|
1 | | -# Stable123Keypoints PoC |
| 1 | +# Stable123Keypoints Stage1 |
2 | 2 |
|
3 | | -A proof-of-concept project for keypoint extraction based on the Zero123Plus model. |
| 3 | +English | [简体中文](README.md) |
4 | 4 |
|
5 | | -## Project Overview |
6 | | - |
7 | | -Stable123Keypoints is currently a proof-of-concept project aimed at performing image keypoint extraction using the `sudo-ai/zero123plus-v1.2` model. This project explores the potential application of `Zero123Plus` model weights in keypoint detection tasks. |
8 | | - |
9 | | -## Technical Background |
10 | | - |
11 | | -### Core Findings |
12 | | - |
13 | | -Research has found that the pre-trained weights of the `Zero123Plus` model can successfully reproduce the keypoint extraction effects of the [StableImageKeypoints](https://github.com/Aloento/StableImageKeypoints/blob/v1.5/README_EN.md) project. |
| 5 | +Keypoint Extraction Exploration Project Based on Zero123Plus Model - Stage 1 Research Report. |
14 | 6 |
|
15 | | -### Implementation Strategy |
| 7 | +## Project Overview |
16 | 8 |
|
17 | | -- **Fully load** the `Zero123Plus Pipeline` with targeted adaptations |
18 | | -- **Selectively disable** features specific to `Zero123Plus`, including but not limited to: |
19 | | - - Visual global embeddings |
20 | | - - Classifier-free guidance |
21 | | - - Reference image attention mechanisms |
| 9 | +Stable123Keypoints aims to explore the application potential of the `sudo-ai/zero123plus-v1.2` model in keypoint detection tasks. This stage focuses on evaluating the direct usability of Zero123Plus pre-trained weights under the same architecture as [StableImageKeypoints v1.5](https://github.com/Aloento/StableImageKeypoints/blob/v1.5/README_EN.md). |
22 | 10 |
|
23 | | -### Architecture Explanation |
| 11 | +## Experimental Design |
24 | 12 |
|
25 | | -Although theoretically the same effect could be achieved by making minor modifications based on `StableImageKeypoints`, considering the long-term development goals of the project, we chose to load the complete Pipeline: |
| 13 | +### Testing Protocol |
26 | 14 |
|
27 | | -- **Future Goals**: Introduce advanced features such as multi-view consistency |
| 15 | +- **Baseline Model**: `sd-legacy/stable-diffusion-v1-5` |
| 16 | +- **Test Model**: `sudo-ai/zero123plus-v1.2` |
| 17 | +- **Network Architecture**: Kept basically consistent with `StableImageKeypoints v1.5` |
| 18 | +- **Comparison Dimensions**: |
| 19 | + - Loss function convergence |
| 20 | + - Attention mechanism activation patterns |
| 21 | + - Keypoint extraction effectiveness |
28 | 22 |
|
29 | 23 | ## Quick Start |
30 | 24 |
|
@@ -53,15 +47,47 @@ Please refer to the environment configuration requirements of [StableImageKeypoi |
53 | 47 |
|
54 | 48 | The remaining operation steps are consistent with the `StableImageKeypoints` project. |
55 | 49 |
|
56 | | -## Results |
| 50 | +## Experimental Results |
| 51 | + |
| 52 | +### Training Convergence Analysis |
| 53 | + |
| 54 | + |
| 55 | + |
| 56 | +As shown in the figure, when training with `Zero123Plus` model weights, the loss function converges normally, initially indicating that the model has learning capability. |
| 57 | + |
| 58 | +### Attention Mechanism Analysis |
| 59 | + |
| 60 | + |
| 61 | + |
| 62 | +However, through visualization analysis of the attention maps after the model is activated by `context`, we discovered a **critical issue**: the attention distribution exhibits a **divergent state**, failing to form the expected concentrated response pattern at keypoint locations. |
| 63 | + |
| 64 | +### Comparative Experiment Verification |
| 65 | + |
| 66 | +To rule out the influence of loading methods, we conducted the following comparative tests: |
| 67 | + |
| 68 | +1. **Full Zero123Plus Pipeline Loading**: Attention divergence ❌ |
| 69 | +2. **Zero123Plus Weights Only (without Pipeline)**: Attention divergence ❌ |
| 70 | +3. **Using stable-diffusion-v1-5 Weights (same architecture and configuration)**: Keypoint extraction normal ✅ |
| 71 | + |
| 72 | +## Stage Conclusions |
| 73 | + |
| 74 | +### Core Findings |
| 75 | + |
| 76 | +**Without targeted code modifications, the Zero123Plus pre-trained weights cannot be directly applied to keypoint extraction tasks.** |
| 77 | + |
| 78 | +Although the loss function converges normally during model training, the model does not produce the expected response to pure context. Specifically: |
| 79 | + |
| 80 | +- ✅ **Training Feasibility**: Loss function convergence is normal |
| 81 | +- ❌ **Functional Effectiveness**: Attention mechanism not activated at keypoint locations |
| 82 | +- ✅ **Code Correctness**: `SD-1.5` weights work normally with the same code |
57 | 83 |
|
58 | | - |
| 84 | +### Problem Attribution Analysis |
59 | 85 |
|
60 | | - |
| 86 | +Considering the minimal structural differences between `Zero123Plus` and `Stable Diffusion v1.5`, we infer: |
61 | 87 |
|
62 | | - |
| 88 | +**The special operations introduced during Zero123Plus pre-training (such as multi-view condition injection, reference image attention, etc.) have fundamentally changed how the model's internal weights process `encoder_hidden_states`.** |
63 | 89 |
|
64 | | -We can observe that the generated results are similar to the `StableImageKeypoints v1.5` project. |
| 90 | +This change is not a simple feature extraction difference, but involves deep reconstruction of the attention mechanism, making it difficult for the model to produce spatially localized responses to pure text `context` like the original SD model. |
65 | 91 |
|
66 | 92 | > [!CAUTION] |
67 | 93 | > **Do not use FP16 precision** |
|
0 commit comments