Skip to content
This repository was archived by the owner on Nov 4, 2025. It is now read-only.

Commit 36d40b7

Browse files
committed
更新README和README_EN文件,修改项目阶段描述;新增实验结果分析部分;更新default.yaml中的模型类型为sudo-ai/zero123plus-v1.2并调整wandb运行名称;优化ptp_utils.py中的init_random_noise函数以返回1024维的随机噪声;新增关键点和损失曲线收敛图像。
1 parent f99f896 commit 36d40b7

6 files changed

Lines changed: 105 additions & 53 deletions

File tree

README.md

Lines changed: 51 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,24 @@
1-
# Stable123Keypoints PoC
1+
# Stable123Keypoints Stage1
22

3-
一个基于 Zero123Plus 模型进行关键点提取的概念验证项目。
3+
[English](README_EN.md) | 简体中文
44

5-
## 项目概述
6-
7-
Stable123Keypoints 是一个目前处于概念验证阶段的项目,旨在使用 `sudo-ai/zero123plus-v1.2` 模型进行图像关键点提取。本项目探索了 `Zero123Plus` 模型权重在关键点检测任务中的应用潜力。
8-
9-
## 技术背景
10-
11-
### 核心发现
12-
13-
研究发现,`Zero123Plus` 模型的预训练权重能够成功复现 [StableImageKeypoints](https://github.com/Aloento/StableImageKeypoints/tree/v1.5) 项目的关键点提取效果。
5+
基于 Zero123Plus 模型的关键点提取探索项目 - 第一阶段研究报告。
146

15-
### 实现策略
7+
## 项目概述
168

17-
- **完整加载** `Zero123Plus Pipeline` 并且进行针对性适配
18-
- **选择性禁用** 包括但不限于以下 `Zero123Plus` 特有功能:
19-
- 视觉全局嵌入
20-
- 分类器自由引导
21-
- 参考图注意力机制
9+
Stable123Keypoints 旨在探索 `sudo-ai/zero123plus-v1.2` 模型在关键点检测任务中的应用可能性。本阶段研究聚焦于在 [StableImageKeypoints v1.5](https://github.com/Aloento/StableImageKeypoints/tree/v1.5) 相同架构下,评估 Zero123Plus 预训练权重的直接可用性。
2210

23-
### 架构说明
11+
## 实验设计
2412

25-
虽然理论上可以在 `StableImageKeypoints` 的基础上通过少量修改获得相同效果,但考虑到项目的长期发展目标,我们选择加载完整的 Pipeline:
13+
### 测试方案
2614

27-
- **未来目标**: 引入多视角一致性等高级功能
15+
- **基准模型**: `sd-legacy/stable-diffusion-v1-5`
16+
- **测试模型**: `sudo-ai/zero123plus-v1.2`
17+
- **网络架构**: 保持与 `StableImageKeypoints v1.5` 基本一致
18+
- **对比维度**:
19+
- 损失函数收敛情况
20+
- 注意力机制激活模式
21+
- 关键点提取效果
2822

2923
## 快速开始
3024

@@ -53,15 +47,47 @@ Stable123Keypoints 是一个目前处于概念验证阶段的项目,旨在使
5347

5448
其余操作步骤与 `StableImageKeypoints` 项目保持一致。
5549

56-
## 结果展示
50+
## 实验结果
51+
52+
### 训练收敛性分析
53+
54+
![损失曲线收敛](assets/sk123.png)
55+
56+
如图所示,在使用 `Zero123Plus` 模型权重进行训练时,损失函数能够正常收敛,这初步表明模型具备学习能力。
57+
58+
### 注意力机制分析
59+
60+
![注意力激活模式](assets/keypoint.png)
61+
62+
然而,通过对模型被 `context` 激活后的注意力图进行可视化分析,我们发现了**关键性问题**:注意力分布呈现**发散状态**,未能在关键点位置形成预期的集中响应模式。
63+
64+
### 对比实验验证
65+
66+
为排除加载方式的影响,我们进行了以下对比测试:
67+
68+
1. **完整加载 Zero123Plus Pipeline**: 注意力发散 ❌
69+
2. **仅加载 Zero123Plus 权重(不加载 Pipeline)**: 注意力发散 ❌
70+
3. **使用 stable-diffusion-v1-5 权重(相同代码和配置)**: 关键点提取正常 ✅
71+
72+
## 阶段性结论
73+
74+
### 核心发现
75+
76+
**在不针对性修改代码的前提下,Zero123Plus 预训练权重无法直接用于关键点提取任务。**
77+
78+
尽管模型训练过程中损失函数能够正常收敛,但模型并不对单纯的 context 做出期望的响应。具体表现为:
79+
80+
-**训练可行性**: 损失函数收敛正常
81+
-**功能有效性**: 注意力机制未在关键点位置激活
82+
-**代码正确性**: 相同代码下 `SD-1.5` 权重工作正常
5783

58-
![示例结果](assets/res.png)
84+
### 问题归因分析
5985

60-
![收敛性](assets/heat.png)
86+
考虑到 `Zero123Plus``Stable Diffusion v1.5` 的模型结构差异较小,我们推断:
6187

62-
![一致性](assets/augmentation.png)
88+
**Zero123Plus 在预训练过程中引入的特殊操作(如多视角条件注入、参考图注意力等),已经从根本上改变了模型内部权重对 `encoder_hidden_states` 的处理方式。**
6389

64-
我们可以观察到生成结果与 `StableImageKeypoints v1.5` 项目相似
90+
这种改变并非简单的特征提取差异,而是涉及到注意力机制的深层重构,使得模型难以像原始 SD 模型那样对纯文本 `context` 产生空间局部化的响应
6591

6692
> [!CAUTION]
6793
> **请勿使用 FP16 精度**

README_EN.md

Lines changed: 51 additions & 25 deletions
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,24 @@
1-
# Stable123Keypoints PoC
1+
# Stable123Keypoints Stage1
22

3-
A proof-of-concept project for keypoint extraction based on the Zero123Plus model.
3+
English | [简体中文](README.md)
44

5-
## Project Overview
6-
7-
Stable123Keypoints is currently a proof-of-concept project aimed at performing image keypoint extraction using the `sudo-ai/zero123plus-v1.2` model. This project explores the potential application of `Zero123Plus` model weights in keypoint detection tasks.
8-
9-
## Technical Background
10-
11-
### Core Findings
12-
13-
Research has found that the pre-trained weights of the `Zero123Plus` model can successfully reproduce the keypoint extraction effects of the [StableImageKeypoints](https://github.com/Aloento/StableImageKeypoints/blob/v1.5/README_EN.md) project.
5+
Keypoint Extraction Exploration Project Based on Zero123Plus Model - Stage 1 Research Report.
146

15-
### Implementation Strategy
7+
## Project Overview
168

17-
- **Fully load** the `Zero123Plus Pipeline` with targeted adaptations
18-
- **Selectively disable** features specific to `Zero123Plus`, including but not limited to:
19-
- Visual global embeddings
20-
- Classifier-free guidance
21-
- Reference image attention mechanisms
9+
Stable123Keypoints aims to explore the application potential of the `sudo-ai/zero123plus-v1.2` model in keypoint detection tasks. This stage focuses on evaluating the direct usability of Zero123Plus pre-trained weights under the same architecture as [StableImageKeypoints v1.5](https://github.com/Aloento/StableImageKeypoints/blob/v1.5/README_EN.md).
2210

23-
### Architecture Explanation
11+
## Experimental Design
2412

25-
Although theoretically the same effect could be achieved by making minor modifications based on `StableImageKeypoints`, considering the long-term development goals of the project, we chose to load the complete Pipeline:
13+
### Testing Protocol
2614

27-
- **Future Goals**: Introduce advanced features such as multi-view consistency
15+
- **Baseline Model**: `sd-legacy/stable-diffusion-v1-5`
16+
- **Test Model**: `sudo-ai/zero123plus-v1.2`
17+
- **Network Architecture**: Kept basically consistent with `StableImageKeypoints v1.5`
18+
- **Comparison Dimensions**:
19+
- Loss function convergence
20+
- Attention mechanism activation patterns
21+
- Keypoint extraction effectiveness
2822

2923
## Quick Start
3024

@@ -53,15 +47,47 @@ Please refer to the environment configuration requirements of [StableImageKeypoi
5347

5448
The remaining operation steps are consistent with the `StableImageKeypoints` project.
5549

56-
## Results
50+
## Experimental Results
51+
52+
### Training Convergence Analysis
53+
54+
![Loss Curve Convergence](assets/sk123.png)
55+
56+
As shown in the figure, when training with `Zero123Plus` model weights, the loss function converges normally, initially indicating that the model has learning capability.
57+
58+
### Attention Mechanism Analysis
59+
60+
![Attention Activation Pattern](assets/keypoint.png)
61+
62+
However, through visualization analysis of the attention maps after the model is activated by `context`, we discovered a **critical issue**: the attention distribution exhibits a **divergent state**, failing to form the expected concentrated response pattern at keypoint locations.
63+
64+
### Comparative Experiment Verification
65+
66+
To rule out the influence of loading methods, we conducted the following comparative tests:
67+
68+
1. **Full Zero123Plus Pipeline Loading**: Attention divergence ❌
69+
2. **Zero123Plus Weights Only (without Pipeline)**: Attention divergence ❌
70+
3. **Using stable-diffusion-v1-5 Weights (same architecture and configuration)**: Keypoint extraction normal ✅
71+
72+
## Stage Conclusions
73+
74+
### Core Findings
75+
76+
**Without targeted code modifications, the Zero123Plus pre-trained weights cannot be directly applied to keypoint extraction tasks.**
77+
78+
Although the loss function converges normally during model training, the model does not produce the expected response to pure context. Specifically:
79+
80+
-**Training Feasibility**: Loss function convergence is normal
81+
-**Functional Effectiveness**: Attention mechanism not activated at keypoint locations
82+
-**Code Correctness**: `SD-1.5` weights work normally with the same code
5783

58-
![Example Results](assets/res.png)
84+
### Problem Attribution Analysis
5985

60-
![Convergence](assets/heat.png)
86+
Considering the minimal structural differences between `Zero123Plus` and `Stable Diffusion v1.5`, we infer:
6187

62-
![Consistency](assets/augmentation.png)
88+
**The special operations introduced during Zero123Plus pre-training (such as multi-view condition injection, reference image attention, etc.) have fundamentally changed how the model's internal weights process `encoder_hidden_states`.**
6389

64-
We can observe that the generated results are similar to the `StableImageKeypoints v1.5` project.
90+
This change is not a simple feature extraction difference, but involves deep reconstruction of the attention mechanism, making it difficult for the model to produce spatially localized responses to pure text `context` like the original SD model.
6591

6692
> [!CAUTION]
6793
> **Do not use FP16 precision**

assets/keypoint.png

108 KB
Loading

assets/sk123.png

53.9 KB
Loading

configs/default.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33

44
# 模型相关配置
55
model:
6-
type: "sd-legacy/stable-diffusion-v1-5" # LDM模型类型
6+
type: "sudo-ai/zero123plus-v1.2" # LDM模型类型
77
my_token: null # Hugging Face token(可选)
88

99
# 数据集相关配置
@@ -63,6 +63,6 @@ visualization:
6363
# Weights & Biases配置
6464
wandb:
6565
enabled: true # 是否启用wandb日志
66-
name: "sk1.5" # wandb运行名称
66+
name: "sk123" # wandb运行名称
6767
project: "Stable123Keypoints" # wandb项目名称
6868
entity: "aronfothi_org" # wandb组织名称(可选)

src/ptp_utils.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -445,4 +445,4 @@ def __init__(self):
445445

446446

447447
def init_random_noise(device, num_words=500):
448-
return torch.randn(1, num_words, 768).to(device)
448+
return torch.randn(1, num_words, 1024).to(device)

0 commit comments

Comments
 (0)