更新README和README_EN文件，修改项目阶段描述；新增实验结果分析部分；更新default.yaml中的模型类型为sudo-ai/zero123plus-v1.2并调整wandb运行名称；优化ptp_utils.py中的init_random_noise函数以返回1024维的随机噪声；新增关键点和损失曲线收敛图像。

Aloento · Aloento · commit 36d40b77cdfd · 2025-10-05T08:26:50.000Z
diff --git a/README.md b/README.md
@@ -1,30 +1,24 @@
-# Stable123Keypoints PoC
+# Stable123Keypoints Stage1
 
-一个基于 Zero123Plus 模型进行关键点提取的概念验证项目。
+[English](README_EN.md) | 简体中文
 
-## 项目概述
-
-Stable123Keypoints 是一个目前处于概念验证阶段的项目，旨在使用 `sudo-ai/zero123plus-v1.2` 模型进行图像关键点提取。本项目探索了 `Zero123Plus` 模型权重在关键点检测任务中的应用潜力。
-
-## 技术背景
-
-### 核心发现
-
-研究发现，`Zero123Plus` 模型的预训练权重能够成功复现 [StableImageKeypoints](https://github.com/Aloento/StableImageKeypoints/tree/v1.5) 项目的关键点提取效果。
+基于 Zero123Plus 模型的关键点提取探索项目 - 第一阶段研究报告。
 
-### 实现策略
+## 项目概述
 
-- **完整加载** `Zero123Plus Pipeline` 并且进行针对性适配
-- **选择性禁用** 包括但不限于以下 `Zero123Plus` 特有功能:
-  - 视觉全局嵌入
-  - 分类器自由引导
-  - 参考图注意力机制
+Stable123Keypoints 旨在探索 `sudo-ai/zero123plus-v1.2` 模型在关键点检测任务中的应用可能性。本阶段研究聚焦于在 [StableImageKeypoints v1.5](https://github.com/Aloento/StableImageKeypoints/tree/v1.5) 相同架构下，评估 Zero123Plus 预训练权重的直接可用性。
 
-### 架构说明
+## 实验设计
 
-虽然理论上可以在 `StableImageKeypoints` 的基础上通过少量修改获得相同效果，但考虑到项目的长期发展目标，我们选择加载完整的 Pipeline:
+### 测试方案
 
-- **未来目标**: 引入多视角一致性等高级功能
+- **基准模型**: `sd-legacy/stable-diffusion-v1-5`
+- **测试模型**: `sudo-ai/zero123plus-v1.2`
+- **网络架构**: 保持与 `StableImageKeypoints v1.5` 基本一致
+- **对比维度**:
+  - 损失函数收敛情况
+  - 注意力机制激活模式
+  - 关键点提取效果
 
 ## 快速开始
 
@@ -53,15 +47,47 @@ Stable123Keypoints 是一个目前处于概念验证阶段的项目，旨在使
 
    其余操作步骤与 `StableImageKeypoints` 项目保持一致。
 
-## 结果展示
+## 实验结果
+
+### 训练收敛性分析
+
+![损失曲线收敛](assets/sk123.png)
+
+如图所示，在使用 `Zero123Plus` 模型权重进行训练时，损失函数能够正常收敛，这初步表明模型具备学习能力。
+
+### 注意力机制分析
+
+![注意力激活模式](assets/keypoint.png)
+
+然而，通过对模型被 `context` 激活后的注意力图进行可视化分析，我们发现了**关键性问题**：注意力分布呈现**发散状态**，未能在关键点位置形成预期的集中响应模式。
+
+### 对比实验验证
+
+为排除加载方式的影响，我们进行了以下对比测试：
+
+1. **完整加载 Zero123Plus Pipeline**: 注意力发散 ❌
+2. **仅加载 Zero123Plus 权重（不加载 Pipeline）**: 注意力发散 ❌
+3. **使用 stable-diffusion-v1-5 权重（相同代码和配置）**: 关键点提取正常 ✅
+
+## 阶段性结论
+
+### 核心发现
+
+**在不针对性修改代码的前提下，Zero123Plus 预训练权重无法直接用于关键点提取任务。**
+
+尽管模型训练过程中损失函数能够正常收敛，但模型并不对单纯的 context 做出期望的响应。具体表现为：
+
+- ✅ **训练可行性**: 损失函数收敛正常
+- ❌ **功能有效性**: 注意力机制未在关键点位置激活
+- ✅ **代码正确性**: 相同代码下 `SD-1.5` 权重工作正常
 
-![示例结果](assets/res.png)
+### 问题归因分析
 
-![收敛性](assets/heat.png)
+考虑到 `Zero123Plus` 与 `Stable Diffusion v1.5` 的模型结构差异较小，我们推断：
 
-![一致性](assets/augmentation.png)
+**Zero123Plus 在预训练过程中引入的特殊操作（如多视角条件注入、参考图注意力等），已经从根本上改变了模型内部权重对 `encoder_hidden_states` 的处理方式。**
 
-我们可以观察到生成结果与 `StableImageKeypoints v1.5` 项目相似。
+这种改变并非简单的特征提取差异，而是涉及到注意力机制的深层重构，使得模型难以像原始 SD 模型那样对纯文本 `context` 产生空间局部化的响应。
 
 > [!CAUTION]  
 > **请勿使用 FP16 精度**  
diff --git a/README_EN.md b/README_EN.md
@@ -1,30 +1,24 @@
-# Stable123Keypoints PoC
+# Stable123Keypoints Stage1
 
-A proof-of-concept project for keypoint extraction based on the Zero123Plus model.
+English | [简体中文](README.md)
 
-## Project Overview
-
-Stable123Keypoints is currently a proof-of-concept project aimed at performing image keypoint extraction using the `sudo-ai/zero123plus-v1.2` model. This project explores the potential application of `Zero123Plus` model weights in keypoint detection tasks.
-
-## Technical Background
-
-### Core Findings
-
-Research has found that the pre-trained weights of the `Zero123Plus` model can successfully reproduce the keypoint extraction effects of the [StableImageKeypoints](https://github.com/Aloento/StableImageKeypoints/blob/v1.5/README_EN.md) project.
+Keypoint Extraction Exploration Project Based on Zero123Plus Model - Stage 1 Research Report.
 
-### Implementation Strategy
+## Project Overview
 
-- **Fully load** the `Zero123Plus Pipeline` with targeted adaptations
-- **Selectively disable** features specific to `Zero123Plus`, including but not limited to:
-  - Visual global embeddings
-  - Classifier-free guidance
-  - Reference image attention mechanisms
+Stable123Keypoints aims to explore the application potential of the `sudo-ai/zero123plus-v1.2` model in keypoint detection tasks. This stage focuses on evaluating the direct usability of Zero123Plus pre-trained weights under the same architecture as [StableImageKeypoints v1.5](https://github.com/Aloento/StableImageKeypoints/blob/v1.5/README_EN.md).
 
-### Architecture Explanation
+## Experimental Design
 
-Although theoretically the same effect could be achieved by making minor modifications based on `StableImageKeypoints`, considering the long-term development goals of the project, we chose to load the complete Pipeline:
+### Testing Protocol
 
-- **Future Goals**: Introduce advanced features such as multi-view consistency
+- **Baseline Model**: `sd-legacy/stable-diffusion-v1-5`
+- **Test Model**: `sudo-ai/zero123plus-v1.2`
+- **Network Architecture**: Kept basically consistent with `StableImageKeypoints v1.5`
+- **Comparison Dimensions**:
+  - Loss function convergence
+  - Attention mechanism activation patterns
+  - Keypoint extraction effectiveness
 
 ## Quick Start
 
@@ -53,15 +47,47 @@ Please refer to the environment configuration requirements of [StableImageKeypoi
 
    The remaining operation steps are consistent with the `StableImageKeypoints` project.
 
-## Results
+## Experimental Results
+
+### Training Convergence Analysis
+
+![Loss Curve Convergence](assets/sk123.png)
+
+As shown in the figure, when training with `Zero123Plus` model weights, the loss function converges normally, initially indicating that the model has learning capability.
+
+### Attention Mechanism Analysis
+
+![Attention Activation Pattern](assets/keypoint.png)
+
+However, through visualization analysis of the attention maps after the model is activated by `context`, we discovered a **critical issue**: the attention distribution exhibits a **divergent state**, failing to form the expected concentrated response pattern at keypoint locations.
+
+### Comparative Experiment Verification
+
+To rule out the influence of loading methods, we conducted the following comparative tests:
+
+1. **Full Zero123Plus Pipeline Loading**: Attention divergence ❌
+2. **Zero123Plus Weights Only (without Pipeline)**: Attention divergence ❌
+3. **Using stable-diffusion-v1-5 Weights (same architecture and configuration)**: Keypoint extraction normal ✅
+
+## Stage Conclusions
+
+### Core Findings
+
+**Without targeted code modifications, the Zero123Plus pre-trained weights cannot be directly applied to keypoint extraction tasks.**
+
+Although the loss function converges normally during model training, the model does not produce the expected response to pure context. Specifically:
+
+- ✅ **Training Feasibility**: Loss function convergence is normal
+- ❌ **Functional Effectiveness**: Attention mechanism not activated at keypoint locations
+- ✅ **Code Correctness**: `SD-1.5` weights work normally with the same code
 
-![Example Results](assets/res.png)
+### Problem Attribution Analysis
 
-![Convergence](assets/heat.png)
+Considering the minimal structural differences between `Zero123Plus` and `Stable Diffusion v1.5`, we infer:
 
-![Consistency](assets/augmentation.png)
+**The special operations introduced during Zero123Plus pre-training (such as multi-view condition injection, reference image attention, etc.) have fundamentally changed how the model's internal weights process `encoder_hidden_states`.**
 
-We can observe that the generated results are similar to the `StableImageKeypoints v1.5` project.
+This change is not a simple feature extraction difference, but involves deep reconstruction of the attention mechanism, making it difficult for the model to produce spatially localized responses to pure text `context` like the original SD model.
 
 > [!CAUTION]  
 > **Do not use FP16 precision**  
diff --git a/assets/keypoint.png b/assets/keypoint.png
diff --git a/assets/sk123.png b/assets/sk123.png
diff --git a/configs/default.yaml b/configs/default.yaml
@@ -3,7 +3,7 @@
 
 # 模型相关配置
 model:
-  type: "sd-legacy/stable-diffusion-v1-5" # LDM模型类型
+  type: "sudo-ai/zero123plus-v1.2" # LDM模型类型
   my_token: null # Hugging Face token（可选）
 
 # 数据集相关配置
@@ -63,6 +63,6 @@ visualization:
 # Weights & Biases配置
 wandb:
   enabled: true # 是否启用wandb日志
-  name: "sk1.5" # wandb运行名称
+  name: "sk123" # wandb运行名称
   project: "Stable123Keypoints" # wandb项目名称
   entity: "aronfothi_org" # wandb组织名称（可选）
diff --git a/src/ptp_utils.py b/src/ptp_utils.py
@@ -445,4 +445,4 @@ def __init__(self):
 
 
 def init_random_noise(device, num_words=500):
-    return torch.randn(1, num_words, 768).to(device)
+    return torch.randn(1, num_words, 1024).to(device)

Original file line number	Diff line number	Diff line change
`@@ -445,4 +445,4 @@ def __init__(self):`
`445`	`445`
`446`	`446`
`447`	`447`	`def init_random_noise(device, num_words=500):`
`448`		`- return torch.randn(1, num_words, 768).to(device)`
	`448`	`+ return torch.randn(1, num_words, 1024).to(device)`