This repository contains the implementation, experiments, and research notes for an AI Residency project focused on fine-tuning Stable Diffusion models for concurrent identity preservation and pose guidance.
- dreambooth/: Core DreamBooth implementation for subject identity learning.
- controlnet/: Advanced fine-tuning of ControlNet integrated with DreamBooth architectures.
- docs/: Project documentation, task descriptions, and technical concepts.
- π Full Project Report
- paper/: Detailed research summaries and pseudocode for relevant SOTA methods (HyperHuman, MagicPose, etc.).
This project explores the fine-tuning of Stable Diffusion v1.5 to generate specific subjects (identity) in user-defined configurations (pose).
- DreamBooth Reproduction: Successfully learned specific subjects with minimal data; identified prompt fidelity vs. identity trade-offs.
-
Human Pose Integration: Combined ControlNet with DreamBooth. Discovered that 200β600 steps (avg. 400) and LoRA ranks
$\geq$ 16 provide the optimal balance for identity preservation. - Key Finding: Fine-tuning the Text Encoder is critical for learning complex human identities but risks "catastrophic forgetting" of structural concepts.
Transitioned to a significantly harder domain: a non-humanoid robot subject with sparse training data (3β6 images).
- Constraint: OpenPose algorithms struggle with non-humanoid joint structures, limiting dataset quality.
- ControlNet Bias: Pre-trained ControlNets exhibit a strong "human bias," making it difficult to maintain robot morphology in extreme or unusual poses.
To address overfitting and structural bias, several research-backed techniques were implemented:
- Custom Diffusion Optimization:
- K/V Attention Tuning: Trained only the Key (K) and Value (V) projections in cross-attention layers.
- Embedding Training: Optimized the
[V]rare-token embedding exclusively, which reduced structural forgetting but resulted in lower identity fidelity.
- Multi-Stage Training (MagicPose Style):
- Stage 1 (Appearance): Isolated identity training without ControlNet interference.
- Stage 2 (Pose): Structural guidance training with the identity-aware Text Encoder frozen.
- Pose-Identity Conflict: Precise subject generation in "hard" (non-humanoid) poses remains a major challenge due to the inherent human-centric bias in pre-trained spatial adapters.
- Overfitting vs. Generalization: Naive DreamBooth training often causes the model to "forget" structural flexibility. Strategic dropout and targeted parameter tuning (e.g., K/V attention) are essential for maintaining pose adherence.
- Future Work: Bridging the gap between the specific morphology of non-humanoid subjects and general spatial conditioning models.
Tip
Refer to the IDEA.md for a deep dive into the technical papers that inspired these implementations.