This repository contains a PyTorch implementation of a Conditional Variational Autoencoder (CVAE), trained on the CelebA dataset. The model generates 64×64 human face images with controllable semantic attributes.
- Gender: Male / Female
- Glasses: Present / Absent
- Beard: Present / Absent (derived from Goatee and No_Beard)
Compresses input image x and conditions c into a probabilistic latent space z.
- Input: RGB image + condition maps (spatially broadcasted)
- Backbone: Residual downsampling blocks (Conv → ReLU → MaxPool → Conv)
- Output: μ and log(σ²) vectors parameterizing the latent distribution
Injects attributes at multiple spatial resolutions using Conditioning Paths:
- Conditioning Path: Each attribute uses a dedicated network taking
zandc, with gating mechanism:Out = f(z, c) * σ(g(z, c)) - Hierarchical Injection:
- Gender: Low resolution (res0 block) – defines overall facial structure
- Glasses: Medium resolution (res3 block)
- Beard: High resolution (res4 block) – fine texture details
Upsampling is performed with custom residual UpSamplingBlocks.
checkpoint/ # Model weights (.pt)
images/ # Generated images during training
result/ # Final images
Utils/
├── checkpoint_manager.py # Save/load utilities
├── const.py # Hyperparameters and paths
└── image_manager.py # Image grid utilities
net.py # CVAE architecture
train.py # Training script
test.py # Inference and test grid generation
README.md
Clone the repository and install the following dependencies
pip install torch torchvision matplotlib tqdm pillowDataset: Auto-downloads CelebA to ../data (configurable in const.py). Ensure internet and sufficient disk space.
python train.py- 300 epochs by default
- Debug images saved to
./images - Checkpoints saved to
./checkpoint - Modify hyperparameters (
BATCH_SIZE,LEARNING_RATE,LATENT_SIZE) inUtils/const.py
python test.py- Loads the latest checkpoint from
./checkpoint - Generates 8×8 grid of all attribute combinations
- Saves results to
images
Optimizes the VAE objective (ELBO):
L = L_recon(x, x_hat) + β * D_KL(q(z|x,c) || p(z))
L_recon: MSE between original and reconstructed imagesD_KL: Kullback-Leibler divergence to regularize latent spaceβ: Weighting factor (defaultBETA_KL = 1.0)
Random faces (z ~ N(0, I)) conditioned on 8 attribute combinations. Each row represents a different attribute set.
