A parameter-efficient approach to adapt Vision-Language Models (VLMs) for simultaneous facial analysis tasks using soft prompt optimization.
This repository implements a multitask learning framework that teaches a CLIP-based vision-language model to simultaneously recognize three facial attributes from images:
- Age (9 age groups: 0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70+)
- Gender (male, female)
- Emotion (surprise, fear, disgust, happy, sad, angry, neutral)
Instead of fine-tuning the entire model (which would require millions of parameters), it uses prompt learning to adapt the model efficiently by learning only a small set of continuous prompt tokens.
- CoOp (Context Optimization) - Learns text prompts that replace hand-crafted templates like "A photo of a [class]"
- SoftCPT (Soft Context Prompt Tuning) - Advanced approach that generates task-specific prompts dynamically
Additionally supports Visual Prompt Tuning to learn visual tokens that are prepended to image patches, further improving adaptation with minimal parameter overhead.
- Only 0.1-2% of model parameters are trained
- Fast training and inference
- Maintains competitive accuracy with full fine-tuning approaches
🎮 Live Demo: Test the trained model interactively on Hugging Face Spaces
👉 https://huggingface.co/spaces/moan2851/PECore_age_gender_emotion_recognition
Upload a face image and see the model predict age, gender, and emotion in real-time!
To use this code and reproduce the training or test results, please install the required packages.
pip install -r requirements.txtAll the datasets used for training and testing were preprocessed using the code available in the Dataset_preprocessing folder to produce centered cropped face images of a single person from each image.
(An exception is made for the Lagenda dataset because it already provides annotations and face locations for all faces in each image)
To obtain the cropped faces for each dataset, run the following Python script:
python3 dataset_processing.py --folder ../datasets_with_standard_labels/CelebA_HQ --output_dir ../processed_datasets --num_threads 4 --size 384or for simplicity run the script process.sh that calls the Python script for all the datasets used.
After processing the datasets, each of them has been split into training and validation sets using a fixed seed and an 80/20 ratio. Also in this case a Python script is used for the splitting and is called only for the datasets used for training:
python3 dataset/split_csv.py ../processed_datasets/datasets_with_standard_labels/CelebA_HQ/train/labels.csv --train_ratio 0.8 --seed 2025 --rename_original_csvAlso in this case a bash script is provided to automatically call the script on all the datasets of interest:
./script/split_dataset.shNote: The uploaded ../processed_datasets.zip already contains the processed and split datasets.
There are two training python scripts that have been used to run the experiments: one for single-task training and one for multitask training. Both work in the same way: they take a JSON configuration file that contains all the information needed for training. Using the configuration files, it is possible to select the type of experiment to run and the hyperparameters.
The experiments can be reproduced by running the following python scripts:
python3 coop_train.py ./config/coop/emotion/PE_coop_15.json
python3 coop_train.py ./config/coop/emotion/PE_coop_20.json
python3 coop_train.py ./config/coop/emotion/PE_coop_25.jsonpython3 coop_train.py ./config/coop/emotion/PE_vpt_10_cn15.json
python3 coop_train.py ./config/coop/emotion/PE_vpt_10_cn20.json
python3 coop_train.py ./config/coop/emotion/PE_vpt_10_cn25.jsonpython3 coop_train.py ./config/coop/age/PE_coop_15.json
python3 coop_train.py ./config/coop/age/PE_coop_20.json
python3 coop_train.py ./config/coop/age/PE_coop_25.jsonpython3 coop_train.py ./config/coop/age/PE_vpt_10_cn15.json
python3 coop_train.py ./config/coop/age/PE_vpt_10_cn20.json
python3 coop_train.py ./config/coop/age/PE_vpt_10_cn25.jsonpython3 coop_train.py ./config/coop/gender/PE_coop_15.json
python3 coop_train.py ./config/coop/gender/PE_coop_20.json
python3 coop_train.py ./config/coop/gender/PE_coop_25.jsonpython3 coop_train.py ./config/coop/gender/PE_vpt_10_cn15.json
python3 coop_train.py ./config/coop/gender/PE_vpt_10_cn20.json
python3 coop_train.py ./config/coop/gender/PE_vpt_10_cn25.jsonpython3 train_multitask.py ./config/softpe_15.json
python3 train_multitask.py ./config/softpe_20.json
python3 train_multitask.py ./config/softpe_25.jsonpython3 train_multitask.py ./config/vpt_10_cn_15.json
python3 train_multitask.py ./config/vpt_10_cn_20.json
python3 train_multitask.py ./config/vpt_10_cn_25.jsonIn alternative is possibile to use the following bash script to run all the configuration for a particular single task or for the multitask setting (this bash script also perform the test of all the trained model at the end of all the experiment):
./script/coop_pe_age.sh # Coop and Coop+VPT for age
./script/coop_pe_gender.sh # Coop and Coop+VPT for gender
./script/coop_pe_emotion.sh # Coop and Coop+VPT for emotion
./script/softpe.sh # SoftCPT and SoftCPT+VPTNOTE: The VPT experiments use the pretrained SoftCPT weights that are loaded from the path specified in the config file, so to run them is necessary to first launch the SoftCPT experiments.
To test a model, and obtain the accuracy score, the GFLOPs, and the parameters counts use the test.py script:
python3 test.py --model_type "PECore" \
--num_prompt 0 \
--dataset_path "../processed_datasets/datasets_with_standard_labels/RAF-DB" \
"../processed_datasets/datasets_with_standard_labels/UTKFace" \
"../processed_datasets/datasets_with_standard_labels/FairFace" \
"../processed_datasets/datasets_with_standard_labels/CelebA_HQ" \
"../processed_datasets/datasets_with_standard_labels/VggFace2-Test" \
--ckpt_dir "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_15/ckpt/" \
--batch_size 32 --no_tqdmThe arguments that the script takes are the following ones:
--model_type : "PECore"
--dataset_paths : path to the dataset to be tested, each separated by a space. For each dataset the test split is loaded.
--bacth_size : Size of the batch to be used.
--output_path : Output directory to use for saving the result and the plot, if not passed the output directory is automatically set from the ckpt_dir (TRAIN -> TEST)
--num_prompt : Number of visual prompt token present in the loaded model, if the model only use text prompting use 0.
--ckpt_dir : Path to the checkpoint directory containing all the weights saved during training. The different .pt files are automatically detected and loaded. (Note: If no vision_ckpt.pt is found in the ckpt directory that it is automatically downlaoded from the huggingface hub)
--no_tqdm : If passed no progress bar is shown.
If all the experiments are done using the previous configuration is possibile to use the following script to test all the trained model all in onces without the need to seperate run the python script above:
./script/test_all.shTo test the baseline with the standard hard prompting template "A photo of a <class>" run the following python script:
python3 baseline.py --model_type "PECore" \
--dataset_path "../processed_datasets/datasets_with_standard_labels/RAF-DB" \
"../processed_datasets/datasets_with_standard_labels/UTKFace" \
"../processed_datasets/datasets_with_standard_labels/FairFace" \
"../processed_datasets/datasets_with_standard_labels/CelebA_HQ" \
"../processed_datasets/datasets_with_standard_labels/VggFace2-Test" \
--batch_size 32 --no_tqdmAll the configuration files are stored in the config/ directory in JSON format. In particular, the configuration files for each task can be found in the config/coop/<task_name> folder, while the multitask configuration files can be found in the config/ folder.
Below is a comprehensive list of all parameters that can be used in the configuration files:
TUNING (string)
- Description: Type of tuning method to use. Options:
"softcpt"(Soft Context Prompt Tuning),"coop"(Context Optimization) - How to switch from CoOp to Soft: Change
"TUNING": "coop"to"TUNING": "softcpt" - Example:
"TUNING": "softcpt"
MODEL (string)
- Description: Base model architecture to use
- Example:
"MODEL": "pecore"
TASK (integer)
- Description: Task identifier. Use
-1for multitask training, or specific task index (0, 1, 2, etc.) for single-task training - How to switch from multitask to single-task: Change
"TASK": -1to"TASK": 0(or 1, 2 for other tasks) - Example:
"TASK": -1(multitask) or"TASK": 0(single-task for first task)
MODEL_TYPE (string)
- Description: Specific model variant/configuration
- Example:
"MODEL_TYPE": "PE-Core-L14-336"
NUM_VISUAL_PROMPT (integer)
- Description: Number of visual prompt tokens to use in VPT (Visual Prompt Tuning). Set to
0to disable VPT and use text-only prompting - How to switch from VPT to text-only: Change
"NUM_VISUAL_PROMPT": 10to"NUM_VISUAL_PROMPT": 0 - Example:
"NUM_VISUAL_PROMPT": 10(VPT enabled) or"NUM_VISUAL_PROMPT": 0(text-only)
NUM_TEXT_CNTX (integer)
- Description: Number of text context tokens for prompt learning
- Example:
"NUM_TEXT_CNTX": 25
TASK_NAMES (array of strings)
- Description: Natural language descriptions of each task
- Example:
["age estimation from face picture", "gender recognition from facial features", "emotion classification from facial expression"]
CLASSES (array of arrays)
- Description: Class labels for each task. Each inner array corresponds to one task
- Example:
[
["0-2", "3-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70+"],
["male", "female"],
["surprise", "fear", "disgust", "happy", "sad", "angry", "neutral"]
]NAMED_TRAINABLE_PARAMETERS (array of strings)
- Description: Names of model components that should be trainable during training
- Example:
["prompt_learner", "task_prompt_learner", "prompt_gen"]
DATASET_NAMES (object)
- Description: Mapping of task ID to dataset name(s). Use key
"-1"with array of datasets for multitask, or specific task keys with single dataset for single-task - How to switch from multitask to single-task: Change from
{"-1": ["Dataset1", "Dataset2"]}to{"0": "Dataset1"}(and setTASKto 0) - Example:
- Multitask:
{"-1": ["FairFace", "RAF-DB", "CelebA_HQ", "Lagenda"]} - Single-task:
{"0": "FairFace"}
- Multitask:
DATASET_ROOT (string)
- Description: Root directory path where processed datasets are stored
- Example:
"DATASET_ROOT": "../processed_datasets/datasets_with_standard_labels"
BALANCE_TASK (object)
- Description: Task specific ratio in the merged dataset. Keys are task IDs (as strings), values are target ratio.
- Example:
{"2": 0.33}(We want task 2 to represent the 33% of the merged dataset)
PRETRAINED_CPT (string, optional)
- Description: Path to pretrained checkpoint file. Omit this parameter or remove it entirely to train from scratch
- How to switch from pretrained to from-scratch: Remove the
"PRETRAINED_CPT"key from the JSON file - Example:
"PRETRAINED_CPT": "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25/ckpt/softCPT_tokens_bval.pt"
BATCH_SIZE (integer)
- Description: Number of samples per training batch
- Example:
"BATCH_SIZE": 60
LR (float)
- Description: Learning rate for the optimizer
- Example:
"LR": 0.002
EPOCHS (integer)
- Description: Maximum number of training epochs
- Example:
"EPOCHS": 50
PATIENCE (integer)
- Description: Number of epochs with no improvement after which training will be stopped (early stopping)
- Example:
"PATIENCE": 7
EMD_WEIGHT (float)
- Description: Weight for Earth Mover's Distance loss component
- Example:
"EMD_WEIGHT": 30
EMD_OMEGA (float)
- Description: Omega parameter for EMD loss
- Example:
"EMD_OMEGA": 2.0
EMD_MU (float)
- Description: Mu parameter for EMD loss
- Example:
"EMD_MU": -0.0025
NUM_WORKERS (integer)
- Description: Number of subprocesses to use for data loading
- Example:
"NUM_WORKERS": 3
PREFETCH_FACTOR (integer)
- Description: Number of batches loaded in advance by each worker
- Example:
"PREFETCH_FACTOR": 1
OUTPUT_DIR (string)
- Description: Directory path where training outputs (checkpoints, logs) will be saved
- Example:
"OUTPUT_DIR": "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25_vpt_10"
VERBOSE (boolean)
- Description: Enable verbose logging output
- Example:
"VERBOSE": true
USE_TQDM (boolean)
- Description: Enable tqdm progress bars during training
- Example:
"USE_TQDM": false
Example Configuration for Multitask Training (config/vpt_10_cn25.json)
{
"TUNING" : "softcpt",
"MODEL" : "pecore",
"TASK" : -1,
"MODEL_TYPE" : "PE-Core-L14-336",
"NUM_VISUAL_PROMPT" : 10,
"NUM_TEXT_CNTX" : 25,
"TASK_NAMES": ["age estimation from face picture", "gender recognition from facial features", "emotion classification from facial expression"],
"CLASSES": [
["0-2", "3-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70+"],
["male", "female"],
["surprise", "fear", "disgust", "happy", "sad", "angry", "neutral"]
],
"NAMED_TRAINABLE_PARAMETERS": [
"prompt_learner",
"task_prompt_learner",
"prompt_gen"
],
"DATASET_NAMES": {
"-1" : ["FairFace", "RAF-DB", "CelebA_HQ", "Lagenda"]
},
"PRETRAINED_CPT" : "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25/ckpt/softCPT_tokens_bval.pt",
"CSP": false,
"EMD_WEIGHT" : 30,
"EMD_OMEGA" : 2.0,
"EMD_MU": -0.0025,
"DATASET_ROOT": "../processed_datasets/datasets_with_standard_labels",
"BALANCE_TASK" : {"2" : 0.33},
"BATCH_SIZE" : 60,
"NUM_WORKERS" : 3,
"LR" : 0.002,
"OUTPUT_DIR" : "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25_vpt_10",
"PREFETCH_FACTOR" : 1,
"VERBOSE": true,
"USE_TQDM": false,
"EPOCHS" : 50,
"PATIENCE" : 7
}To switch from VPT to text-only prompting:
- Set
"NUM_VISUAL_PROMPT": 0 - Remove the
"PRETRAINED_CPT"key from the JSON file
To switch from CoOp to Soft prompting:
- Change
"TUNING": "coop"to"TUNING": "softcpt"
To switch from multitask to single-task training:
- Change
"TASK": -1to"TASK": 0(or 1, 2 for different tasks) - Change
"DATASET_NAMES": {"-1": ["Dataset1", "Dataset2", ...]}to"DATASET_NAMES": {"0": "Dataset1"}
Example: Text-only Soft prompting configuration (config/softpe_20.json)
{
"TUNING" : "softcpt",
"NUM_VISUAL_PROMPT" : 0, // Text-only (no VPT)
"NUM_TEXT_CNTX" : 20,
// ... no PRETRAINED_CPT key means training from scratch and using only the text model for prompting the model
}Example: VPT with Soft prompting configuration (config/vpt_10_cn25.json)
{
"TUNING" : "softcpt",
"NUM_VISUAL_PROMPT" : 10, // VPT enabled
"NUM_TEXT_CNTX" : 25,
"PRETRAINED_CPT" : "../path/to/pretrained.pt", // Load pretrained weights
// ...
}This project contains code from the Perception Encoder repository (https://github.com/facebookresearch/perception_models), licensed under Apache License, Version 2.0, January 2004