Skip to content

Parameter-efficient multitask learning via soft context prompt tuning (SoftCPT) and visual prompt tuning (VPT). Adapts vision-language models for simultaneous age, gender, and emotion recognition from facial images.

Notifications You must be signed in to change notification settings

AngeloMolinario/VLM-Multitask-Face-Analysis

Repository files navigation

VLM Multitask Face Analysis

A parameter-efficient approach to adapt Vision-Language Models (VLMs) for simultaneous facial analysis tasks using soft prompt optimization.

🎯 What This Code Does

This repository implements a multitask learning framework that teaches a CLIP-based vision-language model to simultaneously recognize three facial attributes from images:

  • Age (9 age groups: 0-2, 3-9, 10-19, 20-29, 30-39, 40-49, 50-59, 60-69, 70+)
  • Gender (male, female)
  • Emotion (surprise, fear, disgust, happy, sad, angry, neutral)

Instead of fine-tuning the entire model (which would require millions of parameters), it uses prompt learning to adapt the model efficiently by learning only a small set of continuous prompt tokens.

🚀 Key Features

Two Prompt Learning Approaches

  1. CoOp (Context Optimization) - Learns text prompts that replace hand-crafted templates like "A photo of a [class]"
  2. SoftCPT (Soft Context Prompt Tuning) - Advanced approach that generates task-specific prompts dynamically

Visual Prompt Tuning (VPT)

Additionally supports Visual Prompt Tuning to learn visual tokens that are prepended to image patches, further improving adaptation with minimal parameter overhead.

Parameter Efficiency

  • Only 0.1-2% of model parameters are trained
  • Fast training and inference
  • Maintains competitive accuracy with full fine-tuning approaches

📊 Try It Out

🎮 Live Demo: Test the trained model interactively on Hugging Face Spaces
👉 https://huggingface.co/spaces/moan2851/PECore_age_gender_emotion_recognition

Upload a face image and see the model predict age, gender, and emotion in real-time!

How to Use This Code

To use this code and reproduce the training or test results, please install the required packages.

pip install -r requirements.txt

Dataset Preparation

All the datasets used for training and testing were preprocessed using the code available in the Dataset_preprocessing folder to produce centered cropped face images of a single person from each image.

(An exception is made for the Lagenda dataset because it already provides annotations and face locations for all faces in each image)

To obtain the cropped faces for each dataset, run the following Python script:

python3 dataset_processing.py --folder ../datasets_with_standard_labels/CelebA_HQ --output_dir ../processed_datasets --num_threads 4 --size 384

or for simplicity run the script process.sh that calls the Python script for all the datasets used.

After processing the datasets, each of them has been split into training and validation sets using a fixed seed and an 80/20 ratio. Also in this case a Python script is used for the splitting and is called only for the datasets used for training:

python3 dataset/split_csv.py ../processed_datasets/datasets_with_standard_labels/CelebA_HQ/train/labels.csv --train_ratio 0.8 --seed 2025 --rename_original_csv

Also in this case a bash script is provided to automatically call the script on all the datasets of interest:

./script/split_dataset.sh

Note: The uploaded ../processed_datasets.zip already contains the processed and split datasets.

Start the Training

There are two training python scripts that have been used to run the experiments: one for single-task training and one for multitask training. Both work in the same way: they take a JSON configuration file that contains all the information needed for training. Using the configuration files, it is possible to select the type of experiment to run and the hyperparameters.

The experiments can be reproduced by running the following python scripts:

Single task Coop experiments for emotion task

python3 coop_train.py ./config/coop/emotion/PE_coop_15.json
python3 coop_train.py ./config/coop/emotion/PE_coop_20.json
python3 coop_train.py ./config/coop/emotion/PE_coop_25.json

Single task VPT + Coop experiments for emotion task

python3 coop_train.py ./config/coop/emotion/PE_vpt_10_cn15.json
python3 coop_train.py ./config/coop/emotion/PE_vpt_10_cn20.json
python3 coop_train.py ./config/coop/emotion/PE_vpt_10_cn25.json

Single task Coop experiments for age task

python3 coop_train.py ./config/coop/age/PE_coop_15.json
python3 coop_train.py ./config/coop/age/PE_coop_20.json
python3 coop_train.py ./config/coop/age/PE_coop_25.json

Single task VPT + Coop experiments for age task

python3 coop_train.py ./config/coop/age/PE_vpt_10_cn15.json
python3 coop_train.py ./config/coop/age/PE_vpt_10_cn20.json
python3 coop_train.py ./config/coop/age/PE_vpt_10_cn25.json

Single task Coop experiments for gender task

python3 coop_train.py ./config/coop/gender/PE_coop_15.json
python3 coop_train.py ./config/coop/gender/PE_coop_20.json
python3 coop_train.py ./config/coop/gender/PE_coop_25.json

Single task VPT + Coop experiments for gender task

python3 coop_train.py ./config/coop/gender/PE_vpt_10_cn15.json
python3 coop_train.py ./config/coop/gender/PE_vpt_10_cn20.json
python3 coop_train.py ./config/coop/gender/PE_vpt_10_cn25.json

Mutitask SoftCPT experiments

python3 train_multitask.py ./config/softpe_15.json
python3 train_multitask.py ./config/softpe_20.json
python3 train_multitask.py ./config/softpe_25.json

Mutitask VPT + SoftCPT experiments

python3 train_multitask.py ./config/vpt_10_cn_15.json
python3 train_multitask.py ./config/vpt_10_cn_20.json
python3 train_multitask.py ./config/vpt_10_cn_25.json

In alternative is possibile to use the following bash script to run all the configuration for a particular single task or for the multitask setting (this bash script also perform the test of all the trained model at the end of all the experiment):

./script/coop_pe_age.sh      # Coop and Coop+VPT for age
./script/coop_pe_gender.sh   # Coop and Coop+VPT for gender
./script/coop_pe_emotion.sh  # Coop and Coop+VPT for emotion
./script/softpe.sh           # SoftCPT and SoftCPT+VPT

NOTE: The VPT experiments use the pretrained SoftCPT weights that are loaded from the path specified in the config file, so to run them is necessary to first launch the SoftCPT experiments.

Testing

To test a model, and obtain the accuracy score, the GFLOPs, and the parameters counts use the test.py script:

python3 test.py --model_type "PECore" \
                    --num_prompt 0 \
                    --dataset_path "../processed_datasets/datasets_with_standard_labels/RAF-DB" \
                    "../processed_datasets/datasets_with_standard_labels/UTKFace" \
                    "../processed_datasets/datasets_with_standard_labels/FairFace" \
                    "../processed_datasets/datasets_with_standard_labels/CelebA_HQ" \
                    "../processed_datasets/datasets_with_standard_labels/VggFace2-Test" \
                    --ckpt_dir "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_15/ckpt/" \
                    --batch_size 32 --no_tqdm

The arguments that the script takes are the following ones:

--model_type : "PECore"

--dataset_paths : path to the dataset to be tested, each separated by a space. For each dataset the test split is loaded.

--bacth_size : Size of the batch to be used.

--output_path : Output directory to use for saving the result and the plot, if not passed the output directory is automatically set from the ckpt_dir (TRAIN -> TEST)

--num_prompt : Number of visual prompt token present in the loaded model, if the model only use text prompting use 0.

--ckpt_dir : Path to the checkpoint directory containing all the weights saved during training. The different .pt files are automatically detected and loaded. (Note: If no vision_ckpt.pt is found in the ckpt directory that it is automatically downlaoded from the huggingface hub)

--no_tqdm : If passed no progress bar is shown.

If all the experiments are done using the previous configuration is possibile to use the following script to test all the trained model all in onces without the need to seperate run the python script above:

./script/test_all.sh

To test the baseline with the standard hard prompting template "A photo of a <class>" run the following python script:

python3 baseline.py --model_type "PECore" \
                    --dataset_path "../processed_datasets/datasets_with_standard_labels/RAF-DB" \
                    "../processed_datasets/datasets_with_standard_labels/UTKFace" \
                    "../processed_datasets/datasets_with_standard_labels/FairFace" \
                    "../processed_datasets/datasets_with_standard_labels/CelebA_HQ" \
                    "../processed_datasets/datasets_with_standard_labels/VggFace2-Test" \                    
                    --batch_size 32 --no_tqdm

Configuration

All the configuration files are stored in the config/ directory in JSON format. In particular, the configuration files for each task can be found in the config/coop/<task_name> folder, while the multitask configuration files can be found in the config/ folder.

Configuration Parameters Reference

Below is a comprehensive list of all parameters that can be used in the configuration files:

Core Training Parameters

TUNING (string)

  • Description: Type of tuning method to use. Options: "softcpt" (Soft Context Prompt Tuning), "coop" (Context Optimization)
  • How to switch from CoOp to Soft: Change "TUNING": "coop" to "TUNING": "softcpt"
  • Example: "TUNING": "softcpt"

MODEL (string)

  • Description: Base model architecture to use
  • Example: "MODEL": "pecore"

TASK (integer)

  • Description: Task identifier. Use -1 for multitask training, or specific task index (0, 1, 2, etc.) for single-task training
  • How to switch from multitask to single-task: Change "TASK": -1 to "TASK": 0 (or 1, 2 for other tasks)
  • Example: "TASK": -1 (multitask) or "TASK": 0 (single-task for first task)

MODEL_TYPE (string)

  • Description: Specific model variant/configuration
  • Example: "MODEL_TYPE": "PE-Core-L14-336"

Prompt Configuration

NUM_VISUAL_PROMPT (integer)

  • Description: Number of visual prompt tokens to use in VPT (Visual Prompt Tuning). Set to 0 to disable VPT and use text-only prompting
  • How to switch from VPT to text-only: Change "NUM_VISUAL_PROMPT": 10 to "NUM_VISUAL_PROMPT": 0
  • Example: "NUM_VISUAL_PROMPT": 10 (VPT enabled) or "NUM_VISUAL_PROMPT": 0 (text-only)

NUM_TEXT_CNTX (integer)

  • Description: Number of text context tokens for prompt learning
  • Example: "NUM_TEXT_CNTX": 25

TASK_NAMES (array of strings)

  • Description: Natural language descriptions of each task
  • Example: ["age estimation from face picture", "gender recognition from facial features", "emotion classification from facial expression"]

CLASSES (array of arrays)

  • Description: Class labels for each task. Each inner array corresponds to one task
  • Example:
[
    ["0-2", "3-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70+"],
    ["male", "female"],
    ["surprise", "fear", "disgust", "happy", "sad", "angry", "neutral"]
]

NAMED_TRAINABLE_PARAMETERS (array of strings)

  • Description: Names of model components that should be trainable during training
  • Example: ["prompt_learner", "task_prompt_learner", "prompt_gen"]

Dataset Configuration

DATASET_NAMES (object)

  • Description: Mapping of task ID to dataset name(s). Use key "-1" with array of datasets for multitask, or specific task keys with single dataset for single-task
  • How to switch from multitask to single-task: Change from {"-1": ["Dataset1", "Dataset2"]} to {"0": "Dataset1"} (and set TASK to 0)
  • Example:
    • Multitask: {"-1": ["FairFace", "RAF-DB", "CelebA_HQ", "Lagenda"]}
    • Single-task: {"0": "FairFace"}

DATASET_ROOT (string)

  • Description: Root directory path where processed datasets are stored
  • Example: "DATASET_ROOT": "../processed_datasets/datasets_with_standard_labels"

BALANCE_TASK (object)

  • Description: Task specific ratio in the merged dataset. Keys are task IDs (as strings), values are target ratio.
  • Example: {"2": 0.33} (We want task 2 to represent the 33% of the merged dataset)

Model Loading

PRETRAINED_CPT (string, optional)

  • Description: Path to pretrained checkpoint file. Omit this parameter or remove it entirely to train from scratch
  • How to switch from pretrained to from-scratch: Remove the "PRETRAINED_CPT" key from the JSON file
  • Example: "PRETRAINED_CPT": "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25/ckpt/softCPT_tokens_bval.pt"

Training Hyperparameters

BATCH_SIZE (integer)

  • Description: Number of samples per training batch
  • Example: "BATCH_SIZE": 60

LR (float)

  • Description: Learning rate for the optimizer
  • Example: "LR": 0.002

EPOCHS (integer)

  • Description: Maximum number of training epochs
  • Example: "EPOCHS": 50

PATIENCE (integer)

  • Description: Number of epochs with no improvement after which training will be stopped (early stopping)
  • Example: "PATIENCE": 7

Loss and Regularization

EMD_WEIGHT (float)

  • Description: Weight for Earth Mover's Distance loss component
  • Example: "EMD_WEIGHT": 30

EMD_OMEGA (float)

  • Description: Omega parameter for EMD loss
  • Example: "EMD_OMEGA": 2.0

EMD_MU (float)

  • Description: Mu parameter for EMD loss
  • Example: "EMD_MU": -0.0025

Data Loading

NUM_WORKERS (integer)

  • Description: Number of subprocesses to use for data loading
  • Example: "NUM_WORKERS": 3

PREFETCH_FACTOR (integer)

  • Description: Number of batches loaded in advance by each worker
  • Example: "PREFETCH_FACTOR": 1

Output and Logging

OUTPUT_DIR (string)

  • Description: Directory path where training outputs (checkpoints, logs) will be saved
  • Example: "OUTPUT_DIR": "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25_vpt_10"

VERBOSE (boolean)

  • Description: Enable verbose logging output
  • Example: "VERBOSE": true

USE_TQDM (boolean)

  • Description: Enable tqdm progress bars during training
  • Example: "USE_TQDM": false

Example Configuration for Multitask Training (config/vpt_10_cn25.json)

{
    "TUNING" : "softcpt",
    "MODEL"  : "pecore",
    "TASK"   : -1,
    "MODEL_TYPE" : "PE-Core-L14-336",

    "NUM_VISUAL_PROMPT" : 10,
    "NUM_TEXT_CNTX" : 25,
    "TASK_NAMES": ["age estimation from face picture", "gender recognition from facial features", "emotion classification from facial expression"], 
    "CLASSES": [
        ["0-2", "3-9", "10-19", "20-29", "30-39", "40-49", "50-59", "60-69", "70+"],
        ["male", "female"],
        ["surprise", "fear", "disgust", "happy", "sad", "angry", "neutral"]
    ],    
    "NAMED_TRAINABLE_PARAMETERS": [
        "prompt_learner",
        "task_prompt_learner", 
        "prompt_gen"
    ],

    "DATASET_NAMES": {
        "-1" : ["FairFace", "RAF-DB", "CelebA_HQ", "Lagenda"]     
    },

    "PRETRAINED_CPT" : "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25/ckpt/softCPT_tokens_bval.pt",


    "CSP": false,
    "EMD_WEIGHT" : 30,
    "EMD_OMEGA" : 2.0,
    "EMD_MU": -0.0025,
    "DATASET_ROOT": "../processed_datasets/datasets_with_standard_labels",
    "BALANCE_TASK" : {"2" : 0.33},
    "BATCH_SIZE" : 60,
    "NUM_WORKERS" : 3,
    "LR" : 0.002,
    "OUTPUT_DIR" : "../TRAIN/PECore/L14/SoftCPT/TSCA_cntx_25_vpt_10",
    "PREFETCH_FACTOR" : 1,
    "VERBOSE": true,
    "USE_TQDM": false,
    "EPOCHS" : 50,
    "PATIENCE" : 7

}

Quick Configuration Switching Guide

To switch from VPT to text-only prompting:

  1. Set "NUM_VISUAL_PROMPT": 0
  2. Remove the "PRETRAINED_CPT" key from the JSON file

To switch from CoOp to Soft prompting:

  • Change "TUNING": "coop" to "TUNING": "softcpt"

To switch from multitask to single-task training:

  1. Change "TASK": -1 to "TASK": 0 (or 1, 2 for different tasks)
  2. Change "DATASET_NAMES": {"-1": ["Dataset1", "Dataset2", ...]} to "DATASET_NAMES": {"0": "Dataset1"}

Example: Text-only Soft prompting configuration (config/softpe_20.json)

{
    "TUNING" : "softcpt",
    "NUM_VISUAL_PROMPT" : 0,  // Text-only (no VPT)
    "NUM_TEXT_CNTX" : 20,
    // ... no PRETRAINED_CPT key means training from scratch and using only the text model for prompting the model
}

Example: VPT with Soft prompting configuration (config/vpt_10_cn25.json)

{
    "TUNING" : "softcpt",
    "NUM_VISUAL_PROMPT" : 10,  // VPT enabled
    "NUM_TEXT_CNTX" : 25,
    "PRETRAINED_CPT" : "../path/to/pretrained.pt",  // Load pretrained weights
    // ...
}

Acknowledgements

This project contains code from the Perception Encoder repository (https://github.com/facebookresearch/perception_models), licensed under Apache License, Version 2.0, January 2004

About

Parameter-efficient multitask learning via soft context prompt tuning (SoftCPT) and visual prompt tuning (VPT). Adapts vision-language models for simultaneous age, gender, and emotion recognition from facial images.

Topics

Resources

Stars

Watchers

Forks