Human-CLAP: Human-perception-based contrastive language–audio pretraining

Human-CLAP is a CLAP model based on human perception, fine-tuned with human-scored text-audio similarity. Human-CLAP effectively improved the correlation between CLAPScore and human-scored text-audio similarity, enabling CLAPScore to align more closely with human perception.

Models

Our Human-CLAP models can be used as same as the CLAP model uploaded on HuggingFace Transformers

Trained Human-CLAP models can be found in HuggingFace
- sarulab-speech/human-clap-wsce-mae

Usage

Install

Recommended python version: 3.11

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install transformers soundfile

Predict CLAPScore Prediction example from cal_clap.py

Load the model from sarulab-speech/human-clap-wsce-mae
Load the processor and the tokenizer from laion/clap-htsat-fused

import torchaudio
import torch
from transformers import ClapModel, ClapProcessor


## util codes
def get_audio(audiopath):
    audio, sr = torchaudio.load(audiopath)
    audio = audio[0]
    if sr == 16000:
        resampler = torchaudio.transforms.Resample(16000, 48000)
        audio = resampler(audio)

    audio = audio.detach().numpy().copy()
    return audio


## calculate CLAPScore
def CLAPScore(text, audiopath, model, processor):
    ## get audio
    audio = get_audio(audiopath=audiopath)

    ## get model and processor
    inputs = processor(
        text=text, audios=audio, return_tensors="pt", sampling_rate=48000, padding=True
    ).to(0)

    ## predict
    outputs = model(**inputs)

    ## normalize
    text_embeds = outputs.text_embeds
    text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
    audio_embeds = outputs.audio_embeds
    audio_embeds = audio_embeds / audio_embeds.norm(dim=-1, keepdim=True)

    logits_per_text = torch.matmul(text_embeds, audio_embeds.t())
    logits_per_audio = torch.matmul(audio_embeds, text_embeds.t())

    ## output
    return (logits_per_text[0].item() + logits_per_audio[0].item()) / 2


def main():
    ## initialize model
    model_path = "sarulab-speech/human-clap-wsce-mae"
    processor_path = "laion/clap-htsat-fused"
    model = ClapModel.from_pretrained(model_path).to(0)
    processor = ClapProcessor.from_pretrained(processor_path)

    ## calculate CLAPScore
    text = "Text input is written here"
    audiopath = "/path/to/wav/file.wav"
    simularity = CLAPScore(
        text=text, audiopath=audiopath, model=model, processor=processor
    )
    print(simularity)


if __name__ == "__main__":
    main()

Train Human-CLAP

Train Human-CLAP by finetuning laion/clap-htsat-fused.

Setup

git clone https://github.com/sarulab-speech/human-clap/
cd human-clap
pip install .

Train

Select options and execute ./src/clap_finetune/train.py
- An example of training condition can be seen in ./train_model.sh
- See python src/clap_finetune/train.py --help

chmod +x train_model.sh
./train_model.sh

Dataset format

Same webdataset format as https://github.com/LAION-AI/CLAP.

Citation

@article{Takano_APSIPA2025_01,
  author={Taisei Takano and Yuki Okamoto and Yusuke Kanamori and Yuki Saito and Ryotaro Nagase and Hiroshi Saruwatari},
  title={Human-CLAP: Human-perception-based contrastive language-audio pretraining},
  journal={Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}
  year=2025,
  pages={131--136},
}

Acknowledgements

The work was supported by JSPS KAKENHI Grant Number 24K23880, ROIS NII Open Collaborative Research 2024-(24S0504), JST Moonshot Grant Number JPMJMS2237.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
images		images
src/clap_finetune		src/clap_finetune
.gitignore		.gitignore
README.md		README.md
cal_clap.py		cal_clap.py
pyproject.toml		pyproject.toml
train_model.sh		train_model.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Human-CLAP: Human-perception-based contrastive language–audio pretraining

Models

Usage

Train Human-CLAP

Dataset format

Citation

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Human-CLAP: Human-perception-based contrastive language–audio pretraining

Models

Usage

Train Human-CLAP

Dataset format

Citation

Acknowledgements

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages