Human-CLAP is a CLAP model based on human perception, fine-tuned with human-scored text-audio similarity. Human-CLAP effectively improved the correlation between CLAPScore and human-scored text-audio similarity, enabling CLAPScore to align more closely with human perception.
Our Human-CLAP models can be used as same as the CLAP model uploaded on HuggingFace Transformers
- Trained Human-CLAP models can be found in HuggingFace
- Install
Recommended python version: 3.11
pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install transformers soundfile- Predict CLAPScore
Prediction example from
cal_clap.py
- Load the model from sarulab-speech/human-clap-wsce-mae
- Load the processor and the tokenizer from laion/clap-htsat-fused
import torchaudio
import torch
from transformers import ClapModel, ClapProcessor
## util codes
def get_audio(audiopath):
audio, sr = torchaudio.load(audiopath)
audio = audio[0]
if sr == 16000:
resampler = torchaudio.transforms.Resample(16000, 48000)
audio = resampler(audio)
audio = audio.detach().numpy().copy()
return audio
## calculate CLAPScore
def CLAPScore(text, audiopath, model, processor):
## get audio
audio = get_audio(audiopath=audiopath)
## get model and processor
inputs = processor(
text=text, audios=audio, return_tensors="pt", sampling_rate=48000, padding=True
).to(0)
## predict
outputs = model(**inputs)
## normalize
text_embeds = outputs.text_embeds
text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
audio_embeds = outputs.audio_embeds
audio_embeds = audio_embeds / audio_embeds.norm(dim=-1, keepdim=True)
logits_per_text = torch.matmul(text_embeds, audio_embeds.t())
logits_per_audio = torch.matmul(audio_embeds, text_embeds.t())
## output
return (logits_per_text[0].item() + logits_per_audio[0].item()) / 2
def main():
## initialize model
model_path = "sarulab-speech/human-clap-wsce-mae"
processor_path = "laion/clap-htsat-fused"
model = ClapModel.from_pretrained(model_path).to(0)
processor = ClapProcessor.from_pretrained(processor_path)
## calculate CLAPScore
text = "Text input is written here"
audiopath = "/path/to/wav/file.wav"
simularity = CLAPScore(
text=text, audiopath=audiopath, model=model, processor=processor
)
print(simularity)
if __name__ == "__main__":
main()Train Human-CLAP by finetuning laion/clap-htsat-fused.
- Setup
git clone https://github.com/sarulab-speech/human-clap/
cd human-clap
pip install .- Train
- Select options and execute
./src/clap_finetune/train.py- An example of training condition can be seen in
./train_model.sh - See
python src/clap_finetune/train.py --help
- An example of training condition can be seen in
chmod +x train_model.sh
./train_model.shSame webdataset format as https://github.com/LAION-AI/CLAP.
@article{Takano_APSIPA2025_01,
author={Taisei Takano and Yuki Okamoto and Yusuke Kanamori and Yuki Saito and Ryotaro Nagase and Hiroshi Saruwatari},
title={Human-CLAP: Human-perception-based contrastive language-audio pretraining},
journal={Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}
year=2025,
pages={131--136},
}
The work was supported by JSPS KAKENHI Grant Number 24K23880, ROIS NII Open Collaborative Research 2024-(24S0504), JST Moonshot Grant Number JPMJMS2237.
