Skip to content

sarulab-speech/Human-CLAP

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Human-CLAP: Human-perception-based contrastive language–audio pretraining

Human-CLAP is a CLAP model based on human perception, fine-tuned with human-scored text-audio similarity. Human-CLAP effectively improved the correlation between CLAPScore and human-scored text-audio similarity, enabling CLAPScore to align more closely with human perception.

Overview of Human-CLAP

Models

Our Human-CLAP models can be used as same as the CLAP model uploaded on HuggingFace Transformers

Usage

  1. Install

Recommended python version: 3.11

pip install torch torchaudio --index-url https://download.pytorch.org/whl/cu124
pip install transformers soundfile
  1. Predict CLAPScore Prediction example from cal_clap.py
import torchaudio
import torch
from transformers import ClapModel, ClapProcessor


## util codes
def get_audio(audiopath):
    audio, sr = torchaudio.load(audiopath)
    audio = audio[0]
    if sr == 16000:
        resampler = torchaudio.transforms.Resample(16000, 48000)
        audio = resampler(audio)

    audio = audio.detach().numpy().copy()
    return audio


## calculate CLAPScore
def CLAPScore(text, audiopath, model, processor):
    ## get audio
    audio = get_audio(audiopath=audiopath)

    ## get model and processor
    inputs = processor(
        text=text, audios=audio, return_tensors="pt", sampling_rate=48000, padding=True
    ).to(0)

    ## predict
    outputs = model(**inputs)

    ## normalize
    text_embeds = outputs.text_embeds
    text_embeds = text_embeds / text_embeds.norm(dim=-1, keepdim=True)
    audio_embeds = outputs.audio_embeds
    audio_embeds = audio_embeds / audio_embeds.norm(dim=-1, keepdim=True)

    logits_per_text = torch.matmul(text_embeds, audio_embeds.t())
    logits_per_audio = torch.matmul(audio_embeds, text_embeds.t())

    ## output
    return (logits_per_text[0].item() + logits_per_audio[0].item()) / 2


def main():
    ## initialize model
    model_path = "sarulab-speech/human-clap-wsce-mae"
    processor_path = "laion/clap-htsat-fused"
    model = ClapModel.from_pretrained(model_path).to(0)
    processor = ClapProcessor.from_pretrained(processor_path)

    ## calculate CLAPScore
    text = "Text input is written here"
    audiopath = "/path/to/wav/file.wav"
    simularity = CLAPScore(
        text=text, audiopath=audiopath, model=model, processor=processor
    )
    print(simularity)


if __name__ == "__main__":
    main()

Train Human-CLAP

Train Human-CLAP by finetuning laion/clap-htsat-fused.

  1. Setup
git clone https://github.com/sarulab-speech/human-clap/
cd human-clap
pip install .
  1. Train
  • Select options and execute ./src/clap_finetune/train.py
    • An example of training condition can be seen in ./train_model.sh
    • See python src/clap_finetune/train.py --help
chmod +x train_model.sh
./train_model.sh

Dataset format

Same webdataset format as https://github.com/LAION-AI/CLAP.

Citation

@article{Takano_APSIPA2025_01,
  author={Taisei Takano and Yuki Okamoto and Yusuke Kanamori and Yuki Saito and Ryotaro Nagase and Hiroshi Saruwatari},
  title={Human-CLAP: Human-perception-based contrastive language-audio pretraining},
  journal={Proc. Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC)}
  year=2025,
  pages={131--136},
}

Acknowledgements

The work was supported by JSPS KAKENHI Grant Number 24K23880, ROIS NII Open Collaborative Research 2024-(24S0504), JST Moonshot Grant Number JPMJMS2237.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors