Skip to content

hinanohart/wav2caption

Repository files navigation

wav2caption

PyPI Python License CI CodeQL Downloads

Audio → instrument-aware caption for AI music generation. Suno / Udio / ACE-Step prompt generator that describes what is playing *and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).

Point it at a WAV/MP3/FLAC file and get back a structured analysis and a ready-to-paste prompt for ACE-Step, Suno, Udio, or any other prompt-conditioned music model.

Disclaimer: This is an independent third-party tool. It is not affiliated with, endorsed by, or sponsored by Suno, Udio, ACE-Step, Essentia, MTG-Jamendo, or Discogs. Those names appear nominatively to identify the downstream prompt formats and upstream models / datasets this tool integrates with. Bundled model weights inherit their original CC-BY / Apache-2.0 licenses; users are responsible for verifying that audio inputs they analyse are properly licensed.

live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, dynamic build-up, breakdown section

Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class instrument head + Discogs-EffNet embeddings) with classical MIR features (BPM, key, loudness, spectral centroid, pitch range) and a small role taxonomy, so the caption describes both what is playing and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).


Why this exists

Most "audio → tag" tools stop at a flat list of instruments. When you feed that into a prompt-conditioned music model, the arrangement gets lost — instruments are named but their role is missing, and dynamics are dropped entirely. wav2caption was factored out of a production pipeline that captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and it keeps two things other tools don't:

  • Role grouping. drums and bass are not just instruments; they are the rhythm and bass roles. A section that also has strings + brass gets tagged as "string section, brass section" rather than five indistinguishable labels.
  • Section features. Per-window loudness, centroid, and pitch-range give you "quiet (breakdown/interlude)", "peak energy (chorus/climax)", "staccato stabs", "metallic percussion accents" — the kind of descriptors music LLMs actually condition on.

Install

pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"

Essentia is distributed under AGPL-3.0 (or a commercial license from MTG-UPF). If you ship a network service built on wav2caption, you may need to release your source under AGPL-3.0 or buy a commercial license. The wav2caption code itself is Apache-2.0.

Models

The pretrained weights are not bundled (they are CC-BY-NC-SA 4.0 and non-commercial). Download them once, then verify the SHA-256 digests:

mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb

# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1  discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a  mtg_jamendo_instrument-discogs-effnet-1.pb
EOF

The same check is available programmatically:

from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())

or automatically on every analyze(...) call by setting WAV2CAPTION_VERIFY_DIGESTS=1 in your environment.

⚠️ Supply-chain note. The .pb files are TensorFlow GraphDefs and a maliciously crafted graph can influence what runs inside Essentia. Always download over HTTPS from essentia.upf.edu and verify the digests before first load.

Or point WAV2CAPTION_MODELS_DIR (or --models-dir) at an existing folder.

Quick start

CLI

wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5

Example output

On a 3:32 record-grand-prix reference instrumental, wav2caption song.wav produces:

=== song.wav ===
duration: 3:32  tempo: 132.9 BPM  key: Eb major (conf 0.87)  danceability: 1.10

[ detected instruments ]
  drums                0.402  ################
  electricguitar       0.308  ############
  bass                 0.286  ###########
  guitar               0.274  ##########
  piano                0.222  ########
  acousticguitar       0.177  #######
  synthesizer          0.176  #######
  violin               0.126  #####
  ...

[ role scores ]
  rhythm             0.468
  acoustic_guitar    0.450
  harmony            0.377
  lead_guitar        0.308
  bass               0.286
  strings            0.219
  synth              0.176
  brass              0.118
  vocal              0.067
  woodwind           0.061

[ sections ]
  0:20-0:30  loud=1301  bright=1019Hz  Eb major
    roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
    features: metallic percussion accents, string harmonies, brass accents
  0:30-0:40  loud=1224  bright=1278Hz  Eb major
    roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
    features: metallic percussion accents, staccato stabs

[ caption ]
  live drums, electric guitar, piano, bass, string section, acoustic guitar,
  Eb major, 133 BPM, dynamic build-up, breakdown section

Python

from wav2caption import analyze, build_caption

result = analyze("song.wav")
print(build_caption(result))

for s in result.sections:
    roles = {r: name for r, (name, _score) in s.roles.items()}
    print(f"{s.start:>5.1f}s  {roles}  {s.features}")

AnalysisResult is a typed dataclass:

@dataclass
class AnalysisResult:
    path: Path
    duration_sec: float
    bpm: float
    key: str
    scale: str  # "major" | "minor"
    key_confidence: float
    danceability: float
    detected_instruments: list[tuple[str, float]]   # (label, probability)
    role_scores: dict[str, float]                   # aggregated per role
    sections: list[Section]

Role taxonomy

role instruments
rhythm drums, drummachine, beat, percussion, bongo
bass bass, acousticbassguitar, doublebass
harmony piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion
lead_guitar electricguitar
acoustic_guitar acousticguitar, classicalguitar, guitar
strings strings, violin, viola, cello, orchestra
brass brass, trumpet, trombone, horn, saxophone
woodwind flute, clarinet, oboe
synth synthesizer, pad, sampler, computer
bells bell, harp, harmonica
vocal voice

The mapping is intentionally opinionated and biased toward production arrangement labels rather than strict orchestration (e.g. guitar goes to acoustic_guitar because the MTG-Jamendo label is ambiguous and the acoustic interpretation is safer for caption conditioning). Override ROLE_MAP if you disagree — it's just a dict[str, tuple[str, ...]].

Project layout

src/wav2caption/
    __init__.py       # public API
    analyzer.py       # analyze() + build_caption() + dataclasses
    constants.py      # INSTRUMENTS, ROLE_MAP, get_role()
    models.py         # model-path discovery
    cli.py            # wav2caption console script
tests/                # no-Essentia unit tests

Development

git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy src

The unit tests intentionally do not require Essentia, so CI stays fast and free of TensorFlow. Real-audio smoke tests belong in examples/.

License

  • Source code: Apache 2.0 (see LICENSE).
  • Runtime dep Essentia: AGPL-3.0 (opt-in via pip install "wav2caption[essentia]").
  • Pretrained models: CC-BY-NC-SA 4.0 (user-downloaded, non-commercial).

Full third-party notices: NOTICE.md.

If you need a commercial pipeline you will have to either license Essentia from MTG-UPF or swap in a different backend. The Apache-2.0-licensed code in this repo is backend-agnostic enough that a torch / onnxruntime port is straightforward — PRs welcome.

About

Audio → instrument-aware caption for AI music generation (ACE-Step, Suno, Udio prompts). MTG-Jamendo + Discogs-EffNet + role-aware MIR heuristics.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages