Audio → instrument-aware caption for AI music generation. Suno / Udio / ACE-Step prompt generator that describes what is playing *and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).
Point it at a WAV/MP3/FLAC file and get back a structured analysis and a ready-to-paste prompt for ACE-Step, Suno, Udio, or any other prompt-conditioned music model.
Disclaimer: This is an independent third-party tool. It is not affiliated with, endorsed by, or sponsored by Suno, Udio, ACE-Step, Essentia, MTG-Jamendo, or Discogs. Those names appear nominatively to identify the downstream prompt formats and upstream models / datasets this tool integrates with. Bundled model weights inherit their original CC-BY / Apache-2.0 licenses; users are responsible for verifying that audio inputs they analyse are properly licensed.
live drums, electric guitar, piano, bass, string section, brass section,
D major, 140 BPM, dynamic build-up, breakdown section
Under the hood it combines Essentia's TensorFlow graphs (MTG-Jamendo 40-class instrument head + Discogs-EffNet embeddings) with classical MIR features (BPM, key, loudness, spectral centroid, pitch range) and a small role taxonomy, so the caption describes both what is playing and how it is used (rhythm / bass / harmony / lead / strings / brass / synth / vocal).
Most "audio → tag" tools stop at a flat list of instruments. When you feed
that into a prompt-conditioned music model, the arrangement gets lost —
instruments are named but their role is missing, and dynamics are dropped
entirely. wav2caption was factored out of a production pipeline that
captioned hundreds of reference tracks for ACE-Step Lego-mode generation, and
it keeps two things other tools don't:
- Role grouping.
drumsandbassare not just instruments; they are the rhythm and bass roles. A section that also hasstrings+brassgets tagged as "string section, brass section" rather than five indistinguishable labels. - Section features. Per-window loudness, centroid, and pitch-range give you "quiet (breakdown/interlude)", "peak energy (chorus/climax)", "staccato stabs", "metallic percussion accents" — the kind of descriptors music LLMs actually condition on.
pip install wav2caption
# Then opt in to the (AGPL-3.0) Essentia runtime — required for analysis.
pip install "wav2caption[essentia]"Essentia is distributed under AGPL-3.0 (or a commercial license from MTG-UPF). If you ship a network service built on
wav2caption, you may need to release your source under AGPL-3.0 or buy a commercial license. Thewav2captioncode itself is Apache-2.0.
The pretrained weights are not bundled (they are CC-BY-NC-SA 4.0 and non-commercial). Download them once, then verify the SHA-256 digests:
mkdir -p ~/.cache/wav2caption/models
cd ~/.cache/wav2caption/models
curl -LO https://essentia.upf.edu/models/feature-extractors/discogs-effnet/discogs-effnet-bs64-1.pb
curl -LO https://essentia.upf.edu/models/classification-heads/mtg_jamendo_instrument/mtg_jamendo_instrument-discogs-effnet-1.pb
# Captured 2026-04-18 against https://essentia.upf.edu/models/
sha256sum -c <<'EOF'
3ed9af50d5367c0b9c795b294b00e7599e4943244f4cbd376869f3bfc87721b1 discogs-effnet-bs64-1.pb
2e8c3003c722e098da371b6a1f7ad0ce62fac0dcfc09c7c7997d430941196c2a mtg_jamendo_instrument-discogs-effnet-1.pb
EOFThe same check is available programmatically:
from wav2caption import resolve_models, verify_digests
verify_digests(resolve_models())or automatically on every analyze(...) call by setting
WAV2CAPTION_VERIFY_DIGESTS=1 in your environment.
⚠️ Supply-chain note. The.pbfiles are TensorFlow GraphDefs and a maliciously crafted graph can influence what runs inside Essentia. Always download over HTTPS fromessentia.upf.eduand verify the digests before first load.
Or point WAV2CAPTION_MODELS_DIR (or --models-dir) at an existing folder.
wav2caption song.wav
wav2caption song.wav --json > analysis.json
wav2caption song.wav --section-seconds 5On a 3:32 record-grand-prix reference instrumental, wav2caption song.wav produces:
=== song.wav ===
duration: 3:32 tempo: 132.9 BPM key: Eb major (conf 0.87) danceability: 1.10
[ detected instruments ]
drums 0.402 ################
electricguitar 0.308 ############
bass 0.286 ###########
guitar 0.274 ##########
piano 0.222 ########
acousticguitar 0.177 #######
synthesizer 0.176 #######
violin 0.126 #####
...
[ role scores ]
rhythm 0.468
acoustic_guitar 0.450
harmony 0.377
lead_guitar 0.308
bass 0.286
strings 0.219
synth 0.176
brass 0.118
vocal 0.067
woodwind 0.061
[ sections ]
0:20-0:30 loud=1301 bright=1019Hz Eb major
roles: rhythm=drums(0.44) / lead_guitar=electricguitar(0.37) / bass=bass(0.34) / ...
features: metallic percussion accents, string harmonies, brass accents
0:30-0:40 loud=1224 bright=1278Hz Eb major
roles: rhythm=drums(0.38) / lead_guitar=electricguitar(0.31) / bass=bass(0.29) / ...
features: metallic percussion accents, staccato stabs
[ caption ]
live drums, electric guitar, piano, bass, string section, acoustic guitar,
Eb major, 133 BPM, dynamic build-up, breakdown section
from wav2caption import analyze, build_caption
result = analyze("song.wav")
print(build_caption(result))
for s in result.sections:
roles = {r: name for r, (name, _score) in s.roles.items()}
print(f"{s.start:>5.1f}s {roles} {s.features}")AnalysisResult is a typed dataclass:
@dataclass
class AnalysisResult:
path: Path
duration_sec: float
bpm: float
key: str
scale: str # "major" | "minor"
key_confidence: float
danceability: float
detected_instruments: list[tuple[str, float]] # (label, probability)
role_scores: dict[str, float] # aggregated per role
sections: list[Section]| role | instruments |
|---|---|
rhythm |
drums, drummachine, beat, percussion, bongo |
bass |
bass, acousticbassguitar, doublebass |
harmony |
piano, electricpiano, keyboard, rhodes, organ, pipeorgan, accordion |
lead_guitar |
electricguitar |
acoustic_guitar |
acousticguitar, classicalguitar, guitar |
strings |
strings, violin, viola, cello, orchestra |
brass |
brass, trumpet, trombone, horn, saxophone |
woodwind |
flute, clarinet, oboe |
synth |
synthesizer, pad, sampler, computer |
bells |
bell, harp, harmonica |
vocal |
voice |
The mapping is intentionally opinionated and biased toward production
arrangement labels rather than strict orchestration (e.g. guitar goes to
acoustic_guitar because the MTG-Jamendo label is ambiguous and the
acoustic interpretation is safer for caption conditioning). Override
ROLE_MAP if you disagree — it's just a dict[str, tuple[str, ...]].
src/wav2caption/
__init__.py # public API
analyzer.py # analyze() + build_caption() + dataclasses
constants.py # INSTRUMENTS, ROLE_MAP, get_role()
models.py # model-path discovery
cli.py # wav2caption console script
tests/ # no-Essentia unit tests
git clone https://github.com/hinanohart/wav2caption
cd wav2caption
python -m venv .venv && source .venv/bin/activate
pip install -e ".[dev]"
pytest
ruff check .
mypy srcThe unit tests intentionally do not require Essentia, so CI stays fast
and free of TensorFlow. Real-audio smoke tests belong in examples/.
- Source code: Apache 2.0 (see LICENSE).
- Runtime dep Essentia: AGPL-3.0 (opt-in via
pip install "wav2caption[essentia]"). - Pretrained models: CC-BY-NC-SA 4.0 (user-downloaded, non-commercial).
Full third-party notices: NOTICE.md.
If you need a commercial pipeline you will have to either license Essentia
from MTG-UPF or swap in a different backend. The Apache-2.0-licensed code
in this repo is backend-agnostic enough that a torch / onnxruntime
port is straightforward — PRs welcome.