Synthetic Hebrew audio dataset for the Audio Violence Detection Pipeline (AVDP), generated by the SynthBanshee pipeline.
This is a data-only repository. It contains no application code. All pipeline logic, configuration, documentation, and tests live in SynthBanshee.
If you are a Claude Code agent or AI assistant: read
CLAUDE.mdbefore making any changes. Key rules: never rename/modify/delete files inassets/; never edit.wavfiles by hand; always updateDELIVERIES.mdwhen adding clips; never drophas_violencefrom metadata or manifests.
AVDP is an AI safety initiative run by DataHack with two downstream products:
- She-Proves — passively monitors a smartphone for domestic violence incidents and preserves audio evidence for legal use
- Elephant in the Room (הפיל שבחדר) — a Raspberry Pi–class device in clinic/welfare offices that alerts security when a social worker is under threat
The clips in this repository are synthetic (is_synthetic: true in all metadata). They are generated by a text-to-speech pipeline using Microsoft Azure Cognitive Services Hebrew neural voices. A real-data pipeline (actor recordings) is planned for a later phase; those recordings will live in a separate repository.
assets/
speech/ # Per-utterance WAV cache, named by SHA-256 of the full rendered
│ # SSML string. Never modify or rename these files — SynthBanshee
│ # uses the hash as the cache key. Deleting a file forces a paid
│ # re-synthesis; adding a file with a wrong name silently breaks
│ # cache lookups.
│ dirty/ # Pre-preprocessing ("dirty") WAV files, retained per spec.
│ # Named {clip_id}_dirty.wav — not by hash.
scripts/ # Per-scene script generation cache, named by SHA-256 of all
# generation inputs. Same rules as assets/speech/.
data/
he/ # Language code (ISO 639-1). All current clips are Hebrew.
{speaker_id}/ # Speaker persona ID, e.g. agg_m_30-45_001
{clip_id}.wav # 16 kHz, mono, 16-bit PCM WAV
{clip_id}.txt # Per-turn transcript with onset/offset markers
{clip_id}.json # ClipMetadata (weak labels, speaker info, is_synthetic, etc.)
{clip_id}.jsonl # Per-event EventLabel records (strong labels)
Every .wav must have a matching .txt, .json, and .jsonl. A clip without all four files is invalid and will be rejected by synthbanshee validate.
- All filenames (and filesystem path components) are ASCII only, lowercase, no spaces.
- Format:
{scene_id_lower}_{take_number:02d}— e.g.sp_it_a_0001_00. The same id appears uppercase in YAMLscene_id. - The on-disk speaker directory is
speaker_id.lower()of the scene's first listed speaker. Thespeakers[].speaker_idvalue in the.jsonstays uppercase (AGG_M_30-45_001); only the directory name is lowercase (agg_m_30-45_001/). - Single source of truth for per-surface casing rules: SynthBanshee
docs/spec.md§2.5 — Identifier casing (per surface). - No Hebrew text in filenames or JSON keys/values — Hebrew belongs in
.txttranscript files only.
Labels follow a three-level hierarchy defined in configs/taxonomy.yaml in the SynthBanshee repo:
| Level | Field | Examples |
|---|---|---|
| Violence typology (scene-level) | violence_typology |
SV, IT, NEG, NEU |
| Tier 1 category (event-level) | tier1_category |
PHYS, VERB, DIST, ACOU, EMOT, NONE |
| Tier 2 subtype (event-level) | tier2_subtype |
VERB_THREAT, DIST_SCREAM, PHYS_HARD |
has_violence is a derived convenience field computed from the strong-label events, not from typology or intensity. The rule is pinned in SynthBanshee docs/spec.md §5.1 and lives in synthbanshee/labels/generator.py:
has_violence = any(e.tier1_category != "NONE" for e in events)This means NEG (Negative / Confusor) clips are correctly has_violence: false even at max_intensity ≥ 3 — by definition NEG is "acoustically intense but non-violent" so every event lands tier1_category: "NONE". Do not re-derive has_violence from typology + intensity alone; you will disagree with the data on every NEG row. The taxonomy columns are the ground truth — has_violence is for fast filtering and baseline modelling only, never the sole training label.
Intensity is scored 1–5 per turn:
| Score | Label | Description |
|---|---|---|
| 1 | Low tension | Calm conversation, mild undercurrent |
| 2 | Moderate tension | Noticeable friction, raised voices |
| 3 | Active conflict | Clear verbal aggression or intimidation |
| 4 | Escalated violence | Physical or high-intensity verbal violence |
| 5 | Extreme / life-threatening | Severe physical violence, panic, imminent danger |
All clips must conform to:
- Sample rate: 16 kHz
- Channels: Mono
- Bit depth: 16-bit PCM
- Peak normalization: target
−2.0 dBFS(configurable, range[−12.0, −1.5]) via single global gain, then safety limiter at≤ −1.0 dBFS. The measured peak lands inpreprocessing_applied.normalized_dbfs; the configured target lands ingeneration_metadata.loudness_target_peak_dbfs. See spec §3 and §5.1 field notes. - Silence padding: ≥ 0.5 s ambient baseline before and after target speech
Clips carry a generation_metadata block in their .json file when the generator recorded pipeline provenance (pipeline_version, tts_backend per speaker, voice_family per speaker, mix_mode_used, normalization_strategy, loudness_target_peak_dbfs, breathiness_applied, effective_prosody_caps). Older clips may have it as null — treat absence as "unknown", not as failure. See spec §5.1 field notes.
Per-delivery quality posture lives in deliveries/{slug}/notes.md. Each delivery records the SynthBanshee commit, milestone state, prosody / acoustic QA findings, and any known limitations specific to that batch. Consumer teams reading the corpus should always start from the delivery notes for the clips they're working with rather than assuming a single global quality bar.
SynthBanshee writes directly to this repository when the following environment variables are set (configured in .envrc of the SynthBanshee repo):
| Variable | Points to |
|---|---|
SYNTHBANSHEE_CACHE_DIR |
assets/speech/ |
SYNTHBANSHEE_SCRIPT_CACHE_DIR |
assets/scripts/ |
SYNTHBANSHEE_DATA_DIR |
data/he/ |
Do not write to this repository by hand. All files should be produced by synthbanshee generate or synthbanshee generate-batch. Manual edits to .wav files will invalidate the SHA-256 cache keys and break re-synthesis detection.
To verify that a clip is spec-compliant, use the SynthBanshee CLI from the SynthBanshee repo:
synthbanshee validate data/he/{speaker_id}/{clip_id}.wavTo run QA over an entire dataset directory:
synthbanshee qa-report data/he/All data deliveries are logged in DELIVERIES.md — one row per merged PR,
with clip counts, duration, prosody QA results, and known limitations.
Per-delivery notes and structured metadata live under deliveries/{slug}/.
See CLAUDE.md for the full rules governing this repository — cache integrity,
label policy, delivery log conventions, and what not to do.
| Repo | Purpose |
|---|---|
| DataHackIL/SynthBanshee | Pipeline code, configs, templates, tests, documentation |
| This repo | Generated data and asset cache only |