avdp-synth-corpus

Synthetic Hebrew audio dataset for the Audio Violence Detection Pipeline (AVDP), generated by the SynthBanshee pipeline.

This is a data-only repository. It contains no application code. All pipeline logic, configuration, documentation, and tests live in SynthBanshee.

If you are a Claude Code agent or AI assistant: read CLAUDE.md before making any changes. Key rules: never rename/modify/delete files in assets/; never edit .wav files by hand; always update DELIVERIES.md when adding clips; never drop has_violence from metadata or manifests.

What is this data for?

AVDP is an AI safety initiative run by DataHack with two downstream products:

She-Proves — passively monitors a smartphone for domestic violence incidents and preserves audio evidence for legal use
Elephant in the Room (הפיל שבחדר) — a Raspberry Pi–class device in clinic/welfare offices that alerts security when a social worker is under threat

The clips in this repository are synthetic (is_synthetic: true in all metadata). They are generated by a text-to-speech pipeline using Microsoft Azure Cognitive Services Hebrew neural voices. A real-data pipeline (actor recordings) is planned for a later phase; those recordings will live in a separate repository.

Repository layout

assets/
  speech/          # Per-utterance WAV cache, named by SHA-256 of the full rendered
  │                # SSML string. Never modify or rename these files — SynthBanshee
  │                # uses the hash as the cache key. Deleting a file forces a paid
  │                # re-synthesis; adding a file with a wrong name silently breaks
  │                # cache lookups.
  │  dirty/        # Pre-preprocessing ("dirty") WAV files, retained per spec.
  │                # Named {clip_id}_dirty.wav — not by hash.
  scripts/         # Per-scene script generation cache, named by SHA-256 of all
                   # generation inputs. Same rules as assets/speech/.

data/
  he/              # Language code (ISO 639-1). All current clips are Hebrew.
    {speaker_id}/  # Speaker persona ID, e.g. agg_m_30-45_001
      {clip_id}.wav    # 16 kHz, mono, 16-bit PCM WAV
      {clip_id}.txt    # Per-turn transcript with onset/offset markers
      {clip_id}.json   # ClipMetadata (weak labels, speaker info, is_synthetic, etc.)
      {clip_id}.jsonl  # Per-event EventLabel records (strong labels)

Every .wav must have a matching .txt, .json, and .jsonl. A clip without all four files is invalid and will be rejected by synthbanshee validate.

Clip ID and filename conventions

All filenames (and filesystem path components) are ASCII only, lowercase, no spaces.
Format: {scene_id_lower}_{take_number:02d} — e.g. sp_it_a_0001_00. The same id appears uppercase in YAML scene_id.
The on-disk speaker directory is speaker_id.lower() of the scene's first listed speaker. The speakers[].speaker_id value in the .json stays uppercase (AGG_M_30-45_001); only the directory name is lowercase (agg_m_30-45_001/).
Single source of truth for per-surface casing rules: SynthBanshee docs/spec.md §2.5 — Identifier casing (per surface).
No Hebrew text in filenames or JSON keys/values — Hebrew belongs in .txt transcript files only.

Label taxonomy

Labels follow a three-level hierarchy defined in configs/taxonomy.yaml in the SynthBanshee repo:

Level	Field	Examples
Violence typology (scene-level)	`violence_typology`	`SV`, `IT`, `NEG`, `NEU`
Tier 1 category (event-level)	`tier1_category`	`PHYS`, `VERB`, `DIST`, `ACOU`, `EMOT`, `NONE`
Tier 2 subtype (event-level)	`tier2_subtype`	`VERB_THREAT`, `DIST_SCREAM`, `PHYS_HARD`

has_violence is a derived convenience field computed from the strong-label events, not from typology or intensity. The rule is pinned in SynthBanshee docs/spec.md §5.1 and lives in synthbanshee/labels/generator.py:

has_violence = any(e.tier1_category != "NONE" for e in events)

This means NEG (Negative / Confusor) clips are correctly has_violence: false even at max_intensity ≥ 3 — by definition NEG is "acoustically intense but non-violent" so every event lands tier1_category: "NONE". Do not re-derive has_violence from typology + intensity alone; you will disagree with the data on every NEG row. The taxonomy columns are the ground truth — has_violence is for fast filtering and baseline modelling only, never the sole training label.

Intensity is scored 1–5 per turn:

Score	Label	Description
1	Low tension	Calm conversation, mild undercurrent
2	Moderate tension	Noticeable friction, raised voices
3	Active conflict	Clear verbal aggression or intimidation
4	Escalated violence	Physical or high-intensity verbal violence
5	Extreme / life-threatening	Severe physical violence, panic, imminent danger

Audio format

All clips must conform to:

Sample rate: 16 kHz
Channels: Mono
Bit depth: 16-bit PCM
Peak normalization: target −2.0 dBFS (configurable, range [−12.0, −1.5]) via single global gain, then safety limiter at ≤ −1.0 dBFS. The measured peak lands in preprocessing_applied.normalized_dbfs; the configured target lands in generation_metadata.loudness_target_peak_dbfs. See spec §3 and §5.1 field notes.
Silence padding: ≥ 0.5 s ambient baseline before and after target speech

Pipeline versions and data quality

Clips carry a generation_metadata block in their .json file when the generator recorded pipeline provenance (pipeline_version, tts_backend per speaker, voice_family per speaker, mix_mode_used, normalization_strategy, loudness_target_peak_dbfs, breathiness_applied, effective_prosody_caps). Older clips may have it as null — treat absence as "unknown", not as failure. See spec §5.1 field notes.

Per-delivery quality posture lives in deliveries/{slug}/notes.md. Each delivery records the SynthBanshee commit, milestone state, prosody / acoustic QA findings, and any known limitations specific to that batch. Consumer teams reading the corpus should always start from the delivery notes for the clips they're working with rather than assuming a single global quality bar.

How clips get here

SynthBanshee writes directly to this repository when the following environment variables are set (configured in .envrc of the SynthBanshee repo):

Variable	Points to
`SYNTHBANSHEE_CACHE_DIR`	`assets/speech/`
`SYNTHBANSHEE_SCRIPT_CACHE_DIR`	`assets/scripts/`
`SYNTHBANSHEE_DATA_DIR`	`data/he/`

Do not write to this repository by hand. All files should be produced by synthbanshee generate or synthbanshee generate-batch. Manual edits to .wav files will invalidate the SHA-256 cache keys and break re-synthesis detection.

Validation

To verify that a clip is spec-compliant, use the SynthBanshee CLI from the SynthBanshee repo:

synthbanshee validate data/he/{speaker_id}/{clip_id}.wav

To run QA over an entire dataset directory:

synthbanshee qa-report data/he/

Delivery history

All data deliveries are logged in DELIVERIES.md — one row per merged PR, with clip counts, duration, prosody QA results, and known limitations. Per-delivery notes and structured metadata live under deliveries/{slug}/.

Agent and contributor guidelines

See CLAUDE.md for the full rules governing this repository — cache integrity, label policy, delivery log conventions, and what not to do.

Related repositories

Repo	Purpose
DataHackIL/SynthBanshee	Pipeline code, configs, templates, tests, documentation
This repo	Generated data and asset cache only

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
.github/workflows		.github/workflows
assets		assets
data/he		data/he
deliveries		deliveries
docs		docs
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CLAUDE.md		CLAUDE.md
DELIVERIES.md		DELIVERIES.md
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

avdp-synth-corpus

What is this data for?

Repository layout

Clip ID and filename conventions

Label taxonomy

Audio format

Pipeline versions and data quality

How clips get here

Validation

Delivery history

Agent and contributor guidelines

Related repositories

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

avdp-synth-corpus

What is this data for?

Repository layout

Clip ID and filename conventions

Label taxonomy

Audio format

Pipeline versions and data quality

How clips get here

Validation

Delivery history

Agent and contributor guidelines

Related repositories

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages