av-recorder

Synchronized audio/video capture for Linux that, in one call, benchmarks the pipeline, saves raw frames + audio to av_data/, and writes a synchronized MP4 to processed_av/. It runs from real devices (V4L2 camera + PortAudio mic) or from externally pushed frames/audio, can apply DPDFNet speech enhancement, and lets you defer enhancement/rendering so you can capture now and render later by name.

Frames are JPEG-encoded by a worker pool and timestamped on CLOCK_MONOTONIC; audio carries its true capture time; both stream to disk continuously. The MP4 is built variable-frame-rate from those timestamps, so the camera's real rate is honored and audio stays aligned.

Install

System packages (Linux):

sudo apt install ffmpeg v4l-utils          # ffmpeg = MP4 mux; v4l-utils = camera controls
# PyTurboJPEG 2.x also needs libjpeg-turbo >= 3.0 (official .deb), else pin PyTurboJPEG<2

Python (with uv, using the provided pyproject.toml):

uv sync --extra all                        # everything
# or pick what you need:
uv sync --extra audio --extra enhance      # mic + DPDFNet

Each optional piece degrades gracefully: no sounddevice → video-only, no PyTurboJPEG → OpenCV encode, no sherpa-onnx/ffmpeg → enhancement/MP4 skipped (with a note in the result, not a crash).

Output layout

av_data/<name>/frames/f_000001.jpg ...   raw JPEG frames (capture order)
av_data/<name>/audio.wav                 raw PCM (int16), streamed to disk
av_data/<name>/audio_enhanced.wav        DPDFNet output (only if enhancement run)
av_data/<name>/manifest.json             config + per-frame timestamps + metrics
processed_av/<name>.mp4                   synchronized H.264 + AAC

data_dir/processed_dir default to relative av_data/processed_av under the current directory; pass absolute paths for fixed locations.

API usage

from av_recorder import AVRecorder, RecordingResult

1. Record a fixed duration (device mode)

rec = AVRecorder(width=1280, height=720, fps=30)
result = rec.record(duration=60, name="interview")

print(result.metrics)        # printable benchmark summary
print(result.mp4_path)       # processed_av/interview.mp4

2. Open-ended capture: `start()` / `stop()`

Stop on a keypress, event, signal — not a timer:

rec = AVRecorder(fps=30)
rec.start(name="session")
# ... wait on anything: input(), a threading.Event, a socket message, SIGTERM ...
result = rec.stop()          # finalizes raw files (+ MP4/enhancement unless deferred)
rec.is_recording()           # -> False

Guards: start() while recording raises; stop() without start() raises. The recorder is reusable — loop start/stop; each session gets its own <name>/.

3. Context manager

with AVRecorder(fps=30) as rec:      # __enter__ calls start()
    wait_for_event()
    result = rec.stop()              # or let __exit__ stop it on the way out

4. Defer rendering/enhancement, then process later by name

Capture cheaply and skip the heavy work, then render/enhance whenever you want — even in another process or on another machine. process(name) finds av_data/<name>/, reads the manifest timestamps + frames + audio.wav, and produces the enhanced track and/or MP4. It does not re-encode frames (the stored JPEGs are muxed directly) and opens no devices.

# capture only: raw frames + audio + manifest, no MP4, no enhancement
rec = AVRecorder(fps=30)
rec.record(duration=60, name="interview", make_mp4=False, enhance=False)

# ... later / elsewhere ...
proc = AVRecorder(data_dir="av_data", processed_dir="processed_av",
                  denoise_model="dpdfnet2.onnx")
result = proc.process("interview")            # enhance + render
result = proc.process("interview", enhance=False)   # render only, no enhancement
result = proc.process("interview", make_mp4=False)  # enhance only, no MP4

make_mp4 and enhance toggles exist on record(), stop(), and process():

call	`make_mp4`	`enhance`
`record(...)` / `stop(...)`	render MP4 at capture end	enhance at capture end (needs `denoise_model`)
`process(name, ...)`	(re)render MP4 from stored data	`None` = enhance iff `denoise_model` set; `True`/`False` forces

5. External feed — supply your own frames/audio (no devices opened)

The recorder never touches the camera or mic; your pipeline pushes data. The one rule: video and audio timestamps must share one clock (time.monotonic()).

import time, numpy as np

rec = AVRecorder(source="external", fps=30, sample_rate=48000, channels=1)
rec.start(name="external")

t = time.monotonic()
rec.push_video(bgr_ndarray, t=t)        # BGR ndarray -> JPEG-encoded
rec.push_video(jpeg_bytes,  t=t)        # raw JPEG bytes -> written as-is (no re-encode)
rec.push_audio(int16_samples, t=t)      # (n,) or (n, channels); int16 or float[-1,1]

result = rec.stop()

t is the capture instant of the frame / first audio sample; t=None stamps on arrival. If your source has its own clock, map it to time.monotonic() once and add the offset to every t.
push_video/push_audio are thread-safe — independent producers can push concurrently.
External audio needs no sounddevice, external video no OpenCV camera. Use on_full="block" to avoid dropping frames while feeding fast.

6. Reading the result and metrics

result.name                  # session name
result.frames_dir            # Path to av_data/<name>/frames
result.wav_path              # Path or None (raw audio)
result.enhanced_wav_path     # Path or None (enhanced audio)
result.manifest_path         # Path to manifest.json
result.mp4_path              # Path or None

m = result.metrics           # BenchmarkMetrics dataclass
m.duration_s                 # actual elapsed capture window (s)
m.frames                     # frames captured
m.dropped                    # frames dropped (queue full, on_full="drop")
m.effective_fps              # measured fps from timestamps
m.nominal_fps                # configured fps
m.grab_ms_mean               # device fetch time (nan in external mode)
m.retrieve_ms_mean           # MJPG->BGR decode time
m.write_latency_p95_ms       # capture->disk latency, p95
m.write_latency_max_ms
m.queue_max                  # peak queue depth
m.video_mb                   # total JPEG bytes written
m.video_mbps
m.audio_enabled
m.audio_samples
m.audio_xruns                # input overflows (>0 means dropped audio)
m.audio_block_jitter_ms
m.soundcard_drift_ppm        # audio clock vs system clock
m.camera_fps_label_error_ppm # true rate vs nominal label
m.projected_av_skew_ms_per_hour  # skew IF you played at nominal fps (the MP4 has none)
m.encoder                    # "turbojpeg" | "opencv"
m.denoise_enabled
m.denoise_model
m.denoise_out_sr             # enhanced sample rate (16000 or 48000)
m.denoise_rt_factor          # enhancement time / audio length (<1 = faster than real-time)
m.notes                      # list of fallback/warning strings

7. Speech enhancement (DPDFNet)

# download a model variant...
path = AVRecorder.download_denoise_model("dpdfnet2")       # -> ./dpdfnet2.onnx
url  = AVRecorder.denoise_model_url("dpdfnet8")            # just the URL

# ...then enable it (runs at stop() unless deferred, or in process())
rec = AVRecorder(fps=30, denoise_model="dpdfnet2.onnx")
result = rec.record(60, name="clean")
result.enhanced_wav_path                                   # the enhanced track
result.metrics.denoise_rt_factor                           # speed vs real-time

Variants (AVRecorder.DPDFNET_VARIANTS):

model	output rate	use
`dpdfnet_baseline`	16 kHz	fastest / lowest resource
`dpdfnet2`	16 kHz	real-time / embedded
`dpdfnet4`	16 kHz	balanced
`dpdfnet8`	16 kHz	best quality
`dpdfnet2_48khz_hr`	48 kHz	high-resolution full-band

16 kHz models output a 16 kHz mono track (ideal for ASR/diarization); the raw WAV stays full-band. Use dpdfnet2_48khz_hr to keep the MP4 audio at 48 kHz.

8. List audio devices

print(AVRecorder.list_audio_devices())     # static; needs sounddevice

9. Constructor reference

AVRecorder(
    # video (device mode)
    width=1280, height=720, fps=30.0, video_device=0,
    fourcc="MJPG", convert_rgb=True, buffersize=None,
    v4l2_controls=("exposure_dynamic_framerate=0",), warmup=15,
    # audio
    audio=True, audio_device=None, sample_rate=48000, channels=1, blocksize=1024,
    # JPEG encode
    encoder="turbojpeg",   # or "opencv"
    quality=90, subsample="420",   # "gray" | "420" | "422" | "444"
    progressive=False, turbo_lib=None,
    # pipeline
    workers=2, queue_size=64, on_full="drop",   # or "block"
    fsync=False,
    # source
    source="device",       # or "external"
    # enhancement
    denoise=False, denoise_model=None, denoise_num_threads=1, denoise_provider="cpu",
    # output
    data_dir="av_data", processed_dir="processed_av",
)

Tuning notes: buffersize=None keeps the V4L2 default (do not set 1 — it starves the buffer pool and halves fps). fsync=False uses the OS cache (faster, fewer stalls); fsync=True forces each write to disk. convert_rgb=False writes the camera's raw MJPG with no decode/encode (fastest capture-and-save, but no quality/subsample control). turbo_lib points PyTurboJPEG at a specific libturbojpeg.so.

CLI usage

av_recorder.py runs as a script:

# basic 60-second capture (device mode, audio on, MP4 on)
uv run av_recorder.py --duration 60 --name interview

# list input devices, then pick one
uv run av_recorder.py --list-audio
uv run av_recorder.py --duration 60 --audio-device 3

# resolution / fps / video device
uv run av_recorder.py --duration 30 --width 1920 --height 1080 --fps 30 --device 0

# video only / skip the MP4 / skip enhancement
uv run av_recorder.py --duration 30 --no-audio
uv run av_recorder.py --duration 30 --no-mp4
uv run av_recorder.py --duration 30 --no-enhance --denoise-model dpdfnet2.onnx

# DEFER everything, then render/enhance later by name
uv run av_recorder.py --duration 60 --name interview --no-mp4 --no-enhance
uv run av_recorder.py --process interview --denoise-model dpdfnet2.onnx   # enhance + render
uv run av_recorder.py --process interview                                # render only

# enhancement during capture
uv run av_recorder.py --download-model dpdfnet2                # -> ./dpdfnet2.onnx
uv run av_recorder.py --duration 60 --denoise-model dpdfnet2.onnx

# custom output folders
uv run av_recorder.py --duration 30 --data-dir /srv/av_data --processed-dir /srv/processed

All av_recorder.py flags:

flag	default	meaning
`--duration`	`20`	seconds to capture
`--name`	timestamp	session name (folder + MP4 name)
`--width` / `--height`	`1280` / `720`	capture resolution
`--fps`	`30`	requested frame rate
`--device`	`0`	video device index (`/dev/videoN`)
`--audio-device`	system default	mic index/name (`--list-audio`)
`--no-audio`	off	disable audio capture
`--no-mp4`	off	skip the MP4 (defer; render later with `--process`)
`--no-enhance`	off	skip enhancement at capture (defer to `--process`)
`--process NAME`	none	render/enhance an existing recording in `--data-dir` and exit
`--data-dir`	`av_data`	raw output root
`--processed-dir`	`processed_av`	MP4 output root
`--denoise-model`	none	DPDFNet `.onnx` (used at capture, or with `--process`)
`--denoise-provider`	`cpu`	sherpa-onnx provider
`--download-model VARIANT`	none	download a variant and exit
`--list-audio`	off	print input devices and exit

Demo

demo_record_then_replay.py is the project's reference demo: it captures with the heavy work deferred, then processes by name (enhance + render).

# capture now, render/enhance later in two steps
uv run demo_record_then_replay.py --duration 8 --name interview
uv run demo_record_then_replay.py --process interview --denoise-model dpdfnet2.onnx

# capture (deferred) then process immediately, one run
uv run demo_record_then_replay.py --duration 8 --name interview --denoise-model dpdfnet2.onnx

# render only (no enhancement) from an existing recording
uv run demo_record_then_replay.py --process interview

How synchronization works (and its limit)

Every frame is stamped with time.monotonic() at capture; audio carries its true capture time on the same clock. The MP4 is muxed variable-frame-rate from those timestamps, so a camera whose real rate isn't exactly its nominal label doesn't drift against audio. The benchmark reports per-device drift (a rate, in ppm) and the skew you'd get only if you ignored the timestamps and played at nominal fps.

What is not measured: the constant device-latency offset between the two streams (the fixed "audio leads video by N ms" skew). Software timestamps can't recover it — that needs a physical flash + beep reference. This tool gives a shared timeline and true rates, not absolute lip-sync calibration.

Notes & limitations

Memory on long runs. Both audio and video stream to disk continuously and are released as they go, so RAM stays flat for the whole capture. The only exception: if speech enhancement runs, it reads the finished WAV back into memory once to feed the offline DPDFNet model (a transient spike proportional to clip length, ~5.5 MB/min at 48 kHz mono), since the model needs the complete signal. Defer it (enhance=False / --no-enhance) and run process(name) later if you want capture itself to stay lean. (Small per-frame/per-block timestamp metadata still accrues in RAM during capture — kilobytes-to-low-MB per hour.)
Enhancement is offline. sherpa-onnx's Python API exposes only the offline DPDFNet denoiser, so enhancement runs on the whole buffer (output-equivalent to streaming for a finalize-at-stop recorder). The live streaming denoiser is C-only.
MP4 is a transcode. Frames are re-encoded to H.264 at render time, so the MP4 isn't a byte copy of the JPEGs. This is post-capture and doesn't affect the live benchmark numbers.
Linux-focused. Device mode uses V4L2 + PortAudio; external mode is platform-agnostic.

Future roadmap

Not yet implemented; candidate directions:

Live streaming enhancement. Wrap sherpa-onnx's C streaming DPDFNet denoiser (libsherpa-onnx-c-api.so, already installed via sherpa-onnx-core) via ctypes so audio can be enhanced during capture — for live monitoring or feeding a real-time transcriber — instead of only at render time.
Flash + beep calibration. A helper that emits a simultaneous screen flash and audio click, detects both offline, and writes the measured constant A/V offset into the manifest — closing the one gap software timestamps can't (absolute lip-sync).
Incremental manifest. Stream per-frame/per-block timestamps to disk during capture so even metadata RAM is bounded on marathon (multi-hour) sessions.
Stream-copy MP4 path. When frames are raw MJPG (convert_rgb=False), offer a remux that avoids the H.264 transcode for a faster, lossless render.
Hardware video encode. Optional VAAPI/NVENC H.264 at render time for large batch processing.
Packaging. Flip [tool.uv] package = false to a real build backend so the module can be pip install-ed and versioned as a distributable package.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
av_recorder.py		av_recorder.py
demo_record_then_replay.py		demo_record_then_replay.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

av-recorder

Install

Output layout

API usage

1. Record a fixed duration (device mode)

2. Open-ended capture: `start()` / `stop()`

3. Context manager

4. Defer rendering/enhancement, then process later by name

5. External feed — supply your own frames/audio (no devices opened)

6. Reading the result and metrics

7. Speech enhancement (DPDFNet)

8. List audio devices

9. Constructor reference

CLI usage

Demo

How synchronization works (and its limit)

Notes & limitations

Future roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

av-recorder

Install

Output layout

API usage

1. Record a fixed duration (device mode)

2. Open-ended capture: start() / stop()

3. Context manager

4. Defer rendering/enhancement, then process later by name

5. External feed — supply your own frames/audio (no devices opened)

6. Reading the result and metrics

7. Speech enhancement (DPDFNet)

8. List audio devices

9. Constructor reference

CLI usage

Demo

How synchronization works (and its limit)

Notes & limitations

Future roadmap

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

2. Open-ended capture: `start()` / `stop()`

Packages