Synchronized audio/video capture for Linux that, in one call, benchmarks the
pipeline, saves raw frames + audio to av_data/, and writes a synchronized
MP4 to processed_av/. It runs from real devices (V4L2 camera + PortAudio mic)
or from externally pushed frames/audio, can apply DPDFNet speech
enhancement, and lets you defer enhancement/rendering so you can capture now
and render later by name.
Frames are JPEG-encoded by a worker pool and timestamped on CLOCK_MONOTONIC;
audio carries its true capture time; both stream to disk continuously. The MP4 is
built variable-frame-rate from those timestamps, so the camera's real rate is
honored and audio stays aligned.
System packages (Linux):
sudo apt install ffmpeg v4l-utils # ffmpeg = MP4 mux; v4l-utils = camera controls
# PyTurboJPEG 2.x also needs libjpeg-turbo >= 3.0 (official .deb), else pin PyTurboJPEG<2Python (with uv, using the provided pyproject.toml):
uv sync --extra all # everything
# or pick what you need:
uv sync --extra audio --extra enhance # mic + DPDFNetEach optional piece degrades gracefully: no sounddevice → video-only, no
PyTurboJPEG → OpenCV encode, no sherpa-onnx/ffmpeg → enhancement/MP4 skipped
(with a note in the result, not a crash).
av_data/<name>/frames/f_000001.jpg ... raw JPEG frames (capture order)
av_data/<name>/audio.wav raw PCM (int16), streamed to disk
av_data/<name>/audio_enhanced.wav DPDFNet output (only if enhancement run)
av_data/<name>/manifest.json config + per-frame timestamps + metrics
processed_av/<name>.mp4 synchronized H.264 + AAC
data_dir/processed_dir default to relative av_data/processed_av under the
current directory; pass absolute paths for fixed locations.
from av_recorder import AVRecorder, RecordingResultrec = AVRecorder(width=1280, height=720, fps=30)
result = rec.record(duration=60, name="interview")
print(result.metrics) # printable benchmark summary
print(result.mp4_path) # processed_av/interview.mp4Stop on a keypress, event, signal — not a timer:
rec = AVRecorder(fps=30)
rec.start(name="session")
# ... wait on anything: input(), a threading.Event, a socket message, SIGTERM ...
result = rec.stop() # finalizes raw files (+ MP4/enhancement unless deferred)
rec.is_recording() # -> FalseGuards: start() while recording raises; stop() without start() raises. The
recorder is reusable — loop start/stop; each session gets its own <name>/.
with AVRecorder(fps=30) as rec: # __enter__ calls start()
wait_for_event()
result = rec.stop() # or let __exit__ stop it on the way outCapture cheaply and skip the heavy work, then render/enhance whenever you want —
even in another process or on another machine. process(name) finds
av_data/<name>/, reads the manifest timestamps + frames + audio.wav, and
produces the enhanced track and/or MP4. It does not re-encode frames (the
stored JPEGs are muxed directly) and opens no devices.
# capture only: raw frames + audio + manifest, no MP4, no enhancement
rec = AVRecorder(fps=30)
rec.record(duration=60, name="interview", make_mp4=False, enhance=False)
# ... later / elsewhere ...
proc = AVRecorder(data_dir="av_data", processed_dir="processed_av",
denoise_model="dpdfnet2.onnx")
result = proc.process("interview") # enhance + render
result = proc.process("interview", enhance=False) # render only, no enhancement
result = proc.process("interview", make_mp4=False) # enhance only, no MP4make_mp4 and enhance toggles exist on record(), stop(), and process():
| call | make_mp4 |
enhance |
|---|---|---|
record(...) / stop(...) |
render MP4 at capture end | enhance at capture end (needs denoise_model) |
process(name, ...) |
(re)render MP4 from stored data | None = enhance iff denoise_model set; True/False forces |
The recorder never touches the camera or mic; your pipeline pushes data. The one
rule: video and audio timestamps must share one clock (time.monotonic()).
import time, numpy as np
rec = AVRecorder(source="external", fps=30, sample_rate=48000, channels=1)
rec.start(name="external")
t = time.monotonic()
rec.push_video(bgr_ndarray, t=t) # BGR ndarray -> JPEG-encoded
rec.push_video(jpeg_bytes, t=t) # raw JPEG bytes -> written as-is (no re-encode)
rec.push_audio(int16_samples, t=t) # (n,) or (n, channels); int16 or float[-1,1]
result = rec.stop()tis the capture instant of the frame / first audio sample;t=Nonestamps on arrival. If your source has its own clock, map it totime.monotonic()once and add the offset to everyt.push_video/push_audioare thread-safe — independent producers can push concurrently.- External audio needs no
sounddevice, external video no OpenCV camera. Useon_full="block"to avoid dropping frames while feeding fast.
result.name # session name
result.frames_dir # Path to av_data/<name>/frames
result.wav_path # Path or None (raw audio)
result.enhanced_wav_path # Path or None (enhanced audio)
result.manifest_path # Path to manifest.json
result.mp4_path # Path or None
m = result.metrics # BenchmarkMetrics dataclass
m.duration_s # actual elapsed capture window (s)
m.frames # frames captured
m.dropped # frames dropped (queue full, on_full="drop")
m.effective_fps # measured fps from timestamps
m.nominal_fps # configured fps
m.grab_ms_mean # device fetch time (nan in external mode)
m.retrieve_ms_mean # MJPG->BGR decode time
m.write_latency_p95_ms # capture->disk latency, p95
m.write_latency_max_ms
m.queue_max # peak queue depth
m.video_mb # total JPEG bytes written
m.video_mbps
m.audio_enabled
m.audio_samples
m.audio_xruns # input overflows (>0 means dropped audio)
m.audio_block_jitter_ms
m.soundcard_drift_ppm # audio clock vs system clock
m.camera_fps_label_error_ppm # true rate vs nominal label
m.projected_av_skew_ms_per_hour # skew IF you played at nominal fps (the MP4 has none)
m.encoder # "turbojpeg" | "opencv"
m.denoise_enabled
m.denoise_model
m.denoise_out_sr # enhanced sample rate (16000 or 48000)
m.denoise_rt_factor # enhancement time / audio length (<1 = faster than real-time)
m.notes # list of fallback/warning strings# download a model variant...
path = AVRecorder.download_denoise_model("dpdfnet2") # -> ./dpdfnet2.onnx
url = AVRecorder.denoise_model_url("dpdfnet8") # just the URL
# ...then enable it (runs at stop() unless deferred, or in process())
rec = AVRecorder(fps=30, denoise_model="dpdfnet2.onnx")
result = rec.record(60, name="clean")
result.enhanced_wav_path # the enhanced track
result.metrics.denoise_rt_factor # speed vs real-timeVariants (AVRecorder.DPDFNET_VARIANTS):
| model | output rate | use |
|---|---|---|
dpdfnet_baseline |
16 kHz | fastest / lowest resource |
dpdfnet2 |
16 kHz | real-time / embedded |
dpdfnet4 |
16 kHz | balanced |
dpdfnet8 |
16 kHz | best quality |
dpdfnet2_48khz_hr |
48 kHz | high-resolution full-band |
16 kHz models output a 16 kHz mono track (ideal for ASR/diarization); the raw WAV
stays full-band. Use dpdfnet2_48khz_hr to keep the MP4 audio at 48 kHz.
print(AVRecorder.list_audio_devices()) # static; needs sounddeviceAVRecorder(
# video (device mode)
width=1280, height=720, fps=30.0, video_device=0,
fourcc="MJPG", convert_rgb=True, buffersize=None,
v4l2_controls=("exposure_dynamic_framerate=0",), warmup=15,
# audio
audio=True, audio_device=None, sample_rate=48000, channels=1, blocksize=1024,
# JPEG encode
encoder="turbojpeg", # or "opencv"
quality=90, subsample="420", # "gray" | "420" | "422" | "444"
progressive=False, turbo_lib=None,
# pipeline
workers=2, queue_size=64, on_full="drop", # or "block"
fsync=False,
# source
source="device", # or "external"
# enhancement
denoise=False, denoise_model=None, denoise_num_threads=1, denoise_provider="cpu",
# output
data_dir="av_data", processed_dir="processed_av",
)Tuning notes: buffersize=None keeps the V4L2 default (do not set 1 — it
starves the buffer pool and halves fps). fsync=False uses the OS cache (faster,
fewer stalls); fsync=True forces each write to disk. convert_rgb=False writes
the camera's raw MJPG with no decode/encode (fastest capture-and-save, but no
quality/subsample control). turbo_lib points PyTurboJPEG at a specific
libturbojpeg.so.
av_recorder.py runs as a script:
# basic 60-second capture (device mode, audio on, MP4 on)
uv run av_recorder.py --duration 60 --name interview
# list input devices, then pick one
uv run av_recorder.py --list-audio
uv run av_recorder.py --duration 60 --audio-device 3
# resolution / fps / video device
uv run av_recorder.py --duration 30 --width 1920 --height 1080 --fps 30 --device 0
# video only / skip the MP4 / skip enhancement
uv run av_recorder.py --duration 30 --no-audio
uv run av_recorder.py --duration 30 --no-mp4
uv run av_recorder.py --duration 30 --no-enhance --denoise-model dpdfnet2.onnx
# DEFER everything, then render/enhance later by name
uv run av_recorder.py --duration 60 --name interview --no-mp4 --no-enhance
uv run av_recorder.py --process interview --denoise-model dpdfnet2.onnx # enhance + render
uv run av_recorder.py --process interview # render only
# enhancement during capture
uv run av_recorder.py --download-model dpdfnet2 # -> ./dpdfnet2.onnx
uv run av_recorder.py --duration 60 --denoise-model dpdfnet2.onnx
# custom output folders
uv run av_recorder.py --duration 30 --data-dir /srv/av_data --processed-dir /srv/processedAll av_recorder.py flags:
| flag | default | meaning |
|---|---|---|
--duration |
20 |
seconds to capture |
--name |
timestamp | session name (folder + MP4 name) |
--width / --height |
1280 / 720 |
capture resolution |
--fps |
30 |
requested frame rate |
--device |
0 |
video device index (/dev/videoN) |
--audio-device |
system default | mic index/name (--list-audio) |
--no-audio |
off | disable audio capture |
--no-mp4 |
off | skip the MP4 (defer; render later with --process) |
--no-enhance |
off | skip enhancement at capture (defer to --process) |
--process NAME |
none | render/enhance an existing recording in --data-dir and exit |
--data-dir |
av_data |
raw output root |
--processed-dir |
processed_av |
MP4 output root |
--denoise-model |
none | DPDFNet .onnx (used at capture, or with --process) |
--denoise-provider |
cpu |
sherpa-onnx provider |
--download-model VARIANT |
none | download a variant and exit |
--list-audio |
off | print input devices and exit |
demo_record_then_replay.py is the project's reference demo: it captures with the
heavy work deferred, then processes by name (enhance + render).
# capture now, render/enhance later in two steps
uv run demo_record_then_replay.py --duration 8 --name interview
uv run demo_record_then_replay.py --process interview --denoise-model dpdfnet2.onnx
# capture (deferred) then process immediately, one run
uv run demo_record_then_replay.py --duration 8 --name interview --denoise-model dpdfnet2.onnx
# render only (no enhancement) from an existing recording
uv run demo_record_then_replay.py --process interviewEvery frame is stamped with time.monotonic() at capture; audio carries its true
capture time on the same clock. The MP4 is muxed variable-frame-rate from those
timestamps, so a camera whose real rate isn't exactly its nominal label doesn't
drift against audio. The benchmark reports per-device drift (a rate, in ppm)
and the skew you'd get only if you ignored the timestamps and played at nominal fps.
What is not measured: the constant device-latency offset between the two streams (the fixed "audio leads video by N ms" skew). Software timestamps can't recover it — that needs a physical flash + beep reference. This tool gives a shared timeline and true rates, not absolute lip-sync calibration.
- Memory on long runs. Both audio and video stream to disk continuously and
are released as they go, so RAM stays flat for the whole capture. The only
exception: if speech enhancement runs, it reads the finished WAV back into
memory once to feed the offline DPDFNet model (a transient spike proportional to
clip length, ~5.5 MB/min at 48 kHz mono), since the model needs the complete
signal. Defer it (
enhance=False/--no-enhance) and runprocess(name)later if you want capture itself to stay lean. (Small per-frame/per-block timestamp metadata still accrues in RAM during capture — kilobytes-to-low-MB per hour.) - Enhancement is offline. sherpa-onnx's Python API exposes only the offline DPDFNet denoiser, so enhancement runs on the whole buffer (output-equivalent to streaming for a finalize-at-stop recorder). The live streaming denoiser is C-only.
- MP4 is a transcode. Frames are re-encoded to H.264 at render time, so the MP4 isn't a byte copy of the JPEGs. This is post-capture and doesn't affect the live benchmark numbers.
- Linux-focused. Device mode uses V4L2 + PortAudio; external mode is platform-agnostic.
Not yet implemented; candidate directions:
- Live streaming enhancement. Wrap sherpa-onnx's C streaming DPDFNet denoiser
(
libsherpa-onnx-c-api.so, already installed viasherpa-onnx-core) viactypesso audio can be enhanced during capture — for live monitoring or feeding a real-time transcriber — instead of only at render time. - Flash + beep calibration. A helper that emits a simultaneous screen flash and audio click, detects both offline, and writes the measured constant A/V offset into the manifest — closing the one gap software timestamps can't (absolute lip-sync).
- Incremental manifest. Stream per-frame/per-block timestamps to disk during capture so even metadata RAM is bounded on marathon (multi-hour) sessions.
- Stream-copy MP4 path. When frames are raw MJPG (
convert_rgb=False), offer a remux that avoids the H.264 transcode for a faster, lossless render. - Hardware video encode. Optional VAAPI/NVENC H.264 at render time for large batch processing.
- Packaging. Flip
[tool.uv] package = falseto a real build backend so the module can bepip install-ed and versioned as a distributable package.