Skip to content

sandergs92/PySAVR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

av-recorder

Synchronized audio/video capture for Linux that, in one call, benchmarks the pipeline, saves raw frames + audio to av_data/, and writes a synchronized MP4 to processed_av/. It runs from real devices (V4L2 camera + PortAudio mic) or from externally pushed frames/audio, can apply DPDFNet speech enhancement, and lets you defer enhancement/rendering so you can capture now and render later by name.

Frames are JPEG-encoded by a worker pool and timestamped on CLOCK_MONOTONIC; audio carries its true capture time; both stream to disk continuously. The MP4 is built variable-frame-rate from those timestamps, so the camera's real rate is honored and audio stays aligned.


Install

System packages (Linux):

sudo apt install ffmpeg v4l-utils          # ffmpeg = MP4 mux; v4l-utils = camera controls
# PyTurboJPEG 2.x also needs libjpeg-turbo >= 3.0 (official .deb), else pin PyTurboJPEG<2

Python (with uv, using the provided pyproject.toml):

uv sync --extra all                        # everything
# or pick what you need:
uv sync --extra audio --extra enhance      # mic + DPDFNet

Each optional piece degrades gracefully: no sounddevice → video-only, no PyTurboJPEG → OpenCV encode, no sherpa-onnx/ffmpeg → enhancement/MP4 skipped (with a note in the result, not a crash).


Output layout

av_data/<name>/frames/f_000001.jpg ...   raw JPEG frames (capture order)
av_data/<name>/audio.wav                 raw PCM (int16), streamed to disk
av_data/<name>/audio_enhanced.wav        DPDFNet output (only if enhancement run)
av_data/<name>/manifest.json             config + per-frame timestamps + metrics
processed_av/<name>.mp4                   synchronized H.264 + AAC

data_dir/processed_dir default to relative av_data/processed_av under the current directory; pass absolute paths for fixed locations.


API usage

from av_recorder import AVRecorder, RecordingResult

1. Record a fixed duration (device mode)

rec = AVRecorder(width=1280, height=720, fps=30)
result = rec.record(duration=60, name="interview")

print(result.metrics)        # printable benchmark summary
print(result.mp4_path)       # processed_av/interview.mp4

2. Open-ended capture: start() / stop()

Stop on a keypress, event, signal — not a timer:

rec = AVRecorder(fps=30)
rec.start(name="session")
# ... wait on anything: input(), a threading.Event, a socket message, SIGTERM ...
result = rec.stop()          # finalizes raw files (+ MP4/enhancement unless deferred)
rec.is_recording()           # -> False

Guards: start() while recording raises; stop() without start() raises. The recorder is reusable — loop start/stop; each session gets its own <name>/.

3. Context manager

with AVRecorder(fps=30) as rec:      # __enter__ calls start()
    wait_for_event()
    result = rec.stop()              # or let __exit__ stop it on the way out

4. Defer rendering/enhancement, then process later by name

Capture cheaply and skip the heavy work, then render/enhance whenever you want — even in another process or on another machine. process(name) finds av_data/<name>/, reads the manifest timestamps + frames + audio.wav, and produces the enhanced track and/or MP4. It does not re-encode frames (the stored JPEGs are muxed directly) and opens no devices.

# capture only: raw frames + audio + manifest, no MP4, no enhancement
rec = AVRecorder(fps=30)
rec.record(duration=60, name="interview", make_mp4=False, enhance=False)

# ... later / elsewhere ...
proc = AVRecorder(data_dir="av_data", processed_dir="processed_av",
                  denoise_model="dpdfnet2.onnx")
result = proc.process("interview")            # enhance + render
result = proc.process("interview", enhance=False)   # render only, no enhancement
result = proc.process("interview", make_mp4=False)  # enhance only, no MP4

make_mp4 and enhance toggles exist on record(), stop(), and process():

call make_mp4 enhance
record(...) / stop(...) render MP4 at capture end enhance at capture end (needs denoise_model)
process(name, ...) (re)render MP4 from stored data None = enhance iff denoise_model set; True/False forces

5. External feed — supply your own frames/audio (no devices opened)

The recorder never touches the camera or mic; your pipeline pushes data. The one rule: video and audio timestamps must share one clock (time.monotonic()).

import time, numpy as np

rec = AVRecorder(source="external", fps=30, sample_rate=48000, channels=1)
rec.start(name="external")

t = time.monotonic()
rec.push_video(bgr_ndarray, t=t)        # BGR ndarray -> JPEG-encoded
rec.push_video(jpeg_bytes,  t=t)        # raw JPEG bytes -> written as-is (no re-encode)
rec.push_audio(int16_samples, t=t)      # (n,) or (n, channels); int16 or float[-1,1]

result = rec.stop()
  • t is the capture instant of the frame / first audio sample; t=None stamps on arrival. If your source has its own clock, map it to time.monotonic() once and add the offset to every t.
  • push_video/push_audio are thread-safe — independent producers can push concurrently.
  • External audio needs no sounddevice, external video no OpenCV camera. Use on_full="block" to avoid dropping frames while feeding fast.

6. Reading the result and metrics

result.name                  # session name
result.frames_dir            # Path to av_data/<name>/frames
result.wav_path              # Path or None (raw audio)
result.enhanced_wav_path     # Path or None (enhanced audio)
result.manifest_path         # Path to manifest.json
result.mp4_path              # Path or None

m = result.metrics           # BenchmarkMetrics dataclass
m.duration_s                 # actual elapsed capture window (s)
m.frames                     # frames captured
m.dropped                    # frames dropped (queue full, on_full="drop")
m.effective_fps              # measured fps from timestamps
m.nominal_fps                # configured fps
m.grab_ms_mean               # device fetch time (nan in external mode)
m.retrieve_ms_mean           # MJPG->BGR decode time
m.write_latency_p95_ms       # capture->disk latency, p95
m.write_latency_max_ms
m.queue_max                  # peak queue depth
m.video_mb                   # total JPEG bytes written
m.video_mbps
m.audio_enabled
m.audio_samples
m.audio_xruns                # input overflows (>0 means dropped audio)
m.audio_block_jitter_ms
m.soundcard_drift_ppm        # audio clock vs system clock
m.camera_fps_label_error_ppm # true rate vs nominal label
m.projected_av_skew_ms_per_hour  # skew IF you played at nominal fps (the MP4 has none)
m.encoder                    # "turbojpeg" | "opencv"
m.denoise_enabled
m.denoise_model
m.denoise_out_sr             # enhanced sample rate (16000 or 48000)
m.denoise_rt_factor          # enhancement time / audio length (<1 = faster than real-time)
m.notes                      # list of fallback/warning strings

7. Speech enhancement (DPDFNet)

# download a model variant...
path = AVRecorder.download_denoise_model("dpdfnet2")       # -> ./dpdfnet2.onnx
url  = AVRecorder.denoise_model_url("dpdfnet8")            # just the URL

# ...then enable it (runs at stop() unless deferred, or in process())
rec = AVRecorder(fps=30, denoise_model="dpdfnet2.onnx")
result = rec.record(60, name="clean")
result.enhanced_wav_path                                   # the enhanced track
result.metrics.denoise_rt_factor                           # speed vs real-time

Variants (AVRecorder.DPDFNET_VARIANTS):

model output rate use
dpdfnet_baseline 16 kHz fastest / lowest resource
dpdfnet2 16 kHz real-time / embedded
dpdfnet4 16 kHz balanced
dpdfnet8 16 kHz best quality
dpdfnet2_48khz_hr 48 kHz high-resolution full-band

16 kHz models output a 16 kHz mono track (ideal for ASR/diarization); the raw WAV stays full-band. Use dpdfnet2_48khz_hr to keep the MP4 audio at 48 kHz.

8. List audio devices

print(AVRecorder.list_audio_devices())     # static; needs sounddevice

9. Constructor reference

AVRecorder(
    # video (device mode)
    width=1280, height=720, fps=30.0, video_device=0,
    fourcc="MJPG", convert_rgb=True, buffersize=None,
    v4l2_controls=("exposure_dynamic_framerate=0",), warmup=15,
    # audio
    audio=True, audio_device=None, sample_rate=48000, channels=1, blocksize=1024,
    # JPEG encode
    encoder="turbojpeg",   # or "opencv"
    quality=90, subsample="420",   # "gray" | "420" | "422" | "444"
    progressive=False, turbo_lib=None,
    # pipeline
    workers=2, queue_size=64, on_full="drop",   # or "block"
    fsync=False,
    # source
    source="device",       # or "external"
    # enhancement
    denoise=False, denoise_model=None, denoise_num_threads=1, denoise_provider="cpu",
    # output
    data_dir="av_data", processed_dir="processed_av",
)

Tuning notes: buffersize=None keeps the V4L2 default (do not set 1 — it starves the buffer pool and halves fps). fsync=False uses the OS cache (faster, fewer stalls); fsync=True forces each write to disk. convert_rgb=False writes the camera's raw MJPG with no decode/encode (fastest capture-and-save, but no quality/subsample control). turbo_lib points PyTurboJPEG at a specific libturbojpeg.so.


CLI usage

av_recorder.py runs as a script:

# basic 60-second capture (device mode, audio on, MP4 on)
uv run av_recorder.py --duration 60 --name interview

# list input devices, then pick one
uv run av_recorder.py --list-audio
uv run av_recorder.py --duration 60 --audio-device 3

# resolution / fps / video device
uv run av_recorder.py --duration 30 --width 1920 --height 1080 --fps 30 --device 0

# video only / skip the MP4 / skip enhancement
uv run av_recorder.py --duration 30 --no-audio
uv run av_recorder.py --duration 30 --no-mp4
uv run av_recorder.py --duration 30 --no-enhance --denoise-model dpdfnet2.onnx

# DEFER everything, then render/enhance later by name
uv run av_recorder.py --duration 60 --name interview --no-mp4 --no-enhance
uv run av_recorder.py --process interview --denoise-model dpdfnet2.onnx   # enhance + render
uv run av_recorder.py --process interview                                # render only

# enhancement during capture
uv run av_recorder.py --download-model dpdfnet2                # -> ./dpdfnet2.onnx
uv run av_recorder.py --duration 60 --denoise-model dpdfnet2.onnx

# custom output folders
uv run av_recorder.py --duration 30 --data-dir /srv/av_data --processed-dir /srv/processed

All av_recorder.py flags:

flag default meaning
--duration 20 seconds to capture
--name timestamp session name (folder + MP4 name)
--width / --height 1280 / 720 capture resolution
--fps 30 requested frame rate
--device 0 video device index (/dev/videoN)
--audio-device system default mic index/name (--list-audio)
--no-audio off disable audio capture
--no-mp4 off skip the MP4 (defer; render later with --process)
--no-enhance off skip enhancement at capture (defer to --process)
--process NAME none render/enhance an existing recording in --data-dir and exit
--data-dir av_data raw output root
--processed-dir processed_av MP4 output root
--denoise-model none DPDFNet .onnx (used at capture, or with --process)
--denoise-provider cpu sherpa-onnx provider
--download-model VARIANT none download a variant and exit
--list-audio off print input devices and exit

Demo

demo_record_then_replay.py is the project's reference demo: it captures with the heavy work deferred, then processes by name (enhance + render).

# capture now, render/enhance later in two steps
uv run demo_record_then_replay.py --duration 8 --name interview
uv run demo_record_then_replay.py --process interview --denoise-model dpdfnet2.onnx

# capture (deferred) then process immediately, one run
uv run demo_record_then_replay.py --duration 8 --name interview --denoise-model dpdfnet2.onnx

# render only (no enhancement) from an existing recording
uv run demo_record_then_replay.py --process interview

How synchronization works (and its limit)

Every frame is stamped with time.monotonic() at capture; audio carries its true capture time on the same clock. The MP4 is muxed variable-frame-rate from those timestamps, so a camera whose real rate isn't exactly its nominal label doesn't drift against audio. The benchmark reports per-device drift (a rate, in ppm) and the skew you'd get only if you ignored the timestamps and played at nominal fps.

What is not measured: the constant device-latency offset between the two streams (the fixed "audio leads video by N ms" skew). Software timestamps can't recover it — that needs a physical flash + beep reference. This tool gives a shared timeline and true rates, not absolute lip-sync calibration.

Notes & limitations

  • Memory on long runs. Both audio and video stream to disk continuously and are released as they go, so RAM stays flat for the whole capture. The only exception: if speech enhancement runs, it reads the finished WAV back into memory once to feed the offline DPDFNet model (a transient spike proportional to clip length, ~5.5 MB/min at 48 kHz mono), since the model needs the complete signal. Defer it (enhance=False / --no-enhance) and run process(name) later if you want capture itself to stay lean. (Small per-frame/per-block timestamp metadata still accrues in RAM during capture — kilobytes-to-low-MB per hour.)
  • Enhancement is offline. sherpa-onnx's Python API exposes only the offline DPDFNet denoiser, so enhancement runs on the whole buffer (output-equivalent to streaming for a finalize-at-stop recorder). The live streaming denoiser is C-only.
  • MP4 is a transcode. Frames are re-encoded to H.264 at render time, so the MP4 isn't a byte copy of the JPEGs. This is post-capture and doesn't affect the live benchmark numbers.
  • Linux-focused. Device mode uses V4L2 + PortAudio; external mode is platform-agnostic.

Future roadmap

Not yet implemented; candidate directions:

  • Live streaming enhancement. Wrap sherpa-onnx's C streaming DPDFNet denoiser (libsherpa-onnx-c-api.so, already installed via sherpa-onnx-core) via ctypes so audio can be enhanced during capture — for live monitoring or feeding a real-time transcriber — instead of only at render time.
  • Flash + beep calibration. A helper that emits a simultaneous screen flash and audio click, detects both offline, and writes the measured constant A/V offset into the manifest — closing the one gap software timestamps can't (absolute lip-sync).
  • Incremental manifest. Stream per-frame/per-block timestamps to disk during capture so even metadata RAM is bounded on marathon (multi-hour) sessions.
  • Stream-copy MP4 path. When frames are raw MJPG (convert_rgb=False), offer a remux that avoids the H.264 transcode for a faster, lossless render.
  • Hardware video encode. Optional VAAPI/NVENC H.264 at render time for large batch processing.
  • Packaging. Flip [tool.uv] package = false to a real build backend so the module can be pip install-ed and versioned as a distributable package.

About

Synchronized audio/video capture for Linux that, in one call, benchmarks the pipeline, saves raw frames + audio to av_data/, and writes a synchronized MP4 to processed_av/.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages