Skip to content

Test strategy for aTrain: unit + E2E (transcription & UI) — proposal #9

@BW-Projects

Description

@BW-Projects

Now that the lint/security CI is in place (ruff JuergenFleiss#160, bandit JuergenFleiss#167,
pip-audit JuergenFleiss#168), the next piece is an actual test suite. This issue
proposes a concrete, layered setup and — as discussed — I've already
prototyped it and verified it runs green, so the numbers below are
measured, not estimated.

My recommendation is to keep the tests in aTrain for now (not in
aTrain_core), start minimal and iterate — partly because aTrain_core is
expected to fold into aTrain (JuergenFleiss#145), at which point the suites merge.
Otherwise the setup is exactly as described below.

Proposed layers (all verified green in a fork preview, jobs run in parallel)

Job What it covers Install Time
unit torch-/NiceGUI-free helpers (e.g. archive.py file handler) none heavy seconds
e2e (core pipeline) transcription engine (tiny model, CPU), speaker detection off and on aTrain_core + CPU torch (~1.5 GB) ~40 s (+~15 s for the diarization case)
e2e (app, full stack) the shipped cu128 stack renders the UI and transcribes through it full app, locked cu128 + GTK (~7.5 GB) ~1m30s

The split is deliberate: a fast lightweight signal (unit, ~seconds), a
fast engine check that doesn't need the heavy GPU stack (e2e core,
~40 s on CPU torch — the inference is CTranslate2, not torch), and one
heavier job that validates the actual shipped environment installs,
imports, boots and transcribes. On a GPU-less runner the cu128 build
simply runs on CPU, so no GPU runner (which would be billed) is needed.

Tests are separated with a marker: the unit job runs pytest -m "not e2e"
so it never pulls torch/NiceGUI; the e2e jobs run the marked tests.

Test data

A ~3 s public-domain LibriVox clip, trimmed from the existing
sample_data/SampleAudio.mp3, added under tests/fixtures/. Transcribing
it with the tiny model produces "This is a LibreVox recording." The E2E
assertion is a smoke check, not accuracy: it requires the transcript
body to contain at least 3 words (plus that the expected output files
were written). WER/accuracy stays out of scope here (tracked in JuergenFleiss#147).

Speaker detection (diarization) is exercised too: the core job runs the
transcription both with and without it, asserting a SPEAKER_ label appears
when it's on (a smoke check, not an exact speaker count). It stays cheap —
the speaker model is a public 32 MB download (aTrain-core/speaker-detection,
no HF token, pre-loaded waveform so no system ffmpeg) — so it runs per PR,
no nightly needed.

Decision 1 — the UI test

Two options, both prototyped:

  • (A) Boot smoke: start aTrain start --no-native, poll :8080 for
    HTTP 200. Dead simple, no model, no test framework. ~7 s.
  • (B) NiceGUI User-fixture click-through (recommended): renders the
    real page in-process (no browser) and drives a transcription through
    the app's real wiring (start_transcriptionrun.cpu_bound
    finished dialog), tiny model on CPU. ~6 s.

Recommendation: (B). It costs about the same but covers the actual
upload→transcribe→result path, not just "does the server answer". No
Selenium/Chromium needed. (A) remains a fine fallback if we want the
lightest possible smoke.

Decision 2 — lightweight core test and/or full-stack UI test

Keep both — a lightweight aTrain_core test and the UI-based full-stack
test — because they cover different things:

  • e2e (core pipeline) exercises aTrain_core (the engine) on its own,
    lightweight (CPU torch, ~1.5 GB) — fast (~40 s), quick per-PR feedback. It
    installs aTrain_core standalone (@develop) today; once aTrain_core folds
    into aTrain (Decide long-term pinning strategy for aTrain_core dependency JuergenFleiss/aTrain#145) this becomes a direct in-repo test of that code, no
    separate install.
  • e2e (app, full stack) runs through the UI on the locked cu128 stack
    — the exact dependency set we ship — catching ABI / lock / install
    regressions the lightweight job can't. The whole job is only ~1m30s
    per PR, so it's comfortably affordable to keep in the per-PR loop.

Decision 3 — unit tests

Start with one or two simple, torch-/NiceGUI-free helpers (the archive
file handler is a good first target). Keep it minimal; iterate once the
pipeline stands. These run in the lightweight unit job in seconds.

Open question for maintainers

The whole suite (tiny model + the small public diarization model) is cheap
enough to run per PR — the heaviest job is only ~1m30s — so no nightly
is needed for the coverage above. The one thing worth a maintainer call:
do you want an optional nightly that additionally runs the full
sample_data/SampleAudio.mp3 (or an even longer sample) with speaker
detection on
— the heavier, more realistic path — kept out of the per-PR
loop? The per-PR jobs already include a short-clip diarization case, so
this would purely be "exercise a bigger input overnight".

Plan

The recommended setup is implemented in the linked draft PR #7
(jobs + tests + the sample clip) — so you can see the real diff and the CI
running green. Open to discussion; if the approach looks good, I'll finalize
the draft PR for review.

Maps to BSI IT-Grundschutz

  • CON.8 §3.2.5 (Funktionstests und Sicherheitstests) — the E2E tests are
    the "Funktionstest" half, complementing the automated code analysis
    (ruff / bandit / pip-audit) already in place.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions