You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Now that the lint/security CI is in place (ruff JuergenFleiss#160, bandit JuergenFleiss#167,
pip-audit JuergenFleiss#168), the next piece is an actual test suite. This issue
proposes a concrete, layered setup and — as discussed — I've already
prototyped it and verified it runs green, so the numbers below are
measured, not estimated.
My recommendation is to keep the tests in aTrain for now (not in
aTrain_core), start minimal and iterate — partly because aTrain_core is
expected to fold into aTrain (JuergenFleiss#145), at which point the suites merge.
Otherwise the setup is exactly as described below.
Proposed layers (all verified green in a fork preview, jobs run in parallel)
transcription engine (tiny model, CPU), speaker detection off and on
aTrain_core + CPU torch (~1.5 GB)
~40 s (+~15 s for the diarization case)
e2e (app, full stack)
the shipped cu128 stack renders the UI and transcribes through it
full app, locked cu128 + GTK (~7.5 GB)
~1m30s
The split is deliberate: a fast lightweight signal (unit, ~seconds), a
fast engine check that doesn't need the heavy GPU stack (e2e core,
~40 s on CPU torch — the inference is CTranslate2, not torch), and one
heavier job that validates the actual shipped environment installs,
imports, boots and transcribes. On a GPU-less runner the cu128 build
simply runs on CPU, so no GPU runner (which would be billed) is needed.
Tests are separated with a marker: the unit job runs pytest -m "not e2e"
so it never pulls torch/NiceGUI; the e2e jobs run the marked tests.
Test data
A ~3 s public-domain LibriVox clip, trimmed from the existing sample_data/SampleAudio.mp3, added under tests/fixtures/. Transcribing
it with the tiny model produces "This is a LibreVox recording." The E2E
assertion is a smoke check, not accuracy: it requires the transcript
body to contain at least 3 words (plus that the expected output files
were written). WER/accuracy stays out of scope here (tracked in JuergenFleiss#147).
Speaker detection (diarization) is exercised too: the core job runs the
transcription both with and without it, asserting a SPEAKER_ label appears
when it's on (a smoke check, not an exact speaker count). It stays cheap —
the speaker model is a public 32 MB download (aTrain-core/speaker-detection,
no HF token, pre-loaded waveform so no system ffmpeg) — so it runs per PR,
no nightly needed.
Decision 1 — the UI test
Two options, both prototyped:
(A) Boot smoke: start aTrain start --no-native, poll :8080 for
HTTP 200. Dead simple, no model, no test framework. ~7 s.
(B) NiceGUI User-fixture click-through(recommended): renders the
real page in-process (no browser) and drives a transcription through
the app's real wiring (start_transcription → run.cpu_bound →
finished dialog), tiny model on CPU. ~6 s.
Recommendation: (B). It costs about the same but covers the actual
upload→transcribe→result path, not just "does the server answer". No
Selenium/Chromium needed. (A) remains a fine fallback if we want the
lightest possible smoke.
Decision 2 — lightweight core test and/or full-stack UI test
Keep both — a lightweight aTrain_core test and the UI-based full-stack
test — because they cover different things:
e2e (core pipeline) exercises aTrain_core (the engine) on its own,
lightweight (CPU torch, ~1.5 GB) — fast (~40 s), quick per-PR feedback. It
installs aTrain_core standalone (@develop) today; once aTrain_core folds
into aTrain (Decide long-term pinning strategy for aTrain_core dependency JuergenFleiss/aTrain#145) this becomes a direct in-repo test of that code, no
separate install.
e2e (app, full stack) runs through the UI on the locked cu128 stack
— the exact dependency set we ship — catching ABI / lock / install
regressions the lightweight job can't. The whole job is only ~1m30s
per PR, so it's comfortably affordable to keep in the per-PR loop.
Decision 3 — unit tests
Start with one or two simple, torch-/NiceGUI-free helpers (the archive
file handler is a good first target). Keep it minimal; iterate once the
pipeline stands. These run in the lightweight unit job in seconds.
Open question for maintainers
The whole suite (tiny model + the small public diarization model) is cheap
enough to run per PR — the heaviest job is only ~1m30s — so no nightly
is needed for the coverage above. The one thing worth a maintainer call:
do you want an optional nightly that additionally runs the full sample_data/SampleAudio.mp3 (or an even longer sample) with speaker
detection on — the heavier, more realistic path — kept out of the per-PR
loop? The per-PR jobs already include a short-clip diarization case, so
this would purely be "exercise a bigger input overnight".
Plan
The recommended setup is implemented in the linked draft PR #7
(jobs + tests + the sample clip) — so you can see the real diff and the CI
running green. Open to discussion; if the approach looks good, I'll finalize
the draft PR for review.
Maps to BSI IT-Grundschutz
CON.8 §3.2.5 (Funktionstests und Sicherheitstests) — the E2E tests are
the "Funktionstest" half, complementing the automated code analysis
(ruff / bandit / pip-audit) already in place.
Now that the lint/security CI is in place (ruff JuergenFleiss#160, bandit JuergenFleiss#167,
pip-audit JuergenFleiss#168), the next piece is an actual test suite. This issue
proposes a concrete, layered setup and — as discussed — I've already
prototyped it and verified it runs green, so the numbers below are
measured, not estimated.
My recommendation is to keep the tests in aTrain for now (not in
aTrain_core), start minimal and iterate — partly because aTrain_core is
expected to fold into aTrain (JuergenFleiss#145), at which point the suites merge.
Otherwise the setup is exactly as described below.
Proposed layers (all verified green in a fork preview, jobs run in parallel)
unitarchive.pyfile handler)e2e (core pipeline)e2e (app, full stack)The split is deliberate: a fast lightweight signal (
unit, ~seconds), afast engine check that doesn't need the heavy GPU stack (
e2e core,~40 s on CPU torch — the inference is CTranslate2, not torch), and one
heavier job that validates the actual shipped environment installs,
imports, boots and transcribes. On a GPU-less runner the cu128 build
simply runs on CPU, so no GPU runner (which would be billed) is needed.
Tests are separated with a marker: the
unitjob runspytest -m "not e2e"so it never pulls torch/NiceGUI; the e2e jobs run the marked tests.
Test data
A ~3 s public-domain LibriVox clip, trimmed from the existing
sample_data/SampleAudio.mp3, added undertests/fixtures/. Transcribingit with the tiny model produces "This is a LibreVox recording." The E2E
assertion is a smoke check, not accuracy: it requires the transcript
body to contain at least 3 words (plus that the expected output files
were written). WER/accuracy stays out of scope here (tracked in JuergenFleiss#147).
Speaker detection (diarization) is exercised too: the core job runs the
transcription both with and without it, asserting a
SPEAKER_label appearswhen it's on (a smoke check, not an exact speaker count). It stays cheap —
the speaker model is a public 32 MB download (
aTrain-core/speaker-detection,no HF token, pre-loaded waveform so no system ffmpeg) — so it runs per PR,
no nightly needed.
Decision 1 — the UI test
Two options, both prototyped:
aTrain start --no-native, poll:8080forHTTP 200. Dead simple, no model, no test framework. ~7 s.
User-fixture click-through (recommended): renders thereal page in-process (no browser) and drives a transcription through
the app's real wiring (
start_transcription→run.cpu_bound→finished dialog), tiny model on CPU. ~6 s.
Recommendation: (B). It costs about the same but covers the actual
upload→transcribe→result path, not just "does the server answer". No
Selenium/Chromium needed. (A) remains a fine fallback if we want the
lightest possible smoke.
Decision 2 — lightweight core test and/or full-stack UI test
Keep both — a lightweight aTrain_core test and the UI-based full-stack
test — because they cover different things:
e2e (core pipeline)exercises aTrain_core (the engine) on its own,lightweight (CPU torch, ~1.5 GB) — fast (~40 s), quick per-PR feedback. It
installs aTrain_core standalone (
@develop) today; once aTrain_core foldsinto aTrain (Decide long-term pinning strategy for aTrain_core dependency JuergenFleiss/aTrain#145) this becomes a direct in-repo test of that code, no
separate install.
e2e (app, full stack)runs through the UI on the locked cu128 stack— the exact dependency set we ship — catching ABI / lock / install
regressions the lightweight job can't. The whole job is only ~1m30s
per PR, so it's comfortably affordable to keep in the per-PR loop.
Decision 3 — unit tests
Start with one or two simple, torch-/NiceGUI-free helpers (the archive
file handler is a good first target). Keep it minimal; iterate once the
pipeline stands. These run in the lightweight
unitjob in seconds.Open question for maintainers
The whole suite (tiny model + the small public diarization model) is cheap
enough to run per PR — the heaviest job is only ~1m30s — so no nightly
is needed for the coverage above. The one thing worth a maintainer call:
do you want an optional nightly that additionally runs the full
sample_data/SampleAudio.mp3(or an even longer sample) with speakerdetection on — the heavier, more realistic path — kept out of the per-PR
loop? The per-PR jobs already include a short-clip diarization case, so
this would purely be "exercise a bigger input overnight".
Plan
The recommended setup is implemented in the linked draft PR #7
(jobs + tests + the sample clip) — so you can see the real diff and the CI
running green. Open to discussion; if the approach looks good, I'll finalize
the draft PR for review.
Maps to BSI IT-Grundschutz
the "Funktionstest" half, complementing the automated code analysis
(ruff / bandit / pip-audit) already in place.