Test strategy for aTrain: unit + E2E (transcription & UI) — proposal

Now that the lint/security CI is in place (ruff #160, bandit #167,
pip-audit #168), the next piece is an actual test suite. This issue
proposes a concrete, layered setup and — as discussed — I've already
prototyped it and verified it runs green, so the numbers below are
measured, not estimated.

My recommendation is to keep the tests in **aTrain** for now (not in
aTrain_core), start minimal and iterate — partly because aTrain_core is
expected to fold into aTrain (#145), at which point the suites merge.
Otherwise the setup is exactly as described below.

## Proposed layers (all verified green in a fork preview, jobs run in parallel)

| Job | What it covers | Install | Time |
|-----|----------------|---------|------|
| `unit` | torch-/NiceGUI-free helpers (e.g. `archive.py` file handler) | none heavy | seconds |
| `e2e (core pipeline)` | transcription engine (tiny model, CPU), speaker detection **off and on** | aTrain_core + **CPU** torch (~1.5 GB) | ~40 s (+~15 s for the diarization case) |
| `e2e (app, full stack)` | the shipped **cu128** stack renders the UI **and** transcribes through it | full app, locked cu128 + GTK (~7.5 GB) | ~1m30s |

The split is deliberate: a fast lightweight signal (`unit`, ~seconds), a
fast engine check that doesn't need the heavy GPU stack (`e2e core`,
~40 s on CPU torch — the inference is CTranslate2, not torch), and one
heavier job that validates the *actual shipped* environment installs,
imports, boots and transcribes. On a GPU-less runner the cu128 build
simply runs on CPU, so no GPU runner (which would be billed) is needed.

Tests are separated with a marker: the `unit` job runs `pytest -m "not e2e"`
so it never pulls torch/NiceGUI; the e2e jobs run the marked tests.

### Test data

A ~3 s public-domain LibriVox clip, trimmed from the existing
`sample_data/SampleAudio.mp3`, added under `tests/fixtures/`. Transcribing
it with the tiny model produces "This is a LibreVox recording." The E2E
assertion is a **smoke** check, not accuracy: it requires the transcript
body to contain **at least 3 words** (plus that the expected output files
were written). WER/accuracy stays out of scope here (tracked in #147).

Speaker detection (diarization) is exercised too: the core job runs the
transcription both with and without it, asserting a `SPEAKER_` label appears
when it's on (a smoke check, not an exact speaker count). It stays cheap —
the speaker model is a public 32 MB download (`aTrain-core/speaker-detection`,
no HF token, pre-loaded waveform so no system ffmpeg) — so it runs per PR,
no nightly needed.

## Decision 1 — the UI test

Two options, both prototyped:

- **(A) Boot smoke**: start `aTrain start --no-native`, poll `:8080` for
  HTTP 200. Dead simple, no model, no test framework. ~7 s.
- **(B) NiceGUI `User`-fixture click-through** *(recommended)*: renders the
  real page in-process (no browser) **and** drives a transcription through
  the app's real wiring (`start_transcription` → `run.cpu_bound` →
  finished dialog), tiny model on CPU. ~6 s.

**Recommendation: (B).** It costs about the same but covers the actual
upload→transcribe→result path, not just "does the server answer". No
Selenium/Chromium needed. (A) remains a fine fallback if we want the
lightest possible smoke.

## Decision 2 — lightweight core test and/or full-stack UI test

Keep **both** — a lightweight aTrain_core test *and* the UI-based full-stack
test — because they cover different things:
- `e2e (core pipeline)` exercises **aTrain_core** (the engine) on its own,
  lightweight (CPU torch, ~1.5 GB) — fast (~40 s), quick per-PR feedback. It
  installs aTrain_core standalone (`@develop`) today; once aTrain_core folds
  into aTrain (#145) this becomes a direct in-repo test of that code, no
  separate install.
- `e2e (app, full stack)` runs through the **UI on the locked cu128 stack**
  — the exact dependency set we ship — catching ABI / lock / install
  regressions the lightweight job can't. The whole job is only **~1m30s**
  per PR, so it's comfortably affordable to keep in the per-PR loop.

## Decision 3 — unit tests

Start with one or two simple, torch-/NiceGUI-free helpers (the archive
file handler is a good first target). Keep it minimal; iterate once the
pipeline stands. These run in the lightweight `unit` job in seconds.

## Open question for maintainers

The whole suite (tiny model + the small public diarization model) is cheap
enough to run **per PR** — the heaviest job is only ~1m30s — so no nightly
is needed for the coverage above. The one thing worth a maintainer call:
do you want an **optional nightly** that additionally runs the full
`sample_data/SampleAudio.mp3` (or an even longer sample) **with speaker
detection on** — the heavier, more realistic path — kept out of the per-PR
loop? The per-PR jobs already include a short-clip diarization case, so
this would purely be "exercise a bigger input overnight".

## Plan

The recommended setup is implemented in the linked **draft PR #7**
(jobs + tests + the sample clip) — so you can see the real diff and the CI
running green. Open to discussion; if the approach looks good, I'll finalize
the draft PR for review.

## Maps to BSI IT-Grundschutz

- CON.8 §3.2.5 (Funktionstests und Sicherheitstests) — the E2E tests are
  the "Funktionstest" half, complementing the automated code analysis
  (ruff / bandit / pip-audit) already in place.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test strategy for aTrain: unit + E2E (transcription & UI) — proposal #9

Proposed layers (all verified green in a fork preview, jobs run in parallel)

Test data

Decision 1 — the UI test

Decision 2 — lightweight core test and/or full-stack UI test

Decision 3 — unit tests

Open question for maintainers

Plan

Maps to BSI IT-Grundschutz

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Job	What it covers	Install	Time
`unit`	torch-/NiceGUI-free helpers (e.g. `archive.py` file handler)	none heavy	seconds
`e2e (core pipeline)`	transcription engine (tiny model, CPU), speaker detection off and on	aTrain_core + CPU torch (~1.5 GB)	~40 s (+~15 s for the diarization case)
`e2e (app, full stack)`	the shipped cu128 stack renders the UI and transcribes through it	full app, locked cu128 + GTK (~7.5 GB)	~1m30s

Test strategy for aTrain: unit + E2E (transcription & UI) — proposal #9

Description

Proposed layers (all verified green in a fork preview, jobs run in parallel)

Test data

Decision 1 — the UI test

Decision 2 — lightweight core test and/or full-stack UI test

Decision 3 — unit tests

Open question for maintainers

Plan

Maps to BSI IT-Grundschutz

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions