Test Code Health Analysis #128

OmerGronich · 2026-04-16T10:28:12Z

OmerGronich
Apr 16, 2026

The Idea

Fallow's health model analyzes production code — complexity, maintainability, hotspots, coupling, dead code. But test code quality is the other half of the equation. A codebase can have 100% line coverage and still have a 30% mutation score — tests execute code but don't verify behavior. Structurally poor test files (500-line tests, duplicated setup, zero assertions) give false confidence and are expensive to maintain.

What if fallow could analyze test code health the same way it analyzes production code health?

Two Complementary Dimensions

1. Structural test quality — "is the test well-written?"

The test-code counterpart to production complexity analysis. Score each test case based on structural complexity and spec-design smells:

No assertions — test executes code but verifies nothing
Low assertion density — 1 assertion in a 30-line test
Multiple phases — test does too much (multiple act/assert cycles)
High mocking — over-isolation, brittle to refactors
Large tests — hard to understand failure cause
Hidden complexity — complexity moved to helpers, not removed
Duplicated setup — repeated arrange scaffolding across tests

Roll up to file-level scores. Distinguish harmful duplication (repeated setup worth extracting) from coverage-matrix repetition (many small tests covering input variations — often intentional and fine).

This is the approach taken by SCRAP (Robert C. Martin) for Clojure specs. The concepts are language-agnostic and would translate directly to JS/TS test frameworks.

2. Test effectiveness — "does the test catch bugs?"

Mutation testing answers this by making small changes to production code (flipping conditions, removing statements, changing operators) and checking if any test fails. If a mutation survives, that's a gap.

Key patterns from mutate4java (also Robert C. Martin) that are relevant:

Coverage-filtered mutation — skip mutation sites on uncovered lines (no point mutating code no test reaches)
Differential mutation — hash each function scope, only re-mutate changed scopes on subsequent runs
AST-based mutation operators — true↔false, ==↔!=, +↔-, &&↔||, !expr→expr, 0↔1, rvalue→null

The mutation score (% of mutations killed) is a much stronger signal than line coverage. It could enhance the existing CRAP formula: CC² × (1-mutation_score/100)³ + CC instead of using line coverage.

How This Fits

Fallow already has the building blocks:

Test file detection — hotspot entries already tag test paths with [test]
Coverage consumption — --coverage already accepts Istanbul data
AST parsing — Oxc already parses the test files
Health scoring — the per-file maintainability index pattern already exists

The structural analysis side (smell detection, duplication) is pure AST pattern matching — no type resolution needed, same as production complexity analysis. The mutation side could start by consuming reports from existing tools like Stryker.

Combined, this would give a complete picture: not just "is our code healthy?" but "are our tests healthy enough to keep it that way?"

Why Now

AI code generators produce tests that pass but are often structurally weak — large examples, shallow assertions, duplicated arrange/act patterns. As AI-generated test code grows, structural test quality becomes a scaling problem that manual review can't keep up with.

References

SCRAP (Robert C. Martin) — structural quality analyzer for test code
mutate4java (Robert C. Martin) — mutation testing with differential mutation and coverage filtering
Stryker Mutator — mutation testing for JavaScript/TypeScript
Mutation Testing Elements — standardized mutation report schema

BartWaardenburg · 2026-04-20T09:09:52Z

BartWaardenburg
Apr 20, 2026
Maintainer

Thanks for the detailed write-up. The SCRAP and mutate4java references sent me down a good rabbit hole. Sharing where I've landed after thinking it through against what fallow's existing infrastructure can realistically back.

On consuming vs building for mutation testing. Going with consume. Stryker's mutation-testing-elements schema is stable, and fallow already consumes Istanbul the same way, so a --mutation-report stryker.json path is a natural extension. Building native would mean reinventing ten years of Stryker's operator work without a sub-second payoff (mutation testing is structurally non-sub-second regardless of engine). On the CRAP variant specifically: I'd keep it additive rather than replace. Line-coverage CRAP stays the default for everyone, and a mutation-weighted variant shows up as a separate field when the report is present. Retroactively shifting headline CRAP numbers on existing users is a bigger support cost than the extra field.

On the structural smells, going through your list:

No assertions + dead expect(). On the roadmap first, solo, as error. Committed .only/.skip/.todo folded in for the same cost.
Low assertion density. On the roadmap as a per-file JSON metric rather than a rule finding (density-as-alert has a rough track record in SonarQube and CodeClimate). Paired with an assertion-strength classification (strong / shallow / mock / snapshot / structural) so "low density" can be interpreted against what kind of assertions are actually there. The classifier needs framework-aware overrides to avoid misclassifying matchers like toBeInTheDocument, so the matcher overrides auto-activate from package.json enablers (testing-library, supertest, Angular TestBed, NestJS TestingModule, Cypress) to keep it zero-config.
Multiple act/assert phases. Reframed on the roadmap as test cyclomatic complexity > 1. A test with branching, looping, or try-catch around expects is the thing that's actually broken; "multiple phases" as a LOC ratio tends to misfire on integration tests and playwright test.step(). CC on test bodies reuses the existing engine directly. Exempts fakeAsync/tick.
High mocking. Parked for now. The signal is real, but it breaks structurally on dependency-injection frameworks where TestingModule.overrideProvider(...) or TestBed.configureTestingModule({ providers: [...] }) look like 10+ mocks per test without being over-mocked. Would revisit if fallow grows a test-framework-plugin layer that can classify DI wiring separately from genuine mock-assertions.
Large tests. Folded into the complexity signal above. CC is a better proxy for "hard to diagnose" than raw LOC; long setup is legitimate, long decision structure is not.
Hidden complexity in helpers. Fallow's existing hotspot engine already does this on any path. Widening the scope so test-utils/, __helpers__/, etc. are first-class hotspot surface is a config tweak rather than a new feature.
Duplicated arrange/setup. Respectfully skipping. The testing literature on this is genuinely split (DAMP over DRY: each test tells its own story, and helper abstractions break dozens of tests per refactor). Your own distinction, harmful duplication vs coverage-matrix repetition, is exactly the hard boundary, and I don't have a confident way to draw it objectively without a lot of framework-specific heuristics for it.each/describe.each/test.each. If someone ships a credible classifier for this, I'm happy to be wrong.

New signal not in your list, worth flagging because it's the one nobody else can compute: per-module test vs production cyclomatic-complexity ratio. Using fallow's module graph (which ESLint and Stryker structurally don't have), you can say "this module has 2000 LOC of complex production code and a test-CC of 3." A structural signal of under-test-ness that doesn't depend on runtime coverage at all. Percentile banners rather than absolute thresholds. On the roadmap.

Snapshot-only files. Also on the roadmap. toMatchSnapshot gives you 100% coverage with zero behavior verification, worth surfacing as its own signal with glob opt-outs for golden-output suites (__snapshots__/, *.golden.test.ts).

Framing-wise: landing this as test quality complements coverage rather than "AI test health." Coverage measures reach ("did the line run?"); these signals measure verification ("was the result checked?"). A codebase with 90% coverage and no assertions has 0% meaningful coverage, and that's the gap. Works for AI-generated tests, works for hand-written ones.

Surface-wise: the structural signals would feed the existing per-file health score as penalty weights, the same way complexity and coupling already do, so they'd appear in the default fallow run as reasons a file's health dropped, with fallow explain naming the specific signal. No new composite score to maintain.

Really appreciate the nudge on this. The structural-quality-plus-mutation-effectiveness framing is exactly right.

1 reply

OmerGronich Apr 21, 2026
Author

That's great, especially the CC ratio idea. Looking forward to it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Test Code Health Analysis #128

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Test Code Health Analysis #128

Uh oh!

OmerGronich Apr 16, 2026

The Idea

Two Complementary Dimensions

1. Structural test quality — "is the test well-written?"

2. Test effectiveness — "does the test catch bugs?"

How This Fits

Why Now

References

Replies: 1 comment · 1 reply

Uh oh!

BartWaardenburg Apr 20, 2026 Maintainer

Uh oh!

OmerGronich Apr 21, 2026 Author

OmerGronich
Apr 16, 2026

Replies: 1 comment 1 reply

BartWaardenburg
Apr 20, 2026
Maintainer

OmerGronich Apr 21, 2026
Author