Test Code Health Analysis #128
Replies: 1 comment 1 reply
-
|
Thanks for the detailed write-up. The SCRAP and mutate4java references sent me down a good rabbit hole. Sharing where I've landed after thinking it through against what fallow's existing infrastructure can realistically back. On consuming vs building for mutation testing. Going with consume. Stryker's On the structural smells, going through your list:
New signal not in your list, worth flagging because it's the one nobody else can compute: per-module test vs production cyclomatic-complexity ratio. Using fallow's module graph (which ESLint and Stryker structurally don't have), you can say "this module has 2000 LOC of complex production code and a test-CC of 3." A structural signal of under-test-ness that doesn't depend on runtime coverage at all. Percentile banners rather than absolute thresholds. On the roadmap. Snapshot-only files. Also on the roadmap. Framing-wise: landing this as test quality complements coverage rather than "AI test health." Coverage measures reach ("did the line run?"); these signals measure verification ("was the result checked?"). A codebase with 90% coverage and no assertions has 0% meaningful coverage, and that's the gap. Works for AI-generated tests, works for hand-written ones. Surface-wise: the structural signals would feed the existing per-file health score as penalty weights, the same way complexity and coupling already do, so they'd appear in the default Really appreciate the nudge on this. The structural-quality-plus-mutation-effectiveness framing is exactly right. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
The Idea
Fallow's health model analyzes production code — complexity, maintainability, hotspots, coupling, dead code. But test code quality is the other half of the equation. A codebase can have 100% line coverage and still have a 30% mutation score — tests execute code but don't verify behavior. Structurally poor test files (500-line tests, duplicated setup, zero assertions) give false confidence and are expensive to maintain.
What if fallow could analyze test code health the same way it analyzes production code health?
Two Complementary Dimensions
1. Structural test quality — "is the test well-written?"
The test-code counterpart to production complexity analysis. Score each test case based on structural complexity and spec-design smells:
Roll up to file-level scores. Distinguish harmful duplication (repeated setup worth extracting) from coverage-matrix repetition (many small tests covering input variations — often intentional and fine).
This is the approach taken by SCRAP (Robert C. Martin) for Clojure specs. The concepts are language-agnostic and would translate directly to JS/TS test frameworks.
2. Test effectiveness — "does the test catch bugs?"
Mutation testing answers this by making small changes to production code (flipping conditions, removing statements, changing operators) and checking if any test fails. If a mutation survives, that's a gap.
Key patterns from mutate4java (also Robert C. Martin) that are relevant:
true↔false,==↔!=,+↔-,&&↔||,!expr→expr,0↔1, rvalue→nullThe mutation score (% of mutations killed) is a much stronger signal than line coverage. It could enhance the existing CRAP formula:
CC² × (1-mutation_score/100)³ + CCinstead of using line coverage.How This Fits
Fallow already has the building blocks:
[test]--coveragealready accepts Istanbul dataThe structural analysis side (smell detection, duplication) is pure AST pattern matching — no type resolution needed, same as production complexity analysis. The mutation side could start by consuming reports from existing tools like Stryker.
Combined, this would give a complete picture: not just "is our code healthy?" but "are our tests healthy enough to keep it that way?"
Why Now
AI code generators produce tests that pass but are often structurally weak — large examples, shallow assertions, duplicated arrange/act patterns. As AI-generated test code grows, structural test quality becomes a scaling problem that manual review can't keep up with.
References
Beta Was this translation helpful? Give feedback.
All reactions