"The test your tests have to pass."
Test quality governance for AI agent workflows. 5 commands, 5 agents, 12 violation types, Test Quality Score 0-100.
AI agents write tests that pass but don't protect.
Your agent generates a test suite. Every test is green. Coverage is 95%. You ship with confidence. Then production breaks — and the test suite never flinched.
This is not a hypothetical. The research is clear:
- Only 29% of developers trust AI accuracy (Stack Overflow Developer Survey 2025)
- Best AI test generators achieve 71% mutation scores — meaning 29% of bugs slip through undetected (Diffblue 2025 Benchmarks)
- Researchers identified 13 new test smells specific to auto-generated tests that don't exist in human-written tests (Springer 2025)
- AI-generated code has 1.7x more issues than human-written code (CodeRabbit 2025)
Your test suite shows green. Your code ships bugs. The tests are theater.
Agent-Litmus doesn't run your tests. It tests your tests.
Agent writes tests -> Tests pass -> Green bar -> You trust it -> Ship it
Nobody checks if:
- Assertions are meaningful (or just
expect(result).toBeDefined()) - Edge cases are covered (or just happy paths)
- Tests would catch a regression (or just pass for any output)
- "100% tests passing" actually means "100% protection"
It doesn't.
Agent writes tests -> /litmus-scan catches 12 violation types
-> /litmus-edge maps every edge case
-> /litmus-strength asks "would this catch a real bug?"
-> /litmus-fix generates concrete improvements
-> /litmus-report gives project-wide Test Quality Score
Verdicts are honest: EFFECTIVE / WEAK / HOLLOW — not just "tests pass."
| Concern | What You Tell the Agent | What Validates It |
|---|---|---|
| Testing | "Please write tests" | Jest / Pytest |
| Linting | "Please format nicely" | ESLint / Prettier |
| Evidence | "Please cite sources" | Agent-Cite |
| Drift | "Please follow instructions" | Agent-Drift |
| Test Quality | "Are these tests real?" | Agent-Litmus |
Without Agent-Litmus, green means nothing. With it, green means protected.
Agent-Litmus detects 12 violation types across 4 categories:
| Type | Severity | What It Catches |
|---|---|---|
HOLLOW_ASSERTION |
error | expect(result).toBeDefined() — checks existence, not correctness. Passes for { error: true } just as happily as { name: 'Alice' }. |
WEAK_ASSERTION |
warning | expect(result.length).toBeGreaterThan(0) — confirms non-empty, but ["CORRUPTED"] passes too. |
NO_ASSERTION |
critical | Test function calls code but never checks the result. Zero assertions. Smoke test at best. |
| Type | Severity | What It Catches |
|---|---|---|
IMPLEMENTATION_COUPLING |
error | jest.spyOn(service, '_validate') — tests HOW the code works, not WHAT it produces. Breaks on refactor. |
OVER_MOCKING |
error | 8 mocks, 1 assertion. You're testing that JavaScript calls functions. Not that your logic works. |
BRITTLE_SELECTOR |
warning | .css-1a2b3c, /html/body/div[3]/span — breaks every build, trains devs to ignore failures. |
| Type | Severity | What It Catches |
|---|---|---|
MISSING_EDGE_CASE |
warning | Source checks if (!name) but no test ever passes null. The guard is untested. |
HAPPY_PATH_ONLY |
warning* | Every test uses valid input. No error paths, no boundaries, no nulls. (*Escalates to error if source has error handling.) |
| Type | Severity | What It Catches |
|---|---|---|
DUPLICATE_TEST_LOGIC |
info | 3 tests: add(1,1), add(2,2), add(3,3). Same code path, three times. Zero extra coverage. |
TEST_PRIVATE_METHOD |
warning | service._validateEmail() — testing internals that break on refactor. |
HARDCODED_DEPENDENCY |
warning | /Users/dev/data.json, localhost:3000, new Date('2024-01-15') — time bombs. |
FLAKY_INDICATOR |
warning | setTimeout(2000), Math.random(), timing assertions — non-deterministic failures. |
| Command | Purpose | Verdict |
|---|---|---|
/litmus-scan <test-file> |
Scan for 12 violation types, classify all assertions | EFFECTIVE / WEAK / HOLLOW |
/litmus-edge <source-file> |
Map all edge cases, check which are tested | COVERED / GAPS / EXPOSED |
/litmus-strength <test-file> |
Thought-experiment mutation testing | STRONG / MODERATE / THEATER |
/litmus-fix <test-file> |
Generate concrete improved test code | Before/After diffs |
/litmus-report [scope] |
Batch audit, project-wide Test Quality Score | PROTECTED / AT_RISK / EXPOSED |
# Scan a single test file
/litmus-scan src/utils/auth.test.ts
# Check edge case coverage for a source file
/litmus-edge src/utils/auth.ts
# Would these tests catch real bugs?
/litmus-strength src/utils/auth.test.ts
# Fix the violations
/litmus-fix src/utils/auth.test.ts --auto
# Project-wide assessment
/litmus-report --format summary
# Strict mode (warnings become errors)
/litmus-scan src/utils/auth.test.ts --strict
# Focus on assertion quality only
/litmus-scan src/utils/auth.test.ts --focus AA single number from 0-100 that answers: "How protected is this code?"
TQS = assertion_strength * 0.40 + violation_penalty * 0.30 + edge_coverage * 0.30
| Component (Weight) | Calculation |
|---|---|
| Assertion Strength (40%) | 100 - (weak% x 1) - (hollow% x 2) |
| Violation Penalty (30%) | 100 - (critical x 15) - (error x 8) - (warning x 3) - (info x 1) |
| Edge Coverage (30%) | Tested edges / total edges x 100 |
All component scores are floored at 0 (no negative values). The final TQS is capped at 100.
| TQS | Verdict | Meaning |
|---|---|---|
| 80-100 | PROTECTED | Tests are doing their job. Ship with confidence. |
| 50-79 | AT_RISK | Tests exist but have significant blind spots. Bugs will get through. |
| 0-49 | EXPOSED | Green-bar theater. Tests pass but protect nothing. |
Use Agent-Litmus when:
- An AI agent just wrote tests for your code
- You want to audit test quality before a release
- You're doing periodic test health checks
- You're reviewing a PR with new tests
- Your test suite is green but bugs keep shipping
Don't use Agent-Litmus for:
- Code without any tests (write tests first, then audit them)
- Test infrastructure setup (jest.config, conftest.py)
- Mocking library configuration
- E2E test orchestration (Playwright, Cypress config)
curl -fsSL https://raw.githubusercontent.com/saisumantatgit/Agent-Litmus/main/install.sh | bashThe installer auto-detects your CLI (Claude Code, Cursor, Codex, Aider) and installs the appropriate adapter.
git clone https://github.com/saisumantatgit/Agent-Litmus.git
cp -r Agent-Litmus/.claude/ .claude/
cp -r Agent-Litmus/.claude-plugin/ .claude-plugin/
cp -r Agent-Litmus/references/ references/
cp -r Agent-Litmus/templates/ templates/| Platform | Adapter Location | Setup |
|---|---|---|
| Claude Code | adapters/claude-code/ |
Plugin + commands (native) |
| Cursor | adapters/cursor/ |
Rules file in .cursor/rules/ |
| OpenAI Codex | adapters/codex/ |
AGENTS.md system prompt |
| Aider | adapters/aider/ |
.aider.conf.yml |
| Generic | adapters/generic/ |
Copy prompts from prompts/ |
# 1. Install
curl -fsSL https://raw.githubusercontent.com/saisumantatgit/Agent-Litmus/main/install.sh | bash
# 2. Scan your weakest test file
/litmus-scan path/to/your.test.ts
# 3. See the violations and verdict (EFFECTIVE/WEAK/HOLLOW)
# 4. Fix the violations
/litmus-fix path/to/your.test.ts --auto
# 5. Get project-wide score
/litmus-reportCopy templates/litmus-protocol.yaml to your project root as .litmus-protocol.yaml:
# Override violation severities
violations:
HOLLOW_ASSERTION: error # default: error
WEAK_ASSERTION: warning # default: warning
NO_ASSERTION: critical # default: critical
DUPLICATE_TEST_LOGIC: off # disable this check
# Test file discovery patterns
test_patterns:
- "**/*.test.{ts,tsx,js,jsx}"
- "**/test_*.py"
# TQS thresholds
tqs:
protected: 80
at_risk: 50
# Scoring weights (must sum to 1.0)
scoring:
assertion_strength: 0.40
violation_penalty: 0.30
edge_coverage: 0.30
# Directories to ignore
ignore:
- "node_modules/"
- "vendor/"
- ".git/"See templates/litmus-protocol.yaml for all configuration options.
Agent-Litmus works with any AI coding assistant that can read files and follow instructions.
| Platform | Support Level | Adapter |
|---|---|---|
| Claude Code | Native plugin | .claude-plugin/ + commands + skills + agents |
| Cursor | Rules integration | .cursor/rules/litmus.md |
| OpenAI Codex | Agent instructions | AGENTS.md |
| Aider | Config integration | .aider.conf.yml |
| Windsurf | Generic prompts | prompts/*.md |
| Cline | Generic prompts | prompts/*.md |
| Any LLM | Copy-paste prompts | prompts/*.md |
Agent-Litmus is one of six products in the Agent Suite for AI agent governance:
| Product | Tagline | Purpose |
|---|---|---|
| Agent-PROVE | "Prove it or it fails." | Thinking validation — structured reasoning frameworks |
| Agent-Trace | "See the ripple effect before it happens." | Blast radius mapping — impact analysis before changes |
| Agent-Drift | "Not on my watch." | Drift detection — catch when agents deviate from instructions |
| Agent-Litmus | "The test your tests have to pass." | Test quality governance — are tests protecting code? |
| Agent-Cite | "Cite it or it's opinion." | Evidence enforcement — require citations for claims |
| Agent-Scribe | "Nothing is lost." | Session governance — capture decisions and context |
Agent-Litmus was built from research across:
- Springer 2025 papers on auto-generated test quality and the 13 test smells unique to AI-generated tests
- Diffblue 2025 mutation testing benchmarks showing AI tests achieve only 71% mutation scores
- Stack Overflow Developer Survey 2025 reporting only 29% developer trust in AI accuracy
- CodeRabbit 2025 analysis showing 1.7x more issues in AI-generated code
The 12 violation types map directly to the documented failure modes of AI-generated tests. The assertion classification (STRONG/WEAK/HOLLOW) comes from mutation testing research: assertions that survive mutations are not protecting code.
See CONTRIBUTING.md for how to:
- Add new violation types
- Add assertion patterns for new frameworks
- Add CLI adapters
- Add edge case categories
MIT License. Copyright (c) 2026 Sai Sumanth Battepati.
See LICENSE for details.