A lightweight evaluation harness for LLM features that produce structured JSON outputs.
It provides:
- Schema validation
- Regression scoring (Exact Match + F1)
- Deterministic mock adapter for CI
- OpenAI / Azure OpenAI support
- Quality gates (threshold-based failure)
- Baseline regression protection
The goal is simple: treat AI features like production software — measurable, testable, and safe to evolve.
LLM-powered features regress easily when:
- prompts change
- models are upgraded
- decoding parameters change
- schemas evolve
- retrieval / RAG logic changes
This harness ensures changes are:
- measurable
- reproducible
- enforceable in CI
flowchart LR
D["Dataset (JSONL)"] --> R["Runner"]
P["Prompt (vN)"] --> R
R --> A["Adapter\nmock | openai | azure"]
A --> O["Structured Output (JSON)"]
S["JSON Schema"] --> V["Schema Validation"]
O --> V
O --> M["Scoring\nExact Match + F1"]
V --> M
B["Baseline"] --> G["Regression Gates"]
M --> G
G --> RPT["Report (JSON)"]
ai-evaluation-harness/
.github/workflows/ci.yml
baselines/
datasets/
prompts/
reports/
schemas/
src/eval_harness/
tests/
run.ps1
test.ps1
This repository includes helper scripts to ensure a consistent local setup.
From repo root:
.\run.ps1This will:
- Create
.venvif missing - Activate it
- Install dependencies
- Run the evaluation harness
| Parameter | Default | Description |
|---|---|---|
-Adapter |
mock |
mock, openai, azure |
-Dataset |
datasets\sample_tasks.jsonl |
JSONL evaluation dataset |
-Prompt |
prompts\task_extraction\v1.md |
Prompt file |
-Schema |
schemas\task_extraction.schema.json |
JSON schema file |
-MinSchemaValidRate |
1.0 |
Fail if below threshold |
-MinAvgF1 |
0.8 |
Fail if below threshold |
-Baseline |
baselines\task_extraction.mock.baseline.json |
Baseline reference |
-MaxAvgF1Drop |
0.02 |
Allowed regression delta |
-WriteBaseline |
switch | Write current summary as baseline |
A baseline is a committed reference summary representing known-good evaluation performance.
It prevents silent degradation.
Example:
If baseline avg F1 = 0.92
And a change drops it to 0.83
Even if minimum threshold is 0.80 — this is still a regression.
Baseline regression catches this.
Explicitly run:
.\run.ps1 -WriteBaselineThis writes:
baselines/task_extraction.mock.baseline.json
Commit it.
Baselines are never updated automatically.
Update only when:
- Prompt improvements intentionally change output
- Model upgrades are accepted
- Dataset changes are deliberate
Do NOT update baseline to “make CI green.”
Treat baseline updates like snapshot test updates — deliberate and reviewed.
A Python virtual environment is an isolated Python runtime per project.
It ensures:
- No global dependency pollution
- Reproducible installs
- Alignment with CI
The environment lives in:
.venv/
It is not committed to Git.
.\.venv\Scripts\Activate.ps1Deactivate:
deactivateAdd to your $PROFILE:
function venv {
if (!(Test-Path ".\pyproject.toml")) {
Write-Host "Not repo root" -ForegroundColor Yellow
return
}
if (!(Test-Path ".\.venv\Scripts\Activate.ps1")) {
python -m venv .venv
}
. .\.venv\Scripts\Activate.ps1
}
function devenv {
if (Get-Command deactivate -ErrorAction SilentlyContinue) {
deactivate
}
}Usage:
cd ai-evaluation-harness
venvNo.
Both run.ps1 and test.ps1 automatically create and activate .venv.
Manual activation is only needed for interactive debugging.
Run tests via:
.\test.ps1This ensures:
.venvexists- pytest is installed
- tests run via
python -m pytest
CI runs:
- Unit tests
- Mock adapter quality gate
- Baseline regression check (if baseline committed)
- Optional Azure gate if secrets are configured
Reports are uploaded as artifacts.
Below is a simplified example of a generated report file (reports/run-abc123.json):
{
"summary": {
"total": 12,
"schema_valid_rate": 1.0,
"exact_match_rate": 0.42,
"avg_f1": 0.78,
"avg_latency_ms": 12.4,
"total_cost_usd": 0.00
},
"cases": [
{
"id": "case-001",
"schema_valid": true,
"exact_match": false,
"f1": 0.83,
"latency_ms": 9,
"cost_usd": 0.0
},
{
"id": "case-002",
"schema_valid": true,
"exact_match": true,
"f1": 1.0,
"latency_ms": 8,
"cost_usd": 0.0
}
]
}The summary section is what quality gates and baseline regression checks compare against.
schema_valid_rateensures structured output correctness.exact_match_rateis strict equality vs expected output.avg_f1allows partial credit scoring.avg_latency_msenables performance tracking.total_cost_usdenables future cost gating.
These metrics make LLM behavior measurable and enforceable in CI.
Recommended next steps (intentionally not over-engineered):
- Per-tag thresholds (edge-case vs happy-path)
- Cost tracking gates
- Prompt A/B comparison
- Per-case regression diff reporting
Keep it small, measurable, and production-aligned.
This harness favors:
- Determinism in CI
- Explicit baselines
- Controlled regression
- Clear failure modes
AI systems drift.
Production systems measure that drift.