Evidence-backed quality control for LLM outputs.
AI products should not ship because an answer "looks good". This tool scores AI outputs against rubrics, checks claims against evidence, catches safety failures, and produces launch-ready evaluation reports.
Evaluate AI with evidence, not vibes.
Live demo · 90-second demo path · GitHub
A working evaluation engine, not a mockup. Every claim here is backed by code or tests.
- 135 unit tests, all passing —
npm test. Engine logic only: scoring, aggregation, verdict gates, claim grounding, deterministic safety checks, PII + language detection. Seeapp/tests/unit. - CI on every push — GitHub Actions runs lint + typecheck + test + build. See
.github/workflows/ci.yml. The badge above reflects the live result. - 17 runnable eval scenarios across hallucination / safety / format, each asserted against the real engine in CI. See
app/src/lib/eval-scenarios. - Real vs mock, stated plainly — with no env the app serves bundled mock data (read-only, no persistence). Real scoring (LLM judge, embeddings, claim pipeline, deterministic gates) runs only with
OPENAI_API_KEY+ Supabase. A dimension with no real scorer is reportedunscored, never a placeholder number.
flowchart LR
subgraph UI["Next.js app · 25 pages"]
D[Dashboard] --- RU[Runs] --- CMP[Compare] --- SAF[Safety] --- REP[Reports]
end
subgraph API["API routes · 10"]
RUN["/api/eval/run/*"] --- SC["/api/rubric/score"] --- CLM["/api/eval/claims"] --- DET["/api/eval/deterministic"]
end
subgraph ENGINE["Evaluation engine · src/lib"]
RB[Rubric + dimensions]
JG[LLM judge]
SM[Semantic cosine]
CP[Claim pipeline]
DC[Deterministic + safety gates]
AG[Aggregator + verdict]
end
DB[(Supabase / mock fallback)]
UI --> API --> ENGINE
RB --> AG
JG --> AG
SM --> AG
CP --> AG
DC --> AG
ENGINE --> DB
AG --> V{{Ship-ready / Acceptable / Needs work / Blocked}}
One case fans out across the scoring, claim, and safety paths; results converge in the aggregator, where the verdict gate decides ship vs block. Storage is downstream — the run is written once, reports and the review queue read from it. Detailed pipeline diagrams live in diagrams/.
A walkthrough of the dark evaluation cockpit — rubric scoring, claim grounding, safety gates, and the ship/block verdict — lives on the portfolio project page:
- Project page: shatalov.dev
- Live app: ai-eval-tool.vercel.app
The rest of this README stays focused on proof, architecture, and setup.
| Dashboard | Run detail |
|---|---|
![]() |
![]() |
| Quality health, pipeline status, evaluator breakdown | Ship-ready verdict + per-dimension scores with methods/thresholds |
| Regression | Reports |
|---|---|
![]() |
![]() |
| Baseline vs current deltas + verdicts | Exportable evaluation reports per run |
Live, seeded demo: ai-eval-tool.vercel.app · 90-second path: docs/DEMO.md
LLM outputs are fluent by default. That does not make them correct, grounded, safe, or production-ready.
A single confident answer can hide:
- hallucinated facts
- unsupported claims
- false confirmations
- broken business rules
- policy violations
- regressions after a prompt/model change
Most teams review AI output by feel. This project turns that into a repeatable evaluation system.
AI Output → Rubric → Claim Pipeline → Safety Gates → Verdict → ReportAI Evaluation Tool helps answer one question:
Is this AI output good enough to ship?
The evaluation loop follows four stages:
| Stage | Purpose |
|---|---|
| Rubrics | Define what quality means |
| Claim Pipeline | Check factual claims against evidence |
| Safety Gates | Block non-negotiable risks |
| Verdict | Decide whether the output can ship |
Rubrics define dimensions, weights, thresholds, and scoring methods. The engine ships 14 reference dimensions (app/src/lib/eval/dimensions.ts): accuracy, relevance, completeness, task_completion, hallucination_risk, groundedness, safety, consistency, tone_fit, actionability — plus extended ones for conversational/reflective products: helpfulness, emotional_nuance, non_judgmental_tone, useful_next_step.
The system extracts claims from the AI output and checks them against retrieved evidence.
Some failures should block a run instead of being averaged into a score.
Every run ends with a verdict: Ship-ready, Acceptable, Needs work, or Blocked.
Thresholds (see app/src/lib/eval/aggregate.ts):
| Verdict | Rule |
|---|---|
| Ship-ready | overall ≥ 0.85 and no dimension below its threshold |
| Acceptable | overall ≥ 0.70 |
| Needs work | overall < 0.70 |
| Blocked | any safety gate fails, regardless of score |
- Open Dashboard — see quality health across projects.
- Open the RAG — Internal Docs QA run (
/runs/run-rag-qa-004) — inspect the claim heat map, grounded vs unsupported claims, and the Ship-ready verdict. - Open Safety Log (
/safety) — inspect the open finding that blocks a run. - Open Regression (
/compare) — compare prompt/model changes with score deltas. - Open Reports — export a stakeholder-ready markdown summary.
- Open Play (
/play, "Outputs, Please") — practice claim labeling.
- Rubric scoring — evaluate AI outputs across weighted dimensions
- LLM-as-judge — score subjective quality dimensions with rationale (GPT-4o-mini)
- Claim pipeline — extract factual claims and verify each against retrieved evidence
- Semantic similarity — embedding cosine vs the expected behavior
- Deterministic checks — pattern-based checks: PII, false confirmation, language match, length
- Safety gates — block critical issues regardless of average score
- Human review — inspect failed cases, override scores, calibrate judgment
- Datasets — versioned test sets for apples-to-apples regression across model/prompt changes
- Regression compare — diff two runs and flag score drops
- Reports — export run summaries (
.md/.txt) with scores, rationales, failures, and evidence
Five scoring methods are configurable per rubric dimension: llm_judge, semantic_similarity, claim_pipeline, deterministic, human. A dimension with no real scorer is reported as unscored — never coerced to a placeholder number.
A WhatsApp booking assistant replies:
"Your appointment is confirmed for 18:00."But the calendar evidence says:
No available slot found at 18:00.The evaluation catches:
False confirmation
Unsupported claim
Safety gate failure
Verdict: blockedAfter the assistant is fixed, it checks availability before confirming.
Verdict: ship-ready
Score: 0.94 / 1.0Input
↓
Rubric Engine
↓
LLM Judge
↓
Claim Extraction
↓
Evidence Matching
↓
Deterministic Safety Checks
↓
Human Review
↓
Report GeneratorA test case can include:
- user input
- expected behavior
- AI output
- retrieved context
- metadata
Defines dimensions, weights, thresholds, and scoring methods.
Extracts factual claims and labels them as:
supportedpartially_supportedunsupportedcontradicted
Runs deterministic checks for high-risk failures. Implemented checks:
| Check | Severity | Blocks release |
|---|---|---|
pii_leakage |
critical | yes |
false_confirmation |
critical | yes |
booking_requires_calendar_write |
critical | yes |
language_match |
critical | yes |
manager_request_requires_handoff |
error | no |
output_length_limit |
warning | no |
Aggregates scores, thresholds, claim results, and safety findings into a launch decision.
An LLM judge can help, but it is not enough.
This tool combines:
| Layer | Purpose |
|---|---|
| LLM judge | Subjective quality |
| Claim checking | Factual grounding |
| Deterministic checks | Known failure patterns |
| Safety gates | Non-negotiable blockers |
| Human review | Calibration and edge cases |
| Reports | Repeatability and accountability |
The goal is not just to produce a score.
The goal is to explain:
- what passed
- what failed
- why it failed
- whether it is safe to ship
- what should be fixed next
| Route | Purpose |
|---|---|
/ |
Quality dashboard across projects |
/projects · /projects/[id] · /projects/new |
Project CRUD |
/rubrics · /rubrics/[id] · /rubrics/new |
Rubric builder |
/runs · /runs/new · /runs/[id] |
Run list, batch runner, run detail |
/cases/[id] |
Case detail: scores, claim heat map, findings |
/compare |
Regression comparison between two runs |
/datasets · /datasets/[id] |
Versioned test sets |
/evaluators |
Configure scoring engines + global settings |
/play |
Manual case inspection / practice mode |
/review · /review/[id] |
Human review queue + per-case scoring |
/reports · /reports/[id] |
Exportable evaluation reports |
/safety |
Safety findings and policy violations |
/wiki · /wiki/[slug] · /wiki/start-here |
Evaluation knowledge base |
/enter |
Demo access gate (when DEMO_ACCESS_CODE set) |
POST /api/eval/run/start · POST /api/eval/run/case · POST /api/eval/run/finalize · POST /api/eval/questions · POST /api/eval/answer · POST /api/eval/claims · POST /api/eval/deterministic · POST /api/rubric/score · GET /api/index · POST /api/enter
25 pages + 10 API routes total.
- Next.js 15 · React 19
- TypeScript 5
- Tailwind CSS 3
- Supabase (PostgreSQL)
@supabase/supabase-js2 - OpenAI SDK 6
- Zod 4
- Vitest 2
git clone https://github.com/Qalipso/ai-evaluation-tool.git
cd ai-evaluation-tool/app
npm install
cp .env.local.example .env.local # fill OPENAI_API_KEY + SUPABASE_*
npm run devOpen http://localhost:3000.
With no env, the app runs on bundled mock data (read-only, no persistence). For full live evaluation, fill the env below, apply the migrations, then seed:
# in the Supabase SQL editor, run in order:
# supabase/migrations/0001_init.sql
# supabase/migrations/0002_eval_settings.sql
# supabase/migrations/0003_datasets.sql
npm run seed
npm run devSee docs/DEMO.md for what works in the public demo vs what requires env.
# required for live evaluation
OPENAI_API_KEY=
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
# optional — public deploy hardening
DEMO_ACCESS_CODE= # enables the /enter auth gate
DEMO_SESSION_SECRET= # signs demo sessions
MAX_DAILY_LLM_USD=2 # daily spend cap (default 2)
# optional — model overrides (sensible defaults if unset)
OPENAI_JUDGE_MODEL=gpt-4o-mini
OPENAI_CLAIM_MODEL=gpt-4o-mini
OPENAI_GEN_MODEL=gpt-4o-mini
OPENAI_EMBED_MODEL=text-embedding-3-smallapp/
src/
app/ Next.js routes (pages + /api)
components/ UI components
lib/
eval/ Evaluation engine (judges, claims, semantic, aggregate, run, budget)
evaluators/ Deterministic checks, safety gates, PII + language detection
eval-scenarios/ Runnable scenario library (hallucination / safety / format)
validation/ Zod schemas
wiki/ Knowledge base utilities
tests/unit/ Vitest unit tests (135) — engine, detectors, scenarios
scripts/seed.mjs Seed script
supabase/migrations/ 0001_init · 0002_eval_settings · 0003_datasets
mock-data/ Bundled read-only fallback data
docs/assets/ Motion assets (teaser + chart GIFs)
wiki/ Evaluation knowledge base (markdown)npm run dev
npm run test
npm run lint
npm run buildBefore deploying publicly:
- keep
SUPABASE_SERVICE_ROLE_KEYserver-side only - do not expose raw private evaluation data in public reports
- review Supabase RLS policies
- redact sensitive user input/output where needed
- separate demo data from real production data
- avoid logging secrets, API keys, or private customer content
- Golden dataset support
- Judge calibration dashboard
- Prompt/model regression tracking
- Run comparison improvements
- CI integration for AI output tests
- More deterministic safety gates
- Public demo dataset
- Video report generation from evaluation runs
- Better test coverage for scoring and reports
Built by Eduard Shatalov as part of an AI product engineering portfolio.
The project demonstrates:
- AI product thinking
- LLM evaluation architecture
- rubric-based scoring
- claim grounding
- safety gates
- full-stack product development
- technical documentation
- product storytelling







