This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
KG AI Benchmark is a React + TypeScript application for benchmarking local LLMs hosted in LM Studio (or any OpenAI-compatible runtime) using a curated GATE PYQ (Previous Year Questions) dataset. The app is client-side only, with benchmark state persisted to Supabase.
npm run dev- Start Vite dev server at http://localhost:5173npm run build- Type-check with TypeScript and build production bundlenpm run lint- Run ESLint with TypeScript rulesnpm run preview- Preview production build locally
- No test suite is currently configured
- Supabase stores profiles, diagnostics, and run history; ensure the environment has
VITE_SUPABASE_URLandVITE_SUPABASE_ANON_KEYconfigured.
The application uses a centralized BenchmarkContext (src/context/BenchmarkContext.tsx) as the single source of truth:
- Profiles (
ModelProfile[]) - LM Studio connection configs, diagnostics history, and benchmark step configurations - Runs (
BenchmarkRun[]) - Benchmark executions with attempts, metrics, and evaluation results - Questions - Static dataset loaded from
src/data/questions.ts(100 GATE PYQs) - Topology - Subject/topic/subtopic catalog loaded from
src/data/topology.ts
State updates flow through a reducer pattern with actions:
UPSERT_PROFILE/DELETE_PROFILE- Manage model configurationsUPSERT_RUN/DELETE_RUN- Manage benchmark runsRECORD_DIAGNOSTIC- Store diagnostic results in profile history
All state changes automatically sync to Supabase via src/services/storage.ts.
Both profiles and runs go through normalization functions (normalizeProfile, normalizeRun) that:
- Merge partial updates with existing records
- Apply defaults from
src/data/defaults.ts - Ensure all required fields are present
- Maintain createdAt/updatedAt timestamps
lmStudioClient.ts
sendChatCompletion()- Sends chat requests to/v1/chat/completions- Automatic JSON-mode fallback: if the server rejects
response_format: { type: 'json_object' }, retries without it and setsfallbackUsed: true - Returns
ChatCompletionResultwith usage stats and raw response
diagnostics.ts
runDiagnostics()- Executes two-tier diagnostics:- HANDSHAKE (Level 1): Validates server connectivity and JSON-mode support
- READINESS (Level 2): Runs a sample question through the full pipeline to verify answer parsing and evaluation
- Stores results in profile's
diagnostics[]array with timestamps and structured logs
benchmarkEngine.ts
executeBenchmarkRun()- Orchestrates the full benchmark execution:- Iterates through selected questions
- Builds prompts with question text, options, and formatting instructions
- Calls LM Studio for each question
- Parses responses via
parseModelResponse()(src/services/evaluation.ts) - Evaluates answers via
evaluateModelAnswer() - Aggregates metrics (accuracy, latency, pass/fail counts)
- Streams progress via
onProgresscallback - Supports AbortSignal for cancellation
evaluation.ts
parseModelResponse()- Extractsanswer,explanation,confidencefrom JSON or plain textevaluateModelAnswer()- Type-specific evaluation:- MCQ: Single option match (A, B, C, D)
- MSQ: Multiple options match (A,C or A, C)
- NAT: Numeric range validation or exact string match
- TRUE_FALSE: Boolean comparison
- Returns
BenchmarkAttemptEvaluationwith expected/received/passed/score
/ (AppLayout wrapper)
├── /dashboard - KPI cards, trend charts (Recharts), latest runs table
├── /profiles - Profile list, diagnostics panel, CRUD operations
├── /runs - Runs table with filters, "New run" flow
└── /runs/:runId - Run detail with metrics, charts, attempt breakdown
BenchmarkQuestion- Question data with type, prompt, options, answer key, metadataModelProfile- LM Studio config + benchmark steps + diagnostics history + metadata (supportsJsonMode, lastHandshakeAt, lastReadinessAt)BenchmarkRun- Run metadata + question IDs + attempts + metricsBenchmarkAttempt- Single question evaluation with request/response payloads, latency, tokens, and evaluation resultsDiagnosticsResult- Diagnostic run with level, status, logs, and metadata
The question loader (src/data/questions.ts) ingests pyq-gate-sample.json:
- 100 normalized GATE PYQs
- Types: MCQ, MSQ, NAT, TRUE_FALSE
- Plain-string prompts/options (legacy rich-text fields removed Oct 2025)
- Metadata includes topology (subject/topic/subtopic), PYQ year/exam/branch, and tags
The topology catalog (src/data/topology.ts) loads pyq-gate-sample-topology.json for subject/topic/subtopic filters.
All modals should have translucent backgrounds (per user global instructions).
Always respect profile.metadata.supportsJsonMode:
- If
false, skipresponse_formatin chat completion requests - If
undefined, attempt JSON mode and fall back on error (handled automatically bysendChatCompletion)
The UI blocks benchmark launch until both HANDSHAKE and READINESS diagnostics pass for the selected profile. This ensures the model can handle the full pipeline before executing long-running evaluations.
Prompts follow a consistent structure (see buildQuestionPrompt in benchmarkEngine.ts):
- Question header with type
- Optional instructions
- Options list (A, B, C, D for MCQ/MSQ)
- JSON format instructions:
{ "answer": "...", "explanation": "...", "confidence": 0-1 } - Special handling for NAT numeric ranges
- Supabase currently stores complete benchmark payloads; future work may add per-user auth and server-side analytics.
- If migrating away from Supabase, mirror the contracts defined in
src/types/benchmark.tsand updatesrc/services/storage.tsaccordingly.
- Cancellation controls and progress indicators for running benchmarks
- FILL_BLANK and descriptive question evaluation with rubric scoring
- Dataset import/export (CSV/JSON)
- Screenshot capture for completed runs
- Server-side caching or virtualization for large question banks
The project uses @/ as an alias for src/:
- Configured in vite.config.ts (
resolve.alias) - Use
@/components/...,@/services/..., etc. in imports
- React 19 with functional components and hooks
- Strict TypeScript with ESLint rules
- Recharts for data visualization
- No emoji usage unless explicitly requested