KG AI Benchmark – Implementation Plan

Context

KG AI Benchmark is a React 19 + TypeScript application used to benchmark local LLMs running through LM Studio or an OpenAI-compatible endpoint. The UI overhaul (Tailwind migration, theme system, component refactors) is largely complete, and the next major milestone is stronger LM Studio integration so that administrators can discover installed models and manage default profiles without manual setup.

Primary Outcomes

Maintain the polished Tailwind-based UI and theme system already delivered.
Deliver a transparent benchmark launch experience with in-modal progress tracking.
Expand intelligent defaults for benchmark configuration to reduce manual effort.
Integrate directly with the LM Studio REST API so the app auto-discovers available models (including capabilities metadata) and surfaces them as ready-to-use profiles.
Finish the remaining UI polish pass (spacing, status states, hover affordances) once the discovery work lands.

Current Focus: Run Launch Experience Overhaul

Problem Statement

The new run modal in src/pages/Runs.tsx (NewRunPanel) still surfaces a Launch benchmark CTA that flips to Launching… with no visible increments, making long-running batches appear stalled.
Progress only appears indirectly when the run row updates or the floating toast renders, so users cannot verify that individual questions are executing or diagnose slowdowns mid-run.

Goals

Rename the primary CTA to Run benchmark and ensure copy on the modal reflects the multi-stage experience.
Keep the modal open after submission and pivot the layout to show a live execution view without forcing navigation to the runs table.
Surface per-question status (queued → running → completed/failed), aggregate metrics (accuracy, average latency, elapsed time), and highlight the current question being evaluated.
Allow the operator to minimize/dismiss the modal once progress is visible while keeping a persistent inline indicator on the runs screen.
Provide clear failure and completion states with next-step affordances (view run details, start another run).

Non-Goals

No server-side streaming or websocket introduction—leverage executeBenchmarkRun's existing onProgress callback.
No scheduling/queuing of multiple concurrent runs; focus on the single-run experience.
Cancel/abort controls are out of scope for the first pass unless the existing engine already supports it.

Current Behavior Audit

src/pages/Runs.tsx handles launch via NewRunPanel, sets a launching boolean, and immediately closes the modal once onLaunch resolves.
handleLaunchRun triggers executeBenchmarkRun and uses upsertRun to write incremental attempts, but these updates are only visible after closing the modal.
While a run is active, a temporary toast (launchingRunId banner) instructs the user to watch the table, yet rows only show coarse Running status with no granular context.
There is no shared state that tracks "active run" metadata for components outside of the runs table, so a future progress component has nothing to subscribe to.

Proposed UX Flow

Configure run – identical form as today with improved CTA text.
Running state – modal transitions into a progress dashboard:
- Timeline/progress bar showing % completion derived from attempts vs selected questions.
- Scrollable list of selected questions (label + type) with status pills and latency once answered.
- Hero section with current question id, step-level copy, elapsed time, and recent metrics.
Completion state – modal shows summary (accuracy, average latency, failures) and offers buttons to view run detail or start another run. In the failure case, surface error notes for the last attempt and link to diagnostics.

UX & Copy Requirements

Primary button copy Run benchmark (idle) → Starting… (while awaiting first update) → View run/Close.
Secondary actions: Back to editing prior to launch; once running, replace with Hide window that simply dismisses the modal but keeps run active.
Status tags: Queued, Running, Passed, Failed. Should reuse Tailwind token colors already defined for status pills.

Data & State Updates

Extend BenchmarkContext with an activeRun slice (id, questionIds, attempts, progress, startedAt, status, lastUpdatedAt) that persists while the app session is open.
Update handleLaunchRun to populate the activeRun store before calling executeBenchmarkRun and to emit progress snapshots from the onProgress callback.
Provide a derived selector/hook (e.g., useActiveRunProgress()) that the modal and a lightweight runs-table indicator can consume.
Preserve existing runs entries so that if the modal is closed, returning to Runs shows consistent data.
Consider debouncing updates to avoid re-render storms (batch updates or requestAnimationFrame).

Technical Tasks & Phasing

Discovery & scaffolding
- Inventory question metadata needed for progress rows (prompt, display id, type) and ensure it is available without repeated lookups.
- Draft TypeScript types for the new ActiveRunProgress structure and integrate with context reducer/storage (unit tests for reducer transitions).
Progress plumbing
- Refactor handleLaunchRun to stream progress events into context, including derived metrics (elapsed time, average latency).
- Ensure failure paths and completion events finalize activeRun with terminal status and timestamps.
- Decide how to handle modal dismissal (keep state in context, rehydrate modal if user reopens while run active).
UI implementation
- Convert NewRunPanel into a multi-step component (mode: 'form' | 'progress' | 'complete') with smooth transitions.
- Build new progress subcomponents (progress header, metrics widgets, question list, empty states) using Tailwind primitives.
- Update CTA/button logic and ensure accessibility (focus management when switching states).
Post-run handoff
- Inject inline indicator/banner on the Runs page when an active run exists (e.g., status chip above the table with link to reopen modal).
- Write copy/affordances for completion and error states that direct users to run detail or diagnostics.
- Update documentation (requirement-doc.md or README) as needed.

Detailed To-Do

Testing & Validation

Unit-test reducer transitions for all new actions (start, progress tick, completion, failure, reset).
Component tests for NewRunPanel multi-step flow to ensure UI switches when progress events stream in.
Manual verification against a long-running profile to confirm progress list updates without freezing.
Regression pass on runs table and existing run detail view to ensure persisted data unchanged.

Dependencies & Risks

Need reliable timestamps and metrics from executeBenchmarkRun; latency spikes or errors must not crash the progress UI.
Modal staying mounted for long runs could increase memory usage; ensure list virtualization is unnecessary given question counts (<100).
If multiple runs are launched concurrently (future scenario), we must clarify how activeRun behaves (likely single active run only).
Ensure storage persistence does not record activeRun snapshot in localStorage unless specifically desired (avoid stale resume on reload).

Open Questions

Should closing the modal cancel the run, or merely hide the UI? Current thinking: hide only, but needs confirmation.
Do we want to allow launching another run while one is active? If not, disable CTA globally until completion.
Is there interest in exporting a progress log or metrics snapshot directly from the modal?

Workstreams & Status

A. Tailwind CSS Migration (Complete)

✅ Dependencies installed (tailwindcss, postcss, autoprefixer).
✅ Tailwind config with brand palette, spacing scale, dark mode.
✅ Global styles replaced with Tailwind directives; legacy CSS removed.

B. Theme System (Complete)

✅ ThemeContext with auto/Light/Dark modes and localStorage persistence.
✅ App wrapped in ThemeProvider; sidebar toggle cycles Light → Dark → Auto.
✅ Dark-mode color tokens wired into Tailwind config.

C. Intelligent Defaults (Complete)

✅ DEFAULT_PROFILE_VALUES centralizes model parameters (temperature, topP, penalties).
✅ ModelProfile types extended with topP, frequencyPenalty, presencePenalty.
✅ Profile form updated with helper text + new inputs; LM Studio client sends the values.

D. Component Library & Layout Refresh (Complete)

✅ Modal, Button, and Card primitives implemented with Tailwind variants.
✅ Dashboard, Profiles, Runs, and Run Detail screens migrated to Tailwind layout patterns.
✅ Diagnostics and run workflows respect the new design system.

E. UI Polish (Phase 5, Not Started)

⬜ Apply consistent spacing scale across pages.
⬜ Enhance card/panel styling for visual hierarchy.
⬜ Improve status pills, table hover states, and form field affordances.
⬜ Add subtle transitions where beneficial.

F. LM Studio Model Discovery & Auto Profiles (New, In Progress)

Goal: When LM Studio is running locally, automatically fetch its available models, capture capability metadata, and present them as pre-populated profiles that can be adopted or customized.

🟡 Discovery Client
- Implement lmStudioDiscoveryClient service to call GET /api/v0/models (rich metadata) with fallback to GET /v1/models.
- Normalize payload into internal DiscoveredModel records (id, type, context window, quantization, capabilities array, load state).
⬜ Benchmark Context Integration
- Extend BenchmarkContext to store discovered models alongside saved profiles.
- Add load/refresh actions that fetch on app start and via manual refresh in Profiles tab.
⬜ Auto Profile Generation
- Map discovered models into default profile entries (base URL, model id, capability tags) while keeping user-created profiles separate.
- Flag autogenerated profiles to prevent accidental edits from overwriting persistent user data.
⬜ UI Updates
- Update Profiles tab to display “Discovered Models” section with capability badges (e.g., tool_use, vision, embeddings) and load state.
- Provide CTA to adopt/clone a discovered model into a persistent profile.
⬜ Resilience & Telemetry
- Handle offline LM Studio gracefully (retry strategy, UI messaging).
- Surface last sync timestamp and error state in diagnostics drawer.

G. Run Launch Experience Overhaul

✅ Draft progress-driven UX requirements in PLAN.md (capture problem, goals, technical approach).
✅ Implement activeRun state management and selectors in BenchmarkContext.
✅ Add inline active-run indicator and completion affordances on the Runs page.
🟡 Replace modal-driven progress flow with a full-screen live dashboard.
- ✅ Redirect new run launches to the dedicated progress/detail route immediately after submission.
- ✅ Redesign RunDetail to support active runs with live metrics, dataset summary, and status-aware hero treatment.
- ✅ Build a question navigator layout (left status rail + right attempt inspector) that works for in-progress and completed runs.
- ✅ Auto-focus the inspector on the currently running question while honoring manual overrides.
- ✅ Keep the run creation experience scoped to configuration only (modal/panel) and ensure "View progress" opens the dashboard.
- 🟡 Validate that completed runs continue to render cleanly with the updated UX and data model.

H. Multi-Step Benchmark Pipeline Restoration (New, High Priority)

Goal: Reinstate the documented multi-call execution flow so every question triggers distinct topology and answer evaluations, with results captured per step.

⬜ Engine Refactor
- Update executeBenchmarkRun to iterate through profile.benchmarkSteps, issuing separate chat completions for enabled steps (topology first, final answer second at minimum).
- Ensure prompts honor each step’s promptTemplate, injecting question context plus prior step outputs when needed.
⬜ Attempt Persistence
- Extend BenchmarkAttempt to store per-step request/response payloads, inferred topology metadata, and evaluation notes.
- Persist topology comparison outcomes alongside answer scoring so downstream analytics can surface accuracy per step.
⬜ Evaluation Enhancements
- Add a topology validator that compares the model’s predicted subject/topic/subtopic to the ground truth before scoring answers.
- Propagate step-level pass/fail metrics into aggregated run stats (e.g., topology accuracy, answer accuracy, combined score).
⬜ UI & Reporting
- Surface step timelines in Run Detail (e.g., topology vs answer results, latency per step, JSON payloads).
- Update readiness diagnostics to exercise the full multi-step flow and fail fast when topology extraction breaks.
⬜ Config & Defaults
- Refresh default benchmark steps to include explicit topology and answer templates and document customization rules in Profiles.
- Backfill existing saved profiles/runs with default steps if missing to avoid runtime regressions.

I. Supabase Persistence Layer (New, In Progress)

Goal: Replace localStorage with Supabase-backed storage so profiles, diagnostics, and runs survive reloads and support multi-device use.

✅ Client bootstrap
- Add Supabase JS client configured via VITE_SUPABASE_URL/VITE_SUPABASE_ANON_KEY.
- Load and persist benchmark data (profiles, runs, diagnostics) through Supabase instead of localStorage.
⬜ Database schema & migrations
- Create profiles and runs tables with id (uuid), data (jsonb), updated_at, plus helper columns (name, status, etc.).
- Add row-level security policies (RLS) aligned with anonymous key usage or replace with service-role key when auth is added.
⬜ API hardening
- Handle offline/latency scenarios with optimistic updates and retry queue.
- Add structured error surfacing in the UI when persistence fails.
⬜ Future enhancements
- Split large run payloads into normalized tables for analytics queries.
- Introduce user auth + multi-tenant scoping before production rollout.

Upcoming Milestones

Restore the multi-step benchmark pipeline (topology + answer calls, per-step metrics, UI surfaces).
Ship the run launch progress experience (active run plumbing, multi-step modal, inline indicator).
Finish LM Studio discovery client and context plumbing so models auto-populate on load.
Ship Profiles UI updates that clearly separate discovered models from saved profiles and allow quick adoption.
Complete Phase 5 polish tasks once the discovery experience is stable.

Risks & Mitigations

LM Studio server availability – Mitigate with retries, offline messaging, and manual refresh controls.
Payload drift between /api/v0/models versions – Guard with type-safe parsing, optional chaining, and capability defaults.
Profile confusion – Clearly label autogenerated entries and require explicit “Save as profile” before editing parameters.
Run progress rendering cost – Frequent progress ticks could trigger heavy React updates; batch reducer dispatches and memoize derived selectors to keep the UI responsive.
Multi-step regression risk – Introduce integration and smoke checks that verify topology and answer steps run in sequence before shipping.

Open Questions

Should discovered models auto-refresh in the background or only on user action?
How should we persist capability metadata if LM Studio is unavailable after initial discovery?
Do we want to pre-filter models (e.g., hide unloaded ones when JIT loading is off) or show the full list with load-state badges?
When the run progress modal is dismissed, where should the user reopen it (banner, toast, pinned button)?
Do we need a cancel/abort affordance in the progress view for long-running or stuck attempts?
Should active-run state persist across reloads, or is it acceptable to revert to table-only updates after refresh?

Progress Tracking Legend

✅ Completed
🟡 In Progress
⬜ Not Started
⏭️ Skipped

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KG AI Benchmark – Implementation Plan

Context

Primary Outcomes

Current Focus: Run Launch Experience Overhaul

Problem Statement

Goals

Non-Goals

Current Behavior Audit

Proposed UX Flow

UX & Copy Requirements

Data & State Updates

Technical Tasks & Phasing

Detailed To-Do

Testing & Validation

Dependencies & Risks

Open Questions

Workstreams & Status

A. Tailwind CSS Migration (Complete)

B. Theme System (Complete)

C. Intelligent Defaults (Complete)

D. Component Library & Layout Refresh (Complete)

E. UI Polish (Phase 5, Not Started)

F. LM Studio Model Discovery & Auto Profiles (New, In Progress)

G. Run Launch Experience Overhaul

H. Multi-Step Benchmark Pipeline Restoration (New, High Priority)

I. Supabase Persistence Layer (New, In Progress)

Upcoming Milestones

Risks & Mitigations

Open Questions

Progress Tracking Legend

FilesExpand file tree

PLAN.md

Latest commit

History

PLAN.md

File metadata and controls

KG AI Benchmark – Implementation Plan

Context

Primary Outcomes

Current Focus: Run Launch Experience Overhaul

Problem Statement

Goals

Non-Goals

Current Behavior Audit

Proposed UX Flow

UX & Copy Requirements

Data & State Updates

Technical Tasks & Phasing

Detailed To-Do

Testing & Validation

Dependencies & Risks

Open Questions

Workstreams & Status

A. Tailwind CSS Migration (Complete)

B. Theme System (Complete)

C. Intelligent Defaults (Complete)

D. Component Library & Layout Refresh (Complete)

E. UI Polish (Phase 5, Not Started)

F. LM Studio Model Discovery & Auto Profiles (New, In Progress)

G. Run Launch Experience Overhaul

H. Multi-Step Benchmark Pipeline Restoration (New, High Priority)

I. Supabase Persistence Layer (New, In Progress)

Upcoming Milestones

Risks & Mitigations

Open Questions

Progress Tracking Legend