Per-stage provider metrics dashboard for multi-provider compile selection

## Credit

Idea from [@richardchen874-sys](https://github.com/richardchen874-sys) in [Discussion #77 — Kompl v0.2.0](https://github.com/tuirk/Kompl/discussions/77#discussioncomment-17305165). Maintainer will look at prioritizing in an upcoming sprint; **open for anyone to pick up** — comment on the issue if you're starting work so we don't duplicate effort.

## Problem

Kompl v0.2.0 added a second compile/chat backend (DeepSeek V4 Pro alongside Gemini 2.5) and per-session model lock (`compile_progress.compile_model` stamped at finalize; `getEffectiveCompileModel()` for all session LLM calls). Settings still expose one global default: pick a model, run the whole pipeline on it.

That works as a workaround (e.g. switch to DeepSeek when Gemini truncates on dense PDFs, [#7](https://github.com/tuirk/Kompl/issues/7)), but it's the wrong abstraction. The useful question isn't "which model is smarter." It's which backend reliably gets through `extract → draft → match → crossref → commit` without truncating, burning retries, or silently dropping sections.

Different providers may be better at different stages (dense extraction vs drafting vs triage). Users have no data to decide.

## Proposed direction (not a spec)

Track **provider behavior per pipeline stage**, surfaced in a **dashboard** so users can make a data-informed provider choice instead of guessing from Settings.

Metrics to track (from the discussion):

| Metric | Why it matters |
|--------|----------------|
| Truncation rate | Gemini structured-output pathology; json-repair salvage in nlp-service logs |
| Retry rate | Hidden cost + latency |
| Cost per successfully compiled source | Normalizes spend against actual output, not just raw token burn |
| Latency per pipeline step | `compile_progress` already tracks per-step status; duration is partial today |
| Structured-output validity | Parse/salvage failures vs clean completes |
| Failure recovery after partial extract | Retry paths vs silent degradation |
| Stage-specific provider fit | e.g. one provider for extract, another for draft |

**Outcome:** a dashboard that complements the existing daily spend cap in Settings. Today the default cap is $5/day, calibrated from heavy test usage while debugging the Gemini repetition/truncation bug ([#7](https://github.com/tuirk/Kompl/issues/7)). With DeepSeek V4 Pro's discounted pricing becoming permanent, that cap is arguably overkill for most normal usage. Per-stage cost and reliability data would let users tune both provider *and* budget from evidence instead of a conservative default.

Automatic per-stage routing is out of scope for v1 of this issue. Start with observability and recommendations.

## What exists today (starting points)

- Session model lock: `app/src/lib/db.ts` (`getSessionCompileModel`, `getEffectiveCompileModel`)
- Per-step progress: `compile_progress` steps + UI `X/Y` (v0.2.0)
- Provider routing: `getProviderForModel`, `nlp-service/services/llm_client.py`, provider modules under `nlp-service/services/providers/`
- Daily spend cap: `nlp-service/routers/llm.py` (`GET /llm/usage`), `llm-config.json`, Settings UI
- Truncation/salvage signals in nlp-service logs (`salvaged truncated response via json-repair`)

## Open questions (for whoever picks this up)

- **Storage:** extend SQLite (`activity` / new table) vs derive from logs vs hybrid?
- **Granularity:** per-session rollup vs per-source vs per-LLM-call-site (12+ sites in `llm_client.py`)?
- **Dashboard placement:** Settings panel, dedicated page, or compile-progress adjunct?
- **Selection UX:** recommendations only in v1, or per-stage defaults in Settings (larger scope)?
- **Mixed sessions:** same session, different stage needs. Align with per-session lock or revisit lock semantics?
- **Cap interaction:** should the dashboard suggest a daily cap based on observed spend, or stay read-only?

## Suggested first slice (optional)

Smallest useful increment: persist per-stage success/failure + latency + provider + salvage flag for one compile session, expose via API, render a minimal table in Settings (provider × stage). Defer automatic routing and cap auto-tuning.

## Files likely in scope

- `app/src/lib/db.ts` — schema / helpers for metrics persistence
- `app/src/lib/compile/` — step runners (`extract`, `draft`, `match`, `crossref`, `commit`)
- `app/src/app/api/compile/*` — HTTP shims that POST to nlp-service
- `nlp-service/services/llm_client.py` — central LLM call sites, usage metadata, salvage paths
- `app/src/app/settings/page.tsx` — dashboard UI + daily cap section (complementary placement)

## Related

- [#7](https://github.com/tuirk/Kompl/issues/7) — Gemini truncation on dense sources (motivation for $5 default cap)
- [Discussion #77](https://github.com/tuirk/Kompl/discussions/77) — v0.2.0 release thread
- [validation/2026-05-05-multi-provider-corpus/REPORT.md](https://github.com/tuirk/Kompl/blob/main/validation/2026-05-05-multi-provider-corpus/REPORT.md) — prior multi-provider corpus comparison


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Per-stage provider metrics dashboard for multi-provider compile selection #144

Credit

Problem

Proposed direction (not a spec)

What exists today (starting points)

Open questions (for whoever picks this up)

Suggested first slice (optional)

Files likely in scope

Related

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Why it matters
Truncation rate	Gemini structured-output pathology; json-repair salvage in nlp-service logs
Retry rate	Hidden cost + latency
Cost per successfully compiled source	Normalizes spend against actual output, not just raw token burn
Latency per pipeline step	`compile_progress` already tracks per-step status; duration is partial today
Structured-output validity	Parse/salvage failures vs clean completes
Failure recovery after partial extract	Retry paths vs silent degradation
Stage-specific provider fit	e.g. one provider for extract, another for draft

Per-stage provider metrics dashboard for multi-provider compile selection #144

Description

Credit

Problem

Proposed direction (not a spec)

What exists today (starting points)

Open questions (for whoever picks this up)

Suggested first slice (optional)

Files likely in scope

Related

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions