Skip to content

Per-stage provider metrics dashboard for multi-provider compile selection #144

Description

@tuirk

Credit

Idea from @richardchen874-sys in Discussion #77 — Kompl v0.2.0. Maintainer will look at prioritizing in an upcoming sprint; open for anyone to pick up — comment on the issue if you're starting work so we don't duplicate effort.

Problem

Kompl v0.2.0 added a second compile/chat backend (DeepSeek V4 Pro alongside Gemini 2.5) and per-session model lock (compile_progress.compile_model stamped at finalize; getEffectiveCompileModel() for all session LLM calls). Settings still expose one global default: pick a model, run the whole pipeline on it.

That works as a workaround (e.g. switch to DeepSeek when Gemini truncates on dense PDFs, #7), but it's the wrong abstraction. The useful question isn't "which model is smarter." It's which backend reliably gets through extract → draft → match → crossref → commit without truncating, burning retries, or silently dropping sections.

Different providers may be better at different stages (dense extraction vs drafting vs triage). Users have no data to decide.

Proposed direction (not a spec)

Track provider behavior per pipeline stage, surfaced in a dashboard so users can make a data-informed provider choice instead of guessing from Settings.

Metrics to track (from the discussion):

Metric Why it matters
Truncation rate Gemini structured-output pathology; json-repair salvage in nlp-service logs
Retry rate Hidden cost + latency
Cost per successfully compiled source Normalizes spend against actual output, not just raw token burn
Latency per pipeline step compile_progress already tracks per-step status; duration is partial today
Structured-output validity Parse/salvage failures vs clean completes
Failure recovery after partial extract Retry paths vs silent degradation
Stage-specific provider fit e.g. one provider for extract, another for draft

Outcome: a dashboard that complements the existing daily spend cap in Settings. Today the default cap is $5/day, calibrated from heavy test usage while debugging the Gemini repetition/truncation bug (#7). With DeepSeek V4 Pro's discounted pricing becoming permanent, that cap is arguably overkill for most normal usage. Per-stage cost and reliability data would let users tune both provider and budget from evidence instead of a conservative default.

Automatic per-stage routing is out of scope for v1 of this issue. Start with observability and recommendations.

What exists today (starting points)

  • Session model lock: app/src/lib/db.ts (getSessionCompileModel, getEffectiveCompileModel)
  • Per-step progress: compile_progress steps + UI X/Y (v0.2.0)
  • Provider routing: getProviderForModel, nlp-service/services/llm_client.py, provider modules under nlp-service/services/providers/
  • Daily spend cap: nlp-service/routers/llm.py (GET /llm/usage), llm-config.json, Settings UI
  • Truncation/salvage signals in nlp-service logs (salvaged truncated response via json-repair)

Open questions (for whoever picks this up)

  • Storage: extend SQLite (activity / new table) vs derive from logs vs hybrid?
  • Granularity: per-session rollup vs per-source vs per-LLM-call-site (12+ sites in llm_client.py)?
  • Dashboard placement: Settings panel, dedicated page, or compile-progress adjunct?
  • Selection UX: recommendations only in v1, or per-stage defaults in Settings (larger scope)?
  • Mixed sessions: same session, different stage needs. Align with per-session lock or revisit lock semantics?
  • Cap interaction: should the dashboard suggest a daily cap based on observed spend, or stay read-only?

Suggested first slice (optional)

Smallest useful increment: persist per-stage success/failure + latency + provider + salvage flag for one compile session, expose via API, render a minimal table in Settings (provider × stage). Defer automatic routing and cap auto-tuning.

Files likely in scope

  • app/src/lib/db.ts — schema / helpers for metrics persistence
  • app/src/lib/compile/ — step runners (extract, draft, match, crossref, commit)
  • app/src/app/api/compile/* — HTTP shims that POST to nlp-service
  • nlp-service/services/llm_client.py — central LLM call sites, usage metadata, salvage paths
  • app/src/app/settings/page.tsx — dashboard UI + daily cap section (complementary placement)

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestup-for-grabsMaintainer isn't actively working on this — open for anyone to pick up

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions