Skip to content

Model Quartermaster

scarecr0w12 edited this page Jun 24, 2026 · 6 revisions

Model Quartermaster (MQM)

MQM is a learning-based model selection engine that dynamically routes requests to the most appropriate LLM based on task characteristics, historical performance, cost constraints, and learned patterns. It is implemented in packages/infra/src/model-quartermaster/ and registered as a pipeline hook at pre-llm and post-llm stages.

The legacy Quartermaster (packages/infra/src/quartermaster/) handles tool orchestration learning (predicting which tool to use next) and is documented separately.

Architecture

6-Signal Prediction Engine

Signal Source Purpose
Historical performance Per-task category stats Which model performed best for this task type
Episodic memory hits Memory system Similar past requests and their outcomes
Cost optimization Provider cost data Token/$ efficiency
Quality estimation Reflection feedback Confidence and correctness scores
Trajectory patterns Recent model usage Context from the current session
Reflection feedback Post-turn analysis Self-assessment of response quality

Signals are fused via weighted combination to predict the best model before each LLM call.

Decision Modes

Mode Confidence Behavior
enforce > 0.85 Override model selection entirely
suggest > 0.65 Inject hint into system prompt
defer ≤ 0.65 Use default provider

Adaptive Learning

Signal weights update via EMA (Exponential Moving Average):

new_weight = old_weight + learning_rate × (reward - old_weight)

Learning rate decays over time: 0.05 × 0.995^observations, driven by quality and cost efficiency feedback after each turn.

Observation-First Startup

MQM starts in observe-only mode. It records performance data for the first 10 LLM calls before activating and making predictions. This enables faster learning and productivity activation after just a handful of turns.

Arbiter Strategies

Strategy Preference Confidence Required Best For
conservative Cheaper models High Cost-sensitive deployments
balanced Cost/quality balance Standard General use (default)
aggressive Highest quality Lower Quality-critical tasks

Auto Model Selection Mode (v0.46+)

Auto mode enables per-turn dynamic model selection using a configurable model pool managed through the Quartermaster UI. This extends beyond MQM to allow explicit user control over available models.

How It Works

  1. Model Pool — a list of {provider, model, enabled} entries curated in the Quartermaster settings UI
  2. Pool Filtering — at pre-llm time, only enabled models are considered
  3. MQM Prediction — if enabled, MQM predicts the best model from the filtered pool
  4. Heuristic Fallback — if MQM is disabled or confidence is low, falls back to complexity-based selection (simple → fast model, complex → powerful model)
  5. Per-Turn Metadata — resolved model info is reported in the WebSocket done payload: requestedModelMode, resolvedProvider, resolvedModel, autoFallback, autoFallbackReason

Configuration

{
  "modelSelection": {
    "autoModelPool": [
      { "provider": "anthropic", "model": "claude-sonnet-4-5", "enabled": true },
      { "provider": "openai", "model": "gpt-4o", "enabled": true },
      { "provider": "ollama", "model": "llama3.2", "enabled": false }
    ]
  }
}

Chat UI

  • Model selector gains an Auto option alongside manual provider/model selection
  • When Auto is selected, the chat header displays the resolved model per turn
  • Auto fallback reasons (e.g., "MQM disabled", "low confidence") shown in warning toasts

Tool Prediction with Auto Mode

When Quartermaster tool prediction is active, the follow-up instruction after tool execution includes a hint suggesting which tool to use next based on learned patterns. This nudges the model toward productive actions (e.g., suggesting file_write when the model is stuck reading files).

Task Categorization

Requests are automatically classified into 5 categories using heuristic keyword matching:

Category Triggers Example Queries
code Code blocks, function/class keywords, file paths "Write a Python script to..."
analysis Data, summarize, evaluate, compare "Analyze this log output..."
creative Write, design, create, story "Write a blog post about..."
factual What is, who, when, define, explain "What is the capital of..."
conversation Greetings, thanks, general chat "Hello, how are you?"

Context Fingerprinting

Multi-feature context extraction for pattern matching: message length, code detection, question count, complexity estimation, tool round depth, file count, error context, and session age.

Database Schema

5 tables in cortex.db (migration 019):

Table Purpose
mqm_model_stats Per-model performance: success rate, avg tokens, avg cost, task category breakdown
mqm_signal_weights Learned signal importance with EMA history
mqm_decisions Full audit trail: predicted vs actual, confidence, mode, correctness
mqm_session_state Per-session tracking and context
mqm_patterns Learned model selection patterns by task + context

Pipeline Integration

MQM runs as a pipeline hook (@cortex/model-quartermaster, priority 5):

  • pre-llm — computes prediction, enforces model override if confidence > 0.85
  • post-llm — records outcome, updates weights, logs lens audit event

Observatory Events

5 lens audit event types:

  • mqm_prediction — model selection prediction made
  • mqm_observation — outcome recorded after LLM call
  • mqm_weight_updated — signal weight adjusted
  • mqm_pattern_learned — new pattern detected
  • mqm_mode_changed — arbiter strategy or mode switched

CLI

cortex mqm stats            # Performance statistics per model
cortex mqm decisions        # Recent routing decisions
cortex mqm weights          # Current signal weights
cortex mqm accuracy         # Prediction accuracy metrics

Web UI

The Quartermaster unified page provides three panes:

  • Tool Orchestration — QM patterns, decisions, tool stats (from legacy Quartermaster)
  • Model Intelligence — MQM model stats, accuracy trends chart, signal weight bars, top-10 tool stats with success rate bars
  • Settings — Enable/disable Model Intelligence, pin dedicated QM provider + model, choose strategy, observe threshold

Configuration

{
  "modelSelection": {
    "enabled": false,
    "mode": "balanced",
    "observeThreshold": 50,
    "enforceConfidence": 0.85,
    "suggestConfidence": 0.65,
    "costBudget": 1.0,
    "qualityThreshold": 0.7,
    "allowedProviders": ["anthropic", "openai", "ollama"],
    "quartermasterProvider": "google",
    "quartermasterModel": "gemini-2.0-flash",
    "autoModelPool": [
      { "provider": "anthropic", "model": "claude-sonnet-4-5", "enabled": true },
      { "provider": "openai", "model": "gpt-4o", "enabled": true }
    ]
  }
}

API Endpoints

Method Path Description
GET /api/qm/config Current MQM config
POST /api/qm/config Update MQM config
POST /api/qm/reset Reset all learned state
GET /api/mqm/summary High-level MQM summary
GET /api/mqm/accuracy Prediction accuracy
GET /api/mqm/stats Per-model statistics
GET /api/mqm/decisions Decision audit log
GET /api/mqm/weights Current signal weights

See Also

Clone this wiki locally