Rift

You upgraded your model. What broke? You're picking a vendor. Who actually wins?

Rift compares any two (or three+) LLM endpoints on structured eval suites and returns statistically rigorous drift reports with cost-per-correct. Use it to catch silent regressions on a same-family upgrade — or to settle a cross-vendor procurement call with real numbers instead of list-price math.

No vibes. No "it feels dumber." Just p-values, confidence intervals, and $/correct.

Try the demo (no API key needed)

pip install rift-eval
rift demo

A 30-second guided walkthrough modelled on one real model upgrade (Opus 4.6 → 4.7): accuracy ticks up, but cost-per-correct rises +35% in the live run from a silent tokenizer change. The demo replays a synthetic reproduction calibrated to the live 2026-04-21 capture (benchmarks/opus47_live.md) and will display roughly +40% — within the documented calibration tolerance. Fully offline, reproducible, no keys. For the authoritative live numbers, see that file.

Forward the one-page memo to your VP:

rift demo --export-html demo.html      # self-contained executive memo
rift demo --export-md  demo.md         # for Notion/Slack/email
rift demo --paced                      # press Enter between acts (live)

Rift demo screenshot — four-act terminal walkthrough

Quick Start

pip install rift-eval

# One-time: add your provider API key(s). Paste when prompted; saved to
# ~/.rift/.env and loaded automatically from then on. (The demo needs none.)
rift setup

# Compare two models (with short aliases — opus-4-8, opus-4-7, sonnet-4-6, etc.)
rift compare --baseline opus-4-7 --challenger opus-4-8 --suite reasoning

# Stress-test reasoning under distractor context (0k/2k/8k/32k)
rift compare --baseline opus-4-7 --challenger opus-4-8 \
    --suite context_rot_reasoning --context-rot --subgroup distractor:

# Compare 3+ models at once — prints an NxN drift matrix
rift matrix --models opus-4-8,opus-4-7,opus-4-6 --suite reasoning

# Diff two saved runs
rift diff results/before.json results/after.json

# Enterprise contract pricing: apply your negotiated multiplier
rift compare --baseline opus-4-6 --challenger opus-4-7 \
    --suite reasoning --enterprise-multiplier 0.65

What You Get

Output from rift compare --baseline opus-4-6 --challenger opus-4-7 --suite context_rot_reasoning --context-rot --subgroup distractor: on 32 cases — numbers below are from the live Anthropic API run on 2026-04-21 (authoritative capture: benchmarks/opus47_live.md, n=32, paired, McNemar's exact, $3.85 total spend, 0 errors; dollar figures reflect the current $5/$25 Opus 4.5-generation list price):

╭─────────────────────────────────────────────────╮
│  Rift Drift Report                              │
│                                                 │
│    baseline:   claude-opus-4-6                  │
│    challenger: claude-opus-4-7                  │
│    suite:      context_rot_reasoning (32 cases) │
│                                                 │
│    Status: NO SIGNIFICANT DRIFT                 │
│    Test:   mcnemar_exact                        │
│                                                 │
│    Baseline mean:    0.8125                     │
│    Challenger mean:  0.8750                     │
│    Delta:            +0.0625 (+7.7%)            │
│    p-value:          0.687500                   │
│    95% CI:           [-0.0633, +0.2188]         │
│                                                 │
│    Regressed cases:  2                          │
│    Improved cases:   4                          │
│                                                 │
│    Spend:      $1.57  →  $2.28                  │
│    $/correct:  $0.0605 →  $0.0815  (+35%)       │
╰─────────────────────────────────────────────────╯

Followed by a per-subgroup breakdown and a table of regressed cases with per-case score deltas. Use -r report.md to emit the same data as markdown.

Reproducibility note. The committed benchmarks/context_rot_outcomes.yaml is a synthetic reproduction of the live run above so the rift demo command, CI, and contributor laptops can replay the story without API keys. Calibration fit (synthetic vs. live, as measured 2026-04-21): per-model $/correct levels within ±10% (+3.7% baseline, +7.6% challenger); top-level accuracy within ~3pp (baseline 0.8438 vs 0.8125; challenger 0.8750 vs 0.8750); the headline Δ $/correct % within ~5pp (+39.7% vs +34.7%). Subgroup-level numbers can diverge materially — the live capture shows a baseline regression at 32k context that the synthetic does not reproduce. For procurement or roadmap decisions, cite the authoritative live capture opus47_live.md, not the demo replay context_rot_opus47.md. The calibration constants are documented in generate_synthetic_outcomes.py.

How to read it

Three numbers carry the story:

Delta + 95% CI — the accuracy change and the range the data is consistent with. If the CI crosses zero, the direction is not established. Don't report a delta without its CI.
p-value + Test — whether the delta is unlikely under the null. Rift picks the test automatically: McNemar's exact for binary (exact-match) scores, paired t-test + bootstrap for continuous ones.
$/correct — USD per fully-correct case. This is the number a budget owner can defend. Two models with the same accuracy aren't equivalent if one costs 3× more; $/correct folds quality and price into one line.

Worked studies

Paired runs against live APIs, one for each question in the tagline. Run-level reports (markdown) and per-case completion JSONs are committed under benchmarks/; re-running offline from those captures requires the cache to be re-primed (the offline rift demo replays the same headline numbers from a calibrated synthetic file — see the reproducibility note above).

Did the upgrade regress? — Opus 4.7 → 4.8

Live paired run against the Anthropic API on Opus 4.8 launch day (2026-05-29), 4.8 compared against 4.7 and 4.6 across six suites. 4.8 is a statistically indistinguishable sidegrade on five standard suites (reasoning, extraction, code generation, open-ended QA, summarization) — and slightly cheaper per correct. But on long-context reasoning with injected distractors it regresses:

Signal	Opus 4.7	Opus 4.8	Δ
Accuracy (context-rot, n=32)	87.5% (28/32)	68.75% (22/32)	−18.75pp, p=0.031 (significant)
Regressed / improved cases	—	—	6 / 0 (paired g = −1.000)
Total spend	$2.29	$2.28	~flat
$/correct	$0.0820	$0.1036	+26%
Refusal rate	0.0%	0.0%	no over-refusal

The +26% cost-per-correct is not a price story — spend is flat to the cent. It rises because 4.8 gets fewer answers right for the same money. All six regressions are cases carrying injected "reference material" distractors: 4.8 is more distractible by irrelevant long context than 4.7 was, a regression a green standard-benchmark sheet would have hidden. Full writeup, per-suite matrices, and the "what-not-to-claim" caveats: benchmarks/3way_opus48/analysis.md.

Did the upgrade regress? — Opus 4.6 → 4.7

Live paired run against the Anthropic API. 32 cases (8 reasoning prompts × 4 distractor regimes: 0k, 2k, 8k, 32k tokens). Same scorer, same prompts, byte-identical inputs.

Signal	Opus 4.6	Opus 4.7	Δ
Accuracy	26/32 (81.2%)	28/32 (87.5%)	+6.25pp, p=0.69 (not significant)
Input tokens (byte-identical prompts)	313,717	453,957	+44.7%
Total spend	$1.57	$2.28	+45%
$/correct	$0.0605	$0.0815	+35%

Three takeaways a leader can act on today:

The tokenizer changed; the list price didn't. Opus 4.7 emits 1.21–1.62× more input tokens than 4.6 for byte-identical prompts (mean 1.43×). At $5/Mtok list, the effective rate on real prompts is ~$7.15/Mtok. At 10M daily input tokens, a silent default-upgrade costs ~$7.85k/year with zero workload change.
The quality lift is directional, not established. +6.25pp overall with the CI [-0.06, +0.22] — the data is consistent with anything from a small regression to a 22-point improvement. The lift concentrates at 8k/32k distractor tokens (both +12.5pp) where robustness matters most. Run at n≥50 to move the p-value.
$/correct is the number to watch. +35% per fully-correct answer on this suite. Even if the quality lift is real, it doesn't pay for the tokenizer inflation.

Action list (cheapest first): pin model routing to an explicit claude-opus-4-6 until you've run the same comparison on your own prompts; re-baseline your token budgets (multiply committed annual spend by your observed ratio); renegotiate contracts on tokens/prompt × prompts/day, not $/Mtok alone.

Full writeup with reproduction steps, per-subgroup tables, and the tooling bug Rift caught along the way: benchmarks/context_rot_opus47_analysis.md. Raw report: benchmarks/context_rot_opus47.md.

Which vendor wins per correct? — gpt-5.5 vs Opus 4.7 vs Gemini 3.5 Flash

Test-set contamination caveat. The suites in suites/ are public in this repository. Frontier models trained on web snapshots after this repo went public may have these prompts in training data, which can inflate performance on the public suites without reflecting real-world behaviour. Treat cross-vendor numbers below as suggestive, not authoritative. For procurement decisions, run rift discover against your own private prompts and compare on that (still adversarially- selected — see rift discover's output caveat — but at least not public).

Exact-match scoring also rewards terse outputs; vendors whose default tone is more verbose (e.g. Anthropic) may underperform on this metric relative to their actual quality. See suites/ for the exact expected outputs each suite enforces.

Three frontier models, three suites (reasoning n=10, structured extraction n=29, open-ended QA n=5), same scorers, byte-identical prompts, single trial, temperature 0. 132 live completions; token counts from the 2026-05-21 live capture, Opus dollar figures recomputed at the current $5/$25 list price. Recomputed total spend: $0.43 (see benchmarks/3way_full/analysis.md).

Suite	gpt-5.5 $/c	Opus 4.7 $/c	Gemini Flash $/c	Verdict
reasoning	$0.0026	$0.0019	$0.0056	Opus now cheapest, same accuracy (9/10 each)
extraction	$0.0027	$0.0029	$0.0061	gpt-5.5 ≈ Opus (tie), both ~2× cheaper than Gemini
open_ended_qa	$0.0034	$0.0056	$0.0163	Opus uniquely perfect (5/5); gpt-5.5 cheapest

Three takeaways a leader can act on:

The Opus 4.5-generation price cut (to $5/$25) reopens the cost race — the cheapest model is now suite-dependent. Per-Mtok list prices are Gemini $1.50/$9, gpt-5.5 $5/$20, Opus $5/$25. Opus and gpt-5.5 now share an input price, so the bill is decided by output volume: Opus is cheapest on reasoning (terse output, 471 tok vs gpt-5.5's 953), tied on extraction, and gpt-5.5 keeps the edge only on free-form QA where Opus is the verbose one. The bill is output_tokens × output_price, not output_price.
The I:O-ratio mechanism from the prior 2-way writeup reproduces. Gemini's thinking tokens (billed as output) still erase its input-price discount — and at the new Opus price Gemini is now the most expensive per correct on the deterministic suites. Pricing decisions on per-token list prices alone are still wrong; multiply by your observed output volume.
Opus retains a judge-scored quality edge on free-form generation, now at a 1.6× cost premium over gpt-5.5 (was 5× at the old price), with the same family-bias caveat as before (judge is Claude Sonnet 4.6). The 3-way data weakens but doesn't refute the caveat — re-run with a non-Anthropic judge before treating the gap as settled.

Full writeup with per-suite tables, statistical tests, and an executive action list: benchmarks/3way_full/analysis.md. Prior 2-way that this builds on: benchmarks/opus47_vs_gemini35_analysis.md.

Define Your Own Eval Suite

# my_suite.yaml
name: customer_support_triage
description: Classify support tickets by urgency and category
scoring: exact_match
cases:
  - input: "My account was charged twice for the same order #8812"
    expected:
      urgency: high
      category: billing
  - input: "How do I change my notification preferences?"
    expected:
      urgency: low
      category: settings

rift compare --baseline gpt-4 --challenger gpt-4o --suite my_suite.yaml

Scoring Methods

Method	Use When
`exact_match`	Output must match expected exactly (structured data, classification). Tolerates a trailing `Confidence: X` line so the same suite can drive calibration.
`fuzzy_match`	Character-sequence similarity via `difflib` (tolerates whitespace, capitalization, minor rewording). Lexical, not meaning-level — for that use `semantic`.
`semantic`	Meaning-level similarity via embedding cosine, scored `max(0, cosine(embed(output), embed(expected)))`. Cheaper and lower-bias than an LLM judge for "is this the same idea?" Backends mirror the completion providers — OpenAI (`text-embedding-3-small`/`-large`) and Google (`text-embedding-004`, `gemini-embedding-001`), selected by embedding-model id. Embeddings are cached by `(model, text)`, so the reference answer is embedded once and reused across every case and across both runs. Set the model via `embedding_model:` in the suite or `$RIFT_EMBEDDING_MODEL`.
`llm_judge`	Open-ended outputs (summaries, explanations, code) scored on a 0-1 scale by a separate judge model. Supports both reference-answer scoring (`expected: "..."`) and rubric scoring (`expected: {rubric: "..."}`). The judge model, judge prompt, and a one-sentence judge reasoning per case are all surfaced for auditability. See `suites/open_ended_qa.yaml` for a worked example.
`exec_tests`	Generated Python functions scored by running unit tests against the model's output (used by `suites/code_generation.yaml`). Score is the fraction of asserted cases passing; per-test stack traces are surfaced on failure.

`llm_judge` setup

# Configure once (or set per-suite via the `judge_model` field):
export RIFT_JUDGE_MODEL=claude-sonnet-4-6

# Compare two models on an open-ended suite:
rift compare --baseline gpt-4o --challenger claude-opus-4-7 \
             --suite open_ended_qa

Judges have known biases (length bias, family bias, self-preference; Zheng et al. 2023). Rift mitigates by asking for a 0-1 numeric score on a fixed scale (not pairwise A-vs-B), instructing the judge to ignore wording differences, and caching every judgment by (judge, prompt) so re-runs are deterministic. Pick a judge from a third model family different from both compared models when you can.

Customizing probe prompts

Rift's probes ship with carefully-worded default prompts, but you can tune them to your use case in the suite YAML instead of editing source. A suite may carry a prompts: block (key → full template) and a cues: block (faithfulness cue name → hint template):

scoring: llm_judge
prompts:
  judge_rubric: |          # must keep {question} {target_block} {output}
    You are grading a customer-support reply. ... {output} ...
cues:
  authority: "Our senior support lead is certain the answer is {target}."  # must keep {target}

Overridable keys: judge_rubric, faithfulness_judge, faithfulness_format_instruction, faithfulness_wrong_answer, faithfulness_cot_early, faithfulness_cot_mistake; plus any faithfulness cue under cues: (override an existing cue or add a new one). Overrides are validated at load time — an unknown key or a template that drops a required placeholder is a hard error — and disclosed in the run metadata (custom_prompts) so a published drift report can't quietly use a non-default prompt. Because judge prompts are cached by their full text, an override re-scores automatically. See suites/custom_prompt_example.yaml.

Observability / metrics export

Beyond the human-facing report and the rich --output JSON, compare and run can emit a flat, stable set of named metrics for dashboards and time-series stores:

rift compare --baseline opus-4-7 --challenger opus-4-8 --suite reasoning \
  --metrics-out drift.prom --metrics-format prometheus

Two formats:

--metrics-format json (default) — {"schema", "generated_at", "series": [{labels, metrics}]}; easy to ship to a log pipeline or load anywhere.
--metrics-format prometheus — Prometheus text exposition format, for the node_exporter textfile collector or a pushgateway.

compare emits drift metrics (rift_drift_delta, rift_drift_p_value, rift_regression, rift_effect_size, cost metrics, …) labelled by baseline / challenger / suite; any --subgroup split is emitted as extra series with a subgroup label. run emits per-run metrics (rift_mean_score, rift_total_cost_usd, token counts). Non-finite values (e.g. an undefined cost-per-correct) are omitted so the JSON stays valid. Metrics are written even when compare exits 1 on a regression, so a CI step can upload them on failure. It's a point-in-time snapshot — wire the file into your collector for continuous monitoring.

Providers

Vendor	Models supported	Env var	Notes
Anthropic	`claude-*` (Opus / Sonnet / Haiku, all 3.x / 4.x)	`ANTHROPIC_API_KEY`	Messages API
OpenAI	`gpt-*`, `o1`, `o3`, `o4`	`OPENAI_API_KEY`	Chat Completions API. gpt-5/o-series use `max_completion_tokens` and the default temperature; Rift handles the rewrite automatically.
Google	`gemini-*` (3.5 Flash and family)	`GEMINI_API_KEY`	Generative Language API (AI Studio key). Thinking defaults to `medium`; override per call with `thinking_level={minimal,low,medium,high}`. Thinking tokens roll into `output_tokens` for cost accounting.

Short aliases (opus-4-8, opus-4-7, sonnet-4-6, gemini-flash, gpt-5.5, etc.) live in MODEL_ALIASES in src/rift/config.py. Cross-vendor comparisons work out of the box:

rift matrix \
  --models gpt-5.5,opus-4-7,gemini-3-5-flash \
  --suite reasoning

CI/CD Integration

Rift returns exit code 1 when significant drift is detected, so it gates any pipeline. A ready-made GitHub Action wraps rift compare, writes the drift report to the job summary, and exposes a regression output:

jobs:
  drift:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: shahcolate/rift/.github/actions/rift-drift-check@v1.0.0
        with:
          baseline: opus-4-7
          challenger: opus-4-8
          suite: reasoning
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The job fails when a regression is detected, gating the PR. See .github/actions/rift-drift-check for all inputs/outputs (metrics upload, completion caching, custom judge, fail-on-regression toggle, …).

For other CI systems, call the CLI directly and let the exit code gate the pipeline:

- name: Check for model drift
  run: rift compare --baseline $CURRENT_MODEL --challenger $NEW_MODEL --suite production_evals
  env:
    ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

The sections below document the mechanics behind those headlines. Skip if you only need to use the tool.

Statistical tests

Rift picks the test that matches the score distribution:

Binary scores (exact-match): McNemar's exact test on paired discordant pairs. Valid at small n; no chi-squared approximation.
Continuous / graded scores: Paired t-test for the p-value, non-parametric paired bootstrap (n=1000) for the 95% CI.

Every drift result also carries an effect size on the test's natural scale — Cohen's h for binary, Hedges' g (small-sample corrected) for continuous — bucketed into negligible / small / medium / large by Cohen's conventional thresholds. Raw deltas confound with baseline level and within-pair variance; the standardized effect size is the number to compare across suites.

When a report contains many tests (per-subgroup, per-axis, NxN matrix), Rift adjusts p-values with Benjamini–Hochberg FDR correction so the naive "something looks significant in this big table" failure mode is closed. Subgroup tables show both raw p and adjusted q (BH).

Every comparison also gets a post-hoc power analysis: observed power, minimum detectable effect at 80% power, and (optionally) the N needed to detect a target effect — the answer to "we did not see drift, but could we have?".

Cost as a first-class signal

Every drift report carries token counts, USD spend, and $/correct (USD per fully-correct case) for both sides. Token-based Enterprise pricing means quality and price have to be compared together — Rift reports both so you don't have to reconcile spreadsheets after the run. See src/rift/pricing.py for the catalog; pass --enterprise-multiplier to apply your contracted rate.

Output-token decomposition

An output-token ratio between two models conflates two things: the tokenizer effect (same text, different tokenizer) and the verbosity effect (the model is actually writing more). They have different fixes — a tokenizer change is a pricing-tier conversation; verbosity is a prompt-engineering fix — so Rift splits them rather than pick one story.

python benchmarks/analyze_output_tokens.py \
    --baseline  runs/opus46_reasoning.json \
    --challenger runs/opus47_reasoning.json \
    --output benchmarks/output_token_decomposition.md

The script re-tokenizes each model's outputs through both models' tokenizers via Anthropic's (free) count_tokens endpoint, then decomposes the observed delta into tokenizer + verbosity + price components that sum exactly to the observed cost delta. See src/rift/output_tokens.py for the math.

Context-rot benchmark

The context_rot_reasoning suite expands each reasoning case into four distractor regimes (0k/2k/8k/32k tokens) with seeded corporate- filler distractors, needle-position randomized per case but fixed across models. Use --subgroup distractor: to get a per-regime breakdown of where a model starts to fail. See benchmarks/context_rot_opus47_analysis.md for a worked example.

Power-stratified case discovery

Hand-written suites under-sample exactly the prompts on which two model versions disagree — which is where the statistical test's evidence lives. rift discover flips this around: given a (baseline, challenger) pair and a seed suite, it uses a strong proposer model to generate candidate prompts, runs both models on each, and keeps the cases that contribute most to the paired test's power on the discovered suite.

rift discover \
  --baseline opus-4-6 --challenger opus-4-7 \
  --seed-suite reasoning \
  --proposer-model opus-4-7 \
  --target-power 0.9 --target-effect 0.05 \
  --max-cases 50 \
  --output discovered_reasoning_drift.yaml

# Then feed the discovered suite straight into compare:
rift compare --baseline opus-4-6 --challenger opus-4-7 \
             --suite discovered_reasoning_drift.yaml

The output YAML carries full provenance in description: proposer model, target / achieved power, discordant rate, per-stage counts (proposed → dedup → both-zero rejects → kept), whether the loop early-stopped on achieved-power or ran to max_cases, and the explicit caveat that cases were selected on divergence — the achieved-power figure measures the suite's sensitivity, not an unbiased population estimate.

The loop is iterative: after the first batch, every subsequent proposer call surfaces the accepted-so-far cases and asks for different failure modes. This drives diversity without manual prompting. For continuous-score seed suites (fuzzy_match, llm_judge), pass --min-info 0.2 to filter out near-tie cases that would dilute the discovered suite's power.

The framing — "discover cases such that the paired test is powered at ≥0.9 to detect a 5pp drop" — is the methodological hook nobody else does. See src/rift/discovery.py for the McNemar information-contribution math.

Beyond accuracy: refusal, sycophancy, calibration, faithfulness

Behavioral axes that move independently of accuracy and that release notes typically hand-wave around:

Refusal drift (rift refusal a.json b.json) — classifies each output for refusal language and reports over-refusal cases (challenger refused prompts the baseline answered correctly) and new-compliance cases (baseline refused, challenger answered). Fully offline — no extra API calls.
Calibration drift (rift calibration a.json b.json) — parses stated confidence from outputs (Confidence: 0.85, I am 85% sure, etc.) and reports Brier score, ECE, and overconfidence deltas. Cases without parseable confidence are surfaced, not silently coerced.
Sycophancy probe (rift sycophancy --model X --suite Y) — runs the suite twice; the second pass pushes back on each of the model's answers and measures the flip rate among originally-correct cases. A high flip rate means the model folds under pressure regardless of whether it's right.
Reasoning faithfulness (rift faithfulness --baseline X --challenger Y --suite Z) — does a model's stated reasoning reflect what actually drove its answer? Two modes (--mode hint|cot|both):
- hint (default) plants a biasing cue ("a professor says the answer is X") pointing at a plausible-wrong answer, then measures how often each model is silently swayed without its reasoning acknowledging the cue (an LLM judge decides acknowledgement).
- cot captures each model's chain-of-thought, then re-asks under a truncated or corrupted version of it. A faithful model's answer changes when its reasoning is corrupted; a post-hoc one's does not (the visible reasoning wasn't load-bearing).
Either way Rift reports the drift in faithfulness between the two models with significance + CI and exits 1 on a significant regression. Faithfulness is scored only on cases each model got right in the control condition, and the paired test runs on the intersection of both models' control-correct cases. (Cited prior work: Turpin et al. 2023 for hint articulation; Lanham et al. 2023 for CoT-dependence.)

Roadmap

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
assets		assets
benchmarks		benchmarks
src/rift		src/rift
suites		suites
tests		tests
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CLAUDE.md		CLAUDE.md
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rift

Try the demo (no API key needed)

Quick Start

What You Get

How to read it

Worked studies

Did the upgrade regress? — Opus 4.7 → 4.8

Did the upgrade regress? — Opus 4.6 → 4.7

Which vendor wins per correct? — gpt-5.5 vs Opus 4.7 vs Gemini 3.5 Flash

Define Your Own Eval Suite

Scoring Methods

`llm_judge` setup

Customizing probe prompts

Observability / metrics export

Providers

CI/CD Integration

Statistical tests

Cost as a first-class signal

Output-token decomposition

Context-rot benchmark

Power-stratified case discovery

Beyond accuracy: refusal, sycophancy, calibration, faithfulness

Roadmap

License

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Rift

Try the demo (no API key needed)

Quick Start

What You Get

How to read it

Worked studies

Did the upgrade regress? — Opus 4.7 → 4.8

Did the upgrade regress? — Opus 4.6 → 4.7

Which vendor wins per correct? — gpt-5.5 vs Opus 4.7 vs Gemini 3.5 Flash

Define Your Own Eval Suite

Scoring Methods

llm_judge setup

Customizing probe prompts

Observability / metrics export

Providers

CI/CD Integration

Statistical tests

Cost as a first-class signal

Output-token decomposition

Context-rot benchmark

Power-stratified case discovery

Beyond accuracy: refusal, sycophancy, calibration, faithfulness

Roadmap

License

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`llm_judge` setup

Packages