test(gsar): stabilize live integration suite under judge variance by fede-kamel · Pull Request #18 · oracle-samples/locus

fede-kamel · 2026-04-30T18:15:07Z

Empirical reliability work after #17. Ran the GSAR live suite 80 times across three iteration rounds; identified two flaky-under-judge-variance assertions, fixed them, then re-ran 30 times to confirm stability.

Results

Round	Runs	Result	Failures
Original	20	19 pass / 1 fail	`test_gsar_cross_judge_decision_agreement`: gpt-4o sent grounded → replan
After cross-judge fix	30	28 pass / 2 fail	`test_gsar_rho_zero_inflation_visible_live`: judge over-contradicted, W(G)+W(K)=0 → strict inflation degenerate
After P5 precondition fix	30	30 pass / 0 fail	—

Final: 237 passes, 3 honest skips, 0 failures across 240 attempts. 1912s total wall-clock (~64s/run).

The two stabilizations

1. Cross-judge test → directional, not tier-based. Renamed test_gsar_cross_judge_decision_agreement to test_gsar_cross_judge_score_directional_agreement. The paper's §11 / Table 10 claim is judge-agnosticism of the mechanism, not tier alignment under model variance. The new assertion: each judge must score the grounded report strictly higher than the ungrounded report.

2. P5 ablation gets a second precondition. Already skipped when the judge produced no contradicted claim (W(X) = 0). Now also skips when the judge produces no grounded/complementary claims (W(G) + W(K) = 0) — in that degenerate partition both ρ values yield S = 0 by the math, so strict inflation is unobservable. The weak inequality s_no_rho >= s_default is still asserted unconditionally per Property P5.

Validation

30/30 runs passed after the fix; 1912s wall-clock.
27 runs hit full 8/8; 3 runs hit 7/8 with the P5 ablation correctly skipping due to judge-variance partitions where the property is unobservable. Each skip prints the partition shape that triggered it.
hatch run lint clean.

Test plan

CI runs the existing test files cleanly (the new tests are still gated on OPENAI_API_KEY).

test(gsar): stabilize live integration suite under judge variance See merge request saas-observ-eng/locus!102

Merge branch 'tests/gsar-stability-fixes' into 'main'

c4a7df3

test(gsar): stabilize live integration suite under judge variance See merge request saas-observ-eng/locus!102

oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 30, 2026

fede-kamel merged commit b96142f into main Apr 30, 2026
1 check passed

fede-kamel mentioned this pull request Apr 30, 2026

feat: GSAR Agent integration — Agent(gsar=GSARConfig(...)) #19

Merged

1 task

fede-kamel deleted the tests/gsar-stability-fixes-github branch May 13, 2026 04:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(gsar): stabilize live integration suite under judge variance#18

test(gsar): stabilize live integration suite under judge variance#18
fede-kamel merged 1 commit into
mainfrom
tests/gsar-stability-fixes-github

fede-kamel commented Apr 30, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

fede-kamel commented Apr 30, 2026

Results

The two stabilizations

Validation

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant