Skip to content

test(gsar): stabilize live integration suite under judge variance#18

Merged
fede-kamel merged 1 commit into
mainfrom
tests/gsar-stability-fixes-github
Apr 30, 2026
Merged

test(gsar): stabilize live integration suite under judge variance#18
fede-kamel merged 1 commit into
mainfrom
tests/gsar-stability-fixes-github

Conversation

@fede-kamel
Copy link
Copy Markdown
Contributor

Empirical reliability work after #17. Ran the GSAR live suite 80 times across three iteration rounds; identified two flaky-under-judge-variance assertions, fixed them, then re-ran 30 times to confirm stability.

Results

Round Runs Result Failures
Original 20 19 pass / 1 fail test_gsar_cross_judge_decision_agreement: gpt-4o sent grounded → replan
After cross-judge fix 30 28 pass / 2 fail test_gsar_rho_zero_inflation_visible_live: judge over-contradicted, W(G)+W(K)=0 → strict inflation degenerate
After P5 precondition fix 30 30 pass / 0 fail

Final: 237 passes, 3 honest skips, 0 failures across 240 attempts. 1912s total wall-clock (~64s/run).

The two stabilizations

1. Cross-judge test → directional, not tier-based. Renamed test_gsar_cross_judge_decision_agreement to test_gsar_cross_judge_score_directional_agreement. The paper's §11 / Table 10 claim is judge-agnosticism of the mechanism, not tier alignment under model variance. The new assertion: each judge must score the grounded report strictly higher than the ungrounded report.

2. P5 ablation gets a second precondition. Already skipped when the judge produced no contradicted claim (W(X) = 0). Now also skips when the judge produces no grounded/complementary claims (W(G) + W(K) = 0) — in that degenerate partition both ρ values yield S = 0 by the math, so strict inflation is unobservable. The weak inequality s_no_rho >= s_default is still asserted unconditionally per Property P5.

Validation

  • 30/30 runs passed after the fix; 1912s wall-clock.
  • 27 runs hit full 8/8; 3 runs hit 7/8 with the P5 ablation correctly skipping due to judge-variance partitions where the property is unobservable. Each skip prints the partition shape that triggered it.
  • hatch run lint clean.

Test plan

  • CI runs the existing test files cleanly (the new tests are still gated on OPENAI_API_KEY).

test(gsar): stabilize live integration suite under judge variance

See merge request saas-observ-eng/locus!102
@oracle-contributor-agreement oracle-contributor-agreement Bot added the OCA Verified All contributors have signed the Oracle Contributor Agreement. label Apr 30, 2026
@fede-kamel fede-kamel merged commit b96142f into main Apr 30, 2026
1 check passed
@fede-kamel fede-kamel deleted the tests/gsar-stability-fixes-github branch May 13, 2026 04:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OCA Verified All contributors have signed the Oracle Contributor Agreement.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant