test(gsar): stabilize live integration suite under judge variance#18
Merged
Conversation
test(gsar): stabilize live integration suite under judge variance See merge request saas-observ-eng/locus!102
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Empirical reliability work after #17. Ran the GSAR live suite 80 times across three iteration rounds; identified two flaky-under-judge-variance assertions, fixed them, then re-ran 30 times to confirm stability.
Results
test_gsar_cross_judge_decision_agreement: gpt-4o sent grounded → replantest_gsar_rho_zero_inflation_visible_live: judge over-contradicted, W(G)+W(K)=0 → strict inflation degenerateFinal: 237 passes, 3 honest skips, 0 failures across 240 attempts. 1912s total wall-clock (~64s/run).
The two stabilizations
1. Cross-judge test → directional, not tier-based. Renamed
test_gsar_cross_judge_decision_agreementtotest_gsar_cross_judge_score_directional_agreement. The paper's §11 / Table 10 claim is judge-agnosticism of the mechanism, not tier alignment under model variance. The new assertion: each judge must score the grounded report strictly higher than the ungrounded report.2. P5 ablation gets a second precondition. Already skipped when the judge produced no contradicted claim (W(X) = 0). Now also skips when the judge produces no grounded/complementary claims (W(G) + W(K) = 0) — in that degenerate partition both ρ values yield S = 0 by the math, so strict inflation is unobservable. The weak inequality
s_no_rho >= s_defaultis still asserted unconditionally per Property P5.Validation
hatch run lintclean.Test plan
OPENAI_API_KEY).