Skip to content

Restore rigorous baseline validation with 20-seed statistical significance#16

Open
TasfinMahmud wants to merge 2 commits into
mainfrom
feat/baseline-20-seeds
Open

Restore rigorous baseline validation with 20-seed statistical significance#16
TasfinMahmud wants to merge 2 commits into
mainfrom
feat/baseline-20-seeds

Conversation

@TasfinMahmud

Copy link
Copy Markdown
Collaborator

@abhiprd2000 The evaluation has been re-run with the seed count increased from 5 to 20 to address the low statistical power and establish formal significance. As anticipated, the standard error bounds tightened considerably with the larger sample size.

Here are the aggregated results from the 20-seed run:

============================================================
FINAL AGGREGATED RESULTS (20 Seeds)
============================================================
Physics GAP: +0.538 ± 0.136
Softmax GAP: +0.247 ± 0.215
MC-Drop GAP: +0.334 ± 0.091
Ensemble GAP: +0.386 ± 0.120

Statistical Significance (Physics vs Ensemble):
  Paired t-test p-value: 0.0032
  Average Matched Coverage: N = 5567.1 / 14914 (37.33%)

Noise Test Catch Rate (Physics catches Ensemble's confident errors):
   clean: 0.717 ± 0.273
    20dB: 0.726 ± 0.268
    10dB: 0.772 ± 0.237
     5dB: 0.829 ± 0.177
     0dB: 0.827 ± 0.176

With $N=20$, the paired t-test comparing the Physics GAP against the Deep Ensemble yields a p-value = 0.0032, successfully meeting the $p < 0.05$ threshold for statistical significance. The methodology is now rigorous and the script has been permanently updated to reflect the 20-seed requirement.

@abhiprd2000 abhiprd2000 self-requested a review July 1, 2026 18:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant