Description:
While running the ATLAS system, we observed a significant discrepancy in F1 scores between the pretrained model and our own trained version. The pretrained model closely matches the paper’s results (within 1%), but the own-trained results show a larger variation.
We find it acceptable if reproducibility differences are within ±2%, but these deviations exceed that margin.
Results Summary
| Model |
Level |
TP |
FP |
TN |
FN |
Precision |
Recall |
F1 Score |
F1 % Diff |
| Paper |
M1 (Event Level) |
8168.0000 |
3.0000 |
24304.0000 |
0.0000 |
0.9996 |
1.0000 |
0.9998 |
— |
| Own-Trained |
M1 (Event Level) |
5299.0000 |
379.0000 |
243137.0000 |
2881.0000 |
0.9333 |
0.6478 |
0.7648 |
−23.5103 |
| Pre-Trained |
M1 (Event Level) |
8180.0000 |
1.0000 |
243494.0000 |
0.0000 |
0.9999 |
1.0000 |
0.9999 |
0.0123 |
| Model |
Level |
TP |
FP |
TN |
FN |
Precision |
Recall |
F1 Score |
F1 % Diff |
| Paper |
M3 (Entity Level) |
35.0000 |
1.0000 |
24423.0000 |
1.0000 |
0.9722 |
0.9722 |
0.9722 |
— |
| Own-Trained |
M3 (Entity Level) |
18.0000 |
15.0000 |
1263.0000 |
6.0000 |
0.5455 |
0.7500 |
0.6316 |
−35.0376 |
| Pre-Trained |
M3 (Entity Level) |
24.0000 |
1.0000 |
1308.0000 |
0.0000 |
0.9600 |
1.0000 |
0.9796 |
0.7580 |
Request:
Any idea and guidance on why this discrepancy in F1 scores between the pretrained and own-trained models could be happening?
Description:
While running the ATLAS system, we observed a significant discrepancy in F1 scores between the pretrained model and our own trained version. The pretrained model closely matches the paper’s results (within 1%), but the own-trained results show a larger variation.
We find it acceptable if reproducibility differences are within ±2%, but these deviations exceed that margin.
Results Summary
Request:
Any idea and guidance on why this discrepancy in F1 scores between the pretrained and own-trained models could be happening?