ATLAS is a research framework for leakage‑resilient machine learning
evaluation.
It enforces strict information‑flow constraints so that validation and
test data cannot influence model development, helping ensure reliable
and reproducible performance estimates.
The framework formalizes a Split‑Before‑Fit protocol, provides automated leakage auditing, and introduces a quantitative Leakage Risk Score (LRS) for evaluation governance.
ATLAS
│
├── README.md
│
├── data
│ ├── synthetic
│ ├── realworld
│ ├── higgs
│ ├── higgs_negative_control
│ └── audit
│
├── experiments
│
└── figures
-
data/ Contains all experiment outputs used in the manuscript, including synthetic experiments, real-world datasets, HIGGS benchmark results, and protocol audit artifacts.
-
data/synthetic/ Results from controlled synthetic experiments evaluating leakage behavior under different protocol conditions.
-
data/realworld/ Benchmark results on multiple real-world datasets demonstrating leakage pressure in practical settings.
-
data/higgs/ Large-scale experiments conducted on the HIGGS dataset used to evaluate robustness under realistic machine learning pipelines.
-
data/higgs_negative_control/ Negative-control experiments (label-shuffle) verifying that measured optimism gaps are not statistical artifacts.
-
data/audit/ ATLAS protocol audit logs, reproducibility metadata, and leakage risk diagnostics.
-
experiments/ Python scripts used to run the experiments and reproduce the results reported in the paper.
-
figures/ Figures included in the manuscript.
Evaluation pipelines must follow:
- Define train / validation / test splits before modeling
- Fit all operators on train only
- Use validation for model selection
- Use the test set only once for final reporting
ALAV automatically audits pipeline artifacts and detects protocol violations.
Checks include:
- split overlap detection
- preprocessing scope verification
- test‑reuse detection
- duplicate leakage detection
- temporal/group leakage checks
- cache contamination checks
Output status:
PASS / WARN / FAIL
ATLAS quantifies evaluation risk using a Leakage Risk Score (0--100).
Risk levels:
| Score | Interpretation |
|---|---|
| 0-19 | Low |
| 20-39 | Medium |
| 40-69 | High |
| 70-100 | Critical |
Computed using surrogate indicators:
- Duplicate Overlap Rate (DOR)
- Preprocessing Leakage Indicator (PLI)
- Test‑Reuse Optimism Proxy (TOP)
Data → Split → Train → Select → Evaluate
The evaluation stage is protected by the ATLAS trust layer, preventing information leakage from test data.
from atlas import Protocol, Auditor
protocol = Protocol()
protocol.split(data)
model = protocol.train(model, train_data)
protocol.select(model, validation_data)
results = protocol.evaluate(model, test_data)
Auditor.run(protocol)ATLAS produces machine‑auditable artifacts such as:
data/audit/split_manifest.json
data/audit/operator_log.csv
data/audit/duplicate_report.csv
data/audit/alav_report.json
These allow independent verification of evaluation integrity.