CLiP (Constant-time Lifecycle Prediction) predicts a file’s lifetime (scalar) or lifecycle (coarse I/O activity histogram over time) as soon as the file is created or first opened, using only the file creation/first-open call stack as context.
This repository contains the analysis notebooks used to reproduce/illustrate the evaluation results described in the paper “Context Matters: Constant-Time File Lifecycle Prediction from File Creation Call Stacks”.
The core idea is simple: the creation context (call stack) acts as a semantic signature of the file’s role in the application, enabling accurate predictions with constant-time inference via a compact per-application lookup table.
CLiP follows a two-phase workflow:
During training runs, an LD_PRELOAD tracer intercepts POSIX I/O calls to:
- capture the call stack at the first
open()/create (creation context), - track subsequent I/O to extract the ground truth:
- Lifetime: time from first open/create to last access,
- Lifecycle histogram: #I/O events in coarse time bins after first open.
Per application, CLiP aggregates these observations into compact lookup tables:
- Lifetime table:
context -> median lifetime, - Lifecycle table:
context -> mean histogram (bin-wise).
At runtime, CLiP only needs to instrument create/open calls:
- hash the current creation call stack,
- do a constant-time table lookup to return the predicted lifetime or lifecycle. If a context was never observed during training, CLiP falls back to a global default learned from training data.
Lifetime Prediction.ipynb— Lifetime prediction analysis (Predicted/Real ratio, PDF/CDF plots, accuracy within ±5% / ±10%, etc.).Lifetime Prediction-ML.ipynb— Additional/alternative modeling and analysis notebook (baseline exploration).
Note: This repo focuses on analysis and reproducibility notebooks. The paper details the tracing + lookup-table method and the evaluation protocol.
CLiP is evaluated on three representative HPC workloads: Incompact3d, LAMMPS, and NAMD.
For lifetime prediction, the paper reports strong concentration of the Predicted/Real ratio around 1. For example, LAMMPS reaches ~95% of files within ±5% relative error and ~99–100% within ±10%.
For lifecycle prediction (binned I/O counts after first open), accuracy depends on application phases and is sensitive to bin boundaries (small time shifts can move I/O between adjacent bins).
- Python 3.9+ (recommended)
- Jupyter / JupyterLab
- Typical scientific stack:
numpy,pandas,matplotlib(and optionallyscipy)
Example:
pip install numpy pandas matplotlib jupyter
jupyter lab