Code and notebooks for the paper "Collective Posterior Inference from Highly Variable Empirical Replicates".
The collective posterior extends simulation-based inference from a single observation to a set of replicate observations. The core implementation is in collective_posterior.py; it wraps an amortized individual posterior estimator and combines replicate-wise evidence into a collective posterior.
collective_posterior.py: collective posterior implementation and samplers.simulators.py: WF, GLU, and SLCP simulators and wrappers.evo_sim.py: 3-locus evolutionary simulator and wrappers.inference.py: trains an individual-observation NPE posterior.inference_sbi_iid.py: trains NPE with a permutation-invariant embedding for replicate sets, referred to in the notebooks as NPE+PIE.test_posterior.py: evaluates synthetic benchmarks and saves accuracy, coverage, and posterior samples.installation_check.ipynb: smoke test for the installed environment.GLU/,SLCP/,EVO_SIM/,WF/: benchmark-specific notebooks, posteriors, cached samples, and figures.
The paper notebooks are tested with Python 3.10 and the pinned packages in requirements.txt.
Important pinned dependencies:
torch==2.5.1+cpusbi==0.25.0scikit-learn==1.5.0seaborn==0.13.2
The CPU PyTorch wheel is intentional: it avoids CUDA/cuDNN runtime mismatches in notebook kernels. sbi==0.25.0 is used because the NPSE workflow needs recent sbi functionality, including NPSE/iid support. Use the same environment for loading the saved .pkl posterior files and for running the notebooks.
The local GLU and SLCP implementations follow the corresponding sbibm task definitions, so sbibm is not required at runtime. phylo_abc.ipynb is outside the pinned paper environment and still requires the external GenomeRearrangement package separately.
From this directory:
conda create -n collective python=3.10 -y
conda activate collective
python -m pip install --upgrade pip
python -m pip install -r requirements.txt
python -m pip install sbibm==1.1.0 --no-deps
python -m ipykernel install --user --name collective --display-name "Python (collective)"```
## Installation Check
After installing, open and run:
```bash
jupyter lab installation_check.ipynbRun installation_check.ipynb from the repository root. It checks package imports and versions, simulator outputs, posterior loading, and a small collective-posterior sampling call. If the saved notebook output shows old package versions, re-run all cells with the new kernel.
Many notebooks use relative paths such as posteriors/..., tests/..., or x_test.pt. Run benchmark notebooks from their own directory.
The repository includes cached .pkl, .pt, .csv, and figure files for most results. Use these cached files for fast reproduction of the reported plots. Retraining or resampling can produce slightly different numerical values unless you also reproduce the original random seeds and sampling settings.
Fast path with cached results:
cd GLU
jupyter lab "test collective.ipynb"
cd ../SLCP
jupyter lab "test collective.ipynb"These notebooks reproduce the synthetic GLU and SLCP figures from saved posterior and test files:
GLU/posteriors/posterior_GLU_100000_20.pklGLU/posteriors/posterior_iid_GLU_100000_20.pklGLU/tests/test_theta.ptGLU/tests/test_x.ptGLU/tests/accus_*.pt,GLU/tests/covs_*.pt,GLU/tests/samples_*.pt- corresponding files under
SLCP/posteriors/andSLCP/tests/
To regenerate the trained posteriors from the repository root:
python inference.py -m GLU -n 100000 -e 20
python inference_sbi_iid.py -m GLU -n 100000 -e 20
python inference.py -m SLCP -n 100000 -e 20
python inference_sbi_iid.py -m SLCP -n 100000 -e 20The training scripts save new posterior files under the model folder with their built-in filenames. Move or rename them into posteriors/ if you want the notebooks to use regenerated posteriors without editing notebook paths.
To regenerate synthetic evaluation tensors, use test_posterior.py. Pass -cp for the Collective Posterior condition and omit it for NPE+PIE. For example:
python test_posterior.py -m SLCP -p SLCP/posteriors/posterior_SLCP_100000_20.pkl -s 1000 -t SLCP/tests/test_theta.pt -x SLCP/tests/test_x.pt -cp -e _adaptive
python test_posterior.py -m SLCP -p SLCP/posteriors/posterior_iid_SLCP_100000_20.pkl -s 1000 -t SLCP/tests/test_theta.pt -x SLCP/tests/test_x.ptThe script writes new accus_*, covs_*, and samples_* files under the model folder. Move them into the corresponding tests/ folder or update the notebook paths if you want the notebooks to use regenerated files.
Fast path with cached results:
cd EVO_SIM
jupyter lab "test collective.ipynb"
jupyter lab epsilon_vs_N.ipynbMain cached files:
posterior_EVO_SIM_30000_20.pklposterior_iid.pkltheta_test.ptx_test.pt,x_test_h.pt,x_test_n.pt,x_test_r.ptaccus_EVO_SIM_*.pt,covs_EVO_SIM_*.pt,samples_EVO_SIM_*.ptaccus_npse*.pt,covs_npse*.pt,samples_npse*.ptepsilon_vs_N_results_full.ptepsilon_testset_estimates_full.pt
Notebook roles:
simulate.ipynb: generates the EVO_SIM synthetic test tensors.test collective.ipynb: reproduces the main synthetic EVO_SIM comparison between Collective Posterior, NPE+PIE, and NPSE.epsilon_vs_N.ipynb: reproduces the epsilon-by-number-of-replicates sweep. The sweep caches each(epsilon, N)result toepsilon_vs_N_results_full.pt, so interrupted runs can be resumed.epsilon_fitting.ipynb: explores epsilon estimation.
To regenerate the individual NPE and NPE+PIE posteriors from the repository root:
python inference.py -m EVO_SIM -n 30000 -e 20
python inference_sbi_iid.py -m EVO_SIM -n 30000 -e 20The training scripts save new posterior files under EVO_SIM/ with their built-in filenames. Rename them if you want to replace the cached posterior files loaded by the notebooks.
To regenerate Collective Posterior and NPE+PIE evaluation samples:
python test_posterior.py -m EVO_SIM -p EVO_SIM/posterior_EVO_SIM_30000_20.pkl -s 1000 -t EVO_SIM/theta_test.pt -x EVO_SIM/x_test_r.pt -cp -e _adaptive
python test_posterior.py -m EVO_SIM -p EVO_SIM/posterior_iid.pkl -s 1000 -t EVO_SIM/theta_test.pt -x EVO_SIM/x_test_r.ptThe NPSE comparison files are cached in EVO_SIM/. If you retrain NPSE, use EVO_SIM/inference_npse.py as the starting point and save the regenerated accus_npse, covs_npse, and samples_npse files expected by test collective.ipynb.
Fast path with cached results:
cd WF
jupyter lab "test collective.ipynb"Main cached files:
WF/posteriors/posterior_WF_30000_20.pklWF/posteriors/posterior_iid_WF_30000_20.pklWF/tests/theta_test.ptWF/tests/x_test.ptWF/accus_WF_adaptive.pt,WF/covs_WF_adaptive.pt,WF/samples_WF_adaptive.ptWF/accus_WF_iid.pt,WF/covs_WF_iid.pt,WF/samples_WF_iid.pt
To regenerate the posteriors from the repository root:
python inference.py -m WF -n 30000 -e 20
python inference_sbi_iid.py -m WF -n 30000 -e 20The training scripts save new posterior files under WF/ with their built-in filenames. Move or rename them into WF/posteriors/ if you want to replace the cached posterior files loaded by the notebooks.
To regenerate the synthetic benchmark samples and coverage files:
python test_posterior.py -m WF -p WF/posteriors/posterior_WF_30000_20.pkl -s 1000 -t WF/tests/theta_test.pt -x WF/tests/x_test.pt -cp -e _adaptive
python test_posterior.py -m WF -p WF/posteriors/posterior_iid_WF_30000_20.pkl -s 1000 -t WF/tests/theta_test.pt -x WF/tests/x_test.ptWF empirical data are in WF/empirical_data/.
Recommended order:
cd WF
jupyter lab npse.ipynb
jupyter lab empirical_wf.ipynbnpse.ipynb trains or loads NPSE, evaluates it on the empirical WF datasets, runs the NPSE inference-cycle check, and saves posterior samples:
posteriors/posterior_npse.pkltests/samples_npse_<dataset>.pttests/cycle_samples_npse_<dataset>.pttests/cycle_observations_npse_<dataset>.pttests/npse_empirical_summary.csvtests/npse_inference_cycle_summary.csv
Set FORCE_TRAIN_NPSE = False in npse.ipynb if you want to load the cached posteriors/posterior_npse.pkl instead of retraining.
empirical_wf.ipynb compares Collective Posterior, NPE+PIE, and NPSE on the empirical WF datasets. By default it reuses saved posterior samples:
REUSE_SAVED_SAMPLES = TrueFORCE_RECOMPUTE_SAMPLES = False
The notebook loads or saves:
tests/samples_collective_<dataset>.pttests/samples_npe_pie_<dataset>.pttests/samples_npse_<dataset>.pttests/cycle_samples_collective_<dataset>.pttests/cycle_samples_npe_pie_<dataset>.pttests/cycle_samples_npse_<dataset>.pttests/cycle_ovl_ks.csvtests/posterior_predictive_interval_coverage_200.csvtests/posterior_predictive_interval_coverage_mae_200.csvtests/predictive_checks_summary.pngtests/predictive_checks_summary.tiftests/predictive_checks_summary.pdf
wf_collective_abc.ipynb contains the Rej-ABC empirical WF analysis. epsilon_fitting.ipynb and erratic.ipynb provide WF-specific epsilon and posterior-density diagnostics.