A secure, Kaggle-style GNN competition for molecular property prediction
ENIGMA is a competitive benchmarking platform for Graph Neural Networks on molecular graph classification. Participants build GNN models to predict BACE-1 enzyme inhibition — a target relevant to Alzheimer's disease drug discovery — while all submissions are protected by RSA-2048 encryption.
Given a molecular graph
-
Nodes
$V$ represent atoms with features$\mathbf{x}_v \in \mathbb{R}^d$ encoding atomic properties -
Edges
$E$ represent chemical bonds with features encoding bond types
Learn a graph-level representation and predict a binary label $y \in {0, 1}$ indicating whether the molecule is an active inhibitor of BACE-1 (Beta-secretase 1).
ENIGMA is built on top of the OGB MolBACE dataset, but the competition design departs significantly from the vanilla OGB benchmark. The table below summarises the key differences — bolded entries highlight where ENIGMA introduces something new.
| Aspect | OGB MolBACE (Standard) | ENIGMA |
|---|---|---|
| Evaluation Metric | ROC-AUC (single metric) | Macro F1 (primary) + Efficiency Score + Cliff Accuracy — a multi-dimensional evaluation that rewards balanced classification, computational frugality, and robustness to activity cliffs |
| Submission Security | Open upload to OGB evaluation server | RSA-2048 encrypted submissions via GitHub Pull Requests — predictions are chunked and encrypted with OAEP/SHA-256 padding; only CI runners with the private key can decrypt them |
| Scoring Infrastructure | OGB central evaluation server | Fully automated GitHub Actions CI pipeline — decryption → validation → scoring → leaderboard update, all within an ephemeral runner. No central server required |
| Data Splits | Scaffold split only | Scaffold split + MMP-OOD (Matched Molecular Pair Out-of-Distribution) — an additional stress-test split where test molecules are drawn from activity cliff pairs and training molecules are scaffold-excluded, evaluating true OOD generalisation |
| Activity Cliff Evaluation | Not evaluated |
Pairwise cliff accuracy — for each activity cliff pair |
| Test Label Access | Publicly evaluable via OGB API |
Hidden labels injected at CI time — test labels are stored in a GitHub Secret (TEST_LABELS_CSV) or a private repository, never committed to the public repo |
| Submission Attempts | Unlimited |
One submission per team — enforced programmatically by validate_submission.py. This encourages careful model selection over leaderboard probing |
| Graph Data Format | Implicit via PyG (edge_index, data.x) |
Explicit dense adjacency matrices .npz files, alongside the standard PyG format |
| Efficiency Metric | Not measured |
|
| Robustness Testing | Not tested | Adversarial graph perturbations (random edge flips, gradient-based edge removal, feature noise, feature masking) measured via Attack Success Rate |
| Uncertainty Quantification | Not evaluated | MC Dropout, Conformal Prediction, Temperature Scaling — tools provided to measure epistemic uncertainty, calibration error, and prediction-set coverage |
| Baseline Suite | Single GCN baseline | Six baseline architectures: GCN, GIN, GraphSAGE (starter), plus D-MPNN and Spectral GNN with Chebyshev convolutions and Laplacian regularisation (advanced) |
| Leaderboard | Static OGB leaderboard | Interactive HTML/JS leaderboard with Pareto-front visualisation (F1 vs. Efficiency) auto-generated on every merge |
- Macro F1 over ROC-AUC — With ~30 % positive class, ROC-AUC can be misleadingly high even when the minority class is poorly predicted. Macro F1 forces balanced performance across both classes.
- Encrypted submissions — In open benchmarks, participants can reverse-engineer test labels by submitting carefully constructed probes. RSA-2048 encryption eliminates this attack vector entirely.
- MMP-OOD evaluation — Standard scaffold splits can still leak structural similarity. Activity-cliff pairs are the hardest cases in drug discovery; evaluating on them reveals whether a model has truly learned molecular SAR (Structure–Activity Relationships).
- Efficiency scoring — Encourages participants to build lightweight, deployable models rather than scaling to impractically large architectures.
- One-shot submission — Mimics real-world drug-discovery decisions where you commit resources to a single model. Prevents overfitting to the test set through repeated submissions.
- How ENIGMA Differs from Standard OGB MolBACE
- Dataset
- Graph Specification
- Evaluation Metric
- Security Architecture
- Getting Started
- Submission Process
- Baseline Architectures
- Advanced Architectures
- Evaluation Dimensions
- Repository Structure
- Rules
- References
We use the OGB MolBACE dataset from the Open Graph Benchmark:
| Property | Value |
|---|---|
| Source | OGB MolBACE |
| Task | Binary classification (BACE-1 inhibitor: yes/no) |
| Split | Scaffold-based (prevents structural leakage) |
| Class balance | ~30% positive (imbalanced) |
| Split | Molecules | Labels | Description |
|---|---|---|---|
| Train | 1,210 | ✅ Provided | Model training |
| Valid | 151 | ✅ Provided | Hyperparameter tuning |
| Test | 152 | 🔒 Hidden | Final evaluation (CI-only) |
Each molecule is a graph with:
-
Node features: 9-dimensional vectors
$\mathbf{x}_v \in \mathbb{R}^9$ — atomic number, chirality, degree, formal charge, hydrogen count, hybridization, aromaticity, ring membership - Edge features: 3-dimensional vectors — bond type, stereochemistry, conjugation
The scaffold split groups molecules by Bemis-Murcko scaffolds, ensuring structurally different molecules in train/test. This simulates real-world drug discovery where novel molecular scaffolds must be classified.
With ~30% positive class, a naive all-zeros classifier achieves ~70% accuracy but poor F1. Participants are encouraged to use class weighting, focal loss, or oversampling.
Every molecular graph is pre-computed and stored as dense NumPy matrices in data/graphs/:
| File | Molecules | Contents |
|---|---|---|
data/graphs/train_graphs.npz |
1,210 |
|
data/graphs/valid_graphs.npz |
151 |
|
data/graphs/test_graphs.npz |
152 |
|
For each molecule
where
Loading example (Python)
import numpy as np
data = np.load('data/graphs/train_graphs.npz', allow_pickle=False)
indices = data['indices'] # molecule IDs
A = data['adj_2'] # (n, n) binary adjacency matrix
X = data['x_2'] # (n, 9) node feature matrix
y = data['y_2'] # label: 0 or 1
print(f"Molecule 2: {A.shape[0]} atoms, label = {y[0]}")The same data is available via the OGB API (data.edge_index, data.x). See data/graphs/README_graphs.md for the full format specification.
Primary metric: Macro F1 Score — equally weights performance on both classes:
where
Why Macro F1? — Treats both classes equally regardless of sample size. Penalises poor performance on the minority class. Standard in molecular property prediction benchmarks.
Logarithmic scaling ensures diminishing-return hardware gains; squaring F1 heavily rewards prediction quality. The leaderboard shows both metrics.
ENIGMA uses a layered security model to ensure fair, tamper-proof evaluation.
┌──────────────┐ public_key.pem ┌──────────────┐ GitHub Secret ┌──────────────┐
│ │ ─────────────────▶ │ │ ─────────────────▶ │ │
│ Participant │ encrypt.py │ GitHub PR │ RSA_PRIVATE_KEY │ CI Runner │
│ predictions │ (RSA-2048, OAEP, │ (.enc file) │ decrypt.py │ (plaintext) │
│ (.csv) │ SHA-256, chunked) │ │ │ │
└──────────────┘ └──────────────┘ └──────────────┘
│
score + rank
│
▼
Public Leaderboard
How it works:
- Encrypt — You run
encryption/encrypt.pywith the public key. Your CSV is split into 190-byte chunks, each encrypted with OAEP/SHA-256 padding. The.encfile is unreadable without the private key. - Submit — You open a Pull Request containing only the
.encfile. Other participants cannot see your predictions. - Decrypt — GitHub Actions injects the private key from a repository secret, decrypts, scores, and deletes the key — all within an ephemeral CI runner.
- Publish — Only the final score and rank appear on the public leaderboard.
Test labels are never committed to this repository. During CI they are injected via:
- GitHub Secret
TEST_LABELS_CSV(base64-encoded CSV) — preferred - Private repository
enigma-private(cloned withPRIVATE_REPO_TOKEN) — fallback
git clone https://github.com/muuki2/enigma.git
cd enigmapython -m venv venv
source venv/bin/activate # On Windows: venv\Scripts\activate
pip install -r starter_code/requirements.txtcd starter_code
python baseline.py --all # run GCN, GIN, GraphSAGE
python baseline.py --model gcn # or a specific modelThis downloads OGB MolBACE, trains for 50 epochs, generates submissions/{model}_submission.csv, and reports validation F1.
| Model | Validation Macro F1 |
|---|---|
| GCN | 0.6153 |
| GIN | 0.6103 |
| GraphSAGE | 0.5835 |
from ogb.graphproppred import PygGraphPropPredDataset
dataset = PygGraphPropPredDataset(name='ogbg-molbace')
graph = dataset[0]
print(f"Nodes: {graph.num_nodes}, Edges: {graph.num_edges}")
print(f"Node features: {graph.x.shape}, Label: {graph.y.item()}")id,y_pred
0,1
1,0
6,1
...id: molecule index fromdata/public/test.csvy_pred: binary prediction (0 or 1). Legacy column nametargetis still accepted.
python encryption/encrypt.py \
submissions/inbox/my_team/run_01/predictions.csv \
encryption/public_key.pem \
submissions/inbox/my_team/run_01/predictions.encgit add submissions/inbox/my_team/run_01/predictions.enc
git commit -m "Submission: My Team Name"
git push origin my-branch && gh pr create # or open PR on GitHubWhat happens after you submit
- Decrypt — CI decrypts your
.encusing the private key from GitHub Secrets - Validate —
competition/validate_submission.pychecks format - Score —
competition/evaluate.pycomputes Macro F1 against hidden labels - Comment — A bot comments on your PR with the score
- Leaderboard — Leaderboard and interactive board are updated automatically
Include metadata.json alongside your submission to appear with efficiency metrics:
{
"team_name": "alice",
"model_name": "MyGNN",
"submission_type": "human",
"efficiency_metrics": {"inference_time_ms": 5.2, "total_params": 45000}
}Use evaluation/speed_benchmark.py to measure these values. See schema/submission_metadata.json for the full schema.
submissions/inbox/<team>/<run_id>/
├── predictions.enc # Required (RSA-encrypted)
└── metadata.json # Optional (efficiency + model info)
| Rank | Team | Macro-F1 | Efficiency | Params |
|---|---|---|---|---|
| 🥇 1 | Baseline-Spectral | 0.7215 | 0.6360 | 40.4K |
| 🥈 2 | Baseline-DMPNN | 0.6674 | 0.0833 | 53.6K |
| 🥉 3 | Baseline-GCN | 0.6153 | - | - |
The competition provides three baseline GNN architectures. Below are their message-passing formulations.
GCN (Kipf & Welling, 2017) performs spectral graph convolutions using a first-order approximation:
where
GraphSAGE (Hamilton et al., 2017) learns to aggregate neighborhood features:
where AGG can be mean, max-pool, or LSTM aggregation. Our baseline uses mean aggregation.
GIN (Xu et al., 2019) achieves maximal expressive power among message-passing GNNs:
where
All models use global mean pooling for graph-level prediction:
followed by a linear classifier:
Beyond the baselines, we provide two advanced architectures with stronger mathematical foundations.
D-MPNN (Yang et al., 2019) is an edge-centric GNN designed for molecular graphs that prevents "message backflow" — a key limitation of standard MPNNs.
Message Passing:
Key Features:
- Messages flow along directed edges
- Prevents information from immediately flowing back to source
- Edge features are first-class citizens
- Particularly effective for molecular property prediction
Implementation: advanced_baselines/dmpnn.py
Our Spectral GNN operates in the graph frequency domain using Chebyshev polynomial approximations.
Chebyshev Convolution:
where:
-
$\tilde{\mathbf{L}} = \frac{2}{\lambda_{max}} \mathbf{L} - \mathbf{I}$ is the scaled Laplacian -
$T_k$ are Chebyshev polynomials:$T_0 = 1, T_1 = x, T_k = 2xT_{k-1} - T_{k-2}$ -
$\theta_k$ are learnable spectral coefficients
Laplacian Regularization Loss:
We minimize the Dirichlet energy to encourage smoothness:
Laplacian Positional Encodings:
Optional positional features from Laplacian eigenvectors:
The first
Implementation: advanced_baselines/spectral_gnn.py
We evaluate submissions along multiple dimensions beyond raw accuracy.
Macro F1 Score is the primary ranking metric (see Evaluation Metrics).
Tracked via the efficiency formula above. We record:
- Inference time (ms per batch)
- Parameter count
- Memory usage
- FLOPs estimate
Use the profiler in evaluation/speed_benchmark.py to measure your model.
Good models should know when they don't know. We provide tools to evaluate:
MC Dropout: Epistemic uncertainty via multiple forward passes with dropout enabled:
Conformal Prediction: Distribution-free prediction sets with coverage guarantees:
where
Temperature Scaling: Post-hoc calibration via:
with temperature
Metrics:
- Expected Calibration Error (ECE)
- Brier Score
- Empirical Coverage at 90%
Implementation: evaluation/uncertainty.py
We evaluate model robustness to graph perturbations:
Attack Types:
- Random Edge Perturbation: Add/remove random edges
- Gradient-Based Attack: Remove high-importance edges
- Feature Noise: Gaussian noise on node features
- Feature Masking: Zero out random features
Metrics:
- Robust Accuracy under attack
- Attack Success Rate (ASR)
Implementation: evaluation/adversarial.py
We visualize the accuracy-efficiency trade-off:
A model is Pareto optimal if no other model is:
- Better in accuracy AND equally efficient, OR
- Equally accurate AND more efficient, OR
- Better in both
Hypervolume Indicator:
Higher hypervolume indicates better overall performance.
Visualization: visualization/pareto_plot.py
- GAT (Graph Attention Network) — attention-weighted message passing
- MPNN (Message Passing Neural Network) — edge-conditioned convolutions
- AttentiveFP — designed specifically for molecular property prediction
- D-MPNN — see our implementation in
advanced_baselines/dmpnn.py - Spectral GNN — see our implementation in
advanced_baselines/spectral_gnn.py - Ensemble methods — combine multiple architectures
- Class weighting — address class imbalance via weighted cross-entropy
- Focal loss — down-weight easy examples, focus on hard ones
- Laplacian regularization — encourage smooth representations (see Spectral GNN)
- Data augmentation — random edge dropping, node feature masking
- Different pooling — sum pooling, attention-based pooling, Set2Set
- Virtual nodes — add a global node connected to all atoms
- Positional encodings — Laplacian eigenvectors, random walk features
- Learning rate scheduling — cosine annealing, warm restarts
- Early stopping — monitor validation F1 to prevent overfitting
- Speed benchmark:
evaluation/speed_benchmark.py— profile inference time - Uncertainty:
evaluation/uncertainty.py— MC Dropout, Conformal Prediction - Adversarial:
evaluation/adversarial.py— robustness testing - Visualization:
visualization/pareto_plot.py— Pareto front analysis
- PyTorch Geometric Documentation
- OGB Leaderboard for MolBACE
- Graph Neural Networks: A Review
- GraphSAGE Paper
- GIN Paper
enigma/
├── competition/ # Competition infrastructure
│ ├── config.yaml # Single source of truth for all settings
│ ├── evaluate.py # Scoring entry-point (CI)
│ ├── metrics.py # Metric computation (Macro-F1, Efficiency)
│ ├── validate_submission.py # Submission format validation
│ └── render_leaderboard.py # Leaderboard renderer (Markdown + JS)
├── encryption/ # RSA-2048 submission encryption
│ ├── encrypt.py # Encrypt predictions (participant-facing)
│ ├── decrypt.py # Decrypt submissions (CI-only)
│ └── public_key.pem # RSA public key (safe to publish)
├── data/
│ ├── public/ # Public data for participants
│ │ ├── train.csv, valid.csv, test.csv
│ ├── graphs/ # Pre-computed A and X matrices (.npz)
│ │ ├── train_graphs.npz, valid_graphs.npz, test_graphs.npz
│ │ └── README_graphs.md
│ ├── mmp_split/ # MMP-OOD activity-cliff split
│ └── ogb/ # OGB dataset (auto-downloaded)
├── submissions/inbox/ # Submit here: inbox/<team>/<run_id>/
├── leaderboard/ # Authoritative CSV + auto-generated Markdown
├── docs/ # GitHub Pages interactive leaderboard
├── starter_code/ # Baseline GNNs (GCN, GIN, GraphSAGE)
├── advanced_baselines/ # D-MPNN + Spectral GNN
├── evaluation/ # Speed, uncertainty, adversarial, MMP-OOD
├── visualization/ # Pareto front analysis
├── scripts/ # Label generation, local tests, MMP evaluation
├── schema/ # Submission metadata JSON schema
├── .github/workflows/ # CI: decrypt → validate → score → leaderboard
├── requirements.txt # CI dependencies
└── README.md # This file
- No external data: Use only the provided OGB MolBACE dataset
- No pre-trained models: Train from scratch; pre-trained molecular embeddings are not allowed
- One submission per team: Each team may submit only once — make it count!
- One submission per PR: Each pull request should contain exactly one predictions file
- Code sharing encouraged: You may share code and ideas, but submit individually
- Fair play: Do not attempt to access test labels or exploit the evaluation system
- Submission privacy: All submissions must be encrypted using the provided RSA public key. Only final scores and ranks appear on the public leaderboard — private submissions must not be visible
- LLM usage restriction: Large Language Models must not be used to fully design the competition, including dataset creation, task definition, or evaluation logic. This competition's dataset (OGB MolBACE) was created by the academic community, and the evaluation logic was designed by the organizer independently
- Computational affordability: Full model training must not exceed 3 hours on CPU. The provided dataset (1,210 training molecules, ~30 atoms each) and baseline models (~40K–54K parameters) train in minutes on CPU. Participants should keep model complexity within this budget
- Kaggle-style ranking: Tied scores share the same rank on the leaderboard (min method). The next rank after a tie skips accordingly
Q: Can I use libraries other than PyTorch Geometric?
Yes. You can use DGL, Spektral, JAX, or any other framework. Ensure your final predictions follow the CSV format.
Q: How do I test locally before submitting?
Use the validation set to evaluate your model locally. Training labels are available via OGB; only test labels are hidden.
Q: Can I submit multiple times?
No. Each team is limited to one submission — make it count! If you need to correct an error, contact the organisers.
Q: How does the automated scoring work?
When you open a PR, GitHub Actions fetches the hidden test labels from a private repository, runs the scoring script, and comments on your PR with the result.
Q: When does the competition end?
This is an ongoing challenge. Top performers will be contacted for the research opportunity.
- Dataset: Open Graph Benchmark
- Original BACE data: MoleculeNet
If you use this challenge or the methods implemented here, please cite the following:
Open Graph Benchmark (OGB)
@article{hu2020ogb,
title={Open Graph Benchmark: Datasets for Machine Learning on Graphs},
author={Hu, Weihua and Fey, Matthias and Zitnik, Marinka and Dong, Yuxiao and Ren, Hongyu and Liu, Bowen and Catasta, Michele and Leskovec, Jure},
journal={Advances in Neural Information Processing Systems},
volume={33},
pages={22118--22133},
year={2020}
}MoleculeNet
@article{wu2018moleculenet,
title={MoleculeNet: A Benchmark for Molecular Machine Learning},
author={Wu, Zhenqin and Ramsundar, Bharath and Feinberg, Evan N and Gomes, Joseph and Geniesse, Caleb and Pappu, Aneesh S and Leswing, Karl and Pande, Vijay},
journal={Chemical Science},
volume={9},
number={2},
pages={513--530},
year={2018},
publisher={Royal Society of Chemistry}
}GraphSAGE
@inproceedings{hamilton2017inductive,
title={Inductive Representation Learning on Large Graphs},
author={Hamilton, William L and Ying, Rex and Leskovec, Jure},
booktitle={Advances in Neural Information Processing Systems},
volume={30},
year={2017}
}Graph Convolutional Networks (GCN)
@inproceedings{kipf2017semi,
title={Semi-Supervised Classification with Graph Convolutional Networks},
author={Kipf, Thomas N and Welling, Max},
booktitle={International Conference on Learning Representations},
year={2017}
}Graph Isomorphism Network (GIN)
@inproceedings{xu2019powerful,
title={How Powerful are Graph Neural Networks?},
author={Xu, Keyulu and Hu, Weihua and Leskovec, Jure and Jegelka, Stefanie},
booktitle={International Conference on Learning Representations},
year={2019}
}Directed Message Passing Neural Network (D-MPNN)
@article{yang2019analyzing,
title={Analyzing Learned Molecular Representations for Property Prediction},
author={Yang, Kevin and Swanson, Kyle and Jin, Wengong and Coley, Connor and
Eiden, Philipp and Gao, Hua and Guzman-Perez, Angel and Hopper, Timothy and
Kelley, Brian and Mathea, Miriam and others},
journal={Journal of Chemical Information and Modeling},
volume={59},
number={8},
pages={3370--3388},
year={2019},
publisher={ACS Publications}
}Spectral Graph Theory
@book{chung1997spectral,
title={Spectral Graph Theory},
author={Chung, Fan RK},
year={1997},
publisher={American Mathematical Society}
}Chebyshev Spectral Convolutions
@inproceedings{defferrard2016convolutional,
title={Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering},
author={Defferrard, Micha{\"e}l and Bresson, Xavier and Vandergheynst, Pierre},
booktitle={Advances in Neural Information Processing Systems},
volume={29},
year={2016}
}Conformal Prediction
@article{romano2020classification,
title={Classification with Valid and Adaptive Coverage},
author={Romano, Yaniv and Sesia, Matteo and Candes, Emmanuel},
journal={Advances in Neural Information Processing Systems},
volume={33},
pages={3581--3591},
year={2020}
}PyTorch Geometric
@inproceedings{fey2019fast,
title={Fast Graph Representation Learning with PyTorch Geometric},
author={Fey, Matthias and Lenssen, Jan Eric},
booktitle={ICLR Workshop on Representation Learning on Graphs and Manifolds},
year={2019}
}- Jure Leskovec (Stanford University) — Open Graph Benchmark, GraphSAGE
- Weihua Hu (Stanford University) — Open Graph Benchmark
- Zhenqin Wu and Vijay Pande (Stanford University) — MoleculeNet
- William L. Hamilton, Rex Ying, Jure Leskovec — GraphSAGE
- Thomas N. Kipf, Max Welling — Graph Convolutional Networks
- Keyulu Xu, Weihua Hu, Jure Leskovec, Stefanie Jegelka — Graph Isomorphism Network
- Matthias Fey, Jan Eric Lenssen — PyTorch Geometric
- Deep Graph Library (DGL) Team — DGL Framework
- BASIRA Lab — Research collaboration and support
- Prof. Islem Rekik (Imperial College London) — Mentorship and guidance
- Murat Kolic — Sarajevo, Bosnia and Herzegovina
For questions or issues, please open a GitHub Issue.
Organizer: Murat Kolic (@muuki2) Affiliation: BASIRA Lab Location: Sarajevo, Bosnia and Herzegovina
ENIGMA — Encrypted Neural Inference on Graphs for Molecular Analysis