Graph-based Liquid-biopsy Inductive Modeling for PreeclampSia
This repository hosts a prediction-only challenge focused on maternal-fetal health modeling using graph learning. Participant code is run outside this repository. Submissions are scored in CI against hidden labels.
🏆 Click me to join competition
- Inductive graph learning across cfRNA and placental transcriptomics to detect maternal-fetal health issues.
- Learn transferable representations that generalize to unseen samples and domains rather than treating each dataset independently.
Alignment with BASIRA Lab's Mission
- Prioritizes robust generalization across heterogeneous datasets.
- Uses compute-efficient, non-data-hungry graph learning methods that can run on standard hardware.
- Draws from studies on inductive learning, message passing, and representation transfer.
- Model design follows DGL Lectures 1.1-4.6, covering:
- Graph construction from tabular data
- Node feature encoding
- Neighborhood aggregation (GraphSAGE-style inductive updates)
- Mini-batch training via neighborhood sampling
- Inductive inference on unseen nodes
- Task: Binary classification (
0=Control,1=Preeclampsia) - Setting: Inductive transfer from cfRNA (train) to placenta (test)
- Primary metric: F1 Score
- Additional metrics: Accuracy, Precision, Recall
- Public leaderboard: Auto-updated after merged submissions
- Public datasets from Gene Expression Omnibus (GEO, NIH)
- Maternal plasma cfRNA:
GSE192902 - Placental RNA-seq:
GSE234729
- Training set: cfRNA samples
- Test set: placenta samples (unseen during training)
- Labels: binary disease status
- Identify and validate cfRNA biomarkers for early prediction of preeclampsia, often before clinical symptoms appear.
- Support research in maternal-fetal health and early detection of preeclampsia.
- Integrate gene expression and clinical metadata to capture subtle risk patterns while handling noisy and imbalanced data for robust and equitable predictions.
This competition explicitly provides both required graph components:
- Adjacency matrix
A:data/public/adjacency_matrix.csv - Node feature matrix
X: derived fromdata/public/train.csvanddata/public/test.csv
Related graph files:
data/public/graph_edges.csvdata/public/node_types.csvdata/public/graph_artifacts.pt
Interpretation:
A[i, j] = 1indicates an edge between nodesiandj, else0Xis node-by-feature and includes harmonized expression features and released covariates- Node alignment is by
node_id; usedata/public/test_nodes.csv(and node files) as the ordering reference so rows inXcorrespond to the same nodes indexed inA.
The benchmark includes meaningful modeling difficulty:
- 🧪 Noisy and partially missing metadata
- ⚖️ Label imbalance pressure
- 🧬 High-dimensional features relative to sample size (sparsity pressure)
- 🔄 Cross-domain distribution shift (cfRNA -> placenta)
- 🕸️ Inductive generalization to unseen test nodes
- Full training should not exceed 3 hours on CPU per competition.
- If needed, downsize graph complexity (for example by reducing node count, edge density, or neighborhood sampling size) while preserving task integrity.
build_dataset.ipynb and Kaggle
Objective: Ensure structural compatibility for graph construction and inductive learning by handling expression data, parsing and cleaning metadata, and expression-metadata fusion.
Objective: Implement an advanced inductive GNN for cfRNA -> placenta prediction, ensuring generalizable node representations and inductive learning.
Key Components:
- Graph Construction: Build hetero-graphs using similarity and ancestry edges.
- Node Feature Encoding: Integrate gene expression and metadata into node-level features.
- Neighborhood Aggregation: GraphSAGE-style layers with BatchNorm and ReLU for neighbor information propagation.
- Mini-Batch Training: Use neighborhood sampling for efficient training on large graphs.
- Inductive Inference: Generate predictions for unseen placenta nodes without label leakage.
starter_code/advanced_GNN_model.pystarter_code/baseline.pystarter_code/build_adjacency_matrix.pystarter_code/build_graph_artifacts.py
Submission instructions are in CONTRIBUTING.md.
Key policy:
- Only one submission attempt per participant (enforced in CI)
- Submission files are public but participant predictions are encrypted at rest (
predictions.csv.enc); only CI with organizer secrets decrypts for scoring.
- Public page:
https://mubarraqqq.github.io/gnn-challenge/leaderboard.html - Source CSV:
leaderboard/leaderboard.csv - Rendered markdown:
leaderboard.md - Tie handling: equal scores share rank
Use this command to regenerate all leaderboard outputs from the canonical pipeline:
python update_leaderboard.py && python competition/render_leaderboard.py@dataset{gnn_challenge_2026,
title={GNN Challenge: cfRNA -> Placenta Inductive GNN for Maternal-Fetal Health Prediction},
author={Mubaraq Onipede},
year={2026},
url={https://github.com/Mubarraqqq/gnn-challenge}
}See LICENSE.

