🏆 View Live Leaderboard: Open leaderboard
This dataset comprises street network graphs for 120 diverse cities across continents including North America, South America, Europe, Asia, Africa, Australia & Oceania, and others like the Middle East and Central Asia. The graphs are extracted from OpenStreetMap using OSMnx, focusing on driveable roads within a 500-meter buffer around each city's central point.
The dataset includes a total of 120 cities with an unbalanced distribution of street network types reflecting real-world urban patterns: 37 grid cities (such as planned orthogonal layouts like Salt Lake City, USA), 31 organic cities (such as irregular, historic winding streets like Boston, USA), and 52 hybrid cities (such as mixed elements like Atlanta, USA).
Each city's data is stored as a serialized NetworkX graph in .pkl format within the city_graphs folder, including nodes (intersections with coordinates), edges (roads with lengths and geometries), and graph attributes for layout type (grid/organic/hybrid) and city name.
This dataset is ideal for urban planning analysis, graph theory, or machine learning tasks like layout classification. It was generated via a Python script using OSMnx and NetworkX.
The goal of Task 3 is to train a model to classify each city’s street layout into one of three classes:
0= organic1= grid2= hybrid
Participants will train on the train set and submit predictions for the test set as a submission.csv.
- Class distribution in the full dataset:
- organic: 31
- grid: 37
- hybrid: 52
Each city graph is stored as a serialized NetworkX graph (.pkl) and contains:
- nodes: intersections with coordinates (
x,y) - edges: road segments (may include attributes such as length/geometry depending on OSM)
- graph attributes (e.g., city name).
For the test set, the label attribute is removed.
This dataset is useful for urban planning analysis, graph learning, and layout classification tasks.
The dataset is split into 70/30 with stratification by class:
gnn_challenge/data/train/: labeled graphs (70%)gnn_challenge/data/test/: unlabeled graphs (30%)
Training labels are provided in:
gnn_challenge/data/train_labels.csvwith columns:filenametarget
Train your model using:
- graphs in
gnn_challenge/data/train/ - labels in
gnn_challenge/data/train_labels.csv
Predict labels for every graph in:
gnn_challenge/data/test/
Create a submission.csv in the following format:
filename,prediction
Boston_Massachusetts_USA.pkl,2
Delhi_India.pkl,0
Turin_Italy.pkl,1
...Encrypt your CSV and submit only the encrypted file in submissions/.
Required naming rule (important):
- Use your team name in the filename:
<team_name>.csv.enc - Examples:
abdksm.csv.enc,Muhammad_Isah.csv.enc - Do not submit generic names such as
submission.csv.enc
This naming rule is used to display your team name correctly on the leaderboard.
The provided baseline is a Graph Convolutional Network (GCN) for graph-level classification (one label per city graph).
Each city is a graph G stored as a .pkl NetworkX file.
- Nodes: intersections with coordinate attributes
xandy - Edges: road connections between intersections
For each node, we build a 3D feature vector:
- Centered & scaled x-coordinate
- Centered & scaled y-coordinate
- Normalized node degree
So the node feature matrix is:
X ∈ R^(N×3)whereN= number of nodes in the city graph.
The baseline builds a sparse adjacency matrix with self-loops and applies standard GCN normalization:
Normalized adjacency: D^{-1/2}(A+I)D^{-1/2}
This improves stability compared to using a raw adjacency matrix.
The baseline uses two GCN-style message passing layers (implemented with sparse matrix multiplication) and then a graph pooling step:
- Layer 1:
X -> hidden - Layer 2:
hidden -> hidden - Pooling: concatenate mean pooling + max pooling to get a graph embedding
- Classifier: linear layer to output 3 logits (organic/grid/hybrid)
Training uses:
- Adam optimizer
- Cross-entropy loss
- class weights (helps if classes are imbalanced)
- dropout + weight decay (regularization)
To provide a baseline metric without touching the hidden test labels, the script splits the training set internally:
- 70% train
- 30% validation (stratified)
It prints:
- Validation Accuracy
- Validation Macro-F1 (main metric)
After training, the baseline predicts on the unlabeled test graphs and writes:
gnn_challenge/data/submission.csv
Format:
filename,prediction
City1.pkl,2
City2.pkl,0
...