Autonomous topological navigation in a maze simulator using CosPlace + SuperGlue recall-and-verify, with real-time A* re-planning. Toggle autopilot mid-game.
An autonomous player for the ai4ce/vis_nav_game
maze simulator. The agent drives itself to a target image using a topological
map built from its own exploration footage, two pretrained vision models
working as a recall-and-verify pair, and an A* planner that re-runs every
frame. You toggle autopilot with the a key during navigation; everything
else (localization, planning, steering) runs in real time.
This is the working version used in our demo. Drop your exploration data in
data/images_subsample/, press a, and watch the player solve the maze.
- Sped-up demo (~2 min): https://youtu.be/3dnE2oHW1Zc
- Full team demo video: https://youtu.be/shXDClZoEfM
An autonomous mode you can toggle mid-game. Press a during the
navigation phase and the player takes over. Press a again to take it
back. State resets cleanly on each toggle so you can hand control back
and forth.
A real-time debug overlay. Two extra OpenCV windows open in autopilot:
Auto: Current Location (the database frame the player thinks it is at)
and Auto: Next Target (the database frame it is steering toward). You
can watch the planner re-route in real time when localization jumps.
Inspection keys. While the game is running:
| Key | What it does |
|---|---|
↑ ↓ ← → |
Manual movement (forward, backward, turn left, turn right) |
Space |
CHECKIN (claim you have reached the target) |
Esc |
QUIT the current phase |
a |
Toggle autopilot (navigation phase only) |
g |
Show the current target image |
q |
Show the database frame closest to your current FPV |
l |
Print a loop-closure summary to the terminal |
p |
Recompute and visualize the A* path |
Two pretrained models, used for what each is good at. CosPlace (ResNet-18, 512-D) gives fast top-K recall over the database. SuperPoint + SuperGlue verifies each candidate with keypoint matching and an essential-matrix inlier count. CosPlace alone is fast but aliases on repeated wall textures. SuperGlue alone is too slow to run against the whole database every frame. Together they are both fast and precise.
A topological graph with geometry-verified shortcuts. The map is a
networkx graph over exploration frames. Every consecutive pair gets a
sequence edge so there is always some path. On top of that, every node
gets up to five extra edges to its CosPlace nearest neighbours, but only
if SuperGlue returns at least 50 inliers between the two views. Those
verified shortcuts are what let the planner take diagonal hops across
the maze instead of replaying the recording in reverse.
Idle compression. When two consecutive frames have nearly identical
descriptors (the player was sitting still during exploration), the edge
between them gets weight 0.01 instead of 1.0. A* steps through
those for free, so idle gaps in the recording do not bloat the planned
path.
Multi-crop descriptors. Each frame's CosPlace embedding is the average of three crops (full, left half, right half), L2-normalized after averaging. This tolerates small left/right viewpoint shifts so the agent does not have to be perfectly centered on a stored frame to localize.
Graceful degradation. If SuperGlue is not installed, the player detects this at import time, logs a warning, and runs in CosPlace-only mode. Localization gets less precise, geometry-verified edges drop out of the graph, but the rest of the pipeline still works end-to-end.
A stuck detector. If the localized index does not change for ten
frames in a row, the autopilot forces a LEFT turn to break symmetry.
This stops the agent from oscillating in front of a dead-end view that
keeps localizing to the same node.
The player factors a hard metric navigation problem into three easier sub-problems and uses well-tested CV primitives for each.
When the exploration phase ends, the framework calls pre_navigation().
The player loads CosPlace, scans data/images_subsample/ in natural
sort order, and computes a 512-D descriptor for every image. Those go
into a (N, 512) matrix and a BallTree for fast K-NN lookup. Because
the descriptors are L2-normalized, Euclidean distance in the tree is
monotonic in cosine distance, so top-K BallTree retrieval is equivalent
to top-K cosine NN. SuperPoint and SuperGlue load lazily with the
indoor weights.
Two kinds of edges go in:
-
Sequence edges. Every pair
(i, i+1)of consecutive frames gets an edge. Idle-streak pairs get weight0.01; everything else gets weight1.0. This guarantees a spanning path through the graph regardless of what the verifier accepts later. -
CosPlace KNN edges. For each node, look up its five CosPlace nearest neighbours. If the descriptor distance is under
0.4and SuperGlue returns at least 50 inliers between the two views, add the edge. Otherwise skip it. A pairwise cache makes sure each (i, j) pair only goes through SuperGlue once.
The framework hands the player four target images (front, right, back,
left of the goal). The player runs the front view through the same
recall-and-verify pipeline used for self-localization and stores the
matched database index as self.goal. From this point on, navigation is
"reach node self.goal in graph G" rather than free-form metric
navigation. That reduction is the trick that makes a topological player
solve a metric maze.
Every frame, while autopilot is on:
- Localize. CosPlace returns the top 5 candidates from the database; SuperGlue picks the one with the most inliers. If no candidate clears 30 inliers, fall back to CosPlace top-1.
- Check for arrival. If the localized index is within one of the
goal index, return
CHECKIN. - Plan.
nx.astar_path(G, current_idx, goal, weight="weight"). This re-runs every frame, so any localization jump self-corrects on the next planning step. - Step. Take
path[1]as the next waypoint. - Steer. Match SuperGlue keypoints between the current FPV and the
waypoint image, take the horizontal centroid of the matched
keypoints in the FPV, and:
- centroid left of center by 30 px →
LEFT - centroid right of center by 30 px →
RIGHT - otherwise →
FORWARD - fewer than 8 matches →
FORWARD(don't spin blindly).
- centroid left of center by 30 px →
- Detect stuck. If the localized index does not change for ten
frames, force a
LEFTto escape symmetry.
The steering rule deliberately ignores essential-matrix yaw and CosPlace
score even though both are available. We tried richer estimators
(align_step_to_next, geometric_servo_step) and the centroid rule
won on the narrow-FOV camera. Simpler controllers degrade more
gracefully when matches get noisy.
A few details that took a while to get right and would silently break the pipeline if missed:
cv2.recoverPosereturns a{0, 255}mask, not{0, 1}. The inlier count uses(mask_pose > 0).sum()instead ofmask_pose.sum(), so counts are not 255× inflated and the inlier threshold actually means what you think it means.- Image filenames are sorted with
natsorted, not Python's default string sort, so0009.jpg < 0010.jpg < 0100.jpginstead of the lexicographic order that would scramble the sequence edges. - CosPlace descriptors are L2-normalized after the multi-crop average, not before, so the BallTree's Euclidean distance stays monotonic in cosine distance.
- Models load lazily (CosPlace and SuperGlue both). A manual-play session does not pay the GPU cost, and the SuperGlue path is conditional on the import succeeding.
- Pygame's
QUITevent is checked before any autopilot logic, so closing the window always works even when the player is mid-step.
Two diagrams: what runs once when exploration ends, and what runs every game tick during navigation. Editable source: docs/architecture.drawio (open in draw.io).
Build the map (once, when EXPLORATION ends):
Per-frame autopilot loop (every tick during NAVIGATION):
.
├── baseline_lv1.py the player (single file, ~2,500 lines)
├── player.py upstream keyboard player, unmodified
├── environment.yaml conda environment
├── requirements.txt pip-only fallback
└── docs/
├── architecture.svg rendered architecture
├── architecture.drawio editable source
└── lv1_demo.gif demo loop
conda env create -f environment.yaml
conda activate vis_nav
pip install git+https://github.com/ai4ce/vis_nav_game_public.gitFor the optional SuperGlue verifier, clone SuperGluePretrainedNetwork either next to this repo or one directory up. The player searches both locations at import time and falls back to CosPlace-only if neither exists.
python baseline_lv1.pyThe first time you run, the framework starts in the exploration phase.
Drive the maze manually with the arrow keys. When you press Esc to
end exploration, the player builds the descriptor database and the
graph (this takes a minute on a GPU; longer on CPU), then the
navigation phase begins. Press a to hand over to the autopilot.
Place your exploration data under data/images_subsample/ (frames
named 0001.jpg, 0002.jpg, …) and your startup.json at the repo
root before running. Both are gitignored.
- ai4ce/vis_nav_game for the simulator and the baseline scaffolding.
- CosPlace (Berton et al.) for the place-recognition descriptors.
- SuperPoint + SuperGlue (Magic Leap) for the keypoint extractor and matcher.
- The AI4CE course staff and our teammates.
Nishant Pushparaju · nishantpushparaju@gmail.com · github.com/Nishant-ZFYII
