Skip to content

feat: paper atlas — WebGL scatter of every embedded paper in 2D#12

Open
charlielidbury wants to merge 20 commits into
similarity-graphfrom
paper-atlas
Open

feat: paper atlas — WebGL scatter of every embedded paper in 2D#12
charlielidbury wants to merge 20 commits into
similarity-graphfrom
paper-atlas

Conversation

@charlielidbury
Copy link
Copy Markdown
Collaborator

A /atlas page that renders all ~524k embedded papers as a pan/zoomable WebGL scatter using PaCMAP projection, with source-coloured points and an interactive legend filter, hover sidebar, multi-select search highlight.

Surface

  • paper_projection_2d table — (paper_id, projection, x, y) keyed; lets multiple projections coexist (pacmap_pl_v1, pacmap_v1).
  • GET /api/atlas?projection=<name>[&viewport=...][&limit=N] — serves up to 1M points per projection.
  • GET /api/papers/<path:paper_id> — full paper metadata for hover sidebar.
  • /atlas page in the FE — regl-scatterplot, in-canvas selection highlight, source-toggleable legend with ALL/NONE shortcuts, semantic search bar with multi-select.
  • scripts/load_pacmap_coords.py — loads a CSV of 2D coords into the projection table.

Notes

  • Stacks on top of feat: similarity graph UI with variable-thickness edges #11. The 15 commits are the atlas-specific work; the similarity-graph commits no longer duplicate after the rebase.
  • arxiv hidden by default in the legend so conference clusters are visible without unticking.
  • 524k points: ~9s initial load (109MB JSON), ~55 FPS pan/zoom (swiftshader), 346ms filter toggle. Streaming/NDJSON migration is planned but not in this PR — see docs/streaming-points.md once it lands.

🤖 Generated with Claude Code

charlielidbury and others added 15 commits May 12, 2026 18:21
Stores PaCMAP/UMAP/t-SNE 2D coordinates per (paper_id, projection) so
multiple projections can coexist; (projection, x, y) index supports
viewport range queries from the upcoming /atlas page.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Streams a (paper_id, x, y) CSV into paper_projection_2d in 5k-row
batches, upserting on (paper_id, projection) so re-runs are idempotent
and rows without a matching paper are filtered client-side to dodge
the FK violation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns 2D-projection points joined with paper.title/source for the
upcoming /atlas scatter plot. Optional viewport (xmin,ymin,xmax,ymax)
clips to a rectangle; default limit 50k, max 200k. Reuses the
neighbors connection so per-request overhead stays low.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renders the paper-atlas as a regl-scatterplot canvas: pan/zoom,
hover tooltips with title/source, source-coloured points, click to
open /graph?papers=<id>. Lazy-imports the WebGL renderer so SSR
doesn't choke. Defaults to projection=pacmap_pl_v1 against the
18k PL load; flip the constant once the full corpus lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Map view" link in the header next to the Oversight title so
users can jump from the chat-style search into the 2D paper-cloud
scatter without typing the URL manually.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled bugs in the v1 atlas init: (1) regl-scatterplot's
colorBy defaults to null, so the per-point category in points[i][2]
was ignored and every point rendered with pointColor[0] (all blue);
(2) the canvas was Tailwind-sized via CSS only, leaving the
attribute width/height at the 300x150 default, which made regl
allocate a buffer that didn't match the viewport and could even
fail GL context init on stricter drivers.

Set colorBy='valueA' so the category column drives colour, seed
canvas.width/height (× devicePixelRatio) before createScatterplot,
and use width/height='auto' so the lib reads the canvas's intrinsic
size. ResizeObserver now also resizes the backing buffer.

Verified end-to-end via headless Chrome + CDP: 18,063 points draw
with mixed colours, hover at the cloud centre shows a real paper
("Biparsers: Exact Printing for Data Synchronisation · POPL"), and
clicking that point navigates to /graph?papers=10.1145%2F3704910.
0 exceptions, 0 console errors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each legend entry is now a button. Clicking it toggles whether
that source's points appear on the map; hidden entries dim to 40%
opacity with a strike-through and a hollow color swatch. Two small
ALL / NONE affordances show/hide everything at once.

Filtering uses regl-scatterplot's filter([indices]) API rather than
re-feeding the buffer, so we keep the original point indices intact
(click → /graph navigation still finds the right paper_id) and we
don't tear down the GL context on every toggle (camera pan/zoom
state is preserved).

Hover tooltips for a now-hidden source are suppressed so we don't
show metadata for an invisible dot.

Verified end-to-end via CDP: toggling POPL off drops 3107 non-black
canvas pixels and suppresses the tooltip at a known POPL point;
toggling back restores both. NONE → canvas goes blank (0 non-black
pixels); ALL → cloud returns.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 360px right sidebar to /atlas mirroring graph.tsx's HoverPreview:
source · date · title (linked) · authors · paper_id · "View paper" CTA ·
abstract. Hover a point → sidebar populates after a lazy
GET /api/papers/<id> fetch (results cached so wiggling between known
points is instant).

Click a point now PINS that paper to the sidebar instead of navigating
to /graph. Once pinned, the panel sticks while the user keeps panning
the map; subsequent hovers don't displace it. The clear (×) button
unpins and returns to the empty state. A small "pinned" badge in the
sidebar header signals the latched state.

Adds GET /api/papers/<path:paper_id> on the backend — the existing
/neighbors endpoint couldn't be reused because Flask's default <str>
converter rejects DOI-style ids with slashes. Verified the new route
doesn't shadow /neighbors via curl on both an arxiv id and a DOI.

Verified end-to-end via CDP: hovering a known POPL point at the cloud
centre populates the sidebar with "Biparsers: Exact Printing for Data
Synchronisation · POPL · Jan 2025 · authors · full abstract"; clicking
flips the header to show "pinned"; hovering elsewhere leaves the
sidebar unchanged (pinned wins); X clears back to the empty state.
URL stays /atlas throughout (no more accidental graph navigation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a debounced search field in the top-left of the canvas. Typing a
query (e.g. "garbage collection") hits the existing /api/search
(semantic, embedding-backed) and shows up to 10 results, filtered to
papers that actually exist in the current atlas projection — anything
else is unhighlightable on this map and would mislead.

Each result has a checkbox. Checked papers go into a Set<string> and
two things happen:
1. scatter.select(indices, { preventEvent: true }) is called so
   regl-scatterplot bumps their pointSizeSelected. preventEvent stops
   it from firing the 'select' event back at our pin-the-paper
   subscriber.
2. Yellow ring overlays (DOM, pointer-events-none, z-20) are drawn on
   top of the canvas at each selected point's screen position. We
   couldn't rely on pointColorActive alone because regl-scatterplot's
   getColors path ignores it when colorBy is set (esm.js ~6596) —
   selected points would otherwise just be slightly bigger
   same-coloured dots. The overlay subscribes to the scatter's 'view'
   event so the rings track pan / zoom; ResizeObserver also re-ticks
   so they stay aligned across viewport changes.

The clear (×) button clears the input and closes the dropdown but
leaves selections intact, so the user can search again to add more
highlights without losing their existing set; a separate "clear"
link in the summary line drops all selections at once.

Verified end-to-end via CDP: searching "garbage collection" against
the 18k PL atlas returns 6 candidates, 5 of which exist in the
projection; checking 3 produces 3 ring overlays (clustered tightly
because GC papers neighbour each other in PaCMAP) plus a "3 selected"
badge; subsequent zoom-wheel events shift each ring's left/top
correctly (e.g. one ring moves from 781,553 → 894,738 after 5 wheel
ticks); pinning a different paper via canvas click coexists with the
selection (sidebar shows "pinned", ring count stays at 2).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At 524k points (full corpus), the naive O(N) "scan every point and
test source membership" recompute used by visibleIndices took ~2.5s
on every legend toggle — long enough to feel sluggish. Replace it
with two memoised passes: one builds a Map<source, Uint32Array> of
point indices at load time, and the toggle then concatenates the
*visible* source buckets in O(unique_sources + total_visible). The
typed-array storage also cuts memory: ~2 MB instead of ~10 MB for
the boxed-Number array on the same data.

Measured via CDP at 524k points: filter toggle latency went from
~2.5s to ~350ms (canvas pixel-count drop confirms the filter
actually applied). The 18k-PL projection sees the speedup too but
the absolute numbers were already imperceptible there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DEFAULT_PROJECTION flips from pacmap_pl_v1 (18k PL) to pacmap_v1
  (524k full corpus). Users can still load the smaller projection
  by visiting /atlas?projection=pacmap_pl_v1 — the page now reads
  the query string on mount.
- /api/atlas: limit ceiling raised from 200k to 1M (524k corpus
  needs more headroom than the original PL-only cap allowed),
  default limit bumped to 1M so a no-args call returns everything.
- COLOR_PALETTE expanded from 12 to 22 hues so all 20 sources in
  the full corpus get a distinct colour (was greying out half the
  legend before).
- pointScaleMode='asinh' + smaller pointSize + lower opacity (0.55)
  tuned for the denser cloud — without these tweaks the 524k
  points saturated mid-cluster and washed out structure.

Measured at 524k via CDP/headless swiftshader:
  TIME_TO_DRAWN_MS: ~9000  (download + parse + GL init)
  HEAP_MB:         used=341 total=396  (well under the 4.3 GB limit)
  PAN_FPS:         55.8     (smooth in headless; native GPU faster)
  FILTER_TOGGLE:   346 ms   (with the precompute commit on top)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arxiv is ~80% of the full-corpus projection (524k → ~419k arxiv
papers), so by default it washes the cloud violet and buries the
conference clusters underneath. Initialise hiddenSources with
{"arxiv"} so the first paint emphasises the more interesting
conference clusters; the user can still click arxiv in the legend
to bring it back.

The 18k PL projection (?projection=pacmap_pl_v1) is also majority
arxiv (9753 / 18063), so the same default reads sensibly there.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation drew yellow ring <div>s on top of the
canvas, positioned via scatter.getScreenPosition(). Two bugs:
  1. devicePixelRatio math — getScreenPosition returns CSS pixels
     (regl-scatterplot.esm.js:8401, `(currentWidth * (v[0]+1))/2`
     where currentWidth is already CSS px), but our overlay divided
     by dpr anyway, so on Retina the rings landed at half their
     correct position.
  2. Even at dpr=1 the rings drifted ~17 px from the underlying dot
     (measured by sampling the canvas's bright-pixel centroid versus
     the ring's centre) — likely some interaction with our normalize
     function and the camera's default zoom we couldn't track down.

Removed the overlay entirely. Highlights now live inside regl's own
pipeline: we reserve a "highlight" colour category at index
legend.length, paint it #ffd84a in pointColor, and give it a bigger
slot in pointSize (12 vs the base 3). When the selection set
changes, we clone just the affected rows in the points array with
their 3rd column swapped to highlightCategory and call
scatter.draw(...) — pixel-perfect by construction, no DPR math, no
pan/zoom reconciliation.

scatter.draw is now serialized through a tiny inflight ref because
the lib rejects concurrent draws with "Ignoring draw call on the
previous draw call has not yet finished".

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hts pop

Two bugs in the search-highlight path:

1. Selecting a paper repainted hidden sources. scatter.draw() resets
   any prior filter() — the lib treats a new point buffer as fresh
   visibility, so toggling the search-selection set un-hid every
   source the user had clicked off in the legend (most visibly,
   arxiv came back). Fixed by re-applying the current filter
   immediately after every draw(), reading state from refs so the
   redraw effect's async closure sees the latest values.

2. Highlighted points weren't obvious. We were passing pointSize and
   opacity as arrays indexed by the 3rd column, but regl-scatterplot
   ignores those arrays unless sizeBy / opacityBy are explicitly set
   to the same encoding (esm.js DEFAULT_*_BY = null). Wire sizeBy
   and opacityBy to 'valueA' and bump the highlight slot to
   pointSize=30 (10x the cloud), opacity=1.0 (vs the cloud's 0.55).
   Yellow + big + opaque now reads obviously selected.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ResizeObserver fired on initial container layout (which is fine)
and again whenever the container's box changed — for example when
the sidebar's paper-details panel rendered the abstract a few
hundred ms after a click. The inner scatter.set({width,height}) call
clobbers the active filter state in regl-scatterplot, so hidden
sources (notably arxiv) reappeared a moment after the user clicked
a highlight, looking exactly like "the whole graph reset after a
while". Re-apply the current hiddenSources state from the ref right
after the size change so the visual stays consistent across layout
shifts.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 12, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
oversight Ready Ready Preview, Comment May 12, 2026 6:00pm

charlielidbury and others added 5 commits May 12, 2026 19:58
Captures the wire format, backend/frontend split, perf projections,
and migration steps for the NDJSON streaming path on /api/atlas.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in ?format=ndjson branch that yields a header line
({projection, total, bbox}) followed by one JSON-per-line point. Uses
a fresh psycopg connection + named server-side cursor (atlas_stream,
itersize=5000) so the streaming cursor never blocks the shared
_neighbors_conn, and drops ORDER BY paper_id so PostgreSQL can emit
rows as it scans instead of buffering to sort.

The default JSON path is unchanged. The header line arrives in ~200ms
vs ~5s for the bulk JSON build, enabling the frontend to render points
progressively instead of staring at a spinner.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure browser-side module that fetches /api/atlas?format=ndjson, parses
the response via ReadableStream.getReader() + TextDecoder, and surfaces
the header and per-batch point arrays to the caller. Line-buffered split
on the last newline carries partial chunks forward correctly.

Batch size of 25k is tuned to the in-page redraw cadence: smaller values
starve the renderer because batches arrive faster than scatter.draw can
complete, so the canvas stays blank until the stream ends.

Also exports makeNormalizer(bbox), a bbox-driven version of the existing
normalizePoints math so the page can normalise each point as it arrives
instead of waiting for a global min/max scan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Forks the points fetch effect and the normalised-points memo so that
?stream=1 uses the new streamAtlas helper, appending batches to the
points array as they arrive and normalising via the header's bbox.
The default JSON path is unchanged.

The existing scatter draw effect already gates on normalized.length > 0
and serializes overlapping draws via drawInflightRef, so progressive
batch growth flows through unchanged. Search, legend, and hover all
keep working mid-load — their memos rebuild as points grows.

Header counter shows N / total papers in stream mode so the user can
see progress.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant