feat: paper atlas — WebGL scatter of every embedded paper in 2D#12
Open
charlielidbury wants to merge 20 commits into
Open
feat: paper atlas — WebGL scatter of every embedded paper in 2D#12charlielidbury wants to merge 20 commits into
charlielidbury wants to merge 20 commits into
Conversation
Stores PaCMAP/UMAP/t-SNE 2D coordinates per (paper_id, projection) so multiple projections can coexist; (projection, x, y) index supports viewport range queries from the upcoming /atlas page. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Streams a (paper_id, x, y) CSV into paper_projection_2d in 5k-row batches, upserting on (paper_id, projection) so re-runs are idempotent and rows without a matching paper are filtered client-side to dodge the FK violation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Returns 2D-projection points joined with paper.title/source for the upcoming /atlas scatter plot. Optional viewport (xmin,ymin,xmax,ymax) clips to a rectangle; default limit 50k, max 200k. Reuses the neighbors connection so per-request overhead stays low. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Renders the paper-atlas as a regl-scatterplot canvas: pan/zoom, hover tooltips with title/source, source-coloured points, click to open /graph?papers=<id>. Lazy-imports the WebGL renderer so SSR doesn't choke. Defaults to projection=pacmap_pl_v1 against the 18k PL load; flip the constant once the full corpus lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a "Map view" link in the header next to the Oversight title so users can jump from the chat-style search into the 2D paper-cloud scatter without typing the URL manually. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two coupled bugs in the v1 atlas init: (1) regl-scatterplot's
colorBy defaults to null, so the per-point category in points[i][2]
was ignored and every point rendered with pointColor[0] (all blue);
(2) the canvas was Tailwind-sized via CSS only, leaving the
attribute width/height at the 300x150 default, which made regl
allocate a buffer that didn't match the viewport and could even
fail GL context init on stricter drivers.
Set colorBy='valueA' so the category column drives colour, seed
canvas.width/height (× devicePixelRatio) before createScatterplot,
and use width/height='auto' so the lib reads the canvas's intrinsic
size. ResizeObserver now also resizes the backing buffer.
Verified end-to-end via headless Chrome + CDP: 18,063 points draw
with mixed colours, hover at the cloud centre shows a real paper
("Biparsers: Exact Printing for Data Synchronisation · POPL"), and
clicking that point navigates to /graph?papers=10.1145%2F3704910.
0 exceptions, 0 console errors.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Each legend entry is now a button. Clicking it toggles whether that source's points appear on the map; hidden entries dim to 40% opacity with a strike-through and a hollow color swatch. Two small ALL / NONE affordances show/hide everything at once. Filtering uses regl-scatterplot's filter([indices]) API rather than re-feeding the buffer, so we keep the original point indices intact (click → /graph navigation still finds the right paper_id) and we don't tear down the GL context on every toggle (camera pan/zoom state is preserved). Hover tooltips for a now-hidden source are suppressed so we don't show metadata for an invisible dot. Verified end-to-end via CDP: toggling POPL off drops 3107 non-black canvas pixels and suppresses the tooltip at a known POPL point; toggling back restores both. NONE → canvas goes blank (0 non-black pixels); ALL → cloud returns. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a 360px right sidebar to /atlas mirroring graph.tsx's HoverPreview: source · date · title (linked) · authors · paper_id · "View paper" CTA · abstract. Hover a point → sidebar populates after a lazy GET /api/papers/<id> fetch (results cached so wiggling between known points is instant). Click a point now PINS that paper to the sidebar instead of navigating to /graph. Once pinned, the panel sticks while the user keeps panning the map; subsequent hovers don't displace it. The clear (×) button unpins and returns to the empty state. A small "pinned" badge in the sidebar header signals the latched state. Adds GET /api/papers/<path:paper_id> on the backend — the existing /neighbors endpoint couldn't be reused because Flask's default <str> converter rejects DOI-style ids with slashes. Verified the new route doesn't shadow /neighbors via curl on both an arxiv id and a DOI. Verified end-to-end via CDP: hovering a known POPL point at the cloud centre populates the sidebar with "Biparsers: Exact Printing for Data Synchronisation · POPL · Jan 2025 · authors · full abstract"; clicking flips the header to show "pinned"; hovering elsewhere leaves the sidebar unchanged (pinned wins); X clears back to the empty state. URL stays /atlas throughout (no more accidental graph navigation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a debounced search field in the top-left of the canvas. Typing a
query (e.g. "garbage collection") hits the existing /api/search
(semantic, embedding-backed) and shows up to 10 results, filtered to
papers that actually exist in the current atlas projection — anything
else is unhighlightable on this map and would mislead.
Each result has a checkbox. Checked papers go into a Set<string> and
two things happen:
1. scatter.select(indices, { preventEvent: true }) is called so
regl-scatterplot bumps their pointSizeSelected. preventEvent stops
it from firing the 'select' event back at our pin-the-paper
subscriber.
2. Yellow ring overlays (DOM, pointer-events-none, z-20) are drawn on
top of the canvas at each selected point's screen position. We
couldn't rely on pointColorActive alone because regl-scatterplot's
getColors path ignores it when colorBy is set (esm.js ~6596) —
selected points would otherwise just be slightly bigger
same-coloured dots. The overlay subscribes to the scatter's 'view'
event so the rings track pan / zoom; ResizeObserver also re-ticks
so they stay aligned across viewport changes.
The clear (×) button clears the input and closes the dropdown but
leaves selections intact, so the user can search again to add more
highlights without losing their existing set; a separate "clear"
link in the summary line drops all selections at once.
Verified end-to-end via CDP: searching "garbage collection" against
the 18k PL atlas returns 6 candidates, 5 of which exist in the
projection; checking 3 produces 3 ring overlays (clustered tightly
because GC papers neighbour each other in PaCMAP) plus a "3 selected"
badge; subsequent zoom-wheel events shift each ring's left/top
correctly (e.g. one ring moves from 781,553 → 894,738 after 5 wheel
ticks); pinning a different paper via canvas click coexists with the
selection (sidebar shows "pinned", ring count stays at 2).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At 524k points (full corpus), the naive O(N) "scan every point and test source membership" recompute used by visibleIndices took ~2.5s on every legend toggle — long enough to feel sluggish. Replace it with two memoised passes: one builds a Map<source, Uint32Array> of point indices at load time, and the toggle then concatenates the *visible* source buckets in O(unique_sources + total_visible). The typed-array storage also cuts memory: ~2 MB instead of ~10 MB for the boxed-Number array on the same data. Measured via CDP at 524k points: filter toggle latency went from ~2.5s to ~350ms (canvas pixel-count drop confirms the filter actually applied). The 18k-PL projection sees the speedup too but the absolute numbers were already imperceptible there. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- DEFAULT_PROJECTION flips from pacmap_pl_v1 (18k PL) to pacmap_v1 (524k full corpus). Users can still load the smaller projection by visiting /atlas?projection=pacmap_pl_v1 — the page now reads the query string on mount. - /api/atlas: limit ceiling raised from 200k to 1M (524k corpus needs more headroom than the original PL-only cap allowed), default limit bumped to 1M so a no-args call returns everything. - COLOR_PALETTE expanded from 12 to 22 hues so all 20 sources in the full corpus get a distinct colour (was greying out half the legend before). - pointScaleMode='asinh' + smaller pointSize + lower opacity (0.55) tuned for the denser cloud — without these tweaks the 524k points saturated mid-cluster and washed out structure. Measured at 524k via CDP/headless swiftshader: TIME_TO_DRAWN_MS: ~9000 (download + parse + GL init) HEAP_MB: used=341 total=396 (well under the 4.3 GB limit) PAN_FPS: 55.8 (smooth in headless; native GPU faster) FILTER_TOGGLE: 346 ms (with the precompute commit on top) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
arxiv is ~80% of the full-corpus projection (524k → ~419k arxiv
papers), so by default it washes the cloud violet and buries the
conference clusters underneath. Initialise hiddenSources with
{"arxiv"} so the first paint emphasises the more interesting
conference clusters; the user can still click arxiv in the legend
to bring it back.
The 18k PL projection (?projection=pacmap_pl_v1) is also majority
arxiv (9753 / 18063), so the same default reads sensibly there.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation drew yellow ring <div>s on top of the
canvas, positioned via scatter.getScreenPosition(). Two bugs:
1. devicePixelRatio math — getScreenPosition returns CSS pixels
(regl-scatterplot.esm.js:8401, `(currentWidth * (v[0]+1))/2`
where currentWidth is already CSS px), but our overlay divided
by dpr anyway, so on Retina the rings landed at half their
correct position.
2. Even at dpr=1 the rings drifted ~17 px from the underlying dot
(measured by sampling the canvas's bright-pixel centroid versus
the ring's centre) — likely some interaction with our normalize
function and the camera's default zoom we couldn't track down.
Removed the overlay entirely. Highlights now live inside regl's own
pipeline: we reserve a "highlight" colour category at index
legend.length, paint it #ffd84a in pointColor, and give it a bigger
slot in pointSize (12 vs the base 3). When the selection set
changes, we clone just the affected rows in the points array with
their 3rd column swapped to highlightCategory and call
scatter.draw(...) — pixel-perfect by construction, no DPR math, no
pan/zoom reconciliation.
scatter.draw is now serialized through a tiny inflight ref because
the lib rejects concurrent draws with "Ignoring draw call on the
previous draw call has not yet finished".
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…hts pop Two bugs in the search-highlight path: 1. Selecting a paper repainted hidden sources. scatter.draw() resets any prior filter() — the lib treats a new point buffer as fresh visibility, so toggling the search-selection set un-hid every source the user had clicked off in the legend (most visibly, arxiv came back). Fixed by re-applying the current filter immediately after every draw(), reading state from refs so the redraw effect's async closure sees the latest values. 2. Highlighted points weren't obvious. We were passing pointSize and opacity as arrays indexed by the 3rd column, but regl-scatterplot ignores those arrays unless sizeBy / opacityBy are explicitly set to the same encoding (esm.js DEFAULT_*_BY = null). Wire sizeBy and opacityBy to 'valueA' and bump the highlight slot to pointSize=30 (10x the cloud), opacity=1.0 (vs the cloud's 0.55). Yellow + big + opaque now reads obviously selected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The ResizeObserver fired on initial container layout (which is fine)
and again whenever the container's box changed — for example when
the sidebar's paper-details panel rendered the abstract a few
hundred ms after a click. The inner scatter.set({width,height}) call
clobbers the active filter state in regl-scatterplot, so hidden
sources (notably arxiv) reappeared a moment after the user clicked
a highlight, looking exactly like "the whole graph reset after a
while". Re-apply the current hiddenSources state from the ref right
after the size change so the visual stays consistent across layout
shifts.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Captures the wire format, backend/frontend split, perf projections, and migration steps for the NDJSON streaming path on /api/atlas. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds an opt-in ?format=ndjson branch that yields a header line
({projection, total, bbox}) followed by one JSON-per-line point. Uses
a fresh psycopg connection + named server-side cursor (atlas_stream,
itersize=5000) so the streaming cursor never blocks the shared
_neighbors_conn, and drops ORDER BY paper_id so PostgreSQL can emit
rows as it scans instead of buffering to sort.
The default JSON path is unchanged. The header line arrives in ~200ms
vs ~5s for the bulk JSON build, enabling the frontend to render points
progressively instead of staring at a spinner.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Pure browser-side module that fetches /api/atlas?format=ndjson, parses the response via ReadableStream.getReader() + TextDecoder, and surfaces the header and per-batch point arrays to the caller. Line-buffered split on the last newline carries partial chunks forward correctly. Batch size of 25k is tuned to the in-page redraw cadence: smaller values starve the renderer because batches arrive faster than scatter.draw can complete, so the canvas stays blank until the stream ends. Also exports makeNormalizer(bbox), a bbox-driven version of the existing normalizePoints math so the page can normalise each point as it arrives instead of waiting for a global min/max scan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Forks the points fetch effect and the normalised-points memo so that ?stream=1 uses the new streamAtlas helper, appending batches to the points array as they arrive and normalising via the header's bbox. The default JSON path is unchanged. The existing scatter draw effect already gates on normalized.length > 0 and serializes overlapping draws via drawInflightRef, so progressive batch growth flows through unchanged. Search, legend, and hover all keep working mid-load — their memos rebuild as points grows. Header counter shows N / total papers in stream mode so the user can see progress. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A
/atlaspage that renders all ~524k embedded papers as a pan/zoomable WebGL scatter using PaCMAP projection, with source-coloured points and an interactive legend filter, hover sidebar, multi-select search highlight.Surface
paper_projection_2dtable —(paper_id, projection, x, y)keyed; lets multiple projections coexist (pacmap_pl_v1,pacmap_v1).GET /api/atlas?projection=<name>[&viewport=...][&limit=N]— serves up to 1M points per projection.GET /api/papers/<path:paper_id>— full paper metadata for hover sidebar./atlaspage in the FE — regl-scatterplot, in-canvas selection highlight, source-toggleable legend with ALL/NONE shortcuts, semantic search bar with multi-select.scripts/load_pacmap_coords.py— loads a CSV of 2D coords into the projection table.Notes
docs/streaming-points.mdonce it lands.🤖 Generated with Claude Code