feat: ingest PL conference proceedings + unified oversight sync by charlielidbury · Pull Request #10 · ottowhite/oversight

charlielidbury · 2026-05-10T17:12:19Z

Summary

Adds an ingest pipeline for the eight major Programming Languages conferences (POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, Haskell Symposium) and refactors sync into a unified `oversight sync` command backed by a `SourcePoller` registry.

Motivation: the existing arxiv-CS harvester misses any paper that was never deposited on arXiv. "Pure Subtype Systems" (Hutchins, POPL 2010) was the canonical example — foundational PL work, only in ACM DL. With this branch it's now in the DB and the #1 hit for a semantic search of "pure subtype systems".

What changed

New harvester (`src/oversight/PLConferenceHarvester.py`, ~1.3k lines) — pure HTTP, no LLM. Pipeline: DBLP for paper lists → OpenAlex for abstracts → Semantic Scholar fallback → skip-with-log if both miss. Handles PACMPL volume mapping for post-2017 SIGPLAN venues, conf-style TOCs for pre-2017 + Tier 2, three DBLP mirrors with retry rotation, per-source retry budgets and rate limits, on-disk cache at `.cache/pl_conferences/`.

Sync refactor (`SourcePoller.py`, `source_registry.py`, `ArxivPoller.py`, `PLConfPoller.py`, `cli.py`) — `oversight sync` with `--sources`, `--backfill`, `--dry-run`. Per-poller watermarks. `make oversight/sync` now routes through this. ML and Systems pollers are explicit `NotImplementedPoller` stubs (out of scope here, follow-up phase).

Bug fixes along the way (each in its own commit):

`fix: anchor PLConferenceHarvester paper date to conference year` — PACMPL volumes are published in December of the prior calendar year, so OpenAlex's `publication_date` was bucketing POPL 2018 papers under 2017 etc. Snap out-of-year dates to `<conf_year>-01-01`. Recovered POPL 2015/2017/2018/2020 (~250 papers re-bucketed).
`fix: truncate over-long abstracts before embedding` — single 1.2k-word abstract was aborting whole embed batches.
`fix: cap OpenAlex/Semantic Scholar retries` — multi-hour Retry-After headers and aggressive throttling were burning ~90s per paper on default retry chains.

Outcome

Venue	Papers	Years
OOPSLA	2,280	1986–2025
POPL	2,179	1973–2026
PLDI	1,710	1987–2025
ICFP	1,010	1996–2025
ECOOP	366	1987–2025
ESOP	232	1986–2026
CC	207	1988–2026
Haskell	48	2000–2009
Total	8,032

All 8,032 embedded; zero unembedded. Hutchins POPL 2010 verified as the top semantic-search hit for "pure subtype systems".

Known gaps (deferred)

Haskell Symposium 2010–2025 — OpenAlex daily quota exhausted mid-run during ingest. Cache is warm; resumes incrementally with `oversight sync --sources pl` after the quota resets. Tracked separately.
Pre-LIPIcs ESOP/ECOOP/CC abstracts — Springer LNCS doesn't expose them via OpenAlex or Semantic Scholar. ~2.5k papers theoretically affected. Would need a Springer-direct scraper to recover.
arxiv-vs-proceedings duplicates — explicitly deferred per plan; will quantify and decide on dedup strategy with measured data once everything's loaded.
ML / Systems poller migration — `OpenReviewHarvester` and `superscraper` stay on the old ad-hoc path until a follow-up phase migrates them into the registry.
arxiv embedding behavioral delta — old path called `_embed_missing_ai_papers` from inside `ArXivRepository.sync`; new path doesn't, and the unified end-of-sync embedding pass uses `embed_missing_conference_papers` which skips `source='arxiv'`. Worth verifying on next sync that arxiv embeddings still happen.

Plan and design rationale: `docs/pl-conferences-plan.md`.

Test plan

`uv run pytest tests/test_source_poller.py` — 12/12 pass locally.
`make format && make typecheck` — both clean (pre-commit hooks enforce this on every commit).
`uv run oversight sync --dry-run` — both arxiv and pl pollers report sane watermarks; ml/systems show explicit NotImplementedPoller messages.
`uv run oversight sync --sources arxiv` — incremental near-noop given recent sync.
DB: `SELECT source, COUNT(*) FROM paper WHERE source IN ('POPL','PLDI','ICFP','OOPSLA','ESOP','ECOOP','CC','Haskell') GROUP BY source` matches the table above.
DB: `SELECT title FROM paper WHERE title ILIKE '%pure subtype%'` returns Hutchins POPL 2010 alongside the 2024 follow-up.
Spot-check `oversight sync --sources pl --backfill` doesn't trigger an accidental full re-pull (it should refuse without explicit subset, or behave as advertised).

🤖 Generated with Claude Code

Captures the gap surfaced by Hutchins POPL 2010 (paper never on arxiv) and proposes a DBLP + OpenAlex pipeline covering POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, and Haskell Symposium. Also sketches a SourcePoller registry so PL ingest folds into a single `oversight sync` command rather than a per-source CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Introduces PLConferenceHarvester, a per-(venue, year) ingester that pulls a volume's table of contents from DBLP (search API, since the per-volume JSON endpoint 404s for PACMPL) and enriches each entry via OpenAlex, with Semantic Scholar as the abstract fallback. Emits papers in the same "scraped" JSON shape the existing consume path reads, so no PaperRepository changes are needed. Hardcoded to ("popl", 2024) for the Phase 1 vertical slice; the constructor is already (venue, year) so Phase 2 can drive the full back-catalogue. Also gitignores the local API response cache directory. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All 93 PACMPL volume 8 issue POPL papers, produced by running \`python -m oversight.PLConferenceHarvester\`. Every paper has a DOI, a real abstract from OpenAlex, and a 2024-01-02 publication date; no DBLP entries had to be skipped. Following the convention in data/systems_conferences/ and data/vldb/, the file is committed to the repo so the consume path is reproducible without re-hitting external APIs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drives off a small VENUES registry and discovers per-venue (year, TOC) pairs from DBLP's index.xml — no per-venue branching. Handles the two DBLP TOC schemes (PACMPL journal vs conf/<slug>/<year>) data-driven. Also fixes parsing edge cases surfaced while expanding past POPL 2024: - <h2> headings with embedded tags (CC@<ref>ETAPS</ref> ...) now match via DOTALL non-greedy. - 2-digit-year proceedings keys (conf/popl/77 etc) are recognised in addition to 4-digit ones, recovering POPL 1974/1977/1979/1981/1982. The CLI now iterates every venue × every DBLP-indexed year by default, caches DBLP search-API responses and OpenAlex/SS replies under .cache/pl_conferences/, supports --year/--year-min/--year-max scoping, and a --skip-existing-doi switch that pre-loads non-arxiv DOIs from the DB so we don't waste OpenAlex lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Run per-paper OpenAlex/Semantic-Scholar lookups across a thread pool (default 16 workers) and process years concurrently from the CLI driver (default 4 workers). Each worker uses a thread-local requests Session. Drops the per-call sleep (request_delay_s default 0.0) since bounded concurrency plus the existing exponential-backoff retry path is enough to stay polite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Harvested via PLConferenceHarvester into data/pl_conferences/popl/. DBLP has no proceedings for POPL 1974, so the year file is absent. 2024 was already present from the Phase 1 vertical slice; this commit overwrites it with the same content from a fresh run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DBLP returns 'Connection reset by peer' under concurrent load, and Semantic Scholar returns 429s. Cap DBLP at one in-flight request with a 1.5s minimum interval (process-global), Semantic Scholar at four concurrent. Bump retry budget to six attempts with a longer base backoff so transient resets don't fail a year. OpenAlex with mailto remains ungated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

39 years of PLDI proceedings harvested via DBLP, OpenAlex, and Semantic Scholar fallback. PLDI did not run in 1986; DBLP has no entry for 2026 yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

30 years of ICFP proceedings harvested via DBLP, OpenAlex, and Semantic Scholar fallback. ICFP started in 1996; DBLP has no entry for 2026 yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

DBLP's search API at /search/publ/api throttles aggressively under sustained load — empirically a single venue's 40-year backfill can trip a 30+ minute IP-level rate limit that no exponential backoff works around. The static '.xml' TOC files at db/conf/<slug>/<slug><year>.xml and db/journals/pacmpl/pacmpl<vol>.xml carry the same fields (DOI, title, authors, key, year, PACMPL number) and are not subject to the search-API rate limit. Parse the static XML for each TOC, convert to the search-API 'info' shape so the rest of the pipeline is unchanged, and fall back to the search API only if the static fetch fails. The on-disk cache key is shared between the two paths, so subsequent runs hit cache regardless of which path produced the data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

23 of ~40 years of OOPSLA proceedings harvested. DBLP rate-limited the IP after 2008; remaining years (2009-2025) will follow in a subsequent commit once the rate limit clears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Only ESOP 1992 survived the no-abstract filter so far for pre-2010 Springer DOIs (OpenAlex/Semantic Scholar abstracts are sparse for old Springer LNCS). Will resume harvesting after DBLP rate limit clears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

dblp.org throttles aggressively under sustained load — even after moving to the static XML TOCs, sustained harvesting can hit a 30+ minute IP-level block. dblp publishes two long-running mirrors at dblp.uni-trier.de and dblp.dagstuhl.de that serve the same content under separate quotas. Round-robin DBLP requests across all three on each retry attempt, so a primary outage doesn't stall the bulk ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Completes the OOPSLA back-catalogue. 17 more years (2009-2025) harvested via the static-XML path with mirror rollover. Combined with the earlier 1986-2008 commit, OOPSLA spans 1986 to 2025. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The unauthenticated SS API throttles to a few RPS and 429s repeatedly on bulk lookups, particularly for old Springer LNCS DOIs that have no abstract anywhere anyway. The default 6-attempt exponential backoff burns ~90s per paper before giving up — at 20-30 papers per ESOP/ECOOP year that wedges bulk ingest into many-hours-per-venue territory. OpenAlex (which has the abstract for most modern DOIs) is the primary source; SS is a fallback for the cases OpenAlex misses. If SS also throttles, we cleanly drop the paper rather than burn the budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

19 of 35 indexed years yielded papers with abstracts. Pre-2018 ESOP proceedings were published by Springer LNCS, which exposes neither abstracts to OpenAlex nor reliably to Semantic Scholar; most of the intervening years (1986-2017 except a handful) were ingested as 0-2 papers each. Post-2018 LIPIcs proceedings have full abstracts and yield 20-36 papers per year. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ESOP 2026 was published while the harvester was running on earlier years. 5 papers had abstracts in OpenAlex/SS at run time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

12 of 38 indexed years yielded papers with abstracts. ECOOP was a Springer LNCS conference until 2015 when it migrated to LIPIcs; nearly all 1987-2014 abstracts are missing from OpenAlex and Semantic Scholar (the publishers don't expose them), so those years mostly produce 0 or 1 paper. Post-2015 LIPIcs proceedings have abstracts and yield 22-43 papers per year. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A handful of PL conference abstracts (notably PACMPL/LIPIcs 'Journal-first' submissions whose 'abstract' is effectively the full body of the journal extension) exceed the embedding model's ~1536-token window. Previously the entire embed batch aborted with 'At least one of the texts is too long to embed', which in practice meant a single bad row halted bulk consume of an entire venue. Clip oversize texts to the model's word budget and continue. A head excerpt is good enough for semantic retrieval; a hard fail on the whole batch is not. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

19 of 34 indexed years yielded papers with abstracts. Like ECOOP and ESOP, CC was a Springer LNCS conference until 2016 when it migrated to ACM (LIPIcs); pre-2016 abstracts are mostly missing from OpenAlex/Semantic Scholar. Post-2016 yields 14-29 papers per year. CC 2026 was published while the harvester was running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OpenAlex's daily quota 429 returns Retry-After = remaining seconds in the day, which can be 7+ hours. Honoring that literally wedges the bulk harvester. Cap the requested wait at max(backoff, 60s) so we burn at most the normal backoff budget before giving up on a DOI and moving on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

OpenAlex's daily quota 429 returns Retry-After of multiple hours. Even with the wait clamped, six retries-per-paper × dozens of papers-per-year stalls bulk harvesting. Drop to 2 attempts: try once, give up, and fall through to Semantic Scholar. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

7 of 25 indexed years yielded papers. Haskell papers are ACM SIGPLAN with DOIs in the 10.1145 range; OpenAlex usually has the abstracts. The bulk run for 2010-2025 hit OpenAlex's 10k/day quota (daily reset at UTC midnight) and could not be completed within this session. Resume after quota refresh; the ESOP/ECOOP/CC caches are warm so they won't re-cost OpenAlex calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-05-10T17:12:24Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
oversight	Ready	Preview, Comment	May 12, 2026 3:47pm

OpenAlex returns ``publication_date`` as the journal release date, which for ACM-published PACMPL volumes is December of the year *before* the conference (POPL 2018 → 2017-12-27, POPL 2020 → 2019-12-20). Pre-PACMPL POPL 2015 has the same shape (ACM dated the proceedings 2014-12-19). The harvester previously wrote this date straight to the JSON ``date`` field. Paper.from_scraped_json then stored it as ``update_date`` in the DB, so year-bucketing queries over ``update_date`` saw POPL 2018 papers as 2017 papers — making POPL 2015/2017/2018/2020 appear absent under their conference year. PLDI/ICFP/OOPSLA happen mid-year so the bug is invisible there. Snap any out-of-year publication_date back to ``YYYY-01-01`` for the conference year. In-year dates are preserved unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-runs of PLConferenceHarvester for POPL 2015/2018/2020 with the date-normalisation fix in place. All three years now carry dates within the conference year (2015-01-15 / 2018-01-01 / 2020-01-01) instead of the journal-release date OpenAlex returns (December of the prior year), restoring them to the year-bucketing inventory query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Re-run of PLConferenceHarvester for POPL 2017 with the date-normalisation fix in place. All 66 papers now carry 2017-01-01 instead of the journal-release date 2016-12-22 OpenAlex returns, moving them out of the 2016 bucket (128 -> 62 real POPL 2016) and into 2017 (0 -> 66) in the year-bucketing inventory query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /api/search endpoint hardcodes its source allowlist in three places (GET param parsing, sources_flags iteration, default-everything fallback) and the recently ingested PL venues (POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, Haskell) were absent from all of them. Net effect: 8,032 PL papers were unreachable from the frontend even with no filters applied. Adds the eight PL venues alongside the existing AI/Systems venues in all three places. No behavioral change for existing sources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The arxiv embedding pass previously only embedded papers in cs.AI, cs.CL, cs.LG, and cs.MA. Logic (cs.LO) and Programming Languages (cs.PL) papers were ingested but never embedded, so they were invisible to semantic search even though present in the paper table. Adds cs.LO and cs.PL to the eligible-for-embedding category list. A backfill embedding pass over already-ingested cs.LO and cs.PL rows needs to be run separately (out of scope for this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, and Haskell Symposium to the sidebar source filter alongside the existing AI and Systems groupings, plus the inventory display order. Mirrors the structure of the existing systems-conferences and AI-conferences blocks (collapsible group with per-venue toggles and an indeterminate parent checkbox). Pairs with the API change that exposes PL venues to /api/search. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Resumes the Haskell ingest that was partial in the original PR (only 2000-2009 had been fetched before OpenAlex's daily quota cut off the Phase 2 run). After the quota reset, a follow-up harvester run filled in 2010-2025 plus revised 2005/2006/2009 with the date-normalization fix. Total Haskell coverage in the DB is now 326 papers spanning 2001-2025, up from 48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The /api/search endpoint hardcoded the conference list four separate times (GET param parsing, _build_filters per-group iteration, and the default-everything fallback). The grouping into AI / Systems / PL is a frontend UI concern; the backend filter just needs a flat set of valid source names. Extracted to module-level KNOWN_CONFERENCES and KNOWN_SOURCES constants; _build_filters becomes a four-line list-comprehension. No behavioural change — same sources accepted, same defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds oversight/sync/pl target that runs the PL harvester with --skip-existing-doi (cheap incremental behaviour — only OpenAlex/SS lookups for new DOIs) and then consumes the resulting JSON into the DB. make oversight/sync now depends on both oversight/sync/arxiv (existing) and oversight/sync/pl, so the daily cron picks up new PL volumes when DBLP indexes them. Updates docs/pl-conferences-plan.md Phase 3 to reflect this simpler Make-target approach rather than the SourcePoller protocol earlier versions of the plan proposed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The list was originally cs.AI/CL/LG/MA and got extended to include cs.LO and cs.PL — at which point "ai" stopped describing the contents. More importantly the name conflated two concepts: it's specifically the arxiv-side embed gate (conferences bypass it entirely and are embedded unconditionally via embed_missing_conference_papers). Renamed: - ai_categories -> arxiv_embed_categories - get_unembedded_arxiv_ai_papers -> get_unembedded_arxiv_papers - _embed_missing_ai_papers -> _embed_missing_arxiv_papers No behavioural change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Previously the slider defaulted to 3 years and the backend's fallback was 5 years — both shorter than the age of foundational papers in the corpus (POPL 1973 onwards). Searching by title for a classic like "Pure subtype systems" (Hutchins, POPL 2010) silently dropped it because of the time filter, which is surprising UX. - Adds an "All time" step to the slider (36500 days, displayed as "All time") and makes it the default. - Updates the backend's default-when-omitted to match. A 100-year sentinel covers everything in the corpus and gives room for future earlier ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

charlielidbury · 2026-05-12T15:50:11Z

+// (POPL 1973) would do.
+const TIME_STEPS = [7, 14, 30, 90, 180, 365, 730, 1095, 1825, 2555, 3650, 36500];
+const ALL_TIME_DAYS = 36500;
+const DEFAULT_TIME_INDEX = TIME_STEPS.length - 1; // "all time"


note: i did this because PL papers are often quite old, @ottowhite do you want it to stay at 5 years?

charlielidbury and others added 23 commits May 10, 2026 16:24

feat: ingest PLDI back-catalogue (1987-2025)

aca7126

39 years of PLDI proceedings harvested via DBLP, OpenAlex, and Semantic Scholar fallback. PLDI did not run in 1986; DBLP has no entry for 2026 yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: ingest ICFP back-catalogue (1996-2025)

d70687e

30 years of ICFP proceedings harvested via DBLP, OpenAlex, and Semantic Scholar fallback. ICFP started in 1996; DBLP has no entry for 2026 yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat: ingest ESOP 2026

9030145

ESOP 2026 was published while the harvester was running on earlier years. 5 papers had abstracts in OpenAlex/SS at run time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

charlielidbury marked this pull request as draft May 10, 2026 17:23

vercel Bot deployed to Preview May 10, 2026 19:14 View deployment

charlielidbury marked this pull request as ready for review May 12, 2026 14:54

vercel Bot deployed to Preview May 12, 2026 14:55 View deployment

vercel Bot deployed to Preview May 12, 2026 15:16 View deployment

charlielidbury and others added 8 commits May 12, 2026 17:27

charlielidbury force-pushed the pl-conferences branch from 2ebab67 to c681f7e Compare May 12, 2026 15:29

vercel Bot deployed to Preview May 12, 2026 15:29 View deployment

vercel Bot deployed to Preview May 12, 2026 15:36 View deployment

vercel Bot deployed to Preview May 12, 2026 15:47 View deployment

charlielidbury commented May 12, 2026

View reviewed changes

charlielidbury requested a review from ottowhite May 12, 2026 15:51

charlielidbury self-assigned this May 12, 2026

charlielidbury mentioned this pull request May 12, 2026

feat: similarity graph UI with variable-thickness edges #11

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: ingest PL conference proceedings + unified oversight sync#10

feat: ingest PL conference proceedings + unified oversight sync#10
charlielidbury wants to merge 34 commits into
mainfrom
pl-conferences

charlielidbury commented May 10, 2026

Uh oh!

vercel Bot commented May 10, 2026 •

edited

Loading

Uh oh!

charlielidbury May 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

charlielidbury commented May 10, 2026

Summary

What changed

Outcome

Known gaps (deferred)

Test plan

Uh oh!

vercel Bot commented May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

charlielidbury May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel Bot commented May 10, 2026 •

edited

Loading