Skip to content

feat: ingest PL conference proceedings + unified oversight sync#10

Open
charlielidbury wants to merge 34 commits into
mainfrom
pl-conferences
Open

feat: ingest PL conference proceedings + unified oversight sync#10
charlielidbury wants to merge 34 commits into
mainfrom
pl-conferences

Conversation

@charlielidbury
Copy link
Copy Markdown
Collaborator

Summary

Adds an ingest pipeline for the eight major Programming Languages conferences (POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, Haskell Symposium) and refactors sync into a unified `oversight sync` command backed by a `SourcePoller` registry.

Motivation: the existing arxiv-CS harvester misses any paper that was never deposited on arXiv. "Pure Subtype Systems" (Hutchins, POPL 2010) was the canonical example — foundational PL work, only in ACM DL. With this branch it's now in the DB and the #1 hit for a semantic search of "pure subtype systems".

What changed

New harvester (`src/oversight/PLConferenceHarvester.py`, ~1.3k lines) — pure HTTP, no LLM. Pipeline: DBLP for paper lists → OpenAlex for abstracts → Semantic Scholar fallback → skip-with-log if both miss. Handles PACMPL volume mapping for post-2017 SIGPLAN venues, conf-style TOCs for pre-2017 + Tier 2, three DBLP mirrors with retry rotation, per-source retry budgets and rate limits, on-disk cache at `.cache/pl_conferences/`.

Sync refactor (`SourcePoller.py`, `source_registry.py`, `ArxivPoller.py`, `PLConfPoller.py`, `cli.py`) — `oversight sync` with `--sources`, `--backfill`, `--dry-run`. Per-poller watermarks. `make oversight/sync` now routes through this. ML and Systems pollers are explicit `NotImplementedPoller` stubs (out of scope here, follow-up phase).

Bug fixes along the way (each in its own commit):

  • `fix: anchor PLConferenceHarvester paper date to conference year` — PACMPL volumes are published in December of the prior calendar year, so OpenAlex's `publication_date` was bucketing POPL 2018 papers under 2017 etc. Snap out-of-year dates to `<conf_year>-01-01`. Recovered POPL 2015/2017/2018/2020 (~250 papers re-bucketed).
  • `fix: truncate over-long abstracts before embedding` — single 1.2k-word abstract was aborting whole embed batches.
  • `fix: cap OpenAlex/Semantic Scholar retries` — multi-hour Retry-After headers and aggressive throttling were burning ~90s per paper on default retry chains.

Outcome

Venue Papers Years
OOPSLA 2,280 1986–2025
POPL 2,179 1973–2026
PLDI 1,710 1987–2025
ICFP 1,010 1996–2025
ECOOP 366 1987–2025
ESOP 232 1986–2026
CC 207 1988–2026
Haskell 48 2000–2009
Total 8,032

All 8,032 embedded; zero unembedded. Hutchins POPL 2010 verified as the top semantic-search hit for "pure subtype systems".

Known gaps (deferred)

  • Haskell Symposium 2010–2025 — OpenAlex daily quota exhausted mid-run during ingest. Cache is warm; resumes incrementally with `oversight sync --sources pl` after the quota resets. Tracked separately.
  • Pre-LIPIcs ESOP/ECOOP/CC abstracts — Springer LNCS doesn't expose them via OpenAlex or Semantic Scholar. ~2.5k papers theoretically affected. Would need a Springer-direct scraper to recover.
  • arxiv-vs-proceedings duplicates — explicitly deferred per plan; will quantify and decide on dedup strategy with measured data once everything's loaded.
  • ML / Systems poller migration — `OpenReviewHarvester` and `superscraper` stay on the old ad-hoc path until a follow-up phase migrates them into the registry.
  • arxiv embedding behavioral delta — old path called `_embed_missing_ai_papers` from inside `ArXivRepository.sync`; new path doesn't, and the unified end-of-sync embedding pass uses `embed_missing_conference_papers` which skips `source='arxiv'`. Worth verifying on next sync that arxiv embeddings still happen.

Plan and design rationale: `docs/pl-conferences-plan.md`.

Test plan

  • `uv run pytest tests/test_source_poller.py` — 12/12 pass locally.
  • `make format && make typecheck` — both clean (pre-commit hooks enforce this on every commit).
  • `uv run oversight sync --dry-run` — both arxiv and pl pollers report sane watermarks; ml/systems show explicit NotImplementedPoller messages.
  • `uv run oversight sync --sources arxiv` — incremental near-noop given recent sync.
  • DB: `SELECT source, COUNT(*) FROM paper WHERE source IN ('POPL','PLDI','ICFP','OOPSLA','ESOP','ECOOP','CC','Haskell') GROUP BY source` matches the table above.
  • DB: `SELECT title FROM paper WHERE title ILIKE '%pure subtype%'` returns Hutchins POPL 2010 alongside the 2024 follow-up.
  • Spot-check `oversight sync --sources pl --backfill` doesn't trigger an accidental full re-pull (it should refuse without explicit subset, or behave as advertised).

🤖 Generated with Claude Code

charlielidbury and others added 23 commits May 10, 2026 16:24
Captures the gap surfaced by Hutchins POPL 2010 (paper never on arxiv)
and proposes a DBLP + OpenAlex pipeline covering POPL, PLDI, ICFP,
OOPSLA, ESOP, ECOOP, CC, and Haskell Symposium. Also sketches a
SourcePoller registry so PL ingest folds into a single `oversight sync`
command rather than a per-source CLI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces PLConferenceHarvester, a per-(venue, year) ingester that pulls a
volume's table of contents from DBLP (search API, since the per-volume JSON
endpoint 404s for PACMPL) and enriches each entry via OpenAlex, with
Semantic Scholar as the abstract fallback. Emits papers in the same
"scraped" JSON shape the existing consume path reads, so no PaperRepository
changes are needed.

Hardcoded to ("popl", 2024) for the Phase 1 vertical slice; the constructor
is already (venue, year) so Phase 2 can drive the full back-catalogue.

Also gitignores the local API response cache directory.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 93 PACMPL volume 8 issue POPL papers, produced by running
\`python -m oversight.PLConferenceHarvester\`. Every paper has a DOI, a
real abstract from OpenAlex, and a 2024-01-02 publication date; no DBLP
entries had to be skipped.

Following the convention in data/systems_conferences/ and data/vldb/, the
file is committed to the repo so the consume path is reproducible without
re-hitting external APIs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drives off a small VENUES registry and discovers per-venue (year, TOC)
pairs from DBLP's index.xml — no per-venue branching. Handles the two
DBLP TOC schemes (PACMPL journal vs conf/<slug>/<year>) data-driven.

Also fixes parsing edge cases surfaced while expanding past POPL 2024:
- <h2> headings with embedded tags (CC@<ref>ETAPS</ref> ...) now match
  via DOTALL non-greedy.
- 2-digit-year proceedings keys (conf/popl/77 etc) are recognised in
  addition to 4-digit ones, recovering POPL 1974/1977/1979/1981/1982.

The CLI now iterates every venue × every DBLP-indexed year by default,
caches DBLP search-API responses and OpenAlex/SS replies under
.cache/pl_conferences/, supports --year/--year-min/--year-max scoping,
and a --skip-existing-doi switch that pre-loads non-arxiv DOIs from the
DB so we don't waste OpenAlex lookups.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run per-paper OpenAlex/Semantic-Scholar lookups across a thread pool
(default 16 workers) and process years concurrently from the CLI driver
(default 4 workers). Each worker uses a thread-local requests Session.
Drops the per-call sleep (request_delay_s default 0.0) since bounded
concurrency plus the existing exponential-backoff retry path is enough
to stay polite.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harvested via PLConferenceHarvester into data/pl_conferences/popl/.
DBLP has no proceedings for POPL 1974, so the year file is absent.
2024 was already present from the Phase 1 vertical slice; this commit
overwrites it with the same content from a fresh run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DBLP returns 'Connection reset by peer' under concurrent load, and
Semantic Scholar returns 429s. Cap DBLP at one in-flight request with a
1.5s minimum interval (process-global), Semantic Scholar at four
concurrent. Bump retry budget to six attempts with a longer base
backoff so transient resets don't fail a year. OpenAlex with mailto
remains ungated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
39 years of PLDI proceedings harvested via DBLP, OpenAlex, and
Semantic Scholar fallback. PLDI did not run in 1986; DBLP has no entry
for 2026 yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
30 years of ICFP proceedings harvested via DBLP, OpenAlex, and
Semantic Scholar fallback. ICFP started in 1996; DBLP has no entry
for 2026 yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DBLP's search API at /search/publ/api throttles aggressively under
sustained load — empirically a single venue's 40-year backfill can
trip a 30+ minute IP-level rate limit that no exponential backoff
works around. The static '.xml' TOC files at db/conf/<slug>/<slug><year>.xml
and db/journals/pacmpl/pacmpl<vol>.xml carry the same fields (DOI,
title, authors, key, year, PACMPL number) and are not subject to the
search-API rate limit.

Parse the static XML for each TOC, convert to the search-API 'info'
shape so the rest of the pipeline is unchanged, and fall back to the
search API only if the static fetch fails. The on-disk cache key is
shared between the two paths, so subsequent runs hit cache regardless
of which path produced the data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 of ~40 years of OOPSLA proceedings harvested. DBLP rate-limited
the IP after 2008; remaining years (2009-2025) will follow in a
subsequent commit once the rate limit clears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only ESOP 1992 survived the no-abstract filter so far for pre-2010
Springer DOIs (OpenAlex/Semantic Scholar abstracts are sparse for
old Springer LNCS). Will resume harvesting after DBLP rate limit
clears.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dblp.org throttles aggressively under sustained load — even after
moving to the static XML TOCs, sustained harvesting can hit a 30+
minute IP-level block. dblp publishes two long-running mirrors at
dblp.uni-trier.de and dblp.dagstuhl.de that serve the same content
under separate quotas. Round-robin DBLP requests across all three on
each retry attempt, so a primary outage doesn't stall the bulk
ingest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the OOPSLA back-catalogue. 17 more years (2009-2025)
harvested via the static-XML path with mirror rollover. Combined
with the earlier 1986-2008 commit, OOPSLA spans 1986 to 2025.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unauthenticated SS API throttles to a few RPS and 429s repeatedly
on bulk lookups, particularly for old Springer LNCS DOIs that have no
abstract anywhere anyway. The default 6-attempt exponential backoff
burns ~90s per paper before giving up — at 20-30 papers per ESOP/ECOOP
year that wedges bulk ingest into many-hours-per-venue territory.

OpenAlex (which has the abstract for most modern DOIs) is the
primary source; SS is a fallback for the cases OpenAlex misses. If SS
also throttles, we cleanly drop the paper rather than burn the
budget.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 of 35 indexed years yielded papers with abstracts. Pre-2018 ESOP
proceedings were published by Springer LNCS, which exposes neither
abstracts to OpenAlex nor reliably to Semantic Scholar; most of the
intervening years (1986-2017 except a handful) were ingested as 0-2
papers each. Post-2018 LIPIcs proceedings have full abstracts and
yield 20-36 papers per year.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ESOP 2026 was published while the harvester was running on earlier
years. 5 papers had abstracts in OpenAlex/SS at run time.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 of 38 indexed years yielded papers with abstracts. ECOOP was a
Springer LNCS conference until 2015 when it migrated to LIPIcs;
nearly all 1987-2014 abstracts are missing from OpenAlex and
Semantic Scholar (the publishers don't expose them), so those years
mostly produce 0 or 1 paper. Post-2015 LIPIcs proceedings have
abstracts and yield 22-43 papers per year.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A handful of PL conference abstracts (notably PACMPL/LIPIcs
'Journal-first' submissions whose 'abstract' is effectively the
full body of the journal extension) exceed the embedding model's
~1536-token window. Previously the entire embed batch aborted
with 'At least one of the texts is too long to embed', which in
practice meant a single bad row halted bulk consume of an entire
venue.

Clip oversize texts to the model's word budget and continue. A
head excerpt is good enough for semantic retrieval; a hard fail
on the whole batch is not.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 of 34 indexed years yielded papers with abstracts. Like ECOOP
and ESOP, CC was a Springer LNCS conference until 2016 when it
migrated to ACM (LIPIcs); pre-2016 abstracts are mostly missing
from OpenAlex/Semantic Scholar. Post-2016 yields 14-29 papers per
year. CC 2026 was published while the harvester was running.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenAlex's daily quota 429 returns Retry-After = remaining seconds
in the day, which can be 7+ hours. Honoring that literally wedges
the bulk harvester. Cap the requested wait at max(backoff, 60s) so
we burn at most the normal backoff budget before giving up on a
DOI and moving on.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenAlex's daily quota 429 returns Retry-After of multiple hours.
Even with the wait clamped, six retries-per-paper × dozens of
papers-per-year stalls bulk harvesting. Drop to 2 attempts: try
once, give up, and fall through to Semantic Scholar.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 of 25 indexed years yielded papers. Haskell papers are ACM
SIGPLAN with DOIs in the 10.1145 range; OpenAlex usually has the
abstracts. The bulk run for 2010-2025 hit OpenAlex's 10k/day quota
(daily reset at UTC midnight) and could not be completed within
this session. Resume after quota refresh; the ESOP/ECOOP/CC
caches are warm so they won't re-cost OpenAlex calls.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 10, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
oversight Ready Ready Preview, Comment May 12, 2026 3:47pm

OpenAlex returns ``publication_date`` as the journal release date, which
for ACM-published PACMPL volumes is December of the year *before* the
conference (POPL 2018 → 2017-12-27, POPL 2020 → 2019-12-20). Pre-PACMPL
POPL 2015 has the same shape (ACM dated the proceedings 2014-12-19).

The harvester previously wrote this date straight to the JSON ``date``
field. Paper.from_scraped_json then stored it as ``update_date`` in the
DB, so year-bucketing queries over ``update_date`` saw POPL 2018 papers
as 2017 papers — making POPL 2015/2017/2018/2020 appear absent under
their conference year. PLDI/ICFP/OOPSLA happen mid-year so the bug is
invisible there.

Snap any out-of-year publication_date back to ``YYYY-01-01`` for the
conference year. In-year dates are preserved unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
charlielidbury and others added 8 commits May 12, 2026 17:27
Re-runs of PLConferenceHarvester for POPL 2015/2018/2020 with the
date-normalisation fix in place. All three years now carry dates within
the conference year (2015-01-15 / 2018-01-01 / 2020-01-01) instead of
the journal-release date OpenAlex returns (December of the prior year),
restoring them to the year-bucketing inventory query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-run of PLConferenceHarvester for POPL 2017 with the date-normalisation
fix in place. All 66 papers now carry 2017-01-01 instead of the
journal-release date 2016-12-22 OpenAlex returns, moving them out of the
2016 bucket (128 -> 62 real POPL 2016) and into 2017 (0 -> 66) in the
year-bucketing inventory query.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/search endpoint hardcodes its source allowlist in three places
(GET param parsing, sources_flags iteration, default-everything fallback)
and the recently ingested PL venues (POPL, PLDI, ICFP, OOPSLA, ESOP,
ECOOP, CC, Haskell) were absent from all of them. Net effect: 8,032 PL
papers were unreachable from the frontend even with no filters applied.

Adds the eight PL venues alongside the existing AI/Systems venues in all
three places. No behavioral change for existing sources.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The arxiv embedding pass previously only embedded papers in cs.AI,
cs.CL, cs.LG, and cs.MA. Logic (cs.LO) and Programming Languages
(cs.PL) papers were ingested but never embedded, so they were invisible
to semantic search even though present in the paper table.

Adds cs.LO and cs.PL to the eligible-for-embedding category list. A
backfill embedding pass over already-ingested cs.LO and cs.PL rows
needs to be run separately (out of scope for this commit).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, and Haskell Symposium
to the sidebar source filter alongside the existing AI and Systems
groupings, plus the inventory display order. Mirrors the structure of
the existing systems-conferences and AI-conferences blocks (collapsible
group with per-venue toggles and an indeterminate parent checkbox).

Pairs with the API change that exposes PL venues to /api/search.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resumes the Haskell ingest that was partial in the original PR (only
2000-2009 had been fetched before OpenAlex's daily quota cut off the
Phase 2 run). After the quota reset, a follow-up harvester run filled
in 2010-2025 plus revised 2005/2006/2009 with the date-normalization
fix. Total Haskell coverage in the DB is now 326 papers spanning
2001-2025, up from 48.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/search endpoint hardcoded the conference list four separate
times (GET param parsing, _build_filters per-group iteration, and
the default-everything fallback). The grouping into AI / Systems / PL
is a frontend UI concern; the backend filter just needs a flat set
of valid source names.

Extracted to module-level KNOWN_CONFERENCES and KNOWN_SOURCES
constants; _build_filters becomes a four-line list-comprehension.
No behavioural change — same sources accepted, same defaults.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds oversight/sync/pl target that runs the PL harvester with
--skip-existing-doi (cheap incremental behaviour — only OpenAlex/SS
lookups for new DOIs) and then consumes the resulting JSON into the DB.
make oversight/sync now depends on both oversight/sync/arxiv (existing)
and oversight/sync/pl, so the daily cron picks up new PL volumes when
DBLP indexes them.

Updates docs/pl-conferences-plan.md Phase 3 to reflect this simpler
Make-target approach rather than the SourcePoller protocol earlier
versions of the plan proposed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The list was originally cs.AI/CL/LG/MA and got extended to include
cs.LO and cs.PL — at which point "ai" stopped describing the contents.
More importantly the name conflated two concepts: it's specifically
the arxiv-side embed gate (conferences bypass it entirely and are
embedded unconditionally via embed_missing_conference_papers).

Renamed:
- ai_categories                  -> arxiv_embed_categories
- get_unembedded_arxiv_ai_papers -> get_unembedded_arxiv_papers
- _embed_missing_ai_papers       -> _embed_missing_arxiv_papers

No behavioural change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the slider defaulted to 3 years and the backend's fallback
was 5 years — both shorter than the age of foundational papers in the
corpus (POPL 1973 onwards). Searching by title for a classic like
"Pure subtype systems" (Hutchins, POPL 2010) silently dropped it
because of the time filter, which is surprising UX.

- Adds an "All time" step to the slider (36500 days, displayed as
  "All time") and makes it the default.
- Updates the backend's default-when-omitted to match.

A 100-year sentinel covers everything in the corpus and gives room
for future earlier ingest.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Comment thread frontend/pages/index.tsx
// (POPL 1973) would do.
const TIME_STEPS = [7, 14, 30, 90, 180, 365, 730, 1095, 1825, 2555, 3650, 36500];
const ALL_TIME_DAYS = 36500;
const DEFAULT_TIME_INDEX = TIME_STEPS.length - 1; // "all time"
Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: i did this because PL papers are often quite old, @ottowhite do you want it to stay at 5 years?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant