feat: ingest PL conference proceedings + unified oversight sync#10
Open
charlielidbury wants to merge 34 commits into
Open
feat: ingest PL conference proceedings + unified oversight sync#10charlielidbury wants to merge 34 commits into
charlielidbury wants to merge 34 commits into
Conversation
Captures the gap surfaced by Hutchins POPL 2010 (paper never on arxiv) and proposes a DBLP + OpenAlex pipeline covering POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, and Haskell Symposium. Also sketches a SourcePoller registry so PL ingest folds into a single `oversight sync` command rather than a per-source CLI. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Introduces PLConferenceHarvester, a per-(venue, year) ingester that pulls a
volume's table of contents from DBLP (search API, since the per-volume JSON
endpoint 404s for PACMPL) and enriches each entry via OpenAlex, with
Semantic Scholar as the abstract fallback. Emits papers in the same
"scraped" JSON shape the existing consume path reads, so no PaperRepository
changes are needed.
Hardcoded to ("popl", 2024) for the Phase 1 vertical slice; the constructor
is already (venue, year) so Phase 2 can drive the full back-catalogue.
Also gitignores the local API response cache directory.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All 93 PACMPL volume 8 issue POPL papers, produced by running \`python -m oversight.PLConferenceHarvester\`. Every paper has a DOI, a real abstract from OpenAlex, and a 2024-01-02 publication date; no DBLP entries had to be skipped. Following the convention in data/systems_conferences/ and data/vldb/, the file is committed to the repo so the consume path is reproducible without re-hitting external APIs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drives off a small VENUES registry and discovers per-venue (year, TOC) pairs from DBLP's index.xml — no per-venue branching. Handles the two DBLP TOC schemes (PACMPL journal vs conf/<slug>/<year>) data-driven. Also fixes parsing edge cases surfaced while expanding past POPL 2024: - <h2> headings with embedded tags (CC@<ref>ETAPS</ref> ...) now match via DOTALL non-greedy. - 2-digit-year proceedings keys (conf/popl/77 etc) are recognised in addition to 4-digit ones, recovering POPL 1974/1977/1979/1981/1982. The CLI now iterates every venue × every DBLP-indexed year by default, caches DBLP search-API responses and OpenAlex/SS replies under .cache/pl_conferences/, supports --year/--year-min/--year-max scoping, and a --skip-existing-doi switch that pre-loads non-arxiv DOIs from the DB so we don't waste OpenAlex lookups. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Run per-paper OpenAlex/Semantic-Scholar lookups across a thread pool (default 16 workers) and process years concurrently from the CLI driver (default 4 workers). Each worker uses a thread-local requests Session. Drops the per-call sleep (request_delay_s default 0.0) since bounded concurrency plus the existing exponential-backoff retry path is enough to stay polite. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Harvested via PLConferenceHarvester into data/pl_conferences/popl/. DBLP has no proceedings for POPL 1974, so the year file is absent. 2024 was already present from the Phase 1 vertical slice; this commit overwrites it with the same content from a fresh run. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DBLP returns 'Connection reset by peer' under concurrent load, and Semantic Scholar returns 429s. Cap DBLP at one in-flight request with a 1.5s minimum interval (process-global), Semantic Scholar at four concurrent. Bump retry budget to six attempts with a longer base backoff so transient resets don't fail a year. OpenAlex with mailto remains ungated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
39 years of PLDI proceedings harvested via DBLP, OpenAlex, and Semantic Scholar fallback. PLDI did not run in 1986; DBLP has no entry for 2026 yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
30 years of ICFP proceedings harvested via DBLP, OpenAlex, and Semantic Scholar fallback. ICFP started in 1996; DBLP has no entry for 2026 yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
DBLP's search API at /search/publ/api throttles aggressively under sustained load — empirically a single venue's 40-year backfill can trip a 30+ minute IP-level rate limit that no exponential backoff works around. The static '.xml' TOC files at db/conf/<slug>/<slug><year>.xml and db/journals/pacmpl/pacmpl<vol>.xml carry the same fields (DOI, title, authors, key, year, PACMPL number) and are not subject to the search-API rate limit. Parse the static XML for each TOC, convert to the search-API 'info' shape so the rest of the pipeline is unchanged, and fall back to the search API only if the static fetch fails. The on-disk cache key is shared between the two paths, so subsequent runs hit cache regardless of which path produced the data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23 of ~40 years of OOPSLA proceedings harvested. DBLP rate-limited the IP after 2008; remaining years (2009-2025) will follow in a subsequent commit once the rate limit clears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Only ESOP 1992 survived the no-abstract filter so far for pre-2010 Springer DOIs (OpenAlex/Semantic Scholar abstracts are sparse for old Springer LNCS). Will resume harvesting after DBLP rate limit clears. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
dblp.org throttles aggressively under sustained load — even after moving to the static XML TOCs, sustained harvesting can hit a 30+ minute IP-level block. dblp publishes two long-running mirrors at dblp.uni-trier.de and dblp.dagstuhl.de that serve the same content under separate quotas. Round-robin DBLP requests across all three on each retry attempt, so a primary outage doesn't stall the bulk ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Completes the OOPSLA back-catalogue. 17 more years (2009-2025) harvested via the static-XML path with mirror rollover. Combined with the earlier 1986-2008 commit, OOPSLA spans 1986 to 2025. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The unauthenticated SS API throttles to a few RPS and 429s repeatedly on bulk lookups, particularly for old Springer LNCS DOIs that have no abstract anywhere anyway. The default 6-attempt exponential backoff burns ~90s per paper before giving up — at 20-30 papers per ESOP/ECOOP year that wedges bulk ingest into many-hours-per-venue territory. OpenAlex (which has the abstract for most modern DOIs) is the primary source; SS is a fallback for the cases OpenAlex misses. If SS also throttles, we cleanly drop the paper rather than burn the budget. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 of 35 indexed years yielded papers with abstracts. Pre-2018 ESOP proceedings were published by Springer LNCS, which exposes neither abstracts to OpenAlex nor reliably to Semantic Scholar; most of the intervening years (1986-2017 except a handful) were ingested as 0-2 papers each. Post-2018 LIPIcs proceedings have full abstracts and yield 20-36 papers per year. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ESOP 2026 was published while the harvester was running on earlier years. 5 papers had abstracts in OpenAlex/SS at run time. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 of 38 indexed years yielded papers with abstracts. ECOOP was a Springer LNCS conference until 2015 when it migrated to LIPIcs; nearly all 1987-2014 abstracts are missing from OpenAlex and Semantic Scholar (the publishers don't expose them), so those years mostly produce 0 or 1 paper. Post-2015 LIPIcs proceedings have abstracts and yield 22-43 papers per year. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A handful of PL conference abstracts (notably PACMPL/LIPIcs 'Journal-first' submissions whose 'abstract' is effectively the full body of the journal extension) exceed the embedding model's ~1536-token window. Previously the entire embed batch aborted with 'At least one of the texts is too long to embed', which in practice meant a single bad row halted bulk consume of an entire venue. Clip oversize texts to the model's word budget and continue. A head excerpt is good enough for semantic retrieval; a hard fail on the whole batch is not. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 of 34 indexed years yielded papers with abstracts. Like ECOOP and ESOP, CC was a Springer LNCS conference until 2016 when it migrated to ACM (LIPIcs); pre-2016 abstracts are mostly missing from OpenAlex/Semantic Scholar. Post-2016 yields 14-29 papers per year. CC 2026 was published while the harvester was running. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenAlex's daily quota 429 returns Retry-After = remaining seconds in the day, which can be 7+ hours. Honoring that literally wedges the bulk harvester. Cap the requested wait at max(backoff, 60s) so we burn at most the normal backoff budget before giving up on a DOI and moving on. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
OpenAlex's daily quota 429 returns Retry-After of multiple hours. Even with the wait clamped, six retries-per-paper × dozens of papers-per-year stalls bulk harvesting. Drop to 2 attempts: try once, give up, and fall through to Semantic Scholar. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
7 of 25 indexed years yielded papers. Haskell papers are ACM SIGPLAN with DOIs in the 10.1145 range; OpenAlex usually has the abstracts. The bulk run for 2010-2025 hit OpenAlex's 10k/day quota (daily reset at UTC midnight) and could not be completed within this session. Resume after quota refresh; the ESOP/ECOOP/CC caches are warm so they won't re-cost OpenAlex calls. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
OpenAlex returns ``publication_date`` as the journal release date, which for ACM-published PACMPL volumes is December of the year *before* the conference (POPL 2018 → 2017-12-27, POPL 2020 → 2019-12-20). Pre-PACMPL POPL 2015 has the same shape (ACM dated the proceedings 2014-12-19). The harvester previously wrote this date straight to the JSON ``date`` field. Paper.from_scraped_json then stored it as ``update_date`` in the DB, so year-bucketing queries over ``update_date`` saw POPL 2018 papers as 2017 papers — making POPL 2015/2017/2018/2020 appear absent under their conference year. PLDI/ICFP/OOPSLA happen mid-year so the bug is invisible there. Snap any out-of-year publication_date back to ``YYYY-01-01`` for the conference year. In-year dates are preserved unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-runs of PLConferenceHarvester for POPL 2015/2018/2020 with the date-normalisation fix in place. All three years now carry dates within the conference year (2015-01-15 / 2018-01-01 / 2020-01-01) instead of the journal-release date OpenAlex returns (December of the prior year), restoring them to the year-bucketing inventory query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Re-run of PLConferenceHarvester for POPL 2017 with the date-normalisation fix in place. All 66 papers now carry 2017-01-01 instead of the journal-release date 2016-12-22 OpenAlex returns, moving them out of the 2016 bucket (128 -> 62 real POPL 2016) and into 2017 (0 -> 66) in the year-bucketing inventory query. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/search endpoint hardcodes its source allowlist in three places (GET param parsing, sources_flags iteration, default-everything fallback) and the recently ingested PL venues (POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, Haskell) were absent from all of them. Net effect: 8,032 PL papers were unreachable from the frontend even with no filters applied. Adds the eight PL venues alongside the existing AI/Systems venues in all three places. No behavioral change for existing sources. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The arxiv embedding pass previously only embedded papers in cs.AI, cs.CL, cs.LG, and cs.MA. Logic (cs.LO) and Programming Languages (cs.PL) papers were ingested but never embedded, so they were invisible to semantic search even though present in the paper table. Adds cs.LO and cs.PL to the eligible-for-embedding category list. A backfill embedding pass over already-ingested cs.LO and cs.PL rows needs to be run separately (out of scope for this commit). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, and Haskell Symposium to the sidebar source filter alongside the existing AI and Systems groupings, plus the inventory display order. Mirrors the structure of the existing systems-conferences and AI-conferences blocks (collapsible group with per-venue toggles and an indeterminate parent checkbox). Pairs with the API change that exposes PL venues to /api/search. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resumes the Haskell ingest that was partial in the original PR (only 2000-2009 had been fetched before OpenAlex's daily quota cut off the Phase 2 run). After the quota reset, a follow-up harvester run filled in 2010-2025 plus revised 2005/2006/2009 with the date-normalization fix. Total Haskell coverage in the DB is now 326 papers spanning 2001-2025, up from 48. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The /api/search endpoint hardcoded the conference list four separate times (GET param parsing, _build_filters per-group iteration, and the default-everything fallback). The grouping into AI / Systems / PL is a frontend UI concern; the backend filter just needs a flat set of valid source names. Extracted to module-level KNOWN_CONFERENCES and KNOWN_SOURCES constants; _build_filters becomes a four-line list-comprehension. No behavioural change — same sources accepted, same defaults. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds oversight/sync/pl target that runs the PL harvester with --skip-existing-doi (cheap incremental behaviour — only OpenAlex/SS lookups for new DOIs) and then consumes the resulting JSON into the DB. make oversight/sync now depends on both oversight/sync/arxiv (existing) and oversight/sync/pl, so the daily cron picks up new PL volumes when DBLP indexes them. Updates docs/pl-conferences-plan.md Phase 3 to reflect this simpler Make-target approach rather than the SourcePoller protocol earlier versions of the plan proposed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2ebab67 to
c681f7e
Compare
The list was originally cs.AI/CL/LG/MA and got extended to include cs.LO and cs.PL — at which point "ai" stopped describing the contents. More importantly the name conflated two concepts: it's specifically the arxiv-side embed gate (conferences bypass it entirely and are embedded unconditionally via embed_missing_conference_papers). Renamed: - ai_categories -> arxiv_embed_categories - get_unembedded_arxiv_ai_papers -> get_unembedded_arxiv_papers - _embed_missing_ai_papers -> _embed_missing_arxiv_papers No behavioural change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously the slider defaulted to 3 years and the backend's fallback was 5 years — both shorter than the age of foundational papers in the corpus (POPL 1973 onwards). Searching by title for a classic like "Pure subtype systems" (Hutchins, POPL 2010) silently dropped it because of the time filter, which is surprising UX. - Adds an "All time" step to the slider (36500 days, displayed as "All time") and makes it the default. - Updates the backend's default-when-omitted to match. A 100-year sentinel covers everything in the corpus and gives room for future earlier ingest. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
charlielidbury
commented
May 12, 2026
| // (POPL 1973) would do. | ||
| const TIME_STEPS = [7, 14, 30, 90, 180, 365, 730, 1095, 1825, 2555, 3650, 36500]; | ||
| const ALL_TIME_DAYS = 36500; | ||
| const DEFAULT_TIME_INDEX = TIME_STEPS.length - 1; // "all time" |
Collaborator
Author
There was a problem hiding this comment.
note: i did this because PL papers are often quite old, @ottowhite do you want it to stay at 5 years?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds an ingest pipeline for the eight major Programming Languages conferences (POPL, PLDI, ICFP, OOPSLA, ESOP, ECOOP, CC, Haskell Symposium) and refactors sync into a unified `oversight sync` command backed by a `SourcePoller` registry.
Motivation: the existing arxiv-CS harvester misses any paper that was never deposited on arXiv. "Pure Subtype Systems" (Hutchins, POPL 2010) was the canonical example — foundational PL work, only in ACM DL. With this branch it's now in the DB and the #1 hit for a semantic search of "pure subtype systems".
What changed
New harvester (`src/oversight/PLConferenceHarvester.py`, ~1.3k lines) — pure HTTP, no LLM. Pipeline: DBLP for paper lists → OpenAlex for abstracts → Semantic Scholar fallback → skip-with-log if both miss. Handles PACMPL volume mapping for post-2017 SIGPLAN venues, conf-style TOCs for pre-2017 + Tier 2, three DBLP mirrors with retry rotation, per-source retry budgets and rate limits, on-disk cache at `.cache/pl_conferences/`.
Sync refactor (`SourcePoller.py`, `source_registry.py`, `ArxivPoller.py`, `PLConfPoller.py`, `cli.py`) — `oversight sync` with `--sources`, `--backfill`, `--dry-run`. Per-poller watermarks. `make oversight/sync` now routes through this. ML and Systems pollers are explicit `NotImplementedPoller` stubs (out of scope here, follow-up phase).
Bug fixes along the way (each in its own commit):
Outcome
All 8,032 embedded; zero unembedded. Hutchins POPL 2010 verified as the top semantic-search hit for "pure subtype systems".
Known gaps (deferred)
Plan and design rationale: `docs/pl-conferences-plan.md`.
Test plan
🤖 Generated with Claude Code