Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
34 commits
Select commit Hold shift + click to select a range
b89ceed
docs: add plan for ingesting PL conference proceedings
charlielidbury May 10, 2026
c5b95e9
feat: add DBLP and OpenAlex clients for PL conference ingest
charlielidbury May 10, 2026
490c2b4
feat: harvest POPL 2024 proceedings into data/pl_conferences/
charlielidbury May 10, 2026
00b9c4d
refactor: generalize PLConferenceHarvester for arbitrary venue/year
charlielidbury May 10, 2026
824f6eb
feat: parallelize PLConferenceHarvester for bulk ingest
charlielidbury May 10, 2026
a3d040e
feat: ingest POPL back-catalogue (1973-2026)
charlielidbury May 10, 2026
49a7299
feat: rate-limit DBLP and Semantic Scholar in PLConferenceHarvester
charlielidbury May 10, 2026
aca7126
feat: ingest PLDI back-catalogue (1987-2025)
charlielidbury May 10, 2026
d70687e
feat: ingest ICFP back-catalogue (1996-2025)
charlielidbury May 10, 2026
3ca7295
feat: harvest DBLP TOCs via static XML, falling back to search API
charlielidbury May 10, 2026
36c742c
feat: ingest OOPSLA back-catalogue (1986-2008, partial)
charlielidbury May 10, 2026
244482e
feat: ingest ESOP back-catalogue (1992-1992, partial)
charlielidbury May 10, 2026
110beb0
feat: roll DBLP requests across mirrors on rate limit
charlielidbury May 10, 2026
5634dab
feat: ingest OOPSLA back-catalogue (2009-2025)
charlielidbury May 10, 2026
e5a8d27
fix: cap Semantic Scholar retries at 2 attempts
charlielidbury May 10, 2026
a8f74f5
feat: ingest ESOP back-catalogue (1986-2025)
charlielidbury May 10, 2026
9030145
feat: ingest ESOP 2026
charlielidbury May 10, 2026
b665588
feat: ingest ECOOP back-catalogue (1987-2025)
charlielidbury May 10, 2026
0ab27b2
fix: truncate over-long abstracts before embedding instead of aborting
charlielidbury May 10, 2026
b2e1e22
feat: ingest CC back-catalogue (1988-2026)
charlielidbury May 10, 2026
e42015e
fix: clamp OpenAlex Retry-After to a reasonable cap
charlielidbury May 10, 2026
1a859f1
fix: cap OpenAlex retries at 2 attempts
charlielidbury May 10, 2026
031b158
feat: ingest Haskell Symposium back-catalogue (2000-2009, partial)
charlielidbury May 10, 2026
ff7aa8a
fix: anchor PLConferenceHarvester paper date to conference year
charlielidbury May 10, 2026
34636ad
feat: re-ingest POPL 2015, 2018, 2020 with conference-year dates
charlielidbury May 10, 2026
ec6741b
feat: re-ingest POPL 2017 with conference-year date
charlielidbury May 10, 2026
d0cf87c
fix(api): expose PL conferences in /api/search source filter
charlielidbury May 10, 2026
f8bab7a
fix: include cs.LO and cs.PL in arxiv embedding scope
charlielidbury May 10, 2026
5d5ee6c
feat(frontend): add PL conference toggles to source filters
charlielidbury May 10, 2026
a9a757d
feat: complete Haskell Symposium back-catalogue ingest (2010-2025)
charlielidbury May 12, 2026
16e3921
refactor: collapse three duplicated conference lists in flask_app
charlielidbury May 12, 2026
c681f7e
feat: integrate PL sync into make oversight/sync via Make dependency
charlielidbury May 12, 2026
68e9253
refactor: rename ai_categories to arxiv_embed_categories
charlielidbury May 12, 2026
b927515
feat: default search time window to "All time"
charlielidbury May 12, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -35,3 +35,6 @@ logs_runtime/

# Scraping
src/superscraper/tools/.cache/

# Local API response cache (PL harvester etc.)
.cache/
10 changes: 8 additions & 2 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -53,5 +53,11 @@ format/check:
typecheck:
uv run ty check src/

oversight/sync:
uv run python -m oversight.ArXivRepository --sync
oversight/sync: oversight/sync/arxiv oversight/sync/pl

oversight/sync/arxiv:
uv run python -m oversight.ArXivRepository --sync

oversight/sync/pl:
uv run python -m oversight.PLConferenceHarvester --skip-existing-doi
uv run oversight consume data/pl_conferences/ --format scraped
20 changes: 20 additions & 0 deletions data/pl_conferences/cc/1988.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[
{
"paper_id": "10.1007/3-540-51364-7_6",
"title": "Generators for High-Speed Front-Ends",
"abstract": "High-speed compilers can be constructed automatically. We present some existing tools for the generation of fast front-ends. Rex (Regular EXpression tool) is a scanner generator whose specifications are based on regular expressions and arbitrary semantic actions written in one of the target languages C or Modula-2. As scanners sometimes have to consider the context to unambiguously recognize a token the right context can be specified by an additional regular expression and the left context can be handled by so-called start states. The generated scanners automatically compute the line and column position of the tokens and offer an efficient mechanism to normalize identifiers and keywords to upper or lower case letters. The scanners are table-driven and run at a speed of 180,000 to 195,000 lines per minute on a MC 68020 processor. Lalr is a LALR(1) parser generator accepting grammars written in extended BNT notation which may be augmented by semantic actions expressed by statements of the target language. The generator provides a mechanism for S-attribution, that is synthesized attributes can be computed during parsing. In case of LR-conflicts, unlike other tools, Lalr provides not only information about an internal state consisting of a set of items but it prints a derivation tree which is much more useful to analyze the problem. Conflicts can be resolved by specifying precedence and associativity of operators and productions. The generated parsers include automatic error reporting, error recovery, and error repair. The parsers are table-driven and run at a speed of 400,000 lines per minute. Currently parsers can be generated in the target languages C and Modula-2. Ell is a LL(1) parser generator accepting the same specification language as Lalr except that the grammars must obey the LL(1) property. The generated parsers include automatic error reporting, recovery, and repair like Lalr. The parsers are implemented following the recursive descent method and reach a speed of 450,000 lines per minute. The possible target languages are again C and Modula-2 A comparison of the above tools with the corresponding UNIX tools shows that significant improvements have been achieved thus allowing the generation of high-speed compilers.",
"date": "1989-01-01",
"link": "https://doi.org/10.1007/3-540-51364-7_6",
"conference_name": "CC",
"authors": [
{
"first_name": "Josef",
"last_name": "Grosch",
"institution": "Karlsruhe Institute of Technology"
}
],
"dblp_key": "conf/cc/Grosch88",
"venue": "cc",
"year": 1988
}
]
20 changes: 20 additions & 0 deletions data/pl_conferences/cc/1996.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[
{
"paper_id": "10.1007/3-540-61053-7_71",
"title": "Delegating Compiler Objects: An Object-Oriented Approach to Crafting Compilers",
"abstract": "Conventional compilers often are large entities that are highly complex, difficult to maintain and hard to reuse. In this article it is argued that this is due to the inherently functional approach to compiler construction. An alternative approach to compiler construction is proposed, based on object-oriented principles, which solves (or at least lessens) the problems of compiler construction. The approach is based on delegating compiler objects (Dcos) that provide a structural decomposition of compilers in addition to the conventional functional decomposition. The DCO approach makes use of the parser delegation and lexer delegation techniques, that provide reuse and modularisation of syntactical, respectively, lexical specifications.",
"date": "1996-01-01",
"link": "https://doi.org/10.1007/3-540-61053-7_71",
"conference_name": "CC",
"authors": [
{
"first_name": "Jan",
"last_name": "Bosch",
"institution": ""
}
],
"dblp_key": "conf/cc/Bosch96",
"venue": "cc",
"year": 1996
}
]
25 changes: 25 additions & 0 deletions data/pl_conferences/cc/1998.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[
{
"paper_id": "10.1007/BFb0026420",
"title": "Generalised Recursive Descent parsing and Fellow-Determinism",
"abstract": "This paper presents a construct for mapping arbitrary non-left recursive context-free grammars into recursive descent parsers that: handle ambiguous grammars correctly; perform with LL(1) efficiency on LL(1) grammars; allow straightforward implementation of both inherited and synthesized attributes; and allow semantic actions to be added at any point in the grammar. We describe both the basic algorithm and a tool, GRDP, which generates parsers which use this technique. Modifications of the basic algorithm to improve efficiency lead to a discussion of follow-determinism, a fundamental property that gives insights into the behaviour of both LL and LR parsers.",
"date": "1998-01-01",
"link": "https://doi.org/10.1007/BFb0026420",
"conference_name": "CC",
"authors": [
{
"first_name": "Adrian",
"last_name": "Johnstone",
"institution": "Royal Holloway University of London"
},
{
"first_name": "Elizabeth",
"last_name": "Scott",
"institution": "Universidad de Londres"
}
],
"dblp_key": "conf/cc/JohnstoneS98",
"venue": "cc",
"year": 1998
}
]
25 changes: 25 additions & 0 deletions data/pl_conferences/cc/1999.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[
{
"paper_id": "10.1007/978-3-540-49051-7_3",
"title": "Faster Generalized LR Parsing",
"abstract": "Tomita devised a method of generalized LR (GLR) parsing to parse ambiguous grammars efficiently. A GLR parser uses linear-time LR parsing techniques as long as possible, falling back on more expensive general techniques when necessary.Much research has addressed speeding up LR parsers. However, we argue that this previous work is not transferable to GLR parsers. Instead, we speed up LR parsers by building larger pushdown automata, trading space for time. A variant of the GLR algorithm then incorporates our faster LR parsers.Our timings show that our new method for GLR parsing can parse highly ambiguous grammars significantly faster than a standard GLR parser.",
"date": "1999-01-01",
"link": "https://doi.org/10.1007/978-3-540-49051-7_3",
"conference_name": "CC",
"authors": [
{
"first_name": "John",
"last_name": "Aycock",
"institution": "University of Victoria"
},
{
"first_name": "Nigel",
"last_name": "Horspool",
"institution": "University of Victoria"
}
],
"dblp_key": "conf/cc/AycockH99",
"venue": "cc",
"year": 1999
}
]
20 changes: 20 additions & 0 deletions data/pl_conferences/cc/2001.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
[
{
"paper_id": "10.1007/3-540-45306-7_1",
"title": "Virtual Classes and Their Implementation",
"abstract": "One of the characteristics of BETA [4] is the unification of abstraction mechanisms such as class, procedure, process type, generic class, interface, etc. into one abstraction mechanism: the pattern. In addition to keeping the language small, the unification has given a systematic treatment of all abstraction mechanisms and leads to a number of new possibilities. One of the interesting results of the unification is the notion of virtual class [[7],[8], which is the BETA mechanism for expressing genericity. A class may define an attribute in the form of a virtual class just as a class may define an attribute in the form of a virtual procedure. A subclass may then refine the definition of the virtual class attribute into a more specialized class. This is very much in the same way as a virtual procedure can be refined - resulting in a more specialized procedure. Virtual classes can be seen as an object-oriented version of generics. Other attempts to provide genericity for OO languages has been based on various forms of parametric polymorphism and function application rather than inheritance. Virtual classes have been used for more than 15 years in the BETA community and they have demonstrated their usefulness as a powerful abstraction mechanism. There has recently been an increasing interest in virtual classes and a number of proposals for adding virtual classes to other languages, extending virtual classes, and unifying virtual classes and parameterized classes have been made [[1],[2],[3],[13],[14],[15],[16],[17]. Another distinguishing feature of BETA is the notion of nested class [6]. The nested class construct originates already with Simula and is supported in a more general form in BETA. Nested classes have thus been available to the OO community for almost 4 decades, and the mechanism has found many uses in particular to structure large systems. Despite the usefulness, mainstream OO languages have not included general nesting mechanisms although C++ has a restricted form of nested classes, only working as a scoping mechanism. Recently nested classes has been added to the Java language. From a semantic analysis point of view the combination of inheritance, and general nesting adds some complexity to the semantic analysis, since the search space for names becomes two-dimensional. With virtual classes, the analysis becomes even more complicated — for details see ref. [10]. The unification of class and procedure has also lead to an inheritance mechanism for procedures [5] where method-combination is based on the inner-construct known from Simula. In BETA, patterns are first-class values, which implies that procedures as well as classes are first-class values. BETA also supports the notion of class-less objects, which has been adapted in the form of anonymous classes in Java. Finally, it might be mentioned that BETA supports coroutines as well as concurrent active objects. For further details about BETA, see [6,9,11]. The Mjølner System is a program development environment for BETA and may be obtained from ref. [12].",
"date": "2001-01-01",
"link": "https://doi.org/10.1007/3-540-45306-7_1",
"conference_name": "CC",
"authors": [
{
"first_name": "Ole Lehrmann",
"last_name": "Madsen",
"institution": "Aarhus University"
}
],
"dblp_key": "conf/cc/Madsen01",
"venue": "cc",
"year": 2001
}
]
30 changes: 30 additions & 0 deletions data/pl_conferences/cc/2004.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
[
{
"paper_id": "10.1007/978-3-540-24723-4_7",
"title": "Generalised Parsing: Some Costs",
"abstract": "We discuss generalisations of bottom up parsing, emphasising the relative costs for real programming languages. Our goal is to provide a roadmap of the available approaches in terms of their space and time performance for programming language applications, focusing mainly on GLR style algorithms. It is well known that the original Tomita GLR algorithm fails to terminate on hidden left recursion: here we analyse two approaches to correct GLR parsing (i) the modification due to Farshi that is incorporated into Visser’s work and (ii) our own right-nullable GLR (RNGLR) algorithm, showing that Farshi’s approach can be expensive. We also present results from our new Binary RNGLR algorithm which is asymptotically the fastest parser in this family and show that the recently reported reduction incorporated parsers can require automata that are too large to be practical on current machines.",
"date": "2004-01-01",
"link": "https://doi.org/10.1007/978-3-540-24723-4_7",
"conference_name": "CC",
"authors": [
{
"first_name": "Adrian",
"last_name": "Johnstone",
"institution": "Royal Holloway University of London"
},
{
"first_name": "Elizabeth",
"last_name": "Scott",
"institution": "Royal Holloway University of London"
},
{
"first_name": "Giorgios",
"last_name": "Economopoulos",
"institution": "Royal Holloway University of London"
}
],
"dblp_key": "conf/cc/JohnstoneSE04",
"venue": "cc",
"year": 2004
}
]
25 changes: 25 additions & 0 deletions data/pl_conferences/cc/2008.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
[
{
"paper_id": "10.1007/978-3-540-78791-4_11",
"title": "Compiler-Guaranteed Safety in Code-Copying Virtual Machines",
"abstract": "Virtual Machine authors face a difficult choice between low performance, cheap interpreters, or specialized and costly compilers. A method able to bridge this wide gap is the existing code-copying technique that reuses chunks of the VM’s binary code to create a simple JIT. This technique is not reliable without a compiler guaranteeing that copied chunks are still functionally equivalent despite aggressive optimizations. We present a proof-of-concept, minimal-impact modification of a highly optimizing compiler, GCC. A VM programmer marks chunks of VM source code as copyable. The chunks of native code resulting from compilation of the marked source become addressable and self-contained. Chunks can be safely copied at VM runtime, concatenated and executed together. This allows code-copying VMs to safely achieve speedup up to 3 times, 1.67 on average, over the direct interpretation. This maintainable enhancement makes the code-copying technique reliable and thus practically usable.",
"date": "2008-04-01",
"link": "https://doi.org/10.1007/978-3-540-78791-4_11",
"conference_name": "CC",
"authors": [
{
"first_name": "Gregory B.",
"last_name": "Prokopski",
"institution": "McGill University"
},
{
"first_name": "Clark",
"last_name": "Verbrugge",
"institution": "McGill University"
}
],
"dblp_key": "conf/cc/ProkopskiV08",
"venue": "cc",
"year": 2008
}
]
45 changes: 45 additions & 0 deletions data/pl_conferences/cc/2013.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
[
{
"paper_id": "10.1007/978-3-642-37051-9_6",
"title": "Simple and Efficient Construction of Static Single Assignment Form",
"abstract": "We present a simple SSA construction algorithm, which allows direct translation from an abstract syntax tree or bytecode into an SSA-based intermediate representation. The algorithm requires no prior analysis and ensures that even during construction the intermediate representation is in SSA form. This allows the application of SSA-based optimizations during construction. After completion, the intermediate representation is in minimal and pruned SSA form. In spite of its simplicity, the runtime of our algorithm is on par with Cytron et al.’s algorithm.",
"date": "2013-01-01",
"link": "https://doi.org/10.1007/978-3-642-37051-9_6",
"conference_name": "CC",
"authors": [
{
"first_name": "Matthias",
"last_name": "Braun",
"institution": "Karlsruhe Institute of Technology"
},
{
"first_name": "Sebastian",
"last_name": "Buchwald",
"institution": "Karlsruhe Institute of Technology"
},
{
"first_name": "Sebastian",
"last_name": "Hack",
"institution": "Saarland University"
},
{
"first_name": "Roland",
"last_name": "Leißa",
"institution": "Saarland University"
},
{
"first_name": "Christoph",
"last_name": "Mallon",
"institution": "Saarland University"
},
{
"first_name": "Andreas",
"last_name": "Zwinkau",
"institution": "Karlsruhe Institute of Technology"
}
],
"dblp_key": "conf/cc/BraunBHLMZ13",
"venue": "cc",
"year": 2013
}
]
48 changes: 48 additions & 0 deletions data/pl_conferences/cc/2014.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
[
{
"paper_id": "10.1007/978-3-642-54807-9_12",
"title": "String Analysis for Dynamic Field Access",
"abstract": "In JavaScript, and scripting languages in general, dynamic field access is a commonly used feature. Unfortunately, current static analysis tools either completely ignore dynamic field access or use overly conservative approximations that lead to poor precision and scalability. We present new string domains to reason about dynamic field access in a static analysis tool. A key feature of the domains is that the equal, concatenate and join operations take $\\mathcal{O}$ (1) time. Experimental evaluation on four common JavaScript libraries, including jQuery and Prototype, shows that traditional string domains are insufficient. For instance, the commonly used constant string domain can only ensure that at most 21% dynamic field accesses are without false positives. In contrast, our string domain $\\mathcal{H}$ ensures no false positives for up to 90% of all dynamic field accesses. We demonstrate that a dataflow analysis equipped with the $\\mathcal{H}$ domain gains significant precision resulting in an analysis speedup of more than 1.5x for 7 out of 10 benchmark programs.",
"date": "2014-01-01",
"link": "https://doi.org/10.1007/978-3-642-54807-9_12",
"conference_name": "CC",
"authors": [
{
"first_name": "Magnus",
"last_name": "Madsen",
"institution": "Aarhus University"
},
{
"first_name": "Esben",
"last_name": "Andreasen",
"institution": "Aarhus University"
}
],
"dblp_key": "conf/cc/MadsenA14",
"venue": "cc",
"year": 2014
},
{
"paper_id": "10.1007/978-3-642-54807-9_8",
"title": "Taming Control Divergence in GPUs through Control Flow Linearization",
"abstract": "Branch divergence is a very commonly occurring performance problem in GPGPU in which the execution of diverging branches is serialized to execute only one control flow path at a time. Existing hardware mechanism to reconverge threads using a stack causes duplicate execution of code for unstructured control flow graphs. Also the stack mechanism cannot effectively utilize the available parallelism among diverging branches. Further, the amount of nested divergence allowed is also limited by depth of the branch divergence stack. In this paper we propose a simple and elegant transformation to handle all of the above mentioned problems. The transformation converts an unstructured CFG to a structured CFG without duplicating user code. It incurs only a linear increase in the number of basic blocks and also the number of instructions. Our solution linearizes the CFG using a predicate variable. This mechanism reconverges the divergent threads as early as possible. It also reduces the depth of the reconvergence stack. The available parallelism in nested branches can be effectively extracted by scheduling the basic blocks to reduce the effect of stalls due to memory accesses. It can also increase execution efficiency of nested loops with different trip counts for different threads. We implemented the proposed transformation at PTX level using the Ocelot compiler infrastructure. We evaluated the technique using various benchmarks to show that it can be effective in handling the performance problem due to divergence in unstructured CFGs.",
"date": "2014-01-01",
"link": "https://doi.org/10.1007/978-3-642-54807-9_8",
"conference_name": "CC",
"authors": [
{
"first_name": "Jayvant",
"last_name": "Anantpur",
"institution": "Indian Institute of Science Bangalore"
},
{
"first_name": "R.",
"last_name": "Govindarajan",
"institution": "Indian Institute of Science Bangalore"
}
],
"dblp_key": "conf/cc/AnantpurG14",
"venue": "cc",
"year": 2014
}
]
Loading