Skip to content

citation-cosmograph/citation-astrolabe

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 

Repository files navigation

citation-astrolabe 🔭

Citation Constellation Logo

Venue governance database — AI-agent-driven extraction of editorial boards and program committees

Astrolabe is the instrument behind citation-cosmograph's venue governance detection. It will scrape academic venue websites, extract structured governance data using a locally deployed language model (Pulsar), resolve extracted members against OpenAlex author profiles, and store the results in a persistent, incrementally growing database.

pulsar 🌟  →  astrolabe 🔭  →  citation-constellation ✨
(the signal)    (the instrument)    (the map)
                     ▲
                 you are here

Purpose

Citation analysis tools can detect self-citation, co-authorship ties, and institutional proximity — but they miss a structural layer that is hiding in plain sight: venue governance. A citation that flows through a journal where the cited researcher sits on the editorial board, or where their co-author serves as associate editor, passes through a structural pathway that could mediate citation behavior. This pathway is currently invisible to all major bibliometric tools.

The reason it is invisible is practical, not conceptual. Editorial board and program committee data is scattered across thousands of heterogeneous journal and conference websites — each with different page structures, different naming conventions, and no standardized machine-readable format. Building a comprehensive governance database has historically required either dedicated manual curation or fragile per-publisher web scrapers.

Astrolabe will solve this with an AI-agent pipeline: fetch the page, feed it to a local LLM, extract structured data, resolve identities, store the results. The database grows incrementally with every new venue encountered.

Planned Architecture

┌──────────────────────────────────────────────────────────┐
│  Astrolabe Pipeline                                       │
│                                                           │
│  1. Venue Discovery                                       │
│     └─ Collect citing venues from OpenAlex source metadata│
│                                                           │
│  2. Web Scraping                                          │
│     ├─ httpx + BeautifulSoup (static HTML)               │
│     └─ Cloudflare Crawl API (JS-rendered, fallback)      │
│                                                           │
│  3. LLM Extraction (via Pulsar 🌟)                       │
│     ├─ Structured prompt → JSON array                    │
│     ├─ Member name, role, institution, ORCID             │
│     └─ Per-member confidence scoring                     │
│                                                           │
│  4. Entity Resolution                                     │
│     ├─ ORCID match (high confidence)                     │
│     ├─ Name + institution match (medium)                 │
│     └─ Name-only match (low, flagged)                    │
│                                                           │
│  5. Persistent Storage                                    │
│     ├─ PostgreSQL / SQLite                               │
│     ├─ Timestamped entries with confidence scores        │
│     └─ Incremental growth with nightly refresh           │
└──────────────────────────────────────────────────────────┘

Planned Database Schema

Astrolabe will maintain three core tables:

venues — One row per academic venue (journal or conference).

  • Source ID (OpenAlex), name, homepage URL, ISSN, publisher, type, last scraped timestamp.

governance_members — One row per person-venue-role relationship.

  • Venue ID, member name, role (editor-in-chief, associate editor, editorial board, program committee, organizing committee), institution, ORCID, OpenAlex author ID, match confidence, extraction confidence, scrape timestamp, source URL.

scrape_log — One row per scrape attempt.

  • Venue ID, URL attempted, HTTP status, extraction method (static / Cloudflare / cached), LLM model used, timestamp.

Planned Features

Incremental growth. The database will grow organically as new researchers are analyzed in Citation-Constellation. When a citing venue is not yet in the database, it will be queued for scraping and extraction. Subsequent analyses — for the same researcher or others in the same field — will benefit from the accumulated data.

Nightly refresh. A Kubernetes CronJob will process the queue of new venues and refresh entries older than 12 months, keeping the database current as editorial boards rotate.

Quality assurance. A sampling-based validation pipeline will randomly verify a percentage of extractions against manual checks. Precision and recall metrics will be tracked per publisher, since different publishers have different page structures and extraction difficulty.

Comprehensive logging. Every scrape, extraction, and entity resolution decision will be logged with full provenance: URL fetched, HTTP status, raw HTML hash, LLM prompt, LLM response, entity resolution decisions with confidence scores, and model version.

Multi-discipline coverage. Astrolabe will not be limited to computer science or any single field. Any venue with a web-accessible editorial board or committee page is a valid target — sciences, humanities, medicine, engineering, social sciences.

Relationship to the Ecosystem

Astrolabe sits between Pulsar (the LLM) and Citation-Constellation (the scoring engine):

  1. Citation-Constellation encounters a venue during citation classification and queries Astrolabe's database.
  2. If the venue is known, Astrolabe returns its governance members for cross-matching.
  3. If the venue is unknown, Astrolabe queues it for scraping.
  4. The scraping pipeline fetches the venue's editorial board page and sends the HTML to Pulsar for structured extraction.
  5. Astrolabe performs entity resolution against OpenAlex and stores the results.
  6. On the next analysis, the venue is known.

Over time, the database converges toward comprehensive coverage for actively cited venues.

Planned External Dependencies

Service Purpose Cost
Pulsar LLM inference for structured extraction Self-hosted
OpenAlex API Venue metadata, author profiles for entity resolution Free
ORCID Public API Identity validation during entity resolution Free
Cloudflare Crawl API JS-rendered venue pages (fallback) Free tier (6 req/min)

No commercial API keys or institutional subscriptions will be required.

The Database as an Independent Resource

While Astrolabe is designed to feed Citation-Constellation's Phase 4 classification, the venue governance database it produces is valuable in its own right. Potential independent uses include:

  • Editorial board diversity studies — Who governs academic venues, and how does composition vary by field, geography, and gender?
  • Governance–citation relationship analysis — Do venues with editorial board members from a researcher's network cite that researcher more frequently?
  • Temporal governance tracking — How do editorial boards evolve over time? What is the typical tenure of a board member?
  • Venue governance lookup — A simple tool for researchers to see who governs the journals they publish in or review for.

These use cases require no integration with Citation-Constellation and can be pursued independently once the database reaches sufficient coverage.

Status

🔧 In development. The scraping pipeline, LLM extraction prompts, entity resolution logic, and database schema are being designed and prototyped. This repository will contain the full pipeline, database migrations, CLI tools, and documentation once the initial version is ready.

Part of citation-cosmograph

Repo Role
citation-pulsar-helm 🌟 LLM inference on Kubernetes — the signal
citation-astrolabe 🔭 Venue governance database — the instrument
citation-constellation BARON & HEROCON scoring — the map

Phased Implementation Architecture Diagram

Phased Implementation Architecture Diagram


Future Roadmap Diagram

Future Roadmap Diagram


Paper

For the full methodology, conceptual foundations, tool landscape comparison, discussion of responsible research assessment alignment, and detailed limitations analysis, see the accompanying research paper:

Mahbub Ul Alam. Where do your citations come from? Citation-Constellation: A free, open-source, no-code, and auditable tool for citation network decomposition with complementary BARON and HEROCON scores, 2026. URL: https://arxiv.org/abs/2603.24216, arXiv:2603.24216, doi:10.48550/arXiv.2603.24216.

The paper is also available embedded within the web tool under the Full Research Paper tab.

BibTeX

@misc{alam2026citationconstellation,
  title={Where Do Your Citations Come From? {C}itation-{C}onstellation: A Free, Open-Source, No-Code, and Auditable Tool for Citation Network Decomposition with Complementary {BARON} and {HEROCON} Scores},
  author={Mahbub Ul Alam},
  year={2026},
  eprint={2603.24216},
  archivePrefix={arXiv},
  primaryClass={cs.DL},
  url={https://arxiv.org/abs/2603.24216},
  doi={10.48550/arXiv.2603.24216}
}

Acknowledgements

Powered by OpenAlex, ORCID, and ROR.


License

MIT

About

AI-agent-driven venue governance database. Extracts editorial boards and program committees from journal websites using local LLMs, with entity resolution against OpenAlex.

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors