feat: MCP server + live API upgrades (PubMed, ClinVar, gnomAD, ClinGen) by zzgael · Pull Request #66 · victormar1/PubMatcher

zzgael · 2026-03-24T21:35:39Z

Summary

This PR adds an MCP (Model Context Protocol) server to PubMatcher and upgrades the data-fetching layer to use live APIs instead of static files and web scraping.

All changes are made directly in utils/ and services/, so both the existing web app and the new MCP server share the same code — no duplication.

Data layer upgrades (backward-compatible)

Source	Before	After
PubMed	Cheerio web scraping	NCBI E-utilities API (abstracts, DOIs, authors, journals)
ClinVar	Static JSON (`BDD/clinvarCountsPerGene.json`)	Live NCBI ClinVar API + direct URLs
gnomAD	Static CSV (`BDD/constraints_v2.csv`, `v4.csv`)	GraphQL API (works for any gene, extra fields)
ClinGen	Static CSV (`BDD/gene_validity.csv`)	Live download from ClinGen endpoint
OMIM	No caching, no timeout	Caching + 30s timeout + error handling
All sources	Sequential execution, no caching, no rate limiting	Parallel via `Promise.allSettled`, 1h cache TTL, 350ms NCBI rate limiter with retry

The web app's getData(req) function returns the exact same keys as before — the frontend doesn't need any changes.

New shared infrastructure

utils/cache.js — In-memory cache with configurable TTL
utils/rateLimiter.js — Mutex-based queue with exponential backoff retry (429, 5xx)
utils/formatters.js — Markdown formatters for MCP output (testable CJS module)
services/dataservice.js — Exports both analyzeGenes() (flat, web app) and analyzeGenesStructured() (per-source, MCP) + validateGene() and searchLiterature()

MCP server (`mcp/`)

A thin layer (~70 lines of tool definitions) that imports from services/dataservice.js and utils/formatters.js. Provides 3 tools:

analyze_genes — Full 8-database analysis for 1-10 genes
validate_gene — Quick HGNC + ClinGen check
search_literature — PubMed search with phenotype refinement

Compatible with Claude Desktop, VS Code Copilot, Cursor, and any MCP client.

When a data source fails, the output explicitly warns the LLM (e.g. "ClinVar: unavailable (429 rate limited)") instead of silently showing zeros.

Test suite (114 tests)

Suite	Tests	Description
Unit tests	97	Fully mocked (axios/rateLimiter), ~10s
Integration tests	17	Real APIs, BRCA1+TP53, zero skips, ~17s

Coverage: 99.64% lines, 86.58% branches.

Run with:

npm test                  # unit tests
npm run test:integration  # real API tests

Related issues

Closes #54 (PubMed API), closes #55 (ClinVar links), closes #56 (OMIM status), closes #40 (gnomAD constraints)

Test plan

npm test — 97 unit tests pass
npm run test:integration — 17 integration tests pass against real APIs
MCP server starts and responds to initialize, tools/list, tools/call
validate_gene BRCA1 returns correct gene info
search_literature TP53 ["Li-Fraumeni"] returns articles with abstracts
analyze_genes ["BRCA1"] ["breast cancer"] returns all 8 data sources
validate_gene XYZFAKE returns "not found" error
Web app backward compat: getData(req) returns all 25 expected keys

Major upgrade to the data-fetching layer. All improvements are made directly in utils/, so both the web app and the new MCP server benefit. Upgrades to existing utils/ (backward-compatible return shapes): - PubMed (closes victormar1#54): NCBI E-utilities API replaces cheerio web scraping. Returns abstracts, DOIs, full author lists, journal info. - ClinVar (closes victormar1#55): Live NCBI ClinVar API replaces static BDD/clinvarCountsPerGene.json. Always up-to-date, adds totalPathogenic/totalLikelyPathogenic + direct ClinVar URLs. - gnomAD (closes victormar1#40): GraphQL API replaces static CSV files (BDD/constraints_v2.csv, v4.csv). Works for any gene, adds oe_mis, oe_lof, lof_z fields + gnomAD URL. - ClinGen: Live download from search.clinicalgenome.org replaces BDD/gene_validity.csv. - OMIM (closes victormar1#56): Added caching, timeouts, proper error handling. New shared infrastructure: - utils/cache.js: In-memory cache with 1-hour TTL for all API responses. - utils/rateLimiter.js: 350ms throttle for NCBI APIs (3 req/s without API key). dataservice.js improvements: - Promise.allSettled() runs all 7 data sources in parallel (was sequential). - Exported analyzeGenes() function for programmatic use (MCP server, scripts). - One failing source no longer blocks the rest — returns partial results. MCP server (mcp/): - Thin layer importing from ../utils/ via createRequire (no code duplication). - 3 tools: analyze_genes, validate_gene, search_literature. - Compatible with Claude Desktop, VS Code Copilot, Cursor, etc. Test suite (tests/): - Added Jest + 17 tests covering cache, PubMed, ClinVar, gnomAD, dataservice. - Tests handle NCBI rate limiting gracefully (429 resilient).

Upgrade data layer (utils/) to use live APIs, add MCP server, add properly mocked test suite (21 tests, no network calls). Data layer upgrades (backward-compatible return shapes): - PubMed (victormar1#54): E-utilities API replaces cheerio web scraping - ClinVar (victormar1#55): Live NCBI API replaces static JSON (BDD/clinvarCountsPerGene.json) - gnomAD (victormar1#40): GraphQL API replaces static CSV (BDD/constraints_v2.csv, v4.csv) - ClinGen: Live download replaces BDD/gene_validity.csv - OMIM (victormar1#56): Added caching, timeouts, error handling - All utils: caching (1h TTL), rate limiting (350ms NCBI), timeouts dataservice.js: Promise.allSettled for parallel execution, exported analyzeGenes() for programmatic use. MCP server (mcp/): Thin layer importing from ../utils/ via createRequire. 3 tools: analyze_genes, validate_gene, search_literature. Test suite (tests/): Jest with axios mocks — deterministic, fast, no external API dependency. Covers PubMed parsing, ClinVar queries, gnomAD GraphQL, cache behavior, error handling, and full dataservice pipeline.

…t coverage Rate limiter: - Mutex-based queue serializes concurrent requests (no more race conditions) - Exponential backoff retry on 429/5xx (1s, 2s, then give up) - resetLimiter() for clean test isolation ClinVar: - 6 queries now fire via Promise.all through the rate limiter queue (queue still serializes them, but code expresses correct intent) dataservice.js: - New analyzeGenesStructured() returns per-source data + sourceErrors array - analyzeGenes() flattens for web app backward compat - Source errors detected from both rejected promises AND internal error fields - MCP server imports analyzeGenesStructured (no more duplicated orchestration) MCP server: - Sources that failed show "Data Source Warnings" section at the top so the LLM explicitly tells users which data is unavailable - Sections for failed sources are skipped (no silent zeros) Test suite (50 tests, 11 suites): - Every util has its own test file with mocked HTTP - rateLimiter tested for queue serialization, retry, backoff, no-retry on 4xx - Proper separation: util tests mock rateLimitedGet, not axios - dataservice tests verify structured output, error tracking, flat compat

Unit tests (65 tests, 11 suites): - 99.55% line coverage, 97.98% statement coverage - Added: ClinGen header-not-found, ClinGen download failure, ClinGen duplicate ranking, HGNC non-200, UniProt keywords failure, OMIM missing description, PubMed abstract variants (string/array/object), PubMed >5 authors truncation, gnomAD pLI=0 delta, gnomAD N/A delta, dataservice rejected promise path, getData Express handler - Module singletons (ClinGen validity, UniProt keywords) properly isolated via jest.resetModules() between tests Integration tests (15 tests, real APIs): - BRCA1 full analysis: gene info, ClinGen Definitive, PubMed >10k articles with metadata, ClinVar >100 pathogenic, gnomAD v2+v4, UniProt function contains DNA/repair, IMPC phenotypes, PanelApp, OMIM breast/ovarian cancer - TP53 analysis: PubMed >20k, ClinVar >500 pathogenic, gnomAD constraint data present - Invalid gene: returns valid=false - Run separately: npm run test:integration

MCP server cleanup: - Extracted formatters to utils/formatters.js (testable CJS module) - server.js now imports only from dataservice.js + formatters.js - validateGene and searchLiterature moved to dataservice.js - No more direct util imports in server.js — single source of truth Integration tests (17 tests): - Removed ALL error skips — tests MUST pass unconditionally - Asserts `error` is undefined for every source - Asserts `sourceErrors` is empty array - Tests validateGene and searchLiterature via real APIs - OMIM timeout increased to 30s for Ensembl reliability Unit tests (85 tests, 13 suites): - Added formatter tests (formatGeneAnalysis, formatValidation, formatLiterature, formatSourceErrors) with full branch coverage - Added validateGene + searchLiterature unit tests - Added getGeneConstraints extractConstraints field coverage - Added getClinVarData undefined count branch - Branch coverage: 82.92%, Lines: 99.46%

Coverage: 99.64% lines, 86.58% branches, 97.22% functions. Added tests for: - PubMed: bare article (no abstract/authors/journal/DOI), direct PMID, empty efetch response, NaN count from esearch - Formatters: missing fields (no authors, no DOI, no geneLink, null pubmedUrl), PanelApp error strings, flat array phenotypes, constraintsDelta rendering, partial citations - gnomAD: extractConstraints with null individual fields Remaining uncovered branches are defensive optional chaining (?.) in XML parsing — require malformed API responses to trigger. 97 unit tests (13 suites) + 17 integration tests = 114 total.

zzgael added 6 commits March 24, 2026 20:16

zzgael mentioned this pull request Mar 24, 2026

Proposal: MCP server + live API upgrades for PubMatcher #67

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: MCP server + live API upgrades (PubMed, ClinVar, gnomAD, ClinGen)#66

feat: MCP server + live API upgrades (PubMed, ClinVar, gnomAD, ClinGen)#66
zzgael wants to merge 6 commits into
victormar1:prodfrom
zzgael:feat/mcp-server-and-api-enhancements

zzgael commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

zzgael commented Mar 24, 2026

Summary

Data layer upgrades (backward-compatible)

New shared infrastructure

MCP server (mcp/)

Test suite (114 tests)

Related issues

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MCP server (`mcp/`)