Skip to content

feat: MCP server + live API upgrades (PubMed, ClinVar, gnomAD, ClinGen)#66

Open
zzgael wants to merge 6 commits into
victormar1:prodfrom
zzgael:feat/mcp-server-and-api-enhancements
Open

feat: MCP server + live API upgrades (PubMed, ClinVar, gnomAD, ClinGen)#66
zzgael wants to merge 6 commits into
victormar1:prodfrom
zzgael:feat/mcp-server-and-api-enhancements

Conversation

@zzgael

@zzgael zzgael commented Mar 24, 2026

Copy link
Copy Markdown

Summary

This PR adds an MCP (Model Context Protocol) server to PubMatcher and upgrades the data-fetching layer to use live APIs instead of static files and web scraping.

All changes are made directly in utils/ and services/, so both the existing web app and the new MCP server share the same code — no duplication.

Data layer upgrades (backward-compatible)

Source Before After
PubMed Cheerio web scraping NCBI E-utilities API (abstracts, DOIs, authors, journals)
ClinVar Static JSON (BDD/clinvarCountsPerGene.json) Live NCBI ClinVar API + direct URLs
gnomAD Static CSV (BDD/constraints_v2.csv, v4.csv) GraphQL API (works for any gene, extra fields)
ClinGen Static CSV (BDD/gene_validity.csv) Live download from ClinGen endpoint
OMIM No caching, no timeout Caching + 30s timeout + error handling
All sources Sequential execution, no caching, no rate limiting Parallel via Promise.allSettled, 1h cache TTL, 350ms NCBI rate limiter with retry

The web app's getData(req) function returns the exact same keys as before — the frontend doesn't need any changes.

New shared infrastructure

  • utils/cache.js — In-memory cache with configurable TTL
  • utils/rateLimiter.js — Mutex-based queue with exponential backoff retry (429, 5xx)
  • utils/formatters.js — Markdown formatters for MCP output (testable CJS module)
  • services/dataservice.js — Exports both analyzeGenes() (flat, web app) and analyzeGenesStructured() (per-source, MCP) + validateGene() and searchLiterature()

MCP server (mcp/)

A thin layer (~70 lines of tool definitions) that imports from services/dataservice.js and utils/formatters.js. Provides 3 tools:

  • analyze_genes — Full 8-database analysis for 1-10 genes
  • validate_gene — Quick HGNC + ClinGen check
  • search_literature — PubMed search with phenotype refinement

Compatible with Claude Desktop, VS Code Copilot, Cursor, and any MCP client.

When a data source fails, the output explicitly warns the LLM (e.g. "ClinVar: unavailable (429 rate limited)") instead of silently showing zeros.

Test suite (114 tests)

Suite Tests Description
Unit tests 97 Fully mocked (axios/rateLimiter), ~10s
Integration tests 17 Real APIs, BRCA1+TP53, zero skips, ~17s

Coverage: 99.64% lines, 86.58% branches.

Run with:

npm test                  # unit tests
npm run test:integration  # real API tests

Related issues

Closes #54 (PubMed API), closes #55 (ClinVar links), closes #56 (OMIM status), closes #40 (gnomAD constraints)

Test plan

  • npm test — 97 unit tests pass
  • npm run test:integration — 17 integration tests pass against real APIs
  • MCP server starts and responds to initialize, tools/list, tools/call
  • validate_gene BRCA1 returns correct gene info
  • search_literature TP53 ["Li-Fraumeni"] returns articles with abstracts
  • analyze_genes ["BRCA1"] ["breast cancer"] returns all 8 data sources
  • validate_gene XYZFAKE returns "not found" error
  • Web app backward compat: getData(req) returns all 25 expected keys

zzgael added 6 commits March 24, 2026 20:16
Major upgrade to the data-fetching layer. All improvements are made
directly in utils/, so both the web app and the new MCP server benefit.

Upgrades to existing utils/ (backward-compatible return shapes):

- PubMed (closes victormar1#54): NCBI E-utilities API replaces cheerio web scraping.
  Returns abstracts, DOIs, full author lists, journal info.
- ClinVar (closes victormar1#55): Live NCBI ClinVar API replaces static BDD/clinvarCountsPerGene.json.
  Always up-to-date, adds totalPathogenic/totalLikelyPathogenic + direct ClinVar URLs.
- gnomAD (closes victormar1#40): GraphQL API replaces static CSV files (BDD/constraints_v2.csv, v4.csv).
  Works for any gene, adds oe_mis, oe_lof, lof_z fields + gnomAD URL.
- ClinGen: Live download from search.clinicalgenome.org replaces BDD/gene_validity.csv.
- OMIM (closes victormar1#56): Added caching, timeouts, proper error handling.

New shared infrastructure:
- utils/cache.js: In-memory cache with 1-hour TTL for all API responses.
- utils/rateLimiter.js: 350ms throttle for NCBI APIs (3 req/s without API key).

dataservice.js improvements:
- Promise.allSettled() runs all 7 data sources in parallel (was sequential).
- Exported analyzeGenes() function for programmatic use (MCP server, scripts).
- One failing source no longer blocks the rest — returns partial results.

MCP server (mcp/):
- Thin layer importing from ../utils/ via createRequire (no code duplication).
- 3 tools: analyze_genes, validate_gene, search_literature.
- Compatible with Claude Desktop, VS Code Copilot, Cursor, etc.

Test suite (tests/):
- Added Jest + 17 tests covering cache, PubMed, ClinVar, gnomAD, dataservice.
- Tests handle NCBI rate limiting gracefully (429 resilient).
Upgrade data layer (utils/) to use live APIs, add MCP server, add
properly mocked test suite (21 tests, no network calls).

Data layer upgrades (backward-compatible return shapes):
- PubMed (victormar1#54): E-utilities API replaces cheerio web scraping
- ClinVar (victormar1#55): Live NCBI API replaces static JSON (BDD/clinvarCountsPerGene.json)
- gnomAD (victormar1#40): GraphQL API replaces static CSV (BDD/constraints_v2.csv, v4.csv)
- ClinGen: Live download replaces BDD/gene_validity.csv
- OMIM (victormar1#56): Added caching, timeouts, error handling
- All utils: caching (1h TTL), rate limiting (350ms NCBI), timeouts

dataservice.js: Promise.allSettled for parallel execution, exported
analyzeGenes() for programmatic use.

MCP server (mcp/): Thin layer importing from ../utils/ via createRequire.
3 tools: analyze_genes, validate_gene, search_literature.

Test suite (tests/): Jest with axios mocks — deterministic, fast, no
external API dependency. Covers PubMed parsing, ClinVar queries, gnomAD
GraphQL, cache behavior, error handling, and full dataservice pipeline.
…t coverage

Rate limiter:
- Mutex-based queue serializes concurrent requests (no more race conditions)
- Exponential backoff retry on 429/5xx (1s, 2s, then give up)
- resetLimiter() for clean test isolation

ClinVar:
- 6 queries now fire via Promise.all through the rate limiter queue
  (queue still serializes them, but code expresses correct intent)

dataservice.js:
- New analyzeGenesStructured() returns per-source data + sourceErrors array
- analyzeGenes() flattens for web app backward compat
- Source errors detected from both rejected promises AND internal error fields
- MCP server imports analyzeGenesStructured (no more duplicated orchestration)

MCP server:
- Sources that failed show "Data Source Warnings" section at the top
  so the LLM explicitly tells users which data is unavailable
- Sections for failed sources are skipped (no silent zeros)

Test suite (50 tests, 11 suites):
- Every util has its own test file with mocked HTTP
- rateLimiter tested for queue serialization, retry, backoff, no-retry on 4xx
- Proper separation: util tests mock rateLimitedGet, not axios
- dataservice tests verify structured output, error tracking, flat compat
Unit tests (65 tests, 11 suites):
- 99.55% line coverage, 97.98% statement coverage
- Added: ClinGen header-not-found, ClinGen download failure, ClinGen
  duplicate ranking, HGNC non-200, UniProt keywords failure, OMIM
  missing description, PubMed abstract variants (string/array/object),
  PubMed >5 authors truncation, gnomAD pLI=0 delta, gnomAD N/A delta,
  dataservice rejected promise path, getData Express handler
- Module singletons (ClinGen validity, UniProt keywords) properly
  isolated via jest.resetModules() between tests

Integration tests (15 tests, real APIs):
- BRCA1 full analysis: gene info, ClinGen Definitive, PubMed >10k
  articles with metadata, ClinVar >100 pathogenic, gnomAD v2+v4,
  UniProt function contains DNA/repair, IMPC phenotypes, PanelApp,
  OMIM breast/ovarian cancer
- TP53 analysis: PubMed >20k, ClinVar >500 pathogenic, gnomAD
  constraint data present
- Invalid gene: returns valid=false
- Run separately: npm run test:integration
MCP server cleanup:
- Extracted formatters to utils/formatters.js (testable CJS module)
- server.js now imports only from dataservice.js + formatters.js
- validateGene and searchLiterature moved to dataservice.js
- No more direct util imports in server.js — single source of truth

Integration tests (17 tests):
- Removed ALL error skips — tests MUST pass unconditionally
- Asserts `error` is undefined for every source
- Asserts `sourceErrors` is empty array
- Tests validateGene and searchLiterature via real APIs
- OMIM timeout increased to 30s for Ensembl reliability

Unit tests (85 tests, 13 suites):
- Added formatter tests (formatGeneAnalysis, formatValidation,
  formatLiterature, formatSourceErrors) with full branch coverage
- Added validateGene + searchLiterature unit tests
- Added getGeneConstraints extractConstraints field coverage
- Added getClinVarData undefined count branch
- Branch coverage: 82.92%, Lines: 99.46%
Coverage: 99.64% lines, 86.58% branches, 97.22% functions.

Added tests for:
- PubMed: bare article (no abstract/authors/journal/DOI), direct PMID,
  empty efetch response, NaN count from esearch
- Formatters: missing fields (no authors, no DOI, no geneLink, null
  pubmedUrl), PanelApp error strings, flat array phenotypes,
  constraintsDelta rendering, partial citations
- gnomAD: extractConstraints with null individual fields

Remaining uncovered branches are defensive optional chaining (?.)
in XML parsing — require malformed API responses to trigger.

97 unit tests (13 suites) + 17 integration tests = 114 total.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat : add OMIM status feat: add link to clinvar Pubmed API Control constraints ( CROCC2)

1 participant