A tool that answers: "If I block gene X with a drug, what diseases will that help, and what might go wrong?"
All humans share ~99.9% of the same DNA, but some people are born with mutations that naturally break a specific gene's function, such individuals are called natural knockouts. People born with a broken CCR5 gene are naturally immune to HIV, which led directly to Maraviroc. People with a broken HSD17B13 gene have lower rates of chronic liver disease. Those people are nature's version of a clinical trial for any drug that targets the same gene. Drugs backed by such genetic evidence are 2.6x more likely to get approved than those without it.
Works In Progress article Nature's Laboratory -- the piece that introduced me to this concept -- got me asking:
"Could one automate the search for such genetic variants?"
That rabbit hole led me to Formation Bio's pipeline, which combines genetic database search with LLMs to do exactly that.
Built to demonstrate AI's role in accelerating scientific research, this tool is an open-source alternative built on the same idea for educational purposes.
Set ANTHROPIC_API_KEY in .env. Model used is claude-sonnet-4-6.
Clone the repo:
git clone https://github.com/your-username/knockout-discovery
cd knockout-discoveryInstall and run:
pip install -r requirements.txt
streamlit run app.pyType a drug description, eg: PCSK9 inhibitor for hypercholesterolemia
- Step 1:
Claude Sonnet 4.6extracts the target gene and modulation direction:Gene: PCSK9 | Direction: Inhibition | Modality: Antibody | Indication: Hypercholesterolemia - Step 2: Queries Open Targets + FinnGen across 500K+ people and 2,400+ disease phenotypes, and gnomAD for gene constraint scores and LoF variant annotations
- Open Targets:
Hypercholesterolemia | p: 3.2e-45 | beta: -0.82 - FinnGen:
Coronary artery disease | p: 1.1e-12 | beta: -0.31 - gnomAD:
PCSK9 | pLI: 0.01 | LOEUF: 1.19 | LoF variant: rs28362286 | consequence: stop_gained
- Open Targets:
- Step 3: Filters to statistically significant hits only (p < 5e-8)
- Step 4: Labels each signal as Efficacy (drug works for this disease) or Safety (potential side effect)
- Step 5:
Sonnet 4.6generates an extensive evidence report. Downloadable with data via zip.
Note on terminology: loss-of-function (LoF) gene variants are mutations that break or reduce a gene's protein output. If people born with a broken copy of gene X are healthier in some way, that is direct evidence a drug targeting gene X might work. gnomAD catalogs these variants across hundreds of thousands of people and scores how tolerant each gene is to being broken.
Input: PCSK9 inhibitor for hypercholesterolemia
Output:
- Protective signals for hypercholesterolemia and lipid disorders (OR ~0.2-0.4)
- Cardiovascular cascade: coronary artery disease, myocardial infarction, peripheral artery disease
- T2D safety signal (PCSK9 LoF carriers have higher T2D rates)
Input: anti-IL-6 receptor antibody for rheumatoid arthritis
Output:
- Protective signal for coronary artery disease
- Possible atopic dermatitis / asthma safety signals
- Infection risk absent -- FinnGen lacks power here
Input: anti-IL-18 antibody for Crohn's disease
Output:
- No Crohn's signal
- Protective musculoskeletal signals: spondylosis, disc disorders (beta -0.05 to -0.11)
- Direction paradox on atherosclerosis, flagged in the report
Note: I am not a biologist. I just hold an intellectual interest in genetics and compbio. To build an intuition of the biology involved, Opus 4.6 created a 50-page textbook for me to grasp the concepts from the ground up.
- MoA Parsing
src/moa_parser.py:Claude Sonnet 4.6takes the plain-text drug description and extracts the target gene symbol(s), whether the drug inhibits or activates the target, the modality (antibody, small molecule, etc.), and the proposed indication. Returns structured JSON:Gene: PCSK9 | Direction: Inhibition | Modality: Antibody | Indication: Hypercholesterolemia - Gene Resolution
src/gene_resolver.py: Takes the gene symbol eg.IL18and queries MyGene.info API to get the stable Ensembl ID (e.g.ENSG00000150782). This ID is what the databases use to identify genes. - Data Retrieval -- Two sources queried for every target gene:
- Open Targets
src/open_targets.py: Pulls three things: (a) all diseases associated with the gene, (b) GWAS credible sets (variant-level association data), and (c) gene burden test results (rare variant collapse tests that directly measure what happens when a gene is broken) eg.Hypercholesterolemia | p: 3.2e-45 | beta: -0.82 - FinnGen
src/finngen.py: Gets PheWAS results for the gene across ~500K Finnish individuals and 2,400+ diseases. Returns beta (effect size) and p-value for every phenotype eg.Coronary artery disease | p: 1.1e-12 | beta: -0.31
- Variant Annotation
src/gnomad.py: Queries gnomAD (catalogs genetic variation across 800K+ individuals) for two things: (a) the gene's constraint scores (pLI and LOEUF), which tell you how tolerant the gene is to being broken. A highly constrained gene means nature punishes loss-of-function mutations, so fewer natural knockouts exist in the gene pool. (b) A list of high-confidence loss-of-function (LoF) variants in the gene, the mutations that actually break or reduce protein output eg.PCSK9 | pLI: 0.01 | LOEUF: 1.19 | LoF variant: rs28362286 | consequence: stop_gained - Filtering
src/filters.py: Filters out raw associations that do not pass two thresholds: p-value must be below 5e-8 (genome-wide significance, to account for testing thousands of phenotypes at once) and effect size must be above a minimum (|beta| >= 0.1 for gene burden, >= 0.05 for GWAS). Deduplicates so the same disease from the same source only appears once (keeps the most significant hit). All thresholds live insrc/config.py. - Direction-of-Effect
src/interpretation.py: The core logic. For each filtered association, determines whether the signal means "the drug should help with this disease" (efficacy) or "the drug might cause this as a side effect" (safety). The logic: if the drug is an inhibitor and the variant is loss-of-function (both reduce the protein), then a negative beta (protective) means efficacy signal, and a positive beta (increased risk) means safety signal. Flips the logic for activator drugs or gain-of-function variants. Labels each association asEFFICACY_SIGNAL,SAFETY_SIGNAL, orUNKNOWN. - Phenotype Harmonization
src/harmonizer.py: The same disease shows up under different names across databases ("Type 2 diabetes" vs "T2DM" vs a FinnGen endpoint code).Sonnet 4.6merges duplicates into a canonical name and groups everything by therapeutic area (cardiovascular, immunology, etc.). - Replication Scoring
src/replication.py: Checks whether each signal was found in more than one data source (e.g. both Open Targets gene burden and FinnGen). A signal replicated across independent biobanks is much more trustworthy than one found in a single source. Tags each result withreplicated: true/falseand lists sources. - Literature Context
src/literature.py:Sonnet 4.6looks up the target gene's protein function, known disease roles, existing drugs, expected genetic signals, and cases where the biology is complex or contradictory, from published research. This gets fed into the report generator. - Report Generation
src/report_generator.py:Sonnet 4.6combines the signals and literature context to write a structured report that covers indication support, per-target analysis, additional opportunities, safety concerns, evidence strength, and limitations.
prompts/: Claude system prompts, stored as markdown files.src/engine.py: Wires the pipeline together, yields live progress updates.app.pyis the Streamlit app.src/db.py: DuckDB atdata/knockout.db. Caches raw and filtered associations per gene to avoid re-querying the APIs.reports/: Example reports from previous runs.
- Works in Progress: The Nature's Laboratory article that inspired this project.
- Formation Bio: Scaling Genetics Insights was the technical reference for this project. This is an open source version using public data.
- Open Targets: Gene-disease association scores, GWAS credible sets, and gene burden test results.
- FinnGen: PheWAS results across ~500K Finnish individuals and 2,400+ diseases.
- gnomAD: Gene constraint scores and LoF variant catalog from 800K+ sequenced individuals
- MyGene.info: Gene symbol to Ensembl ID.
