GitHub - tanaydesai/knockout-discovery: Analyze human genetic knockouts to predict drug efficacy and side effects

Knockout Discovery Engine

A tool that answers: "If I block gene X with a drug, what diseases will that help, and what might go wrong?"

Introduction

All humans share ~99.9% of the same DNA, but some people are born with mutations that naturally break a specific gene's function, such individuals are called natural knockouts. People born with a broken CCR5 gene are naturally immune to HIV, which led directly to Maraviroc. People with a broken HSD17B13 gene have lower rates of chronic liver disease. Those people are nature's version of a clinical trial for any drug that targets the same gene. Drugs backed by such genetic evidence are 2.6x more likely to get approved than those without it.

Works In Progress article Nature's Laboratory -- the piece that introduced me to this concept -- got me asking:

"Could one automate the search for such genetic variants?"

That rabbit hole led me to Formation Bio's pipeline, which combines genetic database search with LLMs to do exactly that.

Built to demonstrate AI's role in accelerating scientific research, this tool is an open-source alternative built on the same idea for educational purposes.

Usage

Set ANTHROPIC_API_KEY in .env. Model used is claude-sonnet-4-6.

Clone the repo:

git clone https://github.com/your-username/knockout-discovery
cd knockout-discovery

Install and run:

pip install -r requirements.txt
streamlit run app.py

What it does

Type a drug description, eg: PCSK9 inhibitor for hypercholesterolemia

Step 1: Claude Sonnet 4.6 extracts the target gene and modulation direction: Gene: PCSK9 | Direction: Inhibition | Modality: Antibody | Indication: Hypercholesterolemia
Step 2: Queries Open Targets + FinnGen across 500K+ people and 2,400+ disease phenotypes, and gnomAD for gene constraint scores and LoF variant annotations
- Open Targets: Hypercholesterolemia | p: 3.2e-45 | beta: -0.82
- FinnGen: Coronary artery disease | p: 1.1e-12 | beta: -0.31
- gnomAD: PCSK9 | pLI: 0.01 | LOEUF: 1.19 | LoF variant: rs28362286 | consequence: stop_gained
Step 3: Filters to statistically significant hits only (p < 5e-8)
Step 4: Labels each signal as Efficacy (drug works for this disease) or Safety (potential side effect)
Step 5: Sonnet 4.6 generates an extensive evidence report. Downloadable with data via zip.

Note on terminology: loss-of-function (LoF) gene variants are mutations that break or reduce a gene's protein output. If people born with a broken copy of gene X are healthier in some way, that is direct evidence a drug targeting gene X might work. gnomAD catalogs these variants across hundreds of thousands of people and scores how tolerant each gene is to being broken.

Examples

PCSK9 Inhibitor

Input: PCSK9 inhibitor for hypercholesterolemia

Output:

Protective signals for hypercholesterolemia and lipid disorders (OR ~0.2-0.4)
Cardiovascular cascade: coronary artery disease, myocardial infarction, peripheral artery disease
T2D safety signal (PCSK9 LoF carriers have higher T2D rates)

Anti-IL-6 Receptor (tocilizumab)

Input: anti-IL-6 receptor antibody for rheumatoid arthritis

Output:

Protective signal for coronary artery disease
Possible atopic dermatitis / asthma safety signals
Infection risk absent -- FinnGen lacks power here

Anti-IL-18 Antibody

Input: anti-IL-18 antibody for Crohn's disease

Output:

No Crohn's signal
Protective musculoskeletal signals: spondylosis, disc disorders (beta -0.05 to -0.11)
Direction paradox on atherosclerosis, flagged in the report

How It Works

Note: I am not a biologist. I just hold an intellectual interest in genetics and compbio. To build an intuition of the biology involved, Opus 4.6 created a 50-page textbook for me to grasp the concepts from the ground up.

Pipeline

MoA Parsing src/moa_parser.py: Claude Sonnet 4.6 takes the plain-text drug description and extracts the target gene symbol(s), whether the drug inhibits or activates the target, the modality (antibody, small molecule, etc.), and the proposed indication. Returns structured JSON: Gene: PCSK9 | Direction: Inhibition | Modality: Antibody | Indication: Hypercholesterolemia
Gene Resolution src/gene_resolver.py: Takes the gene symbol eg. IL18 and queries MyGene.info API to get the stable Ensembl ID (e.g. ENSG00000150782). This ID is what the databases use to identify genes.
Data Retrieval -- Two sources queried for every target gene:

Open Targets src/open_targets.py: Pulls three things: (a) all diseases associated with the gene, (b) GWAS credible sets (variant-level association data), and (c) gene burden test results (rare variant collapse tests that directly measure what happens when a gene is broken) eg. Hypercholesterolemia | p: 3.2e-45 | beta: -0.82
FinnGen src/finngen.py: Gets PheWAS results for the gene across ~500K Finnish individuals and 2,400+ diseases. Returns beta (effect size) and p-value for every phenotype eg. Coronary artery disease | p: 1.1e-12 | beta: -0.31

Variant Annotation src/gnomad.py: Queries gnomAD (catalogs genetic variation across 800K+ individuals) for two things: (a) the gene's constraint scores (pLI and LOEUF), which tell you how tolerant the gene is to being broken. A highly constrained gene means nature punishes loss-of-function mutations, so fewer natural knockouts exist in the gene pool. (b) A list of high-confidence loss-of-function (LoF) variants in the gene, the mutations that actually break or reduce protein output eg. PCSK9 | pLI: 0.01 | LOEUF: 1.19 | LoF variant: rs28362286 | consequence: stop_gained
Filtering src/filters.py: Filters out raw associations that do not pass two thresholds: p-value must be below 5e-8 (genome-wide significance, to account for testing thousands of phenotypes at once) and effect size must be above a minimum (|beta| >= 0.1 for gene burden, >= 0.05 for GWAS). Deduplicates so the same disease from the same source only appears once (keeps the most significant hit). All thresholds live in src/config.py.
Direction-of-Effect src/interpretation.py: The core logic. For each filtered association, determines whether the signal means "the drug should help with this disease" (efficacy) or "the drug might cause this as a side effect" (safety). The logic: if the drug is an inhibitor and the variant is loss-of-function (both reduce the protein), then a negative beta (protective) means efficacy signal, and a positive beta (increased risk) means safety signal. Flips the logic for activator drugs or gain-of-function variants. Labels each association as EFFICACY_SIGNAL, SAFETY_SIGNAL, or UNKNOWN.
Phenotype Harmonization src/harmonizer.py: The same disease shows up under different names across databases ("Type 2 diabetes" vs "T2DM" vs a FinnGen endpoint code). Sonnet 4.6 merges duplicates into a canonical name and groups everything by therapeutic area (cardiovascular, immunology, etc.).
Replication Scoring src/replication.py: Checks whether each signal was found in more than one data source (e.g. both Open Targets gene burden and FinnGen). A signal replicated across independent biobanks is much more trustworthy than one found in a single source. Tags each result with replicated: true/false and lists sources.
Literature Context src/literature.py: Sonnet 4.6 looks up the target gene's protein function, known disease roles, existing drugs, expected genetic signals, and cases where the biology is complex or contradictory, from published research. This gets fed into the report generator.
Report Generation src/report_generator.py: Sonnet 4.6 combines the signals and literature context to write a structured report that covers indication support, per-target analysis, additional opportunities, safety concerns, evidence strength, and limitations.

Other Files

prompts/: Claude system prompts, stored as markdown files.
src/engine.py: Wires the pipeline together, yields live progress updates. app.py is the Streamlit app.
src/db.py: DuckDB at data/knockout.db. Caches raw and filtered associations per gene to avoid re-querying the APIs.
reports/: Example reports from previous runs.

Credits

Works in Progress: The Nature's Laboratory article that inspired this project.
Formation Bio: Scaling Genetics Insights was the technical reference for this project. This is an open source version using public data.
Open Targets: Gene-disease association scores, GWAS credible sets, and gene burden test results.
FinnGen: PheWAS results across ~500K Finnish individuals and 2,400+ diseases.
gnomAD: Gene constraint scores and LoF variant catalog from 800K+ sequenced individuals
MyGene.info: Gene symbol to Ensembl ID.

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
.streamlit		.streamlit
prompts		prompts
reports		reports
src		src
.gitignore		.gitignore
README.md		README.md
app.py		app.py
demo.png		demo.png
pipeline.svg		pipeline.svg
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Knockout Discovery Engine

Introduction

Usage

What it does

Examples

PCSK9 Inhibitor

Anti-IL-6 Receptor (tocilizumab)

Anti-IL-18 Antibody

How It Works

Pipeline

Other Files

Credits

About

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Knockout Discovery Engine

Introduction

Usage

What it does

Examples

PCSK9 Inhibitor

Anti-IL-6 Receptor (tocilizumab)

Anti-IL-18 Antibody

How It Works

Pipeline

Other Files

Credits

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Contributors

Uh oh!

Languages