VNTR downsample mode produces unrealistic coverage: simulation architecture doesn't generate flanking reads

## Problem

`downsample_mode: "vntr"` targets a coverage level in the VNTR region but produces ~2x flanking coverage (should be ~70x). `downsample_mode: "non_vntr"` produces better flanking (~140x) but inflated VNTR (~500x). Neither matches real exome data.

## Root Cause

MucOneUp simulates reads from a ~46kb MUC1 haplotype, not a whole exome. The pBLAT fragment simulation places virtually ALL reads in the VNTR or immediate flanking because the capture probes are designed for this small reference. In real exomes, the MUC1 region receives ~30k of 50M+ total reads (0.06%), and flanking probes from other exome targets create ~65x background coverage.

The simulation architecture simply cannot produce realistic flanking coverage because there are no reads from the rest of the exome.

## Coverage Comparison

| Metric | Real exomes (n=1,043) | vntr mode | non_vntr mode |
|--------|----------------------|-----------|---------------|
| VNTR mean | **201x** | 376x | 507x |
| Flanking mean | **70x** | **2x** | 140x |
| VNTR/Flanking ratio | **2.9x** | 188x | 3.5x |
| Total reads | **31,851** | 9,649 | 9,957 |

## Why This Matters

VNtyper's Kestrel algorithm uses k-mer frequencies from reads in the VNTR region. The flanking coverage level affects:
- Read extraction during BAM preprocessing
- Background noise for variant calling
- Coverage statistics used for confidence scoring

With 2x flanking (vntr mode), there's essentially no background. With 140x flanking but 500x VNTR (non_vntr mode), the ratio is approximately right but absolute values are inflated.

## Possible Solutions

1. **Calibrate `non_vntr` mode with adjusted `read_number`**: Find the `read_number` that produces ~200x VNTR mean (matching real data). This is the simplest fix -- just change a parameter. The ratio (3.5x vs 2.9x) is close enough.

2. **Add a "target_vntr_coverage" parameter** that works with `non_vntr` mode: automatically compute the `read_number` needed to achieve a target VNTR coverage given the known enrichment ratio.

3. **Spike in background reads**: After VNTR-focused simulation, add reads from a real exome BAM (or simulated whole-genome) in the flanking regions to create realistic background. Complex but most accurate.

4. **Accept the limitation**: Document that simulated data has different absolute coverage than real exomes, but the VNTR-specific signal (the part VNtyper actually uses) is realistic. Use coverage titration to test across a range of depths.

## Recommendation

Option 1 is pragmatic for the manuscript. We can empirically find the `read_number` that gives ~200x VNTR mean in `non_vntr` mode. Based on current data:
- 100k reads → 507x VNTR mean
- Need ~200x → try ~40k reads (200/507 × 100k ≈ 39k)

## Environment

- MucOneUp v0.44.0
- Both downsample modes tested

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VNTR downsample mode produces unrealistic coverage: simulation architecture doesn't generate flanking reads #88

Problem

Root Cause

Coverage Comparison

Why This Matters

Possible Solutions

Recommendation

Environment

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Metric	Real exomes (n=1,043)	vntr mode	non_vntr mode
VNTR mean	201x	376x	507x
Flanking mean	70x	2x	140x
VNTR/Flanking ratio	2.9x	188x	3.5x
Total reads	31,851	9,649	9,957

VNTR downsample mode produces unrealistic coverage: simulation architecture doesn't generate flanking reads #88

Description

Problem

Root Cause

Coverage Comparison

Why This Matters

Possible Solutions

Recommendation

Environment

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions