Skip to content

VNTR downsample mode produces unrealistic coverage: simulation architecture doesn't generate flanking reads #88

@berntpopp

Description

@berntpopp

Problem

downsample_mode: "vntr" targets a coverage level in the VNTR region but produces ~2x flanking coverage (should be ~70x). downsample_mode: "non_vntr" produces better flanking (~140x) but inflated VNTR (~500x). Neither matches real exome data.

Root Cause

MucOneUp simulates reads from a ~46kb MUC1 haplotype, not a whole exome. The pBLAT fragment simulation places virtually ALL reads in the VNTR or immediate flanking because the capture probes are designed for this small reference. In real exomes, the MUC1 region receives ~30k of 50M+ total reads (0.06%), and flanking probes from other exome targets create ~65x background coverage.

The simulation architecture simply cannot produce realistic flanking coverage because there are no reads from the rest of the exome.

Coverage Comparison

Metric Real exomes (n=1,043) vntr mode non_vntr mode
VNTR mean 201x 376x 507x
Flanking mean 70x 2x 140x
VNTR/Flanking ratio 2.9x 188x 3.5x
Total reads 31,851 9,649 9,957

Why This Matters

VNtyper's Kestrel algorithm uses k-mer frequencies from reads in the VNTR region. The flanking coverage level affects:

  • Read extraction during BAM preprocessing
  • Background noise for variant calling
  • Coverage statistics used for confidence scoring

With 2x flanking (vntr mode), there's essentially no background. With 140x flanking but 500x VNTR (non_vntr mode), the ratio is approximately right but absolute values are inflated.

Possible Solutions

  1. Calibrate non_vntr mode with adjusted read_number: Find the read_number that produces ~200x VNTR mean (matching real data). This is the simplest fix -- just change a parameter. The ratio (3.5x vs 2.9x) is close enough.

  2. Add a "target_vntr_coverage" parameter that works with non_vntr mode: automatically compute the read_number needed to achieve a target VNTR coverage given the known enrichment ratio.

  3. Spike in background reads: After VNTR-focused simulation, add reads from a real exome BAM (or simulated whole-genome) in the flanking regions to create realistic background. Complex but most accurate.

  4. Accept the limitation: Document that simulated data has different absolute coverage than real exomes, but the VNTR-specific signal (the part VNtyper actually uses) is realistic. Use coverage titration to test across a range of depths.

Recommendation

Option 1 is pragmatic for the manuscript. We can empirically find the read_number that gives ~200x VNTR mean in non_vntr mode. Based on current data:

  • 100k reads → 507x VNTR mean
  • Need ~200x → try ~40k reads (200/507 × 100k ≈ 39k)

Environment

  • MucOneUp v0.44.0
  • Both downsample modes tested

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions