Problem
downsample_mode: "vntr" targets a coverage level in the VNTR region but produces ~2x flanking coverage (should be ~70x). downsample_mode: "non_vntr" produces better flanking (~140x) but inflated VNTR (~500x). Neither matches real exome data.
Root Cause
MucOneUp simulates reads from a ~46kb MUC1 haplotype, not a whole exome. The pBLAT fragment simulation places virtually ALL reads in the VNTR or immediate flanking because the capture probes are designed for this small reference. In real exomes, the MUC1 region receives ~30k of 50M+ total reads (0.06%), and flanking probes from other exome targets create ~65x background coverage.
The simulation architecture simply cannot produce realistic flanking coverage because there are no reads from the rest of the exome.
Coverage Comparison
| Metric |
Real exomes (n=1,043) |
vntr mode |
non_vntr mode |
| VNTR mean |
201x |
376x |
507x |
| Flanking mean |
70x |
2x |
140x |
| VNTR/Flanking ratio |
2.9x |
188x |
3.5x |
| Total reads |
31,851 |
9,649 |
9,957 |
Why This Matters
VNtyper's Kestrel algorithm uses k-mer frequencies from reads in the VNTR region. The flanking coverage level affects:
- Read extraction during BAM preprocessing
- Background noise for variant calling
- Coverage statistics used for confidence scoring
With 2x flanking (vntr mode), there's essentially no background. With 140x flanking but 500x VNTR (non_vntr mode), the ratio is approximately right but absolute values are inflated.
Possible Solutions
-
Calibrate non_vntr mode with adjusted read_number: Find the read_number that produces ~200x VNTR mean (matching real data). This is the simplest fix -- just change a parameter. The ratio (3.5x vs 2.9x) is close enough.
-
Add a "target_vntr_coverage" parameter that works with non_vntr mode: automatically compute the read_number needed to achieve a target VNTR coverage given the known enrichment ratio.
-
Spike in background reads: After VNTR-focused simulation, add reads from a real exome BAM (or simulated whole-genome) in the flanking regions to create realistic background. Complex but most accurate.
-
Accept the limitation: Document that simulated data has different absolute coverage than real exomes, but the VNTR-specific signal (the part VNtyper actually uses) is realistic. Use coverage titration to test across a range of depths.
Recommendation
Option 1 is pragmatic for the manuscript. We can empirically find the read_number that gives ~200x VNTR mean in non_vntr mode. Based on current data:
- 100k reads → 507x VNTR mean
- Need ~200x → try ~40k reads (200/507 × 100k ≈ 39k)
Environment
- MucOneUp v0.44.0
- Both downsample modes tested
Problem
downsample_mode: "vntr"targets a coverage level in the VNTR region but produces ~2x flanking coverage (should be ~70x).downsample_mode: "non_vntr"produces better flanking (~140x) but inflated VNTR (~500x). Neither matches real exome data.Root Cause
MucOneUp simulates reads from a ~46kb MUC1 haplotype, not a whole exome. The pBLAT fragment simulation places virtually ALL reads in the VNTR or immediate flanking because the capture probes are designed for this small reference. In real exomes, the MUC1 region receives ~30k of 50M+ total reads (0.06%), and flanking probes from other exome targets create ~65x background coverage.
The simulation architecture simply cannot produce realistic flanking coverage because there are no reads from the rest of the exome.
Coverage Comparison
Why This Matters
VNtyper's Kestrel algorithm uses k-mer frequencies from reads in the VNTR region. The flanking coverage level affects:
With 2x flanking (vntr mode), there's essentially no background. With 140x flanking but 500x VNTR (non_vntr mode), the ratio is approximately right but absolute values are inflated.
Possible Solutions
Calibrate
non_vntrmode with adjustedread_number: Find theread_numberthat produces ~200x VNTR mean (matching real data). This is the simplest fix -- just change a parameter. The ratio (3.5x vs 2.9x) is close enough.Add a "target_vntr_coverage" parameter that works with
non_vntrmode: automatically compute theread_numberneeded to achieve a target VNTR coverage given the known enrichment ratio.Spike in background reads: After VNTR-focused simulation, add reads from a real exome BAM (or simulated whole-genome) in the flanking regions to create realistic background. Complex but most accurate.
Accept the limitation: Document that simulated data has different absolute coverage than real exomes, but the VNTR-specific signal (the part VNtyper actually uses) is realistic. Use coverage titration to test across a range of depths.
Recommendation
Option 1 is pragmatic for the manuscript. We can empirically find the
read_numberthat gives ~200x VNTR mean innon_vntrmode. Based on current data:Environment