Skip to content

Same samples for both training (+BAMSurgeon spike-ins) and calling? #89

@AndreasKienzle

Description

@AndreasKienzle

Hello,

I have a dataset of about 50 tumor-normal paired cell-free DNA samples.

For model training, the bam files of the tumor will be in silico spiked with random SNVs and INDELs (COSMIC v102) using BAMSurgeon.

python preprocess.py
--mode train
--reference GRCh38.fa
--region_bed region.bed
--tumor_bam SPIKED_tumor.bam
--normal_bam normal.bam
--work work_train
--truth_vcf SPIKED_truth.vcf
--min_mapq 10
--number_threads 10
--scan_alignments_binary ../bin/scan_alignments

Would it be valid to train on all 50 samples (+BAMSurgeon spike-ins) and then perform somatic calling on the corresponding original (unspiked) tumor BAMs? In theory, the training bam files are different from the original ones.
Or do I have to split the dataset into 20% training and 80% testing?

Best and thanks in advance,
Andreas

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions