Hello,
I have a dataset of about 50 tumor-normal paired cell-free DNA samples.
For model training, the bam files of the tumor will be in silico spiked with random SNVs and INDELs (COSMIC v102) using BAMSurgeon.
python preprocess.py
--mode train
--reference GRCh38.fa
--region_bed region.bed
--tumor_bam SPIKED_tumor.bam
--normal_bam normal.bam
--work work_train
--truth_vcf SPIKED_truth.vcf
--min_mapq 10
--number_threads 10
--scan_alignments_binary ../bin/scan_alignments
Would it be valid to train on all 50 samples (+BAMSurgeon spike-ins) and then perform somatic calling on the corresponding original (unspiked) tumor BAMs? In theory, the training bam files are different from the original ones.
Or do I have to split the dataset into 20% training and 80% testing?
Best and thanks in advance,
Andreas
Hello,
I have a dataset of about 50 tumor-normal paired cell-free DNA samples.
For model training, the bam files of the tumor will be in silico spiked with random SNVs and INDELs (COSMIC v102) using BAMSurgeon.
python preprocess.py
--mode train
--reference GRCh38.fa
--region_bed region.bed
--tumor_bam SPIKED_tumor.bam
--normal_bam normal.bam
--work work_train
--truth_vcf SPIKED_truth.vcf
--min_mapq 10
--number_threads 10
--scan_alignments_binary ../bin/scan_alignments
Would it be valid to train on all 50 samples (+BAMSurgeon spike-ins) and then perform somatic calling on the corresponding original (unspiked) tumor BAMs? In theory, the training bam files are different from the original ones.
Or do I have to split the dataset into 20% training and 80% testing?
Best and thanks in advance,
Andreas