Same samples for both training (+BAMSurgeon spike-ins) and calling?

Hello,

I have a dataset of about 50 tumor-normal paired cell-free DNA samples.

For model training, the bam files of the tumor will be in silico spiked with random SNVs and INDELs (COSMIC v102) using  BAMSurgeon.

python preprocess.py \
	--mode train \
	--reference GRCh38.fa \
	--region_bed region.bed \
	--tumor_bam SPIKED_tumor.bam \
	--normal_bam normal.bam \
	--work work_train \
	--truth_vcf SPIKED_truth.vcf \
	--min_mapq 10 \
	--number_threads 10 \
	--scan_alignments_binary ../bin/scan_alignments


Would it be valid to train on all 50 samples (+BAMSurgeon spike-ins) and then perform somatic calling on the corresponding original (unspiked) tumor BAMs? In theory, the training bam files are different from the original ones.
Or do I have to split the dataset into 20% training and 80% testing?

Best and thanks in advance,
Andreas




Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same samples for both training (+BAMSurgeon spike-ins) and calling? #89

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Same samples for both training (+BAMSurgeon spike-ins) and calling? #89

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions