Skip to content

Questions About Training Random Forest Model in PTATO #27

@GexinLiu

Description

@GexinLiu

Hi,
I'm currently working on training the random forest model in PTATO and have been using the provided 'training_vcfs' to run the training program. I have encountered some issues and would appreciate your guidance.

I'm unsure how to properly modify the configuration file and execute the training program, as I couldn't find comprehensive instructions for this part. Below are the command and configuration file I used:

Command:

 nextflow run /home/ug2263/software/PTATO/ptato-train.nf \
    -c /home/ug2263/software/PTATO/configs/run_model_demo_test.config \
    --out_dir /home/ug2263/data/LY_OME/01_processing/PTATO_model/model_demo_test \
    -process.memory '800 GB' \
    -process.cpus 64 \
    -process.maxForks 10 \
    -process.queueSize 100 \
    -resume

Configuration File (run_model_demo_test.config):

includeConfig "${projectDir}/configs/process.config"
includeConfig "${projectDir}/configs/nextflow.config"
includeConfig "${projectDir}/configs/resources.config"

params {

  run {
    snvs =true
    QC = false
    svs = false
    indels = false
    cnvs = false
  }

  // TRAINING
  train {
    version = '2.0.0'
  }
  pta_vcfs_dir = '/home/ug2263/data/LY_OME/Training/training_vcfs/TP'
  nopta_vcfs_dir = '/home/ug2263/data/LY_OME/Training/training_vcfs/FP'
  // END TRAINING

  // TESTING
  input_vcfs_dir = '/home/ug2263/data/LY_OME/Training/training_vcfs/TP'
  bams_dir = ''
  // END TESTING

  out_dir = ''
  bulk_names = [
    ['IBFM26', 'IBFM26_shared_filtered'],
    ['PMCCB15', 'PMCCB15_shared_filtered'],
    ['PMCAHH1-FANCCKO', 'PMCAHH1-FANCCKO_shared_filtered'],
    ['IBFM35', 'IBFM35_shared_filtered'],
    ['PB10268', 'PB10268_shared_filtered']
  ]

  snvs {
    rf_rds = ""
  }

  indels {
    rf_rds = ''
    excludeindellist = "${projectDir}/resources/hg38/indels/excludeindellist/PTA_Indel_ExcludeIndellist_normNoGTrenamed.vcf.gz"
  }
  optional {

    germline_vcfs_dir = ''
    callableloci_dir = ''
    autosomal_callable_dir = ''
    walker_vcfs_dir = ''

    short_variants {
      somatic_vcfs_dir = ''
      phased_vcfs_dir = ''
      ab_tables_dir = ''
      context_beds_dir = ''
      features_beds_dir = ''
    }

    snvs {
      rf_tables_dir = ''
      ptato_vcfs_dir = ''
    }

    indels {
      rf_tables_dir = ''
      ptato_vcfs_dir = ''
    }

    qc {
      wgs_metrics_dir = ''
      alignment_summary_metrics_dir = ''
    }

    svs {
      gridss_driver_vcfs_dir = ''
      gridss_unfiltered_vcfs_dir = ''
      gripss_somatic_filtered_vcfs_dir = ''
      gripss_filtered_files_dir = ''
      integrated_sv_files_dir = ''
    }

    cnvs {
      cobalt_ratio_tsv_dir = ''
      cobalt_filtered_readcounts_dir = ''
      baf_filtered_files_dir = ''
    }
  }


}

However, the execution failed. I encountered the following errors:

Error 1:

" N E X T F L O W   ~  version 25.04.6

Launching `/home/ug2263/software/PTATO/ptato-train.nf` [silly_church] DSL2 - revision: d5f55b4f34

WARN: Include with `params()` is deprecated -- pass params as a workflow or process input instead
Cannot find a component with name 'extractInputVcfGzFromDir' in module: /home/ug2263/software/PTATO/NextflowModules/Utils/getFilesFromDir.nf

Did you mean any of these?
  extractInputVcfFromDir


 -- Check script '/home/ug2263/software/PTATO/ptato-train.nf' at line: 8 or see '.nextflow.log' file for more details"

After revising the function ‘extractInputVcfFromDir’ in ptato.nf to address the above error, I encountered another issue:

Error 2
No .vcf(.gz) files found in [/home/ug2263/data/LY_OME/Training/training_vcfs/FP/*/*.{vcf,vcf.gz}].

Upon further investigation, I believe this error originates from 'short_variants.nf' rather than 'getFilesFromDir.nf'. I'm quite confused about this part and would appreciate any solutions or suggestions you might have.

Thank you in advance for your help!

Best regards,
Gexin

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions