Skip to content

Kat comp finds specific kmers between 2 fastq files with the same reads not given in the same order (with a reproducible example) #188

@jfouret

Description

@jfouret

Hi,

Thank you for this tool. I wanted to use kat comp somehow to validate an other tool for looseless compression of fastq reads. This tool is reordering the reads but should be looseless. I was suprised to see specific kmers after decompression hence to confirm that it's not an artefact I wanted to confirm that when I give 2 identical set of reads to kat comp, but not in the same order, I would have 0 specific kmers.

However that is not the confirmation I had. Maybe I mad a mistake somewhere or there may be artifacts in kat comp.

Below is a code to reproduce my results:

SRR=SRR14237206
apptainer run docker://ncbi/sra-tools prefetch $SRR
apptainer run docker://ncbi/sra-tools fasterq-dump $SRR \
  --split-files --progress
pigz -p 8 ${SRR}_* 
mkdir fastq ; mv ${SRR}_*.fastq.gz fastq/
mkdir shuffle
apptainer run docker://staphb/seqkit seqkit shuffle fastq/${SRR}_1.fastq.gz --out-file shuffle/${SRR}_1.fastq.gz
apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz

I got those results:

$ apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz
INFO:    Using cached SIF image
Kmer Analysis Toolkit (KAT) V2.4.1

Running KAT in COMP mode
------------------------

Input 1 is a sequence file.  Counting kmers for input 1 (fastq/SRR14237206_1.fastq.gz) ... done.  Time taken: 32.2s

Input 2 is a sequence file.  Counting kmers for input 2 (shuffle/SRR14237206_1.fastq.gz) ... done.  Time taken: 34.2s

Comparing hashes ... done.  Time taken: 27.0s

Merging results ... done.  Time taken: 0.7s

Saving results to disk ... done.  Time taken: 0.3s


Summary statistics
------------------

K-mer statistics for: 
 - Hash 1: "fastq/SRR14237206_1.fastq.gz"
 - Hash 2: "shuffle/SRR14237206_1.fastq.gz"

Total K-mers in: 
 - Hash 1: 1464945516
 - Hash 2: 1464945516

Distinct K-mers in:
 - Hash 1: 364458959
 - Hash 2: 364458959

Total K-mers only found in:
 - Hash 1: 0
 - Hash 2: 131916277

Distinct K-mers only found in:
 - Hash 1: 0
 - Hash 2: 129068475

Shared K-mers:
 - Total shared found in hash 1: 1464945516
 - Total shared found in hash 2: 1464945516
 - Distinct shared K-mers: 364458959

Distance between spectra 1 and 2 (all k-mers):
 - Manhattan distance: 0
 - Euclidean distance: 0
 - Cosine distance: 1.11022e-16
 - Canberra distance: 0
 - Jaccard distance: 0

Distance between spectra 1 and 2 (shared k-mers):
 - Manhattan distance: 0
 - Euclidean distance: 0
 - Cosine distance: 1.11022e-16
 - Canberra distance: 0
 - Jaccard distance: 0

Creating plot(s) ... done.  Time taken: 1.3s

Analysing peaks for spectra copy number matrix
----------------------------------------------

Analysing distributions for: kat-comp-main.mx ... 
Analysing full spectra
No peaks detected for full spectra.  Can't continue.
done.  Time taken:  0.0s

Main spectra statistics
-----------------------
K-value used: 27
Peaks in analysis: 0
Global minima @ Frequency=2x (1420224)
Global maxima @ Frequency=9x (10974317)
Overall mean k-mer frequency: 0x

No peaks detected

Calculating genome statistics
-----------------------------
No peaks detected, so no genome stats to report
Estimated assembly completeness: Unknown

Creating plots
--------------

No peaks in K-mer frequency histogram.  Not plotting.


KAT COMP completed.
Total runtime: 96.7s

What I do not understand is that :

Total K-mers only found in:
 - Hash 1: 0
 - Hash 2: 131916277 <=============================================

Distinct K-mers only found in:
 - Hash 1: 0
 - Hash 2: 129068475  <=============================================

Thank you,

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions