Hi,
Thank you for this tool. I wanted to use kat comp somehow to validate an other tool for looseless compression of fastq reads. This tool is reordering the reads but should be looseless. I was suprised to see specific kmers after decompression hence to confirm that it's not an artefact I wanted to confirm that when I give 2 identical set of reads to kat comp, but not in the same order, I would have 0 specific kmers.
However that is not the confirmation I had. Maybe I mad a mistake somewhere or there may be artifacts in kat comp.
Below is a code to reproduce my results:
SRR=SRR14237206
apptainer run docker://ncbi/sra-tools prefetch $SRR
apptainer run docker://ncbi/sra-tools fasterq-dump $SRR \
--split-files --progress
pigz -p 8 ${SRR}_*
mkdir fastq ; mv ${SRR}_*.fastq.gz fastq/
mkdir shuffle
apptainer run docker://staphb/seqkit seqkit shuffle fastq/${SRR}_1.fastq.gz --out-file shuffle/${SRR}_1.fastq.gz
apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz
I got those results:
$ apptainer run docker://ghcr.io/nexomis/kat:2.4.1 comp -N -O -H 1000000000 -I 1000000000 -t 12 fastq/${SRR}_1.fastq.gz shuffle/${SRR}_1.fastq.gz
INFO: Using cached SIF image
Kmer Analysis Toolkit (KAT) V2.4.1
Running KAT in COMP mode
------------------------
Input 1 is a sequence file. Counting kmers for input 1 (fastq/SRR14237206_1.fastq.gz) ... done. Time taken: 32.2s
Input 2 is a sequence file. Counting kmers for input 2 (shuffle/SRR14237206_1.fastq.gz) ... done. Time taken: 34.2s
Comparing hashes ... done. Time taken: 27.0s
Merging results ... done. Time taken: 0.7s
Saving results to disk ... done. Time taken: 0.3s
Summary statistics
------------------
K-mer statistics for:
- Hash 1: "fastq/SRR14237206_1.fastq.gz"
- Hash 2: "shuffle/SRR14237206_1.fastq.gz"
Total K-mers in:
- Hash 1: 1464945516
- Hash 2: 1464945516
Distinct K-mers in:
- Hash 1: 364458959
- Hash 2: 364458959
Total K-mers only found in:
- Hash 1: 0
- Hash 2: 131916277
Distinct K-mers only found in:
- Hash 1: 0
- Hash 2: 129068475
Shared K-mers:
- Total shared found in hash 1: 1464945516
- Total shared found in hash 2: 1464945516
- Distinct shared K-mers: 364458959
Distance between spectra 1 and 2 (all k-mers):
- Manhattan distance: 0
- Euclidean distance: 0
- Cosine distance: 1.11022e-16
- Canberra distance: 0
- Jaccard distance: 0
Distance between spectra 1 and 2 (shared k-mers):
- Manhattan distance: 0
- Euclidean distance: 0
- Cosine distance: 1.11022e-16
- Canberra distance: 0
- Jaccard distance: 0
Creating plot(s) ... done. Time taken: 1.3s
Analysing peaks for spectra copy number matrix
----------------------------------------------
Analysing distributions for: kat-comp-main.mx ...
Analysing full spectra
No peaks detected for full spectra. Can't continue.
done. Time taken: 0.0s
Main spectra statistics
-----------------------
K-value used: 27
Peaks in analysis: 0
Global minima @ Frequency=2x (1420224)
Global maxima @ Frequency=9x (10974317)
Overall mean k-mer frequency: 0x
No peaks detected
Calculating genome statistics
-----------------------------
No peaks detected, so no genome stats to report
Estimated assembly completeness: Unknown
Creating plots
--------------
No peaks in K-mer frequency histogram. Not plotting.
KAT COMP completed.
Total runtime: 96.7s
What I do not understand is that :
Total K-mers only found in:
- Hash 1: 0
- Hash 2: 131916277 <=============================================
Distinct K-mers only found in:
- Hash 1: 0
- Hash 2: 129068475 <=============================================
Thank you,
Hi,
Thank you for this tool. I wanted to use kat comp somehow to validate an other tool for looseless compression of fastq reads. This tool is reordering the reads but should be looseless. I was suprised to see specific kmers after decompression hence to confirm that it's not an artefact I wanted to confirm that when I give 2 identical set of reads to kat comp, but not in the same order, I would have 0 specific kmers.
However that is not the confirmation I had. Maybe I mad a mistake somewhere or there may be artifacts in kat comp.
Below is a code to reproduce my results:
I got those results:
What I do not understand is that :
Thank you,