perf: add --use-ram and --use-awk for faster high-freq k-mer filtering by pegasas555 · Pull Request #17 · natapol/kitsune

pegasas555 · 2025-08-18T10:12:44Z

Summary

This PR introduces two performance options to the ACF pipeline:

--use-ram: prefer /dev/shm for temp storage to reduce disk I/O.

--use-awk: filter high-frequency k-mers via jellyfish dump | awk '$2 >= MIN' as a faster alternative to the legacy path.

Results on a representative dataset show ~5.4× wall-time speedup with identical outputs.

Motivation / Context

Legacy flow incurred heavy disk I/O and extra passes.

Filtering low-count k-mers early is both biologically sensible and computationally cheaper.

This keeps algorithmic results the same; it changes the implementation, not the definition of the result.

Changes

kitsune/modules/kitsunejf.py: implement --use-ram, --use-awk; stream pipelines; {}/[] literals in hot paths.

kitsune/modules/acf.py: wire flags into ACF flow; choose RAM tmpdir and awk path when requested.

kitsune/modules/kopt.py: add argparse options and help text.

kitsune/modules/ofc.py, cre.py, dmatrix.py: Refactored to support temporary path handling and data streaming; included minor structural cleanups.

--help updated for both flags

CLI / Behavior

New flags:

--use-ram → try /dev/shm, fallback to default tmp if unavailable.

--use-awk → use awk to enforce min-count threshold on jellyfish dump.

Default behavior remains unchanged (legacy path) unless flags are provided.
(Optional: if you switched defaults, state it clearly.)
NOTE: --use-ram may not be generally useful and can be toggle for niche scenario like I/O bottlenecks and when resource provides where --use-awk is considerably speeds up the runtime
Benchmarks
legacy (no flags) kitsune acf on 2 fasta viruses with --canonical and k from 1 to 31
with --use-ram --use-awk flags on the same data with ~5.4x wall-time

--

--use-ram: Use /dev/shm for temporary directory --use-awk: Use awk to filter out low-frequency k-mers as alternative - Style: replace dict()/list() with {} and [] Results: 61.49s → 11.41s (~5.39× faster) on representative dataset with identical outputs

pegasas555 and others added 3 commits May 16, 2024 16:05

Fix issue natapol#14

fee2ae1

Merge branch 'natapol:master' into master

9cbb804

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: add --use-ram and --use-awk for faster high-freq k-mer filtering#17

perf: add --use-ram and --use-awk for faster high-freq k-mer filtering#17
pegasas555 wants to merge 3 commits into
natapol:masterfrom
pegasas555:master

pegasas555 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

pegasas555 commented Aug 18, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant