perf: add --use-ram and --use-awk for faster high-freq k-mer filtering#17
Open
pegasas555 wants to merge 3 commits into
Open
perf: add --use-ram and --use-awk for faster high-freq k-mer filtering#17pegasas555 wants to merge 3 commits into
pegasas555 wants to merge 3 commits into
Conversation
--use-ram: Use /dev/shm for temporary directory
--use-awk: Use awk to filter out low-frequency k-mers as alternative
- Style: replace dict()/list() with {} and []
Results: 61.49s → 11.41s (~5.39× faster) on representative dataset with identical outputs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR introduces two performance options to the ACF pipeline:
--use-ram: prefer /dev/shm for temp storage to reduce disk I/O.
--use-awk: filter high-frequency k-mers via jellyfish dump | awk '$2 >= MIN' as a faster alternative to the legacy path.
Results on a representative dataset show ~5.4× wall-time speedup with identical outputs.
Motivation / Context
Legacy flow incurred heavy disk I/O and extra passes.
Filtering low-count k-mers early is both biologically sensible and computationally cheaper.
This keeps algorithmic results the same; it changes the implementation, not the definition of the result.
Changes
kitsune/modules/kitsunejf.py: implement --use-ram, --use-awk; stream pipelines; {}/[] literals in hot paths.
kitsune/modules/acf.py: wire flags into ACF flow; choose RAM tmpdir and awk path when requested.
kitsune/modules/kopt.py: add argparse options and help text.
kitsune/modules/ofc.py, cre.py, dmatrix.py: Refactored to support temporary path handling and data streaming; included minor structural cleanups.
--help updated for both flags
CLI / Behavior
New flags:
--use-ram → try /dev/shm, fallback to default tmp if unavailable.
--use-awk → use awk to enforce min-count threshold on jellyfish dump.
Default behavior remains unchanged (legacy path) unless flags are provided.

(Optional: if you switched defaults, state it clearly.)
NOTE: --use-ram may not be generally useful and can be toggle for niche scenario like I/O bottlenecks and when resource provides where --use-awk is considerably speeds up the runtime
Benchmarks
legacy (no flags) kitsune acf on 2 fasta viruses with --canonical and k from 1 to 31
with --use-ram --use-awk flags on the same data with ~5.4x wall-time
--