Skip to content

perf: add --use-ram and --use-awk for faster high-freq k-mer filtering#17

Open
pegasas555 wants to merge 3 commits into
natapol:masterfrom
pegasas555:master
Open

perf: add --use-ram and --use-awk for faster high-freq k-mer filtering#17
pegasas555 wants to merge 3 commits into
natapol:masterfrom
pegasas555:master

Conversation

@pegasas555

Copy link
Copy Markdown

Summary

This PR introduces two performance options to the ACF pipeline:

--use-ram: prefer /dev/shm for temp storage to reduce disk I/O.

--use-awk: filter high-frequency k-mers via jellyfish dump | awk '$2 >= MIN' as a faster alternative to the legacy path.

Results on a representative dataset show ~5.4× wall-time speedup with identical outputs.

Motivation / Context

Legacy flow incurred heavy disk I/O and extra passes.

Filtering low-count k-mers early is both biologically sensible and computationally cheaper.

This keeps algorithmic results the same; it changes the implementation, not the definition of the result.

Changes

kitsune/modules/kitsunejf.py: implement --use-ram, --use-awk; stream pipelines; {}/[] literals in hot paths.

kitsune/modules/acf.py: wire flags into ACF flow; choose RAM tmpdir and awk path when requested.

kitsune/modules/kopt.py: add argparse options and help text.

kitsune/modules/ofc.py, cre.py, dmatrix.py: Refactored to support temporary path handling and data streaming; included minor structural cleanups.

--help updated for both flags

CLI / Behavior

New flags:

--use-ram → try /dev/shm, fallback to default tmp if unavailable.

--use-awk → use awk to enforce min-count threshold on jellyfish dump.

Default behavior remains unchanged (legacy path) unless flags are provided.
(Optional: if you switched defaults, state it clearly.)
NOTE: --use-ram may not be generally useful and can be toggle for niche scenario like I/O bottlenecks and when resource provides where --use-awk is considerably speeds up the runtime
Benchmarks
legacy (no flags) kitsune acf on 2 fasta viruses with --canonical and k from 1 to 31
with --use-ram --use-awk flags on the same data with ~5.4x wall-time
kitsune_modified

--

 

pegasas555 and others added 3 commits May 16, 2024 16:05
--use-ram: Use /dev/shm for temporary directory
--use-awk: Use awk to filter out low-frequency k-mers as alternative
- Style: replace dict()/list() with {} and []
Results: 61.49s → 11.41s (~5.39× faster) on representative dataset with identical outputs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant