Skip to content

Gzip large TSV outputs to reduce disk usage #24

@rachadele

Description

@rachadele

Problem

Work directory and published outputs accumulate large uncompressed TSV files. Probability matrices from `scvi_predict` (one per query×ref combination) and F1/confusion TSVs from `classify_all` and `predict_seurat` are the main contributors.

Proposed Fix

Compress TSV outputs at write time in the relevant scripts, and update module `output:` patterns and `publishDir` patterns to match `*.tsv.gz`. Most downstream readers (pandas, R) handle gzipped TSVs transparently.

Affected modules / scripts

  • `scvi_predict` / `bin/predict_scvi.py` — probability TSVs (`.rf.prob.df.tsv`, `.knn.prob.df.tsv`)
  • `classify_all` / `bin/classify_all.py` — F1 summary + confusion matrix TSVs
  • `predict_seurat` / `bin/predict_seurat.R` — prediction score TSVs

Expected outcome

Significant reduction in disk usage in both the Nextflow work directory and the published output directory, with no change in downstream behavior.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions