Distill the 3 species detection tables into a single output file with a python/pola.rs script

# Goals
- clarified in parent issue: #107

# Plan 
1. read in each table and merge into 1 using a common key (like a `taxonomy_id`)
   - Need to find or make the key in all 3 tables first? // 
   - [ ] See how Eddy standardized her outputs for comparisons!?
2. Keep only the desired/necesessary columns and arrange them by: (confidence of identification / or abundance?)
3. Break into a table of high confidence : with 2/3 tools detected and a low confidence table and output these 
  
## Understand formats
Understand the output formats
- [sylph](https://sylph-docs.github.io/Output-format/)  
- [ganon2](https://pirovc.github.io/ganon/outputfiles/#ganon-classify): Output files:
>     results.rep: plain report of the run, used to further generate tree-like reports
>     results.tre: tree-like report with cumulative abundances by taxonomic ranks (can be re-generated with ganon report)
- [kraken2](https://github.com/DerrickWood/kraken2/wiki/Manual#output-formats) : `mock9.kraken2.report.txt` : similar to ganon2's .tre with hierarchy of outputs?
- Understanding [taxpasta output](https://taxpasta.readthedocs.io/en/latest/tutorials/getting-started/#raw-output). _how do I match the `taxonomy_id` to the species/taxa name?
  - If you want to learn how to use taxpasta to add taxonomic names (rather than IDs) to your profiles, see [here](https://taxpasta.readthedocs.io/en/latest/how-tos/how-to-add-names/). // Need to supply ncbi/other taxanomy files (`.dmp`)

## Information to retain
 
- Species name
- confidence metric: 
  - adjusted_ANI (`sylph`)
- abundance estimate 

## References 
- Best tool to use: [polars](https://docs.pola.rs/)/python which is fast and rust based. _This is a quick way to learn this library + Copilot/Seqera will help generate a base script and module_
- [x] (NO, skip for now) Is it relevant to use [taxpasta](https://taxpasta.readthedocs.io/en/latest/) here to `standardize` or `merge` the 3 tool outputs? 
  - [x] Does it [support ](https://taxpasta.readthedocs.io/en/latest/#supported-taxonomic-profilers) all 3 of our tools? -- _not supporting sylph [yet](https://github.com/taxprofiler/taxpasta/issues/161)_ :😞; Brought up polars library in their [repo here](https://github.com/taxprofiler/taxpasta/issues/164)
  - _Consider if their minimalistic 2 col output format (`taxonomy_id` and `count`) is good enough for us?
- Example output formats for the 3 tools for the test datasets (`mock9` and `mock20`) in these work dirs; store somewhere or link here for reference?
```groovy
[80/06b239] ORCHESTRATE_SOMATEM:SOMATEM:SPECIES_DETECTION:SYLPH_PROFILE (mock9)   [100%] 2 of 2 ✔
[38/2e20cc] ORCHESTRATE_SOMATEM:SOMATEM:SPECIES_DETECTION:GANON_CLASSIFY (mock9)  [100%] 2 of 2 ✔
[0e/f5823f] ORCHESTRATE_SOMATEM:SOMATEM:SPECIES_DETECTION:KRAKEN2_KRAKEN2 (mock9) [100%] 2 of 2 ✔
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distill the 3 species detection tables into a single output file with a python/pola.rs script #115

Goals

Plan

Understand formats

Information to retain

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Distill the 3 species detection tables into a single output file with a python/pola.rs script #115

Description

Goals

Plan

Understand formats

Information to retain

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions