Add genetic map auto detection and conversion#272
Add genetic map auto detection and conversion#272LouisLeNezet wants to merge 20 commits intonf-core:devfrom
Conversation
| The following parameters are automatically detected, but can also be set (e.g. when you only provide `pos` and `cm` with no header): | ||
|
|
||
| - `--map_sep`: Field separator used in the map file (e.g. "\t", " ", ",") | ||
| - `--map_header`: Whether the file contains a header row (`true` or `false`) | ||
| - `--map_colnames`: Ordered list of column names in the file | ||
|
|
||
| For the example below, the map file uses tab separators, contains a header, and provides the columns in the following order: `chr`, `id`, `cm`, `pos`. Therefore, the appropriate parameters would be: | ||
|
|
||
| `--map_sep "\t" --map_header true --map_colnames "chr,id,cm,pos"` |
There was a problem hiding this comment.
This is extremely complicated for an user and the fact that it has more parameters to set configurations implies more risk of errors and the necessity of more tests and error messages.
I would suggest that we stick to one map format (e.g. glimpse format), then require users to upload that format as input file and then if they need to run any tool, the autoconversion will be from that strict format to the rest of the tools. The one map format to use could be the one that is the most "comprehensive" of them all (that it includes all the necessary information). We could then point out to external documentation on how to obtain those.
The autoconversion can be a single module that produces all the types of formats as different outputs, then used by the tools.
There was a problem hiding this comment.
I think the format should be Oxford 3-column format
There was a problem hiding this comment.
This could indeed be a simpler solution, where we would only need MAPCONVERT.
But as a consequence we push the burden of the conversion back to the user if they don't have the right format.
For which format to use as default, I think it could be nice to ask the nf-core community. When I worked on dogs, the genetic map file I found was mostly in 4-column plink format.
There was a problem hiding this comment.
Before implementing mapautodetect and mapconvert locally, check if we can use the GAWK nf-core module. If this nf-core module does not fit your purpose, consider developing a general‐purpose module under nf-core CUSTOM/* modules and also use template scripts if it has more than 20 lines. You can then import the module into this pipeline via nf‑core modules install, which will avoid having a local copy.
There was a problem hiding this comment.
Using the GAWK modules would make the maintenance complicated if we pass the program as a string.
But we could pass it as a file from the assets directory.
| ```csv title="chr21.map" | ||
| chr1 id1 0.00000 55550 | ||
| chr1 id2 0.00000 632942 | ||
| chr1 id3 0.00000 633147 | ||
| chr1 id4 0.41029 785910 | ||
| chr1 id5 0.41742 788439 | ||
| chr1 id6 0.41764 788511 | ||
| chr1 id7 0.43061 792862 | ||
| chr1 id8 0.43586 794568 |
There was a problem hiding this comment.
You should validate the map structure with a schema, which columns are mandatory? which are optional? Correct errors should be raised.
There was a problem hiding this comment.
The schema usage, could indeed be a good solution.
I will try to implement it.
| "map_sep": { | ||
| "type": "string", | ||
| "description": "Separator used in the genetic map file.", | ||
| "default": null, | ||
| "fa_icon": "fas fa-arrows-alt-h", | ||
| "hidden": true | ||
| }, | ||
| "map_header": { | ||
| "type": "boolean", | ||
| "description": "Does the genetic map file contain a header line?", | ||
| "fa_icon": "fas fa-heading", | ||
| "hidden": true | ||
| }, | ||
| "map_colnames": { | ||
| "type": "string", | ||
| "description": "Column names for the genetic map file.", | ||
| "default": null, | ||
| "fa_icon": "fas fa-columns", | ||
| "pattern": "^(chr|id|cm|pos)(,(chr|id|cm|pos))*$", | ||
| "hidden": true |
There was a problem hiding this comment.
This can be detected with a schema.
There was a problem hiding this comment.
Do not change tests. Add new tests if necessary, but do not change previous tests. Otherwise, we cannot evaluate if previous functionality remains stable.
There was a problem hiding this comment.
I made sure that the md5sum only changed for the test concerned by the addition of the map file.
The only snapshot that changed are for test where we added a map.
This is also why I've changed the test for a more descriptive name to better follow the changes.
There was a problem hiding this comment.
If no nf-core modules fulfill this, this could be replaced by a single custom module doing the detect + convert using python/R (and submitted to nf-core). The AWK code has no validation, no numeric type enforcement, no header awareness, it is a bit harder to unit test and fragile to whitespaces.
There was a problem hiding this comment.
Please do not change the original tests, this messes up with the nf-tests later and it is more difficult to do traceability. We can manage tests cases using the params in nf-test.
There was a problem hiding this comment.
Here, I only reordered the input to match the other config files.
The aim of the test from my point of view is to be the closer as possible to real scenario.
That's why I did add a map file in test_batch, test, test_quilt, test_stitch, test_panelprep, test_all and test_vcf. As they weren't using one yet.
There was a problem hiding this comment.
I can remove it from test_batch and test_vcf as they are not aimed to test the tools.
But for me, the test aimed at tool testing should have the map to me more comprehensive.
PR checklist
Closes #216
Closes #150
Genetic map has been added to
test_batch,test,test_quilt,test_stitch,test_panelprep,test_allandtest_vcf. Therefore the md5sum for the test- with maphave been updated.The test names have been clarified.
nf-core pipelines lint).nextflow run . -profile test,docker --outdir <OUTDIR>).nextflow run . -profile debug,test,docker --outdir <OUTDIR>).docs/usage.mdis updated.docs/output.mdis updated.CHANGELOG.mdis updated.README.mdis updated (including new tool citations and authors/contributors).