Skip to content

Add genetic map auto detection and conversion#272

Open
LouisLeNezet wants to merge 20 commits intonf-core:devfrom
LouisLeNezet:map_convert
Open

Add genetic map auto detection and conversion#272
LouisLeNezet wants to merge 20 commits intonf-core:devfrom
LouisLeNezet:map_convert

Conversation

@LouisLeNezet
Copy link
Collaborator

@LouisLeNezet LouisLeNezet commented Mar 16, 2026

PR checklist

Closes #216
Closes #150

Genetic map has been added to test_batch, test, test_quilt, test_stitch, test_panelprep, test_all and test_vcf. Therefore the md5sum for the test - with map have been updated.
The test names have been clarified.

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • If necessary, also make a PR on the nf-core/phaseimpute branch on the nf-core/test-datasets repository.
  • Make sure your code lints (nf-core pipelines lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Check for unexpected warnings in debug mode (nextflow run . -profile debug,test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@LouisLeNezet LouisLeNezet self-assigned this Mar 16, 2026
@LouisLeNezet LouisLeNezet changed the base branch from main to dev March 16, 2026 16:47
@LouisLeNezet LouisLeNezet linked an issue Mar 16, 2026 that may be closed by this pull request
@LouisLeNezet LouisLeNezet requested a review from atrigila March 17, 2026 14:04
Comment on lines +162 to +170
The following parameters are automatically detected, but can also be set (e.g. when you only provide `pos` and `cm` with no header):

- `--map_sep`: Field separator used in the map file (e.g. "\t", " ", ",")
- `--map_header`: Whether the file contains a header row (`true` or `false`)
- `--map_colnames`: Ordered list of column names in the file

For the example below, the map file uses tab separators, contains a header, and provides the columns in the following order: `chr`, `id`, `cm`, `pos`. Therefore, the appropriate parameters would be:

`--map_sep "\t" --map_header true --map_colnames "chr,id,cm,pos"`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is extremely complicated for an user and the fact that it has more parameters to set configurations implies more risk of errors and the necessity of more tests and error messages.

I would suggest that we stick to one map format (e.g. glimpse format), then require users to upload that format as input file and then if they need to run any tool, the autoconversion will be from that strict format to the rest of the tools. The one map format to use could be the one that is the most "comprehensive" of them all (that it includes all the necessary information). We could then point out to external documentation on how to obtain those.

The autoconversion can be a single module that produces all the types of formats as different outputs, then used by the tools.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the format should be Oxford 3-column format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could indeed be a simpler solution, where we would only need MAPCONVERT.
But as a consequence we push the burden of the conversion back to the user if they don't have the right format.

For which format to use as default, I think it could be nice to ask the nf-core community. When I worked on dogs, the genetic map file I found was mostly in 4-column plink format.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before implementing mapautodetect and mapconvert locally, check if we can use the GAWK nf-core module. If this nf-core module does not fit your purpose, consider developing a general‐purpose module under nf-core CUSTOM/* modules and also use template scripts if it has more than 20 lines. You can then import the module into this pipeline via nf‑core modules install, which will avoid having a local copy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the GAWK modules would make the maintenance complicated if we pass the program as a string.
But we could pass it as a file from the assets directory.

Comment on lines +172 to +180
```csv title="chr21.map"
chr1 id1 0.00000 55550
chr1 id2 0.00000 632942
chr1 id3 0.00000 633147
chr1 id4 0.41029 785910
chr1 id5 0.41742 788439
chr1 id6 0.41764 788511
chr1 id7 0.43061 792862
chr1 id8 0.43586 794568
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should validate the map structure with a schema, which columns are mandatory? which are optional? Correct errors should be raised.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The schema usage, could indeed be a good solution.
I will try to implement it.

Comment on lines +267 to +286
"map_sep": {
"type": "string",
"description": "Separator used in the genetic map file.",
"default": null,
"fa_icon": "fas fa-arrows-alt-h",
"hidden": true
},
"map_header": {
"type": "boolean",
"description": "Does the genetic map file contain a header line?",
"fa_icon": "fas fa-heading",
"hidden": true
},
"map_colnames": {
"type": "string",
"description": "Column names for the genetic map file.",
"default": null,
"fa_icon": "fas fa-columns",
"pattern": "^(chr|id|cm|pos)(,(chr|id|cm|pos))*$",
"hidden": true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be detected with a schema.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not change tests. Add new tests if necessary, but do not change previous tests. Otherwise, we cannot evaluate if previous functionality remains stable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made sure that the md5sum only changed for the test concerned by the addition of the map file.
The only snapshot that changed are for test where we added a map.

This is also why I've changed the test for a more descriptive name to better follow the changes.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no nf-core modules fulfill this, this could be replaced by a single custom module doing the detect + convert using python/R (and submitted to nf-core). The AWK code has no validation, no numeric type enforcement, no header awareness, it is a bit harder to unit test and fragile to whitespaces.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not change the original tests, this messes up with the nf-tests later and it is more difficult to do traceability. We can manage tests cases using the params in nf-test.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I only reordered the input to match the other config files.

The aim of the test from my point of view is to be the closer as possible to real scenario.
That's why I did add a map file in test_batch, test, test_quilt, test_stitch, test_panelprep, test_all and test_vcf. As they weren't using one yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can remove it from test_batch and test_vcf as they are not aimed to test the tools.
But for me, the test aimed at tool testing should have the map to me more comprehensive.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Map is not supported yet

2 participants