Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Nov 10, 2025

resolves #4708, #4734

partially resolves #5392, #5185 (comment)

Builds on #5382

BREAKING CHANGES

When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional fastaId column with a space (or comma) -separated list of the fastaIds (fasta header IDs) of the respective sequences. If no fastaId column is supplied the submissionId will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata submissionId to fastaId.

This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned sequences:

minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>

For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format _ (as in current set up).

As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaHeader in the processedData: sequenceNameToFastaHeaderMap. This allows us to surface this assignment on the edit page.

Testing

You can use pathoplexus/dev_example_data#2 for testing.

Prepro config changes

Instead of having a dictionary for the nextclade datasets and servers we make nucleotideSequences a list of sequences:

nextclade_dataset_name: 
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output

Note the templates now also generate the genes list from the merged config.

PR Checklist

  • Update values.schema.json
  • keep tests for alignment NONE case
  • Create a minimizer for tests using: https://github.com/loculus-project/nextclade-sort-minimizer-creator
  • Any manual testing that has been done is documented: submission of EVs from test folder were submitted with the same fastaHeader as the submissionId -> this succeeded, additionally the submission of CCHF with a fastaID column in the metadata was tested (also in folder above), additionally revision of a segment was tested
  • Have preprocessing send back a segment: fastaHeader mapping

Future Work

  • add integration testing for full EV submission user journey
  • improve CCHF minimizer (some segments are again not assigned)
  • discuss if the originalData dictionary should be migrated (persistent DB has segmentName as key, now we have fastaHeader as key)
  • update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Nov 10, 2025
@anna-parker anna-parker changed the base branch from main to edit-page-anya November 10, 2025 08:34
@anna-parker anna-parker force-pushed the multi-segment-submission-2 branch 4 times, most recently from 12b539b to f8df4fa Compare November 10, 2025 08:49
@anna-parker anna-parker changed the title Multi segment submission 2 feat!(backend): refactor multi-segment submission Nov 10, 2025
@anna-parker anna-parker force-pushed the multi-segment-submission-2 branch from eec24bb to f8df4fa Compare November 10, 2025 10:29
@anna-parker
Copy link
Contributor Author

@codex review

@anna-parker anna-parker marked this pull request as ready for review November 10, 2025 10:29
@anna-parker

This comment was marked as outdated.

@anna-parker anna-parker force-pushed the multi-segment-submission-2 branch from ffac38a to 87fbe02 Compare November 11, 2025 08:57
@fengelniederhammer
Copy link
Contributor

Nitpick: IMO the commit message shouldn't be "refactor" but instead describe that
it's a significant change to the sequence structure. Maybe something like "don't require the segment name in sequence submissions"?

@fengelniederhammer
Copy link
Contributor

When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are required to add an additional fastaId column with a space (or comma) -separated list of the fastaIds (fasta header IDs) of the respective sequences. If no fastaId column is supplied the submissionId will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata submissionId to fastaId.

Could you please explain why we need the fastaIds now? Is the submissionId even necessary then?

I think initially we agreed to still stick to the pattern >{submissionid}_{someSegmentId} for the fasta headers. Why don't we keep it?

@theosanderson
Copy link
Member

Is the submissionId even necessary then?

I think the submissionId is generally part of our model as (in part) a user's id for a sequence. E.g. we surface it on the review cards.

Comment on lines 133 to 137
val metadataFastaIds = uploadDatabaseService.getFastaIdsForMetadata(uploadId)
val metadataFastaIdsSet = metadataFastaIds.flatten().toSet()
if (metadataFastaIdsSet.size < metadataFastaIds.flatten().size) {
throw UnprocessableEntityException("Metadata file contains duplicate fastaIds.")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put into validate function similar to below for submission ids

corneliusroemer and others added 4 commits November 20, 2025 12:25
resolves #4847

### Screenshot

Improves #4821, comes
after #5398

You can use pathoplexus/dev_example_data#2 for
testing.

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

## Prepro config changes

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name: 
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

### PR Checklist
- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

## Future Work
- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://sort-multi-path.loculus.org
@anna-parker anna-parker merged commit 0d41959 into edit-page-anya Nov 20, 2025
41 of 45 checks passed
@anna-parker anna-parker deleted the multi-segment-submission-2 branch November 20, 2025 13:32
@anna-parker anna-parker restored the multi-segment-submission-2 branch November 20, 2025 14:09
anna-parker added a commit that referenced this pull request Nov 20, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

### BREAKING CHANGES

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

### Testing

You can use pathoplexus/dev_example_data#2 for
testing.

### Prepro config changes

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name: 
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

### PR Checklist
- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

## Future Work
- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 21, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 24, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 24, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 25, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 26, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 26, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 27, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Nov 28, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Dec 1, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Dec 1, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Dec 2, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Dec 3, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Dec 4, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

You can use pathoplexus/dev_example_data#2 for
testing.

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level>
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used>
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
anna-parker added a commit that referenced this pull request Dec 5, 2025
… refactor multi segment submission in backend and edit page and have prepro assign segments (#5382)

resolves #4999 #4708,
#4734,
#5511

partially resolves
#5392,
#5185 (comment)

includes work done in
#5398 and
#5402

This PR additionally fixes submission, subtype assignment and search for
EVs and other multi-path organisms.

### BREAKING CHANGES

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaIds` column with a space -separated list of the
`fastaId`s (fasta header IDs) of the respective sequences. If no
`fastaIds` column is supplied the `submissionId` will be used instead
and the backend will assume that (as in the single-segmented case) there
is a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort (uses a minimizer index for fast local alignment) or
nextclade align (full sequence alignment to reference) will be used to
assign segments/subtypes for all multi-segmented and multi-pathogen
sequences (this is also done in ingest for grouping segments):
```
segment_classification_method: "minimizer" or "align"
minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format `<submissionId>_<segmentName>` (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaId in the processedData, the map is
called: `sequenceNameToFastaId`. This allows us to surface the segment
assignment on the edit page.

### Nextclade Preprocessing pipeline config changes

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a dictionary where each item includes all
information required to run nextclade. I.e. we change from:
```
nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```
to: 
```
nextclade_sequence_and_datasets: 
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
    genes: [RdRp]
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
    genes: [GPC]
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
    genes: [NP]
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align">
minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort>
```

### Ingest Pipeline Config changes

`minimizer_index` is changed to `minimizer_url` for consistency (can be
used in ingest and preprocessing and should both be the same)

### Optional additional Config changes

Limit the number of sequences the backend will accept per submission by
using - should be added for multi-segmented organisms:
`
submissionDataTypes: &defaultSubmissionDataTypes
  consensusSequences: true
  maxSequencesPerEntry: 1
`

### Testing

You can use pathoplexus/example_data#16 and
pathoplexus/dev_example_data#2 for testing.

### PR Checklist
- [x] Update values.schema.json and other READMEs
- [x] add fastaId to commonMetadata (ensure it is downloaded in
templates): #5561
- [x] Fix how genes are returned (will cause a config update):
#5563
- [x] Improve prepro code (less duplication and more tests):
#5554
- [x] ingest EVs as single segmented to ensure search works:
#5511
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping
- ~add integration testing for full EV submission user journey~ -> will
be done in a later PR
- [x] improve CCHF minimizer (some segments are again not assigned)
- [x] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
-> decided against
- [x] update PPX docs with new multi-segment submission format -> test
PR here: pathoplexus/pathoplexus#759
- [x] update example data for demo

🚀 Preview: https://edit-page-anya.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd update_db_schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor multi-segment submission Multi pathogen support: backend submission

5 participants