Skip to content

Conversation

@anna-parker
Copy link
Contributor

@anna-parker anna-parker commented Nov 6, 2025

resolves #4999 #4708, #4734, #5511

partially resolves #5392, #5185 (comment)

includes work done in #5398 and #5402

This PR additionally fixes submission, subtype assignment and search for EVs and other multi-path organisms.

BREAKING CHANGES

When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional fastaIds column with a space -separated list of the fastaIds (fasta header IDs) of the respective sequences. If no fastaIds column is supplied the submissionId will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata submissionId to fastaId.

This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings)

Nextclade sort (uses a minimizer index for fast local alignment) or nextclade align (full sequence alignment to reference) will be used to assign segments/subtypes for all multi-segmented and multi-pathogen sequences (this is also done in ingest for grouping segments):

segment_classification_method: "minimizer" or "align"
minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort>

For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaId in the processedData, the map is called: sequenceNameToFastaId. This allows us to surface the segment assignment on the edit page.

Nextclade Preprocessing pipeline config changes

Instead of having a dictionary for the nextclade datasets and servers we make nucleotideSequences a dictionary where each item includes all information required to run nextclade. I.e. we change from:

nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]

to:

nextclade_sequence_and_datasets: 
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
    genes: [RdRp]
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
    genes: [GPC]
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
    genes: [NP]
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align">
minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort>

Ingest Pipeline Config changes

minimizer_index is changed to minimizer_url for consistency (can be used in ingest and preprocessing and should both be the same)

Optional additional Config changes

Limit the number of sequences the backend will accept per submission by using - should be added for multi-segmented organisms:
submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1

Testing

You can use pathoplexus/example_data#16 and pathoplexus/dev_example_data#2 for testing.

PR Checklist

🚀 Preview: https://edit-page-anya.loculus.org

@anna-parker anna-parker added the preview Triggers a deployment to argocd label Nov 6, 2025
@anna-parker anna-parker changed the title feat: update edit page feat(website):multi pathogen - update edit page Nov 6, 2025
@anna-parker anna-parker marked this pull request as ready for review November 6, 2025 18:01
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

@anna-parker anna-parker changed the base branch from main to preprocessingSegmentAssignment November 6, 2025 18:35
Copy link
Contributor

@fengelniederhammer fengelniederhammer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On http://localhost:3000/cchf/submission/2/submit?inputMode=form, when I upload the same file twice, then the page crashes:
"Uncaught Error: A sequence with the fastaHeader LOC already exists."

Nice improvement though! (Although I may have preferred a separate PR on top of mine to avoid reviewing my own code in a way here)

@anna-parker anna-parker force-pushed the edit-page-anya branch 2 times, most recently from 873fd80 to 0a27555 Compare November 19, 2025 20:23
@anna-parker
Copy link
Contributor Author

@fengelniederhammer I fixed the error now - I will try to add a test for that case tomorrow

@corneliusroemer corneliusroemer changed the title feat(website):multi pathogen - update edit page feat(website):multi pathogen - update edit page (1/n) Nov 20, 2025
anna-parker added a commit that referenced this pull request Nov 20, 2025
resolves #4708,
#4734

partially resolves
#5392,
#5185 (comment)

Builds on #5382

### BREAKING CHANGES

When users submit to multi-segmented organisms and want to group
multiple segments under one metadata entry they are now required to add
an additional `fastaId` column with a space (or comma) -separated list
of the `fastaIds` (fasta header IDs) of the respective sequences. If no
`fastaId` column is supplied the `submissionId` will be used instead and
the backend will assume that (as in the single-segmented case) there is
a one-to-one mapping of metadata `submissionId` to `fastaId`.

This new submission structure was voted for in microbioinfo:
https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399
and discussed in
https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6
(and in other meetings)

Nextclade sort will be used to assign segments/subtypes for all aligned
sequences:
```
minimizer_index: <url_to_minimizer_index_used_by_nextclade_sort>
```
For organisms without a nextclade dataset we still allow the fasta
headers to be used to determine the segment/subtype - entries must have
the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the
segment (or subtype) to the fastaHeader in the processedData:
`sequenceNameToFastaHeaderMap`. This allows us to surface this
assignment on the edit page.

### Testing

You can use pathoplexus/dev_example_data#2 for
testing.

### Prepro config changes

Instead of having a dictionary for the nextclade datasets and servers we
make `nucleotideSequences` a list of sequences:
```
nextclade_dataset_name: 
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]
```

```
nucleotideSequences:
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name is used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
```

Note the templates now also generate the genes list from the merged
config.

### PR Checklist
- [ ] Update values.schema.json
- [x] keep tests for alignment NONE case
- [x] Create a minimizer for tests using:
https://github.com/loculus-project/nextclade-sort-minimizer-creator
- [x] Any manual testing that has been done is documented: submission of
EVs from test folder were submitted with the same fastaHeader as the
submissionId -> this succeeded, additionally the submission of CCHF with
a fastaID column in the metadata was tested (also in folder above),
additionally revision of a segment was tested
- [x] Have preprocessing send back a segment: fastaHeader mapping

## Future Work
- [ ] add integration testing for full EV submission user journey
- [ ] improve CCHF minimizer (some segments are again not assigned)
- [ ] discuss if the originalData dictionary should be migrated
(persistent DB has segmentName as key, now we have fastaHeader as key)
- [ ] update PPX docs with new multi-segment submission format

🚀 Preview: https://multi-segment-submission.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
@anna-parker anna-parker changed the title feat(website):multi pathogen - update edit page (1/n) feat(website):multi pathogen - edit page, backend refactor and prepro segment assignment Nov 20, 2025
@fengelniederhammer
Copy link
Contributor

Sorry, but did you just merge #5398 into this PR? My understanding was that we merge those PRs into the feature branch. Now we have one very large PR (I thought we actually wanted to avoid that).

anna-parker and others added 13 commits December 4, 2025 18:41
…s with correct fields (#5561)

resolves #5572

Follow up PR with `fastaId` to `fastaIds` change:
#5583

### Screenshot
fastaIds field added to template for CCHF:
<img width="1672" height="1138" alt="image"
src="https://github.com/user-attachments/assets/0dcc1be8-2f01-4205-a819-84ea8055fc5f"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/b41678ca-17c5-41ef-a409-86288f18d124"
/>
Not added for ebola or EVs:
<img width="1728" height="644" alt="image"
src="https://github.com/user-attachments/assets/ebcfa88a-e64f-47b1-9fea-e3dfd6bd21a5"
/>
<img width="1836" height="520" alt="image"
src="https://github.com/user-attachments/assets/a77e47ee-8f06-4a29-9eeb-68863ed3dbd0"
/>


### PR Checklist
- [ ] Make PR with same changes in PPX -> after docs are approved
- ~[x] Add fastaIds to commonMetadata fields in config?~ -> fastaIds
field does not exist for single segmented organisms and thus should not
be added here as this breaks Loculus
- [x] Ensure metadata template downloads are correct

🚀 Preview: https://multipath-docs.loculus.org

---------

Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>
partially resolves
#5576

### Screenshot

Have segment assignment by nextclade sort and segment assignment via
nextclade align use the same code to assign segment for each entry after
running nextclade on the batch.

I also realized we dont have tests for the
`require_nextclade_sort_match` config option - I added them here

### PR Checklist
- [ ] All necessary documentation has been adapted.
- [ ] The implemented feature is covered by appropriate, automated
tests.
- [ ] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: https://multipath-updates.loculus.org
Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
…5559)

resolves #5558

🚀 Preview: https://backend-reject-gt1-seq.loculus.org

---------

Co-authored-by: anna-parker <50943381+anna-parker@users.noreply.github.com>
Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
Co-authored-by: Theo Sanderson <theo@sndrsn.co.uk>
resolves #5571

Adds the new config structure to the
kubernetes/loculus/values.schema.json. I also renamed
`nucleotideSequences` to `nextcladeSequenceAndDatasets` to avoid
confusion as preprocessing expects different items in this list of
dictionaries.

### Screenshot
<img width="1096" height="1398" alt="image"
src="https://github.com/user-attachments/assets/02041fb7-2e92-4d14-9bc9-3d1feab0c112"
/>

### PR Checklist
- [ ] All necessary documentation has been adapted.
- [ ] The implemented feature is covered by appropriate, automated
tests.
- [ ] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: Add `preview` label to enable
resolves #

### Screenshot
When testing on PPX I see this error caused by
#5552 that I sadly missed
(string is converted to a list)- this should resolve the bug:

```
INFO:loculus_preprocessing.nextclade:Downloading Nextclade dataset: ['nextclade3', 'dataset', 'get', '--name=nextstrain/rsv/a/EPI_ISL_412866', '--server=https://data.clades.nextstrain.org/v3', '--output-dir=/tmp/tmp6frl2wtu/main', '-', '-', 't', 'a', 'g', '=', '2', '0', '2', '5', '-', '0', '8', '-', '2', '5', '-', '-', '0', '9', '-', '0', '0', '-', '3', '5', 'Z']
error: unexpected argument '-' found
```

### PR Checklist
- [ ] All necessary documentation has been adapted.
- [ ] The implemented feature is covered by appropriate, automated
tests.
- [ ] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: Add `preview` label to enable
resolves these comments:

#5382 (comment)

#5382 (comment)

#5382 (comment)

🚀 Preview: Add `preview` label to enable
…ta field `fastaIds` (#5629)

resolves #5627

The `fastaIds` object should contain a space separated list of fasta
IDs. The website was constructing `fastaIds` for submissions via the
edit page by concatenating full fasta headers of the form `>fastaId
description` (and removing `>`) - but this lead to the backend thinking
the description was another fastaId.

Also adds a description to multiple fasta files in the integration tests
to ensure in future such an issue cannot be missed

### Screenshot

### PR Checklist
- [ ] All necessary documentation has been adapted.
- [ ] The implemented feature is covered by appropriate, automated
tests.
- [ ] Any manual testing that has been done is documented (i.e. what
exactly was tested?)

🚀 Preview: https://fastaheader.loculus.org

---------

Co-authored-by: Cornelius Roemer <cornelius.roemer@gmail.com>
Co-authored-by: Fabian Engelniederhammer <92720311+fengelniederhammer@users.noreply.github.com>
@anna-parker
Copy link
Contributor Author

@codex review

@chatgpt-codex-connector
Copy link

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@anna-parker anna-parker changed the title feat!(website):multi pathogen - edit page, backend refactor and prepro segment assignment feat!(website, prepro, backend, config, integration):multi pathogen - edit page, backend refactor and prepro segment assignment Dec 5, 2025
@anna-parker anna-parker changed the title feat!(website, prepro, backend, config, integration):multi pathogen - edit page, backend refactor and prepro segment assignment feat!(website, prepro, backend, config, integration):multi pathogen - refactor multi segment submission in backend and edit page and have prepro assign segments Dec 5, 2025
@anna-parker anna-parker merged commit d3c43c0 into main Dec 5, 2025
55 checks passed
@anna-parker anna-parker deleted the edit-page-anya branch December 5, 2025 13:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

preview Triggers a deployment to argocd update_db_schema

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Refactor multi-segment submission Multi pathogen support: edit page

5 participants