feat!(website, prepro, backend, config, integration):multi pathogen - refactor multi segment submission in backend and edit page and have prepro assign segments #5382

anna-parker · 2025-11-06T10:25:40Z

partially resolves #5392, #5185 (comment)

includes work done in #5398 and #5402

This PR additionally fixes submission, subtype assignment and search for EVs and other multi-path organisms.

BREAKING CHANGES

When users submit to multi-segmented organisms and want to group multiple segments under one metadata entry they are now required to add an additional fastaIds column with a space -separated list of the fastaIds (fasta header IDs) of the respective sequences. If no fastaIds column is supplied the submissionId will be used instead and the backend will assume that (as in the single-segmented case) there is a one-to-one mapping of metadata submissionId to fastaId.

This new submission structure was voted for in microbioinfo: https://microbial-bioinfo.slack.com/archives/CB0HYT53M/p1760961465729399 and discussed in https://app.nuclino.com/Loculus/Development/2025-10-20-Weekly-6d5fe89f-8ded-4286-b892-d215e0a498f6 (and in other meetings)

Nextclade sort (uses a minimizer index for fast local alignment) or nextclade align (full sequence alignment to reference) will be used to assign segments/subtypes for all multi-segmented and multi-pathogen sequences (this is also done in ingest for grouping segments):

segment_classification_method: "minimizer" or "align"
minimizer_url: <url_to_minimizer_index_used_by_nextclade_sort>

For organisms without a nextclade dataset we still allow the fasta headers to be used to determine the segment/subtype - entries must have the format <submissionId>_<segmentName> (as in current set up).

As preprocessing now assigns segments it will return a map from the segment (or subtype) to the fastaId in the processedData, the map is called: sequenceNameToFastaId. This allows us to surface the segment assignment on the edit page.

Nextclade Preprocessing pipeline config changes

Instead of having a dictionary for the nextclade datasets and servers we make nucleotideSequences a dictionary where each item includes all information required to run nextclade. I.e. we change from:

nextclade_dataset_name:
    L: nextstrain/cchfv/linked/L
    M: nextstrain/cchfv/linked/M
    S: nextstrain/cchfv/linked/S
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
genes: [RdRp, GPC, NP]

to:

nextclade_sequence_and_datasets: 
  - name: L
    nextclade_dataset_name: nextstrain/cchfv/linked/L
    nextclade_dataset_tag: <optional - was previously incorrectly placed on an organism level> 
    nextclade_dataset_server: <optional overwrites nextclade_dataset_server for this seq>
    accepted_sort_matches: <optional, used for classify_with_nextclade_sort and require_nextclade_sort_match, if not given nextclade_dataset_name and name are used> 
    gene_prefix: <optional, prefix to add to genes produced by nextclade run, e.g. nextclade labels genes as `AV1` but we expect `EV1_AV1`, here `EV1` would be the prefix >
    genes: [RdRp]
  - name: M
    nextclade_dataset_name: nextstrain/cchfv/linked/M
    genes: [GPC]
  - name: S
    nextclade_dataset_name: nextstrain/cchfv/linked/S
    genes: [NP]
nextclade_dataset_server: https://raw.githubusercontent.com/nextstrain/nextclade_data/cornelius-cchfv/data_output
segment_classification_method: <optional, default for multi segmented viruses is align - if you assign segments in ingest for grouping use the same option here as you use there e.g. "minimizer" or "align">
minimizer_url: <optional, url_to_minimizer_index_used_by_nextclade_sort>

Ingest Pipeline Config changes

minimizer_index is changed to minimizer_url for consistency (can be used in ingest and preprocessing and should both be the same)

Optional additional Config changes

Limit the number of sequences the backend will accept per submission by using - should be added for multi-segmented organisms:
submissionDataTypes: &defaultSubmissionDataTypes consensusSequences: true maxSequencesPerEntry: 1

Testing

You can use pathoplexus/example_data#16 and pathoplexus/dev_example_data#2 for testing.

PR Checklist

🚀 Preview: https://edit-page-anya.loculus.org

integration-tests/tests/specs/features/revise-sequence.spec.ts

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".