Skip to content

[BUG] Silent data corruption in load_centiles due to unassigned pandas sort #396

@divye-joshi

Description

@divye-joshi

Reproduction notebook for this issue (There is neat markdown and comments to replicate the bug in last 3 cells) :

https://colab.research.google.com/drive/1BHdlJdPYTy0dRymzEfNFWS85ZImZliFo?usp=sharing

Adressed with a simple fix in PR #397

Description

There is a critical data handling bug in the NormData.load_centiles method that causes silent data corruption. When loading centile predictions from a saved CSV file, the method attempts to sort the data by subject ID (observations). However, the pandas sort_values function is called without inplace=True or variable reassignment.

Because pandas returns a sorted copy and leaves the original dataframe unchanged, the centile values are extracted in the exact top-to-bottom order of the CSV file. Later, xarray blindly maps these unsorted values to a mathematically sorted array of subject IDs.

Root Cause

The main cause is that the intended fix to this is not correctly implemented. In pcntoolkit/dataio/norm_data.py (inside the load_centiles function):

for i, c in enumerate(centiles):
    sub = df[df["centile"] == c]
    
    # BUG: sort_values creates a copy and discards it. `sub` remains unsorted.
    sub.sort_values(by="observations") 
    
    for j, rv in enumerate(response_vars):
        # Extracts data in unsorted CSV order, stripping IDs
        A[i, :, j] = sub[rv]

Consequences

If a user loads a CSV that is not already perfectly sorted by subject ID (a highly common scenario when concatenating cluster job outputs or merging data from multiple imaging sites), the predictions are silently scrambled.

For example, if the CSV lists Subject 103 before Subject 101, Subject 101 will be assigned Subject 103's centile scores. Because this does not throw a KeyError or ValueError, researchers will unknowingly proceed with corrupted clinical/normative scores for their downstream analyses.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions