Skip to content

Fix biobank clinical merge when ALIQUOT_STATUS already exists#1345

Merged
callachennault merged 2 commits into
knowledgesystems:masterfrom
mandawilson:fix-biobank-clinical-aliquot-merge
Jun 23, 2026
Merged

Fix biobank clinical merge when ALIQUOT_STATUS already exists#1345
callachennault merged 2 commits into
knowledgesystems:masterfrom
mandawilson:fix-biobank-clinical-aliquot-merge

Conversation

@mandawilson

@mandawilson mandawilson commented Jun 19, 2026

Copy link
Copy Markdown
Contributor

*** This has already been deployed. ***

Old scripts were kept and should be cleaned up:

cd /data/portal-cron/scripts

$ ls -lthr *BK_DELETE_ME
-rwxrw-r-- 1 cbioportal_importer cbioportal_importer 4.6K Dec 11  2025 combine_files_py3.py.BK_DELETE_ME
-rwxrwxr-x 1 cbioportal_importer cbioportal_importer  80K Jun 19 17:25 fetch-dmp-data-for-import.sh.BK_DELETE_ME

If combine_files_py3.py was given two files with the same non-key column it would create two columns, one with _x on the end and one with _y. I assume we never want that, and that until now no two files being merged had the same columns (besides key columns like PATIENT_ID). We should review this further, but for now let biobank use a -p option to say we want to use the column from the right hand file and replace the one in the left hand file.

Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.

I tested this on pipelines3.

Old problem combine_files_py3.py:

cd /data/portal-cron/cbio-portal-data/pipelines-testing/studies/msk_impact_biobank

$ python3 $PORTAL_HOME/scripts/combine_files_py3.py \
>   -i data_clinical_patient.txt data_clinical_patient_biobank.txt \
>   -o /tmp/data_clinical_patient_merged_test.txt \
>   -c PATIENT_ID -m left

[cbioportal_importer@pipelines3 msk_impact_biobank]$ head -1 /tmp/data_clinical_patient_merged_test.txt
PATIENT_ID	STAGE_HIGHEST_RECORDED	NUM_ICDO_DX	OS_MONTHS	OS_STATUS	YOST_INDEX_IMPUTED_MEDIAN	GENDER	RACE	ETHNICITY	CURRENT_AGE_DEID	ADRENAL_GLANDS	BONE	CNS_BRAIN	INTRA_ABDOMINAL	LIVER	LUNG	LYMPH_NODES	OTHER	PLEURA	REPRODUCTIVE_ORGANS	GLEASON_FIRST_REPORTED	GLEASON_HIGHEST_REPORTED	HISTORY_OF_PDL1	PRIOR_MED_TO_MSK	SMOKING_PREDICTIONS_3_CLASSES	OTHER_PATIENT_ID	PARTA_CONSENTED_12_245	PARTC_CONSENTED_12_245	ALIQUOT_STATUS_x	ALIQUOT_STATUS_y

Fixed combine_files_py3.py:

$ python3 $PORTAL_HOME/scripts/combine_files_py3.py \
> -i data_clinical_patient.txt data_clinical_patient_biobank.txt \
> -o /tmp/data_clinical_patient_merged_test.txt \
> -c PATIENT_ID -m left -p
[cbioportal_importer@pipelines3 msk_impact_biobank]$ head -1 /tmp/data_clinical_patient_merged_test.txt
PATIENT_ID	STAGE_HIGHEST_RECORDED	NUM_ICDO_DX	OS_MONTHS	OS_STATUS	YOST_INDEX_IMPUTED_MEDIAN	GENDER	RACE	ETHNICITY	CURRENT_AGE_DEID	ADRENAL_GLANDS	BONE	CNS_BRAIN	INTRA_ABDOMINAL	LIVER	LUNG	LYMPH_NODES	OTHER	PLEURA	REPRODUCTIVE_ORGANS	GLEASON_FIRST_REPORTED	GLEASON_HIGHEST_REPORTED	HISTORY_OF_PDL1	PRIOR_MED_TO_MSK	SMOKING_PREDICTIONS_3_CLASSES	OTHER_PATIENT_ID	PARTA_CONSENTED_12_245	PARTC_CONSENTED_12_245	ALIQUOT_STATUS

…d-trip imports.

Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.
Lock in legacy _x/_y output without -p and confirm -p does not change merges when columns do not overlap.
# TODO: prefer-right-columns may be right for all combine_files_py3
# callers, but other merges have never had overlapping column names
# before biobank; audit before making -p the default.
$PYTHON3_BINARY $PORTAL_HOME/scripts/combine_files_py3.py -i "$input_clinical_file" "$biobank_clinical_file" -o "$merged_clinical_file" -c "PATIENT_ID" -m left -p

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a good fix to have in the script but another way to think about it is that ALIQUOT_STATUS should be listed in the -c argument. I think I hadn't listed it before because the data_clinical_patient initially didn't have this column. It would look like the below. The -p argument could likely be left out in this case.

 $PYTHON3_BINARY $PORTAL_HOME/scripts/combine_files_py3.py -i "$input_clinical_file" "$biobank_clinical_file" -o "$merged_clinical_file" -c "PATIENT_ID ALIQUOT_STATUS" -m left -p

The script is also supposed to default to merging the columns that are present in both files. The issue shows up when you provide the -c argument but don't list all common columns there.

@mandawilson mandawilson Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@callachennault I think -c is for join keys not columns in common. If we add ALIQUOT_STATUS to -c:

-c "PATIENT_ID ALIQUOT_STATUS" -m left

pandas joins on both columns. Rows match only when patient and status value are identical. After a Databricks round-trip, clinical often has stale ALIQUOT_STATUS while biobank has fresh status for the same PATIENT_ID. Those rows won’t match, so a left merge won’t update status from biobank — we lose the refresh you want.

See: on in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

If you agree, can you merge this?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah yes you're right. This script has mainly been used for combining files of the exact same format, or if there is a new column, it's only in one of the files. This makes sense for the biobank case. Thanks for catching this.

@mandawilson mandawilson mentioned this pull request Jun 23, 2026
1 task
@callachennault callachennault merged commit 9039a14 into knowledgesystems:master Jun 23, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants