Fix biobank clinical merge when ALIQUOT_STATUS already exists by mandawilson · Pull Request #1345 · knowledgesystems/cmo-pipelines

mandawilson · 2026-06-19T21:11:31Z

*** This has already been deployed. ***

Old scripts were kept and should be cleaned up:

cd /data/portal-cron/scripts

$ ls -lthr *BK_DELETE_ME
-rwxrw-r-- 1 cbioportal_importer cbioportal_importer 4.6K Dec 11  2025 combine_files_py3.py.BK_DELETE_ME
-rwxrwxr-x 1 cbioportal_importer cbioportal_importer  80K Jun 19 17:25 fetch-dmp-data-for-import.sh.BK_DELETE_ME

If combine_files_py3.py was given two files with the same non-key column it would create two columns, one with _x on the end and one with _y. I assume we never want that, and that until now no two files being merged had the same columns (besides key columns like PATIENT_ID). We should review this further, but for now let biobank use a -p option to say we want to use the column from the right hand file and replace the one in the left hand file.

Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.

I tested this on pipelines3.

Old problem combine_files_py3.py:

cd /data/portal-cron/cbio-portal-data/pipelines-testing/studies/msk_impact_biobank

$ python3 $PORTAL_HOME/scripts/combine_files_py3.py \
>   -i data_clinical_patient.txt data_clinical_patient_biobank.txt \
>   -o /tmp/data_clinical_patient_merged_test.txt \
>   -c PATIENT_ID -m left

[cbioportal_importer@pipelines3 msk_impact_biobank]$ head -1 /tmp/data_clinical_patient_merged_test.txt
PATIENT_ID	STAGE_HIGHEST_RECORDED	NUM_ICDO_DX	OS_MONTHS	OS_STATUS	YOST_INDEX_IMPUTED_MEDIAN	GENDER	RACE	ETHNICITY	CURRENT_AGE_DEID	ADRENAL_GLANDS	BONE	CNS_BRAIN	INTRA_ABDOMINAL	LIVER	LUNG	LYMPH_NODES	OTHER	PLEURA	REPRODUCTIVE_ORGANS	GLEASON_FIRST_REPORTED	GLEASON_HIGHEST_REPORTED	HISTORY_OF_PDL1	PRIOR_MED_TO_MSK	SMOKING_PREDICTIONS_3_CLASSES	OTHER_PATIENT_ID	PARTA_CONSENTED_12_245	PARTC_CONSENTED_12_245	ALIQUOT_STATUS_x	ALIQUOT_STATUS_y

Fixed combine_files_py3.py:

$ python3 $PORTAL_HOME/scripts/combine_files_py3.py \
> -i data_clinical_patient.txt data_clinical_patient_biobank.txt \
> -o /tmp/data_clinical_patient_merged_test.txt \
> -c PATIENT_ID -m left -p
[cbioportal_importer@pipelines3 msk_impact_biobank]$ head -1 /tmp/data_clinical_patient_merged_test.txt
PATIENT_ID	STAGE_HIGHEST_RECORDED	NUM_ICDO_DX	OS_MONTHS	OS_STATUS	YOST_INDEX_IMPUTED_MEDIAN	GENDER	RACE	ETHNICITY	CURRENT_AGE_DEID	ADRENAL_GLANDS	BONE	CNS_BRAIN	INTRA_ABDOMINAL	LIVER	LUNG	LYMPH_NODES	OTHER	PLEURA	REPRODUCTIVE_ORGANS	GLEASON_FIRST_REPORTED	GLEASON_HIGHEST_REPORTED	HISTORY_OF_PDL1	PRIOR_MED_TO_MSK	SMOKING_PREDICTIONS_3_CLASSES	OTHER_PATIENT_ID	PARTA_CONSENTED_12_245	PARTC_CONSENTED_12_245	ALIQUOT_STATUS

…d-trip imports. Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.

Lock in legacy _x/_y output without -p and confirm -p does not change merges when columns do not overlap.

callachennault · 2026-06-22T21:09:55Z

+            # TODO: prefer-right-columns may be right for all combine_files_py3
+            # callers, but other merges have never had overlapping column names
+            # before biobank; audit before making -p the default.
+            $PYTHON3_BINARY $PORTAL_HOME/scripts/combine_files_py3.py -i "$input_clinical_file" "$biobank_clinical_file" -o "$merged_clinical_file" -c "PATIENT_ID" -m left -p


I think this is a good fix to have in the script but another way to think about it is that ALIQUOT_STATUS should be listed in the -c argument. I think I hadn't listed it before because the data_clinical_patient initially didn't have this column. It would look like the below. The -p argument could likely be left out in this case.

$PYTHON3_BINARY $PORTAL_HOME/scripts/combine_files_py3.py -i "$input_clinical_file" "$biobank_clinical_file" -o "$merged_clinical_file" -c "PATIENT_ID ALIQUOT_STATUS" -m left -p

The script is also supposed to default to merging the columns that are present in both files. The issue shows up when you provide the -c argument but don't list all common columns there.

@callachennault I think -c is for join keys not columns in common. If we add ALIQUOT_STATUS to -c:

-c "PATIENT_ID ALIQUOT_STATUS" -m left

pandas joins on both columns. Rows match only when patient and status value are identical. After a Databricks round-trip, clinical often has stale ALIQUOT_STATUS while biobank has fresh status for the same PATIENT_ID. Those rows won’t match, so a left merge won’t update status from biobank — we lose the refresh you want.

See: on in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html

If you agree, can you merge this?

Ah yes you're right. This script has mainly been used for combining files of the exact same format, or if there is a new column, it's only in one of the files. This makes sense for the biobank case. Thanks for catching this.

mandawilson added 2 commits June 19, 2026 22:08

Fix biobank clinical merge when ALIQUOT_STATUS already exists on roun…

c8f835a

…d-trip imports. Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.

Add golden-file and no-op tests for biobank merge with and without -p.

796bf4a

Lock in legacy _x/_y output without -p and confirm -p does not change merges when columns do not overlap.

mandawilson requested review from callachennault and n1zea144 June 19, 2026 21:28

callachennault approved these changes Jun 22, 2026

View reviewed changes

mandawilson mentioned this pull request Jun 23, 2026

Add biofluid merge code #1346

Open

1 task

callachennault merged commit 9039a14 into knowledgesystems:master Jun 23, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix biobank clinical merge when ALIQUOT_STATUS already exists#1345

Fix biobank clinical merge when ALIQUOT_STATUS already exists#1345
callachennault merged 2 commits into
knowledgesystems:masterfrom
mandawilson:fix-biobank-clinical-aliquot-merge

mandawilson commented Jun 19, 2026 •

edited

Loading

Uh oh!

callachennault Jun 22, 2026

Uh oh!

mandawilson Jun 23, 2026 •

edited

Loading

Uh oh!

callachennault Jun 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mandawilson commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

callachennault Jun 22, 2026

Choose a reason for hiding this comment

Uh oh!

mandawilson Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

callachennault Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mandawilson commented Jun 19, 2026 •

edited

Loading

mandawilson Jun 23, 2026 •

edited

Loading