Fix biobank clinical merge when ALIQUOT_STATUS already exists#1345
Conversation
…d-trip imports. Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.
Lock in legacy _x/_y output without -p and confirm -p does not change merges when columns do not overlap.
| # TODO: prefer-right-columns may be right for all combine_files_py3 | ||
| # callers, but other merges have never had overlapping column names | ||
| # before biobank; audit before making -p the default. | ||
| $PYTHON3_BINARY $PORTAL_HOME/scripts/combine_files_py3.py -i "$input_clinical_file" "$biobank_clinical_file" -o "$merged_clinical_file" -c "PATIENT_ID" -m left -p |
There was a problem hiding this comment.
I think this is a good fix to have in the script but another way to think about it is that ALIQUOT_STATUS should be listed in the -c argument. I think I hadn't listed it before because the data_clinical_patient initially didn't have this column. It would look like the below. The -p argument could likely be left out in this case.
$PYTHON3_BINARY $PORTAL_HOME/scripts/combine_files_py3.py -i "$input_clinical_file" "$biobank_clinical_file" -o "$merged_clinical_file" -c "PATIENT_ID ALIQUOT_STATUS" -m left -p
The script is also supposed to default to merging the columns that are present in both files. The issue shows up when you provide the -c argument but don't list all common columns there.
There was a problem hiding this comment.
@callachennault I think -c is for join keys not columns in common. If we add ALIQUOT_STATUS to -c:
-c "PATIENT_ID ALIQUOT_STATUS" -m left
pandas joins on both columns. Rows match only when patient and status value are identical. After a Databricks round-trip, clinical often has stale ALIQUOT_STATUS while biobank has fresh status for the same PATIENT_ID. Those rows won’t match, so a left merge won’t update status from biobank — we lose the refresh you want.
See: on in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.merge.html
If you agree, can you merge this?
There was a problem hiding this comment.
Ah yes you're right. This script has mainly been used for combining files of the exact same format, or if there is a new column, it's only in one of the files. This makes sense for the biobank case. Thanks for catching this.
*** This has already been deployed. ***
Old scripts were kept and should be cleaned up:
If
combine_files_py3.pywas given two files with the same non-key column it would create two columns, one with _x on the end and one with _y. I assume we never want that, and that until now no two files being merged had the same columns (besides key columns like PATIENT_ID). We should review this further, but for now let biobank use a -p option to say we want to use the column from the right hand file and replace the one in the left hand file.Add opt-in -p to combine_files_py3 so overlapping columns use biobank values instead of pandas _x/_y suffixes.
I tested this on pipelines3.
Old problem
combine_files_py3.py:Fixed
combine_files_py3.py: