-
Notifications
You must be signed in to change notification settings - Fork 0
How to: Data to merge
This folder is the handoff point between the data-generating team and the engineer who maintains the OEPS backend. New CSV data that is ready to be incorporated into OEPS should be placed here.
-
Create your CSV following the preparing CSV data requirements:
- CamelCase column names
- Join column (HEROP_ID, GEOID, FIPS, or ZCTA5)
- One year and one geography level per file
- Empty cells for "no data" (not NA)
-
Add the CSV to this folder (
data_to_merge/). -
Optionally add a note so the engineer knows what to do next. You can:
- Add a comment in the PR or Slack
- Create a small text file alongside the CSV (e.g.
MOUD_tract_2025_notes.txt) with:- Geography level (state, county, tract, zcta)
- Data year
- List of variable/column names
- Target table if you know it (e.g. tract-2025)
- Create or edit table sources in the registry
- Add files to
backend/oeps/data/tables/ - Run
flask merge-csvor any backend commands
An engineer will handle the merge and registry updates.
When new CSVs appear in data_to_merge/, complete the integration:
remove-variable and merge-csv only change local files on your machine (backend/oeps/data/ and backend/oeps/registry/). They do not touch the production server, BigQuery, or the live explorer. Production is updated later when you run build-explorer (and --upload-map-data), BigQuery upload, and/or deploy. Backup and restore below refer to this local state.
Before running remove-variable, create a restore point. Recommended: one Git tag from main (or from your current branch). Alternative: copy two folders.
Option A — Git tag (recommended: one good backup)
This gives you a single named restore point. Use main so the backup is the last known-good state before your merges.
- From the repo root, open a terminal.
- Make sure main is up to date (if you use it as the backup baseline):
git fetch origin main git checkout main git pull origin main
- Create a tag with today’s date (this is your backup; no files are copied, the tag points to this commit):
Use the actual date, e.g.
git tag backup-before-remove-merge-2025-02-09
backup-before-remove-merge-2025-02-09. - Switch back to your working branch to run remove/merge:
(Use your branch name.)
git checkout access-metrics-2025
Note: If your working branch already has the data/registry state you want to keep (e.g. you haven’t merged main in yet), you can create the tag on your current branch instead: stay on that branch and run only the git tag ... command in step 3, then continue working.
To restore later:
If something goes wrong and you want to put backend/oeps/data and backend/oeps/registry back to the backup state:
git checkout backup-before-remove-merge-2025-02-09 -- backend/oeps/data backend/oeps/registryThen commit or stash as needed. To delete the tag later: git tag -d backup-before-remove-merge-2025-02-09.
Option B — Copy folders (no Git needed)
- From the repo root (e.g.
c:\workspace-healthyregions\oeps), open PowerShell. - Create two dated backup folders (use today’s date):
$date = Get-Date -Format "yyyy-MM-dd" Copy-Item -Recurse backend\oeps\data backend\oeps\data_backup_$date Copy-Item -Recurse backend\oeps\registry backend\oeps\registry_backup_$date
- Confirm the folders exist:
backend\oeps\data_backup_YYYY-MM-DDandbackend\oeps\registry_backup_YYYY-MM-DD.
To restore later:
Replace the live folders with the backups (e.g. remove or rename data and registry, then rename the _backup_* folders to data and registry), or copy the backup folders’ contents over the live ones.
Do the backup before the first real remove-variable run for a given batch of merges. If main is already in a good state and you don’t commit or push your local changes, you don’t need a separate backup: you can restore with git checkout main -- backend/oeps/data backend/oeps/registry to throw away local changes to data and registry.
cd backend
flask inspect-csv -s ../data_to_merge/YourFile.csvThis checks whether all columns have matching variables in the registry. If any are missing, add variable definitions (via Pages CMS or registry JSON) first.
If the target table (e.g. tract-2025) does not exist:
flask create-table-source -n tract-2025 -g geo-2020-tract(Replace with the correct geography and geodata source as needed.)
flask merge-csv -s ../data_to_merge/YourFile.csv -t tract-2025 --dry-runReview the output before proceeding. Dry-run does not change any data; it only reports what would happen.
Understanding the dry-run output:
- Matched columns — Columns that exist in the registry and are not yet in the target table. These will be merged when you run the merge (step 4).
- Matched columns already in target — Columns that exist in both the registry and the target table. By default the merge skips them (does not overwrite). If the dry run reports only these and no "Matched columns," it will say "There are no matched columns to merge. Cancelling operation."
- Unmatched columns — Column names in the CSV that have no variable in the registry (or a name mismatch). These are ignored. Add the variable to the registry if they should be included.
Option A — Only new columns (dry run showed "Matched columns" and you don’t need to refresh existing data):
flask merge-csv -s ../data_to_merge/YourFile.csv -t tract-2025This adds the new columns and updates variable table_sources in the registry. Existing columns in the target are left unchanged.
Option B — Refresh existing columns and/or add new columns (remove + merge):
If you want the target table to get the latest data from the CSV for both variables that already exist and new variables, use remove then merge:
-
Remove the variables that were "Matched columns already in target" (so they are no longer in the target table):
flask remove-variable -n "Var1,Var2,Var3" -t tract-2025 --yesUse the exact column names from the dry-run "Matched columns already in target" list, comma-separated. Omit this step if there are no such columns to refresh.
Optional: Preview what would be removed without changing any files by adding
--dry-run:flask remove-variable -n "Var1,Var2,Var3" -t tract-2025 --dry-run -
Merge the CSV (one command):
flask merge-csv -s ../data_to_merge/YourFile.csv -t tract-2025
This single merge adds data for:
- The columns you just removed (they are re-added from the incoming CSV), and
- Any new columns (Matched columns) that were not in the target.
So one remove (if needed) + one merge updates both refreshed and new columns at the same time.
- Run
flask build-explorer(and--upload-map-dataif publishing to production). This and any BigQuery/data-package steps are when production (server, explorer, BigQuery) gets updated—not during merge/remove. - Optionally run
flask build-docs, regenerate data packages, sync to BigQuery - Commit the changes and remove or archive the merged CSV from
data_to_merge/
See Merging data into OEPS for detailed requirements, geodata/table source creation, and troubleshooting.
When you have both new columns (to add) and existing columns (to refresh), use remove + merge per CSV: one remove (list all "already in target" vars to refresh), then one merge. That single merge updates both the refreshed (old) columns and the new columns at the same time.
Setup: From repo root, cd backend and set FLASK_APP=oeps. Do the backup (step 0 above) before the first remove-variable run.
-
Dry-run to see which columns are "Matched" (new) vs "Matched columns already in target" (existing):
flask merge-csv -s ../data_to_merge/YourFile.csv -t TARGET --dry-run
-
If there are "Matched columns already in target" and you want to refresh them: run remove with those names (comma-separated). You can preview with
--dry-runfirst (e.g.flask remove-variable -n "Col1,Col2,..." -t TARGET --dry-run), then run without--dry-run:Skip this step if you only have new columns and don’t need to refresh any existing ones.flask remove-variable -n "Col1,Col2,..." -t TARGET --dry-run # optional: preview flask remove-variable -n "Col1,Col2,..." -t TARGET --yes # run for real
-
Run the merge (one command). This adds/refreshes both the columns you removed and any new columns from the CSV:
flask merge-csv -s ../data_to_merge/YourFile.csv -t TARGET
FQHC tract 2025 (all existing in target → remove then merge):
flask remove-variable -n "FqhcCntDr,FqhcTmDr2,FqhcTmDr" -t tract-2025 --yes
flask merge-csv -s ../data_to_merge/fqhc-tract-2025.csv -t tract-2025FQHC county 2025 (all existing → remove then merge):
flask remove-variable -n "TotTracts,FqhcCtTmDr,FqhcCtTmDr2,FqhcTmDrP,FqhcTmDrP2,FqhcAvTmDr,FqhcAvTmDr2" -t county-2025 --yes
flask merge-csv -s ../data_to_merge/fqhc-county-2025.csv -t county-2025Hospital tract 2025 (existing: HospCntDr, HospTmDr; new after registry fix: HospTmDr2):
flask remove-variable -n "HospCntDr,HospTmDr" -t tract-2025 --yes
flask merge-csv -s ../data_to_merge/Hospital-Tract-2025.csv -t tract-2025Hospital county 2025 (existing: TotTracts, HospCtTmDr, HospTmDrP, HospAvTmDr; new: HospCtTmDr2, HospTmDrP2, HospAvTmDr2):
flask remove-variable -n "TotTracts,HospCtTmDr,HospTmDrP,HospAvTmDr" -t county-2025 --yes
flask merge-csv -s ../data_to_merge/Hospital-county-2025.csv -t county-2025Pharmacy tract 2025 — Check that Pharmacy-Tract-2025.csv is not empty (it was empty in an earlier run). If it has data:
flask merge-csv -s ../data_to_merge/Pharmacy-Tract-2025.csv -t tract-2025 --dry-run
# If "Matched columns already in target" for RxTmDr,RxTmDr2: remove then merge. Otherwise just merge.
flask remove-variable -n "RxTmDr,RxTmDr2" -t tract-2025 --yes # only if dry-run said already in target
flask merge-csv -s ../data_to_merge/Pharmacy-Tract-2025.csv -t tract-2025Pharmacy county 2025 (existing: TotTracts; new: RxCtTmDr, RxTmDrP, RxAvTmDr, RxCtTmDr2, RxTmDrP2, RxAvTmDr2 — or some already in target; dry-run to confirm):
flask merge-csv -s ../data_to_merge/Pharmacy-County-2025.csv -t county-2025 --dry-run
# If TotTracts (or others) are "already in target" and you want to refresh:
flask remove-variable -n "TotTracts,RxCtTmDr,RxTmDrP,RxAvTmDr" -t county-2025 --yes # only names that were "already in target"
flask merge-csv -s ../data_to_merge/Pharmacy-County-2025.csv -t county-2025Order: Run tract merges first (FQHC, Hospital, Pharmacy tract), then county merges, so you don’t remove a variable from both geography levels in one go if it appears in multiple CSVs. Or run one CSV end-to-end (remove → merge) before the next.
After all merges: Run flask build-explorer, update data packages/BigQuery as needed, commit, and archive or remove the merged CSVs from data_to_merge/ if desired.