Fix mixed-format dates in incrementally-fetched CSVs by nesanders · Pull Request #60 · nesanders/MAenvironmentaldata

nesanders · 2026-04-29T23:05:42Z

Summary

Fixes the IntCastingNaNError crash in dashboard chart generation introduced by PR #56's incremental fetching.

The cached CSVs store dates as YYYY-MM-DD, while the EEA DataLake API returns YYYY-MM-DDTHH:MM:SS. After concat in fetch_incremental, the column became object dtype with mixed serializations on CSV write. pd.to_datetime then inferred format from the first row and silently NaT-ed the T-separated rows downstream — s_data['Year'].astype(int) in MADEP_enforcements_viz.py:79 crashed with IntCastingNaNError.

Changes

get_EEA_data_portal.py: cast new_data's date column to datetime64 with format='ISO8601' before the concat so the combined column has a uniform dtype and serializes consistently.
EEADP_enforcement.csv: re-write the 44 existing T-format rows as YYYY-MM-DD.
EEADP_inspection.csv: re-write the 24 existing T-format rows as YYYY-MM-DD.
db_semantic_context.txt: regenerated from the fixed DB.
gs://openamend-data/EEADP_drinkingWater.csv: 7 T-format rows fixed (uploaded out of band; file is too large to commit).
gs://openamend-data/amend.db: rebuilt locally from the fixed CSVs and uploaded, so the next charts run won't fail.

Test plan

Verify locally that pd.to_datetime(...).dt.year produces 0 NaN values on the fixed enforcement and inspection tables in the rebuilt DB
Verify the fix produces uniform date format after CSV round-trip
Trigger an update-charts workflow run and confirm it completes successfully

🤖 Generated with Claude Code

The cached CSVs store dates as 'YYYY-MM-DD' strings, while the EEA DataLake API returns 'YYYY-MM-DDTHH:MM:SS'. After concat in fetch_incremental, the column is object dtype with mixed serializations on CSV write — and pd.to_datetime later infers format from the first row, silently NaT-ing the rest. This crashed dashboard chart generation with IntCastingNaNError on s_data['Year'].astype(int). Fixes: - get_EEA_data_portal.py: cast new_data's date column to datetime64 with format='ISO8601' before the concat, so the combined column has a uniform datetime64 dtype and serializes consistently. - EEADP_enforcement.csv: re-write existing 44 T-format rows as YYYY-MM-DD - EEADP_inspection.csv: re-write existing 24 T-format rows as YYYY-MM-DD - db_semantic_context.txt: regenerated from the fixed DB - gs://openamend-data/EEADP_drinkingWater.csv: 7 T-format rows fixed (uploaded out of band; too large to commit) - gs://openamend-data/amend.db: rebuilt from fixed CSVs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>

github-actions · 2026-04-29T23:07:45Z

✅ Semantic Eval Results

Each eval sends a natural-language question plus the relevant table schemas from db_semantic_context.txt to gpt-4o-mini, executes the generated SQL against AMEND.db, then scores it with a second LLM call using a per-case rubric. Hard pass = SQL ran and returned rows without hitting any known anti-patterns. Fatal = judge determined the query would return wrong or misleading results.

Metric	Value
Hard pass rate	10/10 (100%)
Fatal failures	0
Mean judge score	5.0/5
P50 judge score	5/5
Model	gpt-4o-mini
Semantic context hash	`0b4c17034694`

Per-case results

ID	Hard pass	Score	Fatal	Reason
`cso_top_operator`	✅	5/5	no	The query correctly aggregates total CSO discharge volume by operator for 2022, using the correct table and filters, and orders the results as required.
`cso_monthly_rainfall`	✅	5/5	no	The query correctly aggregates both CSO discharge volume and precipitation totals by month and joins them appropriately.
`cso_by_watershed`	✅	5/5	no	The query correctly joins the tables, groups by watershed, and applies the necessary filter for eventType.
`enforcement_vs_budget`	✅	5/5	no	The query correctly joins the enforcement actions with the budget data on the year, extracts the year from the EnforcementDate correctly, and applies the necessary filters.
`staffing_trend`	✅	5/5	no	The query correctly uses the MADEP_staff_Comptroller table, groups by year, and counts employees from 2005 to present.
`303d_impaired_trend`	✅	5/5	no	The query correctly counts listings from the EPA_303d_Impairments table, groups by reportingCycle, and orders the results correctly.
`303d_named_waterbody`	✅	5/5	no	The query correctly uses the EPA_303d_Impairments table, applies the maximum reportingCycle filter, and uses the LIKE pattern for the waterbody.
`cso_to_impaired`	✅	5/5	no	Both joins are correct and the reportingCycle filter is applied to EPA_303d_Impairments, which is appropriate.
`all_caps_boston`	✅	5/5	no	The query correctly uses UPPER() to filter the municipality for 'BOSTON' and checks for CSO event types using LIKE.
`ecos_per_capita`	✅	5/5	no	The query correctly uses the ECOS_budgets table, aggregates data for multiple states, and calculates per-capita spending accurately.

nesanders merged commit c2e9eec into main Apr 29, 2026
10 checks passed

nesanders deleted the bugfix/incremental-fetch-date-format branch April 29, 2026 23:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mixed-format dates in incrementally-fetched CSVs#60

Fix mixed-format dates in incrementally-fetched CSVs#60
nesanders merged 1 commit intomainfrom
bugfix/incremental-fetch-date-format

nesanders commented Apr 29, 2026

Uh oh!

github-actions Bot commented Apr 29, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

nesanders commented Apr 29, 2026

Summary

Changes

Test plan

Uh oh!

github-actions Bot commented Apr 29, 2026

✅ Semantic Eval Results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant