Fix mixed-format dates in incrementally-fetched CSVs#60
Merged
Conversation
The cached CSVs store dates as 'YYYY-MM-DD' strings, while the EEA DataLake API returns 'YYYY-MM-DDTHH:MM:SS'. After concat in fetch_incremental, the column is object dtype with mixed serializations on CSV write — and pd.to_datetime later infers format from the first row, silently NaT-ing the rest. This crashed dashboard chart generation with IntCastingNaNError on s_data['Year'].astype(int). Fixes: - get_EEA_data_portal.py: cast new_data's date column to datetime64 with format='ISO8601' before the concat, so the combined column has a uniform datetime64 dtype and serializes consistently. - EEADP_enforcement.csv: re-write existing 44 T-format rows as YYYY-MM-DD - EEADP_inspection.csv: re-write existing 24 T-format rows as YYYY-MM-DD - db_semantic_context.txt: regenerated from the fixed DB - gs://openamend-data/EEADP_drinkingWater.csv: 7 T-format rows fixed (uploaded out of band; too large to commit) - gs://openamend-data/amend.db: rebuilt from fixed CSVs Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Contributor
✅ Semantic Eval ResultsEach eval sends a natural-language question plus the relevant table schemas from
Per-case results
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes the
IntCastingNaNErrorcrash in dashboard chart generation introduced by PR #56's incremental fetching.The cached CSVs store dates as
YYYY-MM-DD, while the EEA DataLake API returnsYYYY-MM-DDTHH:MM:SS. After concat infetch_incremental, the column became object dtype with mixed serializations on CSV write.pd.to_datetimethen inferred format from the first row and silently NaT-ed the T-separated rows downstream —s_data['Year'].astype(int)inMADEP_enforcements_viz.py:79crashed withIntCastingNaNError.Changes
get_EEA_data_portal.py: cast new_data's date column todatetime64withformat='ISO8601'before the concat so the combined column has a uniform dtype and serializes consistently.EEADP_enforcement.csv: re-write the 44 existing T-format rows asYYYY-MM-DD.EEADP_inspection.csv: re-write the 24 existing T-format rows asYYYY-MM-DD.db_semantic_context.txt: regenerated from the fixed DB.gs://openamend-data/EEADP_drinkingWater.csv: 7 T-format rows fixed (uploaded out of band; file is too large to commit).gs://openamend-data/amend.db: rebuilt locally from the fixed CSVs and uploaded, so the next charts run won't fail.Test plan
pd.to_datetime(...).dt.yearproduces 0 NaN values on the fixed enforcement and inspection tables in the rebuilt DBupdate-chartsworkflow run and confirm it completes successfully🤖 Generated with Claude Code