Question
Should the v3.1 / v4.x catalogs be stored and consumed natively as
geopandas.GeoDataFrame (with a Shapely geometry column on every row)
rather than as plain pandas.DataFrame plus on-demand Fan.to_shapely() /
Blotch.to_shapely() conversion?
Surfaced during the production.coverage port — the existing pattern
(materialise polygons in a 2.5-minute Shapely union pass, cache as
parquet) means every downstream consumer that wants polygon geometry
either re-runs the conversion or reads the cache. A native
GeoDataFrame distribution would let consumers gpd.read_parquet(...)
and have geometry available immediately, no conversion step.
What to evaluate
-
Memory footprint. v3.1 has ~600 k fan + blotch markings. Estimate
per-polygon Shapely + GeoSeries overhead end-to-end:
- Fan polygon: 3-vertex line + 100-vertex semi-circle arc ≈ 103 points.
- Blotch polygon: ~36-point ellipse approximation.
- Mean ~70 points × 2 floats × 8 B ≈ 1.1 KB per row in WKB; somewhat
more in live Shapely objects. → very rough estimate ~700 MB-1.5 GB
in-memory for the full catalog.
Confirm with a 50 k-row pilot.
-
On-disk size. GeoParquet vs current plain Parquet for a v3.1
sample. Acceptable inflation factor before the convenience win is
eaten by larger Zenodo uploads / pooch caches?
-
API options.
- (a) Keep plain DataFrame as the canonical catalog; add an
io.get_fan_catalog(version, geometry=True) switch that returns
a GeoDataFrame, materialising geometry on the fly with a small
in-memory cache.
- (b) Ship a separate GeoParquet alongside the plain Parquet on
Zenodo for users who want geometry pre-computed.
- (c) Make the GeoDataFrame the default — biggest behavioural
change, hardest to roll back.
-
Downstream impact. Confirm production.coverage,
classify_by_activity, and any other module that uses geometry
transparently benefits — i.e. they don't re-Shapely after a
geometry-aware load.
Acceptance
A short markdown note in the repo (or a follow-up issue) recommending
one of (a)/(b)/(c) with measured numbers for memory and disk, and a
proposed migration path that keeps pip install p4tools users on
plain Parquet by default if memory turns out to be the deciding cost.
Question
Should the v3.1 / v4.x catalogs be stored and consumed natively as
geopandas.GeoDataFrame(with a Shapely geometry column on every row)rather than as plain
pandas.DataFrameplus on-demandFan.to_shapely()/Blotch.to_shapely()conversion?Surfaced during the
production.coverageport — the existing pattern(materialise polygons in a 2.5-minute Shapely union pass, cache as
parquet) means every downstream consumer that wants polygon geometry
either re-runs the conversion or reads the cache. A native
GeoDataFrame distribution would let consumers
gpd.read_parquet(...)and have geometry available immediately, no conversion step.
What to evaluate
Memory footprint. v3.1 has ~600 k fan + blotch markings. Estimate
per-polygon Shapely + GeoSeries overhead end-to-end:
more in live Shapely objects. → very rough estimate ~700 MB-1.5 GB
in-memory for the full catalog.
Confirm with a 50 k-row pilot.
On-disk size. GeoParquet vs current plain Parquet for a v3.1
sample. Acceptable inflation factor before the convenience win is
eaten by larger Zenodo uploads / pooch caches?
API options.
io.get_fan_catalog(version, geometry=True)switch that returnsa GeoDataFrame, materialising geometry on the fly with a small
in-memory cache.
Zenodo for users who want geometry pre-computed.
change, hardest to roll back.
Downstream impact. Confirm
production.coverage,classify_by_activity, and any other module that uses geometrytransparently benefits — i.e. they don't re-Shapely after a
geometry-aware load.
Acceptance
A short markdown note in the repo (or a follow-up issue) recommending
one of (a)/(b)/(c) with measured numbers for memory and disk, and a
proposed migration path that keeps
pip install p4toolsusers onplain Parquet by default if memory turns out to be the deciding cost.