Evaluate native GeoDataFrame catalog representation (memory cost?)

## Question

Should the v3.1 / v4.x catalogs be stored and consumed natively as
`geopandas.GeoDataFrame` (with a Shapely geometry column on every row)
rather than as plain `pandas.DataFrame` plus on-demand `Fan.to_shapely()` /
`Blotch.to_shapely()` conversion?

Surfaced during the `production.coverage` port — the existing pattern
(materialise polygons in a 2.5-minute Shapely union pass, cache as
parquet) means every downstream consumer that wants polygon geometry
either re-runs the conversion or reads the cache. A native
GeoDataFrame distribution would let consumers `gpd.read_parquet(...)`
and have geometry available immediately, no conversion step.

## What to evaluate

1. **Memory footprint.** v3.1 has ~600 k fan + blotch markings. Estimate
   per-polygon Shapely + GeoSeries overhead end-to-end:
   - Fan polygon: 3-vertex line + 100-vertex semi-circle arc ≈ 103 points.
   - Blotch polygon: ~36-point ellipse approximation.
   - Mean ~70 points × 2 floats × 8 B ≈ 1.1 KB per row in WKB; somewhat
     more in live Shapely objects. → very rough estimate ~700 MB-1.5 GB
     in-memory for the full catalog.
   Confirm with a 50 k-row pilot.

2. **On-disk size.** GeoParquet vs current plain Parquet for a v3.1
   sample. Acceptable inflation factor before the convenience win is
   eaten by larger Zenodo uploads / pooch caches?

3. **API options.**
   - (a) Keep plain DataFrame as the canonical catalog; add an
     `io.get_fan_catalog(version, geometry=True)` switch that returns
     a GeoDataFrame, materialising geometry on the fly with a small
     in-memory cache.
   - (b) Ship a *separate* GeoParquet alongside the plain Parquet on
     Zenodo for users who want geometry pre-computed.
   - (c) Make the GeoDataFrame the default — biggest behavioural
     change, hardest to roll back.

4. **Downstream impact.** Confirm `production.coverage`,
   `classify_by_activity`, and any other module that uses geometry
   transparently benefits — i.e. they don't re-Shapely after a
   geometry-aware load.

## Acceptance

A short markdown note in the repo (or a follow-up issue) recommending
one of (a)/(b)/(c) with measured numbers for memory and disk, and a
proposed migration path that keeps `pip install p4tools` users on
plain Parquet by default if memory turns out to be the deciding cost.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate native GeoDataFrame catalog representation (memory cost?) #13

Question

What to evaluate

Acceptance

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Evaluate native GeoDataFrame catalog representation (memory cost?) #13

Description

Question

What to evaluate

Acceptance

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions