Skip to content

Evaluate native GeoDataFrame catalog representation (memory cost?) #13

Description

@michaelaye

Question

Should the v3.1 / v4.x catalogs be stored and consumed natively as
geopandas.GeoDataFrame (with a Shapely geometry column on every row)
rather than as plain pandas.DataFrame plus on-demand Fan.to_shapely() /
Blotch.to_shapely() conversion?

Surfaced during the production.coverage port — the existing pattern
(materialise polygons in a 2.5-minute Shapely union pass, cache as
parquet) means every downstream consumer that wants polygon geometry
either re-runs the conversion or reads the cache. A native
GeoDataFrame distribution would let consumers gpd.read_parquet(...)
and have geometry available immediately, no conversion step.

What to evaluate

  1. Memory footprint. v3.1 has ~600 k fan + blotch markings. Estimate
    per-polygon Shapely + GeoSeries overhead end-to-end:

    • Fan polygon: 3-vertex line + 100-vertex semi-circle arc ≈ 103 points.
    • Blotch polygon: ~36-point ellipse approximation.
    • Mean ~70 points × 2 floats × 8 B ≈ 1.1 KB per row in WKB; somewhat
      more in live Shapely objects. → very rough estimate ~700 MB-1.5 GB
      in-memory for the full catalog.
      Confirm with a 50 k-row pilot.
  2. On-disk size. GeoParquet vs current plain Parquet for a v3.1
    sample. Acceptable inflation factor before the convenience win is
    eaten by larger Zenodo uploads / pooch caches?

  3. API options.

    • (a) Keep plain DataFrame as the canonical catalog; add an
      io.get_fan_catalog(version, geometry=True) switch that returns
      a GeoDataFrame, materialising geometry on the fly with a small
      in-memory cache.
    • (b) Ship a separate GeoParquet alongside the plain Parquet on
      Zenodo for users who want geometry pre-computed.
    • (c) Make the GeoDataFrame the default — biggest behavioural
      change, hardest to roll back.
  4. Downstream impact. Confirm production.coverage,
    classify_by_activity, and any other module that uses geometry
    transparently benefits — i.e. they don't re-Shapely after a
    geometry-aware load.

Acceptance

A short markdown note in the repo (or a follow-up issue) recommending
one of (a)/(b)/(c) with measured numbers for memory and disk, and a
proposed migration path that keeps pip install p4tools users on
plain Parquet by default if memory turns out to be the deciding cost.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions