Skip to content

🚀 Performance: Sub-second edits for large networks (Bay Area scale) #442

@e-lo

Description

@e-lo

Summary

Performance profiling reveals that individual network edits currently take ~1 second for a 66K link network, with 61% of that time spent on inefficient hash computation. For Bay Area 9-county scale networks (200-500K links), this would scale to 3-7 seconds per edit.

Goal: Enable sub-second individual edits for networks up to 500K links.

Related Issues

Profiling Results

Test Environment

  • Test data: St. Paul example (66,253 links, 17,159 nodes)
  • Target scale: Bay Area 9-county (~200K-500K links)

Current Performance Breakdown

Operation Time % of Total
Apply project (total) 0.94s 100%
→ Hash computation (network_hash) 0.69s 61%
→ Selection operation 0.23s 24%
→ Pandera validation 0.17s 15%
→ Actual edit ~0.05s 5%

Profiler Output (Top Functions)

ncalls  tottime  cumtime  filename:lineno(function)
     3    0.008    0.686  network.py:226(network_hash)
     7    0.358    0.679  df_accessors.py:83(__call__)
250277    0.190    0.198  shapely/geometry/base.py:174(__repr__)
     1    0.000    0.173  pandera/.../container.py:41(validate)

Root Cause: Inefficient df_hash() Implementation

The hash function in network_wrangler/utils/df_accessors.py:83-93:

def __call__(self):
    df_sorted = self._obj.sort_index(axis=0).sort_index(axis=1)
    _value = str(df_sorted.values.tolist()).encode()  # ← Problem!
    hash = hashlib.sha1(_value).hexdigest()
    return hash

This converts the entire DataFrame to a nested Python list, then to a string - including all 66K geometries. The 250K calls to shapely.__repr__ confirm that every geometry is being stringified.

The hash is computed 3 times per apply() call:

  1. selection.py:149 - When creating a selection
  2. selection.py:345-346 - When checking if selection is stale (called twice)

Projected Performance at Bay Area Scale

Scenario Apply Time Meets <1s Target?
Current code (3x scale) 2.8s
Current code (7.5x scale) 7.1s
With hash fix (3x) 0.8s
With hash fix (7.5x) 1.9s ⚠️ Close
With hash + validation fixes <1s

Proposed Solutions

1. Fix Hash Computation (Priority 1)

Issue: #443 (related to #391)

Two options:

  • Option A (Recommended): Dirty flag pattern - track modifications instead of recomputing hash
  • Option B: Use pandas.util.hash_pandas_object() which is optimized for DataFrames

Expected improvement: 0.69s → ~0.01s (98% reduction in hash time)

2. Reduce Redundant Hash Checks (Priority 2)

Issue: #444

The selection code calls network_hash multiple times. Consolidate to single check per operation.

3. Batch/Defer Pandera Validation (Priority 3)

Issue: #445

Currently validates full schema on every edit. Consider:

  • Validate only changed rows
  • Defer validation to explicit checkpoints
  • Add "fast mode" that skips validation

Expected improvement: 0.17s → ~0.02s

4. Default to Parquet Format (Priority 4)

Format Load Time File Size
JSON/GeoJSON 1.52s 42.4 MB
Parquet 1.17s 8.3 MB

Recommend Parquet as default for large networks.

Reproduction

# Run benchmarks
pytest tests/test_benchmarks.py --benchmark-only -v

# Profile a specific operation
python -c "
import cProfile
from network_wrangler.roadway import load_roadway_from_dir
from projectcard import read_cards

net = load_roadway_from_dir('examples/stpaul')
cards = read_cards('examples/stpaul/project_cards/road.prop_change.multiple.yml')
proj = list(cards.values())[0]

cProfile.run('net.apply(proj)', sort='cumulative')
"

Checklist

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions