-
Notifications
You must be signed in to change notification settings - Fork 5
🚀 Performance: Sub-second edits for large networks (Bay Area scale) #442
Description
Summary
Performance profiling reveals that individual network edits currently take ~1 second for a 66K link network, with 61% of that time spent on inefficient hash computation. For Bay Area 9-county scale networks (200-500K links), this would scale to 3-7 seconds per edit.
Goal: Enable sub-second individual edits for networks up to 500K links.
Related Issues
- ⏱️ Find quicker way to detect a change in a roadway network #391 - Original issue identifying hash computation as slow (now with detailed profiling data)
Profiling Results
Test Environment
- Test data: St. Paul example (66,253 links, 17,159 nodes)
- Target scale: Bay Area 9-county (~200K-500K links)
Current Performance Breakdown
| Operation | Time | % of Total |
|---|---|---|
| Apply project (total) | 0.94s | 100% |
→ Hash computation (network_hash) |
0.69s | 61% |
| → Selection operation | 0.23s | 24% |
| → Pandera validation | 0.17s | 15% |
| → Actual edit | ~0.05s | 5% |
Profiler Output (Top Functions)
ncalls tottime cumtime filename:lineno(function)
3 0.008 0.686 network.py:226(network_hash)
7 0.358 0.679 df_accessors.py:83(__call__)
250277 0.190 0.198 shapely/geometry/base.py:174(__repr__)
1 0.000 0.173 pandera/.../container.py:41(validate)
Root Cause: Inefficient df_hash() Implementation
The hash function in network_wrangler/utils/df_accessors.py:83-93:
def __call__(self):
df_sorted = self._obj.sort_index(axis=0).sort_index(axis=1)
_value = str(df_sorted.values.tolist()).encode() # ← Problem!
hash = hashlib.sha1(_value).hexdigest()
return hashThis converts the entire DataFrame to a nested Python list, then to a string - including all 66K geometries. The 250K calls to shapely.__repr__ confirm that every geometry is being stringified.
The hash is computed 3 times per apply() call:
selection.py:149- When creating a selectionselection.py:345-346- When checking if selection is stale (called twice)
Projected Performance at Bay Area Scale
| Scenario | Apply Time | Meets <1s Target? |
|---|---|---|
| Current code (3x scale) | 2.8s | ❌ |
| Current code (7.5x scale) | 7.1s | ❌ |
| With hash fix (3x) | 0.8s | ✅ |
| With hash fix (7.5x) | 1.9s | |
| With hash + validation fixes | <1s | ✅ |
Proposed Solutions
1. Fix Hash Computation (Priority 1)
Two options:
- Option A (Recommended): Dirty flag pattern - track modifications instead of recomputing hash
- Option B: Use
pandas.util.hash_pandas_object()which is optimized for DataFrames
Expected improvement: 0.69s → ~0.01s (98% reduction in hash time)
2. Reduce Redundant Hash Checks (Priority 2)
Issue: #444
The selection code calls network_hash multiple times. Consolidate to single check per operation.
3. Batch/Defer Pandera Validation (Priority 3)
Issue: #445
Currently validates full schema on every edit. Consider:
- Validate only changed rows
- Defer validation to explicit checkpoints
- Add "fast mode" that skips validation
Expected improvement: 0.17s → ~0.02s
4. Default to Parquet Format (Priority 4)
| Format | Load Time | File Size |
|---|---|---|
| JSON/GeoJSON | 1.52s | 42.4 MB |
| Parquet | 1.17s | 8.3 MB |
Recommend Parquet as default for large networks.
Reproduction
# Run benchmarks
pytest tests/test_benchmarks.py --benchmark-only -v
# Profile a specific operation
python -c "
import cProfile
from network_wrangler.roadway import load_roadway_from_dir
from projectcard import read_cards
net = load_roadway_from_dir('examples/stpaul')
cards = read_cards('examples/stpaul/project_cards/road.prop_change.multiple.yml')
proj = list(cards.values())[0]
cProfile.run('net.apply(proj)', sort='cumulative')
"Checklist
- Fix hash computation (Performance: Replace df_hash() with efficient implementation #443, extends ⏱️ Find quicker way to detect a change in a roadway network #391)
- Reduce redundant hash checks (Performance: Reduce redundant network_hash calls in selection #444)
- Optimize validation (Performance: Add option to defer/batch Pandera validation #445)
- Document performance best practices
- Add performance regression tests