🚀 Performance: Sub-second edits for large networks (Bay Area scale)

## Summary

Performance profiling reveals that individual network edits currently take ~1 second for a 66K link network, with **61% of that time spent on inefficient hash computation**. For Bay Area 9-county scale networks (200-500K links), this would scale to 3-7 seconds per edit.

**Goal:** Enable sub-second individual edits for networks up to 500K links.

## Related Issues
- #391 - Original issue identifying hash computation as slow (now with detailed profiling data)

## Profiling Results

### Test Environment
- **Test data:** St. Paul example (66,253 links, 17,159 nodes)
- **Target scale:** Bay Area 9-county (~200K-500K links)

### Current Performance Breakdown

| Operation | Time | % of Total |
|-----------|------|------------|
| **Apply project (total)** | 0.94s | 100% |
| → Hash computation (`network_hash`) | 0.69s | **61%** |
| → Selection operation | 0.23s | 24% |
| → Pandera validation | 0.17s | 15% |
| → Actual edit | ~0.05s | 5% |

### Profiler Output (Top Functions)

```
ncalls  tottime  cumtime  filename:lineno(function)
     3    0.008    0.686  network.py:226(network_hash)
     7    0.358    0.679  df_accessors.py:83(__call__)
250277    0.190    0.198  shapely/geometry/base.py:174(__repr__)
     1    0.000    0.173  pandera/.../container.py:41(validate)
```

### Root Cause: Inefficient `df_hash()` Implementation

The hash function in `network_wrangler/utils/df_accessors.py:83-93`:

```python
def __call__(self):
    df_sorted = self._obj.sort_index(axis=0).sort_index(axis=1)
    _value = str(df_sorted.values.tolist()).encode()  # ← Problem!
    hash = hashlib.sha1(_value).hexdigest()
    return hash
```

This converts the **entire DataFrame to a nested Python list, then to a string** - including all 66K geometries. The 250K calls to `shapely.__repr__` confirm that every geometry is being stringified.

The hash is computed **3 times per `apply()` call**:
1. `selection.py:149` - When creating a selection
2. `selection.py:345-346` - When checking if selection is stale (called twice)

## Projected Performance at Bay Area Scale

| Scenario | Apply Time | Meets <1s Target? |
|----------|------------|-------------------|
| Current code (3x scale) | 2.8s | ❌ |
| Current code (7.5x scale) | 7.1s | ❌ |
| **With hash fix** (3x) | 0.8s | ✅ |
| **With hash fix** (7.5x) | 1.9s | ⚠️ Close |
| **With hash + validation fixes** | <1s | ✅ |

## Proposed Solutions

### 1. Fix Hash Computation (Priority 1) 
**Issue:** #443 (related to #391)

Two options:
- **Option A (Recommended):** Dirty flag pattern - track modifications instead of recomputing hash
- **Option B:** Use `pandas.util.hash_pandas_object()` which is optimized for DataFrames

**Expected improvement:** 0.69s → ~0.01s (98% reduction in hash time)

### 2. Reduce Redundant Hash Checks (Priority 2)
**Issue:** #444

The selection code calls `network_hash` multiple times. Consolidate to single check per operation.

### 3. Batch/Defer Pandera Validation (Priority 3)
**Issue:** #445

Currently validates full schema on every edit. Consider:
- Validate only changed rows
- Defer validation to explicit checkpoints
- Add "fast mode" that skips validation

**Expected improvement:** 0.17s → ~0.02s

### 4. Default to Parquet Format (Priority 4)

| Format | Load Time | File Size |
|--------|-----------|-----------|
| JSON/GeoJSON | 1.52s | 42.4 MB |
| Parquet | 1.17s | 8.3 MB |

Recommend Parquet as default for large networks.

## Reproduction

```bash
# Run benchmarks
pytest tests/test_benchmarks.py --benchmark-only -v

# Profile a specific operation
python -c "
import cProfile
from network_wrangler.roadway import load_roadway_from_dir
from projectcard import read_cards

net = load_roadway_from_dir('examples/stpaul')
cards = read_cards('examples/stpaul/project_cards/road.prop_change.multiple.yml')
proj = list(cards.values())[0]

cProfile.run('net.apply(proj)', sort='cumulative')
"
```

## Checklist
- [ ] Fix hash computation (#443, extends #391)
- [ ] Reduce redundant hash checks (#444)
- [ ] Optimize validation (#445)
- [ ] Document performance best practices
- [ ] Add performance regression tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🚀 Performance: Sub-second edits for large networks (Bay Area scale) #442

Summary

Related Issues

Profiling Results

Test Environment

Current Performance Breakdown

Profiler Output (Top Functions)

Root Cause: Inefficient `df_hash()` Implementation

Projected Performance at Bay Area Scale

Proposed Solutions

1. Fix Hash Computation (Priority 1)

2. Reduce Redundant Hash Checks (Priority 2)

3. Batch/Defer Pandera Validation (Priority 3)

4. Default to Parquet Format (Priority 4)

Reproduction

Checklist

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Operation	Time	% of Total
Apply project (total)	0.94s	100%
→ Hash computation (`network_hash`)	0.69s	61%
→ Selection operation	0.23s	24%
→ Pandera validation	0.17s	15%
→ Actual edit	~0.05s	5%

Scenario	Apply Time	Meets <1s Target?
Current code (3x scale)	2.8s	❌
Current code (7.5x scale)	7.1s	❌
With hash fix (3x)	0.8s	✅
With hash fix (7.5x)	1.9s	⚠️ Close
With hash + validation fixes	<1s	✅

🚀 Performance: Sub-second edits for large networks (Bay Area scale) #442

Description

Summary

Related Issues

Profiling Results

Test Environment

Current Performance Breakdown

Profiler Output (Top Functions)

Root Cause: Inefficient df_hash() Implementation

Projected Performance at Bay Area Scale

Proposed Solutions

1. Fix Hash Computation (Priority 1)

2. Reduce Redundant Hash Checks (Priority 2)

3. Batch/Defer Pandera Validation (Priority 3)

4. Default to Parquet Format (Priority 4)

Reproduction

Checklist

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Root Cause: Inefficient `df_hash()` Implementation