-
Notifications
You must be signed in to change notification settings - Fork 5
perf: add batch scope resolution API to eliminate redundant validate+explode per (property × timespan × category) #451
Description
Problem
prop_for_scope is called once per (property × timespan × category) combination by downstream consumers such as cube_wrangler's split_properties_by_time_period_and_category. On a MetCouncil-scale network this means 35+ independent calls (3 simple properties × 5 time periods + `price` with 4 categories × 5 time periods = 35 combinations).
Each call independently executes:
```python
network_wrangler/roadway/links/scopes.py — prop_for_scope()
links_df = validate_df_to_model(links_df, RoadLinksTable) # (1) full schema validation
candidate = _create_exploded_df_for_scoped_prop(links_df, prop_name) # (2) explode + json_normalize + datetime conversion
filtered = _filter_exploded_df_to_scope(candidate, timespan, category) # (3) cheap filter — varies per call
```
Redundancy 1: validation called 35× on an already-valid DataFrame
`validate_df_to_model` does significant work on every call:
| Step | Cost |
|---|---|
| `copy.deepcopy(df.attrs)` | full deep copy of attrs |
| `_convert_string_dtype_to_object` | full DataFrame copy + column iteration |
| `model.validate(df, lazy=True)` | Pandera checks all 28+ columns of `RoadLinksTable` |
| `fill_df_with_defaults_from_model` | fills NaN defaults across columns |
When called from `split_properties_by_time_period_and_category`, the input is always `roadway_net.links_df` — a DataFrame that was already validated and coerced when the `RoadwayNetwork` was loaded. There is no caching or memoization, so the full validation is repeated 35 times on the same object.
Redundancy 2: explode called 35× when the result is the same per property
`create_exploded_df_for_scoped_prop` builds a tidy exploded DataFrame from the `sc{prop}` list column. Its output depends only on the property name — not on the timespan or category being queried. It is therefore identical for every call on the same property, yet it runs once per (timespan, category) combination.
Measured impact (stpaul network, 66,253 links)
| Step | Time per call | × calls | Total |
|---|---|---|---|
| validate + explode | ~0.3–0.6s | 35 | ~10–15s |
| filter (step 3 only) | ~0.01s | 35 | ~0.4s |
On a 200k-link production network: ~30–45s for `split_properties` alone.
Proposed fix
Add a public `props_for_scopes` function that:
- Validates once
- Explodes once per property
- Filters cheaply for each (timespan, category) scope
```python
def props_for_scopes(
links_df: pd.DataFrame,
prop_name: str,
scopes: list[dict],
strict_timespan_match: bool = False,
min_overlap_minutes: int = 60,
allow_default: bool = True,
) -> dict[str, pd.Series]:
"""Resolve one property for multiple (timespan, category) scopes in a single pass.
Validates and explodes links_df once; filters once per scope.
Args:
links_df: RoadLinksTable DataFrame.
prop_name: Name of the property to resolve.
scopes: List of dicts, each with keys:
- "label": str — key in the returned dict
- "timespan": list[TimeString]
- "category": str | int | None
strict_timespan_match: passed to _filter_exploded_df_to_scope.
min_overlap_minutes: passed to _filter_exploded_df_to_scope.
allow_default: if True, return the default column when no scoped values exist.
Returns:
dict mapping each scope label to a pd.Series of resolved values.
"""
links_df = validate_df_to_model(links_df, RoadLinksTable) # validate ONCE
base = links_df[prop_name].copy()
if f"sc_{prop_name}" not in links_df.columns or links_df[f"sc_{prop_name}"].isna().all():
if not allow_default:
raise ValueError(f"{prop_name} has no scoped values and allow_default=False")
return {s["label"]: base.copy() for s in scopes}
exploded = _create_exploded_df_for_scoped_prop(links_df, prop_name) # explode ONCE
result = {}
for scope in scopes:
filtered = _filter_exploded_df_to_scope(
exploded,
timespan=scope["timespan"],
category=scope.get("category"),
strict_timespan_match=strict_timespan_match,
min_overlap_minutes=min_overlap_minutes,
)
col = base.copy()
col.loc[filtered.index] = filtered["scoped"]
result[scope["label"]] = col
return result
```
Caller change in cube_wrangler
```python
cube_wrangler/roadway.py
Before: one prop_for_scope call per combination
for time_suffix, category_suffix in itertools.product(time_periods, categories):
roadway_net.links_df[out_var + "_" + ...] = prop_for_scope(
roadway_net.links_df, params["v"], category=..., timespan=...
)[params["v"]]
After: one props_for_scopes call per property
scopes = [
{"label": f"{out_var}{cat_sfx}{ts_sfx}",
"timespan": params["time_periods"][ts_sfx],
"category": params["categories"][cat_sfx]}
for ts_sfx, cat_sfx in itertools.product(params["time_periods"], params["categories"])
]
resolved = props_for_scopes(roadway_net.links_df, params["v"], scopes)
for label, series in resolved.items():
roadway_net.links_df[label] = series
```
Expected speedup
| Operation | Before | After | Reduction |
|---|---|---|---|
| `validate_df_to_model` | 35× | 1× | 34 avoided |
| `_create_exploded_df_for_scoped_prop` | 35× | 7× (one per property) | 28 avoided |
| `_filter_exploded_df_to_scope` | 35× | 35× | unchanged (cheap) |
| Total split_properties (stpaul, 66k links) | ~12s | ~1s | ~10–30× |
| Total split_properties (200k links) | ~40s | ~3s | ~13× |
Implementation notes
- `props_for_scopes` should be exported from `network_wrangler.roadway.links.scopes` alongside `prop_for_scope`
- `_create_exploded_df_for_scoped_prop` is already well-factored; no changes needed to it
- The existing `prop_for_scope` should remain unchanged for backwards compatibility
- A benchmark test can be added using a pre-loaded `RoadwayNetwork` fixture (the stpaul test network), calling `props_for_scopes` with a realistic set of MetCouncil-style scopes
Context
Identified by profiling `cube_wrangler`'s log → project card pipeline. Full analysis in `cube_wrangler/PERFORMANCE_ANALYSIS.md` on branch `feature/perf-link-changes`. The `_process_link_changes` bottleneck (O(N_network × N_changes) boolean mask scan) has already been fixed; this is the next highest-impact item.