Skip to content

perf: add batch scope resolution API to eliminate redundant validate+explode per (property × timespan × category) #451

@e-lo

Description

@e-lo

Problem

prop_for_scope is called once per (property × timespan × category) combination by downstream consumers such as cube_wrangler's split_properties_by_time_period_and_category. On a MetCouncil-scale network this means 35+ independent calls (3 simple properties × 5 time periods + `price` with 4 categories × 5 time periods = 35 combinations).

Each call independently executes:

```python

network_wrangler/roadway/links/scopes.py — prop_for_scope()

links_df = validate_df_to_model(links_df, RoadLinksTable) # (1) full schema validation
candidate = _create_exploded_df_for_scoped_prop(links_df, prop_name) # (2) explode + json_normalize + datetime conversion
filtered = _filter_exploded_df_to_scope(candidate, timespan, category) # (3) cheap filter — varies per call
```

Redundancy 1: validation called 35× on an already-valid DataFrame

`validate_df_to_model` does significant work on every call:

Step Cost
`copy.deepcopy(df.attrs)` full deep copy of attrs
`_convert_string_dtype_to_object` full DataFrame copy + column iteration
`model.validate(df, lazy=True)` Pandera checks all 28+ columns of `RoadLinksTable`
`fill_df_with_defaults_from_model` fills NaN defaults across columns

When called from `split_properties_by_time_period_and_category`, the input is always `roadway_net.links_df` — a DataFrame that was already validated and coerced when the `RoadwayNetwork` was loaded. There is no caching or memoization, so the full validation is repeated 35 times on the same object.

Redundancy 2: explode called 35× when the result is the same per property

`create_exploded_df_for_scoped_prop` builds a tidy exploded DataFrame from the `sc{prop}` list column. Its output depends only on the property name — not on the timespan or category being queried. It is therefore identical for every call on the same property, yet it runs once per (timespan, category) combination.

Measured impact (stpaul network, 66,253 links)

Step Time per call × calls Total
validate + explode ~0.3–0.6s 35 ~10–15s
filter (step 3 only) ~0.01s 35 ~0.4s

On a 200k-link production network: ~30–45s for `split_properties` alone.


Proposed fix

Add a public `props_for_scopes` function that:

  1. Validates once
  2. Explodes once per property
  3. Filters cheaply for each (timespan, category) scope

```python
def props_for_scopes(
links_df: pd.DataFrame,
prop_name: str,
scopes: list[dict],
strict_timespan_match: bool = False,
min_overlap_minutes: int = 60,
allow_default: bool = True,
) -> dict[str, pd.Series]:
"""Resolve one property for multiple (timespan, category) scopes in a single pass.

Validates and explodes links_df once; filters once per scope.

Args:
    links_df: RoadLinksTable DataFrame.
    prop_name: Name of the property to resolve.
    scopes: List of dicts, each with keys:
        - "label": str — key in the returned dict
        - "timespan": list[TimeString]
        - "category": str | int | None
    strict_timespan_match: passed to _filter_exploded_df_to_scope.
    min_overlap_minutes: passed to _filter_exploded_df_to_scope.
    allow_default: if True, return the default column when no scoped values exist.

Returns:
    dict mapping each scope label to a pd.Series of resolved values.
"""
links_df = validate_df_to_model(links_df, RoadLinksTable)  # validate ONCE

base = links_df[prop_name].copy()

if f"sc_{prop_name}" not in links_df.columns or links_df[f"sc_{prop_name}"].isna().all():
    if not allow_default:
        raise ValueError(f"{prop_name} has no scoped values and allow_default=False")
    return {s["label"]: base.copy() for s in scopes}

exploded = _create_exploded_df_for_scoped_prop(links_df, prop_name)  # explode ONCE

result = {}
for scope in scopes:
    filtered = _filter_exploded_df_to_scope(
        exploded,
        timespan=scope["timespan"],
        category=scope.get("category"),
        strict_timespan_match=strict_timespan_match,
        min_overlap_minutes=min_overlap_minutes,
    )
    col = base.copy()
    col.loc[filtered.index] = filtered["scoped"]
    result[scope["label"]] = col

return result

```

Caller change in cube_wrangler

```python

cube_wrangler/roadway.py

Before: one prop_for_scope call per combination

for time_suffix, category_suffix in itertools.product(time_periods, categories):
roadway_net.links_df[out_var + "_" + ...] = prop_for_scope(
roadway_net.links_df, params["v"], category=..., timespan=...
)[params["v"]]

After: one props_for_scopes call per property

scopes = [
{"label": f"{out_var}{cat_sfx}{ts_sfx}",
"timespan": params["time_periods"][ts_sfx],
"category": params["categories"][cat_sfx]}
for ts_sfx, cat_sfx in itertools.product(params["time_periods"], params["categories"])
]
resolved = props_for_scopes(roadway_net.links_df, params["v"], scopes)
for label, series in resolved.items():
roadway_net.links_df[label] = series
```


Expected speedup

Operation Before After Reduction
`validate_df_to_model` 35× 34 avoided
`_create_exploded_df_for_scoped_prop` 35× 7× (one per property) 28 avoided
`_filter_exploded_df_to_scope` 35× 35× unchanged (cheap)
Total split_properties (stpaul, 66k links) ~12s ~1s ~10–30×
Total split_properties (200k links) ~40s ~3s ~13×

Implementation notes

  • `props_for_scopes` should be exported from `network_wrangler.roadway.links.scopes` alongside `prop_for_scope`
  • `_create_exploded_df_for_scoped_prop` is already well-factored; no changes needed to it
  • The existing `prop_for_scope` should remain unchanged for backwards compatibility
  • A benchmark test can be added using a pre-loaded `RoadwayNetwork` fixture (the stpaul test network), calling `props_for_scopes` with a realistic set of MetCouncil-style scopes

Context

Identified by profiling `cube_wrangler`'s log → project card pipeline. Full analysis in `cube_wrangler/PERFORMANCE_ANALYSIS.md` on branch `feature/perf-link-changes`. The `_process_link_changes` bottleneck (O(N_network × N_changes) boolean mask scan) has already been fixed; this is the next highest-impact item.

Metadata

Metadata

Assignees

Labels

enhancementNew feature or requestperformanceaddresses performance but doesn't add a feature

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions