Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,24 @@ Versioning: [SemVer](https://semver.org/spec/v2.0.0.html).

### Added

- **Manifest decomposition + regression sections.** The manifest sidecar
now records two additional summaries of the engine's realized signal
layer. `seasonal_decomposition` snapshots the global per-period
seasonal-strength array plus the per-metric and per-entity sensitivity
multipliers — a downstream consumer can reproduce the effective
seasonal lift at any `(entity, period, metric)` cell without
re-reading the source config. `regression_pairs_global` carries
pair-wise OLS β + intercept (both directions), r², per-direction
residual variance, and the finite-observation count for every
declared correlation pair, pooled across all entities;
`regression_pairs_by_archetype` provides the same OLS surface
restricted to each archetype's entity subset, so consumers can see
which archetypes carry the correlation. Configs without
`seasonal_effects` emit an empty-sentinel `seasonal_decomposition`
(empty list and empty dicts); configs without `correlations` emit
empty regression sections. Manifest schema bumps 1.9 → 1.10 for the
three additive sections.

- **Per-metric treatment effects.** New optional `target_metric` field
on the treatment surface (set on `SegmentInput.treatment` in the
builder; mirrored as `Entity.treatment_target_metric` in the engine).
Expand Down
25 changes: 18 additions & 7 deletions docs/site/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,13 +237,14 @@ Use this when you need the ground-truth trajectory positions — the
the primary consumers. Recovering positions from noisy fact-table cells
is impossible in general; this function exposes them directly.

`GenerationState` is a frozen dataclass with three fields:
`GenerationState` is a frozen dataclass with four fields:

| Field | Type | Contents |
|---|---|---|
| `trajectories` | `dict[str, ndarray]` | Per-entity position array, length `n_periods`, values in `[0, 1]` |
| `scd` | `SCDState` | Per-dim SCD Type 2 versioning (empty when no SCD columns are configured) |
| `bridges` | `BridgeAssociations` | Per-bridge association ground truth (empty when no bridges are configured) |
| `entity_metrics` | `dict[str, dict[str, ndarray]]` | Per-entity, per-metric realized series — the noise-free, distribution-shaped values the fact tables were built from. Consumed by `build_manifest` for the regression-pair sections; downstream feature pipelines pick it up here when they need the same arrays without re-running the engine |

**Returns** — `(tables, state)`.

Expand All @@ -255,8 +256,11 @@ is impossible in general; this function exposes them directly.
from plotsim import generate_tables_with_state, build_manifest

tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges)
manifest = build_manifest(
cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges,
entity_metrics=state.entity_metrics,
)
```

---
Expand Down Expand Up @@ -393,8 +397,11 @@ are still written so you can inspect the broken data. Block on
from plotsim import generate_tables_with_state, build_manifest, write_tables

tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges)
manifest = build_manifest(
cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges,
entity_metrics=state.entity_metrics,
)
out = write_tables(tables, cfg, manifest=manifest)
print(f"Wrote to {out}")
```
Expand Down Expand Up @@ -496,13 +503,15 @@ def build_manifest(
sample_rate: float | None = None,
scd_state: SCDState | None = None,
bridge_state: BridgeAssociations | None = None,
entity_metrics: dict[str, dict[str, numpy.ndarray]] | None = None,
) -> ManifestSchema
```

The manifest captures the *signal layer* a noisy fact table can't
recover: archetype assignments, trajectory positions, event-firing
periods, SCD band crossings, bridge associations, and reproducibility
metadata.
periods, SCD band crossings, bridge associations, the engine's
seasonal-strength inputs, per-pair regression summaries for declared
correlations, and reproducibility metadata.

**Parameters**

Expand All @@ -514,6 +523,7 @@ metadata.
| `sample_rate` | Override for `config.manifest.trajectory_sample_rate`. `None` reads the config value. |
| `scd_state` | Pass `state.scd` to record SCD Type 2 band crossings. `None` leaves `manifest.scd_events` empty. |
| `bridge_state` | Pass `state.bridges` to record M:N associations. `None` leaves `manifest.bridge_associations` empty. |
| `entity_metrics` | Pass `state.entity_metrics` to populate `manifest.regression_pairs_global` and `manifest.regression_pairs_by_archetype` with pair-wise OLS summaries for every declared correlation pair. `None` leaves both sections at their empty defaults. |

The function is pure — same inputs produce a byte-identical manifest.
No RNG, no clock, no filesystem.
Expand All @@ -531,6 +541,7 @@ tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(
cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges,
entity_metrics=state.entity_metrics,
)
write_manifest(manifest, Path("output"))
```
Expand Down
188 changes: 185 additions & 3 deletions docs/site/manifest-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,7 @@ tables, state = generate_tables_with_state(cfg)
manifest = build_manifest(
cfg, state.trajectories, tables,
scd_state=state.scd, bridge_state=state.bridges,
entity_metrics=state.entity_metrics,
)
write_tables(tables, cfg, manifest=manifest)
```
Expand All @@ -46,7 +47,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,

```json
{
"schema_version": "1.7",
"schema_version": "1.10",
"seed": 42,
"config_sha256": "<64-char hex>",
"archetype_assignments": [...],
Expand All @@ -64,13 +65,16 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,
"correlations": [...],
"outlier_injections": [...] | null,
"parent_child_relations": [...],
"noise_config": {...} | null
"noise_config": {...} | null,
"seasonal_decomposition": {...},
"regression_pairs_global": [...],
"regression_pairs_by_archetype": {...}
}
```

| Field | Type | Description |
|---|---|---|
| `schema_version` | `str` | Wire-shape version. Currently `"1.9"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations`, `noise_config` — landed; 1.7 → 1.8 extended `noise_config` with `noise_family` / `degrees_of_freedom`; 1.8 → 1.9 added the optional `target_metric` field on the per-entity `treatment` and per-cohort `treatment_cohorts` records) |
| `schema_version` | `str` | Wire-shape version. Currently `"1.10"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations`, `noise_config` — landed; 1.7 → 1.8 extended `noise_config` with `noise_family` / `degrees_of_freedom`; 1.8 → 1.9 added the optional `target_metric` field on the per-entity `treatment` and per-cohort `treatment_cohorts` records; 1.9 → 1.10 added the `seasonal_decomposition` snapshot plus per-pair OLS summaries in `regression_pairs_global` / `regression_pairs_by_archetype`) |
| `seed` | `int` | The seed used for generation — `config.seed` |
| `config_sha256` | `str` | Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption |
| `archetype_assignments` | array | One entry per entity; see below |
Expand All @@ -88,6 +92,9 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,
| `correlations` | array | One entry per user-declared `config.correlations` pair, with the realized (post-Higham, post-compensation) coefficient. Empty list when no correlations are configured |
| `outlier_injections` | array or `null` | Per-cell outlier-fire log. `null` when skipped (no `outlier_rate`, vectorized mode, or cell budget exceeded). `[]` when the detector ran and observed no firings |
| `noise_config` | object or `null` | Noise-model record. `null` when the run uses the default magnitude-scaled gaussian lane; populated when EITHER `noise.scale_with_trajectory` is `true` OR `noise.noise_family` is non-default (`"student_t"` / `"laplace"`) |
| `seasonal_decomposition` | object | Snapshot of the seasonal-strength inputs the engine consumed. Always emitted; configs without `seasonal_effects` get the empty-sentinel shape (empty list / empty dicts) |
| `regression_pairs_global` | array | Pair-wise OLS summary (slope, intercept, r², residual variance) for every declared correlation pair, pooled across every entity. Empty list when no correlations are configured |
| `regression_pairs_by_archetype` | object | Same OLS summary as `regression_pairs_global` but grouped by `Entity.archetype`. Keys are archetype names; values mirror the global list shape. Empty dict when no correlations are configured |

---

Expand Down Expand Up @@ -666,6 +673,181 @@ the scorer well-calibrated under the heavier-tailed residuals.

---

## `seasonal_decomposition`

Snapshot of the seasonal-strength inputs the engine consumed during
metric generation.

```json
{
"seasonal_decomposition": {
"seasonal_factors": [0.0, 0.8, 0.8, 0.0, 0.0, -0.3, -0.3, 0.0, 0.0, 0.0, 0.0, 0.8],
"metric_seasonal_sensitivities": {
"engagement": 1.0,
"mrr": 0.6
},
"entity_seasonal_sensitivities": {
"growers_001": 1.0,
"decliners_002": 0.0
}
}
}
```

| Field | Type | Description |
|---|---|---|
| `seasonal_factors` | array of `float` | Length-`n_periods` global strength array. Entry `t` is the sum of every `SeasonalEffect.strength` whose `months` set contains period `t`'s calendar month |
| `metric_seasonal_sensitivities` | object | One entry per metric, keyed by `Metric.name` and valued by `Metric.seasonal_sensitivity`. The per-metric multiplier the engine applies on top of the global strength |
| `entity_seasonal_sensitivities` | object | One entry per entity, keyed by `Entity.name` and valued by `Entity.seasonal_sensitivity`. The per-entity multiplier the engine applies on top of the global strength |

### When the section is the empty sentinel

Configs without any `seasonal_effects` declared get the empty-sentinel
shape — `seasonal_factors: []`, `metric_seasonal_sensitivities: {}`,
`entity_seasonal_sensitivities: {}` — rather than `null`. The
sensitivity multipliers are inert in that lane (the engine short-
circuits before applying them), so recording them would just be noise.
Always present so a downstream consumer can iterate the section without
a None-check.

**Use case** — reconstruct the engine's effective seasonal lift at any
cell without re-reading the YAML config. For an `(entity, period, metric)`
triple:

```python
lift = (
manifest["seasonal_decomposition"]["seasonal_factors"][period]
* manifest["seasonal_decomposition"]["metric_seasonal_sensitivities"][metric]
* manifest["seasonal_decomposition"]["entity_seasonal_sensitivities"][entity]
)
```

A seasonality-aware anomaly detector can subtract this lift before
scoring; a feature pipeline can expose `seasonal_factor` as a regressor
that exactly mirrors the engine's modulation.

---

## `regression_pairs_global`

Pair-wise ordinary-least-squares fit for every declared correlation,
pooled across every entity and period.

```json
{
"regression_pairs_global": [
{
"metric_a": "engagement",
"metric_b": "mrr",
"beta_a_to_b": 0.84,
"intercept_a_to_b": 12.3,
"beta_b_to_a": 0.71,
"intercept_b_to_a": -4.1,
"r_squared": 0.6,
"residual_variance_a_to_b": 18.7,
"residual_variance_b_to_a": 0.04,
"n_observations": 720
}
]
}
```

| Field | Type | Description |
|---|---|---|
| `metric_a` / `metric_b` | `str` | The pair, in the order the user declared them in `config.correlations` |
| `beta_a_to_b` | `float` | OLS slope for `b = beta * a + intercept` over the pooled `(a, b)` observations |
| `intercept_a_to_b` | `float` | OLS intercept for the same regression |
| `beta_b_to_a` | `float` | OLS slope for the reverse regression `a = beta * b + intercept` |
| `intercept_b_to_a` | `float` | OLS intercept for the reverse regression |
| `r_squared` | `float` | Direction-invariant coefficient of determination. Equal to `corr(a, b) ** 2` on the same observations |
| `residual_variance_a_to_b` | `float` | Variance of `b - (beta_a_to_b * a + intercept_a_to_b)` — the unexplained-noise scale for the `a → b` direction |
| `residual_variance_b_to_a` | `float` | Same for the reverse direction |
| `n_observations` | `int` | Count of finite `(a, b)` pairs used. Cells with NaN in either metric (cold-start lead-ins, MCAR-rewritten values) are excluded |

One entry per pair in `config.correlations`. Auto-zero off-diagonals
(pairs the user did not declare) are not recorded. Sorted by
`(metric_a, metric_b)` for stable JSON output.

**Distinct from** `correlations` (which records the realized Pearson
coefficient the copula targeted). `regression_pairs_global` describes
the *fitted linear relationship* between the realized series — slope
and intercept, plus the unexplained variance. A high `r_squared`
combined with a small `residual_variance` says the pair moves
tightly together along a straight line; a high `r_squared` with
asymmetric residual variances says one direction predicts the other
better than vice-versa (which is normal under unequal metric scales).

`n_observations < 2` is a degenerate case (sparse cold-start, no
overlap between metric domains); the record's β / intercept / variance
fields are all `0.0` and downstream consumers should gate on the count
before reading the coefficients.

**Use case** — score a regression baseline. A predictor of `mrr` from
`engagement` should land near `beta_a_to_b` with residual variance
close to `residual_variance_a_to_b`. Larger deviations flag either
model misspecification or that the consumer is over-fitting noise the
manifest already attributes to residuals.

---

## `regression_pairs_by_archetype`

The same OLS surface as `regression_pairs_global`, but restricted to
each archetype's entity subset so a consumer can see which archetypes
carry the declared correlations.

```json
{
"regression_pairs_by_archetype": {
"growth": [
{
"metric_a": "engagement",
"metric_b": "mrr",
"beta_a_to_b": 0.91,
"intercept_a_to_b": 9.2,
"beta_b_to_a": 0.86,
"intercept_b_to_a": -7.0,
"r_squared": 0.78,
"residual_variance_a_to_b": 10.4,
"residual_variance_b_to_a": 0.02,
"n_observations": 360
}
],
"decline": [
{
"metric_a": "engagement",
"metric_b": "mrr",
"beta_a_to_b": 0.62,
"intercept_a_to_b": 15.8,
"beta_b_to_a": 0.41,
"intercept_b_to_a": 1.2,
"r_squared": 0.31,
"residual_variance_a_to_b": 25.6,
"residual_variance_b_to_a": 0.08,
"n_observations": 360
}
]
}
}
```

The top-level object's keys are archetype names (matching
`Entity.archetype`); each value list mirrors the
`regression_pairs_global` shape, one entry per declared pair.
Archetypes that contribute no finite observations are omitted entirely
(rather than mapped to an empty list) — the dict reflects archetypes
that actually contributed to the fit.

Empty `{}` when no correlations are declared.

**Use case** — diagnose where in the population a declared correlation
is strongest. A pair with a high pooled `r_squared` but per-archetype
values that swing widely is a signal that the correlation is a mixture
artefact, not a within-archetype relationship — a model trained on the
pooled fit will mispredict for the archetype whose β diverges most.

---

## Reading the manifest in Python

```python
Expand Down
11 changes: 11 additions & 0 deletions docs/site/user-guide/metrics-and-connections.md
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,17 @@ records the adjustments in `manifest.correlation_adjustments`. Strong
mirrors (`mirrors`, `inverts`) on lots of metrics tends to over-constrain
the matrix — a warning fires.

Beyond the correlation target itself, the manifest emits a pair-wise
OLS fit (slope, intercept, r², residual variance) for every declared
correlation in `manifest.regression_pairs_global` (pooled across all
entities) and `manifest.regression_pairs_by_archetype` (grouped by
archetype). The pooled fit answers "given the realized output, what
linear relationship do these metrics actually follow"; the
per-archetype fit answers "is that relationship the same in every
sub-population, or is the pooled correlation a mixture artefact?"
See [`manifest-reference.md`](../manifest-reference.md#regression_pairs_global)
for the field layout.

---

## Causal lag — `follows` + `delay`
Expand Down
25 changes: 20 additions & 5 deletions docs/site/user-guide/seasonality.md
Original file line number Diff line number Diff line change
Expand Up @@ -190,11 +190,26 @@ combined with windows shorter than 24 periods.

Seasonal modulation is a deterministic function of the config — same
`(config, seed)` produces the same `seasonal_factor` at every cell. The
manifest doesn't record per-cell seasonal factors directly; you can
reconstruct them from the config alone.

If you need per-cell verification, [`trace_metric_cell`](../api-reference.md#trace_metric_cell)
returns the `seasonal_factor` and `modulated_center` for any single
manifest's `seasonal_decomposition` section captures the three inputs
the engine consumed so a consumer can reproduce the effective lift at
any cell without re-reading the YAML:

- `seasonal_factors` — the length-`n_periods` global strength array
(entry `t` is the summed strength of every effect whose `months`
set contains period `t`'s calendar month).
- `metric_seasonal_sensitivities` — per-metric multipliers
(`Metric.seasonal_sensitivity`).
- `entity_seasonal_sensitivities` — per-entity multipliers
(`Entity.seasonal_sensitivity`).

The effective lift at cell `(entity, period, metric)` is the product
of those three values — the same multiplication the engine applies
during metric generation. Configs without any `seasonal_effects`
declared get the empty-sentinel shape (empty list and empty dicts).

If you need per-cell verification rather than reconstruction,
[`trace_metric_cell`](../api-reference.md#trace_metric_cell) returns
the `seasonal_factor` and `modulated_center` for any single
`(entity, period, metric)` triple.

---
Expand Down
1 change: 1 addition & 0 deletions plotsim/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -280,6 +280,7 @@ def cmd_run(args: argparse.Namespace) -> int:
tables,
scd_state=gen_state.scd,
bridge_state=gen_state.bridges,
entity_metrics=gen_state.entity_metrics,
)

output_dir = Path(args.output_dir) if args.output_dir else None
Expand Down
Loading
Loading