Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,24 @@ Versioning: [SemVer](https://semver.org/spec/v2.0.0.html).
1.6 → 1.7 with a new optional `noise_config` field populated only
when the flag is enabled.

- **Heavy-tailed noise families (Student-t, Laplace).** New
`noise_family` field on `NoiseConfig` accepts `"gaussian"` (default,
byte-identical to prior behavior), `"student_t"` (with required
`degrees_of_freedom`), or `"laplace"`. Heavy-tailed families produce
outlier-prone residuals without explicit outlier injection — useful
for modeling sensor noise, financial returns, or any domain with
fat-tailed observation error. Family dispatch composes orthogonally
with `scale_with_trajectory`: the resolved scale is the same for
every family, only the sampling distribution differs. Config-time
validation rejects `student_t` without `degrees_of_freedom`, rejects
`degrees_of_freedom` on other families, and rejects `df < 1`.
Builder mirror on `NoiseInput`; preset shorthand always resolves to
gaussian. Manifest schema bumps 1.7 → 1.8 — `NoiseConfigInfo` gains
`noise_family` and `degrees_of_freedom`, and its emission criterion
broadens to "heteroscedastic OR non-default family" so the manifest
records the realized noise family whenever it diverges from the
historical lane.

- **`pool.<attr>` source on per_entity_per_period facts.** Widens the
per-entity value-pool surface to the most common fact grain (one
row per entity per period). Two new dispatch handlers —
Expand Down
39 changes: 34 additions & 5 deletions docs/site/config-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ bridges: [ ... ]
quality: [ ... ]
holdout: { target, periods, min_training_periods }
entity_features: true | false | { metrics, include_labels }
noise: <preset_name> | { gaussian_sigma, outlier_rate, mcar_rate, scale_with_trajectory }
noise: <preset_name> | { gaussian_sigma, outlier_rate, mcar_rate, scale_with_trajectory, noise_family, degrees_of_freedom }
output: csv | parquet | jsonl | sql | { format, directory, cell_budget, denormalized, partition_by, sql_dialect }
locale: <faker locale or list of locales>
seed: <int>
Expand Down Expand Up @@ -644,14 +644,18 @@ noise:
outlier_rate: 0.02
mcar_rate: 0.01
scale_with_trajectory: false
noise_family: gaussian
degrees_of_freedom: null # required when noise_family is "student_t"
```

| Field | Type | Default | Range | Effect |
|---|---|---|---|---|
| `gaussian_sigma` | `float` | `0.0` | `0.0`–`5.0` | Multiplicative log-normal jitter on each draw — `value *= exp(N(0, σ²))`. Bigger σ = wider spread |
| `gaussian_sigma` | `float` | `0.0` | `0.0`–`5.0` | Multiplicative log-normal jitter on each draw — `value *= exp(N(0, σ²))`. Bigger σ = wider spread. Used by every `noise_family` as the scale parameter |
| `outlier_rate` | `float` | `0.0` | `0.0`–`1.0` | Probability per cell of replacing the value with a 3-σ tail draw |
| `mcar_rate` | `float` | `0.0` | `0.0`–`1.0` | Probability per cell of dropping the value to NaN (missing-completely-at-random) |
| `scale_with_trajectory` | `bool` | `false` | — | When `true`, the gaussian standard deviation at each cell becomes `gaussian_sigma × trajectory_position` instead of `gaussian_sigma × \|value\|`. Position-zero cells receive zero gaussian noise; position-one cells receive the full σ. Outlier and MCAR branches are unchanged. Use when the dataset's noise model should be heteroscedastic — e.g. high-engagement entities exhibit larger observation variance — rather than proportional to the value magnitude |
| `scale_with_trajectory` | `bool` | `false` | — | When `true`, the gaussian standard deviation at each cell becomes `gaussian_sigma × trajectory_position` instead of `gaussian_sigma × \|value\|`. Position-zero cells receive zero gaussian noise; position-one cells receive the full σ. Outlier and MCAR branches are unchanged. Use when the dataset's noise model should be heteroscedastic — e.g. high-engagement entities exhibit larger observation variance — rather than proportional to the value magnitude. Composes orthogonally with `noise_family` |
| `noise_family` | `str` | `"gaussian"` | `"gaussian"` / `"student_t"` / `"laplace"` | Distribution of the additive jitter. `"gaussian"` (default) preserves the historical behavior byte-for-byte. `"student_t"` draws from a Student-t with `degrees_of_freedom` and produces heavier tails (outlier-prone residuals without explicit `outlier_rate`). `"laplace"` draws from a Laplace distribution — sharper peak, heavier tails than Gaussian. Composes with `scale_with_trajectory`: the resolved scale is the same for every family |
| `degrees_of_freedom` | `float` or `null` | `null` | ≥ `1.0` | Required when `noise_family: student_t`; forbidden otherwise (a non-null value with any other family raises at load time). Lower values yield heavier tails; `df = 1` is the Cauchy limit (no finite mean). Typical values: `df = 3`–`5` for visibly heavy tails, `df = 10`–`30` for mild Gaussian-like residuals |

Four named presets accept the lower-case canonical name OR a friendly
alias — pick whichever reads naturally:
Expand All @@ -665,8 +669,33 @@ alias — pick whichever reads naturally:

The same constants are exported from `plotsim` for engine-direct
mutation: `PERFECTLY_CLEAN`, `SLIGHTLY_MESSY`, `REALISTIC`, `DIRTY`.
Presets always set `scale_with_trajectory: false`; opt into the
heteroscedastic lane by passing the explicit dict form.
Presets always set `scale_with_trajectory: false` and
`noise_family: gaussian`; opt into the heteroscedastic lane or a
heavy-tailed family by passing the explicit dict form.

**Picking a heavy-tailed family.** `student_t` with low `df` (3–5)
models occasional large deviations driven by a heavy-tailed underlying
process — sensor failures, financial return spikes, support-ticket
volume after an outage. `laplace` is similar but with a sharper peak
around the center and exponential (rather than power-law) tails — a
good fit when most residuals are small but a non-negligible minority
are several scales out. Both compose with `outlier_rate` if you also
want explicit "blow up the value by 3–10×" injection on top of the
heavy-tailed jitter.

```yaml
# Heavy-tailed noise from a Student-t
noise:
gaussian_sigma: 0.10
noise_family: student_t
degrees_of_freedom: 4

# Laplace residuals, heteroscedastic amplitude
noise:
gaussian_sigma: 0.05
scale_with_trajectory: true
noise_family: laplace
```

`noise` is independent of the `quality` block — `noise` perturbs metric
values *during* generation (correlations and trajectory still hold);
Expand Down
41 changes: 25 additions & 16 deletions docs/site/manifest-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,7 +70,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,

| Field | Type | Description |
|---|---|---|
| `schema_version` | `str` | Wire-shape version. Currently `"1.7"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations`, `noise_config` — landed) |
| `schema_version` | `str` | Wire-shape version. Currently `"1.8"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations`, `noise_config` — landed; 1.7 → 1.8 extended `noise_config` with `noise_family` / `degrees_of_freedom`) |
| `seed` | `int` | The seed used for generation — `config.seed` |
| `config_sha256` | `str` | Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption |
| `archetype_assignments` | array | One entry per entity; see below |
Expand All @@ -87,7 +87,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,
| `causal_graph` | array | One `CausalEdge` per metric with a non-None `causal_lag`. Empty list when no metric uses `causal_lag` |
| `correlations` | array | One entry per user-declared `config.correlations` pair, with the realized (post-Higham, post-compensation) coefficient. Empty list when no correlations are configured |
| `outlier_injections` | array or `null` | Per-cell outlier-fire log. `null` when skipped (no `outlier_rate`, vectorized mode, or cell budget exceeded). `[]` when the detector ran and observed no firings |
| `noise_config` | object or `null` | Noise-model record. `null` when the run uses the default magnitude-scaled gaussian lane; populated only when `noise.scale_with_trajectory` is `true` |
| `noise_config` | object or `null` | Noise-model record. `null` when the run uses the default magnitude-scaled gaussian lane; populated when EITHER `noise.scale_with_trajectory` is `true` OR `noise.noise_family` is non-default (`"student_t"` / `"laplace"`) |

---

Expand Down Expand Up @@ -627,33 +627,42 @@ seed signals a generation regression.

## `noise_config`

Noise-model record — emitted only when the run opted into
heteroscedastic gaussian noise via `noise.scale_with_trajectory: true`.
`null` for the default magnitude-scaled lane (and absent from manifests
produced before `schema_version: "1.7"`).
Noise-model record — emitted whenever the run diverges from the
historical magnitude-scaled gaussian lane. Two triggers, either
sufficient: `noise.scale_with_trajectory: true` (heteroscedastic
amplitude) OR `noise.noise_family` is non-default (heavy-tailed
family — `"student_t"` or `"laplace"`). `null` for the default lane
(and absent from manifests produced before `schema_version: "1.7"`).

```json
{
"noise_config": {
"gaussian_sigma": 0.20,
"outlier_rate": 0.0,
"mcar_rate": 0.0,
"scale_with_trajectory": true
"scale_with_trajectory": true,
"noise_family": "student_t",
"degrees_of_freedom": 4.0
}
}
```

| Field | Type | Description |
|---|---|---|
| `gaussian_sigma` | `float` | The σ multiplier from `config.noise.gaussian_sigma`. Under the heteroscedastic lane the realized scale at a cell is `gaussian_sigma × trajectory_position` |
| `outlier_rate` | `float` | Mirrors `config.noise.outlier_rate`. Unaffected by the heteroscedastic flag — recorded here for completeness so the manifest fully describes the noise model |
| `mcar_rate` | `float` | Mirrors `config.noise.mcar_rate`. Unaffected by the heteroscedastic flag |
| `scale_with_trajectory` | `bool` | Always `true` when this record is present (the field exists for forward compatibility in case the manifest later starts recording the default-off lane as well) |

**Use case** — distinguish a run that opted into position-scaled
gaussian noise from one that didn't, without re-reading the YAML
config. Anomaly-detection scoring that assumes uniform noise variance
can read this field to switch to a position-aware likelihood model.
| `gaussian_sigma` | `float` | The σ multiplier from `config.noise.gaussian_sigma`. Under the heteroscedastic lane the realized scale at a cell is `gaussian_sigma × trajectory_position`; otherwise `gaussian_sigma × \|value\|`. Used by every family as the scale parameter |
| `outlier_rate` | `float` | Mirrors `config.noise.outlier_rate`. Unaffected by the family or heteroscedastic flag — recorded here for completeness so the manifest fully describes the noise model |
| `mcar_rate` | `float` | Mirrors `config.noise.mcar_rate`. Unaffected by the family or heteroscedastic flag |
| `scale_with_trajectory` | `bool` | `true` when the heteroscedastic lane was engaged. `false` when the record was emitted purely because `noise_family` diverged from the default |
| `noise_family` | `str` | The additive-jitter distribution — one of `"gaussian"`, `"student_t"`, `"laplace"`. Mirrors `config.noise.noise_family` |
| `degrees_of_freedom` | `float` or `null` | Populated only when `noise_family == "student_t"`; `null` otherwise |

**Use case** — distinguish a run that opted into position-scaled or
heavy-tailed gaussian noise from one that didn't, without re-reading
the YAML config. Anomaly-detection scoring that assumes uniform
gaussian noise variance can read this record to switch to a
position-aware or family-aware likelihood model — e.g., switching to
a t-distribution likelihood when `noise_family == "student_t"` keeps
the scorer well-calibrated under the heavier-tailed residuals.

---

Expand Down
23 changes: 23 additions & 0 deletions plotsim-schema.json
Original file line number Diff line number Diff line change
Expand Up @@ -921,6 +921,29 @@
"default": false,
"title": "Scale With Trajectory",
"type": "boolean"
},
"noise_family": {
"default": "gaussian",
"enum": [
"gaussian",
"student_t",
"laplace"
],
"title": "Noise Family",
"type": "string"
},
"degrees_of_freedom": {
"anyOf": [
{
"minimum": 1.0,
"type": "number"
},
{
"type": "null"
}
],
"default": null,
"title": "Degrees Of Freedom"
}
},
"title": "NoiseConfig",
Expand Down
12 changes: 12 additions & 0 deletions plotsim/builder/input.py
Original file line number Diff line number Diff line change
Expand Up @@ -1145,6 +1145,18 @@ class NoiseInput(BaseModel):
# (``"clean"`` / ``"slightly_messy"`` / ...) always leaves this False;
# users opt in by passing the explicit dict form.
scale_with_trajectory: bool = False
# 0.6-M23: mirrors ``NoiseConfig.noise_family``. Selects the additive
# jitter distribution — ``"gaussian"`` (default, byte-identical to
# pre-M23 behavior), ``"student_t"`` (heavy-tailed; requires
# ``degrees_of_freedom``), or ``"laplace"`` (heavy-tailed, sharper
# peak). Preset shorthand always resolves to ``"gaussian"``.
noise_family: Literal["gaussian", "student_t", "laplace"] = "gaussian"
# 0.6-M23: mirrors ``NoiseConfig.degrees_of_freedom``. Required when
# ``noise_family="student_t"``; forbidden otherwise. The engine-side
# validator on ``NoiseConfig`` raises with a clear message if the
# combination is incoherent — the builder simply passes the field
# through.
degrees_of_freedom: Optional[float] = Field(default=None, ge=1.0)


class OutputInput(BaseModel):
Expand Down
2 changes: 2 additions & 0 deletions plotsim/builder/interpreter.py
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,8 @@ def interpret(user_input: UserInput) -> PlotsimConfig:
outlier_rate=user_input.noise.outlier_rate,
mcar_rate=user_input.noise.mcar_rate,
scale_with_trajectory=user_input.noise.scale_with_trajectory,
noise_family=user_input.noise.noise_family,
degrees_of_freedom=user_input.noise.degrees_of_freedom,
)
else:
noise_cfg = NoiseConfig()
Expand Down
32 changes: 32 additions & 0 deletions plotsim/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -2470,6 +2470,38 @@ class NoiseConfig(_Frozen):
# MCAR rates are unaffected. Default False preserves the multiplicative-
# on-magnitude behavior bit-for-bit.
scale_with_trajectory: bool = False
# 0.6-M23: distribution family for the additive jitter branch. ``"gaussian"``
# (default) preserves the historical ``rng.normal`` draw byte-for-byte.
# ``"student_t"`` draws from a Student-t distribution scaled by ``scale``
# (heavier tails — outlier-prone residuals without explicit outlier
# injection). ``"laplace"`` draws from a Laplace distribution scaled by
# ``scale`` (sharper peak + heavier tails than Gaussian). Composes
# orthogonally with ``scale_with_trajectory``: in both lanes the realized
# scale is the same value the gaussian branch would have used, just
# parameterizing a different family.
noise_family: Literal["gaussian", "student_t", "laplace"] = "gaussian"
# 0.6-M23: degrees-of-freedom parameter for ``noise_family="student_t"``.
# Lower values produce heavier tails; ``df=1`` is the Cauchy limit (no
# finite mean). Validator below requires this be set (and >= 1.0) when
# the family is ``student_t``, and absent for every other family.
degrees_of_freedom: Optional[float] = Field(default=None, ge=1.0)

@model_validator(mode="after")
def _validate_noise_family_params(self) -> "NoiseConfig":
if self.noise_family == "student_t":
if self.degrees_of_freedom is None:
raise ValueError(
"noise_family='student_t' requires degrees_of_freedom to be set "
"(float >= 1.0; lower values mean heavier tails)"
)
else:
if self.degrees_of_freedom is not None:
raise ValueError(
f"degrees_of_freedom is only valid when noise_family='student_t'; "
f"got noise_family={self.noise_family!r} with "
f"degrees_of_freedom={self.degrees_of_freedom}"
)
return self


class ManifestConfig(_Frozen):
Expand Down
15 changes: 14 additions & 1 deletion plotsim/inspect.py
Original file line number Diff line number Diff line change
Expand Up @@ -619,6 +619,12 @@ def _detect_noise_branches(
the side RNG in lockstep — same number of bytes consumed, same value
drawn. Callers must pass the same ``trajectory_position`` the engine
saw at this cell.

0.6-M23: when ``noise.noise_family`` is non-default, the replay must
invoke the same family on the side generator so the post-jitter RNG
state matches the engine's. Otherwise the subsequent ``random()`` calls
for outlier and MCAR checks would read from a different byte position,
yielding garbage outlier-injection records in the manifest.
"""
side = np.random.default_rng()
side.bit_generator.state = rng_state_snapshot
Expand All @@ -629,7 +635,14 @@ def _detect_noise_branches(
else:
mag = abs(v) if v != 0.0 else 1.0
scale = noise.gaussian_sigma * mag
v = v + float(side.normal(loc=0.0, scale=scale))
family = getattr(noise, "noise_family", "gaussian")
if family == "gaussian":
v = v + float(side.normal(loc=0.0, scale=scale))
elif family == "student_t":
df = float(noise.degrees_of_freedom)
v = v + float(side.standard_t(df)) * scale
else: # "laplace"
v = v + float(side.laplace(loc=0.0, scale=scale))
outlier_fired = False
if noise.outlier_rate > 0.0:
if side.random() < noise.outlier_rate:
Expand Down
Loading
Loading