diff --git a/CHANGELOG.md b/CHANGELOG.md index 618733e..3ef8e3f 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,6 +9,17 @@ Versioning: [SemVer](https://semver.org/spec/v2.0.0.html). ### Added +- **Heteroscedastic gaussian noise.** Optional `scale_with_trajectory` + flag on `NoiseConfig` (mirror on the builder's `NoiseInput`). When + `true`, each cell's gaussian standard deviation becomes + `gaussian_sigma × trajectory_position` instead of + `gaussian_sigma × |value|` — position-zero cells receive zero + gaussian noise, position-one cells receive the full σ. Outlier and + MCAR rates are unaffected. Default `false` keeps engine output + byte-identical to the magnitude-scaled lane. Manifest schema bumps + 1.6 → 1.7 with a new optional `noise_config` field populated only + when the flag is enabled. + - **`pool.` source on per_entity_per_period facts.** Widens the per-entity value-pool surface to the most common fact grain (one row per entity per period). Two new dispatch handlers — diff --git a/docs/site/config-reference.md b/docs/site/config-reference.md index 285f6ae..f5b4726 100644 --- a/docs/site/config-reference.md +++ b/docs/site/config-reference.md @@ -25,7 +25,7 @@ bridges: [ ... ] quality: [ ... ] holdout: { target, periods, min_training_periods } entity_features: true | false | { metrics, include_labels } -noise: | { gaussian_sigma, outlier_rate, mcar_rate } +noise: | { gaussian_sigma, outlier_rate, mcar_rate, scale_with_trajectory } output: csv | parquet | jsonl | sql | { format, directory, cell_budget, denormalized, partition_by, sql_dialect } locale: seed: @@ -643,6 +643,7 @@ noise: gaussian_sigma: 0.05 outlier_rate: 0.02 mcar_rate: 0.01 + scale_with_trajectory: false ``` | Field | Type | Default | Range | Effect | @@ -650,6 +651,7 @@ noise: | `gaussian_sigma` | `float` | `0.0` | `0.0`–`5.0` | Multiplicative log-normal jitter on each draw — `value *= exp(N(0, σ²))`. Bigger σ = wider spread | | `outlier_rate` | `float` | `0.0` | `0.0`–`1.0` | Probability per cell of replacing the value with a 3-σ tail draw | | `mcar_rate` | `float` | `0.0` | `0.0`–`1.0` | Probability per cell of dropping the value to NaN (missing-completely-at-random) | +| `scale_with_trajectory` | `bool` | `false` | — | When `true`, the gaussian standard deviation at each cell becomes `gaussian_sigma × trajectory_position` instead of `gaussian_sigma × \|value\|`. Position-zero cells receive zero gaussian noise; position-one cells receive the full σ. Outlier and MCAR branches are unchanged. Use when the dataset's noise model should be heteroscedastic — e.g. high-engagement entities exhibit larger observation variance — rather than proportional to the value magnitude | Four named presets accept the lower-case canonical name OR a friendly alias — pick whichever reads naturally: @@ -663,6 +665,8 @@ alias — pick whichever reads naturally: The same constants are exported from `plotsim` for engine-direct mutation: `PERFECTLY_CLEAN`, `SLIGHTLY_MESSY`, `REALISTIC`, `DIRTY`. +Presets always set `scale_with_trajectory: false`; opt into the +heteroscedastic lane by passing the explicit dict form. `noise` is independent of the `quality` block — `noise` perturbs metric values *during* generation (correlations and trajectory still hold); diff --git a/docs/site/manifest-reference.md b/docs/site/manifest-reference.md index 0ec130d..bfa6f02 100644 --- a/docs/site/manifest-reference.md +++ b/docs/site/manifest-reference.md @@ -46,7 +46,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8, ```json { - "schema_version": "1.6", + "schema_version": "1.7", "seed": 42, "config_sha256": "<64-char hex>", "archetype_assignments": [...], @@ -63,13 +63,14 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8, "causal_graph": [...], "correlations": [...], "outlier_injections": [...] | null, - "parent_child_relations": [...] + "parent_child_relations": [...], + "noise_config": {...} | null } ``` | Field | Type | Description | |---|---|---| -| `schema_version` | `str` | Wire-shape version. Currently `"1.6"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations` — landed) | +| `schema_version` | `str` | Wire-shape version. Currently `"1.7"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations`, `noise_config` — landed) | | `seed` | `int` | The seed used for generation — `config.seed` | | `config_sha256` | `str` | Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption | | `archetype_assignments` | array | One entry per entity; see below | @@ -86,6 +87,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8, | `causal_graph` | array | One `CausalEdge` per metric with a non-None `causal_lag`. Empty list when no metric uses `causal_lag` | | `correlations` | array | One entry per user-declared `config.correlations` pair, with the realized (post-Higham, post-compensation) coefficient. Empty list when no correlations are configured | | `outlier_injections` | array or `null` | Per-cell outlier-fire log. `null` when skipped (no `outlier_rate`, vectorized mode, or cell budget exceeded). `[]` when the detector ran and observed no firings | +| `noise_config` | object or `null` | Noise-model record. `null` when the run uses the default magnitude-scaled gaussian lane; populated only when `noise.scale_with_trajectory` is `true` | --- @@ -623,6 +625,38 @@ seed signals a generation regression. --- +## `noise_config` + +Noise-model record — emitted only when the run opted into +heteroscedastic gaussian noise via `noise.scale_with_trajectory: true`. +`null` for the default magnitude-scaled lane (and absent from manifests +produced before `schema_version: "1.7"`). + +```json +{ + "noise_config": { + "gaussian_sigma": 0.20, + "outlier_rate": 0.0, + "mcar_rate": 0.0, + "scale_with_trajectory": true + } +} +``` + +| Field | Type | Description | +|---|---|---| +| `gaussian_sigma` | `float` | The σ multiplier from `config.noise.gaussian_sigma`. Under the heteroscedastic lane the realized scale at a cell is `gaussian_sigma × trajectory_position` | +| `outlier_rate` | `float` | Mirrors `config.noise.outlier_rate`. Unaffected by the heteroscedastic flag — recorded here for completeness so the manifest fully describes the noise model | +| `mcar_rate` | `float` | Mirrors `config.noise.mcar_rate`. Unaffected by the heteroscedastic flag | +| `scale_with_trajectory` | `bool` | Always `true` when this record is present (the field exists for forward compatibility in case the manifest later starts recording the default-off lane as well) | + +**Use case** — distinguish a run that opted into position-scaled +gaussian noise from one that didn't, without re-reading the YAML +config. Anomaly-detection scoring that assumes uniform noise variance +can read this field to switch to a position-aware likelihood model. + +--- + ## Reading the manifest in Python ```python