Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 11 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,17 @@ Versioning: [SemVer](https://semver.org/spec/v2.0.0.html).

### Added

- **Heteroscedastic gaussian noise.** Optional `scale_with_trajectory`
flag on `NoiseConfig` (mirror on the builder's `NoiseInput`). When
`true`, each cell's gaussian standard deviation becomes
`gaussian_sigma × trajectory_position` instead of
`gaussian_sigma × |value|` — position-zero cells receive zero
gaussian noise, position-one cells receive the full σ. Outlier and
MCAR rates are unaffected. Default `false` keeps engine output
byte-identical to the magnitude-scaled lane. Manifest schema bumps
1.6 → 1.7 with a new optional `noise_config` field populated only
when the flag is enabled.

- **`pool.<attr>` source on per_entity_per_period facts.** Widens the
per-entity value-pool surface to the most common fact grain (one
row per entity per period). Two new dispatch handlers —
Expand Down
6 changes: 5 additions & 1 deletion docs/site/config-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ bridges: [ ... ]
quality: [ ... ]
holdout: { target, periods, min_training_periods }
entity_features: true | false | { metrics, include_labels }
noise: <preset_name> | { gaussian_sigma, outlier_rate, mcar_rate }
noise: <preset_name> | { gaussian_sigma, outlier_rate, mcar_rate, scale_with_trajectory }
output: csv | parquet | jsonl | sql | { format, directory, cell_budget, denormalized, partition_by, sql_dialect }
locale: <faker locale or list of locales>
seed: <int>
Expand Down Expand Up @@ -643,13 +643,15 @@ noise:
gaussian_sigma: 0.05
outlier_rate: 0.02
mcar_rate: 0.01
scale_with_trajectory: false
```

| Field | Type | Default | Range | Effect |
|---|---|---|---|---|
| `gaussian_sigma` | `float` | `0.0` | `0.0`–`5.0` | Multiplicative log-normal jitter on each draw — `value *= exp(N(0, σ²))`. Bigger σ = wider spread |
| `outlier_rate` | `float` | `0.0` | `0.0`–`1.0` | Probability per cell of replacing the value with a 3-σ tail draw |
| `mcar_rate` | `float` | `0.0` | `0.0`–`1.0` | Probability per cell of dropping the value to NaN (missing-completely-at-random) |
| `scale_with_trajectory` | `bool` | `false` | — | When `true`, the gaussian standard deviation at each cell becomes `gaussian_sigma × trajectory_position` instead of `gaussian_sigma × \|value\|`. Position-zero cells receive zero gaussian noise; position-one cells receive the full σ. Outlier and MCAR branches are unchanged. Use when the dataset's noise model should be heteroscedastic — e.g. high-engagement entities exhibit larger observation variance — rather than proportional to the value magnitude |

Four named presets accept the lower-case canonical name OR a friendly
alias — pick whichever reads naturally:
Expand All @@ -663,6 +665,8 @@ alias — pick whichever reads naturally:

The same constants are exported from `plotsim` for engine-direct
mutation: `PERFECTLY_CLEAN`, `SLIGHTLY_MESSY`, `REALISTIC`, `DIRTY`.
Presets always set `scale_with_trajectory: false`; opt into the
heteroscedastic lane by passing the explicit dict form.

`noise` is independent of the `quality` block — `noise` perturbs metric
values *during* generation (correlations and trajectory still hold);
Expand Down
40 changes: 37 additions & 3 deletions docs/site/manifest-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,

```json
{
"schema_version": "1.6",
"schema_version": "1.7",
"seed": 42,
"config_sha256": "<64-char hex>",
"archetype_assignments": [...],
Expand All @@ -63,13 +63,14 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,
"causal_graph": [...],
"correlations": [...],
"outlier_injections": [...] | null,
"parent_child_relations": [...]
"parent_child_relations": [...],
"noise_config": {...} | null
}
```

| Field | Type | Description |
|---|---|---|
| `schema_version` | `str` | Wire-shape version. Currently `"1.6"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations` — landed) |
| `schema_version` | `str` | Wire-shape version. Currently `"1.7"` (bumped over time as new additive sections — `causal_graph`, `correlations`, `outlier_injections`, multi-source mappings, `parent_child_relations`, `noise_config` — landed) |
| `seed` | `int` | The seed used for generation — `config.seed` |
| `config_sha256` | `str` | Full SHA-256 hex of the JSON-serialized config. Detects config drift between generation and consumption |
| `archetype_assignments` | array | One entry per entity; see below |
Expand All @@ -86,6 +87,7 @@ produces a byte-identical `manifest.json`. Encoding: UTF-8,
| `causal_graph` | array | One `CausalEdge` per metric with a non-None `causal_lag`. Empty list when no metric uses `causal_lag` |
| `correlations` | array | One entry per user-declared `config.correlations` pair, with the realized (post-Higham, post-compensation) coefficient. Empty list when no correlations are configured |
| `outlier_injections` | array or `null` | Per-cell outlier-fire log. `null` when skipped (no `outlier_rate`, vectorized mode, or cell budget exceeded). `[]` when the detector ran and observed no firings |
| `noise_config` | object or `null` | Noise-model record. `null` when the run uses the default magnitude-scaled gaussian lane; populated only when `noise.scale_with_trajectory` is `true` |

---

Expand Down Expand Up @@ -623,6 +625,38 @@ seed signals a generation regression.

---

## `noise_config`

Noise-model record — emitted only when the run opted into
heteroscedastic gaussian noise via `noise.scale_with_trajectory: true`.
`null` for the default magnitude-scaled lane (and absent from manifests
produced before `schema_version: "1.7"`).

```json
{
"noise_config": {
"gaussian_sigma": 0.20,
"outlier_rate": 0.0,
"mcar_rate": 0.0,
"scale_with_trajectory": true
}
}
```

| Field | Type | Description |
|---|---|---|
| `gaussian_sigma` | `float` | The σ multiplier from `config.noise.gaussian_sigma`. Under the heteroscedastic lane the realized scale at a cell is `gaussian_sigma × trajectory_position` |
| `outlier_rate` | `float` | Mirrors `config.noise.outlier_rate`. Unaffected by the heteroscedastic flag — recorded here for completeness so the manifest fully describes the noise model |
| `mcar_rate` | `float` | Mirrors `config.noise.mcar_rate`. Unaffected by the heteroscedastic flag |
| `scale_with_trajectory` | `bool` | Always `true` when this record is present (the field exists for forward compatibility in case the manifest later starts recording the default-off lane as well) |

**Use case** — distinguish a run that opted into position-scaled
gaussian noise from one that didn't, without re-reading the YAML
config. Anomaly-detection scoring that assumes uniform noise variance
can read this field to switch to a position-aware likelihood model.

---

## Reading the manifest in Python

```python
Expand Down
Loading