Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,32 @@ Versioning: [SemVer](https://semver.org/spec/v2.0.0.html).

## [Unreleased]

### Changed

- **Bundled template catalog refreshed.** `plotsim.list_templates()`
now returns exactly six domain templates: `banking`, `health`,
`hr`, `marketing`, `retail`, `saas`. Each is schema-realistic
(real column topology and FK shapes for the domain), output-
realistic (pool, range, distribution, correlation, and seasonality
choices match the domain's real data shape), and feature-deep —
every template exercises SCD2, lifecycle stages, 3+ correlations,
causal lag, seasonality, and 2 event tables; CDC on the relevant
fact for each domain; per-metric treatment cohorts on `marketing`
and `health`; bridge tables on `hr`, `retail`, `banking`, and
`health`; parent/child fact grain on `retail`, `banking`, and
`health`; cross-fact FK on `retail` and `health`; geo bundle on
`retail`, `banking`, and `health`; narrative columns on `hr`,
`retail`, `banking`, and `health`; heteroscedastic noise on
`saas` and `health`; student-t noise on `banking`; holdout splits
on `banking` and `health`; sub-entity dim on `saas`; multi-locale
on `retail`. The previous catalog of fourteen mixed-purpose
templates — `ab_trial`, `bare_minimum`, `cdc_demo`,
`crm_billing_overlap`, `education`, `geo_retail`, `lakehouse`,
`latency_skew`, `narrative_reviews`, `orders` — has been demoted
from public surface: the feature-vehicle YAMLs and `.py`
companions for each now live under `tests/configs/` and continue
to power the existing feature-coverage test files unchanged.

### Added

- **Manifest decomposition + regression sections.** The manifest sidecar
Expand Down
2 changes: 1 addition & 1 deletion CONTRIBUTING.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ your config shape:

Steps:

1. Copy an existing template (e.g. `saas_template.yaml` +
1. Copy an existing template (e.g. `saas.yaml` +
`saas_template.py`) as a starting point.
2. Edit metrics, segments / archetypes, dimensions, facts, events, and
any feature-specific blocks for the new use case.
Expand Down
8 changes: 3 additions & 5 deletions docs/site/api-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,12 +131,10 @@ Return the names of bundled builder templates.
def list_templates() -> list[str]
```

Names round-trip through [`load_template`](#load_template). Templates
whose filename ends in `_template` strip that suffix; `bare_minimum`
and the single-feature templates keep their full stems. Sorted
alphabetically.
Names round-trip through [`load_template`](#load_template). The bundled
catalog covers six domains, sorted alphabetically.

**Returns** — e.g. `["ab_trial", "bare_minimum", "cdc_demo", "crm_billing_overlap", "education", "geo_retail", "hr", "lakehouse", "latency_skew", "marketing", "narrative_reviews", "retail", "saas"]`.
**Returns** — `["banking", "health", "hr", "marketing", "retail", "saas"]`.

**Example**

Expand Down
10 changes: 7 additions & 3 deletions docs/site/column-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,9 @@ Output dtype is `float` for `latitude` / `longitude` and `string`
for everything else. `geo.<field>` is dim-only; on facts and
events the engine raises `unsupported generated provider`. See
[Geo hierarchy](./user-guide/geo-hierarchy.md) for the underlying
dataset, determinism, and the bundled `geo_retail` template.
dataset, determinism, and the `tests/configs/geo_retail.yaml`
worked example; the bundled `retail`, `banking`, and `health`
domain templates each put a geo bundle on their customer/patient dim.

---

Expand Down Expand Up @@ -182,8 +184,10 @@ builder API). `narrative` is fact-only and per_entity_per_period;
the cell builder forces the scalar fact path because it consumes one
RNG draw per slot per row. See
[Narrative text source](./user-guide/narrative-source.md) for the
lexicon-design playbook, validation gates, and the bundled
`narrative_reviews` template.
lexicon-design playbook, validation gates, and the
`tests/configs/narrative_reviews.yaml` worked example; narrative
columns also ship on the bundled `hr`, `retail`, `banking`, and
`health` domain templates.

---

Expand Down
6 changes: 3 additions & 3 deletions docs/site/config-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -725,7 +725,7 @@ output:
|---|---|---|---|
| `format` | `"csv"` / `"parquet"` / `"jsonl"` / `"sql"` | `"csv"` | `parquet` requires `pip install plotsim[parquet]` (pyarrow) and produces typed binary files ~5–10× smaller than CSV. `jsonl` writes newline-delimited JSON (one self-contained object per row) for streaming-ingestion / schema-on-read consumers. `sql` writes a single `data.sql` file with dialect-aware DDL + batched INSERTs instead of per-table files |
| `directory` | `str` | `"output"` | Where `write_tables` writes. Override at call time with `write_tables(..., output_dir=...)` |
| `cell_budget` | `int ≥ 0` / `null` | `null` | Soft cell-count cap consumed by the load-time scale estimator. `null` falls through to `PLOTSIM_CELL_BUDGET` env var, then to the 2,000,000 default. `0` disables the soft cap entirely. See [Cell-count budget](#cell-count-budget) for precedence and the bundled `lakehouse` template for a worked example |
| `cell_budget` | `int ≥ 0` / `null` | `null` | Soft cell-count cap consumed by the load-time scale estimator. `null` falls through to `PLOTSIM_CELL_BUDGET` env var, then to the 2,000,000 default. `0` disables the soft cap entirely. See [Cell-count budget](#cell-count-budget) for precedence and `tests/configs/lakehouse.yaml` for a worked example |
| `denormalized` | `bool` | `false` | Opt-in wide-table companion writer. When `true`, every fact table is left-joined with its FK'd dims (SCD2 dims filtered to current state) and emits `<fct>_wide.<ext>` alongside the normalized output. Under `format: sql` the wide tables emit as trailing blocks inside `data.sql` instead of separate files |
| `partition_by` | `str` / `null` | `null` | Column name to partition Parquet output on. When set, every table that carries the column is written as a Hive-style directory (`<output_dir>/<table>/<col>=<value>/...`) via `pyarrow.parquet.write_to_dataset`. Tables without the column fall back to single files. Requires `format: parquet`; cross-validated at config load |
| `sql_dialect` | `"postgresql"` / `"mysql"` / `"sqlite"` | `"postgresql"` | Dialect for the SQL dump writer — selects identifier quoting (`"col"` for PG/SQLite, `` `col` `` for MySQL), type words (PG `NUMERIC` / MySQL `DOUBLE` + `VARCHAR(255)` for string PKs / SQLite `REAL`), and boolean encoding. The default round-trips under any format; explicit `mysql` / `sqlite` requires `format: sql` (cross-validated at config load) |
Expand Down Expand Up @@ -894,8 +894,8 @@ precedence order (the first one that resolves wins):
1. **Config field (recommended)** — set `output.cell_budget: N` in
the YAML (or pass `output={"cell_budget": N}` to `create()`).
Reproducible from the config alone — no env vars or flags
required, which is the contract the bundled `lakehouse`
template relies on.
required, which is the contract the `tests/configs/lakehouse.yaml`
worked example relies on.
2. **Environment variable** — `PLOTSIM_CELL_BUDGET=N` sets the
soft cap to `N` cells when no config field is set.
3. **Default** — `2,000,000` cells.
Expand Down
9 changes: 5 additions & 4 deletions docs/site/cookbook/data-engineering.md
Original file line number Diff line number Diff line change
Expand Up @@ -56,8 +56,9 @@ whichever fits your workflow.

Or skip the YAML round-trip entirely — the
[`saas_template.py`](https://github.com/mohossam01/plotsim/blob/main/plotsim/configs/templates/saas_template.py)
bundled with plotsim shows the same template authored as
`create(**kwargs)` directly.
bundled with plotsim shows the same SaaS template authored as
`create(**kwargs)` directly, paired with `saas.yaml` in the
same directory.

Pin `seed:` in the YAML (or pass `seed=42` to `create`) and the fixture
is byte-stable across CI runs.
Expand Down Expand Up @@ -351,8 +352,8 @@ in the config (recommended; reproducible from YAML alone),
`output.cell_budget: 0` (or `PLOTSIM_CELL_BUDGET=0`) disables the soft
cap entirely; only the `50,000,000`-cell hard ceiling still applies.
See [Limits](../config-reference.md#limits-and-performance-gates) for
the full ladder and the bundled `lakehouse` template for a worked
example of a 1.5M-cell config.
the full ladder; `tests/configs/lakehouse.yaml` in the repo is a
worked example of a config near the 1.5M-cell range.

---

Expand Down
11 changes: 6 additions & 5 deletions docs/site/cookbook/data-science.md
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,9 @@ multi-metric dataset with archetype ground truth.
```

The [`saas_template.py`](https://github.com/mohossam01/plotsim/blob/main/plotsim/configs/templates/saas_template.py)
companion shows the same template authored as a `create(**kwargs)`
call — every YAML field maps 1-1 to a Python keyword.
companion (paired with `saas.yaml` in the same directory) shows the
same SaaS template authored as a `create(**kwargs)` call — every YAML
field maps 1-1 to a Python keyword.

---

Expand Down Expand Up @@ -228,9 +229,9 @@ time, not just larger ones.

All six builder distribution families (`lognorm`, `gamma`,
`weibull`, `beta`, `normal`, `poisson`) are pinnable the same way
via `MetricInput.distribution` + `distribution_params`. The bundled
`latency_skew` template (`plotsim template latency_skew`) exercises
all six on a single config. Full mechanics:
via `MetricInput.distribution` + `distribution_params`. The
`tests/configs/latency_skew.yaml` worked example exercises all six
on a single config. Full mechanics:
[`metrics-and-connections.md` §pinning the distribution explicitly](../user-guide/metrics-and-connections.md#pinning-the-distribution-explicitly).

---
Expand Down
18 changes: 9 additions & 9 deletions docs/site/feature-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ Three surfaces today:
|---|---|---|
| Library | `plotsim.create`, `create_from_yaml`, `generate_tables`, `write_tables` | Python users in an IDE or notebook |
| CLI | `plotsim run`, `validate`, `info`, `template`, `schema` | Terminal, CI, scripts |
| YAML | bundled templates: `ab_trial`, `bare_minimum`, `cdc_demo`, `crm_billing_overlap`, `education`, `geo_retail`, `hr`, `lakehouse`, `latency_skew`, `marketing`, `narrative_reviews`, `retail`, `saas` | Anyone who wants to hand-edit a config |
| YAML | bundled domain templates: `banking`, `health`, `hr`, `marketing`, `retail`, `saas` | Anyone who wants to hand-edit a config |

---

Expand All @@ -39,7 +39,7 @@ integrity / provenance tooling.
|---|---|---|
| Trajectory-first metric generation | Every metric for an entity at time *t* is derived from one archetype-curve position | `generate_tables(cfg)` |
| Determinism | Single seeded `numpy.random.Generator` flows through every random draw | YAML `seed:` (integer) |
| Cell-budget scale gate | Soft pre-flight guard that aborts runs above the configured cell ceiling. Precedence: `output.cell_budget` field > `PLOTSIM_CELL_BUDGET` env > 2M default; `0` disables. Bundled template `lakehouse` exercises a 1.5M-cell config. | YAML `output.cell_budget: <int>`; env override `PLOTSIM_CELL_BUDGET` / `PLOTSIM_ALLOW_LARGE_DATASET` |
| Cell-budget scale gate | Soft pre-flight guard that aborts runs above the configured cell ceiling. Precedence: `output.cell_budget` field > `PLOTSIM_CELL_BUDGET` env > 2M default; `0` disables. `tests/configs/lakehouse.yaml` is a worked example near the 1.5M-cell range. | YAML `output.cell_budget: <int>`; env override `PLOTSIM_CELL_BUDGET` / `PLOTSIM_ALLOW_LARGE_DATASET` |

#### Tables emitted

Expand Down Expand Up @@ -114,7 +114,7 @@ is no longer byte-identical to a pre-flag run of the same file.
|---|---|---|
| Lifecycle stages | Per-entity stage sequence with stage-specific archetype overrides | YAML `lifecycle:` |
| Cohort arrival distribution | Per-segment entity arrival shape — `uniform` / `linear` / `step` / `explicit` — driving `Entity.start_period`, so the entity body grows or contracts across the window. Cold-start cells are NaN-filled and dropped pre-write. Validator enforces every entity has ≥2 active periods. | builder kwarg `arrival:` on segments (4-shape discriminated union); YAML `Entity.start_period` directly |
| Treatment / control cohorts | Per-entity treatment assignment with a logit-shift on trajectory position from `treatment_start_period` onward (`treatment_lift_log_odds`). Known effect → A/B test analysis, uplift modeling, causal inference. Manifest carries `TreatmentAssignment` per entity + `TreatmentCohort` per segment. Bundled template `ab_trial`. | YAML `Entity.treatment_group` / `treatment_lift_log_odds` / `treatment_start_period` |
| Treatment / control cohorts | Per-entity treatment assignment with a logit-shift on trajectory position from `treatment_start_period` onward (`treatment_lift_log_odds`). Known effect → A/B test analysis, uplift modeling, causal inference. Manifest carries `TreatmentAssignment` per entity + `TreatmentCohort` per segment. Demonstrated on bundled `marketing` and `health` (per-metric lifts) and `banking` (whole-trajectory lift); `tests/configs/ab_trial.yaml` is the dedicated worked example. | YAML `Entity.treatment_group` / `treatment_lift_log_odds` / `treatment_start_period` |

### 6. Dim columns + fact-grain text — fill non-metric cells with realistic content

Expand All @@ -124,20 +124,20 @@ is no longer byte-identical to a pre-flag run of the same file.
| Faker-backed text + identifiers | PII-shape providers wired into the engine: `name`, `email`, `phone_number`, `company`, `address`, `postcode`, `country`, `city`, `latitude`, `longitude`, `sentence`. Deterministic under the run seed. Useful for masking exercises and regex-validation scenarios; **does not read entity, archetype, or trajectory** (each call is an independent draw). |
| Range source | `type: range` with `range: [min, max]` on fact / event columns produces a per-row uniform draw between the bounds. Integer bounds → `dtype: int` and inclusive upper bound; float bounds → `dtype: float` and exclusive upper bound (numpy conventions). Use it for `quantity ∈ [1, 5]`, `unit_price ∈ [10.0, 500.0]`, and similar shape constraints that `faker.random_int` / `faker.pyfloat` express less precisely. Deterministic under seed. |
| Pool source on facts and events | `type: pool.<attribute>` lifts the per-entity value pool (previously dim-only) onto per_entity_per_period facts, variable-grain facts, per_parent_row child facts, and event tables. Every row resolves to its entity's segment, then draws uniformly from `attributes[<attr>]` — so a `loyal` cohort customer's `channel` always lands in `[app, web]` while a `casual` customer's lands in `[sms, email]`. Per_period facts (the `dim_date`-style grain) remain out of scope — those rows have no per-row entity binding. |
| Narrative text source (trajectory-aware) | Per-archetype lexicons + a sentence template rendered into a `narrative` column on a fact table. Output vocabulary tracks the entity's trajectory position (a high-position `growth` entity produces systematically different text than a low-position `decline` entity); a simple bag-of-words classifier hits ≥0.55 accuracy on archetype prediction. Deterministic under seed; preserves the trajectory-first invariant. **Fact-only** (rejected on dim / event tables at config load). **Performance:** forces the scalar fact builder path (~3-10× slower than vectorized metric-only facts), so keep narrative on tables that genuinely need text. Bundled template `narrative_reviews`. See [Narrative source](./user-guide/narrative-source.md). |
| Narrative text source (trajectory-aware) | Per-archetype lexicons + a sentence template rendered into a `narrative` column on a fact table. Output vocabulary tracks the entity's trajectory position (a high-position `growth` entity produces systematically different text than a low-position `decline` entity); a simple bag-of-words classifier hits ≥0.55 accuracy on archetype prediction. Deterministic under seed; preserves the trajectory-first invariant. **Fact-only** (rejected on dim / event tables at config load). **Performance:** forces the scalar fact builder path (~3-10× slower than vectorized metric-only facts), so keep narrative on tables that genuinely need text. Demonstrated on bundled `hr`, `retail`, `banking`, `health`; `tests/configs/narrative_reviews.yaml` is the dedicated lexicon-design walkthrough. See [Narrative source](./user-guide/narrative-source.md). |

### 7. Audit + downstream-pipeline outputs

| Feature | Behavior |
|---|---|
| SCD Type 2 | `dim_<entity>` expanded to N×versions with `valid_from_period` and band-crossing events surfaced in the manifest |
| SCD Type 1 | default (no-op) |
| Fact-side CDC | `facts[].cdc: true` emits `_inserted_at` / `_updated_at` / `_op` audit columns; column-level quality issues flip `_op` to `"U"` on affected rows. Demonstrated in `cdc_demo` (dedicated) and `retail` (realistic POS purchase ledger). |
| Fact-side CDC | `facts[].cdc: true` emits `_inserted_at` / `_updated_at` / `_op` audit columns; column-level quality issues flip `_op` to `"U"` on affected rows. Demonstrated on bundled `saas` (revenue restatement), `marketing` (spend attribution), `retail` (purchase ledger), `banking` (loan disbursement), `health` (encounter chart amendment); `tests/configs/cdc_demo.yaml` is the dedicated minimal walkthrough. |
| Holdout splits | `output.holdout: {fraction\|periods}` writes `{table}_train.<csv\|parquet>` + `{table}_holdout.<csv\|parquet>` instead of one file per fact, split by period index |
| Denormalization | `output.denormalized: true` joins each fact with its FK'd dims (SCD2 current-only, audit columns excluded, dim columns prefixed `<dim>__<col>`); emits `<fct>_wide.{csv\|parquet}` alongside normalized output for 1NF–3NF decomposition exercises. Demonstrated in `saas`. |
| Log-file writer | Event tables with `log_format: "{ts} ... "` + `log_filename: "..."` emit a structured `.log` file alongside the CSV/Parquet event table. Format string is `template.format(**row.to_dict())` per row; unknown placeholders raise. Demonstrated in `saas` (`evt_login` as syslog-flavoured lines). |
| Multi-source / overlap | `multi_source:` block emits per-source dim copies with controlled drift (casing / abbreviation / swap) and per-source key schemes; `source_entity_mappings` ground truth in the manifest. Demonstrated in `crm_billing_overlap` (CRM + billing dual-source, 40 mapping records). |
| Nested / JSON columns | `dtype: struct` (with `nested_schema`) or `dtype: array` (with `array_element_type`) paired with `source: nested` on dim columns. Parquet preserves native nested schema (`pa.struct(...)`); CSV serializes as JSON string. Dim-only, one level of nesting, primitive leaves in V1. Demonstrated in `retail` (`dim_product_category.catalog_metadata`). |
| Log-file writer | Event tables with `log_format: "{ts} ... "` + `log_filename: "..."` emit a structured `.log` file alongside the CSV/Parquet event table. Format string is `template.format(**row.to_dict())` per row; unknown placeholders raise. `tests/configs/saas_template.yaml` (`evt_login` as syslog-flavoured lines) is the worked example. |
| Multi-source / overlap | `multi_source:` block emits per-source dim copies with controlled drift (casing / abbreviation / swap) and per-source key schemes; `source_entity_mappings` ground truth in the manifest. `tests/configs/crm_billing_overlap.yaml` is the worked example (CRM + billing dual-source, 40 mapping records). |
| Nested / JSON columns | `dtype: struct` (with `nested_schema`) or `dtype: array` (with `array_element_type`) paired with `source: nested` on dim columns. Parquet preserves native nested schema (`pa.struct(...)`); CSV serializes as JSON string. Dim-only, one level of nesting, primitive leaves in V1. `tests/configs/retail_template.yaml` (`dim_product_category.catalog_metadata`) is the worked example. |

### 8. Validation, manifest, and provenance (advanced)

Expand Down Expand Up @@ -223,7 +223,7 @@ convenience shapes:
- `window=("2024-01", "2024-12", "monthly")` shorthand.

Templates: `plotsim.list_templates()` →
`["ab_trial", "bare_minimum", "cdc_demo", "crm_billing_overlap", "education", "geo_retail", "hr", "lakehouse", "latency_skew", "marketing", "narrative_reviews", "retail", "saas"]`.
`["banking", "health", "hr", "marketing", "retail", "saas"]`.
`plotsim.load_template("saas")` returns a `PlotsimConfig` ready to mutate
or pass to `generate_tables`.

Expand Down
Loading
Loading