Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 12 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,18 @@ Versioning: [SemVer](https://semver.org/spec/v2.0.0.html).

### Added

- **`pool.<attr>` source on per_entity_per_period facts.** Widens the
per-entity value-pool surface to the most common fact grain (one
row per entity per period). Two new dispatch handlers —
`_fact_scalar_pool` and `_fact_vec_pool` — register against
`BuilderKind.PER_ENTITY_PER_PERIOD_FACT_{SCALAR,VECTORIZED}` and
draw uniformly from the row's entity's pool list. Pool sources
remain rejected on per_period facts (no per-row entity binding),
reference dims, and sub-entity dims. Pairs naturally with `cdc:
true` on the same fact, so a column like `payment_type:
pool.payment_method` now works alongside SCD2 and CDC on a single
transactional table.

- **Parent/child fact grain + sibling-fact references.** Three
composable patterns for multi-fact stars:
- **Header / detail** — a `per_parent_row` child fact fans out
Expand Down
12 changes: 6 additions & 6 deletions docs/site/column-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ Some types take additional fields (`labels` for `bucket`, `tracks` /
| `faker.{kind}` | yes | yes | yes | yes |
| `geo.{field}` | yes (dim only) | — | — | — |
| `static.{value}` | yes | yes | yes | yes |
| `pool.{attr}` | yes (per-entity dim only) | yes (variable-grain + per_parent_row) | yes | — |
| `pool.{attr}` | yes (per-entity dim only) | yes (per_entity_per_period + variable-grain + per_parent_row) | yes | — |
| `range` | — | yes | yes | — |
| `segment.count` | yes (per-entity dim only) | — | — | — |
| `timestamp` | — | — | yes | — |
Expand Down Expand Up @@ -235,11 +235,11 @@ dimensions:

Output dtype is `string`.

**Valid on**: per-entity dimension columns, variable-grain fact
columns, per_parent_row child-fact columns, and event columns. The
engine reads the row's entity FK and draws from
`attributes[attr_name]` for that entity's segment.
Per_entity_per_period and per_period facts, reference dims, and
**Valid on**: per-entity dimension columns, per_entity_per_period
fact columns, variable-grain fact columns, per_parent_row child-fact
columns, and event columns. The engine reads the row's entity FK
and draws from `attributes[attr_name]` for that entity's segment.
Per_period facts (the `dim_date`-style grain), reference dims, and
sub-entity dims are out of scope — pool dispatch requires either a
per-row entity binding (facts / events) or a 1:1 row-to-entity
mapping (per_entity dim).
Expand Down
2 changes: 1 addition & 1 deletion docs/site/feature-reference.md
Original file line number Diff line number Diff line change
Expand Up @@ -123,7 +123,7 @@ is no longer byte-identical to a pre-flag run of the same file.
| Geo bundle provider | `geo.<field>` column types pull country / region / city / postcode / lat-lng from a curated 200-entry, 17-country reference dataset. All fields on the same dim row come from a single bundle, so the city is in the stated country, the postcode looks right for that country, and lat/lng land on the named city. Dim-only; the engine rejects geo on facts/events. See [Geo hierarchy](./user-guide/geo-hierarchy.md). |
| Faker-backed text + identifiers | PII-shape providers wired into the engine: `name`, `email`, `phone_number`, `company`, `address`, `postcode`, `country`, `city`, `latitude`, `longitude`, `sentence`. Deterministic under the run seed. Useful for masking exercises and regex-validation scenarios; **does not read entity, archetype, or trajectory** (each call is an independent draw). |
| Range source | `type: range` with `range: [min, max]` on fact / event columns produces a per-row uniform draw between the bounds. Integer bounds → `dtype: int` and inclusive upper bound; float bounds → `dtype: float` and exclusive upper bound (numpy conventions). Use it for `quantity ∈ [1, 5]`, `unit_price ∈ [10.0, 500.0]`, and similar shape constraints that `faker.random_int` / `faker.pyfloat` express less precisely. Deterministic under seed. |
| Pool source on facts and events | `type: pool.<attribute>` lifts the per-entity value pool (previously dim-only) onto variable-grain facts, per_parent_row child facts, and event tables. Every row resolves to its entity's segment, then draws uniformly from `attributes[<attr>]` — so a `loyal` cohort customer's `channel` always lands in `[app, web]` while a `casual` customer's lands in `[sms, email]`. |
| Pool source on facts and events | `type: pool.<attribute>` lifts the per-entity value pool (previously dim-only) onto per_entity_per_period facts, variable-grain facts, per_parent_row child facts, and event tables. Every row resolves to its entity's segment, then draws uniformly from `attributes[<attr>]` — so a `loyal` cohort customer's `channel` always lands in `[app, web]` while a `casual` customer's lands in `[sms, email]`. Per_period facts (the `dim_date`-style grain) remain out of scope — those rows have no per-row entity binding. |
| Narrative text source (trajectory-aware) | Per-archetype lexicons + a sentence template rendered into a `narrative` column on a fact table. Output vocabulary tracks the entity's trajectory position (a high-position `growth` entity produces systematically different text than a low-position `decline` entity); a simple bag-of-words classifier hits ≥0.55 accuracy on archetype prediction. Deterministic under seed; preserves the trajectory-first invariant. **Fact-only** (rejected on dim / event tables at config load). **Performance:** forces the scalar fact builder path (~3-10× slower than vectorized metric-only facts), so keep narrative on tables that genuinely need text. Bundled template `narrative_reviews`. See [Narrative source](./user-guide/narrative-source.md). |

### 7. Audit + downstream-pipeline outputs
Expand Down
91 changes: 91 additions & 0 deletions plotsim/tables.py
Original file line number Diff line number Diff line change
Expand Up @@ -1259,6 +1259,50 @@ def _fact_vec_range(parsed: RangeSource, ctx: dict):
return rng.uniform(parsed.min, parsed.max, size=total_rows)


def _fact_vec_pool(parsed: PoolSource, ctx: dict):
"""Bulk per-row pool draw on a vectorized per_entity_per_period fact.

Output is entity-major (matches ``entity_pk_repeated`` /
``date_key_tiled`` layout). One bulk ``rng.integers`` draw per
entity sized to ``n_periods``, scattered into the contiguous
entity block. Per-entity draws keep ordering stable when entities
have heterogeneous pool sizes.
"""
del parsed # PoolSource carries only a marker name; data is on col.value_pool.
col = ctx["col"]
rng = ctx["rng"]
if rng is None:
raise ValueError(
f"fact column {col.name!r} has source {col.source!r} but no "
f"RNG was supplied to the vectorized fact builder; pool "
f"draws require the per-table RNG"
)
if col.value_pool is None:
raise ValueError(
f"fact column {col.name!r} declares pool source {col.source!r} "
f"but Column.value_pool is None; Column._pool_pairing should "
f"have rejected this at load"
)
config = ctx["config"]
n_periods = ctx["n_periods"]
total_rows = ctx["total_rows"]
out = np.empty(total_rows, dtype=object)
cursor = 0
for entity in config.entities:
choices = col.value_pool.get(entity.name)
if choices is None:
raise ValueError(
f"fact column {col.name!r} value_pool has no entry for "
f"entity {entity.name!r}; validate_value_pool_coverage "
f"should have caught this at load"
)
indices = rng.integers(0, len(choices), size=n_periods)
for k in range(n_periods):
out[cursor + k] = _coerce_static(choices[int(indices[k])], col.dtype)
cursor += n_periods
return out


def _fact_vec_text_bucket(parsed: TextBucketSource, ctx: dict):
# M105: trajectory-position-driven text emission. ``trajectories_2d``
# is shape (E, P); flatten in the same row-major (entity, period)
Expand Down Expand Up @@ -1347,6 +1391,11 @@ def _fact_vec_unsupported(parsed: Any, ctx: dict):
RangeSource,
_fact_vec_range,
)
COLUMN_DISPATCH.register(
BuilderKind.PER_ENTITY_PER_PERIOD_FACT_VECTORIZED,
PoolSource,
_fact_vec_pool,
)
COLUMN_DISPATCH.register_unsupported(
BuilderKind.PER_ENTITY_PER_PERIOD_FACT_VECTORIZED,
_fact_vec_unsupported,
Expand Down Expand Up @@ -1620,6 +1669,43 @@ def _fact_scalar_range(parsed: RangeSource, ctx: dict):
return float(rng.uniform(parsed.min, parsed.max))


def _fact_scalar_pool(parsed: PoolSource, ctx: dict):
"""Per-cell pool draw on a scalar per_entity_per_period fact.

Looks up the per-entity choice list on ``col.value_pool`` keyed by
the current row's entity name, then draws one index from the
seeded RNG. Same shape as ``_evt_row_pool`` but the entity is
already in ctx (no PK reverse-lookup needed on the per-entity
dim).
"""
del parsed # PoolSource carries only a marker name; data is on col.value_pool.
col = ctx["col"]
rng = ctx["rng"]
entity = ctx["entity"]
if rng is None or entity is None:
raise ValueError(
f"fact column {col.name!r} pool source needs both `entity` and "
f"`rng` in ctx (got entity={entity!r}, "
f"rng={'set' if rng is not None else 'None'}); this is an "
f"internal wiring bug, not a config error"
)
if col.value_pool is None:
raise ValueError(
f"fact column {col.name!r} declares pool source {col.source!r} "
f"but Column.value_pool is None; Column._pool_pairing should "
f"have rejected this at load"
)
choices = col.value_pool.get(entity.name)
if choices is None:
raise ValueError(
f"fact column {col.name!r} value_pool has no entry for entity "
f"{entity.name!r}; validate_value_pool_coverage should have "
f"caught this at load"
)
pick = int(rng.integers(0, len(choices)))
return _coerce_static(choices[pick], col.dtype)


def _fact_scalar_unsupported(parsed: Any, ctx: dict):
col = ctx["col"]
raise ValueError(
Expand Down Expand Up @@ -1691,6 +1777,11 @@ def _fact_scalar_unsupported(parsed: Any, ctx: dict):
RangeSource,
_fact_scalar_range,
)
COLUMN_DISPATCH.register(
BuilderKind.PER_ENTITY_PER_PERIOD_FACT_SCALAR,
PoolSource,
_fact_scalar_pool,
)
COLUMN_DISPATCH.register_unsupported(
BuilderKind.PER_ENTITY_PER_PERIOD_FACT_SCALAR,
_fact_scalar_unsupported,
Expand Down
23 changes: 14 additions & 9 deletions plotsim/validation.py
Original file line number Diff line number Diff line change
Expand Up @@ -581,9 +581,9 @@ def validate_value_pool_coverage(config: PlotsimConfig) -> list[str]:
1. ``PoolSource`` columns are only meaningful on tables where
every row resolves to exactly one entity and where the
engine has wired the per-row entity → pool lookup: per_entity
dims (M114), variable-grain fact tables, per_parent_row
child facts, and event tables. Per_entity_per_period facts,
per_period facts, reference dims, and sub-entity dims are
dims (M114), per_entity_per_period facts, variable-grain
fact tables, per_parent_row child facts, and event tables.
Per_period facts, reference dims, and sub-entity dims are
out of scope — either no per-row entity binding or no
dispatch handler is registered for the grain.
2. The ``value_pool`` dict's keys must cover every ``Entity.name``
Expand All @@ -597,7 +597,10 @@ def validate_value_pool_coverage(config: PlotsimConfig) -> list[str]:
add variable-grain facts, per_parent_row child facts, and event
tables so authors can curate per-entity value pools on the fact
/ event rows directly (e.g. ``payment_method`` on ``fct_orders``)
without the indirection of a separate dim-row lookup.
without the indirection of a separate dim-row lookup. A follow-up
widened it again to include per_entity_per_period facts once
``_fact_scalar_pool`` / ``_fact_vec_pool`` landed in the column
dispatch registry.
"""
errors: list[str] = []
entity_names = {e.name for e in config.entities}
Expand All @@ -607,7 +610,8 @@ def validate_value_pool_coverage(config: PlotsimConfig) -> list[str]:
pool_capable_tables = per_entity_dim_names | {
t.name
for t in config.tables
if (t.type == "fact" and t.grain in ("variable", "per_parent_row")) or t.type == "event"
if (t.type == "fact" and t.grain in ("per_entity_per_period", "variable", "per_parent_row"))
or t.type == "event"
}

for tbl in config.tables:
Expand All @@ -619,10 +623,11 @@ def validate_value_pool_coverage(config: PlotsimConfig) -> list[str]:
errors.append(
f"table {tbl.name!r} column {col.name!r} declares a "
f"'pool:' source but the table is not a per_entity "
f"dim, a variable-grain fact, a per_parent_row child "
f"fact, or an event (type={tbl.type!r}, "
f"grain={tbl.grain!r}); pool sources need a per-row "
f"per-entity binding the engine can dispatch against"
f"dim, a per_entity_per_period fact, a variable-grain "
f"fact, a per_parent_row child fact, or an event "
f"(type={tbl.type!r}, grain={tbl.grain!r}); pool "
f"sources need a per-row per-entity binding the "
f"engine can dispatch against"
)
continue
if col.value_pool is None:
Expand Down
119 changes: 65 additions & 54 deletions tests/test_pool_attr.py
Original file line number Diff line number Diff line change
Expand Up @@ -248,62 +248,73 @@ def test_pool_attr_missing_on_some_segments_raises():
)


def test_pool_attr_on_per_entity_per_period_fact_rejected():
"""``pool.{attr}`` is now valid on variable-grain facts,
per_parent_row child facts, and event tables (0.6-M19 Fix 1), but
the per_entity_per_period fact grain stays out of scope — the
engine has no per-row pool dispatch handler registered for that
grain. The engine validator surfaces the gap at config load.
def test_pool_attr_on_per_entity_per_period_fact_accepted():
"""``pool.{attr}`` on a per_entity_per_period fact column now
interprets cleanly and wires through ``_fact_scalar_pool`` /
``_fact_vec_pool``. The widening over the M19-Fix-1 baseline
(which accepted variable-grain facts, per_parent_row children,
and events) covers the most common fact grain — one row per
(entity, period).
"""
with pytest.raises(ValueError, match="variable-grain fact"):
interpret(
_input(
segments=[
{
"name": "alpha",
"count": 3,
"archetype": "growth",
"attributes": {"industry": ["Tech"]},
},
{
"name": "beta",
"count": 3,
"archetype": "flat",
"attributes": {"industry": ["Healthcare"]},
},
],
dimensions=[
{
"name": "dim_date",
"per": "period",
"columns": [
{"name": "date_key", "type": "id"},
{"name": "date", "type": "date"},
],
},
{
"name": "dim_company",
"per": "unit",
"columns": [
{"name": "company_id", "type": "id"},
],
},
],
facts=[
{
"name": "fct_company",
"metrics": ["engagement"],
"columns": [
{"name": "date_key", "type": "ref.dim_date"},
{"name": "company_id", "type": "ref.dim_company"},
{"name": "engagement", "type": "metric.engagement"},
# Illegal: pool.{attr} on a fact column
{"name": "industry", "type": "pool.industry"},
],
},
],
)
cfg = interpret(
_input(
segments=[
{
"name": "alpha",
"count": 3,
"archetype": "growth",
"attributes": {"industry": ["Tech", "Finance"]},
},
{
"name": "beta",
"count": 3,
"archetype": "flat",
"attributes": {"industry": ["Healthcare"]},
},
],
dimensions=[
{
"name": "dim_date",
"per": "period",
"columns": [
{"name": "date_key", "type": "id"},
{"name": "date", "type": "date"},
],
},
{
"name": "dim_company",
"per": "unit",
"columns": [
{"name": "company_id", "type": "id"},
],
},
],
facts=[
{
"name": "fct_company",
"metrics": ["engagement"],
"columns": [
{"name": "date_key", "type": "ref.dim_date"},
{"name": "company_id", "type": "ref.dim_company"},
{"name": "engagement", "type": "metric.engagement"},
{"name": "industry", "type": "pool.industry"},
],
},
],
)
)
fct = next(t for t in cfg.tables if t.name == "fct_company")
industry_col = next(c for c in fct.columns if c.name == "industry")
assert industry_col.source == "pool:industry"
# Builder expands each segment into per-entity rows (alpha_0000, ...).
# Every entity in a segment shares that segment's attribute pool.
pool = industry_col.value_pool
assert pool is not None
alpha_keys = sorted(k for k in pool if k.startswith("alpha"))
beta_keys = sorted(k for k in pool if k.startswith("beta"))
assert len(alpha_keys) == 3 and len(beta_keys) == 3
assert all(pool[k] == ["Tech", "Finance"] for k in alpha_keys)
assert all(pool[k] == ["Healthcare"] for k in beta_keys)


# ── Auto-schema attribute columns ──────────────────────────────────────────
Expand Down
Loading
Loading