Skip to content

feat: type-aware partitioning, SummingMergeTree + LowCardinality, --stdout#30

Merged
Maksim-Gr merged 1 commit into
mainfrom
feat/quick-wins-partitioning-engines
Jun 19, 2026
Merged

feat: type-aware partitioning, SummingMergeTree + LowCardinality, --stdout#30
Maksim-Gr merged 1 commit into
mainfrom
feat/quick-wins-partitioning-engines

Conversation

@Maksim-Gr

@Maksim-Gr Maksim-Gr commented Jun 19, 2026

Copy link
Copy Markdown
Owner

Summary

Fixes one correctness bug and lands four low-risk quick wins that build on classifiers already present in the codebase. No new dependencies.

  • Fix type-blind PARTITION BYtable now decides PARTITION BY toYYYYMM(...) from the column's inferred type (DateTime64) instead of its name. Previously a scan → table flow could emit PARTITION BY toYYYYMM(<String column>), which ClickHouse rejects (date-only strings stay String, yet any field merely named like a timestamp qualified). Also removes the timestamp heuristic that was duplicated between generator and scanner.
  • Auto-suggest SummingMergeTreescan suggests it when numeric metrics and a dimension (id/timestamp) exist, with a narrow grouping key (one primary id + one timestamp) and the metric columns as SUM COLUMNS.
  • LowCardinality(String) detection — inference tracks capped distinct string values and flags low-cardinality columns (conservative thresholds: ≥20 records, ≤100 distinct, distinct < records/2); rendered as LowCardinality(String) across table/kafka/diff.
  • --stdout flagkafka/table/diff can print migrations (with -- up / -- down headers) instead of writing files.
  • Docs — README updated; version bumped 0.4.0 → 0.5.0.

…tdout

Fix a correctness bug and land several quick wins that build on classifiers
already present in the codebase.

- generator: decide PARTITION BY from the column's inferred type (DateTime64)
  instead of its name, so a scan -> table flow can no longer emit
  PARTITION BY toYYYYMM(<String column>), which ClickHouse rejects. Removes the
  timestamp heuristic that was duplicated between generator and scanner.
- scanner: suggest SummingMergeTree when numeric metrics and a dimension exist,
  with a narrow grouping key (one primary id + one timestamp) and the metric
  columns as SUM COLUMNS.
- inference/schema: detect low-cardinality String columns (capped distinct-value
  tracking, conservative thresholds) and render them as LowCardinality(String);
  applied across table/kafka/diff output.
- cli/main: add --stdout to kafka/table/diff to print migrations instead of
  writing files.
- docs: document the above in README; bump version 0.4.0 -> 0.5.0.

Adds 8 unit tests (18 -> 26). cargo fmt, clippy -D warnings, and tests pass.
@Maksim-Gr Maksim-Gr merged commit 5aa682c into main Jun 19, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant