-
Notifications
You must be signed in to change notification settings - Fork 85
Add datatype field to Field and Metric; reframe is_time as role marker #113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,2 @@ | ||
| __pycache__/ | ||
| *.py[cod] |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -36,6 +36,22 @@ Supported SQL and expression language dialects for metrics and field definitions | |
| | `TABLEAU` | Tableau calculations | | ||
| | `DATABRICKS` | Databricks SQL | | ||
|
|
||
| ### Datatypes | ||
|
|
||
| Logical data types for fields and metrics. | ||
|
|
||
| | Datatype | Description | | ||
| |----------|-------------| | ||
| | `string` | Variable-length Unicode character data. | | ||
| | `integer` | Signed integer with no scale. | | ||
| | `number` | Real number (floating-point or decimal) with unspecified precision. | | ||
| | `boolean` | Logical two-valued truth type. | | ||
| | `date` | Calendar date with no time-of-day component. | | ||
| | `time` | Time-of-day with no date component. | | ||
| | `timestamp` | Instant-in-time without timezone offset (naive / local). | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. maybe the default
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't have a strong opinion on it, but I tend to lean towards postgres naming. But also not opposed to omitting |
||
| | `timestamp_tz` | Instant-in-time with timezone offset (zoned). | | ||
| | `other` | Any data type not covered above; use `custom_extensions` for vendor-specific refinement. | | ||
|
|
||
| ### Vendors | ||
|
|
||
| Supported vendors for custom extensions and integrations. | ||
|
|
@@ -202,6 +218,7 @@ Fields represent row-level attributes that can be used for grouping, filtering, | |
| | `dimension` | object | No | Dimension metadata (e.g., `is_time` flag) | | ||
| | `label` | string | No | Label for categorization | | ||
| | `description` | string | No | Human-readable description | | ||
| | `datatype` | string (enum) | No | Logical data type for this field. See [Datatypes](#datatypes). | | ||
| | `ai_context` | string/object | No | Additional context for AI tools (e.g., synonyms) | | ||
| | `custom_extensions` | array | No | Vendor-specific attributes | | ||
|
|
||
|
|
@@ -228,7 +245,7 @@ expression: | |
|
|
||
| | Field | Type | Description | | ||
| |-------|------|-------------| | ||
| | `is_time` | boolean | Indicates if this is a time-based dimension for temporal filtering | | ||
| | `is_time` | boolean | Temporal-role marker. When `true`, consumers that distinguish time dimensions (e.g. for time-series analysis or temporal filtering) should treat this field as a time dimension. This is a *role* flag, independent of the field's data type. See [Datatype and `is_time`: type vs. role](#datatype-and-is_time-type-vs-role). | | ||
|
|
||
| ### Examples | ||
|
|
||
|
|
@@ -268,6 +285,7 @@ expression: | |
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: order_date | ||
| datatype: date | ||
| dimension: | ||
| is_time: true | ||
| description: Date when order was placed | ||
|
|
@@ -290,6 +308,33 @@ expression: | |
| description: Normalized email address | ||
| ``` | ||
|
|
||
| ### Datatype and `is_time`: type vs. role | ||
|
|
||
| `datatype` and `dimension.is_time` are independent properties that answer different questions: | ||
|
|
||
| - **`datatype`** describes the *data type* of the field (e.g. `date`, `integer`, `string`, `timestamp_tz`): what kind of values the field holds. | ||
| - **`dimension.is_time`** is a *temporal-role marker*: whether the field should be treated as a time dimension for time-series analysis or temporal filtering, regardless of its data type. | ||
|
|
||
| **Default for `is_time`.** When `is_time` is not set explicitly, it defaults to `true` if `datatype` is one of `date`, `time`, `timestamp`, `timestamp_tz`, and `false` otherwise. Explicit `is_time` always wins. Set `is_time: false` on a temporal-typed column (e.g. an audit `created_at` you don't want on the time axis) to opt out of the default. | ||
|
|
||
| Common combinations: | ||
|
|
||
| | Column example | `datatype` | `is_time` | Effective role | Why | | ||
| |---|---|---|---|---| | ||
| | `d_date` (calendar date) | `date` | omitted | time dimension | Temporal `datatype`; `is_time` defaults to `true`. | | ||
| | `order_timestamp` | `timestamp_tz` | omitted | time dimension | Same. | | ||
| | `created_at` (audit timestamp) | `timestamp` | `false` | regular dimension | Explicit opt-out of the temporal default. | | ||
| | `d_year` (integer year grain) | `integer` | `true` | time dimension | Non-temporal `datatype`; `is_time: true` makes the role explicit. | | ||
| | `d_quarter_name` (e.g. `"Q1"`) | `string` | `true` | time dimension | String-valued temporal grain. | | ||
| | `customer_id` | `integer` | omitted | regular dimension | Non-temporal `datatype`; `is_time` defaults to `false`. | | ||
|
|
||
| > **Precedent.** This type/role separation mirrors [Snowflake Semantic Views' YAML authoring form](https://docs.snowflake.com/en/user-guide/views-semantic/semantic-view-yaml-spec), which has a structural `time_dimensions:` collection whose entries can carry any `data_type`. The published example annotates `order_year` with `data_type: NUMBER`. LookML supports a similar split via its [`dimension_group`](https://cloud.google.com/looker/docs/reference/param-field-dimension-group), whose `datatype` enum covers `date`, `datetime`, `timestamp`, plus the integer-encoded forms `epoch` and `yyyymmdd`. | ||
|
|
||
| **Consumer guidance.** | ||
|
|
||
| - For *data-type* questions (casting, serialization, downstream type inference): prefer `datatype` when present. If only `is_time: true` is set, do not infer a specific scalar type from it. | ||
| - For *role* questions (classifying time dimensions in a query UI, generating time-series output sections, choosing time-aware aggregations): treat the field as a time dimension when `is_time` resolves to `true`, whether explicitly set or defaulted from a temporal `datatype`. | ||
|
|
||
| --- | ||
|
|
||
| ## Metrics | ||
|
|
@@ -303,6 +348,7 @@ Quantitative measures defined on business data, representing key calculations li | |
| | `name` | string | Yes | Unique identifier for the metric | | ||
| | `expression` | object | Yes | Expression definition with dialect support | | ||
| | `description` | string | No | Human-readable description of what the metric measures | | ||
| | `datatype` | string (enum) | No | Logical data type for this metric. See [Datatypes](#datatypes). | | ||
| | `ai_context` | string/object | No | Additional context for AI tools (e.g., synonyms) | | ||
| | `custom_extensions` | array | No | Vendor-specific attributes | | ||
|
|
||
|
|
@@ -324,9 +370,11 @@ expression: | |
| ```yaml | ||
| - name: total_revenue | ||
| expression: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) | ||
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) | ||
| description: Total revenue across all orders | ||
| datatype: number | ||
| ai_context: | ||
| synonyms: | ||
| - "total sales" | ||
|
|
@@ -338,9 +386,11 @@ expression: | |
| ```yaml | ||
| - name: avg_orders | ||
| expression: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) / COUNT(DISTINCT customers.id) | ||
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: SUM(orders.amount) / COUNT(DISTINCT customers.id) | ||
| description: Average orders | ||
| datatype: number | ||
| ai_context: | ||
| synonyms: | ||
| - "Order Average by customer" | ||
|
|
@@ -446,6 +496,7 @@ semantic_model: | |
| dialects: | ||
| - dialect: ANSI_SQL | ||
| expression: order_date | ||
| datatype: date | ||
| dimension: | ||
| is_time: true | ||
| description: Order date | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One concern here is that using a generic number type isn’t really a valid interoperability strategy. Different data warehouses interpret and implement it differently, so it’s not interchangeable in practice.
For example, you can’t take a model built against data in Snowflake and expect it to behave identically in Databricks just because both use a number type. As we’ve already discussed in other issues/threads, precision and scale vary across systems, and that directly affects results.
Without explicitly specifying those parameters, this approach is likely to introduce subtle bugs. If strict type checking is enforced and there’s no automatic conversion, things will simply break. Even with implicit or explicit casting at the physical layer, you can still end up with precision loss, which is not acceptable for many use cases.
So I think we need a different approach here. Either we make the type definition explicit (e.g., always specifying precision and scale for numeric types), or we introduce some abstraction that preserves correctness across backends instead of relying on a loosely defined number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the feedback @KSDaemon. The frame of reference I'm coming from is that these types need to be precise enough to generate SQL on top of (e.g. a BI tool building on top of the semantic definitions). In my experience, the SQL required for integers vs float/decimal can differ for warehouses that use truncation semantics for integers. I have not come across cases where I've needed to generate different SQL depending on whether the underlying type is a float or decimal.
But it is worth taking a step back to ask what else folks would rely on these logical data types for. Do you have any particular use cases in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up. Are there cases beyond the generation of SQL for semantic queries that you have in mind to support here? I'm happy to add decimal as data type, but I'd like to understand a motivating scenario for how the spec would be used in such a way that
numberanddecimalwould be treated differently.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using JSON Schema data types (https://json-schema.org/understanding-json-schema/reference/type) as a minimum would be a good idea.