Skip to content

[Feature]: Support SQL Expressions in Relationship Join Conditions (Auto-Generated Join Key Dimensions) #106

@mluizzi-whoop

Description

@mluizzi-whoop

Problem Statement

The Challenge

Users frequently need to join tables on transformed columns without modifying their underlying dbt models. A common scenario is joining a timestamp column to a date column:

Example Scenario:

snowflake_relationships:
  - name: transactions_to_daily_summary
    left_table: {{ table('transactions') }}
    right_table: {{ table('daily_summary') }}
    relationship_conditions:
      - "{{ column('transactions', 'user_id') }} = {{ column('daily_summary', 'user_id') }}"
      - "{{ column('transactions', 'transaction_timestamp') }} = {{ column('daily_summary', 'summary_date') }}"

The problem: If transactions.transaction_timestamp is a TIMESTAMP but daily_summary.summary_date is a DATE, users need to join on transaction_timestamp::DATE = summary_date.

Current limitation: SST explicitly rejects SQL transformations in relationship conditions:

# From snowflake_semantic_tools/core/validation/rules/references.py (lines 401-426)
patterns = [
    (r"::", "type casting (::)"),
    (r"\bCAST\s*\(", "CAST function"),
    (r"\bTO_DATE\s*\(", "TO_DATE function"),
    (r"\bDATE\s*\(", "DATE function"),
    (r"\bDATE_TRUNC\s*\(", "DATE_TRUNC function"),  # Not currently listed but relevant
    # ... many more patterns
]

Current workaround: Add a computed column to the dbt model (e.g., transaction_date), which:

  • Requires modifying the upstream model
  • Adds redundant columns to tables
  • Increases maintenance burden
  • May not be feasible for models owned by other teams

Use Cases

  1. Timestamp to Date joins: DATE(event_timestamp) = calendar_date
  2. Monthly aggregation joins: DATE_TRUNC('month', order_date) = month_start_date
  3. Quarterly joins: DATE_TRUNC('quarter', transaction_date) = quarter_date
  4. Case-insensitive joins: UPPER(email) = UPPER(normalized_email) (less common)

Proposed Solution

Overview

Allow SQL expressions in relationship conditions by automatically generating "join key dimensions" during semantic view generation. The user writes natural SQL expressions, and SST handles the complexity behind the scenes.

User Experience (No New Syntax Required!)

Users write intuitive expressions:

snowflake_relationships:
  - name: events_to_calendar
    left_table: {{ table('events') }}
    right_table: {{ table('calendar') }}
    relationship_conditions:
      - "{{ column('events', 'user_id') }} = {{ column('calendar', 'user_id') }}"
      # NEW: Expressions now supported!
      - "DATE({{ column('events', 'event_timestamp') }}) = {{ column('calendar', 'calendar_date') }}"

Or with DATE_TRUNC:

relationship_conditions:
  - "DATE_TRUNC('month', {{ column('events', 'event_timestamp') }}) = {{ column('calendar', 'month_date') }}"

How It Works

Step 1: Detection (during template resolution/parsing)

  • Parse each relationship condition
  • Detect SQL transformation patterns around {{ column() }} templates
  • Extract: table name, base column, full expression

Step 2: Join Key Dimension Generation

  • Generate a synthetic dimension name: _JK_{TABLE}_{COLUMN}_{HASH}
  • Example: _JK_EVENTS_EVENT_TIMESTAMP_A1B2
  • Store mapping: expression → dimension name

Step 3: SQL Generation

  • Add synthetic dimensions to DIMENSIONS clause
  • Reference dimension names (not raw columns) in RELATIONSHIPS clause

Generated SQL Example

Input:

relationship_conditions:
  - "DATE({{ column('events', 'event_timestamp') }}) = {{ column('calendar', 'calendar_date') }}"

Output:

CREATE SEMANTIC VIEW my_semantic_view
TABLES (
    EVENTS AS DB.SCHEMA.EVENTS
        PRIMARY KEY (EVENT_ID)
        COMMENT = 'Events table',
    CALENDAR AS DB.SCHEMA.CALENDAR
        PRIMARY KEY (CALENDAR_DATE)
        COMMENT = 'Calendar dimension'
)
DIMENSIONS (
    -- Regular dimensions from dbt model metadata...
    EVENTS.EVENT_ID AS EVENT_ID
        COMMENT = 'Event identifier',
    EVENTS.USER_ID AS USER_ID
        COMMENT = 'User identifier',
    EVENTS.EVENT_TIMESTAMP AS EVENT_TIMESTAMP
        COMMENT = 'Event timestamp',
    CALENDAR.CALENDAR_DATE AS CALENDAR_DATE
        COMMENT = 'Calendar date',
    
    -- AUTO-GENERATED JOIN KEY DIMENSION
    EVENTS._JK_EVENT_TIMESTAMP_DATE AS DATE(EVENT_TIMESTAMP)
        COMMENT = 'Auto-generated join key: DATE(EVENT_TIMESTAMP)'
)
RELATIONSHIPS (
    EVENTS_TO_CALENDAR AS
        EVENTS (_JK_EVENT_TIMESTAMP_DATE) REFERENCES CALENDAR (CALENDAR_DATE)
)
COMMENT = 'Semantic view with expression-based join';

Key Insight: Snowflake Dimensions Support Expressions

Research confirmed that Snowflake semantic views support expressions in the DIMENSIONS clause:

DIMENSIONS (
    TABLE_NAME.DIM_NAME AS <expression>
)

The critical question is: Can RELATIONSHIPS reference dimension aliases instead of raw table columns?


Alternatives Considered

Alternative 1: Add Computed Column to dbt Model ❌ (Current Workaround)

-- In dbt model
SELECT
    *,
    transaction_timestamp::DATE AS transaction_date
FROM ...

Pros:

  • Works today
  • Explicit and visible

Cons:

  • Requires modifying upstream models
  • Adds redundant columns to physical tables
  • May not be feasible for shared/external models
  • Increases table storage and maintenance burden

Alternative 2: New Template Function {{ join_key() }}

relationship_conditions:
  - "{{ join_key('events', 'event_date', 'DATE(event_timestamp)') }} = {{ column('calendar', 'calendar_date') }}"

Pros:

  • Explicit about creating a join key dimension
  • Clear separation of concerns

Cons:

  • More verbose syntax
  • Requires learning new template function
  • Breaks the intuitive SQL-like pattern users expect

Alternative 3: Explicit Join Key Dimension Definition ❌

snowflake_relationships:
  - name: events_to_calendar
    join_key_dimensions:
      - table: events
        name: event_date
        expr: DATE(event_timestamp)
    relationship_conditions:
      - "{{ column('events', 'event_date') }} = {{ column('calendar', 'calendar_date') }}"

Pros:

  • Very explicit
  • Separates dimension definition from usage

Cons:

  • Verbose
  • Two places to maintain (dimension + condition)
  • Not inline with relationship definition

Alternative 4: Auto-Detect and Generate (Proposed Solution) ✅

Pros:

  • No new syntax to learn
  • Users write natural SQL expressions
  • Complexity handled automatically
  • Inline with relationship definition
  • Matches how users think about joins

Cons:

  • Magic happening behind the scenes (mitigated by logging/comments)
  • Need robust expression parsing
  • Potential edge cases with complex expressions

Priority

High - Would significantly improve workflow

This is a frequently requested pattern that currently requires modifying dbt models, which is often not desirable or possible.


Impact

Who Benefits

  1. All SST users who need to join tables on transformed columns
  2. Teams with shared models where modifying upstream tables isn't an option
  3. Data engineers working with timestamp/date mismatches (very common)
  4. Organizations with strict model ownership where adding columns requires approval

Common Patterns This Enables

Pattern Expression Use Case
Timestamp → Date DATE(timestamp_col) Join events to daily calendar
Date Truncation DATE_TRUNC('month', date_col) Monthly aggregation joins
Type Casting col::DATE or CAST(col AS DATE) Type alignment
Quarterly DATE_TRUNC('quarter', date_col) Quarterly reporting

Technical Considerations

Research Completed

1. Current Validation Blocks Transformations

File: snowflake_semantic_tools/core/validation/rules/references.py (lines 389-460)

The has_sql_transformation() function explicitly rejects:

  • Type casting (::)
  • CAST(), CONVERT(), TO_DATE(), TO_TIMESTAMP(), TO_CHAR(), TO_NUMBER()
  • DATE(), TRIM(), UPPER(), LOWER(), SUBSTRING()
  • COALESCE(), NVL(), IFNULL()
  • CASE WHEN statements
  • Arithmetic operations (+, -, *, /)
  • String concatenation (||)

2. Relationship SQL Generation

File: snowflake_semantic_tools/core/parsing/join_condition_parser.py (lines 247-285)

The generate_sql_references() method generates:

TABLE1 (col1, col2) REFERENCES TABLE2 (col1, col2)

Currently only uses column names from ParsedCondition.left_column and .right_column.

3. Dimension SQL Generation Supports Expressions

File: snowflake_semantic_tools/core/generation/semantic_view_builder.py (line 800)

dim_def = f"    {table_name}.{dim_name} AS {expression}"

This confirms dimensions CAN have expressions in the generated SQL.

Hypothesis to Validate

Critical Question: Can Snowflake semantic view RELATIONSHIPS reference dimension aliases (which can contain expressions) instead of raw table columns?

-- If this works, the feature is viable:
DIMENSIONS (
    EVENTS.EVENT_DATE AS DATE(EVENT_TIMESTAMP)
)
RELATIONSHIPS (
    REL_NAME AS
        EVENTS (EVENT_DATE) REFERENCES CALENDAR (CALENDAR_DATE)
        --      ^^^^^^^^^^^ dimension alias, not raw column
)

Validation SQL Script

Run this in Snowflake to test if dimension aliases work in RELATIONSHIPS:

-- ============================================================================
-- TEST: Can Snowflake semantic views join on dimension expressions?
-- ============================================================================
-- 
-- HYPOTHESIS: You can define a dimension with an expression (like DATE(timestamp_col))
-- and then reference that dimension name in a RELATIONSHIPS clause.
--
-- WHY THIS MATTERS: If this works, users can join a timestamp column to a date column
-- without adding a new column to their dbt model. They could write:
--
--   relationship_conditions:
--     - "DATE_TRUNC('month', {{ column('events', 'timestamp') }}) = {{ column('calendar', 'month_date') }}"
--
-- And the tool would auto-generate a dimension for the expression.
--
-- INSTRUCTIONS:
-- 1. Set your database/schema context below
-- 2. Run this script to test each hypothesis
-- ============================================================================

-- Set context (adjust to your environment)
USE ROLE your_role;
USE WAREHOUSE your_warehouse;
USE DATABASE your_database;
USE SCHEMA your_schema;

-- ============================================================================
-- Step 1: Create test tables
-- ============================================================================

CREATE OR REPLACE TABLE SST_TEST_EVENTS (
    event_id INT,
    user_id INT,
    event_timestamp TIMESTAMP_NTZ,
    event_type VARCHAR(50)
);

CREATE OR REPLACE TABLE SST_TEST_CALENDAR (
    calendar_date DATE,
    day_name VARCHAR(20),
    is_weekend BOOLEAN
);

-- Insert sample data
INSERT INTO SST_TEST_EVENTS VALUES
    (1, 100, '2025-01-15 10:30:00', 'login'),
    (2, 100, '2025-01-15 11:45:00', 'purchase'),
    (3, 101, '2025-01-16 09:00:00', 'login');

INSERT INTO SST_TEST_CALENDAR VALUES
    ('2025-01-15', 'Wednesday', FALSE),
    ('2025-01-16', 'Thursday', FALSE),
    ('2025-01-17', 'Friday', FALSE);

-- ============================================================================
-- TEST 1: Define dimension with expression, reference in relationship
-- ============================================================================

CREATE OR REPLACE SEMANTIC VIEW SST_TEST_DIMENSION_EXPR_JOIN
TABLES (
    SST_TEST_EVENTS AS CURRENT_DATABASE().CURRENT_SCHEMA().SST_TEST_EVENTS
        PRIMARY KEY (EVENT_ID)
        COMMENT = 'Events with timestamps',
    SST_TEST_CALENDAR AS CURRENT_DATABASE().CURRENT_SCHEMA().SST_TEST_CALENDAR
        PRIMARY KEY (CALENDAR_DATE)
        COMMENT = 'Calendar dimension'
)
DIMENSIONS (
    -- Regular dimension (simple column reference)
    SST_TEST_EVENTS.EVENT_ID AS EVENT_ID
        COMMENT = 'Event identifier',
    SST_TEST_EVENTS.USER_ID AS USER_ID
        COMMENT = 'User identifier',
    SST_TEST_EVENTS.EVENT_TYPE AS EVENT_TYPE
        COMMENT = 'Type of event',
    -- EXPRESSION DIMENSION: This is the key test!
    -- Can we define a dimension that truncates timestamp to date?
    SST_TEST_EVENTS.EVENT_DATE AS DATE(EVENT_TIMESTAMP)
        COMMENT = 'Event date derived from timestamp',
    -- Calendar dimensions
    SST_TEST_CALENDAR.CALENDAR_DATE AS CALENDAR_DATE
        COMMENT = 'Calendar date',
    SST_TEST_CALENDAR.DAY_NAME AS DAY_NAME
        COMMENT = 'Day of week name'
)
RELATIONSHIPS (
    -- THE KEY TEST: Reference the dimension alias (EVENT_DATE) 
    -- not the raw column (EVENT_TIMESTAMP)
    EVENTS_TO_CALENDAR AS
        SST_TEST_EVENTS (EVENT_DATE) REFERENCES SST_TEST_CALENDAR (CALENDAR_DATE)
)
COMMENT = 'Test: Can relationships reference dimension expressions?';

-- If the above succeeds, print success
SELECT 'TEST 1 SUCCESS: Dimension expression joins work!' AS result;

-- Verify the semantic view
SHOW SEMANTIC VIEWS LIKE 'SST_TEST_DIMENSION_EXPR_JOIN';
DESCRIBE SEMANTIC VIEW SST_TEST_DIMENSION_EXPR_JOIN;

-- ============================================================================
-- TEST 2: DATE_TRUNC as join key (the actual use case)
-- ============================================================================

CREATE OR REPLACE TABLE SST_TEST_MONTHLY_CALENDAR (
    month_date DATE,  -- First day of month
    month_name VARCHAR(20),
    quarter INT
);

INSERT INTO SST_TEST_MONTHLY_CALENDAR VALUES
    ('2025-01-01', 'January', 1),
    ('2025-02-01', 'February', 1);

CREATE OR REPLACE SEMANTIC VIEW SST_TEST_MONTHLY_JOIN
TABLES (
    SST_TEST_EVENTS AS CURRENT_DATABASE().CURRENT_SCHEMA().SST_TEST_EVENTS
        PRIMARY KEY (EVENT_ID)
        COMMENT = 'Events with timestamps',
    SST_TEST_MONTHLY_CALENDAR AS CURRENT_DATABASE().CURRENT_SCHEMA().SST_TEST_MONTHLY_CALENDAR
        PRIMARY KEY (MONTH_DATE)
        COMMENT = 'Monthly calendar'
)
DIMENSIONS (
    SST_TEST_EVENTS.EVENT_ID AS EVENT_ID,
    SST_TEST_EVENTS.USER_ID AS USER_ID,
    -- DATE_TRUNC expression dimension - simulating auto-generated join key
    SST_TEST_EVENTS.EVENT_MONTH AS DATE_TRUNC('month', EVENT_TIMESTAMP)
        COMMENT = 'Auto-generated: Event month for joining to monthly calendar',
    SST_TEST_MONTHLY_CALENDAR.MONTH_DATE AS MONTH_DATE,
    SST_TEST_MONTHLY_CALENDAR.MONTH_NAME AS MONTH_NAME
)
RELATIONSHIPS (
    -- Use the expression-based dimension in the relationship
    EVENTS_TO_MONTHLY AS
        SST_TEST_EVENTS (EVENT_MONTH) REFERENCES SST_TEST_MONTHLY_CALENDAR (MONTH_DATE)
)
COMMENT = 'Test: DATE_TRUNC expression as join key';

SELECT 'TEST 2 SUCCESS: DATE_TRUNC join key works!' AS result;

-- ============================================================================
-- CLEANUP (run when done testing)
-- ============================================================================

-- DROP SEMANTIC VIEW IF EXISTS SST_TEST_DIMENSION_EXPR_JOIN;
-- DROP SEMANTIC VIEW IF EXISTS SST_TEST_MONTHLY_JOIN;
-- DROP TABLE IF EXISTS SST_TEST_EVENTS;
-- DROP TABLE IF EXISTS SST_TEST_CALENDAR;
-- DROP TABLE IF EXISTS SST_TEST_MONTHLY_CALENDAR;

Implementation Plan

Phase 1: Validate Hypothesis

  • Run test SQL script in Snowflake
  • Confirm dimension aliases work in RELATIONSHIPS
  • Document any limitations

Phase 2: Expression Detection

Modify JoinConditionParser to detect SQL transformations:

# Suggested patterns to support initially:
EXPRESSION_PATTERNS = [
    r"DATE\s*\(\s*{{\s*column\([^)]+\)\s*}}\s*\)",           # DATE({{ column(...) }})
    r"DATE_TRUNC\s*\([^,]+,\s*{{\s*column\([^)]+\)\s*}}\)",  # DATE_TRUNC('month', {{ column(...) }})
    r"{{\s*column\([^)]+\)\s*}}::DATE",                      # {{ column(...) }}::DATE
    r"CAST\s*\(\s*{{\s*column\([^)]+\)\s*}}\s+AS\s+DATE\)",  # CAST({{ column(...) }} AS DATE)
]

Phase 3: Join Key Dimension Generation

New class: JoinKeyDimensionGenerator

import hashlib
from typing import Dict, Any


class JoinKeyDimensionGenerator:
    """Generates synthetic dimensions for join key expressions."""
    
    def __init__(self):
        self.generated_dimensions: Dict[tuple, Dict[str, Any]] = {}
    
    def generate_dimension_name(self, table: str, column: str, expression: str) -> str:
        """Generate unique dimension name for join key."""
        hash_suffix = hashlib.md5(expression.encode()).hexdigest()[:4].upper()
        return f"_JK_{table.upper()}_{column.upper()}_{hash_suffix}"
    
    def register_join_key(self, table: str, column: str, expression: str) -> str:
        """
        Register a join key expression and return dimension name.
        
        Args:
            table: Source table name
            column: Base column name (extracted from expression)
            expression: Full SQL expression (e.g., "DATE(EVENT_TIMESTAMP)")
            
        Returns:
            Generated dimension name to use in RELATIONSHIPS
        """
        key = (table.lower(), expression)
        if key not in self.generated_dimensions:
            dim_name = self.generate_dimension_name(table, column, expression)
            self.generated_dimensions[key] = {
                'name': dim_name,
                'table': table.upper(),
                'expression': expression,
                'base_column': column.upper(),
                'comment': f'Auto-generated join key: {expression}'
            }
        return self.generated_dimensions[key]['name']
    
    def get_all_dimensions(self) -> list:
        """Return all generated dimensions for inclusion in DIMENSIONS clause."""
        return list(self.generated_dimensions.values())
    
    def clear(self):
        """Clear all generated dimensions (call between semantic views)."""
        self.generated_dimensions.clear()

Phase 4: Update SQL Generation

Modify SemanticViewBuilder:

  1. _build_dimensions_clause() - Include join key dimensions:
def _build_dimensions_clause(self, conn, table_names: List[str], join_key_generator: JoinKeyDimensionGenerator = None) -> str:
    """Build the DIMENSIONS clause of the CREATE SEMANTIC VIEW statement."""
    # ... existing code ...
    
    # Add auto-generated join key dimensions
    if join_key_generator:
        for jk_dim in join_key_generator.get_all_dimensions():
            dim_def = f"    {jk_dim['table']}.{jk_dim['name']} AS {jk_dim['expression']}"
            dim_def += f"\n      COMMENT = '{jk_dim['comment']}'"
            dim_definitions.append(dim_def)
    
    return ",\n".join(dim_definitions)
  1. _build_relationships_clause() - Use dimension names from generator

Phase 5: Update Validation

  1. Remove hard block on SQL transformations (or make it conditional)
  2. Add validation that base column exists in dbt catalog
  3. Add warning for ASOF joins with expressions (see Caveats)

Caveats & Edge Cases

ASOF Joins

ASOF joins require UNIQUE constraints on the right table's join columns. If an expression destroys uniqueness (e.g., DATE_TRUNC groups multiple timestamps to same month), ASOF joins may fail.

Recommendation:

  • Warn users when expressions are used with >= (ASOF) operators
  • Document this limitation

Expression on Both Sides

If both sides have expressions:

- "DATE({{ column('events', 'timestamp') }}) = DATE({{ column('calendar', 'created_at') }})"

Generate two join key dimensions, one for each side.

Same Expression Multiple Times

If the same expression appears in multiple relationships, reuse the dimension.

Complex/Nested Expressions

Initially, support only single-function expressions. Complex nested expressions can be added later:

# Phase 1: Support these
- "DATE({{ column('events', 'timestamp') }})"
- "DATE_TRUNC('month', {{ column('events', 'timestamp') }})"

# Future: Consider supporting these
- "DATE_TRUNC('month', DATE({{ column('events', 'timestamp') }}))"

Files to Modify

File Changes
core/parsing/join_condition_parser.py Add expression detection, extract base column from expressions
core/validation/rules/references.py Conditionally allow expressions when auto-generating dimensions
core/generation/semantic_view_builder.py Add join key dimensions to DIMENSIONS clause, use in RELATIONSHIPS
core/parsing/parsers/semantic_parser.py Handle expression metadata during relationship parsing
New: core/generation/join_key_generator.py JoinKeyDimensionGenerator class

Example Usage

Before (Current - Fails Validation)

snowflake_relationships:
  - name: events_to_calendar
    left_table: {{ table('events') }}
    right_table: {{ table('calendar') }}
    relationship_conditions:
      - "{{ column('events', 'user_id') }} = {{ column('calendar', 'user_id') }}"
      - "DATE({{ column('events', 'event_timestamp') }}) = {{ column('calendar', 'calendar_date') }}"
      # ❌ ERROR: SQL transformation (DATE function) not allowed

After (Proposed - Works)

snowflake_relationships:
  - name: events_to_calendar
    left_table: {{ table('events') }}
    right_table: {{ table('calendar') }}
    relationship_conditions:
      - "{{ column('events', 'user_id') }} = {{ column('calendar', 'user_id') }}"
      - "DATE({{ column('events', 'event_timestamp') }}) = {{ column('calendar', 'calendar_date') }}"
      # ✅ Auto-generates join key dimension: EVENTS._JK_EVENT_TIMESTAMP_DATE AS DATE(EVENT_TIMESTAMP)

Generated Output (Verbose Mode)

$ sst generate --all --verbose

Parsing relationships...
  events_to_calendar: Detected expression in join condition
    Left side: DATE({{ column('events', 'event_timestamp') }})
    Creating join key dimension: EVENTS._JK_EVENT_TIMESTAMP_A1B2 AS DATE(EVENT_TIMESTAMP)

Generating semantic view 'analytics_view'...
  Tables: 2
  Dimensions: 5 (including 1 auto-generated join key)
  Relationships: 1
  ✓ Created successfully

Additional Context

Related Code References

Component File Lines Purpose
Expression blocking core/validation/rules/references.py 389-460 Currently rejects transformations
Join condition parsing core/parsing/join_condition_parser.py 47-285 Parses relationship conditions
Dimension SQL generation core/generation/semantic_view_builder.py 772-820 Builds DIMENSIONS clause
Relationship SQL generation core/generation/semantic_view_builder.py 700-739 Builds RELATIONSHIPS clause
Relationship parsing core/parsing/parsers/semantic_parser.py 147-203 Parses relationship YAML

Snowflake Documentation References


Pre-submission Checklist

  • I have searched existing issues to avoid duplicates
  • I have described a clear problem and solution
  • I have considered alternatives and workarounds

Summary

Problem: Users can't join on transformed columns without modifying dbt models.

Solution: Auto-generate "join key dimensions" when SQL expressions are detected in relationship conditions.

Key Benefits:

  • No new syntax required
  • No dbt model changes needed
  • Complexity handled automatically
  • Intuitive SQL-like expressions

Next Step: Validate that Snowflake semantic views allow dimension aliases in RELATIONSHIPS by running the test SQL script above.

Metadata

Metadata

Assignees

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions