Skip to content

Comments

refactor/dbt-redshift#163

Open
iobruno wants to merge 6 commits intomasterfrom
refactor/dbt-redshift
Open

refactor/dbt-redshift#163
iobruno wants to merge 6 commits intomasterfrom
refactor/dbt-redshift

Conversation

@iobruno
Copy link
Owner

@iobruno iobruno commented Feb 10, 2026

Summary

  • Updates staging models (stg_green_, stg_yellow_, stg_fhv_)
    • Replaces deduplication CTE on stg_yellow/stg_green for QUALIFY on main SELECT stmt
    • Adds deduplication step (with QUALIFY) on stg_fhv for consistency
  • Updates dimension models (dim_fhv_trips, dim_taxi_trips)
    • extracted year, quarter, and month fields from pickup/dropoff timestamps
      to support downstream aggregations; shortened join aliases for readability
  • Adds core models:
    • fct_fhv_monthly_zone_traveltime_p90: computes the P90 travel time (in seconds)
      per pickup/dropoff zone pair per month for FHV trips, ranked by longest travel time
    • fct_taxi_trips_monthly_fare_p95: calculates P97, P95, and P90 fare percentiles
      by service type per month, filtering for valid cash/credit card trips
    • fct_taxi_trips_quarterly_revenue: aggregates quarterly revenue and trip counts
      by service type with year-over-year growth percentage
  • Fix indentation issues across all models
  • Adds exhaustive clause (else) on resolve_payment_type_desc macro
  • Updates Dockerfile to use astral-sh/uv:python3.13-trixie-slim as base image for builder stage
    and python3.13-slim as base image for the runner stage
  • Updates README

Test plan

  • Run dbt build to validate all models compile and execute against BigQuery
  • Verify the new fact models produce expected output for known date ranges
  • Confirm dim_fhv_trips and dim_taxi_trips still pass existing tests after the schema changes

🤖 Generated with Claude Code

* Update project dependencies
* Bumps dbt_utils to 1.3.3
* Fixes materialization config on dbt_project.yml to prevent warnings
* Updates upper bondandaries for Python (<3.14)
* `fct_fhv_monthly_zone_traveltime_p90`: computes the P90 travel time (in seconds)
   per pickup/dropoff zone pair per month for FHV trips, ranked by longest travel time

* `fct_taxi_trips_monthly_fare_p95`: calculates P97, P95, and P90 fare percentiles
   by service type per month, filtering for valid cash/credit card trips

* `fct_taxi_trips_quarterly_revenue`: aggregates quarterly revenue and trip counts
@iobruno iobruno force-pushed the refactor/dbt-redshift branch from b091e17 to 1f95544 Compare February 10, 2026 03:23
* Adds exhaustive clause (else) on resolve_payment_type_desc
* Minor update on var name
@iobruno iobruno force-pushed the master branch 3 times, most recently from 8adbdca to 153db80 Compare February 14, 2026 03:36
@iobruno
Copy link
Owner Author

iobruno commented Feb 16, 2026

@claude code review it

@claude
Copy link

claude bot commented Feb 16, 2026

Claude finished @iobruno's task in 2m 40s —— View job


Code Review Complete ✓

  • Read repository guidelines (CLAUDE.md)
  • Review new fact models
  • Review updated dimension models
  • Review updated staging models
  • Review macro changes
  • Review Dockerfile updates
  • Review README changes
  • Provide comprehensive feedback

Review Summary

Overall, this is a well-structured refactor that modernizes the dbt-redshift project with improved models, better deduplication, and updated dependencies. The code quality is high, and the changes are mostly solid. However, there are several issues that should be addressed before merging.


🔴 Critical Issues

1. Incorrect Function in fct_fhv_monthly_zone_traveltime_p90.sql

Location: models/core/fct_fhv_monthly_zone_traveltime_p90.sql:11

Issue: Using timestamp_diff() which is a BigQuery function, not Redshift. Redshift uses DATEDIFF() or EXTRACT(EPOCH FROM timestamp2 - timestamp1).

-- Current (WRONG for Redshift):
percentile_cont(timestamp_diff(dropoff_datetime, pickup_datetime, SECOND), 0.90)

-- Should be:
percentile_cont(DATEDIFF(second, pickup_datetime, dropoff_datetime), 0.90)
-- OR:
percentile_cont(EXTRACT(EPOCH FROM dropoff_datetime - pickup_datetime), 0.90)

This will cause runtime errors in Redshift. Fix this →


2. Incorrect Function in fct_taxi_trips_quarterly_revenue.sql

Location: models/core/fct_taxi_trips_quarterly_revenue.sql:38

Issue: Using safe_divide() which is a BigQuery function. Redshift doesn't have this function.

-- Current (WRONG for Redshift):
round(safe_divide(revenue - prev_year_revenue, prev_year_revenue) * 100, 2) as growth

-- Should be:
round(CASE 
    WHEN prev_year_revenue = 0 OR prev_year_revenue IS NULL THEN NULL
    ELSE (revenue - prev_year_revenue) / prev_year_revenue * 100
END, 2) as growth

This will cause runtime errors. Fix this →


🟡 Performance & Quality Issues

3. Inefficient Window Function Pattern in fct_fhv_monthly_zone_traveltime_p90.sql

Location: models/core/fct_fhv_monthly_zone_traveltime_p90.sql:11-12

Issue: The window function computes P90 for every row, then aggregates with max(timedelta_p90). This is wasteful.

Better approach:

with fhv_timedelta as (
    select
        pickup_year,
        pickup_month,
        pickup_zone,
        dropoff_zone,
        DATEDIFF(second, pickup_datetime, dropoff_datetime) as timedelta_seconds
    from {{ ref('dim_fhv_trips') }}
),

fhv_rnk_timedelta as (
    select
        pickup_year as year,
        pickup_month as month,
        pickup_zone,
        dropoff_zone,
        count(1) as num_trips,
        percentile_cont(0.90) within group (order by timedelta_seconds) as timedelta_p90,
        dense_rank() over (partition by pickup_year, pickup_month, pickup_zone 
                          order by percentile_cont(0.90) within group (order by timedelta_seconds) desc) as rnk
    from fhv_timedelta
    group by pickup_year, pickup_month, pickup_zone, dropoff_zone
)

This computes P90 once per group instead of once per row, which is much more efficient.


4. Inconsistent Percentile Functions

Location: models/core/fct_taxi_trips_monthly_fare_p95.sql:11-13

Issue: Using percentile_disc() (discrete percentile - returns actual value from dataset) vs percentile_cont() in the FHV model (continuous percentile - interpolates).

Recommendation: Be consistent. For fare amounts, percentile_disc makes sense (returns actual fare). For time durations, percentile_cont makes sense (interpolated). Just ensure this is intentional and documented.


5. Case Sensitivity Issues in Column Names

Location: Multiple files, e.g., models/staging/stg_fhv_tripdata.sql:13,16,18-20

Issue: Inconsistent casing in source column references:

Affiliated_base_number   -- Mixed case
dropOff_datetime         -- Mixed case
PUlocationID            -- Mixed case

While Redshift is case-insensitive by default, mixed-case identifiers can cause issues if tables were created with quoted identifiers. Recommendation: Use consistent lowercase aliases throughout for maintainability.


6. Missing NULL Checks in dim_taxi_trips Join

Location: models/core/dim_taxi_trips.sql:56-59

Issue: Using INNER JOIN will silently drop trips with invalid location IDs (not in the zone lookup table). This might be intentional, but it's not obvious.

Recommendation:

  • If dropping invalid trips is intentional, add a comment explaining this
  • Consider adding a data quality test to monitor how many trips are dropped
  • Or use LEFT JOIN and handle nulls explicitly if you want to preserve all trips

🟢 Positive Changes

✅ Excellent: QUALIFY for Deduplication

The switch from CTEs to QUALIFY for deduplication in staging models is a great optimization for Redshift:

qualify row_number() over(partition by VendorID, tpep_pickup_datetime) = 1

This is more efficient than the previous CTE approach and is more readable.


✅ Good: Time Partitioning Fields

Adding pickup_year, pickup_quarter, pickup_month to dimension models (dim_fhv_trips.sql:28-30, dim_taxi_trips.sql:27-29) is excellent for supporting downstream aggregations without repeated date extractions.


✅ Good: Macro Enhancement

Adding the else clause in resolve_payment_type_desc.sql:15 ensures unknown payment types are handled gracefully. Nice defensive coding!


✅ Good: Dockerfile Modernization

Upgrading to Python 3.13 and using astral-sh/uv base image is a solid modernization. The multi-stage build is clean and efficient.


🔵 Minor Issues & Suggestions

7. Redundant Aliases in SELECT Statements

Location: Throughout new models (e.g., fct_fhv_monthly_zone_traveltime_p90.sql:36-42)

Issue: Aliasing columns to themselves is redundant:

select
    year                    as year,
    month                   as month,
    pickup_zone             as pickup_zone,

Recommendation: Either remove the aliases or keep them for consistency. I'd suggest keeping them for maintainability (makes column contract explicit), but it's a style choice.


8. Missing Documentation for Complex Models

Location: All new fact models

Issue: No schema documentation (.yml files) for the new fact models.

Recommendation: Add schema files with:

  • Column descriptions
  • Data quality tests (unique, not_null, relationships)
  • Model descriptions explaining business logic

Example:

version: 2
models:
  - name: fct_fhv_monthly_zone_traveltime_p90
    description: "P90 travel time by zone pair per month for FHV trips"
    columns:
      - name: timedelta_p90
        description: "90th percentile travel time in seconds"
        tests:
          - not_null

9. PR Description Claims P97/P95/P90 but Title Says P95

Location: PR summary vs model name

Minor inconsistency: The PR summary says the model "calculates P97, P95, and P90 fare percentiles" but the model is named fct_taxi_trips_monthly_fare_p95.sql. The model does calculate all three percentiles, so the name could be more accurate (e.g., fct_taxi_trips_monthly_fare_percentiles.sql), or the description should emphasize P95 as the primary metric.


10. README Still References BigQuery

Location: README.md:43

Issue: The README says "Set the environment variables for dbt-bigquery" but this is the Redshift adapter.

Should be:

3.2. Set the environment variables for `dbt-redshift`:

Minor copy-paste error that should be corrected.


📋 Test Plan Validation

The test plan in the PR description is appropriate:

  • ✅ Run dbt build - CRITICAL given the BigQuery function usage found
  • ✅ Verify new fact models - Especially important for the P90/P95/P97 calculations
  • ✅ Confirm dim models pass tests - Good to validate schema changes don't break downstream

Additional recommendation: Add integration tests for the new fact models to validate:

  • P90 calculations produce expected values for known datasets
  • Growth percentage calculations handle edge cases (zero/null previous revenue)

Summary of Required Changes

Must Fix Before Merge:

  1. Replace timestamp_diff() with Redshift's DATEDIFF() in fct_fhv_monthly_zone_traveltime_p90.sql:11
  2. Replace safe_divide() with CASE statement in fct_taxi_trips_quarterly_revenue.sql:38

Should Fix:

  1. Optimize window function in fct_fhv_monthly_zone_traveltime_p90.sql to avoid computing P90 per row
  2. Fix README copy-paste error ("dbt-bigquery" → "dbt-redshift")

Nice to Have:

  1. Add schema documentation for new fact models
  2. Consider data quality tests for zone lookup joins
  3. Verify percentile function choice (disc vs cont) is intentional

The refactor demonstrates strong dbt knowledge and good SQL practices overall. Once the BigQuery function references are fixed for Redshift compatibility, this will be a solid improvement to the codebase! 🚀


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant