refactor/dbt-redshift by iobruno · Pull Request #163 · iobruno/data-engineering-labs

iobruno · 2026-02-10T03:15:37Z

Summary

Updates staging models (stg_green_, stg_yellow_, stg_fhv_)
- Replaces deduplication CTE on stg_yellow/stg_green for QUALIFY on main SELECT stmt
- Adds deduplication step (with QUALIFY) on stg_fhv for consistency
Updates dimension models (dim_fhv_trips, dim_taxi_trips)
- extracted year, quarter, and month fields from pickup/dropoff timestamps
  to support downstream aggregations; shortened join aliases for readability
Adds core models:
- fct_fhv_monthly_zone_traveltime_p90: computes the P90 travel time (in seconds)
  per pickup/dropoff zone pair per month for FHV trips, ranked by longest travel time
- fct_taxi_trips_monthly_fare_p95: calculates P97, P95, and P90 fare percentiles
  by service type per month, filtering for valid cash/credit card trips
- fct_taxi_trips_quarterly_revenue: aggregates quarterly revenue and trip counts
  by service type with year-over-year growth percentage
Fix indentation issues across all models
Adds exhaustive clause (else) on resolve_payment_type_desc macro
Updates Dockerfile to use astral-sh/uv:python3.13-trixie-slim as base image for builder stage
and python3.13-slim as base image for the runner stage
Updates README
- Updates dbt github shield to use a base64-encoded string as Simple Icons
  removed the dbt logo in Remove dbt simple-icons/simple-icons#13525
- Updates running instructions and TODO list

Test plan

Run dbt build to validate all models compile and execute against BigQuery
Verify the new fact models produce expected output for known date ranges
Confirm dim_fhv_trips and dim_taxi_trips still pass existing tests after the schema changes

🤖 Generated with Claude Code

* Update project dependencies * Bumps dbt_utils to 1.3.3 * Fixes materialization config on dbt_project.yml to prevent warnings * Updates upper bondandaries for Python (<3.14)

* `fct_fhv_monthly_zone_traveltime_p90`: computes the P90 travel time (in seconds) per pickup/dropoff zone pair per month for FHV trips, ranked by longest travel time * `fct_taxi_trips_monthly_fare_p95`: calculates P97, P95, and P90 fare percentiles by service type per month, filtering for valid cash/credit card trips * `fct_taxi_trips_quarterly_revenue`: aggregates quarterly revenue and trip counts

* Adds exhaustive clause (else) on resolve_payment_type_desc * Minor update on var name

iobruno · 2026-02-16T06:20:19Z

@claude code review it

claude · 2026-02-16T06:20:34Z

Claude finished @iobruno's task in 2m 40s —— View job

Code Review Complete ✓

Review Summary

Overall, this is a well-structured refactor that modernizes the dbt-redshift project with improved models, better deduplication, and updated dependencies. The code quality is high, and the changes are mostly solid. However, there are several issues that should be addressed before merging.

🔴 Critical Issues

1. Incorrect Function in `fct_fhv_monthly_zone_traveltime_p90.sql`

Location: models/core/fct_fhv_monthly_zone_traveltime_p90.sql:11

Issue: Using timestamp_diff() which is a BigQuery function, not Redshift. Redshift uses DATEDIFF() or EXTRACT(EPOCH FROM timestamp2 - timestamp1).

-- Current (WRONG for Redshift):
percentile_cont(timestamp_diff(dropoff_datetime, pickup_datetime, SECOND), 0.90)

-- Should be:
percentile_cont(DATEDIFF(second, pickup_datetime, dropoff_datetime), 0.90)
-- OR:
percentile_cont(EXTRACT(EPOCH FROM dropoff_datetime - pickup_datetime), 0.90)

This will cause runtime errors in Redshift. Fix this →

2. Incorrect Function in `fct_taxi_trips_quarterly_revenue.sql`

Location: models/core/fct_taxi_trips_quarterly_revenue.sql:38

Issue: Using safe_divide() which is a BigQuery function. Redshift doesn't have this function.

-- Current (WRONG for Redshift):
round(safe_divide(revenue - prev_year_revenue, prev_year_revenue) * 100, 2) as growth

-- Should be:
round(CASE 
    WHEN prev_year_revenue = 0 OR prev_year_revenue IS NULL THEN NULL
    ELSE (revenue - prev_year_revenue) / prev_year_revenue * 100
END, 2) as growth

This will cause runtime errors. Fix this →

🟡 Performance & Quality Issues

3. Inefficient Window Function Pattern in `fct_fhv_monthly_zone_traveltime_p90.sql`

Location: models/core/fct_fhv_monthly_zone_traveltime_p90.sql:11-12

Issue: The window function computes P90 for every row, then aggregates with max(timedelta_p90). This is wasteful.

Better approach:

with fhv_timedelta as (
    select
        pickup_year,
        pickup_month,
        pickup_zone,
        dropoff_zone,
        DATEDIFF(second, pickup_datetime, dropoff_datetime) as timedelta_seconds
    from {{ ref('dim_fhv_trips') }}
),

fhv_rnk_timedelta as (
    select
        pickup_year as year,
        pickup_month as month,
        pickup_zone,
        dropoff_zone,
        count(1) as num_trips,
        percentile_cont(0.90) within group (order by timedelta_seconds) as timedelta_p90,
        dense_rank() over (partition by pickup_year, pickup_month, pickup_zone 
                          order by percentile_cont(0.90) within group (order by timedelta_seconds) desc) as rnk
    from fhv_timedelta
    group by pickup_year, pickup_month, pickup_zone, dropoff_zone
)

This computes P90 once per group instead of once per row, which is much more efficient.

4. Inconsistent Percentile Functions

Location: models/core/fct_taxi_trips_monthly_fare_p95.sql:11-13

Issue: Using percentile_disc() (discrete percentile - returns actual value from dataset) vs percentile_cont() in the FHV model (continuous percentile - interpolates).

Recommendation: Be consistent. For fare amounts, percentile_disc makes sense (returns actual fare). For time durations, percentile_cont makes sense (interpolated). Just ensure this is intentional and documented.

5. Case Sensitivity Issues in Column Names

Location: Multiple files, e.g., models/staging/stg_fhv_tripdata.sql:13,16,18-20

Issue: Inconsistent casing in source column references:

Affiliated_base_number   -- Mixed case
dropOff_datetime         -- Mixed case
PUlocationID            -- Mixed case

While Redshift is case-insensitive by default, mixed-case identifiers can cause issues if tables were created with quoted identifiers. Recommendation: Use consistent lowercase aliases throughout for maintainability.

6. Missing NULL Checks in `dim_taxi_trips` Join

Location: models/core/dim_taxi_trips.sql:56-59

Issue: Using INNER JOIN will silently drop trips with invalid location IDs (not in the zone lookup table). This might be intentional, but it's not obvious.

Recommendation:

If dropping invalid trips is intentional, add a comment explaining this
Consider adding a data quality test to monitor how many trips are dropped
Or use LEFT JOIN and handle nulls explicitly if you want to preserve all trips

🟢 Positive Changes

✅ Excellent: QUALIFY for Deduplication

The switch from CTEs to QUALIFY for deduplication in staging models is a great optimization for Redshift:

qualify row_number() over(partition by VendorID, tpep_pickup_datetime) = 1

This is more efficient than the previous CTE approach and is more readable.

✅ Good: Time Partitioning Fields

Adding pickup_year, pickup_quarter, pickup_month to dimension models (dim_fhv_trips.sql:28-30, dim_taxi_trips.sql:27-29) is excellent for supporting downstream aggregations without repeated date extractions.

✅ Good: Macro Enhancement

Adding the else clause in resolve_payment_type_desc.sql:15 ensures unknown payment types are handled gracefully. Nice defensive coding!

✅ Good: Dockerfile Modernization

Upgrading to Python 3.13 and using astral-sh/uv base image is a solid modernization. The multi-stage build is clean and efficient.

🔵 Minor Issues & Suggestions

7. Redundant Aliases in SELECT Statements

Location: Throughout new models (e.g., fct_fhv_monthly_zone_traveltime_p90.sql:36-42)

Issue: Aliasing columns to themselves is redundant:

select
    year                    as year,
    month                   as month,
    pickup_zone             as pickup_zone,

Recommendation: Either remove the aliases or keep them for consistency. I'd suggest keeping them for maintainability (makes column contract explicit), but it's a style choice.

8. Missing Documentation for Complex Models

Location: All new fact models

Issue: No schema documentation (.yml files) for the new fact models.

Recommendation: Add schema files with:

Column descriptions
Data quality tests (unique, not_null, relationships)
Model descriptions explaining business logic

Example:

version: 2
models:
  - name: fct_fhv_monthly_zone_traveltime_p90
    description: "P90 travel time by zone pair per month for FHV trips"
    columns:
      - name: timedelta_p90
        description: "90th percentile travel time in seconds"
        tests:
          - not_null

9. PR Description Claims P97/P95/P90 but Title Says P95

Location: PR summary vs model name

Minor inconsistency: The PR summary says the model "calculates P97, P95, and P90 fare percentiles" but the model is named fct_taxi_trips_monthly_fare_p95.sql. The model does calculate all three percentiles, so the name could be more accurate (e.g., fct_taxi_trips_monthly_fare_percentiles.sql), or the description should emphasize P95 as the primary metric.

10. README Still References BigQuery

Location: README.md:43

Issue: The README says "Set the environment variables for dbt-bigquery" but this is the Redshift adapter.

Should be:

3.2. Set the environment variables for `dbt-redshift`:

Minor copy-paste error that should be corrected.

📋 Test Plan Validation

The test plan in the PR description is appropriate:

✅ Run dbt build - CRITICAL given the BigQuery function usage found
✅ Verify new fact models - Especially important for the P90/P95/P97 calculations
✅ Confirm dim models pass tests - Good to validate schema changes don't break downstream

Additional recommendation: Add integration tests for the new fact models to validate:

P90 calculations produce expected values for known datasets
Growth percentage calculations handle edge cases (zero/null previous revenue)

Summary of Required Changes

Must Fix Before Merge:

Replace timestamp_diff() with Redshift's DATEDIFF() in fct_fhv_monthly_zone_traveltime_p90.sql:11
Replace safe_divide() with CASE statement in fct_taxi_trips_quarterly_revenue.sql:38

Should Fix:

Optimize window function in fct_fhv_monthly_zone_traveltime_p90.sql to avoid computing P90 per row
Fix README copy-paste error ("dbt-bigquery" → "dbt-redshift")

Nice to Have:

Add schema documentation for new fact models
Consider data quality tests for zone lookup joins
Verify percentile function choice (disc vs cont) is intentional

The refactor demonstrates strong dbt knowledge and good SQL practices overall. Once the BigQuery function references are fixed for Redshift compatibility, this will be a solid improvement to the codebase! 🚀

iobruno added 5 commits February 10, 2026 00:22

Update dependencies and config files

84970db

* Update project dependencies * Bumps dbt_utils to 1.3.3 * Fixes materialization config on dbt_project.yml to prevent warnings * Updates upper bondandaries for Python (<3.14)

Update Dockerfile

8effebc

Updates core and staging models to match those of dbt-bigquery

add4f04

Update README

1f95544

iobruno force-pushed the refactor/dbt-redshift branch from b091e17 to 1f95544 Compare February 10, 2026 03:23

Updates macro

ffce6b2

* Adds exhaustive clause (else) on resolve_payment_type_desc * Minor update on var name

iobruno force-pushed the master branch 3 times, most recently from 8adbdca to 153db80 Compare February 14, 2026 03:36

iobruno force-pushed the master branch from ed402a8 to d7bfcce Compare February 19, 2026 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

refactor/dbt-redshift#163

refactor/dbt-redshift#163
iobruno wants to merge 6 commits intomasterfrom
refactor/dbt-redshift

iobruno commented Feb 10, 2026 •

edited

Loading

Uh oh!

iobruno commented Feb 16, 2026

Uh oh!

claude bot commented Feb 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Comments

Conversation

iobruno commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test plan

Uh oh!

iobruno commented Feb 16, 2026

Uh oh!

claude bot commented Feb 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review Complete ✓

Review Summary

🔴 Critical Issues

1. Incorrect Function in fct_fhv_monthly_zone_traveltime_p90.sql

2. Incorrect Function in fct_taxi_trips_quarterly_revenue.sql

🟡 Performance & Quality Issues

3. Inefficient Window Function Pattern in fct_fhv_monthly_zone_traveltime_p90.sql

4. Inconsistent Percentile Functions

5. Case Sensitivity Issues in Column Names

6. Missing NULL Checks in dim_taxi_trips Join

🟢 Positive Changes

✅ Excellent: QUALIFY for Deduplication

✅ Good: Time Partitioning Fields

✅ Good: Macro Enhancement

✅ Good: Dockerfile Modernization

🔵 Minor Issues & Suggestions

7. Redundant Aliases in SELECT Statements

8. Missing Documentation for Complex Models

9. PR Description Claims P97/P95/P90 but Title Says P95

10. README Still References BigQuery

📋 Test Plan Validation

Summary of Required Changes

Must Fix Before Merge:

Should Fix:

Nice to Have:

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

iobruno commented Feb 10, 2026 •

edited

Loading

claude bot commented Feb 16, 2026 •

edited

Loading

1. Incorrect Function in `fct_fhv_monthly_zone_traveltime_p90.sql`

2. Incorrect Function in `fct_taxi_trips_quarterly_revenue.sql`

3. Inefficient Window Function Pattern in `fct_fhv_monthly_zone_traveltime_p90.sql`

6. Missing NULL Checks in `dim_taxi_trips` Join