Skip to content

akshay-gera/dbt_bigquery_project

Repository files navigation

Airbnb Data Pipeline Project

This project defines various DBT models, snapshots, and associated transformations for managing and analyzing Airbnb listings, hosts, and reviews data. It provides an easy-to-follow structure for data transformation, historical tracking of listings, and analytics, using a combination of fact tables, dimension tables, and snapshot tables.

Table of Contents

Features:

  • Fact Tables: Store transactional or event-level data, like reviews and listings.
  • Dimension Tables: Store descriptive data about entities, like hosts and listings.
  • Snapshot Tables: Track historical changes to listings with SCD (Slowly Changing Dimensions).
  • Data Quality: Includes tests for uniqueness, non-null values, and relationships between tables.

BigQuery Integration

This project leverages BigQuery (sandbox) as the data warehouse for storing Airbnb data. The CSV files containing data about Airbnb listings, hosts, and reviews were uploaded into BigQuery to serve as the source for our data models.

BigQuery was chosen for its scalability and integration with modern data transformation tools like DBT, making it an ideal platform for handling large datasets and complex analytical workflows.

  • Data Upload: The raw CSV files were uploaded into BigQuery using the BigQuery Web UI or command line interface (CLI), where they are stored as tables under a specific project.
  • BigQuery Sandbox: The sandbox environment provides a free-tier option for users to experiment and run small queries, ideal for this project as it doesn't incur additional costs unless the usage exceeds the free limits.

DBT Transformation Process

DBT (Data Build Tool) is used for transforming the raw data into structured tables that can be used for analysis. The following describes the DBT integration process:

  1. DBT Setup:

    • Connect DBT to BigQuery using the BigQuery adapter.
    • DBT models are defined in .sql files and placed in the models/ directory.
    • DBT transformations run SQL queries to transform raw data into usable tables (e.g., dim_listings_cleansed, fct_reviews, dim_hosts).
  2. DBT Models:

    • DBT models are used to transform raw CSV data into cleaned, dimensioned, and fact tables in BigQuery.
    • The transformations (SQL queries) are run within BigQuery, and DBT will create or update tables as needed.
    • The transformation process handles things like data cleaning, normalization, and ensuring relationships between tables (e.g., between dim_listings and dim_hosts).
  3. Running DBT:

    • Run DBT with the command dbt run to trigger the transformation process.
    • DBT will automatically apply tests, build tables, and execute transformations defined in the project, ensuring data quality and integrity.
  4. DBT Models in BigQuery:

    • The final models are stored as tables in BigQuery and can be queried for reporting, analysis, or further transformations.

DBT ensures that the data remains consistent and accurately reflects the latest transformations, making it easier to maintain and scale your data pipeline over time.

Dagster Orchestration

Dagster is used as a data orchestration tool to schedule and manage data workflows across the entire pipeline. It integrates seamlessly with both DBT and BigQuery, ensuring that data processing tasks are executed in a reliable and efficient manner.

How Dagster Works in This Project:

  1. Data Pipelines:

    • Dagster manages the workflow for loading, transforming, and analyzing the Airbnb data.
    • The pipeline orchestrates the tasks by defining solid (the smallest unit of work) and pipeline (a set of solids that run together).
  2. Scheduling and Monitoring:

    • Dagster schedules DBT runs and ensures that transformations occur in the correct sequence.
    • It monitors the status of each task, ensuring that transformations are executed successfully or retried in case of failures.
    • Dagster provides real-time monitoring and logs for task execution, so users can track the progress of the data pipeline and quickly address any issues.
  3. Integration with DBT:

    • Dagster triggers DBT models as part of the pipeline to perform the data transformations. It also handles dependencies between models and schedules the execution of DBT commands (dbt run, dbt test) automatically.
  4. Handling Failures:

    • Dagster ensures the resiliency of the data pipeline by retrying failed tasks and notifying the relevant stakeholders.
  5. Data Quality:

    • Dagster can integrate data quality checks into the pipeline, ensuring that only validated and transformed data is available for use in BigQuery.

By using Dagster to orchestrate the data workflow, the process becomes fully automated, ensuring timely and reliable data transformations and analysis.

Power BI Integration

Once the data is transformed in BigQuery, the next step is to visualize the data using Power BI. Here’s how Power BI integrates with the data warehouse and retrieves the transformed data:

  1. Live Connection to BigQuery:

    • Power BI allows you to connect directly to BigQuery via the BigQuery connector in Power BI Desktop.
    • A live connection means that Power BI queries BigQuery directly in real-time, ensuring that your reports and dashboards are always up to date with the latest data.
  2. Data Mart:

    • In this case, the mart_fullmoon_reviews table is used as a data mart in Power BI. This table contains aggregated and transformed data, ready for analysis.
    • Power BI will query this table for any real-time reports, like sentiment analysis of reviews and full moon-related insights.
  3. Data Refresh:

    • Since Power BI connects live to BigQuery, there’s no need for regular data refreshes; any updates made in BigQuery are automatically reflected in Power BI.
  4. Power BI Dashboard:

    • Power BI visualizations (such as charts, tables, and KPIs) can be created on top of the mart_fullmoon_reviews data, allowing for interactive analysis.
    • Examples of insights you can visualize include:
      • Review sentiments over time.
      • Full moon nights' impact on reviews.
      • Listing popularity and performance based on reviews.

By using a live connection, Power BI ensures that decision-makers have the most up-to-date insights without worrying about manual data updates.

image

image

Models

src_listings

Description: Contains details of Airbnb listings, including metadata like listing_id, listing_name, room_type, minimum_nights, and pricing information.

Column Name Data Type Description Data Tests
listing_id int64 The primary key for this table not_null, unique
listing_url string URL of the listing
listing_name string Name of the listing
listing_room_type string Type of room available (e.g., entire home)
minimum_nights int64 Minimum number of nights to book
host_id int64 ID of the host who owns the listing
listing_price int64 Price per night of the listing
created_at timestamp Timestamp of when the listing was created
updated_at timestamp Last updated timestamp for the listing

src_hosts

Description: Contains details of Airbnb hosts, including host_id, which serves as the primary key.

Column Name Data Type Description Data Tests
host_id int64 The primary key for this table unique, not_null

src_reviews

Description: Stores guest reviews of Airbnb listings. Each review is tied to a specific listing_id.

Column Name Data Type Description Data Tests
listing_id int64 The primary key for this table not_null

dim_hosts

Description: A dimension table for hosts, containing descriptive attributes such as host_id, host_name, and is_superhost.

Column Name Data Type Description Data Tests
host_id int64 Unique host ID
host_name string Name of the host
is_superhost boolean Whether the host is a superhost
created_at timestamp Host account creation timestamp
updated_at timestamp Last update timestamp for host information

dim_listings_cleansed

Description: Contains cleansed listings data, ensuring data integrity and applying transformations (e.g., minimum nights, room type).

Column Name Data Type Description Data Tests
listing_id int64 Unique listing ID unique, not_null
listing_name string Listing name
listing_room_type string Room type (Apartment, House, etc.) accepted_values: ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room']
minimum_nights int64 Minimum nights for booking positive_value
host_id int64 Unique ID of associated host not_null, relationships: dim_hosts(host_id)
price numeric Price per night (Euros)
created_at timestamp Timestamp of listing creation
updated_at timestamp Last update timestamp

dim_listings_with_hosts

Description: A joined table combining listings and hosts data for enriched analysis.

Column Name Data Type Description Data Tests
listing_id int64 Unique listing ID
listing_name string Listing name
listing_room_type string Type of room available
minimum_nights int64 Minimum nights to book
price numeric Price per night dbt_expectations.expect_column_max_to_be_between(max_value=5000)
host_id int64 Host ID
host_name string Host name
is_superhost boolean Whether the host is a superhost
created_at timestamp Timestamp of listing creation
updated_at timestamp Last update timestamp

fct_reviews

Description: A fact table containing guest reviews for listings.

Column Name Data Type Description Data Tests
listing_id int64 ID of the listing being reviewed
review_date timestamp Date when the review was posted
reviewer_name string Name of the reviewer
review_text string The text of the review
sentiment string Sentiment of the review ('Positive', 'Negative', 'Neutral')

mart_fullmoon_reviews

Description: A data mart table based on fct_reviews enriched with a column indicating whether the review date was a full moon night.

Column Name Data Type Description Data Tests
listing_id int64 Listing ID being reviewed
review_date timestamp Date of review
reviewer_name string Name of the reviewer
review_text string Review text
sentiment string Sentiment of the review
is_full_moon string Whether the review date was a full moon

Snapshots

scd_raw_listings

Description: Snapshot table that tracks historical changes to the Raw_Listings table. Every time a field changes, a new record is created with a new updated_at timestamp.

Column Name Data Type Description Data Tests
id int64 Listing ID
listing_url string URL of the listing
name string Name of the listing
room_type string Type of room
minimum_nights int64 Minimum nights
host_id int64 Host ID
price int64 Price per night
created_at timestamp Created timestamp
updated_at timestamp Last update timestamp
dbt_scd_id string DBT internal hashed ID
dbt_updated_at timestamp DBT update timestamp
dbt_valid_from timestamp Valid from timestamp
dbt_valid_to timestamp Valid to timestamp

About

End to End Data Pipeline Project on Google Cloud Warehouse, DBT Data Modelling and Power BI Data Visualization

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors