This project defines various DBT models, snapshots, and associated transformations for managing and analyzing Airbnb listings, hosts, and reviews data. It provides an easy-to-follow structure for data transformation, historical tracking of listings, and analytics, using a combination of fact tables, dimension tables, and snapshot tables.
- Fact Tables: Store transactional or event-level data, like reviews and listings.
- Dimension Tables: Store descriptive data about entities, like hosts and listings.
- Snapshot Tables: Track historical changes to listings with SCD (Slowly Changing Dimensions).
- Data Quality: Includes tests for uniqueness, non-null values, and relationships between tables.
This project leverages BigQuery (sandbox) as the data warehouse for storing Airbnb data. The CSV files containing data about Airbnb listings, hosts, and reviews were uploaded into BigQuery to serve as the source for our data models.
BigQuery was chosen for its scalability and integration with modern data transformation tools like DBT, making it an ideal platform for handling large datasets and complex analytical workflows.
- Data Upload: The raw CSV files were uploaded into BigQuery using the BigQuery Web UI or command line interface (CLI), where they are stored as tables under a specific project.
- BigQuery Sandbox: The sandbox environment provides a free-tier option for users to experiment and run small queries, ideal for this project as it doesn't incur additional costs unless the usage exceeds the free limits.
DBT (Data Build Tool) is used for transforming the raw data into structured tables that can be used for analysis. The following describes the DBT integration process:
-
DBT Setup:
- Connect DBT to BigQuery using the BigQuery adapter.
- DBT models are defined in
.sqlfiles and placed in themodels/directory. - DBT transformations run SQL queries to transform raw data into usable tables (e.g.,
dim_listings_cleansed,fct_reviews,dim_hosts).
-
DBT Models:
- DBT models are used to transform raw CSV data into cleaned, dimensioned, and fact tables in BigQuery.
- The transformations (SQL queries) are run within BigQuery, and DBT will create or update tables as needed.
- The transformation process handles things like data cleaning, normalization, and ensuring relationships between tables (e.g., between
dim_listingsanddim_hosts).
-
Running DBT:
- Run DBT with the command
dbt runto trigger the transformation process. - DBT will automatically apply tests, build tables, and execute transformations defined in the project, ensuring data quality and integrity.
- Run DBT with the command
-
DBT Models in BigQuery:
- The final models are stored as tables in BigQuery and can be queried for reporting, analysis, or further transformations.
DBT ensures that the data remains consistent and accurately reflects the latest transformations, making it easier to maintain and scale your data pipeline over time.
Dagster is used as a data orchestration tool to schedule and manage data workflows across the entire pipeline. It integrates seamlessly with both DBT and BigQuery, ensuring that data processing tasks are executed in a reliable and efficient manner.
-
Data Pipelines:
- Dagster manages the workflow for loading, transforming, and analyzing the Airbnb data.
- The pipeline orchestrates the tasks by defining solid (the smallest unit of work) and pipeline (a set of solids that run together).
-
Scheduling and Monitoring:
- Dagster schedules DBT runs and ensures that transformations occur in the correct sequence.
- It monitors the status of each task, ensuring that transformations are executed successfully or retried in case of failures.
- Dagster provides real-time monitoring and logs for task execution, so users can track the progress of the data pipeline and quickly address any issues.
-
Integration with DBT:
- Dagster triggers DBT models as part of the pipeline to perform the data transformations. It also handles dependencies between models and schedules the execution of DBT commands (
dbt run,dbt test) automatically.
- Dagster triggers DBT models as part of the pipeline to perform the data transformations. It also handles dependencies between models and schedules the execution of DBT commands (
-
Handling Failures:
- Dagster ensures the resiliency of the data pipeline by retrying failed tasks and notifying the relevant stakeholders.
-
Data Quality:
- Dagster can integrate data quality checks into the pipeline, ensuring that only validated and transformed data is available for use in BigQuery.
By using Dagster to orchestrate the data workflow, the process becomes fully automated, ensuring timely and reliable data transformations and analysis.
Once the data is transformed in BigQuery, the next step is to visualize the data using Power BI. Here’s how Power BI integrates with the data warehouse and retrieves the transformed data:
-
Live Connection to BigQuery:
- Power BI allows you to connect directly to BigQuery via the BigQuery connector in Power BI Desktop.
- A live connection means that Power BI queries BigQuery directly in real-time, ensuring that your reports and dashboards are always up to date with the latest data.
-
Data Mart:
- In this case, the
mart_fullmoon_reviewstable is used as a data mart in Power BI. This table contains aggregated and transformed data, ready for analysis. - Power BI will query this table for any real-time reports, like sentiment analysis of reviews and full moon-related insights.
- In this case, the
-
Data Refresh:
- Since Power BI connects live to BigQuery, there’s no need for regular data refreshes; any updates made in BigQuery are automatically reflected in Power BI.
-
Power BI Dashboard:
- Power BI visualizations (such as charts, tables, and KPIs) can be created on top of the
mart_fullmoon_reviewsdata, allowing for interactive analysis. - Examples of insights you can visualize include:
- Review sentiments over time.
- Full moon nights' impact on reviews.
- Listing popularity and performance based on reviews.
- Power BI visualizations (such as charts, tables, and KPIs) can be created on top of the
By using a live connection, Power BI ensures that decision-makers have the most up-to-date insights without worrying about manual data updates.
Description: Contains details of Airbnb listings, including metadata like listing_id, listing_name, room_type, minimum_nights, and pricing information.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| listing_id | int64 | The primary key for this table | not_null, unique |
| listing_url | string | URL of the listing | |
| listing_name | string | Name of the listing | |
| listing_room_type | string | Type of room available (e.g., entire home) | |
| minimum_nights | int64 | Minimum number of nights to book | |
| host_id | int64 | ID of the host who owns the listing | |
| listing_price | int64 | Price per night of the listing | |
| created_at | timestamp | Timestamp of when the listing was created | |
| updated_at | timestamp | Last updated timestamp for the listing |
Description: Contains details of Airbnb hosts, including host_id, which serves as the primary key.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| host_id | int64 | The primary key for this table | unique, not_null |
Description: Stores guest reviews of Airbnb listings. Each review is tied to a specific listing_id.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| listing_id | int64 | The primary key for this table | not_null |
Description: A dimension table for hosts, containing descriptive attributes such as host_id, host_name, and is_superhost.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| host_id | int64 | Unique host ID | |
| host_name | string | Name of the host | |
| is_superhost | boolean | Whether the host is a superhost | |
| created_at | timestamp | Host account creation timestamp | |
| updated_at | timestamp | Last update timestamp for host information |
Description: Contains cleansed listings data, ensuring data integrity and applying transformations (e.g., minimum nights, room type).
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| listing_id | int64 | Unique listing ID | unique, not_null |
| listing_name | string | Listing name | |
| listing_room_type | string | Room type (Apartment, House, etc.) | accepted_values: ['Entire home/apt', 'Private room', 'Shared room', 'Hotel room'] |
| minimum_nights | int64 | Minimum nights for booking | positive_value |
| host_id | int64 | Unique ID of associated host | not_null, relationships: dim_hosts(host_id) |
| price | numeric | Price per night (Euros) | |
| created_at | timestamp | Timestamp of listing creation | |
| updated_at | timestamp | Last update timestamp |
Description: A joined table combining listings and hosts data for enriched analysis.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| listing_id | int64 | Unique listing ID | |
| listing_name | string | Listing name | |
| listing_room_type | string | Type of room available | |
| minimum_nights | int64 | Minimum nights to book | |
| price | numeric | Price per night | dbt_expectations.expect_column_max_to_be_between(max_value=5000) |
| host_id | int64 | Host ID | |
| host_name | string | Host name | |
| is_superhost | boolean | Whether the host is a superhost | |
| created_at | timestamp | Timestamp of listing creation | |
| updated_at | timestamp | Last update timestamp |
Description: A fact table containing guest reviews for listings.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| listing_id | int64 | ID of the listing being reviewed | |
| review_date | timestamp | Date when the review was posted | |
| reviewer_name | string | Name of the reviewer | |
| review_text | string | The text of the review | |
| sentiment | string | Sentiment of the review ('Positive', 'Negative', 'Neutral') |
Description: A data mart table based on fct_reviews enriched with a column indicating whether the review date was a full moon night.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| listing_id | int64 | Listing ID being reviewed | |
| review_date | timestamp | Date of review | |
| reviewer_name | string | Name of the reviewer | |
| review_text | string | Review text | |
| sentiment | string | Sentiment of the review | |
| is_full_moon | string | Whether the review date was a full moon |
Description: Snapshot table that tracks historical changes to the Raw_Listings table. Every time a field changes, a new record is created with a new updated_at timestamp.
| Column Name | Data Type | Description | Data Tests |
|---|---|---|---|
| id | int64 | Listing ID | |
| listing_url | string | URL of the listing | |
| name | string | Name of the listing | |
| room_type | string | Type of room | |
| minimum_nights | int64 | Minimum nights | |
| host_id | int64 | Host ID | |
| price | int64 | Price per night | |
| created_at | timestamp | Created timestamp | |
| updated_at | timestamp | Last update timestamp | |
| dbt_scd_id | string | DBT internal hashed ID | |
| dbt_updated_at | timestamp | DBT update timestamp | |
| dbt_valid_from | timestamp | Valid from timestamp | |
| dbt_valid_to | timestamp | Valid to timestamp |

