A centralized, end-to-end data warehouse pipeline to ingest, transform, and analyse retail transaction data. Built for AcmeMart, a mid-sized retail and e-commerce company operating both physical stores and an online shopping platform. This project leverages a modern data stack to consolidate fragmented data sources into a single source of truth, enabling self-service analytics and proactive decision-making.
- AcmeMart Transaction Analytics
- Table of Contents
- Project Overview
- Business Context
- Architecture
- Tech Stack
- Project Structure
- Prerequisites
- Getting Started
- Running the Pipeline
- Data Models
- Environment Variables
- Learning Outcomes
- Contributing
- License
AcmeMart generates large volumes of transaction data daily across in-store and online channels. This data was previously fragmented across multiple source files in Google Drive with no unified schema, making consistent reporting difficult.
This project addresses those challenges by building a structured batch data pipeline that:
- Ingests raw CSV files from Google Drive into Snowflake using Airbyte on a daily schedule
- Transforms raw data through staged cleaning, typing, and renaming using dbt
- Models data into fact and dimension tables following a star schema
- Aggregates key business metrics - sales by product, store performance, customer behaviour trends
- Validates data quality through dbt tests at every layer
- Documents all models, columns, and lineage via the dbt docs site
- Centralised data warehouse - single source of truth for all retail transaction data
- Clean fact and dimension tables - structured, query-ready gold layer
- Aggregated datasets - pre-built summaries for reporting and dashboards
- Improved data quality - validated through automated dbt tests
- Faster insight generation - self-service SQL analytics on Snowflake
- Reduced manual reporting - replaces spreadsheet-based processes
Company: AcmeMart
Industry: Retail & E-commerce
Core services: In-store retail sales, online platform, customer management, supplier & product management
AcmeMart sells a wide range of consumer goods including groceries, household items, and personal care products. With growing transaction volume across multiple channels, the company needed a scalable data foundation to move from reactive reporting to proactive, data-driven decision-making.
Key challenges this project solves:
- Data silos - source files logically isolated in Google Drive with no integration layer
- Limited analytics capability - no unified view of sales by product, store, or customer
- Manual reporting - teams relying on spreadsheets causing delays and inconsistencies
- Poor data quality - inconsistent data types and formats across source files
- No scalable data model - no structured warehouse layer to support analytics
| Layer | Tool | Description |
|---|---|---|
| Source | Google Drive | Raw CSV files: transactions, customers, products, stores |
| Ingestion | Airbyte | Batch connector syncing CSVs to Snowflake BRONZE schema daily |
| Bronze | Snowflake | Raw landing zone - untransformed, all columns as VARCHAR |
| Staging | dbt | Cleaned, typed, renamed, deduplicated models |
| Gold | dbt | Fact and dimension tables following star schema |
| Aggregates | dbt | Pre-summarised metrics: sales by store, product, customer |
| Validation | dbt tests | not_null, unique, accepted_values, referential integrity |
| Documentation | dbt docs | Auto-generated lineage graph and data dictionary |
| Version control | Git / GitHub | All dbt models, tests, and schema YAML files |
| Tool | Purpose |
|---|---|
| Google Drive | Source file storage (CSV) |
| Airbyte | Data ingestion - Google Drive → Snowflake |
| Snowflake | Cloud data warehouse |
| dbt | Data transformation, modelling, testing, documentation |
| Python | Scripting and pipeline utilities |
| Git / GitHub | Version control |
acmemart-dbt/
├── models/
│ ├── staging/
│ │ ├── stg_transactions.sql
│ │ ├── stg_customers.sql
│ │ ├── stg_products.sql
│ │ ├── stg_stores.sql
│ │ └── schema.yml
│ ├── gold/
│ │ ├── fct_sales.sql
│ │ ├── dim_customers.sql
│ │ ├── dim_products.sql
│ │ ├── dim_stores.sql
│ │ └── schema.yml
│ └── aggregates/
│ ├── agg_sales_by_store.sql
│ ├── agg_sales_by_product.sql
│ ├── agg_customer_summary.sql
│ └── schema.yml
├── macros/
│ └── clean_string.sql
├── tests/
│ └── assert_positive_amounts.sql
├── dbt_project.yml
├── packages.yml
├── profiles.yml.example ← template only, never commit real credentials
├── .env.example
├── .gitignore
└── README.md
Ensure the following are in place before proceeding:
- Snowflake account (free trial available)
- Airbyte Cloud account or self-hosted Airbyte instance
- Google Drive folder containing the source CSV files
- dbt Core installed locally (
pip install dbt-snowflake) - Python 3.9+ with pip
- Git
- dbt-snowflake
(installed via
requirements.txtin Step 5)
Verify your dbt installation:
dbt --versiongit clone https://github.com/samuelede/AcmeMart-Transaction-Analytics.git
cd acmemart-transaction-analyticsCopy the example env file and populate it with your credentials:
cp .env.example .envEdit .env with your Snowflake and Airbyte details:
SNOWFLAKE_ACCOUNT=your_account_identifier
SNOWFLAKE_USER=your_username
SNOWFLAKE_PASSWORD=your_password
SNOWFLAKE_WAREHOUSE=COMPUTE_WH
SNOWFLAKE_DATABASE=ACMEMART_DW
SNOWFLAKE_SCHEMA=BRONZE
SNOWFLAKE_ROLE=TRANSFORMER_ROLE
Important: Never commit your
.envfile to version control. It is listed in.gitignoreby default.
Run the following in a Snowflake worksheet to create the required objects:
-- Create database and schemas
CREATE DATABASE IF NOT EXISTS ACMEMART_DW;
CREATE SCHEMA IF NOT EXISTS ACMEMART_DW.BRONZE;
CREATE SCHEMA IF NOT EXISTS ACMEMART_DW.STAGING;
CREATE SCHEMA IF NOT EXISTS ACMEMART_DW.GOLD;
-- Create a dedicated role and user for dbt
CREATE ROLE TRANSFORMER_ROLE;
CREATE USER DBT_USER
PASSWORD = '<strong-password>'
DEFAULT_ROLE = TRANSFORMER_ROLE;
GRANT ROLE TRANSFORMER_ROLE TO USER DBT_USER;
GRANT USAGE ON WAREHOUSE COMPUTE_WH TO ROLE TRANSFORMER_ROLE;
GRANT ALL ON DATABASE ACMEMART_DW TO ROLE TRANSFORMER_ROLE;
GRANT ALL ON ALL SCHEMAS IN DATABASE ACMEMART_DW TO ROLE TRANSFORMER_ROLE;1. In Airbyte → Sources → New Source
2. Connector : Google Drive
3. Name : AcmeMart - Google Drive
4. Auth : paste service account JSON key
5. Folder URL : paste the shared Drive folder URL
6. File type : CSV
7. Click "Test and Save" → confirm green checkmark
1. In Airbyte → Destinations → New Destination
2. Connector : Snowflake
3. Name : AcmeMart - Snowflake
4. Credentials : use values from your .env file
5. Database : ACMEMART_DW
6. Default schema : BRONZE
7. Click "Test and Save" → confirm green checkmark
1. In Airbyte → Connections → New Connection
2. Source : AcmeMart - Google Drive
3. Destination : AcmeMart - Snowflake
Airbyte automatically detects all store CSV files from the Drive folder
and combines them into a single stream called csv_sales.
In the Select streams step, configure the stream as follows:
| Setting | Value |
|---|---|
| Stream name | csv_sales |
| Sync mode | Incremental | Append + Deduped |
| Primary key | transaction_id, store_id, customer_id |
| Cursor field | _ab_source_file_last_modified |
How to set the primary key:
1. Click the Fields column (13/13) on the csv_sales row
2. In the field panel, tick the primary key checkbox for:
✅ transaction_id
✅ store_id
✅ customer_id
3. Confirm cursor is set to: _ab_source_file_last_modified
4. Click Next → Configure connection
The cursor field
_ab_source_file_last_modifiedtells Airbyte which files in Google Drive are new since the last sync. Only new or updated store files are pulled on each run — not the entire folder every time.
1. Sync schedule : Every 24 hours
2. Destination namespace : ACMEMART_DW.BRONZE
3. Normalisation : Off (dbt handles all transformation)
4. Click "Set up connection"
Trigger a manual sync and confirm it succeeds:
1. Click "Sync Now" on the connection page
2. Wait for status → Succeeded
3. In Snowflake, verify data landed:
SELECT COUNT(*) FROM ACMEMART_DW.BRONZE.CSV_SALES;Screenshot the Airbyte dashboard showing Succeeded status and the Snowflake query result showing a non-zero row count. These are your Week 1 deliverables.
pip install -r requirements.txtVerify the installation:
dbt --versionCreate the .dbt folder and copy the template in one command:
mkdir -p ~/.dbt && cp profiles.yml.example ~/.dbt/profiles.ymlOpen the file and fill in your Snowflake credentials:
Windows (Git Bash):
notepad "C:/Users/<YourUsername>/.dbt/profiles.yml"Mac / Linux:
open ~/.dbt/profiles.yml
profiles.ymlmust never be committed to version control. Onlyprofiles.yml.examplelives in the repository. The real file is excluded via.gitignore.
dbt debugA successful connection ends with:
Connection test: [OK connection ok]
All checks passed!
dbt depsRun a manual sync from the Airbyte UI, or wait for the scheduled daily run. Verify data landed in Snowflake:
-- Verify Airbyte sync landed data in Snowflake
SELECT
COUNT(*) AS total_rows,
COUNT(DISTINCT transaction_id) AS unique_transactions,
COUNT(DISTINCT store_id) AS stores,
COUNT(DISTINCT product_id) AS products,
COUNT(DISTINCT customer_id) AS customers,
MIN(transaction_timestamp) AS earliest_transaction,
MAX(transaction_timestamp) AS latest_transaction
FROM ACMEMART_DW.BRONZE.CSV_SALES;# Run all models (staging → gold → aggregates)
dbt run
# Run a specific layer only
dbt run --select staging
dbt run --select gold
dbt run --select aggregates# Run all tests
dbt test
# Run tests for a specific model
dbt test --select fct_salesdbt docs generate
dbt docs serve --port 8081Open http://localhost:8081 in your browser to view the lineage graph and data dictionary.
If port 8081 is blocked on Windows: try
--port 8888or--port 8000. If all ports fail, run Git Bash as Administrator and retry.
| Model | Source | Description |
|---|---|---|
stg_transactions |
BRONZE.TRANSACTIONS |
Cleaned transaction records - typed, deduplicated |
stg_customers |
BRONZE.CUSTOMERS |
Cleaned customer profiles |
stg_products |
BRONZE.PRODUCTS |
Cleaned product catalogue |
stg_stores |
BRONZE.STORES |
Cleaned store reference data |
| Model | Type | Description |
|---|---|---|
fct_sales |
Fact | One row per transaction line item with foreign keys to all dimensions |
dim_customers |
Dimension | Customer profiles with loyalty tier and registration data |
dim_products |
Dimension | Product catalogue with category, brand, pricing |
dim_stores |
Dimension | Store details including type (physical / online), region |
| Model | Description |
|---|---|
agg_sales_by_store |
Total revenue, transaction count, and average order value per store |
agg_sales_by_product |
Units sold, revenue, and margin per product and category |
agg_customer_summary |
Purchase frequency, total spend, and recency per customer |
The following variables are required in your .env file:
| Variable | Description | Example |
|---|---|---|
SNOWFLAKE_ACCOUNT |
Snowflake account identifier | xyz12345.us-east-1 |
SNOWFLAKE_USER |
Snowflake username | dbt_user |
SNOWFLAKE_PASSWORD |
Snowflake password | securepassword |
SNOWFLAKE_WAREHOUSE |
Compute warehouse name | COMPUTE_WH |
SNOWFLAKE_DATABASE |
Target database | ACMEMART_DW |
SNOWFLAKE_SCHEMA |
Default schema | BRONZE |
SNOWFLAKE_ROLE |
Role with appropriate grants | TRANSFORMER_ROLE |
Working through this project builds practical skills in:
- Data engineering - designing and operating batch ETL pipelines with Airbyte and Snowflake
- Data modelling - applying star schema principles to build fact and dimension tables
- dbt - writing staging, gold, and aggregate models; applying tests; generating documentation
- Data quality - implementing automated validation using dbt's built-in and custom tests
- Warehouse design - structuring bronze / staging / gold layers in Snowflake
- Version control - managing a dbt project with Git feature branches and pull requests
Contributions are welcome. If you'd like to improve the models, fix a bug, or extend the project with new features, please follow the steps below.
- Fork the repository - click the Fork button at the top of the GitHub page
- Create a feature branch from
main:
git checkout -b feature/your-feature-name- Make your changes and ensure all dbt tests pass before committing:
dbt run && dbt test- Commit with a clear message:
git commit -m "feat: describe what your change does"- Push your branch:
git push origin feature/your-feature-name- Open a Pull Request against
mainwith a description of what you changed and why
- Keep pull requests focused - one feature or fix per PR
- Follow the existing model naming conventions (
stg_,fct_,dim_,agg_) - Add or update
schema.ymltests and column descriptions for any new models - Do not commit
.envfiles or any credentials
Found a bug or have a suggestion? Open an issue on the Issues page with as much detail as possible, including steps to reproduce if applicable.
This project is licensed under the MIT License