Repo: github.com/mosesobeng/qa_pipeline
This file provides a top-level overview of the entire solution, referencing detailed documentation and code.
Below is a high-level project layout:
- airflow/:
- DAGs:
ingestion_gdrive_to_gcs.py: Ingest Assessment ZIP data from Google Drive to GCS.ingestion_quality_assurance.py: Load raw CSV to BigQuery.ingestion_customer_subscriptions.py: JSON ingestion to BigQuery.transformation_data_warehouse.py: Orchestrates dbt transformations.- dbt/:
models/marts/*: Dimensions and Fact SQL models (e.g.dim_team.sql,fact_autoqa_reviews.sql).models/staging/*: Staging references to raw tables (e.g.stg_autoqa_reviews.sql).schema.yml,sources.yml: Testing & source definitions.
- DAGs:
- diagram/:
erd_non_modeled.mmd,erd_modeled.mmd: Mermaid files for the ERDs.
- docs/:
images/*: Rendered images (ERDs, pipeline diagrams).
- scripts/:
*.sql: sql queries.
Technology Stack and Flow Overview
The assessment source zip file is ingested from the orginal google drive path and placed in a GCS bucket (the “raw” layer). Then load into zendesk-assessment.raw BigQuery dataset with minimal changes. Optionally transform or refine data in the zendesk-assessment.refined dataset (silver). Final dimensional or fact tables live in zendesk-assessment.dw (gold).
ingestion_gdrive_to_gcs.py: A single-step PythonOperator to download & unzip.ingestion_quality_assurance.py: Parallel load multiple csv from GCS → BigQuery.ingestion_customer_subscriptions.py: Incremental ingests JSONetl.jsondata.transformation_data_warehouse.py: Runs dbt to incrementally build models for the gold layer
For more detail, see: Doc: Data Ingestion
The data modeling process in this project is divided into two main phases:
-
Preliminary Analysis of Relationships:
- Explores relationships between six base tables without applying dimensional modeling techniques.
- Key relationships include:
autoqa_reviews↔autoqa_ratings↔autoqa_root_causeautoqa_reviews↔conversationsconversations↔conversation_messagesandconversation_surveys
- An Entity-Relationship Diagram (ERD) for this phase is available in the non-modeled ERD and non-modeled ERD code.
-
Dimensional Modeling:
- Optimizes the structure for analytics by organizing data into fact and dimension tables.
- A second ERD, reflecting the dimensional model, is provided in the modeled ERD and modeled ERD code.
The Mermaid code for this model lives in
diagram/erd_modeled.mmd.
For more detail, see: Doc: Data Modeling
I rely on incremental logic with unique_key in each dimension/fact. This allows partial loads, merges new columns, and handles schema changes. I test each model with not_null, unique, and relationship constraints in schema.yml.
The dbt code for this lives in
airflow/dags/dbt/models/martsand airflow dag isairflow/dags/transformation_data_warehouse.py.
For more detail, see: Doc: Data Model Implementation
SQL related queries for task A3 are found in Dir: Scripts
One DAG processes the nested JSON (etl.json) from GCS. I flatten it with pd.json_normalize for simplicity. More advanced usage might split out sub-objects into separate tables. The pipeline also merges records incrementally, ensuring only changed or new data is updated.
For more detail, see: Doc: ETL JSON Data Pipeline
- The pipeline uses
Airflow+dbt Core+BigQueryin a layered approach, promoting modularity. - Containerization (
Docker,Kubernetes) ensures scalable deployment and avoids cloud lock-in. - Proper partitioning and strong data governance practices help keep the architecture resilient and cost-effective as data volumes rise.
- Container-based ML pipelines can ingest refined data, train models, and serve predictions in a flexible and cloud-agnostic manner.
- Potential solutions include sentiment analysis models, predictive scoring, and time series forecasting—integrated seamlessly with existing transformations and dashboards.
For more detail, see: Doc: Future Consideration
- Dashboards was built in
Tableau Publicagainst the Gold BigQuery tables for streamlined access to curated, business-ready data. - Visualizations focus on ticket scores, reviewer performance, root causes, and time-based trends.
- Documentation captures the rationale behind each metric and chart type.
For more detail, see: Doc: Dashboarding
- Airflow:
- Deploy the DAGs (
ingestion_*.py,transformation_data_warehouse.py). - Supply a GCS bucket name and BigQuery credentials.
- Deploy the DAGs (
- dbt:
cd airflow/dags/dbt/, adjustprofiles.ymlto match BigQuery project, rundbt run --full-refresh.
- Validation:
- Use
dbt test. - Check results in the BigQuery console (
zendesk-assessment.dwdataset).
- Use


