A Google Cloud Solutions Engineer independent study: take a retail star-schema dataset and drive it end-to-end through pre-sales discovery, a two-track workshop sequence (Business Analytics + Data Science), a signed-off KPI contract, a ranked ML experiment backlog, a GCP reference architecture, and a pitch deck.
Read first: ROLE_AND_SCOPE.md — external SE, fictional retailer as customer, GCP only.
- Role & data —
ROLE_AND_SCOPE.md,EXPLORATION.md. - Local lab —
STEP_BY_STEP.md: Docker MySQL → load star schema → run SQL. - Week-2 BA workshop —
workshops/01_BA_WORKSHOP_BRIEF.md→workshops/02_BA_WORKSHOP_FINDINGS.md→ signed-offworkshops/03_PROBLEM_AND_KPIS.md. - Week-2 DS workshop (parallel track) —
workshops/04_DS_WORKSHOP_BRIEF.md→workshops/05_DS_WORKSHOP_FINDINGS.md→ committedworkshops/06_ML_HYPOTHESES_AND_EXPERIMENTS.md. - GCP reference architecture —
docs/CLOUD_ARCHITECTURE.md: Cloud SQL pilot + BigQuery/BQML analytics, with every design choice traced back to a workshop finding. - Pitch —
docs/PITCH_DECK.md: 12-slide narrative for the Director of Analytics. - First-pitch one-pager —
docs/ONE_PAGER.md: CFO-readable, dollar-led summary that travels through the customer's org without you. Seedocs/assets/ONE_PAGER.pngfor the visual.
Suggested first pass: ROLE_AND_SCOPE → skim EXPLORATION → workshops/03_PROBLEM_AND_KPIS → workshops/06_ML_HYPOTHESES_AND_EXPERIMENTS → docs/CLOUD_ARCHITECTURE → docs/PITCH_DECK → docs/ONE_PAGER.
Produced from the CSVs in this repo by scripts/profile_dataset.py and scripts/ds_profile.py — re-runnable any time.
- West earns 42% of revenue and 56% of profit; Central earns 17% of revenue but only 2.6% of profit.
- Discount cliff at 20%: below 20% discount, 0.6% of lines lose money; at ≥30% discount, 96.3% of lines lose money. 70.7% of all negative-profit lines come from the ≥30% band.
- Growth is real: +11.5% → +25.6% → +18.4% YoY sales across 2014–2017.
- Seasonality is a 6.9× peak/trough ratio (Nov peak, Feb trough) — monthly forecasting is viable.
- 19% of customers have negative lifetime profit — a concrete, model-able segment.
| Document | What it is for |
|---|---|
ROLE_AND_SCOPE.md |
Canonical role & scope — who you are, who the customer is, pre/post-sales, no ambiguity |
EXPLORATION.md |
Discovery notes, business scenarios (migration + engagement), GCP service anchors, data/SQL reference |
STEP_BY_STEP.md |
Lab steps — Docker → clean/load → run SQL |
workshops/ |
Week-2 BA + DS workshop sequence (briefs → findings → signed-off artifacts) |
docs/CLOUD_ARCHITECTURE.md |
Full GCP reference architecture with traceability to workshops |
docs/PITCH_DECK.md |
12-slide pre-sales narrative for the Director of Analytics |
docs/ONE_PAGER.md |
First-pitch one-pager — CFO-readable, dollar-led, leave-behind from the intro meeting (rendered: docs/assets/ONE_PAGER.png) |
GITHUB_SETUP.md |
Original repo-creation notes |
Questions |
Original brainstorm |
scripts/profile_dataset.py— BA-flavored profile; writesworkshops/profile_output.json.scripts/ds_profile.py— DS-flavored profile (seasonality, discount cliff, Pareto, RFM, outliers); writesworkshops/ds_profile_output.json.scripts/clean_and_build_fact.py— CSV clean pass →data/clean/.scripts/load_mysql.py— loads cleaned CSVs into local Docker MySQL.
sql/01_schema_mysql.sql— star schema DDL.sql/02_enriched_fact_view.sql— enriched line-level view.sql/03_aggregate_monthly_region.sql— primary-KPI aggregate.sql/04_aggregate_monthly_category.sql— powers top-5 concentration KPI + guardrail.sql/05_aggregate_monthly_customer_segment.sql— segment-level view.sql/06_guardrail_neg_profit_lines.sql— weekly guardrail views (total, by sub-category, by region).sql/07_discount_cliff_rule.sql— E2 single-rule baseline (what the ML classifier must beat).sql/examples/— simple query examples.
sql/bqml/01_forecast_arima_plus.sql— E1 monthly revenue forecast per region.sql/bqml/02_classifier_boosted_tree.sql— E2 line-level loss classifier.sql/bqml/03_customer_loss_flag.sql— E3 forward-looking customer loss flag.
docker-compose.yml— local MySQL 8 on port 3307.requirements.txt— pip deps for the clean + load scripts.
# 1. Clone
git clone https://github.com/anushb1/wayfair-independent-study.git
cd wayfair-independent-study
# 2. Set up Python and profile the data
python3 -m venv .venv
.venv/bin/pip install pandas numpy
.venv/bin/python scripts/profile_dataset.py # → workshops/profile_output.json
.venv/bin/python scripts/ds_profile.py # → workshops/ds_profile_output.json
# 3. (Optional) spin up local MySQL and load the star schema
docker compose up -d
.venv/bin/pip install -r requirements.txt
.venv/bin/python scripts/clean_and_build_fact.py
.venv/bin/python scripts/load_mysql.py
mysql -h 127.0.0.1 -P 3307 -u root -p < sql/01_schema_mysql.sql
# then load aggregates: sql/02..07