Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion catalog/repos.md
Original file line number Diff line number Diff line change
Expand Up @@ -547,6 +547,7 @@ Some repos are intentionally duplicated from the same upstream source so that di
| [amazon-personalize-immersion-day](https://github.com/Cognition-Partner-Workshops/amazon-personalize-immersion-day) | AWS Personalize ML workshop | Python, AWS |
| [angular-1.x-bootstrap-admin-dashboard](https://github.com/Cognition-Partner-Workshops/angular-1.x-bootstrap-admin-dashboard) | AngularJS 1.x admin panel | AngularJS, Bootstrap |
| [angular-1.x-dashboard](https://github.com/Cognition-Partner-Workshops/angular-1.x-dashboard) | AngularJS dashboard widgets | AngularJS |
| [etl-workflow](https://github.com/Cognition-Partner-Workshops/etl-workflow) | Full-stack ETL workflow with YAML mapping specs, PostgreSQL source/target, FastAPI web UI, optional IICS integration. Used for Databricks Platform Migration (SQL ETL → PySpark notebooks). | Python, FastAPI, PostgreSQL, YAML |
| [ts-informatica-powercenter](https://github.com/Cognition-Partner-Workshops/ts-informatica-powercenter) | Informatica PowerCenter 9.6.1 XML exports (11 mappings, 117K lines), Oracle SQL, shell orchestration — see [full entry](#ts-informatica-powercenter) | Informatica PowerCenter, Oracle, Shell |
| [katalon-web-automation](https://github.com/Cognition-Partner-Workshops/katalon-web-automation) | Katalon web automation sample | Katalon |
| [keycloak](https://github.com/Cognition-Partner-Workshops/keycloak) | Identity and Access Management | Java |
Expand All @@ -561,6 +562,6 @@ Some repos are intentionally duplicated from the same upstream source so that di
| [ruby-redmine](https://github.com/Cognition-Partner-Workshops/ruby-redmine) | Redmine project management | Ruby |
| [sample-serverless-digital-asset-payments](https://github.com/Cognition-Partner-Workshops/sample-serverless-digital-asset-payments) | Serverless digital asset payments | AWS, Serverless |
| [serverless-eda-insurance-claims-processing](https://github.com/Cognition-Partner-Workshops/serverless-eda-insurance-claims-processing) | Event-driven insurance claims | AWS, Serverless |
| [streamify-data-engineering](https://github.com/Cognition-Partner-Workshops/streamify-data-engineering) | Data engineering with Kafka, Spark, dbt | Python, Kafka, Spark, dbt |
| [streamify-data-engineering](https://github.com/Cognition-Partner-Workshops/streamify-data-engineering) | Music streaming data pipeline — PySpark Structured Streaming (KafkaSpark → Parquet), Airflow batch DAGs, dbt star schema on BigQuery. Primary repo for Databricks Platform Migration workshop. | Python, PySpark, Kafka, Airflow, dbt, BigQuery, GCP, Terraform |
| [todo-app-sandbox-infra](https://github.com/Cognition-Partner-Workshops/todo-app-sandbox-infra) | Todo app sandbox infrastructure | IaC |
| [traderXCognitiondemos](https://github.com/Cognition-Partner-Workshops/traderXCognitiondemos) | TraderX fork for Devin demos | Java |
5 changes: 5 additions & 0 deletions catalog/upstream-map.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -356,6 +356,11 @@ repos:
import_method: github-fork
cluster: C4

etl-workflow:
upstream: null
import_method: original
cluster: null

fineract:
upstream: apache/fineract
upstream_url: https://github.com/apache/fineract
Expand Down
4 changes: 3 additions & 1 deletion modules/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,7 @@ All hands-on modules organized by engineering discipline. Each module is a self-
| [DevOps & CI/CD](devops-cicd/) | DevOps Engineer, Release Engineer | 5 modules |
| [Cloud & Infrastructure](cloud-infrastructure/) | Cloud Engineer, Platform Engineer | 6 modules |
| [Observability & SRE](observability-sre/) | SRE, Observability Engineer | 4 modules |
| [Data Engineering](data-engineering/) | Data Engineer, Analytics Engineer | 7 modules |
| [Data Engineering](data-engineering/) | Data Engineer, Analytics Engineer | 9 modules |
| [Architecture & Design](architecture-design/) | Solution Architect, Enterprise Architect | 5 modules |
| [AI & ML Engineering](ai-ml-engineering/) | ML Engineer, AI Engineer | 3 modules |
| [Technical Documentation](technical-documentation/) | Technical Writer, Documentation Engineer | 6 modules |
Expand Down Expand Up @@ -120,6 +120,8 @@ All hands-on modules organized by engineering discipline. Each module is a self-
| [SAS to Python/Snowflake](data-engineering/sas-to-python-snowflake.md) | Intermediate–Advanced | 60 min | Multiple repos |
| [Informatica PowerCenter Analysis](data-engineering/informatica-powercenter-analysis.md) | Intermediate | 45 min | ts-informatica-powercenter |
| [Informatica PowerCenter to Snowflake Migration](data-engineering/informatica-to-snowflake-migration.md) | Advanced | 75 min | ts-informatica-powercenter, uc-dw-migration-teradata-to-snowflake |
| [COBOL Copybook to PySpark/JSON](data-engineering/cobol-copybook-to-pyspark-json.md) | Intermediate | 45 min | aws-mainframe-modernization-carddemo |
| [Databricks Platform Migration](data-engineering/databricks-platform-migration.md) | Intermediate–Advanced | 60 min | streamify-data-engineering, uc-data-source-migration-legacy-to-modern, etl-workflow |

### Architecture & Design

Expand Down
6 changes: 5 additions & 1 deletion modules/data-engineering/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,15 +16,18 @@ Challenges focused on data warehouse migration, ETL pipeline modernization, data
| [Informatica PowerCenter Analysis](informatica-powercenter-analysis.md) | Intermediate | 45 min |
| [Informatica PowerCenter to Snowflake Migration](informatica-to-snowflake-migration.md) | Advanced | 75 min |
| [COBOL Copybook to PySpark/JSON](cobol-copybook-to-pyspark-json.md) | Intermediate | 45 min |
| [Databricks Platform Migration](databricks-platform-migration.md) | Intermediate–Advanced | 60 min |

## Repositories

| Repository | Compatible Modules |
|------------|--------------------|
| uc-dw-migration-teradata-to-snowflake | [DW Migration: Teradata to Snowflake](dw-migration-teradata-to-snowflake.md), [ETL Pipeline Modernization](etl-pipeline-modernization.md), [Data Quality & Validation](data-quality-validation.md) |
| uc-data-source-migration-legacy-to-modern | [Data Source Migration](data-source-migration.md) |
| uc-data-source-migration-legacy-to-modern | [Data Source Migration](data-source-migration.md), [Databricks Platform Migration](databricks-platform-migration.md) |
| ts-informatica-powercenter | [Informatica PowerCenter Analysis](informatica-powercenter-analysis.md), [Informatica PowerCenter to Snowflake Migration](informatica-to-snowflake-migration.md) |
| aws-mainframe-modernization-carddemo | [COBOL Copybook to PySpark/JSON](cobol-copybook-to-pyspark-json.md) |
| streamify-data-engineering | [Databricks Platform Migration](databricks-platform-migration.md) |
| etl-workflow | [Databricks Platform Migration](databricks-platform-migration.md) |

## When to Use This Category

Expand All @@ -34,3 +37,4 @@ Challenges focused on data warehouse migration, ETL pipeline modernization, data
- Data Quality & Validation pairs well with any data migration module as a validation step
- The `uc-dw-migration-teradata-to-snowflake` repo was specifically curated for these challenges
- Informatica PowerCenter modules are ideal for enterprises migrating from on-prem Informatica ETL to cloud-native Snowflake architectures
- Databricks Platform Migration is designed for Databricks-user audiences — converts standard PySpark/Airflow/dbt stacks to Databricks Notebooks, Workflows, and Delta Lake. Supports parallel Devin sessions for a live parallelization showcase.
179 changes: 179 additions & 0 deletions modules/data-engineering/databricks-platform-migration.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,179 @@
# Databricks Platform Migration
Comment thread
devin-ai-integration[bot] marked this conversation as resolved.

## Repositories

- [streamify-data-engineering](#streamify-data-engineering)
- [uc-data-source-migration-legacy-to-modern](#uc-data-source-migration-legacy-to-modern)
Comment thread
devin-ai-integration[bot] marked this conversation as resolved.
- [etl-workflow](#etl-workflow)

---

## Challenge

Migrate a standard open-source data engineering stack (PySpark scripts, Airflow DAGs, dbt models targeting BigQuery/GCS) to the Databricks Lakehouse platform. Participants will convert existing pipeline code to Databricks Notebooks, replace file-based sinks with Delta Lake tables, translate Airflow orchestration to Databricks Workflows, and retarget dbt models from BigQuery to the dbt-databricks adapter.

## Platform Conversion Map

| Source Component | Databricks Target |
|---|---|
| PySpark scripts via `spark-submit` | Databricks Notebooks (Python) |
| Parquet files on GCS / local disk | Delta Lake tables in Unity Catalog |
| Airflow DAGs (scheduling + dependencies) | Databricks Workflows (JSON task definitions) |
| dbt-bigquery models (SQL + Jinja) | dbt-databricks models with Delta materialization |
| BigQuery staging / external tables | Databricks SQL warehouses + Unity Catalog schemas |
| GCP Terraform modules | Databricks Terraform provider resources |
| `spark-submit` with `--packages` | Databricks cluster libraries + init scripts |

## Target Outcomes

- Databricks Notebooks replicating the PySpark streaming and batch logic with Delta Lake as the sink
- Databricks Workflow JSON definition replacing the Airflow DAG with equivalent task dependencies
- dbt project retargeted to the `dbt-databricks` adapter with Delta table materializations
- `MIGRATION_PLAN.md` documenting every conversion decision, Delta Lake partitioning strategy, and Unity Catalog schema design
- PR with all Databricks-ready artifacts

## What Participants Will Learn

- How Devin reads standard PySpark code and re-expresses it as Databricks Notebooks with Delta Lake APIs
- How Devin converts Airflow DAG definitions into Databricks Workflow task graphs
- How Devin retargets dbt models from one warehouse adapter (BigQuery) to another (Databricks)
- The mapping between open-source data engineering concepts and their Databricks equivalents

## Devin Features Exercised

- Multi-file codebase analysis (PySpark, Airflow, dbt, Terraform)
- Cross-platform code translation
- Infrastructure-as-Code generation (Databricks Workflow JSON, Terraform)
- Documentation generation with architecture rationale
- PR creation with migration artifacts

## Difficulty

Intermediate to Advanced

## Estimated Time

60 minutes

## Notes

- The `streamify-data-engineering` repo uses GCP infrastructure (BigQuery, GCS, Compute Engine). For workshop purposes, participants do not need a GCP or Databricks account — the goal is for Devin to generate the Databricks-ready code, not to deploy it.
- The Airflow DAG in `airflow/dags/streamify_dag.py` uses BigQuery-specific operators; Devin should replace these with equivalent Databricks notebook task references in the Workflow definition.
- The dbt models in `dbt/models/core/` use BigQuery-specific materializations (partition_by with timestamp granularity); the dbt-databricks equivalent uses `liquid_clustering` or `partition_by` with Delta semantics.

---

## <a id="streamify-data-engineering"></a>streamify-data-engineering

**Repository:** [streamify-data-engineering](https://github.com/Cognition-Partner-Workshops/streamify-data-engineering)

Music streaming data pipeline with PySpark Structured Streaming (Kafka → Spark → Parquet/GCS), Airflow hourly batch DAGs, dbt star schema models targeting BigQuery, and Terraform for GCP infrastructure. Three event streams (listen_events, page_view_events, auth_events) with full dimensional modeling.

### Step 1: Paste into Devin

> Analyze the data pipeline in streamify-data-engineering. This project streams music events through Kafka → PySpark Structured Streaming → GCS (Parquet), then uses Airflow to load hourly batches into BigQuery staging tables, and dbt to build a star schema (dim_users, dim_songs, dim_artists, dim_location, dim_datetime, fact_streams).
>
> Convert this entire pipeline to the Databricks Lakehouse platform:
>
> 1. **PySpark → Databricks Notebooks** (`databricks/notebooks/`): Convert `spark_streaming/stream_all_events.py` and `spark_streaming/streaming_functions.py` into Databricks notebooks that write to Delta Lake tables instead of Parquet files on GCS. Use Auto Loader (`cloudFiles`) for incremental ingestion where appropriate. Add markdown cells explaining each transformation step.
>
> 2. **Airflow → Databricks Workflows** (`databricks/workflows/`): Convert the Airflow DAG in `airflow/dags/streamify_dag.py` into a Databricks Workflow JSON definition. Map each Airflow task (create_external_table, insert_job, dbt_run) to equivalent Databricks notebook tasks or SQL tasks with proper dependency chains.
>
> 3. **dbt-bigquery → dbt-databricks** (`databricks/dbt/`): Convert the dbt project in `dbt/` to use the `dbt-databricks` adapter. Update `profiles.yml` for a Databricks SQL Warehouse connection, convert BigQuery-specific materializations (timestamp partitioning) to Delta Lake equivalents, and replace any BigQuery SQL syntax with Databricks SQL (Spark SQL).
>
> 4. **Migration Plan** (`docs/MIGRATION_PLAN.md`): Document every conversion decision — why Auto Loader vs. structured streaming, Delta Lake partitioning/clustering strategy for the event tables, Unity Catalog schema layout, and Workflow scheduling configuration.
>
> Open a PR with all generated artifacts.

### Step 2: Research with Ask Devin

- *"What GCP-specific code patterns in streamify-data-engineering need to change for Databricks? List every GCS, BigQuery, and Airflow reference."*
- *"What's the best Delta Lake partitioning strategy for high-volume event data — partition by date or use liquid clustering?"*
- *"How should the dbt models change when moving from BigQuery to Databricks SQL? What syntax differences exist?"*
- Use the analysis to plan follow-up sessions — try adding Unity Catalog governance, Delta Live Tables, or Databricks Asset Bundles

### Step 3 (Optional): Read the DeepWiki

Open the repo's DeepWiki page to understand the full architecture: event generation (Eventsim), Kafka topics, Spark streaming functions, Airflow DAG structure, and dbt dimensional model. Use this to verify Devin's conversion covers all data flows end-to-end.

### Step 4 (Optional): Review & Give Feedback

- **Review the diff** — do the Databricks Notebooks preserve the streaming logic? Is the Delta Lake schema well-designed?
- **Leave a comment** asking Devin to add Delta Live Tables (DLT) pipeline definitions as an alternative to the notebook-based approach
- **Watch Devin respond** and push a follow-up commit with DLT Python pipeline code

---

## <a id="uc-data-source-migration-legacy-to-modern"></a>uc-data-source-migration-legacy-to-modern

**Repository:** [uc-data-source-migration-legacy-to-modern](https://github.com/Cognition-Partner-Workshops/uc-data-source-migration-legacy-to-modern)

Spring Boot loan management application with a legacy Corporate Data Warehouse (all-VARCHAR columns, cryptic names like BORR_FST_NM, LN_CURR_BAL). Includes column mappings, seed data, and a modern target schema for migration.

### Step 1: Paste into Devin

> Analyze the legacy CDW schema in uc-data-source-migration-legacy-to-modern. Review the schema in `src/main/resources/schema-legacy.sql`, seed data in `src/main/resources/data-legacy.sql`, and column mappings in `data/mappings/column_mappings.md`.
>
> Generate a complete Databricks migration pipeline:
>
> 1. **Delta Lake DDL** (`databricks/ddl/`): Create `CREATE TABLE` statements for the modern target schema using proper Spark SQL types (DATE, DECIMAL, STRING), meaningful column names, and Delta Lake partitioning.
>
> 2. **PySpark Ingestion** (`databricks/ingestion/`): Write PySpark notebooks that read from legacy tables (CSV/Parquet source), parse dates and amounts, expand status codes (ACT→Active, CLO→Closed), split denormalized fields into dimension tables, and handle nulls/malformed values with logging.
>
> 3. **Data Quality Framework** (`databricks/quality/`): Create validation notebooks checking row counts, null constraints, referential integrity, and business rules (e.g., balance > 0 for active loans). Generate a `DATA_QUALITY_REPORT.md`.
>
> 4. **Migration Runbook** (`docs/DATABRICKS_MIGRATION_RUNBOOK.md`): Document transformation decisions, column mappings, type conversions, and execution order.
>
> Open a PR.

### Step 2: Research with Ask Devin

- *"What are the riskiest type conversions from all-VARCHAR columns to typed Delta Lake columns? What edge cases should the ingestion handle?"*
- *"What's the best approach for incremental loads after the initial migration — Change Data Feed, MERGE, or full refresh?"*

### Step 3 (Optional): Read the DeepWiki

Open the repo's DeepWiki page to understand the domain model, legacy schema quirks, and the column mapping rules. Use this to verify Devin's migration covers all tables and transformations.

### Step 4 (Optional): Review & Give Feedback

- **Review the diff** — are the Delta Lake types appropriate? Does the ingestion handle all edge cases in the seed data?
- **Leave a comment** asking Devin to add Databricks notebook versions with markdown cells explaining each transformation

---

## <a id="etl-workflow"></a>etl-workflow

**Repository:** [etl-workflow](https://github.com/Cognition-Partner-Workshops/etl-workflow)

Full-stack ETL workflow application with YAML mapping specs defining five data transformations (employee directory enrichment, department roster, salary analytics, manager hierarchy, title progression) that execute as PostgreSQL INSERT...SELECT statements. FastAPI web UI with optional Informatica IICS integration.

### Step 1: Paste into Devin

> Analyze the ETL mapping specifications in etl-workflow/mapping_specs/. These YAML files define five data transformations that currently execute as PostgreSQL INSERT...SELECT statements.
>
> Convert each mapping to a Databricks notebook:
>
> 1. **Databricks Notebooks** (`databricks/notebooks/`): For each YAML spec, create a PySpark notebook that reads from Delta Lake source tables, applies the same joins/filters/aggregations/window functions, and writes to Delta Lake target tables. Add markdown cells documenting the transformation logic.
>
> 2. **Delta Lake DDL** (`databricks/ddl/`): Create source and target table definitions in Spark SQL with appropriate types and partitioning.
>
> 3. **Databricks Workflow** (`databricks/workflows/workflow.json`): Define a Workflow that runs the notebooks in dependency order.
>
> 4. **Migration Notes** (`docs/ETL_TO_DATABRICKS.md`): Document the PostgreSQL → Spark SQL translation decisions.
>
> Open a PR.

### Step 2: Research with Ask Devin

- *"What SQL patterns in the etl-workflow mapping specs are PostgreSQL-specific and need Spark SQL equivalents?"*
- *"How do the YAML mapping specs define joins, filters, and expressions? What's the most complex transformation?"*

### Step 3 (Optional): Read the DeepWiki

Review the mapping spec format, the PostgreSQL source schema (employees database), and the transformation patterns (window functions, self-joins, aggregations).

### Step 4 (Optional): Review & Give Feedback

- **Review the diff** — do the PySpark notebooks produce equivalent results to the SQL versions? Are window functions correctly translated?
- **Leave a comment** asking Devin to add data profiling statistics comparing source and target row counts
1 change: 1 addition & 0 deletions workshops/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ Reusable workshop definitions that can be instantiated as events. Each workshop
| [Feature Development](feature-development/) | New features on existing applications | 1-2 hours | 1-2 | Application Development modules |
| [Quality Engineering & Assurance](quality-engineering/) | Test automation, E2E testing, continuous quality, and code review | 4-6 hours | 9 (3 tracks × 3 labs) | Testing & QA, DevOps-CICD, Technical Documentation modules |
| [General](general/) | Security, modernization, and feature development — 3-track broad tour | 4-6 hours | 9 (3 tracks × 3 labs) | Security, Migration-Modernization, Application Development, Testing & QA modules |
| [Databricks Migration](databricks-migration/) | Converting PySpark/Airflow/dbt/SQL ETL to Databricks Lakehouse — notebooks, Delta Lake, Workflows | 1-6 hours | 6 (2 tracks × 3 labs) | Data Engineering modules |

## Creating a New Workshop

Expand Down
Loading