This is the code repository for Building Modern Data Applications Using Databricks Lakehouse, published by Packt.
Develop, optimize, and monitor data pipelines on Databricks
Learn the latest Databricks features, with up-to-date insights into the platform. This book will develop your skills to build scalable and secure data pipelines to ingest, transform, and deliver timely, accurate data to drive business decisions.
This book covers the following exciting features:
- Deploy near-real-time data pipelines in Databricks using Delta Live Tables
- Orchestrate data pipelines using Databricks workflows
- Implement data validation policies and monitor/quarantine bad data
- Apply slowly changing dimensions (SCD), Type 1 and 2, data to lakehouse tables
- Secure data access across different groups and users using Unity Catalog
- Automate continuous data pipeline deployment by integrating Git with build tools such as Terraform and Databricks Asset Bundles
If you feel this book is for you, get your copy today!
This book and the associated code are intended solely for educational purposes. The examples and pipelines demonstrated are not to be used in production environments without obtaining the necessary licenses from Databricks, Inc., and signing a Master Cloud Services Agreement (MCSA) with Databricks for production use of Databricks Services, including the 'dbldatagen' library. Refer to the license here: License.

All of the code is organized into folders. For example, chapter01.
The code will look like the following:
@dlt.table(
name="random_trip_data_raw",
comment="The raw taxi trip data ingested from a landing zone.",
table_properties={
"quality": "bronze"
}
)
Following is what you need for this book: This book is for data engineers looking to streamline data ingestion, transformation, and orchestration tasks. Data analysts responsible for managing and processing lakehouse data for analysis, reporting, and visualization will also find this book beneficial. Additionally, DataOps/DevOps engineers will find this book helpful for automating the testing and deployment of data pipelines, optimizing table tasks, and tracking data lineage within the lakehouse. Beginner-level knowledge of Apache Spark and Python is needed to make the most out of this book.
While not a mandatory requirement, to get the most out of this book, it’s recommended that you have beginner-level knowledge of Python and Apache Spark, and at least some knowledge of navigating around the Databricks Data Intelligence Platform. It’s also recommended to have the following dependencies installed locally in order to follow along with the hands-on exercises and code examples throughout the book:(Chapter 1-10).
| Chapter | Software required | OS required |
|---|---|---|
| 1-10 | Python 3.6+ | Windows, macOS, or Linux |
| 1-10 | Databricks CLI 0.205+ | Windows, macOS, or Linux |
Furthermore, it’s recommended that you have a Databricks account and workspace to log in, import notebooks, create clusters, and create new data pipelines. If you do not have a Databricks account, you can sign up for a free trial on the Databricks website.
Will Girten is a lead specialist solutions architect who joined Databricks in early 2019. With over a decade of experience in data and AI, Will has worked in various business verticals, from healthcare to government and financial services. Will’s primary focus has been helping enterprises implement data warehousing strategies for the lakehouse and performance-tuning BI dashboards, reports, and queries. Will is a certified Databricks Data Engineering Professional and Databricks Machine Learning Professional. He holds a Bachelor of Science in computer engineering from the University of Delaware.
