Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions install_packages.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
brew install harfbuzz
brew install fribidi
brew install libgit2
---
install.packages("usethis", verbose=TRUE)
---
brew install libtiff
brew install libxt

install.packages("formatR")
3 changes: 2 additions & 1 deletion manuscript/01-Introduction.Rmd
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# Introduction

# Introduction {#introduction}

Imagine being able to effortlessly deploy, manage, and monitor your machine learning models with ease. No more headaches from version control issues, data drift, and model performance degradation. That's the power of MLOps. *"MLOps Engineering: Building, Deploying, and Managing Machine Learning Workflows with Airflow and MLflow on Kubernetes"* takes you on a journey through the principles, practices, and platforms of MLOps. You'll learn how to create an end-to-end pipeline for machine learning projects, using cutting-edge tools and techniques like Kubernetes, Terraform, and GitOps, and working with tools to ease your machine learning workflow such as Apache Airflow and MLflow Tracking. Before we begin, let's have a more closer look on what MLOps actually is, what principles it incorporates, and how it distinguished from traditional DevOps.

Expand Down
9 changes: 3 additions & 6 deletions manuscript/01.1-Introduction-Machine_Learning_Workflow.Rmd
Original file line number Diff line number Diff line change
@@ -1,27 +1,24 @@

## Machine Learning Workflow

A machine learning workflow typically involves several stages. These stages are closely related and sometimes overlap as some stages may involve multiple iterations. In the following, the machine learning workflow is broken down to five different stages to make things easier, and give an overview.

![](images/01-Introduction/ml-lifecycle.svg)
![ML lifecycle](images/01-Introduction/ml-lifecycle.svg){ width=100% }

**1. Data Preparation**
In the first stage, data used to train a machine learning model is collected, cleaned, and preprocessed. Preprocessing includes tasks to remove missing or duplicate data, normalize data, and split data into a training and testing set.

**2. Model Building**
In the second stage, a machine learning model is selected and trained using the prepared data. This includes tasks such as selecting an appropriate algorithm as a machine learning model, training the model, and tuning the model's parameters to improve its performance.

**3. Model Evaluation**
Afterward, the performance of the trained model is evaluated using the test data set. This includes tasks such as measuring the accuracy and other performance metrics, comparing the performance of different models, and identifying potential issues with the model.

**4. Model Deployment**
Finally, the selected and optimized model is deployed to a production environment where it can be used to make predictions on new data. This stage includes tasks like scaling the model to handle large amounts of data, and deploying the model to different environments to be used in different contexts

**5. Model Monitoring and Maintenance**
It is important to monitor the model performance and update the model as needed, once the model is deployed. This includes tasks such as collecting feedback from the model, monitoring the model's performance metrics, and updating the model as necessary.

Each stage is often handled by the same tool or platform which makes a clear differentiation across stages and tools fairly difficult. Further, some machine learning workflows will not have all the steps, or they might have some variations. A machine learning workflow is thereby not a walk in the park and the actual model code is just a small piece of the work.

![Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. (D. Sculley, et al., 2015)](images/01-Introduction/ml-sculley.svg)
![Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. (D. Sculley, et al., 2015)](images/01-Introduction/ml-sculley.svg){ width=100% }

Working with and developing machine learning models, monitoring their performance, and continuously retraining it on new data with possible alternative models can be challenging and involves the right tools.

Expand Down
4 changes: 2 additions & 2 deletions manuscript/01.2-Introduction-MLOps.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,7 @@ The goal of DevOps is to automate and streamline the process of building, testin

DevOps focuses on the deployment and management of software in general (or *traditional* sofware), while MLOps focuses specifically on the deployment and management of machine learning models in a production environment. The goal is basically the same as in DevOps, yet deploying a machine learning model. While this is achieved by the same tools and best practices used in DevOps, deploying machine learning models (compared to software) adds a lot of complexity to the process.

![Traditional vs ML software](images/01-Introduction/mlops-vs-devops.svg)
![Traditional vs ML software](images/01-Introduction/mlops-vs-devops.svg){ width=100% }

Machine learning models are not just lines of code, but also require large amounts of data, and specialized hardware, to function properly. Further, machine learning models and their complex algorithms might need to change when there is a shift in new data. This process of ensuring that machine learning models are accurate and reliable with new data leads to additional challenges.
Another key difference is that MLOps places a great emphasis on model governance, which ensures that machine learning models are compliant with relevant regulations and standards. The above list of tools within DevOps can be extended to the following for MLOps.
Expand All @@ -42,7 +42,7 @@ It's important to note that the specific tools used in MLOps and DevOps may vary

Incorporating the tools introduced by DevOps and MLOps can extend the machine learning workflow outlined in the previous section, resulting in a complete MLOps lifecycle that covers each stage of the machine learning process while integrating automation practices.

![](images/01-Introduction/mlops-lifecycle.svg)
![](images/01-Introduction/mlops-lifecycle.svg){ width=100% }

Integrating MLOps into machine learning projects introduces additional complexity into the workflow. Although the development stage can be carried out on a local machine, subsequent stages are typically executed within a cloud platform. Additionally, the transition from one stage to another is automated using tools like CI/CD, which automate testing and deployment.

Expand Down
2 changes: 1 addition & 1 deletion manuscript/01.3-Introduction-Roles_and_Tasks.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

The MLOps lifecycle typically involves several key roles and responsibilities, each with their own specific tasks and objectives.

![Roles and their operating areas within the MLOps lifecycle](images/01-Introduction/mlops-roles-and-tasks.svg)
![Roles and their operating areas within the MLOps lifecycle](images/01-Introduction/mlops-roles-and-tasks.svg){ width=100% }

### Data Engineer

Expand Down
11 changes: 6 additions & 5 deletions manuscript/01.4-Introduction-Ops_practices.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ Containerization is an essential component in operations as it enables deploying

All of this makes them beneficial compared to deploying an application on a virtual machine or traditionally directly on a machine. Virtual machines would emulate an entire computer system and require a hypervisor to run, which introduces additional overhead. Similarly, a traditional deployment involves installing software directly onto a physical or virtual machine without the use of containers or virtualization. Not to mention the lack of portability of both.

![](./images/01-Introduction/ops-containerization.drawio.svg)
![](./images/01-Introduction/ops-containerization.drawio.svg){ width=100% }

The concept of container images is analogous to shipping containers in the physical world. Like shipping containers can be loaded with different types of cargo, a container image can be used to create different containers with various applications and configurations. Both the physical containers and container images are standardized, just like blueprints, enabling multiple operators to work with them. This allows for the deployment and management of applications in various environments and cloud platforms, making containerization a versatile solution.

Expand Down Expand Up @@ -52,7 +52,7 @@ Popular tools such as Azure ML, AWS Sagemaker, Kubeflow, and MLflow offer their

GitHub provides a variety of branching options to enable flexible collaboration workflows. Each branch serves a specific purpose in the development process, and using them effectively can help teams collaborate more efficiently and effectively.

![](./images/01-Introduction/ops-version-control.drawio.svg)
![](./images/01-Introduction/ops-version-control.drawio.svg){ width=100% }

*Main Branch:* The main branch is the default branch in a repository. It represents the latest stable version and production-ready state of a codebase, and changes to the code are merged into the main branch as they are completed and tested.
*Feature Branch:* A feature branch is used to develop a new feature or functionality. It is typically created off the main branch, and once the feature is completed, it can be merged back into the main branch.
Expand All @@ -65,7 +65,7 @@ After a programmer has made changes to their code, they would typically use Git

After committing changes locally, the programmer may want to share those changes with others. They would do this by pushing their local commits to a remote repository using the command `git push`. Once the changes are pushed, others can pull those changes down to their local machines and continue working on the project by using the command `git pull`.

![](./images/01-Introduction/ops-git-commands.png)
![](./images/01-Introduction/ops-git-commands.png){ width=100% }

If the programmer is collaborating with others, they may need to merge their changes with changes made by others. This can be done using the `git merge <BRANCH-NAME>` command, which combines two branches of development history. The programmer may need to resolve any conflicts that arise during the merge.

Expand All @@ -78,7 +78,7 @@ While automating the code review process is generally viewed as advantageous, it

Continuous Integration (CI) and Continuous Delivery / Continuous Delivery (CD) are related software development practices that work together to automate and streamline the software development and deployment process of code changes to production. Deploying new software and models without CI/CD often requires a lot of implicit knowledge and manual steps.

![](./images/01-Introduction/ops-ci-cd.drawio.svg)
![](./images/01-Introduction/ops-ci-cd.drawio.svg){ width=100% }

1. *Continuous Integration (CI)*: is a software development practice that involves frequently integrating code changes into a shared central repository. The goal of CI is to catch and fix integration errors as soon as they are introduced, rather than waiting for them to accumulate over time. This is typically done by running automated tests and builds, to catch any errors that might have been introduced with new code changes, for example when merging a Git feature branch into the main branch.

Expand All @@ -97,6 +97,7 @@ At first, the environment variables are defined under `env`. Two variables are d
The second part defines when the pipeline is or should be triggered. The exampele shows three possibilites to trigger a pipelines, when pushing on the master branch `push`, when a pull request to the master branch is granted `pull_request`, or when the pipeline is triggered manually via the Github interface `workflow_dispatch`.
The third part of the code example introduces the actual jobs and steps performed by the pipeline. The pipeline consists of two jobs `pytest` and `docker`. The first represents the CI part of the pipeline. The run environment of the job is set up and the necessary requirements are installed. Afterward unit tests are run using the pytest library. If the `pytest` job was successful, the `docker` job will be triggered. The job builds the Dockerfile and pushes it automatically to the specified Dockerhub repository specified in `tags`. The step introduces another variable just like the `env.Variable` before, the `secrets.`. Secrets are a way by Github to safely store classified information like username and passwords. They can be set up using the Github Interface and used in the Github Actions CI using `secrets.SECRET-NAME`.

\footnotesize
```yaml
name: Docker CI base

Expand Down Expand Up @@ -153,7 +154,7 @@ jobs:
push: true
tags: ${{ env.DOCKERREPO }}:${{ env.DIRECTORY }}
```

\normalsize

### Infrastructure as code

Expand Down
2 changes: 2 additions & 0 deletions manuscript/02-Overview_about_book_tutorials.Rmd
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
\newpage

# Overview about book tutorials

The book contains two sections with distinct focuses. The first section comprises Chapters 3 to 6, which consist of tutorials on the specific tools aforementioned. These chapters also serve as prerequisites for the subsequent sections. Among these tutorials, the chapters dedicated to *Airflow* and *MLflow* are oriented towards Data Scientists, providing insights into their usage. The chapters centered around *Kubernetes* and *Terraform* target Data- and MLOps Engineers, offering detailed guidance on deploying and managing these tools.
Expand Down
2 changes: 1 addition & 1 deletion manuscript/03-Airflow.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

[Apache Airflow](https://github.com/apache/airflow) is an open-source platform to develop, schedule and monitor workflows. Airflow comes with a web user interface that aims to make managing workflows as easy as possible and provides a good overview of each workflow over time and the ability to inspect logs and manage tasks, for example retrying a task in case of failure.

![](./images/03-Airflow/web-interface_overview.png)
![](./images/03-Airflow/web-interface_overview.png){ width=100% }

However, the philosophy of Airflow is to define workflows as code, so coding will always be required. Thus, Airflow can also be referred to as a *“Workflows as code”*-tool that allows for a dynamic, extensible, and flexible management of its workflows.

Expand Down
Loading