diff --git a/install_packages.txt b/install_packages.txt new file mode 100644 index 0000000..bfd7a1e --- /dev/null +++ b/install_packages.txt @@ -0,0 +1,10 @@ +brew install harfbuzz +brew install fribidi +brew install libgit2 +--- +install.packages("usethis", verbose=TRUE) +--- +brew install libtiff +brew install libxt + +install.packages("formatR") diff --git a/manuscript/01-Introduction.Rmd b/manuscript/01-Introduction.Rmd index c432b69..52c42f7 100644 --- a/manuscript/01-Introduction.Rmd +++ b/manuscript/01-Introduction.Rmd @@ -1,4 +1,5 @@ -# Introduction + +# Introduction {#introduction} Imagine being able to effortlessly deploy, manage, and monitor your machine learning models with ease. No more headaches from version control issues, data drift, and model performance degradation. That's the power of MLOps. *"MLOps Engineering: Building, Deploying, and Managing Machine Learning Workflows with Airflow and MLflow on Kubernetes"* takes you on a journey through the principles, practices, and platforms of MLOps. You'll learn how to create an end-to-end pipeline for machine learning projects, using cutting-edge tools and techniques like Kubernetes, Terraform, and GitOps, and working with tools to ease your machine learning workflow such as Apache Airflow and MLflow Tracking. Before we begin, let's have a more closer look on what MLOps actually is, what principles it incorporates, and how it distinguished from traditional DevOps. diff --git a/manuscript/01.1-Introduction-Machine_Learning_Workflow.Rmd b/manuscript/01.1-Introduction-Machine_Learning_Workflow.Rmd index 0250fcc..9c7fbc8 100644 --- a/manuscript/01.1-Introduction-Machine_Learning_Workflow.Rmd +++ b/manuscript/01.1-Introduction-Machine_Learning_Workflow.Rmd @@ -1,27 +1,24 @@ + ## Machine Learning Workflow A machine learning workflow typically involves several stages. These stages are closely related and sometimes overlap as some stages may involve multiple iterations. In the following, the machine learning workflow is broken down to five different stages to make things easier, and give an overview. -![](images/01-Introduction/ml-lifecycle.svg) +![ML lifecycle](images/01-Introduction/ml-lifecycle.svg){ width=100% } **1. Data Preparation** In the first stage, data used to train a machine learning model is collected, cleaned, and preprocessed. Preprocessing includes tasks to remove missing or duplicate data, normalize data, and split data into a training and testing set. - **2. Model Building** In the second stage, a machine learning model is selected and trained using the prepared data. This includes tasks such as selecting an appropriate algorithm as a machine learning model, training the model, and tuning the model's parameters to improve its performance. - **3. Model Evaluation** Afterward, the performance of the trained model is evaluated using the test data set. This includes tasks such as measuring the accuracy and other performance metrics, comparing the performance of different models, and identifying potential issues with the model. - **4. Model Deployment** Finally, the selected and optimized model is deployed to a production environment where it can be used to make predictions on new data. This stage includes tasks like scaling the model to handle large amounts of data, and deploying the model to different environments to be used in different contexts - **5. Model Monitoring and Maintenance** It is important to monitor the model performance and update the model as needed, once the model is deployed. This includes tasks such as collecting feedback from the model, monitoring the model's performance metrics, and updating the model as necessary. Each stage is often handled by the same tool or platform which makes a clear differentiation across stages and tools fairly difficult. Further, some machine learning workflows will not have all the steps, or they might have some variations. A machine learning workflow is thereby not a walk in the park and the actual model code is just a small piece of the work. -![Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. (D. Sculley, et al., 2015)](images/01-Introduction/ml-sculley.svg) +![Only a small fraction of real-world ML systems is composed of the ML code, as shown by the small black box in the middle. The required surrounding infrastructure is vast and complex. (D. Sculley, et al., 2015)](images/01-Introduction/ml-sculley.svg){ width=100% } Working with and developing machine learning models, monitoring their performance, and continuously retraining it on new data with possible alternative models can be challenging and involves the right tools. diff --git a/manuscript/01.2-Introduction-MLOps.Rmd b/manuscript/01.2-Introduction-MLOps.Rmd index 229a6b0..89ac529 100644 --- a/manuscript/01.2-Introduction-MLOps.Rmd +++ b/manuscript/01.2-Introduction-MLOps.Rmd @@ -25,7 +25,7 @@ The goal of DevOps is to automate and streamline the process of building, testin DevOps focuses on the deployment and management of software in general (or *traditional* sofware), while MLOps focuses specifically on the deployment and management of machine learning models in a production environment. The goal is basically the same as in DevOps, yet deploying a machine learning model. While this is achieved by the same tools and best practices used in DevOps, deploying machine learning models (compared to software) adds a lot of complexity to the process. -![Traditional vs ML software](images/01-Introduction/mlops-vs-devops.svg) +![Traditional vs ML software](images/01-Introduction/mlops-vs-devops.svg){ width=100% } Machine learning models are not just lines of code, but also require large amounts of data, and specialized hardware, to function properly. Further, machine learning models and their complex algorithms might need to change when there is a shift in new data. This process of ensuring that machine learning models are accurate and reliable with new data leads to additional challenges. Another key difference is that MLOps places a great emphasis on model governance, which ensures that machine learning models are compliant with relevant regulations and standards. The above list of tools within DevOps can be extended to the following for MLOps. @@ -42,7 +42,7 @@ It's important to note that the specific tools used in MLOps and DevOps may vary Incorporating the tools introduced by DevOps and MLOps can extend the machine learning workflow outlined in the previous section, resulting in a complete MLOps lifecycle that covers each stage of the machine learning process while integrating automation practices. -![](images/01-Introduction/mlops-lifecycle.svg) +![](images/01-Introduction/mlops-lifecycle.svg){ width=100% } Integrating MLOps into machine learning projects introduces additional complexity into the workflow. Although the development stage can be carried out on a local machine, subsequent stages are typically executed within a cloud platform. Additionally, the transition from one stage to another is automated using tools like CI/CD, which automate testing and deployment. diff --git a/manuscript/01.3-Introduction-Roles_and_Tasks.Rmd b/manuscript/01.3-Introduction-Roles_and_Tasks.Rmd index eb2eb2f..3f46d41 100644 --- a/manuscript/01.3-Introduction-Roles_and_Tasks.Rmd +++ b/manuscript/01.3-Introduction-Roles_and_Tasks.Rmd @@ -2,7 +2,7 @@ The MLOps lifecycle typically involves several key roles and responsibilities, each with their own specific tasks and objectives. -![Roles and their operating areas within the MLOps lifecycle](images/01-Introduction/mlops-roles-and-tasks.svg) +![Roles and their operating areas within the MLOps lifecycle](images/01-Introduction/mlops-roles-and-tasks.svg){ width=100% } ### Data Engineer diff --git a/manuscript/01.4-Introduction-Ops_practices.Rmd b/manuscript/01.4-Introduction-Ops_practices.Rmd index 6d0265c..3bb6ca8 100644 --- a/manuscript/01.4-Introduction-Ops_practices.Rmd +++ b/manuscript/01.4-Introduction-Ops_practices.Rmd @@ -14,7 +14,7 @@ Containerization is an essential component in operations as it enables deploying All of this makes them beneficial compared to deploying an application on a virtual machine or traditionally directly on a machine. Virtual machines would emulate an entire computer system and require a hypervisor to run, which introduces additional overhead. Similarly, a traditional deployment involves installing software directly onto a physical or virtual machine without the use of containers or virtualization. Not to mention the lack of portability of both. -![](./images/01-Introduction/ops-containerization.drawio.svg) +![](./images/01-Introduction/ops-containerization.drawio.svg){ width=100% } The concept of container images is analogous to shipping containers in the physical world. Like shipping containers can be loaded with different types of cargo, a container image can be used to create different containers with various applications and configurations. Both the physical containers and container images are standardized, just like blueprints, enabling multiple operators to work with them. This allows for the deployment and management of applications in various environments and cloud platforms, making containerization a versatile solution. @@ -52,7 +52,7 @@ Popular tools such as Azure ML, AWS Sagemaker, Kubeflow, and MLflow offer their GitHub provides a variety of branching options to enable flexible collaboration workflows. Each branch serves a specific purpose in the development process, and using them effectively can help teams collaborate more efficiently and effectively. -![](./images/01-Introduction/ops-version-control.drawio.svg) +![](./images/01-Introduction/ops-version-control.drawio.svg){ width=100% } *Main Branch:* The main branch is the default branch in a repository. It represents the latest stable version and production-ready state of a codebase, and changes to the code are merged into the main branch as they are completed and tested. *Feature Branch:* A feature branch is used to develop a new feature or functionality. It is typically created off the main branch, and once the feature is completed, it can be merged back into the main branch. @@ -65,7 +65,7 @@ After a programmer has made changes to their code, they would typically use Git After committing changes locally, the programmer may want to share those changes with others. They would do this by pushing their local commits to a remote repository using the command `git push`. Once the changes are pushed, others can pull those changes down to their local machines and continue working on the project by using the command `git pull`. -![](./images/01-Introduction/ops-git-commands.png) +![](./images/01-Introduction/ops-git-commands.png){ width=100% } If the programmer is collaborating with others, they may need to merge their changes with changes made by others. This can be done using the `git merge ` command, which combines two branches of development history. The programmer may need to resolve any conflicts that arise during the merge. @@ -78,7 +78,7 @@ While automating the code review process is generally viewed as advantageous, it Continuous Integration (CI) and Continuous Delivery / Continuous Delivery (CD) are related software development practices that work together to automate and streamline the software development and deployment process of code changes to production. Deploying new software and models without CI/CD often requires a lot of implicit knowledge and manual steps. -![](./images/01-Introduction/ops-ci-cd.drawio.svg) +![](./images/01-Introduction/ops-ci-cd.drawio.svg){ width=100% } 1. *Continuous Integration (CI)*: is a software development practice that involves frequently integrating code changes into a shared central repository. The goal of CI is to catch and fix integration errors as soon as they are introduced, rather than waiting for them to accumulate over time. This is typically done by running automated tests and builds, to catch any errors that might have been introduced with new code changes, for example when merging a Git feature branch into the main branch. @@ -97,6 +97,7 @@ At first, the environment variables are defined under `env`. Two variables are d The second part defines when the pipeline is or should be triggered. The exampele shows three possibilites to trigger a pipelines, when pushing on the master branch `push`, when a pull request to the master branch is granted `pull_request`, or when the pipeline is triggered manually via the Github interface `workflow_dispatch`. The third part of the code example introduces the actual jobs and steps performed by the pipeline. The pipeline consists of two jobs `pytest` and `docker`. The first represents the CI part of the pipeline. The run environment of the job is set up and the necessary requirements are installed. Afterward unit tests are run using the pytest library. If the `pytest` job was successful, the `docker` job will be triggered. The job builds the Dockerfile and pushes it automatically to the specified Dockerhub repository specified in `tags`. The step introduces another variable just like the `env.Variable` before, the `secrets.`. Secrets are a way by Github to safely store classified information like username and passwords. They can be set up using the Github Interface and used in the Github Actions CI using `secrets.SECRET-NAME`. +\footnotesize ```yaml name: Docker CI base @@ -153,7 +154,7 @@ jobs: push: true tags: ${{ env.DOCKERREPO }}:${{ env.DIRECTORY }} ``` - +\normalsize ### Infrastructure as code diff --git a/manuscript/02-Overview_about_book_tutorials.Rmd b/manuscript/02-Overview_about_book_tutorials.Rmd index 3dcfb2d..806bd9c 100644 --- a/manuscript/02-Overview_about_book_tutorials.Rmd +++ b/manuscript/02-Overview_about_book_tutorials.Rmd @@ -1,3 +1,5 @@ +\newpage + # Overview about book tutorials The book contains two sections with distinct focuses. The first section comprises Chapters 3 to 6, which consist of tutorials on the specific tools aforementioned. These chapters also serve as prerequisites for the subsequent sections. Among these tutorials, the chapters dedicated to *Airflow* and *MLflow* are oriented towards Data Scientists, providing insights into their usage. The chapters centered around *Kubernetes* and *Terraform* target Data- and MLOps Engineers, offering detailed guidance on deploying and managing these tools. diff --git a/manuscript/03-Airflow.Rmd b/manuscript/03-Airflow.Rmd index e8a4f4a..1525d2e 100644 --- a/manuscript/03-Airflow.Rmd +++ b/manuscript/03-Airflow.Rmd @@ -2,7 +2,7 @@ [Apache Airflow](https://github.com/apache/airflow) is an open-source platform to develop, schedule and monitor workflows. Airflow comes with a web user interface that aims to make managing workflows as easy as possible and provides a good overview of each workflow over time and the ability to inspect logs and manage tasks, for example retrying a task in case of failure. -![](./images/03-Airflow/web-interface_overview.png) +![](./images/03-Airflow/web-interface_overview.png){ width=100% } However, the philosophy of Airflow is to define workflows as code, so coding will always be required. Thus, Airflow can also be referred to as a *“Workflows as code”*-tool that allows for a dynamic, extensible, and flexible management of its workflows. diff --git a/manuscript/03.1-Airflow-Core_Components.Rmd b/manuscript/03.1-Airflow-Core_Components.Rmd index e4519c0..5f15bde 100644 --- a/manuscript/03.1-Airflow-Core_Components.Rmd +++ b/manuscript/03.1-Airflow-Core_Components.Rmd @@ -14,6 +14,7 @@ A workflows in Airflow is implemented as a DAG, a *Directed Acyclic Graph*. A *g The DAG object is needed to nest the separate tasks intp a workflow. A workflow specified in code, e.g. python, is often also referred to as a *pipeline*. This terminology can be used synonymosly when working with Airflow. The following code snippet depicts how to define a DAG object in python code. The `dag_id` string is a unique identifier to the DAG object. The `default_args` dictionary consists of additional parameters that can be specified. There are only shown two additional parameters. There are a lot more though which can be seen in the [official documentation](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/models/dag/index.html#airflow.models.dag.DAG). +\footnotesize ```python from airflow.models import DAG from pendulum import datetime @@ -30,16 +31,18 @@ example_dag = DAG( default_args=default_args ) ``` +\normalsize The state of a workflow can accessed via the web user inteface, such as shown in below images. The first image shows Airflows overview of all DAGs currently "Active" and further information about them such as their "Owner", how many "Runs" have been performaned and whether they were successful, and much more. The second image depicts a detailed overview of the DAG "xcom_fundamentals". Besudes looking into the *Audit Logs* of the DAG and the *Task Duration*, it is also possible to check the *Code* of the DAG. -![](./images/03-Airflow/web-interface_dags.png) +![](./images/03-Airflow/web-interface_dags.png){ width=100% } -![](./images/03-Airflow/web-interface_dag-grid.png) +![](./images/03-Airflow/web-interface_dag-grid.png){ width=100% } The Airflow command line interface also allows to interact with the DAGs. Below command shows its exemplary usage and how to list all active DAGs. Further examples of using the CLI in a specific context are show in the subsection about Tasks. +\footnotesize ```bash # Create the database schema airflow db init @@ -47,11 +50,11 @@ airflow db init # Print the list of active DAGs airflow dags list ``` +\normalsize People sometimes think of a DAG definition file as a place where the actual data processing is done. That is not the case at all! The scripts purpose is to only define a DAG object. It needs to evaluate quickly (in seconds, not minutes) since the scheduler of Airflow will load and execute it periodically to account for changes in the DAG definition. A DAG usually consists of multiple steps it runs through, also names as *tasks*. Tasks themselves consist of *operators*. This will be outlined in the following subsections. - ### Operators An Operator represents a single predefined task in a workflow. It is basically a unit of work that Airflow has to complete. Operators usually run independently and generally do not share any information by themselves. There are different categories of operators to perform different tasks with, for example *Action operators*, *Transfer operators*, or *Sensors*. @@ -67,6 +70,7 @@ The `PythonOperator` is actually declared a deprecated function. Airflow 2.0 pro As its name suggests, the `BashOperator` executes commands in the bash shell. +\footnotesize ```python from airflow.operators.bash_operator import BashOperator @@ -76,11 +80,13 @@ bash_task = BashOperator( dag=action_operator_fundamentals ) ``` +\normalsize **PythonOperator** The `PythonOperator` expects a python callable. Airflow passes a set of keyword arguments from the `op_kwargs` dictionary to the callable as input. +\footnotesize ```python from airflow.operators.python_operator import PythonOperator @@ -94,11 +100,13 @@ sleep_task = PythonOperator( dag=action_operator_fundamentals ) ``` +\normalsize **EmailOperator** The `EmailOperator` allows to send predefined emails from an Airflow DAG run. For example, this could be used to notify if a workflow was successfull or not. The `EmailOperator` does require the Airflow system to be configured with email server details as a prerequisite. Please refer to the official docs on how to do this. +\footnotesize ```python from airflow.operators.email_operator import EmailOperator @@ -111,6 +119,7 @@ email_task = EmailOperator( dag=action_operator_fundamentals ) ``` +\normalsize #### Transfer Operators @@ -118,6 +127,7 @@ email_task = EmailOperator( The `GoogleApiToS3Operator` makes requests to any Google API that supports discovery and uploads its response to AWS S3. The example below loads data from Google Sheets and saves it to an AWS S3 file. +\footnotesize ```python from airflow.providers.amazon.aws.transfers.google_api_to_s3 import GoogleApiToS3Operator @@ -135,11 +145,13 @@ task_google_sheets_values_to_s3 = GoogleApiToS3Operator( s3_destination_key=s3_destination_key, ) ``` +\normalsize **DynamoDBToS3Operator** The `DynamoDBToS3Operator` copies the content of an AWS DynamoDB table to an AWS S3 bucket. It is also possible to specifiy criteria such as `dynamodb_scan_kwargs` to filter the transfered data and only replicate records according to criteria. +\footnotesize ```python from airflow.providers.amazon.aws.transfers.dynamodb_to_s3 import DynamoDBToS3Operator @@ -154,11 +166,13 @@ backup_db = DynamoDBToS3Operator( file_size=20, ) ``` +\normalsize **AzureFileShareToGCSOperator** The `AzureFileShareToGCSOperator` transfers files from the Azure FileShare to the Google Storage. Even though the storage systems are quite similar, this operator is beneficial when the cloud provider needs to be changed. The `share_name`-parameter denotes the Azure FileShare share name to transfer files from. Similarly, the `dest_gcs`specifies the destination bucket on the Google Cloud. +\footnotesize ```python from airflow.providers.google.cloud.transfers.azure_fileshare_to_gcs import AzureFileShareToGCSOperator @@ -176,6 +190,7 @@ sync_azure_files_with_gcs = AzureFileShareToGCSOperator( google_impersonation_chain=None, ) ``` +\normalsize #### Sensors @@ -184,6 +199,7 @@ A *sensor* is a special subclass of an operator that is triggered when an extern It is also possible to specify further requirements to check the condition. The `mode` argument sets how to check for a condition. `mode='poke'` denotes to run a task repeatedly until it is successful (this is the default), whereas `mode='reschedule'` gives up a task slot and tries again later. Simultaneously, the `poke_interval` defines how long a sensor should wait between checks, and the `timeout` parameter defines how long to wait before letting a task fail. Below example shows a `FileSensor` that checks the creation of a file with a `poke_interval` defined. +\footnotesize ```python from airflow.contrib.sensors.file_sensor import FileSensor @@ -193,18 +209,19 @@ file_sensor_task = FileSensor(task_id='file_sense', dag=sensor_operator_fundamentals ) ``` +\normalsize Other sensors are for example: * `ExternalTaskSensor` - waits for a task in another DAG to complete * `HttpSensor` - Requests a web URL and checks for content * `SqlSensor` - Runs a SQL query to check for content - ### Tasks To use an operator in a DAG it needs to be instantiated as a task. Tasks determine how to execute an operator’s work within the context of a DAG. The concepts of a *Task* and *Operator* are actually somewhat interchangeable as each task is actually a subclass of Airflow’s `BaseOperator`. However, it is useful to think of them as separate concepts. Tasks are Instances of operators and are usually assigned to a python variable. The following code instantiates the `BashOperator` to two different variables `task_1` and `task_2`. The `depends_on_past` argument ensures that the previously scheduled task has succeeded before the current task is triggered. +\footnotesize ```python task_1 = BashOperator( task_id="print_date", @@ -227,9 +244,11 @@ task_3 = BashOperator( dag=task_fundamentals ) ``` +\normalsize Tasks can be referred to by their `task_id` either using the web interface or using the CLI within the airflow tools. `airflow tasks test` runs task instances locally, outputs their log to stdout (on screen), does not bother with dependencies, and does not communicate the state (running, success, failed, …) to the database. It simply allows testing a single task instance. The same accounts for `airflow dags test` +\footnotesize ```bash # Run a single task with the following command airflow run @@ -243,6 +262,7 @@ airflow tasks test task_fundamentals print_date 2015-06-01 # Testing the task sleep airflow tasks test task_fundamentals sleep 2015-06-01 ``` +\normalsize #### Task dependencies @@ -253,12 +273,16 @@ A machine learning or data workflow usually has a specific order in which its ta An exemplary code and chaining examples of tasks would look like this: +\footnotesize ```python # Simply chained dependencies task_1 >> task_2 >> task_3 ``` -![](./images/03-Airflow/task-dependencies.png "task-dependencies") +\normalsize +![](./images/03-Airflow/task-dependencies.png "task-dependencies"){ width=100% } + +\footnotesize ```python # Mixed dependencies task_1 >> task_3 << task_2 @@ -269,26 +293,34 @@ task_2 >> task_3 # or [task_1, task_2] >> task_3 - ``` -![](./images/03-Airflow/task-dependencies_mixed.png "task-dependencies_mixed") +\normalsize +![](./images/03-Airflow/task-dependencies_mixed.png "task-dependencies_mixed"){ width=100% } + +\footnotesize ```python # It is also possible to define dependencies with task_1.set_downstream(task_2) task_3.set_upstream(task_1) ``` -![](./images/03-Airflow/task-dependencies_upstream.png "task-dependencies_upstream") +\normalsize + +![](./images/03-Airflow/task-dependencies_upstream.png "task-dependencies_upstream"){ width=100% } +\footnotesize ```python # It is also possible to mix it completely wild task_1 >> task_3 << task_2 task_1.set_downstream(task_2) ``` -![](./images/03-Airflow/task-dependencies_wild.png "task-dependencies_wild") +\normalsize + +![](./images/03-Airflow/task-dependencies_wild.png "task-dependencies_wild"){ width=100% } It is possible to list all tasks within a DAG using the CLI. Below commands show two approaches. +\footnotesize ```bash # Prints the list of tasks in the "task_fundamentals" DAG airflow tasks list task_fundamentals @@ -296,10 +328,10 @@ airflow tasks list task_fundamentals # Prints the hierarchy of tasks in the "task_fundamentals" DAG airflow tasks list task_fundamentals --tree ``` +\normalsize In general, each task of a DAG runs on a different compute resource (also called worker). It may require an extensive use of environment variables to achieve running on the same environment or with elevated privileges. This means that tasks naturally cannot cross communicate which impedes the exchange of information and data. To achieve cross communication an additional feature of Airflow needs to be used, called *XCom*. - ### XCom *XCom* (short for “cross-communication”) is a mechanism that allows information passing between tasks in a DAG. This is beneficial as by default tasks are isolated within Airflow and may run on entirely different machines. @@ -312,12 +344,12 @@ The XCom is identified by a *key*-*value* pair stored in the Airflow Metadata Da Below example shows this mechanics. However, when looking at the `push` one can see a difference in their functionality. The method `push_by_returning` uses the operators' auto-push functionality that pushes their results into an default XCom key (the default is `return_value`). Using the auto-push functionality allows to only use python `return` statements and is enabled by setting the `do_xcom_push` argument to `True`, which it also is by default ( `@task` functions do this as well). To push an XCom with a specific key, the `xcom_push`-method needs to be called explicitly. In order to access the `xcom_push` one needs to access the task instance (ti) object. It can be accessed by passing the `"ti"` parameter to the python callable of the PythonOperator. Its usage can be seen in the `push` method, where also a custom key is given to the XCom. Similarly, the `puller` function uses the `xcom_pull` method to pull the previously pushed values from the metadata databes. +\footnotesize ```python from airflow.models import DAG from pendulum import datetime from airflow.operators.python_operator import PythonOperator - xcom_fundamentals = DAG( dag_id='xcom_fundamentals', start_date=datetime(2023, 1, 1, tz="Europe/Amsterdam"), @@ -354,7 +386,6 @@ def puller(**kwargs): print(f"pulled_value_1: {pulled_value_1}") print(f"pulled_value_2: {pulled_value_2}") - push1 = PythonOperator( task_id='push', # provide context is for getting the TI (task instance ) parameters @@ -381,7 +412,7 @@ pull = PythonOperator( # push1, push2 are upstream to pull [push1, push2] >> pull ``` - +\normalsize ### Scheduling @@ -389,7 +420,7 @@ A workflow can be run either triggered manually or on a scheduled basis. Each DA The Airflow scheduler monitors all DAGs and tasks, and triggers those task instances whose dependencies have been met. A scheduled tasks needs several attributes specified. When looking at the first example of how to define a DAG we can see that we already defined the attributes `start_date`, and `schedule_interval`. We can also add optional attributes such as `end_date`, and `max_tries`. - +\footnotesize ```python from airflow.models import DAG @@ -407,7 +438,7 @@ example_dag = DAG( default_args=default_args ) ``` - +\normalsize ### Taskflow @@ -417,6 +448,7 @@ All of the processing in a TaskFlow DAG is similar to the traditional paradigm o Defining a workflow of an ETL-pipeline using the TaskFlow paradigm is shown in belows example. The pipeline invokes an *extract* task, sends the ordered data to a *transform* task for summarization, and finally invokes a *load* task with the previously summarized data. Its quite easy to catch that the Taskflow workflow contrasts with Airflow's traditional paradigm in several ways. +\footnotesize ```python import json import pendulum @@ -459,10 +491,12 @@ def taskflow_api_fundamentals(): # Finally execute the DAG taskflow_api_fundamentals() ``` +\normalsize Even the passing of data between tasks which might run on different workers is all handled by TaskFlow so there is no need to use XCom. However, XCom is still used behind the scenes, but all of the XCom usage passing data between tasks is abstracted away. This allows to view the XCom in the Airflow UI as before. Belows example shows the `transform` function written in the traditional Airflow using XCom and highlights the simplicity of using TaskFlow. +\footnotesize ```python def transform(**kwargs): ti = kwargs["ti"] @@ -477,6 +511,7 @@ def transform(**kwargs): total_value_json_string = json.dumps(total_value) ti.xcom_push("total_order_value", total_value_json_string) ``` +\normalsize As it is clearly visible, using TaskFlow is an easy approach to workflowing in Airflow. It takes away a lot of worries when it comes to building pipelines and allows for a flexible programing experience using decorators. It allows for several more functionalities, such as reusing decorated tasks in multiple DAGs, overriding task parameters like the `task_id`, custom XCom backends to automatically store data in e.g. AWS S3, and using TaskGroups to group multiple tasks for a better overview in the Airflow Interface. diff --git a/manuscript/03.2-Airflow-Exemplary_ML_Workflow.Rmd b/manuscript/03.2-Airflow-Exemplary_ML_Workflow.Rmd index ddb915e..d50d4ee 100644 --- a/manuscript/03.2-Airflow-Exemplary_ML_Workflow.Rmd +++ b/manuscript/03.2-Airflow-Exemplary_ML_Workflow.Rmd @@ -2,10 +2,11 @@ Please refer to the previous section. This involves everything +\footnotesize ```python # build dag that includes everything from before def to_be_done(): pass ``` - +\normalsize diff --git a/manuscript/03.3-Airflow-infrastructure.Rmd b/manuscript/03.3-Airflow-infrastructure.Rmd index cf38e90..e3350f9 100644 --- a/manuscript/03.3-Airflow-infrastructure.Rmd +++ b/manuscript/03.3-Airflow-infrastructure.Rmd @@ -17,7 +17,7 @@ An Airflow deployment generally consists of five different compontens: The following graphs shows how the components build up the Airflow architecture. -![Architecture of Airflow as a distributed system](./images/03-Airflow/architecture_airflow.drawio.svg) +![Architecture of Airflow as a distributed system](./images/03-Airflow/architecture_airflow.drawio.svg){ width=100% } ### Scheduler @@ -42,9 +42,11 @@ In a production ready deployment of Airflow the executor pushes the task executi Examining which executor is currently set can be done by running the following command. +\footnotesize ```bash airflow config get-value core executor ``` +\normalsize ### DAG Directory diff --git a/manuscript/04-MlFlow.Rmd b/manuscript/04-MlFlow.Rmd index f929d85..34df7cd 100644 --- a/manuscript/04-MlFlow.Rmd +++ b/manuscript/04-MlFlow.Rmd @@ -11,7 +11,7 @@ MLflow is library-agnostic, which means one can use it with any ML library and The aim is to make its use as reproducible and reusable as possible so Data Scientists require minimal changes to integrate MLflow into their existing codebase. MLflow also comes with a user web interface to conveniently view and compare models and metrics. -![Web Interface of MLflow](images/04-MLflow/MLflow_web_interface-overview.png) +![Web Interface of MLflow](images/04-MLflow/MLflow_web_interface-overview.png){ width=100% } ### Prerequisites {.unlisted .unnumbered} diff --git a/manuscript/04.1-MlFlow-Core_Components.Rmd b/manuscript/04.1-MlFlow-Core_Components.Rmd index 25313ee..f083428 100644 --- a/manuscript/04.1-MlFlow-Core_Components.Rmd +++ b/manuscript/04.1-MlFlow-Core_Components.Rmd @@ -7,23 +7,28 @@ The four primary components of MLflow are shown in more detail and with exemplar MLflow Tracking allows to log and compare parameters, code versions, metrics, and artifacts of a machine learning model. This can be easily done by minimal changes to your code using the MLflow Tracking API. The following examples depict the basic concepts and show how to use it. To use MLflow within your code it needs to be imported first. +\footnotesize ```python import mlflow ``` +\normalsize #### MLflow experiment MLflow experiments are a part of MLflow’s tracking component that allow to group runs together based on custom criteria. For example multiple model runs with different model architectures might be grouped within one experiment to make it easier for evaluation. +\footnotesize ```python experiment_name = "introduction-experiment" mlflow.set_experiment(experiment_name) ``` +\normalsize #### MLflow run An MLflow run is an execution environment for a piece of machine learning code. Whenever parameters or performances of a ML run or experiment should be tracked, a new MLflow run is created. This is easily done using `MLflow.start_run()`. Using `MLflow.end_run()` the run will similarly be ended. +\footnotesize ```python run_name = "example-run" @@ -32,9 +37,11 @@ run = mlflow.active_run() print(f"Active run_id: {run.info.run_id}") mlflow.end_run() ``` +\normalsize It is a good practice to pass a run name to the MLflow run to identify it easily afterwards. It is also possible to use the context manager as shown below, which allows for a smoother style. +\footnotesize ```python run_name = "context-manager-run" @@ -42,10 +49,12 @@ with mlflow.start_run(run_name=run_name) as run: run_id = run.info.run_id print(f"Active run_id: {run_id}") ``` +\normalsize **Child runs** It is possible to create child runs of the current run, based on the run ID. This can be used for example to gain a better overview of multiple run. Belows code shows how to create a child run. +\footnotesize ```python # Create child runs based on the run ID with mlflow.start_run(run_id=run_id) as parent_run: @@ -60,11 +69,13 @@ with mlflow.start_run(run_id=run_id) as parent_run: mlflow.log_metric("acc", 0.90) print("child run_id : {}".format(child_run.info.run_id)) ``` +\normalsize #### Logging metrics & parameters The main reason to use MLflow Tracking is to log and store parameters and metrics during a MLflow run. *Parameters* represent the input parameters used for training, e.g. the initial learning rate. *Metrics* are used to track the progress of the model training and are usually updated over the course of a model run. MLflow allows to keep track of the model’s train and validation losses and to visualize their development across the training run. Parameters and metrics can be easily logged by calling `MLflow.log_param` and `MLflow.log_metric`. One can also specify a tag to identify the run by using `MLflow.set_tag`. Belows example show how to use each method within a run. +\footnotesize ```python run_name = "tracking-example-run" experiment_name = "tracking-experiment" @@ -96,10 +107,12 @@ with mlflow.start_run(run_name=run_name) as run: print(f"run_id: {run_id}") print(f"experiment_id: {experiment_id}") ``` +\normalsize It is also possible to add information after the experiment ran. One just needs to specify the run ID from the previous run to the current run. The example below shows how to do this, and uses the `mlflow.client.MLflowClient`. The `mlflow.client` module provides a Python CRUD interface, which is a lower level API directly translating to the MLflow [REST API](https://mlflow.org/docs/latest/rest-api.html) calls. It can be used similarly to the `mlflow`-module of the higher level API. It is mentioned here to give a hint of its existence. +\footnotesize ```python from mlflow.tracking import MLflowClient @@ -115,35 +128,41 @@ with mlflow.start_run(run_id=run_id): mlflow.log_metric("f1", 0.9) print(f"run_id: {run.info.run_id}") ``` +\normalsize #### Display & View metrics How can the logged parameters and metrics be used and viewed afterwards? It is possible to give an overview of the currently stored runs using the MLflow API and printing the results. +\footnotesize ```python current_experiment = dict(mlflow.get_experiment_by_name(experiment_name)) mlflow_run = mlflow.search_runs([current_experiment['experiment_id']]) print(f"MLflow_run: {mlflow_run}") ``` +\normalsize -![MLflow Model Tracking CLI Run Overview](images/04-MLflow/MLflow_cli_interface-tracking.png) +![MLflow Model Tracking CLI Run Overview](images/04-MLflow/MLflow_cli_interface-tracking.png){ width=100% } Yet, viewing all the results in the web interface of MLflow gives a much better overview. By default, the tracking API writes the data to the local filesystem of the machine it’s running on under a `./mlruns` directory. This directory can be accessed by the MLflow’s Tracking web interface by running `MLflow ui` via the command line. The web interface can be viewed in the browser under http://localhost:5000 (The port: 5000 is the MLflow default). The metrics dashboard of a run looks like the following: -![MLflow Model Tracking Dashboard](images/04-MLflow/MLflow_web_interface-tracking.png) +![MLflow Model Tracking Dashboard](images/04-MLflow/MLflow_web_interface-tracking.png){ width=100% } It is also possible to configure MLflow to log to a remote tracking server. This allows to manage results on in a central place and share them across a team. To get access to a remote tracking server it is needed to set a MLflow tracking URI. This can be done multiple way. Either by setting an environment variable `MLflow_TRACKING_URI` to the servers URI, or by adding it to the start of our code. +\footnotesize ```python import mlflow mlflow.set_tracking_uri("http://YOUR-SERVER:YOUR-PORT") mlflow.set_experiment("my-experiment") ``` +\normalsize #### Logging artifacts *Artifacts* represent any kind of file to save during training, such as plots and model weights. It is possible to log such files as well, and place them within the same run as parameters and metrics. This means everything created within a ML run is saved at one point. Artifact files can be either single local files, or even full directories. The following example creates a local file and logs it to a model run. +\footnotesize ```python import os @@ -175,11 +194,13 @@ with mlflow.start_run(run_id=run_id) as run: print(f"run_id: {run.info.run_id}") print(f"Artifact uri: {artifact_uri}") ``` +\normalsize #### Autolog Previously, all the parameters, metrics, and files have been logged manually by the user. The *autolog*-feature of MLflow allows for automatic logging of metrics, parameters, and models without the need for an explicit log statements. This feature needs to be activated previous to the execution of a run by calling `MLflow.sklearn.autolog()`. +\footnotesize ```python import mlflow.sklearn import numpy as np @@ -197,6 +218,7 @@ with mlflow.start_run(run_name=run_name) as run: mlflow.sklearn.autolog(disable=True) ``` +\normalsize Even though this is a very convenient feature, it is a good practice to log metrics manually, as this gives more control over a ML run. @@ -206,6 +228,7 @@ Even though this is a very convenient feature, it is a good practice to log metr MLflow Models manages and deploys models from various different ML libraries such as scikit-learn, TensorFlow, PyTorch, Spark, and [many more](https://MLflow.org/docs/latest/models.html). It includes a generic `MLmodel` format that acts as a standard format to package ML models so they can be used in different projects and environments. The `MLmodel` format defines a convention that saves the model in so called *“flavors”*. For example `mlflow.sklearn` allows to load mlflow models back into scikit-learn. The stored model can also be served easily and conveniently using these *flavors* as a python function either locally, in Docker-based REST servers containers, or on commercial serving platforms like AWS SageMaker or AzureML. The following example is based on the scikit-learn library. +\footnotesize ```python # Import the sklearn models from MLflow import mlflow.sklearn @@ -230,9 +253,11 @@ model_name = f"RandomForestRegressionModel" print(f"model_uri: {model_uri}") print(f"model_name: {model_name}") ``` +\normalsize Once a model is stored in the correct format it can be identified by its `model_uri`, loaded, and used for prediction. +\footnotesize ```python import mlflow.pyfunc @@ -242,14 +267,16 @@ data = [[0, 1, 0]] model_pred = model.predict(data) print(f"model_pred: {model_pred}") ``` +\normalsize -![MLflow Models](images/04-MLflow/MLflow_web_interface-models.png) +![MLflow Models](images/04-MLflow/MLflow_web_interface-models.png){ width=100% } ### MLflow Model Registry The MLflow Model Registry provides a central model store to manage the lifecycle of an ML Model. This allows to register MLflow models like the *RandomForestRegressor* from the previous section to the Model Registry and include model versioning, stage transitions, and annotations. In fact, by running `MLflow.sklearn.log_model` we already did exactly that. Look at how easy the MLflow API is to use. Let's have a look at the code again. +\footnotesize ```python import mlflow.sklearn import mlflow.pyfunc @@ -276,9 +303,11 @@ data = [[0, 1, 0]] model_pred = model.predict(data) print(f"model_pred: {model_pred}") ``` +\normalsize Yet, it is also possible to register the MLflow model in the model registry by calling `MLflow.register_model` such as show in belows example. +\footnotesize ```python # The previously stated Model URI and name are needed to register a MLflow Model mv = mlflow.register_model(model_uri, model_name) @@ -286,9 +315,11 @@ print("Name: {}".format(mv.name)) print("Version: {}".format(mv.version)) print("Stage: {}".format(mv.current_stage)) ``` +\normalsize Once registered to the model registry the model is versioned. This enables to load a model based on a specific version and to change a model version respectively. A registered model can be also modified to transition to another version or stage. Both use cases are shown in the example below. +\footnotesize ```python import mlflow.pyfunc @@ -302,9 +333,11 @@ data = [[0, 1, 0]] model_pred = model.predict(data) print(f"model_pred: {model_pred}") ``` +\normalsize Let's stage a model to `'Staging'`. The for-loop below prints all registered models and shows that there is indeed a model with a `'Staging'`-stage. +\footnotesize ```python # Transition the model to another stage from mlflow.client import MLflowClient @@ -323,12 +356,13 @@ client.transition_model_version_stage( for rm in client.search_registered_models(): pprint(dict(rm), indent=4) ``` - +\normalsize ### MLflow Projects MLflow Projects allows to package code and its dependencies as a *project* that can be run reproducible on other data. Each project includes a *MLproject* file written in the *YAML* syntax that defines the projects dependencies, and the commands and arguments it takes to run the project. It basically is a convention to organizes and describe the model code so other data scientists or automated tools can run it conveniently. MLflow currently supports four environments to run your code: *Virtualenv*, *Conda*, *Docker Container*, and *system environment*. A very basic `MLproject` file is shown below that is run in an *Virtualenv* +\footnotesize ```yaml name: mlprojects_tutorial @@ -341,11 +375,12 @@ entry_points: alpha: {type: float, default: 0.5} l1_ratio: {type: float, default: 0.1} command: "python wine_model.py {alpha} {l1_ratio}" - ``` +\normalsize A project is run using the `MLflow run` command in the command line. It can run a project from either a local directory or a GitHub URI. The `MLproject` file shows that two parameters are passed to the `MLflow run` command. This is optional in this case as they have default values. It is also possible to specify extra parameters such as the experiment name or to specify the tracking uri (check the [official documentation](https://mlflow.org/docs/latest/python_api/mlflow.projects.html) for more). Below is a possible CLI command show to run the MLflow Project. By setting the `MLFLOW_TRACKING_URI` environment variable it is possible to also specify an execution backend for the run. +\footnotesize ```python # Run the MLflow project from the current directory # The parameters are optional in this case as the MLproject file has defaults @@ -357,8 +392,7 @@ MLFLOW_TRACKING_URI=http://localhost: mlflow run . --experiment-name="mode # Run the MLflow project from a Github URI and use the localhost as backend MLFLOW_TRACKING_URI=http://localhost: MLflow run https://github.com/seblum/mlops-practice#files/06-MLFlow/mlprojects --version=chapter/mlflow - - ``` +\normalsize The MLflow Projects API allows to chain projects together into workflows and also supports launching multiple runs in parallel. Combining this with for example the MLflow Tracking API enables an easy way of hyperparameter tuning to develop a model with a good fit. diff --git a/manuscript/04.2-MlFlow-Architecture.Rmd b/manuscript/04.2-MlFlow-Architecture.Rmd index 05c8971..d202179 100644 --- a/manuscript/04.2-MlFlow-Architecture.Rmd +++ b/manuscript/04.2-MlFlow-Architecture.Rmd @@ -5,7 +5,7 @@ While MLflow can be run locally for your personal model implementation, it is us The MLflow client can interface with a variety of backend and artifact storage configurations. The official [MLflow documentation](https://mlflow.org/docs/latest/tracking.html#how-runs-and-artifacts-are-recorded) outlines several detailed configurations. The example below depicts the main interaction between the different architectural components of a remote MLflow Tracking Server, a Postgres database for backend storage, and an S3 bucket for artifact storage. -![MLflow Architecture Diagram](images/04-MLflow/architecture_mlflow.drawio.svg) +![MLflow Architecture Diagram](images/04-MLflow/architecture_mlflow.drawio.svg){ width=100% } ### MLflow Tracking Server diff --git a/manuscript/05.1-Kubernetes-Core_Components.Rmd b/manuscript/05.1-Kubernetes-Core_Components.Rmd index 60bd84d..2213f7b 100644 --- a/manuscript/05.1-Kubernetes-Core_Components.Rmd +++ b/manuscript/05.1-Kubernetes-Core_Components.Rmd @@ -4,7 +4,7 @@ A K8s Cluster usually consistes of a set of nodes. A Node can hereby be a virtual machine (VM) in the cloud, e.g. AWS, Azure, or GCP, or a node can also be of course a physical on-premise instance. K8s distinguishes the nodes between a *master node* and *worker nodes*. The master node is basically the brain of the cluster. This is where everything is organized, handled, and managed. In comparison, a worker nodes is where the heavy lifting is happening, such as running application. Both, master and worker nodes communicate with each other via the so called kubelet. One cluster has only one master node and usually multiple worker nodes. -![K8s Cluster](images/05-Kubernetes/k8s-cluster.png) +![K8s Cluster](images/05-Kubernetes/k8s-cluster.png){ width=100% } #### Master & Control Plane @@ -16,7 +16,7 @@ To be able to work as the brain of the cluster, the master node contains a contr + Controller Manager + Cloud Controller Manager -![Master Node](images/05-Kubernetes/master-node.png) +![Master Node](images/05-Kubernetes/master-node.png){ width=100% } ##### API Server The api servers serves as the connection between the frontend and the K8s controll plane. All communications, external and interal, go through it. Frontend to Kubernetes Controll Plane. It exposes a restful api on port 443 to allow communication, as well as performes authentication and authorization checks. Whenever we perform something on the K8s cluster, e.g. using a command like `kubectl apply -f `, we communicate with the api server (what we do here is shown in the section about pods). @@ -35,7 +35,7 @@ The controller manager is a daemon that manages the control loop. This means, th There worker nodes are the part of the cluster where the heavy lifting happens. Their VMs (or physical machines) often run linux and thus provide a suitable and running environment for each application. -![Worker Node](images/05-Kubernetes/worker-node.png) +![Worker Node](images/05-Kubernetes/worker-node.png){ width=100% } A worker node consists of three main components. @@ -57,7 +57,7 @@ The kube-proxy runs on every node via a DaemondSet. It is responsible for networ A pod is the smallest deployable unit in K8s (In contrast to K8s, the smallest deployable unit for docker are containers.). Therefore, a pod is a running process that runs on a clusters' node. Within a pod, there is always one *main container* representing the application (in whatever language written, e.g. JS, Python, Go). There also may or may not be *init containers*, and/or *side containers*. Init containers are containers that are executed before the main container. Side containers are containers that support the main containers, e.g. a container that acts as a proxy to your main container. There may be volumes specified within a pod, which enables containers to share and store data. -![Pod](images/05-Kubernetes/pod.png) +![Pod](images/05-Kubernetes/pod.png){ width=100% } The containers running within a pod communicate with each other using localhost and whatever port they expose. The port itself has a unique ip adress, which enables outward communication between pods. @@ -69,6 +69,7 @@ Imperative management means managing the pods via a CLI and specifying all neces #### Imperative Management +\footnotesize ```bash # start a pod by specifying the pods name, # the container image to run, and the port exposed @@ -78,11 +79,13 @@ kubectl run --image="" --port=80 # It will forward to localhost:8080 kubectl port-forward pod/ 8080:80 ``` +\normalsize #### Declarative Management / Configuration Declarative configuration is done using a *yaml* format, which works on key-value pairs. +\footnotesize ```yaml # pod.yaml apiVersion: v1 @@ -112,20 +115,24 @@ spec: ports: ContainerPorts: 80 ``` +\normalsize Appyl this declarative configuration using the following kubectl command via the CLI. +\footnotesize ```bash kubectl apply -f "file-name.yaml" # similar to before, run following to test your pod on localhost:8080 kubectl port-forward pod/ 8080:80 ``` +\normalsize #### Kubectl One word to interacting with the cluster using the CLI. In general, *kubectl* is used to interact with the K8s cluster. This allows to run and apply pod configurations such as seen before, as well as the already shown port forwarding. We can also inspect the cluster, see what ressources are running on which nodes, see their configurations, and watch their logs. A small selection of commands are shown below. +\footnotesize ```bash # forward the pods to localhost:8080 kubectl port-forward / 8080:80 @@ -146,3 +153,4 @@ kubectl get # describe and show specific settings of a pods kubectl describe pod ``` +\normalsize diff --git a/manuscript/05.2-Kubernetes-Application_Deployment_and_Design.Rmd b/manuscript/05.2-Kubernetes-Application_Deployment_and_Design.Rmd index 6dacfda..ba7792a 100644 --- a/manuscript/05.2-Kubernetes-Application_Deployment_and_Design.Rmd +++ b/manuscript/05.2-Kubernetes-Application_Deployment_and_Design.Rmd @@ -10,6 +10,7 @@ In general, Pods should be managed through Deployments. The purpose of a Deploym #### ReplicaSets {} A ReplicaSet makes sure that a desired number of pods is running. When looking at Pods' name of a Deployment, it usually has a random string attached. This is because a deployment can have multiple replicas and the random suffix ensures a different name after all. The way ReplicaSets work is that they implement a background control loop that checks the desired number of pods are always present on the cluster. We can specify the number of replicas by creating a yaml-file of a Deployment, similar to previous specifications done to a Pod. As a reminder, the Deployment can be applied using the `kubectl apply -f` as well. +\footnotesize ```yaml # deployment.yaml apiVersion: apps/v1 @@ -38,10 +39,12 @@ spec: ports: - containerPort: 5000 ``` +\normalsize #### Rolling updates A rolling update means that a new version of the application is rolled out. In general, a basic deployments strategy will delete every single pod before it creates a new version. This is very dangerous since there is downtime. The preferred strategy is to perform a rolling update. This ensures keeping traffic to the previous version until the new one is up and running and alternates traffic until the new version is fully healthy. K8s perfoms the update of an application while the application is up and running. For example, when there are two replicasets running, one with version v1 and one with v2, K8s performs the update such that it only scales v1 down when v2 is already up and running and the traffic has been redirected to v2 as well. How do the deployments need to be configured for that? +\footnotesize ```yaml # deployment_rolling-update.yaml apiVersion: apps/v1 @@ -84,9 +87,11 @@ spec: ports: - containerPort: 5000 ``` +\normalsize The changes can be applied as well using `kubectl apply -f "file-name.yaml"`. Good to know, K8s is not deleting the replicasets of previous versions. They are still stored on the Cluster Store. The `spec.revisionHistory: ` state in the yaml denoted this. The last ten previous versions are stored on default. However, it doesn't really make sense to keep more such for example in the previous yaml where there are the last 20 versions specified. This enables to perform **Rollbacks** to previous versions. To not have discrepancies in a cluster, one should always update using the declarative approach. Below stated are a number of commands that trigger and help with a rollback or with rollouts in general. +\footnotesize ```bash # check the status of the current deployment process kubectl rollout status deployments @@ -104,12 +109,13 @@ kubectl rollout undo deployment # there is a limit of history and k8s only keeps 10 previous versions kubectl rollout undo deployment --to-revision= ``` - +\normalsize ### Resource Management Besides the importance of a healthy application itself, there should be also enough resources allocated so the application can perform well, e.g. memory & CPU. Yet, it should also only consume the resources needed and not block unneeded ones. It might be dangerous, as one application using a lot of ressources, leaving nothing left for other applications and eventually starving them. To prevent this from happening in K8s. there can be a minimum amount of resources defined a container needs (request) as well as the maximum amount of resources a container can have (limit). Configuring limits and requests for a container can be done within the spec for a Pod or Deployment. Actually, we have been using them all the time previously. +\footnotesize ```yaml # resource-management.yaml apiVersion: apps/v1 @@ -141,13 +147,14 @@ spec: ports: - containerPort: 5000 ``` - +\normalsize ### DaemonSets The master node of K8s decides on what worker nodes a pod is scheduled or not. However, there are times where we want to have a copy of a pod across the cluster. A *DaemonSet* ensures a copy of the specified Pod is exactly doing this. This can be useful for example to deploy system daemons such as log collectors and monitoring agents. DaemonSets are automatically deployed on every single node, unless specified on which node to run. They therefore do not need a specification of nodes and can scale up and down with the cluster as needed. They will automatically scheduled a pod on each new node. The given example deploys a DaemonSet to cover logging using K8s FluendID. +\footnotesize ```yaml # daemonsets.yaml apiVersion: v1 @@ -203,22 +210,25 @@ spec: readOnly: true terminationGracePeriodSeconds: 30 ``` - +\normalsize ### StatefulSets *StatefulSets* are used to deploy and manage stateful applications. Stateful applications are applications which are long lived, for example databases. Most applications of K8s are stateless as they only run for a specific task. However, a database is a state of truth and should be present at all time. StatefulSets manage the pods based on the same container specifications such as Deployments. Lets assume we have a StatefulSet with 3 replicas. Each Pod has a PV attached. +\footnotesize ```yaml ``` - +\normalsize ### Jobs & Cron Jobs Using the *busybox* image in the section about volumes we experienced that the image is very short lived. K8s is not aware of this and runs into a CrashLoopBackOff-Error. K8s will try and restart the container itself though until it BackOffs completley. Because the image is so short live, a job within the image has to be executed such as done with a shell command previously. However, what if we have a simple task that only should run like every 5 minutes, or every single day? A good idea is to use CronJobs for such tasks that start the image if needed. When comparingJobs jobs and CronJobs, jobs execute only once, whereas CronJobs execute depending on an specified expression. The following job simulates a backup to a database that runs 30 seconds in total. The part in the `args` specifies that the container will sleep for 20 seconds (the hypothetical backup). Afterward, the container will wait 10 seconds to shut down, as specified in `ttlSecondsAfterFinished`. + +\footnotesize ```yaml # job.yaml apiVersion: batch/v1 @@ -238,8 +248,11 @@ spec: - "echo 'performing db backup...' && sleep 20" restartPolicy: Never ``` +\normalsize The CronJob below runs run every minute. Given the structure of ( * * * * * * ) - ( Minutes Hours Day-of-month Month Day-of-week Year), the cronjob expression defines as follows: + +\footnotesize ```yaml # cronjob.yaml apiVersion: batch/v1 @@ -259,4 +272,5 @@ spec: args: - "echo 'performing db backup...' && sleep 20" restartPolicy: Never -``` \ No newline at end of file +``` +\normalsize \ No newline at end of file diff --git a/manuscript/05.3-Kubernetes-Services_and_Networking.Rmd b/manuscript/05.3-Kubernetes-Services_and_Networking.Rmd index 7eaa757..ace780c 100644 --- a/manuscript/05.3-Kubernetes-Services_and_Networking.Rmd +++ b/manuscript/05.3-Kubernetes-Services_and_Networking.Rmd @@ -31,10 +31,7 @@ Loadbalancers are a standard way of exposing applications to the extern, for exa #### default kubernetes services There are also default K8s services created automatically to access K8s with the K8s API. Check the endpoints of the *kubernetes* service and the endpoints of the *api-service* pod within kube-system namespace. They should be the same. -It is also possible to show all endpoints of the cluster using -```bash -kubectl get endpoints -``` +It is also possible to show all endpoints of the cluster using `kubectl get endpoints`. #### Exemplary setup of database and frontend microservices @@ -42,6 +39,7 @@ The following example show the deployment and linking of two different deploymen When looking at the ClusterIP service with `kubectl describe service backendflask` the IP address of the service to exposes, as well as the listed endpoints that connect to the database-deployments are shown. One can compare them to the IPs of the actual deployments - they are the same. +\footnotesize ```yaml # services_frontend-deployment.yaml apiVersion: apps/v1 @@ -106,7 +104,9 @@ spec: targetPort: 8501 nodePort: 30000 ``` +\normalsize +\footnotesize ```yaml # services_backend-deployment.yaml apiVersion: apps/v1 @@ -150,7 +150,7 @@ spec: # in our case has to be the same as in backendflask. targetPort: 5000 ``` - +\normalsize ### Service Discovery @@ -160,6 +160,7 @@ Service Discovery is a mechanism that lets applications and microservices locate Whenever a service is created, it is registered in the service registry with the service name and the service IP. Most clusters use CoreDNS as a service registry (this would be the telephone book itself). When having a look at the minikube cluster one should see are *core-dns* service running. Now you know what it is for. Having a closer look using `kubectl describe svc `, the core-dns service has only one endpoint. If you want to have an even closer look, you can dive into a pod itself and check the file /etc/resolv.conf. There you find a nameserver where the IP is the one of the core-dns. +\footnotesize ```bash # when querying services, it necessary # to specify the corresponding namespace @@ -168,6 +169,7 @@ kubectl get service -n kube-system # command for queriying the dns nslookup ``` +\normalsize #### kube-proxy diff --git a/manuscript/05.4-Kubernetes-Storage.Rmd b/manuscript/05.4-Kubernetes-Storage.Rmd index 7454efa..cb0c569 100644 --- a/manuscript/05.4-Kubernetes-Storage.Rmd +++ b/manuscript/05.4-Kubernetes-Storage.Rmd @@ -13,6 +13,7 @@ There are different types of volumes, e.g.: ### EmptyDir Volume An EmptyDir Volume is initially empty (as the name suggests). The volume is a temporary directory that shares the pods lifetime. If the pod dies, the contents of the emptyDir are lost as well. The EmptyDir is also used to share data between containers inside a Pod during runtime. +\footnotesize ```yaml # volume_empty-dir.yaml apiVersion: apps/v1 @@ -69,11 +70,13 @@ spec: memory: "128Mi" cpu: "500m" ``` +\normalsize As stated in the yaml, the busybox image immediately dies. If the Containers where created without the shell commands, the pod would be in a crashloopbackoff-state. To prevent the Pod to do so it is caught with the `sleep`commands until it scales down. Accessing a container using `kubectl exec`, it can be checked whether the foo/bar.txt has been created in *container-one*. When checking the second container *container-two*, the same file should be visible as well. This is because both containers refer to the same volume. Keep in mind though that the mountPath of the *container-two* is different. +\footnotesize ```bash # get in container kubectl exec -it -c container-one -- sh @@ -85,10 +88,12 @@ ls kubectl exec -it -c container-two -- sh ls ``` +\normalsize ### HostPath Volume THe HostPath Volume type is used when an application needs to access the underlying host file system, meaning the file system of the node. HostPath represents a pre-existing file or directory on the host machine. However, this can be quite dangerous and should be used with caution. If having the right access, the application can interfere and basically mess up the host. It is therefore recommended to set the rights to read only to prevent this from happening. +\footnotesize ```yaml # volume_hostpath.yaml apiVersion: apps/v1 @@ -124,18 +129,19 @@ spec: memory: "128Mi" cpu: "500m" ``` +\normalsize Similar to the EmptyDir Volume example, you can check the implementation of the HostPath Volume by accessing the volume. When comparing the file structures of the *hostpath-volume* deployment and the directory `path: /var/log` on the node the deployment is running, they should be the same. All the changes made to either on of them will make the changes available on the other. By making changes via the pod we can directly influence the Node. Again, this is why it is important to keep it read-only. +\footnotesize ```bash # access the kubectl exec -it -- sh # ssh into node minikube ssh - ``` - +\normalsize ### Persistent Volumes Persistent Volumes allow to store data beyond a Pods lifecycle. If a Pod fails, dies or moves to a different node, the data is still intact and can be shared between pods. Persistent Volume types are implemented as plugins that K8s can support(a full list can be found online). Different types of Persistent Volumes are: @@ -151,7 +157,7 @@ If a Pods or Deployments want to consume storage of the PV, they need to get acc All of this is part of a Persistent Volume Subsystem. The Persistent Volume Subsystem provides an API for users and administrators. The API abstracts details of how storage is provided from how it is consumed. Again, the provisioning of storage is done via a PV and the consumption via a PCV. -![Persistent Volume Subsystem](images/05-Kubernetes/persistent-volume-subsystem.png) +![Persistent Volume Subsystem](images/05-Kubernetes/persistent-volume-subsystem.png){ width=100% } Listed below are again the three main components when dealing with Persistent Volumes in K8s @@ -163,6 +169,7 @@ So how are Persistent Volumes specified in our deployments yamls? As there are ` Before applying the yaml-files we need to allocate the local storage by claiming storage on the node and set the paths specified in the yamls. To do this, we ssh into the node using `minikube ssh`. We can then create a specific path on the node such as `/mnt/data/`. We might also create a file in it to test accessing it when creating a PVC to a Pod. Since we do not know yet on what node the Pod is scheduled, we should create the directory on both nodes. Below are all steps listed again. +\footnotesize ```bash # ssh into node minikube ssh @@ -171,11 +178,12 @@ sudo mkdir /mnt/data # create a file with text sudo sh -c "echo 'this is a pvc test' > /mnt/data/index.html" # do this on both nodes as pod can land on either one of them - ``` +\normalsize Afterward we can apply the yaml files and create a PV, PVC, and the corresponding Deployment utilizing the PVC. The yaml code below shows this process. +\footnotesize ```yaml # volume_persistent-volume.yaml apiVersion: v1 @@ -258,6 +266,7 @@ spec: - port: 80 targetPort: 80 ``` +\normalsize By accessing a pod using `kubectl exec -it -- sh` we can check whether the path is linked using the PVC. Now, the end result may seem the same as what we did with the HostPath Volume. But it actually is not, it just looks like it since both, the PersistentVolume and the HostPath connect to the Host. Yet, the locally mounted path would be somewhere else when running in the cloud. The PV configuration would point to another storage source instead of a local file system, for example an attached EBS of EFS storage. Since we also created a LoadBalancer service, we can run `minikube tunnel` to expose the application deplyment under localhost:80. It should show the input of the index.html file we created on the storage. diff --git a/manuscript/05.5-Kubernetes-Environment_Configuration_and_Security.Rmd b/manuscript/05.5-Kubernetes-Environment_Configuration_and_Security.Rmd index b88a24c..4ccc9ad 100644 --- a/manuscript/05.5-Kubernetes-Environment_Configuration_and_Security.Rmd +++ b/manuscript/05.5-Kubernetes-Environment_Configuration_and_Security.Rmd @@ -12,6 +12,7 @@ Namespaces allow to organize resources in the cluster, which makes it more overs Of course, there is also the possibility of creating ones own namespace and using it by attaching a e.g. Deployment to it, such as seen in the following example. +\footnotesize ```yaml # namespace.yaml apiVersion: v1 @@ -44,6 +45,7 @@ spec: ports: - containerPort: 5000 ``` +\normalsize When creating a Service, a corresponding DNS entry is created as well, such as seen in the Services section when calling `backendflask` directly. This entry is created according to the namespace which is denoted to the service. This can be useful when using the same configuration across multiple namespaces such as development, staging, and production. It is also possible to reach across namespaces. One needs to use the fully qualified domain name (FQDN) tough, such as `..svc.cluster.local`. @@ -56,6 +58,7 @@ Annotations are an unstructured key-value mapping stored with a resource that ma Selectors are used to filter K8s objects based on a set of labels. A selector basically simply uses a boolean language to select pods. The selector matches the labels under a an all or nothing principle, meaning everything specified in the selector must be fulfilled by the labels. However, this works not the other way around. If there are multiple labels specified and the selector matches only one of them, the selector will match the ressource itself. How a selector matches the labels can be tested using the `kubectl` commands as seen below. +\footnotesize ```bash # Show all pods including their labels kubectl get pods --show-labels @@ -69,10 +72,12 @@ kubectl get pods -l key=value # or also look for multiple kubectl get pods -l 'key in (value1, value2)' ``` +\normalsize When using ReplicaSets in a Deployment, their selector matches the labels to a specific pod (check e.g. the section describing Deployments). Any Pods matching the label of the selector will be created according to the specified replicas. Of course, there can also be multiple labels specified. The same principle accounts when working with Services. Below example shows two different Pods and two NodePort services. Each service matches to a Pod based on their selector-label relationship. Have a look at their specific settings using `kubectl`. The Nodeport Service *labels-and-selectors-2* has no endpoints, as it is a all-or-none-principle and none of the created Pods matches the label `environment=dev`. In contrast, even though the Pod *cat-v1* has multiple labels specified `app: cat-v1; version: one`, the NodePort Service *labels-and-selectors* is linked to it. It is also linked to the second Pod *cat-v2*. +\footnotesize ```yaml # labels.yaml apiVersion: v1 @@ -131,7 +136,7 @@ spec: - port: 80 targetPort: 5000 ``` - +\normalsize ### ConfigMaps @@ -140,6 +145,7 @@ Besides allow for an easy change of variables, another benefit of using ConfigMa The following example creates two different ConfigMaps. The first one includes three environment variables as data. The second one include a more complex configuration of an nginx server. +\footnotesize ```yaml # configmaps.yaml apiVersion: v1 @@ -180,9 +186,11 @@ data: } } ``` +\normalsize Additionally, a Deployment is created which uses both ConfigMaps. A ConfigMap is declared under `spec.volumes` as well. It is also possible to state a reference to both ConfigMaps simultaneously. The Deployment creates two containers. The first container mounts each ConfigMap as a Volume. Container two uses environment variables to access and configure the key-value pairs of the ConfigMaps and store them on the container. +\footnotesize ```yaml # configmaps_deployment.yaml apiVersion: apps/v1 @@ -277,9 +285,11 @@ spec: name: nginx-conf key: nginx.conf ``` +\normalsize We can check for the attached configs by accessing the containers via the shell, similar to what we did in the section about Volumes. In the container *config-map-volume*, the configs are saved under the respective `mountPath` of the volume. In the *config-map-env*, the configs are stored as environment variables. +\footnotesize ```bash # get in container -volume or -env kubectl exec -it -c >container-name< -- sh @@ -289,7 +299,7 @@ ls # print environment variables printenv ``` - +\normalsize ### Secrets @@ -297,6 +307,7 @@ Secrets, as the name suggests, store and manage sensitive information. However, It is possible to create secrets using imperative approach as shown below. +\footnotesize ```bash # create the two secrets db-password and api-token kubectl create secret generic mysecret-from-cli --from-literal=db-password=123 --from-literal=api-token=token @@ -310,9 +321,11 @@ echo "super-save-password" > secret # create a secret from file kubectl create secret generic mysecret-from-file --from-file=secret ``` +\normalsize Similar to ConfigMaps, secrets are accessed via an environment variable or a volume. +\footnotesize ```yaml # secrets.yaml apiVersion: apps/v1 @@ -357,20 +370,24 @@ spec: memory: "128Mi" cpu: "500m" ``` +\normalsize #### Exemplary use case of secrets When pulling from a private dockerhub repository, applying the deployment will throw an error since there are no username and password specified. As they should not be coded into the deployment yaml itself, they can be accessed via a secret. In fact, a specific secret can be specified for docker registry. The secret can be specified using the imperative approach. +\footnotesize ```bash kubectl create secret docker-registry docker-hub-private \ --docker-username=YOUR_USERNAME \ --docker-password=YOUR_PASSWORD \ --docker-email=YOUR_EMAIL ``` +\normalsize Finally, the secret is specified in the deployment configuration where it can be accessed during application. +\footnotesize ```yaml # secret_dockerhub.yaml apiVersion: apps/v1 @@ -401,3 +418,4 @@ spec: ports: - containerPort: 80 ``` +\normalsize \ No newline at end of file diff --git a/manuscript/05.6-Kubernetes-Observability_and_Maintenance.Rmd b/manuscript/05.6-Kubernetes-Observability_and_Maintenance.Rmd index 9b696b1..d950069 100644 --- a/manuscript/05.6-Kubernetes-Observability_and_Maintenance.Rmd +++ b/manuscript/05.6-Kubernetes-Observability_and_Maintenance.Rmd @@ -16,6 +16,7 @@ Belows configuration shows a Deployment which includes a Liveness and a Readines + periodSeconds: The probe is called every x seconds by K8s + failureThreshold: The container will fail and restart if more than x consecutive probes fail +\footnotesize ```yaml # healthchecks.yaml apiVersion: apps/v1 @@ -98,6 +99,6 @@ spec: - port: 80 targetPort: 8080 ``` - +\normalsize ### Debugging diff --git a/manuscript/05.7-Kubernetes-Helm.Rmd b/manuscript/05.7-Kubernetes-Helm.Rmd index 6c8f6fd..318fb14 100644 --- a/manuscript/05.7-Kubernetes-Helm.Rmd +++ b/manuscript/05.7-Kubernetes-Helm.Rmd @@ -10,6 +10,7 @@ As previously mentioned, a Helm package consists of a *Helm Chart* that includes a collection of yaml and helper templates. This chart is finally packaged as a `.tar` file. The following structure shows how the chart `my-custom-chart` is organized in the directory `custom_chart`. The `Chart.yaml` file consists of the major metadata about the chart, such as name, version, and description, as well as dependencies if multiple charts are packaged. The `/templates` directory contains all the Kubernetes manifests that define the behavior of the application and are deployed automatically by installing the chart. Exemplary variables able to be specified are denoted in the `values.yaml` file, which also incorporates default values. Just like a *.gitignore* file it is also possible to add a *.helmignore*. +\footnotesize ``` custom_chart/ ├── .helmignore # Contains patterns to ignore @@ -19,12 +20,15 @@ custom_chart/ └── ingress.yaml # ingress.yaml manifest, └── ... # and others... ``` +\normalsize When wanting to create a own Helm Chart, there is no need to create all the files on your own. To bootstrap a Helm Chart there is the in-built command: +\footnotesize ```bash helm create ``` +\normalsize that provide all the common Kubernetes manifests (`deployment.yaml`, `hpa.yaml` , `ingress.yaml` , `service.yaml` , and `serviceaccount.yaml`) as well as helper templates to circumvent resource naming constraints and labels/annotations. The command will provide a scalable deployment for `nginx` on default, which can be simply modified to deploy a custom docker image by editing the `values.yaml` file. @@ -32,14 +36,17 @@ that provide all the common Kubernetes manifests (`deployment.yaml`, `hpa.yaml` As there is large collection of open-source and public charts already available, there is no need to create your own Helm Chart. One simply can use helms build in command to search for a specific Helm Chart, such as shown below by searching for a `redis` deployment, or have a look oneselves by scrolling through for example *artifactory* or *bitnami*, which provide a large collection of public charts. +\footnotesize ```bash helm search hub redis ``` +\normalsize #### Adding a Helm Chart to the local setup. After finding the correct Helm Chart, it can simply be added to the local setup. Once added to the local setup, the chart is listed in the local repository and is ready to be installed. +\footnotesize ```bash # Add a helm chart to the local setup under the name "bitnami" helm repo add bitnami https://charts.bitnami.com/bitnami @@ -48,18 +55,22 @@ helm repo add bitnami https://charts.bitnami.com/bitnami # Once it a helm chart is listed here it can be installed helm search repo bitnami ``` +\normalsize Since Helm harts are packaged as a `.tar`file, they can also be downloaded locally and modified as needed. +\footnotesize ```bash # Download the nginx-ingress-controller helm chart to local helm pull bitnami/nginx-ingress-controller --version 5.3.19 ``` +\normalsize #### Installing a Helm Chart Once a Helm Chart is downloaded or added to the local setup, it can be installed using the `helm install` command followed by a custom release name, and the name of the chart to be installed. It is a best practice to update the list of charts before installing, just like when installing packages in other programming languages such as pip and python. +\footnotesize ```bash # Make sure we get the latest list of charts helm repo update @@ -71,17 +82,22 @@ helm install custom-bitnami bitnami/wordpress # Installing a downloaded Helm Chart from a directory helm install -f values.yaml my-custom-chart ./custom_chart ``` +\normalsize Similar to installing a Helm Chart, it can also be uninstalled. +\footnotesize ```bash # helm uninstall helm uninstall custom-bitnami helm uninstall my-custom-chart ``` +\normalsize All installed and released charts can be listed using the following command. +\footnotesize ```bash helm list -``` \ No newline at end of file +``` +\normalsize \ No newline at end of file diff --git a/manuscript/06.1-Terraform-Basic_Usage.Rmd b/manuscript/06.1-Terraform-Basic_Usage.Rmd index d6ba8fe..51ef957 100644 --- a/manuscript/06.1-Terraform-Basic_Usage.Rmd +++ b/manuscript/06.1-Terraform-Basic_Usage.Rmd @@ -3,6 +3,7 @@ A Terraform project is basically just a set of files in a directory containing resource definitions of cloud ressources to be created. Those Terraform files, denoted by the ending `.tf`, use Terraform's configuration language to define the specified resources. In the following example there are two definitions made: a `provider` and a `resource`. Later in this chapter we will dive deeper in the structure of the language. For now, we only need to know this script is creating a file called `hello.txt` that includes the text `"Hello, Terraform"`. It's our Terraform version of Hello World! +\footnotesize ```bash provider "local" { version = "~> 1.4" @@ -12,39 +13,47 @@ resource "local_file" "hello" { filename = "hello.txt" } ``` +\normalsize ### terraform init When a project is run for the first time the terraform project needs to be initialized. This is done via the `terraform init` command. Terraform scans the project files in this step and downloads any required providers needed (more details to providers in a following section). In the given example this is the local procider. +\footnotesize ```bash # Initializes the working directory which consists of all the configuration files terraform init ``` +\normalsize -![](images/06-Terraform/terraform-init.png) +![](images/06-Terraform/terraform-init.png){ width=100% } ### terraform validate The `terraform validate` command checks the code for syntax errors. This is optional yet a way to handle initial errors or minor careless mistakes +\footnotesize ```bash # Validates the configuration files in a directory terraform validate ``` -![](images/06-Terraform/terraform-validate.png) +\normalsize + +![](images/06-Terraform/terraform-validate.png){ width=100% } ### terraform plan The `terraform plan` command verifies what action Terraform will perform and what resources will be created. This step is basically a *dry run* of the code to be executed. It also returns the provided values and some permission attributes which have been set. +\footnotesize ```bash # Creates an execution plan to reach a desired state of the infrastructure terraform plan ``` +\normalsize -![](images/06-Terraform/terraform-plan.png) +![](images/06-Terraform/terraform-plan.png){ width=100% } ### terraform apply @@ -53,22 +62,26 @@ The command `Terraform apply` creates the resource specified in the `.tf` files. Modifications to previously deployed ressources can be implemented by using `terraform apply` again. The output will denote that there are resources to change. +\footnotesize ```bash # Provision the changes in the infrastructure as defined in the plan terraform apply ``` +\normalsize -![](images/06-Terraform/terraform-apply.png) +![](images/06-Terraform/terraform-apply.png){ width=100% } ### terraform destroy To destoy all created ressouces and to delete everything we did before, there is a `terraform destroy` command. +\footnotesize ```bash # Deletes all the old infrastructure resources terraform destroy ``` +\normalsize -![](images/06-Terraform/terraform-destroy.png) +![](images/06-Terraform/terraform-destroy.png){ width=100% } diff --git a/manuscript/06.2-Terraform-Core_Components.Rmd b/manuscript/06.2-Terraform-Core_Components.Rmd index 3cdea55..c7f7a69 100644 --- a/manuscript/06.2-Terraform-Core_Components.Rmd +++ b/manuscript/06.2-Terraform-Core_Components.Rmd @@ -9,16 +9,19 @@ Terraform relies on plugins called providers to interact with Cloud providers, S Depending on the provider it is necessary to supply it with specific parameters. The aws provier for example needs the `region` as well as username and password. If nothing is specified it will automatically pull these information from the *AWS CLI* and the credentials specified under the directory `.aws/config`. It is also a best practice to specify the version of the provider, as the providers are usually maintained and updated on a regular basis. +\footnotesize ```bash provider "aws" { region = "us-east-1" } ``` +\normalsize ### Resources A *resource* is the core building block when working with Terraform. It can be a `"local_file"` such as shown in the example above, or a cloud resource such as an `"aws_instance"` on aws. The resource type is followed by the custom name of the resource in Terraform. Resource definitions are usually specified in the `main.tf`file. Each customization and setting to a ressource is done within its resource specification. The style convention when writing Terraform code states that the resource name is named in lowercase as well as it should not repeat the resource type. An example can be seen below +\footnotesize ```bash # Ressource type: aws_instance # Ressource name: my-instance @@ -33,6 +36,7 @@ resource "aws_instance" "my-instance" { } } ``` +\normalsize ### Data Sources @@ -40,6 +44,7 @@ resource "aws_instance" "my-instance" { A typical example is shown below as the `"aws_ami"` data source available in the AWS provider. This data source is used to recover attributes from an existing AMI (Amazon Machine Image). The example creates a data source called `"ubuntu”` that queries the AMI registry and returns several attributes related to the located image. +\footnotesize ```bash data "aws_ami" "ubuntu" { most_recent = true @@ -54,15 +59,18 @@ data "aws_ami" "ubuntu" { owners = ["099720109477"] # Canonical } ``` +\normalsize Data sources and their attributes can be used in resource definitions by prepending the `data` prefix to the attribute name. The following example used the `"aws_ami"` data source within an `"aws_instace"` resource. +\footnotesize ```bash resource "aws_instance" "web" { ami = data.aws_ami.ubuntu.id instance_type = "t2.micro" } ``` +\normalsize ### State @@ -71,6 +79,7 @@ Providing information about already existing resources is the primary purpose of Terraform uses the concept of a backend to store and retrieve its statefile. The default backend is the local backend which means to store the statefile in the project's root folder. However, we can also configure an alternative (remote) backend to store it elsewhere. The backend can be declared within a `terraform` block in the project files. The given example stores the statefile in an AWS S3 Bucket callen `some-bucket`. Keep in mind this needs access to an AWS account and also needs the AWS provider of terraform. +\footnotesize ```bash terraform { backend "s3" { @@ -79,4 +88,5 @@ terraform { region = "us-east-1" } } -``` \ No newline at end of file +``` +\normalsize \ No newline at end of file diff --git a/manuscript/06.3-Terraform-Modules.Rmd b/manuscript/06.3-Terraform-Modules.Rmd index 651b2bb..3e833ec 100644 --- a/manuscript/06.3-Terraform-Modules.Rmd +++ b/manuscript/06.3-Terraform-Modules.Rmd @@ -5,6 +5,7 @@ A Terraform module allows to reuse resources in multiple places throughout the p A Terraform module is build as a directory containing one or more resource definition files. Basically, when putting all our code in a single directory, we already have a module. This is exactly what we did in our previous examples. However, terraform does not include subdirectories on its own. Subdirectories must be called explicitly using a terraform `module`parameter. The example below references a module located in a `./network` subdirectory and passes two parameters to it. +\footnotesize ```bash # main.tf module "network" { @@ -13,9 +14,11 @@ module "network" { environment = "prod" } ``` +\normalsize Each module consists of a similar file structure as the root directory. This includes a `main.tf` where all resources are specified, as well as files for different data sources such as `variables.tf` and `outputs.tf`. However, providers are usually configured only in the root module and are not reused in modules. Note that there are different approaches on where to specify the providers. They are either specified in the `main.tf` or a separate `providers.tf`. It does not make a difference for Terraform as it does not distinguish between the resource definition files. It is merely a strategy to keep code and project in a clean and consistent structure. +\footnotesize ``` root │ main.tf @@ -27,12 +30,13 @@ root │ variables.tf │ outputs.tf ``` - +\normalsize ### Input Variables Each module can have multiple *Input Variables*. Input Variables serve as parameters for a Terraform module so users can customize behavior without editing the source. In the previous example of importing a `network` module, there have been two input variables specified, `create_public_ip` and `environment`. Input variables are usually specified in the `variables.tf` file. +\footnotesize ```bash # variables.tf variable "instance_name" { @@ -41,6 +45,8 @@ variable "instance_name" { description = "Name of the aws instance to be created" } ``` +\normalsize + Each variable has a type (e.g. `string`, `map`, `set`, `boolen`) and may have a `default` value and `description`. Any variable that has no default must be supplied with a value when calling the `module` reference. This means that variables defined at the root module need values assigned to as a requirement so Terraform will not fail. This can be done by different resources, for example * a variable's `default` value @@ -50,6 +56,7 @@ Each variable has a type (e.g. `string`, `map`, `set`, `boolen`) and may have a Variables can be used in expressions using the `var.`prefix such as shown in below example. We use the resource configuration of the previous example to create an `aws_instance` but this time its name is provided by an input variable. +\footnotesize ```bash # main.tf resource "aws_instance" "awesome-instance" { @@ -61,7 +68,7 @@ resource "aws_instance" "awesome-instance" { } } ``` - +\normalsize ### Output Variables @@ -69,6 +76,7 @@ Similar to Input variables, a terraform module has *output variables*. As their The example below defines an output value *instance_address* containing the IP address of an EC2 instance the we create with a module. Any module that reference this module can use the *instance_address* value by referencing it via *module.module_name.instance_address* +\footnotesize ```bash # outputs.tf output "instance_address" { @@ -76,13 +84,15 @@ output "instance_address" { description = "Web server's private IP address" } ``` +\normalsize -![](images/06-Terraform/outputs.png "outputs") +![](images/06-Terraform/outputs.png "outputs"){ width=100% } ### Local Variables Additionally to Input variables and output variables a module provides the use of local variables. Local values are basically just a convenience feature to assign a shorter name to an expression and work like standard variables. This means theor scope is also limited to the module they are declared in. Using local variables reduces code repetitions which can be especially valuable when dealing with output variables from a module. +\footnotesize ```bash # main.tf locals { @@ -100,3 +110,4 @@ module "service2" { vpc_id = local.vpc_id } ``` +\normalsize diff --git a/manuscript/06.4-Terraform-Tips_and_Tricks.Rmd b/manuscript/06.4-Terraform-Tips_and_Tricks.Rmd index a1175ea..4d0b45d 100644 --- a/manuscript/06.4-Terraform-Tips_and_Tricks.Rmd +++ b/manuscript/06.4-Terraform-Tips_and_Tricks.Rmd @@ -9,16 +9,18 @@ Terraform comes with different looping constructs, each used slightly different. Count can be used to loop over any resource and module. Every Terraform resource has a meta-parameter *count* one can use. Count is the simplest, and most limited iteration construct and all it does is to define how many copies to create of a resource. When creating multiple instance with one specification, the problem is that each instance must have a unique name, otherwise Terraform would cause an error. Therefore we need to index the meta-parameter just like doing it in a for-loop to give each resource a unique name. The example below shows how to do this on an AWS IAM user. +\footnotesize ```bash resource "aws_iam_user" "example" { count = 2 name = "neo.${count.index}" } ``` +\normalsize -![](images/06-Terraform/additionals_count.png) - +![](images/06-Terraform/additionals_count.png){ width=100% } +\footnotesize ```bash variable "user_names" { description = "Create IAM users with these names" @@ -31,14 +33,17 @@ resource "aws_iam_user" "example" { name = var.user_names[count.index] } ``` +\normalsize -![](images/06-Terraform/additionals_count_list.png) +![](images/06-Terraform/additionals_count_list.png){ width=100% } After using count on a resource it becomes an array of resources rather than one single resource. The same hold when using count on modules. When adding count to a module it turns it into an array of modules. This can round into problems because the way Terraform identifies each resource within the array is by its index. Now, when removing an item from the middle of the array, all items after it shift one index back. This will result in Terraform deleting every resource after that item and then re-creating these resources again from scratch So after running `terraform plan` with just three names, Terraform’s internal representation will look like this: + +\footnotesize ```bash variable "user_names" { description = "Create IAM users with these names" @@ -51,23 +56,27 @@ resource "aws_iam_user" "example" { name = var.user_names[count.index] } ``` +\normalsize -![](images/06-Terraform/additionals_count_index_deletion.png) +![](images/06-Terraform/additionals_count_index_deletion.png){ width=100% } **Count as conditional** Count can also be used as a form of a conditional if statement. This is possible as Terraform supports *conditional expressions*. If `count` is set to one 1, one copy of that resource is created; if set to 0, the resource is not created at all. Writing this as a conditional expression could look something like the follow, where var.enable_autoscaling is a boolean variable either set to `True` or `False`. +\footnotesize ```bash resource "example-1" "example" { count = var.enable_autoscaling ? 1 : 0 name = var.user_names[count.index] } ``` +\normalsize ### for-each The *for_each* expression allows to loop over lists, sets, and maps to create multiple copies of a resource just like the *count* meta. The main difference between them is that *count* expects a non-negative number, whereas *for_each* only accepts a list or map of values. Using the same example as above it would look like this: +\footnotesize ```bash variable "user_names" { description = "Create IAM users with these names" @@ -84,11 +93,13 @@ output "all_users" { value = aws_iam_user.example } ``` +\normalsize -![](images/06-Terraform/additionals_for_each_on_list.png) +![](images/06-Terraform/additionals_for_each_on_list.png){ width=100% } Using a map of resource with the *for_each* meta rather than an array of resources as with *count* has the benefit to remove items from the middle of the collection safely and without re-creating the resources following the deleted item. Of course, the same can also be done for modules. +\footnotesize ```bash module "users" { source = "./iam-user" @@ -97,12 +108,13 @@ module "users" { user_name = each.value } ``` - +\normalsize ### for Terraform also offers a similar functionality as python list comprehension in the form of a for expression. This should not be confused with the *for-each* expression seen above. The basic syntax is shown below to convert the list of names of previous examples in var.names to uppercase: +\footnotesize ```bash output "upper_names" { value = [for name in var.names : upper(name)] @@ -112,18 +124,21 @@ output "short_upper_names" { value = [for name in var.names : upper(name) if length(name) < 5] } ``` +\normalsize -![](images/06-Terraform/additionals_for_each_on_list.png) +![](images/06-Terraform/additionals_for_each_on_list.png){ width=100% } Using for to loop over lists and maps within a string can be used similarly. This allows us to use control statements directly withing strings using a syntax similar to string interpolation. +\footnotesize ```bash output "for_directive" { value = "%{ for name in var.names }${name}, %{ endfor }" } ``` +\normalsize -![](images/06-Terraform/additionals_for_each_on_string.png) +![](images/06-Terraform/additionals_for_each_on_string.png){ width=100% } ### Workspaces diff --git a/manuscript/07-ML-Project_Design.Rmd b/manuscript/07-ML-Project_Design.Rmd index a949592..eafbc19 100644 --- a/manuscript/07-ML-Project_Design.Rmd +++ b/manuscript/07-ML-Project_Design.Rmd @@ -10,7 +10,7 @@ Airflow and MLflow are very flexible with their running environment and their st The infrastructure will be maintained using the Infrastructure as Code tool *Terraform*, and incorporate best Ops practices such as CI/CD and automation. The project will also incorporate the work done by data and machine learning scientists since basic machine learning models will be implemented and run on the platform. -![](images/01-Introduction/airflow-on-eks-basic.drawio.svg) +![](images/01-Introduction/airflow-on-eks-basic.drawio.svg){ width=100% } The following chapters give an introductory tutorial on each of the previously introduced tools. A machine learning workflow using Airflow is set up on the deployed infrastructure, including data preprocessing, model training, and model deployment, as well as tracking the experiment and deploying the model into production using MLFlow. diff --git a/manuscript/08-Deployment-Infrastructure_Overview.Rmd b/manuscript/08-Deployment-Infrastructure_Overview.Rmd index a1d9159..02b67e5 100644 --- a/manuscript/08-Deployment-Infrastructure_Overview.Rmd +++ b/manuscript/08-Deployment-Infrastructure_Overview.Rmd @@ -42,7 +42,6 @@ The *root* directory of the Terraform project contains the general configuration * The `outputs.tf` defining the output variables that expose relevant information about the deployed infrastructure. * The `providers.tf` that defining and configuring the providers used in the project, for example, AWS, Kubernetes, Helm. - ### Infrastructure {.unlisted .unnumbered} The *infrastructure* directory holds the individual modules responsible for provisioning specific components of the AWS Cloud and EKS setup. @@ -52,7 +51,6 @@ The *infrastructure* directory holds the individual modules responsible for prov * `networking` contains networking components that provide access to the cluster using ingresses and DNS records, for example the AWS Application Load Balancer or an External DNS. * The `rds` module provides resources to deploy and Amazon Relational Database Service (RDS), such as database instances, subnets, and security groups. This module is needed for the specific tools and components of our ML platform. - ### Modules {.unlisted .unnumbered} The *modules* directory contains Terraform modules that are specific for setting up out ML Platform and provides the components to integrate the MLOps Framework, such as tools for model tracking (MLflow), workflow management (Airflow), or a integrated development environment (JupyterHub). diff --git a/manuscript/08.1-Deployment_Infrastructure_Root.Rmd b/manuscript/08.1-Deployment_Infrastructure_Root.Rmd index 08e4725..c747278 100644 --- a/manuscript/08.1-Deployment_Infrastructure_Root.Rmd +++ b/manuscript/08.1-Deployment_Infrastructure_Root.Rmd @@ -12,6 +12,7 @@ Secondly, as Airflow needs access to and RDS Database, the RDS module is called. Third, variable values for the Airflow Helm chart are passed to the module. Using Helm makes the deployment of Airflow very easy. Since there are customizations on the deployment, such as a connection to the Airflow DAG repository on Github, it is necessary to specify these information beforehand, and to integrate them into the deployment. +\footnotesize ```javascript locals { cluster_name = "${var.name_prefix}-eks" @@ -128,6 +129,5 @@ module "jupyterhub" { module.eks ] } - - -``` \ No newline at end of file +``` +\normalsize \ No newline at end of file diff --git a/manuscript/08.2-Deployment-Infrastructure_Essentials.Rmd b/manuscript/08.2-Deployment-Infrastructure_Essentials.Rmd index 4867a60..94c3f88 100644 --- a/manuscript/08.2-Deployment-Infrastructure_Essentials.Rmd +++ b/manuscript/08.2-Deployment-Infrastructure_Essentials.Rmd @@ -12,6 +12,7 @@ The VPC subnets are tagged with specific metadata relevant to Kubernetes cluster Additionally, three security groups are defined to manage access to worker nodes. They are intended to provide secure management access to the worker nodes within the EKS cluster. Two of these security groups, `"worker_group_mgmt_one"` and `"worker_group_mgmt_two"`, allow SSH access from specific CIDR blocks. The third security group, `"all_worker_mgmt,"` allows SSH access from multiple CIDR blocks, including `"10.0.0.0/8", "172.16.0.0/12", "192.168.0.0/16"` +\footnotesize ```javascript locals { cluster_name = var.cluster_name @@ -93,6 +94,7 @@ resource "aws_security_group" "all_worker_mgmt" { } } ``` +\normalsize ### Elastic Kubernetes Service @@ -107,6 +109,7 @@ The `eks` module also deploys several Kubernetes add-ons, including `coredns`, ` - `aws-ebs-csi-driver`(Container Storage Interface) is an add-on that enables Kubernetes pods to use Amazon Elastic Block Store (EBS) volumes for persistent storage, allowing data to be retained across pod restarts and ensuring data durability for stateful applications. The EBS configuration and deployment are describen in the following subsection, but the respective `service_account_role_arn` is linked to the EKS cluster on creation. - `vpc-cni` (Container Network Interface) is essential for AWS EKS clusters, as it enables networking for pods using AWS VPC (Virtual Private Cloud) networking. It ensures that each pod gets an IP address from the VPC subnet and can communicate securely with other AWS resources within the VPC. +\footnotesize ```javascript locals { cluster_name = var.cluster_name @@ -281,6 +284,7 @@ module "vpc_cni_irsa" { } } ``` +\normalsize #### Elastic Block Store @@ -288,6 +292,7 @@ The EBS CSI controller (Elastic Block Store Container Storage Interface) is set The code also configures the default Kubernetes StorageClass named `"gp2"` and annotates it as not the default storage class for the cluster, managing how storage volumes are provisioned and utilized in the cluster. Ensuring that the `"gp2"` StorageClass does not become the default storage class is needed as we additionally create an EFS Storage (Elastic File System), which is described in the next subsection. +\footnotesize ```javascript # # EBS CSI controller @@ -343,6 +348,7 @@ resource "kubernetes_annotations" "ebs-no-default-storageclass" { } } ``` +\normalsize #### Elastic File System @@ -352,6 +358,7 @@ For security, the code creates an AWS security group named `"allow_nfs"` that al Finally, the code defines a Kubernetes StorageClass named `"efs"` using the `"kubernetes_storage_class_v1"` resource. The StorageClass specifies the EFS CSI driver as the storage provisioner and the EFS file system created earlier as the backing storage. Additionally, the `"efs"` StorageClass is marked as the default storage class for the cluster using an annotation. This allows dynamic provisioning of EFS-backed persistent volumes for Kubernetes pods on default, simplifying the process of handling storage in the EKS cluster. This is done for example for the Airflow deployment in a later step. +\footnotesize ```javascript # # EFS @@ -450,6 +457,7 @@ resource "kubernetes_storage_class_v1" "efs" { ] } ``` +\normalsize #### Cluster Autoscaler @@ -459,6 +467,7 @@ The necessary IAM settings are set up prior to deploying the Autoscaler. First, The EKS Cluster Autoscaler itself is instantiated using the custom `"eks_autoscaler"` module on the bottom of the code snippet. The module is called to set up the Autoscaler for the EKS cluster and the required input variables are provided accordingly. Its components are described in detailed in the following. +\footnotesize ```javascript # # EKS Cluster autoscaler @@ -528,6 +537,7 @@ module "eks_autoscaler" { autoscaler_service_account_name = local.autoscaler_service_account_name } ``` +\normalsize The configurationof the Cluster Autoscaler begins with the creation of a Helm release named `"cluster-autoscaler"` using the `"helm_release"` resource. The Helm chart is sourced from the `"kubernetes.github.io/autoscaler"` repository with the chart version `"9.10.7"`. The settings inside the Helm release include the AWS region, RBAC (Role-Based Access Control) settings for the service account, cluster auto-discovery settings, and the creation of the service account with the required permissions. @@ -535,6 +545,7 @@ The necessary resources for the settings are created accordingly in the followin An IAM policy named `"cluster_autoscaler"` is created to permit the Cluster Autoscaler to interact with Auto Scaling Groups, EC2 instances, launch configurations, and tags. The policy includes two statements: `"clusterAutoscalerAll"` and `"clusterAutoscalerOwn"`. The first statement grants read access to Auto Scaling Group-related resources, while the second statement allows the Cluster Autoscaler to modify the desired capacity of the Auto Scaling Groups and terminate instances. The policy also includes conditions to ensure that the Cluster Autoscaler can only modify resources with specific tags. The conditions check that the Auto Scaling Group has a tag `"k8s.io/cluster-autoscaler/enabled"` set to `"true"` and a tag `"k8s.io/cluster-autoscaler/"` set to `"owned"`. If you remember it, we have set these tags when setting up the managed node groups for the EKS Cluster in the previous step. +\footnotesize ```javascript resource "helm_release" "cluster-autoscaler" { name = "cluster-autoscaler" @@ -628,11 +639,13 @@ data "aws_iam_policy_document" "cluster_autoscaler" { } } ``` +\normalsize ### Networking The `networking` module of the infrastructure directory integrates an *Application Load Balancer* (ALB) and *External DNS* in the cluster. Both play crucial roles in managing and exposing Kubernetes applications within the EKS cluster to the outside world. The ALB serves as an Ingress Controller to route external traffic to Kubernetes services, while External DNS automates the management of DNS records, making it easier to access services using user-friendly domain names. The root module of network just calls both submodules, which are described in detail in the following sections. +\footnotesize ```javascript module "external-dns" { ... @@ -642,6 +655,7 @@ module "application-load-balancer" { ... } ``` +\normalsize #### AWS Application Load Balancer (ALB) @@ -653,6 +667,7 @@ Since its policy document is quite extensive, it is loaded from a file named `"A After setting up the IAM role, the code proceeds to install the AWS Load Balancer Controller using Helm. The Helm chart is sourced from the `"aws.github.io/eks-charts"` repository, specifying version `"v2.4.2"`. The service account configuration is provided to the Helm release's values, including the name of the service account and annotations to associate it with the IAM role created earlier. The `"eks.amazonaws.com/role-arn"` annotation points to the ARN of the IAM role associated with the service account, allowing the controller to assume that role and operate with the appropriate permissions. +\footnotesize ```javascript locals { aws_load_balancer_controller_service_account_role_name = "aws-load-balancer-controller-role" @@ -702,6 +717,7 @@ resource "helm_release" "aws-load-balancer-controller" { })] } ``` +\normalsize #### External DNS @@ -711,6 +727,7 @@ The code is structured similar to the ALB and defines local variables first, fol Finally, the Helm is used to to deploy the external DNS controller as a Kubernetes resource. The Helm release configuration includes specifying the previously create service account, the IAM `role-arn` associated with it, the `aws.region` where the Route 53 hosted zone exists, and a `domainFilter` which filters to a specific domain provided by us. +\footnotesize ```javascript locals { external_dns_service_account_role_name = "external-dns-role" @@ -790,6 +807,7 @@ resource "helm_release" "external_dns" { })] } ``` +\normalsize ### Relational Database Service @@ -799,6 +817,7 @@ The resource `aws_db_subnet_group` creates an RDS subnet group with the name `"v The `rds` module is not necessarily needed to run a kubernetes cluster properly. It is merely an extension of the cluster and is needed to store relevant data of the tools used, such as airflow or mlflow. The module is thus called directly from the own airflow and mlflow modules. +\footnotesize ```javascript locals { rds_name = var.rds_name @@ -846,4 +865,5 @@ resource "aws_security_group" "rds_sg" { cidr_blocks = ["0.0.0.0/0"] } } -``` \ No newline at end of file +``` +\normalsize \ No newline at end of file diff --git a/manuscript/08.3-Deployment-Infrastructure_Modules.Rmd b/manuscript/08.3-Deployment-Infrastructure_Modules.Rmd index 27d9201..fe70139 100644 --- a/manuscript/08.3-Deployment-Infrastructure_Modules.Rmd +++ b/manuscript/08.3-Deployment-Infrastructure_Modules.Rmd @@ -11,6 +11,7 @@ Airflow itself is deployed in the Terraform code via a Helm chart. The provided The code starts by declaring several local variables that store the names of Kubernetes secrets and S3 buckets for data storage and logging. Next, it creates a Kubernetes namespace for Airflow to isolate the deployment. +\footnotesize ```javascript locals { k8s_airflow_db_secret_name = "${var.name_prefix}-${var.namespace}-db-auth" @@ -53,11 +54,14 @@ module "s3-data-storage" { s3_data_bucket_secret_name = local.s3_data_bucket_secret_name } ``` +\normalsize + Afterward, two custom modules, `"s3-remote-logging"` and `"s3-data-storage"` set up S3 buckets for remote logging and data storage. Both modules handle creating the S3 buckets and necessary IAM roles for accessing them. The terraform code of both modules is not depicted here, it is visible on [github](https://github.com/seblum/mlops-airflow-on-eks) though. The main difference between the modules are in the in the assume role policies that are needed for the different use cases of storing and reading data, or logging to S3. While the `"s3_log_bucket_role"` allows a Federated entity, specified by an OIDC provider ARN, to assume the role using `"sts:AssumeRoleWithWebIdentity"`, the `"s3_data_bucket_role"` allows both a specific IAM user (constructed from the user's ARN) and the Amazon S3 service itself to assume the role using `"sts:AssumeRole"`. **s3-data-storage role policy** +\footnotesize ```javascript # s3-data-storage role policy resource "aws_iam_role" "s3_data_bucket_role" { @@ -87,9 +91,11 @@ resource "aws_iam_role" "s3_data_bucket_role" { EOF } ``` +\normalsize **s3-remote-logging role policy** +\footnotesize ```javascript # s3-remote-logging role policy resource "aws_iam_role" "s3_log_bucket_role" { @@ -114,12 +120,14 @@ resource "aws_iam_role" "s3_log_bucket_role" { EOF } ``` +\normalsize After the S3 buckets are set up, the code proceeds to create Kubernetes secrets to store various credentials required for Airflow's operation. These include credentials for PostgreSQL database, GitHub authentication secrets for accessing private repositories, and secrets for accessing GitHub organizations which are required to authenticate users. The `"rds-airflow"` module is used to create the RDS instance for Airflow, which will serve as the external database for the deployment. The Apache Airflow deployment is defined using a `helm_release` of the *Airflow Community Helm Chart* and is highly customized to cater to our specific needs. The release includes various configurations for Airflow, such as custom environment variables, extra environment variables sourced from the GitHub organization secret, and the `KubernetesExecutor` Airflow executor. The deployment enables DAG synchronization of a dedicated Github repository which includes our Airflow DAGs (see chapter 9). It also configures a persistent volume using Amazon EFS for Airflow logs and a Kubernetes Ingress resource to expose the Airflow web interface using an Application Load Balancer (ALB). Additionally, readiness and liveness probes are configured for the web server. +\footnotesize ```javascript # # Helm Release Airflow @@ -215,10 +223,6 @@ resource "helm_release" "airflow" { AIRFLOW__WEBSERVER__BASE_URL = "http://${var.domain_name}/${var.domain_suffix}" AIRFLOW__CORE__LOAD_EXAMPLES = false - # AIRFLOW__LOGGING__LOGGING_LEVEL = "DEBUG" - # AIRFLOW__LOGGING__REMOTE_LOGGING = true - # AIRFLOW__LOGGING__REMOTE_BASE_LOG_FOLDER = "s3://${module.s3-data-storage.s3_log_bucket_name}/airflow/logs" - # AIRFLOW__LOGGING__REMOTE_LOG_CONN_ID = "aws_logs_storage_access" AIRFLOW__CORE__DEFAULT_TIMEZONE = "Europe/Amsterdam" }, users = [] @@ -349,11 +353,13 @@ resource "helm_release" "airflow" { })] } ``` +\normalsize In a final step of the Helm Chart, a custom `WebServerConfig.py` is specified which is set to integrate our Airflow deployment with a Github Authentication provider. The python script consists of two major parts: a custom AirflowSecurityManager class definition and the actual webserver_config configuration file for Apache Airflow's web server. The custom `CustomSecurityManager` class extends the default AirflowSecurityManager to retrieves user information from the GitHub OAuth provider. The webserver_config configuration sets up the configurations for the web server component of Apache Airflow by indicating that OAuth will be used for user authentication. The `SECURITY_MANAGER_CLASS` is set to the previously defined `CustomSecurityManager` to customizes how user information is retrieved from the OAuth provider. Finally, the GitHub provider is configured with its required parameters like `client_id`, `client_secret`, and API endpoints. +\footnotesize ```python ####################################### # Custom AirflowSecurityManager @@ -409,7 +415,6 @@ class CustomSecurityManager(AirflowSecurityManager): else: return {} - ####################################### # Actual `webserver_config.py` ####################################### @@ -457,6 +462,7 @@ AUTH_ROLES_SYNC_AT_LOGIN = True # force users to re-auth after 30min of inactivity (to keep roles in sync) PERMANENT_SESSION_LIFETIME = 1800 ``` +\normalsize