An end to end analytics platform that models and visualizes customer churn. The project blends together concepts from data engineering, analytics engineering, and MLOps.
Analytics Dashboard
ML Monitoring Dashboard
This project was developed to:
- Employ various elements of the modern data stack
- Be applied to a somewhat realistic business use case
- Serve as a template for others to learn from and use via extensive documentation
- Consider CI/CD setup with ECR and Jenkins.
- Integrate Prometheus to log operational metrics like model inference latency.
- Add DVC as a more comprehensive option compared to MLFlow's built-in solution.
- Look into including data drift and feature explainability as additional metrics. Explainabilty can be used to observe feature importance changes over time.
- Implement unit and integration tests for Airflow functions.
- Develop SCD2 functionality for dimension tables.
- Evaluate feasibility of incremental refresh (models) via dbt to prevent overhead.
Model retraining pipelines with Experiment Tracking using MLflowAutomated model deployment using MLflow Model RegistryMonitoring using Evidently and Grafana
For this project, the Telco Customer Churn data module which is a sample dataset on IBM's Cognos Analytics platform was selected because it seemed like the best representative considering the difficulty in finding a decent dataset for the use case.
The dataset is then used to train two models. The first is a Gaussian Copula Synthesizer to produce synthetic data with characteristics similar to the original. This is done since there is not much data to go around and serves as a rudimentary imitation of data entering the database. The second is the product of using FLAML's AutoML implementation on the data, and its purpose is to predict churn status for a particular user. These experiments are tracked using MLFlow for better observability and minIO is employed as an object store for the various artefacts involved.
The synthesizer model is hosted via FastAPI, while the classifier is deployed with MLFlow model registry, which dynamically exposes the latest "Champion" model for inference. Airflow is then used to orchestrate the pulling of data from the Synthesizer, obtaining churn status prediction for said data from the classification model, generating a ULID for each customer, and writing it all to a Postgres database. It then triggers dbt afterward to run tests and apply necessary transformations. Airflow is used to record model metrics at a regular frequency and also to retrain the classifier. The data is modeled after the star schema and is finally visualized as a dashboard using Metabase. The various model metrics are stored and is monitored with Grafana to proactively deal with any problems that may arise.
Almost all of the services above run in their own docker containers, as seen in the diagram. These containers are running on a GCP VM, provisioned via Terraform. Finally, GitHub Actions facilitates CI/CD, as the code is quickly checked for any issues using Ruff and changes made to this repo are reflected in the VM.
- Keep in mind that predicting customer churn is a rather sophisticated use case and is difficult to get right. AutoML was only employed to obtain a viable baseline model, and this is how it should be applied most of the time. Check out Fighting Churn with Data for a more comprehensive outlook on modelling customer churn.
- The original dataset has been transformed and can be found here. This is the dataset the models have been trained on.
- As mentioned earlier, the various elements which comprise this project have tried to be documented. Most files should have accompanying comments to facilitate understanding, and more high level system design justifications can be found in the README below.
If you want to run the pipeline on your own, here are some considerations:
- The Docker Compose file is the crux of this project, so that is a good starting point.
- If you are running into an error with regards to
pip installduring the Docker build phase, consider changing the mtu size in the Docker Daemon. - Rename sample.env to .env
- Double check the ports for the various docker services if project is run locally. Ensure it doesn't conflict with ports used by other services on the system.
- Add path to credentials file and locally generated SSH files in the Terraform main file (main.tf).
- Replace variables in variables.tf to your liking.
- If CD implementation is going to be used, make changes to the output filepath (if necessary), and add necessary GitHub secrets.
- Add Postgres connection information to Airflow with name pg_conn.
- You can use the backup under the
assetsfolder to recreate the Grafana dashboard. - Beware of port forwarding if Terraform is being used to create the VM that the project runs on. If the instance external IP address is known, anyone can gain access to the running Docker containers.
- Used to quickly obtain a viable baseline
- Selected Microsoft's FLAML because it had more streamlined access to core AutoML use cases.
- Total training time facilitates a threshold for training, and can be modified.
- Used predict_proba as a crude proxy for confidence. It can benefit from calibration for a more reliable measure.
- Went for this design to obtain something akin to realtime data
- Ended up with a gaussian copula model from the SDVault library
- Initially wanted to go with the popular CTGAN model from this paper, but the final model was an order of magnitude bigger and performed about the same as the GCop Model
- Size and speed of inference is important since the model going into production
- Constraints are added such that location data is generated consistently (i.e. you don't get random city associated with random zip code)
- Stands for Universally Unique Lexicographically Sortable IDentifier, used as primary key to uniquely identify each customer.
- An implementation of UUID (Universally Unique IDentifier), thus inheriting its advantages over a standard autoincrementing integer based primary key, namely robust uniqueness and security.
- In addition, since ULID is generated client side (as opposed to autoincrement), it prevents the need for going back to the primary source table to get the ID number to write to other tables.
- ULID also overcomes one of the major drawbacks associated with using UUIDs, which is the lack of sortability. This is of course until newer versions of UUID (like v6) that solve this problem are standardized.
- Generated using the python-ulid library, and converted to UUID format to be inserted into Postgres.
- One drawback is how the relatively large size of ULID (and UUIDs in general) affects performance, but that remains to be seen.
- Went with the tried and tested star schema (Kimball approach) for this project
- Briefly considered the recent one big table implementation as that would have the consequence of reducing data transformations.
- However, we don't have close to the amount of data necessary to see any real performance gains and it seems like falling into the trap of premature optimization.
- Plus, this approach better facilitates data integrity and ease of understanding.
- Fast lint check using Ruff
- Uses a GitHub actions module (appleboy) that SSHs into the VM and pull --rebase(s) this GitHub repo.
- Triggered on push to repo.
- Testing applied just for data (for now) at beginning of transformation layer via dbt.
- Only applied at source since the subsequent transformations don't affect the conditions checked by test.
- Plus, having redundant tests just increases performance overhead.
- MLflow is used to track experiments and log relevant metrics and artifacts. Since FLAML's AutoML is used to train the classifier, the best model is then entered into the model registry.
- Deployment is automated via loading the "Champion" model from the registry for inference.
- Model retraining is triggered at period intervals, but can also happen when performance (model metrics, or data dsitribution changes) drops beyond a set threshold.
- Monitoring of such drops in performance is facilitated by Grafana, which can also be set up to alert you when such situations occur.
- Defines services for Postgres, pgAdmin, Airflow, FastAPI server, and Metabase.
- Persistent volumes mapped to services for backup purposes.
- Use env file for environment variables available throughout.
- Data synthesis and prediction models handled via one server.
- This is done to reduce overhead in setting up and maintaining multiple servers for services.
- Serves as an example of both serving a model for inference, and pulling from a model for synthetic data.
- Schema and tables created on startup.
- pgAdmin included for GUI, can also consider DBeaver.
- ULID for primary key handled by native UUID datatype.
- Also hosts Metabase metadata.


- Tasks are parallelized for efficiency.
- Use jinja scripting to pull XCom data into Postgres database.

- dbt Core used in Airflow container,
dbt buildtriggered viaBashOperator. - This implementation can be dangerous if there is the potential for resource or environmental clashes between the two services, so be wary. It would be wise to utilize tools like
pyenv,poetry, or the more modernuvby Astral to sidestep this particular problem. - If you have the Kubernetes infrastructure handy, you can also do the same with
KubernetesPodOperatorfor containerization and scalability. - Difficult to integrate dbt Cloud with Airflow unless using something like Astronomer's Astro CLI.
- At a production level, the de facto method of using dbt is via:
- dockerizing the project
- hosting it on an image registry
- Airflow triggers job from most recent image
- Every github push updates the image in the registry
- The aforementioned methods were not pursued in the interest of keeping the project (relatively) straightforward, but they can be considered.
- Acts as visualization layer
- Selected because it is quick to set up and put together a decent result.
- Dashboards developed can be shared publicly, and is therefore highly accessible.
- However, the link does expose the system's external IP, so that is something to be kept in mind.
- Spin up and tear down VMs quickly and reliably for rapid experimentation.
- Use output values for convenience and SSH capabilities.
- VM instance type determined via trial and error.
- Once again, be wary of port forwarding and general firewall settings.

- The MLflow server is hosted on the same container as airflow. This was to reduce the additional overhead caused by the need for inter-container communication if the need arose.
- You can read more about it here. Your options are exposing the host socket (very dangerous), setting up SSH between the instances (cumbersome), or using a Flask extension and wrapping bash access around a REST API (long term support is questionable).
- The backend store is handled by the same postgres instance, while minIO handles object store.
- A custom FastAPI wrapper deals with deployment for flexibility with predict_proba, but one can use MLflow's native support for serving models via Flask, Seldon's ML Server, Kubernetes, or other cloud platforms.


