This project consists of a proof of concept application for predicting future (day-ahead) stock market closing prices. This can help investors make informed decisions and identify opportunities by providing an analysis on market trends.
Main components used here are the Hopsworks feature store and the yfinance package. yfinance serves as the initial data source, while Hopsworks stores data, both for training and for inference, and also allow for data versioning.
In this stage, data is extracted, transformed, validated, and loaded in the feature store.
Main components used here are Mlflow, Pytorch, Lightning, Sklearn. Pytorch and Lightning are used to define and train a deep learning model. Sklearn is used for some feature engineering (scaling namely). Mlflow is used to log the experiments, allowing metric comparion between runs, and also storing critical artifacts such as the model file itself, as well as important pieces of code that allow it to run.
In this stage, we load data from the feature store, feature engineering steps are performed and a deep learning model is trained and logged.
In this stage, a trained model is pulled from the Mlflow model registry, data is loaded from the feature store, predictions are made and stored in an S3 bucket (using the Boto3 package).
In this stage, past predictions are loaded from S3 and compared to data that is loaded from the feature store. The absolute error is calculated and stored in an S3 bucket, similarly to the inference stage.
Main component used here is Airflow. It is used to create a DAG with multiple tasks and specific tasks dependencies, as well as schedule it to run on a schedule.
In this stage, we define the dag and tasks, which include data ingestion, model inference and monitoring. The dag is run daily.
Main component used here is FastAPI. It is used to create a fast and scalable API service with multiple endpoints.
In this stage, we define endpoints to obtain predictions and the metrics obtained during the monitoring stage, which are pulled from the S3 storage.
Main component used here is Streamlit. It is used to create a simple data dashboard. Calls are made to the API to obtain both predictions and metrics on demand, which are then displayed.
Pre-requisites: Docker, Poetry, Python
In each folder, there will be a .env.default file that represents a template for the necessary environment variables. Run:
cp .env.default .envin each folder and type the necessary variables that will be obtained by following the rest of the guide.
We will need to deploy some packages to be used as dependencies by Airflow later. For this end we can setup a private PyPi server.
Create credentials using passlib:
# Install dependencies.
sudo apt install -y apache2-utils
pip install passlib
# Create the credentials under the market-price-predictor name.
mkdir ~/.htpasswd
htpasswd -sc ~/.htpasswd/htpasswd.txt market-price-predictorSet poetry to use the credentials:
poetry config repositories.my-pypi http://localhost
poetry config http-basic.my-pypi market-price-predictor <password>Hopsworks is used as a serverless feature store, and as such you need to create an account and a project in Hopsworks. Choose a unique project name, obtain an API key and add these credentials to the .env files when required.
We use AWS S3 buckets to store predictions and metrics, as well as model artifacts for Mlflow. As such, you need to create a AWS account. After this step, you need to do 2 things:
- Create 2
S3buckets, one for model artifacts and one for model predictions and metrics. The names are up to you but they must be unique (you can follow this guide) - Create access keys (you can follow this guide)
Add the name of the buckets and the keys to the .env files.
Move to the orchestration directory and run the following:
# Initialize the Airflow database
docker compose up airflow-init
# Start up all services
# Note: You should set up the private PyPi server credentials before running this command.
docker compose --env-file .env up --build -dIn order to deploy the modules in the private PyPi server, simply move to the deploy directory and run:
# Build and deploy the modules.
sh deploy_modules.shYou can access the Airflow webserver using localhost:8080, and login using airflow as the username and password. First, configure the parameters that are necessary to run the pipelines. Go to the Admin tab, then hit Variables. Then you can click on the blue + button to add a new variable.
These are the parameters you can configure with example values:
ml_pipeline_symbols= ["MSFT", "AAPL", "AMZN", "META", "GOOGL"]ml_pipeline_interval= "1d"ml_pipeline_days_delay= 14ml_pipeline_feature_group_version= 1ml_pipeline_feature_view_version= 1
Note that the ml_pipeline_symbols variable in particular needs to be consistent with a model for that particular symbol having been trained and deployed.
Finally, go to the DAGs tab and find the ml_pipeline dag. Toggle the activation button. You can also run the DAG manually by pressing the play button from the top-right side of the window.
Go to the root directory of your project market-price-predictor and run the following docker command, which will build and run all the docker containers of the web app:
docker compose -f deploy/app_docker_compose.yaml --project-directory . up --buildYou can access the API at localhost:8001/api/v1/docs and the Dashboard at localhost:8501
Move to the data_ingest directory and run the following:
python pipeline.py --symbol SYMBOL --start_date START_DATE --end_date END_DATE
python feature_views.py --start_date START_DATE --end_date END_DATEfor the desired SYMBOL (e.g. AAPL for Apple, MSFT for Microsoft) and dates in the format YYYY-MM-DD.
Edit the configs in model-training/configs for the desired SYMBOL, date ranges, and other hyperparameters.
Move to the model_training directory and run the following:
python train.pyAccess the Mlflow UI at localhost:6500 and find the run that corresponds to the model you just trained.
Press the Register model button at the top right. Create a new model with the format "market_price_predictor_symbol_{symbol}" replacing {symbol} for the desited SYMBOL.
Go to the Models page and add an alias @champion to the model you want to use in production.
If you retrain a model for the same symbol, when registring the model, pick the already existing model name and simply add the alias to this new re-trained model.
- Add ECS deployment guide
- Add CI/CD pipeline
- Add tests
- Add IaC (e.g Terraform)
- Improve model performance
- Incorporate news data + sentiment analysis
- Train with higher resolution data
- Data augmentation
- Automatic hyperparameter tuning
- Automatic model retraining based on monitoring
- Consider changing to different data storage (e.g. Timescale)

