This repository is some reading notes and Python implementation for book "Machine Learning for Factor Investing" by Silkdust.
For better compatibility and better accordance with original book, this repository is completed in English.
The author of this book offers some solutions for Python Notebooks. See detailed info at https://www.mlfactor.com/python.html. However, they did not offer executable files (.ipynb). For your convenience, this project offers both jupyter notebooks (.ipynb) and a PDF version produced by nbconvert and XeLaTeX engine.
Finally, I deeply appreciate Guillaume Coqueret and Tony Guida for their devotions to this book. Published in 2020, this book really offers some up-to-date machine learning techniques and their applications to factor investing. They also offer a wide range of literatures in the book for interested readers. As a summary note, this project does not include all literature reviews in the book, and I encourage interested readers explore them in the references section or in the original book.
Hope you have fun reading this project!
| Chapter Name | Notebook Ref | PDF Download | Status |
|---|---|---|---|
| Full Version | Here | Download | ☑Finished |
| Chapter 1 - Notations and Data | Here | Download | ☑Finished |
| Chapter 2 - Introduction | Here | Download | ☑Finished |
| Chapter 3 - Factor Investing and Asset Pricing Anomalies | Here | Download | ☑Finished |
| Chapter 4 - Data Preprocessing | Here | Download | ☑Finished |
| Chapter 5 - Penalized Regressions and Sparse Hedging for MVP | Here | Download | ☑Finished1 |
| Chapter 6 - Tree-based Methods | Here | Download | ☑Finished |
| Chapter 7 - Neural Networks | Here | Download | ☑Finished |
| Chapter 8 - Support Vector Machines | Here | Download | ☑Finished |
| Chapter 9 - Bayesian Methods | Here | Download | ☑Finished2 |
| Chapter 10 - Validating and Tuning | Here | Download | ☑Finished |
| Chapter 11 - Ensemble Models | Here | Download | ☑Finished3 |
| Chapter 12 - Portfolio Backtesting | Here | Download | ☑Finished |
| Chapter 13 - Interpretability | Here | Download | ☑Finished |
| Chapter 14 - Causality and Non-stationarity | Here | Download | ☑Finished4 |
| Chapter 15 - Unsupervised Learning | Here | Download | ☑Finished |
| Chapter 16 - Reinforcement Learning | Here | Download | ☑Finished |
| References | Here | Download | ☑Finished |
Updated on November 21, 2023: The notes for all 16 chapters have been finished and an integrated version has been generated with the nbmerge tool for readers to enjoy this project efficiently! The notes in Python are, after all, some humble work. However, this project also serves as a milestone for my first open-source project, and I will try my best to make more contributions.
Finally, I would make my deep appreciation again to the authors of this book. Hope you all enjoy it.
Dependencies packages have be released in the repository as requirements.txt and will be updated regularly. Generally speaking, you have to install Python3 (version 3.8 or later preferred) and jupyter notebook in your device at first. To get your notebooks work properly in your device, run the following command after cloning this repo:
git clone https://github.com/Silkdust/mlfactor-python.git
cd mlfactor-python/
pip install -r ./requirements.txt
If you revise the coding or use additional packages, there is a simple command with pipreqsnb to remake the requirements.txt after cloning this repo as follows:
pip install pipreqs
pip install pipreqsnb
pipreqsnb --force ./ --encoding=utf-8
For better I/O speed, the data_ml object is stored in .pkl format. However, this costs a lot of storage space and is not pushed to the repo. You may run the following commands to generate it under the /data/ folder:
import pandas as pd
import pyreadr
# data = pd.read_excel("./data/data_ml.xlsx") # Not Recommended. Too Slow!
result = pyreadr.read_r('./data/data_ml.RData')
data = result['data_ml']
data.to_pickle("./data/data_ml.pkl")
- Main reference: Coqueret, G., & Guida, T. (2020). Machine Learning for Factor Investing: R Version. Chapman and Hall/CRC.
- Other references: see here. You can also find them on their website here.
- This project is completely free and uses CC0-1.0 license. We encourage reproducibility. See here for details.
Footnotes
-
There are some minor differences between
ElasticNetinsklearnandglmnetinR. See here for details. ↩ -
Two less familiarized packages in Python are faciliated in this Chapter to complete the Bayesian linear regression and the BART. For the first one, we provide the source code inside the notebook so that there is virtually no need for you to install the package
conjugate_bayes(which is messy). For the second packagebartpy, please use this command to install theBartPypackage to avoid from the following issues #37 and #51:pip install git+https://github.com/JakeColtman/bartpy.git@pytorch --upgrade. ↩ -
You may find some model caches used in this Chapter (or possibly, previous and future chapters) under the
/models/folder. ↩ -
In this Chapter, the causal additive models (
CAM) and thePCalgorithms are implemented in Python with the aid of the packagecdt, which requires the R environment and packagesCAM,(k)pcalgandRCIT. The configuration can be complex, so we strongly recommend readers to get the trained models under the/models/folder (graph_cam.pklandgraph_pc.pkl). Interested readers can refer to this website for installation guides. ↩