This project explores consumer purchase behavior using the Online Retail II dataset from UCI/Kaggle. It applies Frequent Itemset Mining (Apriori algorithm) and Decision Tree Classification to uncover actionable insights in retail transactions.
To extract frequent product combinations using association rules (Apriori) and compare them with rules derived from Decision Tree models trained to predict the purchase of a specific item. The goal is to support marketing strategies such as cross-selling, bundling, and personalized recommendations.
trabalho2_am1.py: Python script with all data processing and mining routinesTrabalho_2___AM1.pdf: Final academic report with methodology, results, and discussionREADME.md: Project documentationdataset_link.txt: Contains the public link to the dataset used
- Data filtering and preprocessing (handling cancellations, missing data, and item quantities)
- Market Basket Analysis with Apriori from
mlxtend - Association rule generation using support, confidence, and lift
- Construction of Decision Trees to predict purchase of a target item
- Extraction of interpretable rules from tree leaves
- Comparative analysis between both techniques
The vast majority of transactions come from the United Kingdom, followed by Germany and France.
Strong end-of-year purchasing trend, with December showing peak activity.
Best-selling products are mostly decorative items and home accessories.
To ensure a fair comparison, both Apriori and Decision Tree were evaluated on a restricted scenario:
Transactions from the United Kingdom, focused on the product:
PINK REGENCY TEACUP AND SAUCER
REGENCY TEA PLATE ROSES→PINK REGENCY TEACUP AND SAUCER
Confidence: 0.91 — Support: 2.8% — Lift: 31.7
- Items:
GREEN,ROSES,TEA PLATE PINK, etc.
Confidence: 0.85 — Support: 0.27% — Lift: 35.9
To ensure a fair comparison, both Apriori and Decision Tree were evaluated on a restricted scenario:
Transactions from the United Kingdom, with focus on the product:
PINK REGENCY TEACUP AND SAUCER
REGENCY TEA PLATE ROSES→PINK REGENCY TEACUP AND SAUCER
Confidence: 0.91 — Support: 2.8% — Lift: 31.7
- Items:
GREEN,ROSES,TEA PLATE PINK, etc.
Confidence: 0.85 — Support: 0.27% — Lift: 35.9
- Both techniques identified similar associations around coordinated product lines (e.g., “Regency”).
- Apriori yielded more general and high-support rules across all data.
- Decision Trees focused on narrower, high-confidence subgroups and showed higher lift in certain leaves.
- The two approaches are complementary: Apriori excels in discovering broad frequent patterns, while Decision Trees isolate local high-precision segments.
git clone https://github.com/your-username/frequent-itemset-mining.git
cd frequent-itemset-miningpip install pandas mlxtend scikit-learn matplotlib seabornjupyter notebook notebook/frequent_itemset_analysis.ipynbThe dataset used is publicly available at:
Note: Due to licensing, the dataset is not included in this repository. Please download it manually and place it in the working directory.
- Géron, A. (2019). Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow. O'Reilly Media.
- Pedregosa et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research.
- Kaggle: Online Retail II Dataset
This project was developed as part of an academic assignment for the Applied Machine Learning course.






