Language
English | Portuguese (BR)
Official repository for the paper "Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation", published at the 19th ACM Conference on Recommender Systems (RecSys'25).
Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration–exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model -- with no type of exploration -- consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.
- We conduct a large-scale empirical comparison of prominent linear contextual bandit algorithms under widely-used offline evaluation protocols, demonstrating the surprising efficacy of purely exploitative models;
- We reveal and quantify a systematic bias against exploration inherent in these offline protocols, showing how exploration mechanisms are consistently devalued, leading to the artificial dominance of greedy strategies during hyperparameter tuning;
- We advocate for a critical reassessment of offline MAB evaluation practices, discussing the potential and current limitations of alternative methodologies, including simulation-based approaches, and outlining crucial directions for future research to foster more reliable bandit algorithm assessment.
Below we present a summary of the experimental protocol and the achieved results. More details can be found in the original paper.
To evaluate the models, we followed the pipeline illustrated in the above figure. First, the datasets were chronologically sorted and consumed to simulate a continuous environment. For each dataset, the first 50% of the interactions were used as initial training (warm-up), and from this portion, 10% were selected to form a validation split. The remaining 50% of the data was set aside as a test partition.
The warm-up partition was consumed to learn item embeddings through implicit ALS. User states were then generated as the average of the previously consumed item embeddings and used as context to train the linear CMABs. Hyperparameters were fine-tuned based on the performance on the validation split, using the Normalized Discounted Cumulative Gain (NDCG) in a top-20 recommendation setting. Since the item embeddings were pre-trained in a non-incremental fashion, we filtered the test partition to exclude interactions with items unavailable in the warm-up phase.
After the warm-up stage, we divided the remaining data into ten sequential batches, each comprising 10% of the interactions. For each batch, the models yielded recommendations and received feedback based on the item consumed by the user, updating their underlying linear models incrementally.
In this section, we show the results obtained. For more metrics and plots, consult the results page.
The supported platforms for executing the code are the following:
- macOS 10.12+ x86_64.
- Linux x86_64 (including WSL on Windows 10).
There are two ways to install the libs: (1) installing manually and (2) using Docker (it works for Windows too).
Executing the command below will install the necessary libraries:
pip install -r requirements.txt
or
python -m pip install -r requirements.txt
OBS 1: It is recommended to use a new conda environment before doing it to prevent breaking library versions of other codes.
OBS 2: Due to some of the used packages, this will not work on Windows (it will only work with WSL)
OBS 3: It is recommended to use the Python version 3.8.0
To install the libraries with Docker, execute the following steps:
1- Build a Docker image:
docker build -t mab-recsys .
2- Run the Docker container:
docker run -it --gpus all --shm-size=8g \
-v ./raw:/mab_recsys/raw \
-v ./1-datasets:/mab_recsys/1-datasets \
-v ./2-experiments:/mab_recsys/2-experiments \
-v ./3-results:/mab_recsys/3-results \
-v ./4-tex:/mab_recsys/4-tex \
-v ./5-images:/mab_recsys/5-images \
-v ./6-analyses:/mab_recsys/6-analyses \
-v ./7-agg_results:/mab_recsys/7-agg_results \
mab-recsys /bin/bash -c "source activate py38 && /bin/bash"
Downloading the datasets is necessary to run the experiments. A list with download link and where to save the files are given below:
- AmazonBeauty: put
All_Beauty.jsonlfile inraw/amazon-beauty - AmazonBooks: download and extract in
raw/amazon-books - AmazonGames: put
Video_Games.jsonlfile inraw/amazon-games - BestBuy: put
train.csvfile inraw/BestBuy - Delicious2K: download
hetrec2011-delicious-2k.zipin theDelicious Bookmarkssection and extract it inraw/delicious2k - Delicious2k-urlPrincipal: the same as
Delicious2K - MovieLens-100K: download
ml-100k.zipin theMovieLens 100K Datasetsection and extract it inraw/ml-100k - MovieLens-25M: download
ml-25m.zipin theMovieLens 25M Datasetsection and extract it inraw/ml-25m - RetailRocket: put
events.csvfile inraw/RetailRocket
Execute the following scripts to reproduce our results:
With the raw datasets downloaded (more details in Datasets section), it's necessary to preprocess them before generating the recommendations. To do that, execute the following command:
python src/scripts/preprocess/main.py
Executing this Python code will ask you which datasets to preprocess. Input the datasets indexes separated by space to select the datasets.
Another way to select the datasets is by executing the command below:
python src/scripts/preprocess/main.py --datasets <datasets>
Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to preprocess are:
- [1]: amazon-beauty
- [2]: amazon-books
- [3]: amazon-games
- [4]: bestbuy
- [5]: delicious2k
- [6]: delicious2k-urlPrincipal
- [7]: ml-100k
- [8]: ml-25m
- [9]: retailrocket
- all (it will use all datasets)
More information about this step can be found in the documentation about preprocessing.
With the datasets downloaded and preprocessed, it's necessary to generate de embeddings which will be used as context for the contextual MABs. To do that, execute the following command:
python src/scripts/not_incremental/main.py
Executing this Python code will ask you which datasets and algorithms to use. Input the datasets and algorithms indexes separated by space to select the wanted options.
Another way to select the options is by executing the command below:
python src/scripts/not_incremental/main.py --algorithms <algorithms> --datasets <datasets>
Replace <algorithms> with the names (or indexes) of the algorithms separated by comma (","). The available algorithms to execute are:
- [1]: als
- [2]: bpr
- all (it will use all algorithms)
Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to use as train/test are:
- [1]: amazon-beauty
- [2]: amazon-books
- [3]: amazon-games
- [4]: bestbuy
- [5]: delicious2k
- [6]: delicious2k-urlPrincipal
- [7]: ml-100k
- [8]: ml-25m
- [9]: retailrocket
- all (it will use all datasets)
More information about this step can be found in the documentation about not incremental experiment.
With the embeddings generated, it is possible to train and test the contextual MABs. For that, execute the following command:
python src/scripts/incremental/main.py
Executing this Python code will ask you which datasets, algorithms, embeddings, and contexts to use. Input the options indexes separated by space to select the wanted options.
Another way to select the options is by executing the command below:
python src/scripts/incremental/main.py --algorithms <algorithms> --datasets <datasets> --embeddings <embeddings> --contexts <contexts>
Replace <algorithms> with the names (or indexes) of the incremental algorithms separated by comma (","). The available algorithms to execute are:
- [1]: Lin
- [2]: LinUCB
- [3]: LinGreedy
- [4]: LinTS
- all (it will use all algorithms)
Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to use as train/test are:
- [1]: amazon-beauty
- [2]: amazon-books
- [3]: amazon-games
- [4]: bestbuy
- [5]: delicious2k
- [6]: delicious2k-urlPrincipal
- [7]: ml-100k
- [8]: ml-25m
- [9]: retailrocket
- all (it will use all datasets)
Replace <embeddings> with the names (or indexes) of the not incremental algorithms separated by comma (","). The embeddings generated by these algorithms will be used as part of the MAB context. So, it is necessary to generate the embeddings before (explained in Section 2 about not incremental algorithms). The available embeddings are:
- [1]: als
- [2]: bpr
- all (it will use all embeddings)
Replace <contexts> with the names (or indexes) of the context generation strategies separated by comma (","). The available strategies are:
- [1]: user
- [2]: item_concat
- [3]: item_mean
- [4]: item_concat+user
- [5]: item_mean+user
- [6]: item_concat+item_mean
- [7]: item_concat+item_mean+user
- all (it will use all strategies)
More information about this step can be found in the documentation about incremental experiment.
After executing all the above commands, it is possible to generate tables and graphics to visualize the results. For that, execute the following command:
python src/scripts/generate_metrics/main.py
Executing this Python code will ask you which datasets, metrics, not incremental algorithms, incremental algorithms, embeddings, and contexts to use. Input the options indexes separated by space to select the wanted options.
Another way to select the options is by executing the command below:
python src/scripts/generate_metrics/main.py --datasets <datasets> --topk <top_k> --metrics <metrics> --not_incremental_algorithms <not_incremental_algorithms> --incremental_algorithms <incremental_algorithms> --embeddings <embeddings> --contexts <contexts>
Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to use are:
- [1]: amazon-beauty
- [2]: amazon-books
- [3]: amazon-games
- [4]: bestbuy
- [5]: delicious2k
- [6]: delicious2k-urlPrincipal
- [7]: ml-100k
- [8]: ml-25m
- [9]: retailrocket
- all (it will use all datasets)
Replace <top_k> with the names (or indices) of the topK values, separated by commas (","). The available top-K are:
- [1]: top-5
- [2]: top-10
- [3]: top-15
- [4]: top-20
- all (will use all topK)
Replace <metrics> with the names (or indexes) of the metrics separated by comma (","). The available metrics are:
- [1]: ncdg
- [2]: hit rate (hr)
- [3]: f-score
- [4]: novelty
- [5]: coverage
- [6]: diversity
- all (it will use all metrics)
Replace <not_incremental_algorithms> with the names (or indexes) of the not incremental algorithms separated by comma (","). The non-incremental algorithms selected here will be used to compare the results with the incremental algorithms. The available not incremental algorithms are:
- [1]: als
- [2]: bpr
- all (it will use all algorithms)
Replace <incremental_algorithms> with the names (or indexes) of the incremental algorithms separated by comma (","). The available incremental algorithms to execute are:
- [1]: Lin
- [2]: LinUCB
- [3]: LinGreedy
- [4]: LinTS
- all (it will use all algorithms)
Replace <embeddings> with the names (or indexes) of the embeddings separated by comma (","). The selected not incremental algorithms here will be used to find results about incremental algorithms that used the embeddings of the selected embedding options. The available embeddings are:
- [1]: als
- [2]: bpr
- all (it will use all embeddings)
Replace <contexts> with the names (or indexes) of the context generation strategies separated by comma (","). The available strategies are:
- [1]: user
- [2]: item_concat
- [3]: item_mean
- [4]: item_concat+user
- [5]: item_mean+user
- [6]: item_concat+item_mean
- [7]: item_concat+item_mean+user
- all (it will use all strategies)
More information about this step can be found in the documentation about metrics generation.
A Jupyter Notebook can be used to aggregate the results from multiple datasets. Change the necessary variables and execute the cell in the notebook to generate the aggregated graphics and tables.
If our project is useful or relevant to your research, please kindly recognize our contributions by citing our paper:
@inproceedings{pires2025,
author = {Pedro R. Pires and Gregorio F. Azevedo and Pietro L. P. Campos and Rafael T. Sereicikas and Tiago A. Almeida},
title = {Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation},
year = {2025},
booktitle = {Proceedings of the 19th ACM Conference on Recommender Systems},
series = {RecSys '25}
}
