Exploitation Over Exploration for Offline MABs

Language

Exploitation Over Exploration for Offline MABs

Official repository for the paper "Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation", published at the 19th ACM Conference on Recommender Systems (RecSys'25).

Abstract

Multi-Armed Bandit (MAB) algorithms are widely used in recommender systems that require continuous, incremental learning. A core aspect of MABs is the exploration–exploitation trade-off: choosing between exploiting items likely to be enjoyed and exploring new ones to gather information. In contextual linear bandits, this trade-off is particularly central, as many variants share the same linear regression backbone and differ primarily in their exploration strategies. Despite its prevalent use, offline evaluation of MABs is increasingly recognized for its limitations in reliably assessing exploration behavior. This study conducts an extensive offline empirical comparison of several linear MABs. Strikingly, across over 90% of various datasets, a greedy linear model -- with no type of exploration -- consistently achieves top-tier performance, often outperforming or matching its exploratory counterparts. This observation is further corroborated by hyperparameter optimization, which consistently favors configurations that minimize exploration, suggesting that pure exploitation is the dominant strategy within these evaluation settings. Our results expose significant inadequacies in offline evaluation protocols for bandits, particularly concerning their capacity to reflect true exploratory efficacy. Consequently, this research underscores the urgent necessity for developing more robust assessment methodologies, guiding future investigations into alternative evaluation frameworks for interactive learning in recommender systems.

Key achievements

We conduct a large-scale empirical comparison of prominent linear contextual bandit algorithms under widely-used offline evaluation protocols, demonstrating the surprising efficacy of purely exploitative models;
We reveal and quantify a systematic bias against exploration inherent in these offline protocols, showing how exploration mechanisms are consistently devalued, leading to the artificial dominance of greedy strategies during hyperparameter tuning;
We advocate for a critical reassessment of offline MAB evaluation practices, discussing the potential and current limitations of alternative methodologies, including simulation-based approaches, and outlining crucial directions for future research to foster more reliable bandit algorithm assessment.

Below we present a summary of the experimental protocol and the achieved results. More details can be found in the original paper.

Experimental protocol

To evaluate the models, we followed the pipeline illustrated in the above figure. First, the datasets were chronologically sorted and consumed to simulate a continuous environment. For each dataset, the first 50% of the interactions were used as initial training (warm-up), and from this portion, 10% were selected to form a validation split. The remaining 50% of the data was set aside as a test partition.

The warm-up partition was consumed to learn item embeddings through implicit ALS. User states were then generated as the average of the previously consumed item embeddings and used as context to train the linear CMABs. Hyperparameters were fine-tuned based on the performance on the validation split, using the Normalized Discounted Cumulative Gain (NDCG) in a top-20 recommendation setting. Since the item embeddings were pre-trained in a non-incremental fashion, we filtered the test partition to exclude interactions with items unavailable in the warm-up phase.

After the warm-up stage, we divided the remaining data into ten sequential batches, each comprising 10% of the interactions. For each batch, the models yielded recommendations and received feedback based on the item consumed by the user, updating their underlying linear models incrementally.

Results

In this section, we show the results obtained. For more metrics and plots, consult the results page.

NDCG in a top-20 recommendation task (NDCG@20)

Cumulative NDCG@20 for every partition on the test set

Cumulative novelty@20 for every partition on the test set

Installing

The supported platforms for executing the code are the following:

macOS 10.12+ x86_64.
Linux x86_64 (including WSL on Windows 10).

There are two ways to install the libs: (1) installing manually and (2) using Docker (it works for Windows too).

Installing manually

Executing the command below will install the necessary libraries:

pip install -r requirements.txt

or

python -m pip install -r requirements.txt

OBS 1: It is recommended to use a new conda environment before doing it to prevent breaking library versions of other codes.

OBS 2: Due to some of the used packages, this will not work on Windows (it will only work with WSL)

OBS 3: It is recommended to use the Python version 3.8.0

Installing with Docker

To install the libraries with Docker, execute the following steps:

1- Build a Docker image:

docker build -t mab-recsys .

2- Run the Docker container:

docker run -it --gpus all --shm-size=8g \
    -v ./raw:/mab_recsys/raw \
    -v ./1-datasets:/mab_recsys/1-datasets \
    -v ./2-experiments:/mab_recsys/2-experiments \
    -v ./3-results:/mab_recsys/3-results \
    -v ./4-tex:/mab_recsys/4-tex \
    -v ./5-images:/mab_recsys/5-images \
    -v ./6-analyses:/mab_recsys/6-analyses \
    -v ./7-agg_results:/mab_recsys/7-agg_results \
    mab-recsys /bin/bash -c "source activate py38 && /bin/bash"

Datasets

Downloading the datasets is necessary to run the experiments. A list with download link and where to save the files are given below:

AmazonBeauty: put All_Beauty.jsonl file in raw/amazon-beauty
AmazonBooks: download and extract in raw/amazon-books
AmazonGames: put Video_Games.jsonl file in raw/amazon-games
BestBuy: put train.csv file in raw/BestBuy
Delicious2K: download hetrec2011-delicious-2k.zip in the Delicious Bookmarks section and extract it in raw/delicious2k
Delicious2k-urlPrincipal: the same as Delicious2K
MovieLens-100K: download ml-100k.zip in the MovieLens 100K Dataset section and extract it in raw/ml-100k
MovieLens-25M: download ml-25m.zip in the MovieLens 25M Dataset section and extract it in raw/ml-25m
RetailRocket: put events.csv file in raw/RetailRocket

Executing the code

Execute the following scripts to reproduce our results:

1. Dataset preprocess

With the raw datasets downloaded (more details in Datasets section), it's necessary to preprocess them before generating the recommendations. To do that, execute the following command:

python src/scripts/preprocess/main.py

Executing this Python code will ask you which datasets to preprocess. Input the datasets indexes separated by space to select the datasets.

Another way to select the datasets is by executing the command below:

python src/scripts/preprocess/main.py --datasets <datasets>

Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to preprocess are:

[1]: amazon-beauty
[2]: amazon-books
[3]: amazon-games
[4]: bestbuy
[5]: delicious2k
[6]: delicious2k-urlPrincipal
[7]: ml-100k
[8]: ml-25m
[9]: retailrocket
all (it will use all datasets)

More information about this step can be found in the documentation about preprocessing.

2. Generate embeddings (not_incremental training)

With the datasets downloaded and preprocessed, it's necessary to generate de embeddings which will be used as context for the contextual MABs. To do that, execute the following command:

python src/scripts/not_incremental/main.py

Executing this Python code will ask you which datasets and algorithms to use. Input the datasets and algorithms indexes separated by space to select the wanted options.

Another way to select the options is by executing the command below:

python src/scripts/not_incremental/main.py --algorithms <algorithms> --datasets <datasets>

Replace <algorithms> with the names (or indexes) of the algorithms separated by comma (","). The available algorithms to execute are:

[1]: als
[2]: bpr
all (it will use all algorithms)

Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to use as train/test are:

[1]: amazon-beauty
[2]: amazon-books
[3]: amazon-games
[4]: bestbuy
[5]: delicious2k
[6]: delicious2k-urlPrincipal
[7]: ml-100k
[8]: ml-25m
[9]: retailrocket
all (it will use all datasets)

More information about this step can be found in the documentation about not incremental experiment.

3. Run incremental experiments

With the embeddings generated, it is possible to train and test the contextual MABs. For that, execute the following command:

python src/scripts/incremental/main.py

Executing this Python code will ask you which datasets, algorithms, embeddings, and contexts to use. Input the options indexes separated by space to select the wanted options.

Another way to select the options is by executing the command below:

python src/scripts/incremental/main.py --algorithms <algorithms> --datasets <datasets> --embeddings <embeddings> --contexts <contexts>

Replace <algorithms> with the names (or indexes) of the incremental algorithms separated by comma (","). The available algorithms to execute are:

[1]: Lin
[2]: LinUCB
[3]: LinGreedy
[4]: LinTS
all (it will use all algorithms)

Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to use as train/test are:

[1]: amazon-beauty
[2]: amazon-books
[3]: amazon-games
[4]: bestbuy
[5]: delicious2k
[6]: delicious2k-urlPrincipal
[7]: ml-100k
[8]: ml-25m
[9]: retailrocket
all (it will use all datasets)

Replace <embeddings> with the names (or indexes) of the not incremental algorithms separated by comma (","). The embeddings generated by these algorithms will be used as part of the MAB context. So, it is necessary to generate the embeddings before (explained in Section 2 about not incremental algorithms). The available embeddings are:

[1]: als
[2]: bpr
all (it will use all embeddings)

Replace <contexts> with the names (or indexes) of the context generation strategies separated by comma (","). The available strategies are:

[1]: user
[2]: item_concat
[3]: item_mean
[4]: item_concat+user
[5]: item_mean+user
[6]: item_concat+item_mean
[7]: item_concat+item_mean+user
all (it will use all strategies)

More information about this step can be found in the documentation about incremental experiment.

4. Generate tables and graphics

After executing all the above commands, it is possible to generate tables and graphics to visualize the results. For that, execute the following command:

python src/scripts/generate_metrics/main.py

Executing this Python code will ask you which datasets, metrics, not incremental algorithms, incremental algorithms, embeddings, and contexts to use. Input the options indexes separated by space to select the wanted options.

Another way to select the options is by executing the command below:

python src/scripts/generate_metrics/main.py --datasets <datasets> --topk <top_k> --metrics <metrics> --not_incremental_algorithms <not_incremental_algorithms> --incremental_algorithms <incremental_algorithms> --embeddings <embeddings> --contexts <contexts>

Replace <datasets> with the names (or indexes) of the datasets separated by comma (","). The available datasets to use are:

[1]: amazon-beauty
[2]: amazon-books
[3]: amazon-games
[4]: bestbuy
[5]: delicious2k
[6]: delicious2k-urlPrincipal
[7]: ml-100k
[8]: ml-25m
[9]: retailrocket
all (it will use all datasets)

Replace <top_k> with the names (or indices) of the topK values, separated by commas (","). The available top-K are:

[1]: top-5
[2]: top-10
[3]: top-15
[4]: top-20
all (will use all topK)

Replace <metrics> with the names (or indexes) of the metrics separated by comma (","). The available metrics are:

[1]: ncdg
[2]: hit rate (hr)
[3]: f-score
[4]: novelty
[5]: coverage
[6]: diversity
all (it will use all metrics)

Replace <not_incremental_algorithms> with the names (or indexes) of the not incremental algorithms separated by comma (","). The non-incremental algorithms selected here will be used to compare the results with the incremental algorithms. The available not incremental algorithms are:

[1]: als
[2]: bpr
all (it will use all algorithms)

Replace <incremental_algorithms> with the names (or indexes) of the incremental algorithms separated by comma (","). The available incremental algorithms to execute are:

[1]: Lin
[2]: LinUCB
[3]: LinGreedy
[4]: LinTS
all (it will use all algorithms)

Replace <embeddings> with the names (or indexes) of the embeddings separated by comma (","). The selected not incremental algorithms here will be used to find results about incremental algorithms that used the embeddings of the selected embedding options. The available embeddings are:

[1]: als
[2]: bpr
all (it will use all embeddings)

Replace <contexts> with the names (or indexes) of the context generation strategies separated by comma (","). The available strategies are:

[1]: user
[2]: item_concat
[3]: item_mean
[4]: item_concat+user
[5]: item_mean+user
[6]: item_concat+item_mean
[7]: item_concat+item_mean+user
all (it will use all strategies)

More information about this step can be found in the documentation about metrics generation.

5. Aggregate results

A Jupyter Notebook can be used to aggregate the results from multiple datasets. Change the necessary variables and execute the cell in the notebook to generate the aggregated graphics and tables.

Citation

If our project is useful or relevant to your research, please kindly recognize our contributions by citing our paper:

@inproceedings{pires2025,
    author = {Pedro R. Pires and Gregorio F. Azevedo and Pietro L. P. Campos and Rafael T. Sereicikas and Tiago A. Almeida},
    title = {Exploitation Over Exploration: Unmasking the Bias in Linear Bandit Recommender Offline Evaluation},
    year = {2025},
    booktitle = {Proceedings of the 19th ACM Conference on Recommender Systems},
    series = {RecSys '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
images		images
results		results
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
README_PT-BR.md		README_PT-BR.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Exploitation Over Exploration for Offline MABs

Abstract

Key achievements

Experimental protocol

Results

NDCG in a top-20 recommendation task (NDCG@20)

Cumulative NDCG@20 for every partition on the test set

Cumulative novelty@20 for every partition on the test set

Installing

Installing manually

Installing with Docker

Datasets

Executing the code

1. Dataset preprocess

2. Generate embeddings (not_incremental training)

3. Run incremental experiments

4. Generate tables and graphics

5. Aggregate results

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Exploitation Over Exploration for Offline MABs

Abstract

Key achievements

Experimental protocol

Results

NDCG in a top-20 recommendation task (NDCG@20)

Cumulative NDCG@20 for every partition on the test set

Cumulative novelty@20 for every partition on the test set

Installing

Installing manually

Installing with Docker

Datasets

Executing the code

1. Dataset preprocess

2. Generate embeddings (not_incremental training)

3. Run incremental experiments

4. Generate tables and graphics

5. Aggregate results

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages