Multi-Arm Bandits for recommendations and A/B testing on Amazon ratings data set

Abhishek Maheshwarappa
11 min readDec 18, 2020
Slot Machine

Multi-Arm Bandits is used by many companies like Stitchfix, Netflix, Microsoft, and other big companies for recommendations. There are tons of research going on the Multi-Arm Bandits and their application to real-time problems. This article is an attempt to apply Multi-Arm bandits.

What is a Multi-Arm Bandit?

If you are visiting the casino for the first time then I am sure you will be confused about which machine to play to get the jackpot as all seem identical, Right? Let's find out is there a way to solve this problem, Multi-Arm Bandit is inspired by the same problem.

Imagine a gambler, gambling at a casino in the slot machine with unknow distribution of the reward of these machines. The main goal of the gambler is to win maximum times and make a lot of money when he leaves the casino.

If the gambler knows the reward distribution of these then he would play the slot machine which has a higher probability of success. Question is, how does the gambler learn which slot machine is best and get the most money in the shortest amount of time?

The solution is to play the slot machine the arms one-by-one in a sequence while gathering informationtoo maximize the total payout over the long-run.

You might think this is specific to the Las Vegas Strip but the truth is this scenario is an exploration-exploitation tradeoff which helps in making a decision based on what you already know and miss out on a new scenario.

Las Vegas Strip source —

Bandit Algorithms Setting

  • The agent chooses an action from a set of actions
  • The environment generates a response with the rewards or regrets back to the agent
  • The aim of the agent is to maximize the cumulative reward or minimize the cumulative regret which is the difference in total reward gained in n rounds and the total reward that would have been gained with respect to the optimal action

Multi-Armed Bandit Examples

A real-world example of a multi-armed bandit problem is when a news website wants to make a decision about which news to show to the visitor. With no information about the visitor, the important question is which news needs to be shown to get the most clicks?. The goal is to maximize the engagement of the visitor.

This problem is similar to the slot machine issue in the casino, here the problem is in choosing which news to be shown to the visitor. As in the case of a slot machine where winning money is the goal similarly the news website wants to maximize the engagement of the visitor.

The same case for any website which displays ads to its visitors. In this case, they want to maximize ad-revenue, the same question which ad to be shown to the visitor.

Variants of Muti-Armed Bandit

There many different variants of MAB(Multi-Arm Bandit)

Different Environments:

  • Stochastic and Stationary
  • Adversarial and Non Stationary


  • Cumulative regret
  • tracking the best expert

Type of actions

  • Discrete
  • Continous

These are a few to mention, apart from this there few more that are defined by the data scientist according to different variations.

Multi-Armed Bandit Solutions


First, this is to understand what is Exploration vs Exploitation before learning Epsilon-Greedy.

Exploration vs Exploitation

Exploration means when the agent explores the environment rather than taking advantage of previous steps and adding randomness to the algorithm in hope of finding a better approach to convergence or finding the highest reward and if you compare this to the casino then in hope of getting more money trying random slot machine.

Exploitation means when an agent tries to exploit the environment by taking advantage of the previous steps and increasing the reward and in the case of the casino example gambler using previous information of the slot machine where they have won more times and using that slot machine to make more money.

Exploration vs Exploitation Tradeoff

There is always a trade-off between the exploration and exploitation for the agent, using the random toss of the coin the agent chooses to be either exploration or exploitation. There Another method is to use the Esiplon the exploration rate to choose between the exploitation and exploration, which can be constant throughout or decaying over time using a decay rate.

Epsilon-Greedy is a simple method to balance exploration and exploitation by choosing between exploration and exploitation randomly.

The epsilon-greedy, where epsilon refers to the probability of choosing to explore, exploits most of the time with a small chance of exploring.

Pseudocode for the Epsilon Greedy bandit algorithm

In the Epsilon Greedy bandit algorithm, we:

  • Choose a random machine to pull with probability = ϵ
  • Choose the best machine to pull with probability = 1−ϵ

Pseudocode for the algorithm:

set epsilon # e.g. epsilon = 0.1
for t = 1…numRounds:
if r< epsilon:
pull a random bandit arm
pull the bandit arm with the best experimental mean
update the experimental mean of the pulled bandit arm

Note — This implementation is also done in python on Amazon datasets

Upper Confidence Bound (UCB)

The Upper Confidence Bound (UCB) algorithm is often phrased as “optimism in the face of uncertainty.” That is if a bandit has the potential to be the best we should try it. As one tries bandits more and more the uncertainty around the estimate of its payout shrinks. The algorithm allocates exploratory effort to actions that might be optimal and are in this sense “optimistic.”

In our casino example if there a chance of winning in a slot machine we should keep trying with optimism that the gambler will win.

Practically one wants to limit this optimism to what is plausibly possible. This optimism is limited by an optimism hyperparameter which should be tuned and adjusted.

t = timesteps

number of times action (a) is taken

In the above figure

  • the bracket tells us the confidence
  • Q(A) is the action that lies between the confidence interval
  • Lower bracket — lower bound
  • Upper bracket — upper bound

The figure has the 4 actions with different confidence intervals, UCB algorithm will optimistically pick the action that has the highest upper bound i.e. A. By this either it will have the highest value and get the highest reward, or by taking that we will get to learn about an action. Initially, UCB explores more to systematically reduce uncertainty but its exploration reduces over time.

Pseudocode for the algorithm:

first k turns: initialize experimental means by pulling each arm once
set c # confidence range hyperparameter
for t = k+1...numRounds:
for i = 1...k:
calculate a(i, t) = c* square_root(2 * log(t / p(i, t)))
pull arm i that maximizes experimental mean + a(i, t)
update the experimental mean of the pulled bandit arm

Note — This implementation is also done in python on Amazon datasets

Thompson Sampling

Thompson sampling, named after William R. Thompson, is the way to address the exploration-exploitation trade-off in the multi-armed bandit problem.

Thompson Sampling is an algorithm with the exploration and exploitation to maximize the cumulative rewards obtained by performing an action. Thompson Sampling is also sometimes referred to as Posterior Sampling or Probability Matching as it incorporates the Bayesian approach to the bandit problem.

Thompson sampling is a powerful but simple algorithm that implicitly balances exploration and exploitation based on quality and uncertainty. If we have a k-armed bandit and model the probability that each arm gives us a positive reward. The goal is of course to maximize our rewards by pulling the most promising arm.

Thompson sampling, on the other hand, incorporates uncertainty by modeling the bandit’s Bernouilli parameter with a prior beta distribution. Thompson sampling can be generalized to sample from any arbitrary distributions over its parameters.

Beta Distribution

The beauty of the algorithm is that it always chooses the action with the highest expected reward, with the twist that this reward is weighted by uncertainty through the posterior distribution.

#for each turnfor turn_index in range(number_of_turns):
index_of_machine_to_play = -1
max_beta = -1 # note that max beta
#determine which slot machine to play for this turn for each slot

for slot_machine_index in range(number_of_bandits):

#Define the shape parameters for the beta distribution. The shape
will depend on the number of wins and losses that have thus far
been observed for this particular slot machine.
a = number_of_positive_rewards[slot_machine_index] + 1
b = number_of_negative_rewards[slot_machine_index] + 1
#Get a random value from the beta distribution whose shape is
defined by the number of wins and losses that have thus far been
observed for this slot machine
random_beta = np.random.beta(a, b) #if this is the largest beta value thus far observed for this
if random_beta > max_beta: #update the maximum beta value thus far observed
max_beta = random_beta
#set the machine to play to the current machine
index_of_machine_to_play = slot_machine_index
#play the selected slot machine, and record whether we win or lose
if outcomes[turn_index][index_of_machine_to_play] == 1:
number_of_positive_rewards[index_of_machine_to_play] += 1
number_of_negative_rewards[index_of_machine_to_play] += 1

The practical differences between A/B testing and bandit testing

Before understanding the difference let's see what is A/B testing first.

A/B testing

Let take the example of a news website where 50% of the traffic is sent to the control and 50% of the traffic to the variation, run the test and then decide whether to implement the winning variation.

An AB test is a statistical hypothesis testing, a process whereby a hypothesis is made about the relationship between two data sets and those data sets are then compared against each other to determine if there is a statistically significant relationship or not.

To put this in a practical sense. for a news website two recommendation models are built, then data from both the recommendation of both observed and compared to determine which recommendation model is statistically significant.

A/B testing consists of a short period of pure exploration, where you’re randomly assigning equal numbers of users to control and Variation website. It then jumps into a long period of pure exploitation, where you send 100% of your users to the more successful version of your site.

Bandit testing tries to solve the explore-exploit problem in a different way. Instead of pure exploration and pure exploitation, bandit tests are adaptive over time and can involve both exploitation and exploration simultaneously.

Where Bandits can be used

  • Ad Campaigns
  • Movie recommendation
  • Headline or news Recommendation
  • Targeting

Stephen Pavlovich is the director of the conversion, a UK agency specializing in conversion rate optimization for highly competitive niches. Stephen helps start-ups and multinational companies maximize their conversion rates rapidly. He says

“A/B testing isn’t that useful for short-term campaigns. If you’re running tests on an e-commerce site for Black Friday, an A/B test isn’t that practical — you might only be confident in the result at the end of the day. Instead, a MAB will drive more traffic to the better-performing variation — and that in turn can increase revenue.”

There are at least two problems with A/B testing:

  • It jumps discretely from exploration to exploitation, when you might be able to transition more smoothly.
  • During the exploratory phase (the test), it wastes resources exploring inferior options in order to gather as much data as possible.

Bandit testing tries to solve the explore-exploit problem in a different way. Instead of two distinct periods of pure exploration and pure exploitation, bandit tests are adaptive and simultaneously include exploration and exploitation.

So, bandit algorithms try to minimize opportunity costs and minimize regret (the difference between your actual payoff and the payoff you would have collected had you played the optimal — best — options at every opportunity)

Drawbacks of Bandits

One of the issues with Multi-arm bandit testing is its computational complexity and difficulty in programming than with difficult problems. It creates a problem if the organization is having a shortage of DevOps team.


The MAB is implemented on Amazon Product Reviews — Electronic Products User Ratings this data can found in Kaggle

Amazon Product Reviews

Amazon Data

The dataset here is taken from the below website.

Source — Amazon Reviews data ( The repository has several datasets. For this case study, we are using the Electronics dataset.

Attribute Information:

  • userId: Every user identified with a unique id (First Column)
  • productId: Every product identified with a unique id(Second Column)
  • Rating: Rating of the corresponding product by the corresponding user(Third Column)
  • timestamp: Time of the rating ( Fourth Column)
Data head

In this, we will be converting the ratings

  • Rating > 4 will be mapped to reward = 1
  • Rating ≤ 4 will be mapped to reward = 0

As explained in the pseudocode above for the Epsilon, Thompson sampling and UCB are implemented in the python code, which is available on Github. Each of the algorithms is present in a separate python script to maintain modularity.

Offline Replay Evaluation

To make this work offline will be using what is called has online Replay of the online algorithm and evaluate it using that.

Bandit’s recommendations will be different from those generated by the model whose recommendations are reflected in your historic dataset. This creates problems that lead to some of the key challenges in evaluating these algorithms using historic data.

  1. this is problematic is that your data is probably biased
  2. algorithm will often produce recommendations that are different from the recommendations seen by users in the historic dataset

This is the high-level reason, a complete explanation of this issue will be explained in another article since that is not the scope of this article.

The solution to the above problem is called replay. Replay evaluation essentially takes historic events and algorithm’s recommendations at each time step and throws out all samples except for those where your model’s recommendation is the same as the one the user saw in the historic dataset.

From conducting these experiment with

  1. A/B testing
  2. Epsilon
  3. UCB
  4. Thompson

Upon the simulation, the output is used to compare the results and see which algorithms are performing well.


  • From above it clear that Multi-arm bandit is way better compared to A/B testing
  • Among the Multi-arm bandit, Thompson sampling works the best

The Thompson Sampling results (brown) are the best of them all. This bandit performs better than the ε-Greedy bandit because it dynamically adjusts the rate at which it explores — rather than using a constant rate. In the beginning, it explores more often, but over time, it explores less often. As a result, this bandit quickly identifies the best product and exploits it more frequently after it has been found, leading to high performance in both the short- and long-term.

The entire code can be found in our Github


Abhishek Maheshwarappa | Linkedin
Jiaxin Tong |Linkedin