Multi-Arm Bandits for recommendations and A/B testing on Amazon ratings data set

Slot Machine

What is a Multi-Arm Bandit?

Las Vegas Strip source —

Bandit Algorithms Setting

  • The agent chooses an action from a set of actions
  • The environment generates a response with the rewards or regrets back to the agent
  • The aim of the agent is to maximize the cumulative reward or minimize the cumulative regret which is the difference in total reward gained in n rounds and the total reward that would have been gained with respect to the optimal action

Multi-Armed Bandit Examples

Variants of Muti-Armed Bandit

  • Stochastic and Stationary
  • Adversarial and Non Stationary
  • Cumulative regret
  • tracking the best expert
  • Discrete
  • Continous

Multi-Armed Bandit Solutions


Exploration vs Exploitation

Exploration vs Exploitation Tradeoff

  • Choose a random machine to pull with probability = ϵ
  • Choose the best machine to pull with probability = 1−ϵ
set epsilon # e.g. epsilon = 0.1
for t = 1…numRounds:
if r< epsilon:
pull a random bandit arm
pull the bandit arm with the best experimental mean
update the experimental mean of the pulled bandit arm

Upper Confidence Bound (UCB)

  • the bracket tells us the confidence
  • Q(A) is the action that lies between the confidence interval
  • Lower bracket — lower bound
  • Upper bracket — upper bound
first k turns: initialize experimental means by pulling each arm once
set c # confidence range hyperparameter
for t = k+1...numRounds:
for i = 1...k:
calculate a(i, t) = c* square_root(2 * log(t / p(i, t)))
pull arm i that maximizes experimental mean + a(i, t)
update the experimental mean of the pulled bandit arm

Thompson Sampling

Beta Distribution
#for each turnfor turn_index in range(number_of_turns):
index_of_machine_to_play = -1
max_beta = -1 # note that max beta
#determine which slot machine to play for this turn for each slot

for slot_machine_index in range(number_of_bandits):

#Define the shape parameters for the beta distribution. The shape
will depend on the number of wins and losses that have thus far
been observed for this particular slot machine.
a = number_of_positive_rewards[slot_machine_index] + 1
b = number_of_negative_rewards[slot_machine_index] + 1
#Get a random value from the beta distribution whose shape is
defined by the number of wins and losses that have thus far been
observed for this slot machine
random_beta = np.random.beta(a, b) #if this is the largest beta value thus far observed for this
if random_beta > max_beta: #update the maximum beta value thus far observed
max_beta = random_beta
#set the machine to play to the current machine
index_of_machine_to_play = slot_machine_index
#play the selected slot machine, and record whether we win or lose
if outcomes[turn_index][index_of_machine_to_play] == 1:
number_of_positive_rewards[index_of_machine_to_play] += 1
number_of_negative_rewards[index_of_machine_to_play] += 1

The practical differences between A/B testing and bandit testing

A/B testing

  • Ad Campaigns
  • Movie recommendation
  • Headline or news Recommendation
  • Targeting
  • It jumps discretely from exploration to exploitation, when you might be able to transition more smoothly.
  • During the exploratory phase (the test), it wastes resources exploring inferior options in order to gather as much data as possible.


Amazon Data

  • userId: Every user identified with a unique id (First Column)
  • productId: Every product identified with a unique id(Second Column)
  • Rating: Rating of the corresponding product by the corresponding user(Third Column)
  • timestamp: Time of the rating ( Fourth Column)
Data head
  • Rating > 4 will be mapped to reward = 1
  • Rating ≤ 4 will be mapped to reward = 0

Offline Replay Evaluation

  1. this is problematic is that your data is probably biased
  2. algorithm will often produce recommendations that are different from the recommendations seen by users in the historic dataset
  1. A/B testing
  2. Epsilon
  3. UCB
  4. Thompson
  • From above it clear that Multi-arm bandit is way better compared to A/B testing
  • Among the Multi-arm bandit, Thompson sampling works the best






Data Scientist | ML-Ops| | |

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Clustering Trips within NYC to Build a Recommendation Engine

Summary: Structured Attention Networks

Global Behavioral Biometrics Market is estimated to reach USD 3.07 Billion in 2024

The first one- Let’s define the time series and its analysis!

Spicy Coffee Grind Distributions

My Story as an Online Volunteer with UNV on COVID-19 Data Projects

Coffee Roasting Splash Page

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Abhishek Maheshwarappa

Abhishek Maheshwarappa

Data Scientist | ML-Ops| | |

More from Medium

Path to production for ML models

Version controlling of ML model and Data using DVC

Machine Learning Operations (MLOps) — Augmenting Machine Learning Activities for better results

Explainable AI (XAI) with SHAP and PDP.