Multi-Arm Bandits for recommendations and A/B testing on Amazon ratings data set

Slot Machine

What is a Multi-Arm Bandit?

Las Vegas Strip source — https://skycondoslv.com/

Bandit Algorithms Setting

Multi-Armed Bandit Examples

Variants of Muti-Armed Bandit

Multi-Armed Bandit Solutions

Epsilon-Greedy

Exploration vs Exploitation

Exploration vs Exploitation Tradeoff

set epsilon # e.g. epsilon = 0.1
for t = 1…numRounds:
r=random.uniform(0,1)
if r< epsilon:
pull a random bandit arm
else:
pull the bandit arm with the best experimental mean
update the experimental mean of the pulled bandit arm

Upper Confidence Bound (UCB)

first k turns: initialize experimental means by pulling each arm once
set c # confidence range hyperparameter
for t = k+1...numRounds:
for i = 1...k:
calculate a(i, t) = c* square_root(2 * log(t / p(i, t)))
pull arm i that maximizes experimental mean + a(i, t)
update the experimental mean of the pulled bandit arm

Thompson Sampling

Beta Distribution
#for each turnfor turn_index in range(number_of_turns):
index_of_machine_to_play = -1
max_beta = -1 # note that max beta
#determine which slot machine to play for this turn for each slot
machine

for slot_machine_index in range(number_of_bandits):

#Define the shape parameters for the beta distribution. The shape
will depend on the number of wins and losses that have thus far
been observed for this particular slot machine.
a = number_of_positive_rewards[slot_machine_index] + 1
b = number_of_negative_rewards[slot_machine_index] + 1
#Get a random value from the beta distribution whose shape is
defined by the number of wins and losses that have thus far been
observed for this slot machine
random_beta = np.random.beta(a, b) #if this is the largest beta value thus far observed for this
iteration
if random_beta > max_beta: #update the maximum beta value thus far observed
max_beta = random_beta
#set the machine to play to the current machine
index_of_machine_to_play = slot_machine_index
#play the selected slot machine, and record whether we win or lose
if outcomes[turn_index][index_of_machine_to_play] == 1:
number_of_positive_rewards[index_of_machine_to_play] += 1
else:
number_of_negative_rewards[index_of_machine_to_play] += 1

The practical differences between A/B testing and bandit testing

A/B testing

Implementation

Amazon Data

Data head

Offline Replay Evaluation

Author

Reference

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store