Reinforcement Learning for Stock Trading

6 min readNov 15, 2020

One of the most intresting fields of AI is Reinforcement learning, which came into popularity in 2016 when the computer AlphaGO into the light.

AlphaGo is the first computer program to defeat a professional human Go player Leo Sedol.

Before going to Trading let's define some of the important terms in Reinforcement Learning

action — the mechanism by which the agent transitions between states of the environment
agent — the entity that uses a policy to maximize expected return gained from transitioning between states of the environment.
environment — the world that contains the agent and allows the agent to observe that world’s state
reward — the numerical result of taking an action in a state, as defined by the environment
state — the parameter values that describe the current configuration of the environment, which the agent uses to choose an action

Stock Trading

Stock trading is defined by Investopedia which refers to the buying and selling of shares in a particular company; if you own the stock, you own a piece of the company.

Stock trading is a constant process of learning, getting feedback, and understanding the market trends and then the trader will try to optimize the profit/loss.

This is nothing but a trial and error approach, sometimes the decision taken will go wrong, and sometimes the decision taken will be right. The wrong decision will lead to a loss of money and the right decision will lead to an increase in balance money. Based on these experiences the trader learns over time when to Buy or Sell or Hold the current stocks.

This can be done using Reinforcement learning, which also learns the policy over time.

We will be using Q-Learning in this article to allows us to learn this experience by doing the same thing the trader does, that is to buy or sell or hold the stocks with a given value of budget or money to spend.

In Q-learning, the possible states and actions are represented by a Q-table, and we will be using the Bellman equation for updating Q-values in the Q-table

Bellman Equation

A Q-table contains states as rows, and actions are columns, and it helps in finding the best possible action for the given state.

Q of sₜ and aₜ represents the maximum discounted future reward when we perform an action in state s and continue optimally from then on.

We can think of this function as the maximum possible account balance we can achieve at the end of a training episode after we perform an action and in state s.

Actions

In our case of Stock trading, the possible actions are

Buy
Sell
Hold

Reward

Current value amount compared with the previous step

The Q function will pick one that has the highest q value, we will be using Deep Reinforcement Learning for this.

we will be building a neural network using Keras sequential

model = Sequential()        
model.add(Dense(units=64, input_dim=self.state_size, activation="relu"))        
model.add(Dense(units=32, activation="relu"))        model.add(Dense(units=8, activation="relu"))        model.add(Dense(self.action_size, activation="linear"))        model.compile(loss="mse", optimizer=Adam(lr=0.001))

The network contains Keras sequential layer with 2 hidden layers and with Relu activation function and Adam optimizer. All the layers used are dense layers, the input to the neural network is the state and the output will be the action.

Exploration and Exploitation trade-off

The following function helps to find the action for a given state based on the epsilon which decides the exploration and exploitation trade-off.

def act(self, state):        
  if not self.is_eval and np.random.rand() <= self.epsilon:      
    return random.randrange(self.action_size)         
  options = self.model.predict(state)        
  return np.argmax(options[0])

Experience Replay

In Deep Reinforcement Learning, the experience replay is used during training. Experience replay holds the agent’s experiences at each time step in a data set called the replay memory. The reason for using replay memory is to break the correlation between consecutive samples because if the network learns only from consecutive samples of experience as they have occurred sequentially in the environment, so the samples will be highly correlated hence it can not be inefficient. The best thing is the sample from the replay memory breaks this.

def expReplay(self, batch_size):  
      
 mini_batch = []        
 l = len(self.memory)  
      
 for i in range(l - batch_size + 1, l):   
   mini_batch.append(self.memory[i])    
     
 for state, action, reward, next_state, done in mini_batch:     
   target = reward            
   if not done:                
       target =reward+self.gamma*\
            np.amax(self.model.predict(next_state)0])             
       
       target_f = self.model.predict(state)            
       target_f[0][action] = target  
          
       self.model.fit(state, target_f, epochs=1, verbose=0) 
        
   if self.epsilon > self.epsilon_min:            
       self.epsilon *= self.epsilon_decay

Training to learn Trading

The next step is to train the Keras model, based on the window size provided by the user. In this implementation of Q-learning, we are training model short-term stock trading. The model uses n-ay windows of closing price to determine if the best action to take at a given time is to buy, sell, or hold stocks.

for e in range(episode_count + 1):    
    print("Episode " + str(e) + "/" + str(episode_count))    
    state = getState(data, 0, window_size + 1)     
    total_profit = 0    
    agent.inventory = []     
    for t in range(l):        
        action = agent.act(state)         
        
        # hold
        next_state = getState(data, t + 1, window_size + 1)                reward = 0  
        # buy
        if action == 1:             
           agent.inventory.append(data[t])            
           print("Buy: " + formatPrice(data[t]))         
        
        # sell
        elif action == 2 and len(agent.inventory) > 0:   
             bought_price = agent.inventory.pop(0)            
             reward = max(data[t] - bought_price, 0)                 
             total_profit += data[t] - bought_price                              
             print("Sell: " + formatPrice(data[t]) + " | Profit: " +   
                                formatPrice(data[t] - bought_price))         
        done = True if t == l - 1 else False
        agent.memory.append((state, action, reward, next_state,        
                                                         done))          
        state = next_state         
        if done:            
           print("--------------------------------")                
           print("Total Profit: " + formatPrice(total_profit))            
           print("--------------------------------")         
        if len(agent.memory) > batch_size:              
           agent.expReplay(batch_size)     
     if e % 10 == 0:        
        agent.model.save("models/model_ep" + str(e))

In the above code, the outer for loop is for the number of episodes, which is one of the hyperparameters,s, and the inner for loop will go through the entire data, yes for every episode the agent will train for the entire data.

Evaluation

Now yes finally after training we are ready with the agent, now to trade with this agent, but to see how well the agent performs we need to evaluate it with the data.

for t in xrange(l): 
    action = agent.act(state)  
    
    # hold
    next_state = getState(data, t + 1, window_size + 1) 
    reward = 0
    
    # buy
    if action == 1:  
       agent.inventory.append(data[t])  
       print("Buy: " + formatPrice(data[t]))  
    
    # sell
    elif action == 2 and len(agent.inventory) > 0:  
       bought_price = agent.inventory.pop(0)  
       reward = max(data[t] - bought_price, 0)  
       total_profit += data[t] - bought_price  
       print("Sell: " + formatPrice(data[t]) + " | Profit: " +                      
                            formatPrice(data[t] - bought_price))  
    done = True if t == l - 1 else False 
    agent.memory.append((state, action, reward, next_state, done))
    state = next_state  
    if done:  
       print("--------------------------------")  
       print(stock_name + " Total Profit: "                
                    +formatPrice(total_profit))
       print("--------------------------------")

This helps us to evaluate the agent how well it trades.

Conclusion

The financial markets are not an easy thing to predict even the most experienced people fail to predict the crash of markets or the trends, but intelligent guessing is the only way, which Reinforcement learning does, this has to be combined with other variables that affect the market like

Note: This was trained on the High-performance cluster of Northeastern University, Discovery cluster. Training on the laptop will take a long time, the suggestion is to train on any cluster. Github contains all the necessary information on the requirement and dependency to run this.

Feel free to reach out if you have any questions.

The entire code is available on Github.

Author

Abhishek Maheshwarappa | Linkedin
Jiaxin Tong |Linkedin