Reinforcement Learning for Stock Trading
One of the most intresting fields of AI is Reinforcement learning, which came into popularity in 2016 when the computer AlphaGO into the light.
AlphaGo is the first computer program to defeat a professional human Go player Leo Sedol.
Before going to Trading let's define some of the important terms in Reinforcement Learning
- action — the mechanism by which the agent transitions between states of the environment
- agent — the entity that uses a policy to maximize expected return gained from transitioning between states of the environment.
- environment — the world that contains the agent and allows the agent to observe that world’s state
- reward — the numerical result of taking an action in a state, as defined by the environment
- state — the parameter values that describe the current configuration of the environment, which the agent uses to choose an action
Stock Trading
Stock trading is defined by Investopedia which refers to the buying and selling of shares in a particular company; if you own the stock, you own a piece of the company.
Stock trading is a constant process of learning, getting feedback, and understanding the market trends and then the trader will try to optimize the profit/loss.
This is nothing but a trial and error approach, sometimes the decision taken will go wrong, and sometimes the decision taken will be right. The wrong decision will lead to a loss of money and the right decision will lead to an increase in balance money. Based on these experiences the trader learns over time when to Buy or Sell or Hold the current stocks.
This can be done using Reinforcement learning, which also learns the policy over time.
We will be using Q-Learning in this article to allows us to learn this experience by doing the same thing the trader does, that is to buy or sell or hold the stocks with a given value of budget or money to spend.
In Q-learning, the possible states and actions are represented by a Q-table, and we will be using the Bellman equation for updating Q-values in the Q-table
Bellman Equation
A Q-table contains states as rows, and actions are columns, and it helps in finding the best possible action for the given state.
Q of sₜ and aₜ represents the maximum discounted future reward when we perform an action in state s and continue optimally from then on.
We can think of this function as the maximum possible account balance we can achieve at the end of a training episode after we perform an action and in state s.
Actions
In our case of Stock trading, the possible actions are
- Buy
- Sell
- Hold
Reward
Current value amount compared with the previous step
The Q function will pick one that has the highest q value, we will be using Deep Reinforcement Learning for this.
we will be building a neural network using Keras sequential
model = Sequential()
model.add(Dense(units=64, input_dim=self.state_size, activation="relu"))
model.add(Dense(units=32, activation="relu")) model.add(Dense(units=8, activation="relu")) model.add(Dense(self.action_size, activation="linear")) model.compile(loss="mse", optimizer=Adam(lr=0.001))
The network contains Keras sequential layer with 2 hidden layers and with Relu activation function and Adam optimizer. All the layers used are dense layers, the input to the neural network is the state and the output will be the action.
Exploration and Exploitation trade-off
The following function helps to find the action for a given state based on the epsilon which decides the exploration and exploitation trade-off.
def act(self, state):
if not self.is_eval and np.random.rand() <= self.epsilon:
return random.randrange(self.action_size)
options = self.model.predict(state)
return np.argmax(options[0])
Experience Replay
In Deep Reinforcement Learning, the experience replay is used during training. Experience replay holds the agent’s experiences at each time step in a data set called the replay memory. The reason for using replay memory is to break the correlation between consecutive samples because if the network learns only from consecutive samples of experience as they have occurred sequentially in the environment, so the samples will be highly correlated hence it can not be inefficient. The best thing is the sample from the replay memory breaks this.
def expReplay(self, batch_size):
mini_batch = []
l = len(self.memory)
for i in range(l - batch_size + 1, l):
mini_batch.append(self.memory[i])
for state, action, reward, next_state, done in mini_batch:
target = reward
if not done:
target =reward+self.gamma*\
np.amax(self.model.predict(next_state)0])
target_f = self.model.predict(state)
target_f[0][action] = target
self.model.fit(state, target_f, epochs=1, verbose=0)
if self.epsilon > self.epsilon_min:
self.epsilon *= self.epsilon_decay
Training to learn Trading
The next step is to train the Keras model, based on the window size provided by the user. In this implementation of Q-learning, we are training model short-term stock trading. The model uses n-ay windows of closing price to determine if the best action to take at a given time is to buy, sell, or hold stocks.
for e in range(episode_count + 1):
print("Episode " + str(e) + "/" + str(episode_count))
state = getState(data, 0, window_size + 1)
total_profit = 0
agent.inventory = []
for t in range(l):
action = agent.act(state)
# hold
next_state = getState(data, t + 1, window_size + 1) reward = 0
# buy
if action == 1:
agent.inventory.append(data[t])
print("Buy: " + formatPrice(data[t]))
# sell
elif action == 2 and len(agent.inventory) > 0:
bought_price = agent.inventory.pop(0)
reward = max(data[t] - bought_price, 0)
total_profit += data[t] - bought_price
print("Sell: " + formatPrice(data[t]) + " | Profit: " +
formatPrice(data[t] - bought_price))
done = True if t == l - 1 else False
agent.memory.append((state, action, reward, next_state,
done))
state = next_state
if done:
print("--------------------------------")
print("Total Profit: " + formatPrice(total_profit))
print("--------------------------------")
if len(agent.memory) > batch_size:
agent.expReplay(batch_size)
if e % 10 == 0:
agent.model.save("models/model_ep" + str(e))
In the above code, the outer for loop is for the number of episodes, which is one of the hyperparameters,s, and the inner for loop will go through the entire data, yes for every episode the agent will train for the entire data.
Evaluation
Now yes finally after training we are ready with the agent, now to trade with this agent, but to see how well the agent performs we need to evaluate it with the data.
for t in xrange(l):
action = agent.act(state)
# hold
next_state = getState(data, t + 1, window_size + 1)
reward = 0
# buy
if action == 1:
agent.inventory.append(data[t])
print("Buy: " + formatPrice(data[t]))
# sell
elif action == 2 and len(agent.inventory) > 0:
bought_price = agent.inventory.pop(0)
reward = max(data[t] - bought_price, 0)
total_profit += data[t] - bought_price
print("Sell: " + formatPrice(data[t]) + " | Profit: " +
formatPrice(data[t] - bought_price))
done = True if t == l - 1 else False
agent.memory.append((state, action, reward, next_state, done))
state = next_state
if done:
print("--------------------------------")
print(stock_name + " Total Profit: "
+formatPrice(total_profit))
print("--------------------------------")
This helps us to evaluate the agent how well it trades.
Conclusion
The financial markets are not an easy thing to predict even the most experienced people fail to predict the crash of markets or the trends, but intelligent guessing is the only way, which Reinforcement learning does, this has to be combined with other variables that affect the market like
Note: This was trained on the High-performance cluster of Northeastern University, Discovery cluster. Training on the laptop will take a long time, the suggestion is to train on any cluster. Github contains all the necessary information on the requirement and dependency to run this.
Feel free to reach out if you have any questions.
The entire code is available on Github.
Author
Abhishek Maheshwarappa | Linkedin
Jiaxin Tong |Linkedin