Reinforcement Learning - A brief intro
What is the basic problem that Reinforcement Learning (RL) solves?
When I started with ML, I saw a diagram that had split Machine Learning into Supervised Learning, Unsupervised Learning, and Reinforcement Learning(RL). A variant of that is as follows, which many may have seen:
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fd2d3680d-261f-4b37-a4f8-a0fab41fb244_60x42.png)
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F918ab617-35e1-4820-b738-0e401c569583_1180x828.png)
In this blog post, I will introduce RL and the basic building blocks/concepts RL uses.
We can consider Supervised Learning and Unsupervised Learning loosely as problems related to pattern recognition. Here’s where RL is different from both Supervised Learning and Unsupervised Learning. It is not primarily about pattern recognition.
Instead, Reinforcement Learning is primarily about Decision Making. Specifically, it is about decision making so that we maximize the sum of future rewards.
So, fundamentally, Supervised and Unsupervised Learning are closer to each other compared to RL. Hence, I feel the above Venn diagram paints a slightly misleading picture, portraying all three on same plane.
An Example:
Let’s now look at an example of a Decision Making problem: Imagine that you have a system (think of a function in programming world) that gets some input, and can tell you to take one of various actions. For example, consider a simple GPS system that gives you directions to your office. At some junction, it can tell you to either take left, right, or go straight. These are various actions that you can perform in the given setting of driving a car. A GPS is typically programmed to give an action that takes you to your destination fastest. So your system should minimize the time it takes to complete the trip. In the RL world, we do this by creating some Reward and then trying to maximize this reward. A few other examples where RL can be applied are:
A computer that plays Chess/Go etc against humans. The Reward can be positive if the computer wins the game, and negative if it loses.
An Autonomous Lander that needs to land safely on Mars, without human supervision. The Reward can be positive if it lands safely, and negative if it crash lands. Extra penalties can be configured such that there’s some minor negative reward for every minute it is in sky, so it lands safely, and also fast.
Inventory management in Supply Chain — where a positive reward is applicable if inventory reaches just in time when it is required, and negative penalty for stock-out/excess inventory.
Problem Structure:
We saw that RL is essentially a decision making problem, and a few examples that can be solved by RL. There are a few common characteristics for all the problems where we can apply RL:
The problem is almost always* comprised of several ‘time-steps’. For example, a GPS system that takes some time to reach destination, or a Chess game that involves multiple steps before finally winning (or losing) the game (Contrast this with a pattern recognition problem such as cats vs dogs images or a credit-card approval).
As there are many time-steps in the problem, we denote the state of the problem at each time step as the information that is relevant to the decision making. For example, in a self driving car, whether there is a car infornt of you, or speed limit on the road, can be part of the state. But what color shirt you are wearing while driving, is not part of the state.
The problem is memoryless — i.e., all possible information that is required to take an optimal action, can be captured in the state corresponding to the time when we are taking the decision. For example in the game of Chess, if current board is the state, we take next action only based on current board. What happened two steps earlier is irrelevant. Such a problem where only current state is affecting the future path, is called as Markov Decision Process.It is so important, it has its own acronym called as MDP
Reinforcement Learning is essentially, a way to solve an MDP.
RL Framework:
The RL framework looks something like below image:
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb03e7ccc-798a-4f85-9eba-7457becbb129_60x23.png)
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7782d843-cd15-42e9-b4fe-c419a8f729dc_955x367.png)
Img Ref: https://www.kdnuggets.com/2018/03/5-things-reinforcement-learning.html
RL decision making happens inside the Agent. Consider this as equivalent to a trained model in supervised learning world (But be careful with using model in RL world, as model often denotes a mathematical representation of the state transitions inside Environment).
The agent can see the current state (st)of an environment, in which it operates. It then returns an action, which is executed inside the environment. With this, the environment reaches to a new state (st+1). It also returns some reward (Rt+1), both of which can be observed by the Agent.
We now introduce a concept called as discounted rewards. Let’s say you win a million dollar lottery. The lottery company gives you two options — you can collect the money now itself, or after five years. An obvious choice is to collect money now itself. i.e., a reward now is more valuable than a reward in future. This is formalized by a concept called as discounting the rewards with each future time step. If the reward after k timesteps is R, we use a discount factor γ and the discounted reward becomes as:
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fb9c5079d-7362-4216-8b91-ad1c5233c0b0_47x41.png)
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F419f161f-3113-4e33-9e0a-c837789f8ab7_47x41.png)
So the sum of all discounted rewards in the MDP is denoted as follows:
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F41ce5a1a-8486-4f94-83c1-424fa19a4bbe_60x10.png)
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F6a25b02e-dad8-4788-8663-81999761ec0f_254x43.png)
Sum of discounted Rewards. Rk denotes reward after k time steps.
The objective of RL is to maximize this value.
This value depends on the actions taken at each time step. So an RL agent controls this expression by taking an appropriate action so this value is maximized. The function that maps state to action in RL is called as a policy. An RL policy is denoted by π. Formally, this can be expressed as follows.
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F7f040c3f-5302-4be8-9de6-5bda06950662_60x7.png)
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fcd0433f0-43c9-4fe9-bf56-fbf8f123af34_455x57.png)
The expected sum of future rewards, given that we are at state s at time t, and we are following a policy π
The RL agent is trained to learn a policy which maximizes this value. We can see that the value depends mainly on following:
Policy: This is given by the RL agent.
State: If we are in certain state, whatever an RL agent does, there will be some limits to the reward. Think if you are stuck in a huge traffic blockage, the state is such that whatever a GPS system tells, you can not get a great reward.
v(s) is denoted as value of the state. This is dependent only on the state, and the policy. But let’s take a step back and extract such value for every possible action, at state s. So if there are n actions possible, we get n such values. These can be denoted as follows:
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2F675f5d5c-d787-46d0-8b02-d21c1eb0d8bc_60x4.png)
![Image for post Image for post](https://substackcdn.com/image/fetch/w_1456,c_limit,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fbucketeer-e05bbc84-baa3-437e-9518-adb32be77984.s3.amazonaws.com%2Fpublic%2Fimages%2Fdc0e06e1-0ca3-4607-b7dc-b501b8468172_517x41.png)
Q Value of a state-action pair, given that a policy π will be followed from the next state.
Why do we define this extra complexity, when we already have an expression for value of state? Because, if we are at state s and we magically know the Q values for all actions at that state, then we can simply take the action that has highest Q value. This assures us that we took the best possible action at state s. This is not possible using the v values of state only. Now RL can be broken down to a problem of identifying these Q values for all state-action pairs. And that is exactly how most RL problems are solved — by identifying these Q Values*. These can be solved by various approaches depending on how simple or complex the environment is (Don’t worry too much about understanding these approaches. It is just to give an overview of the terms. We can’t do more than that in two lines).
Dynamic Programming: If the environment is simple and can be modeled perfectly, we can use dynamic programming to solve for these values. Note that this approach is not Reinforcement Learning. The below approaches are RL.
Monte Carlo Simulation: If the environment is complex and we can’t model the state transitions perfectly, we can instead simulate the paths several times, and update the Q values based on the rewards observed.
TD Learning: This too is similar to Monte Carlo, where we simulate the paths. But instead of letting the paths complete, we update the Q values immediately after the next state. This is faster than Monte Carlo simulation and has a few advantages. This is an extremely important idea which fundamentally changed the RL progress. This is introduced by Richard Sutton, who also wrote arguably the most popular RL textbook and also the Doctoral Advisor for David Silver — the mind behind DeepMind.
There are several variations in TD Learning- most popular being Q-Learning. If the states and actions become too large, it may be infeasible to learn the Q values for all state action pairs. In such case, we use some function approximators to get the Q values by representing states by some features. Thsi is where techniques like Deep Reinforcement Learning comes in. There is so much to RL, but we will stop here with a brief overview of the concepts that we saw till now.
Overview
MDP: Any RL problem is basically an MDP. It is defined by multiple time-steps, memory less states, actions, rewards, and transitions between various states. This is the environment in which an RL agent operates
State: A state describes everything about the environment that is relevant to take an action. State is the way an RL agent can see the enviroment
Action: At any state in an MDP, we can take one of various actions which change the state of the environment. An RL agent can control the environment through actions
Transition Probabilities: I haven’t talked about this earlier. In a given state s, when an agent performs an action a, it is not always guarenteed that the environment’s new state is fixed. It is usually stochastic. Consider a game of chess where state represents board condition before the computer gets to play. After computer makes a move, it doesn’t have a control on what the opponent is going to play, but it can have an estimate of various possibilities. And these possibilities change based on which action the computer takes now. So the previous state to next state transitions depend on the actions taken. These transitions are represented as transition probabilities. In simple environments, these probabilities are known. Dynamic Programming can be used to solve such problems. But usually these transitions are unknown and agent learns them while training.
Rewards: Rewards are a way for enviroment to give a feedback to the agent, about the current action taken. They are scalar values — a single numerical value. Some times current reward might be high but it may lead to a state that is bad for future. So, all future rewards need to be considered to say if an action is optimal or not.
Discount Factor: This is a value between 0 and 1. In some problems, you may be quite content with immediate reward (γ close to 0)while in other settings, long term reward is almost as important as current reward (γ close to 1). Depending on the problem, we can choose the discount factor.
Policy: A policy is a function that takes state as input, and returns an action* to be taken. An optimal policy is one which maximizes expected sum of future rewards, if the actions determined by the policy are taken.
Q-Values: These are scalar values, defined for a state action pair under specific policy. This is the expected sum of future rewards when that action is taken at state s, and in future all actions are taken based on the same policy.
Solving for Q-Values: If we have Q-values, the MDP is solved. Various approaches exist for solving Q-Values, such as Dynamic Programming, Monte Carlo simulation, TD Learning. Within TD Learning, there are approaches such as Q-Learning and SARSA.
Function approximations: If the state-action space becomes too large, it becomes infeasible to learn Q-values for all pairs. In such cases, states are represented by its features (a feature vector) and a function is learnt to estimate Q-values based on these features. Deep Learning is one way to approximate these functions. Simpler methods such as linear functions too can be used here.
If you want to learn more about RL, you can follow some of these references:
Reinforcement Learning: An Introduction by Sutton and Barto: This is considered as the textbook for RL, and is must read. Best thing, it is available for free
David Silver’s lectures on YouTube: David Silver was the lead researcher at AlphaGo, which broke new frontiers in RL. This video series gives a good introduction to RL.
Reinforcement Learning from Georgia Tech: This is a course I did as part of my Masters. The videos are freely accessible on Udacity. The course goes very deep with all the math and expores many stability issues, convergence proofs, and Game Theory related concepts. The videos are very fun to watch.
Footnotes:
RL can be applied for some single step processes as well, like ranking order on a search page results. But majority of problems fall under multi-time step setting.
Solving for Q values is not the only way to solve RL problems. There are other approaches, where we solve for the best policy directly, but Q-value approach is the most popular one.
A policy may return distribution over multiple actions, and not necessarily one action per state. This is useful in game theory where opponents keep learning from your actions. Think of rock-paper-scissors. If you always play rock, your opponent can always win. But an optimal policy is actually a 1/3rd distribution over all three actions (rock, paper, scissors)
If you like this article, do subscribe to my substack for more such articles to reach your mailbox directly.