top of page

Reinforcement learning(RL) Simplified

In this blog, I will be covering the below taxonomy of Reinforcement learning. It is easy to visualize and memorize the complete concept in a single mindmap. This is how I learn and hope you all find it easy to understand and apply..

Let’s recap different learning methods before we go deep into Reinforcement learning. The figure below depicts various subfields of Machine learning. These subfields are one of ways the machine learning algorithms are classified.

Supervised learning: Supervised learning is all about operating to a known expectation and in this case, what needs to be analysed from the data being defined. The input datasets in this context are also referred to as “labelled” datasets. Algorithms classified under this category focus on establishing a relationship between the input and output attributes and uses this relationship speculatively to generate an output for new input data points. In the section above, the example defined for Classification problem is also an example for supervised learning. Labelled data helps build reliable models and usually expensive and limited.

Unsupervised learning: In some of the learning problems we do not have anything specific target in mind to solve for and specifically this kind of learning is called unsupervised analyses or learning. The goal in this case is to decipher structure in data as against build mapping between input and output attributes of data and in fact the output attributes are not defined. These learning algorithms operate on “unlabelled” dataset for this reason.

So, given a bunch of xs, the goal here is to define a function f that can give a compact description for a set of xs. Hence this is called clustering.


Semi-supervised learning (SSL): Semi-supervised learning is about using both labelled and unlabelled data to learn better models. It is important that there are appropriate assumptions for the unlabelled data and any inappropriate assumptions can invalidate the model. Semi-supervised learning has its motivation from human way of learning.

Reinforcement learning: The Context

Reinforcement learning is about learning that is focussed on maximizing the rewards from the result.

Analogy: While teaching toddlers new habits, rewarding toddlers every time they follow instructions works very well. In fact they figure out what behaviour is helping them earn rewards. This is exactly what reinforcement learning is. It is also called credit assessment learning. [embed][/embed]

The most important thing is, in reinforcement learning, the model is additionally responsible for making decisions for which a periodic reward is received. The results in this case unlike supervised learning are not immediate and may require a sequence of steps to be executed before the final result is seen. Ideally, the algorithm will generate a sequence of decisions that will help achieve highest reward or utility.

The goal in this learning technique is to measure the trade-offs effectively by exploring and exploiting the data. For example, when a person has to travel from a point A to point B, there will be many ways that include travelling by air, water, road or by walk and there is a significant value in considering this data measuring the trade-offs for each of these options. Another important aspect also is about what would a delay in the rewards mean? And, how it would affect learning? For example, in games like chess any delay in reward identification may change or impact the result.

So, the representation is very similar to supervised learning, the difference being that the input is no x, y pairs but x, z pairs. The goal is to find a function f that identifies a y given x and z. In the following sections, we will explore more on what the z is.

y = f(x) given z.

A formal definition:

Reinforcement learning is defined as a way of programming agents by reward and punishment without needing to specify how the task is to be achieved
- Kaelbling, Littman, & Moore, 96

So, net-net, RL is neither a type of neural network nor is an alternative to neural networks, but an “orthogonal” approach for machine learning with emphasis being on learning feedback that is used for evaluating the learner’s performance with no standard behavioural targets against which the performance is measured, for example, learning to ride a bicycle.

Let us now look at the formal or basic RL model and understand different elements in action and as a first step let us understand some basic terms.

Agent: An agent is an entity that is a learner as well as a decision maker, typically an intelligent program in this case.

Environment: An environment is an entity that is responsible for producing a new situation given an action performed by the agent. It gives rewards or feedback for the action. So, in short environment is everything other than an agent.

State: Is a situation that an action lands an entity in.

Action: An action is a step executed by an agent that results in a change in state.

Policy: A policy is a definition of how an agent behaves at a given point in time. It elaborates the mapping between the states and actions and usually a simple business rule or a function.

Reward: A reward lays down short term benefit of an action which helps in reaching the goal.

Value: There is another important element in reinforcement learning and that is a value function, while reward function is all about the short term or immediate benefit of an action, a value function is about the benefit in long run. This value is an accumulation of rewards an agent is expected to get from the time the world started.

Examples of Reinforcement learning

An easiest way to understand Reinforcement learning is to look at some of the practical and real world applications of it. In this section, we will list down and understand some of them.

Example 1: Game of Chess: In the game of chess, a player makes a move; this move is driven by an informed selection of an action that comes with a set of counter moves from the opponent player. The next action of the player is determined by what moves the opponent takes.

Example 2: Elevator Scheduling: Let us take an example of a building with many floors and many elevators; the key optimization requirement here is to choose which elevator should be sent to which floor and is categorized as a control problem. The input here is a set of buttons pressed (inside and outside the lift) across the floors, locations of the elevators and a set of floors. The reward is this case is the least waiting time of the people wanting to use the lift. Here the systems learns how to control the elevators Again, through learning in a simulation of the building, the system learns to control the elevators through the estimates of the value of actions from the past.

Example 3: Mobile Robot behaviour: A mobile robot needs to decide between it reaching the recharge point or the next trash point depending on how quickly it has been able to find a recharge point in the past.

Evaluative Feedback

One of the key features that differentiate reinforcement learning from the other learning types is that it uses the information to evaluate impact of a particular action than instructing blindly what action needs to be taken. Evaluative feedback on one hand indicates how good the action taken while instructive feedback indicates what is the correct action irrespective of whether the action is taken or not. Though these two mechanisms are different independently, there are some cases where techniques are employed in conjunction. In this section, we will explore some evaluative feedback methods that will lay the foundation for the other rest of the chapter.

Reinforcement Comparison methods

We have been seeing in most of the selection method that an action that has the largest reward has the most likelihood of being selected than an action with lesser reward. The important question is how to quality if a reward is big or small? We will always need to have a reference number that qualifies a reward has a high value or a low value. This reference value is called reference reward. A reference reward to start with, can be an average of previously received reward. Learning methods that use this idea are called reinforcement comparison methods. These methods are more effective than actor-value methods and form a basis for actor-critic method that we will discuss in the sections to come.

The Reinforcement learning problem — The grid world example

We will try to understand the Reinforcement learning problem using the famous example, the grid world. This particular grid world is a 3 X 4 grid as shown below and is an approximation of the complexity of the world.

This example assumes the world is kind of a game where you start with a state called start state (from the location 1,1) and let us assume four actions can be taken that include moving left, right, top and down. The goal is to ensure using these actions we move towards the goal that is represented in the location 4,3 (in green) and we need to avoid the red box that is shown in the location 4,2.

· Start state: positon 1,1 à the world starts here

· Success state: position 4,3 à the world ends here and in a success state

· Failure state: position 4,2 à the world ends here and in a failure state

· When the world ends, we need to start over again.

· Wall: There is a road block or a wall shown in the position 2, 2. this position cannot be navigated.

To reach the goal (4,3) from the start point (1,1) steps can be taken in the following directions:

· Every step in a direction moves you from one position to another (position here is nothing but state). For example, a movement in the “UP” direction from the position 1, 1 will take you to the position 1, 2 and so on.

· All directions cannot be taken from a given position, let us take the example below, from the position 3, 2, only UP, DOWN and RIGHT can be taken. LEFT movement will hit the wall and hence cannot be taken. That said only UP and DOWN movements make sense as RIGHT will make a move to the danger position which results in failure in reaching the goal.

· Similarly, any of the positions in the boundaries of grid will have limitations, for example the position 1, 3 allow RIGHT and DOWN movements and any other movements do not alter the position.

· Let us now look at the shortest path from Start (1, 1) to the Goal (4, 3). There are two solutions

o Solution 1: RIGHT à RIGHT à UP à UP à RIGHT (5 steps)

o Solution 2: UP à UP à RIGHT à RIGHT à RIGHT (5 steps)

· In the real world, not all actions get executed as expected; there is reliability factor that affects the performance or rather there is uncertainty. If we add a small caveat to the example and say that every time there is an action to move from one position to another, the probability that the movement is correct is 0.8. This means there is 80% possibility that a movement executes as expected. In this case, if we want to measure the probability of Solution 1(RàRàUàUàR) succeeding.

Probability of actions happening as expected + Probability of actions not happening as expected

= 0.8 x 0.8 x 0.8 x 0.8 x 0.8 + 0.1 x 0.1 x 0.1 x 0.1 x 0.8

= 0.32768 + 0.00008 = 0.32776

As we see the element of uncertainty does change the result, in the next section we will discuss the decision process framework that captures these uncertainties.

Reinforcement learning — key features

Reinforcement learning is not a set of techniques but is a set of problems that focuses on what the task is as against how the task should be addressed.
Reinforcement learning is considered as a tool for machines to learn using the rewards and punishments that is more a trial-and-error driven.
Reinforcement learning employs evaluative feedback. Evaluative feedback measure how effective the action taken is as against measuring the action if it is best or worst. (Note that supervised learning is more of an instructive learning and measures the correctness of an action irrespective of active being executed)
The tasks in reinforcement learning are more of associative tasks. Associative tasks are dependent on the situation where actions that suit best to the given situation are identified and executed. Non associative tasks are those that are independent of the given situation and the learner finds the best action when the task is stationary.

This summarizes the core concept of Reinforcement learnign, in the next article I will cover RL Solution methods. Please share your comments and feedback. [embed][/embed]

2 views0 comments

Recent Posts

See All


bottom of page