AI unit -3.docx
AI unit -3.docx
AI unit -3.docx
REINFORCEMENT LEARNING
Introduction
Reinforcement learning involves no supervisor and only a reward signal is used for
an agent to determine if they are doing well or not. Time is a key component in RL where the
process is sequential with delayed feedback. Each action the agent makes affects the next
data it receives.
It is a type of machine learning technique where a computer agent learns to perform a
task through repeated trial and error interactions with a dynamic environment. This learning
approach enables the agent to make a series of decisions that maximize a reward metric for
the task without human intervention and without being explicitly programmed to achieve the
task.
Through a series of Trial and Error methods, an agent keeps learning continuously in
an interactive environment from its own actions and experiences. The only goal of it is to find
a suitable action model which would increase the total cumulative reward of the agent. It
learns via interaction and feedback.
How does reinforcement learning work? Well, let me explain with an example.
Fig: Reinforcement Learning Example
You can see a dog and a master. Let’s imagine you are training your dog to get the stick. Each
time the dog gets a stick successfully, you offered him a feast (a bone let’s say). Eventually,
the dog understands the pattern, that whenever the master throws a stick, it should get it as
early as it can to gain a reward (a bone) from a master in a lesser time.
Fig: Terminologies in RL
Reward – For each selected action by agent, the environment gives a reward. It’s usually a
scalar value and nothing but feedback from the environment
Value Function – The value of state shows up the reward achieved starting from the state
until the policy is executed
Model – Every RL agent doesn’t use a model of its environment. The agent’s view maps
state-action pairs probability distributions over the states
In supervised learning, the model is trained with a training dataset that has a correct
answer key. The decision is done on the initial input given as it has all the data that’s required
to train the machine. The decisions are independent of each other so each decision is
represented through a label. Example: Object Recognition
In reinforcement learning, there isn’t any answer and the reinforcement agent decides
what to be done to perform the required task. As the training dataset isn’t available, the agent
had to learn from its experience. It’s all about compiling the decisions in a sequential manner.
To be said in simpler words, the output relies on the current input state and the next input
relies on the output of the previous input. We give labels to the sequence of dependent
decisions. Decisions are dependent. Example: Chess Game.
An RL problem can be best explained through games. Let’s take the game
of PacMan where the goal of the agent(PacMan) is to eat the food in the grid while avoiding
the ghosts on its way. In this case, the grid world is the interactive environment for the agent
where it acts. Agent receives a reward for eating food and punishment if it gets killed by the
ghost (loses the game). The states are the location of the agent in the grid world and the total
cumulative reward is the agent winning the game.
So far we have said that the agent needs to find the “right” action. The right action
depends on the rewards.
Reward: The reward R is a scalar feedback signal which indicates how well the agent is
doing at step time t.
In reinforcement learning we need to define our problem such that it can be applied to
satisfy our reward hypothesis. An example would be playing a game of chess where the
agent gets a positive reward for winning a game and a negative reward for losing a game.
Reward Hypothesis: All goals can be described by the maximisation of expected cumulative
reward.
Since our process involves sequential decision making tasks, our actions we make
early on may have a long-term consequence on our overall goal. Sometimes it may be better
to sacrifice immediate reward (reward at time step R ) to gain more long-term reward. An
example applied to chess would be to sacrifice a pawn to capture a rook at a later stage.
Goal: The goal is to select actions to maximise total future reward.
In reinforcement learning the agent makes the decisions on which actions to take at
each time step A . The agent makes these decisions based on the scalar reward R it
receives and the observed environment O .
The environment receives the action from the agent and emits a new
observation O and scalar reward R . What happens next to the environment depends on
the history.
State: The state is the information used to determine what happens next.
The key difference between the history and state is that the state is a
function of the history. The states can be divided into three main types:
● Agent State (S ᵃ) — The agent’s internal representation and is used by the agent
to pick the next action.
What we believe what will happen next depends on the agent’s representation of state:
The environment is also divided into fully observable environment and partially
observable environment.
So far, we have defined how to set up RL problems but not how the RL agent learns or what
makes up an agent. An RL agent can have one or more of the three major components:
● Policy: Agent’s behaviour function which is a map from state to action. It can be
a deterministic policy or a stochastic policy.
● Value Based: Has no policy, picks actions greedily based on state values.
● Policy Based: Has no value function, uses the policy function to pick actions.
● Actor Critic: Uses both value and policy functions.
● Model Free: Uses policy and/or value functions but has no model.
● Model Based: Uses policy and/or value functions and has a model.
● Reinforcement Learning Algorithms
There are 3 approaches to implement reinforcement learning algorithms
● Learning: The environment is initially unknown and the agent needs to interact
with the environment to improve its policy
● Planning: If the environment is known the agent performs computations with its
model and then improves its policy.
Since reinforcement learning is like a trial-and-error learning approach, the agent
needs to learn from its experiences in the environment to make decisions without losing too
much reward. It needs to consider exploitation and exploration with a balance between
exploring new information and using known information to maximise reward.
1. Positive Reinforcement
Advantages:
Disadvantage:
– Excess reinforcement can lead to an overload of states which would minimize the results.
2. Negative Reinforcement
Negative Reinforcement is represented as the strengthening of a behaviour. In other ways,
when a negative condition is barred or avoided, it tries to stop this action in the future.
Advantages:
– Maximized behaviour
Disadvantage:
1. Markov Decision Process (MDP’s) – are mathematical frameworks for mapping solutions
in RL. The set of parameters that include Set of finite states – S, Set of possible Actions in
each state – A, Reward – R, Model – T, Policy – π. The outcome of deploying an action to a
state doesn’t depend on previous actions or states but on current action and state.
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular,
T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to
state S’ (S and S’ may be the same). For stochastic actions (noisy, non-deterministic) we also
define a probability P(S’|S,a) which represents the probability of reaching a state S’ if action
‘a’ is taken in state S. Note Markov property states that the effects of an action taken in a
state depend only on that state and not on the prior history.
An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken
being in state S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT Walls
block the agent path, i.e., if there is a wall in the direction the agent would have taken, the
agent stays in the same place. So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
● RIGHT RIGHT UP UP RIGHT
● UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent
discussion. The move is now noisy. 80% of the time the intended action works correctly. 20%
of the time the action agent takes causes it to move at right angles. For example, if the agent
says UP the probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and
the probability of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
● Small reward each step (can be negative when can also be term as punishment, in
the above example entering the Fire can have a reward of -1).
● Big rewards come at the end (good or bad).
● The goal is to Maximize the sum of rewards.
Fig: Q Learning
– Training system which would issue custom instructions and materials with respect to the
requirements of students
Conclusion:
The conclusion for this topic is nothing but helping us to discover which action could
yield the highest reward for a longer time. Realistic environments can have partial
observability and be non-stationary as well. It isn’t very useful to apply when you have
hands-on enough data to solve the problem using supervised learning. The main challenge of
this method is that parameters could affect the speed of the learning.
Let’s consider a problem where an agent can be in various states and can choose an
action from a set of actions. Such type of problems are called Sequential Decision Problems.
An MDP is the mathematical framework which captures such a fully observable,
non-deterministic environment with Markovian Transition Model and additive
rewards in which the agent acts. The solution to an MDP is an optimal policy which refers to
the choice of action for every state that maximizes overall cumulative reward. Thus,
the transition model that represents an agent’s environment(when the environment is known)
and the optimal policy which decides what action the agent needs to perform in each state are
required elements for training the agent learn a specific behavior.
Fig : Markov Decision Process
● Markov Property
● Bellman Equation
Agent : Software programs that make intelligent decisions and they are the learners in RL.
These agents interact with the environment by actions and receive rewards based on there
actions.
Environment :It is the demonstration of the problem to be solved. Now, we can have a
real-world environment or a simulated environment with which our agent will interact.
Anything that the agent cannot change arbitrarily is considered to be part of the
environment. In simple terms, actions can be any decision we want the agent to
learn and state can be anything which can be useful in choosing actions. We do not assume
that everything in the environment is unknown to the agent, for example, reward calculation
is considered to be the part of the environment even though the agent knows a bit on how it’s
reward is calculated as a function of its actions and states in which they are taken. This is
because rewards cannot be arbitrarily changed by the agent. Sometimes, the agent might
be fully aware of its environment but still finds it difficult to maximize the reward as like we
might know how to play Rubik’s cube but still cannot solve it. So, we can safely say that the
agent-environment relationship represents the limit of the agent control and not it’s
knowledge.
Transition Probability: The probability that the agent will move from one state to another is
called transition probability.
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this
equation means is that the transition from state S[t] to S[t+1] is entirely independent of the
past. So, the RHS of the Equation means the same as LHS if the system has a Markov
Property. Intuitively meaning that our current state already captures the information of the
past states.
As we now know about transition probability we can define state Transition Probability as
follows :
For Markov State from S[t] to S[t+1] i.e. any other successor state , the state transition
probability is given by
Each row in the matrix represents the probability from moving from our original or
starting state to any successor state .Sum of each row is equal to 1.
Markov Process is the memory less random process i.e. a sequence of a random state
S[1],S[2],….S[n] with a Markov Property. So, it’s basically a sequence of states with the
Markov Property. It can be defined using a set of states(S) and transition probability matrix
(P).The dynamics of the environment can be fully defined using the States(S) and Transition
Probability matrix(P).
The edges of the tree denote transition probability. From this chain let’s take some
sample. Now, suppose that we were sleeping and the according to the probability distribution
there is a 0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we
will eat ice-cream. Similarly, we can think of other sequences that we can sample from this
chain.
In the above two sequences what we see is we get random set of States(S) (i.e. Sleep,
Ice-cream, Sleep ) every time we run the chain. Hope, it’s now clear why Markov process is
called random set of sequences.
Before going to Markov Reward process let’s look at some important concepts that will
help us in understand MRPs.
Reward and Returns:
Rewards are the numerical values that the agent receives on performing some action
at some state(s) in the environment. The numerical value can be positive or negative based
on the actions of the agent.
r[t+1] is the reward received by the agent at time step t[0] while performing an
action(a) to move from one state to another. Similarly, r[t+2] is the reward received by the
agent at time step t[1] by performing an action to move to another state. And, r[T] is the
reward received by the agent by at the final time step by performing an action to move to
another state.
Episodic Tasks: These are the tasks that have a terminal state (end state).We can say they
have finite states. For example, in racing games, we start the game (start the race) and play it
until the game is over (race ends!). This is called an episode. Once we restart the game it will
start from an initial state and hence, every episode is independent.
Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any terminal
state. These types of tasks will never end. For example, Learning how to code!
Now, it’s easy to calculate the returns from the episodic tasks as they will eventually
end but what about continuous tasks, as it will go on and on forever. The returns from sum up
to infinity! So, how we define returns for continuous tasks.
Discount Factor (ɤ): It determines how much importance is to be given to the immediate
reward and future rewards. This basically helps us to avoid infinity as a reward in
continuous tasks. It has a value between 0 and 1. A value of 0 means that more importance is
given to the immediate reward and a value of 1 means that more importance is given
to future rewards. In practice, a discount factor of 0 will never learn as it only considers
immediate reward and a discount factor of 1 will go on for future rewards which may lead to
infinity. Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.
So, we can define returns using discount factor as follows :(Let’s say this is equation
1 ,as we are going to use this equation in later for deriving Bellman Equation)
Let’s understand it with an example, suppose you live at a place where you face water
scarcity so if someone comes to you and say that he will give you 100 liters of water!(assume
please!) for the next 15 hours as a function of some parameter (ɤ).Let’s look at two
possibilities : (Let’s say this is equation 1 ,as we are going to use this equation in later for
deriving Bellman Equation)
This means that we should wait till 15th hour because the decrease is not very
significant , so it’s still worth to go till the end. This means that we are also interested in
future rewards. So, if the discount factor is close to 1 then we will make a effort to go to end
as the reward are of significant importance.
This means that we are more interested in early rewards as the rewards are getting
significantly low at hour. So, we might not want to wait till the end (till 15th hour) as it will
be worthless. So, if the discount factor is close to zero then immediate rewards are more
important that the future.
It depends on the task that we want to train an agent for. Suppose, in a chess game, the
goal is to defeat the opponent’s king. If we give importance to the immediate rewards like a
reward on pawn defeat any opponent player then the agent will learn to perform these
sub-goals no matter if his players are also defeated. So, in this task future rewards are more
important. In some, we might prefer to use immediate rewards like the water example we saw
earlier.
Markov Reward Process:
Till now we have seen how Markov chain defined the dynamics of a environment
using set of states(S) and Transition Probability Matrix(P).But, we know that Reinforcement
Learning is all about goal to maximize the reward. So, let’s add reward to our Markov
Chain. This gives us Markov Reward Process.
Markov Reward Process : As the name suggests, MDPs are the Markov chains with values
judgement. Basically, we get a value from every state our agent is in.
What this equation means is how much reward (Rs) we get from a particular state
S[t]. This tells us the immediate reward from that particular state our agent is in. As we will
see in the next story how we maximize these rewards from each state our agent is in. In
simple terms, maximizing the cumulative reward we get from each state.
● S is a set of states,
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.
Policy Function and Value Function:
Value Function determines how good it is for the agent to be in a particular state.
Of course, to determine how good it will be to be in a particular state it must depend on some
actions that it will take. This is where policy comes in. A policy defines what actions to
perform in a particular state s.
Policy Function:
Now, how do we find a value of a state. The value of state s, when the agent is
following a policy π which is denoted by vπ(s) is the expected return starting from s and
following a policy π for the next states until we reach the terminal state. We can formulate
this as :(This function is also called State-value Function)
Value Function:
This equation gives us the expected returns starting from the state(s) and going to
successor states thereafter, with the policy π. One thing to note is the returns we get is
stochastic whereas the value of a state is not stochastic. It is the expectation of returns from
start state s and thereafter, to any other state. And also note that the value of the terminal state
(if there is any) is zero. Let’s look at an example :
Example
Suppose our start state is Class 2, and we move to Class 3 then Pass then Sleep.In short, Class
2 > Class 3 > Pass > Sleep.
Bellman Equation helps us to find optimal policies and value functions. We know
that our policy changes with experience so we will have different value functions according
to different policies. The optimal value function is one that gives maximum value
compared to all other value functions.
Bellman Equation states that value function can be decomposed into two parts:
Suppose, there is a robot in some state (s) and then he moves from this state to some other
state (s’). Now, the question is how good it was for the robot to be in the state(s). Using the
Bellman equation, we can that it is the expectation of reward it got on leaving the state(s)
plus the value of the state (s’) he moved to.
We want to know the value of state s. The value of state(s) is the reward we got upon leaving
that state, plus the discounted value of the state we landed upon multiplied by the transition
Where v is the value of state we were in, which is equal to the immediate reward plus
the discounted value of the next state multiplied by the probability of moving into that state.
The running time complexity for this computation is O(n³). Therefore, this is clearly
not a practical solution for solving larger MRPs (same for MDPs).In later Blogs, we will
look at more efficient methods like Dynamic Programming (Value iteration and Policy
iteration), Monte-Claro methods, and TD-Learning.
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no
reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is
no reward at this state also.
V(s5) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.81, and R(s, a)= 0, because there is
no reward at this state also.
V(s9) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.73, and R(s, a)= 0, because there is
no reward at this state also.
● S is a set of states,
● A is the set of actions agent can choose to take,
Reward Function
Reward Function w.r.t action
Till now we have talked about getting a reward (r) when our agent goes through a set
of states (s) following a policy π. Actually, in Markov Decision Process(MDP) the policy is
the mechanism to take decisions. So now we have a mechanism that will choose to take an
action.
Policies in an MDP depend on the current state. They do not depend on history. That’s
the Markov Property. So, the current state we are in characterizes history.
We have already seen how good it is for the agent to be in a particular
state(State-value function). Now, let’s see how good it is to take a particular action following
a policy π from state s (Action-Value Function).
This function specifies how good it is for the agent to take action (a) in a state (s) with
a policy π.
Basically, it tells us the value of performing a certain action(a) in a state(s) with a policy π.
Now, we can see that there are no more probabilities. In fact, now our agent has
choices to make like after waking up, we can choose to watch Netflix or code and debug. Of
course, the actions of the agent are defined w.r.t some policy π and will get the reward
accordingly.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on the current
state and future action and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov
property. Such as in a Chess game, the players only focus on the current state and do not
need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2, .....,
St that uses the Markov Property. Markov process is also known as Markov chain, which is a
tuple (S, P) on state S and transition function P. These two components (S and P) can define
the dynamics of the system.
o Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the temporal
difference Learning. The temporal difference learning methods are the way of
comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take action "a"
at a particular state "s."
o The below flowchart explains the working of Q- learning:
State Action Reward State action (SARSA):
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the
agent that what actions should be taken for maximizing the reward under what
circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the below
image:
In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future state. The
agent can go to any direction (Up, Left, or Right), so he needs to decide where to go for the
optimal path. Here agent will take a move as per probability bases and changes the state. But
if we want some exact moves, so for this, we need to make some changes in terms of
Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which
action is more lubricative than others, and according to the best Q-value, the agent takes his
next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:
The Q stands for quality in Q-learning, which means it specifies the quality of an action
taken by the agent.
Q-table:
A Q-table or matrix is created while performing the Q-learning. The table follows the
state and action pair, i.e., [s, a], and initializes the values to zero. After each action, the table
is updated, and the q-values are stored within the table.
The RL agent uses this Q-table as a reference table to select the best action based on the
q-values.
1. The robot loses 1 point at each step. This is done so that the robot takes the shortest
path and reaches the goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
Now, the obvious question is: How do we train a robot to reach the end goal with the
shortest path without stepping on a mine?
So, how do we solve this?
There will be four numbers of actions at each non-edge tile. When a robot is at a state it can
either move up or down or right or left.
Each Q-table score will be the maximum expected future reward that the robot will get if it
takes that action at that state. This is an iterative process, as we need to improve the Q-Table
at each iteration.
Q-function:
The Q-function uses the Bellman equation and takes two inputs: state (s) and action
(a).
Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.
Each of the colored boxes is one step. Let’s understand each of these steps in detail.
Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n= number of actions. There are m
rows, where m= number of states. We will initialise the values at 0.
In our robot example, we have four actions (a=4) and five states (s=5). So we will build a
table with four columns and five rows.
We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier,
when the episode initially starts, every Q-value is 0.
So now the concept of exploration and exploitation trade-off comes into play.
We’ll use something called the epsilon greedy strategy. In the beginning, the epsilon rates
will be higher. The robot will explore the environment and randomly choose actions. The
logic behind this is that the robot does not know anything about the environment. As the
robot explores the environment, the epsilon rate decreases and the robot starts to exploit the
environment. During the process of exploration, the robot progressively becomes more
confident in estimating the Q-values.
For the robot example, there are four actions to choose from: up, down, left, and
right. We are starting the training now — our robot knows nothing about the environment. So
the robot chooses a random action, say right.
We can now update the Q-values for being at the start and moving right using the Bellman
equation.
Now we have taken an action and observed an outcome and reward.We need to update the
function Q(s,a).
In the case of the robot game, to reiterate the scoring/reward structure is:
● power = +1
● mine = -100
● end = +100
We will repeat this again and again until the learning is stopped. In this way the Q-Table will
be updated.
What is meant by passive and active reinforcement learning and how do we compare
the two?
Both active and passive reinforcement learning are types of RL. In case of passive RL, the
agent’s policy is fixed which means that it is told what to do. In contrast to this, in active RL,
an agent needs to decide what to do as there’s no fixed policy that it can act on. Therefore,
the goal of a passive RL agent is to execute a fixed policy (sequence of actions) and evaluate
it while that of an active RL agent is to act and learn an optimal policy.
Passive Learning:
As the goal of the agent is to evaluate how good an optimal policy is, the agent needs to learn
the expected utility Uπ(s) for each state s.
figure 21.1 (a) A policy π for the 4 X 3 world; this policy happens to be optimal with rewards
of R( s) = - 0.04 in the nonterminal states and no discounting. (b) The utilities of the states in
the 4 X 3 world, given policy π.
The agent executes a set of trials in the environment using its policy π. In each trial, the agent
starts in state (1, 1) and experiences a sequence of state transitions until it reaches one of the
terminal states, (4,2) or (4,3). Its percepts supply both the current state and the reward
received in that state. Typical trials might look like this:
where R(s) is the reward for a state, St (a random variable) is the state reached at time t when
executing policy 1r, and So= s. We will include a discount factor 1 in all of our equations, but
for the 4 x 3 world we will set ϒ = 1.
This can be done in three ways.
In this method, the agent executes a sequence of trials or runs (sequences of states-actions
transitions that continue until the agent reaches the terminal state). Each trial gives a sample
value and the agent estimates the utility based on the samples values. Can be calculated
as running averages of sample values. The main drawback is that this method makes a
wrong assumption that state utilities are independent while in reality they
are Markovian. Also, it is slow to converge.
Suppose we have a 4x3 grid as the environment in which the agent can move either Left,
Right, Up or Down(set of available actions). An example of a run
Total reward starting at (1,1) = 0.72
ADP is a smarter method than Direct Utility Estimation as it runs trials to learn the model of
the environment by estimating the utility of a state as a sum of reward for being in that state
and the expected discounted reward of being in the next state.
Where R(s) = reward for being in state s, P(s’|s, π(s)) = transition model, γ = discount factor
and Uπ(s) = utility of being in state s’.
It can be solved using value-iteration algorithm. The algorithm converges fast but can
become quite costly to compute for large state spaces. ADP is a model based approach and
requires the transition model of the environment. A model-free approach is Temporal
Difference Learning.
TD learning does not require the agent to learn the transition model. The update occurs
between successive states and agent only updates states that are directly affected.
Active Learning:
As the goal of an active agent is to learn an optimal policy, the agent needs to learn
the expected utility of each state and update its policy. Can be done using a passive ADP
agent and then using value or policy iteration it can learn optimal actions. But this approach
results into a greedy agent. Hence, we use an approach that gives higher weights to
unexplored actions and lower weights to actions with lower utilities.
Where f(u, n) is the exploration function that increases with expected value u and decreases
with number of tries n
2. Q-Learning:
Q-learning is a TD learning method which does not require the agent to learn the transitional
model, instead learns Q-value functions Q(s, a) .
POLICY SEARCH:
Suppose you are in a new town and you have no map nor GPS, and you need to reach
downtown. You can try assess your current position relative to your destination, as well the
effectiveness (value) of each direction you take. You can think of this as computing the value
function. Or you can ask a local and he tells you to go straight and when you see a fountain
you go to the left and continue until you reach downtown. He gave you a policy to follow.
Naturally, in this case, following the given policy is much less complicated than computing
the value function on your own.
In another example, consider that you are managing in inventory, and you decided
that when the count of each item drops below a certain limit you issue a buy order to refill
your stock. This is a policy that is far more simple than studying the activity of the customers,
and their buying habits and preferences, in order to predict the impact on your stock… Surely
enough, value functions will lead to determining a policy, but there are other methods that
can learn a policy that can select actions using parameters, without consulting a value
function (this is not quite correct, since value function is needed to increase accuracy). So the
main idea is to be able to determine at a state (s) which action to take in order to maximize
the reward.
The way to achieve this objective is to fine tune a vector of parameters noted 𝜽 in
order to select the best action to take for policy 𝜋. The policy is noted 𝜋(a|s, 𝜽) = Pr{At = a |
St = s, 𝜽t = 𝜽}, which means that the policy 𝜋 is the probability of taking action a when at
state s and the parameters are 𝜽.
Advantages:
● Better convergence properties
● Effective in high-dimensional or continuous action spaces .When the space is large, the usage
of memory and computation consumption grows rapidly. The policy based RL avoids this
because the objective is to learn a set of parameters that is far less than the space count.
● Can learn stochastic policies Stochastic policies are better than deterministic policies,
especially in 2 players game where if one player acts deterministically the other player will
develop counter measures in order to win.
Disadvantage:
● Typically converge to a local rather than global optimum
● Evaluating a policy is typically inefficient and high variance Policy based RL has high
variance, but there are techniques to reduce this variance.
Stochastic Policy:
At first it is important to note that stochastic does not mean randomness in all states,
but it can be stochastic in some states where it makes sense. Usually maximizing reward
leads to deterministic policy. But in some cases deterministic policies are not a good fit for
the problem, for example in any two player game, playing deterministically means the other
player will be able to come with counter measures in order to win all the time. For example in
Rock-Cissors-Paper game, if we play deterministically meaning the same shape every time,
then the other player can easily counter our policy and wins every game.
So in this game the optimal policy would be stochastic which will be better than the
deterministic one.
self-driving cars:
Various papers have proposed Deep Reinforcement Learning for autonomous
driving. In self-driving cars, there are various aspects to consider, such as speed limits at
various places, drivable zones, avoiding collisions — just to mention a few. Some of the
autonomous driving tasks where reinforcement learning could be applied include trajectory
optimization, motion planning, dynamic pathing, controller optimization, and scenario-based
learning policies for highways.
For example, parking can be achieved by learning automatic parking policies. Lane
changing can be achieved using Q-Learning while overtaking can be implemented by
learning an overtaking policy while avoiding collision and maintaining a steady speed
thereafter.
AWS DeepRacer is an autonomous racing car that has been designed to test out RL in a
physical track. It uses cameras to visualize the runway and a reinforcement learning model to
control the throttle and direction.
Wayve.ai has successfully applied reinforcement learning to training a car on how to drive in
a day. They used a deep reinforcement learning algorithm to tackle the lane following task.
Their network architecture was a deep network with 4 convolutional layers and 3 fully
connected layers. The example below shows the lane following task. The image in the middle
represents the driver’s perspective.
What is NLP:
NLP stands for Natural Language Processing, which is a part of Computer Science,
Human language, and Artificial Intelligence. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret human's languages. It helps developers to
organize knowledge for performing tasks such as translation, automatic summarization,
Named Entity Recognition (NER), speech recognition, relationship extraction, and topic
segmentation.
History of NLP:
(1940-1960) - Focused on Machine Translation (MT)
The Natural Languages Processing started in the year 1940s.
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck
College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer
science. Now, Chomsky developed his first book syntactic structures and claimed that
language is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.
(1960-1980) - Flavoured with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were:
Augmented Transition Networks (ATN):
Augmented Transition Networks is a finite state machine that is capable of recognizing
regular languages.
Case Grammar:
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case
Grammar uses languages such as English to express the relationship between nouns and verbs
by using the preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: "Neha broke the mirror with the hammer". In this example case grammar
identify Neha as an agent, mirror as a theme, and hammer as an instrument.
In the year 1960 to 1980, key systems were:
SHRDLU
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to
communicate with the computer and moving objects. It can handle instructions such as "pick
up the green boll" and also answer the questions like "What is inside the black box." The
main importance of SHRDLU is that it shows those syntax, semantics, and reasoning about
the world that can be combined to produce a system that understands a natural language.
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used
ATNs and Woods' Procedural Semantics. It was capable of translating elaborate natural
language expressions into database queries and handle 78% of requests without errors.
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.
In the beginning of the year 1990s, NLP started growing faster and achieved good process
accuracy, especially in English Grammar. In 1990 also, an electronic text introduced, which
provided a good resource for training and examining natural language programs. Other
factors may include the availability of computers with fast CPUs and more memory. The
major factor behind the advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it
allows the artificial intelligence to gain knowledge of the world. Let's consider the example
of AMAZON ALEXA, using this robot you can ask the question to Alexa, and it will reply to
you.
Advantages of NLP:
o NLP helps users to ask questions about any subject and get a direct response within
seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.
Disadvantages of NLP:
A list of disadvantages of NLP is given below:
o NLP may not show context.
o NLP is unpredictable
o NLP may require more keystrokes.
o NLP is unable to adapt to the new domain, and it has a limited function that's why
NLP is built for a single and specific task only.
Components of NLP:
There are the following two components of NLP -
1. Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords,
emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.
NLU involves the following tasks -
o It is used to map the given input into useful representation.
o It is used to analyze different aspects of the language.
2. Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning,
and Text Realization.
NLU NLG
NLU is the process of reading and NLG is the process of writing or generating
interpreting language. language.
Applications of NLP:
There are the following applications of NLP -
1. Question Answering
Question Answering focuses on building systems that automatically answer the questions
asked by humans in a natural language.
2. Spam Detection
Spam detection is used to detect unwanted e-mails getting to a user's inbox.
3. Sentiment Analysis
Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the
attitude, behaviour, and emotional state of the sender. This application is implemented
through a combination of NLP (Natural Language Processing) and statistics by assigning the
values to the text (positive, negative, or natural), identify the mood of the context (happy, sad,
angry, etc.)
4. Machine Translation
Machine translation is used to translate text or speech from one natural language to another
natural language.
Example: Google Translator
5. Spelling correction
Microsoft Corporation provides word processor software like MS-word, PowerPoint for the
spelling correction.
6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used in applications,
such as mobile, home automation, video recovery, dictating to Microsoft Word, voice
biometrics, voice user interface, and so on.
7. Chatbot
Implementing the Chatbot is one of the important applications of NLP. It is used by many
companies to provide the customer's chat services.
8. Information extraction
Information extraction is one of the most important applications of NLP. It is used for
extracting structured information from unstructured or semi-structured machine-readable
documents.
9. Natural Language Understanding (NLU)
It converts a large set of text into more formal representations such as first-order logic
structures that are easier for the computer programs to manipulate notations of the natural
language processing.
Note: When you are building a rock band search engine, then you do not ignore the
word "The."
Phases of NLP:
There are the following five phases of NLP:
Ambiguity
There are the following three ambiguity -
o Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the sentence
within a single word.
Example:
Manya is looking for a match.
In the above example, the word match refers to that either Manya is looking for a partner or
Manya is looking for a match. (Cricket or other match)
o Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within the
sentence.
Example:
I saw the girl with the binocular.
In the above example, did I have the binoculars? Or did the girl have the binoculars?
o Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to Sunita. She said, "I am hungry."
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
NLP Libraries:
Scikit-learn: It provides a wide range of algorithms for building machine learning models in
Python.
Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.
Pattern: It is a web mining module for NLP and machine learning.
TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis, noun
phrase extraction, or pos-tagging.
Quepy: Quepy is used to transform natural language questions into queries in a database
query language.
SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.
Gensim: Gensim works with large datasets and processes data streams.
Natural language has a very large Computer language has a very limited
vocabulary. vocabulary.
1. Text classification
Text clarification is the process of categorizing the text into a group of words. By
using NLP, text classification can automatically analyze text and then assign a set of
predefined tags or categories based on its context. NLP is used for sentiment analysis,
topic detection, and language detection. There is mainly three text classification
approach-
o Rule-based System,
o Machine System
o Hybrid System.
In the rule-based approach, texts are separated into an organized group using a set of
handicraft linguistic rules. Those handicraft linguistic rules contain users to define a
list of words that are characterized by groups. For example, words like Donald Trump
and Boris Johnson would be categorized into politics. People like LeBron James and
Ronaldo would be categorized into sports.
Machine-based classifier learns to make a classification based on past
observation from the data sets. User data is prelabeled as tarin and test data. It collects
the classification strategy from the previous inputs and learns continuously.
Machine-based classifier usage a bag of a word for feature extension.
Matching :
It is the process of finding a measure of similarity between two text representations.
The relevance of a document is computed based on the following parameters:
1. TF: It stands for Term Frequency which is simply the number of times a given term
appears in that document.
2. IDF: It stands for Inverse Document Frequency which is a measure of the general
importance of the term.
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It
has been organized into a formatted repository that is typically a database. It
concerns all data which can be stored in database SQL in a table with rows and
columns. They have relational keys and can easily be mapped into pre-designed
fields. Today, those data are most processed in the development and simplest way
to manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database
but that has some organizational properties that make it easier to analyze. With
some processes, you can store them in the relation database (it could be very hard
for some kind of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does
not have a predefined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there are alternative platforms for
storing and managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.
Unstructured
Properties Structured data Semi-structured data data
Matured transaction
and various No transaction
Transaction concurrency Transaction is adapted from management and
management techniques DBMS not matured no concurrency
Boolean Model
Boolean Model is the oldest model for Information Retrieval (IR). These models are based on
set theory and Boolean algebra, where
● Documents: Sets of terms
● Queries: Boolean expressions on terms
As a response to the query, the set of documents that satisfied the Boolean expression are
retrieved.
The boolean model can be defined as:
D: It represents a set of words, i.e, the indexing terms present in a document. Here, each term
is either present (1) or absent (0) in the document.
Q: It represents a Boolean expression, where terms are the index terms and operators are
logical products such as:
● AND,
● Logical sum − OR,
● Logical difference − NOT.
F: It represents a Boolean algebra over sets of terms as well as over sets of documents.
If we talk about the relevance feedback, then in the Boolean IR model the Relevance
prediction can be defined as follows:
R: A document is predicted as relevant to the query expression if and only if it satisfies the
query expression as −
Evaluation of IR Systems:
The two common effective measures for evaluating IR systems are as follows:
● Precision
● Recall
INFORMATION EXTRACTION:
Information Extraction is the process of parsing through unstructured data and
extracting essential information into more editable and structured data formats. For example,
consider we're going through a company’s financial information from a few documents.
Usually, we search for some required information when the data is digital or manually check
the same. But with information extraction NLP algorithms, we can automate the data
extraction of all required information such as tables, company growth metrics, and other
financial details from various kinds of documents (PDFs, Docs, Images etc.).
Below is a screenshot explaining how we can extract information from an Invoice.
#1 Information Collection:
Firstly, we’ll need to collect the data from different sources to build an information extraction
model. Usually, we see documents on emails, cloud drives, scanned copies, computer
software, and many other sources for business. Hence, we’ll have to write different scripts to
collect and store information in one place. This is usually done by either using APIs on the
web or building RPA (Robotic Process Automation) pipelines.
#2 Process Data:
After we collect the data, the next step is to process them. Usually, documents are two types:
electronically generated (editable) and the other non-electronically generated (scanned
documents). For the electronically generated documents, we can directly send them into the
preprocessing pipelines. Still, we’ll need OCR to first read all the data from images and then
send them into preprocessing pipelines for the scanned copies. We can either use open-source
tools like Tesseract or any online services like Nanonets or Textract. After all the data is in
editable or electronic format, we can then apply to pre-process steps like Tokenization and
POS tagging and then use data loaders to load the data into the NLP information extraction
models.
#3 Choosing the Right Model:
As discussed in the above sections, choosing a suitable model mostly depends on the type of
data we’re working with. Today, there are several state-of-the-art models we could rely on.
Below are some of the frequently use open-source models:
1. Named Entity Recognition on CoNLL 2003 (English)
2. Key Information Extraction From Documents: Evaluation And Generator
3. Deep Reader: Information extraction from Document images via relation extraction
and Natural Language
These are some of the information extraction models. However, these are trained on a
particular dataset. If we are utilising these on our models, we’ll need to experiment on the
hyperparameters and fine-tune the model accordingly.
The other way is to utilize the pre-trained models and fine-tuning them based on our data. For
Information Extraction from text, in particular, BERT models are widely used.