AI unit -3.docx

UNIT-3
REINFORCEMENT LEARNING
Introduction
What is Reinforcement Learning?
Reinforcement Learning is a part of machine learning. Here, agents are self-trained on

reward and punishment mechanisms. It’s about taking the best possible action or path to gain
maximum rewards and minimum punishment through observations in a specific situation. It
acts as a signal to positive and negative behaviours. Essentially an agent (or several) is built
that can perceive and interpret the environment in which is placed, furthermore, it can take
actions and interact with it.
Fig: Basic Diagram of Reinforcement Learning
To know the meaning of reinforcement learning, let’s go through the formal

definition.
Reinforcement learning, a type of machine learning, in which agents take actions in an
environment aimed at maximizing their cumulative rewards .
Reinforcement learning (RL) is based on rewarding desired behaviours or punishing
undesired ones. Instead of one input producing one output, the algorithm produces a variety
of outputs and is trained to select the right one based on certain variables .
Reinforcement learning (RL) can be viewed as an approach which falls

between supervised and unsupervised learning. It is not strictly supervised as it does not rely
only on a set of labelled training data but is not unsupervised learning because we have a
reward which we want our agent to maximise. The agent needs to find the “right” actions to
take in different situations to achieve its overall goal.
Reinforcement learning is the science of decision making:
Reinforcement learning involves no supervisor and only a reward signal is used for
an agent to determine if they are doing well or not. Time is a key component in RL where the
process is sequential with delayed feedback. Each action the agent makes affects the next
data it receives.
It is a type of machine learning technique where a computer agent learns to perform a
task through repeated trial and error interactions with a dynamic environment. This learning
approach enables the agent to make a series of decisions that maximize a reward metric for
the task without human intervention and without being explicitly programmed to achieve the
task.
Simplified Definition of Reinforcement Learning:
Through a series of Trial and Error methods, an agent keeps learning continuously in
an interactive environment from its own actions and experiences. The only goal of it is to find
a suitable action model which would increase the total cumulative reward of the agent. It
learns via interaction and feedback.
Explanation to Reinforcement Learning
How does reinforcement learning work? Well, let me explain with an example.
Fig: Reinforcement Learning Example
Here what do you see:
You can see a dog and a master. Let’s imagine you are training your dog to get the stick. Each
time the dog gets a stick successfully, you offered him a feast (a bone let’s say). Eventually,
the dog understands the pattern, that whenever the master throws a stick, it should get it as
early as it can to gain a reward (a bone) from a master in a lesser time.
Terminologies used in Reinforcement Learning:
Fig: Terminologies in RL
Agent – is the sole decision-maker and learner

Environment – a physical world where an agent learns and decides the actions to be
performed
Action – a list of action which an agent can perform
State – the current situation of the agent in the environment
Reward – For each selected action by agent, the environment gives a reward. It’s usually a
scalar value and nothing but feedback from the environment
Policy – the agent prepares strategy(decision-making) to map situations to actions.
Value Function – The value of state shows up the reward achieved starting from the state
until the policy is executed
Model – Every RL agent doesn’t use a model of its environment. The agent’s view maps
state-action pairs probability distributions over the states
Reinforcement Learning Workflow:
Fig: Reinforcement Learning Workflow
– Create the Environment
– Define the reward
– Create the agent
– Train and validate the agent

– Deploy the policy
How is reinforcement learning different from supervised learning?
In supervised learning, the model is trained with a training dataset that has a correct
answer key. The decision is done on the initial input given as it has all the data that’s required
to train the machine. The decisions are independent of each other so each decision is
represented through a label. Example: Object Recognition
Difference between Supervised and Reinforcement Learning :
In reinforcement learning, there isn’t any answer and the reinforcement agent decides
what to be done to perform the required task. As the training dataset isn’t available, the agent
had to learn from its experience. It’s all about compiling the decisions in a sequential manner.
To be said in simpler words, the output relies on the current input state and the next input
relies on the output of the previous input. We give labels to the sequence of dependent
decisions. Decisions are dependent. Example: Chess Game.
An RL problem can be best explained through games. Let’s take the game
of PacMan where the goal of the agent(PacMan) is to eat the food in the grid while avoiding
the ghosts on its way. In this case, the grid world is the interactive environment for the agent
where it acts. Agent receives a reward for eating food and punishment if it gets killed by the
ghost (loses the game). The states are the location of the agent in the grid world and the total
cumulative reward is the agent winning the game.
What is the reinforcement learning problem?
So far we have said that the agent needs to find the “right” action. The right action
depends on the rewards.
Reward: The reward R is a scalar feedback signal which indicates how well the agent is
doing at step time t.
In reinforcement learning we need to define our problem such that it can be applied to
satisfy our reward hypothesis. An example would be playing a game of chess where the
agent gets a positive reward for winning a game and a negative reward for losing a game.
Reward Hypothesis: All goals can be described by the maximisation of expected cumulative
reward.
Since our process involves sequential decision making tasks, our actions we make
early on may have a long-term consequence on our overall goal. Sometimes it may be better
to sacrifice immediate reward (reward at time step R ) to gain more long-term reward. An
example applied to chess would be to sacrifice a pawn to capture a rook at a later stage.
Goal: The goal is to select actions to maximise total future reward.
Setting up Reinforcement Learning Problems:
In reinforcement learning the agent makes the decisions on which actions to take at
each time step A . The agent makes these decisions based on the scalar reward R it
receives and the observed environment O .
Fig: Reinforcement learning process diagram
The environment receives the action from the agent and emits a new
observation O and scalar reward R . What happens next to the environment depends on
the history.
History: The history H is a sequence of observations, actions and rewards up to time t.
State: The state is the information used to determine what happens next.
The key difference between the history and state is that the state is a
function of the history. The states can be divided into three main types:
● Environment state (S ᵉ) — The environment’s private representation and may

not be visible to the agent. It is used to pick the next observation.
● Agent State (S ᵃ) — The agent’s internal representation and is used by the agent
to pick the next action.
● Information State / Markov State (S ) — Contains the useful information from

the history. Therefore given this state it will be sufficient information to model
the future and the history can be thrown away.
Markov State: A state S is Markov if and only if
What we believe what will happen next depends on the agent’s representation of state:
The environment is also divided into fully observable environment and partially
observable environment.
● Fully Observable Environments (Markov Decision Process): Agent directly

observes environment state. O =S ᵃ=S ᵉ
● Partially Observable Environments (Partially Observable Markov Decision

Process): Agent indirectly observes environment. S ᵃ≠S ᵉ
Reinforcement Learning Agent:
So far, we have defined how to set up RL problems but not how the RL agent learns or what
makes up an agent. An RL agent can have one or more of the three major components:
● Policy: Agent’s behaviour function which is a map from state to action. It can be
a deterministic policy or a stochastic policy.
Deterministic policy function
Stochastic policy function
● Value Function: Represents how good is each state and/or action. It is a

prediction of future reward.
Example value function
● Model: Agent’s representation of the environment. It predicts what the

environment will do next. The predictions are of the next state and next
immediate reward.
Example equation to predict of the next state
Example equation to predict the next immediate reward
RL agents can be categorised into the following types:
● Value Based: Has no policy, picks actions greedily based on state values.
● Policy Based: Has no value function, uses the policy function to pick actions.
● Actor Critic: Uses both value and policy functions.
● Model Free: Uses policy and/or value functions but has no model.
● Model Based: Uses policy and/or value functions and has a model.
● Reinforcement Learning Algorithms
There are 3 approaches to implement reinforcement learning algorithms
● Reinforcement Learning Algorithms :

● Value-Based – The main goal of this method is to maximize a value function. Here,
an agent through a policy expects a long-term return of the current states.
● Policy-Based – In policy-based, you enable to come up with a strategy that helps to
gain maximum rewards in the future through possible actions performed in each state.
Two types of policy-based methods are deterministic and stochastic.
● Model-Based – In this method, we need to create a virtual model for the agent to help
in learning to perform in each specific environment
Problems within Reinforcement Learning
There are two fundamental problems in sequential decision making:
● Learning: The environment is initially unknown and the agent needs to interact
with the environment to improve its policy
● Planning: If the environment is known the agent performs computations with its
model and then improves its policy.
Since reinforcement learning is like a trial-and-error learning approach, the agent
needs to learn from its experiences in the environment to make decisions without losing too
much reward. It needs to consider exploitation and exploration with a balance between
exploring new information and using known information to maximise reward.
Characteristics of Reinforcement Learning
– No supervision, only a real value or reward signal
– Decision making is sequential
– Time plays a major role in reinforcement problems
– Feedback isn’t prompt but delayed
– The following data it receives is determined by the agent’s actions
Types of Reinforcement Learning
There are two types :

Fig: Reinforcement Theory Example
1. Positive Reinforcement
Positive reinforcement is defined as when an event, occurs due to specific behaviour,

increases the strength and frequency of the behaviour. It has a positive impact on behaviour.
Advantages:
– Maximizes the performance of an action
– Sustain change for a longer period
Disadvantage:
– Excess reinforcement can lead to an overload of states which would minimize the results.
2. Negative Reinforcement
Negative Reinforcement is represented as the strengthening of a behaviour. In other ways,
when a negative condition is barred or avoided, it tries to stop this action in the future.
Advantages:
– Maximized behaviour
– Provide a decent to minimum standard of performance
Disadvantage:
– It just limits itself enough to meet up a minimum behaviour
Widely used models for reinforcement learning:
1. Markov Decision Process (MDP’s) – are mathematical frameworks for mapping solutions
in RL. The set of parameters that include Set of finite states – S, Set of possible Actions in
each state – A, Reward – R, Model – T, Policy – π. The outcome of deploying an action to a
state doesn’t depend on previous actions or states but on current action and state.
Fig: Markov Decision Process

What is a State?
A State is a set of tokens that represent every state that the agent can be in.
What is a Model?
A Model (sometimes called Transition Model) gives an action’s effect in a state. In particular,
T(S, a, S’) defines a transition T where being in state S and taking an action ‘a’ takes us to
state S’ (S and S’ may be the same). For stochastic actions (noisy, non-deterministic) we also
define a probability P(S’|S,a) which represents the probability of reaching a state S’ if action
‘a’ is taken in state S. Note Markov property states that the effects of an action taken in a
state depend only on that state and not on the prior history.
What are Actions?
An Action A is a set of all possible actions. A(s) defines the set of actions that can be taken
being in state S.
What is a Reward?
A Reward is a real-valued reward function. R(s) indicates the reward for simply being in the
state S. R(S,a) indicates the reward for being in a state S and taking an action ‘a’. R(S,a,S’)
indicates the reward for being in a state S, taking an action ‘a’ and ending up in a state S’.
What is a Policy?
A Policy is a solution to the Markov Decision Process. A policy is a mapping from S to a. It

indicates the action ‘a’ to be taken while in state S.
Let us take the example of a grid world:
An agent lives in the grid. The above example is a 3*4 grid. The grid has a START
state(grid no 1,1). The purpose of the agent is to wander around the grid to finally reach the
Blue Diamond (grid no 4,3). Under all circumstances, the agent should avoid the Fire grid
(orange color, grid no 4,2). Also the grid no 2,2 is a blocked grid, it acts as a wall hence the
agent cannot enter it.
The agent can take any one of these actions: UP, DOWN, LEFT, RIGHT Walls
block the agent path, i.e., if there is a wall in the direction the agent would have taken, the
agent stays in the same place. So for example, if the agent says LEFT in the START grid he
would stay put in the START grid.
First Aim: To find the shortest sequence getting from START to the Diamond. Two such
sequences can be found:
● RIGHT RIGHT UP UP RIGHT
● UP UP RIGHT RIGHT RIGHT
Let us take the second one (UP UP RIGHT RIGHT RIGHT) for the subsequent
discussion. The move is now noisy. 80% of the time the intended action works correctly. 20%
of the time the action agent takes causes it to move at right angles. For example, if the agent
says UP the probability of going UP is 0.8 whereas the probability of going LEFT is 0.1, and
the probability of going RIGHT is 0.1 (since LEFT and RIGHT are right angles to UP).
The agent receives rewards each time step:-
● Small reward each step (can be negative when can also be term as punishment, in
the above example entering the Fire can have a reward of -1).
● Big rewards come at the end (good or bad).
● The goal is to Maximize the sum of rewards.
2. Q Learning – it’s a value-based model free approach for supplying information to

intimate which action an agent should perform. It revolves around the notion of updating Q
values which shows the value of doing action A in state S. Value update rule is the main
aspect of the Q-learning algorithm.
Fig: Q Learning
Practical Applications of reinforcement learning:
– Robotics for Industrial Automation
– Text summarization engines, dialogue agents (text, speech), gameplays
– Autonomous Self Driving Cars

– Machine Learning and Data Processing
– Training system which would issue custom instructions and materials with respect to the
requirements of students
– AI Toolkits, Manufacturing, Automotive, Healthcare, and Bots
– Aircraft Control and Robot Motion Control
– Building artificial intelligence for computer games
Conclusion:
The conclusion for this topic is nothing but helping us to discover which action could
yield the highest reward for a longer time. Realistic environments can have partial
observability and be non-stationary as well. It isn’t very useful to apply when you have
hands-on enough data to solve the problem using supervised learning. The main challenge of
this method is that parameters could affect the speed of the learning.
Let’s consider a problem where an agent can be in various states and can choose an
action from a set of actions. Such type of problems are called Sequential Decision Problems.
An MDP is the mathematical framework which captures such a fully observable,
non-deterministic environment with Markovian Transition Model and additive
rewards in which the agent acts. The solution to an MDP is an optimal policy which refers to
the choice of action for every state that maximizes overall cumulative reward. Thus,
the transition model that represents an agent’s environment(when the environment is known)
and the optimal policy which decides what action the agent needs to perform in each state are
required elements for training the agent learn a specific behavior.
Fig : Markov Decision Process
In a typical Reinforcement Learning (RL) problem, there is a learner and a decision

maker called agent and the surrounding with which it interacts is called environment. The
environment, in return, provides rewards and a new state based on the actions of the agent.
So, in reinforcement learning, we do not teach an agent how it should do something but
presents it with rewards whether positive or negative based on its actions. So our root
question for this blog is how we formulate any problem in RL mathematically. This is
where the Markov Decision Process(MDP) comes in.
Fig: Typical Reinforcement Learning cycle
Before we answer our root question i.e. How we formulate RL problems

mathematically (using MDP), we need to develop our intuition about :
● The Agent-Environment relationship
● Markov Property
● Markov Process and Markov chains
● Markov Reward Process (MRP)
● Bellman Equation
● Markov Reward Process
The Agent-Environment Relationship:
First let’s look at some formal definitions :
Agent : Software programs that make intelligent decisions and they are the learners in RL.
These agents interact with the environment by actions and receive rewards based on there
actions.
Environment :It is the demonstration of the problem to be solved. Now, we can have a
real-world environment or a simulated environment with which our agent will interact.
Fig: Demonstrating an environment with which agents are interacting.

State : This is the position of the agents at a specific time-step in the environment. So,
whenever an agent performs a action the environment gives the agent reward and a new state
where the agent reached by performing the action.
Anything that the agent cannot change arbitrarily is considered to be part of the
environment. In simple terms, actions can be any decision we want the agent to
learn and state can be anything which can be useful in choosing actions. We do not assume
that everything in the environment is unknown to the agent, for example, reward calculation
is considered to be the part of the environment even though the agent knows a bit on how it’s
reward is calculated as a function of its actions and states in which they are taken. This is
because rewards cannot be arbitrarily changed by the agent. Sometimes, the agent might
be fully aware of its environment but still finds it difficult to maximize the reward as like we
might know how to play Rubik’s cube but still cannot solve it. So, we can safely say that the
agent-environment relationship represents the limit of the agent control and not it’s
knowledge.
The Markov Property:
Transition : Moving from one state to another is called Transition.
Transition Probability: The probability that the agent will move from one state to another is
called transition probability.
The Markov Property state that :
“Future is Independent of the past given the present”
Mathematically we can express this statement as :

Markov Property:
S[t] denotes the current state of the agent and s[t+1] denotes the next state. What this
equation means is that the transition from state S[t] to S[t+1] is entirely independent of the
past. So, the RHS of the Equation means the same as LHS if the system has a Markov
Property. Intuitively meaning that our current state already captures the information of the
past states.
State Transition Probability :
As we now know about transition probability we can define state Transition Probability as
follows :
For Markov State from S[t] to S[t+1] i.e. any other successor state , the state transition
probability is given by
State Transition Probability

We can formulate the State Transition probability into a State Transition probability matrix
by:
Fig: State Transition Probability Matrix
Each row in the matrix represents the probability from moving from our original or
starting state to any successor state .Sum of each row is equal to 1.
Markov Process or Markov Chains:
Markov Process is the memory less random process i.e. a sequence of a random state
S[1],S[2],….S[n] with a Markov Property. So, it’s basically a sequence of states with the
Markov Property. It can be defined using a set of states(S) and transition probability matrix
(P).The dynamics of the environment can be fully defined using the States(S) and Transition
Probability matrix(P).
But what random process means ?
To answer this question let’s look at a example:

Fig: Markov chain
The edges of the tree denote transition probability. From this chain let’s take some
sample. Now, suppose that we were sleeping and the according to the probability distribution
there is a 0.6 chance that we will Run and 0.2 chance we sleep more and again 0.2 that we
will eat ice-cream. Similarly, we can think of other sequences that we can sample from this
chain.
Some samples from the chain :
● Sleep — Run — Ice-cream — Sleep
● Sleep — Ice-cream — Ice-cream — Run
In the above two sequences what we see is we get random set of States(S) (i.e. Sleep,
Ice-cream, Sleep ) every time we run the chain. Hope, it’s now clear why Markov process is
called random set of sequences.
Before going to Markov Reward process let’s look at some important concepts that will
help us in understand MRPs.
Reward and Returns:
Rewards are the numerical values that the agent receives on performing some action
at some state(s) in the environment. The numerical value can be positive or negative based
on the actions of the agent.
In Reinforcement learning, we care about maximizing the cumulative reward (all

the rewards agent receives from the environment) instead of, the reward agent receives from
the current state(also called immediate reward). This total sum of reward the agent receives
from the environment is called returns.
We can define Returns as :
Returns (Total rewards from the environment):
r[t+1] is the reward received by the agent at time step t[0] while performing an
action(a) to move from one state to another. Similarly, r[t+2] is the reward received by the
agent at time step t[1] by performing an action to move to another state. And, r[T] is the
reward received by the agent by at the final time step by performing an action to move to
another state.
Episodic and Continuous Tasks:
Episodic Tasks: These are the tasks that have a terminal state (end state).We can say they
have finite states. For example, in racing games, we start the game (start the race) and play it
until the game is over (race ends!). This is called an episode. Once we restart the game it will
start from an initial state and hence, every episode is independent.
Continuous Tasks : These are the tasks that have no ends i.e. they don’t have any terminal
state. These types of tasks will never end. For example, Learning how to code!
Now, it’s easy to calculate the returns from the episodic tasks as they will eventually
end but what about continuous tasks, as it will go on and on forever. The returns from sum up
to infinity! So, how we define returns for continuous tasks.
This is where we need Discount factor(ɤ).
Discount Factor (ɤ): It determines how much importance is to be given to the immediate
reward and future rewards. This basically helps us to avoid infinity as a reward in
continuous tasks. It has a value between 0 and 1. A value of 0 means that more importance is
given to the immediate reward and a value of 1 means that more importance is given
to future rewards. In practice, a discount factor of 0 will never learn as it only considers
immediate reward and a discount factor of 1 will go on for future rewards which may lead to
infinity. Therefore, the optimal value for the discount factor lies between 0.2 to 0.8.
So, we can define returns using discount factor as follows :(Let’s say this is equation
1 ,as we are going to use this equation in later for deriving Bellman Equation)
Returns using discount factor
Let’s understand it with an example, suppose you live at a place where you face water
scarcity so if someone comes to you and say that he will give you 100 liters of water!(assume
please!) for the next 15 hours as a function of some parameter (ɤ).Let’s look at two
possibilities : (Let’s say this is equation 1 ,as we are going to use this equation in later for
deriving Bellman Equation)
One with discount factor (ɤ) 0.8 :

Discount Factor (0.8)
This means that we should wait till 15th hour because the decrease is not very
significant , so it’s still worth to go till the end. This means that we are also interested in
future rewards. So, if the discount factor is close to 1 then we will make a effort to go to end
as the reward are of significant importance.
Second, with discount factor (ɤ) 0.2 :
Discount Factor (0.2)
This means that we are more interested in early rewards as the rewards are getting
significantly low at hour. So, we might not want to wait till the end (till 15th hour) as it will
be worthless. So, if the discount factor is close to zero then immediate rewards are more
important that the future.
So which value of discount factor to use ?
It depends on the task that we want to train an agent for. Suppose, in a chess game, the
goal is to defeat the opponent’s king. If we give importance to the immediate rewards like a
reward on pawn defeat any opponent player then the agent will learn to perform these
sub-goals no matter if his players are also defeated. So, in this task future rewards are more
important. In some, we might prefer to use immediate rewards like the water example we saw
earlier.
Markov Reward Process:
Till now we have seen how Markov chain defined the dynamics of a environment
using set of states(S) and Transition Probability Matrix(P).But, we know that Reinforcement
Learning is all about goal to maximize the reward. So, let’s add reward to our Markov
Chain. This gives us Markov Reward Process.
Markov Reward Process : As the name suggests, MDPs are the Markov chains with values
judgement. Basically, we get a value from every state our agent is in.
Mathematically, we define Markov Reward Process as :
Markov Reward Process:
What this equation means is how much reward (Rs) we get from a particular state
S[t]. This tells us the immediate reward from that particular state our agent is in. As we will
see in the next story how we maximize these rewards from each state our agent is in. In
simple terms, maximizing the cumulative reward we get from each state.
We define MRP as (S,P, R,ɤ) , where :
● S is a set of states,
● P is the Transition Probability Matrix,
● R is the Reward function, we saw earlier,
● ɤ is the discount factor
Markov Decision Process:
Now, let’s develop our intuition for Bellman Equation and Markov Decision Process.
Policy Function and Value Function:
Value Function determines how good it is for the agent to be in a particular state.
Of course, to determine how good it will be to be in a particular state it must depend on some
actions that it will take. This is where policy comes in. A policy defines what actions to
perform in a particular state s.
A policy is a simple function, that defines a probability distribution over Actions

(a∈ A) for each state (s ∈ S). If an agent at time t follows a policy π then π(a|s) is the
probability that the agent with taking action (a ) at a particular time step (t).In Reinforcement
Learning the experience of the agent determines the change in policy. Mathematically, a
policy is defined as follows :
Policy Function:
Now, how do we find a value of a state. The value of state s, when the agent is
following a policy π which is denoted by vπ(s) is the expected return starting from s and
following a policy π for the next states until we reach the terminal state. We can formulate
this as :(This function is also called State-value Function)
Value Function:
This equation gives us the expected returns starting from the state(s) and going to
successor states thereafter, with the policy π. One thing to note is the returns we get is
stochastic whereas the value of a state is not stochastic. It is the expectation of returns from
start state s and thereafter, to any other state. And also note that the value of the terminal state
(if there is any) is zero. Let’s look at an example :
Example
Suppose our start state is Class 2, and we move to Class 3 then Pass then Sleep.In short, Class
2 > Class 3 > Pass > Sleep.
Our expected return is with a discount factor of 0.5:

Calculating the Value of Class 2
Note:It’s -2 + (-2 * 0.5) + 10 * 0.25 + 0 instead of -2 * -2 * 0.5 + 10 * 0.25 + 0.Then the
value of Class 2 is -0.5 .
Bellman Equation for Value Function:
Bellman Equation helps us to find optimal policies and value functions. We know
that our policy changes with experience so we will have different value functions according
to different policies. The optimal value function is one that gives maximum value
compared to all other value functions.
Bellman Equation states that value function can be decomposed into two parts:
● Immediate Reward, R[t+1]
● Discounted value of successor states,
Mathematically, we can define Bellman Equation as :
Bellman Equation for Value Function
Let’s understand what this equation says with a help of an example :
Suppose, there is a robot in some state (s) and then he moves from this state to some other
state (s’). Now, the question is how good it was for the robot to be in the state(s). Using the
Bellman equation, we can that it is the expectation of reward it got on leaving the state(s)
plus the value of the state (s’) he moved to.
Let’s look at another example :

Backup Diagram
We want to know the value of state s. The value of state(s) is the reward we got upon leaving
that state, plus the discounted value of the state we landed upon multiplied by the transition
probability that we will move into it.

Value Calculation
The above equation can be expressed in matrix form as follows :
Bellman Linear Equation:
Where v is the value of state we were in, which is equal to the immediate reward plus
the discounted value of the next state multiplied by the probability of moving into that state.
The running time complexity for this computation is O(n³). Therefore, this is clearly
not a practical solution for solving larger MRPs (same for MDPs).In later Blogs, we will
look at more efficient methods like Dynamic Programming (Value iteration and Policy
iteration), Monte-Claro methods, and TD-Learning.
The Bellman Equation:
The Bellman equation was introduced by the Mathematician Richard Ernest

Bellman in the year 1953, and hence it is called as a Bellman equation. It is associated with
dynamic programming and used to calculate the values of a decision problem at a certain
point by including the values of previous states.
It is a way of calculating the value functions in dynamic programming or environment

that leads to modern reinforcement learning.
The key-elements used in Bellman equations are:
o Action performed by the agent is referred to as "a"

o State occurred by performing the action is "s."
o The reward/feedback obtained for each good and bad action is "R."
o A discount factor is Gamma "γ.
The Bellman equation can be written as:
1. V(s) = max [R(s,a) + γV(s`)]

2. Where,
3. V(s)= value calculated at a particular point.
4. R(s,a) = Reward at a particular state s by performing an action.
5. γ = Discount factor
6. V(s`) = The value at the previous state.
7. In the above equation, we are taking the max of the complete values because the agent
tries to find the optimal solution always.
8. So now, using the Bellman equation, we will find value at each state of the given
environment. We will start from the block, which is next to the target block.
For 1st block:
V(s3) = max [R(s,a) + γV(s`)], here V(s')= 0 because there is no further state to move.
V(s3)= max[R(s,a)]=> V(s3)= max[1]=> V(s3)= 1.
For 2nd block:
V(s2) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 1, and R(s, a)= 0, because there is no
reward at this state.
V(s2)= max[0.9(1)]=> V(s)= max[0.9]=> V(s2) =0.9
For 3rd block:
V(s1) = max [R(s,a) + γV(s`)], here γ= 0.9(lets), V(s')= 0.9, and R(s, a)= 0, because there is
no reward at this state also.
V(s1)= max[0.9(0.9)]=> V(s3)= max[0.81]=> V(s1) =0.81
For 4th block:
V(s5)= max[0.9(0.81)]=> V(s5)= max[0.81]=> V(s5) =0.73
For 5th block:
V(s9)= max[0.9(0.73)]=> V(s4)= max[0.81]=> V(s4) =0.66
Consider the below image:

Now, we will move further to the 6th block, and here agent may change the route because it
always tries to find the optimal path. So now, let's consider from the block next to the fire pit.
Now, the agent has three options to move; if he moves to the blue box, then he will feel a
bump if he moves to the fire pit, then he will get the -1 reward. But here we are taking only
positive rewards, so for this, he will move to upwards only. The complete block values will
be calculated using this formula. Consider the below image:
What is Markov Decision Process ?
Markov Decision Process : It is Markov Reward Process with a decisions.

Everything is same like MRP but now we have actual agency that makes decisions or take
actions.
It is a tuple of (S, A, P, R, 𝛾) where:
● S is a set of states,
● A is the set of actions agent can choose to take,
● P is the transition Probability Matrix,
● R is the Reward accumulated by the actions of the agent,
● 𝛾 is the discount factor.
P and R will have slight change w.r.t actions as follows :
Transition Probability Matrix
Transition Probability Matrix w.r.t action
Reward Function
Reward Function w.r.t action
Now, our reward function is dependent on the action.
Till now we have talked about getting a reward (r) when our agent goes through a set
of states (s) following a policy π. Actually, in Markov Decision Process(MDP) the policy is
the mechanism to take decisions. So now we have a mechanism that will choose to take an
action.
Policies in an MDP depend on the current state. They do not depend on history. That’s
the Markov Property. So, the current state we are in characterizes history.
We have already seen how good it is for the agent to be in a particular
state(State-value function). Now, let’s see how good it is to take a particular action following
a policy π from state s (Action-Value Function).
State-action value function or Q-Function:
This function specifies how good it is for the agent to take action (a) in a state (s) with
a policy π.
Mathematically, we can define the State-action value function as :
State-action value function
Basically, it tells us the value of performing a certain action(a) in a state(s) with a policy π.
Let’s look at an example of the Markov Decision Process :

Fig: Example of MDP
Now, we can see that there are no more probabilities. In fact, now our agent has
choices to make like after waking up, we can choose to watch Netflix or code and debug. Of
course, the actions of the agent are defined w.r.t some policy π and will get the reward
accordingly.
Markov Decision Process:
Markov Decision Process or MDP, is used to formalize the reinforcement learning

problems. If the environment is completely observable, then its dynamic can be modeled as
a Markov Process. In MDP, the agent constantly interacts with the environment and
performs actions; at each action, the environment responds and generates a new state.
MDP is used to describe the environment for the RL, and almost all the RL problem can be
formalized using MDP.
MDP contains a tuple of four elements (S, A, Pa, Ra):
o A set of finite States S

o A set of finite Actions A
o Rewards received after transitioning from state S to state S', due to action a.
o Probability Pa.
MDP uses Markov property, and to better understand the MDP, we need to learn about it.
Markov Property:
It says that "If the agent is present in the current state S1, performs an action a1
and move to the state s2, then the state transition from s1 to s2 only depends on the current
state and future action and states do not depend on past actions, rewards, or states."
Or, in other words, as per Markov Property, the current state transition does not
depend on any past action or state. Hence, MDP is an RL problem that satisfies the Markov
property. Such as in a Chess game, the players only focus on the current state and do not
need to remember past actions or states.
Finite MDP:
A finite MDP is when there are finite states, finite rewards, and finite actions. In RL, we
consider only the finite MDP.
Markov Process:
Markov Process is a memoryless process with a sequence of random states S1, S2, .....,
St that uses the Markov Property. Markov process is also known as Markov chain, which is a
tuple (S, P) on state S and transition function P. These two components (S and P) can define
the dynamics of the system.
Reinforcement Learning Algorithms:
Reinforcement learning algorithms are mainly used in AI applications and gaming

applications. The main used algorithms are:
o Q-Learning:
o Q-learning is an Off policy RL algorithm, which is used for the temporal
difference Learning. The temporal difference learning methods are the way of
comparing temporally successive predictions.
o It learns the value function Q (S, a), which means how good to take action "a"
at a particular state "s."
o The below flowchart explains the working of Q- learning:
State Action Reward State action (SARSA):
o SARSA stands for State Action Reward State action, which is

an on-policy temporal difference learning method. The on-policy control
method selects the action for each state while learning using a specific policy.
o The goal of SARSA is to calculate the Q π (s, a) for the selected current
policy π and all pairs of (s-a).
o The main difference between Q-learning and SARSA algorithms is that unlike
Q-learning, the maximum reward for the next state is not required for
updating the Q-value in the table.
o In SARSA, new action and reward are selected using the same policy, which
has determined the original action.
o The SARSA is named because it uses the quintuple Q(s, a, r, s', a'). Where,
s: original state
a: Original action
r: reward observed while following the states
s' and a': New state, action pair.
Q-Learning Explanation:
o Q-learning is a popular model-free reinforcement learning algorithm based on the
Bellman equation.
o The main objective of Q-learning is to learn the policy which can inform the
agent that what actions should be taken for maximizing the reward under what
circumstances.
o It is an off-policy RL that attempts to find the best action to take at a current state.
o The goal of the agent in Q-learning is to maximize the value of Q.
o The value of Q-learning can be derived from the Bellman equation. Consider the
Bellman equation given below:
In the equation, we have various components, including reward, discount factor (γ),
probability, and end states s'. But there is no any Q-value is given so first consider the below
image:
In the above image, we can see there is an agent who has three values options, V(s1),
V(s2), V(s3). As this is MDP, so agent only cares for the current state and the future state. The
agent can go to any direction (Up, Left, or Right), so he needs to decide where to go for the
optimal path. Here agent will take a move as per probability bases and changes the state. But
if we want some exact moves, so for this, we need to make some changes in terms of
Q-value. Consider the below image:
Q- represents the quality of the actions at each state. So instead of using a value at
each state, we will use a pair of state and action, i.e., Q(s, a). Q-value specifies that which
action is more lubricative than others, and according to the best Q-value, the agent takes his
next move. The Bellman equation can be used for deriving the Q-value.
To perform any action, the agent will get a reward R(s, a), and also he will end up on
a certain state, so the Q -value equation will be:
Hence, we can say that, V(s) = max [Q(s, a)]

The above formula is used to estimate the Q-values in Q-Learning.
What is 'Q' in Q-learning?
The Q stands for quality in Q-learning, which means it specifies the quality of an action
taken by the agent.
Q-table:
A Q-table or matrix is created while performing the Q-learning. The table follows the
state and action pair, i.e., [s, a], and initializes the values to zero. After each action, the table
is updated, and the q-values are stored within the table.
The RL agent uses this Q-table as a reference table to select the best action based on the
q-values.
Q-Learning — a simplistic overview

Let’s say that a robot has to cross a maze and reach the end point. There are mines, and the
robot can only move one tile at a time. If the robot steps onto a mine, the robot is dead. The
robot has to reach the end point in the shortest time possible.
The scoring/reward system is as below:
1. The robot loses 1 point at each step. This is done so that the robot takes the shortest
path and reaches the goal as fast as possible.
2. If the robot steps on a mine, the point loss is 100 and the game ends.
3. If the robot gets power ⚡️, it gains 1 point.

4. If the robot reaches the end goal, the robot gets 100 points.
Now, the obvious question is: How do we train a robot to reach the end goal with the
shortest path without stepping on a mine?
So, how do we solve this?
Introducing the Q-Table:

Q-Table is just a fancy name for a simple lookup table where we calculate the maximum
expected future rewards for action at each state. Basically, this table will guide us to the best
action at each state.
There will be four numbers of actions at each non-edge tile. When a robot is at a state it can
either move up or down or right or left.
So, let’s model this environment in our Q-Table.

In the Q-Table, the columns are the actions and the rows are the states.
Each Q-table score will be the maximum expected future reward that the robot will get if it
takes that action at that state. This is an iterative process, as we need to improve the Q-Table
at each iteration.
But the questions are:
● How do we calculate the values of the Q-table?
● Are the values available or predefined?
To learn each value of the Q-table, we use the Q-Learning algorithm.
Mathematics: the Q-Learning algorithm
Q-function:
The Q-function uses the Bellman equation and takes two inputs: state (s) and action
(a).
Using the above function, we get the values of Q for the cells in the table.
When we start, all the values in the Q-table are zeros.
There is an iterative process of updating the values. As we start to explore the

environment, the Q-function gives us better and better approximations by continuously
updating the Q-values in the table.
Now, let’s understand how the updating takes place.
Introducing the Q-learning algorithm process:
Each of the colored boxes is one step. Let’s understand each of these steps in detail.
Step 1: initialize the Q-Table
We will first build a Q-table. There are n columns, where n= number of actions. There are m
rows, where m= number of states. We will initialise the values at 0.
In our robot example, we have four actions (a=4) and five states (s=5). So we will build a
table with four columns and five rows.
Steps 2 and 3: choose and perform an action

This combination of steps is done for an undefined amount of time. This means that this step
runs until the time we stop the training, or the training loop stops as defined in the code.
We will choose an action (a) in the state (s) based on the Q-Table. But, as mentioned earlier,
when the episode initially starts, every Q-value is 0.
So now the concept of exploration and exploitation trade-off comes into play.
We’ll use something called the epsilon greedy strategy. In the beginning, the epsilon rates
will be higher. The robot will explore the environment and randomly choose actions. The
logic behind this is that the robot does not know anything about the environment. As the
robot explores the environment, the epsilon rate decreases and the robot starts to exploit the
environment. During the process of exploration, the robot progressively becomes more
confident in estimating the Q-values.
For the robot example, there are four actions to choose from: up, down, left, and
right. We are starting the training now — our robot knows nothing about the environment. So
the robot chooses a random action, say right.
We can now update the Q-values for being at the start and moving right using the Bellman
equation.
Steps 4 and 5: evaluate
Now we have taken an action and observed an outcome and reward.We need to update the
function Q(s,a).
In the case of the robot game, to reiterate the scoring/reward structure is:
● power = +1
● mine = -100
● end = +100
We will repeat this again and again until the learning is stopped. In this way the Q-Table will
be updated.
What is meant by passive and active reinforcement learning and how do we compare
the two?
Both active and passive reinforcement learning are types of RL. In case of passive RL, the
agent’s policy is fixed which means that it is told what to do. In contrast to this, in active RL,
an agent needs to decide what to do as there’s no fixed policy that it can act on. Therefore,
the goal of a passive RL agent is to execute a fixed policy (sequence of actions) and evaluate
it while that of an active RL agent is to act and learn an optimal policy.
What are some common active and passive RL techniques?
Passive Learning:
As the goal of the agent is to evaluate how good an optimal policy is, the agent needs to learn
the expected utility Uπ(s) for each state s.
figure 21.1 (a) A policy π for the 4 X 3 world; this policy happens to be optimal with rewards
of R( s) = - 0.04 in the nonterminal states and no discounting. (b) The utilities of the states in
the 4 X 3 world, given policy π.
The agent executes a set of trials in the environment using its policy π. In each trial, the agent
starts in state (1, 1) and experiences a sequence of state transitions until it reaches one of the
terminal states, (4,2) or (4,3). Its percepts supply both the current state and the reward
received in that state. Typical trials might look like this:
The utility is defined to be the expected sum of (discounted) rewards obtained if

policy π is followed
where R(s) is the reward for a state, St (a random variable) is the state reached at time t when
executing policy 1r, and So= s. We will include a discount factor 1 in all of our equations, but
for the 4 x 3 world we will set ϒ = 1.
This can be done in three ways.
Direct Utility Estimation:
In this method, the agent executes a sequence of trials or runs (sequences of states-actions
transitions that continue until the agent reaches the terminal state). Each trial gives a sample
value and the agent estimates the utility based on the samples values. Can be calculated
as running averages of sample values. The main drawback is that this method makes a
wrong assumption that state utilities are independent while in reality they
are Markovian. Also, it is slow to converge.
Suppose we have a 4x3 grid as the environment in which the agent can move either Left,
Right, Up or Down(set of available actions). An example of a run
Total reward starting at (1,1) = 0.72
2. Adaptive Dynamic Programming(ADP):
ADP is a smarter method than Direct Utility Estimation as it runs trials to learn the model of
the environment by estimating the utility of a state as a sum of reward for being in that state
and the expected discounted reward of being in the next state.
Where R(s) = reward for being in state s, P(s’|s, π(s)) = transition model, γ = discount factor
and Uπ(s) = utility of being in state s’.
It can be solved using value-iteration algorithm. The algorithm converges fast but can
become quite costly to compute for large state spaces. ADP is a model based approach and
requires the transition model of the environment. A model-free approach is Temporal
Difference Learning.
3. Temporal Difference Learning (TD):
TD learning does not require the agent to learn the transition model. The update occurs
between successive states and agent only updates states that are directly affected.
Where α = learning rate which determines the convergence to true utilities.

While ADP adjusts the utility of s with all its successor states, TD learning adjusts it with
that of a single successor state s’. TD is slower in convergence but much simpler in terms of
computation.
Active Learning:
ADP with exploration function
As the goal of an active agent is to learn an optimal policy, the agent needs to learn
the expected utility of each state and update its policy. Can be done using a passive ADP
agent and then using value or policy iteration it can learn optimal actions. But this approach
results into a greedy agent. Hence, we use an approach that gives higher weights to
unexplored actions and lower weights to actions with lower utilities.
Where f(u, n) is the exploration function that increases with expected value u and decreases
with number of tries n
R+ is an optimistic reward and Ne is the number of times we want an agent to be forced to

pick an action in every state. The exploration function converts a passive agent into an
active one.
2. Q-Learning:
Q-learning is a TD learning method which does not require the agent to learn the transitional
model, instead learns Q-value functions Q(s, a) .
Q-values can be updated using the following equation,
Next action can be selected using the following policy,
Again, this is simpler to compute but slower than ADP.

ACTIVE REINFORCEMENT LEARNING:
Policy gradients:
It's clear, then, that defining an effective and versatile policy architecture is necessary
in order to ensure that meeting the objective does not entail unacceptable 'risk' or a
catastrophic logical conflict in the system. Furthermore, due to the relatively 'blind' nature of
reinforcement learning, policies are likely to encounter unexpected situations, but will
nonetheless need to provide the agent with practical guidance to prevent over-frequent 'stop'
scenarios or the entire re-tooling of the policy or the process.
To this end, Policy Gradient Theory provides a method of developing rules that can
operate in various contexts and that can change the value of the agent's end-goal/s as and
when necessary. To understand the need for policy gradients, we should consider that there
are two types of available policy in reinforcement learning:
Deterministic Policy:
where we can be certain that the action an agent takes will have a definite desired
outcome, such as the decision to tilt a full beaker by 90 degrees, which will definitely lead to
some of its liquid being poured out.
Stochastic Policy:
where the environment and/or the result of actions have not necessarily been derived
from known physical laws, as with the previous example; are not 'certain'; and where the
agent must evaluate its choices more carefully—a process known as Partially Observable
Markov Decision Process (POMDP, see 'Markov Decision Process' below). Deterministic
approaches can operate outside of any policy framework and are often characterized as a
'direct' route to quick results in reinforcement learning; but, like many of life's shortcuts, they
come with a number of caveats, as we'll see shortly.
POLICY SEARCH:
Suppose you are in a new town and you have no map nor GPS, and you need to reach
downtown. You can try assess your current position relative to your destination, as well the
effectiveness (value) of each direction you take. You can think of this as computing the value
function. Or you can ask a local and he tells you to go straight and when you see a fountain
you go to the left and continue until you reach downtown. He gave you a policy to follow.
Naturally, in this case, following the given policy is much less complicated than computing
the value function on your own.
In another example, consider that you are managing in inventory, and you decided
that when the count of each item drops below a certain limit you issue a buy order to refill
your stock. This is a policy that is far more simple than studying the activity of the customers,
and their buying habits and preferences, in order to predict the impact on your stock… Surely
enough, value functions will lead to determining a policy, but there are other methods that
can learn a policy that can select actions using parameters, without consulting a value
function (this is not quite correct, since value function is needed to increase accuracy). So the
main idea is to be able to determine at a state (s) which action to take in order to maximize
the reward.
The way to achieve this objective is to fine tune a vector of parameters noted 𝜽 in
order to select the best action to take for policy 𝜋. The policy is noted 𝜋(a|s, 𝜽) = Pr{At = a |
St = s, 𝜽t = 𝜽}, which means that the policy 𝜋 is the probability of taking action a when at
state s and the parameters are 𝜽.
Advantages:
● Better convergence properties
● Effective in high-dimensional or continuous action spaces .When the space is large, the usage
of memory and computation consumption grows rapidly. The policy based RL avoids this
because the objective is to learn a set of parameters that is far less than the space count.
● Can learn stochastic policies Stochastic policies are better than deterministic policies,
especially in 2 players game where if one player acts deterministically the other player will
develop counter measures in order to win.
Disadvantage:
● Typically converge to a local rather than global optimum
● Evaluating a policy is typically inefficient and high variance Policy based RL has high
variance, but there are techniques to reduce this variance.
Stochastic Policy:
At first it is important to note that stochastic does not mean randomness in all states,
but it can be stochastic in some states where it makes sense. Usually maximizing reward
leads to deterministic policy. But in some cases deterministic policies are not a good fit for
the problem, for example in any two player game, playing deterministically means the other
player will be able to come with counter measures in order to win all the time. For example in
Rock-Cissors-Paper game, if we play deterministically meaning the same shape every time,
then the other player can easily counter our policy and wins every game.
So in this game the optimal policy would be stochastic which will be better than the
deterministic one.
Reinforcement Learning Applications:

1. Robotics:
RL is used in Robot navigation, Robo-soccer, walking, juggling, etc.
1. Control:
RL can be used for adaptive control such as Factory processes, admission control in
telecommunication, and Helicopter pilot is an example of reinforcement learning.
1. Game Playing:
RL can be used in Game playing such as tic-tac-toe, chess, etc.
1. Chemistry:
RL can be used for optimizing the chemical reactions.
1. Business:
RL is now used for business strategy planning.
1. Manufacturing:
In various automobile manufacturing companies, the robots use deep
reinforcement learning to pick goods and put them in some containers.
1. Finance Sector:
The RL is currently used in the finance sector for evaluating trading strategies.
self-driving cars:
Various papers have proposed Deep Reinforcement Learning for autonomous
driving. In self-driving cars, there are various aspects to consider, such as speed limits at
various places, drivable zones, avoiding collisions — just to mention a few. Some of the
autonomous driving tasks where reinforcement learning could be applied include trajectory
optimization, motion planning, dynamic pathing, controller optimization, and scenario-based
learning policies for highways.
For example, parking can be achieved by learning automatic parking policies. Lane
changing can be achieved using Q-Learning while overtaking can be implemented by
learning an overtaking policy while avoiding collision and maintaining a steady speed
thereafter.
AWS DeepRacer is an autonomous racing car that has been designed to test out RL in a
physical track. It uses cameras to visualize the runway and a reinforcement learning model to
control the throttle and direction.
Wayve.ai has successfully applied reinforcement learning to training a car on how to drive in
a day. They used a deep reinforcement learning algorithm to tackle the lane following task.
Their network architecture was a deep network with 4 convolutional layers and 3 fully
connected layers. The example below shows the lane following task. The image in the middle
represents the driver’s perspective.
Reinforcement Learning applications in healthcare:

In healthcare, patients can receive treatment from policies learned from RL systems.
RL is able to find optimal policies using previous experiences without the need for previous
information on the mathematical model of biological systems. It makes this approach more
applicable than other control-based systems in healthcare.
RL in healthcare is categorized as dynamic treatment regimes(DTRs) in chronic
disease or critical care, automated medical diagnosis, and other general domains.
In DTRs the input is a set of clinical observations and assessments of a patient. The
outputs are the treatment options for every stage. These are similar to states in RL.
Application of RL in DTRs is advantageous because it is capable of determining
time-dependent decisions for the best treatment for a patient at a specific time.
The use of RL in healthcare also enables improvement of long-term outcomes by
factoring the delayed effects of treatments. RL has also been used for the discovery and
generation of optimal DTRs for chronic diseases.
NATURAL LANGUAGE PROCESSING
What is NLP:
NLP stands for Natural Language Processing, which is a part of Computer Science,
Human language, and Artificial Intelligence. It is the technology that is used by machines to
understand, analyse, manipulate, and interpret human's languages. It helps developers to
organize knowledge for performing tasks such as translation, automatic summarization,
Named Entity Recognition (NER), speech recognition, relationship extraction, and topic
segmentation.
History of NLP:
(1940-1960) - Focused on Machine Translation (MT)
The Natural Languages Processing started in the year 1940s.
1948 - In the Year 1948, the first recognisable NLP application was introduced in Birkbeck
College, London.
1950s - In the Year 1950s, there was a conflicting view between linguistics and computer
science. Now, Chomsky developed his first book syntactic structures and claimed that
language is generative in nature.
In 1957, Chomsky also introduced the idea of Generative Grammar, which is rule based
descriptions of syntactic structures.
(1960-1980) - Flavoured with Artificial Intelligence (AI)
In the year 1960 to 1980, the key developments were:
Augmented Transition Networks (ATN):
Augmented Transition Networks is a finite state machine that is capable of recognizing
regular languages.
Case Grammar:
Case Grammar was developed by Linguist Charles J. Fillmore in the year 1968. Case
Grammar uses languages such as English to express the relationship between nouns and verbs
by using the preposition.
In Case Grammar, case roles can be defined to link certain kinds of verbs and objects.
For example: "Neha broke the mirror with the hammer". In this example case grammar
identify Neha as an agent, mirror as a theme, and hammer as an instrument.
In the year 1960 to 1980, key systems were:
SHRDLU
SHRDLU is a program written by Terry Winograd in 1968-70. It helps users to
communicate with the computer and moving objects. It can handle instructions such as "pick
up the green boll" and also answer the questions like "What is inside the black box." The
main importance of SHRDLU is that it shows those syntax, semantics, and reasoning about
the world that can be combined to produce a system that understands a natural language.
LUNAR
LUNAR is the classic example of a Natural Language database interface system that is used
ATNs and Woods' Procedural Semantics. It was capable of translating elaborate natural
language expressions into database queries and handle 78% of requests without errors.
1980 - Current
Till the year 1980, natural language processing systems were based on complex sets of
hand-written rules. After 1980, NLP introduced machine learning algorithms for language
processing.
In the beginning of the year 1990s, NLP started growing faster and achieved good process
accuracy, especially in English Grammar. In 1990 also, an electronic text introduced, which
provided a good resource for training and examining natural language programs. Other
factors may include the availability of computers with fast CPUs and more memory. The
major factor behind the advancement of natural language processing was the Internet.
Now, modern NLP consists of various applications, like speech recognition, machine
translation, and machine text reading. When we combine all these applications then it
allows the artificial intelligence to gain knowledge of the world. Let's consider the example
of AMAZON ALEXA, using this robot you can ask the question to Alexa, and it will reply to
you.
Advantages of NLP:
o NLP helps users to ask questions about any subject and get a direct response within
seconds.
o NLP offers exact answers to the question means it does not offer unnecessary and
unwanted information.
o NLP helps computers to communicate with humans in their languages.
o It is very time efficient.
o Most of the companies use NLP to improve the efficiency of documentation
processes, accuracy of documentation, and identify the information from large
databases.
Disadvantages of NLP:
A list of disadvantages of NLP is given below:
o NLP may not show context.
o NLP is unpredictable
o NLP may require more keystrokes.
o NLP is unable to adapt to the new domain, and it has a limited function that's why
NLP is built for a single and specific task only.
Components of NLP:
There are the following two components of NLP -
1. Natural Language Understanding (NLU)
Natural Language Understanding (NLU) helps the machine to understand and analyse human
language by extracting the metadata from content such as concepts, entities, keywords,
emotion, relations, and semantic roles.
NLU mainly used in Business applications to understand the customer's problem in both
spoken and written language.
NLU involves the following tasks -
o It is used to map the given input into useful representation.
o It is used to analyze different aspects of the language.
2. Natural Language Generation (NLG)
Natural Language Generation (NLG) acts as a translator that converts the computerized data
into natural language representation. It mainly involves Text planning, Sentence planning,
and Text Realization.
Note: The NLU is difficult than NLG.
Difference between NLU and NLG
NLU NLG
NLU is the process of reading and NLG is the process of writing or generating
interpreting language. language.
It produces non-linguistic outputs It produces constructing natural language

from natural language inputs. outputs from non-linguistic inputs.
Applications of NLP:
There are the following applications of NLP -
1. Question Answering
Question Answering focuses on building systems that automatically answer the questions
asked by humans in a natural language.
2. Spam Detection
Spam detection is used to detect unwanted e-mails getting to a user's inbox.
3. Sentiment Analysis
Sentiment Analysis is also known as opinion mining. It is used on the web to analyse the
attitude, behaviour, and emotional state of the sender. This application is implemented
through a combination of NLP (Natural Language Processing) and statistics by assigning the
values to the text (positive, negative, or natural), identify the mood of the context (happy, sad,
angry, etc.)
4. Machine Translation
Machine translation is used to translate text or speech from one natural language to another
natural language.
Example: Google Translator
5. Spelling correction
Microsoft Corporation provides word processor software like MS-word, PowerPoint for the
spelling correction.
6. Speech Recognition
Speech recognition is used for converting spoken words into text. It is used in applications,
such as mobile, home automation, video recovery, dictating to Microsoft Word, voice
biometrics, voice user interface, and so on.
7. Chatbot
Implementing the Chatbot is one of the important applications of NLP. It is used by many
companies to provide the customer's chat services.
8. Information extraction
Information extraction is one of the most important applications of NLP. It is used for
extracting structured information from unstructured or semi-structured machine-readable
documents.
9. Natural Language Understanding (NLU)
It converts a large set of text into more formal representations such as first-order logic
structures that are easier for the computer programs to manipulate notations of the natural
language processing.
How to build an NLP pipeline:

There are the following steps to build an NLP pipeline -
Step1: Sentence Segmentation
Sentence Segment is the first step for building the NLP pipeline. It breaks the paragraph into
separate sentences.
Example: Consider the following paragraph -
Independence Day is one of the important festivals for every Indian citizen. It is
celebrated on the 15th of August each year ever since India got independence from the
British rule. The day celebrates independence in the true sense.
Sentence Segment produces the following result:
1. "Independence Day is one of the important festivals for every Indian citizen."
2. "It is celebrated on the 15th of August each year ever since India got independence
from the British rule."
3. "This day celebrates independence in the true sense."
Step2: Word Tokenization
Word Tokenizer is used to break the sentence into separate words or tokens.
Example:
JavaTpoint offers Corporate Training, Summer Training, Online Training, and Winter
Training.
Word Tokenizer generates the following result:
"JavaTpoint", "offers", "Corporate", "Training", "Summer", "Training", "Online", "Training",
"and", "Winter", "Training", "."
Step3: Stemming
Stemming is used to normalize words into its base form or root form. For example,
celebrates, celebrated and celebrating, all these words are originated with a single root word
"celebrate." The big problem with stemming is that sometimes it produces the root word
which may not have any meaning.
For Example, intelligence, intelligent, and intelligently, all these words are originated with a
single root word "intelligen." In English, the word "intelligen" do not have any meaning.
Step 4: Lemmatization
Lemmatization is quite similar to the Stamming. It is used to group different inflected forms
of the word, called Lemma. The main difference between Stemming and lemmatization is
that it produces the root word, which has a meaning.
For example: In lemmatization, the words intelligence, intelligent, and intelligently has a
root word intelligent, which has a meaning.
Step 5: Identifying Stop Words
In English, there are a lot of words that appear very frequently like "is", "and", "the", and "a".
NLP pipelines will flag these words as stop words. Stop words might be filtered out before
doing any statistical analysis.
Example: He is a good boy.
Note: When you are building a rock band search engine, then you do not ignore the
word "The."
Step 6: Dependency Parsing

Dependency Parsing is used to find that how all the words in the sentence are related to each
other.
Step 7: POS tags
POS stands for parts of speech, which includes Noun, verb, adverb, and Adjective. It
indicates that how a word functions with its meaning as well as grammatically within the
sentences. A word has one or more parts of speech based on the context in which it is used.
Example: "Google" something on the Internet.
In the above example, Google is used as a verb, although it is a proper noun.
Step 8: Named Entity Recognition (NER)
Named Entity Recognition (NER) is the process of detecting the named entity such as person
name, movie name, organization name, or location.
Example: Steve Jobs introduced iPhone at the Macworld Conference in San Francisco,
California.
Step 9: Chunking
Chunking is used to collect the individual piece of information and grouping them into bigger
pieces of sentences.
Phases of NLP:
There are the following five phases of NLP:
1. Lexical Analysis and Morphological

The first phase of NLP is the Lexical Analysis. This phase scans the source code as a stream
of characters and converts it into meaningful lexemes. It divides the whole text into
paragraphs, sentences, and words.
2. Syntactic Analysis (Parsing)
Syntactic Analysis is used to check grammar, word arrangements, and shows the relationship
among the words.
Example: Agra goes to the Poonam
In the real world, Agra goes to the Poonam, does not make any sense, so this sentence is
rejected by the Syntactic analyzer.
3. Semantic Analysis
Semantic analysis is concerned with the meaning representation. It mainly focuses on the
literal meaning of words, phrases, and sentences.
4. Discourse Integration
Discourse Integration depends upon the sentences that proceeds it and also invokes the
meaning of the sentences that follow it.
5. Pragmatic Analysis
Pragmatic is the fifth and last phase of NLP. It helps you to discover the intended effect by
applying a set of rules that characterize cooperative dialogues.
For Example: "Open the door" is interpreted as a request instead of an order.
Why NLP is difficult:

NLP is difficult because Ambiguity and Uncertainty exist in the language.
Ambiguity
There are the following three ambiguity -
o Lexical Ambiguity
Lexical Ambiguity exists in the presence of two or more possible meanings of the sentence
within a single word.
Example:
Manya is looking for a match.
In the above example, the word match refers to that either Manya is looking for a partner or
Manya is looking for a match. (Cricket or other match)
o Syntactic Ambiguity
Syntactic Ambiguity exists in the presence of two or more possible meanings within the
sentence.
Example:
I saw the girl with the binocular.
In the above example, did I have the binoculars? Or did the girl have the binoculars?
o Referential Ambiguity
Referential Ambiguity exists when you are referring to something using the pronoun.
Example: Kiran went to Sunita. She said, "I am hungry."
In the above sentence, you do not know that who is hungry, either Kiran or Sunita.
NLP Libraries:
Scikit-learn: It provides a wide range of algorithms for building machine learning models in
Python.
Natural language Toolkit (NLTK): NLTK is a complete toolkit for all NLP techniques.
Pattern: It is a web mining module for NLP and machine learning.
TextBlob: It provides an easy interface to learn basic NLP tasks like sentiment analysis, noun
phrase extraction, or pos-tagging.
Quepy: Quepy is used to transform natural language questions into queries in a database
query language.
SpaCy: SpaCy is an open-source NLP library which is used for Data Extraction, Data
Analysis, Sentiment Analysis, and Text Summarization.
Gensim: Gensim works with large datasets and processes data streams.
Difference between Natural language and Computer Language:
Natural Language Computer Language
Natural language has a very large Computer language has a very limited
vocabulary. vocabulary.
Natural language is easily understood Computer language is easily understood by

by humans. the machines.
Natural language is ambiguous in Computer language is unambiguous.
nature.
How does Language Model Works?

Language Models determine the probability of the next word by analyzing the text in
data. These models interpret the data by feeding it through algorithms.
The algorithms are responsible for creating rules for the context in natural language. The
models are prepared for the prediction of words by learning the features and characteristics of
a language. With this learning, the model prepares itself for understanding phrases and
predicting the next words in sentences.
For training a language model, a number of probabilistic approaches are used. These
approaches vary on the basis of the purpose for which a language model is created. The
amount of text data to be analyzed and the math applied for analysis makes a difference in the
approach followed for creating and training a language model.
For example, a language model used for predicting the next word in a search query will be
absolutely different from those used in predicting the next word in a long document (such as
Google Docs). The approach followed to train the model would be unique in both cases.
Types of Language Models:
There are primarily two types of language models:
1. Statistical Language Models
Statistical models include the development of probabilistic models that are able to predict the
next word in the sequence, given the words that precede it. A number of statistical language
models are in use already. Let’s take a look at some of those popular models:
N-Gram: This is one of the simplest approaches to language modelling. Here, a probability
distribution for a sequence of ‘n’ is created, where ‘n’ can be any number and defines the size
of the gram (or sequence of words being assigned a probability). If n=4, a gram may look
like: “can you help me”. Basically, ‘n’ is the amount of context that the model is trained to
consider. There are different types of N-Gram models such as unigrams, bigrams, trigrams,
etc.
Unigram: The unigram is the simplest type of language model. It doesn't look at any
conditioning context in its calculations. It evaluates each word or term independently.
Unigram models commonly handle language processing tasks such as information retrieval.
The unigram is the foundation of a more specific model variant called the query likelihood
model, which uses information retrieval to examine a pool of documents and match the most
relevant one to a specific query.
Bidirectional: Unlike n-gram models, which analyze text in one direction (backwards),
bidirectional models analyze text in both directions, backwards and forwards. These models
can predict any word in a sentence or body of text by using every other word in the text.
Examining text bidirectionally increases result accuracy. This type is often utilized in
machine learning and speech generation applications. For example, Google uses a
bidirectional model to process search queries.
Exponential: This type of statistical model evaluates text by using an equation which is a
combination of n-grams and feature functions. Here the features and parameters of the
desired results are already specified. The model is based on the principle of entropy, which
states that probability distribution with the most entropy is the best choice. Exponential
models have fewer statistical assumptions which mean the chances of having accurate results
are more.
2. Neural Language Models
These language models are based on neural networks and are often considered as an
advanced approach to execute NLP tasks. Neural language models overcome the
shortcomings of classical models such as n-gram and are used for complex tasks such as
speech recognition or machine translation.
Language is significantly complex and keeps on evolving. Therefore, the more complex the
language model is, the better it would be at performing NLP tasks. Compared to the n-gram
model, an exponential or continuous space model proves to be a better option for NLP tasks
because they are designed to handle ambiguity and language variation.
Meanwhile, language models should be able to manage dependencies. For example, a model
should be able to understand words derived from different languages.
Some Common Examples of Language Models
Language models are the cornerstone of Natural Language Processing (NLP) technology. We
have been making the best of language models in our routine, without even realizing it. Let’s
take a look at some of the examples of language models.
1. Speech Recognization
Voice assistants such as Siri and Alexa are examples of how language models help machines
in processing speech audio.
2. Machine Translation
Google Translator and Microsoft Translate are examples of how NLP models can help in
translating one language to another.
3. Sentiment Analysis
This helps in analyzing the sentiments behind a phrase. This use case of NLP models is used
in products that allow businesses to understand a customer’s intent behind opinions or
attitudes expressed in the text. Hubspot’s Service Hub is an example of how language models
can help in sentiment analysis.
4. Text Suggestions
Google services such as Gmail or Google Docs use language models to help users get text
suggestions while they compose an email or create long text documents, respectively.
5. Parsing Tools
Parsing involves analysing sentences or words that comply with syntax or grammar rules.
Spell checking tools are perfect examples of language modelling and parsing.
Words and Sequences:

NLP system needs to understand text, sign, and semantic properly. Many methods help the
NLP system to understand text and symbols. They are text classification, vector semantic,
word embedding, probabilistic language model, sequence labeling, and speech
reorganization.
1. Text classification
Text clarification is the process of categorizing the text into a group of words. By
using NLP, text classification can automatically analyze text and then assign a set of
predefined tags or categories based on its context. NLP is used for sentiment analysis,
topic detection, and language detection. There is mainly three text classification
approach-
o Rule-based System,
o Machine System
o Hybrid System.
In the rule-based approach, texts are separated into an organized group using a set of
handicraft linguistic rules. Those handicraft linguistic rules contain users to define a
list of words that are characterized by groups. For example, words like Donald Trump
and Boris Johnson would be categorized into politics. People like LeBron James and
Ronaldo would be categorized into sports.
Machine-based classifier learns to make a classification based on past
observation from the data sets. User data is prelabeled as tarin and test data. It collects
the classification strategy from the previous inputs and learns continuously.
Machine-based classifier usage a bag of a word for feature extension.
In a bag of words, a vector represents the frequency of words in a predefined

dictionary of a word list. We can perform NLP using the following machine learning
algorithms: Naïve Bayer, SVM, and Deep Learning.
The third approach to text classification is the Hybrid Approach. Hybrid

approach usage combines a rule-based and machine Based approach. Hybrid based
approach usage of the rule-based system to create a tag and use machine learning to
train the system and create a rule. Then the machine-based rule list is compared with
the rule-based rule list. If something does not match on the tags, humans improve the
list manually. It is the best method to implement text classification.
Information Retrieval Systems:
Firstly we will discuss what exactly is Information Retrieval?

Information retrieval is defined as the process of accessing and retrieving the most
appropriate information from text based on a particular query given by the user, with the help
of context-based indexing or metadata.
Google Search is the most famous example of information retrieval.
Now let’s discuss what are Information Retrieval Systems?
An information retrieval system searches a collection of natural language documents with the
goal of retrieving exactly the set of documents that matches a user’s question. They have their
origin in library systems.
These systems assist users in finding the information they require but it does not
attempt to deduce or generate answers. It tells about the existence and location of documents
that might consist of the required information that is given to the user. The documents that
satisfy the user’s requirement are called relevant documents. If we have a perfect IR system,
then it will retrieve only relevant documents.
Basics of IR Systems:
Image Source: Google Images

From the above diagram, it is clear that a user who needs information will have to
formulate a request in the form of a query in natural language. After that, the IR system will
return output by retrieving the relevant output, in the form of documents, about the required
information.
The step by step procedure of these systems are as follows:
● Indexing the collection of documents.
● Transforming the query in the same way as the document content is represented.
● Comparing the description of each document with that of the query.
● Listing the results in order of relevancy.
Retrieval Systems consist of mainly two processes:
● Indexing
● Matching
Indexing:
It is the process of selecting terms to represent a text.
Indexing involves:
● Tokenization of string
● Removing frequent words
● Stemming
Two common Indexing Techniques:
● Boolean Model(operator, logical )
● Vector space model(algebraic properties, eculidan vector)
Matching :
It is the process of finding a measure of similarity between two text representations.
The relevance of a document is computed based on the following parameters:
1. TF: It stands for Term Frequency which is simply the number of times a given term
appears in that document.
TF (i, j) = (count of ith term in jth document)/(total terms in jth document)
2. IDF: It stands for Inverse Document Frequency which is a measure of the general
importance of the term.
IDF (i) = (total no. of documents)/(no. of documents containing ith term)
3. TF-IDF Score (i, j) = TF * IDF
1. Structured data –
Structured data is data whose elements are addressable for effective analysis. It
has been organized into a formatted repository that is typically a database. It
concerns all data which can be stored in database SQL in a table with rows and
columns. They have relational keys and can easily be mapped into pre-designed
fields. Today, those data are most processed in the development and simplest way
to manage information. Example: Relational data.
2. Semi-Structured data –
Semi-structured data is information that does not reside in a relational database
but that has some organizational properties that make it easier to analyze. With
some processes, you can store them in the relation database (it could be very hard
for some kind of semi-structured data), but Semi-structured exist to ease
space. Example: XML data.
3. Unstructured data –
Unstructured data is a data which is not organized in a predefined manner or does
not have a predefined data model, thus it is not a good fit for a mainstream
relational database. So for Unstructured data, there are alternative platforms for
storing and managing, it is increasingly prevalent in IT systems and is used by
organizations in a variety of business intelligence and analytics
applications. Example: Word, PDF, Text, Media logs.
Differences between Structured, Semi-structured and Unstructured data:
Unstructured
Properties Structured data Semi-structured data data
It is based on It is based on It is based on

Relational database XML/RDF(Resource character and
Technology table Description Framework). binary data
Matured transaction
and various No transaction
Transaction concurrency Transaction is adapted from management and
management techniques DBMS not matured no concurrency
Version Versioning over Versioning over tuples or Versioned as a

management tuples,row,tables graph is possible whole
Unstructured
Properties Structured data Semi-structured data data
It is more flexible than It is more

It is schema structured data but less flexible and there
dependent and less flexible than unstructured is absence of
Flexibility flexible data schema
It is very difficult to It’s scaling is simpler than It is more

Scalability scale DB schema structured data scalable.
New technology, not very

Robustness Very robust spread —
Structured query Only textual

Query allow complex Queries over anonymous queries are
performance joining nodes are possible possible
Classical Problem in IR Systems:

The main aim behind IR research is to develop a model for retrieving information
from the repositories of documents. Ad-hoc retrieval problem is the classical problem in IR
systems.
Now, let’s discuss what exactly is ad-hoc retrieval?
In ad-hoc retrieval, the user must have to enter a query in natural language that describes the
required information. Then the IR system will return the output as the required documents
that are related to the desired information.
For Example, suppose we are searching for something on the Internet and it gives some
exact pages that are relevant as per our requirement but there can be some non-relevant pages
too. This is due to the ad-hoc retrieval problem.
Aspects of Ad-hoc Retrieval:

The aspects of ad-hoc retrieval that are addressed in IR research are as follows:
● How users with the help of relevant feedback can improve the original formulation of
a query?
● How to implement database merging, i.e., how results from different text databases
can be merged into one result set?
● How to handle partly corrupted data? Which models are appropriate for the same?
Information Retrieval Models:
Information retrieval models predict and explain what a user will find in relevance to the
given query. These are basically a pattern that defines the above-mentioned aspects of
retrieval procedure that we discussed in ad-hoc retrieval and consists of the following:
● A model for documents.
● A model for queries.
● A matching function that compares queries to documents.
Mathematically, a retrieval model consists of the following components:
● D: Representation for documents.
● R: Representation for queries.
● F: The modelling framework for D, Q along with the relationship between them.
● R (q, di): A ranking or similarity function that orders the documents with respect to
the query.
Types of IR Model
The following are three models that are classified for the Information model (IR) model:
Classical IR Models:
These are the simplest and easy-to-implement IR models. These are based on mathematical
knowledge that was easily recognized and understood as well.
Following are the examples of classical IR models:
● Boolean models,
● Vector models,
● Probabilistic models.
Non-Classical IR Models:
These are completely opposite to the classical IR models. These are based on principles other
than similarity, probability, Boolean operations.
Following are the examples of Non-classical IR models:
● Information logic models,
● Situation theory models,
● Interaction models.
Alternative IR Models:
It is the enhancement of the classical IR model that makes use of some specific techniques
from some other fields.
Following are the examples of Alternative IR models:
● Cluster models,
● Fuzzy models,
● Latent Semantic Indexing (LSI) models.
Boolean Model
Boolean Model is the oldest model for Information Retrieval (IR). These models are based on
set theory and Boolean algebra, where
● Documents: Sets of terms
● Queries: Boolean expressions on terms
As a response to the query, the set of documents that satisfied the Boolean expression are
retrieved.
The boolean model can be defined as:
D: It represents a set of words, i.e, the indexing terms present in a document. Here, each term
is either present (1) or absent (0) in the document.
Q: It represents a Boolean expression, where terms are the index terms and operators are
logical products such as:
● AND,
● Logical sum − OR,
● Logical difference − NOT.
F: It represents a Boolean algebra over sets of terms as well as over sets of documents.
If we talk about the relevance feedback, then in the Boolean IR model the Relevance
prediction can be defined as follows:
R: A document is predicted as relevant to the query expression if and only if it satisfies the
query expression as −
((𝑡𝑒𝑥𝑡 ˅ 𝑖𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛) ˄ 𝑟𝑒𝑟𝑖𝑒𝑣𝑎𝑙 ˄ ˜ 𝑡ℎ𝑒𝑜𝑟𝑦)
We can explain this model by a query term as an unambiguous definition of a set of

documents.
For Example, suppose we have the query term “analytics”, which defines the set of
documents that are indexed with the term “analytics”.
Now, think on what is the result after we combining terms with the Boolean ‘AND’
Operator?
After doing the ‘AND’ operation, it will define a document set that is smaller than or equal to
the document sets of any of the single terms.
For Example, now we have the query with terms “Vidhya” and “analytics” that will
produce the set of documents that are indexed with both the terms. In simple words, the
document set with the intersection of both the sets described here.
Now, also think on what is the result after combining terms with the Boolean ‘OR’
operator?
After doing the ‘OR’ operation, it will define a document set that is bigger than or equal to
the document sets of any of the single terms.
For Example, now we have the query with terms “Vidhya” or “analytics” that will produce
the set of documents that are indexed with either the term “Vidhya” or “analytics”. In
simple words, the document set with the union of both sets described here.
Advantages of the Boolean Model
Following are the advantages of the Boolean model:
1. It is the simplest model based on sets.
2. It is easy to understand and implement.
3. It only retrieves exact matches.
4. It gives the user, a sense of control over the system.
Disadvantages of the Boolean Model
Following are the disadvantages of the Boolean model:
1. The model’s similarity function is Boolean. Hence, there would be no partial matches. This
can be annoying for the users.
2. In this model, the Boolean operator usage has much more influence than a critical word.
3. The query language is expressive, but it is complicated too.
4. There is no ranking for retrieved documents by the model.
Vector Space Model
As we have seen that there are some limitations in the Boolean model, so we have come up
with a new model which is based on Luhn’s similarity criterion, which states that “the more
two representations agreed in given elements and their distribution, the higher would be
the probability of their representing similar information”.
To understand more about the vector Space model, you have to understand the following
points:
1. In this model, the index representations (documents) and the queries are represented by
vectors in a T dimensional Euclidean space.
2. T represents the number of distinct terms used in the documents.
3. Each axis corresponds to one term.
4. Ranked list of documents ordered by similarity to the query where the similarity between a
query and a document is computed using a metric on the respective vectors.
5. The similarity measure of a document vector to a query vector is usually the cosine of the
angle between them.
Evaluation of IR Systems:
The two common effective measures for evaluating IR systems are as follows:
● Precision
● Recall
Precision: Precision is the Proportion of retrieved documents that are relevant.

Recall: The recall is the Proportion of relevant documents that are retrieved.
INFORMATION EXTRACTION:
Information Extraction is the process of parsing through unstructured data and
extracting essential information into more editable and structured data formats. For example,
consider we're going through a company’s financial information from a few documents.
Usually, we search for some required information when the data is digital or manually check
the same. But with information extraction NLP algorithms, we can automate the data
extraction of all required information such as tables, company growth metrics, and other
financial details from various kinds of documents (PDFs, Docs, Images etc.).
Below is a screenshot explaining how we can extract information from an Invoice.
Fig: Information Extraction Workflow

Information Extraction from text data can be achieved by leveraging Deep Learning
and NLP techniques like Named Entity Recognition. However, if we build one from scratch,
we should decide the algorithm considering the type of data we're working on, such as
invoices, medical reports, etc. This is to make sure the model is specific to a particular use
case. We’ll be learning more about this in the following sections.
How Does Information Extraction Work?
To understand the mechanics of Information Extraction NLP algorithms, we should
understand the kind of data we are working on. This will help us to sort out the information
we want to extract from the unstructured data. For example, for invoice related information,
the algorithm should understand the invoice items, company name, billing address etc. While
working on medical reports, it should identify and extract patient names, drug information,
and other general reports.
After curating the data, we’ll then start applying the information extraction NLP techniques,
to process and build models around the data. Below are some of the most common techniques
that are frequently used.
Tokenization:
Computers usually won't understand the language we speak or communicate with.
Hence, we break the language, basically the words and sentences, into tokens and then load it
into a program. The process of breaking down language into tokens is called tokenization.
For example, consider a simple sentence: "NLP information extraction is fun''. This could be
tokenized into:
1. One-word (sometimes called unigram token): NLP, information, extraction, is, fun
2. Two-word phrase (bigram tokens): NLP information, information extraction,
extraction is, is fun, fun NLP
3. Three-word sentence (trigram tokens): NLP information extraction, information
extraction is, extraction is fun
An Example of Information Extraction
Several industries deal with lots of documents every day and rely on manual work. Those
include finance, medical chains, transportation, and construction. Using NLP information
extraction techniques on documents will allow everyone on the teams to search, edit, and
analyse important transactions and details across business processes.
Identifying text from documents
Now we’ll look at an example in detail on how information extraction from text can be done
generically for documents of any kind.
#1 Information Collection:
Firstly, we’ll need to collect the data from different sources to build an information extraction
model. Usually, we see documents on emails, cloud drives, scanned copies, computer
software, and many other sources for business. Hence, we’ll have to write different scripts to
collect and store information in one place. This is usually done by either using APIs on the
web or building RPA (Robotic Process Automation) pipelines.
#2 Process Data:
After we collect the data, the next step is to process them. Usually, documents are two types:
electronically generated (editable) and the other non-electronically generated (scanned
documents). For the electronically generated documents, we can directly send them into the
preprocessing pipelines. Still, we’ll need OCR to first read all the data from images and then
send them into preprocessing pipelines for the scanned copies. We can either use open-source
tools like Tesseract or any online services like Nanonets or Textract. After all the data is in
editable or electronic format, we can then apply to pre-process steps like Tokenization and
POS tagging and then use data loaders to load the data into the NLP information extraction
models.
#3 Choosing the Right Model:
As discussed in the above sections, choosing a suitable model mostly depends on the type of
data we’re working with. Today, there are several state-of-the-art models we could rely on.
Below are some of the frequently use open-source models:
1. Named Entity Recognition on CoNLL 2003 (English)
2. Key Information Extraction From Documents: Evaluation And Generator
3. Deep Reader: Information extraction from Document images via relation extraction
and Natural Language
These are some of the information extraction models. However, these are trained on a
particular dataset. If we are utilising these on our models, we’ll need to experiment on the
hyperparameters and fine-tune the model accordingly.
The other way is to utilize the pre-trained models and fine-tuning them based on our data. For
Information Extraction from text, in particular, BERT models are widely used.
#4 Evaluation of the Model:

We evaluate the training process is crucial before we use the models in production. This is
usually done by creating a testing dataset and finding some key metrics:
● Accuracy: the ratio of correct predictions made against the size of the test data.
● Precision: the ratio of true positives and total predicted positives.
● Recall the ratio of true positives and total actual positives.
● F1-Score: harmonic mean of precision and recall.
Different metrics take precedence when considering different use cases. In invoice
processing, we know that an increase in the numbers or missing an item can lead to losses for
the company. This means that besides needing a good accuracy, we also need to make sure
the false positives for money-related fields are minimum, so aiming for a high precision
value might be ideal. We also need to ensure that details like invoice numbers and dates are
always extracted since they are needed for legal and compliance purposes. Maintaining
a high recall value for these fields might take precedence.
#5 Deploying Model in Production:

The full potential of the NLP models only knows when they are deployed in production.
Today, as the world is entirely digital, these models are stored on cloud servers with a suitable
background. In most cases, Python is utilised as its more handy programming language when
it comes to Text data and machine learning. The model is either exported as API or an SDK
(software development kit) for integrating with business tools. However, we need not build
everything from scratch as there are several tools and online services for this kind of
use-cases. For example, Nanonets has a highly accurate, fully trained invoice information
extraction NLP model, and you can directly integrate on our applications using APIs or
supported SDKs.
Ideally, these are the steps that are required for information extraction from text data. Here’s
an example of how Nanonets performs on an ID card:
Nanonets Information Extraction

A few applications of Information Extraction
There are several applications of Information Extraction, especially with large capital
companies and businesses. However, we can still implement IE tasks when working with
significant textual sources like emails, datasets, invoices, reports and many more. Following
are some of the applications:
1. Invoice Automation: Automate the process of invoice information extraction.
2. Healthcare Systems: Manage medical records by identifying patient information
and their prescriptions.
3. KYC Automation: Automate the process of KYC by extracting ethical information
from customer's identity documents.
4. Financial Investigation: Extract import information from financial documents. (Tax,
Growth, Quarterly Revenue, Profit/Losses).

AI unit -3.docx

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

AI unit -3.docx

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

AI unit -3.docx

Uploaded by

Copyright:

Available Formats

UNIT-3

What is Reinforcement Learning?

Reinforcement Learning is a part of machine learning. Here, agents are self-trained on

Fig: Basic Diagram of Reinforcement Learning

To know the meaning of reinforcement learning, let’s go through the formal

Reinforcement learning (RL) can be viewed as an approach which falls

Reinforcement learning is the science of decision making:

Simplified Definition of Reinforcement Learning:

Explanation to Reinforcement Learning

Here what do you see:

Terminologies used in Reinforcement Learning:

Agent – is the sole decision-maker and learner

Action – a list of action which an agent can perform

State – the current situation of the agent in the environment

Policy – the agent prepares strategy(decision-making) to map situations to actions.

Reinforcement Learning Workflow:

Fig: Reinforcement Learning Workflow

– Create the Environment

– Define the reward

– Create the agent

– Train and validate the agent

How is reinforcement learning different from supervised learning?

Difference between Supervised and Reinforcement Learning :

What is the reinforcement learning problem?

Setting up Reinforcement Learning Problems:

Fig: Reinforcement learning process diagram

History: The history H is a sequence of observations, actions and rewards up to time t.

● Environment state (S ᵉ) — The environment’s private representation and may

● Information State / Markov State (S ) — Contains the useful information from

Markov State: A state S is Markov if and only if

● Fully Observable Environments (Markov Decision Process): Agent directly

● Partially Observable Environments (Partially Observable Markov Decision

Reinforcement Learning Agent:

Deterministic policy function

Stochastic policy function

● Value Function: Represents how good is each state and/or action. It is a

Example value function

● Model: Agent’s representation of the environment. It predicts what the

Example equation to predict of the next state

Example equation to predict the next immediate reward

RL agents can be categorised into the following types:

● Reinforcement Learning Algorithms :

Problems within Reinforcement Learning

There are two fundamental problems in sequential decision making:

Characteristics of Reinforcement Learning

– No supervision, only a real value or reward signal

– Decision making is sequential

– Time plays a major role in reinforcement problems

– Feedback isn’t prompt but delayed

– The following data it receives is determined by the agent’s actions

Types of Reinforcement Learning

There are two types :

Positive reinforcement is defined as when an event, occurs due to specific behaviour,

– Maximizes the performance of an action

– Sustain change for a longer period

– Provide a decent to minimum standard of performance

– It just limits itself enough to meet up a minimum behaviour

Widely used models for reinforcement learning: