ML Unit 5
ML Unit 5
ML Unit 5
me/jntuh
JNTU Hyderabad
Unit – 5
Markov Chain Monte Carlo Methods – Sampling – Proposal Distribution – Markov Chain Monte
Carlo – Graphical Models – Bayesian Networks – Markov Random Fields – Hidden Markov Models –
Tracking Methods
A classic example of reinforcement learning using the concept of "getting lost" would be a
scenario where an autonomous robot is navigating an unfamiliar environment, like a maze, and must
learn to reach a specific destination by trial and error, receiving positive rewards for moving closer to
the goal and negative rewards for going further away or hitting obstacles, essentially "learning" how
to navigate without getting lost through repeated attempts and feedback from the environment
Positive Reinforcement:
The positive reinforcement learning means adding something to increase the tendency that expected
behavior would occur again. It impacts positively on the behavior of the agent and increases the
strength of the behavior.
This type of reinforcement can sustain the changes for a long time, but too much positive
reinforcement may lead to an overload of states that can reduce the consequences.
Negative Reinforcement:
The negative reinforcement learning is opposite to the positive reinforcement as it increases the
tendency that the specific behavior will occur again by avoiding the negative condition.
It can be more effective than the positive reinforcement depending on situation and behavior, but it
provides reinforcement only to meet minimum behavior.
There are four main elements of Reinforcement Learning, which are given below:
1. Policy
2. Reward Signal
3. Value Function
There are four main elements of Reinforcement Learning, which are given below:
1. Policy
A policy can be defined as a way how an agent behaves at a given time. It maps the
perceived states of the environment to the actions taken on those states. A policy is the core
element of the RL as it alone can define the behavior of the agent. In some cases, it may be a simple
function or a lookup table, whereas, for other cases, it may involve general computation as a search
process. It could be deterministic or a stochastic policy:
There are four main elements of Reinforcement Learning, which are given below:
2) Reward Signal: The goal of reinforcement learning is defined by the reward signal. At each state,
the environment sends an immediate signal to the learning agent, and this signal is known as
a reward signal. These rewards are given according to the good and bad actions taken by the agent.
The agent's main objective is to maximize the total number of rewards for good actions. The reward
signal can change the policy, such as if an action selected by the agent leads to low reward, then the
policy may change to select other actions in the future.
3) Value Function: The value function gives information about how good the situation and action are
and how much reward an agent can expect. A reward indicates the immediate signal for each good
and bad action, whereas a value function specifies the good state and action for the future. The
value function depends on the reward as, without reward, there could be no value. The goal of
estimating values is to achieve more rewards.
4) Model: The last element of reinforcement learning is the model, which mimics the behavior of the
environment. With the help of the model, one can make inferences about how the environment will
behave. Such as, if a state and an action are given, then a model can predict the next state and
reward.
The model is used for planning, which means it provides a way to take a course of action by
considering all future situations before actually experiencing those situations. The approaches for
solving the RL problems with the help of the model are termed as the model-based approach.
Comparatively, an approach without using a model is called a model-free approach.
• State:
• At any point in time, the robot's "state" is its current location within the maze, including
information like proximity to walls and potential pathways.
• Action:
• The robot can choose actions like "move forward", "turn left", "turn right" based on its
current state.
• Reward:
• The robot receives a positive reward when it moves closer to the goal and a negative reward
when it moves further away or hits a wall.
• Learning process:
• Through repeated trials, the robot learns which actions in each state lead to the highest
cumulative reward, effectively building a strategy to navigate the maze without getting lost
How it works:
Initially, the robot might explore different paths randomly to understand the maze layout.
• Policy Improvement:
Over time, the robot gradually starts to prioritize actions that have led to positive rewards in
the past, leading to a more efficient navigation strategy.
• Q-Learning:
A common reinforcement learning algorithm that can be used in this scenario involves
creating a "Q-table" where each cell represents a state-action pair, storing the expected future
reward for taking a particular action in a given state.
Real-world applications:
• Robot navigation:
• Autonomous vehicles:
• Self-driving cars can use reinforcement learning to learn how to navigate roads, avoiding
obstacles and making optimal route decisions.
• Game playing:
• AI agents in video games can learn to play effectively by receiving rewards for achieving goals
within the game
Real-world applications:
• Robot navigation:
This concept is directly applicable to robots navigating complex environments like warehouses
or disaster zones.
Markov Chain Monte Carlo Methods – Sampling – Proposal Distribution – Markov Chain Monte
Carlo – Graphical Models – Bayesian Networks – Markov Random Fields – Hidden Markov Models –
Tracking Methods
Monte Carlo methods are mainly used in three distinct problem classes: optimization, numerical
integration,
and generating draws from a probability distribution. They can also be used to model phenomena
with significant uncertainty in inputs, such as calculating the risk of a nuclear power plant failure.
Monte Carlo methods are often implemented using computer simulations, and they can provide
approximate solutions to problems that are otherwise intractable or too complex to analyze
mathematically.
Sampling distribution is essential in various aspects of real life. Sampling distributions are important
for inferential statistics. A sampling distribution represents the distribution of a statistic, like the
mean or standard deviation, which is calculated from multiple samples of a population. It shows how
these statistics vary across different samples drawn from the same population.
In this article, we will discuss the Sampling Distribution in detail and its types along with examples
and go through some practice questions too.
• Statistic: A numerical summary of a sample, such as mean, median, standard deviation, etc.
• Population: Entire group of individuals or observations that a study aims to describe or draw
conclusions about.
Central Limit Theorem(CLT): A fundamental theorem in statistics stating that the sampling
distribution of the sample mean tends to be approximately normal as the sample size increases,
regardless of the shape of the population distribution
• Confidence Interval: A range of values calculated from sample data that is likely to contain
the population parameter with a certain level of confidence.
• Sampling Method: Technique used to select a sample from a population, such as simple
random sampling, stratified sampling, cluster sampling, etc.
• Inferential Statistics: Statistical methods and techniques used to draw conclusions or make
inferences about a population based on sample data.
• Hypothesis Testing: A statistical method for making decisions or drawing conclusions about a
population parameter based on sample data and assumptions about the population.
1. Number Observed in a Population: The symbol for this variable is "N." It is the measure of
observed activity in a given group of data.
2. Number Observed in Sample: The symbol for this variable is "n." It is the measure of
observed activity in a random sample of data that is part of the larger grouping.
3. Method of Choosing Sample: How you chose the samples can account for variability in some
cases.
• chosen from the population and plotting the data points. The graph shows a normal
distribution where the center is the mean of the sampling distribution, which represents the
mean of the entire population.
• Mean, or center of the sampling distribution of x̄, is equal to the population mean, µ.
• µx− = µ
σxσx = σ/√n
T-Distribution
Sampling distribution involves a small population or a population about which you don't know much.
It is used to estimate the mean of the population and other statistics such as confidence intervals,
statistical differences and linear regression. T-distribution uses a t-score to evaluate data that
wouldn't be appropriate for a normal distribution.
t = [x - μ] / [s /√(n)]
where:
• x is Sample Mean
• n is Sample Size
This formula calculates the difference between the sample mean and the population mean, scaled by
the standard error of the sample mean. The t-score helps to assess whether the observed difference
between the sample and population means is statistically significant.
Markov chain Monte Carlo methods create samples from a continuous random variable,
with probability density proportional to a known function. These samples can be used to evaluate an
integral over that variable, as its expected value or variance.
Practically, an ensemble of chains is generally developed, starting from a set of points arbitrarily
chosen and sufficiently distant from each other. These chains are stochastic processes of "walkers"
which move around randomly according to an algorithm that looks for places with a reasonably high
contribution to the integral to move into next, assigning them higher probabilities.
Random walk Monte Carlo methods are a kind of random simulation or Monte Carlo method.
However, whereas the random samples of the integrand used in a conventional Monte Carlo
integration are statistically independent, those used in MCMC are autocorrelated. Correlations of
samples introduces the need to use the Markov chain central limit theorem when estimating the
error of mean values.
These algorithms create Markov chains such that they have an equilibrium distribution which is
proportional to the function given.
In statistics, Markov chain Monte Carlo (MCMC) is a class of algorithms used to draw samples from
a probability distribution. Given a probability distribution, one can construct a Markov chain whose
elements' distribution approximates it – that is, the Markov chain's equilibrium distribution matches
the target distribution. The more steps that are included, the more closely the distribution of the
sample matches the actual desired distribution.
Markov chain Monte Carlo methods are used to study probability distributions that are too complex
or too highly dimensional to study with analytic techniques alone. Various algorithms exist for
constructing such Markov chains, including the Metropolis–Hastings algorithm.
Graphical Model
One major difference between Machine Learning (ML) and Deep Learning (DL) is
the
amount of domain knowledge sought to solve a problem. ML algorithms regularly
exploit domain knowledge. However, the solution can be biased if the knowledge is
incomplete. However, if it is done correctly, we can solve problems more efficiently.
The Graphical model (GM) is a branch of ML which uses a graph to represent a
domain problem. Many ML & DL algorithms, including Naive Bayes’ algorithm, the
Hidden Markov Model, Restricted Boltzmann machine and Neural Networks, belong
to the GM. Studying it allows us a bird’s eye view on many ML algorithms. In this
article, we focus on the basic in representing a problem using a graphical model.
Later
in this series, we will discuss how inference is made and how the model is trained.
node in the graph represents a variable with each edge representing the dependency
or correlation between two variables.
The likelihood of the observations P(E) — the chance of the lab results
(Evidence E).
The marginal probability P(X₁), P(X₁, X₃), etc … — the chance of having
Alzheimer when you are 70.
Conditional probability (or a posterior belief based on evidence), P(Y|E) — the
chance of having Alzheimer given your parent has it.
Maximum a Posterior (MAP), arg max P(X, E) — the most likely disease given the
lab results.
For example, the marginal probability of having the grass wet is computed by
summing over other variables. For the conditional probability, we can first apply
Bayes’ Theorem followed by the corresponding marginal probabilities.
Joint probability
Which probability distributions below is more complex? Let’s give some serious
thoughts here because it helps us to understand machine learning (ML) better.
So
what is the complexity for these models? p(d, i, g, s, l) have 5 variables with the
possible combinations of 32 (2 ). To model ⁵ p, we sample data for 32 possibilities.
On
the contrary, the graph above has 4 tables with only 26 parameters.
Don’t underestimate the curse of exponential. 64 variables will have a combination of
Bayesian networks are probabilistic, because these networks are built from a
probability
distribution, and also use probability theory for prediction and anomaly detection.
Real world applications are probabilistic in nature, and to represent the
relationship between
multiple events, we need a Bayesian network. It can also be used in various tasks
including prediction, anomaly detection, diagnostics, automated insight, reasoning,
time
series prediction, and decision making under uncertainty.
Bayesian Network can be used for building models from data and experts
opinions, and it
consists of two parts:
o Directed Acyclic Graph
o Table of conditional probabilities.
The generalized form of Bayesian network that represents and solve decision
problems under
uncertain knowledge is known as an Influence diagram.
A Bayesian network graph is made up of nodes and Arcs (directed links), where:
no directed link that means that nodes are independent with each other
o In the above diagram, A, B, C, and D are random variables represented
by the nodes of the network graph.
o If we are considering node B, which is connected with node A by a
directed arrow, then node A is called the parent of Node B.
o Node C is independent of node A.
The Bayesian network has mainly two components:
o Causal Component
o Actual numbers
Each node in the Bayesian network has condition probability distribution P(Xi
|Parent(Xi) ),
which determines the effect of the parent on that node.
Bayesian network is based on Joint probability distribution and conditional
probability. So
let's first understand the joint probability distribution:
Joint probability distribution:
If we have variables x1, x2, x3,....., xn, then the probabilities of a different
combination of x1,
x2, x3.. xn, are known as Joint probability distribution.
P[x1, x2, x3,....., xn], it can be written as the following way in terms of the joint
probability
distribution.
= P[x1| x2, x3,....., xn]P[x2, x3,....., xn]
= P[x1| x2, x3,....., xn]P[x2|x3,....., xn]....P[xn-1|xn]P[xn].
In general for each variable Xi, we can write the equation as:
P(Xi|Xi-1,........., X1) = P(Xi |Parents(Xi ))
Explanation of Bayesian network:
Let's understand the Bayesian network through an example by creating a
directed acyclic
graph:
Problem:
Calculate the probability that alarm has sounded, but there is neither a burglary, nor
an earthquake occurred, and David and Sophia both called the Harry.
Solution:
o The Bayesian network for the above problem is given below. The network
structure is
showing that burglary and earthquake is the parent node of the alarm and
directly
affecting the probability of alarm's going off, but David and Sophia's calls depend
on
alarm probability.
o The network is representing that our assumptions do not directly perceive the
burglary
and also do not notice the minor earthquake, and they also not confer before
calling.
o The conditional distributions for each node are given as conditional
probabilities table
or CPT.
o Each row in the CPT must be sum to 1 because all the entries in the table
represent an
exhaustive set of cases for the variable.
can write the events of problem statement in the form of probability: P[D, S, A, B,
E],
can rewrite the above probability statement using joint probability distribution:
P[D, S, A, B, E]= P[D | S, A, B, E]. P[S, A, B, E]
=P[D | S, A, B, E]. P[S | A, B, E]. P[A, B, E]
= P [D| A]. P [ S| A, B, E]. P[ A, B, E]
= P[D | A]. P[ S | A]. P[A| B, E]. P[B, E]
= P[D | A ]. P[S | A]. P[A| B, E]. P[B |E]. P[E]
Let's take
the observed probability for the Burglary and earthquake component:
Hence, a Bayesian network can answer any query about the domain by using Joint
distribution.
Gibbs Sampling
• Perform a random walk through the space of complete variable assignments. On each move
• 1. pick a variable X
Repeat many times. Frequency with which any variable X is true is its posterior probability
𝑷(𝑨)𝑷(𝑿|𝑨)𝑷(𝑪)𝑷(𝑩|𝑿,𝑪)
= 𝑷(𝑨,𝑩,𝑪)
= P(X|A)P(B|X,C)
= αP(X|A)P(B|X,C)
𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒔𝒂𝒎𝒑𝒍𝒆𝒔 𝒘𝒊𝒕𝒉 𝑿=𝒙
P(X|E) = 𝒕𝒐𝒕𝒂𝒍 𝒏𝒖𝒎𝒃𝒆𝒓 𝒐𝒇 𝒔𝒂𝒎𝒑𝒍𝒆𝒔
Advantages
- Disadvantages
where each node captures the (discrete or Gaussian) probability distribution of a variable and
the edges represent dependencies between those variables and are weighted to represent the
relative strengths of the dependencies.
The defining difference between a Bayesian network and a Markov random field is that a Bayesian
network is directed while a Markov random field is undirected.
a hidden Markov model is directed and thus a sub-type of Bayesian network rather than a sub-type
of Markov random field.
A conditional random field is a Markov random field that has specifically been set up to allow the
values of a specific variable or variables to be predicted based on the values of the other variables
Undirected graphical models edge represents the potential between two variables, syntactically,
Factorization distribution probabilities between variable.
CPDs for Bayesian networks, we have tables to incorporate relations between nodes in Markov
networks.
However, there are two crucial differences between these tables and CPDs.
✓ If we take 'A' as a subset and 'C' as one subset then there is wore between them. So, there is
no way to go
between 'A' and 'C' without getting threw the subset. So, we are using (A, B) than B, C, D, E.
AccordingtoSajja [2012]aMarkovchainisaspecialcaseofaweightedautomatonin
which the input sequence uniquely determinesstates the automaton will go
through for that input sequence. Since they can’t represent inherently ambiguous
problems. Markov chain is only useful for assigning probabilities to unambiguous
sequences.
AMarkov chain is specifiedbythefollowing components:
Q= q1q2.. .qN aset ofstates
A = [aij]NxN a transition probability matrix A, each aijrepresenting the
probability of moving from state i to state j.
q0,qend a special start state and end state which are not associated with
observations.
A Markov chain embodies an important assumption about these probabilities In a
first-order Markov chain, the probabilityof a particular state is dependent onlyon
the
immediateprevious state, Symbolically
Markov Assump琀椀on:P(qi|q1...qi−1) =P(qi|qi−1)
Sinceeachaijexpressestheprobabilityp(qj|qi),thelawsofprobabilityrequirethatthe
values
of the probabilities for some state must sum to 1.
= 1, , ...,
2 N an initial probability distribution over states. is the
i
probabilitythattheMarkovchainwillstartinstatei.Some
statesjmayhave j=0,meaningthattheycannotbeinitial states.
Each iexpressestheprobabilityp(qi|START),allthe probabilitiesmustsumto1:
Markov chain is useful when we need to compute the probability for a sequence
of events that we can observe in the world. It is useful even in events in which
one is interested, but may not be directly observable in the world. For example
forNER named entities tagsare not observed in real world as we observe
sequence of various weather
cold,hot,rain.ButsequenceofwordsobservedinrealworldandNERhasto infer the
correct named entities tags from these word sequence.The named entity tags are
said to be hidden because they are not observed.
HMM Assumptions
MarkovAssumption:Afirst-orderHiddenMarkovModelinstantiatestwosimplifying
assumptions. One: as incase of first-order Markov chain, the probability of a
particular state is dependent only on the immediate previous state:
P(qi|q1... qi−1)=P(qi|qi−1)
Independent Assumption: The probability of an output observation oiis dependent
only on the state that produced the observation qi, and not on any other state or
any
other observations:
P(oi|q1. . .qi, . . . ,qn,o1, .. . ,oi, . . . ,on)=P(oi|qi)
TheStationaryAssumptionHereitisassumedthatstatetransitionprobabilitiesare
independentofthe actualtimeatwhichthetransitionstakeplace. Mathematically,
P(qt1+1 =j |qt1 =i) =P(qt2+1=j|qt2=i)fortime t1and t2
Having described the structure of an HMM, it will be logical to introduce the
algorithms for computing things with them. An influential tutorial byRabiner
[1989],
introduced the idea that Hidden Markov Models should be characterized by three
fundamental problems: Evaluation problem can be used for isolated (word)
recognition. Decoding problem is related to the continuous recognition as well as to
the segmentation. Learning problem must be solved, if we want to train an HMM
for
the subsequent use of recognition tasks.
Evaluation:Byevaluationitismeantwhatistheprobabilitythataparticular
sequenceofsymbolsisproducedbyaparticularmodel?ThatisgivenanHMM =
AdvantagesandDisadvantagesof HMM
Theunderlyingtheoreticalbasisismuchmoresound,elegantandeasyto understand.
Itiseasiertoimplementandanalyze.
HMM taggers are very simple to train (just need to compile counts from
thetraining corpus).
Performsrelativelywell(over 90%performanceonnamedentities).
StatisticiansarecomfortablewiththetheoreticalbaseofHMM.
Libertytomanipulatethe trainingandverification processes.
Mathematical/theoreticalanalysisoftheresultsandprocesses.
Incorporatespriorknowledgeintothearchitecturewithgooddesign.
Initializethemodelclosetosomething believedtobe correct.
Iteliminateslabelbiasproblem.
It has also been proved effective for a number of other tasks, such as
speechrecognition, handwriting recognitionand sign language recognition.
Because each HMM uses onlypositive data, they scale well; since new words
can be added without affecting learnt HMMs.
Disadvantages:
In order to define joint probabilityover observation and label sequence HMM
needs to enumerate all possible observation sequences.
Maindifficultyis modeling probabilityofassigning atagto wordcan be very
difficult if “words” are complex.
Itisnotpracticaltorepresentmultipleoverlappingfeaturesandlongterm
dependencies.
Number of parameters to be evaluated is huge. So it needs a large data set for
training.
Itrequires hugeamountoftraininginordertoobtainbetter results.
HMMs onlyuse positive data to train. In other words, HMM training involves
maximizing the observed probabilities for examples belonging to a class. But
it does not minimize the probability of observation of instances from other
classes.
It adopts the Markovian assumption: that the emission and the transition
probabilities depend only on the current state, which does not map well to
many real-world domains;
Tracking Methods
Object tracking aims at estimating bounding boxes and the
identities
of objects in videos. It takes in a set of initial object detection, develops a visual
model for the
objects, and tracks the objects as they move around in a video. Furthermore,
object tracking
enables us to assign a unique ID to each tracked object, making it possible for us
to count
unique objects in a video. Often, there’s an indication around the object being
tracked, for
example, a surrounding square that follows the object, showing the user where
the object is
For the ID assignment, i.e., data association task the new target states are used
to predict the
bounding boxes that are later on compared with the detected boxes in the
current timeframe.
The IOU metric and the Hungarian algorithm are utilized for choosing the
optimum box to
pass on the identity. SORT achieves 74.6 MOTA and 76.9 IDF1 on the MOT17
dataset with
291 ID switches and 30 FPS.
DeepSORT
Despite achieving overall good performance in terms of tracking precision and
accuracy,
SORT has a high number of identity switches. It fails in many of the challenging
scenarios
like occlusions, different camera angles, etc. To overcome these
limitations DeepSORT replaces the association metric with a more informed
metric that
combines motion and appearance information. In particular, a “deep appearance”
distance
metric is added. The core idea is to obtain a vector that can be used to represent
a given image.
To do this DeepSORT creates a classifier and strips the final classification layer,
this leaves us
with a dense layer that produces a single feature vector.
In addition to that, DeepSORT adds extra dimensions to its motion tracking
model. The state
of each target is denoted on the eight-dimensional state space (u, v, γ, h, x,˙ y,˙
γ, ˙ h˙) that
contains the bounding box center position (u, v), aspect ratio γ, height h, and
their respective
velocities in image coordinates. These additions enable DeepSORT to effectively
handle
challenging scenarios and reduce the number of identity switches by 45%.
DeepSORT
achieves 75.4 MOTA and 77.2 IDF1 on the MOT17 dataset with 239 ID switches
but a
lower FPS of 13.
FairMOT
Approaching object tracking from a multi-task learning perspective of object
detection and re-
ID in a single network is appealing since it allows shared optimization of the two
tasks.
However, the two tasks tend to compete with each other. In particular, previous
works usually
treat re-ID(re-Identification) as a secondary task whose accuracy is heavily
affected by the
primary detection task. As a result, the network is biased to the primary detection
task which is
not fair to the re-ID task.
The detection branch is built on top of CenterNet, three parallel heads are
appended to DLA-
34 to estimate heatmaps, object center offsets, and bounding box sizes,
respectively. The Re-
ID branch aims to generate features that can distinguish objects. Ideally, affinity
among
different objects should be smaller than that between the same objects. To
achieve this goal,
FairMOT applies a convolution layer with 128 kernels on top of backbone
features to extract
re-ID features for each location. The re-ID features are learned through a
classification task.
All object instances of the same identity in the training set are treated as the
same class. All
these optimizations help FairMOT achieve 77.2 MOTA and 79.8 IDF1 on the
MOT17 dataset
with 25.9 FPS running speed.
TransMOT