ML Basics Unit 5
ML Basics Unit 5
An Autonomous Institution
Approved by AICTE, Affiliated to JNTUH
Accredited by NAAC-A Grade, NBA (CSE, ECE & ME) & ISO 9001:2015 Certified
UNIT-V
Reinforcement Learning – Overview – Getting Lost Example - Markov Chain Monte Carlo Methods
– Sampling – Proposal Distribution – Markov Chain Monte Carlo – Hidden Markov Models
Reinforcement Learning
Reinforcement Learning (RL) is a branch of machine learning that focuses on how agents can
learn to make decisions through trial and error to maximize cumulative rewards. RL allows
machines to learn by interacting with an environment and receiving feedback based on their
actions. This feedback comes in the form of rewards or penalties.
Reinforcement learning is a reward/punishment-based learning technique where, a teacher or
critic is present not to guide just like in supervised learning, but punishes for the wrong actions
and rewards for the correct actions.
Reinforcement Learning revolves around the idea that an agent (the learner or decision-
maker) interacts with an environment to achieve a goal. The agent performs actions and
receives feedback to optimize its decision-making over time.
Agent: The decision-maker that performs actions.
Environment: The world or system in which the agent operates.
State: The situation or condition the agent is currently in.
Action: The possible moves or decisions the agent can make.
Reward: The feedback or result from the environment based on the agent’s action.
How Reinforcement Learning Works?
The RL process involves an agent performing actions in an environment, receiving rewards or
penalties based on those actions, and adjusting its behavior accordingly. This loop helps the
agent improve its decision-making over time to maximize the cumulative reward.
Here’s a breakdown of RL components:
Policy: A strategy that the agent uses to determine the next action based on the
current state.
Reward Function: A function that provides feedback on the actions taken, guiding the
agent towards its goal.
Value Function: Estimates the future cumulative rewards the agent will receive from
a given state.
Model of the Environment: A representation of the environment that predicts future
states and rewards, aiding in planning.
Reinforcement Learning Example: Navigating a Maze
Imagine a robot navigating a maze to reach a diamond while avoiding fire hazards. The goal is
to find the optimal path with the least number of hazards while maximizing the reward:
Each time the robot moves correctly, it receives a reward.
If the robot takes the wrong path, it loses points.
The robot learns by exploring different paths in the maze. By trying various moves, it
evaluates the rewards and penalties for each path. Over time, the robot determines the best
route by selecting the actions that lead to the highest cumulative reward.
1. Exploration: The robot starts by exploring all possible paths in the maze, taking different
actions at each step (e.g., move left, right, up, or down).
2. Feedback: After each move, the robot receives feedback from the environment:
A positive reward for moving closer to the diamond.
A penalty for moving into a fire hazard.
3. Adjusting Behavior: Based on this feedback, the robot adjusts its behavior to maximize
the cumulative reward, favoring paths that avoid hazards and bring it closer to the
diamond.
4. Optimal Path: Eventually, the robot discovers the optimal path with the least number of
hazards and the highest reward by selecting the right actions based on past experiences.
Getting Lost Example: Navigating a Maze
Imagine you’re lost in a maze (or a robot navigating a maze-like environment). The goal is to
find the exit as quickly as possible. In RL, this scenario is modeled as follows:
1. RL Components in the Maze Example
Agent: You (or the robot) trying to find the exit.
Environment: The maze, with walls, paths, and an exit.
States: Each position in the maze (e.g., a specific intersection or coordinate). For
example, state S1 S_1 S1 might be “at the entrance,” and state Sn S_n Sn might be “at a
dead end.”
Actions: Possible moves at each state (e.g., go left, right, forward, or backward).
Rewards: Feedback from the environment:
o Positive reward (e.g., +100) for reaching the exit.
o Small negative reward (e.g., -1) for each step to encourage efficiency.
o Optional: Larger negative reward (e.g., -10) for hitting a wall or dead end.
Policy: The strategy the agent learns to choose actions (e.g., “in state S1 S_1 S1, go right
with 80% probability”). Initially random, it improves over time.
2. RL Process in the Maze
The agent learns through trial-and-error:
1. Initialization: Starts at the maze entrance with no prior knowledge (random policy).
2. Interaction:
o In state S1 S_1 S1 (entrance), the agent chooses an action (e.g., go right).
o The environment responds with a new state (e.g., S2 S_2 S2, a new position) and a
reward (e.g., -1 for a step).
o If the agent hits a wall, it gets a negative reward (-10) and stays in the same state.
3. Learning:
o The agent updates its policy based on rewards using an RL algorithm (e.g., Q-
learning):
Q-learning maintains a table of Q-values, estimating the expected future
reward for each state-action pair.
Example: Q(S1,right) Q(S_1, \text{right}) Q(S1,right) increases if “going
right” leads to the exit faster.
This means that the probability of being in state 's' at time t+1, given all previous states, is the
same as the probability of being in state 's' at time t+1, given only the current state.
MCMC Algorithms:
MCMC works by constructing a Markov chain—a sequence of states where each state depends
only on the previous one—that have π(x) as its stationary distribution. By simulating this
chain for much iteration, the samples produced approximate the target distribution, allowing
you to estimate properties like means, variances, or probabilities.
Metropolis-Hastings (MH) Algorithm is popular MCMC method that generates samples
from a target distribution π(n) by constructing a Markov chain. It uses a proposal distribution
P(n∗∣n) to suggest new states and accepts or rejects them based on an acceptance probability,
ensuring the chain converges to the target distribution.
Here, P(n | n*) is the probability of proposing n given n* (reverse proposal), and P(n∗∣n) is
the probability of proposing n* given n.
If the proposal distribution is symmetric (i.e., P(n∣n∗)=P(n∗∣n)), the ratio P(n∗∣n)/P(n∣n∗)=1,
simplifying the acceptance probability to:
Steps 8–11: Accept or reject: The proposed state n* is accepted with probability A(n,n∗). If
rejected, the chain stays at n. The proposal distribution influences the acceptance rate
(computed in Step 16), as a poorly chosen proposal can lead to frequent rejections or slow
exploration.
Proposal Distribution in MH
Several proposal distributions are used in the MH algorithm, involving:
Gaussian distribution or Normal Distribution:
This invariant, commonly employed in proposal distribution, is especially effective when the
target distribution is unimodal and symmetric.
The distribution is denoted as
Cauchy distribution:
Suitable for target distributions with heavy tails. The distribution that we are talking above is
derived as
Exponential distribution:
Suited for target distributions that are non-negative with a long right tail. The distribution
that we are considering is denoted by
Student’s t-distribution:
Applied, when the target/desired distribution has heavy tails and the Gaussian distribution,
is not suitable. The distribution describe above is denoted by
The selection of the given above distribution hinges on the properties of the target
distribution (TD) and the desired acceptance rate.
Typically, the proposal distribution enable the generation of a diverse set of candidate states,
but not to an extent that results in an excessively low acceptance rate. A commonly employed
strategy involves refining the proposal distribution during the MH Algorithm by adjusting
parameters such as variance or scale until the acceptance rate aligns with the desired range.
Advantages of MCMC
1. Handles Complex Distributions: MCMC can sample from high-dimensional, complex,
or non-standard probability distributions (e.g., posterior distributions in Bayesian
inference) where analytical solutions or direct sampling are impractical.
2. Flexibility: Applicable to a wide range of problems, from statistical inference to
physics, as it doesn’t require specific distributional assumptions.
3. Approximates Posterior Distributions: In Bayesian statistics, MCMC estimates
posterior distributions by generating samples, enabling uncertainty quantification in
model parameters.
4. Scalable to High Dimensions: Effective in high-dimensional spaces (e.g., thousands of
parameters), common in modern machine learning and scientific modeling.
5. Convergence to True Distribution: Given enough iterations, MCMC converges to the
target distribution, ensuring theoretically sound results. Provides reliable samples for
inference, unlike heuristic methods that may not guarantee convergence.
Disadvantages of MCMC
1. Computational Cost: MCMC can be computationally intensive, requiring much iteration
to converge, especially for complex or high-dimensional distributions.
2. Convergence Issues: Convergence is not guaranteed in finite time, and diagnosing
convergence (e.g., using Gelman-Rubin statistics) can be challenging.
Step 6: Decode the most likely sequence of hidden states: Given the observed data, the
Viterbi algorithm is used to compute the most likely sequence of hidden states. This can be
used to predict future observations, classify sequences, or detect patterns in sequential data.
Step 7: Evaluate the model: The performance of the HMM can be evaluated using various
metrics, such as accuracy, precision, recall, or F1 score.
Baum-Welch algorithm
The Baum-Welch algorithm is an expectation-maximization (EM) algorithm used to train
Hidden Markov Models (HMMs) by estimating their parameters (transition probabilities,
emission probabilities, and initial state probabilities) from unlabeled sequence data. It
iteratively refines these parameters to maximize the likelihood of the observed data.
The Baum-Welch Algorithm, also known as the Forward-Backward Algorithm, is a key
method for training Hidden Markov Models (HMMs). It estimates the model parameters (initial
probabilities, transition probabilities, and emission probabilities) from observed data, even
when the hidden states are unknown.
Viterbi algorithm:
The Viterbi Algorithm is a dynamic programming algorithm used to find the most likely
sequence of hidden states—called the Viterbi path—in a Hidden Markov Model (HMM), given a
sequence of observed events. It’s widely applied in areas like speech recognition, natural
language processing (e.g., part-of-speech tagging), and bioinformatics (e.g., gene prediction).
Below, I’ll introduce the algorithm, its purpose, and how it works, keeping the explanation
concise and building on your prior context about HMMs and the Viterbi Algorithm description
you provided.
Purpose of the Viterbi Algorithm
HMMs model systems where observations (e.g., words, sounds) are generated by underlying
hidden states (e.g., grammatical tags, phonemes) that follow a Markov process. The Viterbi
Algorithm solves the decoding problem in HMMs: it identifies the most probable sequence of
hidden states that could have produced a given sequence of observations, maximizing the joint
probability of the state sequence and observations.
HMM Viterbi Algorithm:
How the Viterbi Algorithm Works: The algorithm uses dynamic programming to efficiently
compute the most likely state sequence without evaluating all possible sequences (which
would be exponential). It builds a trellis (a grid of probabilities over time and states) and
finds the optimal path through it. Here’s a step-by-step process
Initialization:
For each state ‘i’, initialize δi,0 as the probability of starting in state ‘i’ and observing the
first observation ‘o(0)’:
δi,0=πibi(o(0))
where ‘πi‘ is the initial state probability, and bi(o(0)) is the emission probability of
observing o(0) in state ‘i’ . Set ϕi,0=0, which tracks the most likely previous state (not
needed at t=0).
**Forward Recursion (for each time step t)):
For each possible state ‘s ‘at time ‘t’:
o Compute δs,t the highest probability of being in state ‘s’ at time ‘t’, considering all
possible previous states ‘i’:
δs,t=maxi(δi,t−1ai,s)bs(o(t))
where ai,s is the transition probability from state ‘i’ to state ‘s’, and bs(o(t)) is the
emission probability of observing o(t) in state ‘s’.
o Store the previous state ‘i ‘ that maximizes this probability:
ϕs,t=argmaxi(δ i,t−1 ai,s)
Termination: At the final time step ‘T’, find the most likely end state qT ∗:
qT∗=argimaxδi,T
Backtracking: Trace back the most likely state sequence using the stored ‘ϕ ‘values:
qt−1∗=ϕqt∗,t
Repeat this for t =T down to down to t=1 , building the sequence q0∗,q1∗,…,qT∗
Advantages of Hidden Markov Models (HMMs)
1. Effective for Sequential and Time-Series Data: Captures temporal dependencies and
transitions between states, which are common in real-world sequential data.
2. Probabilistic Framework: HMMs use a probabilistic approach to model uncertainty,
with well-defined algorithms (e.g., Forward-Backward, Viterbi, Baum-Welch) for
inference, decoding, and learning.
Provides a robust way to handle noisy or incomplete data, offering probabilities for
predictions rather than deterministic outputs.
3. Flexibility Across Domains: Applicable to diverse fields like speech processing (e.g.,
modeling phonemes), bioinformatics (e.g., gene prediction), and finance (e.g., stock price
modeling).
The general framework can be adapted to various sequential tasks with appropriate
state and observation definitions.
4. Efficient Algorithms: Algorithms like Viterbi (for finding the most likely state
sequence) and Baum-Welch (for parameter estimation) are computationally efficient for
moderate-sized problems.
Enables practical implementation even with large sequences, balancing accuracy and
speed.
5. Handles Variable-Length Sequences: HMMs can process sequences of varying lengths
without requiring fixed-size inputs, unlike some neural network models.
Suitable for real-world data where sequence lengths differ (e.g., varying sentence
lengths in NLP).
Disadvantages of Hidden Markov Models (HMMs)
1. Assumption of Markov Property: HMMs assume that the current state depends only
on the previous state (first-order Markov property), which may not capture long-term
dependencies in complex sequences.
In tasks like language modeling, where context spans multiple steps, HMMs may
underperform compared to models like LSTMs or Transformers.
2. Scalability Issues: Training HMMs on large datasets or with many hidden states can be
computationally expensive, as the Baum-Welch algorithm scales poorly with state space
size.
Less suitable for very large or high-dimensional datasets compared to deep learning
models.
3. Local Optima in Training: The Expectation-Maximization (EM) approach used in
Baum-Welch can converge to local optima, leading to suboptimal model parameters.
May result in lower accuracy if initialization is poor or the model is complex.
4. Limited Expressiveness: HMMs struggle with highly non-linear or complex patterns in
data, as they rely on simple probabilistic transitions and emission distributions (often
Gaussian or discrete).
Deep learning models (e.g., RNNs, Transformers) often outperform HMMs in tasks
requiring complex feature interactions, like advanced NLP or image sequence analysis.
5. Requires Careful Design: Defining the number of hidden states and appropriate
emission distributions requires domain knowledge and experimentation.
Mis-specified models (e.g., too few or too many states) can lead to poor performance,
and tuning is not always straightforward
Applications of the HMM:
Natural Language Processing (NLP): HMMs are employed across diverse natural language
processing tasks, including named entity recognition, part-of-speech tagging, and machine
translation. Their ability to capture sequential dependencies in language contributes to
the improved accuracy of these applications.
Speech Recognition (SP): HMMs play a crucial role in contemporary speech recognition
systems. They adeptly capture the intricate relationship between phonemes and acoustic
features, facilitating precise speech recognition and transcription.
Financial Time Series Analysis (FTSA): HMMs find utility in modeling and forecasting time
series data, encompassing exchange rates or stock market prices. Through the capture of
hidden states and transitions, HMMs provide valuable insights into market trends,
facilitating well-informed investment decisions