0% found this document useful (0 votes)

103 views163 pages

Machine Learning Model

Uploaded by

diop samba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

103 views163 pages

Machine Learning Model

Uploaded by

diop samba

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 163

Machine Learning

© 2021 The Knowledge Academy Ltd 1

About The Knowledge Academy
• World Class Training Solutions
• Subject Matter Experts
• Highest Quality Training Material
• Accelerated Learning Techniques
• Project, Programme, and Change
Management, ITIL® Consultancy
• Bespoke Tailor Made Training Solutions
• PRINCE2®, MSP®, ITIL®, Soft Skills, and More

© 2021 The Knowledge Academy Ltd 2

Administration
• Trainer
• Fire Procedures
• Facilities
• Days/Times
• Breaks
• Special Needs
• Delegate ID check
• Phones and Mobile devices

© 2021 The Knowledge Academy Ltd 3

Outline
• Module 1: Machine Learning -
Introduction

• Module 2: Importance of
Machine Learning and its
Techniques

• Module 3: Data Preprocessing

• Module 4: Machine Learning

Mathematics

© 2021 The Knowledge Academy Ltd 4

Outline
• Module 5: Supervised Learning • Module 10: Clustering

• Module 6: Classification • Module 11: Deep Learning

-Introduction
• Module 7: Regression

• Module 8: Neural Networks

• Module 9: Unsupervised Learning

© 2021 The Knowledge Academy Ltd 5

Machine Learning - Introduction

© 2021 The Knowledge Academy Ltd 6

Machine Learning - Introduction
• Machine Learning refers to the study of algorithms and statistical
models used by computer systems as a way of effectively performing
tasks without the need for specific instructions, but relying on patterns
and inference instead

• The following describes the two ways a system can improve:

1) By acquiring new knowledge, facts, and skills

2) By adapting its behaviour, solving problems more accurately, and

more efficiently

© 2021 The Knowledge Academy Ltd 7

Machine Learning - Introduction
• There are three main elements that comprise Machine Learning:

1) Base knowledge in which the system is aware of the answer thus

enabling the system to learn

2) The computational algorithm which is at the core of making

determinations

3) Variables and features used to make decisions

© 2021 The Knowledge Academy Ltd 8

Machine Learning - Introduction
• Machine Learning is the main subarea of artificial intelligence

• Machine Learning allows the computers or machines to routinely adjust

and customise themselves instead of being explicitly programmed to
carry out specific tasks

• These programs or algorithms are specifically designed to improve their

performance P at some task T with experience E:

T: Recognising hand-written words

P: Percentage of words correctly classified
E: Database of human-labelled images of handwritten words

© 2021 The Knowledge Academy Ltd 9

Machine Learning - Introduction
Difference Between Traditional Programming and Machine Learning

Traditional Programming

Data
Computer Output
Program

Machine Learning
Data
Computer Program
Output

© 2021 The Knowledge Academy Ltd 10

Machine Learning - Introduction
Real Life Examples of Machine Learning

• The following are real life examples of Machine Learning:

o While shopping on the internet, users are presented with

advertisements related to their purchases

o When shopping, a person checks a product on the internet then it

recommends similar products

o When using an app to book a cab ride, the app will provide an
estimation of the price of that ride. When using these services, how
do they minimise the detours? The answer is machine learning

© 2021 The Knowledge Academy Ltd 11

Machine Learning - Introduction
Some Other Real-Life Examples of Machine Learning

• Virtual Personal Assistants

o Siri, Alexa, are few of the popular examples of virtual personal assistants

o Virtual Assistants are integrated in a variety of platforms. For example:

• Smartphones: Samsung Bixby on Samsung S8

• Smart Speakers: Amazon Echo and Google Home

• Mobile Apps: Google Allo

© 2021 The Knowledge Academy Ltd 12

Machine Learning - Introduction
Social Media Services

o Social media platforms are utilising machine learning for their own
benefits as well as for the benefit of the user. Below are a few
examples:

• Face Recognition: Upload a picture of you with a friend and

Facebook instantly recognizes that friend

• Similar Pins: Computer Vision is used by Pinterest as a way of

recognises objects in images and recommends similar pins
accordingly

© 2021 The Knowledge Academy Ltd 13

Machine Learning - Introduction
Online Fraud Detection

o Machine learning is proving its potential to make cyberspace a secure

place and tracking monetary frauds online is one of its examples

o For example: Paypal is using ML for protection against money laundering

Online Customer Support

o Most websites will offer the option to chat to customer support. In most
cases, you talk to a chatbot rather than a live executive to answer your
queries

o These bots tend to extract information from the website and present it to
the customers
© 2021 The Knowledge Academy Ltd 14
Importance of Machine Learning and its
Techniques

© 2021 The Knowledge Academy Ltd 15

Importance of Machine Learning

• Machine Learning is used to complete complex tasks that are difficult

for humans to complete, such as complex coding

• We provide a machine learning algorithm with a massive amount of

data

o It explores and searches for a model that will work out what the
programmers have set out to achieve

© 2021 The Knowledge Academy Ltd 16

Importance of Machine Learning
• Machine learning has become a key technique for problem solving in a
variety of fields:

Image Automotive,
Natural
Computationa Computationa Processing Energy aerospace,
Language
l Biology l Finance and Computer Production and
Processing
Vision manufacturing

Drugs
Recovery Credit Motion
Price
Scoring Detection

Voice
Tumor Predictive
Recognition
Detection Maintenance
Applications

Algorithm Object Load

DNA Trading Detection Forecasting
Sequencing

© 2021 The Knowledge Academy Ltd 17

Types of Machine Learning
Types of Machine Learning

Machine Learning
Three Types

Supervised Learning Unsupervised Learning Reinforcement Learning

Task Driven (Predict next value) Data Driven (Predict next value) Learn from Mistakes

Classification Regression Clustering

K-Means, K-Medoids Fuzzy
Support Vector Machines Linear Regression, GLM
C-Means

Discriminant Analysis SVR, GPR Hierarchical

Naïve Bayes Ensemble Methods Gaussian Mixture

Nearest Neighbour Decision Trees Neural Networks

Continuos Neural Networks Hidden Markov Model

© 2021 The Knowledge Academy Ltd Categorical 18

How Machine Learning Works?
• Machine Learning uses both Supervised and Unsupervised Learning. Supervised
Learning trains a model on known input and output data so that it can predict
future outputs. Unsupervised learning identifies hidden patterns or intrinsic
structures in input data

Machine Learning

Unsupervised Learning Supervised Learning

Group and interpret data based Develop predictive model based

only on input data on both input and output data

Clustering Classification

Regression
© 2021 The Knowledge Academy Ltd 19
How Machine Learning Works?
Training the Machine
Learning algorithm
START If the accuracy is
not acceptable

Training Data Set

Model Input Data ML

algorithm is If the accuracy
trained is acceptable Machine Learning
again algorithm is
deployed

New Data Input

introduced to Prediction
make a prediction
Machine Learning Algorithm

© 2021 The Knowledge Academy Ltd 20

Machine Learning Mathematics

© 2021 The Knowledge Academy Ltd 21

Machine Learning Mathematics
• Machine Learning Theory is a field that uses probabilistic, computer science,
statistical, and algorithms feature as a result of learning iteratively from data
and identifying hidden patterns that can later be used to generate intelligent
applications

Why mathematics is significant for machine learning?

o Selecting the right algorithm

o Identifying underfitting and overfitting

o Choosing parameter settings and validation strategies

o Estimating the right confidence interval and uncertainty

© 2021 The Knowledge Academy Ltd 22

Machine Learning Mathematics
Importance of Maths Topics Required For Machine Learning

© 2021 The Knowledge Academy Ltd 23

Data Preprocessing

© 2021 The Knowledge Academy Ltd 24

Data Preprocessing
• Data Preprocsesing is a technique that is used to transform raw data
into an understandable format

• Whenever the real-world data is gathered from various sources it is

collected in raw format (likely to contain many errors) which is not
feasible for analysis

• Data Preprocessing includes the following:

• Removing outliers and noisy data, resolving

Data Cleaning any inconsistencies, filling in missing values

© 2021 The Knowledge Academy Ltd 25

Data Preprocessing
• Data Preprocessing helps to resolve such areas

• Using data cubes, multiple databases,

Data Integration or files

Data • Normalisation and aggregation

Transformation

• Diminishing the volume but producing

Data Reduction the same or similar analytical results

• Part of data reduction, replacing

Data Discretisation numerical attributes with nominal ones

© 2021 The Knowledge Academy Ltd 26

Data Preprocessing
To handle the missing values:

• Data Collection
Nationality Age Salary Gender

o Here we are using a dataset that Spain 28 40,000 Female

incorporates the information of Poland 38 50,000 Female

Germany 70000 Male
Sales professionals Poland 32 100000 Male
Spain 19 13000 Female

o This dataset is in .csv format and Germany 26 38000 Male

named as Employee_Record Germany 33 64000 Female

Spain 35 Male
Poland 24 46000 Female
o Make sure in datasets you leave Germany 20 60000 Male

empty cells like we have done in Spain 31 44000 Female

our example Poland 27 54000 Male

© 2021 The Knowledge Academy Ltd 27

Data Preprocessing
Importing the Libraries

• We are using three main libraries: numpy, and pandas, where:

o numpy includes the mathematical tools, so we can use any type of

mathematics
o pandas is used to import and manage datasets

o Use the following code to import libraries:

#importing the libraries

import numpy as np Alias
import pandas as pd

© 2021 The Knowledge Academy Ltd 28

Data Preprocessing
Importing the Dataset

o Now we are importing our dataset.

To import the dataset, perform the
following command:

emp = pd.read_csv(“Employee_Record.csv”)

o When the dataset has been

imported, the variable explorer
environment looks like this, as
shown in the figure above:

© 2021 The Knowledge Academy Ltd 29

Data Preprocessing
Setting the Datasets into Dependent and
Independent Variables

o The next step is to determine the

dependent (y) and independent (x)
variable

o According to the data, we can #setting the dependent and independent

conclude that age and salary variables variable
are our independent variable and
x = emp.iloc[: , :-1].values
gender variable is the dependent
variable y = emp.iloc[: , -1].values

o Now, we determine the gender of the

employees based on their salary, age
and nationality
© 2021 The Knowledge Academy Ltd 30
Data Preprocessing
Program 1: Importing the Dataset and displaying “True” in place of
empty record

© 2021 The Knowledge Academy Ltd 31

Data Preprocessing
Output:

© 2021 The Knowledge Academy Ltd 32

Data Preprocessing
Step 1: Import important packages and the data set

© 2021 The Knowledge Academy Ltd 33

Data Preprocessing
Step 2: Lets take a look at the imported data set

© 2021 The Knowledge Academy Ltd 34

Data Preprocessing
Step 3: Plot the distribution of all the continuous variables in our data set

© 2021 The Knowledge Academy Ltd 35

Data Preprocessing
(Continued)
Output:

© 2021 The Knowledge Academy Ltd 36

Supervised Learning

© 2021 The Knowledge Academy Ltd 37

Supervised Learning
• As the name indicates, supervised learning involves the presence of
a supervisor as a trainer

• In supervised learning, we can train the machines using labelled data

• Once the machine has understood the data, it is provided with a new
dataset. The Supervised learning algorithm analyses the training
data(examples) and produces a correct outcome from labelled data

• The algorithm will then continuously make predictions based on the

training data that has been corrected by the supervisor

© 2021 The Knowledge Academy Ltd 38

Supervised Learning
• For instance, let’s assume there is a
basket filled with different kinds of fruits.
The first step would be to train the
machine with all different fruits one by
one:

o If a shape of an object is rounded with

depression at the top having colour
Red, then it will be labelled as-Apple

o If an object is a bunch of round ovals

that are the colour black, then it will
be labelled as Grapes

© 2021 The Knowledge Academy Ltd 39

Supervised Learning
• Now assume after training the data, we have
given a new separate fruit to machine from the
basket and asked the machine to identify it. The
fruit that it must identify is an apple:

o Because the machine has previously learned

the physical characteristics of fruit from the
training data, it must now use that
knowledge to recognise the apple

o First, the machine will classify the fruit with

its colour and shape. Then, it will confirm the
name of the fruit (response variable) and put
the fruit in the Apple Category

Consequently, the machine learns the information from training data (basket SUPERVISED
containing fruits) and utilises this knowledge to test data (new fruit) LEARNING

© 2021 The Knowledge Academy Ltd 40

Supervised Learning
In Mathematical Terms:

• In supervised learning, you have an input variable (X) and output

variable (Y). An algorithm is used as a way of learning the mapping
function from the input and output variables

Y = f(X)

• The primary purpose of this is to precisely approximate the mapping

function so that when you have new input data the input variable (X)
can predict the output variables (Y) for that data

© 2021 The Knowledge Academy Ltd 41

Supervised Learning

• Supervised Learning is classified into two categories of algorithms:

o Classification: The primary goal of the classification algorithm is to

categorise data into the desired and distinct number of classes
which can help to assign labels to each class

o Regression: This algorithm is used as a way of making iterative

predictions of outputs

© 2021 The Knowledge Academy Ltd 42

Classification

© 2021 The Knowledge Academy Ltd 43

Classification
• In machine learning, classification is a crucial concept that provides the
machine with the knowledge needed to group data by specific criteria

• Classification is the process of predicting the class of data where classes

are known as targets/ labels or categories

• There is a supervised version of classification where machines group

data together according to predetermined characteristics

• The Unsupervised version of classification, also known as clustering, is

where computers identify shared characteristics which are then used to
group data into categories when categories have not been specified

© 2021 The Knowledge Academy Ltd 44

Classification
• Real life examples of the use of classification include when your inbox
filters the emails it has received as spam/junk email or important email

• Another example of classification is categorising transaction data as

fraudulent or authorised

• Classification predicts categorical class labels/classifies data based on

training sets and uses the knowledge it has required from the training
set to classify new data

• It includes a number of models such as logistic regression, decision

trees, random forest, gradient-boosted tree, multilayer perceptron,
one-vs-rest, and Naive Bayes
© 2021 The Knowledge Academy Ltd 45
Classification
For example:

• Choose the classification problem(s) from the following options:

a) Predicting apartment price based on area

b) Predicting the gender of a person by his/her handwriting style
c) Predict the number of copies of a book that will be sold next month
d) Predicting whether monsoon will be normal next year

• Solution: b) Predicting the gender of a person, d) Predicting whether

monsoon will be normal next year

• The other two a) and c) are examples of regressions

© 2021 The Knowledge Academy Ltd 46

Classification
• In classification, there are two types of learners : lazy learners and eager learners

1) Eager Learners

o These learners create a classification model according to the given training data
before receiving new data to classify

o Accuracy: It must commit to a solitary hypothesis that covers the entire instance
space

o Because of the model constructions, eager learners often take much longer to train
but less time in predicting

e.g. Naive Bayes, Decision Tree, Artificial Neural Networks

© 2021 The Knowledge Academy Ltd 47

Classification
2) Lazy Learners:

• Lazy Learners store training data and wait until it is given a test tuple

• Accuracy: This type of learner uses a more well-rounded hypothesis

space that draws from various local linear functions in order to form its
implicit global estimation to the target function

• Unlike the Eager Learner, the lazy learners takes less time to train but
more time to predict

e.g. Case-based reasoning, k-nearest neighbor

© 2021 The Knowledge Academy Ltd 48

Support Vector Machines
• “Support Vector Machine” (SVM) is a
supervised machine learning algorithm
which can be used for both regression or
y
classification challenges

• However, it is most commonly used to solve

classification problems. In this algorithm,
we plot every data item as a point in
n-dimensional space with the value of each
feature being the value of a specific
coordinate

• By finding the hyperplane, we perform x

classification that differentiates the two
classes very well

© 2021 The Knowledge Academy Ltd 49

How does SVM work?
• In the next couple of slides we will
discuss different scenarios, both of
y S
which involve the process of T
segregating the two classes with a
hyper-plane V

Scenario 1: Identify the right

hyperplane

• In this scenario, there are three x

hyperplanes, S, T, and V. Now, In given scenario, hyperplane “T” has
identify the right hyperplane excellently performed this job

© 2021 The Knowledge Academy Ltd 50

How does SVM work?
Scenario 2: Identify the right hyperplane:

• In this scenario, we have three hyperplanes (S, T and V) which are

segregating the classes well. Now, how can we identify the right
hyperplane?
y
T
V
S

© 2021 The Knowledge Academy Ltd 51

How does SVM work?
(Continued)

• To identify the right hyperplane, maximise the distances between the nearest data
point (either class). It will help to determine the right hyperplane

• This distance is known as Margin

y
T
V
S

x
© 2021 The Knowledge Academy Ltd 52
How does SVM work?
Scenario 3: Identify the right hyperplane

• In this scenario, use the same rules that have

been used in the previous scenario to
identify the right hyper-plane y
T

• According to those rules, the hyper-plane T is S

considered as the right hyperplane as it has
higher margin compared to S

• But, SVM selects the hyperplane that

classifies the classes accurately before
maximising margin
x

• Here, hyperplane A has classified all correctly

and T has a classification error. So, the right
hyper-plane is A
© 2021 The Knowledge Academy Ltd 53
How does SVM work?
Scenario 4: Can we classify two classes
y
• In this scenario, we are unable to
segregate the two classes using a straight
line. This is because one of the stars lies in
the territory of other class as an outlier

• As we know, one star at other end is like

x
an outlier for star class y

• SVM ignores the outliers and recognise

the hyper-plane that has a maximum
margin

• Hence, SVM is robust to outliers x

© 2021 The Knowledge Academy Ltd 54

How does SVM work?
Scenario 5: Find the hyperplane to segregate to classes

• In this scenario, we are not able to have linear hyper-plane between the two classes

• SVM resolves this issue by introducing an additional feature

© 2021 The Knowledge Academy Ltd 55

How does SVM work?
(Continued)

• Here, we are adding a new feature z=x^2+y^2. z

Let’s plot the values on x and z axis

o When plotting values, the following points

need to be considered:
x

o Each value for z would be positive as z is the

squared sum of both x and y

o In the original plot, the red circles appear

close to the origin of the x and y axis which
leads to the lower value of z and the star
being relatively away from the origin result to
the higher value of z
© 2021 The Knowledge Academy Ltd 56
How does SVM work?
(Continued)

• The hyperplane in the original input space looks like a circle:

© 2021 The Knowledge Academy Ltd 57

Discriminant Analysis
• Linear Discriminant Analysis is a technique
commonly used for dimensionality reduction. In
machine learning, it is used prior to
PreProcessing as a preparation step. It is also Dimensionality Reduction
used in pattern classification applications

• The primary purpose of this technique is to

reduce dimensions by eliminating any features
that are redundant and dependent. This is done Supervised Unsupervised
by transforming the features from higher Learning Learning
dimensional space to a space with lower
dimensions
LDA PCA
• This category of dimensionality reduction is used
in bioinformatics, chemistry, and biometrics

© 2021 The Knowledge Academy Ltd 58

Discriminant Analysis
How does it work?

• The Linear discriminant’s analysis main goal is to project the features in

higher dimension space onto a lower dimension space

• The working of the discriminant analysis includes the following steps:

o Step 1: Calculate the distance between the mean of different classes

that is also known as between-class variance

Sb = Ni ( xi - x )( xi - x )T
© 2021 The Knowledge Academy Ltd 59
Discriminant Analysis
Step 2: Calculate the distance between the mean and sample of every class,
which is known as within class variance

Sw = (Ni - 1)Si = ( xi.j - x ) ( xi.j - x )T

Step 3: Construct the lower dimensional space which minimises the within class
variance and maximises the between-class variance

o Let P be the lower dimensional space projection which is known as Fisher’s

criterion
|PT Sb P|
Plda = arg max T
P |P Sw P|
© 2021 The Knowledge Academy Ltd 60
Discriminant Analysis

Poor projection axis for

Best (LDA) projection axis separating the classes
for separating the classes

© 2021 The Knowledge Academy Ltd 61

Discriminant Analysis
Extension to Linear Discriminant Analysis (LDA)

• LDA is a simple and effective method for classification. It includes various

extensions and variations. Some of them are as follows:

o Flexible Discriminant Analysis (FDA): Where non-linear combinations of

inputs is used such as splines

o Quadratic Discriminant Analysis (QDA): Each class uses its own estimate
of variance

o Regularised Discriminant Analysis (RDA): Introduces regularisation into

the estimate of the variance, moderating the influence of different
variables on LDA\

© 2021 The Knowledge Academy Ltd 62

Naive Bayes
• Naive Bayes classifier is a group of classification algorithms based on
Bayes’ Theorem

• It is a group of algorithms that all share a common principle which is

that each pair of features is classified separately from one another

• Here we are considering a fictional dataset that describes the weather

conditions for playing a game of football

o Each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for

playing football

© 2021 The Knowledge Academy Ltd 63

Naive Bayes
Tabular Representation of our dataset:

© 2021 The Knowledge Academy Ltd 64

Naive Bayes
• The dataset is divided into two sections: feature matrix and response vector

o Feature matrix includes all the rows (vector) of a dataset in which every
vector consists of the value of dependent features, including‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’ are features

o Response vector includes the value of a class variable(prediction or output)

for each row of the feature matrix. The class variable name is ‘Play
football’

Assumption:

• The fundamental Naive Bayes assumption is that each feature makes an

independent and equal contribution to the outcome

© 2021 The Knowledge Academy Ltd 65

Naive Bayes
• According to our dataset, this concept of Naive Bayes can be
understood as:

• There is an assumption that no pair of features are dependent

o For instance, the temperature being ‘Hot’ has nothing to do with
the humidity or the outlook being ‘Rainy’ has no effect on the
winds. Hence, the features are assumed to be independent

• Secondly, each feature is given the same weight

o For instance, knowing humidity and temperature alone cannot
predict the outcome correctly. All attributes are assumed to be
contributing equally to the outcome

© 2021 The Knowledge Academy Ltd 66

Naive Bayes
Bayes’ Theorem

• It finds the probability of an event occurring given the probability of another event
that has already occurred

• Bayes’ theorem is represented by the following equation:

© 2021 The Knowledge Academy Ltd 67

Naive Bayes
• We can apply Bayes’ theorem in following way (with regards to our dataset):
P (X|y) P(y)
P (y|X) =
P(X)

• Where, y is class variable and X is a dependent feature vector:

X = (x1 , x2, x3, ……xn)

• An instance of a feature vector and corresponding class variable can be:

X = (Rainy, Hot, High, False)

y = No
o P(X|y) represents the probability of “Not playing football” given that the
weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity”
and “no wind”

© 2021 The Knowledge Academy Ltd 68

Naive Bayes
Naive Assumption

• Now, we are placing a naive assumption to the Bayes’ theorem, which is,
independence among the features

• First, split the evidence into the independent parts

• If any two events A and B are independent, then

© 2021 The Knowledge Academy Ltd 69

Naive Bayes
• Hence, we reach the result of:

P(x1|y) P(x2|y)…P(xn|y) P(y)

P(y|x1,…,xn) =
P(x1) P(x2)…P(xn)
• Which can be expressed as:

P(y) ∏n i=1 P(xi|y)

P(y|x1,…,xn) =
P(x1) P(x2)…P(xn)
• Removing the denominator, as it remains constant for a given input:

P(y) ∏n i=1 P(xi|y)

© 2021 The Knowledge Academy Ltd 70
Naive Bayes
• Now, we need to create a classifier model. Firstly, find the probability of
a given set of inputs for all possible values of the class variable y and
select the output with maximum probability. It can be expressed as:

y = argmaxy P(y) ∏ni=1 P(xi|y)

• Finally, we are left with the task of calculating P(y) and P(xi|y)

• P(y) is called class probability, and P(xi| y) is called conditional

probability

© 2021 The Knowledge Academy Ltd 71

Naive Bayes
• To apply the formula (given on the previous) slides manually on our
weather dataset, find P(xi|yj ) for each xi in X and yj in y

• The calculations are represented in the below given tables:

Outlook Temperatur
e
Yes No P(yes) P(no) Yes No P(yes) P(no)
Sunny 2 1 2/6 1/4 Hot 2 2 2/7 2/4
Overcast 3 0 3/6 0/4 Mild 2 1 2/7 1/4
Rainy 1 3 1/6 3/4 Cool 3 1 3/7 1/4
Total 6 4 100% 100% Total 7 4 100% 100%

Table 1 Table 2

© 2021 The Knowledge Academy Ltd 72

Naive Bayes
Humidity Wind

Yes No P(yes) P(no) Yes No P(yes) P(no)

High 3 3 3/7 3/4 False 5 2 5/6 2/4
Normal 4 1 4/7 1/4 True 1 2 1/6 2/4
Total 7 4 100% 100% Total 6 4 100% 100%

Table 3 Table 4

• We have calculated P(xi|yj) for each xi in X and yj in y manually in the tables 1 to 4

Play P(yes)/P(no)
Yes 7 7/13
No 4 4/13
Total 13 100%

Table 5
© 2021 The Knowledge Academy Ltd 73
Naive Bayes
• For instance, probability of playing football given that the temperature is cool, i.e.
P(temp. = cool | play football = Yes) = 3/7

• Also, find class probabilities (P(y)) which has been calculated in table 5. For instance,
P(play football = Yes) = 7/13

• Let’s test it on a new set of features: today = (Sunny, Hot, Normal, False)

• So, the probability of playing football is given by:

© 2021 The Knowledge Academy Ltd 74

• Since P(today) is common in both probabilities, we can ignore P(today) and find
proportional probabilities such as:

P(Yes|today) 2. 2 . 4 . 5 . 7 0.0244 P(No|today) 1. 2 . 1 . 2 . 4 0.0048

6 7 7 6 13 4 4 4 4 13
• Since
P(Yes|today) + P(No|today) = 1
• These numbers can be converted into a probability by making the sum equal to 1:

0.0244
P(Yes|today) = = 0.84
0.0244 + 0.0048

© 2021 The Knowledge Academy Ltd 75

Naive Bayes
0.0048
P(No|today) = = 0.16
0.0244 + 0.0048

• Since,

P(Yes|today) > P(No|today)

• So, the prediction that football would be played is ‘Yes’

© 2021 The Knowledge Academy Ltd 76

Naive Bayes
Gaussian Naive Bayes Classifier

• In Gaussian Naïve Bayes, the continuous

values that are associated with all features f(x)
The Normal Distribution
are generally thought to be distributed
according to a Guassian distribution

• A Gaussian distribution is also known as

Normal distribution

• When plotted, the Gaussian distribution

gives a bell-shaped curve which is
symmetric about the mean of the feature µ

values as shown below:

© 2021 The Knowledge Academy Ltd 77
Naive Bayes
• The likelihood of the features is assumed to be Gaussian. Hence,
conditional probability is given by:

1 (xi - µy )2
P(xi|y) = exp -
√2∏σ2y 2σ2y

© 2021 The Knowledge Academy Ltd 78

Naive Bayes
Example

© 2021 The Knowledge Academy Ltd 79

Naive Bayes
Example

Output

© 2021 The Knowledge Academy Ltd 80

Nearest Neighbour

• K-Nearest Neighbors is the simplest primary machine learning

algorithm that is used to solve classification and regression problems

• The algorithm is generally disposable in real-life scenarios. This is due

to the fact that it is non-parametric, thus, it doesn’t make any
underlying assumptions regarding the distribution of data, unlike
other algorithms that assume a Gaussian distribution of the given data

• We are given some prior data, also known as training data, that
classifies coordinates into groups identified by an attribute

© 2021 The Knowledge Academy Ltd 81

Nearest Neighbour
• Consider the following data points given in the figure:

© 2021 The Knowledge Academy Ltd 82

Nearest Neighbour
• The following figure is another set of data points, also known as
testing data. Allocate these points a group by analysing the training

• The unclassified points are marked as ‘White’

© 2021 The Knowledge Academy Ltd 83

Nearest Neighbour
Algorithm

• Let p be an unknown point and m be the number of training data samples

1) Store the training samples in an array of data points [], each element of this
array represents a tuple (x, y)

2) for i=0 to m:
o Calculate Euclidean distance d(arr[i], p)

3) Make set S of K smallest distances obtained

4) Return the majority label among S

© 2021 The Knowledge Academy Ltd 84

Nearest Neighbour
Example:

Output

© 2021 The Knowledge Academy Ltd 85

Nearest Neighbour
(Continued)

• To measure the accuracy of the model

© 2021 The Knowledge Academy Ltd 86

Nearest Neighbour
(Continued)

• To test the model for each and every expected k-value

© 2021 The Knowledge Academy Ltd 87

Regression

© 2021 The Knowledge Academy Ltd 88

Regression
• Regression problems are defined as scenarios when the output variable is a
real or continuous value, such as “salary” or “weight”

• You can use a number of models, however, the simplest is Linear regression

• Linear Regression attempts to fit data with the best hyper-plane which goes
through the points

y- dependent variable
(output)

x - dependent variable (input)

Regression
Models

Simple Multiple

Linear Non-Linear Linear Non-Linear

© 2021 The Knowledge Academy Ltd 90

Regression
Example

• Choose a regression task from the following options?

o Predicting nationality of a person

o Predicting whether a document is related to sighting of UFOs?
o Predicting age of a person
o Predicting whether stock price of a company will increase tomorrow

• Solution: Predicting age of a person (because it is a real value,

predicting nationality is categorical, whether stock price will increase is
discreet-yes/no answer, predicting whether a document is related to
UFO is again discreet- a yes/no answer)

© 2021 The Knowledge Academy Ltd 91

Linear Regression and GLM
GLM (Generalised Linear Model)

• GLM is used to represent the dependent variable as a linear combination

of independent variables

• Simple linear regression is the traditional form of GLM. It works

adequately when the dependent variable is normally distributed

• In real circumstances, the assumption of normally distributed dependent

variable is usually violated

© 2021 The Knowledge Academy Ltd 92

Linear Regression and GLM
Linear Regression

• Linear Regression is a machine learning algorithm where the predicted

output is continuous

• Regression models a target prediction value according to independent

variables

• It is often used to find the relationship between variables and

forecasting

• Regression models vary based on the type of relationship between

independent and dependent variables, they are considering, and the
number of independent variables is used
© 2021 The Knowledge Academy Ltd 93
Linear Regression and GLM
Linear Regression

• It performs the task to predict a

dependent variable value (y) based
on a given independent variable (x)

• This regression technique identifies a

linear relationship between x (input) Y
and y(output). That is why, it is
named as Linear Regression

• In the given figure, X (input) is the X

work experience and Y (output) is the
salary of an employee
© 2021 The Knowledge Academy Ltd 94
Linear Regression and GLM
• Hypothesis function for Linear Regression given below in the mathematical form:

y = Ɵ1 + Ɵ2.x
• During training the model we are given:
o x: input training data (univariate – one input variable(parameter))
o y: labels to data (supervised learning)

• While training the model – it fits the best line to predict the value of y for a given
value of x. By finding the best θ1 and θ2 values, the model gets the best
regression fit line
o θ1: intercept
o θ2: coefficient of x

© 2021 The Knowledge Academy Ltd 95

Linear Regression and GLM
Cost Function (J):

• By accomplishing the best-fit regression line, the model intends to predict

the y value so that the error difference between the predicted value and
the actual value is minimum

• It’s essential to update the θ1 and θ2 values to reach the best value that
minimises the error between predicted y value (pred) and actual y value
(y)

1 (predi - yi)2
minimise
n

1 (predi - yi)2
J=
n
© 2021 The Knowledge Academy Ltd 96
Linear Regression and GLM
• Cost function, denoted as J, of Linear Regression is the RMSE (Root
Mean Squared Error) between predicted y value (pred) and true y
value (y)

Gradient Descent:

• Gradient model is used by the model to update θ1 and θ2 values to

reduce Cost function (minimising RMSE value) and achieving the best
fit line

• The purpose of this is to start with random θ1 and θ2 values and then
iteratively to update the values, reaching minimum cost

© 2021 The Knowledge Academy Ltd 97

SVR
• SVR refers to Support Vector Machine – Regression

• The SVR utilises the identical principles that are used by the support
vector machine for classification. There are only a few minor
differences

• Because the output is a real number, it becomes difficult to predict the

available information as there is an infinite possibilities

• In the case of regression, a margin of tolerance is set in approximation

to the SVM. The SVM would have already requested the problem

© 2021 The Knowledge Academy Ltd 98

SVR
• However, the central idea is always the same: to minimise error, to
individualise the hyperplane (maximises the margin), keeping in mind
that part of the error is tolerated

y
1
+ɛ • Solution: min ||w||2
y = wx+b 0 2
-ɛ
• Constraints: yi – wxi – b ≤ ɛ
wxi + b - yi ≤ ɛ

© 2021 The Knowledge Academy Ltd 99

SVR
y
1
+ɛ • Minimise: ||w||2 + C
y = wx+b 0 2
-ɛ
• Constraints:

Linear SVR

y= (ai – ai*). ‹xi, x› + b

© 2021 The Knowledge Academy Ltd 10

SVR
Non-linear SVR

• The Kernal Function is a technique that is used to transform data into a

higher dimensional feature space to incorporate the linear separation

y y
*
y= (ai – ai ). ‹ϕ(xi), ϕ(x)› + b +ɛ
𝜉 +ɛ 0
0
-ɛ
-ɛ
𝜉

y= (ai – ai*).K (xi, x) + b

x ϕx

© 2021 The Knowledge Academy Ltd 10

SVR
Kernel Functions

Polynomial

K(xi, xj) = (xi.xj)d

Gaussian Radial Basics Function

||xi – xj||2
K(xi, xj) = exp -
2σ

© 2021 The Knowledge Academy Ltd 10

Decision Tree
• Decision trees create classification or regression models in the form of a tree
structure

• They break down a dataset into smaller subsets. Also, decision trees
incrementally developed the associated decision tree

• The final result appears in the form of a tree with leaf nodes and decision
nodes, where:

o A decision node has two or more branches, each representing values for
the attribute tested

o A leaf node depicts a decision on the numerical target — the topmost

decision node in a tree which corresponds to the best predictor is known
as a root node
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Predictors Targe
t
Outlook Temp Humidity Windy Hours Played
Rainy Hot High False 26
Rainy Hot High True 30 Outlook
Overcast Hot High False 46
Sunny Mild High False 45
Sunny Overcast Rainy
Sunny Cool Normal False 52
Sunny Cool Normal True 23
Overcast Cool Normal True 43 Windy 46.3 Temp.
Rainy Mild High False 35
False True Cool Hot Mild
Rainy Cool Normal False 38
Sunny Mild Normal False 46
47.7 26.5 38 27.5 41.5
Rainy Mild Normal True 48
Overcast Mild High True 52
Overcast Hot Normal False 44
Sunny Mild High True 30

© 2021 The Knowledge Academy Ltd 10

Decision Tree
Decision Tree Algorithm

• Decision trees can handle both categorical and statistical data

• ID3 is the primary algorithm used to build decision trees. It operates at

a top-down greedy search through the space of possible branches with
no backtracking

• Decision Trees are able to manage both categorical and numerical

variables simultaneously to features

© 2021 The Knowledge Academy Ltd 10

Decision Tree
Standard Deviation

• A decision tree is developed top-down from a root node and includes

partitioning the data into subsets that comprise instances with similar
values (homogenous)

• Standard deviation is used for calculating the homogeneity of a

numerical sample

• If the numerical sample is entirely homogeneous then its standard

deviation will be zero

© 2021 The Knowledge Academy Ltd 10

Decision Tree
Standard Deviation
a) Standard deviation for one attribute:

Hours Played
26
30
Count = n = 14
46 Σx
Average = x = = 39.8
45 n
52 Σ(x – x )2
Standard Deviation = S = = 9.32
23 n
S
43 Coefficient of Variation = CV = * 100% = 23%
35
x
38 • Standard Deviation (S) is for branching
46
48 • Coefficient of Deviation (CV) helps to decide when to stop
52
branching
44 • Average (Avg) is the value in the leaf nodes
30

© 2021 The Knowledge Academy Ltd 10

Decision Tree
Standard Deviation

b) Standard deviation for two attributes (target and predictor):

S(T, X) = P(c)S(c)

Hours Played Count

(StDev)
Overcast 3.49 4 S(Hours, Outlook) =
Outlook
Rainy 7.78 5
P(Sunny)*S(Sunny)+P(Overcast)*S(Overcast)+P(Rainy)*S(Rainy)
= (4/14)*3.49 + (5/14)*7.78 + (5/14)*10.87 = 7.66
Sunny 10.87 5
14

© 2021 The Knowledge Academy Ltd 10

Decision Tree
Standard Deviation Reduction

• The SDR (Standard Deviation Reduction) depends on the decrease in

standard deviation after a dataset is split on an attribute

• Building a decision tree is all about finding an attribute that returns the
highest standard deviation reduction

Step 1: The standard deviation of the target is calculated

Standard deviation (Hours Played) = 9.32

© 2021 The Knowledge Academy Ltd 10

Decision Tree
Step 2: Calculate the standard deviation for each branch

• The resulting standard deviation is subtracted from the standard

deviation before the split. The result is the standard deviation reduction:

SDR(T, X) = S(T) – S(T, X)

SDR(Hours, Outlook) = S(Hours) – S(Hours, Outlook) = 9.32 – 7.66 = 1.66

Hours
Played
(StDev)
S(Hours, Outlook) =
Overcast 3.49 P(Sunny)*S(Sunny)+P(Overview)*S(Overcast)+P(Rainy)*S(Rainy)
Outlook = (4/14)*3.49 + (5/14)*7.78 + (5/14)*10.87 = 7.66
Rainy 7.78
Sunny 10.87 SDR = 9.32 – 7.66 = 1.66
SDR = 1.66

© 2021 The Knowledge Academy Ltd 11

Decision Tree
• In the same way, calculate the SDR for the below given tables:

Hours Played
(StDev) Hours Played
(StDev)
Cool 10.51
Temp. High 9.36
Hot 8.95 Humidity
Normal 8.37
Mild 7.65
SDR = 0.28
SDR = 0.17

Hours Played
(StDev)

False 7.87
Windy
True 10.59
SDR = 0.26

© 2021 The Knowledge Academy Ltd 11

Decision Tree
Step 3: The attribute that has the largest standard deviation reduction is
chosen for the decision node

Hours Played (StDev)

Overcast 3.49
Outlook
Rainy 7.78

Sunny 10.87

SDR = 1.66

Step 4 (a): The dataset is divided based on the values of the selected
attribute

o This process is run recursively on the non-leaf branches. The

process continues until all the data has been processed
© 2021 The Knowledge Academy Ltd 11
Decision Tree

© 2021 The Knowledge Academy Ltd 11

Decision Tree
Step 4 (b): The “Overset” subset does not require any more splitting due
to the fact that its CV (8%) is less than the threshold (10%). The
associated leaf node gets the average of the “overcast” subset
Hours Played Hours Played Hours Played Count
(StDev) (Avg) (CV)
Overcast 3.49 46.3 8 4

Outlook Rainy 7.78 35.2 22 5

Sunny 10.87 39.2 28 5

Outlook

Sunny Overcast Rainy

46.3

© 2021 The Knowledge Academy Ltd 11

Decision Tree
Step 4 (c): However, the “Sunny" branch has an CV (28%) more than the threshold (10%)
which needs further splitting. We select “Windy" as the best node after “Outlook"
because of the fact that it has the largest SDR

Hours Played (StDev) Count

Temp Humidity Windy Hours Played
Temp Cool 14.50 2
Mild High False 45 Mild 7.32 3
Cool Normal False 52 SDR = 10.87 – ((2/5)*14.5 + (3/5)*7.32 ) = 0.678
Hours Played (StDev) Count
Cool Normal True 23
Humidity High 7.50 2
Mild Normal False 46
Normal 12.50 3
Mild High True 30 SDR = 10.87 – ((2/5)*7.5 + (3/5)*12.5) = .370
S = 10.87 Hours Played (StDev) Count
Windy False 3.09 3
Avg = 39.2
True 3.50 2
CV = 28
SDR = 10.87 – ((3/5)*3.09 + (2/5)*3.50)= 7.62

Decision Tree
• Due to the fact that the number of data points for both branches
(FALSE and TRUE) is equal or less than 3, we stop further branching
and assign the average of each branch to the related leaf node

Decision Tree
Step 4 (d): The "rainy" branch has a CV (22%), which is more than the threshold (10%).
This branch needs additional splitting. Here we are selecting "Windy" as the best node
because it has the largest SDR
Hours Played (StDev) Count
Cool 0 1
Temp Humidity Windy Hours Played Temp
Hot 2.5 2
Hot High False 25
Mild 6.5 2
Hot High True 30 SDR = 7.87 – ((1/5)*0 + (2/5)*2.5 + (2/5)*6.5) = 4.18
Mild High False 35
Hours Played (StDev) Count
Cool Normal False 38
High 4.1 3
Humidity
Mild Normal True 48
Normal 5.0 2
S = 7.78 SDR = 7.87 – ((3/5)*4.3 + (2/5)*5.0) = 3.32
Avg = 35.2 Hours Played (StDev) Count
CV = 22% False 5.6 3
Windy
True 9.0 2
SDR = 7.87 – ((3/5)*5.6 + (2/5)*9.0) = 0.8 2

Decision Tree
• Now, we stop further branching as the number of data points for all three branches
(Cool, Hot and Mild) is equal or less than 3. Assign the average of every branch to the
related leaf node

Outlook
Temp Hours Played
Mild 38
Sunny Overcast Rainy
Cool 25
Windy Windy Cool 30
46.3

Mild 35

False True False True True Mild 48

47.5 26.5 38 27.5 41.5

Neural Networks
• Neural Networks are a class of models in the overall machine learning
literature

• Neural Networks are a group of algorithms that have had a massive

impact on Machine Learning

• The current deep neural networks are inspired by biological neural

networks and have proven to work quite well

• They are general function approximations, meaning that they can be

applied to almost any machine learning problem about learning a
complex mapping from the input to the output space

Neural Networks
• The following are some reasons why we should study neural
computation:

To understand how the brain actually works

To understand a style of parallel computation inspired by

neurons and their adaptive connections

To solve practical problems by using novel learning algorithms

inspired by the brain

Neural Networks
Building Blocks of Neurons

• The basic unit of a neural network is a neuron, whichtakes inputs and

produces an output

• The below-given figure represents neurons:

Inputs + y Output

Neural Networks
• The mathematical formation of this included the following steps:

o First, each input is multiplied by a weight:

x1 x1 * w1
x2 x2 * w2
o Next, all the weighted inputs are added together with a bias b:

(x1 * w2) + (x2 * w2) + b

o Finally, the sum is passed through an activation function:

y = f (x1 * w2 + x2 * w2 + b)

Neural Networks
• The activation function is used to set an unbounded input into an
output that consists of a predictable form. A commonly used activation
function is the sigmoid function:

• The sigmoid function only outputs numbers in the range (0,1)

Neural Networks
• Following are some of the different neural network architectures:

Convolutional Neural Recurrent Neural

Perceptrons
Networks Networks

Long / Short Term

Gated Recurrent Unit Hopfield Network
Memory

Boltzmann Machine Deep Belief Networks Autoencoders

Generative Adversarial
Network

Unsupervised Learning

Unsupervised Learning
• In unsupervised learning, the machine is trained by using the
information that is neither labelled nor classified and the algorithm is
allowed to act on that information without guidance

• The machine’s main task is to group unsorted information based on

patterns, similarities, and differences without the need to have former
training of data

• Because the Machine is not provided with a teacher, it’s restricted to

find the hidden structure in unlabelled data by themselves

Unsupervised Learning
Difference Between Supervised and Unsupervised Learning

Unsupervised Learning Supervised Learning

Unsupervised Learning

• For example, let’s assume there is an image having both dogs and cats
which have not seen ever

• Consequently, the machine has is not aware of the features of cats

and dogs. This means that we cannot categorise this data

• But the machine can categorise them according to their patterns,

similarities, and differences, i.e., we can easily categorise the given
picture into two parts

Unsupervised Learning
• An unsupervised learning can be divided into two categories of algorithms:

Clustering Association
A clustering problem is
An association rule
where you want to find
learning problem is
the inherent groupings in
where you want to find
the data, such as
rules that describe large
grouping customers by
portions of the data
purchasing behaviour

Clustering

Clustering
• Clustering is the task of distributing data
points into multiple groups so that data
points in the same groups are more similar
to other data points in the same group and
dissimilar to the data points in different
groups

• Essentially, clustering is a collection of

objects based on similarity and dissimilarity
between them

• For instance, the data points given in the

graph clustered together can be
incorporated into one single group We can identify there are
three clusters in the graph

Clustering
• It is not essential for clusters to be a spherical, as shown in the below figure:

DBSCAN Density data

• These data points are clustered by using the fundamental notion that the data point
lies within the given constraint from the cluster centre

Clustering
Types of Clustering

• Broadly speaking, clustering can be divided into two subgroups:

o Hard Clustering: In this type of clustering, each data point either

belongs to a cluster completely or not

o Soft Clustering: In this type of clustering, instead of putting each

data point into a separate cluster, a probability or likelihood of that
data point is assigned to those clusters

Clustering
The following are some methods of clustering:

Density-Based Methods

Partitioning Methods

Hierarchical Based
Methods

Grid-based Methods

K-Means
• Suppose we are given a data set of items, including specific features
and their values

• The task is to categorise those items into groups

• K-means algorithm (unsupervised learning algorithm) helps to achieve

this task

• This algorithm categorises the items into k groups of similarity

• To calculate this similarity, use the Euclidean distance as

measurement

K-Means
The algorithm works as follows:

1) Firstly, initialise k points, known as Algorithm in pseudocode:

means, randomly

2) Secondly, categorise every item to Initialise k means with random values

its closest mean and update the For a given number of iterations:
mean’s coordinates, which are the Iterate through items:
averages of the items categorised Find the mean closest to the item
Assign item to mean
in that mean so far Update mean

3) Repeat the steps for a given

number of iterations. At the end,
we have our clusters
© 2021 The Knowledge Academy Ltd 13
K-Medoids
• This algorithm is a clustering algorithm associated with the k-means
and the medoidshift algorithm

• It can be defined as the point in the cluster, whose dissimilarities with

all the other points in the cluster is minimum

• The dissimilarity of the object(Pi) and medoid(Ci) is calculated by using

E = |Pi - Ci|

• The cost in K-Medoids algorithm is given below:

c= ΣCi ΣPi∈ |Pi – Ci|

K-Medoids
Algorithm:

1) Initialise: Select K random points out of the n data points as the medoids

2) Associate each data point to the closest mediod by using any common
distance metric methods

3) While the cost decreases:

4) For each medoid, m, for each data o point which is not a medoid

a) Swap m and o, associate each data point to the closest medoid, and
recompute the cost
b) If the total cost is more than that in the previous step, undo the swap

K-Medoids
• A medoid of a finite dataset is a data point from a set, whose average
dissimilarity to each data point is minimal (most centrally located point in the
set)

• The Partitioning Around Medoids (PAM) algorithm is the most common

realisation of k-medoid clustering. The way the algorithm works is outlined
below:

1) Initialise: Randomly select k of the n data points as the medoids

2) Assignment step: Associate each data point to the closest medoid

3) Update step: For every medoid m and each data point o related to m swap m
and o and calculate the total cost of the configuration. Select the medoid o
with the lowest cost of the configuration
© 2021 The Knowledge Academy Ltd 13
Fuzzy
• The fuzzy term refers to things which are not very clear or vague

• Sometimes we may come across a situation where we cannot decide

whether the statement is true or false. At that point, fuzzy logic
provides flexibility for reasoning

• The fuzzy logic algorithm is used to solve a problem after analysing all
available data. Then it takes the best possible decision for the given
input

• The Fuzzy Logic method imitates a human's decision-making ability

which consider all the possibilities between digital values T and F

Fuzzy
Fuzzy Logic Architecture

• It has four main parts as shown below in the figure:

Rules

Crisp Input Crisp Output

Fuzzy Input Set Fuzzy Output Set
Fuzzifier Intelligence Defuzzifier

Hierarichal
• The hierarchical clustering technique is Original Unclustered Data
one of the most popular Clustering
techniques in Machine Learning

• It groups similar data points, and the

group of those related data points is
known as a Cluster
Clustered Data

• This clustering technique is divided into

two types:

o Agglomerative

• In the agglomerative technique, every data point is initially

considered as an individual cluster. At each iteration, similar
clusters combine with other clusters until K clusters are formed

• The steps included in the basic algorithm of Agglomerative are as

follows:
o Compute the proximity matrix
o Let each data point be a cluster
o Repeat: Merge the two closest clusters and update the proximity
matrix
o Until only a single cluster remains

Hierarichal
2) Divisive Hierarchical clustering Technique

• This clustering technique is opposite to the Agglomerative

Hierarchical clustering technique

• In divisive hierarchical clustering, we consider all the data points as

a single cluster and separate the data points from the cluster which
are not similar in each iteration

• Every data point which is separated is considered as an individual

cluster. In the end we will be left with n clusters

• As a single cluster is divided into n clusters, it is named as Divisive

Hierarchical clustering
© 2021 The Knowledge Academy Ltd 14
Gaussian Mixture
• Suppose there are K clusters and estimate µ and σ for each k

o They would have been estimated by the maximum-likelihood method,

had it been only one distribution

o But since there are K such clusters and the probability density is
defined as a linear function of densities of all these K distributions, i.e.

p(X) = ΣK ∏k G(X|µk , Σk)

k=1
o Where ∏k is the mixing coefficient for k-th distribution

Gaussian Mixture
• To estimate the parameters by maximum log-likelihood method, compute p(X|µ, Σ, ∏)

ln p(X | µ, Σ, ∏ )
= ΣNi=1 p(Xi)
= ΣNi=1 ln ΣKk=1 ∏k G(X|µk , Σk)

• Now, define a random variable γk(X) such that γk (X)=p(k|X)

γk(X)
• From Bayes’ theorem:
p(X|k)p(k)
=
ΣKk=1 p(k) p(X | k)
p(X|k) ∏k
=
ΣKk=1 ∏k p(X | k)
© 2021 The Knowledge Academy Ltd 14
Gaussian Mixture
• Now for the log-likelihood function to be maximum, it’s derivative of
p(X|µ, Σ, ∏) for µ, Σ, ∏ should be zero. So, equalling the derivative of
p(X|µ, Σ, ∏) with respect to µ to zero and rearranging the terms,

ΣNn=1 γk (xn) (xn)

Σk =
ΣNn=1 γk (xn)

• Similarly taking derivative for σ and ∏ respectively, one can obtain the
following expressions:
ΣNn=1 γk (xn) (xn - µk)T and ∏k =
1
ΣNn=1 γk (xn)
Σk =
ΣNn=1 γk (xn) N

Hidden Markov Model
• HMM refers to Hidden Markov Model

• It is based on augmenting the Markov chain

• A Markov chain is a model that explains to us the probabilities of

sequences of random variables, states, each of which can take on
values from some set

• These sets can be words, or tags, or symbols depicting anything, such

as weather

• A Markov chain helps to make a powerful assumption that if we want

to predict the future in the sequence, then all that matters is the
current state
© 2021 The Knowledge Academy Ltd 14
Hidden Markov Model
• To predict tomorrow’s weather examine today’s weather but it is not allowed to look
at yesterday’s weather

(a) (b)

• Consider a sequence of state variables q1, q2,……qi. A Markov model embodies the
Markov assumption on the probabilities of this sequence: that while predicting the
future, the past does not matter, it only needs present
Markov Assumption: P(qi = a|q1...qi−1) = P(qi = a|qi−1)

Hidden Markov Model
• The following components specify a Markov chain:

o q = q1q2 ...qN: A set of N states

o a = a11a12 ...an1 ...ann: A transition probability matrix A, each aij

representing the probability of moving from state P i to state j, s.t.
Σnj=1 aij = 1 ∀I

o π = π1,π2,...,πN: An initial probability distribution over states

• ∏i is the probability that the Markov chain will start in state i

• Some states j may have πj = 0, meaning that they cannot be initial

states. Also, Σn i=1 ∏i = 1
© 2021 The Knowledge Academy Ltd 15
Hidden Markov Model
• A hidden Markov model includes both observed events Hidden Markov
model and hidden events that considers as causal factors in the
probabilistic model

• The following components specify an HMM:

o q = q1q2 ...qN a set of N states

o a = a11 ...ai j ...aNN a transition probability matrix A, each aij

representing the probability of moving from state i to state j, s.t. ΣN
a = 1 ∀i
j=1 i j

o o = o1o2 ...oT a sequence of T observations, each one drawn from a

vocabulary V = v1, v2,..., vV
© 2021 The Knowledge Academy Ltd 15
Hidden Markov Model
• b = bi(ot) a sequence of observation likelihoods, also known as emission
probabilities, each expressing the probability of an observation ot being
generated from a state i

• π = π1,π2,...,πN an initial probability distribution over states

o ∏i is the probability that the Markov chain will start in state i

o Some states j may have πj = 0, meaning that they cannot be initial

states. Also, Σn i=1 πi = 1

Hidden Markov Model
• A first-order hidden Markov model instantiates two simplifying assumptions

o First, the probability of a specific state rely upon the previous state

Markov Assumption: P(qi |q1...qi−1) = P(qi|qi−1)

o Second, the probability of an output observation oi depends upon the state that
produced the observation qi and not on any other states/any other observations

Output Independence: P(oi|q1 ...qi,...,qT , o1,...,oi,...,oT ) = P(oi|qi)

Deep Learning

Deep Learning
• Deep learning is a machine learning
technique that trains machines to do
what comes naturally to humans. They
learn by example

• It is a key technology behind driverless

cars, allowing them to distinguish a
pedestrian from a lamppost or to
recognise a stop sign

• It controls the voice in consumer devices

such as tablets, phones, TVs, and
hands-free speakers

Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before

• In deep learning, a computer model learns to perform classification

tasks directly from text, images, or sound

• The deep learning models can obtain state-of-the-art accuracy,

sometimes exceeding human-level performance

• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers

Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before

• In deep learning, a computer model learns to perform classification

tasks directly from text, images, or sound

• The deep learning models can obtain state-of-the-art accuracy,

sometimes exceeding human-level performance

• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers

Importance of Deep Learning
• As the name suggests, Artificial Intelligence is to make a machine artificially intelligent
so that, making the machines that act and think like humans

• The amount of useful data available and an increase in computational speed are the two
factors that have made the whole world to invest in this field

• If a robot is hard coded i.e. all the logic has manually been coded to the system, then it
is not AI so it does not mean that simple robots mean AI

• Machine learning means making a machine learn from its experience and enhancing its
performance with time as in case of a human baby

• The concept of machine learning became possible only when an adequate amount of
data made available for training machines. It assists in dealing with a complex and
sound system

Importance of Deep Learning
(Continued)

• Mainly, deep learning is a subset of machine learning, but in this case, the machine
learns the way where humans are believed to learn

• The structure of both deep learning model and the human brain is similar to a large
number of nodes and neurons, neurons in the brain of human thus result in artificial
neural network

• When traditional machine learning algorithms are applied we need to select input
features manually from complex data set and then train them that is a boring job for the
scientist of Machine Learning, but in neural networks, we do not need to select
manually useful input features

Importance of Deep Learning
(Continued)

• There are several types of neural networks to manage the complexity of data set and
algorithm

• Deep learning has allowed most of the Industries Experts to overcome challenges that
were not possible, a decade ago like Image and Speech recognition and Natural
Language Processing

• Industries like Entertainment, Journalism, Manufacturing or even Digital Sector,

Healthcare, Banking and Finance, Automobile depending on it

• Trending successes of deep learning are Voice Assistants, Mail Services, Self Driving cars,
Video recommendations, Intelligent Chat bots

How Deep Learning Works
• Neural networks are composed of layers of nodes, similar to the human brain, which is
made of neurons. Nodes within individual layers are combined to adjacent layers

• In the human brain, a single unit of the neuron gets thousands of signals from other
neurons. In an artificial neural network, signals are travel between nodes and allocate
weight accordingly

• A node weighing heavy will apply more impact on the next layer of the nodes. The final
layer put together the weighted inputs to give an output

• Systems of Deep learning needs powerful hardware as they have a huge amount of
processed data and includes many complex mathematical calculations

• In spite of having such advanced hardware, calculations of deep learning training can
take weeks

How Deep Learning Works
(Continued)

• Deep learning systems need a large amount of data to get back to accurate results;
according to that, information is served as huge data sets

• When data is processing, artificial neural networks are able to categorise data with the
answers gets from a series of true/ false questions that include highly complex
mathematical computations fed

• For instance, programs of facial identification work by learning to identify and detect
edges and lines of faces, then more important parts of faces, and finally complete
representations of the faces

• As the program trains itself and the possibility of getting the right answers enhances
with time

Congratulations

Congratulations on completing this course!

Keep in touch
info@theknowledgeacademy.com
Thank you

Aman's AI Journal - Watch List
No ratings yet
Aman's AI Journal - Watch List
32 pages
Cyber Security
No ratings yet
Cyber Security
13 pages
Hitchhiker's Guide To Machine Learning Algorithms, The - Devin Schumacher, Francis LaBounty JR
No ratings yet
Hitchhiker's Guide To Machine Learning Algorithms, The - Devin Schumacher, Francis LaBounty JR
364 pages
Java Notes PDF
No ratings yet
Java Notes PDF
264 pages
Guide - Study Like A Topper
No ratings yet
Guide - Study Like A Topper
3 pages
Gluon Tutorials: Deep Learning - The Straight Dope
No ratings yet
Gluon Tutorials: Deep Learning - The Straight Dope
403 pages
Kotlin Reference
No ratings yet
Kotlin Reference
1,455 pages
AIR 1 Aditya Evaluated Test Copies Compilation
No ratings yet
AIR 1 Aditya Evaluated Test Copies Compilation
409 pages
AIAgents System Design
No ratings yet
AIAgents System Design
15 pages
Introduction To Machine Learning - IITM - Course
No ratings yet
Introduction To Machine Learning - IITM - Course
1 page
Explainable Ai For Cybersecurity Zhixin Pan Prabhat Mishra Download
No ratings yet
Explainable Ai For Cybersecurity Zhixin Pan Prabhat Mishra Download
72 pages
Practical Guide To AI Learning 2025 Edition Ebook by Chiranjeev Gaggar
No ratings yet
Practical Guide To AI Learning 2025 Edition Ebook by Chiranjeev Gaggar
96 pages
Economy - Indian Economics
No ratings yet
Economy - Indian Economics
382 pages
Press During British Rule 1 Lyst1738961942038
No ratings yet
Press During British Rule 1 Lyst1738961942038
6 pages
Building Neo4j Powered
No ratings yet
Building Neo4j Powered
312 pages
Building A Smart Travel Agent With LangGraph and OpenAI - by Abhinav Kumar - Artificial Intelligence in Plain English
No ratings yet
Building A Smart Travel Agent With LangGraph and OpenAI - by Abhinav Kumar - Artificial Intelligence in Plain English
14 pages
Image Processing and Machine Learning, Volume 1 - Erik Cuevas, Alma Nayeli Rodríguez - 1, 2024 - Chapman and Hall - CRC - 9781003287414 - Anna's Archive
No ratings yet
Image Processing and Machine Learning, Volume 1 - Erik Cuevas, Alma Nayeli Rodríguez - 1, 2024 - Chapman and Hall - CRC - 9781003287414 - Anna's Archive
225 pages
AIAgentsin Software Testingand Test Automation
No ratings yet
AIAgentsin Software Testingand Test Automation
266 pages
AI Agent's in Production
No ratings yet
AI Agent's in Production
16 pages
Lectures Machine Learning
100% (1)
Lectures Machine Learning
205 pages
Mining The Social Web 3rd Edition Edition Klassen Instant Download
No ratings yet
Mining The Social Web 3rd Edition Edition Klassen Instant Download
77 pages
AI and Machine Learning First Edition Rahman Instant Access 2025
100% (1)
AI and Machine Learning First Edition Rahman Instant Access 2025
150 pages
Maipathi Sir Core and Adv Java and JDBC
No ratings yet
Maipathi Sir Core and Adv Java and JDBC
413 pages
Building Living Software Systems With Generative & Agentic AI
No ratings yet
Building Living Software Systems With Generative & Agentic AI
6 pages
Security Analytics A Data Centric Approach To Information Security (Taylor Francis Group) (Z-Library)
No ratings yet
Security Analytics A Data Centric Approach To Information Security (Taylor Francis Group) (Z-Library)
236 pages
Deep Reinforcement Learning in Action 1st Edition Alexander Zai PDF Download
100% (2)
Deep Reinforcement Learning in Action 1st Edition Alexander Zai PDF Download
56 pages
05 ANN Artificial Neural Networks
No ratings yet
05 ANN Artificial Neural Networks
221 pages
How Is RAG Used in The Industry Launchpad - Rag - Seminar - q2 - 8 - May - 2025
No ratings yet
How Is RAG Used in The Industry Launchpad - Rag - Seminar - q2 - 8 - May - 2025
49 pages
BATCH 4 (1) Review
No ratings yet
BATCH 4 (1) Review
19 pages
Agentic AI - Threats and Mitigations
No ratings yet
Agentic AI - Threats and Mitigations
46 pages
Machine Learning Basics Guide
100% (1)
Machine Learning Basics Guide
124 pages
AICTE UG Curriculum: AI & ML
No ratings yet
AICTE UG Curriculum: AI & ML
181 pages
Machine Learning Design Patterns Solutions To Common Challenges in Data Preparation Model Building and MLOps Valliappa Lakshmanan Download
No ratings yet
Machine Learning Design Patterns Solutions To Common Challenges in Data Preparation Model Building and MLOps Valliappa Lakshmanan Download
101 pages
Deep Neural Network Presentation
No ratings yet
Deep Neural Network Presentation
9 pages
14 - Operationalizing Generative AI On Vertex AI - v7
No ratings yet
14 - Operationalizing Generative AI On Vertex AI - v7
93 pages
XAI and GNN Research Overview
No ratings yet
XAI and GNN Research Overview
4 pages
Machine Learning-1
No ratings yet
Machine Learning-1
64 pages
AI and GenAI Adoption by LRAs
No ratings yet
AI and GenAI Adoption by LRAs
172 pages
AI Ethics
No ratings yet
AI Ethics
22 pages
Learn Wireshark: A Definitive Guide To Expertly Analyzing Protocols and Troubleshooting Networks Using Wireshark Lisa Bock Updated 2025
No ratings yet
Learn Wireshark: A Definitive Guide To Expertly Analyzing Protocols and Troubleshooting Networks Using Wireshark Lisa Bock Updated 2025
107 pages
Sparklabsaiprimer2025 250731210106 E29a3ed2
No ratings yet
Sparklabsaiprimer2025 250731210106 E29a3ed2
123 pages
MLOps
No ratings yet
MLOps
72 pages
Concept Drift in Large Language Models - Ketan Sanjay Desale
No ratings yet
Concept Drift in Large Language Models - Ketan Sanjay Desale
183 pages
Machine Learning in Production
No ratings yet
Machine Learning in Production
31 pages
Owasp Ai
No ratings yet
Owasp Ai
144 pages
Tool Use
No ratings yet
Tool Use
17 pages
ML Lesson Plan (21AI63)
No ratings yet
ML Lesson Plan (21AI63)
8 pages
Module 1
No ratings yet
Module 1
42 pages
Real Test Bank Artificial Intelligence A Modern Approach 2nd Edition by Russell Digital Bundle
No ratings yet
Real Test Bank Artificial Intelligence A Modern Approach 2nd Edition by Russell Digital Bundle
313 pages
Managing The AI Native Product - AI Product Manager's Handbook - Second Edition
No ratings yet
Managing The AI Native Product - AI Product Manager's Handbook - Second Edition
39 pages
Decap446 Data Warehousing and Data Mining
No ratings yet
Decap446 Data Warehousing and Data Mining
252 pages
Lang Graph
100% (1)
Lang Graph
113 pages
Azure & DevOps Training Guide
No ratings yet
Azure & DevOps Training Guide
15 pages
07 Spectrum 24
No ratings yet
07 Spectrum 24
5 pages
Predixion AI - 6 Months Internship
No ratings yet
Predixion AI - 6 Months Internship
19 pages
Career Track For AI/ML
No ratings yet
Career Track For AI/ML
10 pages
LLaMa Model Hallucination Analysis
No ratings yet
LLaMa Model Hallucination Analysis
3 pages
Unit 1 Notes
No ratings yet
Unit 1 Notes
68 pages
1.2.1 ML Intro
No ratings yet
1.2.1 ML Intro
18 pages
ML Notes
100% (1)
ML Notes
202 pages
Data Validity and Reliability Analysis
No ratings yet
Data Validity and Reliability Analysis
6 pages
Calenders
No ratings yet
Calenders
5 pages
Some Notes From The Book - Pairs Trading - Quantitative Methods and Analysis by Ganapathy Vidyamurthy Weatherwax - Vidyamurthy - Notes
No ratings yet
Some Notes From The Book - Pairs Trading - Quantitative Methods and Analysis by Ganapathy Vidyamurthy Weatherwax - Vidyamurthy - Notes
32 pages
January 2021 QP
No ratings yet
January 2021 QP
32 pages
419 DataSceince SQP
No ratings yet
419 DataSceince SQP
7 pages
Appendix B Forouzan
No ratings yet
Appendix B Forouzan
8 pages
Python Programming Exercises
50% (2)
Python Programming Exercises
12 pages
Mean Median Mode Range-Lesson Plan
100% (1)
Mean Median Mode Range-Lesson Plan
3 pages
Enhanced Euler's Method To Solve First Order ODE
No ratings yet
Enhanced Euler's Method To Solve First Order ODE
14 pages
Cambridge International Examinations
No ratings yet
Cambridge International Examinations
12 pages
MATLAB Programming Fundamentals-MathWorks (2023)
No ratings yet
MATLAB Programming Fundamentals-MathWorks (2023)
1,602 pages
SAP PI Context
No ratings yet
SAP PI Context
11 pages
CATIA Tutorial 1-Parameters Wings Empennage
No ratings yet
CATIA Tutorial 1-Parameters Wings Empennage
27 pages
Surveying Data Analysis Guide
100% (1)
Surveying Data Analysis Guide
280 pages
CESTAT30 02.01.vectors - Presentation
No ratings yet
CESTAT30 02.01.vectors - Presentation
164 pages
EXPLORATIONS in ANCIENT and MODERN PHILOSOPHY - Volume 3 - Myles Burnyeat, Carol Atack, Malcolm Schofield, David Sedley - 3, 2022 - Cambridge - 9780521750721 - Anna's Archive
No ratings yet
EXPLORATIONS in ANCIENT and MODERN PHILOSOPHY - Volume 3 - Myles Burnyeat, Carol Atack, Malcolm Schofield, David Sedley - 3, 2022 - Cambridge - 9780521750721 - Anna's Archive
460 pages
History of Statistics: Key Milestones
No ratings yet
History of Statistics: Key Milestones
24 pages
Notes On Micro Controller and Digital Signal Processing
No ratings yet
Notes On Micro Controller and Digital Signal Processing
70 pages
Cpps Unit-3 Arrays Array:: Declaration: Syntax: Data Type Array - Name (Size of The Array)
No ratings yet
Cpps Unit-3 Arrays Array:: Declaration: Syntax: Data Type Array - Name (Size of The Array)
24 pages
Class 10 Math Exam Paper
No ratings yet
Class 10 Math Exam Paper
5 pages
Average Range Versus Height Lab
100% (1)
Average Range Versus Height Lab
4 pages
Summer Vacation Assignment Grade 10
No ratings yet
Summer Vacation Assignment Grade 10
4 pages
Class Determinant Test 29 Dec
No ratings yet
Class Determinant Test 29 Dec
2 pages
Chapter 1 - Symmetry Elements and Operation
No ratings yet
Chapter 1 - Symmetry Elements and Operation
32 pages
Detailed Lesson Plan in Mathematics For Grade 3
No ratings yet
Detailed Lesson Plan in Mathematics For Grade 3
8 pages
Statistics and Probability Notes
No ratings yet
Statistics and Probability Notes
12 pages
SCVCST - Synchro-Check/ Voltage-Check Function Stage 1 (Scvcst1) Stage 2 (Scvcst2)
No ratings yet
SCVCST - Synchro-Check/ Voltage-Check Function Stage 1 (Scvcst1) Stage 2 (Scvcst2)
24 pages
Math/Stats 425 Homework Solutions
No ratings yet
Math/Stats 425 Homework Solutions
3 pages
Toe 1
No ratings yet
Toe 1
2 pages