Machine Learning
© 2021 The Knowledge Academy Ltd 1
About The Knowledge Academy
• World Class Training Solutions
• Subject Matter Experts
• Highest Quality Training Material
• Accelerated Learning Techniques
• Project, Programme, and Change
Management, ITIL® Consultancy
• Bespoke Tailor Made Training Solutions
• PRINCE2®, MSP®, ITIL®, Soft Skills, and More
© 2021 The Knowledge Academy Ltd 2
Administration
• Trainer
• Fire Procedures
• Facilities
• Days/Times
• Breaks
• Special Needs
• Delegate ID check
• Phones and Mobile devices
© 2021 The Knowledge Academy Ltd 3
Outline
• Module 1: Machine Learning -
Introduction
• Module 2: Importance of
Machine Learning and its
Techniques
• Module 3: Data Preprocessing
• Module 4: Machine Learning
Mathematics
© 2021 The Knowledge Academy Ltd 4
Outline
• Module 5: Supervised Learning • Module 10: Clustering
• Module 6: Classification • Module 11: Deep Learning
-Introduction
• Module 7: Regression
• Module 8: Neural Networks
• Module 9: Unsupervised Learning
© 2021 The Knowledge Academy Ltd 5
Machine Learning - Introduction
© 2021 The Knowledge Academy Ltd 6
Machine Learning - Introduction
• Machine Learning refers to the study of algorithms and statistical
models used by computer systems as a way of effectively performing
tasks without the need for specific instructions, but relying on patterns
and inference instead
• The following describes the two ways a system can improve:
1) By acquiring new knowledge, facts, and skills
2) By adapting its behaviour, solving problems more accurately, and
more efficiently
© 2021 The Knowledge Academy Ltd 7
Machine Learning - Introduction
• There are three main elements that comprise Machine Learning:
1) Base knowledge in which the system is aware of the answer thus
enabling the system to learn
2) The computational algorithm which is at the core of making
determinations
3) Variables and features used to make decisions
© 2021 The Knowledge Academy Ltd 8
Machine Learning - Introduction
• Machine Learning is the main subarea of artificial intelligence
• Machine Learning allows the computers or machines to routinely adjust
and customise themselves instead of being explicitly programmed to
carry out specific tasks
• These programs or algorithms are specifically designed to improve their
performance P at some task T with experience E:
T: Recognising hand-written words
P: Percentage of words correctly classified
E: Database of human-labelled images of handwritten words
© 2021 The Knowledge Academy Ltd 9
Machine Learning - Introduction
Difference Between Traditional Programming and Machine Learning
Traditional Programming
Data
Computer Output
Program
Machine Learning
Data
Computer Program
Output
© 2021 The Knowledge Academy Ltd 10
Machine Learning - Introduction
Real Life Examples of Machine Learning
• The following are real life examples of Machine Learning:
o While shopping on the internet, users are presented with
advertisements related to their purchases
o When shopping, a person checks a product on the internet then it
recommends similar products
o When using an app to book a cab ride, the app will provide an
estimation of the price of that ride. When using these services, how
do they minimise the detours? The answer is machine learning
© 2021 The Knowledge Academy Ltd 11
Machine Learning - Introduction
Some Other Real-Life Examples of Machine Learning
• Virtual Personal Assistants
o Siri, Alexa, are few of the popular examples of virtual personal assistants
o Virtual Assistants are integrated in a variety of platforms. For example:
• Smartphones: Samsung Bixby on Samsung S8
• Smart Speakers: Amazon Echo and Google Home
• Mobile Apps: Google Allo
© 2021 The Knowledge Academy Ltd 12
Machine Learning - Introduction
Social Media Services
o Social media platforms are utilising machine learning for their own
benefits as well as for the benefit of the user. Below are a few
examples:
• Face Recognition: Upload a picture of you with a friend and
Facebook instantly recognizes that friend
• Similar Pins: Computer Vision is used by Pinterest as a way of
recognises objects in images and recommends similar pins
accordingly
© 2021 The Knowledge Academy Ltd 13
Machine Learning - Introduction
Online Fraud Detection
o Machine learning is proving its potential to make cyberspace a secure
place and tracking monetary frauds online is one of its examples
o For example: Paypal is using ML for protection against money laundering
Online Customer Support
o Most websites will offer the option to chat to customer support. In most
cases, you talk to a chatbot rather than a live executive to answer your
queries
o These bots tend to extract information from the website and present it to
the customers
© 2021 The Knowledge Academy Ltd 14
Importance of Machine Learning and its
Techniques
© 2021 The Knowledge Academy Ltd 15
Importance of Machine Learning
• Machine Learning is used to complete complex tasks that are difficult
for humans to complete, such as complex coding
• We provide a machine learning algorithm with a massive amount of
data
o It explores and searches for a model that will work out what the
programmers have set out to achieve
© 2021 The Knowledge Academy Ltd 16
Importance of Machine Learning
• Machine learning has become a key technique for problem solving in a
variety of fields:
Image Automotive,
Natural
Computationa Computationa Processing Energy aerospace,
Language
l Biology l Finance and Computer Production and
Processing
Vision manufacturing
Drugs
Recovery Credit Motion
Price
Scoring Detection
Voice
Tumor Predictive
Recognition
Detection Maintenance
Applications
Algorithm Object Load
DNA Trading Detection Forecasting
Sequencing
© 2021 The Knowledge Academy Ltd 17
Types of Machine Learning
Types of Machine Learning
Machine Learning
Three Types
Supervised Learning Unsupervised Learning Reinforcement Learning
Task Driven (Predict next value) Data Driven (Predict next value) Learn from Mistakes
Classification Regression Clustering
K-Means, K-Medoids Fuzzy
Support Vector Machines Linear Regression, GLM
C-Means
Discriminant Analysis SVR, GPR Hierarchical
Naïve Bayes Ensemble Methods Gaussian Mixture
Nearest Neighbour Decision Trees Neural Networks
Continuos Neural Networks Hidden Markov Model
© 2021 The Knowledge Academy Ltd Categorical 18
How Machine Learning Works?
• Machine Learning uses both Supervised and Unsupervised Learning. Supervised
Learning trains a model on known input and output data so that it can predict
future outputs. Unsupervised learning identifies hidden patterns or intrinsic
structures in input data
Machine Learning
Unsupervised Learning Supervised Learning
Group and interpret data based Develop predictive model based
only on input data on both input and output data
Clustering Classification
Regression
© 2021 The Knowledge Academy Ltd 19
How Machine Learning Works?
Training the Machine
Learning algorithm
START If the accuracy is
not acceptable
Training Data Set
Model Input Data ML
algorithm is If the accuracy
trained is acceptable Machine Learning
again algorithm is
deployed
New Data Input
introduced to Prediction
make a prediction
Machine Learning Algorithm
© 2021 The Knowledge Academy Ltd 20
Machine Learning Mathematics
© 2021 The Knowledge Academy Ltd 21
Machine Learning Mathematics
• Machine Learning Theory is a field that uses probabilistic, computer science,
statistical, and algorithms feature as a result of learning iteratively from data
and identifying hidden patterns that can later be used to generate intelligent
applications
Why mathematics is significant for machine learning?
o Selecting the right algorithm
o Identifying underfitting and overfitting
o Choosing parameter settings and validation strategies
o Estimating the right confidence interval and uncertainty
© 2021 The Knowledge Academy Ltd 22
Machine Learning Mathematics
Importance of Maths Topics Required For Machine Learning
© 2021 The Knowledge Academy Ltd 23
Data Preprocessing
© 2021 The Knowledge Academy Ltd 24
Data Preprocessing
• Data Preprocsesing is a technique that is used to transform raw data
into an understandable format
• Whenever the real-world data is gathered from various sources it is
collected in raw format (likely to contain many errors) which is not
feasible for analysis
• Data Preprocessing includes the following:
• Removing outliers and noisy data, resolving
Data Cleaning any inconsistencies, filling in missing values
© 2021 The Knowledge Academy Ltd 25
Data Preprocessing
• Data Preprocessing helps to resolve such areas
• Using data cubes, multiple databases,
Data Integration or files
Data • Normalisation and aggregation
Transformation
• Diminishing the volume but producing
Data Reduction the same or similar analytical results
• Part of data reduction, replacing
Data Discretisation numerical attributes with nominal ones
© 2021 The Knowledge Academy Ltd 26
Data Preprocessing
To handle the missing values:
• Data Collection
Nationality Age Salary Gender
o Here we are using a dataset that Spain 28 40,000 Female
incorporates the information of Poland 38 50,000 Female
Germany 70000 Male
Sales professionals Poland 32 100000 Male
Spain 19 13000 Female
o This dataset is in .csv format and Germany 26 38000 Male
named as Employee_Record Germany 33 64000 Female
Spain 35 Male
Poland 24 46000 Female
o Make sure in datasets you leave Germany 20 60000 Male
empty cells like we have done in Spain 31 44000 Female
our example Poland 27 54000 Male
© 2021 The Knowledge Academy Ltd 27
Data Preprocessing
Importing the Libraries
• We are using three main libraries: numpy, and pandas, where:
o numpy includes the mathematical tools, so we can use any type of
mathematics
o pandas is used to import and manage datasets
o Use the following code to import libraries:
#importing the libraries
import numpy as np Alias
import pandas as pd
© 2021 The Knowledge Academy Ltd 28
Data Preprocessing
Importing the Dataset
o Now we are importing our dataset.
To import the dataset, perform the
following command:
emp = pd.read_csv(“Employee_Record.csv”)
o When the dataset has been
imported, the variable explorer
environment looks like this, as
shown in the figure above:
© 2021 The Knowledge Academy Ltd 29
Data Preprocessing
Setting the Datasets into Dependent and
Independent Variables
o The next step is to determine the
dependent (y) and independent (x)
variable
o According to the data, we can #setting the dependent and independent
conclude that age and salary variables variable
are our independent variable and
x = emp.iloc[: , :-1].values
gender variable is the dependent
variable y = emp.iloc[: , -1].values
o Now, we determine the gender of the
employees based on their salary, age
and nationality
© 2021 The Knowledge Academy Ltd 30
Data Preprocessing
Program 1: Importing the Dataset and displaying “True” in place of
empty record
© 2021 The Knowledge Academy Ltd 31
Data Preprocessing
Output:
© 2021 The Knowledge Academy Ltd 32
Data Preprocessing
Step 1: Import important packages and the data set
© 2021 The Knowledge Academy Ltd 33
Data Preprocessing
Step 2: Lets take a look at the imported data set
© 2021 The Knowledge Academy Ltd 34
Data Preprocessing
Step 3: Plot the distribution of all the continuous variables in our data set
© 2021 The Knowledge Academy Ltd 35
Data Preprocessing
(Continued)
Output:
© 2021 The Knowledge Academy Ltd 36
Supervised Learning
© 2021 The Knowledge Academy Ltd 37
Supervised Learning
• As the name indicates, supervised learning involves the presence of
a supervisor as a trainer
• In supervised learning, we can train the machines using labelled data
• Once the machine has understood the data, it is provided with a new
dataset. The Supervised learning algorithm analyses the training
data(examples) and produces a correct outcome from labelled data
• The algorithm will then continuously make predictions based on the
training data that has been corrected by the supervisor
© 2021 The Knowledge Academy Ltd 38
Supervised Learning
• For instance, let’s assume there is a
basket filled with different kinds of fruits.
The first step would be to train the
machine with all different fruits one by
one:
o If a shape of an object is rounded with
depression at the top having colour
Red, then it will be labelled as-Apple
o If an object is a bunch of round ovals
that are the colour black, then it will
be labelled as Grapes
© 2021 The Knowledge Academy Ltd 39
Supervised Learning
• Now assume after training the data, we have
given a new separate fruit to machine from the
basket and asked the machine to identify it. The
fruit that it must identify is an apple:
o Because the machine has previously learned
the physical characteristics of fruit from the
training data, it must now use that
knowledge to recognise the apple
o First, the machine will classify the fruit with
its colour and shape. Then, it will confirm the
name of the fruit (response variable) and put
the fruit in the Apple Category
Consequently, the machine learns the information from training data (basket SUPERVISED
containing fruits) and utilises this knowledge to test data (new fruit) LEARNING
© 2021 The Knowledge Academy Ltd 40
Supervised Learning
In Mathematical Terms:
• In supervised learning, you have an input variable (X) and output
variable (Y). An algorithm is used as a way of learning the mapping
function from the input and output variables
Y = f(X)
• The primary purpose of this is to precisely approximate the mapping
function so that when you have new input data the input variable (X)
can predict the output variables (Y) for that data
© 2021 The Knowledge Academy Ltd 41
Supervised Learning
• Supervised Learning is classified into two categories of algorithms:
o Classification: The primary goal of the classification algorithm is to
categorise data into the desired and distinct number of classes
which can help to assign labels to each class
o Regression: This algorithm is used as a way of making iterative
predictions of outputs
© 2021 The Knowledge Academy Ltd 42
Classification
© 2021 The Knowledge Academy Ltd 43
Classification
• In machine learning, classification is a crucial concept that provides the
machine with the knowledge needed to group data by specific criteria
• Classification is the process of predicting the class of data where classes
are known as targets/ labels or categories
• There is a supervised version of classification where machines group
data together according to predetermined characteristics
• The Unsupervised version of classification, also known as clustering, is
where computers identify shared characteristics which are then used to
group data into categories when categories have not been specified
© 2021 The Knowledge Academy Ltd 44
Classification
• Real life examples of the use of classification include when your inbox
filters the emails it has received as spam/junk email or important email
• Another example of classification is categorising transaction data as
fraudulent or authorised
• Classification predicts categorical class labels/classifies data based on
training sets and uses the knowledge it has required from the training
set to classify new data
• It includes a number of models such as logistic regression, decision
trees, random forest, gradient-boosted tree, multilayer perceptron,
one-vs-rest, and Naive Bayes
© 2021 The Knowledge Academy Ltd 45
Classification
For example:
• Choose the classification problem(s) from the following options:
a) Predicting apartment price based on area
b) Predicting the gender of a person by his/her handwriting style
c) Predict the number of copies of a book that will be sold next month
d) Predicting whether monsoon will be normal next year
• Solution: b) Predicting the gender of a person, d) Predicting whether
monsoon will be normal next year
• The other two a) and c) are examples of regressions
© 2021 The Knowledge Academy Ltd 46
Classification
• In classification, there are two types of learners : lazy learners and eager learners
1) Eager Learners
o These learners create a classification model according to the given training data
before receiving new data to classify
o Accuracy: It must commit to a solitary hypothesis that covers the entire instance
space
o Because of the model constructions, eager learners often take much longer to train
but less time in predicting
e.g. Naive Bayes, Decision Tree, Artificial Neural Networks
© 2021 The Knowledge Academy Ltd 47
Classification
2) Lazy Learners:
• Lazy Learners store training data and wait until it is given a test tuple
• Accuracy: This type of learner uses a more well-rounded hypothesis
space that draws from various local linear functions in order to form its
implicit global estimation to the target function
• Unlike the Eager Learner, the lazy learners takes less time to train but
more time to predict
e.g. Case-based reasoning, k-nearest neighbor
© 2021 The Knowledge Academy Ltd 48
Support Vector Machines
• “Support Vector Machine” (SVM) is a
supervised machine learning algorithm
which can be used for both regression or
y
classification challenges
• However, it is most commonly used to solve
classification problems. In this algorithm,
we plot every data item as a point in
n-dimensional space with the value of each
feature being the value of a specific
coordinate
• By finding the hyperplane, we perform x
classification that differentiates the two
classes very well
© 2021 The Knowledge Academy Ltd 49
How does SVM work?
• In the next couple of slides we will
discuss different scenarios, both of
y S
which involve the process of T
segregating the two classes with a
hyper-plane V
Scenario 1: Identify the right
hyperplane
• In this scenario, there are three x
hyperplanes, S, T, and V. Now, In given scenario, hyperplane “T” has
identify the right hyperplane excellently performed this job
© 2021 The Knowledge Academy Ltd 50
How does SVM work?
Scenario 2: Identify the right hyperplane:
• In this scenario, we have three hyperplanes (S, T and V) which are
segregating the classes well. Now, how can we identify the right
hyperplane?
y
T
V
S
© 2021 The Knowledge Academy Ltd 51
How does SVM work?
(Continued)
• To identify the right hyperplane, maximise the distances between the nearest data
point (either class). It will help to determine the right hyperplane
• This distance is known as Margin
y
T
V
S
x
© 2021 The Knowledge Academy Ltd 52
How does SVM work?
Scenario 3: Identify the right hyperplane
• In this scenario, use the same rules that have
been used in the previous scenario to
identify the right hyper-plane y
T
• According to those rules, the hyper-plane T is S
considered as the right hyperplane as it has
higher margin compared to S
• But, SVM selects the hyperplane that
classifies the classes accurately before
maximising margin
x
• Here, hyperplane A has classified all correctly
and T has a classification error. So, the right
hyper-plane is A
© 2021 The Knowledge Academy Ltd 53
How does SVM work?
Scenario 4: Can we classify two classes
y
• In this scenario, we are unable to
segregate the two classes using a straight
line. This is because one of the stars lies in
the territory of other class as an outlier
• As we know, one star at other end is like
x
an outlier for star class y
• SVM ignores the outliers and recognise
the hyper-plane that has a maximum
margin
• Hence, SVM is robust to outliers x
© 2021 The Knowledge Academy Ltd 54
How does SVM work?
Scenario 5: Find the hyperplane to segregate to classes
• In this scenario, we are not able to have linear hyper-plane between the two classes
• SVM resolves this issue by introducing an additional feature
© 2021 The Knowledge Academy Ltd 55
How does SVM work?
(Continued)
• Here, we are adding a new feature z=x^2+y^2. z
Let’s plot the values on x and z axis
o When plotting values, the following points
need to be considered:
x
o Each value for z would be positive as z is the
squared sum of both x and y
o In the original plot, the red circles appear
close to the origin of the x and y axis which
leads to the lower value of z and the star
being relatively away from the origin result to
the higher value of z
© 2021 The Knowledge Academy Ltd 56
How does SVM work?
(Continued)
• The hyperplane in the original input space looks like a circle:
© 2021 The Knowledge Academy Ltd 57
Discriminant Analysis
• Linear Discriminant Analysis is a technique
commonly used for dimensionality reduction. In
machine learning, it is used prior to
PreProcessing as a preparation step. It is also Dimensionality Reduction
used in pattern classification applications
• The primary purpose of this technique is to
reduce dimensions by eliminating any features
that are redundant and dependent. This is done Supervised Unsupervised
by transforming the features from higher Learning Learning
dimensional space to a space with lower
dimensions
LDA PCA
• This category of dimensionality reduction is used
in bioinformatics, chemistry, and biometrics
© 2021 The Knowledge Academy Ltd 58
Discriminant Analysis
How does it work?
• The Linear discriminant’s analysis main goal is to project the features in
higher dimension space onto a lower dimension space
• The working of the discriminant analysis includes the following steps:
o Step 1: Calculate the distance between the mean of different classes
that is also known as between-class variance
Sb = Ni ( xi - x )( xi - x )T
© 2021 The Knowledge Academy Ltd 59
Discriminant Analysis
Step 2: Calculate the distance between the mean and sample of every class,
which is known as within class variance
Sw = (Ni - 1)Si = ( xi.j - x ) ( xi.j - x )T
Step 3: Construct the lower dimensional space which minimises the within class
variance and maximises the between-class variance
o Let P be the lower dimensional space projection which is known as Fisher’s
criterion
|PT Sb P|
Plda = arg max T
P |P Sw P|
© 2021 The Knowledge Academy Ltd 60
Discriminant Analysis
Poor projection axis for
Best (LDA) projection axis separating the classes
for separating the classes
© 2021 The Knowledge Academy Ltd 61
Discriminant Analysis
Extension to Linear Discriminant Analysis (LDA)
• LDA is a simple and effective method for classification. It includes various
extensions and variations. Some of them are as follows:
o Flexible Discriminant Analysis (FDA): Where non-linear combinations of
inputs is used such as splines
o Quadratic Discriminant Analysis (QDA): Each class uses its own estimate
of variance
o Regularised Discriminant Analysis (RDA): Introduces regularisation into
the estimate of the variance, moderating the influence of different
variables on LDA\
© 2021 The Knowledge Academy Ltd 62
Naive Bayes
• Naive Bayes classifier is a group of classification algorithms based on
Bayes’ Theorem
• It is a group of algorithms that all share a common principle which is
that each pair of features is classified separately from one another
• Here we are considering a fictional dataset that describes the weather
conditions for playing a game of football
o Each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for
playing football
© 2021 The Knowledge Academy Ltd 63
Naive Bayes
Tabular Representation of our dataset:
© 2021 The Knowledge Academy Ltd 64
Naive Bayes
• The dataset is divided into two sections: feature matrix and response vector
o Feature matrix includes all the rows (vector) of a dataset in which every
vector consists of the value of dependent features, including‘Outlook’,
‘Temperature’, ‘Humidity’ and ‘Windy’ are features
o Response vector includes the value of a class variable(prediction or output)
for each row of the feature matrix. The class variable name is ‘Play
football’
Assumption:
• The fundamental Naive Bayes assumption is that each feature makes an
independent and equal contribution to the outcome
© 2021 The Knowledge Academy Ltd 65
Naive Bayes
• According to our dataset, this concept of Naive Bayes can be
understood as:
• There is an assumption that no pair of features are dependent
o For instance, the temperature being ‘Hot’ has nothing to do with
the humidity or the outlook being ‘Rainy’ has no effect on the
winds. Hence, the features are assumed to be independent
• Secondly, each feature is given the same weight
o For instance, knowing humidity and temperature alone cannot
predict the outcome correctly. All attributes are assumed to be
contributing equally to the outcome
© 2021 The Knowledge Academy Ltd 66
Naive Bayes
Bayes’ Theorem
• It finds the probability of an event occurring given the probability of another event
that has already occurred
• Bayes’ theorem is represented by the following equation:
© 2021 The Knowledge Academy Ltd 67
Naive Bayes
• We can apply Bayes’ theorem in following way (with regards to our dataset):
P (X|y) P(y)
P (y|X) =
P(X)
• Where, y is class variable and X is a dependent feature vector:
X = (x1 , x2, x3, ……xn)
• An instance of a feature vector and corresponding class variable can be:
X = (Rainy, Hot, High, False)
y = No
o P(X|y) represents the probability of “Not playing football” given that the
weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity”
and “no wind”
© 2021 The Knowledge Academy Ltd 68
Naive Bayes
Naive Assumption
• Now, we are placing a naive assumption to the Bayes’ theorem, which is,
independence among the features
• First, split the evidence into the independent parts
• If any two events A and B are independent, then
© 2021 The Knowledge Academy Ltd 69
Naive Bayes
• Hence, we reach the result of:
P(x1|y) P(x2|y)…P(xn|y) P(y)
P(y|x1,…,xn) =
P(x1) P(x2)…P(xn)
• Which can be expressed as:
P(y) ∏n i=1 P(xi|y)
P(y|x1,…,xn) =
P(x1) P(x2)…P(xn)
• Removing the denominator, as it remains constant for a given input:
P(y) ∏n i=1 P(xi|y)
© 2021 The Knowledge Academy Ltd 70
Naive Bayes
• Now, we need to create a classifier model. Firstly, find the probability of
a given set of inputs for all possible values of the class variable y and
select the output with maximum probability. It can be expressed as:
y = argmaxy P(y) ∏ni=1 P(xi|y)
• Finally, we are left with the task of calculating P(y) and P(xi|y)
• P(y) is called class probability, and P(xi| y) is called conditional
probability
© 2021 The Knowledge Academy Ltd 71
Naive Bayes
• To apply the formula (given on the previous) slides manually on our
weather dataset, find P(xi|yj ) for each xi in X and yj in y
• The calculations are represented in the below given tables:
Outlook Temperatur
e
Yes No P(yes) P(no) Yes No P(yes) P(no)
Sunny 2 1 2/6 1/4 Hot 2 2 2/7 2/4
Overcast 3 0 3/6 0/4 Mild 2 1 2/7 1/4
Rainy 1 3 1/6 3/4 Cool 3 1 3/7 1/4
Total 6 4 100% 100% Total 7 4 100% 100%
Table 1 Table 2
© 2021 The Knowledge Academy Ltd 72
Naive Bayes
Humidity Wind
Yes No P(yes) P(no) Yes No P(yes) P(no)
High 3 3 3/7 3/4 False 5 2 5/6 2/4
Normal 4 1 4/7 1/4 True 1 2 1/6 2/4
Total 7 4 100% 100% Total 6 4 100% 100%
Table 3 Table 4
• We have calculated P(xi|yj) for each xi in X and yj in y manually in the tables 1 to 4
Play P(yes)/P(no)
Yes 7 7/13
No 4 4/13
Total 13 100%
Table 5
© 2021 The Knowledge Academy Ltd 73
Naive Bayes
• For instance, probability of playing football given that the temperature is cool, i.e.
P(temp. = cool | play football = Yes) = 3/7
• Also, find class probabilities (P(y)) which has been calculated in table 5. For instance,
P(play football = Yes) = 7/13
• Let’s test it on a new set of features: today = (Sunny, Hot, Normal, False)
• So, the probability of playing football is given by:
P(SunnyOutlook|Yes)P(HotTemperature|Yes)P(NormalHumidity|Yes)P(FalseWind|Yes)P(Yes
P(Yes|today) = )
P(today)
© 2021 The Knowledge Academy Ltd 74
Naive Bayes
• The probability to not play football is given by:
P(SunnyOutlook|No)P(HotTemperature|No)P(NormalHumidity|No)P(FalseWind|No)P(No)
P(No|today) =
P(today)
• Since P(today) is common in both probabilities, we can ignore P(today) and find
proportional probabilities such as:
P(Yes|today) 2. 2 . 4 . 5 . 7 0.0244 P(No|today) 1. 2 . 1 . 2 . 4 0.0048
6 7 7 6 13 4 4 4 4 13
• Since
P(Yes|today) + P(No|today) = 1
• These numbers can be converted into a probability by making the sum equal to 1:
0.0244
P(Yes|today) = = 0.84
0.0244 + 0.0048
© 2021 The Knowledge Academy Ltd 75
Naive Bayes
0.0048
P(No|today) = = 0.16
0.0244 + 0.0048
• Since,
P(Yes|today) > P(No|today)
• So, the prediction that football would be played is ‘Yes’
© 2021 The Knowledge Academy Ltd 76
Naive Bayes
Gaussian Naive Bayes Classifier
• In Gaussian Naïve Bayes, the continuous
values that are associated with all features f(x)
The Normal Distribution
are generally thought to be distributed
according to a Guassian distribution
• A Gaussian distribution is also known as
Normal distribution
• When plotted, the Gaussian distribution
gives a bell-shaped curve which is
symmetric about the mean of the feature µ
values as shown below:
© 2021 The Knowledge Academy Ltd 77
Naive Bayes
• The likelihood of the features is assumed to be Gaussian. Hence,
conditional probability is given by:
1 (xi - µy )2
P(xi|y) = exp -
√2∏σ2y 2σ2y
© 2021 The Knowledge Academy Ltd 78
Naive Bayes
Example
© 2021 The Knowledge Academy Ltd 79
Naive Bayes
Example
Output
© 2021 The Knowledge Academy Ltd 80
Nearest Neighbour
• K-Nearest Neighbors is the simplest primary machine learning
algorithm that is used to solve classification and regression problems
• The algorithm is generally disposable in real-life scenarios. This is due
to the fact that it is non-parametric, thus, it doesn’t make any
underlying assumptions regarding the distribution of data, unlike
other algorithms that assume a Gaussian distribution of the given data
• We are given some prior data, also known as training data, that
classifies coordinates into groups identified by an attribute
© 2021 The Knowledge Academy Ltd 81
Nearest Neighbour
• Consider the following data points given in the figure:
© 2021 The Knowledge Academy Ltd 82
Nearest Neighbour
• The following figure is another set of data points, also known as
testing data. Allocate these points a group by analysing the training
• The unclassified points are marked as ‘White’
© 2021 The Knowledge Academy Ltd 83
Nearest Neighbour
Algorithm
• Let p be an unknown point and m be the number of training data samples
1) Store the training samples in an array of data points [], each element of this
array represents a tuple (x, y)
2) for i=0 to m:
o Calculate Euclidean distance d(arr[i], p)
3) Make set S of K smallest distances obtained
4) Return the majority label among S
© 2021 The Knowledge Academy Ltd 84
Nearest Neighbour
Example:
Output
© 2021 The Knowledge Academy Ltd 85
Nearest Neighbour
(Continued)
• To measure the accuracy of the model
© 2021 The Knowledge Academy Ltd 86
Nearest Neighbour
(Continued)
• To test the model for each and every expected k-value
© 2021 The Knowledge Academy Ltd 87
Regression
© 2021 The Knowledge Academy Ltd 88
Regression
• Regression problems are defined as scenarios when the output variable is a
real or continuous value, such as “salary” or “weight”
• You can use a number of models, however, the simplest is Linear regression
• Linear Regression attempts to fit data with the best hyper-plane which goes
through the points
y- dependent variable
(output)
x - dependent variable (input)
© 2021 The Knowledge Academy Ltd 89
Regression
Types of Regression Models:
Regression
Models
Simple Multiple
Linear Non-Linear Linear Non-Linear
© 2021 The Knowledge Academy Ltd 90
Regression
Example
• Choose a regression task from the following options?
o Predicting nationality of a person
o Predicting whether a document is related to sighting of UFOs?
o Predicting age of a person
o Predicting whether stock price of a company will increase tomorrow
• Solution: Predicting age of a person (because it is a real value,
predicting nationality is categorical, whether stock price will increase is
discreet-yes/no answer, predicting whether a document is related to
UFO is again discreet- a yes/no answer)
© 2021 The Knowledge Academy Ltd 91
Linear Regression and GLM
GLM (Generalised Linear Model)
• GLM is used to represent the dependent variable as a linear combination
of independent variables
• Simple linear regression is the traditional form of GLM. It works
adequately when the dependent variable is normally distributed
• In real circumstances, the assumption of normally distributed dependent
variable is usually violated
© 2021 The Knowledge Academy Ltd 92
Linear Regression and GLM
Linear Regression
• Linear Regression is a machine learning algorithm where the predicted
output is continuous
• Regression models a target prediction value according to independent
variables
• It is often used to find the relationship between variables and
forecasting
• Regression models vary based on the type of relationship between
independent and dependent variables, they are considering, and the
number of independent variables is used
© 2021 The Knowledge Academy Ltd 93
Linear Regression and GLM
Linear Regression
• It performs the task to predict a
dependent variable value (y) based
on a given independent variable (x)
• This regression technique identifies a
linear relationship between x (input) Y
and y(output). That is why, it is
named as Linear Regression
• In the given figure, X (input) is the X
work experience and Y (output) is the
salary of an employee
© 2021 The Knowledge Academy Ltd 94
Linear Regression and GLM
• Hypothesis function for Linear Regression given below in the mathematical form:
y = Ɵ1 + Ɵ2.x
• During training the model we are given:
o x: input training data (univariate – one input variable(parameter))
o y: labels to data (supervised learning)
• While training the model – it fits the best line to predict the value of y for a given
value of x. By finding the best θ1 and θ2 values, the model gets the best
regression fit line
o θ1: intercept
o θ2: coefficient of x
© 2021 The Knowledge Academy Ltd 95
Linear Regression and GLM
Cost Function (J):
• By accomplishing the best-fit regression line, the model intends to predict
the y value so that the error difference between the predicted value and
the actual value is minimum
• It’s essential to update the θ1 and θ2 values to reach the best value that
minimises the error between predicted y value (pred) and actual y value
(y)
1 (predi - yi)2
minimise
n
1 (predi - yi)2
J=
n
© 2021 The Knowledge Academy Ltd 96
Linear Regression and GLM
• Cost function, denoted as J, of Linear Regression is the RMSE (Root
Mean Squared Error) between predicted y value (pred) and true y
value (y)
Gradient Descent:
• Gradient model is used by the model to update θ1 and θ2 values to
reduce Cost function (minimising RMSE value) and achieving the best
fit line
• The purpose of this is to start with random θ1 and θ2 values and then
iteratively to update the values, reaching minimum cost
© 2021 The Knowledge Academy Ltd 97
SVR
• SVR refers to Support Vector Machine – Regression
• The SVR utilises the identical principles that are used by the support
vector machine for classification. There are only a few minor
differences
• Because the output is a real number, it becomes difficult to predict the
available information as there is an infinite possibilities
• In the case of regression, a margin of tolerance is set in approximation
to the SVM. The SVM would have already requested the problem
© 2021 The Knowledge Academy Ltd 98
SVR
• However, the central idea is always the same: to minimise error, to
individualise the hyperplane (maximises the margin), keeping in mind
that part of the error is tolerated
y
1
+ɛ • Solution: min ||w||2
y = wx+b 0 2
-ɛ
• Constraints: yi – wxi – b ≤ ɛ
wxi + b - yi ≤ ɛ
© 2021 The Knowledge Academy Ltd 99
SVR
y
1
+ɛ • Minimise: ||w||2 + C
y = wx+b 0 2
-ɛ
• Constraints:
Linear SVR
y= (ai – ai*). ‹xi, x› + b
© 2021 The Knowledge Academy Ltd 10
SVR
Non-linear SVR
• The Kernal Function is a technique that is used to transform data into a
higher dimensional feature space to incorporate the linear separation
y y
*
y= (ai – ai ). ‹ϕ(xi), ϕ(x)› + b +ɛ
𝜉 +ɛ 0
0
-ɛ
-ɛ
𝜉
y= (ai – ai*).K (xi, x) + b
x ϕx
© 2021 The Knowledge Academy Ltd 10
SVR
Kernel Functions
Polynomial
K(xi, xj) = (xi.xj)d
Gaussian Radial Basics Function
||xi – xj||2
K(xi, xj) = exp -
2σ
© 2021 The Knowledge Academy Ltd 10
Decision Tree
• Decision trees create classification or regression models in the form of a tree
structure
• They break down a dataset into smaller subsets. Also, decision trees
incrementally developed the associated decision tree
• The final result appears in the form of a tree with leaf nodes and decision
nodes, where:
o A decision node has two or more branches, each representing values for
the attribute tested
o A leaf node depicts a decision on the numerical target — the topmost
decision node in a tree which corresponds to the best predictor is known
as a root node
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Predictors Targe
t
Outlook Temp Humidity Windy Hours Played
Rainy Hot High False 26
Rainy Hot High True 30 Outlook
Overcast Hot High False 46
Sunny Mild High False 45
Sunny Overcast Rainy
Sunny Cool Normal False 52
Sunny Cool Normal True 23
Overcast Cool Normal True 43 Windy 46.3 Temp.
Rainy Mild High False 35
False True Cool Hot Mild
Rainy Cool Normal False 38
Sunny Mild Normal False 46
47.7 26.5 38 27.5 41.5
Rainy Mild Normal True 48
Overcast Mild High True 52
Overcast Hot Normal False 44
Sunny Mild High True 30
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Decision Tree Algorithm
• Decision trees can handle both categorical and statistical data
• ID3 is the primary algorithm used to build decision trees. It operates at
a top-down greedy search through the space of possible branches with
no backtracking
• Decision Trees are able to manage both categorical and numerical
variables simultaneously to features
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Standard Deviation
• A decision tree is developed top-down from a root node and includes
partitioning the data into subsets that comprise instances with similar
values (homogenous)
• Standard deviation is used for calculating the homogeneity of a
numerical sample
• If the numerical sample is entirely homogeneous then its standard
deviation will be zero
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Standard Deviation
a) Standard deviation for one attribute:
Hours Played
26
30
Count = n = 14
46 Σx
Average = x = = 39.8
45 n
52 Σ(x – x )2
Standard Deviation = S = = 9.32
23 n
S
43 Coefficient of Variation = CV = * 100% = 23%
35
x
38 • Standard Deviation (S) is for branching
46
48 • Coefficient of Deviation (CV) helps to decide when to stop
52
branching
44 • Average (Avg) is the value in the leaf nodes
30
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Standard Deviation
b) Standard deviation for two attributes (target and predictor):
S(T, X) = P(c)S(c)
Hours Played Count
(StDev)
Overcast 3.49 4 S(Hours, Outlook) =
Outlook
Rainy 7.78 5
P(Sunny)*S(Sunny)+P(Overcast)*S(Overcast)+P(Rainy)*S(Rainy)
= (4/14)*3.49 + (5/14)*7.78 + (5/14)*10.87 = 7.66
Sunny 10.87 5
14
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Standard Deviation Reduction
• The SDR (Standard Deviation Reduction) depends on the decrease in
standard deviation after a dataset is split on an attribute
• Building a decision tree is all about finding an attribute that returns the
highest standard deviation reduction
Step 1: The standard deviation of the target is calculated
Standard deviation (Hours Played) = 9.32
© 2021 The Knowledge Academy Ltd 10
Decision Tree
Step 2: Calculate the standard deviation for each branch
• The resulting standard deviation is subtracted from the standard
deviation before the split. The result is the standard deviation reduction:
SDR(T, X) = S(T) – S(T, X)
SDR(Hours, Outlook) = S(Hours) – S(Hours, Outlook) = 9.32 – 7.66 = 1.66
Hours
Played
(StDev)
S(Hours, Outlook) =
Overcast 3.49 P(Sunny)*S(Sunny)+P(Overview)*S(Overcast)+P(Rainy)*S(Rainy)
Outlook = (4/14)*3.49 + (5/14)*7.78 + (5/14)*10.87 = 7.66
Rainy 7.78
Sunny 10.87 SDR = 9.32 – 7.66 = 1.66
SDR = 1.66
© 2021 The Knowledge Academy Ltd 11
Decision Tree
• In the same way, calculate the SDR for the below given tables:
Hours Played
(StDev) Hours Played
(StDev)
Cool 10.51
Temp. High 9.36
Hot 8.95 Humidity
Normal 8.37
Mild 7.65
SDR = 0.28
SDR = 0.17
Hours Played
(StDev)
False 7.87
Windy
True 10.59
SDR = 0.26
© 2021 The Knowledge Academy Ltd 11
Decision Tree
Step 3: The attribute that has the largest standard deviation reduction is
chosen for the decision node
Hours Played (StDev)
Overcast 3.49
Outlook
Rainy 7.78
Sunny 10.87
SDR = 1.66
Step 4 (a): The dataset is divided based on the values of the selected
attribute
o This process is run recursively on the non-leaf branches. The
process continues until all the data has been processed
© 2021 The Knowledge Academy Ltd 11
Decision Tree
© 2021 The Knowledge Academy Ltd 11
Decision Tree
Step 4 (b): The “Overset” subset does not require any more splitting due
to the fact that its CV (8%) is less than the threshold (10%). The
associated leaf node gets the average of the “overcast” subset
Hours Played Hours Played Hours Played Count
(StDev) (Avg) (CV)
Overcast 3.49 46.3 8 4
Outlook Rainy 7.78 35.2 22 5
Sunny 10.87 39.2 28 5
Outlook
Sunny Overcast Rainy
46.3
© 2021 The Knowledge Academy Ltd 11
Decision Tree
Step 4 (c): However, the “Sunny" branch has an CV (28%) more than the threshold (10%)
which needs further splitting. We select “Windy" as the best node after “Outlook"
because of the fact that it has the largest SDR
Hours Played (StDev) Count
Temp Humidity Windy Hours Played
Temp Cool 14.50 2
Mild High False 45 Mild 7.32 3
Cool Normal False 52 SDR = 10.87 – ((2/5)*14.5 + (3/5)*7.32 ) = 0.678
Hours Played (StDev) Count
Cool Normal True 23
Humidity High 7.50 2
Mild Normal False 46
Normal 12.50 3
Mild High True 30 SDR = 10.87 – ((2/5)*7.5 + (3/5)*12.5) = .370
S = 10.87 Hours Played (StDev) Count
Windy False 3.09 3
Avg = 39.2
True 3.50 2
CV = 28
SDR = 10.87 – ((3/5)*3.09 + (2/5)*3.50)= 7.62
© 2021 The Knowledge Academy Ltd 11
Decision Tree
• Due to the fact that the number of data points for both branches
(FALSE and TRUE) is equal or less than 3, we stop further branching
and assign the average of each branch to the related leaf node
© 2021 The Knowledge Academy Ltd 11
Decision Tree
Step 4 (d): The "rainy" branch has a CV (22%), which is more than the threshold (10%).
This branch needs additional splitting. Here we are selecting "Windy" as the best node
because it has the largest SDR
Hours Played (StDev) Count
Cool 0 1
Temp Humidity Windy Hours Played Temp
Hot 2.5 2
Hot High False 25
Mild 6.5 2
Hot High True 30 SDR = 7.87 – ((1/5)*0 + (2/5)*2.5 + (2/5)*6.5) = 4.18
Mild High False 35
Hours Played (StDev) Count
Cool Normal False 38
High 4.1 3
Humidity
Mild Normal True 48
Normal 5.0 2
S = 7.78 SDR = 7.87 – ((3/5)*4.3 + (2/5)*5.0) = 3.32
Avg = 35.2 Hours Played (StDev) Count
CV = 22% False 5.6 3
Windy
True 9.0 2
SDR = 7.87 – ((3/5)*5.6 + (2/5)*9.0) = 0.8 2
© 2021 The Knowledge Academy Ltd 11
Decision Tree
• Now, we stop further branching as the number of data points for all three branches
(Cool, Hot and Mild) is equal or less than 3. Assign the average of every branch to the
related leaf node
Outlook
Temp Hours Played
Mild 38
Sunny Overcast Rainy
Cool 25
Windy Windy Cool 30
46.3
Mild 35
False True False True True Mild 48
47.5 26.5 38 27.5 41.5
© 2021 The Knowledge Academy Ltd 11
Neural Networks
• Neural Networks are a class of models in the overall machine learning
literature
• Neural Networks are a group of algorithms that have had a massive
impact on Machine Learning
• The current deep neural networks are inspired by biological neural
networks and have proven to work quite well
• They are general function approximations, meaning that they can be
applied to almost any machine learning problem about learning a
complex mapping from the input to the output space
© 2021 The Knowledge Academy Ltd 11
Neural Networks
• The following are some reasons why we should study neural
computation:
To understand how the brain actually works
To understand a style of parallel computation inspired by
neurons and their adaptive connections
To solve practical problems by using novel learning algorithms
inspired by the brain
© 2021 The Knowledge Academy Ltd 12
Neural Networks
Building Blocks of Neurons
• The basic unit of a neural network is a neuron, whichtakes inputs and
produces an output
• The below-given figure represents neurons:
x1
Inputs + y Output
x2
© 2021 The Knowledge Academy Ltd 12
Neural Networks
• The mathematical formation of this included the following steps:
o First, each input is multiplied by a weight:
x1 x1 * w1
x2 x2 * w2
o Next, all the weighted inputs are added together with a bias b:
(x1 * w2) + (x2 * w2) + b
o Finally, the sum is passed through an activation function:
y = f (x1 * w2 + x2 * w2 + b)
© 2021 The Knowledge Academy Ltd 12
Neural Networks
• The activation function is used to set an unbounded input into an
output that consists of a predictable form. A commonly used activation
function is the sigmoid function:
• The sigmoid function only outputs numbers in the range (0,1)
© 2021 The Knowledge Academy Ltd 12
Neural Networks
• Following are some of the different neural network architectures:
Convolutional Neural Recurrent Neural
Perceptrons
Networks Networks
Long / Short Term
Gated Recurrent Unit Hopfield Network
Memory
Boltzmann Machine Deep Belief Networks Autoencoders
Generative Adversarial
Network
© 2021 The Knowledge Academy Ltd 12
Unsupervised Learning
© 2021 The Knowledge Academy Ltd 12
Unsupervised Learning
• In unsupervised learning, the machine is trained by using the
information that is neither labelled nor classified and the algorithm is
allowed to act on that information without guidance
• The machine’s main task is to group unsorted information based on
patterns, similarities, and differences without the need to have former
training of data
• Because the Machine is not provided with a teacher, it’s restricted to
find the hidden structure in unlabelled data by themselves
© 2021 The Knowledge Academy Ltd 12
Unsupervised Learning
Difference Between Supervised and Unsupervised Learning
Unsupervised Learning Supervised Learning
© 2021 The Knowledge Academy Ltd 12
Unsupervised Learning
• For example, let’s assume there is an image having both dogs and cats
which have not seen ever
• Consequently, the machine has is not aware of the features of cats
and dogs. This means that we cannot categorise this data
• But the machine can categorise them according to their patterns,
similarities, and differences, i.e., we can easily categorise the given
picture into two parts
© 2021 The Knowledge Academy Ltd 12
Unsupervised Learning
• An unsupervised learning can be divided into two categories of algorithms:
Clustering Association
A clustering problem is
An association rule
where you want to find
learning problem is
the inherent groupings in
where you want to find
the data, such as
rules that describe large
grouping customers by
portions of the data
purchasing behaviour
© 2021 The Knowledge Academy Ltd 12
Clustering
© 2021 The Knowledge Academy Ltd 13
Clustering
• Clustering is the task of distributing data
points into multiple groups so that data
points in the same groups are more similar
to other data points in the same group and
dissimilar to the data points in different
groups
• Essentially, clustering is a collection of
objects based on similarity and dissimilarity
between them
• For instance, the data points given in the
graph clustered together can be
incorporated into one single group We can identify there are
three clusters in the graph
© 2021 The Knowledge Academy Ltd 13
Clustering
• It is not essential for clusters to be a spherical, as shown in the below figure:
DBSCAN Density data
• These data points are clustered by using the fundamental notion that the data point
lies within the given constraint from the cluster centre
© 2021 The Knowledge Academy Ltd 13
Clustering
Types of Clustering
• Broadly speaking, clustering can be divided into two subgroups:
o Hard Clustering: In this type of clustering, each data point either
belongs to a cluster completely or not
o Soft Clustering: In this type of clustering, instead of putting each
data point into a separate cluster, a probability or likelihood of that
data point is assigned to those clusters
© 2021 The Knowledge Academy Ltd 13
Clustering
The following are some methods of clustering:
Density-Based Methods
Partitioning Methods
Hierarchical Based
Methods
Grid-based Methods
© 2021 The Knowledge Academy Ltd 13
K-Means
• Suppose we are given a data set of items, including specific features
and their values
• The task is to categorise those items into groups
• K-means algorithm (unsupervised learning algorithm) helps to achieve
this task
• This algorithm categorises the items into k groups of similarity
• To calculate this similarity, use the Euclidean distance as
measurement
© 2021 The Knowledge Academy Ltd 13
K-Means
The algorithm works as follows:
1) Firstly, initialise k points, known as Algorithm in pseudocode:
means, randomly
2) Secondly, categorise every item to Initialise k means with random values
its closest mean and update the For a given number of iterations:
mean’s coordinates, which are the Iterate through items:
averages of the items categorised Find the mean closest to the item
Assign item to mean
in that mean so far Update mean
3) Repeat the steps for a given
number of iterations. At the end,
we have our clusters
© 2021 The Knowledge Academy Ltd 13
K-Medoids
• This algorithm is a clustering algorithm associated with the k-means
and the medoidshift algorithm
• It can be defined as the point in the cluster, whose dissimilarities with
all the other points in the cluster is minimum
• The dissimilarity of the object(Pi) and medoid(Ci) is calculated by using
E = |Pi - Ci|
• The cost in K-Medoids algorithm is given below:
c= ΣCi ΣPi∈ |Pi – Ci|
© 2021 The Knowledge Academy Ltd 13
K-Medoids
Algorithm:
1) Initialise: Select K random points out of the n data points as the medoids
2) Associate each data point to the closest mediod by using any common
distance metric methods
3) While the cost decreases:
4) For each medoid, m, for each data o point which is not a medoid
a) Swap m and o, associate each data point to the closest medoid, and
recompute the cost
b) If the total cost is more than that in the previous step, undo the swap
© 2021 The Knowledge Academy Ltd 13
K-Medoids
• A medoid of a finite dataset is a data point from a set, whose average
dissimilarity to each data point is minimal (most centrally located point in the
set)
• The Partitioning Around Medoids (PAM) algorithm is the most common
realisation of k-medoid clustering. The way the algorithm works is outlined
below:
1) Initialise: Randomly select k of the n data points as the medoids
2) Assignment step: Associate each data point to the closest medoid
3) Update step: For every medoid m and each data point o related to m swap m
and o and calculate the total cost of the configuration. Select the medoid o
with the lowest cost of the configuration
© 2021 The Knowledge Academy Ltd 13
Fuzzy
• The fuzzy term refers to things which are not very clear or vague
• Sometimes we may come across a situation where we cannot decide
whether the statement is true or false. At that point, fuzzy logic
provides flexibility for reasoning
• The fuzzy logic algorithm is used to solve a problem after analysing all
available data. Then it takes the best possible decision for the given
input
• The Fuzzy Logic method imitates a human's decision-making ability
which consider all the possibilities between digital values T and F
© 2021 The Knowledge Academy Ltd 14
Fuzzy
Fuzzy Logic Architecture
• It has four main parts as shown below in the figure:
Rules
Crisp Input Crisp Output
Fuzzy Input Set Fuzzy Output Set
Fuzzifier Intelligence Defuzzifier
© 2021 The Knowledge Academy Ltd 14
Hierarichal
• The hierarchical clustering technique is Original Unclustered Data
one of the most popular Clustering
techniques in Machine Learning
• It groups similar data points, and the
group of those related data points is
known as a Cluster
Clustered Data
• This clustering technique is divided into
two types:
o Agglomerative
o Divisive
© 2021 The Knowledge Academy Ltd 14
Hierarichal
1) Agglomerative
• In the agglomerative technique, every data point is initially
considered as an individual cluster. At each iteration, similar
clusters combine with other clusters until K clusters are formed
• The steps included in the basic algorithm of Agglomerative are as
follows:
o Compute the proximity matrix
o Let each data point be a cluster
o Repeat: Merge the two closest clusters and update the proximity
matrix
o Until only a single cluster remains
© 2021 The Knowledge Academy Ltd 14
Hierarichal
2) Divisive Hierarchical clustering Technique
• This clustering technique is opposite to the Agglomerative
Hierarchical clustering technique
• In divisive hierarchical clustering, we consider all the data points as
a single cluster and separate the data points from the cluster which
are not similar in each iteration
• Every data point which is separated is considered as an individual
cluster. In the end we will be left with n clusters
• As a single cluster is divided into n clusters, it is named as Divisive
Hierarchical clustering
© 2021 The Knowledge Academy Ltd 14
Gaussian Mixture
• Suppose there are K clusters and estimate µ and σ for each k
o They would have been estimated by the maximum-likelihood method,
had it been only one distribution
o But since there are K such clusters and the probability density is
defined as a linear function of densities of all these K distributions, i.e.
p(X) = ΣK ∏k G(X|µk , Σk)
k=1
o Where ∏k is the mixing coefficient for k-th distribution
© 2021 The Knowledge Academy Ltd 14
Gaussian Mixture
• To estimate the parameters by maximum log-likelihood method, compute p(X|µ, Σ, ∏)
ln p(X | µ, Σ, ∏ )
= ΣNi=1 p(Xi)
= ΣNi=1 ln ΣKk=1 ∏k G(X|µk , Σk)
• Now, define a random variable γk(X) such that γk (X)=p(k|X)
γk(X)
• From Bayes’ theorem:
p(X|k)p(k)
=
ΣKk=1 p(k) p(X | k)
p(X|k) ∏k
=
ΣKk=1 ∏k p(X | k)
© 2021 The Knowledge Academy Ltd 14
Gaussian Mixture
• Now for the log-likelihood function to be maximum, it’s derivative of
p(X|µ, Σ, ∏) for µ, Σ, ∏ should be zero. So, equalling the derivative of
p(X|µ, Σ, ∏) with respect to µ to zero and rearranging the terms,
ΣNn=1 γk (xn) (xn)
Σk =
ΣNn=1 γk (xn)
• Similarly taking derivative for σ and ∏ respectively, one can obtain the
following expressions:
ΣNn=1 γk (xn) (xn - µk)T and ∏k =
1
ΣNn=1 γk (xn)
Σk =
ΣNn=1 γk (xn) N
© 2021 The Knowledge Academy Ltd 14
Hidden Markov Model
• HMM refers to Hidden Markov Model
• It is based on augmenting the Markov chain
• A Markov chain is a model that explains to us the probabilities of
sequences of random variables, states, each of which can take on
values from some set
• These sets can be words, or tags, or symbols depicting anything, such
as weather
• A Markov chain helps to make a powerful assumption that if we want
to predict the future in the sequence, then all that matters is the
current state
© 2021 The Knowledge Academy Ltd 14
Hidden Markov Model
• To predict tomorrow’s weather examine today’s weather but it is not allowed to look
at yesterday’s weather
(a) (b)
• Consider a sequence of state variables q1, q2,……qi. A Markov model embodies the
Markov assumption on the probabilities of this sequence: that while predicting the
future, the past does not matter, it only needs present
Markov Assumption: P(qi = a|q1...qi−1) = P(qi = a|qi−1)
© 2021 The Knowledge Academy Ltd 14
Hidden Markov Model
• The following components specify a Markov chain:
o q = q1q2 ...qN: A set of N states
o a = a11a12 ...an1 ...ann: A transition probability matrix A, each aij
representing the probability of moving from state P i to state j, s.t.
Σnj=1 aij = 1 ∀I
o π = π1,π2,...,πN: An initial probability distribution over states
• ∏i is the probability that the Markov chain will start in state i
• Some states j may have πj = 0, meaning that they cannot be initial
states. Also, Σn i=1 ∏i = 1
© 2021 The Knowledge Academy Ltd 15
Hidden Markov Model
• A hidden Markov model includes both observed events Hidden Markov
model and hidden events that considers as causal factors in the
probabilistic model
• The following components specify an HMM:
o q = q1q2 ...qN a set of N states
o a = a11 ...ai j ...aNN a transition probability matrix A, each aij
representing the probability of moving from state i to state j, s.t. ΣN
a = 1 ∀i
j=1 i j
o o = o1o2 ...oT a sequence of T observations, each one drawn from a
vocabulary V = v1, v2,..., vV
© 2021 The Knowledge Academy Ltd 15
Hidden Markov Model
• b = bi(ot) a sequence of observation likelihoods, also known as emission
probabilities, each expressing the probability of an observation ot being
generated from a state i
• π = π1,π2,...,πN an initial probability distribution over states
o ∏i is the probability that the Markov chain will start in state i
o Some states j may have πj = 0, meaning that they cannot be initial
states. Also, Σn i=1 πi = 1
© 2021 The Knowledge Academy Ltd 15
Hidden Markov Model
• A first-order hidden Markov model instantiates two simplifying assumptions
o First, the probability of a specific state rely upon the previous state
Markov Assumption: P(qi |q1...qi−1) = P(qi|qi−1)
o Second, the probability of an output observation oi depends upon the state that
produced the observation qi and not on any other states/any other observations
Output Independence: P(oi|q1 ...qi,...,qT , o1,...,oi,...,oT ) = P(oi|qi)
© 2021 The Knowledge Academy Ltd 15
Deep Learning
© 2021 The Knowledge Academy Ltd 15
Deep Learning
• Deep learning is a machine learning
technique that trains machines to do
what comes naturally to humans. They
learn by example
• It is a key technology behind driverless
cars, allowing them to distinguish a
pedestrian from a lamppost or to
recognise a stop sign
• It controls the voice in consumer devices
such as tablets, phones, TVs, and
hands-free speakers
© 2021 The Knowledge Academy Ltd 15
Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before
• In deep learning, a computer model learns to perform classification
tasks directly from text, images, or sound
• The deep learning models can obtain state-of-the-art accuracy,
sometimes exceeding human-level performance
• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers
© 2021 The Knowledge Academy Ltd 15
Deep Learning
• The deep learning is getting attention lately as it is achieving results that
were not possible before
• In deep learning, a computer model learns to perform classification
tasks directly from text, images, or sound
• The deep learning models can obtain state-of-the-art accuracy,
sometimes exceeding human-level performance
• The models are trained by using a huge set of labelled data and neural
network architectures that include multiple layers
© 2021 The Knowledge Academy Ltd 15
Importance of Deep Learning
• As the name suggests, Artificial Intelligence is to make a machine artificially intelligent
so that, making the machines that act and think like humans
• The amount of useful data available and an increase in computational speed are the two
factors that have made the whole world to invest in this field
• If a robot is hard coded i.e. all the logic has manually been coded to the system, then it
is not AI so it does not mean that simple robots mean AI
• Machine learning means making a machine learn from its experience and enhancing its
performance with time as in case of a human baby
• The concept of machine learning became possible only when an adequate amount of
data made available for training machines. It assists in dealing with a complex and
sound system
© 2020 The Knowledge Academy Ltd 15
Importance of Deep Learning
(Continued)
• Mainly, deep learning is a subset of machine learning, but in this case, the machine
learns the way where humans are believed to learn
• The structure of both deep learning model and the human brain is similar to a large
number of nodes and neurons, neurons in the brain of human thus result in artificial
neural network
• When traditional machine learning algorithms are applied we need to select input
features manually from complex data set and then train them that is a boring job for the
scientist of Machine Learning, but in neural networks, we do not need to select
manually useful input features
© 2020 The Knowledge Academy Ltd 15
Importance of Deep Learning
(Continued)
• There are several types of neural networks to manage the complexity of data set and
algorithm
• Deep learning has allowed most of the Industries Experts to overcome challenges that
were not possible, a decade ago like Image and Speech recognition and Natural
Language Processing
• Industries like Entertainment, Journalism, Manufacturing or even Digital Sector,
Healthcare, Banking and Finance, Automobile depending on it
• Trending successes of deep learning are Voice Assistants, Mail Services, Self Driving cars,
Video recommendations, Intelligent Chat bots
© 2020 The Knowledge Academy Ltd 16
How Deep Learning Works
• Neural networks are composed of layers of nodes, similar to the human brain, which is
made of neurons. Nodes within individual layers are combined to adjacent layers
• In the human brain, a single unit of the neuron gets thousands of signals from other
neurons. In an artificial neural network, signals are travel between nodes and allocate
weight accordingly
• A node weighing heavy will apply more impact on the next layer of the nodes. The final
layer put together the weighted inputs to give an output
• Systems of Deep learning needs powerful hardware as they have a huge amount of
processed data and includes many complex mathematical calculations
• In spite of having such advanced hardware, calculations of deep learning training can
take weeks
© 2020 The Knowledge Academy Ltd 16
How Deep Learning Works
(Continued)
• Deep learning systems need a large amount of data to get back to accurate results;
according to that, information is served as huge data sets
• When data is processing, artificial neural networks are able to categorise data with the
answers gets from a series of true/ false questions that include highly complex
mathematical computations fed
• For instance, programs of facial identification work by learning to identify and detect
edges and lines of faces, then more important parts of faces, and finally complete
representations of the faces
• As the program trains itself and the possibility of getting the right answers enhances
with time
© 2020 The Knowledge Academy Ltd 16
Congratulations
Congratulations on completing this course!
Keep in touch
info@theknowledgeacademy.com
Thank you
© 2021 The Knowledge Academy Ltd 16