Ai ML Important Questions
Ai ML Important Questions
y
Output
neuron and output is passed to the next layer. If the output is incorrect, then in the backward
direction, error is back propagated to adjust the weights and biases to get correct output. Thus.
la
the network learns with the training data. This type ofANN is used in deep learning for complex
ir
classification, speech recognition, medical diagnosis, forecasting, etc. They are comparatively
complex and slow. The model of an MLP is shown in Figure 10.9.
10.5.3 Multi-Layer Perceptron (MLP)
This ANN consists of multiple layers with one input layer, one output layer and one or more
hidden layers. Every neuron in a layer is connected to all neurons in the next layer and thus
direction,
they are fully connected. The information flows in both the directions. In the forward
activation function of the
the inputs are multiplied by weights of neurons and forwarded to the
neuron and output is passed to the next layer. If the output is incorrect, then in the backward
direction, error is back propagated to adjust the weights and biases to get correct output. Thus.
la
the network learns with the training data. This type of ANN is used in deep learning for complex
ir
classification, speech recognition, medical diagnosis, forecasting, etc. They are comparatively
complex and slow. The model of an MLP is shown in Figure 10.9.
a The modelling with ANN is also extremely complicated and the development takes a much
longer time.
algorithms.
AGenerally, neural networks require more data than traditional machine learning
and they do not perform well on small datasets.
learning techniques.
5. They are also more computationally expensive than traditional
Challenges of Clustering Algorithms
A huge collection of data with higher dimensions (i.e., features or attributes) can pose a
problem for clustering algorithms. With the arrival of the internet, billions of data are available
for clustering algorithms. This is a diffcult task, as scaling is always anlower
issuedimension
with clustering
data
algorithms. Scaling is an issue where some algorithms work with
but do not perform well for higher dimension data. Also, units of data can post a problem,
like some weights in kg and some in pounds can pose a problem in clustering, Designing
a proximity measure is also a big challenge.
The advantages and disadvantages of the cluster analysis algorithms are given in Table 13.2.
Table 13.2: Advantages and Disadvantages of Clustering Algorithms
S.No. Advantages Disadvantages
1. Cluster analysis algorithms can handle missing Cluster analysis algorithms are sensitive to
data and outliers. initialization and order of the input data.
Can help classifiers in labelling the unlabelled Often, the number of clusters present in the
dala. Semi-supervised algorithn1s use cluster data have to be specified by the user.
analysis algorithms to label the unlabelled data
and then use classifiers to classity them.
(Contiued)
(13.17)
Here, N is the number of clusters, C is the set of
centroids, x, is the centroid and m, is the
samples. A lower within cluster variation is a necessary
high cohesion. condition for greater compactness and
Separation indicates how well a sample differs from other clusters. This is measured as the
weighted sum of the differences of the centroid of the dataset and the centroid of the generated
clusters. This is given as:
(13.18)
Here, x is the centroid of the entire dataset, x. is the centroid of the clusters and C. is the size
of the clusters. A larger distance is required for well-separated clusters so that the clusters are
perfectly distinct. Sometimes, the connectivity between asample and other member of that cluster
may be important, indicating the sort of samples that can be put into the clusters. The connectivity
value ranges from 0to infinity. Dunn index can be computed as:
ax separation
Index = (13.19)
Bx compactness
Here, a and ß are parameters. Dunn index is a useful measure that can combine both cohesion
and separation.
Silhouette Co-eficient
Silhouette coefficient combines both cohesion and separation. The Silhouette coefficient measures
the average distance between clusters. It is given as follows:
s, = (13.20)
maxlb,,a,]
Here, a, is the distance between the sample and centroid of the same cluster and b, is the
distance between the sample and the nearest centroid. The silhouette coefficients of the individual
objects can be summed to get for the entire cluster as S, given as:
(13.21)
The value of the silhouette coefficient s, is between -l and +1. When it is closer to1, the clusters
are well formed. The value is zero when the data points are between two clusters and negative
when the clusters are not formed correctly.
Summary
1. Clustering is a technique of partitioning the objects with many attributes into meaningful disjoint
subgroups.
Ouantitative variables use distance measures, Euclidean distance, Manhattan distance and
Example 6.3: Assess a student's performance during his course of study and predict whether
T consists
a student will get a job offer or not in his final year of the course. The training dataset
of 10 data instances with attributes such as 'LGPA, 'Interactiveness', 'Practical Knowledge' and
'Communication Skills' as shown in Table 6.3. The target class attribute is the 'Job Offer.
Table 6.3: Training Dataset T
Interactiveness Practical Knowledge Communication Skills Job Offer
S.No. CGPA
Good Yes
1. Yes Very good
No Good Moderate Yes
2. 28
Poor No
3 29 No Average
4 <& No Average Good No
Good Moderate Yes
5 28 Yes
Yes Good Moderate Yes
>9
8 Yes Good Poor No
7
No Very good Good Yes
8
Yes Good Good Yes
28
Good Yes
28 Yes Average
Solution:
Step 1:
Calculate the Entropy for the target class Yob Ofer.
Entropy_Info(Target Attribute =Job Offer) =Entropy_Info(7, 3) =
7 7.3 3
Iteration 1:
=io810*los,-4-0.3599 +-0.5208) =0.807
Step 2:
Cálculate the Entropy_Info and Gain(Information_Gain) for each of the attribute the training
dataset.
Table 6.4 shows the number of data instances classified with Job Offer as Yes or No for the attribute
CGPA.
164 Machine Learning
Solution:
Step 1:
Calculate the Entropy for the target class Job Offer.
Entropy_Info(larget Attribute =Job Offer) =Entropy_Info(7, 3) =
7 7,3 3
Iteration 1:
=lo%,0*olos, -0.3599 +-0.5208) =0.8807
Step 2:
Calculate the Entropy_Info and Gain(Information_Gain) for each of the attribute in the training
dataset.
Table 6.4 shows the number of data instances classified with Job Ofer as Yes or No for the attribute
CGPA.
Table 6.4: Entropy Information for CGPA
CGPA Job Offer = Yes Job Offer = No Total Entropy
3 1 4
4 0 4
2 2
atti
Entropy_Info(T, CGPA)
4| 3
10 4o 4 10 log, *o8,; -
4
10
(0.3111 +0.4997) +0 +0
=0.3243
Gain (CGPA) = 0.8807-0.3243
=0.5564
Table 6.5 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Interactiveness.
Table 6.5: Entropy Information for Interactiveness
Interactiveness Job Offer Yes Job Ofter = No Total Entropy
YES 5 1 6
NO 2 2 4
Entropy_Info(T, Interactiveness) =
6 4
=(0.2191
10 +0.4306) +(0.4997
10 +0.4997)
=0.3898+0.3998 = 0.7896
Gain(Interactiveness) = 0.8807- 0.7896
=0.0911
Table 6.6 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Practical Knowledge.
Decision Tree Learning " 165
Average 1 2 3
Good 4 5
Entropy_
Info(T, Practical Knowiedge)
5
-0os30 +0.3897) +10(0.2574 +0.4641)
10
=0+0.2753 + 0.3608
= 0.6361
Moderate 3 3
Poor 0 2 2
28
Gain
Pratical Communication Job Skil
nteractivenessKnowledge Skills offer Job offer-Yes
Very good GoodYes
Yes
PoO No
No Average
Moderate
Yes Good
Very good GOod
NO
1
Figure 6.3: Decision Tree After Iteration
data instances branched with
CGPA >9
same process for the subset of
Now, continue the
Iteration 2: 112/226
are repea
In this iteration, the same process of computing the Entropy_Info and Gain
of 4 data instances as shown in the above Figure 6.3.
subset of training set. The subset consists
Entropy_Info(T) =Entropy_Info(3, 1) =
={-0.3111 + -0.4997)
=0.8108
Entropy_Info(T, Interactiveness)
=0+0.4997
Gain(Interactiveness) =0.8108 -0.4997
=0.3111
Entropy_Info(T, Practical Knowledge)
Decision Tree Learning 167
Here, both the attributes "Practical Knowledge' and "Communication Skills' have the same
Gain. So, we can either construct the decision tree using 'Practical Knowledge' or'Communication
Skills'. The final decision tree is shown in Figure 6.4.
28 <8 113/226
CGPA
S9
Good Very
good
Average
=(-0.3599 + -0.5208)
=0.8807
for each of the attribute
Step 2: Calculate the Entropy_Info, Gain(Info_Gain), Split_Info, Gain_ Ratio
in the training dataset.
CGPA:
4 3 4
Entropy Info(T, CGPA)=-log,
10 4 4
10 2%22 2
4
(0.3111 +0.4997) +0 + 0
10
= 0.3243
Gain(CGPA) =0.8807 -0.3243
-0.5564
4 4 4 2 2
Split_Info(T, CGPA) =-log, 10 10 log, 10 1n62 10
=0.5285 +0.5285+0.4641
=1.5211
4
(0.2191 +0.4306) +(0.4997
10
+0.4997)
4 4
Gain(Interactiveness)
Gain Ratio(Interactiveness) Split_Info(T, Interactiveness)
0.0911
0.9704
=0.0939
Practical Knowledge:
Communication Skills:
Entropy_Info(T, Communication Skills) =
It
1o0)+2 T.
5
-(0.5280+03897) + (0)
10 R
- 0.3609
Gain(Communication Skills) =0.8813- 0.36096
=0.5202
5 3 2
Split_Info(T, Communication Skills) =-log, -log, log,
10
1n
=1.4853
Gain(Communication Skills) t
28
No Average Poor No
Yes Good Moderate Yes
2 =0.5 +0.5 =1
Split_Info(T, Interactiveness) =-log, -1
Gain(Interactiveness)
Gain_Ratio(Interactiveness) =
SplitInfo(T, Interactiveness)
0.3112
0.3112
1
Practical Knowledge:
Entropy Into(T, Pracical Knowiedge)
2
8,-lo8,o,-.
=0
Gain(Practical Knowledge) =0.8108
Communication Skills:
=0
Gain(Communication Skills) = 0.8108
Table 6.11 shows the Gain_Ratio computed for all the attributes.
Table 6.11: Gain-Ratio
Attributes Gain Ratio
Interactiveness 0.3112
Practical Knowledge 0.5408
Communication Skills 0.5408
Both 'Practical Knowledge' and'Communication Skills' have the highest gairn ratio. So, the best
Skills', and therefore, the
splitting attribute can either be 'Practical Knowledge' or 'Communication
split can be based on any one of these.
shown in Figure 6.6.
Here, we split based on 'Practical Knowledge'. The final decision tree is
1 21 121
=0
Gain(Communication Skills) =0.8108
1
Split_ Info(T, Communication Skills) 2 1
=-7log,log,7827 -1.5
Gain(Practical Knowledge) 0.8108
Gain Ratio(Communication Skills) =0.5408
Split_Info(T, Practical Knowledge) 1.5
Table 6.11 shows the Gain Ratiocomputed for all the attributes.
Table 6.11: Gain-Ratio
Attributes Gain Ratio
Interactiveness 03112
0.5408
Practical Knowledge
Communication Skills 0.5408
>8
CGPA
Yes
Job offer= Job offer =No
PK
Good Very
good
Average
Advantages
1. Simple
2. Easy to implement
Disadvantages
1. It is sensitive to initialization process as change of initial points leads to different clusters.
2. If the samples are large, then the algorithm takes a lot of time.
How to Choose the Value of k?
It is obvious that kis the user specified value specifying the number of clusters that are present.
Obviously, there are no standard rules availalble to pick the value of k. Normally, the k-means
algorithm is run with multiple values of kand within group variance (sum of squares of samples
with its cerntroid) and plotted as a line graph. This plot is called Elbow curve. The optimal or best
value of kcan be determined from the graph. The optimal value of k is identified by the flat or
horizontal part of the Elbow curve.
Complexity
The complexity of k-means algorithm is dependent on the parameters like n, the number of
samples, k, the number of clusters, O(nkd). I is the number of iterations and d is the number of
attributes. The complexity of k-means algorithm is O().
hence it exhibits th same initial conditions every time the model is run and is likely to get asingle
possible outcome as the solution.
Bayesian learring differs from probabilistic learning as it uses subjective probabilities
(e. probability that is based on an individual's belief or interpretation about the outcome of an
event and it can change over time) to infer parameters of amodel. Two practical learning algorithms
called Naive Bayes learning and Bayesian Belief Network (BBN) form the major part ofBayesian
learning. These algorithms use prior probabilities and apply Bayes rule to iníer useful information.
Bayesian Belief Networks (BBN) is explained in detail in Chapter 9.
Posterior Probability
It is the updated or revised probability of an event taking into account the observations from the
training data. P(Hypothesis IEvidence) is the posterior distribution representing the belief about
the hypothesis, given the evidence from the training data. Therefore,
Posterior probability prior probability +new evidence
Machine Learning
236
calculated from the prior probability P (Hypothesis h), the
E).
P(Hypothesis hlEvidence E) is
lHypothesis h) and the marginal probability P (Evidence
likelihood probability P (Evidence E
P(Hypothesis h)
It can be written as: P(Evidence ElHypothesis h) (8.1)
Evidence E) = P(Evidence E)
P(Hypothesis h | bserying the training
Probability and 1S Stated as:
P(Hypothesis h I Bvidence E)
where, Hypothesis his the target class to be classified and Evidence Eis the given test instance.
Learning
236 . Machine
probability P (Hypothesis h), the
Evidence E) is calculated from the prior probability P (Evidence E).
P (Hypothesis hl IHypothesis h) and the marginal
likelihood probability P (Evidence E
It can be written as: P(Hypothesis )
P(Evidence ElHypothesis h) (8.1)
Evidence E) = P(Evidence E)
P (Hypothesis h I the training
probability of the hypothesis h without observing
probability that the
the prior
where, P(Hypothesis h) isevidence. It denotes the prior belief or the initial from the training
E
data or considering any (Evidence E) is the prior probability of the evidencethe marginal proba
hypothesish is correct. P is also called
without any knowledge of which hypothesis holds. It
dataset
bility. prior probability of Evidence E given Hypothesistheh.
Hypothesis h) is the training data that
P (Evidence ETprobability of the Evidence E after observing the of Hypothesis h
likelihood probability
It is the (Hypothesis h I Evidence E) is the posterior training data that the
hypothesis his correct. P probability of the hypothesish after observing the
can observe that:
given Evidence E. It is theother words, by the equation of Bayes Eq. (8.1), one
evidence E is correct. In Probability
Probability xLikelihood from
Posterior Probability a Prior posterior probability for a number of hypotheses,
helps in calculating the
Bayes theorem highest probability can be
selected.
formally defined as
which the hypothesis with the probable hypothesis from a set of hypotheses is
This selection of the most Hypothesis.
Maximum A Posteriori (MAP)
Posteriori (MAP) Hypothesis, haP
Maximum A value is considered as
which has the maximum
hypotheses, the hypothesis
Given a set of candidate probable hypothesis is called
hypothesis or most probable hypothesis. This most can be used to find the h
the maximum probable Hypothesis h, Bayes theorem Eq.
(8.1)
the Maximum APosteriori P(Hypothesishl Evidence E)
, =max,, h)
P(Evidence ElHypothesis h)P(Hypothesis
= max,H P(Evidence E)
(8.2)
lHypothesis h)P(Hypothesis h)
= max,, P(Evidence E
h,M.
Maximum Likelihood (ML) Hypothesis, probable, only P(EIh) is used
candidate hypotheses, if every hypothesis is equallymaximum likelihood for P (E Ih)
Given a set of gives the
hypothesis.The hypothesis that
to find the most probable Likelihood (ML) Hypothesis, h,: (8.3)
is called the Maximum h)
max,, P(Evidence EIHypothesis
P (A) =5/8
D R) AI8
Output
Cell body
Axon
Input Synapse
Dendrites
Axon
Cell body