[go: up one dir, main page]

0% found this document useful (0 votes)
73 views21 pages

Ai ML Important Questions

The document discusses different types of artificial neural networks including feed forward neural networks, fully connected neural networks, multi-layer perceptrons, and feedback neural networks. It describes the structure and information flow of each type of neural network. The document also covers the advantages and disadvantages of using artificial neural networks.

Uploaded by

neha praveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views21 pages

Ai ML Important Questions

The document discusses different types of artificial neural networks including feed forward neural networks, fully connected neural networks, multi-layer perceptrons, and feedback neural networks. It describes the structure and information flow of each type of neural network. The document also covers the advantages and disadvantages of using artificial neural networks.

Uploaded by

neha praveen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

where, T is the training dataset, Op- and O;d are the desired target output and estimated

actual output, respectively, for a trairing instance d.


The principle of gradient descent is an optimization approach which is used to minimize
the cost funcion by converging to a local minimal point moving in the negative direction of the
gradient and each step size during movement is determined by the learning rate and the slope of
the gradient.
Gradient descent learning is the foundation of back propagation algorithm used in MLP.
Before we study about an MLP, let us first understand the different types of neural networks that
differ in their structure, activation function and learning mechanism.

10.5 TYPES OF ARTIFICIAL NEURAL NETWORKS


ANNs consist of multiple neurons arranged in layers. There are different types of ANNS
that differ by the network structure, activation function involved and the learning rules used.
In an ANN, there are three layers called input layer, hidden layer and output layer. Any general
ANN would consist of one input layer, one output layer and zero or more hidden layers.

Artificial Neural Networks . 289

10.5.1 Feed Forward Neural Network


This is the simplest neural network that consists of neurons which are arranged in layers and
the information is propagated only in the forward direction. This model may or may not contain
ahidden layer and there is no back propagation. Based on the number of hidden layers they
are further classiied into single-layered and multi-layered feed forward networks. These ANNS
are simple to design and easy to maintain. They are fast but cannot be used for complex learning.
They are used for simple classification and simple image processing, etc. The model of a
Feed Forward Neural Network is shown in Figure 10.7.

y
Output

Input layer Hidden layer Output layer


Figure 10.7: Model of a Feed Forward Neural Network

10.5.2 Fully Connected Neural Network


Fully connected neural networks are the ones in which all the neurons in a layer are connected
to all other neurons in the next layer. The model of a fully connected neural network is shown in
Figure 10.8.

Input layer Hidden layer Output layer


Figure 10.8: Model of a Fully Connected Neural Network

10.5.3 Multi-Layer Perceptron (MLP)


This ANN consists of multiple layers with one input layer, one output layer and one or more
hidden layers. Every neuron in a layer is connected to all neurons in the next layer and thus
they are fully connected. The information flows in both the directions. ln the forward direction,
the inputs are multiplied by weights of neurons and forwarded to the activation function of the

290 Machine Learning

neuron and output is passed to the next layer. If the output is incorrect, then in the backward
direction, error is back propagated to adjust the weights and biases to get correct output. Thus.
la
the network learns with the training data. This type ofANN is used in deep learning for complex
ir
classification, speech recognition, medical diagnosis, forecasting, etc. They are comparatively
complex and slow. The model of an MLP is shown in Figure 10.9.
10.5.3 Multi-Layer Perceptron (MLP)
This ANN consists of multiple layers with one input layer, one output layer and one or more
hidden layers. Every neuron in a layer is connected to all neurons in the next layer and thus
direction,
they are fully connected. The information flows in both the directions. In the forward
activation function of the
the inputs are multiplied by weights of neurons and forwarded to the

290 Machine Learning

neuron and output is passed to the next layer. If the output is incorrect, then in the backward
direction, error is back propagated to adjust the weights and biases to get correct output. Thus.
la
the network learns with the training data. This type of ANN is used in deep learning for complex
ir
classification, speech recognition, medical diagnosis, forecasting, etc. They are comparatively
complex and slow. The model of an MLP is shown in Figure 10.9.

Input layer Hidden layer Output layer


Figure 10.9: Model of a Multi-Layer Perceptron

10.5.4 Feedback Neural Network


Feedback neural networks have feedback connections between neurons that allow information
flow in both directions in the network. The output signals can be sent back to the neurons in the
same layer or to the neurons in the preceding layers. Hence, this network is more dynamic during
training. The model of a feedback neural network is shown in Figure 10.10.
Feedback 174/226

Input layer Hidden layer Output layer


Figure 10.10: Model of a Feedback Neural Network
10.10 ADVANTAGES AND DISADVANTAGES OF ANN
Advantages of ANN
ANN can solve complex problems involving non-linear processes.
ANNs can learn and recognize complex patterms and solve problems as humans solve a problem.
ANNs have a parallel processing capability and can predict in less time.
4. They have an ability to work with inadequate knowledge. It can even handle incomplete and
noisy data.
5. They can scale well to larger data sets and outperforms other learning mechanisms.
Limitations of ANN
1. An ANN requires processors with parallel processing capability to train the network running
for many epochs. The function of each node requires a CPU capability which is difficult for very
large networks with alarge amount of data.
2. They work like a black box' and it is exceedingly difficult to understand their working in inner
layers. Moreover, it is hard to understand the relationship between the representations learned
at each layer.

Artificial Neural Networks 307

a The modelling with ANN is also extremely complicated and the development takes a much
longer time.
algorithms.
AGenerally, neural networks require more data than traditional machine learning
and they do not perform well on small datasets.
learning techniques.
5. They are also more computationally expensive than traditional
Challenges of Clustering Algorithms
A huge collection of data with higher dimensions (i.e., features or attributes) can pose a
problem for clustering algorithms. With the arrival of the internet, billions of data are available
for clustering algorithms. This is a diffcult task, as scaling is always anlower
issuedimension
with clustering
data
algorithms. Scaling is an issue where some algorithms work with
but do not perform well for higher dimension data. Also, units of data can post a problem,
like some weights in kg and some in pounds can pose a problem in clustering, Designing
a proximity measure is also a big challenge.
The advantages and disadvantages of the cluster analysis algorithms are given in Table 13.2.
Table 13.2: Advantages and Disadvantages of Clustering Algorithms
S.No. Advantages Disadvantages
1. Cluster analysis algorithms can handle missing Cluster analysis algorithms are sensitive to
data and outliers. initialization and order of the input data.
Can help classifiers in labelling the unlabelled Often, the number of clusters present in the
dala. Semi-supervised algorithn1s use cluster data have to be specified by the user.
analysis algorithms to label the unlabelled data
and then use classifiers to classity them.
(Contiued)

364 Machine Learning

S.No. Advantages Disadvantages


3. Itiseasy to explain the cluster analysisalgorithms Scaling is aproblem. Sc
and to implement them.
4. Clustering is the oldest technique in statisticsDesigning a proximity measure for the given
and it is easy to explain. It is also relatively easy data is an issue. CA
to implement.
ed
Step 4: Repeat the steps 2-3 till change is minimal within the
do not change at all. threshold value or parameters

13.8 CLUSTER EVALUATION METHODS

Scan for informotion on 'Purity', 'Evoluation bosed on Ground Truth, and


'Similarity-bosed Measures'

Evaluation clustering algorithms is a difficult task, as often, no benchmark data is available


as in classification. Also, in
clustering algorithms, domain knowledge is absent most of the times.
So, clustering algorithms' validation is difficult as
compared to the validation of
algorithms. There are three types of measurcs that can be used for cluster validation: classification
1. Internal
2. External
3. Relative
Internal metrics quantify the quality of clustering without the use of any
or knowledge. External metrics use the ground external information
truth or
quality of the validation. In relative measure, different externally supplied labels to quantify the
cluster algorithms are compared, or the
algorithm is run with multiple parameter values. This measure helps in finding optimal clusters.
Basically, two measures of information measures, that is
the idea that the objects in the cluster should be same and cohesion and separation, are based on
objects across clusters should be distinct.
Alternatively, the average distance within the cluster should be small and average distance across
the clusters should be large.

Cohesion and Separation


Cohesion (or compact) measures how close the samples are inside the
clusters are homogeneous. Cohesion is measured as sum of squared cluster. This ensures that the
and the centroid. The within cluster sum is given as: errors between the samples
N

(13.17)
Here, N is the number of clusters, C is the set of
centroids, x, is the centroid and m, is the
samples. A lower within cluster variation is a necessary
high cohesion. condition for greater compactness and

Clustering Algorithms 383

Separation indicates how well a sample differs from other clusters. This is measured as the
weighted sum of the differences of the centroid of the dataset and the centroid of the generated
clusters. This is given as:
(13.18)
Here, x is the centroid of the entire dataset, x. is the centroid of the clusters and C. is the size
of the clusters. A larger distance is required for well-separated clusters so that the clusters are
perfectly distinct. Sometimes, the connectivity between asample and other member of that cluster
may be important, indicating the sort of samples that can be put into the clusters. The connectivity
value ranges from 0to infinity. Dunn index can be computed as:
ax separation
Index = (13.19)
Bx compactness
Here, a and ß are parameters. Dunn index is a useful measure that can combine both cohesion
and separation.

Silhouette Co-eficient
Silhouette coefficient combines both cohesion and separation. The Silhouette coefficient measures
the average distance between clusters. It is given as follows:

s, = (13.20)
maxlb,,a,]
Here, a, is the distance between the sample and centroid of the same cluster and b, is the
distance between the sample and the nearest centroid. The silhouette coefficients of the individual
objects can be summed to get for the entire cluster as S, given as:
(13.21)
The value of the silhouette coefficient s, is between -l and +1. When it is closer to1, the clusters
are well formed. The value is zero when the data points are between two clusters and negative
when the clusters are not formed correctly.

Summary
1. Clustering is a technique of partitioning the objects with many attributes into meaningful disjoint
subgroups.
Ouantitative variables use distance measures, Euclidean distance, Manhattan distance and
Example 6.3: Assess a student's performance during his course of study and predict whether
T consists
a student will get a job offer or not in his final year of the course. The training dataset
of 10 data instances with attributes such as 'LGPA, 'Interactiveness', 'Practical Knowledge' and
'Communication Skills' as shown in Table 6.3. The target class attribute is the 'Job Offer.
Table 6.3: Training Dataset T
Interactiveness Practical Knowledge Communication Skills Job Offer
S.No. CGPA
Good Yes
1. Yes Very good
No Good Moderate Yes
2. 28
Poor No
3 29 No Average
4 <& No Average Good No
Good Moderate Yes
5 28 Yes
Yes Good Moderate Yes
>9
8 Yes Good Poor No
7
No Very good Good Yes
8
Yes Good Good Yes
28
Good Yes
28 Yes Average

164 Machine Learning

Solution:
Step 1:
Calculate the Entropy for the target class Yob Ofer.
Entropy_Info(Target Attribute =Job Offer) =Entropy_Info(7, 3) =
7 7.3 3
Iteration 1:
=io810*los,-4-0.3599 +-0.5208) =0.807
Step 2:
Cálculate the Entropy_Info and Gain(Information_Gain) for each of the attribute the training
dataset.
Table 6.4 shows the number of data instances classified with Job Offer as Yes or No for the attribute
CGPA.
164 Machine Learning

Solution:
Step 1:
Calculate the Entropy for the target class Job Offer.
Entropy_Info(larget Attribute =Job Offer) =Entropy_Info(7, 3) =
7 7,3 3
Iteration 1:
=lo%,0*olos, -0.3599 +-0.5208) =0.8807
Step 2:
Calculate the Entropy_Info and Gain(Information_Gain) for each of the attribute in the training
dataset.

Table 6.4 shows the number of data instances classified with Job Ofer as Yes or No for the attribute
CGPA.
Table 6.4: Entropy Information for CGPA
CGPA Job Offer = Yes Job Offer = No Total Entropy
3 1 4
4 0 4
2 2
atti
Entropy_Info(T, CGPA)
4| 3
10 4o 4 10 log, *o8,; -

4
10
(0.3111 +0.4997) +0 +0
=0.3243
Gain (CGPA) = 0.8807-0.3243
=0.5564
Table 6.5 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Interactiveness.
Table 6.5: Entropy Information for Interactiveness
Interactiveness Job Offer Yes Job Ofter = No Total Entropy
YES 5 1 6
NO 2 2 4

Entropy_Info(T, Interactiveness) =
6 4
=(0.2191
10 +0.4306) +(0.4997
10 +0.4997)
=0.3898+0.3998 = 0.7896
Gain(Interactiveness) = 0.8807- 0.7896
=0.0911
Table 6.6 shows the number of data instances classified with Job Offer as Yes or No for the
attribute Practical Knowledge.
Decision Tree Learning " 165

Table 6.6: Entropy Information for Practical Knowledge


Practical Knowledge Job Offer = Yes Job Offer = No Total Entropy
Very Good 2 2

Average 1 2 3
Good 4 5

Entropy_
Info(T, Practical Knowiedge)

5
-0os30 +0.3897) +10(0.2574 +0.4641)
10
=0+0.2753 + 0.3608
= 0.6361

Gain(Practical Knowledge) =0.8807 - 0.6361


=0.2446
Table 6.7 hows the number of data instances classified with Job Offer as Yes or No for the
attribute Communication Skills.
Table 6.7: Entropy Information for Communication Skills
Communication Skills Job Offer =YesJob Offer No Total
Good 4 1 5

Moderate 3 3
Poor 0 2 2

Entropy_Info(T, Communication Skills)


3 3 3 0, 0 2
-lo8,-log,lo8,lo,o
10|
3 2
5
(0.5280 +0.3897) +(0) +0)
10
= 0.3609

Gain(Communication Skills) = 0.8813 - 0.36096


=0.5203
The Gain calculated for all the attributes is shown in Table 6.8:
Table 6.8: Gain
Attributes Gain
CGPA 0.5564
Interactiveness 0.0911

Practical Knowledge 0.2246


Communication Skills 0.5203
166 Machine Learning
therefore the gain
for which entropy is minimum and
G
From Table 6.8, choose the attribute
Step 3: Er
attribute.
is maximum as the best split
is CGPA since it has the maximum gain. So, we choose CGPA as the
The best split attribute entropy
three distinct values for CGPA with outcomes 29, 28 and <&. The
root node. There are Offer =No for
and <& with all instances classified as Job Offer =Yes for 28 and Job
value is 0 for 28 The tree grows with the subset of instances
with
both 28 and <& end up in a leaf node.
&. Hence,
CGPA 29 as shown in Figure 6.3.
CGPA

28

Gain
Pratical Communication Job Skil
nteractivenessKnowledge Skills offer Job offer-Yes
Very good GoodYes
Yes
PoO No
No Average
Moderate
Yes Good
Very good GOod
NO
1
Figure 6.3: Decision Tree After Iteration
data instances branched with
CGPA >9
same process for the subset of
Now, continue the
Iteration 2: 112/226
are repea
In this iteration, the same process of computing the Entropy_Info and Gain
of 4 data instances as shown in the above Figure 6.3.
subset of training set. The subset consists
Entropy_Info(T) =Entropy_Info(3, 1) =

={-0.3111 + -0.4997)
=0.8108

Entropy_Info(T, Interactiveness)
=0+0.4997
Gain(Interactiveness) =0.8108 -0.4997
=0.3111
Entropy_Info(T, Practical Knowledge)
Decision Tree Learning 167

in Gain(Practical Knowledge) = 0.8108


Entropy_Info(T, Communication Skills)
y 1
=0
th
Gain(Communication Skills) =0.8108
The gain calculated for all the attributes is showA in Table 6.9.
Table 6:9: Total Gain
Attributes 9Gain
Interactiveness 0.3111
Practical Knowledge 0.8108
Communication Skills 0.8108

Here, both the attributes "Practical Knowledge' and "Communication Skills' have the same
Gain. So, we can either construct the decision tree using 'Practical Knowledge' or'Communication
Skills'. The final decision tree is shown in Figure 6.4.
28 <8 113/226
CGPA

S9

Job offerYes Job offer No

Good Very
good

Average

Job offer =Yes Job offer= No Job offere Yes

Figure 6.4: Final Decision Tree

6.2.2 C4.5 Construction


C4.5 is an improvement over ID3. C4.5 works with continuous and discrete attributes and missing
values, and it also supports post-pruning. C5.0 is the successor of C4.5 and is more efficient and
used for building smaller decision trees. C4.5 works with missing values by marking as '", but
these missing attribute values are not considered in the calculations.
Decision Tree Learning 169

which are calculated in ID3


Example 6.4: Make use of Information Gain of the attributes
algorithm in Example 6.3 to construct a decision tree using C4.5.
Solution:
Iteration 1:
Step 1: Calculate the Class_Entropy for the target class "Job Offer'.
Entropy_Info(Target Attribute =Job Offer) =Entropy_Info(7, 3) =
7,3 3

=(-0.3599 + -0.5208)
=0.8807
for each of the attribute
Step 2: Calculate the Entropy_Info, Gain(Info_Gain), Split_Info, Gain_ Ratio
in the training dataset.
CGPA:
4 3 4
Entropy Info(T, CGPA)=-log,
10 4 4

10 2%22 2
4
(0.3111 +0.4997) +0 + 0
10
= 0.3243
Gain(CGPA) =0.8807 -0.3243
-0.5564
4 4 4 2 2
Split_Info(T, CGPA) =-log, 10 10 log, 10 1n62 10
=0.5285 +0.5285+0.4641
=1.5211

Gain Ratio(CGPA) =(Gain(CGPA)/(Split_Info(T, CGPA)


0.5564
=0.3658
1.5211
Interactiveness:
5 lo16 2 210%:4
Entropy Info(T, Interactiveness) = 10 6
62l08:6 6

4
(0.2191 +0.4306) +(0.4997
10
+0.4997)
4 4

0.3898 + 0.3998 = 0.7896


Gain(Interactiveness) =0.8807 -0.7896=0.0911
4 4
Gain(Interactiveness) =
Olog: 10 10 log, 10
0.9704
170 " Machine Learning

Gain(Interactiveness)
Gain Ratio(Interactiveness) Split_Info(T, Interactiveness)
0.0911
0.9704
=0.0939
Practical Knowledge:

Entropy JInfo(T, Pracical Knowledge)


20,
log,-lo8, lo,1_1o827 Stef
10 attri
5
3
0.3897) +(0.2574 +0.4641)
-0) +(0.5280 +
10
The
=0+0.2753 + 0.3608 =0.6361

Gain(Practical Knowledge) =0.8807 - 0.6361


=0.2448
5 3,1061103
Split, Info(T, Practical Knowiedge) =olo8 10 10 10 10
=1.4853
Gain(Practical Knowledge)
Gain_Ratio(Practical Knowledge) = Split_Info(T, Practical Knowledge)
0.2448
1.4853
= 0.1648

Communication Skills:
Entropy_Info(T, Communication Skills) =
It
1o0)+2 T.
5
-(0.5280+03897) + (0)
10 R
- 0.3609
Gain(Communication Skills) =0.8813- 0.36096
=0.5202
5 3 2
Split_Info(T, Communication Skills) =-log, -log, log,
10
1n
=1.4853
Gain(Communication Skills) t

Gain Ratio(Communication Skills) = Split_Info(T, Communication Skills)


0.5202 - 0.3502
1.4853
Decision TreeLearning 171
Table 6.10 shows the Gain Ratiocomputed for all the attributes.
Table 6.10: Gain_Ratio
Attribute Gain Ratio
CGPA 0.3658
INTERACTIVENESS 0.0939
PRACTICAL KNOWLEDGE 0.1648
COMMUNICATION SKILLS0.3502
Step 3: Choose the attribute for which Gain Ratiois maximum as the best split attribute.
From Table 6.10, we can see that CGPA has highest gain ratio and it is selected as the best split
attribute. We can construct the decision tree placing CGPA as the root node showm in Figure 6.5.
The training dataset is split into subsets with 4data instances.
29
CGPA

28

Pratical Communication Job


InteractivenessKnowledge Skills Offer
Job offer= es Job offer No
Yes Very good Good Yes

No Average Poor No
Yes Good Moderate Yes

No Very good Good Yes

Figure 6.5: Decision Tree after Iteration 1


Iteration 2:
Total Samples: 4
Repeat the same process for this resultant dataset with 4 data instances.
Job Offer has 3 instances as Yes and 1 instance as No.

Entropy Info(Target Class =Job Offer) =-log,-lo


4
-0.3112 +0.5
-0.8112
Interactiveness:
Entropy_lnto(T, Interactiveness) 02 1
2 2 2
-0+0.4997
Gain(lnteractiveness) =0.8108 -0.4997 0.3111
172 " Machine Learning

2 =0.5 +0.5 =1
Split_Info(T, Interactiveness) =-log, -1
Gain(Interactiveness)
Gain_Ratio(Interactiveness) =
SplitInfo(T, Interactiveness)
0.3112
0.3112
1

Practical Knowledge:
Entropy Into(T, Pracical Knowiedge)
2
8,-lo8,o,-.

=0
Gain(Practical Knowledge) =0.8108

Split Info(T, Pracical Knowledge) =-log,-log, -log, ; =1.5


Gain(Practical Knowledge) 0.8108
=0.5408
Gain_ Ratio(Practical Knowledge) Split_Info(T, Practical Knowledge) 15

Communication Skills:

Entropy_Info(T, Communication Skills)


T
C

=0
Gain(Communication Skills) = 0.8108

Split Info(T, Communication Skill) =-o,-log, -o8,


4 -15
Gain(Practical Knowledge) 0.8108
= 0.5408
Gain Ratio(Communication Skills) = Split_Info(T, Practical Knowledge) 1.5

Table 6.11 shows the Gain_Ratio computed for all the attributes.
Table 6.11: Gain-Ratio
Attributes Gain Ratio

Interactiveness 0.3112
Practical Knowledge 0.5408
Communication Skills 0.5408

Both 'Practical Knowledge' and'Communication Skills' have the highest gairn ratio. So, the best
Skills', and therefore, the
splitting attribute can either be 'Practical Knowledge' or 'Communication
split can be based on any one of these.
shown in Figure 6.6.
Here, we split based on 'Practical Knowledge'. The final decision tree is
1 21 121
=0
Gain(Communication Skills) =0.8108
1
Split_ Info(T, Communication Skills) 2 1
=-7log,log,7827 -1.5
Gain(Practical Knowledge) 0.8108
Gain Ratio(Communication Skills) =0.5408
Split_Info(T, Practical Knowledge) 1.5

Table 6.11 shows the Gain Ratiocomputed for all the attributes.
Table 6.11: Gain-Ratio
Attributes Gain Ratio

Interactiveness 03112
0.5408
Practical Knowledge
Communication Skills 0.5408

gain ratio. So, the best


Both 'Practical Knowledge' and'Communication Skills' have the highestSkills', and therefore, the
splitting attribute can either be'Practical Knowledge' or'Communication
split can be based on any one of these.
in Figure 6.6.
Here, we split based on »Practical Knowledge'. The final decision tree is shown

Decision TreeLearning 173

>8
CGPA

Yes
Job offer= Job offer =No

PK
Good Very
good

Average

Job offer Yes Job offer =NO Job offer Yes

Figure 6.6: Final Decision Tree


Algorithm 13.3: k-means Algorithm
Step 1: Determine the number of clusters before the algorithm is started. This is called k.
Step 2: Choose k instances randomly. These are initial cluster centers.
Step 3: Compute the mean of the initial clusters and assign the remaining sample to the
closest cluster based on Euclidean distance or any other distance measure between
the instances and the centroid of the clusters.
Step 4: Compute new centroid again considering the newly added samples.
Step 5: Perform the steps 34 till the algorithm becomes stable with no more changes in
assignment of instances and clusters.

kemeans can also be viewed as greedy algorithm as it involves partitioning n samples to


k clusters to minimize Sum of Squared Error (SSE). SSE is ametric that is a measure of error
It is
that gives the sum of the squared Euclidean distances of cach data to its closest centroid.
given as:
SSE - Edist(c,, x (13.14)
Here, c is the centroid of the cluster, x is the sample or data point and dist is the Euclidean
distance. The aim of the k-means algorithm is to minimize SSE.

374 Machine Learning

Advantages
1. Simple
2. Easy to implement
Disadvantages
1. It is sensitive to initialization process as change of initial points leads to different clusters.
2. If the samples are large, then the algorithm takes a lot of time.
How to Choose the Value of k?
It is obvious that kis the user specified value specifying the number of clusters that are present.
Obviously, there are no standard rules availalble to pick the value of k. Normally, the k-means
algorithm is run with multiple values of kand within group variance (sum of squares of samples
with its cerntroid) and plotted as a line graph. This plot is called Elbow curve. The optimal or best
value of kcan be determined from the graph. The optimal value of k is identified by the flat or
horizontal part of the Elbow curve.

Complexity
The complexity of k-means algorithm is dependent on the parameters like n, the number of
samples, k, the number of clusters, O(nkd). I is the number of iterations and d is the number of
attributes. The complexity of k-means algorithm is O().

12 E. Coneidor ho follourino eot of A- oion in T,hlo 1 o C1,,etor t4


Bayesian Learning " 235

hence it exhibits th same initial conditions every time the model is run and is likely to get asingle
possible outcome as the solution.
Bayesian learring differs from probabilistic learning as it uses subjective probabilities
(e. probability that is based on an individual's belief or interpretation about the outcome of an
event and it can change over time) to infer parameters of amodel. Two practical learning algorithms
called Naive Bayes learning and Bayesian Belief Network (BBN) form the major part ofBayesian
learning. These algorithms use prior probabilities and apply Bayes rule to iníer useful information.
Bayesian Belief Networks (BBN) is explained in detail in Chapter 9.

Scan for information on Probabiliy Theory and for 'Additional Examples'

8.2 FUNDAMENTALS OF BAYES THEOREM


Naive Bayes Model relies on Bayes theorem that works on the principle of three kinds of probabil
ites called prior probability, ikelihood probability, and posterior probability.
Príor Probability
It is the general probability of an uncertain event before an observation is seen or some evidence is
collected. It is the initial probability that is believed before any new information is collected.
Likelihood Probability
Likelihood probability is the relative probability of the observation occurring for each class or the
sampling density for the evidence given the hypothesis. It is stated as P(Evidence IHypothesis),
which denotes the likeliness of the occurrence of the evidence given the parameters.

Posterior Probability
It is the updated or revised probability of an event taking into account the observations from the
training data. P(Hypothesis IEvidence) is the posterior distribution representing the belief about
the hypothesis, given the evidence from the training data. Therefore,
Posterior probability prior probability +new evidence

8.3 CLASSIFICATION USING BAYES MODEL


Naive Bayes Classification models work on the principle of Bayes theorem. Bayes' rule is amathe
matical formula used determine the posterior probability, given prior probabilities of events.
Generally, Bayes theorem is used to select the most probable hypothesis from data, considering
both prior knowledge and posterior distributions. It is based on the calculation of the posterior
probabilityy and is stated as:
P(Hypothesis h I Evidence E)
where, Hypothesis his the target class to be classified and Evidence Eis the given test instance.

Machine Learning
236
calculated from the prior probability P (Hypothesis h), the
E).
P(Hypothesis hlEvidence E) is
lHypothesis h) and the marginal probability P (Evidence
likelihood probability P (Evidence E
P(Hypothesis h)
It can be written as: P(Evidence ElHypothesis h) (8.1)
Evidence E) = P(Evidence E)
P(Hypothesis h | bserying the training
Probability and 1S Stated as:
P(Hypothesis h I Bvidence E)
where, Hypothesis his the target class to be classified and Evidence Eis the given test instance.

Learning
236 . Machine
probability P (Hypothesis h), the
Evidence E) is calculated from the prior probability P (Evidence E).
P (Hypothesis hl IHypothesis h) and the marginal
likelihood probability P (Evidence E
It can be written as: P(Hypothesis )
P(Evidence ElHypothesis h) (8.1)
Evidence E) = P(Evidence E)
P (Hypothesis h I the training
probability of the hypothesis h without observing
probability that the
the prior
where, P(Hypothesis h) isevidence. It denotes the prior belief or the initial from the training
E
data or considering any (Evidence E) is the prior probability of the evidencethe marginal proba
hypothesish is correct. P is also called
without any knowledge of which hypothesis holds. It
dataset
bility. prior probability of Evidence E given Hypothesistheh.
Hypothesis h) is the training data that
P (Evidence ETprobability of the Evidence E after observing the of Hypothesis h
likelihood probability
It is the (Hypothesis h I Evidence E) is the posterior training data that the
hypothesis his correct. P probability of the hypothesish after observing the
can observe that:
given Evidence E. It is theother words, by the equation of Bayes Eq. (8.1), one
evidence E is correct. In Probability
Probability xLikelihood from
Posterior Probability a Prior posterior probability for a number of hypotheses,
helps in calculating the
Bayes theorem highest probability can be
selected.
formally defined as
which the hypothesis with the probable hypothesis from a set of hypotheses is
This selection of the most Hypothesis.
Maximum A Posteriori (MAP)
Posteriori (MAP) Hypothesis, haP
Maximum A value is considered as
which has the maximum
hypotheses, the hypothesis
Given a set of candidate probable hypothesis is called
hypothesis or most probable hypothesis. This most can be used to find the h
the maximum probable Hypothesis h, Bayes theorem Eq.
(8.1)
the Maximum APosteriori P(Hypothesishl Evidence E)
, =max,, h)
P(Evidence ElHypothesis h)P(Hypothesis
= max,H P(Evidence E)
(8.2)
lHypothesis h)P(Hypothesis h)
= max,, P(Evidence E

h,M.
Maximum Likelihood (ML) Hypothesis, probable, only P(EIh) is used
candidate hypotheses, if every hypothesis is equallymaximum likelihood for P (E Ih)
Given a set of gives the
hypothesis.The hypothesis that
to find the most probable Likelihood (ML) Hypothesis, h,: (8.3)
is called the Maximum h)
max,, P(Evidence EIHypothesis

Correctness of Bayes Theorem


a sample space S.
Consider two events A and B in
ATFTTFTTE
BFTTFTFTF

P (A) =5/8
D R) AI8

Bayesian Learning . 237


P(AIB) =2/4
P(BIA)=2/5
P (A IB) =P (BIA) P (A)/P (B) == 2/4
PBIA)=P (A IB) PB) /P (A == 2/5
Let us consider anumerical example to illustrate the use of Bayes theorem now:
Example 8.1: Consider aboy who has avolleyball tournament on the next day, but today he feels
sick. t is unusual that there is only a40% chance he would fall sick since he is ahealty boy. Now,
Find the probability of the boy participating in the tourmament. The boy is very much interested in
Figure 10,4: Artificial Neural Network Structure

10.3.3 Activation Functions


Activation functions are mathematical functions associated with each neuron in the neural network
that map input signals to output signals. It decides whether to fire a neuron or not based on the
input signals the neuron receives. These functions normalize the output value of each neuron either
between 0and 1or between -1 and 41. Typical activation functions can be linear or no-linear.

Artificial Neural Networks 283

one of the two groups


Linear functions are useful when the input values can be classified into any hand, are continuous
perceptrons. Non-linear functions, on the other
and are generally used in binary functions are useful in learning
functions that map the input in the range of (0, 1) or (-1, 1), etc. These
high-dimensional data or complex data such as audio, video and images.
ANNs:
Below are some of the activation functions used in
1. Identity Function or Linear Function
(10.4)
the value of x. This function
The value off() increases linearly or proportionally with
threshold. The output would be just the
is useful when we do not want to apply any between -o and +0.
weighted sum of input values. The output value ranges
2. Binary Step Function
(10.5)
|0 iff(r) <0
threshold value 0. If value of fx)
The output value is binary, i.e., 0 or 1 based on the
is greater than or equal to @, it outputs 1 or else it outputs 0.
3. Bipolar Step Function
1iff(x) >0 (10.6)
f)=1if f) <0
value is bipolar, i.e, +1 or-1 based on the threshold value . If value of
The output
f) is greater than or equal to 0, it outputs +1 or else it
outputs -1.
4. Sigmoidal Function or Logistic Function
(10.7)
o(r)= 1+e*
produces an S-shaped curve
It is a widely used non-linear activation function which
range of 0 and 1. It has a vanishing gradient problem,
and the output values are in the values.
very high input
i.e., no change in the prediction for very low input values and
5. Bipolar Sigmoid Function
1-e* (10.8)
o() = 1+e
It outputs values between -1 and +1.
6. Ramp Functions
1 ifx>1
fx) =x if 0sr<1 (10.9)
0 ifx<0
limits are fixed.
It is a linear function whose upper and lower
7. Tanh - Hyperbolic Tangent Function
sigmoid function which is also non-linear.
The Tanh function is a scaled version of the
values range between
It also suffers from the vanishing gradient problem. The output
-1 and 1.
2 (10.10)
tan h(r) = 1+e -1

284 .. Machine Learning

8. ReLu - Recified Linear Unit Function


This activation function is a typical function generally used in deep learning neural
network models in the hidden layers. It avoids or reduces the vanishing gradient problem.
This function outputs avalue of0 for negative input values and works like alinear function
if the input values are positive.
(r if x 0
1 (10.7)
o(*) = 1+e*
produces an S-shaped curve
It is a widely used non-linear activation function which
vanishing gradient problem,
and the output values are in the range of 0 and 1. It has a
very high input values.
i.e., no change in the prediction for very low input values and
5. Bipolar Sigmoid Function
1-e (10.8)
ox) =
1+ e*
It outputs values between -1 and +1.
6. Ramp Functions
if x>1
(10.9)
fx) =x if 0sxs1
|0 ifx<0
fixed.
It is a linear function whose upper and lower limits are
7. Tanh- Hyperbolic Tangent Function
function which is also non-linear.
The Tanh function is a scaled version of the sigmoid
values range between
It also suffers fronn the vanishing gradient problem. The output
-1 and 1.
2 (10.10)
tan h(r) = 1+e -1

284 . Machine Learning


8. ReLu -Rectified Linear Unit Function
This activation function is a typical function generally used in deep learning neural
network models in the hidden layers. It avoids or reduces the vanishing gradient problem.
This function outputs avalue of 0for negative input values and works like alinear function as

if the input values are positive.


Yx20
r)= max (0,x) =J*
|0 if x<0 (10.11)
9. Softmax Function
This is a non-linear function used in the output layer that can handle multiple
classes. It calculates the probability of each target class which ranges between 0and1,
The probability of the input belonging to a particular class is computed by dividing
the exponential of the given input value by the sum of the exponential values of all the
inputs.
s(z) = -where i= 0 ...k (10.12)
10.2 BIOLOGICAL NEURONS
and synapse. The body
A typical biological neuron has four parts called dendrites, soma, axon
and process it in the cell
of the neuron is called as soma. Dendrites accept the input information
10,000 neurons and through
body called soma. A single neuron is connected by axons to around
to another neuron. A neuron
these axons the processed information is passed from one neuron
threshold value and transmits signals to another
gets fired if the input information crosses
neuron through a synapse. Asynapse gets fired with an electrical impulse called spikes which
synaptic inputs from one neuron
are transmitted to another neuron. A single neuron can receive processes input information
or multiple neurons. These neurons form a network structure which
is shown in Figure 10.1.
and gives out a response. The simple structure of a biological neuron
Input

Output
Cell body
Axon

Input Synapse
Dendrites

Axon

Figure 10.1: ABiological Neuron

Artificial Neural Networks 281

10.3 ARTIFICIAL NEURONS


Artificial neurons are like biological neurons which are called as nodes. A node or a neuron can
receive one or more input information and process it. Artificial neurons or nodes are connected
by connection links to one another. Each connection link is associated with a synaptic weight. The
structure of a single neuron is shown in Figure 10.2.
Dendrites
X W,
W,
Input x Output
W, Axon
W

Cell body

Figure 10.2: An Artificial Neuron

10.3.1 Simple Model of an Artificial Neuron


The first mathematical model of a biological neuron was designed by McCulloch &Pitts in 1943.
It includes two steps:
1. It receives weighted inputs from other neurons
2. It operates with a threshold function or activation function
The received inputs are computed as a weighted sum which is given to the activation
function and if the sum exceeds the threshold yalue the neuron gets fired. The mathematical

You might also like