ML Unit-2
ML Unit-2
UNIT-II
Multi-Layer Perceptron: Going Forwards, Going Backwards, Back Propagation Error, Multi-
layer perceptron in practice, Examples of using the MLP, Deriving Back-propagation.
Radial Basis Functions and Splines: Concepts, RBF Network, Curse of Dimensionality,
Interpolations and Basis Functions, Support Vector Machine
1
                                                         R22 Machine Learning Lecture Notes
Input: (0,1)
A=1, B=0
At Neuron C:
1x1+0x1+1x(-0.5)=1+0-0.5=0.5 > Threshold 0
Neuron C Fires, so output is 1
At Neuron D:
1x1+0x1+1x(-1)=1+0-1=0
Neuron D does not fire, so output is 0
At Neuron E:
1x1+0x(-1)+1x(-0.5)=1-0-0.5=0.5 > Threshold 0
Neuron E fires, so output is 1.
Going Forwards:
   Training the MLP consists of two parts: working out what the outputs are for the given
    inputs and the current weights, and then updating the weights according to the error, which
    is a function of the difference between the outputs and the targets.
   These are generally known as going forwards and backwards through the network.
   Each neuron in the network (whether it is a hidden layer or the output) has one extra input,
    with fixed value is called bias.
Going Backwards- Back Propagation of Error:
   Back-propagation of error makes it clear that the errors are sent backwards through the
    network.
   It is a form of gradient descent.
   The problem is that when we try to adapt the weights of the Multi-layer Perceptron, we
    have to work out which weights caused the error.
   This could be the weights connecting the inputs to the hidden layer, or the weights
    connecting the hidden layer to the output layer.
   We use sum-of-squares error function, which calculates the difference between y and t for
    each node, squares them, and adds them all together.
2
                                                          R22 Machine Learning Lecture Notes
       We need an activation function that looks like a threshold function but is differentiable
        so that we can compute the gradient.
    Activation Functions:
       The activation function basically decides whether a neuron should be activated or not.
       The activation function is a non-linear transformation that we do over the input before
        sending it to the next layer of neurons or finalizing it as output.
    Sigmoid Function:
           The Sigmoid activation function, also known as the logistic activation function,
            takes inputs and turns them into outputs ranging between 0 and 1.
           For this reason, sigmoid is referred to as the “squashing function” and is
            differentiable.
           Larger, more positive inputs should produce output values close to 1.0, with
            smaller, more negative inputs producing outputs closer to 0.0.
3
                                                         R22 Machine Learning Lecture Notes
4
                                                        R22 Machine Learning Lecture Notes
5
                                                          R22 Machine Learning Lecture Notes
6
                                                        R22 Machine Learning Lecture Notes
       A local minimum is a point in the parameter space where the loss function is minimized
        in a local neighborhood.
       A global minimum is a point in the parameter space where the loss function is
        minimized globally.
Picking Up Momentum:
       Momentum in neural networks is a parameter optimization technique that accelerates
        gradient descent by adding a fraction of the previous weight update to the current
        weight update.
7
                                                          R22 Machine Learning Lecture Notes
8
                                                         R22 Machine Learning Lecture Notes
       The training of the MLP requires that the algorithm runs over the entire dataset many
        times, with the weights changing as the network makes errors in each iteration.
       Two options
            o Predefined number of Iterations
            o Predefined minimum error reached
       Using both of these options together can help, as can terminating the learning once the
        error stops decreasing.
       We train the network for some predetermined amount of time, and then use the
        validation set to estimate how well the network is generalising.
       We then carry on training for a few more iterations, and repeat the whole process.
       At some stage the error on the validation set will start increasing again, because the
        network has stopped learning about the function that generated the data, and started to
        learn about the noise that is in the data itself.
       At this stage we stop the training. This technique is called early stopping.
9
                                                            R22 Machine Learning Lecture Notes
    The loss functions that can be used in Regression MLP include Mean Squared Error(MSE)
     and Mean Absolute Error(MAE).
    MSE can be used in datasets with fewer outliers, while MAE is a good measure in datasets
     which has more outliers.
    Example: Rainfall prediction, Stock price prediction
Classification:
        If the output variable is categorical, then we have to use classification for prediction.
Example: Iris Flower classification
        The aim is to classify iris flowers among three species (Setosa, Versicolor, or Virginica)
         from the sepals’ and petals’ length and width measurements.
        The above neural network has one input layer, two hidden layers and one output layer.
        In the hidden layers we use sigmoid as an activation function for all neurons.
        In the output layer, we use softmax as an activation function for the three output
         neurons.
        In this regard, all outputs are between 0 and 1, and their sum is 1.
        The neural network has three outputs since the target variable contains three classes
         (Setosa, Versicolor, and Virginica).
10
                                                            R22 Machine Learning Lecture Notes
Working of Softmax:
11
                                                          R22 Machine Learning Lecture Notes
        They are finding a different representation of the input data that extracts important
         components of the data, and ignores the noise.
        This auto-associative network can be used to compress images and other data.
Deriving Back-propagation:
Things to know:
     1. Derivative of ½ x2 is x
     2. Chain rule:
12
                                                           R22 Machine Learning Lecture Notes
        Note that i is an index over the input nodes, j is an index over the hidden layer neurons,
         and k is an index over the output neurons.
The Error of the Network:
        Error function E(v,w) remind us that the only things that we can change are the weights
         v and w.
        We will choose sum of squared error function
        We are going to use a gradient descent algorithm that adjusts each weight.
        The gradient that we want to know is how the error function changes with respect to
         the different weights
13
                                                        R22 Machine Learning Lecture Notes
There is a family of functions called sigmoid functions because they are S-shaped that satisfy
all those criteria perfectly.
since we don’t know much about the inputs to a neuron, we just know about its output. That’s
fine, because we can use the chain rule again
14
                                                       R22 Machine Learning Lecture Notes
The important thing that we need to remember is that inputs to the output layer neurons come
from the activations of the hidden layer neurons multiplied by the second layer weights:
15
                                                           R22 Machine Learning Lecture Notes
16
                                                           R22 Machine Learning Lecture Notes
        RBF Networks are conceptually similar to K-Nearest Neighbor (k-NN) models, though
         their implementation is distinct.
        The fundamental idea is that an item’s predicted target value is influenced by nearby
         items with similar predictor variable values.
        Here’s how RBF Networks operate:
             o Input Vector: The network receives an n-dimensional input vector that needs
                 classification or regression.
             o RBF Neurons: Each neuron in the hidden layer represents a prototype vector
                 (center, radius/spread) from the training set. The network computes the
                 Euclidean distance between the input vector and each neuron’s center.
             o Activation Function: The Euclidean distance is transformed using a Radial
                 Basis Function (typically a Gaussian function) to compute the neuron’s
                 activation value. This value decreases exponentially as the distance increases.
17
                                                          R22 Machine Learning Lecture Notes
            o Output Nodes: Each output node calculates a score based on a weighted sum of
              the activation values from all RBF neurons. For classification, the category with
              the highest score is chosen.
18
                                                         R22 Machine Learning Lecture Notes
         Example: if a child's height was measured at age 5 and age 6, interpolation could be
         used to estimate the child's height at age 5.5.
Basis Function:
        Radial basis functions and several other machine learning algorithms can be written in
         this form:
19
                                                          R22 Machine Learning Lecture Notes
Curse of Dimensionality:
        The Curse of Dimensionality refers to the phenomenon where the efficiency and
         effectiveness of algorithms deteriorate as the dimensionality of the data increases
         exponentially.
        It is crucial to understand this concept because as the number of features or dimensions
         in a dataset increases, the amount of data we need to generalize accurately grows
         exponentially.
        Dimensions refer to the features or attributes of data.
        For instance, if we consider a dataset of houses, the dimensions could include the
         house's price, size, number of bedrooms, location, and so on.
What problems does it cause?
     1. Data sparsity. As mentioned, data becomes sparse, meaning that most of the high-
        dimensional space is empty. This makes clustering and classification tasks challenging.
     2. Increased computation. More dimensions mean more computational resources and
        time to process the data.
20
                                                           R22 Machine Learning Lecture Notes
     3. Overfitting. With higher dimensions, models can become overly complex, fitting to
        the noise rather than the underlying pattern. This reduces the model's ability to
        generalize to new data.
     4. Distances lose meaning. In high dimensions, the difference in distances between data
        points tends to become negligible, making measures like Euclidean distance less
        meaningful.
     5. Performance degradation. Algorithms, especially those relying on distance
        measurements like k-nearest neighbors, can see a drop in performance.
     6. Visualization challenges. High-dimensional data is hard to visualize, making
        exploratory data analysis more difficult.
21
                                                            R22 Machine Learning Lecture Notes
        SVM algorithm can be used for Face detection, image classification, text
         categorization, etc.
Types of SVM:
        Linear SVM: Linear SVM is used for linearly separable data, which means if a dataset
         can be classified into two classes by using a single straight line, then such data is termed
         as linearly separable data, and classifier is used called as Linear SVM classifier.
        Non-linear SVM: Non-Linear SVM is used for non-linearly separated data, which
         means if a dataset cannot be classified by using a straight line, then such data is termed
         as non-linear data and classifier used is called as Non-linear SVM classifier. 
Hyperplane:
22
                                                           R22 Machine Learning Lecture Notes
        We always create a hyperplane that has a maximum margin, which means the maximum
         distance between the data points.
Support Vectors:
        The data points or vectors that are the closest to the hyperplane and which affect the
         position of the hyperplane are termed as Support Vector. Since these vectors support
         the hyperplane, hence called a Support vector.
Linear SVM:
        Suppose we have a dataset that has two tags (green and blue), and the dataset has two
         features x1 and x2. We want a classifier that can classify the pair(x1, x2) of coordinates
         in either green or blue. Consider the below image:
        So as it is 2-d space so by just using a straight line, we can easily separate these two
         classes. But there can be multiple lines that can separate these classes. Consider the
         below image:
23
                                                            R22 Machine Learning Lecture Notes
        Hence, the SVM algorithm helps to find the best line or decision boundary; this best
         boundary or region is called as a hyperplane.
        SVM algorithm finds the closest point of the lines from both the classes. These points
         are called support vectors.
        The distance between the vectors and the hyperplane is called as margin.
        And the goal of SVM is to maximize this margin.
        The hyperplane with maximum margin is called the optimal hyperplane.
Non-Linear SVM:
        If data is linearly arranged, then we can separate it by using a straight line, but for non-
         linear data, we cannot draw a single straight line. Consider the below image:
24
                                                        R22 Machine Learning Lecture Notes
        So to separate these data points, we need to add one more dimension. For linear data,
         we have used two dimensions x and y, so for non-linear data, we will add a third
         dimension z. It can be calculated as:
Z=x2+y2
 By adding the third dimension, the sample space will become as below image:
        So now, SVM will divide the datasets into classes in the following way. Consider the
         below image:
25
                                                          R22 Machine Learning Lecture Notes
Kernels:
        The most interesting feature of SVM is that it can even work with a non-linear dataset
         and for this, we use “Kernel Trick” which makes it easier to classifies the points.
         Suppose we have a dataset like this:
        Here we see we cannot draw a single line or say hyperplane which can classify the
         points correctly.
         So we convert this lower dimension space to a higher dimension space using some
         quadratic functions which will allow us to find a decision boundary that clearly divides
         the data points.
        The functions which help us to do this are called Kernels and which kernel to use is
         purely determined by hyperparameter tuning.
               So we basically need to find X12, X22 and X1.X2, and now we can see that 2
                dimensions got converted into 5 dimensions.
26
                                                         R22 Machine Learning Lecture Notes
 Sigmoid Kernel
          It is just taking your input, mapping them to a value of 0 and 1 so that they can be
           separated by a simple straight line. 
        RBF Kernel
        It creates non-linear combinations of our features to lift your samples onto a higher-
         dimensional feature space where we can use a linear decision boundary to separate your
         classes
        It is the most used kernel in SVM classifications, the following formula explains it
         mathematically:
         – identify the support vectors as those that are within some specified distance of the
         closest point and dispose of the rest of the training data
         – compute b* using equation
27
                                                            R22 Machine Learning Lecture Notes
Advantages of SVM:
        SVM works better when the data is Linear
        It is more effective in high dimensions
        With the help of the kernel trick, we can solve any complex problem
        SVM is not sensitive to outliers
        Can help us with Image classification
Disadvantages of SVM:
        Choosing a good kernel is not easy
        It doesn’t show good results on a large dataset
        The SVM hyperparameters are Cost -C and gamma. It is not that easy to fine-tune
         these hyper-parameters. It is hard to visualize their impact.
******
28