Sec 1630

SCHOOL OF ELECTRICAL AND ELECTRONICS
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
Machine Learning Techniques – SEC1630

UNIT I – INTRODUCTION TO MACHINE LEARNING

UNIT I
INTRODUCTION TO MACHINE LEARNING
Machine Learning vs Statistical Modelling, Applications of Machine Learning, Supervised vs

Unsupervised Learning, Supervised Learning Classification, Unsupervised Learning Classification,
Python libraries suitable for Machine Learning.
Definition of Machine learning:
Well posed learning problem: "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E."(Tom Michel)
"Field of study that gives computers the ability to learn without being explicitly programmed".
Learning = Improving with experience at some task
- Improve over task T,

- with respect to performance measure P,
- based on experience E.
E.g., Learn to play checkers
- T : Play checkers
- P : % of games won in world tournament
- E: opportunity to play against self
Examples of tasks that are best solved by using a learning algorithm:
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences (demo)
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant or unusual sound in
your car engine.
• Prediction:
– Future stock prices or currency exchange rates
• The web contains a lot of data. Tasks with very big datasets often use machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
– The enemy adapts so we must adapt too.
• Recommendation systems:
– Lots of noisy data. Million dollar prize!
• Information retrieval:
– Find documents or images with similar content.
• Data Visualization:
- Display a huge database in a revealing way
Types of learning algorithms:
 Supervised learning
o Training data includes desired outputs. Examples include,
o Prediction
o Classification (discrete labels), Regression (real values)
 Unsupervised learning
o Training data does not include desired outputs, Examples include,
o Clustering
o Probability distribution estimation
o Finding association (in features)
o Dimension reduction
 Semi-supervised learning
o Training data includes a few desired outputs
 Reinforcement learning
o Rewards from sequence of actions
o Policies: what actions should an agent take in a particular situation
o Utility estimation: how good is a state (→used by policy)
o No supervised output but delayed reward
o Credit assignment problem (what was responsible for the outcome)
o Applications:
o Game playing
o Robot in a maze
o Multiple agents, partial observability, ...
Hypothesis Space
• One way to think about a supervised learning machine is as a device that explores a
“hypothesis space”.
– Each setting of the parameters in the machine is a different hypothesis about the
function that maps input vectors to output vectors.
– If the data is noise-free, each training example rules out a region of hypothesis
space.
– If the data is noisy, each training example scales the posterior probability of each
point in the hypothesis space in proportion to how likely the training example is
given that hypothesis.
• The art of supervised machine learning is in:
– Deciding how to represent the inputs and outputs
– Selecting a hypothesis space that is powerful enough to represent the relationship
between inputs and outputs but simple enough to be searched.
Generalization
• The real aim of supervised learning is to do well on test data that is not known during
learning.
• Choosing the values for the parameters that minimize the loss function on the training data
is not necessarily the best policy.
• We want the learning machine to model the true regularities in the data and to ignore the
noise in the data.
– But the learning machine does not know which regularities are real and which are
accidental quirks of the particular set of training examples we happen to pick.
• So how can we be sure that the machine will generalize correctly to new data?
Training set, Test set and Validation set

• Divide the total dataset into three subsets:
– Training data is used for learning the parameters of the model.
– Validation data is not used of learning but is used for deciding what type of model
and what amount of regularization works best.
– Test data is used to get a final, unbiased estimate of how well the network works.
We expect this estimate to be worse than on the validation data.
• We could then re-divide the total dataset to get another unbiased estimate of the true error
rate.
Learning Associations:
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
Example: P ( chips | beer ) = 0.7 // 70 percent of customers who buy beer also buy chips.
We may want to make a distinction among customers and toward this, estimate P(Y|X,D) where
D is the set of customer attributes, for example, gender, age, marital status, and so on, assuming
that we have access to this information. If this is a bookseller instead of a supermarket, products
can be books or authors. In the case of a Web portal, items correspond to links to Web pages, and
we can estimate the links a user is likely to click and use this information to download such
pages in advance for faster access.
Classification:
• Example: Credit scoring

• Differentiating between low-risk and high-risk customers from their income and savings
• Discriminant: IF income > θ1 AND savings > θ2 THEN low-risk ELSE high-risk
Prediction - Regression:
• Example: Predict Price of a used car
• x : car attributes
y : price
y = g (x | θ )
where, g ( ) is the model, θ are the parameters
A training dataset of used cars and the function fitted. For simplicity, mileage is taken as the only
input attribute and a linear model is used.
Supervised Learning:
Learning a Class from Examples:
• Class C of a “family car”
o Prediction: Is car x a family car?

o Knowledge extraction: What do people expect from a family car?
• Output:
Positive (+) and negative (–) examples

• Input representation:
x1: price, x2 : engine power
Probably Approximately Correct (PAC):
• Cannot expect a learner to learn a concept exactly.

• Cannot always expect to learn a close approximation to the target concept
• Therefore, the only realistic expectation of a good learner is that with high probability it
will learn a close approximation to the target concept.
• In Probably Approximately Correct (PAC) learning, one requires that given small
parameters ε and δ, with probability at least (1- δ) a learner produces a hypothesis with
error at most ε.
• The reason we can hope for that is the Consistent Distribution assumption.
PAC Learnability
• Consider a concept class C defined over an instance space X (containing instances of length
n), and a learner L using a hypothesis space H.
• C is PAC learnable by L using H if
 for all f ϵ C,
 for all distributions D over X, and fixed 0< ε , δ < 1,

• L, given a collection of m examples sampled independently according to D produces
 with probability at least (1- δ) a hypothesis h ϵ H with error at most ε, (ErrorD =

PrD[f(x) : = h(x)])
• where m is polynomial in 1/ ε, 1/ δ, n and size(H)
• C is efficiently learnable if L can produce the hypothesis in time polynomial in 1/ ε, 1/ δ, n
and size(H)
• We impose two limitations:
• Polynomial sample complexity (information theoretic constraint)
 Is there enough information in the sample to distinguish a hypothesis h that

approximate f ?
• Polynomial time complexity (computational complexity)
 Is there an efficient algorithm that can process the sample and produce a good
hypothesis h ?
• To be PAC learnable, there must be a hypothesis h H with arbitrary small error for every
f ϵ C. We generally assume H ⸧ C. (Properly PAC learnable if H=C)
Occam’s Razor:
Noise and Model Complexity:
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple hypothesis class. There are several
interpretations of noise:
• There may be imprecision in recording the input attributes, which may shift the data points
in the input space.
• There may be errors in labeling the data points, which may relabel positive instances as
negative and vice versa. This is sometimes called teacher noise.
• There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modeled as a random
component and is included in “noise.”
Use the simpler one because
• Simpler to use (lower computational complexity)
• Easier to train (lower space complexity)
• Easier to explain (more interpretable)
• Generalizes better (lower variance - Occam’s razor)
Learning Multiple Classes:
In our example of learning a family car, we have positive examples belonging to the class family
car and the negative examples belonging to all other cars. This is a two-class problem. In the
general case, we have K classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one
and exactly one of them. The training set is now of the form,
There are three classes: family car, sports car, and luxury sedan. There are three hypotheses
induced, each one covering the instances of one class and leaving outside the instances of the
other two classes .’?’ are reject regions where no, or more than one, class is chosen.
Model Selection & Generalization:
• Learning is an ill-posed problem; data is not sufficient to find a unique

solution
• The need for inductive bias, assumptions about H
• Generalization: How well a model performs on new data
• Overfitting: H more complex than C or f
• Underfitting: H less complex than C or f
Triple Trade-Off:
• There is a trade-off between three factors (Dietterich, 2003):
1. Complexity of H, c (H),
2. Training set size, N,
3. Generalization error, E, on new data
As N increases, E decreasesAs c
(H), first E decreases and then E
increases
Cross-Validation:
• To estimate generalization error, we need data unseen during training. We
split the data as
• Training set (50%)

• Validation set (25%)
• Test (publication) set (25%)
• Resampling when there is few data
Dimensions of a Supervised Learner:
Applications of Machine learning
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.
6. Email Spam and Malware Filtering:
Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails
in our spam box, and the technology behind this is Machine learning. Below are some spam
filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
7. Virtual Personal Assistant:
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
These virtual assistants use machine learning algorithms as an important part.
These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
8. Online Fraud Detection:
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online transactions
more secure.
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.
10. Medical Diagnosis:
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.
It helps in finding brain tumors and other brain-related diseases easily.

11. Automatic Language Translation:
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem
at all, as for this also machine learning helps us by converting the text into our known
languages. Google's GNMT (Google Neural Machine Translation) provide this feature, which is
a Neural Machine Learning that translates the text into our familiar language, and it called as
automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.
Supervised learning Vs Unsupervised Learning

Supervised Learning
In supervised learning, algorithms learn from labeled data. After understanding the data, the
algorithm determines which label should be given to new data by associating patterns to the
unlabeled new data.
Supervised learning can be divided into two categories: classification and regression.
Classification
Classification is a technique for determining which class the dependent belongs to based on one
or more independent variables.
Classification is used for predicting discrete responses.
1. LOGISTIC REGRESSION
Logistic regression is kind of like linear regression, but is used when the dependent variable is
not a number but something else (e.g., a "yes/no" response). It's called regression but performs
classification based on the regression and it classifies the dependent variable into either of the
classes.
Logistic regression is used for prediction of output which is binary, as stated above. For
example, if a credit card company builds a model to decide whether or not to issue a credit card
to a customer, it will model for whether the customer is going to “default” or “not default” on
their card.
Linear Regression
Firstly, linear regression is performed on the relationship between variables to get the model.
The threshold for the classification line is assumed to be at 0.5.
Logistic Sigmoid Function

Logistic function is applied to the regression to get the probabilities of it belonging in either
class.
It gives the log of the probability of the event occurring to the log of the probability of it not
occurring. In the end, it classifies the variable based on the higher probability of either class.
2. K-NEAREST NEIGHBORS (K-NN)
K-NN algorithm is one of the simplest classification algorithms and it is used to identify the
data points that are separated into several classes to predict the classification of a new sample
point. K-NN is a non-parametric, lazy learning algorithm. It classifies new cases based on a
similarity measure (i.e., distance functions).
K-NN works well with a small number of input variables (p), but struggles when the number of
inputs is very large.
3. SUPPORT VECTOR MACHINE (SVM)
Support vector is used for both regression and classification. It is based on the concept of
decision planes that define decision boundaries. A decision plane (hyperplane) is one that
separates between a set of objects having different class memberships.
It performs classification by finding the hyperplane that maximizes the margin between the two
classes with the help of support vectors.
The learning of the hyperplane in SVM is done by transforming the problem using some linear
algebra (i.e., the example above is a linear kernel which has a linear separability between each
variable).
For higher dimensional data, other kernels are used as points and cannot be classified easily.
They are specified in the next section.
Kernel SVM
Kernel SVM takes in a kernel function in the SVM algorithm and transforms it into the required
form that maps data on a higher dimension which is separable.
Types of kernel function are:
Type of kernel functions
1. Linear SVM is the one we discussed earlier.

2. In polynomial kernel, the degree of the polynomial should be specified. It allows for curved
lines in the input space.
3. In the radial basis function (RBF) kernel, it is used for non-linearly separable variables. For
distance, metric squared Euclidean distance is used. Using a typical value of the parameter can
lead to overfitting our data. It is used by default in sklearn.
4. Sigmoid kernel, similar to logistic regression is used for binary classification.
Kernel trick uses the kernel function to transform data into a higher dimensional feature space
and makes it possible to perform the linear separation for classification.
Radial Basis Function (RBF) Kernel
The RBF kernel SVM decision region is actually also a linear decision region. What RBF kernel
SVM actually does is create non-linear combinations of features to uplift the samples onto a
higher-dimensional feature space where a linear decision boundary can be used to separate
classes.
So, the rule of thumb is: use linear SVMs for linear problems, and nonlinear kernels such as the
RBF kernel for non-linear problems.
4. NAIVE BAYES
The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions
between predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other
feature). Even if these features depend on each other, or upon the existence of the other features,
all of these properties independently. Thus, the name naive Bayes.
Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial
(normal) distribution of data.
 P(class|data) is the posterior probability of class(target) given predictor(attribute). The

probability of a data point having either class, given the data point. This is the value that we are
looking to calculate.
 P(class) is the prior probability of class.
 P(data|class) is the likelihood, which is the probability of predictor given class.
 P(data) is the prior probability of predictor or marginal likelihood.
Steps
1. Calculate Prior Probability
P(class) = Number of data points in the class/Total no. of observations
P(yellow) = 10/17
P(green) = 7/17
2. Calculate Marginal Likelihood
P(data) = Number of data points similar to observation/Total no. of observations
P(?) = 4/17
The value is present in checking both the probabilities.
3. Calculate Likelihood
P(data/class) = Number of similar observations to the class/Total no. of points in the class.
P(?/yellow) = 1/7
P(?/green) = 3/10
4. Posterior Probability for Each Class
5. Classification
The higher probability, the class belongs to that category as from above 75% probability the
point belongs to class green.
Multinomial, Bernoulli naive Bayes are the other models used in calculating probabilities. Thus,
a naive Bayes model is easy to build, with no complicated iterative parameter estimation, which
makes it particularly useful for very large datasets.
5. DECISION TREE CLASSIFICATION
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. It
follows Iterative Dichotomiser 3(ID3) algorithm structure for determining the split.
Entropy and information gain are used to construct a decision tree.
Entropy
Entropy is the degree or amount of uncertainty in the randomness of elements. In other words, it
is a measure of impurity.
Intuitively, it tells us about the predictability of a certain event. Entropy calculates the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero, and if
the sample is equally divided it has an entropy of one.
Information Gain
Information gain measures the relative change in entropy with respect to the independent
attribute. It tries to estimate the information contained by each attribute. Constructing a decision
tree is all about finding the attribute that returns the highest information gain (i.e., the most
homogeneous branches).
Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is the entropy of
the entire set, while the second term calculates the entropy after applying the feature X.
Information gain ranks attributes for filtering at a given node in the tree. The ranking is based
on the highest information gain entropy in each split.
The disadvantage of a decision tree model is overfitting, as it tries to fit the model by going
deeper in the training set and thereby reducing test accuracy.
Overfitting in decision trees can be minimized by pruning nodes.
Ensemble Methods for Classification
An ensemble model is a team of models. Technically, ensemble models comprise several

supervised learning models that are individually trained and the results merged in various ways
to achieve the final prediction. This result has higher predictive power than the results of any of
its constituting learning algorithms independently.
1. RANDOM FOREST CLASSIFICATION
Random forest classifier is an ensemble algorithm based on bagging i.e bootstrap aggregation.
Ensemble methods combines more than one algorithm of the same or different kind for
classifying objects (i.e., an ensemble of SVM, naive Bayes or decision trees, for example.)
The general idea is that a combination of learning models increases the overall result selected.
Deep decision trees may suffer from overfitting, but random forests prevent overfitting by
creating trees on random subsets. The main reason is that it takes the average of all the
predictions, which cancels out the biases.Random forest adds additional randomness to the
model while growing the trees. Instead of searching for the most important feature while
splitting a node, it searches for the best feature among a random subset of features. This results
in a wide diversity that generally results in a better model.
2. GRADIENT BOOSTING CLASSIFICATION
Gradient boosting classifier is a boosting ensemble method. Boosting is a way to combine

(ensemble) weak learners, primarily to reduce prediction bias. Instead of creating a pool of
predictors, as in bagging, boosting produces a cascade of them, where each output is the input
for the following learner. Typically, in a bagging algorithm trees are grown in parallel to get the
average prediction across all trees, where each tree is built on a sample of original data.
Gradient boosting, on the other hand, takes a sequential approach to obtaining predictions
instead of parallelizing the tree building process. In gradient boosting, each decision tree
predicts the error of the previous decision tree — thereby boosting (improving) the error
(gradient).
Working of Gradient Boosting
1. Initialize predictions with a simple decision tree.

2. Calculate residual (actual-prediction) value.
3. Build another shallow decision tree that predicts residual based on all the independent values.
4. Update the original prediction with the new prediction multiplied by learning rate.
5. Repeat steps two through four for a certain number of iterations (the number of iterations will
be the number of trees).
Unsupervised classification is where the outcomes (groupings of pixels with common
characteristics) are based on the software analysis of an image without the user providing
sample classes. The computer uses techniques to determine which pixels are related and groups
them into classes. The user can specify which algorism the software will use and the desired
number of output classes but otherwise does not aid in the classification process. However, the
user must have knowledge of the area being classified when the groupings of pixels with
common characteristics produced by the computer have to be related to actual features on the
ground (such as wetlands, developed areas, coniferous forests, etc.).
Python libraries suitable for Machine Learning.
Python machine learning libraries have grown to become the most preferred language for
machine learning algorithm implementations. Let’s have a look at the main Python libraries
used for machine learning.
Top Python Machine Learning Libraries

1) NumPy
NumPy is a well known general-purpose array-processing package. An extensive collection of
high complexity mathematical functions make NumPy powerful to process large multi-
dimensional arrays and matrices. NumPy is very useful for handling linear algebra, Fourier
transforms, and random numbers. Other libraries like TensorFlow uses NumPy at the backend
for manipulating tensors.
With NumPy, you can define arbitrary data types and easily integrate with most databases.
NumPy can also serve as an efficient multi-dimensional container for any generic data that is in
any datatype. The key features of NumPy include powerful N-dimensional array object,
broadcasting functions, and out-of-box tools to integrate C/C++ and Fortran code.
2) SciPy
With machine learning growing at supersonic speed, many Python developers were
creating python libraries for machine learning, especially for scientific and analytical
computing. Travis Oliphant, Eric Jones, and Pearu Peterson in 2001 decided to merge most of
these bits and pieces codes and standardize it. The resulting library was then named as SciPy
library.
The current development of the SciPy library is supported and sponsored by an open
community of developers and distributed under the free BSD license.
The SciPy library offers modules for linear algebra, image optimization, integration
interpolation, special functions, Fast Fourier transform, signal and image processing, Ordinary
Differential Equation (ODE) solving, and other computational tasks in science and analytics.
The underlying data structure used by SciPy is a multi-dimensional array provided by the
NumPy module. SciPy depends on NumPy for the array manipulation subroutines. The SciPy
library was built to work with NumPy arrays along with providing user-friendly and efficient
numerical functions.
3) Scikit-learn
In 2007, David Cournapeau developed the Scikit-learn library as part of the Google Summer
of Code project. In 2010 INRIA involved and did the public release in January 2010. Skikit-
learn was built on top of two Python libraries – NumPy and SciPy and has become the most
popular Python machine learning library for developing machine learning algorithms.
Scikit-learn has a wide range of supervised and unsupervised learning algorithms that works on
a consistent interface in Python. The library can also be used for data-mining and data analysis.
The main machine learning functions that the Scikit-learn library can handle are classification,
regression, clustering, dimensionality reduction, model selection, and preprocessing.
4) Theano
Theano is a python machine learning library that can act as an optimizing compiler for
evaluating and manipulating mathematical expressions and matrix calculations. Built on
NumPy, Theano exhibits a tight integration with NumPy and has a very similar interface.
Theano can work on Graphics Processing Unit (GPU) and CPU.
Working on GPU architecture yields faster results. Theano can perform data-intensive
computations up to 140x faster on GPU than on a CPU. Theano can automatically avoid errors
and bugs when dealing with logarithmic and exponential functions. Theano has built-in tools for
unit-testing and validation, thereby avoiding bugs and problems.
5) TensorFlow
TensorFlow was developed for Google’s internal use by the Google Brain team. Its first release
came in November 2015 under Apache License 2.0. TensorFlow is a popular computational
framework for creating machine learning models. TensorFlow supports a variety of different
toolkits for constructing models at varying levels of abstraction.
TensorFlow exposes a very stable Python and C++ APIs. It can expose, backward compatible
APIs for other languages too, but they might be unstable. TensorFlow has a flexible architecture
with which it can run on a variety of computational platforms CPUs, GPUs, and TPUs. TPU
stands for Tensor processing unit, a hardware chip built around TensorFlow for machine
learning and artificial intelligence.
6) Keras
Keras has over 200,000 users as of November 2017. Keras is an open-source library used for
neural networks and machine learning. Keras can run on top of TensorFlow, Theano, Microsoft
Cognitive Toolkit, R, or PlaidML. Keras also can run efficiently on CPU and GPU.
Keras works with neural-network building blocks like layers, objectives, activation functions,
and optimizers. Keras also have a bunch of features to work on images and text images that
comes handy when writing Deep Neural Network code.
Apart from the standard neural network, Keras supports convolutional and recurrent neural
networks.
7) PyTorch
PyTorch has a range of tools and libraries that support computer vision, machine learning, and
natural language processing. The PyTorch library is open-source and is based on the Torch
library. The most significant advantage of PyTorch library is it’s ease of learning and using.
PyTorch can smoothly integrate with the python data science stack, including NumPy. You will
hardly make out a difference between NumPy and PyTorch. PyTorch also allows developers to
perform computations on Tensors. PyTorch has a robust framework to build computational
graphs on the go and even change them in runtime. Other advantages of PyTorch include multi
GPU support, simplified preprocessors, and custom data loaders.
8) Pandas
Pandas are turning up to be the most popular Python library that is used for data analysis with
support for fast, flexible, and expressive data structures designed to work on both “relational” or
“labeled” data. Pandas today is an inevitable library for solving practical, real-world data
analysis in Python. Pandas is highly stable, providing highly optimized performance. The
backend code is purely written in C or Python.
The two main types of data structures used by pandas are :
Series (1-dimensional)
DataFrame (2-dimensional)
These two put together can handle a vast majority of data requirements and use cases from most
sectors like science, statistics, social, finance, and of course, analytics and other areas of
engineering.
Pandas support and perform well with different kinds of data including the below :
Tabular data with columns of heterogeneous data. For instance, consider the data coming from
the SQL table or Excel spreadsheet.
Ordered and unordered time series data. The frequency of time series need not be fixed, unlike
other libraries and tools. Pandas is exceptionally robust in handling uneven time-series data
Arbitrary matrix data with the homogeneous or heterogeneous type of data in the rows and
columns
Any other form of statistical or observational data sets. The data need not be labeled at all.
Pandas data structure can process it even without labeling.
9) Matplotlib
Matplotlib is a data visualization library that is used for 2D plotting to produce publication-
quality image plots and figures in a variety of formats. The library helps to generate histograms,
plots, error charts, scatter plots, bar charts with just a few lines of code.
It provides a MATLAB-like interface and is exceptionally user-friendly. It works by using
standard GUI toolkits like GTK+, wxPython, Tkinter, or Qt to provide an object-oriented API
that helps programmers to embed graphs and plots into their applications.
TEXT / REFERENCE BOOKS
1. Chris Albon : Machine Learning with Python Cookbook , O‟Reilly Media, Inc.2018
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
3. Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education
4. Machine Learning: The art and Science of algorithms that make sense of data, Peter Flach,
Cambridge University Press, 2012
5. EthemAlpaydın, Introduction to machine learning, second edition, MIT press.
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”, Springer
Series , 2nd edition
7. Sebastian Raschka, "Python Machine Learning",Second Edition.Packt Publication
UNIT- II CLASSIFIERS
UNIT II CLASSIFIERS
Classification, K- nearest neighbour, Decision Trees, Implementing Decision Tree, building a

Tree, Random Forests - Working of Random Forest, Pros and Cons of Random Forest, Naiver
Bayes, building model Using Naiver Bayes.
KNN:
In statistics, the k-nearest neighbors algorithm (k-NN) is a non parametric classification
method first developed by Evelyn Fix and Joseph Hodges in 1951 and later expanded by
Thomas Cover.
It is used for classification and regression. In both cases, the input
consists of the k closest training examples in data set. The output depends on whether k-NN is
used for classification or regression.
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm.
It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used
for both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN
well −
• Lazy learning algorithm − KNN is a lazy learning algorithm because it does
not have a specialized training phase and uses all the data for training while
classification.
• Non-parametric learning algorithm − KNN is also a non-parametric
learning algorithm because it doesn’t assume anything about the underlying
data.
Working of KNN Algorithm
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help of
following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN,
we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram −
We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in
red class.
Pros and Cons of KNN
Pros
 It is very simple algorithm to understand and interpret.
 It is very useful for nonlinear data because there is no assumption about data in this
algorithm.
 It is a versatile algorithm as we can use it for classification as well as regression.
 It has relatively high accuracy but there are much better supervised learning models
than KNN.
Cons
 It is computationally a bit expensive algorithm because it stores all the training data.
 High memory storage required as compared to other supervised learning algorithms.
 Prediction is slow in case of big N.
 It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan approval?
Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like
“Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
Decision Trees
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire
dataset, which further gets divided into two or more homogeneous sets.
 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated
further after getting a leaf node.
 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
 Branch/Sub Tree: A tree formed by splitting the tree.
 Pruning: Pruning is the process of removing the unwanted branches from the tree.
 Parent/Child node: The root node of the tree is called the parent node, and other nodes
are called the child nodes.
How does the Decision Tree algorithm Work?
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
Attribute Selection Measures
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)
Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies

randomness in data. Entropy can be calculated as:
Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)
Where,
o S= Total number of samples

o P(yes)= probability of yes
o P(no)= probability of no
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Gini Index= 1- ∑jPj2

Pruning: Getting an Optimal Decision tree
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
o Cost Complexity Pruning

o Reduced Error Pruning.
Advantages of the Decision Tree

o It is simple to understand as it follows the same process which a human follow while
making any decision in real-life.
o It can be very useful for solving decision-related problems.
o It helps to think about all the possible outcomes for a problem.
o There is less requirement of data cleaning compared to other algorithms.
Disadvantages of the Decision Tree

o The decision tree contains lots of layers, which makes it complex.
o It may have an overfitting issue, which can be resolved using the Random Forest
algorithm.
o For more class labels, the computational complexity of the decision tree may increase.
Random Forest Algorithm
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on
the concept of ensemble learning, which is a process of combining multiple classifiers to solve
a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note:
the Decision
To better
Treeunderstand
Algorithm.the Random Forest Algorithm, you should have knowledge of
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Why use Random Forest?
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
How does Random Forest algorithm work?
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-1: Select random K data points from the training set.
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given
to the Random forest classifier. The dataset is divided into subsets and given to each decision
tree. During the training phase, each decision tree produces a prediction result, and when a new
data point occurs, then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Advantages of Random Forest

o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.
Disadvantages of Random Forest

o Although random forest can be used for both classification and regression tasks, it is not
more suitable for Regression tasks.
Naiver Bayes
Bayesian Decision Theory
Bayesian framework assumes that we always have a prior distribution for

everything.
– The prior may be very vague.
– When we see some data, we combine our prior
distribution with a likelihood termto get a posterior
distribution.
– The likelihood term takes into account how
probable the observed data is giventhe parameters
of the model.
• It favors parameter settings that make the data likely.
• It fights the prior
• With enough data the likelihood terms always win.
Given database:
Example:
Losses and Risks:
Discriminant Functions
Classification can also be seen as implementing a set of discriminant functions,

gi(x), i = 1, . . K, such that we
Building model Using Naiver Bayes

Bayesian networks
A Bayesian network, Bayes network, belief network, Bayes(ian)
model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a
type of statistical model) that represents a set of variables and their conditional
dependencies via a directed acyclic graph (DAG). For example, a Bayesian network could
represent the probabilistic relationships between diseases and symptoms. Given symptoms, the
network can be used to compute the probabilities of the presence of various diseases.
Bayesian Net Example:
Consider the following Bayesian network:
Thus, the independence expressed in this Bayesian net are thatA and B
are (absolutely) independent.
C is independent of B given A.
D is independent of C given A and B.
E is independent of A, B, and D given C.
Suppose that the net further records the following probabilities:

Prob(A=T) = 0.3
Prob(B=T) = 0.6
Prob(C=T|A=T) = 0.8
Prob(C=T|A=F) = 0.4
Prob(D=T|A=T,B=T) = 0.7
Prob(D=T|A=T,B=F) = 0.8
Prob(D=T|A=F,B=T) = 0.1
Prob(D=T|A=F,B=F) = 0.2
Prob(E=T|C=T) = 0.7
Prob(E=T|C=F) = 0.2
Some sample computations:
Prob(D=T):
P(D=T) =
P(D=T,A=T,B=T) + P(D=T,A=T,B=F) + P(D=T,A=F,B=T) + P(D=T,A=F,B=F) =

P(D=T|A=T,B=T) P(A=T,B=T) + P(D=T|A=T,B=F) P(A=T,B=F) +P(D=T|A=F,B=T)
P(A=F,B=T) + P(D=T|A=F,B=F) P(A=F,B=F) =
(since A and B are independent absolutely)
P(D=T|A=T,B=T) P(A=T) P(B=T) + P(D=T|A=T,B=F) P(A=T) P(B=F) +P(D=T|A=F,B=T)

P(A=F) P(B=T) + P(D=T|A=F,B=F) P(A=F) P(B=F) =
0.7*0.3*0.6 + 0.8*0.3*0.4 + 0.1*0.7*0.6 + 0.2*0.7*0.4 = 0.32

Prob(A=T|C=T):
P(A=T|C=T) = P(C=T|A=T)P(A=T) / P(C=T).
Now, P(C=T) = P(C=T,A=T) + P(C=T,A=F) =

P(C=T|A=T)P(A=T) + P(C=T|A=F)P(A=F) =
0.8*0.3+ 0.4*0.7 = 0.52
So P(C=T|A=T)P(A=T) / P(C=T) = 0.8*0.3/0.52= 0.46.
Association rule
Association rule mining is explained using the Apriori Algorithm.
Apriori Algorithm:
Confidence:
Parametric Methods
Parametric Estimation
 X = { xt }t where xt ~ p (x)
 Parametric estimation:
Assume a form for p (x | θ) and estimate θ, its sufficient statistics,using X
e.g., N ( μ, σ2) where θ = { μ, σ2}
Maximum Likelihood Estimation:
Likelihood of θ given the sample X

l (θ|X) = p (X |θ) = ∏t p (xt|θ)
Log likelihood
L(θ|X) = log l (θ|X) = ∑t log p (xt|θ)
Maximum likelihood estimator (MLE)
θ* = argmaxθ L(θ|X)
Examples: Bernoulli/Multinomial:
Gaussian (Normal) Distribution:

Bias and Variance:
Classification
Regression
Linear Regression:
Other Error Measures:

Multivariate Methods
Data:
 Multiple measurements (sensors)

 d inputs/features/attributes: d-variate
 N instances/observations/examples
Multivariate Parameters:
Parameter Estimation
Classification
Estimating the mean and Variance,
Quadratic Discriminant:
Tuning Complexity
Discrete Features
Dimensionality Reduction
Necessity:
1. Reduces time complexity: Less computation

2. Reduces space complexity: Less parameters
3. Saves the cost of observing the feature
4. Simpler models are more robust on small datasets
5. More interpretable; simpler explanation
6. Data visualization (structure, groups, outliers, etc) if plotted in 2or 3 dimensions.
Feature Selection and Extraction:
Principal Component Analysis(PCA)

Factor Analysis
Multidimensional Scaling
Linear Discriminant Analysis

Fisher’s Linear Discriminant:
For K>2 Classes,
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
4. Machine Learning: The art and Science of algorithms that make sense of data, Peter
Flach, Cambridge University Press, 2012
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”,
Springer Series , 2nd edition
UNIT-III SUPERVISED LEARNING

UNIT III
UNIT III SUPERVISED LEARNING
Regression, Types of Regression model, Building a Regressor in Python, Types of ML

Algorithm, Linear Regression, Multiple Linear Regression, Non-linear Regression, Model
evaluation methods.
REGRESSION:
Statistical modeling to show the relationship between two variables with the
linear Regression
Regression analysis is form of predictive modeling technique which investigates the
relationship between a dependent and independent variable
What is the strength of relationship between sales and marketing spending?
What is the relationship between age and income?
Agricultural scientists often use linear regression to measure the effect of fertilizer and
water on crop yields.
Data scientists for professional sports teams often use linear regression to measure the effect
that different training regimens have on player performance
Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog.
For this identification, we can use the KNN algorithm, as it works on a similarity measure.
Our KNN model will find the similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or dog category.
Clustering
• Cluster: A collection of data objects

o similar (or related) to one another within the same group
o dissimilar (or unrelated) to the objects in other groups
• Cluster analysis (or clustering, data segmentation, …)
o Finding similarities between data according to the characteristics found in the data
and grouping similar data objects into clusters
• Unsupervised learning: no predefined classes (i.e., learning by observations vs. learning
by examples: supervised)
• A good clustering method will produce high quality clusters
o high intra-class similarity: cohesive within clusters
o low inter-class similarity: distinctive between clusters
• The quality of a clustering method depends on
o the similarity measure used by the method
o its implementation, and
o its ability to discover some or all of the hidden patterns.
Major Clustering Approaches
• Partitioning approach:
o Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
o Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
o Create a hierarchical decomposition of the set of data (or objects) using some
criterion
o Typical methods: DIANA, AGNES, BIRCH, CAMELEON
Mixture densities
Partitioning method
• Partitioning a database D of n objects into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
The K-Means Clustering Method
• Given k, the k-means algorithm is implemented in four steps:

o Partition objects into k nonempty subsets
o Compute seed points as the centroids of the clusters of the current partitioning
(the centroid is the center, i.e., mean point, of the cluster)
o Assign each object to the cluster with the nearest seed point
o Go back to Step 2, stop when the assignment does not change
Example:
The iteration stops here, as we get the same set of clusters as the previous step.
Agglomerative Clustering:
Start with N groups each with one instance and merge two closest groups at each
iteration
Distance between two groups Gi and Gj:
Single-link clustering:
d Gi ,G j   min d xr , x s 

xr Gi ,xsG j

Complete-link clustering:
d G ,G   max dxr ,x s 
i j
xrGi ,xsGj
Average-link clustering, centroid

Dendrogram: Decompose data objects into a several levels of nested partitioning (tree of
clusters), called a dendrogram
A clustering of the data objects is obtained by cutting the dendrogram at
the desired level, then each connected component forms a cluster
Example of Single Linkage Clustering:
1. Given a datset, first find the distance matrix using Euclidean distance measure.
2. After finding the distance matrix, find the smallest element in the distance matix.
3. Merge these two points to form a cluster.
4. Next find the minium distance between the new cluster obtained with all other points.
5. Now frame the distance matrix and find the smallest value in the obtained distance matrix
and then merge these points to form a cluster. The process repeats until all points are
grouped into clusters.
6. Finally draw the dendogram.
Example:
Nonparametric Methods
⚫ Parametric (single global model), semiparametric (small number of local models)

⚫ Nonparametric: Similar inputs have similar outputs
⚫ Functions (pdf, discriminant, regression) change smoothly
⚫ Keep the training data;“let the data speak for itself”
⚫ Given x, find a small number of closest training instances and interpolate from these
⚫ Aka lazy/memory-based/case-based/instance-based learning
Density Estimation
⚫ Given the training set X={xt}t drawn iid from p(x)

⚫ Divide data into bins of size h
⚫ Histogram:
pˆ x  

# x t in the same bin as x 
⚫ Naive estimator: Nh

p̂  x   # x  h  x  x  h
t

or 2 Nh
1 N  x  x 
t
1/ 2 if u  1
p̂ x    w  wu   
Nh t 1  h 
0 otherwise
with the weight function defined as,,
Kernel Estimator: 
⚫ Kernel function, e.g., Gaussian kernel: 
 u 2 
K u  
1
exp   
2   2 


⚫ Kernel estimator (Parzen windows)
1 N  x  x t 
p̂  x    K  
Nh t 1  h 





k-Nearest Neighbor Estimator:
⚫ Instead of fixing bin width h and counting the number of instances, fix the
instances (neighbors) k and check bin width,
p̂x   k
2 Nd k x 

dk(x), distance to kth closest instance to x.
The nearest neighbor class of estimators adapts the amount of smoothing to the local density
of data. The degree of smoothing is controlled by k, the number of neighbors taken into
account, which is much smaller than N, the sample size. Let us define a distance between a
and b, for example, |a − b|, and for each x, we define,
Generalization of multivariate data
• Kernel density estimator

N  x  xt 
d 
1
p̂x  K 
Nh t1  h 

• Multivariate Gaussian kernel

 1 d  2
u 
• Spheric K u    exp 
 2   2 
K u 
1
  1 T 1 
• Ellipsoid S u
exp  2 u 
2  S
d/2 1/ 2

where S is the sample covariance matrix. This corresponds to using Mahalanobis distance instead
of the Euclidean distance. It is also possible to have the distance metric local where S is
calculated from instances in the vicinity of x, for example, some k closest instances.
Note that S calculated locally may be singular and PCA (or LDA, in thecase of classification)
may be needed.
Hamming distance:
When the inputs are discrete, we can use Hamming distance, which counts the number of
nonmatching attributes.
Nonparametric Classification
• Estimate p(x|Ci) and use Bayes’ rule
• Kernel estimator
1 N  x  xt  N
p̂ x | C i   N hd  K  h rit P̂ Ci   Ni
i t 1  
1 N  x  x t 
ˆ
g i x  p̂ x | C i P C i   d 

K rit

Nh t1  h 

  
• k-NN estimator
p̂x | Ci P̂Ci  ki
p̂x | Ci    i | x 
ki
P̂ C 
i x
NV k p̂x k
The k-nn classifier assigns the input to the class having most examples among the k neighbors of
the input. All neighbors have equal vote, and the class having the maximum number of voters
among the k neighbors is chosen. Ties are broken arbitrarily or a weighted vote is taken. K is
generally taken to be an odd number to minimize ties: confusion is generally between two
neighboring classes. Again, the use of Euclidean distance corresponds to assuming uncorrelated
inputs with equal variances, and when this is not the case a suitable discriminant metric should
be used. One example is discriminant adaptive nearest neighbor where the optimal distance to
separate classes is estimated locally.
Nonparametric Regression
⚫ Aka smoothing models

⚫ Regressogram
Running Mean/Kernel Smoother:
Instead of taking an average and giving a constant fit at a point, we can take into account one
more term in the Taylor expansion and calculate a linear fit. In the running line smoother, we can
use the data points in the neighborhood, as defined by h or k, and fit a local regression line. In
the locally weighted running line smoother, known as loess, instead of a hard definition of
neighborhoods, we use kernel weighting such that distant points have less effect on error.
Decision Trees
Algorithm:
Attribute Selection Measure:
Example:
After further iterations, the final decision tree would be,
Univariate Trees
Multivariate Trees
Learning rules from data
From the decision tree, the following rules are identified.
If age=youth and student=no then

buys_computer=no. If age=youth and student=yes
then buys_computer=yes. If age=middle_aged then
buys_computer=yes.
If age=senior and credit_rating=fair then buys_computer=no.
If age=senior and credit_rating=excellent then buys_computer=yes.
• Rule induction is similar to tree induction but

o tree induction is breadth-first,
o rule induction is depth-first; one rule at a time
• Rule set contains rules; rules are conjunctions of terms
• Rule covers an example if all terms of the rule evaluate to true for the example
• Sequential covering: Generate rules one at a time until all positive examples are covered.
Linear Discrimination
Likelihood- vs. Discriminant-based Classification:
⚫ Likelihood-based: Assume a model for p(x|Ci), use Bayes’ rule to calculate
P(Ci|x) gi(x) = log P(Ci|x)
⚫ Discriminant-based: Assume a model for gi(x|Φi); no density estimation
⚫ Estimating the boundaries is enough; no need to accurately estimate the densities

insidethe boundaries
Linear Discriminant:
⚫ Linear discriminant:
g x| w , w   wTx  w  w x  w
d
i i i0 i i0 ij j i0
j1
⚫ Advantages:
⚫ Simple: O(d) space/computation
⚫ Knowledge extraction: Weighted sum of attributes; positive/negative weights,

magnitudes (credit scoring)
⚫ Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes
are (almost) linearly separable
Generalized Linear Model:
⚫ Quadratic discriminant:
g x | W , w , w   xT Wx  wT x  w
i i i i0 i i i0
⚫ Higher-order (product) terms:
z  x , z  x , z  x2, z  x2, z  x x
1 1 2 2 3 1 4 2 5 1 2
⚫ Map from x to z using nonlinear basis functions and use a linear discriminant in z-space:

k
g i x    w ij  j x 
j 1


Two Classes:
g x   g x   g x 

1 2
 w T
1 x  w 10  w T
2 x  w 20
 w 1  w 2 T x  w 10  w 20 
 w T
x  w 0


  C1 if g x   0
choose 
C 2 otherwise
Multiple Classes:
Sigmoid (Logistic) Function:

Gradient-Descent:
Logistic Discrimination:
1. Chris Albon : Machine Learning with Python Cookbook , O‟Reilly Media,

Inc.2018
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition Series,
2014
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”,
Springer Series , 2nd edition
UNIT-IV K-MEANS CLUSTERING

UNIT IV
UNIT IV K-MEANS CLUSTERING
Working of K-Means Clustering Algorithm, Advantages and Disadvantages, Applications of K-

Means Clustering Algorithm, Hierarchical Clustering, Steps to Perform Agglomerative Hierarchical
Clustering, Role of Dendrograms Agglomerative Hierarchical Clustering, Density-Based
Clustering.
The iteration stops here, as we get the same set of clusters as the previous step.
Advantages and Disadvantages
Advantages
The following are some advantages of K-Means clustering algorithms −
 It is very easy to understand and implement.
 If we have large number of variables then, K-means would be faster than Hierarchical clustering.
 On re-computation of centroids, an instance can change the cluster.
 Tighter clusters are formed with K-means as compared to Hierarchical clustering.
Disadvantages
The following are some disadvantages of K-Means clustering algorithms −
 It is a bit difficult to predict the number of clusters i.e. the value of k.
 Output is strongly impacted by initial inputs like number of clusters (value of k)
 Order of data will have strong impact on the final output.
 It is very sensitive to rescaling. If we will rescale our data by means of normalization or standardization,
then the output will completely change.
 It is not good in doing clustering job if the clusters have a complicated geometric shape.
Applications of K-Means Clustering Algorithm
The main goals of cluster analysis are −
 To get a meaningful intuition from the data we are working with.
 Cluster-then-predict where different models will be built for different subgroups.
To fulfill the above-mentioned goals, K-means clustering is performing well enough. It can be
used in following applications −
 Market segmentation
 Document Clustering
 Image segmentation
 Image compression
 Customer segmentation
 Analyzing the trend on dynamic data
Hierarchical Clustering
Cluster based on similarities/distances

Distance measure between instances xr and xs
Minkowski distance (Lp) (Euclidean for p = 2)

xr p 1/ p
d
d m
r
x ,x s
j 1 j  x sj
City-block distance
j
d
d cb
xr , xs
1
x
r
j  x
s
j
• Two important paradigms:
– Bottom-up agglomerative clustering(Agglomerative Nesting-AGNES)
– Top-down divisive clustering(Divisive Analysis-DIANA)
Introduction to Hierarchical Clustering

Hierarchical clustering is another unsupervised learning algorithm that is used to
group together the unlabeled data points having similar characteristics.
Hierarchical clustering algorithms falls into following two categories.
Agglomerative hierarchical algorithms − In agglomerative hierarchical
algorithms, each data point is treated as a single cluster and then successively
merge or agglomerate (bottom-up approach) the pairs of clusters. The hierarchy of
the clusters is represented as a dendrogram or tree structure.
Divisive hierarchical algorithms − On the other hand, in divisive hierarchical
algorithms, all the data points are treated as one big cluster and the process of
clustering involves dividing (Top-down approach) the one big cluster into various
small clusters.
Steps to Perform Agglomerative Hierarchical Clustering
We are going to explain the most used and important Hierarchical clustering i.e.
agglomerative. The steps to perform the same is as follows −
 Step 1 − Treat each data point as single cluster. Hence, we will be having,
say K clusters at start. The number of data points will also be K at start.
 Step 2 − Now, in this step we need to form a big cluster by joining two
closet datapoints. This will result in total of K-1 clusters.
 Step 3 − Now, to form more clusters we need to join two closet clusters.
This will result in total of K-2 clusters.
 Step 4 − Now, to form one big cluster repeat the above three steps until K
would become 0 i.e. no more data points left to join.
 Step 5 − At last, after making one single big cluster, dendrograms will be
used to divide into multiple clusters depending upon the problem.
Role of Dendrograms in Agglomerative Hierarchical Clustering
As we discussed in the last step, the role of dendrogram starts once the big cluster
is formed. Dendrogram will be used to split the clusters into multiple cluster of
related data points depending upon our problem.
What is Agglomerative Hierarchical Clustering

Agglomerative Hierarchical Clustering (AHC) is a clustering (or classification) method which
has the following advantages:
It works from the dissimilarities between the objects to be grouped together. A type of
dissimilarity can be suited to the subject studied and the nature of the data.
One of the results is the dendrogram which shows the progressive grouping of the data. It is
then possible to gain an idea of a suitable number of classes into which the data can be
grouped.
How does Agglomerative Hierarchical Clustering work
Agglomerative Hierarchical Clustering (AHC) is an iterative classification method whose
principle is simple.
The process starts by calculating the dissimilarity between the N objects.
Then two objects which when clustered together minimize a given agglomeration criterion, are
clustered together thus creating a class comprising these two objects.
Then the dissimilarity between this class and the N-2 other objects is calculated using the
agglomeration criterion. The two objects or classes of objects whose clustering together
minimizes the agglomeration criterion are then clustered together.
Agglomerative Hierarchical Clustering (AHC) is a clustering (or classification) method which

has the following advantages: It works from the dissimilarities between the objects to be
grouped together. A type of dissimilarity can be suited to the subject studied and the nature of
the data. The agglomerative clustering is the most common type of hierarchical clustering used
to group objects in clusters based on their similarity. It’s also known as AGNES (Agglomerative
Nesting). The algorithm starts by treating each object as a singleton cluster. Next, pairs of
clusters are successively merged until all clusters have been merged into one big cluster
containing all objects. The result is a tree-based representation of the objects,
named dendrogram.
Clustering analysis or simply Clustering is basically an Unsupervised learning method that

divides the data points into a number of specific batches or groups, such that the data points in
the same groups have similar properties and data points in different groups have different
properties in some sense. It comprises of many different methods based on different evolution.
E.g. K-Means (distance between points), Affinity propagation (graph distance), Mean-shift
(distance between points), DBSCAN (distance between nearest points), Gaussian mixtures
(Mahalanobis distance to centers), Spectral clustering (graph distance) etc.
Fundamentally, all clustering methods use the same approach i.e. first we calculate similarities
and then we use it to cluster the data points into groups or batches. Here we will focus
on Density-based spatial clustering of applications with noise (DBSCAN) clustering
method.
Clusters are dense regions in the data space, separated by regions of the lower density of
points. The DBSCAN algorithm is based on this intuitive notion of “clusters” and “noise”.
The key idea is that for each point of a cluster, the neighborhood of a given radius has to
contain at least a minimum number of points.
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
Real life data may contain irregularities, like –
i) Clusters can be of arbitrary shape such as those shown in the figure below.
ii) Data may contain noise.
The figure below shows a data set containing nonconvex clusters and outliers/noises. Given
such data, k-means algorithm has difficulties for identifying these clusters with arbitraryshapes.
DBSCAN algorithm requires two parameters –
1. eps : It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered as neighbors. If the eps value is
chosen too small then large part of the data will be considered as outliers. If it is chosen very
large then the clusters will merge and majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
In this algorithm, we have 3 types of data points.

Core Point: A point is a core point if it has more than MinPts points within eps.
Border Point: A point which has fewer than MinPts within eps but it is in the neighborhood
of a core point.
Noise or outlier: A point which is not a core point or border point.
DBSCAN algorithm can be abstracted in the following steps –

1. Find all the neighbor points within eps and identify the core points or visited with
more than MinPts neighbors.
2. For each core point if it is not already assigned to a cluster, create a new cluster.
3. Find recursively all its density connected points and assign them to the same cluster
as the core point.
A point a and b are said to be density connected if there exist a point c which has a sufficient
number of points in its neighbors and both the points a and b are within the eps distance.
This is a chaining process. So, if b is neighbor of c, c is neighbor of d, d is neighbor of e,
which in turn is neighbor of a implies that b is neighbor of a.
4. Iterate through the remaining unvisited points in the dataset. Those points that do not
belong to any cluster are noise.
Why do we need a Density-Based clustering algorithm like DBSCAN when we already
have K-means clustering?
K-Means clustering may cluster loosely related observations together. Every observation
becomes a part of some cluster eventually, even if the observations are scattered far away in
the vector space. Since clusters depend on the mean value of cluster elements, each data point
plays a role in forming the clusters. A slight change in data points might affect the clustering
outcome. This problem is greatly reduced in DBSCAN due to the way clusters are formed.
This is usually not a big problem unless we come across some odd shape data.
Another challenge with k-means is that you need to specify the number of clusters (“k”) in
order to use it. Much of the time, we won’t know what a reasonable k value is a priori.
What’s nice about DBSCAN is that you don’t have to specify the number of clusters to use it.
All you need is a function to calculate the distance between values and some guidance for what
amount of distance is considered “close”. DBSCAN also produces more reasonable results
than k-means across a variety of different distributions. Below figure illustrates the fact:
Density-Based Clustering Algorithms
Density-Based Clustering refers to unsupervised learning methods that identify

distinctive groups/clusters in the data, based on the idea that a cluster in data
space is a contiguous region of high point density, separated from other such
clusters by contiguous regions of low point density.
Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is a
base algorithm for density-based clustering. It can discover clusters of different
shapes and sizes from a large amount of data, which is containing noise and
outliers.
The DBSCAN algorithm uses two parameters:
 minPts: The minimum number of points (a threshold) clustered together for a region to
be considered dense.
 eps (ε): A distance measure that will be used to locate the points in the neighborhood of
any point.
These parameters can be understood if we explore two concepts called Density Reachability
and Density Connectivity.
Reachability in terms of density establishes a point to be reachable from another if it lies
within a particular distance (eps) from it.
Connectivity, on the other hand, involves a transitivity based chaining-approach to determine
whether points are located in a particular cluster. For example, p and q points could be
connected if p->r->s->t->q, where a->b means b is in the neighborhood of a.
There are three types of points after the DBSCAN clustering is complete:
 Core — This is a point that has at least m points within distance n from itself.
 Border — This is a point that has at least one Core point at a distance n.
 Noise — This is a point that is neither a Core nor a Border. And it has less
than m points within distance n from itself.
Algorithmic steps for DBSCAN clustering

 The algorithm proceeds by arbitrarily picking up a point in the dataset (until all points
have been visited).
 If there are at least ‘minPoint’ points within a radius of ‘ε’ to the point then we consider
all these points to be part of the same cluster.
 The clusters are then expanded by recursively repeating the neighborhood calculation
for each neighboring point
DBSCAN in action
Parameter Estimation
Every data mining task has the problem of parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed.
 minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not make
sense, as then every point on its own will already be a cluster. With minPts ≤ 2, the result will
be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at
height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually better
for data sets with noise and will yield more significant clusters. As a rule of thumb, minPts =
2·dim can be used, but it may be necessary to choose larger values for very large data, for
noisy data or for data that contains many duplicates.
 ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance
to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good
values of ε are where this plot shows an “elbow”: if ε is chosen much too small, a large part of
the data will not be clustered; whereas for a too high value of ε, clusters will merge and the
majority of objects will be in the same cluster. In general, small values of ε are preferable, and
as a rule of thumb, only a small fraction of points should be within this distance of each other.
 Distance function: The choice of distance function is tightly linked to the choice of ε,
and has a major impact on the outcomes. In general, it will be necessary to first identify a
reasonable measure of similarity for the data set, before the parameter ε can be chosen. There is
no estimation for this parameter, but the distance functions need to be chosen appropriately for
the data set.
UNIT V- DIMENSIONALITY REDUCTIONS & COLLABORATIVE

FILTERING
UNIT V
UNIT V DIMENSIONALITY REDUCTIONS & COLLABORATIVE FILTERING
Dimensionality Reduction, Feature Extraction & Selection, Linear Discriminant Analysis,

Principal Component Analysis, Factor Analysis, Independent Component Analysis, Locally
Linear Embedding, Least Squares Optimization, Collaborative Filtering & Its Challenges.
What is Dimensionality Reduction?

In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on
it. Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning and
Predictive Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these
features may overlap. In another condition, a classification problem that relies on both
humidity and rainfall can be collapsed into just one underlying feature, since both of the
aforementioned are correlated to a high degree. Hence, we can reduce the number of features
in such problems. A 3-D classification problem can be hard to visualize, whereas a 2-D one
can be mapped to a simple 2 dimensional space, and a 1-D problem to a simple line. The
below figure illustrates this concept, where a 3-D feature space is split into two 1-D feature
spaces, and later, if found to be correlated, the number of features can be reduced even
further.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually involves
three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used.
The prime linear method, called Principal Component Analysis, or PCA, is discussed below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the
data in the lower dimensional space should be maximum.
It involves the following steps:
 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes
undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb
rules are applied.
What is Feature selection (or Variable Selection)?

Problem of selecting some subset of a learning algorithm’s input variables upon which it should
focus attention, while ignoring the rest. In other words, Dimensionality Reduction. As Humans,
we constantly do that!
What is Feature selection (or Variable Selection)?
Mathematically speaking,
Given a set of features F = { f1 ,…, fi ,…, fn } the Feature Selection problem is to find a subset
that “maximizes the learner’s ability to classify patterns”
Formally F’ should maximize some scoring function
This general definition subsumes feature selection (i.e. a feature selection algorithm also
performs a mapping but can only map to subsets of the input variables)
Feature Selection : The Two Schools of Thoughts
There are two schools of thoughts on Feature Selection.
Feature Selection : The Two Schools of Thoughts

While the above two school of thoughts co-exist, we’d focus here on the Motivation for Feature
Selection.
Especially when dealing with a large number of variables there is a need for Dimensionality
Reduction
Feature Selection can significantly improve a learning algorithm’s performance
The Curse of Dimensionality
The Curse of Dimensionality
The required number of samples (to achieve the same accuracy) grows exponentially with the
number of variables.
In practice: the number of training examples is fixed.
the classifier’s performance usually will degrade for a large number of features
In many cases the information that is lost by discarding variables is made up for by a more
accurate mapping/sampling in the lower-dimensional space.
Feature Selection — Optimality?
In theory, the goal is to find an optimal feature-subset (one that maximizes the scoring
function).
In real world applications this is usually not possible.
For most problems it is computationally intractable to search the whole space of possible feature
subsets
One usually has to settle for approximations of the optimal subset
Most of the research in this area is devoted to finding efficient search-heuristics
Feature Selection and Feature Extraction
Optimal Feature Subset:

Often, the definition of optimal feature subset in terms of classifier’s performance
The best one can hope for theoretically is the Bayes error rate
Relevance of Features
There are several definitions of relevance in literature.
Relevance of 1 variable, Relevance of a variable given other variables, Relevance given a certain
learning algorithm ,..
Most definitions are problematic, because there are problems where all features would be
declared to be irrelevant
This can be defined through two degrees of relevance: weak and strong relevance.
A feature is relevant iff it is weakly or strongly relevant and irrelevant (redundant) otherwise.
Strong Relevance of a variable/feature:
Let Si = {f1, …, fi-1, fi+1, …fn} be the set of all features except fi. Denote by si a value-
assignment to all features in Si.
A feature fi is strongly relevant, iff there exists some xi, y and si for which p(fi = xi, Si = si) >
0 such that
p(Y = y | fi = xi; Si = si) ≠ p(Y = y | Si = si)
This means that removal of fi alone will always result in a performance deterioration of an
optimal Bayes classifier.
Weak Relevance of a variable/feature:
A feature fi is weakly relevant, iff it is not strongly relevant, and there exists a subset of features
Si‘ of Si for which there exists some xi, y and si’ with p(fi = xi, Si’ = si’) > 0 such that
p(Y = y | fi = xi; Si’ = si’) ≠ p(Y = y | Si’ = si’)
This means that there exists a subset of features Si’, such that the performance of an optimal
Bayes classifier on Si’ is worse than Si’ U { fi }
Feature extraction is about extracting/deriving information from the original features set to
create a new features subspace. The primary idea behind feature extraction is to compress the
data with the goal of maintaining most of the relevant information. As with feature selection
techniques, these techniques are also used for reducing the number of features from the original
features set to reduce model complexity, model overfitting, enhance model computation
efficiency and reduce generalization error. The following are different types of feature extraction
techniques:
Principal component analysis (PCA) for unsupervised data compression. Here is a detailed
post on feature extraction using PCA with Python example. You will get a good understanding of
how PCA can help with finding the directions of maximum variance in high-dimensional data
and projects the data onto a new subspace with equal or fewer dimensions than the original
one. This is explained with example of identifying Taj Mahal (7th wonder of world) from top
view or side view based on dimensions in which there is maximum variance. The diagram
below shows the dimensions of maximum variance (PCA1 and PCA2) as a result of PCA.
Linear discriminant analysis (LDA) as a supervised dimensionality reduction

technique for maximizing class separability
Nonlinear dimensionality reduction via kernel principal component
analysis (KPCA)
When to use Feature Selection & Feature Extraction
The key difference between feature selection and feature extraction techniques
used for dimensionality reduction is that while the original features are maintained in the case
of feature selection algorithms, the feature extraction algorithms transform the data onto a new
feature space.
Feature selection techniques can be used if the requirement is to maintain the
original features, unlike the feature extraction techniques which derive useful information from
data to construct a new feature subspace. Feature selection techniques are used when model
explainability is a key requirement. Feature extraction techniques can be used to improve the
predictive performance of the models, especially, in the case of algorithms that don’t support
regularization.
Unlike feature selection, feature extraction usually needs to transform the original
data to features with strong pattern recognition ability, where the original data can be regarded as
features with weak recognition ability.
Linear Discriminant Analysis or Normal Discriminant Analysis or Discriminant

Function Analysis is a dimensionality reduction technique which is commonly used for the
supervised classification problems. It is used for modeling differences in groups i.e. separating
two or more classes. It is used to project the features in higher dimension space into a lower
dimension space.
For example, we have two classes and we need to separate them efficiently. Classes can have
multiple features. Using only a single feature to classify them may result in some overlapping as
shown in the below figure. So, we will keep on increasing the number of features for proper
classification.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence, in
this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D
graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
1. Maximize the distance between means of the two classes.

2. Minimize the variation within each class.
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the data points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable. In
such cases, we use non-linear discriminant analysis.
Mathematics
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:
n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be written as
vTxi
ExtensionstoLDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is
used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of different
variables on LDA.
What is linear discriminant analysis?

It is one of the most used dimensionality reduction techniques. It is used in machine learning as
well as applications that have anything to do with the classification of patterns. LDA serves a
very specific purpose, which is to project features that exist in a high dimensional space onto
space at a lower dimension.
This is done to do away with common dimensionality issues and bring down dimensional costs
and resources. Ronald A Fisher holds the credit for the development of the original concept
in 1936 –Fisher’s Discriminant Analysis or Linear Discriminant. Originally, linear
discriminant was a two-class technique. The multi-class version came in later.
Linear discriminant analysis is a supervised classification method that is used to create machine
learning models. These models based on dimensionality reduction are used in the application,
such as marketing predictive analysis and image recognition, amongst others. We will discuss
applications a little later.
So what are we exactly looking for with LDA? There are two areas that this dimensionality
reduction technique helps in discovering – The parameters that can be used to explain the
relationship between a group and an object – The classification preceptor model that can help in
separating the groups. This is why LDA is widely used to model varieties in different groups. So
you can use this technique to use two or more than two classes for the distribution of a variable.
Extensions to linear discriminant analysis
LDA is considered one of the simplest and most effective methods available for classification.
As the method is so simple and easy to understand, we have a few variations as well as
extensions available for it. Some of these include:
1. Regularized discriminant analysis or RDA
RDA is used for bringing regularization into variance or covariance estimation. This is done to
moderate the impact that variables have on LDA.
2. Quadratic discriminant analysis or QDA
In QDA, different classes use their own variance estimate. In case the number of the input
variable is more than usual, every class uses its covariance estimate.
3. Flexible discriminant analysis or FDA
FDA makes use of inputs with non-linear combinations. Splines are a good example.
Common LDA applications

LDA finds its use in several applications. It can be used in any problem that can be turned into a
classification problem. Common examples include speed recognition, face recognition,
chemistry, microarray data classification, image retrieval, biometrics, and bioinformatics to name
a few. Let’s discuss a few of these.
1. Face recognition
In computer vision, face recognition is considered one of the most popular applications. Face
recognition is carried out by representing faces using large amounts of pixel values. LDA is used
to trim down the number of features to prepare grounds for using the classification method. The
new dimensions are combinations of pixel values that are used to create a template.
2. Customer identification
If you want to identify customers on the basis of the likelihood that they will buy a product, you
can use LDA to collect customer features. You can identify and choose those features that
describe the group of customers that are showing higher chances of buying a product.
3. Medical
LDA can be used to put diseases into different categories, such as severe, mild, or moderate.
There are several patient parameters that will go into conducting this classification task. This
classification allows doctors to define the pace of the treatment.
Principal Component Analysis
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so it
contains the important variables and drops the least important variable.
The PCA algorithm is based on some mathematical concepts such as:
o Variance and Covariance

o Eigenvalues and Eigen factors
Some common terms used in PCA algorithm:
o Dimensionality: It is the number of features or variables present in the given dataset.

More easily, it is the number of columns present in the dataset.
o Correlation: It signifies that how strongly two variables are related to each other. Such
as if one changes, the other variable also gets changed. The correlation value ranges from -1 to
+1. Here, -1 occurs if variables are inversely proportional to each other, and +1 indicates that
variables are directly proportional to each other.
o Orthogonal: It defines that variables are not correlated to each other, and hence the
correlation between the pair of variables is zero.
o Eigenvectors: If there is a square matrix M, and a non-zero vector v is given. Then v will
be eigenvector if Av is the scalar multiple of v.
o Covariance Matrix: A matrix containing the covariance between the pair of variables is
called the Covariance Matrix.
Principal Components in PCA
As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has
the most importance, and n PC will have the least importance.
Steps for PCA algorithm

1. Getting the dataset
Firstly, we need to take the input dataset and divide it into two subparts X and Y, where X is the
training set, and Y is the validation set.
2. Representing data into a structure
Now we will represent our dataset into a structure. Such as we will represent the two-
dimensional matrix of independent variable X. Here each row corresponds to the data items, and
the column corresponds to the Features. The number of columns is the dimensions of the dataset.
3. Standardizing the data
In this step, we will standardize our dataset. Such as in a particular column, the features with
high variance are more important compared to the features with lower variance.
If the importance of features is independent of the variance of the feature, then we will divide
each data item in a column with the standard deviation of the column. Here we will name the
matrix as Z.
4. Calculating the Covariance of Z
To calculate the covariance of Z, we will take the matrix Z, and will transpose it. After transpose,
we will multiply it by Z. The output matrix will be the Covariance matrix of Z.
5. Calculating the Eigen Values and Eigen Vectors
Now we need to calculate the eigenvalues and eigenvectors for the resultant covariance matrix Z.
Eigenvectors or the covariance matrix are the directions of the axes with high information. And
the coefficients of these eigenvectors are defined as the eigenvalues.
6. Sorting the Eigen Vectors
In this step, we will take all the eigenvalues and will sort them in decreasing order, which means
from largest to smallest. And simultaneously sort the eigenvectors accordingly in matrix P of
eigenvalues. The resultant matrix will be named as P*.
7. Calculating the new features Or Principal Components
Here we will calculate the new features. To do this, we will multiply the P* matrix to the Z. In
the resultant matrix Z*, each observation is the linear combination of original features. Each
column of the Z* matrix is independent of each other.
8. Remove less or unimportant features from the new dataset.
The new feature set has occurred, so we will decide here what to keep and what to remove. It
means, we will only keep the relevant or important features in the new dataset, and unimportant
features will be removed out.
Applications of Principal Component Analysis

o PCA is mainly used as the dimensionality reduction technique in various AI applications
such as computer vision, image compression, etc.
o It can also be used for finding hidden patterns if data has high dimensions. Some fields
where PCA is used are Finance, data mining, Psychology, etc.
Factor Analysis
Factor Analytics is a special technique reducing the huge number of variables into a few
numbers of factors is known as factoring of the data, and managing which data is to be present
in sheet comes under factor analysis. It is completely a statistical approach that is also used to
describe fluctuations among the observed and correlated variables in terms of a potentially
lower number of unobserved variables called factors.
The factor analysis technique extracts the maximum common variance from all the variables
and puts them into a common score. It is a theory that is used in training the machine learning
model and so it is quite related to data mining. The belief behind factor analytic techniques is
that the information gained about the interdependencies between observed variables can be
used later to reduce the set of variables in a dataset.
Factor analysis is a very effective tool for inspecting changeable relationships for complex
concepts such as social status, economic status, dietary patterns, psychological scales, biology,
psychometrics, personality theories, marketing, product management, operations research,
finance, etc. It can help a researcher to investigate the concepts that are not easily measured in
a much easier and quicker way directly by the cave in a large number of variables into a few
easily interpretable fundamental factors.
Types of factor analysis:
1. Exploratory factor analysis (EFA) :
It is used to identify composite inter-relationships among items and group items that are the
part of uniting concepts. The Analyst can’t make any prior assumptions about the relationships
among factors. It is also used to find the fundamental structure of a huge set of variables. It
lessens the large data to a much smaller set of summary variables. It is almost similar to the
Confirmatory Factor Analysis(CFA).
Similarities are:
1. Evaluate the internal reliability of an amount.
2. Examine the factors represented by item sets. They presume that the factors
aren’t correlated.
3. Investigate the grade/class of each item.
However, some common differences, most of them are concerned about how factors are used.
Basically, EFA is a data-driven approach, which allows all items to load on all the factors,
while in CFA you need to specify which factors are required to load. EFA is really a nice
choice if you have no idea about what common factors might exist. EFA is able to generate a
huge number of possible models for your data, something which is not possible is, if a
researcher has to specify factors. If you have a bit idea about what actually the models look
like, and then afterwards you want to test your hypotheses about the data structure, in that case,
the CFA is a better approach.
2. Confirmatory factor analysis (CFA) :
It is a more complex(composite) approach that tests the theory that the items are associated
with specific factors. Confirmatory Factor Analysis uses a properly structured equation model
to test a measurement model whereby loading on the factors allows for the evaluation of
relationships between observed variables and unobserved variables.
As we know, the Structural equation modelling approaches can board measurement error
easily, and these are much less restrictive than least-squares estimation thus provide more
exposure to accommodate errors. Hypothesized models are tested against actual data, and the
analysis would demonstrate loadings of observed variables on the latent variables (factors), as
well as the correlation between the latent variables.
Confirmatory Factor Analysis allows an analyst and researcher to figure out if a relationship
between a set of observed variables (also known as manifest variables) and their underlying
constructs exists. It is similar to the Exploratory Factor Analysis.
The main difference between the two is:
1. Simply use Exploratory Factor Analysis to explore the pattern.
2. Use Confirmatory Factor Analysis to perform hypothesis testing.
Confirmatory Factor Analysis provides information about the standard quality of the number of
factors that are required to represent the data set. Using Confirmatory Factor Analysis, you can
define the total number of factors required. For example, Confirmatory Factor Analysis is able
to answer questions like Does my thousand question survey can able to measure accurately the
one specific factor. Even though it is technically applicable to any kind of discipline, it is
typically used in social sciences.
3. Multiple Factor Analysis :
This type of Factor Analysis is used when your variables are structured in changeable groups.
For example, you may have a teenager’s health questionnaire with several points like sleeping
patterns, wrong addictions, psychological health, mobile phone addiction, or learning
disabilities.
The Multiple Factor Analysis is performed in two steps which are:-
1. Firstly, the Principal Component Analysis will perform on each and every section of the
data. Further, this can give a useful eigenvalue, which is actually used to normalize the
data sets for further use.
2. The newly formed data sets are going to merge into a distinctive matrix and then global
PCA is performed.
4. Generalized Procrustes Analysis (GPA) :
The Procrustes analysis is actually a suggested way to compare then the two approximate sets
of configurations and shapes, which were originally developed to equivalent to the two
solutions from Factor Analysis, this technique was actually used to extend the GP Analysis so
that more than two shapes could be compared in many ways. The shapes are properly aligned
to achieve the target shape. Mainly GPA (Generalized Procrustes Analysis) uses geometric
transformations.
Geometric progressions are :
1. Isotropic rescaling,
2. Reflection,
3. Rotation,
4. Translation of matrices to compare the sets of data.
Independent Component Analysis

(ICA) is a machine learning technique to separate independent sources from a mixed signal.
Unlike principal component analysis which focuses on maximizing the variance of the data
points, the independent component analysis focuses on independence, i.e. independent
components.
Problem: To extract independent sources’ signals from a mixed signal composed of the
signals from those sources.
Given: Mixed signal from five different independent sources.
Aim: To decompose the mixed signal into independent sources:
 Source 1
 Source 2
 Source 3
 Source 4
 Source 5
Solution: Independent Component Analysis (ICA).

Consider Cocktail Party Problem or Blind Source Separation problem to understand the
problem which is solved by independent component analysis.
Here, There is a party going into a room full of people. There is ‘n’ number of speakers in that
room and they are speaking simultaneously at the party. In the same room, there are also ‘n’
number of microphones placed at different distances from the speakers which are recording ‘n’
speakers’ voice signals. Hence, the number of speakers is equal to the number must of
microphones in the room.
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’ voice
signals in the room given each microphone recorded the voice signals coming from each
speaker of different intensity due to the difference in distances between them. Decomposing
the mixed signal of each microphone’s recording into independent source’s speech signal can
be done by using the machine learning technique, independent component analysis.
[ X1, X2, ….., Xn ] => [ Y1, Y2, ….., Yn ]
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn
are the new features and are independent components which are independent of each other
Restrictions on ICA
1. The independent components generated by the ICA are assumed to be statistically

independent of each other.
2. The independent components generated by the ICA must have non-Gaussian
distribution.
3. The number of independent components generated by the ICA is equal to the number of
observed mixtures.
Locally Linear Embedding LLE)

Locally Linear Embedding (LLE) is a Manifold Learning technique that is used for non-linear
dimensionality reduction. It is an unsupervised learning algorithm that produces low-dimensional
embeddings of high-dimensional inputs, relating each training instance to its closest neighbor.
How does LLE work?
For each training instance x(i), the algorithm first finds its k nearest neighbors and then tries to
express x(i) as a linear function of them. In general, if there are m training instances in total, then
it tries to find the set of weights w which minimizes the squared distance between x(i) and its
linear representation.
So, the cost function is given by
where wi,j =0, if j is not included in the k closest neighbors of i.
Also, it normalizes the weights for each training instance x(i),
Finally, each high-dimensional training instance x(i) is mapped to a low-dimensional

(say, d dimensions) vector y(i) while preserving the neighborhood relationships. This is done by
choosing d-dimensional coordinates which minimize the cost function,
Here the weights wi,j are kept fixed while we try to find the optimum coordinates y(i)
Least Square optimization

What Is Collaborative Filtering?
Collaborative filtering is a technique that can filter out items that a user might like on the basis of
reactions by similar users.
It works by searching a large group of people and finding a smaller set of users with tastes
similar to a particular user. It looks at the items they like and combines them to create a ranked
list of suggestions.
There are many ways to decide which users are similar and combine their choices to create a list
of recommendations.
Steps Involved in Collaborative Filtering

To build a system that can automatically recommend items to users based on the preferences of
other users, the first step is to find similar users or items. The second step is to predict the ratings
of the items that are not yet rated by a user. So, you will need the answers to these questions:
 How do you determine which users or items are similar to one another?
 Given that you know which users are similar, how do you determine the rating that a user
would give to an item based on the ratings of similar users?
 How do you measure the accuracy of the ratings you calculate?
Collaborative filtering is a family of algorithms where there are multiple ways to find similar
users or items and multiple ways to calculate rating based on ratings of similar users. One
important thing to keep in mind is that in an approach based purely on collaborative filtering, the
similarity is not calculated using factors like the age of users, genre of the movie, or any other
data about users or items. It is calculated only on the basis of the rating (explicit or implicit) a
user gives to an item. For example, two users can be considered similar if they give the same
ratings to ten movies despite there being a big difference in their age. The third question for how
to measure the accuracy of your predictions also has multiple answers, which include error
calculation techniques that can be used in many places and not just recommenders based on
collaborative filtering. One of the approaches to measure the accuracy of your result is the Root
Mean Square Error (RMSE), in which you predict ratings for a test dataset of user-item pairs
whose rating values are already known. The difference between the known value and the
predicted value would be the error. Square all the error values for the test set, find the average (or
mean), and then take the square root of that average to get the RMSE. Another metric to measure
the accuracy is Mean Absolute Error (MAE), in which you find the magnitude of error by
finding its absolute value and then taking the average of all error values.
Memory Based
The first category includes algorithms that are memory based, in which statistical techniques are
applied to the entire dataset to calculate the predictions.
To find the rating R that a user U would give to an item I, the approach includes:
 Finding users similar to U who have rated the item I

 Calculating the rating R based the ratings of users found in the previous step
User-Based vs Item-Based Collaborative Filtering

The technique in the examples explained above, where the rating matrix is used to find similar
users based on the ratings they give, is called user-based or user-user collaborative filtering. If
you use the rating matrix to find similar items based on the ratings given to them by users, then
the approach is called item-based or item-item collaborative filtering.
The two approaches are mathematically quite similar, but there is a conceptual difference
between the two. Here’s how the two compare:
 User-based: For a user U, with a set of similar users determined based on rating vectors
consisting of given item ratings, the rating for an item I, which hasn’t been rated, is found
by picking out N users from the similarity list who have rated the item I and calculating
the rating based on these N ratings.
 Item-based: For an item I, with a set of similar items determined based on rating vectors
consisting of received user ratings, the rating by a user U, who hasn’t rated it, is found by
picking out N items from the similarity list that have been rated by U and calculating the
rating based on these N ratings.
Item-based collaborative filtering was developed by Amazon. In a system where there are more
users than items, item-based filtering is faster and more stable than user-based. It is effective
because usually, the average rating received by an item doesn’t change as quickly as the average
rating given by a user to different items. It’s also known to perform better than the user-based
approach when the ratings matrix is sparse.
Although, the item-based approach performs poorly for datasets with browsing or entertainment
related items such as MovieLens, where the recommendations it gives out seem very obvious to
the target users..
Model Based
The second category covers the Model based approaches, which involve a step to reduce or
compress the large but sparse user-item matrix. For understanding this step, a basic
understanding of dimensionality reduction can be very helpful.
When Can Collaborative Filtering Be Used?

Collaborative filtering works around the interactions that users have with items. These
interactions can help find patterns that the data about the items or users itself can’t. Here are
some points that can help you decide if collaborative filtering can be used:
 Collaborative filtering doesn’t require features about the items or users to be known. It is
suited for a set of different types of items, for example, a supermarket’s inventory where
items of various categories can be added. In a set of similar items such as that of a
bookstore, though, known features like writers and genres can be useful and might
benefit from content-based or hybrid approaches.
 Collaborative filtering can help recommenders to not overspecialize in a user’s profile
and recommend items that are completely different from what they have seen before. If
you want your recommender to not suggest a pair of sneakers to someone who just
bought another similar pair of sneakers, then try to add collaborative filtering to your
recommender spell.
Although collaborative Filtering is very commonly used in recommenders, some of the

challenges that are faced while using it are the following:
 Collaborative filtering can lead to some problems like cold start for new items that are
added to the list. Until someone rates them, they don’t get recommended.
 Data sparsity can affect the quality of user-based recommenders and also add to the cold
start problem mentioned above.
 Scaling can be a challenge for growing datasets as the complexity can become too large.
Item-based recommenders are faster than user-based when the dataset is large.
 With a straightforward implementation, you might observe that the recommendations
tend to be already popular, and the items from the long tail section might get ignored.
With every type of recommender algorithm having its own list of pros and cons, it’s usually a
hybrid recommender that comes to the rescue. The benefits of multiple algorithms working
together or in a pipeline can help you set up more accurate recommenders. In fact, the solution of
the winner of the Netflix prize was also a complex mix of multiple algorithms.
Collaborative Filtering Advantages & Disadvantages
Advantages
No domain knowledge necessary
We don't need domain knowledge because the embeddings are automatically learned.
Serendipity
The model can help users discover new interests. In isolation, the ML system may not know the
user is interested in a given item, but the model might still recommend it because similar users
are interested in that item.
Great starting point

To some extent, the system needs only the feedback matrix to train a matrix factorization model.
In particular, the system doesn't need contextual features. In practice, this can be used as one of
multiple candidate generators.
Disadvantages
Cannot handle fresh items
The prediction of the model for a given (user, item) pair is the dot product of the corresponding
embeddings. So, if an item is not seen during training, the system can't create an embedding for
it and can't query the model with this item. This issue is often called the cold-start problem.
However, the following techniques can address the cold-start problem to some extent:
 Projection in WALS. Given a new item i0 not seen in training, if the system has a few
interactions with users, then the system can easily compute an embedding vi0 for this item
without having to retrain the whole model. The system simply has to solve the following
equation or the weighted version:
minvi0∈Rd‖Ai0−Uvi0‖
The preceding equation corresponds to one iteration in WALS: the user embeddings are kept
fixed, and the system solves for the embedding of item i0. The same can be done for a new user.
 Heuristics to generate embeddings of fresh items. If the system does not have interactions, the
system can approximate its embedding by averaging the embeddings of items from the same
category, from the same uploader (in YouTube), and so on.
Hard to include side features for query/item

Side features are any features beyond the query or item ID. For movie recommendations, the
side features might include country or age. Including available side features improves the quality
of the model. Although it may not be easy to include side features in WALS, a generalization of
WALS makes this possible.
To generalize WALS, augment the input matrix with features by defining a block matrix A¯,
where:
 Block (0, 0) is the original feedback matrix A.
 Block (0, 1) is a multi-hot encoding of the user features.
 Block (1, 0) is a multi-hot encoding of the item features.

Sec 1630

Uploaded by

Document Informationclick to expand document information

Document Informationclick to expand document information

Copyright:

Available Formats

Sec 1630

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Sec 1630

Uploaded by

Copyright:

Available Formats

SCHOOL OF ELECTRICAL AND ELECTRONICS

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING

Machine Learning Techniques – SEC1630

UNIT I – INTRODUCTION TO MACHINE LEARNING

Machine Learning vs Statistical Modelling, Applications of Machine Learning, Supervised vs

Definition of Machine learning:

Learning = Improving with experience at some task

- Improve over task T,

E.g., Learn to play checkers

Types of learning algorithms:

Training set, Test set and Validation set

• Example: Credit scoring

Learning a Class from Examples:

• Class C of a “family car”

o Prediction: Is car x a family car?

Positive (+) and negative (–) examples

• Cannot expect a learner to learn a concept exactly.

 for all distributions D over X, and fixed 0< ε , δ < 1,

 with probability at least (1- δ) a hypothesis h ϵ H with error at most ε, (ErrorD =

 Is there enough information in the sample to distinguish a hypothesis h that

• Learning is an ill-posed problem; data is not sufficient to find a unique

• Training set (50%)

Dimensions of a Supervised Learner:

Applications of Machine learning

6. Email Spam and Malware Filtering:

7. Virtual Personal Assistant:

These virtual assistants use machine learning algorithms as an important part.

8. Online Fraud Detection:

10. Medical Diagnosis:

It helps in finding brain tumors and other brain-related diseases easily.

Supervised learning Vs Unsupervised Learning

Classification is used for predicting discrete responses.

Logistic Sigmoid Function

3. SUPPORT VECTOR MACHINE (SVM)

Type of kernel functions

1. Linear SVM is the one we discussed earlier.

 P(class|data) is the posterior probability of class(target) given predictor(attribute). The

5. DECISION TREE CLASSIFICATION

Entropy and information gain are used to construct a decision tree.

Overfitting in decision trees can be minimized by pruning nodes.

Ensemble Methods for Classification

An ensemble model is a team of models. Technically, ensemble models comprise several

2. GRADIENT BOOSTING CLASSIFICATION

Gradient boosting classifier is a boosting ensemble method. Boosting is a way to combine

1. Initialize predictions with a simple decision tree.

Python libraries suitable for Machine Learning.

Top Python Machine Learning Libraries

TEXT / REFERENCE BOOKS

Classification, K- nearest neighbour, Decision Trees, Implementing Decision Tree, building a

Decision Tree Terminologies

How does the Decision Tree algorithm Work?

Attribute Selection Measures

1. Information Gain= Entropy(S)- [(Weighted Avg) *Entropy(each feature)

Entropy: Entropy is a metric to measure the impurity in a given attribute. It specifies

Entropy(s)= -P(yes)log2 P(yes)- P(no) log2 P(no)

o S= Total number of samples

Gini Index= 1- ∑jPj2

o Cost Complexity Pruning

0.70.30.6 + 0.80.30.4 + 0.10.70.6 + 0.20.70.4 = 0.32