Sec 1630
Sec 1630
Sec 1630
Well posed learning problem: "A computer program is said to learn from experience E with respect
to some class of tasks T and performance measure P, if its performance at tasks in T, as measured
by P, improves with experience E."(Tom Michel)
"Field of study that gives computers the ability to learn without being explicitly programmed".
- T : Play checkers
- P : % of games won in world tournament
- E: opportunity to play against self
Examples of tasks that are best solved by using a learning algorithm:
• Recognizing patterns:
– Facial identities or facial expressions
– Handwritten or spoken words
– Medical images
• Generating patterns:
– Generating images or motion sequences (demo)
• Recognizing anomalies:
– Unusual sequences of credit card transactions
– Unusual patterns of sensor readings in a nuclear power plant or unusual sound in
your car engine.
• Prediction:
– Future stock prices or currency exchange rates
• The web contains a lot of data. Tasks with very big datasets often use machine learning
– especially if the data is noisy or non-stationary.
• Spam filtering, fraud detection:
– The enemy adapts so we must adapt too.
• Recommendation systems:
– Lots of noisy data. Million dollar prize!
• Information retrieval:
– Find documents or images with similar content.
• Data Visualization:
- Display a huge database in a revealing way
Supervised learning
o Training data includes desired outputs. Examples include,
o Prediction
o Classification (discrete labels), Regression (real values)
Unsupervised learning
o Training data does not include desired outputs, Examples include,
o Clustering
o Probability distribution estimation
o Finding association (in features)
o Dimension reduction
Semi-supervised learning
o Training data includes a few desired outputs
Reinforcement learning
o Rewards from sequence of actions
o Policies: what actions should an agent take in a particular situation
o Utility estimation: how good is a state (→used by policy)
o No supervised output but delayed reward
o Credit assignment problem (what was responsible for the outcome)
o Applications:
o Game playing
o Robot in a maze
o Multiple agents, partial observability, ...
Hypothesis Space
• One way to think about a supervised learning machine is as a device that explores a
“hypothesis space”.
– Each setting of the parameters in the machine is a different hypothesis about the
function that maps input vectors to output vectors.
– If the data is noise-free, each training example rules out a region of hypothesis
space.
– If the data is noisy, each training example scales the posterior probability of each
point in the hypothesis space in proportion to how likely the training example is
given that hypothesis.
• The art of supervised machine learning is in:
– Deciding how to represent the inputs and outputs
– Selecting a hypothesis space that is powerful enough to represent the relationship
between inputs and outputs but simple enough to be searched.
Generalization
• The real aim of supervised learning is to do well on test data that is not known during
learning.
• Choosing the values for the parameters that minimize the loss function on the training data
is not necessarily the best policy.
• We want the learning machine to model the true regularities in the data and to ignore the
noise in the data.
– But the learning machine does not know which regularities are real and which are
accidental quirks of the particular set of training examples we happen to pick.
• So how can we be sure that the machine will generalize correctly to new data?
Learning Associations:
• Basket analysis:
P (Y | X ) probability that somebody who buys X also buys Y where X and Y are
products/services.
Example: P ( chips | beer ) = 0.7 // 70 percent of customers who buy beer also buy chips.
We may want to make a distinction among customers and toward this, estimate P(Y|X,D) where
D is the set of customer attributes, for example, gender, age, marital status, and so on, assuming
that we have access to this information. If this is a bookseller instead of a supermarket, products
can be books or authors. In the case of a Web portal, items correspond to links to Web pages, and
we can estimate the links a user is likely to click and use this information to download such
pages in advance for faster access.
Classification:
A training dataset of used cars and the function fitted. For simplicity, mileage is taken as the only
input attribute and a linear model is used.
Supervised Learning:
• Consider a concept class C defined over an instance space X (containing instances of length
n), and a learner L using a hypothesis space H.
• C is PAC learnable by L using H if
for all f ϵ C,
Is there an efficient algorithm that can process the sample and produce a good
hypothesis h ?
• To be PAC learnable, there must be a hypothesis h H with arbitrary small error for every
f ϵ C. We generally assume H ⸧ C. (Properly PAC learnable if H=C)
Occam’s Razor:
Noise and Model Complexity:
Noise is any unwanted anomaly in the data and due to noise, the class may be more difficult to
learn and zero error may be infeasible with a simple hypothesis class. There are several
interpretations of noise:
• There may be imprecision in recording the input attributes, which may shift the data points
in the input space.
• There may be errors in labeling the data points, which may relabel positive instances as
negative and vice versa. This is sometimes called teacher noise.
• There may be additional attributes, which we have not taken into account, that affect the
label of an instance. Such attributes may be hidden or latent in that they may be
unobservable. The effect of these neglected attributes is thus modeled as a random
component and is included in “noise.”
Use the simpler one because
• Simpler to use (lower computational complexity)
• Easier to train (lower space complexity)
• Easier to explain (more interpretable)
• Generalizes better (lower variance - Occam’s razor)
Learning Multiple Classes:
In our example of learning a family car, we have positive examples belonging to the class family
car and the negative examples belonging to all other cars. This is a two-class problem. In the
general case, we have K classes denoted as Ci, i = 1, . . . , K, and an input instance belongs to one
and exactly one of them. The training set is now of the form,
There are three classes: family car, sports car, and luxury sedan. There are three hypotheses
induced, each one covering the instances of one class and leaving outside the instances of the
other two classes .’?’ are reject regions where no, or more than one, class is chosen.
Model Selection & Generalization:
Machine learning is a buzzword for today's technology, and it is growing very rapidly day by
day. We are using machine learning in our daily life even without knowing it such as Google
Maps, Google assistant, Alexa, etc. Below are some most trending real-world applications of
Machine Learning:
1. Image Recognition:
Image recognition is one of the most common applications of machine learning. It is used to
identify objects, persons, places, digital images, etc. The popular use case of image recognition
and face detection is, Automatic friend tagging suggestion:
Facebook provides us a feature of auto friend tagging suggestion. Whenever we upload a photo
with our Facebook friends, then we automatically get a tagging suggestion with name, and the
technology behind this is machine learning's face detection and recognition algorithm.
It is based on the Facebook project named "Deep Face," which is responsible for face
recognition and person identification in the picture.
2. Speech Recognition
While using Google, we get an option of "Search by voice," it comes under speech recognition,
and it's a popular application of machine learning.
Speech recognition is a process of converting voice instructions into text, and it is also known
as "Speech to text", or "Computer speech recognition." At present, machine learning
algorithms are widely used by various applications of speech recognition. Google
assistant, Siri, Cortana, and Alexa are using speech recognition technology to follow the voice
instructions.
3. Traffic prediction:
If we want to visit a new place, we take help of Google Maps, which shows us the correct path
with the shortest route and predicts the traffic conditions.
It predicts the traffic conditions such as whether traffic is cleared, slow-moving, or heavily
congested with the help of two ways:
o Real Time location of the vehicle form Google Map app and sensors
o Average time has taken on past days at the same time.
Everyone who is using Google Map is helping this app to make it better. It takes information
from the user and sends back to its database to improve the performance.
4. Product recommendations:
Machine learning is widely used by various e-commerce and entertainment companies such
as Amazon, Netflix, etc., for product recommendation to the user. Whenever we search for
some product on Amazon, then we started getting an advertisement for the same product while
internet surfing on the same browser and this is because of machine learning.
Google understands the user interest using various machine learning algorithms and suggests
the product as per customer interest.
As similar, when we use Netflix, we find some recommendations for entertainment series,
movies, etc., and this is also done with the help of machine learning.
5. Self-driving cars:
One of the most exciting applications of machine learning is self-driving cars. Machine learning
plays a significant role in self-driving cars. Tesla, the most popular car manufacturing company
is working on self-driving car. It is using unsupervised learning method to train the car models
to detect people and objects while driving.
Whenever we receive a new email, it is filtered automatically as important, normal, and spam.
We always receive an important mail in our inbox with the important symbol and spam emails
in our spam box, and the technology behind this is Machine learning. Below are some spam
filters used by Gmail:
o Content Filter
o Header filter
o General blacklists filter
o Rules-based filters
o Permission filters
Some machine learning algorithms such as Multi-Layer Perceptron, Decision tree, and Naïve
Bayes classifier are used for email spam filtering and malware detection.
We have various virtual personal assistants such as Google assistant, Alexa, Cortana, Siri. As
the name suggests, they help us in finding the information using our voice instruction. These
assistants can help us in various ways just by our voice instructions such as Play music, call
someone, Open an email, Scheduling an appointment, etc.
These assistant record our voice instructions, send it over the server on a cloud, and decode it
using ML algorithms and act accordingly.
Machine learning is making our online transaction safe and secure by detecting fraud
transaction. Whenever we perform some online transaction, there may be various ways that a
fraudulent transaction can take place such as fake accounts, fake ids, and steal money in the
middle of a transaction. So to detect this, Feed Forward Neural network helps us by checking
whether it is a genuine transaction or a fraud transaction.
For each genuine transaction, the output is converted into some hash values, and these values
become the input for the next round. For each genuine transaction, there is a specific pattern
which gets change for the fraud transaction hence, it detects it and makes our online transactions
more secure.
9. Stock Market trading:
Machine learning is widely used in stock market trading. In the stock market, there is always a
risk of up and downs in shares, so for this machine learning's long short term memory neural
network is used for the prediction of stock market trends.
In medical science, machine learning is used for diseases diagnoses. With this, medical
technology is growing very fast and able to build 3D models that can predict the exact position
of lesions in the brain.
Nowadays, if we visit a new place and we are not aware of the language then it is not a problem
at all, as for this also machine learning helps us by converting the text into our known
languages. Google's GNMT (Google Neural Machine Translation) provide this feature, which is
a Neural Machine Learning that translates the text into our familiar language, and it called as
automatic translation.
The technology behind the automatic translation is a sequence to sequence learning algorithm,
which is used with image recognition and translates the text from one language to another
language.
In supervised learning, algorithms learn from labeled data. After understanding the data, the
algorithm determines which label should be given to new data by associating patterns to the
unlabeled new data.
Supervised learning can be divided into two categories: classification and regression.
Classification
Classification is a technique for determining which class the dependent belongs to based on one
or more independent variables.
1. LOGISTIC REGRESSION
Logistic regression is kind of like linear regression, but is used when the dependent variable is
not a number but something else (e.g., a "yes/no" response). It's called regression but performs
classification based on the regression and it classifies the dependent variable into either of the
classes.
Logistic regression is used for prediction of output which is binary, as stated above. For
example, if a credit card company builds a model to decide whether or not to issue a credit card
to a customer, it will model for whether the customer is going to “default” or “not default” on
their card.
Linear Regression
Firstly, linear regression is performed on the relationship between variables to get the model.
The threshold for the classification line is assumed to be at 0.5.
It gives the log of the probability of the event occurring to the log of the probability of it not
occurring. In the end, it classifies the variable based on the higher probability of either class.
2. K-NEAREST NEIGHBORS (K-NN)
K-NN algorithm is one of the simplest classification algorithms and it is used to identify the
data points that are separated into several classes to predict the classification of a new sample
point. K-NN is a non-parametric, lazy learning algorithm. It classifies new cases based on a
similarity measure (i.e., distance functions).
K-NN works well with a small number of input variables (p), but struggles when the number of
inputs is very large.
Support vector is used for both regression and classification. It is based on the concept of
decision planes that define decision boundaries. A decision plane (hyperplane) is one that
separates between a set of objects having different class memberships.
It performs classification by finding the hyperplane that maximizes the margin between the two
classes with the help of support vectors.
The learning of the hyperplane in SVM is done by transforming the problem using some linear
algebra (i.e., the example above is a linear kernel which has a linear separability between each
variable).
For higher dimensional data, other kernels are used as points and cannot be classified easily.
They are specified in the next section.
Kernel SVM
Kernel SVM takes in a kernel function in the SVM algorithm and transforms it into the required
form that maps data on a higher dimension which is separable.
Types of kernel function are:
Kernel trick uses the kernel function to transform data into a higher dimensional feature space
and makes it possible to perform the linear separation for classification.
Radial Basis Function (RBF) Kernel
The RBF kernel SVM decision region is actually also a linear decision region. What RBF kernel
SVM actually does is create non-linear combinations of features to uplift the samples onto a
higher-dimensional feature space where a linear decision boundary can be used to separate
classes.
So, the rule of thumb is: use linear SVMs for linear problems, and nonlinear kernels such as the
RBF kernel for non-linear problems.
4. NAIVE BAYES
The naive Bayes classifier is based on Bayes’ theorem with the independence assumptions
between predictors (i.e., it assumes the presence of a feature in a class is unrelated to any other
feature). Even if these features depend on each other, or upon the existence of the other features,
all of these properties independently. Thus, the name naive Bayes.
Based on naive Bayes, Gaussian naive Bayes is used for classification based on the binomial
(normal) distribution of data.
3. Calculate Likelihood
P(data/class) = Number of similar observations to the class/Total no. of points in the class.
P(?/yellow) = 1/7
P(?/green) = 3/10
4. Posterior Probability for Each Class
5. Classification
The higher probability, the class belongs to that category as from above 75% probability the
point belongs to class green.
Multinomial, Bernoulli naive Bayes are the other models used in calculating probabilities. Thus,
a naive Bayes model is easy to build, with no complicated iterative parameter estimation, which
makes it particularly useful for very large datasets.
Decision tree builds classification or regression models in the form of a tree structure. It breaks
down a dataset into smaller and smaller subsets while at the same time an associated decision
tree is incrementally developed. The final result is a tree with decision nodes and leaf nodes. It
follows Iterative Dichotomiser 3(ID3) algorithm structure for determining the split.
Entropy
Entropy is the degree or amount of uncertainty in the randomness of elements. In other words, it
is a measure of impurity.
Intuitively, it tells us about the predictability of a certain event. Entropy calculates the
homogeneity of a sample. If the sample is completely homogeneous the entropy is zero, and if
the sample is equally divided it has an entropy of one.
Information Gain
Information gain measures the relative change in entropy with respect to the independent
attribute. It tries to estimate the information contained by each attribute. Constructing a decision
tree is all about finding the attribute that returns the highest information gain (i.e., the most
homogeneous branches).
Where Gain(T, X) is the information gain by applying feature X. Entropy(T) is the entropy of
the entire set, while the second term calculates the entropy after applying the feature X.
Information gain ranks attributes for filtering at a given node in the tree. The ranking is based
on the highest information gain entropy in each split.
The disadvantage of a decision tree model is overfitting, as it tries to fit the model by going
deeper in the training set and thereby reducing test accuracy.
Random forest classifier is an ensemble algorithm based on bagging i.e bootstrap aggregation.
Ensemble methods combines more than one algorithm of the same or different kind for
classifying objects (i.e., an ensemble of SVM, naive Bayes or decision trees, for example.)
The general idea is that a combination of learning models increases the overall result selected.
Deep decision trees may suffer from overfitting, but random forests prevent overfitting by
creating trees on random subsets. The main reason is that it takes the average of all the
predictions, which cancels out the biases.Random forest adds additional randomness to the
model while growing the trees. Instead of searching for the most important feature while
splitting a node, it searches for the best feature among a random subset of features. This results
in a wide diversity that generally results in a better model.
Python machine learning libraries have grown to become the most preferred language for
machine learning algorithm implementations. Let’s have a look at the main Python libraries
used for machine learning.
Tabular data with columns of heterogeneous data. For instance, consider the data coming from
the SQL table or Excel spreadsheet.
Ordered and unordered time series data. The frequency of time series need not be fixed, unlike
other libraries and tools. Pandas is exceptionally robust in handling uneven time-series data
Arbitrary matrix data with the homogeneous or heterogeneous type of data in the rows and
columns
Any other form of statistical or observational data sets. The data need not be labeled at all.
Pandas data structure can process it even without labeling.
9) Matplotlib
Matplotlib is a data visualization library that is used for 2D plotting to produce publication-
quality image plots and figures in a variety of formats. The library helps to generate histograms,
plots, error charts, scatter plots, bar charts with just a few lines of code.
It provides a MATLAB-like interface and is exceptionally user-friendly. It works by using
standard GUI toolkits like GTK+, wxPython, Tkinter, or Qt to provide an object-oriented API
that helps programmers to embed graphs and plots into their applications.
1. Chris Albon : Machine Learning with Python Cookbook , O‟Reilly Media, Inc.2018
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
3. Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education
4. Machine Learning: The art and Science of algorithms that make sense of data, Peter Flach,
Cambridge University Press, 2012
5. EthemAlpaydın, Introduction to machine learning, second edition, MIT press.
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”, Springer
Series , 2nd edition
7. Sebastian Raschka, "Python Machine Learning",Second Edition.Packt Publication
SCHOOL OF ELECTRICAL AND ELECTRONICS
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
UNIT- II CLASSIFIERS
UNIT II CLASSIFIERS
KNN:
In statistics, the k-nearest neighbors algorithm (k-NN) is a non parametric classification
method first developed by Evelyn Fix and Joseph Hodges in 1951 and later expanded by
Thomas Cover.
It is used for classification and regression. In both cases, the input
consists of the k closest training examples in data set. The output depends on whether k-NN is
used for classification or regression.
The k-nearest neighbors (KNN) algorithm is a simple, easy-to-implement supervised machine
learning algorithm.
It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it performs
an action on the dataset.
KNN algorithm at the training phase just stores the dataset and when it gets new data, then it
classifies that data into a category that is much similar to the new data.
K-nearest neighbors (KNN) algorithm is a type of supervised ML algorithm which can be used
for both classification as well as regression predictive problems. However, it is mainly used for
classification predictive problems in industry. The following two properties would define KNN
well −
• Lazy learning algorithm − KNN is a lazy learning algorithm because it does
not have a specialized training phase and uses all the data for training while
classification.
• Non-parametric learning algorithm − KNN is also a non-parametric
learning algorithm because it doesn’t assume anything about the underlying
data.
Working of KNN Algorithm
K-nearest neighbors (KNN) algorithm uses ‘feature similarity’ to predict the values of new
datapoints which further means that the new data point will be assigned a value based on how
closely it matches the points in the training set. We can understand its working with the help of
following steps −
Step 1 − For implementing any algorithm, we need dataset. So during the first step of KNN,
we must load the training as well as test data.
Step 2 − Next, we need to choose the value of K i.e. the nearest data points. K can be any
integer.
Step 3 − For each point in the test data do the following −
3.1 − Calculate the distance between test data and each row of training data with the
help of any of the method namely: Euclidean, Manhattan or Hamming distance. The
most commonly used method to calculate distance is Euclidean.
3.2 − Now, based on the distance value, sort them in ascending order.
3.3 − Next, it will choose the top K rows from the sorted array.
3.4 − Now, it will assign a class to the test point based on most frequent class of these
rows.
Step 4 − End
Example
The following is an example to understand the concept of K and working of KNN algorithm −
Suppose we have a dataset which can be plotted as follows −
Now, we need to classify new data point with black dot (at point 60,60) into blue or red class.
We are assuming K = 3 i.e. it would find three nearest data points. It is shown in the next
diagram −
We can see in the above diagram the three nearest neighbors of the data point with black dot.
Among those three, two of them lies in Red class hence the black dot will also be assigned in
red class.
Pros and Cons of KNN
Pros
It is very simple algorithm to understand and interpret.
It is very useful for nonlinear data because there is no assumption about data in this
algorithm.
It is a versatile algorithm as we can use it for classification as well as regression.
It has relatively high accuracy but there are much better supervised learning models
than KNN.
Cons
It is computationally a bit expensive algorithm because it stores all the training data.
High memory storage required as compared to other supervised learning algorithms.
Prediction is slow in case of big N.
It is very sensitive to the scale of data as well as irrelevant features.
Applications of KNN
The following are some of the areas in which KNN can be applied successfully −
Banking System
KNN can be used in banking system to predict weather an individual is fit for loan approval?
Does that individual have the characteristics similar to the defaulters one?
Calculating Credit Ratings
KNN algorithms can be used to find an individual’s credit rating by comparing with the
persons having similar traits.
Politics
With the help of KNN algorithms, we can classify a potential voter into various classes like
“Will Vote”, “Will not Vote”, “Will Vote to Party ‘Congress’, “Will Vote to Party ‘BJP’.
Other areas in which KNN algorithm can be used are Speech Recognition, Handwriting
Detection, Image Recognition and Video Recognition.
Decision Trees
o Decision Tree is a Supervised learning technique that can be used for both
classification and Regression problems, but mostly it is preferred for solving
Classification problems. It is a tree-structured classifier, where internal nodes
represent the features of a dataset, branches represent the decision rules and each
leaf node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision Node and Leaf
Node. Decision nodes are used to make any decision and have multiple branches,
whereas Leaf nodes are the output of those decisions and do not contain any further
branches.
o The decisions or the test are performed on the basis of features of the given dataset.
o It is a graphical representation for getting all the possible solutions to a
problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with the root node, which
expands on further branches and constructs a tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands for Classification
and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer (Yes/No), it further split
the tree into subtrees.
Why use Decision Trees?
There are various algorithms in Machine learning, so choosing the best algorithm for the given
dataset and problem is the main point to remember while creating a machine learning model.
Below are the two reasons for using the Decision tree:
o Decision Trees usually mimic human thinking ability while making a decision, so it is
easy to understand.
o The logic behind the decision tree can be easily understood because it shows a tree-like
structure.
In a decision tree, for predicting the class of the given dataset, the algorithm starts from the root
node of the tree. This algorithm compares the values of root attribute with the record (real
dataset) attribute and, based on the comparison, follows the branch and jumps to the next node.
For the next node, the algorithm again compares the attribute value with the other sub-nodes
and move further. It continues the process until it reaches the leaf node of the tree. The complete
process can be better understood using the below algorithm:
o Step-1: Begin the tree with the root node, says S, which contains the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute Selection Measure
(ASM).
o Step-3: Divide the S into subsets that contains possible values for the best attributes.
o Step-4: Generate the decision tree node, which contains the best attribute.
o Step-5: Recursively make new decision trees using the subsets of the dataset created in
step -3. Continue this process until a stage is reached where you cannot further classify
the nodes and called the final node as a leaf node.
Example: Suppose there is a candidate who has a job offer and wants to decide whether he
should accept the offer or Not. So, to solve this problem, the decision tree starts with the root
node (Salary attribute by ASM). The root node splits further into the next decision node
(distance from the office) and one leaf node based on the corresponding labels. The next
decision node further gets split into one decision node (Cab facility) and one leaf node. Finally,
the decision node splits into two leaf nodes (Accepted offers and Declined offer). Consider the
below diagram:
While implementing a Decision tree, the main issue arises that how to select the best attribute
for the root node and for sub-nodes. So, to solve such problems there is a technique which is
called as Attribute selection measure or ASM. By this measurement, we can easily select the
best attribute for the nodes of the tree. There are two popular techniques for ASM, which are:
o Information Gain
o Gini Index
1. Information Gain:
o Information gain is the measurement of changes in entropy after the segmentation of a
dataset based on an attribute.
o It calculates how much information a feature provides us about a class.
o According to the value of information gain, we split the node and build the decision tree.
o A decision tree algorithm always tries to maximize the value of information gain, and a
node/attribute having the highest information gain is split first. It can be calculated using
the below formula:
Where,
2. Gini Index:
o Gini index is a measure of impurity or purity used while creating a decision tree in the
CART(Classification and Regression Tree) algorithm.
o An attribute with the low Gini index should be preferred as compared to the high Gini
index.
o It only creates binary splits, and the CART algorithm uses the Gini index to create
binary splits.
o Gini index can be calculated using the below formula:
Pruning is a process of deleting the unnecessary nodes from a tree in order to get the optimal
decision tree.
A too-large tree increases the risk of overfitting, and a small tree may not capture all the
important features of the dataset. Therefore, a technique that decreases the size of the learning
tree without reducing accuracy is known as Pruning. There are mainly two types of
tree pruning technology used:
Random Forest is a popular machine learning algorithm that belongs to the supervised learning
technique. It can be used for both Classification and Regression problems in ML. It is based on
the concept of ensemble learning, which is a process of combining multiple classifiers to solve
a complex problem and to improve the performance of the model.
As the name suggests, "Random Forest is a classifier that contains a number of decision trees
on various subsets of the given dataset and takes the average to improve the predictive
accuracy of that dataset." Instead of relying on one decision tree, the random forest takes the
prediction from each tree and based on the majority votes of predictions, and it predicts the final
output.
The greater number of trees in the forest leads to higher accuracy and prevents the
problem of overfitting.
The below diagram explains the working of the Random Forest algorithm:
Note:
the Decision
To better
Treeunderstand
Algorithm.the Random Forest Algorithm, you should have knowledge of
Assumptions for Random Forest
Since the random forest combines multiple trees to predict the class of the dataset, it is possible
that some decision trees may predict the correct output, while others may not. But together, all
the trees predict the correct output. Therefore, below are two assumptions for a better Random
forest classifier:
o There should be some actual values in the feature variable of the dataset so that the
classifier can predict accurate results rather than a guessed result.
o The predictions from each tree must have very low correlations.
Below are some points that explain why we should use the Random Forest algorithm:
<="" li="">
o It takes less training time as compared to other algorithms.
o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.
Random Forest works in two-phase first is to create the random forest by combining N decision
tree, and second is to make predictions for each tree created in the first phase.
The Working process can be explained in the below steps and diagram:
Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-5: For new data points, find the predictions of each decision tree, and assign the new data
points to the category that wins the majority votes.
The working of the algorithm can be better understood by the below example:
Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is given
to the Random forest classifier. The dataset is divided into subsets and given to each decision
tree. During the training phase, each decision tree produces a prediction result, and when a new
data point occurs, then based on the majority of results, the Random Forest classifier predicts
the final decision. Consider the below image:
Applications of Random Forest
There are mainly four sectors where Random forest mostly used:
1. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
2. Medicine: With the help of this algorithm, disease trends and risks of the disease can be
identified.
3. Land Use: We can identify the areas of similar land use by this algorithm.
4. Marketing: Marketing trends can be identified using this algorithm.
Thus, the independence expressed in this Bayesian net are thatA and B
are (absolutely) independent.
C is independent of B given A.
D is independent of C given A and B.
E is independent of A, B, and D given C.
Prob(D=T):
P(D=T) =
Association rule
Association rule mining is explained using the Apriori Algorithm.
Apriori Algorithm:
Confidence:
Parametric Methods
Parametric Estimation
X = { xt }t where xt ~ p (x)
Parametric estimation:
Log likelihood
θ* = argmaxθ L(θ|X)
Examples: Bernoulli/Multinomial:
Classification
Regression
Linear Regression:
Data:
Multivariate Parameters:
Parameter Estimation
Classification
Estimating the mean and Variance,
Quadratic Discriminant:
Tuning Complexity
Discrete Features
Dimensionality Reduction
Necessity:
1. Chris Albon : Machine Learning with Python Cookbook , O‟Reilly Media, Inc.2018
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second
Edition, Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
3. Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education
4. Machine Learning: The art and Science of algorithms that make sense of data, Peter
Flach, Cambridge University Press, 2012
5. EthemAlpaydın, Introduction to machine learning, second edition, MIT press.
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”,
Springer Series , 2nd edition
7. Sebastian Raschka, "Python Machine Learning",Second Edition.Packt Publication
SCHOOL OF ELECTRICAL AND ELECTRONICS
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
REGRESSION:
Statistical modeling to show the relationship between two variables with the
linear Regression
Regression analysis is form of predictive modeling technique which investigates the
relationship between a dependent and independent variable
What is the strength of relationship between sales and marketing spending?
What is the relationship between age and income?
Agricultural scientists often use linear regression to measure the effect of fertilizer and
water on crop yields.
Data scientists for professional sports teams often use linear regression to measure the effect
that different training regimens have on player performance
Example:
Suppose, we have an image of a creature that looks similar to cat and dog, but we want to
know either it is a cat or dog.
For this identification, we can use the KNN algorithm, as it works on a similarity measure.
Our KNN model will find the similar features of the new data set to the cats and dogs images
and based on the most similar features it will put it in either cat or dog category.
Clustering
• Partitioning approach:
o Construct various partitions and then evaluate them by some criterion, e.g.,
minimizing the sum of square errors
o Typical methods: k-means, k-medoids, CLARANS
• Hierarchical approach:
o Create a hierarchical decomposition of the set of data (or objects) using some
criterion
o Typical methods: DIANA, AGNES, BIRCH, CAMELEON
Mixture densities
Partitioning method
• Partitioning a database D of n objects into a set of k clusters, such that the sum of squared
distances is minimized (where ci is the centroid or medoid of cluster Ci)
• Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
Start with N groups each with one instance and merge two closest groups at each
iteration
Distance between two groups Gi and Gj:
Single-link clustering:
d G ,G max dxr ,x s
i j
xrGi ,xsGj
1. Given a datset, first find the distance matrix using Euclidean distance measure.
2. After finding the distance matrix, find the smallest element in the distance matix.
3. Merge these two points to form a cluster.
4. Next find the minium distance between the new cluster obtained with all other points.
5. Now frame the distance matrix and find the smallest value in the obtained distance matrix
and then merge these points to form a cluster. The process repeats until all points are
grouped into clusters.
6. Finally draw the dendogram.
Example:
Nonparametric Methods
Density Estimation
pˆ x
# x t in the same bin as x
⚫ Naive estimator: Nh
p̂ x # x h x x h
t
or 2 Nh
1 N x x
t
1/ 2 if u 1
p̂ x w wu
Nh t 1 h
0 otherwise
Kernel Estimator:
⚫ Kernel function, e.g., Gaussian kernel:
u 2
K u
1
exp
2 2
1 N x x t
p̂ x K
Nh t 1 h
k-Nearest Neighbor Estimator:
⚫ Instead of fixing bin width h and counting the number of instances, fix the
instances (neighbors) k and check bin width,
p̂x k
2 Nd k x
dk(x), distance to kth closest instance to x.
The nearest neighbor class of estimators adapts the amount of smoothing to the local density
of data. The degree of smoothing is controlled by k, the number of neighbors taken into
account, which is much smaller than N, the sample size. Let us define a distance between a
and b, for example, |a − b|, and for each x, we define,
Generalization of multivariate data
where S is the sample covariance matrix. This corresponds to using Mahalanobis distance instead
of the Euclidean distance. It is also possible to have the distance metric local where S is
calculated from instances in the vicinity of x, for example, some k closest instances.
Note that S calculated locally may be singular and PCA (or LDA, in thecase of classification)
may be needed.
Hamming distance:
When the inputs are discrete, we can use Hamming distance, which counts the number of
nonmatching attributes.
Nonparametric Classification
• Kernel estimator
1 N x xt N
p̂ x | C i N hd K h rit P̂ Ci Ni
i t 1
1 N x x t
ˆ
g i x p̂ x | C i P C i d
K rit
Nh t1 h
• k-NN estimator
p̂x | Ci P̂Ci ki
p̂x | Ci i | x
ki
P̂ C
i x
NV k p̂x k
The k-nn classifier assigns the input to the class having most examples among the k neighbors of
the input. All neighbors have equal vote, and the class having the maximum number of voters
among the k neighbors is chosen. Ties are broken arbitrarily or a weighted vote is taken. K is
generally taken to be an odd number to minimize ties: confusion is generally between two
neighboring classes. Again, the use of Euclidean distance corresponds to assuming uncorrelated
inputs with equal variances, and when this is not the case a suitable discriminant metric should
be used. One example is discriminant adaptive nearest neighbor where the optimal distance to
separate classes is estimated locally.
Nonparametric Regression
Instead of taking an average and giving a constant fit at a point, we can take into account one
more term in the Taylor expansion and calculate a linear fit. In the running line smoother, we can
use the data points in the neighborhood, as defined by h or k, and fit a local regression line. In
the locally weighted running line smoother, known as loess, instead of a hard definition of
neighborhoods, we use kernel weighting such that distant points have less effect on error.
Decision Trees
Algorithm:
Attribute Selection Measure:
Example:
After further iterations, the final decision tree would be,
Univariate Trees
Multivariate Trees
Linear Discriminant:
⚫ Linear discriminant:
g x| w , w wTx w w x w
d
i i i0 i i0 ij j i0
j1
⚫ Advantages:
⚫ Optimal when p(x|Ci) are Gaussian with shared cov matrix; useful when classes
are (almost) linearly separable
⚫ Quadratic discriminant:
g x | W , w , w xT Wx wT x w
i i i i0 i i i0
z x , z x , z x2, z x2, z x x
1 1 2 2 3 1 4 2 5 1 2
⚫ Map from x to z using nonlinear basis functions and use a linear discriminant in z-space:
k
g i x w ij j x
j 1
Two Classes:
g x g x g x
1 2
w T
1 x w 10 w T
2 x w 20
w 1 w 2 T x w 10 w 20
w T
x w 0
C1 if g x 0
choose
C 2 otherwise
Multiple Classes:
Market segmentation
Document Clustering
Image segmentation
Image compression
Customer segmentation
Analyzing the trend on dynamic data
Hierarchical Clustering
City-block distance
j
d
d cb
xr , xs
1
x
r
j x
s
j
• Two important paradigms:
– Bottom-up agglomerative clustering(Agglomerative Nesting-AGNES)
– Top-down divisive clustering(Divisive Analysis-DIANA)
Partitioning methods (K-means, PAM clustering) and hierarchical clustering work for finding
spherical-shaped clusters or convex clusters. In other words, they are suitable only for compact
and well-separated clusters. Moreover, they are also severely affected by the presence of noise
and outliers in the data.
Real life data may contain irregularities, like –
i) Clusters can be of arbitrary shape such as those shown in the figure below.
ii) Data may contain noise.
The figure below shows a data set containing nonconvex clusters and outliers/noises. Given
such data, k-means algorithm has difficulties for identifying these clusters with arbitraryshapes.
DBSCAN algorithm requires two parameters –
1. eps : It defines the neighborhood around a data point i.e. if the distance between two
points is lower or equal to ‘eps’ then they are considered as neighbors. If the eps value is
chosen too small then large part of the data will be considered as outliers. If it is chosen very
large then the clusters will merge and majority of the data points will be in the same clusters.
One way to find the eps value is based on the k-distance graph.
2. MinPts: Minimum number of neighbors (data points) within eps radius. Larger the
dataset, the larger value of MinPts must be chosen. As a general rule, the minimum MinPts
can be derived from the number of dimensions D in the dataset as, MinPts >= D+1. The
minimum value of MinPts must be chosen at least 3.
DBSCAN in action
Parameter Estimation
Every data mining task has the problem of parameters. Every parameter influences the
algorithm in specific ways. For DBSCAN, the parameters ε and minPts are needed.
minPts: As a rule of thumb, a minimum minPts can be derived from the number of
dimensions D in the data set, as minPts ≥ D + 1. The low value minPts = 1 does not make
sense, as then every point on its own will already be a cluster. With minPts ≤ 2, the result will
be the same as of hierarchical clustering with the single link metric, with the dendrogram cut at
height ε. Therefore, minPts must be chosen at least 3. However, larger values are usually better
for data sets with noise and will yield more significant clusters. As a rule of thumb, minPts =
2·dim can be used, but it may be necessary to choose larger values for very large data, for
noisy data or for data that contains many duplicates.
ε: The value for ε can then be chosen by using a k-distance graph, plotting the distance
to the k = minPts-1 nearest neighbor ordered from the largest to the smallest value. Good
values of ε are where this plot shows an “elbow”: if ε is chosen much too small, a large part of
the data will not be clustered; whereas for a too high value of ε, clusters will merge and the
majority of objects will be in the same cluster. In general, small values of ε are preferable, and
as a rule of thumb, only a small fraction of points should be within this distance of each other.
Distance function: The choice of distance function is tightly linked to the choice of ε,
and has a major impact on the outcomes. In general, it will be necessary to first identify a
reasonable measure of similarity for the data set, before the parameter ε can be chosen. There is
no estimation for this parameter, but the distance functions need to be chosen appropriately for
the data set.
1. Chris Albon : Machine Learning with Python Cookbook , O‟Reilly Media, Inc.2018
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
3. Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education
4. Machine Learning: The art and Science of algorithms that make sense of data, Peter
Flach, Cambridge University Press, 2012
5. EthemAlpaydın, Introduction to machine learning, second edition, MIT press.
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”, Springer
Series , 2nd edition
7. Sebastian Raschka, "Python Machine Learning",Second Edition.Packt Publication
SCHOOL OF ELECTRICAL AND ELECTRONICS
DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING
Feature extraction is about extracting/deriving information from the original features set to
create a new features subspace. The primary idea behind feature extraction is to compress the
data with the goal of maintaining most of the relevant information. As with feature selection
techniques, these techniques are also used for reducing the number of features from the original
features set to reduce model complexity, model overfitting, enhance model computation
efficiency and reduce generalization error. The following are different types of feature extraction
techniques:
Principal component analysis (PCA) for unsupervised data compression. Here is a detailed
post on feature extraction using PCA with Python example. You will get a good understanding of
how PCA can help with finding the directions of maximum variance in high-dimensional data
and projects the data onto a new subspace with equal or fewer dimensions than the original
one. This is explained with example of identifying Taj Mahal (7th wonder of world) from top
view or side view based on dimensions in which there is maximum variance. The diagram
below shows the dimensions of maximum variance (PCA1 and PCA2) as a result of PCA.
Example:
Suppose we have two sets of data points belonging to two different classes that we want to
classify. As shown in the given 2D graph, when the data points are plotted on the 2D plane,
there’s no straight line that can separate the two classes of the data points completely. Hence, in
this case, LDA (Linear Discriminant Analysis) is used which reduces the 2D graph into a 1D
graph in order to maximize the separability between the two classes.
Here, Linear Discriminant Analysis uses both the axes (X and Y) to create a new axis and
projects data onto a new axis in a way to maximize the separation of the two categories and
hence, reducing the 2D graph into a 1D graph.
Two criteria are used by LDA to create a new axis:
In the above graph, it can be seen that a new axis (in red) is generated and plotted in the 2D
graph such that it maximizes the distance between the means of the two classes and minimizes
the variation within each class. In simple terms, this newly generated axis increases the
separation between the data points of the two classes. After generating this new axis using the
above-mentioned criteria, all the data points of the classes are plotted on this new axis and are
shown in the figure given below.
But Linear Discriminant Analysis fails when the mean of the distributions are shared, as it
becomes impossible for LDA to find a new axis that makes both the classes linearly separable. In
such cases, we use non-linear discriminant analysis.
Mathematics
Let’s suppose we have two classes and a d- dimensional samples such as x1, x2 … xn, where:
n1 samples coming from the class (c1) and n2 coming from the class (c2).
If xi is the data point, then its projection on the line represented by unit vector v can be written as
vTxi
ExtensionstoLDA:
1. Quadratic Discriminant Analysis (QDA): Each class uses its own estimate of
variance (or covariance when there are multiple input variables).
2. Flexible Discriminant Analysis (FDA): Where non-linear combinations of inputs is
used such as splines.
3. Regularized Discriminant Analysis (RDA): Introduces regularization into the
estimate of the variance (actually covariance), moderating the influence of different
variables on LDA.
Principal Component Analysis is an unsupervised learning algorithm that is used for the
dimensionality reduction in machine learning. It is a statistical process that converts the
observations of correlated features into a set of linearly uncorrelated features with the help of
orthogonal transformation. These new transformed features are called the Principal
Components. It is one of the popular tools that is used for exploratory data analysis and
predictive modeling. It is a technique to draw strong patterns from the given dataset by reducing
the variances.
PCA generally tries to find the lower-dimensional surface to project the high-dimensional data.
PCA works by considering the variance of each attribute because the high attribute shows the
good split between the classes, and hence it reduces the dimensionality. Some real-world
applications of PCA are image processing, movie recommendation system, optimizing the
power allocation in various communication channels. It is a feature extraction technique, so it
contains the important variables and drops the least important variable.
As described above, the transformed new features or the output of PCA are the Principal
Components. The number of these PCs are either equal to or less than the original features
present in the dataset. Some properties of these principal components are given below:
o The principal component must be the linear combination of the original features.
o These components are orthogonal, i.e., the correlation between a pair of variables is zero.
o The importance of each component decreases when going to 1 to n, it means the 1 PC has
the most importance, and n PC will have the least importance.
Factor Analysis
Factor Analytics is a special technique reducing the huge number of variables into a few
numbers of factors is known as factoring of the data, and managing which data is to be present
in sheet comes under factor analysis. It is completely a statistical approach that is also used to
describe fluctuations among the observed and correlated variables in terms of a potentially
lower number of unobserved variables called factors.
The factor analysis technique extracts the maximum common variance from all the variables
and puts them into a common score. It is a theory that is used in training the machine learning
model and so it is quite related to data mining. The belief behind factor analytic techniques is
that the information gained about the interdependencies between observed variables can be
used later to reduce the set of variables in a dataset.
Factor analysis is a very effective tool for inspecting changeable relationships for complex
concepts such as social status, economic status, dietary patterns, psychological scales, biology,
psychometrics, personality theories, marketing, product management, operations research,
finance, etc. It can help a researcher to investigate the concepts that are not easily measured in
a much easier and quicker way directly by the cave in a large number of variables into a few
easily interpretable fundamental factors.
Types of factor analysis:
1. Exploratory factor analysis (EFA) :
It is used to identify composite inter-relationships among items and group items that are the
part of uniting concepts. The Analyst can’t make any prior assumptions about the relationships
among factors. It is also used to find the fundamental structure of a huge set of variables. It
lessens the large data to a much smaller set of summary variables. It is almost similar to the
Confirmatory Factor Analysis(CFA).
Similarities are:
1. Evaluate the internal reliability of an amount.
2. Examine the factors represented by item sets. They presume that the factors
aren’t correlated.
3. Investigate the grade/class of each item.
However, some common differences, most of them are concerned about how factors are used.
Basically, EFA is a data-driven approach, which allows all items to load on all the factors,
while in CFA you need to specify which factors are required to load. EFA is really a nice
choice if you have no idea about what common factors might exist. EFA is able to generate a
huge number of possible models for your data, something which is not possible is, if a
researcher has to specify factors. If you have a bit idea about what actually the models look
like, and then afterwards you want to test your hypotheses about the data structure, in that case,
the CFA is a better approach.
2. Confirmatory factor analysis (CFA) :
It is a more complex(composite) approach that tests the theory that the items are associated
with specific factors. Confirmatory Factor Analysis uses a properly structured equation model
to test a measurement model whereby loading on the factors allows for the evaluation of
relationships between observed variables and unobserved variables.
As we know, the Structural equation modelling approaches can board measurement error
easily, and these are much less restrictive than least-squares estimation thus provide more
exposure to accommodate errors. Hypothesized models are tested against actual data, and the
analysis would demonstrate loadings of observed variables on the latent variables (factors), as
well as the correlation between the latent variables.
Confirmatory Factor Analysis allows an analyst and researcher to figure out if a relationship
between a set of observed variables (also known as manifest variables) and their underlying
constructs exists. It is similar to the Exploratory Factor Analysis.
The main difference between the two is:
1. Simply use Exploratory Factor Analysis to explore the pattern.
2. Use Confirmatory Factor Analysis to perform hypothesis testing.
Confirmatory Factor Analysis provides information about the standard quality of the number of
factors that are required to represent the data set. Using Confirmatory Factor Analysis, you can
define the total number of factors required. For example, Confirmatory Factor Analysis is able
to answer questions like Does my thousand question survey can able to measure accurately the
one specific factor. Even though it is technically applicable to any kind of discipline, it is
typically used in social sciences.
3. Multiple Factor Analysis :
This type of Factor Analysis is used when your variables are structured in changeable groups.
For example, you may have a teenager’s health questionnaire with several points like sleeping
patterns, wrong addictions, psychological health, mobile phone addiction, or learning
disabilities.
The Multiple Factor Analysis is performed in two steps which are:-
1. Firstly, the Principal Component Analysis will perform on each and every section of the
data. Further, this can give a useful eigenvalue, which is actually used to normalize the
data sets for further use.
2. The newly formed data sets are going to merge into a distinctive matrix and then global
PCA is performed.
4. Generalized Procrustes Analysis (GPA) :
The Procrustes analysis is actually a suggested way to compare then the two approximate sets
of configurations and shapes, which were originally developed to equivalent to the two
solutions from Factor Analysis, this technique was actually used to extend the GP Analysis so
that more than two shapes could be compared in many ways. The shapes are properly aligned
to achieve the target shape. Mainly GPA (Generalized Procrustes Analysis) uses geometric
transformations.
Geometric progressions are :
1. Isotropic rescaling,
2. Reflection,
3. Rotation,
4. Translation of matrices to compare the sets of data.
Here, There is a party going into a room full of people. There is ‘n’ number of speakers in that
room and they are speaking simultaneously at the party. In the same room, there are also ‘n’
number of microphones placed at different distances from the speakers which are recording ‘n’
speakers’ voice signals. Hence, the number of speakers is equal to the number must of
microphones in the room.
Now, using these microphones’ recordings, we want to separate all the ‘n’ speakers’ voice
signals in the room given each microphone recorded the voice signals coming from each
speaker of different intensity due to the difference in distances between them. Decomposing
the mixed signal of each microphone’s recording into independent source’s speech signal can
be done by using the machine learning technique, independent component analysis.
[ X1, X2, ….., Xn ] => [ Y1, Y2, ….., Yn ]
where, X1, X2, …, Xn are the original signals present in the mixed signal and Y1, Y2, …, Yn
are the new features and are independent components which are independent of each other
Restrictions on ICA
For each training instance x(i), the algorithm first finds its k nearest neighbors and then tries to
express x(i) as a linear function of them. In general, if there are m training instances in total, then
it tries to find the set of weights w which minimizes the squared distance between x(i) and its
linear representation.
Here the weights wi,j are kept fixed while we try to find the optimum coordinates y(i)
It works by searching a large group of people and finding a smaller set of users with tastes
similar to a particular user. It looks at the items they like and combines them to create a ranked
list of suggestions.
There are many ways to decide which users are similar and combine their choices to create a list
of recommendations.
How do you determine which users or items are similar to one another?
Given that you know which users are similar, how do you determine the rating that a user
would give to an item based on the ratings of similar users?
How do you measure the accuracy of the ratings you calculate?
Collaborative filtering is a family of algorithms where there are multiple ways to find similar
users or items and multiple ways to calculate rating based on ratings of similar users. One
important thing to keep in mind is that in an approach based purely on collaborative filtering, the
similarity is not calculated using factors like the age of users, genre of the movie, or any other
data about users or items. It is calculated only on the basis of the rating (explicit or implicit) a
user gives to an item. For example, two users can be considered similar if they give the same
ratings to ten movies despite there being a big difference in their age. The third question for how
to measure the accuracy of your predictions also has multiple answers, which include error
calculation techniques that can be used in many places and not just recommenders based on
collaborative filtering. One of the approaches to measure the accuracy of your result is the Root
Mean Square Error (RMSE), in which you predict ratings for a test dataset of user-item pairs
whose rating values are already known. The difference between the known value and the
predicted value would be the error. Square all the error values for the test set, find the average (or
mean), and then take the square root of that average to get the RMSE. Another metric to measure
the accuracy is Mean Absolute Error (MAE), in which you find the magnitude of error by
finding its absolute value and then taking the average of all error values.
Memory Based
The first category includes algorithms that are memory based, in which statistical techniques are
applied to the entire dataset to calculate the predictions.
To find the rating R that a user U would give to an item I, the approach includes:
The two approaches are mathematically quite similar, but there is a conceptual difference
between the two. Here’s how the two compare:
User-based: For a user U, with a set of similar users determined based on rating vectors
consisting of given item ratings, the rating for an item I, which hasn’t been rated, is found
by picking out N users from the similarity list who have rated the item I and calculating
the rating based on these N ratings.
Item-based: For an item I, with a set of similar items determined based on rating vectors
consisting of received user ratings, the rating by a user U, who hasn’t rated it, is found by
picking out N items from the similarity list that have been rated by U and calculating the
rating based on these N ratings.
Item-based collaborative filtering was developed by Amazon. In a system where there are more
users than items, item-based filtering is faster and more stable than user-based. It is effective
because usually, the average rating received by an item doesn’t change as quickly as the average
rating given by a user to different items. It’s also known to perform better than the user-based
approach when the ratings matrix is sparse.
Although, the item-based approach performs poorly for datasets with browsing or entertainment
related items such as MovieLens, where the recommendations it gives out seem very obvious to
the target users..
Model Based
The second category covers the Model based approaches, which involve a step to reduce or
compress the large but sparse user-item matrix. For understanding this step, a basic
understanding of dimensionality reduction can be very helpful.
Collaborative filtering doesn’t require features about the items or users to be known. It is
suited for a set of different types of items, for example, a supermarket’s inventory where
items of various categories can be added. In a set of similar items such as that of a
bookstore, though, known features like writers and genres can be useful and might
benefit from content-based or hybrid approaches.
Collaborative filtering can help recommenders to not overspecialize in a user’s profile
and recommend items that are completely different from what they have seen before. If
you want your recommender to not suggest a pair of sneakers to someone who just
bought another similar pair of sneakers, then try to add collaborative filtering to your
recommender spell.
Collaborative filtering can lead to some problems like cold start for new items that are
added to the list. Until someone rates them, they don’t get recommended.
Data sparsity can affect the quality of user-based recommenders and also add to the cold
start problem mentioned above.
Scaling can be a challenge for growing datasets as the complexity can become too large.
Item-based recommenders are faster than user-based when the dataset is large.
With a straightforward implementation, you might observe that the recommendations
tend to be already popular, and the items from the long tail section might get ignored.
With every type of recommender algorithm having its own list of pros and cons, it’s usually a
hybrid recommender that comes to the rescue. The benefits of multiple algorithms working
together or in a pipeline can help you set up more accurate recommenders. In fact, the solution of
the winner of the Netflix prize was also a complex mix of multiple algorithms.
Advantages
No domain knowledge necessary
We don't need domain knowledge because the embeddings are automatically learned.
Serendipity
The model can help users discover new interests. In isolation, the ML system may not know the
user is interested in a given item, but the model might still recommend it because similar users
are interested in that item.
Disadvantages
Cannot handle fresh items
The prediction of the model for a given (user, item) pair is the dot product of the corresponding
embeddings. So, if an item is not seen during training, the system can't create an embedding for
it and can't query the model with this item. This issue is often called the cold-start problem.
However, the following techniques can address the cold-start problem to some extent:
Projection in WALS. Given a new item i0 not seen in training, if the system has a few
interactions with users, then the system can easily compute an embedding vi0 for this item
without having to retrain the whole model. The system simply has to solve the following
equation or the weighted version:
minvi0∈Rd‖Ai0−Uvi0‖
The preceding equation corresponds to one iteration in WALS: the user embeddings are kept
fixed, and the system solves for the embedding of item i0. The same can be done for a new user.
Heuristics to generate embeddings of fresh items. If the system does not have interactions, the
system can approximate its embedding by averaging the embeddings of items from the same
category, from the same uploader (in YouTube), and so on.
1. Chris Albon : Machine Learning with Python Cookbook , O‟Reilly Media, Inc.2018
2. Stephen Marsland, “Machine Learning – An Algorithmic Perspective”, Second Edition,
Chapman and Hall/CRC Machine Learning and Pattern Recognition Series, 2014
3. Tom M. Mitchell, Machine Learning, India Edition 2013, McGraw Hill Education
4. Machine Learning: The art and Science of algorithms that make sense of data, Peter
Flach, Cambridge University Press, 2012
5. EthemAlpaydın, Introduction to machine learning, second edition, MIT press.
6. T. Hastie, R. Tibshirani and J. Friedman, “Elements of Statistical Learning”, Springer
Series , 2nd edition
7. Sebastian Raschka, "Python Machine Learning",Second Edition.Packt Publication