UNIT 3: TREE AND PROBABILISTIC MODELS: Learning with Trees – Decision Trees – Constructing
Decision Trees – Classification and Regression Trees – Ensemble Learning – Boosting – Bagging –
Different ways to Combine Classifiers – Probability and Learning – Data into Probabilities – Basic
Statistics – Gaussian Mixture Models – Nearest Neighbour Methods – Unsupervised Learning – K
means Algorithms – Vector Quantization – Self Organizing Feature Map.
1. . Learning with trees
Learning with trees refers to the process of building tree-structured models to make decisions or
predictions based on data. These models are popular in both classification (categorical outputs) and
regression (continuous outputs) tasks because of their simplicity, interpretability, and versatility.
Decision Trees
A Decision Tree is a supervised learning algorithm used to model decisions and their possible
outcomes. It splits the dataset into subsets based on feature values, eventually leading to a
prediction.
Components of a Decision Tree
1. Root Node: The starting point of the tree that represents the entire dataset.
2. Internal Nodes: Represent decision points based on feature splits.
3. Branches: Show the outcome of a decision (e.g., if age > 30).
4. Leaf Nodes: Represent the final outcomes (e.g., a class label or a value).
ow Decision Tree is formed?
The process of forming a decision tree involves recursively partitioning the data based on the values
of different attributes. The algorithm selects the best attribute to split the data at each internal node,
based on certain criteria such as information gain or Gini impurity. This splitting process continues
until a stopping criterion is met, such as reaching a maximum depth or having a minimum number of
instances in a leaf node.
Why Decision Tree?
Decision trees are widely used in machine learning for a number of reasons:
Decision trees are so versatile in simulating intricate decision-making processes, because of
their interpretability and versatility.
Their portrayal of complex choice scenarios that take into account a variety of causes and
outcomes is made possible by their hierarchical structure.
They provide comprehensible insights into the decision logic, decision trees are especially
helpful for tasks involving categorisation and regression.
They are proficient with both numerical and categorical data, and they can easily adapt to a
variety of datasets thanks to their autonomous feature selection capability.
Decision trees also provide simple visualization, which helps to comprehend and elucidate
the underlying decision processes in a model
Decision Tree Approach
Decision tree uses the tree representation to solve the problem in which each leaf node corresponds
to a class label and attributes are represented on the internal node of the tree. We can represent any
boolean function on discrete attributes using the decision tree.
Below are some assumptions that we made while using the decision tree:
At the beginning, we consider the whole training set as the root.
Feature values are preferred to be categorical. If the values are continuous then they are
discretized prior to building the model.
On the basis of attribute values, records are distributed recursively.
We use statistical methods for ordering attributes as root or the internal node
Types of Decision Trees
Hunt’s algorithm, which was developed in the 1960s to model
human learning in Psychology, forms the foundation of many
popular decision tree algorithms, such as the following:
- ID3: Ross Quinlan is credited within the development of ID3, which
is shorthand for “Iterative Dichotomiser 3.” This algorithm leverages
entropy and information gain as metrics to evaluate candidate
splits. Some of Quinlan’s research on this algorithm from 1986 can
be found here (link resides outside ibm.com).
- C4.5: This algorithm is considered a later iteration of ID3, which
was also developed by Quinlan. It can use information gain or gain
ratios to evaluate split points within the decision trees.
- CART: The term, CART, is an abbreviation for “classification and
regression trees” and was introduced by Leo Breiman. This
algorithm typically utilizes Gini impurity to identify the ideal
attribute to split on. Gini impurity measures how often a randomly
chosen attribute is misclassified. When evaluating using Gini
impurity, a lower value is more ideal.
How to choose the best attribute at each node
While there are multiple ways to select the best attribute at each node,
two methods, information gain and Gini impurity, act as popular splitting
criterion for decision tree models. They help to evaluate the quality of
each test condition and how well it will be able to classify samples into a
class.
Entropy and Information Gain
It’s difficult to explain information gain without first discussing entropy.
Entropy is a concept that stems from information theory, which measures
the impurity of the sample values. It is defined with by the following
formula, where:
S represents the data set that entropy is calculated
c represents the classes in set, S
p(c) represents the proportion of data points that belong to class c
to the number of total data points in set, S
Entropy values can fall between 0 and 1. If all samples in data set, S,
belong to one class, then entropy will equal zero. If half of the samples are
classified as one class and the other half are in another class, entropy will
be at its highest at 1. In order to select the best feature to split on and
find the optimal decision tree, the attribute with the smallest amount of
entropy should be used. Information gain represents the difference in
entropy before and after a split on a given attribute. The attribute with the
highest information gain will produce the best split as it’s doing the best
job at classifying the training data according to its target classification.
Gini Impurity
Gini impurity is the probability of incorrectly classifying random data point
in the dataset if it were labeled based on the class distribution of the
dataset. Similar to entropy, if set, S, is pure—i.e. belonging to one class)
then, its impurity is zero. This is denoted by the following formula:
Advantages and disadvantages of Decision Trees
While decision trees can be used in a variety of use cases, other
algorithms typically outperform decision tree algorithms. That said,
decision trees are particularly useful for data mining and knowledge
discovery tasks. Let’s explore the key benefits and challenges of utilizing
decision trees more below:
Advantages
- Easy to interpret: The Boolean logic and visual representations of
decision trees make them easier to understand and consume. The
hierarchical nature of a decision tree also makes it easy to see which
attributes are most important, which isn’t always clear with other
algorithms, like neural networks.
- Little to no data preparation required: Decision trees have a
number of characteristics, which make it more flexible than other
classifiers. It can handle various data types—i.e. discrete or continuous
values, and continuous values can be converted into categorical values
through the use of thresholds. Additionally, it can also handle values with
missing values, which can be problematic for other classifiers, like Naïve
Bayes.
- More flexible: Decision trees can be leveraged for both classification
and regression tasks, making it more flexible than some other algorithms.
It’s also insensitive to underlying relationships between attributes; this
means that if two variables are highly correlated, the algorithm will only
choose one of the features to split on.
Disadvantages
- Prone to overfitting: Complex decision trees tend to overfit and do not
generalize well to new data. This scenario can be avoided through the
processes of pre-pruning or post-pruning. Pre-pruning halts tree growth
when there is insufficient data while post-pruning removes subtrees with
inadequate data after tree construction.
- High variance estimators: Small variations within data can produce a
very different decision tree. Bagging, or the averaging of estimates, can
be a method of reducing variance of decision trees. However, this
approach is limited as it can lead to highly correlated predictors.
- More costly: Given that decision trees take a greedy search approach
during construction, they can be more expensive to train compared to
other algorithms.
2. CART(Classification And Regression Tree) for Decision Tree
CART is a predictive algorithm used in Machine learning and it explains
how the target variable’s values can be predicted based on other matters.
It is a decision tree where each fork is split into a predictor variable and
each node has a prediction for the target variable at the end.
The term CART serves as a generic term for the following categories of
decision trees:
Classification Trees: The tree is used to determine which “class”
the target variable is most likely to fall into when it is continuous.
Regression trees: These are used to predict a continuous
variable’s value.
In the decision tree, nodes are split into sub-nodes based on a threshold
value of an attribute. The root node is taken as the training set and is split
into two by considering the best attribute and threshold value. Further,
the subsets are also split using the same logic. This continues till the last
pure sub-set is found in the tree or the maximum number of leaves
possible in that growing tree.
CART Algorithm
Classification and Regression Trees (CART) is a decision tree algorithm
that is used for both classification and regression tasks. It is a supervised
learning algorithm that learns from labelled data to predict unseen data.
Tree structure: CART builds a tree-like structure consisting of
nodes and branches. The nodes represent different decision points,
and the branches represent the possible outcomes of those
decisions. The leaf nodes in the tree contain a predicted class label
or value for the target variable.
Splitting criteria: CART uses a greedy approach to split the data
at each node. It evaluates all possible splits and selects the one that
best reduces the impurity of the resulting subsets. For classification
tasks, CART uses Gini impurity as the splitting criterion. The lower
the Gini impurity, the more pure the subset is. For regression tasks,
CART uses residual reduction as the splitting criterion. The lower the
residual reduction, the better the fit of the model to the data.
Pruning: To prevent overfitting of the data, pruning is a technique
used to remove the nodes that contribute little to the model
accuracy. Cost complexity pruning and information gain pruning are
two popular pruning techniques. Cost complexity pruning involves
calculating the cost of each node and removing nodes that have a
negative cost. Information gain pruning involves calculating the
information gain of each node and removing nodes that have a low
information gain.
How does CART algorithm works?
The CART algorithm works via the following process:
The best-split point of each input is obtained.
Based on the best-split points of each input in Step 1, the new
“best” split point is identified.
Split the chosen input according to the “best” split point.
Continue splitting until a stopping rule is satisfied or no further
desirable splitting is available.
CART algorithm uses Gini Impurity to split the dataset into a decision
tree .It does that by searching for the best homogeneity for the sub
nodes, with the help of the Gini index criterion.
Gini index/Gini impurity: Already explain above you can take the
reference from there also or study from there also: formula is
mentioned above do it from there.
The Gini index is a metric for the classification tasks in CART. It stores the
sum of squared probabilities of each class. It computes the degree of
probability of a specific variable that is wrongly being classified when
chosen randomly and a variation of the Gini coefficient. It works on
categorical variables, provides outcomes either “successful” or “failure”
and hence conducts binary splitting only.
The degree of the Gini index varies from 0 to 1,
Where 0 depicts that all the elements are allied to a certain class, or
only one class exists there.
Gini index close to 1 means a high level of impurity, where each
class contains a very small fraction of elements, and
A value of 1-1/n occurs when the elements are uniformly distributed
into n classes and each class has an equal probability of 1/n. For
example, with two classes, the Gini impurity is 1 – 1/2 = 0.5.
In conclusion, Gini impurity is the probability of misclassification,
assuming independent selection of the element and its class based on the
class probabilities.
CART for Classification
A classification tree is an algorithm where the target variable is
categorical. The algorithm is then used to identify the “Class” within which
the target variable is most likely to fall. Classification trees are used when
the dataset needs to be split into classes that belong to the response
variable(like yes or no)
For classification in decision tree learning algorithm that creates a tree-
like structure to predict class labels. The tree consists of nodes, which
represent different decision points, and branches, which represent the
possible result of those decisions. Predicted class labels are present at
each leaf node of the tree.
How Does CART for Classification Work?
CART for classification works by recursively splitting the training data into
smaller and smaller subsets based on certain criteria. The goal is to split
the data in a way that minimizes the impurity within each subset. Impurity
is a measure of how mixed up the data is in a particular subset. For
classification tasks, CART uses Gini impurity
Gini Impurity- Gini impurity measures the probability of
misclassifying a random instance from a subset labeled according to
the majority class. Lower Gini impurity means more purity of the
subset.
Splitting Criteria- The CART algorithm evaluates all potential splits
at every node and chooses the one that best decreases the Gini
impurity of the resultant subsets. This process continues until a
stopping criterion is reached, like a maximum tree depth or a
minimum number of instances in a leaf node.
CART for Regression
A Regression tree is an algorithm where the target variable is continuous
and the tree is used to predict its value. Regression trees are used when
the response variable is continuous. For example, if the response variable
is the temperature of the day.
CART for regression is a decision tree learning method that creates a tree-
like structure to predict continuous target variables. The tree consists of
nodes that represent different decision points and branches that represent
the possible outcomes of those decisions. Predicted values for the target
variable are stored in each leaf node of the tree.
How Does CART works for Regression?
Regression CART works by splitting the training data recursively into
smaller subsets based on specific criteria. The objective is to split the data
in a way that minimizes the residual reduction in each subset.
Residual Reduction- Residual reduction is a measure of how much
the average squared difference between the predicted values and
the actual values for the target variable is reduced by splitting the
subset. The lower the residual reduction, the better the model fits
the data.
Splitting Criteria- CART evaluates every possible split at each
node and selects the one that results in the greatest reduction of
residual error in the resulting subsets. This process is repeated until
a stopping criterion is met, such as reaching the maximum tree
depth or having too few instances in a leaf node.
https://www.ibm.com/topics/decision-trees
https://www.geeksforgeeks.org/decision-tree-introduction-
example/
https://www.geeksforgeeks.org/cart-classification-and-regression-
tree-in-machine-learning/
3. Ensemble Learning
Ensemble methods combine several trees base algorithms to construct
better predictive performance than a single tree base algorithm. The main
principle behind the ensemble model is that a group of weak learners
come together to form a strong learner, thus increasing the accuracy of
the model. When we try to predict the target variable using any machine
learning technique, the main causes of difference in actual and predicted
values are noise, variance, and bias. Ensemble helps to reduce these
factors (except noise, which is irreducible error).
There are various ensemble approaches, and each combines models in a
different way. The most typical kinds consist of:
1. Bagging, also known as bootstrap aggregating
Is the process of training several models concurrently on a random subset
of the data (with replacement), then averaging the predictions from each
model. Another well-known example of bagging is Random Forest.
https://www.analyticsvidhya.com/blog/2023/01/ensemble-learning-
methods-bagging-boosting-and-stacking/
Steps:
These are the steps involved in bagging:
An initial training dataset with n instances is available to us.
We take the training set and divide it into m-number subgroups.
For every subset, we select a subset of N sample points from the original
dataset.
We take each subset and replace it. This implies that multiple samples
can be taken from a single data point.
We train the associated weak learners independently for each subgroup of
the data. Since these models are all of the same type, they are
homogeneous.
Every model projects a future state.
One forecast is created by combining all of the predictions. Either
maximum voting or average are applied in this case.
2. Boosting
Boosting involves successively training models so that each one tries to
fix the mistakes of the one before it. Predictions are derived from the
weighted sum of the models, which are ranked according to how accurate
they are. AdaBoost, gradient boosting, and XGBoost are a few examples.
Boosting involves sequentially training weak learners. Here, each
subsequent learner improves the errors of previous learners in the
sequence. A sample of data is first taken from the initial dataset.
This sample is used to train the first model, and the model makes its
prediction. The samples can either be correctly or incorrectly predicted.
The samples that are wrongly predicted are reused for training the next
model. In this way, subsequent models can improve on the errors of
previous models.
Unlike bagging, which aggregates prediction results at the end, boosting
aggregates the results at each step. They are aggregated using weighted
averaging.
Using weighted averaging, each model is assigned a different weight
based on how well it predicts the future. Stated differently, it assigns
greater weight to the model possessing the highest predictive power. This
is due to the fact that the learner deemed most significant is the one with
the most predictive power.
3. Stacking (stacked generalization)
Stacking, also known as stacked generalization, is the process of training
several models on the same set of data, followed by the training of a
meta-model to produce a final prediction that is based on the earlier
models’ predictions.
Here are the steps involved in stacking:
We train m-number of algorithms using the original training set of data.
Every algorithm’s output is used to generate a fresh training set.
We develop a meta-model algorithm with the fresh training set.
Our ultimate forecast is based on the meta-model’s findings. Weighted
averaging is utilized to aggregate the outcomes.
4. Voting
Voting: To arrive at a final prediction, a majority vote (in classification
problems) or an average (in regression problems) is used to aggregate the
predictions of multiple independently trained models. This technique
makes use of the differences between the models to enhance
performance overall.
When to use Bagging vs Boosting vs Stacking?
https://medium.com/@sumbatilinda/ensemble-learning-in-machine-learning-bagging-boosting-and-
stacking-a00c6bae971f
4. Probability and Learning
Probability plays a central role in machine learning, providing a framework for reasoning under
uncertainty. Many machine learning algorithms, especially in probabilistic models, rely on
incorporating probabilities to make predictions, infer hidden variables, or learn patterns from
data.
Key Concepts in Probability and Learning:
1. Probability Distribution:
o Represents the likelihood of different outcomes.
o Discrete Distribution: Outcomes are distinct (e.g., rolling a die).
o Continuous Distribution: Outcomes are within a range (e.g., height of individuals).
2. Conditional Probability:
o The probability of an event AAA given that BBB has occurred:
P(A∣B)=P(A∩B)P(B)P(A | B) = \frac{P(A \cap B)}{P(B)}P(A∣B)=P(B)P(A∩B)
3. Bayes' Theorem:
o A key formula in probabilistic learning:
P(A∣B)=P(B∣A)⋅P(A)P(B)P(A | B) = \frac{P(B | A) \cdot P(A)}{P(B)}P(A∣B)=P(B)P(B∣A)⋅P(A)
o Often used in models like Naïve Bayes.
4. Maximum Likelihood Estimation (MLE):
o A method to estimate model parameters by maximizing the likelihood of observed
data.
5. Expectation-Maximization (EM):
o An iterative optimization technique for probabilistic models involving latent variables
(e.g., Gaussian Mixture Models).
Data into Probabilities
In probabilistic learning, data is transformed into probabilities to model uncertainty, patterns, and
structure. Key steps include:
1. Fitting a Probability Distribution:
o Estimating parameters (mean, variance) of a distribution from data.
o Example: Fitting a normal distribution to model heights.
2. Probability Density Functions (PDFs):
o For continuous variables, the PDF describes the likelihood of a value within a range.
o Example: The bell curve for Gaussian distributions.
3. Discretization:
o Continuous data can be converted into discrete categories for modeling.
4. Learning Probabilistic Models:
o Supervised Learning: Models like Logistic Regression estimate probabilities of
classes.
o Unsupervised Learning: Models like Gaussian Mixture Models cluster data based on
likelihoods.
Basic Statistics in Machine Learning
Statistics provide tools to summarize and analyze data, forming the foundation of probabilistic
reasoning.
1. Descriptive Statistics:
o Mean: Average of the data.
o Median: Middle value in sorted data.
o Variance and Standard Deviation: Measure of data spread.
2. Inferential Statistics:
o Drawing conclusions about a population from a sample.
o Hypothesis Testing: Determining whether observed results are statistically
significant.
3. Correlation and Covariance:
o Correlation measures the strength of a linear relationship between two variables.
o Covariance indicates how two variables change together.
5 Gausian Mixture Model
Suppose there are a set of data points that need to be grouped into several parts or clusters based
on their similarity. In Machine Learning, this is known as Clustering. There are several methods
available for clustering:
K Means Clustering
Hierarchical Clustering
Gaussian Mixture Models
A Gaussian mixture model is a soft clustering technique used in unsupervised learning to determine
identified by k ∈ {1,…, K}, where K is the number of clusters in a data set.
the probability that a given data point belongs to a cluster. It’s composed of several Gaussians, each
KNN:
K-Nearest Neighbor(KNN) Algorithm for Machine Learning
o K-Nearest Neighbour is one of the simplest Machine Learning algorithms based on
Supervised Learning technique.
o K-NN algorithm assumes the similarity between the new case/data and available
cases and put the new case into the category that is most similar to the available
categories.
o K-NN algorithm stores all the available data and classifies a new data point based on
the similarity. This means when new data appears then it can be easily classified into
a well suite category by using K- NN algorithm.
o K-NN algorithm can be used for Regression as well as for Classification but mostly it is
used for the Classification problems.
o K-NN is a non-parametric algorithm, which means it does not make any assumption
on underlying data.
o It is also called a lazy learner algorithm because it does not learn from the training
set immediately instead it stores the dataset and at the time of classification, it
performs an action on the dataset.
Example: Suppose, we have an image of a creature that looks similar to cat and dog, but we want
to know either it is a cat or dog. So for this identification, we can use the KNN algorithm, as it
works on a similarity measure. Our KNN model will find the similar features of the new data set
to the cats and dogs images and based on the most similar features it will put it in either cat or dog
category.
Suppose there are two categories, i.e., Category A and Category B, and we have a new data point
x1, so this data point will lie in which of these categories. To solve this type of problem, we need
a K-NN algorithm. With the help of K-NN, we can easily identify the category or class of a
particular dataset. Consider the below diagram:
Distance Metrics Used in KNN Algorithm
As we know that the KNN algorithm helps us identify the nearest points or the groups for
a query point. But to determine the closest groups or the nearest points for a query point
we need some metric. For this purpose, we use below distance metrics:
Euclidean Distance
This is nothing but the cartesian distance between the two points which are in the
plane/hyperplane. Euclidean distance can also be visualized as the length of the straight
line that joins the two points which are into consideration. This metric helps us calculate
the net displacement done between the two states of an object.
Manhattan Distance
Manhattan Distance metric is generally used when we are interested in the total distance
traveled by the object instead of the displacement. This metric is calculated by summing
the absolute difference between the coordinates of the points in n-dimensions.
How does K-NN work?
The K-NN working can be explained on the basis of the below algorithm:
o Step-1: Select the number K of the neighbors
o Step-2: Calculate the Euclidean distance of K number of neighbors
o Step-3: Take the K nearest neighbors as per the calculated Euclidean distance.
o Step-4: Among these k neighbors, count the number of the data points in each
category.
o Step-5: Assign the new data points to that category for which the number of the
neighbor is maximum.
o Step-6: Our model is ready.
Suppose we have a new data point and we need to put it in the required category.
o Firstly, we will choose the number of neighbors, so we will choose the k=5.
o Next, we will calculate the Euclidean distance between the data points. The
Euclidean distance is the distance between two points, which we have already
studied in geometry. It can be calculated as:
o By calculating the Euclidean distance we got the nearest neighbors, as three nearest
neighbors in category A and two nearest neighbors in category B. Consider the below
image:
o As we can see the 3 nearest neighbors are from category A, hence this new data
point must belong to category A.
How to select the value of K in the K-NN Algorithm?
Below are some points to remember while selecting the value of K in the K-NN
algorithm:
o There is no particular way to determine the best value for "K", so we need to try
some values to find the best out of them. The most preferred value for K is 5.
o A very low value for K such as K=1 or K=2, can be noisy and lead to the effects of
outliers in the model.
o Large values for K are good, but it may find some difficulties.
Advantages of KNN Algorithm:
o It is simple to implement.
o It is robust to the noisy training data
o It can be more effective if the training data is large.
Disadvantages of KNN Algorithm:
o Always needs to determine the value of K which may be complex some time.
o The computation cost is high because of calculating the distance between the data
points for all the training samples.
What is K-Means Clustering?
K-means clustering is a way of grouping data based on how similar or close the data
points are to each other. Imagine you have a bunch of points, and you want to group
them into clusters. The algorithm works by first randomly picking some central points
(called centroids) and then assigning every data point to the nearest centroid.
Once that’s done, it recalculates the centroids based on the new groupings and repeats
the process until the clusters make sense. It’s a pretty fast and efficient method, but it
works best when the clusters are distinct and not too mixed up.
Objective of K-Means Clustering
K-Means clustering primarily aims to organize similar data points into distinct groups.
Here’s a look at its key objectives:
Grouping Similar Data Points
K-Means is designed to cluster data points that share common traits, allowing patterns
or trends to emerge. Whether analyzing customer behavior or images, the method helps
reveal hidden relationships within your dataset.
Minimizing Within-Cluster Distance
Another objective is to keep data points in each group as close to the cluster's centroid
as possible. Reducing this internal distance results in compact, cohesive clusters,
enhancing the accuracy of your results.
Maximizing Between-Cluster Distance
K-Means also aims to maintain clear separation between different clusters. By
maximizing the distance between groups, the algorithm ensures that each cluster
remains distinct, providing a better understanding of data categories without overlap.
Properties of K-Means Clustering
Now, let’s look at the key properties that make K-means clustering algorithm effective in
creating meaningful groups:
Similarity Within a Cluster
One of the main things K Means aims for is that all the data points in a cluster should be
pretty similar to each other. Imagine a bank that wants to group its customers based on
income and debt. If customers within the same cluster have vastly different financial
situations, then a one-size-fits-all approach to offers might not work. For example, a
customer with high income and high debt might have different needs compared to
someone with low income and low debt. By making sure the customers in each cluster
are similar, the bank can create more tailored and effective strategies.
Differences Between Clusters
Another important aspect is that the clusters themselves should be as distinct from each
other as possible. Going back to our bank example, if one cluster consists of high-
income, high-debt customers and another cluster has high-income, low-debt customers,
the differences between the clusters are clear. This separation helps the bank create
different strategies for each group. If the clusters are too similar, it can be challenging to
treat them as separate segments, which can make targeted marketing less effective.
How k-means clustering works?
We are given a data set of items, with certain features, and values for these features (like
a vector). The task is to categorize those items into groups. To achieve this, we will use
the K-means algorithm, an unsupervised learning algorithm. ‘K’ in the name of the
algorithm represents the number of groups/clusters we want to classify our items into.
(It will help if you think of items as points in an n-dimensional space). The algorithm will
categorize the items into k groups or clusters of similarity. To calculate that similarity, we
will use the Euclidean distance as a measurement.
The algorithm works as follows:
1. First, we randomly initialize k points, called means or cluster centroids.
2. We categorize each item to its closest mean, and we update the mean’s coordinates,
which are the averages of the items categorized in that cluster so far.
3. We repeat the process for a given number of iterations and at the end, we have our
clusters.
The “points” mentioned above are called means because they are the mean values of
the items categorized in them.
Applications of Clustering in Real-World Scenarios
Clustering is a widely used technique in the industry. It is being used in almost every
domain, from banking and recommendation engines to document clustering and image
segmentation.
Customer Segmentation
We covered this earlier – one of the most common applications of clustering is customer
segmentation. And it isn’t just limited to banking. This strategy is across functions,
including telecom, e-commerce, sports, advertising, sales, etc.
Document Clustering
This is another common application of clustering. Let’s say you have multiple documents
and you need to cluster similar documents together. Clustering helps us group these
documents such that similar documents are in the same clusters.
Image Segmentation
We can also use clustering to perform image segmentation. Here, we try to club similar
pixels in the image together. We can apply clustering to create clusters having similar
pixels in the same group.
Recommendation Engines
Clustering can also be used in recommendation engines. Let’s say you want to
recommend songs to your friends. You can look at the songs liked by that person and
then use clustering to find similar songs and finally recommend the most similar songs.
There are many more applications that I’m sure you have already thought of. You can
share these applications in the comments section below. Next, let’s look at how we can
evaluate our clusters.
How to Apply K-Means Clustering Algorithm?
Let’s now take an example to understand how K-Means actually works:
We have these 8 points, and we want to apply k-means to create clusters for these
points. Here’s how we can do it.
1. Choose the number of clusters k
The first step in k-means is to pick the number of clusters, k.
2. Select k random points from the data as centroids
Next, we randomly select the centroid for each cluster. Let’s say we want to have 2
clusters, so k is equal to 2 here. We then randomly select the centroid:
Here, the red and green circles represent the centroid for these clusters.
3. Assign all the points to the closest cluster centroid
Once we have initialized the centroids, we assign each point to the closest cluster
centroid:
Here you can see that the points closer to the red point are assigned to the red cluster,
whereas the points closer to the green point are assigned to the green cluster.
4. Recompute the centroids of newly formed clusters
Now, once we have assigned all of the points to either cluster, the next step is to
compute the centroids of newly formed clusters:
Here, the red and green crosses are the new centroids.
5. Repeat steps 3 and 4
We then repeat steps 3 and 4:
The step of computing the centroid and assigning all the points to the cluster based on
their distance from the centroid is a single iteration.
https://www.analyticsvidhya.com/blog/2019/08/comprehensive-guide-k-means-
clustering/
https://www.simplilearn.com/tutorials/machine-learning-tutorial/k-means-clustering-
algorithm
Self Organizing Feature Map
Self Organizing Maps (SOM) or Kohenin’s map is a type of artificial neural network
introduced by Teuvo Kohonen in the 1980s.
A SOM is an unsupervised learning algorithm trained using dimensionality reduction
(typically two-dimensional), discretized representation of input space of the training
samples, called a map. It differs from other ANN as they apply competitive learning and
not the error-correction learning (like backpropagation with gradient descent). They use
a neighborhood function to preserve the topological properties of the input space to
reduce data by creating a spatially organized representation, and also helps to discover
the correlation between data.
Although SOM has been initially proposed for data visualization, it has been applied to
different problems, including a solution to the Traveling Salesman Problem (TSP).
SOM is used for clustering and mapping (or dimensionality reduction) techniques to map
multidimensional data onto lower-dimensional which allows people to reduce complex
problems for easy interpretation. SOM has two layers, one is the Input layer and the
other one is the Output layer.
The architecture of the Self Organizing Map with two clusters and n input features of any
sample is given below:
How do SOM works?
Let’s say an input data of size (m, n) where m is the number of training examples and n is
the number of features in each example. First, it initializes the weights of size (n, C)
where C is the number of clusters. Then iterating over the input data, for each training
example, it updates the winning vector (weight vector with the shortest distance (e.g
Euclidean distance) from training example). Weight updation rule is given by :
wij = wij(old) + alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input data.
After training the SOM network, trained weights are used for clustering new examples. A
new example falls in the cluster of winning vectors.
What really happens in SOM?
Each data point in the data set competes to get recognition for representation. SOM
mapping steps start from initializing the weight vectors. From there, a sample vector is
selected randomly, and the map of the weight vectors is searched to find the weight
which can best represent that sample. Each weight vector maintains neighboring weights
that are close to it. The weight that is selected is rewarded by being able to become
more like that randomly selected sample vector. The neighbors of that weight are also
considered by being able to become more like the selected sample vector. This allows
the map to form different shapes. Most generally, they have
square/rectangular/hexagonal/L shapes in the 2D feature space.
Algorithm
Training:
Step 1: Initialize the weights wij random value may be assumed. Initialize the learning
rate α.
Step 2: Calculate squared Euclidean distance.
D(j) = Σ (wij – xi)^2 where i=1 to n and j=1 to m
Step 3: Find index J, when D(j) is minimum that will be considered as winning index.
Step 4: For each j within a specific neighborhood of j and for all i, calculate the new
weight.
wij(new)=wij(old) + α[xi – wij(old)]
Step 5: Update the learning rule by using :
α(t+1) = 0.5 * t
Step 6: Test the Stopping Condition.
Applications
Dimensionality reduction and data visualization: In terms of dimensionality
reduction, Principal Component Analysis (PCA) is actually one of the most popular
tools and has been broadly used. Compared to PCA, SOM has an advantage in
maintaining the topological (structural) information of the training data and is not
inherently linear. Using PCA on high-dimensional data may cause a loss of data when
the dimension is reduced to two. If the target data has a lot of dimensions and every
dimension is equally essential, SOM can be very useful over PCA.
Seismic facies analysis for oil and gas exploration: Seismic facies analysis refers to
the interpretation of facies type from the seismic reflector information. It generates
groups based on the identification of different individual features.
These methods find an organization in the dataset and form organized relational
clusters. However, these clusters may or may not have any physical analogs. A
calibration method to relate SOM clusters to physical reality is needed. This
calibration method must define the mapping between the groups and the measured
physical properties; it should also provide an estimate of the validity of the
relationships.
Text Clustering: Text clustering is an unsupervised learning process, i.e., not
dependent on the prior knowledge of data, and based solely on the similarity
relationship between documents in the collection to separate the document
collection into some clusters
Learning Vector Quantization ( or LVQ ) is a type of Artificial Neural
Network which also inspired by biological models of neural systems. It is based on
prototype supervised learning classification algorithm and trained its network through a
competitive learning algorithm similar to Self Organizing Map. It can also deal with the
multiclass classification problem. LVQ has two layers, one is the Input layer and the other
one is the Output layer. The architecture of the Learning Vector Quantization with the
number of classes in an input data and n number of input features for any sample is
given below:
How Learning Vector Quantization works?
Let’s say that an input data of size ( m, n ) where m is the number of training examples
and n is the number of features in each example and a label vector of size ( m, 1 ). First, it
initializes the weights of size ( n, c ) from the first c number of training samples with
different labels and should be discarded from all training samples. Here, c is the number
of classes. Then iterate over the remaining input data, for each training example, it
updates the winning vector ( weight vector with the shortest distance ( e.g Euclidean
distance ) from the training example ).
The weight updation rule is given by:
if correctly_classified:
wij(new) = wij(old) + alpha(t) * (xik - wij(old))
else:
wij(new) = wij(old) - alpha(t) * (xik - wij(old))
where alpha is a learning rate at time t, j denotes the winning vector, i denotes the
ith feature of training example and k denotes the kth training example from the input data.
After training the LVQ network, trained weights are used for classifying new examples. A
new example is labelled with the class of the winning vector.
Difference Between Self organizing Map and Learning Vector Quantization
Self-organizing in networks is one of the most fascinating topics in the neural network
field. Such networks can learn to detect regularities and correlations in their input and
adapt their future responses to that input accordingly. The neurons of competitive
networks learn to recognize groups of similar input vectors. Self-organizing maps learn to
recognize groups of similar input vectors in such a way that neurons physically near each
other in the neuron layer respond to similar input vectors. Self-organizing maps do not
have target vectors, since their purpose is to divide the input vectors into clusters of
similar vectors. There is no desired output for these types of networks.
Learning vector quantization (LVQ) is a method for training competitive layers in a
supervised manner (with target outputs). A competitive layer automatically learns to
classify input vectors. However, the classes that the competitive layer finds are
dependent only on the distance between input vectors. If two input vectors are very
similar, the competitive layer probably will put them in the same class. There is no
mechanism in a strictly competitive layer design to say whether or not any two input
vectors are in the same class or different classes.
LVQ networks, on the other hand, learn to classify input vectors into target classes
chosen by the user.