0% found this document useful (0 votes)

14 views50 pages

04 DMBI Module 4 Clustering and Outlier Detection

The document outlines the syllabus for a course on Data Mining and Business Intelligence, focusing on clustering and outlier detection techniques. It covers various clustering methods such as K-Means, K-Medoids, hierarchical approaches, and density-based methods, along with their applications and challenges. Additionally, it discusses the importance of quality measures in clustering and the implications of supervised versus unsupervised learning.

Uploaded by

kartikrlende

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views50 pages

04 DMBI Module 4 Clustering and Outlier Detection

Uploaded by

kartikrlende

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Data Mining and Business Intelligence

ITC 601
INFORMATION TECHNOLOGY
SEMESTER - VI (CBCGS R 2019)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 1
MODULE 4
CLUSTRING and OUTLIER DETECTION

Syllabus Topics ( 08 Hours )

Topic Hrs CO
4.1 Cluster Analysis : Basic Concepts 02 CO 4
CO 5
Partitioning Methods : K-Means K-Mediods
4.2 Hierarchical Methods : Agglomerative Divisive BIRCH 02 CO 4
CO 5
4.3 Density-Based Methods : DBSCAN 02 CO 4
CO 5
4.4 What are outliers? Types Challenges 02 CO 4
CO 5
Outlier Detection Methods: Supervised, Semi Supervised,
Unsupervised, Proximity based, Clustering Based.
Reference: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 2
Supervised vs. Unsupervised Learning

n Supervised learning (classification)

n Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
n New data is classified based on the training set
n Unsupervised learning (clustering)
n The class labels of training data is unknown
n Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 3
What is Cluster Analysis?
n Cluster: A collection of data objects
n similar (or related) to one another within the same group
n dissimilar (or unrelated) to the objects in other groups
n Cluster analysis (or clustering, data segmentation, …)
n Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
n Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
n Typical applications
n As a stand-alone tool to get insight into data distribution
n As a preprocessing step for other algorithms
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 4
Clustering for Data Understanding and
Applications
n Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
n Information retrieval: document clustering
n Land use: Identification of areas of similar land use in an earth observation
database
n Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
n City-planning: Identifying groups of houses according to their house type,
value, and geographical location
n Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
n Climate: understanding earth climate, find patterns of atmospheric and ocean
n Economic Science: market resarch
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 5
Clustering as a Preprocessing Tool (Utility)

n Summarization:
n Preprocessing for regression, PCA, classification, and
association analysis
n Compression:
n Image processing: vector quantization
n Finding K-nearest Neighbors
n Localizing search to one or a small number of clusters
n Outlier detection
n Outliers are often viewed as those “far away” from any cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 6
Quality: What Is Good Clustering?

n A good clustering method will produce high quality clusters

n high intra-class similarity: cohesive within clusters
n low inter-class similarity: distinctive between clusters

n The quality of a clustering method depends on

n the similarity measure used by the method
n its implementation, and
n Its ability to discover some or all of the hidden patterns

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 7
Measure the Quality of Clustering
n Dissimilarity/Similarity metric
n Similarity is expressed in terms of a distance function,
typically metric: d(i,j)
n The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal ratio,
and vector variables
n Weights should be associated with different variables based
on applications and data semantics

n Quality of clustering
n There is usually a separate “quality” function that measures
the “goodness” of a cluster.
n It is hard to define “similar enough” or “good enough”

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 8
Considerations for Cluster Analysis
n Partitioning criteria
n Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
n Separation of clusters
n Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive
(e.g., one document may belong to more than one class)
n Similarity measure
n Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
n Clustering space
n Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 9
Requirements and Challenges
n Scalability
n Clustering all the data instead of only on samples
n Ability to deal with different types of attributes
n Numerical, binary, categorical, ordinal, linked, and mixture of these
n Constraint-based clustering
n User may give inputs on constraints
n Use domain knowledge to determine input parameters
n Interpretability and usability
n Others
n Discovery of clusters with arbitrary shape
n Ability to deal with noisy data
n Incremental clustering and insensitivity to input order
n High dimensionality

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 10
Major Clustering Approaches (I)
n Partitioning approach
n Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
n Typical methods: k-means, k-medoids, CLARANS
n Hierarchical approach
n Create a hierarchical decomposition of the set of data (or objects)
using some criterion
n Typical methods: Diana, Agnes, BIRCH, CAMELEON
n Density-based approach
n Based on connectivity and density functions
n Typical methods: DBSACN, OPTICS, DenClue
n Grid-based approach
n based on a multiple-level granularity structure
n Typical methods: STING, WaveCluster, CLIQUE

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 11
Major Clustering Approaches (II)
n Model-based methods
n A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
n Typical methods: EM, SOM, COBWEB
n Frequent pattern-based
n Based on the analysis of frequent patterns
n Typical methods: p-Cluster
n User-guided or constraint-based
n Clustering by considering user-specified or application-specific
constraints
n Typical methods: COD (obstacles), constrained clustering
n Link-based clustering
n Objects are often linked together in various ways
n Massive links can be used to cluster objects: SimRank, LinkClus

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 12
Partitioning Algorithms: Basic Concept
n Partitioning method: Partitioning a database D of n objects into a set of k clusters,
such that the sum of squared distances is minimized (where ci is the centroid or
medoid of cluster Ci)
E   ik1 pCi ( p  ci ) 2
n Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
n Global optimal: exhaustively enumerate all partitions
n Heuristic methods: k-means and k-medoids algorithms
n k-means : Each cluster is represented by the center of the cluster
n k-medoids or PAM (Partition around medoids): Each cluster is represented
by one of the objects in the cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 13
The K-Means Clustering Method

n Given k, the k-means algorithm is implemented in four steps:

n Partition objects into k nonempty subsets
n Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean
point, of the cluster)
n Assign each object to the cluster with the nearest seed point
n Go back to Step 2, stop when the assignment does not
change

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 14
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data set Loop if Reassign objects
n Partition objects into k nonempty subsets needed

n Repeat
n Compute centroid (i.e., mean point)
for each partition
Update
n Assign each object to the cluster of the
its nearest centroid cluster
centroids
n Until no change

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 15
Characteristics of the K-Means Method

n Strength: Efficient
n O (tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
n Comment: Often terminates at a local optimal.
n Weakness
n Applicable only to objects in a continuous n-dimensional space
n Using the k-modes method for categorical data
n In comparison, k-medoids can be applied to a wide range of data
n Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
n Sensitive to noisy data and outliers
n Not suitable to discover clusters with non-convex shapes

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 16
Variations of the K-Means Method
n Most of the variants of the k-means which differ in

n Selection of the initial k means

n Dissimilarity calculations

n Strategies to calculate cluster means

n Handling categorical data: k-modes

n Replacing means of clusters with modes

n Using new dissimilarity measures to deal with categorical objects

n Using a frequency-based method to update modes of clusters

n A mixture of categorical and numerical data: k-prototype method

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 17
What Is the Problem of the K-Means Method?

n The k-means algorithm is sensitive to outliers !

n Since an object with an extremely large value may substantially

distort the distribution of the data

n K-Medoids: Instead of taking the mean value of the object in a

cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 18
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10

6
Arbitrar Assign
5 y each
4 choose remaini
3
k object ng
as initial object
2

0
medoids to
0 1 2 3 4 5 6 7 8 9 10
nearest
medoids
K=2 Randomly select a
Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8 Compute
9

Swapping O total cost

Until no
7 7

and Oramdom 6
of
6

change
5 5

If quality is 4 swapping 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 19
The K-Medoid Clustering Method
n K-Medoids Clustering: Find representative objects (medoids) in clusters
n PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
n Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering

n PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)

n Efficiency improvement on PAM

n CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

n CLARANS (Ng & Han, 1994): Randomized re-sampling

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 20
Hierarchical Clustering
n Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
Step Step Step Step Step
0 1 2 3 4 agglomerative
(AGNES)
a ab
b abcde
c
cde
d de
e
divisive
Step Step Step Step
Step
4 3 2 1 0
(DIANA)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 21
AGNES (Agglomerative Nesting)

n Introduced in Kaufmann and Rousseeuw (1990)

n Implemented in statistical packages, e.g., Splus
n Use the single-link method and the dissimilarity matrix
n Merge nodes that have the least dissimilarity
n Go on in a non-descending fashion
n Eventually all nodes belong to the same cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 22
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels
of nested partitioning (tree of clusters), called
a dendrogram

A clustering of the data objects is obtained by

cutting the dendrogram at the desired level,
then each connected component forms a
cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 23
DIANA (Divisive Analysis)
n Introduced in Kaufmann and Rousseeuw (1990)
n Implemented in statistical analysis packages, For eapmle Splus
n Inverse order of AGNES
n Eventually each node forms a cluster on its own

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 24
Distance between Clusters

n Single link: smallest distance between an element in one

cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one

X
n

cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

n Average: avg distance between an element in one cluster and an

element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

n Centroid: distance between the centroids of two clusters, i.e.,

dist(Ki, Kj) = dist(Ci, Cj) X

n Medoid: distance between the medoids of two clusters, i.e.,

dist(Ki, Kj) = dist(Mi, Mj)
n Medoid: a chosen, centrally located object in the cluster
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 25
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)

n Centroid: the “middle” of a cluster iN 1(t )

Cm  N
ip

n Radius: square root of average distance from any point of the cluster to
its centroid  N (t  c m ) 2
Rm  i  1 ip
N
n Diameter: square root of average mean squared distance between all
pairs of points in the cluster
 N  N (t  t ) 2
Dm  i  1 i  1 ip iq
N ( N  1)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 26
Extensions to Hierarchical Clustering
n Major weakness of agglomerative clustering methods

n Can never undo what was done previously

n Do not scale well: time complexity of at least O (n2), where n is the

number of total objects

n Integration of hierarchical & distance-based clustering

n BIRCH (1996): uses CF-tree and incrementally adjusts the quality of

sub-clusters

n CHAMELEON (1999): hierarchical clustering using dynamic modeling

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 27
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
n Zhang, Ramakrishnan & Livny, SIGMOD’96
n Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
n Phase 1: scan DB to build an initial in-memory CF tree (a multi-
level compression of the data that tries to preserve the inherent
clustering structure of the data)
n Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
n Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
n Weakness: handles only numeric data, and sensitive to the order of
the data record

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 28
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)

N: Number of data points
CF = (5, (16,30),(54,190))
LS: linear sum of N points:
N
(3,4)
 X i (2,6)
i 1
(4,5)
SS: square sum of N points
(4,7)
N 2

 X (3,8)
i
i 1

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 29
CF-Tree in BIRCH
n Clustering feature
n Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd
moments of the subcluster from the statistical point of view
n Registers crucial measurements for computing cluster and utilizes
storage efficiently
n A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering
n A nonleaf node in a tree has descendants or “children”
n The nonleaf nodes store sums of the CFs of their children
n A CF tree has two parameters
n Branching factor: max # of children
n Threshold: max diameter of sub-clusters stored at the leaf nodes

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 30
The CF Tree Structure
Root
B=7 CF1 CF2 CF3 CF6
L= 6 child_1 child_2 child_3 child_6

Non-leaf node
CF1 CF2 CF3 CF5
child_1 child_2 child_3 child_5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 31
The Birch Algorithm
n Cluster Diameter 1 2
 (x  x )
n ( n  1) i j
n For each point in the input
n Find closest leaf entry
n Add point to leaf entry and update CF
n If entry diameter > max_diameter, then split leaf, and possibly
parents
n Algorithm is O(n)
n Concerns
n Sensitive to insertion order of data points
n Since we fix the size of leaf nodes, so clusters may not be so
natural
n Clusters tend to be spherical given the radius and diameter
measures

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 32
Probabilistic Hierarchical Clustering
n Algorithmic hierarchical clustering
n Nontrivial to choose a good distance measure
n Hard to handle missing attribute values
n Optimization goal not clear: heuristic, local search
n Probabilistic hierarchical clustering
n Use probabilistic models to measure distances between clusters
n Generative model: Regard the set of data objects to be clustered as a
sample of the underlying data generation mechanism to be analyzed
n Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
n In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed by
parameters

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 33
Density-Based Clustering: Basic Concepts
n Two parameters
n Eps - Maximum radius of the neighbourhood
n MinPts - Minimum number of points in an Eps-neighbourhood of
that point
n NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
n Directly density-reachable - A point p is directly density-reachable from
a point q w.r.t. Eps, MinPts if
n p belongs to NEps(q)
n core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 34
Density-Reachable and Density-Connected

n Density-reachable
n A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
such that pi+1 is directly density- q
reachable from pi
n Density-connected
n A point p is density-connected to a point
q w.r.t. Eps, MinPts if there is a point o p q
such that both, p and q are density-
reachable from o w.r.t. Eps and MinPts
o

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 35
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
n Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
n Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 36
DBSCAN: The Algorithm
n Arbitrary select a point p

n Retrieve all points density-reachable from p w.r.t. Eps

and MinPts

n If p is a core point, a cluster is formed

n If p is a border point, no points are density-reachable

from p and DBSCAN visits the next point of the database

n Continue the process until all of the points have been

processed

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 37
OPTICS: A Cluster-Ordering Method (1999)

n OPTICS: Ordering Points To Identify the Clustering Structure

n Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
n Produces a special order of the database wrt its density-
based clustering structure
n This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
n Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
n Can be represented graphically or using visualization
techniques

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 38
What Are Outliers?
n Outlier: A data object that deviates significantly from the normal objects as if
it were generated by a different mechanism
n Ex.: Unusual credit card purchase
n Outliers are different from the noise data
n Noise is random error or variance in a measured variable
n Noise should be removed before outlier detection
n Outliers are interesting: It violates the mechanism that generates the normal data
n Outlier detection vs. novelty detection: early stage, outlier; but later merged into
the model
n Applications:
n Credit card fraud detection
n Telecom fraud detection
n Customer segmentation
n Medical analysis

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 39
Types of Outliers
n Three kinds: global, contextual and collective outliers
n Global outlier (or point anomaly)
n Object is Og if it significantly deviates from the rest of the
Global Outlier
data set
n Ex. Intrusion detection in computer networks
n Issue: Find an appropriate measurement of deviation
n Contextual outlier (or conditional outlier)
n Object is Oc if it deviates significantly based on a selected context
n Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
n Attributes of data objects should be divided into two groups
n Contextual attributes: defines the context, e.g., time & location
n Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
n Can be viewed as a generalization of local outliers—whose density significantly
deviates from its local area
n Issue: How to define or formulate meaningful context?
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 40
Types of Outliers
n Collective Outliers
n A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data
objects may not be outliers
n Applications: E.g., intrusion detection:
n When a number of computers keep sending denial-
of-service packages to each other
Collective Outlier
n Detection of collective outliers
n Consider not only behavior of individual objects, but also that of groups of objects
n Need to have the background knowledge on the relationship among data objects,
such as a distance or similarity measure on objects.
n A data set may have multiple types of outlier
n One object may belong to more than one type of outlier

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 41
Challenges of Outlier Detection
n Modeling normal objects and outliers properly
n Hard to enumerate all possible normal behaviors in an application
n The border between normal and outlier objects is often a gray area
n Application-specific outlier detection
n Choice of distance measure among objects and the model of relationship among
objects are often application-dependent
n E.g., clinic data: a small deviation could be an outlier; while in marketing analysis,
larger fluctuations
n Handling noise in outlier detection
n Noise may distort the normal objects and blur the distinction between normal objects
and outliers. It may help hide outliers and reduce the effectiveness of outlier detection
n Understandability
n Understand why these are outliers: Justification of the detection
n Specify the degree of an outlier: the unlikelihood of the object being generated by a
normal mechanism

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 42
Outlier Detection I: Supervised Methods
n Two ways to categorize outlier detection methods:
n Based on whether user-labeled examples of outliers can be obtained
n Supervised, semi-supervised vs. unsupervised methods
n Based on assumptions about normal data and outliers
n Statistical, proximity-based, and clustering-based methods
n Outlier Detection I: Supervised Methods
n Modeling outlier detection as a classification problem
n Samples examined by domain experts used for training & testing
n Methods for Learning a classifier for outlier detection effectively:
n Model normal objects & report those not matching the model as outliers, or
n Model outliers and treat those not matching the model as normal
n Challenges
n Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up
some artificial outliers
n Catch as many outliers as possible, i.e., recall is more important than accuracy
(i.e., not mislabeling normal objects as outliers)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 43
Outlier Detection II: Unsupervised Methods
n Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
n An outlier is expected to be far away from any groups of normal objects
n Weakness: Cannot detect collective outlier effectively
n Normal objects may not share any strong patterns, but the collective outliers may share
high similarity in a small area
n Ex. In some intrusion or virus detection, normal activities are diverse
n Unsupervised methods may have a high false positive rate but still miss many real
outliers.
n Supervised methods can be more effective, e.g., identify attacking some key resources
n Many clustering methods can be adapted for unsupervised methods
n Find clusters, then outliers: not belonging to any cluster
n Problem 1: Hard to distinguish noise from outliers
n Problem 2: Costly since first clustering: but far less outliers than normal objects
n Newer methods: tackle outliers directly

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 44
Outlier Detection III: Semi-Supervised Methods
n Situation: In many applications, the number of labeled data is often small: Labels could
be on outliers only, normal objects only, or both
n Semi-supervised outlier detection: Regarded as applications of semi-supervised learning
n If some labeled normal objects are available
n Use the labeled examples and the proximate unlabeled objects to train a model for
normal objects
n Those not fitting the model of normal objects are detected as outliers
n If only some labeled outliers are available, a small number of labeled outliers many not
cover the possible outliers well
n To improve the quality of outlier detection, one can get help from models for
normal objects learned from unsupervised methods

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 45
Outlier Detection (1) - Statistical Methods

n Statistical methods (also known as model-based methods) assume that

the normal data follow some statistical model (a stochastic model)
n The data not following the model are outliers.
n Example (right figure): First use Gaussian distribution
to model the normal data
n For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
n If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier
n Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
n There are rich alternatives to use various statistical models
n E.g., parametric vs. non-parametric

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 46
Outlier Detection (2) - Proximity-Based Methods
n An object is an outlier if the nearest neighbors of the object are far away, i.e.,
the proximity of the object is significantly deviates from the proximity of
most of the other objects in the same data set

n Example (refer figure)- Model the proximity of an

object using its 3 nearest neighbors
n Objects in region R are substantially different
from other objects in the data set.
n Thus the objects in R are outliers

n The effectiveness of proximity-based methods highly relies on the proximity measure.

n In some applications, proximity or distance measures cannot be obtained easily.
n Often have a difficulty in finding a group of outliers which stay close to each other
n Two major types of proximity-based outlier detection
n Distance-based vs. density-based

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 47
Outlier Detection (3) - Clustering-Based Methods

n Normal data belong to large and dense clusters, whereas outliers belong
to small or sparse clusters, or do not belong to any clusters

n Example (right figure): two clusters

n All points not in R form a large cluster
n The two points in R form a tiny cluster, thus are
outliers

n Since there are many clustering methods, there are many clustering-
based outlier detection methods as well
n Clustering is expensive: straightforward adaption of a clustering method
for outlier detection can be costly and does not scale up well for large
data sets

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 48
Summary
n Cluster analysis groups objects based on their similarity and has wide applications
n Measure of similarity can be computed for various types of data
n Clustering algorithms can be categorized into partitioning methods, hierarchical
methods, density-based methods, grid-based methods, and model-based methods
n K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms
n Birch and Chameleon are interesting hierarchical clustering algorithms, and there are
also probabilistic hierarchical clustering algorithms
n DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
n STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm
n Quality of clustering results can be evaluated in various ways

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 49
Summary
n Types of outliers
n global, contextual & collective outliers
n Outlier detection
n supervised, semi-supervised, or unsupervised
n Statistical (or model-based) approaches
n Proximity-base approaches
n Clustering-base approaches
n Classification approaches
n Mining contextual and collective outliers
n Outlier detection in high dimensional data

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 50

ML Unit-4-1
No ratings yet
ML Unit-4-1
39 pages
Clustering Unit4
No ratings yet
Clustering Unit4
9 pages
Clustering
No ratings yet
Clustering
34 pages
Amity School of Engineering and Technology Amity University, Uttar Pradesh
No ratings yet
Amity School of Engineering and Technology Amity University, Uttar Pradesh
5 pages
Segment 7 (Ch10)
No ratings yet
Segment 7 (Ch10)
60 pages
Unit - 5 Cluster Analysis
No ratings yet
Unit - 5 Cluster Analysis
83 pages
Data Mining With Clustering: Dr. Mahesh Fernando
No ratings yet
Data Mining With Clustering: Dr. Mahesh Fernando
55 pages
Clustering
No ratings yet
Clustering
16 pages
Cluster-Analysis
No ratings yet
Cluster-Analysis
89 pages
DM Clustering
No ratings yet
DM Clustering
51 pages
Unit 4 Data Warehousing
No ratings yet
Unit 4 Data Warehousing
47 pages
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
No ratings yet
TQM - TRG - F-07 - Cluster Analysis - Rev02 - 20180421
42 pages
K-Means Data Clustering Approach: Jaipur National University
No ratings yet
K-Means Data Clustering Approach: Jaipur National University
43 pages
Data Mining Unit-Iv
No ratings yet
Data Mining Unit-Iv
34 pages
Fuzzy Clustering Toolbox
No ratings yet
Fuzzy Clustering Toolbox
77 pages
(Balasko, Dkk. 2007) Fuzzy Clustering
No ratings yet
(Balasko, Dkk. 2007) Fuzzy Clustering
77 pages
Clustering Notes
No ratings yet
Clustering Notes
17 pages
Clustering Data Mining
No ratings yet
Clustering Data Mining
27 pages
Unit IV
No ratings yet
Unit IV
96 pages
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
No ratings yet
Bab 8 Clustering: Data Mining - Arif Djunaidy - FTIF ITS Bab 8 - 1/??
119 pages
2002 Spring CS525 Lecture 2
No ratings yet
2002 Spring CS525 Lecture 2
37 pages
Concepts and Techniques: Data Mining
No ratings yet
Concepts and Techniques: Data Mining
101 pages
Clustering
No ratings yet
Clustering
32 pages
Concepts and Techniques: - Chapter 10
No ratings yet
Concepts and Techniques: - Chapter 10
97 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
42 pages
Data Analytics CSE704 Module-2
No ratings yet
Data Analytics CSE704 Module-2
42 pages
Slide-08-Chapter10-Cluster Analysis Basic Concept I
No ratings yet
Slide-08-Chapter10-Cluster Analysis Basic Concept I
40 pages
Data Mining - UNIT-IV
No ratings yet
Data Mining - UNIT-IV
24 pages
DWMModule 4
No ratings yet
DWMModule 4
31 pages
Fds Unit03
No ratings yet
Fds Unit03
11 pages
Cluster Analysis in Data Mining
No ratings yet
Cluster Analysis in Data Mining
36 pages
Unit VII
No ratings yet
Unit VII
30 pages
Data Mining Unit-4
No ratings yet
Data Mining Unit-4
15 pages
Cluster Analysis Methods Guide
No ratings yet
Cluster Analysis Methods Guide
51 pages
Machine Learning Unit-4
No ratings yet
Machine Learning Unit-4
24 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
9 pages
2022 Istdm 06
No ratings yet
2022 Istdm 06
76 pages
Cluster Analysis for Researchers
No ratings yet
Cluster Analysis for Researchers
76 pages
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
No ratings yet
What Is Cluster Analysis?: Unsupervised Learning Stand-Alone Tool Preprocessing Step
21 pages
Clustering Theory Applications and Algorithms
No ratings yet
Clustering Theory Applications and Algorithms
9 pages
Concepts and Techniques: - Chapter 7
No ratings yet
Concepts and Techniques: - Chapter 7
123 pages
K-Means Clustering Explained
No ratings yet
K-Means Clustering Explained
6 pages
A Short Review On Different Clustering Techniques and Their Applications
No ratings yet
A Short Review On Different Clustering Techniques and Their Applications
15 pages
Final ML Unit3 May24
No ratings yet
Final ML Unit3 May24
154 pages
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
No ratings yet
Cluster Analysis: Basic Concepts Partitioning Methods Hierarchical Methods Density-Based Methods Grid-Based Methods Evaluation of Clustering
38 pages
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
No ratings yet
Sathyabama Institute of Science and Technology SIT1301-Data Mining and Warehousing
22 pages
DW & DM Unit 4 Notes
No ratings yet
DW & DM Unit 4 Notes
40 pages
Chapter 7. Cluster Analysis
No ratings yet
Chapter 7. Cluster Analysis
120 pages
Cluster Analysis Overview
No ratings yet
Cluster Analysis Overview
77 pages
Unit 4
No ratings yet
Unit 4
65 pages
Unit Iv
No ratings yet
Unit Iv
14 pages
Clustering & Classification Metrics
No ratings yet
Clustering & Classification Metrics
13 pages
ML 7th Sem AIML ITE Notes Complete LONG (1) - 155-202
No ratings yet
ML 7th Sem AIML ITE Notes Complete LONG (1) - 155-202
48 pages
Unit IV Cluster Analysis
No ratings yet
Unit IV Cluster Analysis
7 pages
Graph Partitioning & Clustering Techniques
No ratings yet
Graph Partitioning & Clustering Techniques
14 pages
09 Clustering
No ratings yet
09 Clustering
21 pages
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
No ratings yet
What Is Cluster Analysis?: - Cluster: A Collection of Data Objects
51 pages
Diamond Price Prediction
No ratings yet
Diamond Price Prediction
8 pages
Itc 601 Dmbi-Module 6 Notes
No ratings yet
Itc 601 Dmbi-Module 6 Notes
25 pages
DMBI FH 2024 Solution
No ratings yet
DMBI FH 2024 Solution
7 pages
05 DMBI Module 5 Frequent Pattern Mining
No ratings yet
05 DMBI Module 5 Frequent Pattern Mining
33 pages
ML 04
No ratings yet
ML 04
26 pages
3205-Article Text-23308-1-10-20240703
No ratings yet
3205-Article Text-23308-1-10-20240703
7 pages
Cluster MCQ
No ratings yet
Cluster MCQ
12 pages
Big Data Clustering Techniques
No ratings yet
Big Data Clustering Techniques
28 pages
Data Mining: Clustering
No ratings yet
Data Mining: Clustering
46 pages
Clustering Techniques Overview
No ratings yet
Clustering Techniques Overview
45 pages
Chapter 5 Clustering
No ratings yet
Chapter 5 Clustering
40 pages
Unsupervised Learning & Clustering Guide
No ratings yet
Unsupervised Learning & Clustering Guide
22 pages
Seperated
No ratings yet
Seperated
11 pages
Unit IV Unsupervised Learning
No ratings yet
Unit IV Unsupervised Learning
4 pages
Unsupervised Learning Guide
No ratings yet
Unsupervised Learning Guide
62 pages
Practical Guide To Cluster Analysis in R Unsupervised Machine Learning Alboukadel Kassambara All Chapters Available
100% (1)
Practical Guide To Cluster Analysis in R Unsupervised Machine Learning Alboukadel Kassambara All Chapters Available
57 pages
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
No ratings yet
A Comparative Study of K-Means, K-Medoid and Enhanced K-Medoid Algorithms
4 pages
Analisis Algoritma K-Medoids Clustering Dalam Pengelompokan Penyebaran Covid-19 Di Indonesia
No ratings yet
Analisis Algoritma K-Medoids Clustering Dalam Pengelompokan Penyebaran Covid-19 Di Indonesia
8 pages
10clustering - Han and Kamber
No ratings yet
10clustering - Han and Kamber
93 pages
IJEDR1702035
No ratings yet
IJEDR1702035
4 pages
Assignment 1 Formatted-V4
No ratings yet
Assignment 1 Formatted-V4
4 pages
K-Means Clustering Guide
No ratings yet
K-Means Clustering Guide
32 pages
Data Clustering & Classification Guide
No ratings yet
Data Clustering & Classification Guide
60 pages
Datamining Mod3
No ratings yet
Datamining Mod3
21 pages
Clustering
0% (1)
Clustering
127 pages
CSD6011 - Machine Learning For Cyber Security
No ratings yet
CSD6011 - Machine Learning For Cyber Security
3 pages
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
No ratings yet
CMPS242 Machine Learning Final Project Report: 1. Problem Statement
7 pages
Numpy / Scipy Recipes For Data Science: K-Medoids Clustering
No ratings yet
Numpy / Scipy Recipes For Data Science: K-Medoids Clustering
6 pages
Pattern Recognition - Clustering - Classification
No ratings yet
Pattern Recognition - Clustering - Classification
177 pages
Cluster Analysis Essentials
No ratings yet
Cluster Analysis Essentials
50 pages
Cluster Analysis Techniques
No ratings yet
Cluster Analysis Techniques
98 pages

04 DMBI Module 4 Clustering and Outlier Detection

Uploaded by

04 DMBI Module 4 Clustering and Outlier Detection

Uploaded by

Data Mining and Business Intelligence

Syllabus Topics ( 08 Hours )

n Supervised learning (classification)

n A good clustering method will produce high quality clusters

n The quality of a clustering method depends on

n Given k, the k-means algorithm is implemented in four steps:

n Selection of the initial k means

n Strategies to calculate cluster means

n Handling categorical data: k-modes

n Replacing means of clusters with modes

n Using new dissimilarity measures to deal with categorical objects

n Using a frequency-based method to update modes of clusters

n A mixture of categorical and numerical data: k-prototype method

n The k-means algorithm is sensitive to outliers !

n Since an object with an extremely large value may substantially

n K-Medoids: Instead of taking the mean value of the object in a

Swapping O total cost

n Efficiency improvement on PAM

n CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples

n Introduced in Kaufmann and Rousseeuw (1990)

A clustering of the data objects is obtained by

n Single link: smallest distance between an element in one

Complete link: largest distance between an element in one

n Average: avg distance between an element in one cluster and an

n Centroid: distance between the centroids of two clusters, i.e.,

n Medoid: distance between the medoids of two clusters, i.e.,

n Centroid: the “middle” of a cluster iN 1(t )

n Can never undo what was done previously

n Do not scale well: time complexity of at least O (n2), where n is the

n Integration of hierarchical & distance-based clustering

n BIRCH (1996): uses CF-tree and incrementally adjusts the quality of

n CHAMELEON (1999): hierarchical clustering using dynamic modeling

Clustering Feature (CF): CF = (N, LS, SS)

Leaf node Leaf node

n Retrieve all points density-reachable from p w.r.t. Eps

n If p is a core point, a cluster is formed

n If p is a border point, no points are density-reachable

n Continue the process until all of the points have been

n OPTICS: Ordering Points To Identify the Clustering Structure

n Statistical methods (also known as model-based methods) assume that

n Example (refer figure)- Model the proximity of an

n The effectiveness of proximity-based methods highly relies on the proximity measure.

n Example (right figure): two clusters

You might also like