[go: up one dir, main page]

0% found this document useful (0 votes)
14 views50 pages

04 DMBI Module 4 Clustering and Outlier Detection

The document outlines the syllabus for a course on Data Mining and Business Intelligence, focusing on clustering and outlier detection techniques. It covers various clustering methods such as K-Means, K-Medoids, hierarchical approaches, and density-based methods, along with their applications and challenges. Additionally, it discusses the importance of quality measures in clustering and the implications of supervised versus unsupervised learning.

Uploaded by

kartikrlende
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views50 pages

04 DMBI Module 4 Clustering and Outlier Detection

The document outlines the syllabus for a course on Data Mining and Business Intelligence, focusing on clustering and outlier detection techniques. It covers various clustering methods such as K-Means, K-Medoids, hierarchical approaches, and density-based methods, along with their applications and challenges. Additionally, it discusses the importance of quality measures in clustering and the implications of supervised versus unsupervised learning.

Uploaded by

kartikrlende
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

Data Mining and Business Intelligence

ITC 601
INFORMATION TECHNOLOGY
SEMESTER - VI (CBCGS R 2019)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 1
MODULE 4
CLUSTRING and OUTLIER DETECTION

Syllabus Topics ( 08 Hours )


Topic Hrs CO
4.1 Cluster Analysis : Basic Concepts 02 CO 4
CO 5
Partitioning Methods : K-Means K-Mediods
4.2 Hierarchical Methods : Agglomerative Divisive BIRCH 02 CO 4
CO 5
4.3 Density-Based Methods : DBSCAN 02 CO 4
CO 5
4.4 What are outliers? Types Challenges 02 CO 4
CO 5
Outlier Detection Methods: Supervised, Semi Supervised,
Unsupervised, Proximity based, Clustering Based.
Reference: Han, Kamber, "Data Mining Concepts and Techniques", Morgan Kaufmann 3nd Edition.

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 2
Supervised vs. Unsupervised Learning

n Supervised learning (classification)


n Supervision: The training data (observations, measurements,
etc.) are accompanied by labels indicating the class of the
observations
n New data is classified based on the training set
n Unsupervised learning (clustering)
n The class labels of training data is unknown
n Given a set of measurements, observations, etc. with the aim
of establishing the existence of classes or clusters in the data

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 3
What is Cluster Analysis?
n Cluster: A collection of data objects
n similar (or related) to one another within the same group
n dissimilar (or unrelated) to the objects in other groups
n Cluster analysis (or clustering, data segmentation, …)
n Finding similarities between data according to the
characteristics found in the data and grouping similar
data objects into clusters
n Unsupervised learning: no predefined classes (i.e., learning
by observations vs. learning by examples: supervised)
n Typical applications
n As a stand-alone tool to get insight into data distribution
n As a preprocessing step for other algorithms
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 4
Clustering for Data Understanding and
Applications
n Biology: taxonomy of living things: kingdom, phylum, class, order, family,
genus and species
n Information retrieval: document clustering
n Land use: Identification of areas of similar land use in an earth observation
database
n Marketing: Help marketers discover distinct groups in their customer bases,
and then use this knowledge to develop targeted marketing programs
n City-planning: Identifying groups of houses according to their house type,
value, and geographical location
n Earth-quake studies: Observed earth quake epicenters should be clustered
along continent faults
n Climate: understanding earth climate, find patterns of atmospheric and ocean
n Economic Science: market resarch
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 5
Clustering as a Preprocessing Tool (Utility)

n Summarization:
n Preprocessing for regression, PCA, classification, and
association analysis
n Compression:
n Image processing: vector quantization
n Finding K-nearest Neighbors
n Localizing search to one or a small number of clusters
n Outlier detection
n Outliers are often viewed as those “far away” from any cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 6
Quality: What Is Good Clustering?

n A good clustering method will produce high quality clusters


n high intra-class similarity: cohesive within clusters
n low inter-class similarity: distinctive between clusters

n The quality of a clustering method depends on


n the similarity measure used by the method
n its implementation, and
n Its ability to discover some or all of the hidden patterns

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 7
Measure the Quality of Clustering
n Dissimilarity/Similarity metric
n Similarity is expressed in terms of a distance function,
typically metric: d(i,j)
n The definitions of distance functions are usually rather
different for interval-scaled, boolean, categorical, ordinal ratio,
and vector variables
n Weights should be associated with different variables based
on applications and data semantics

n Quality of clustering
n There is usually a separate “quality” function that measures
the “goodness” of a cluster.
n It is hard to define “similar enough” or “good enough”

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 8
Considerations for Cluster Analysis
n Partitioning criteria
n Single level vs. hierarchical partitioning (often, multi-level hierarchical
partitioning is desirable)
n Separation of clusters
n Exclusive (e.g., one customer belongs to only one region) vs. non-exclusive
(e.g., one document may belong to more than one class)
n Similarity measure
n Distance-based (e.g., Euclidian, road network, vector) vs. connectivity-based
(e.g., density or contiguity)
n Clustering space
n Full space (often when low dimensional) vs. subspaces (often in high-
dimensional clustering)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 9
Requirements and Challenges
n Scalability
n Clustering all the data instead of only on samples
n Ability to deal with different types of attributes
n Numerical, binary, categorical, ordinal, linked, and mixture of these
n Constraint-based clustering
n User may give inputs on constraints
n Use domain knowledge to determine input parameters
n Interpretability and usability
n Others
n Discovery of clusters with arbitrary shape
n Ability to deal with noisy data
n Incremental clustering and insensitivity to input order
n High dimensionality

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 10
Major Clustering Approaches (I)
n Partitioning approach
n Construct various partitions and then evaluate them by some
criterion, e.g., minimizing the sum of square errors
n Typical methods: k-means, k-medoids, CLARANS
n Hierarchical approach
n Create a hierarchical decomposition of the set of data (or objects)
using some criterion
n Typical methods: Diana, Agnes, BIRCH, CAMELEON
n Density-based approach
n Based on connectivity and density functions
n Typical methods: DBSACN, OPTICS, DenClue
n Grid-based approach
n based on a multiple-level granularity structure
n Typical methods: STING, WaveCluster, CLIQUE

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 11
Major Clustering Approaches (II)
n Model-based methods
n A model is hypothesized for each of the clusters and tries to find
the best fit of that model to each other
n Typical methods: EM, SOM, COBWEB
n Frequent pattern-based
n Based on the analysis of frequent patterns
n Typical methods: p-Cluster
n User-guided or constraint-based
n Clustering by considering user-specified or application-specific
constraints
n Typical methods: COD (obstacles), constrained clustering
n Link-based clustering
n Objects are often linked together in various ways
n Massive links can be used to cluster objects: SimRank, LinkClus

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 12
Partitioning Algorithms: Basic Concept
n Partitioning method: Partitioning a database D of n objects into a set of k clusters,
such that the sum of squared distances is minimized (where ci is the centroid or
medoid of cluster Ci)
E   ik1 pCi ( p  ci ) 2
n Given k, find a partition of k clusters that optimizes the chosen partitioning criterion
n Global optimal: exhaustively enumerate all partitions
n Heuristic methods: k-means and k-medoids algorithms
n k-means : Each cluster is represented by the center of the cluster
n k-medoids or PAM (Partition around medoids): Each cluster is represented
by one of the objects in the cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 13
The K-Means Clustering Method

n Given k, the k-means algorithm is implemented in four steps:


n Partition objects into k nonempty subsets
n Compute seed points as the centroids of the clusters of the
current partitioning (the centroid is the center, i.e., mean
point, of the cluster)
n Assign each object to the cluster with the nearest seed point
n Go back to Step 2, stop when the assignment does not
change

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 14
An Example of K-Means Clustering

K=2

Arbitrarily Update
partition the
objects cluster
into k centroids
groups
The initial data set Loop if Reassign objects
n Partition objects into k nonempty subsets needed

n Repeat
n Compute centroid (i.e., mean point)
for each partition
Update
n Assign each object to the cluster of the
its nearest centroid cluster
centroids
n Until no change

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 15
Characteristics of the K-Means Method

n Strength: Efficient
n O (tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.
n Comment: Often terminates at a local optimal.
n Weakness
n Applicable only to objects in a continuous n-dimensional space
n Using the k-modes method for categorical data
n In comparison, k-medoids can be applied to a wide range of data
n Need to specify k, the number of clusters, in advance (there are ways to
automatically determine the best k (see Hastie et al., 2009)
n Sensitive to noisy data and outliers
n Not suitable to discover clusters with non-convex shapes

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 16
Variations of the K-Means Method
n Most of the variants of the k-means which differ in

n Selection of the initial k means

n Dissimilarity calculations

n Strategies to calculate cluster means

n Handling categorical data: k-modes

n Replacing means of clusters with modes

n Using new dissimilarity measures to deal with categorical objects

n Using a frequency-based method to update modes of clusters

n A mixture of categorical and numerical data: k-prototype method

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 17
What Is the Problem of the K-Means Method?

n The k-means algorithm is sensitive to outliers !

n Since an object with an extremely large value may substantially


distort the distribution of the data

n K-Medoids: Instead of taking the mean value of the object in a


cluster as a reference point, medoids can be used, which is the
most centrally located object in a cluster
10 10
9 9
8 8
7 7
6 6
5 5
4 4
3 3
2 2
1 1
0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 18
PAM: A Typical K-Medoids Algorithm
Total Cost = 20
10

6
Arbitrar Assign
5 y each
4 choose remaini
3
k object ng
as initial object
2

0
medoids to
0 1 2 3 4 5 6 7 8 9 10
nearest
medoids
K=2 Randomly select a
Total Cost = 26 nonmedoid object,Oramdom
10 10

Do loop 9

8 Compute
9

Swapping O total cost


Until no
7 7

and Oramdom 6
of
6

change
5 5

If quality is 4 swapping 4

improved. 3

2
3

1 1

0 0
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 19
The K-Medoid Clustering Method
n K-Medoids Clustering: Find representative objects (medoids) in clusters
n PAM (Partitioning Around Medoids, Kaufmann & Rousseeuw 1987)
n Starts from an initial set of medoids and iteratively replaces one of the
medoids by one of the non-medoids if it improves the total distance of
the resulting clustering

n PAM works effectively for small data sets, but does not scale well for
large data sets (due to the computational complexity)

n Efficiency improvement on PAM

n CLARA (Kaufmann & Rousseeuw, 1990): PAM on samples


n CLARANS (Ng & Han, 1994): Randomized re-sampling

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 20
Hierarchical Clustering
n Use distance matrix as clustering criteria. This method does not require
the number of clusters k as an input, but needs a termination condition
Step Step Step Step Step
0 1 2 3 4 agglomerative
(AGNES)
a ab
b abcde
c
cde
d de
e
divisive
Step Step Step Step
Step
4 3 2 1 0
(DIANA)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 21
AGNES (Agglomerative Nesting)

n Introduced in Kaufmann and Rousseeuw (1990)


n Implemented in statistical packages, e.g., Splus
n Use the single-link method and the dissimilarity matrix
n Merge nodes that have the least dissimilarity
n Go on in a non-descending fashion
n Eventually all nodes belong to the same cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 22
Dendrogram: Shows How Clusters are Merged
Decompose data objects into a several levels
of nested partitioning (tree of clusters), called
a dendrogram

A clustering of the data objects is obtained by


cutting the dendrogram at the desired level,
then each connected component forms a
cluster

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 23
DIANA (Divisive Analysis)
n Introduced in Kaufmann and Rousseeuw (1990)
n Implemented in statistical analysis packages, For eapmle Splus
n Inverse order of AGNES
n Eventually each node forms a cluster on its own

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 24
Distance between Clusters

n Single link: smallest distance between an element in one


cluster and an element in the other, i.e., dist(Ki, Kj) = min(tip, tjq)

Complete link: largest distance between an element in one


X
n

cluster and an element in the other, i.e., dist(Ki, Kj) = max(tip, tjq)

n Average: avg distance between an element in one cluster and an


element in the other, i.e., dist(Ki, Kj) = avg(tip, tjq)

n Centroid: distance between the centroids of two clusters, i.e.,


dist(Ki, Kj) = dist(Ci, Cj) X

n Medoid: distance between the medoids of two clusters, i.e.,


dist(Ki, Kj) = dist(Mi, Mj)
n Medoid: a chosen, centrally located object in the cluster
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 25
Centroid, Radius and Diameter of a Cluster
(for numerical data sets)

n Centroid: the “middle” of a cluster iN 1(t )


Cm  N
ip

n Radius: square root of average distance from any point of the cluster to
its centroid  N (t  c m ) 2
Rm  i  1 ip
N
n Diameter: square root of average mean squared distance between all
pairs of points in the cluster
 N  N (t  t ) 2
Dm  i  1 i  1 ip iq
N ( N  1)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 26
Extensions to Hierarchical Clustering
n Major weakness of agglomerative clustering methods

n Can never undo what was done previously

n Do not scale well: time complexity of at least O (n2), where n is the


number of total objects

n Integration of hierarchical & distance-based clustering

n BIRCH (1996): uses CF-tree and incrementally adjusts the quality of


sub-clusters

n CHAMELEON (1999): hierarchical clustering using dynamic modeling

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 27
BIRCH (Balanced Iterative Reducing and
Clustering Using Hierarchies)
n Zhang, Ramakrishnan & Livny, SIGMOD’96
n Incrementally construct a CF (Clustering Feature) tree, a hierarchical
data structure for multiphase clustering
n Phase 1: scan DB to build an initial in-memory CF tree (a multi-
level compression of the data that tries to preserve the inherent
clustering structure of the data)
n Phase 2: use an arbitrary clustering algorithm to cluster the leaf
nodes of the CF-tree
n Scales linearly: finds a good clustering with a single scan and improves
the quality with a few additional scans
n Weakness: handles only numeric data, and sensitive to the order of
the data record

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 28
Clustering Feature Vector in BIRCH

Clustering Feature (CF): CF = (N, LS, SS)


N: Number of data points
CF = (5, (16,30),(54,190))
LS: linear sum of N points:
N
(3,4)
 X i (2,6)
i 1
(4,5)
SS: square sum of N points
(4,7)
N 2

 X (3,8)
i
i 1

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 29
CF-Tree in BIRCH
n Clustering feature
n Summary of the statistics for a given subcluster: the 0-th, 1st, and 2nd
moments of the subcluster from the statistical point of view
n Registers crucial measurements for computing cluster and utilizes
storage efficiently
n A CF tree is a height-balanced tree that stores the clustering features for a
hierarchical clustering
n A nonleaf node in a tree has descendants or “children”
n The nonleaf nodes store sums of the CFs of their children
n A CF tree has two parameters
n Branching factor: max # of children
n Threshold: max diameter of sub-clusters stored at the leaf nodes

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 30
The CF Tree Structure
Root
B=7 CF1 CF2 CF3 CF6
L= 6 child_1 child_2 child_3 child_6

Non-leaf node
CF1 CF2 CF3 CF5
child_1 child_2 child_3 child_5

Leaf node Leaf node

prev CF1 CF2 CF6 next prev CF1 CF2 CF4 next

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 31
The Birch Algorithm
n Cluster Diameter 1 2
 (x  x )
n ( n  1) i j
n For each point in the input
n Find closest leaf entry
n Add point to leaf entry and update CF
n If entry diameter > max_diameter, then split leaf, and possibly
parents
n Algorithm is O(n)
n Concerns
n Sensitive to insertion order of data points
n Since we fix the size of leaf nodes, so clusters may not be so
natural
n Clusters tend to be spherical given the radius and diameter
measures

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 32
Probabilistic Hierarchical Clustering
n Algorithmic hierarchical clustering
n Nontrivial to choose a good distance measure
n Hard to handle missing attribute values
n Optimization goal not clear: heuristic, local search
n Probabilistic hierarchical clustering
n Use probabilistic models to measure distances between clusters
n Generative model: Regard the set of data objects to be clustered as a
sample of the underlying data generation mechanism to be analyzed
n Easy to understand, same efficiency as algorithmic agglomerative
clustering method, can handle partially observed data
n In practice, assume the generative models adopt common distributions
functions, e.g., Gaussian distribution or Bernoulli distribution, governed by
parameters

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 33
Density-Based Clustering: Basic Concepts
n Two parameters
n Eps - Maximum radius of the neighbourhood
n MinPts - Minimum number of points in an Eps-neighbourhood of
that point
n NEps(p): {q belongs to D | dist(p,q) ≤ Eps}
n Directly density-reachable - A point p is directly density-reachable from
a point q w.r.t. Eps, MinPts if
n p belongs to NEps(q)
n core point condition: p MinPts = 5
|NEps (q)| ≥ MinPts Eps = 1 cm
q

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 34
Density-Reachable and Density-Connected

n Density-reachable
n A point p is density-reachable from a p
point q w.r.t. Eps, MinPts if there is a
chain of points p1, …, pn, p1 = q, pn = p p1
such that pi+1 is directly density- q
reachable from pi
n Density-connected
n A point p is density-connected to a point
q w.r.t. Eps, MinPts if there is a point o p q
such that both, p and q are density-
reachable from o w.r.t. Eps and MinPts
o

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 35
DBSCAN: Density-Based Spatial Clustering of
Applications with Noise
n Relies on a density-based notion of cluster: A cluster is
defined as a maximal set of density-connected points
n Discovers clusters of arbitrary shape in spatial databases
with noise

Outlier

Border
Eps = 1cm
Core MinPts = 5

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 36
DBSCAN: The Algorithm
n Arbitrary select a point p

n Retrieve all points density-reachable from p w.r.t. Eps


and MinPts

n If p is a core point, a cluster is formed

n If p is a border point, no points are density-reachable


from p and DBSCAN visits the next point of the database

n Continue the process until all of the points have been


processed

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 37
OPTICS: A Cluster-Ordering Method (1999)

n OPTICS: Ordering Points To Identify the Clustering Structure


n Ankerst, Breunig, Kriegel, and Sander (SIGMOD’99)
n Produces a special order of the database wrt its density-
based clustering structure
n This cluster-ordering contains info equiv to the density-
based clusterings corresponding to a broad range of
parameter settings
n Good for both automatic and interactive cluster analysis,
including finding intrinsic clustering structure
n Can be represented graphically or using visualization
techniques

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 38
What Are Outliers?
n Outlier: A data object that deviates significantly from the normal objects as if
it were generated by a different mechanism
n Ex.: Unusual credit card purchase
n Outliers are different from the noise data
n Noise is random error or variance in a measured variable
n Noise should be removed before outlier detection
n Outliers are interesting: It violates the mechanism that generates the normal data
n Outlier detection vs. novelty detection: early stage, outlier; but later merged into
the model
n Applications:
n Credit card fraud detection
n Telecom fraud detection
n Customer segmentation
n Medical analysis

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 39
Types of Outliers
n Three kinds: global, contextual and collective outliers
n Global outlier (or point anomaly)
n Object is Og if it significantly deviates from the rest of the
Global Outlier
data set
n Ex. Intrusion detection in computer networks
n Issue: Find an appropriate measurement of deviation
n Contextual outlier (or conditional outlier)
n Object is Oc if it deviates significantly based on a selected context
n Ex. 80o F in Urbana: outlier? (depending on summer or winter?)
n Attributes of data objects should be divided into two groups
n Contextual attributes: defines the context, e.g., time & location
n Behavioral attributes: characteristics of the object, used in outlier
evaluation, e.g., temperature
n Can be viewed as a generalization of local outliers—whose density significantly
deviates from its local area
n Issue: How to define or formulate meaningful context?
R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 40
Types of Outliers
n Collective Outliers
n A subset of data objects collectively deviate significantly
from the whole data set, even if the individual data
objects may not be outliers
n Applications: E.g., intrusion detection:
n When a number of computers keep sending denial-
of-service packages to each other
Collective Outlier
n Detection of collective outliers
n Consider not only behavior of individual objects, but also that of groups of objects
n Need to have the background knowledge on the relationship among data objects,
such as a distance or similarity measure on objects.
n A data set may have multiple types of outlier
n One object may belong to more than one type of outlier

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 41
Challenges of Outlier Detection
n Modeling normal objects and outliers properly
n Hard to enumerate all possible normal behaviors in an application
n The border between normal and outlier objects is often a gray area
n Application-specific outlier detection
n Choice of distance measure among objects and the model of relationship among
objects are often application-dependent
n E.g., clinic data: a small deviation could be an outlier; while in marketing analysis,
larger fluctuations
n Handling noise in outlier detection
n Noise may distort the normal objects and blur the distinction between normal objects
and outliers. It may help hide outliers and reduce the effectiveness of outlier detection
n Understandability
n Understand why these are outliers: Justification of the detection
n Specify the degree of an outlier: the unlikelihood of the object being generated by a
normal mechanism

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 42
Outlier Detection I: Supervised Methods
n Two ways to categorize outlier detection methods:
n Based on whether user-labeled examples of outliers can be obtained
n Supervised, semi-supervised vs. unsupervised methods
n Based on assumptions about normal data and outliers
n Statistical, proximity-based, and clustering-based methods
n Outlier Detection I: Supervised Methods
n Modeling outlier detection as a classification problem
n Samples examined by domain experts used for training & testing
n Methods for Learning a classifier for outlier detection effectively:
n Model normal objects & report those not matching the model as outliers, or
n Model outliers and treat those not matching the model as normal
n Challenges
n Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up
some artificial outliers
n Catch as many outliers as possible, i.e., recall is more important than accuracy
(i.e., not mislabeling normal objects as outliers)

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 43
Outlier Detection II: Unsupervised Methods
n Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having
some distinct features
n An outlier is expected to be far away from any groups of normal objects
n Weakness: Cannot detect collective outlier effectively
n Normal objects may not share any strong patterns, but the collective outliers may share
high similarity in a small area
n Ex. In some intrusion or virus detection, normal activities are diverse
n Unsupervised methods may have a high false positive rate but still miss many real
outliers.
n Supervised methods can be more effective, e.g., identify attacking some key resources
n Many clustering methods can be adapted for unsupervised methods
n Find clusters, then outliers: not belonging to any cluster
n Problem 1: Hard to distinguish noise from outliers
n Problem 2: Costly since first clustering: but far less outliers than normal objects
n Newer methods: tackle outliers directly

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 44
Outlier Detection III: Semi-Supervised Methods
n Situation: In many applications, the number of labeled data is often small: Labels could
be on outliers only, normal objects only, or both
n Semi-supervised outlier detection: Regarded as applications of semi-supervised learning
n If some labeled normal objects are available
n Use the labeled examples and the proximate unlabeled objects to train a model for
normal objects
n Those not fitting the model of normal objects are detected as outliers
n If only some labeled outliers are available, a small number of labeled outliers many not
cover the possible outliers well
n To improve the quality of outlier detection, one can get help from models for
normal objects learned from unsupervised methods

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 45
Outlier Detection (1) - Statistical Methods

n Statistical methods (also known as model-based methods) assume that


the normal data follow some statistical model (a stochastic model)
n The data not following the model are outliers.
n Example (right figure): First use Gaussian distribution
to model the normal data
n For each object y in region R, estimate gD(y), the
probability of y fits the Gaussian distribution
n If gD(y) is very low, y is unlikely generated by the
Gaussian model, thus an outlier
n Effectiveness of statistical methods: highly depends on whether the
assumption of statistical model holds in the real data
n There are rich alternatives to use various statistical models
n E.g., parametric vs. non-parametric

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 46
Outlier Detection (2) - Proximity-Based Methods
n An object is an outlier if the nearest neighbors of the object are far away, i.e.,
the proximity of the object is significantly deviates from the proximity of
most of the other objects in the same data set

n Example (refer figure)- Model the proximity of an


object using its 3 nearest neighbors
n Objects in region R are substantially different
from other objects in the data set.
n Thus the objects in R are outliers

n The effectiveness of proximity-based methods highly relies on the proximity measure.


n In some applications, proximity or distance measures cannot be obtained easily.
n Often have a difficulty in finding a group of outliers which stay close to each other
n Two major types of proximity-based outlier detection
n Distance-based vs. density-based

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 47
Outlier Detection (3) - Clustering-Based Methods

n Normal data belong to large and dense clusters, whereas outliers belong
to small or sparse clusters, or do not belong to any clusters

n Example (right figure): two clusters


n All points not in R form a large cluster
n The two points in R form a tiny cluster, thus are
outliers

n Since there are many clustering methods, there are many clustering-
based outlier detection methods as well
n Clustering is expensive: straightforward adaption of a clustering method
for outlier detection can be costly and does not scale up well for large
data sets

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 48
Summary
n Cluster analysis groups objects based on their similarity and has wide applications
n Measure of similarity can be computed for various types of data
n Clustering algorithms can be categorized into partitioning methods, hierarchical
methods, density-based methods, grid-based methods, and model-based methods
n K-means and K-medoids algorithms are popular partitioning-based clustering
algorithms
n Birch and Chameleon are interesting hierarchical clustering algorithms, and there are
also probabilistic hierarchical clustering algorithms
n DBSCAN, OPTICS, and DENCLU are interesting density-based algorithms
n STING and CLIQUE are grid-based methods, where CLIQUE is also a subspace
clustering algorithm
n Quality of clustering results can be evaluated in various ways

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 49
Summary
n Types of outliers
n global, contextual & collective outliers
n Outlier detection
n supervised, semi-supervised, or unsupervised
n Statistical (or model-based) approaches
n Proximity-base approaches
n Clustering-base approaches
n Classification approaches
n Mining contextual and collective outliers
n Outlier detection in high dimensional data

R 2019 C | ITC 601 DMBI | MANDAR S. JOSHI | INFORMATION TECHNOLOGY DEPARTMENT | FAMT 50

You might also like