Data Mining and C
Data Mining and C
Data Mining and C
DATA MINING
NIVASHINI G
TEACHING ASSISTANT
Chapter-1
Data mining refers to extracting or mining knowledge from large amountsof data. The term is
actually a misnomer. Thus, data miningshould have been more appropriately named as
knowledge mining which emphasis on mining from large amounts of data.
It is the computational process of discovering patterns in large data sets involving methods at the
intersection of artificial intelligence, machine learning, statistics, and database systems.
The overall goal of the data mining process is to extract information from a data set and transform
it into an understandable structure for further use.
Data mining derives its name from the similarities between searching for valuable business
information in a large database — for example, finding linked products in gigabytes of store
scanner data — and mining a mountain for a vein of valuable ore. Both processes require either
sifting through an immense amount of material, or intelligently probing it to find exactly where
the value resides. Given databases of sufficient size and quality, data mining technology can
generate new business opportunities by providing these capabilities:
Automated prediction of trends and behaviors. Data mining automates the process of finding
predictive information in large databases. Questions that traditionally required extensive handson
analysis can now be answered directly from the data — quickly. A typical example of a
predictive problem is targeted marketing. Data mining uses data on past promotional mailings to
Automated discovery of previously unknown patterns. Data mining tools sweep through
databases and identify previously hidden patterns in one step. An example of pattern discovery is
the analysis of retail sales data to identify seemingly unrelated products that are often purchased
together. Other pattern discovery problems include detecting fraudulent credit card transactions
and identifying anomalous data that could represent data entry keying errors.
Classification – is the task of generalizing known structure to apply to new data. For
example, an e-mail program might attempt to classify an e-mail as "legitimate" or as
"spam".
Regression – attempts to find a function which models the data with the least error.
Summarization – providing a more compact representation of the data set, including
visualization and report generation.
A typical data mining system may have the following major components.
1. Knowledge Base:
This is the domain knowledge that is used to guide the search orevaluate the
interestingness of resulting patterns. Such knowledge can include concepthierarchies,
used to organize attributes or attribute values into different levels of abstraction.
This is essential to the data mining systemand ideally consists ofa set of functional
modules for tasks such as characterization, association and correlationanalysis,
classification, prediction, cluster analysis, outlier analysis, and evolutionanalysis.
This component typically employs interestingness measures interacts with the data
mining modules so as to focus thesearch toward interesting patterns. It may use
interestingness thresholds to filterout discovered patterns. Alternatively, the pattern
evaluation module may be integratedwith the mining module, depending on the
implementation of the datamining method used. For efficient data mining, it is highly
recommended to pushthe evaluation of pattern interestingness as deep as possible into
the mining processso as to confine the search to only the interesting patterns.
4. User interface:
Thismodule communicates between users and the data mining system,allowing the
user to interact with the system by specifying a data mining query ortask, providing
information to help focus the search, and performing exploratory datamining based
on the intermediate data mining results. In addition, this componentallows the user to
browse database and data warehouse schemas or data structures,evaluate mined
patterns, and visualize the patterns in different forms.
This step is concerned with how the data are generated and collected. In general, there are
two distinct possibilities. The first is when the data-generation process is under the
control of an expert (modeler): this approach is known as a designed experiment. The
second possibility is when the expert cannot influence the data- generation process: this is
known as the observational approach. An observational setting, namely, random data
generation, is assumed in most data-mining applications. Typically, the sampling
distribution is completely unknown after data are collected, or it is partially and
implicitly given in the data-collection procedure. It is very important, however, to
understand how data collection affects its theoretical distribution, since such a priori
In the observational setting, data are usually "collected" from the existing databses, data
warehouses, and data marts. Data preprocessing usually includes at least two common
tasks:
1. Outlier detection (and removal) – Outliers are unusual data values that are not
consistent with most observations. Commonly, outliers result from measurement
errors, coding and recording errors, and, sometimes, are natural, abnormal values.
Such nonrepresentative samples can seriously affect the model produced later. There
are two strategies for dealing with outliers:
2. Scaling, encoding, and selecting features – Data preprocessing includes several steps
such as variable scaling and different types of encoding. For example, one feature with
the range [0, 1] and the other with the range [−100, 1000] will not have the same weights
in the applied technique; they will also influence the final data-mining results differently.
Therefore, it is recommended to scale them and bring both features to the same weight
for further analysis. Also, application-specific encoding methods usually achieve
dimensionality reduction by providing a smaller number of informative features for
subsequent data modeling.
The selection and implementation of the appropriate data-mining technique is the main
task in this phase. This process is not straightforward; usually, in practice, the
implementation is based on several models, and selecting the best one is an additional
task. The basic principles of learning and discovery from data are given in Chapter 4 of
this book. Later, Chapter 5 through 13 explain and analyze specific techniques that are
applied to perform a successful learning process from data and to develop an appropriate
model.
In most cases, data-mining models should help in decision making. Hence, such models
need to be interpretable in order to be useful because humans are not likely to base their
decisions on complex "black-box" models. Note that the goals of accuracy of the model
and accuracy of its interpretation are somewhat contradictory. Usually, simple models are
more interpretable, but they are also less accurate. Modern data-mining methods are
expected to yield highly accurate results using highdimensional models. The problem of
interpreting these models, also very important, is considered a separate task, with specific
techniques to validate the results. A user does not want hundreds of pages of numeric
The data mining system can be classified according to the following criteria:
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
We can classify the data mining system according to kind of databases mined. Database system
can be classified according to different criteria such as data models, types of data etc. And the
data mining system can be classified accordingly. For example if we classify the database
according to data model then we may have a relational, transactional, object- relational, or data
warehouse mining system.
We can classify the data mining system according to kind of knowledge mined. It is means data
mining system are classified on the basis of functionalities such as:
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
We can classify the data mining system according to application adapted. These applications are
as follows:
Finance
Telecommunications
DNA
Stock Markets
E-mail
Interactive mining of knowledge at multiple levels of abstraction. - The data mining process
needs to be interactive because it allows users to focus the search for patterns, providing and
refining data mining requests based on returned results.
Handling noisy or incomplete data. - The data cleaning methods are required that can handle
the noise, incomplete objects while mining the data regularities. If data cleaning methods are not
there then the accuracy of the discovered patterns will be poor.
Pattern evaluation. - It refers to interestingness of the problem. The patterns discovered should
be interesting because either they represent common knowledge or lack novelty.
Efficiency and scalability of data mining algorithms. - In order to effectively extract the
information from huge amount of data in databases, data mining algorithm must be efficient
and scalable.
Parallel, distributed, and incremental mining algorithms. - The factors such as huge size
of databases, wide distribution of data,and complexity of data mining methods motivate the
development of parallel and distributed data mining algorithms. These algorithm divide the
data into partitions which is further processed parallel. Then the results from the partitions is
merged. The incremental algorithms, updates databases without having mine the data again
from scratch.
Data Cleaning - In this step the noise and inconsistent data is removed.
Architecture of KDD
Time-Variant: Historical data is kept in a data warehouse. For example, one can retrieve data
from 3 months, 6 months, 12 months, or even older data from a data warehouse. This contrasts
with a transactions system, where often only the most recent data is kept. For example, a
transaction system may hold the most recent address of a customer, where a data warehouse can
hold all addresses associated with a customer.
Non-volatile: Once data is in the data warehouse, it will not change. So, historical data in a data
warehouse should never be altered.
The top-down approach starts with the overall design and planning. It is useful in cases
where the technology is mature and well known, and where the business problems that must
be solved are clear and well understood.
The bottom-up approach starts with experiments and prototypes. This is useful in the early
stage of business modeling and technology development. It allows an organization to move
forward at considerably less expense and to evaluate the benefits of the technology before
making significant commitments.
In the combined approach, an organization can exploit the planned and strategic nature of
the top-down approach while retaining the rapid implementation and opportunistic
application of the bottom-up approach.
Choose a business process to model, for example, orders, invoices, shipments, inventory,
account administration, sales, or the general ledger. If the business process is organizational
and involves multiple complex object collections, a data warehouse model should be
followed. However, if the process is departmental and focuses on the analysis of one kind of
business process, a data mart model should be chosen.
Choose the grain of the business process. The grain is the fundamental, atomic level of data
to be represented in the fact table for this process, for example, individual transactions,
individual daily snapshots, and so on.
Choose the dimensions that will apply to each fact table record. Typical dimensions are
time, item, customer, supplier, warehouse, transaction type, and status.
Choose the measures that will populate each fact table record. Typical measures are numeric
additive quantities like dollars sold and units sold.
The bottom tier is a warehouse database server that is almost always a relationaldatabase
system. Back-end tools and utilities are used to feed data into the bottomtier from
operational databases or other external sources (such as customer profileinformation
provided by external consultants). These tools and utilities performdataextraction,
cleaning, and transformation (e.g., to merge similar data from differentsources into a
unified format), as well as load and refresh functions to update thedata warehouse . The
data are extracted using application programinterfaces known as gateways. A gateway is
supported by the underlying DBMS andallows client programs to generate SQL code to
be executed at a server.
Tier-2:
The middle tier is an OLAP server that is typically implemented using either a relational
OLAP (ROLAP) model or a multidimensional OLAP.
Tier-3:
The top tier is a front-end client layer, which contains query and reporting tools, analysis
tools, and/or data mining tools (e.g., trend analysis, prediction, and so on).
1. Enterprise warehouse:
An enterprise warehouse collects all of the information about subjects spanning the entire
organization.
It provides corporate-wide data integration, usually from one or more operational systems
or external information providers, and is cross-functional in scope.
It typically contains detailed data aswell as summarized data, and can range in size from a
few gigabytes to hundreds of gigabytes, terabytes, or beyond.
An enterprise data warehouse may be implemented on traditional mainframes, computer
superservers, or parallel architecture platforms. It requires extensive business modeling
and may take years to design and build.
2. Data mart:
A data mart contains a subset of corporate-wide data that is of value to aspecific group of
users. The scope is confined to specific selected subjects. For example,a marketing data
mart may confine its subjects to customer, item, and sales. Thedata contained in data marts
tend to be summarized.
Metadata are data about data.When used in a data warehouse, metadata are the data thatdefine
warehouse objects. Metadata are created for the data names anddefinitions of the given
warehouse. Additional metadata are created and captured fortimestamping any extracted data,
the source of the extracted data, and missing fieldsthat have been added by data cleaning or
integration processes.
Operational metadata, which include data lineage (history of migrated data and
the sequence of transformations applied to it), currency of data (active, archived,
or purged), and monitoring information (warehouse usage statistics, error
reports, and audit trails).
The algorithms used for summarization, which include measure and dimension
definitionalgorithms, data on granularity, partitions, subject areas, aggregation,
summarization,and predefined queries and reports.
Data related to system performance, which include indices and profiles that
improvedata access and retrieval performance, in addition to rules for the timing
and scheduling of refresh, update, and replication cycles.
Consolidation (Roll-Up)
Drill-Down
Slicing And Dicing
The drill-down is a technique that allows users to navigate through the details.
For instance, users can view the sales by individual products that make up a
region’s sales.
Slicing and dicing is a feature whereby users can take out (slicing) a specific set
of data of the OLAP cube and view (dicing) the slices from different viewpoints.
ROLAP works directly with relational databases. The base data and the
dimension tables are stored as relational tables and new tables are created to
hold the aggregated information. It depends on a specialized schema design.
This methodology relies on manipulating the data stored in the relational
database to give the appearance of traditional OLAP's slicing and dicing
functionality. In essence, each action of slicing and dicing is equivalent to
adding a "WHERE" clause in the SQL statement.
ROLAP tools do not use pre-calculated data cubes but instead pose the query
to the standard relational database and its tables in order to bring back the
data required to answer the question.
ROLAP tools feature the ability to ask any question because the
methodology does not limit to the contents of a cube. ROLAP also has the
ability to drill down to the lowest level of detail in the database.
MOLAP tools have a very fast response time and the ability to quickly write
back data into the data set.
How can the data analyst or the computer be sure that customer id in one database and
customer number in another reference to the same attribute.
2. Redundancy:
For the same real-world entity, attribute values fromdifferent sources may differ.
Smoothing, which works to remove noise from the data. Such techniques
includebinning, regression, and clustering.
Aggregation, where summary or aggregation operations are applied to the data.
For example, the daily sales data may be aggregated so as to compute monthly
and annualtotal amounts. This step is typically used in constructing a data cube
for analysis of the data at multiple granularities.
Chapter-2
Problem Definition:
The problem of association rule mining is defined as:
.
The sets of items (for short itemsets) and are called antecedent (left-hand-side or LHS) and
consequent (right-hand-side or RHS) of the rule respectively.
Example:
To illustrate the concepts, we use a small example from the supermarket domain. The set of
Transaction
milk
bread
IDbutter
beer
1 1 0 0
0 0 1 0
0 0 0 1
1 1 1 0
0 1 0 0
and can be interpreted as the ratio of the expected frequency that X occurs without Y
(that is to say, the frequency that the rule makes an incorrect prediction) if X and Y were
independent divided by the observed frequency of incorrect predictions.
If customers who purchase computers also tend to buy antivirussoftware at the same time, then
placing the hardware display close to the software displaymay help increase the sales of both
items. In an alternative strategy, placing hardware andsoftware at opposite ends of the store may
entice customers who purchase such items topick up other items along the way. For instance,
after deciding on an expensive computer,a customer may observe security systems for sale while
heading toward the software displayto purchase antivirus software and may decide to purchase a
home security systemas well. Market basket analysis can also help retailers plan which items to
put on saleat reduced prices. If customers tend to purchase computers and printers together,
thenhaving a sale on printers may encourage the sale of printers as well as computers.
Some methods for associationrule mining can find rules at differing levels of abstraction.
For example, supposethat a set of association rules mined includes the following rules
where X is a variablerepresenting a customer:
In rule (1) and (2), the items bought are referenced at different levels ofabstraction (e.g.,
―computer‖ is a higher-level abstraction of ―laptop computer‖).
3. Based on the number of data dimensions involved in the rule:
If the items or attributes in an association rule reference only one dimension, then
it is a single-dimensional association rule.
buys(X, ―computer‖))=>buys(X, ―antivirus software‖)
If a rule references two or more dimensions, such as the dimensions age, income,
and buys, then it is amultidimensional association rule. The following rule is an
exampleof a multidimensional rule: age(X, ―30,31…39‖) ^ income(X, ―42K,…
48K‖))=>buys(X, ―high resolution TV‖)
7.The transactions in D are scanned in order to determine L3, consisting of those candidate 3-
itemsets in C3 having minimum support.
8.The algorithm uses L3x L3 to generate a candidate set of 4-itemsets, C4.
For many applications, it is difficult to find strong associations among data items at low or
primitive levels of abstraction due to the sparsity of data at those levels.
Strong associations discovered at high levels of abstraction may represent commonsense
knowledge.
Therefore, data mining systems should provide capabilities for mining association rules at
multiple levels of abstraction, with sufficient flexibility for easy traversal
amongdifferentabstraction spaces.
Association rules that involve two or more dimensions or predicates can be referred to as
multidimensional association rules.
Lift is a simple correlation measure that is given as follows. The occurrence of itemset A
is independent of the occurrence of itemsetB if = P(A)P(B); otherwise,
itemsetsA and B are dependent and correlated as events. This definition can easily be
extended to more than two itemsets.
If the lift(A,B) is less than 1, then the occurrence of A is negativelycorrelated with the
occurrence of B.
If the resulting value is greater than 1, then A and B are positively correlated, meaning that the
occurrence of one implies the occurrence of the other.
If the resulting value is equal to 1, then A and B are independent and there is no correlation
between them.
Chapter-3
Classification and prediction are two forms of data analysis that can be used to extractmodels
describing important data classes or to predict future data trends.
Classificationpredicts categorical (discrete, unordered) labels, prediction
models continuousvaluedfunctions.
For example, we can build a classification model to categorize bankloan applications as
either safe or risky, or a prediction model to predict the expendituresof potential customers
on computer equipment given their income and occupation.
A predictor is constructed that predicts a continuous-valued function, or ordered value, as
opposed to a categorical label.
Regression analysis is a statistical methodology that is most often used for numeric prediction.
Many classification and prediction methods have been proposed by researchers in machine
learning, pattern recognition, and statistics.
Most algorithms are memory resident, typically assuming a small data size. Recent data
mining research has built on such work, developing scalable classification and prediction
techniques capable of handling large disk-resident data.
(i)Data cleaning:
This refers to the preprocessing of data in order to remove or reduce noise (by applying
smoothing techniques) and the treatment of missingvalues (e.g., by replacing a missing
There are three possible scenarios.Let A be the splitting attribute. A has v distinct values,
{a1, a2, … ,av}, based on the training data.
1 A is discrete-valued:
In this case, the outcomes of the test at node N corresponddirectly to the known values
of A.
A branch is created for each known value, aj, of A and labeled with that value.
Aneed not be considered in any future partitioning of the tuples.
2 A is continuous-valued:
In this case, the test at node N has two possible outcomes, corresponding to the conditions A
<=split point and A >split point, respectively wheresplit point is the split-point returned by
Attribute selection method as part of the splitting criterion.
1.Let D be a training set of tuples and their associated class labels. As usual, each tuple is
represented by an n-dimensional attribute vector, X = (x1, x2, …,xn), depicting n
measurements made on the tuple from n attributes, respectively, A1, A2, …, An.
2. Suppose that there are m classes, C1, C2, …, Cm. Given a tuple, X, the classifier will
predict that X belongs to the class having the highest posterior probability, conditioned on X.
That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if
Thus we maximize P(CijX). The classCifor which P(CijX) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
3.As P(X) is constant for all classes, only P(X|Ci)P(Ci) need be maximized. If the class prior
probabilities are not known, then it is commonly assumed that the classes are equally likely,
that is, P(C1) = P(C2) = …= P(Cm), and we would therefore maximize P(X|Ci).
Otherwise, we maximize P(X|Ci)P(Ci).
4.Given data sets with many attributes, it would be extremely computationally expensiveto
compute P(X|Ci). In order to reduce computation in evaluating P(X|Ci), the naive assumption
of class conditional independence is made. This presumes that the values of the attributes
areconditionally independent of one another, given the class label of the tuple. Thus,
5.In order to predict the class label of X, P(XjCi)P(Ci) is evaluated for each class Ci.
The classifier predicts that the class label of tuple X is the class Ciif and only if
Example:
Process:
Initialize the weights:
wherewi,jis the weight of the connection from unit iin the previous layer to unit j;
Oiis the output of unit ifrom the previous layer
Ɵj is the bias of the unit & it actsas a threshold in that it serves to vary the activity of the unit.
Each unit in the hidden and output layers takes its net input and then applies an activation
function to it.
The error is propagated backward by updating the weights and biases to reflect the error of the
network’s prediction. For a unit j in the output layer, the error Err jis computed by
wherewjkis the weight of the connection from unit j to a unit k in the next higher layer,
andErrkis the error of unit k.
Weights are updatedby the following equations, where Dwi j is the change in weight wi j:
In other words, for each numeric attribute, we take the difference between the corresponding
values of that attribute in tuple X1and in tuple X2, square this difference,and accumulate it.
The square root is taken of the total accumulated distance count.
Min-Max normalization can be used to transforma value v of a numeric attribute A to v0 in
therange [0, 1] by computing
that a certain value has in a given category. Each category then represents afuzzy set.
Fuzzy logic systemstypically provide graphical tools to assist users in converting attribute
values to fuzzy truthvalues.
Fuzzy set theory is also known as possibility theory.
Example:
where x is the mean value of x1, x2, … , x|D|, and y is the mean value of y1, y2,…, y|D|. The
coefficients w0 and w1 often provide good approximations to otherwise complicated
regression equations.
= x, x2 = x2 ,x3 = x3
It can then be converted to linear formby applying the above assignments,resulting in the
4.1.1 Applications:
Cluster analysis has been widely used in numerous applications, including market research,
pattern recognition, data analysis, and image processing.
In business, clustering can help marketers discover distinct groups in their customer bases
and characterize customer groups based on purchasing patterns.
In biology, it can be used to derive plant and animal taxonomies, categorize genes with
similar functionality, and gain insight into structures inherent in populations.
Clustering may also help in the identification of areas of similar land use in an earth
observation database and in the identification of groups of houses in a city according to
house type, value,and geographic location, as well as the identification of groups of
automobile insurance policy holders with a high average claim cost.
Clustering is also called data segmentation in some applications because clustering
partitions large data sets into groups according to their similarity.
Hierarchical Methods
Density-Based Methods
Grid-Based Methods
Model-Based Methods
The general criterion of a good partitioning is that objects in the same cluster are close or
related to each other, whereas objects of different clusters are far apart or very different.
Theagglomerative approach, also called the bottom-up approach, starts with each
objectforming a separate group. It successively merges the objects or groups that are
closeto one another, until all of the groups are merged into one or until a termination
condition holds.
The divisive approach, also calledthe top-down approach, starts with all of the objects
in the same cluster. In each successiveiteration, a cluster is split up into smaller clusters,
until eventually each objectis in one cluster, or until a termination condition holds.
DEPT OF CSE & IT
VSSUT, Burla
Hierarchical methods suffer fromthe fact that once a step (merge or split) is done,it can never
be undone. This rigidity is useful in that it leads to smaller computationcosts by not having
toworry about a combinatorial number of different choices.
whereE is the sum of the square error for all objects in the data set
pis the point in space representing a given object miis the mean of
cluster Ci
Thepartitioning method is then performed based on the principle of minimizing the sum
whereE is the sum of the absolute error for all objects in the data set
pis the point inspace representing a given object in clusterCj ojis the
representative object of Cj
The initial representative objects are chosen arbitrarily. The iterative process of replacing
representative objects by non representative objects continues as long as the quality of the
resulting clustering is improved.
This quality is estimated using a cost function that measures the average
dissimilaritybetween an object and the representative object of its cluster.
To determine whether a non representative object, oj random, is a good replacement for a
current representativeobject, oj, the following four cases are examined for each of the
nonrepresentative objects.
Case 1:
pcurrently belongs to representative object, oj. If ojis replaced by orandomasa representative object
and p is closest to one of the other representative objects, oi,i≠j, then p is reassigned to oi.
Case 2:
pcurrently belongs to representative object, oj. If ojis replaced by o randomasa representative object
and p is closest to orandom, then p is reassigned to orandom.
Case 3:
pcurrently belongs to representative object, oi, i≠j. If o jis replaced byorandomas a representative
object and p is closest to orandom, then p is reassigned toorandom.
4.4.2 Thek-MedoidsAlgorithm:
We can specify constraints on the objects to beclustered. In a real estate application, for
example, one may like to spatially cluster only those luxury mansions worth over a million
dollars. This constraint confines the setof objects to be clustered. It can easily be handled
by preprocessing after which the problem reduces to an instance ofunconstrained
clustering.
A user may like to set a desired range for each clustering parameter. Clustering parameters
are usually quite specific to the given clustering algorithm. Examples of parameters include
k, the desired numberof clusters in a k-means algorithm; or e the radius and the minimum
number of points in the DBSCAN algorithm. Although such user-specified parameters may
DEPT OF CSE & IT
VSSUT, Burla
strongly influence the clustering results, they are usually confined to the algorithm itself.
Thus, their fine tuning and processing are usually not considered a form of constraint-based
clustering.
Constraints on distance or similarity functions:
We can specify different distance orsimilarity functions for specific attributes of the objects
to be clustered, or differentdistance measures for specific pairs of objects.When clustering
sportsmen, for example,we may use different weighting schemes for height, body weight,
age, and skilllevel. Although this will likely change the mining results, it may not alter the
clusteringprocess per se. However, in some cases, such changes may make the evaluationof
the distance function nontrivial, especially when it is tightly intertwined with the clustering
process.
Outlier mining can be described as follows: Given a set of n data points or objectsand k, the
expected number of outliers, find the top k objects that are considerablydissimilar,
exceptional, or inconsistent with respect to the remaining data. The outliermining problem
can be viewed as two subproblems:
Define what data can be considered as inconsistent in a given data set, and
Find an efficient method to mine the outliers so defined.
In this method, the data space is partitioned into cells with a side length equal to
Eachcell has two layers surrounding it. The first layer is one cell thick, while the secondis
Let Mbe the maximum number ofoutliers that can exist in the dmin-neighborhood of an
outlier.
An object, o, in the current cell is considered an outlier only if cell + 1 layer countis less
than or equal to M. If this condition does not hold, then all of the objectsin the cell can be
removed from further investigation as they cannot be outliers.
If cell_+ 2_layers_count is less than or equal to M, then all of the objects in thecell are
considered outliers. Otherwise, if this number is more than M, then itis possible that some
of the objects in the cell may be outliers. To detect theseoutliers, object-by-object
processing is used where, for each object, o, in the cell,objects in the second layer of o
are examined. For objects in the cell, only thoseobjects having no more than M points in
their dmin-neighborhoods are outliers.The dmin-neighborhood of an object consists of
the object’s cell, all of its firstlayer, and some of its second layer.
A variation to the algorithm is linear with respect to n and guarantees that no morethan three
passes over the data set are required. It can be used for large disk-residentdata sets, yet does
not scale well for high dimensions.
That is, there are at most k-1 objects that are closer to p than o. You may bewondering at this
point how k is determined. The LOF method links to density-basedclustering in that it sets k
to the parameter rMinPts,which specifies the minimumnumberof points for use in identifying
clusters based on density.
Here, MinPts (as k) is used to define the local neighborhood of an object, p.
The k-distance neighborhood of an object p is denoted N kdistance(p)(p), or Nk(p)for short. By
setting k to MinPts, we get N MinPts(p). It contains the MinPts-nearestneighbors of p. That is, it
contains every object whose distance is not greater than theMinPts-distance of p. The
reachability distance of an object p with respect to object o (where o is within theMinPts-
nearest neighbors of p), is defined as reach distMinPts(p, o) = max{MinPtsdistance(o), d(p,
o)}.
Intuitively, if an object p is far away , then the reachabilitydistance between the two is simply
their actual distance. However, if they are sufficientlyclose (i.e., where p is within the
MinPts-distance neighborhood of o), thenthe actual distance is replaced by the
MinPtsdistance of o. This helps to significantlyreduce the statistical fluctuations of d(p, o)
for all of the p close to o.
The higher thevalue of MinPts is, the more similar is the reachability distance for objects
withinthe same neighborhood.
Intuitively, the local reachability density of p is the inverse of the average reachability
density based on the MinPts-nearest neighbors of p. It is defined as
The local outlier factor (LOF) of p captures the degree to which we call p an outlier.
It is defined as
Dissimilarities are assessed between subsets in the sequence. The technique introducesthe
following key terms.
Exception set:
This is the set of deviations or outliers. It is defined as the smallestsubset of objects whose
removal results in the greatest reduction of dissimilarity in the residual set.
Dissimilarity function:
This function does not require a metric distance between theobjects. It is any function that, if
given a set of objects, returns a lowvalue if the objectsare similar to one another. The greater
where x is the mean of the n numbers in the set. For character strings, the dissimilarityfunction
may be in the form of a pattern string (e.g., containing wildcard charactersthat is used to cover
all of the patterns seen so far. The dissimilarity increases when the pattern covering all of the
strings in Dj-1 does not cover any string in Dj that isnot in Dj-1.
Cardinality function:
This is typically the count of the number of objects in a given set.
Smoothing factor:
This function is computed for each subset in the sequence. Itassesses how much the
dissimilarity can be reduced by removing the subset from theoriginal set of objects.