Data Mining Tutorials
Data Mining Tutorials
Data Mining Tutorials
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Applications
Data mining is highly useful in the following domains −
Fraud Detection
Data mining is also used in the fields of credit card services and
telecommunication to detect frauds. In fraud telephone calls, it helps to find
the destination of the call, duration of the call, time of the day or week, etc.
It also analyzes the patterns that deviate from expected norms.
Data mining deals with the kind of patterns that can be mined. On the
basis of the kind of data to be mined, there are two categories of functions
involved in Data Mining −
Descriptive
Classification and Prediction
Descriptive Function
The descriptive function deals with the general properties of data in the
database. Here is the list of descriptive functions −
Class/Concept Description
Mining of Frequent Patterns
Mining of Associations
Mining of Correlations
Mining of Clusters
Class/Concept Description
Class/Concept refers to the data to be associated with the classes or
concepts. For example, in a company, the classes of items for sales include
computer and printers, and concepts of customers include big spenders and
budget spenders. Such descriptions of a class or a concept are called
class/concept descriptions. These descriptions can be derived by the
following two ways −
Data Characterization − This refers to summarizing data of class
under study. This class under study is called as Target Class.
Data Discrimination − It refers to the mapping or classification of a
class with some predefined group or class.
Mining of Association
Associations are used in retail sales to identify patterns that are
frequently purchased together. This process refers to the process of
uncovering the relationship among data and determining association rules.
For example, a retailer generates an association rule that shows that 70%
of time milk is sold with bread and only 30% of times biscuits are sold with
bread.
Mining of Correlations
It is a kind of additional analysis performed to uncover interesting statistical
correlations between associated-attribute-value pairs or between two item
sets to analyze that if they have positive, negative or no effect on each
other.
Mining of Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis refers to
forming group of objects that are very similar to each other but are highly
different from the objects in other clusters.
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner
with the data mining system. Here is the list of Data Mining Task Primitives
Set of task relevant data to be mined.
Database Attributes
Data Warehouse dimensions of interest
Kind of knowledge to be mined
It refers to the kind of functions to be performed. These functions are −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Clustering
Outlier Analysis
Evolution Analysis
Background knowledge
The background knowledge allows data to be mined at multiple levels
of abstraction. For example, the Concept hierarchies are one of the
background knowledge that allows data to be mined at multiple levels of
abstraction.
Rules
Tables
Charts
Graphs
Decision Trees
Cubes
Data mining is not an easy task, as the algorithms used can get very
complex and data is not always available at one place. It needs to be
integrated from various heterogeneous data sources. These factors also
create some issues. Here in this tutorial, we will discuss the major issues
regarding −
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to
effectively extract the information from huge amount of data in
databases, data mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The
factors such as huge size of databases, wide distribution of data, and
complexity of data mining methods motivate the development of
parallel and distributed data mining algorithms. These algorithms
divide the data into partitions which is further processed in a parallel
fashion. Then the results from the partitions is merged. The
incremental algorithms, update databases without mining the data
again from scratch.
Data Warehouse
A data warehouse exhibits the following characteristics to support the
management's decision-making process −
Subject Oriented − Data warehouse is subject oriented because it
provides us the information around a subject rather than the
organization's ongoing operations. These subjects can be product,
customers, suppliers, sales, revenue, etc. The data warehouse does
not focus on the ongoing operations, rather it focuses on modelling
and analysis of data for decision-making.
Integrated − Data warehouse is constructed by integration of data
from heterogeneous sources such as relational databases, flat files
etc. This integration enhances the effective analysis of data.
Time Variant − The data collected in a data warehouse is identified
with a particular time period. The data in a data warehouse provides
information from a historical point of view.
Non-volatile − Nonvolatile means the previous data is not removed
when new data is added to it. The data warehouse is kept separate
from the operational database therefore frequent changes in
operational database is not reflected in the data warehouse.
Data Warehousing
Data warehousing is the process of constructing and using the data
warehouse. A data warehouse is constructed by integrating the data from
multiple heterogeneous sources. It supports analytical reporting, structured
and/or ad hoc queries, and decision making.
Data warehousing involves data cleaning, data integration, and data
consolidations. To integrate heterogeneous databases, we have the
following two approaches −
Disadvantages
This approach has the following disadvantages −
The Query Driven Approach needs complex integration and filtering
processes.
It is very inefficient and very expensive for frequent queries.
This approach is expensive for queries that require aggregations.
Update-Driven Approach
Today's data warehouse systems follow update-driven approach rather than
the traditional approach discussed earlier. In the update-driven approach,
the information from multiple heterogeneous sources is integrated in
advance and stored in a warehouse. This information is available for direct
querying and analysis.
Advantages
This approach has the following advantages −
This approach provides high performance.
The data can be copied, processed, integrated, annotated,
summarized and restructured in the semantic data store in advance.
Query processing does not require interface with the processing at local
sources.
From Data Warehousing (OLAP) to Data Mining (OLAM)
Online Analytical Mining integrates with Online Analytical Processing
with data mining and mining knowledge in multidimensional databases.
Here is the diagram that shows the integration of both OLAP and OLAM −
Importance of OLAM
OLAM is important for the following reasons −
High quality of data in data warehouses − The data mining tools
are required to work on integrated, consistent, and cleaned data.
These steps are very costly in the preprocessing of data. The data
warehouses constructed by such preprocessing are valuable sources
of high quality data for OLAP and data mining as well.
Available information processing infrastructure surrounding
data warehouses − Information processing infrastructure refers to
accessing, integration, consolidation, and transformation of multiple
heterogeneous databases, web-accessing and service facilities,
reporting and OLAP analysis tools.
OLAP−based exploratory data analysis − Exploratory data
analysis is required for effective data mining. OLAM provides facility
for data mining on various subset of data and at different levels of
abstraction.
Online selection of data mining functions − Integrating OLAP with
multiple data mining functions and online analytical mining provide
users with the flexibility to select desired data mining functions and
swap data mining tasks dynamically.
Data Mining
Data mining is defined as extracting the information from a huge set
of data. In other words we can say that data mining is mining the
knowledge from data. This information can be used for any of the following
applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Data Mining Engine
Data mining engine is very essential to the data mining system. It
consists of a set of functional modules that perform the following functions
−
Characterization
Association and Correlation Analysis
Classification
Prediction
Cluster analysis
Outlier analysis
Evolution analysis
Knowledge Base
This is the domain knowledge. This knowledge is used to guide the
search or evaluate the interestingness of the resulting patterns.
Knowledge Discovery
Some people treat data mining same as knowledge discovery, while
others view data mining as an essential step in the process of knowledge
discovery. Here is the list of steps involved in the knowledge discovery
process −
Data Cleaning
Data Integration
Data Selection
Data Transformation
Data Mining
Pattern Evaluation
Knowledge Presentation
User interface
User interface is the module of data mining system that helps the
communication between users and the data mining system. User Interface
allows the following functionalities −
Data Cleaning
Data cleaning is a technique that is applied to remove the noisy data
and correct the inconsistencies in data. Data cleaning involves
transformations to correct the wrong data. Data cleaning is performed as a
data preprocessing step while preparing the data for a data warehouse.
Data Selection
Data Selection is the process where data relevant to the analysis task
are retrieved from the database. Sometimes data transformation and
consolidation are performed before the data selection process.
Clusters
Cluster refers to a group of similar kind of objects. Cluster analysis
refers to forming group of objects that are very similar to each other but
are highly different from the objects in other clusters.
Data Transformation
In this step, data is transformed or consolidated into forms
appropriate for mining, by performing summary or aggregation operations.
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based
on the kind of (a) databases mined, (b) knowledge mined, (c) techniques
utilized, and (d) applications adapted.
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of
techniques used. We can describe these techniques according to the degree
of user interaction involved or the methods of analysis employed.
Finance
Telecommunications
DNA
Stock Markets
E-mail
Integrating a Data Mining System with a DB/DW System
If a data mining system is not integrated with a database or a data
warehouse system, then there will be no system to communicate with. This
scheme is known as the non-coupling scheme. In this scheme, the main
focus is on data mining design and on developing efficient and effective
algorithms for mining the available data sets.
Loose Coupling − In this scheme, the data mining system may use
some of the functions of database and data warehouse system. It
fetches the data from the data respiratory managed by these systems
and performs data mining on that data. It then stores the mining
result either in a file or in a designated place in a database or in a
data warehouse.
Characterization
The syntax for characterization is −
Discrimination
The syntax for Discrimination is −
Classification
The syntax for Classification is −
analyze credit_rating
Prediction
The syntax for prediction is −
-schema hierarchies
define hierarchy time_hierarchy on date as [date,month quarter,year]
-
set-grouping hierarchies
define hierarchy age_hierarchy for age on customer as
level1: {young, middle_aged, senior} < level0: all
level2: {20, ..., 39} < level1: young
level3: {40, ..., 59} < level1: middle_aged
level4: {60, ..., 89} < level1: senior
-operation-derived hierarchies
define hierarchy age_hierarchy for age on customer as
{age_category(1), ..., age_category(5)}
:= cluster(default, age, 5) < all(age)
-rule-based hierarchies
define hierarchy profit_margin_hierarchy on item as
level_1: low_profit_margin < level_0: all
display as <result_form>
For Example −
display as table
Classification
Prediction
Classification models predict categorical class labels; and prediction
models predict continuous valued functions. For example, we can build a
classification model to categorize bank loan applications as either safe or
risky, or a prediction model to predict the expenditures in dollars of
potential customers on computer equipment given their income and
occupation.
What is classification?
Following are the examples of cases where the data analysis task is
Classification −
A bank loan officer wants to analyze the data in order to know which
customer (loan applicant) are risky or which are safe.
A marketing manager at a company needs to analyze a customer with
a given profile, who will buy a new computer.
In both of the above examples, a model or classifier is constructed to
predict the categorical labels. These labels are risky or safe for loan
application data and yes or no for marketing data.
What is prediction?
Following are the examples of cases where the data analysis task is
Prediction −
Suppose the marketing manager needs to predict how much a given
customer will spend during a sale at his company. In this example we are
bothered to predict a numeric value. Therefore the data analysis task is an
example of numeric prediction. In this case, a model or a predictor will be
constructed that predicts a continuous-valued-function or ordered value.
Note − Regression analysis is a statistical methodology that is most often
used for numeric prediction.
Given a set of measurements, observations, etc. with the aim of establishing the existence of
classes or clusters in the data
Classification
classifies data (constructs a model) based on the training set and the values (class labels)
in a classifying attribute and uses it in classifying new data
Numeric Prediction
models continuous-valued functions, i.e., predicts unknown or missing values
Typical applications
Credit/loan approval:
The known label of test sample is compared with the classified result from the
model
Accuracy rate is the percentage of test set samples that are correctly classified
by the model
Note: If the test set is used to select models, it is called validation (test) set
Process (2): Using the Model in Prediction
Input:
Data partition, D, which is a set of training tuples
and their associated class labels.
attribute_list, the set of candidate attributes.
Attribute selection method, a procedure to determine the
splitting criterion that best partitions that the data
tuples into individual classes. This criterion includes a
splitting_attribute and either a splitting point or splitting subset.
Output:
A Decision Tree
Method
create a node N;
if Dj is empty then
attach a leaf labeled with the majority
class in D to node N;
else
attach the node returned by Generate
decision tree(Dj, attribute list) to node N;
end for
return N;
Tree Pruning
Tree pruning is performed in order to remove anomalies in the
training data due to noise or outliers. The pruned trees are smaller and less
complex.
Tree Pruning Approaches
There are two approaches to prune a tree −
Pre-pruning − The tree is pruned by halting its construction early.
Post-pruning - This approach removes a sub-tree from a fully grown
tree.
Cost Complexity
The cost complexity is measured by the following two parameters −
Baye's Theorem
Bayes' Theorem is named after Thomas Bayes. There are two types of
probabilities −
Points to remember −
Rule Extraction
Here we will learn how to build a rule-based classifier by extracting IF-
THEN rules from a decision tree.
Points to remember −
One rule is created for each path from the root to the leaf node.
The leaf node holds the class prediction, forming the rule consequent.
Rule Induction Using Sequential Covering Algorithm
Sequential Covering Algorithm can be used to extract IF-THEN rules
form the training data. We do not require to generate a decision tree first.
In this algorithm, each rule for a given class covers many of the tuples of
that class.
Some of the sequential Covering Algorithms are AQ, CN2, and RIPPER.
As per the general strategy the rules are learned one at a time. For each
time rules are learned, a tuple covered by the rule is removed and the
process continues for the rest of the tuples. This is because the path to
each leaf in a decision tree corresponds to a rule.
Input:
D, a data set class-labeled tuples,
Att_vals, the set of all attributes and their possible values.
repeat
Rule = Learn_One_Rule(D, Att_valls, c);
remove tuples covered by Rule form D;
until termination condition;
FOIL is one of the simple and effective method for rule pruning. For a
given rule R,
FOIL_Prune = pos - neg / pos + neg
Note − This value will increase with the accuracy of R on the pruning set.
Hence, if the FOIL_Prune value is higher for the pruned version of R, then
we prune R.
Here we will discuss other classification methods such as Genetic
Algorithms, Rough Set Approach, and Fuzzy Set Approach.
Genetic Algorithms
The idea of genetic algorithm is derived from natural evolution. In
genetic algorithm, first of all, the initial population is created. This initial
population consists of randomly generated rules. We can represent each
rule by a string of bits.
For example, in a given training set, the samples are described by two
Boolean attributes such as A1 and A2. And this given training set contains
two classes such as C1 and C2.
We can encode the rule IF A1 AND NOT A2 THEN C2 into a bit
string 100. In this bit representation, the two leftmost bits represent the
attribute A1 and A2, respectively.
Likewise, the rule IF NOT A1 AND NOT A2 THEN C1 can be encoded
as 001.
Note − If the attribute has K values where K>2, then we can use the K bits
to encode the attribute values. The classes are also encoded in the same
manner.
Points to remember −
Based on the notion of the survival of the fittest, a new population is
formed that consists of the fittest rules in the current population and
offspring values of these rules as well.
The fitness of a rule is assessed by its classification accuracy on a set
of training samples.
The genetic operators such as crossover and mutation are applied to
create offspring.
In crossover, the substring from pair of rules are swapped to form a
new pair of rules.
In mutation, randomly selected bits in a rule's string are inverted.
What is Clustering?
Clustering is the process of making a group of abstract objects into
classes of similar objects.
Points to Remember
A cluster of data objects can be treated as one group.
While doing cluster analysis, we first partition the set of data into
groups based on data similarity and then assign the labels to the
groups.
The main advantage of clustering over classification is that, it is
adaptable to changes and helps single out useful features that
distinguish different groups.
Clustering Methods
Clustering methods can be classified into the following categories −
Partitioning Method
Hierarchical Method
Density-based Method
Grid-Based Method
Model-Based Method
Constraint-based Method
Partitioning Method
Suppose we are given a database of ‘n’ objects and the partitioning
method constructs ‘k’ partition of data. Each partition will represent a
cluster and k ≤ n. It means that it will classify the data into k groups, which
satisfy the following requirements −
Each group contains at least one object.
Each object must belong to exactly one group.
Points to remember −
For a given number of partitions (say k), the partitioning method will
create an initial partitioning.
Then it uses the iterative relocation technique to improve the
partitioning by moving objects from one group to other.
Hierarchical Methods
This method creates a hierarchical decomposition of the given set of
data objects. We can classify hierarchical methods on the basis of how the
hierarchical decomposition is formed. There are two approaches here −
Agglomerative Approach
Divisive Approach
Agglomerative Approach
This approach is also known as the bottom-up approach. In this, we
start with each object forming a separate group. It keeps on merging the
objects or groups that are close to one another. It keep on doing so until all
of the groups are merged into one or until the termination condition holds.
Divisive Approach
This approach is also known as the top-down approach. In this, we
start with all of the objects in the same cluster. In the continuous iteration,
a cluster is split up into smaller clusters. It is down until each object in one
cluster or the termination condition holds. This method is rigid, i.e., once a
merging or splitting is done, it can never be undone.
Density-based Method
This method is based on the notion of density. The basic idea is to
continue growing the given cluster as long as the density in the
neighborhood exceeds some threshold, i.e., for each data point within a
given cluster, the radius of a given cluster has to contain at least a
minimum number of points.
Grid-based Method
In this, the objects together form a grid. The object space is quantized
into finite number of cells that form a grid structure.
Advantages
The major advantage of this method is fast processing time.
It is dependent only on the number of cells in each dimension in the
quantized space.
Model-based methods
In this method, a model is hypothesized for each cluster to find the
best fit of data for a given model. This method locates the clusters by
clustering the density function. It reflects spatial distribution of the data
points.
This method also provides a way to automatically determine the
number of clusters based on standard statistics, taking outlier or noise into
account. It therefore yields robust clustering methods.
Constraint-based Method
In this method, the clustering is performed by the incorporation of
user or application-oriented constraints. A constraint refers to the user
expectation or the properties of desired clustering results. Constraints
provide us with an interactive way of communication with the clustering
process. Constraints can be specified by the user or the application
requirement.
Text databases consist of huge collection of documents. They collect
these information from several sources such as news articles, books, digital
libraries, e-mail messages, web pages, etc. Due to increase in the amount
of information, the text databases are growing rapidly. In many of the text
databases, the data is semi-structured.
For example, a document may contain a few structured fields, such as
title, author, publishing_date, etc. But along with the structure data, the
document also contains unstructured text components, such as abstract and
contents. Without knowing what could be in the documents, it is difficult to
formulate effective queries for analyzing and extracting useful information
from the data. Users require tools to compare the documents and rank their
importance and relevance. Therefore, text mining has become popular and
an essential theme in data mining.
Information Retrieval
Information retrieval deals with the retrieval of information from a
large number of text-based documents. Some of the database systems are
not usually present in information retrieval systems because both handle
different kinds of data. Examples of information retrieval system include −
Precision
Recall
F-score
Precision
Precision is the percentage of retrieved documents that are in fact
relevant to the query. Precision can be defined as −
Recall
Recall is the percentage of documents that are relevant to the query
and were in fact retrieved. Recall is defined as −
F-score
F-score is the commonly used trade-off. The information retrieval
system often needs to trade-off for precision or vice versa. F-score is
defined as harmonic mean of recall or precision as follows −
Retail Industry
Data Mining has its great application in Retail Industry because it
collects large amount of data from on sales, customer purchasing history,
goods transportation, consumption and services. It is natural that the
quantity of data collected will continue to expand rapidly because of the
increasing ease, availability and popularity of the web.
Data mining in retail industry helps in identifying customer buying
patterns and trends that lead to improved quality of customer service and
good customer retention and satisfaction. Here is the list of examples of
data mining in the retail industry −
Design and Construction of data warehouses based on the benefits of
data mining.
Multidimensional analysis of sales, customers, products, time and
region.
Analysis of effectiveness of sales campaigns.
Customer Retention.
Product recommendation and cross-referencing of items.
Telecommunication Industry
Today the telecommunication industry is one of the most emerging
industries providing various services such as fax, pager, cellular phone,
internet messenger, images, e-mail, web data transmission, etc. Due to the
development of new computer and communication technologies, the
telecommunication industry is rapidly expanding. This is the reason why
data mining is become very important to help and understand the business.
Data mining in telecommunication industry helps in identifying the
telecommunication patterns, catch fraudulent activities, make better use of
resource, and improve quality of service. Here is the list of examples for
which data mining improves telecommunication services −
Multidimensional Analysis of Telecommunication data.
Fraudulent pattern analysis.
Identification of unusual patterns.
Multidimensional association and sequential patterns analysis.
Mobile Telecommunication services.
Use of visualization tools in telecommunication data analysis.