1.
1Motivation
Data mining is the procedure of finding useful new correlations, patterns, and trends by sharing through a high amount of data saved in repositories, using
pattern recognition technologies including statistical and mathematical techniques.
1.2Importance of data mining
The information or knowledge extracted so can be used for any of the following applications −
Market Analysis
Fraud Detection
Customer Retention
Production Control
Science Exploration
Market Analysis and Management
Listed below are the various fields of market where data mining is used −
Customer Profiling − Data mining helps determine what kind of people buy what kind of products.
Identifying Customer Requirements − Data mining helps in identifying the best products for different customers. It uses prediction to find the factors that may
attract new customers.
Cross Market Analysis − Data mining performs Association/correlations between product sales.
Target Marketing − Data mining helps to find clusters of model customers who share the same characteristics such as interests, spending habits, income, etc.
Determining Customer purchasing pattern − Data mining helps in determining customer purchasing pattern.
Providing Summary Information − Data mining provides us various multidimensional summary reports.
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector −
Finance Planning and Asset Evaluation − It involves cash flow analysis and prediction, contingent claim analysis to evaluate assets.
Resource Planning − It involves summarizing and comparing the resources and spending.
Competition − It involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of credit card services and telecommunication to detect frauds. In fraud telephone calls, it helps to find the
destination of the call, duration of the call, time of the day or week, etc. It also analyzes the patterns that deviate from expected norms.
1.3What is Data Mining?
Data Mining is defined as extracting information from huge sets of data. In other words, we can say that data mining is the procedure of mining
knowledge from data.
Figure 1.3 Data mining—searching for knowledge (interesting patterns) in data
❖ “The process of extracting information to identify patterns, trends, and useful data that would allow
the business to take the data-driven decision from huge sets of data is called Data Mining.”
❖ “Data mining is the process of analyzing massive volumes of data to discover business
intelligence that helps companies solve problems, mitigate risks, and seize new opportunities.”
❖ “Data mining is the process of finding anomalies, patterns and correlations within large data sets to
predict outcomes. Using a broad range of techniques, you can use this information to increase
revenues, cut costs, improve customer relationships, reduce risks and more.”
1.4Kind of data be mined
Kinds of Data on which Data mining is performed
Data mining can be performed on the following types of data:
1. Relational Database
A relational database is a collection of multiple data sets formally organized by tables, records,
and columns from which data can be accessed in various ways without having to recognize the
DCOM-DATA MINING
database tables. Tables convey and share information, which facilitates data searchability,
reporting, and organization.
2. Data Warehouse
A Data Warehouse is the technology that collects the data from various sources within the
organization to provide meaningful business insights. The huge amount of data comes from
multiple places such as Marketing and Finance. The extracted data is utilized for analytical
purposes and helps in decision- making for a business organization. The data warehouse is
designed for the analysis of data rather than transaction processing.
3. Data Repositories
The Data Repository generally refers to a destination for data storage. However, many IT
professionals utilize the term more clearly to refer to a specific kind of setup within an IT structure.
For example, a group of databases, where an organization has kept various kinds of information.
4. Object-Relational Database
A combination of an object-oriented database model and relational database model is called an
object-relational model. It supports Classes, Objects, Inheritance, etc. One of the primary
objectives of the Object-relational data model is to close the gap between the Relational database
and the object-oriented model practices frequently utilized in many programming languages, for
example, C++, Java, C#, and so on.
5. Transactional Database
A transactional database refers to a database management system (DBMS) that has the potential
to undo a database transaction if it is not performed appropriately. Even though this was a unique
capability a very long while back, today, most of the relational database systems support
transactional database activities.
1.5Data Mining Functionalities
Data mining functionalities are used to represent the type of patterns that have to be discovered in data mining tasks. In general, data mining tasks can be
classified into two types including descriptive and predictive. Descriptive mining tasks define the common features of the data in the database and the predictive
mining tasks act inference on the current information to develop predictions.
There are various data mining functionalities which are as follows −
Data characterization − It is a summarization of the general characteristics of an object class of data. The data corresponding to the user-specified class is
generally collected by a database query. The output of data characterization can be presented in multiple forms.
Data discrimination − It is a comparison of the general characteristics of target class data objects with the general characteristics of objects from one or a set
of contrasting classes. The target and contrasting classes can be represented by the user, and the equivalent data objects fetched through database queries.
Association Analysis − It analyses the set of items that generally occur together in a transactional dataset. There are two parameters that are used for
determining the association rules −
It provides which identifies the common item set in the database.
Confidence is the conditional probability that an item occurs in a transaction when another item occurs.
Classification − Classification is the procedure of discovering a model that represents and distinguishes data classes or concepts, for the objective of being
able to use the model to predict the class of objects whose class label is anonymous. The derived model is established on the analysis of a set of training data
(i.e., data objects whose class label is common).
Prediction − It defines predict some unavailable data values or pending trends. An object can be anticipated based on the attribute values of the object and
attribute values of the classes. It can be a prediction of missing numerical values or increase/decrease trends in time-related information.
Clustering − It is similar to classification but the classes are not predefined. The classes are represented by data attributes. It is unsupervised learning. The
objects are clustered or grouped, depends on the principle of maximizing the intraclass similarity and minimizing the intraclass similarity.
Outlier analysis − Outliers are data elements that cannot be grouped in a given class or cluster. These are the data objects which have multiple behaviour
from the general behaviour of other data objects. The analysis of this type of data can be essential to mine the knowledge.
Evolution analysis − It defines the trends for objects whose behaviour changes over some time.
1.6Kinds of patterns
Using the most relevant data (which may come from organizational databases or may be obtained from outside sources), data mining builds models to identify
patterns among the attributes (i.e., variables or characteristics) that exist in a data set
Associations find commonly co-occurring groupings of things, such as “beers and diapers” or “bread and butter” commonly purchased and observed together in
a shopping cart (i.e., market-basket analysis). Another type of association pattern captures the sequences of things. These sequential relationships can discover
time-ordered events, such as predicting that an existing banking customer who already has a checking account will open a savings account followed by an
investment account within a year.
Predictions tell the nature of future occurrences of certain events based on what has happened in the past, such as predicting the winner of the Super Bowl or
forecasting the absolute temperature on a particular day.
Clusters identify natural groupings of things based on their known characteristics, such as assigning customers in different segments based on their
demographics and past purchase behaviors.
1.7Data Mining System Classification
A data mining system can be classified according to the following criteria −
Database Technology
Statistics
Machine Learning
Information Science
Visualization
Other Disciplines
Apart from these, a data mining system can also be classified based on the kind of (a) databases mined, (b) knowledge mined, (c) techniques utilized, and (d)
applications adapted.
Classification Based on the Databases Mined
We can classify a data mining system according to the kind of databases mined. Database system can be classified according to different criteria such as data
models, types of data, etc. And the data mining system can be classified accordingly.
For example, if we classify a database according to the data model, then we may have a relational, transactional, object-relational, or data warehouse mining
system.
Classification Based on the kind of Knowledge Mined
We can classify a data mining system according to the kind of knowledge mined. It means the data mining system is classified on the basis of functionalities such
as −
Characterization
Discrimination
Association and Correlation Analysis
Classification
Prediction
Outlier Analysis
Evolution Analysis
Classification Based on the Techniques Utilized
We can classify a data mining system according to the kind of techniques used. We can describe these techniques according to the degree of user interaction
involved or the methods of analysis employed.
Classification Based on the Applications Adapted
We can classify a data mining system according to the applications adapted. These applications are as follows −
Finance
Telecommunications
DNA
Stock Markets
E-mail
1.8Data Mining Task Primitives
We can specify a data mining task in the form of a data mining query.
This query is input to the system.
A data mining query is defined in terms of data mining task primitives.
Note − These primitives allow us to communicate in an interactive manner with the data mining system. Here is the list of Data Mining Task Primitives −
Set of task relevant data to be mined.
Kind of knowledge to be mined.
Background knowledge to be used in discovery process.
Interestingness measures and thresholds for pattern evaluation.
Representation for visualizing the discovered patterns.
1.9Integrating a Data Mining System with a DB/DW System
If a data mining system is not integrated with a database or a data warehouse system, then there will be no system to communicate with. This scheme is known
as the non-coupling scheme. In this scheme, the main focus is on data mining design and on developing efficient and effective algorithms for mining the
available data sets.
The list of Integration Schemes is as follows −
No Coupling − In this scheme, the data mining system does not utilize any of the database or data warehouse functions. It fetches the data from a particular
source and processes that data using some data mining algorithms. The data mining result is stored in another file.
Loose Coupling − In this scheme, the data mining system may use some of the functions of database and data warehouse system. It fetches the data from the
data respiratory managed by these systems and performs data mining on that data. It then stores the mining result either in a file or in a designated place in a
database or in a data warehouse.
Semi−tight Coupling − In this scheme, the data mining system is linked with a database or a data warehouse system and in addition to that, efficient
implementations of a few data mining primitives can be provided in the database.
Tight coupling − In this coupling scheme, the data mining system is smoothly integrated into the database or data warehouse system. The data mining subsystem
is treated as one functional component of an information system.
1.10 Major issue in data mining
Data mining is not an easy task, as the algorithms used can get very complex and data is not always available at one place. It needs to be integrated from various
heterogeneous data sources. These factors also create some issues. Here in this tutorial, we will discuss the major issues regarding −
Mining Methodology and User Interaction
Performance Issues
Diverse Data Types Issues
The following diagram describes the major issues.
Mining Methodology and User Interaction Issues
It refers to the following kinds of issues −
Mining different kinds of knowledge in databases − Different users may be interested in different kinds of knowledge. Therefore it is necessary for
data mining to cover a broad range of knowledge discovery task.
Interactive mining of knowledge at multiple levels of abstraction − The data mining process needs to be interactive because it allows users to focus
the search for patterns, providing and refining data mining requests based on the returned results.
Incorporation of background knowledge − To guide discovery process and to express the discovered patterns, the background knowledge can be
used. Background knowledge may be used to express the discovered patterns not only in concise terms but at multiple levels of abstraction.
Data mining query languages and ad hoc data mining − Data Mining Query language that allows the user to describe ad hoc mining tasks, should be
integrated with a data warehouse query language and optimized for efficient and flexible data mining.
Presentation and visualization of data mining results − Once the patterns are discovered it needs to be expressed in high level languages, and visual
representations. These representations should be easily understandable.
Handling noisy or incomplete data − The data cleaning methods are required to handle the noise and incomplete objects while mining the data
regularities. If the data cleaning methods are not there then the accuracy of the discovered patterns will be poor.
Pattern evaluation − The patterns discovered should be interesting because either they represent common knowledge or lack novelty.
Performance Issues
There can be performance-related issues such as follows −
Efficiency and scalability of data mining algorithms − In order to effectively extract the information from huge amount of data in databases, data
mining algorithm must be efficient and scalable.
Parallel, distributed, and incremental mining algorithms − The factors such as huge size of databases, wide distribution of data, and complexity of
data mining methods motivate the development of parallel and distributed data mining algorithms. These algorithms divide the data into partitions
which is further processed in a parallel fashion. Then the results from the partitions is merged. The incremental algorithms, update databases
without mining the data again from scratch.
Diverse Data Types Issues
Handling of relational and complex types of data − The database may contain complex data objects, multimedia data objects, spatial data, temporal
data etc. It is not possible for one system to mine all these kind of data.
Mining information from heterogeneous databases and global information systems − The data is available at different data sources on LAN or
WAN. These data source may be structured, semi structured or unstructured. Therefore mining the knowledge from them adds challenges to data
mining.
1.11Types of data sets and Attributes values
At t r i b u t e s
Attribute (or dimensions, features, variables):a data field, representing a characteristic
or feature of a data object.
E.g., customer _ID, name, address
Types:
Nominal: “red”, “black”, “blue”, …
Binary: 1/0, TRUE/FALSE
Numeric: quantitative
Interval-scaled
Ratio-scaled
A ttribute Types
Nominal: categories, states, or “names of things”
Hair_color = {auburn, black, blond, brown, grey, red, white}
marital status, occupation, ID numbers, zip codes
Binary
Nominal attribute with only 2 states (0 and 1)
Symmetric binary: both outcomes equally important
e.g., gender
Asymmetric binary: outcomes not equally important.
e.g., medical test (positive vs. negative)
Convention: assign 1 to most important outcome (e.g., HIV positive)
Ordinal
Values have a meaningful order (ranking) but magnitude between successive values is not known.
Size = {small, medium, large}, grades, army rankings
Nu m e r i c At t r i b u t e T y p e s
Quantity (integer or real-valued)
Interval
Measured on a scale of equal-sized units
Values have order
E.g., temperature in C˚or F˚, calendar dates
No true zero-point
Ratio
Inherent zero-point
We can speak of values as being an order of magnitude larger than the unit of measurement (10 K˚ is twice as high as 5 K˚).
e.g., temperature in Kelvin, length, counts, monetary quantities
A ttributes
Discrete Attribute Has only a finite or countably infinite set of values
E.g., zip codes, profession, or the set of words in a collection of documents
Sometimes, represented as integer variables
Note: Binary attributes are a special case of discrete attributes
Continuous Attribute
Has real numbers as attribute values
E.g., temperature, height, or weight
Practically, real values can only be measured and represented using a finite number of digits
Continuous attributes are typically represented as floating-point variables
1.12.1Basic Sta tis tic al De s cr i p t i o n s o f Data
Motivation
To better understand the data: central tendency, variation and spread
Data dispersion characteristics
median, max, min, quantiles, outliers, variance, etc.
Numerical dimensions correspond to sorted intervals
Data dispersion: analyzed with multiple granularities of precision
Boxplot or quantile analysis on sorted intervals
Dispersion analysis on computed measures
Folding measures into numerical dimensions
Boxplot or quantile analysis on the transformed cube
1.12.2Data V isualization
Why data visualization?
Gain insight into an information space by mapping data onto graphical primitives
Provide qualitative overview of large data sets
Search for patterns, trends, structure, irregularities, relationships among data
Help find interesting regions and suitable parameters for further quantitative analysis
Provide a visual proof of computer representations derived
Categorization of visualization methods:
1.12.3 Measuring Data Similarity
Similarity
Numerical measure of how alike two data objects are
Value is higher when objects are more alike
Often falls in the range [0,1]
1.13PREPROCESSING
1.13.1Data quality
Data Quality: Why do we preprocess the data?
Many characteristics act as a deciding factor for data quality, such as incompleteness
and incoherent information, which are common properties of the big database in the
real world. Factors used for data quality assessment are:
Accuracy:
There are many possible reasons for flawed or inaccurate data here. i.e. Having
incorrect values of properties that could be human or computer errors.
Completeness:
For some reasons, incomplete data can occur, attributes of interest such as customer
information for sales & transaction data may not always be available.
Consistency:
Incorrect data can also result from inconsistencies in naming convention or data
codes, or from input field incoherent format. Duplicate tuples need cleaning of details,
too.
Timeliness:
It also affects the quality of the data. At the end of the month, several sales
representatives fail to file their sales record on time. These are also several corrections
& adjustments which flow into after the end of the month. Data stored in the database
are incomplete for a time after each month.
Believability:
It is reflective of how much users trust the data.
Interpretability:
It is a reflection of how easy the users can understand the data.
1.13.2 Major Tasks in Data Preprocessing
The major steps involved in data preprocessing, namely, data cleaning, data
integration, data reduction,And data transformation
1. Data Cleaning
Data cleaning routines work to “clean” the data by filling in missing values, smoothing
noisy data,identifying or removing outliers, and resolving inconsistencies. If users
believe the data are dirty, they are unlikely to trust the results of any data mining that
has been applied. Furthermore, dirty data can cause confusion for the mining
procedure, resulting in unreliable output. Although most mining routines have some
procedures for dealing with incomplete or noisy data, they are not always robust.
Instead, they may concentrate on avoiding overfitting the data to the function being
modeled. Therefore, a useful preprocessing step is to run your data through some data
cleaning routines.
Data cleaning is the process to remove incorrect data, incomplete data and
inaccurate data from the datasets, and it also replaces the missing values.
✓Ignore the tuple
✓ Fill the missing value manually
✓ Use global constant to fill the missing values
✓ Use a measure of central tendency for the attribute (e.g., the mean or median) to fill in the
missing value.
✓ Use the most probable value to fill in the missing value.
Noisy data can be handled in following ways:
✓ Binning method
✓ Regression method
✓ Clustering
Figure: Binning methods for data smoothing
2. Data Integration
The process of combining multiple sources into a single dataset. The Data integration process is one of
the main components in data management. Data integration is the method to assists when the
information is collected from diversified data origin and information is merging to form continuous
information
Data Transformation
The change made in the format or the structure of the data is called data transformation. This step can
be simple or complex based on the requirements. There are some methods in data transformation.
This involves following ways:
✓ Smoothing
✓ Aggregation
✓ Normalization
✓ Attribute Selection
✓ Discretization
✓ Concept hierarchy generation
4. Data Reduction
Since data mining is a technique that is used to handle huge amount of data. While working with huge
volume of data, analysis became harder in such cases. In order to get rid of this, we use data reduction
technique. It aims to increase the storage efficiency and reduce data storage and analysis costs.
When the volume of data is huge, databases can become slower, costly to access, and challenging to
properly store. Data reduction aims to present a reduced representation of the data in a data
warehouse.
The various steps to data reduction are:
✓ Data Cube aggregation
✓ Attribute Subset Selection
✓Numerosity Reduction
✓ Dimensionality Reduction
Although numerous methods of data preprocessing have been developed, data preprocessing remains
an active area of research, due to the huge amount of inconsistent or dirty data and the complexity of
the problem.
Data Discretization
Data discretization refers to a method of converting a huge number of data values into smaller
ones so that the evaluation and management of data become easy. In other words, data
discretization is a method of converting attributes values of continuous data into a finite set of
intervals with minimum data loss.
Now, we can understand this concept with the help of an example
Suppose we have an attribute of Age with the given values
Age 1,5,9,4,7,11,14,17,13,18, 19,31,33,36,42,44,46,70,74,78,77
Table before Discretization
A concept hierarchy for a given numeric attribute attribute defines a discretization of the attribute.
Concept hierarchies can be used to reduce the data y collecting and replacing low-level concepts (such
as numeric value for the attribute age) by higher level concepts (such as young, middle-aged, or senior).
Although detail is lost by such generalization, it becomes meaningful and it is easier to interpret.
Some Famous techniques of data discretization
➢ Histogram analysis
Histogram refers to a plot used to represent the underlying frequency distribution of a continuous
data set. Histogram assists the data inspection for data distribution. For example, Outliers,
skewness representation, normal distribution representation, etc.