0 ratings0% found this document useful (0 votes) 29 views12 pagesUnit-2 Finalized
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content,
claim it here.
Available Formats
Download as PDF or read online on Scribd
EEE Data Warehousing and Data Mining Reference Note
Unit-2
Introduction to Data M is
Data Mining
Data mining is the process of discovering interesting patterns and knowledge from the imge
amount of data. Data Mining is one of the essential step in the process of KDD (Knowledge
Discovery in Database),
Why Data Mining? (Motivation)
+ Data mining helps to turn the huge amount of data into useful information and knowledge
that can have different applications.
* Data mining helps in
a. Automatic discovery of pattems
b. Predietion of likely outcomes
¢. Creation of actionable information
+ Data mining can answer questions that cannot be addressed through simple query and
reporting techniques.
Types of Data that can be mined on Data Mining
Different kinds of data can be mined. Some of the examples are mentioned below:
+ Flat Files: Flat files are in the binary form or text form and having a structure that can be
easily extracted by data mining algorithms. The data stored in the flat file has no
relationship or path to each other. Flat files are represented by data dictionary. E.g. CSV
file, Itis often used in data warehousing to store data, in carrying data to and from servers,
ete,
+ Relational Databases: A relational database is a data collection organized into tables with
rows and columns. The physical schema of a relational database is the schema that defines
the structure of the table. A relational database logical schema is a schema that defines the
relationships between tables
+ Data Warehouses: A data warehouse is defined as the collection of data integrated from
multiple sources (often heterogeneous) that will queries and decision making, Data
warehouses consist of three types, enterprise data warehouses, data marts, and virtual
warehouses. It is widely used in everyday business decision-making,
+ Transaction Databases: A transaction database is a set of records representing
transactions, each with a time stamp, an identifier and a set of items. This type of database
has the capability to roll back or undo its operation when a transaction is not completed
or committed. Object databases, ATM machine, Banking, and Distributed systems are
very famous applications of a transactional database.
+ Multimedia Databases: Multimedia databases include video, images, audio and text media.
They can be stored on Object-Oriented Databases. E-book databases, video website
databases, news website databases, ete. are famous applications of multimedia databases.
+ Spatial Databases: Spatial databases are databases that store geographical information like
maps and global or regional positioning. Itstores data in the form of coordinates, topology,
lines, polygons, etc.
Collegenote Prepared By: Jayanta PoudelWEEENEN Data Warehousing and Data Mining Reference Note
Data Mining Architecture
The major components of a data mining system architecture are as follows:
1
{Data cleaning, iteration and selection |
Fig: Architecture of typical data mining system
+ Database, Data Warehouse or Other Information Repository: This is one or a set of
databases, data warehouses, spreadsheets, or other kinds of information repositories. Data
cleaning and data integration techniques may be performed on the data
* Database or Data Warehouse Server: It fetches the data as per the users’ requirement
which one need for data mining task.
+ Knowledge Base: This is the domain knowledge that is used to ~ guide the search or
evaluate the interestingness of resulting patterns. It is simply stored in the form of set of
rules,
* Data Mining Engine: Wt performs the data mining task such as characterization,
association, classification, prediction, cluster analysis etc.
+ Pattern Evaluation Module: They are responsible for finding interesting patterns in the
data using a threshold value. It interacts with the data mining engine to focus the search on
interesting patterns.
= Graphical User Interface: This module is used to communicate between user and the data
mining system and allow users to browse databases or data warehouse schemas by
specifying a data mining query or task.
Collegenote Prepared By: Jayanta PoudelBEEN Data Warehousing and Data Mining Reference Note
Data Mining Functionalities — What kinds of Patterns Can Be Mined’
Data mining functionalities are used to specify the kinds of patterns tobe found in data mining,
tasks, In general, such tasks can be classified into two categories: descriptive and predictive.
* Descriptive mining tasks characterize the general properties of the data in the database.
* Predictive mining tasks perform inference on the current data in order to make predictions.
Data mining functionalities or the kinds of patterns that can be mined are as follows:
1. Class/Concept Description: Data can be associated with classes or concepts that can be
described in summarized, concise and yet precise, terms. Such descriptions of a concept or
class are called class/concept descriptions. These descriptions can be derived via:
= Data Characterization: Characterization is a summarization of the general
characteristics or features of a target class of data which creates what is called a
characteristic rule.
+ Data Discrimination: Data discrimination is a comparison of the general features of
target class data objects with the general features of objects from one or a set of,
contrasting classes.
2. Association analysis on frequent patterns: Frequent pattems are pattems that occur
frequently in data. Association analysis aims to discover associations between items
occurring together frequently.
Exg. buys(X."computer”) —> buys(X,"software”) [support=1%,confidence-S0%]
where X is a variable representing a customer. Confidence=30% means that if a
customer buys a computer, there is a 50% chance that she will buy software as well,
3. Classification and Prediction: Classification is the process of finding a model (or function)
that deseribes and distinguishes data classes or concepts. This model is derived based on
the analysis of a set of training data and used to predict the class label of objects for which
the class label is unknown.
Prediction is used to predict missing or unavailable numeric data values rather than class
labels. Regression analysis is a statistical methodology that is most often used for numeric
prediction, although other methods exist as well.
4. Cluster Analysis / Clustering: Clustering analyzes data objects without cousulting class
labels. It can be used to generate class labels for a group of data which did not exist at the
beginning. The objects are clustered or grouped based on the principle of maximizing the
intra-class similarity and minimizing the interclass similarity. That is, clusters of objects
are formed so that objects within a cluster have high similarity in comparison to one
another, but are very dissimilar to objects in other clusters
5. Quilier Analysis: Outliers axe objects that do not comply with the general behavior or
model of the data. Most data mining methods discard outliers as noise or exceptions
However, in some events these kind of events are more interesting. This analysis of outlier
data is referred to as outlier analysis, E.g. Fraud detection
Evolution Analysis: Data evolution analysis describes and models regularities or trends for
objects whose behavior changes over time. This may include characterization,
discrimination, association and correlation analysis, classification, prediction or clustering
of time related data, Distinet features of such data incinde time-series data analysis,
sequence or periodicity pattem matching, and similarity-based data analysis,
Collegenote Prepared By: Jayanta PoudelEEE Data Warehousing and Data Mining Reference Note
Knowledge Discovery in Database (KDD)
Knowledge discovery in databases (KDD) is the process of discovering useful knowledge fiom
collection of data
Fig: KDD process
The steps involved in knowledge discovery process:
1. Data Cleaning: Data cleaning is a process of removing unnecessary and inconsistent
data from the databases. The main purpose of cleaning is to improve the quality of the
data by filling the missing values, configuring the data to make sure that it in consistent
format.
2. Data Integration; In this step data from various sources such as database, data warehouse
and transactional data are combined.
3. Data Selection: Data which is required for data mining process can be extracted fom
multiple and heterogeneous data sources such as databases, files etc. Data selection is a
process where the appropriate data required for analysis is fetched from the databases.
4. Data Transformation: In the transformation stage data extracted from multiple data
sources are converted into an appropriate format for data mining process. Data reduction
or summarization is used to decrease the number of possible values of data without
affecting the integrity of data.
$. Data Mining: It is the most essential step of KDD process where intelligent methods are
applied in order to extract hidden patterns from data stored in databases,
6. Pattern Evaluation: This step identifies the truly interesting pattems representing
‘Knowledge on the basis of some interestingness measures. Support and confidence are two
widely used interestingness measures. These patterns are helpful for decision support
systems,
7. Knowledge Presentation: In this step, visualization and knowledge representation
techniques are used to present mined knowledge to users. Visualizations can be in form of
graphs, charts or table
Collegenote Prepared By: Jayanta PoudelBEEN Data Warehousing and Data Mining Reference Note
Classification of Data Mining System
The data mining system can be classified according to the following criteria
1. Classification according to kind of databases mined
We can classify the data mining system according to kind of databases mined. Database
system can be classified according to different criteria such as data models, types of data
etc, And the data mining system can be classified accordingly. For example if we classify
the database according to data model then we may have a relational, transactional, abject-
relational, or data warehouse mining system.
2. Classification according to kind of knowledge mined
We can classify the data mining system according to kind of knowledge mined. It is means
data mining system are classified on the basis of functionalities such as: Characterization,
Discrimination, Association and Correlation Analysis, Classification, Prediction,
Clustering, Outlier Analysis, Evolution Analysis
3. Classification according to kinds of techniques utilized
We can classify the data mining system according to kind of techniques used. We can
describes these techniques according to degree of user interaction involved or the methods
of analysis employed.
4. Classification according to applications adapted
We can classify the data mining system according to application adapted. These
applications are as follows: Finance, Telecommunications, DNA, Stock Markets, E-mail
Issues in Data Mining
In data mining, the algorithm used is complex and data is not available from single sources so
these factors also create some issues. The major issues are
Date Mining
‘Mining Methodology and User Performance Deri ata Types
Interaction nae issues
shining aifferentkinds of knowleage:
‘Gificiency and scalability of data | [ *Handling oF relational ana
indatabeses ie wa _
‘mining algorithms complextypes of data
sinteractive mining of trowtedge at | | | cparatelcistributed, and ‘Mining information from
patipte ewe lsof sheteaetioge incrementalmining algorithms || | heterogeneous dotabaces ancl
incorporation of background tlobal information systems
knowledge
/s0ata mining query languages and ad
hoc data mining
Presentation and visualization of
dota mining celts
Handling noisy or incomplete data
Pattern evaluation
Collegenote Prepared By: Jayanta PoudelBEE Data Warehousing and Data Mining Reference Note
L
Mining Methodology and User Interaction Issues
4) Mining different kinds of knowledge in databases: Different users may be interested
in different kinds of knowledge. Therefore it is necessary for data mining to cover a
broad range of knowledge discovery task.
b) Interactive mining of knowledge at muttiple levels of abstraction: The data mining
process needs to be interactive because it allows users to focus the search for patterns,
providing and refining data mining requests based on the returned results.
©) Incorporation of background knowledge: To guide discovery process and to express
the discovered pattems, the background knowledge can be used. Background
knowledge may be used to express the discovered pattems not only in concise terms
but at multiple levels of abstraction
@) Data mining query languages and ad hoc data mining: Data Mining Query language
that allows the user to describe ad hoc inining tasks, should be integrated with a data
wareliouse query language and optimized for efficient aud flexible data mining
©) Presentation and visualization of data mining results: Once the patterns are
discovered it needs to be expressed in high level languages, and visual representations.
These representations should be easily understandable
D) Handling noisy or incomplete data: The data cleaning methods are required to handle
the noise and incomplete objects while mining the data regularities. Ifthe data cleaning
methods are not there then the aecuracy of the discovered pattems will be poor
) Pattern evaluation: The pattems discovered should be interesting because either they
represent common knowledge or lack novelty.
Performance Issues
4) Efficiency and scalability of data mining algorithms: tn order to effectively extract
the information from huge amount of data in databases, data mining algorithm must be
efficient and scalable
») Parallel, distributed, and incremental mining algorithms: The factors such as huge
size of databases, wide distribution of data, and complexity of data mining methods
motivate the development of parallel and distributed data mining algorithms. These
algorithms divide the data into partitions which is further processed in a parallel
fashion, Then the results from the partitions is merged. The incremental algorithms,
update databases without mining the data again from scratch.
Diverse Data Types Issues
@) Handling of relational and complex types of data: The database may coutain complex
data objects, multimedia data objects, spatial data, temporal data ete. It is not possible
for one system to mine all these kind of data,
4) Mining information from heterogeneous databases and global information systems.
The data is available at different data sources on LAN or WAN. These data source may
be structured, semi structured or unstructured. Therefore mining the knowledge from
them adds challenges to data mining.
Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note
Data Object and Attribute ‘Types
Data Objects
Data sets are made up of data objects. A data object represents an entity - in a sales database,
the objects may be customers, store items, and sales. Data objects are typically described by
attributes. If the data objects are stored in a database, they are data tuples.
Attribute
An attribute is a data field, representing a characteristic or feature of a data object. Attributes
describing a customer object can include, for example, customer ID, name, and address.
On the basis of set of possible values attributes can be divided into following types:
ds
‘Qualitative Quantitative
| |
Po] | |
fomina Orcinery Discrete Continous
|
Symmetric
1) Nominal Attributes: Nominal means “relating to names.” The velues of ¢ nominal attribute
are symbols or names of things. Each value represents some kind of category, code, or state,
and so nominal attributes are also referred to as categorical. The values do not have any
meaningful order. B.g,
- Hair_color: possible values are: {black, brown, red, grey, white}
- Marital_status: possible values are: {Married, Single, Divorced, Widowed}
2) Binary Attributes: A binary attribute is a nominal attribute with only two categories or
states: 0 or I, where 0 typically means that the attribute is absent, and 1 means that it
is present. E.g. Given the attribute smoker describing a patient object, 1 indicates that
the patient smokes, while 0 indicates that the patient does not
- A binary attribute is symmetric if both of its states are equally valuable. E.g
attribute gender having the states male and female.
- Abinary attribute is asymmetric ifthe outcomes of the states are not equally important,
such as the positive (1) and negative (0) outcomes of a medical test for HIV.
Collegenote Prepared By: Jayanta PoudelBEEN Data Warehousing and Data Mining Reference Note
3) Ordinal Attributes; An ordinal attribute is an attribute with possible values that have a
meaningful order or ranking among them, but the magnitude between successive values is
not known. F.g. Height: possible values are: {Tall, Medium, Short}. The values have a
meaningfitl sequence (which corresponds to increasing height); however, we cannot tell
from the values how much bigger, say, a medium is than a short. Other example of ordinal
attributes include grade (e.g., A+, A, A~, B+, and so on).
4) Numeric Auribures: A numeric attribute is quantitative; that is, itis a measurable quantity,
represented in integer or real values. Numeric attributes can be interval-scaled ot ratio-
scaled.
- Interval-Scated Attributes: Interval-scaled attributes are measured ona scale of equal-
size units. The values of interval-scaled attributes have order and can be positive, 0, or
negative. E.g. Calendar Date (2002 and 2010 are 8 years apart)
- Ratio-Scated Attributes: If measurement is ratio scaled means a value being multiple
(ortatio) of another value. In addition, the values are ordered, aud we can also compute
the difference between values, as well as the mean, median, and mode. E.g. Frequency
cof words in a document.
5) Discrete versus Continuous Attributes: A discrete attribute has a finite or countably
infinite set of values, which may or may not be represented as integers. The attributes
haircolor, smoker, medical test each have a finite number of values, and so are discrete
A continuous attribute has an infinite no. of states. Continnous attributes are typically
represented as floating-point variables. E.g. The attribute Height having the values 5.4...
6.5,. ete.
Statistical Description of Data
The basic statistical description of data can be used to identify properties of the data and
highlight which data values should be treated as noise or outliers. Basic statistical descriptions
include Measure of Central Tendency and Measure of Dispersion.
Measure of Central Tendency
Measure of central tendency measures the location of the middle or center of a data distribution
Measures of central tendency include the mean, median, mode, and midrange
= Mean: Mean is the most common and effective numeric measure, which is used to measure
the “center” of a set of data. Let x22)... u/Xn be the set of N observed valnes for X.
The mean of this set of values is
yo
oO tatty
N N
If each x is associated with a weight w; for i = 1, N then the weighted mean is
Collegenote Prepared By: Jayanta PoudelBEEN Data Warehousing and Data Mining Reference Note
Median: better measure of the center of data is the median, which is the middle value in
a set of ordered data values. It is the value that separates the higher half of a data set from
the lower half.
Suppose that a given data set of N values for an attribute X is sorted in increasing order. If
NV is odd, then the median is the middle value of the ordered set. If IV is even, then the
median is not unique; it is the two middlemost values and any value in between. If X is a
mumeric attribute in this case, by convention, the median is taken as the average of the two
middlemost values.
Mode: The mode for a set of data is the value that occurs most frequently in the set.
Therefore, it can be determined for qualitative and quantitative attributes. Data sets with
one, two, or three modes are respectively called unimodal, bimodal, and trimodal, In
general, a data set with two or more modes is multimodal. At the other extreme, if each
data value occurs only once, then there is no mode.
For unimodal numeric data, we have the following empirical relation:
mean —mode = 3 x (mean — median).
Midrange: The midrange can also be used to assess the central tendency of a numeric data
set. It is the average of the largest and smallest values in the set.
Example:
Let 30, 36, 47, 50, 52, 52, 56, 60, 63, 70, 70, 110 are the values.
4g) w toisetarisotszisztsstooteat70+70+410
Mean(z) = S21 Sete7 soca iszisercosesirot7oi no
7
=58
52456
> Median= =54
> Mode: The given data are bimodal. Two modes are 52 and 70.
304110
> Midrange = =70
Measures of dispersion indicate how much the observed data is spread ont around a measure
of central tendency. The measures include range, quantiles, quartiles, percentiles, and the
interquartile range. Variance and standard deviation also indicate the spread of a data
distribution.
Range: The range of the set is the difference between the largest (max()) and smallest
(arin) values. Example: 1, 3,5,6,7 = Range = 7 —1= 6
Quantites: Suppose that the data for attribute X are sorted in increasing numeric order.
Quantiles axe points taken at regular intervals of a data distribution, dividing it into
essentially equal-size consecutive sets.
- The 2-quantile is the data point dividing the lower and upper halves of the data
distribution. It corresponds to the median
Collegenote Prepared By: Jayanta PoudelBEET Data Warehousing and Data Mining Reference Note
Quartites: The 4-quantiles are the three data points that split the data distribution into four
equal parts; each part represents one-fourth of the data distribution. They are more
commonly referred to as quartiles.
Percentites: The 100-quantiles are more commonly referred to as percentiles; they divide
the data distribution into 100 equal-sized consecutive sets
2 @
28h Median 75th
percentile percentile
Interquartile Range: The distance between the first (25% percentile) and third (75
percentile) quartiles is called the interquartile range (IQR).
TOR = Qs Qh
iance: The variance of N observations, x3, .
1a
BYwws
Xn fora numeric attribute X is
where © is the mean value of the observations.
Standard Deviation: The standard deviation, ¢, of the observations is the square root of
the variance, a2. A low standard deviation means that the data observations tend to be very
close to the mean, while a high standard deviation indicates that the data are spread out
over a large range of values
‘Exampl
Marks: 8, 10, 15, 20
Mean of marks
13.25
3.25)?+(10-13.25)2 +
4
25)2+(@20
> Variance(o?)
> Standard Deviation(a) = V21.6 =
Collegenote Prepared By: Jayanta PoudelBEET Data Warehousing and Data Mining Reference Note
Applications of Data Mining
Data mining can be applied in almost every field. Some of the major applications of data mining
are briefly discussed below.
L
2
4.
Market Analysis and Management
Listed below are the various fields of market where data mining is used:
+ Customer Profiting: Data mining helps determine what kind of people buy what kind
of products
+ Identifying Customer Requirements: Data mining helps in identifying the best
products for different customers. It uses prediction to find the factors that may attract
new customers.
* Cross Market Analysis: Data mining performs association/conrelations between
product sales.
* Target Marketing: Data mining helps to find clusters of model customers who share
the same characteristics such as interests, spending habits, income, ete
* Determining Customer purchasing pattern: Data mining helps in determining
customer purchasing pattern.
* Providing Summary Information: Data mining provides us various nmultidimensional
summary reports,
Corporate Analysis and Risk Management
Data mining is used in the following fields of the Corporate Sector:
+ Finance Planning and Asset Evaluation: It involves cash flow analysis and prediction,
contingent claim analysis to evaluate assets.
* Resource Planning: It involves summarizing and comparing the resources and
spending.
+ Competition: 1: involves monitoring competitors and market directions.
Fraud Detection
Data mining is also used in the fields of eredit card services and telecommunication to
detect frauds. In fraud telephone calls, it helps to find the destination of the call, duration
of the call, time of the day or week, etc. It also analyzes the pattems that deviate from
expected norms.
Intrusion Detection
Data mining can help improve intrusion detection by adding a level of focus to anomaly
detection. It helps an analyst to distinguish an activity from common everyday network
activity.
Web Search Engines
Web search engines are essentially very large data mining applications. Various data
mining techniques are used in all aspects of search engines, ranging from crawling,
indexing, and searching
Collegenote Prepared By: Jayanta PoudelData Warehousing and Data Mining Reference Note
6. Social Web and Networks
There are a growing mumber of highly-popular user-centric applications such as blogs,
wikis and Web communities that generate a lot of structured and semi-structured
information. In these applications data mining can be used to explain and predict the
evolution of social networks, personalized search for social interaction, user behavior
prediction etc.
7. Space Science
Data mining can be used to automate the analysis image data collected from sky survey
with better accuracy.
Please let me know if I missed anything or
anything is incorrect,
poudeljayanta99@gmail.com
Collegenote Prepared By: Jayanta Poudel