8530521.doc Created by Chethan.
Data Mining
Goal of Data Mining
Simplification and automation of the overall statistical process, from data source(s) to
model application
Changed over the years
— Replace statistician ? Better models, less grunge work
— Many different data mining algorithms / tools available
— Statistical expertise required to compare different techniques
— Build intelligence into the software
Data Mining Is…
Decision Trees
Nearest Neighbor Classification
Neural Networks
Rule Induction
K-means Clustering
Data Mining is Not...
Data warehousing
SQL / Ad Hoc Queries / Reporting
Software Agents
Online Analytical Processing (OLAP)
Data Visualization
Why data-mining now?
Data mining is an increasingly popular topic
(If the number of new textbooks is anything to go by).
Two main reasons:
With computers now mediating most aspects of our lives, there has been a
large increase in the accumulation of electronic data.
With computers being increasingly up to the demands of complex modeling, it
is getting easier to process larger datasets.
Why Mine Data? Commercial Viewpoint
Data volumes are too large for classical analysis approaches:
Large number of records
High dimensional data
Leverage organization’s data assets
Only a small portion of the collected data is ever analyzed
Data that may never be analyzed continues to be collected, at a great
expense, out of fear that something which may prove important in the
future is missing.
Lots of data is being collected and warehoused
Web data, e-commerce
ISiM
8530521.doc Created by Chethan.M
purchases at department/grocery stores
Bank/Credit Card transactions
Computers have become cheaper and more powerful
Competitive Pressure is Strong
Provide better, customized services for an edge (e.g. In Customer
Relationship Management)
Scientific Viewpoint
Data collected and stored at enormous speeds (GB/hour)
remote sensors on a satellite
telescopes scanning the skies
micro arrays generating gene expression data
scientific simulations generating terabytes of data
Traditional techniques infeasible for raw data
Data mining may help scientists
In classifying and segmenting data
In Hypothesis Formation
Origins of Data Mining
Draws ideas from machine learning/AI, pattern recognition, statistics, and
database systems
Traditional Techniques may be unsuitable due to
Enormity of data
High dimensionality of data
Heterogeneous, distributed nature of data
Mining Large Data Sets - Motivation
There is often information “hidden” in the data that is not readily evident
Human analysts may take weeks to discover useful information
Much of the data is never analyzed at all
What is Data Mining? ----- Many Definitions
Data processing using sophisticated data search capabilities and statistical
algorithms to discover patterns and correlations in large preexisting databases; a
way to discover new meaning in data.
Non-trivial extraction of implicit, previously unknown and potentially useful
information from data.
Exploration & analysis, by automatic or semi-automatic means, of large quantities
of data in order to discover meaningful patterns.
ISiM
8530521.doc Created by Chethan.M
The process of identifying commercially useful patterns or relationships in
databases or other computer repositories through the use of advanced statistical
tools.
The automated extraction of predictive information from (large) databases.
A step in the knowledge discovery process consisting of particular algorithms
(methods) that under some acceptable objective, produces a particular
enumeration of patterns (models) over the data.
Data mining is the process of discovering interesting knowledge from large amounts
of data stored either in databases, data warehouses, or other information
repositories.
What is (not) Data Mining?
What is Data Mining?
What is not Data Mining?
Certain names are more prevalent in certain US
Look up phone number in phone
locations (O’Brien, O’Rurke, O’Reilly… in Boston
directory
area)
Query a Web search engine for
Group together similar documents returned by
information about “Amazon”
search engine according to their context (e.g.
Amazon rainforest, Amazon.com,)
ISiM
8530521.doc Created by Chethan.M
Statistics/AI Machine Learning/
Pattern Recognition
Data Mining
Database
systems
Data Mining Types:
Predictive data mining: This produces the model of the system described by the
given data. It uses some variables or fields in the data set to predict unknown
or future values of other variables of interest.
Descriptive data mining: This produces new, nontrivial information based on
the available data set. It focuses on finding patterns describing the data that
can be interpreted by humans.
Defining `data'
By `data', we mean sets of variable values, e.g.,
Annual rainfall in Sussex for the last twenty years;
Age, salary and IQ for all members of Sussex faculty.
Records
Values are organised in combinations called records.
Each record has a particular context, e.g., age, salary and IQ specifically for the
Informatics HoD.
Combinations may also be called vectors (esp. in neural-networks) and data-points
(esp. in statistics).
A single record is a datum.
ISiM
8530521.doc Created by Chethan.M
Tabulation
Data are often presented in a tabulated form, with one datum per row, and one
variable per column.
NAME AGE SALARY IQ
smith 42 36K 130
bloggs 29 30K 140
bush 50 60K 120
...
Where data are used for prediction, the to-be-predicted variable normally appears in
the final column (and is often called `class').
Basic data-types
Data may be classified according to the number and character of variables involved.
• Univariate/discrete: one variable with integer/symbolic values.
• Univariate/continuous: one variable with real/continuous values.
• Multivariate/discrete: more than one variable with integer/symbolic values.
• Multivariate/continuous: more than one variable with real/continuous values.
Data Mining Tasks...
Classification [Predictive]
Clustering [Descriptive]
Association Rule Discovery [Descriptive]
Sequential Pattern Discovery [Descriptive]
Regression [Predictive]
Deviation Detection [Predictive]
ISiM
8530521.doc Created by Chethan.M
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is the class.
Find a model for class attribute as a function of the values of other attributes.
Goal: previously unseen records should be assigned a class as accurately as
possible.
A test set is used to determine the accuracy of the model. Usually, the given
data set is divided into training and test sets, with training set used to build
the model and test set used to validate it.
Explicit and implicit structure
A dataset is a body of data.
Any dataset has explicit structure, ie., the numbers/values in the records.
Generally, there is also implicit structure.
Data mining is the task of identifying and modeling implicit structure, either as an end
in itself or as a means of obtaining new information.
Example: A-level grades
Dataset containing average A-level grades for the past ten years.
Explicit structure is the mapping between years and average grades.
(Explicit structure = `what you see')
ISiM
8530521.doc Created by Chethan.M
There is also implicit structure---a gradual increase in values over time. (Average
grades are increasing by approx 3% per year.)
Classification Example
al ous
ic al nu
or ic ti as
s
te
g r n cl Refund Marital Taxable
ca go co
Tid Refund te
Marital Taxable Status Income Cheat
Ca
Status Income Cheat
No Single 75K ?
1 Yes Single 125K No
Yes Married 50K ?
2 No Married 100K No
No Married 150K ?
3 No Single 70K No
Yes Divorced 90K ?
4 Yes Married 120K No
No Single 40K ?
5 No Divorced 95K Yes
No Married 80K ?
6 No Married 60K No
10
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No Training Test
10 No Single 90K Yes Set Set
Learn
10
Classifier
Model
Challenges of Data Mining
Scalability
Dimensionality
Complex and Heterogeneous Data
Data Quality
Data Ownership and Distribution
Privacy Preservation
Streaming Data
ISiM
8530521.doc Created by Chethan.M
Statistical methods
Case-based reasoning
Neural networks
Decision trees
DM & DW:
Data Warehousing + Data Mining = Increased performance of decision making
process + Knowledgeable decision makers
Data Mining Applications
Data Mining For Financial Data Analysis
Data Mining For Telecommunications Industry
Data Mining For The Retail Industry
Data Mining In Healthcare and Biomedical Research
Data Mining In Science and Engineering
Reference:
ISiM
8530521.doc Created by Chethan.M
1. Introduction to Data Mining by Tan, Steinbach, Kumar
Data Mining: Concepts and Techniques
2. Data Mining: Concepts and Techniques: Jiawei Han and Micheline Kamber
3. Kurt Thearling, Ph.D. An Introduction to Data Mining. www.thearling.com
ISiM