[go: up one dir, main page]

0% found this document useful (0 votes)
2 views37 pages

Module 2 (A) - Introduction To Data Mining

The document provides an introduction to data mining, discussing its motivation, definitions, and the types of data and patterns involved. It outlines the evolution of database technology and the knowledge discovery process, emphasizing the importance of data mining in various applications such as market analysis, risk management, and fraud detection. Key data mining functions include classification, clustering, and outlier analysis, while also addressing the interdisciplinary nature of the field and the major issues faced in data mining.

Uploaded by

ahujaayush973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views37 pages

Module 2 (A) - Introduction To Data Mining

The document provides an introduction to data mining, discussing its motivation, definitions, and the types of data and patterns involved. It outlines the evolution of database technology and the knowledge discovery process, emphasizing the importance of data mining in various applications such as market analysis, risk management, and fraud detection. Key data mining functions include classification, clustering, and outlier analysis, while also addressing the interdisciplinary nature of the field and the major issues faced in data mining.

Uploaded by

ahujaayush973
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 37

Chp-1: Introduction to

Data Mining

Data Mining: Concepts and


August 20, 2025 Techniques 1
Chapter 1. Introduction

 Motivation: Why data mining?


 What is data mining?
 Data Mining: On what kind of data?
 Kind of patterns to be mined
 Technologies used
 Major issues in data mining

Data Mining: Concepts and


August 20, 2025 Techniques 2
Why Data Mining?
 The Explosive Growth of Data: from terabytes to petabytes(1000
terabytes)

Data collection and data availability

Automated data collection tools, database systems, Web,
computerized society

Major sources of abundant data

Business: Web, e-commerce, transactions, stocks, …

Science: Remote sensing, bioinformatics, scientific simulation,


Society and everyone: news, digital cameras, YouTube
 We are drowning in data, but starving for knowledge!
 “Necessity is the mother of invention”—Data mining—Automated
analysis of massive data Data
sets Mining: Concepts and
August 20, 2025 Techniques 3
Evolution of Database
Technology
 1960s:
 Data collection, database creation and network DBMS
 1970s:
 Relational data model, relational DBMS implementation
 1980s:
 RDBMS, advanced data models (extended-relational, OO, deductive,
etc.)
 Application-oriented DBMS (spatial, scientific, engineering, etc.)
 1990s:
 Data mining, data warehousing, multimedia databases, and Web
databases
 2000s
 Stream data management and mining
 Data mining and its applications
 Web technology (XML, data integration) and global information
systems
Data Mining: Concepts and
August 20, 2025 Techniques 4
Evolution of database system technology
Data Mining: Concepts and
August 20, 2025 Techniques 5
What Is Data Mining?

 Data mining (knowledge discovery from data)


 Extraction of interesting (non-trivial, implicit, previously
unknown and potentially useful) patterns or knowledge
from huge amount of data
 Data mining: a misnomer?
 Alternative names
 Knowledge discovery (mining) in databases (KDD),
knowledge extraction, data/pattern analysis, data
archeology, data dredging, information harvesting,
business intelligence, etc.

Data Mining: Concepts and


August 20, 2025 Techniques 6
Decision making

Data Mining: Concepts and


August 20, 2025 Techniques 7
Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process
Data Mining

Data Selection & Transformation


Warehouse
Data Cleaning

Data Integration

Databases Data Mining: Concepts and


August 20, 2025 Techniques 8
Data Mining in Business Intelligence

Increasing potential
to support
business decisions End User
Decisio
n
Making
Data Presentation Business
Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Summary, Querying, and Reporting

Data Preprocessing/Integration, Data Warehouses


DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
9
Why Data Mining?—Potential
Applications
 Data analysis and decision support

Market analysis and management

Target marketing, customer relationship
management (CRM), market basket analysis,
cross selling
 Risk analysis and management

Forecasting, customer retention, quality
control, competitive analysis
 Fraud detection and detection of unusual
patterns (outliers
 Other Applications

Text mining (news group, email, documents)
and Web mining 

Stream data mining

Bioinformatics
August 20, 2025
and bio-data
Data Mining: analysis
Concepts and
Techniques 10
Ex. 1: Market Analysis and
Management
 Where does the data come from?—Credit card
transactions, loyalty cards, discount coupons,
customer complaint calls, plus (public) lifestyle
studies

 Target marketing

Find clusters of “model” customers who share the same
characteristics: interest, income level, spending habits,
etc.

Determine customer purchasing patterns over time
 Cross-market analysis—Find associations/co-
relations between product sales, & predict based
on such association
 Customer profiling—What types of customers buy
what products (clustering or classification)
Data Mining: Concepts and
August 20, 2025 Techniques 11
Ex. 1: Market Analysis and
Management
 Customer requirement analysis

Identify the best products for different groups of
customers

Predict what factors will attract new customers
 Provision of summary information

Multidimensional summary reports

Statistical summary information (data central tendency
and variation)

Data Mining: Concepts and


August 20, 2025 Techniques 12
Data Mining: On What Kinds of
Data?
 Database-oriented data sets and applications
 Relational database, data warehouse, transactional database
 Advanced data sets and advanced applications
 Data streams and sensor data
 Time-series data, temporal data, sequence data (incl. bio-
sequences)
 Structure data, graphs, social networks and multi-linked data
 Object-relational databases
 Heterogeneous databases and legacy databases
 Spatial data and spatiotemporal data(geographical data)
 Multimedia database
 Text databases
 The World-Wide Web Data Mining: Concepts and
August 20, 2025 Techniques 13
Data Mining: On What Kinds of
Data?
 Mining relational databases

Eg. Anaylze customer data to predict the credit
risk of new customers based on their income,
age and previous credit information.
 Data Warehouses

Sales per item type per branch for third quarter.

Data stored to provide information from
historical perespective. Eg. In past 6 to 12
months, summarized data

Modeled by multidimentional data structure
called data cube.

Data Mining: Concepts and


August 20, 2025 Techniques 14
Data Mining: Concepts and
August 20, 2025 Techniques 15
Data Mining: On What Kinds of
Data?
 Transactional data

Eg analyze which items are sold well together?

Printers are normally purchased together with
computer

Data Mining: Concepts and


August 20, 2025 Techniques 16
Kinds of Patterns to be mined

Data Mining: Concepts and


August 20, 2025 Techniques 17
What Kinds of Patterns Can Be
Mined?

1) Generalization

2) Association and Correlation Analysis


3) Classification

4) Cluster Analysis

5) Outlier Analysis

Data Mining: Concepts and


August 20, 2025 Techniques 18
Data Mining Function: (1)
Generalization
 Multidimensional concept description:
Characterization and discrimination

Generalize, summarize, and contrast data
characteristics, e.g., summarize the
characteristics of customers who spend more
than Rs. 50,000 a year at an electronics store

Data characterization is a summarization of
the general characteristics or features of a
target class of data
 Data cube technology for computing

OLAP (online analytical processing)
 Examples of Output forms : pie charts, MDD
cubes, bar charts, curves etc.

Data Mining: Concepts and


August 20, 2025 Techniques 19
Data Mining Function: (1)
Generalization contd.
 Data discrimination is a comparison of the general
features of the target class data objects against the
general features of objects from one or multiple
contrasting classes.

Eg. Compare 2 groups of customers- those who
shop for computer products regularly(more than
twice a month) and those who rarely shop for such
products(less than 3 times a year)

 Data cube technology for computing



Drill down on any dimension

Discriminant rules: Discrimination descriptions
expressed in the form of rules

 Output forms : same as that of data characterization


along with discrimination descriptions
Data Mining: Concepts and
August 20, 2025 Techniques 20
Data Mining Function: (2) Association
and Correlation Analysis
 Frequent patterns (or frequent itemsets)

What items are frequently purchased together in
your mart? Eg. Milk & bread
 Association, correlation vs. causality

A typical association rule

Computer →software [1%, 50%] (support,
confidence)

Confidence means that if one buys a computer there is a
50% chance that she will buy software too. A 1% support
means that 1% of all transactions under analysis show
that computer & software are purchased together

 Association rules are discarded as uninteresting if


they do not satisfy both a minimum support
threshold and a minimum confidence
threshold
Data Mining: Concepts and
August 20, 2025 Techniques 21
Data Mining Function: (3)
Classification

 Classification and label prediction



Construct models (functions) based on some training examples

Describe and distinguish classes or concepts for future
prediction

E.g., classify countries based on (climate), or classify cars
based on (gas mileage)

Predict some unknown class labels
 Typical methods

Decision trees, naïve Bayesian classification, support vector
machines, neural networks, rule-based classification, pattern-
based classification, logistic regression, …
 Typical applications:

Credit card fraud detection, direct marketing, classifying stars,
diseases, web-pages, …
Data Mining: Concepts and
August 20, 2025 Techniques 22
Various forms of a classification
model

Data Mining: Concepts and


August 20, 2025 Techniques 23
Data Mining Function: (4) Cluster
Analysis
 Unsupervised learning (i.e., Class label is
unknown)
 Group data to form new categories (i.e.,
clusters), e.g., cluster houses to find
distribution patterns

 Data objects are clustered or grouped


based on the principle of maximizing
intraclass similarity and minimizing
interclass similarity

Data Mining: Concepts and


August 20, 2025 Techniques 24
Data Mining Function: (4) Cluster
Analysis

Data Mining: Concepts and


August 20, 2025 Techniques 25
Data Mining Function: (5) Outlier
Analysis
 Outlier analysis (anomaly mining)
 Outlier: A data object that does not comply
with the general behaviour of the data
 Noise or exception? ― One person’s
garbage could be another person’s treasure
 Methods: by product of clustering or
regression analysis, …
 Useful in fraud detection, rare events
analysis

Data Mining: Concepts and


August 20, 2025 Techniques 26
Are All the “Discovered” Patterns
Interesting?
 Data mining may generate thousands of patterns: Not all of
them are interesting
 Suggested approach: Human-centered, query-based, focused
mining
 Interestingness measures
 A pattern is interesting if it is easily understood by humans, valid
on new or test data with some degree of certainty, potentially
useful, novel, or validates some hypothesis that a user seeks to
confirm
 Objective vs. subjective interestingness measures
 Objective: based on statistics and structures of patterns, e.g.,
support, confidence, etc.
 Subjective: based on user’s belief in the data, e.g. large
DataaMining:
earthquake often follows clusterConcepts and earthquake.
of small
August 20, 2025 Techniques 27
Find All and Only Interesting
Patterns?
 Find all the interesting patterns: Completeness
 Can a data mining system find all the interesting patterns?
Do we need to find all of the interesting patterns?
 Association vs. classification vs. clustering
 Search for only interesting patterns: An optimization problem
 Can a data mining system find only the interesting
patterns?
 Approaches

First generate all the patterns and then filter out the
uninteresting ones

Generate only the interesting

Data Mining: Concepts and


August 20, 2025 Techniques 28
Technologies Used

 As a highly application-driven domain, data mining has


incorporated many techniques from other domains

 The interdisciplinary nature of data mining research and


development contributes significantly to the success of data
mining and its extensive applications

Data Mining: Concepts and


August 20, 2025 Techniques 29
Data Mining: Confluence of Multiple
Disciplines

Machine Pattern Statistics


Learning Recognition

Applications Data Mining Visualization

Algorithm Database High-Performance


Technology Computing

Data Mining: Concepts and


August 20, 2025 Techniques 30
Data Mining: Confluence of Multiple
Disciplines

 Statistics

Statistical models are widely used to model data
and data classes.

Eg. We can use statistics to model noise and
missing data.
 Machine learning

Computer programs automatically learn to
recognize complex patterns and make intelligent
decisions based on data.

e.g. Handwritten postal codes

Data Mining: Concepts and


August 20, 2025 Techniques 31
Why Confluence of Multiple
Disciplines?
 Tremendous amount of data
 Algorithms must be highly scalable to handle such as tera-
bytes of data
 High-dimensionality of data
 Micro-array may have tens of thousands of dimensions
 High complexity of data
 Data streams and sensor data
 Time-series data, temporal data, sequence data
 Structure data, graphs, social networks and multi-linked
data
 Heterogeneous databases and legacy databases
 Spatial, spatiotemporal, multimedia, text and Web data
 Software programs, scientific simulations
 New and sophisticated applications
Data Mining: Concepts and
August 20, 2025 Techniques 32
Major Issues in Data Mining
 Mining methodology
 Mining different kinds of knowledge from diverse data
types, e.g., files in pdf or doc
 Mining knowledge in multi-dimensional space.
 Data mining: An interdisciplinary effort( mine data with
NL text)
 Pattern evaluation: the interestingness problem
 Handling noise, uncertainty, and incompleteness of data
 Integration of the discovered knowledge with existing
one: knowledge fusion
 Pattern evaluation and pattern- or constraint-guided
mining

Data Mining: Concepts and


August 20, 2025 Techniques 33
Major Issues in Data Mining
(1)
 User interaction

Interactive mining( dynamically change focus of search)

Incorporation of background knowledge(constraints, rules)

presentation and visualization of data mining results
 Efficiency and Scalability

Efficiency and scalability of data mining algorithms(run time
…predictable,short,acceptable)

Parallel, distributed, stream, and incremental mining
methods
 Diversity of data types

Handling complex types of data(simple to temporal data
objects)

Mining dynamic, networked, and global
Data Mining: Concepts and data repositories
August 20, 2025 Techniques 34
Major Issues in Data Mining
(2)

 Data mining and society


 Social impacts of data mining(benefit to society)
 Privacy-preserving data mining
 Invisible data mining(system have buit in function.. click
of mouse)

Data Mining: Concepts and


August 20, 2025 Techniques 35
Architecture: Typical Data Mining
System

Graphical User Interface

Pattern Evaluation
Know
Data Mining Engine ledge
-Base
Database or Data
Warehouse Server

data cleaning, integration, and selection

Data World-Wide Other Info


Database Repositories
Warehouse Web
Data Mining: Concepts and
August 20, 2025 Techniques 36
Summary

 Data mining: Discovering interesting patterns from large


amounts of data
 A natural evolution of database technology, in great demand,
with wide applications
 A KDD process includes data cleaning, data integration, data
selection, transformation, data mining, pattern evaluation, and
knowledge presentation
 Mining can be performed in a variety of information repositories
 Data mining functionalities: characterization, discrimination,
association, classification, clustering, outlier and trend analysis,
etc.
 Data mining systems and architectures
 Major issues in data mining
Data Mining: Concepts and
August 20, 2025 Techniques 37

You might also like