•Knowledge Discovery in Databases (KDD): process of finding useful
information and patterns in data.
•Data Mining: Use of algorithms to extract the information and patterns
derived by the KDD process.
•KDD process involves many steps
•Input is the data and output is the desired useful info
•Five steps of the KDD process
•Selection-Obtain data from various database
•Pre-processing – may have incorrect or missing data, wrong data
corrected or removed ,missing data must be supplied or predicted
Transformation-Transformation techniques are used to make the
data easier to mine and more useful and to provide more
meaningful results.
•Data mining- applies algorithms to the transformed data to
generate the desired results.
•Interpretation/Evaluation-results are presented by various
visualization and GUI strategies
Visualization – refers to the visual representation to the data.
It includes the techniques:
• Graphical-graphs
• Geometric-box plot, scatter diagrams
• Icon based-colors,icons
• Pixel based-Unique colored pixel
• Hierarchical-divides screen into region based on values
• Hybrid-combines any of these methods
• May be 2D or 3D
KDD Process Ex: Web Log
• Selection:
Select log data (dates and locations) to use
• Preprocessing:
Remove identifying URLs
Remove error logs
• Transformation:
Sessionize (sort and group)
• Data Mining:
Identify and count patterns
Construct data structure
• Interpretation/Evaluation:
Identify and display frequently accessed sequences.
DATA MINING
DEVELOPMENT IN IR
SIMILARITY MEASURES
HIERARCHICAL
CLUSTERING
IR SYSTEMS
WEB SEARCH ENGINES
DATA MINING
DEVELOPMENT IN DB
RELATIONAL DATA
MODEL
SQL
ASSOCIATION RULES
DATA WAREHOUSING
DATA MINING
DEVELOPMENT IN ALG
ALGORITHM DESIGN
ALGORITHM ANALYSIS
DATA STRUCTURES
DATA MINING
DEVELOPMENT IN
MACHINE LEARNING
NEURAL NETWORKS
DECISION TREE
DATA MINING
DEVELOPMENT IN
STATISTICS
REGRESSION
EM ALGORITHM
K-MEANS CLUSTERING
TIME SERIES ANALYSIS
HUMAN INTERACTION – Technical experts need to formulate the queries
and assist in interpreting results
OVERFITING- Occurs when the data doesn’t fit the future stated
Outliers - Data doesn’t fit in the model
Interpretation of results – Needs an expert to interpret the correct results
Visualization of results – To easily view and understand the visualization
is needed
Large datasets- massive data creates problem when the algorithm designed
for the smaller dataset is applied.Can be rectified by the sampling
High dimensionality-Many attributes involved and difficult to determine
which one should be used.(dimensionality curse). Solution is to reduce the
number of attributes (dimensionality reduction).
Multimedia data- Different data types will affect the algorithm application
Multimedia data- Different data types will affect the algorithm
application
Missing data- During pre-processing ,missing data to be placed
Irrelevant data- some data may not be relevant
Noisy data- values might be incorrect or invalid
Changing data- database cannot be static
Integration – Introducing data mining functions into the database is
important
Application – effective use of algorithm to obtain results
Effectiveness or usefulness of the data mining should be measured using
some metrics
ROI (Return on Investment) examines the difference between what the data
mining tech costs and what the savings or benefits
Sales/advertising
Traditional metrics based on space and time based on complexity analysis
Accuracy
Social implications – Profiling is a process of evaluating data from past
source & analyzing & summarizing useful info about the data
Example – Similar Credit card purchases
Implementation issues
Scalability-Not up to date
Real world data-noisy data and missing values
Update-work with static
Ease of use-difficult or unable to understand.