[go: up one dir, main page]

0% found this document useful (0 votes)
52 views12 pages

Basic Data Mining Tasks

The document outlines the Knowledge Discovery in Databases (KDD) process, which involves several steps including selection, pre-processing, transformation, data mining, and interpretation/evaluation to extract useful information from data. It discusses various visualization techniques and highlights challenges in data mining such as overfitting, outliers, and high dimensionality. Additionally, it emphasizes the importance of expert interpretation and the need for effective algorithms to handle large and diverse datasets.

Uploaded by

devipriya210387
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views12 pages

Basic Data Mining Tasks

The document outlines the Knowledge Discovery in Databases (KDD) process, which involves several steps including selection, pre-processing, transformation, data mining, and interpretation/evaluation to extract useful information from data. It discusses various visualization techniques and highlights challenges in data mining such as overfitting, outliers, and high dimensionality. Additionally, it emphasizes the importance of expert interpretation and the need for effective algorithms to handle large and diverse datasets.

Uploaded by

devipriya210387
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

•Knowledge Discovery in Databases (KDD): process of finding useful

information and patterns in data.

•Data Mining: Use of algorithms to extract the information and patterns


derived by the KDD process.

•KDD process involves many steps

•Input is the data and output is the desired useful info

•Five steps of the KDD process


•Selection-Obtain data from various database
•Pre-processing – may have incorrect or missing data, wrong data
corrected or removed ,missing data must be supplied or predicted
Transformation-Transformation techniques are used to make the
data easier to mine and more useful and to provide more
meaningful results.
•Data mining- applies algorithms to the transformed data to
generate the desired results.
•Interpretation/Evaluation-results are presented by various
visualization and GUI strategies
Visualization – refers to the visual representation to the data.
It includes the techniques:
• Graphical-graphs
• Geometric-box plot, scatter diagrams
• Icon based-colors,icons
• Pixel based-Unique colored pixel
• Hierarchical-divides screen into region based on values
• Hybrid-combines any of these methods
• May be 2D or 3D
KDD Process Ex: Web Log
• Selection:
Select log data (dates and locations) to use
• Preprocessing:
Remove identifying URLs
Remove error logs
• Transformation:
Sessionize (sort and group)
• Data Mining:
Identify and count patterns
Construct data structure
• Interpretation/Evaluation:
Identify and display frequently accessed sequences.
DATA MINING
DEVELOPMENT IN IR

SIMILARITY MEASURES

HIERARCHICAL
CLUSTERING

IR SYSTEMS

WEB SEARCH ENGINES


DATA MINING
DEVELOPMENT IN DB

RELATIONAL DATA
MODEL

SQL

ASSOCIATION RULES

DATA WAREHOUSING
DATA MINING
DEVELOPMENT IN ALG

ALGORITHM DESIGN

ALGORITHM ANALYSIS

DATA STRUCTURES
DATA MINING
DEVELOPMENT IN
MACHINE LEARNING

NEURAL NETWORKS

DECISION TREE
DATA MINING
DEVELOPMENT IN
STATISTICS

REGRESSION

EM ALGORITHM

K-MEANS CLUSTERING

TIME SERIES ANALYSIS


HUMAN INTERACTION – Technical experts need to formulate the queries
and assist in interpreting results

OVERFITING- Occurs when the data doesn’t fit the future stated

Outliers - Data doesn’t fit in the model

Interpretation of results – Needs an expert to interpret the correct results

Visualization of results – To easily view and understand the visualization


is needed

Large datasets- massive data creates problem when the algorithm designed
for the smaller dataset is applied.Can be rectified by the sampling

High dimensionality-Many attributes involved and difficult to determine


which one should be used.(dimensionality curse). Solution is to reduce the
number of attributes (dimensionality reduction).

Multimedia data- Different data types will affect the algorithm application
Multimedia data- Different data types will affect the algorithm
application

Missing data- During pre-processing ,missing data to be placed

Irrelevant data- some data may not be relevant

Noisy data- values might be incorrect or invalid

Changing data- database cannot be static

Integration – Introducing data mining functions into the database is


important

Application – effective use of algorithm to obtain results


Effectiveness or usefulness of the data mining should be measured using
some metrics

ROI (Return on Investment) examines the difference between what the data
mining tech costs and what the savings or benefits

Sales/advertising

Traditional metrics based on space and time based on complexity analysis

Accuracy

Social implications – Profiling is a process of evaluating data from past


source & analyzing & summarizing useful info about the data

Example – Similar Credit card purchases


Implementation issues

Scalability-Not up to date

Real world data-noisy data and missing values

Update-work with static

Ease of use-difficult or unable to understand.

You might also like