0% found this document useful (0 votes)

73 views11 pages

Data Mining Unit-I

The document provides an overview of data warehousing and data mining, detailing the characteristics of data warehouses, types of data warehouses, and the data mining process. It explains the Knowledge Discovery in Databases (KDD) process, which includes stages such as data cleaning, integration, selection, transformation, mining, evaluation, and representation. Additionally, it outlines various data mining techniques and task primitives that facilitate the extraction of valuable information from large datasets.

Uploaded by

chandini18225

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

73 views11 pages

Data Mining Unit-I

Uploaded by

chandini18225

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Data mining

Unit-1: An idea on Data warehouse, Data mining-KDD versus data mining, Stages of the
Data mining process-Task primitives., Data Mining Techniques-Data mining knowledge
representation.
What is Data warehouse?
Data warehouse is an information system that contains historical and commutative data from single or
multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version of
truth for any company for decision making and forecasting. Characteristics of Data warehouse

 Subject-Oriented

 Integrated

 Time-variant

 Non-volatile
Subject-Oriented A data warehouse is subject oriented as it offers information regarding a theme instead of
companies’ on-going operations. These subjects can be sales, marketing, distributions, etc. A data warehouse
never focuses on the on-going operations. Instead, it put emphasis on modelling and analysis of data for
decision making. It also provides a simple and concise view around the specific subject by excluding data
which not helpful to support the decision process.

Integrated In Data Warehouse, integration means the establishment of a common unit of measure for all
similar data from the dissimilar database. The data also needs to be stored in the Data warehouse in common
and universally acceptable manner. A data warehouse is developed by integrating data from varied sources
like a mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions,
format, and coding
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures,
encoding structure etc. has to be ensured.
Time-Variant
The time horizon for data warehouse is quite extensive compared with operational systems. The data
collected in a data warehouse is recognized with a particular period and offers information from the historical
point of view. It contains an element of time, explicitly or implicitly. One such place where Data warehouse
data display time variance is in in the structure of the record key. Every primary key contained with the DW
should have either implicitly or explicitly an element of time. Like the day, week month, etc. Another aspect
of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what
& when happened. It does not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application environment are
omitted in Data warehouse environment. Only two types of data operations performed in the Data
Warehousing are
1. Data loading
2. Data access

Types of Data Warehouses

Types of Data Warehouses can be classified based on their architecture, data flow, and purpose. The major
types are:
1. Enterprise Data Warehouse (EDW)

 Definition: A centralized data warehouse that consolidates data from various departments and
functions across an organization into a single repository. EDW is designed to provide a comprehensive
and unified view of enterprise data.

2. Operational Data Store (ODS)

 Definition: An ODS is a data store that integrates data from multiple operational systems for real-
time, day-to-day operations. It contains current or near-real-time data and is often used as a staging
area before the data is loaded into a data warehouse for analysis.
3. Data Mart
 Definition: A data mart is a smaller, specialized version of a data warehouse, designed for specific
departments or business functions such as sales, marketing, or finance. Data marts are usually
subsets of a larger data warehouse, focused on a specific area of interest.

Definition of mining :-
In general terms, “mining” is the process of extraction of some valuable material from the earth. Eg. Coal
mining, gold mining, etc.,
In the context of computer science, “Data mining” refers to the extraction of useful information from a bulk
of data or data warehouses.

What is Data mining?

 Data mining is defined as procedure of extracting information from huge sets of data. Data mining is also
known as mining knowledge from data.
 There are a number of components involved in the data mining process. These components constitute
the architecture of a data mining system.
Data Mining Architecture
The major components of any data mining system are data source, data warehouse server, data mining
engine, pattern evaluation module, graphical user interface and knowledge base.
a) Data Sources: Database, data warehouse, World Wide Web (WWW), text files and other documents
are the actual sources of data. You need large volumes of historical data for data mining to be
successful. Organizations usually store data in databases or data warehouses. Data warehouses may
contain one or more databases, text files, spreadsheets or other kinds of information repositories.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the Internet
is another big source of data.
Different Processes: The data needs to be cleaned, integrated and selected before passing it to the
database or data warehouse server. As the data is from different sources and in different formats, it
cannot be used directly for the data mining process because the data might not be complete and reliable.
So, first data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the server. These
processes are not as simple as we think. A number of techniques may be performed on the data as part
of cleaning, integration and selection.
b) Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed. Hence,
the server is responsible for retrieving the relevant data based on the data mining request of the user.
c) Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of a number of
modules for performing data mining tasks including association, classification, characterization,
clustering, prediction, time-series analysis etc.
d) Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern
by using a threshold value. It interacts with the data mining engine to focus the search towards
interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining system.
This module helps the user use the system easily and efficiently without knowing the real complexity
behind the process. When the user specifies a query or a task, this module interacts with the data
mining system and displays the result in an easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data mining.
The data mining engine might get inputs from the knowledge base to make the result more accurate
and reliable. The pattern evaluation module interacts with the knowledge base on a regular basis to
get inputs and also to update it.
Summary Each and every component of data mining system has its own role and importance in
completing data mining efficiently.

Knowledge Discovery in Database(KDD)

Steps involved in KDD process:
Data Mining also known as Knowledge Discovery in Databases refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from data stored in databases.

Fig: KDD PROCESS

1.Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.

 Cleaning in case of Missing values.

 Cleaning noisy data, where noise is a random or variance error.

 Cleaning with Data discrepancy detection and Data transformation tools.

2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a
common source (Data Warehouse).

 Data integration using Data Migration(the process of moving data from one storage system to another)
tools.

 Data integration using Data Synchronization(the process of keeping data consistent and up to date across
multiple devices or system) tools.
3.Data selection: is defined as the process where data relevant to the analysis is decided and retrieved from
the data collection.

 Data selection using Neural network.

 Data selection using Decision Trees.

 Data selection using Naive bayes.

 Data selection using Clustering, Regression, etc.

4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate
form required by mining procedure. Data Transformation is a two-step process:
 Data Mapping: Assigning elements from source base to destination to capture transformations.

 Code generation: Creation of the actual transformation program.

5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially
useful.

 Transforms task relevant data into patterns.

 Decides purpose of model using classification or characterization.

6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing patterns representing
knowledge based on given measures.

 Find interestingness score of each pattern.

 Uses summarization and Visualization to make data understandable by user.

7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization
tools to represent data mining results.

 Generate reports.

 Generate tables.

 Generate discriminant rules, classification rules, characterization rules, etc.

Note:

 KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data
can be integrated and transformed in order to get different and more appropriate results.

 Preprocessing of databases consists of Data cleaning and Data Integration.

KDD VERSUS DATA MINING

KDD Data mining
Data mining is a step in the KDD process that
KDD is the complete process of
Definition focuses on applying algorithms to extract
discovering useful knowledge from data.
patterns.

Consists of multiple stages: Selection, Consists mainly of the application of algorithms

Stages
Preprocessing, Transformation, Data like clustering, classification, association rule
Involved
Mining, Evaluation, and Presentation. mining.

The broader goal is to extract meaningful

The specific goal is to find patterns, rules, and
Objective knowledge from raw data across various
models in data.
steps.

Emphasizes pattern discovery and modeling

Emphasizes the end-to-end process, from
Emphasis using techniques like machine learning and
raw data to knowledge extraction.
statistics.

Requires detailed data preprocessing,

Data Assumes data is relatively clean and focuses on
transformation, and cleaning before
Preparation the application of mining algorithms.
mining.
Produces knowledge that can be used for Produces patterns that need further evaluation
Output
decision-making. and validation.

Involves evaluating the results for

Focus on Limited evaluation; focuses on the correctness
interestingness, relevance, and
Evaluation and efficiency of algorithms.
usefulness.

End-to-end process in a customer Applying classification or clustering algorithms

Examples
segmentation project. to classify customers into groups.

Uses a broader set of tools including data

Techniques Primarily uses data mining techniques like
cleaning, data mining, and visualization
Used clustering, classification, association rules, etc.
techniques.

May involve multiple tools for ETL Typically involves specific data mining software
Tools (Extract, Transform, Load), data mining, or algorithms like decision trees, neural
and visualization. networks, etc.

Used by data scientists, analysts, and

Primarily used by data scientists and machine
Users domain experts who manage the entire
learning engineers to apply mining techniques.
knowledge discovery process.

Stages of data mining process:

Data Mining Process
State problem and formulate hypothesis

In this part, the problem from a group is taken and initial hypothesis(guess) is applied. There is an in-depth
conversation between data mining expert and application expert to formulate hypotheses and is continued
during whole data mining process.

Data Collection(gathering data from various resources)

This step take care of how the data is collected from various sources. There are two scenarios in which the
data is collected. First is when an expert controls the data generation process which is well designed and
understood. Second is when experts cannot influence data generation process and an observational
approach are used which randomly generates data. Data collection procedure implicit sampling distribution
partially or unknown in some cases. To utilise the estimated model in final results, it is necessary to know
how the data collection contradicts its distribution as the data would be used for modeling, the ultimate
interpretation of result and estimating a model.

Data Preprocessing
In this process, raw data is converted into an understandable format and made ready for further analysis.
The motive is to improve data quality and make it up to mark for specific tasks. It usually have minimum two
tasks −
Outlier detection and removal: Outliers are nonspecific data which cannot be used for observation. It
contains errors and abnormal values which can harm the model. It is handled by either detecting and
removing outliers or by using robust modeling which are non-sensitive for outliers.
Scaling and encoding: Variable scaling and encoding are used and we need to scale them and convey
equivalent weight which helps the analysis. Application-specific encoding provides smaller information by
achieving dimensionality reduction.

Estimate model
This phase helps to select data mining techniques that are best suitable. Implementation is first done on
different models and then the simplest one is selected for further process.
Interpret model and draw conclusions
Simple models are accountable but are less accurate. New generation data mining models are expected to
provide high accuracy by using high dimensional models. Some specific techniques are used to validate
results by interpreting these models.

DATA MINING TASK PRIMITIVES

A data mining task can be specified in the form of a data mining query, which is input to the data mining
system. A data mining query is defined in terms of data mining task primitives. These primitives allow the
user to interactively communicate with the data mining system during the mining process to discover
interesting patterns.
Here is the list of Data Mining Task Primitives
 Set of task relevant data to be mined.

 Kind of knowledge to be mined.

 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested.
This portion includes the following
 Database Attributes
 Data Warehouse dimensions of interest
For example, suppose that you are a manager of All Electronics incharge of sales in the United States and
Canada. You would like to study the buying trends of customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes.
Kind of knowledge to be mined
This specifies the data mining functions to be performed, such as
 Characterization& Discrimination

 Association
 Classification
 Clustering
 Prediction
 Outlier analysis

For instance, if studying the buying habits of customers in Canada, you may choose to mine associations
between customer profiles and the items that these customers like to buy.

Background knowledge to be used in discovery process

Users can specify background knowledge, or knowledge about the domain to be mined. This knowledge is
useful for guiding the knowledge discovery process, and for evaluating the patterns found. User beliefs about
relationship in the data.
There are several kinds of background knowledge. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
Example:
An example of a concept hierarchy for the attribute (or dimension) age is shown in the following Figure.
In the above, the root node represents the most general abstraction level, denoted as all.
Interestingness measures and thresholds for pattern evaluation
The Interestingness measures are used to separate interesting and uninteresting patterns from the
knowledge. They may be used to guide the mining process, or after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different interestingness measures.
For example, interesting measures for association rules include support and confidence.
Representation for visualizing the discovered patterns

This refers to the for min which discovered patterns are to be displayed. Users can choose from different
forms for knowledge presentation, such as
 rules, tables, reports, charts, graphs, decision trees, and cubes.

DATA MINING TECHNIQUES- DATA MINING KNOWLEDGE REPRESENTATION :-

Data mining uses algorithms and various other techniques to convert large collections of data into useful
output. The most popular types of data mining techniques include association rules, classification, clustering,
and predictive analysis.

 Association rules: also referred to as market basket analysis, search for relationships between
variables. This relationship in itself creates additional value within the data set as it strives to link
pieces of data.
For example, association rules would search a company's sales history to see which products are most
commonly purchased together; with this information, stores can plan, promote, and forecast.

 Classification: uses predefined classes to assign to objects. These classes describe the characteristics
of items or represent what the data points have in common with each other. This data mining
technique allows the underlying data to be more neatly categorized and summarized across similar
features or product lines.
Example, process of organising things into groups based on a set of criteria.

 Clustering: is similar to classification. However, clustering identifies similarities between objects, then
groups those items based on what makes them different from other items. While classification may
result in groups such as "shampoo," "conditioner," "soap," and "toothpaste," clustering may identify
groups such as "hair care" and "dental health."
 Predictive analysis: strives to leverage historical information to build graphical or mathematical
models to forecast future outcomes. Overlapping with regression analysis, this technique aims to
support an unknown figure in the future based on current data on hand.
 Regression analysis: This technique discovers relationships in data by predicting outcomes based on
predetermined variables. This can include decision trees and multivariate and linear regression.
Results can be prioritized by the closeness of the relationship to help determine what data is most or
least significant.
An example would be for a soft drink manufacturer to estimate the needed inventory of drinks before
the arrival of predicted hot summer weather.

DATA MINING FUNCTIONALITIES

Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Concept/Class Description:
Characterization and Discrimination Data can be associated with classes or concepts.
For example, Data characterization is a summarization of a target class of data.

Data discrimination
Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, are patterns that occur frequently in data(things which are found most commonly in data).
There are many kinds of frequent patterns, including itemsets, subsequences, and substructures.
Association analysis Suppose, as a marketing manager, you would like to determine which items are
frequently purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
Where X is a variable representing a customer. Confidence=50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well. Support=1% means that 1% of all of the transactions
under analysis showed that computer and software were purchased together.
Correlation Analysis: Correlation is a mathematical technique that can show whether and how strongly the
pairs of attributes are related to each other.
For example, Highted people tend to have more weight
Classification: There is a large variety of data mining systems available. Data mining systems may integrate
techniques from the following –
 Pattern Recognition
 Image Analysis
 Computer Graphics
 Web Technology
 Business

DATA MINING APPLICATIONS:

Here is the list of areas where data mining is widely used –

 Financial Data Analysis

 Retail Industry

 Telecommunication Industry

 Biological Data Analysis

 Other Scientific Applications

 Intrusion Detection

01 Data Warehouse
No ratings yet
01 Data Warehouse
15 pages
DM Notes
No ratings yet
DM Notes
193 pages
Steps Involved in KDD Process: Data Mining
No ratings yet
Steps Involved in KDD Process: Data Mining
14 pages
Dmdw-Unit-1 R16
No ratings yet
Dmdw-Unit-1 R16
17 pages
Data Mining - Reference - 1
No ratings yet
Data Mining - Reference - 1
91 pages
Unit 1 - Introduction
No ratings yet
Unit 1 - Introduction
8 pages
Unit-2 Finalized
No ratings yet
Unit-2 Finalized
12 pages
DWDM Lecture Notes III-II
No ratings yet
DWDM Lecture Notes III-II
86 pages
DWDM 5 Unit Notes
No ratings yet
DWDM 5 Unit Notes
86 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
36 pages
Data Mining Notes UNIT I
No ratings yet
Data Mining Notes UNIT I
21 pages
Data Warehousing Mining 20APE0501 Min
No ratings yet
Data Warehousing Mining 20APE0501 Min
87 pages
Data Mining & Warehousing Basics
No ratings yet
Data Mining & Warehousing Basics
14 pages
Data Mining: Concepts and Challenges
100% (1)
Data Mining: Concepts and Challenges
24 pages
Data Mining New
No ratings yet
Data Mining New
21 pages
Data Mining & KDD Overview
No ratings yet
Data Mining & KDD Overview
63 pages
Chapter 1 - Data Mining and Data Warehouse
No ratings yet
Chapter 1 - Data Mining and Data Warehouse
44 pages
Unit 1 DMDW
No ratings yet
Unit 1 DMDW
57 pages
U1 - Data Warehouse Intro
No ratings yet
U1 - Data Warehouse Intro
13 pages
DW Assignment
No ratings yet
DW Assignment
6 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Data Mining Questions 1st Unit
No ratings yet
Data Mining Questions 1st Unit
6 pages
Data Mining Basics
No ratings yet
Data Mining Basics
20 pages
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
No ratings yet
A Conceptual Overview of Data Mining: B.N. Lakshmi., G.H. Raghunandhan
6 pages
DWM 4
No ratings yet
DWM 4
23 pages
Unit 1
No ratings yet
Unit 1
11 pages
D-Unit-1 R16
No ratings yet
D-Unit-1 R16
17 pages
Data Mining Essentials for Students
No ratings yet
Data Mining Essentials for Students
15 pages
DM-Unit V
No ratings yet
DM-Unit V
47 pages
Best Chapter 1 DM
No ratings yet
Best Chapter 1 DM
22 pages
Data Warehousing for Decision Makers
No ratings yet
Data Warehousing for Decision Makers
31 pages
Database Tech Evolution for Analysts
No ratings yet
Database Tech Evolution for Analysts
59 pages
DWDM B Tech Unit 1 Part-A
No ratings yet
DWDM B Tech Unit 1 Part-A
15 pages
Data Mining Chapter 1 Introduction
No ratings yet
Data Mining Chapter 1 Introduction
39 pages
DATA MINING Unit 1
No ratings yet
DATA MINING Unit 1
22 pages
Dwdm-Unit-1 R16
No ratings yet
Dwdm-Unit-1 R16
17 pages
Defining Data Mining and Data Warehouse (Adugna Gutema)
No ratings yet
Defining Data Mining and Data Warehouse (Adugna Gutema)
9 pages
Dmda Mid 1
No ratings yet
Dmda Mid 1
20 pages
Unit I Data Mining
No ratings yet
Unit I Data Mining
34 pages
Data Mining
No ratings yet
Data Mining
26 pages
Data Warehousing and Data Mining Final Year Seminar Topic
No ratings yet
Data Warehousing and Data Mining Final Year Seminar Topic
10 pages
A Paper Presentation On: - Information Repository With Knowledge Discovery
No ratings yet
A Paper Presentation On: - Information Repository With Knowledge Discovery
23 pages
Data Warehousingdata Mining
No ratings yet
Data Warehousingdata Mining
86 pages
Data Minng
No ratings yet
Data Minng
20 pages
DWDM External
No ratings yet
DWDM External
30 pages
Build The Models
No ratings yet
Build The Models
7 pages
Lecture 1 & 2
No ratings yet
Lecture 1 & 2
14 pages
Data Warehousing & Mining Basics
No ratings yet
Data Warehousing & Mining Basics
20 pages
Introduction to Data Warehousing
No ratings yet
Introduction to Data Warehousing
80 pages
Data Mining
No ratings yet
Data Mining
15 pages
A) Data Cleaning
No ratings yet
A) Data Cleaning
7 pages
5, Data Warehousing
No ratings yet
5, Data Warehousing
16 pages
DWDM Fresh Notes For Unit 1, Unit 2, Unit 3
No ratings yet
DWDM Fresh Notes For Unit 1, Unit 2, Unit 3
54 pages
Dmda Mid 1 Laqs
No ratings yet
Dmda Mid 1 Laqs
10 pages
FDS Unit 1
No ratings yet
FDS Unit 1
20 pages
Data Mining and Data Warehouse Study Material - Edited
No ratings yet
Data Mining and Data Warehouse Study Material - Edited
7 pages
Lecture 4
No ratings yet
Lecture 4
6 pages
Lecture 3
No ratings yet
Lecture 3
10 pages
AP Statistics - Chapter 14 Review Name - Part I - Multiple Choice (Questions 1-7) - Circle The Answer of Your Choice
No ratings yet
AP Statistics - Chapter 14 Review Name - Part I - Multiple Choice (Questions 1-7) - Circle The Answer of Your Choice
3 pages
KK Textile Inventory Management Study
No ratings yet
KK Textile Inventory Management Study
45 pages
Particle Size Analysis SOP PDF
No ratings yet
Particle Size Analysis SOP PDF
7 pages
Multiple Regression - D. Boduszek - HUD PDF
No ratings yet
Multiple Regression - D. Boduszek - HUD PDF
37 pages
Sousa - 2003 - Linking Quality Management To Manufacturing Strategy An Empirical Investigation of Customer Focus Practices
No ratings yet
Sousa - 2003 - Linking Quality Management To Manufacturing Strategy An Empirical Investigation of Customer Focus Practices
18 pages
Midterm Exam Questions: Data Analytics
No ratings yet
Midterm Exam Questions: Data Analytics
2 pages
Ai - Introduction: FDP / Short Term Training On Artificial Intelligence & Deep Learning Applications
No ratings yet
Ai - Introduction: FDP / Short Term Training On Artificial Intelligence & Deep Learning Applications
6 pages
113 225 1 SM
No ratings yet
113 225 1 SM
7 pages
Data Analytics in Quality 4.0 Literature Review and Future Research Directi
No ratings yet
Data Analytics in Quality 4.0 Literature Review and Future Research Directi
25 pages
The Relationship Between Price and Loyalty in Serv
No ratings yet
The Relationship Between Price and Loyalty in Serv
10 pages
Nancy Sharma pgdm23619 DS ASSIGNMENT
No ratings yet
Nancy Sharma pgdm23619 DS ASSIGNMENT
11 pages
Practical Research I Lesson 3 Material PDF
No ratings yet
Practical Research I Lesson 3 Material PDF
9 pages
System Analysis and Design - BCA
50% (2)
System Analysis and Design - BCA
3 pages
Introduction To Biostatistics
No ratings yet
Introduction To Biostatistics
73 pages
PL1005 Formula List and Tables Sheet S23
No ratings yet
PL1005 Formula List and Tables Sheet S23
5 pages
Research Methods The Basics 3rd Edition Nicholas Walliman PDF Version
No ratings yet
Research Methods The Basics 3rd Edition Nicholas Walliman PDF Version
105 pages
Unit 7 Analyzing The Meaning of The Data and Drawing Conclusions
100% (1)
Unit 7 Analyzing The Meaning of The Data and Drawing Conclusions
20 pages
Chapter 3 and 4: Numerical Descriptive Measures: X N X WX P L N
No ratings yet
Chapter 3 and 4: Numerical Descriptive Measures: X N X WX P L N
7 pages
Econometrics Exercises
No ratings yet
Econometrics Exercises
8 pages
Applied Data Analysis (With SPSS)
No ratings yet
Applied Data Analysis (With SPSS)
19 pages
Big Data Computing Decision Trees For Big Data Analytics
No ratings yet
Big Data Computing Decision Trees For Big Data Analytics
48 pages
A Wholesale Distributor
86% (7)
A Wholesale Distributor
5 pages
PCA & Clustering
No ratings yet
PCA & Clustering
6 pages
Final Brochureof Bookon Data Scienceand Business Intelligencefor Corporate Decision Making
No ratings yet
Final Brochureof Bookon Data Scienceand Business Intelligencefor Corporate Decision Making
11 pages
Digital Devices & Grade 6 Learning
No ratings yet
Digital Devices & Grade 6 Learning
6 pages
Tutorial Data Analysis With CANape EN
No ratings yet
Tutorial Data Analysis With CANape EN
42 pages
Cluster Analysis for Statisticians
No ratings yet
Cluster Analysis for Statisticians
9 pages
Data Analytics - Project Videos & Ideas
No ratings yet
Data Analytics - Project Videos & Ideas
6 pages
Ajit Kumarroy: Aguidetoresearchmethodology For Beginners
No ratings yet
Ajit Kumarroy: Aguidetoresearchmethodology For Beginners
43 pages
Chapter Twenty - Time Series
No ratings yet
Chapter Twenty - Time Series
21 pages

Data Mining Unit-I

Uploaded by

Data Mining Unit-I

Uploaded by

Data mining

Types of Data Warehouses

2. Operational Data Store (ODS)

What is Data mining?

Knowledge Discovery in Database(KDD)

Fig: KDD PROCESS

 Cleaning in case of Missing values.

 Cleaning noisy data, where noise is a random or variance error.

 Cleaning with Data discrepancy detection and Data transformation tools.

 Data selection using Neural network.

 Data selection using Decision Trees.

 Data selection using Naive bayes.

 Data selection using Clustering, Regression, etc.

 Code generation: Creation of the actual transformation program.

 Transforms task relevant data into patterns.

 Decides purpose of model using classification or characterization.

 Find interestingness score of each pattern.

 Uses summarization and Visualization to make data understandable by user.

 Generate discriminant rules, classification rules, characterization rules, etc.

 Preprocessing of databases consists of Data cleaning and Data Integration.

KDD VERSUS DATA MINING

Consists of multiple stages: Selection, Consists mainly of the application of algorithms

The broader goal is to extract meaningful

Emphasizes pattern discovery and modeling

Requires detailed data preprocessing,

Involves evaluating the results for

End-to-end process in a customer Applying classification or clustering algorithms

Uses a broader set of tools including data

Used by data scientists, analysts, and

Stages of data mining process:

Data Collection(gathering data from various resources)

DATA MINING TASK PRIMITIVES

 Kind of knowledge to be mined.

Background knowledge to be used in discovery process

DATA MINING TECHNIQUES- DATA MINING KNOWLEDGE REPRESENTATION :-

DATA MINING FUNCTIONALITIES

DATA MINING APPLICATIONS:

 Financial Data Analysis

 Biological Data Analysis

 Other Scientific Applications

You might also like