[go: up one dir, main page]

0% found this document useful (0 votes)
73 views11 pages

Data Mining Unit-I

The document provides an overview of data warehousing and data mining, detailing the characteristics of data warehouses, types of data warehouses, and the data mining process. It explains the Knowledge Discovery in Databases (KDD) process, which includes stages such as data cleaning, integration, selection, transformation, mining, evaluation, and representation. Additionally, it outlines various data mining techniques and task primitives that facilitate the extraction of valuable information from large datasets.

Uploaded by

chandini18225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
73 views11 pages

Data Mining Unit-I

The document provides an overview of data warehousing and data mining, detailing the characteristics of data warehouses, types of data warehouses, and the data mining process. It explains the Knowledge Discovery in Databases (KDD) process, which includes stages such as data cleaning, integration, selection, transformation, mining, evaluation, and representation. Additionally, it outlines various data mining techniques and task primitives that facilitate the extraction of valuable information from large datasets.

Uploaded by

chandini18225
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Data mining

Unit-1: An idea on Data warehouse, Data mining-KDD versus data mining, Stages of the
Data mining process-Task primitives., Data Mining Techniques-Data mining knowledge
representation.
What is Data warehouse?
Data warehouse is an information system that contains historical and commutative data from single or
multiple sources. It simplifies reporting and analysis process of the organization. It is also a single version of
truth for any company for decision making and forecasting. Characteristics of Data warehouse

 Subject-Oriented

 Integrated

 Time-variant

 Non-volatile
Subject-Oriented A data warehouse is subject oriented as it offers information regarding a theme instead of
companies’ on-going operations. These subjects can be sales, marketing, distributions, etc. A data warehouse
never focuses on the on-going operations. Instead, it put emphasis on modelling and analysis of data for
decision making. It also provides a simple and concise view around the specific subject by excluding data
which not helpful to support the decision process.

Integrated In Data Warehouse, integration means the establishment of a common unit of measure for all
similar data from the dissimilar database. The data also needs to be stored in the Data warehouse in common
and universally acceptable manner. A data warehouse is developed by integrating data from varied sources
like a mainframe, relational databases, flat files, etc. Moreover, it must keep consistent naming conventions,
format, and coding
This integration helps in effective analysis of data. Consistency in naming conventions, attribute measures,
encoding structure etc. has to be ensured.
Time-Variant
The time horizon for data warehouse is quite extensive compared with operational systems. The data
collected in a data warehouse is recognized with a particular period and offers information from the historical
point of view. It contains an element of time, explicitly or implicitly. One such place where Data warehouse
data display time variance is in in the structure of the record key. Every primary key contained with the DW
should have either implicitly or explicitly an element of time. Like the day, week month, etc. Another aspect
of time variance is that once data is inserted in the warehouse, it can't be updated or changed.
Non-volatile
Data warehouse is also non-volatile means the previous data is not erased when new data is entered in it.
Data is read-only and periodically refreshed. This also helps to analyze historical data and understand what
& when happened. It does not require transaction process, recovery and concurrency control mechanisms.
Activities like delete, update, and insert which are performed in an operational application environment are
omitted in Data warehouse environment. Only two types of data operations performed in the Data
Warehousing are
1. Data loading
2. Data access

Types of Data Warehouses


Types of Data Warehouses can be classified based on their architecture, data flow, and purpose. The major
types are:
1. Enterprise Data Warehouse (EDW)

 Definition: A centralized data warehouse that consolidates data from various departments and
functions across an organization into a single repository. EDW is designed to provide a comprehensive
and unified view of enterprise data.

2. Operational Data Store (ODS)


 Definition: An ODS is a data store that integrates data from multiple operational systems for real-
time, day-to-day operations. It contains current or near-real-time data and is often used as a staging
area before the data is loaded into a data warehouse for analysis.
3. Data Mart
 Definition: A data mart is a smaller, specialized version of a data warehouse, designed for specific
departments or business functions such as sales, marketing, or finance. Data marts are usually
subsets of a larger data warehouse, focused on a specific area of interest.

Definition of mining :-
In general terms, “mining” is the process of extraction of some valuable material from the earth. Eg. Coal
mining, gold mining, etc.,
In the context of computer science, “Data mining” refers to the extraction of useful information from a bulk
of data or data warehouses.

What is Data mining?


 Data mining is defined as procedure of extracting information from huge sets of data. Data mining is also
known as mining knowledge from data.
 There are a number of components involved in the data mining process. These components constitute
the architecture of a data mining system.
Data Mining Architecture
The major components of any data mining system are data source, data warehouse server, data mining
engine, pattern evaluation module, graphical user interface and knowledge base.
a) Data Sources: Database, data warehouse, World Wide Web (WWW), text files and other documents
are the actual sources of data. You need large volumes of historical data for data mining to be
successful. Organizations usually store data in databases or data warehouses. Data warehouses may
contain one or more databases, text files, spreadsheets or other kinds of information repositories.
Sometimes, data may reside even in plain text files or spreadsheets. World Wide Web or the Internet
is another big source of data.
Different Processes: The data needs to be cleaned, integrated and selected before passing it to the
database or data warehouse server. As the data is from different sources and in different formats, it
cannot be used directly for the data mining process because the data might not be complete and reliable.
So, first data needs to be cleaned and integrated. Again, more data than required will be collected from
different data sources and only the data of interest needs to be selected and passed to the server. These
processes are not as simple as we think. A number of techniques may be performed on the data as part
of cleaning, integration and selection.
b) Database or Data Warehouse Server
The database or data warehouse server contains the actual data that is ready to be processed. Hence,
the server is responsible for retrieving the relevant data based on the data mining request of the user.
c) Data Mining Engine
The data mining engine is the core component of any data mining system. It consists of a number of
modules for performing data mining tasks including association, classification, characterization,
clustering, prediction, time-series analysis etc.
d) Pattern Evaluation Modules
The pattern evaluation module is mainly responsible for the measure of interestingness of the pattern
by using a threshold value. It interacts with the data mining engine to focus the search towards
interesting patterns.
e) Graphical User Interface
The graphical user interface module communicates between the user and the data mining system.
This module helps the user use the system easily and efficiently without knowing the real complexity
behind the process. When the user specifies a query or a task, this module interacts with the data
mining system and displays the result in an easily understandable manner.
f) Knowledge Base
The knowledge base is helpful in the whole data mining process. It might be useful for guiding the
search or evaluating the interestingness of the result patterns. The knowledge base might even
contain user beliefs and data from user experiences that can be useful in the process of data mining.
The data mining engine might get inputs from the knowledge base to make the result more accurate
and reliable. The pattern evaluation module interacts with the knowledge base on a regular basis to
get inputs and also to update it.
Summary Each and every component of data mining system has its own role and importance in
completing data mining efficiently.

Knowledge Discovery in Database(KDD)


Steps involved in KDD process:
Data Mining also known as Knowledge Discovery in Databases refers to the nontrivial extraction of implicit,
previously unknown and potentially useful information from data stored in databases.

Fig: KDD PROCESS


1.Data Cleaning: Data cleaning is defined as removal of noisy and irrelevant data from collection.

 Cleaning in case of Missing values.

 Cleaning noisy data, where noise is a random or variance error.

 Cleaning with Data discrepancy detection and Data transformation tools.


2. Data Integration: Data integration is defined as heterogeneous data from multiple sources combined in a
common source (Data Warehouse).

 Data integration using Data Migration(the process of moving data from one storage system to another)
tools.

 Data integration using Data Synchronization(the process of keeping data consistent and up to date across
multiple devices or system) tools.
3.Data selection: is defined as the process where data relevant to the analysis is decided and retrieved from
the data collection.

 Data selection using Neural network.

 Data selection using Decision Trees.

 Data selection using Naive bayes.

 Data selection using Clustering, Regression, etc.


4. Data Transformation: Data Transformation is defined as the process of transforming data into appropriate
form required by mining procedure. Data Transformation is a two-step process:
 Data Mapping: Assigning elements from source base to destination to capture transformations.

 Code generation: Creation of the actual transformation program.


5. Data Mining: Data mining is defined as clever techniques that are applied to extract patterns potentially
useful.

 Transforms task relevant data into patterns.

 Decides purpose of model using classification or characterization.


6. Pattern Evaluation: Pattern Evaluation is defined as as identifying strictly increasing patterns representing
knowledge based on given measures.

 Find interestingness score of each pattern.

 Uses summarization and Visualization to make data understandable by user.


7. Knowledge representation: Knowledge representation is defined as technique which utilizes visualization
tools to represent data mining results.

 Generate reports.

 Generate tables.

 Generate discriminant rules, classification rules, characterization rules, etc.


Note:

 KDD is an iterative process where evaluation measures can be enhanced, mining can be refined, new data
can be integrated and transformed in order to get different and more appropriate results.

 Preprocessing of databases consists of Data cleaning and Data Integration.

KDD VERSUS DATA MINING


KDD Data mining
Data mining is a step in the KDD process that
KDD is the complete process of
Definition focuses on applying algorithms to extract
discovering useful knowledge from data.
patterns.

Consists of multiple stages: Selection, Consists mainly of the application of algorithms


Stages
Preprocessing, Transformation, Data like clustering, classification, association rule
Involved
Mining, Evaluation, and Presentation. mining.

The broader goal is to extract meaningful


The specific goal is to find patterns, rules, and
Objective knowledge from raw data across various
models in data.
steps.

Emphasizes pattern discovery and modeling


Emphasizes the end-to-end process, from
Emphasis using techniques like machine learning and
raw data to knowledge extraction.
statistics.

Requires detailed data preprocessing,


Data Assumes data is relatively clean and focuses on
transformation, and cleaning before
Preparation the application of mining algorithms.
mining.
Produces knowledge that can be used for Produces patterns that need further evaluation
Output
decision-making. and validation.

Involves evaluating the results for


Focus on Limited evaluation; focuses on the correctness
interestingness, relevance, and
Evaluation and efficiency of algorithms.
usefulness.

End-to-end process in a customer Applying classification or clustering algorithms


Examples
segmentation project. to classify customers into groups.

Uses a broader set of tools including data


Techniques Primarily uses data mining techniques like
cleaning, data mining, and visualization
Used clustering, classification, association rules, etc.
techniques.

May involve multiple tools for ETL Typically involves specific data mining software
Tools (Extract, Transform, Load), data mining, or algorithms like decision trees, neural
and visualization. networks, etc.

Used by data scientists, analysts, and


Primarily used by data scientists and machine
Users domain experts who manage the entire
learning engineers to apply mining techniques.
knowledge discovery process.

Stages of data mining process:


Data Mining Process
State problem and formulate hypothesis

In this part, the problem from a group is taken and initial hypothesis(guess) is applied. There is an in-depth
conversation between data mining expert and application expert to formulate hypotheses and is continued
during whole data mining process.

Data Collection(gathering data from various resources)


This step take care of how the data is collected from various sources. There are two scenarios in which the
data is collected. First is when an expert controls the data generation process which is well designed and
understood. Second is when experts cannot influence data generation process and an observational
approach are used which randomly generates data. Data collection procedure implicit sampling distribution
partially or unknown in some cases. To utilise the estimated model in final results, it is necessary to know
how the data collection contradicts its distribution as the data would be used for modeling, the ultimate
interpretation of result and estimating a model.

Data Preprocessing
In this process, raw data is converted into an understandable format and made ready for further analysis.
The motive is to improve data quality and make it up to mark for specific tasks. It usually have minimum two
tasks −
Outlier detection and removal: Outliers are nonspecific data which cannot be used for observation. It
contains errors and abnormal values which can harm the model. It is handled by either detecting and
removing outliers or by using robust modeling which are non-sensitive for outliers.
Scaling and encoding: Variable scaling and encoding are used and we need to scale them and convey
equivalent weight which helps the analysis. Application-specific encoding provides smaller information by
achieving dimensionality reduction.

Estimate model
This phase helps to select data mining techniques that are best suitable. Implementation is first done on
different models and then the simplest one is selected for further process.
Interpret model and draw conclusions
Simple models are accountable but are less accurate. New generation data mining models are expected to
provide high accuracy by using high dimensional models. Some specific techniques are used to validate
results by interpreting these models.

DATA MINING TASK PRIMITIVES


A data mining task can be specified in the form of a data mining query, which is input to the data mining
system. A data mining query is defined in terms of data mining task primitives. These primitives allow the
user to interactively communicate with the data mining system during the mining process to discover
interesting patterns.
Here is the list of Data Mining Task Primitives
 Set of task relevant data to be mined.

 Kind of knowledge to be mined.


 Background knowledge to be used in discovery process.
 Interestingness measures and thresholds for pattern evaluation.
 Representation for visualizing the discovered patterns.
Set of task relevant data to be mined

This specifies the portions of the database or the set of data in which the user is interested.
This portion includes the following
 Database Attributes
 Data Warehouse dimensions of interest
For example, suppose that you are a manager of All Electronics incharge of sales in the United States and
Canada. You would like to study the buying trends of customers in Canada. Rather than mining on the entire
database. These are referred to as relevant attributes.
Kind of knowledge to be mined
This specifies the data mining functions to be performed, such as
 Characterization& Discrimination

 Association
 Classification
 Clustering
 Prediction
 Outlier analysis

For instance, if studying the buying habits of customers in Canada, you may choose to mine associations
between customer profiles and the items that these customers like to buy.

Background knowledge to be used in discovery process


Users can specify background knowledge, or knowledge about the domain to be mined. This knowledge is
useful for guiding the knowledge discovery process, and for evaluating the patterns found. User beliefs about
relationship in the data.
There are several kinds of background knowledge. Concept hierarchies are a popular form of background
knowledge, which allow data to be mined at multiple levels of abstraction.
Example:
An example of a concept hierarchy for the attribute (or dimension) age is shown in the following Figure.
In the above, the root node represents the most general abstraction level, denoted as all.
Interestingness measures and thresholds for pattern evaluation
The Interestingness measures are used to separate interesting and uninteresting patterns from the
knowledge. They may be used to guide the mining process, or after discovery, to evaluate the discovered
patterns. Different kinds of knowledge may have different interestingness measures.
For example, interesting measures for association rules include support and confidence.
Representation for visualizing the discovered patterns

This refers to the for min which discovered patterns are to be displayed. Users can choose from different
forms for knowledge presentation, such as
 rules, tables, reports, charts, graphs, decision trees, and cubes.

DATA MINING TECHNIQUES- DATA MINING KNOWLEDGE REPRESENTATION :-


Data mining uses algorithms and various other techniques to convert large collections of data into useful
output. The most popular types of data mining techniques include association rules, classification, clustering,
and predictive analysis.

 Association rules: also referred to as market basket analysis, search for relationships between
variables. This relationship in itself creates additional value within the data set as it strives to link
pieces of data.
For example, association rules would search a company's sales history to see which products are most
commonly purchased together; with this information, stores can plan, promote, and forecast.

 Classification: uses predefined classes to assign to objects. These classes describe the characteristics
of items or represent what the data points have in common with each other. This data mining
technique allows the underlying data to be more neatly categorized and summarized across similar
features or product lines.
Example, process of organising things into groups based on a set of criteria.

 Clustering: is similar to classification. However, clustering identifies similarities between objects, then
groups those items based on what makes them different from other items. While classification may
result in groups such as "shampoo," "conditioner," "soap," and "toothpaste," clustering may identify
groups such as "hair care" and "dental health."
 Predictive analysis: strives to leverage historical information to build graphical or mathematical
models to forecast future outcomes. Overlapping with regression analysis, this technique aims to
support an unknown figure in the future based on current data on hand.
 Regression analysis: This technique discovers relationships in data by predicting outcomes based on
predetermined variables. This can include decision trees and multivariate and linear regression.
Results can be prioritized by the closeness of the relationship to help determine what data is most or
least significant.
An example would be for a soft drink manufacturer to estimate the needed inventory of drinks before
the arrival of predicted hot summer weather.

DATA MINING FUNCTIONALITIES


Data mining functionalities are used to specify the kind of patterns to be found in data mining
tasks. Data mining tasks can be classified into two categories: descriptive and predictive.

Descriptive mining tasks characterize the general properties of the data in the database.
Predictive mining tasks perform inference on the current data in order to make predictions.
Concept/Class Description:
Characterization and Discrimination Data can be associated with classes or concepts.
For example, Data characterization is a summarization of a target class of data.

Data discrimination
Data discrimination is a comparison of the general features of target class data objects with the general
features of objects from one or a set of contrasting classes.
Mining Frequent Patterns, Associations, and Correlations
Frequent patterns, are patterns that occur frequently in data(things which are found most commonly in data).
There are many kinds of frequent patterns, including itemsets, subsequences, and substructures.
Association analysis Suppose, as a marketing manager, you would like to determine which items are
frequently purchased together within the same transactions.
buys(X,“computer”)=buys(X,“software”) [support=1%,confidence=50%]
Where X is a variable representing a customer. Confidence=50% means that if a customer buys a computer,
there is a 50% chance that she will buy software as well. Support=1% means that 1% of all of the transactions
under analysis showed that computer and software were purchased together.
Correlation Analysis: Correlation is a mathematical technique that can show whether and how strongly the
pairs of attributes are related to each other.
For example, Highted people tend to have more weight
Classification: There is a large variety of data mining systems available. Data mining systems may integrate
techniques from the following –
 Pattern Recognition
 Image Analysis
 Computer Graphics
 Web Technology
 Business

DATA MINING APPLICATIONS:


Here is the list of areas where data mining is widely used –

 Financial Data Analysis

 Retail Industry

 Telecommunication Industry

 Biological Data Analysis

 Other Scientific Applications

 Intrusion Detection

You might also like