0% found this document useful (0 votes)

22 views27 pages

Data Integration and Data Reduction

Data integration in data mining combines data from multiple sources into a unified view, involving processes like ETL and data federation to enhance analysis and decision-making. It faces challenges due to varying data formats and structures, and is crucial for applications like business intelligence and analytics. Additionally, data reduction techniques help manage large datasets by preserving essential information while minimizing size, using methods such as sampling, dimensionality reduction, and data compression.

Uploaded by

trippune2025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views27 pages

Data Integration and Data Reduction

Uploaded by

trippune2025

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 27

Data Integration in Data Mining

INTRODUCTION :
 Data integration in data mining refers to the process of combining
data from multiple sources into a single, unified view. This can
involve cleaning and transforming the data, as well as resolving any
inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more
useful and meaningful for the purposes of analysis and decision
making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data
federation.
Data Integration is a data preprocessing technique that combines data from
multiple heterogeneous data sources into a coherent data store and provides
a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M>
where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources
into a cohesive and consistent view. This process involves identifying and
accessing the different data sources, mapping the data to a common format,
and reconciling any inconsistencies or discrepancies between the sources.
The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more
complete and accurate understanding of the data.

Data integration can be challenging due to the variety of data formats,

structures, and semantics used by different data sources. Different data
sources may use different data types, naming conventions, and schemas,
making it difficult to combine the data into a single view. Data integration
typically involves a combination of manual and automated processes,
including data profiling, data mapping, data transformation, and data
reconciliation.
Data integration is used in a wide range of applications, such as business
intelligence, data warehousing, master data management, and analytics.
Data integration can be critical to the success of these applications, as it
enables organizations to access and analyze data that is spread across
different systems, departments, and lines of business, in order to make
better decisions, improve operational efficiency, and gain a competitive
advantage.

There are mainly 2 major approaches for data integration – one is the “tight
coupling approach” and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse
to store the integrated data. The data is extracted from various sources,
transformed and loaded into a data warehouse. Data is integrated in a tightly
coupled manner, meaning that the data is integrated at a high level, such as
at the level of the entire dataset or schema. This approach is also known as
data warehousing, and it enables data consistency and integrity, but it can
be inflexible and difficult to change or update.
 Here, a data warehouse is treated as an information retrieval
component.
 In this coupling, data is combined from different sources into a
single physical location through the process of ETL – Extraction,
Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the
level of individual data elements or records. Data is integrated in a loosely
coupled manner, meaning that the data is integrated at a low level, and it
allows data to be integrated without having to create a central repository or
data warehouse. This approach is also known as data federation, and it
enables data flexibility and easy updates, but it can be difficult to maintain
consistency and integrity across multiple data sources.
Data reduction :-
The method of data reduction may achieve a condensed description of the
original data which is much smaller in quantity but keeps the quality of the
original data.

INTRODUCTION:

Data reduction is a technique used in data mining to reduce the size of a

dataset while still preserving the most important information. This can be
beneficial in situations where the dataset is too large to be processed
efficiently, or where the dataset contains a large amount of irrelevant or
redundant information.

There are several different data reduction techniques that can be

used in data mining, including:

1. Data Sampling: This technique involves selecting a subset of the

data to work with, rather than using the entire dataset. This can be
useful for reducing the size of a dataset while still preserving the
overall trends and patterns in the data.
2. Dimensionality Reduction: This technique involves reducing the
number of features in the dataset, either by removing features that
are not relevant or by combining multiple features into a single
feature.
3. Data Compression: This technique involves using techniques such
as lossy or lossless compression to reduce the size of a dataset.
4. Data Discretization: This technique involves converting continuous
data into discrete data by partitioning the range of possible values
into intervals or bins.
5. Feature Selection: This technique involves selecting a subset of
features from the dataset that are most relevant to the task at hand.
6. It’s important to note that data reduction can have a trade-off
between the accuracy and the size of the data. The more data is
reduced, the less accurate the model will be and the less
generalizable it will be.
In conclusion, data reduction is an important step in data mining, as it can
help to improve the efficiency and performance of machine learning
algorithms by reducing the size of the dataset. However, it is important to be
aware of the trade-off between the size and accuracy of the data, and
carefully assess the risks and benefits before implementing it.
Methods of data reduction:
These are explained as following below.
1. Data Cube Aggregation:
This technique is used to aggregate data in a simpler form. For example,
imagine the information you gathered for your analysis for the years 2012 to
2014, that data includes the revenue of your company every three months.
They involve you in the annual sales, rather than the quarterly average, So
we can summarize the data in such a way that the resulting data summarizes
the total sales per year instead of per quarter. It summarizes the data.

2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
 Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we
decide the best of the original attributes on the set based on their
relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which
few attributes are redundant.
 Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original
data and at each point, it eliminates the worst remaining attribute in
the set.
Suppose there are the following attributes in the data set in which
few attributes are redundant.
 Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes,
saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different
encoding mechanisms (Huffman Encoding & run-length Encoding). We can
divide it into two types based on their compression techniques.
 Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses
algorithms to restore the precise original data from the compressed
data.
 Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression.
For e.g., the JPEG image format is a lossy compression, but we
can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original
data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical
models or smaller representations of the data instead of actual data, it is
important to only store the model parameter. Or non-parametric methods
such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the
continuous nature into data with intervals. We replace many constant values
of the attributes by labels of small intervals. This means that mining results
are shown in a concise, and easily understandable way.

 Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints
or split points) to divide the whole set of attributes and repeat this
method up to the end, then the process is known as top-down
discretization also known as splitting.
 Bottom-up discretization –
If you first consider all the constant values as split points, some are
discarded through a combination of the neighborhood values in the
interval, that process is called bottom-up discretization.
data-warehouse :
A data-warehouse is a heterogeneous collection of different data sources
organised under a unified schema. There are 2 approaches for constructing
data-warehouse: Top-down approach and Bottom-up approach are explained
as below.

1.Top-down approach :

1. External Sources –
External source is a source from where data is collected
irrespective of the type of data. Data can be structured, semi
structured and unstructured as well.

2. Stage Area –
Since the data, extracted from the external sources does not follow
a particular format, so there is a need to validate this data to load
into datawarehouse. For this purpose, it is recommended to
use ETL tool.
 E(Extracted): Data is extracted from External data
source.

 T(Transform): Data is transformed into the standard

format.

 L(Load): Data is loaded into datawarehouse after

transforming it into the standard format.

3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central
repository. It actually stores the meta data and the actual data gets
stored in the data marts. Note that datawarehouse stores the data
in its purest form in this top-down approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the
information of a particular function of an organisation which is
handled by single authority. There can be as many number of data
marts in an organisation depending upon the functions. We can
also say that data mart contains subset of the data stored in
datawarehouse.

5. Data Mining –
The practice of analysing the big data present in datawarehouse is
data mining. It is used to find the hidden patterns that are present in
the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a
central repository for the complete organisation and data marts are
created from it after the complete datawarehouse has been
created.

2. Bottom-up approach:

1. First, the data is extracted from external sources (same as happens

in top-down approach).

2. Then, the data go through the staging area (as explained above)
and loaded into data marts instead of datawarehouse. The data
marts are created first and provide reporting capability. It addresses
a single business area.

3. These data marts are then integrated into datawarehouse.

This approach is given by Kinball as – data marts are created first and
provides a thin view for analyses and datawarehouse is created after
complete data marts have been created.
MULTILEVEL ASSOCIATION RULES:

 Association rules generated from mining data at multiple levels of abstraction

are called multiple-level or multilevel association rules.
 Multilevel association rules can be mined efficiently using concept hierarchies
under a support-confidence framework.
 Rules at high concept level may add to common sense while rules at low
concept level may not be useful always.
o Using uniform minimum support for all levels:
 When a uniform minimum support threshold is used, the search procedure is
simplified.
 The method is also simple, in that users are required to specify only one
minimum support threshold.
 The same minimum support threshold is used when mining at each level of
abstraction.
 For example, in Figure, a minimum support threshold of 5% is used
throughout.
 (e.g. for mining from “computer” down to “laptop computer”).
 Both “computer” and “laptop computer” are found to be frequent, while
“desktop computer” is not.
 Using reduced minimum support at lower levels:
o Each level of abstraction has its own minimum support threshold.
o The deeper the level of abstraction, the smaller the corresponding
threshold

Multilevel Association rule consists of alternate search strategies and

Controlled level cross filtering:

1.Alternate Search Strategies:

 Level by level independent:
o Full breadth search.
o No background knowledge in pruning.

 Level-cross filtering by single item:

o Examine nodes at level i only if node at level (i-1) is frequent.
o Misses frequent items at lower level abstractions (due to reduced
support).
 Level-cross filtering by k-item set:
o Examine k-itemsets at level i only if k-itemsets at level (i-1) is frequent.
o Misses frequent k-itemsets at lower level abstractions (due to reduced
support).
 Controlled Level-cross filtering by single item:
o A modified level-cross filtering by single item.
o Sets a level passage threshold for every level.
MULTIDIMENSIONAL ASSOCIATION RULES:
1.In Multi dimensional association:

 Attributes can be categorical or quantitative.

 Quantitative attributes are numeric and incorporates hierarchy.
 Numeric attributes must be discretized.
 Multi dimensional association rule consists of more than one dimension:

Eg: buys(X,”IBM Laptop computer”)buys(X,”HP Inkjet Printer”)

2.Three approaches in mining multi dimensional association rules:
1.Using static discritization of quantitative attributes.

 Discritization is static and occurs prior to mining.

 Discritized attributes are treated as categorical.
 Use apriori algorithm to find all k-frequent predicate sets(this requires k or k+1
table scans ).
 Every subset of frequent predicate set must be frequent.
 Eg: If in a data cube the 3D cuboid (age, income, buys) is frequent implies
(age, income), (age, buys), (income, buys) are also frequent.
 Data cubes are well suited for mining since they make mining faster.
 The cells of an n-dimensional data cuboid correspond to the predicate cells.

2.Using dynamic discritization of quantitative attributes:

 Known as mining Quantitative Association Rules.

 Numeric attributes are dynamically discretized.
 Eg: age(X,”20..25”) Λ income(X,”30K..41K”)buys (X,”Laptop Computer”)

GRID FOR TUPLES

3.Using distance based discritization with clustering.
This id dynamic discretization process that considers the distance between data points.

 It involves a two step mining process:

o Perform clustering to find the interval of attributes involved.
o Obtain association rules by searching for groups of clusters that occur
together.
 The resultant rules may satisfy:


 Introduction :-

We all use the Decision Tree Technique on day to day life to make
the decision. Organizations use these supervised machine learning
techniques like Decision trees to make a better decision and to
generate more surplus and profit.

 Ensemble methods combine different decision trees to deliver

better predictive results, afterward utilizing a single decision tree.
The primary principle behind the ensemble model is that a group of
weak learners come together to form an active learner.

 There are two techniques given below that are used to perform
ensemble decision tree.

Bagging
Bagging is used when our objective is to reduce the variance of a decision
tree. Here the concept is to create a few subsets of data from the training
sample, which is chosen randomly with replacement. Now each collection
of subset data is used to prepare their decision trees thus, we end up with
an ensemble of various models. The average of all the assumptions from
numerous tress is used, which is more powerful than a single decision
tree.

Implementation Steps of Bagging

 Step 1: Multiple subsets are created from the original data
set with equal tuples, selecting observations with
replacement.
 Step 2: A base model is created on each of these subsets.
 Step 3: Each model is learned in parallel with each
training set and independent of each other.
 Step 4: The final predictions are determined by combining
the predictions from all the models.
 Example of Bagging
 The Random Forest model uses Bagging, where decision tree
models with higher variance are present. It makes random
feature selection to grow trees. Several random trees make a
Random Forest.
Boosting
Boosting is an ensemble modeling technique that attempts to build
a strong classifier from the number of weak classifiers. It is done by
building a model by using weak models in series. Firstly, a model is
built from the training data. Then the second model is built which
tries to correct the errors present in the first model. This procedure
is continued and models are added until either the complete
training data set is predicted correctly or the maximum number of
models is added.
Algorithm steps:-
1. Initialise the dataset and assign equal weight to each of
the data point.
2. Provide this as input to the model and identify the wrongly
classified data points.
3. Increase the weight of the wrongly classified data points
and decrease the weights of correctly classified data
points. And then normalize the weights of all data points.
4. if (got required results)
Goto step 5
else
Goto step 2
5. End

Similarities Between Bagging and Boosting

Bagging and Boosting, both being the commonly used methods,
have a universal similarity of being classified as ensemble
methods. Here we will explain the similarities between them.
1. Both are ensemble methods to get N learners from 1
learner.
2. Both generate several training data sets by random
sampling.
3. Both make the final decision by averaging the N learners
(or taking the majority of them i.e Majority Voting).
4. Both are good at reducing variance and provide higher
stability.
K-Means Algorithm
K-Means Clustering is an Unsupervised Learning algorithm, which groups the
unlabeled dataset into different clusters. Here K defines the number of pre-
defined clusters that need to be created in the process, as if K=2, there will be
two clusters, and for K=3, there will be three clusters, and so on.

It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.

It is a centroid-based algorithm, where each cluster is associated with a

centroid. The main aim of this algorithm is to minimize the sum of
distances between the data point and their corresponding clusters.

The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an

iterative process.
o Assigns each data point to its closest k-center. Those data points
which are near to the particular k-center, create a cluster.

Hence each cluster has datapoints with some commonalities, and it is

away from other clusters.

The below diagram explains the working of the K-means Clustering

Algorithm:
How does the K-Means Algorithm Work?
The working of the K-Means algorithm is explained in the below steps:

Step-1: Select the number K to decide the number of clusters.

Step-2: Select random K points or centroids. (It can be other from the
input dataset).

Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.

Step-4: Calculate the variance and place a new centroid of each cluster.

Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Naïve Bayes Classifier Algorithm
o Naïve Bayes algorithm is a supervised learning algorithm,
which is based on Bayes theorem and used for solving
classification problems.
o It is mainly used in text classification that includes a high-
dimensional training dataset.
o Naïve Bayes Classifier is one of the simple and most
effective Classification algorithms which helps in building
the fast machine learning models that can make quick
predictions.
o It is a probabilistic classifier, which means it
predicts on the basis of the probability of an object.
o Some popular examples of Naïve Bayes Algorithm
are spam filtration, Sentimental analysis, and
classifying articles.

Why is it called Naïve Bayes?

The Naïve Bayes algorithm is comprised of two words Naïve
and Bayes, Which can be described as:

o Naïve: It is called Naïve because it assumes that the

occurrence of a certain feature is independent of the
occurrence of other features. Such as if the fruit is
identified on the bases of color, shape, and taste, then
red, spherical, and sweet fruit is recognized as an apple.
Hence each feature individually contributes to identify that
it is an apple without depending on each other.

o Bayes: It is called Bayes because it depends on the

principle of Bayes' Theorem.
Bayes' Theorem:
o Bayes' theorem is also known as Bayes' Rule or Bayes'
law, which is used to determine the probability of a
hypothesis with prior knowledge. It depends on the
conditional probability.
o The formula for Bayes' theorem is given as:

Where,

P(A|B) is Posterior probability: Probability of hypothesis A

on the observed event B.

P(B|A) is Likelihood probability: Probability of the evidence

given that the probability of a hypothesis is true.

P(A) is Prior Probability: Probability of hypothesis before

observing the evidence.

P(B) is Marginal Probability: Probability of Evidence.

Decision Tree Classification Algorithm
o Decision Tree is a Supervised learning technique that can be
used for both classification and Regression problems, but mostly it
is preferred for solving Classification problems. It is a tree-structured
classifier, where internal nodes represent the features of a
dataset, branches represent the decision rules and each leaf
node represents the outcome.
o In a Decision tree, there are two nodes, which are the Decision
Node and Leaf Node. Decision nodes are used to make any
decision and have multiple branches, whereas Leaf nodes are the
output of those decisions and do not contain any further branches.
o The decisions or the test are performed on the basis of features of
the given dataset.
o It is a graphical representation for getting all the possible
solutions to a problem/decision based on given conditions.
o It is called a decision tree because, similar to a tree, it starts with
the root node, which expands on further branches and constructs a
tree-like structure.
o In order to build a tree, we use the CART algorithm, which stands
for Classification and Regression Tree algorithm.
o A decision tree simply asks a question, and based on the answer
(Yes/No), it further split the tree into subtrees.
o Below diagram explains the general structure of a decision tree:
Why use Decision Trees?
There are various algorithms in data mining, so choosing the best
algorithm for the given dataset and problem is the main point to
remember while creating a machine learning model. Below are the two
reasons for using the Decision tree:

o Decision Trees usually mimic human thinking ability while making a

decision, so it is easy to understand.
o The logic behind the decision tree can be easily understood because it
shows a tree-like structure.

Decision Tree Terminologies

 Root Node: Root node is from where the decision tree starts. It represents the entire dataset,
which further gets divided into two or more homogeneous sets.

 Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.

 Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.

 Branch/Sub Tree: A tree formed by splitting the tree.

 Pruning: Pruning is the process of removing the unwanted branches from the tree.

 Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.

How does the Decision Tree algorithm Work?

In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm compares
the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.

o Step-1: Begin the tree with the root node, says S, which contains
the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for
the best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of
the dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.

BUSINESS INTELLIGENCE
→ Business intelligence is a set of mathematical models and analysis
methodologies that exploit the available data to generate information and
knowledge useful for complex decision-making processes.
→ i.e. it is a broad category of applications and technologies for gathering.
storing, analysing and providing access to data to help clients make better
decisions.
For Example: Reducing the staff in the company
If I want to reduce the staff in my company. So to reduce the staff it needs the
right
Information, how many staff I have, what type of staff I have, what is my
business doing with this staff. how much do I cam, what is my growth, what is
my operational expenses, do I need to reduce the staff or not?
So if you can provide the right information at the right time in the right
format which is matching to the business data so you can take the right
decisions on that data

DATA, INFORMATION AND KNOWLEDGE

DATA:
Data is a set of representation of plain facts. Data are the facts of the world. Data
represent a structured codification of single primary entities, as well as of transactions
involving two or more primary entities.
INFORMATION:
Information is the outcome of extraction and processing activities carried out on data, and
it appears meaningful for those who receive it in a specific domain.
KNOWLEDGE
Information is transformed into knowledge when it is used to make decisions and develop
the corresponding actions.
The activity of providing support to knowledge workers through the integration of
decision-making processes and enabling information technologies is usually referred to as
knowledge management.

ARCHITECTURE OF BUSINESS INTELLIGENCE

MAJOR COMPONENTS OF BI
 DATA SOURCES
 DATA WAREHOUSES AND DATA MARTS
 BI METHODOLOGIES
Data sources:
In a first stage, it is necessary to gather and integrate the data stored in the various
primary and secondary sources, which are heterogeneous in origin and type. The sources
consist for the most part of data belonging to operational systems, but may also include
unstructured documents, such as emails and data received from external providers.
Data warehouses and data marts:
Using extraction and transformation tools known as extract, transform, load (ETL), the
data originating from the different sources are stored in databases intended to support
business intelligence analyses. These databases are usually referred to as data
warehouses and data marts.
Business intelligence methodologies:
Data are finally extracted and used to feed mathematical models and analysis
methodologies intended to support decision makers. In a business intelligence system,
several decision support applications may be implemented, most of which will be
described in the following chapters:
• multidimensional cube analysis:
exploratory data analysis;

Inductive learning models for data mining:

Data Exploration:
It is an informative search which is used by data consumers to form real and true analysis
from the information collected. It is about describing the data by means of statistical and
visualization techniques. We explore data in order to bring important aspects of that data
into focus for further analysis. Often, data is gathered in a non-rigid or controlled manner
in large bulks,
Data mining:
Data mining technique has to be chosen based on the type of business and the type of
problem your business faces. A generalized approach has to be used to improve the
accuracy and cost effectiveness of using data mining techniques.
Optimization: By moving up one level in the pyramid we find optimization models that
allow us to determine the best solution out of a set of alternative actions, which is usually
fairly extensive and sometimes even infinite.
Decisions: Choice of a decision pertains to the decision makers, who may also take
advantage of informal and unstructured information available to adapt and modify the
recommendations and the conclusions achieved through the use of mathematical
models.

 business intelligence issues divided into

two categories:-
1. Organizations and People:

o Management within our organization are not convinced that data driven or
evidence based decisions really works for them.

o There is no clear overall business strategy laid out with objectives and
measures related to those objectives to assess business progress.

o IT personnel are overloaded and have no resource available to source the

data we need for our Business Intelligence system.

o There are no incentives for the staff within organization to improve the
performance of the business either using BI or not.

o The business is in a state of stress or high change or flux. There is no

apparent or perceived time to establish a BI system.

o The eventual consumers of the BI system do not really know what they
want from a BI system until they see it.
o It experts building the system do not really understand the business, and so
many changes are needed to have the system accepted by the organization.
o The company does not have sufficient expertise or it is not able to hire such
expertise to manage a project implementation on time and within budget
or to design the system adequately.
2. Data and Technology:
 The data of the organization is not clean and the time and effort to correct this or
handle this, destroys the success of the BI project.

 The BI technology chosen turns out to be so rigid and painstaking to change that it
takes too long and costs too much to complete the project on time.

 The BI technology used deters use of the system because:

1. The quality of the presentation or visualization of the information is poor or

limited.
2. The response times (speed) to present the data is too slow and not
acceptable.
3. The flexibility to ask new questions of the BI technology is limited or too
difficult or time consuming to do for either the End Users or BI expert.

Decision support system

Fraud detection in BI :-

Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
6 pages
Data Preprocessing Steps 2
No ratings yet
Data Preprocessing Steps 2
26 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Unit 2 Data Warehouse and Data Mining
No ratings yet
Unit 2 Data Warehouse and Data Mining
19 pages
UNIT-III Data Warehouse and Minig Notes MDU
No ratings yet
UNIT-III Data Warehouse and Minig Notes MDU
42 pages
Data Warehouse and Data Mining - Definition and Concepts
No ratings yet
Data Warehouse and Data Mining - Definition and Concepts
20 pages
02 Data Warehouse
No ratings yet
02 Data Warehouse
18 pages
Screenshot 2025-04-09 at 10.35.12 AM
No ratings yet
Screenshot 2025-04-09 at 10.35.12 AM
31 pages
Unit-3 Data Reduction
No ratings yet
Unit-3 Data Reduction
5 pages
DWH Unit-3
No ratings yet
DWH Unit-3
12 pages
3.data Pre-Processing Concepts
No ratings yet
3.data Pre-Processing Concepts
8 pages
Data Pre Processing
No ratings yet
Data Pre Processing
11 pages
DATA MINING Notes
No ratings yet
DATA MINING Notes
37 pages
Lecture 7 Data Reduction
No ratings yet
Lecture 7 Data Reduction
5 pages
Data Preprocessing Unit 2
No ratings yet
Data Preprocessing Unit 2
3 pages
Metadata & Data Mining Essentials
No ratings yet
Metadata & Data Mining Essentials
36 pages
DATA MINING Notes (Upate)
No ratings yet
DATA MINING Notes (Upate)
25 pages
Data Mining & Warehousing Guide
No ratings yet
Data Mining & Warehousing Guide
12 pages
Data Preprocessing for Analysts
No ratings yet
Data Preprocessing for Analysts
3 pages
What Is Big Data Analytics
No ratings yet
What Is Big Data Analytics
3 pages
Data Pre Processing
No ratings yet
Data Pre Processing
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
3 pages
Data Mining
No ratings yet
Data Mining
5 pages
BDA Class1
No ratings yet
BDA Class1
33 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
14 pages
Unit 3 Data Warehousing and Data Mining
No ratings yet
Unit 3 Data Warehousing and Data Mining
7 pages
Data Preprocessing in Data Mining
No ratings yet
Data Preprocessing in Data Mining
4 pages
DMA Notes
No ratings yet
DMA Notes
40 pages
Unit - 2
No ratings yet
Unit - 2
17 pages
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
No ratings yet
Business Uses of Data Mining and Data Warehousing MIS 304 Section 04 CRN-41595
23 pages
Data Mining UNIT II
No ratings yet
Data Mining UNIT II
19 pages
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
No ratings yet
Lecture Notes Data Mining Data Warehousing Unit-2: Data Preprocessing
3 pages
IV-cse DM Viva Questions
No ratings yet
IV-cse DM Viva Questions
10 pages
Major Issues in Data Mining
No ratings yet
Major Issues in Data Mining
5 pages
Data Warehousing Unit 1
No ratings yet
Data Warehousing Unit 1
26 pages
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
No ratings yet
Unit 3 DW&DM Notes Mr. Rohit Pratap Singh
22 pages
R21 Unit 2
No ratings yet
R21 Unit 2
101 pages
Data Preprocessing
No ratings yet
Data Preprocessing
33 pages
DWM Assigment-Questions Ans
No ratings yet
DWM Assigment-Questions Ans
67 pages
BUSINESS INTELLIGENCE NOTES Unit 4
No ratings yet
BUSINESS INTELLIGENCE NOTES Unit 4
10 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Data Transformation in Data Mining
No ratings yet
Data Transformation in Data Mining
6 pages
Lec07 - Data-Preprocessing-18052023-082951pm 2
No ratings yet
Lec07 - Data-Preprocessing-18052023-082951pm 2
32 pages
Data Preprocessing
No ratings yet
Data Preprocessing
21 pages
Data Mining
No ratings yet
Data Mining
55 pages
Module III Data Mining
No ratings yet
Module III Data Mining
7 pages
7.data Preprocessing
No ratings yet
7.data Preprocessing
12 pages
Unit 2 Data Mining
No ratings yet
Unit 2 Data Mining
69 pages
Data Mining: Concepts and Techniques
No ratings yet
Data Mining: Concepts and Techniques
50 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
DR
No ratings yet
DR
20 pages
Unit-3 Finalized
No ratings yet
Unit-3 Finalized
9 pages
UNIT-2 Data Pre-Processing
No ratings yet
UNIT-2 Data Pre-Processing
57 pages
Unit-2 Data Warehouse Notes
No ratings yet
Unit-2 Data Warehouse Notes
11 pages
Chapter 3 - For Class
No ratings yet
Chapter 3 - For Class
52 pages
Unit-3 Data Preprocessing
100% (1)
Unit-3 Data Preprocessing
7 pages
Bana Reviewer
No ratings yet
Bana Reviewer
4 pages
Data Mining 11
No ratings yet
Data Mining 11
6 pages
r20 DWDM Unit 2 PART 2
No ratings yet
r20 DWDM Unit 2 PART 2
15 pages
IDS 6 Classification
No ratings yet
IDS 6 Classification
44 pages
Circle Red-Node Triangle Black-Node
No ratings yet
Circle Red-Node Triangle Black-Node
2 pages
AIDSSyllabus 2024 Pattern - V1
No ratings yet
AIDSSyllabus 2024 Pattern - V1
108 pages
Unit 4&5 Final
No ratings yet
Unit 4&5 Final
36 pages
A Level Comp Scie 2022 p1 Ms
No ratings yet
A Level Comp Scie 2022 p1 Ms
10 pages
D.E. Shaw 2025 DSA Questions
No ratings yet
D.E. Shaw 2025 DSA Questions
21 pages
Practice Sheet 1 For DSA
No ratings yet
Practice Sheet 1 For DSA
1 page
Curriculum
No ratings yet
Curriculum
44 pages
Tree - Binary Tree Traversal
No ratings yet
Tree - Binary Tree Traversal
154 pages
ADSA
No ratings yet
ADSA
5 pages
GPP CM Iv Sem Q Papers Even-24
No ratings yet
GPP CM Iv Sem Q Papers Even-24
13 pages
Graph I - Basics, Traversal
No ratings yet
Graph I - Basics, Traversal
4 pages
Database 2
No ratings yet
Database 2
10 pages
DSA Cheatsheet - Coders - Section
No ratings yet
DSA Cheatsheet - Coders - Section
60 pages
Basic Concepts of Data Structures
No ratings yet
Basic Concepts of Data Structures
41 pages
Data Structure Viva Questions and Answers PDF
No ratings yet
Data Structure Viva Questions and Answers PDF
7 pages
An Overview of Distributed MST Algorithms
No ratings yet
An Overview of Distributed MST Algorithms
28 pages
AVL Tree: Balanced Binary Search Tree
No ratings yet
AVL Tree: Balanced Binary Search Tree
9 pages
Sigma Curriculum
No ratings yet
Sigma Curriculum
14 pages
2425 CSC14003 23CLC1 Quiz01
No ratings yet
2425 CSC14003 23CLC1 Quiz01
8 pages
Data Structure QB
No ratings yet
Data Structure QB
17 pages
DFS, BFS State Space Search
No ratings yet
DFS, BFS State Space Search
14 pages
BCA Course Structure
No ratings yet
BCA Course Structure
24 pages
DSA Tutorial - 2025
No ratings yet
DSA Tutorial - 2025
22 pages
Lecture Notes 1 - SWE245 Advanced Data Structure II
No ratings yet
Lecture Notes 1 - SWE245 Advanced Data Structure II
12 pages
A Binary Artificial Bee Colony Algorithm For Constructing Spanning Trees in Vehicular Ad Hoc Networks
No ratings yet
A Binary Artificial Bee Colony Algorithm For Constructing Spanning Trees in Vehicular Ad Hoc Networks
12 pages
Tree Algorithms
No ratings yet
Tree Algorithms
19 pages
11 Lab - Student BST
No ratings yet
11 Lab - Student BST
2 pages
2025-4009 - Course Outline - Dsa
No ratings yet
2025-4009 - Course Outline - Dsa
4 pages
MCQ Compsciedu
No ratings yet
MCQ Compsciedu
58 pages

Data Integration and Data Reduction

Uploaded by

Data Integration and Data Reduction

Uploaded by

Data Integration in Data Mining

Data integration can be challenging due to the variety of data formats,

Data reduction is a technique used in data mining to reduce the size of a

There are several different data reduction techniques that can be

1. Data Sampling: This technique involves selecting a subset of the

 T(Transform): Data is transformed into the standard

 L(Load): Data is loaded into datawarehouse after

1. First, the data is extracted from external sources (same as happens

3. These data marts are then integrated into datawarehouse.

 Association rules generated from mining data at multiple levels of abstraction

Multilevel Association rule consists of alternate search strategies and

1.Alternate Search Strategies:

 Level-cross filtering by single item:

 Attributes can be categorical or quantitative.

Eg: buys(X,”IBM Laptop computer”)buys(X,”HP Inkjet Printer”)

 Discritization is static and occurs prior to mining.

2.Using dynamic discritization of quantitative attributes:

 Known as mining Quantitative Association Rules.

GRID FOR TUPLES

 It involves a two step mining process:

 Ensemble methods combine different decision trees to deliver

Implementation Steps of Bagging

Similarities Between Bagging and Boosting

It is a centroid-based algorithm, where each cluster is associated with a

The k-means clustering algorithm mainly performs two tasks:

o Determines the best value for K center points or centroids by an

Hence each cluster has datapoints with some commonalities, and it is

The below diagram explains the working of the K-means Clustering

Step-1: Select the number K to decide the number of clusters.

Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.

Step-7: The model is ready.

Why is it called Naïve Bayes?

o Naïve: It is called Naïve because it assumes that the

o Bayes: It is called Bayes because it depends on the

P(A|B) is Posterior probability: Probability of hypothesis A

P(B|A) is Likelihood probability: Probability of the evidence

P(A) is Prior Probability: Probability of hypothesis before

P(B) is Marginal Probability: Probability of Evidence.

o Decision Trees usually mimic human thinking ability while making a

Decision Tree Terminologies

 Branch/Sub Tree: A tree formed by splitting the tree.

How does the Decision Tree algorithm Work?

DATA, INFORMATION AND KNOWLEDGE

ARCHITECTURE OF BUSINESS INTELLIGENCE

Inductive learning models for data mining:

 business intelligence issues divided into

o IT personnel are overloaded and have no resource available to source the

o The business is in a state of stress or high change or flux. There is no

 The BI technology used deters use of the system because:

1. The quality of the presentation or visualization of the information is poor or

Decision support system

You might also like