Data Integration and Data Reduction
Data Integration and Data Reduction
INTRODUCTION :
Data integration in data mining refers to the process of combining
data from multiple sources into a single, unified view. This can
involve cleaning and transforming the data, as well as resolving any
inconsistencies or conflicts that may exist between the different
sources. The goal of data integration is to make the data more
useful and meaningful for the purposes of analysis and decision
making. Techniques used in data integration include data
warehousing, ETL (extract, transform, load) processes, and data
federation.
Data Integration is a data preprocessing technique that combines data from
multiple heterogeneous data sources into a coherent data store and provides
a unified view of the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as triple <G, S, M>
where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
M stands for mapping between the queries of source and global schema.
What is data integration :
Data integration is the process of combining data from multiple sources
into a cohesive and consistent view. This process involves identifying and
accessing the different data sources, mapping the data to a common format,
and reconciling any inconsistencies or discrepancies between the sources.
The goal of data integration is to make it easier to access and analyze data
that is spread across multiple systems or platforms, in order to gain a more
complete and accurate understanding of the data.
There are mainly 2 major approaches for data integration – one is the “tight
coupling approach” and another is the “loose coupling approach”.
Tight Coupling:
This approach involves creating a centralized repository or data warehouse
to store the integrated data. The data is extracted from various sources,
transformed and loaded into a data warehouse. Data is integrated in a tightly
coupled manner, meaning that the data is integrated at a high level, such as
at the level of the entire dataset or schema. This approach is also known as
data warehousing, and it enables data consistency and integrity, but it can
be inflexible and difficult to change or update.
Here, a data warehouse is treated as an information retrieval
component.
In this coupling, data is combined from different sources into a
single physical location through the process of ETL – Extraction,
Transformation, and Loading.
Loose Coupling:
This approach involves integrating data at the lowest level, such as at the
level of individual data elements or records. Data is integrated in a loosely
coupled manner, meaning that the data is integrated at a low level, and it
allows data to be integrated without having to create a central repository or
data warehouse. This approach is also known as data federation, and it
enables data flexibility and easy updates, but it can be difficult to maintain
consistency and integrity across multiple data sources.
Data reduction :-
The method of data reduction may achieve a condensed description of the
original data which is much smaller in quantity but keeps the quality of the
original data.
INTRODUCTION:
2. Dimension reduction:
Whenever we come across any data which is weakly important, then we use
the attribute required for our analysis. It reduces data size as it eliminates
outdated or redundant features.
Step-wise Forward Selection –
The selection begins with an empty set of attributes later on we
decide the best of the original attributes on the set based on their
relevance to other attributes. We know it as a p-value in statistics.
Suppose there are the following attributes in the data set in which
few attributes are redundant.
Step-wise Backward Selection –
This selection starts with a set of complete attributes in the original
data and at each point, it eliminates the worst remaining attribute in
the set.
Suppose there are the following attributes in the data set in which
few attributes are redundant.
Combination of forwarding and Backward Selection –
It allows us to remove the worst and select the best attributes,
saving time and making the process faster.
3. Data Compression:
The data compression technique reduces the size of the files using different
encoding mechanisms (Huffman Encoding & run-length Encoding). We can
divide it into two types based on their compression techniques.
Lossless Compression –
Encoding techniques (Run Length Encoding) allow a simple and
minimal data size reduction. Lossless data compression uses
algorithms to restore the precise original data from the compressed
data.
Lossy Compression –
Methods such as the Discrete Wavelet transform technique, PCA
(principal component analysis) are examples of this compression.
For e.g., the JPEG image format is a lossy compression, but we
can find the meaning equivalent to the original image. In lossy-data
compression, the decompressed data may differ from the original
data but are useful enough to retrieve information from them.
4. Numerosity Reduction:
In this reduction technique, the actual data is replaced with mathematical
models or smaller representations of the data instead of actual data, it is
important to only store the model parameter. Or non-parametric methods
such as clustering, histogram, and sampling.
5. Discretization & Concept Hierarchy Operation:
Techniques of data discretization are used to divide the attributes of the
continuous nature into data with intervals. We replace many constant values
of the attributes by labels of small intervals. This means that mining results
are shown in a concise, and easily understandable way.
Top-down discretization –
If you first consider one or a couple of points (so-called breakpoints
or split points) to divide the whole set of attributes and repeat this
method up to the end, then the process is known as top-down
discretization also known as splitting.
Bottom-up discretization –
If you first consider all the constant values as split points, some are
discarded through a combination of the neighborhood values in the
interval, that process is called bottom-up discretization.
data-warehouse :
A data-warehouse is a heterogeneous collection of different data sources
organised under a unified schema. There are 2 approaches for constructing
data-warehouse: Top-down approach and Bottom-up approach are explained
as below.
1.Top-down approach :
1. External Sources –
External source is a source from where data is collected
irrespective of the type of data. Data can be structured, semi
structured and unstructured as well.
2. Stage Area –
Since the data, extracted from the external sources does not follow
a particular format, so there is a need to validate this data to load
into datawarehouse. For this purpose, it is recommended to
use ETL tool.
E(Extracted): Data is extracted from External data
source.
3. Data-warehouse –
After cleansing of data, it is stored in the datawarehouse as central
repository. It actually stores the meta data and the actual data gets
stored in the data marts. Note that datawarehouse stores the data
in its purest form in this top-down approach.
4. Data Marts –
Data mart is also a part of storage component. It stores the
information of a particular function of an organisation which is
handled by single authority. There can be as many number of data
marts in an organisation depending upon the functions. We can
also say that data mart contains subset of the data stored in
datawarehouse.
5. Data Mining –
The practice of analysing the big data present in datawarehouse is
data mining. It is used to find the hidden patterns that are present in
the database or in datawarehouse with the help of algorithm of data
mining.
This approach is defined by Inmon as – datawarehouse as a
central repository for the complete organisation and data marts are
created from it after the complete datawarehouse has been
created.
2. Bottom-up approach:
2. Then, the data go through the staging area (as explained above)
and loaded into data marts instead of datawarehouse. The data
marts are created first and provide reporting capability. It addresses
a single business area.
This approach is given by Kinball as – data marts are created first and
provides a thin view for analyses and datawarehouse is created after
complete data marts have been created.
MULTILEVEL ASSOCIATION RULES:
Introduction :-
We all use the Decision Tree Technique on day to day life to make
the decision. Organizations use these supervised machine learning
techniques like Decision trees to make a better decision and to
generate more surplus and profit.
There are two techniques given below that are used to perform
ensemble decision tree.
Bagging
Bagging is used when our objective is to reduce the variance of a decision
tree. Here the concept is to create a few subsets of data from the training
sample, which is chosen randomly with replacement. Now each collection
of subset data is used to prepare their decision trees thus, we end up with
an ensemble of various models. The average of all the assumptions from
numerous tress is used, which is more powerful than a single decision
tree.
It allows us to cluster the data into different groups and a convenient way
to discover the categories of groups in the unlabeled dataset on its own
without the need for any training.
The algorithm takes the unlabeled dataset as input, divides the dataset
into k-number of clusters, and repeats the process until it does not find
the best clusters. The value of k should be predetermined in this
algorithm.
Step-2: Select random K points or centroids. (It can be other from the
input dataset).
Step-3: Assign each data point to their closest centroid, which will form
the predefined K clusters.
Step-4: Calculate the variance and place a new centroid of each cluster.
Step-5: Repeat the third steps, which means reassign each datapoint to
the new closest centroid of each cluster.
Where,
Leaf Node: Leaf nodes are the final output node, and the tree cannot be segregated further
after getting a leaf node.
Splitting: Splitting is the process of dividing the decision node/root node into sub-nodes
according to the given conditions.
Pruning: Pruning is the process of removing the unwanted branches from the tree.
Parent/Child node: The root node of the tree is called the parent node, and other
nodes are called the child nodes.
In a decision tree, for predicting the class of the given dataset, the
algorithm starts from the root node of the tree. This algorithm compares
the values of root attribute with the record (real dataset) attribute and,
based on the comparison, follows the branch and jumps to the next node.
o Step-1: Begin the tree with the root node, says S, which contains
the complete dataset.
o Step-2: Find the best attribute in the dataset using Attribute
Selection Measure (ASM).
o Step-3: Divide the S into subsets that contains possible values for
the best attributes.
o Step-4: Generate the decision tree node, which contains the best
attribute.
o Step-5: Recursively make new decision trees using the subsets of
the dataset created in step -3. Continue this process until a stage is
reached where you cannot further classify the nodes and called the
final node as a leaf node.
BUSINESS INTELLIGENCE
→ Business intelligence is a set of mathematical models and analysis
methodologies that exploit the available data to generate information and
knowledge useful for complex decision-making processes.
→ i.e. it is a broad category of applications and technologies for gathering.
storing, analysing and providing access to data to help clients make better
decisions.
For Example: Reducing the staff in the company
If I want to reduce the staff in my company. So to reduce the staff it needs the
right
Information, how many staff I have, what type of staff I have, what is my
business doing with this staff. how much do I cam, what is my growth, what is
my operational expenses, do I need to reduce the staff or not?
So if you can provide the right information at the right time in the right
format which is matching to the business data so you can take the right
decisions on that data
MAJOR COMPONENTS OF BI
DATA SOURCES
DATA WAREHOUSES AND DATA MARTS
BI METHODOLOGIES
Data sources:
In a first stage, it is necessary to gather and integrate the data stored in the various
primary and secondary sources, which are heterogeneous in origin and type. The sources
consist for the most part of data belonging to operational systems, but may also include
unstructured documents, such as emails and data received from external providers.
Data warehouses and data marts:
Using extraction and transformation tools known as extract, transform, load (ETL), the
data originating from the different sources are stored in databases intended to support
business intelligence analyses. These databases are usually referred to as data
warehouses and data marts.
Business intelligence methodologies:
Data are finally extracted and used to feed mathematical models and analysis
methodologies intended to support decision makers. In a business intelligence system,
several decision support applications may be implemented, most of which will be
described in the following chapters:
• multidimensional cube analysis:
exploratory data analysis;
Data Exploration:
It is an informative search which is used by data consumers to form real and true analysis
from the information collected. It is about describing the data by means of statistical and
visualization techniques. We explore data in order to bring important aspects of that data
into focus for further analysis. Often, data is gathered in a non-rigid or controlled manner
in large bulks,
Data mining:
Data mining technique has to be chosen based on the type of business and the type of
problem your business faces. A generalized approach has to be used to improve the
accuracy and cost effectiveness of using data mining techniques.
Optimization: By moving up one level in the pyramid we find optimization models that
allow us to determine the best solution out of a set of alternative actions, which is usually
fairly extensive and sometimes even infinite.
Decisions: Choice of a decision pertains to the decision makers, who may also take
advantage of informal and unstructured information available to adapt and modify the
recommendations and the conclusions achieved through the use of mathematical
models.
o Management within our organization are not convinced that data driven or
evidence based decisions really works for them.
o There is no clear overall business strategy laid out with objectives and
measures related to those objectives to assess business progress.
o There are no incentives for the staff within organization to improve the
performance of the business either using BI or not.
o The eventual consumers of the BI system do not really know what they
want from a BI system until they see it.
o It experts building the system do not really understand the business, and so
many changes are needed to have the system accepted by the organization.
o The company does not have sufficient expertise or it is not able to hire such
expertise to manage a project implementation on time and within budget
or to design the system adequately.
2. Data and Technology:
The data of the organization is not clean and the time and effort to correct this or
handle this, destroys the success of the BI project.
The BI technology chosen turns out to be so rigid and painstaking to change that it
takes too long and costs too much to complete the project on time.