DATA WAREHOUSE
DATA WAREHOUSE: Data Warehouse is separate
from DBMS, it stores a huge amount of data, which is
typically collected from multiple heterogeneous sources
like files, DBMS, etc.
The goal is to produce statistical results that may help in
decision-making. For example, a college might want to
see quick different results, like how the placement of
CS students has improved over the last 10 years, in
terms of salaries, counts, etc.
Issues Occur while Building the Warehouse :
When and how to gather data: In a source-driven
architecture for gathering data, the data sources
transmit new information, either continually (as
transaction processing takes place), or
periodically (nightly, for example).
In a destination-driven architecture, the data warehouse
periodically sends requests for new data to the sources.
Unless updates at the sources are replicated at the
warehouse via two phase commit, the warehouse will
DATA WAREHOUSE
never be quite up to-date with the sources. Two-phase
commit is usually far too expensive to be an option, so
data warehouses typically have slightly out-of-date data.
That, however, is usually not a problem for decision-
support systems.
What schema to use: Data sources that have been
constructed independently are likely to have
different schemas.
In fact, they may even use different data models. Part of
the task of a warehouse is to perform schema integration,
and to convert data to the integrated schema before they
are stored.
As a result, the data stored in the warehouse are not just a
copy of the data at the sources. Instead, they can be
thought of as a materialized view of the data at the
sources.
Data transformation and cleansing: The task of
correcting and preprocessing data is called
data cleansing. Data sources often deliver data
with numerous minor inconsistencies, which
can be corrected.
DATA WAREHOUSE
For example, names are often misspelled, and addresses
may have street, area, or city names misspelled, or postal
codes entered incorrectly.
These can be corrected to a reasonable extent by consulting
a database of street names and postal codes in each city.
The approximate matching of data required for this task is
referred to as fuzzy lookup.
How to propagate update: Updates on relations at the
data sources must be propagated to the data
warehouse. If the relations at the data warehouse are
exactly the same as those at the data source, the
propagation is straightforward.
If they are not, the problem of propagating updates is
basically the view-maintenance problem.
What data to summarize: The raw data generated by a
transaction-processing system may be too large to
store online.
However, we can answer many queries by maintaining just
summary data obtained by aggregation on a relation,
rather than maintaining the entire relation. For example,
DATA WAREHOUSE
instead of storing data about every sale of clothing, we
can store total sales of clothing by item name and
category.
Benefits of Data Warehouse :
Better business analytics: Data warehouse plays an
important role in every business to store and
analysis of all the past data and records of the
company. which can further increase the
understanding or analysis of data for the company.
Faster Queries: The data warehouse is designed to
handle large queries that’s why it runs queries faster
than the database.
Improved data Quality: In the data warehouse the
data you gathered from different sources is being
stored and analyzed it does not interfere with or add
data by itself so your quality of data is maintained
and if you get any issue regarding data quality then
the data warehouse team will solve this.
DATA WAREHOUSE
Historical Insight: The warehouse stores all your
historical data which contains details about the
business so that one can analyze it at any time
and extract insights from it.
MULTIDIMENSIONAL DATA MODEL
The multi-Dimensional Data Model is a method which is
used for ordering data in the database along with good
arrangement and assembling of the contents in the
database.
The Multi-Dimensional Data Model allows customers to
interrogate analytical questions associated with market or
business trends, unlike relational databases which allow
customers to access data in the form of queries.
They allow users to rapidly receive answers to the requests
which they made by creating and examining the data
comparatively fast.
Working on a Multidimensional Data Model :
On the basis of the pre-decided steps, Multidimensional
Data Model works.
DATA WAREHOUSE
The following stages should be followed by every project
for building a Multi Dimensional Data Model :
Stage 1 : Assembling data from the client : In first stage,
a Multi-Dimensional Data Model collects correct data
from the client. Mostly, software professionals provide
simplicity to the client about the range of data which can
be gained with the selected technology and collect the
complete data in detail.
Stage 2 : Grouping different segments of the system : In
the second stage, the Multi Dimensional Data Model
recognizes and classifies all the data to the respective
section they belong to and also builds it problem-free to
apply step by step.
Stage 3 : Noticing the different proportions : In the third
stage, it is the basis on which the design of the system is
based. In this stage, the main factors are recognized
according to the user’s point of view. These factors are
also known as “Dimensions”.
DATA WAREHOUSE
Stage 4 : Preparing the actual-time factors and their
respective qualities : In the fourth stage, the factors which
are recognized in the previous step are used further for
identifying the related qualities. These qualities are also
known as “attributes” in the database.
Stage 5 : Finding the actuality of factors which are listed
previously and their qualities : In the fifth stage, A Multi-
Dimensional Data Model separates and differentiates the
actuality from the factors which are collected by it. These
actually play a significant role in the arrangement of a
Multi-Dimensional Data Model.
Stage 6 : Building the Schema to place the data, with
respect to the information collected from the steps above :
In the sixth stage, on the basis of the data which was
collected previously, a Schema is built.
DATA CLAENING :
Data cleaning is the process of correcting or deleting
inaccurate, damaged, improperly formatted, duplicated, or
insufficient data from a dataset.
DATA WAREHOUSE
Even if results and algorithms appear to be correct, they
are unreliable if the data is inaccurate.
There are numerous ways for data to be duplicated or
incorrectly labeled when merging multiple data sources.
In general, data cleaning lowers errors and raises the
caliber of the data. Although it might be a time-consuming
and laborious operation, fixing data mistakes and
removing incorrect information must be done.
A crucial method for cleaning up data is data mining. A
method for finding useful information in data is data
mining.
Data quality mining is a novel methodology that uses data
mining methods to find and fix data quality issues in
sizable databases.
Data mining mechanically pulls intrinsic and hidden
information from large data sets. Data cleansing can be
accomplished using a variety of data mining approaches.
Steps for Cleaning Data
DATA WAREHOUSE
You can follow these fundamental stages to clean your
data even if the techniques employed may vary depending
on the sorts of data your firm stores:
1. Remove duplicate or irrelevant observations
Remove duplicate or pointless observations as well as
undesirable observations from your dataset. The majority
of duplicate observations will occur during data gathering.
Duplicate data can be produced when you merge data
sets from several sources, scrape data, or get data from
clients or other departments.
One of the most important factors to take into account in
this procedure is de-duplication. Those observations are
deemed irrelevant when you observe observations that do
not pertain to the particular issue you are attempting to
analyze.
2. Fix structural errors
When you measure or transfer data and find odd naming
practices, typos, or wrong capitalization, such are
DATA WAREHOUSE
structural faults. Mislabelled categories or classes may
result from these inconsistencies.
For instance, "N/A" and "Not Applicable" might be
present on any given sheet, but they ought to be
analyzed under the same heading.
3. Filter unwanted outliers
There will frequently be isolated findings that, at first
glance, do not seem to fit the data you are analyzing.
Removing an outlier if you have a good reason to, such as
incorrect data entry, will improve the performance of the
data you are working with.
4. Handle missing data
Because many algorithms won't tolerate missing values,
you can't overlook missing data. There are a few options
for handling missing data. While neither is ideal, both can
be taken into account, for example:
DATA WAREHOUSE
Although you can remove observations with missing
values, doing so will result in the loss of information, so
proceed with caution.
Again, there is a chance to undermine the integrity of the
data since you can be working from assumptions rather
than actual observations when you input missing numbers
based on other observations.
To browse null values efficiently, you may need to change
the way the data is used.
DATA INTEGRATION AND TRANSFORMATION
Data Integration is a data preprocessing technique that
combines data from multiple heterogeneous data sources
into a coherent data store and provides a unified view of
the data. These sources may include multiple data cubes,
databases, or flat files.
The data integration approaches are formally defined as
triple <G, S, M> where,
G stand for the global schema,
S stands for the heterogeneous source of schema,
DATA WAREHOUSE
M stands for mapping between the queries of source and
global schema.
Data integration can be challenging due to the variety of
data formats, structures, and semantics used by different
data sources.
Different data sources may use different data types,
naming conventions, and schemas, making it difficult to
combine the data into a single view.
Data integration typically involves a combination of
manual and automated processes, including data profiling,
data mapping, data transformation, and data reconciliation.
Data Integration Approaches
There are mainly two types of approaches for data
integration. These are as follows:
Tight Coupling
It is the process of using ETL (Extraction, Transformation,
and Loading) to combine data from various sources into a
single physical location.
Loose Coupling
DATA WAREHOUSE
Facts with loose coupling are most effectively kept in the
actual source databases.
This approach provides an interface that gets a query from
the user, changes it into a format that the supply database
may understand, and then sends the query to the source
databases without delay to obtain the result.
Issues in Data Integration:
There are several issues that can arise when integrating
data from multiple sources, including:
Data Quality: Inconsistencies and errors in the data
can make it difficult to combine and analyze.
Data Semantics: Different sources may use different
terms or definitions for the same data, making it
difficult to combine and understand the data.
Data Heterogeneity: Different sources may use
different data formats, structures, or schemas, making
it difficult to combine and analyze the data.
Data Privacy and Security: Protecting sensitive
information and maintaining security can be
difficult when integrating data from multiple
sources.
DATA WAREHOUSE
Scalability: Integrating large amounts of data from
multiple sources can be computationally
expensive and time-consuming.
Data Governance: Managing and maintaining the
integration of data from multiple sources can be
difficult, especially when it comes to ensuring
data accuracy, consistency, and timeliness.
Performance: Integrating data from multiple
sources can also affect the performance of the
system.
Integration with existing systems: Integrating new
data sources with existing systems can be a
complex task, requiring significant effort and
resources.
Complexity: The complexity of integrating data
from multiple sources can be high, requiring
specialized skills and knowledge.
DATA TRANSFORMATION
Data transformation in data mining refers to the process of
converting raw data into a format that is suitable for
analysis and modeling.
The goal of data transformation is to prepare the data for
data mining so that it can be used to extract useful insights
DATA WAREHOUSE
and knowledge. Data transformation typically involves
several steps, including:
Data cleaning: Removing or correcting errors,
inconsistencies, and missing values in the
data.
Data integration: Combining data from multiple
sources, such as databases and spreadsheets, into a
single format.
Data normalization: Scaling the data to a common
range of values, such as between 0 and 1, to facilitate
comparison and analysis.
Data reduction: Reducing the dimensionality of the
data by selecting a subset of relevant features or
attributes.
Data discretization: Converting continuous data
into discrete categories or bins.
Data aggregation: Combining data at different
levels of granularity, such as by summing or
averaging, to create new features or attributes.
Data transformation is an important step in the data
mining process as it helps to ensure that the data is in a
DATA WAREHOUSE
format that is suitable for analysis and modeling, and that
it is free of errors and inconsistencies.
Data transformation can also help to improve the
performance of data mining algorithms, by reducing the
dimensionality of the data, and by scaling the data to a
common range of values.
The data are transformed in ways that are ideal for mining
the data. The data transformation involves steps that are:
1. Smoothing: It is a process that is used to remove noise
from the dataset using some algorithms It allows for
highlighting important features present in the dataset.
It helps in predicting the patterns. When collecting data,
it can be manipulated to eliminate or reduce any
variance or any other noise form.
2. Aggregation: Data collection or aggregation is the
method of storing and presenting data in a summary
format. The data may be obtained from multiple
data sources to integrate these data sources into a
data analysis description.
DATA WAREHOUSE
This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality
of the data used.
Gathering accurate data of high quality and a large enough
quantity is necessary to produce relevant results.
3. Discretization : It is a process of transforming
continuous data into set of small intervals. Most
Data Mining activities in the real world require
continuous attributes.
Yet many of the existing data mining frameworks are
unable to handle these attributes. Also, even if a data
mining task can manage a continuous attribute, it can
significantly improve its efficiency by replacing a
constant quality attribute with its discrete values. For
example, (1-10, 11-20) (age:- young, middle age,
senior).
4. Attribute Construction: Where new attributes are
created & applied to assist the mining process from the
given set of attributes. This simplifies the original data
& makes the mining more efficient.
DATA WAREHOUSE
5. Generalization: It converts low-level data attributes to
high-level data attributes using concept hierarchy. For
Example Age initially in Numerical form (22, 25) is
converted into categorical value (young, old).
For example, Categorical attributes, such as house
addresses, may be generalized to higher-level
definitions, such as town or country.
6. Normalization: Data normalization involves
converting all data variables into a given range.
DATA REDUCTION
Data reduction is a technique used in data mining to reduce
the size of a dataset while still preserving the most
important information.
This can be beneficial in situations where the dataset is too
large to be processed efficiently, or where the dataset
contains a large amount of irrelevant or redundant
information.
DATA WAREHOUSE
There are several different data reduction techniques that
can be used in data mining, including:
Data Sampling: This technique involves selecting a
subset of the data to work with, rather than using the
entire dataset. This can be useful for reducing the size
of a dataset while still preserving the overall trends
and patterns in the data.
Dimensionality Reduction: This technique involves
reducing the number of features in the dataset,
either by removing features that are not relevant or
by combining multiple features into a single feature.
Data Compression: This technique involves using
techniques such as lossy or lossless compression to
reduce the size of a dataset.
Data Discretization: This technique involves
converting continuous data into discrete data by
partitioning the range of possible values into intervals
or bins.
Feature Selection: This technique involves selecting
a subset of features from the dataset that are most
relevant to the task at hand.
It’s important to note that data reduction can have a trade-
off between the accuracy and the size of the data. The
DATA WAREHOUSE
more data is reduced, the less accurate the model will be
and the less generalizable it will be.
DISCRETIZATION
Data discretization refers to a method of converting a huge
number of data values into smaller ones so that the
evaluation and management of data become easy.
In other words, data discretization is a method of
converting attributes values of continuous data into a finite
set of intervals with minimum data loss.
There are two forms of data discretization first is
supervised discretization, and the second is unsupervised
discretization.
Supervised discretization refers to a method in which the
class data is used. Unsupervised discretization refers to a
method depending upon the way which operation proceeds.
It means it works on the top-down splitting strategy
and bottom-up merging strategy.
Discretization Technique:
Discretization is one form of data transformation
technique. It transforms numeric values to interval labels
DATA WAREHOUSE
of conceptual labels. Ex. age can be transformed to (0-
10,11-20….) or to conceptual labels like youth, adult,
senior.
There are different techniques of discretization:
Discretization by binning: It is unsupervised
method of partitioning the data based on equal
partitions , either by equal width or by equal
frequency
Discretization by Cluster: clustering can be applied to
discretize numeric attributes. It partitions the values
into different clusters or groups by following top down
or bottom up strategy
Discretization By decision tree: it employs top down
splitting strategy. It is a supervised technique that
uses class information.
Discretization By correlation analysis: ChiMerge
employs a bottom-up approach by finding the best
neighboring intervals and then merging them to
form larger intervals, recursively
Discretization by histogram: Histogram analysis is
unsupervised learning because it doesn’t use any class
DATA WAREHOUSE
information like binning. There are various partition
rules used to define histograms.
Importance of Discretization:
A discretization is important because it is useful:
To generate concept hierarchies.
Transform numeric data.
To ease evaluation and management of data.
To minimize data loss.
To produce a better result.
Generate a more understandable structure viz.
decision tree.