0% found this document useful (0 votes)

35 views42 pages

Chapter 2.data Warehouse

The document discusses the multidimensional data model, emphasizing the importance of data cubes in organizing data across various dimensions such as time, item, and location. It also highlights the critical role of data cleaning in data mining, outlining steps and techniques for ensuring data quality, including removing duplicates, fixing structural errors, and handling missing data. Additionally, it covers data integration methods and challenges, emphasizing the need for accurate and consistent data for effective decision-making.

Uploaded by

vivekwolf61

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views42 pages

Chapter 2.data Warehouse

Uploaded by

vivekwolf61

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 42

Data science

2.Data Warehouse

Multi-Dimensional Data Model

A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.

 The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location.
 These dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table related to
it, called a dimensional table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item name, brand, and type.

A multidimensional data model is organized around a central theme, for example, sales. This theme
is represented by a fact table. Facts are numerical measures. The fact table contains the names of
the facts or measures of the related dimensional tables.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 1

Data science
Consider the data of a shop for items sold per quarter in the city of Delhi. The data is shown in the
table. In this 2D representation, the sales for Delhi are shown for the time dimension (organized in
quarters) and the item dimension (classified according to the types of an item sold).
The fact or measure displayed in rupee sold (in thousands).

Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are represented
as a series of 2D tables.

Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown
in fig:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 2

Data science

Data Cleaning in Data Mining

• Data cleaning is an essential step in the data mining process. It is crucial to the construction
of a model. The step that is required, but frequently overlooked by everyone, is data
cleaning.
• The major problem with quality information management is data quality. Problems with
data quality can happen at any place in an information system. Data cleansing offers a
solution to these issues.
• Data cleaning is the process of correcting or deleting inaccurate, damaged, improperly
formatted, duplicated, or insufficient data from a dataset.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 3

Data science

Steps for Cleaning Data

1. Remove duplicate or irrelevant observations

Remove duplicate or pointless observations as well as undesirable observations from your

dataset. The majority of duplicate observations will occur during data gathering. Duplicate data
can be produced when you merge data sets from several sources, scrape data, or get data from
clients or other departments.
One of the most important factors to take into account in this procedure is deduplication. Those
observations are deemed irrelevant when you observe observations that do not pertain to the
particular issue you are attempting to analyze.

2. Fix structural errors

When you measure or transfer data and find odd naming practices, typos, or wrong
capitalization, such are structural faults. Mislabeled categories or classes may result from
these inconsistencies. For instance, "N/A" and "Not Applicable" might be present on any
given sheet, but they ought to be analyzed under the same heading.

3. Filter unwanted outliers

There will frequently be isolated findings that, at first glance, do not seem to fit the data you
are analyzing. Removing an outlier if you have a good reason to, such as incorrect data
entry, will improve the performance of the data you are working with.

4. Handle missing data

Because many algorithms won't tolerate missing values, you can't overlook missing data. There
are a few options for handling missing data. While neither is ideal, both can be taken into
account, for example:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru 4

Data science
5. Validate and QA

• As part of fundamental validation, you ought to be able to respond to the following queries
once the data cleansing procedure is complete:

• Are the data coherent?

• Does the data abide by the regulations that apply to its particular field?

• Does it support or refute your working theory? Does it offer any new information?

• To support your next theory, can you identify any trends in the data?

• If not, is there a problem with the data's quality?

False conclusions can be used to inform poor company strategy and decision-making as a result of
inaccurate or noisy data. False conclusions can result in a humiliating situation in a reporting
meeting when you find out your data couldn't withstand further investigation.
Establishing a culture of quality data in your organization is crucial before you arrive. The tools you
might employ to develop this plan should be documented to achieve this.

Techniques for Cleaning Data

The data should be passed through one of the various data-cleaning procedures available. The
procedures are explained below:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 5

Data science
1. Ignore the tuples: This approach is not very practical because it is only useful when a tuple
has multiple characteristics and missing values.

2. Fill in the missing value: This strategy is also not very practical or effective. Additionally,
it could be a time-consuming technique. One must add the missing value to the approach.
The most common method for doing this is manually, but other options include using
attribute means or the most likely value.

3. Binning method: This strategy is fairly easy to comprehend. The values nearby are used
to smooth the sorted data. The information is subsequently split into several equal-sized
parts. The various techniques are then used to finish the assignment.

4. Regression: With the use of the regression function, the data is smoothed out. Regression
may be multivariate or linear. Multiple regressions have more independent variables than
linear regressions, which only have one.
5. Clustering: This technique focuses mostly on the group. Data are grouped using clustering.
After that, clustering is used to find the outliers. After that, the comparable values are
grouped into a "group" or "cluster".

Process of Data Cleaning

The data cleaning method for data mining is demonstrated in the subsequent sections.

1. Monitoring the errors: Keep track of the areas where errors seem to occur most frequently.
It will be simpler to identify and maintain inaccurate or corrupt information. Information is
particularly important when integrating a potential substitute with current management
software.

2. Standardize the mining process: To help lower the likelihood of duplicity, standardize the
place of insertion.

3. Validate data accuracy: Analyses the data and spend money on data cleaning

software. Artificial intelligence-based tools were utilized to thoroughly check for accuracy.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 6

Data science
4. Research on data: Our data needs to be vetted, standardized, and duplicate-checked before
this action. There are numerous third-party sources, and these vetted and approved sources
can extract data straight from our databases. They assist us in gathering the data and
cleaning it up so that it is reliable, accurate, and comprehensive for use in business
decisions.

5. Communicate with the team: Keeping the group informed will help with client
development and strengthening as well as giving more focused information to potential
clients.

Usage of Data Cleaning in Data Mining

The following are some examples of how data cleaning is used in data mining:

Data Integration: Since it is challenging to guarantee quality with low-quality data,

data integration is crucial in resolving this issue.
• The process of merging information from various data sets into one is known as data
integration. Before transferring to the ultimate location, this step makes sure that the
embedded data set is standardized and formatted using data cleansing technologies.
Data Migration: The process of transferring a file from one system, format, or
application to another is known as data migration.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 7

Data science

• To ensure that the resulting data has the correct format, structure, and consistency
without any delicacy at the destination, it is crucial to maintain the data's quality,
security, and consistency while it is in transit.

Data Transformation: The data must be changed before being uploaded to a

location. Data cleansing, which takes into account system requirements for formatting,
organizing, etc.,
Data Debugging in ETL Processes:
To prepare data for reporting and analysis throughout the extract, transform, and load (ETL)
process, data cleansing is essential. Only high quality data are used for decision-making and
analysis thanks to data purification.
• Cleaning data is essential. For instance, a retail business could receive inaccurate or
duplicate data from different sources, including CRM or ERP systems. A reliable data
debugging tool would find and fix data discrepancies. The deleted information will be
transformed into a common format and transferred to the intended database.

Characteristics of Data Cleaning

To ensure the correctness, integrity, and security of corporate data, data cleaning is a requirement.
These may be of varying quality depending on the properties or attributes of the data. The key
components of data cleansing in data mining are as follows:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 8

Data science
• Accuracy: The business's database must contain only extremely accurate data. Comparing
them to other sources is one technique to confirm their veracity. The stored data will also
have issues if the source cannot be located or contains errors.
• Coherence: To ensure that the information on a person or body is the same throughout all
types of storage, the data must be consistent with one another.
• Validity: There must be rules or limitations in place for the stored data. The information
must also be confirmed to support its veracity.
• Uniformity: A database's data must all share the same units or values. Since it doesn't
complicate the process, it is a crucial component while doing the Data Cleansing process.
• Data Verification: Every step of the process, including its appropriateness and
effectiveness, must be checked. The study, design, and validation stages all play a role in
the verification process. The disadvantages are frequently obvious after applying the data
to a specific number of changes.
• Clean Data Backflow: After addressing quality issues, the previously clean data must be
replaced with data that is not present in the source so that legacy applications can profit
from it and avoid the need for a subsequent data-cleaning program.

Tools for Data Cleaning in Data Mining

Data Cleansing Tools can be very helpful if you are not confident of cleaning the data yourself or
have no time to clean up all your data sets. You might need to invest in those tools, but it is worth
the expenditure. There are many data cleaning tools in the market. Here are some topranked data
cleaning tools, such as:

1. Open Refine

2. Trifacta Wrangler

3. Drake

4. Data Ladder

5. Data Cleaner
Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 9
Data science
6. Clouding

7. IBM Info sphere Quality Stage

8. TIBCO Clarity

9. Win pure

Benefits of Data Cleaning

• Removal of inaccuracies when several data sources are involved.

• Clients are happier and employees are less annoyed when there are fewer mistakes.
• The capacity to map out the many functions and the planned uses of your data.

• Monitoring mistakes and improving reporting make it easier to resolve inaccurate or

damaged data for future applications by allowing users to identify where issues are coming
from.
• Making decisions more quickly and with greater efficiency will be possible with the use of
data cleansing tools.

Data Integration in Data Mining

Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc.

In data mining, data integration is a record preprocessing method that includes merging data from
a couple of the heterogeneous data sources into coherent data to retain and provide a unified
perspective of the data.

These assets could also include several record cubes, databases, or flat documents. The statistical
integration strategy is formally stated as a triple (G, S, M) approach. G represents the global
schema, S represents the heterogeneous source of schema and M represents the mapping between
source and global schema queries.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 10

Data science
The data integration methods are formally characterized as a triple (G, S, M), where;

G represents the global schema,

S represents the heterogeneous source of schema,

M represents the mapping between source and global schema queries.

Data Integration Approaches

There are mainly two types of approaches for data integration. These are as follows:

Tight Coupling

It is the process of using ETL (Extraction, Transformation, and Loading) to combine data from
various sources into a single physical location.

Loose Coupling

Facts with loose coupling are most effectively kept in the actual source databases. This approach
provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain
the result.

Issues in Data Integration

Entity Identification Problem

As you understand, the records are obtained from heterogeneous sources, and how can you 'match

the real-world entities from the data'.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 11

Data science
• For example, assume that the discount is applied to the entire order in one machine, but in
every other machine, the discount is applied to each item in the order. This distinction
should be noted before the information from those assets is included in the goal system.

Redundancy and Correlation Analysis

 One of the major issues in the course of data integration is redundancy. Unimportant data
that are no longer required are referred to as redundant data. It may also appear due to
attributes created from the use of another property inside the information set.

Tuple Duplication
Information integration has also handled duplicate tuples in addition to redundancy. Duplicate
tuples may also appear in the generated information if the denormalized table was utilized as a
deliverable for data integration.

Data warfare Detection and backbone

The data warfare technique of combining records from several sources is unhealthy. In the same
way, that characteristic values can vary, so can statistics units. The disparity may be related to the
fact that they are represented differently within the special data units. For example, in one-ofakind
towns, the price of an inn room might be expressed in a particular currency. This type of issue is
recognized and fixed during the data integration process.

Data Integration Techniques

There are various data integration techniques in data mining. Some of them are as follows:

Manual Integration
This method avoids using automation during data integration. The data analyst collects, cleans,
and integrates the data to produce meaningful information. This strategy is suitable for a mini
organization with a limited data set. Although, it will be time-consuming for the huge,

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 12

Data science
sophisticated, and recurring integration. Because the entire process must be done manually, it is a
time-consuming operation.

Middleware Integration
The middleware software is used to take data from many sources, normalize it, and store it in the
resulting data set. When an enterprise needs to integrate data from legacy systems to modern
systems, this technique is used. Middleware software acts as a translator between legacy and
advanced systems. You may take an adapter that allows two systems with different interfaces to be
connected. It is only applicable to certain systems.

Application-based integration
It is using software applications to extract, transform, and load data from disparate sources. This
strategy saves time and effort, but it is a little more complicated because building such an
application necessitates technical understanding. This strategy saves time and effort, but it is a little
more complicated because building such an application necessitates technical understanding.

Uniform Access Integration

This method combines data from a more disparate source. However, the data's position is not
altered in this scenario; the data stays in its original location. This technique merely generates a
unified view of the integrated data. The integrated data does not need to be stored separately
because the end-user only sees the integrated view.

Data Transformation in Data Mining

Data transformation is a technique used to convert the raw data into a suitable format that
efficiently eases data mining and retrieves strategic information. Data transformation includes
data cleaning techniques and a data reduction technique to convert the data into the appropriate
form.
This could mean that data transformation may be:

• Constructive: The data transformation process adds, copies, or replicates data.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 13
Data science

• Destructive: The system deletes fields or records.

• Aesthetic: The transformation standardizes the data to meet requirements or parameters.

• Structural: The database is reorganized by renaming, moving, or combining columns.

Data Transformation Techniques

There are several data transformation techniques that can help structure and clean up the data
before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.

1. Data Smoothing
 Data smoothing is a process that is used to remove noise from the dataset using some

algorithms.
 The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders need to
look at a lot of data which can often be difficult to digest for finding patterns that they
finding patterns that they wouldn't see otherwise.

 We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 14

Data science
 Binning: This method splits the sorted data into the number of bins and smoothens the data
values in each bin considering the neighborhood values around it.
 Regression: This method identifies the relation among two dependent attributes so that if
we have one attribute, it can be used to predict the other attribute.
 Clustering: This method groups similar data values and form a cluster. The values that lie
outside a cluster are known as outliers.

2. Attribute Construction
 In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to
assist the mining process from the given attributes. This simplifies the original data and
makes the mining more efficient.
 For example, suppose we have a data set referring to measurements of different plots, i.e.,
we may have the height and width of each plot. So here, we can construct a new attribute
'area' from attributes 'height' and 'weight'. This also helps understand the relations among
the attributes in a data set.

3. Data Aggregation
 Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used.
 Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
 For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 15

Data science

4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.

Consider that we have a numeric attribute A and we have n number of observed values for attribute
A that are V1, V2, V3, ….Vn.

o Min-max normalization: This method implements a linear transformation on the original

data. Let us consider that we have minA and maxA as the minimum and maximum value
observed for attribute A and Viis the value for attribute A that has to be normalized. The

min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:

For example, we have $1200 and $9800 as the minimum, and maximum value for the

attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.

The value $73,600 would be transformed using min-max normalization as follows:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 16

Data science
o Z-score normalization: This method normalizes the value for attribute A using the mean
and standard deviation. The following formula is used for Z-score normalization:

Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively. For
example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.

o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:

Here j is the smallest integer such that max(|v'i|)<1

For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute
A using decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3. So,
the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917. The
normalization parameters such as mean, standard deviation, the maximum absolute value
must be preserved to normalize the future data uniformly.

5. Data Discretization
 This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study
and analyze. If a data mining task handles a continuous attribute, then its discrete values
can be replaced by constant quality attributes. This improves the efficiency of the task.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 17

Data science
 This method is also called a data reduction mechanism as it transforms a large dataset into
a set of categorical data. Discretization also uses decision tree-based algorithms to produce
short, compact, and accurate results when using discrete values.
 Data discretization can be classified into two types: supervised discretization, where the
class information is used, and unsupervised discretization, which is based on which
direction the process proceeds, i.e., 'top-down splitting strategy' or 'bottom-up merging
strategy'.
 For example, the values for the age attribute can be replaced by the interval labels such as
(0-10, 11-20…) or (kid, youth, adult, senior).

6. Data Generalization
 It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches: o Data cube
process (OLAP) approach. o
Attribute-oriented induction (AOI) approach.

 For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).

Data Transformation Process

The entire process for transforming data is known as ETL (Extract, Load, and Transform).
Through the ETL process, analysts can convert data to its desired format. Here are the steps
involved in the data transformation process:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 18

Data science

1. Data Discovery: During the first stage, analysts work to understand and identify data in its
source format. To do this, they will use data profiling tools. This step helps analysts decide
what they need to do to get data into its desired format.

2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and ripple
through your entire organization.

3. Data Extraction: During this phase, analysts extract the data from its original source.
These may include structured sources such as databases or streaming sources such as
customer log files from web applications.

4. Code Generation and Execution: Once the data has been extracted, analysts need to create
a code to complete the transformation. Often, analysts generate codes with the help of data
transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has
been formatted correctly.

6. Sending: The final step involves sending the data to its target destination. The target might
be a data warehouse or a database that handles both structured and unstructured data.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 19

Data science

Advantages of Data Transformation

Transforming data can help businesses in a variety of ways. Here are some of the essential

advantages of data transformation, such as:

• Better Organization: Transformed data is easier for both humans and computers to use.

• Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.

• Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to
it being stored and standardized in a source location.

• Better Data Management: Businesses are constantly generating data from more and more
sources. If there are inconsistencies in the metadata, it can be challenging to organize and
understand it. Data transformation refines your metadata, so it's easier to organize and
understand.
• More Use Out of Data: While businesses may be collecting data constantly, a lot of that
data sits around unanalyzed. Transformation makes it easier to get the most out of your
data by standardizing it and making it more usable.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 20

Data science

Disadvantages of Data Transformation

While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:
• Data transformation can be expensive. The cost is dependent on the specific
infrastructure, software, and tools used to process data. Expenses may include licensing,
computing resources, and hiring necessary personnel.
• Data transformation processes can be resource-intensive. Performing transformations
in an on-premises data warehouse after loading or transforming data before feeding it into
applications can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.
• Lack of expertise and carelessness can introduce problems during transformation.
Data

analysts without appropriate subject matter expertise are less likely to notice incorrect data
because they are less familiar with the range of accurate and permissible values.
• Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.

Data Reduction in Data Mining

• Data mining is applied to the selected data in a large amount database. When data analysis
and mining is done on a huge amount of data, then it takes a very long time to process,
making it impractical and infeasible.
• Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a much
smaller volume. Data reduction techniques are used to obtain a reduced representation of
the dataset that is much smaller in volume by maintaining the integrity of the original data.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 21

Data science
By reducing the data, the efficiency of the data mining process is improved, which produces
the same analytical results.
• Data reduction does not affect the result obtained from data mining. That means the result
obtained from data mining before and after data reduction is the same or almost the same.
• Data reduction aims to define it more compactly. When the data size is smaller, it is simpler
to apply sophisticated and computationally high-priced algorithms. The reduction of the
data may be in terms of the number of rows (records) or terms of the number of
columns (dimensions).

Techniques of Data Reduction

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 22

Data Cleaning
No ratings yet
Data Cleaning
8 pages
Ids Unit 2
No ratings yet
Ids Unit 2
26 pages
Unit 2
No ratings yet
Unit 2
16 pages
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
No ratings yet
Introduction To Data Science: Data Science Methodology & Data Preparation DR Shuhaida Mohamed Shuhidan Jan 2025
34 pages
Data Cleaning and Preparation
No ratings yet
Data Cleaning and Preparation
20 pages
B DWM Lab Manual Zil
No ratings yet
B DWM Lab Manual Zil
114 pages
Data Analysis and Information Management
No ratings yet
Data Analysis and Information Management
13 pages
Data Warehouse and Data Mining - Unit 3
No ratings yet
Data Warehouse and Data Mining - Unit 3
14 pages
Data Cleaning Preprocessing
No ratings yet
Data Cleaning Preprocessing
28 pages
FDS UNIT 1 Part2
No ratings yet
FDS UNIT 1 Part2
47 pages
Foundation of DS
No ratings yet
Foundation of DS
21 pages
Intro. Data Science 3
No ratings yet
Intro. Data Science 3
38 pages
UNIT - 2 .DataScience 04.09.18
No ratings yet
UNIT - 2 .DataScience 04.09.18
53 pages
Data Cleaning Using Pandas
No ratings yet
Data Cleaning Using Pandas
9 pages
Unit 3
No ratings yet
Unit 3
18 pages
Data Cleaning and Data Transformation
No ratings yet
Data Cleaning and Data Transformation
13 pages
m4t5 - PDF - Eng Data Cleaning & Etl
No ratings yet
m4t5 - PDF - Eng Data Cleaning & Etl
6 pages
Data Science Course Overview
No ratings yet
Data Science Course Overview
34 pages
Dmi Unit 3
No ratings yet
Dmi Unit 3
12 pages
Lecture 3 Unit 1
No ratings yet
Lecture 3 Unit 1
61 pages
DM Unit 3
No ratings yet
DM Unit 3
15 pages
The Ultimate Guide To Data Cleaning
No ratings yet
The Ultimate Guide To Data Cleaning
18 pages
What Is Data Cleaning
No ratings yet
What Is Data Cleaning
8 pages
Data Preprocessing Essentials
No ratings yet
Data Preprocessing Essentials
9 pages
Deep Learning Ram
No ratings yet
Deep Learning Ram
21 pages
Unit 2 Preprocessing in Data Analytics
No ratings yet
Unit 2 Preprocessing in Data Analytics
36 pages
SMA Expt 3
No ratings yet
SMA Expt 3
9 pages
Overview of Data Preprocessing
No ratings yet
Overview of Data Preprocessing
4 pages
Unit - 2 Data Warehouse
No ratings yet
Unit - 2 Data Warehouse
11 pages
Estimasi Anggaran Biaya Google Adwords Iklan Website
No ratings yet
Estimasi Anggaran Biaya Google Adwords Iklan Website
54 pages
DEC - Unit II Data Pre-Processing
No ratings yet
DEC - Unit II Data Pre-Processing
96 pages
Foundations of Data Science
No ratings yet
Foundations of Data Science
139 pages
Intro To Data Analytics - Cleanup & Transformation
No ratings yet
Intro To Data Analytics - Cleanup & Transformation
30 pages
OJCST Vol13 N2-3 P 78-81
No ratings yet
OJCST Vol13 N2-3 P 78-81
4 pages
DMBI Unit-4,5,6
No ratings yet
DMBI Unit-4,5,6
38 pages
Data Mining
No ratings yet
Data Mining
22 pages
The Data Science Process
No ratings yet
The Data Science Process
33 pages
Data Preparation Guide COS10022
No ratings yet
Data Preparation Guide COS10022
61 pages
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
No ratings yet
Unit 1datapre Processing Datacleaningtransformationreductionintegration 240509092339 7095c9af
88 pages
Report On Summer Internship
No ratings yet
Report On Summer Internship
30 pages
Disruptive Technologies DA Lecture 8
No ratings yet
Disruptive Technologies DA Lecture 8
17 pages
03preprocessing Part1
No ratings yet
03preprocessing Part1
21 pages
Notes - Unit01 - Data Science and Big Data Analytics
No ratings yet
Notes - Unit01 - Data Science and Big Data Analytics
7 pages
Data Cleaning and JSON in R
No ratings yet
Data Cleaning and JSON in R
61 pages
Importance of Data Cleaning
No ratings yet
Importance of Data Cleaning
35 pages
Unit - III DW
No ratings yet
Unit - III DW
14 pages
FoDS Notes - Unit 2
No ratings yet
FoDS Notes - Unit 2
12 pages
DWDM 3
No ratings yet
DWDM 3
12 pages
Data Mining for Tech Enthusiasts
No ratings yet
Data Mining for Tech Enthusiasts
61 pages
Data Mining and Preprocessing Guide
No ratings yet
Data Mining and Preprocessing Guide
64 pages
CS-DM Module-2
No ratings yet
CS-DM Module-2
29 pages
Unit 2
No ratings yet
Unit 2
21 pages
Chapter 3& 4
No ratings yet
Chapter 3& 4
60 pages
The Ultimate Guide To Data Cleaning With SQL 1738769035
No ratings yet
The Ultimate Guide To Data Cleaning With SQL 1738769035
36 pages
Data Cleaning: A Brief Guide To
100% (2)
Data Cleaning: A Brief Guide To
15 pages
Data Cleaning: A Brief Guide To
No ratings yet
Data Cleaning: A Brief Guide To
15 pages
HR Recruit - Data Analytic
No ratings yet
HR Recruit - Data Analytic
474 pages
Smart Data Boden Introduction Flink
No ratings yet
Smart Data Boden Introduction Flink
37 pages
SAP BW - 4HANA - Your Questions Answered - Pratik Shah - Bluefin Solutions
100% (1)
SAP BW - 4HANA - Your Questions Answered - Pratik Shah - Bluefin Solutions
7 pages
SAP Sybase IQ 16 Hardware Sizing Guide
No ratings yet
SAP Sybase IQ 16 Hardware Sizing Guide
32 pages
Extraction of Vietnam Address Data From The Unstructured Text - by Linh Tran - in Towards Data Science - Freedium
No ratings yet
Extraction of Vietnam Address Data From The Unstructured Text - by Linh Tran - in Towards Data Science - Freedium
13 pages
Cheats Installer (Source Code)
No ratings yet
Cheats Installer (Source Code)
2 pages
Grade 12 Ip Holidayworksheet
No ratings yet
Grade 12 Ip Holidayworksheet
4 pages
Mohammed Shahjahan
No ratings yet
Mohammed Shahjahan
2 pages
Q5
No ratings yet
Q5
1 page
Dbms Model Papers 2020
No ratings yet
Dbms Model Papers 2020
6 pages
Oracle SQL Scripts for DBAs
No ratings yet
Oracle SQL Scripts for DBAs
8 pages
HP-UX Bootable DVD Creation Guide
No ratings yet
HP-UX Bootable DVD Creation Guide
8 pages
Smriti Resume
No ratings yet
Smriti Resume
1 page
Pritlesh Resume 2024 Updated
No ratings yet
Pritlesh Resume 2024 Updated
2 pages
Basic Linux Commands Cheat Sheet
No ratings yet
Basic Linux Commands Cheat Sheet
2 pages
Ecse Unit III
No ratings yet
Ecse Unit III
21 pages
Database Management Systems Overview
No ratings yet
Database Management Systems Overview
9 pages
Introduction to DBMS Basics
No ratings yet
Introduction to DBMS Basics
14 pages
DBMS Lab 3 ER DIagram
No ratings yet
DBMS Lab 3 ER DIagram
5 pages
Notes 2
No ratings yet
Notes 2
8 pages
BDA Unit-4
No ratings yet
BDA Unit-4
12 pages
Google Hacking List 3
100% (1)
Google Hacking List 3
11 pages
Spreadsheet Basics for Finance Students
No ratings yet
Spreadsheet Basics for Finance Students
7 pages
Wimlib Imagex Delete
No ratings yet
Wimlib Imagex Delete
1 page
Data Analytics With Python & SQL: Module A1 Module A2 Module A3 Module A4
No ratings yet
Data Analytics With Python & SQL: Module A1 Module A2 Module A3 Module A4
2 pages
SQL - Revision Questions
No ratings yet
SQL - Revision Questions
4 pages
MST Unit-5
No ratings yet
MST Unit-5
14 pages
SAP HANA Security - Intellipaat Blog PDF
No ratings yet
SAP HANA Security - Intellipaat Blog PDF
6 pages
Dokumen - Tips - Quiz PL SQL 10 15
No ratings yet
Dokumen - Tips - Quiz PL SQL 10 15
48 pages
FortiAnalyzer Exam Guide
No ratings yet
FortiAnalyzer Exam Guide
24 pages

Chapter 2.data Warehouse

Uploaded by

Chapter 2.data Warehouse

Uploaded by

Data science

Multi-Dimensional Data Model

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 1

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 2

Data Cleaning in Data Mining

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 3

Steps for Cleaning Data

1. Remove duplicate or irrelevant observations

Remove duplicate or pointless observations as well as undesirable observations from your

2. Fix structural errors

3. Filter unwanted outliers

4. Handle missing data

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru 4

• Are the data coherent?

• If not, is there a problem with the data's quality?

Techniques for Cleaning Data

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 5

Process of Data Cleaning

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 6

Usage of Data Cleaning in Data Mining

Data Integration: Since it is challenging to guarantee quality with low-quality data,

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 7

Data Transformation: The data must be changed before being uploaded to a

Characteristics of Data Cleaning

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 8

Tools for Data Cleaning in Data Mining

7. IBM Info sphere Quality Stage

Benefits of Data Cleaning

• Monitoring mistakes and improving reporting make it easier to resolve inaccurate or

Data Integration in Data Mining

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 10

G represents the global schema,

S represents the heterogeneous source of schema,

M represents the mapping between source and global schema queries.

Data Integration Approaches

Issues in Data Integration

Entity Identification Problem

the real-world entities from the data'.

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 11

Redundancy and Correlation Analysis

Data warfare Detection and backbone

Data Integration Techniques

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 12

Uniform Access Integration

Data Transformation in Data Mining

• Constructive: The data transformation process adds, copies, or replicates data.

• Destructive: The system deletes fields or records.

• Aesthetic: The transformation standardizes the data to meet requirements or parameters.

• Structural: The database is reorganized by renaming, moving, or combining columns.

Data Transformation Techniques

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 14

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 15

o Min-max normalization: This method implements a linear transformation on the original

The value $73,600 would be transformed using min-max normalization as follows:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 16

Here j is the smallest integer such that max(|v'i|)<1

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 17

Data Transformation Process

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 18

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 19

Advantages of Data Transformation

advantages of data transformation, such as:

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 20

Disadvantages of Data Transformation

Data Reduction in Data Mining

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 21

Techniques of Data Reduction

Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 22

You might also like