Chapter 2.data Warehouse
Chapter 2.data Warehouse
2.Data Warehouse
A multidimensional model views data in the form of a data-cube. A data cube enables data to be
modeled and viewed in multiple dimensions. It is defined by dimensions and facts.
The dimensions are the perspectives or entities concerning which an organization keeps
records. For example, a shop may create a sales data warehouse to keep records of the
store's sales for the dimension time, item, and location.
These dimensions allow the save to keep track of things, for example, monthly sales of
items and the locations at which the items were sold. Each dimension has a table related to
it, called a dimensional table, which describes the dimension further. For example, a
dimensional table for an item may contain the attributes item name, brand, and type.
A multidimensional data model is organized around a central theme, for example, sales. This theme
is represented by a fact table. Facts are numerical measures. The fact table contains the names of
the facts or measures of the related dimensional tables.
Now, if we want to view the sales data with a third dimension, For example, suppose the data
according to time and item, as well as the location is considered for the cities Chennai, Kolkata,
Mumbai, and Delhi. These 3D data are shown in the table. The 3D data of the table are represented
as a series of 2D tables.
Conceptually, it may also be represented by the same data in the form of a 3D data cube, as shown
in fig:
When you measure or transfer data and find odd naming practices, typos, or wrong
capitalization, such are structural faults. Mislabeled categories or classes may result from
these inconsistencies. For instance, "N/A" and "Not Applicable" might be present on any
given sheet, but they ought to be analyzed under the same heading.
There will frequently be isolated findings that, at first glance, do not seem to fit the data you
are analyzing. Removing an outlier if you have a good reason to, such as incorrect data
entry, will improve the performance of the data you are working with.
Because many algorithms won't tolerate missing values, you can't overlook missing data. There
are a few options for handling missing data. While neither is ideal, both can be taken into
account, for example:
• As part of fundamental validation, you ought to be able to respond to the following queries
once the data cleansing procedure is complete:
• Does the data abide by the regulations that apply to its particular field?
• Does it support or refute your working theory? Does it offer any new information?
• To support your next theory, can you identify any trends in the data?
False conclusions can be used to inform poor company strategy and decision-making as a result of
inaccurate or noisy data. False conclusions can result in a humiliating situation in a reporting
meeting when you find out your data couldn't withstand further investigation.
Establishing a culture of quality data in your organization is crucial before you arrive. The tools you
might employ to develop this plan should be documented to achieve this.
2. Fill in the missing value: This strategy is also not very practical or effective. Additionally,
it could be a time-consuming technique. One must add the missing value to the approach.
The most common method for doing this is manually, but other options include using
attribute means or the most likely value.
3. Binning method: This strategy is fairly easy to comprehend. The values nearby are used
to smooth the sorted data. The information is subsequently split into several equal-sized
parts. The various techniques are then used to finish the assignment.
4. Regression: With the use of the regression function, the data is smoothed out. Regression
may be multivariate or linear. Multiple regressions have more independent variables than
linear regressions, which only have one.
5. Clustering: This technique focuses mostly on the group. Data are grouped using clustering.
After that, clustering is used to find the outliers. After that, the comparable values are
grouped into a "group" or "cluster".
The data cleaning method for data mining is demonstrated in the subsequent sections.
1. Monitoring the errors: Keep track of the areas where errors seem to occur most frequently.
It will be simpler to identify and maintain inaccurate or corrupt information. Information is
particularly important when integrating a potential substitute with current management
software.
2. Standardize the mining process: To help lower the likelihood of duplicity, standardize the
place of insertion.
3. Validate data accuracy: Analyses the data and spend money on data cleaning
software. Artificial intelligence-based tools were utilized to thoroughly check for accuracy.
5. Communicate with the team: Keeping the group informed will help with client
development and strengthening as well as giving more focused information to potential
clients.
The following are some examples of how data cleaning is used in data mining:
• To ensure that the resulting data has the correct format, structure, and consistency
without any delicacy at the destination, it is crucial to maintain the data's quality,
security, and consistency while it is in transit.
To ensure the correctness, integrity, and security of corporate data, data cleaning is a requirement.
These may be of varying quality depending on the properties or attributes of the data. The key
components of data cleansing in data mining are as follows:
1. Open Refine
2. Trifacta Wrangler
3. Drake
4. Data Ladder
5. Data Cleaner
Ms. Navya shree A, Asst Professor,Dept of BCA, SSIBM,Tumkuru Page 9
Data science
6. Clouding
8. TIBCO Clarity
9. Win pure
• Clients are happier and employees are less annoyed when there are fewer mistakes.
• The capacity to map out the many functions and the planned uses of your data.
Data integration is the process of merging data from several disparate sources. While performing
data integration, you must work on data redundancy, inconsistency, duplicity, etc.
In data mining, data integration is a record preprocessing method that includes merging data from
a couple of the heterogeneous data sources into coherent data to retain and provide a unified
perspective of the data.
These assets could also include several record cubes, databases, or flat documents. The statistical
integration strategy is formally stated as a triple (G, S, M) approach. G represents the global
schema, S represents the heterogeneous source of schema and M represents the mapping between
source and global schema queries.
Tight Coupling
It is the process of using ETL (Extraction, Transformation, and Loading) to combine data from
various sources into a single physical location.
Loose Coupling
Facts with loose coupling are most effectively kept in the actual source databases. This approach
provides an interface that gets a query from the user, changes it into a format that the supply
database may understand, and then sends the query to the source databases without delay to obtain
the result.
Tuple Duplication
Information integration has also handled duplicate tuples in addition to redundancy. Duplicate
tuples may also appear in the generated information if the denormalized table was utilized as a
deliverable for data integration.
Manual Integration
This method avoids using automation during data integration. The data analyst collects, cleans,
and integrates the data to produce meaningful information. This strategy is suitable for a mini
organization with a limited data set. Although, it will be time-consuming for the huge,
Middleware Integration
The middleware software is used to take data from many sources, normalize it, and store it in the
resulting data set. When an enterprise needs to integrate data from legacy systems to modern
systems, this technique is used. Middleware software acts as a translator between legacy and
advanced systems. You may take an adapter that allows two systems with different interfaces to be
connected. It is only applicable to certain systems.
Application-based integration
It is using software applications to extract, transform, and load data from disparate sources. This
strategy saves time and effort, but it is a little more complicated because building such an
application necessitates technical understanding. This strategy saves time and effort, but it is a little
more complicated because building such an application necessitates technical understanding.
There are several data transformation techniques that can help structure and clean up the data
before analysis or storage in a data warehouse. Let's study all techniques used for data
transformation, some of which we have already studied in data reduction and data cleaning.
1. Data Smoothing
Data smoothing is a process that is used to remove noise from the dataset using some
algorithms.
The concept behind data smoothing is that it will be able to identify simple changes to help
predict different trends and patterns. This serves as a help to analysts or traders need to
look at a lot of data which can often be difficult to digest for finding patterns that they
finding patterns that they wouldn't see otherwise.
We have seen how the noise is removed from the data using the techniques such as binning,
regression, clustering.
2. Attribute Construction
In the attribute construction method, the new attributes consult the existing attributes to
construct a new data set that eases data mining. New attributes are created and applied to
assist the mining process from the given attributes. This simplifies the original data and
makes the mining more efficient.
For example, suppose we have a data set referring to measurements of different plots, i.e.,
we may have the height and width of each plot. So here, we can construct a new attribute
'area' from attributes 'height' and 'weight'. This also helps understand the relations among
the attributes in a data set.
3. Data Aggregation
Data collection or aggregation is the method of storing and presenting data in a summary
format. The data may be obtained from multiple data sources to integrate these data sources
into a data analysis description. This is a crucial step since the accuracy of data analysis
insights is highly dependent on the quantity and quality of the data used.
Gathering accurate data of high quality and a large enough quantity is necessary to produce
relevant results. The collection of data is useful for everything from decisions concerning
financing or business strategy of the product, pricing, operations, and marketing strategies.
For example, we have a data set of sales reports of an enterprise that has quarterly sales of
each year. We can aggregate the data to get the enterprise's annual sales report.
4. Data Normalization
Normalizing the data refers to scaling the data values to a much smaller range such as [-1, 1] or
[0.0, 1.0]. There are different methods to normalize the data, as discussed below.
Consider that we have a numeric attribute A and we have n number of observed values for attribute
A that are V1, V2, V3, ….Vn.
min-max normalization would map Vi to the V'i in a new smaller range [new_minA,
new_maxA]. The formula for min-max normalization is given below:
For example, we have $1200 and $9800 as the minimum, and maximum value for the
attribute income, and [0.0, 1.0] is the range in which we have to map a value of $73,600.
Here Ᾱ and σA are the mean and standard deviation for attribute A, respectively. For
example, we have a mean and standard deviation for attribute A as $54,000 and $16,000.
And we have to normalize the value $73,600 using z-score normalization.
o Decimal Scaling: This method normalizes the value of attribute A by moving the decimal
point in the value. This movement of a decimal point depends on the maximum absolute
value of A. The formula for the decimal scaling is given below:
For example, the observed values for attribute A range from -986 to 917, and the maximum
absolute value for attribute A is 986. Here, to normalize each value of attribute
A using decimal scaling, we have to divide each value of attribute A by 1000, i.e., j=3. So,
the value -986 would be normalized to -0.986, and 917 would be normalized to 0.917. The
normalization parameters such as mean, standard deviation, the maximum absolute value
must be preserved to normalize the future data uniformly.
5. Data Discretization
This is a process of converting continuous data into a set of data intervals. Continuous
attribute values are substituted by small interval labels. This makes the data easier to study
and analyze. If a data mining task handles a continuous attribute, then its discrete values
can be replaced by constant quality attributes. This improves the efficiency of the task.
6. Data Generalization
It converts low-level data attributes to high-level data attributes using concept hierarchy.
This conversion from a lower level to a higher conceptual level is useful to get a clearer
picture of the data. Data generalization can be divided into two approaches: o Data cube
process (OLAP) approach. o
Attribute-oriented induction (AOI) approach.
For example, age data can be in the form of (20, 30) in a dataset. It is transformed into a
higher conceptual level into a categorical value (young, old).
The entire process for transforming data is known as ETL (Extract, Load, and Transform).
Through the ETL process, analysts can convert data to its desired format. Here are the steps
involved in the data transformation process:
1. Data Discovery: During the first stage, analysts work to understand and identify data in its
source format. To do this, they will use data profiling tools. This step helps analysts decide
what they need to do to get data into its desired format.
2. Data Mapping: During this phase, analysts perform data mapping to determine how
individual fields are modified, mapped, filtered, joined, and aggregated. Data mapping is
essential to many data processes, and one misstep can lead to incorrect analysis and ripple
through your entire organization.
3. Data Extraction: During this phase, analysts extract the data from its original source.
These may include structured sources such as databases or streaming sources such as
customer log files from web applications.
4. Code Generation and Execution: Once the data has been extracted, analysts need to create
a code to complete the transformation. Often, analysts generate codes with the help of data
transformation platforms or tools.
5. Review: After transforming the data, analysts need to check it to ensure everything has
been formatted correctly.
6. Sending: The final step involves sending the data to its target destination. The target might
be a data warehouse or a database that handles both structured and unstructured data.
• Better Organization: Transformed data is easier for both humans and computers to use.
• Improved Data Quality: There are many risks and costs associated with bad data. Data
transformation can help your organization eliminate quality issues such as missing values
and other inconsistencies.
• Perform Faster Queries: You can quickly and easily retrieve transformed data thanks to
it being stored and standardized in a source location.
• Better Data Management: Businesses are constantly generating data from more and more
sources. If there are inconsistencies in the metadata, it can be challenging to organize and
understand it. Data transformation refines your metadata, so it's easier to organize and
understand.
• More Use Out of Data: While businesses may be collecting data constantly, a lot of that
data sits around unanalyzed. Transformation makes it easier to get the most out of your
data by standardizing it and making it more usable.
While data transformation comes with a lot of benefits, still there are some challenges to
transforming data effectively, such as:
• Data transformation can be expensive. The cost is dependent on the specific
infrastructure, software, and tools used to process data. Expenses may include licensing,
computing resources, and hiring necessary personnel.
• Data transformation processes can be resource-intensive. Performing transformations
in an on-premises data warehouse after loading or transforming data before feeding it into
applications can create a computational burden that slows down other operations. If you
use a cloud-based data warehouse, you can do the transformations after loading because
the platform can scale up to meet demand.
• Lack of expertise and carelessness can introduce problems during transformation.
Data
analysts without appropriate subject matter expertise are less likely to notice incorrect data
because they are less familiar with the range of accurate and permissible values.
• Enterprises can perform transformations that don't suit their needs. A business might
change information to a specific format for one application only to then revert the
information to its prior format for a different application.
• Data mining is applied to the selected data in a large amount database. When data analysis
and mining is done on a huge amount of data, then it takes a very long time to process,
making it impractical and infeasible.
• Data reduction techniques ensure the integrity of data while reducing the data. Data
reduction is a process that reduces the volume of original data and represents it in a much
smaller volume. Data reduction techniques are used to obtain a reduced representation of
the dataset that is much smaller in volume by maintaining the integrity of the original data.