Reading Material Mod 4 Data Integration - Data Warehouse

Course Name: Business Analytics
Module 4: Data integration- Data

warehouse
Data warehouse
The concept of the data warehouse (Figure 1.1) is a lone scheme that is the repository of all
of the organization’s data (or simply data) in a pattern that can be competently analysed so
that significant accounts can be arranged for administration and other information workers.
However, meeting this goal requires some challenges:

 Data should be acquired from a variety of incompatible systems.
 The identical piece of data might reside in the databases of distinct systems in distinct
types. A specific data item might not only be represented in distinct formats, but the
values of this
 Data piece might be distinct in distinct databases. Which value is the correct one?
Data is continually altering. How often should the Data warehouse be revised to
contemplate a sensibly current view?
 The amount of Data is massive. How is it analysed and presented easily so that it is
useful?
To meet these needs, a broad range of powerful tools were developed over the years and
became productized. They included:
Extract, Transform, and Load (ETL) utilities for the moving of data from the diverse data
sources to the common data warehouse.
Data-mining pushes for complex predetermined analysis and ad hoc queries of the Data
retained in the Data warehouse.
Reporting tools to provide management employees with the outcomes of the analysis in
very simple to absorb formats.
Goals of a Data Warehouse
The prime goal of a data warehouse is to enable users’ appropriate access to a homogenized
and comprehensive view of the organization. This in turn will support the forecasting and
decision-making at the enterprise level.
Information accessibility: Data in a data warehouse must be easy to comprehend, both by
the business users and developers alike. It should be properly labelled to facilitate easy
access. The business users should be allowed to slice and dice the data in every possible way
(slicing and dicing refers to the separation and combination of data in infinite combinations).
Information credibility: The data in the data warehouse should be credible, complete, and
of desired quality. Flexible to change: Business situations change, users’ requirements
change, technology changes, and tools to access data may also change. The data warehouse
must be adaptable to change. Addition of new data from disparate sources or new queries
against the data warehouse should not invalidate the existing information in the data
warehouse.
Support for more fact-based decision making: “Manage by fact” seems to be the
buzzword these days. The data warehouse should have enough pertinent data to support more
precise decision making. What is also required is that the business users should be able to
access the data easily.
Support for the data security: The data warehouse maintains the company’s confidential
information. This information falling into wrong hands will do more damage than not a data
warehouse at all. There should be mechanisms in place to enable the provision of
information in the required format to only those who are supposed to receive it.
Information consistency: Information consistency is about a single/consistent version of
truth. A data warehouse brings data from disparate data sources into a centralized repository.
Users from across the organization make use of the data warehouse to view a single and
consistent version of truth.
Extract, Transform and Load (ETL)

Early on, the one common interface that was provided between the disparate systems in an
association was magnetic tape. Tape formats were standardized, and any system could
compose tapes that could be read by other systems. Thus, the first data warehouses were fed
by magnetic tapes prepared by the various systems inside the association. However, that left
the difficulty of data disparity. The data written by the different systems reflected their native
data associations.
Even more important was that the data warehouse’s database was designed to support the
analytical functions needed for the business intelligence function. This database design was
typically a highly organised database with complex indices to support Online Analytical
Processing (OLAP). Databases configured for OLAP allowed complex analytical and ad hoc
queries with rapid execution time. The data fed to the data warehouse from the enterprise
systems was converted to a format significant to the data warehouse.
To explain the difficulty of initially stacking this data into a data warehouse, holding it
updated, and resolving discrepancies, Extract, Transform and Load (ETL) utilities were
evolved. As their name suggests, these utilities extract data from source databases,
change/transform them into the widespread data warehouse format, and load them into the
data warehouse, as shown in below diagram.
The transform function is the key to the achievement of this approach. Its job is to request a
series of rules to extracted data so that it is properly formatted for loading into the data
warehouse. An example of transformation rules includes:
 The selection of data to load.
 The translation of encoded items (for example, 1 for male, 2 for female to M, F).
 Deriving new calculated values (sale price = price - discount).
 Merging data from multiple sources.
 Summarizing (aggregating) certain rows and columns.
 Splitting a column into multiple columns.
 Resolving discrepancies between similar data items.
 Validating the data.
Shows Representation of ETL Model

The ETL function permits the consolidation of multiple data sources into a well-structured
database for use in complex analysis. The ETL process is performed occasionally, such as
daily, weekly, or monthly, depending upon the enterprise needs. This method is called offline
ETL because the key database is not relentlessly updated. It is revised on a periodic batch
basis. Though offline ETL serves its purpose well, it has some drawbacks as well:
The data in the data warehouse is not fresh. It could be weeks old. Though, it is useful for
strategic functions but is not especially adaptable to tactical use.
The source database typically should be temporary inactive throughout the extract method.
Otherwise, the target database is in an inconsistent state following the load. With this result,
the applications must be shutdown, often for hours.
In order to develop to support real-time business intelligence, the ETL function must be
relentless and non-invasive, which is called online ETL, and is recounted later. In compare to
offline ETL, which supplies data which is not fresh but reliable answers to queries, online
ETL supplies present but varying answers to successive queries since the data that it is using
is constantly being updated to reflect the current state of the business.
Data Integration
Process of coherent merging of data from various data sources and presenting a
cohesive/consolidated view to the user
 Involves combining data residing at different sources and providing users with a
unified view of the data. What Is Data Integration?
 Significant in a variety of situations; both
 commercial (e.g., two similar companies trying to merge their database)
 Scientific (e.g., combining research results from different bioinformatics
research repositories)
Challenges in Data Integration

• Development challenges
 Translation of relational database to object-oriented applications
 Consistent and inconsistent metadata
 Handling redundant and missing data
 Normalization of data from different sources
• Technological challenges
 Various formats of data
 Structured and unstructured data
 Huge volumes of data
• Organizational challenges
 Unavailability of data
 Manual integration risk, failure
Technologies in Data integration
.l .Integration is divided into two main approaches:
 Schema integration – reconciles schema elements Multiple data sources may provide
data on the same entity type. The main goal is to allow applications to transparently
view and query this data as one uniform data source, and this is done using various
mapping rules to handle structural differences.
 Instance integration – matches tuples and attribute values Data integration from
multiple heterogeneous data sources has become a high-priority task in many large
enterprises. Hence to obtain the accurate semantic information on the data content, the
information is being retrieved directly from the data. It identifies and integrates all the
instance of the data items that represents the real-world entity, distinct from the
schema integration. Entity Identification (EI) and attribute-value conflict
resolution (AVCR) comprise the instanceintegration task. When common key-
attributes are not available across different data sources, the rules for EI and the rules
for AVCR are expressed as combinations of constraints on their attribute values.
• Electronic Data Interchange (EDI) :

– It refers to the structured transmission of data between organizations by electronic means.
It is used to transfer electronic documents from one computer system to another (ie) from one
trading partner to another trading partner.
– It is more than mere E-mail; for instance, organizations might replace bills of lading and
even checks with appropriate EDI messages.
• Object Brokering/Object Request Broker (ORB):
– An ORB is a piece of middleware software that allows programmers to make programs

calls from one computer to another via a network.
– It handles the transformation of in-process data structure to and from the byte sequence.
• The technologies that are used for data integration include:
 Data interchange
 Object Brokering
 Modeling techniques
• Entity-Relational Modeling
• Dimensional Modeling
Need for Data Integration
It is done for providing data in a specific view as requested by users, applications, etc.
The bigger the organization gets, the more data there is and the more data needs integration.
DB2 Unified view of data SQL What it means? needs integration.
Increases with the need for data sharing.
Advantages of Using Data Integration Of
benefit to decision-makers, who have access to important information from past studies
Reduces cost, overlaps and redundancies; reduces exposure to risks Helps to monitor key
variables like DB2 Unified view of data SQL What it means?
Helps to monitor key variables like trends and consumer behaviour, etc.
Common Approaches to Data integration
There are currently various methods for performing data integration.
• The most popular ones are:
 Federated databases
 Memory-mapped data structure
 Data warehousing
Data Integration Approaches
 Federated database (virtual database):

 Type of meta-database management system which transparently integrates multiple
autonomous databases into a single federated database
 The constituent databases are interconnected via a computer network, geographically
decentralized.
 The federated databases is the fully integrated, logical composite of all constituent
databases in a federated database management system.
 Memory-mapped data structure:

 Useful when needed to do in-memory data manipulation and data structure is large.
It’s mainly used in the dot net platform and is always performed with C# or using
VB.NET
 It’s is a much faster way of accessing the data than using Memory Stream.
Data Warehousing
The various primary concepts used in data warehousing would be:
 ETL (Extract Transform Load)

 Component-based (Data Mart)
 Dimensional Models and Schemas
 Metadata driven
Date profiling and Data Quality maintenance

Definition – What does Data Integration
Data integration is a process in which heterogeneous data is retrieved and combined as an

incorporated form and structure. Data integration allows different data types (such as data
sets, documents and tables) to be merged by users, organizations and applications, for use as
personal or business processes and/or functions.
Data integration involves combining data from several disparate sources, which are stored
using various technologies and provide a unified view of the data. Data integration becomes
increasingly important in cases of merging systems of two companies or consolidating
applications within one company to provide a unified view of the company's data assets. The
later initiative is often called a data warehouse.
Probably the most well known implementation of data integration is building an enterprise's
data warehouse. The benefit of a data warehouse enables a business to perform analyses
based on the data in the data warehouse. This would not be possible to do on the data
available only in the source system. The reason is that the source systems may not contain
corresponding data, even though the data are identically named, they may refer to different
entities.
Data Integration Areas
Data integration is a term covering several distinct sub-areas such as:
Data warehousing
Data migration
Enterprise application/information integration
Master data management
 Challenges of Data Integration
At first glance, the biggest challenge is the technical implementation of integrating data from
disparate often incompatible sources. However, a much bigger challenge lies in the entirety
of data integration. It has to include the following phases:
Design
The data integration initiative within a company must be an initiative of business, not IT.
There should be a champion who understands the data assets of the enterprise and will be
able to lead the discussion about the long-term data integration initiative in order to make it
consistent, successful and benefitial.
Analysis of the requirements (BRS), i.e. why is the data integration being done, what are the
objectives and deliverables. From what systems will the data be sourced? Is all the data
available to fulfill the requirements? What are the business rules? What is the support model
and SLA?
Analysis of the source systems, i.e. what are the options of extracting the data from the
systems (update notification, incremental extracts, full extracts), what is the
required/available frequency of the extracts? What is the quality of the data? Are the required
data fields populated properly and consistently? Is the documentation available? What are the
data volumes being processed? Who is the system owner?
Any other non-functional requirements such as data processing window, system response
time, estimated number of (concurrent) users, data security policy, backup policy.
What is the support model for the new system?
What are the SLA requirements? And last but not least, who will be the owner of the system
and what is the funding of the maintenance and upgrade expenses?
The results of the above steps need to be documented in form of SRS document, confirmed
and signed-off by all parties which will be participating in the data integration project.
Implementation
Based on the BRS and SRS, a feasibility study should be performed to select the tools to
implement the data integration system. Small companies and enterprises which are starting
with data warehousing are faced with making a decision about the set of tools they will need
to implement the solution. The larger enterprise or the enterprises which already have started
other projects of data integration are in an easier position as they already have experience and
can extend the existing system and exploit the existing knowledge to implement the system
more effectively. There are cases, however, when using a new, better suited platform or
technology makes a system more effective compared to staying with existing company
standards. For example, finding a more suitable tool which provides better scaling for future
growth/expansion, a solution that lowers the implementation/support cost, lowering the
license costs, migrating the system to a new/modern platform, etc.
Testing
Along with the implementation, the proper testing is a must to ensure that the unified data are
correct, complete and up-to-date.
Both technical IT and business needs to participate in the testing to ensure that the results are
as expected/required. Therefore, the testing should incorporate at least Performance Stress
test (PST), Technical Acceptance Testing (TAT) and User Acceptance Testing (UAT ) PST,
TAT (Technical Acceptance Testing), UAT (User Acceptance Testing).
Data Integration Techniques
There are several organizational levels on which the integration can be performed. As we go
down the level of automated integration increases.
Manual Integration or Common User Interface - users operate with all the relevant
information accessing all the source systems or web page interface. No unified view of the
data exists.
Application Based Integration - requires the particular applications to implement all the
integration efforts. This approach is manageable only in case of very limited number of
applications.
Middleware Data Integration - transfers the integration logic from particular applications to a
new middleware layer. Although the integration logic is not implemented in the applications
anymore, there is still a need for the applications to partially participate in the data
integration.
Uniform Data Access or Virtual Integration - leaves data in the source systems and defines a
set of views to provide and access the unified view to the customer across whole enterprise.
For example, when a user accesses the customer information, the particular details of the
customer are transparently acquired from the respective system. The main benefits of the
virtual integration are nearly zero latency of the data updates propagation from the source
system to the consolidated view, no need for separate store for the consolidated data.
However, the drawbacks include limited possibility of data's history and version
management, limitation to apply the method only to 'similar‘ data sources (e.g. same type of
database) and the fact that the access to the user data generates extra load on the source
systems which may not have been designed to accommodate.
Common Data Storage or Physical Data Integration - usually means creating a new system
which keeps a copy of the data from the source systems to store and manage it independently
of the original system. The most well know example of this approach is called Data
Warehouse (DW). The benefits comprise data version management, combining data from
very different sources (mainframes, databases, flat files, etc.). The physical integration,
however, requires a separate system to handle the vast volumes of data.
Need for Data Integration
 It is done for providing data in a specific view as requested by users, applications, etc.
 The bigger the organization gets, the more data there is and the more data needs
integration.
 Increases with the need for data sharing.
Advantages of Using Data Integration

Benefit to decision - makers, who have access to important information from past studies
Reduces cost, overlaps and redundancies; reduces exposure to risks
Helps to monitor key variables like trends and consumer behaviour, etc.
Approaches to Integration
In this section, we apply an architectural perspective to give an overview of the different

ways to address the integration problem. The presented classification is based on [12] and
distinguishes integration approaches according to the level of abstraction where integration is
performed. Information systems can be described using a layered architecture, as shown in
Figure 1.1: On the topmost layer, users access data and services through various interfaces
that run on top of different applications. Applications may use middleware transaction
processing (TP) monitors, message-oriented middleware (MOM), SQL-middleware, etc. to
access data via a data access layer. The data itself is managed by a data storage system.
Usually, database management systems (DBMS) are used to combine the data access and
storage layer. In general, the integration problem can be addressed on each of the presented
system layers.
Manual Integration
Common User Interface
Integration by Applications
Integration by middleware
Uniform Data Access
Common Data Storage
Data Integration tools
 Alteryx
 Analytics Canvas
 Cloud Elements API Integration
 DataWatch Denodo Platform
 elastic.io Integration Platform
 HiperFabric
 Lavastorm
 Informatica Platform (www.informatica.com)
 Oracle Data Integration Services
 ParseKit (enigma.io)
 Paxata
 RapidMiner Studio
 Red Hat JBoss Data Virtualization. Community project: teiid.
 Azure Data Factory (ADF)
 SQL Server Integration Services (SSIS)
 TMMData
Data quality
Data quality refers to the level of quality of data. There are many definitions of data quality
but data are generally considered high quality if "they are fit for their intended uses in
operations, decision making and planning." (Tom Redman). Alternatively, data is deemed of
high quality if it correctly represents the real-world construct to which it refers. Furthermore,
apart from these definitions, as data volume increases, the question of internal consistency
within data becomes significant, regardless of fitness for use for any particular external
purpose. People's views on data quality can often be in disagreement, even when discussing
the same set of data used for the same purpose.
The market is going some way to providing data quality assurance. A number of vendors
make tools for analyzing and repairing poor quality data in situ," service providers can clean
the data on a contract basis and consultants can advise on fixing processes or systems to
avoid data quality problems in the first place. Most data quality tools offer a series of tools
for improving data, which may include some or all of the following
1. Data profiling - initially assessing the data to understand its quality challenges
Data standardization - a business rules engine that ensures that data conforms to quality rules
2. Geocoding - for name and address data. Corrects data to U.S. and Worldwide postal
standards
3. Matching or Linking - a way to compare data so that similar, but slightly different records
can be aligned. Matching may use "fuzzy logic" to find duplicates in the data. It often
recognizes that "Bob" and "Robert" may be the same individual. It might be able to manage
"householding", or finding links between spouses at the same address, for example. Finally, it
often can build a "best of breed" record, taking the best components from multiple data
sources and building a single super-record.
4. Monitoring - keeping track of data quality over time and reporting variations in the quality
of data. Software can also auto-correct the variations based on pre-defined business rules.
5. Batch and Real time - Once the data is initially cleansed (batch), companies often want to
build the processes into enterprise applications to keep it clean.
There are several well-known authors and self-styled experts, with Larry English perhaps the
most popular guru. In addition, IQ International - the International Association for
Information and Data Quality was established in 2004 to provide a focal point for
professionals and researchers in this field. ISO 8000 is an international standard for data
quality.
Definitions
This list is taken from the online book "Data Quality: High-impact Strategies". See also the
glossary of data quality terms.
 Degree of excellence exhibited by the data in relation to the portrayal of the actual
scenario
 The state of completeness, validity, consistency, timeliness and accuracy that makes
data appropriate for a specific use
Data quality is a perception or an assessment of data‘s fitness to serve its purpose in a given
context. It is described by several dimensions like
•Correctness / Accuracy: Accuracy of data is the degree to which the captured data correctly
describes the real world entity.
• Consistency: This is about the single version of truth. Consistency means data throughout
the enterprise should be sync with each other.
• Completeness: It is the extent to which the expected attributes of data are provided
. • Timeliness: Right data to the right person at the right time is important for business.
• Metadata: Data about data

1. Maintain complete data.
2. Clean up your data by standardizing it using rules.
3. Use fancy algorithms to detect duplicates. Eg: ICS and Informatics Computer System.
4. Avoid entry of duplicate leads and contacts.
5. Merge existing duplicate records.
6. Use roles for security.
What is data profiling?
It is the process of statistically examining and analyzing the content in a data source, and
hence collecting information about the data. It consists of techniques used to analyze the data
we have for accuracy and completeness.
1. Data profiling helps us make a thorough assessment of data quality.
2. It assists the discovery of anomalies in data.
3. It helps us understand content, structure, relationships, etc. about the data in the data source
we are analyzing.
4. It helps us know whether the existing data can be applied to other areas or purposes.
5. It helps us understand the various issues/challenges we may face in a database project

much before the actual work begins. This enables us to make early decisions and act
accordingly.
6. It is also used to assess and validate metadata
Data profiling is the process of examining the data available in an existing information data
source (e.g. a database or a file) and collecting statistics or small but informative summaries
about that data.
The purpose of these statistics may be to:
1. Find out whether existing data can easily be used for other purposes
2. Improve the ability to search the data by tagging it with keywords, descriptions, or
assigning it to a category
3. Give metrics on data quality including whether the data conforms to particular standards or
patterns
4. Assess the risk involved in integrating data for new applications, including the challenges
of joins 5. Discover metadata of the source database, including value patterns and
distributions, key candidates, foreign-key candidates, and functional dependencies
6. Assess whether known metadata accurately describes the actual values in the source
database
7. Understanding data challenges early in any data intensive project, so that late project
surprises are avoided. Finding data problems late in the project can lead to delays and cost
overruns.
8. Have an enterprise view of all data, for uses such as master data management where key
data is needed, or data governance for improving data quality.
How to conduct Data Profiling?
Data profiling involves statistical analysis of the data at source and the data being loaded, as
well as analysis of metadata. These statistics may be used for various analysis purposes.
Common examples of analyses to be done are:
Data quality: Analyze the quality of data at the data source
NULL values: Look out for the number of NULL values in an attribute.
Candidate keys: Analysis of the extent to which certain columns are distinct will give
developer useful information w. r. t. selection of candidate keys.
Primary key selection: To check whether the candidate key column does not violate the basic
requirements of not having NULL values or duplicate values.
Empty string values: A string column may contain NULL or even empty sting values that
may create problems later.
String length: An analysis of largest and shortest possible length as well as the average string
length of a sting-type column can help us decide what data type would be most suitable for
the said column.
Identification of cardinality: The cardinality relationships are important for inner and outer
join considerations with regard to several BI tools.
Data format: Sometimes, the format in which certain data is written in some columns may or
may not be user-friendly.
Common Data Profiling Software

Most of the data-integration/analysis soft-wares have data profiling built into them.
Alternatively, various independent data profiling tools are also available. Some popular ones
are:
• Trillium Enterprise Data quality
• Datiris Profiler
• Talend Data Profiler
• IBM Infosphere Information Analyzer
• SSIS Data Profiling Task
• Oracle Warehouse Builder
Benefits
The benefits of data profiling are to improve data quality, shorten the implementation cycle
of major projects, and improve understanding of data for users. Discovering business
knowledge embedded in data itself is one of the significant benefits derived from data
profiling. Data profiling is one of the most effective technologies for improving data accuracy
in corporate databases.
Although data profiling is effective and useful for each sector of our daily life, it can be
challenging not to slip into ―analysis paralysis‖.
Data modelling
Data modeling is the process of creating a visual representation of either a whole information
system or parts of it to communicate connections between data points and structures. The
goal is to illustrate the types of data used and stored within the system, the relationships
among these data types, the ways the data can be grouped and organized and its formats and
attributes.
Data models are built around business needs. Rules and requirements are defined upfront
through feedback from business stakeholders so they can be incorporated into the design of a
new system or adapted in the iteration of an existing one.
Data can be modeled at various levels of abstraction. The process begins by collecting
information about business requirements from stakeholders and end users. These business
rules are then translated into data structures to formulate a concrete database design. A data
model can be compared to a roadmap, an architect’s blueprint or any formal diagram that
facilitates a deeper understanding of what is being designed.
Data modeling employs standardized schemas and formal techniques. This provides a
common, consistent, and predictable way of defining and managing data resources across an
organization, or even beyond.
Ideally, data models are living documents that evolve along with changing business needs.
They play an important role in supporting business processes and planning IT architecture
and strategy. Data models can be shared with vendors, partners, and/or industry peers.
Types of data models

Like any design process, database and information system design begins at a high level of
abstraction and becomes increasingly more concrete and specific. Data models can generally
be divided into three categories, which vary according to their degree of abstraction. The
process will start with a conceptual model, progress to a logical model and conclude with a
physical model. Each type of data model is discussed in more detail below:
 Conceptual data models. They are also referred to as domain models and offer a
big-picture view of what the system will contain, how it will be organized, and
which business rules are involved. Conceptual models are usually created as part of
the process of gathering initial project requirements. Typically, they include entity
classes (defining the types of things that are important for the business to represent
in the data model), their characteristics and constraints, the relationships between
them and relevant security and data integrity requirements. Any notation is
typically simple.
 Logical data models. They are less abstract and provide greater detail about the
concepts and relationships in the domain under consideration. One of several
formal data modeling notation systems is followed. These indicate data attributes,
such as data types and their corresponding lengths, and show the relationships
among entities. Logical data models don’t specify any technical system
requirements. This stage is frequently omitted in agile or DevOps practices.
Logical data models can be useful in highly procedural implementation
environments, or for projects that are data-oriented by nature, such as data
warehouse design or reporting system development.
 Physical data models.
They provide a schema for how the data will be physically stored within a database. As
such, they’re the least abstract of all. They offer a finalized design that can be
implemented as a relational database, including associative tables that illustrate the
relationships among entities as well as the primary keys and foreign keys that will be used
to maintain those relationships. Physical data models can include database management
system (DBMS)-specific properties, including performance tuning.
Data modeling process
As a discipline, data modeling invites stakeholders to evaluate data processing and storage in
painstaking detail. Data modeling techniques have different conventions that dictate which
symbols are used to represent the data, how models are laid out, and how business
requirements are conveyed. All approaches provide formalized workflows that include a
sequence of tasks to be performed in an iterative manner. Those workflows generally look
like this:
1. Identify the entities. The process of data modeling begins with the identification
of the things, events or concepts that are represented in the data set that is to be
modeled. Each entity should be cohesive and logically discrete from all others.
2. Identify key properties of each entity. Each entity type can be differentiated from
all others because it has one or more unique properties, called attributes. For
instance, an entity called “customer” might possess such attributes as a first name,
last name, telephone number and salutation, while an entity called “address” might
include a street name and number, a city, state, country and zip code.
3. Identify relationships among entities. The earliest draft of a data model will
specify the nature of the relationships each entity has with the others. In the above
example, each customer “lives at” an address. If that model were expanded to
include an entity called “orders,” each order would be shipped to and billed to an
address as well. These relationships are usually documented via unified modeling
language (UML).
4. Map attributes to entities completely. This will ensure the model reflects how the
business will use the data. Several formal data modeling patterns are in widespread
use. Object-oriented developers often apply analysis patterns or design patterns,
while stakeholders from other business domains may turn to other patterns.
5. Assign keys as needed, and decide on a degree of normalization that balances
the need to reduce redundancy with performance requirements. Normalization
is a technique for organizing data models (and the databases they represent) in
which numerical identifiers, called keys, are assigned to groups of data to represent
relationships between them without repeating the data. For instance, if customers
are each assigned a key, that key can be linked to both their address and their order
history without having to repeat this information in the table of customer names.
Normalization tends to reduce the amount of storage space a database will require,
but it can at cost to query performance.
6. Finalize and validate the data model. Data modeling is an iterative process that
should be repeated and refined as business needs change.
Types of data modelling (techniques)
Data modeling has evolved alongside database management systems, with model types
increasing in complexity as businesses' data storage needs have grown. Here are several
model types:
 Hierarchical data models represent one-to-many relationships in a treelike format.

In this type of model, each record has a single root or parent which maps to one or
more child tables. This model was implemented in the IBM Information
Management System (IMS), which was introduced in 1966 and rapidly found
widespread use, especially in banking. Though this approach is less efficient than
more recently developed database models, it’s still used in Extensible Markup
Language (XML) systems and geographic information systems (GISs).
 Relational data models were initially proposed by IBM researcher E.F. Codd in
1970. They are still implemented today in the many different relational database
commonly used in enterprise computing. Relational data modeling doesn’t require a
detailed understanding of the physical properties of the data storage being used. In it, data
segments are explicitly joined through the use of tables, reducing database complexity.
Relational databases frequently employ structured query language (SQL) for data
management. These databases work well for maintaining data integrity and minimizing
redundancy. They’re often used in point-of-sale systems, as well as for other types of
transaction processing.
 Entity-relationship (ER) data models use formal diagrams to represent the

relationships between entities in a database. Several ER modeling tools are used by
data architects to create visual maps that convey database design objectives.
 Object-oriented data models gained traction as object-oriented programming and
it became popular in the mid-1990s. The “objects” involved are abstractions of
real-world entities. Objects are grouped in class hierarchies, and have associated
features. Object-oriented databases can incorporate tables, but can also support
more complex data relationships. This approach is employed in multimedia and
hypertext databases as well as other use cases.
 Dimensional data models were developed by Ralph Kimball, and they were
designed to optimize data retrieval speeds for analytic purposes in a data
warehouse. While relational and ER models emphasize efficient storage,
dimensional models increase redundancy in order to make it easier to locate
information for reporting and retrieval. This modeling is typically used
across OLAP systems.
Two popular dimensional data models are the star schema, in which data is organized into
facts (measurable items) and dimensions (reference information), where each fact is
surrounded by its associated dimensions in a star-like pattern. The other is the snowflake
schema, which resembles the star schema but includes additional layers of associated
dimensions, making the branching pattern more complex.
Benefits of data modeling
Data modeling makes it easier for developers, data architects, business analysts, and other
stakeholders to view and understand relationships among the data in a database or data
warehouse. In addition, it can:
 Reduce errors in software and database development.

 Increase consistency in documentation and system design across the enterprise.
 Improve application and database performance.
 Ease data mapping throughout the organization.
 Improve communication between developers and business intelligence teams.
 Ease and speed the process of database design at the conceptual, logical and
physical levels.
Data modeling tools
Numerous commercial and open source computer-aided software engineering (CASE)

solutions are widely used today, including multiple data modeling, diagramming and
visualization tools. Here are several examples:
 erwin Data Modeler is a data modeling tool based on the Integration DEFinition
for information modeling (IDEF1X) data modeling language that now supports
other notation methodologies, including a dimensional approach.
 Enterprise Architect is a visual modeling and design tool that supports the
modeling of enterprise information systems and architectures as well as software
applications and databases. It’s based on object-oriented languages and standards.
 ER/Studio is database design software that’s compatible with several of today’s
most popular database management systems. It supports both relational and
dimensional data modeling.
 Free data modeling tools include open source solutions such as Open
ModelSphere.
Fact Table
In data warehousing, a fact table consists of the measurements, metrics or facts of a business
process. It is located at the center of a star schema or a snowflake schema surrounded
by dimension tables. Where multiple fact tables are used, these are arranged as a fact
constellation schema. A fact table typically has two types of columns: those that contain facts
and those that are a foreign key to dimension tables. The primary key of a fact table is usually
a composite key that is made up of all of its foreign keys. Fact tables contain the content of
the data warehouse and store different types of measures like additive, non additive, and semi
additive measures.
Fact tables provide the (usually) additive values that act as independent variables by which
dimensional attributes are analyzed. Fact tables are often defined by their grain. The grain of
a fact table represents the most atomic level by which the facts may be defined. The grain of
a sales fact table might be stated as "sales volume by day by product by store". Each record in
this fact table is therefore uniquely defined by a day, product and store. Other dimensions
might be members of this fact table (such as location/region) but these add nothing to the
uniqueness of the fact records. These "affiliate dimensions" allow for additional slices of the
independent facts but generally provide insights at a higher level of aggregation (a region
contains many stores).
Example of a fact table

In the schema below, we have a fact table FACT_SALES that has a grain that gives us the
number of units sold by date, by store, and product.
All other tables such as DIM_DATE, DIM_STORE and DIM_PRODUCT are dimensions
tables. This schema is known as the star schema.
Measure types
 Additive - measures that can be added across any dimension.

 Non-additive - measures that cannot be added across any dimension.
 Semi-additive - measures that can be added across some dimensions.
A fact table might contain either detail level facts or facts that have been aggregated (fact
tables that contain aggregated facts are often instead called summary tables).
Special care must be taken when handling ratios and percentage. One good design rule[1] is to
never store percentages or ratios in fact tables but only calculate these in the data access tool.
Thus only store the numerator and denominator in the fact table, which then can be
aggregated and the aggregated stored values can then be used for calculating the ratio or
percentage in the data access tool.
In the real world, it is possible to have a fact table that contains no measures or facts. These
tables are called "factless fact tables", or "junction tables".
The factless fact tables may be used for modeling many-to-many relationships or for
capturing timestamps of events.[1]
Types of fact tables
There are four fundamental measurement events, which characterize all fact tables. [2]
Transactional
A transactional table is the most basic and fundamental. The grain associated with a
transactional fact table is usually specified as "one row per line in a transaction", e.g., every
line on a receipt. Typically a transactional fact table holds data of the most detailed level,
causing it to have a great number of dimensions associated with it.
Periodic snapshots
The periodic snapshot, as the name implies, takes a "picture of the moment", where the
moment could be any defined period of time, e.g. a performance summary of a salesman over
the previous month. A periodic snapshot table is dependent on the transactional table, as it
needs the detailed data held in the transactional fact table in order to deliver the chosen
performance output.
Accumulating snapshots
This type of fact table is used to show the activity of a process that has a well-defined
beginning and end, e.g., the processing of an order. An order moves through specific steps
until it is fully processed. As steps towards fulfilling the order are completed, the associated
row in the fact table is updated. An accumulating snapshot table often has multiple date
columns, each representing a milestone in the process. Therefore, it's important to have an
entry in the associated date dimension that represents an unknown date, as many of the
milestone dates are unknown at the time of the creation of the row.
Temporal snapshots
By applying temporal database theory and modeling techniques the temporal snapshot fact
table [3] allows to have the equivalent of daily snapshots without really having daily
snapshots. It introduces the concept of time Intervals into a fact table, allowing to save a lot
of space, optimizing performances while allowing the end user to have the logical equivalent
of the "picture of the moment" they are interested in.
Steps in designing a fact table
Identify a business process for analysis (like sales).
Identify measures of facts (sales dollar), by asking questions like 'what number of X are
relevant for the business process?', replacing the X with various options that make sense
within the context of the business.
Identify dimensions for facts (product dimension, location dimension, time dimension,
organization dimension), by asking questions that make sense within the context of the
business, like 'analyse by X', where X is replaced with the subject to test.
List the columns that describe each dimension (region name, branch name, business unit
name).
Determine the lowest level (granularity) of summary in a fact table (e.g. sales dollars).
An alternative approach is the four step design process described in Kimball:[1] select the
business process, declare the grain, identify the dimensions, identify the facts.
Dimension Table
Dimensional Modeling
Dimensional Modeling (DM) is a data structure technique optimized for data storage in a
Data warehouse. The purpose of dimensional modeling is to optimize the database for faster
retrieval of data. The concept of Dimensional Modelling was developed by Ralph Kimball
and consists of “fact” and “dimension” tables.
A dimensional model in data warehouse is designed to read, summarize, analyze numeric

information like values, balances, counts, weights, etc. in a data warehouse. In contrast,
relation models are optimized for addition, updating and deletion of data in a real-time Online
Transaction System.
These dimensional and relational models have their unique way of data storage that has
specific advantages.
For instance, in the relational mode, normalization and ER models reduce redundancy in data.
On the contrary, dimensional model in data warehouse arranges data in such a way that it is
easier to retrieve information and generate reports.
Hence, Dimensional models are used in data warehouse systems and not a good fit for
relational systems.
Elements of Dimensional Data Model
Fact
Facts are the measurements/metrics or facts from your business process. For a Sales business
process, a measurement would be quarterly sales number
Dimension
Dimension provides the context surrounding a business process event. In simple terms, they
give who, what, where of a fact. In the Sales business process, for the fact quarterly sales
number, dimensions would be
 Who – Customer Names

 Where – Location
 What – Product Name
In other words, a dimension is a window to view information in the facts.

Attributes
The Attributes are the various characteristics of the dimension in dimensional data modeling.
In the Location dimension, the attributes can be
 State
 Country
 Zipcode etc.
Attributes are used to search, filter, or classify facts. Dimension Tables contain Attributes
Fact Table
A fact table is a primary table in dimension modelling.
A Fact Table contains
1. Measurements/facts
2. Foreign key to dimension table
Dimension Table
 A dimension table contains dimensions of a fact.

 They are joined to fact table via a foreign key.
 Dimension tables are de-normalized tables.
 The Dimension Attributes are the various columns in a dimension table
 Dimensions offers descriptive characteristics of the facts with the help of their
attributes
 No set limit set for given for number of dimensions
 The dimension can also contain one or more hierarchical relationships
Types of Dimensions in Data Warehouse
Following are the Types of Dimensions in Data Warehouse:
 Conformed Dimension
 Outrigger Dimension
 Shrunken Dimension
 Role-playing Dimension
 Dimension to Dimension Table
 Junk Dimension
 Degenerate Dimension
 Swappable Dimension
 Step Dimension
Steps of Dimensional Modelling

The accuracy in creating your Dimensional modeling determines the success of your data
warehouse implementation. Here are the steps to create Dimension Model
1. Identify Business Process

2. Identify Grain (level of detail)
3. Identify Dimensions
4. Identify Facts
5. Build Star
Step 1) Identify the Business Process
Identifying the actual business process a datarehouse should cover. This could be Marketing,
Sales, HR, etc. as per the data analysis needs of the organization. The selection of the
Business process also depends on the quality of data available for that process. It is the most
important step of the Data Modelling process, and a failure here would have cascading and
irreparable defects.
To describe the business process, you can use plain text or use basic Business Process
Modelling Notation (BPMN) or Unified Modelling Language (UML).
Step 2) Identify the Grain
The Grain describes the level of detail for the business problem/solution. It is the process of
identifying the lowest level of information for any table in your data warehouse. If a table
contains sales data for every day, then it should be daily granularity. If a table contains total
sales data for each month, then it has monthly granularity.
During this stage, you answer questions like
1. Do we need to store all the available products or just a few types of products? This
decision is based on the business processes selected for Datawarehouse
2. Do we store the product sale information on a monthly, weekly, daily or hourly basis?
This decision depends on the nature of reports requested by executives
3. How do the above two choices affect the database size?
Example of Grain:
The CEO at an MNC wants to find the sales for specific products in different locations on a
daily basis.
So, the grain is "product sale information by location by the day."
Step 3) Identify the Dimensions
Dimensions are nouns like date, store, inventory, etc. These dimensions are where all the data
should be stored. For example, the date dimension may contain data like a year, month and
weekday.
Example of Dimensions:
daily basis.
Dimensions: Product, Location and Time
Attributes: For Product: Product key (Foreign Key), Name, Type, Specifications
Hierarchies: For Location: Country, State, City, Street Address, Name
Step 4) Identify the Fact
This step is co-associated with the business users of the system because this is where they get
access to data stored in the data warehouse. Most of the fact table rows are numerical values
like price or cost per unit, etc.
Example of Facts:
daily basis.
The fact here is Sum of Sales by product by location by time.

Step 5) Build Schema
In this step, you implement the Dimension Model. A schema is nothing but the database
structure (arrangement of tables). There are two popular schemas
1. Star Schema
The star schema architecture is easy to design. It is called a star schema because diagram
resembles a star, with points radiating from a center. The center of the star consists of the fact
table, and the points of the star is dimension tables.
The fact tables in a star schema which is third normal form whereas dimensional tables are
de-normalized.
In computing, the star schema is the simplest style of data mart schema and is the approach
most widely used to develop data warehouses and dimensional data marts. The star schema
consists of one or more fact tables referencing any number of dimension tables. The star
schema is an important special case of the snowflake schema, and is more effective for
handling simpler queries.
The star schema gets its name from the physical model's resemblance to a star shape with a
fact table at its center and the dimension tables surrounding it representing the star's points.
2. Snowflake Schema
The snowflake schema is an extension of the star schema. In a snowflake schema, each
dimension are normalized and connected to more dimension tables.
In computing, a snowflake schema is a logical arrangement of tables in a multidimensional

database such that the entity relationship diagram resembles a snowflake shape. The
snowflake schema is represented by centralized fact tables which are connected to
multiple dimensions. "Snowflaking" is a method of normalizing the dimension tables in a star
schema. When it is completely normalized along all the dimension tables, the resultant
structure resembles a snowflake with the fact table in the middle. The principle behind
snowflaking is normalization of the dimension tables by removing low cardinality attributes
and forming separate tables.[1]
The snowflake schema is similar to the star schema. However, in the snowflake schema,
dimensions are normalized into multiple related tables, whereas the star schema's dimensions
are denormalized with each dimension represented by a single table. A complex snowflake
shape emerges when the dimensions of a snowflake schema are elaborate, having multiple
levels of relationships, and the child tables have multiple parent tables ("forks in the road").
Rules for Dimensional Modelling
Following are the rules and principles of Dimensional Modeling:

 Load atomic data into dimensional structures.
 Build dimensional models around business processes.
 Need to ensure that every fact table has an associated date dimension table.
 Ensure that all facts in a single fact table are at the same grain or level of detail.
 It's essential to store report labels and filter domain values in dimension tables
 Need to ensure that dimension tables use a surrogate key
 Continuously balance requirements and realities to deliver business solution to
support their decision-making
Benefits of Dimensional Modeling
 Standardization of dimensions allows easy reporting across areas of the business.

 Dimension tables store the history of the dimensional information.
 It allows to introduce entirely new dimension without major disruptions to the fact
table.
 Dimensional also to store data in such a fashion that it is easier to retrieve the
information from the data once the data is stored in the database.
 Compared to the normalized model dimensional table are easier to understand.
 Information is grouped into clear and simple business categories.
 The dimensional model is very understandable by the business. This model is based
on business terms, so that the business knows what each fact, dimension, or attribute
means.
 Dimensional models are deformalized and optimized for fast data querying. Many
relational database platforms recognize this model and optimize query execution plans
to aid in performance.
 Dimensional modelling in data warehouse creates a schema which is optimized for
high performance. It means fewer joins and helps with minimized data redundancy.
 The dimensional model also helps to boost query performance. It is more
denormalized therefore it is optimized for querying.
 Dimensional models can comfortably accommodate change. Dimension tables can
have more columns added to them without affecting existing business intelligence
applications using these tables.
The model should describe the Why, How much, When/Where/Who and What of your
business process
The life cycle of a dimensional model includes design, test, transform, and production
phases.
Life cycle of a dimensional model
Throughout the life of a dimensional model, there are four key phases:
1. Design: You can use the workbench to create dimensional models.

2. Test: You can use the workbench and other querying tools to analyze and test a
dimensional model against a database (usually a test database).
3. Transform: You can transform your data models to work with other querying tools
or to continue developing your data models and applications.
4. Production: You deploy the dimensional model, and users can query against the
database.
Design a dimensional data model
To build a dimensional database, you start by designing a dimensional data model for your
business.
You will learn how a dimensional model differs from a transactional model, what fact tables
and dimension tables are and how to design them effectively. You will learn how to analyze
the business processes in your organization where data is gathered and use that analysis to
design a model for your dimensional data.
IBM® Informix® includes several demonstration databases that are the basis for many
examples in Informix publications, including examples in the IBM Informix Data Warehouse
Guide. The stores_demo database illustrates a relational schema with information about a
fictitious wholesale sporting-goods distributor. You will use SQL and the data in
the stores_demo database to populate a new dimensional database. The dimensional database
is based on the simple dimensional data model that you learned about.
 Concepts of dimensional data modeling

 Building a dimensional data model
 Handle common dimensional data-modeling problems
Concepts of dimensional data modeling

To build a dimensional database, you start with a dimensional data model. The dimensional
data model provides a method for making databases simple and understandable. You can
conceive of a dimensional database as a database cube of three or four dimensions where
users can access a slice of the database along any of its dimensions. To create a dimensional
database, you need a model that lets you visualize the data.
Suppose your business sells products in different markets and you want to evaluate the
performance over time. It is easy to conceive of this business process as a cube of data,
which contains dimensions for time, products, and markets. The following figure shows
this dimensional model. The various intersections along the lines of the cube would
combination: product, market, and time data.
Figure 1. A dimensional model of a business that has time, product, and market
dimensions
Another name for the dimensional model is the star schema. The database designers use
this name because the diagram for this model looks like a star with one central table
around which a set of other tables are displayed. The central table is the only table in the
schema with multiple joins connecting it to all the other tables. This central table is called
the fact table and the other tables are called dimension tables. The dimension tables all
have only a single join that attaches them to the fact table, regardless of the query. The
following figure shows a simple dimensional model of a business that sells products in
different markets and evaluates business performance over time.
Figure 2. A typical dimensional model
 The fact table

The fact table stores the measures of the business and points to the key value at the
lowest level of each dimension table. The measures are quantitative or factual data
about the subject.
 Dimensions of the data model
Building a dimensional data model
To build a dimensional data model, you need a methodology that outlines the decisions you
need to make to complete the database design. This methodology uses a top-down approach
because it first identifies the major processes in your organization where data is collected. An
important task of the database designer is to start with the existing sources of data that your
organization uses. After the processes are identified, one or more fact tables are built from
each business process. The following steps describe the methodology you use to build the
data model.
A dimensional database can be based on multiple business processes and can contain many
fact tables. However, to focus on the concepts, the data model that this section describes is
based on a single business process and has one fact table.
To build a dimensional database:
1. Choose the business processes that you want to use to analyze the subject area to be
modeled.
2. Determine the granularity of the fact tables.
3. Identify dimensions and hierarchies for each fact table.
4. Identify measures for the fact tables.
5. Determine the attributes for each dimension table.
6. Get users to verify the data model.
 A business process
A business process is an important operation in your organization that some legacy
system supports. You collect data from this system to use in your dimensional
database.
 Summary of a business process
 Determine the granularity of the fact table
After you gather all the relevant information about the subject area, the next step in
the design process is to determine the granularity of the fact table.
 Identify the dimensions and hierarchies
After you determine the granularity of the fact table, it is easy to identify the primary
dimensions for the data model because each component that defines the granularity
corresponds to a dimension.
 Establish referential relationships
For the database server to support the dimensional data model, you must define
logical dependencies between the fact tables and their dimension tables.
 Fragmentation: Storage distribution strategies
The performance of data warehousing applications can typically benefit from
distributed storage allocation designs for partitioning a database table into two or
more fragments. Each fragment has the same schema as the table, and stores a subset
of the rows in the table (rather than a subset of its columns).
Handle common dimensional data-modeling problems
The dimensional model that the previous sections describe illustrates only the most basic
concepts and techniques of dimensional data modeling. The data model you build to address
the business needs of your enterprise typically involves additional problems and difficulties
that you must resolve to achieve the best possible query performance from your database.
This section describes various methods you can use to resolve some of the most common
problems that arise when you build a dimensional data model.
 Minimize the number of attributes in a dimension table
 Dimensions that occasionally change
 Use the snowflake schema for hierarchical dimension tables

Reading Material Mod 4 Data Integration - Data Warehouse

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Reading Material Mod 4 Data Integration - Data Warehouse

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Reading Material Mod 4 Data Integration - Data Warehouse

Uploaded by

Copyright:

Available Formats

Course Name: Business Analytics

Module 4: Data integration- Data

However, meeting this goal requires some challenges:

Goals of a Data Warehouse

Extract, Transform and Load (ETL)

Shows Representation of ETL Model

Challenges in Data Integration

Technologies in Data integration

.l .Integration is divided into two main approaches:

• Electronic Data Interchange (EDI) :

• Object Brokering/Object Request Broker (ORB):

– An ORB is a piece of middleware software that allows programmers to make programs

• The technologies that are used for data integration include:

Need for Data Integration

Increases with the need for data sharing.

Advantages of Using Data Integration Of

Common Approaches to Data integration

There are currently various methods for performing data integration.

• The most popular ones are:

Data Integration Approaches

 Federated database (virtual database):

 Memory-mapped data structure:

The various primary concepts used in data warehousing would be:

 ETL (Extract Transform Load)

Date profiling and Data Quality maintenance

Data integration is a process in which heterogeneous data is retrieved and combined as an

Data Integration Areas

Data integration is a term covering several distinct sub-areas such as:

Enterprise application/information integration

Master data management

 Challenges of Data Integration

What is the support model for the new system?

Data Integration Techniques

Need for Data Integration

Advantages of Using Data Integration

Reduces cost, overlaps and redundancies; reduces exposure to risks

In this section, we apply an architectural perspective to give an overview of the different

Common User Interface

Uniform Data Access

Common Data Storage

Data Integration tools

 Cloud Elements API Integration

 DataWatch Denodo Platform

 elastic.io Integration Platform

 Informatica Platform (www.informatica.com)

 Oracle Data Integration Services

 Red Hat JBoss Data Virtualization. Community project: teiid.

 Azure Data Factory (ADF)

 SQL Server Integration Services (SSIS)

• Metadata: Data about data

2. Clean up your data by standardizing it using rules.

4. Avoid entry of duplicate leads and contacts.

5. Merge existing duplicate records.

6. Use roles for security.

What is data profiling?

1. Data profiling helps us make a thorough assessment of data quality.

2. It assists the discovery of anomalies in data.

5. It helps us understand the various issues/challenges we may face in a database project