[go: up one dir, main page]

0% found this document useful (0 votes)
167 views15 pages

ETL Process Overview for Data Warehousing

The document discusses the extract, transform, and load (ETL) process used to synthesize data from multiple sources into a data warehouse. It describes the traditional ETL process of extracting data from transactional systems, transforming it in a staging area, and loading it into a data warehouse for analysis. It then explains how modern ETL processes extract data directly from various sources, transform it within powerful analytics databases using SQL, and load it for analysis. Finally, it outlines some critical components of ETL systems and the three phases of ETL: extraction, cleansing, and transformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
167 views15 pages

ETL Process Overview for Data Warehousing

The document discusses the extract, transform, and load (ETL) process used to synthesize data from multiple sources into a data warehouse. It describes the traditional ETL process of extracting data from transactional systems, transforming it in a staging area, and loading it into a data warehouse for analysis. It then explains how modern ETL processes extract data directly from various sources, transform it within powerful analytics databases using SQL, and load it for analysis. Finally, it outlines some critical components of ETL systems and the three phases of ETL: extraction, cleansing, and transformation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

DM104 – Evaluation of Business Performance Page 1 of 15

Extract, Transform, and Load Process

EXTRACT, TRANSFORM, AND LOAD PROCESS


 Data Warehouse ETL Process
 ETL Components
 ETL Phases
 ETL Strength and Weaknesses
 ETL vs ELT

Welcome Notes:
WELCOME BSIS STUDENTS!

I. INTRODUCTION:

This module contains the lesson about extraction, transform, and load process. ETL process is used to
synthesize data from multiple sources many times to build a Data Warehouse, Data Hub, or Data Lake. This
module will help you to understand the process of transmitting data from the source database to the
destination data warehouse.

II. OBJECTIVES:

At the end of this module, you should be able to:


1. Explain the extraction, transform, and load process.
2. Identify the components of ETL.
3. Differentiate ETL process and ELT process.
4. List-down the strengths and weaknesses of ETL.
DM104 – Evaluation of Business Performance Page 2 of 15
Extract, Transform, and Load Process

III. PRELIMINARY ACTIVITIES:

Before you proceed to the main lesson, test yourself in this activity.

Direction: Write TRUE if the statement is correct, otherwise write FALSE.


_______ 1. ETL stands for Extract, Transform, and Load.
_______ 2. Loading is the operation of getting the information from a source system for further use in a
data warehouse environment.
_______ 3. The cleansing stage is crucial in a data warehouse technique because it is supposed to
improve data quality.
_______ 4. Extraction is the core of the reconciliation phase.
_______ 5. Load is the process of writing the data into the target database.

GREAT!!!
You may now proceed to the main lesson.

IV. DEVELOPMENT OF THE LESSON

Based on the preliminary activities, what did you notice about it?
________________________________________________________
CONGRATULATIONS!
You may now proceed to the lesson.

ETL (Extract, Transform, and Load) Process


The mechanism of extracting information from source systems and bringing it into the data warehouse
is commonly called ETL, which stands for Extraction, Transformation, and Loading.
The ETL process requires active inputs from various stakeholders, including developers, analysts,
testers, top executives and is technically challenging.
DM104 – Evaluation of Business Performance Page 3 of 15
Extract, Transform, and Load Process

To maintain its value as a tool for decision-makers, Data warehouse technique needs to change with
business changes. ETL is a recurring method (daily, weekly, monthly) of a Data warehouse system and
needs to be agile, automated, and well documented.

Traditional ETL process


The ETL process has looked like this:

Data is extracted from online transaction processing (OLTP) databases, today more commonly known
just as 'transactional databases', and other data sources. OLTP applications have high throughput, with large
numbers of read and write requests. They do not lend themselves well to data analysis or business
intelligence tasks. Data is then transformed in a staging area. These transformations cover both data
cleansing and optimizing the data for analysis. The transformed data is then loaded into an online analytical
processing (OLAP) database, today more commonly known as just an analytics database.
Business intelligence (BI) teams then run queries on that data, which are eventually presented to end
users, or to individuals responsible for making business decisions, or used as input for machine learning
algorithms or other data science projects. One common problem encountered here is if the OLAP summaries
can’t support the type of analysis the BI team wants to do, then the whole process needs to run again, this
time with different transformations.

Data Warehouse ETL process


Modern technology has changed most organizations’ approach to ETL, for several reasons. The biggest
is the advent of powerful analytics warehouses like Amazon Redshift and Google BigQuery. These newer
cloud-based analytics databases have the horsepower to perform transformations in place rather than
requiring a special staging area.
DM104 – Evaluation of Business Performance Page 4 of 15
Extract, Transform, and Load Process

Another is the rapid shift to cloud-based SaaS applications that now house significant amounts of
business-critical data in their own databases, accessible through different technologies such as APIs and
webhooks. Also, data today is frequently analyzed in raw form rather than from preloaded OLAP summaries.
This has led to the development of lightweight, flexible, and transparent ETL systems with processes that
look something like this:

The biggest advantage to this setup is that transformations and data modeling happen in the analytics
database, in SQL. This gives the BI team, data scientists, and analysts greater control over how they work
with it, in a common language they all understand.

Critical ETL components


Regardless of the exact ETL process you choose, there are some critical components you’ll want to consider:
 Support for change data capture (CDC): Incremental loading allows you to update your analytics
warehouse with new data without doing a full reload of the entire data set.
 Auditing and logging: You need detailed logging within the ETL pipeline to ensure that data can be
audited after it’s loaded and that errors can be debugged.
 Handling of multiple source formats: To pull in data from diverse sources such as Salesforce’s API,
your back-end financials application, and databases such as MySQL and MongoDB, your process needs
to be able to handle a variety of data formats.
 Fault tolerance: In any system, problems inevitably occur. ETL systems need to be able to recover
gracefully, making sure that data can make it from one end of the pipeline to the other even when the
first run encounters problems.
 Notification support: If you want your organization to trust its analyses, you have to build in notification
systems to alert you when data isn’t accurate. These might include:
DM104 – Evaluation of Business Performance Page 5 of 15
Extract, Transform, and Load Process

 Proactive notification directly to end users when API credentials expire


 Passing along an error from a third-party API with a description that can help developers debug and fix
an issue
 If there’s an unexpected error in a connector, automatically creating a ticket to have an engineer look
into it
 Utilizing systems-level monitoring for things like errors in networking or databases
 Low latency: Some decisions need to be made in real time, so data freshness is critical. While there will
be latency constraints imposed by particular source data integrations, data should flow through your ETL
process with as little latency as possible.
 Scalability: As your company grows, so will your data volume. All components of an ETL process should
scale to support arbitrarily large throughput.
 Accuracy: Data cannot be dropped or changed in a way that corrupts its meaning. Every data point
should be auditable at every stage in your process.

ETL Phases
ETL consists of three separate phases:

Extraction
 Extraction is the operation of extracting information from a source system for further use in a data
warehouse environment. This is the first stage of the ETL process.
 Extraction process is often one of the most time-consuming tasks in the ETL.
 The source systems might be complicated and poorly documented, and thus determining which data
needs to be extracted can be difficult.
 The data has to be extracted several times in a periodic manner to supply all changed data to the
warehouse and keep it up-to-date.
DM104 – Evaluation of Business Performance Page 6 of 15
Extract, Transform, and Load Process

Cleansing
The cleansing stage is crucial in a data warehouse technique because it is supposed to improve data
quality. The primary data cleansing features found in ETL tools are rectification and homogenization. They
use specific dictionaries to rectify typing mistakes and to recognize synonyms, as well as rule-based
cleansing to enforce domain-specific rules and defines appropriate associations between values.
The following examples show the essential of data cleaning:
 If an enterprise wishes to contact its users or its suppliers, a complete, accurate and up-to-date list of
contact addresses, email addresses and telephone numbers must be available.
 If a client or supplier calls, the staff responding should be quickly able to find the person in the enterprise
database, but this need that the caller's name or his/her company name is listed in the database.
 If a user appears in the databases with two or more slightly different names or different account numbers,
it becomes difficult to update the customer's information.

Transformation
Transformation is the core of the reconciliation phase. It converts records from its operational source
format into a particular data warehouse format. If we implement a three-layer architecture, this phase outputs
our reconciled data layer.
The following points must be rectified in this phase:
 Loose texts may hide valuable information. For example, XYZ PVT Ltd does not explicitly show that
this is a Limited Partnership company.
 Different formats can be used for individual data. For example, data can be saved as a string or as
three integers.
Following are the main transformation processes aimed at populating the reconciled data layer:
 Conversion and normalization that operate on both storage formats and units of measure to make
data uniform.
 Matching that associates equivalent fields in different sources.
 Selection that reduces the number of source fields and records.
Cleansing and Transformation processes are often closely linked in ETL tools.
DM104 – Evaluation of Business Performance Page 7 of 15
Extract, Transform, and Load Process

Loading
The Load is the process of writing the data into the
target database. During the load step, it is necessary to
ensure that the load is performed correctly and with as little
resources as possible.
Loading can be carried in two ways:
1. Refresh: Data Warehouse data is completely
rewritten. This means that older file is replaced.
Refresh is usually used in combination with static
extraction to populate a data warehouse initially.
2. Update: Only those changes applied to source
information are added to the Data Warehouse. An
update is typically carried out without deleting or
modifying preexisting data. This method is used in
combination with incremental extraction to update
data warehouses regularly.

ETL vs ELT
Extract, Transform and Load (ETL) is the technique of extracting the record from sources (which is
present outside or on-premises, etc.) to a staging area, then transforming or reformatting with business
manipulation performed on it in order to fit the operational needs or data analysis, and later loading into the
goal or destination databases or data warehouse.

Strengths
 Development Time: Designing from the output backwards provide that only information applicable to
the solution is extracted and processed, potentially decreasing development, delete, and processing
overhead.
DM104 – Evaluation of Business Performance Page 8 of 15
Extract, Transform, and Load Process

 Targeted data: Due to the targeted feature of the load process, the warehouse contains only information
relevant to the presentation. Reduced warehouse content simplify the security regime enforce and hence
the administration overheads.
 Tools Availability: The number of tools available that implement ETL provides the flexibility of approach
and the opportunity to identify the most appropriate tool. The proliferation of tools has to lead to a
competitive functionality war, which often results in loss of maintainability.

Weaknesses
 Flexibility: Targeting only relevant information for output means that any future requirements that may
need data that was not included in the original design will need to be added to the ETL routines. Due to
the nature of tight dependency between the methods developed, this often leads to a need for
fundamental redesign and development. As a result, this increase the time and cost involved.
 Hardware: Most third-party tools utilize their engine to implement the ETL phase. Regardless of the
estimate of the solution, this can necessitate the investment in additional hardware to implement the
tool's ETL engine. The use of third-party tools to achieve the ETL process compels the information of
new scripting languages and processes.
 Learning Curve: Implementing a third-party tools that uses foreign processes and languages results in
the learning curve that is implicit in all technologies new to an organization and can often lead to
consecutive blind alleys in their use due to shortage of experience.

Extract, Load and Transform (ELT) is the various sight while looking at data migration or movement.
ELT involves the extraction of aggregate information from the source system and loading to the target method
instead of transformation between the extraction and loading phase. Once the data is copied or loaded into
the target method, then change takes place.
DM104 – Evaluation of Business Performance Page 9 of 15
Extract, Transform, and Load Process

The extract and load step can be isolated from the transformation process. Isolating the load phase from
the transformation process delete an inherent dependency between these phases. In addition to containing
the data necessary for the transformations, the extract and load process can include components of data that
may be essential in the future. The load phase could take the entire source and loaded it into the warehouses.
Separating the phases enables the project to be damaged down into smaller chunks, thus making it more
specific and manageable.
Performing the data integrity analysis in the staging method enables a further phase in the process
to be isolated and dealt with at the most appropriate point in the process. This method also helps to ensure
that only cleaned and checked information is loaded into the warehouse for transformation. Isolating the
transformations from the load steps helps to encourage a more staged way to the warehouse design and
implementation.

Strengths
 Project Management: Being able to divide the warehouse method into specific and isolated functions,
enables a project to be designed on a smaller function basis, therefore the project can be broken down
into feasible chunks.
 Flexible & Future Proof: In general, in an ELT implementation, all record from the sources are loaded
into the data warehouse as part of the extract and loading process. This, linked with the isolation of the
transformation phase, means that future requirements can easily be incorporated into the data
warehouse architecture.
 Risk minimization: Deleting the close interdependencies between each technique of the warehouse
build system enables the development method to be isolated, and the individual process design can thus
also be separated. This provides a good platform for change, maintenance and management.
 Utilize Existing Hardware: In implementing ELT as a warehouse build process, the essential tools
provided with the database engine can be used.
 Utilize Existing Skill sets: By using the functionality support by the database engine, the existing
investment in database functions are re-used to develop the warehouse. No new skills need to be
learned, and the full weight of the experience in developing the engine’s technology is utilized, further
reducing the cost and risk in the development process.
DM104 – Evaluation of Business Performance Page 10 of 15
Extract, Transform, and Load Process

Weaknesses
 Against the Norm: ELT is a new method to data warehouse design and development. While it has
proven itself many times over through its abundant use in implementations throughout the world, it does
require a change in mentality and design approach against traditional methods.
 Tools Availability: Being an emergent technology approach, ELT has limited availability of tools.

Difference between ETL and ELT


Basics ETL ELT
Process Data is transferred to the ETL server and Data remains in the DB except for
moved back to DB. High network bandwidth cross Database loads (e.g. source to
required. object).
Transformation Transformations are performed in ETL Server. Transformations are performed (in the
source or) in the target.
Code Usage Typically used for Typically used for
 Source to target transfer  High amounts of data
 Compute-intensive Transformations
 Small amount of data
Time- It needs highs maintenance as you need to Low maintenance as data is always
Maintenance select data to load and transform. available.
Calculations Overwrites existing column or Need to append Easily add the calculated column to
the dataset and push to the target platform. the existing table.
Analysis

We had just finished the discussion on extract, transform, and load


process. Let’s move on to the next higher level of activity/ies or exercise/s that
demonstrate your potential skills/knowledge of what you have learned.
DM104 – Evaluation of Business Performance Page 11 of 15
Extract, Transform, and Load Process

V. ANALYSIS, APPLICATION AND EXPLORATION


ACTIVITY 1
Name: _________________________________ Date: __________________
Year & Section: _________________________ Score: _________________
Direction: Read each statement below and fill in the missing words from the word bank.

Refresh Load Cleansing ELT Extraction


Transformation Data Update ETL Server

1. The mechanism of extracting information from source systems and bringing it into the data warehouse
is commonly called __________.
2. __________ is extracted from online transaction processing (OLTP) databases, today more commonly
known just as 'transactional databases', and other data sources.
3. __________ is the operation of extracting information from a source system for further use in a data
warehouse environment.
4. __________ stage is crucial in a data warehouse technique because it is supposed to improve data
quality.
5. __________ is the core of the reconciliation phase.
6. __________is the process of writing the data into the target database.
7. __________ involves the extraction of aggregate information from the source system and loading to
the target method instead of transformation between the extraction and loading phase.
8. __________ is usually used in combination with static extraction to populate a data warehouse initially.
9. __________ is typically carried out without deleting or modifying preexisting data.
10. Transformations are performed in ETL __________.

Finally, let us summarize the lesson of what we had discussed today.


DM104 – Evaluation of Business Performance Page 12 of 15
Extract, Transform, and Load Process

VI. GENERALIZATION

Name: _________________________________ Date: __________________


Year & Section: _________________________ Score: _________________
Direction: Explain the three phases of ETL process.
1. Extract
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

2. Transform
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

3. Load
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________
__________________________________________________________________________________

KUDOS!
You have come to an end of Module 3.
OOPS! Don’t forget that you have still an assignment to do.
Here it is….
DM104 – Evaluation of Business Performance Page 13 of 15
Extract, Transform, and Load Process

VII. ASSIGNMENT

Name: _________________________________ Date: __________________


Year & Section: _________________________ Score: _________________
Direction: Write TRUE if the statement is correct, otherwise write FALSE.
_______ 1. ETL stands for Extract, Transform, and Load.
_______ 2. In ELT process the transformations are performed in the server.
_______ 3. Loading is the operation of getting the information from a source system for further use in a
data warehouse environment.
_______ 4. The cleansing stage is crucial in a data warehouse technique because it is supposed to improve
data quality.
_______ 5. Extraction is the core of the reconciliation phase.
_______ 6. Load is the process of writing the data into the target database.
_______ 7. OLTP applications have high throughput, with large numbers of read and write requests.
_______ 8. System analyst run queries on that data, which are eventually presented to end users, or to
individuals responsible for making business decisions
_______ 9. Every data point should be auditable at every stage in your process.
_______ 10. Refresh is used in combination with incremental extraction to update data warehouses
regularly.

After your long journey of reading and accomplishing the module, let us now
challenge your mind by answering the evaluation part of this module.
DM104 – Evaluation of Business Performance Page 14 of 15
Extract, Transform, and Load Process

VIII. EVALUATION

Name: _________________________________ Date: __________________


Year & Section: _________________________ Score: _________________
Direction: Read each sentence/ situation carefully and select the BEST answer among the choices and
encircle its corresponding letter.
1. It is the process of extracting information from source systems and bringing it into the data warehouse.
A. ETL Process B. ELT Process C. Data Warehousing D. None of these.
2. What does ETL stands for?
A. Extreme, Transformation, and Load C. Extract, Transform, and Load
B. Extract, Transfer, and Load D. Exact, Transform, and Load
3. It refers to the operation of taking information from a source system for further use in a data warehouse
environment.
A. Extract B. Transform C. Load D. Cleansing
4. This stage is crucial in a data warehouse technique because it is supposed to improve data quality.
A. Extract B. Transform C. Load D. Cleansing
5. This phase of ETL process converts records from its operational source format into a particular data
warehouse format.
A. Extract B. Transform C. Load D. Cleansing
6. It is the process of writing the data into the target database.
A. Extract B. Load C. Cleansing D. Staging Area
7. This method of loading process is used in combination with incremental extraction to update data
warehouses regularly.
A. Refresh B. Update C. Cleansing D. Staging Area
8. This loading process usually used in combination with static extraction to populate a data warehouse
initially.
A. Refresh B. Update C. Cleansing D. Staging Area
9. It is an intermediate storage used for data processing during the extract, transform and load process.
A. Refresh B. Update C. Cleansing D. Staging Area
10. It involves the extraction of aggregate information from the source system and loading to the target
method instead of transformation between the extraction and loading phase
A. ETL Process B. ELT Process C. Data Warehousing D. None of these.
DM104 – Evaluation of Business Performance Page 15 of 15
Extract, Transform, and Load Process

CONGRATULATIONS on reaching the end of this module!


You may now proceed to the next module.
Don’t forget to submit all the exercises, activities and portfolio
on ___________________.
KEEP UP THE GOOD WORK.
Well Done!!!

You might also like