[go: up one dir, main page]

0% found this document useful (0 votes)
440 views20 pages

ETL Testing Concepts iCEDQ

1. ETL testing validates that an extract, transform, and load (ETL) process correctly extracts data from source systems, transforms it as required, and loads it into the target systems. 2. ETL testing is done by comparing the input and output data from the ETL process to ensure data is handled as specified. 3. Traditional manual ETL testing is inefficient for large amounts of data and dynamic data sources. Rules-based ETL testing uses data rules to automate cross-system comparison testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
440 views20 pages

ETL Testing Concepts iCEDQ

1. ETL testing validates that an extract, transform, and load (ETL) process correctly extracts data from source systems, transforms it as required, and loads it into the target systems. 2. ETL testing is done by comparing the input and output data from the ETL process to ensure data is handled as specified. 3. Traditional manual ETL testing is inefficient for large amounts of data and dynamic data sources. Rules-based ETL testing uses data rules to automate cross-system comparison testing.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

What is ETL Testing?

ETL Testing means that an ETL process is correctly extracting, transforming, and
loading data as per the specifications. ETL testing is done by validating and/or
comparing the input and output data transformed by the ETL process.

ETL testing is used in data-centric projects having a huge amount of data or


substantial number of data pipelines. It should not be confused with application
testing which usually involves a small amount of transactional data.

ETL Testing Full Form


ETL in testing means an extract, transform and load process that reads data
from multiple source systems, transports it to a data transformation layer for
further processing which includes cleaning, consolidating, integrating, and
Why ETL Testing is Required?
Anytime a piece of software is developed, it must be tested. The ETL process is
ultimately a piece of software written by a developer. An ETL process is at the heart
of any data-centric system and/or project and mistakes in the ETL process will directly
impact the data and the downstream applications.

1. Without ETL testing there is now way of knowing if the process is built to

2. Without ETL testing the code cannot be released or deployed in production.


3. ETL testing enables root cause analysis to identify data issues due to the
source data or the ETL process.

ensures that the data issues are caught early in the development lifecycle.
ETL Testing Basics
ETL processes read data from a source, transform the data, and then load it
in the destination databases. An organization can easily have thousands of

“Just like any piece of code that is developed,


the ETL code must be tested to ensure that
the developers have written the code correctly.”

ETL testing is unique since,


ETL processes are background processes and don’t have user screens.
ETL testing involves a large amount of data.
ETL processes are like functions where testing requires execution of the ETL
process and then the comparison of input and output data.
The defects in the ETL processes cannot be detected by simply reviewing
the ETL code.

How to do ETL Testing?


ETL processes are evaluated indirectly through black box testing approach, wherein
the ETL process is first executed to create the output data and then by verifying the
output data the quality of the ETL process is determined.
ETL testing process is summarized in the following three steps:
1. First, the ETL code is executed to generate the output data.
2. Then the output data is compared with the predetermined expected data.
3. Based on the comparison results, the quality of the ETL process is
determined.
For ETL testing you can follow the legacy approach, which is outdated, or the
newer rules based ETL testing pioneered by iceDQ.

Difference between Manual Testing and


ETL Testing
ETL testing vs manual testing:
It is resource-intensive, hence very costly.
Testing is based on few sampled records.
It is ad hoc hence not repeatable.
The tester must do all the testing on his desktop.
Incomplete test coverage.
Regression testing is not possible.

The diagram below shows legacy manual ETL testing vs Rules based
approach.
1. Manual Testing
2. Manual- Pseudo code-based ETL Testing
3. Manual- Golden Copy based ETL Testing
4. Rules Based ETL Testing

1. Manual ETL Testing

In this approach, the data created by the ETL


process is sampled and inspected visually by
a tester. If the data output is as excepted the

This manual testing approach is not scalable as humans are not capable of
dealing with more than a few hundred records.

2. Manual- Pseudo code-based ETL Testing:

QA recreates pseudo ETL code in parallel to the developers’ actual ETL process.
This pseudo ETL code processes a subset of data and generates an output.
The actual ETL process also uses the same input data and generates data.
Then the ETL tester compares the data generated by both the processes
process passes the quality test.

Parallel development of pseudo ETL code by QA team is ridiculous because:

Crazy amount of time, money and resources are wasted in reproducing the
pseudo ETL code by the QA team.
Since the pseudo ETL code is also development, there is no guarantee that
the pseudo ETL code developed by the QA team is also error free and
hence, the data output generated by it.

3. Manual- Golden Copy based ETL Testing:

In this method, QA manually creates expected data output based on their


understanding of the data processing logic. First the manual data, also called
as golden copy is created and is stored in a database. Next, the ETL process
is executed, and the data generated by the ETL process is compared to the
golden copy of data that was created by the QA team. The ETL process is

This approach is also severely limited because:

Only works with sampled data.


Does not ensure test coverage.
Does not work if the input data changes.

4. Rules Based ETL Testing:


An ETL process is a set of actions that transforms an input data into a desired
data output. Rules based ETL testing understands these data transformation
requirements and derive data audit rules which are later used to test the ETL
processes.

The prior three legacy methods have inherent limitations. Beyond the obvious

conventional approach towards ETL testing will never work.

Large Data Volume: The recent exponential growth in the data volumes as
well as growth in the number of ETL processes have made the above three
approaches pretty much useless. It is no longer feasible to manually test the
ETL processes.
Data Sampling: Data sampling does not work because there are many
corner cases that will be only discovered if almost all the data is processed,
and the output generated by the ETL process is inspected.
Dynamic Input Data: The output generated by an ETL process is totally
dependent on the input data; and the input is dynamic. Hence any conventional
ETL testing that uses a predetermined output, will not work.

Cross database comparisons: Since ETL processes data from one system
and loads into another, it is almost impossible to bring the data in one place
and then do the comparison.

The rule based ETL testing is designed to avoid all the above pitfalls. Its

balance in account ‘A’ must reduce while the balance in account ‘B’ must
stimulatingly increase by the same amount.
Account ‘A’ (Original Balance – New Balance) = Account “B” (New Balance
– Original Balance)

This same concept has been expanded in for ETL testing. Wherein the logic
of ETL transformation is understood and the audit rules are created to certify
the ETL process.

Advantages of Rules based ETL Testing:

The data can change but the rules don’t.


No need to recreate the ETL processes.
There is no limit on data volume.
No manual interventions.
No data sampling is required.
The rules are stored in central knowledge repository.
The rules are executed on the server via a schedule or on-demand.
Cross database testing is possible.
The rules can be stateless to supports dynamic change in the input data

ETL Testing Concepts with Examples


Rules based ETL testing operates under the two assumptions:

– Input data provided will change for each ETL run.


– The data transformation rules and conditions remain same unless there
are changes in business requirements.

For ETL testing you must understand the concepts of Static ETL testing and
the Dynamic ETL testing.
1. Static ETL Testing

For static ETL testing only the output data


is used for certifying the ETL process. The
output value generated by ETL process is

tester. The input data used by the ETL


process is not taken into considerations.

Here are few examples:

1. Static Value Example: The Account Category column can only have one of
the three values, ‘Checking, Savings, Trading’

2. Static Condition Example: Net amount must equal Gross Amount minus sum
of Tax, Commission and Fees.

2. Dynamic ETL Testing

For dynamic ETL testing both input and


output data is taken into considerations
while testing the ETL process. In many
cases the output data of an ETL process
is totally dependent on the input data
provided at runtime to the ETL process.
Even though the transformation logic is

used by the ETL process at runtime.

Thus, the ETL testing must support the dynamic nature of input data that is
provided during the execution. This can be represented by the following
simplistic equation.
Input Data + Transformation = Output Data

The ETL testing example below will explain the dynamic ETL testing
concept:
An ETL is processing customer list. The list contains two types of customers: corporate
and individuals. The ETL developer is asked to only load individual customers and not
corporate customers. To test this ETL process the total of individual customers in the
source must exactly match the customers in the target.

How many customers should be in the target table? That can only be known by
counting individual customers in the source that were provided to the ETL process
at runtime.

Types of ETL Testing


The types of ETL testing are listed below

1. ETL Source Data Validation Testing


2. ETL Source to Target Data Reconciliation Testing
3. ETL Data Transformation testing
4. ETL Data Validation
5. ETL Referential Integrity Testing
6. ETL Integration Testing
7. ETL Performance Testing
8. ETL Functional Testing
9. ETL Unit Testing
10. ETL Validation

1. ETL Source Data Validation Testing


This ETL testing checks input data for validity. Because if the input data itself is not
valid you cannot expect the ETL process to transform the data correctly or for the
process to even execute at all.

The test involves checking for nulls, formats, reference values, duplicates, etc.
For example,

– Verify that there are no null values in attribute “Name” attribute


– The format of the date in the DOB column should be “YYYY-MM-DD”

2. ETL Source to Target Data Reconciliation Testing


This test is mostly done to prove that there is no leakage while transporting or
staging the data. Comparing the source (input) data and the target (output) data
ensures that data completeness and consistency is not lost because of any issues
in the ETL process. For example,

– Make sure the row count between the source and the target table is matching.
– Compare all the customer data in the source and the table to ensure that ETL
loaded the data in the destination table as per the mapping rules.
3. ETL Data Transformation Testing
Data Transformation Tests ensures that every row has transformed successfully
based on the mapping document. Testing Data transformations involve reconciling
the data between source and destination to verify that the ETL is transforming the
data as expected. For example,

target column.
– Make sure the ETL is calculating the values correctly.

4. ETL Data Validation


An ETL process loads data into a destination data store. The ETL data validation

the data is validated only after the ETL loads the data. Data Validation Tests is used
to validate a single data source, be it a database table, data extracts, dimension table,
or a fact table. Check some of the examples below

– Check if there are any nulls in the column.


– Validate the format of the email should be valid.
– There should be only one active record in a dimension table.
– Date in birth date column should be a valid date.
– Check if the Net amount cannot be less than zero.
5. ETL Referential Integrity Testing
The referential integrity testing ensures that the child table only have foreign key
values that exists in the parent table.

In the case above the gender table has M, F and Others. The ETL testing involves
reconciling so that the Gender attributes in the customer table will only have one
of those three values.

6. ETL Integration Testing


ETL integration testing is done to verify that the ETL process has integrated the data
correctly. One of the key purposes of an ETL process is to integrate data from multiple
data sources or multiple subject areas.

Vertical Integration Testing: In this case


data is brought in from multiple data
sources and integrated into a table. Example
in this type of integration customer list from
CRM system and accounting system is

integration must ensure that:

Attributes from multiple sources are mapped correctly to the destination


No duplicate records exist.
In this scenario data from multiple subject
areas and sources are linked together to form
meaningful relationship. A typical example is
to link the salesperson data with sales data
to calculate the commission.

Mostly referential integrity /foreign keys are

together.

ETL integration testing involves creation of multiple ETL testing rules to verify if the
data integration is done correctly. This is true because even though there might be
one ETL process that integrates the data, it nevertheless contains multiple business
rules for data transformation. ETL testing must ensure that each of those integration
rules are implemented correctly. This testing includes all the above types of testing.

– Ensure the data is going to the respective attributes

– Ensure the entities are linked correctly.

7. ETL Performance Testing


Even if the ETL process is coded correctly it is possible that, when executed it takes

ETL performance metrics are usually measured in the number of rows processed per
seconds.

To measure performance three metric are needed, ETL processes start time, ETL
process end time and number of records processed. The sources for the above
metrics are:
– Special ETL log table which captures all the ETL process execution stats.
– Some of the metrics are derived from the target table with row level logging
attributes such as record insert datetime, record update date time.

There is no universal standard for performance testing numbers, so it all depends on


the expectations. However, some parameters must be taken into consideration that

– Number of records inserted.


- Number of records updated or deleted.
– Logging is enabled or not in the target database.
– Row level locking setting in destination tables.
– Presence of indexes.
– The size of the processing machine.

8. ETL Functional Testing


The ETL functional test are designed to ensure that the ETL is following the business

There are processes that loads orders and shipments data from a source. The tester
can compare the source orders with target orders and the source shipment with target
shipment data; and the data is perfect match. While this is technically correct and the
test is a success, it could still be failure from business rules or functional point of view.
Here are few of the scenarios below:
If there are more shipments then orders, that means shipment were done without
orders. It is also possible that the shipment quantity and the order quantity or the
prices don’t match between them.

These kinds of issues are tested by reconciling order and shipment and falls under the
ETL functional testing.

9. ETL Unit Testing


The ETL testing done by the developer during development is called ETL unit testing.
A unit ETL test is a test written by the programmer to verify that a relatively small piece
of ETL code is doing what it is intended to do. They are narrow in scope. While testing
activity is expected from QA team, some basic testing tasks are executed by the ETL
developer during the development phase itself and much before the code is
transitioned to the QA team. This ensures basic data transformations are done correctly.

10. ETL Validation


ETL validation is technical test done as part of unit testing to ensure that the ETL is
executed without crashing. It is usually done by the developer as part of the sanity
checks as explained above.

ETL Testing Scenarios


Following ETL testing scenarios should be considered for any data projects.

1. Record level ETL tests


2. Attribute Data level ETL tests
3. Aggregate Data level ETL tests
4. Execution Level ETL tests

The ETL testing scenarios repeat in multiple situations regardless of the type of data
being processed.
The ETL testing scenarios repeat in multiple situations regardless of the type of data
being processed.

ETL Test Scenarios Test Description


Record Level These are record level ETL tests
Scenarios

This is a primary test, to check if all the available records


Record Count are populated – Nothing more, nothing less. This test
Testing ensures that the ETL process has loaded all the records.
But it does not know if the data in the records is correct.

Duplicate records happens if primary key or unique key


Duplicate Records constraints are not implemented in the database. In such
Testing
records are not generated by the ETL process.

In many scenarios transaction level records are aggregated


Record Aggregation by time, or other dimensions. Test are needed to ensure
Test that the dimension chosen for the aggregation of records
are correct.

Row Filter Testing

implemented as per requirements.

The type ii dimensions ETL logic retires old records and


Type II dimension inserts new records.
Testing This Test to ensure that only one valid record is present,
and the expiry dates don’t overlap.
Attribute Level
These are attribute level tests.
Scenarios
During the development of the ETL process the developer
Data mapping Testing might do mistake in mapping the source and target
attributes. This ETL test ensure that the data is getting
populated in the correct target attributes.
There are many mathematical calculations used to
Calculations –
Numeric and date
calculations are done correctly by the ETL process.

Various string manipulation and operations such as


CONACT, SUBSTRING, TRIM, are done on strings. This test
Expressions – String
ensures string transformations are done correctly by the
ETL process.

Many time the data processed by the ETL process truncate


the data and/or if the target column has shorter size the
Data Truncation data can be get truncated. This ETL test ensures string
data is not truncated by the ETL process or during the load
time.

This can happen if the datatype is not chosen correctly in


either the ETL process variables or the target table
Data Rounding –
datatypes.
Numbers and dates
Numbers can get rounded; dates can lose time or second
components. Ensure decimal data is not rounded
incorrectly.

This mostly happens with string datatypes as it accepts


Formatting Issues -
data in almost any format. Many cases dates are p The d
Date and Strings
ate Ensure the date, or string data is formatted correctly.

Ensure that the child or transaction attributes have


Reference Data or
reference data that are present in the master.
Dimension Lookup

Aggregate Scenarios This involves testing of summarized (balances, snapshot,


aggregates) data.

Aggregate calculation Ensure the data aggregations of data is done correctly.

Ensure the number of records populated is not more


Simple Row counts and/or less than the expected number of records. The row
count in the destination matches to the source system.

Match the sums of numeric values between source and


Simple Sums
target to ensure the numbers are correct.
Grouped Row Count
target.

Group Sums
source and target.

Execution Scenarios This testing involves testing of ETL processes related to


their executions.
Often data is loaded in increments based on delta logic.
This ETL Test ensures the incremental loads are reconciling
Incremental Load
correctly with source and no gaps or overlapping are
generated.
Normally you won’t expect same data to be processed
again. But in many situations the data is reprocessed or
Multi Execution Tests
accidently executed. This test ensures multiple reruns of
the ETL process with the same data do not generate extra
records.

ETL Performance Test timeframe. ETL performance test ensures that the ETL
processing time is acceptable by checking the run logs.

Scope of ETL Testing


The scope of ETL testing is restricted to ensuring the ETL process is correctly developed,
and it is processing data as per the business requirements.

– Data Transformation and data loading is correct.


– ETL process does not create duplicate data.
– ETL process execution is done in proper order.
– The ETL process has correct incremental logic for processing data.
– The ETL emits proper exit codes on errors.
– The ETL Processes do not crash due to data exceptions.
– The ETL process logs metadata about its process.
There are many ways to do ETL testing. Some do it manually while others use legacy
approaches. Hope this becomes clear that going forward the rules based ETL testing
is the only viable solution to do ETL testing at scale. Let us know what you think in the
comments below.

DataOps Platform for Testing and Monitoring


Identify data issues in your Data Warehouse, Big Data and Data Migration Projects.

Let’s talk and see how iCEDQ can help you!

Request a Demo

icedq.com contact@icedq.com (203) 666-4442 Linkedin.com Youtube.com

You might also like