ETL Testing Concepts iCEDQ
ETL Testing Concepts iCEDQ
ETL Testing means that an ETL process is correctly extracting, transforming, and
loading data as per the specifications. ETL testing is done by validating and/or
comparing the input and output data transformed by the ETL process.
1. Without ETL testing there is now way of knowing if the process is built to
ensures that the data issues are caught early in the development lifecycle.
ETL Testing Basics
ETL processes read data from a source, transform the data, and then load it
in the destination databases. An organization can easily have thousands of
The diagram below shows legacy manual ETL testing vs Rules based
approach.
1. Manual Testing
2. Manual- Pseudo code-based ETL Testing
3. Manual- Golden Copy based ETL Testing
4. Rules Based ETL Testing
This manual testing approach is not scalable as humans are not capable of
dealing with more than a few hundred records.
QA recreates pseudo ETL code in parallel to the developers’ actual ETL process.
This pseudo ETL code processes a subset of data and generates an output.
The actual ETL process also uses the same input data and generates data.
Then the ETL tester compares the data generated by both the processes
process passes the quality test.
Crazy amount of time, money and resources are wasted in reproducing the
pseudo ETL code by the QA team.
Since the pseudo ETL code is also development, there is no guarantee that
the pseudo ETL code developed by the QA team is also error free and
hence, the data output generated by it.
The prior three legacy methods have inherent limitations. Beyond the obvious
Large Data Volume: The recent exponential growth in the data volumes as
well as growth in the number of ETL processes have made the above three
approaches pretty much useless. It is no longer feasible to manually test the
ETL processes.
Data Sampling: Data sampling does not work because there are many
corner cases that will be only discovered if almost all the data is processed,
and the output generated by the ETL process is inspected.
Dynamic Input Data: The output generated by an ETL process is totally
dependent on the input data; and the input is dynamic. Hence any conventional
ETL testing that uses a predetermined output, will not work.
Cross database comparisons: Since ETL processes data from one system
and loads into another, it is almost impossible to bring the data in one place
and then do the comparison.
The rule based ETL testing is designed to avoid all the above pitfalls. Its
balance in account ‘A’ must reduce while the balance in account ‘B’ must
stimulatingly increase by the same amount.
Account ‘A’ (Original Balance – New Balance) = Account “B” (New Balance
– Original Balance)
This same concept has been expanded in for ETL testing. Wherein the logic
of ETL transformation is understood and the audit rules are created to certify
the ETL process.
For ETL testing you must understand the concepts of Static ETL testing and
the Dynamic ETL testing.
1. Static ETL Testing
1. Static Value Example: The Account Category column can only have one of
the three values, ‘Checking, Savings, Trading’
2. Static Condition Example: Net amount must equal Gross Amount minus sum
of Tax, Commission and Fees.
Thus, the ETL testing must support the dynamic nature of input data that is
provided during the execution. This can be represented by the following
simplistic equation.
Input Data + Transformation = Output Data
The ETL testing example below will explain the dynamic ETL testing
concept:
An ETL is processing customer list. The list contains two types of customers: corporate
and individuals. The ETL developer is asked to only load individual customers and not
corporate customers. To test this ETL process the total of individual customers in the
source must exactly match the customers in the target.
How many customers should be in the target table? That can only be known by
counting individual customers in the source that were provided to the ETL process
at runtime.
The test involves checking for nulls, formats, reference values, duplicates, etc.
For example,
– Make sure the row count between the source and the target table is matching.
– Compare all the customer data in the source and the table to ensure that ETL
loaded the data in the destination table as per the mapping rules.
3. ETL Data Transformation Testing
Data Transformation Tests ensures that every row has transformed successfully
based on the mapping document. Testing Data transformations involve reconciling
the data between source and destination to verify that the ETL is transforming the
data as expected. For example,
target column.
– Make sure the ETL is calculating the values correctly.
the data is validated only after the ETL loads the data. Data Validation Tests is used
to validate a single data source, be it a database table, data extracts, dimension table,
or a fact table. Check some of the examples below
In the case above the gender table has M, F and Others. The ETL testing involves
reconciling so that the Gender attributes in the customer table will only have one
of those three values.
together.
ETL integration testing involves creation of multiple ETL testing rules to verify if the
data integration is done correctly. This is true because even though there might be
one ETL process that integrates the data, it nevertheless contains multiple business
rules for data transformation. ETL testing must ensure that each of those integration
rules are implemented correctly. This testing includes all the above types of testing.
ETL performance metrics are usually measured in the number of rows processed per
seconds.
To measure performance three metric are needed, ETL processes start time, ETL
process end time and number of records processed. The sources for the above
metrics are:
– Special ETL log table which captures all the ETL process execution stats.
– Some of the metrics are derived from the target table with row level logging
attributes such as record insert datetime, record update date time.
There are processes that loads orders and shipments data from a source. The tester
can compare the source orders with target orders and the source shipment with target
shipment data; and the data is perfect match. While this is technically correct and the
test is a success, it could still be failure from business rules or functional point of view.
Here are few of the scenarios below:
If there are more shipments then orders, that means shipment were done without
orders. It is also possible that the shipment quantity and the order quantity or the
prices don’t match between them.
These kinds of issues are tested by reconciling order and shipment and falls under the
ETL functional testing.
The ETL testing scenarios repeat in multiple situations regardless of the type of data
being processed.
The ETL testing scenarios repeat in multiple situations regardless of the type of data
being processed.
Group Sums
source and target.
ETL Performance Test timeframe. ETL performance test ensures that the ETL
processing time is acceptable by checking the run logs.
Request a Demo