[go: up one dir, main page]

0% found this document useful (0 votes)
13 views15 pages

Lecture-9 Extraction Transformation Loading

Uploaded by

samkh866n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views15 pages

Lecture-9 Extraction Transformation Loading

Uploaded by

samkh866n
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 15

ETL PROCESS

Lecture # 9
Instructor: Mr. Sharjeel Ahmed
ETL PROCESS
Introduction to ETL
The User want to compare and analyze sales by store, by
product, and by month?
• The sale figures are available in several sales applications in the
company.
• Product information is available in a master file.
• Each sales transaction refers to a specific store where it occurred.

Before information is ready for analysis, what actions need to performed?


Introduction to ETL
• What constitutes the difference between the data in the source
operational systems and the information in the data warehouse?

• ETL –
• Extraction, Transformation & Loading
Process of Data Extraction
Source Identification → Identify source applications and source structures.
Method of Extraction → Define the extraction process for each data source
Manual/ Automatic
Immediate or Deferred
Extraction frequency → Establish how frequently the data extraction must be
done i.e. daily, weekly, quarterly.
Time Window → Time Window for extraction process.
Job sequencing → Whether the beginning of one job has to wait until the
previous job has finished successfully.
Exception handling → Determine how to handle input records that cannot be
extracted.
Source
Identification

Identify source
applications and
source structures.
Challenges of Source Systems
ETL Functions are challenging primarily because of the nature of the
source systems.
• Sour systems are very diverse and disparate.

• Source systems on multiple platforms and different operating systems.


• Computing platforms, operating systems, data is management systems, network
protocols, and source legacy systems.

• Older legacy applications running on obsolete database techniques.

• Quality of data is dubious in many old source systems that have


evolved overtime.
• Source system structures keep changing over time because of new business
conditions. ETL functions must also be modified accordingly
Challenges of Source Systems (Cont. )
• Generally, historical data on changes in value are not preserved in
source operational systems.

• Same data is likely to be represented differently in the various source


systems.
• For example, data on salary may be represented as monthly salary, weekly salary,
and bi-monthly salary in different source payroll systems.

• Even when inconsistent data is detected among disparate source


systems, lack of a means, for resolving mismatches escalates the
problem of inconsistency.
Considerations in Extraction
• The metadata on the source systems must contain information on
every database and every data structure that are needed from the
source systems.

• Detailed information, including database size and volatility of the data.

• Schedule jobs for the time window during each day when you can
extract data without impacting the usage of the operational systems.

• Mechanism for capturing the changes to data in each of the relevant


source systems.
Data Transformation
Basic Tasks
• Selection
• Splitting/Joining
• Conversion
• Summarization
• Enrichment
Considerations in Transformation
• Reformat internal data structures

• Re-sequence data

• Apply various form of conversion techniques

• Supply default values wherever values are missing

• Design aggregations that are needed for performance improvement


Data Loading
Initial Load
Populating all the data warehouse tables for the very first time

Incremental Load
Applying ongoing changes as necessary in a periodic manner

Full Refresh
Completely erasing the contents of one or more tables and
reloading with fresh data
Considerations in Loading
• Initial loading can populate millions of rows in the data warehouse.
• Creating and managing load images for such large numbers are not easy
tasks.
• Even more difficult is the task of testing and applying the load images to
actually populate the physical files in the data warehouse.
• May take up to few weeks to complete the initial physical loading.
• For extracting and applying the ongoing incremental changes.
• Finding a time window to apply the changes to the data warehouse can be
tricky if your data warehouse cannot suffer long downtimes.
ETL Summary
• Determine the best method to capture the ongoing changes from each
source system.

• Execute the capture without impacting the source systems.

• Schedule incremental loads without impacting the usage of ware


house by the users.
Major Steps
for ETL
Process

You might also like