Data Engineering - Curriculum
Data Engineering - Curriculum
1
2 Should be able to write basic query with best practices
3
4 Should be able to understand the need of DW
5 Should understand different ETL design patterns
6 Should be able to navigate and perform basic needs to execute scripts
Should be able to analyze and understand existing shell scripts and able to develop simple shell
7 scripts
8 should be able to understand the concept of NoSQL.
9 Should have a very good knowledge and understanding on Pyspark
11 Should have a very good knowledge and understanding on Data Preprocessing in Python
12 Should be able to understand the concept of Cloud fundamentals
13 Should be able to understand the concept of Azure Data Factory
14 Should be able to understand the concept of AWS
15 Should be able to understand the concept of Data Analytics & Data Science techniques and tools
16 Should be able to understand the concept of Machine learning algorithms & Predictive models
Level (Awareness, Skill
Learning Outcome
and Knowledge)
Should be able to understand database operations / various database manipulations (DML, Skill level
DDL, DQL, DCL, TCL) using SQL. Should be able to query using Joins and sub queries
Should have a clear understanding on Spark execution model, Structured API, Dataframe. Skill level
Should have a clear understanding on Data cleaning, Transformations of numerical features Skill level
Should be able to describe the concepts in Cloud fundamentals Knowledge
Should be able to describe the concepts in Azure SQL, Azure Blob storage, Azure Data
Factory, Azure Synapse Analytics Knowledge
Should be able to describe the concepts in AWS S3, AWS Glue Knowledge
Should be able to describe the concepts in Data Analytics & Data Science techniques and
tools Knowledge
Should be able to describe the concepts in Machine learning algorithms & Predictive models Knowledge
Sub Track
SQL
Data Warehouse
ETL - Basics
NoSQL
Pyspark, Spark
53.75
10.75
Understanding ANSI SQL : Table of Co
Module Name: Understanding ANSI SQL
Coverage of Each Modul
Topic # Learning
Topic Name
Objective #
1 Understanding SQL
1
2
4 SQL Operators
1
2
3
4
5
6
5 SQL Functions
1
2
3
4
5
6
7
6 Clauses in SQL
1
2
3
4
5
8 Sub-queries
1
2
3
4
5
6
7
8
9
10
11
12
13
Data Integrity
Integrity Constraints
Entity integrity
PRIMARY KEY Constraint
Sequence generators
Referential Integrity
FOREIGN KEY Constraint 60
Domain Integrity
NOT NULL Constraint
UNIQUE KEY Constraint
CHECK Constraint
User Defined Integrity
Enabling and Disabling Constraints
Case Study
Estimated Time Duration for this Topic 60
Group By Clause
Having Clause
30
Order By Clause
Order of Execution of Clauses in SELECT Statement
Case study
Estimated Time Duration for this Topic 30
Understanding Subqueries
Advantages of subqueries
Rules of subqueries
Using Subqueries With SELECT, INSERT, UPDATE, DELETE
Subqueries Types
Scalar Subquery
60
Single Row Subquery
Multiple Row Subquery
Usage of IN, NOT IN, ALL, ANY, and SOME
Correlated Subqueries
Usage of EXISTS, NOT EXISTS
Difference between Correlated & Non-Correlated Subquery
Case study
Estimated Time Duration for this Topic 60
Database Objects
What is View?
Advantages of View
Inline View 40
What is Index ?
Index Architecture : Non-clustered & Clustered
40
Unique Index
Case study
Estimated Time Duration for this Topic 40
Total Duration in Mins 520
Total Duration in Hours 9
Estimated Estimated
Duration In Mts Duration In Mts
for Hands-on Total
0 0
0 15
0 15
0 30
0
0 60
120 120
120 180
0
0 60
180 180
180 240
0
0 60
240 240
240 300
0
0 60
0 60
240 240
240 300
0
0 30
240 240
240 270
0
0 120
480 480
480 600
0
0 60
320 320
320 380
0
0 40
0 40
60 60
60 100
1880 2400
31 40
DBMS & Data Modeling : Table of Conten
Module Name: DBMS & Data Modeling
Coverage of Each Module
Topic # Learning
Topic Name
Objective #
2 DBMS Architecture
1
2
3
3 Types of Databases
1
2
9 Demo on ErwinTool
1
2
11 Requirement Analysis
1
2
3
structure of data 10
process of data access in the various data models 20
Estimated Time Duration for this Topic 30
OLTP 10
Dimensional Modeling 10
Estimated Time Duration for this Topic 0 20
Conceptual Modeling 20
Logical Modeling
Physical Modeling
Estimated Time Duration for this Topic 0 20
Entity 20
Attribute
Relationship
Notation
Keys-PK, FK,AK etc
Estimated Time Duration for this Topic 0 20
Creating Entities,Attributes 25
Creating different types of relationships 25
Estimated Time Duration for this Topic 50
Steps for logical to physical data model conversion 25
Physical Model -Primary Keys & Constraints 25
Estimated Time Duration for this Topic 50
Why Normalization? 10
Normalization Forms - First Normal Form (1NF) 10
Second Normal Form (2NF) 10
Third Normal Form (3NF) 10
Boyce-Codd Normal Form (BCNF) 10
Why do we need to de-normalize? 10
Pros & Cons of de-normalization
Estimated Time Duration for this Topic 0 60
10
10
15
15
10
60
10
5
15
30
10
20
30
10
10
20
10
10
20
20
0
0
20
20
0
0
0
0
20
25
25
50
25
25
50
10
0
0
10
10
10
10
10
10
10
0
60
20
10
20
50
420
7
NoSQL : Table of Contents
ge of Each Module
30 30
30 30
120 120
30 30
180 180
30 30
30 30
60 60
30 30
540 0 540
9
DW Basics : Table of Contents
Module Name: DW Basics
Coverage of Each Module
Topic # Learning
Topic Name
Objective #
3 Data Marts
1
2
3
4
5
6
7
0 20
0 10
0 20
0 10
0 60
0 10
0 15
0 25
0 10
0 60
0 10
0 5
0 5
0 20
0 5
0 5
0 10
0 60
0 10
0 5
0 5
0 5
0 5
0 15
0 10
0 5
0 60
0 20
0 20
0 20
0 60
60 60
60 60
60 360
1 6
ETL Concepts : Table of Contents
Module Name: ETL CONCEPTS
1
Introduction to ETL Concepts
1
2
3
4
5
6
7
What is ETL 20 20
ETL Architecture 20 20
Transformation Options 20 20
ETL Standards 20 20
ETL and metadata 20 20
FACT and Dimension Tables 20 20
SCD I/II/III 30 30
Estimated Time Duration for this Topic 150 0 150
1
Introduction to Unix and Basic Concepts
1
2
3
4
60
60
90
90
300
60
30
150
120
360
90
180
120
390
1050
17.5
Pyspark : Table of Content
Module Name: Pyspark
Coverage of Each Mo
Topic # Learning
Topic Name
Objective #
1 Spark
1
2
3
4
5
6
7
8
9
10
Estimated
Learning Objective for the Topics Duration In Mins
for Theory
Introduction to Spark
Transformations, Actions, RDD, DataSet
Key Value Methods and Caching Data
120
Distribution and Parallelism
Spark Streaming
Optimization
Data Exploration and Analysis
Transforming and Cleaning Unstructured Data
120
Summarizing Data Along Dimensions
Broadcasting and Accumulator
Estimated Time Duration for this Topic 240
Introduction
Querying Data with the DataFrames
Improving Type Safety with Datasets 240
Processing Data with the Streaming API
Optimizing, Structured Streaming, and Spark 2.x
Estimated Time Duration for this Topic 240
180 300
180 300
360 600
360 600
360 600
720 1200
12 20
Spark : Table of Contents
Module Name: Spark
Coverage of Each Module
Topic # Learning Estimated Duration In
Topic Name Learning Objective for the Topics
Objective # Minutes for Theory
1 Spark Programming
1 Introduction to Spark 60
2 Why do we need spark 60
3 Installing and using Apache spark 240
4 Spark execution model and architecture 240
5 Spark programming model 240
6 Structured API foundataion 300
7 Data sources and sinks 300
8 Dataframe and dataset transformations 300
9 Aggregations in Spark 300
10 Dataframe joins 300
11 Alternatives for Spark 60
Estimated Time Duration for this Topic 2400
60
60
240
240
240
300
300
300
300
300
60
0 2400
0 2400
0 40
Data Preprocessing in Python : Table of Contents
Module Name: Spark
Coverage of Each Module
Topic # Learning Estimated Duration In
Topic Name Learning Objective for the Topics
Objective # Minutes for Theory
1 Data Preprocessing in Python
1 Data Cleaning 60
2 Encoding of the categorical features 45
3 Transformations of the numerical features 45
4 Pipelines 30
5 Scaling 30
6 Principal Component Analysis 30
7 Filter-based feature selection 60
8 A complete pipeline 30
9 Oversampling 30
Estimated Time Duration for this Topic 360
60
45
45
30
30
30
60
30
30
900 1260
900 1260
15 21
Cloud Fundamentals : Table of Contents
2 S3
1 Core concepts of object store
2 S3-storage class, Lifecycle, replication
Estimated Time Duration for this Topic
20 20
40 40
30 30
60 60
60 60
60 60
30 30
30 30
30 30
60 60
420 0 420
Table of Contents
30 30
60 60
60 60
60 60
210 0 210
4 4
4 0 4
634 0 634
10.5666666666667 0 10.5666666666667
ADF, ADLS: Table of Contents
Module Name: ADF,ADFS
Coverage of Each Module
Topic #
Topic Name Learning Objective #
1 Introduction - Understanding Core Data Concepts
1
2
3
4
5
6
7
8
23
24
4
5
6
7
8
9
10
11
12
13
14
Introduction
Data - A simple definition
Introduction to Structured data
Introduction to Non Relational Data
120
Introduction to Data Ingestion
Introduction to Data Processing
Batch Processing vs Stream Processing
Introduction to Data Analytics
Estimated Time Duration for this Topic 120
Section Intro
Create Datasets
Create Pipeline and Activities
Create Mapping Data Flow and Adding Sources
150
Mapping Data Flow - Joining Sources
Mapping Data Flow - Aggregate Data
Mapping Data Flow Execution
Mapping Data Flow and Apache Spark Execution
Estimated Time Duration for this Topic 150
Introduction
Cost Warning - Data Pipeline Pricing
Azure SQL - Contained Users
Azure Key Vault - Store SQL Server Secrets
Azure Key Vault - Linked Service
Create Azure Storage Account
Azure Managed Identity - Create a Linked Service
To Azure Blob Storage
Azure Role Based Access Control - Grant Access
To Managed Identity
Create a Dataset for the Lookup Activity
Azure Data Factory - Lookup Activity
Azure Data Factory - ForEach Activity & Pipeline
Expressions
Azure Data Factory - ForEach Activity - Part II
240
Parameterize a Dataset Part I - Container Name
Parameterize a Dataset Part II - Directory Name
Parameterize a Dataset Part III - File Name
Mapping Data Flow - JSON Source
Mapping Data Flow - Parquet Source
Mapping Data Flow - JOIN & Derived Column
Transformations
Mapping Data Flow - Aggregate Transformation
Mappind Data Flow - Parameterized CSV File Sink
Azure Data Factory - Store SAS In Azure Key Vault
Azure Data Factory - Copy Activity Merge
Behaviour
Azure Data Factory - End To End Pipeline
Execution
Azure Data Factory - Storage Event Triggers
Estimated Time Duration for this Topic 240
Capstone Project
Estimated Time Duration for this Topic 0
Total Time Duration 1200
Total Time Duration (In Hours) 20
Total Estimated Duration In
Mins
120
120
120
150
120
120
150
150
240
240
180
180
120
120
120
120
120
0
0
1200
20
Data Analytics & Data Science - Introduction : Table of Content
ge of Each Module
30 30
10 10
10 10
10 10
10 10
15 15
10 10
10 10
75 75
180 0 180
3
Machine learning algorithms & Predictive models : Table
Estimated Duration In
Learning Objective for the Topics Mts for Theory
Introduction 30
Software used in this course R-Studio and Introduction to R 30
R Crash Course - get started with R-programming in R-Studio 60
Fundamentals of predictive modelling with Machine Learning: Thoery 90
Unsupervised Machine Learning and Cluster Analysis in R 90
Supervised Machine Learning in R: Classification in R 60
Supervised Machine Learning in R: Linear Regression Analysis 60
More types of regression models in R 60
Working With Non-Parametric and Non-Linear Data (Supervised Machine Learning) 60
Estimated Time Duration for this Topic 540
30
30
60
90
90
60
60
60
60
0 540