BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI
WORK INTEGRATED LEARNING PROGRAMMES
Part A: Course Design
Course Title Data Management for Machine Learning
Course No(s) DSE* ZG529 / AIML* ZG529
Credit Units 4
Content Authors Pravin Y Pawar
Version 1.1
Course Description
Data Models and Query Languages: Relational, Object-Relational, NoSQL data models; Declarative (SQL)
and Imperative (MapReduce) Querying; Data Encoding: Evolution, Formats, Models of dataflow; Machine
learning workflow; Data management challenges in ML workflow; Data Pipelines and patterns; Data Pipeline
Stages: Data extraction, ingestion, cleaning, wrangling, versioning, transformation, exploration, feature
management; Modern Data Infrastructure: Diverse data sources, Cloud data warehouses and lakes, Data
Ingestion tools, Data transformation and modelling tools, Workflow orchestration platforms; ML model
metadata and Registry, ML Observability, Data privacy and anonymity.
Course Objectives
The course aims at providing:
CO1 Introduction to the data models, storages and querying languages used in data management
emphasizing on machine learning aspects
CO2 Required guidance on architecture of modern data platform, usage and types of data pipelines
CO3 Hands-on exposure to the common techniques, and tools used by data engineers to support build,
test, deploy and automate the machine learning pipelines
CO4 Exposure to the industry best practices essential to deal with data privacy, metadata and
observability
Text Book(s)
T1 Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Reis and Housley
T2 Reliable Machine Learning By Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd
Underwood
Reference Book(s) & other resources
R1 Designing Data-Intensive Applications by Martin Kleppmann
R2 Data Pipelines Pocket Reference by Densore
R3 Building Machine Learning Pipelines by Hapke, Nelson
Learning Outcomes:
Students will be able to :
LO1 Understand the necessity, position and role of data management components appearing in the
modern data stacks
LO2 Acknowledge the patterns, challenges and possible solutions associated with the data ingestion,
flow, storage and processing on data platforms
LO3 Gain experience in designing and handling the dataflow during machine learning pipeline by
means of state-of-art tools
LO4 Apply the acquired conceptual data management knowledge and practices over a real-world
machine learning workflow addressing the model metadata, privacy and monitoring aspects
Part B: Course Handout
Academic Term II Semester 2022-2023
Course Title Data Management for Machine Learning
Course No DSE* ZG529 / AIML* ZG529
Lead Instructor Pravin Y Pawar
Glossary of Terms
Module M Module is a standalone quantum of designed content. A typical course is
delivered using a string of modules. M2 means module 2.
Contact Hour CH Contact Hour (CH) stands for an hour long live session with students
conducted either in a physical classroom or enabled through technology.
In this model of instruction, instructor led sessions will be for 32 CH.
Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the
Lecture student through an online portal. A given RL unfolds as a sequences of
video segments interleaved with exercises.
Lab Exercises LE Lab exercises associated with various modules
Self-Study SS Specific content assigned for self-study
Homework HW Specific problems/design/lab exercises assigned as homework
Modular Structure
Module Summary
No. Content of the Module
M1 Foundations of data management
M2 Modern Data Platform
M3 Data Management in ML Workflow
M4 Advanced Topic in Data Management
Detailed Structure
M1: Foundations of data management
Contact Session 1-2
Session Type Description/Plan Reference
1 CH1 Data Management Principles
Data Management Components T2
CH2
2 CH3 Data Models and Query Languages R1
Data Encoding T1
CH4
Post CS LE Lab 1
M2: Modern Data Platform
Contact Session 3-4
Session Type Description/Plan Reference
3 CH5 Data Architectures T1
Modern Data Stack
CH6
Data Pipelines and patterns
4 CH7 Data Storage T1
Data Science Infrastructure
CH8 Serving Data for Analytics and ML
Post CS LE Lab 2
M3: Data Management in ML Workflow
Contact Session 5-12
Session Type Description/Plan Reference
5 CH9 ML Workflow/lifecycle
T2
CH10 Data Pipeline vs ML Pipeline R3
ML Pipeline Stages
Training / Serving pipeline
Data management challenges in ML workflow
6 CH11 Data Collection / Ingestion T1
CH12 Diverse data sources
Data generation in source systems
Batch Ingestion
Message and Stream Ingestion
Ingestion strategies
7 CH13 Data Validation
R3
CH14 Common problems with data
Data skew and drift
Bias and Fairness
Data leakage
Data validation approaches
8 CH15 Analytics Engineering Instructor-supplied
material
CH16 Data Integration
Data Transformation
Data Partitioning
Data Versioning
Test data management challenges
9 CH17 Data Analysis Instructor-supplied
CH18 material
Types of Analytics
Data Exploration and Visualizations
Data Cubes and OLAP
Data Cube Operations
Data Cubes and ML
10 CH19 Feature Preparations T2
CH20 Feature life cycle
Data Annotation / labeling
Data augmentation and Data Synthesis
Common Feature Engineering Operations
Feature Importance
Feature Generalization
Feature Stores
11-12 CH21 ML Experimentation & Metadata
CH22
Model training & experimentation Instructor-supplied
Model Analysis & Validation material
CH23 ML Metadata Store
CH24 Dataset, Feature, Label, Pipeline metadata
ML Experiment Tracking data
ML model metadata and Registry
Post CS LE Lab 3, 4
M4: Advanced Topic in Data Management
Contact Session 14-16
Session Type Description/Plan Reference
13 CH25 Distributed Data Processing
CH26
Big Data Analytics
Technologies for big data processing
Distributed and Parallel data processing
In-memory data processing
Hadoop, Spark, Kafka as exemplar architecture
14 CH27 Data Privacy and anonymity
T2
CH28 Data privacy issues Instructor-supplied
Differential privacy material
Anonymization
Methods to preserve privacy
Federated learning
Encrypted ML
15 CH29 Data Observability
T2
CH30 Data Observability Instructor-supplied
Data downtime material
Five pillars
Tools selection
16 CH31 ML Monitoring & Observability T2
Instructor-supplied
CH32 Causes of ML System failure material
Data Distribution Shifts
Problems with ML Production Monitoring
ML-specific metric
Monitoring Toolbox
Monitoring vs Observability
Post CS SS To be identified
Experiential Leaning Component
Lab Topic
1 Design and implement the simple data flows involving various Virtual Labs
data formats
Modes of data flows
a) Through Databases – use SQL / Custom Program to read/
write into databases
b) Through REST/RPC – Synchronous mechanism for data ex-
change
c) Through Message Brokers / Queues – Asynchronous mecha-
nism for data exchange
2 Build a Modern Data Stack Virtual Labs
Components
a) a fully managed ELT data pipeline
b) a cloud-based columnar warehouse or data lake as a destina-
tion
c) a data transformation tool
d) A business intelligence or data visualization platform.
3 Manage Machine Learning Model Metadata using MLFlow / Virtual Labs
Neptune
Components
a) Projects
b) Experiments
c) Model metadata
d) Model tracking / logging
e) Model Registry
4 Construct a Machine Learning Pipeline with Data Versioning Virtual Labs
Tool
Components
a) Data Pipeline
b) Data Versioning Tool
c) Feature Store
d) ML Pipeline
e) Prediction Service
Evaluation Scheme:
Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session
No Name Type Duration Weight Day, Date, Session, Time
Experiential learning Take 15 days 10% TBA
EC-1 Assignment / Quiz -I Home
Experiential learning Take 15 days 20% TBA
Assignment-II Home
EC-2 Mid-Semester Test Closed 2 hours 30% Per programme schedule
Book
EC-3 Comprehensive Open 3 hours 40% Per programme schedule
Exam Book
Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 7
Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 16)
Important links and information:
Elearn portal: https://elearn.bits-pilani.ac.in
Students are expected to visit the Elearn portal on a regular basis and stay up to date with the latest
announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule provided on the Elearn portal.
Evaluation Guidelines:
1. EC1 consists of two assignments. Announcements will be made available on the portal, in a timely
manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
3. For Open Book exams: Use of books and any printed / written reference material (filed or bound) is
permitted. However, loose sheets of paper will not be allowed. Use of calculators is permitted in all
exams. Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam which will be made available on the
Elearn portal. The Make-Up Test/Exam will be conducted only at selected exam centres on the dates to
be announced later.
It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the online lectures, and take all the prescribed evaluation components such
as Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.