0% found this document useful (0 votes)

140 views7 pages

Data Management For Machine Learning

The document outlines the course design for 'Data Management for Machine Learning' at Birla Institute of Technology & Science, Pilani, detailing its objectives, content, and evaluation scheme. It covers various data models, querying languages, and modern data infrastructure, while providing hands-on experience with data pipelines and machine learning workflows. The course includes a modular structure with specific learning outcomes and evaluation components, emphasizing the importance of data management in machine learning.

Uploaded by

geetapillai1963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

140 views7 pages

Data Management For Machine Learning

Uploaded by

geetapillai1963

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOCX, PDF, TXT or read online on Scribd

You are on page 1/ 7

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

Part A: Course Design

Course Title Data Management for Machine Learning

Course No(s) DSE* ZG529 / AIML* ZG529
Credit Units 4
Content Authors Pravin Y Pawar
Version 1.1

Course Description

Data Models and Query Languages: Relational, Object-Relational, NoSQL data models; Declarative (SQL)
and Imperative (MapReduce) Querying; Data Encoding: Evolution, Formats, Models of dataflow; Machine
learning workflow; Data management challenges in ML workflow; Data Pipelines and patterns; Data Pipeline
Stages: Data extraction, ingestion, cleaning, wrangling, versioning, transformation, exploration, feature
management; Modern Data Infrastructure: Diverse data sources, Cloud data warehouses and lakes, Data
Ingestion tools, Data transformation and modelling tools, Workflow orchestration platforms; ML model
metadata and Registry, ML Observability, Data privacy and anonymity.

Course Objectives

The course aims at providing:

CO1 Introduction to the data models, storages and querying languages used in data management
emphasizing on machine learning aspects

CO2 Required guidance on architecture of modern data platform, usage and types of data pipelines

CO3 Hands-on exposure to the common techniques, and tools used by data engineers to support build,
test, deploy and automate the machine learning pipelines

CO4 Exposure to the industry best practices essential to deal with data privacy, metadata and
observability

Text Book(s)

T1 Fundamentals of Data Engineering: Plan and Build Robust Data Systems by Reis and Housley

T2 Reliable Machine Learning By Cathy Chen, Niall Richard Murphy, Kranti Parisa, D. Sculley, Todd
Underwood

Reference Book(s) & other resources

R1 Designing Data-Intensive Applications by Martin Kleppmann

R2 Data Pipelines Pocket Reference by Densore

R3 Building Machine Learning Pipelines by Hapke, Nelson

Learning Outcomes:

Students will be able to :

LO1 Understand the necessity, position and role of data management components appearing in the
modern data stacks

LO2 Acknowledge the patterns, challenges and possible solutions associated with the data ingestion,
flow, storage and processing on data platforms

LO3 Gain experience in designing and handling the dataflow during machine learning pipeline by
means of state-of-art tools

LO4 Apply the acquired conceptual data management knowledge and practices over a real-world
machine learning workflow addressing the model metadata, privacy and monitoring aspects

Part B: Course Handout

Academic Term II Semester 2022-2023

Course Title Data Management for Machine Learning
Course No DSE* ZG529 / AIML* ZG529
Lead Instructor Pravin Y Pawar

Glossary of Terms

Module M Module is a standalone quantum of designed content. A typical course is

delivered using a string of modules. M2 means module 2.

Contact Hour CH Contact Hour (CH) stands for an hour long live session with students
conducted either in a physical classroom or enabled through technology.
In this model of instruction, instructor led sessions will be for 32 CH.

Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the

Lecture student through an online portal. A given RL unfolds as a sequences of
video segments interleaved with exercises.

Lab Exercises LE Lab exercises associated with various modules

Self-Study SS Specific content assigned for self-study

Homework HW Specific problems/design/lab exercises assigned as homework

Modular Structure

Module Summary
No. Content of the Module
M1 Foundations of data management

M2 Modern Data Platform

M3 Data Management in ML Workflow

M4 Advanced Topic in Data Management

Detailed Structure

M1: Foundations of data management

Contact Session 1-2

Session Type Description/Plan Reference

1 CH1  Data Management Principles
 Data Management Components T2
CH2

2 CH3  Data Models and Query Languages R1

 Data Encoding T1
CH4

Post CS LE  Lab 1

M2: Modern Data Platform

Contact Session 3-4

Session Type Description/Plan Reference

3 CH5  Data Architectures T1
 Modern Data Stack
CH6
 Data Pipelines and patterns

4 CH7  Data Storage T1

 Data Science Infrastructure
CH8  Serving Data for Analytics and ML
Post CS LE  Lab 2

M3: Data Management in ML Workflow

Contact Session 5-12

Session Type Description/Plan Reference

5 CH9 ML Workflow/lifecycle
T2
CH10  Data Pipeline vs ML Pipeline R3
 ML Pipeline Stages
 Training / Serving pipeline
 Data management challenges in ML workflow

6 CH11 Data Collection / Ingestion T1

CH12  Diverse data sources
 Data generation in source systems
 Batch Ingestion
 Message and Stream Ingestion
 Ingestion strategies

7 CH13 Data Validation

R3
CH14  Common problems with data
 Data skew and drift
 Bias and Fairness
 Data leakage
 Data validation approaches

8 CH15 Analytics Engineering Instructor-supplied

material
CH16  Data Integration
 Data Transformation
 Data Partitioning
 Data Versioning
 Test data management challenges

9 CH17 Data Analysis Instructor-supplied

CH18 material
 Types of Analytics
 Data Exploration and Visualizations
 Data Cubes and OLAP
 Data Cube Operations
 Data Cubes and ML
10 CH19 Feature Preparations T2
CH20  Feature life cycle
 Data Annotation / labeling
 Data augmentation and Data Synthesis
 Common Feature Engineering Operations
 Feature Importance
 Feature Generalization
 Feature Stores

11-12 CH21 ML Experimentation & Metadata

CH22
 Model training & experimentation Instructor-supplied
 Model Analysis & Validation material
CH23  ML Metadata Store
CH24  Dataset, Feature, Label, Pipeline metadata
 ML Experiment Tracking data
 ML model metadata and Registry

Post CS LE  Lab 3, 4

M4: Advanced Topic in Data Management

Contact Session 14-16

Session Type Description/Plan Reference

13 CH25 Distributed Data Processing
CH26
 Big Data Analytics
 Technologies for big data processing
 Distributed and Parallel data processing
 In-memory data processing
 Hadoop, Spark, Kafka as exemplar architecture
14 CH27 Data Privacy and anonymity
T2
CH28  Data privacy issues Instructor-supplied
 Differential privacy material
 Anonymization
 Methods to preserve privacy
 Federated learning
 Encrypted ML

15 CH29 Data Observability

T2
CH30  Data Observability Instructor-supplied
 Data downtime material
 Five pillars
 Tools selection
16 CH31 ML Monitoring & Observability T2
Instructor-supplied
CH32  Causes of ML System failure material
 Data Distribution Shifts
 Problems with ML Production Monitoring
 ML-specific metric
 Monitoring Toolbox
 Monitoring vs Observability
Post CS SS  To be identified

Experiential Leaning Component

Lab Topic

1 Design and implement the simple data flows involving various  Virtual Labs
data formats
Modes of data flows
a) Through Databases – use SQL / Custom Program to read/
write into databases
b) Through REST/RPC – Synchronous mechanism for data ex-
change
c) Through Message Brokers / Queues – Asynchronous mecha-
nism for data exchange

2 Build a Modern Data Stack  Virtual Labs

Components
a) a fully managed ELT data pipeline
b) a cloud-based columnar warehouse or data lake as a destina-
tion
c) a data transformation tool
d) A business intelligence or data visualization platform.

3 Manage Machine Learning Model Metadata using MLFlow /  Virtual Labs

Neptune
Components

a) Projects
b) Experiments
c) Model metadata
d) Model tracking / logging
e) Model Registry

4 Construct a Machine Learning Pipeline with Data Versioning  Virtual Labs

Tool
Components
a) Data Pipeline
b) Data Versioning Tool
c) Feature Store
d) ML Pipeline
e) Prediction Service

Evaluation Scheme:

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

No Name Type Duration Weight Day, Date, Session, Time

Experiential learning Take 15 days 10% TBA
EC-1 Assignment / Quiz -I Home
Experiential learning Take 15 days 20% TBA
Assignment-II Home
EC-2 Mid-Semester Test Closed 2 hours 30% Per programme schedule
Book
EC-3 Comprehensive Open 3 hours 40% Per programme schedule
Exam Book

Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 7

Syllabus for Comprehensive Exam (Open Book): All topics (Session Nos. 1 to 16)

Important links and information:

Elearn portal: https://elearn.bits-pilani.ac.in
Students are expected to visit the Elearn portal on a regular basis and stay up to date with the latest
announcements and deadlines.
Contact sessions: Students should attend the online lectures as per the schedule provided on the Elearn portal.
Evaluation Guidelines:
1. EC1 consists of two assignments. Announcements will be made available on the portal, in a timely
manner.
2. For Closed Book tests: No books or reference material of any kind will be permitted.
3. For Open Book exams: Use of books and any printed / written reference material (filed or bound) is
permitted. However, loose sheets of paper will not be allowed. Use of calculators is permitted in all
exams. Laptops/Mobiles of any kind are not allowed. Exchange of any material is not allowed.
4. If a student is unable to appear for the Regular Test/Exam due to genuine exigencies, the student
should follow the procedure to apply for the Make-Up Test/Exam which will be made available on the
Elearn portal. The Make-Up Test/Exam will be conducted only at selected exam centres on the dates to
be announced later.

It shall be the responsibility of the individual student to be regular in maintaining the self-study schedule as
given in the course handout, attend the online lectures, and take all the prescribed evaluation components such
as Assignment/Quiz, Mid-Semester Test and Comprehensive Exam according to the evaluation scheme
provided in the handout.

B.Tech CSE 8th Sem
No ratings yet
B.Tech CSE 8th Sem
10 pages
5-Day KVCET Bootcamp - Data Analytics
No ratings yet
5-Day KVCET Bootcamp - Data Analytics
6 pages
IITJ DE 02 - Curriculum - v181123 - 250708 - 211556
No ratings yet
IITJ DE 02 - Curriculum - v181123 - 250708 - 211556
30 pages
Course Outline - ML IIFT Delhi MBA (BA) Sep-Dec 24
No ratings yet
Course Outline - ML IIFT Delhi MBA (BA) Sep-Dec 24
5 pages
Data Warehousing & Mining Course
No ratings yet
Data Warehousing & Mining Course
45 pages
Macse503 Data-Engineering Ela 1.0 83 Macse503
No ratings yet
Macse503 Data-Engineering Ela 1.0 83 Macse503
2 pages
Data Science I: Lesson #01 - Outline Presentation
No ratings yet
Data Science I: Lesson #01 - Outline Presentation
20 pages
Course Objectives DM
No ratings yet
Course Objectives DM
4 pages
IITH Executive MTech Brochure
No ratings yet
IITH Executive MTech Brochure
13 pages
Altair Data Science Internship Report
No ratings yet
Altair Data Science Internship Report
47 pages
MR20 Vi-I Syllabus
No ratings yet
MR20 Vi-I Syllabus
22 pages
Syllabus of DT-1 23ECH102
No ratings yet
Syllabus of DT-1 23ECH102
5 pages
Cseit - All
No ratings yet
Cseit - All
85 pages
V7 - 68. Lesson Plan
No ratings yet
V7 - 68. Lesson Plan
7 pages
Brochure Professional Certificate Course On Data Science & AI
No ratings yet
Brochure Professional Certificate Course On Data Science & AI
26 pages
Lec 01
No ratings yet
Lec 01
28 pages
08 - Professional Certificate Course On Data Science - v2
No ratings yet
08 - Professional Certificate Course On Data Science - v2
25 pages
Data Warehousing and Data Mining
No ratings yet
Data Warehousing and Data Mining
3 pages
An Analysis of Data Quality Requirements For Machine Learning
No ratings yet
An Analysis of Data Quality Requirements For Machine Learning
12 pages
Data Engineering For IoE V1.0
No ratings yet
Data Engineering For IoE V1.0
3 pages
Data Science Syllabus
No ratings yet
Data Science Syllabus
7 pages
Mcse615l - Data-Analytics - TH - 1.0 - 71 - Mcse615l - 67 Acp
No ratings yet
Mcse615l - Data-Analytics - TH - 1.0 - 71 - Mcse615l - 67 Acp
2 pages
A Survey of Data Quality Requirements That Matter in ML
No ratings yet
A Survey of Data Quality Requirements That Matter in ML
39 pages
VTH Sem Syllabus
No ratings yet
VTH Sem Syllabus
37 pages
Syllabus - CIS 509 Data Mining II (Fall 2019)
No ratings yet
Syllabus - CIS 509 Data Mining II (Fall 2019)
7 pages
01 - Introduction To Machine Learning
No ratings yet
01 - Introduction To Machine Learning
71 pages
Data Eng Lab Manual
No ratings yet
Data Eng Lab Manual
81 pages
Data Processing in AI
No ratings yet
Data Processing in AI
7 pages
Machine Learning To Data Management A Round Trip
No ratings yet
Machine Learning To Data Management A Round Trip
4 pages
CSET228 Course Handout
No ratings yet
CSET228 Course Handout
7 pages
Unit I 1
No ratings yet
Unit I 1
203 pages
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
No ratings yet
Batch Vs Online ML: Wednesday, March 17, 2021 5:30 PM
436 pages
Machine Learning Essentials
No ratings yet
Machine Learning Essentials
383 pages
MDU B.Tech CSE 8th Sem Syllabus
No ratings yet
MDU B.Tech CSE 8th Sem Syllabus
7 pages
Social Media Analytics, Video Analytics, Data Management For ML
No ratings yet
Social Media Analytics, Video Analytics, Data Management For ML
3 pages
1725892639module 3 The Machine Learning Process
No ratings yet
1725892639module 3 The Machine Learning Process
17 pages
NDS Data Practitioner Degree Curriculum
No ratings yet
NDS Data Practitioner Degree Curriculum
10 pages
5th BDA Booklet
No ratings yet
5th BDA Booklet
58 pages
Data Engineering Nanodegree Program Syllabus PDF
No ratings yet
Data Engineering Nanodegree Program Syllabus PDF
5 pages
Data Transformation in The Cloud 3
No ratings yet
Data Transformation in The Cloud 3
9 pages
22CS911-DEC Unit 5
No ratings yet
22CS911-DEC Unit 5
68 pages
PGP in DS & AI
No ratings yet
PGP in DS & AI
24 pages
Ai For IT Coders
No ratings yet
Ai For IT Coders
18 pages
DSML Curriculum Doc - Google Sheets
0% (1)
DSML Curriculum Doc - Google Sheets
12 pages
Syllabus - ML Lab
No ratings yet
Syllabus - ML Lab
3 pages
BDA Lec11
No ratings yet
BDA Lec11
32 pages
Polyzotis Et Al - 2018
No ratings yet
Polyzotis Et Al - 2018
12 pages
Ad8552 ML Unit Ii
No ratings yet
Ad8552 ML Unit Ii
94 pages
AI IBM Curriculumn
No ratings yet
AI IBM Curriculumn
1 page
Introduction of Subject
No ratings yet
Introduction of Subject
28 pages
Data Science and Machine Learning Syllabus V1.0
No ratings yet
Data Science and Machine Learning Syllabus V1.0
6 pages
07 - Data Lifecycle Challenges in Production ML
No ratings yet
07 - Data Lifecycle Challenges in Production ML
12 pages
3 & 4sem Edit
No ratings yet
3 & 4sem Edit
25 pages
Activity Log
No ratings yet
Activity Log
23 pages
Da Handbook
No ratings yet
Da Handbook
18 pages
PCAC2009
No ratings yet
PCAC2009
3 pages
Module 1 ML Chapter2
No ratings yet
Module 1 ML Chapter2
56 pages
A Comparative Study of K-Means, DBSCAN and OPTICS
No ratings yet
A Comparative Study of K-Means, DBSCAN and OPTICS
6 pages
SAP Best Practices
No ratings yet
SAP Best Practices
15 pages
II4IIT Assignment-10 Solution
No ratings yet
II4IIT Assignment-10 Solution
5 pages
Project Charter
No ratings yet
Project Charter
26 pages
Batch Derivation Overview With Example
No ratings yet
Batch Derivation Overview With Example
9 pages
cw3551 Dis Unit 2 Notes
No ratings yet
cw3551 Dis Unit 2 Notes
18 pages
CS8662 Mad QS Set
No ratings yet
CS8662 Mad QS Set
2 pages
Editpyxl: Python Excel Editing Guide
No ratings yet
Editpyxl: Python Excel Editing Guide
25 pages
Oci 1Z0 1072 25
50% (2)
Oci 1Z0 1072 25
58 pages
Senior Quantitative Analyst-15
No ratings yet
Senior Quantitative Analyst-15
2 pages
Database Systems MCQ PDF
No ratings yet
Database Systems MCQ PDF
16 pages
Developer Guide To Azure
100% (1)
Developer Guide To Azure
130 pages
Aruba ClearPass Essentials, Rev. 20.11
No ratings yet
Aruba ClearPass Essentials, Rev. 20.11
3 pages
ISF Sample Exam en v1.0
No ratings yet
ISF Sample Exam en v1.0
15 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Migration Steps in Oracle Apps: Reports
No ratings yet
Migration Steps in Oracle Apps: Reports
6 pages
GL Profile Options
No ratings yet
GL Profile Options
11 pages
Project
No ratings yet
Project
1 page
Metrobank Assessment Form
No ratings yet
Metrobank Assessment Form
4 pages
BD61 Interview Questions
No ratings yet
BD61 Interview Questions
13 pages
SAP HANA Studio & Cloud Setup Guide
No ratings yet
SAP HANA Studio & Cloud Setup Guide
14 pages
Software - Idea-Presentation-Format G20-MAIT Hackathon
No ratings yet
Software - Idea-Presentation-Format G20-MAIT Hackathon
7 pages
Network Open VPN 2
No ratings yet
Network Open VPN 2
2 pages
Contents
No ratings yet
Contents
12 pages
ATRG - Application Control
No ratings yet
ATRG - Application Control
11 pages
Lecture No. 15
No ratings yet
Lecture No. 15
11 pages
Food Ordering System DFD Guide
No ratings yet
Food Ordering System DFD Guide
3 pages
ASP.NET 2.0: Use Stored Procedures
No ratings yet
ASP.NET 2.0: Use Stored Procedures
20 pages
Siebel CRM Training
No ratings yet
Siebel CRM Training
11 pages
Chapter 04 Entity Relationship ER Modeling
No ratings yet
Chapter 04 Entity Relationship ER Modeling
21 pages

Data Management For Machine Learning

Uploaded by

Data Management For Machine Learning

Uploaded by

BIRLA INSTITUTE OF TECHNOLOGY & SCIENCE, PILANI

WORK INTEGRATED LEARNING PROGRAMMES

Part A: Course Design

Course Title Data Management for Machine Learning

The course aims at providing:

Reference Book(s) & other resources

R1 Designing Data-Intensive Applications by Martin Kleppmann

R2 Data Pipelines Pocket Reference by Densore

Students will be able to :

Part B: Course Handout

Academic Term II Semester 2022-2023

Module M Module is a standalone quantum of designed content. A typical course is

Recorded RL RL stands for Recorded Lecture or Recorded Lesson. It is presented to the

Lab Exercises LE Lab exercises associated with various modules

Self-Study SS Specific content assigned for self-study

Homework HW Specific problems/design/lab exercises assigned as homework

M2 Modern Data Platform

M3 Data Management in ML Workflow

M4 Advanced Topic in Data Management

M1: Foundations of data management

Contact Session 1-2

Session Type Description/Plan Reference

2 CH3  Data Models and Query Languages R1

M2: Modern Data Platform

Contact Session 3-4

Session Type Description/Plan Reference

4 CH7  Data Storage T1

M3: Data Management in ML Workflow

Contact Session 5-12

Session Type Description/Plan Reference

6 CH11 Data Collection / Ingestion T1

7 CH13 Data Validation

8 CH15 Analytics Engineering Instructor-supplied

9 CH17 Data Analysis Instructor-supplied

11-12 CH21 ML Experimentation & Metadata

M4: Advanced Topic in Data Management

Contact Session 14-16

Session Type Description/Plan Reference

15 CH29 Data Observability

Experiential Leaning Component

2 Build a Modern Data Stack  Virtual Labs

3 Manage Machine Learning Model Metadata using MLFlow /  Virtual Labs

4 Construct a Machine Learning Pipeline with Data Versioning  Virtual Labs

Legend: EC = Evaluation Component; AN = After Noon Session; FN = Fore Noon Session

No Name Type Duration Weight Day, Date, Session, Time

Syllabus for Mid-Semester Test (Closed Book): Topics in Session Nos. 1 to 7

Important links and information:

You might also like