CampusX DSMP 2.0 Syllabus
CampusX DSMP 2.0 Syllabus
Program
Week 5: Numpy
1. Session 13: Numpy Fundamentals
● Numpy Theory
● Numpy array
● Matrix in numpy
● Numpy array attributes
● Array operations
● Scalar and Vector operations
● Numpy array functions
i. Dot product
ii. Log, exp, mean, median, std, prod, min, max, trigo, variance, ceil,
floor, slicing, iteration
iii. Reshaping
iv. Stacking and splitting
2. Session 14: Advanced Numpy
● Numpy array vs Python List
● Advanced, Fancy and Boolean Indexing
● Broadcasting
● Mathematical operations in numpy
● Sigmoid in numpy
● Mean Squared Error in numpy
● Working with missing values
● Plotting graphs
3. Session 15: Numpy Tricks
● Various numpy functions like sort, append, concatenate, percentile, flip,
Set functions, etc.
4. Session on Web Development using Flask
● What is Flask library
● Why to use Flask?
● Building login system and name entity recognition with API
Week 6: Pandas
1. Session 16: Pandas Series
● What is Pandas?
● Introduction to Pandas Series
● Series Methods
● Series Math Methods
● Series with Python functionalities
● Boolean Indexing on Series
● Plotting graphs on series
2. Session 17: Pandas DataFrame
● Introduction Pandas DataFrame
● Creating DataFrame and read_csv()
● DataFrame attributes and methods
● Dataframe Math Methods
● Selecting cols and rows from dataframe
● Filtering a Dataframe
● Adding new columns
● Dataframe function – astype()
3. Session 18: Important DataFrame Methods
● Various DataFrame Methods
● Sort, index, reset_index, isnull, dropna, fillna, drop_duplicates,
value_counts, apply, etc.
4. Session on API Development using Flask
● What is API?
● Building API using Flask
● Hands-on project
5. Session on Numpy Interview Question
Capstone Project
1. Session 1 on Capstone Project | Data Gathering
a. Project overview in details
b. Gather data for the project
c. Details of the data
2. Session 2 on Capstone Project | Data Cleaning
a. Merging House and Flats Data
b. Basic Level Data Cleaning
3. Session 3 on Capstone Project | Feature Engineering
a. Feature Engineering on Columns:
i. additionalRoom
ii. areaWithType
iii. agePossession
iv. furnishDetails
v. features : luxury Score
4. Session 4 on Capstone Project | EDA
a. Univariate Analysis
b. PandasProfiling
c. Multivariate Analysis
5. Session 5 on Capstone Project | Outlier Detection and Removal
a. Outlier Detection And Removal
6. Session 6 on Capstone Project | Missing Value Imputation
a. Outlier Detection and Removal on area and bedroom
b. Missing Value Imputation
7. Session 7 on Capstone Project | Feature Selection
a. Feature Selection
i. Correlation Technique
ii. Random Forest Feature Importance
iii. Gradient Boosting Feature Importance
iv. Permutation Importance
v. LASSO
vi. Recursive Feature Elimination
vii. Linear Regression with Weights
viii. SHAP (Explainable AI)
b. Linear Regression - Base Model
i. One-Hot Encoding
ii. Transformation
iii. Pipeline for Linear Regression
c. SVR
8. Session 8 on Capstone Project | Model Selection & Productionalization
a. Price Prediction Pipeline
i. Encoding Selection
1. Ordinal Encoding
2. OHE
3. OHE with PCA
4. Target Encoding
ii. Model Selection
b. Price Prediction Web Interface -Streamlit
9. Session 9 on Capstone Project | Building the Analytics Module
a. geo map
b. word cloud amenities
c. scatterplot -> area vs price
d. pie chart bhk filter by sector
e. side by side boxplot bedroom price
f. distplot of price of flat and house
10. Session 10 on Capstone Project | Building the Recommender System
a. Recommender System using TopFacilities
b. Recommender System using Price Details
c. Recommender System using LocationAdvantages
11. Session 11 on Capstone Project | Building the Recommender System Part 2
a. Evaluating Recommendation Results
b. Web Interface for Recommendation (Streamlit)
12. Session 12 on Capstone Project | Building the Insights Module
13. Session 13 on Capstone Project | Deploying the application on AWS
------------------------------------------------------------------------
4. Session 3: Reproducibility
a. Story
b. Industry Tools
c. Cookiecutter
i. Step 1: Install the Cookiecutter Library and start a project
ii. Step 2: Explore the Template Structure
iii. Step 3: Customize the Cookiecutter Variables
iv. Step 4: Benefits of Using Cookiecutter Templates in Data
Science
5. Session 4: Data Versioning Control
a. Introduction
b. Prerequisites
c. Setup
i. Step 1: Initialize a Git repository
ii. Step 2: Set up DVC in your project
iii. Step 3: Add a dataset to your project
iv. Step 4: Commit changes to Git
v. Step 5: Create and version your machine learning pipeline
vi. Step 6: Track changes and reproduce experiments
6. Doubt Clearance Session 2
a. Assignment Solution on DVC: 10:19
b. Doubt Clearance
i. DVC with G-Drive 42:50
ii. DVC Setup Error: 48:45
iii. Containerization with Virtual Environment 49:40
iv. Create Version and ML Pipeline: 56:50
v. DVC Checkout 57:50
vi. How to which ID(commit) to go to - through commit
messages? 1:00:00
vii. What is Kubernetes?
viii. Not able to understand by reading documentation 1:04:30
ix. Getting no of commits 11k+ 1:09:40
Unsupervised Learning
1. KMeans Clustering
a. Session 1 on KMeans Clustering
i. Plan of Attack (Getting Started with Clustering)
ii. Types of ML Learning
iii. Applications of Clustering
iv. Geometric Intuition of K-Means
v. Elbow Method for Deciding Number of Clusters
1. Code Example
2. Limitation of Elbow Method
vi. Assumptions of KMeans
vii. Limitations of K Means
b. Session 2 on KMeans Clustering
i. Recap of Last class
ii. Assignment Solution
iii. Silhouette Score
iv. Kmeans Hyperparameters
1. Number of Clusters(k)
2. Initialization Method (K Means++)
3. Number of Initialization Runs (n_init)
4. Maximum Number of Iterations (max_iter)
5. Tolerance (tol)
6. Algorithm (auto, full, ..)
7. Random State
v. K Means ++
c. Session 3 on KMeans Clustering
i. K-Means Mathematical Formulation (Loyd’s Algorithm)
ii. K-Means Time and Space Complexity
iii. Mini Batch K Means
iv. Types of Clustering
1. Partitional Clustering
2. Hierarchical Clustering
3. Density Based Clustering
4. Distribution/Model-based Clustering
d. K-Means Clustering Algorithms from Scratch in Python
i. Algorithms implementation from Scratch in Python
2. Apriori
a. Introduction: Principles of association rule mining
b. Key Concepts: Support, Confidence, Lift
c. Algorithm Steps: Candidate generation, Pruning
d. Applications: Market Basket Analysis, Recommender Systems
Feature Engineering
1. Session on Encoding Categorical Features - 1
a. Feature Engineering Roadmap
b. What is Feature Encoding
c. Ordinal Encoding
i. Code examples in Python
ii. Handling Rare Categories
d. Label Encoding
i. Code Example using Sklearn LabelEncoder
e. One Hot Encoding
i. Code Examples using Sklearn OneHotEncoder
ii. Handling unknown Category
f. LabelBinarizer
2. Session on Sklearn ColumnTransformer & Pipeline
a. What is ColumnTransformer
b. Code implementation of ColumnTransformer
i. OHE
ii. Ordinal
c. SKLearn Pipelines
i. Implementing multiple transformations in Pipeline
1. Missing value imputation
2. Encoding Categorical Variables
a. Handling rare Categories
3. Scaling
4. Feature Selection
5. Model building
6. Prediction
3. Session on Sklearn Deep Dive
a. Estimators
b. Custom Estimators
c. Mixins
d. Transformers
e. Custom Transformer
f. Composite Transformers
g. Column transformer
h. Feature Union
i. Pipeline
6. Session 2 on Discretization
a. Types of Discretization
i. Uniform Binning
ii. Quantile Binning
iii. K-Means Binning
iv. Decision Tree Based Binning
v. Custom Binning
vi. Threshold Binning (Binarization)
7. Session 1 on Handling Missing Data
a. Missing Values
b. The missingo library
c. Why missing values occur?
d. Types of missing values
e. How missing values impact ML models?
f. How to handle missing values?
i. Removing
ii. Imputing
g. Removing Missing Data
Advanced XGBoost
Session on Revisiting XGBoost
1. Supervised ML
2. Stagewise Additive Modelling
3. XGBoost Objective Function
Session on XGBoost Regularization
1. Recap
2. Ways to reduce overfitting in XGBoost
3. Parameters
a. Gamma
b. Max Depth
c. Num Estimators
d. Early Stopping
e. Shrinkage
1. Adaboost
a. Introduction: Overview and intuition of the algorithm
b. Components: Weak Learners, Weights, Final Model
c. Hyperparameters: Learning Rate, Number of Estimators
d. Applications: Use Cases in Classification and Regression
2. Stacking
a. Introduction: Concept of model ensembling
b. Steps: Base Models, Meta-Model, Final Prediction
c. Variations: Different approaches and modifications
d. Best Practices: Tips for effective stacking
3. LightGBM
Session 1 on Introduction to LightGBM
a. Introduction and core features
b. Boosting and Objective Function
c. Histogram-Based Split finding
d. Best-fit Tree (Leaf-wise growth strategy)
e. Gradient-based One side sampling(GOSS)
f. Exclusive Feature Bundling (EFB)
Session 2 on LightGBM (GOSS & EFB)
a. Recap - Features and Technical Aspects
b. Revisiting GOSS
c. EFB
4. CatBoost
Session 1 on CatBoost - Practical Introduction
a. Introduction
b. Advantages and Technical Aspects
c. Practical Implementation of CatBoost on Medical Cost Dataset
Miscellaneous Topics
1. NoSQL
a. Introduction: Overview of NoSQL databases
b. Types: Document, Key-Value, Column-Family, Graph
c. Use Cases: When to use NoSQL over SQL databases
d. Popular Databases: MongoDB, Cassandra, Redis, Neo4j
2. Model Explainability
a. Introduction: Importance of interpretable models
b. Techniques: LIME, SHAP, Feature Importance
c. Application: Applying techniques to various models
d. Best Practices: Ensuring reliable and accurate explanations
3. FastAPI
a. Introduction: Modern, fast web framework for building APIs
b. Features: Type checking, Automatic validation, Documentation
c. Building APIs: Steps and best practices
d. Deployment: Hosting and scaling FastAPI applications
4. AWS Sagemaker
a. Introduction: Fully managed service for machine learning
b. Features: Model building, Training, Deployment
c. Usage: Workflow from data preprocessing to model deployment
d. Best Practices: Optimizing costs and performance
Note: The schedule is tentative and topics can be added/removed from it in the
future.