0% found this document useful (0 votes)

25 views18 pages

Chapter Two - Classification Feb 26 2024

Chapter Two of 'Applied Machine Learning' focuses on classification, emphasizing the critical role of data preparation, which constitutes 80% of machine learning and data science efforts. It outlines various classification algorithms, their advantages and disadvantages, and the importance of feature engineering and handling imbalanced datasets. The chapter also introduces the Boston Housing Dataset and discusses methodologies for evaluating classification performance, including metrics like accuracy, F1 score, and AUC/ROC.

Uploaded by

gpt4prompt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views18 pages

Chapter Two - Classification Feb 26 2024

Uploaded by

gpt4prompt

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 18

Applied Machine Learning

Chapter Two: Classification

Introduction

The Importance of Data Prep!

80% of ML and DS is Data Prep! See: Machine Learning with Python: Classification (complete
tutorial) | by Mauro Di Pietro | Towards Data Science or Solving A Simple Classification Problem
with Python — Fruits Lovers’ Edition | by Susan Li | Towards Data Science

Feature Engineering

Is for no deep learning tasks or algorithms.

DL learns the underlying patterns in your dataset, so you actually don’t need a whole lot of
feature engineering ! Actually none!!
Still need Data Prep.

● Environment setup: import libraries and read data

● Exploratory Data Analysis & Visualization : understand the meaning and the predictive
power of the variables
● Data Prep :
○ data partitioning,
○ handle missing values,
○ encode categorical variables,
○ Standard scaler
○ Normalization
● Feature engineering:
○ extract features from raw data
○ Feature Selection: keep only the most relevant variables
○ Feature importance
○ Entropy , Gini, Signal / noise
● Model Dev: train, tune hyperparameters, validation, test
● Performance evaluation: read the metrics
● Explainability/Interpretability : understand how the model produces results

1
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Data Preparation
Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science

Remember: Size, unique(). Counts, distributions

2
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

The Classification Algorithms

There are many classification algorithms available in machine learning. Here are some of the
most common ones:
1. Decision Trees
2. Random Forest
3. Naive Bayes
4. Logistic Regression
5. Support Vector Machines (SVM)
6. k-Nearest Neighbors (k-NN)
7. Gradient Boosting
Here's a table that summarizes the advantages and disadvantages of each algorithm:

Algorithm Advantages Disadvantages

Easy to understand and interpret. Can Prone to overfitting. Can be

Decision Trees handle both categorical and numerical sensitive to small changes in
data. Can handle missing data. the data.

Can handle large datasets with many

Can be slow to train. Can be
Random Forest features. Reduces the risk of overfitting.
difficult to interpret.
Good for categorical data.

Very fast and requires minimal training Assumes independence of

Naive Bayes data. Works well with high dimensional features, which may not be true
datasets. Good for text analysis. in reality.

Logistic Can underperform when there

Simple and easy to interpret. Can work
Regression/ are non-linear relationships
well with small datasets. Good for binary
Binary between features and the
classification.
Classification target variable.

3
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Effective for high-dimensional datasets.

Support Vector Works well with both linear and
Can be slow to train. Can be
Machines non-linear relationships between features
difficult to interpret.
(SVM) and the target variable. Good for image
classification.

k-Nearest Easy to understand and implement.

Can be slow to predict.
Neighbors Good for non-linear relationships
Sensitive to irrelevant features.
(k-NN) between features and the target variable.

Good for datasets with many features.

Gradient Can be slow to train. Can be
Reduces the risk of overfitting. Can
Boosting difficult to interpret.
handle missing data.

It's important to note that the performance of each algorithm will vary depending on the specific
dataset and problem at hand. Therefore, it's recommended to experiment with multiple
algorithms and compare their performance on the same dataset.

Feature Engineering

Basic
1. Normalization
2. Scaling
3. Cleaning

Advanced

Feature Importance: What are the important columns / features in my data

set?
1. Feature importance, imblearn , cov, model.coef_
2. explainability; SHAP
impact of that feature on the ultimate outcome ie the model outcome “your application was
accepted!” But why? Because you had a good credit score, good references, a stable job, no
past criminal record.
4
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

SHAPLY values score the importance or relevance of a feature to predicting the target variable.

We want to detect , identify the importance of each feature used in making the final prediction.

Objectives:
1. EDAV : Distribution of the Data set → skewed ⇒ plot dataset distributions
2. Remove/reduction noise
a. Get rid of unnecessary or irrelevant features (columns)
b. Methods for determining feature importance
3. Increase signal
a. Up or down sampling ( 5 techniques)
b. Handling Imbalanced data
c. Combining techniques
4. XAI : explainability, interpretability, transparency, trustworthy AI
a. Use SHAP

5
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Source: The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service
concerning housing in the area of Boston MA. The following describes the dataset columns:

● CRIM - per capita crime rate by town

● ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
● INDUS - proportion of non-retail business acres per town.
● CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
● NOX - nitric oxides concentration (parts per 10 million)
● RM - average number of rooms per dwelling
● AGE - proportion of owner-occupied units built prior to 1940
● DIS - weighted distances to five Boston employment centres
● RAD - index of accessibility to radial highways
● TAX - full-value property-tax rate per $10,000
● PTRATIO - pupil-teacher ratio by town
● B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
● LSTAT - % lower status of the population
● MEDV - Median value of owner-occupied homes in $1000's

6
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Gini impurity for classification tasks and variance reduction in the

case of regression.

Source : SHAP (SHapley Additive exPlanations) | by Cory Maklin | Medium

7
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

8
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Dealing with Imbalanced Datasets

Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science

See Machine Learning — Multiclass Classification with Imbalanced Dataset | by Javaid Nabi |
Towards Data Science

SMOTE can used to partially rebalance a skewed dataset that has under represented classes.
Imblearn helps you sample or generate additional data:
● Upsample mandarin → apples
● Downsample everyone else to get to the mandarin representation level of classifiers /
targets/ classes
Or pass the class_weights parameter in the .fit() function insklearn.

9
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science

For a complete reference – soups donuts !! see Machine Learning with Python: Classification
(complete tutorial) | by Mauro Di Pietro | Towards Data Science

10
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

The Muller Loop

Let’s approach classification as a set of experiments. Please fork / clone:
https://github.com/aarsanjani/applied-ml-2020/blob/master/MullerLoop.ipynb

names = [
"Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA"
]

Classification

# Modified by Ali Arsanjani from ...

# Code source: Gaël Varoquaux
# Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian

Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",

11
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

"Naive Bayes", "QDA"]

classifiers = [
KNeighborsClassifier(2),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
# GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

X, y = X_data_reshape, y_data

X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=.2)

# TODO (Apply): All cross-validation

max_score = 0.0
max_class = ''
# iterate over classifiers
for name, clf in zip(names, classifiers):
start_time = time()
clf.fit(X_train, y_train)
score = 100.0 * clf.score(X_test, y_test)
print('Classifier = %s, Score (test, accuracy) = %.2f,' %(name, score),
'Training time = %.2f seconds' % (time() - start_time))

if score > max_score:

clf_best = clf
max_score = score
max_class = name

print(80*'-' )
print('Best --> Classifier = %s, Score (test, accuracy) = %.2f'
%(max_class, max_score))

#plot the output of the various algorithms

Muller Loop for Regression [next topic]

The Notion of Experiments and your AI/ML Lab

Every data science activity or training and testing or poc is essentially an experiment.

1. Formulate a narrative of what you are trying to do – what benefit to the business /
humankind?
2. Formulate some questions among your group:
3. What are the questions that we want to answer by running these experiments/
algorithms with these datasets??
4. Run the experiments and
5. log the results of all experiments !!
6. We are trying to determine which experiment works best.
7. Put the results in a table.

Clustering
Euclidean Distance vs Fractal Distance

For every iteration show the data, the objective function results for your cluster
Look at the results of your multiple experiments and then write up a hypothesis explaining why
you think an explanation you have produced is a valid one.

algorithm f1 precision recall auc/roc

knn

svc

Example for classification

Algorithm Metric Metric 2

● Data Visualization and Distributions

● Feature Importance (SHAP, permutations )
● Gini score, entropy, signal vs noise
● Class imbalance
● upsampling/ downsampling (smote)

How well did our classifications do?

The 5 Classification Evaluation metrics every Data Scientist must know | by Rahul Agarwal

Confusion Matrix: Performance Metrics for Classification problems in Machine Learning | by

Mohammed Sunasra | Medium

Accuracy ?? →F1 Score

AUC / ROC

Use the Random Forest classifier’s Gini Score to identify features which
are contributing to the data

Experimentation
project work
1. Find dataset 2 and 3, amalgamate them ; run classification (each team
member try to pick a different classification)
1. show how performance is enhanced with each amalgamation

Dataset 1
Algorithm Scores

Dataset 1+2
Algorithm Scores

Dataset 1+2+3
Algorithm Scores

2. Get in depth with one algorithm at least : you can use a

classification
3. Run a Muller loop on your data set 1, ds1+ds2, ds1+ds2+ds3 , i.e.,
your incrementally amalgamated datasets for classification
1. plot the results in a table

Section 2: Dealing with Class Imbalance

● Feature Importance (SHAP, permutations )
● Gini score, entropy, signal vs noise
● Class imbalance
● upsampling/ downsampling (smote)

Module 3 Data Science Machine Learning
No ratings yet
Module 3 Data Science Machine Learning
53 pages
Aiya Session 4
No ratings yet
Aiya Session 4
42 pages
Supervised Learning Notes
No ratings yet
Supervised Learning Notes
7 pages
Lecture 5 - Feature Extraction, Model Building & Evaluation
No ratings yet
Lecture 5 - Feature Extraction, Model Building & Evaluation
35 pages
Topic 2
No ratings yet
Topic 2
47 pages
Amlt Bca Unit-1
No ratings yet
Amlt Bca Unit-1
24 pages
Ds Notes Mca
No ratings yet
Ds Notes Mca
30 pages
Python 06 MachineLearning
No ratings yet
Python 06 MachineLearning
45 pages
Machine Learning - Brief
No ratings yet
Machine Learning - Brief
12 pages
CS 2 3 4 Aml
No ratings yet
CS 2 3 4 Aml
70 pages
ML Book Notes
No ratings yet
ML Book Notes
9 pages
Tutorial 3
No ratings yet
Tutorial 3
30 pages
Algorithmeknn 121213175830 Phpapp02
No ratings yet
Algorithmeknn 121213175830 Phpapp02
52 pages
Machine Learning Cheat Sheet
No ratings yet
Machine Learning Cheat Sheet
15 pages
KNIME - Seven Techs For Dimensionality Reduction
No ratings yet
KNIME - Seven Techs For Dimensionality Reduction
17 pages
ML 02 Dataset-Feature Selection PDF
No ratings yet
ML 02 Dataset-Feature Selection PDF
44 pages
Live Classroom 2
No ratings yet
Live Classroom 2
40 pages
Machine Learning Lecture1 - 26-27 Aug
No ratings yet
Machine Learning Lecture1 - 26-27 Aug
30 pages
ML Workshop
No ratings yet
ML Workshop
78 pages
A Short Guide For Feature Engineering and Feature Selection
No ratings yet
A Short Guide For Feature Engineering and Feature Selection
32 pages
End SEM V IMP DSE 2
No ratings yet
End SEM V IMP DSE 2
9 pages
ML and DL
No ratings yet
ML and DL
15 pages
04 MLModelingBasics
No ratings yet
04 MLModelingBasics
61 pages
Module 4
No ratings yet
Module 4
28 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Chapter Four
No ratings yet
Chapter Four
75 pages
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
No ratings yet
Beginner's Guide To Implementing A Simple Machine Learning Project - DeV Community
9 pages
Machine Learning Lec 1
No ratings yet
Machine Learning Lec 1
68 pages
Assignmnet
No ratings yet
Assignmnet
25 pages
SML
No ratings yet
SML
8 pages
AI For Eng Supervised-Learning
No ratings yet
AI For Eng Supervised-Learning
25 pages
DATA 2024 - Dist
No ratings yet
DATA 2024 - Dist
72 pages
T9 Iml
No ratings yet
T9 Iml
44 pages
Chapter 02 Overview - 4
No ratings yet
Chapter 02 Overview - 4
43 pages
Machine Learning Part: Domain Overview
No ratings yet
Machine Learning Part: Domain Overview
20 pages
MMC102 - Module 4 - Notes
No ratings yet
MMC102 - Module 4 - Notes
39 pages
Maxbox Starter60 Machine Learning
No ratings yet
Maxbox Starter60 Machine Learning
8 pages
5 Markd
No ratings yet
5 Markd
24 pages
CH 5
No ratings yet
CH 5
19 pages
ML Algorithms Comprehensive Study
No ratings yet
ML Algorithms Comprehensive Study
9 pages
INT354 - Unit 2
No ratings yet
INT354 - Unit 2
26 pages
Study Material For Machine Learning - 1 - 1754721598318
No ratings yet
Study Material For Machine Learning - 1 - 1754721598318
18 pages
Machine Learning Project Checklist
No ratings yet
Machine Learning Project Checklist
30 pages
AI ML 2024 Solved Question Paper - Vaibhavpandit - Tele - 250522 - 224429
No ratings yet
AI ML 2024 Solved Question Paper - Vaibhavpandit - Tele - 250522 - 224429
41 pages
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
No ratings yet
Weak AI Generative AI Strong AI:-Machine Learning Tutorial 1.supervised Leaning 2.un Supervised Learning 3.reinforcement Learning
53 pages
Final ML
No ratings yet
Final ML
2 pages
Machine: Learning
No ratings yet
Machine: Learning
24 pages
Disease Prediction Using Machine Learning
No ratings yet
Disease Prediction Using Machine Learning
10 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
7 Classification Algorithms in Python
No ratings yet
7 Classification Algorithms in Python
9 pages
Classification
No ratings yet
Classification
4 pages
Machine Learning Basics
No ratings yet
Machine Learning Basics
32 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
Session 5
No ratings yet
Session 5
36 pages
Imbalanced Classes in Big Data
No ratings yet
Imbalanced Classes in Big Data
20 pages
Ashwin Report
No ratings yet
Ashwin Report
18 pages
Machine Learning (Feature Engineering)
No ratings yet
Machine Learning (Feature Engineering)
10 pages
Implementation of Credit Card Fraud Detection Using Random Forest Algorithm
100% (1)
Implementation of Credit Card Fraud Detection Using Random Forest Algorithm
10 pages
Ict in Language Teaching
No ratings yet
Ict in Language Teaching
25 pages
Python Function Practice 50 Questions
No ratings yet
Python Function Practice 50 Questions
2 pages
Cs607p Labs
No ratings yet
Cs607p Labs
50 pages
Evaluation of Three-Phase Transformerless Photovoltaic Inverter Topologies
No ratings yet
Evaluation of Three-Phase Transformerless Photovoltaic Inverter Topologies
10 pages
Java ARP Simulation Guide
No ratings yet
Java ARP Simulation Guide
4 pages
If Postbridge
No ratings yet
If Postbridge
262 pages
WumiAJAYI - SENG405 Lecture Slides Set 2N
No ratings yet
WumiAJAYI - SENG405 Lecture Slides Set 2N
42 pages
Qualitative Vs Quantitative Data
100% (1)
Qualitative Vs Quantitative Data
12 pages
Gea Valve Automation - tcm11 74372
No ratings yet
Gea Valve Automation - tcm11 74372
64 pages
Elseviers Cas Latex Double Column Template
No ratings yet
Elseviers Cas Latex Double Column Template
4 pages
LG PC Central Controller Guide
No ratings yet
LG PC Central Controller Guide
30 pages
حساس ليجراند
No ratings yet
حساس ليجراند
32 pages
0.56 Dual Digit Display. Part Number
No ratings yet
0.56 Dual Digit Display. Part Number
4 pages
Architecture Is The Art and Science
No ratings yet
Architecture Is The Art and Science
10 pages
Knowledge-Based Systems: Alfonso Hernández Medrano
No ratings yet
Knowledge-Based Systems: Alfonso Hernández Medrano
10 pages
Cisco Aaa Case Study Overview
No ratings yet
Cisco Aaa Case Study Overview
16 pages
Consumer Buying Behaviour Towards Smartphone
80% (30)
Consumer Buying Behaviour Towards Smartphone
36 pages
UPS Study for Tech Students
No ratings yet
UPS Study for Tech Students
3 pages
AWS - Python Boto3
No ratings yet
AWS - Python Boto3
56 pages
Document
No ratings yet
Document
2 pages
Analog vs Digital Signals Guide
No ratings yet
Analog vs Digital Signals Guide
14 pages
Unity Diagnostics
No ratings yet
Unity Diagnostics
13 pages
LG 42sl8
No ratings yet
LG 42sl8
11 pages
Bharati Vidyapeeth College of Engineering Department of Mechanical Engineering A-Y-2020-21 Title
No ratings yet
Bharati Vidyapeeth College of Engineering Department of Mechanical Engineering A-Y-2020-21 Title
19 pages
And8093 D
No ratings yet
And8093 D
12 pages
State Space Models
No ratings yet
State Space Models
19 pages
Matlab/Simulink Modeling and Simulation of Electric Appliances Based On Their Actual Current Waveforms
No ratings yet
Matlab/Simulink Modeling and Simulation of Electric Appliances Based On Their Actual Current Waveforms
7 pages
State Machine Diagram and Their Uses
No ratings yet
State Machine Diagram and Their Uses
1 page
Business Network Comparison Guide
No ratings yet
Business Network Comparison Guide
5 pages
Exercise CSC415
No ratings yet
Exercise CSC415
7 pages

Chapter Two - Classification Feb 26 2024

Uploaded by

Chapter Two - Classification Feb 26 2024

Uploaded by

Applied Machine Learning

Chapter Two: Classification

The Importance of Data Prep!

Is for no deep learning tasks or algorithms.

● Environment setup: import libraries and read data

Remember: Size, unique(). Counts, distributions

The Classification Algorithms

Algorithm Advantages Disadvantages

Easy to understand and interpret. Can Prone to overfitting. Can be

Can handle large datasets with many

Very fast and requires minimal training Assumes independence of

Logistic Can underperform when there

Effective for high-dimensional datasets.

k-Nearest Easy to understand and implement.

Good for datasets with many features.

Feature Importance: What are the important columns / features in my data

Source: The Boston Housing Dataset

● CRIM - per capita crime rate by town

Gini impurity for classification tasks and variance reduction in the

Source : SHAP (SHapley Additive exPlanations) | by Cory Maklin | Medium

Dealing with Imbalanced Datasets

The Muller Loop

# Modified by Ali Arsanjani from ...

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian

"Naive Bayes", "QDA"]

# TODO (Apply): All cross-validation

if score > max_score:

#plot the output of the various algorithms

Muller Loop for Regression [next topic]

The Notion of Experiments and your AI/ML Lab

algorithm f1 precision recall auc/roc

Example for classification

Algorithm Metric Metric 2

● Data Visualization and Distributions

How well did our classifications do?

Confusion Matrix: Performance Metrics for Classification problems in Machine Learning | by

Accuracy ?? →F1 Score

2. Get in depth with one algorithm at least : you can use a

Section 2: Dealing with Class Imbalance

You might also like