Applied Machine Learning
Chapter Two: Classification
Introduction
The Importance of Data Prep!
80% of ML and DS is Data Prep! See: Machine Learning with Python: Classification (complete
tutorial) | by Mauro Di Pietro | Towards Data Science or Solving A Simple Classification Problem
with Python — Fruits Lovers’ Edition | by Susan Li | Towards Data Science
Feature Engineering
Is for no deep learning tasks or algorithms.
DL learns the underlying patterns in your dataset, so you actually don’t need a whole lot of
feature engineering ! Actually none!!
Still need Data Prep.
● Environment setup: import libraries and read data
● Exploratory Data Analysis & Visualization : understand the meaning and the predictive
power of the variables
● Data Prep :
○ data partitioning,
○ handle missing values,
○ encode categorical variables,
○ Standard scaler
○ Normalization
● Feature engineering:
○ extract features from raw data
○ Feature Selection: keep only the most relevant variables
○ Feature importance
○ Entropy , Gini, Signal / noise
● Model Dev: train, tune hyperparameters, validation, test
● Performance evaluation: read the metrics
● Explainability/Interpretability : understand how the model produces results
1
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Data Preparation
Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science
Remember: Size, unique(). Counts, distributions
2
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
The Classification Algorithms
There are many classification algorithms available in machine learning. Here are some of the
most common ones:
1. Decision Trees
2. Random Forest
3. Naive Bayes
4. Logistic Regression
5. Support Vector Machines (SVM)
6. k-Nearest Neighbors (k-NN)
7. Gradient Boosting
Here's a table that summarizes the advantages and disadvantages of each algorithm:
Algorithm Advantages Disadvantages
Easy to understand and interpret. Can Prone to overfitting. Can be
Decision Trees handle both categorical and numerical sensitive to small changes in
data. Can handle missing data. the data.
Can handle large datasets with many
Can be slow to train. Can be
Random Forest features. Reduces the risk of overfitting.
difficult to interpret.
Good for categorical data.
Very fast and requires minimal training Assumes independence of
Naive Bayes data. Works well with high dimensional features, which may not be true
datasets. Good for text analysis. in reality.
Logistic Can underperform when there
Simple and easy to interpret. Can work
Regression/ are non-linear relationships
well with small datasets. Good for binary
Binary between features and the
classification.
Classification target variable.
3
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Effective for high-dimensional datasets.
Support Vector Works well with both linear and
Can be slow to train. Can be
Machines non-linear relationships between features
difficult to interpret.
(SVM) and the target variable. Good for image
classification.
k-Nearest Easy to understand and implement.
Can be slow to predict.
Neighbors Good for non-linear relationships
Sensitive to irrelevant features.
(k-NN) between features and the target variable.
Good for datasets with many features.
Gradient Can be slow to train. Can be
Reduces the risk of overfitting. Can
Boosting difficult to interpret.
handle missing data.
It's important to note that the performance of each algorithm will vary depending on the specific
dataset and problem at hand. Therefore, it's recommended to experiment with multiple
algorithms and compare their performance on the same dataset.
Feature Engineering
Basic
1. Normalization
2. Scaling
3. Cleaning
Advanced
Feature Importance: What are the important columns / features in my data
set?
1. Feature importance, imblearn , cov, model.coef_
2. explainability; SHAP
impact of that feature on the ultimate outcome ie the model outcome “your application was
accepted!” But why? Because you had a good credit score, good references, a stable job, no
past criminal record.
4
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
SHAPLY values score the importance or relevance of a feature to predicting the target variable.
We want to detect , identify the importance of each feature used in making the final prediction.
Objectives:
1. EDAV : Distribution of the Data set → skewed ⇒ plot dataset distributions
2. Remove/reduction noise
a. Get rid of unnecessary or irrelevant features (columns)
b. Methods for determining feature importance
3. Increase signal
a. Up or down sampling ( 5 techniques)
b. Handling Imbalanced data
c. Combining techniques
4. XAI : explainability, interpretability, transparency, trustworthy AI
a. Use SHAP
5
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Source: The Boston Housing Dataset
The Boston Housing Dataset is a derived from information collected by the U.S. Census Service
concerning housing in the area of Boston MA. The following describes the dataset columns:
● CRIM - per capita crime rate by town
● ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
● INDUS - proportion of non-retail business acres per town.
● CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
● NOX - nitric oxides concentration (parts per 10 million)
● RM - average number of rooms per dwelling
● AGE - proportion of owner-occupied units built prior to 1940
● DIS - weighted distances to five Boston employment centres
● RAD - index of accessibility to radial highways
● TAX - full-value property-tax rate per $10,000
● PTRATIO - pupil-teacher ratio by town
● B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
● LSTAT - % lower status of the population
● MEDV - Median value of owner-occupied homes in $1000's
6
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Gini impurity for classification tasks and variance reduction in the
case of regression.
Source : SHAP (SHapley Additive exPlanations) | by Cory Maklin | Medium
7
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
8
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Dealing with Imbalanced Datasets
Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science
See Machine Learning — Multiclass Classification with Imbalanced Dataset | by Javaid Nabi |
Towards Data Science
SMOTE can used to partially rebalance a skewed dataset that has under represented classes.
Imblearn helps you sample or generate additional data:
● Upsample mandarin → apples
● Downsample everyone else to get to the mandarin representation level of classifiers /
targets/ classes
Or pass the class_weights parameter in the .fit() function insklearn.
9
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science
For a complete reference – soups donuts !! see Machine Learning with Python: Classification
(complete tutorial) | by Mauro Di Pietro | Towards Data Science
10
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
The Muller Loop
Let’s approach classification as a set of experiments. Please fork / clone:
https://github.com/aarsanjani/applied-ml-2020/blob/master/MullerLoop.ipynb
names = [
"Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA"
]
Classification
# Modified by Ali Arsanjani from ...
# Code source: Gaël Varoquaux
# Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian
Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
11
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
"Naive Bayes", "QDA"]
classifiers = [
KNeighborsClassifier(2),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
# GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]
X, y = X_data_reshape, y_data
X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=.2)
# TODO (Apply): All cross-validation
max_score = 0.0
max_class = ''
# iterate over classifiers
for name, clf in zip(names, classifiers):
start_time = time()
clf.fit(X_train, y_train)
score = 100.0 * clf.score(X_test, y_test)
print('Classifier = %s, Score (test, accuracy) = %.2f,' %(name, score),
'Training time = %.2f seconds' % (time() - start_time))
if score > max_score:
clf_best = clf
max_score = score
max_class = name
print(80*'-' )
print('Best --> Classifier = %s, Score (test, accuracy) = %.2f'
%(max_class, max_score))
12
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
#plot the output of the various algorithms
Muller Loop for Regression [next topic]
The Notion of Experiments and your AI/ML Lab
Every data science activity or training and testing or poc is essentially an experiment.
1. Formulate a narrative of what you are trying to do – what benefit to the business /
humankind?
2. Formulate some questions among your group:
3. What are the questions that we want to answer by running these experiments/
algorithms with these datasets??
4. Run the experiments and
5. log the results of all experiments !!
6. We are trying to determine which experiment works best.
7. Put the results in a table.
Clustering
Euclidean Distance vs Fractal Distance
For every iteration show the data, the objective function results for your cluster
Look at the results of your multiple experiments and then write up a hypothesis explaining why
you think an explanation you have produced is a valid one.
algorithm f1 precision recall auc/roc
knn
svc
Example for classification
13
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Algorithm Metric Metric 2
14
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
● Data Visualization and Distributions
● Feature Importance (SHAP, permutations )
● Gini score, entropy, signal vs noise
● Class imbalance
● upsampling/ downsampling (smote)
How well did our classifications do?
The 5 Classification Evaluation metrics every Data Scientist must know | by Rahul Agarwal
Confusion Matrix: Performance Metrics for Classification problems in Machine Learning | by
Mohammed Sunasra | Medium
Accuracy ?? →F1 Score
AUC / ROC
Use the Random Forest classifier’s Gini Score to identify features which
are contributing to the data
15
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Experimentation
project work
1. Find dataset 2 and 3, amalgamate them ; run classification (each team
member try to pick a different classification)
1. show how performance is enhanced with each amalgamation
Dataset 1
Algorithm Scores
Dataset 1+2
Algorithm Scores
Dataset 1+2+3
Algorithm Scores
16
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
2. Get in depth with one algorithm at least : you can use a
classification
3. Run a Muller loop on your data set 1, ds1+ds2, ds1+ds2+ds3 , i.e.,
your incrementally amalgamated datasets for classification
1. plot the results in a table
17
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning
Section 2: Dealing with Class Imbalance
● Feature Importance (SHAP, permutations )
● Gini score, entropy, signal vs noise
● Class imbalance
● upsampling/ downsampling (smote)
18
© 2018-2023, Dr. Ali Arsanjani