[go: up one dir, main page]

0% found this document useful (0 votes)
25 views18 pages

Chapter Two - Classification Feb 26 2024

Chapter Two of 'Applied Machine Learning' focuses on classification, emphasizing the critical role of data preparation, which constitutes 80% of machine learning and data science efforts. It outlines various classification algorithms, their advantages and disadvantages, and the importance of feature engineering and handling imbalanced datasets. The chapter also introduces the Boston Housing Dataset and discusses methodologies for evaluating classification performance, including metrics like accuracy, F1 score, and AUC/ROC.

Uploaded by

gpt4prompt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views18 pages

Chapter Two - Classification Feb 26 2024

Chapter Two of 'Applied Machine Learning' focuses on classification, emphasizing the critical role of data preparation, which constitutes 80% of machine learning and data science efforts. It outlines various classification algorithms, their advantages and disadvantages, and the importance of feature engineering and handling imbalanced datasets. The chapter also introduces the Boston Housing Dataset and discusses methodologies for evaluating classification performance, including metrics like accuracy, F1 score, and AUC/ROC.

Uploaded by

gpt4prompt
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

Applied Machine Learning

Chapter Two: Classification

Introduction

The Importance of Data Prep!


80% of ML and DS is Data Prep! See: Machine Learning with Python: Classification (complete
tutorial) | by Mauro Di Pietro | Towards Data Science or Solving A Simple Classification Problem
with Python — Fruits Lovers’ Edition | by Susan Li | Towards Data Science

Feature Engineering

Is for no deep learning tasks or algorithms.


DL learns the underlying patterns in your dataset, so you actually don’t need a whole lot of
feature engineering ! Actually none!!
Still need Data Prep.

● Environment setup: import libraries and read data


● Exploratory Data Analysis & Visualization : understand the meaning and the predictive
power of the variables
● Data Prep :
○ data partitioning,
○ handle missing values,
○ encode categorical variables,
○ Standard scaler
○ Normalization
● Feature engineering:
○ extract features from raw data
○ Feature Selection: keep only the most relevant variables
○ Feature importance
○ Entropy , Gini, Signal / noise
● Model Dev: train, tune hyperparameters, validation, test
● Performance evaluation: read the metrics
● Explainability/Interpretability : understand how the model produces results

1
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Data Preparation
Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science

Remember: Size, unique(). Counts, distributions

2
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

The Classification Algorithms


There are many classification algorithms available in machine learning. Here are some of the
most common ones:
1. Decision Trees
2. Random Forest
3. Naive Bayes
4. Logistic Regression
5. Support Vector Machines (SVM)
6. k-Nearest Neighbors (k-NN)
7. Gradient Boosting
Here's a table that summarizes the advantages and disadvantages of each algorithm:

Algorithm Advantages Disadvantages

Easy to understand and interpret. Can Prone to overfitting. Can be


Decision Trees handle both categorical and numerical sensitive to small changes in
data. Can handle missing data. the data.

Can handle large datasets with many


Can be slow to train. Can be
Random Forest features. Reduces the risk of overfitting.
difficult to interpret.
Good for categorical data.

Very fast and requires minimal training Assumes independence of


Naive Bayes data. Works well with high dimensional features, which may not be true
datasets. Good for text analysis. in reality.

Logistic Can underperform when there


Simple and easy to interpret. Can work
Regression/ are non-linear relationships
well with small datasets. Good for binary
Binary between features and the
classification.
Classification target variable.

3
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Effective for high-dimensional datasets.


Support Vector Works well with both linear and
Can be slow to train. Can be
Machines non-linear relationships between features
difficult to interpret.
(SVM) and the target variable. Good for image
classification.

k-Nearest Easy to understand and implement.


Can be slow to predict.
Neighbors Good for non-linear relationships
Sensitive to irrelevant features.
(k-NN) between features and the target variable.

Good for datasets with many features.


Gradient Can be slow to train. Can be
Reduces the risk of overfitting. Can
Boosting difficult to interpret.
handle missing data.

It's important to note that the performance of each algorithm will vary depending on the specific
dataset and problem at hand. Therefore, it's recommended to experiment with multiple
algorithms and compare their performance on the same dataset.

Feature Engineering

Basic
1. Normalization
2. Scaling
3. Cleaning

Advanced

Feature Importance: What are the important columns / features in my data


set?
1. Feature importance, imblearn , cov, model.coef_
2. explainability; SHAP
impact of that feature on the ultimate outcome ie the model outcome “your application was
accepted!” But why? Because you had a good credit score, good references, a stable job, no
past criminal record.
4
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

SHAPLY values score the importance or relevance of a feature to predicting the target variable.

We want to detect , identify the importance of each feature used in making the final prediction.

Objectives:
1. EDAV : Distribution of the Data set → skewed ⇒ plot dataset distributions
2. Remove/reduction noise
a. Get rid of unnecessary or irrelevant features (columns)
b. Methods for determining feature importance
3. Increase signal
a. Up or down sampling ( 5 techniques)
b. Handling Imbalanced data
c. Combining techniques
4. XAI : explainability, interpretability, transparency, trustworthy AI
a. Use SHAP

5
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Source: The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service
concerning housing in the area of Boston MA. The following describes the dataset columns:

● CRIM - per capita crime rate by town


● ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
● INDUS - proportion of non-retail business acres per town.
● CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
● NOX - nitric oxides concentration (parts per 10 million)
● RM - average number of rooms per dwelling
● AGE - proportion of owner-occupied units built prior to 1940
● DIS - weighted distances to five Boston employment centres
● RAD - index of accessibility to radial highways
● TAX - full-value property-tax rate per $10,000
● PTRATIO - pupil-teacher ratio by town
● B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
● LSTAT - % lower status of the population
● MEDV - Median value of owner-occupied homes in $1000's

6
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Gini impurity for classification tasks and variance reduction in the


case of regression.

Source : SHAP (SHapley Additive exPlanations) | by Cory Maklin | Medium

7
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

8
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Dealing with Imbalanced Datasets

Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science

See Machine Learning — Multiclass Classification with Imbalanced Dataset | by Javaid Nabi |
Towards Data Science

SMOTE can used to partially rebalance a skewed dataset that has under represented classes.
Imblearn helps you sample or generate additional data:
● Upsample mandarin → apples
● Downsample everyone else to get to the mandarin representation level of classifiers /
targets/ classes
Or pass the class_weights parameter in the .fit() function insklearn.

9
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Source: Solving A Simple Classification Problem with Python — Fruits Lovers’ Edition | by
Susan Li | Towards Data Science

For a complete reference – soups donuts !! see Machine Learning with Python: Classification
(complete tutorial) | by Mauro Di Pietro | Towards Data Science

10
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

The Muller Loop


Let’s approach classification as a set of experiments. Please fork / clone:
https://github.com/aarsanjani/applied-ml-2020/blob/master/MullerLoop.ipynb

names = [
"Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
"Naive Bayes", "QDA"
]

Classification

# Modified by Ali Arsanjani from ...


# Code source: Gaël Varoquaux
# Andreas Müller
# Modified for documentation by Jaques Grobler
# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import make_moons, make_circles, make_classification
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", #"Gaussian


Process",
"Decision Tree", "Random Forest", "Neural Net", "AdaBoost",

11
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

"Naive Bayes", "QDA"]

classifiers = [
KNeighborsClassifier(2),
SVC(kernel="linear", C=0.025),
SVC(gamma=2, C=1),
# GaussianProcessClassifier(1.0 * RBF(1.0)),
DecisionTreeClassifier(max_depth=5),
RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
MLPClassifier(alpha=1, max_iter=1000),
AdaBoostClassifier(),
GaussianNB(),
QuadraticDiscriminantAnalysis()]

X, y = X_data_reshape, y_data

X = StandardScaler().fit_transform(X)
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=.2)

# TODO (Apply): All cross-validation

max_score = 0.0
max_class = ''
# iterate over classifiers
for name, clf in zip(names, classifiers):
start_time = time()
clf.fit(X_train, y_train)
score = 100.0 * clf.score(X_test, y_test)
print('Classifier = %s, Score (test, accuracy) = %.2f,' %(name, score),
'Training time = %.2f seconds' % (time() - start_time))

if score > max_score:


clf_best = clf
max_score = score
max_class = name

print(80*'-' )
print('Best --> Classifier = %s, Score (test, accuracy) = %.2f'
%(max_class, max_score))

12
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

#plot the output of the various algorithms

Muller Loop for Regression [next topic]

The Notion of Experiments and your AI/ML Lab


Every data science activity or training and testing or poc is essentially an experiment.

1. Formulate a narrative of what you are trying to do – what benefit to the business /
humankind?
2. Formulate some questions among your group:
3. What are the questions that we want to answer by running these experiments/
algorithms with these datasets??
4. Run the experiments and
5. log the results of all experiments !!
6. We are trying to determine which experiment works best.
7. Put the results in a table.

Clustering
Euclidean Distance vs Fractal Distance

For every iteration show the data, the objective function results for your cluster
Look at the results of your multiple experiments and then write up a hypothesis explaining why
you think an explanation you have produced is a valid one.

algorithm f1 precision recall auc/roc

knn

svc

Example for classification

13
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Algorithm Metric Metric 2

14
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

● Data Visualization and Distributions


● Feature Importance (SHAP, permutations )
● Gini score, entropy, signal vs noise
● Class imbalance
● upsampling/ downsampling (smote)

How well did our classifications do?


The 5 Classification Evaluation metrics every Data Scientist must know | by Rahul Agarwal

Confusion Matrix: Performance Metrics for Classification problems in Machine Learning | by


Mohammed Sunasra | Medium

Accuracy ?? →F1 Score

AUC / ROC

Use the Random Forest classifier’s Gini Score to identify features which
are contributing to the data

15
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Experimentation
project work
1. Find dataset 2 and 3, amalgamate them ; run classification (each team
member try to pick a different classification)
1. show how performance is enhanced with each amalgamation

Dataset 1
Algorithm Scores

Dataset 1+2
Algorithm Scores

Dataset 1+2+3
Algorithm Scores

16
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

2. Get in depth with one algorithm at least : you can use a


classification
3. Run a Muller loop on your data set 1, ds1+ds2, ds1+ds2+ds3 , i.e.,
your incrementally amalgamated datasets for classification
1. plot the results in a table

17
© 2018-2023, Dr. Ali Arsanjani
Applied Machine Learning

Section 2: Dealing with Class Imbalance


● Feature Importance (SHAP, permutations )
● Gini score, entropy, signal vs noise
● Class imbalance
● upsampling/ downsampling (smote)

18
© 2018-2023, Dr. Ali Arsanjani

You might also like