Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)
• Feature Imputation
• Dealing with missing values (NaNs, NULLs, Empty Strings…)
• Feature Encoding
• ML algorithm’s deal with a mathematical representation of the world
• How can we transform categorical data into an equivalent numeric format?
Data Preparation
• Feature Normalization
• Algorithm’s can be sensible to different scales of values
• How can we deal with such disparities?
• Feature Selection
• Not every element of a dataset is equally important
• How to decide?
• In practice:
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
• pandas apply function, with axis=0 affects individual columns, returning a new Series or
ndarray
• E.g., df[‘greater_than_ten’] = df[‘values’].apply(lambda x: 1 if x > 10 else 0, axis=0)
Feature Imputation
• Feature imputation is the process of filling missing values from datasets as most ML models
can’t work with it on their own
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
• “Strategy” parameter of sklearn’s “SimpleImputer” supports multiple imputations
• E.g., imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean’)
• https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
• “KNNImputer” of sklearn allows to make multiple value imputation
• E.g., imputer = KNNImputer(n_neighbors=2)
Feature Encoding
• Feature encoding is the process of transforming categorical variables into equivalent numeric
representations – ML algorithm’s live in a mathematical world
• Ordinal Encoding
• Ordinal encoding is the process of converting each distinct string into a distinct
number incrementally
• E.g., “A” is 1, “B” is 2, “C” is 3, etc…
FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Normalization
• Min-Max Normalization
• Re-scales every feature into [0, 1] interval
• Standard Scaling
• Re-scales data into having 0 mean and a standard deviation of one
FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Normalization
• How to choose?
• Normalization is good when your data does not follow a Gaussian distribution.
• Good for K-Nearest Neighbors and Neural Networks (do not assume any distribution)
• Standardization is good when the data follows a Gaussian distribution.
• Gaussian Naive Bayes (assumes normal distribution)
• Most of the time, the best way is to experiment and take insightful conclusions from the results…
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
• Sklearn’s “MinMaxScaler”
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
• Sklearn’s “StandardScaler”
FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Selection
• Feature selection is the process of identifying the most valuable/significant features of a
dataset
• It can be useful to reduce the size of algorithm’s input and clear some noise, since variables
that are not useful are discarded from the learning process
• Feature Importance
• Some models (E.g., Random Forest) allow you to determine which features
contributed the most for predicting the target’s variable values
• Quickly creating one of such models can be useful for understanding which features
are most valuable
Feature Selection
• Dimensionality Reduction
• Some models such as Principal Component Analysis take many features and uses analytic
methods (linear algebra) to reduce them into fewer representative features while keeping most
meaning as possible
• Be aware that the resulting features are numerical values that can no longer be mapped to
anything meaningful of the domain
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
• Sklearn’s “RandomForestClassifier”
• https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
• Sklearn’s “RandomForestClassifier” demo for feature importance measures
• https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
• Sklearn’s “PCA”
Data Imbalance
• “The hitch with imbalanced datasets is that standard classification learning algorithms are
often biased towards the majority classes (known as “negative”) and therefore there is a
higher misclassification rate in the minority class instances (called the “positive” class).”
• Deals with severe skew in class distribution (e.g., 1:1000 or 1:10000 ratio)
• Most standard ML models struggle with imbalanced datasets
• Algorithms learn that minority classes are not as important as the majority class (!)
• How to solve?
• Random Oversampling
• Randomly duplicate examples of minority classes
• Random Undersampling
• Randomly delete examples of minority classes
• Oversampling + Undersampling
• Sometimes is best to perform both…
Data Imbalance
• In practice:
• https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
• imblearn’s “SMOTE”
• https://imbalanced-
learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html
• imblearn’s “RandomOverSampler”
• https://imbalanced-
learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html
• imblearn’s “RandomUnderSampler”
Modeling
• Modeling is the process of finding which algorithm best suits the data
• Select Modeling Techniques
• Based on the use case and dataset some algorithms can be more appropriate than
others…
• Examples
• Support Vector Machines
• Computationally expensive and do not deal well with large datasets
• However, they perform well for large features spaces (many features)
• Artificial Neural Networks
• Typically achieve modest performance for small datasets
• For large datasets they generally perform better than any other model
Modeling
Modeling
• We will only work with classification algorithm’s in INACS
• Some will be analyzed in detail in T3 to give some insights about what happens under
the hood – K-Nearest Neighbors, Decision Trees, Tree-Ensembles
• We have a dataset…
• How can we train the model?
• How can we evaluate the model reliably?
• How can we compare multiple-versions of the same model?
• How can we compare multiple models against each other?
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Split 30% a side to serve as test set (hold-out method)
• The test set is an unseen portion of the dataset that is used to assess how
a given algorithm would perform for “real world” unseen data…
• Different algorithms should be compared against the test set
• So:
• We use the train set to fit multiple algorithms with different
combinations of hyperparameters
• Then we evaluate their results in the validation set
• We choose the algorithm with the best combination of
hyperparameters to evaluate in the test set
• We repeat this process for multiple algorithms
Modeling
train set test set
• Don’t forget: the test set should be the most representative as possible
of the “real world”
Modeling
• Tuning Hyperparameters
• Hyperparameters are degrees of freedom inherent to any ML algorithm
• E.g., number of estimators in a Random Forest, maximum depth of a Decision Tree
• By intuition/educated guesses, we can, with certain levels of experience determine
which hyperparameters can be more important than others for a given situation
• E.g., If model is overfitting, then reducing the maximum depth of a Decision Tree
can be helpful…
• overfitting vs underfitting related to bias-variance trade-off
• high variance and low bias vs high bias and low variance
• reducing the depth of a tree helps to reduces high variance
• Most of the time we still get several possible values for multiple hyperparameters that
we could experiment for better performance
• How to get the best model configuration?
Modeling
• Grid Search
• Bruteforce approach where every configuration is experimented
• We have 3 * 5 * 2 = 30 possibilities
• If fit(X, y) takes 30 minutes for each possibility…
• Hyperparameter tuning would take 900 minutes or 15 hours
• Example:
• Given, normal=class 0, attack=class1
• Precision: Measure of samples that we correctly identified as class 1 out of all the samples
we predicted to be class 1
• How was the error rate of the model in predicting attack?
• Recall: For all the samples who belong to class 1, recall tells us how many we correctly
identified as belonging to class 1
• How many of all the attack instances was the algorithm able to identify?
Evaluation
• F1-score: It is the harmonic mean between precision and recall
• It is widely used when we don’t want to favor one over another
• It is a very reliable metric for scenarios with high class unbalancing
• In practice:
• https://scikit-learn.org/stable/modules/model_evaluation.html
• Multiple evaluation metrics that can be used
• https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
• Classification report that provides a summary comprised of multiple metrics
Deployment
• After the whole process in the “Notebook” we get a trained model saved as a temporary
variable in RAM
• If we want to use it, we need to persist it and include it an already existing or completely
new software system
• And then, we have to manage the algorithm’s behavior over time…
• Deploying and managing ML algorithms over time is a complex topic
• It can involve retraining routines, redlines of model’s performance, A/B testing, …
• For INACS you are encouraged to export your models and pipelines to build a small
prototype that can work in a simulated matter
• E.g., By continuously feeding data from the test set into a small program that runs the
inference and displays results in cmd
Deployment
• In practice:
• https://scikit-learn.org/stable/modules/model_persistence.html
• Sklearn’s official persistence suggestions
• https://cloud.google.com/ai-platform/prediction/docs/exporting-for-prediction
• Deploying models of different frameworks into Google Cloud’s AI Platform