[go: up one dir, main page]

0% found this document useful (0 votes)
10 views45 pages

Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views45 pages

Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

Machine Learning (2)

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)


N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
Data Preparation
• Most of the time, data can’t be directly used by algorithms, requiring several steps of
preprocessing…
• Feature Engineering
• Transform data into meaningful representations that help to better understand the
problem at hand

• Feature Imputation
• Dealing with missing values (NaNs, NULLs, Empty Strings…)

• Feature Encoding
• ML algorithm’s deal with a mathematical representation of the world
• How can we transform categorical data into an equivalent numeric format?
Data Preparation
• Feature Normalization
• Algorithm’s can be sensible to different scales of values
• How can we deal with such disparities?

• Feature Selection
• Not every element of a dataset is equally important
• How to decide?

• Dealing with Data Imbalances


• An imbalanced dataset can cause a classifier to be biased
Feature Engineering
• Feature engineering is the process of changing existing features or deriving new ones to
better reflect the problem at hand
• Example 1:
• When dealing with seasonal data (energy forecasting), time-based information can
be valuable
• However, algorithms are unable to deal with python’s Datetime object
• We can convert one datetime column into multiple columns, deriving features such
as day of the weak, month of the year, etc ...
Feature Engineering
• Example 2:
• In the titanic dataset
• https://www.kaggle.com/c/titanic
• People are judged where they are most likely to survive or not based on several
features: age, gender, socio-economic class, etc…
• One of the most accurate models for this data set engineered a feature:
• “Is_women_or_child” which was True if the person was a woman or a child and
False otherwise

• In practice:
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
• pandas apply function, with axis=0 affects individual columns, returning a new Series or
ndarray
• E.g., df[‘greater_than_ten’] = df[‘values’].apply(lambda x: 1 if x > 10 else 0, axis=0)
Feature Imputation
• Feature imputation is the process of filling missing values from datasets as most ML models
can’t work with it on their own

• Single value imputation


• Fill missing values for each individual column
• Mean or median (numerical)
• Most frequent element (numerical or categorical)
• Constant (numerical or categorical)

• Multiple value imputation


• Fill multiple missing values for each row at once with an algorithm that finds
similarities with existing examples
• K-Nearest Neighbors ImputationK-Nearest neighbor: filling data with a value
from another similar sample
Feature Imputation
• Delete the row
• If only a few rows have missing values and your dataset is broad enough to not loose
any representativity, just delete the rows with missing values

• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
• “Strategy” parameter of sklearn’s “SimpleImputer” supports multiple imputations
• E.g., imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean’)

• https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
• “KNNImputer” of sklearn allows to make multiple value imputation
• E.g., imputer = KNNImputer(n_neighbors=2)
Feature Encoding
• Feature encoding is the process of transforming categorical variables into equivalent numeric
representations – ML algorithm’s live in a mathematical world
• Ordinal Encoding
• Ordinal encoding is the process of converting each distinct string into a distinct
number incrementally
• E.g., “A” is 1, “B” is 2, “C” is 3, etc…

• Does it suit all kinds of categorical variables?


• What about nominal categorical variables?
• E.g., blue, red, green, white, etc…
Feature Encoding
• Ordinal Encoding (Practical Example)
• Numbers hold meaningful relationships
• Distance and Order (!) Color Encoding
• Is White twice as much as Red? Blue 1
• Is Blue = White – (Red + Blue)
Red 2
• Making a model “run” is insufficient to grasp all Red 2
insights of the underlying domain (!) Green 3
• This kind of encoding messes up with some White 4
algorithms, for example K-Nearest Neighbors
that relies in distance measures between
Cartesian coordinates

• How to encode nominal features correctly?


Feature Encoding
• One-Hot Encoding
• Each distinct categorical variable assumes a new column with binary value, 1 if
present, 0 if absent
Row Blue Red Green White
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
• Can it go wrong?
• Too much distinct values can lead to unmanageable number of columns…
• Memory issues, sparse datasets, curse of dimensionality…
Feature Encoding
• Binary Encoding
• Most times One-Hot Encoding is preferrable, but the number of resulting columns
can make it an impossible choice
• Binary encoding allows to encode the same number of distinct categorical values
into less columns
1. Every numeric feature is encoded into numerical (same as Ordinal Encoding)
2. Each number is transformed into binary
3. The number of bits required to represent that set of values is equal to the
amount of resulting columns
Feature Encoding
• Binary Encoding (Practical Example)
• Blue = 001
Row Bin1 Bin2 Bin3
• Red = 010
• Green = 011 0 0 0 1
• White = 100 1 0 1 0
• 3 bits are required = 3 columns 2 0 1 0
3 0 1 1
• If too much columns are generated anyway one must
4 1 0 0
search for alternatives
• E.g., Target Encoding

• Ordinal Encoding can still be used if we run out of options


• However, we need to understand its implications and
choose an algorithm that can be robust to such noise…
Feature Encoding
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
• Sklearn’s “OrdinalEncoder”
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
• Sklearn’s “OneHotEncoder”
• https://contrib.scikit-learn.org/category_encoders/
• Category Encoders family (Ordinal, One-Hot, Binary)

• Note: Sklearn’s “LabelEncoder” is different than “OrdinalEncoder”


• Provide nearly same functionality but should be used for different purposes
• “LabelEncoder” assumes 1D structure of labels
Feature Normalization
• Some algorithms don’t work well when features have distinct scales
• E.g., Salary in “thousands”, Age is “tens”, etc …

• Gradient-Descent based algorithms


• Artificial Neural Networks, Logistic Regression, Linear Regression, etc…
• Convergence issues
• Distance-based algorithms
• K-Nearest Neighbors, Support Vector Machines
• Implicit weightage in decision making
• Bigger values have greater impact in distance calculations

• What about Tree-based algorithms?


• Represent knowledge has If-Then-Rules, so, they are insensitive to the scale of their
features…

FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Normalization
• Min-Max Normalization
• Re-scales every feature into [0, 1] interval

• Standard Scaling
• Re-scales data into having 0 mean and a standard deviation of one

FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Normalization
• How to choose?
• Normalization is good when your data does not follow a Gaussian distribution.
• Good for K-Nearest Neighbors and Neural Networks (do not assume any distribution)
• Standardization is good when the data follows a Gaussian distribution.
• Gaussian Naive Bayes (assumes normal distribution)

• Most of the time, the best way is to experiment and take insightful conclusions from the results…

• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
• Sklearn’s “MinMaxScaler”
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
• Sklearn’s “StandardScaler”

FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Selection
• Feature selection is the process of identifying the most valuable/significant features of a
dataset
• It can be useful to reduce the size of algorithm’s input and clear some noise, since variables
that are not useful are discarded from the learning process

• Feature Importance
• Some models (E.g., Random Forest) allow you to determine which features
contributed the most for predicting the target’s variable values
• Quickly creating one of such models can be useful for understanding which features
are most valuable
Feature Selection
• Dimensionality Reduction
• Some models such as Principal Component Analysis take many features and uses analytic
methods (linear algebra) to reduce them into fewer representative features while keeping most
meaning as possible
• Be aware that the resulting features are numerical values that can no longer be mapped to
anything meaningful of the domain

• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
• Sklearn’s “RandomForestClassifier”
• https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
• Sklearn’s “RandomForestClassifier” demo for feature importance measures
• https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
• Sklearn’s “PCA”
Data Imbalance
• “The hitch with imbalanced datasets is that standard classification learning algorithms are
often biased towards the majority classes (known as “negative”) and therefore there is a
higher misclassification rate in the minority class instances (called the “positive” class).”
• Deals with severe skew in class distribution (e.g., 1:1000 or 1:10000 ratio)
• Most standard ML models struggle with imbalanced datasets
• Algorithms learn that minority classes are not as important as the majority class (!)

• In the context of Cybersecurity (NIDS),


• It’s easier to capture example of normal traffic or attack-related traffic?
• Is it more important for an algorithm to recognize normal behavior or anomalous
one?

• How to solve?

TOUR OF DATA SAMPLING METHODS FOR IMBALANCED CLASSIFICATION, HTTPS://MACHINELEARNINGMASTERY.COM/DATA-SAMPLING-METHODS-FOR-IMBALANCED-CLASSIFICATION/ [ONLINE]


Data Imbalance
• Oversampling
• Oversampling methods duplicate or synthesize new examples for minority classes

• Random Oversampling
• Randomly duplicate examples of minority classes

• Synthetic Minority Oversampling Technique (SMOTE)


• Selects examples that are close in the feature space, drawing a line between
those same and generating a new sample as a point along that line
Data Imbalance
• SMOTE (Practical Example)
Data Imbalance
• Undersampling
• Undersampling methods delete elements of the majority class

• Random Undersampling
• Randomly delete examples of minority classes

• Condensed Nearest Neighbor Rule (CNN)


• Remove redundant examples from high-density regions of the majority class

• Oversampling + Undersampling
• Sometimes is best to perform both…
Data Imbalance
• In practice:
• https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
• imblearn’s “SMOTE”
• https://imbalanced-
learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html
• imblearn’s “RandomOverSampler”
• https://imbalanced-
learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html
• imblearn’s “RandomUnderSampler”
Modeling
• Modeling is the process of finding which algorithm best suits the data
• Select Modeling Techniques
• Based on the use case and dataset some algorithms can be more appropriate than
others…
• Examples
• Support Vector Machines
• Computationally expensive and do not deal well with large datasets
• However, they perform well for large features spaces (many features)
• Artificial Neural Networks
• Typically achieve modest performance for small datasets
• For large datasets they generally perform better than any other model
Modeling
Modeling
• We will only work with classification algorithm’s in INACS
• Some will be analyzed in detail in T3 to give some insights about what happens under
the hood – K-Nearest Neighbors, Decision Trees, Tree-Ensembles

• Good general choices for Practical Work:


• K-Nearest Neighbors
• Decision Tree
• Tree-Ensembles
• Random Forest
• AdaBoost
• XGBoost
• Support Vector Machines
• If the dataset is not too big
• Naïve Bayes
Modeling
• Generate Test Design
• This step is part of the Modeling phase in CRISP-DM but in fact, it should be the
first thing to be done

• We have a dataset…
• How can we train the model?
• How can we evaluate the model reliably?
• How can we compare multiple-versions of the same model?
• How can we compare multiple models against each other?
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Split 30% a side to serve as test set (hold-out method)
• The test set is an unseen portion of the dataset that is used to assess how
a given algorithm would perform for “real world” unseen data…
• Different algorithms should be compared against the test set

• What are those so-called results?


• Mathematical measures are used to determine performance
• For example, Accuracy measure the amount of correctly guessed
samples of all test samples
• 70% Accuracy: Algorithms guessed right 70% of the time, (70 in
100)
• More on that latter…
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Remaining 70% will be used as train set
• However, most algorithms have Hyperparameters
• Number of neighbors in K-Nearest Neighbors algorithm
• Number of trees of a Random Forest
• We can have multiple versions of the same algorithm
• Comparing these versions relates to a processes called
Hyperparameter Tunning
• These versions should not be compared using the test set as it
would make the result biased (aka finding the best parameters
for that set of data…)
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Remaining 70% will be used as train set
• A simple strategy for Hyperparameter Tunning is to split another 30% of
the train set to make a validation set (hold-out method)

• So:
• We use the train set to fit multiple algorithms with different
combinations of hyperparameters
• Then we evaluate their results in the validation set
• We choose the algorithm with the best combination of
hyperparameters to evaluate in the test set
• We repeat this process for multiple algorithms
Modeling
train set test set

train set validation set test set


Modeling
• Recommended Methodology for INACS
• What is the problem with this methodology?
• The train set is reduced by a whole lot…
• Less data usually contributes to less powerful models
• The hold-out method is only suitable for large dataset where lack of
representability isn’t a problem

• We can solve this, using k-fold cross validation method


• k-fold cross validation can help to replace hold-out for hyperparameter tunning
Modeling
Modeling
• Recommended Methodology for INACS
• Be careful with preprocessing operations such as normalization, undersampling…
• Why?
• Most of preprocessing operations can’t be used carelessly on the entire dataset,
or it can make the whole methodology biased anyway…
• Fitting sklearn’s “MinMaxScaler” in the entire dataset means that we are
predicting the future
• Finding the minimum and maximum of all columns of the dataset
• Encoding will be overly perfect…
• In a “real world” scenario it does not usually work like that

• Don’t forget: the test set should be the most representative as possible
of the “real world”
Modeling
• Tuning Hyperparameters
• Hyperparameters are degrees of freedom inherent to any ML algorithm
• E.g., number of estimators in a Random Forest, maximum depth of a Decision Tree
• By intuition/educated guesses, we can, with certain levels of experience determine
which hyperparameters can be more important than others for a given situation
• E.g., If model is overfitting, then reducing the maximum depth of a Decision Tree
can be helpful…
• overfitting vs underfitting related to bias-variance trade-off
• high variance and low bias vs high bias and low variance
• reducing the depth of a tree helps to reduces high variance

• Most of the time we still get several possible values for multiple hyperparameters that
we could experiment for better performance
• How to get the best model configuration?
Modeling
• Grid Search
• Bruteforce approach where every configuration is experimented

• For the following grid of a pipeline that considers PCA + KNN:


• dimreduction__n_components": [8, 16, 32]
• classifier__n_neighbors": [4, 8, 16, 32, 64]
• classifier__weights": ["uniform", "distance"]

• We have 3 * 5 * 2 = 30 possibilities
• If fit(X, y) takes 30 minutes for each possibility…
• Hyperparameter tuning would take 900 minutes or 15 hours

• The best configuration is guaranteed (!)


• Computation time can get out of hand (!)
Modeling
• Random Search
• Experiment n random configurations of a grid
• The developer can set the number of experiments to be made
• But the best configuration is not guaranteed

• Random Search is a good way of exploring a large space of configurations


• Grid Search is preferred but quickly becomes impractical

• Other Optimization Methods


• There are other not so straightforward approaches
• Usually much more efficient but also more complex

• Bayesian Optimization, Genetic Algorithms, Particle Swarm Optimization, etc…


Modeling
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
• Splitting the dataset (hold-out)
• https://scikit-learn.org/stable/modules/cross_validation.html
• Splitting the dataset (k-fold cross validation)
• https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
• Column Transformers and Pipeline
• https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
• https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
• Grid Search/Random Search + K-Fold Cross Validation
• https://scikit-learn.org/stable/supervised_learning.html
• Sklearn’s supervised learning algorithms
• https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
• Comprehensive example
Evaluation
• The evaluation phase of CRISP-DM is more related to determining how well a model or set of
models is fitting the business objectives

• However, any evaluation, including model-level evaluation or business-level evaluation


depend on well-established metrics
• There is a whole plethora of metrics available, will dive into some of them:
• Accuracy
• Precision
• Recall
• F1-Score
• Other Metrics:
• FPR (False Positive Rate)
• TPR (True Positive Rate)
• ROC/AUC (Area Under the Curve of Receiver Operating Characteristic)
Evaluation
• When comparing the predicted values with the real values, we can build a confusion matrix
• By applying a set of mathematical formulas to the cells of the matrix it is possible to easily
determine the intended metrics
Evaluation
• True Positive (TP): When the model correctly predicts an occurrence of class 1
• True Negative (TN): When the model correctly predicts an occurrence of class 0
• False Positive (FP): When the model incorrectly predicts an occurrence of class 1
• Type 1 Error
• False Negative (FN): When the model incorrectly predicts an occurrence of class 0
• Type 2 Error

• Example:
• Given, normal=class 0, attack=class1

• Predicted attack but it was normal


• FP
• Predicted normal but it was attack
• FN
Evaluation
• Accuracy: The number of correct prediction divided by the total number of samples
• It is usually a good standard measure
• However, for imbalanced datasets it can be highly biased
• Consider a dataset of malware files
• 1 is malicious, 99 are benign
• The algorithm always predicts benign
• It will have an accuracy of 99, however, is it reliable?

• Precision: Measure of samples that we correctly identified as class 1 out of all the samples
we predicted to be class 1
• How was the error rate of the model in predicting attack?
• Recall: For all the samples who belong to class 1, recall tells us how many we correctly
identified as belonging to class 1
• How many of all the attack instances was the algorithm able to identify?
Evaluation
• F1-score: It is the harmonic mean between precision and recall
• It is widely used when we don’t want to favor one over another
• It is a very reliable metric for scenarios with high class unbalancing

• In practice:
• https://scikit-learn.org/stable/modules/model_evaluation.html
• Multiple evaluation metrics that can be used
• https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
• Classification report that provides a summary comprised of multiple metrics
Deployment
• After the whole process in the “Notebook” we get a trained model saved as a temporary
variable in RAM
• If we want to use it, we need to persist it and include it an already existing or completely
new software system
• And then, we have to manage the algorithm’s behavior over time…
• Deploying and managing ML algorithms over time is a complex topic
• It can involve retraining routines, redlines of model’s performance, A/B testing, …

• For INACS you are encouraged to export your models and pipelines to build a small
prototype that can work in a simulated matter
• E.g., By continuously feeding data from the test set into a small program that runs the
inference and displays results in cmd
Deployment
• In practice:
• https://scikit-learn.org/stable/modules/model_persistence.html
• Sklearn’s official persistence suggestions
• https://cloud.google.com/ai-platform/prediction/docs/exporting-for-prediction
• Deploying models of different frameworks into Google Cloud’s AI Platform

You might also like