0% found this document useful (0 votes)

10 views45 pages

Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views45 pages

Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Hil_Silva

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 45

Machine Learning (2)

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)

N U N A L @ I S E P. I P P. P T
O M S @ I S E P. I P P. P T
Data Preparation
• Most of the time, data can’t be directly used by algorithms, requiring several steps of
preprocessing…
• Feature Engineering
• Transform data into meaningful representations that help to better understand the
problem at hand

• Feature Imputation
• Dealing with missing values (NaNs, NULLs, Empty Strings…)

• Feature Encoding
• ML algorithm’s deal with a mathematical representation of the world
• How can we transform categorical data into an equivalent numeric format?
Data Preparation
• Feature Normalization
• Algorithm’s can be sensible to different scales of values
• How can we deal with such disparities?

• Feature Selection
• Not every element of a dataset is equally important
• How to decide?

• Dealing with Data Imbalances

• An imbalanced dataset can cause a classifier to be biased
Feature Engineering
• Feature engineering is the process of changing existing features or deriving new ones to
better reflect the problem at hand
• Example 1:
• When dealing with seasonal data (energy forecasting), time-based information can
be valuable
• However, algorithms are unable to deal with python’s Datetime object
• We can convert one datetime column into multiple columns, deriving features such
as day of the weak, month of the year, etc ...
Feature Engineering
• Example 2:
• In the titanic dataset
• https://www.kaggle.com/c/titanic
• People are judged where they are most likely to survive or not based on several
features: age, gender, socio-economic class, etc…
• One of the most accurate models for this data set engineered a feature:
• “Is_women_or_child” which was True if the person was a woman or a child and
False otherwise

• In practice:
• https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.apply.html
• pandas apply function, with axis=0 affects individual columns, returning a new Series or
ndarray
• E.g., df[‘greater_than_ten’] = df[‘values’].apply(lambda x: 1 if x > 10 else 0, axis=0)
Feature Imputation
• Feature imputation is the process of filling missing values from datasets as most ML models
can’t work with it on their own

• Single value imputation

• Fill missing values for each individual column
• Mean or median (numerical)
• Most frequent element (numerical or categorical)
• Constant (numerical or categorical)

• Multiple value imputation

• Fill multiple missing values for each row at once with an algorithm that finds
similarities with existing examples
• K-Nearest Neighbors ImputationK-Nearest neighbor: filling data with a value
from another similar sample
Feature Imputation
• Delete the row
• If only a few rows have missing values and your dataset is broad enough to not loose
any representativity, just delete the rows with missing values

• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html
• “Strategy” parameter of sklearn’s “SimpleImputer” supports multiple imputations
• E.g., imp_mean = SimpleImputer(missing_values=np.nan, strategy='mean’)

• https://scikit-learn.org/stable/modules/generated/sklearn.impute.KNNImputer.html
• “KNNImputer” of sklearn allows to make multiple value imputation
• E.g., imputer = KNNImputer(n_neighbors=2)
Feature Encoding
• Feature encoding is the process of transforming categorical variables into equivalent numeric
representations – ML algorithm’s live in a mathematical world
• Ordinal Encoding
• Ordinal encoding is the process of converting each distinct string into a distinct
number incrementally
• E.g., “A” is 1, “B” is 2, “C” is 3, etc…

• Does it suit all kinds of categorical variables?

• What about nominal categorical variables?
• E.g., blue, red, green, white, etc…
Feature Encoding
• Ordinal Encoding (Practical Example)
• Numbers hold meaningful relationships
• Distance and Order (!) Color Encoding
• Is White twice as much as Red? Blue 1
• Is Blue = White – (Red + Blue)
Red 2
• Making a model “run” is insufficient to grasp all Red 2
insights of the underlying domain (!) Green 3
• This kind of encoding messes up with some White 4
algorithms, for example K-Nearest Neighbors
that relies in distance measures between
Cartesian coordinates

• How to encode nominal features correctly?

Feature Encoding
• One-Hot Encoding
• Each distinct categorical variable assumes a new column with binary value, 1 if
present, 0 if absent
Row Blue Red Green White
0 1 0 0 0
1 0 1 0 0
2 0 1 0 0
3 0 0 1 0
4 0 0 0 1
• Can it go wrong?
• Too much distinct values can lead to unmanageable number of columns…
• Memory issues, sparse datasets, curse of dimensionality…
Feature Encoding
• Binary Encoding
• Most times One-Hot Encoding is preferrable, but the number of resulting columns
can make it an impossible choice
• Binary encoding allows to encode the same number of distinct categorical values
into less columns
1. Every numeric feature is encoded into numerical (same as Ordinal Encoding)
2. Each number is transformed into binary
3. The number of bits required to represent that set of values is equal to the
amount of resulting columns
Feature Encoding
• Binary Encoding (Practical Example)
• Blue = 001
Row Bin1 Bin2 Bin3
• Red = 010
• Green = 011 0 0 0 1
• White = 100 1 0 1 0
• 3 bits are required = 3 columns 2 0 1 0
3 0 1 1
• If too much columns are generated anyway one must
4 1 0 0
search for alternatives
• E.g., Target Encoding

• Ordinal Encoding can still be used if we run out of options

• However, we need to understand its implications and
choose an algorithm that can be robust to such noise…
Feature Encoding
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OrdinalEncoder.html
• Sklearn’s “OrdinalEncoder”
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.OneHotEncoder.html
• Sklearn’s “OneHotEncoder”
• https://contrib.scikit-learn.org/category_encoders/
• Category Encoders family (Ordinal, One-Hot, Binary)

• Note: Sklearn’s “LabelEncoder” is different than “OrdinalEncoder”

• Provide nearly same functionality but should be used for different purposes
• “LabelEncoder” assumes 1D structure of labels
Feature Normalization
• Some algorithms don’t work well when features have distinct scales
• E.g., Salary in “thousands”, Age is “tens”, etc …

• Gradient-Descent based algorithms

• Artificial Neural Networks, Logistic Regression, Linear Regression, etc…
• Convergence issues
• Distance-based algorithms
• K-Nearest Neighbors, Support Vector Machines
• Implicit weightage in decision making
• Bigger values have greater impact in distance calculations

• What about Tree-based algorithms?

• Represent knowledge has If-Then-Rules, so, they are insensitive to the scale of their
features…

• Standard Scaling
• Re-scales data into having 0 mean and a standard deviation of one

FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Normalization
• How to choose?
• Normalization is good when your data does not follow a Gaussian distribution.
• Good for K-Nearest Neighbors and Neural Networks (do not assume any distribution)
• Standardization is good when the data follows a Gaussian distribution.
• Gaussian Naive Bayes (assumes normal distribution)

• Most of the time, the best way is to experiment and take insightful conclusions from the results…

• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html
• Sklearn’s “MinMaxScaler”
• https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
• Sklearn’s “StandardScaler”

FEATURE SCALING FOR MACHINE LEARNING: UNDERSTANDING THE DIFFERENCE BETWEEN NORMALIZATION VS. STANDARDIZATION, HTTPS://WWW.ANALYTICSVIDHYA.COM/BLOG/2020/04/FEATURE-SCALING-MACHINE-LEARNING-NORMALIZATION-STANDARDIZATION/ [ONLINE]
Feature Selection
• Feature selection is the process of identifying the most valuable/significant features of a
dataset
• It can be useful to reduce the size of algorithm’s input and clear some noise, since variables
that are not useful are discarded from the learning process

• Feature Importance
• Some models (E.g., Random Forest) allow you to determine which features
contributed the most for predicting the target’s variable values
• Quickly creating one of such models can be useful for understanding which features
are most valuable
Feature Selection
• Dimensionality Reduction
• Some models such as Principal Component Analysis take many features and uses analytic
methods (linear algebra) to reduce them into fewer representative features while keeping most
meaning as possible
• Be aware that the resulting features are numerical values that can no longer be mapped to
anything meaningful of the domain

• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
• Sklearn’s “RandomForestClassifier”
• https://scikit-learn.org/stable/auto_examples/ensemble/plot_forest_importances.html
• Sklearn’s “RandomForestClassifier” demo for feature importance measures
• https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
• Sklearn’s “PCA”
Data Imbalance
• “The hitch with imbalanced datasets is that standard classification learning algorithms are
often biased towards the majority classes (known as “negative”) and therefore there is a
higher misclassification rate in the minority class instances (called the “positive” class).”
• Deals with severe skew in class distribution (e.g., 1:1000 or 1:10000 ratio)
• Most standard ML models struggle with imbalanced datasets
• Algorithms learn that minority classes are not as important as the majority class (!)

• In the context of Cybersecurity (NIDS),

• It’s easier to capture example of normal traffic or attack-related traffic?
• Is it more important for an algorithm to recognize normal behavior or anomalous
one?

• How to solve?

TOUR OF DATA SAMPLING METHODS FOR IMBALANCED CLASSIFICATION, HTTPS://MACHINELEARNINGMASTERY.COM/DATA-SAMPLING-METHODS-FOR-IMBALANCED-CLASSIFICATION/ [ONLINE]

Data Imbalance
• Oversampling
• Oversampling methods duplicate or synthesize new examples for minority classes

• Random Oversampling
• Randomly duplicate examples of minority classes

• Synthetic Minority Oversampling Technique (SMOTE)

• Selects examples that are close in the feature space, drawing a line between
those same and generating a new sample as a point along that line
Data Imbalance
• SMOTE (Practical Example)
Data Imbalance
• Undersampling
• Undersampling methods delete elements of the majority class

• Random Undersampling
• Randomly delete examples of minority classes

• Condensed Nearest Neighbor Rule (CNN)

• Remove redundant examples from high-density regions of the majority class

• Oversampling + Undersampling
• Sometimes is best to perform both…
Data Imbalance
• In practice:
• https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html
• imblearn’s “SMOTE”
• https://imbalanced-
learn.org/stable/references/generated/imblearn.over_sampling.RandomOverSampler.html
• imblearn’s “RandomOverSampler”
• https://imbalanced-
learn.org/stable/references/generated/imblearn.under_sampling.RandomUnderSampler.html
• imblearn’s “RandomUnderSampler”
Modeling
• Modeling is the process of finding which algorithm best suits the data
• Select Modeling Techniques
• Based on the use case and dataset some algorithms can be more appropriate than
others…
• Examples
• Support Vector Machines
• Computationally expensive and do not deal well with large datasets
• However, they perform well for large features spaces (many features)
• Artificial Neural Networks
• Typically achieve modest performance for small datasets
• For large datasets they generally perform better than any other model
Modeling
Modeling
• We will only work with classification algorithm’s in INACS
• Some will be analyzed in detail in T3 to give some insights about what happens under
the hood – K-Nearest Neighbors, Decision Trees, Tree-Ensembles

• Good general choices for Practical Work:

• K-Nearest Neighbors
• Decision Tree
• Tree-Ensembles
• Random Forest
• AdaBoost
• XGBoost
• Support Vector Machines
• If the dataset is not too big
• Naïve Bayes
Modeling
• Generate Test Design
• This step is part of the Modeling phase in CRISP-DM but in fact, it should be the
first thing to be done

• We have a dataset…
• How can we train the model?
• How can we evaluate the model reliably?
• How can we compare multiple-versions of the same model?
• How can we compare multiple models against each other?
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Split 30% a side to serve as test set (hold-out method)
• The test set is an unseen portion of the dataset that is used to assess how
a given algorithm would perform for “real world” unseen data…
• Different algorithms should be compared against the test set

• What are those so-called results?

• Mathematical measures are used to determine performance
• For example, Accuracy measure the amount of correctly guessed
samples of all test samples
• 70% Accuracy: Algorithms guessed right 70% of the time, (70 in
100)
• More on that latter…
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Remaining 70% will be used as train set
• However, most algorithms have Hyperparameters
• Number of neighbors in K-Nearest Neighbors algorithm
• Number of trees of a Random Forest
• We can have multiple versions of the same algorithm
• Comparing these versions relates to a processes called
Hyperparameter Tunning
• These versions should not be compared using the test set as it
would make the result biased (aka finding the best parameters
for that set of data…)
Modeling
• Recommended Methodology for INACS
• From 100% of data
• Remaining 70% will be used as train set
• A simple strategy for Hyperparameter Tunning is to split another 30% of
the train set to make a validation set (hold-out method)

• So:
• We use the train set to fit multiple algorithms with different
combinations of hyperparameters
• Then we evaluate their results in the validation set
• We choose the algorithm with the best combination of
hyperparameters to evaluate in the test set
• We repeat this process for multiple algorithms
Modeling
train set test set

train set validation set test set

Modeling
• Recommended Methodology for INACS
• What is the problem with this methodology?
• The train set is reduced by a whole lot…
• Less data usually contributes to less powerful models
• The hold-out method is only suitable for large dataset where lack of
representability isn’t a problem

• We can solve this, using k-fold cross validation method

• k-fold cross validation can help to replace hold-out for hyperparameter tunning
Modeling
Modeling
• Recommended Methodology for INACS
• Be careful with preprocessing operations such as normalization, undersampling…
• Why?
• Most of preprocessing operations can’t be used carelessly on the entire dataset,
or it can make the whole methodology biased anyway…
• Fitting sklearn’s “MinMaxScaler” in the entire dataset means that we are
predicting the future
• Finding the minimum and maximum of all columns of the dataset
• Encoding will be overly perfect…
• In a “real world” scenario it does not usually work like that

• Don’t forget: the test set should be the most representative as possible
of the “real world”
Modeling
• Tuning Hyperparameters
• Hyperparameters are degrees of freedom inherent to any ML algorithm
• E.g., number of estimators in a Random Forest, maximum depth of a Decision Tree
• By intuition/educated guesses, we can, with certain levels of experience determine
which hyperparameters can be more important than others for a given situation
• E.g., If model is overfitting, then reducing the maximum depth of a Decision Tree
can be helpful…
• overfitting vs underfitting related to bias-variance trade-off
• high variance and low bias vs high bias and low variance
• reducing the depth of a tree helps to reduces high variance

• Most of the time we still get several possible values for multiple hyperparameters that
we could experiment for better performance
• How to get the best model configuration?
Modeling
• Grid Search
• Bruteforce approach where every configuration is experimented

• For the following grid of a pipeline that considers PCA + KNN:

• dimreduction__n_components": [8, 16, 32]
• classifier__n_neighbors": [4, 8, 16, 32, 64]
• classifier__weights": ["uniform", "distance"]

• We have 3 * 5 * 2 = 30 possibilities
• If fit(X, y) takes 30 minutes for each possibility…
• Hyperparameter tuning would take 900 minutes or 15 hours

• The best configuration is guaranteed (!)

• Computation time can get out of hand (!)
Modeling
• Random Search
• Experiment n random configurations of a grid
• The developer can set the number of experiments to be made
• But the best configuration is not guaranteed

• Random Search is a good way of exploring a large space of configurations

• Grid Search is preferred but quickly becomes impractical

• Other Optimization Methods

• There are other not so straightforward approaches
• Usually much more efficient but also more complex

• Bayesian Optimization, Genetic Algorithms, Particle Swarm Optimization, etc…

Modeling
• In practice:
• https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
• Splitting the dataset (hold-out)
• https://scikit-learn.org/stable/modules/cross_validation.html
• Splitting the dataset (k-fold cross validation)
• https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html
• Column Transformers and Pipeline
• https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html
• https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html
• Grid Search/Random Search + K-Fold Cross Validation
• https://scikit-learn.org/stable/supervised_learning.html
• Sklearn’s supervised learning algorithms
• https://scikit-learn.org/stable/tutorial/statistical_inference/putting_together.html
• Comprehensive example
Evaluation
• The evaluation phase of CRISP-DM is more related to determining how well a model or set of
models is fitting the business objectives

• However, any evaluation, including model-level evaluation or business-level evaluation

depend on well-established metrics
• There is a whole plethora of metrics available, will dive into some of them:
• Accuracy
• Precision
• Recall
• F1-Score
• Other Metrics:
• FPR (False Positive Rate)
• TPR (True Positive Rate)
• ROC/AUC (Area Under the Curve of Receiver Operating Characteristic)
Evaluation
• When comparing the predicted values with the real values, we can build a confusion matrix
• By applying a set of mathematical formulas to the cells of the matrix it is possible to easily
determine the intended metrics
Evaluation
• True Positive (TP): When the model correctly predicts an occurrence of class 1
• True Negative (TN): When the model correctly predicts an occurrence of class 0
• False Positive (FP): When the model incorrectly predicts an occurrence of class 1
• Type 1 Error
• False Negative (FN): When the model incorrectly predicts an occurrence of class 0
• Type 2 Error

• Example:
• Given, normal=class 0, attack=class1

• Predicted attack but it was normal

• FP
• Predicted normal but it was attack
• FN
Evaluation
• Accuracy: The number of correct prediction divided by the total number of samples
• It is usually a good standard measure
• However, for imbalanced datasets it can be highly biased
• Consider a dataset of malware files
• 1 is malicious, 99 are benign
• The algorithm always predicts benign
• It will have an accuracy of 99, however, is it reliable?

• Precision: Measure of samples that we correctly identified as class 1 out of all the samples
we predicted to be class 1
• How was the error rate of the model in predicting attack?
• Recall: For all the samples who belong to class 1, recall tells us how many we correctly
identified as belonging to class 1
• How many of all the attack instances was the algorithm able to identify?
Evaluation
• F1-score: It is the harmonic mean between precision and recall
• It is widely used when we don’t want to favor one over another
• It is a very reliable metric for scenarios with high class unbalancing

• In practice:
• https://scikit-learn.org/stable/modules/model_evaluation.html
• Multiple evaluation metrics that can be used
• https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html
• Classification report that provides a summary comprised of multiple metrics
Deployment
• After the whole process in the “Notebook” we get a trained model saved as a temporary
variable in RAM
• If we want to use it, we need to persist it and include it an already existing or completely
new software system
• And then, we have to manage the algorithm’s behavior over time…
• Deploying and managing ML algorithms over time is a complex topic
• It can involve retraining routines, redlines of model’s performance, A/B testing, …

• For INACS you are encouraged to export your models and pipelines to build a small
prototype that can work in a simulated matter
• E.g., By continuously feeding data from the test set into a small program that runs the
inference and displays results in cmd
Deployment
• In practice:
• https://scikit-learn.org/stable/modules/model_persistence.html
• Sklearn’s official persistence suggestions
• https://cloud.google.com/ai-platform/prediction/docs/exporting-for-prediction
• Deploying models of different frameworks into Google Cloud’s AI Platform

Proxy Marketing A-Z
100% (3)
Proxy Marketing A-Z
7 pages
003-FIN7790 (Part2)
No ratings yet
003-FIN7790 (Part2)
162 pages
Unit II
No ratings yet
Unit II
119 pages
Lecture Material 10
No ratings yet
Lecture Material 10
9 pages
Feature Engineering: Getting The Most Out of Data For Predictive Models
No ratings yet
Feature Engineering: Getting The Most Out of Data For Predictive Models
75 pages
Week 10
No ratings yet
Week 10
50 pages
Unit 3-2
No ratings yet
Unit 3-2
15 pages
Feature Engineering
No ratings yet
Feature Engineering
23 pages
ML - Week 04
No ratings yet
ML - Week 04
33 pages
Feature Engineering PDF
100% (1)
Feature Engineering PDF
75 pages
Feature Engineering
No ratings yet
Feature Engineering
50 pages
Feature Engineering
No ratings yet
Feature Engineering
15 pages
7-8 Feature Engineering 101-Normalization
No ratings yet
7-8 Feature Engineering 101-Normalization
8 pages
Lecture 2 20022025 092902am
No ratings yet
Lecture 2 20022025 092902am
87 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Data Pre-Processing Python For Beginner
No ratings yet
Data Pre-Processing Python For Beginner
12 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
5.2 Feature Engineering
No ratings yet
5.2 Feature Engineering
57 pages
Feature Engineering For Machine Learning
No ratings yet
Feature Engineering For Machine Learning
41 pages
L1 - Data Pre-Processing & Steps of Building A Model
No ratings yet
L1 - Data Pre-Processing & Steps of Building A Model
30 pages
Data - Preprocessing - Jupyter Notebook
No ratings yet
Data - Preprocessing - Jupyter Notebook
5 pages
DS 1
No ratings yet
DS 1
20 pages
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
No ratings yet
The Implication of Statistical Analysis and Feature Engineering For Model Building Using Machine Learning Algorithms
11 pages
IML 2 - Data Preparation
No ratings yet
IML 2 - Data Preparation
13 pages
100 Days of Machine Learning
No ratings yet
100 Days of Machine Learning
14 pages
Lab Manual 5 Solved 40
No ratings yet
Lab Manual 5 Solved 40
13 pages
Scikit Hca
No ratings yet
Scikit Hca
8 pages
ML-Lab05-Data Preprocessing Techniques in Python
No ratings yet
ML-Lab05-Data Preprocessing Techniques in Python
7 pages
Unit-4 Part 3 Feature Engineering
No ratings yet
Unit-4 Part 3 Feature Engineering
29 pages
Feature Engineering
100% (2)
Feature Engineering
76 pages
Feature Engineering
No ratings yet
Feature Engineering
18 pages
Data Preparation.2
No ratings yet
Data Preparation.2
18 pages
2 DataPreProcessing Code
No ratings yet
2 DataPreProcessing Code
46 pages
01 - Feature Engg
No ratings yet
01 - Feature Engg
43 pages
3 - AML - Lecture 3 - Feature Engg
No ratings yet
3 - AML - Lecture 3 - Feature Engg
39 pages
Unit 2exploratory Analysis
No ratings yet
Unit 2exploratory Analysis
37 pages
Scikit Learn
No ratings yet
Scikit Learn
28 pages
Data Preprocessing
No ratings yet
Data Preprocessing
38 pages
Machine Learning (Feature Engineering)
No ratings yet
Machine Learning (Feature Engineering)
10 pages
5.feauture Engineering
No ratings yet
5.feauture Engineering
34 pages
Machine Learning
No ratings yet
Machine Learning
28 pages
Slides On DataI
No ratings yet
Slides On DataI
33 pages
6 - Machine Learning 2
No ratings yet
6 - Machine Learning 2
14 pages
Python Scikit-Learn Cheat Sheet For Machine Learning
No ratings yet
Python Scikit-Learn Cheat Sheet For Machine Learning
3 pages
Cours Data
No ratings yet
Cours Data
51 pages
Lecture Material 3
No ratings yet
Lecture Material 3
7 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
27 pages
Machine Learning
No ratings yet
Machine Learning
17 pages
Unit 4 Basics of Feature Engineering
100% (1)
Unit 4 Basics of Feature Engineering
33 pages
18ai61-Model Question Paper Solutions
No ratings yet
18ai61-Model Question Paper Solutions
71 pages
1.3.2. Feature Engineering and Variable - Transformation
No ratings yet
1.3.2. Feature Engineering and Variable - Transformation
29 pages
Classification Algorithms I
No ratings yet
Classification Algorithms I
14 pages
C1 W2 Lab04 FeatEng PolyReg Soln
No ratings yet
C1 W2 Lab04 FeatEng PolyReg Soln
5 pages
ML Inter Q&A
No ratings yet
ML Inter Q&A
54 pages
ML Lectures Summary 2
No ratings yet
ML Lectures Summary 2
52 pages
Feature Engineering
No ratings yet
Feature Engineering
20 pages
Week 6. Data Preparation and Transformation
No ratings yet
Week 6. Data Preparation and Transformation
34 pages
Types of Data (Qualitative and Quantitative)
No ratings yet
Types of Data (Qualitative and Quantitative)
89 pages
Scikit Learn
No ratings yet
Scikit Learn
17 pages
The Tech Interview Playbook: From DSA to System Design
From Everand
The Tech Interview Playbook: From DSA to System Design
Chinmoy Mukherjee
No ratings yet
The Numpy Pocketbook: Essentials on the Go
From Everand
The Numpy Pocketbook: Essentials on the Go
Silas Meadowlark
No ratings yet
Introduction To Algorithms and Programming Concepts
No ratings yet
Introduction To Algorithms and Programming Concepts
22 pages
Lab 1 Measuring Resistance
0% (1)
Lab 1 Measuring Resistance
2 pages
Chapter 12 Power Point 5e HP
No ratings yet
Chapter 12 Power Point 5e HP
83 pages
Unity Replication
No ratings yet
Unity Replication
56 pages
An Analysis of The Use and Effect of Questions in Interactive Learning-Videos
No ratings yet
An Analysis of The Use and Effect of Questions in Interactive Learning-Videos
16 pages
Extreme Programming
No ratings yet
Extreme Programming
59 pages
Bluetooth To Serial Modules: Deal Extreme, Part # 80711
No ratings yet
Bluetooth To Serial Modules: Deal Extreme, Part # 80711
8 pages
Basic Stamp 2 Tutorial
No ratings yet
Basic Stamp 2 Tutorial
376 pages
Resume 2022
No ratings yet
Resume 2022
3 pages
Payslip 2025 FEB
No ratings yet
Payslip 2025 FEB
1 page
Assignment 3 RMS
No ratings yet
Assignment 3 RMS
2 pages
Assignment 2
No ratings yet
Assignment 2
3 pages
Smart Arduino Touch Switch Board For Home Automation: Internal Guide - Prof. Satish Nimbalkar G - 23 (Members)
No ratings yet
Smart Arduino Touch Switch Board For Home Automation: Internal Guide - Prof. Satish Nimbalkar G - 23 (Members)
10 pages
PIONEER DJM-750-k-s RRV4457 PDF
100% (1)
PIONEER DJM-750-k-s RRV4457 PDF
194 pages
Reading Vocab Chapter 4
No ratings yet
Reading Vocab Chapter 4
3 pages
Tle 9 Bow Quarter 1
No ratings yet
Tle 9 Bow Quarter 1
4 pages
SoS BS Mathematics (ICT) Revised 2023
No ratings yet
SoS BS Mathematics (ICT) Revised 2023
2 pages
Kunal Balwani Assgt 6 PDF
No ratings yet
Kunal Balwani Assgt 6 PDF
4 pages
Genz Benz ShuttleMax 12.2 Manual
No ratings yet
Genz Benz ShuttleMax 12.2 Manual
9 pages
Bentley Water Screen
No ratings yet
Bentley Water Screen
2 pages
De Fina 5
No ratings yet
De Fina 5
27 pages
3 1 3-01 08-Occ2-Gb
No ratings yet
3 1 3-01 08-Occ2-Gb
2 pages
B2 Dynamics Questions
No ratings yet
B2 Dynamics Questions
6 pages
Module 4
No ratings yet
Module 4
16 pages
Price List Emco 2019 - Lwrs
No ratings yet
Price List Emco 2019 - Lwrs
8 pages
Niemeyer 2014
No ratings yet
Niemeyer 2014
14 pages
11 Phases of Software Testing
No ratings yet
11 Phases of Software Testing
3 pages
Max40109-3402282 Dos
No ratings yet
Max40109-3402282 Dos
60 pages
Programming Logic and Design: Seventh Edition
No ratings yet
Programming Logic and Design: Seventh Edition
35 pages

Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Machine Learning (2) : Inteligência Artificial E Cibersegurança (Inacs)

Uploaded by

Machine Learning (2)

INTELIGÊNCIA ARTIFICIAL E CIBERSEGURANÇA (INACS)

• Dealing with Data Imbalances

• Single value imputation

• Multiple value imputation

• Does it suit all kinds of categorical variables?

• How to encode nominal features correctly?

• Ordinal Encoding can still be used if we run out of options

• Note: Sklearn’s “LabelEncoder” is different than “OrdinalEncoder”

• Gradient-Descent based algorithms

• What about Tree-based algorithms?

• In the context of Cybersecurity (NIDS),

TOUR OF DATA SAMPLING METHODS FOR IMBALANCED CLASSIFICATION, HTTPS://MACHINELEARNINGMASTERY.COM/DATA-SAMPLING-METHODS-FOR-IMBALANCED-CLASSIFICATION/ [ONLINE]

• Synthetic Minority Oversampling Technique (SMOTE)

• Condensed Nearest Neighbor Rule (CNN)

• Good general choices for Practical Work:

• What are those so-called results?

train set validation set test set

• We can solve this, using k-fold cross validation method

• For the following grid of a pipeline that considers PCA + KNN:

• The best configuration is guaranteed (!)

• Random Search is a good way of exploring a large space of configurations

• Other Optimization Methods

• Bayesian Optimization, Genetic Algorithms, Particle Swarm Optimization, etc…

• However, any evaluation, including model-level evaluation or business-level evaluation

• Predicted attack but it was normal

You might also like