Multivariate Regression is a
supervised machine learning
algorithm involving multiple data
variables for analysis. Multivariate
regression is an extension of
multiple regression with one
dependent variable and multiple
independent variables. Based on the
number of independent variables,
we try to predict the output.Machine Learning Model Evaluation
Model evaluation is the process that uses
some metrics which help us to analyze the
performance of the model. As we all know
that model development is a multi-step
process and a check should be kept on how
well the model generalizes future predictions.
Therefore evaluating a model plays a vital role
so that we can judge the performance of our
model. The evaluation also helps to analyze a
model’s key weaknesses. There are many
metrics like Accuracy, Precision, Recall, F1
score, Area under Curve, Confusion Matrix,
and Mean Square Error. Cross Validation is
one technique that is followed during the
training phase and it is a model evaluation
technique as well.Cross Validation and Holdout
Cross Validation is a method in which we do
not use the whole dataset for training. In this
technique, some part of the dataset is
reserved for testing the model. There are
many types of Cross-Validation out of which K
Fold Cross Validation is mostly used. In K Fold
Cross Validation the original dataset is
divided into k subsets. The subsets are known
as folds. This is repeated k times where 1 fold
is used for testing purposes. Rest k-1 folds
are used for training the model. So each data
point acts as a test subject for the model as
well as acts as the training subject. It is seen
that this technique generalizes the model well
and reduces the error rateHoldout is the simplest approach. It is used in
neural networks as well as in many classifiers.
In this technique, the dataset is divided into
train and test datasets. The dataset is usually
divided into ratios like 70:30 or 80:20.
Normally a large percentage of data is used
for training the model and a small portion of
the dataset is used for testing the model.Confusion Matrix
A confusion matrix is an N x N matrix where N
is the number of target classes. It represents
the number of actual outputs and the
predicted outputs. Some terminologies in the
matrix are as follows:
* True Positives: It is also Known as TP. It is
the output in which the actual and the
predicted values are YES.
* True Negatives: It is also known as TN. Itis
the output in which the actual and the
predicted values are NO.
* False Positives: It is also known as FP. It is
the output in which the actual value is NO
but the predicted value is YES.
* False Negatives: It is also known as FN. It
is the output in which the actual value is
YES but the predicted value is NO.Least Square Regression in
Machine LearningLeast Square Regression is a statistical method
commonly used in machine learning for analyzing and
modelling data. It involves finding the line of best fit
that minimizes the sum of the squared residuals (the
difference between the actual values and the
predicted values) between the independent
variable(s) and the dependent variable.Linear Regression
Linear regression is one of the basic statistical
techniques in regression analysis. People use it for
investigating and modelling the relationship between
variables (i.e dependent variable and one or more
independent variables).
Before being promptly adopted into machine learning
and data science, linear models were used as basic
tools in statistics to assist prediction analysis and
data mining. If the model involves only one regressor
variable (independent variable), it is called simple
linear regression and if the model has more than one
regressor variable, the process is called multiple
linear regression.What is Regularization in Machine
Learning? Regularization refers to
techniques that are used to
calibrate machine learning models
in order to minimize the adjusted
loss function and prevent
overfitting or underfitting.What is Regularization in
Machine Learning?
Over-fitting Appropirate-fittingRegularization in Machine
Learning
What is Regularization?
Regularization is one of the most important
concepts of machine learning. It is a technique
to prevent the model from overfitting by adding
extra information to it.
Sometimes the machine learning model
performs well with the training data but does
not perform well with the test data. It means
the model is not able to predict the output
when deals with unseen data by introducing
noise in the output, and hence the model is
called overfitted. This problem can be deal with
the help of a regularization technique.
This technique can be used in such a way that
it will allow to maintain all variables or features
in the model by reducing the magnitude of the
variables. Hence, it maintains accuracy as well
as a generalization of the model.Techniques of Regularization
There are mainly two types of regularization
techniques, which are given below:
° Ridge Regression
© Lasso RegressionLasso Regression:
© Lasso regression is another
regularization technique to reduce the
complexity of the model. It stands for
Least Absolute and Selection Operator.
It is similar to the Ridge Regression
except that the penalty term contains
only the absolute weights instead of a
square of weights.
Since it takes absolute values, hence, it
can shrink the slope to 0, whereas Ridge
Regression can only shrink it near to 0.
It is also called as L1 regularization. The
equation for the cost function of Lasso
regression will be:
So-rs = Y(-Sern) + SuiKey Difference between Ridge
Regression and Lasso Regression
° Ridge regression is mostly used to
reduce the overfitting in the model, and it
includes all the features present in the
model. It reduces the complexity of the
model by shrinking the coefficients.
° Lasso regression helps to reduce the
overfitting in the model as well as
feature selection.What are the applications of
regression?
The main uses of regression analysis are
forecasting, time series modeling and
finding the cause and effect relationship
between variables.What is Classification?
Classification is defined as the process of
recognition, understanding, and grouping of
objects and ideas into preset categories a.k.a
“sub-populations.” With the help of these pre-
categorized training datasets, classification in
machine learning programs leverage a wide
range of algorithms to classify future datasets
into respective and relevant categories.Classification algorithms used in machine
learning utilize input training data for the
purpose of predicting the likelihood or
probability that the data that follows will fall
into one of the predetermined categories. One
of the most common applications of
classification is for filtering emails into “spam”
or “non-spam”, as used by today’s top email
service providers.y & Gg a @¢ les
we
- o
Independent Classification
Input Variables Model
‘
'
1
1
1
'
4
‘
'
'
1
'
'
Groceries
i
Categorical
Output Variable
Figure 2: Classification of
vegetables and groceriesCross-Validation in
Machine Learning
Cross-validation is a technique for validating
the model efficiency by training it on the subset
of input data and testing on previously unseen
subset of the input data. We can also say that it
is a technique to check how a statistical model
generalizes to an independent dataset.
In machine learning, there is always the need to
test the stability of the model. It means based
only on the training dataset; we can't fit our
model on the training dataset. For this purpose,
we reserve a particular sample of the dataset,
which was not part of the training dataset.
After that, we test our model on that sample
before deployment, and this complete process
comes under cross-validation. This is
something different from the general train-test
splitHence the basic steps of cross-validations are:> Reserve a subset of the dataset as a
validation set.
© Provide the training to the model using
the training dataset.
© Now, evaluate model performance using
the validation set. If the model performs
well with the validation set, perform the
further step, else check for the issues.Methods used for Cross-
Validation
There are some common methods that are
used for cross-validation. These methods are
given below:
1. Validation Set Approach
2. Leave-P-out cross-validation
3. Leave one out cross-validation
4. K-fold cross-validation
5. Stratified k-fold cross-validationComparison of Cross-
validation to train/test split in
Machine Learning
© Train/test split: The input data is divided
into two parts, that are training set and
test set on a ratio of 70:30, 80:20, etc. It
provides a high variance, which is one of
the biggest disadvantages.
© Training Data: The training data is
used to train the model, and the
dependent variable is known.
Test Data: The test data is used to
make the predictions from the
model that is already trained on
the training data. This has the
same features as training data but
not the part of that.© Cross-Validation dataset: It is used to
overcome the disadvantage of train/test
split by splitting the dataset into groups
of train/test splits, and averaging the
result. It can be used if we want to
optimize our model that has been
trained on the training dataset for the
best performance. It is more efficient as
compared to train/test split as every
observation is used for the training and
testing both.Limitations of Cross-
Validation
There are some limitations of the cross-
validation technique, which are given below:
For the ideal conditions, it provides the
optimum output. But for the inconsistent
data, it may produce a drastic result. So,
it is one of the big disadvantages of
cross-validation, as there is no certainty
of the type of data in machine learning.
In predictive modeling, the data evolves
over a period, due to which, it may face
the differences between the training set
and validation sets. Such as if we create
a model for the prediction of stock
market values, and the data is trained on
the previous 5 years stock values, but
the realistic future values for the next 5
years may drastically different, so it is
difficult to expect the correct output for
such situations.Applications of Cross-
Validation
© This technique can be used to compare
the performance of different predictive
modeling methods.
co It has great scope in the medical
research field.
© It can also be used for the meta-
analysis, as it is already being used by
the data scientists in the field of medical
statistics.F1 Score
The F1 score is a measure of a model's
accuracy that takes into account both
precision and recall, where the goal is to
classify instances correctly as positive or
negative. Precision measures how many
of the predicted positive instances were
actually positive, while recall measures
how many of the actual positive
instances were correctly predicted. A
high precision score means that the
model has a low rate of false positives,
while a high recall score means that the
model has a low rate of false negatives.Mathematically speaking, the F1 score is
a weighted harmonic mean of precision
and recall. It ranges from 0 to 1, with 1
being the best possible score. The
formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision +
recall)The harmonic mean is used to give more
weight to low values. This means that if
either precision or recall is low, the F1
score will also be low, even if the other
value is high. For example, if a model has
high precision but low recall, it will have a
low F1 score because it is not correctly
identifying all of the positive instances.Accuracy
Accuracy is an ML metric that measures
the proportion of correct predictions
made by a model over the total number
of predictions made. It is one of the most
widely used metrics to evaluate the
performance of a classification model.
Accuracy can be calculated using the
following formula:
Accuracy = (number of correct
predictions) / (total number of
predictions)Accuracy is a simple and intuitive metric
that is easy to understand and interpret.
It is particularly useful when the classes
are balanced, meaning that there are
roughly equal numbers of positive and
negative samples. In such cases,
accuracy can provide a good overall
assessment of the model's performance.However, accuracy can be misleading
when the classes are imbalanced. For
example, if 95% of the samples are
negative and only 5% are positive, a
model that always predicts negative
would achieve an accuracy of 95%. Still, it
would be useless for the positive class.
In such cases, other metrics such as
precision, recall, F1 score, and area under
the precision-recall curve should be used
to evaluate the model's performance.ROC-AUC
The ROC (Receiver Operating
Characteristic) curve and AUC (Area
Under the Curve) are ML metrics used to
evaluate the performance of binary
classification models. The ROC curve is a
plot of the true positive rate (TPR)
against the false positive rate (FPR) at
various threshold settings, and it is
created by varying the threshold to
predict a positive or negative outcome
and plotting the TPR against the FPR for
each threshold. The TPR is the proportionof actual positive samples that are
correctly identified as positive by the
model. In contrast, the FPR is the
proportion of actual negative samples
that are incorrectly identified as positive
by the model. In the figure below, each
coloured line represents the ROC curve of
a different binary classifier system. The
axes represent the FPR and TPR. The
diagonal line represents a random
classifier, while the top-left corner
represents a perfect classifier with
TPR=1 and FPR=0.True positive rate
Perfect
classifier Sue
1.0¢
0.5
0.0 0.5
False positive rate
ROC curve (Source)
1.0At the same time, the AUC represents the
overall performance of the model. The
AUC is the area under the ROC curve,
representing the probability that a
randomly chosen positive sample will be
ranked higher by the model than a
randomly chosen negative sample. A
perfect model would have an AUC of 1,
while a random model would have an
AUC of 0.5. The AUC provides a single
value that summarizes the model's
overall performance and is particularly
useful when comparing the performance
of multiple models.The true and false positive rates at
different thresholds are particularly
useful when the classes are imbalanced,
meaning there are significantly more
negative samples than positive ones. In
such cases, the ROC curve and AUC can
provide a more accurate assessment of
the model's performance than metrics
such as accuracy or F1 score, which may
be biased towards the majority class.PR-AUC
PR-AUC (Precision-Recall Area Under the
Curve) is an ML metric used to evaluate
the performance of binary classification
models, mainly when the classes are
imbalanced. Unlike the ROC curve and
AUC, which plot the TPR against the FPR,
the PR curve plots the precision against
the recall at different threshold settings.Precision is the proportion of true
positive predictions out of all positive
predictions made by the model, while
recall is the proportion of true positive
predictions from all actual positive
samples in the dataset. The PR curve is
created by varying the threshold for
predicting a positive or negative outcome
and plotting the precision against the
recall for each threshold.Precision
Perfect Classifier
“Pretty Good”
Classifier
Baseline Classifier
Recall
PR curve (Source)The PR-AUC is the area under the PR
curve, and represents the overall
performance of the model. A perfect
model would have a PR-AUC of 1, while a
random model would have a PR-AUC
equal to the ratio of positive samples in
the dataset. Like the AUC, the PR-AUC
provides a single value that summarizes
the model's overall performance and is
particularly useful when comparing the
performance of multiple models. In the
figure above, the grey dotted linerepresents a “baseline” classifier — this
classifier would simply predict that all
instances belong to the positive class.
The purple line represents an ideal
classifier with perfect precision and recall
at all thresholds.
The PR curve and PR-AUC provide a more
accurate assessment of the model's
performance than metrics such as
accuracy or F1 score, which may be
biased towards the majority class. In
addition, they can provide insight into the
trade-off between precision and recall
and help to identify the optimal threshold
for making predictions.2.5 Lasso
Least Absolute Shrinkage and Selection
Operator (LASSO) is an acronym for Least
Absolute Shrinkage and _ Selection
Operator. Lasso regression is a form of
regularization. For a more _ precise
forecast, it is favoured over regression
approaches. Shrinkage is used in this
model. Data values are shrunk towards a
central point known as the mean in
shrinkage.Easy, sparse models are encouraged by
the lasso technique (ie. models with
fewer parameters). This method of
regression is suitable for models with a
lot of multicollinearity or when you want
to automate parts of the model selection
process, such as variable selection and
parameter elimination.
The L1 regularization technique is used in
Lasso Regression. It is used when there
are a large number of features because it
performs feature selection automatically.Lasso regression performs L1
regularization, which means it applies a
dimension to the optimization goal equal
to the number of absolute values of
coefficients. As a result, lasso regression
increases the following:
Objective = RSS + a * (sum of absolute
value of coefficients)
In this case, (alpha) functions similarly to
ridge and offers a trade-off between
balancing RSS and coefficient magnitude.
Similarly to ridge, may have a number of
values.e a=0: Same coefficients as simple
linear regression
© a=: All coefficients zero
e 0
100K, it will not seale.
In comparison to SVM or simple logistic regression, it requires higher runtime memory for prediction. It consumes
muck: time to compute, especially for models with a lot of variables.3,2. Decision Tree
A Decision Tree is.a supervised leaming technique that can be used to perform classification and regression tasks,
while it is most typically employed for classification.
A decision tree has a root node, branch nodes, and leaf nodes, similar to a tree, with each node representing a
characteristic or attribute, each branch representing a decision or rule, and each leaf representing a result. To split the
features, decision tree algorithms are used. At each node, the splitting is tested to see if it is the most suited for the
respective classes. A decision tree is a graphical layout thet allows you to get all of the various answers for 2 decision
based on the current situation. It only focuses on one question, and the tree is split into subtrees based on the answer.
The following are some of the benefits of using a Decision Tree: It is effective for both regression and classification
problems, with ease of interpretation, the ability to fill incomplete data in attributes with the most likely valtue and
handling categorical and quantitative values, It also has a superior productivity due to the efficiency of the tree
traversal algorithm. Over-fitting is a problem that Decision Tree may experience, and the answer is Random Forest,
which is based on an ensemble modelling technique.
The following are the downsides of using a Decision Tree: itis being unstable, difficult to manage tree size, prone
to errors in sampling, and providing a locally optimal answer rather than a glabally ideal solution.
3.3. K-Nearest Neighbour
K-nearest neighbours (KNN) are supervised machine learning algorithms that can be utilised to solve both
classification and regression problems. With the K-NN model, fresh data can be quickly sorted into well-defined
categories. To estimate the values of any new data points, the KNN algorithm makes use of "feature similarity.” It
evaluates the distances between a query and each example in the data, picks the K examples that are closest to the
query, and then selects the label with the highest frequency (in the case of classification) or averages the labels (in the
case of regression).KNN analyses a given test tuple with comparable training tuples in process af learning. An n- dimensional pattern
space is used to hold all of the training tuples. A k-nearest-neighbor classifier examines the pattern space for the k
training tuples that are nearest to the unidentified tuple when given one, These k training tuples are the unknown
tuple's k "nearest neighbours.” [2]
Advantages of KNN algorithm are the following: It is « simple technique that may be implemented quickly. It is
inexpensive to construct the model. It's.a very adaptable categorisation technique that's ideal for Multi-modal classes
‘There are several class labels on the records. The mistake rate is twice as high as the Bayes error rate. It is sometimes
the most effective way. When it eame to predicting protein, function based on expression profiles, KNN outperformed
SVM
Disadvantages of KNN are the following: It is relatively costly to classify unknown records. It requires caleulating
the distance between k-nearest neighbours. The algorithm becomes more computationally costly as the size of the
training set grows. Accuracy will degrade as a result of noisy or imelevant features.
3.4. Support Vector Machine
In Supervised Learning, Support Vector Machines (SVMs) are widely used for dealing with classification and
regression problems. The purpose of SVM is to find the optimal line or decision boundary for classifying ndimensional
space into sections so that successive data points may be classified conveniently. These boundaries are known as
hyperplanes. SVM can handle unstructured, semi structured and structured data. Ketel finctions eases the
comiplexities in data type
This algorithm is divided into two categories: linear data and non-linear data. Mathematical programming and
kernel functions are the two main implementations of SVM technology. In a high-dimensional space, the hyperplane
divides data points of distinct kinds (4)
SVM has a number of limitations, including the following: Because of the longer training time, it performs poorly
when working with large data sets. The correct kernel function will be tough to locate. When a dataset is noisy, SVM
does not perform well. Probability calculations are not provided by SVM. It’s difficult to interpret the final SVM
model,WHAT IS CLASSIFICATION?
Classification predicts the category the
data belongs to. Some examples of
classification include spam detection,
churn prediction, sentiment analysis,
dog breed detection and so on.WHAT IS A CLASSIFIER?
A classifier is a type of machine
learning algorithm that assigns a
label to a data input. Classifier
algorithms use labeled data and
statistical methods to produce
predictions about data input
classifications.
Classification is used for predicting
discrete responses.2. K-NEAREST NEIGHBORS (K-
NN)
K-NN algorithm is one of the simplest
classification algorithms and it is used
to identify the data points that are
separated into several classes to predict
the classification of a new sample point.
K-NN is a non-parametric, lazy learning
algorithm. It classifies new cases based
on a similarity measure (i.e., distance
functions).k= Nene
Galena
= K isthe number of neightors
te consider
~ Scaling is importont, WF R=, tre
~ K shold be odd. OSS
ouapuativn it D
=f we hove binary festwes [4
We con use Hemming shine, she wan Yar) nae
= Voting can be weighted Sr Sid aa 5
by Sistnee te each only ‘ene i oO
bhor. red,
= = not Seale te large dats
ue Chris Abo
BEES K-NNLGARN
K-nearest neiqhor does not “Veo
per-se. It is lazy and just memorizes
the dota.
Chris AlboaDEES K-NMLERRN
K-nearest neighor does not “learn
per-se. It is lazy and jest memorizes
the data.
Chris Alboo
KHER Esa
4 All features should
eee pene TIPS AND TRICKS
2. K shodd be odd to avoid ties.
3 Notes can be weighted by the distance to the
Reighber So closer observations” vetes are
Werth more.
4 Try @ variety of distance measurements.
Chrshlton: | NE\GHBORHOoD
(NIN S28
Small
Uk Low Bias, High lariance
Rek= igh 6s, Law Yorn
BY CHRis Autow
K-NN works well with a small number
of input variables (p), but struggles
when the number of inputs is very
large.3. SUPPORT VECTOR
MACHINE (SVM)
Support vector is used for both
regression and classification. It is based
on the concept of decision planes that
define decision boundaries. A decision
plane (hyperplane) is one that separates
between a set of objects having
different class memberships.It performs classification by finding the
hyperplane that maximizes the margin
between the two classes with the help of
support vectors.
aVE
Finds the linear hyperplane thot separstes
classes with the, Moss mum Margin.
ai Support VectorsThe learning of the hyperplane in SVM
is done by transforming the problem
using some linear algebra (Le., the
example above is a linear kernel which
has a linear separability between each
variable).
For higher dimensional data, other
kernels are used as points and cannot
be classified easily. They are specified in
the next section.
Kernel SVM
Kernel SVM takes in a kernel function in
the SVM algorithm and transforms it
into the required form that maps data
on a higher dimension which is
separable.Types of kernel functions:
X,-X; Linear
X,-X,+C)* iynomial
K(X,.X,)= ( i : it ) ; Pol
exp(-y|X,-X, |) REF
tanh (7X, -X, +C) Sigmoid
Type of kernel functions
1. Linear SVM is the one we
discussed earlier.
2. In polynomial kernel, the degree
of the polynomial should be
specified. |t allows for curved lines
in the input space.3. In the radial basis function (RBF)
kernel, it is used for non-linearly
separable variables. For distance,
metric squared Euclidean
distance is used. Using a typical
value of the parameter can lead to
overfitting our data. It is used by
default in sklearn.
4. Sigmoid kernel, similar to logistic
regression is used for binary
classification.
ERNELPRIEK
Support vector classifiers con be written as
Suppat ao
bt) ax xe
ti fat areas
penmetes
The Kernel trick is to replace +he aot product witha
Heres aah ex) ™
e 128 For men -lingar decision boundaries ancl comprtational
aesKernel trick uses the kernel function to
transform data into a higher
dimensional feature space and makes it
possible to perform the linear
separation for classification.4. NAIVE BAYES
The naive Bayes classifier is based on
Bayes’ theorem with the independence
assumptions between predictors (i.e., it
assumes the presence of a feature ina
class is unrelated to any other feature).
Even if these features depend on each
other, or upon the existence of the other
features, all of these properties
independently. Thus, the name naive
Bayes.(Fogel caeeralt
LIKEL! Hoop PRIOR
POSTE RIOR, B | P(A
P(AIg) = PIA) PCA). a )
BY CHAS AcBon!
Based on naive Bayes, Gaussian naive
Bayes is used for classification based on
the binomial (normal) distribution of
data.GAUSSIAN
HBIVE BRYES
CLASSIFIER
Gavetian because this is @ oe
formal dietribution Wer
_ P( acts [cles) x plan)
P(e | as) = oy
Ue dati ctleten 7
this in naive bages
class Fens
+ P(class/data) is the posterior
probability of class(target)
given predictor(attribute). The
probability of a data point having
either class, given the data point.
This is the value that we are
looking to calculate.
* P(class) is the prior probability
of class.* P(data/class) is the likelihood,
which is the probability
of predictor given class.
¢ P(data) is the prior probability
of predictor or marginal likelihood.
NB Classification ExampleNaive Bayes Steps
1. Calculate Prior Probability
P(class) = Number of data points in the
class/Total no. of observations
P(yellow) = 10/17
P(green) = 7/17
2. Calculate Marginal Likelihood
P(data) = Number of data points similar
to observation/Total no. of observations
P= 47
The value is present in checking both
the probabilities.3. Calculate Likelihood
P(data/class) = Number of similar
observations to the class/Total no. of
points in the class.
P(?/yellow) = 1/7
P(?/green) = 3/10
4. Posterior Probability for Each Class
_ P(data/class) « P(class)
14 lass/date) = en ata)
P(yetlow/?) = 1277 _ 905
4/173/10 * 10/17
wit = 0.75
P(green/?) =
5. Classification
P(class1/data) > P(class2/data)
P(green/?) > P(yellow/?)
The higher probability, the class belongs
to that category as from above 75%
probability the point belongs to class
green.Multinomial, Bernoulli naive Bayes are
the other models used in calculating
probabilities. Thus, a naive Bayes model
is easy to build, with no complicated
iterative parameter estimation, which
makes it particularly useful for very
large datasets.5. DECISION TREE
CLASSIFICATION
Decision tree builds classification or
regression models in the form of a tree
structure. It breaks down a dataset into
smaller and smaller subsets while at the
same time an associated decision tree is
incrementally developed. The final
result is a tree with decision
nodes and leaf nodes. It follows Iterative
Dichotomiser 3 (ID3) algorithm
structure for determining the split.. SION TREES HAVE
eo Fi en {ATER PRETABILITY.
EATORE rR BEING TRAINEO
ad i You CAN LITERALLY DRAW
THEM.
Ms DATA ON THE
seus oom, Shs,
Split PROVES HIGHEST
Lie INPORMATION GAIN,
eo
BY CHas ALBon
Entropy and information gain are used
to construct a decision tree.
AGECH) AgE73°Entropy
Entropy is the degree or amount of
uncertainty in the randomness of
elements. In other words, it is a
measure of impurity.
E(S)= > P: log, p;
i=l
EntropyIntuitively, it tells us about the
predictability of a certain event.
Entropy calculates the homogeneity of a
sample. If the sample is completely
homogeneous the entropy is zero, and if
the sample is equally divided it has an
entropy of one.
Information Gain
Information gain measures the relative
change in entropy with respect to the
independent attribute. It tries to
estimate the information contained by
each attribute. Constructing a decision
tree is all about finding the attribute
that returns the highest information
gain (i.e., the most homogeneous
branches).Gain(T, X) = Entropy(T)— Entropy(T.X)
Where Gain(TZ, X) is the information gain
by applying feature X. Entropy(T) is the
entropy of the entire set, while the
second term calculates the entropy after
applying the feature x.
Information gain ranks attributes for
filtering at a given node in the tree. The
ranking is based on the highest
information gain entropy in each split.
The disadvantage of a decision tree
model is overfitting, as it tries to fit the
model by going deeper in the training
set and thereby reducing test accuracy.Overfitting occurs when
G model Starts to memorize overfit model
the aspects of the training set
Qnd inturn loses the ability,
to generalize
ChrisAlbon
Overfitting in decision trees can be
minimized by pruning nodes.Accuracy, Precision, Recall and F-1
Score
From the confusion matrix, we can infer
accuracy, precision, recall and F-1 score.
Accuracy
Accuracy is the fraction of predictions
our model got right.
ALEURALY
Predicted 4
het D169
rem ie of
observations
Ck 4
poeta
furction
A common metric in classification. Fails when we
have highl imbalanced classes, In those cases
FL is more opprep risteAccuracy can also be written as
Accuracy = ee +N
TP+TN+FP+FN
Accuracy alone doesn’t tell the full story
when working with a class-imbalanced
data set, where there is a significant
disparity between the number of
positive and negative labels. Precision
and recall are better metrics for
evaluating class-imbalanced problems.
Precision
Out of all the classes, precision is how
much we predicted correctly.PRECISUN
Frecision is the ability a classifier to not ladel
ca tne neootive observation as positive
True Positive
True fositive + False Rsitive
Chrisfilbon
Precision should be as high as possible.
Recall
Out of all the positive classes, recall is
how much we predicted correctly. It is
also called sensitivity or true positive
rate (TPR).REERLL
"Recall is cbot tne real positives
True Positives
Recall is the ability of the classifier de fed
petitive ox amples If we wanted ty be cectain to
find all positive examples, We could maximize
recall,
Recall should be as high as possible.
F-1 Score
It is often convenient to combine
precision and recall into a single metric
called the F-1 score, particularly if you
need a simple way to compare two
classifiers. The F-1 score is the harmonic
mean of precision and recall.GEARE
Precision x Recall
Se
Ae Be Precision + Recall
i yf precision and
score is the harmonic mean of pi
BY Valves range from 0 (had) to 1 Cgood),
ChrisAlbon
The regular mean treats all values
equally, while the harmonic mean gives
much more weight to low values
thereby punishing the extreme values
more. As a result, the classifier will only
get a high F-1 score if both recall and
precision are high.1-Training
[a -_-————> Machine
leaming
Feature .
classifiers
extraction
Tweets
2- Prediction3. Test set (20% of the original data set):
Now we have chosen our preferred
prediction algorithm but we don't know
yet how it's going to perform on
completely unseen real-world data. So, we
apply our chosen prediction algorithm on
our test set in order to see how it's going
to perform so we can have an idea about
our algorithm's performance on unseen
data. | suppose that if your algorithms did
not have any parameters then you would
not need a third step. In that case, your
validation step would be your test step.
This data set is used only for testing the
final solution in order to confirm the actual
predictive power of the network.1. Training set (60% of the original data
set): This is used to build up our
prediction algorithm and to adjust the
weights on the neural network. Our
algorithm tries to tune itself to the quirks
of the training data sets. In this phase we
usually create multiple algorithms in order
to compare their performances during the
Cross-Validation Phase. Each type of
algorithm has its own parameter options
(the number of layers in a Neural Network,
the number of trees ina Random Forest,
etc). For each of your algorithms, you
must pick one option. That's why you
have a training set.How Testing and Training Data Are Used
Algorithms that examine your training dataset,
classify the inputs and outputs, and then analyze it
again are the foundation for machine learning
models.
Aissue arises when an algorithm needs to take
into account data from other sources, such as real-
world consumers, because a sufficiently trained
algorithm will effectively memorize all of the inputs
and outputs in a training dataset.
There are three steps in the training data process:
e Feed - supplying data to a model
+ Define - The model creates text vectors
from training data (numbers that
represent data features)
* Test your model by feeding it test data to
complete the process (unseen data).
After training is finished, you can test the model
using the 20% of the original dataset that you
saved (without labeled results, if using supervised
learning). Here, the model is adjusted to ensure
that it performs as intended.You don't have to bother about fine-tuning in
Obviously Al because the entire procedure (training
and testing %) is completed in a matter of
seconds. To ensure that it's not a black box, we
constantly advise knowing what's going on in the
background.
The amount of training data required
This is a frequently asked question, and the
response is: It depends.
This is the type of response you'll receive from the
majority of data scientists; we don't mean to be
evasive. This is so that you can understand how
different variables, like:
¢ The difficulty of the issue
¢ The degree of the learning algorithm's
complexityWe constantly say at Obviously Al: the more data,
the better. That's because your model will get
smarter the more you train it. However, you can
still get reliable results if your data is well-prepared,
adheres to a simple data prep checklist, and is
prepared for machine learning. And because to our
technology, those precise findings may be
produced in a matter of seconds.