Machine Learning General: Definiton
Machine Learning General: Definiton
Machine Learning General: Definiton
Here, the validation loss is much better than the training one, which Notes
reflects the validation dataset is easier to predict than the training
dataset. An explanation could be the validation data is scarce but → Ablation: An ablation study is turning off components of a model
widely represented by the training dataset, so the model performs (e.g. features or sub-models) one at a time, to see how much each
extremely well on these few examples. contributes to the model’s performance.
Model Evaluation sensitivity because false positive (FP normal transactions that Note: We can think of the plot as the fraction of correct predictions
are flagged as possible fraud) are more acceptable than false for the positive class (y-axis) versus the fraction of errors for the
Classification Problems negative (FN fradulent transactions that are not detected) negative class (x-axis).
Confusion Matrix TP How to choose threshold for the logistic regression? The choice of a
3. Precision: threshold depends on the importance of TPR and FPR classification
Type I error: The null hypothesis H0 is rejected when it is true. TP + FP
Type II error: The null hypothesis H0 is not rejected when it is false. problem. If there is no external concern about low TPR or high FPR,
Exactness of model. → Out of total predicted positive (1) one option is to weight them equally by choosing the threshold that
→ False negative (Type I error) — incorrectly decide no values, how often classifier is correct.
→ False positive (Type II error) — incorrectly decide yes maximizes TPR−FPR.
Probability: P [Y = 1|D = 1] , If our model says positive,
how likely it is correct in that judgement. # Get predicted probabilities
Example: ”Spam Filter” +ve (1) class is spam → Optimize y_prob = model . predict_proba ( X_test ) [: , 1]
for precision or, specificity because false negatives (FN spam # Calculate ROC curve
goes to the inbox) are more acceptable than false positive (FP fpr , tpr , thresholds = roc_curve ( y_test , y_prob )
non-spam is caught by the spam filter). Example: # Optimal threshold using Youden ’s J stat .
”Hotel booking canceled” +ve (1) class is isCanceled → youden_index = tpr - fpr
Optimize for precision or, specificity because false negatives optimal = thresholds [ np . argmax ( youden_index ) ]
(FN isCanceled labeled as ”not canceled” 0) are more
acceptable than false positive (FP isnotCanceled labeled as • Area Under the ROC Curve AUC: To compute the points in an
”canceled” 1). ROC curve, an efficient, sorting-based algorithm called AUC. AUC
Ex: We assume the null hypothesis H0 is true. ranges in value from 0 to 1. Area Under the Curve measures how
→ H0 : Person is not guilty Precision × Recall likely the model differentiates positives and negatives (perfect AUC =
4. F1-Score = 2 ×
→ H1 : Person is guilty Precision + Recall 1, basline = 0.5)
False positive (FP) and False negative (FN) are equally • Precision-Recall curve: Focuses on the correct prediction of the
important. minority class, useful when data is imbalanced. Plot precision at
FP different thresholds.
5. False Positive Rate:
TN + FP
Fraction of negatives wrongly classified positive.
Probability: P [D = 1|Y = 0]
FN
6. False Negative Rate: = 1 - Recall
TP + FN
Fraction of positives wrongly classified negative.
Probability: P [D = 0|Y = 1]
TN
7. Specificity: = 1-FPR
TN + FP
Fraction of negatives rightly classified negative.
Probability: P [D = 0|Y = 0]
Regression Problems
• ROC-curve: The curve illustrates the trade-off between (TPR) true
positive rate (sensitivity or recall) and the (FPR) false positive rate 1X
1. Mean Squared Error: MSE = (yi − yˆi )2
using classification thresholds α. n i
→ Lowering the classification threshold classifies more items as
positive, thus increasing both False Positives and True Positives. s
PN
i=1 (yi − yˆi )2
2. Root Mean Squared Error: RMSE =
N
TP + TN
1. Accuracy: 1X
TP + TN + FP + FN 3. Mean Absolute Error: MAE = |yi − yˆi |
→ Ratio of correct predictions over total predictions. n i
Estimate of P [D = Y ] , probability of decision is equal to X
outcome. 4. Sum of Squared Error: SSE = (yi − yˆi )2
i
TP
2. Recall or Sensitivity or True positive rate: .
TP + FN X
5. Total Sum of Squares: SST = (yi − y¯i )2
Completness of model → Out of total actual positive (1) i
values, how often the classifier is correct.
Probability: P [D = 1|Y = 1] 6. R2 Error :
Example: ”Fradulent transaction detector” or MSE (model) SSE
R2 = 1 − R2 = 1 −
”Person Cancer” → +ve (1) is ”fraud”: Optimize for MSE(baseline) SST
7. Adjusted R2 : → Cost function : The cost function J is commonly used to know Tips : For effective gradient descent, select an optimal learning rate,
the performance of a model, and is defined with the loss function L scale features, initialize parameters wisely, utilize mini-batch
n−1 as follows: processing, monitor convergence, experiment with various optimizers,
Ra2 = 1 − (1 − R2 ) m
n−k−1 X apply regularization techniques, avoid local minima, visualize loss
J(θ) = L(hθ (x(i) ), y (i) ) trends, and tune hyperparameters diligently.
i=1
Variance, R2 and the Sum of Squares • Stochastic Gradient Descent uses a single point to compute
• Cost function for regression: Mean Squared Error (MSE), Mean gradients, leading to smoother convergence and faster compute
• The total sum of squares: SStotal = i (yi − ȳ)2
P
Absolute Error (MAE), Huber Loss, Log-Cosh Loss speeds.
1 P 2 → Time Complexity: O(k · m) → m is no. of features. In each
• This scales with variance: var(Y ) = n i (yi − ȳ) • Cost function for classification: Binary Cross-Entropy Loss,
• The regression sum of squares: Categorical Cross-Entropy Loss, Sparse Categorical Cross-Entropy iteration, SGD computes the gradient using only one data point,
P ˆ
SSregression = i (y 2
i − ȳ) , → nVar(predictions) Loss, Hinge Loss, Squared Hinge Loss. leading to O(k · m) for m features.
• The residual sum of squares (squared errro): • Mini-batch Gradient Descent trains on small subsets of the data,
SSresidual = i (yi − yˆi )2 , → nVar(ϵ)
P
Convex & Non-convex striking a balance between the approaches. → Time Complexity:
O(k · b · m)
Note: ϵ̄ = 0, E[ŷ] = ȳ
A convex function is one where a line drawn between any two points Ordinary Least Squares
SStotal = SSregression + SSresidual on the graph lies on or above the graph. It has one minimum. A
non-convex function is one where a line drawn between any two X
SSresidual SSregression nV ar(preds) V ar(preds) General Linear Regression Model: ŷ = β0 + β j xj + ϵ
points on the graph may intersect other points on the graph. It
R2 = 1 − = = = j
SStotal SStotal nV ar(Y ) V ar(Y ) characterized as ”wavy”
→ When a cost function is non-convex, it means that there is a Here, βj is the j-th coefficient and xj is the j-th feature.
→ Explained Variance: R2 quantifies how much of the variability in
likelihood that the function may find local minima instead of the ⃗ that minimizes squared error:
Ordinary Least Squares - find β
the outcome (dependent) variable can be explained by the predictor
global minimum, which is typically undesired in machine learning
(independent) variables.
models from an optimization perspective. X
→ Goodness of Fit: A higher R2 value generally suggests a better fit arg min (yi − yˆi )2
of the model to the data, meaning the model’s predictions are closer ⃗
β
General Optimization Steps i
to the actual values.
→ An R2 of 1 indicates a perfect fit, where the model explains all the 1. Understand data (features and outcome variables) → 2. Define
variability, while an R2 of 0 indicates that the model explains none of loss (or gain/utility) function → 3. Define predictive model → 4.
the variability. Search for parameters that minimize the loss function
→ R2 is not valid for nonlinear models as
SSresidual + SSerror ̸= SStotal Gradient Descent
Procedure:
1. Calculate entropy of the outcome classes (c)
c
X
E(T ) = −pi log2 pi
i=1
Time Series
Time Series is generally data which is collected over time and is
dependent on it.
→ RNN memorize information from previous data with feedback It is a random sequence {Xt } of real values recorded at successive
loops inside it which helps to keep data information over time. equally spaced points in time.
→ It has an arrow pointing to itself, indicating that the data inside → Not every data collected with respect to time represents a time
Convolutional Neural Network block “A” will be recursively used. Once expanded, its structure is series.
CNN is a neural network architecture that is well-suited for image equivalent to a chain-like structure. → Methods of prediction & forecasting, time based data is Time
classification and object recognition tasks. The general CNN → Learning to store information or data over long periods of time Series Modeling
architectures are as shown below: intervals via recurrent backpropagation takes a very long time. Hence, • Examples of time series: Stock Market Price, Passenger Count of
the gradient gradually vanishes as they propagate to earlier time airlines, Temperature over time, Monthly Sales Data,
steps. These downstream gradients relies on parameter (weight) Quarterly/Annual Revenue, Hourly Weather Data/Wind Speed, IOT
sharing for efficiency, and repeatedly multiplying values greater than sensors in Industries and Smart Devices, Energy Forecasting
or less than 1 leads to:
→ Difference between Time Series and Regression
– Exploding gradients - model instability and overflows • Time Series is time dependent. However the basic assumption of a
– Vanishing gradients - loss of learning ability linear regression model is that the observations are independent.
• Along with an increasing or decreasing trend, most Time Series
→ A convolutional neural network starts by taking an input image, → This can be solved using: have some form of seasonality trends
represented as a matrix of pixel values – Gradient clipping - cap the maximum value of gradients
→ This input image is passed through convolutional layers. Here, a Note:
– ReLU - its derivative prevents gradient shrinkage for x > 0 → Predicting a time series using regression techniques is not a good
set of filters applies to the input image to detect features like edges,
– Gated cells - regulate the flow of information approach.
textures, and patterns. Each filter produces a feature map that
highlights a specific aspect of the input image. → Time series forecasting is the use of a model to predict future
And, also for the non-convex problem, the RNN model training values based on previously observed values.
→ After each convolution, an activation function (like ReLU) is confuse between local minimum and global minimum. To overcome
applied to introduce non-linearity, enabling the network to learn more these problem, LSTM has been introduced as RNN languages → A stochastic process is defined as a collection of random variables
complex patterns. modelling learning algorithm based on the feedforward architecture. X = {Xt : t ∈ T } defined on a common probability space, taking
→ This produces feature maps. Different weights lead to different values in a common set S (the state space), and indexed by a set T ,
feature maps. often either N or [0, ∞) and thought of as time (discrete or
continuous respectively) (Oliver, 2009).
Time Series Statistical Models
A time series model specifies the joint distribution of the sequence
{Xt } of random variables; e.g.,
P (X1 ≤ x1 , . . . , Xt ≤ xt ) for all t and x1 , . . . , xt
Typically, a time series model can be described as
Xt = mt + st + Yt
where mt : trend component; st : seasonal component; Yt : Zero-mean
→ The feature maps are then passed through pooling layers, which error
downsample the spatial dimensions by taking the maximum or • Vanishing gradient problem for RNNs. The sensitivity increases as
average value in small regions. This reduces the size of the feature the network backpropagates through in time. The darker the shade, Note: The following are some zero-mean models
maps and retains essential information, making the network more the greater the sensitivity. → iid noise: The simplest time series model is the one with no trend
efficient and less sensitive to slight changes in the input. or seasonal component, and the observations Xt s are simply
→ Again, the feature maps produced by the convolutional layer and independent and identically distribution random variables with zero
pooling layer are then passed through multiple additional mean. Such a sequence of random variable {Xt } is referred to as iid
convolutional and pooling layers, each layer learning increasingly noise. Y Y
P (X1 ≤ x1 , . . . , Xt ≤ xt ) = P (Xt ≤ xt ) = F (xt )
complex features of the input image.
t t
→ Now, the output obtained from above is fed into a fully connected
where F (·) is the cdf of each Xt . Further E(Xt ) = 0 for all t. We
layer for classification, object detection, or other structural analyses.
denote such sequence as Xt ∼ IID(0, σ 2 ). IID noise is not interesting
The final output of the network is a predicted class label or
for forecasting since Xt |X1 , . . . , Xt−1 = Xt .
probability score for each class, depending on the task.
→ iid noise example: A binary (discrete) process {Xt } is a sequence
Question: Describe the difference between batch normalization and of iid random variables Xt s with
layer normalization. P (Xt = 1) = 0.5, P (Xt = −1) = 0.5
→ Gaussian Noise example:A continues process: Gaussian noise • Exponential Smoothing - uses an exponentially decreasing weight Evolution of NLP
{Xt } is a sequence of iid normal random variables with zero mean to observations over time, and takes a moving average. The time t
and σ 2 variance; i.e., Xt ∼ N (0, σ 2 ) output is st = αxt + (1 − α)st−1 , where 0 < α < 1.
→ Random walk: The random walk {St , t = 0, 1, 2, . . .} (starting at • Double Exponential Smoothing - applies a recursive exponential
zero, S0 = 0) is obtained by cumulatively summing (or ”integrating”) filter to capture trends within a time series
random variables; i.e., S0 = 0 and st = αxt + (1 − α)(st−1 + bt−1 )
St = X1 + · · · + Xt , for t = 1, 2, . . . bt = β(st − st−1 ) + (1 − β)bt−1
where {Xt } is iid noise with zero mean and σ 2 variance. Note that by Triple exponential smoothing adds a third variable γ that accounts for
differencing, we can recover Xt ; i.e., seasonality.
∇St = St − St−1 = Xt • ARIMA - models time series using three parameters (p, d, q):
Further, we have – Autoregressive - the past p values affect the next value
X
!
X X – Integrated - values are replaced with the difference between
E(St ) = E Xt = E(Xt ) = 0=0 current and previous values, using the difference degree d (0 for
t t i stationary data, and 1 for non-stationary) Challenges in NLP
X
!
X – Moving Average - the number of lagged forecast errors and the • The 3 stages of an NLP pipeline are: Text Processing → Feature
Var(St ) = Var Xt = Var(Xt ) = tσ 2 size of the moving average window q Extraction → Modeling.
t t
• SARIMA - models seasonality through four additional
→ White Noise: We say {Xt } is a white noise; i.e., seasonality-specific parameters: P , D, Q, and the season length s
Xt ∼ WN(0, σ 2 ), if {Xt } is uncorrelated, i.e., Cov (Xt1 , Xt2 ) = 0 for
• Prophet - additive model that uses non-linear trends to account for
any t1 and t2 with E[Xt ] = 0 and Var(Xt = σ 2 ).
multiple seasonalities such as yearly, weekly, and daily.
Note: Every IID(0, σ 2 ) sequence is WN(0, σ 2 ) but not conversely. → Robust to missing data and handles outliers well.
• Moving Average Smoother This is an essentially non-parametric → Can be represented as: y(t) = g(t) + s(t) + h(t) + ϵ(t), with four
method for trend estimation. It takes averages of observations around distinct components for the growth over time, seasonality, holiday
Text Processing
t; i.e., it smooths the series. For example, let effects, and error. This specification is similar to a generalized
additive model.
1 Take raw input text, clean it, normalize it, and convert it into a form
Xt = (Wt−1 + Wt + Wt+1 ) • Generalized Additive Model - combine predictive methods while
3 that is suitable for feature extraction.
preserving additivity across variables, in a form such as Libraries: nltk, spacy
which is a three-point moving average of the white noise series Wt . y = β0 + f1 (x1 ) + · · · + fm (xm ), where functions can be non-linear.
→ AR(1) model (Autoregression of order 1): Let → GAMs also provide regularized and interpretable solutions for – Lower casing
Xt = 0.6Xt−1 + Wt regression and classification problems. – Removing other stuff like: punctuations, tags, URLs, etc depends
Tutorial: Complete Guide on Time Series Analysis in Python on the problem
where Wt is a white noise series. It represents a regression or
prediction of the current value Xt of a time series as a function of the – Convert chat words used in social media to a normal word
past two values of the series. Natural Language Processing – Spelling correction using libraries like TextBlob
NLP is the discipline of building machines that can manipulate
Stationary Process human language — or data that resembles human language — in the
– Stop words - removes common and irrelevant words (the, is)
Note: Do not remove stop words when using POS Tagging in text
Extracts characteristics from time-sequenced data, which may exhibit way that it is written, spoken, and organized. It evolved from
processing.
the following characteristics: computational linguistics.
– Stationarity - statistical properties such as mean, variance, NLP Applications – Tokenization - splits text into individual words (tokens) and word
auto-correlation are constant over time, an autocovariance that fragments.
does not depend on time, and no trend or seasonality • Sentence-level tokenization involves splitting a text into
individual sentences.
– Non-Stationary - There are 2 major reasons behind the
• Word-level tokenization involves splitting each sentence into
non-stationary of a Time Series
individual words or tokens.
– Trend - varying mean over time (mean is not constant)
– Seasonality - variations at specific time-frames (standard – Lemmatization - reduces words to its base form based on
deviation is not constant) dictionary definition (am, are, is → be)
– Stemming - reduces words to its base form without context
– Trend - Trend is a general direction in which something is
(ended → end)
developing or changing.
– Seasonality - Any predictable change or pattern in a time series – Language Detection
that recurs or repeats over a specific time period (calendar times)
occurring at regular intervals less than a year
Advance Text Processing
– Cyclicality - variations without a fixed time length, occurring in
periods of greater or less than one year – POS Tagging
– Autocorrelation - degree of linear similarity between current and
lagged values
• CV must account for the time aspect, such as for each fold Fx :
– Parse Tree
– Sliding Window - train F1 , test F2 , then train F2 , test F3
– Forward Chain - train F1 , test F2 , then train F1 , F2 , test F3 – Coreference Resolution
Feature Extraction Note: A word is important if it occurs many times in a document. similar contexts, tend to have related meanings. It builds a global
But that creates a problem. Words like “a” and “the” appear co-occurrence matrix that captures the frequency of word
→ Feature Extraction = Text Representation = Text Vectorization often. And as such, their TF score will always be high. We resolve co-occurrences within a context window across the entire corpus
Common Terms: this issue by using Inverse Document Frequency, which is high if → Based on transformer architecture
• Corpus • Vocabulary • Document • Word the word is rare and low if the word is common across the corpus. • BERT - accounts for word order and trains on subwords, and unlike
The TF-IDF score of a term is the product of TF and IDF. word2vec and GloVe, BERT outputs different vectors for different
Cosine Similarity - measures similarity between vectors, calculated uses of words (cell phone vs. blood cell)
A·B
as cos(θ) = ||A||||B|| , which ranges from o to 1
Sentiment Analysis
Extracts the attitudes and emotions from text
• Polarity - measures positive, negative, or neutral opinions
– Valence shifters - capture amplifiers or negators such as ’really
fun’ or ’hardly fun’
• Sentiment - measures emotional states such as happy or sad
• Subject-Object Identification - classifies sentences as either
subjective or objective
→ Most conventional machine learning techniques work on the
features – generally numbers that describe a document in relation to Topic Modelling
the corpus that contains it – created by either Bag-of-Words, TF-IDF,
Captures the underlying themes that appear in documents
or generic (custom) feature engineerings such as document length,
• Latent Dirichlet Allocation (LDA) - generates k topics by first
word polarity, and metadata (for instance, if the text has associated
assigning each word to a random topic, then iteratively updating
tags or scores). → CountVectorizer - Bag of Words assignments based on parameters α, the mix of topics per document,
Note: Deep learning does not require to do feature engineering → TfidfTransformer - TF-IDF values and β, the distribution of words per topic
→ TfidfVectorizer - Bag of Words AND TF-IDF values • Latent Semantic Analysis (LSA) - identifies patterns using tf-idf
– Bag-of-words - counts the number of times each word or n-gram
Word Embedding scores and reduces data to k dimensions through SVD
(combination of n words) appears in a document.
Word embeddings are often based on neural network models in deep NLP Tutorial
learning. Duplicate Question Pairs - Quora Questions Pairs: NLP Pipeline
→ Based on CBOW, Skip gram: Word2vec, GloVe, fastText
– Continuous bag-of-words (CBOW) - predicts the word given its
context
– skip-gram - predicts the context given a word
• word2vec - trains iteratively over a corpus of text to learn the
association between the words, and preserve the semantic information
as well as contextual meanings of words within a given corpus of text.
→ They are numerical representations of words and phrases allowing
similar words to have similar vector representations.
– n-gram - predicts the next term in a sequence of n terms based → It uses the cosine similarity metric to measure semantic similarity.
on Markov chains If the cosine angle is one, it means that the words are overlapping. ,
→ Markov Chain - stochastic and memoryless process that such that king − man + woman ≈ queen
predicts future events based only on the current state Note: According to research CBOW is used when small dataset is
available.