3.
1 Sentence Classification
• It is the task where an NLP model receives a sentence and assigns some label to it.
• A spam filter is an application of sentence classification. It receives an email message
and assigns whether or not it is spam.
• If you want to classify news articles into different topics (business, politics, sports, and
so on), it’s also a sentence-classification task.
• Sentence classification is one of the simplest NLP tasks that has a wide range of
applications, including document classification, spam filtering, and sentiment analysis.
• With an emerging field of deep learning, performing complex operations has become
faster and easier.
• The first step in sentence classification is to represent variable-length sentences using
neural networks.
• Many modern NLP models use RNNs in some way.
• As studies NLP is a field of Artificial Intelligence in which we try to process human
language as text or speech to make computers understand the language.
• To make a machine learn from the raw text we need to transform this data into a vector
format which then can easily be processed by our computers. This transformation of
raw text into a vector format is known as word representation.
• The first step in sentence classification is to represent variable-length sentences using
neural networks (RNNs).
3.2 Recurrent neural networks (RNNs)
• Recurrent neural networks are one of the most important concepts in deep NLP.
• RNNs are a powerful and robust type because of their internal memory, RNNs can
remember important things about the input they received, which allows them to be very
precise in predicting what’s coming next.
• This is why they’re the preferred algorithm for sequential data like time series, speech,
text, financial data, audio, video, weather and much more. Recurrent neural networks
can form a much deeper understanding of a sequence and its context compared to other
algorithms.
• Because of their internal memory, RNNs can remember important things about the
input they received, which allows them to be very precise in predicting what’s coming
next. This is why they’re the preferred algorithm for sequential data like time series,
speech, text, financial data, audio, video, weather and much more. Recurrent neural
networks can form a much deeper understanding of a sequence and its context
compared to other algorithms.
Why Recurrent Neural Networks?
RNN were created because there were a few issues in the feed-forward neural network:
• Cannot handle sequential data
• Considers only the current input
• Cannot memorize previous inputs
The solution to these issues is the RNN. An RNN can handle sequential data, accepting the
current input data, and previously received inputs. RNNs can memorize previous inputs due to
their internal memory.
How Do Recurrent Neural Networks Work?
Sequential data is basically just ordered data in which related things follow each other.
The most popular type of sequential data is perhaps time series data, which is just a series of
data points that are listed in time order.
Recurrent vs. Feed-forward neural networks
RNNs and feed-forward neural networks get their names from the way they channel
information.
In a feed-forward neural network, the information only moves in one direction — from the
input layer, through the hidden layers, to the output layer. The information moves straight
through the network.
Feed-forward neural networks have no memory of the input they receive and are bad at
predicting what’s coming next. Because a feed-forward network only considers the current
input, it has no notion of order in time. It simply can’t remember anything about what happened
in the past except its training.
In an RNN, the information cycles through a loop. When it makes a decision, it considers the
current input and also what it has learned from the inputs it received previously.
The two images below illustrate the difference in information flow between an RNN and a
feed-forward neural network.
A recurrent neural network, however, is able to remember those characters because of its
internal memory. It produces output, copies that output and loops it back into the network.
Simply put: Recurrent neural networks add the immediate past to the present.
Therefore, an RNN has two inputs: the present and the recent past. This is important because
the sequence of data contains crucial information about what is coming next, which is why an
RNN can do things other algorithms can’t.
Recurrent Neuron and RNN Unfolding
The fundamental processing unit in a Recurrent Neural Network (RNN) is a Recurrent Unit,
which is not explicitly called a “Recurrent Neuron.” This unit has the unique ability to maintain
a hidden state, allowing the network to capture sequential dependencies by remembering
previous inputs while processing.
By unrolling we mean that we write out the network for the complete sequence. For example,
if the sequence we care about is a sentence of 3 words, the network would be unrolled into a 3-
layer neural network, one layer for each word.
Recurrent Neural Networks and Backpropagation Through Time
To understand the concept of backpropagation through time (BPTT), you’ll need to understand
the concepts of forward and backpropagation first. We could spend an entire article discussing
these concepts, so I will attempt to provide as simple a definition as possible.
WHAT IS Backpropagation Through Time?
Backpropagation (BPTT or backprop) is known as a workhorse algorithm in machine learning.
Backpropagation is used for calculating the gradient of an error function with respect to a neural
network’s weights. The algorithm works its way backwards through the various layers of
gradients to find the partial derivative of the errors with respect to the weights. Backprop then
uses these weights to decrease error margins when training.
In neural networks, you basically do forward-propagation to get the output of your model and
check if this output is correct or incorrect, to get the error. Backpropagation is nothing but
going backwards through your neural network to find the partial derivatives of the error with
respect to the weights, which enables you to subtract this value from the weights.
Those derivatives are then used by gradient descent, an algorithm that can iteratively minimize
a given function. Then it adjusts the weights up or down, depending on which decreases the
error. That is exactly how a neural network learns during the training process.
So, with backpropagation you basically try to tweak the weights of your model while training.
The image below illustrates the concept of forward propagation and backpropagation in a feed-
forward neural network:
BPTT is basically just a fancy buzzword for doing backpropagation on an unrolled recurrent
neural network. Unrolling is a visualization and conceptual tool, which helps you understand
what’s going on within the network. Most of the time when implementing a recurrent neural
network in the common programming frameworks, backpropagation is automatically taken
care of, but you need to understand how it works to troubleshoot problems that may arise during
the development process.
You can view an RNN as a sequence of neural networks that you train one after another with
backpropagation.
TYPES OF RECURRENT NEURAL NETWORKS (RNNS)
• One to One
• One to Many
• Many to One
• Many to Many
One-to-one:
This is also called Plain Neural networks. It deals with a fixed size of the input to the fixed
size of output, where they are independent of previous information/output.
Example: Image classification.
One-to-Many:
It deals with a fixed size of information as input that gives a sequence of data as output.
Example: Image Captioning takes the image as input and outputs a sentence of words.
Many-to-One:
It takes a sequence of information as input and outputs a fixed size of the output.
Example: sentiment analysis where any sentence is classified as expressing the positive or
negative sentiment.
Many-to-Many:
It takes a Sequence of information as input and processes the recurrently outputs as a
Sequence of data.
Example: Machine Translation, where the RNN reads any sentence in English and then
outputs the sentence in French.
Advantages of Recurrent Neural Network
• RNN can model a sequence of data so that each sample can be assumed to be
dependent on previous ones.
• A recurrent neural network is even used with convolutional layers to extend the active
pixel neighborhood.
Disadvantages of Recurrent Neural Network
• Gradient vanishing and exploding problems.
• Training an RNN is a complicated task.
• It could not process very long sequences
• Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU) versions
improve the RNN’s ability to handle long-term dependencies.
What are some variants of recurrent neural network architecture?
The RNN architecture laid the foundation for ML models to have language processing
capabilities. Several variants have emerged that share its memory retention principle and
improve on its original functionality.
• Bidirectional recurrent neural network (BRNN)
• Long short-term memory (LSTM)
• A gated recurrent unit (GRU)
Introduction
The first step in sentence classification is to represent variable-length sentences using neural
networks (RNNs).
Handling variable-length input
• The Skip-gram network structure takes a word vector of a fixed size, runs it through a
linear layer, and obtains a distribution of scores over all the context words. The structure
and the size of the input, output, and the network are all fixed throughout the training.
• However, many, if not most, of what we deal with in NLP are sequences of variable
lengths. For example, words, which are sequences of characters, can be short (“a,” “in”)
or long (“internationalization”). Sentences (sequences of words) and documents
(sequences of sentences) can be of any lengths.
• neural networks can handle only numbers and arithmetic operations. That was why we
needed to convert words and documents to numbers through embeddings. We used
linear layers to convert a fixed-length vector into another.
• But to do something similar with variable-length inputs, we need to figure out how to
structure the neural networks so that they can handle them.
One idea is to first convert the input (e.g., a sequence of words) to embeddings, that is, a
sequence of vectors of floating-point numbers, then average them.
Let’s assume the input sentence is sentence = ["john", "loves", "mary", "."]
result = (v("john") + v("loves") + v("mary") + v(".")) / 4
This method is quite simple and it’s used in many NLP applications, but it has one critical
issue, which is that it can’t take word order into account. Because the order of input elements
doesn’t affect the result of averaging, you’d get the same vector for both “Mary loves John”
and “John loves Mary.”
Now, if we step back and reflect how we humans read language, this “averaging” is far from
the reality. We usually scan the sentence from the beginning, one word at a time, as we hold
what the “partial” sentence means up until the part you’re reading in our short-term memory.
You maintain some sort of mental representation of the sentence as you read it.
Can we design a neural network structure that simulates this incremental reading of the input?
The answer is a resounding yes. That structure is called Recurrent Neural Networks (RNNs),
which I’ll explain in detail below.
RNN abstraction
If you break down the reading process mentioned above, its core is the repetition of the
following series of operations:
• Read a word
• Based on what has been read this far (your “mental state”), figure out what the word
means
• Update the mental state
• Move on to the next word
Let’s see how this works using a concrete example. If the input sentence is sentence = ["john",
"loves", "mary", "."] and each word is already represented as a word embedding vector.
Also, let’s denote your “mental state” as state, which is initialized by init_state(). Then, the
reading process is represented by the following incremental operations:
state = init_state()
state = update(state, v("john"))
state = update(state, v("loves"))
state = update(state, v("mary"))
state = update(state, v("."))
The final value of state becomes the representation of the entire sentence from this process.
Notice that if you change the order in which these words are processed (for example, by flipping
“John” and “Mary”), the final value of state also changes, meaning that the state also
encodes some information about the word order.
pseudo-Python, it’d be like:
def rnn(words):
state = init_state()
for word in words:
state = update(state, word)
return state
Notice that there’s state that gets initialized first and passed around during the iteration. For
every input word, state is updated based on the previous state and the input using the function
update. The network substructure corresponding to this step (the code block inside the loop) is
called a cell. This stops when the input is exhausted, and the final value of state becomes the
result of this RNN. See figure 2 for the illustration.
Figure 2: RNN abstraction
Now you see the parallelism here. When you’re reading a sentence (sequence of words), your
internal mental representation of the sentence, state, gets updated after reading each word. You
can assume that the final state encodes the representation of the entire sentence.
The only remaining work is to design two functions — init_state() and update(). The state is
usually initialized with zero (a vector filled with zeros), and you usually don’t have to worry
about how to go about defining the former. The more important issue is how you design
update(), which determines the characteristics of the RNN.
Simple RNN and Nonlinearity
Here, we’re going to implement update(), which is a function that takes two input variables and
produces one output variable? After all, a cell is a neural network with its own input and output,
right? The answer is yes, and it’d look like this:
def update_simple(state, word):
return f(w1 * state + w2 * word + b)
An RNN defined by this type of the update function is called a simple RNN.
The function, called activation function or nonlinearity, takes a single input (or a vector) and
transforms it (or every element of a vector) in a non-linear fashion.
Lets imagine we are building an RNN that recognizes “grammatical” English sentences.
Recognizing grammatical sentences from ungrammatical ones is a difficult NLP problem,
which is a well-established research, but let’s simplify it and only consider agreement between
the subject and the verb.
Let’s further simplify it and assume that there are only four words in this “language” —
“I”, “you”, “am”, and “are.”
• If the sentence is either “I am” or “you are,” it’s grammatical.
• Other two combinations, “I are” and “you am,” are incorrect.
What you want to build is an RNN that outputs 1 for these correct sentences as it produces 0
for these incorrect ones.
The first step in almost every modern NLP model is to represent words with embeddings.
Embeddings are usually learned from a large dataset of natural language text, but we’re going
to give them some pre-defined values, as shown in figure 3.
Figure 3: Recognizing grammatical English sentences using an RNN
Without activation function:
The update_simple() function above simplifies to:
def update_simple_linear(state, word):
return w1 * state + w2 * word + b
We assume the initial value of state are [0, 0], because the specific initial values aren’t relevant
to the discussion here.
• The RNN takes the first word embedding, x1, updates state, takes the second word
embedding, x2, then produces the final state, which is a two-dimensional vector.
• Finally, the two elements in this vector are summed up and converted to result.
• If result is close to 1, the sentence is grammatical. Otherwise, it’s not.
w1 * w2 * x1 + w2 * x2 + w1 * b + b
Remember, w1, w2, and b are parameters of the model (aka “magic constants”) that need to
be trained (adjusted). Here, instead of adjusting these parameters using a training dataset, let’s
assign some arbitrary values and see what happens. For example, when w1 = [1, 0], w2 = [0,
1], and b = [0, 0], the input and the output of this RNN is shown in figure 4.
Figure 4: Input and output when w1 = [1, 0], w2 = [0, 1], and b = [0, 0] without an activation
function
If you look at the values of result, this RNN groups ungrammatical sentences (for example, “I
are”) with grammatical ones (for example, “you are”), which isn’t the desired behavior. How
about we try another set of values for the parameters? Let’s use w1 = [1, 0], w2 = [-1, 0], and
b = [0, 0] and see what happens (figure 5).
Figure 5: Input and output when w1 = [1, 0], w2 = [-1, 0], and b = [0, 0] without an activation
function
This is much better, because the RNN is successful in grouping ungrammatical sentences by
assigning 0 to both “I are” and “you am.” It also assigns completely opposite values (2 and -2)
to grammatical sentences (“I am” and “you are”).
we can’t use this neural network to classify grammatical sentences from ungrammatical ones
no matter how hard you try. No matter what values you assign to the parameters, this RNN
can’t produce results that are close enough to the desired values and are able to group sentences
by their grammaticality.
With activation function:
Now, let’s put the activation function f() back and see what happens. The specific activation
function we’ll use is called the hyperbolic tangent function, or more commonly, tanh, which is
one of the most commonly used activation functions in neural networks. tanh doesn’t do much
to the input when it’s close to zero, for example, 0.3 or -0.2. The input passes through the
function almost unchanged. When the input is far from zero, tanh tries to squeeze it between -
1 and 1.
When w1 = [-1, 2], w2 = [-1, 2], b = [0, 1], and the tanh activation function is used, the result
of the RNN becomes a lot closer to what we desire (see figure 6). If you round them to the
closest integers, the RNN successfully groups sentence by their grammaticality.
Figure 6: Input and output when w1 = [-1, 2], w2 = [-1, 2], and b = [0, 1] with an activation
function
RNNs they’re trained like any other neural networks. The final outcome is compared with the
desired outcome using the loss function, then the difference between the two, the loss, is used
for updating the “magic constants.” The magic constants are, in this case, w1, w2, and b in the
update_simple() function. Note that the update function and its magic constants are identical
across all the timesteps in the loop. This means that what RNNs are learning is a general form
of updates that can be applied to any situation.
The simple RNNs are rarely used in real-world NLP applications due to one problem called the
vanishing gradients problem.
Vanishing gradients problem
Just like any programming language, if you know the length of the input, you can rewrite a
loop without using one. An RNN can also be rewritten without using a loop, which makes it
look just like a regular neural network with many layers. For example, if you know that there
are only six words in the input, the rnn() from earlier can be rewritten as follows:
def rnn(sentence):
word1, word2, word3, word4, word5, word6 = sentence
state = init_state()
state = update(state, word1)
state = update(state, word2)
state = update(state, word3)
state = update(state, word4)
state = update(state, word5)
state = update(state, word6)
return state
Representing RNNs without loops is called unrolling.
Now we know what update() looks like for a simple RNN (update_simple), so we can replace
the function calls with their bodies, as shown here:
def rnn_simple(sentence):
word1, word2, word3, word4, word5, word6 = sentence
state = init_state()
state = f(w1 * f(w1 * f(w1 * f(w1 * f(w1 * f(w1 * state + w2 * word1 + b) + w2 * word2 +
b) + w2 * word3 + b) + w2 * word4 + b) + w2 * word5 + b) + w2 * word6 + b)
return state
Let’s say the input is sentence = ["The", "books", "I", "read", "yes- terday", "were"].
In this case, the innermost function call processes the first word “The,” the next one processes
the second word “books,” and so on, all the way to the outermost function call, which
processes “were.”
we rewrite the previous pseudocode slightly, as shown
def is_grammatical(sentence):
word1, word2, word3, word4, word5, word6 = sentence
state = init_state()
state = process_main_verb(w1 *
process_adverb(w1 *
process_relative_clause_verb(w1 *
process_relative_clause_subject(w1 *
process_main_subject(w1 *
process_article(w1 * state + w2 * word1 + b) +
w2 * word2 + b) +
w2 * word3 + b) +
w2 * word4 + b) +
w2 * word5 + b) +
w2 * word6 + b)
return state
The deep learning literature calls this the vanishing gradients problem. A gradient is a
mathematical term that corresponds to the message signal that each function receives from
the next one that states how exactly they should improve their process (how to change their
magic constants
Because of the vanishing gradients problem, simple RNNs are difficult to train and rarely used
in practice nowadays.
3.3.Long short-term memory (LSTM)
The nested functions process information about grammar seems too inefficient. The
outermost function cannot tell which function was responsible for which part of the message
from only the final output.
so instead of passing the information through an activation function every time and changing
its shape completely, how about adding and subtracting information relevant to the part of
the sentence being processed at each step.
Long short-term memory units (LSTMs) are a type of RNN cell that is proposed based on this
insight. Instead of passing around states, LSTM cells share a “memory” that each cell can
remove old information from and/or add new information to.
Specifically, LSTM RNNs use the following function for updating states:
def update_lstm(state, word):
cell_state, hidden_state = state
cell_state *= forget(hidden_state, word)
cell_state += add(hidden_state, word)
hidden_state = update_hidden(hidden_state, cell_state, word)
return (cell_state, hidden_state)
Structure of LSTM
An LSTM (Long Short-Term Memory) network is a type of RNN recurrent neural network that
is capable of handling and processing sequential data. The structure of an LSTM network
consists of a series of LSTM cells, each of which has a set of gates (input, output, and forget
gates) that control the flow of information into and out of the cell. The gates are used to
selectively forget or retain information from the previous time steps, allowing the LSTM to
maintain long-term dependencies in the input data.
A common LSTM unit is composed of a cell, an input gate, an output gate and a forget gate.
The cell remembers values over arbitrary time intervals and the three gates regulate the flow
of information into and out of the cell.
• Forget gate decide what information to discard from a previous state by assigning a
previous state, compared to a current input, a value between 0 and 1. A (rounded) value
of 1 means to keep the information, and a value of 0 means to discard it.
• Input gates decide which pieces of new information to store in the current state, using
the same system as forget gates.
• Output gates control which pieces of information in the current state to output by
assigning a value from 0 to 1 to the information, considering the previous and current
states. Selectively outputting relevant information from the current state allows the
LSTM network to maintain useful, long-term dependencies to make predictions, both
in current and future time-steps.
How do LSTMs Work?
The LSTM architecture is similar to RNN, but instead of the feedback loop it has an LSTM
cell. The sequence of LSTM cells in each layer is fed with the output of the last cell. This
enables the cell to get the previous inputs and sequence information. A cyclic set of steps
happens in each LSTM cell
• The Forget gate is computed.
• The Input gate value is computed.
• The Cell state is updated using the above two outputs.
• The output(hidden state) is computed using the output gate.
• These series of steps occur in every LSTM cell. The intuition behind LSTM is that the
Cell and Hidden states carry the previous information and pass it on to future time steps.
• The Cell state is aggregated with all the past data information and is the long-
term information retainer. The Hidden state carries the output of the last cell, i.e. short-
term memory. This combination of Long term and short-term memory techniques
enables LSTM’s to perform well In time series and sequence data.
The LSTM states comprise two halves—the cell state (the “memory” part) and the hidden
state (the “mental representation” part).
• The function forget() returns a value between 0 and 1, so multiplying by this number
means erasing old memory from cell_state. How much to erase is determined from
hidden_state and word (input). Controlling the flow of information by multiplying by a
value between 0 and 1 is called gating. LSTMs are the first RNN architecture that uses
this gating mechanism.
• The function add()returns a new value added to the memory. The value again is
determined from hidden_state and word.
• Finally, hidden_state is updated using a function, whose value is computed from the
previous hidden state, the updated memory, and the input word
Because LSTMs have this cell state that stays constant across different timesteps unless
explicitly modified, they are easier to train and relatively well behaved. Because you have a
shared “memory” and functions are adding and removing information related to different
parts of the input sentence, it is easier to pinpoint which function did what and what went
wrong. The error signal from the outermost function can reach responsible functions more
directly.
Advantages:
1. Long-term dependency handling: LSTMs are capable of learning long-term
dependencies in sequential data, making them suitable for tasks where context over a
long sequence is important.
2. Gradient flow: LSTMs use a gating mechanism to control the flow of information,
allowing gradients to flow more easily during training and mitigating the vanishing
gradient problem.
Disadvantages:
1. Complexity: LSTMs are more complex than traditional RNNs, which can make them
harder to train and tune for optimal performance.
2. Computational cost: LSTMs can be more computationally expensive than simpler
models, especially for large datasets or complex architectures.
3.4 What is Gated Recurrent Unit(GRU) ?
• GRU stands for Gated Recurrent Unit, which is a type of recurrent neural network
(RNN) architecture that is similar to LSTM (Long Short-Term Memory).
• Like LSTM, GRU is designed to model sequential data by allowing information to be
selectively remembered or forgotten over time. However, GRU has a simpler
architecture than LSTM, with fewer parameters, which can make it easier to train and
more computationally efficient.
The main difference between GRU and LSTM is the way they handle the memory cell state.
• In LSTM, the memory cell state is maintained separately from the hidden state and is
updated using three gates: the input gate, output gate, and forget gate.
• In GRU, the memory cell state is replaced with a “candidate activation vector,” which
is updated using two gates: the reset gate and update gate.
• The reset gate determines how much of the previous hidden state to forget, while the
update gate determines how much of the candidate activation vector to incorporate into
the new hidden state.
• Overall, GRU is a popular alternative to LSTM for modeling sequential data, especially
in cases where computational resources are limited or where a simpler architecture is
desired.
GRU architecture
• The architecture of the gated recurrent unit (GRU) is designed with two specific gates
- the update gate and the reset gate.
• Each gate serves a unique purpose, significantly contributing to the GRU's high
efficiency. The reset gate identifies short-term relationships, while the update gate
recognizes long-term connections.
The various components of the architecture are:
• Update gate (Z): Determines the degree of past information forwarded to the future.
• Reset gate (R): Decides the amount of past information to discard.
• Candidate hidden state (H'): Creates new representations, considering both the input
and the past hidden state.
• Final hidden state (H): A blend of the new and old memories governed by the update
gate.
The philosophy behind GRUs is similar to that of LSTMs, but GRUs use only one set of states
instead of two halves. The update function for GRUs is shown next:
def update_gru(state, word):
new_state = update_hidden(state, word)
switch = get_switch(state, word)
state = swtich * new_state + (1 – switch) * state
return state
• Instead of erasing or updating the memory, GRUs use a switching mechanism. The cell
first computes the new state from the old state and the input.
• It then computes switch, a value between 0 and 1.
• The state is chosen between the new state and the old one based on the value of switch.
If it’s 0, the old state passes through intact. If it’s 1, it’s overwritten by the new state.
• If it’s somewhere in between, the state will be a mix of two. See figure below for an
illustration of the GRU update function.
• Notice that the update function for GRUs is much simpler than that for the LSTMs.
Indeed, it has fewer parameters (magic constants) that need to be trained compared to
LSTMs. Because of this, GRUs are faster to train than LSTMs
3.5 Accuracy, Precision, Recall and F-measure
Accuracy
• Accuracy is probably the simplest of all the evaluation metrics.
• In a classification setting, accuracy is the fraction of instances that your model got right.
• For example, if there are 10 emails and your spam-filtering model got 8 of them correct,
the accuracy of your prediction is 0.8, or 80%.
• Though simple, accuracy is not without its limitations.
• Specifically, accuracy can be misleading when the test set is imbalanced.
• An imbalanced dataset contains multiple class labels that greatly differ in their
numbers.
• For example, if a spam-filtering dataset is imbalanced, it may contain 90% nonspam
emails and 10% spams.
• In such case, even a stupid classifier that labels everything as nonspam would be able
to achieve an accuracy of 90%.
Precision and recall
• The rest of the metrics—precision, recall, and F-measure—are used in a binary
classification setting.
• The goal of a binary classification task is to identify one class (called a positive class)
from the other (called a negative class).
• In the spam-filtering setting, the positive class is spam, whereas the negative class is
nonspam
The Venn diagram in figure above contains four subregions: true positives, false
positives, false negatives, and true negatives.
• True positives (TP) are instances that are predicted as positive (= spam) and are
indeed in the positive class.
• False positives (FP) are instances that are predicted as positive (= spam) but are
actually not in the positive class. These are noises in the prediction, that is, innocent
nonspam emails that are mistakenly caught by the spam filter and end up in the spam
folder of your email client.
• On the other hand, false negatives (FN) are instances that are predicted as negative
but are actually in the positive class. These are spam emails that slip through the spam
filter and end up in your inbox. Finally, true negatives (TN) are instances that are
predicted as negative and are indeed in the negative class (nonspam emails in your
inbox).
• Precision is the fraction of instances that the model classifies as positive that are indeed
correct. For example, if your spam filter identifies three emails as spam, and two of
them are indeed spam, the precision will be 2/3, or about 66%.
• Recall is somewhat opposite of precision. It is the fraction of positive instances in your
dataset that are identified as positive by your model. Again, using spam filtering as an
example, if your dataset contains three spam emails and your model identifies two of
them as spam successfully, the recall will be 2/3, or about 66%.
F-measure
• notice a tradeoff between precision and recall.
• Imagine there’s a spam filter that is very, very careful in classifying emails. It outputs
only one out of several thousand emails as spam, but when it does, it is always correct.
This is not a difficult task, because some spam emails are pretty obvious.
• Improving precision or recall alone while ignoring the other is not a good practice,
because of the tradeoff between them.