Deep Net Performance (Metrics)
Data scientists use a variety of metrics in order to objectively determine the performance of a model.
This file will provide an overview of some of the most common metrics such as error, precision, and
recall etc…that have been used for measuring the performance of Deep Nets.
The measurements of the deep net performance
Error
Error is the straight forward measurement to capture the proportion of data points classify incorrectly,
typically from the test data set. The calculation is simply the number of incorrect classification made
by the net divided by the total number of classifications.
Disadvantage
Unfortunately, error has a significant drawback as a performance measurement especially when the
data points are skewed toward one class over another.
Performance metrics summary
where
The problem arises when we measure error globally across the set of all classes, rather than taking a
granular view of the model performance at the class level. Let us look at a two class classification
problem where data points are considered positive or negative. If the model makes a positive
classification there are two possibilities; the model can be correct in which case we have a true positive
or the model can be incorrect and the point is actually negative in which case we have a false positive.
True negative and false negative are defined similarly when the model makes a negative prediction. If
the number of occurrences of these four values are plotted in 2 by 2 grid we find a valuable tool known
as the confusion matrix.
Confusion Matrix
Each square in the matrix is a place holder for a value, so the values of the matrix may look like this
Error = the number of incorrect classification made by the net / the total number of classifications
Using the confusion matrix you can device new measurements that overcome of the issues of the error
metric. We can do this by asking two related questions about the model’s performance. We first look
at the positives in the data set and ask what percentage of these positives the model would able
correctly identify.
Recall
We first look at the positives in the data set and ask what percentage of these positives is the model
able correctly to identify? this matrix is known as Recall,
Recall = TP / the total number of positives in the data set
= TP / (TP + FN)
We can also look at the number of times the model predicted positive and ask what percentage of
these predictions were actually positive? this metric in known as Precision.
Precision
Expressed as
Precision = TP / the number of data points classified as positive
= TP / (TP + FP)
These two questions sound similar but spoken another way we want to know how many of the positive
data points is the model able to recall and how precise were the actual predictions. Also keep in mind
the precision and recall are defined for the negative data points as well.
A very important metric is the Accuracy:
∑ 𝑇𝑃+∑ 𝑇𝑁 𝑇𝑃+𝑇𝑁
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = ∑ 𝑇𝑜𝑡𝑎𝑙 𝑃𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛
=
𝑇𝑃+𝐹𝑃+𝑇𝑁+𝐹𝑁
We can use hybrid measures in order to achieve a balance between precision and recall. One of these
measures the F1 square is calculated as the harmonic mean of precision and recall which outperforms
the standard arithmetic mean. The harmonic mean is an improvement if the features has values
between 0 and 1.
2 𝑥 𝑅𝑒𝑐𝑎𝑙𝑙 𝑥 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
𝐹1𝑠𝑐𝑜𝑟𝑒 =
𝑅𝑒𝑐𝑎𝑙𝑙 + 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛
We can also use the Specificity or selectivity metric which is:
∑ 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑡𝑦 𝑜𝑟 𝑇𝑟𝑢𝑒 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑟𝑎𝑡𝑒 (𝑇𝑁𝑅) = =
∑ 𝐶𝑜𝑛𝑑𝑖𝑡𝑖𝑜𝑛 𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝐹𝑃 + 𝑇𝑁
The following maps summarizes the deep net’s performance metrics.
The following maps summarizes the deep net’s performance metrics.
We can also examine precision and recall graphically by plotting these values and examining the area
under the curve. The model that maximizes this area will typically the best performing.
The following are the Deep learning metrics from Wikipedia:
Where
Multi Class Classification Metrics
We can extend these concepts to classification problems with more than two classes.
Example: the confusion matrix
The definition for precision and recall remain the same as in the case of two class classification. The
difference is with now data points can be misclassified in multiple ways, so false positive and false
negative must be summed of all possible misclassification pairs.
For multiclass classification:
Precision is the ratio of the number of true positives over all points classified as positive
Recall is the ratio of the number of true positive over the total number of positives
Deep Net Performance Hardware Tools
Training a large-scale deep net is a computationally expensive process, and common CPUs are generally
insufficient for the task. GPUs are a great tool for speeding up training, but there are several other
options available.
A CPU is a versatile tool than can be used across many domains of computation. However, the cost of
this versatility is the dependence on sophisticated control mechanisms needed to manage the flow of
tasks. CPUs also perform tasks serially, requiring the use of a limited number of cores in order to build
in parallelism. Even though CPU speeds and memory limits have increased over the years, a CPU is still
an impractical choice for training large deep nets.
Vector implementations can be used to speed up the deep net training process. Generally, parallelism
comes in the form of both parallel processing and parallel programming. Parallel processing can either
involve shared resources on a single computer, or distributed computing across a cluster of nodes.
The GPU is a common tool for parallel processing. As opposed to a CPU, GPUs tend to hold large
numbers of cores – anywhere from 100s to even 1000s. Each of these cores is capable of general
purpose computing, and the core structure allows for large amounts of parallelism. As a result, GPUs
are a popular choice for training large deep nets. The Deep Learning community provides GPU support
through various libraries, implementations, and a vibrant ecosystem fostered by nVidia. The main
downside of a GPU is the amount of power required to run one relative to the alternatives.
The “Field Programmable Gate Array”, or FPGA, is another choice for training a deep net. FPGAs were
originally used by electrical engineers to design mock-ups for different computer chips without having
to custom build a chip for each solution. With an FPGA, chip function can be programmed at the lowest
level – the logic gate. With this flexibility, an FPGA can be tailored for deep nets so as to require less
power than a GPU. Aside from speeding up the training process, FPGAs can also be used to run the
resultant models. For example, FPGAs would be useful for running a complex convolutional net over
thousands of images every second. The downside of an FPGA is the specialized knowledge required
during design, setup, and configuration.
Another option is the “Application Specific Integrated Circuit”, or ASIC. ASICs are highly specialized,
with designs built in at the hardware and integrated circuit level. Once built, they will perform very well
at the task they were designed for, but are generally unusable in any other task. Compared to GPUs
and FPGAs, ASICs tend to have the lowest power consumption requirements. There are several Deep
Learning ASICs such as the Google Tensor Processing Unit (TPU), and the chip being built by Nervana
Systems.
There are a few parallelism options available with distributed computing such as data parallelism, model
parallelism, and pipeline parallelism. With data parallelism, different subsets of the data are trained on
different nodes in parallel for each training pass, followed by parameter averaging and replacement
across the cluster. Libraries like TensorFlow support model parallelism, where different portions of the
model are trained on different devices in parallel. With pipeline parallelism, workers are dedicated to
tasks, like in an assembly line. The main idea is to ensure that each worker is relatively well-utilized. A
worker starts the next job as soon as the current one is complete, a strategy that minimizes the total
amount of wasted time.
Parallel programming research has been active for decades, and many advanced techniques have been
developed. Generally, algorithms should be designed with parallelism in mind in order to take full
advantage of the hardware. One such way to do this is to decompose the data model into independent
chunks that each perform one instance of a task. Another option is to group all the tasks by their
dependencies, so that each group is completely independent of the others. As an addition, you can
implement threads or processes that handle different task groups. These threads can be used as a
standalone solution, but will provide significant speed improvements when combined with the grouping
method. To learn more about this topic, follow this link to the Open HPI Massive Open Online course
(MOOC) on parallel programming –
For more information URLs:
https://www.facebook.com/DeepLearningTV/
https://twitter.com/deeplearningtv
https://en.wikipedia.org/wiki/Precisi...
https://open.hpi.de/courses/parprog2014.
Receiver Operating Characteristics (ROC) Curve
Area under Curve (AUC)
What does AUC of ROC mean
In a ROC curve the true positive rate (Sensitivity) is plotted in function of the false positive rate (100-Specificity)
for different cut-off points of a parameter. ... The area under the ROC curve ( AUC ) is a measure of how well a
parameter can distinguish between two diagnostic groups (diseased/normal).
Until now
• Strategy table/curve: still make assumption
• What is “overall” best model?
The Receiver Operating Characteristics (ROC) Curve (plot)
AUC: Area Under The Curve
AUC ROC-curve A = 0.75
AUC ROC-curve B = 0.78
Classification: ROC Curve and AUC
ROC curve
An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a
classification model at all classification thresholds. This curve plots two parameters:
True Positive Rate
False Positive Rate
True Positive Rate (TPR) is a synonym for recall and is therefore defined as follows:
𝑇𝑃
𝑇𝑃𝑅 =
𝑇𝑃 + 𝐹𝑁
False Positive Rate (FPR) is defined as follows:
𝐹𝑃
𝐹𝑃𝑅 =
𝐹𝑃 + 𝑇𝑁
An ROC curve plots TPR vs. FPR at different classification thresholds. Lowering the classification threshold
classifies more items as positive, thus increasing both False Positives and True Positives. The following figure
shows a typical ROC curve.
To compute the points in an ROC curve, we could evaluate a logistic regression model many times with
different classification thresholds, but this would be inefficient. Fortunately, there's an efficient, sorting-
based algorithm that can provide this information for us, called AUC.
AUC: Area Under the ROC Curve
AUC stands for "Area under the ROC Curve." That is, AUC measures the entire two-dimensional area
underneath the entire ROC curve (think integral calculus) from (0,0) to (1,1).
AUC provides an aggregate measure of performance across all possible classification thresholds. One
way of interpreting AUC is as the probability that the model ranks a random positive example more
highly than a random negative example. For example, given the following examples, which are
arranged from left to right in ascending order of logistic regression predictions:
AUC represents the probability that a random positive (green) example is positioned to the right of a
random negative (red) example.
AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one
whose predictions are 100% correct has an AUC of 1.0.
AUC is desirable for the following two reasons:
• AUC is scale-invariant. It measures how well predictions are ranked, rather than their absolute
values.
• AUC is classification-threshold-invariant. It measures the quality of the model's predictions
irrespective of what classification threshold is chosen.
However, both these reasons come with caveats, which may limit the usefulness of AUC in certain
use cases:
• Scale invariance is not always desirable. For example, sometimes we really do need well
calibrated probability outputs, and AUC won’t tell us about that.
• Classification-threshold invariance is not always desirable. In cases where there are wide
disparities in the cost of false negatives vs. false positives, it may be critical to minimize one
type of classification error. For example, when doing email spam detection, you likely want to
prioritize minimizing false positives (even if that results in a significant increase of false
negatives). AUC isn't a useful metric for this type of optimization.
Differences between Receiver Operating Characteristic AUC (ROC AUC) and Precision
Recall AUC (PR AUC)
Use Precision Recall area under curve for class imbalance problems. If not, Receiver Operating
Characteristic area under curve otherwise.
Introduction
When being confronted with the class imbalance problem, accuracy is a wrong metric to use. Usually,
there are two candidates as metrics:
1. Receiver Operating Characteristic area under curve (ROC AUC)
2. Precision Recall area under curve (PR AUC)
Which is better? What are the differences?
Receiver Operating Characteristic Curve (ROC curve)
A ROC curve is plotting True Positive Rate (TPR) against False Positive Rate (FPR).
TPR is defined as:
FPR is defined as:
where TP = true positive, TN = true negative, FP = false positive, FN = false negative.
A typical ROC curve looks like this, which shows two ROC curves for Algorithm 1 and Algorithm 2.
The goal is to have a model be at the upper left corner, which is basically getting no false positives –
a perfect classifier.
The receiver operating characteristic area under curve (ROC AUC) is just the area under
the ROC curve. The higher it is, the better the model is.
Precision Recall Curve (PR Curve)
A PR curve is plotting Precision against Recall.
Precision is defined as:
Recall is defined as:
A typical PR curve looks like this, which shows two PR curves for Algorithm
1 and Algorithm 2.
The goal is to have a model be at the upper right corner, which is basically getting only the true
positives with no false positives and no false negatives – a perfect classifier.
The precision recall area under curve (PR AUC) is just the area under the PR curve. The
higher it is, the better the model is.
Differences between the ROC AUC and PR AUC
Since PR does not account for true negatives (as TN is not a component of either Precision or Recall),
or there are many more negatives than positives (a characteristic of class imbalance problem), use PR.
If not, use ROC.
For illustration, let’s take an example of an information retrieval problem where we want to find a set
of, say, 100 relevant documents out of a list of 1 million possibilities based on some query. Let’s say
we’ve got two algorithms we want to compare with the following performance:
• Method 1: 100 retrieved documents, 90 relevant. Thus, TP = 90, TN = 999890, FP = 10,
FN = 10.
• Method 2: 2000 retrieved documents, 90 relevant. Thus, TP = 90, TN = 997990, FP = 1910,
FN = 10.
Clearly, Method 1’s result is preferable since they both come back with the same number of relevant
results, but Method 2 brings a ton of false positives with it. The ROC measures of TPR and FPR will
reflect that, but since the number of irrelevant documents dwarfs the number of relevant ones, the
difference is mostly lost:
• Method 1: 0.9 TPR, 0.00001 FPR
o TPR = TP/(TP + FN) = 90/(90 + 10) = 0.9
o FPR = FP/(FP + TN) = 10/(10 + 999890) = 0.00001
• Method 2: 0.9 TPR, 0.00191 FPR (difference of 0.0019)
o TPR = TP/(TP + FN) = 90/(90 + 10) = 0.9
o FPR = FP/(FP + TN) = 1910/(1910 + 997990) = 0.0019
Precision and recall, however, don’t consider true negatives and thus won’t be affected by the relative
imbalance (which is precisely why they’re used for these types of problems):
• Method 1: 0.9 precision, 0.9 recall
o Precision = TP/(TP + FP) = 90/(90 + 10) = 0.9
o Recall = TP/(TP + FN) = 90/(90 + 10) = 0.9
• Method 2: 0.045 precision (difference of 0.855), 0.9 recall
o Precision = TP/(TP + FP) = 90/(90 + 1910) = 0.045
o Recall = TP/(TP + FN) = 90/(90 + 10) = 0.9
Obviously, those are just single points in ROC and PR space, but if these differences persist across
various scoring thresholds, using ROC AUC, we’d see a very small difference between the two
algorithms, whereas PR AUC would show quite a large difference.
To compare both methods, using ROC_AUC, we see that the FPR has a difference of 0.0019, which is
very small. However, using PR_AUC, we see that the Precision has a difference of 0.855 which is much
more pronounced.
Clearly, the PR is much better in illustrating the differences of the algorithms in the case where there
are a lot more negative examples than the positive examples.
Conclusion
Use PR AUC for cases where the class imbalance problem occurs, otherwise use ROC AUC.
One note though, if your problem set is small (thus having fewer points in PR curve), the PR AUC metric
could be over-optimistic because AUC is calculated via the trapezoid rule, but linear interpolation on
the PR curve does not work very well, which the PR curve example above looks very wiggly. Interested
readers may consult this paper of The Relationship between Precision-Recall and ROC Curves. Though,
practically in a class imbalance problem nowadays, you should have a lot of examples so this should
not be a problem.
Y at threshold
0.6