Interpretable
Interpretable
ABSTRACT modern car [2] could be performing simple lane keeping function
The use of learning based components in cyber-physical systems from the video feed it receives from the camera. The other end of the
(CPS) has created a gamut of possible avenues to use high dimen- spectrum would be a fully self-driving car equipped with an auto-
sional real world signals generated from sensors like camera and mated driving system (ADS). Here, the software is in full control of
LiDAR. The ability to process such signals can be largely attributed the car, and is capable of making all decision concerning navigation
to the adoption of high-capacity function approximators like deep and maneuvering the vehicle. This has the potential to dramati-
neural networks. However, this does not come without its potential cally reduce accidents and general vehicle safety. Additionally, such
perils. The pitfalls arise from possible over-fitting, and subsequent technologies have the potential to bolster the independence and
unsafe behavior when exposed to unknown environments. One mobility of seniors and those who cannot drive.
challenge is that, in high dimensional input spaces it is almost im- Deep Neural networks (DNNs) perform most of the the heavy
possible to experience enough training data in the design phase. lifting regarding LEC’s with rich sensors like camera and LiDAR.
What is required here, is an efficient way to flag out-of-distribution DNNs are high-capacity function approximators, and form a funda-
(OOD) samples that is precise enough to not raise too many false mental building block for most machine learning applications. The
alarms. In addition, the system needs to be able to detect these downside is that DNNs do not learn to interpret an image input
in a computationally efficient manner at runtime. In this paper, the way humans do. It largely happens through an effort to fit a
our proposal is to build good representations for in-distribution high-capacity function to the training data by reducing its classifi-
data. We introduce the idea of a memory bank to store prototypical cation error rate. DNNs are complex computation graphs and can
samples from the input space. We use these memories to compute potentially have millions of nodes and parameters. This makes them
probability density estimates using kernel density estimation tech- particularly difficult to be analyzable by a human expert. What this
niques. We evaluate our technique on two challenging scenarios : a translates to is that computer vision systems are prone to errors
self-driving car setting implemented inside the simulator CARLA in a way that people are not. For instance it is fairly easy to come
with image inputs, and an autonomous racing car navigation set- up with imperceptible changes to an input image that can fool the
ting, with LiDAR inputs. In both settings, it was observed that a ADS system. It can be shown that its fairly simple to change the
deviation from in-distribution setting can potentially lead to devia- number on a speed limit sign, or even change a stop sign to a speed
tion from safe behavior. An added benefit of using training samples limit sign [24].
as memories to detect out-of-distribution inputs is that the system The challenge with most LECs is that it is necessarily the case
is interpretable to a human operator. Explanation of this nature that the input space is insufficiently sampled. That is, during the
is generally hard to obtain from pure deep learning based alter- design phase the system does not see all the training samples in the
natives. Our code for reproducing the experiments is available at vicinity of the operating region of the system. Hence it is almost
https:// github.com/ yangy96/ interpretable_ood_detection.git impossible to provide rigorous guarantees on the operating limits at
design time. This is where efficient runtime monitoring techniques
KEYWORDS can play a significant role in ensuring safe operations of the system.
Neural networks due to the nature of the training algorithms do not
anomalous inputs, autonomy, safety, vision, perception systems perform well when pushed outside of its training zone. Even though
statistical machine learning tools can provide upper bounds on the
1 INTRODUCTION test error rate of a learnt model, they are often too conservative
and do not capture a realistic error rate.
Learning-enabled components (LEC) are being increasingly used in
Detecting out-of-distribution (OOD) [16, 23] samples has gained
modern cyber-physical systems (CPS) especially self-driving cars.
widespread popularity in the recent years to counter the fragility
Designing such systems typically rely on data-driven techniques to
of DNNs when operating in unknown environments. However, it
achieve desired performance. An LEC learns to operate by having
could not been extended easily to a CPS setting, and can have high
access to a large corpus of human-labelled input-output data during
false alarm rates. The main contribution of this paper is a novel
the design phase. For instance, a learning-enabled component in a
226
explanation when samples are in-distribution or out-of-distribution. Where 𝐾 is the Kernel function, and ℎ is a smoothing term. The
We demonstrate this in Figure 1b, the system not only flags the kernel function measures the influence of sample 𝑠𝑖 on the query
image with the biker ahead as OOD, but selects a set of pixels de- point 𝑥. The choice of kernel, smoothing parameter, and distance
marcating the biker to communicate as to why it decided to label it function influences the quality of estimates that one obtains from
as an OOD. this method. We refer the reader to [29] for further details on KDE.
4.2.1 Abstract Kernel Centers. Standard KDE is often difficult to es-
4 PRELIMINARIES timate at runtime. Placing demands on both memory requirements
4.1 Clustering with Medoids and computational efficiency. To handle this, we use the concept of
Similar to k-means clustering, we wish to form partitions of the abstract kernel centers introduced in [26]. Instead of using the full
data into distinct groups or clusters. Clustering with k-means is a data set, the idea is to use a smaller representative set to estimate
well known tool but has its challenges when used in the context of the probability density. In order to obtain useful density estimates,
images. An issue with k-means is that it can potentially produce the first step is to identify centrally located data points called kernel
virtual cluster centers which are absent in the original data set. centers, which are able to capture the distribution closely. Let us
This is essentially because a simple mean of two (or more) images denote these kernel centers with the set M𝑟 : (𝑚 1, 𝑚 2, . . . , 𝑚𝑟 ).
might not correspond to a real image. The other issue being that The density estimates are computed using the k-nearest-neighbors
it is often susceptible to outliers in the data. Hence, we restrict (kNN) from the test point 𝑥 in the set M𝑟 . Since, the kernel centers
ourselves to partitioning around points which are present in the are identified using the partitioning algorithm mentioned above,
data. The algorithm which achieves this is PAM [3], which is short we can alternately call it the 𝑘 nearest memories as well. Suppose
for Partitioning Around Medoids. Intuitively the algorithm tries to this set of 𝑘 nearest memories is M𝑘 ⊂ M𝑟 . The weighted kernel
search for centrally located objects called medoids, and are used to density estimate [15] is given by,
define the cluster boundaries in a nearest medoid sense.
Let us assume that the set S is equipped with a distance metric
𝑘
𝑓ˆ𝐴 (𝑥) = 𝑤 𝑗 𝐾ℎ (|𝑥 − 𝑚 𝑗 |) (3)
D : (𝑠 1, 𝑠 2 ) → R, for 𝑠𝑖 ∈ S and 𝑛 = |S|. Given a data set S, PAM
𝑗=1
tries to select a set of 𝑟 medoids - 𝑀𝑟 : {𝑚 1, 𝑚 2, . . . , 𝑚𝑟 } such that
the following cost is minimized, # points with closest memory as 𝑚 𝑗
𝑤𝑗 =
# points which have closest memory in M𝑘
𝑛
𝐶𝑜𝑠𝑡 (𝑀𝑟 ) = 𝑚𝑖𝑛 D (𝑚 𝑗 , 𝑠𝑖 ) (1) Thus, the data partitions built around the memories can now be
𝑚 𝑗 ∈𝑀𝑟
𝑖=1 used to compute the density estimation function 𝑓ˆ𝐴 . In this paper
we use the 𝐸𝑝𝑎𝑛𝑒𝑐ℎ𝑛𝑖𝑘𝑜𝑣 kernel. Since, this kernel permits us to
We assume that the inner minimization is always possible, and
demarcate a boundary around the center beyond which the distance
we are able to break ties arbitrarily among distinct members of the
function stops being meaningful. The bandwidth parameter ℎ can
set S.
be chosen depending on the application. Note that, for us the kernel
Algorithms : The challenge with PAM is that the naive implemen-
centers are not abstract as compared to [26], but in our case we
tation has a runtime complexity of 𝑂 (𝑛 2𝑟 2 ) [28]. Even though there
keep the terminology to be coherent with the original idea.
exists faster variants, but is still largely inaccessible for applications
at the scale of image data sets generated from autonomous driving
4.3 Structural Similarity Index Metric
scenarios. In order to circumvent this challenge we introduce a
variant of the Clustering Large Applications based upon Random- A fundamental challenge in dealing with images is to capture human
ized Search (CLARANS) [25] algorithm in Section 5.2. It combines perceptual similarity with a mathematically meaningful distance
randomized global search with local cluster improvement method function. Techniques like KDE necessitate the use of a distance met-
to improve the quality of clustering. The medoids identified by ric to compute the probability density. To the best of our knowledge
minimizing the objective in Equation 1, are referred to as memories the right candidate for this purpose is the structural similarity index
from here on. metric (SSIM). This was first introduced in [35], and has gained
widespread popularity. It computes the degree to which two images
are similar to a human eye, and was used to compute the degra-
4.2 Kernel Density Estimation
dation quality of the image. SSIM metric is designed to capture
Kernel Density Estimation (KDE) is a non-parametric way to es- statistical similarity between images. This makes our system more
timate the probability density function of a random variable. Let robust to random noise in comparison to vanilla DNNs. It also has
us assume that the set S = (𝑠 1, 𝑠 2, . . . , 𝑠𝑛 ) is independently and been used to capture image similarity for adversarial sticker attacks
identically drawn from a fixed but unknown distribution 𝑓 , and as well [22].
we wish to estimate the probability density for an element 𝑥. This We state the original SSIM distance function next. Assume we
according to kernel density estimation methods it is given by the have two images A1 ∈ R𝑁 and A2 ∈ R𝑁 . This allows us to compute
following equation, three terms : a luminance distortion term, a contrast distortion term,
and a correlation term.
1
𝑛
𝑓ˆ(𝑥) = 𝐾 (|𝑥 − 𝑠𝑖 |) (2) 2A¯1 A¯2 + 𝑐 1
𝑛 𝑖=1 ℎ 𝑙 (A1, A2 ) = 2 (4)
A¯1 + A¯2 2 + 𝑐 1
227
2𝑠 A 𝑠 A + 𝑐 2 point at random, and compute the distance score across all the
𝑐 (A1, A2 ) = 2 1 22 (5)
𝑠A + 𝑠A + 𝑐 2 samples in the currently RejectedSet in a single linear pass. The data
1 2
points which are similar enough are admitted as being close to a
𝑠 A ,A + 𝑐 3
𝑠 (A1, A2 ) = 1 2 (6) memory, and are not considered as candidates for new memories in
𝑠 A1 𝑠 A2 + 𝑐 3 the next iteration. We continue this process until all data points are
Where A¯1, A¯2, 𝑠 A2 , 𝑠 A2 and 𝑠 A1 ,A2 are the local mean, local vari- admitted into the set of memories 𝑀. This allows the subsequent
1 2
ance, and local covariance between A1 and A2 . The scalar terms algorithms to have a warm start. Algorithm 1 always terminates.
𝑐 1, 𝑐 2, 𝑐 3 aim to capture the saturation effects of the visual system, This is because the RejectedSet decreases by at least 1 at each step.
and provide numerical stability. The terms computed above cap- In the worst case we have as many memories as the number of data
ture the local difference in some chosen window in the image. The points. But in most practical datasets this is not the case.
combination across all such local windows gives the SSIM index.
With 𝑐 3 = 𝑐 2 /2, SSIM index can be written in the following form : 5.2 Learning Memories
𝑆𝑆𝐼𝑀 (A1, A2 ) = 𝑆 1 (A1, A2 )𝑆 2 (A1, A2 ) To restate, we are given a dataset S, with 𝑛 elements, and we wish to
𝑆 1 (A1, A2 ) = 𝑙 (A1, A2 ) compute an 𝑟 size memory set 𝑀 = {𝑚 1, 𝑚 2, . . . , 𝑚𝑟 } with certain
(7)
desirable properties. The search for memories can be simplified
𝑆 2 (A1, A2 ) = 𝑐 (A1, A2 )𝑠 (A1, A2 ) by viewing this as a search through a graph G [25] with subsets
SSIM can be implemented efficiently in tools like Pytorch [1] and S𝑟 ⊂ S as its nodes. Each subset of size 𝑟 defines a choice for the
accelerated using a GPU. This permits a scalable and efficient imple- memory set 𝑀.
mentation inside our OOD detection framework. On the downside,
Definition 5.1 (Memory Search Graph G). The undirected graph
SSIM does not have the mathematical properties to be a distance
G is represented by an ordered pair (𝑉 , 𝐸). The set of nodes 𝑉 is
metric. But with some modifications it can be turned into one. The
the collection of subsets of original dataset S𝑟 ⊂ S. An edge 𝑒 ∈ 𝐸
details of this modification and the associated proof can be found in
exists between two nodes S𝑟1 and S𝑟2 iff |S𝑟1 ∩ S𝑟2 | = 𝑟 − 1. That is,
[8]. We use the modified SSIM to define a distance metric D (𝐴1, 𝐴2 )
they differ by at most one memory.
in this paper and hyperparameters (e.g. 𝑐 1, 𝑐 2, 𝑐 3 ) are set same as in
[1]. The use of a proper distance metric for images allows us to com- Each node of the graph has an associated cost given by Equation
pute the probability density function, and subsequently distribution 1. Hence starting from some node it is possible to visit neighboring
shifts in a more meaningful way. nodes with decreasing costs in the search process. What we present
next is a combination of 𝐺𝑙𝑜𝑏𝑎𝑙 resets and 𝐿𝑜𝑐𝑎𝑙 minimization to
5 METHODOLOGY approximate the optimal choice.
5.1 Initializing the Memory Set
Algorithm 2 Generate Memories
Input: S : {𝑠 1, 𝑠 2, . . . , 𝑠𝑛 }
Algorithm 1 Generate Initial Memories
Output: Memories 𝑀 : {𝑚 1, 𝑚 2, 𝑚 3, . . . , 𝑚𝑟 }
Input: Data set S : {𝑠 1, 𝑠 2, . . . , 𝑠𝑛 } Parameter : (Max Global Steps : 𝑍𝑔 , Max Local Steps : 𝑍𝑙 ,
Output: Memories 𝑀 : {𝑚 1, 𝑚 2, 𝑚 3, . . . , 𝑚𝑟 } Distance Threshold 𝑑)
Parameter : Distance Threshold 𝑑 1: BestCost = ∞
1: 𝑀 = 𝜙 2: for 1 ≤ 𝑔 ≤ 𝑍𝑔 do
2: RejectedSet = S 3: Memory Set 𝑀 = GenerateInitialMemories(𝑆, 𝑑)
3: while RejectedSet ≠ 𝜙 do 4: 𝑣 = FindNode(𝑀, G)
4: 𝑠𝑚 = pickRandomPoint(RejectedSet) 5: G = CreateGraph(S, |𝑀 |) ⊲ The memory search graph
5: for 𝑠𝑖 ∈ RejectedSet do 6: CurrentCost = ComputeCost(𝑣)
6: if D (𝑠𝑚 , 𝑠𝑖 ) < 𝑑 then 7: for 1 ≤ 𝑙 ≤ 𝑍𝑙 do
7: RejectedSet = RejectedSet \𝑠𝑖 8: 𝑣 ′ = PickNeighbor(𝑣, G)
8: 𝑀 = 𝑀 ∪ {𝑠𝑚 } 9: NewCost = ComputeCost(𝑣 ′ )
9: return 𝑀 10: if NewCost < CurrentCost then
11: 𝑣 ← 𝑣′
12: CurrentCost ← NewCost
The intuition here is that high dimensional data like images, and 13: if CurrentCost < BestCost then
LiDAR scans generated from a real world setting, cluster well in 14: BestCost = CurrentCost
practice. Hence, the first step is to identify these broad categories in
15: return 𝑀
a quick and efficient fashion. One of the questions however is that
the number of partitions to be made is often not known apriori. But
drawing on the intuitions from an image distance metric, only small Algorithm 2 picks the eventual memories used in OOD detection.
enough distances have perceptual meaning. Thus, the intuition here Similar to standard CLARANS algorithm each node in G has 𝑟 (𝑛 −𝑟 )
is to populate the input space densely enough with memories such neighbors, where 𝑟 is the number of memories. Which can be quite
that, every training point is within a threshold distance 𝑑 of some large given the scale of modern machine learning data sets with
memory. Algorithm 1 summarizes our approach. We pick a data large 𝑛. What we do here is that start with a reasonable choice
228
Figure 2: This figure summarizes our approach. The memorization phase of the algorithm picks prototypical samples as memo-
ries. At run-time the algorithm computes kernel density estimates to assess the likelihood of a new data being in-distribution.
for initial node in G, and greedily look for local improvements for the 𝑘 nearest memories. The intuition being that for sufficiently
for a fixed number of iterations. The global search starts by using different memories computing a single distance pair can be used to
Algorithm 1, in order to generate the initial set of memories as reject other memories from further consideration.
node 𝑣 in G. Notice that we do not choose the number of memories We are interested in computing the nearest neighbor, that is
apriori but instead gets picked as a consequence of distance score 𝑑. 𝑘 = 1 in the set M S for a test point 𝑥𝑡 . Assume that we wish to
The partitioning cost for the choice of memories is computed by compute the distance between a test point 𝑥𝑡 , and some memory 𝑚 𝑗
the function 𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝐶𝑜𝑠𝑡 which evaluates Equation 1. Note this and the distance D (𝑥𝑡 , 𝑚𝑖 ) is known. Then in the triangle formed
can be expensive since it needs a total of 𝑟 ×𝑛 distance computation by the triplet (𝑚𝑖 , 𝑥𝑡 , 𝑚 𝑗 ), the following two equations are true:
operations. The local search (Lines 7 − 12) implements a greedy
strategy to pick the neighborhood node which produces a descent. D (𝑚𝑖 , 𝑥𝑡 ) − D (𝑚𝑖 , 𝑚 𝑗 ) ≤ D (𝑚 𝑗 , 𝑥𝑡 )
The outer loop of the algorithm keeps track of the node with the
minimum cost for each such reset produced in line 3. Algorithm 2 and
trivially terminates, since each search proceeds for a fixed number D (𝑚𝑖 , 𝑚 𝑗 ) − D (𝑚𝑖 , 𝑥𝑡 ) ≤ D (𝑚 𝑗 , 𝑥𝑡 )
of steps.
Meaning that D (𝑚 ′, 𝑥𝑡 ) is lower bounded by : |D (𝑚𝑖 , 𝑚 𝑗 ) −
Definition 5.2 (Memory System M S ). A memory system is a D (𝑚𝑖 , 𝑥𝑡 )|. Since, if we are interested in memories which are within
collection of pairs M S := {(𝑚 1, 𝑞 1 ), (𝑚 2, 𝑞 2 ), . . . , (𝑚𝑟 , 𝑞𝑟 )}, where, a certain threshold ( say ℎ) of 𝑚 ′ , we do not actually need to compute
𝑞𝑖 = |𝑄𝑚𝑖 | the distance D (𝑚 ′, 𝑥𝑡 ) if the following equation holds True.
𝑄𝑚𝑖 = I(𝑚𝑖 = 𝑎𝑟𝑔𝑚𝑖𝑛 D (𝑠, 𝑚𝑖 )) (8) |D (𝑚𝑖 , 𝑚 𝑗 ) − D (𝑚𝑖 , 𝑥𝑡 )| > ℎ (9)
𝑠 ∈S 𝑖
For each memory 𝑚𝑖 , we can pre-compute a look-up table for
Thus the memory system M S keeps track of the number of the inter-memory distance Q : {(𝑚 𝑗 , D (𝑚𝑖 , 𝑚 𝑗 ))|1 ≤ 𝑗 ≤ 𝑙, 𝑗 ≠ 𝑖}.
points for every memory which belongs to the cluster defined by it. This can lead to reduction in the search space in practice by pruning
The OOD detection algorithm which follows uses these memories out memories from further consideration each time the distance of a
as abstract kernel centers in Equation 3 to compute the probability memory from 𝑥𝑡 gets measured. For 𝑘 > 1, similar reasoning holds.
density at a test point. The only difference being that, in this case the search algorithm
tracks the distance of the 𝑘 𝑡ℎ -memory furthest from the test point.
5.3 Scaling Memory Search
In order to compute the probability density estimates given by 5.4 Detecting Distribution Shifts
Equation 3, we need to do a linear time search through the current To summarize, we know how to go from the set of training data S
set of memories in M S . Even though the number of memories to the set of memories M S . This happens through a smart initial-
produced in Algorithm 2 might be small enough compared to the ization of the set of memories (Algorithm 1), followed by a further
full data set, S a search through the list of memories might still be refinement using a medoid based partitioning technique discussed
challenging. To remedy this potential drawback we deploy a simple in Algorithm 2. Additionally, to handle any potential slow downs,
hashing technique first introduced in [14] . The distance metric we briefly discuss how one can use the inter-memory distance to
D discussed in Section 4.3, was a proper distance metric, which prune out large parts of the search space. Thus allowing the system
implies that the distance function respects triangle inequality. In to scale to larger memory systems. We are now at a stage to discuss
what follows, we describe a possible avenue to speed up the search our runtime algorithm for detecting distribution shifts.
229
Algorithm 3 Detect Distribution Shifts (3) OOD-ness due to change in the front obstacles from cars in
the in-distribution traces to bikes in the OOD traces.
Input: Time Series Data 𝑥𝑡 , Memory System M S (4) OOD-ness due to perturbation of in-distribution frames with
Output: Distribution Shift Flags F𝑡 adversarial attack.
Parameter : Window Threshold - 𝜏, Window - 𝑊 , probability
threshold - 𝛼
Evaluation metrics: We refer to OOD traces as positive and in-
1: Frame = 𝜙
distribution as negative. We report false positive (FP) as the number
2: for 1 ≤ 𝑡 ≤ ∞ do
of in-distribution traces that were falsely detected as OOD. False
3: 𝐹𝑙𝑎𝑔𝑂𝑂𝐷 = DetectOOD(M S , 𝑥𝑡 , 𝛼)
negatives (FN) are the number of OOD traces falsely detected as in-
4: Frame ← UpdateFrame(𝐹𝑙𝑎𝑔𝑂𝑂𝐷 , 𝑊 , Frame)
distribution. We also report an average delay in the OOD detection
5: F𝑡 ← CountOOD(Frame) ≥ 𝜏
as the number of windows required to detect the start of the OOD-
6: return F𝑡 ness in the traces averaged over the total number of detected OOD
traces. We conduct all our experiments in this case study on a single
GPU (NVIDIA GeForce RTX 2080 Ti).
In practical scenarios, detecting a shift in distribution needs a 6.1 OOD-ness due to change in weather and
robust mechanism. We achieve this using a sliding window based lighting
implementation to track the number of out of distribution sam-
ples it sees. Algorithm 3 summarizes our approach. For each input Here we generate OOD traces in which OOD-ness gradually in-
sample at runtime, the function 𝐷𝑒𝑡𝑒𝑐𝑡𝑂𝑂𝐷 simply computes the creases with time. We increase the precipitation (or fog) parameter
probability estimates for a test point 𝑥𝑡 , from a memory system gradually in sequential frames of a trace to generate heavy rain (or
M S using Equation 3. Additionally it compares this probability foggy) OOD trace. Similarly, we gradually increase the darkness
density with a threshold 𝛼 to Flag a sample as OOD. The function parameter to generate night time OOD traces. Examples of these
𝑈 𝑝𝑑𝑎𝑡𝑒𝐹𝑟𝑎𝑚𝑒 keeps track of the OOD Flags in the last 𝑊 frames. OOD traces are shown in Appendix.
The algorithm outputs a distribution shift once this count cross 6.1.1 Results on OOD detection for heavy rain. We define frames
threshold 𝜏. with precipitation level greater than 20 as OODs due to heavy rain.
There are 4488 in-distribution images with precipitation parame-
6 CASE STUDY 1 - SIMULATED ter from 0 to 10. The test dataset contains 26 in-distribution and
AUTONOMOUS DRIVING SCENARIO USING 74 out-of-distribution traces. Our approach involves a few hyper-
parameters like window length 𝑊 , threshold count 𝜏, probability
CARLA
threshold 𝛼, and distance threshold 𝑑. In the current experiments,
System Description: Here we consider an advanced emergency we choose the hyperparameters empirically. In practice, the desired
braking system (AEBS) from [9]. The system overview is shown in sensitivity of the system would dictate the parameter choice. We
Figure 11 of A.2. It is a closed loop system composed of a perception report some of the top performers in Table 1. A more exhaustive
based LEC, which estimates distance of the object ahead of the ego study has been reported in Figure 10 of the Appendix.
vehicle. This distance combined with the velocity is the input to the
braking controller. The objective of the AEBS system is to brake the
(𝑾 , 𝝉 , 𝜶 , 𝒅) Mem FP FNAvg Exec
ego vehicle to avoid a collision. The controller is trained using stan-
Delay Time (ms)
dard reinforcement learning on in-distribution data. The distance
(5,5,0.92,0.2) 145 0/26 2/74 0.42 21.39
estimating LEC is trained using supervised learning techniques. For
(10,5,0.92,0.2) 145 0/26 2/74 0.04 19.80
details about the model architecture and training hypeparameters,
(5,5,0.78,0.3) 36 1/26 1/74 0 6.26
please refer to [9].
(10,5,0.78,0.3) 36 1/26 0/74 0 5.88
In-distribution data: We evaluate our approach on OOD detection Table 1: Results on heavy rain traces as OODs. 𝑀𝑒𝑚: Total
with the dataset provided in [9]. The dataset is generated using number of memories, 𝐸𝑥𝑒𝑐 𝑡𝑖𝑚𝑒: the time for OOD detection
CARLA [11], an open-source simulator for autonomous driving. on each frame (calculated on 3 random traces)
The in-distribution traces consists of daytime frames with slight
rain (i.e. precipitation level in {0, 1, . . . 10}), and with cars as the
front object. The sampling rate is 20𝐻𝑧 [9]. The experimental results shows that Algorithm 2 efficiently com-
pressed the training data (4488→ 145/36) into memories and can
Types of OODs: We evaluate our approach on the following four perform OOD detection under autonomous driving setting. With
different sources of OOD-ness in the traces. the parameters as shown in the table, the results in Table 1 indicate
(1) OOD-ness due to weather change from slight rain in the in- that our OOD detector could successfully detect traces with heavy
distribution traces to heavy rain (precipitation level greater rain precipitation and without delay. Performance of the VAE based
than or equal to 20) and foggy traces in OOD traces. OOD detector [9] is comparable to ours in terms of false positive
(2) OOD-ness due to change in the lighting conditions from day and false negative rates but their average detection delay (as re-
in the in-distribution traces to night in the OOD traces. ported) is higher than ours. We also report the average execution
230
Night traces Foggy traces (𝑾 , 𝝉 , 𝜶 , 𝒅) FP FN Avg Delay
(𝑾 , 𝝉 , 𝜶 , 𝒅) FN Avg Delay FN Avg Delay (5,5,0.92,0.2) 0/26 0/27 0
(5,5,0.92,0.2) 0/27 0.15 0/27 5 (10,5,0.92,0.2) 0/26 0/27 0
(10,5,0.92,0.2) 0/27 0.15 0/27 10 (5,5,0.78,0.3) 1/26 2/27 0.96
(5,5,0.78,0.3) 0/27 1.89 0/27 9 (10,5,0.78,0.3) 1/26 2/27 0
(10,5,0.78,0.3) 0/27 0.15 0/27 11.15 Table 3: Results on OOD traces with bikers.
Table 2: Results on night and foggy OOD traces.
231
(a) Match the input test im- (b) Highlight the least similar
(a) True prediction rate (b) False prediction rate age with memories in training part compared to the memory
data
Figure 5: Out-of-distribution traces detection results for de-
tecting adversarial attack on the road with different hyper- Figure 6: OOD detection reasoning for sticker detection (The
parameters test image is Figure 4b)
232
Figure 7: Left : We show a setting where the car should take a right turn on an L-shaped track. Middle : The dots show the
distance estimates as provided by the sensor. It matches well with the position of the obstacles. Right : Due to reflection from
the left wall, it gives a false impression of no obstacle to the left of the car when deployed in the real world.
233
that state of the art results in OOD detection for self-driving car [16] Dan Hendrycks and Kevin Gimpel. 2016. A baseline for detecting misclassified and
out-of-distribution examples in neural networks. arXiv preprint arXiv:1610.02136
applications, can go hand in hand with overall interpretablity, with- (2016).
out compromising on execution times. In the future, we would like [17] Radoslav Ivanov, Taylor J. Carpenter, James Weimer, Rajeev Alur, George J. Pap-
to extend this technique on applications beyond self-driving cars pas, and Insup Lee. 2020. Case Study: Verifying the Safety of an Autonomous
Racing Car with a Neural Network Controller. In Proceedings of the 23rd Inter-
where anomalous inputs are challenging to handle. national Conference on Hybrid Systems: Computation and Control (Sydney, New
South Wales, Australia) (HSCC ’20). Association for Computing Machinery, New
York, NY, USA, Article 28, 7 pages. https://doi.org/10.1145/3365365.3382216
9 ACKNOWLEDGEMENT [18] Radoslav Ivanov, James Weimer, Rajeev Alur, George J. Pappas, and Insup
This work was supported in part by ARO W911NF-20-1-0080, AFRL Lee. 2019. Verisig: Verifying Safety Properties of Hybrid Systems with Neu-
ral Network Controllers. In Proceedings of the 22nd ACM International Confer-
and DARPA FA8750-18-C-0090, NSF-1915398, NSF-2125561 and ence on Hybrid Systems: Computation and Control (Montreal, Quebec, Canada)
SRC Task 2894.001. Any opinions, findings and conclusions or rec- (HSCC ’19). Association for Computing Machinery, New York, NY, USA, 169–178.
https://doi.org/10.1145/3302504.3311806
ommendations expressed in this material are those of the authors [19] Ramneet Kaur, Susmit Jha, Anirban Roy, Sangdon Park, Edgar Dobriban, Oleg
and do not necessarily reflect the views of the Air Force Research Sokolsky, and Insup Lee. 2022. iDECODe: In-distribution Equivariance for Confor-
Laboratory (AFRL), the Army Research Office (ARO), the Defense mal Out-of-distribution Detection, Association for the Advancement of Artificial
Intelligence. arXiv:2201.02331 [cs.LG]
Advanced Research Projects Agency (DARPA), or the Department [20] Ramneet Kaur, Susmit Jha, Anirban Roy, Sangdon Park, Oleg Sokolsky, and Insup
of Defense, or the United States Government. Lee. 2021. Detecting OODs as datapoints with High Uncertainty. arXiv preprint
Additionally, we would like to thank Radoslav Ivanov, postdoc- arXiv:2108.06380 (2021).
[21] Rikard Laxhammar and Göran Falkman. 2015. Inductive conformal anomaly
toral scholar from the University of Pennsylvania for discussions detection for sequential detection of anomalous sub-trajectories. Annals of
on the LiDAR experiments and sharing the data. We also especially Mathematics and Artificial Intelligence 74, 1 (2015), 67–94.
[22] Juncheng Li, Frank R. Schmidt, and J. Zico Kolter. 2019. Adversarial camera
appreciate Feiyang Cai, PhD candidate from Vanderbilt University stickers: A physical camera-based attack on deep learning systems. CoRR
for sharing the CARLA dataset and simulation code. abs/1904.00759 (2019). arXiv:1904.00759 http://arxiv.org/abs/1904.00759
[23] David Macêdo, Tsang Ing Ren, Cleber Zanchettin, Adriano LI Oliveira, and Teresa
Ludermir. 2021. Entropic out-of-distribution detection. In 2021 International Joint
REFERENCES Conference on Neural Networks (IJCNN). IEEE, 1–8.
[1] [n. d.]. pytorch-msssim. https://pypi.org/project/pytorch-msssim/ [24] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. 2015.
[2] [n. d.]. Toyota Safety Sense. https://www.toyota.com/safety-sense/ DeepFool: a simple and accurate method to fool deep neural networks. CoRR
[3] 1990. Partitioning Around Medoids (Program PAM). Chap- abs/1511.04599 (2015). arXiv:1511.04599 http://arxiv.org/abs/1511.04599
ter 2, 68–125. https://doi.org/10.1002/9780470316801.ch2 [25] R.T. Ng and Jiawei Han. 2002. CLARANS: a method for clustering objects for
arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/9780470316801.ch2 spatial data mining. IEEE Transactions on Knowledge and Data Engineering 14, 5
[4] 2021. F1TENTH. https://f1tenth.org (2002), 1003–1016. https://doi.org/10.1109/TKDE.2002.1033770
[5] Alexander A. Alemi, Ben Poole, Ian Fischer, Joshua V. Dillon, Rif A. Saurous, and [26] Xiao Qin, Lei Cao, Elke A. Rundensteiner, and Samuel Madden. 2019. Scalable
Kevin Murphy. 2017. An Information-Theoretic Analysis of Deep Latent-Variable Kernel Density Estimation-based Local Outlier Detection over Large Data Streams.
Models. CoRR abs/1711.00464 (2017). arXiv:1711.00464 http://arxiv.org/abs/1711. In Advances in Database Technology - 22nd International Conference on Extending
00464 Database Technology, EDBT 2019, Lisbon, Portugal, March 26-29, 2019. 421–432.
[6] Michele Basseville, Igor V Nikiforov, et al. 1993. Detection of abrupt changes: [27] Shreyas Ramakrishna, Zahra Rahiminasab, Gabor Karsai, Arvind Easwaran, and
theory and application. Vol. 104. prentice Hall Englewood Cliffs. Abhishek Dubey. 2021. Efficient Out-of-Distribution Detection Using Latent
[7] Adith Boloor, Karthik Garimella, Xin He, Christopher Gill, Yevgeniy Vorobey- Space of 𝛽 -VAE for Cyber-Physical Systems. arXiv preprint arXiv:2108.11800
chik, and Xuan Zhang. 2020. Attacking vision-based perception in end-to-end (2021).
autonomous driving models. Journal of Systems Architecture 110 (2020), 101766. [28] Erich Schubert and Peter J. Rousseeuw. 2018. Faster k-Medoids Clustering:
https://doi.org/10.1016/j.sysarc.2020.101766 Improving the PAM, CLARA, and CLARANS Algorithms. CoRR abs/1810.05691
[8] Dominique Brunet, Edward R. Vrscay, and Zhou Wang. 2012. On the Mathemat- (2018). arXiv:1810.05691 http://arxiv.org/abs/1810.05691
ical Properties of the Structural Similarity Index. IEEE Transactions on Image [29] B.W. Silverman. 2018. Density Estimation for Statistics and Data Analysis. 1–175
Processing 21, 4 (2012), 1488–1499. https://doi.org/10.1109/TIP.2011.2173206 pages. https://doi.org/10.1201/9781315140919
[9] Feiyang Cai and Xenofon Koutsoukos. 2020. Real-time out-of-distribution detec- [30] Karen Simonyan, Andrea Vedaldi, and Andrew Zisserman. 2014. Deep Inside
tion in learning-enabled cyber-physical systems. In 2020 ACM/IEEE 11th Interna- Convolutional Networks: Visualising Image Classification Models and Saliency
tional Conference on Cyber-Physical Systems (ICCPS). IEEE, 174–183. Maps. CoRR abs/1312.6034 (2014).
[10] Chaofan Chen, Oscar Li, Daniel Tao, Alina Barnett, Cynthia Rudin, and Jonathan [31] Vijaya Kumar Sundar, Shreyas Ramakrishna, Zahra Rahiminasab, Arvind
Su. 2019. This Looks Like That: Deep Learning for Interpretable Image Recog- Easwaran, and Abhishek Dubey. 2020. Out-of-distribution detection in multi-
nition. In NeurIPS 2019, 8-14 December 2019, Vancouver, BC, Canada, Hanna M. label datasets using latent space of 𝛽 -vae. In 2020 IEEE Security and Privacy
Wallach, Hugo Larochelle, Alina Beygelzimer, Florence d’Alché Buc, Edward A. Workshops (SPW). IEEE, 250–255.
Fox, and Roman Garnett (Eds.). 8928–8939. http://papers.nips.cc/paper/9095- [32] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo Shin. 2020. Csi: Novelty
this-looks-like-that-deep-learning-for-interpretable-image-recognition detection via contrastive learning on distributionally shifted instances. arXiv
[11] Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen preprint arXiv:2007.08176 (2020).
Koltun. 2017. CARLA: An Open Urban Driving Simulator. In Proceedings of the [33] Ashish Tiwari, Bruno Dutertre, Dejan Jovanović, Thomas de Candia, Patrick D
1st Annual Conference on Robot Learning. 1–16. Lincoln, John Rushby, Dorsa Sadigh, and Sanjit Seshia. 2014. Safety envelope
[12] Souradeep Dutta, Xin Chen, Susmit Jha, Sriram Sankaranarayanan, and Ashish for security. In Proceedings of the 3rd international conference on High confidence
Tiwari. 2019. Sherlock - A Tool for Verification of Neural Network Feedback networked systems. 85–94.
Systems: Demo Abstract. In Proceedings of the 22nd ACM International Conference [34] Vladimir Vovk, Ilia Nouretdinov, and Alexander Gammerman. 2003. Testing
on Hybrid Systems: Computation and Control (Montreal, Quebec, Canada) (HSCC exchangeability on-line. In Proceedings of the 20th International Conference on
’19). Association for Computing Machinery, New York, NY, USA, 262–263. https: Machine Learning (ICML-03). 768–775.
//doi.org/10.1145/3302504.3313351 [35] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli. 2004. Image Quality
[13] Yeli Feng, Daniel Jun Xian Ng, and Arvind Easwaran. 2021. Improving Varia- Assessment: From Error Visibility to Structural Similarity. IEEE Transactions on
tional Autoencoder based Out-of-Distribution Detection for Embedded Real-time Image Processing 13, 4 (April 2004), 600–612. https://doi.org/10.1109/TIP.2003.
Applications. ACM Transactions on Embedded Computing Systems (TECS) 20, 5s 819861
(2021), 1–26. [36] Ev Zisselman and Aviv Tamar. 2020. Deep residual flow for out of distribution
[14] K. Fukunaga and P.M. Narendra. 1975. A Branch and Bound Algorithm for detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and
Computing k-Nearest Neighbors. IEEE Trans. Comput. C-24, 7 (1975), 750–753. Pattern Recognition. 13994–14003.
https://doi.org/10.1109/T-C.1975.224297
[15] Francisco José Gisbert. 2003. Weighted samples, kernel density estimators and
convergence. Empirical Economics 28 (02 2003), 335–351. https://doi.org/10.1007/
s001810200134
234
A APPENDIX Night dataset: the OOD frame starts from the 10th frame. The
average length of night episodes is 123 frames.
A.1 Hyperparameter experiments
Figure 10: Out-of-distribution episode detection results for Figure 13: Example sequence in Night Dataset (we gradually
detecting OODs due to heavy rain increase the darkness parameter)
235