Fully Convolutional Cross-Scale-Flows For Image-Based Defect Detection
Fully Convolutional Cross-Scale-Flows For Image-Based Defect Detection
resize
without requiring any image samples of defective parts. Re- ffe
cent works model the distribution of defect-free image data,
x(3) y(3)
fcsf z(3)
using either strong statistical priors or overly simplified ffe ...
data representations. In contrast, our approach handles
fine-grained representations incorporating the global and Figure 1. Our method detects and localizes defects based on the
local image context while flexibly estimating the density. To density estimation of feature maps from the differently sized in-
this end, we propose a novel fully convolutional cross-scale put images. We process the multi-scale feature maps jointly, using
a fully convolutional normalizing flow with cross-connections be-
normalizing flow (CS-Flow) that jointly processes multiple
tween scales.
feature maps of different scales. Using normalizing flows to
assign meaningful likelihoods to input samples allows for
efficient defect detection on image-level. Moreover, due to normal (in our case non-defective) data. The problem is in-
the preserved spatial arrangement the latent space of the terpreted in terms of whether a data sample lies out of the
normalizing flow is interpretable which enables to localize distribution pX of the set of normal images X, also named
defective regions in the image. Our work sets a new state- out-of-distribution (OOD) detection. It is assumed that de-
of-the-art in image-level defect detection on the benchmark fects X̄ are out-of-distribution, i.e. have a small likelihood
datasets Magnetic Tile Defects and MVTec AD showing a given pX . We propose a method that models the distribution
100% AUROC on 4 out of 15 classes. on feature level with a normalizing flow.
Most research [14, 37, 32] in the AD field focuses on
1. Introduction image datasets with high intra-class and high inter-class-
variance. The setting in defect detection is different: Since
During the industrial production of components, defects the non-defective components are similar to themselves and
occur over time. They must be detected to ensure safety to the defects, there is a small intra-class and a small inter-
standards and product quality. Since manual inspection by class-variance. Hence, most AD approaches are not suitable
humans is very costly and error-prone, reliable and efficient for defect detection. Common approaches based on autoen-
automatic defect detection is highly demanded. In most coders [42, 12, 6, 15] or generative adversarial networks
real-world scenarios, however, there exist no examples of (GANs) [34, 1, 7] perform poorly in this setting, which is
such defects. Moreover, even if a small set of known de- described in detail in Section 2. Thus, recent works rely on
fects is available, new and formerly unseen types of de- density estimation of image features obtained from models
fects occur at unpredictable times which makes it impos- pretrained on ImageNet [9], e.g. ResNet [17] or Efficient-
sible to apply standard classification approaches. Instead, Net [38]. However, either information is lost due to the
it is inevitable to let the defect detector learn only from averaging of feature maps [31] or strong statistical priors
non-defective examples. This problem is commonly called are required limiting their flexibility in density estimation
semi-supervised anomaly detection (AD), novelty detection [29, 8]. To alleviate these issues, we propose a normal-
or one-class classification. izing flow (NF) that is able to process multi-scale feature
These terms describe the objective of deciding whether maps to estimate their density, as shown in Figure 1. NFs
a data sample belongs to the class of the given set X of are generative models that transform the training set distri-
1
bution pX to a latent space with a predefined distribution
pZ via maximum-likelihood-optimization. In contrast to
other generative models, for instance VAEs [21] and GANs
[16, 4], the likelihoods of latent space vectors in NFs are
directly interpreted as likelihoods of the input data, since
the network maps bijectively. Thus, the regions in the latent
space with high likelihood represent the normal examples
while defective examples are projected to latent variables
outside of the learned distribution. Conversely, the injective
mapping of autoencoders potentially results in projecting
untrained anomalies to indeterminate latent space regions,
Figure 2. Histogram of different features from MVTec AD im-
which may overlap with the regions of the normal samples. ages extracted with EfficientNet [38]. Each histogram contains
However, applying NFs to images for OOD detection the values from the same position of one feature map. The blue
is not straightforward as shown by Kirichenko et al. [22]. line shows the best fitting normal distribution. Assuming a nor-
With RGB data, the network fails to learn a useful distri- mal distribution of the features, as done by [29, 8], appears to be
bution, focusing on local pixel correlations instead of se- insufficient to capture the feature distribution.
mantics. For this reason, we perform the density estima-
tion on feature maps obtained by pretrained feature ex-
2.1. Anomaly Detection
tractors which provide compressed semantic information. State-of-the-art work can be roughly divided into ap-
Our cross-scale flow (CS-Flow) simultaneously processes proaches that are based on generative models or pretrained
the features of the image at different scales by propagating networks. Alternative methods that do not fall into one of
them in parallel through the NF while interacting with each these categories are described separately.
other. Keeping in mind that the discriminability regarding
defectiveness is unknown during training, our model uti-
lizes the full potential of the information and correlations in 2.1.1 Generative Models
both local and global contexts to learn the distribution pre-
Many anomaly detection methods are based on generative
cisely to identify defective examples. In addition to identi-
models, such as autoencoders [24, 21, 30] and GANs [16],
fication, the fully convolutional architecture also preserves
which are optimized to generate the normal data. These ap-
spatial arrangement which allows for a visualization of the
proaches detect anomalies by the inability of the generative
defective regions on the image. In contrast to models using
model to reconstruct them. In the simplest case, the input
densely connected layers and thus many parameters [31],
and the reconstruction of an autoencoder is compared [42].
our approach still achieves good performance even with a
In this context, a high reconstruction error is interpreted as
low number of training samples.
an indicator of an anomaly. Bergmann et al. [6] replace the
We summarize our contributions as follows: common l2 error with SSIM to have a better metric for vi-
• Our novel cross-scale normalizing flow (CS-Flow) de- sual similarity. Gong et al. [15] use memory modules in the
tects defects by jointly estimating likelihoods on multi- latent space to prevent the autoencoder from generalizing
scale feature maps. to anomalous data. Zhai et al. [41] combine energy-based
models and regularized autoencoders to model the data dis-
• Our method maintains the image structure to obtain an tribution. Denoising autoencoders are used by Huang et
interpretable latent space, which enables precise defect al. [12] by letting autoencoders learn to restore transformed
detection. images.
• We set a new state-of-the-art in image-level defect de- Similar to the decoding part of autoencoders, genera-
tection on the MVTec AD and Magnetic Tile Defects tors of GANs are utilized for anomaly detection. Schlegl
dataset. et al. [34] propose to learn an inverse generator after train-
• Code is available on GitHub1 . ing a GAN, utilizing both together for reconstruction and
the error consideration. A combination of autoencoders and
GANs is proposed by Akcay et al. [1]. They apply the
2. Related Work autoencoder directly as the GAN’s generator to ensure the
generation of normal data only.
In the following, we review previous work in the field of
anomaly detection and normalizing flows as the basis of our As shown in Section 4.3, autoencoders and GANs per-
methodology. form poorly on defect detection tasks.Since different types
of anomalies with individual size, shape and structure have
1 https://github.com/marco-rudolph/cs-flow inconsistent characteristics regarding reconstruction errors,
2
(1)
(1)
yin,1 • + yout,1
(2)
(2)
yin,1 • + yout,1 channel
(3)
(1) (3)
yin,1 • + yout,1 concat.
yin (1)
yout
fixed
yin(2)
random
channel
even
channel
split
r1 cross-scale
convolutions
r2 cross-scale
convolutions
channel
concat.
(2)
yout
permutation
(1)
yin(3) (1) yout,2
yin,2 • + (3)
yout
(2) channel
(2) yout,2 concat.
yin,2 • + (3)
yout,2
(3)
yin,2 • + ×nblocks
Figure 3. Architecture of one block inside the normalizing flow: After a fixed random permutation, every input tensor is split into two parts
across the channel dimension where each ensemble is used to estimate scale and shift parameters that transform the respective counterpart.
Symbols and ⊕ denote element-wise multiplication and addition, respectively.
they are not widely applicable. For example, structures with this weakness by passing 64 different rotations of each im-
high frequency cannot be represented and reconstructed ac- age through the network, which, however, significantly in-
curately in general and small defect areas cause smaller er- creases computational complexity. In contrast, our method
rors. utilizes the fine-grained information of the full-sized feature
maps while requiring only a single pass and outperforms
2.1.2 Methods Based on Pretrained Networks DifferNet [31] in almost all experiments by a large margin.
Instead of working on the image directly, many methods 2.1.3 Other Approaches
perform defect detection on features of pretrained networks.
Pretraining on a large-scale database, such as ImageNet, en- Besides generative and pretrained models, there are alter-
sures the extraction of universal features that are expected native approaches to perform anomaly detection. Lizner-
to differ in the presence of defects. In this way, discrimi- ski et al. [26] propose a learnable hypersphere classifier us-
nant features are considered which cannot be learned from ing exemplar outlier exposures as anomaly substitute. Con-
non-defective data, since they do not necessarily occur in trastive learning on augmentations of the same image is
it. Detecting defects in the feature space commonly is done used by Tack et al. [37] by defining in-distribution and out-
using traditional statistical approaches. of-distribution transformations. In contrast, Golan and El-
Andrews et al. [2] fit a one-class Support Vector Ma- Yaniv [14] augment images to classify the specific trans-
chine to the feature distribution. Rippel et al. [29] model formation, assuming that this does not work as clearly on
the features as an unimodal Gaussian distribution and uti- anomalies as it does on normal data.
lizes the Mahalanobis distance as scoring function. This
2.2. Normalizing Flows
approach was further refined by Defard et al. [8] by apply-
ing it to image patches utilizing feature maps at different A normalizing flow (NF) [28] is a generative model that
semantic levels. However, these approaches are limited to transforms data into tractable distributions. Unlike conven-
normal distributions which are inappropriate in many cases tional neural networks, their mapping is bijective, which
as shown in Figure 2. In contrast, we do not assume any allows them to train and evaluate in both directions [39].
predefined feature distribution, but learn the true distribu- The forward pass projects data into a latent space to calcu-
tion via maximum likelihood estimation (MLE). Assuming late exact likelihoods for the data given the predefined latent
that distances within the feature space are semantically ex- distribution. Conversely, data sampled from the predefined
pressive, the distance to the nearest neighbour is used as an distribution can be mapped back into the original space to
anomaly score in [27]. The only deep-learning-based image generate data. Bijectivity and bidirectional execution are
feature density estimation method by Rudolph et al. [31], ensured by using invertible affine transformations. There
which is the most comparable to our work, is also based are different types of normalizing flows, which differ in the
on normalizing flows. However, they do not process full- architecture of the affine transformations in order to effi-
sized feature maps, but rather vectors after applying aver- ciently enable the forward or backward direction. The affine
age pooling. As a result, important contextual and posi- blocks are realized either by learning fixed or autoregressive
tional information is lost. The authors partially compensate transformations. A popular type of autoregressive flows is
3
scale shift
conv2D(stride=1)
+ |
|
bilinear upsample ×2 + conv2D(stride=1)
|
+ | conv2D(stride=2)
|
|
leaky ReLU
|
|
+
+ | element-wise sum
|
|
|
even channel split
channels: 304 → 1024 1024 → 608 608 → 2×304
Figure 4. Architecture of the internal networks r inside the coupling blocks. Convolutions are performed at two levels, with cross-
connections between scales at the second level. Feature map resizing is implemented by upsampling and strided convolutions. Aggregation
is implemented by summation. The output is split across the channel dimension to obtain the scale and shift parameters.
MADE (Germain et al. [13]). The density calculation based ing a pretrained neural network ffe (x) = y which will re-
on the Bayesian chain rule is efficient in this case. How- main unchanged during training. To have a more descrip-
ever, sampling is costly. In contrast, inverse autoregressive tive representation of x, feature maps of different scales are
flows (Kingma et al. [20]) are usually efficient at sampling, included in y via extracting features from s different reso-
but not at computing likelihoods. Real-NVP [11], a vari- lutions of the image. In contrast to [31], our proposed NF-
ant of inverse autoregressive flows, simplifies both passes to architecture is able to perform density estimation on differ-
be efficient in both directions. We enhanced Real-NVP to ent scaled full-sized feature maps in parallel instead of on
operate on multiple scales that can interact with each other. concatenated feature vectors. Thus, important fine-grained
This leverages NFs for defect detection by introducing fully positional and contextual information is maintained. We de-
convolutional cross-scale flows, whose architecture is ex- fine y = [y (1) , ..., y (s) ] with y (i) as the 3D feature tensor of
plained in detail in Section 3.1. the image x(i) at scale i ∈ {1, ..., s}. Our proposed cross-
Normalizing flows are successfully used for anomaly de- scale-flow fcsf transforms the feature tensors bijectively and
tection on non-image data [33, 35, 10]. With image data, in parallel to
the problem arises that the network mainly focuses on local
pixel correlation without taking semantics into account. Re-
cent works [31, 22] found that semantic information is bet- fcsf (y (1) , ..., y (s) ) = [z (1) , ..., z (s) ] = z ∈ Z (1)
ter captured when working on image features instead of full
images. In contrast to [22], we use features from multiple with the same dimensionality3 as y. The likelihood pZ (z)
scales and refrain from the usage of fully connected layers is measured according to the target distribution which in our
and squeeze layers2 . In this way, our latent space preserves case is a multivariate standard normal distribution N (0, I).
the spatial arrangement and therefore enables precise defect We use the likelihood of pZ (z) to decide whether x is
localization. Furthermore, we lower the number of param- anomalous according to a threshold θ:
eters which enables us to process high dimensional feature (
maps and train with few data samples. 1 for pZ (z) < θ
A(x) = . (2)
0 else
3. Method
3.1. Cross-Scale Flow
To detect defects in images, we first learn a statistical We extend the traditional normalizing flows with our
model of features y ∈ Y of defect-free images x ∈ X sim- novel cross-scale flow to allow for effective defect detec-
ilar to DifferNet[31]. During inference, we assign a likeli- tion on images. It processes feature maps of different sizes
hood to the input image x by using a density estimation on which interact with each other. In this way, information be-
image features y, assuming a low likelihood is an indicator tween the scales is shared to obtain a likelihood for the com-
for a defect. The density estimation is learned via a bijec- pound of y = [y (1) , ..., y (s) ]. Moreover, we design it fully
tive mapping of the unknown distribution pY of the feature convolutional and preserve the spatial dimensions. This al-
space Y to a latent space Z with a Gaussian distribution lows to determine the positions of the anomalies in Z as
pZ . Thus, as shown in Figure 1, our method is divided into shown in Section 3.3. An additional benefit of our approach
the steps feature extraction X → Y and density estimation compared to [31] is a practicable handling of very high-
Y → Z. dimensional input spaces while having few training samples
From the input image x we extract the features y by us- as shown in Section 4.
2 Squeeze layers reshape the tensor, e.g. by aggregating the channels of 3 For better readability, in the following z without any index represents
4 neighboring pixels to one pixel with fourfold channel number. a vector which is the concatenation of the flattened tensors [z (1) , ..., z (s) ].
4
The cross-scale flow is a chain of so-called coupling Thus, we formulate our objective as the minimization of the
blocks, each performing affine transformations. As basis for negative log-likelihood − log pY (z):
the frame architecture of the coupling block we chose Real-
NVP [11]. The detailed structure of one block with s = 3 is ∂z
log pY (y) = log pZ (z) + log det
(i)
shown in Figure 3. Inside, each input tensor yin is first ran- ∂y
2 (6)
domly permuted and evenly split across its channel dimen- kzk2 ∂z
(i) (i) L(y) = − log pY (y) = − log det .
sion into the two parts yin,1 and yin,2 . These parts manip- 2 ∂y
ulate each other by regressing element-wise scale and shift
∂z
parameters which are successively applied to their respec- with det ∂y denoting the absolute determinant of the Ja-
(i) (i)
tive counterparts to obtain the output [yout,1 , yout,2 ]. The cobian. The logarithm of this term simplifies in our case to
scale and shift parameters are estimated by coupling block- the sum of all values of s since the Jacobian of the element-
individual subnetworks r1 and r2 whose output is split into wise product operator in Equation 3 is a diagonal matrix.
[s1 , t1 ] and [s2 , t2 ] and is then used as follows: The training is conducted over a fixed number of epochs.
To stabilize it further, we limit the l2 -norm of the gradients
yout,2 = yin,2 eγ1 s1 (yin,1 ) + γ1 t1 (yin,1 ) to 1. Section 4.2 describes the training in more detail.
(3)
yout,1 = yin,1 eγ1 s2 (yout,2 ) + γ2 t2 (yout,2 ), 3.3. Localization
with as the element-wise product. To initialize the model In previous work [31], the latent space of the normalizing
in a stable way, we introduce the learnable block-individual flow has only been used such that all entries of z are consid-
scalar coefficients γ1 and γ2 . They are initialized to 0 and ered to produce a score at the image level. Since our method
thus cause yout = yin . The affinity property is preserved processes feature maps fully-convolutional, positional in-
by having non-zero scaling coefficients with the exponen- formation is preserved. This allows for the interpretation
tiation in Equation 3. The internal networks r1 and r2 do of the output in terms of the likelihood of individual image
not need to be invertible and can be any differentiable func- regions, which in our application is the localization of the
tion, which in our case is implemented as a fully convolu- defect.
tional network that regresses both components by splitting Analogous to the definition of the anomaly score of the
the output (see Figure 4 for details of the architecture). Fea- entire image, we define an anomaly score for each local po-
tures are processed with one hidden layer per scale on which sition (i, j) of the feature map y s by aggregating the values
the number of channels is increased. Motivated by HRNet s 2
along the channel dimension with zi,j 2
. Thus, we can lo-
[36], we adjust the size of individual feature maps of dif-
calize the defect by marking image regions with high norm
ferent scales by bilinear upsampling or strided convolutions
in the output feature tensors z s .
before aggregation by summation.
We apply soft-clamping to the scale components s, as 4. Experiments
proposed by Ardizzone et al. [3], to preserve model stability
in spite of the exponentiation. This clamping is applied as 4.1. Datasets
the last layer to the outputs s1 and s2 by the activation
We evaluate our method on a wide range of realistic de-
2α h fect detection scenarios to demonstrate the advantage of our
σα (h) = arctan . (4) contributions and the superiority over previous approaches.
π α
For this purpose, we measure the performance on the chal-
This prevents extreme scaling components by restricting the
lenging and diverse MVTec AD [5] and Magnetic Tile De-
values to the interval (−α, α).
fects (MTD) [18] datasets.
3.2. Learning Objective MVTec AD comprises 10 object and 5 texture classes
with overall 3629 defect-free training and 1725 testing im-
During training, we want the cross-scale flow fcsf to ages. Each class contains 60 to 320 high-resolution images
maximize the likelihoods of feature tensors pY (y) which with a range from 700 × 700 to 1024 × 1024 pixels. The
we obtain by mapping them to the latent space Z where test set includes defects of different sizes, shapes and types
we model a well-defined density pZ . Using the change-of- such as cracks, scratches and displacements, with up to 8
variables formula Eq. 5 and z = fNF (y), this likelihood is different defect types per class and 70 defect types in total.
defined by To the best of our knowledge, MVTec AD acts currently as
∂z
pY (y) = pZ (z) det . (5) the only dataset with multi-object and multi-defect-data for
∂y anomaly detection.
We optimize the log-likelihood, since it is equivalent and As a common choice, we also evaluate on the MTD
more convenient for a density pZ of a Gaussian distribution. dataset, which includes gray-scale images of magnetic tiles
5
Category ARNet Geom. GAN DSEBM Mahal. 1-NN DifferNet PaDiM CS-Flow (ours)
[12] [14] [1] [41] [29] [27] [31] [8] (16 shots/full set)
Grid 88.3 61.9 70.8 71.7 93.7 81.8 84.0 - 93.3 99.0
Leather 86.2 84.1 84.2 41.6 100 100 97.1 - 100 100
Textures
Tile 73.5 41.7 79.4 69.0 100 100 99.4 - 99.9 100
Carpet 70.6 43.7 69.9 41.3 99.6 98.5 92.9 - 100 100
Wood 92.3 61.1 83.4 95.2 99.3 95.8 99.8 - 99.5 100
Avg. Text. 82.2 59.6 77.5 63.8 98.5 96.1 94.6 99.0 98.5 99.8
Bottle 94.1 74.4 89.2 81.8 99.0 99.6 99.0 - 100 99.8
Capsule 68.1 67.0 73.2 59.4 96.3 89.4 86.9 - 83.1 97.1
Pill 78.6 63.0 74.3 80.6 91.4 79.9 88.8 - 90.9 98.6
Transistor 84.3 86.9 79.2 74.1 98.2 95.4 91.1 - 98.0 99.3
Objects
Zipper 87.6 82.0 74.5 58.4 98.8 97.1 95.1 - 95.3 99.7
Cable 83.2 78.3 75.7 68.5 99.1 95.1 95.9 - 94.4 99.1
Hazelnut 85.5 35.9 78.5 76.2 100 98.2 99.3 - 97.9 99.6
Metal Nut 66.7 81.3 70.0 67.9 97.4 91.1 96.1 - 99.1 99.1
Screw 100 50.0 74.6 99.9 94.5 91.4 96.3 - 65.2 97.6
Toothbrush 100 97.2 65.3 78.1 94.1 94.7 98.6 - 85.6 91.9
Avg. Obj. 84.8 71.6 75.5 74.5 96.9 93.2 94.7 97.2 91.0 98.2
Average 83.9 67.2 76.2 70.9 97.5 93.9 94.7 97.9 93.5 98.7
Table 1. Area under ROC in % for detecting defects of all categories of MVTec AD [5] on image-level grouped into textures and objects.
Best results are in bold. 16 shots denotes that a subset of only 16 random images per category was used in training. Beside the average
value, detailed results of PaDiM [8] were not provided by the authors.
6
1.0 Method AUROC [%] ↑
single scale NF (768 × 768) 97.8
single scale NF (384 × 384) 96.8
0.8 single scale NF (192 × 192) 96.1
separate multi-scale 98.2
True Positive Rate
7
bottle capsule carpet hazelnut leather cable screw wood
Figure 7. Defect localization of one defective example per category of MVTec AD and MTD. The rows each show the original image,
the localization and the overlay of both images, from top to bottom. The localization maps show the sum of squares along the channel
dimension of the networks output at the highest scale.
8
References mation Processing Systems, pages 9758–9769, 2018. 1, 3,
6
[1] Samet Akcay, Amir Atapour-Abarghouei, and Toby P.
[15] Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha,
Breckon. Ganomaly: Semi-supervised anomaly detection
Moussa Reda Mansour, Svetha Venkatesh, and Anton
via adversarial training. In Computer Vision – ACCV 2018,
van den Hengel. Memorizing normality to detect anomaly:
pages 622–637, Cham, 2019. Springer International Publish-
Memory-augmented deep autoencoder for unsupervised
ing. 1, 2, 6
anomaly detection. In Proceedings of the IEEE International
[2] Jerone Andrews, Thomas Tanay, Edward Morton, and Lewis Conference on Computer Vision, pages 1705–1714, 2019. 1,
Griffin. Transfer representation-learning for anomaly detec- 2
tion. In NeurIPS, 2019. 3
[16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing
[3] Lynton Ardizzone, Carsten Lüth, Jakob Kruse, Carsten
Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and
Rother, and Ullrich Köthe. Guided image generation
Yoshua Bengio. Generative adversarial nets. In Advances
with conditional invertible neural networks. arXiv preprint
in neural information processing systems, pages 2672–2680,
arXiv:1907.02392, 2019. 5
2014. 2
[4] Maren Awiszus, Frederik Schubert, and Bodo Rosenhahn.
[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Toad-gan: coherent style level generation from a single ex-
Deep residual learning for image recognition. In Proceed-
ample. In Proceedings of the AAAI Conference on Artifi-
ings of the IEEE conference on computer vision and pattern
cial Intelligence and Interactive Digital Entertainment, vol-
recognition, pages 770–778, 2016. 1
ume 16, pages 10–16, 2020. 2
[18] Yibin Huang, Congying Qiu, and Kui Yuan. Surface defect
[5] Paul Bergmann, Michael Fauser, David Sattlegger, and
saliency of magnetic tile. The Visual Computer, 36(1):85–
Carsten Steger. Mvtec ad–a comprehensive real-world
96, 2020. 5
dataset for unsupervised anomaly detection. In Proceed-
ings of the IEEE Conference on Computer Vision and Pattern [19] Diederik P Kingma and Jimmy Ba. Adam: A method
Recognition, pages 9592–9600, 2019. 5, 6 for stochastic optimization. In International Conference on
Learning Representations (ICLR), 2015. 6
[6] Paul Bergmann, Sindy Löwe, Michael Fauser, David Sattleg-
ger, and C. Steger. Improving unsupervised defect segmen- [20] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen,
tation by applying structural similarity to autoencoders. In Ilya Sutskever, and Max Welling. Improved variational infer-
VISIGRAPP, 2019. 1, 2 ence with inverse autoregressive flow. In Advances in neural
[7] Haoqing Cheng, Heng Liu, Fei Gao, and Zhuo Chen. Adgan: information processing systems, pages 4743–4751, 2016. 4
A scalable gan-based architecture for image anomaly de- [21] Diederik P. Kingma and Max Welling. Auto-encoding vari-
tection. In 2020 IEEE 4th Information Technology, Net- ational bayes. CoRR, abs/1312.6114, 2013. 2
working, Electronic and Automation Control Conference (IT- [22] Polina Kirichenko, Pavel Izmailov, and Andrew Gordon Wil-
NEC), volume 1, pages 987–993. IEEE, 2020. 1 son. Why normalizing flows fail to detect out-of-distribution
[8] Thomas Defard, Aleksandr Setkov, Angelique Loesch, and data. In NeurIPS, 2020. 2, 4
Romaric Audigier. Padim: a patch distribution modeling [23] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
framework for anomaly detection and localization. In pat- Imagenet classification with deep convolutional neural net-
tern Recognition, ICPR International Workshops and Chal- works. In Advances in neural information processing sys-
lenges, 2021. 1, 2, 3, 6, 7, 8 tems, pages 1097–1105, 2012. 7
[9] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [24] Y. LeCun. Generalization and network design strategies. In
and Li Fei-Fei. Imagenet: A large-scale hierarchical image R. Pfeifer, Z. Schreter, F. Fogelman, and L. Steels, editors,
database. In 2009 IEEE conference on computer vision and Connectionism in Perspective, Zurich, Switzerland, 1989.
pattern recognition, pages 248–255. Ieee, 2009. 1, 6 Elsevier. an extended version was published as a technical
[10] Madson LD Dias, César Lincoln C Mattos, Ticiana LC da report of the University of Toronto. 2
Silva, José Antônio F de Macedo, and Wellington CP Silva. [25] Wentong Liao, Bodo Rosenhahn, and Yang Michael. Gaus-
Anomaly detection in trajectory data with normalizing flows. sian process for activity modeling and anomaly detection. In
arXiv preprint arXiv:2004.05958, 2020. 4 International Society for Photogrammetry and Remote Sens-
[11] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. ing ISA workshop, La Grande Motte, France, Sept. 2015. 8
Density estimation using real nvp. ICLR 2017, 2016. 4, [26] Philipp Liznerski, Lukas Ruff, Robert A Vandermeulen,
5 Billy Joe Franks, Marius Kloft, and Klaus-Robert Müller.
[12] Ye Fei, Chaoqin Huang, Cao Jinkun, Maosen Li, Ya Zhang, Explainable deep one-class classification. In ICLR, 2021.
and Cewu Lu. Attribute restoration framework for anomaly 3
detection. IEEE Transactions on Multimedia, 2020. 1, 2, 6 [27] Tiago Nazare, Rodrigo de Mello, and Moacir Ponti. Are pre-
[13] Mathieu Germain, Karol Gregor, Iain Murray, and Hugo trained cnns good feature extractors for anomaly detection
Larochelle. Made: Masked autoencoder for distribution es- in surveillance videos? arXiv preprint arXiv:1811.08495,
timation. In International Conference on Machine Learning, 2018. 3, 6, 7
pages 881–889, 2015. 4 [28] Danilo Rezende and Shakir Mohamed. Variational inference
[14] Izhak Golan and Ran El-Yaniv. Deep anomaly detection us- with normalizing flows. In International Conference on Ma-
ing geometric transformations. In Advances in Neural Infor- chine Learning, pages 1530–1538. PMLR, 2015. 3
9
[29] Oliver Rippel, Patrick Mertens, and Dorit Merhof. Modeling
the distribution of normal data in pre-trained deep features
for anomaly detection. arXiv preprint arXiv:2005.14140,
2020. 1, 2, 3, 6, 7
[30] Marco Rudolph, Bastian Wandt, and Bodo Rosenhahn.
Structuring autoencoders. In Proceedings of the IEEE Inter-
national Conference on Computer Vision Workshops, 2019.
2
[31] Marco Rudolph, Bastian Wandt, and Bodo Rosenhahn. Same
same but differnet: Semi-supervised defect detection with
normalizing flows. In Proceedings of the IEEE/CVF Win-
ter Conference on Applications of Computer Vision, pages
1907–1916, 2021. 1, 2, 3, 4, 5, 6, 7
[32] Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas
Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Em-
manuel Müller, and Marius Kloft. Deep one-class classifica-
tion. In International conference on machine learning, pages
4393–4402. PMLR, 2018. 1
[33] Artem Ryzhikov, Maxim Borisyak, Andrey Ustyuzhanin,
and Denis Derkach. Normalizing flows for deep anomaly
detection. arXiv preprint arXiv:1912.09323, 2019. 4
[34] Thomas Schlegl, Philipp Seeböck, Sebastian M Waldstein,
Georg Langs, and Ursula Schmidt-Erfurth. f-anogan: Fast
unsupervised anomaly detection with generative adversarial
networks. Medical image analysis, 54:30–44, 2019. 1, 2
[35] Maximilian Schmidt and Marko Simic. Normalizing flows
for novelty detection in industrial time series data. arXiv
preprint arXiv:1906.06904, 2019. 4
[36] Ke Sun, Bin Xiao, Dong Liu, and Jingdong Wang. Deep
high-resolution representation learning for human pose es-
timation. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pages 5693–
5703, 2019. 5
[37] Jihoon Tack, Sangwoo Mo, Jongheon Jeong, and Jinwoo
Shin. Csi: Novelty detection via contrastive learning on dis-
tributionally shifted instances. In NeurIPS, 2020. 1, 3
[38] Mingxing Tan and Quoc Le. Efficientnet: Rethinking model
scaling for convolutional neural networks. In International
Conference on Machine Learning, pages 6105–6114. PMLR,
2019. 1, 2, 6, 7
[39] Tom Wehrbein, Marco Rudolph, Bodo Rosenhahn, and Bas-
tian Wandt. Probabilistic monocular 3d human pose estima-
tion with normalizing flows. Proceedings of the IEEE Inter-
national Conference on Computer Vision, 2021. 3
[40] Michael Ying Yang, Wentong Liao, Yanpeng Cao, and Bodo
Rosenhahn. Video event recognition and anomaly detec-
tion by combining gaussian process and hierarchical dirichlet
process models. In Photogrammetric Engineering & Remote
Sensing, 2018. 8
[41] Shuangfei Zhai, Yu Cheng, Weining Lu, and Zhongfei
Zhang. Deep structured energy based models for anomaly
detection. In Proceedings of the 33rd International Con-
ference on International Conference on Machine Learning-
Volume 48, pages 1100–1109, 2016. 2, 6
[42] Chong Zhou and Randy C Paffenroth. Anomaly detection
with robust deep autoencoders. In Proceedings of the 23rd
ACM SIGKDD international conference on knowledge dis-
covery and data mining, pages 665–674, 2017. 1, 2
10