Object Detection Survey
Object Detection Survey
Abstract—Object detection, as of one the most fundamental and challenging problems in computer vision, has received great
attention in recent years. Its development in the past two decades can be regarded as an epitome of computer vision history. If we think
of today’s object detection as a technical aesthetics under the power of deep learning, then turning back the clock 20 years we would
witness the wisdom of cold weapon era. This paper extensively reviews 400+ papers of object detection in the light of its technical
evolution, spanning over a quarter-century’s time (from the 1990s to 2019). A number of topics have been covered in this paper,
including the milestone detectors in history, detection datasets, metrics, fundamental building blocks of the detection system, speed up
techniques, and the recent state of the art detection methods. This paper also reviews some important detection applications, such as
pedestrian detection, face detection, text detection, etc, and makes an in-deep analysis of their challenges as well as technical
arXiv:1905.05055v2 [cs.CV] 16 May 2019
Index Terms—Object detection, Computer vision, Deep learning, Convolutional neural networks, Technical evolution.
1 I NTRODUCTION
Fig. 2. A road map of object detection. Milestone detectors in this figure: VJ Det. [10, 11], HOG Det. [12], DPM [13–15], RCNN [16], SPPNet [17],
Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], Pyramid Networks [22], Retina-Net [23].
regression”, etc. However, previous reviews lack fundamen- methods in the recent three years are summarized in Section
tal analysis to help readers understand the nature of these 4. Some important detection applications will be reviewed
sophisticated techniques, e.g., “Where did they come from in Section 5. In Section 6, we conclude this paper and make
and how did they evolve?” “What are the pros and cons an analysis of the further research directions.
of each group of methods?” This paper makes an in-depth
analysis for readers of the above concerns.
3. A comprehensive analysis of detection speed up
2 O BJECT D ETECTION IN 20 Y EARS
techniques: The acceleration of object detection has long In this section, we will review the history of object detection
been a crucial but challenging task. This paper makes an in multiple aspects, including milestone detectors, object
extensive review of the speed up techniques in 20 years detection datasets, metrics, and the evolution of key tech-
of object detection history at multiple levels, including niques.
“detection pipeline” (e.g., cascaded detection, feature map
shared computation), “detection backbone” (e.g., network 2.1 A Road Map of Object Detection
compression, lightweight network design), and “numerical
In the past two decades, it is widely accepted that the
computation” (e.g., integral image, vector quantization).
progress of object detection has generally gone through
This topic is rarely covered by previous reviews.
two historical periods: “traditional object detection period
• Difficulties and Challenges in Object Detection (before 2014)” and “deep learning based detection period
Despite people always asking “what are the difficulties (after 2014)”, as shown in Fig. 2.
and challenges in object detection?”, actually, this question
is not easy to answer and may even be over-generalized. 2.1.1 Milestones: Traditional Detectors
As different detection tasks have totally different objectives If we think of today’s object detection as a technical aes-
and constraints, their difficulties may vary from each other. thetics under the power of deep learning, then turning back
In addition to some common challenges in other computer the clock 20 years we would witness “the wisdom of cold
vision tasks such as objects under different viewpoints, weapon era”. Most of the early object detection algorithms
illuminations, and intraclass variations, the challenges in were built based on handcrafted features. Due to the lack of
object detection include but not limited to the following effective image representation at that time, people have no
aspects: object rotation and scale changes (e.g., small ob- choice but to design sophisticated feature representations,
jects), accurate object localization, dense and occluded object and a variety of speed up skills to exhaust the usage of
detection, speed up of detection, etc. In Sections 4 and 5, we limited computing resources.
will give a more detailed analysis of these topics.
• Viola Jones Detectors
The rest of this paper is organized as follows. In Section
2, we review the 20 years’ evolutionary history of object 18 years ago, P. Viola and M. Jones achieved real-time
detection. Some speed up techniques in object detection will detection of human faces for the first time without any
be introduced in Section 3. Some state of the art detection constraints (e.g., skin color segmentation) [10, 11]. Running
3
on a 700MHz Pentium III CPU, the detector was tens or The DPM follows the detection philosophy of “divide
even hundreds of times faster than any other algorithms in and conquer”, where the training can be simply considered
its time under comparable detection accuracy. The detection as the learning of a proper way of decomposing an object,
algorithm, which was later referred to the “Viola-Jones and the inference can be considered as an ensemble of de-
(VJ) detector”, was herein given by the authors’ names in tections on different object parts. For example, the problem
memory of their significant contributions. of detecting a “car” can be considered as the detection of
The VJ detector follows a most straight forward way of its window, body, and wheels. This part of the work, a.k.a.
detection, i.e., sliding windows: to go through all possible “star-model”, was completed by P. Felzenszwalb et al. [13].
locations and scales in an image to see if any window Later on, R. Girshick has further extended the star-model to
contains a human face. Although it seems to be a very the “mixture models” [14, 15, 37, 38] to deal with the objects
simple process, the calculation behind it was far beyond the in the real world under more significant variations.
computer’s power of its time. The VJ detector has dramat- A typical DPM detector consists of a root-filter and a
ically improved its detection speed by incorporating three number of part-filters. Instead of manually specifying the
important techniques: “integral image”, “feature selection”, configurations of the part filters (e.g., size and location), a
and “detection cascades”. weakly supervised learning method is developed in DPM
1) Integral image: The integral image is a computational where all configurations of part filters can be learned auto-
method to speed up box filtering or convolution process. matically as latent variables. R. Girshick has further formu-
Like other object detection algorithms in its time [29–31], lated this process as a special case of Multi-Instance learning
the Haar wavelet is used in VJ detector as the feature [39], and some other important techniques such as “hard
representation of an image. The integral image makes the negative mining”, “bounding box regression”, and “context
computational complexity of each window in VJ detector priming” are also applied for improving detection accuracy
independent of its window size. (to be introduced in Section 2.3). To speed up the detection,
2) Feature selection: Instead of using a set of manually Girshick developed a technique for “compiling” detection
selected Haar basis filters, the authors used Adaboost al- models into a much faster one that implements a cascade
gorithm [32] to select a small set of features that are mostly architecture, which has achieved over 10 times acceleration
helpful for face detection from a huge set of random features without sacrificing any accuracy [14, 38].
pools (about 180k-dimensional). Although today’s object detectors have far surpassed
3) Detection cascades: A multi-stage detection paradigm DPM in terms of the detection accuracy, many of them are
(a.k.a. the “detection cascades”) was introduced in VJ detec- still deeply influenced by its valuable insights, e.g., mixture
tor to reduce its computational overhead by spending less models, hard negative mining, bounding box regression, etc.
computations on background windows but more on face In 2010, P. Felzenszwalb and R. Girshick were awarded the
targets. “lifetime achievement” by PASCAL VOC.
RCNN yields a signicant performance boost on VOC07, Although Faster RCNN breaks through the speed bottle-
with a large improvement of mean Average Precision (mAP) neck of Fast RCNN, there is still computation redundancy
from 33.7% (DPM-v5 [43]) to 58.5%. at subsequent detection stage. Later, a variety of improve-
Although RCNN has made great progress, its drawbacks ments have been proposed, including RFCN [46] and Light
are obvious: the redundant feature computations on a large head RCNN [47]. (See more details in Section 3.)
number of overlapped proposals (over 2000 boxes from one
• Feature Pyramid Networks
image) leads to an extremely slow detection speed (14s per
image with GPU). Later in the same year, SPPNet [17] was In 2017, T.-Y. Lin et al. proposed Feature Pyramid Net-
proposed and has overcome this problem. works (FPN) [22] on basis of Faster RCNN. Before FPN,
most of the deep learning based detectors run detection only
• SPPNet
on a network’s top layer. Although the features in deeper
In 2014, K. He et al. proposed Spatial Pyramid Pooling layers of a CNN are beneficial for category recognition, it
Networks (SPPNet) [17]. Previous CNN models require a is not conducive to localizing objects. To this end, a top-
fixed-size input, e.g., a 224x224 image for AlexNet [40]. down architecture with lateral connections is developed in
The main contribution of SPPNet is the introduction of a FPN for building high-level semantics at all scales. Since a
Spatial Pyramid Pooling (SPP) layer, which enables a CNN CNN naturally forms a feature pyramid through its forward
to generate a fixed-length representation regardless of the propagation, the FPN shows great advances for detecting
size of image/region of interest without rescaling it. When objects with a wide variety of scales. Using FPN in a basic
using SPPNet for object detection, the feature maps can be Faster R-CNN system, it achieves state-of-the-art single
computed from the entire image only once, and then fixed- model detection results on the MSCOCO dataset without
length representations of arbitrary regions can be generated bells and whistles (COCO mAP@.5=59.1%, COCO mAP@[.5,
for training the detectors, which avoids repeatedly com- .95]=36.2%). FPN has now become a basic building block of
puting the convolutional features. SPPNet is more than 20 many latest detectors.
times faster than R-CNN without sacrificing any detection
accuracy (VOC07 mAP=59.2%). 2.1.3 Milestones: CNN based One-stage Detectors
Although SPPNet has effectively improved the detection • You Only Look Once (YOLO)
speed, there are still some drawbacks: first, the training is
YOLO was proposed by R. Joseph et al. in 2015. It was
still multi-stage, second, SPPNet only fine-tunes its fully
the first one-stage detector in deep learning era [20]. YOLO
connected layers while simply ignores all previous layers.
is extremely fast: a fast version of YOLO runs at 155fps
Later in the next year, Fast RCNN [18] was proposed and
with VOC07 mAP=52.7%, while its enhanced version runs
solved these problems.
at 45fps with VOC07 mAP=63.4% and VOC12 mAP=57.9%.
• Fast RCNN YOLO is the abbreviation of “You Only Look Once”. It can
be seen from its name that the authors have completely
In 2015, R. Girshick proposed Fast RCNN detector [18], abandoned the previous detection paradigm of “proposal
which is a further improvement of R-CNN and SPPNet detection + verification”. Instead, it follows a totally dif-
[16, 17]. Fast RCNN enables us to simultaneously train a ferent philosophy: to apply a single neural network to the
detector and a bounding box regressor under the same full image. This network divides the image into regions
network configurations. On VOC07 dataset, Fast RCNN and predicts bounding boxes and probabilities for each
increased the mAP from 58.5% (RCNN) to 70.0% while with region simultaneously. Later, R. Joseph has made a series
a detection speed over 200 times faster than R-CNN. of improvements on basis of YOLO and has proposed its v2
Although Fast-RCNN successfully integrates the advan- and v3 editions [48, 49], which further improve the detection
tages of R-CNN and SPPNet, its detection speed is still accuracy while keeps a very high detection speed.
limited by the proposal detection (see Section 2.3.2 for more In spite of its great improvement of detection speed,
details). Then, a question naturally arises: “can we generate YOLO suffers from a drop of the localization accuracy com-
object proposals with a CNN model?” Later, Faster R-CNN pared with two-stage detectors, especially for some small
[19] has answered this question. objects. YOLO’s subsequent versions [48, 49] and the latter
proposed SSD [21] has paid more attention to this problem.
• Faster RCNN
• Single Shot MultiBox Detector (SSD)
In 2015, S. Ren et al. proposed Faster RCNN detector
[19, 44] shortly after the Fast RCNN. Faster RCNN is the first SSD [21] was proposed by W. Liu et al. in 2015. It was the
end-to-end, and the first near-realtime deep learning de- second one-stage detector in deep learning era. The main
tector (COCO mAP@.5=42.7%, COCO mAP@[.5,.95]=21.9%, contribution of SSD is the introduction of the multi-reference
VOC07 mAP=73.2%, VOC12 mAP=70.4%, 17fps with ZF- and multi-resolution detection techniques (to be introduce
Net [45]). The main contribution of Faster-RCNN is the in- in Section 2.3.2), which significantly improves the detection
troduction of Region Proposal Network (RPN) that enables accuracy of a one-stage detector, especially for some small
nearly cost-free region proposals. From R-CNN to Faster objects. SSD has advantages in terms of both detection speed
RCNN, most individual blocks of an object detection sys- and accuracy (VOC07 mAP=76.8%, VOC12 mAP=74.9%,
tem, e.g., proposal detection, feature extraction, bounding COCO mAP@.5=46.5%, mAP@[.5,.95]=26.8%, a fast version
box regression, etc, have been gradually integrated into a runs at 59fps). The main difference between SSD and any
unified, end-to-end learning framework. previous detectors is that the former one detects objects of
5
1. http://host.robots.ox.ac.uk/pascal/VOC/ 3. http://cocodataset.org/
2. http://image-net.org/challenges/LSVRC/ 4. https://storage.googleapis.com/openimages/web/index.html
6
Fig. 4. Some example images and annotations in (a) PASCAL-VOC07, (b) ILSVRC, (c) MS-COCO, and (d) Open Images.
Open Images: 1) the standard object detection, and 2) the to predict full image performance in certain cases [59].
visual relationship detection which detects paired objects in In 2009, the Caltech pedestrian detection benchmark was
particular relations. For the object detection task, the dataset created [59, 60] and since then, the evaluation metric has
consists of 1,910k images with 15,440k annotated bounding changed from per-window (FPPW) to false positives per-
boxes on 600 object categories. image (FPPI).
• Datasets of Other Detection Tasks In recent years, the most frequently used evaluation for
object detection is “Average Precision (AP)”, which was
In addition to general object detection, the past 20 years originally introduced in VOC2007. AP is defined as the
also witness the prosperity of detection applications in spe- average detection precision under different recalls, and is
cific areas, such as pedestrian detection, face detection, text usually evaluated in a category specific manner. To compare
detection, traffic sign/light detection, and remote sensing performance over all object categories, the mean AP (mAP)
target detection. Tables 2-6 list some of the popular datasets averaged over all object categories is usually used as the
of these detection tasks5 . A detailed introduction of the final metric of performance. To measure the object local-
detection methods of these tasks can be found in Section ization accuracy, the Intersection over Union (IoU) is used
5. to check whether the IoU between the predicted box and
the ground truth box is greater than a predefined threshold,
2.2.1 Metrics say, 0.5. If yes, the object will be identified as “successfully
How can we evaluate the effectiveness of an object detector? detected”, otherwise will be identified as “missed”. The 0.5-
This question may even have different answers at different IoU based mAP has then become the de facto metric for
time. object detection problems for years.
In the early time’s detection community, there is no After 2014, due to the popularity of MS-COCO datasets,
widely accepted evaluation criteria on detection perfor- researchers started to pay more attention to the accuracy
mance. For example, in the early research of pedestrian of the bounding box location. Instead of using a fixed IoU
detection [12], the “miss rate vs. false positives per-window threshold, MS-COCO AP is averaged over multiple IoU
(FPPW)” was usually used as a metric. However, the per- thresholds between 0.5 (coarse localization) and 0.95 (perfect
window measurement (FPPW) can be flawed and fails localization). This change of the metric has encouraged more
accurate object localization and may be of great importance
5. The #Cites shows statistics as of Feb. 2019. for some real-world applications (e.g., imagine there is a
7
robot arm trying to grasp a spanner). 2.3.1 Early Time’s Dark Knowledge
Recently, there are some further developments of the The early time’s object detection (before 2000) did not follow
evaluation in the Open Images dataset, e.g., by considering a unified detection philosophy like sliding window detec-
the group-of boxes and the non-exhaustive image-level cate- tion. Detectors at that time were usually designed based on
gory hierarchies. Some researchers also have proposed some low-level and mid-level vision as follows.
alternative metrics, e.g., “localization recall precision” [94].
Despite the recent changes, the VOC/COCO-based mAP is • Components, shapes and edges
still the most frequently used evaluation metric for object
detection. “Recognition-by-components”, as an important cogni-
tive theory [98], has long been the core idea of image
recognition and object detection [13, 99, 100]. Some early
researchers framed the object detection as a measurement of
2.3 Technical Evolution in Object Detection
similarity between the object components, shapes and con-
In this section, we will introduce some important building tours, including Distance Transforms [101], Shape Contexts
blocks of a detection system and their technical evolutions [35], and Edgelet [102], etc. Despite promising initial results,
in the past 20 years. things did not work out well on more complicated detec-
8
tion problems. Therefore, machine learning based detection specific knowledge from data.
methods were beginning to prosper. Wavelet feature transform started to dominate visual
Machine learning based detection has gone through mul- recognition and object detection since 2000. The essence of
tiple periods, including the statistical models of appearance this group of methods is learning by transforming an image
(before 1998), wavelet feature representations (1998-2005), from pixels to a set of wavelet coefficients. Among these
and gradient-based representations (2005-2012). methods, the Haar wavelet, owing to its high computational
Building statistical models of an object, like Eigenfaces efficiency, has been mostly used in many object detection
[95, 106] as shown in Fig 5 (a), was the first wave of learning tasks, such as general object detection [29], face detection
based approaches in object detection history. In 1991, M. [10, 11, 109], pedestrian detection [30, 31], etc. Fig 5 (d)
Turk et al. achieved real-time face detection in a lab envi- shows a set of Haar wavelets basis learned by a VJ detector
ronment by using Eigenface decomposition [95]. Compared [10, 11] for human faces.
with the rule-based or template based approaches of its
• Early time’s CNN for object detection
time [107, 108], a statistical model better provides holistic
descriptions of an object’s appearance by learning task- The history of using CNN to detecting objects can be
9
traced back to the 1990s [96], where Y. LeCun et al. have With the increase of computing power after the VJ
made great contributions at that time. Due to limitations in detector, researchers started to pay more attention to an
computing resources, CNN models at the time were much intuitive way of detection by building “feature pyramid +
smaller and shallower than those of today. Despite this, the sliding windows”. From 2004 to 2014, a number of mile-
computational efficiency was still considered as one of the stone detectors were built based on this detection paradigm,
tough nuts to crack in early times’s CNN based detection including the HOG detector, DPM, and even the Overfeat
models. Y. LeCun et al. have made a series of improvements detector [103] of the deep learning era (winner of ILSVRC-
like “shared-weight replicated neural network” [96] and 13 localization task).
“space displacement network” [97] to reduce the computa-
Early detection models like VJ detector and HOG de-
tions by extending each layer of the convolutional network
tector were specifically designed to detect objects with a
so as to cover the entire input image, as shown in Fig. 5
“fixed aspect ratio” (e.g., faces and upright pedestrians) by
(b)-(c). In this way, the feature of any location of the entire
simply building the feature pyramid and sliding fixed size
image can be extracted by taking only one time of forward
detection window on it. The detection of “various aspect
propagation of the network. This can be considered as the
ratios” was not considered at that time. To detect objects
prototype of today’s fully convolutional networks (FCN)
with a more complex appearance like those in PASCAL
[110, 111], which was proposed almost 20 years later. CNN
VOC, R. Girshick et al. began to seek better solutions outside
also has been applied to other tasks such as face detection
the feature pyramid. The “mixture model” [15] was one of
[112, 113] and hand tracking [114] of its time.
the best solutions at that time, by training multiple models
to detect objects with different aspect ratios. Apart from
2.3.2 Technical Evolution of Multi-Scale Detection this, exemplar-based detection [36, 115] provided another
Multi-scale detection of objects with “different sizes” and solution by training individual models for every object
“different aspect ratios” is one of the main technical chal- instance (exemplar) of the training set.
lenges in object detection. In the past 20 years, multi-
scale detection has gone through multiple historical periods: As objects in the modern datasets (e.g., MS-COCO)
“feature pyramids and sliding windows (before 2014)”, “de- become more diversified, the mixture model or exemplar-
tection with object proposals (2010-2015)”, “deep regression based methods inevitably lead to more miscellaneous de-
(2013-2016)”, “multi-reference detection (after 2015)”, and tection models. A question then naturally arises: is there a
“multi-resolution detection (after 2016)”, as shown in Fig. 6. unified multi-scale approach to detect objects of different
aspect ratios? The introduction of “object proposals” (to be
• Feature pyramids + sliding windows (before 2014) introduced) has answered this question.
10
Fig. 6. Evolution of multi-scale detection techniques in object detection from 2001 to 2019: 1) feature pyramids and sliding windows, 2) detection
with object proposals, 3) deep regression, 4) multi-reference detection, and 5) multi-resolution detection. Detectors in this figure: VJ Det. [10], HOG
Det. [12], DPM [13, 15], Exemplar SVM [36], Overfeat [103], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], DNN Det. [104], YOLO
[20], YOLO-v2 [48], SSD [21], Unified Det. [105], FPN [22], RetinaNet [23], RefineDet [55], TridentNet [56].
2.3.3 Technical Evolution of Bounding Box Regression • From features to BB (after 2013)
The Bounding Box (BB) regression is an important tech- After the introduction of Faster RCNN in 2015, BB re-
nique in object detection. It aims to refine the location of gression no longer serves as an individual post-processing
a predicted bounding box based on the initial proposal block but has been integrated with the detector and trained
or the anchor box. In the past 20 years, the evolution of in an end-to-end fashion. At the same time, BB regression
BB regression has gone through three historical periods: has evolved to predicting BB directly based on CNN fea-
“without BB regression (before 2008)”, “from BB to BB (2008- tures. In order to get more robust prediction, the smooth-L1
2013)”, and “from feature to BB (after 2013)”. Fig. 7 shows function [19] is commonly used,
the evolutions of bounding box regression. (
• Without BB regression (before 2008)
5t2 |t| 6 0.1
L(t) = (2)
|t| − 0.05 else,
Most of the early detection methods such as VJ detector
and HOG detector do not use BB regression, and usually or the root-square function [20],
directly consider the sliding window as the detection result. √ √
L(x, x∗ ) = ( x − x∗ )2 , (3)
To obtain accurate locations of an object, researchers have
no choice but to build very dense pyramid and slide the as their regression loss, which are more robust to the outliers
detector densely on each location. than the least square loss used in DPM. Some researchers
• From BB to BB (2008-2013) also choose to normalize the coordinates to get more robust
results [18, 19, 21, 23].
The first time that BB regression was introduced to an
object detection system was in DPM [15]. The BB regression 2.3.4 Technical Evolution of Context Priming
at that time usually acted as a post-processing block, thus
Visual objects are usually embedded in a typical context
it is optional. As the goal in the PASCAL VOC is to predict
with the surrounding environments. Our brain takes advan-
single bounding box for each object, the simplest way for
tage of the associations among objects and environments
a DPM to generate final detection should be directly using
to facilitate visual perception and cognition [160]. Context
its root filter locations. Later, R. Girshick et al. introduced
priming has long been used to improve detection. There
a more complex way to predict a bounding box based
are three common approaches in its evolutionary history: 1)
on the complete configuration of an object hypothesis and
detection with local context, 2) detection with global context,
formulate this process as a linear least-squares regression
and 3) context interactives, as shown in Fig. 8.
problem [15]. This method yields noticeable improvements
of the detection under PASCAL criteria. • Detection with local context
12
Fig. 7. Evolution of bounding box regression techniques in object detection from 2001 to 2019. Detectors in this figure: VJ Det. [10], HOG Det. [12],
Exemplar SVM [36], DPM [13, 15], Overfeat [103], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], YOLO-v2
[48], Unified Det. [105], FPN [22], RetinaNet [23], RefineDet [55], TridentNet [56].
Fig. 8. Evolution of context priming in object detection from 2001 to 2019: 1) detection with local context, 2) detection with global context, 3)
detection with context interactives. Detectors in this figure: Face Det. [139], MultiPath [140], GBDNet [141, 142], CC-Net [143], MultiRegion-CNN
[144], CoupleNet [145], DPM [14, 15], StructDet [146], YOLO [20], RFCN++ [147], ION [148], AttenContext [149], CtxSVM [150], PersonContext
[151], SMN [152], RetinaNet [23], SIN [153].
Local context refers to the visual information in the area time’s object detectors, a common way of integrating global
that surrounds the object to detect. It has long been acknowl- context is to integrate a statistical summary of the elements
edged that local context helps improve object detection. At that comprise the scene, like Gist [160]. For modern deep
early 2000s, Sinha and Torralba [139] found that inclusion of learning based detectors, there are two methods to integrate
local contextual regions such as the facial bounding contour global context. The first way is to take advantage of large
substantially improves face detection performance. Dalal receptive field (even larger than the input image) [20] or
and Triggs also found that incorporating a small amount global pooling operation of a CNN feature [147]. The second
of background information improves the accuracy of pedes- way is to think of the global context as a kind of sequential
trian detection [12]. Recent deep learning based detectors information and to learn it with the recurrent neural net-
can also be improved with local context by simply enlarging works [148, 149].
the networks’ receptive field or the size of object proposals • Context interactive
[140–145, 161].
Context interactive refers to the piece of information that
• Detection with global context conveys by the interactions of visual elements, such as the
Global context exploits scene configuration as an addi- constraints and dependencies. For most object detectors,
tional source of information for object detection. For early object instances are detected and recognized individually
13
Fig. 9. Evolution of non-max suppression (NMS) techniques in object detection from 1994 to 2019: 1) Greedy selection, 2) Bounding box
aggregation, and 3) Learn to NMS. Detectors in this figure: VJ Det. [10], Face Det. [96], HOG Det. [12], DPM [13, 15], RCNN [16], SPPNet
[17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], FPN [22], RetinaNet [23], LearnNMS [154], MAP-Det [155], End2End-DPM [136],
StrucDet [146], Overfeat [103], APC-NMS [156], MAPC [157], SoftNMS [158], FitnessNMS [159].
without exploiting their relations. Some recent researches shown in Fig 11. First of all, the top-scoring box may not be
have suggested that modern object detectors can be im- the best fit. Second, it may suppress nearby objects. Finally, it
proved by considering context interactives. Some recent im- does not suppress false positives. In recent years, in spite of
provements can be grouped into two categories, where the the fact that some manual modifications have been recently
first one is to explore the relationship between individual made to improve its performance [158, 159, 163] (see Section
objects [15, 146, 150, 152, 162], and the second one is to 4.4 for more details), to our best knowledge, the greedy
explore modeling the dependencies between objects and selection still performs as the strongest baseline for today’s
scenes [151, 151, 153]. object detection.
Fig. 10. Evolution of hard negative mining techniques in object detection from 1994 to 2019. Detectors in this figure: Face Det. [164], Haar Det. [29],
VJ Det. [10], HOG Det. [12], DPM [13, 15], RCNN [16], SPPNet [17], Fast RCNN [18], Faster RCNN [19], YOLO [20], SSD [21], FasterPed [165],
OHEM [166], RetinaNet [23], RefineDet [55].
windows to every object. Modern detection datasets require RCNN and YOLO simply balance the weights between the
the prediction of object aspect ratio, further increasing the positive and negative windows. However, researchers later
imbalanced ratio to 106 ∼107 [129]. In this case, using all noticed that the weight-balancing cannot completely solve
background data will be harmful to training as the vast the imbalanced data problem [23]. To this end, after 2016,
number of easy negatives will overwhelm the learning the bootstrap was re-introduced to deep learning based
process. Hard negative mining (HNM) aims to deal with the detectors [21, 165–168]. For example, in SSD [21] and OHEM
problem of imbalanced data during training. The technical [166], only the gradients of a very small part of samples
evolution of HNM in object detection is shown in Fig. 10. (those with the largest loss values) will be back-propagated.
In RefineDet [55], an “anchor refinement module” is de-
• Bootstrap signed to filter easy negatives. An alternative improvement
Bootstrap in object detection refers to a group of training is to design new loss functions [23, 169, 170], by reshaping
techniques in which the training starts with a small part the standard cross entropy loss so that it will put more focus
of background samples and then iteratively add new miss- on hard, misclassified examples [23].
classified backgrounds during the training process. In early
times object detectors, bootstrap was initially introduced
with the purpose of reducing the training computations
over millions of background samples [10, 29, 164]. Later it 3 S PEED -U P OF D ETECTION
became a standard training technique in DPM and HOG
detectors [12, 13] for solving the data imbalance problem. The acceleration of object detection has long been an im-
portant but challenging problem. In the past 20 years, the
• HNM in deep learning based detectors object detection community has developed sophisticated
Later in the deep learning era, due to the improvement acceleration techniques. These techniques can be roughly
of computing power, bootstrap was shortly discarded in divided into three levels of groups: “speed up of detection
object detection during 2014-2016 [16–20]. To ease the data- pipeline”, “speed up of detection engine”, and “speed up of
imbalance problem during training, detectors like Faster numerical computation”, as shown in Fig 12.
15
Fig. 14. An overview of speed up methods of a CNN’s convolutional layer and the comparison of their computational complexity: (a) Standard
0 0 0
convolution: O(dk2 c). (b) Factoring convolutional filters (k × k → (k × k )2 or 1 × k, k × 1): O(dk 2 c) or O(dkc). (c) Factoring convolutional
0 2 2 0
channels: O(d k c) + O(dk d ). (d) Group convolution (#groups=m): O(dk c/m). (e) Depth-wise separable convolution: O(ck2 ) + O(dc).
2
3.4.1 Network Pruning one (“student net”) [193, 194]. Recently, this idea has been
The research of “network pruning” can be traced back to used in the acceleration of object detection [195, 196]. One
as early as the 1980s. At that time, Y. LeCun et al. proposed straight forward approach of this idea is to use a teacher net
a method called “optimal brain damage” to compress the to instruct the training of a (light-weight) student net so that
parameters of a multi-layer perceptron network [186]. In this the latter can be used for speed up detection [195]. Another
method, the loss function of a network is approximated by approach is to make transform of the candidate regions so
taking the second-order derivatives so that to remove some as to minimize the features distance between the student net
unimportant weights. Following this idea, the network and teacher net. This method makes the detection model 2
pruning methods in recent years usually take an iterative times faster while achieving a comparable accuracy [196].
training and pruning process, i.e., to remove only a small
group of unimportant weights after each stage of training, 3.5 Lightweight Network Design
and to repeat those operations [187]. As traditional network
pruning simply removes unimportant weights, which may The last group of methods to speed up a CNN based de-
result in some sparse connectivity patterns in a convolu- tector is to directly design a lightweight network instead of
tional filter, it can not be directly applied to compress a CNN using off-the-shelf detection engines. Researchers have long
model. A simple solution to this problem is to remove the been exploring the right configurations of a network so that
whole filters instead of the independent weights [188, 189]. to gain accuracy under a constrained time cost. In addition
to some general designing principles like “fewer channels
3.4.2 Network Quantification and more layers” [197], some other approaches have been
The recent works on network quantification mainly focus on proposed in recent years: 1) factorizing convolutions, 2)
network binarization, which aims to accelerate a network group convolution, 3) depth-wise separable convolution, 4)
by quantifying its activations or weights to binary variables bottle-neck design, and 5) neural architecture search.
(say, 0/1) so that the floating-point operation is converted
to AND, OR, NOT logical operations. Network binariza- 3.5.1 Factorizing Convolutions
tion can significantly speed up computations and reduce Factorizing convolutions is the simplest and most straight
the network’s storage so that it can be much easier to be forward way to build a lightweight CNN model. There are
deployed on mobile devices. One possible implementation two groups of factorizing methods.
of the above ideas is to approximate the convolution by The first group of methods is to factorize a large convo-
binary variables with the least squares method [190]. A lution filter into a set of small ones in their spatial dimension
more accurate approximation can be obtained by using [47, 147, 198], as shown in Fig. 14 (b). For example, one can
linear combinations of multiple binary convolutions [191]. factorize a 7x7 filter into three 3x3 filters, where they share
In addition, some researchers have further developed GPU the same receptive field but the later one is more efficient.
acceleration libraries for binarized computation, which ob- Another example is to factorize a k ×k filter into a k ×1 filter
tained more significant acceleration results [192]. and a 1×k filter [198, 199], which could be more efficient for
very large filters, say 15x15 [199]. This idea has been recently
3.4.3 Network Distillation used in object detection [200].
Network distillation is a general framework to compress the The second group of methods is to factorize a large
knowledge of a large network (“teacher net”) into a small group of convolutions into two small groups in their chan-
17
nel dimension [201, 202], as shown in Fig. 14 (c). For exam- 3.6 Numerical Acceleration
ple, one can approximate a convolution layer with d filters In this section, we mainly introduce four important numer-
and a feature map of c channels by d0 filters + a nonlinear ical acceleration methods that are frequently used in object
activation + another d filters ( d0 < d ). In this case, the detection: 1) speed up with the integral image, 2) speed
complexity O(dk 2 c) of the original layer can be reduced to up in the frequency domain, 3) vector quantization, and 4)
O(d0 k 2 c) + O(dd0 ). reduced rank approximation.
3.5.2 Group Convolution
3.6.1 Speed Up with Integral Image
Group convolution aims to reduce the number of parame-
ters in a convolution layer by dividing the feature channels The integral image is an important method in image pro-
into many different groups, and then convolve on each cessing. It helps to rapidly calculate summations over image
group independently [189, 203], as shown in Fig. 14 (d). sub-regions. The essence of integral image is the integral-
If we evenly divide the feature channels into m groups, differential separability of convolution in signal processing:
without changing other configurations, the computational dg(x)
Z
complexity of the convolution will theoretically be reduced f (x) ∗ g(x) = ( f (x)dx) ∗ ( ), (4)
dx
to 1/m of that before.
where if dg(x)/dx is a sparse signal, then the convolution
3.5.3 Depth-wise Separable Convolution can be accelerated by the right part of this equation. Al-
Depth-wise separable convolution, as shown in Fig. 14 (e), though the VJ detector [10] is well known for the integral
is a recent popular way of building lightweight convolution image acceleration, before it was born, the integral image
networks [204]. It can be viewed as a special case of the has already been used to speed up a CNN model [219] and
group convolution when the number of groups is set equal achieved more than 10 times acceleration.
to the number of channels. In addition to the above examples, integral image can
Suppose we have a convolutional layer with d filters and also be used to speed up more general features in ob-
a feature map of c channels. The size of each filter is k × k . ject detection, e.g., color histogram, gradient histogram
For a depth-wise separable convolution, every k ×k ×c filter [171, 177, 220, 221], etc. A typical example is to speed up
is first to split into c slices each with the size of k ×k ×1, and HOG by computing integral HOG maps [177, 220]. Instead
then the convolutions are performed individually in each of accumulating pixel values in a traditional integral image,
channel with each slice of the filter. Finally, a number of the integral HOG map accumulates gradient orientations in
1x1 filters are used to make a dimension transform so that an image, as shown in Fig. 15. As the histogram of a cell
the final output should have d channels. By using depth- can be viewed as the summation of the gradient vector in
wise separable convolution, the computational complexity a certain region, by using the integral image, it is possible
can be reduced from O(dk 2 c) to O(ck 2 ) + O(dc). This idea to compute a histogram in a rectangle region of an arbitrary
has been recently applied to object detection and fine-grain position and size with a constant computational overhead.
classification [205–207]. The integral HOG map has been used in pedestrian detec-
tion and has achieved dozens of times’ acceleration without
3.5.4 Bottle-neck Design losing any accuracy [177].
A bottleneck layer in a neural network contains few nodes Later in 2009, P. Dollár et al. proposed a new type of
compared to the previous layers. It can be used to learning image feature called Integral Channel Features (ICF), which
efficient data encodings of the input with reduced dimen- can be considered as a more general case of the integral
sionality, which has been commonly used in deep autoen- image features, and has been successfully used in pedes-
coders [208]. In recent years, the bottle-neck design has been trian detection [171]. ICF achieves state-of-the-art detection
widely used for designing lightweight networks [47, 209– accuracy under the near realtime detection speed in its time.
212]. Among these methods, one common approach is to
compress the input layer of a detector to reduce the amount 3.6.2 Speed Up in Frequency Domain
of computation from the very beginning of the detection Convolution is an important type of numerical operation
pipeline [209–211]. Another approach is to compress the in object detection. As the detection of a linear detector
output of the detection engine to make the feature map can be viewed as the window-wise inner product between
thinner, so as to make it more efficient for subsequent the feature map and detector’s weights, this process can be
detection stages [47, 212]. implemented by convolutions.
3.5.5 Neural Architecture Search The convolution can be accelerated in many ways, where
the Fourier transform is a very practical choice especially
More recently, there has been significant interest in de-
for speeding up those large filters. The theoretical basis
signing network architectures automatically by neural ar-
for accelerating convolution in the frequency domain is the
chitecture search (NAS) instead of relying heavily on ex-
convolution theorem in signal processing, that is, under
pert experience and knowledge. NAS has been applied to
suitable conditions, the Fourier transform of a convolution
large-scale image classification [213, 214], object detection
of two signals is the point-wise product in their Fourier
[215] and image segmentation [216] tasks. NAS also shows
space:
promising results in designing lightweight networks very
I ∗ W = F −1 (F (I) F (W )) (5)
recently, where the constraints on the prediction accuracy
and computational complexity are both considered during where F is Fourier transform, F −1 is Inverse Fourier trans-
the searching process [217, 218]. form, I and W are the input image and filter, ∗ is the
18
Fig. 15. An illustration of how to compute the “Integral HOG Map” [177]. With integral image techniques, we can efficiently compute the histogram
feature of any location and any size with constant computational complexity.
Fig. 17. A comparison of detection accuracy of three detectors: Faster Fig. 18. An illustration of different feature fusion methods: (a) bottom-
RCNN [19], R-FCN [46] and SSD [21] on MS-COCO dataset with up fusion, (b) top-down fusion, (c) element-wise sum, (d) element-wise
different detection engines. Image from J. Huang et al. CVPR2017 [27]. product, and (e) concatenation.
AlexNet: AlexNet [40], an eight-layer deep network, object detection models such as STDN [237], DSOD [238],
was the first CNN model that started the deep learning TinyDSOD [207], and Pelee [209] choose DenseNet [235] as
revolution in computer vision. AlexNet famously won the their detection engine. The Mask RCNN [4], as the state of
2012 ImageNet LSVRC-2012 competition by a large margin the art model for instance segmentation, applied the next
[15.3% VS 26.2% (second place) error rates]. As of Feb. 2019, generation of ResNet: ResNeXt [239] as its detection engine.
the Alexnet paper has been cited over 30,000 times. Besides, to speed up detection, the depth-wise separable
VGG: VGG was proposed by Oxford’s Visual Geometry convolution operation, which was introduced by Xception
Group (VGG) in 2014 [230]. VGG increased the model’s [204], an improved version of Incepion, has also been used
depth to 16-19 layers and used very small (3x3) convolution in detectors such as MobileNet [205] and LightHead RCNN
filters instead of 5x5 and 7x7 those were previously used in [47].
AlexNet. VGG has achieved the state of the art performance
on the ImageNet dataset of its time.
GoogLeNet: GoogLeNet, a.k.a Inception [198, 231–233], 4.2 Detection with Better Features
is a big family of CNN models proposed by Google Inc.
since 2014. GoogLeNet increased both of a CNN’s width The quality of feature representations is critical for object
and depth (up to 22 layers). The main contribution of the detection. In recent years, many researchers have made
Inception family is the introduction of factorizing convolu- efforts to further improve the quality of image features on
tion and batch normalization. basis of some latest engines, where the most important two
ResNet: The Deep Residual Networks (ResNet) [234], groups of methods are: 1) feature fusion and 2) learning
proposed by K. He et al. in 2015, is a new type of convolu- high-resolution features with large receptive fields.
tional network architecture that is substantially deeper (up
to 152 layers) than those used previously. ResNet aims to 4.2.1 Why Feature Fusion is Important?
ease the training of networks by reformulating its layers as
learning residual functions with reference to the layer in- Invariance and equivariance are two important properties
puts. ResNet won multiple computer vision competitions in in image feature representations. Classification desires in-
2015, including ImageNet detection, ImageNet localization, variant feature representations since it aims at learning
COCO detection, and COCO segmentation. high-level semantic information. Object localization desires
DenseNet: DenseNet [235] was proposed by G. Huang equivariant representations since it aims at discriminating
and Z. Liu et al. in 2017. The success of ResNet suggested position and scale changes. As object detection consists of
that the short cut connection in CNN enables us to train two sub-tasks of object recognition and localization, it is cru-
deeper and more accurate models. The authors embraced cial for a detector to learn both invariance and equivariance
this observation and introduced a densely connected block, at the same time.
which connects each layer to every other layer in a feed- Feature fusion has been widely used in object detection
forward fashion. in the last three years. As a CNN model consists of a series
SENet: Squeeze and Excitation Networks (SENet) was of convolutional and pooling layers, features in deeper
proposed by J. Hu and L. Shen et al. in 2018 [236]. Its layers will have stronger invariance but less equivariance.
main contribution is the integration of global pooling and Although this could be beneficial to category recognition, it
shuffling to learn channel-wise importance of the feature suffers from low localization accuracy in object detection.
map. SENet won the 1st place in ILSVRC 2017 classification On the contrary, features in shallower layers is not con-
competition. ducive to learning semantics, but it helps object localization
as it contains more information about edges and contours.
• Object detectors with new engines Therefore, the integration of deep and shallow features in
In recent three years, many of the latest engines have a CNN model helps improve both invariance and equivari-
been applied to object detection. For example, some latest ance.
20
4.2.2 Feature Fusion in Different Ways As we mentioned before, the lower the feature resolution
There are many ways to perform feature fusion in object is, the harder will be to detect small objects. The most
detection. Here we introduce some recent methods in two straight forward way to increase the feature resolution is to
aspects: 1) processing flow and 2) element-wise operation. remove pooling layer or to reduce the convolution down-
sampling rate. But this will cause a new problem, the
• Processing flow receptive field will become too small due to the decreasing
of output stride. In other words, this will narrow a detector’s
Recent feature fusion methods in object detection can be
”sight” and may result in the miss detection of some large
divided into two categories: 1) bottom-up fusion, 2) top-
objects.
down fusion, as shown in Fig. 18 (a)-(b). Bottom-up fusion
feeds forward shallow features to deeper layers via skip A piratical method to increase both of the receptive field
connections [237, 240–242]. In comparison, top-down fusion and feature resolution at the same time is to introduce di-
feeds back the features of deeper layers into the shallower lated convolution (a.k.a. atrous convolution, or convolution
ones [22, 55, 243–246]. Apart from these methods, there are with holes). Dilated convolution is originally proposed in
more complex approaches proposed recently, e.g., weaving semantic segmentation tasks [252, 253]. Its main idea is to
features across different layers [247]. expand the convolution filter and use sparse parameters.
As the feature maps of different layers may have differ- For example, a 3x3 filter with a dilation rate of 2 will have
ent sizes both in terms of their spatial and channel dimen- the same receptive field as a 5x5 kernel but only have
sions, one may need to accommodate the feature maps, such 9 parameters. Dilated convolution has now been widely
as by adjusting the number of channels, up-sampling low- used in object detection [21, 56, 254, 255], and proves to
resolution maps, or down-sampling high-resolution maps to be effective for improved accuracy without any additional
a proper size. The easiest ways to do this is to use nearest- parameters and computational cost [56].
or bilinear-interpolation [22, 244]. Besides, fractional strided
convolution (a.k.a. transpose convolution) [45, 248], is an-
4.3 Beyond Sliding Window
other recent popular way to resize the feature maps and
adjust the number of channels. The advantage of using Although object detection has evolved from using hand-
fractional strided convolution is that it can learn an appro- crafted features to deep neural networks, the detection still
priate way to perform up-sampling by itself [55, 212, 241– follows a paradigm of “sliding window on feature maps”
243, 245, 246, 249]. [137]. Recently, there are some detectors built beyond sliding
windows.
• Element-wise operation
• Detection as sub-region search
From a local point of view, feature fusion can be consid-
ered as the element-wise operation between different feature Sub-region search [184, 256–258] provides a new way
maps. There are three groups of methods: 1) element-wise of performing detection. One recent method is to think
sum, 2) element-wise product, and 3) concatenation, as of detection as a path planning process that starts from
shown in Fig. 18 (c)-(e). initial grids and finally converges to the desired ground
The element-wise sum is the easiest way to perform truth boxes [256]. Another method is to think of detection
feature fusion. It has been frequently used in many recent as an iterative updating process to refine the corners of a
object detectors [22, 55, 241, 243, 246]. The element-wise predicted bounding box [257].
product [245, 249–251] is very similar to the element-wise
sum, while the only difference is the use of multiplication • Detection as key points localization
instead of summation. An advantage of element-wise prod-
uct is that it can be used to suppress or highlight the features Key points localization is an important computer vision
within a certain area, which may further benefit small object task that has extensively broad applications, such as facial
detection [245, 250, 251]. Feature concatenation is another expression recognition [259], human poses identification
way of feature fusion [212, 237, 240, 244]. Its advantage [260], etc. As any object in an image can be uniquely
is that it can be used to integrate context information of determined by its upper left corner and lower right corner
different regions [105, 144, 149, 161], while its disadvantage of the ground truth box, the detection task, therefore, can be
is the increase of the memory [235]. equivalently framed as a pair-wise key points localization
problem. One recent implementation of this idea is to pre-
4.2.3 Learning High Resolution Features with Large Re- dict a heat-map for the corners [261]. The advantage of this
ceptive Fields approach is that it can be implemented under a semantic
The receptive field and feature resolution are two important segmentation framework, and there is no need to design
characteristics of a CNN based detector, where the former multi-scale anchor boxes.
one refers to the spatial range of input pixels that contribute
to the calculation of a single pixel of the output, and the
4.4 Improvements of Localization
latter one corresponds to the down-sampling rate between
the input and the feature map. A network with a larger To improve localization accuracy, there are two groups of
receptive field is able to capture a larger scale of context methods in recent detectors: 1) bounding box refinement,
information, while that with a smaller one may concentrate and 2) designing new loss functions for accurate localiza-
more on the local details. tion.
21
4.4.1 Bounding Box Refinement the boundary of an object, segmentation may be helpful for
The most intuitive way to improve localization accuracy category recognition.
is bounding box refinement, which can be considered as • Segmentation helps accurate localization
a post-processing of the detection results. Although the
bounding box regression has been integrated into most of The ground-truth bounding box of an object is deter-
the modern object detectors, there are still some objects mined by its well-defined boundary. For some objects with
with unexpected scales that cannot be well captured by any a special shape (e.g., imagine a cat with a very long tail),
of the predefined anchors. This will inevitably lead to an it will be difficult to predict high IoU locations. As object
inaccurate prediction of their locations. For this reason, the boundaries can be well encoded in semantic segmentation
“iterative bounding box refinement” [262–264] has been in- features, learning with segmentation would be helpful for
troduced recently by iteratively feeding the detection results accurate object localization.
into a BB regressor until the prediction converges to a correct • Segmentation can be embedded as context
location and size. However, some researchers also claimed
that this method does not guarantee the monotonicity of Objects in daily life are surrounded by different back-
localization accuracy [262], in other words, the BB regression grounds, such as the sky, water, grass, etc, and all these
may degenerate the localization if it is applied for multiple elements constitute the context of an object. Integrating the
times. context of semantic segmentation will be helpful for object
detection, say, an aircraft is more likely to appear in the sky
4.4.2 Improving Loss Functions for Accurate Localization than on the water.
In most modern detectors, object localization is considered 4.5.2 How Segmentation Improves Detection?
as a coordinate regression problem. However, there are
two drawbacks of this paradigm. First, the regression loss There are two main approaches to improve object detection
function does not correspond to the final evaluation of by segmentation: 1) learning with enriched features and 2)
localization. For example, we can not guarantee that a lower learning with multi-task loss functions.
regression error will always produce a higher IoU predic- • Learning with enriched features
tion, especially when the object has a very large aspect ratio.
Second, the traditional bounding box regression method The simplest way is to think of the segmentation net-
does not provide the confidence of localization. When there work as a fixed feature extractor and to integrate it into a de-
are multiple BB’s overlapping with each other, this may lead tection framework as additional features [144, 269, 270]. The
to failure in non-maximum suppression (see more details in advantage of this approach is that it is easy to implement,
subsection 2.3.5). while the disadvantage is that the segmentation network
The above problems can be alleviated by designing may bring additional calculation.
new loss functions. The most intuitive design is to directly • Learning with multi-task loss functions
use IoU as the localization loss function [265]. Some other
researchers have further proposed an IoU-guided NMS to Another way is to introduce an additional segmentation
improve localization in both training and detection stages branch on top of the original detection framework and to
[163]. Besides, some researchers have also tried to improve train this model with multi-task loss functions (segmenta-
localization under a probabilistic inference framework [266]. tion loss + detection loss) [4, 269]. In most cases, the segmen-
Different from the previous methods that directly predict tation brunch will be removed at the inference stage. The
the box coordinates, this method predicts the probability advantage is the detection speed will not be affected, but the
distribution of a bounding box location. disadvantage is that the training requires pixel-level image
annotations. To this end, some researchers have followed the
idea of “weakly supervised learning”: instead of training
4.5 Learning with Segmentation based on pixel-wise annotation masks, they simply train
Object detection and semantic segmentation are all impor- the segmentation brunch based on the bounding-box level
tant tasks in computer vision. Recent researches suggest annotations [250, 271].
object detection can be improved by learning with semantic
segmentation. 4.6 Robust Detection of Rotation and Scale Changes
Object rotation and scale changes are important challenges
4.5.1 Why Segmentation Improves Detection? in object detection. As the features learned by CNN are
There are three reasons why the semantic segmentation not invariant to rotation and large degree of scale changes,
improves object detection. in recent years, many people have made efforts in this
problem.
• Segmentation helps category recognition
Edges and boundaries are the basic elements that consti- 4.6.1 Rotation Robust Detection
tute human visual cognition [267, 268]. In computer vision, Object rotation is very common in detection tasks such as
the difference between an object (e.g., a car, a person) and a face detection, text detection, etc. The most straight forward
stuff (e.g., sky, water, grass) is that the former usually has a solution to this problem is data augmentation so that an
closed and well defined boundary while the latter does not. object in any orientation can be well covered by the aug-
As the feature of semantic segmentation tasks well captures mented data [88]. Another solution is to train independent
22
detectors for every orientation [272, 273]. Apart from these “larger ones” [184, 258]. Another recent improvement is
traditional approaches, recently, there are some new im- learning to predict the scale distribution of objects in an
provement methods. image, and then adaptively re-scaling the image according
to the distribution [282, 283].
• Rotation invariant loss functions
The idea of learning with rotation invariant loss function
can be traced back to the 1990s [274]. Some recent works 4.7 Training from Scratch
have introduced a constraint on the original detection loss Most deep learning based detectors are first pre-trained on
function so that to make the features of rotated objects large scale datasets, say ImageNet, and then fine-tuned on
unchanged [275, 276]. specific detection tasks. People have always believed that
pre-training helps to improve generalization ability and
• Rotation calibration
training speed and the question is, do we really need to
Another way of improving rotation invariant detection is pre-training a detector on ImageNet? In fact, there are some
to make geometric transformations of the objects candidates limitations when adopting the pre-trained networks in ob-
[277–279]. This will be especially helpful for multi-stage ject detection. The first limitation is the divergence between
detectors, where the correlation at early stages will benefit ImageNet classification and object detection, including their
the subsequent detections. The representative of this idea loss functions and scale/category distributions. The second
is Spatial Transformer Networks (STN) [278]. STN has now limitation is the domain mismatch. As images in ImageNet
been used in rotated text detection [278] and rotated face are RGB images while detection sometimes will be applied
detection [279]. to depth image (RGB-D) or 3D medical images, the pre-
trained knowledge can not be well transfer to these detec-
• Rotation RoI Pooling
tion tasks.
In a two-stage detector, feature pooling aims to extract In recent years, some researchers have tried to train an
a fixed length feature representation for an object proposal object detector from scratch. To speed up training and im-
with any location and size by first dividing the proposal prove stability, some researchers introduce dense connection
evenly into a set of grids, and then concatenating the grid and batch normalization to accelerate the back-propagation
features. As the grid meshing is performed in Cartesian in shallow layers [238, 284]. The recent work by K. He
coordinates, the features are not invariance to rotation trans- et al. [285] has further questioned the paradigm of pre-
form. A recent improvement is to mesh the grids in polar training even further by exploring the opposite regime:
coordinates so that the features could be robust to the they reported competitive results on object detection on
rotation changes [272]. the COCO dataset using standard models trained from
random initialization, with the sole exception of increasing
4.6.2 Scale Robust Detection the number of training iterations so the randomly initialized
Recent improvements have been made at both training and models may converge. Training from random initialization
detection stages for scale robust detection. is also surprisingly robust even using only 10% of the train-
ing data, which indicates that ImageNet pre-training may
• Scale adaptive training
speed up convergence, but does not necessarily provide
Most of the modern detectors re-scale the input image regularization or improve final detection accuracy.
to a fixed size and back propagate the loss of the objects
in all scales, as shown in Fig. 19 (a). However, a drawback
4.8 Adversarial Training
of doing this is there will be a “scale imbalance” problem.
Building an image pyramid during detection could alleviate The Generative Adversarial Networks (GAN) [286], intro-
this problem but not fundamentally [46, 234]. A recent duced by A. Goodfellow et al. in 2014, has received great
improvement is Scale Normalization for Image Pyramids attention in recent years. A typical GAN consists of two
(SNIP) [280], which builds image pyramids at both of neural networks: a generator networks and a discriminator
training and detection stages and only backpropagates the networks, contesting with each other in a minimax opti-
loss of some selected scales, as shown in Fig. 19 (b). Some mization framework. Typically, the generator learns to map
researchers have further proposed a more efficient training from a latent space to a particular data distribution of inter-
strategy: SNIP with Efficient Resampling (SNIPER) [281], i.e. est, while the discriminator aims to discriminate between in-
to crop and re-scale an image to a set of sub-regions so that stances from the true data distribution and those produced
to benefit from large batch training. by the generator. GAN has been widely used for many
computer vision tasks such as image generation[286, 287],
• Scale adaptive detection
image style transfer [288], and image super-resolution [289].
Most of the modern detectors use the fixed configura- In recent two years, GAN has also been applied to object
tions for detecting objects of different sizes. For example, detection, especially for improving the detection of small
in a typical CNN based detector, we need to carefully and occluded object.
define the size of anchors. A drawback of doing this is GAN has been used to enhance the detection on small
the configurations cannot be adaptive to unexpected scale objects by narrowing the representations between small and
changes. To improve the detection of small objects, some large ones [290, 291]. To improve the detection of occluded
“adaptive zoom-in” techniques are proposed in some recent objects, one recent idea is to generate occlusion masks
detectors to adaptively enlarge the small objects into the by using adversarial training [292]. Instead of generating
23
Fig. 19. Different training strategies for multi-scale object detection: (a): Training on a single resolution image, back propagate objects of all scales
[17–19, 21]. (b) Training on multi-resolution images (image pyramid), back propagate objects of selected scale. If an object is too large or too small,
its gradient will be discarded [56, 280, 281].
examples in pixel space, the adversarial network directly prove WSOD. More recently, generative adversarial training
modifies the features to mimic occlusion. has been used for WSOD [302].
In addition to these works, “adversarial attack” [293],
which aims to study how to attack a detector with adver-
sarial examples, has drawn increasing attention recently. 5 A PPLICATIONS
The research on this topic is especially important for au-
In this section, we will review some important detection
tonomous driving, as it cannot be fully trusted before guar-
applications in the past 20 years, including pedestrian
anteeing the robustness to adversarial attacks.
detection, face detection, text detection, traffic sign/light
detection, and remote sensing target detection.
4.9 Weakly Supervised Object Detection
The training of a modern object detector usually requires 5.1 Pedestrian Detection
a large amount of manually labeled data, while the label- Pedestrian detection, as an important object detection ap-
ing process is time-consuming, expensive, and inefficient. plication, has received extensive attention in many areas
Weakly Supervised Object Detection (WSOD) aims to solve such as autonomous driving, video surveillance, criminal
this problem by training a detector with only image level investigation, etc. Some early time’s pedestrian detection
annotations instead of bounding boxes. methods, such as HOG detector [12], ICF detector [171], laid
Recently, multi-instance learning has been used for a solid foundation for general object detection in terms of
WSOD [294, 295]. Multi-instance learning is a group of su- the feature representation [12, 171], the design of classifier
pervised learning method [39, 296]. Instead of learning with [174], and the detection acceleration [177]. In recent years,
a set of instances which are individually labeled, a multi- some general object detection algorithms, e.g., Faster RCNN
instance learning model receives a set of labeled bags, each [19], have been introduced to pedestrian detection [165], and
containing many instances. If we consider object candidates has greatly promoted the progress of this area.
in one image as a bag, and image-level annotation as the
label, then the WSOD can be formulated as a multi-instance
5.1.1 Difficulties and Challenges
learning process.
Class activation mapping is another recently group of The challenges and difficulties in pedestrian detection can
methods for WSOD [297, 298]. The research on CNN visu- be summarized as follows.
alization has shown that the convolution layer of a CNN Small pedestrian: Fig. 20 (a) shows some examples of
behaves as object detectors despite there is no supervision the small pedestrians that are captured far from the camera.
on the location of the object. Class activation mapping shed In Caltech Dataset [59, 60], 15% of the pedestrians are less
light on how to enable a CNN to have localization ability than 30 pixels in height.
despite being trained on image level labels [299]. Hard negatives: Some backgrounds in street view im-
In addition to the above approaches, some other re- ages are very similar to pedestrians in their visual appear-
searchers considered the WSOD as a proposal ranking pro- ance, as shown in Fig. 20 (b).
cess by selecting the most informative regions and then Dense and occluded pedestrian: Fig 20 (c) shows some
training these regions with image-level annotation [300]. examples of dense and occluded pedestrians. In the Caltech
Another simple method for WSOD is to mask out different Dataset [59, 60], pedestrians that haven’t been occluded only
parts of the image. If the detection score drops sharply, account for 29% of the total pedestrian instances.
then an object would be covered with high probability Real-time detection: The real-time pedestrian detection
[301]. Besides, interactive annotation [295] takes human from HD video is crucial for some applications like au-
feedback into consideration during training so that to im- tonomous driving and video surveillance.
24
recognition,” in European conference on computer vision. detection with structured models. Citeseer, 2012.
Springer, 2014, pp. 346–361. [39] S. Andrews, I. Tsochantaridis, and T. Hofmann, “Support
[18] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE inter- vector machines for multiple-instance learning,” in Ad-
national conference on computer vision, 2015, pp. 1440–1448. vances in neural information processing systems, 2003, pp.
[19] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: 577–584.
Towards real-time object detection with region proposal [40] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
networks,” in Advances in neural information processing classification with deep convolutional neural networks,”
systems, 2015, pp. 91–99. in Advances in neural information processing systems, 2012,
[20] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi, “You pp. 1097–1105.
only look once: Unified, real-time object detection,” in [41] R. Girshick, J. Donahue, T. Darrell, and J. Malik, “Region-
Proceedings of the IEEE conference on computer vision and based convolutional networks for accurate object de-
pattern recognition, 2016, pp. 779–788. tection and segmentation,” IEEE transactions on pattern
[21] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. analysis and machine intelligence, vol. 38, no. 1, pp. 142–
Fu, and A. C. Berg, “Ssd: Single shot multibox detector,” 158, 2016.
in European conference on computer vision. Springer, 2016, [42] K. E. Van de Sande, J. R. Uijlings, T. Gevers, and A. W.
pp. 21–37. Smeulders, “Segmentation as selective search for object
[22] T.-Y. Lin, P. Dollár, R. B. Girshick, K. He, B. Hariharan, recognition,” in Computer Vision (ICCV), 2011 IEEE Inter-
and S. J. Belongie, “Feature pyramid networks for object national Conference on. IEEE, 2011, pp. 1879–1886.
detection.” in CVPR, vol. 1, no. 2, 2017, p. 4. [43] R. B. Girshick, P. F. Felzenszwalb, and D. McAllester,
[23] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Discriminatively trained deformable part models, re-
“Focal loss for dense object detection,” IEEE transactions lease 5,” http://people.cs.uchicago.edu/ rbg/latent-
on pattern analysis and machine intelligence, 2018. release5/.
[24] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, [44] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn:
and M. Pietikäinen, “Deep learning for generic object de- towards real-time object detection with region proposal
tection: A survey,” arXiv preprint arXiv:1809.02165, 2018. networks,” IEEE Transactions on Pattern Analysis & Ma-
[25] S. Agarwal, J. O. D. Terrail, and F. Jurie, “Recent advances chine Intelligence, no. 6, pp. 1137–1149, 2017.
in object detection in the age of deep convolutional neural [45] M. D. Zeiler and R. Fergus, “Visualizing and understand-
networks,” arXiv preprint arXiv:1809.03193, 2018. ing convolutional networks,” in European conference on
[26] A. Andreopoulos and J. K. Tsotsos, “50 years of object computer vision. Springer, 2014, pp. 818–833.
recognition: Directions forward,” Computer vision and im- [46] J. Dai, Y. Li, K. He, and J. Sun, “R-fcn: Object detection via
age understanding, vol. 117, no. 8, pp. 827–891, 2013. region-based fully convolutional networks,” in Advances
[27] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, in neural information processing systems, 2016, pp. 379–387.
A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama [47] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun,
et al., “Speed/accuracy trade-offs for modern convolu- “Light-head r-cnn: In defense of two-stage object detec-
tional object detectors,” in IEEE CVPR, vol. 4, 2017. tor,” arXiv preprint arXiv:1711.07264, 2017.
[28] K. Grauman and B. Leibe, “Visual object recognition [48] J. Redmon and A. Farhadi, “Yolo9000: better, faster,
(synthesis lectures on artificial intelligence and machine stronger,” arXiv preprint, 2017.
learning),” Morgan & Claypool, 2011. [49] ——, “Yolov3: An incremental improvement,” arXiv
[29] C. P. Papageorgiou, M. Oren, and T. Poggio, “A general preprint arXiv:1804.02767, 2018.
framework for object detection,” in Computer vision, 1998. [50] M. Everingham, L. Van Gool, C. K. Williams, J. Winn,
sixth international conference on. IEEE, 1998, pp. 555–562. and A. Zisserman, “The pascal visual object classes (voc)
[30] C. Papageorgiou and T. Poggio, “A trainable system for challenge,” International journal of computer vision, vol. 88,
object detection,” International journal of computer vision, no. 2, pp. 303–338, 2010.
vol. 38, no. 1, pp. 15–33, 2000. [51] M. Everingham, S. A. Eslami, L. Van Gool, C. K. Williams,
[31] A. Mohan, C. Papageorgiou, and T. Poggio, “Example- J. Winn, and A. Zisserman, “The pascal visual object
based object detection in images by components,” IEEE classes challenge: A retrospective,” International journal of
Transactions on Pattern Analysis & Machine Intelligence, computer vision, vol. 111, no. 1, pp. 98–136, 2015.
no. 4, pp. 349–361, 2001. [52] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[32] Y. Freund, R. Schapire, and N. Abe, “A short introduction S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein
to boosting,” Journal-Japanese Society For Artificial Intelli- et al., “Imagenet large scale visual recognition challenge,”
gence, vol. 14, no. 771-780, p. 1612, 1999. International Journal of Computer Vision, vol. 115, no. 3, pp.
[33] D. G. Lowe, “Object recognition from local scale-invariant 211–252, 2015.
features,” in Computer vision, 1999. The proceedings of the [53] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona,
seventh IEEE international conference on, vol. 2. Ieee, 1999, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco:
pp. 1150–1157. Common objects in context,” in European conference on
[34] ——, “Distinctive image features from scale-invariant computer vision. Springer, 2014, pp. 740–755.
keypoints,” International journal of computer vision, vol. 60, [54] M. A. Sadeghi and D. Forsyth, “30hz object detection
no. 2, pp. 91–110, 2004. with dpm v5,” in European Conference on Computer Vision.
[35] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and Springer, 2014, pp. 65–79.
object recognition using shape contexts,” CALIFORNIA [55] S. Zhang, L. Wen, X. Bian, Z. Lei, and S. Z. Li, “Single-
UNIV SAN DIEGO LA JOLLA DEPT OF COMPUTER shot refinement neural network for object detection,” in
SCIENCE AND ENGINEERING, Tech. Rep., 2002. IEEE CVPR, 2018.
[36] T. Malisiewicz, A. Gupta, and A. A. Efros, “Ensemble [56] Y. Li, Y. Chen, N. Wang, and Z. Zhang, “Scale-aware
of exemplar-svms for object detection and beyond,” in trident networks for object detection,” arXiv preprint
Computer Vision (ICCV), 2011 IEEE International Conference arXiv:1901.01892, 2019.
on. IEEE, 2011, pp. 89–96. [57] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
[37] R. B. Girshick, P. F. Felzenszwalb, and D. A. Mcallester, “Imagenet: A large-scale hierarchical image database,” in
“Object detection with grammar models,” in Advances in Computer Vision and Pattern Recognition, 2009. CVPR 2009.
Neural Information Processing Systems, 2011, pp. 442–450. IEEE Conference on. Ieee, 2009, pp. 248–255.
[38] R. B. Girshick, From rigid templates to grammars: Object [58] I. Krasin and T. e. a. Duerig, “Openimages: A
30
public dataset for large-scale multi-label and multi- European Conference on Computer Vision. Springer, 2010,
class image classification.” Dataset available from pp. 591–604.
https://storage.googleapis.com/openimages/web/index.html, [75] C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts
2017. of arbitrary orientations in natural images,” in 2012 IEEE
[59] P. Dollár, C. Wojek, B. Schiele, and P. Perona, “Pedestrian Conference on Computer Vision and Pattern Recognition.
detection: A benchmark,” in Computer Vision and Pattern IEEE, 2012, pp. 1083–1090.
Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, [76] A. Mishra, K. Alahari, and C. Jawahar, “Scene text recog-
2009, pp. 304–311. nition using higher order language priors,” in BMVC-
[60] P. Dollar, C. Wojek, B. Schiele, and P. Perona, “Pedestrian British Machine Vision Conference. BMVA, 2012.
detection: An evaluation of the state of the art,” IEEE [77] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zis-
transactions on pattern analysis and machine intelligence, serman, “Synthetic data and artificial neural net-
vol. 34, no. 4, pp. 743–761, 2012. works for natural scene text recognition,” arXiv preprint
[61] A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for arXiv:1406.2227, 2014.
autonomous driving? the kitti vision benchmark suite,” [78] A. Veit, T. Matera, L. Neumann, J. Matas, and S. Be-
in Computer Vision and Pattern Recognition (CVPR), 2012 longie, “Coco-text: Dataset and benchmark for text de-
IEEE Conference on. IEEE, 2012, pp. 3354–3361. tection and recognition in natural images,” arXiv preprint
[62] S. Zhang, R. Benenson, and B. Schiele, “Citypersons: A arXiv:1601.07140, 2016.
diverse dataset for pedestrian detection,” in The IEEE [79] R. De Charette and F. Nashashibi, “Real time visual
Conference on Computer Vision and Pattern Recognition traffic lights recognition based on spot light detection and
(CVPR), vol. 1, no. 2, 2017, p. 3. adaptive traffic lights templates,” in Intelligent Vehicles
[63] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, Symposium, 2009 IEEE. IEEE, 2009, pp. 358–363.
R. Benenson, U. Franke, S. Roth, and B. Schiele, “The [80] A. Møgelmose, M. M. Trivedi, and T. B. Moeslund,
cityscapes dataset for semantic urban scene understand- “Vision-based traffic sign detection and analysis for in-
ing,” in Proceedings of the IEEE conference on computer telligent driver assistance systems: Perspectives and sur-
vision and pattern recognition, 2016, pp. 3213–3223. vey.” IEEE Trans. Intelligent Transportation Systems, vol. 13,
[64] M. Braun, S. Krebs, F. Flohr, and D. M. Gavrila, “The no. 4, pp. 1484–1497, 2012.
eurocity persons dataset: A novel benchmark for object [81] S. Houben, J. Stallkamp, J. Salmen, M. Schlipsing, and
detection,” arXiv preprint arXiv:1805.07193, 2018. C. Igel, “Detection of traffic signs in real-world images:
[65] V. Jain and E. Learned-Miller, “Fddb: A benchmark The german traffic sign detection benchmark,” in Neural
for face detection in unconstrained settings,” Technical Networks (IJCNN), The 2013 International Joint Conference
Report UM-CS-2010-009, University of Massachusetts, on. IEEE, 2013, pp. 1–8.
Amherst, Tech. Rep., 2010. [82] R. Timofte, K. Zimmermann, and L. Van Gool, “Multi-
[66] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof, view traffic sign detection, recognition, and 3d localisa-
“Annotated facial landmarks in the wild: A large-scale, tion,” Machine vision and applications, vol. 25, no. 3, pp.
real-world database for facial landmark localization,” in 633–647, 2014.
Computer Vision Workshops (ICCV Workshops), 2011 IEEE [83] Z. Zhu, D. Liang, S. Zhang, X. Huang, B. Li, and S. Hu,
International Conference on. IEEE, 2011, pp. 2144–2151. “Traffic-sign detection and classification in the wild,” in
[67] B. F. Klare, B. Klein, E. Taborsky, A. Blanton, J. Cheney, Proceedings of the IEEE Conference on Computer Vision and
K. Allen, P. Grother, A. Mah, and A. K. Jain, “Pushing the Pattern Recognition, 2016, pp. 2110–2118.
frontiers of unconstrained face detection and recognition: [84] K. Behrendt, L. Novak, and R. Botros, “A deep learning
Iarpa janus benchmark a,” in Proceedings of the IEEE approach to traffic lights: Detection, tracking, and clas-
conference on computer vision and pattern recognition, 2015, sification,” in Robotics and Automation (ICRA), 2017 IEEE
pp. 1931–1939. International Conference on. IEEE, 2017, pp. 1370–1377.
[68] S. Yang, P. Luo, C.-C. Loy, and X. Tang, “Wider face: [85] G. Heitz and D. Koller, “Learning spatial context: Using
A face detection benchmark,” in Proceedings of the IEEE stuff to find things,” in European conference on computer
conference on computer vision and pattern recognition, 2016, vision. Springer, 2008, pp. 30–43.
pp. 5525–5533. [86] F. Tanner, B. Colder, C. Pullen, D. Heagy, M. Eppolito,
[69] H. Nada, V. A. Sindagi, H. Zhang, and V. M. Patel, V. Carlan, C. Oertel, and P. Sallee, “Overhead imagery
“Pushing the limits of unconstrained face detection: a research data setan annotated data library & tools to aid
challenge dataset and baseline results,” arXiv preprint in the development of computer vision algorithms,” in
arXiv:1804.10275, 2018. 2009 IEEE Applied Imagery Pattern Recognition Workshop
[70] M. K. Yucel, Y. C. Bilge, O. Oguz, N. Ikizler-Cinbis, (AIPR 2009). IEEE, 2009, pp. 1–8.
P. Duygulu, and R. G. Cinbis, “Wildest faces: Face de- [87] K. Liu and G. Mattyus, “Fast multiclass vehicle detec-
tection and recognition in violent settings,” arXiv preprint tion on aerial images.” IEEE Geosci. Remote Sensing Lett.,
arXiv:1805.07566, 2018. vol. 12, no. 9, pp. 1938–1942, 2015.
[71] S. M. Lucas, A. Panaretos, L. Sosa, A. Tang, S. Wong, and [88] H. Zhu, X. Chen, W. Dai, K. Fu, Q. Ye, and J. Jiao, “Ori-
R. Young, “Icdar 2003 robust reading competitions,” in entation robust object detection in aerial images using
null. IEEE, 2003, p. 682. deep convolutional neural network,” in Image Processing
[72] D. Karatzas, L. Gomez-Bigorda, A. Nicolaou, S. Ghosh, (ICIP), 2015 IEEE International Conference on. IEEE, 2015,
A. Bagdanov, M. Iwamura, J. Matas, L. Neumann, V. R. pp. 3735–3739.
Chandrasekhar, S. Lu et al., “Icdar 2015 competition on [89] S. Razakarivony and F. Jurie, “Vehicle detection in aerial
robust reading,” in Document Analysis and Recognition imagery: A small target detection benchmark,” Journal of
(ICDAR), 2015 13th International Conference on. IEEE, Visual Communication and Image Representation, vol. 34, pp.
2015, pp. 1156–1160. 187–203, 2016.
[73] B. Shi, C. Yao, M. Liao, M. Yang, P. Xu, L. Cui, S. Belongie, [90] G. Cheng and J. Han, “A survey on object detection in
S. Lu, and X. Bai, “Icdar2017 competition on reading optical remote sensing images,” ISPRS Journal of Pho-
chinese text in the wild (rctw-17),” in Document Analysis togrammetry and Remote Sensing, vol. 117, pp. 11–28, 2016.
and Recognition (ICDAR), 2017 14th IAPR International [91] Z. Zou and Z. Shi, “Random access memories: A new
Conference on, vol. 1. IEEE, 2017, pp. 1429–1434. paradigm for target detection in high resolution aerial
[74] K. Wang and S. Belongie, “Word spotting in the wild,” in remote sensing images,” IEEE Transactions on Image Pro-
31
cessing, vol. 27, no. 3, pp. 1100–1111, 2018. and A. L. Yuille, “Semantic image segmentation with
[92] G.-S. Xia, X. Bai, J. Ding, Z. Zhu, S. Belongie, J. Luo, deep convolutional nets and fully connected crfs,” arXiv
M. Datcu, M. Pelillo, and L. Zhang, “Dota: A large-scale preprint arXiv:1412.7062, 2014.
dataset for object detection in aerial images,” in Proc. [112] C. Garcia and M. Delakis, “A neural architecture for fast
CVPR, 2018. and robust face detection,” in Pattern Recognition, 2002.
[93] D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, Proceedings. 16th International Conference on, vol. 2. IEEE,
M. Klaric, Y. Bulatov, and B. McCord, “xview: Ob- 2002, pp. 44–47.
jects in context in overhead imagery,” arXiv preprint [113] M. Osadchy, M. L. Miller, and Y. L. Cun, “Synergistic face
arXiv:1802.07856, 2018. detection and pose estimation with energy-based mod-
[94] K. Oksuz, B. C. Cam, E. Akbas, and S. Kalkan, “Local- els,” in Advances in Neural Information Processing Systems,
ization recall precision (lrp): A new performance metric 2005, pp. 1017–1024.
for object detection,” in European Conference on Computer [114] S. J. Nowlan and J. C. Platt, “A convolutional neural
Vision (ECCV), vol. 6, 2018. network hand tracker,” Advances in neural information
[95] M. Turk and A. Pentland, “Eigenfaces for recognition,” processing systems, pp. 901–908, 1995.
Journal of cognitive neuroscience, vol. 3, no. 1, pp. 71–86, [115] T. Malisiewicz, Exemplar-based representations for object
1991. detection, association and beyond. Carnegie Mellon Uni-
[96] R. Vaillant, C. Monrocq, and Y. Le Cun, “Original ap- versity, 2011.
proach for the localisation of objects in images,” IEE [116] B. Alexe, T. Deselaers, and V. Ferrari, “What is an object?”
Proceedings-Vision, Image and Signal Processing, vol. 141, in Computer Vision and Pattern Recognition (CVPR), 2010
no. 4, pp. 245–250, 1994. IEEE Conference on. IEEE, 2010, pp. 73–80.
[97] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient- [117] J. R. Uijlings, K. E. Van De Sande, T. Gevers, and A. W.
based learning applied to document recognition,” Pro- Smeulders, “Selective search for object recognition,” In-
ceedings of the IEEE, vol. 86, no. 11, pp. 2278–2324, 1998. ternational journal of computer vision, vol. 104, no. 2, pp.
[98] I. Biederman, “Recognition-by-components: a theory 154–171, 2013.
of human image understanding.” Psychological review, [118] J. Carreira and C. Sminchisescu, “Constrained parametric
vol. 94, no. 2, p. 115, 1987. min-cuts for automatic object segmentation,” in Computer
[99] M. A. Fischler and R. A. Elschlager, “The representation Vision and Pattern Recognition (CVPR), 2010 IEEE Confer-
and matching of pictorial structures,” IEEE Transactions ence on. IEEE, 2010, pp. 3241–3248.
on computers, vol. 100, no. 1, pp. 67–92, 1973. [119] P. Arbeláez, J. Pont-Tuset, J. T. Barron, F. Marques, and
[100] B. Leibe, A. Leonardis, and B. Schiele, “Robust object J. Malik, “Multiscale combinatorial grouping,” in Proceed-
detection with interleaved categorization and segmenta- ings of the IEEE conference on computer vision and pattern
tion,” International journal of computer vision, vol. 77, no. recognition, 2014, pp. 328–335.
1-3, pp. 259–289, 2008. [120] B. Alexe, T. Deselaers, and V. Ferrari, “Measuring the ob-
[101] D. M. Gavrila and V. Philomin, “Real-time object detec- jectness of image windows,” IEEE transactions on pattern
tion for” smart” vehicles,” in Computer Vision, 1999. The analysis and machine intelligence, vol. 34, no. 11, pp. 2189–
Proceedings of the Seventh IEEE International Conference on, 2202, 2012.
vol. 1. IEEE, 1999, pp. 87–93. [121] M.-M. Cheng, Z. Zhang, W.-Y. Lin, and P. Torr, “Bing:
[102] B. Wu and R. Nevatia, “Detection of multiple, partially Binarized normed gradients for objectness estimation at
occluded humans in a single image by bayesian combi- 300fps,” in Proceedings of the IEEE conference on computer
nation of edgelet part detectors,” in null. IEEE, 2005, pp. vision and pattern recognition, 2014, pp. 3286–3293.
90–97. [122] C. L. Zitnick and P. Dollár, “Edge boxes: Locating object
[103] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, proposals from edges,” in European conference on computer
and Y. LeCun, “Overfeat: Integrated recognition, localiza- vision. Springer, 2014, pp. 391–405.
tion and detection using convolutional networks,” arXiv [123] C. Szegedy, S. Reed, D. Erhan, D. Anguelov, and S. Ioffe,
preprint arXiv:1312.6229, 2013. “Scalable, high-quality object detection,” arXiv preprint
[104] C. Szegedy, A. Toshev, and D. Erhan, “Deep neural net- arXiv:1412.1441, 2014.
works for object detection,” in Advances in neural informa- [124] D. Erhan, C. Szegedy, A. Toshev, and D. Anguelov, “Scal-
tion processing systems, 2013, pp. 2553–2561. able object detection using deep neural networks,” in
[105] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos, “A uni- Proceedings of the IEEE Conference on Computer Vision and
fied multi-scale deep convolutional neural network for Pattern Recognition, 2014, pp. 2147–2154.
fast object detection,” in European Conference on Computer [125] A. Ghodrati, A. Diba, M. Pedersoli, T. Tuytelaars, and
Vision. Springer, 2016, pp. 354–370. L. Van Gool, “Deepproposal: Hunting objects by cas-
[106] A. Pentland, B. Moghaddam, T. Starner et al., “View- cading deep convolutional layers,” in Proceedings of the
based and modular eigenspaces for face recognition,” IEEE International Conference on Computer Vision, 2015, pp.
1994. 2578–2586.
[107] G. Yang and T. S. Huang, “Human face detection in a [126] W. Kuo, B. Hariharan, and J. Malik, “Deepbox: Learning
complex background,” Pattern recognition, vol. 27, no. 1, objectness with convolutional networks,” in Proceedings of
pp. 53–63, 1994. the IEEE International Conference on Computer Vision, 2015,
[108] I. Craw, D. Tock, and A. Bennett, “Finding face features,” pp. 2479–2487.
in European Conference on Computer Vision. Springer, 1992, [127] S. Gidaris and N. Komodakis, “Attend refine repeat:
pp. 92–96. Active box proposal generation via in-out localization,”
[109] R. Xiao, L. Zhu, and H.-J. Zhang, “Boosting chain learn- arXiv preprint arXiv:1606.04446, 2016.
ing for object detection,” in Computer Vision, 2003. Pro- [128] H. Li, Y. Liu, W. Ouyang, and X. Wang, “Zoom out-and-
ceedings. Ninth IEEE International Conference on. IEEE, in network with recursive training for object proposal,”
2003, pp. 709–715. arXiv preprint arXiv:1702.05711, 2017.
[110] J. Long, E. Shelhamer, and T. Darrell, “Fully convolu- [129] J. Hosang, R. Benenson, P. Dollár, and B. Schiele, “What
tional networks for semantic segmentation,” in Proceed- makes for effective detection proposals?” IEEE transac-
ings of the IEEE conference on computer vision and pattern tions on pattern analysis and machine intelligence, vol. 38,
recognition, 2015, pp. 3431–3440. no. 4, pp. 814–830, 2016.
[111] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, [130] J. Hosang, R. Benenson, and B. Schiele, “How
32
good are detection proposals, really?” arXiv preprint S. Yan, “Attentive contexts for object detection,” IEEE
arXiv:1406.6962, 2014. Transactions on Multimedia, vol. 19, no. 5, pp. 944–954,
[131] J. Carreira and C. Sminchisescu, “Cpmc: Automatic object 2017.
segmentation using constrained parametric min-cuts,” [150] Q. Chen, Z. Song, J. Dong, Z. Huang, Y. Hua, and
IEEE Transactions on Pattern Analysis & Machine Intelli- S. Yan, “Contextualizing object detection and classifica-
gence, no. 7, pp. 1312–1328, 2011. tion,” IEEE transactions on pattern analysis and machine
[132] N. Chavali, H. Agrawal, A. Mahendru, and D. Batra, intelligence, vol. 37, no. 1, pp. 13–27, 2015.
“Object-proposal evaluation protocol is’ gameable’,” in [151] S. Gupta, B. Hariharan, and J. Malik, “Exploring person
Proceedings of the IEEE conference on computer vision and context and local scene context for object detection,”
pattern recognition, 2016, pp. 835–844. arXiv preprint arXiv:1511.08177, 2015.
[133] K. Lenc and A. Vedaldi, “R-cnn minus r,” arXiv preprint [152] X. Chen and A. Gupta, “Spatial memory for con-
arXiv:1506.06981, 2015. text reasoning in object detection,” arXiv preprint
[134] P.-A. Savalle, S. Tsogkas, G. Papandreou, and I. Kokkinos, arXiv:1704.04224, 2017.
“Deformable part models with cnn features,” in European [153] Y. Liu, R. Wang, S. Shan, and X. Chen, “Structure infer-
Conference on Computer Vision, Parts and Attributes Work- ence net: Object detection using scene-level context and
shop, 2014. instance-level relationships,” in Proceedings of the IEEE
[135] N. Zhang, J. Donahue, R. Girshick, and T. Darrell, “Part- Conference on Computer Vision and Pattern Recognition,
based r-cnns for fine-grained category detection,” in Eu- 2018, pp. 6985–6994.
ropean conference on computer vision. Springer, 2014, pp. [154] J. H. Hosang, R. Benenson, and B. Schiele, “Learning non-
834–849. maximum suppression.” in CVPR, 2017, pp. 6469–6477.
[136] L. Wan, D. Eigen, and R. Fergus, “End-to-end integration [155] P. Henderson and V. Ferrari, “End-to-end training of
of a convolution network, deformable parts model and object class detectors for mean average precision,” in
non-maximum suppression,” in Proceedings of the IEEE Asian Conference on Computer Vision. Springer, 2016, pp.
Conference on Computer Vision and Pattern Recognition, 198–213.
2015, pp. 851–859. [156] R. Rothe, M. Guillaumin, and L. Van Gool, “Non-
[137] R. Girshick, F. Iandola, T. Darrell, and J. Malik, “De- maximum suppression for object detection by passing
formable part models are convolutional neural net- messages between windows,” in Asian Conference on Com-
works,” in Proceedings of the IEEE conference on Computer puter Vision. Springer, 2014, pp. 290–306.
Vision and Pattern Recognition, 2015, pp. 437–446. [157] D. Mrowca, M. Rohrbach, J. Hoffman, R. Hu, K. Saenko,
[138] B. Li, T. Wu, S. Shao, L. Zhang, and R. Chu, “Object and T. Darrell, “Spatial semantic regularisation for large
detection via end-to-end integration of aspect ratio and scale object detection,” in Proceedings of the IEEE interna-
context aware part-based models and fully convolutional tional conference on computer vision, 2015, pp. 2003–2011.
networks,” arXiv preprint arXiv:1612.00534, 2016. [158] N. Bodla, B. Singh, R. Chellappa, and L. S. Davis, “Soft-
[139] A. Torralba and P. Sinha, “Detecting faces in impov- nmsimproving object detection with one line of code,” in
erished images,” MASSACHUSETTS INST OF TECH Computer Vision (ICCV), 2017 IEEE International Conference
CAMBRIDGE ARTIFICIAL INTELLIGENCE LAB, Tech. on. IEEE, 2017, pp. 5562–5570.
Rep., 2001. [159] L. Tychsen-Smith and L. Petersson, “Improving object
[140] S. Zagoruyko, A. Lerer, T.-Y. Lin, P. O. Pinheiro, S. Gross, localization with fitness nms and bounded iou loss,”
S. Chintala, and P. Dollár, “A multipath network for arXiv preprint arXiv:1711.00164, 2017.
object detection,” arXiv preprint arXiv:1604.02135, 2016. [160] S. K. Divvala, D. Hoiem, J. H. Hays, A. A. Efros, and
[141] X. Zeng, W. Ouyang, B. Yang, J. Yan, and X. Wang, “Gated M. Hebert, “An empirical study of context in object de-
bi-directional cnn for object detection,” in European Con- tection,” in Computer Vision and Pattern Recognition, 2009.
ference on Computer Vision. Springer, 2016, pp. 354–369. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 1271–
[142] X. Zeng, W. Ouyang, J. Yan, H. Li, T. Xiao, K. Wang, Y. Liu, 1278.
Y. Zhou, B. Yang, Z. Wang et al., “Crafting gbd-net for [161] C. Chen, M.-Y. Liu, O. Tuzel, and J. Xiao, “R-cnn for small
object detection,” IEEE transactions on pattern analysis and object detection,” in Asian conference on computer vision.
machine intelligence, vol. 40, no. 9, pp. 2109–2123, 2018. Springer, 2016, pp. 214–230.
[143] W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Learning [162] H. Hu, J. Gu, Z. Zhang, J. Dai, and Y. Wei, “Relation
chained deep features and classifiers for cascade in object networks for object detection,” in Computer Vision and
detection,” arXiv preprint arXiv:1702.07054, 2017. Pattern Recognition (CVPR), vol. 2, no. 3, 2018.
[144] S. Gidaris and N. Komodakis, “Object detection via [163] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang, “Acquisition
a multi-region and semantic segmentation-aware cnn of localization confidence for accurate object detection,”
model,” in Proceedings of the IEEE International Conference in Proceedings of the European Conference on Computer Vi-
on Computer Vision, 2015, pp. 1134–1142. sion, Munich, Germany, 2018, pp. 8–14.
[145] Y. Zhu, C. Zhao, J. Wang, X. Zhao, Y. Wu, H. Lu et al., [164] H. A. Rowley, S. Baluja, and T. Kanade, “Human face de-
“Couplenet: Coupling global structure with local parts tection in visual scenes,” in Advances in Neural Information
for object detection,” in Proc. of Intl Conf. on Computer Processing Systems, 1996, pp. 875–881.
Vision (ICCV), vol. 2, 2017. [165] L. Zhang, L. Lin, X. Liang, and K. He, “Is faster r-
[146] C. Desai, D. Ramanan, and C. C. Fowlkes, “Discrimina- cnn doing well for pedestrian detection?” in European
tive models for multi-class object layout,” International Conference on Computer Vision. Springer, 2016, pp. 443–
journal of computer vision, vol. 95, no. 1, pp. 1–12, 2011. 457.
[147] Z. Li, Y. Chen, G. Yu, and Y. Deng, “R-fcn++: Towards [166] A. Shrivastava, A. Gupta, and R. Girshick, “Training
accurate region-based fully convolutional networks for region-based object detectors with online hard example
object detection.” in AAAI, 2018. mining,” in Proceedings of the IEEE Conference on Computer
[148] S. Bell, C. Lawrence Zitnick, K. Bala, and R. Girshick, Vision and Pattern Recognition, 2016, pp. 761–769.
“Inside-outside net: Detecting objects in context with skip [167] T. Tang, S. Zhou, Z. Deng, H. Zou, and L. Lei, “Vehicle
pooling and recurrent neural networks,” in Proceedings of detection in aerial images based on region convolutional
the IEEE conference on computer vision and pattern recogni- neural networks and hard negative example mining,”
tion, 2016, pp. 2874–2883. Sensors, vol. 17, no. 2, p. 336, 2017.
[149] J. Li, Y. Wei, X. Liang, J. Dong, T. Xu, J. Feng, and [168] X. Sun, P. Wu, and S. C. Hoi, “Face detection using deep
33
learning: An improved faster rcnn approach,” Neurocom- tems, 1990, pp. 598–605.
puting, vol. 299, pp. 42–50, 2018. [187] S. Han, H. Mao, and W. J. Dally, “Deep compres-
[169] J. Jin, K. Fu, and C. Zhang, “Traffic sign recognition with sion: Compressing deep neural networks with pruning,
hinge loss trained convolutional neural networks,” IEEE trained quantization and huffman coding,” arXiv preprint
Transactions on Intelligent Transportation Systems, vol. 15, arXiv:1510.00149, 2015.
no. 5, pp. 1991–2000, 2014. [188] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf,
[170] M. Zhou, M. Jing, D. Liu, Z. Xia, Z. Zou, and Z. Shi, “Pruning filters for efficient convnets,” arXiv preprint
“Multi-resolution networks for ship detection in infrared arXiv:1608.08710, 2016.
remote sensing images,” Infrared Physics & Technology, [189] G. Huang, S. Liu, L. van der Maaten, and K. Q.
2018. Weinberger, “Condensenet: An efficient densenet using
[171] P. Dollár, Z. Tu, P. Perona, and S. Belongie, “Integral learned group convolutions,” group, vol. 3, no. 12, p. 11,
channel features,” 2009. 2017.
[172] P. Dollár, R. Appel, S. Belongie, and P. Perona, “Fast [190] M. Rastegari, V. Ordonez, J. Redmon, and A. Farhadi,
feature pyramids for object detection,” IEEE Transactions “Xnor-net: Imagenet classification using binary convolu-
on Pattern Analysis and Machine Intelligence, vol. 36, no. 8, tional neural networks,” in European Conference on Com-
pp. 1532–1545, 2014. puter Vision. Springer, 2016, pp. 525–542.
[173] R. Benenson, M. Mathias, R. Timofte, and L. Van Gool, [191] X. Lin, C. Zhao, and W. Pan, “Towards accurate binary
“Pedestrian detection at 100 frames per second,” in Com- convolutional neural network,” in Advances in Neural
puter Vision and Pattern Recognition (CVPR), 2012 IEEE Information Processing Systems, 2017, pp. 345–353.
Conference on. IEEE, 2012, pp. 2903–2910. [192] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and
[174] S. Maji, A. C. Berg, and J. Malik, “Classification using Y. Bengio, “Binarized neural networks,” in Advances in
intersection kernel support vector machines is efficient,” neural information processing systems, 2016, pp. 4107–4115.
in Computer Vision and Pattern Recognition, 2008. CVPR [193] G. Hinton, O. Vinyals, and J. Dean, “Distilling
2008. IEEE Conference on. IEEE, 2008, pp. 1–8. the knowledge in a neural network,” arXiv preprint
[175] A. Vedaldi and A. Zisserman, “Sparse kernel approxima- arXiv:1503.02531, 2015.
tions for efficient classification and detection,” in Com- [194] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta,
puter Vision and Pattern Recognition (CVPR), 2012 IEEE and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv
Conference on. IEEE, 2012, pp. 2320–2327. preprint arXiv:1412.6550, 2014.
[176] F. Fleuret and D. Geman, “Coarse-to-fine face detection,” [195] G. Chen, W. Choi, X. Yu, T. Han, and M. Chandraker,
International Journal of computer vision, vol. 41, no. 1-2, pp. “Learning efficient object detection models with knowl-
85–107, 2001. edge distillation,” in Advances in Neural Information Pro-
[177] Q. Zhu, M.-C. Yeh, K.-T. Cheng, and S. Avidan, “Fast hu- cessing Systems, 2017, pp. 742–751.
man detection using a cascade of histograms of oriented [196] Q. Li, S. Jin, and J. Yan, “Mimicking very efficient network
gradients,” in Computer Vision and Pattern Recognition, for object detection,” in 2017 IEEE Conference on Computer
2006 IEEE Computer Society Conference on, vol. 2. IEEE, Vision and Pattern Recognition (CVPR). IEEE, 2017, pp.
2006, pp. 1491–1498. 7341–7349.
[178] A. Vedaldi, V. Gulshan, M. Varma, and A. Zisserman, [197] K. He and J. Sun, “Convolutional neural networks at con-
“Multiple kernels for object detection,” in Computer Vi- strained time cost,” in Proceedings of the IEEE conference
sion, 2009 IEEE 12th International Conference on. IEEE, on computer vision and pattern recognition, 2015, pp. 5353–
2009, pp. 606–613. 5360.
[179] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua, “A con- [198] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wo-
volutional neural network cascade for face detection,” in jna, “Rethinking the inception architecture for computer
Proceedings of the IEEE Conference on Computer Vision and vision,” in Proceedings of the IEEE conference on computer
Pattern Recognition, 2015, pp. 5325–5334. vision and pattern recognition, 2016, pp. 2818–2826.
[180] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face [199] C. Peng, X. Zhang, G. Yu, G. Luo, and J. Sun, “Large
detection and alignment using multitask cascaded convo- kernel mattersimprove semantic segmentation by global
lutional networks,” IEEE Signal Processing Letters, vol. 23, convolutional network,” in Computer Vision and Pattern
no. 10, pp. 1499–1503, 2016. Recognition (CVPR), 2017 IEEE Conference on. IEEE, 2017,
[181] Z. Cai, M. Saberian, and N. Vasconcelos, “Learning pp. 1743–1751.
complexity-aware cascades for deep pedestrian detec- [200] K.-H. Kim, S. Hong, B. Roh, Y. Cheon, and M. Park,
tion,” in Proceedings of the IEEE International Conference “Pvanet: deep but lightweight neural networks for real-
on Computer Vision, 2015, pp. 3361–3369. time object detection,” arXiv preprint arXiv:1608.08021,
[182] B. Yang, J. Yan, Z. Lei, and S. Z. Li, “Craft objects from 2016.
images,” in Proceedings of the IEEE Conference on Computer [201] X. Zhang, J. Zou, X. Ming, K. He, and J. Sun, “Efficient
Vision and Pattern Recognition, 2016, pp. 6043–6051. and accurate approximations of nonlinear convolutional
[183] F. Yang, W. Choi, and Y. Lin, “Exploit all the layers: Fast networks,” in Proceedings of the IEEE Conference on Com-
and accurate cnn object detector with scale dependent puter Vision and Pattern Recognition, 2015, pp. 1984–1992.
pooling and cascaded rejection classifiers,” in Proceedings [202] X. Zhang, J. Zou, K. He, and J. Sun, “Accelerating very
of the IEEE conference on computer vision and pattern recog- deep convolutional networks for classification and de-
nition, 2016, pp. 2129–2137. tection,” IEEE transactions on pattern analysis and machine
[184] M. Gao, R. Yu, A. Li, V. I. Morariu, and L. S. Davis, intelligence, vol. 38, no. 10, pp. 1943–1955, 2016.
“Dynamic zoom-in network for fast object detection in [203] X. Zhang, X. Zhou, M. Lin, and J. Sun, “Shufflenet:
large images,” in IEEE Conference on Computer Vision and An extremely efficient convolutional neural network for
Pattern Recognition (CVPR), 2018. mobile devices,” 2017.
[185] W. Ouyang, K. Wang, X. Zhu, and X. Wang, “Chained [204] F. Chollet, “Xception: Deep learning with depthwise
cascade network for object detection,” in Computer Vision separable convolutions,” arXiv preprint, pp. 1610–02 357,
(ICCV), 2017 IEEE International Conference on. IEEE, 2017, 2017.
pp. 1956–1964. [205] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko,
[186] Y. LeCun, J. S. Denker, and S. A. Solla, “Optimal brain W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mo-
damage,” in Advances in neural information processing sys- bilenets: Efficient convolutional neural networks for mo-
34
bile vision applications,” arXiv preprint arXiv:1704.04861, fbfft: A gpu performance evaluation,” arXiv preprint
2017. arXiv:1412.7580, 2014.
[206] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and [225] O. Rippel, J. Snoek, and R. P. Adams, “Spectral represen-
L.-C. Chen, “Mobilenetv2: Inverted residuals and linear tations for convolutional neural networks,” in Advances in
bottlenecks,” in 2018 IEEE/CVF Conference on Computer neural information processing systems, 2015, pp. 2449–2457.
Vision and Pattern Recognition. IEEE, 2018, pp. 4510–4520. [226] C. Dubout and F. Fleuret, “Exact acceleration of linear ob-
[207] Y. Li, J. Li, W. Lin, and J. Li, “Tiny-dsod: Lightweight ject detectors,” in European Conference on Computer Vision.
object detection for resource-restricted usages,” arXiv Springer, 2012, pp. 301–311.
preprint arXiv:1807.11013, 2018. [227] M. A. Sadeghi and D. Forsyth, “Fast template evaluation
[208] G. E. Hinton and R. R. Salakhutdinov, “Reducing the with vector quantization,” in Advances in neural informa-
dimensionality of data with neural networks,” science, tion processing systems, 2013, pp. 2949–2957.
vol. 313, no. 5786, pp. 504–507, 2006. [228] I. Kokkinos, “Bounding part scores for rapid detection
[209] R. J. Wang, X. Li, S. Ao, and C. X. Ling, “Pelee: A real-time with deformable part models,” in European Conference on
object detection system on mobile devices,” arXiv preprint Computer Vision. Springer, 2012, pp. 41–50.
arXiv:1804.06882, 2018. [229] J. Gu, Z. Wang, J. Kuen, L. Ma, A. Shahroudy, B. Shuai,
[210] F. N. Iandola, S. Han, M. W. Moskewicz, K. Ashraf, T. Liu, X. Wang, L. Wang, G. Wang et al., “Recent ad-
W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level vances in convolutional neural networks,” arXiv preprint
accuracy with 50x fewer parameters and¡ 0.5 mb model arXiv:1512.07108, 2015.
size,” arXiv preprint arXiv:1602.07360, 2016. [230] K. Simonyan and A. Zisserman, “Very deep convolu-
[211] B. Wu, F. N. Iandola, P. H. Jin, and K. Keutzer, tional networks for large-scale image recognition,” arXiv
“Squeezedet: Unified, small, low power fully convolu- preprint arXiv:1409.1556, 2014.
tional neural networks for real-time object detection for [231] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
autonomous driving.” in CVPR Workshops, 2017, pp. 446– D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich,
454. “Going deeper with convolutions,” in Proceedings of the
[212] T. Kong, A. Yao, Y. Chen, and F. Sun, “Hypernet: Towards IEEE conference on computer vision and pattern recognition,
accurate region proposal generation and joint object de- 2015, pp. 1–9.
tection,” in Proceedings of the IEEE conference on computer [232] S. Ioffe and C. Szegedy, “Batch normalization: Accelerat-
vision and pattern recognition, 2016, pp. 845–853. ing deep network training by reducing internal covariate
[213] B. Zoph, V. Vasudevan, J. Shlens, and Q. V. Le, “Learning shift,” arXiv preprint arXiv:1502.03167, 2015.
transferable architectures for scalable image recognition,” [233] C. Szegedy, S. Ioffe, V. Vanhoucke, and A. A. Alemi,
in Proceedings of the IEEE conference on computer vision and “Inception-v4, inception-resnet and the impact of resid-
pattern recognition, 2018, pp. 8697–8710. ual connections on learning.” in AAAI, vol. 4, 2017, p. 12.
[214] B. Zoph and Q. V. Le, “Neural architecture search with [234] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual
reinforcement learning,” arXiv preprint arXiv:1611.01578, learning for image recognition,” in Proceedings of the IEEE
2016. conference on computer vision and pattern recognition, 2016,
[215] Y. Chen, T. Yang, X. Zhang, G. Meng, C. Pan, and J. Sun, pp. 770–778.
“Detnas: Neural architecture search on object detection,” [235] G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Wein-
arXiv preprint arXiv:1903.10979, 2019. berger, “Densely connected convolutional networks.” in
[216] C. Liu, L.-C. Chen, F. Schroff, H. Adam, W. Hua, A. Yuille, CVPR, vol. 1, no. 2, 2017, p. 3.
and L. Fei-Fei, “Auto-deeplab: Hierarchical neural archi- [236] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation net-
tecture search for semantic image segmentation,” arXiv works,” arXiv preprint arXiv:1709.01507, vol. 7, 2017.
preprint arXiv:1901.02985, 2019. [237] P. Zhou, B. Ni, C. Geng, J. Hu, and Y. Xu, “Scale-
[217] X. Chu, B. Zhang, R. Xu, and H. Ma, “Multi-objective re- transferrable object detection,” in Proceedings of the IEEE
inforced evolution in mobile neural architecture search,” Conference on Computer Vision and Pattern Recognition,
arXiv preprint arXiv:1901.01074, 2019. 2018, pp. 528–537.
[218] C.-H. Hsu, S.-H. Chang, D.-C. Juan, J.-Y. Pan, Y.-T. Chen, [238] Z. Shen, Z. Liu, J. Li, Y.-G. Jiang, Y. Chen, and X. Xue,
W. Wei, and S.-C. Chang, “Monas: Multi-objective neural “Dsod: Learning deeply supervised object detectors from
architecture search using reinforcement learning,” arXiv scratch,” in The IEEE International Conference on Computer
preprint arXiv:1806.10332, 2018. Vision (ICCV), vol. 3, no. 6, 2017, p. 7.
[219] P. Simard, L. Bottou, P. Haffner, and Y. LeCun, “Boxlets: a [239] S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He,
fast convolution algorithm for signal processing and neu- “Aggregated residual transformations for deep neural
ral networks,” in Advances in Neural Information Processing networks,” in Computer Vision and Pattern Recognition
Systems, 1999, pp. 571–577. (CVPR), 2017 IEEE Conference on. IEEE, 2017, pp. 5987–
[220] X. Wang, T. X. Han, and S. Yan, “An hog-lbp human 5995.
detector with partial occlusion handling,” in Computer [240] J. Jeong, H. Park, and N. Kwak, “Enhancement of ssd by
Vision, 2009 IEEE 12th International Conference on. IEEE, concatenating feature maps for object detection,” arXiv
2009, pp. 32–39. preprint arXiv:1705.09587, 2017.
[221] F. Porikli, “Integral histogram: A fast way to extract [241] K. Lee, J. Choi, J. Jeong, and N. Kwak, “Residual features
histograms in cartesian spaces,” in Computer Vision and and unified prediction network for single stage detec-
Pattern Recognition, 2005. CVPR 2005. IEEE Computer So- tion,” arXiv preprint arXiv:1707.05031, 2017.
ciety Conference on, vol. 1. IEEE, 2005, pp. 829–836. [242] G. Cao, X. Xie, W. Yang, Q. Liao, G. Shi, and J. Wu,
[222] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training “Feature-fused ssd: fast detection for small objects,” in
of convolutional networks through ffts,” arXiv preprint Ninth International Conference on Graphic and Image Pro-
arXiv:1312.5851, 2013. cessing (ICGIP 2017), vol. 10615. International Society
[223] H. Pratt, B. Williams, F. Coenen, and Y. Zheng, “Fcnn: for Optics and Photonics, 2018, p. 106151E.
Fourier convolutional neural networks,” in Joint European [243] L. Zheng, C. Fu, and Y. Zhao, “Extend the shallow part
Conference on Machine Learning and Knowledge Discovery in of single shot multibox detector via convolutional neural
Databases. Springer, 2017, pp. 786–798. network,” arXiv preprint arXiv:1801.05918, 2018.
[224] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Pi- [244] A. Shrivastava, R. Sukthankar, J. Malik, and A. Gupta,
antino, and Y. LeCun, “Fast convolutional nets with “Beyond skip connections: Top-down modulation for ob-
35
ject detection,” arXiv preprint arXiv:1612.06851, 2016. 2017 Fifteenth IAPR International Conference on. IEEE,
[245] T. Kong, F. Sun, A. Yao, H. Liu, M. Lu, and Y. Chen, “Ron: 2017, pp. 514–517.
Reverse connection with objectness prior networks for [265] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox:
object detection,” in IEEE Conference on Computer Vision An advanced object detection network,” in Proceedings of
and Pattern Recognition, vol. 1, 2017, p. 2. the 2016 ACM on Multimedia Conference. ACM, 2016, pp.
[246] S. Woo, S. Hwang, and I. S. Kweon, “Stairnet: Top-down 516–520.
semantic aggregation for accurate one shot detection,” in [266] S. Gidaris and N. Komodakis, “Locnet: Improving local-
2018 IEEE Winter Conference on Applications of Computer ization accuracy for object detection,” in Proceedings of the
Vision (WACV). IEEE, 2018, pp. 1093–1102. IEEE conference on computer vision and pattern recognition,
[247] Y. Chen, J. Li, B. Zhou, J. Feng, and S. Yan, “Weaving 2016, pp. 789–798.
multi-scale context for single shot detector,” arXiv preprint [267] B. A. Olshausen and D. J. Field, “Emergence of simple-
arXiv:1712.03149, 2017. cell receptive field properties by learning a sparse code
[248] M. D. Zeiler, G. W. Taylor, and R. Fergus, “Adaptive for natural images,” Nature, vol. 381, no. 6583, p. 607,
deconvolutional networks for mid and high level feature 1996.
learning,” in Computer Vision (ICCV), 2011 IEEE Interna- [268] A. J. Bell and T. J. Sejnowski, “The independent compo-
tional Conference on. IEEE, 2011, pp. 2018–2025. nents of natural scenes are edge filters,” Vision research,
[249] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. vol. 37, no. 23, pp. 3327–3338, 1997.
Berg, “Dssd: Deconvolutional single shot detector,” arXiv [269] S. Brahmbhatt, H. I. Christensen, and J. Hays, “Stuffnet:
preprint arXiv:1701.06659, 2017. Using stuffto improve object detection,” in Applications of
[250] J. Wang, Y. Yuan, and G. Yu, “Face attention network: Computer Vision (WACV), 2017 IEEE Winter Conference on.
An effective face detector for the occluded faces,” arXiv IEEE, 2017, pp. 934–943.
preprint arXiv:1711.07246, 2017. [270] A. Shrivastava and A. Gupta, “Contextual priming and
[251] P. He, W. Huang, T. He, Q. Zhu, Y. Qiao, and X. Li, “Single feedback for faster r-cnn,” in European Conference on Com-
shot text detector with regional attention,” in The IEEE puter Vision. Springer, 2016, pp. 330–348.
International Conference on Computer Vision (ICCV), vol. 6, [271] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L.
no. 7, 2017. Yuille, “Single-shot object detection with enriched seman-
[252] F. Yu and V. Koltun, “Multi-scale context aggregation tics,” Center for Brains, Minds and Machines (CBMM),
by dilated convolutions,” arXiv preprint arXiv:1511.07122, Tech. Rep., 2018.
2015. [272] B. Cai, Z. Jiang, H. Zhang, Y. Yao, and S. Nie, “Online
[253] F. Yu, V. Koltun, and T. A. Funkhouser, “Dilated residual exemplar-based fully convolutional network for aircraft
networks.” in CVPR, vol. 2, 2017, p. 3. detection in remote sensing images,” IEEE Geoscience and
[254] Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun, Remote Sensing Letters, no. 99, pp. 1–5, 2018.
“Detnet: A backbone network for object detection,” arXiv [273] G. Cheng, J. Han, P. Zhou, and L. Guo, “Multi-class
preprint arXiv:1804.06215, 2018. geospatial object detection and geographic image clas-
[255] S. Liu, D. Huang, and Y. Wang, “Receptive field block sification based on collection of part detectors,” ISPRS
net for accurate and fast object detection,” arXiv preprint Journal of Photogrammetry and Remote Sensing, vol. 98, pp.
arXiv:1711.07767, 2017. 119–132, 2014.
[256] M. Najibi, M. Rastegari, and L. S. Davis, “G-cnn: an [274] P. Y. Simard, Y. A. LeCun, J. S. Denker, and B. Victorri,
iterative grid based object detector,” in Proceedings of the “Transformation invariance in pattern recognitiontangent
IEEE Conference on Computer Vision and Pattern Recogni- distance and tangent propagation,” in Neural networks:
tion, 2016, pp. 2369–2377. tricks of the trade. Springer, 1998, pp. 239–274.
[257] D. Yoo, S. Park, J.-Y. Lee, A. S. Paek, and I. So Kweon, [275] G. Cheng, P. Zhou, and J. Han, “Rifd-cnn: Rotation-
“Attentionnet: Aggregating weak directions for accurate invariant and fisher discriminative convolutional neural
object detection,” in Proceedings of the IEEE International networks for object detection,” in Proceedings of the IEEE
Conference on Computer Vision, 2015, pp. 2659–2667. Conference on Computer Vision and Pattern Recognition,
[258] Y. Lu, T. Javidi, and S. Lazebnik, “Adaptive object detec- 2016, pp. 2884–2893.
tion using adjacency and zoom prediction,” in Proceedings [276] ——, “Learning rotation-invariant convolutional neural
of the IEEE Conference on Computer Vision and Pattern networks for object detection in vhr optical remote sens-
Recognition, 2016, pp. 2351–2359. ing images,” IEEE Transactions on Geoscience and Remote
[259] R. Ranjan, V. M. Patel, and R. Chellappa, “Hyperface: Sensing, vol. 54, no. 12, pp. 7405–7415, 2016.
A deep multi-task learning framework for face detec- [277] X. Shi, S. Shan, M. Kan, S. Wu, and X. Chen, “Real-time
tion, landmark localization, pose estimation, and gender rotation-invariant face detection with progressive calibra-
recognition,” IEEE Transactions on Pattern Analysis and tion networks,” in Proceedings of the IEEE Conference on
Machine Intelligence, vol. 41, no. 1, pp. 121–135, 2019. Computer Vision and Pattern Recognition, 2018, pp. 2295–
[260] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Real- 2303.
time multi-person 2d pose estimation using part affinity [278] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial
fields,” arXiv preprint arXiv:1611.08050, 2016. transformer networks,” in Advances in neural information
[261] H. Law and J. Deng, “Cornernet: Detecting objects as processing systems, 2015, pp. 2017–2025.
paired keypoints,” in Proceedings of the European Confer- [279] D. Chen, G. Hua, F. Wen, and J. Sun, “Supervised trans-
ence on Computer Vision (ECCV), vol. 6, 2018. former network for efficient face detection,” in European
[262] Z. Cai and N. Vasconcelos, “Cascade r-cnn: Delving into Conference on Computer Vision. Springer, 2016, pp. 122–
high quality object detection,” in IEEE Conference on Com- 138.
puter Vision and Pattern Recognition (CVPR), vol. 1, no. 2, [280] B. Singh and L. S. Davis, “An analysis of scale invariance
2018, p. 10. in object detection–snip,” in Proceedings of the IEEE Con-
[263] R. N. Rajaram, E. Ohn-Bar, and M. M. Trivedi, “Refinenet: ference on Computer Vision and Pattern Recognition, 2018,
Iterative refinement for accurate object localization,” in pp. 3578–3587.
Intelligent Transportation Systems (ITSC), 2016 IEEE 19th [281] B. Singh, M. Najibi, and L. S. Davis, “Sniper: Efficient
International Conference on. IEEE, 2016, pp. 1528–1533. multi-scale training,” arXiv preprint arXiv:1805.09300,
[264] M.-C. Roh and J.-y. Lee, “Refining faster-rcnn for accurate 2018.
object detection,” in Machine Vision Applications (MVA), [282] S. Qiao, W. Shen, W. Qiu, C. Liu, and A. L. Yuille,
36
“Scalenet: Guiding object proposal generation in super- Applications of Computer Vision (WACV), 2016 IEEE Winter
markets and beyond.” in ICCV, 2017, pp. 1809–1818. Conference on. IEEE, 2016, pp. 1–9.
[283] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu, “Scale- [302] Y. Shen, R. Ji, S. Zhang, W. Zuo, and Y. Wang, “Generative
aware face detection,” in The IEEE Conference on Computer adversarial learning towards fast weakly supervised de-
Vision and Pattern Recognition (CVPR), vol. 3, 2017. tection,” in Proceedings of the IEEE Conference on Computer
[284] R. Zhu, S. Zhang, X. Wang, L. Wen, H. Shi, L. Bo, and Vision and Pattern Recognition, 2018, pp. 5764–5773.
T. Mei, “Scratchdet: Exploring to train single-shot object [303] M. Enzweiler and D. M. Gavrila, “Monocular pedestrian
detectors from scratch,” arXiv preprint arXiv:1810.08425, detection: Survey and experiments,” IEEE Transactions on
2018. Pattern Analysis & Machine Intelligence, no. 12, pp. 2179–
[285] K. He, R. Girshick, and P. Dollár, “Rethinking imagenet 2195, 2008.
pre-training,” arXiv preprint arXiv:1811.08883, 2018. [304] D. Geronimo, A. M. Lopez, A. D. Sappa, and T. Graf,
[286] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, “Survey of pedestrian detection for advanced driver as-
D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, sistance systems,” IEEE transactions on pattern analysis and
“Generative adversarial nets,” in Advances in neural infor- machine intelligence, vol. 32, no. 7, pp. 1239–1258, 2010.
mation processing systems, 2014, pp. 2672–2680. [305] R. Benenson, M. Omran, J. Hosang, and B. Schiele, “Ten
[287] A. Radford, L. Metz, and S. Chintala, “Unsupervised rep- years of pedestrian detection, what have we learned?” in
resentation learning with deep convolutional generative European Conference on Computer Vision. Springer, 2014,
adversarial networks,” arXiv preprint arXiv:1511.06434, pp. 613–627.
2015. [306] S. Zhang, R. Benenson, M. Omran, J. Hosang, and
[288] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired B. Schiele, “How far are we from solving pedestrian de-
image-to-image translation using cycle-consistent adver- tection?” in Proceedings of the IEEE Conference on Computer
sarial networks,” arXiv preprint, 2017. Vision and Pattern Recognition, 2016, pp. 1259–1267.
[289] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunning- [307] ——, “Towards reaching human performance in pedes-
ham, A. Acosta, A. P. Aitken, A. Tejani, J. Totz, Z. Wang trian detection,” IEEE transactions on pattern analysis and
et al., “Photo-realistic single image super-resolution using machine intelligence, vol. 40, no. 4, pp. 973–986, 2018.
a generative adversarial network.” in CVPR, vol. 2, no. 3, [308] P. Viola, M. J. Jones, and D. Snow, “Detecting pedestrians
2017, p. 4. using patterns of motion and appearance,” International
[290] J. Li, X. Liang, Y. Wei, T. Xu, J. Feng, and S. Yan, “Per- Journal of Computer Vision, vol. 63, no. 2, pp. 153–161, 2005.
ceptual generative adversarial networks for small object [309] P. Sabzmeydani and G. Mori, “Detecting pedestrians by
detection,” in IEEE CVPR, 2017. learning shapelet features,” in Computer Vision and Pattern
[291] Y. Bai, Y. Zhang, M. Ding, and B. Ghanem, “Sod-mtgan: Recognition, 2007. CVPR’07. IEEE Conference on. IEEE,
Small object detection via multi-task generative adversar- 2007, pp. 1–8.
ial network,” Computer Vision-ECCV, pp. 8–14, 2018. [310] J. Cao, Y. Pang, and X. Li, “Pedestrian detection inspired
[292] X. Wang, A. Shrivastava, and A. Gupta, “A-fast-rcnn: by appearance constancy and shape symmetry,” in Pro-
Hard positive generation via adversary for object detec- ceedings of the IEEE Conference on Computer Vision and
tion,” in IEEE Conference on Computer Vision and Pattern Pattern Recognition, 2016, pp. 1316–1324.
Recognition, 2017. [311] R. Benenson, R. Timofte, and L. Van Gool, “Stixels estima-
[293] S.-T. Chen, C. Cornelius, J. Martin, and D. H. Chau, tion without depth map computation,” in Computer Vision
“Robust physical adversarial attack on faster r-cnn object Workshops (ICCV Workshops), 2011 IEEE International Con-
detector,” arXiv preprint arXiv:1804.05810, 2018. ference on. IEEE, 2011, pp. 2010–2017.
[294] R. G. Cinbis, J. Verbeek, and C. Schmid, “Weakly super- [312] J. Hosang, M. Omran, R. Benenson, and B. Schiele, “Tak-
vised object localization with multi-fold multiple instance ing a deeper look at pedestrians,” in Proceedings of the
learning,” IEEE transactions on pattern analysis and machine IEEE Conference on Computer Vision and Pattern Recogni-
intelligence, vol. 39, no. 1, pp. 189–203, 2017. tion, 2015, pp. 4073–4082.
[295] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari, [313] J. Cao, Y. Pang, and X. Li, “Learning multilayer channel
“We don’t need no bounding-boxes: Training object class features for pedestrian detection,” IEEE transactions on
detectors using only human verification,” in Proceedings image processing, vol. 26, no. 7, pp. 3210–3220, 2017.
of the IEEE Conference on Computer Vision and Pattern [314] J. Mao, T. Xiao, Y. Jiang, and Z. Cao, “What can help
Recognition, 2016, pp. 854–863. pedestrian detection?” in 2017 IEEE Conference on Com-
[296] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Pérez, puter Vision and Pattern Recognition (CVPR). IEEE, 2017,
“Solving the multiple instance problem with axis-parallel pp. 6034–6043.
rectangles,” Artificial intelligence, vol. 89, no. 1-2, pp. 31– [315] Q. Hu, P. Wang, C. Shen, A. van den Hengel, and
71, 1997. F. Porikli, “Pushing the limits of deep cnns for pedestrian
[297] Y. Zhu, Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao, “Soft proposal detection,” IEEE Transactions on Circuits and Systems for
networks for weakly supervised object localization,” in Video Technology, vol. 28, no. 6, pp. 1358–1368, 2018.
Proc. IEEE Int. Conf. Comput. Vis.(ICCV), 2017, pp. 1841– [316] Y. Tian, P. Luo, X. Wang, and X. Tang, “Pedestrian detec-
1850. tion aided by deep learning semantic tasks,” in Proceed-
[298] A. Diba, V. Sharma, A. M. Pazandeh, H. Pirsiavash, and ings of the IEEE Conference on Computer Vision and Pattern
L. Van Gool, “Weakly supervised cascaded convolutional Recognition, 2015, pp. 5079–5087.
networks.” in CVPR, vol. 1, no. 2, 2017, p. 8. [317] D. Xu, W. Ouyang, E. Ricci, X. Wang, and N. Sebe,
[299] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Tor- “Learning cross-modal deep representations for robust
ralba, “Learning deep features for discriminative local- pedestrian detection,” in Proc. of the IEEE Conf. on Com-
ization,” in Proceedings of the IEEE Conference on Computer puter Vision and Pattern Recognition (CVPR), 2017.
Vision and Pattern Recognition, 2016, pp. 2921–2929. [318] X. Wang, T. Xiao, Y. Jiang, S. Shao, J. Sun, and C. Shen,
[300] H. Bilen and A. Vedaldi, “Weakly supervised deep de- “Repulsion loss: Detecting pedestrians in a crowd,” arXiv
tection networks,” in Proceedings of the IEEE Conference on preprint arXiv:1711.07752, 2017.
Computer Vision and Pattern Recognition, 2016, pp. 2846– [319] Y. Tian, P. Luo, X. Wang, and X. Tang, “Deep learning
2854. strong parts for pedestrian detection,” in Proceedings of
[301] L. Bazzani, A. Bergamo, D. Anguelov, and L. Torresani, the IEEE international conference on computer vision, 2015,
“Self-taught object localization with deep networks,” in pp. 1904–1912.
37
[320] W. Ouyang, H. Zhou, H. Li, Q. Li, J. Yan, and X. Wang, recognition: Recent advances and future trends,” Frontiers
“Jointly learning deep features, deformable parts, occlu- of Computer Science, vol. 10, no. 1, pp. 19–36, 2016.
sion and classification for pedestrian detection,” IEEE [339] Q. Ye and D. Doermann, “Text detection and recognition
transactions on pattern analysis and machine intelligence, in imagery: A survey,” IEEE transactions on pattern analysis
vol. 40, no. 8, pp. 1874–1887, 2018. and machine intelligence, vol. 37, no. 7, pp. 1480–1500, 2015.
[321] S. Zhang, J. Yang, and B. Schiele, “Occluded pedestrian [340] L. Neumann and J. Matas, “Scene text localization and
detection through guided attention in cnns,” in Proceed- recognition with oriented stroke detection,” in Proceedings
ings of the IEEE Conference on Computer Vision and Pattern of the IEEE International Conference on Computer Vision,
Recognition, 2018, pp. 6995–7003. 2013, pp. 97–104.
[322] P. Hu and D. Ramanan, “Finding tiny faces,” in Computer [341] X.-C. Yin, X. Yin, K. Huang, and H.-W. Hao, “Robust text
Vision and Pattern Recognition (CVPR), 2017 IEEE Confer- detection in natural scene images,” IEEE transactions on
ence on. IEEE, 2017, pp. 1522–1530. pattern analysis and machine intelligence, vol. 36, no. 5, pp.
[323] M.-H. Yang, D. J. Kriegman, and N. Ahuja, “Detecting 970–983, 2014.
faces in images: A survey,” IEEE Transactions on pattern [342] K. Wang, B. Babenko, and S. Belongie, “End-to-end scene
analysis and machine intelligence, vol. 24, no. 1, pp. 34–58, text recognition,” in Computer Vision (ICCV), 2011 IEEE
2002. International Conference on. IEEE, 2011, pp. 1457–1464.
[324] S. Zafeiriou, C. Zhang, and Z. Zhang, “A survey on face [343] T. Wang, D. J. Wu, A. Coates, and A. Y. Ng, “End-to-end
detection in the wild: past, present and future,” Computer text recognition with convolutional neural networks,” in
Vision and Image Understanding, vol. 138, pp. 1–24, 2015. Pattern Recognition (ICPR), 2012 21st International Confer-
[325] H. A. Rowley, S. Baluja, and T. Kanade, “Neural network- ence on. IEEE, 2012, pp. 3304–3308.
based face detection,” IEEE Transactions on pattern analysis [344] S. Tian, Y. Pan, C. Huang, S. Lu, K. Yu, and C. Lim Tan,
and machine intelligence, vol. 20, no. 1, pp. 23–38, 1998. “Text flow: A unified text detection system in natural
[326] E. Osuna, R. Freund, and F. Girosit, “Training support scene images,” in Proceedings of the IEEE international
vector machines: an application to face detection,” in conference on computer vision, 2015, pp. 4651–4659.
Computer vision and pattern recognition, 1997. Proceedings., [345] M. Jaderberg, A. Vedaldi, and A. Zisserman, “Deep fea-
1997 IEEE computer society conference on. IEEE, 1997, pp. tures for text spotting,” in European conference on computer
130–136. vision. Springer, 2014, pp. 512–528.
[327] M. Osadchy, Y. L. Cun, and M. L. Miller, “Synergistic [346] X.-C. Yin, W.-Y. Pei, J. Zhang, and H.-W. Hao, “Multi-
face detection and pose estimation with energy-based orientation scene text detection with adaptive cluster-
models,” Journal of Machine Learning Research, vol. 8, no. ing,” IEEE Transactions on Pattern Analysis & Machine
May, pp. 1197–1215, 2007. Intelligence, no. 9, pp. 1930–1937, 2015.
[328] S. Yang, P. Luo, C. C. Loy, and X. Tang, “Faceness-net: [347] Z. Zhang, W. Shen, C. Yao, and X. Bai, “Symmetry-based
Face detection through deep facial part responses,” IEEE text line detection in natural scenes,” in Proceedings of the
transactions on pattern analysis and machine intelligence, IEEE Conference on Computer Vision and Pattern Recogni-
vol. 40, no. 8, pp. 1845–1859, 2018. tion, 2015, pp. 2558–2567.
[329] S. Yang, Y. Xiong, C. C. Loy, and X. Tang, “Face detection [348] M. Jaderberg, K. Simonyan, A. Vedaldi, and A. Zisser-
through scale-friendly deep convolutional networks,” man, “Reading text in the wild with convolutional neural
arXiv preprint arXiv:1706.02863, 2017. networks,” International Journal of Computer Vision, vol.
[330] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis, 116, no. 1, pp. 1–20, 2016.
“Ssh: Single stage headless face detector.” in ICCV, 2017, [349] W. Huang, Y. Qiao, and X. Tang, “Robust scene text
pp. 4885–4894. detection with convolution neural network induced
[331] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. mser trees,” in European Conference on Computer Vision.
Li, “Sˆ 3fd: Single shot scale-invariant face detector,” in Springer, 2014, pp. 497–511.
Computer Vision (ICCV), 2017 IEEE International Conference [350] T. He, W. Huang, Y. Qiao, and J. Yao, “Text-attentional
on. IEEE, 2017, pp. 192–201. convolutional neural network for scene text detection,”
[332] X. Liu, “A camera phone based currency reader for the IEEE transactions on image processing, vol. 25, no. 6, pp.
visually impaired,” in Proceedings of the 10th international 2529–2541, 2016.
ACM SIGACCESS conference on Computers and accessibility. [351] J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and
ACM, 2008, pp. 305–306. X. Xue, “Arbitrary-oriented scene text detection via rota-
[333] N. Ezaki, K. Kiyota, B. T. Minh, M. Bulacu, and tion proposals,” IEEE Transactions on Multimedia, 2018.
L. Schomaker, “Improved text-detection methods for a [352] ——, “Arbitrary-oriented scene text detection via rotation
camera-based text reading system for blind persons,” proposals,” IEEE Transactions on Multimedia, 2018.
in Document Analysis and Recognition, 2005. Proceedings. [353] Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang,
Eighth International Conference on. IEEE, 2005, pp. 257– P. Fu, and Z. Luo, “R2cnn: rotational region cnn for
261. orientation robust scene text detection,” arXiv preprint
[334] P. Sermanet, S. Chintala, and Y. LeCun, “Convolutional arXiv:1706.09579, 2017.
neural networks applied to house numbers digit classi- [354] M. Liao, B. Shi, X. Bai, X. Wang, and W. Liu, “Textboxes:
fication,” in Pattern Recognition (ICPR), 2012 21st Interna- A fast text detector with a single deep neural network.”
tional Conference on. IEEE, 2012, pp. 3288–3291. in AAAI, 2017, pp. 4161–4167.
[335] Z. Wojna, A. Gorban, D.-S. Lee, K. Murphy, Q. Yu, Y. Li, [355] W. He, X.-Y. Zhang, F. Yin, and C.-L. Liu, “Deep direct
and J. Ibarz, “Attention-based extraction of structured regression for multi-oriented scene text detection,” arXiv
information from street view imagery,” arXiv preprint preprint arXiv:1703.08289, 2017.
arXiv:1704.03549, 2017. [356] Y. Liu and L. Jin, “Deep matching prior network: Toward
[336] Y. Liu and L. Jin, “Deep matching prior network: Toward tighter multi-oriented text detection,” in Proc. CVPR,
tighter multi-oriented text detection,” in Proc. CVPR, 2017, pp. 3454–3461.
2017, pp. 3454–3461. [357] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He,
[337] Y. Wu and P. Natarajan, “Self-organized text detection and J. Liang, “East: an efficient and accurate scene text
with minimal post-processing via border learning,” in detector,” in Proc. CVPR, 2017, pp. 2642–2651.
Proc. ICCV, 2017. [358] C. Yao, X. Bai, N. Sang, X. Zhou, S. Zhou, and Z. Cao,
[338] Y. Zhu, C. Yao, and X. Bai, “Scene text detection and “Scene text detection via holistic, multi-channel predic-
38
tion,” arXiv preprint arXiv:1606.09002, 2016. Networks (IJCNN), The 2013 International Joint Conference
[359] C. Xue, S. Lu, and F. Zhan, “Accurate scene text detection on. IEEE, 2013, pp. 1–5.
through border semantics awareness and bootstrapping,” [377] Z. Shi, Z. Zou, and C. Zhang, “Real-time traffic light
in European Conference on Computer Vision. Springer, 2018, detection with adaptive background suppression filter,”
pp. 370–387. IEEE Transactions on Intelligent Transportation Systems,
[360] P. Lyu, C. Yao, W. Wu, S. Yan, and X. Bai, “Multi-oriented vol. 17, no. 3, pp. 690–700, 2016.
scene text detection via corner localization and region [378] Y. Lu, J. Lu, S. Zhang, and P. Hall, “Traffic signal detec-
segmentation,” in Proceedings of the IEEE Conference on tion and classification in street views using an attention
Computer Vision and Pattern Recognition, 2018, pp. 7553– model,” Computational Visual Media, vol. 4, no. 3, pp. 253–
7563. 266, 2018.
[361] Z. Tian, W. Huang, T. He, P. He, and Y. Qiao, “Detect- [379] M. Bach, D. Stumper, and K. Dietmayer, “Deep convolu-
ing text in natural image with connectionist text pro- tional traffic light recognition for automated driving,” in
posal network,” in European conference on computer vision. 2018 21st International Conference on Intelligent Transporta-
Springer, 2016, pp. 56–72. tion Systems (ITSC). IEEE, 2018, pp. 851–858.
[362] A. d. l. Escalera, L. Moreno, M. A. Salichs, and J. M. [380] S. Qiu, G. Wen, and Y. Fan, “Occluded object detection
Armingol, “Road traffic sign detection and classification,” in high-resolution remote sensing images using partial
1997. configuration object model,” IEEE Journal of Selected Topics
[363] D. M. Gavrila, U. Franke, C. Wohler, and S. Gorzig, “Real in Applied Earth Observations and Remote Sensing, vol. 10,
time vision for intelligent vehicles,” IEEE Instrumentation no. 5, pp. 1909–1925, 2017.
& Measurement Magazine, vol. 4, no. 2, pp. 22–27, 2001. [381] Z. Zou and Z. Shi, “Ship detection in spaceborne optical
[364] C. F. Paulo and P. L. Correia, “Automatic detection image with svd networks,” IEEE Transactions on Geo-
and classification of traffic signs,” in Image Analysis for science and Remote Sensing, vol. 54, no. 10, pp. 5832–5845,
Multimedia Interactive Services, 2007. WIAMIS’07. Eighth 2016.
International Workshop on. IEEE, 2007, pp. 11–11. [382] L. Zhang, L. Zhang, and B. Du, “Deep learning for remote
[365] A. De la Escalera, J. M. Armingol, and M. Mata, “Traffic sensing data: A technical tutorial on the state of the art,”
sign recognition and analysis for intelligent vehicles,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2,
Image and vision computing, vol. 21, no. 3, pp. 247–258, pp. 22–40, 2016.
2003. [383] N. Proia and V. Pagé, “Characterization of a bayesian
[366] W. Shadeed, D. I. Abu-Al-Nadi, and M. J. Mismar, “Road ship detection method in optical satellite images,” IEEE
traffic sign detection in color images,” in Electronics, Cir- Geoscience and Remote Sensing Letters, vol. 7, no. 2, pp.
cuits and Systems, 2003. ICECS 2003. Proceedings of the 2003 226–230, 2010.
10th IEEE International Conference on, vol. 2. IEEE, 2003, [384] C. Zhu, H. Zhou, R. Wang, and J. Guo, “A novel hierar-
pp. 890–893. chical method of ship detection from spaceborne optical
[367] S. Maldonado-Bascón, S. Lafuente-Arroyo, P. Gil- image based on shape and texture features,” IEEE Trans-
Jimenez, H. Gómez-Moreno, and F. López-Ferreras, actions on geoscience and remote sensing, vol. 48, no. 9, pp.
“Road-sign detection and recognition based on support 3446–3456, 2010.
vector machines,” IEEE transactions on intelligent trans- [385] S. Qi, J. Ma, J. Lin, Y. Li, and J. Tian, “Unsupervised
portation systems, vol. 8, no. 2, pp. 264–278, 2007. ship detection based on saliency and s-hog descriptor
[368] M. Omachi and S. Omachi, “Traffic light detection with from optical satellite images,” IEEE Geoscience and Remote
color and edge information,” 2009. Sensing Letters, vol. 12, no. 7, pp. 1451–1455, 2015.
[369] Y. Xie, L.-f. Liu, C.-h. Li, and Y.-y. Qu, “Unifying visual [386] F. Bi, B. Zhu, L. Gao, and M. Bian, “A visual search
saliency with hog feature learning for traffic sign detec- inspired computational model for ship detection in op-
tion,” in Intelligent Vehicles Symposium, 2009 IEEE. IEEE, tical satellite images,” IEEE Geoscience and Remote Sensing
2009, pp. 24–29. Letters, vol. 9, no. 4, pp. 749–753, 2012.
[370] S. Houben, “A single target voting scheme for traffic [387] J. Han, P. Zhou, D. Zhang, G. Cheng, L. Guo, Z. Liu, S. Bu,
sign detection,” in Intelligent Vehicles Symposium (IV), 2011 and J. Wu, “Efficient, simultaneous detection of multi-
IEEE. IEEE, 2011, pp. 124–129. class geospatial targets based on visual saliency modeling
[371] A. Soetedjo and K. Yamada, “Fast and robust traffic sign and discriminative learning of sparse coding,” ISPRS
detection,” in Systems, Man and Cybernetics, 2005 IEEE Journal of Photogrammetry and Remote Sensing, vol. 89, pp.
International Conference on, vol. 2. IEEE, 2005, pp. 1341– 37–48, 2014.
1346. [388] J. Han, D. Zhang, G. Cheng, L. Guo, and J. Ren, “Object
[372] N. Fairfield and C. Urmson, “Traffic light mapping and detection in optical remote sensing images based on
detection,” in Robotics and Automation (ICRA), 2011 IEEE weakly supervised learning and high-level feature learn-
International Conference on. IEEE, 2011, pp. 5421–5426. ing,” IEEE Transactions on Geoscience and Remote Sensing,
[373] J. Levinson, J. Askeland, J. Dolson, and S. Thrun, “Traffic vol. 53, no. 6, pp. 3325–3337, 2015.
light mapping, localization, and state detection for au- [389] J. Tang, C. Deng, G.-B. Huang, and B. Zhao,
tonomous vehicles,” in Robotics and Automation (ICRA), “Compressed-domain ship detection on spaceborne op-
2011 IEEE International Conference on. IEEE, 2011, pp. tical image using deep neural network and extreme learn-
5784–5791. ing machine,” IEEE Transactions on Geoscience and Remote
[374] C. Bahlmann, Y. Zhu, V. Ramesh, M. Pellkofer, and Sensing, vol. 53, no. 3, pp. 1174–1185, 2015.
T. Koehler, “A system for traffic sign detection, tracking, [390] Z. Shi, X. Yu, Z. Jiang, and B. Li, “Ship detection in high-
and recognition using color, shape, and motion informa- resolution optical imagery based on anomaly detector
tion,” in Intelligent Vehicles Symposium, 2005. Proceedings. and local shape feature,” IEEE Transactions on Geoscience
IEEE. IEEE, 2005, pp. 255–260. and Remote Sensing, vol. 52, no. 8, pp. 4511–4523, 2014.
[375] I. M. Creusen, R. G. Wijnhoven, E. Herbschleb, and [391] A. Kembhavi, D. Harwood, and L. S. Davis, “Vehicle
P. de With, “Color exploitation in hog-based traffic sign detection using partial least squares,” IEEE Transactions
detection,” in 2010 IEEE International Conference on Image on Pattern Analysis and Machine Intelligence, vol. 33, no. 6,
Processing. IEEE, 2010, pp. 2669–2672. pp. 1250–1265, 2011.
[376] G. Wang, G. Ren, Z. Wu, Y. Zhao, and L. Jiang, “A robust, [392] L. Wan, L. Zheng, H. Huo, and T. Fang, “Affine invariant
coarse-to-fine traffic sign detection method,” in Neural description and large-margin dimensionality reduction
39
for target detection in optical remote sensing images,” IEEE, 2017, pp. 311–319.
IEEE Geoscience and Remote Sensing Letters, vol. 14, no. 7, [408] L. Sommer, T. Schuchert, and J. Beyerer, “Comprehensive
pp. 1116–1120, 2017. analysis of deep learning based vehicle detection in aerial
[393] H. Zhou, L. Wei, C. P. Lim, D. Creighton, and S. Na- images,” IEEE Transactions on Circuits and Systems for
havandi, “Robust vehicle detection in aerial images us- Video Technology, 2018.
ing bag-of-words and orientation aware scanning,” IEEE [409] Z. Liu, J. Hu, L. Weng, and Y. Yang, “Rotated region based
Transactions on Geoscience and Remote Sensing, no. 99, pp. cnn for ship detection,” in Image Processing (ICIP), 2017
1–12, 2018. IEEE International Conference on. IEEE, 2017, pp. 900–904.
[394] M. ElMikaty and T. Stathaki, “Detection of cars in [410] H. Lin, Z. Shi, and Z. Zou, “Fully convolutional network
high-resolution aerial images of complex urban envi- with task partitioning for inshore ship detection in op-
ronments,” IEEE Transactions on Geoscience and Remote tical remote sensing images,” IEEE Geoscience and Remote
Sensing, vol. 55, no. 10, pp. 5913–5924, 2017. Sensing Letters, vol. 14, no. 10, pp. 1665–1669, 2017.
[395] L. Zhang, Z. Shi, and J. Wu, “A hierarchical oil tank de- [411] ——, “Maritime semantic labeling of optical remote
tector with deep surrounding features for high-resolution sensing images with multi-scale fully convolutional net-
optical satellite imagery,” IEEE Journal of Selected Topics work,” Remote Sensing, vol. 9, no. 5, p. 480, 2017.
in Applied Earth Observations and Remote Sensing, vol. 8,
no. 10, pp. 4895–4909, 2015.
[396] C. Zhu, B. Liu, Y. Zhou, Q. Yu, X. Liu, and W. Yu, “Frame-
work design and implementation for oil tank detection in
optical satellite imagery,” in Geoscience and Remote Sensing
Symposium (IGARSS), 2012 IEEE International. IEEE,
2012, pp. 6016–6019.
[397] G. Liu, Y. Zhang, X. Zheng, X. Sun, K. Fu, and H. Wang,
“A new method on inshore ship detection in high-
resolution satellite images using shape and context in-
formation,” IEEE Geoscience and Remote Sensing Letters,
vol. 11, no. 3, pp. 617–621, 2014.
[398] J. Xu, X. Sun, D. Zhang, and K. Fu, “Automatic detection
of inshore ships in high-resolution remote sensing images
using robust invariant generalized hough transform,”
IEEE Geoscience and Remote Sensing Letters, vol. 11, no. 12,
pp. 2070–2074, 2014.
[399] J. Zhang, C. Tao, and Z. Zou, “An on-road vehicle detec-
tion method for high-resolution aerial images based on
local and global structure learning,” IEEE Geoscience and
Remote Sensing Letters, vol. 14, no. 8, pp. 1198–1202, 2017.
[400] W. Diao, X. Sun, X. Zheng, F. Dou, H. Wang, and K. Fu,
“Efficient saliency-based object detection in remote sens-
ing images using deep belief networks,” IEEE Geoscience
and Remote Sensing Letters, vol. 13, no. 2, pp. 137–141,
2016.
[401] P. Zhang, X. Niu, Y. Dou, and F. Xia, “Airport detection on
optical satellite images using deep convolutional neural
networks,” IEEE Geoscience and Remote Sensing Letters,
vol. 14, no. 8, pp. 1183–1187, 2017.
[402] Z. Shi and Z. Zou, “Can a machine generate humanlike
language descriptions for a remote sensing image?” IEEE
Transactions on Geoscience and Remote Sensing, vol. 55,
no. 6, pp. 3623–3634, 2017.
[403] X. Han, Y. Zhong, and L. Zhang, “An efficient and ro-
bust integrated geospatial object detection framework for
high spatial resolution remote sensing imagery,” Remote
Sensing, vol. 9, no. 7, p. 666, 2017.
[404] Z. Xu, X. Xu, L. Wang, R. Yang, and F. Pu, “Deformable
convnet with aspect ratio constrained nms for object
detection in remote sensing imagery,” Remote Sensing,
vol. 9, no. 12, p. 1312, 2017.
[405] W. Li, K. Fu, H. Sun, X. Sun, Z. Guo, M. Yan, and
X. Zheng, “Integrated localization and recognition for
inshore ships in large scene remote sensing images,” IEEE
Geoscience and Remote Sensing Letters, vol. 14, no. 6, pp.
936–940, 2017.
[406] O. A. Penatti, K. Nogueira, and J. A. dos Santos, “Do
deep features generalize from everyday objects to remote
sensing and aerial scenes domains?” in Proceedings of the
IEEE conference on computer vision and pattern recognition
workshops, 2015, pp. 44–51.
[407] L. W. Sommer, T. Schuchert, and J. Beyerer, “Fast deep
vehicle detection in aerial images,” in Applications of
Computer Vision (WACV), 2017 IEEE Winter Conference on.