[go: up one dir, main page]

0% found this document useful (0 votes)
30 views16 pages

Real-Time Pedestrian Detection and Tracking at Nighttime For Driver-Assistance Systems

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 16

IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO.

2, JUNE 2009 283

Real-Time Pedestrian Detection and Tracking at


Nighttime for Driver-Assistance Systems
Junfeng Ge, Student Member, IEEE, Yupin Luo, and Gyomei Tei

Abstract—Pedestrian detection is one of the most important ate in different environments and directly output accurate depth
components in driver-assistance systems. In this paper, we propose information, it is difficult for them to distinguish pedestrians
a monocular vision system for real-time pedestrian detection and from other obstacles. Thus, video cameras are more suitable for
tracking during nighttime driving with a near-infrared (NIR)
camera. Three modules (region-of-interest (ROI) generation, ob- detecting pedestrians as they are similar to the human visual
ject classification, and tracking) are integrated in a cascade, and perception system and provide rich information for applying
each utilizes complementary visual features to distinguish the discriminative pattern recognition techniques.
objects from the cluttered background in the range of 20–80 m. However, robust and efficient vision-based pedestrian de-
Based on the common fact that the objects appear brighter than
the nearby background in nighttime NIR images, efficient ROI
tection is a challenging task in real traffic and cluttered en-
generation is done based on the dual-threshold segmentation al- vironments, due to the movement of cameras, the variable
gorithm. As there is large intraclass variability in the pedestrian illumination conditions, the wide range of possible human
class, a tree-structured, two-stage detector is proposed to tackle appearances and poses, the strict performance criteria, and the
the problem through training separate classifiers on disjoint sub- hard real-time constraints.
sets of different image sizes and arranging the classifiers based on
Haar-like and histogram-of-oriented-gradients (HOG) features in Recently, many interesting approaches for vision-based
a coarse-to-fine manner. To suppress the false alarms and fill the pedestrian detection have been proposed. Most of them tend to
detection gaps, template-matching-based tracking is adopted, and utilize expensive far-infrared (FIR) cameras or stereo vision to
multiframe validation is used to obtain the final results. Results facilitate the extraction of regions of interest (ROIs), because
from extensive tests on both urban and suburban videos indicate the conventional background subtraction, as frequently used
that the algorithm can produce a detection rate of more than
90% at the cost of about 10 false alarms/h and perform as fast as in surveillance applications, completely fails with the moving
the frame rate (30 frames/s) on a Pentium IV 3.0-GHz personal cameras, and the sliding window approach, which searches
computer, which also demonstrates that the proposed system is all possible sizes at all locations over the images, is currently
feasible for practical applications and enjoys the advantage of low computationally too intensive for real-time application. On the
implementation cost.
contrary, considering the low-cost benefit of monocular vision
Index Terms—AdaBoost, histogram of oriented gradients and its potential practical value, increasingly more researches
(HOG), Kalman filter, near infrared camera, pedestrian detection, take account of detecting pedestrians based on a monocular
template matching.
normal or near-infrared (NIR) camera with rough appearance
cues (e.g., symmetry, intensity, and texture) for ROI generation.
I. I NTRODUCTION
In this paper, we present a real-time pedestrian detection and

P EDESTRIANS are the most vulnerable traffic participants,


because they are often seriously injured in traffic acci-
dents, particularly at night [1]. Pedestrian detection for driver-
tracking system based on a monocular NIR camera with illu-
mination from full-beam headlights during nighttime driving.
As the camera has great visual range and high resolution, the
assistance systems aims to detect these potentially dangerous low-cost system can detect pedestrians and bicyclists in a wider
situations with pedestrians in advance to either warn the driver range, where the objects are 20–80 m away from the camera, in
or initiate appropriate protective measures (e.g., automatic contrast with the state-of-the-art systems described in Table I,
vehicle braking) in time. the detection range of which is limited by the FIR camera or
Various types of sensors, such as an ultrasonic sensor, a laser stereo vision.
scanner, a microwave radar, and different kinds of cameras, The main contributions of this paper are listed as follows:
have been employed to detect pedestrians in the literature [2]. 1) An efficient dual-threshold segmentation method is intro-
Although active sensors like a laser scanner and radar can oper- duced for ROI generation under the common fact that the
objects appear brighter than the background from the view
Manuscript received April 10, 2008; revised January 5, 2009. First published of a horizontal scan line in nighttime NIR images. 2) A
May 5, 2009; current version published June 2, 2009. The Associate Editor for tree-structured, two-stage classifier based on Haar-like and
this paper was R. Hammoud.
J. Ge and Y. Luo are with the Tsinghua National Laboratory for Infor- histogram-of-oriented-gradients (HOG) features is proposed to
mation Science and Technology, Department of Automation, Tsinghua Uni- deal with the large variance in pedestrian size and appearance.
versity, Beijing 100084, China (e-mail: gejf03@mails.tsinghua.edu.cn; luo@
tsinghua.edu.cn).
3) Template-matching-based Kalman tracking is adopted to
G. Tei is with the INF Technologies Ltd., Beijing 100080, China (e-mail: suppress the false alarms (FAs), and multiframe validation is
tei@infbj.com). used to obtain the final results.
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. The rest of the paper is organized as follows. After reviewing
Digital Object Identifier 10.1109/TITS.2009.2018961 the related work in Section II, we provide a brief description

1524-9050/$25.00 © 2009 IEEE


284 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

TABLE I
OVERVIEW OF CURRENT PEDESTRIAN DETECTION SYSTEMS. DR IS IN TERMS OF THE NUMBER OF PEDESTRIANS. THE FA PER FRAME
IS ESTIMATED FROM THE ORIGINAL DATA IN RESPECTIVE PAPERS

about the pedestrian-detection system in Section III. Then, from Considering the monocular daylight or NIR camera, which
Section IV to Section VI, we discuss further details about each is cheaper than stereo vision and holds better signal-to-noise
module. Experimental results that validate the proposed system ratio and resolution than the FIR camera, although it loses the
are presented in Section VII. Finally, conclusions are given in depth information, there are still efficient approaches for candi-
the Section VIII. date selection. Broggi et al. [11], [12] use vertical symmetry
derived from gray levels and vertical gradient magnitude to
II. R ELATED W ORK select candidate regions around each relevant symmetry axis.
Shashua et al. [8] obtain 75 ROIs/frame by filtering out win-
Generally, a vision-based pedestrian detection system can
dows based on the lack of distinctive texture properties and
be divided into three modules, as described in [2] and [10]:
noncompliance with perspective constraints on the range and
ROI generation, object classification, and tracking. The ROI
size of the candidates. Cao et al. [9] perform an exhaustive
generation is the first and important step since it provides
scan on the particular rectangle region in which the pedestrians
candidates for the following processes and directly affects the
might cause a collision with the vehicle to generate the candi-
system performance. Thus, both efficiency and effectiveness
are mandatory. Fortunately, the ROI generation module can be dates. For the night videos from the NIR camera, Tian et al.
facilitated by special hardware configuration. [13] extract the target regions from the raw data by using an
Nighttime pedestrian detection is usually carried out using adaptive-thresholding-based image segmentation algorithm.
FIR cameras as they provide special cues for candidate se- Once the ROIs have been obtained, different combinations
lection. In [6], the system starts with the search for hot spots of features and pattern classifiers can be applied to distin-
assumed to be parts of the human body through a dynamic guish pedestrians from nonpedestrian candidates. The shape,
threshold of each frame and then selects upper body and full- appearance, and motion features are the most important cues
body candidates after getting the road information. However, for pedestrian detection, such as the raw image intensity [6],
pedestrians are not always brighter than the background in gradient magnitude [3], edge [4], binary head model subimage
the FIR images during the summer or when they are wearing [7], [11], Haar wavelets [5], [14], Gabor filter outputs [15],
well-insulated clothing. With the help of stereo FIR vision, local receptive fields [4], [16], motion filter output [9], [17],
Bertozzi et al. [7] exploit three different approaches for can- and HOGs [18], [19].
didate generation, i.e., warm area detection for hot spots and Referring to the classification techniques, typical approaches
vertical-edge detection and disparity computation for detecting like template matching [4], [20], artificial neural networks
cold objects that can potentially be pedestrians. (ANNs) [3], [4], support vector machines (SVMs) [5], [6],
Stereo vision is also widely used in daytime pedestrian [14], [18], [19], and AdaBoost [8], [9], [17] have been adopted
detection. Zhao and Thorpe [3] obtain the foreground regions to find pedestrians in conjunction with the above features.
by clustering in the disparity space. Gavirila and Munder [4] Obviously, SVM is the most popular learning method and often
scan the depth map with windows of appropriate sizes and produces accurate classifications with most of the features, but
take into account the ground-plane constraints to obtain the it requires heavy computation if there are lots of candidates.
ROIs where pedestrians are likely. Alonso et al. [5] propose Conversely, AdaBoost, which combines several weak classi-
a candidate-selection method based on the direct computation fiers into a strong classifier using weighted majority vote, can
of the 3-D coordinates of relevant points in the scene. quickly identify the candidates with simple features but with
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 285

several more FAs. Usually, in practical applications, a single


classifier cannot be expected to give effective and reliable
detection.
To make use of the advantages of different classifiers, the
structure of the object classification module should be well
designed, and only in that way is it possible to fulfill the
requirement of high detection rate (DR) and low FA rate while
maintaining the real-time operation. The cascaded structure,
which arranges the classifiers based on different features or
different classifiers trained on a hybrid of features in a two-
stage or multistage coarse-to-fine manner, can greatly reduce
the computational complexity and achieve high performance
[4], [9], [17]. Moreover, the features of specific nonpedestrians Fig. 1. Overview of the system modules.
can be applied in or before the cascade to reduce the FAs
[8], [21]. Furthermore, the tree structure is a useful choice for
dealing with the large intraclass variance of pedestrians caused
by different poses or views [4], [22], whose individual classifier
is trained on a manually separated subset of the training set in
accordance with poses or views.
On the other hand, in view of the discriminative char-
acteristics of the local features, part-based approaches have
been utilized to reduce the complexity of pedestrian classifi-
cation and help to handle occlusions and body pose variations.
Shashua et al. [8] extract orientation histogram features from 13
fixed overlapping parts and use ridge regression to reduce the
dimension to form compact features according to the different Fig. 2. Captured image after preprocessing.
subregions and clustered training subsets. In [5], a two-stage
III. S YSTEM O VERVIEW
classifier based on SVM is used to detect pedestrians. The first
stage is trained to detect parts of the body, such as the head, The proposed system consists of the three modules men-
torso, and legs, with discussions on selecting the best feature tioned in the previous section, which are denoted by ROI
for each component. Individual results are integrated in the generation, object classification, and tracking. Fig. 1 shows the
second stage to verify the pedestrian’s presence. However, a block diagram of the system.
high image resolution of the candidate region is required to The videos in our system are captured by an NIR camera
capture sufficient information of the human components, which fixed on the ceiling of an ordinary car with illumination from
is impractical for detecting distant pedestrians. full-beam headlights. The preprocessing step discards the even-
Although the classification on selected candidates can have indexed or odd-indexed lines to remove the blur caused by
a very low false-positive rate (FPR), e.g., 0.1%, there could be interlaced scanning of the camera and reduces the noise by a
still too many FAs due to the large number of ROIs during the 5 × 5 Gaussian filter. Fig. 2 shows the captured image after
video rate operation. Assuming that there are only 75 ROIs per preprocessing.
frame, the FA in 1 s may appear two or three times, which The ROI generation module contains two steps: image seg-
is too serious for practical use. Thus, the tracking module is mentation and candidate selection. The image segmentation
indispensable in any applied systems, which is able to filter step adopts two thresholds for each pixel that are calculated
the results and greatly decrease the FA rate. Bertozzi et al. from the horizontal scan line to determine the foreground based
[12] propose a Kalman-filter-based tracker to reject the spurious on the fact that pedestrians appear brighter than the nearby
detections and compute the trajectory of the pedestrian. In [4], background in NIR images. Then, erosion and dilatation opera-
a simple α–β tracker is applied on 2.5-D bounding boxes of the tion are taken to refine the segmentation results. Considering
objects to improve the reliability of the detection results. In [5], the large variance in pedestrian size, the original image is
the tracking algorithm relies on the Kalman filter to predict the downsampled to half size to find the pedestrians in near range.
spatial positions of the detected pedestrians and on the Bayesian The main problem here is determining the parameters of the
probability to provide an estimate of pedestrian classification two thresholds, as described in the following section.
certainty over time. In the candidate-selection step, the connected regions that
Table I summarizes the main pedestrian detection systems, satisfy the scene-related size and position constraints are se-
distinguished by image sensor type, classification method, and lected as ROIs for the subsequent classification module. Fur-
detection performance. Although some of them have achieved thermore, multiple additional candidates are added to the ROI
promising results, their farthest detection distance is only list based on the selected ROIs to compensate for the bias and
around 25 m. Very recently, the detection of distant pedestrians errors in image segmentation.
[23] and development of the low-cost systems [9], [24], which For pedestrian classification, the main challenge is the re-
may be the new trends, have attracted increasing attention. quired high performance and real-time constraint versus the
286 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

large intraclass variability in the pedestrian class. To deal with


this problem, we propose a tree-structured two-stage detector
based on Haar-like and HOG features to distinguish the objects
from nonpedestrian candidates. Gentle AdaBoost is used to se-
lect the critical features and learn the classifiers from the train-
ing images. The classifier based on Haar-like features is used
for rough classification, focusing on rejecting the nonpedestrian
candidates and selecting the well-bounded candidates. As the
size of the pedestrians varies in a wide range, three HOG-based
classifiers are trained on three separate sets containing images
of different size ranges to give precise classification. This way,
the classification complexity is reduced, and it helps to improve
system performance.
Although the object classification can achieve an FA rate as
low as 1%, there still exist some FAs flashing due to the huge
number of candidates in real-time processing. To suppress the
spurious detections and fill the detection gap between frames,
pedestrian tracking based on the Kalman filter and template
matching is adopted to filter and optimize the detection results.
The tracking algorithm relies on the Kalman filter to provide
Fig. 3. Analysis of a typical pedestrian area. (a) Original image. (b) Topo-
a spatial estimation of the detected pedestrians, and the detec- graphic surface of (a). (c) and (d) The intensity values of the scan lines marked
tion confidence in each frame is accumulated to determine the with arrows.
detection certainty over time. For data association, the nearest
overlapped neighbor following the combined distance criterion connect with the bright background and split by the nonuniform
is selected as the observation. If the nearest-neighbor method brightness of pedestrians. The false segmentation often makes
fails, template matching based on the appearance is used to the classification fail and decreases the DR.
search for the possible observation. The tracking process is To take advantage of the low computation of the thresholding
divided into two stages: pretracking and tracking. Newly de- method while cutting down the faults in segmentation, we
tected objects enter the pretracking stage. Only after passing the propose an adaptive dual-threshold segmentation algorithm to
multiframe validation in the pretracking stage do they start to efficiently segment the foreground. Unlike Tian et al. [13],
be tracked as pedestrians by the system and be shown as output who calculate the thresholds on a square neighborhood, we
alarms. locally determine the two thresholds in horizontal scan lines
and optimize the parameters by experiments.
IV. ROI G ENERATION If the pedestrians appear brighter than the surrounding back-
ground, the situation will keep the same from the view of the
The ROI generation module, which tries to get regions that
horizontal scan lines, even when the pedestrians have nonuni-
potentially contain pedestrians, can be regarded as a rough
form brightness. Fig. 3 presents an example of pedestrians
classifier operated on the entire original image. However, most
with dark upper body and bright lower body. The two scan
learning-based approaches are time consuming and, thus, un-
lines show that the pixels from the pedestrian area are brighter
suitable, even though we adopt the most efficient AdaBoost
than the nearby background pixel on both sides of the person.
classifier based on simple Haar-like features. The hard real-
Obviously, this condition is easier to be satisfied than that
time constraint means that the rule-based methods are the only
of common adaptive thresholding algorithms based on local
choices.
regions, where the pixels that belong to objects must be brighter
Different rule-based methods can be applied to different
than the background in a large square neighborhood.
types of images. In the gray images captured at night by an
Meanwhile, calculating the thresholds from the scan lines has
NIR or normal camera, the fact that pedestrians always appear
another advantage that the algorithm is inclined to segment the
brighter than the surrounding background is usually utilized to
vertical bright regions of proper width, which not only helps
extract the ROIs through thresholding.
to break the connection to the background but also can prevent
segmenting the bright region of large horizontal size. Fig. 4(b)
A. Image Segmentation
and (c) gives a comparison of the results from [13] and (1),
Thresholding is the common and simple way to divide a gray where the thresholds calculated from the square neighborhood
image into foreground and background. Under uneven lighting produce a large bright region on the road that does not exist in
conditions, the popular solution is adaptive thresholding, where the result of the thresholds calculated from the scan lines.
different thresholds are used for different pixels or subregions However, because the brightness of the background and that
in the image [25]. of the pedestrians vary in a wide range, the 1-D signals from the
Generally, the adaptive threshold for each pixel is individ- scan lines are always contaminated by noises, and employing
ually calculated based on its local neighborhood [13], [21]. a single threshold for each pixel may easily cause a failure, as
However, in cluttered scenes, the segmented object regions may shown in Fig. 4(c). Thus, two thresholds are adopted to suppress
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 287

Fig. 5. False segmentation caused by a large TH , when the pixel values of the
pedestrian area are near the dark background. (a) Original image. (b) Intensity
 . (c) False segmentation of T
of the scan line, TH , TL , and the optimized TH H
and TL . (d) Acceptable result of the optimized TH  and T .
L

Fig. 4. Segment results of different methods. (a) Original image extracted


from Fig. 2. (b) Binarization result of the method in [13]. (c) Thresholding
result based on (1). (d) Segment result of the dual-threshold algorithm using
(1) and (2).

the noise and make the algorithm to be robust to the brightness


variation of object regions. A high threshold (TH ) and a low
one (TL ) can simply be set by the following equations, letting
TL be the minimum value for the object pixels, TH be the
maximum value for background pixels, and TL < TH :
x=i+w
I(x, j)
TL (i, j) = x=i−w +α (1)
2×w+1
TH (i, j) = TL (i, j) + β (2)

where I(i, j) indicates the intensity of the pixel (i, j). w is Fig. 6. If both the object region and background are bright in scan lines, the
initially set to 12 according to the width distribution of pedes- original TH will become too high for correct segmentation. (a) Original image.
 . (c) Bad
(b) Pixel values along the scan line, TH , TL , and the optimized TH
trians: α = 2, and β = 8.
result of TH and TL . (d) Result of TH and T .
The two thresholds are employed like the nonmaximum L

suppression step in the Canny edge detector [26]. We assume


that object pixels generally have higher values than background be handled, that is, the pedestrians with uniform or nonuniform
pixels. For a given pixel I(i, j) in the image, let S(i, j) indicate brightness can properly be segmented, as shown in Figs. 5(d)
the corresponding segmentation result 
and 6(d). The final optimized TH can be calculated using the
⎧ following equations:
⎪ 1, if I(i, j) > TH (i, j)

0, if I(i, j) < TL (i, j) 
S(i, j) = (3) TH (i, j) = max {T1 (i, j), TL (i, j)}

⎩ 1, I(i, j) ∈ Others & I(i − 1, j) = 1
0, I(i, j) ∈ Others & I(i − 1, j) = 0. T1 (i, j) = min {T2 (i, j), 230}
The direction of scanning the pixels for segmentation in an T2 (i, j) = min {T3 (i, j), TL (i, j) + 8}
image can be arbitrary, that is, normally from left to right and
then from top to bottom. Fig. 4(d) shows the result of the dual- T3 (i, j) = max {1.06 × (TL (i, j) − α) , TL (i, j) + 2} . (4)
threshold algorithm, which contains less noise than Fig. 4(c).
However, the thresholds without optimization tend to pro- For the threshold TL , the key parameter is w, which will
duce false segmentation when the object and background are be optimized according to the instance-based DR defined in
both dark or bright. In these cases, the contrast between them Section VII. However, as the size of the nearby pedestrians is
is small; setting β = 8 makes TH become too great to obtain a several times larger than that of the distant ones, globally opti-
correct segmentation. Figs. 5 and 6 present two examples of this mized w may still be poor for the segmentation of pedestrians
phenomenon. Therefore, TH should be adaptive to the different in near range. The original image is downsampled to half size to
values of TL . deal with the defect. Moreover, the morphological operations,
After extensive trials, we let β be a piecewise function to TL . erosion, and dilatation with a specified mask of 5 × 5 pixels are
This way, both the dark and bright object regions can correctly applied to remove noise pixels and split the weakly connected
288 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

Fig. 8. Segmentation results (blue/dark rectangles) and selected candidates


(yellow/bright rectangles) of Fig. 2, containing the regenerated ones.
Fig. 7. Distribution of pedestrians’ aspect ratios.
can add redundancy and help to increase the chances of correct
regions, which is beneficial to correct segmentation in cluttered detection. To make up for the segmentation errors, we also add
scenes. extra split (half-height) and merged (double-height) candidates
from the original ones of special sizes. Obviously, there is a
tradeoff between the number of additional candidates and the
B. Candidate Selection computational complexity. For each original candidate, only
Although the adaptive dual-threshold algorithm gives a sat- two or three extra bounding boxes are generated for further
isfying segmentation, it also produces a large number of non- classification. Fig. 8 depicts the segmentation results of Fig. 2
pedestrian candidates. In particular, some noise pixels can be in blue rectangles and selected candidates in yellow rectangles,
segmented as foreground regions in small sizes. Furthermore, which contain the additional ones.
a lot of objects with high reflectance and light sources are also There are three advantages in the candidate-selection step.
selected as pedestrians. We should take measures to reject these 1) Decreasing the amount of candidates passed to the fol-
nonhuman subimages before the object classification mod- lowing object classification module, which is usually
ule. Moreover, considering the low accuracy of the candidate computationally expensive in classifying a single can-
bounding boxes caused by segmentation bias and errors, it is didate, tends to save computation and helps to improve
helpful to regenerate several candidates near the filtered ones. system performance.
1) Candidate Filtering: The scene-related prior knowledge 2) Rejecting most of the nonhuman candidates in advance
is a good choice for filtering the nonpedestrian candidates. is conducive to the reduction of the FPR in the classifi-
Even if we do not know the camera’s internal and external cation step.
parameters, the possible size and position of pedestrians can be 3) Producing additional candidates makes a compensation
estimated from the labeled ground truth, and the corresponding for the segmentation bias and errors, which can increase
limit values can be applied to remove the nonpedestrian fore- the DR and robustness of the system.
ground regions.
In fact, among the segmented candidates, most of the non- V. O BJECT C LASSIFICATION
human regions are different from the pedestrian areas in the
size and position features such as height, width, aspect ratio, The object-classification module is the key component of
and the height limits at different distances (that is, at different pedestrian-detection system. Both high classification accuracy
vertical positions in images) [27]. All these features can be used and high efficiency are required for the practical real-time
to design simple filters to reject the unwanted candidates. applications. Therefore, learning-based approaches (e.g., SVM,
Take the aspect ratio for example, only histogram-based ANN, and AdaBoost), which try to model the pedestrian pattern
statistics is required. Fig. 7 shows the distribution of the aspect with distribution functions or discriminant functions under the
ratios of the positive samples for training. To preserve no less probabilistic framework, are more favorable than rule-based
than 99% pedestrian samples, the threshold range for the aspect methods (e.g., template matching and the shape-independent
ratio filter can be set to 1.3 ∼ 3.8, and after being relaxed method [28]), whose accuracy is limited by the difficulties
for tolerance of unexpected cases, the range of the pedestrian in exactly translating all human knowledge into executable
aspect ratio is set to 1 ∼ 4. Other parameters for filtering the explicit rules.
candidates can be selected in a similar way. Since the pedestrians have large within-class variance and
2) Candidate Regenerating: Generating several candidates different learning approaches hold different characteristics for
for each extracted candidate is a useful strategy to compensate classification, three issues should be taken into account in
for the low DR caused by poor bounding-box accuracy in the learning the classifiers.
classification step [5]. Badly bounded pedestrians will be re- 1) What kinds of features and learning approaches are ap-
jected as nonpedestrians if the positive samples for training are propriate for the pedestrian-classification task?
well fitted. Thus, producing additional candidates by slightly 2) How can we take advantage of the merits of different
shifting the original candidate bounding box in the image plane kinds of features and classifiers?
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 289

A Haar-like feature is defined by a block with feasible


position, size, and aspect ratio in the detection window, as well
as the difference of intensity summations in two defined areas,
i.e., black and white regions, inside the block. Therefore, in
each block Bi , a Haar-like feature can be calculated as follows:
Fig. 9. Prototype of the employed Haar-like features. (a) Horizontal-edge fea-
ture. (b) Vertical-edge feature. (c) Diagonal feature. (d) Horizontal-line feature.
Haar
Fi,r = Swhite (Bi , typer ) − Sblack (Bi , typer ) (5)
(e) Vertical-line feature.
where typer denotes the rth type Haar-like feature in Fig. 9
3) How can we reduce the complexity of pedestrian classifi- Haar
(r = 1, . . . , 5), Fi,r is the corresponding feature value in
cation and improve detection accuracy? Bi , and Swhite (Bi , typer ) and Sblack (Bi , typer ) are the in-
For the first point, we make a systematic study of the popular tensity summations in the white and black regions inside Bi ,
features under the preferred AdaBoost learning framework, respectively.
which is more efficient and resistant to overfitting than SVM or Although the Haar-like features have successfully been ap-
ANN algorithms. Referring to the second issue, a hybrid of fea- plied to face detection [29], the results for pedestrian detection
tures and combining classifiers into a cascade can be utilized, are not satisfactory in static images [17], [18], [31]. To solve
and the best strategy will be available under experimental analy- the problem, both gradient magnitude and orientation have
sis. Considering the third problem, we can manually divide the been taken into account with local statistics using a histogram,
pedestrian training set into nonoverlapping subsets, where each obtaining the HOG features of high discriminating power for
cluster represents a sample collection from a particular pose, human detection.
view, or size. Furthermore, this way, the overall variability in The HOG feature is first used for pedestrian detection by
the pedestrian class is split into manageable pieces that can be Shashua et al. [8]. They extract orientation histogram features
handled by relatively simple classifiers. from 13 fixed overlapping parts and use ridge regression to
reduce the dimension to form compact features according to the
different subregions and clustered training subsets. Dalal and
A. Feature Extraction
Triggs [18] and Zhu et al. [31] extend the use of histograms with
The performance of a classifier commonly strongly depends a dense scan approach. They divide each normalized detection
on the adopted features. In the past decades, many features have window of 64 × 128 into cells of size 8 × 8 pixels, and
been proposed for pedestrian classification, e.g., image inten- each group of 2 × 2 cells is integrated into a block with an
sity, binary image, gradient magnitude, Haar-like features, and overlap of one cell in both horizontal and vertical directions.
HOG. Now, we give a brief introduction about these features A nine-bin HOG is constructed in each cell, and each block
used in our approach. contains a concatenated vector of all its cells. Thus, each block
The simplest feature for object classification is obviously the is represented by a 36-D feature vector that is normalized to
image intensity, which can directly be extracted from raw data an L2 unit length, and each sample is represented by 7 ×
without or with histogram equalization to reduce the influence 15 blocks, that is, a feature vector of 3780 dimensions.
of illumination. However, this feature is very redundant and We compute the HOG feature in a similar way but with
contains a large variability in the pedestrian class and, hence, a reference image size of 24 × 60, reference cell size of
is less discriminative. 4 × 4 pixels, and, thus, a total of 6 × 15 cells. Instead of
Other simple features like binary image and gradient mag- normalizing the size of the detection window, we adjust the
nitude can easily be derived from the intensity image. The cell size according to the image size and reference size to
two features can capture the shape information of objects, but prevent the loss of aspect information caused by image resizing.
they are sensitive to noise and the slight shift of the bounding Fig. 10(c) presents an example to determine the size of each
box. Moreover, all the simple features should be resized to cell. The other settings for HOG are based on the optimal set of
a fixed same length for comparison, and the resizing step parameters discussed in [18] and are presented as follows:
including interpolation will modify the original features and • size of the block: 2 × 2 cells;
decrease their discriminating quality between the objects and • overlapping between blocks: one cell;
the background. • number of bins for the histogram: nine bins over 0◦ –180◦ ;
Thus, these simple features are not suitable for accurate • normalization type for the block: L2.
detection. However, we can use them in rough classification due
to their low computational cost. The computation of the HOG descriptor is done through the
Haar-like features, which add local statistics into intensity following steps, as shown in Fig. 10.
data, are widely used in many object detection tasks, e.g., face, 1) Compute the horizontal and vertical gradient of the image
car, and pedestrian, because they can effectively capture the by Sobel filters.
different appearance details of objects and have a fast algorithm 2) Compute both the magnitude and orientation of the
with the help of integral images [29]. We employ five types of gradient.
Haar-like features, as illustrated in Fig. 9, including the edge 3) Split the image into 6 × 15 cells according to the refer-
(horizontal and vertical) and diagonal features used in [30] and ence size and image size.
the line (horizontal and vertical) features proposed in [29]. 4) Compute a nine-bin histogram for each cell.
290 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

Fig. 10. Procedure for calculating a HOG descriptor. (a) Original image. (b) Gradient orientation and magnitude of each pixel are illustrated by the arrows’
direction and length. (c) Splitting the detection window into 6 × 15 cells. (d) HOG computation in each cell. (e) Normalizing all the histograms inside a block of
2 × 2 cells. (f) Concatenating all the normalized histograms into an HOG descriptor.

5) Normalize the histograms within a block of 2 × 2 cells. Algorithm 1: Training procedure of Gentle AdaBoost.
6) Group all the normalized histograms into a single vector. • Given data set S = (x1 , y1 ), . . . , (xm , ym ), where (xi , yi )
∈ X × {−1, +1}, the number of weak classifiers to be
The normalization is used to reduce the illumination variabil-
selected T .
ity. The final HOG descriptor is obtained by concatenating all
• Initialize the sample weight distribution D1 (i) = 1/m.
the normalized histograms into a single 2520-D (36 × 5 × 14)
• For t = 1, . . . , T
vector.
1) Learn a CART with lowest error as the best weak classi-
fier to partition X into several disjoint subregions
B. AdaBoost Learning X1 , . . . , Xn .
2) Under the weight distribution Dt , calculate
Given the feature set and the training set of positive and
negative samples, many machine learning approaches can be
used to learn the classification function. In our system, a variant Wlj = P (xi ∈ Xj , yi = l)
of the AdaBoost algorithm, i.e., Gentle AdaBoost [32], is used 
to select the relevant features and train the classifiers. = Dt (i), l = ±1. (6)
i:xi ∈Xj ∧yi =l
Compared with other statistical learning approaches (e.g.,
SVM and ANN), which try to learn a single powerful discrim-
inant function from all the specified features extracted from 3) Set the output of h on each Xj as
training samples, the AdaBoost algorithm combines a collec-
j j
tion of simple weak classifiers on a small set of critical features W+1 − W−1
to form a strong classifier using the weighted majority vote. ∀xi ∈ Xj , h(x) = j j
. (7)
W+1 + W−1
Meanwhile, as a kind of large-margin classifier [33], AdaBoost
provides strong bounds on generalization and guarantees the
comparable performance with SVM. Thus, AdaBoost is an 4) Update the sample weight distribution
effective learning algorithm for real-time application.
Moreover, Gentle AdaBoost [32] has been proven to be Dt (i) exp [−yi ht (xi )]
Dt+1 (i) = (8)
superior to Real AdaBoost [34] in most cases, both of which Zt
adopt weak learners (WLs) with real-valued outputs to over-
come the disadvantage of binary-valued stump functions in where Zt is a normalization factor.
discriminating the complex distribution of the positive and • The final strong classifier H is
negative samples and yield superior performance compared
 T
with the original discrete AdaBoost. 
As the HOG descriptor and simple features are high- H(x) = sign ht (x) − b (9)
dimensional vectors, we regard each dimension of the vector t=1
as a 1-D feature. Thus, all the features can be treated in the
same way. In the AdaBoost learning procedure, we employ where b is a threshold whose default value is zero. The
the classification and regression trees (CARTs) [35] as the  of H is defined as confH (x) = |R(x)|, and
confidence
real-valued WLs that divide the instance space into a set of R(x) = t ht (x) − b.
disjoint subregions with the lowest error in each round. The
real-valued output in each subregion can be calculated by the
C. Tree-Structured Two-Stage Detector
sample weights falling into it. The WL output then depends
only on the subregion that the instance belongs to. The detailed The structure of the pedestrian detector is shown in
training procedure is shown in Algorithm 1. Fig. 11. Unlike the cascaded framework proposed by Viola and
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 291

Fig. 11. Structure of the AdaBoost detector.

Fig. 12. Detailed procedure of the tracking module.

Jones [29] for face detection, which consists of multiple stages The parameters that should be optimized in the detector is
and rejects lots of nonface candidates in the early stages to the threshold of the first and second stages, which is denoted T1
reduce the computational cost, our tree-structured two-stage and T2 , for combination. Furthermore, the optimal parameter
architecture aims to improve the classification performance by settings can be obtained through the sequential parameter-
integrating classifiers trained on different features and sample optimization method described in [4].
sets of different image sizes.
The proposed classification framework arranges two stages
VI. T RACKING
in a coarse-to-fine manner according to the issues mentioned
at the beginning of this section, instead of learning a cascaded Once the candidates are validated by the AdaBoost detector,
detector of multiple stages based on the hybrid of features [36] a tracking stage takes place. The tracking module will fill
for efficiency reasons. the detection gap between frames and discard the spurious
The classifier based on Haar-like features is selected for its detections, thus helping to minimize both false-positive (FP)
higher performance than that based on other simple features. and false-negative detections based on the temporal coherence.
Furthermore, it is used for rough classification, focusing The tracking algorithm relies on the detection results to ini-
on rejecting the nonpedestrian candidates and selecting tiate the tracks and the candidates to provide possible observa-
the well-bounded pedestrian ROIs. On the other hand, the tions (or measurements). If the tracked object fails to obtain the
classifiers based on HOG features are utilized to give precise associated observation from the selected candidates, template
determination. matching is utilized to give the complementary measurement.
As the pedestrian class holds high intraclass variability due The details about the tracking process are presented in Fig. 12.
to different sizes, illuminations, poses, or clothes, the instance For a pedestrian in the video, the state variables that we
space is partitioned into subregions of reduced variability ac- concern are its centroid position (X, Y ), width (W ), and
cording to the distribution of the width and height of the height (H), as well as their differentials between two suc-
pedestrians, three HOG-based classifiers are trained on three cessive frames. Thus, the state vector for Kalman tracking is
separate sets, which are denoted by AdaBoost-S, AdaBoost-M, x = (X, dX, Y, dY, W, dW, H, dH)T . If we consider that the
and AdaBoost-L in Fig. 11. The results from the experiments movement of the camera is straight with constant speed, which
indicate that this strategy not only reduces the complexity of is reasonable in most cases, the state transition matrix is simple,
the classifiers but also improves the detection performance. and because we can directly observe all the state variables of
292 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

the tracked object in each frame, the measurement matrix will A pretracked object will be validated as a pedestrian and
be an identity matrix. Taking the state variables X and dX, for moved to the tracking stage only if it has been detected more
example, the process equation and measurement equation can than Np times, and its average confidence in ten consecutive
be described as follows: frames is above a threshold T Hp . The pedestrian confidence in



each frame is obtained from the detection confidence (which
Xk+1 1 1 Xk
= + wk (10) is mapped from R(x) by f (x) = 1/(1 + exp(−x))) or by
dXk+1 0 1 dXk
template matching. Only the objects in the tracking stage will

m

be shown as output alarms. The parameters Np and T Hp can
Xk 1 0 Xk
= + vk (11) also be optimized by the sequential parameter-optimization
dXkm 0 1 dXk
method [4].
where wk is the process noise, and vk is the measurement Thus, in the tracking procedure, the pretracking and multi-
noise, both of which are assumed to be additive, white, and frame validation can reject the spurious detections, while the
Gaussian, with zero mean. If we obtain the estimated covari- tracking stage tends to fill the detection gaps of the correspond-
ance matrix of wk and vk , the update of the Kalman filter will ing objects.
be straightforward.
In addition to the update of the Kalman filter, another impor-
VII. E XPERIMENTAL R ESULTS
tant issue of object tracking is the data association that tries to
find the associated observation of the tracked object to correct To evaluate the integrated pedestrian-detection and tracking
the filter’s prediction. system, make a comparison of different approaches, and opti-
Since the object is represented by its position and size infor- mize the parameter settings, we generate both the training and
mation, the predicted object’s nearest neighbor can be regarded testing sets from the raw video data captured by the NIR camera
as the corresponding measurement of the tracked object. We during about 10-h of suburban and urban driving at speeds of
define the distance criterion as follows: no more than 70 km/h.
The positive samples including pedestrians and bicyclists
D(x1 , x2 ) = (X1 − X2 )2 + (Y1 − Y2 )2 that cover a wide variety of sizes, poses, illuminations, and
backgrounds are extracted from not only the manually labeled
+ (|W1 − W2 | + |H1 − H2 |) . (12) results but the candidates generated by the ROI generation
module as well. These roughly bounded samples will make
A candidate will not be considered as the observation of the
the trained classifiers to be insensitive to the bounding-box
associated track unless it is the prediction’s nearest neighbor
accuracy. Accordingly, the negative samples are initially ran-
and the overlap ratio between them is greater than 0.5.
domly selected from the nonpedestrian candidates but further
However, the nearest-neighbor method may fail to find the
increased by following the bootstrap strategy in [18] and [37] to
object’s observation due to bad segmentation or the nonlinear
obtain more representative samples, because the easy negative
movement of the camera. Then, template matching is adopted
samples are useless for improving or validating the discriminat-
for improvement. The template is initiated at the beginning
ing capability of the trained classifiers.
of the track and updated when the nearest-neighbor method
Finally, the training set consists of 4754 positive samples
works. Once the nearest neighbor cannot be found from the
and 5929 negative samples, while there are 1977 positive and
candidates in the current frame, template matching is taken
2116 negative instances in the testing set. All the samples
to search for the observation. If the best matching confidence
can be divided into three groups (small, middle, and large)
is greater than 0.85, the resulting region is accepted as the
according to their size for three HOG-based classifiers. More-
measurement; otherwise, the track’s corresponding observation
over, three video sequences in 720 × 480 resolution containing
is missing. The matching confidence is defined as follows:
hundreds of video clips are prepared for parameter optimization
  
(T (x, y) − t̄) Φ(x, y) − φ̄ and performance evaluation. Both suburban and urban scenes
d(T, Φ) =  
x,y
(13) (indicated by V1-SU, V2-S, and V3-U) are considered in the
 2 testing videos; in addition, the pedestrians’ positions in each
(T (x, y) − t̄)2 Φ(x, y) − φ̄
x,y x,y frame are manually labeled by bounding boxes for comparison
with the segmentation and detection results.
where T (x, y) is the template, and Φ(x, y) indicates the image To inspect how the proposed system would work on pedes-
signal of the same size. t̄ and φ̄ are the mean values of T (x, y) trians in different poses and views, all the pedestrians in the
and Φ(x, y), respectively. image sets and video are divided into three groups, i.e., along
An active track will be terminated for the following three the road, across the road, and bicyclist, for statistics. The details
reasons: 1) the tracked object is out of the frame; 2) the about the data set are presented in Table II.
matching confidence in template matching step is less than
0.7; and 3) the corresponding observation has missed in three
A. Evaluation Criteria
successive frames.
To improve the system performance, the tracking process is Commonly, there are distinct differences between the clas-
divided into two stages: pretracking and tracking. The pretrack- sifiers’ classification performance and the system’s detection
ing stage is applied to reconfirm the existence of pedestrians. performance. The true-positive (TP) rate (also called the hit
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 293

TABLE II Regarding the DR, the trajectory-based (or pedestrian-


DETAILS ABOUT THE TRAINING AND TESTING DATA SET
number-based) criterion is rough but widely used for system
evaluation, as mentioned in Table I, where the “Detected”
concept does not guarantee a successive detection for every
frame that contains the pedestrian, because once a pedestrian
is detected, it will be enough to alert the driver.
Relatively, the instance-based DR is more strict and precise,
which involves all the pedestrian instances labeled in each
frame. Thus, it is advisable to use this criterion to evaluate
segmentation or the tracking module since they focus on the
exact positions of objects in all frames.
In the following experiments, we adopt the instance-based
DR versus FPPF curves to evaluate the system performance,
employ the trajectory-based criterion to make comparisons of
different systems, and choose traditional receiver operating
rate/DR and recall) and FPR (also called the FA rate) defined characteristics (ROC) curves composed of the TP rate and FPR
in (14) and (15) or the miss rate (defined as 1−recall) versus to quantify the capability of different classifiers.
FP per window (FPPW) [38] defined in (16) are widely used
for the comparison of classification performance. However, at
the system level, the huge number of the total negative is hard B. Validation of the ROI Generation Module
to determine or is of no value for performance evaluation. The The first test aims to validate the proposed ROI generation
equations for the rates mentioned are given as follows: module using the real-world data V1-SU. Image segmentation,
half-size downsampling, candidate filtering, candidate regener-
correctly classified positive samples (TP) ating, and the optimization of w for TL are all considered in the
TP rate = (14)
total number of the positive testing procedure. The instance-based DR and FPPF are used to
falsely classified negative samples (FP) evaluate the performance.
FPR = (15) Fig. 13(a)–(d) presents the performance change, following
total number of the negative
the steps in the ROI generation module using diverse settings
FP
FPPW = . (16) of w. We select w = 11 for its highest performance while
total number of the negative windows doing other experiments. These results demonstrate that half-
size downsampling indeed improves the segmentation perfor-
Although the recall-precision (defined in the following equa-
mance, while candidate filtering greatly reduces the number of
tion) curves give a feasible way to measure the detection
nonpedestrian candidates, and candidate regenerating only adds
performance in image-based applications, they do not directly
small redundancy for classification when improving the DR.
clarify how a system performs over time, particularly whether
the number of FAs is acceptable in video-based applications, as
the number of positive instances may be quite large: C. Performance Comparison of Different Classifiers

TP As previously mentioned, the classification performance will


precision = . (17) greatly vary, along with the adopted features and the structure
TP + FP
of the classifiers. In the following experiments, we give a
Hence, we adopt FP per frame (FPPF), which is defined in the comprehensive comparison of the different classifiers using the
following equation, to measure the FAs produces by the system extracted training and testing samples. The CART-based Gentle
AdaBoost is used as the learning method in all the classifiers.
FP In the implementation, the node number of each CART is
FPPF = . (18)
total number of frames set to be three, and all the samples are normalized to 24 × 60
for intensity (gray), binary, gradient, and Haar-like features. To
A detection is considered true if the area of overlap ao , which validate the new HOG computation method, both normalized
is defined in the following equation, between the detection and original samples are used for comparison. Each classifier
region bd and the ground-truth region bgt , exceeds 50%: will stop learning when all the training samples are correctly
classified by it.
area(bd ∩ bgt ) After training, the classifiers based on different features con-
ao = . (19)
area(bd ∪ bgt ) tain different numbers of weak classifiers, as shown in Fig. 14.
The HOG- and Haar-like-feature-based classifiers require fewer
Once the true detections are determined by the ground truth, weak classifiers than the others, which indicates that these
any remaining results are considered to be false. Thus, FPPF features are more suitable and discriminating for pedestrian
estimates the frequency of FAs, and if there is an estimation of classification than the other ones. Moreover, as we considered,
candidate number per frame (CNPF), we can easily obtain the calculating the HOG feature from the original images achieves
approximation of the FPR from FPPF. higher performance than that from normalized samples, which
294 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

Fig. 13. Performance of four steps in the ROI generation module using diverse settings of w, i.e., w = 9, 10, . . . , 15.

Fig. 14. Number of weak classifiers in each strong classifier based on different
features. HOG-N indicates computing HOG features from normalized samples.
Small, middle, and large refer to the individual classifiers of the size-sensitive
classifier.

can be seen in Fig. 15. Since a low FPR (1%–2%) is required in


practical applications, we focus on the performance comparison Fig. 15. Performance comparison of classifiers using different features.
in the low-FPR range. HOG-3 indicates the size-sensitive detector that consists of three classifiers.
Obviously, the HOG-based classifiers outperform the others,
and the HOG-3 detector not only needs fewer weak classifiers As we expected, the cascaded classifier can greatly improve
but also produces the highest performance on the testing set. the performance of individual ones, although the two-stage
Furthermore, to compare the performance of the combination classifiers without HOG features do not perform better than the
of two classifiers based on different features, we present the single HOG-3 detector. The best performance comes from the
testing results of typical two-stage detectors in Fig. 16. The combination of the Haar-like-feature-based classifier and
value of the threshold for the first stage T1 is set to be zero. the HOG-3 detector, whose TP rate is greater than 90% when
The ROC curves are then obtained by varying the threshold T2 . its FPR is equal to 2%.
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 295

Fig. 16. Performance comparison of two-stage classifiers using different Fig. 17. Detection performance of each module on V2-S and V3-U.
features.
TABLE III
EFFECTS OF PARAMETER OPTIMIZATION ON THE SYSTEM PERFORMANCE,
WHICH IS TESTED USING THE VIDEO V1-SU

The above results demonstrate that the two-stage detectors


can get much benefit from the complementarity between differ-
ent features. Thus, arranging the classifiers trained on diverse
features in a cascaded structure is a good strategy to improve the
classification performance. Moreover, if the simpler classifier
operates earlier, the computational cost can be reduced at the
same time.

D. System Performance Evaluation


Before evaluating the system performance, we retrain the
selected classifiers with more WLs to improve their general-
ization ability. The number of WLs ceases to increase if there
is only a small improvement about the testing performance.
The final tree-structured detector contains four classifiers: Haar-
AdaBoost (135 WLs), HOG-AdaBoost-L (100 WLs), HOG-
AdaBoost-M (250 WLs), and HOG-AdaBoost-S (200 WLs). Fig. 18. Detected pedestrians from urban and suburban videos. Segmentation
results (blue/dark rectangles), generated candidates (yellow/bright rectangles),
First, we try to optimize the parameters for classification detected objects by the classification module (3, 4, 6, 8, 10), and objects added
and tracking using the video V1-SU. The sequential parameter by the tracking module (1, 2, 5, 7, 9). All the rectangles numbered from 1 to 10
optimization method [4] is employed in the experiment, and the are output alarms. (a) Urban scene. (b) Suburban scene.
results from the classification module (without tracking) and the
tracking module are presented in Table III. for further insight on the system performance. Fig. 17 presents
The default values of T1 and T2 are set to be 0 and 0.5. the detailed detection performance of each module on V2-S
Although the increase of T2 considerably decreases the DR, the and V3-U.
missed detections can be added by the tracking module. Mean- In suburban scenes, as the background is not so heavily
while, the surviving FAs are likely to be further suppressed corrupted with clutter, fewer useless ROIs are generated,
in multiframe validation. The parameters for tracking are also and more accurate segmentation is achieved, which results
important because improper values of Np and T Hp can even in much higher detection performance than that in an urban
increase the FPPF. environment.
After the optimization procedure, we run the resulting detec- Some detected pedestrians from urban and suburban videos
tor on both suburban (V2-S) and cluttered urban (V3-U) videos are shown in Fig. 18, where intermediate results are also
296 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

TABLE IV
SYSTEM PERFORMANCE EVALUATED ON THE VIDEOS V2-S AND V3-U

Fig. 19. False alarms and missed pedestrians in testing videos. (a)–(c) Three pedestrians are missed. (d)–(f) Three false alarms.

displayed. All the pedestrians and bicyclists that are brighter Table IV presents the detailed experimental results on V2-S
than the nearby background can be detected by the system. and V3-U. All the experiments are carried out on a Pentium IV
Concerning the TP rate and FPR (estimated by FPPF/CNPF) 3.0-GHz computer with 1-GB RAM. The average running time
of the detector on videos, both of them are strikingly smaller per frame reveals that the proposed system is fast enough for
than those previously reported on the testing set. The difference real-time constraints, and this advantage also provides room for
may arise from the increase of T2 on one hand, but more further improvement.
exactly, the degraded DR is caused by rough segmentation or In the urban video, three FAs occur in about 350 frames, and
poor bounding-box accuracy, as shown in Fig. 18, and a large six pedestrians (two bicyclists, three across road, and one along
amount of extracted easy negative candidates tend to let the road) are missed for their low average detection confidence.
FPR be at a much lower level. Although there are no false detections in suburban testing,
Therefore, the specific values of the instance-based DR and there still exist two pedestrians (one bicyclist and one across
FPPF or TP rate and FPR are influenced by many factors (e.g., road) missed by the system. These results indicate that the
training samples, test criteria, and testing data) and cannot di- pedestrians walking across the road are not only the most
rectly be applied for performance comparison between different dangerous candidates but are also the hardest targets to detect.
systems. Fig. 19 shows several examples of the FAs and missed
From the perspective of the application, we are more in- pedestrians. Most of the undetected objects suffer from their
terested in the number of detected or missed pedestrians, the nonuniform brightness or the bright background, which leads to
FA times in a time interval, and the running speed. All these bad segmentation and inaccurate bounding boxes. Furthermore,
parameters apparently show the performance that can be per- all the FAs are caused by their similar shape to pedestrians.
ceived by users; thus, they are reasonable for evaluating the Traffic signs, lights, and trees are the three main sources of
performance of different systems. incorrect detection.
GE et al.: REAL-TIME PEDESTRIAN DETECTION AND TRACKING AT NIGHTTIME 297

large intraclass variability, and template-matching-based track-


ing for multiframe validation.
The extensive experiments in both urban and suburban traffic
environments showed that the proposed system succeeded in
reaching a correct recognition percentage of more than 90%
at the cost of about 10 false classifications/h. The satisfy-
ing performance is due to the optimized combination of the
appearance-based detector and the tracking method, which can
benefit from the strengths of different techniques and overcome
their respective disadvantages.
In contrast to most other systems, in addition to the per-
formance, the proposed system enjoys the advantage of low
implementation cost, as only one NIR camera is required, and
the core algorithm can perform as fast as the frame rate on a
common PC platform.
Fig. 20. Size of the detected pedestrians at different distances varies in a
wide range.
Regarding the future work, more research is needed to reduce
the missing detections, particularly the pedestrians walking
Notice that although there are no FAs in the suburban video, across the road and partly segmented objects. View-based
the corresponding FPPF is not zero, because the bounding shape–texture models and part-based detecting and tracking
boxes with inaccurate position or scale are regarded as FAs in approaches seem to be helpful to solve the problems. To refine
instance-based statistics. Therefore, it is included in the FPPF the bounding-box accuracy, minimize the average delay in de-
of V3-U. tecting individual pedestrians, and extend the current algorithm
to the tasks under different weather conditions, online learning
E. Validation of the Detection Range and tracking techniques may be a worthwhile direction for
further research, which enable the algorithm to be adjustable
To validate the system’s detection range, we capture ad- and adaptive to individuals or special scenes during online
ditional video while the car is stopping and pedestrians of processing.
different clothes walk or run at some specific distances, e.g.,
25, 30, 40, 60, and 80 m. Fig. 20 shows some examples of the R EFERENCES
detected pedestrians at different distances and their varied sizes [1] Traffic Safety Facts 2006 Data, 2006, Washington, DC: Nat. Highway
in images. Meanwhile, this experiment indicates that the best Traffic Safety Admin. Tech. Rep. [Online]. Available: http://
detection range for the proposed system is 35–70 m. www.safercar.gov/staticfiles/DOT/NHTSA/TrafficInjuryControl/Articles/
AssociatedFiles/TSF2006 810810.pdf
On the other hand, the range information of the detected [2] T. Gandhi and M. Trivedi, “Pedestrian protection systems: Issues, survey,
pedestrians can be derived from both the size and bottom and challenges,” IEEE Trans. Intell. Transp. Syst., vol. 8, no. 3, pp. 413–
position of their bounding boxes and helps to determine when to 430, Sep. 2007.
[3] L. Zhao and C. Thorpe, “Stereo-and neural network-based pedestrian
issue an alert to the driver. Estimating the distance information detection,” IEEE Trans. Intell. Transp. Syst., vol. 1, no. 3, pp. 148–154,
from the bounding-box size requires the statistics about the Sep. 2000.
size–distance distribution, as shown in Fig. 20. If the intrinsic [4] D. Gavrila and S. Munder, “Multi-cue pedestrian detection and tracking
from a moving vehicle,” Int. J. Comput. Vis., vol. 73, no. 1, pp. 41–59,
and extrinsic parameters of camera have been calibrated, the Jun. 2007.
relationships between the 3-D world coordinates and the image [5] P. Alonso, I. Llorca, D. Sotelo, M. Bergasa, L. de Toro, P. Nuevo,
pixels will be straightforward, and under the assumption of a J. Ocana, and M. Garrido, “Combination of feature extraction methods
for SVM pedestrian detection,” IEEE Trans. Intell. Transp. Syst., vol. 8,
flat road, the distance information can easily be estimated from no. 2, pp. 292–307, Jun. 2007.
the pixel position of the pedestrians’ feet. However, due to the [6] F. Xu, X. Liu, and K. Fujimura, “Pedestrian detection and tracking with
low accuracy of the bounding boxes and vehicle pitch, the two night vision,” IEEE Trans. Intell. Transp. Syst., vol. 6, no. 1, pp. 63–71,
Mar. 2005.
estimations will be quite rough but enough to issue the alerts. [7] M. Bertozzi, A. Broggi, C. Caraffi, M. Del Rose, M. Felisa, and
G. Vezzoni, “Pedestrian detection by means of far-infrared stereo vision,”
Comput. Vis. Image Underst., vol. 106, no. 2/3, pp. 194–204, May 2007.
VIII. C ONCLUSION AND F UTURE W ORK [8] A. Shashua, Y. Gdalyahu, and G. Hayun, “Pedestrian detection for driving
assistance systems: Single-frame classification and system level perfor-
This paper has introduced a nighttime vision system for real- mance,” in Proc. IEEE Intell. Vehicles Symp., Jun. 2004, pp. 1–6.
time pedestrian detection and tracking from a moving vehicle. [9] X. Cao, H. Qiao, and J. Keane, “A low-cost pedestrian-detection system
with a single optical camera,” IEEE Trans. Intell. Transp. Syst., vol. 9,
A cascade of three modules is involved in the system, and each no. 1, pp. 58–67, Mar. 2008.
module utilizes complementary visual features to successively [10] D. Gerónimo, A. López, and A. Sappa, “Computer vision approaches to
distinguish the objects from the cluttered background. pedestrian detection: Visible spectrum survey,” in Proc. 3rd Iberian Conf.
Pattern Recog. Image Anal., Jun. 2007, vol. 4477, pp. 547–554.
To balance the robustness and efficiency at a high- [11] A. Broggi, M. Bertozzi, A. Fascioli, and M. Sechi, “Shape-based
performance level, some novel approaches have been proposed, pedestrian detection,” in Proc. IEEE Intell. Vehicles Symp., Oct. 2000,
including the efficient adaptive dual-threshold segmentation for pp. 200–215.
[12] M. Bertozzi, A. Broggi, A. Fascioli, A. Tibaldi, R. Chapuis, and
candidate generation, the tree-structured two-stage detector to F. Chausse, “Pedestrian localization and tracking system with Kalman
reduce the complexity of pedestrian classification caused by filtering,” in Proc. IEEE Intell. Vehicles Symp., Jun. 2004, pp. 584–589.
298 IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, VOL. 10, NO. 2, JUNE 2009

[13] Q. Tian, H. Sun, Y. Luo, and D. Hu, “Nighttime pedestrian detection with [33] R. E. Schapire, Y. Freund, P. Barlett, and W. S. Lee, “Boosting the margin:
a normal camera using SVM classifier,” in Proc. 2nd. ISNN, May 2005, A new explanation for the effectiveness of voting methods,” Ann. Stat.,
vol. 3497, pp. 189–194. vol. 26, no. 5, pp. 1651–1686, 1998.
[14] C. Papageorgiou and T. Poggio, “A trainable system for object detection,” [34] R. E. Schapire and Y. Singer, “Improved boosting algorithms using
Int. J. Comput. Vis., vol. 38, no. 1, pp. 15–33, Jun. 2000. confidence-rated predictions,” Mach. Learn., vol. 37, no. 3, pp. 297–336,
[15] H. Cheng, N. Zheng, and J. Qin, “Pedestrian detection using sparse Gabor Dec. 1999.
filter and support vector machine,” in Proc. IEEE Intell. Vehicles Symp., [35] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification.
Jun. 2005, pp. 583–587. New York: Wiley-Interscience, Nov. 2000.
[16] S. Munder and M. Gavrila, “An experimental study on pedestrian classifi- [36] Y. Chen and C. Chen, “Fast human detection using a novel boosted cas-
cation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 28, no. 11, pp. 1863– cading structure with meta stages,” IEEE Trans. Image Process., vol. 17,
1868, Nov. 2006. no. 8, pp. 1452–1464, Aug. 2008.
[17] P. Viola, M. Jones, and D. Snow, “Detecting pedestrians using patterns of [37] K. Sung and T. Poggio, “Example-based learning for view-based human
motion and appearance,” Int. J. Comput. Vis., vol. 63, no. 2, pp. 153–161, face detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 1,
Jul. 2005. pp. 39–51, Jan. 1998.
[18] N. Dalal and B. Triggs, “Histograms of oriented gradients for human [38] N. Dalal, “Finding people in images and videos,” Ph.D. dissertation,
detection,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Jun. 2005, Institut National Polytechnique de Grenoble, Grenoble, France, Jul. 2006.
pp. 886–893.
[19] M. Bertozzi, A. Broggi, M. Rose, M. Felisa, A. Rakotomamonjy, and
F. Suard, “A pedestrian detector using histograms of oriented gradients
and a support vector machine classifier,” in Proc. IEEE Conf. Intell. Junfeng Ge (S’09) received the B.Sc. degree in
Transp. Syst., Sep. 2007, pp. 143–148. measuring and control technology and instrumen-
[20] H. Nanda and L. Davis, “Probabilistic template based pedestrian detection tations from the Huazhong University of Science
in infrared videos,” in Proc. IEEE Intell. Vehicles Symp., Jun. 2002, and Technology, Wuhan, China, in 2003. He is cur-
pp. 15–20. rently working toward the Ph.D. degree in control
[21] B. Zhang, Q. Tian, and Y. Luo, “An improved pedestrian detection ap- science and engineering with the Tsinghua National
proach for cluttered background in nighttime,” in Proc. IEEE Intell. Conf. Laboratory for Information Science and Technology,
Veh. Electron. Safety, Oct. 2005, pp. 143–148. Department of Automation, Tsinghua University,
[22] C. Hou, H. Ai, and S. Lao, “Multiview pedestrian detection based on Beijing, China.
vector boosting,” in Proc. ACCV, Nov. 2007, vol. 4843, pp. 210–219. His research interests include machine learning
[23] M. Bertozzi, A. Broggi, S. Ghidoni, and M. Meinecke, “A night vision and computer vision.
module for the detection of distant pedestrians,” in Proc. IEEE Intell.
Vehicles Symp., Jun. 2007, pp. 25–30.
[24] A. Broggi, R. Fedriga, A. Tagliati, T. Graf, and M. Meinecke, “Pedestrian
detection on a moving vehicle: An investigation about near infra-red Yupin Luo received the B.Sc. degree from Hunan
images,” in Proc. IEEE Intell. Vehicles Symp., Sep. 2006, pp. 431–436. University, Hunan, China, in 1982 and the M.Sc. and
[25] M. Sezgin and B. Sankur, “Survey over image thresholding techniques Ph.D. degrees from Nagoya Institute of Technology,
and quantitative performance evaluation,” J. Electron. Imag., vol. 13, Nagoya, Japan, in 1987 and 1990, respectively.
no. 1, pp. 146–165, Jan. 2004. He is currently a Professor with the Tsinghua
[26] J. F. Canny, “A computational approach to edge detection,” IEEE Trans. National Laboratory for Information Science and
Pattern Anal. Mach. Intell., vol. PAMI-8, no. 6, pp. 679–698, Nov. 1986. Technology, Department of Automation, Tsinghua
[27] M. Bertozzi, A. Broggi, A. Fascioli, T. Graf, and M. Meinecke, “Pedes- University, Beijing, China. His research interests in-
trian detection for driver assistance using multiresolution infrared vision,” clude image processing, computer vision, and pattern
IEEE Trans. Veh. Technol., vol. 53, no. 6, pp. 1666–1678, Nov. 2004. recognition.
[28] Y. Fang, K. Yamada, Y. Ninomiya, B. Horn, and I. Masaki, “A shape-
independent method for pedestrian detection with far-infrared images,”
IEEE Trans. Veh. Technol., vol. 53, no. 6, pp. 1679–1697, Nov. 2004.
[29] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
of simple features,” in Proc. IEEE Conf. Comput. Vis. Pattern Recog., Gyomei Tei received the B.Sc. and M.Sc. degrees
Dec. 2001, pp. 511–518. from the Toyohashi University of Technology,
[30] C. Papageorgiou, M. Oren, and T. Poggio, “A general framework Toyohashi, Japan, in 1988 and 1990, respectively.
for object detection,” in Proc. IEEE Int. Conf. Comput. Vis., Jan. 1998, He is currently the President of INF Technologies
pp. 555–562. Ltd., Beijing, China. His research interests include
[31] Q. Zhu, C. Yeh, T. Cheng, and S. Avidan, “Fast human detection using pattern recognition and artificial intelligence.
a cascade of histograms of oriented gradients,” in Proc. IEEE Conf.
Comput. Vis. Pattern Recog., Jun. 2006, pp. 1491–1498.
[32] J. Friedman, T. Hastie, and R. Tibshirani, “Additive logistic regression: A
statistical view of boosting,” Ann. Stat., vol. 28, no. 2, pp. 337–374, 2000.

You might also like