[go: up one dir, main page]

Retinaface: Single-Stage Dense Face Localisation in The Wild

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

RetinaFace: Single-stage Dense Face Localisation in the Wild

Jiankang Deng * 1,2,4 Jia Guo * 2 Yuxiang Zhou 1


Jinke Yu 2 Irene Kotsia 3 Stefanos Zafeiriou1,4
1 2 3 4
Imperial College London InsightFace Middlesex University London FaceSoft
arXiv:1905.00641v2 [cs.CV] 4 May 2019

Abstract

Though tremendous strides have been made in uncon-


trolled face detection, accurate and efficient face locali-
sation in the wild remains an open challenge. This pa-
per presents a robust single-stage face detector, named
RetinaFace, which performs pixel-wise face localisation
on various scales of faces by taking advantages of joint
extra-supervised and self-supervised multi-task learning.
Specifically, We make contributions in the following five
aspects: (1) We manually annotate five facial landmarks Figure 1. The proposed single-stage pixel-wise face localisation
on the WIDER FACE dataset and observe significant im- method employs extra-supervised and self-supervised multi-task
provement in hard face detection with the assistance of learning in parallel with the existing box classification and regres-
this extra supervision signal. (2) We further add a self- sion branches. Each positive anchor outputs (1) a face score, (2) a
supervised mesh decoder branch for predicting a pixel-wise face box, (3) five facial landmarks, and (4) dense 3D face vertices
3D shape face information in parallel with the existing su- projected on the image plane.
pervised branches. (3) On the WIDER FACE hard test set,
RetinaFace outperforms the state of the art average pre- we refer to a broader definition of face localisation which
cision (AP) by 1.1% (achieving AP equal to 91.4%). (4) includes face detection [39], face alignment [13], pixel-
On the IJB-C test set, RetinaFace enables state of the art wise face parsing [48] and 3D dense correspondence regres-
methods (ArcFace) to improve their results in face ver- sion [2, 12]. That kind of dense face localisation provides
ification (TAR=89.59% for FAR=1e-6). (5) By employ- accurate facial position information for all different scales.
ing light-weight backbone networks, RetinaFace can run Inspired by generic object detection methods [16, 43, 30,
real-time on a single CPU core for a VGA-resolution im- 41, 42, 28, 29], which embraced all the recent advances in
age. Extra annotations and code have been made avail- deep learning, face detection has recently achieved remark-
able at: https://github.com/deepinsight/ able progress [23, 36, 68, 8, 49]. Different from generic
insightface/tree/master/RetinaFace. object detection, face detection features smaller ratio varia-
tions (from 1:1 to 1:1.5) but much larger scale variations
(from several pixels to thousand pixels). The most re-
cent state-of-the-art methods [36, 68, 49] focus on single-
1. Introduction stage [30, 29] design which densely samples face locations
and scales on feature pyramids [28], demonstrating promis-
Automatic face localisation is the prerequisite step of fa- ing performance and yielding faster speed compared to two-
cial image analysis for many applications such as facial at- stage methods [43, 63, 8]. Following this route, we improve
tribute (e.g. expression [64] and age [38]) and facial identity the single-stage face detection framework and propose a
recognition [45, 31, 55, 11]. A narrow definition of face lo- state-of-the-art dense face localisation method by exploit-
calisation may refer to traditional face detection [53, 62], ing multi-task losses coming from strongly supervised and
which aims at estimating the face bounding boxes without self-supervised signals. Our idea is examplified in Fig. 1.
any scale and position prior. Nevertheless, in this paper Typically, face detection training process contains both
* Equal contributions. classification and box regression losses [16]. Chen et al. [6]
Email: j.deng16@imperial.ac.uk; guojia@gmail.com proposed to combine face detection and alignment in a joint
InsightFace is a nonprofit Github project for 2D and 3D face analysis. cascade framework based on the observation that aligned

1
face shapes provide better features for face classification. • By employing light-weight backbone networks, Reti-
Inspired by [6], MTCNN [66] and STN [5] simultaneously naFace can run real-time on a single CPU core for a
detected faces and five facial landmarks. Due to training VGA-resolution image.
data limitation, JDA [6], MTCNN [66] and STN [5] have • Extra annotations and code have been released to fa-
not verified whether tiny face detection can benefit from the cilitate future research.
extra supervision of five facial landmarks. One of the ques-
tions we aim at answering in this paper is whether we can 2. Related Work
push forward the current best performance (90.3% [67]) on
the WIDER FACE hard test set [60] by using extra supervi- Image pyramid v.s. feature pyramid: The sliding-
sion signal built of five facial landmarks. window paradigm, in which a classifier is applied on a
dense image grid, can be traced back to past decades. The
In Mask R-CNN [20], the detection performance is sig-
milestone work of Viola-Jones [53] explored cascade chain
nificantly improved by adding a branch for predicting an ob-
to reject false face regions from an image pyramid with
ject mask in parallel with the existing branch for bounding
real-time efficiency, leading to the widespread adoption of
box recognition and regression. That confirms that dense
such scale-invariant face detection framework [66, 5]. Even
pixel-wise annotations are also beneficial to improve detec-
though the sliding-window on image pyramid was the lead-
tion. Unfortunately, for the challenging faces of WIDER
ing detection paradigm [19, 32], with the emergence of fea-
FACE it is not possible to conduct dense face annotation
ture pyramid [28], sliding-anchor [43] on multi-scale fea-
(either in the form of more landmarks or semantic seg-
ture maps [68, 49], quickly dominated face detection.
ments). Since supervised signals cannot be easily obtained,
Two-stage v.s. single-stage: Current face detection meth-
the question is whether we can apply unsupervised methods
ods have inherited some achievements from generic ob-
to further improve face detection.
ject detection approaches and can be divided into two cate-
In FAN [56], an anchor-level attention map is proposed
gories: two-stage methods (e.g. Faster R-CNN [43, 63, 72])
to improve the occluded face detection. Nevertheless, the
and single-stage methods (e.g. SSD [30, 68] and Reti-
proposed attention map is quite coarse and does not contain
naNet [29, 49]). Two-stage methods employed a “proposal
semantic information. Recently, self-supervised 3D mor-
and refinement” mechanism featuring high localisation ac-
phable models [14, 51, 52, 70] have achieved promising 3D
curacy. By contrast, single-stage methods densely sampled
face modelling in-the-wild. Especially, Mesh Decoder [70]
face locations and scales, which resulted in extremely un-
achieves over real-time speed by exploiting graph convo-
balanced positive and negative samples during training. To
lutions [10, 40] on joint shape and texture. However, the
handle this imbalance, sampling [47] and re-weighting [29]
main challenges of applying mesh decoder [70] into the
methods were widely adopted. Compared to two-stage
single-stage detector are: (1) camera parameters are hard to
methods, single-stage methods are more efficient and have
estimate accurately, and (2) the joint latent shape and tex-
higher recall rate but at the risk of achieving a higher false
ture representation is predicted from a single feature vec-
positive rate and compromising the localisation accuracy.
tor (1 ⇥ 1 Conv on feature pyramid) instead of the RoI
Context Modelling: To enhance the model’s contextual
pooled feature, which indicates the risk of feature shift. In
reasoning power for capturing tiny faces [23], SSH [36] and
this paper, we employ a mesh decoder [70] branch through
PyramidBox [49] applied context modules on feature pyra-
self-supervision learning for predicting a pixel-wise 3D face
mids to enlarge the receptive field from Euclidean grids. To
shape in parallel with the existing supervised branches.
enhance the non-rigid transformation modelling capacity of
To summarise, our key contributions are: CNNs, deformable convolution network (DCN) [9, 74] em-
• Based on a single-stage design, we propose a novel ployed a novel deformable layer to model geometric trans-
pixel-wise face localisation method named Reti- formations. The champion solution of the WIDER Face
naFace, which employs a multi-task learning strategy Challenge 2018 [33] indicates that rigid (expansion) and
to simultaneously predict face score, face box, five fa- non-rigid (deformation) context modelling are complemen-
cial landmarks, and 3D position and correspondence tary and orthogonal to improve the performance of face de-
of each facial pixel. tection.
• On the WIDER FACE hard subset, RetinaFace outper- Multi-task Learning: Joint face detection and alignment
forms the AP of the state of the art two-stage method is widely used [6, 66, 5] as aligned face shapes provide bet-
(ISRN [67]) by 1.1% (AP equal to 91.4%). ter features for face classification. In Mask R-CNN [20],
• On the IJB-C dataset, RetinaFace helps to improve Ar- the detection performance was significantly improved by
cFace’s [11] verification accuracy (with TAR equal to adding a branch for predicting an object mask in parallel
89.59% when FAR=1e-6). This indicates that better with the existing branches. Densepose [1] adopted the ar-
face localisation can significantly improve face recog- chitecture of Mask-RCNN to obtain dense part labels and
nition. coordinates within each of the selected regions. Neverthe-
less, the dense regression branch in [20, 1] was trained by joint shape and texture information, and E 2 {0, 1}n⇥n
supervised learning. In addition, the dense branch was a is a sparse adjacency matrix encoding the connection sta-
small FCN applied to each RoI to predict a pixel-to-pixel tus between vertices. The graph Laplacian is defined as
dense mapping. L=D EP 2 Rn⇥n where D 2 Rn⇥n is a diagonal matrix
with Dii = j Eij .
3. RetinaFace Following [10, 40, 70], the graph convolution with kernel
g✓ can be formulated as a recursive Chebyshev polynomial
3.1. Multi-task Loss
truncated at order K,
For any training anchor i, we minimise the following
K
X1
multi-task loss:
y = g✓ (L)x = ✓k Tk (L̃)x, (2)
L = Lcls (pi , p⇤i ) + ⇤ ⇤
1 pi Lbox (ti , ti ) k=0
⇤ ⇤ ⇤ (1)
+ 2 pi Lpts (li , li ) + 3 pi Lpixel . where ✓ 2 RK is a vector of Chebyshev coefficients and
(1) Face classification loss Lcls (pi , p⇤i ), where pi is the pre- Tk (L̃) 2 Rn⇥n is the Chebyshev polynomial of order
dicted probability of anchor i being a face and p⇤i is 1 for k evaluated at the scaled Laplacian L̃. Denoting x̄k =
the positive anchor and 0 for the negative anchor. The clas- Tk (L̃)x 2 Rn , we can recurrently compute x̄k = 2L̃x̄k 1
sification loss Lcls is the softmax loss for binary classes x̄k 2 with x̄0 = x and x̄1 = L̃x. The whole filtering opera-
(face/not face). (2) Face box regression loss Lbox (ti , t⇤i ), tion is extremely efficient including K sparse matrix-vector
where ti = {tx , ty , tw , th }i and t⇤i = {t⇤x , t⇤y , t⇤w , t⇤h }i rep- multiplications and one dense matrix-vector multiplication
resent the coordinates of the predicted box and ground-truth y = g✓ (L)x = [x̄0 , . . . , x̄K 1 ]✓.
box associated with the positive anchor. We follow [16] Differentiable Renderer. After we predict the shape and
to normalise the box regression targets (i.e. centre location, texture parameters PST 2 R128 , we employ an efficient
width and height) and use Lbox (ti , t⇤i ) = R(ti t⇤i ), where differentiable 3D mesh renderer [14] to project a coloured-
R is the robust loss function (smooth-L1 ) defined in [16]. mesh DPST onto a 2D image plane with camera parame-
(3) Facial landmark regression loss Lpts (li , li⇤ ), where li = ters Pcam = [xc , yc , zc , x0c , yc0 , zc0 , fc ] (i.e. camera location,
{lx1 , ly1 , . . . , lx5 , ly5 }i and li⇤ = {lx⇤1 , ly⇤1 , . . . , lx⇤5 , ly⇤5 }i camera pose and focal length) and illumination parameters
represent the predicted five facial landmarks and ground- Pill = [xl , yl , zl , rl , gl , bl , ra , ga , ba ] (i.e. location of point
truth associated with the positive anchor. Similar to the box light source, colour values and colour of ambient lighting).
centre regression, the five facial landmark regression also Dense Regression Loss. Once we get the rendered 2D
employs the target normalisation based on the anchor cen- face R(DPST , Pcam , Pill ), we compare the pixel-wise dif-
tre. (4) Dense regression loss Lpixel (refer to Eq. 3). The ference of the rendered and the original 2D face using the
loss-balancing parameters 1 - 3 are set to 0.25, 0.1 and following function:
0.01, which means that we increase the significance of bet- W X
H
1 X
ter box and landmark locations from supervision signals. Lpixel = ⇤
kR(DPST , Pcam , Pill )i,j Ii,j k1 ,
W ⇤H i j
3.2. Dense Regression Branch (3)
Mesh Decoder. We directly employ the mesh decoder where W and H are the width and height of the anchor crop
(mesh convolution and mesh up-sampling) from [70, 40], ⇤
Ii,j , respectively.
which is a graph convolution method based on fast localised
spectral filtering [10]. In order to achieve further accelera- 4. Experiments
tion, we also use a joint shape and texture decoder similarly
4.1. Dataset
to the method in [70], contrary to [40] which only decoded
shape. The WIDER FACE dataset [60] consists of 32, 203 im-
Below we will briefly explain the concept of graph con- ages and 393, 703 face bounding boxes with a high degree
volutions and outline why they can be used for fast decod- of variability in scale, pose, expression, occlusion and illu-
ing. As illustrated in Fig. 3(a), a 2D convolutional operation mination. The WIDER FACE dataset is split into training
is a “kernel-weighted neighbour sum” within the Euclidean (40%), validation (10%) and testing (50%) subsets by ran-
grid receptive field. Similarly, graph convolution also em- domly sampling from 61 scene categories. Based on the de-
ploys the same concept as shown in Fig. 3(b). However, tection rate of EdgeBox [76], three levels of difficulty (i.e.
the neighbour distance is calculated on the graph by count- Easy, Medium and Hard) are defined by incrementally in-
ing the minimum number of edges connecting two vertices. corporating hard samples.
We follow [70] to define a coloured face mesh G = (V, E), Extra Annotations. As illustrated in Fig. 4 and Tab. 1, we
where V 2 Rn⇥6 is a set of face vertices containing the define five levels of face image quality (according to how
Figure 2. An overview of the proposed single-stage dense face localisation approach. RetinaFace is designed based on the feature pyramids
with independent context modules. Following the context modules, we calculate a multi-task loss for each anchor.

Level Face Number Criterion


1 4,127 indisputable 68 landmarks [44]
2 12,636 annotatable 68 landmarks [44]
3 38,140 indisputable 5 landmarks
4 50,024 annotatable 5 landmarks
5 94,095 distinguish by context
(a) 2D Convolution (b) Graph Convolution
Table 1. Five levels of face image quality. In the indisputable cate-
Figure 3. (a) 2D Convolution is kernel-weighted neighbour sum gory a human can, without a lot of effort, locale the landmarks. In
within the Euclidean grid receptive field. Each convolutional layer the annotatable category finding an approximate location requires
has KernelH ⇥ KernelW ⇥ Channelin ⇥ Channelout pa- some effort.
rameters. (b) Graph convolution is also in the form of kernel-
weighted neighbour sum, but the neighbour distance is calculated 4.2. Implementation details
on the graph by counting the minimum number of edges connect-
ing two vertices. Each convolutional layer has K ⇥ Channelin ⇥ Feature Pyramid. RetinaFace employs feature pyramid
Channelout parameters and the Chebyshev coefficients ✓i,j 2 levels from P2 to P6 , where P2 to P5 are computed from
RK are truncated at order K. the output of the corresponding ResNet residual stage (C2
through C5 ) using top-down and lateral connections as
in [28, 29]. P6 is calculated through a 3⇥3 convolution with
difficult it is to annotate landmarks on the face) and annotate stride=2 on C5 . C1 to C5 is a pre-trained ResNet-152 [21]
five facial landmarks (i.e. eye centres, nose tip and mouth classification network on the ImageNet-11k dataset while
corners) on faces that can be annotated from the WIDER P6 are randomly initialised with the “Xavier” method [17].
FACE training and validation subsets. In total, we have an- Context Module. Inspired by SSH [36] and Pyramid-
notated 84.6k faces on the training set and 18.5k faces on Box [49], we also apply independent context modules on
the validation set. five feature pyramid levels to increase the receptive field
and enhance the rigid context modelling power. Drawing
lessons from the champion of the WIDER Face Challenge
2018 [33], we also replace all 3 ⇥ 3 convolution layers
within the lateral connections and context modules by the
deformable convolution network (DCN) [9, 74], which fur-
ther strengthens the non-rigid context modelling capacity.
Loss Head. For negative anchors, only classification loss
is applied. For positive anchors, the proposed multi-task
loss is calculated. We employ a shared loss head (1 ⇥ 1
conv) across different feature maps Hn ⇥ Wn ⇥ 256, n 2
{2, . . . , 6}. For the mesh decoder, we apply the pre-trained
model [70], which is a small computational overhead that
allows for efficient inference.
Anchor Settings. As illustrated in Tab. 2, we employ scale-
Figure 4. We add extra annotations of five facial landmarks on specific anchors on the feature pyramid levels from P2 to P6
faces that can be annotated (we call them “annotatable”) from the like [56]. Here, P2 is designed to capture tiny faces by tiling
WIDER FACE training and validation sets. small anchors at the cost of more computational time and at
the risk of more false positives. We set the scale step at
21/3 and the aspect ratio at 1:1. With the input image size posed dense regression branch quantitatively affect the per-
at 640 ⇥ 640, the anchors can cover scales from 16 ⇥ 16 to formance of face detection. Besides the standard evaluation
406 ⇥ 406 on the feature pyramid levels. In total, there are metric of average precision (AP) when IoU=0.5 on the Easy,
102,300 anchors, and 75% of these anchors are from P2 . Medium and Hard subsets, we also make use of the devel-
opment server (Hard validation subset) of the WIDER Face
Feature Pyramid Stride Anchor Challenge 2018 [33], which employs a more strict evalua-
P2 (160 ⇥ 160 ⇥ 256) 4 16, 20.16, 25.40 tion metric of mean AP (mAP) for IoU=0.5:0.05:0.95, re-
P3 (80 ⇥ 80 ⇥ 256) 8 32, 40.32, 50.80 warding more accurate face detectors.
P4 (40 ⇥ 40 ⇥ 256) 16 64, 80.63, 101.59 As illustrated in Tab. 3, we evaluate the performance of
P5 (20 ⇥ 20 ⇥ 256) 32 128, 161.26, 203.19 several different settings on the WIDER FACE validation
P6 (10 ⇥ 10 ⇥ 256) 64 256, 322.54, 406.37 set and focus on the observations of AP and mAP on the
Table 2. The details of feature pyramid, stride size, anchor in Reti- Hard subset. By applying the practices of state-of-the-art
naFace. For a 640 ⇥ 640 input image, there are 102,300 anchors techniques (i.e. FPN, context module, and deformable con-
in total, and 75% of these anchors are tiled on P2 . volution), we set up a strong baseline (91.286%), which
During training, anchors are matched to a ground-truth is slightly better than ISRN [67] (90.9%). Adding the
box when IoU is larger than 0.5, and to the background branch of five facial landmark regression significantly im-
when IoU is less than 0.3. Unmatched anchors are ig- proves the face box AP (0.408%) and mAP (0.775%) on the
nored during training. Since most of the anchors (> 99%) Hard subset, suggesting that landmark localisation is crucial
are negative after the matching step, we employ standard for improving the accuracy of face detection. By contrast,
OHEM [47, 68] to alleviate significant imbalance between adding the dense regression branch increases the face box
the positive and negative training examples. More specifi- AP on Easy and Medium subsets but slightly deteriorates
cally, we sort negative anchors by the loss values and select the results on the Hard subset, indicating the difficulty of
the top ones so that the ratio between the negative and pos- dense regression under challenging scenarios. Nevertheless,
itive samples is at least 3:1. learning landmark and dense regression jointly enables a
Data Augmentation. Since there are around 20% tiny faces further improvement compared to adding landmark regres-
in the WIDER FACE training set, we follow [68, 49] and sion only. This demonstrates that landmark regression does
randomly crop square patches from the original images and help dense regression, which in turn boosts face detection
resize these patches into 640 ⇥ 640 to generate larger train- performance even further.
ing faces. More specifically, square patches are cropped
from the original image with a random size between [0.3, Method Easy Medium Hard mAP [33]
1] of the short edge of the original image. For the faces on FPN+Context 95.532 95.134 90.714 50.842
the crop boundary, we keep the overlapped part of the face +DCN 96.349 95.833 91.286 51.522
box if its centre is within the crop patch. Besides random +Lpts 96.467 96.075 91.694 52.297
crop, we also augment training data by random horizontal +Lpixel 96.413 95.864 91.276 51.492
flip with the probability of 0.5 and photo-metric colour dis- +Lpts + Lpixel 96.942 96.175 91.857 52.318
tortion [68]. Table 3. Ablation experiments of the proposed methods on the
Training Details. We train the RetinaFace using SGD op- WIDER FACE validation subset.
timiser (momentum at 0.9, weight decay at 0.0005, batch
size of 8 ⇥ 4) on four NVIDIA Tesla P40 (24GB) GPUs. 4.4. Face box Accuracy
The learning rate starts from 10 3 , rising to 10 2 after 5 Following the stander evaluation protocol of the WIDER
epochs, then divided by 10 at 55 and 68 epochs. The train- FACE dataset, we only train the model on the training set
ing process terminates at 80 epochs. and test on both the validation and test sets. To obtain
Testing Details. For testing on WIDER FACE, we the evaluation results on the test set, we submit the de-
follow the standard practices of [36, 68] and employ tection results to the organisers for evaluation. As shown
flip as well as multi-scale (the short edge of image at in Fig. 5, we compare the proposed RetinaFace with other
[500, 800, 1100, 1400, 1700]) strategies. Box voting [15] is 24 state-of-the-art face detection algorithms (i.e. Multi-
applied on the union set of predicted face boxes using an scale Cascade CNN [60], Two-stage CNN [60], ACF-
IoU threshold at 0.4. WIDER [58], Faceness-WIDER [59], Multitask Cascade
CNN [66], CMS-RCNN [72], LDCF+ [37], HR [23], Face
4.3. Ablation Study
R-CNN [54], ScaleFace [61], SSH [36], SFD [68], Face R-
To achieve a better understanding of the proposed Reti- FCN [57], MSCNN [4], FAN [56], Zhu et al. [71], Pyramid-
naFace, we conduct extensive ablation experiments to ex- Box [49], FDNet [63], SRN [8], FANet [65], DSFD [27],
amine how the annotated five facial landmarks and the pro- DFS [50], VIM-FD [69], ISRN [67]). Our approach outper-
forms these state-of-the-art methods in terms of AP. More features (as in RetinaFace) to predict dense correspondence
specifically, RetinaFace produces the best AP in all subsets parameters is much harder than employing (Region of In-
of both validation and test sets, i.e., 96.9% (Easy), 96.1% terest) RoI features (as in Mesh Decoder [70]). As illus-
(Medium) and 91.8% (Hard) for validation set, and 96.3% trated in Fig. 8(c), RetinaFace can easily handle faces with
(Easy), 95.6% (Medium) and 91.4% (Hard) for test set. pose variations but has difficulty under complex scenarios.
Compared to the recent best performed method [67], Reti- This indicates that mis-aligned and over-compacted feature
naFace sets up a new impressive record (91.4% v.s. 90.3%) representation (1 ⇥ 1 ⇥ 256 in RetinaFace) impedes the
on the Hard subset which contains a large number of tiny single-stage framework achieving high accurate dense re-
faces. gression outputs. Nevertheless, the projected face regions
In Fig. 6, we illustrate qualitative results on a selfie with in the dense regression branch still have the effect of at-
dense faces. RetinaFace successfully finds about 900 faces tention [56] which can help to improve face detection as
(threshold at 0.5) out of the reported 1, 151 faces. Be- confirmed in the section of ablation study.
sides accurate bounding boxes, the five facial landmarks
predicted by RetinaFace are also very robust under the 4.7. Face Recognition Accuracy
variations of pose, occlusion and resolution. Even though Face detection plays a crucial role in robust face recogni-
there are some failure cases of dense face localisation under tion but its effect is rarely explicitly measured. In this paper,
heavy occlusion, the dense regression results on some clear we demonstrate how our face detection method can boost
and large faces are good and even show expression varia- the performance of a state-of-the-art publicly available face
tions. recognition method, i.e. ArcFace [11]. ArcFace [11] stud-
ied how different aspects in the training process of a deep
4.5. Five Facial Landmark Accuracy
convolutional neural network (i.e., choice of the training
To evaluate the accuracy of five facial landmark locali- set, the network and the loss function) affect large scale
sation, we compare RetinaFace with MTCNN [66] on the face recognition performance. However, ArcFace paper did
AFLW dataset [26] (24,386 faces) as well as the WIDER not study the effect of face detection by applying only the
FACE validationp set (18.5k faces). Here, we employ the MTCNN [66] for detection and alignment. In this paper, we
face box size ( W ⇥ H) as the normalisation distance. As replace MTCNN by RetinaFace to detect and align all of the
shown in Fig. 7(a), we give the mean error of each facial training data (i.e. MS1M [18]) and test data (i.e. LFW [24],
landmark on the AFLW dataset [73]. RetinaFace signifi- CFP-FP [46], AgeDB-30 [35] and IJBC [34]), and keep the
cantly decreases the normalised mean errors (NME) from embedding network (i.e. ResNet100 [21]) and the loss func-
2.72% to 2.21% when compared to MTCNN. In Fig. 7(b), tion (i.e. additive angular margin) exactly the same as Arc-
we show the cumulative error distribution (CED) curves on Face.
the WIDER FACE validation set. Compared to MTCNN, In Tab. 4, we show the influence of face detection and
RetinaFace significantly decreases the failure rate from alignment on deep face recognition (i.e. ArcFace) by com-
26.31% to 9.37% (the NME threshold at 10%). paring the widely used MTCNN [66] and the proposed Reti-
naFace. The results on CFP-FP, demonstrate that Reti-
4.6. Dense Facial Landmark Accuracy naFace can boost ArcFace’s verification accuracy from
Besides box and five facial landmarks, RetinaFace also 98.37% to 99.49%. This result shows that the performance
outputs dense face correspondence, but the dense regres- of frontal-profile face verification is now approaching that
sion branch is trained by self-supervised learning only. Fol- of frontal-frontal face verification (e.g. 99.86% on LFW).
lowing [12, 70], we evaluate the accuracy of dense facial
landmark localisation on the AFLW2000-3D dataset [75] Methods LFW CFP-FP AgeDB-30
considering (1) 68 landmarks with the 2D projection coor- MTCNN+ArcFace [11] 99.83 98.37 98.15
dinates and (2) all landmarks with 3D coordinates. Here, RetinaFace+ArcFace 99.86 99.49 98.60
the mean error is still normalised by the bounding box Table 4. Verification performance (%) of different methods on
size [75]. In Fig. 8(a) and 8(b), we give the CED curves LFW, CFP-FP and AgeDB-30.
of state-of-the-art methods [12, 70, 75, 25, 3] as well as In Fig. 9, we show the ROC curves on the IJB-C dataset
RetinaFace. Even though the performance gap exists be- as well as the TAR for FAR= 1e 6 at the end of each
tween supervised and self-supervised methods, the dense legend. We employ two tricks (i.e. flip test and face detec-
regression results of RetinaFace are comparable with these tion score to weigh samples within templates) to progres-
state-of-the-art methods. More specifically, we observe that sively improve the face verification accuracy. Under fair
(1) five facial landmarks regression can alleviate the train- comparison, TAR (at FAR= 1e 6) significantly improves
ing difficulty of dense regression branch and significantly from 88.29% to 89.59% simply by replacing MTCNN with
improve the dense regression results. (2) using single-stage RetinaFace. This indicates that (1) face detection and align-
1 1 1

0.9 0.9 0.9


RetinaFace-0.969 RetinaFace-0.961 RetinaFace-0.918
ISRN-0.967 ISRN-0.958 ISRN-0.909
0.8 VIM-FD-0.967 0.8 VIM-FD-0.957 0.8 VIM-FD-0.907
DSFD-0.966 DSFD-0.957 DSFD-0.904
SRN-0.964 SRN-0.952 SRN-0.901
0.7 PyramidBox-0.961 0.7 PyramidBox-0.95 0.7 FAN-0.900
FDNet-0.959 DFS-0.948 DFS-0.897
DFS-0.957 FANet-0.947 FANet-0.895
0.6 FANet-0.956 0.6 FDNet-0.945 0.6 PyramidBox-0.889
FAN-0.952 FAN-0.940 FDNet-0.879
Precision

Precision

Precision
Zhu et al.-0.949 Face R-FCN-0.935 Face R-FCN-0.874
0.5 Face R-FCN-0.947
0.5 Zhu et al.-0.933
0.5 Zhu et al.-0.861
SFD-0.937 SFD-0.925 SFD-0.859
Face R-CNN-0.937 Face R-CNN-0.921 SSH-0.845
0.4 SSH-0.931
0.4 SSH-0.921
0.4 Face R-CNN-0.831
HR-0.925 HR-0.910 HR-0.806
MSCNN-0.916 MSCNN-0.903 MSCNN-0.802
0.3 CMS-RCNN-0.899
0.3 CMS-RCNN-0.874
0.3 ScaleFace-0.772
ScaleFace-0.868 ScaleFace-0.867 CMS-RCNN-0.624
Multitask Cascade CNN-0.848 Multitask Cascade CNN-0.825 Multitask Cascade CNN-0.598
0.2 LDCF+-0.790
0.2 LDCF+-0.769
0.2 LDCF+-0.522
Faceness-WIDER-0.713 Multiscale Cascade CNN-0.664 Multiscale Cascade CNN-0.424
Multiscale Cascade CNN-0.691 Faceness-WIDER-0.634 Faceness-WIDER-0.345
0.1 Two-stage CNN-0.681
0.1 Two-stage CNN-0.618
0.1 Two-stage CNN-0.323
ACF-WIDER-0.659 ACF-WIDER-0.541 ACF-WIDER-0.273
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall

(a) Val: Easy (b) Val: Medium (c) Val: Hard

1 1 1

0.9 0.9 0.9


RetinaFace-0.963 RetinaFace-0.956 RetinaFace-0.914
ISRN-0.963 ISRN-0.954 ISRN-0.903
0.8 VIM-FD-0.962 0.8 DSFD-0.953 0.8 VIM-FD-0.902
DSFD-0.960 VIM-FD-0.953 DSFD-0.900
SRN-0.959 SRN-0.948 SRN-0.896
0.7 PyramidBox-0.956 0.7 PyramidBox-0.946 0.7 DFS-0.891
FDNet-0.950 DFS-0.940 FANet-0.887
Zhu et al.-0.949 FDNet-0.939 PyramidBox-0.887
0.6 DFS-0.949 0.6 FANet-0.939 0.6 FAN-0.885
FANet-0.947 FAN-0.936 FDNet-0.878
Precision

Precision

Precision
FAN-0.946 Zhu et al.-0.935 Face R-FCN-0.876
0.5 Face R-FCN-0.943
0.5 Face R-FCN-0.931
0.5 Zhu et al.-0.865
SFD-0.935 SFD-0.921 SFD-0.858
Face R-CNN-0.932 Face R-CNN-0.916 SSH-0.844
0.4 SSH-0.927
0.4 SSH-0.915
0.4 Face R-CNN-0.827
HR-0.923 HR-0.910 HR-0.819
MSCNN-0.917 MSCNN-0.903 MSCNN-0.809
0.3 CMS-RCNN-0.902
0.3 CMS-RCNN-0.874
0.3 ScaleFace-0.764
ScaleFace-0.867 ScaleFace-0.866 CMS-RCNN-0.643
Multitask Cascade CNN-0.851 Multitask Cascade CNN-0.820 Multitask Cascade CNN-0.607
0.2 LDCF+-0.797
0.2 LDCF+-0.772
0.2 LDCF+-0.564
Faceness-WIDER-0.716 Multiscale Cascade CNN-0.636 Multiscale Cascade CNN-0.400
Multiscale Cascade CNN-0.711 Faceness-WIDER-0.604 Faceness-WIDER-0.315
0.1 ACF-WIDER-0.695
0.1 Two-stage CNN-0.589
0.1 Two-stage CNN-0.304
Two-stage CNN-0.657 ACF-WIDER-0.588 ACF-WIDER-0.290
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall

(d) Test: Easy (e) Test: Medium (f) Test: Hard


Figure 5. Precision-recall curves on the WIDER FACE validation and test subsets.

Figure 6. RetinaFace can find around 900 faces (threshold at 0.5) out of the reported 1151 people, by taking advantages of the proposed joint
extra-supervised and self-supervised multi-task learning. Detector confidence is given by the colour bar on the right. Dense localisation
masks are drawn in blue. Please zoom in to check the detailed detection, alignment and dense regression results on tiny faces.
0.03 100

90
0.025
80

Number of images (%)


0.02 70
NME (%)

60
0.015
50
0.01
40

0.005 MTCNN 30
RetinaFace
20
0 MTCNN
ter rn er tip ter rn er 10
RetinaFace
en co se ce
n
co
y ec th No ye th
ft e ou ht
e
mo
u 0
Le ft m Rig ht
0 1 2 3 4 5 6 7 8 9 10
Le Rig NME normalized by bounding box size (%)

(a) NME on AFLW (b) CED on WIDER FACE


Figure 7. Qualitative comparison between MTCNN and Reti-
naFace on five facial landmark localisation. (a) AFLW (b) WIDER
FACE validation set.
100 100 Figure 9. ROC curves of 1:1 verification protocol on the IJB-C
90

80
90

80
dataset. “+F” refers to flip test during feature embedding and “+S”
denotes face detection score used to weigh samples within tem-
Number of images (%)

Number of images (%)

70 70

60

50
60

50
plates. We also give TAR for FAR= 1e 6 at the end of the each
40
PRNet: 3.2699
3D-FAN: 3.479
40
PRNet: 4.4079 legend.
30 Mesh Decoder: 3.986 30 Mesh Decoder: 5.4144
DeFA: 4.3651 DeFA: 6.0409

model are fixed to achieve higher accuracy.


20 3DDFA: 6.034 20 3DDFA: 6.5579
RetinaFace (w 5pts): 7.1706 RetinaFace (w 5pts): 8.4209
10 10
RetinaFace (w/o 5pts): 8.5349 RetinaFace (w/o 5pts): 10.2083
0
0 1 2 3 4 5
NME normalized by bounding box size (%)
6 7 8 9 10
0
0 1 2 3 4 5 6
NME normalized by bounding box size (%)
7 8 9 10 Tab. 5 gives the inference time of two models with re-
spect to different input sizes. We omit the time cost on the
(a) 68 2D Landmarks (b) All 3D Landmarks
dense regression branch, thus the time statistics are irrele-
vant to the face density of the input image. We take advan-
tage of TVM [7] to accelerate the model inference and tim-
ing is performed on the NVIDIA Tesla P40 GPU, Intel i7-
6700K CPU and ARM-RK3399, respectively. RetinaFace-
ResNet-152 is designed for highly accurate face localisa-
tion, running at 13 FPS for VGA images (640 ⇥ 480).
By contrast, RetinaFace-MobileNet-0.25 is designed for
highly efficient face localisation which demonstrates con-
(c) Result Analysis (Upper: Mesh Decoder; Lower: RetinaFace) siderable real-time speed of 40 FPS at GPU for 4K images
Figure 8. CED curves on AFLW2000-3D. Evaluation is performed (4096 ⇥ 2160), 20 FPS at multi-thread CPU for HD images
on (a) 68 landmarks with the 2D coordinates and (b) all landmarks (1920 ⇥ 1080), and 60 FPS at single-thread CPU for VGA
with 3D coordinates. In (c), we compare the dense regression re- images (640 ⇥ 480). Even more impressively, 16 FPS at
sults from RetinaFace and Mesh Decoder [70]. RetinaFace can ARM for VGA images (640 ⇥ 480) allows for a fast system
easily handle faces with pose variations but has difficulty to pre- on mobile devices.
dict accurate dense correspondence under complex scenarios.
Backbones VGA HD 4K
ment significantly affect face recognition performance and ResNet-152 (GPU) 75.1 443.2 1742
(2) RetinaFace is a much stronger baseline that MTCNN for MobileNet-0.25 (GPU) 1.4 6.1 25.6
face recognition applications. MobileNet-0.25 (CPU-m) 5.5 50.3 -
MobileNet-0.25 (CPU-1) 17.2 130.4 -
4.8. Inference Efficiency MobileNet-0.25 (ARM) 61.2 434.3 -
During testing, RetinaFace performs face localisation in Table 5. Inference time (ms) of RetinaFace with different back-
a single stage, which is flexible and efficient. Besides the bones (ResNet-152 and MobileNet-0.25) on different input sizes
above-explored heavy-weight model (ResNet-152, size of (VGA@640x480, HD@1920x1080 and 4K@4096x2160). “CPU-
1” and “CPU-m” denote single-thread and multi-thread test on the
262MB, and AP 91.8% on the WIDER FACE hard set), we
Intel i7-6700K CPU, respectively. “GPU” refers to the NVIDIA
also resort to a light-weight model (MobileNet-0.25 [22],
Tesla P40 GPU and “ARM” platform is RK3399(A72x2).
size of 1MB, and AP 78.2% on the WIDER FACE hard set)
to accelerate the inference.
5. Conclusions
For the light-weight model, we can quickly reduce the
data size by using a 7 ⇥ 7 convolution with stride=4 on the We studied the challenging problem of simultaneous
input image, tile dense anchors on P3 , P4 and P5 as in [36], dense localisation and alignment of faces of arbitrary scales
and remove deformable layers. In addition, the first two in images and we proposed the first, to the best of our
convolutional layers initialised by the ImageNet pre-trained knowledge, one-stage solution (RetinaFace). Our solution
outperforms state of the art methods in the current most [15] S. Gidaris and N. Komodakis. Object detection via a multi-
challenging benchmarks for face detection. Furthermore, region and semantic segmentation-aware cnn model. In
when RetinaFace is combined with state-of-the-art practices ICCV, 2015. 5
for face recognition it obviously improves the accuracy. The [16] R. Girshick. Fast r-cnn. In ICCV, 2015. 1, 3
data and models have been provided publicly available to [17] X. Glorot and Y. Bengio. Understanding the difficulty of
facilitate further research on the topic. training deep feedforward neural networks. In AISTATS,
2010. 4
6. Acknowledgements [18] Y. Guo, L. Zhang, Y. Hu, X. He, and J. Gao. Ms-celeb-1m:
A dataset and benchmark for large-scale face recognition. In
Jiankang Deng acknowledges financial support from ECCV, 2016. 6
the Imperial President’s PhD Scholarship and GPU dona- [19] Z. Hao, Y. Liu, H. Qin, J. Yan, X. Li, and X. Hu. Scale-aware
tions from NVIDIA. Stefanos Zafeiriou acknowledges sup- face detection. In CVPR, 2017. 2
port from EPSRC Fellowship DEFORM (EP/S010203/1), [20] K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask r-cnn.
FACER2VM (EP/N007743/1) and a Google Faculty Fel- In ICCV, 2017. 2, 3
lowship. [21] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning
for image recognition. In CVPR, 2016. 4, 6
References [22] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,
T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Effi-
[1] R. Alp Güler, N. Neverova, and I. Kokkinos. Densepose: cient convolutional neural networks for mobile vision appli-
Dense human pose estimation in the wild. In CVPR, 2018. cations. arXiv:1704.04861, 2017. 8
2, 3 [23] P. Hu and D. Ramanan. Finding tiny faces. In CVPR, 2017.
[2] R. Alp Guler, G. Trigeorgis, E. Antonakos, P. Snape, 1, 2, 5
S. Zafeiriou, and I. Kokkinos. Densereg: Fully convolutional [24] G. B. Huang, M. Ramesh, T. Berg, and E. Learned-Miller.
dense shape regression in-the-wild. In CVPR, 2017. 1 Labeled faces in the wild: A database for studying face
[3] A. Bulat and G. Tzimiropoulos. How far are we from solv- recognition in unconstrained environments. Technical report,
ing the 2d & 3d face alignment problem?(and a dataset of 2007. 6
230,000 3d facial landmarks). In ICCV, 2017. 6
[25] A. Jourabloo and X. Liu. Large-pose face alignment via cnn-
[4] Z. Cai, Q. Fan, R. S. Feris, and N. Vasconcelos. A unified
based dense 3d model fitting. In CVPR, 2016. 6
multi-scale deep convolutional neural network for fast object
[26] M. Koestinger, P. Wohlhart, P. M. Roth, and H. Bischof. An-
detection. In ECCV, 2016. 5
notated facial landmarks in the wild: A large-scale, real-
[5] D. Chen, G. Hua, F. Wen, and J. Sun. Supervised transformer
world database for facial landmark localization. In ICCV
network for efficient face detection. In ECCV, 2016. 2
workshops, 2011. 6
[6] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade
[27] J. Li, Y. Wang, C. Wang, Y. Tai, J. Qian, J. Yang, C. Wang,
face detection and alignment. In ECCV, 2014. 1, 2
J. Li, and F. Huang. Dsfd: dual shot face detector.
[7] T. Chen, T. Moreau, Z. Jiang, L. Zheng, E. Yan, H. Shen,
arXiv:1810.10220, 2018. 5
M. Cowan, L. Wang, Y. Hu, L. Ceze, et al. Tvm: An auto-
mated end-to-end optimizing compiler for deep learning. In [28] T.-Y. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and
OSDI, 2018. 8 S. Belongie. Feature pyramid networks for object detection.
[8] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou. Selec- In CVPR, 2017. 1, 2, 4
tive refinement network for high performance face detection. [29] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár. Focal
AAAI, 2019. 1, 5 loss for dense object detection. In ICCV, 2017. 1, 2, 4
[9] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. [30] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y.
Deformable convolutional networks. In ICCV, 2017. 2, 4 Fu, and A. C. Berg. Ssd: Single shot multibox detector. In
[10] M. Defferrard, X. Bresson, and P. Vandergheynst. Convolu- ECCV, 2016. 1, 2
tional neural networks on graphs with fast localized spectral [31] W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song.
filtering. In NeurIPS, 2016. 2, 3 Sphereface: Deep hypersphere embedding for face recogni-
[11] J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Addi- tion. In CVPR, 2017. 1
tive angular margin loss for deep face recognition. In CVPR, [32] Y. Liu, H. Li, J. Yan, F. Wei, X. Wang, and X. Tang. Re-
2019. 1, 2, 6 current scale approximation for object detection in cnn. In
[12] Y. Feng, F. Wu, X. Shao, Y. Wang, and X. Zhou. Joint 3d CVPR, 2017. 2
face reconstruction and dense alignment with position map [33] C. C. Loy, D. Lin, W. Ouyang, Y. Xiong, S. Yang,
regression network. In ECCV, 2018. 1, 6 Q. Huang, D. Zhou, W. Xia, Q. Li, P. Luo, et al. Wider
[13] Z.-H. Feng, J. Kittler, M. Awais, P. Huber, and X.-J. Wu. face and pedestrian challenge 2018: Methods and results.
Wing loss for robust facial landmark localisation with con- arXiv:1902.06854, 2019. 2, 4, 5
volutional neural networks. In CVPR, 2018. 1 [34] B. Maze, J. Adams, J. A. Duncan, N. Kalka, T. Miller,
[14] K. Genova, F. Cole, A. Maschinot, A. Sarna, D. Vlasic, and C. Otto, A. K. Jain, W. T. Niggel, J. Anderson, J. Cheney,
W. T. Freeman. Unsupervised training for 3d morphable et al. Iarpa janus benchmark-c: Face dataset and protocol. In
model regression. In CVPR, 2018. 2, 3 ICB, 2018. 6
[35] S. Moschoglou, A. Papaioannou, C. Sagonas, J. Deng, I. Kot- [56] J. Wang, Y. Yuan, and G. Yu. Face attention net-
sia, and S. Zafeiriou. Agedb: The first manually collected work: an effective face detector for the occluded faces.
in-the-wild age database. In CVPR Workshop, 2017. 6 arXiv:1711.07246, 2017. 2, 4, 5, 6
[36] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis. [57] Y. Wang, X. Ji, Z. Zhou, H. Wang, and Z. Li. Detect-
Ssh: Single stage headless face detector. In ICCV, 2017. 1, ing faces using region-based fully convolutional networks.
2, 4, 5, 8 arXiv:1709.05256, 2017. 5
[37] E. Ohn-Bar and M. M. Trivedi. To boost or not to boost? [58] B. Yang, J. Yan, Z. Lei, and S. Z. Li. Aggregate channel
on the limits of boosted trees for object detection. In ICPR, features for multi-view face detection. In ICB, 2014. 5
2016. 5 [59] S. Yang, P. Luo, C.-C. Loy, and X. Tang. From facial parts
[38] H. Pan, H. Han, S. Shan, and X. Chen. Mean-variance loss responses to face detection: A deep learning approach. In
for deep age estimation from a face. In CVPR, 2018. 1 ICCV, 2015. 5
[39] D. Ramanan and X. Zhu. Face detection, pose estimation, [60] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face
and landmark localization in the wild. In CVPR, 2012. 1 detection benchmark. In CVPR, 2016. 2, 3, 5
[40] A. Ranjan, T. Bolkart, S. Sanyal, and M. J. Black. Generating [61] S. Yang, Y. Xiong, C. C. Loy, and X. Tang. Face de-
3d faces using convolutional mesh autoencoders. In ECCV, tection through scale-friendly deep convolutional networks.
2018. 2, 3 arXiv:1706.02863, 2017. 5
[41] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You [62] S. Zafeiriou, C. Zhang, and Z. Zhang. A survey on face
only look once: Unified, real-time object detection. In detection in the wild: past, present and future. CVIU, 2015.
CVPR, 2016. 1 1
[42] J. Redmon and A. Farhadi. Yolo9000: better, faster, stronger. [63] C. Zhang, X. Xu, and D. Tu. Face detection using improved
In CVPR, 2017. 1 faster rcnn. arXiv:1802.02142, 2018. 1, 2, 5
[43] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards [64] F. Zhang, T. Zhang, Q. Mao, and C. Xu. Joint pose and
real-time object detection with region proposal networks. In expression modeling for facial expression recognition. In
NeurIPS, 2015. 1, 2 CVPR, 2018. 1
[44] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic. [65] J. Zhang, X. Wu, J. Zhu, and S. C. Hoi. Feature
300 faces in-the-wild challenge: The first facial landmark agglomeration networks for single stage face detection.
localization challenge. In ICCV Workshop, 2013. 4 arXiv:1712.00721, 2017. 5
[45] F. Schroff, D. Kalenichenko, and J. Philbin. Facenet: A uni- [66] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detec-
fied embedding for face recognition and clustering. In CVPR, tion and alignment using multitask cascaded convolutional
2015. 1 networks. SPL, 2016. 2, 5, 6
[46] S. Sengupta, J.-C. Chen, C. Castillo, V. M. Patel, R. Chel- [67] S. Zhang, R. Zhu, X. Wang, H. Shi, T. Fu, S. Wang, and
lappa, and D. W. Jacobs. Frontal to profile face verification T. Mei. Improved selective refinement network for face de-
in the wild. In WACV, 2016. 6 tection. arXiv:1901.06651, 2019. 2, 5, 6
[68] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li.
[47] A. Shrivastava, A. Gupta, and R. Girshick. Training region-
S3fd: Single shot scale-invariant face detector. In ICCV,
based object detectors with online hard example mining. In
2017. 1, 2, 5
CVPR, 2016. 2, 5
[69] Y. Zhang, X. X. Xu, and X. Liu. Robust and high perfor-
[48] B. M. Smith, L. Zhang, J. Brandt, Z. Lin, and J. Yang.
mance face detector. arXiv:1901.02350, 2019. 5
Exemplar-based face parsing. In CVPR, 2013. 1
[70] Y. Zhou, J. Deng, I. Kotsia, and S. Zafeiriou. Dense 3d
[49] X. Tang, D. K. Du, Z. He, and J. Liu. Pyramidbox: A
face decoding over 2500fps: Joint texture and shape con-
context-assisted single shot face detector. In ECCV, 2018.
volutional mesh decoders. In arxiv, 2019. 2, 3, 4, 6, 8
1, 2, 4, 5
[71] C. Zhu, R. Tao, K. Luu, and M. Savvides. Seeing small faces
[50] W. Tian, Z. Wang, H. Shen, W. Deng, B. Chen, and
from robust anchor’s perspective. In CVPR, 2018. 5
X. Zhang. Learning better features for face detec-
[72] C. Zhu, Y. Zheng, K. Luu, and M. Savvides. Cms-rcnn: con-
tion with feature fusion and segmentation supervision.
textual multi-scale region-based cnn for unconstrained face
arXiv:1811.08557, 2018. 5
detection. In Deep Learning for Biometrics. 2017. 2, 5
[51] L. Tran and X. Liu. Nonlinear 3d face morphable model. In
[73] S. Zhu, C. Li, C.-C. Loy, and X. Tang. Unconstrained face
CVPR, 2018. 2
alignment via cascaded compositional learning. In CVPR,
[52] L. Tran and X. Liu. On learning 3d face morphable model 2016. 6
from in-the-wild images. arXiv:1808.09560, 2018. 2
[74] X. Zhu, H. Hu, S. Lin, and J. Dai. Deformable convnets v2:
[53] P. Viola and M. J. Jones. Robust real-time face detection. More deformable, better results. arXiv:1811.11168, 2018. 2,
IJCV, 2004. 1, 2 4
[54] H. Wang, Z. Li, X. Ji, and Y. Wang. Face r-cnn. [75] X. Zhu, Z. Lei, X. Liu, H. Shi, and S. Z. Li. Face alignment
arXiv:1706.01061, 2017. 5 across large poses: A 3d solution. In CVPR, 2016. 6
[55] H. Wang, Y. Wang, Z. Zhou, X. Ji, D. Gong, J. Zhou, Z. Li, [76] C. L. Zitnick and P. Dollár. Edge boxes: Locating object
and W. Liu. Cosface: Large margin cosine loss for deep face proposals from edges. In ECCV, 2014. 3
recognition. In CVPR, 2018. 1
2 Related Works
2.1 Cascaded CNN methods
The method of cascade convolutional neural network (CNN) [29, 30, 31] uses cascaded CNN framework to learn
features in order to improve the performance and maintain efficiency. However, there are some problems about
cascaded CNN based detector: 1) The runtime of these detector is negatively correlated with the number of faces on
the input image. The speed will dramatically degrade when the number of faces increases; 2) Because these methods
optimize each module separately, the training process becomes extremely complicated.

2.2 Anchor methods


Inspired by generic object detection methods [16, 17, 25, 26, 35, 36, 37, 38], which embraced all the recent
advancement in deep learning, face detection has recently achieved remarkable progress [7, 8, 9, 10]. Different from
generic object detection, the ratio of face scale is usually from 1:1 to 1:1.5. The latest methods [9, 11] focus on single
stage design, which densely samples face locations and scales on feature pyramids, demonstrating promising
performance and yielding faster speed compared to two-stage methods [18, 19].

2.3 Anchor free methods


In our view, Cascaded CNN methods are also a kind of anchor free methods. However, this method uses sliding
window to detect human faces and relies on image pyramids. It has some shortcomings such as slow speed and
complex training process. LFFD[12] regards the RFs as natural anchors which can cover continuous face scales,
which is just another way to define anchor, but the training time is about 5 days with two NVIDIA GTX1080TI. Our
CenterFace simply represents faces by a single point at their bounding box center, then facial box size and landmark
are regressed directly from image features at the center location. Thus face detection is transformed into a standard
key point estimation problem. And the training time of a NVIDIA GTX2080TI is only one day.

2.4 Multitask Learning


Multitask learning uses multiple supervisory labels to improve the accuracy of each task by utilizing the correlation
between tasks. Joint face detection and alignment [27, 29] is widely used because alignment task,paralleling with
the backbone, provides better features for face classification task with face point information. Similarly, Mask R-
CNN [5] significantly improves the detection performance by adding a branch for predicting an object mask.

3 CenterFace
3.1 Mobile Feature Pyramid Network
We adopted Mobilenetv2 [32] as the backbone and Feature Pyramid Network (FPN) [25] as the neck for the
subsequent detection. In general, FPN uses a top-down architecture with lateral connections to build a feature pyramid
from a single scale input. CenterFace represents the face through the center point of face box, and face size and facial
landmark are then regressed directly from image features of the center location. Therefore only one layer in the
pyramid is used for face detection and alignment. We construct a pyramid with levels {P-L}, L = 3, 4, 5, where L
indicates pyramid level. Pl has 1/2L resolution of the input. All pyramid levels have C = 24 channels.

3.2 Face as Point


Let [x1, y1, x2, y2] be the bounding box of face. Facial center point lies at c = [(x1 + x2)/2, (y1 + y2)/2]. Let I∈ RW×H×3
W/R × H/R
be an input image of width W and height H. Our aim is to produce the heatmap Y∈[0, 1] , where R is the
output stride. We use the default output stride of R = 4 in literature [23]. A prediction Yˆ x,y = 1 corresponds to a face
center, while Yˆ x,y = 0 is background.
The face classification branch is trained following Law and Deng [23]. For each ground truth. We calculate the
equivalent heat map by using Gaussian kernel to represent the ground truth.The training loss is a variant of focal loss
[26]:

- (1 − Yˆxyc )α log(Yˆxyc ) if Yxyc = 1


Lc =  (1)
- (1 − Yxyc ) (Yˆxyc ) log(1 − Yˆxyc ) otherwise
β α

where α and β are hyper-parameters of the focal loss, which are designated as α = 2 and β = 4 in all our experiments.
To gather global information and to reduce memory usage, downsampling is applied to an image convolutionally,
the size of the output is usually smaller than the image. Hence, a location (x, y) in the image is mapped to the location
(x/n, y/n) in the heatmaps, where n is the downsampling factor. When we remap the locations from the heatmaps to
the input image, some pixel may be not alignment, which can greatly affect the accuracy of facial boxes. To address
this issue, we predict position offsets to adjust the center position slightly before remapping the center position to the
input resolution:

xk  xk  yk  yk 
ok = ( − , n −  n ) ( 2)
n  n   
where ok is the offset, xk and yk are the x and y coordinate for face center k. We apply the L1 Loss [5] at ground-truth
center position:
N
1
Loff =
N
∑ SmoothL1 _ Loss(o , oˆ )
k =1
k k (3)

3.3 Box and Landmark Prediction


W/4*H/4
To reduce the computational burden, we use a single size prediction S∈R for facial box and landmarks.
Each ground-truth bounding box is specified as G = (x1, y1, x2, y2). Our goal is to learn a transformation that maps
the networks position outputs (hˆ, wˆ ) to center position (x, y) in the feature maps:
x x
hˆ = log( 2 − 1 )
R R
(4)
y2 y1
wˆ = log( − )
R R
Different from Box regression, the regression of the five facial landmarks adopts the target normalization method
based on the center position:
lmx c
lmxˆ = − x
boxw boxw
(5)
lm y cy
lm yˆ = −
boxh boxh
We also use smooth L1 loss to facial box and landmarks prediction at the center location

L = Lc + λoff Loff + λbox Lbox + λlm Llm ( 6)

Where λoff ,λbox. andλlm is used to scale the loss .We use 1, 0.1, 0.1, respectively in all our experiments.

3.4. Training Details


Dataset. The proposed method is trained on the training set of WIDER FACE benchmark, including 12,880 images
with more than 150,000 valid faces in scale, pose, expression, occlusion and illumination. RetinaFace [11] introduces
five levels of face image quality and annotates five landmarks on faces.
Data augmentation. Data augmentation is important to improve the generalization. We use random flip, random
scaling [33], color jittering and randomly crop square patches from the original images and resize these patches into
800*800 to generate larger training faces. Faces that are less than 8 pixels are discarded directly.
Training parameters. We train the CenterFace using Adam optimiser with a batch-size 8 and learning rate 5e-4
for 140 epochs, with learning rate dropped 10× at 90 and 120 epochs, respectively. The down-sampling layers of
MobilenetV2 are initialized with ImageNet pretrain and the up-sampling layers are randomly initialized. The training
time is about one day with one NVIDIA GTX2080TI.

4 Experiments
In this section, we firstly introduce the runtime efficiency of CenterFace, then evaluate it on the common face
detection benchmarks.

4.1 Running Efficiency


The existing CNN face detectors can be accelerated by GPUs, but they are not fast enough in most practical
applications, especially CPU based applications. As described below, our CenterFace is efficient enough to meet
practical requirements and its model size is only 7.2MB. In Table 1, comparing with other detectors, our method can
exceed the real-time running speed (> 100 FPS) at different resolutions by using a single NVIDIA GTX2080TI.
Owing to the DSFD, PyramidBox, S3FD and SSH are too slow when running on CPU platforms, we only evaluate
the proposed CenterFace, FaceBoxes, MTCNN and CasCNN at VGA-resolution images on CPU and the mAP means
the true positive rate at 1000 false positives on FDDB. As listed in Table 2, our CenterFace can run at 30 FPS on the
CPU with state-of-the-art accuracy.
Table 1. Running efficiency on GTX2080TI
Approach 640*480 1280*720 1920*1080
DSFD 78.08ms 187.78ms 393.82ms
PyramidBox 50.51ms 142.34ms 331.93ms
S3FD 21.75ms 55.73ms 119.53ms
LFFD 7.60ms 16.37ms 31.41ms
CenterFace 5.51ms 6.47ms 8.79 ms

Table 2. Running efficiency on CPU


Approach CPU-model mAP(%) FPS
CasCNN E5-2620@2.00 85.7 14
MTCNN N/A@2.60 94.4 16
Faceboxes3.2 E5-2660v3@2.60 96.0 20
CenterFace I7-6700@2.6 98.0 30

4.2 Evaluation on Benchmarks


FDDB dataset. FDDB contains 2845 images with 5171 unconstrained faces collected from the Yahoo news
website. We evaluate our face detector on FDDB against the other state-of-art methods and the results are shown in
Table 3 and Fig. 2, respectively. We also add DFSD, PyramidBox and S3FD detectors. Whereas, these detectors are
much slower due to the larger backbone and denser anchors. Our CenterFace can also achieve good performance on
both discontinuous and continuous ROC curves, i.e. 98.0% and 72.9% when the number of false positives equals to
1,000 and it outperforms LFFD, FaceBoxes and MTCNN evidently.
Table 3. Evaluation results on FDDB
Method Disc ROC curves score Cont ROC curves score
DFSD 0.984 0.754
PyramidBox 0.982 0.757
S3FD 0.981 0.754
MTCNN 0.944 0.708
Faceboxes 0.960 0.729
LFFD 0.973 0.724
CenterFace 0.980 0.732

(a) Discontinuous ROC curves (b) Continuous ROC curves


Figure 2. Evaluation on the FDDB dataset.
WIDER FACE dataset. Until now, WIDER FACE is the most widely used benchmark for face detection. The
WIDER FACE dataset is split into training (40%), validation (10%) and testing (50%) subsets by randomly sampling
from 61 scene categories. All the compared methods are trained on training set. For testing on WIDER FACE, we
follow the standard practices of [11] and employ flip as well as multi-scale strategies. Box voting [13] is applied on
the union set of predicted face boxes using an IoU threshold at 0.4. We report the results on the validation and testing
sets in Tables 4 and 5, respectively. The proposed method CenterFace achieves 0.935 (Easy), 0.924 (Medium) and
0.875 (Hard) for validation set, and 0.932 (Easy), 0.921 (Medium) and 0.873 (Hard) for testing set. Although it has
gaps with state of the art methods, but consistently outperforms SSH (using VGG16 as the backbone), LFFD,
FaceBoxes and MTCNN. Additionally, CenterFace is better than S3FD that uses VGG16 as the backbone and dense
anchors on Hard parts.
Furthermore, we also test on WIDER FACE not only with the original image but also with a single inference, our
CenterFace also produces the good average precision (AP) in all the subsets of both validation sets, i.e., 92.2% (Easy),
91.1% (Medium) and 78.2% (Hard) for validation set.
Table 4. Performance results on the validation set of WIDER FACE.
Method Easy Medium Hard
RetinaFace 0.969 0.961 0.918
DSFD 0.966 0.957 0.904
PramidBox 0.961 0.950 0.889
S3FD 0.937 0.924 0.852
SSH 0.931 0.921 0.845
MTCNN 0.848 0.825 0.598
Faceboxes 0.840 0.766 0.395
LFFD 0.910 0.881 0.780
CenterFace 0.935 0.924 0.875

Table 5. Performance results on the testing set of WIDER FACE.


Method Easy Medium Hard
RetinaFace 0.963 0.956 0.914
DSFD 0.960 0.953 0.900
PramidBox 0.956 0.946 0.887
S3FD 0.928 0.913 0.840
SSH 0.927 0.915 0.844
MTCNN 0.851 0.820 0.607
Faceboxes 0.839 0.763 0.396
LFFD 0.896 0.865 0.770
CenterFace 0.932 0.921 0.873

5 Conclusion
This paper introduces the CenterFace that has the superiority of the proposed method perform well on both speed
and accuracy and simultaneously predicts facial box and landmark location. Our proposed method overcomes the
drawbacks of the previous anchor based method by translating face detection and alignment into a standard key point
estimation problem. CenterFace represents the face through the center point of face box, and face size and facial
landmark are then regressed directly from image features of the center location. Comprehensive and extensive
experiments are made to fully analyze the proposed method. The final results demonstrate that our method can
achieve real-time speed and high accuracy with a smaller model size, making it an ideal alternative for most face
detection and alignment applications.
Acknowledgments This work was supported in part by the National Key R&D Program of China (2018YFC0809200)
and the Natural Science Foundation of Shanghai (16ZR1416500).

Figure 3. Face Detection Results on WIDER FACE.


References
[1] H. Law and J. Deng. Cornernet: Detecting objects as paired keypoints. In ECCV, 2018.
[2] X. Zhou, J. Zhuo, and P. Krahenbuhl. Bottom-up object detection by grouping extreme and center points. In
CVPR, 2019.
[3] C. Zhu, Y. He, and M. Savvides. Feature selective anchor-free module for single-shot object detection. arXiv
preprint arXiv:1903.00621, 2019.
[4] Z. Cao, G. Hidalgo, T. Simon, S.-E. Wei, and Y. Sheikh. OpenPose: realtime multi-person 2D pose estimation
using Part Affifinity Fields. In arXiv preprint arXiv:1812.08008, 2018.
[5] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask r-cnn. In ICCV, 2017. 2, 3
[6] J. Redmon and A. Farhadi. Yolov3: An incremental improvement. arXiv preprint, 2018.
[7] P. Hu and D. Ramanan. Finding tiny faces. Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, 2017.
[8] M. Najibi, P. Samangouei, R. Chellappa, and L. S. Davis. Ssh: Single stage headless face detector. In ICCV,
2017.
[9] X. Tang, D. K. Du, Z. He, and J. Liu. Pyramidbox: A context-assisted single shot face detector. In ECCV, 2018.
[10] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. S3fd: Single shot scale-invariant face detector. In ICCV,
2017.
[11] J. Deng, J. Guo, Y. Zhou, J. Yu, I. Kotsia and Z. Stefanos. RetinaFace: Single-stage Dense Face Localisation in
the Wild. Arxiv 2019.
[12] Y. He, D. Xu, L. Wu, M. Jian, S. Xiang and C. Pan. LFFD: A Light and Fast Face Detector for Edge Devices,
arXiv:1904.10633, 2019
[13] S. Zhang, X. Zhu, Z. Lei, H. Shi, X. Wang, and S. Z. Li. Faceboxes: A cpu real-time face detector with high
accuracy. In Proceedings of IEEE International Joint Conference on Biometrics, pages 1–9, 2017.
[14] R. Girshick. Fast r-cnn. In ICCV, 2015. 1, 3
[15] S. Ren, K. He, R. Girshick, and J. Sun. Faster r-cnn: Towards real-time object detection with region proposal
networks. In NeurIPS, 2015. 1, 2
[16] J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time object detection. In
CVPR, 2016.
[17] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A. C. Berg. SSD: Single shot multibox
detector. In ECCV, 2016.
[18] C. Chi, S. Zhang, J. Xing, Z. Lei, S. Z. Li, and X. Zou. Selective refinement network for high performance face
detection.AAAI, 2019.
[19] C. Zhang, X. Xu, and D. Tu. Face detection using improved faster rcnn. arXiv:1802.02142, 2018.
[20] S. Yang, P. Luo, C.-C. Loy, and X. Tang. Wider face: A face detection benchmark. In CVPR, 2016.
[21] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition.
arXiv:1409.1556, 2014.
[22] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proceedings of IEEE
Conference on Computer Vision and Pattern Recognition, pages 770– 778, 2016.
[23] X.Zhou, D. Wang and P. Krahenbuhl. Objects as Points. arXiv:1904.07850
[24] V. Jain and E. G. Learned-Miller. Fddb: A benchmark for face detection in unconstrained settings. UMass
Amherst Technical Report, 2010.
[25] T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid networks for object
detection. In CVPR, 2017.
[26] T.-Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar. Focal loss for dense object detection. In ICCV, 2017. 1, 2,
4
[27] A. Newell, Z. Huang, and J. Deng. Associative embedding: End-to-end learning for joint detection and grouping.
In NIPS, 2017.
[28] X. Zhou, A. Karpur, L. Luo, and Q. Huang. Starmap for category-agnostic keypoint and viewpoint estimation.
In ECCV, 2018.
[29] K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded
convolutional networks. SPL, 2016
[30] D. Chen, S. Ren, Y. Wei, X. Cao, and J. Sun. Joint cascade face detection and alignment. In ECCV, 2014.
[31] H. Li, Z. Lin, X. Shen, J. Brandt, and G. Hua. A convolutional neural network cascade for face detection. In
CVPR, 2015.
[32] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov and L. Chen. MobileNetV2: Inverted Residuals and Linear
Bottlenecks. In CVPR, 2018.
[33] B. Singh, M. Najibi, and L. S. Davis. SNIPER: Efficient multi-scale training. NIPS, 2018.
[34] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. Dssd: Deconvolutional single shot detector. arXiv preprint
arXiv:1701.06659, 2017.
[35] J. Huang, V. Rathod, C. Sun, M. Zhu, A. Korattikara, A. Fathi, I. Fischer, Z. Wojna, Y. Song, S. Guadarrama, et
al. Speed/accuracy trade-offs for modern convolutional object detectors. In CVPR, 2017.
[36] L. Liu, W. Ouyang, X. Wang, P. Fieguth, J. Chen, X. Liu, and M. Pietikainen. Deep learning for generic object
detection: ¨ A survey. arXiv preprint arXiv:1809.02165, 2018.
[37] B. Jiang, R. Luo, J. Mao, T. Xiao, and Y. Jiang. Acquisition of localization confidence for accurate object
detection. In ECCV, 2018.
[38] T. Kong, F. Sun, H. Liu, Y. Jiang, and J. Shi. Consistent optimization for single-shot object detection. arXiv
preprint arXiv:1901.06563, 2019.

You might also like