Retinaface: Single-Stage Dense Face Localisation in The Wild
Retinaface: Single-Stage Dense Face Localisation in The Wild
Retinaface: Single-Stage Dense Face Localisation in The Wild
Abstract
1
face shapes provide better features for face classification. • By employing light-weight backbone networks, Reti-
Inspired by [6], MTCNN [66] and STN [5] simultaneously naFace can run real-time on a single CPU core for a
detected faces and five facial landmarks. Due to training VGA-resolution image.
data limitation, JDA [6], MTCNN [66] and STN [5] have • Extra annotations and code have been released to fa-
not verified whether tiny face detection can benefit from the cilitate future research.
extra supervision of five facial landmarks. One of the ques-
tions we aim at answering in this paper is whether we can 2. Related Work
push forward the current best performance (90.3% [67]) on
the WIDER FACE hard test set [60] by using extra supervi- Image pyramid v.s. feature pyramid: The sliding-
sion signal built of five facial landmarks. window paradigm, in which a classifier is applied on a
dense image grid, can be traced back to past decades. The
In Mask R-CNN [20], the detection performance is sig-
milestone work of Viola-Jones [53] explored cascade chain
nificantly improved by adding a branch for predicting an ob-
to reject false face regions from an image pyramid with
ject mask in parallel with the existing branch for bounding
real-time efficiency, leading to the widespread adoption of
box recognition and regression. That confirms that dense
such scale-invariant face detection framework [66, 5]. Even
pixel-wise annotations are also beneficial to improve detec-
though the sliding-window on image pyramid was the lead-
tion. Unfortunately, for the challenging faces of WIDER
ing detection paradigm [19, 32], with the emergence of fea-
FACE it is not possible to conduct dense face annotation
ture pyramid [28], sliding-anchor [43] on multi-scale fea-
(either in the form of more landmarks or semantic seg-
ture maps [68, 49], quickly dominated face detection.
ments). Since supervised signals cannot be easily obtained,
Two-stage v.s. single-stage: Current face detection meth-
the question is whether we can apply unsupervised methods
ods have inherited some achievements from generic ob-
to further improve face detection.
ject detection approaches and can be divided into two cate-
In FAN [56], an anchor-level attention map is proposed
gories: two-stage methods (e.g. Faster R-CNN [43, 63, 72])
to improve the occluded face detection. Nevertheless, the
and single-stage methods (e.g. SSD [30, 68] and Reti-
proposed attention map is quite coarse and does not contain
naNet [29, 49]). Two-stage methods employed a “proposal
semantic information. Recently, self-supervised 3D mor-
and refinement” mechanism featuring high localisation ac-
phable models [14, 51, 52, 70] have achieved promising 3D
curacy. By contrast, single-stage methods densely sampled
face modelling in-the-wild. Especially, Mesh Decoder [70]
face locations and scales, which resulted in extremely un-
achieves over real-time speed by exploiting graph convo-
balanced positive and negative samples during training. To
lutions [10, 40] on joint shape and texture. However, the
handle this imbalance, sampling [47] and re-weighting [29]
main challenges of applying mesh decoder [70] into the
methods were widely adopted. Compared to two-stage
single-stage detector are: (1) camera parameters are hard to
methods, single-stage methods are more efficient and have
estimate accurately, and (2) the joint latent shape and tex-
higher recall rate but at the risk of achieving a higher false
ture representation is predicted from a single feature vec-
positive rate and compromising the localisation accuracy.
tor (1 ⇥ 1 Conv on feature pyramid) instead of the RoI
Context Modelling: To enhance the model’s contextual
pooled feature, which indicates the risk of feature shift. In
reasoning power for capturing tiny faces [23], SSH [36] and
this paper, we employ a mesh decoder [70] branch through
PyramidBox [49] applied context modules on feature pyra-
self-supervision learning for predicting a pixel-wise 3D face
mids to enlarge the receptive field from Euclidean grids. To
shape in parallel with the existing supervised branches.
enhance the non-rigid transformation modelling capacity of
To summarise, our key contributions are: CNNs, deformable convolution network (DCN) [9, 74] em-
• Based on a single-stage design, we propose a novel ployed a novel deformable layer to model geometric trans-
pixel-wise face localisation method named Reti- formations. The champion solution of the WIDER Face
naFace, which employs a multi-task learning strategy Challenge 2018 [33] indicates that rigid (expansion) and
to simultaneously predict face score, face box, five fa- non-rigid (deformation) context modelling are complemen-
cial landmarks, and 3D position and correspondence tary and orthogonal to improve the performance of face de-
of each facial pixel. tection.
• On the WIDER FACE hard subset, RetinaFace outper- Multi-task Learning: Joint face detection and alignment
forms the AP of the state of the art two-stage method is widely used [6, 66, 5] as aligned face shapes provide bet-
(ISRN [67]) by 1.1% (AP equal to 91.4%). ter features for face classification. In Mask R-CNN [20],
• On the IJB-C dataset, RetinaFace helps to improve Ar- the detection performance was significantly improved by
cFace’s [11] verification accuracy (with TAR equal to adding a branch for predicting an object mask in parallel
89.59% when FAR=1e-6). This indicates that better with the existing branches. Densepose [1] adopted the ar-
face localisation can significantly improve face recog- chitecture of Mask-RCNN to obtain dense part labels and
nition. coordinates within each of the selected regions. Neverthe-
less, the dense regression branch in [20, 1] was trained by joint shape and texture information, and E 2 {0, 1}n⇥n
supervised learning. In addition, the dense branch was a is a sparse adjacency matrix encoding the connection sta-
small FCN applied to each RoI to predict a pixel-to-pixel tus between vertices. The graph Laplacian is defined as
dense mapping. L=D EP 2 Rn⇥n where D 2 Rn⇥n is a diagonal matrix
with Dii = j Eij .
3. RetinaFace Following [10, 40, 70], the graph convolution with kernel
g✓ can be formulated as a recursive Chebyshev polynomial
3.1. Multi-task Loss
truncated at order K,
For any training anchor i, we minimise the following
K
X1
multi-task loss:
y = g✓ (L)x = ✓k Tk (L̃)x, (2)
L = Lcls (pi , p⇤i ) + ⇤ ⇤
1 pi Lbox (ti , ti ) k=0
⇤ ⇤ ⇤ (1)
+ 2 pi Lpts (li , li ) + 3 pi Lpixel . where ✓ 2 RK is a vector of Chebyshev coefficients and
(1) Face classification loss Lcls (pi , p⇤i ), where pi is the pre- Tk (L̃) 2 Rn⇥n is the Chebyshev polynomial of order
dicted probability of anchor i being a face and p⇤i is 1 for k evaluated at the scaled Laplacian L̃. Denoting x̄k =
the positive anchor and 0 for the negative anchor. The clas- Tk (L̃)x 2 Rn , we can recurrently compute x̄k = 2L̃x̄k 1
sification loss Lcls is the softmax loss for binary classes x̄k 2 with x̄0 = x and x̄1 = L̃x. The whole filtering opera-
(face/not face). (2) Face box regression loss Lbox (ti , t⇤i ), tion is extremely efficient including K sparse matrix-vector
where ti = {tx , ty , tw , th }i and t⇤i = {t⇤x , t⇤y , t⇤w , t⇤h }i rep- multiplications and one dense matrix-vector multiplication
resent the coordinates of the predicted box and ground-truth y = g✓ (L)x = [x̄0 , . . . , x̄K 1 ]✓.
box associated with the positive anchor. We follow [16] Differentiable Renderer. After we predict the shape and
to normalise the box regression targets (i.e. centre location, texture parameters PST 2 R128 , we employ an efficient
width and height) and use Lbox (ti , t⇤i ) = R(ti t⇤i ), where differentiable 3D mesh renderer [14] to project a coloured-
R is the robust loss function (smooth-L1 ) defined in [16]. mesh DPST onto a 2D image plane with camera parame-
(3) Facial landmark regression loss Lpts (li , li⇤ ), where li = ters Pcam = [xc , yc , zc , x0c , yc0 , zc0 , fc ] (i.e. camera location,
{lx1 , ly1 , . . . , lx5 , ly5 }i and li⇤ = {lx⇤1 , ly⇤1 , . . . , lx⇤5 , ly⇤5 }i camera pose and focal length) and illumination parameters
represent the predicted five facial landmarks and ground- Pill = [xl , yl , zl , rl , gl , bl , ra , ga , ba ] (i.e. location of point
truth associated with the positive anchor. Similar to the box light source, colour values and colour of ambient lighting).
centre regression, the five facial landmark regression also Dense Regression Loss. Once we get the rendered 2D
employs the target normalisation based on the anchor cen- face R(DPST , Pcam , Pill ), we compare the pixel-wise dif-
tre. (4) Dense regression loss Lpixel (refer to Eq. 3). The ference of the rendered and the original 2D face using the
loss-balancing parameters 1 - 3 are set to 0.25, 0.1 and following function:
0.01, which means that we increase the significance of bet- W X
H
1 X
ter box and landmark locations from supervision signals. Lpixel = ⇤
kR(DPST , Pcam , Pill )i,j Ii,j k1 ,
W ⇤H i j
3.2. Dense Regression Branch (3)
Mesh Decoder. We directly employ the mesh decoder where W and H are the width and height of the anchor crop
(mesh convolution and mesh up-sampling) from [70, 40], ⇤
Ii,j , respectively.
which is a graph convolution method based on fast localised
spectral filtering [10]. In order to achieve further accelera- 4. Experiments
tion, we also use a joint shape and texture decoder similarly
4.1. Dataset
to the method in [70], contrary to [40] which only decoded
shape. The WIDER FACE dataset [60] consists of 32, 203 im-
Below we will briefly explain the concept of graph con- ages and 393, 703 face bounding boxes with a high degree
volutions and outline why they can be used for fast decod- of variability in scale, pose, expression, occlusion and illu-
ing. As illustrated in Fig. 3(a), a 2D convolutional operation mination. The WIDER FACE dataset is split into training
is a “kernel-weighted neighbour sum” within the Euclidean (40%), validation (10%) and testing (50%) subsets by ran-
grid receptive field. Similarly, graph convolution also em- domly sampling from 61 scene categories. Based on the de-
ploys the same concept as shown in Fig. 3(b). However, tection rate of EdgeBox [76], three levels of difficulty (i.e.
the neighbour distance is calculated on the graph by count- Easy, Medium and Hard) are defined by incrementally in-
ing the minimum number of edges connecting two vertices. corporating hard samples.
We follow [70] to define a coloured face mesh G = (V, E), Extra Annotations. As illustrated in Fig. 4 and Tab. 1, we
where V 2 Rn⇥6 is a set of face vertices containing the define five levels of face image quality (according to how
Figure 2. An overview of the proposed single-stage dense face localisation approach. RetinaFace is designed based on the feature pyramids
with independent context modules. Following the context modules, we calculate a multi-task loss for each anchor.
Precision
Precision
Zhu et al.-0.949 Face R-FCN-0.935 Face R-FCN-0.874
0.5 Face R-FCN-0.947
0.5 Zhu et al.-0.933
0.5 Zhu et al.-0.861
SFD-0.937 SFD-0.925 SFD-0.859
Face R-CNN-0.937 Face R-CNN-0.921 SSH-0.845
0.4 SSH-0.931
0.4 SSH-0.921
0.4 Face R-CNN-0.831
HR-0.925 HR-0.910 HR-0.806
MSCNN-0.916 MSCNN-0.903 MSCNN-0.802
0.3 CMS-RCNN-0.899
0.3 CMS-RCNN-0.874
0.3 ScaleFace-0.772
ScaleFace-0.868 ScaleFace-0.867 CMS-RCNN-0.624
Multitask Cascade CNN-0.848 Multitask Cascade CNN-0.825 Multitask Cascade CNN-0.598
0.2 LDCF+-0.790
0.2 LDCF+-0.769
0.2 LDCF+-0.522
Faceness-WIDER-0.713 Multiscale Cascade CNN-0.664 Multiscale Cascade CNN-0.424
Multiscale Cascade CNN-0.691 Faceness-WIDER-0.634 Faceness-WIDER-0.345
0.1 Two-stage CNN-0.681
0.1 Two-stage CNN-0.618
0.1 Two-stage CNN-0.323
ACF-WIDER-0.659 ACF-WIDER-0.541 ACF-WIDER-0.273
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall
1 1 1
Precision
Precision
FAN-0.946 Zhu et al.-0.935 Face R-FCN-0.876
0.5 Face R-FCN-0.943
0.5 Face R-FCN-0.931
0.5 Zhu et al.-0.865
SFD-0.935 SFD-0.921 SFD-0.858
Face R-CNN-0.932 Face R-CNN-0.916 SSH-0.844
0.4 SSH-0.927
0.4 SSH-0.915
0.4 Face R-CNN-0.827
HR-0.923 HR-0.910 HR-0.819
MSCNN-0.917 MSCNN-0.903 MSCNN-0.809
0.3 CMS-RCNN-0.902
0.3 CMS-RCNN-0.874
0.3 ScaleFace-0.764
ScaleFace-0.867 ScaleFace-0.866 CMS-RCNN-0.643
Multitask Cascade CNN-0.851 Multitask Cascade CNN-0.820 Multitask Cascade CNN-0.607
0.2 LDCF+-0.797
0.2 LDCF+-0.772
0.2 LDCF+-0.564
Faceness-WIDER-0.716 Multiscale Cascade CNN-0.636 Multiscale Cascade CNN-0.400
Multiscale Cascade CNN-0.711 Faceness-WIDER-0.604 Faceness-WIDER-0.315
0.1 ACF-WIDER-0.695
0.1 Two-stage CNN-0.589
0.1 Two-stage CNN-0.304
Two-stage CNN-0.657 ACF-WIDER-0.588 ACF-WIDER-0.290
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Recall Recall Recall
Figure 6. RetinaFace can find around 900 faces (threshold at 0.5) out of the reported 1151 people, by taking advantages of the proposed joint
extra-supervised and self-supervised multi-task learning. Detector confidence is given by the colour bar on the right. Dense localisation
masks are drawn in blue. Please zoom in to check the detailed detection, alignment and dense regression results on tiny faces.
0.03 100
90
0.025
80
60
0.015
50
0.01
40
0.005 MTCNN 30
RetinaFace
20
0 MTCNN
ter rn er tip ter rn er 10
RetinaFace
en co se ce
n
co
y ec th No ye th
ft e ou ht
e
mo
u 0
Le ft m Rig ht
0 1 2 3 4 5 6 7 8 9 10
Le Rig NME normalized by bounding box size (%)
80
90
80
dataset. “+F” refers to flip test during feature embedding and “+S”
denotes face detection score used to weigh samples within tem-
Number of images (%)
70 70
60
50
60
50
plates. We also give TAR for FAR= 1e 6 at the end of the each
40
PRNet: 3.2699
3D-FAN: 3.479
40
PRNet: 4.4079 legend.
30 Mesh Decoder: 3.986 30 Mesh Decoder: 5.4144
DeFA: 4.3651 DeFA: 6.0409
3 CenterFace
3.1 Mobile Feature Pyramid Network
We adopted Mobilenetv2 [32] as the backbone and Feature Pyramid Network (FPN) [25] as the neck for the
subsequent detection. In general, FPN uses a top-down architecture with lateral connections to build a feature pyramid
from a single scale input. CenterFace represents the face through the center point of face box, and face size and facial
landmark are then regressed directly from image features of the center location. Therefore only one layer in the
pyramid is used for face detection and alignment. We construct a pyramid with levels {P-L}, L = 3, 4, 5, where L
indicates pyramid level. Pl has 1/2L resolution of the input. All pyramid levels have C = 24 channels.
where α and β are hyper-parameters of the focal loss, which are designated as α = 2 and β = 4 in all our experiments.
To gather global information and to reduce memory usage, downsampling is applied to an image convolutionally,
the size of the output is usually smaller than the image. Hence, a location (x, y) in the image is mapped to the location
(x/n, y/n) in the heatmaps, where n is the downsampling factor. When we remap the locations from the heatmaps to
the input image, some pixel may be not alignment, which can greatly affect the accuracy of facial boxes. To address
this issue, we predict position offsets to adjust the center position slightly before remapping the center position to the
input resolution:
xk xk yk yk
ok = ( − , n − n ) ( 2)
n n
where ok is the offset, xk and yk are the x and y coordinate for face center k. We apply the L1 Loss [5] at ground-truth
center position:
N
1
Loff =
N
∑ SmoothL1 _ Loss(o , oˆ )
k =1
k k (3)
Where λoff ,λbox. andλlm is used to scale the loss .We use 1, 0.1, 0.1, respectively in all our experiments.
4 Experiments
In this section, we firstly introduce the runtime efficiency of CenterFace, then evaluate it on the common face
detection benchmarks.
5 Conclusion
This paper introduces the CenterFace that has the superiority of the proposed method perform well on both speed
and accuracy and simultaneously predicts facial box and landmark location. Our proposed method overcomes the
drawbacks of the previous anchor based method by translating face detection and alignment into a standard key point
estimation problem. CenterFace represents the face through the center point of face box, and face size and facial
landmark are then regressed directly from image features of the center location. Comprehensive and extensive
experiments are made to fully analyze the proposed method. The final results demonstrate that our method can
achieve real-time speed and high accuracy with a smaller model size, making it an ideal alternative for most face
detection and alignment applications.
Acknowledgments This work was supported in part by the National Key R&D Program of China (2018YFC0809200)
and the Natural Science Foundation of Shanghai (16ZR1416500).