[go: up one dir, main page]

0% found this document useful (0 votes)
29 views5 pages

Facial Mask Detection Using Semantic Segmentation

The document discusses a method for facial mask detection using semantic segmentation. It proposes using a fully convolutional network to semantically segment faces from images by classifying each pixel as face or non-face. Experimental results on the Multi Parsing Human Dataset show the method can accurately detect faces in images, including non-frontal faces, with a mean pixel accuracy of 93.884%.

Uploaded by

pakalanaveen7975
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Facial Mask Detection Using Semantic Segmentation

The document discusses a method for facial mask detection using semantic segmentation. It proposes using a fully convolutional network to semantically segment faces from images by classifying each pixel as face or non-face. Experimental results on the Multi Parsing Human Dataset show the method can accurately detect faces in images, including non-frontal faces, with a mean pixel accuracy of 93.884%.

Uploaded by

pakalanaveen7975
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

2019 4th International Conference on Computing, Communications and Security (ICCCS)

Facial Mask Detection using Semantic Segmentation


Toshanlal Meenpal Ashutosh Balakrishnan Amit Verma
Dept. of Electronics and Telecomm. Dept. of Electronics and Telecomm. Dept. of Electronics and Telecomm.
National Institute of Technology National Institute of Technology National Institute of Technology
Raipur, India Raipur, India Raipur, India
tmeenpal.etc@nitrr.ac.in abalakrishnan1909@gmail.com averma.phd2016.etc@nitrr.ac.in

Abstract—Face Detection has evolved as a very popular and then detecting that segmented area. The model works
problem in Image processing and Computer Vision. Many new very well not only for images having frontal faces but also
algorithms are being devised using convolutional architectures to for non-frontal faces. The paper also focuses on removing
make the algorithm as accurate as possible. These convolutional
architectures have made it possible to extract even the pixel the erroneous predictions which are bound to occur. Semantic
details. We aim to design a binary face classifier which can detect segmentation of human face is performed with the help of a
any face present in the frame irrespective of its alignment. We fully convolutional network.
present a method to generate accurate face segmentation masks The next section discusses the related work done in the
from any arbitrary size input image. Beginning from the RGB domain of face detection. In section III we describe the
image of any size, the method uses Predefined Training Weights
of VGG – 16 Architecture for feature extraction. Training is method followed for face segmentation and detection using
performed through Fully Convolutional Networks to semantically semantic segmentation on any arbitrary RGB image. Finally,
segment out the faces present in that image. Gradient Descent is the generated facial masks are demonstrated in experimental
used for training while Binomial Cross Entropy is used as a loss results in section IV. Post processing on the predicted images
function. Further the output image from the FCN is processed to has also been discussed at length which also entails the
remove the unwanted noise and avoid the false predictions if any
and make bounding box around the faces. Furthermore, proposed removal of erroneous predictions.
model has also shown great results in recognizing non-frontal
faces. Along with this it is also able to detect multiple facial II. R ELATED W ORKS
masks in a single frame. Experiments were performed on Multi Initially researchers focused on edge and gray value of face
Parsing Human Dataset obtaining mean pixel level accuracy of image. [1] was based on pattern recognition model, having
93.884 % for the segmented face masks.
Index Terms—Fully Convolutional Network, Semantic Segmen-
a prior information of the face model. Adaboost [2] was a
tation, Face Segmentation and Detection good training classifier. The face detection technology got a
breakthrough with the famous Viola Jones Detector [3], which
I. I NTRODUCTION greatly improved real time face detection. Viola Jones detector
optimized the features of Haar [4], but failed to tackle the real
Face detection has emerged as a very interesting problem world problems and was influenced by various factors like face
in image processing and computer vision. It has a range of brightness and face orientation. Viola Jones could only detect
applications from facial motion capture to face recognition frontal well lit faces. It failed to work well in dark condi-
which at the start needs the face to be detected with a very tions and with non-frontal images. These issues have made
good accuracy. Face detection is more relevant today because the independent researchers work on developing new face
it not only used on images but also in video applications detection models based on deep learning, to have better results
like real time surveillance and face detection in videos. High for the different facial conditions. We have developed our face
accuracy image classification is possible now with the ad- detection model using Multi Human Parsing Dataset [5], based
vancements of Convolutional networks. Pixel level information on fully convolutional networks, such that it can detect the
is often required after face detection which most face detection face in any geometric condition frontal or non-frontal for that
methods fail to provide. Obtaining pixel level details has matter. Convolutional Networks have always been used for
been a challenging part in semantic segmentation. Semantic image classification tasks. Typical architectures like AlexNet
segmentation is the process of assigning a label to each pixel [6] and VGGNet [7] comprise of stacked convolutional layers.
of the image. In our case the labels are either face or non-face. AlexNet with 5 convolutional layers and 3 fully connected
Semantic segmentation is thus used to separate out the face layers has been the winner of ImageNet LSVRC-2012 com-
by classifying each pixel of the image as face or background. petition while VGGNet is an improvement over AlexNet as it
Also most of the widely used face detection algorithms tend replaces large kernels with 3x3 multiple kernels consecutively.
to focus on the detection of frontal faces. The ILSVRC-2014 winning architecture GoogleNet [8] uses
This paper proposes a model for face detection using parallel convolution kernels and concatenating the feature
semantic segmentation in an image by classifying each pixel maps together. In it 1×1, 3×3 and 5×5 convolutions and 3×3
as face and non-face i.e. effectively creating a binary classifier max-pooling have been used. Smaller convolutions extract the

978-1-7281-0875-9/19/$31.00 2019
c IEEE 1

Authorized licensed use limited to: National Institute of Technology. Downloaded on October 15,2023 at 06:05:19 UTC from IEEE Xplore. Restrictions apply.
2019 4th International Conference on Computing, Communications and Security (ICCCS)

irrespective of alignment and train it in an appropriate neural


network to get accurate results. The model requires inputting
an RGB image of any arbitary size to the model. The model’s
basic fucntion is feature extraction and class prediction. The
output of the model is a feature vector which is optimized
using Gradient descent and the loss function used is Binomial
Cross Entropy. Figure 1 represents the end to end pipeline
of ourmethod along with sample demonstration of obtained
output at each step.

A. Proposed Work Flow


We propose a method of obtaining segmentation masks
directly from the images containing one or more faces in
different orientation. The input image of any arbitrary size
is resized to 224 × 224 × 3 and fed to the FCN network for
feature extraction and prediction. The output of the network is
then subjected to post processing. Initially the pixel values of
the face and background are subjected to global thresholding.
After that its passed through median filter to remove the high
frequency noise and then subjected to Closing operation to
fill the gaps in the segmented area. After this bounding box
is drawn around the segmented area.

B. Architecture
The feature extraction and prediction is performed using
pre-defined training weights of VGG 16 architecture. The basic
VGG-16 architecture is depicted in Figure 2. Our proposed
model consists of a total of 17 convolutional layers and 5
Max pooling layers. The initial image size which is fed to the
model is 224 × 224 × 3. As the image is processed through the
layers for feature extraction its passed through convolutional
layers and max pooling layers.
Convolutional layer convolutes the input image with another
window while the max pooling operation ensures that the
size of the feature vector being produced in every layer is
halved so as to reduce the number of parameters. This is
Fig. 1. Flowchart of the proposed method. a very crucial step in feature extraction, if the number of
parameters are not reduced then it will become very difficult
to predict the classes of each pixel in a fully convolutional
local features whereas larger convolutions extract high level network. The initial layers extract the lower level features
features. More recent architectures such as ResNet [9] have while as the subsequent layers extract the mid-level and higher
introduced skip connections which allows deeper networks level features. The segmentation task requires that the spatial
to avoid saturation in training accuracy. These architectures information be stored in a pixel wise classification, this we
are often used for initial feature extraction in face detection have achieved by converting the VGG layers to convolutional
networks. In our method, we are using VGG 16 architecture as layers. After the final max pooling layer – the image size is
the base network for face detection and Fully Convolutional reduced to 28 × 28 × 2. This is further upsampled to bring the
Network for segmentation. VGG 16 network is sufficiently image to standard size i.e. 224 × 224 × 2 since it’s a binary
deep to extract features and computationally less expensive for classifier – hence creates two channels for both the classes,
our case. Though majority of segmentation architectures rely face and background.
on downsampling and consecutive upsampling of input image,
Fully Convolutional Networks [10], [11], [12] still are modest C. Face Detection and Avoiding Erroneous Prediction
and have significantly accurate approach for segmentation. Post processing on the predicted mask obtained is performed
so that the irregularities in the region can be filled and to
III. M ETHODOLOGY remove the unwanted errors (which may have crept during the
We propose this paper with twin objective of creating a processing). This we perform by first passing the mask through
Binary face classifier which can detect faces in any orientaiton Median filter and then performing the Closing Operation.

Authorized licensed use limited to: National Institute of Technology. Downloaded on October 15,2023 at 06:05:19 UTC from IEEE Xplore. Restrictions apply.
2019 4th International Conference on Computing, Communications and Security (ICCCS)

Fig. 2. The complete architecture of Fully Convolutional Network used generating segmentation masks.

Fig. 3. (a) Actual Image (b) Erroneous Prediction (c) False Face Detection (d) Correct Face Detection.

This ensures that the gaps in the segmented region are filled TABLE I
and most of the unwanted false erroneous prediction removed. S EGMENTED R EGION PARAMETER VALUES
In spite of this there is a possibility that some large error may S. No. Centroid Major Axis Length Minor Axis Length
not have been removed. We have designed the model such 1. 9.414 11.62 7.2
that all those erroneous predictions are not considered while 2. 18.00 22.65 13.36
3. 13.18 14.84 11.51
showing the final detected faces. We find out the following 4. 22.81 32.09 13.52
parameters in each region – Centroid, Major Axis Length and 5. 18.07 27.35 8.8
Minor Axis Length. These values for Figure 3 are depicted 6. 20.67 30.55 10.7
in Table 1 for all the facial (segmented) regions detected
(including false predictions).
In the Figure 3(b), even after post processing through Using the centroid cx , Major Axis max and minor axis
median filter and dilation, the unwanted erroneous predictions values mix for each of the segmented region, we calculate the
have not completely gone. This results in false face detection diameter dx of each region. We compute mean D and standard
– Figure 3(c) deviation ηD for diameter vector D. Finally, we keep the most

Authorized licensed use limited to: National Institute of Technology. Downloaded on October 15,2023 at 06:05:19 UTC from IEEE Xplore. Restrictions apply.
2019 4th International Conference on Computing, Communications and Security (ICCCS)

Fig. 4. Pixel level accuracy for predicted facial masks.

Authorized licensed use limited to: National Institute of Technology. Downloaded on October 15,2023 at 06:05:19 UTC from IEEE Xplore. Restrictions apply.
2019 4th International Conference on Computing, Communications and Security (ICCCS)

probable diameters lying within the first standard deviation. [2] T.-H. Kim, D.-C. Park, D.-M. Woo, T. Jeong, and S.-Y. Min, “Multi-class
The detail procedure is shown in Algorithm 1. classifier-based adaboost algorithm,” in Proceedings of the Second Sino-
foreign-interchange Conference on Intelligent Science and Intelligent
Data Engineering, ser. IScIDE’11. Berlin, Heidelberg: Springer-Verlag,
Algorithm 1 Detail Distance Computing Procedure 2012, pp. 122–127.
X ← Number of regions [3] P. Viola and M. J. Jones, “Robust real-time face detection,” Int. J.
Comput. Vision, vol. 57, no. 2, pp. 137–154, May 2004.
D ← RX [4] P. Viola and M. Jones, “Rapid object detection using a boosted cascade
for x ← 1, X do of simple features,” in Proceedings of the 2001 IEEE Computer Society
dx ← max +mi2
x Conference on Computer Vision and Pattern Recognition. CVPR 2001,
vol. 1, Dec 2001, pp. I–I.
D[x] ← dx [5] J. Li, J. Zhao, Y. Wei, C. Lang, Y. Li, and J. Feng, “Towards real
end for  world human parsing: Multiple-human parsing in the wild,” CoRR, vol.
X
D ← X1 x=1 D[x] abs/1705.07206.
X [6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
ηD ← X1 x=1 (D[x] − D)2 with deep convolutional neural networks,” in Advances in Neural Infor-
mation Processing Systems 25, F. Pereira, C. J. C. Burges, L. Bottou, and
Dtrue ← [ ] K. Q. Weinberger, Eds. Curran Associates, Inc., 2012, pp. 1097–1105.
c←0 [7] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
for x ← 1, X do large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[8] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
if D − ηD < D[x] < D + ηD then V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
c←c+1 2015.
Dtrue [c] ← D[x] [9] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” 2016 IEEE Conference on Computer Vision and Pattern
end if Recognition (CVPR), pp. 770–778, 2016.
end for [10] K. Li, G. Ding, and H. Wang, “L-fcn: A lightweight fully convolutional
network for biomedical semantic segmentation,” in 2018 IEEE Inter-
national Conference on Bioinformatics and Biomedicine (BIBM), Dec
2018, pp. 2363–2367.
IV. E XPERIMENTAL R ESULTS [11] X. Fu and H. Qu, “Research on semantic segmentation of high-resolution
remote sensing image based on full convolutional neural network,” in
All the experiments have been performed on Multi Human 2018 12th International Symposium on Antennas, Propagation and EM
Parsing Dataset containing about 5000 images, each with at Theory (ISAPE), Dec 2018, pp. 1–4.
least two persons. Out of these, 2500 images were used for [12] S. Kumar, A. Negi, J. N. Singh, and H. Verma, “A deep learning
for brain tumor mri images semantic segmentation using fcn,” in
training and validation while the remaining where used for 2018 4th International Conference on Computing Communication and
testing the model. Figure 4 shows true and predicted class to Automation (ICCCA), Dec 2018, pp. 1–4.
a given input image of any arbitrary size. It also represents
detected faces inside a bounding circle with respective pixel
level accuracy. We have also shown the refined predicted
mask after its subjected to post processing. The designed FCN
semantically segments out the facial spatial location with a
specific label. Furthermore, proposed model has also shown
great results in recognizing non-frontal faces. Along with this
it is also able to detect multiple facial masks in a single
frame. The post processing provides a large boost to pixel
level accuracy. The mean pixel level accuracy for facial masks
: 93.884%.

V. C ONCLUSION
We were able to generate accurate face masks for human
objects from RGB channel images containing localized ob-
jects. We demonstrated our results on Multi Human Parsing
Dataset with mean pixel level accuracy. Also the problem of
erroneous predictions has been solved and a proper bounding
box has been drawn around the segmented region. Proposed
network can detect non frontal faces and multiple faces from
single image. The method can find applications in advanced
tasks such as facial part detection.

R EFERENCES
[1] T. Ojala, M. Pietikainen, and T. Maenpaa, “Multiresolution gray-scale
and rotation invariant texture classification with local binary patterns,”
IEEE Transactions on Pattern Analysis and Machine Intelligence,
vol. 24, no. 7, pp. 971–987, July 2002.

Authorized licensed use limited to: National Institute of Technology. Downloaded on October 15,2023 at 06:05:19 UTC from IEEE Xplore. Restrictions apply.

You might also like