Abstract
Recognizing persons under unconstrained settings is challenging due to variation in pose and viewpoint, partial occlusion, and motion blur. Inference only by face-based recognition techniques would fail in these cases. Previous studies mainly focus on this problem on still images while they cannot handle the temporal variations in videos. In this work, we aim to tackle these challenges and propose a Multi-Cue and Temporal Attention (MCTA) framework to recognize persons in videos. For the spatial domain, we extract features from multiple visual cue regions and utilize a Multi-Cue Attention Module to integrate them. For the temporal domain, we adopt a Temporal Attention Module to combine the video frames, which is learned to assess the quality of different frames adaptively. By this means, MCTA can comprehensively explore the complementary information in spatial-temporal dimensions for person recognition in videos. Moreover, we introduce Character Recognition in Videos (CRV), a new video dataset for character recognition under challenging settings. Extensive experiments on CRV demonstrate the effectiveness of our proposed framework. Dataset with annotations and all codes used in this paper are publicly available at https://github.com/zhezheey/MCTA.
Supported by the National Key R&D Program of China (2018YFC0831500), the National Natural Science Foundation of China (No. 61972047), and the NSFC-General Technology Basic Research Joint Funds (No. U1936220).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Schroff, F., Kalenichenko, D., Philbin, J.: Facenet: A unified embedding for face recognition and clustering. In: CVPR, pp. 815–823 (2015)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: Arcface: Additive angular margin loss for deep face recognition. In: CVPR, pp. 4690–4699 (2019)
Song, G., Leng, B., Liu, Y., Hetang, C., Cai, S.: Region-based quality estimation network for large-scale person re-identification. In: AAAI, pp. 7347–7354 (2018)
Zheng, Z., Yang, X., Yu, Z., Zheng, L., Yang, Y., Kautz, J.: Joint discriminative and generative learning for person re-identification. In: CVPR, pp. 2138–2147 (2019)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19(4), 788–798 (2011)
Oh, S.J., Benenson, R., Fritz, M., Schiele, B.: Person recognition in personal photo collections. In: ICCV, pp. 3862–3870 (2015)
Kumar, V., Namboodiri, A., Paluri, M., Jawahar, C.V.: Pose-aware person recognition. In: CVPR, pp. 6223–6232 (2017)
Zhang, N., Paluri, M., Taigman, Y., Fergus, R., Bourdev, L.: Beyond frontal faces: Improving person recognition using multiple cues. In: CVPR, pp. 4804–4813 (2015)
Li, H., Brandt, J., Lin, Z., Shen, X., Hua, G.: A multi-level contextual model for person recognition in photo albums. In: CVPR, pp. 1297–1305 (2016)
Huang, Q., Xiong, Y., Lin, D.: Unifying identification and context learning for person recognition. In: CVPR, pp. 2217–2225 (2018)
Liu, Y., et al.: iQIYI celebrity video identification challenge. In: ACM MM, pp. 2516–2520 (2019)
Huang, Z., Chang, Y., Chen, W., Shen, Q., Liao, J.: Residualdensenetwork: a simple approach for video person identification. In: ACM MM, pp. 2521–2525 (2019)
Fang, X., Zou, Y.: Make the best of face clues in iQIYI celebrity video identification challenge 2019. In: ACM MM, pp. 2526–2530 (2019)
Dong, C., Gu, Z., Huang, Z., Ji, W., Huo, J., Gao, Y.: Deepmef: a deep model ensemble framework for video based multi-modal person identification. In: ACM MM, pp. 2531–2534 (2019)
Chen, J., Yang, L., Xu, Y., Huo, J., Shi, Y., Gao, Y.: A novel deep multi-modal feature fusion method for celebrity video identification. In: ACM MM, pp. 2535–2538 (2019)
Huang, G.B., Mattar, M., Berg, T., Learned-Miller, E.: Labeled faces in the wild: a database for studying face recognition in unconstrained environments. In: Workshop on Faces in ’Real-Life’ Images: Detection, Alignment, and Recognition, pp. 1–14 (2008)
Huang, Q., Liu, W., Lin, D.: Person search in videos with one portrait through visual and temporal links. In: ECCV, pp. 425–441 (2018)
Wang, L., Xiong, Y., Wang, Z., Qiao, Y., Lin, D., Tang, X., Gool, L.V.: Temporal segment networks: towards good practices for deep action recognition. In: ECCV, pp. 20–36 (2016)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: ICCV, pp. 2961–2969 (2017)
Lin, T.Y., et al.: Microsoft COCO: Common Objects in Context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Sig. Process. Lett. 23(10), 1499–1503 (2016)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Marin-Jimenez, M.J., Kalogeiton, V., Medina-Suarez, P., Zisserman, A.: LAEO-Net: revisiting people looking at each other in videos. In: CVPR, pp. 3477–3485 (2019)
Vu, T.H., Osokin, A., Laptev, I.: Context-aware cnns for person head detection. In: ICCV, pp. 2893–2901 (2015)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_6
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR, pp. 770–778 (2016)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: CVPR, pp. 248–255 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, W., Wu, B., Li, F., Liu, Z. (2020). Multi-Cue and Temporal Attention for Person Recognition in Videos. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12306. Springer, Cham. https://doi.org/10.1007/978-3-030-60639-8_31
Download citation
DOI: https://doi.org/10.1007/978-3-030-60639-8_31
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60638-1
Online ISBN: 978-3-030-60639-8
eBook Packages: Computer ScienceComputer Science (R0)