Abstract
Advanced general visual object tracking models have been drastically developed with the access of large annotated datasets and progressive network architectures. However, a general tracker always suffers domain shift when directly adopting to specific testing scenarios. In this paper, we dedicate to addressing the animal tracking problem by proposing a spatio-temporal inference module and a coarse-to-fine tracking strategy. In terms of tracking animals, non-rigid deformation is a typical challenge. Therefore, we particularly design a novel transformer-based inference structure where the changing animal state is transmitted across continuous frames. By explicitly transmitting the appearance variations, this spatio-temporal module enables adaptive target learning, boosting the animal tracking performance compared to the fixed template matching approaches. Besides, considering the altered contours of animals in different frames, we propose to perform coarse-to-fine tracking to obtain a fine-grained animal bounding box with a dedicated distribution-aware regression module. The coarse tracking phase focuses on distinguishing the target against potential distractors in the background. While the fine-grained tracking phase aims at accurately regressing the final animal bounding box. To facilitate animal tracking evaluation, we captured and annotated 145 video sequences with 20 categories from the zoo, forming a new test set for animal tracking, coined ZOO145. We also collected a dataset, AnimalSOT, with 162 video sequences from existing tracking test benchmarks. The experimental performance on animal tracking datasets, MoCA, ZOO145, and AnimalSOT, demonstrate the merit of the proposed approach against advanced general tracking approaches, providing a baseline for future animal tracking studies.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Avidan, S. (2004). Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 1064–1072.
Babenko, B., Yang, M. H., & Belongie, S. (2011). Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1619–1632.
Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3), 221–255.
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. S. (2016). Staple: Complementary learners for real-time tracking. IEEE Conference on Computer Vision and Pattern Recognition, 38, 1401–1409.
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016b). Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision. Springer, pp. 850–865.
Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191.
Bideau, P., & Learned-Miller, E. (2016). It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. In European Conference on Computer Vision. Springer, pp. 433–449.
Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., & Cipolla, R. (2020). Who left the dogs out? 3d animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision. Springer, pp. 195–211.
Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010). Visual object tracking using adaptive correlation filters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550.
Briechle, K., & Hanebeck, U. D. (2001). Template matching using fast normalized cross correlation. Proceedings of SPIE, 4387, 95–102.
Chan, Y., Hu, A., & Plant, J. (1979). A kalman filter based tracking scheme with input estimation. IEEE Transactions on Aerospace and Electronic Systems, 2, 237–244.
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135.
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6668–6677.
Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 142–149.
Danelljan, M., Hager, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision, pp. 4310–4318.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2017a). Eco: Efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6931–6939.
Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2017). Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1561–1575.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669.
Danelljan, M., Gool, L. V., & Timofte, R. (2020). Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly S et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383.
Fink, M., & Ullman, S. (2008). From aardvark to zorro: A benchmark for mammal image classification. International Journal of Computer Vision, 77(1), 143–156.
Gordon, N., Salmond, D., & Ewing, C. (1995). Bayesian state estimation for tracking and guidance using the bootstrap filter. Journal of Guidance, Control, and Dynamics, 18(6), 1434–1443.
Guo, D., Wang, J., Cui, Y., Wang, Z., & Chen, S. (2020). Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6269–6277.
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009.
Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision. Springer, pp. 749–765.
Henriques, J., O. F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision, pp. 702–715.
Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.
Huang, L., Zhao, X., & Huang, K. (2019). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562–1577.
Isard, M., & Blake, A. (1998). Condensation-conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.
Kiani Galoogahi, H., Fagg, A., & Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In IEEE International Conference on Computer Vision.
Kristan, M., Leonardis, A., & Matas, J., et al. (2016). The visual object tracking vot2016 challenge results. In European Conference on Computer Vision Workshops, 8926, 191–217.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Cehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., & Eldesokey, A. et al. (2018). The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV).
Kristan, M., Matas, J., & Leonardis, A., et al. (2019). The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Lamdouar, H., Yang, C., Xie, W., & Zisserman, A. (2020). Betrayed by motion: Camouflaged object discovery via motion segmentation. In Proceedings of the Asian Conference on Computer Vision.
Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2016). Nus-pro: A new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 335–349.
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291.
Li, S., Song, W., Fang, Z., Shi, J., Hao, A., Zhao, Q., & Qin, H. (2020). Long-short temporal-spatial clues excited network for robust person re-identification. International Journal of Computer Vision, 128(12), 2936–2961.
Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., & Yang, J. (2020). Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33, 21002–21012.
Li, Y., & Zhu, J. (2014). A scale adaptive kernel correlation filter tracker with feature integration. In European Conference on Computer Vision Workshops. Springer, pp. 254–265.
Li, Y., Xu, N., Yang, W., See, J., & Lin, W. (2022). Exploring the semi-supervised video object segmentation problem from a cyclic perspective. International Journal of Computer Vision, 130(10), 2408–2424.
Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12), 5630–5644.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, pp. 740–755.
Liu, S., Zhang, T., Cao, X., & Xu, C. (2016). Structural correlation filter for robust visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4312–4320.
Martin, D., Andreas, R., Fahad, K., & Michael, F. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pp. 472–488.
Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for uav tracking. In European Conference on Computer Vision. Springer, pp. 445–461.
Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1396–1404.
Ng, X. L., Ong, K. E., Zheng, Q., Ni, Y., & Liu, S. Y. Y. J. (2022). Animal kingdom: A large and diverse dataset for animal behavior understanding. arXiv:2204.08129.
Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., & Dambre, J. (2018). Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision, 126(2), 430–439.
Sui, Y., Zhang, Z., Wang, G., Tang, Y., & Zhang, L. (2019). Exploiting the anisotropy of correlation filter learning for visual tracking. International Journal of Computer Vision, 127(8), 1084–1105.
Tao, R., Gavves, E., & Smeulders, A. W. (2016). Siamese instance search for tracking. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1420–1429.
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., & Torr, P. H. (2017). End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 5000–5008.
Wang, M., Liu, Y., & Huang, Z. (2017). Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4021–4029.
Wang, N., Shi, J., Yeung, D. Y., & Jia, J. (2015). Understanding and diagnosing visual tracking systems. In IEEE International Conference on Computer Vision. IEEE, pp. 3101–3109.
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., & Torr, P. H. (2019). Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1328–1338.
Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.
Xing, D., Evangeliou, N., Tsoukalas, A., & Tzes, A. (2022). Siamese transformer pyramid networks for real-time uav tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2139–2148.
Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019a). Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7950–7960
Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.
Xu, T., Feng, Z., Wu, X. J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129(5), 1359–1375.
Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In The AAAI Conference on Artificial Intelligence, pp. 12549–12556.
Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457.
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. arXiv:2108.12617.
Yu, Y., Yuan, J., Mittal, G., Fuxin, L., & Chen, M. (2022). Batman: Bilateral attention transformer in motion-appearance neighboring space for video object segmentation. In European Conference on Computer Vision. Springer, pp. 612–629.
Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. H. (2014). Fast visual tracking via dense spatio-temporal context learning. In European Conference on Computer Vision, pp. 127–141.
Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2013). Robust visual tracking via structured multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.
Zhang, T., Bibi, A., & Ghanem, B. (2016). In defense of sparse tracking: Circulant sparse tracker. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3880–3888
Zhang, T., Xu, C., & Yang, M. H. (2017). Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4335–4343.
Zheng, X., Guo, Y., Huang, H., Li, Y., & He, R. (2020). A survey of deep facial attribute analysis. International Journal of Computer Vision, 128(8), 2002–2034.
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117.
Acknowledgements
This work was supported in part by the National Natural Science Foundation of China (62020106012, U1836218, 62106089).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Hyun Soo Park.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xu, T., Kang, Z., Zhu, X. et al. Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark. Int J Comput Vis 132, 2698–2712 (2024). https://doi.org/10.1007/s11263-024-02008-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-024-02008-8