Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark

Tianyang Xu ORCID: orcid.org/0000-0002-9015-3128¹,
Ze Kang¹,
Xuefeng Zhu¹ &
…
Xiao-Jun Wu¹

656 Accesses
Explore all metrics

Abstract

Advanced general visual object tracking models have been drastically developed with the access of large annotated datasets and progressive network architectures. However, a general tracker always suffers domain shift when directly adopting to specific testing scenarios. In this paper, we dedicate to addressing the animal tracking problem by proposing a spatio-temporal inference module and a coarse-to-fine tracking strategy. In terms of tracking animals, non-rigid deformation is a typical challenge. Therefore, we particularly design a novel transformer-based inference structure where the changing animal state is transmitted across continuous frames. By explicitly transmitting the appearance variations, this spatio-temporal module enables adaptive target learning, boosting the animal tracking performance compared to the fixed template matching approaches. Besides, considering the altered contours of animals in different frames, we propose to perform coarse-to-fine tracking to obtain a fine-grained animal bounding box with a dedicated distribution-aware regression module. The coarse tracking phase focuses on distinguishing the target against potential distractors in the background. While the fine-grained tracking phase aims at accurately regressing the final animal bounding box. To facilitate animal tracking evaluation, we captured and annotated 145 video sequences with 20 categories from the zoo, forming a new test set for animal tracking, coined ZOO145. We also collected a dataset, AnimalSOT, with 162 video sequences from existing tracking test benchmarks. The experimental performance on animal tracking datasets, MoCA, ZOO145, and AnimalSOT, demonstrate the merit of the proposed approach against advanced general tracking approaches, providing a baseline for future animal tracking studies.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

WATB: Wild Animal Tracking Benchmark

Article 22 December 2022

AnimalTrack: A Benchmark for Multi-Animal Tracking in the Wild

Article 18 November 2022

Robust Visual Tracking by Segmentation

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

https://www.cv4animals.com/home

References

Avidan, S. (2004). Support vector tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8), 1064–1072.
Article Google Scholar
Babenko, B., Yang, M. H., & Belongie, S. (2011). Robust object tracking with online multiple instance learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1619–1632.
Article Google Scholar
Baker, S., & Matthews, I. (2004). Lucas-kanade 20 years on: A unifying framework. International Journal of Computer Vision, 56(3), 221–255.
Article Google Scholar
Bertinetto, L., Valmadre, J., Golodetz, S., Miksik, O., & Torr, P. H. S. (2016). Staple: Complementary learners for real-time tracking. IEEE Conference on Computer Vision and Pattern Recognition, 38, 1401–1409.
Google Scholar
Bertinetto, L., Valmadre, J., Henriques, J. F., Vedaldi, A., & Torr, P. H. (2016b). Fully-convolutional siamese networks for object tracking. In European Conference on Computer Vision. Springer, pp. 850–865.
Bhat, G., Danelljan, M., Gool, L. V., & Timofte, R. (2019). Learning discriminative model prediction for tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6182–6191.
Bideau, P., & Learned-Miller, E. (2016). It’s moving! a probabilistic model for causal motion segmentation in moving camera videos. In European Conference on Computer Vision. Springer, pp. 433–449.
Biggs, B., Boyne, O., Charles, J., Fitzgibbon, A., & Cipolla, R. (2020). Who left the dogs out? 3d animal reconstruction with expectation maximization in the loop. In European Conference on Computer Vision. Springer, pp. 195–211.
Bolme, D. S., Beveridge, J. R., Draper, B. A., & Lui, Y. M. (2010). Visual object tracking using adaptive correlation filters. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2544–2550.
Briechle, K., & Hanebeck, U. D. (2001). Template matching using fast normalized cross correlation. Proceedings of SPIE, 4387, 95–102.
Article Google Scholar
Chan, Y., Hu, A., & Plant, J. (1979). A kalman filter based tracking scheme with input estimation. IEEE Transactions on Aerospace and Electronic Systems, 2, 237–244.
Article Google Scholar
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021). Transformer tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8126–8135.
Chen, Z., Zhong, B., Li, G., Zhang, S., & Ji, R. (2020). Siamese box adaptive network for visual tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6668–6677.
Comaniciu, D., Ramesh, V., & Meer, P. (2000). Real-time tracking of non-rigid objects using mean shift. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 142–149.
Danelljan, M., Hager, G., Khan, F. S., & Felsberg, M. (2015). Learning spatially regularized correlation filters for visual tracking. In IEEE International Conference on Computer Vision, pp. 4310–4318.
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2017a). Eco: Efficient convolution operators for tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 6931–6939.
Danelljan, M., Häger, G., Khan, F. S., & Felsberg, M. (2017). Discriminative scale space tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence, 39(8), 1561–1575.
Article Google Scholar
Danelljan, M., Bhat, G., Khan, F. S., & Felsberg, M. (2019). Atom: Accurate tracking by overlap maximization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4660–4669.
Danelljan, M., Gool, L. V., & Timofte, R. (2020). Probabilistic regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7183–7192.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., & Gelly S et al. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv:2010.11929
Fan, H., Lin, L., Yang, F., Chu, P., Deng, G., Yu, S., Bai, H., Xu, Y., Liao, C., & Ling, H. (2019). Lasot: A high-quality benchmark for large-scale single object tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5374–5383.
Fink, M., & Ullman, S. (2008). From aardvark to zorro: A benchmark for mammal image classification. International Journal of Computer Vision, 77(1), 143–156.
Article Google Scholar
Gordon, N., Salmond, D., & Ewing, C. (1995). Bayesian state estimation for tracking and guidance using the bootstrap filter. Journal of Guidance, Control, and Dynamics, 18(6), 1434–1443.
Article Google Scholar
Guo, D., Wang, J., Cui, Y., Wang, Z., & Chen, S. (2020). Siamcar: Siamese fully convolutional classification and regression for visual tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6269–6277.
Guo, D., Shao, Y., Cui, Y., Wang, Z., Zhang, L., & Shen, C. (2021). Graph attention tracking. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9543–9552.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009.
Held, D., Thrun, S., & Savarese, S. (2016). Learning to track at 100 fps with deep regression networks. In European Conference on Computer Vision. Springer, pp. 749–765.
Henriques, J., O. F., Caseiro, R., Martins, P., & Batista, J. (2012). Exploiting the circulant structure of tracking-by-detection with kernels. In European Conference on Computer Vision, pp. 702–715.
Henriques, J. F., Rui, C., Martins, P., & Batista, J. (2015). High-speed tracking with kernelized correlation filters. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3), 583–596.
Article Google Scholar
Huang, L., Zhao, X., & Huang, K. (2019). Got-10k: A large high-diversity benchmark for generic object tracking in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence, 43, 1562–1577.
Article Google Scholar
Isard, M., & Blake, A. (1998). Condensation-conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1), 5–28.
Article Google Scholar
Kiani Galoogahi, H., Fagg, A., & Lucey, S. (2017). Learning background-aware correlation filters for visual tracking. In IEEE International Conference on Computer Vision.
Kristan, M., Leonardis, A., & Matas, J., et al. (2016). The visual object tracking vot2016 challenge results. In European Conference on Computer Vision Workshops, 8926, 191–217.
Kristan, M., Leonardis, A., Matas, J., Felsberg, M., Pflugfelder, R., Cehovin Zajc, L., Vojir, T., Bhat, G., Lukezic, A., & Eldesokey, A. et al. (2018). The sixth visual object tracking vot2018 challenge results. In Proceedings of the European Conference on Computer Vision (ECCV).
Kristan, M., Matas, J., & Leonardis, A., et al. (2019). The seventh visual object tracking vot2019 challenge results. In Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops, pp. 0–0.
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097–1105.
Google Scholar
Lamdouar, H., Yang, C., Xie, W., & Zisserman, A. (2020). Betrayed by motion: Camouflaged object discovery via motion segmentation. In Proceedings of the Asian Conference on Computer Vision.
Li, A., Lin, M., Wu, Y., Yang, M. H., & Yan, S. (2016). Nus-pro: A new visual tracking challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(2), 335–349.
Article Google Scholar
Li, B., Yan, J., Wu, W., Zhu, Z., & Hu, X. (2018). High performance visual tracking with siamese region proposal network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971–8980.
Li, B., Wu, W., Wang, Q., Zhang, F., Xing, J., & Yan, J. (2019). Siamrpn++: Evolution of siamese visual tracking with very deep networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282–4291.
Li, S., Song, W., Fang, Z., Shi, J., Hao, A., Zhao, Q., & Qin, H. (2020). Long-short temporal-spatial clues excited network for robust person re-identification. International Journal of Computer Vision, 128(12), 2936–2961.
Article Google Scholar
Li, X., Wang, W., Wu, L., Chen, S., Hu, X., Li, J., Tang, J., & Yang, J. (2020). Generalized focal loss: Learning qualified and distributed bounding boxes for dense object detection. Advances in Neural Information Processing Systems, 33, 21002–21012.
Google Scholar
Li, Y., & Zhu, J. (2014). A scale adaptive kernel correlation filter tracker with feature integration. In European Conference on Computer Vision Workshops. Springer, pp. 254–265.
Li, Y., Xu, N., Yang, W., See, J., & Lin, W. (2022). Exploring the semi-supervised video object segmentation problem from a cyclic perspective. International Journal of Computer Vision, 130(10), 2408–2424.
Article Google Scholar
Liang, P., Blasch, E., & Ling, H. (2015). Encoding color information for visual tracking: Algorithms and benchmark. IEEE Transactions on Image Processing, 24(12), 5630–5644.
Article MathSciNet Google Scholar
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In European Conference on Computer Vision. Springer, pp. 740–755.
Liu, S., Zhang, T., Cao, X., & Xu, C. (2016). Structural correlation filter for robust visual tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 4312–4320.
Martin, D., Andreas, R., Fahad, K., & Michael, F. (2016). Beyond correlation filters: Learning continuous convolution operators for visual tracking. In European Conference on Computer Vision, pp. 472–488.
Mueller, M., Smith, N., & Ghanem, B. (2016). A benchmark and simulator for uav tracking. In European Conference on Computer Vision. Springer, pp. 445–461.
Mueller, M., Smith, N., & Ghanem, B. (2017). Context-aware correlation filter tracking. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 1396–1404.
Ng, X. L., Ong, K. E., Zheng, Q., Ni, Y., & Liu, S. Y. Y. J. (2022). Animal kingdom: A large and diverse dataset for animal behavior understanding. arXiv:2204.08129.
Pigou, L., Van Den Oord, A., Dieleman, S., Van Herreweghe, M., & Dambre, J. (2018). Beyond temporal pooling: Recurrence and temporal convolutions for gesture recognition in video. International Journal of Computer Vision, 126(2), 430–439.
Article MathSciNet Google Scholar
Sui, Y., Zhang, Z., Wang, G., Tang, Y., & Zhang, L. (2019). Exploiting the anisotropy of correlation filter learning for visual tracking. International Journal of Computer Vision, 127(8), 1084–1105.
Article Google Scholar
Tao, R., Gavves, E., & Smeulders, A. W. (2016). Siamese instance search for tracking. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 1420–1429.
Valmadre, J., Bertinetto, L., Henriques, J., Vedaldi, A., & Torr, P. H. (2017). End-to-end representation learning for correlation filter based tracking. In IEEE Conference on Computer Vision and Pattern Recognition. IEEE, pp. 5000–5008.
Wang, M., Liu, Y., & Huang, Z. (2017). Large margin object tracking with circulant feature maps. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4021–4029.
Wang, N., Shi, J., Yeung, D. Y., & Jia, J. (2015). Understanding and diagnosing visual tracking systems. In IEEE International Conference on Computer Vision. IEEE, pp. 3101–3109.
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., & Torr, P. H. (2019). Fast online object tracking and segmentation: A unifying approach. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1328–1338.
Wu, Y., Lim, J., & Yang, M. H. (2013). Online object tracking: A benchmark. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 2411–2418
Wu, Y., Lim, J., & Yang, M. H. (2015). Object tracking benchmark. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(9), 1834–1848.
Article Google Scholar
Xing, D., Evangeliou, N., Tsoukalas, A., & Tzes, A. (2022). Siamese transformer pyramid networks for real-time uav tracking. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2139–2148.
Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019a). Joint group feature selection and discriminative filter learning for robust visual object tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 7950–7960
Xu, T., Feng, Z. H., Wu, X. J., & Kittler, J. (2019). Learning adaptive discriminative correlation filters via temporal consistency preserving spatial feature selection for robust visual object tracking. IEEE Transactions on Image Processing, 28(11), 5596–5609.
Article MathSciNet Google Scholar
Xu, T., Feng, Z., Wu, X. J., & Kittler, J. (2021). Adaptive channel selection for robust visual object tracking with discriminative correlation filters. International Journal of Computer Vision, 129(5), 1359–1375.
Article Google Scholar
Xu, Y., Wang, Z., Li, Z., Yuan, Y., & Yu, G. (2020). Siamfc++: Towards robust and accurate visual tracking with target estimation guidelines. In The AAAI Conference on Artificial Intelligence, pp. 12549–12556.
Yan, B., Peng, H., Fu, J., Wang, D., & Lu, H. (2021). Learning spatio-temporal transformer for visual tracking. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10448–10457.
Yu, H., Xu, Y., Zhang, J., Zhao, W., Guan, Z., & Tao, D. (2021). Ap-10k: A benchmark for animal pose estimation in the wild. arXiv:2108.12617.
Yu, Y., Yuan, J., Mittal, G., Fuxin, L., & Chen, M. (2022). Batman: Bilateral attention transformer in motion-appearance neighboring space for video object segmentation. In European Conference on Computer Vision. Springer, pp. 612–629.
Zhang, K., Zhang, L., Liu, Q., Zhang, D., & Yang, M. H. (2014). Fast visual tracking via dense spatio-temporal context learning. In European Conference on Computer Vision, pp. 127–141.
Zhang, T., Ghanem, B., Liu, S., & Ahuja, N. (2013). Robust visual tracking via structured multi-task sparse learning. International Journal of Computer Vision, 101(2), 367–383.
Article MathSciNet Google Scholar
Zhang, T., Bibi, A., & Ghanem, B. (2016). In defense of sparse tracking: Circulant sparse tracker. In IEEE Conference on Computer Vision and Pattern Recognition, pp. 3880–3888
Zhang, T., Xu, C., & Yang, M. H. (2017). Multi-task correlation particle filter for robust object tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4335–4343.
Zheng, X., Guo, Y., Huang, H., Li, Y., & He, R. (2020). A survey of deep facial attribute analysis. International Journal of Computer Vision, 128(8), 2002–2034.
Article Google Scholar
Zhu, Z., Wang, Q., Li, B., Wu, W., Yan, J., & Hu, W. (2018). Distractor-aware siamese networks for visual object tracking. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 101–117.

Download references

Acknowledgements

This work was supported in part by the National Natural Science Foundation of China (62020106012, U1836218, 62106089).

Author information

Authors and Affiliations

School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, 214122, China
Tianyang Xu, Ze Kang, Xuefeng Zhu & Xiao-Jun Wu

Authors

Tianyang Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ze Kang
View author publications
You can also search for this author in PubMed Google Scholar
Xuefeng Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Jun Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tianyang Xu.

Additional information

Communicated by Hyun Soo Park.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Xu, T., Kang, Z., Zhu, X. et al. Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark. Int J Comput Vis 132, 2698–2712 (2024). https://doi.org/10.1007/s11263-024-02008-8

Download citation

Received: 30 April 2022
Accepted: 14 January 2024
Published: 12 February 2024
Issue Date: July 2024
DOI: https://doi.org/10.1007/s11263-024-02008-8

Learning Adaptive Spatio-Temporal Inference Transformer for Coarse-to-Fine Animal Visual Tracking: Algorithm and Benchmark

Abstract

Access this article

Subscribe and save

Buy Now