Abstract
As UAVs become increasingly prevalent in urban security monitoring, they are confronted with the challenge of accurately identifying small targets that are easily obscured by complex backgrounds, dense buildings, and dynamic pedestrian flows. In response to these challenges and the demands of real-world applications, we introduce SMT-Net, a system tailor-made for UAV patrolling. SMT-Net marries the Compact Axial Transformer Block with Scale-Adaptive Modulation, striking an effective balance between detection precision and computational expense. The Compact Axial Transformer Block comprises two innovative components: Compact Axial Attention and Fine-grained Feature Enhancement. Compact Axial Attention reduces parameter count and model intricacy while preserving crucial feature information. Concurrently, the introduced Fine-grained Feature Enhancement substantially boosts the model’s capability to apprehend target details, thereby enhancing classification and detection efficiency for diminutive objects. The Scale-Adaptive modulation adeptly seizes semantic information across disparate feature strata, augmenting the detection acuity for minuscule targets. Furthermore, to improve boundary precision in small object detection, we introduce the shape-IoU method, enhancing detection accuracy. On our designed DRP-Dataset for UAV road patrolling imagery, SMT-Net achieved an outstanding 88.0\(\%\) mAP, particularly demonstrating remarkable superiority in small object detection, and outperforming all mainstream methodologies. The experiments substantiate that SMT-Net can satisfy the stringent demands for accurate and efficient detection across various UAV platforms in diverse complex scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2117–2125 (2017)
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Li, Z., Liu, F., Yang, W., Peng, S., Zhou, J.: A survey of convolutional neural networks: analysis, applications, and prospects. IEEE Trans. Neural Networks Learn. Syst. 33(12), 6999–7019 (2021)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Howard, A.G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., Adam, H.: Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Zhang, X., Zhou, X., Lin, M., Sun, J.: Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856 (2018)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)
Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 3–19 (2018)
Dey, P., Chaulya, S., Kumar, S.: Hybrid CNN-LSTM and IoT-based coal mine hazards monitoring and prediction system. Process Saf. Environ. Prot. 152, 249–263 (2021)
Pu, Y., Wang, Y., Xia, Z., Han, Y., Wang, Y., Gan, W., Wang, Z., Song, S., Huang, G.: Adaptive rotated convolution for rotated object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6589–6600 (2023)
Li, R., Zheng, S., Zhang, C., Duan, C., Wang, L., Atkinson, P.M.: ABCNet: attentive bilateral contextual network for efficient semantic segmentation of fine-resolution remotely sensed imagery. ISPRS J. Photogramm. Remote. Sens. 181, 84–98 (2021)
Li, J., Tian, P., Song, R., Xu, H., Li, Y., Du, Q.: PCViT: a pyramid convolutional vision transformer detector for object detection in remote sensing imagery. IEEE Trans. Geosci. Remote Sens. (2024)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Liu, Y., Schiele, B., Vedaldi, A., Rupprecht, C.: Continual detection transformer for incremental object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 23799–23808 (2023)
Li, J., Qiao, S., Zhao, Z., Xie, C., Chen, X., Xia, C.: Rethinking lightweight salient object detection via network depth-width tradeoff. IEEE Trans. Image Process. (2023)
Fang, Y., Yang, S., Wang, S., Ge, Y., Shan, Y., Wang, X.: Unleashing vanilla vision transformer with masked image modeling for object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6244–6253 (2023)
Song, P., Li, J., An, Z., Fan, H., Fan, L.: CTMFNet: CNN and transformer multiscale fusion network of remote sensing urban scene imagery. IEEE Trans. Geosci. Remote Sens. 61, 1–14 (2022)
Wang, L., Li, R., Zhang, C., Fang, S., Duan, C., Meng, X., Atkinson, P.M.: UNetFormer: a UNet-like transformer for efficient semantic segmentation of remote sensing urban scene imagery. ISPRS J. Photogramm. Remote. Sens. 190, 196–214 (2022)
Varghese, R., Sambath, M.: Yolov8: a novel object detection algorithm with enhanced performance and robustness. In: 2024 International Conference on Advances in Data Engineering and Intelligent Computing Systems (ADICS), pp. 1–6. IEEE (2024)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Wang, Y., Zhang, J., Yan, J., Zhou, J., Miao, H. (2025). Scale-Adaptive Modulation Meet Compact Axial Transformer for Small Object Detection in UAV-Vision. In: Lin, Z., et al. Pattern Recognition and Computer Vision. PRCV 2024. Lecture Notes in Computer Science, vol 15043. Springer, Singapore. https://doi.org/10.1007/978-981-97-8493-6_1
Download citation
DOI: https://doi.org/10.1007/978-981-97-8493-6_1
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-97-8492-9
Online ISBN: 978-981-97-8493-6
eBook Packages: Computer ScienceComputer Science (R0)