CN118505974A

CN118505974A - Sonar image target detection method and system based on deep learning

Info

Publication number: CN118505974A
Application number: CN202410644682.3A
Authority: CN
Inventors: 刘锡祥; 彭子通
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2024-05-23
Filing date: 2024-05-23
Publication date: 2024-08-16

Abstract

The invention discloses a sonar image target detection method and system based on deep learning, and belongs to the technical field of underwater recognition. The method comprises the following steps: acquiring a sonar image, and screening and arranging the sonar image to form a sonar image data set; constructing a sonar image target detection model YOLO_EC, introducing an ELAN module to replace a part of C3 modules on the basis of YOLOv networks for feature extraction, and introducing an MP module to replace a convolution layer for downsampling; introducing a C2f module into the feature fusion network to perform feature fusion; pre-training the model by using an optical image data set through a mode of style conversion and migration learning, and training by using a sonar image data set; and inputting the sonar image to be detected into a trained detection model to obtain a sonar image target detection result. According to the invention, the feature extraction capability and the parallel computing capability of the network structure enhancement model are improved, and the detection precision of the sonar image target by the migration learning enhancement model is utilized.

Description

A sonar image target detection method and system based on deep learning

技术领域Technical Field

本发明涉及水下目标识别技术领域，具体涉及一种基于深度学习的声呐图像目标检测方法与系统。The present invention relates to the technical field of underwater target recognition, and in particular to a sonar image target detection method and system based on deep learning.

背景技术Background Art

水下目标检测技术根据工作原理可以分为基于声呐回波的检测技术与基于声呐图像的检测技术。基于声呐回波的检测技术是针对回波信号进行处理，通过分析回波信号的特征，对海底信息进行分析与检测。基于声呐图像的目标检测有着强表达能力，高精度，高分辨率，应用更广泛，多波束及实时性较强的特点。虽然基于声呐回波的目标检测技术在水下探测中仍然有其重要性，但基于声呐图像的目标检测技术具有更多的优势和发展潜力，未来将会在水下探测中得到更广泛的应用。Underwater target detection technology can be divided into detection technology based on sonar echo and detection technology based on sonar image according to the working principle. Detection technology based on sonar echo processes echo signals and analyzes and detects seabed information by analyzing the characteristics of echo signals. Target detection based on sonar images has the characteristics of strong expression ability, high precision, high resolution, wider application, multi-beam and strong real-time performance. Although target detection technology based on sonar echo still has its importance in underwater detection, target detection technology based on sonar images has more advantages and development potential, and will be more widely used in underwater detection in the future.

目前应用于水下探测的成像声呐主要有侧扫声呐，合成孔径声呐，多波束声呐以及前视声呐等。多波束声呐成像多为三维声图，生成图像经过处理后一般为分层设色地形图，对地形的成像比较直观，一般在涉及地形探测、掩埋物探测任务中经常使用该声呐。合成孔径声呐(SAS)是一种高分辨率成像系统。其原理是换能器沿着既定方向航行时有序地将连续时列采样的回波数据进行干涉合成处理，相当于使用较小的物理孔径移动形成较大的虚拟孔径，从而提高了声波的方位向分辨率，该分辨率理论上可达物理尺寸的1/2，从而使得方位向分辨率与传播距离无关。侧扫声呐与合成孔径声呐得到的为侧视声图，而前视声呐得到的为前视声图。二者均为二维声图。二维声图由像素点组成，像素点随接收到的声波信号的强弱变化而产生灰度强弱的变化。侧扫声呐主要由发射阵元和接收阵元组成，在工作时，发射阵元发射声波脉冲程球面状向外传播，声波遇到水中目标或水底会产生散射，其中的反向散射波会被声呐设备上的接受阵元接收。根据接收到的声波强度不同，在声图上绘制出颜色不同的区域。声纳图像中存在三种类型的区域：目标高光区域、阴影区和海底混响区。一个对象的声波反射会产生一个物体高光区。由于物体背后的声学后向散射不足造成了阴影区。其余的信息包含在称为海底混响区的区域中。The imaging sonars currently used for underwater detection mainly include side scan sonar, synthetic aperture sonar, multi-beam sonar and forward-looking sonar. Multi-beam sonar imaging is mostly three-dimensional sound images. The generated images are generally layered color topographic maps after processing. The imaging of the terrain is relatively intuitive. This sonar is often used in tasks involving terrain detection and buried object detection. Synthetic aperture sonar (SAS) is a high-resolution imaging system. Its principle is that when the transducer sails in a given direction, it interferes and synthesizes the echo data sampled in a continuous time series in an orderly manner, which is equivalent to using a smaller physical aperture to move to form a larger virtual aperture, thereby improving the azimuth resolution of the sound wave. The resolution can theoretically reach 1/2 of the physical size, so that the azimuth resolution is independent of the propagation distance. Side scan sonar and synthetic aperture sonar obtain side-view sound images, while forward-looking sonar obtains forward-view sound images. Both are two-dimensional sound images. The two-dimensional sound image is composed of pixels, and the grayscale intensity of the pixels changes with the intensity of the received sound wave signal. Side scan sonar is mainly composed of transmitting array elements and receiving array elements. When working, the transmitting array element emits sound wave pulses and propagates outward in a spherical shape. The sound waves will be scattered when encountering underwater targets or the bottom of the water, and the backscattered waves will be received by the receiving array elements on the sonar device. According to the different intensities of the received sound waves, different color areas are drawn on the sound map. There are three types of areas in the sonar image: target highlight area, shadow area and seabed reverberation area. The reflection of sound waves from an object will produce an object highlight area. The shadow area is caused by insufficient acoustic backscattering behind the object. The rest of the information is contained in an area called the seabed reverberation area.

由于海水复杂多变的特性，所获得的声纳图像序列具有噪声高、对比度低、质量差的特点。目前，水下声纳目标的实现已经做了很多努力检测。它们主要分为传统方法和深度学习方法。传统的声呐图像目标检测方法通常基于图像处理技术，如阈值分割、边缘检测、形态学等，主要采用物理建模等人工方法来区分前景目标和背景像素，实现目标位置。传统的方法是复杂的，因为它们通常需要手动确定提取的特征的类型，声纳的深层特征则无法通过这些方法提取图像。Due to the complex and changeable characteristics of seawater, the obtained sonar image sequences are characterized by high noise, low contrast and poor quality. At present, a lot of efforts have been made to achieve underwater sonar target detection. They are mainly divided into traditional methods and deep learning methods. Traditional sonar image target detection methods are usually based on image processing techniques such as threshold segmentation, edge detection, morphology, etc., and mainly use artificial methods such as physical modeling to distinguish foreground targets and background pixels to achieve target location. Traditional methods are complex because they usually require manual determination of the type of features to be extracted, and deep features of sonar images cannot be extracted by these methods.

随着计算机硬件技术的发展，卷积神经网络(Convolutional Neural Networks，简称CNN)在高算力设备的支持下在图像处理领域中发挥着卓越表现，利用深度神经网络来提高各种图像分类问题的性能已经成为一种标准做法。这些尝试已经在许多应用中产生了最先进的结果，例如车辆车牌识别，医学诊断成像，唇读，水果和蔬菜识别等。自然而然，深度学习也被学者引入到声呐图像领域，并且已经成为近年来主动声呐图像分类研究的热点之一，所以针对深度学习在声呐图像目标识别的研究有着重要的现实意义。With the development of computer hardware technology, Convolutional Neural Networks (CNN) have been performing well in the field of image processing with the support of high-computing devices. It has become a standard practice to use deep neural networks to improve the performance of various image classification problems. These attempts have produced state-of-the-art results in many applications, such as vehicle license plate recognition, medical diagnostic imaging, lip reading, fruit and vegetable recognition, etc. Naturally, deep learning has also been introduced into the field of sonar images by scholars, and has become one of the hot topics in active sonar image classification research in recent years. Therefore, the research on deep learning in sonar image target recognition has important practical significance.

近年来研究者将光学领域中表现极佳的深度学习目标检测技术引入声呐图像领域，并取得了一定的成绩。文献Galusha A,Dale J,Keller J M,et al.DeepConvolutional Neural Network Target Classification for Underwater SyntheticAperture Sonar Imagery[J].Detection and Sensing of Mines,Explosive Objects,and Obscured Targets Xxiv,2019,11012.提出使用4层CNN对合成声纳孔径图像进行分类的方法，该模型的复杂度极高，这在工程中是难以接受的。文献Williams D P,DugelayS.Multi-view SAS Image Classification Using Deep Learning[J].Oceans 2016Mts/Ieee Monterey,2016.使用DCNN主动声纳首次提出图像分类问题。文献Park J,Jung DJ.Identifying Tonal Frequencies in a Lofargram with Convolutional NeuralNetworks[J].2019 19th International Conference on Control,Automation andSystems(Iccas 2019),2019:338-341.提出了一种利用CNN的Lofargram识别系统。虽然它的精确度和召回率是可以接受的，但需要时间。In recent years, researchers have introduced deep learning target detection technology, which has performed very well in the field of optics, into the field of sonar images and have achieved certain results. References Galusha A, Dale J, Keller J M, et al. Deep Convolutional Neural Network Target Classification for Underwater Synthetic Aperture Sonar Imagery [J]. Detection and Sensing of Mines, Explosive Objects, and Obscured Targets Xxiv, 2019, 11012. A method of using a 4-layer CNN to classify synthetic sonar aperture images is proposed. The complexity of this model is extremely high, which is unacceptable in engineering. References Williams D P, Dugelay S. Multi-view SAS Image Classification Using Deep Learning [J]. Oceans 2016 Mts/Ieee Monterey, 2016. The image classification problem was first proposed using DCNN active sonar. Literature Park J, Jung DJ. Identifying Tonal Frequencies in a Lofargram with Convolutional Neural Networks [J]. 2019 19th International Conference on Control, Automation and Systems (Iccas 2019), 2019: 338-341. A Lofargram recognition system using CNN was proposed. Although its precision and recall are acceptable, it takes time.

发明内容Summary of the invention

发明目的：为了克服上述现有技术中的缺陷，本发明目的在于提供一种基于深度学习的声呐图像目标检测方法与系统，有效提升声呐图像目标检测的精度。Purpose of the invention: In order to overcome the defects in the above-mentioned prior art, the purpose of the present invention is to provide a sonar image target detection method and system based on deep learning, so as to effectively improve the accuracy of sonar image target detection.

技术方案：为了达到以上发明目的，本发明采用以下技术方案：Technical solution: In order to achieve the above invention objectives, the present invention adopts the following technical solution:

一种基于深度学习的声呐图像目标检测方法，包括以下步骤：A sonar image target detection method based on deep learning includes the following steps:

S1、获取声呐图像并进行筛选和整理，进行分类后，对图像进行中值滤波降噪和数据增强，形成声呐图像数据集；S1, obtain sonar images and screen and sort them, and after classification, perform median filtering and data enhancement on the images to form a sonar image dataset;

S2、构建声呐图像目标检测模型YOLO_EC，所述YOLO_EC模型在YOLOv5网络的基础上在特征提取网络引入ELAN模块代替一部分C3模块进行特征提取、引入MP模块代替卷积层进行下采样；在特征融合网络引入C2f模块进行特征融合；S2. Construct a sonar image target detection model YOLO_EC. The YOLO_EC model introduces an ELAN module in the feature extraction network to replace a part of the C3 module for feature extraction and an MP module to replace the convolutional layer for downsampling based on the YOLOv5 network; and introduces a C2f module in the feature fusion network for feature fusion.

S3、根据建立的YOLO_EC声呐图像目标检测模型，利用光学图像数据集通过风格转换加迁移学习的方式对模型进行预训练，得到模型的基础参数集，再将声呐图像数据集输入模型进行训练后得到训练好的检测模型；S3. Based on the established YOLO_EC sonar image target detection model, the model is pre-trained using the optical image dataset through style conversion and transfer learning to obtain the basic parameter set of the model, and then the sonar image dataset is input into the model for training to obtain a trained detection model;

S4、将待检测声呐图像输入训练好的检测模型后得到声呐图像目标检测结果。S4. Input the sonar image to be detected into the trained detection model to obtain the sonar image target detection result.

进一步地，所述步骤S1中，对图像进行中值滤波降噪包括：针对图像中的像素，获取其邻域像素灰度值并按顺序排列为一组，并存储该组的中值以替换原有的像素值；Furthermore, in step S1, performing median filtering to reduce noise on the image includes: for a pixel in the image, obtaining the grayscale value of its neighboring pixels and arranging them in order into a group, and storing the median value of the group to replace the original pixel value;

在对声呐图像数据集标注为yolo格式后，对数据集进行数据增强，包括随机裁剪，旋转/仿射变换，随机旋转、缩放、平移、透视变换以及Mosaic增强中的一种或多种。After the sonar image dataset is annotated in YOLO format, data enhancement is performed on the dataset, including one or more of random cropping, rotation/affine transformation, random rotation, scaling, translation, perspective transformation, and Mosaic enhancement.

进一步地，所述ELAN模块将上层传递过来的参数分成两条分支进行处理，第一条分支是经过一个1×1的卷积层，将通道数变为原来的二分之一；第二条分支先经过一个1×1的卷积层，将通道数变为原来的二分之一，然后再经过四个3×3的卷积层做特征提取，通道数不变，并按引入跳层链接方式，将经过两个卷积层的特征提取出来；最后将四个特征图叠加到一起，通道数为原来二倍的特征提取结果，这样得到尺寸不变，通道数变为两倍的特征图。Furthermore, the ELAN module divides the parameters passed from the upper layer into two branches for processing. The first branch passes through a 1×1 convolution layer to reduce the number of channels to half of the original number. The second branch first passes through a 1×1 convolution layer to reduce the number of channels to half of the original number, and then passes through four 3×3 convolution layers for feature extraction. The number of channels remains unchanged, and the features passed through two convolution layers are extracted by introducing a skip-layer link method. Finally, the four feature maps are superimposed together, and the feature extraction result with the number of channels doubled is obtained, so that a feature map with the same size and twice the number of channels is obtained.

进一步地，所述MP模块第一条分支先经过一个最大池化层进行下采样，然后再经过一个卷积核为1×1的卷积层将通道数变为原来的一半；第二条分支先经过一个卷积核为1×1的卷积层，将通道数变为原来的一半，然后再经过一个个卷积核为3*3，步长为2的卷积层，做下采样；最后把第一个分支和第二分支的结果结合在一起，得到下采样的结果。Furthermore, the first branch of the MP module first passes through a maximum pooling layer for downsampling, and then passes through a convolution layer with a convolution kernel of 1×1 to reduce the number of channels to half of the original number; the second branch first passes through a convolution layer with a convolution kernel of 1×1 to reduce the number of channels to half of the original number, and then passes through convolution layers with a convolution kernel of 3*3 and a step size of 2 for downsampling; finally, the results of the first branch and the second branch are combined to obtain the downsampling result.

进一步地，所述C2f模块的主支线由一个k＝1,s＝1,c＝c的卷积层，分裂模块，和若干瓶颈层模块组成，k为卷积核大小，s为步幅大小，c为通道数，c＝c表示该卷积层对进入的特征图的通道数c保持不变，分裂模块通过分裂操作将一个特征图的通道数平均分为两部分即c/2，一部分直接连接到Concat部分，另一部分继续输入到瓶颈层模块，每经过一个瓶颈层模块，都会额外的将输出传到Concat部分，最后融合模块对X+1个参数进行融合，得到通道数为(X+1)*(c/2)的特征图。Furthermore, the main branch of the C2f module consists of a convolutional layer with k=1, s=1, c=c, a splitting module, and several bottleneck layer modules. k is the convolution kernel size, s is the stride size, c is the number of channels, and c=c means that the convolution layer keeps the number of channels c of the incoming feature map unchanged. The splitting module divides the number of channels of a feature map into two parts, namely c/2, through a splitting operation. One part is directly connected to the Concat part, and the other part continues to be input into the bottleneck layer module. Each time it passes through a bottleneck layer module, the output will be additionally transmitted to the Concat part. Finally, the fusion module fuses X+1 parameters to obtain a feature map with a channel number of (X+1)*(c/2).

进一步地，所述YOLO_EC模型对图像的处理过程如下：图像输入后，经过两个卷积层，一个C2f模块与一个卷积层，随后经过ELAN模块，得到第一个特征图P1，随后经过MP模块与ELAN模块得到第二个特征图P2，最后经过MP与ELAN模块后输入到SPPF模块中，得到特征图P3；特征图P3经过卷积层得到特征图P4，P4上采样后，与P2进行融合得到特征图T1，随后T1经过C2f模块与卷积层得到特征图P5，P5上采样后与P1进行融合得到特征图T2，随后T2经过C2f模块得到输出特征图D1；D1经过卷积层与P5融合得到特征图T3，T3经过C2f模块得到输出特征图D2；D2经过与卷积层后与P4进行融合，得到特征图T4，T4经过C2f模块后得到输出特征图D3；特征图D1，D2，D3经过输出推理，得到模型最终推理结果。Furthermore, the image processing process of the YOLO_EC model is as follows: after the image is input, it passes through two convolutional layers, a C2f module and a convolutional layer, and then passes through the ELAN module to obtain the first feature map P1, and then passes through the MP module and the ELAN module to obtain the second feature map P2, and finally passes through the MP and ELAN modules and inputs into the SPPF module to obtain the feature map P3; the feature map P3 passes through the convolutional layer to obtain the feature map P4, and after P4 is upsampled, it is fused with P2 to obtain the feature map T1, and then After that, T1 passes through the C2f module and the convolution layer to obtain the feature map P5. After P5 is upsampled, it is fused with P1 to obtain the feature map T2. Then T2 passes through the C2f module to obtain the output feature map D1. After D1 passes through the convolution layer and is fused with P5, it obtains the feature map T3. After T3 passes through the C2f module, it obtains the output feature map D2. After D2 passes through the convolution layer and is fused with P4, it obtains the feature map T4. After T4 passes through the C2f module, it obtains the output feature map D3. After the output inference of the feature maps D1, D2, and D3, the final inference result of the model is obtained.

进一步地，所述步骤S3中，采用自适应实例正则化AdaIN进行风格转换，表示如下：Furthermore, in step S3, adaptive instance regularization AdaIN is used to perform style conversion, which is expressed as follows:

其中是x光学图像，y是声呐图像，μ和σ分别为图片像素值的均值和方差，μ(y)指声呐图像的均值，σ(y)是指声呐图像的方差。Where x is the optical image, y is the sonar image, μ and σ are the mean and variance of the pixel values of the image, μ(y) refers to the mean of the sonar image, and σ(y) refers to the variance of the sonar image.

本发明还提供一种基于深度学习的声呐图像目标检测系统，包括：The present invention also provides a sonar image target detection system based on deep learning, comprising:

声呐图像预处理模块，获取声呐图像并进行筛选和整理，进行分类后，对图像进行中值滤波降噪和数据增强，形成声呐图像数据集；The sonar image preprocessing module obtains sonar images and screens and organizes them. After classification, the images are subjected to median filtering, noise reduction and data enhancement to form a sonar image data set.

目标检测模型构建模块，构建声呐图像目标检测模型YOLO_EC，所述YOLO_EC模型在YOLOv5网络的基础上在特征提取网络引入ELAN模块代替一部分C3模块进行特征提取、引入MP模块代替卷积层进行下采样；在特征融合网络引入C2f模块进行特征融合；The target detection model construction module constructs the sonar image target detection model YOLO_EC. The YOLO_EC model introduces the ELAN module in the feature extraction network to replace part of the C3 module for feature extraction and the MP module to replace the convolution layer for downsampling based on the YOLOv5 network; and introduces the C2f module in the feature fusion network for feature fusion;

模型训练模块，根据建立的YOLO_EC声呐图像目标检测模型，利用光学图像数据集通过风格转换加迁移学习的方式对模型进行预训练，得到模型的基础参数集，再将声呐图像数据集输入模型进行训练后得到训练好的检测模型；The model training module uses the established YOLO_EC sonar image target detection model and uses the optical image dataset to pre-train the model through style conversion and transfer learning to obtain the basic parameter set of the model. The sonar image dataset is then input into the model for training to obtain a trained detection model.

目标检测模块，将待检测声呐图像输入训练好的检测模型后得到声呐图像目标检测结果。The target detection module inputs the sonar image to be detected into the trained detection model to obtain the sonar image target detection result.

本发明还提供一种计算机设备，包括：一个或多个处理器；存储器；以及一个或多个程序，其中所述一个或多个程序被存储在所述存储器中，并且被配置为由所述一个或多个处理器执行，所述程序被处理器执行时实现如上所述的基于深度学习的声呐图像目标检测方法的步骤。The present invention also provides a computer device, comprising: one or more processors; a memory; and one or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, and when the programs are executed by the processors, the steps of the sonar image target detection method based on deep learning as described above are implemented.

本发明还提供一种计算机可读存储介质，其上存储有计算机程序，所述计算机程序被处理器执行时实现如上所述的基于深度学习的声呐图像目标检测方法的步骤。The present invention also provides a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the steps of the sonar image target detection method based on deep learning as described above are implemented.

有益效果：本发明将深度学习应用在声呐图像领域，并合理的对网络模型进行改进，通过改进特征提取网络，有效地融合不同层次的特征来优化信息流，提升了模型在处理不同尺寸和复杂度的目标时的准确性和鲁棒性，在网络中实现更高效的特征传递；通过改进特征融合网络，初期使用较低分辨率以减少计算负担，随后逐步提高分辨率来增强细节识别，从而在保持效率的同时提升了精度。并应用了风格转换的迁移学习，通过有针对性对源域数据集进行风格转换，可以有效缓解迁移学习对数据集的规模依赖。本发明有效地提升对声呐图像目标检测的精度和效率。Beneficial effects: The present invention applies deep learning to the field of sonar images and reasonably improves the network model. By improving the feature extraction network, it effectively fuses features at different levels to optimize the information flow, improves the accuracy and robustness of the model when dealing with targets of different sizes and complexities, and achieves more efficient feature transfer in the network; by improving the feature fusion network, a lower resolution is used in the early stage to reduce the computational burden, and then the resolution is gradually increased to enhance detail recognition, thereby improving accuracy while maintaining efficiency. Transfer learning of style conversion is also applied. By performing style conversion on the source domain data set in a targeted manner, the dependence of transfer learning on the scale of the data set can be effectively alleviated. The present invention effectively improves the accuracy and efficiency of sonar image target detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明的基于深度学习的声呐图像目标检测方法流程图。FIG1 is a flow chart of a sonar image target detection method based on deep learning according to the present invention.

图2为本发明的YOLO-EC模型的网络结构图；FIG2 is a network structure diagram of the YOLO-EC model of the present invention;

图3是本发明的ELAN模块网络结构图；FIG3 is a diagram of an ELAN module network structure of the present invention;

图4是本发明的MP模块网络结构图；FIG4 is a diagram of the network structure of the MP module of the present invention;

图5是本发明的C2f模块网络结构图；FIG5 is a diagram of a C2f module network structure of the present invention;

图6是本发明的迁移学习原理图；FIG6 is a schematic diagram of the transfer learning principle of the present invention;

图7是本发明的风格转换原理图；FIG7 is a schematic diagram of the style conversion principle of the present invention;

图8是本发明实施例的目标检测结果图。FIG. 8 is a diagram of target detection results according to an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了对本发明的技术方案的特征、优点有更清楚的了解，下面结合附图对具体方案的组成以及实施做出说明。In order to have a clearer understanding of the features and advantages of the technical solution of the present invention, the composition and implementation of the specific solution are explained below in conjunction with the accompanying drawings.

参照图1，本发明提出一种基于深度学习的声呐图像目标检测方法，包括以下步骤：1 , the present invention proposes a sonar image target detection method based on deep learning, comprising the following steps:

步骤S1，获取声呐图像并进行筛选和整理，获得声呐图像数据集。Step S1, obtaining sonar images and screening and sorting them to obtain a sonar image dataset.

从网络、实验、开源数据集中获取声呐图像并进行筛选和整理，包括：对收集到的图片进行裁剪、分割和分类处理，剔除过大和成像效果差的图片，最终将图片分为沉船、失事飞机、人体和未知目标四部分。之后对图像进行中值滤波处理。中值滤波是一种基于排序统计理论的能有效抑制噪声的非线性信号处理技术。其核心是将图像的像素灰度值替换为邻域像素灰度值的中位值。将邻域像素按灰度顺序排列，并存储该组的中值以替换该点原有的像素值。中值滤波的数学表达式如式(1)所示：Sonar images are obtained from the Internet, experiments, and open source data sets and screened and sorted, including: cropping, segmenting, and classifying the collected images, eliminating images that are too large and have poor imaging effects, and finally dividing the images into four parts: shipwrecks, crashed planes, human bodies, and unknown targets. The images are then processed by median filtering. Median filtering is a nonlinear signal processing technology based on sorting statistics theory that can effectively suppress noise. Its core is to replace the pixel grayscale value of the image with the median value of the grayscale value of the neighboring pixels. Arrange the neighboring pixels in grayscale order, and store the median value of the group to replace the original pixel value of the point. The mathematical expression of median filtering is shown in formula (1):

g(x,y)＝median{f(x-i,y-j)},(i,j)∈S (1)g(x,y)＝median{f(x-i,y-j)},(i,j)∈S (1)

式中f(x,y)为原图像中的像素值，g(x,y)为处理后图像中的像素值，S为窗口，i,j为窗口的长度和宽度。Where f(x,y) is the pixel value in the original image, g(x,y) is the pixel value in the processed image, S is the window, and i,j are the length and width of the window.

在对声呐图像数据集标注为yolo格式后，对数据集进行数据增强，例如随机裁剪，旋转/仿射变换，随机旋转、缩放、平移、透视变换以及采用Mosaic增强。After annotating the sonar image dataset into YOLO format, data enhancement is performed on the dataset, such as random cropping, rotation/affine transformation, random rotation, scaling, translation, perspective transformation, and Mosaic enhancement.

步骤S2，基于YOLOv5网络构建声呐图像目标检测模型YOLO_EC，YOLO_EC模型中引入ELAN模块、MP模块和C2f模块对YOLOv5网络进行改进，实现对水下目标的准确检测。Step S2, constructing a sonar image target detection model YOLO_EC based on the YOLOv5 network. The ELAN module, MP module and C2f module are introduced into the YOLO_EC model to improve the YOLOv5 network and realize accurate detection of underwater targets.

本发明中的YOLO_EC模型的网络结构改进包括特征提取网络的改进、特征融合网络的改进。并采用FReLU活函数。The network structure improvement of the YOLO_EC model in the present invention includes the improvement of the feature extraction network and the improvement of the feature fusion network, and adopts the FReLU activation function.

特征提取网络，使用ELAN模块代替一定数量的C3模块，同时采用MP模块进行下采样代替原网络中使用卷积层进行下采样。这样能够加强网络对声呐图像目标的学习能力，在多分枝计算的同时，避免使用过多的过渡层，使得整个网络的最短梯度路径快速变长，有效的改进了ResNet结构中，堆叠残差层只增加最长梯度路径，而不改变最短梯度路径的问题。ELAN模块通过有效地融合不同层次的特征来优化信息流，提升了模型在处理不同尺寸和复杂度的目标时的准确性和鲁棒性，在网络中实现更高效的特征传递。ELAN模块的结构也更加有利于在高并发计算设备(如GPU)下的推理与训练。ELAN模块的网络结构图如图3所示，ELAN模块将上层传递过来的参数分成两条分支进行处理。第一条分支是经过一个1×1的卷积层，将通道数变为原来的二分之一。第二条分支先经过一个1×1的卷积层，将通道数变为原来的二分之一，然后再经过四个3×3的卷积层做特征提取，通道数不变，并按引入跳层链接方式，将经过两个卷积层的特征提取出来。最后将四个特征图叠加到一起，通道数为原来二倍的特征提取结果。这样得到尺寸不变，通道数变为两倍的特征图。MP模块的网络结构图如图4所示，MP模块为下采样模块，第一条分支先经过一个最大池化层。最大值化的作用就是下采样，然后再经过一个卷积核为1×1的卷积层将通道数变为原来的一半。第二条分支先经过一个卷积核为1×1的卷积层，将通道数变为原来的一半，然后再经过一个个卷积核为3*3，步长为2的卷积层，做下采样。最后把第一个分支和第二分支的结果结合在一起，得到了下采样的结果。该模块将两种下采样模式进行融合，最终得到通道数不变，图片尺寸变为原来四分之一的特征图。In the feature extraction network, ELAN modules are used to replace a certain number of C3 modules, and MP modules are used for downsampling instead of the convolutional layer in the original network. This can enhance the network's ability to learn sonar image targets, avoid using too many transition layers while performing multi-branch calculations, and make the shortest gradient path of the entire network quickly become longer, effectively improving the problem in the ResNet structure that stacking residual layers only increases the longest gradient path but does not change the shortest gradient path. The ELAN module optimizes the information flow by effectively fusing features at different levels, improves the accuracy and robustness of the model when dealing with targets of different sizes and complexities, and achieves more efficient feature transfer in the network. The structure of the ELAN module is also more conducive to reasoning and training under high-concurrency computing devices (such as GPUs). The network structure diagram of the ELAN module is shown in Figure 3. The ELAN module divides the parameters passed from the upper layer into two branches for processing. The first branch passes through a 1×1 convolution layer to reduce the number of channels to half of the original. The second branch first passes through a 1×1 convolution layer to reduce the number of channels to half of the original number, and then passes through four 3×3 convolution layers for feature extraction, the number of channels remains unchanged, and the features that pass through two convolution layers are extracted by introducing skip-layer links. Finally, the four feature maps are superimposed together, and the number of channels is doubled. In this way, a feature map with the same size and twice the number of channels is obtained. The network structure diagram of the MP module is shown in Figure 4. The MP module is a downsampling module. The first branch first passes through a maximum pooling layer. The function of maximum pooling is downsampling, and then passes through a convolution layer with a convolution kernel of 1×1 to reduce the number of channels to half of the original number. The second branch first passes through a convolution layer with a convolution kernel of 1×1 to reduce the number of channels to half of the original number, and then passes through convolution layers with a convolution kernel of 3*3 and a step size of 2 to perform downsampling. Finally, the results of the first branch and the second branch are combined to obtain the downsampling result. This module fuses the two downsampling modes and finally obtains a feature map with the same number of channels and a quarter of the original image size.

Neck部分作为特征融合网络，本发明使用C2f模块进行特征融合，C2f网络结构图如图5所示，C2f模块的主支线由一个k＝1,s＝1,c＝c的卷积层，分裂模块，和若干瓶颈层模块组成。k为卷积核大小，s为步幅大小，c为通道数，c＝c表示该卷积层对进入的特征图的通道数c保持不变。分裂模块通过分裂操作将一个特征图的通道数平均分为两部分即c＝c/2，一部分直接连接到Concat部分，另一部分继续输入到瓶颈层模块。每经过一个瓶颈层模块，都会额外的将输出传到Concat模块。最后融合模块对X+1个参数进行融合，得到通道数为c＝(X+1)*(c/2)的特征图。利用C2f模块可以获得更加丰富的梯度流信息。C2f模块在使用瓶颈层的基础上，增加更多的跳层连接和额外的分裂操作，分裂操作是指将一个较大的特征图分成两个较小的特征图，以减少计算量和参数数量。C2f结构可以在初期使用较低分辨率以减少计算负担，随后逐步提高分辨率来增强细节识别，从而在保持效率的同时提升了精度。The neck part is used as a feature fusion network. The present invention uses a C2f module for feature fusion. The C2f network structure diagram is shown in Figure 5. The main branch of the C2f module consists of a convolution layer with k=1, s=1, c=c, a split module, and several bottleneck layer modules. k is the convolution kernel size, s is the stride size, c is the number of channels, and c=c means that the convolution layer keeps the number of channels c of the incoming feature map unchanged. The split module divides the number of channels of a feature map into two parts, namely c=c/2, by a split operation. One part is directly connected to the Concat part, and the other part continues to be input to the bottleneck layer module. After each bottleneck layer module, the output is additionally transmitted to the Concat module. Finally, the fusion module fuses X+1 parameters to obtain a feature map with a channel number of c=(X+1)*(c/2). Using the C2f module, more abundant gradient flow information can be obtained. On the basis of using the bottleneck layer, the C2f module adds more skip layer connections and additional split operations. The split operation refers to dividing a larger feature map into two smaller feature maps to reduce the amount of calculation and the number of parameters. The C2f structure can use a lower resolution in the early stage to reduce the computational burden, and then gradually increase the resolution to enhance detail recognition, thereby improving accuracy while maintaining efficiency.

改进后的YOLO_EC声呐图像目标检测模型整体结构如图2所示。图像输入后，经过两个Conv层，一个C2f模块与一个卷积层(Conv)，随后经过ELAN模块，得到第一个特征图P1，随后经过MP模块与ELAN模块得到第二个特征图P2，最后经过MP与ELAN模块后输入到SPPF模块中，得到特征图P3。特征图P3经过Conv层得到特征图P4，P4上采样(Upsample)后，与P2进行融合得到特征图T1，随后T1经过C2f模块与Conv层得到特征图P5，P5上采样后与P1进行融合得到特征图T2，随后T2经过C2f模块得到输出特征图D1，D1经过Conv层与P5融合得到特征图T3，T3经过C2f模块得到输出特征图D2，D2经过与Conv层后与P4进行融合，得到特征图T4，T4经过C2f模块后得到输出特征图D3。特征图D1，D2，D3经过输出推理，得到模型最终推理结果。The overall structure of the improved YOLO_EC sonar image target detection model is shown in Figure 2. After the image is input, it passes through two Conv layers, a C2f module and a convolution layer (Conv), and then passes through the ELAN module to obtain the first feature map P1, and then passes through the MP module and the ELAN module to obtain the second feature map P2. Finally, after passing through the MP and ELAN modules, it is input into the SPPF module to obtain the feature map P3. Feature map P3 passes through the Conv layer to obtain feature map P4. After P4 is upsampled (Upsample), it is fused with P2 to obtain feature map T1. Then T1 passes through the C2f module and the Conv layer to obtain feature map P5. After P5 is upsampled, it is fused with P1 to obtain feature map T2. Then T2 passes through the C2f module to obtain the output feature map D1. D1 passes through the Conv layer and P5 to obtain feature map T3. T3 passes through the C2f module to obtain the output feature map D2. D2 passes through the Conv layer and is fused with P4 to obtain feature map T4. After T4 passes through the C2f module, it obtains the output feature map D3. After output inference, the feature graphs D1, D2, and D3 are used to obtain the final inference result of the model.

步骤S3，根据建立的YOLO_EC声呐图像目标检测模型，通过风格转换加迁移学习的方式对模型进行预训练。Step S3: Based on the established YOLO_EC sonar image target detection model, the model is pre-trained by style conversion plus transfer learning.

迁移学习的原理如图6所示。先在源域即光学图像数据集中对模型进行预训练，得到的预训练模型中的预训练参数可以在声呐图像的目标检测中发挥作用。光学图像数据集采用主流的ImageNet数据集，本发明获取的数据集其中包含1000类常见光学图像目标，每类1300张图像。为了增强光学数据集和声呐图像数据集之间的联系，同时减少训练的压力，从中挑选16类别图像，其中包含船、飞机等和声呐图像中目标相似的光学图像种类，也包含沙滩、海岸等与声呐使用场景相关的图片。将16类数据集中含有官方标注的图像挑选出来组成新的训练数据集，该数据集命名为IN-16。IN-16共包含图片9728张，包含3类飞机、11类各式船只，以及沙滩和海岸2类合计16类图像，数据集中所有图像均含有标注。在迁移学习预训练之前，本发明对光学数据集进行风格转换来提升光学图像与声呐图像的相似度，即令光学图像数据集更注重图片中目标的形状特征，使得在预训练过程中，模型能更好的学习和保留提取形状特征的能力，并在参数迁移到声呐图像领域中，表现出更优秀的特征提取能力。风格转换的示意图如图7所示，风格转换网络中编码器为训练好的VGG卷积神经网络，将光学图像作为内容图像，声呐图像作为光学图像输入编码器，提取两者图像特征，随后经过AdaIN层，经过计算将声呐图像风格映射到光学图像上，最后通过解码器中的上采样层将图片尺寸还原。本发明使用AdaIN风格转换，该方法提出了自适应实例正则化AdaIN(Adaptive Instance Normalization)。AdaIN中的仿射参数不需要通过学习得到，其通过输入的风格图像实时进行计算，并适应不同的风格图像。通过AdaIN可以实现任意风格转换，具体公式如式(2)所示：The principle of transfer learning is shown in Figure 6. The model is first pre-trained in the source domain, i.e., the optical image dataset, and the pre-trained parameters in the obtained pre-trained model can play a role in target detection in sonar images. The optical image dataset adopts the mainstream ImageNet dataset. The dataset obtained by the present invention contains 1000 categories of common optical image targets, with 1300 images in each category. In order to enhance the connection between the optical dataset and the sonar image dataset and reduce the pressure of training, 16 categories of images are selected, including ships, airplanes, and other optical image types similar to targets in sonar images, as well as pictures related to sonar usage scenarios such as beaches and coasts. The images with official annotations in the 16 categories of datasets are selected to form a new training dataset, which is named IN-16. IN-16 contains a total of 9728 pictures, including 3 categories of aircraft, 11 categories of various ships, and 2 categories of beaches and coasts, totaling 16 categories of images. All images in the dataset are annotated. Before pre-training of transfer learning, the present invention performs style conversion on the optical data set to improve the similarity between the optical image and the sonar image, that is, the optical image data set pays more attention to the shape features of the target in the image, so that in the pre-training process, the model can better learn and retain the ability to extract shape features, and show better feature extraction ability when the parameters are migrated to the sonar image field. The schematic diagram of style conversion is shown in Figure 7. The encoder in the style conversion network is a trained VGG convolutional neural network. The optical image is used as the content image and the sonar image is used as the optical image input encoder to extract the image features of both. Then, after passing through the AdaIN layer, the sonar image style is mapped to the optical image through calculation, and finally the image size is restored through the upsampling layer in the decoder. The present invention uses AdaIN style conversion, and the method proposes adaptive instance regularization AdaIN (Adaptive Instance Normalization). The affine parameters in AdaIN do not need to be obtained through learning. They are calculated in real time through the input style image and adapt to different style images. Any style conversion can be achieved through AdaIN. The specific formula is shown in formula (2):

其中是x内容图像在本发明中为光学图像，y是风格图像在本发明中为声呐图像。μ和σ分别为图片像素值的均值和方差。μ(y)指声呐图像的均值，通过对声呐图像的所有像素值相加并求平均值实现。σ(y)是指声呐图像的方差，通过将声呐图像各点像素值减去均值并进行平方计算，最后求和后除以像素数量实现。通过σ(y)简单的缩放规则化后的输入，并通过μ(y)进行偏移。Where x is the content image, which is an optical image in this invention, and y is the style image, which is a sonar image in this invention. μ and σ are the mean and variance of the pixel values of the image, respectively. μ(y) refers to the mean of the sonar image, which is achieved by adding all the pixel values of the sonar image and averaging them. σ(y) refers to the variance of the sonar image, which is achieved by subtracting the mean from the pixel values of each point in the sonar image and squaring them, then summing and dividing by the number of pixels. The regularized input is simply scaled by σ(y) and offset by μ(y).

步骤S4，将声呐图像数据集输入YOLO_EC模型进行训练后得到检测模型，将待检测图像输入检测模型后得到声呐图像目标检测结果。Step S4, input the sonar image data set into the YOLO_EC model for training to obtain a detection model, and input the image to be detected into the detection model to obtain the sonar image target detection result.

采用声呐图像对光学图像进行风格转换后的数据集，迁移学习后模型的精度相比原本有一定提升。这体现出风格转换在迁移两种差距较大的数据集的特征之间发挥着一定的作用。通过有针对性对源域数据集进行风格转换，可以有效缓解迁移学习对大规模数据集的依赖，同时还可以进一步提升模型在声呐图像的检测任务中的效果。最后得到的声呐图像目标检测结果如图8所示。The dataset after using sonar images to perform style conversion on optical images shows that the accuracy of the model after transfer learning is improved to a certain extent compared to the original one. This shows that style conversion plays a certain role in migrating the features of two datasets with large differences. By performing targeted style conversion on the source domain dataset, the dependence of transfer learning on large-scale datasets can be effectively alleviated, and the effect of the model in the sonar image detection task can be further improved. The final sonar image target detection result is shown in Figure 8.

本发明实施例中，训练所使用的训练集为自建数据集，经过筛选、裁剪、数据增强以及滤波降噪进行处理。在此数据集上针对深度学习方法在声呐图像目标检测领域的应用开展研究。在模型建立方面，考虑到实际项目的应用需求，选择YOLOv5作为目标检测的参考模型，在其基础上对网络结构进行改进。为了有效地融合不同层次的特征来优化信息流，提升模型在处理不同尺寸和复杂度的目标时的准确性和鲁棒性，使用ELAN模块优化特征提取网络，为了减少计算负担，的同时增强细节识别能力，采用C2f模块对特征融合网络进行更改。经验证，改进方式可以有效提升声呐图像目标检测模型的检测性能。In an embodiment of the present invention, the training set used for training is a self-built data set, which is processed through screening, cropping, data enhancement, and filtering and noise reduction. Research is conducted on the application of deep learning methods in the field of sonar image target detection based on this data set. In terms of model establishment, considering the application needs of actual projects, YOLOv5 is selected as the reference model for target detection, and the network structure is improved on this basis. In order to effectively fuse features at different levels to optimize information flow and improve the accuracy and robustness of the model when processing targets of different sizes and complexities, the ELAN module is used to optimize the feature extraction network. In order to reduce the computational burden and enhance the detail recognition capability at the same time, the C2f module is used to modify the feature fusion network. It has been verified that the improved method can effectively improve the detection performance of the sonar image target detection model.

对于过拟合问题，通过迁移学习在数据量更大的领域预训练模型以提升特征提取能力是常用方法。鉴于光学数据集与声呐数据集差别较大，本发明通过AdaIN风格转换将光学数据集转换为声呐图像风格的数据集，使光学数据集与声呐图像之间的特征更接近。从而大幅提升了迁移参数对声呐图像目标检测的帮助。For the overfitting problem, pre-training the model in a field with a larger amount of data through transfer learning to improve feature extraction capabilities is a common method. Given the large difference between the optical dataset and the sonar dataset, the present invention converts the optical dataset into a dataset with sonar image style through AdaIN style conversion, making the features between the optical dataset and the sonar image closer. This greatly improves the help of the transfer parameters for sonar image target detection.

基于和方法同样的发明构思，本发明还提供一种基于深度学习的声呐图像目标检测系统，包括：Based on the same inventive concept as the method, the present invention also provides a sonar image target detection system based on deep learning, comprising:

应理解，本发明实施例中的基于深度学习的声呐图像目标检测系统可以实现上述方法实施例中的全部技术方案，其各个功能模块的功能可以根据上述方法实施例中的方法具体实现，其具体实现过程可参照上述实施例中的相关描述，此处不再赘述。It should be understood that the deep learning-based sonar image target detection system in the embodiment of the present invention can implement all the technical solutions in the above method embodiment, and the functions of its various functional modules can be specifically implemented according to the methods in the above method embodiments. The specific implementation process can refer to the relevant description in the above embodiment, which will not be repeated here.

本领域内的技术人员应明白，本发明的实施例可提供为方法、装置(系统)、计算机设备或计算机程序产品。因此，本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as methods, devices (systems), computer equipment or computer program products. Therefore, the present invention may take the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, the present invention may take the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

本发明是参照根据本发明实施例的方法的流程图来描述的。应理解可由计算机程序指令实现流程图中的每一流程以及流程图中的流程的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程中指定的功能的装置。The present invention is described with reference to a flow chart of a method according to an embodiment of the present invention. It should be understood that each process in the flow chart and a combination of processes in the flow chart can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing device produce a device for implementing the functions specified in one or more processes in the flow chart.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程中指定的功能。These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing device to work in a specific manner, so that the instructions stored in the computer-readable memory produce a product including an instruction device that implements the functions specified in one or more processes of the flowchart.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程中指定的功能的步骤。These computer program instructions may also be loaded onto a computer or other programmable data processing device so that a series of operational steps are executed on the computer or other programmable device to produce a computer-implemented process, whereby the instructions executed on the computer or other programmable device provide steps for implementing the functions specified in one or more processes in the flowchart.

Claims

1. A sonar image target detection method based on deep learning, characterized in that it includes the following steps:

S1, obtain sonar images and screen and sort them, and after classification, perform median filtering and data enhancement on the images to form a sonar image dataset;

S2. Construct a sonar image target detection model YOLO_EC. The YOLO_EC model introduces an ELAN module in the feature extraction network to replace a part of the C3 module for feature extraction and an MP module to replace the convolutional layer for downsampling based on the YOLOv5 network; and introduces a C2f module in the feature fusion network for feature fusion.

S3. Based on the established YOLO_EC sonar image target detection model, the model is pre-trained using the optical image dataset through style conversion and transfer learning to obtain the basic parameter set of the model, and then the sonar image dataset is input into the model for training to obtain a trained detection model;

S4. Input the sonar image to be detected into the trained detection model to obtain the sonar image target detection result.

2. The method according to claim 1, characterized in that in step S1, performing median filtering to reduce noise on the image comprises: for a pixel in the image, obtaining the grayscale value of its neighboring pixels and arranging them in order into a group, and storing the median value of the group to replace the original pixel value;

After the sonar image dataset is annotated in YOLO format, data enhancement is performed on the dataset, including one or more of random cropping, rotation/affine transformation, random rotation, scaling, translation, perspective transformation, and Mosaic enhancement.

3. The method according to claim 1 is characterized in that the ELAN module divides the parameters passed from the upper layer into two branches for processing, the first branch passes through a 1×1 convolution layer to reduce the number of channels to half of the original number; the second branch first passes through a 1×1 convolution layer to reduce the number of channels to half of the original number, and then passes through four 3×3 convolution layers for feature extraction, the number of channels remains unchanged, and by introducing a skip-layer link method, the features passed through two convolution layers are extracted; finally, the four feature maps are superimposed together, and the number of channels is twice the original number of feature extraction results, so that a feature map with the same size and twice the number of channels is obtained.

4. The method according to claim 1 is characterized in that the first branch of the MP module first passes through a maximum pooling layer for downsampling, and then passes through a convolution layer with a convolution kernel of 1×1 to reduce the number of channels to half of the original number; the second branch first passes through a convolution layer with a convolution kernel of 1×1 to reduce the number of channels to half of the original number, and then passes through convolution layers with a convolution kernel of 3*3 and a step size of 2 for downsampling; finally, the results of the first branch and the second branch are combined to obtain the downsampling result.

5. The method according to claim 1 is characterized in that the main branch of the C2f module consists of a convolutional layer with k=1, s=1, c=c, a splitting module, and several bottleneck layer modules, k is the convolution kernel size, s is the stride size, c is the number of channels, c=c means that the convolution layer keeps the number of channels c of the incoming feature map unchanged, the splitting module divides the number of channels of a feature map into two parts, namely c/2, through a splitting operation, one part is directly connected to the Concat part, and the other part continues to be input into the bottleneck layer module, and each time it passes through a bottleneck layer module, the output will be additionally transmitted to the Concat part, and finally the fusion module fuses X+1 parameters to obtain a feature map with a channel number of (X+1)*(c/2).

6. The method according to claim 5 is characterized in that the YOLO_EC model processes the image as follows: after the image is input, it passes through two convolutional layers, a C2f module and a convolutional layer, and then passes through the ELAN module to obtain the first feature map P1, and then passes through the MP module and the ELAN module to obtain the second feature map P2, and finally after passing through the MP and ELAN modules, it is input into the SPPF module to obtain the feature map P3; the feature map P3 passes through the convolutional layer to obtain the feature map P4, and after P4 is upsampled, it is fused with P2 to obtain To feature map T1, then T1 passes through the C2f module and the convolution layer to obtain feature map P5, P5 is upsampled and fused with P1 to obtain feature map T2, then T2 passes through the C2f module to obtain output feature map D1; D1 passes through the convolution layer and is fused with P5 to obtain feature map T3, T3 passes through the C2f module to obtain output feature map D2; D2 passes through the convolution layer and is fused with P4 to obtain feature map T4, T4 passes through the C2f module to obtain output feature map D3; feature maps D1, D2, and D3 are output inference to obtain the final inference result of the model.

7. The method according to claim 1, characterized in that, in step S3, adaptive instance regularization AdaIN is used to perform style conversion, which is expressed as follows:

Where x is the optical image, y is the sonar image, μ and σ are the mean and variance of the pixel values of the image, μ(y) refers to the mean of the sonar image, and σ(y) refers to the variance of the sonar image.

8. A sonar image target detection system based on deep learning, characterized by comprising:

The sonar image preprocessing module obtains sonar images and screens and organizes them. After classification, the images are subjected to median filtering, noise reduction and data enhancement to form a sonar image data set.

The target detection model construction module constructs the sonar image target detection model YOLO_EC. The YOLO_EC model introduces the ELAN module in the feature extraction network to replace part of the C3 module for feature extraction and the MP module to replace the convolution layer for downsampling based on the YOLOv5 network; and introduces the C2f module in the feature fusion network for feature fusion;

The model training module uses the established YOLO_EC sonar image target detection model and uses the optical image dataset to pre-train the model through style conversion and transfer learning to obtain the basic parameter set of the model. The sonar image dataset is then input into the model for training to obtain a trained detection model.

The target detection module inputs the sonar image to be detected into the trained detection model to obtain the sonar image target detection result.

9. A computer device, comprising:

one or more processors;

Memory; and

One or more programs, wherein the one or more programs are stored in the memory and are configured to be executed by the one or more processors, and when the programs are executed by the processors, the steps of the sonar image target detection method based on deep learning as described in any one of claims 1 to 7 are implemented.

10. A computer-readable storage medium having a computer program stored thereon, wherein when the computer program is executed by a processor, the steps of the sonar image target detection method based on deep learning as described in any one of claims 1 to 7 are implemented.