CN116912498B

CN116912498B - A 3D Medical Image Segmentation Method and System Based on Paired Attention

Info

Publication number: CN116912498B
Application number: CN202310921977.6A
Authority: CN
Inventors: 赵晶; 邹庆志; 陈玲; 张荣环; 胡玉帅
Original assignee: Qilu University of Technology
Current assignee: Qilu University of Technology
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2025-11-18
Anticipated expiration: 2043-07-25
Also published as: CN116912498A

Abstract

This invention discloses a 3D medical image segmentation method and system based on paired attention, belonging to the field of 3D medical image segmentation technology. It includes acquiring a 3D medical image to be segmented; inputting the 3D medical image to be segmented into a trained 3D medical image segmentation model for processing to obtain segmentation results; wherein, the 3D medical image segmentation model is designed based on a paired attention transformer module, which can reduce the spatial dimension and effectively learn channel and spatial information in the 3D feature map, thereby improving the model's segmentation performance while reducing the number of model parameters and accelerating model computation. This solves the problem of "poor robustness and high computational resource requirements" in existing technologies for medical image segmentation models.

Description

3D medical image segmentation method and system based on paired attention

Technical Field

The invention relates to the technical field of 3D medical image segmentation, in particular to a 3D medical image segmentation method and system based on paired attention.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

Convolutional Neural Networks (CNNs) are widely used in medical image segmentation tasks due to their powerful multi-scale representation capabilities, the ability to capture local semantic and texture information. Some U-Net variants, such as U-Net++, U-Net3+ and Residual U-Net, perform well on various datasets by processing 3D voxel data into 2D slices. 3D U-Net segmentation in sparsely labeled volumetric images by expanding the U-Net architecture using 3D operations rather than 2D convolution. Isensee et al propose a nnU-Net model based on U-Net with an automated configuration and adaptive framework that can extract features from multi-level images. In addition, there is some effort directed to learning local-global information, such as deformable convolution, depth separable convolution, and large-kernel convolution, through a pure CNNs architecture.

Version Transformer successfully address the challenges of capturing long-range dependencies in medical image segmentation. The transducer learns the correlations between all input markers using a self-attention mechanism, thereby enabling capturing long-range dependencies. While much work has been done to improve the architecture of the transducer to achieve more accurate segmentation, there are few computational complexity issues of model research on the self-attention mechanisms in the transducer. Swin-Unet is a U-shaped encoder-decoder structure consisting of Swin transducer blocks. nnFormer by Zhou et al still uses convolution layers in extracting local image details and employs a hierarchical structure to model multi-scale features.

Recently, some efforts have attempted to design hybrid architectures that combine U-Net models with transformers in order to learn global-local context information through global self-attention mechanics while extracting local features using convolution. TransUNet introduces an encoder with a hybrid CNN-transducer architecture that improves segmentation performance by introducing convolutional neural networks into the transducer structure to recover local spatial information. UNETR introduces a novel Transformer-based method for semantic segmentation of medical images, redefining the task as a one-dimensional sequence-to-sequence prediction problem. The 3D UX-NET proposes that the lightweight volume ConvNet module with large kernel depth separable convolutions adjust the characteristics of the hierarchy to achieve better volumetric segmentation.

The Convolutional Neural Network (CNNs) based approach has certain limitations in capturing geometric and structural information of medical image data because the convolutional operation is limited to learning the limited range of dependencies between pixels. The method based on the Transformer introduces a self-attention mechanism, can learn the long-range dependency relationship among pixels, and solves the problem of limited receptive field in the U-Net variant based on CNN. However, performance improvement of this approach is often accompanied by an increase in model complexity.

Recent research efforts have devised hybrid architectures that combine the U-Net model with a transducer, aimed at learning global-local context information by global self-attention mechanics while extracting local features using convolution. For example UNETR uses a transducer as the encoder and a convolution and upsampling operation as the decoder. Models such as Swin-Unet and nnFormer design a hybrid module based on a transducer and convolution and apply to both encoder and decoder. However, these studies focus mainly on how to improve the segmentation accuracy of the model, but at the cost of increasing the number of parameters and the calculation amount of the model, resulting in poor robustness of the model and a large demand for calculation resources.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides a 3D medical image segmentation method, a system, electronic equipment and a computer readable storage medium based on paired attention, wherein a 3D medical image segmentation model is designed based on a paired attention transducer (Paired Attention Transformer, PAT) module, the dimension of a space dimension is reduced, and channels and space information are effectively learned in a 3D feature map, so that model segmentation performance is improved while model parameter quantity is reduced and model calculation speed is increased.

In a first aspect, the present invention provides a paired attention based 3D medical image segmentation method;

A paired attention based 3D medical image segmentation method comprising:

Acquiring a 3D medical image to be segmented;

inputting the 3D medical image to be segmented into a trained 3D medical image segmentation model for processing so as to obtain a segmentation result;

the 3D medical image segmentation model comprises an encoder and a decoder, wherein the encoder is connected with the decoder, the encoder comprises a first encoding module and a plurality of second encoding modules which are sequentially connected, the first encoding module comprises a patch embedding layer and a paired attention transformer module, any second encoding module comprises a paired attention transformer module and a downsampling layer, the decoder comprises a plurality of decoding modules which are sequentially connected, and any decoding module comprises a jump connection module, a paired attention transformer module and an upsampling module.

Further, the paired attention transformer module is composed of a normalization layer, a multi-layer perceptron and a multi-head paired attention module, wherein data input into the paired attention transformer module sequentially passes through the normalization layer, the multi-head paired attention module and the multi-layer perceptron.

Preferably, the multi-head pairing attention module is used for capturing the channel dependency relationship of the input data through the channel attention, obtaining the channel attention output characteristic diagram, obtaining the space attention output characteristic diagram through the space dependency relationship of the space attention capturing input data, merging the channel attention output characteristic diagram and the space attention output characteristic diagram with the original 3D voxel characteristic of the input data, and carrying out 3D convolution to obtain the deep characteristic representation of the input data.

Preferably, the channel attention formula in the multi-head pairing attention module is expressed as:

Wherein X _C represents the output obtained through channel attention, Q _channel is a channel query vector, K _channel is a channel key vector, V _channel is a channel value, and d is the size of each vector;

The spatial attention formula in the multi-head paired attention module is expressed as:

Wherein X _s is the output obtained by spatial attention, Q _spatial is the projection of the spatial query vector, K _{spatial_proj} is the projection of the spatial key vector, V _{spatial_proj} is the projection of the spatial value, and d is the size of each vector.

Further, the first encoding module is used for carrying out embedding processing and segmentation on the 3D medical image to be segmented, obtaining a 3D voxel feature image and adding position codes, the second encoding modules are used for carrying out pairing attention transformation and downsampling operation on the 3D voxel feature image so as to realize sequential dimension reduction of the 3D voxel feature image, the decoding modules are used for carrying out upsampling processing on the dimension-reduced 3D voxel feature image and splicing the dimension-reduced 3D voxel feature image with 3D voxel feature images of different dimensions, then carrying out pairing attention transformation processing so as to realize sequential dimension increase of the spliced 3D voxel feature image, and outputting a predicted final segmentation result through convolution operation.

Further, the inputting the 3D medical image to be segmented into the trained 3D medical image segmentation model for processing includes:

Carrying out embedding processing and segmentation on the 3D medical image to be segmented, obtaining a 3D voxel characteristic map and adding a position code;

performing pairing attention transformation and downsampling operation on the 3D voxel feature map so as to realize sequential dimension reduction of the 3D voxel feature map;

and carrying out up-sampling processing on the 3D voxel feature map after the dimension reduction, splicing the 3D voxel feature map with different dimensions, and carrying out pairing attention transformation processing to realize sequential dimension increase of the spliced 3D voxel feature map, and outputting a predicted final segmentation result through convolution operation.

Further, the training mode for the 3D medical image segmentation model includes:

acquiring training data;

setting AdamW an optimizer, and adaptively adjusting the learning rate;

And training the 3D medical image segmentation model according to training data, the learning rate and a preset loss function.

In a second aspect, the present invention provides a paired attention based 3D medical image segmentation system;

A paired attention based 3D medical image segmentation system comprising:

the acquisition module is used for acquiring the 3D medical image to be segmented;

The 3D medical image segmentation module is used for inputting the 3D medical image to be segmented into a trained 3D medical image segmentation model for processing so as to obtain a segmentation result;

In a third aspect, the present invention provides an electronic device;

an electronic device comprising a memory and a processor and computer instructions stored on the memory and running on the processor, which when executed by the processor, perform the steps of the paired attention-based 3D medical image segmentation method described above.

In a fourth aspect, the present invention provides a computer-readable storage medium;

A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the paired attention based 3D medical image segmentation method described above.

Compared with the prior art, the invention has the beneficial effects that:

according to the technical scheme provided by the invention, the 3D medical image segmentation model PAT-Unet is designed based on Paired Attention Transformer modules, the dependency information among channels and the rich information on the space dimension are effectively combined by Paired Attention Transformer modules, the segmentation effect is improved, the parameter quantity of the model is reduced, and the model reasoning speed is accelerated.

Compared with the existing method, the technical scheme provided by the invention can capture the detail texture information in the image, and the model is reduced by more than 67% in terms of the parameter number and the operation times while obtaining a more accurate segmentation map.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

FIG. 1 is a schematic flow chart provided in an embodiment of the present invention;

FIG. 2 is a schematic diagram of a 3D medical image segmentation model according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a Paired Attention Transformer module according to an example embodiment of the present invention;

FIG. 4 is a schematic diagram of a multi-head paired attention module according to an embodiment of the present invention;

Fig. 5 is a visual comparison diagram of segmentation results of the method of the present invention with other models on an ACDC dataset according to an embodiment of the present invention.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the present invention. As used herein, unless the context clearly indicates otherwise, the singular forms also are intended to include the plural forms, and furthermore, it is to be understood that the terms "comprises" and "comprising" and any variations thereof are intended to cover non-exclusive inclusions, such as, for example, processes, methods, systems, products or devices that comprise a series of steps or units, are not necessarily limited to those steps or units that are expressly listed, but may include other steps or units that are not expressly listed or inherent to such processes, methods, products or devices.

Embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

In the 3D medical image segmentation method in the prior art, the segmentation precision of the model is improved, but the robustness of the model is poor and the requirement on computing resources is high at the cost of increasing the parameter quantity and the computing quantity of the model, so that the segmentation efficiency and the segmentation precision of the 3D medical image are affected. Accordingly, the present invention provides a 3D medical image segmentation method based on paired attention.

Next, a detailed description will be given of the paired attention-based 3D medical image segmentation method disclosed in this embodiment with reference to fig. 1 to 5. The 3D medical image segmentation method based on the paired attention comprises the following steps of:

S1, acquiring a 3D medical image to be segmented.

S2, inputting the 3D medical image to be segmented into a trained 3D medical image segmentation model for processing so as to obtain a segmentation result. The 3D medical image segmentation model comprises an encoder and a decoder, the encoder is connected with the decoder, the encoder comprises a first encoding module and 3 second encoding modules which are sequentially connected, the first encoding module comprises a patch embedding (Patch Embedding) layer and a pairing attention transformer (Paired Attention Transformer, PAT) module, any second encoding module comprises a pairing attention transformer (Paired Attention Transformer, PAT) module and a downsampling layer, the decoder comprises 4 decoding modules which are sequentially connected, and any decoding module comprises a jump connection module, a pairing attention transformer (Paired Attention Transformer, PAT) module and an upsampling module.

The specific flow of inputting the 3D medical image to be segmented into the trained 3D medical image segmentation model for processing is as follows:

inputting the 3D medical image to be segmented into a decoder, and processing the 3D medical image to be segmented by the decoder, wherein the method comprises the following steps of:

In this embodiment, the 3D medical image to be segmented is a first 3D voxel feature map with a size of 128×128×64×1, i.e. height×width×depth×channel format, where 64 is the Depth of the data volume, 128×128 represents the Height and Width of the data volume, and 1 is the Channel number of the feature volume image.

In the first stage of the encoder, i.e. the first encoding module, first a first 3D voxel feature map is subjected to Patch Embedding (patch embedding) processing by a patch embedding layer, a three-dimensional data volume is split into a number of small data blocks of a low-dimensional representation, and position encoding is added to these data blocks. And secondly, inputting the encoded first 3D voxel characteristic map into a Paired Attention Transformer module for processing, and carrying out focusing segmentation on the medical image characteristic region to obtain a second 3D voxel characteristic map with the size of 32 multiplied by 16 multiplied by 32.

In the second stage of the encoder, i.e. the first second encoder module, the second 3D voxel feature map is first subjected to a step size of 2, and a downsampling layer formed by 3D convolution with the convolution kernel size of 3 multiplied by 3 and normalization operation is processed by Paired Attention Transformer modules to obtain a third 3D voxel characteristic diagram with the size of 16 multiplied by 8 multiplied by 64.

In a third phase of the encoder, i.e. the second encoder module, the third 3D voxel feature map is first subjected to a step size of 2, and a downsampling layer formed by 3D convolution with the convolution kernel size of 3 multiplied by 3 and normalization operation is processed by Paired Attention Transformer modules to obtain a fourth 3D voxel characteristic diagram with the size of 8 multiplied by 4 multiplied by 128.

In a fourth phase of the encoder, i.e. the third second encoder module, the fourth 3D voxel feature map is first subjected to a step size of 2, and a downsampling layer formed by 3D convolution with the convolution kernel size of 3 multiplied by 3 and normalization operation is processed by Paired Attention Transformer modules to obtain a fifth 3D voxel characteristic diagram with the size of 4 multiplied by 2 multiplied by 256.

The processing of the 3D voxel feature map by the Paired Attention Transformer module is the same as the processing of the Paired Attention Transformer module in the decoder section described below, and is described in detail below, and is not repeated here.

The first 3D voxel feature map, the second 3D voxel feature map, the third 3D voxel feature map, the fourth 3D voxel feature map and the fifth 3D voxel feature map which are obtained by processing of the decoder are input into the decoder for processing, and the processing steps are that in the first stage (a first decoding module) of the decoder, the fifth 3D voxel feature map is up-sampled to the 3D voxel feature map with the size of 8 multiplied by 4 multiplied by 128 through an up-sampling layer, then the 3D voxel feature map is spliced with the fourth 3D voxel feature map through a jump connection module, and then the sixth 3D voxel feature map with the size of 8 multiplied by 4 multiplied by 128 is obtained by processing through a Paired Attention Transformer module.

In the second stage (second decoding module) of the decoder, the sixth 3D voxel feature map is up-sampled to a 3D voxel feature map with the size of 16×16×8×64 through an up-sampling layer, then the 3D voxel feature map up-sampled to the size of 16×16×8×64 is spliced with the third 3D voxel feature map through a jump connection module, and then a seventh 3D voxel feature map with the size of 16×16×8×64 is obtained through processing of Paired Attention Transformer modules.

In a third stage of the decoder, the third decoding module, the seventh 3D voxel feature map is first upsampled via an upsampling layer to a 3D voxel feature map of size 32 x 16 x 32, and splicing the 3D voxel characteristic map up-sampled to the size of 32 multiplied by 16 multiplied by 32 with the second 3D voxel characteristic map through a jump connection module, and processing the spliced 3D voxel characteristic map through a Paired Attention Transformer module to obtain an eighth 3D voxel characteristic map with the size of 32 multiplied by 16 multiplied by 32.

In a fourth stage of the decoder, the fourth decoding module, the eighth 3D voxel feature map is first upsampled via an upsampling layer to a 3D voxel feature map of size 128 x 64 x 1, and splicing the 3D voxel feature map up-sampled to 128 multiplied by 64 multiplied by 1 with the result of the 3D convolution processing of the first 3D voxel feature map with the convolution kernel size of 3 multiplied by 3 by 1 through a jump connection module, and obtaining the final prediction output of the model by the 3D convolution processing of the spliced result with the convolution kernel size of 3 multiplied by 3, namely the final segmentation result of the medical feature region: and a ninth 3D voxel feature map.

The Paired Attention Transformer module in the above operation is shown in fig. 3, and mainly consists of a normalization Layer (Layer Norm), a multi-Layer perceptron (MLP), and a multi-head pairing attention (MPA) module. The multi-head pairing attention module is shown in fig. 4, and consists of two parts, namely channel attention and space attention, which capture channel dependency and space dependency respectively.

The channel attention operation in the multi-head paired attention first transposes the vector Q _channel and then performs scaling dot product operation with the vector K _channel, and uses Softmax to measure the similarity between each feature and the rest of the channel features, so as to obtain the channel attention map. And performing dot product operation on the channel attention map and the vector V _channel to capture the dependency relationship among the channels in the feature map so as to obtain the output of the channel attention. The channel attention formula in the multi-head paired attention is as (1):

(1) Where X _C represents the output obtained through channel attention. Q _channel,K_channel and V _channel represent a channel query vector, a channel key vector, and a channel value, respectively, and d is the size of each vector.

The spatial attention operation in the multi-head paired attention projects V _spatial and K _spatial with dimensions hwd×c onto spatial dimensions p×c, respectively, to obtain V _{spatial_proj} and K _{spatial_pro}, respectively. And performing scaling dot product operation on the transposed K _{spatial_pro} with the dimension of PxC and Q _spatial, and processing by using Softmax to obtain a space attention feature map with the dimension of HWDxP. And finally, performing dot product operation on the spatial attention characteristic map and the projected V _{spatial_proj} to generate a spatial attention characteristic map with the dimension HWD multiplied by C. The spatial attention formula in the multi-head paired attention is expressed as (2):

in equation (2), Q _spatial、K_{spatial_proj}、V_{spatial_proj} represents a spatial query vector, a projection of a spatial key vector, and a projection of a spatial value, respectively, and d is a size of each vector.

And (3) integrating the output after the channel and the spatial attention with the original 3D voxel characteristics, and performing 3D convolution operation on the integrated result to further extract deeper characteristic representation. The final output of the multi-head paired attention module is shown in equation (3):

X=Conv₁( Conv₃( ( X_s+X_c ) ) ) (3)

Wherein X _C and X _S represent output feature maps of channel and spatial attention, respectively, conv1 represents a 3D convolution with a convolution kernel size of 1X 1, conv3 represents a 3D convolution with a convolution kernel size of 3X 3.

Further, training the 3D medical image segmentation model includes:

step 1, training data are acquired.

Two disclosed 3D medical image segmentation datasets, synapse and ACDC, were selected as training data.

Wherein the Synapse data set consists of CT scans of 30 patient abdominal organs. Referring to the partitioning of the dataset by the TransUnet model, the 18 sets of data are partitioned into training sets and the remaining 12 sets of data are partitioned into test sets. The model is given in the experimental results section with a Dice similarity score (DSC) and 95% Hausdorff distance (HD 95) data for 8 abdominal organs, spleen, left kidney, pancreas, stomach, aorta, liver, gall bladder and right kidney. An automated cardiac diagnostic challenge data set (ACDC) was split into 70 training samples, 10 validation samples, and 20 test samples.

And 2, preprocessing and enhancing data.

First, two data sets of Synapse and ACDC are acquired and the input three-dimensional data volume is cut into 128×128×64 sizes.

And secondly, carrying out random rotation and random overturning operation with 50% probability on the cut training image and the corresponding real segmentation image. The data preprocessing and enhancing operation can effectively make up the defect of small number of training images in the original data set, so that the capability of the model for resisting over fitting is enhanced, and the robustness of the model is improved.

And step 3, inputting the data after data enhancement into a 3D medical image segmentation model for processing to obtain a final model prediction graph.

Wherein the loss is calculated using a combination of a plurality of loss functions, the loss functions being used to calculate the error between the predicted value of the model and the true segmented image.

The sum of the cross entropy loss and the soft position loss is used in this embodiment to calculate the loss between the model predicted 3D voxel result and the true value, so that the advantages of both loss functions can be integrated. The loss function is shown in equation (4):

(4) Where V is the total number of 3D voxel feature maps, N is the number of predicted classes, Z _v,j is the true value of the jth class at voxel V, and P _v,j is the predicted probability of the model output of the jth class at voxel V.

Setting AdamW optimizers, setting initial learning rate to 0.001, setting weight attenuation to 3e-5, setting weight attenuation coefficient to prevent model from over fitting, and self-adapting learning rate adjustment to speed up model convergence.

And calculating the loss between the model segmentation prediction graph and the real result according to the combined loss function, and performing gradient updating and learning rate self-adaptive adjustment by using AdamW optimizers, wherein 8 samples are trained at a time, and the total training is 1000 rounds. The average Dice score and the evaluation index result of 95% Hausdorff distance are given for the Synase data set, and the ACDC data set only uses the average Dice score as the evaluation index.

The comparative models used for the experiments were the current up-to-date medical image segmentation models UNETR and nnFormer, with experimental comparative data for these models as shown in table 1, and visual comparisons with other models on ACDC datasets as shown in fig. 5.

Table 1 experimental comparison of the method with other models

Example two

The embodiment discloses a 3D medical image segmentation system based on paired attention, comprising:

It should be noted that, the acquiring module and the 3D medical image segmentation module correspond to the steps in the first embodiment, and the modules are the same as the examples and application scenarios implemented by the corresponding steps, but are not limited to the disclosure in the first embodiment. It should be noted that the modules described above may be implemented as part of a system in a computer system, such as a set of computer-executable instructions.

Example III

An electronic device according to a third embodiment of the present invention includes a memory, a processor, and computer instructions stored in the memory and running on the processor, where the computer instructions, when executed by the processor, complete the steps of the 3D medical image segmentation method based on paired attention.

Example IV

A fourth embodiment of the present invention provides a computer readable storage medium storing computer instructions that, when executed by a processor, perform the steps of the paired attention-based 3D medical image segmentation method described above.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The foregoing embodiments are directed to various embodiments, and details of one embodiment may be found in the related description of another embodiment.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A 3D medical image segmentation method based on paired attention, characterized in that it includes:

Acquire 3D medical images to be segmented;

The 3D medical image to be segmented is input into the trained 3D medical image segmentation model for processing to obtain the segmentation result;

The 3D medical image segmentation model includes an encoder and a decoder, which are connected. The encoder includes a first encoding module and multiple second encoding modules connected in sequence. The first encoding module consists of a patch embedding layer and a paired attention transformer module. Each second encoding module consists of a paired attention transformer module and a downsampling layer. The decoder includes multiple decoding modules connected in sequence. Each decoding module consists of a skip connection module, a paired attention transformer module, and an upsampling module.

The paired attention transformer module consists of a normalization layer, a multilayer perceptron, and a multi-head paired attention module. The data input to the paired attention transformer module passes through the normalization layer, the multi-head paired attention module, and the multilayer perceptron in sequence.

The multi-head paired attention module is used to capture the channel dependencies of the input data through channel attention and obtain the channel attention output feature map, and to capture the spatial dependencies of the input data through spatial attention and obtain the spatial attention output feature map. The channel attention output feature map and the spatial attention output feature map are fused with the original 3D voxel features of the input data and 3D convolution is performed to obtain a deep feature representation of the input data.

2. The 3D medical image segmentation method based on paired attention as described in claim 1, characterized in that the channel attention formula in the multi-head paired attention module is expressed as:

Where _XC represents the output obtained after channel attention, Q _channel is the channel query vector, K _channel is the channel keys vector, V _channel is the channel value, and d is the size of each vector;

The spatial attention formula in the multi-head pairing attention module is expressed as follows:

in, The output obtained after spatial attention is Q _spatial , which is the projection of the spatial query vector, K _{spatial_proj} is the projection of the spatial keys vector, V _{spatial_proj} is the projection of the spatial values, and d is the size of each vector.

3. The 3D medical image segmentation method based on paired attention as described in claim 1, characterized in that: the first encoding module is used to embed and segment the 3D medical image to be segmented, obtain a 3D voxel feature map and add position encoding; multiple second encoding modules are used to perform paired attention transformation and downsampling operations on the 3D voxel feature map to achieve sequential dimensionality reduction of the 3D voxel feature map; multiple decoding modules are used to upsampling the dimensionality-reduced 3D voxel feature map and concatenate it with 3D voxel feature maps of different dimensions, and then perform paired attention transformation processing to achieve sequential dimensionality increase of the concatenated 3D voxel feature map; and output the predicted final segmentation result after convolution operation.

4. The 3D medical image segmentation method based on paired attention as described in claim 1, characterized in that the step of inputting the 3D medical image to be segmented into the trained 3D medical image segmentation model for processing includes:

The 3D medical image to be segmented is embedded and segmented to obtain a 3D voxel feature map and add position encoding.

Paired attention transformation and downsampling operations are performed on the 3D voxel feature map to achieve sequential dimensionality reduction of the 3D voxel feature map;

The dimensionality-reduced 3D voxel feature map is upsampled and then concatenated with 3D voxel feature maps of different dimensions. Paired attention transformation is then performed to achieve sequential dimensionality increase of the concatenated 3D voxel feature map. Finally, the predicted segmentation result is output after convolution operation.

5. The 3D medical image segmentation method based on paired attention as described in claim 1, characterized in that the training method for the 3D medical image segmentation model includes:

Obtain training data;

Configure the AdamW optimizer to adaptively adjust the learning rate;

The 3D medical image segmentation model is trained based on training data, learning rate, and a preset loss function.

6. A 3D medical image segmentation system based on paired attention, characterized in that it includes:

The acquisition module is used to acquire the 3D medical image to be segmented;

The 3D medical image segmentation module is used to input the 3D medical image to be segmented into the trained 3D medical image segmentation model for processing to obtain the segmentation result;

7. An electronic device, characterized in that it comprises a memory and a processor, and computer instructions stored in the memory and running on the processor, wherein the computer instructions, when executed by the processor, perform the steps of any one of claims 1-5.

8. A computer-readable storage medium, characterized in that it is used to store computer instructions, which, when executed by a processor, perform the steps described in any one of claims 1-5.