CN115393670B

CN115393670B - Method for training lung endoscope image recognition model and recognition method

Info

Publication number: CN115393670B
Application number: CN202211003172.5A
Authority: CN
Inventors: 方传煜; 严建祺; 刘淳奇
Original assignee: Quanbao Network Technology Co ltd
Current assignee: Quanbao Network Technology Co ltd
Priority date: 2022-08-19
Filing date: 2022-08-19
Publication date: 2024-07-19
Anticipated expiration: 2042-08-19
Also published as: CN115393670A

Abstract

The invention discloses a method for training a lung endoscope image recognition model and a recognition method, wherein the method comprises the steps of constructing a dataset and constructing a neural network model, and carrying out lung endoscope image recognition on the constructed recognition model, wherein the neural network model is SeqYOLO constructed by YOLOv models combined with LSTM (least squares), so as to recognize an endoscope video, namely, a model special for recognizing the lung endoscope is trained through an artificial intelligence deep learning algorithm, and a doctor can intuitively know the position information of the endoscope through a terminal screen when using the endoscope with the technology, so that the time of artificial judgment is reduced, the operation efficiency is improved, the operation duration is shortened, and the pain of a patient is reduced.

Description

Method for training lung endoscope image recognition model and recognition method

Technical Field

The invention relates to the technical field of neural networks, in particular to a method for training a lung endoscope image recognition model and a recognition method.

Background

In the pulmonary endoscopy, an elongated endoscope is placed into the lower respiratory tract of a patient through the mouth or nose, namely, enters the trachea and the bronchi and further ends through the glottis, directly observes lesions of the trachea and the bronchi, and performs corresponding examination and treatment according to the lesions.

The pulmonary trachea and bronchus structure is similar to an inverted big tree, and the pulmonary trachea and bronchus structure has a plurality of branches, even if a specially trained surgeon is easy to get lost in the examination process, the operation time is prolonged, the discomfort of the patient in the operation process is prolonged, and only the doctor with abundant experience can be in charge of the operation.

At the present stage, the current position of the endoscope can not be accurately judged due to insufficient experience because the endoscope image is observed manually to judge the position and find the advancing pipeline.

Disclosure of Invention

The invention aims to provide a method for training a lung endoscope image recognition model and a recognition method for automatically recognizing and marking the current lung trachea position of an endoscope probe through an endoscope image.

In order to achieve the above purpose, the technical scheme of the invention is as follows:

The method comprises the steps of constructing a data set and constructing a neural network model, training the neural network model through 80% of sample data of the constructed data set, deploying after the training is completed through the rest 20% of sample data of the data set after the training is verified, and using mAP0.5 as an evaluation index for measuring accuracy on an endoscopic image target detection task.

The data set is constructed by: intercepting a plurality of clear bronchoscope images from a plurality of bronchoscope videos, dividing the bronchoscope images into a plurality of categories, marking the bronchoscope images by drawing one or a plurality of bounding boxes, and constructing the bronchoscope images by a data enhancement mode;

The data enhancement of the data set comprises three different levels of high, medium and low enhancement intensity;

the neural network model is SeqYOLO constructed by YOLOv model combined with LSTM;

SeqYOLO establishes a reference with YOLOv, the YOLOv part provides target detection capability, and an LSTM module is introduced to learn information with time sequence relation in the video;

SeqYOLO obtaining image characteristic values from each frame in a video sequence through a YOLO back respectively, calculating the characteristic values by using an LSTM model to learn information with a time sequence relation in the video, outputting an image characteristic value corresponding to an image to be inferred after the image characteristic values of the whole sequence are read by the LSTM, and using the image characteristic values to infer and calculate specific image detection frames and object types in the frames by a head component in a YOLO series algorithm.

By means of the above, through a plurality of clear bronchoscopic images intercepted in a plurality of bronchoscopic videos, each image corresponds to one frame in the bronchoscopic videos, the frame comprises other frames in front of and behind a video time axis, the other frames in front of and behind each frame comprise information useful for prediction of a current frame, so that one video segment is used for training of a model, the most middle frame of the video segment is provided with marking information, the other frames in the video are context information for assisting the frame to predict, and common data enhancement comprises image transformation operations such as left-right overturning, up-down overturning, zooming, translation, rotation and the like to expand a data set, so that the problem that an endoscope is not always positioned in the middle of an air tube, the angle orientation of a shot image is different is solved, meanwhile, the intensity of data enhancement at the highest level is divided into three different levels, the highest level data enhancement has more enhancement items, the change degree made for the original image is heavier, the data enhancement at the three different levels is subjected to targeted optimization on the data set, and the best detection effect of P0.5 can be achieved for the data enhancement at the lowest level; therefore, the completeness of the data set is ensured, the neural network model for target detection is adapted to different human body environments, mAP0.5 is used as an evaluation index for measuring accuracy on an endoscopic image target detection task, and the performance of the model on the task is objectively reflected.

For the detection target data is video data, an algorithm SeqYOLO capable of video target detection is constructed, continuous video frames can be input, context information in the video is learned to improve the detection capability of the specified frames, a target detection model used in the system is constructed by a lightweight YOLOv n model, and the value of mAP0.5 can be higher than that of other high-delay large models, namely, the detection accuracy is improved while the reasoning time is shortened.

Preferably, the data set classification comprises 18 categories of: the epiglottis, vocal cords, main trachea, left main bronchi, carina, right main bronchi, left upper lobe bronchi, left lower lobe bronchi, right upper lobe bronchi, right middle bronchi, right lower lobe bronchi, left intrinsic upper lobe bronchi, left lingual bronchi, left lower lobe dorsal segment bronchi, left lower lobe base Duan Zhi trachea, right middle lobe bronchi, right lower lobe base Duan Zhi trachea, and right lower lobe dorsal segment bronchi.

Therefore, the data set is classified, and the number of samples of each class is conveniently kept in a balanced state through data enhancement and frame interception of an input video, so that the problem of unbalance of the data samples is avoided, and a trained model is more accurately identified.

Preferably, the YOLOv model structure is mainly divided into two parts, namely YOLOv5 back bone for extracting information and a separate YOLOv module for regressing the detection frame, which is called YOLOv5 Head.

It follows that the function of the YOLOv backup and YOLOv modules is well-defined, and by making corresponding adjustments as independent components of the SeqYOLO model, since these modules retain the functionality in YOLOv5, model weights pre-trained in YOLOv5 can be used as initial weights for these modules to expedite later training of video data.

Preferably, seqYOLO gives a prediction result of the last frame image after reading in continuous video frame information, namely firstly, using YOLO back to perform feature extraction on the image, inputting the obtained feature value into an LSTM model to perform time sequence learning, learning the feature information of each frame in the LSTM, retaining information useful for prediction of the next frame, and after reading in the last frame, outputting an integral video fragment feature value of the last integrated all frame information and the last frame information by the LSTM, wherein the feature value uses a YOLO head structure to identify a detection frame and an object type.

Preferably, training of SeqYOLO models is split into two phases: training the pictures of Shan Zhangyou labels by using a YOLOv model to enable mAP0.5 to reach a relatively accurate value, and obtaining weights of two components, namely a backup and a head of the model in the first stage for initializing the weights of the model in the next stage; in the second stage, the built SeqYOLO model transfers the two parts of weights of the YOLO back and the YOLO head obtained in the previous stage to the corresponding part in SeqYOLO, the training in the stage uses video data to train, namely a continuous video frame input model, 20 frames of images are obtained by adopting a method of taking frames at intervals and input into SeqYOLO for learning, the last frame of the 20 frames of images is a target image to be predicted, and therefore, only the last frame can carry marking information.

Therefore, the model is constructed SeqYOLO in the second stage, and the two weights of the YOLO back and the YOLO head obtained in the previous stage are migrated to the corresponding part in SeqYOLO, so that the training efficiency of the model can be remarkably improved.

Preferably, when the SeqYOLO model continues training, the last frame with the label in the video is learned by using the YOLOv model in the first stage, and random data enhancement is performed when data is loaded;

in forward propagation of the model, the image after the data enhancement is firstly subjected to feature extraction of YOLOv backbones, and then enters YOLOv Head to predict a specific detection frame position and object type;

In the back propagation of the model, the used PyTorch deep learning framework can perform automatic derivative calculation to update the model weight through a random batch gradient descent method, an SGD optimizer is adopted during training, the learning rate is set to be 0.01, the momentum is set to be 0.937, a OneCycleLR learning rate planning strategy is used during training, the learning rate is gradually increased from zero to the set learning rate of 0.01 at the beginning, and then slowly decreased in the subsequent convergence process.

Therefore, the random data enhancement is carried out in the data enhancement mode, compared with the traditional data enhancement mode, the data enhancement mode has the advantages of saving storage space, generating different samples in a large quantity and not repeating, a OneCycleLR learning rate planning strategy is used in the training process, the learning rate can be gradually increased from zero to the set learning rate of 0.01 at the beginning, and then slowly decreased in the subsequent convergence process, and the problem that the model cannot be converged due to gradient disappearance of the model caused by the excessively high learning rate in the initial training process can be avoided by the learning rate planning strategy.

Preferably, in the SeqYOLO training stage, firstly, the model weight trained in the previous stage is migrated to YOLOv back box and YOLOv Head components in SeqYOLO, after the training in the previous stage, the two components have the capability of extracting the image characteristics of an endoscope and carrying out target prediction and target classification according to the extracted image characteristics, each time of input data of SeqYOLO is a video sequence with the length of 20 frames, wherein the last video frame is an endoscope image needing to be predicted, the output of SeqYOLO model is a target prediction result of the last frame in the target frame, namely, 20 frames of video input sequence, after the model obtains an image sequence with the length of 20 frames, all images are simultaneously subjected to characteristic extraction through the same YOLOv back box, finally 20 sets of different characteristic values are obtained, the 20 sets of different characteristic values are sequentially corresponding to the original 20 frames of images, then enter an OLTM module to carry out sequence information learning, in this stage, before the endoscope moves to the current position, when the LSTM receives the 20 sets of different characteristic values, the output value is the same as the LSOL TM, the image information is subjected to the output of the same value, the image information is subjected to the calculation of the target prediction result, the image information is obtained by calculating the position of the same type as the target frame, and the image is subjected to the calculation of the target prediction result, and the image information is subjected to the calculation of the same value and the expected value, and the target class information is calculated by the expected value;

The model SeqYOLO is optimized by back propagation after forward propagation to update model weights, the optimizer adopts Adam method and sets learning rate to 0.001, and the learning rate planning strategy uses OneCycleLR.

Preferably, seqYOLO is added with an attention module and includes three attention modules, namely an attention module based on CBAM, a visual transducer model ViT, and a SWIN transducer.

Therefore, the model structure is improved, the attention module is added to enable the model to have stronger learning ability, the attention module can enable the model to actively pay attention to important context information in the picture, the context information plays an important role in accuracy of the target detection model, and meanwhile, an image attention thermodynamic diagram can be generated by means of an attention mechanism to prompt a doctor to pay attention to details possibly needing attention in the image.

A method of identifying a lung endoscopic image, the method comprising: acquiring a lung endoscope image to be identified; the method comprises the steps of identifying a lung endoscope image to be identified by a lung endoscope image identification model, wherein the lung endoscope image identification model is trained based on the lung endoscope image identification model.

Preferably, the method is also provided with reasoning calculation, and the reasoning calculation writes rules according to the human lung structure diagram.

The invention is used for assisting a doctor in carrying out endoscope operation, thus having higher requirement on real-time calculation, and a lightweight model is used for achieving millisecond-level reasoning delay, a target detection model used in the system is constructed based on a lightweight YOLOv n model through adapting and reasoning deployment optimization of a data set, and the model is subjected to targeted optimization on the task, so that the model has higher mAP0.5 value than other high-delay large models, namely, the detection accuracy is improved while the reasoning time is shortened.

According to the invention, the model special for identifying the lung endoscope is trained through the artificial intelligent deep learning algorithm, and when a doctor uses the endoscope carrying the technology, the position information of the endoscope can be intuitively known through the terminal screen, so that the artificial judgment time is reduced, the operation efficiency is improved, the operation time is shortened, and the pain of a patient is reduced.

Drawings

Fig. 1 is an overall working principle diagram of the invention.

FIG. 2 is a data set representation of the present invention.

Fig. 3 is a diagram of the effect of real-time reasoning of the present invention.

FIG. 4 is a schematic representation of data enhancement for three different intensity levels of the present invention.

Fig. 5 is a diagram of the network structure of SeqYOLO of the present invention.

Fig. 6 is a diagram of the structure of the inference assistance model of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings.

Example 1

1-6, A method for training a lung endoscope image recognition model comprises the steps of constructing a data set and constructing a neural network model, training the neural network model through 80% of sample data of the constructed data set, performing deployment after the training is completed through the rest 20% of sample data of the data set, and using mAP0.5 as an evaluation index for measuring accuracy on an endoscope image target detection task.

The data set is constructed by: the method comprises the steps of capturing a plurality of clear bronchoscopic images from a plurality of bronchoscopic videos, classifying the clear bronchoscopic images into a plurality of categories, marking the categories by drawing one or more bounding boxes, and constructing the categories by a data enhancement mode;

in this embodiment, the method is further provided with an inference calculation, which writes rules according to the human lung structure diagram.

The specific training process is as follows:

1. The method comprises the steps of extracting clear bronchoscope images from a pediatric institution database, namely 2451 bronchoscope images cut from the bronchoscope videos, manually marking the positions on the images by using key position features in bronchi for doctors to judge, and determining the accuracy, usability and completeness of manual marking. Each image is annotated simultaneously by both the medical expert and the data analyst. The medical professional annotates one or more of the exact location information and position based on clinical experience. The marking tool LabelImg is then used to map one or more bounding boxes on the bronchoscope image and record the correct class. The coordinates of each bounding box are then recorded.

2. Sorting a data set of each site, the data set comprising: the epiglottis, vocal cords, main trachea, left main bronchi, carina, right main bronchi, left upper lobe bronchi, left lower lobe bronchi, right upper lobe bronchi, right middle bronchi, right lower lobe bronchi, left intrinsic upper lobe bronchi, left lingual bronchi, left lower lobe dorsal segment Zhi Qiguan, left lower lobe base Duan Zhi trachea, right middle lobe bronchi, right lower lobe base Duan Zhi trachea and right lower lobe dorsal segment bronchi, 18 categories of example image data, and the amount of sample data of each category is controlled in order to ensure sample data equalization. Meanwhile, the specific frame position of the image in the video is determined, the information of fixed time before and after the frame is intercepted as the context information of the image, the sample number of each category is kept at about 300, and the problem of unbalance of data samples is avoided.

3. In the application scene, the angle and the direction of an image imaged on a screen are not the same after the endoscope enters a human body, so that the diversity of samples is increased by using a data enhancement method of up-down overturn, left-right overturn and rotation; meanwhile, when the endoscope enters the human body, the endoscope cannot be guaranteed to be positioned in the middle of the trachea, and the data sample diversity is guaranteed by using a data enhancement method for translating input data. Meanwhile, a dynamic Gaussian blur data enhancement method is used for simulating motion blur caused by an endoscope when the lung of a human body moves.

4. Model training, using 80% of the picture sample data in the collected samples, with a data enhancement configuration combination of the most appropriate sample data set, the model training is performed, and specific training judgment logic is described below.

SeqYOLO establishes a benchmark with YOLOv, the YOLOv part provides target detection capability, and a new LSTM module is introduced to learn information with time sequence relation in the video;

In this embodiment, the data set classification includes 18 categories, respectively: the epiglottis, vocal cords, main trachea, left main bronchi, carina, right main bronchi, left upper lobe bronchi, left lower lobe bronchi, right upper lobe bronchi, right middle bronchi, right lower lobe bronchi, left intrinsic upper lobe bronchi, left lingual bronchi, left lower lobe dorsal segment bronchi, left lower lobe base Duan Zhi trachea, right middle lobe bronchi, right lower lobe base Duan Zhi trachea, and right lower lobe dorsal segment bronchi.

In this embodiment, the YOLOv model structure is mainly divided into two parts, namely YOLOv back bone for extracting information and a separate YOLOv module for regressing the detection frame, which is called YOLOv Head.

In this embodiment, seqYOLO gives the prediction result of the last frame image after reading in the continuous video frame information, that is, firstly, uses YOLO back to perform feature extraction on the image, inputs the obtained feature value into the LSTM model to perform time sequence learning, learns the feature information of each frame in the LSTM, retains the information useful for the prediction of the next frame, and after reading in the last frame, the LSTM outputs the integral video segment feature value of the last and last frame information, and the feature value uses the YOLO head structure to perform identification of the detection frame and the object class.

In this embodiment, the training of SeqYOLO model is divided into two phases: training the pictures of Shan Zhangyou labels by using a YOLOv model to enable mAP0.5 to reach a relatively accurate value, and obtaining weights of two components, namely a backup and a head of the model in the first stage for initializing the weights of the model in the next stage; in the second stage, the built SeqYOLO model transfers the two parts of weights of the YOLO back and the YOLO head obtained in the previous stage to the corresponding part in SeqYOLO, the training in the stage uses video data to train, namely a continuous video frame input model, 20 frames of images are obtained by adopting a method of taking frames at intervals and input into SeqYOLO for learning, and the last frame of the 20 frames of images is a target image to be predicted, so that only the last frame can carry marking information.

In this embodiment, when the SeqYOLO model continues training, the last frame with the label in the video is learned by using the YOLOv model in the first stage, and random data enhancement is performed when data is loaded;

In this embodiment, in the SeqYOLO training stage, firstly, the model weight trained in the previous stage is migrated to YOLOv backup and YOLOv Head components in SeqYOLO, after the training in the previous stage, the two components have the capability of extracting the image characteristics of the endoscope and performing target prediction and target classification according to the extracted image characteristics, the input data of SeqYOLO each time is a video sequence with 20 frames, wherein the last video frame is an endoscope image needing to be predicted, the output of SeqYOLO model is a target frame, namely, the target prediction result of the last frame in the 20-frame video input sequence, after obtaining an image sequence with 20 frames, the model simultaneously performs feature extraction on all images through the same YOLOv5 backup, finally, 20 sets of different feature values sequentially correspond to the original 20-frame images, the 20 sets of different feature values then enter an LSTM module to perform sequence information learning, in this stage, the image information before the endoscope moves to the current position is focused, when the LSTM receives the 20 sets of different feature values, the output of the same value is calculated and the target information is subjected to the calculation of the same type as the target value, and the target value is calculated, and the target value is then the target value is encoded, and the target value is calculated;

In this embodiment SeqYOLO adds attention modules and includes three attention modules, namely CBAM based attention module, visual transducer model ViT, SWIN transducer.

5. And (3) checking the model, namely checking the trained model by using another 20% of sample data, and promoting the accuracy of the identification of the model. The validation set is used to verify the state of the model and converge during the training process. The verification set is generally used for adjusting the super parameters, which set of super parameters has the best performance according to the performances on the verification sets of the models, meanwhile, the verification set can also be used for monitoring whether the models are fitted in the training process, generally, after the performance of the verification set is stable, if training is continued, the performance of the training set can still be continuously increased, but the verification set can be in a condition of not rising and falling, and thus, the situation that the model is fitted is generally judged. To see if the effect of model training is proceeding in the desired direction.

6. After the model is successfully deployed at the terminal, real-time reasoning and display of the endoscope image can be started, and after the model is successfully deployed at the terminal, the model output result is processed during reasoning so as to improve the prediction accuracy of the model, and the bronchus structure of a human body is known, so that the model is compiled into rules and used for assisting the model in judging the position in the current image in continuous video reasoning.

A method for identifying a lung endoscope image, namely, a doctor obtains a lung endoscope image to be identified in actual use; the lung endoscope image recognition model is used for recognizing the lung endoscope image to be recognized, and is obtained through the training method.

The foregoing detailed description of the invention has been presented in conjunction with a specific embodiment, and it is not intended that the invention be limited to such detailed description. Several equivalent substitutions or obvious modifications will occur to those skilled in the art to which this invention pertains without departing from the spirit of the invention, and the same should be considered to be within the scope of this invention as defined in the appended claims.

Claims

1. A method of training a lung endoscopic image recognition model, characterized by: the method comprises the steps of constructing a data set and a neural network model, training the neural network model through 80% of sample data of the constructed data set, performing deployment after the training is finished through the rest 20% of sample data of the data set, and using mAP0.5 as an evaluation index for measuring accuracy on an endoscope image target detection task;

the data enhancement of the data set comprises three different levels of enhancement intensity;

The code of SeqYOLO establishes a benchmark with YOLOv5, the YOLOv part provides target detection capability, and meanwhile, an LSTM module is introduced to learn information with time sequence relation in the video;

And SeqYOLO, respectively obtaining image characteristic values of each frame in a video sequence through a YOLO back, calculating the characteristic values by using an LSTM model to learn information with a time sequence relation in the video, outputting an image characteristic value corresponding to an image to be inferred after the image characteristic values of the whole sequence are read by using the LSTM, and deducing and calculating specific image detection frames and object types in the frames by using the image characteristic values by using a head component in a YOLO series algorithm.

2. A method of training a lung endoscopic image recognition model according to claim 1, wherein: the dataset classification comprises 18 categories of: the epiglottis, vocal cords, main trachea, left main bronchi, carina, right main bronchi, left upper lobe bronchi, left lower lobe bronchi, right upper lobe bronchi, right middle bronchi, right lower lobe bronchi, left intrinsic upper lobe bronchi, left lingual bronchi, left lower lobe dorsal segment bronchi, left lower lobe base Duan Zhi trachea, right middle lobe bronchi, right lower lobe base Duan Zhi trachea, and right lower lobe dorsal segment bronchi.

3. A method of training a lung endoscopic image recognition model according to claim 1, wherein: the YOLOv model structure includes two parts, namely YOLOv back bone for extracting information and a separate YOLOv module for regressing a detection frame, which is called YOLOv Head.

4. A method of training a lung endoscopic image recognition model according to claim 1, wherein: after the SeqYOLO reads in the continuous video frame information, a prediction result of the last frame image is given, namely, firstly, the YOLO back is used for extracting the characteristics of the image, the obtained characteristic value is input into the LSTM model for time sequence learning, the characteristic information of each frame is learned in the LSTM, the information useful for the prediction of the next frame is reserved, after the last frame is read in, the LSTM outputs the integral video fragment characteristic value which integrates all the previous frame information and the last frame information finally, and the characteristic value uses the YOLO head structure for identifying the detection frame and the object type.

5. A method of training a lung endoscopic image recognition model according to claim 1, wherein: training of SeqYOLO models is divided into two phases: training the pictures of Shan Zhangyou labels by using a YOLOv model to enable mAP0.5 to reach a relatively accurate value, and obtaining weights of two components, namely a backup and a head of the model in the first stage for initializing the weights of the model in the next stage; in the second stage, the built SeqYOLO model transfers the two parts of weights of the YOLO back and the YOLO head obtained in the previous stage to the corresponding part in SeqYOLO, the training in the stage uses video data to train, namely a continuous video frame input model, 20 frames of images are obtained by adopting a method of taking frames at intervals and input into SeqYOLO for learning, and the last frame of the 20 frames of images is a target image to be predicted, so that only the last frame can carry marking information.

6. A method of training a pulmonary endoscopic image recognition model according to claim 5, wherein: when the SeqYOLO model continues training, the last frame with the label in the video is learned by using the YOLOv model in the first stage, and random data enhancement is performed when data is loaded;

7. A method of training a lung endoscopic image recognition model according to claim 1, wherein: in the SeqYOLO training stage, firstly, the trained model weight in the previous stage is migrated to YOLOv backhaul and YOLOv Head components in SeqYOLO, after the training in the previous stage, the two components have the capability of extracting the image characteristics of an endoscope and carrying out target prediction and target classification according to the extracted image characteristics, each time of input data of SeqYOLO is a video sequence with 20 frames, wherein the last video frame is an endoscope image needing to be predicted, the output of SeqYOLO model is a target prediction result of the last frame in a target frame, namely 20 frames of video input sequence, after the model obtains an image sequence with 20 frames, all images are simultaneously subjected to characteristic extraction through the same YOLOv backhaul, finally 20 sets of different characteristic values are obtained, the 20 sets of different characteristic values are sequentially corresponding to the original 20 frames of images, then the 20 sets of different characteristic values enter an LSTM module to carry out sequence information learning, in this stage, when the LSTM receives 20 sets of different characteristic values, the output of characteristic values, the image information before the endoscope moves to the current position is focused, the LSTM receives the 20 sets of different characteristic values, the image information is output of the image information, the image information is subjected to the same value as the LSTM, the image information is subjected to the Yboad, the Yboad position is calculated, the image information is obtained by the same as the output of the image information, and the image is subjected to the Yboad information, and the Yboad information is calculated, and the target information is calculated, and the same as the target information is obtained, and the target information;

The model SeqYOLO is optimized by back propagation after forward propagation to update model weights, and the optimizer adopts Adam method and sets learning rate to 0.001, and the learning rate planning strategy uses OneCycleLR.

8. A method of training a lung endoscopic image recognition model according to claim 1, wherein: the SeqYOLO is added with an attention module and comprises three attention modules, namely an attention module based on CBAM, a visual transducer model ViT and a SWIN transducer.

9. A lung endoscope image recognition method is characterized in that: the method comprises the following steps: acquiring a lung endoscope image to be identified; a lung endoscopic image recognition model for recognizing the lung endoscopic image to be recognized, wherein the lung endoscopic image recognition model is trained based on the method for training a lung endoscopic image recognition model according to any one of claims 1-8.

10. A method of identifying a pulmonary endoscopic image according to claim 9, wherein: the method is also provided with reasoning calculation, and the reasoning calculation writes rules according to a human lung structure diagram.