CN111260608A

CN111260608A - Tongue region detection method and system based on deep learning

Info

Publication number: CN111260608A
Application number: CN202010017676.7A
Authority: CN
Inventors: 杨强; 柴胜; 刘华根; 何韦澄; 王玉鑫
Original assignee: Laikang Technology Co Ltd
Current assignee: Laikang Technology Co Ltd
Priority date: 2020-01-08
Filing date: 2020-01-08
Publication date: 2020-06-09

Abstract

The invention discloses a tongue region detection method and a tongue region detection system based on deep learning, wherein the method comprises the following steps: labeling the acquired image data set containing the tongue part, and preprocessing the labeled image data set to acquire a first image data set; setting the proportional sizes of various fixed reference frames, and clustering by adopting a k-means clustering mode to obtain a plurality of clustering center reference frames; training based on a DarkNet network structure, determining the dimension of an output layer according to the plurality of clustering center reference frames, and training according to the first image data set to determine a first detection model; detecting the image dataset without the tongue by using the first detection model to obtain a false detection image dataset for false detection; and adjusting the network structure of the first detection model, modifying the dimension of an output layer, and retraining by using the first image data set and the false detection image data set to determine a tongue detection model for detecting a tongue region.

Description

Tongue region detection method and system based on deep learning

Technical Field

The present invention relates to the technical field of deep learning algorithm, and more particularly, to a tongue region detection method and system based on deep learning.

Background

At present, many tongue diagnosis algorithms based on the traditional Chinese medicine theory require a person to be collected to stretch out the tongue at a fixed distance and a fixed area through a fixing device or equipment when tongue diagnosis analysis is carried out. And then analyzing the pixel points in the fixed area in the picture. In the real world, different people have different tongue extending states and extending sizes, and a fixed area is used, so that the situation that background pixels are more than tongue area pixels can occur, and the tongue diagnosis analysis has great influence on the actual tongue diagnosis analysis.

Furthermore, the prior art application scenarios have great limitations. There are major problems including: 1. the method has a lot of tongue picture shooting requirements on users, and the users need to extend tongues in a prompt box of a shooting interface, so that the method is very inconvenient and has limited application scenes; 2. the prompt frame area is very roughly used as the real tongue area for tongue diagnosis analysis, and the pixel proportion of the background area in the actual situation may be very large, which may affect the accuracy of tongue diagnosis analysis.

Therefore, there is a need for a tongue region detection method to accurately and intelligently determine the tongue region.

Disclosure of Invention

The invention provides a tongue region detection method and a tongue region detection system based on deep learning, and aims to solve the problem of accurately and intelligently determining a tongue region.

In order to solve the above problem, according to an aspect of the present invention, there is provided a tongue region detection method based on deep learning, the method including:

labeling the acquired image data set containing the tongue part, and preprocessing the labeled image data set to acquire a first image data set;

setting the proportional sizes of various fixed reference frames, and clustering by adopting a k-means clustering mode to obtain a plurality of clustering center reference frames;

training based on a DarkNet network structure, determining the dimension of an output layer according to the plurality of clustering center reference frames, and training according to the first image data set to determine a first detection model;

detecting the image dataset without the tongue by using the first detection model to obtain a false detection image dataset for false detection;

and adjusting the network structure of the first detection model, modifying the dimensionality of an output layer, initializing the model by using the parameters of the first detection model except the last layer, and retraining by using the first image data set and the false detection image data set to determine a tongue detection model for detecting a tongue region.

Preferably, the labeling of the acquired image dataset containing the tongue comprises:

labeling the tongue region in the acquired image dataset containing the tongue by using a labeling tool Labelimg, and labeling the position of the tongue in the form of a rectangular frame.

Preferably, the preprocessing the annotated image data set to obtain the first image data set comprises:

selecting data in the labeled image data set according to a first preset quantity threshold value to perform data enhancement processing so as to obtain an expanded data set;

and carrying out equal-scale scaling and filling processing on the marked image data set and the expanded data set according to a preset image proportion, and synchronously adjusting the coordinate position of the marking frame in the image to obtain a first image data set.

Preferably, wherein the data enhancement processing comprises:

the image processing method comprises at least one of horizontal turning processing, clockwise rotation and anticlockwise rotation processing within a preset angle range threshold, translation processing of a preset proportion threshold in the vertical direction and the horizontal direction, pixel processing within the edge of a randomly cut picture according to the preset cutting proportion threshold, Gaussian filter processing and scaling processing.

Preferably, the training based on the DarkNet network structure, determining the output layer dimension according to the plurality of cluster center reference frames, and training according to the first image data set to determine the first detection model includes:

the model backbone network adopts DarkNet-53 to fuse three feature mapping graphs of 13 × 13, 26 × 26 and 52 × 52 with different scales for target prediction, and each feature graph with different scales uses 3 fixed reference frames with different sizes to determine 9 fixed reference frames;

determining the output layer dimension to be 3 x (1+5) ═ 18; wherein 3 represents three frames for prediction, 1 represents that the prediction category is only one type, and 5 represents the central coordinate, the length and the width and the target score of the predicted target;

and selecting a data training data set with a second preset quantity threshold value in the first image data set, and performing model training by taking the rest data as a verification set to determine a first detection model.

According to another aspect of the present invention, there is provided a tongue region detection system based on deep learning, the system comprising:

the data processing unit is used for labeling the acquired image data set containing the tongue part and preprocessing the labeled image data set to acquire a first image data set;

the clustering unit is used for setting the proportional sizes of various fixed reference frames and clustering by adopting a k-means clustering mode to obtain a plurality of clustering center reference frames;

the first detection model determining unit is used for training based on a DarkNet network structure, determining the dimensionality of an output layer according to the multiple clustering center reference frames, and training according to the first image data set to determine a first detection model;

the false detection data acquisition unit is used for detecting the image dataset which does not contain the tongue part by using the first detection model so as to acquire a false detection image dataset for false detection;

and the tongue detection model determining unit is used for adjusting the network structure of the first detection model, modifying the dimension of an output layer, initializing the model by using the parameters of the first detection model except the last layer, and retraining by using the first image data set and the false detection image data set to determine a tongue detection model for detecting a tongue region.

Preferably, the data processing unit labeling the acquired image data set including the tongue portion includes:

Preferably, the data processing unit, which pre-processes the annotated image data set to obtain the first image data set, comprises:

Preferably, wherein the data enhancement processing comprises:

Preferably, the training of the first detection model determining unit based on the DarkNet network structure, the determination of the output layer dimension according to the plurality of cluster center reference frames, and the training of the first image data set to determine the first detection model includes:

The invention provides a tongue region detection method and a tongue region detection system based on deep learning, wherein a tongue detection model is trained by utilizing a deep convolutional network, so that tongue picture pictures collected under different scenes, different illumination, different pixels and different image sizes can be judged, and the existence of a tongue and the size and the position of the tongue region are determined; only one detection target is used, a simpler convolution structure is used, the number of network layers is only 53, the real-time detection effect can be achieved while the accuracy is guaranteed, and the method is very convenient for various applications of various devices; the shooting requirement on the user is reduced, and the tongue can be correctly detected at any position in the image.

Drawings

A more complete understanding of exemplary embodiments of the present invention may be had by reference to the following drawings in which:

fig. 1 is a flowchart of a tongue region detection method 100 based on deep learning according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of cluster distance setting according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of determining a tongue detection model according to an embodiment of the invention; and

fig. 4 is a schematic structural diagram of a tongue region detection system 400 based on deep learning according to an embodiment of the present invention.

Detailed Description

The exemplary embodiments of the present invention will now be described with reference to the accompanying drawings, however, the present invention may be embodied in many different forms and is not limited to the embodiments described herein, which are provided for complete and complete disclosure of the present invention and to fully convey the scope of the present invention to those skilled in the art. The terminology used in the exemplary embodiments illustrated in the accompanying drawings is not intended to be limiting of the invention. In the drawings, the same units/elements are denoted by the same reference numerals.

Unless otherwise defined, terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Further, it will be understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the relevant art and will not be interpreted in an idealized or overly formal sense.

Fig. 1 is a flowchart of a tongue region detection method 100 based on deep learning according to an embodiment of the present invention. As shown in fig. 1, in the tongue region detection method based on deep learning according to the embodiment of the present invention, a tongue detection model is trained by using a deep convolutional network, so that tongue photographs acquired under different scenes, different illumination, different pixels, and different image sizes can be judged, and whether a tongue exists, and the size and the position of the tongue region are determined; only one detection target is used, a simpler convolution structure is used, the number of network layers is only 53, the real-time detection effect can be achieved while the accuracy is guaranteed, and the method is very convenient for various applications of various devices; the shooting requirement on the user is reduced, and the tongue can be correctly detected at any position in the image. The tongue region detection method 100 based on deep learning provided by the embodiment of the invention starts from step 101, and in step 101, the acquired image dataset including the tongue is labeled, and the labeled image dataset is preprocessed to acquire a first image dataset.

Preferably, wherein the data enhancement processing comprises:

In an embodiment of the invention, a picture data set a containing the tongue is acquired, the tongue region in the data set is labeled by means of a labeling tool Labelimg, and the position of the tongue is marked in the form of a rectangular frame. And then, selecting data with a first preset quantity threshold value of 40% for data enhancement, wherein the data enhancement comprises horizontal 180-degree overturning, rotation from 15 degrees anticlockwise to 15 degrees clockwise, translation from 1% up, down, left and right to 10%, random clipping of pixels within 20% of the edge of the picture, Gaussian filtering with the filter size of 3 x 3 and 5 x 5 … … 17 x 17, scaling and the like, and any combination of the above. Finally, the image is scaled such that the scaled long edge is 416. If the scaled height is h and h is less than 416, filling pixel regions with the height and width (208-0.5 x h, 416) respectively at the upper and lower parts of the image, setting the pixel values to be 128 with fixed size, and synchronously adjusting the coordinate position of the marking frame in the image, and finally determining the first image data set.

In step 102, the proportional sizes of the multiple fixed reference frames are set, and clustering is performed in a k-means clustering manner to obtain multiple clustering center reference frames.

In the embodiment of the invention, in order to detect tongues with different proportions and sizes in the picture, 9 fixed reference frames anchors with different proportions and sizes are arranged. Specifically, a k-means clustering mode is adopted, and the ratio of the height and width of 9 labeling frames to the height and width of the original image is randomly selected from all labeling images to serve as a clustering center anchor. The clustering distances are set as shown in fig. 2, and the lengths and widths of B1 and B2 are the ratios of the height and width of the labeled frame to the height and width of the original image in different pictures. And (3) coinciding the centers of B1 and B2, and taking the ratio of the intersection area of B1 and B2 to the union area as the distance of the k-means cluster. After multiple rounds of clustering, 9 clustering centers anchors can be obtained.

In step 103, training is performed based on the DarkNet network structure, an output layer dimension is determined according to the plurality of clustering center reference frames, and training is performed according to the first image data set to determine a first detection model.

In the embodiment of the invention, the main network of the training model adopts DarkNet-53, fuses the feature maps of three different scales, namely 13 × 13, 26 × 26 and 52 × 52, to perform target prediction, and each feature map of different scale sizes uses 3 anchors of different sizes, and the total number of the anchors is 9. Since there is only one target to be detected, the output layer dimension is 3 × 18 (1+5), where 3 denotes predicting three bounding boxes, 1 denotes that the prediction category is only one type, and 5 denotes the center coordinates, length and width, and target score of the predicted target. Then, model training is performed using 80% of the data in the first image data set as a training data set and 20% of the data in the first image data set as a verification set, resulting in a first detection model.

In step 104, the image dataset not containing the tongue is detected using the first detection model to obtain a false detected image dataset.

In step 105, the network structure of the first detection model is adjusted, the output layer dimension is modified, the model is initialized by using the parameters of the first detection model except the last layer, and the tongue detection model is determined by using the first image data set and the false detection image data set for detection of the tongue region.

The first detection model obtained by the embodiment of the invention has high target detection recall rate, but has high false detection rate on the image data set without the tongue part. Therefore, after the normal image data not including the tongue portion is acquired, the image data set not including the tongue portion is detected by the first detection model, the data erroneously detected by the first detection model in the image data set not including the tongue portion is collected, and the erroneously detected data is set as the second type object of the object detection.

And then, adjusting the network structure of the first detection model, changing the dimension of an output layer to 3 × 2+5 to 21, keeping other structures unchanged, initializing a new model by using parameters of the first detection model except the last layer, adding false detection data as training data, retraining the model, and obtaining a final tongue detection model.

FIG. 3 is a schematic diagram of determining a tongue detection model according to an embodiment of the invention. As shown in fig. 3, the process of determining the tongue detection model includes:

(1) a picture data set a containing the tongue and a normal data set B not containing the tongue are acquired.

(2) And (6) data annotation. Labeling the data set A, labeling the tongue region in the data set by means of a labeling tool Labelimg, and labeling the position of the tongue in a rectangular frame form.

(3) And (4) enhancing data. And performing data enhancement processing on about 40% of pictures in the A data set to expand the existing data set.

(4) And (5) data scaling processing. The image is scaled such that the scaled long edge is 416. If the scaled height is h and h is less than 416, then pixel regions with height and width (208-0.5 x h, 416) are filled up and down the image, the pixel value is set to 128, and the coordinate position of the marking frame in the image is synchronously adjusted.

(5) And (5) constructing and training a model. In order to detect tongues with different proportions and sizes in the picture, 9 anchors with different proportions and sizes are arranged. And clustering by adopting a k-means clustering mode, and determining 9 clustering centers anchors. The main network adopts DarkNet-53 to fuse the feature maps of three different scales, namely 13 × 13, 26 × 26 and 52 × 52, to perform target prediction, and each feature map of different scale sizes uses 3 anchors of different sizes, and the total number of the anchors is 9. Since there is only one target to be detected, the output layer dimension is 3 × 18 (1+5), where 3 denotes predicting three bounding boxes, 1 denotes that the prediction category is only one type, and 5 denotes the center coordinates, length and width, and target score of the predicted target. Model training was performed using 80% of the data in data set a as the training data set and 10% of the data as the validation set, resulting in model M0.

(6) And determining false detection data. M0 has a high false detection rate on the non-tongue data set B, collects the data false detected in the data set B by M0, and takes the false detected area as the second type target of target detection.

(7) The network structure of M0 was modified to change the output layer dimension to 3 x (2+5) to 21, the other structures were left unchanged, and the new model was initialized using the parameters of M0 except for the last layer. And adding M0 false detection data in the data set B as training data, and retraining the model to obtain a model M1.

According to the embodiment of the invention, the target to be detected is defined, then the basic network model M0 is trained, then the M0 false detection region is used as a background target, the model is finely adjusted, a new model is generated in an iterative manner, and the position of the tongue are judged in the picture by applying the deep convolutional network intelligence, so that the effect of real-time detection can be achieved, and the use of various devices is very convenient.

Fig. 4 is a schematic structural diagram of a tongue region detection system 400 based on deep learning according to an embodiment of the present invention. As shown in fig. 4, a tongue region detection system 400 based on deep learning according to an embodiment of the present invention includes: data processing section 401, clustering section 402, first detection model determining section 403, false detection data acquiring section 404, and tongue detection model determining section 405.

Preferably, the data processing unit 401 is configured to label the acquired image data set including the tongue portion, and pre-process the labeled image data set to acquire the first image data set.

Preferably, the data processing unit 401, labeling the acquired image data set containing the tongue portion, includes: labeling the tongue region in the acquired image dataset containing the tongue by using a labeling tool Labelimg, and labeling the position of the tongue in the form of a rectangular frame.

Preferably, the data processing unit 401, performing preprocessing on the annotated image data set to obtain a first image data set, includes: selecting data in the labeled image data set according to a first preset quantity threshold value to perform data enhancement processing so as to obtain an expanded data set; and carrying out equal-scale scaling and filling processing on the marked image data set and the expanded data set according to a preset image proportion, and synchronously adjusting the coordinate position of the marking frame in the image to obtain a first image data set.

Preferably, wherein the data enhancement processing comprises: the image processing method comprises at least one of horizontal turning processing, clockwise rotation and anticlockwise rotation processing within a preset angle range threshold, translation processing of a preset proportion threshold in the vertical direction and the horizontal direction, pixel processing within the edge of a randomly cut picture according to the preset cutting proportion threshold, Gaussian filter processing and scaling processing.

Preferably, the clustering unit 402 is configured to set proportional sizes of multiple fixed reference frames, and perform clustering by using a k-means clustering method to obtain multiple clustering center reference frames.

Preferably, the first detection model determining unit 403 is configured to perform training based on a DarkNet network structure, determine an output layer dimension according to the plurality of clustering center reference frames, and perform training according to the first image data set to determine a first detection model.

Preferably, the training of the first detection model determining unit 403 based on the DarkNet network structure, determining the output layer dimension according to the plurality of cluster center reference frames, and training according to the first image data set to determine the first detection model includes:

Preferably, the false detection data acquiring unit 404 is configured to detect an image dataset without tongue by using the first detection model to acquire a false detection image dataset for false detection.

Preferably, the tongue detection model determining unit 405 is configured to adjust a network structure of the first detection model, modify an output layer dimension, initialize a model by using parameters of the first detection model except a last layer, and perform retraining by using the first image data set and the false detection image data set to determine a tongue detection model for detecting a tongue region.

The tongue region detection system 400 based on deep learning according to the embodiment of the present invention corresponds to the tongue region detection method 100 based on deep learning according to another embodiment of the present invention, and is not described herein again.

The invention has been described with reference to a few embodiments. However, other embodiments of the invention than the one disclosed above are equally possible within the scope of the invention, as would be apparent to a person skilled in the art from the appended patent claims.

Generally, all terms used in the claims are to be interpreted according to their ordinary meaning in the technical field, unless explicitly defined otherwise herein. All references to "a/an/the [ device, component, etc ]" are to be interpreted openly as referring to at least one instance of said device, component, etc., unless explicitly stated otherwise. The steps of any method disclosed herein do not have to be performed in the exact order disclosed, unless explicitly stated.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solutions of the present invention and not for limiting the same, and although the present invention is described in detail with reference to the above embodiments, those of ordinary skill in the art should understand that: modifications and equivalents may be made to the embodiments of the invention without departing from the spirit and scope of the invention, which is to be covered by the claims.

Claims

1. A tongue region detection method based on deep learning is characterized by comprising the following steps:

2. The method of claim 1, wherein said labeling the acquired image dataset containing the tongue comprises:

3. The method of claim 1, wherein pre-processing the annotated image dataset to obtain the first image dataset comprises:

4. The method of claim 3, wherein the data enhancement process comprises:

5. The method of claim 1, wherein the training based on a DarkNet network structure, determining output layer dimensions from the plurality of cluster center reference boxes, and training from the first image dataset to determine a first detection model comprises:

6. A deep learning based tongue region detection system, the system comprising:

7. The system of claim 6, wherein the data processing unit, labeling the acquired image dataset containing the tongue, comprises:

8. The system of claim 6, wherein the data processing unit pre-processes the annotated image dataset to obtain the first image dataset, comprising:

9. The system of claim 8, wherein the data enhancement process comprises:

10. The system of claim 6, wherein the first detection model determination unit, trained based on a DarkNet network structure, determines output layer dimensions from the plurality of cluster center reference boxes, and is trained from the first image dataset to determine a first detection model, comprises: