Disclosure of Invention
The invention provides an interpretable fine-granularity image classification method and system based on an improved neural prototype tree, and provides an interpretable fine-granularity image classification model MBC-Prototree, which improves the capability of extracting depth features of images by adopting a multi-granularity feature fusion network, wherein the network can extract depth features with different granularities, a new Background Prototype Removing Mechanism (BPRM) is designed to reduce the influence of an error background prototype on image classification, and finally, a new loss function is designed to enhance the generalization capability of the model and simultaneously comprises a leaf node loss function and a full-communication layer loss function.
In order to achieve the above object, the present invention is realized by the following technical scheme:
in a first aspect, the present invention provides an interpretable fine-grained image classification method based on an improved neural prototype tree, comprising:
acquiring an image to be classified;
obtaining a classification result according to the acquired image to be classified and a preset interpretable fine-grained image classification model;
the interpretable fine-grained image classification model comprises a multi-granularity feature extraction layer, a prototype layer and a soft neural binary decision tree layer, wherein an image to be classified obtains feature representation of the image through the multi-granularity feature extraction layer to generate a depth feature image, the prototype layer calculates similarity between a prototype and a patch according to the depth feature image to find a patch closest to the prototype, each prototype is replaced by a potential patch closest to the prototype, visual display is carried out, the soft neural binary decision tree layer trains the prototype by taking the prototype as evidence, a background prototype removing mechanism is used for removing the wrong prototype, and a prototype path decision is made through the screened prototype to obtain an image classification result.
Further, a Inception-v4 network and a Inception-ResNet-v2 network are introduced to obtain multi-granularity characteristic representation of the image, the characteristic is subjected to parallel convolution and pooling operation by convolution kernels with different sizes, and the Res2Net network is utilized to realize the combination of generating a plurality of receptive fields on fine granularity.
Further, euclidean distance between the prototype and the receiving domain is calculated by sliding the prototype over the receiving domain using a generalized convolution operation without bias, selecting a patch in the receiving domain closest to the prototype using minimization, and determining the extent to which the prototype is present in the input image using the patch closest to the prototype and the distance between the prototype.
Further, selecting the patch nearest to the prototype as a prototype projection to approximate the prototype, wherein each prototype is projected onto the nearest potential feature block in the same class as the corresponding prototype.
Further, the image is divided into patches with the same shape as the prototype layer patch, a dominant color judgment method is used for selecting a background patch, the colors of the first three colors in the patch are selected for similarity comparison, if the first three colors are similar, the image patch is regarded as the background patch, and the background patch is subjected to hierarchical clustering by using color features as clustering vectors. And selecting representative patches of the clusters to construct a background patch data set.
Further, the soft neural binary decision tree layer comprises a group of internal nodes, a group of leaf nodes and a group of edge nodes, wherein the internal nodes represent prototypes, the leaf nodes represent prediction classes, the edges represent routing probabilities from the prototypes of the current layer to the prototypes of the child nodes, the probability of routing through the child nodes is calculated by comparing potential patches of each prototype with the prototypes, all nodes participate in the routing through traversing all edges to obtain the prediction class probabilities, when the soft neural binary decision tree is trained, similarity scores between image patches corresponding to the updated prototypes and background patches in a background patch data set are calculated when each internal node is updated, and if the image patches of the updated prototypes are similar to one background patch in the background patch data set, the internal nodes need to be updated again.
Further, model training is carried out by using the leaf node loss function and the full-connection layer loss function, time-varying weights are respectively distributed to the leaf node loss function and the full-connection layer loss function, and decision optimization is carried out on classification by using the time-varying weights.
In a second aspect, the present invention also provides an interpretable fine-grained image classification system based on an improved neural prototype tree, comprising:
the image acquisition module is configured to acquire images to be classified;
the classification module is configured to obtain a classification result according to the acquired image to be classified and a preset interpretable fine-grained image classification model;
The interpretable fine-grained image classification model comprises a multi-granularity feature extraction layer, a prototype layer and a soft neural binary decision tree layer, wherein an image to be classified obtains feature representation of the image through the multi-granularity feature extraction layer to generate a depth feature image, the prototype layer calculates similarity between a prototype and a patch according to the depth feature image to find the patch closest to the prototype, each prototype is replaced by the potential patch closest to the prototype and is visually displayed, the soft neural binary decision tree layer trains the prototype and uses a background prototype removal mechanism to remove the wrong prototype as evidence, prototype path decisions are made through the screened prototype to obtain image classification results, the prototype is selected to provide evidence for image classification, model training is performed by using a leaf node loss function and a full connection layer loss function, time-varying weights are respectively distributed to the leaf node loss function and the full connection layer loss function, and the image classification decision is optimized to obtain the classification results.
In a third aspect, the present invention also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of the method for improving a neural prototype tree-based interpretable fine-grained image classification of the first aspect.
In a fourth aspect, the present invention also provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of the method for interpretive fine-grained image classification based on an improved neural prototype tree of the first aspect when the program is executed.
Compared with the prior art, the invention has the beneficial effects that:
1. The invention designs an interpretable fine-granularity image classification model comprising a multi-granularity feature extraction layer, a prototype layer and a soft neural binary decision tree layer, wherein an image to be classified obtains feature representation of the image through the multi-granularity feature extraction layer to generate a depth feature image, the prototype layer calculates similarity between a prototype and a patch according to the depth feature image to find a patch nearest to the prototype, replaces each prototype with the potential patch nearest to the prototype and carries out visual display, takes the potential patch nearest to the prototype as evidence, the soft neural binary decision tree layer trains the prototype, uses a background prototype removing mechanism to remove the wrong prototype, and makes a prototype path decision through the screened prototype to obtain an image classification result;
2. according to the invention, by introducing the Inception-v4 network and the Inception-ResNet-v2, inception-v4 network, the performance is better on the feature extraction of a wider level, the depth feature output with larger size is set, the performance of the Inception-Resnet-v2 on the network feature of a deeper level is better, and the depth feature output with smaller size is set;
3. in the method, an image is segmented into patches with the same shape as a prototype layer patch, a dominant color judgment method is used for selecting a background patch, the colors of the first three colors in the patch are selected for similarity comparison, if the first three colors are similar, the image patch is regarded as the background patch, and color features are used as clustering vectors for hierarchical clustering of the background patch. And when the internal node of the soft neural binary decision tree is updated, calculating the similarity score between the image patch corresponding to the updated prototype and the background patch in the background patch data set, and if the image patch of the updated prototype is similar to one background patch in the background patch data set, re-updating the internal node is needed. The influence of an error background prototype on image classification is reduced through a new background prototype removing mechanism;
4. The invention designs a new loss function, which simultaneously comprises a leaf node loss function and a full-connected layer loss function, and distributes time-varying weights to the leaf node loss function and the full-connected layer loss function respectively, combines the leaf node loss and the full-connected loss, and jointly serves as an objective function in the multi-granularity image classification training process, so that the defect that deviation is easy to occur in single loss function optimization classification can be overcome, and the generalization capability of the model is enhanced.
Detailed Description
The invention will be further described with reference to the drawings and examples.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the application. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs.
Example 1:
the embodiment provides an interpretable fine-grained image classification method based on an improved neural prototype tree, which comprises the following steps:
acquiring an image to be classified;
obtaining a classification result according to the acquired image to be classified and a preset interpretable fine-grained image classification model;
the interpretable fine-grained image classification model comprises a multi-granularity feature extraction layer, a prototype layer and a soft neural binary decision tree layer, wherein an image to be classified obtains feature representation of the image through the multi-granularity feature extraction layer to generate a depth feature image, the prototype layer calculates similarity between a prototype and a patch according to the depth feature image to find a patch closest to the prototype, each prototype is replaced by a potential patch closest to the prototype, visual display is carried out, the soft neural binary decision tree layer trains the prototype by taking the prototype as evidence, a background prototype removing mechanism is used for removing the wrong prototype, and a prototype path decision is made through the screened prototype to obtain an image classification result.
The method comprises the steps of obtaining feature representation of an image through a multi-granularity feature extraction layer to generate a depth feature image, calculating similarity between a prototype and a patch according to the depth feature image by a prototype layer to find a patch nearest to the prototype, replacing each prototype with a potential patch nearest to the prototype, visually displaying the potential patch, taking the potential patch as evidence, training the prototype by a soft neural binary decision tree layer, removing an erroneous prototype by using a background prototype removing mechanism, making a prototype path decision through the screened prototype to obtain an image classification result, and solving the problems that the depth feature extraction of the image is incomplete, the depth feature mining cannot be fully carried out and the classification error is easy to cause in a traditional model.
As shown in fig. 1, the interpretable fine-grained image classification model in this embodiment includes 3 layers, a Multi-grained feature extraction layer (Multi-grained Feature Extraction Network Layer), a prototype layer (Prototype Layer), and a soft-neural binary decision tree layer (Soft Neural Binary Decision TREE LAYER).
The method comprises the steps of obtaining feature representation of an image through a multi-granularity feature extraction layer to generate a depth feature image, inputting the depth feature image generated by the multi-granularity feature extraction layer into a prototype layer, calculating similarity between a prototype and a patch by the prototype layer, finding the patch nearest to the prototype to provide evidence for image classification, training the prototype by a soft binary decision tree layer, making a prototype path decision, adding a background prototype removing mechanism to optimize the prototype path decision, combining a complete connected layer loss function and a leaf node loss function to improve generalization capability in model training, and finally combining the multi-granularity feature extraction layer, the prototype layer and the soft binary decision tree layer to fuse a network into probability prediction.
The specific implementation process comprises the following steps:
First, an input image enters the multi-granularity feature extraction layer f (x; ω), resulting in D two-dimensional (hxw) feature maps, where ω represents the trainable parameters of f. The conversion is expressed as follows:
f(x;ω)--->RH×W×D (1)
Conventional convolution networks ignore the importance of parallel processing of image features. To solve this problem, the present embodiment introduces Inception-v4 network and Inception-ResNet-v2 network, and the Inception modules specific to Inception-v4 and Inception-ResNet-v2 networks are shown in fig. 2.
Features pass Inception modules can be rolled and pooled in parallel by convolution kernels of different sizes. The Inception-v4 network performs better in extracting the deeper network features and sets the larger depth feature output, and the Inception-Resnet-v2 performs better in extracting the deeper network features and sets the smaller depth feature output. Deeper and wider levels of depth features can be extracted by introducing Inception-v4 and Inception-ResNet-v 2. And a better feature extraction effect is obtained.
The Res2Net network can achieve the combination of producing multiple receptive fields on a fine granularity without increasing the number of layers, which improves the effective receptive field and avoids redundant information. The kernel module architecture of Res2Net is shown in fig. 3. After the image is subjected to 1×1 convolution, the feature map is divided into 4 parts. The first part of the lines is very simple, x 1 is not processed and is directly transmitted to y 1, the second part of the lines is divided into two lines after 3X 3 convolution, one line continues to transmit forwards to y 2, the other line transmits to x 3, information of the second line is obtained by the third part of the lines, x 3 is divided into two lines after 3X 3 convolution, one line continues to transmit forwards to y 3, the other line transmits to x 4, information of the third line is obtained by the fourth line, and x 4 is transmitted to y 4 after 3X 3 convolution.
The image is processed into depth feature blocks with different sizes through three different multi-granularity feature extraction layers, features with different granularities can be extracted, and the prototype can obtain receptive fields with different sizes during calculation. The multi-granularity feature extraction layer is used for comparing a later prototype with depth features and visualizing the prototype, and the limitation of a single receptive field of the neural prototype tree is overcome.
And then, the image is input into the prototype layer through the feature map generated by the multi-granularity feature extraction layer, and the prototype layer can find the patch nearest to the prototype to provide evidence for image classification. A prototype is defined as a trainable tensor of the shape (H 1×W1 x D). Here H 1≤H,W1 +.W, while a generalized convolution form without bias can be used, where each prototype P n ε P acts as a kernel by sliding over an acceptance field z of shape (H W D) and calculating the Euclidean distance between P n and its current acceptance field z, called a patch, as shown in FIG. 4.
The application of minimum pooling selects the patch of shape (H 1×W1 XD) closest to prototype p n in z, where the calculation formula is as follows:
Recent potential patches The distance between the prototype p n determines the extent to which the prototype is present in the input image.
The prototype is visualized after being selected, in order to help the visualization of decision interpretability, the prototype is projected, the same strategy as ProtoPNet can be adopted, and the patch nearest to the prototype is selected as the prototype projection to perform approximation representation on the prototype, so that the purpose of prototype visualization is achieved. Each prototype is projected onto the nearest potential feature block in the same class as that prototype so that any prototype that provides a contribution in the image classification decision can be viewed. Performing the updated:
In the present embodiment, use is made of Representing the most recent patchCorresponding training images, prototype p n can be visualized asIs a patch of (a). Will beCreating a 2-dimensional similarity graph through f, wherein the similarity graph comprises p n andSimilarity scores for all plaques in (a).
Wherein (i, j) represents the parameters in Patches (z)S n represents a similarity map, obtained by bicubic interpolationUpsampling the input shape of (a) p n is visualized asIs shown in fig. 5, at the same location as the nearest potential patch.
Finally, the output of the prototype layer is used as input to the soft neural binary decision tree layer. The soft neural binary decision tree layer consists of a group of internal nodes N, a group of leaf nodes L and a group of edge nodes E. The internal nodes represent tensor prototypes, the leaf nodes represent prediction classes, and the edges represent the routing probabilities from the current layer prototypes to the child node prototypes. In the soft neural binary decision tree, the number of prototypes to learn, i.e., |p|, depends on the size of the tree. A binary tree structure is initialized by defining a maximum height h, which will create 2 n leaves and 2 n -1 prototypes. Thus, the computational complexity of learning the prototype P grows exponentially with the height h of the tree.
Some prototypes of ProtoTree appear to be focused on the background, meaning that ProtoTree may take advantage of learned bias since prototyping is prone to prototype path decision errors. For example, protoTree may use green leaves to distinguish between gray cat birds and black gulls. The present embodiment therefore contemplates a new Background Prototype Removal Mechanism (BPRM) that reduces the impact of erroneous prototype selection. First, an image in a data set is divided into patches of the same size as a shape (h×w) at a prototype layer, and a background color patch is selected by using a dominant color judgment method. Typically the colors of the background patch are relatively single, so the top three colors in the image patch are selected and a similarity comparison is made between these colors. If the first three colors are similar, then the image patch may be considered a background patch. The image patches are then hierarchically clustered. Specifically, color features of R, G and B color channels are used as cluster vectors for clustering, and a multi-dimensional histogram is calculated. The embodiment sets a distance threshold to control the clustering process. The image patch closest to the center of each cluster is then selected as the representative patch for that cluster and used to construct the background patch dataset. The principle diagram of the clustering process is shown in fig. 6.
By clustering the background patches, the same type of background patches may be clustered together, as shown in FIG. 6, the gray lake water type patches are clustered together, the green grass type patches are clustered together, and the blue sky type patches are clustered together.
During the prototype tree training, as each internal node updates, a similarity update prototype and a background patch dataset between corresponding image patches are computed, and if the patch of the image update prototype is highly similar to the background patch, indicating that the learned prototype is an erroneous prototype, the internal nodes of the tree should be updated again.
In the soft neural binary decision tree, node n adopts a soft routing mode, and fuzzy weights for z routing to two child nodes are set in [0,1], and a probability interpretation is given to the fuzzy weights. According to this probabilistic term, definitionAnd p n, so the internal node routing probability of sample z through the right edge is as follows:
Where z represents the depth profile map and p n represents the prototype. The expression indicates the extent to which sample z reaches the right leaf node of node n.
Therefore, the probability of routing from sample z through the internal nodes of the right edge can be derived as follows:
pe(n,n.left)(z)=1-pe(n,n.right)(z) (6)
The sample z traverses all edges, reaching each leaf node L e L with a certain probability. Path p e represents the edge sequence from the root node to leaf node l. The probability of sample z arriving at a leaf node, denoted pi l, is the product of the probabilities of the edges on path p e. The probability function of the sample z arriving at leaf l is formulated as follows:
Each leaf node L e L carries a trainable parameter k l representing the distribution of k classes in the leaf that need to be learned. softmax function Normalizing k l to obtain a final predicted class probability distribution for the input image xTraversing the potential representation z=f (x; ω) across all edges in the tree, such that all leaves are for the final predictionContributing to the fact. The class probability distribution of the blade l is obtained. Predicted class probability distributionThe method comprises the following steps:
meanwhile, the contribution degree of the leaf nodes needs to be learned, and the distribution of the leaf nodes is learned, which is a global learning problem. By back-propagating k together with ω and P, the update function of the leaf node parameter k is:
wherein t represents a training period, +.; representing division between elements, and a vector of size K representing the class distribution in leaf l.
Model Training (Model Training):
Prior to model training, the Inception-v4, inception-ResNet-v2, and Res2Net models in the multi-granularity feature extraction layer need to be prepared for pre-training. In the training process, three networks Inception-v4, inception-ResNet-v2 and Res2Net respectively process images in parallel, and multi-granularity feature patches with different sizes are extracted. Next, the patch closest to the prototype is found at the prototype layer, providing evidence for image classification. And finally, carrying out prototype path decision by using a soft neural binary decision tree layer, and predicting class probability distribution of the input image. During the training process, the prototype P n e P, the class distribution c of the leaf and the CNN parameter ω are learned.
In order to improve the generalization capability of the model, the leaf node loss function and the full connection layer loss function are used for model training at the same time, and time-varying weights are used for deciding classification. Different time-varying weights are respectively distributed to the full-connection loss and the leaf node loss, and proper time-varying weight parameters are distributed to the full-connection loss and the leaf node loss through continuous experiments, so that higher image classification accuracy can be achieved. The loss function of the model is as follows:
Wherein α t represents the time-varying weight of the leaf node loss of the tree, β t represents the time-varying weight of the full connection loss, k represents the leaf node parameter, t represents the number of rounds; Representing leaf node loss, crsEnt (y full,ylabel) represents loss of full link layer.
Finally, by utilizing the architectural advantages of different complementary networks, the fusion probability prediction score is calculated according to the results of processing the images by the three feature extraction networks, and the inaccurate limitation of selecting certain prototypes by a single network can be effectively avoided through the fusion strategy, so that the prototypes selected through training are more accurate. The fusion probability prediction score formula is as follows:
wherein, the The prediction type probability distribution of the three feature extraction networks is represented, n represents the number of backbone networks, and x represents the prediction score of a single network.
In the experiment, MBC-Prototree was applied to two authoritative fine granularity datasets CUB-200, FGVC-air and one medical class dataset chest x-ray. In the next experiment, the three data sets were divided into three subsets, training set, test set and validation set, and the validity of the model was validated against a series of baseline models.
The model in this example was compared with other baseline models on datasets CUB-200 and FGVC-air, chest x-ray, with significant improvement in experimental results.
Example 2:
The present embodiment provides an interpretable fine-grained image classification system based on an improved neural prototype tree, comprising:
the image acquisition module is configured to acquire images to be classified;
the classification module is configured to obtain a classification result according to the acquired image to be classified and a preset interpretable fine-grained image classification model;
the interpretable fine-grained image classification model comprises a multi-granularity feature extraction layer, a prototype layer and a soft neural binary decision tree layer, wherein an image to be classified obtains feature representation of the image through the multi-granularity feature extraction layer to generate a depth feature image, the prototype layer calculates similarity between a prototype and a patch according to the depth feature image to find a patch closest to the prototype, each prototype is replaced by a potential patch closest to the prototype, visual display is carried out, the soft neural binary decision tree layer trains the prototype by taking the prototype as evidence, a background prototype removing mechanism is used for removing the wrong prototype, and a prototype path decision is made through the screened prototype to obtain an image classification result.
The working method of the system is the same as the interpretable fine-grained image classification method based on the improved neural prototype tree of embodiment 1, and is not described here again.
Example 3:
The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the method for improving a neural prototype tree-based interpretable fine-grained image classification described in embodiment 1.
Example 4:
the present embodiment provides an electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, which when executed implements the steps of the method for improving a neural prototype tree-based interpretable fine-grained image classification of embodiment 1.
The above description is only a preferred embodiment of the present embodiment, and is not intended to limit the present embodiment, and various modifications and variations can be made to the present embodiment by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present embodiment should be included in the protection scope of the present embodiment.