Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide a hierarchical semantic embedded model for object fine recognition and an implementation method thereof, so as to solve the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning.
To achieve the above and other objects, the present invention provides a hierarchical semantic embedding model for fine object recognition, comprising:
the main network is used for extracting shallow features of the input image and outputting the shallow features to each branch network in the form of a feature map;
and the branch networks are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of the upper-layer semantic knowledge on the feature learning of the lower-layer branch network is realized by introducing a semantic knowledge embedding mechanism.
Preferably, the branch network performs secondary characterization expression on the feature map from the main network to generate a new branch feature map, learns the attention weight map by combining the score vector predicted by the upper level and the branch feature map of the lower level thereof, applies the attention weight map to the branch feature map, and finally generates a weighted branch feature map, thereby predicting the label distribution of the hierarchy type.
Preferably, the backbone network adopts layer4_ x layer and its previous input layer of the ResNet-50 network structure, the parameter layer has 41 layers, and the parameters of the backbone network are shared by the prediction networks of each layer.
Preferably, the branched network includes:
the deep feature extraction submodule is used for carrying out deep feature extraction on the feature map output by the backbone network and outputting feature expression guided by superior semantic knowledge and unguided feature expression;
embedding the higher semantic knowledge into a submodule to obtain a higher predicted score vector si-1Mapping the vector into semantic knowledge expression vectors through a full connection layer, splicing the vector with each site on a W multiplied by H plane of a feature map output by the deep feature extraction submodule, learning an attention coefficient vector through an attention model for the spliced feature map, and applying the attention coefficient vector to the feature map output by the deep feature extraction submodule to obtain a weighted feature map, wherein W and H respectively refer to width and height;
and the score fusion submodule is used for outputting corresponding score vectors through score fusion operation on the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule.
Preferably, the deep feature extraction submodule adopts a layer5_ x layer structure in a ResNet-50 network, the layer5_ x layer structure is composed of 3 residual modules, the layer5_ x layer structure is multiplexed twice, one part faces to the upper semantic knowledge embedding submodule, and the other part faces to the expression of global features.
Preferably, the attention model maps each position point on the W × H plane of the spliced feature map into corresponding dimensions step by using two fully-connected layers successively, and finally obtains the attention coefficient vector.
Preferably, the score fusion sub-module performs the score fusion process as follows:
S=(fc_1+fc_2+fc_cat)/3
the fc _1, fc _2 and fc _ cat are c multiplied by 1 dimensional vectors, the former two vectors are obtained by respectively passing the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule through a full connection layer, the latter is obtained by connecting fc _1 and fc _2 in series and then obtaining the dimensions same as fc _1 and fc _2 through a full connection layer fc _ concatee operation.
Preferably, the network structure of the top-most category of the branch network corresponds to the category number of the hierarchy except for the last fully connected layer, and the parameter settings of other layers are consistent with the original ResNet-50 network.
In order to achieve the above object, the present invention further provides a method for implementing a hierarchical semantic embedded model for fine object recognition, comprising the following steps:
step S1, carrying out hierarchical labeling on each piece of training data;
step S2, adopting the weighted combination of the classification loss function and the regularization constraint loss function as the target function of the optimized HSE model, and gradually training the branch networks corresponding to the classification from the 1 st level classification to the Nth level classification;
and step S3, after all the branch networks are preliminarily trained, performing joint optimization on all the parameters of the whole complete HSE model.
Preferably, the optimization objective function of the branched network is:
where γ is a balance parameter for balancing the classification loss function term
And regularization constraint loss function term
Impact on network parameters.
Compared with the prior art, the hierarchical semantic embedded model for object fine recognition and the implementation method thereof adopt the hierarchical structure of object classification as semantic information, and embed the semantic information into the feature expression in the deep neural network model, so that the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning is solved, and the complexity of the model is reduced.
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
FIG. 1 is a system architecture diagram of a hierarchical semantic embedding model for fine object recognition according to the present invention. In the invention, the Hierarchical Semantic knowledge Embedding algorithm model (HSE for short) comprises three aspects: extracting image depth features, embedding semantic knowledge into expression learning and constraining semantic knowledge to a prediction result semantic space. The HSE model is an algorithm model based on a deep learning technology, the deep expression learning is carried out through the whole HSE framework depending on a deep neural network, the HSE framework utilizes hierarchical semantic knowledge in two modes, and the embedding of the semantic knowledge during feature expression and the guiding of the semantic knowledge to regularize a prediction result during model training are respectively embodied. Specifically, as shown in fig. 1, a Hierarchical Semantic Embedding (HSE) model for fine object recognition according to the present invention includes:
the main network 1 is used for extracting shallow features of an input image and outputting the shallow features to each branch network 2 in the form of a feature map, that is, the input image primarily extracts image features through the main network 1 and outputs the image features to the branch networks 2 in the form of the feature map;
and the branch networks 2 are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of the upper semantic knowledge on the feature learning of the lower branch network is realized by introducing a semantic knowledge embedding mechanism. That is, the feature map output by the main network 1 is input into the branch networks 2 corresponding to each hierarchy type for further feature expression, and is output in the form of feature vector, the feature vector is further calculated by the softmax classifier to obtain the distribution of the prediction label of each hierarchy type,
in the invention, a semantic knowledge embedding mechanism is embodied in a branch network 2, the branch network 2 adopts an attention mechanism guided by semantic knowledge, specifically, the branch network 2 firstly carries out secondary characterization expression on a feature map from a main network 1 to generate a new branch feature map, which is essentially a stack of a plurality of feature maps and is a 3-dimensional tensor, the branch feature map learns an attention weight map by combining a score vector predicted by a superior level and a branch feature map predicted by a subordinate level, and is essentially a stack of a plurality of feature maps and is a three-dimensional tensor, the attention weight map represents that each space position of the new feature map generated by the branch network has an important degree for identifying a target class, the higher the discriminant is, more attention is attracted, and the weight corresponding to the corresponding position of the weight map is larger, the weight map is applied to the branch feature map to generate a weighted branch feature map, so as to predict the label distribution of the hierarchy type.
Therefore, the multi-branch network structure of the trunk-branch reduces the operation overhead by a method of sharing shallow network parameters, and simultaneously, the model can give consideration to the optimization targets of different tasks by a plurality of independent branches.
It should be noted here that the regularization of semantic knowledge to the semantic space of the prediction result is embodied in the training process of the HSE model, and the present invention uses the score vector of the upper prediction as a soft target (soft target) to forcibly constrain the prediction result of the lower prediction to conform to the semantic rules of the classification tree, so as to regularize the semantic space of the lower prediction result, which will be described in detail later.
Specifically, the following is a rough workflow of the hierarchical semantic embedding model of the present invention:
(1) inputting an image I;
(2) extracting a characteristic diagram of the image I by the main network Tr, and marking the characteristic diagram as fI;
(3)fIInput to the highest level of the branch network Br1Performing the following steps;
(4)Br1to fIForward calculation is carried out to obtain a prediction score vector s of a high-level category1;
(5) From level i to level n (i.gtoreq.2):
(5.1)fIinput to the i-th level branch network BriPerforming the following steps;
(5.2) vector s of prediction scores for the top level classi-1Input to a branch network BriPerforming the following steps;
(5.3) at si-1Under the guidance of (D), BriTo fIForward calculation is carried out to obtain a prediction score vector s of the ith layeri。
Compared with the prior art, the accuracy of the Caltech-UCSD-Bird data set is 1.6% higher than that of the optimal algorithm in the past, and the accuracy of the Vegfru data set is 2.3% higher than that of the optimal algorithm in the past. In addition, on the Caltech-UCSD-Bird data set, the accuracy of the optimal algorithm is equivalent to that of the prior optimal algorithm, and the training data can be saved by 20%.
The invention will be further illustrated by the following specific examples:
example (b):
1. hierarchical annotation of data
Taking an image of a bird as an example, it is necessary to prepare hierarchical annotation information other than the image. For example, if a bird is labeled for categories of 4 categories, such as order, family, genus, species, each training/testing data that needs to be provided should include: images, item category labels, department category labels, genus category labels, and species category labels.
2. Implementation of HSE model
The HSE model includes a backbone network (trunk net), a branch network (branch net). The main network mainly extracts shallow features of the input image. The branch network has two functions, namely, the shallow feature map of the image output by the main network is further extracted by deep features, so that the output feature map is suitable for the recognition task of the corresponding level of the branch network; and secondly, the guidance of the upper-layer semantic knowledge on the learning of the lower-layer branch network characteristics is realized by introducing a knowledge embedding module. The multi-branch network structure of the trunk-branch reduces the operation cost by a method of sharing shallow network parameters, and simultaneously, the model can give consideration to the optimization targets of different tasks by a plurality of independent branches. The network structure of the backbone network and the branch network will be described in detail below.
1) Backbone network
In the embodiment of the invention, the backbone network is built based on a layer structure of a residual error network, and the comparison of the network structure of ResNet-50 is shown in FIG. 2. In the figure, conv1 is a single-layer convolution operation, and layer2_ x to layer5_ x are layers formed by stacking and connecting a plurality of residual modules, which respectively comprise a plurality of layer convolution operations. In the structure of ResNet-50, layer2_ x to layer5_ x are respectively composed of 3, 4, 6 and 3 residual modules, each residual module comprises 3 convolution operation layers, 48 layers in total, and a bottommost conv1 and an output full connection layer fc, so that a 50-layer network structure is constructed together.
The main network of the HSE model of the invention only adopts the ResNet-50 network structure layer4_ x and the previous input layer, and the parameter layer has 41 layers. Table 1 describes specific network parameters of the backbone network of the present invention, the HSE model is oriented to class prediction of multiple levels, and the parameters of the backbone network are shared by prediction networks of each level. Inputting a picture, carrying out primary shallow feature extraction on the picture by a backbone network, and outputting the picture in a feature map mode. In a specific embodiment of the present invention, for a picture with an input resolution of 448 × 448, the dimension of the backbone network output feature map is 28 × 28.
TABLE 1 HSE model backbone network Key parameters
2) Branch network
FIG. 3 is a diagram of a comparison of ResNet-50 and the branched network structure of the present invention. In the structure diagram of the branched network, each class of each hierarchy corresponds to one branched network. Specifically, the branched network includes: a deep feature extraction sub-module 201, an upper semantic knowledge embedding sub-module 202 and a score fusion sub-module 203.
In order to maintain consistency with the ResNet-50 network structure and facilitate fair comparison in subsequent experiments, the deep feature extraction sub-module 201 follows the layer5_ x layer structure in ResNet-50. In the embodiment of the present invention, the layer5_ x layer structure is composed of 3 residual modules, and is used to perform deep feature extraction on the feature map output by the backbone network and output upper semantic knowledgeGuided feature expressions (emphasizing local discriminant) and unguided feature expressions (emphasizing global discriminant). Specifically, a feature map (28 × 28 resolution) output by the main network is used as an input, and after being subjected to layer5_ x layer structure operation processing by the deep feature extraction module 201 in the branch network, a feature map with a size of 14 × 14 is output, and the dimension of the output feature map is actually n × C × W × H, in the specific embodiment of the present invention, n is 8, which represents the batch size; c indicates the number of channels and has a value of 2048, and W and H indicate the width and height, respectively, of 14. It is particularly noted that in a branched network, the layer5_ x structure is multiplexed twice, one facing the upper-level semantic knowledge embedding submodule and the other facing the expression of the global features. To distinguish the two layer5_ x layer structures, the notation used for the former is φi(. for the latter note ψ)i(·),φi(. cndot.) and ψiIndependent of each other, not sharing parameters.
In the upper level semantic knowledge embedding sub-module 202, the score vector s of the upper level prediction
i-1Firstly, mapping into a 1024-dimensional semantic knowledge expression vector through a full connection layer. This vector will be associated with phi
iSplicing each site on W × H plane of outputted feature map
Indicating such a splicing operation. In implementation, the knowledge expression vector may simply be copied into dimensions w × h for convenience. Fig. 4 demonstrates the above process.
The feature map after stitching will learn an attention coefficient vector through an attention model α (·). FIG. 5 illustrates the processing of the attention model. For each position on the W × H plane of the mosaic feature map, two fully connected layers fc are successively used to gradually map the position into 1024 dimensions and 2048 dimensions, and finally an attention coefficient vector is obtained (as shown in the rightmost graph of FIG. 5).
The obtained attention coefficient vector will act on phiiOn the characteristic diagram of the output, "" in FIG. 5 indicates an attention coefficient vector and φiMultiplying the values of each corresponding location of the output profileThen, a weighted feature map f is obtainedi。
At score fusion submodule 203, phi is combinedi(. cndot.) and ψiAnd (8) outputting the corresponding score vector by the score fusion operation of the output feature graph.
Specifically, the score fusion process is specifically represented as follows:
S=(fc_1+fc_2+fc_cat)/3
wherein fc _1, fc _2 and fc _ cat are all c × 1 dimensional vectors, and the former two vectors pass throughi(. cndot.) and ψiThe output characteristic diagram is obtained through full connection layers fc2 and fc 1', the latter obtains the same dimension as fc _1 and fc _2, namely c multiplied by 1, by connecting fc _1 and fc _2 in series and then performing full connection layer fc _ concate operation "
In particular, in the embodiment of the present invention, since the top-level category has no upper-level semantics to direct it, the network structure is actually as shown in fig. 6. The network structure of the network is consistent with the original ResNet-50 in the parameter settings of other layers except the last full connection layer fc1 corresponding to the category number of the layer, which is not described herein.
FIG. 7 is a flowchart illustrating steps of a method for implementing a hierarchical semantic embedding model for fine object recognition according to the present invention. When the HSE model is trained, the normal category label is taken as an optimization target, and a cross entropy loss function is taken as an optimization target function. Specifically, the prediction score vector of the i-th layer is normalized by the softmax function:
it is to be noted in particular that the softmax function here differs from the aforementioned softmax function only in the setting of the values on the temperature coefficient, where the temperature coefficient T is set to 1 in the implementation.
For a certain image sample, the correct label of the image sample in the current hierarchy category is assumed to be ciThen its loss value can be expressed as:
similarly, for
Summing to obtain the classification loss value of the whole training set
Specifically, as shown in fig. 7, the implementation method of a hierarchical semantic embedded model for fine object recognition according to the present invention includes the following steps:
and step S1, performing hierarchical labeling on each piece of training data.
Taking an image of a bird as an example, it is necessary to prepare hierarchical annotation information other than the image. For example, if a bird is labeled for categories of 4 categories, such as order, family, genus, species, each training/testing data that needs to be provided should include: images, item category labels, department category labels, genus category labels, and species category labels.
And step S2, adopting the weighted combination of the classification loss function and the regularization loss function as an objective function for optimizing the HSE model, and gradually training the branch networks corresponding to the classification from the 1 st level classification to the Nth level classification.
When training a branch network corresponding to a certain level of category, a prediction score vector of the previous level of category is obtained first. Therefore, in this step, the branch networks corresponding to the 1 st hierarchical category are trained step by step from the 1 st hierarchical category to the nth hierarchical category. Since the parameters of the backbone network are shared by all branches, in this step, the parameters of the backbone network do not need to be optimized for a while, and therefore, the parameters of the backbone network need to be initialized simply by using the Resnet-50 network model parameters pre-trained on the ImageNet dataset. The parameters of the backbone network are kept in a fixed state all the time in this step, and optimization updating is not needed.
In training the ith level class pairWhen the network is branched, the HSE model integrates the network structures corresponding to the first i-1 levels of classes, so that the parameters of the first i-1 levels of branched networks in the HSE model are initialized by using the previously trained branched network models of the first i-1 levels. And for a branch network of layer i, sub-network psi
iParameters of the 9 relevant layers involved in (-) and φ (-) are also initialized by the present invention using Resnet-50 network model parameters pre-trained on ImageNet data sets. In addition, the first and second substrates are,
and a (-) is realized by a full connection layer, parameters of the full connection layer are initialized by using an Xavier algorithm, and an optimization objective function of the branch network is as follows:
where γ is a balance parameter used to balance the impact of the classification loss function term with the regularization constraint loss function term on the network parameters. Loss function term due to constraint by regularization
The resulting gradient values are scaled in magnitude
Therefore, it is necessary to set a relatively large weight value (γ ═ 2 is used in the embodiment of the present invention).
It should be noted that, because the category at the top layer does not introduce upper semantic knowledge, the corresponding branch network only needs to use the classification loss function term as the objective function of model parameter optimization.
In an embodiment of the present invention, the training image is scaled to 512 × 512, and the data increment means includes using random cropping to crop 448 × 448 regions for training, and performing a horizontal flip transformation on the training samples. In the aspect of an optimization method, the invention uses an SGD algorithm and a batch optimization strategy, wherein the batch value is 8, the impulse term of the SGD is 0.9, the weight attenuation factor is 0.00005, the initial learning rate is 0.001, and after the training set is traversed and used for about 300 times, the learning rate is reduced by 10 times to continue training.
And step S3, after all the branch networks are preliminarily trained, performing joint optimization on all the parameters of the whole complete HSE model. The objective function of the joint optimization is:
in training, the present invention still uses the same data increment method and the same super-parameter configuration as in step 1, except that the learning rate is smaller than 0.00001, which is not described herein.
It should be noted here that the backbone network of the present invention adopts a network structure of ResNet-50, and similarly, other general convolutional neural network structures such as VGG16 may be adopted instead.
The network structure listed by the invention has 4 levels, and actually, the number of the levels is only related to the number of the levels of the data set classification hierarchy structure, and the invention is also applicable to any number of levels.
One of loss functions adopted when the method trains the model is KL scattering, and in fact, a universal distance measurement function, such as Euclidean distance, is also applicable.
In summary, the hierarchical semantic embedded model for object fine recognition and the implementation method thereof adopt the hierarchical structure of object classification as semantic information, and embed the semantic information into the feature expression in the deep neural network model, thereby solving the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning, and reducing the complexity of the model.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.