CN109102024B

CN109102024B - A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation

Info

Publication number: CN109102024B
Application number: CN201810924288.XA
Authority: CN
Inventors: 聂琳; 吴文熙; 陈添水; 王青
Original assignee: Sun Yat Sen University
Current assignee: Sun Yat Sen University
Priority date: 2018-08-14
Filing date: 2018-08-14
Publication date: 2021-08-31
Anticipated expiration: 2038-08-14
Also published as: CN109102024A

Abstract

The invention discloses a hierarchical semantic embedding model for fine object recognition and an implementation method thereof. The hierarchical semantic embedding model comprises: a backbone network for extracting shallow features of an input image and outputting them in the form of feature maps To each branch network; several branch networks are used for further deep feature extraction on the image shallow feature map output by the backbone network, so that the output feature map is suitable for the recognition task of the corresponding level of the branch network, and by introducing semantic knowledge The embedding mechanism realizes the guidance of the upper-layer semantic knowledge to the lower-layer branch network feature learning, and the present invention solves the problem of high cost of extra information labeling in the object refined identification technology scheme relying on extra information to guide learning.

Description

Hierarchical semantic embedded model for fine object recognition and implementation method thereof

Technical Field

The invention relates to the technical field of object fine recognition, in particular to a Hierarchical Semantic Embedding (HSE) model for object fine recognition and a training method thereof.

Background

In recent years, the revolution of deep vision calculation has brought forward the demands of various fields on visual understanding and analysis technologies, such as the urgent need of online accurate retrieval of dress pictures for e-commerce, the urgent need of accurate matching of related vehicles for security industry, and the urgent need of fine identification of wild animals and plants in the agriculture and forestry environmental protection field. These requirements often require that the recognition algorithm be able to carefully distinguish subordinate classes of a basic class, a technique commonly referred to as fine recognition of objects.

Generally, the technical difficulty of the refined recognition of objects is:

1) indistinguishable differences between classes: for objects derived from similar categories, their visual differences are very slight in many cases, and some even people are difficult to distinguish;

2) significant intra-class differences: for objects derived from the same category, these objects exhibit very large visual differences due to scale, perspective, occlusion, and diverse backgrounds.

At present, the refined identification technology mainly distinguishes objects based on a plurality of discriminant areas, and the following two main schemes exist:

firstly, automatically excavating a discriminant area by using an attention mechanism;

secondly, model learning is guided by using additional information so as to better perform feature expression on the discriminant area.

However, the former is usually implemented by using multiple networks, and repeated operations increase the complexity of the model, and meanwhile, the positioning of the discriminant area is blurred due to lack of effective supervision or guidance; while the latter effectively improves the identifiability of the key area, the marking cost of the introduced extra information is often high.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention aims to provide a hierarchical semantic embedded model for object fine recognition and an implementation method thereof, so as to solve the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning.

To achieve the above and other objects, the present invention provides a hierarchical semantic embedding model for fine object recognition, comprising:

the main network is used for extracting shallow features of the input image and outputting the shallow features to each branch network in the form of a feature map;

and the branch networks are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of the upper-layer semantic knowledge on the feature learning of the lower-layer branch network is realized by introducing a semantic knowledge embedding mechanism.

Preferably, the branch network performs secondary characterization expression on the feature map from the main network to generate a new branch feature map, learns the attention weight map by combining the score vector predicted by the upper level and the branch feature map of the lower level thereof, applies the attention weight map to the branch feature map, and finally generates a weighted branch feature map, thereby predicting the label distribution of the hierarchy type.

Preferably, the backbone network adopts layer4_ x layer and its previous input layer of the ResNet-50 network structure, the parameter layer has 41 layers, and the parameters of the backbone network are shared by the prediction networks of each layer.

Preferably, the branched network includes:

the deep feature extraction submodule is used for carrying out deep feature extraction on the feature map output by the backbone network and outputting feature expression guided by superior semantic knowledge and unguided feature expression;

embedding the higher semantic knowledge into a submodule to obtain a higher predicted score vector s_i-1Mapping the vector into semantic knowledge expression vectors through a full connection layer, splicing the vector with each site on a W multiplied by H plane of a feature map output by the deep feature extraction submodule, learning an attention coefficient vector through an attention model for the spliced feature map, and applying the attention coefficient vector to the feature map output by the deep feature extraction submodule to obtain a weighted feature map, wherein W and H respectively refer to width and height;

and the score fusion submodule is used for outputting corresponding score vectors through score fusion operation on the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule.

Preferably, the deep feature extraction submodule adopts a layer5_ x layer structure in a ResNet-50 network, the layer5_ x layer structure is composed of 3 residual modules, the layer5_ x layer structure is multiplexed twice, one part faces to the upper semantic knowledge embedding submodule, and the other part faces to the expression of global features.

Preferably, the attention model maps each position point on the W × H plane of the spliced feature map into corresponding dimensions step by using two fully-connected layers successively, and finally obtains the attention coefficient vector.

Preferably, the score fusion sub-module performs the score fusion process as follows:

S＝(fc_1+fc_2+fc_cat)/3

the fc _1, fc _2 and fc _ cat are c multiplied by 1 dimensional vectors, the former two vectors are obtained by respectively passing the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule through a full connection layer, the latter is obtained by connecting fc _1 and fc _2 in series and then obtaining the dimensions same as fc _1 and fc _2 through a full connection layer fc _ concatee operation.

Preferably, the network structure of the top-most category of the branch network corresponds to the category number of the hierarchy except for the last fully connected layer, and the parameter settings of other layers are consistent with the original ResNet-50 network.

In order to achieve the above object, the present invention further provides a method for implementing a hierarchical semantic embedded model for fine object recognition, comprising the following steps:

step S1, carrying out hierarchical labeling on each piece of training data;

step S2, adopting the weighted combination of the classification loss function and the regularization constraint loss function as the target function of the optimized HSE model, and gradually training the branch networks corresponding to the classification from the 1 st level classification to the Nth level classification;

and step S3, after all the branch networks are preliminarily trained, performing joint optimization on all the parameters of the whole complete HSE model.

Preferably, the optimization objective function of the branched network is:

where γ is a balance parameter for balancing the classification loss function term

And regularization constraint loss function term

Impact on network parameters.

Compared with the prior art, the hierarchical semantic embedded model for object fine recognition and the implementation method thereof adopt the hierarchical structure of object classification as semantic information, and embed the semantic information into the feature expression in the deep neural network model, so that the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning is solved, and the complexity of the model is reduced.

Drawings

FIG. 1 is a system architecture diagram of a hierarchical semantic embedding model for fine object recognition according to the present invention;

FIG. 2 is a diagram of a backbone network in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of a comparison of ResNet-50 and the branched network structure of the present invention;

FIG. 4 is a diagram illustrating a process of embedding and expressing upper semantics according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an attention mechanism for a branch network in an embodiment of the present invention;

FIG. 6 is a block diagram of a top level branching network in accordance with an embodiment of the present invention;

FIG. 7 is a flowchart illustrating steps of a method for implementing a hierarchical semantic embedding model for fine object recognition according to the present invention.

Detailed Description

Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.

FIG. 1 is a system architecture diagram of a hierarchical semantic embedding model for fine object recognition according to the present invention. In the invention, the Hierarchical Semantic knowledge Embedding algorithm model (HSE for short) comprises three aspects: extracting image depth features, embedding semantic knowledge into expression learning and constraining semantic knowledge to a prediction result semantic space. The HSE model is an algorithm model based on a deep learning technology, the deep expression learning is carried out through the whole HSE framework depending on a deep neural network, the HSE framework utilizes hierarchical semantic knowledge in two modes, and the embedding of the semantic knowledge during feature expression and the guiding of the semantic knowledge to regularize a prediction result during model training are respectively embodied. Specifically, as shown in fig. 1, a Hierarchical Semantic Embedding (HSE) model for fine object recognition according to the present invention includes:

the main network 1 is used for extracting shallow features of an input image and outputting the shallow features to each branch network 2 in the form of a feature map, that is, the input image primarily extracts image features through the main network 1 and outputs the image features to the branch networks 2 in the form of the feature map;

and the branch networks 2 are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of the upper semantic knowledge on the feature learning of the lower branch network is realized by introducing a semantic knowledge embedding mechanism. That is, the feature map output by the main network 1 is input into the branch networks 2 corresponding to each hierarchy type for further feature expression, and is output in the form of feature vector, the feature vector is further calculated by the softmax classifier to obtain the distribution of the prediction label of each hierarchy type,

in the invention, a semantic knowledge embedding mechanism is embodied in a branch network 2, the branch network 2 adopts an attention mechanism guided by semantic knowledge, specifically, the branch network 2 firstly carries out secondary characterization expression on a feature map from a main network 1 to generate a new branch feature map, which is essentially a stack of a plurality of feature maps and is a 3-dimensional tensor, the branch feature map learns an attention weight map by combining a score vector predicted by a superior level and a branch feature map predicted by a subordinate level, and is essentially a stack of a plurality of feature maps and is a three-dimensional tensor, the attention weight map represents that each space position of the new feature map generated by the branch network has an important degree for identifying a target class, the higher the discriminant is, more attention is attracted, and the weight corresponding to the corresponding position of the weight map is larger, the weight map is applied to the branch feature map to generate a weighted branch feature map, so as to predict the label distribution of the hierarchy type.

Therefore, the multi-branch network structure of the trunk-branch reduces the operation overhead by a method of sharing shallow network parameters, and simultaneously, the model can give consideration to the optimization targets of different tasks by a plurality of independent branches.

It should be noted here that the regularization of semantic knowledge to the semantic space of the prediction result is embodied in the training process of the HSE model, and the present invention uses the score vector of the upper prediction as a soft target (soft target) to forcibly constrain the prediction result of the lower prediction to conform to the semantic rules of the classification tree, so as to regularize the semantic space of the lower prediction result, which will be described in detail later.

Specifically, the following is a rough workflow of the hierarchical semantic embedding model of the present invention:

(1) inputting an image I;

(2) extracting a characteristic diagram of the image I by the main network Tr, and marking the characteristic diagram as f_I；

(3)f_IInput to the highest level of the branch network Br₁Performing the following steps;

(4)Br₁to f_IForward calculation is carried out to obtain a prediction score vector s of a high-level category₁；

(5) From level i to level n (i.gtoreq.2):

(5.1)f_Iinput to the i-th level branch network Br_iPerforming the following steps;

(5.2) vector s of prediction scores for the top level class_i-1Input to a branch network Br_iPerforming the following steps;

(5.3) at s_i-1Under the guidance of (D), Br_iTo f_IForward calculation is carried out to obtain a prediction score vector s of the ith layer_i。

Compared with the prior art, the accuracy of the Caltech-UCSD-Bird data set is 1.6% higher than that of the optimal algorithm in the past, and the accuracy of the Vegfru data set is 2.3% higher than that of the optimal algorithm in the past. In addition, on the Caltech-UCSD-Bird data set, the accuracy of the optimal algorithm is equivalent to that of the prior optimal algorithm, and the training data can be saved by 20%.

The invention will be further illustrated by the following specific examples:

example (b):

1. hierarchical annotation of data

Taking an image of a bird as an example, it is necessary to prepare hierarchical annotation information other than the image. For example, if a bird is labeled for categories of 4 categories, such as order, family, genus, species, each training/testing data that needs to be provided should include: images, item category labels, department category labels, genus category labels, and species category labels.

2. Implementation of HSE model

The HSE model includes a backbone network (trunk net), a branch network (branch net). The main network mainly extracts shallow features of the input image. The branch network has two functions, namely, the shallow feature map of the image output by the main network is further extracted by deep features, so that the output feature map is suitable for the recognition task of the corresponding level of the branch network; and secondly, the guidance of the upper-layer semantic knowledge on the learning of the lower-layer branch network characteristics is realized by introducing a knowledge embedding module. The multi-branch network structure of the trunk-branch reduces the operation cost by a method of sharing shallow network parameters, and simultaneously, the model can give consideration to the optimization targets of different tasks by a plurality of independent branches. The network structure of the backbone network and the branch network will be described in detail below.

1) Backbone network

In the embodiment of the invention, the backbone network is built based on a layer structure of a residual error network, and the comparison of the network structure of ResNet-50 is shown in FIG. 2. In the figure, conv1 is a single-layer convolution operation, and layer2_ x to layer5_ x are layers formed by stacking and connecting a plurality of residual modules, which respectively comprise a plurality of layer convolution operations. In the structure of ResNet-50, layer2_ x to layer5_ x are respectively composed of 3, 4, 6 and 3 residual modules, each residual module comprises 3 convolution operation layers, 48 layers in total, and a bottommost conv1 and an output full connection layer fc, so that a 50-layer network structure is constructed together.

The main network of the HSE model of the invention only adopts the ResNet-50 network structure layer4_ x and the previous input layer, and the parameter layer has 41 layers. Table 1 describes specific network parameters of the backbone network of the present invention, the HSE model is oriented to class prediction of multiple levels, and the parameters of the backbone network are shared by prediction networks of each level. Inputting a picture, carrying out primary shallow feature extraction on the picture by a backbone network, and outputting the picture in a feature map mode. In a specific embodiment of the present invention, for a picture with an input resolution of 448 × 448, the dimension of the backbone network output feature map is 28 × 28.

TABLE 1 HSE model backbone network Key parameters

2) Branch network

FIG. 3 is a diagram of a comparison of ResNet-50 and the branched network structure of the present invention. In the structure diagram of the branched network, each class of each hierarchy corresponds to one branched network. Specifically, the branched network includes: a deep feature extraction sub-module 201, an upper semantic knowledge embedding sub-module 202 and a score fusion sub-module 203.

In order to maintain consistency with the ResNet-50 network structure and facilitate fair comparison in subsequent experiments, the deep feature extraction sub-module 201 follows the layer5_ x layer structure in ResNet-50. In the embodiment of the present invention, the layer5_ x layer structure is composed of 3 residual modules, and is used to perform deep feature extraction on the feature map output by the backbone network and output upper semantic knowledgeGuided feature expressions (emphasizing local discriminant) and unguided feature expressions (emphasizing global discriminant). Specifically, a feature map (28 × 28 resolution) output by the main network is used as an input, and after being subjected to layer5_ x layer structure operation processing by the deep feature extraction module 201 in the branch network, a feature map with a size of 14 × 14 is output, and the dimension of the output feature map is actually n × C × W × H, in the specific embodiment of the present invention, n is 8, which represents the batch size; c indicates the number of channels and has a value of 2048, and W and H indicate the width and height, respectively, of 14. It is particularly noted that in a branched network, the layer5_ x structure is multiplexed twice, one facing the upper-level semantic knowledge embedding submodule and the other facing the expression of the global features. To distinguish the two layer5_ x layer structures, the notation used for the former is φ_i(. for the latter note ψ)_i(·)，φ_i(. cndot.) and ψ_iIndependent of each other, not sharing parameters.

In the upper level semantic knowledge embedding sub-module 202, the score vector s of the upper level prediction_i-1Firstly, mapping into a 1024-dimensional semantic knowledge expression vector through a full connection layer. This vector will be associated with phi_iSplicing each site on W × H plane of outputted feature map

Indicating such a splicing operation. In implementation, the knowledge expression vector may simply be copied into dimensions w × h for convenience. Fig. 4 demonstrates the above process.

The feature map after stitching will learn an attention coefficient vector through an attention model α (·). FIG. 5 illustrates the processing of the attention model. For each position on the W × H plane of the mosaic feature map, two fully connected layers fc are successively used to gradually map the position into 1024 dimensions and 2048 dimensions, and finally an attention coefficient vector is obtained (as shown in the rightmost graph of FIG. 5).

The obtained attention coefficient vector will act on phi_iOn the characteristic diagram of the output, "" in FIG. 5 indicates an attention coefficient vector and φ_iMultiplying the values of each corresponding location of the output profileThen, a weighted feature map f is obtained_i。

At score fusion submodule 203, phi is combined_i(. cndot.) and ψ_iAnd (8) outputting the corresponding score vector by the score fusion operation of the output feature graph.

Specifically, the score fusion process is specifically represented as follows:

S＝(fc_1+fc_2+fc_cat)/3

wherein fc _1, fc _2 and fc _ cat are all c × 1 dimensional vectors, and the former two vectors pass through_i(. cndot.) and ψ_iThe output characteristic diagram is obtained through full connection layers fc2 and fc 1', the latter obtains the same dimension as fc _1 and fc _2, namely c multiplied by 1, by connecting fc _1 and fc _2 in series and then performing full connection layer fc _ concate operation "

In particular, in the embodiment of the present invention, since the top-level category has no upper-level semantics to direct it, the network structure is actually as shown in fig. 6. The network structure of the network is consistent with the original ResNet-50 in the parameter settings of other layers except the last full connection layer fc1 corresponding to the category number of the layer, which is not described herein.

FIG. 7 is a flowchart illustrating steps of a method for implementing a hierarchical semantic embedding model for fine object recognition according to the present invention. When the HSE model is trained, the normal category label is taken as an optimization target, and a cross entropy loss function is taken as an optimization target function. Specifically, the prediction score vector of the i-th layer is normalized by the softmax function:

it is to be noted in particular that the softmax function here differs from the aforementioned softmax function only in the setting of the values on the temperature coefficient, where the temperature coefficient T is set to 1 in the implementation.

For a certain image sample, the correct label of the image sample in the current hierarchy category is assumed to be c_iThen its loss value can be expressed as:

similarly, for

Summing to obtain the classification loss value of the whole training set

Specifically, as shown in fig. 7, the implementation method of a hierarchical semantic embedded model for fine object recognition according to the present invention includes the following steps:

and step S1, performing hierarchical labeling on each piece of training data.

And step S2, adopting the weighted combination of the classification loss function and the regularization loss function as an objective function for optimizing the HSE model, and gradually training the branch networks corresponding to the classification from the 1 st level classification to the Nth level classification.

When training a branch network corresponding to a certain level of category, a prediction score vector of the previous level of category is obtained first. Therefore, in this step, the branch networks corresponding to the 1 st hierarchical category are trained step by step from the 1 st hierarchical category to the nth hierarchical category. Since the parameters of the backbone network are shared by all branches, in this step, the parameters of the backbone network do not need to be optimized for a while, and therefore, the parameters of the backbone network need to be initialized simply by using the Resnet-50 network model parameters pre-trained on the ImageNet dataset. The parameters of the backbone network are kept in a fixed state all the time in this step, and optimization updating is not needed.

In training the ith level class pairWhen the network is branched, the HSE model integrates the network structures corresponding to the first i-1 levels of classes, so that the parameters of the first i-1 levels of branched networks in the HSE model are initialized by using the previously trained branched network models of the first i-1 levels. And for a branch network of layer i, sub-network psi_iParameters of the 9 relevant layers involved in (-) and φ (-) are also initialized by the present invention using Resnet-50 network model parameters pre-trained on ImageNet data sets. In addition, the first and second substrates are,

and a (-) is realized by a full connection layer, parameters of the full connection layer are initialized by using an Xavier algorithm, and an optimization objective function of the branch network is as follows:

where γ is a balance parameter used to balance the impact of the classification loss function term with the regularization constraint loss function term on the network parameters. Loss function term due to constraint by regularization

The resulting gradient values are scaled in magnitude

Therefore, it is necessary to set a relatively large weight value (γ ═ 2 is used in the embodiment of the present invention).

It should be noted that, because the category at the top layer does not introduce upper semantic knowledge, the corresponding branch network only needs to use the classification loss function term as the objective function of model parameter optimization.

In an embodiment of the present invention, the training image is scaled to 512 × 512, and the data increment means includes using random cropping to crop 448 × 448 regions for training, and performing a horizontal flip transformation on the training samples. In the aspect of an optimization method, the invention uses an SGD algorithm and a batch optimization strategy, wherein the batch value is 8, the impulse term of the SGD is 0.9, the weight attenuation factor is 0.00005, the initial learning rate is 0.001, and after the training set is traversed and used for about 300 times, the learning rate is reduced by 10 times to continue training.

And step S3, after all the branch networks are preliminarily trained, performing joint optimization on all the parameters of the whole complete HSE model. The objective function of the joint optimization is:

in training, the present invention still uses the same data increment method and the same super-parameter configuration as in step 1, except that the learning rate is smaller than 0.00001, which is not described herein.

It should be noted here that the backbone network of the present invention adopts a network structure of ResNet-50, and similarly, other general convolutional neural network structures such as VGG16 may be adopted instead.

The network structure listed by the invention has 4 levels, and actually, the number of the levels is only related to the number of the levels of the data set classification hierarchy structure, and the invention is also applicable to any number of levels.

One of loss functions adopted when the method trains the model is KL scattering, and in fact, a universal distance measurement function, such as Euclidean distance, is also applicable.

In summary, the hierarchical semantic embedded model for object fine recognition and the implementation method thereof adopt the hierarchical structure of object classification as semantic information, and embed the semantic information into the feature expression in the deep neural network model, thereby solving the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning, and reducing the complexity of the model.

The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims

1. A hierarchical semantic embedding model for fine-grained identification of objects, comprising:

the branch networks are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of upper-layer semantic knowledge on the feature learning of the lower-layer branch network is realized by introducing a semantic knowledge embedding mechanism;

the branched network includes:

2. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the branch network carries out secondary characteristic expression on the feature map from the main network to generate a new branch feature map, an attention weight map is obtained through learning by combining a score vector predicted by a superior level and a branch feature map predicted by a subordinate level, the attention weight map is acted on the branch feature map, and finally a weighted branch feature map is generated to predict the label distribution of the hierarchy type.

3. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the backbone network adopts layer4_ x layer of ResNet-50 network structure and its previous input layer, the parameter layer has 41 layers, the parameters of the backbone network are shared by prediction networks of each layer.

4. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the deep feature extraction sub-module adopts a layer5_ x layer structure in a ResNet-50 network, the layer5_ x layer structure is composed of 3 residual modules, the layer5_ x layer structure is multiplexed twice, one part faces to the upper semantic knowledge embedding sub-module, and the other part faces to the expression of global features.

5. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: and the attention model continuously and gradually maps each position point on the W multiplied by H plane of the spliced characteristic diagram into corresponding dimension by using two full-connection layers, and finally the attention coefficient vector is obtained.

6. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the score fusion process of the score fusion submodule is as follows:

S＝(fc_1+fc_2+fc_cat)/3

7. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the network structure of the top-most category of the branch network corresponds to the category number of the hierarchy except the last full-connection layer, and the parameter settings of other layers are consistent with the original ResNet-50 network.