[go: up one dir, main page]

CN109102024B - A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation - Google Patents

A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation Download PDF

Info

Publication number
CN109102024B
CN109102024B CN201810924288.XA CN201810924288A CN109102024B CN 109102024 B CN109102024 B CN 109102024B CN 201810924288 A CN201810924288 A CN 201810924288A CN 109102024 B CN109102024 B CN 109102024B
Authority
CN
China
Prior art keywords
network
branch
feature
feature map
layer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810924288.XA
Other languages
Chinese (zh)
Other versions
CN109102024A (en
Inventor
聂琳
吴文熙
陈添水
王青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sun Yat Sen University
Original Assignee
Sun Yat Sen University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Yat Sen University filed Critical Sun Yat Sen University
Priority to CN201810924288.XA priority Critical patent/CN109102024B/en
Publication of CN109102024A publication Critical patent/CN109102024A/en
Application granted granted Critical
Publication of CN109102024B publication Critical patent/CN109102024B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/231Hierarchical techniques, i.e. dividing or merging pattern sets so as to obtain a dendrogram
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

本发明公开了一种用于物体精细识别的层次语义嵌入模型及其实现方法,所述层次语义嵌入模型包括:主干网络,用于对输入图像的浅层特征进行提取,以特征图的形式输出至各分支网络;若干分支网络,用于对主干网络输出的图像浅层特征图进行进一步的深层特征提取,使其输出的特征图适用于分支网络所对应层级的识别任务,并通过引入语义知识嵌入机制,实现上层语义知识对下层分支网络特征学习的指导,本发明解决依赖额外信息引导学习的物体精细化识别技术方案中的额外信息标注成本高的问题。

Figure 201810924288

The invention discloses a hierarchical semantic embedding model for fine object recognition and an implementation method thereof. The hierarchical semantic embedding model comprises: a backbone network for extracting shallow features of an input image and outputting them in the form of feature maps To each branch network; several branch networks are used for further deep feature extraction on the image shallow feature map output by the backbone network, so that the output feature map is suitable for the recognition task of the corresponding level of the branch network, and by introducing semantic knowledge The embedding mechanism realizes the guidance of the upper-layer semantic knowledge to the lower-layer branch network feature learning, and the present invention solves the problem of high cost of extra information labeling in the object refined identification technology scheme relying on extra information to guide learning.

Figure 201810924288

Description

Hierarchical semantic embedded model for fine object recognition and implementation method thereof
Technical Field
The invention relates to the technical field of object fine recognition, in particular to a Hierarchical Semantic Embedding (HSE) model for object fine recognition and a training method thereof.
Background
In recent years, the revolution of deep vision calculation has brought forward the demands of various fields on visual understanding and analysis technologies, such as the urgent need of online accurate retrieval of dress pictures for e-commerce, the urgent need of accurate matching of related vehicles for security industry, and the urgent need of fine identification of wild animals and plants in the agriculture and forestry environmental protection field. These requirements often require that the recognition algorithm be able to carefully distinguish subordinate classes of a basic class, a technique commonly referred to as fine recognition of objects.
Generally, the technical difficulty of the refined recognition of objects is:
1) indistinguishable differences between classes: for objects derived from similar categories, their visual differences are very slight in many cases, and some even people are difficult to distinguish;
2) significant intra-class differences: for objects derived from the same category, these objects exhibit very large visual differences due to scale, perspective, occlusion, and diverse backgrounds.
At present, the refined identification technology mainly distinguishes objects based on a plurality of discriminant areas, and the following two main schemes exist:
firstly, automatically excavating a discriminant area by using an attention mechanism;
secondly, model learning is guided by using additional information so as to better perform feature expression on the discriminant area.
However, the former is usually implemented by using multiple networks, and repeated operations increase the complexity of the model, and meanwhile, the positioning of the discriminant area is blurred due to lack of effective supervision or guidance; while the latter effectively improves the identifiability of the key area, the marking cost of the introduced extra information is often high.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention aims to provide a hierarchical semantic embedded model for object fine recognition and an implementation method thereof, so as to solve the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning.
To achieve the above and other objects, the present invention provides a hierarchical semantic embedding model for fine object recognition, comprising:
the main network is used for extracting shallow features of the input image and outputting the shallow features to each branch network in the form of a feature map;
and the branch networks are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of the upper-layer semantic knowledge on the feature learning of the lower-layer branch network is realized by introducing a semantic knowledge embedding mechanism.
Preferably, the branch network performs secondary characterization expression on the feature map from the main network to generate a new branch feature map, learns the attention weight map by combining the score vector predicted by the upper level and the branch feature map of the lower level thereof, applies the attention weight map to the branch feature map, and finally generates a weighted branch feature map, thereby predicting the label distribution of the hierarchy type.
Preferably, the backbone network adopts layer4_ x layer and its previous input layer of the ResNet-50 network structure, the parameter layer has 41 layers, and the parameters of the backbone network are shared by the prediction networks of each layer.
Preferably, the branched network includes:
the deep feature extraction submodule is used for carrying out deep feature extraction on the feature map output by the backbone network and outputting feature expression guided by superior semantic knowledge and unguided feature expression;
embedding the higher semantic knowledge into a submodule to obtain a higher predicted score vector si-1Mapping the vector into semantic knowledge expression vectors through a full connection layer, splicing the vector with each site on a W multiplied by H plane of a feature map output by the deep feature extraction submodule, learning an attention coefficient vector through an attention model for the spliced feature map, and applying the attention coefficient vector to the feature map output by the deep feature extraction submodule to obtain a weighted feature map, wherein W and H respectively refer to width and height;
and the score fusion submodule is used for outputting corresponding score vectors through score fusion operation on the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule.
Preferably, the deep feature extraction submodule adopts a layer5_ x layer structure in a ResNet-50 network, the layer5_ x layer structure is composed of 3 residual modules, the layer5_ x layer structure is multiplexed twice, one part faces to the upper semantic knowledge embedding submodule, and the other part faces to the expression of global features.
Preferably, the attention model maps each position point on the W × H plane of the spliced feature map into corresponding dimensions step by using two fully-connected layers successively, and finally obtains the attention coefficient vector.
Preferably, the score fusion sub-module performs the score fusion process as follows:
S=(fc_1+fc_2+fc_cat)/3
the fc _1, fc _2 and fc _ cat are c multiplied by 1 dimensional vectors, the former two vectors are obtained by respectively passing the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule through a full connection layer, the latter is obtained by connecting fc _1 and fc _2 in series and then obtaining the dimensions same as fc _1 and fc _2 through a full connection layer fc _ concatee operation.
Preferably, the network structure of the top-most category of the branch network corresponds to the category number of the hierarchy except for the last fully connected layer, and the parameter settings of other layers are consistent with the original ResNet-50 network.
In order to achieve the above object, the present invention further provides a method for implementing a hierarchical semantic embedded model for fine object recognition, comprising the following steps:
step S1, carrying out hierarchical labeling on each piece of training data;
step S2, adopting the weighted combination of the classification loss function and the regularization constraint loss function as the target function of the optimized HSE model, and gradually training the branch networks corresponding to the classification from the 1 st level classification to the Nth level classification;
and step S3, after all the branch networks are preliminarily trained, performing joint optimization on all the parameters of the whole complete HSE model.
Preferably, the optimization objective function of the branched network is:
Figure BDA0001764992820000031
where γ is a balance parameter for balancing the classification loss function term
Figure BDA0001764992820000032
And regularization constraint loss function term
Figure BDA0001764992820000041
Impact on network parameters.
Compared with the prior art, the hierarchical semantic embedded model for object fine recognition and the implementation method thereof adopt the hierarchical structure of object classification as semantic information, and embed the semantic information into the feature expression in the deep neural network model, so that the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning is solved, and the complexity of the model is reduced.
Drawings
FIG. 1 is a system architecture diagram of a hierarchical semantic embedding model for fine object recognition according to the present invention;
FIG. 2 is a diagram of a backbone network in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of a comparison of ResNet-50 and the branched network structure of the present invention;
FIG. 4 is a diagram illustrating a process of embedding and expressing upper semantics according to an embodiment of the present invention;
FIG. 5 is a schematic diagram of an attention mechanism for a branch network in an embodiment of the present invention;
FIG. 6 is a block diagram of a top level branching network in accordance with an embodiment of the present invention;
FIG. 7 is a flowchart illustrating steps of a method for implementing a hierarchical semantic embedding model for fine object recognition according to the present invention.
Detailed Description
Other advantages and capabilities of the present invention will be readily apparent to those skilled in the art from the present disclosure by describing the embodiments of the present invention with specific embodiments thereof in conjunction with the accompanying drawings. The invention is capable of other and different embodiments and its several details are capable of modification in various other respects, all without departing from the spirit and scope of the present invention.
FIG. 1 is a system architecture diagram of a hierarchical semantic embedding model for fine object recognition according to the present invention. In the invention, the Hierarchical Semantic knowledge Embedding algorithm model (HSE for short) comprises three aspects: extracting image depth features, embedding semantic knowledge into expression learning and constraining semantic knowledge to a prediction result semantic space. The HSE model is an algorithm model based on a deep learning technology, the deep expression learning is carried out through the whole HSE framework depending on a deep neural network, the HSE framework utilizes hierarchical semantic knowledge in two modes, and the embedding of the semantic knowledge during feature expression and the guiding of the semantic knowledge to regularize a prediction result during model training are respectively embodied. Specifically, as shown in fig. 1, a Hierarchical Semantic Embedding (HSE) model for fine object recognition according to the present invention includes:
the main network 1 is used for extracting shallow features of an input image and outputting the shallow features to each branch network 2 in the form of a feature map, that is, the input image primarily extracts image features through the main network 1 and outputs the image features to the branch networks 2 in the form of the feature map;
and the branch networks 2 are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of the upper semantic knowledge on the feature learning of the lower branch network is realized by introducing a semantic knowledge embedding mechanism. That is, the feature map output by the main network 1 is input into the branch networks 2 corresponding to each hierarchy type for further feature expression, and is output in the form of feature vector, the feature vector is further calculated by the softmax classifier to obtain the distribution of the prediction label of each hierarchy type,
in the invention, a semantic knowledge embedding mechanism is embodied in a branch network 2, the branch network 2 adopts an attention mechanism guided by semantic knowledge, specifically, the branch network 2 firstly carries out secondary characterization expression on a feature map from a main network 1 to generate a new branch feature map, which is essentially a stack of a plurality of feature maps and is a 3-dimensional tensor, the branch feature map learns an attention weight map by combining a score vector predicted by a superior level and a branch feature map predicted by a subordinate level, and is essentially a stack of a plurality of feature maps and is a three-dimensional tensor, the attention weight map represents that each space position of the new feature map generated by the branch network has an important degree for identifying a target class, the higher the discriminant is, more attention is attracted, and the weight corresponding to the corresponding position of the weight map is larger, the weight map is applied to the branch feature map to generate a weighted branch feature map, so as to predict the label distribution of the hierarchy type.
Therefore, the multi-branch network structure of the trunk-branch reduces the operation overhead by a method of sharing shallow network parameters, and simultaneously, the model can give consideration to the optimization targets of different tasks by a plurality of independent branches.
It should be noted here that the regularization of semantic knowledge to the semantic space of the prediction result is embodied in the training process of the HSE model, and the present invention uses the score vector of the upper prediction as a soft target (soft target) to forcibly constrain the prediction result of the lower prediction to conform to the semantic rules of the classification tree, so as to regularize the semantic space of the lower prediction result, which will be described in detail later.
Specifically, the following is a rough workflow of the hierarchical semantic embedding model of the present invention:
(1) inputting an image I;
(2) extracting a characteristic diagram of the image I by the main network Tr, and marking the characteristic diagram as fI
(3)fIInput to the highest level of the branch network Br1Performing the following steps;
(4)Br1to fIForward calculation is carried out to obtain a prediction score vector s of a high-level category1
(5) From level i to level n (i.gtoreq.2):
(5.1)fIinput to the i-th level branch network BriPerforming the following steps;
(5.2) vector s of prediction scores for the top level classi-1Input to a branch network BriPerforming the following steps;
(5.3) at si-1Under the guidance of (D), BriTo fIForward calculation is carried out to obtain a prediction score vector s of the ith layeri
Compared with the prior art, the accuracy of the Caltech-UCSD-Bird data set is 1.6% higher than that of the optimal algorithm in the past, and the accuracy of the Vegfru data set is 2.3% higher than that of the optimal algorithm in the past. In addition, on the Caltech-UCSD-Bird data set, the accuracy of the optimal algorithm is equivalent to that of the prior optimal algorithm, and the training data can be saved by 20%.
The invention will be further illustrated by the following specific examples:
example (b):
1. hierarchical annotation of data
Taking an image of a bird as an example, it is necessary to prepare hierarchical annotation information other than the image. For example, if a bird is labeled for categories of 4 categories, such as order, family, genus, species, each training/testing data that needs to be provided should include: images, item category labels, department category labels, genus category labels, and species category labels.
2. Implementation of HSE model
The HSE model includes a backbone network (trunk net), a branch network (branch net). The main network mainly extracts shallow features of the input image. The branch network has two functions, namely, the shallow feature map of the image output by the main network is further extracted by deep features, so that the output feature map is suitable for the recognition task of the corresponding level of the branch network; and secondly, the guidance of the upper-layer semantic knowledge on the learning of the lower-layer branch network characteristics is realized by introducing a knowledge embedding module. The multi-branch network structure of the trunk-branch reduces the operation cost by a method of sharing shallow network parameters, and simultaneously, the model can give consideration to the optimization targets of different tasks by a plurality of independent branches. The network structure of the backbone network and the branch network will be described in detail below.
1) Backbone network
In the embodiment of the invention, the backbone network is built based on a layer structure of a residual error network, and the comparison of the network structure of ResNet-50 is shown in FIG. 2. In the figure, conv1 is a single-layer convolution operation, and layer2_ x to layer5_ x are layers formed by stacking and connecting a plurality of residual modules, which respectively comprise a plurality of layer convolution operations. In the structure of ResNet-50, layer2_ x to layer5_ x are respectively composed of 3, 4, 6 and 3 residual modules, each residual module comprises 3 convolution operation layers, 48 layers in total, and a bottommost conv1 and an output full connection layer fc, so that a 50-layer network structure is constructed together.
The main network of the HSE model of the invention only adopts the ResNet-50 network structure layer4_ x and the previous input layer, and the parameter layer has 41 layers. Table 1 describes specific network parameters of the backbone network of the present invention, the HSE model is oriented to class prediction of multiple levels, and the parameters of the backbone network are shared by prediction networks of each level. Inputting a picture, carrying out primary shallow feature extraction on the picture by a backbone network, and outputting the picture in a feature map mode. In a specific embodiment of the present invention, for a picture with an input resolution of 448 × 448, the dimension of the backbone network output feature map is 28 × 28.
TABLE 1 HSE model backbone network Key parameters
Figure BDA0001764992820000081
2) Branch network
FIG. 3 is a diagram of a comparison of ResNet-50 and the branched network structure of the present invention. In the structure diagram of the branched network, each class of each hierarchy corresponds to one branched network. Specifically, the branched network includes: a deep feature extraction sub-module 201, an upper semantic knowledge embedding sub-module 202 and a score fusion sub-module 203.
In order to maintain consistency with the ResNet-50 network structure and facilitate fair comparison in subsequent experiments, the deep feature extraction sub-module 201 follows the layer5_ x layer structure in ResNet-50. In the embodiment of the present invention, the layer5_ x layer structure is composed of 3 residual modules, and is used to perform deep feature extraction on the feature map output by the backbone network and output upper semantic knowledgeGuided feature expressions (emphasizing local discriminant) and unguided feature expressions (emphasizing global discriminant). Specifically, a feature map (28 × 28 resolution) output by the main network is used as an input, and after being subjected to layer5_ x layer structure operation processing by the deep feature extraction module 201 in the branch network, a feature map with a size of 14 × 14 is output, and the dimension of the output feature map is actually n × C × W × H, in the specific embodiment of the present invention, n is 8, which represents the batch size; c indicates the number of channels and has a value of 2048, and W and H indicate the width and height, respectively, of 14. It is particularly noted that in a branched network, the layer5_ x structure is multiplexed twice, one facing the upper-level semantic knowledge embedding submodule and the other facing the expression of the global features. To distinguish the two layer5_ x layer structures, the notation used for the former is φi(. for the latter note ψ)i(·),φi(. cndot.) and ψiIndependent of each other, not sharing parameters.
In the upper level semantic knowledge embedding sub-module 202, the score vector s of the upper level predictioni-1Firstly, mapping into a 1024-dimensional semantic knowledge expression vector through a full connection layer. This vector will be associated with phiiSplicing each site on W × H plane of outputted feature map
Figure BDA0001764992820000091
Indicating such a splicing operation. In implementation, the knowledge expression vector may simply be copied into dimensions w × h for convenience. Fig. 4 demonstrates the above process.
The feature map after stitching will learn an attention coefficient vector through an attention model α (·). FIG. 5 illustrates the processing of the attention model. For each position on the W × H plane of the mosaic feature map, two fully connected layers fc are successively used to gradually map the position into 1024 dimensions and 2048 dimensions, and finally an attention coefficient vector is obtained (as shown in the rightmost graph of FIG. 5).
The obtained attention coefficient vector will act on phiiOn the characteristic diagram of the output, "" in FIG. 5 indicates an attention coefficient vector and φiMultiplying the values of each corresponding location of the output profileThen, a weighted feature map f is obtainedi
At score fusion submodule 203, phi is combinedi(. cndot.) and ψiAnd (8) outputting the corresponding score vector by the score fusion operation of the output feature graph.
Specifically, the score fusion process is specifically represented as follows:
S=(fc_1+fc_2+fc_cat)/3
wherein fc _1, fc _2 and fc _ cat are all c × 1 dimensional vectors, and the former two vectors pass throughi(. cndot.) and ψiThe output characteristic diagram is obtained through full connection layers fc2 and fc 1', the latter obtains the same dimension as fc _1 and fc _2, namely c multiplied by 1, by connecting fc _1 and fc _2 in series and then performing full connection layer fc _ concate operation "
In particular, in the embodiment of the present invention, since the top-level category has no upper-level semantics to direct it, the network structure is actually as shown in fig. 6. The network structure of the network is consistent with the original ResNet-50 in the parameter settings of other layers except the last full connection layer fc1 corresponding to the category number of the layer, which is not described herein.
FIG. 7 is a flowchart illustrating steps of a method for implementing a hierarchical semantic embedding model for fine object recognition according to the present invention. When the HSE model is trained, the normal category label is taken as an optimization target, and a cross entropy loss function is taken as an optimization target function. Specifically, the prediction score vector of the i-th layer is normalized by the softmax function:
Figure BDA0001764992820000101
it is to be noted in particular that the softmax function here differs from the aforementioned softmax function only in the setting of the values on the temperature coefficient, where the temperature coefficient T is set to 1 in the implementation.
For a certain image sample, the correct label of the image sample in the current hierarchy category is assumed to be ciThen its loss value can be expressed as:
Figure BDA0001764992820000102
similarly, for
Figure BDA0001764992820000103
Summing to obtain the classification loss value of the whole training set
Figure BDA0001764992820000104
Specifically, as shown in fig. 7, the implementation method of a hierarchical semantic embedded model for fine object recognition according to the present invention includes the following steps:
and step S1, performing hierarchical labeling on each piece of training data.
Taking an image of a bird as an example, it is necessary to prepare hierarchical annotation information other than the image. For example, if a bird is labeled for categories of 4 categories, such as order, family, genus, species, each training/testing data that needs to be provided should include: images, item category labels, department category labels, genus category labels, and species category labels.
And step S2, adopting the weighted combination of the classification loss function and the regularization loss function as an objective function for optimizing the HSE model, and gradually training the branch networks corresponding to the classification from the 1 st level classification to the Nth level classification.
When training a branch network corresponding to a certain level of category, a prediction score vector of the previous level of category is obtained first. Therefore, in this step, the branch networks corresponding to the 1 st hierarchical category are trained step by step from the 1 st hierarchical category to the nth hierarchical category. Since the parameters of the backbone network are shared by all branches, in this step, the parameters of the backbone network do not need to be optimized for a while, and therefore, the parameters of the backbone network need to be initialized simply by using the Resnet-50 network model parameters pre-trained on the ImageNet dataset. The parameters of the backbone network are kept in a fixed state all the time in this step, and optimization updating is not needed.
In training the ith level class pairWhen the network is branched, the HSE model integrates the network structures corresponding to the first i-1 levels of classes, so that the parameters of the first i-1 levels of branched networks in the HSE model are initialized by using the previously trained branched network models of the first i-1 levels. And for a branch network of layer i, sub-network psiiParameters of the 9 relevant layers involved in (-) and φ (-) are also initialized by the present invention using Resnet-50 network model parameters pre-trained on ImageNet data sets. In addition, the first and second substrates are,
Figure BDA0001764992820000111
and a (-) is realized by a full connection layer, parameters of the full connection layer are initialized by using an Xavier algorithm, and an optimization objective function of the branch network is as follows:
Figure BDA0001764992820000112
where γ is a balance parameter used to balance the impact of the classification loss function term with the regularization constraint loss function term on the network parameters. Loss function term due to constraint by regularization
Figure BDA0001764992820000113
The resulting gradient values are scaled in magnitude
Figure BDA0001764992820000114
Therefore, it is necessary to set a relatively large weight value (γ ═ 2 is used in the embodiment of the present invention).
It should be noted that, because the category at the top layer does not introduce upper semantic knowledge, the corresponding branch network only needs to use the classification loss function term as the objective function of model parameter optimization.
In an embodiment of the present invention, the training image is scaled to 512 × 512, and the data increment means includes using random cropping to crop 448 × 448 regions for training, and performing a horizontal flip transformation on the training samples. In the aspect of an optimization method, the invention uses an SGD algorithm and a batch optimization strategy, wherein the batch value is 8, the impulse term of the SGD is 0.9, the weight attenuation factor is 0.00005, the initial learning rate is 0.001, and after the training set is traversed and used for about 300 times, the learning rate is reduced by 10 times to continue training.
And step S3, after all the branch networks are preliminarily trained, performing joint optimization on all the parameters of the whole complete HSE model. The objective function of the joint optimization is:
Figure BDA0001764992820000115
in training, the present invention still uses the same data increment method and the same super-parameter configuration as in step 1, except that the learning rate is smaller than 0.00001, which is not described herein.
It should be noted here that the backbone network of the present invention adopts a network structure of ResNet-50, and similarly, other general convolutional neural network structures such as VGG16 may be adopted instead.
The network structure listed by the invention has 4 levels, and actually, the number of the levels is only related to the number of the levels of the data set classification hierarchy structure, and the invention is also applicable to any number of levels.
One of loss functions adopted when the method trains the model is KL scattering, and in fact, a universal distance measurement function, such as Euclidean distance, is also applicable.
In summary, the hierarchical semantic embedded model for object fine recognition and the implementation method thereof adopt the hierarchical structure of object classification as semantic information, and embed the semantic information into the feature expression in the deep neural network model, thereby solving the problem of high additional information labeling cost in the technical scheme of object fine recognition which relies on additional information to guide learning, and reducing the complexity of the model.
The foregoing embodiments are merely illustrative of the principles and utilities of the present invention and are not intended to limit the invention. Modifications and variations can be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the present invention. Therefore, the scope of the invention should be determined from the following claims.

Claims (7)

1. A hierarchical semantic embedding model for fine-grained identification of objects, comprising:
the main network is used for extracting shallow features of the input image and outputting the shallow features to each branch network in the form of a feature map;
the branch networks are used for further extracting deep features of the image shallow feature map output by the main network, so that the output feature map is suitable for the recognition task of the corresponding hierarchy of the branch network, and the guidance of upper-layer semantic knowledge on the feature learning of the lower-layer branch network is realized by introducing a semantic knowledge embedding mechanism;
the branched network includes:
the deep feature extraction submodule is used for carrying out deep feature extraction on the feature map output by the backbone network and outputting feature expression guided by superior semantic knowledge and unguided feature expression;
embedding the higher semantic knowledge into a submodule to obtain a higher predicted score vector si-1Mapping the vector into semantic knowledge expression vectors through a full connection layer, splicing the vector with each site on a W multiplied by H plane of a feature map output by the deep feature extraction submodule, learning an attention coefficient vector through an attention model for the spliced feature map, and applying the attention coefficient vector to the feature map output by the deep feature extraction submodule to obtain a weighted feature map, wherein W and H respectively refer to width and height;
and the score fusion submodule is used for outputting corresponding score vectors through score fusion operation on the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule.
2. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the branch network carries out secondary characteristic expression on the feature map from the main network to generate a new branch feature map, an attention weight map is obtained through learning by combining a score vector predicted by a superior level and a branch feature map predicted by a subordinate level, the attention weight map is acted on the branch feature map, and finally a weighted branch feature map is generated to predict the label distribution of the hierarchy type.
3. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the backbone network adopts layer4_ x layer of ResNet-50 network structure and its previous input layer, the parameter layer has 41 layers, the parameters of the backbone network are shared by prediction networks of each layer.
4. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the deep feature extraction sub-module adopts a layer5_ x layer structure in a ResNet-50 network, the layer5_ x layer structure is composed of 3 residual modules, the layer5_ x layer structure is multiplexed twice, one part faces to the upper semantic knowledge embedding sub-module, and the other part faces to the expression of global features.
5. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: and the attention model continuously and gradually maps each position point on the W multiplied by H plane of the spliced characteristic diagram into corresponding dimension by using two full-connection layers, and finally the attention coefficient vector is obtained.
6. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the score fusion process of the score fusion submodule is as follows:
S=(fc_1+fc_2+fc_cat)/3
the fc _1, fc _2 and fc _ cat are c multiplied by 1 dimensional vectors, the former two vectors are obtained by respectively passing the feature maps output by the upper semantic knowledge embedding submodule and the deep feature extraction submodule through a full connection layer, the latter is obtained by connecting fc _1 and fc _2 in series and then obtaining the dimensions same as fc _1 and fc _2 through a full connection layer fc _ concatee operation.
7. A hierarchical semantic embedding model for fine-grained identification of objects according to claim 1, characterized in that: the network structure of the top-most category of the branch network corresponds to the category number of the hierarchy except the last full-connection layer, and the parameter settings of other layers are consistent with the original ResNet-50 network.
CN201810924288.XA 2018-08-14 2018-08-14 A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation Active CN109102024B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810924288.XA CN109102024B (en) 2018-08-14 2018-08-14 A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810924288.XA CN109102024B (en) 2018-08-14 2018-08-14 A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation

Publications (2)

Publication Number Publication Date
CN109102024A CN109102024A (en) 2018-12-28
CN109102024B true CN109102024B (en) 2021-08-31

Family

ID=64849727

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810924288.XA Active CN109102024B (en) 2018-08-14 2018-08-14 A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation

Country Status (1)

Country Link
CN (1) CN109102024B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109961107B (en) * 2019-04-18 2022-07-19 北京迈格威科技有限公司 Training method and device for target detection model, electronic equipment and storage medium
CN110097108B (en) * 2019-04-24 2021-03-02 佳都新太科技股份有限公司 Method, device, equipment and storage medium for identifying non-motor vehicle
CN110288049B (en) * 2019-07-02 2022-05-24 北京字节跳动网络技术有限公司 Method and apparatus for generating image recognition model
CN110321970A (en) * 2019-07-11 2019-10-11 山东领能电子科技有限公司 A kind of fine-grained objective classification method of multiple features based on branch neural network
CN110837856B (en) * 2019-10-31 2023-05-30 深圳市商汤科技有限公司 Neural network training and target detection method, device, equipment and storage medium
CN113095349A (en) * 2020-01-09 2021-07-09 北京沃东天骏信息技术有限公司 Image identification method and device
CN111242222B (en) * 2020-01-14 2023-12-19 北京迈格威科技有限公司 Classification model training method, image processing method and device
CN111711821B (en) * 2020-06-15 2022-06-10 南京工程学院 Information hiding method based on deep learning
CN111814920B (en) * 2020-09-04 2021-01-05 中国科学院自动化研究所 Image fine classification method and system based on multi-granularity feature learning of graph network
CN112990147A (en) * 2021-05-06 2021-06-18 北京远鉴信息技术有限公司 Method and device for identifying administrative-related images, electronic equipment and storage medium
CN113642415B (en) * 2021-07-19 2024-06-04 南京南瑞信息通信科技有限公司 Face feature expression method and face recognition method
CN114648760A (en) * 2022-01-19 2022-06-21 美的集团(上海)有限公司 Image segmentation method, image segmentation device, electronic device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100215277A1 (en) * 2009-02-24 2010-08-26 Huntington Stephen G Method of Massive Parallel Pattern Matching against a Progressively-Exhaustive Knowledge Base of Patterns
CN106682060A (en) * 2015-11-11 2017-05-17 奥多比公司 Structured Knowledge Modeling, Extraction and Localization from Images
CN107979606A (en) * 2017-12-08 2018-05-01 电子科技大学 It is a kind of that there is adaptive distributed intelligence decision-making technique
CN108229543A (en) * 2017-12-22 2018-06-29 中国科学院深圳先进技术研究院 Image classification design methods and device

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100215277A1 (en) * 2009-02-24 2010-08-26 Huntington Stephen G Method of Massive Parallel Pattern Matching against a Progressively-Exhaustive Knowledge Base of Patterns
CN106682060A (en) * 2015-11-11 2017-05-17 奥多比公司 Structured Knowledge Modeling, Extraction and Localization from Images
CN107979606A (en) * 2017-12-08 2018-05-01 电子科技大学 It is a kind of that there is adaptive distributed intelligence decision-making technique
CN108229543A (en) * 2017-12-22 2018-06-29 中国科学院深圳先进技术研究院 Image classification design methods and device

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Knowledge-Embedded Representation Learning for Fine-Grained Image Recognition;Tianshui Chen et al;《27th international joint conference on artificial intelligence》;20180702;第1-8页 *
Recurrent Attentional Reinforcement Learning for Multi-label Image Recognition;Tianshui Chen et al;《AAAI conference on artificial intelligence》;20171220;第1-9页 *
基于非线性知识迁移的交叉视角动作识别;姚湘;《重庆邮电大学学报( 自然科学版)》;20170215;第29卷(第1期);第121-128页 *

Also Published As

Publication number Publication date
CN109102024A (en) 2018-12-28

Similar Documents

Publication Publication Date Title
CN109102024B (en) A Hierarchical Semantic Embedding Model for Object Recognition and Its Implementation
CN109325952B (en) Fashionable garment image segmentation method based on deep learning
Komorowski et al. Minkloc++: lidar and monocular image fusion for place recognition
CN112200161A (en) A Face Recognition Detection Method Based on Hybrid Attention Mechanism
CN111476806B (en) Image processing method, image processing device, computer equipment and storage medium
Kumar et al. Enhancing Face Mask Detection Using Data Augmentation Techniques
Manssor et al. Real-time human detection in thermal infrared imaging at night using enhanced Tiny-yolov3 network
CN113420827A (en) Semantic segmentation network training and image semantic segmentation method, device and equipment
CN117635628B (en) A land-sea segmentation method based on contextual attention and boundary perception guidance
Huang et al. Attention-guided label refinement network for semantic segmentation of very high resolution aerial orthoimages
CN119559403B (en) High-resolution remote sensing image semantic segmentation method based on multi-scale depth supervision
CN111400572A (en) Content safety monitoring system and method for realizing image feature recognition based on convolutional neural network
Ataş Performance evaluation of jaccard-dice coefficient on building segmentation from high resolution satellite images
CN112149612A (en) Marine organism recognition system and recognition method based on deep neural network
CN113111740A (en) Characteristic weaving method for remote sensing image target detection
CN117437691A (en) Real-time multi-person abnormal behavior identification method and system based on lightweight network
CN119649062A (en) A remote sensing image change detection method based on fine-tuning CLIP
CN118297119B (en) Neural network and method for microkernel recognition
CN115049833A (en) Point cloud component segmentation method based on local feature enhancement and similarity measurement
CN114022906A (en) Pedestrian re-identification method based on multi-level features and attention mechanism
CN113837015A (en) A method and system for face detection based on feature pyramid
CN117392508A (en) Target detection method and device based on coordinate attention mechanism
CN116824330A (en) Small sample cross-domain target detection method based on deep learning
CN119068080A (en) Method, electronic device and computer program product for generating an image
Li et al. FeatFlow: Learning geometric features for 3D motion estimation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant