[go: up one dir, main page]

CN108734210B - An object detection method based on cross-modal multi-scale feature fusion - Google Patents

An object detection method based on cross-modal multi-scale feature fusion Download PDF

Info

Publication number
CN108734210B
CN108734210B CN201810474925.8A CN201810474925A CN108734210B CN 108734210 B CN108734210 B CN 108734210B CN 201810474925 A CN201810474925 A CN 201810474925A CN 108734210 B CN108734210 B CN 108734210B
Authority
CN
China
Prior art keywords
network model
rgb
fusion
training
detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810474925.8A
Other languages
Chinese (zh)
Other versions
CN108734210A (en
Inventor
刘盛
尹科杰
刘儒瑜
陈一彬
沈康
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201810474925.8A priority Critical patent/CN108734210B/en
Publication of CN108734210A publication Critical patent/CN108734210A/en
Application granted granted Critical
Publication of CN108734210B publication Critical patent/CN108734210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/56Extraction of image or video features relating to colour

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses an object detection method based on cross-modal multi-scale feature fusion, which is characterized in that a depth map detection network model is initialized by network parameters of an RGB detection network model; and respectively initializing feature extraction weights of the fusion network model based on the obtained RGB detection network model and the depth map detection network model, and finally training to obtain the fusion network model fusing multi-scale cross-modal features. The method does not depend on a large number of labeled depth image data sets, can fuse the characteristics of the depth image and the RGB image in a cross-mode manner, is real-time and efficient, and accurately completes object identification, positioning and detection. The fusion network model designed by the invention can achieve real-time detection speed only by using a consumption-level display card and a CPU as hardware.

Description

Object detection method based on cross-modal multi-scale feature fusion
Technical Field
The invention relates to the technical field of image recognition, in particular to an object detection method based on cross-modal multi-scale feature fusion, which can complete detection, positioning and accurate recognition tasks on an object in a color depth image (RGB-D image, including color information and depth information).
Background
Faster, more accurate and more generalized object detection methods are always an urgent need in the industry. The RGB image is severely affected in some special environments, such as motion or glare can degrade the image data, and the detection using the RGB image features often cannot achieve the desired accuracy. It is necessary to utilize information from different sensors, such as depth information, to improve the performance of object detection.
Since the convolutional neural network is used for the task of identifying and detecting objects, most high-precision object detection methods are implemented based on the convolutional neural network. These networks are capable of learning generic feature expressions of objects from large scale labeled RGB image datasets. If the depth map data is to be used to improve the accuracy of object detection, a generic depth feature representation of the object needs to be extracted. However, the industry does not include a large-scale depth image dataset with a sufficient number of classes and all labeled, so that a general feature expression of depth information cannot be directly obtained.
On the other hand, the existing fusion feature detection method has speed limitation, and often requires a high-performance GPU to obtain a result through long-time calculation, so that the rigidity requirement of high real-time performance in an industrial system cannot be met.
Disclosure of Invention
The invention aims to provide an RGB-D image detection method based on cross-modal multi-scale feature fusion, which designs a fusion model with real-time and high precision, utilizes multi-modal features of an object to accurately detect the object, and simultaneously completes the tasks of detection, positioning and accurate identification of the object in the image.
In order to achieve the purpose, the technical scheme of the invention is as follows:
an object detection method based on cross-modal multi-scale feature fusion comprises the following steps:
training a pre-training model by using RGB images of object classes in labeled images in the first data set, and initializing a single-mode RGB detection network model based on the pre-training model;
training an RGB detection network model by adopting the RGB image with the object type and position in the labeled image in the second data set;
initializing a single-mode depth map detection network model based on the trained RGB detection network model;
training a depth map detection network model by using depth images of object types and positions in labeled images corresponding to the RGB images in the second data set;
initializing a fusion network model and carrying out multi-scale feature fusion based on the trained RGB detection network model and the depth map detection network model;
training a fusion network model by adopting the RGB image and the depth image of the object type and the object position in the matched labeled image;
and detecting the object in the color depth image by adopting the trained fusion network model.
Further, initializing a single-mode depth map detection network model based on the trained RGB detection network model, including:
and copying the network parameters of the RGB detection network model as the network parameters of the depth map detection network model.
Further, initializing a fusion network model and performing multi-scale feature fusion based on the trained RGB detection network model and depth map detection network model, including:
copying network parameters of the RGB detection network model and the depth map detection network model as weights of two feature extraction parts in the fusion network model;
the multi-scale features extracted by the two feature extraction parts are combined by a plurality of fusion layers.
Further, the fusion network model adopts Multibox Loss as a Loss function during training.
Further, when training the RGB detection network model, training the depth map detection network model, and training the fusion network model, the method further includes:
and carrying out data enhancement processing on the input data.
Further, when training the converged network model, the method further includes:
the weights of the feature extraction sections are frozen.
The invention provides an object detection method based on cross-modal multi-scale feature fusion, which combines RGB and depth features to improve detection performance, and initializes a depth map detection network model by network parameters of an RGB detection network model; and respectively initializing feature extraction weights of the fusion network model based on the obtained RGB detection network model and the depth map detection network model, and finally training to obtain the fusion network model fusing multi-scale cross-modal features. The method does not depend on a large number of labeled depth image data sets, can fuse the characteristics of the depth image and the RGB image in a cross-mode manner, is real-time and efficient, and accurately completes object identification, positioning and detection. The converged network model designed by the invention can achieve real-time detection speed only by using a consumer-grade display card and a CPU (Central processing Unit) as hardware, such as the display card GTX1080 and the CPU Intel 7700K.
Drawings
FIG. 1 is a flowchart of an object detection method based on cross-modal multi-scale feature fusion according to the present invention;
fig. 2 is a schematic structural diagram of a converged network model.
Detailed Description
The technical solutions of the present invention are further described in detail below with reference to the drawings and examples, which should not be construed as limiting the present invention.
The general idea of the invention is that the method does not depend on a large amount of labeled depth image data sets, can cross-modal fusion depth image and RGB image characteristics, is real-time and efficient, and accurately completes object identification, positioning and detection. And training to obtain a fusion model which can accept cross-modal RGB and depth image input and obtain the position and category information of a plurality of objects in real time. This solution requires the completion of cross-modal feature transfers: initializing a depth map information network by RGB model parameters and training to obtain a depth map model; and respectively initializing a feature extraction part of the fusion network provided by the invention based on the obtained RGB model and the depth map model, and finally training to obtain a network model fusing multi-scale cross-modal features. The multi-scale cross-modal fusion network with high real-time performance and detection precision designed by the invention is a core element of a solution.
As shown in fig. 1, an object detection method based on cross-modal multi-scale feature fusion includes:
training a pre-training model by using RGB images of object classes in labeled images in the first data set, and initializing a single-mode RGB detection network model based on the pre-training model;
training an RGB detection network model by adopting the RGB image with the object type and position in the labeled image in the second data set;
initializing a single-mode depth map detection network model based on the trained RGB detection network model;
training a depth map detection network model by using depth images of object types and positions in labeled images corresponding to the RGB images in the second data set;
initializing a fusion network model and carrying out multi-scale feature fusion based on the trained RGB detection network model and the depth map detection network model;
training a fusion network model by adopting the RGB image and the depth image of the object type and the object position in the matched labeled image;
and detecting the object in the color depth image by adopting the trained fusion network model.
The following describes the steps of the above method in detail, wherein the model training of this embodiment includes three stages, a first stage trains the RGB detection network model, a second stage trains the depth map detection network model by using a cross-modal supervised transfer training mode, and a third stage trains the fusion network model based on the trained RGB detection network model and depth map detection network model.
Since the convolutional neural network is used for the task of identifying and detecting objects, most high-precision object detection methods are implemented based on the convolutional neural network. These networks are capable of learning generic feature expressions of objects from large scale labeled RGB image datasets. According to the technical scheme, the accuracy of object detection is improved by using the depth map data, and the general depth feature expression of an object needs to be extracted. However, the industry does not include a large-scale depth image dataset with a sufficient number of classes and all labeled, so that a general feature expression of depth information cannot be directly obtained. In the embodiment, the single-mode RGB detection network model is trained firstly, the depth map detection network model is trained in a cross-mode supervised transfer training mode, and the depth map detection network model can be obtained only by utilizing a small-scale data set.
The first stage is as follows: firstly, training a pre-training model by using RGB images of object classes in labeled images in a first data set, and initializing a single-mode RGB detection network model based on the pre-training model; and then, the RGB images of the object types and positions in the marked images in the second data set are used for training the RGB detection network model.
In the current pre-training model, the pre-training model obtained by training the labeled large-scale RGB image dataset is mature, for example, a VGG16 model trained in advance on the ImageNet dataset can be directly adopted by the technical scheme. The pre-training model is trained using a labeled large-scale RGB image dataset (also referred to as the first dataset), typically labeled with the class of the object in the RGB image.
After the pre-training model is selected, the RGB detection network model is initialized based on the pre-training model, that is, parameters of the neural network of the pre-training model are copied to the RGB detection network model. Then, fine tuning training is carried out on the RGB detection network model by using the RGB image in the small-scale data set (also called as a second data set), the type and the position of an object (namely an object) to be detected in the RGB image in the small-scale data set need to be labeled in advance, the small-scale data set comprises a depth image corresponding to the RGB image, and the type and the position of the object to be detected are also labeled in the depth image.
And a second stage: in the embodiment, a single-mode depth map detection network model is initialized based on a trained RGB detection network model; and training the depth image detection network model by adopting the depth image of the object type and position in the labeled image corresponding to the RGB image in the second data set.
In this embodiment, the RGB detection network model and the depth map detection network model are both single-modality, and are both expressed hierarchically by using the neural network model, where the RGB image modality is expressed as:
Figure BDA0001664185870000051
Figure BDA0001664185870000052
is expressed by the i-th layer characteristic trained from a large-scale labeled data set, # l is the layer number of the neural network, and the parameters of the neural network are used
Figure BDA0001664185870000053
To indicate.
The depth image modality is represented as:
Figure BDA0001664185870000054
ψiis the ith layer characteristic expression, and # u is the number of layers of the neural network, and the same holds true
Figure BDA0001664185870000055
As a parameter of the hierarchical representation neural network.
In this embodiment, based on the trained RGB detection network model, the single-mode depth map detection network model is initialized, that is, the network parameters of the RGB detection network model are copied
Figure BDA0001664185870000056
The network parameters of the depth map detection network model are taken as the network parameters of the depth map detection network model, then the depth map detection network model is subjected to fine tuning training by using the depth image part in the small-scale data set, and the trained network parameters of the depth map detection network model are
Figure BDA0001664185870000061
The depth map detection network model can identify the object type and position in the depth map.
A cross-mode Supervision Transfer (Supervision Transfer) method is adopted, the neural network of the depth information mode is initialized by using the neural network expression of the RGB mode, and the cross-mode migration mode is verified in a convolution pooling layer. Assume that there has been a large data set, labeled P, from two modalities that have been paired, but not labeledl,u. By characterizing the image
Figure BDA0001664185870000062
And psi#uPaired image RGB map I with two modalitiesuAnd depth map IlMatch (A)
Figure BDA0001664185870000063
Partial representation of an RGB network, IlAnd IuRGB and depth images in the dataset, respectively), fromCan learn rich expression of the depth map, use transformation function t to make the two expressions have the same dimension, and propose a loss function based on the above network (f can be in any functional form), then can get the parameters of the depth map network in a training way:
Figure BDA0001664185870000064
the single-mode detection network of the invention comprises an element _ wise-sum layer, a permate layer, a flatten layer and a private layer besides the conventional convolution pooling and full connection layer. First, the element _ wise-sum layer sums the feature maps, which can be regarded as a summation operation of the multidimensional matrix. Second, the permate layer changes the order of the data dimensions, which is the same as multiplying by an identity matrix that has undergone row swapping. The flatten layer then merges the multidimensional matrices into one dimension. Finally, the priorbox layer is used to process the bounding box and has no effect on the rich features of the image. All these layers can be unified into a conversion function s, and the loss function of the network can be described as follows:
Figure BDA0001664185870000065
based on the loss function, the cross-modal model transfer of the single-modal network can be realized.
It should be noted that the small-scale data set includes RGB images of the labeled object type and position, and depth maps of the labeled object type and position corresponding to the RGB images, and when the small-scale data set is used for performing fine tuning training on the RGB detection network model, the RGB images in the small-scale data set are used; when the small-scale data set is adopted to carry out fine tuning training on the depth map detection network model, the depth map in the small-scale data set is adopted, and the depth map is represented in an HHA format (HHA coding has three dimensions, namely horizontal parallax, height from the ground and angle with gravity), and the dimensions are consistent with RGB images.
In this embodiment, when the pre-training model adopts the VGG16 model, the network structures of the RGB detection network model and the depth map detection network model are the same, and may adopt a form of SSD-VGG16, where the SSD-VGG16 is an SSD network that adopts VGG16 as a feature extraction part. However, these networks are not fixed, and other network forms can be adopted, such as a pre-training model and a ResNet, and the corresponding RGB detection network model and the depth map detection network model can adopt the form of SSD-ResNet.
A third stage, based on the trained RGB detection network model and depth map detection network model, initializing a fusion network model and carrying out multi-scale feature fusion; and training the fusion network model by adopting the paired RGB images and depth images of the object types and positions in the labeled images.
In this embodiment, trained network parameters of the RGB detection network model and the depth map detection network model are used to initialize weights of the feature extraction part in the fusion network model, that is, the network parameters of the RGB detection network model and the depth map detection network model are copied as weights of the feature extraction part in the fusion network model. The input data, which may be in the form of a second data set, is then paired RGB pictures and depth pictures for fine tuning training. As shown in fig. 2, taking SSD-VGG16 as the feature extraction part as an example, the fusion network needs to copy all the parameters of the SSD-VGG16 module obtained from the RGB detection network and the depth map detection network as the weights of the two feature extraction parts (RGB feature extraction part and depth map feature extraction part) of the fusion network. Thus, the network input RGB image and the depth image are fused, and multi-level RGB general features and general features (multi-scale features) of the depth image are obtained after the two feature extraction parts are processed respectively.
After the two feature extraction parts (the RGB feature extraction part and the depth map feature extraction part) respectively extract the multi-scale features of two modes, the fusion network model adopts a multi-layer structure to fuse the features from different scales (a plurality of fusion layers are used to combine the multi-scale features extracted by the two feature extraction parts), wherein the features come from a convolution pooling layer with more remarkable semantic features and correspond to a higher layer in the network structure. There are two feasible merging points in the converged network architecture, which are divided into two major categories, one is the network bottom layer before the feature extraction part; the other class follows the feature extraction layer with respect to the location of the higher layers. The lower layer network has more spatial features, while the higher layer network has more semantic features. If two objects to be detected are the same object, the high-level general feature expressions are closer, but the low-level expressions may be greatly different. Therefore, the invention selects the fusion of the high layer rather than the fusion of the low layer, and experiments also prove that the high layer fusion can achieve better effect. In the system architecture of the fusion network, the invention selects a plurality of specific network layer characteristics for fusion instead of singly adopting the characteristics of a certain layer, wherein the network layer characteristics comprise a plurality of multi-scale characteristics of a convolution pooling layer with more obvious semantic characteristics. As shown in fig. 2, taking SSD-VGG16 as an example of the feature extraction part, the fusion layer can fuse the features of the conv4-3, fc7, conv6_2, conv7_2, conv8_2 and conv9_2 layers to obtain the fusion feature of the network upper layer. It should be noted that, in fig. 2, only conv4-3 and fc7 are shown, and the schematic diagram of the fusion of other layers is omitted, and will not be described herein again.
The experimental result shows that the effect is poor when the component _ wise-sum layer is replaced by the component layer, and the cross-modal characteristics of the RGB and the depth map are fused by using the merging network layers with the corresponding number of component _ wise-sum. The characteristics can be used for predicting the class and the position of the object, after the characteristics are obtained, regression prediction is carried out on a convolution layer containing two 3 x 3 convolution kernels to obtain a plurality of results, wherein the first convolution kernel completes the prediction of the position (1 x 4 dimensions), and the other convolution kernel completes the prediction of the class of the object (1 x the dimension of the number of the classes needing to be predicted). Finally, the obtained multiple predictions are subjected to a Non-maximum suppression (NMS) method to obtain the final result. The fusion network adopts Multibox Loss as a Loss function during training.
In order to make better use of the input data, the present embodiment performs data enhancement on the input data, for example, by means of rotation, mirroring, cropping, etc., to present spatial diversity of the picture, so that the trained model will have better robustness.
When the fusion network model is trained, the weights of the feature extraction parts are frozen, and only the fusion part is trained, namely the learning rate of the feature extraction part is set to be lower than a set threshold (the threshold is 0 or a very low value, such as 10e-8), so that the network training process can be focused on the fusion part without changing the weights of the RGB and depth feature extraction parts too much. By freezing the module weights copied from the RGB and depth models, only the fusion part is trained to complete the fine tuning training of the fusion network. The number of training iterations is generally more than forty-thousand times, and the basic learning rate is set to be about 0.001. The fusion part is the other part of the fusion network model except the feature extraction part.
The technical scheme of the invention combines RGB and depth features to improve detection performance, before feature combination, RGB and depth images are respectively converted into general feature expression by two feature extraction parts in a fusion network, the two feature extraction parts are feature extraction parts consisting of a plurality of convolution pooling layers and are respectively RGB and depth image feature extraction parts, and weights of the RGB and depth image feature extraction parts are obtained by initializing and training the two single-mode RGB detection network models and the depth image detection models. The two single-mode networks have been individually fine-tuned before the converged network training and use the same architecture. The present embodiment can obtain a generic feature representation of depth images without a large dataset of depth annotations. In addition, in the training process of the fusion network, the input data of two different modes must be kept the same in dimension. At the same time, the data enhancement operation of the two input images (RGB and depth images) must also be the same.
The above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and those skilled in the art can make various corresponding changes and modifications according to the present invention without departing from the spirit and the essence of the present invention, but these corresponding changes and modifications should fall within the protection scope of the appended claims.

Claims (6)

1. An object detection method based on cross-modal multi-scale feature fusion is characterized in that the object detection method based on cross-modal multi-scale feature fusion comprises the following steps:
training a pre-training model by adopting a first data set, wherein the first data set comprises RGB images of object types in labeled images, and initializing a single-mode RGB detection network model based on the pre-training model;
training an RGB detection network model by using an RGB image in a second data set, wherein the second data set comprises the RGB image of the object type and position in the labeled image and a corresponding depth image;
initializing a single-mode depth map detection network model based on the trained RGB detection network model;
training a depth map detection network model by using the depth image in the second data set;
initializing a fusion network model and carrying out multi-scale feature fusion based on the trained RGB detection network model and the depth map detection network model;
training a fusion network model by adopting the RGB image and the depth image of the object type and the object position in the matched labeled image;
and detecting the object in the color depth image by adopting the trained fusion network model.
2. The method for object detection based on cross-modal multi-scale feature fusion according to claim 1, wherein initializing a single-modal depth map detection network model based on the trained RGB detection network model comprises:
and copying the network parameters of the RGB detection network model as the network parameters of the depth map detection network model.
3. The method for detecting an object based on cross-modal multi-scale feature fusion of claim 1, wherein initializing a fusion network model and performing multi-scale feature fusion based on the trained RGB detection network model and the depth map detection network model comprises:
copying network parameters of the RGB detection network model and the depth map detection network model as weights of two feature extraction parts in the fusion network model;
the multi-scale features extracted by the two feature extraction parts are combined by a plurality of fusion layers.
4. The method for object detection based on cross-modal multi-scale feature fusion of claim 3, wherein the fusion network model adopts Multibox Loss as a Loss function during training.
5. The method for object detection based on cross-modal multi-scale feature fusion of claim 1, wherein the training of the RGB detection network model, the training of the depth map detection network model, and the training of the fusion network model further comprises:
and carrying out data enhancement processing on the input data.
6. The method for detecting an object based on cross-modal multi-scale feature fusion according to claim 3, wherein when training the fusion network model, the method further comprises:
the weights of the feature extraction sections are frozen.
CN201810474925.8A 2018-05-17 2018-05-17 An object detection method based on cross-modal multi-scale feature fusion Active CN108734210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810474925.8A CN108734210B (en) 2018-05-17 2018-05-17 An object detection method based on cross-modal multi-scale feature fusion

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810474925.8A CN108734210B (en) 2018-05-17 2018-05-17 An object detection method based on cross-modal multi-scale feature fusion

Publications (2)

Publication Number Publication Date
CN108734210A CN108734210A (en) 2018-11-02
CN108734210B true CN108734210B (en) 2021-10-15

Family

ID=63938564

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810474925.8A Active CN108734210B (en) 2018-05-17 2018-05-17 An object detection method based on cross-modal multi-scale feature fusion

Country Status (1)

Country Link
CN (1) CN108734210B (en)

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110070072A (en) * 2019-05-05 2019-07-30 厦门美图之家科技有限公司 A method of generating object detection model
CN110334708A (en) * 2019-07-03 2019-10-15 中国科学院自动化研究所 Method, system and device for automatic calibration of differences in cross-modal target detection
CN110334769A (en) * 2019-07-09 2019-10-15 北京华捷艾米科技有限公司 Target identification method and device
CN110852350B (en) * 2019-10-21 2022-09-09 北京航空航天大学 Pulmonary nodule benign and malignant classification method and system based on multi-scale migration learning
CN110956094B (en) * 2019-11-09 2023-12-01 北京工业大学 RGB-D multi-mode fusion personnel detection method based on asymmetric double-flow network
CN113033258B (en) * 2019-12-24 2024-05-28 百度国际科技(深圳)有限公司 Image feature extraction method, device, equipment and storage medium
CN111242238B (en) * 2020-01-21 2023-12-26 北京交通大学 RGB-D image saliency target acquisition method
CN111540343B (en) * 2020-03-17 2021-02-05 北京捷通华声科技股份有限公司 Corpus identification method and apparatus
CN111723649B (en) * 2020-05-08 2022-08-12 天津大学 A short video event detection method based on semantic decomposition
CN112183619A (en) * 2020-09-27 2021-01-05 南京三眼精灵信息技术有限公司 Digital model fusion method and device
CN113077491B (en) * 2021-04-02 2023-05-02 安徽大学 RGBT target tracking method based on cross-modal sharing and specific representation form
CN114519377A (en) * 2021-12-14 2022-05-20 中煤科工集团信息技术有限公司 Cross-modal coal gangue sorting method and device
CN114581838B (en) * 2022-04-26 2022-08-26 阿里巴巴达摩院(杭州)科技有限公司 Image processing method, device and cloud device
CN115965846B (en) * 2023-01-18 2025-04-01 重庆邮电大学 Curtain wall frame real-time detection method and device based on frame-aware cross-modal fusion network

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102800079B (en) * 2012-08-03 2015-01-28 西安电子科技大学 Multimode image fusion method based on SCDPT transformation and amplitude-phase combination thereof
EP2910187B1 (en) * 2014-02-24 2018-04-11 Université de Strasbourg (Etablissement Public National à Caractère Scientifique, Culturel et Professionnel) Automatic multimodal real-time tracking of a moving marker for image plane alignment inside a MRI scanner
CN106981059A (en) * 2017-03-30 2017-07-25 中国矿业大学 With reference to PCNN and the two-dimensional empirical mode decomposition image interfusion method of compressed sensing
CN107066583B (en) * 2017-04-14 2018-05-25 华侨大学 A kind of picture and text cross-module state sensibility classification method based on the fusion of compact bilinearity
CN107463952B (en) * 2017-07-21 2020-04-03 清华大学 An object material classification method based on multimodal fusion deep learning
CN107403201A (en) * 2017-08-11 2017-11-28 强深智能医疗科技(昆山)有限公司 Tumour radiotherapy target area and jeopardize that organ is intelligent, automation delineation method

Also Published As

Publication number Publication date
CN108734210A (en) 2018-11-02

Similar Documents

Publication Publication Date Title
CN108734210B (en) An object detection method based on cross-modal multi-scale feature fusion
CN113056743B (en) Training a neural network for vehicle re-identification
CN111797893B (en) Neural network training method, image classification system and related equipment
CN107424159B (en) Image semantic segmentation method based on super-pixel edge and full convolution network
CN108288088B (en) A scene text detection method based on end-to-end fully convolutional neural network
Liu et al. 3D Point cloud analysis
CN111027576B (en) Co-saliency detection method based on co-saliency generative adversarial network
CN110929080B (en) An Optical Remote Sensing Image Retrieval Method Based on Attention and Generative Adversarial Networks
JP2017062781A (en) Similarity-based detection of prominent objects using deep cnn pooling layers as features
CN106815323B (en) Cross-domain visual retrieval method based on significance detection
CN110555420B (en) Fusion model network and method based on pedestrian regional feature extraction and re-identification
Yu et al. A content-adaptively sparse reconstruction method for abnormal events detection with low-rank property
CN116563840B (en) Scene text detection and recognition method based on weak supervision cross-mode contrast learning
CN114359709A (en) Target detection method and device for remote sensing image
Guo et al. UDTIRI: An online open-source intelligent road inspection benchmark suite
CN117475228A (en) A 3D point cloud classification and segmentation method based on dual-domain feature learning
CN114972947B (en) Depth scene text detection method and device based on fuzzy semantic modeling
Qi et al. Unstructured road detection via combining the model‐based and feature‐based methods
CN110852102B (en) Chinese part-of-speech tagging method and device, storage medium and electronic equipment
CN118351144A (en) Target tracking method, target tracking system, storage medium and electronic equipment
Geng et al. SANet: A novel segmented attention mechanism and multi-level information fusion network for 6D object pose estimation
CN117197577A (en) Adversarial training method for target detection model based on contrastive learning
CN114913330B (en) Point cloud component segmentation method and device, electronic equipment and storage medium
Yang et al. UAV Landmark Detection Based on Convolutional Neural Network
Farsi et al. Improving Deep Learning-based Saliency Detection Using Channel Attention Module

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant