CN110298262B

CN110298262B - Object identification method and device

Info

Publication number: CN110298262B
Application number: CN201910493331.6A
Authority: CN
Inventors: 江立辉; 屈展; 张维
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-06-06
Filing date: 2019-06-06
Publication date: 2024-01-02
Anticipated expiration: 2039-06-06
Also published as: EP3916628A4; CN118196828A; US20220165045A1; EP3916628A1; EP4542444A1; CN110298262A; JP7289918B2; WO2020244653A1; EP3916628B1; JP2022515895A

Abstract

The present application relates to the field of artificial intelligence. The sensing network comprises a main network and a plurality of parallel Headers, wherein the parallel Headers are connected with the main network; the backbone network is used for receiving an input picture, carrying out convolution processing on the input picture and outputting feature images with different resolutions corresponding to the picture; each Header in the plurality of parallel headers is configured to detect a task object in a task according to the feature map output by the backbone network, and output a 2D frame of an area where the task object is located and a confidence level corresponding to each 2D frame; each parallel Header completes detection of different task objects; wherein, the task object is an object to be detected in the task; the higher the confidence is, the greater the probability that the task object corresponding to the task exists in the 2D frame corresponding to the confidence is.

Description

Object recognition method and device

技术领域Technical field

本申请涉及人工智能领域，尤其涉及一种物体识别方法及装置。The present application relates to the field of artificial intelligence, and in particular to an object recognition method and device.

背景技术Background technique

计算机视觉是各个应用领域，如制造业、检验、文档分析、医疗诊断，和军事等领域中各种智能/自主系统中不可分割的一部分，它是一门关于如何运用照相机/摄像机和计算机来获取我们所需的，被拍摄对象的数据与信息的学问。形象地说，就是给计算机安装上眼睛(照相机或摄像机)和大脑(算法)用来代替人眼对目标进行识别、跟踪和测量等，从而使计算机能够感知环境。因为感知可以看作是从感官信号中提取信息，所以计算机视觉也可以看作是研究如何使人工系统从图像或多维数据中“感知”的科学。总的来说，计算机视觉就是用各种成象系统代替视觉器官获取输入信息，再由计算机来代替大脑对这些输入信息完成处理和解释。计算机视觉的最终研究目标就是使计算机能像人那样通过视觉观察和理解世界，具有自主适应环境的能力。Computer vision is an integral part of various intelligent/autonomous systems in various application fields, such as manufacturing, inspection, document analysis, medical diagnosis, and military. It is a subject about how to use cameras/video cameras and computers to obtain What we need is the knowledge of data and information about the subjects being photographed. To put it figuratively, it is to install eyes (cameras or video cameras) and brains (algorithms) on computers to replace human eyes in identifying, tracking, and measuring targets, so that the computer can perceive the environment. Because perception can be viewed as extracting information from sensory signals, computer vision can also be viewed as the science that studies how to make artificial systems "perceive" from images or multi-dimensional data. In general, computer vision uses various imaging systems to replace the visual organs to obtain input information, and then the computer replaces the brain to process and interpret the input information. The ultimate research goal of computer vision is to enable computers to observe and understand the world through vision like humans do, and have the ability to adapt to the environment independently.

目前视觉感知网络能完成的功能越来越多，包括图片分类、2D检测、语义分割(Mask)、关键点检测、线性物体检测(比如自动驾驶技术中的车道线或停止线检测)、可行驶区域检测等。另外，视觉感知系统具有成本低、非接触性、体积小、信息量大的特点。随着视觉感知算法的精度的不断提高，其成为当今众多人工智能系统的关键技术，得到越来越广泛的应用，如：高级驾驶辅助系统(ADAS，Advanced Driving Assistant System)和自动驾驶系统(ADS,Autonomous Driving System)中对路面上的动态障碍物(人或车)、静态物体(交通灯、交通标志或交通锥状物)的识别，在终端视觉的拍照美颜功能中通过识别人体的Mask和关键点实现瘦身效果等。At present, the visual perception network can complete more and more functions, including image classification, 2D detection, semantic segmentation (Mask), key point detection, linear object detection (such as lane line or stop line detection in autonomous driving technology), drivable Area detection, etc. In addition, the visual perception system has the characteristics of low cost, non-contact, small size and large amount of information. As the accuracy of visual perception algorithms continues to improve, it has become a key technology for many artificial intelligence systems today and is increasingly widely used, such as: Advanced Driving Assistant System (ADAS) and Autonomous Driving System (ADS). , Autonomous Driving System), the recognition of dynamic obstacles (people or vehicles) and static objects (traffic lights, traffic signs or traffic cones) on the road is achieved by identifying the human body's Mask in the camera beauty function of the terminal vision. and key points to achieve slimming effects, etc.

目前主流的视觉感知网络大多数集中在一种检测任务，如2D检测、3D检测、语义分割、关键点检测等。如果要实现多个功能，则往往需要不同的网络来完成。多个网络同时运行会显著提升硬件的计算量和功耗，降低模型的运行速度，难以实现实时检测。Most of the current mainstream visual perception networks focus on one detection task, such as 2D detection, 3D detection, semantic segmentation, key point detection, etc. If multiple functions are to be implemented, different networks are often required to complete them. Running multiple networks at the same time will significantly increase the calculation amount and power consumption of the hardware, reduce the running speed of the model, and make it difficult to achieve real-time detection.

发明内容Contents of the invention

为了减少硬件的计算量和功耗，提升感知网络模型的运算速度，本发明实施例提供一种基于多个头端(Header)的感知网络，所述感知网络包括主干网络和多个并行Header，所述多个并行Header所述和主干网络连接；In order to reduce the calculation amount and power consumption of the hardware and improve the computing speed of the perceptual network model, embodiments of the present invention provide a perceptual network based on multiple heads (Headers). The perceptual network includes a backbone network and multiple parallel Headers. The multiple parallel headers are connected to the backbone network;

所述主干网络，用于接收输入的图片，并对输入的图片进行卷积处理，输出对应所述图片的具有不同分辨率的特征图；The backbone network is used to receive input pictures, perform convolution processing on the input pictures, and output feature maps with different resolutions corresponding to the pictures;

所述一个并行Header，用于根据所述主干网络输出的特征图，对一个任务中的任务物体进行检测，输出所述任务物体所在区域的2D框以及每个2D框对应的置信度；其中，所述每个并行Header完成不同的任务物体的检测；其中，所述任务物体为该任务中需要检测的物体；所述置信度越高，表示所述对应该置信度的2D框内存在所述任务所对应的物体的概率越大。所述一个并Header是上述多个并行Header中的任一个，每个并行header的功能是相似的。The parallel Header is used to detect the task object in a task based on the feature map output by the backbone network, and output the 2D box of the area where the task object is located and the confidence level corresponding to each 2D box; wherein, Each of the parallel headers completes the detection of different task objects; wherein the task object is an object that needs to be detected in the task; the higher the confidence level, it means that the 2D box corresponding to the confidence level contains the The greater the probability of the object corresponding to the task. The one parallel header is any one of the above multiple parallel headers, and the functions of each parallel header are similar.

可选地，每个并行头端包括候选区域生成网络(RPN)模块、感兴趣区域提取(ROI-ALIGN)模块和区域卷积神经网络(RCNN)模块，所述一个并行头端的RPN模块独立于其它并行头端的RPN模块；所述一个并行头端的ROI-ALIGN模块独立于其它并行头端的ROI-ALIGN模块；所述一个并行头端的RCNN模块独立于其它并行头端的RCNN模块，其中，对于每一个并行头端：Optionally, each parallel headend includes a region candidate generation network (RPN) module, a region of interest extraction (ROI-ALIGN) module, and a region convolutional neural network (RCNN) module. The RPN module of one parallel headend is independent of RPN modules of other parallel head ends; the ROI-ALIGN module of one parallel head end is independent of the ROI-ALIGN modules of other parallel head ends; the RCNN module of one parallel head end is independent of the RCNN modules of other parallel head ends, wherein, for each Parallel headend:

所述RPN模块用于：在主干网络提供的一个或者多个特征图上预测所述任务物体所在的区域，并输出匹配所述区域的候选2D框；The RPN module is used to: predict the area where the task object is located on one or more feature maps provided by the backbone network, and output candidate 2D boxes that match the area;

所述ROI-ALIGN模块用于：根据所述RPN模块预测得到的区域，从所述主干网络提供的一个特征图中扣取出所述候选2D框所在区域的特征；The ROI-ALIGN module is used to: extract the characteristics of the area where the candidate 2D box is located from a feature map provided by the backbone network based on the area predicted by the RPN module;

所述RCNN模块用于：通过神经网络对所述候选2D框所在区域的特征进行卷积处理，得到所述候选2D框属于各个物体类别的置信度；所述各个物体类别为所述一个并行头端对应的任务中的物体类别；通过神经网络对所述候选区域2D框的坐标进行调整，使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配，并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。The RCNN module is used to perform convolution processing on the characteristics of the area where the candidate 2D box is located through a neural network to obtain the confidence that the candidate 2D box belongs to each object category; each object category is the one parallel head The object category in the task corresponding to the end; adjust the coordinates of the candidate area 2D box through the neural network, so that the adjusted 2D candidate box matches the shape of the actual object better than the candidate 2D box, and selects a confidence level greater than The adjusted 2D candidate frame with a preset threshold is used as the 2D frame of the region.

可选地，所述2D框为矩形框。Optionally, the 2D frame is a rectangular frame.

可选地，在本申请实施例的另一方面，所述RPN模块用于：基于所属任务对应的物体的模板框(Anchor)，在主干网络提供的一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框；其中，所述模板框是基于其所属的任务物体的统计特征得到的，所述统计特征包括所述物体的形状和大小。Optionally, in another aspect of the embodiment of the present application, the RPN module is configured to: based on the template box (Anchor) of the object corresponding to the task, identify the existence of the task on one or more feature maps provided by the backbone network The area of the object is predicted to obtain a candidate area, and a candidate 2D box matching the candidate area is output; wherein the template box is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the object's Shape and size.

可选地，在本申请实施例的另一方面，所述感知网络还包括至少一个或多个串行Header；所述串行Header与所述一个并行Header连接；Optionally, in another aspect of the embodiment of this application, the perception network further includes at least one or more serial Headers; the serial Header is connected to the one parallel Header;

所述串行Header用于：利用其连接的并行Header提供的所属任务的任务物体的2D框，在主干网络上的一个或多个特征图上提取所述2D框所在区域的特征，根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。The serial header is used to: utilize the 2D box of the task object belonging to the task provided by the parallel header it is connected to, and extract the characteristics of the area where the 2D box is located on one or more feature maps on the backbone network. According to the The characteristics of the area where the 2D box is located predict the 3D information, Mask information or Keypiont information of the task object to which the task belongs.

可选地，所述RPN模块在不同的分辨率的特征图上预测不同大小物体所在的区域。Optionally, the RPN module predicts areas where objects of different sizes are located on feature maps of different resolutions.

可选地，所述RPN模块在低分辨率的特征上完成大物体所在区域的检测，所述RPN模块在高分辨率的特征图上完成小物体所在区域的检测。Optionally, the RPN module completes the detection of the area where the large object is located on the low-resolution feature map, and the RPN module completes the detection of the area where the small object is located on the high-resolution feature map.

另一方面，本发明实施例还提供一种物体检测方法，所述方法包括：On the other hand, embodiments of the present invention also provide an object detection method, which method includes:

接收输入的图片；Receive input images;

对输入的图片进行卷积处理，输出对应所述图片的具有不同分辨率的特征图；Perform convolution processing on the input image and output feature maps with different resolutions corresponding to the image;

根据所述特征图，针对不同的任务独立检测每个任务中的任务物体，输出所述每个任务物体所在区域的2D框以及每个2D框对应的置信度；其中，所述任务物体为该任务中需要检测的物体；所述置信度越高，表示所述对应该置信度的2D框内存在所述任务所对应的物体的概率越大。According to the feature map, the task object in each task is independently detected for different tasks, and the 2D box of the area where each task object is located and the confidence level corresponding to each 2D box are output; wherein, the task object is Objects that need to be detected in the task; the higher the confidence level, the greater the probability that the object corresponding to the task exists in the 2D box corresponding to the confidence level.

可选地，所述根据所述特征图，针对不同的任务独立检测每个任务中的任务物体，输出所述每个任务物体所在区域的2D框以及每个2D框对应的置信度，包括：Optionally, based on the feature map, independently detect the task objects in each task for different tasks, and output the 2D box of the area where each task object is located and the confidence level corresponding to each 2D box, including:

在一个或者多个特征图上预测所述任务物体所在的区域，并输出匹配所述区域的候选2D框；Predict the area where the task object is located on one or more feature maps, and output candidate 2D boxes that match the area;

根据所述任务物体所在的区域，从一个特征图中扣取出所述候选2D框所在区域的特征；According to the area where the task object is located, extract the features of the area where the candidate 2D box is located from a feature map;

对所述候选2D框所在区域的特征进行卷积处理，得到所述候选2D框属于各个物体类别的置信度；所述各个物体类别为所述一个任务中的物体类别；Perform convolution processing on the features of the area where the candidate 2D box is located to obtain the confidence that the candidate 2D box belongs to each object category; each object category is an object category in the one task;

通过神经网络对所述候选区域2D框的坐标进行调整，使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配，并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。The coordinates of the 2D box of the candidate area are adjusted through the neural network so that the adjusted 2D candidate box matches the shape of the actual object better than the candidate 2D box, and the adjusted 2D candidate whose confidence is greater than the preset threshold is selected. box as a 2D box for the region.

可选地，在一个或者多个特征图上预测所述任务物体所在的区域，并输出匹配所述区域的候选2D框为：Optionally, predict the area where the task object is located on one or more feature maps, and output the candidate 2D box matching the area as:

基于所属任务对应的物体的模板框(Anchor)，在主干网络提供的一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框；其中，所述模板框是基于其所属的任务物体的统计特征得到的，所述统计特征包括所述物体的形状和大小。Based on the template box (Anchor) of the object corresponding to the task, predict the area where the task object exists on one or more feature maps provided by the backbone network to obtain the candidate area, and output the candidate 2D box matching the candidate area. ; Wherein, the template frame is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the shape and size of the object.

可选地，所述方法还包括：Optionally, the method also includes:

基于所属任务的任务物体的2D框，在主干网络上的一个或多个特征图上提取所述2D框所在区域的特征，根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。Based on the 2D box of the task object to which the task belongs, the features of the area where the 2D box is located are extracted on one or more feature maps on the backbone network, and the task object of the associated task is calculated based on the characteristics of the area where the 2D box is located. 3D information, Mask information or Keypiont information for prediction.

可选地，在低分辨率的特征上完成大物体所在区域的检测，所述RPN模块在高分辨率的特征图上完成小物体所在区域的检测。Optionally, the detection of the area where the large object is located is completed on the low-resolution feature, and the RPN module is used to detect the area where the small object is located on the high-resolution feature map.

另一方面，本申请实施例提供一种基于部分标注数据训练多任务感知网络的方法，其特征在于，所述感知网络包括主干网络和多个并行头端(Header)，所述方法包括：On the other hand, embodiments of the present application provide a method for training a multi-task perception network based on partially annotated data, which is characterized in that the perception network includes a backbone network and multiple parallel heads (Headers), and the method includes:

根据每张图片的标注数据类型，确定每张图片所属的任务；其中，所述每张图片标注一个或者多个数据类型，所述多个数据类型是所有数据类型的子集，一个数据类型对应一个任务；According to the annotated data type of each picture, determine the task to which each picture belongs; wherein, each picture is annotated with one or more data types, and the multiple data types are subsets of all data types, and one data type corresponds to a task;

根据每张图片所属的任务，决定所述每张图片所需训练的Header；According to the task to which each picture belongs, determine the Header that needs to be trained for each picture;

计算每张图片所述所需训练的Header的损失值；Calculate the loss value of the Header required to be trained for each picture;

对于每张图片，通过所述所需训练的Header进行梯度回传，并基于所述损失值调整所述所需训练的Header以及主干网络的参数。For each picture, gradient backpropagation is performed through the Header required to be trained, and the parameters of the Header required to be trained and the backbone network are adjusted based on the loss value.

可选地，对所属不同任务的图片进行数据均衡。Optionally, perform data balancing on images belonging to different tasks.

本发明实施例还提供一种基于部分标注数据训练多任务感知网络的装置，所述感知网络包括主干网络和多个并行头端(Header)，所述装置包括：Embodiments of the present invention also provide a device for training a multi-task perception network based on partial annotation data. The perception network includes a backbone network and multiple parallel heads (Headers). The device includes:

任务确定模块，用于根据每张图片的标注数据类型，确定每张图片所属的任务；其中，所述每张图片标注一个或者多个数据类型，所述多个数据类型是所有数据类型的子集，一个数据类型对应一个任务；The task determination module is used to determine the task to which each picture belongs based on the annotated data type of each picture; wherein each picture is annotated with one or more data types, and the multiple data types are sub-categories of all data types. Set, one data type corresponds to one task;

Header决定模块，用于根据每张图片所属的任务，决定所述每张图片所需训练的Header；The Header determination module is used to determine the Header required for each picture to be trained based on the task to which each picture belongs;

损失值计算模块，针对每张图片，用于计算Header决定模块决定出的Header的损失值；The loss value calculation module is used to calculate the loss value of the header determined by the header decision module for each picture;

调整模块，针对每张图片，通过计算Header决定模块决定出的Header进行梯度回传，并基于所述损失值计算模块得到的损失值调整所述所需训练的Header以及主干网络的参数。The adjustment module performs gradient return for each picture by calculating the Header determined by the Header determination module, and adjusts the parameters of the Header to be trained and the backbone network based on the loss value obtained by the loss value calculation module.

可选地，所述装置还包括：数据均衡模块，用于对所属不同任务的图片进行数据均衡。Optionally, the device further includes: a data balancing module, configured to perform data balancing on pictures belonging to different tasks.

本发明实施例还提供一种感知网络应用系统，该感知网络应用系统包括至少一个处理器，至少一个存储器、至少一个通信接口以及至少一个显示设备。处理器、存储器、显示设备和通信接口通过通信总线连接并完成相互间的通信。Embodiments of the present invention also provide a cognitive network application system, which includes at least one processor, at least one memory, at least one communication interface, and at least one display device. The processor, memory, display device and communication interface are connected and communicate with each other through the communication bus.

通信接口，用于与其他设备或通信网络通信；Communication interface, used to communicate with other devices or communication networks;

存储器用于存储执行以上方案的应用程序代码，并由处理器来控制执行。所述处理器用于执行所述存储器中存储的应用程序代码。The memory is used to store the application code that executes the above scheme, and the processor controls the execution. The processor is configured to execute application code stored in the memory.

存储器2002存储的代码可执行以上提供的一种基于Multi-Header的物体的感知方法也可以是上述实施例提供的训练该感知网络的方法。The code stored in the memory 2002 may execute the Multi-Header based object perception method provided above, or may be the method of training the perception network provided in the above embodiment.

显示设备用于显示待识别图像、该图像中感兴趣物体的2D、3D、Mask、关键点等信息。The display device is used to display the image to be recognized, 2D, 3D, Mask, key points and other information of the object of interest in the image.

本申请实施例提供的感知网络，各个感知任务共用相同的主干网络，成倍节省计算量，提升感知网络模型的运算速度；网络结构易于扩展，只需要增加一个或者若干个Header就可以扩展2D的检测类型。每个并行Header具有独立的RPN和RCNN模块，仅需要检测其所属的任务的物体，这样在训练过程中，可以避免对未标注的其他任务的物体的误伤。In the perceptual network provided by the embodiments of this application, each perceptual task shares the same backbone network, doubling the amount of calculation and improving the computing speed of the perceptual network model; the network structure is easy to expand, and only one or several Headers need to be added to expand the 2D Detection type. Each parallel Header has independent RPN and RCNN modules, and only needs to detect the objects of the task to which it belongs, so that during the training process, unlabeled objects of other tasks can be avoided.

本申请的这些方面或其他方面在以下实施例的描述中会更加简明易懂。These and other aspects of the application will be more clearly understood in the following description of the embodiments.

附图说明Description of drawings

为了更清楚地说明本申请发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly explain the embodiments of the present invention or the technical solutions in the prior art, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description These are only some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1为本申请实施例提供的系统架构的结构示意图；Figure 1 is a schematic structural diagram of the system architecture provided by the embodiment of the present application;

图2为本申请实施例提供的CNN特征提取模型的示意图；Figure 2 is a schematic diagram of the CNN feature extraction model provided by the embodiment of the present application;

图3是本申请实施例提供的一种芯片硬件结构示意图；Figure 3 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application;

图4为本申请实施例提供的一种基于多个并行Header的感知网络应用系统框架示意图；Figure 4 is a schematic diagram of a perceptual network application system framework based on multiple parallel headers provided by an embodiment of the present application;

图5为本申请实施例提供的基于多个并行Header的感知网络结构示意图；Figure 5 is a schematic structural diagram of a perception network based on multiple parallel headers provided by an embodiment of the present application;

图6为本申请实施例提供的ADAS/AD的基于多个并行Header感知系统结构示意图；Figure 6 is a schematic structural diagram of the ADAS/AD sensing system based on multiple parallel headers provided by the embodiment of the present application;

图7为本申请实施例提供的一种基础特征生成的流程示意图；Figure 7 is a schematic flowchart of basic feature generation provided by an embodiment of the present application;

图8为本申请实施例提供的另一种RPN层的结构示意图；Figure 8 is a schematic structural diagram of another RPN layer provided by an embodiment of the present application;

图9为本申请实施例提供的另一种RPN层的物体的对应的Anchor示意图；Figure 9 is a schematic diagram of the corresponding Anchor of another object in the RPN layer provided by the embodiment of the present application;

图10为本申请实施例提供的另一种ROI-ALIGN的过程示意图；Figure 10 is a schematic diagram of another ROI-ALIGN process provided by the embodiment of the present application;

图11为本申请实施例提供的另一种RCNN的实现及结构示意图；Figure 11 is a schematic diagram of the implementation and structure of another RCNN provided by the embodiment of the present application;

图12为本申请实施例提供的另一种串行Header的实现及结构示意图；Figure 12 is a schematic diagram of the implementation and structure of another serial header provided by the embodiment of the present application;

图13为本申请实施例提供的另一种串行Header的实现及结构示意图；Figure 13 is a schematic diagram of the implementation and structure of another serial header provided by the embodiment of the present application;

图14为本申请实施例提供的一种串行Header的实现及结构示意图示意图；Figure 14 is a schematic diagram of the implementation and structural diagram of a serial header provided by the embodiment of the present application;

图15为本申请实施例提供的一种部分标注数据的训练方法示意图；Figure 15 is a schematic diagram of a training method for partially annotated data provided by an embodiment of the present application;

图16为本申请实施例提供的另一种部分标注数据的训练方法示意图；Figure 16 is a schematic diagram of another training method for partially annotated data provided by an embodiment of the present application;

图17为本申请实施例提供的另一种部分标注数据的训练方法示意图；Figure 17 is a schematic diagram of another training method for partially annotated data provided by an embodiment of the present application;

图18为本申请实施例提供的另一种部分标注数据的训练方法示意图；Figure 18 is a schematic diagram of another training method for partially annotated data provided by an embodiment of the present application;

图19为本申请实施例提供的一种基于多个并行Header的感知网络的应用示意图；Figure 19 is a schematic application diagram of a perception network based on multiple parallel headers provided by the embodiment of the present application;

图20为本申请实施例提供的一种基于多个并行Header的感知网络的应用示意图；Figure 20 is a schematic application diagram of a sensing network based on multiple parallel headers provided by the embodiment of the present application;

图21为本申请实施例提供的一种感知方法的流程示意图；Figure 21 is a schematic flowchart of a sensing method provided by an embodiment of the present application;

图22为本申请实施例提供的一种2D检测流程示意图；Figure 22 is a schematic diagram of a 2D detection process provided by an embodiment of the present application;

图23为本申请实施例提供的一种终端设备的3D检测流程示意图；Figure 23 is a schematic diagram of a 3D detection process of a terminal device provided by an embodiment of the present application;

图24为本申请实施例提供的一种Mask预测流程示意图；Figure 24 is a schematic diagram of a Mask prediction process provided by an embodiment of the present application;

图25为本申请实施例提供的一种关键点坐标预测流程示意图；Figure 25 is a schematic flowchart of a key point coordinate prediction process provided by an embodiment of the present application;

图26为本申请实施例提供的一种感知网络的训练流程示意图；Figure 26 is a schematic diagram of the training process of a perceptual network provided by an embodiment of the present application;

图27为本申请实施例提供的一种基于多个并行Header的感知网络实现结构示意图；Figure 27 is a schematic structural diagram of a perception network based on multiple parallel headers provided by the embodiment of the present application;

图28为本申请实施例提供的一种基于多个并行Header的感知网络实现结构示意图；Figure 28 is a schematic structural diagram of a perception network based on multiple parallel headers provided by the embodiment of the present application;

图29为本申请实施例提供的一种基于部分标注数据训练多任务感知网络的装置图；Figure 29 is a device diagram for training a multi-task perception network based on partial annotation data provided by an embodiment of the present application;

图30为本申请实施例提供的一种物体检测方法流程示意图；Figure 30 is a schematic flow chart of an object detection method provided by an embodiment of the present application;

图31为本申请实施例提供的一种基于部分标注数据训练多任务感知网络的流程图。Figure 31 is a flow chart for training a multi-task perception network based on partially annotated data provided by an embodiment of the present application.

具体实施方式Detailed ways

首先对本申请实施例中用到的缩略语，列表如下：First, the abbreviations used in the embodiments of this application are listed as follows:

表1Table 1

需要说明的是，本发明实施例部分附图为了更符合业界的术语描述，使用了英文描述，同时实施例中也给出了相应的中文的定义。下面结合附图对本申请的实施例进行描述。It should be noted that some of the drawings of the embodiments of the present invention are described in English in order to be more consistent with the terminology description in the industry, and the corresponding Chinese definitions are also given in the embodiments. The embodiments of the present application are described below with reference to the accompanying drawings.

本申请实施例主要应用在驾驶辅助、自动驾驶、手机终端等需要完成多种感知任务的领域。本发明的应用系统框架如图4所示，视频经过抽帧得到单张图片，该图片送入到本发明的Mulit-Header感知网络，得到该图片中感兴趣物体的2D、3D、Mask(掩膜)、关键点等信息。这些检测结果输出到后处理模块进行处理，比如在自动驾驶系统中送入规划控制单元进行决策、在手机终端中送入美颜算法进行处理得到美颜后的图片。下面分别对ADAS/ADS视觉感知系统和手机美颜两种应用场景做简单的介绍The embodiments of this application are mainly used in fields such as driving assistance, autonomous driving, and mobile phone terminals that require the completion of various sensing tasks. The application system framework of the present invention is shown in Figure 4. A single picture is obtained from the video through frame extraction. The picture is sent to the Mulit-Header perception network of the present invention to obtain the 2D, 3D and Mask of the object of interest in the picture. membrane), key points and other information. These detection results are output to the post-processing module for processing, such as sending them to the planning control unit in the autonomous driving system for decision-making, and sending them to the beautification algorithm in the mobile phone terminal for processing to obtain beautified pictures. The following is a brief introduction to the two application scenarios of ADAS/ADS visual perception system and mobile phone beauty.

应用场景1：ADAS/ADS视觉感知系统Application scenario 1: ADAS/ADS visual perception system

如图19所示，在ADAS和ADS中，需要实时进行多类型的2D目标检测，包括：动态障碍物(行人(Pedestrian)、骑行者(Cyclist)、三轮车(Tricycle)、轿车(Car)、卡车(Truck)、公交车(Bus))，静态障碍物(交通锥标(TrafficCone)、交通棍标(TrafficStick)、消防栓(FireHydrant)、摩托车(Motocycle)、自行车(Bicycle))，交通标志(TrafficSign、导向标志(GuideSign)、广告牌(Billboard)、红色交通灯(TrafficLight_Red)/黄色交通灯(TrafficLight_Yellow)/绿色交通灯(TrafficLight_Green)/黑色交通灯(TrafficLight_Black)、路标(RoadSign))。另外，为了准确获取动态障碍物的在3维空间所占的区域，还需要对动态障碍物进行3D估计，输出3D框。为了与激光雷达的数据进行融合，需要获取动态障碍物的Mask，从而把打到动态障碍物上的激光点云筛选出来；为了进行精确的泊车位，需要同时检测出泊车位的4个关键点；为了进行构图定位，需要检测出静态目标的关键点。使用本申请实施例提供的技术方案，可以在一个感知网络中完成上述所有的功能。As shown in Figure 19, in ADAS and ADS, multiple types of 2D target detection need to be performed in real time, including: dynamic obstacles (pedestrians, cyclists, tricycles, cars, trucks) (Truck, Bus), static obstacles (TrafficCone, TrafficStick, FireHydrant, Motorcycle, Bicycle), traffic signs ( TrafficSign, GuideSign, Billboard, TrafficLight_Red/TrafficLight_Yellow/TrafficLight_Green/TrafficLight_Black, RoadSign). In addition, in order to accurately obtain the area occupied by the dynamic obstacle in the 3D space, it is also necessary to perform 3D estimation of the dynamic obstacle and output a 3D box. In order to fuse it with the lidar data, it is necessary to obtain the Mask of the dynamic obstacle to filter out the laser point cloud hitting the dynamic obstacle; in order to carry out accurate parking space, it is necessary to detect the four key points of the parking space at the same time. ;In order to perform composition positioning, the key points of the static target need to be detected. Using the technical solutions provided by the embodiments of this application, all the above functions can be completed in a perception network.

应用场景2：手机美颜功能Application scenario 2: mobile phone beauty function

如图20所示，在手机中，通过本申请实施例提供的感知网络检测出人体的Mask和关键点，可以对人体相应的部位进行放大缩小，比如进行收腰和美臀操作，从而输出美颜的图片。As shown in Figure 20, in the mobile phone, the mask and key points of the human body are detected through the perception network provided by the embodiment of the present application, and the corresponding parts of the human body can be enlarged and reduced, such as waist tightening and buttock beautifying operations, thereby outputting beauty. picture of.

应用场景3：图像分类场景：Application scenario 3: Image classification scenario:

物体识别装置在获取待分类图像后，采用本申请的物体识别方法获取待分类图像中的物体的类别，然后可根据待分类图像中物体的物体的类别对待分类图像进行分类。对于摄影师来说，每天会拍很多照片，有动物的，有人物，有植物的。采用本申请的方法可以快速地将照片按照照片中的内容进行分类，可分成包含动物的照片、包含人物的照片和包含植物的照片。After obtaining the image to be classified, the object recognition device uses the object recognition method of the present application to obtain the category of the object in the image to be classified, and then the image to be classified can be classified according to the category of the object in the image to be classified. For photographers, they take many photos every day, including animals, people, and plants. The method of this application can be used to quickly classify photos according to the content in the photos, which can be divided into photos containing animals, photos containing people, and photos containing plants.

对于图像数量比较庞大的情况，人工分类的方式效率比较低下，并且人在长时间处理同一件事情时很容易产生疲劳感，此时分类的结果会有很大的误差；而采用本申请的方法可以快速地将图像进行分类，并且不会有误差。For situations where the number of images is relatively large, the efficiency of manual classification is relatively low, and people are prone to fatigue when processing the same thing for a long time. At this time, the classification results will have large errors; and the method of this application is used Images can be classified quickly and without errors.

应用场景4商品分类：Application scenario 4 product classification:

物体识别装置获取商品的图像后，然后采用本申请的物体识别方法获取商品的图像中商品的类别，然后根据商品的类别对商品进行分类。对于大型商场或超市中种类繁多的商品，采用本申请的物体识别方法可以快速完成商品的分类，降低了时间开销和人工成本。After the object recognition device obtains the image of the commodity, it then uses the object recognition method of the present application to obtain the category of the commodity in the image of the commodity, and then classifies the commodity according to the category of the commodity. For a wide variety of goods in large shopping malls or supermarkets, using the object recognition method of this application can quickly complete the classification of goods, reducing time overhead and labor costs.

本申请实施例提供的方法和装置还可以用于扩充训练数据库，如图1所示执行设备120的I/O接口112可以将经执行设备处理过的图像(如包含物体的图像块或图像)和用户输入的物体类别一起作为训练数据对发送给数据库130，以使得数据库130维护的训练数据更加丰富，从而为训练设备130的训练工作提供更丰富的训练数据。The methods and devices provided by the embodiments of the present application can also be used to expand the training database. As shown in Figure 1, the I/O interface 112 of the execution device 120 can process images processed by the execution device (such as image blocks or images containing objects). Together with the object categories input by the user, they are sent to the database 130 as a training data pair, so that the training data maintained by the database 130 is richer, thereby providing richer training data for the training work of the training device 130 .

下面从模型训练侧和模型应用侧对本申请提供的方法进行描述：The method provided by this application is described below from the model training side and the model application side:

本申请实施例提供的训练CNN特征提取模型的方法，涉及计算机视觉的处理，具体可以应用于数据训练、机器学习、深度学习等数据处理方法，对训练数据(如本申请中的物体的图像或图像块和物体的类别)进行符号化和形式化的智能信息建模、抽取、预处理、训练等，最终得到训练好的CNN特征提取模型的；并且，本申请实施例将输入数据(如本申请中的物体的图像)输入到所述训练好的CNN特征提取模型中，得到输出数据(如本申请中得到该图片中感兴趣物体的2D、3D、Mask、关键点等信息)。The method for training a CNN feature extraction model provided by the embodiments of the present application involves computer vision processing, and can be specifically applied to data processing methods such as data training, machine learning, and deep learning. The training data (such as images of objects in this application or image blocks and object categories) to carry out symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc., and finally obtain the trained CNN feature extraction model; and, the embodiment of this application will input data (such as this The image of the object in the application) is input into the trained CNN feature extraction model to obtain output data (for example, in this application, the 2D, 3D, Mask, key points and other information of the object of interest in the picture are obtained).

由于本申请实施例涉及大量神经网络的应用，为了便于理解，下面先对本申请实施例涉及的相关术语及神经网络等相关概念进行介绍。Since the embodiments of the present application involve the application of a large number of neural networks, in order to facilitate understanding, the relevant terms involved in the embodiments of the present application and related concepts such as neural networks are first introduced below.

(1)物体识别，利用图像处理和机器学习、计算机图形学等相关方法，确定图像物体的类别。(1) Object recognition, using image processing, machine learning, computer graphics and other related methods to determine the category of image objects.

(2)神经网络(2)Neural network

神经网络可以是由神经单元组成的，神经单元可以是指以xs和截距1为输入的运算单元，该运算单元的输出可以为：The neural network can be composed of neural units. The neural unit can refer to an arithmetic unit that takes xs and intercept 1 as input. The output of the arithmetic unit can be:

其中，s＝1、2、……n，n为大于1的自然数，Ws为xs的权重，b为神经单元的偏置。f为神经单元的激活函数(activation functions)，用于将非线性特性引入神经网络中，来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络，即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连，来提取局部接受域的特征，局部接受域可以是由若干个神经单元组成的区域。Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above-mentioned single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected to the local receptive field of the previous layer to extract the features of the local receptive field. The local receptive field can be an area composed of several neural units.

(3)深度神经网络(3) Deep neural network

深度神经网络(Deep Neural Network，DNN)，也称多层神经网络，可以理解为具有很多层隐含层的神经网络，这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分，DNN内部的神经网络可以分为三类：输入层，隐含层，输出层。一般来说第一层是输入层，最后一层是输出层，中间的层数都是隐含层。层与层之间是全连接的，也就是说，第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂，但是就每一层的工作来说，其实并不复杂，简单来说就是如下线性关系表达式：其中，是输入向量，/>是输出向量，/>是偏移向量，W是权重矩阵(也称系数)，α()是激活函数。每一层仅仅是对输入向量/>经过如此简单的操作得到输出向量/>由于DNN层数多，则系数W和偏移向量/>的数量也就很多了。这些参数在DNN中的定义如下所述：以系数W为例：假设在一个三层的DNN中，第二层的第4个神经元到第三层的第2个神经元的线性系数定义为/>上标3代表系数W所在的层数，而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是：第L-1层的第k个神经元到第L层的第j个神经元的系数定义为/>需要注意的是，输入层是没有W参数的。在深度Deep Neural Network (DNN), also known as multi-layer neural network, can be understood as a neural network with many hidden layers. There is no special metric for "many" here. From the division of DNN according to the position of different layers, the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the layers in between are hidden layers. The layers are fully connected, that is to say, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer. Although DNN looks very complicated, the work of each layer is actually not complicated. Simply put, it is the following linear relationship expression: in, is the input vector,/> is the output vector,/> is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just a response to the input vector/> After such a simple operation, the output vector is obtained/> Since there are many DNN layers, the coefficient W and offset vector/> The number is also very large. The definitions of these parameters in DNN are as follows: Taking the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the 4th neuron in the second layer to the 2nd neuron in the third layer is defined as /> The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficient from the k-th neuron in layer L-1 to the j-th neuron in layer L is defined as/> It should be noted that the input layer has no W parameter. in depth

神经网络中，更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言，参数越多的模型复杂度越高，“容量”也就越大，也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程，其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。In neural networks, more hidden layers make the network more capable of describing complex situations in the real world. Theoretically, a model with more parameters has higher complexity and greater "capacity", which means it can complete more complex learning tasks. Training a deep neural network is the process of learning the weight matrix. The ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (a weight matrix formed by the vectors W of many layers).

(4)卷积神经网络(4) Convolutional neural network

卷积神经网络(CNN，Convolutional Neuron Network)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器，卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中，一个神经元可以只与部分邻层神经元连接。一个卷积层中，通常包含若干个特征平面，每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重，这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是：图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置，都能使用同样的学习得到的图像信息。在同一卷积层中，可以使用多个卷积核来提取不同的图像信息，一般地，卷积核数量越多，卷积操作反映的图像信息越丰富。Convolutional Neural Network (CNN, Convolutional Neuron Network) is a deep neural network with a convolutional structure. The convolutional neural network contains a feature extractor composed of convolutional layers and subsampling layers. The feature extractor can be regarded as a filter, and the convolution process can be regarded as using a trainable filter to convolve with an input image or convolution feature plane (feature map). The convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal. In the convolutional layer of a convolutional neural network, a neuron can be connected to only some of the neighboring layer neurons. A convolutional layer usually contains several feature planes, and each feature plane can be composed of some rectangularly arranged neural units. Neural units in the same feature plane share weights, and the shared weights here are convolution kernels. Shared weights can be understood as a way to extract image information independent of position. The underlying principle is that the statistical information of one part of the image is the same as that of other parts. This means that the image information learned in one part can also be used in another part. Therefore, the same learned image information can be used for all positions on the image. In the same convolution layer, multiple convolution kernels can be used to extract different image information. Generally, the greater the number of convolution kernels, the richer the image information reflected by the convolution operation.

卷积核可以以随机大小的矩阵的形式初始化，在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外，共享权重带来的直接好处是减少卷积神经网络各层之间的连接，同时又降低了过拟合的风险。The convolution kernel can be initialized in the form of a random-sized matrix. During the training process of the convolutional neural network, the convolution kernel can obtain reasonable weights through learning. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.

(5)循环神经网络(RNN,Recurrent Neural Networks)是用来处理序列数据的。在传统的神经网络模型中，是从输入层到隐含层再到输出层，层与层之间是全连接的，而对于每一层层内之间的各个节点是无连接的。这种普通的神经网络虽然解决了很多难题，但是却仍然对很多问题却无能无力。例如，你要预测句子的下一个单词是什么，一般需要用到前面的单词，因为一个句子中前后单词并不是独立的。RNN之所以称为循环神经网路，即一个序列当前的输出与前面的输出也有关。具体的表现形式为网络会对前面的信息进行记忆并应用于当前输出的计算中，即隐含层本层之间的节点不再无连接而是有连接的，并且隐含层的输入不仅包括输入层的输出还包括上一时刻隐含层的输出。理论上，RNN能够对任何长度的序列数据进行处理。对于RNN的训练和对传统的CNN或DNN的训练一样。同样使用误差反向传播算法，不过有一点区别：即，如果将RNN进行网络展开，那么其中的参数，如W，是共享的；而如上举例上述的传统神经网络却不是这样。并且在使用梯度下降算法中，每一步的输出不仅依赖当前步的网络，还依赖前面若干步网络的状态。该学习算法称为基于时间的反向传播算法Back propagation Through Time(BPTT)。(5) Recurrent Neural Networks (RNN) are used to process sequence data. In the traditional neural network model, from the input layer to the hidden layer and then to the output layer, the layers are fully connected, while the nodes within each layer are unconnected. Although this ordinary neural network has solved many difficult problems, it is still incompetent for many problems. For example, if you want to predict the next word of a sentence, you generally need to use the previous word, because the preceding and following words in a sentence are not independent. The reason why RNN is called a recurrent neural network is that the current output of a sequence is also related to the previous output. The specific manifestation is that the network will remember the previous information and apply it to the calculation of the current output, that is, the nodes between the hidden layer and the current layer are no longer unconnected but connected, and the input of the hidden layer not only includes The output of the input layer also includes the output of the hidden layer at the previous moment. In theory, RNN can process sequence data of any length. The training of RNN is the same as the training of traditional CNN or DNN. The error backpropagation algorithm is also used, but there is one difference: that is, if the RNN is expanded into a network, then the parameters, such as W, are shared; this is not the case with the traditional neural network as shown in the example above. And when using the gradient descent algorithm, the output of each step not only depends on the network of the current step, but also depends on the status of the network of several previous steps. This learning algorithm is called Back propagation Through Time (BPTT).

既然已经有了卷积神经网络，为什么还要循环神经网络？原因很简单，在卷积神经网络中，有一个前提假设是：元素之间是相互独立的，输入与输出也是独立的，比如猫和狗。但现实世界中，很多元素都是相互连接的，比如股票随时间的变化，再比如一个人说了：我喜欢旅游，其中最喜欢的地方是云南，以后有机会一定要去。这里填空，人类应该都知道是填“云南”。因为人类会根据上下文的内容进行推断，但如何让机器做到这一步？RNN就应运而生了。RNN旨在让机器像人一样拥有记忆的能力。因此，RNN的输出就需要依赖当前的输入信息和历史的记忆信息。Now that we already have convolutional neural networks, why do we need recurrent neural networks? The reason is simple. In a convolutional neural network, there is a prerequisite assumption that the elements are independent of each other, and the input and output are also independent, such as cats and dogs. But in the real world, many elements are interconnected, such as the changes in stocks over time. Another example is a person who said: I like traveling, and my favorite place is Yunnan. I must go there if I have the opportunity in the future. Fill in the blank here. Human beings should all know that it means "Yunnan". Because humans will make inferences based on the content of the context, but how do you get machines to do this? RNN came into being. RNN aims to give machines the ability to remember like humans. Therefore, the output of RNN needs to rely on current input information and historical memory information.

(6)损失函数(6)Loss function

在训练深度神经网络的过程中，因为希望深度神经网络的输出尽可能的接近真正想要预测的值，所以可以通过比较当前网络的预测值和真正想要的目标值，再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然，在第一次更新之前通常会有初始化的过程，即为深度神经网络中的各层预先配置参数)，比如，如果网络的预测值高了，就调整权重向量让它预测低一些，不断的调整，直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此，就需要预先定义“如何比较预测值和目标值之间的差异”，这便是损失函数(loss function)或目标函数(objective function)，它们是用于衡量预测值和目标值的差异的重要方程。其中，以损失函数举例，损失函数的输出值(loss)越高表示差异越大，那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。In the process of training a deep neural network, because we hope that the output of the deep neural network is as close as possible to the value that we really want to predict, we can compare the predicted value of the current network with the really desired target value, and then based on the difference between the two to update the weight vector of each layer of the neural network according to the difference (of course, there is usually an initialization process before the first update, that is, preconfiguring parameters for each layer in the deep neural network). For example, if the predicted value of the network If it is high, adjust the weight vector to make its prediction lower, and continue to adjust until the deep neural network can predict the really desired target value or a value that is very close to the really desired target value. Therefore, it is necessary to define in advance "how to compare the difference between the predicted value and the target value". This is the loss function (loss function) or objective function (objective function), which is used to measure the difference between the predicted value and the target value. Important equations. Among them, taking the loss function as an example, the higher the output value (loss) of the loss function, the greater the difference. Then the training of the deep neural network becomes a process of reducing this loss as much as possible.

(7)反向传播算法(7)Back propagation algorithm

卷积神经网络可以采用误差反向传播(back propagation，BP)算法在训练过程中修正初始的超分辨率模型中参数的大小，使得超分辨率模型的重建误差损失越来越小。具体地，前向传递输入信号直至输出会产生误差损失，通过反向传播误差损失信息来更新初始的超分辨率模型中参数，从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动，旨在得到最优的超分辨率模型的参数，例如权重矩阵。The convolutional neural network can use the error back propagation (BP) algorithm to modify the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forward propagation of the input signal until the output will produce an error loss, and the parameters in the initial super-resolution model are updated by back-propagating the error loss information, so that the error loss converges. The backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the super-resolution model, such as the weight matrix.

下面介绍本申请实施例提供系统架构。The following introduces the system architecture provided by the embodiments of this application.

参见图1，本申请实施例提供了一种系统架构110。如所述系统架构110所示，数据采集设备170用于采集训练数据，本申请实施例中训练数据包括：物体的图像或者图像块及物体的类别；并将训练数据存入数据库130，训练设备130基于数据库130中维护的训练数据训练得到CNN特征提取模型101(解释说明：这里的101就是前面介绍的经训练阶段训练得到的模型，可以是用于特征提取的感知网络等)。下面将以实施例一更详细地描述训练设备130如何基于训练数据得到CNN特征提取模型101，该CNN特征提取模型101能够用于实现本申请实施例提供的感知网络，即，将待识别图像或图像块通过相关预处理后输入该CNN特征提取模型101，即可得到待识别图像或图像块感兴趣物体的2D、3D、Mask、关键点等信息。本申请实施例中的CNN特征提取模型101具体可以为CNN卷积神经网络。需要说明的是，在实际的应用中，所述数据库130中维护的训练数据不一定都来自于数据采集设备170的采集，也有可能是从其他设备接收得到的。另外需要说明的是，训练设备130也不一定完全基于数据库130维护的训练数据进行CNN特征提取模型101的训练，也有可能从云端或其他地方获取训练数据进行模型训练，上述描述不应该作为对本申请实施例的限定。Referring to Figure 1, an embodiment of the present application provides a system architecture 110. As shown in the system architecture 110, the data collection device 170 is used to collect training data. In the embodiment of the present application, the training data includes: images or image blocks of objects and categories of objects; and the training data is stored in the database 130. The training device 130 is trained based on the training data maintained in the database 130 to obtain the CNN feature extraction model 101 (explanation: 101 here is the model trained in the training stage introduced earlier, which can be a perceptual network used for feature extraction, etc.). The following will describe in more detail how the training device 130 obtains the CNN feature extraction model 101 based on the training data in Embodiment 1. The CNN feature extraction model 101 can be used to implement the perceptual network provided by the embodiment of the present application, that is, to convert the image to be recognized or After the image block is input into the CNN feature extraction model 101 through relevant preprocessing, the 2D, 3D, Mask, key points and other information of the object of interest in the image or image block to be recognized can be obtained. The CNN feature extraction model 101 in the embodiment of the present application may specifically be a CNN convolutional neural network. It should be noted that in actual applications, the training data maintained in the database 130 may not all be collected by the data collection device 170, and may also be received from other devices. In addition, it should be noted that the training device 130 does not necessarily train the CNN feature extraction model 101 entirely based on the training data maintained by the database 130. It is also possible to obtain training data from the cloud or other places for model training. The above description should not be used as a limitation of this application. Limitations of Examples.

根据训练设备130训练得到的CNN特征提取模型101可以应用于不同的系统或设备中，如应用于图1所示的执行设备120，所述执行设备120可以是终端，如手机终端，平板电脑，笔记本电脑，AR/VR，车载终端等，还可以是服务器或者云端等。在附图1中，执行设备120配置有I/O接口112，用于与外部设备进行数据交互，用户可以通过客户设备150向I/O接口112输入数据，所述输入数据在本申请实施例中可以包括：待识别图像或者图像块或者图片。The CNN feature extraction model 101 trained according to the training device 130 can be applied to different systems or devices, such as to the execution device 120 shown in Figure 1. The execution device 120 can be a terminal, such as a mobile phone terminal, a tablet computer, Laptops, AR/VR, vehicle terminals, etc., or servers or clouds, etc. In Figure 1, the execution device 120 is configured with an I/O interface 112 for data interaction with external devices. The user can input data to the I/O interface 112 through the client device 150. The input data is used in the embodiment of the present application. may include: images, image blocks, or pictures to be recognized.

在执行设备120对输入数据进行预处理，或者在执行设备120的计算模块111执行计算等相关的处理(比如进行本申请中感知网络的功能实现)过程中，执行设备120可以调用数据存储系统160中的数据、代码等以用于相应的处理，也可以将相应处理得到的数据、指令等存入数据存储系统160中。When the execution device 120 preprocesses the input data, or when the calculation module 111 of the execution device 120 performs calculation and other related processing (such as implementing the function of the perception network in this application), the execution device 120 can call the data storage system 160 The data, codes, etc. in the system can be used for corresponding processing, and the data, instructions, etc. obtained by corresponding processing can also be stored in the data storage system 160.

最后，I/O接口112将处理结果，如上述得到的图像或图像块或者图片中感兴趣物体的2D、3D、Mask、关键点等信息返回给客户设备150，从而提供给用户。Finally, the I/O interface 112 returns the processing results, such as the image or image block obtained above, or 2D, 3D, Mask, key points and other information of the object of interest in the picture, to the client device 150, thereby providing it to the user.

可选地，客户设备150，可以是自动驾驶系统中的规划控制单元、手机终端中的美颜算法模块。Optionally, the client device 150 may be a planning control unit in an autonomous driving system or a beauty algorithm module in a mobile phone terminal.

值得说明的是，训练设备130可以针对不同的目标或称不同的任务，基于不同的训练数据生成相应的目标模型/规则101，该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务，从而为用户提供所需的结果。It is worth mentioning that the training device 130 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete the The above tasks, thereby providing the user with the desired results.

在附图1中所示情况下，用户可以手动给定输入数据，该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下，客户设备150可以自动地向I/O接口112发送输入数据，如果要求客户设备150自动发送输入数据需要获得用户的授权，则用户可以在客户设备150中设置相应权限。用户可以在客户设备150查看执行设备120输出的结果，具体的呈现形式可以是显示、声音、动作等具体方式。客户设备150也可以作为数据采集端，采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据，并存入数据库130。当然，也可以不经过客户设备150进行采集，而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果，作为新的样本数据存入数据库130。In the situation shown in FIG. 1 , the user can manually set the input data, and the manual setting can be operated through the interface provided by the I/O interface 112 . In another case, the client device 150 can automatically send input data to the I/O interface 112. If requiring the client device 150 to automatically send the input data requires the user's authorization, the user can set corresponding permissions in the client device 150. The user can view the results output by the execution device 120 on the client device 150, and the specific presentation form may be display, sound, action, etc. The client device 150 can also be used as a data collection terminal to collect input data from the input I/O interface 112 and output results from the output I/O interface 112 as new sample data, and store them in the database 130 . Of course, it is also possible to collect data without going through the client device 150. Instead, the I/O interface 112 directly uses the input data input to the I/O interface 112 and the output result of the output I/O interface 112 as a new sample as shown in the figure. The data is stored in database 130.

值得注意的是，附图1仅是本发明实施例提供的一种系统架构的示意图，图中所示设备、器件、模块等之间的位置关系不构成任何限制，例如，在附图1中，数据存储系统160相对执行设备120是外部存储器，在其它情况下，也可以将数据存储系统160置于执行设备120中。It is worth noting that Figure 1 is only a schematic diagram of a system architecture provided by an embodiment of the present invention. The positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in Figure 1 , the data storage system 160 is an external memory relative to the execution device 120. In other cases, the data storage system 160 can also be placed in the execution device 120.

如图1所示，根据训练设备130训练得到CNN特征提取模型101，该CNN特征提取模型101在本申请实施例中可以是CNN卷积神经网络也可以是下面实施例即将介绍的基于多个Header的感知网络。As shown in Figure 1, a CNN feature extraction model 101 is obtained by training according to the training device 130. In the embodiment of the present application, the CNN feature extraction model 101 can be a CNN convolutional neural network or based on multiple Headers to be introduced in the following embodiments. perception network.

如前文的基础概念介绍所述，卷积神经网络是一种带有卷积结构的深度神经网络，是一种深度学习(deep learning)架构，深度学习架构是指通过机器学习的算法，在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构，CNN是一种前馈(feed-forward)人工神经网络，该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。As mentioned in the introduction to the basic concepts above, a convolutional neural network is a deep neural network with a convolutional structure. It is a deep learning architecture. The deep learning architecture refers to the algorithm of machine learning. Multiple levels of learning at different levels of abstraction. As a deep learning architecture, CNN is a feed-forward artificial neural network. Each neuron in the feed-forward artificial neural network can respond to the image input into it.

如图2所示，卷积神经网络(CNN)210可以包括输入层220，卷积层/池化层230(其中池化层为可选的)，以及神经网络层230。As shown in FIG. 2 , a convolutional neural network (CNN) 210 may include an input layer 220 , a convolutional/pooling layer 230 (where the pooling layer is optional), and a neural network layer 230 .

卷积层/池化层230：Convolutional layer/pooling layer 230:

卷积层：Convolutional layer:

如图2所示卷积层/池化层230可以包括如示例221-226层，举例来说：在一种实现中，221层为卷积层，222层为池化层，223层为卷积层，224层为池化层，225为卷积层，226为池化层；在另一种实现方式中，221、222为卷积层，223为池化层，224、225为卷积层，226为池化层。即卷积层的输出可以作为随后的池化层的输入，也可以作为另一个卷积层的输入以继续进行卷积操作。As shown in Figure 2, the convolution layer/pooling layer 230 may include layers 221-226 as examples. For example: in one implementation, layer 221 is a convolution layer, layer 222 is a pooling layer, and layer 223 is a convolution layer. Product layer, 224 is a pooling layer, 225 is a convolution layer, and 226 is a pooling layer; in another implementation, 221 and 222 are convolution layers, 223 is a pooling layer, and 224 and 225 are convolutions. layer, 226 is the pooling layer. That is, the output of the convolutional layer can be used as the input of the subsequent pooling layer, or can be used as the input of another convolutional layer to continue the convolution operation.

下面将以卷积层221为例，介绍一层卷积层的内部工作原理。The following will take convolutional layer 221 as an example to introduce the internal working principle of a convolutional layer.

卷积层221可以包括很多个卷积算子，卷积算子也称为核，其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器，卷积算子本质上可以是一个权重矩阵，这个权重矩阵通常被预先定义，在对图像进行卷积操作的过程中，权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理，从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关，需要注意的是，权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的，在进行卷积运算的过程中，权重矩阵会延伸到输入图像的整个深度。因此，和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出，但是大多数情况下不使用单一权重矩阵，而是应用多个尺寸(行×列)相同的权重矩阵，即多个同型矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度，这里的维度可以理解为由上面所述的“多个”来决定。不同的权重矩阵可以用来提取图像中不同的特征，例如一个权重矩阵用来提取图像边缘信息，另一个权重矩阵用来提取图像的特定颜色，又一个权重矩阵用来对图像中不需要的噪点进行模糊化等。该多个权重矩阵尺寸(行×列)相同，经过该多个尺寸相同的权重矩阵提取后的特征图的尺寸也相同，再将提取到的多个尺寸相同的特征图合并形成卷积运算的输出。The convolution layer 221 can include many convolution operators. The convolution operator is also called a kernel. Its role in image processing is equivalent to a filter that extracts specific information from the input image matrix. The convolution operator is essentially It can be a weight matrix. This weight matrix is usually predefined. During the convolution operation on the image, the weight matrix is usually one pixel after one pixel (or two pixels after two pixels) along the horizontal direction on the input image. ...This depends on the value of the step size) to complete the process of extracting specific features from the image. The size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image. During the convolution operation, the weight matrix will extend to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a convolved output with a single depth dimension, but in most cases, instead of using a single weight matrix, multiple weight matrices of the same size (rows × columns) are applied, That is, multiple matrices of the same type. The output of each weight matrix is stacked to form the depth dimension of the convolution image. The dimension here can be understood as being determined by the "multiple" mentioned above. Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract edge information of the image, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to remove unnecessary noise in the image. Perform blurring, etc. The multiple weight matrices have the same size (row × column), and the feature maps extracted by the multiple weight matrices with the same size are also the same size. The extracted multiple feature maps with the same size are then merged to form a convolution operation. output.

这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到，通过训练得到的权重值形成的各个权重矩阵可以用来从输入图像中提取信息，从而使得卷积神经网络210进行正确的预测。The weight values in these weight matrices require a lot of training in practical applications. Each weight matrix formed by the weight values obtained through training can be used to extract information from the input image, thereby allowing the convolutional neural network 210 to make correct predictions. .

当卷积神经网络210有多个卷积层的时候，初始的卷积层(例如221)往往提取较多的一般特征，该一般特征也可以称之为低级别的特征；随着卷积神经网络210深度的加深，越往后的卷积层(例如226)提取到的特征越来越复杂，比如高级别的语义之类的特征，语义越高的特征越适用于待解决的问题。When the convolutional neural network 210 has multiple convolutional layers, the initial convolutional layer (for example, 221) often extracts more general features, which can also be called low-level features; as the convolutional neural network As the depth of the network 210 deepens, the features extracted by subsequent convolutional layers (for example, 226) become more and more complex, such as high-level semantic features. Features with higher semantics are more suitable for the problem to be solved.

池化层：Pooling layer:

由于常常需要减少训练参数的数量，因此卷积层之后常常需要周期性的引入池化层，在如图2中230所示例的221-226各层，可以是一层卷积层后面跟一层池化层，也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中，池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子，以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值作为平均池化的结果。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外，就像卷积层中用权重矩阵的大小应该与图像尺寸相关一样，池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸，池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。Since it is often necessary to reduce the number of training parameters, it is often necessary to periodically introduce a pooling layer after the convolution layer. In each layer 221-226 as shown at 230 in Figure 2, it can be a convolution layer followed by a layer The pooling layer can also be a multi-layer convolution layer followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator can calculate the pixel values in the image within a specific range to generate an average value as the result of average pooling. The max pooling operator can take the pixel with the largest value in a specific range as the result of max pooling. In addition, just like the size of the weight matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the size of the image. The size of the image output after processing by the pooling layer can be smaller than the size of the image input to the pooling layer. Each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.

神经网络层230：Neural network layer 230:

在经过卷积层/池化层230的处理后，卷积神经网络210还不足以输出所需要的输出信息。因为如前所述，卷积层/池化层230只会提取特征，并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或其他相关信息)，卷积神经网络210需要利用神经网络层230来生成一个或者一组所需要的类的数量的输出。因此，在神经网络层230中可以包括多层隐含层(如图2所示的231、232至23n)以及输出层240，该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到，例如该任务类型可以包括图像识别，图像分类，图像超分辨率重建等等……After being processed by the convolutional layer/pooling layer 230, the convolutional neural network 210 is not enough to output the required output information. Because as mentioned above, the convolutional layer/pooling layer 230 will only extract features and reduce the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 210 needs to use the neural network layer 230 to generate an output or a set of required number of classes. Therefore, the neural network layer 230 may include multiple hidden layers (231, 232 to 23n as shown in Figure 2) and an output layer 240. The parameters contained in the multiple hidden layers may be based on specific task types. Relevant training data are pre-trained. For example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...

在神经网络层230中的多层隐含层之后，也就是整个卷积神经网络210的最后层为输出层240，该输出层240具有类似分类交叉熵的损失函数，具体用于计算预测误差，一旦整个卷积神经网络210的前向传播(如图2由220至240方向的传播为前向传播)完成，反向传播(如图2由240至220方向的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差，以减少卷积神经网络210的损失，及卷积神经网络210通过输出层输出的结果和理想结果之间的误差。After the multi-layer hidden layer in the neural network layer 230, that is, the last layer of the entire convolutional neural network 210 is the output layer 240. The output layer 240 has a loss function similar to classification cross entropy and is specifically used to calculate the prediction error. Once the forward propagation of the entire convolutional neural network 210 (the propagation from the direction 220 to 240 in Figure 2 is forward propagation) is completed, the back propagation (the propagation from the direction 240 to 220 in Figure 2 is back propagation) will Start updating the weight values and biases of each layer mentioned above to reduce the loss of the convolutional neural network 210 and the error between the result output by the convolutional neural network 210 through the output layer and the ideal result.

需要说明的是，如图2所示的卷积神经网络210仅作为一种卷积神经网络的示例，在具体的应用中，卷积神经网络还可以以其他网络模型的形式存在。It should be noted that the convolutional neural network 210 shown in Figure 2 is only an example of a convolutional neural network. In specific applications, the convolutional neural network can also exist in the form of other network models.

下面介绍本申请实施例提供的一种芯片硬件结构。The following introduces a chip hardware structure provided by the embodiment of the present application.

图3为本发明实施例提供的一种芯片硬件结构，该芯片包括神经网络处理器30。该芯片可以被设置在如图1所示的执行设备120中，用以完成计算模块111的计算工作。该芯片也可以被设置在如图1所示的训练设备130中，用以完成训练设备130的训练工作并输出目标模型/规则101。如图2所示的卷积神经网络中各层的算法均可在如图3所示的芯片中得以实现。Figure 3 is a chip hardware structure provided by an embodiment of the present invention. The chip includes a neural network processor 30. The chip can be disposed in the execution device 120 as shown in Figure 1 to complete the calculation work of the calculation module 111. The chip can also be provided in the training device 130 as shown in Figure 1 to complete the training work of the training device 130 and output the target model/rules 101. The algorithms of each layer in the convolutional neural network shown in Figure 2 can be implemented in the chip shown in Figure 3.

神经网络处理器NPU 30，NPU作为协处理器挂载到主CPU(Host CPU)上，由HostCPU分配任务。NPU的核心部分为运算电路303，控制器304控制运算电路303提取存储器(权重存储器或输入存储器)中的数据并进行运算。Neural network processor NPU 30, the NPU is mounted on the main CPU (Host CPU) as a co-processor, and the Host CPU allocates tasks. The core part of the NPU is the arithmetic circuit 303. The controller 304 controls the arithmetic circuit 303 to extract data in the memory (weight memory or input memory) and perform operations.

在一些实现中，运算电路303内部包括多个处理单元(Process Engine,PE)。在一些实现中，运算电路303是二维脉动阵列。运算电路303还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中，运算电路303是通用的矩阵处理器。In some implementations, the computing circuit 303 internally includes multiple processing units (Process Engine, PE). In some implementations, arithmetic circuit 303 is a two-dimensional systolic array. The arithmetic circuit 303 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 303 is a general-purpose matrix processor.

举例来说，假设有输入矩阵A，权重矩阵B，输出矩阵C。运算电路从权重存储器302中取矩阵B相应的数据，并缓存在运算电路中每一个PE上。运算电路从输入存储器301中取矩阵A数据与矩阵B进行矩阵运算，得到的矩阵的部分结果或最终结果，保存在累加器(accumulator)308中。For example, assume there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit obtains the corresponding data of matrix B from the weight memory 302 and caches it on each PE in the arithmetic circuit. The operation circuit takes matrix A data and matrix B from the input memory 301 to perform matrix operations, and the partial result or final result of the matrix is stored in an accumulator (accumulator) 308 .

向量计算单元307可以对运算电路的输出做进一步处理，如向量乘，向量加，指数运算，对数运算，大小比较等等。例如，向量计算单元307可以用于神经网络中非卷积/非FC层的网络计算，如池化(Pooling)，批归一化(Batch Normalization)，局部响应归一化(Local Response Normalization)等。The vector calculation unit 307 can further process the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc. For example, the vector calculation unit 307 can be used for network calculations of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc. .

在一些实现中，向量计算单元能307将经处理的输出的向量存储到统一缓存器306。例如，向量计算单元307可以将非线性函数应用到运算电路303的输出，例如累加值的向量，用以生成激活值。在一些实现中，向量计算单元307生成归一化的值、合并值，或二者均有。在一些实现中，处理过的输出的向量能够用作到运算电路303的激活输入，例如用于在神经网络中的后续层中的使用。In some implementations, the vector computation unit can 307 store the processed output vectors to the unified buffer 306 . For example, the vector calculation unit 307 may apply a nonlinear function to the output of the operation circuit 303, such as a vector of accumulated values, to generate an activation value. In some implementations, vector calculation unit 307 generates normalized values, merged values, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 303, such as for use in a subsequent layer in a neural network.

本申请实施例提供的感知网络的运算可以由303或307执行。The operation of the perception network provided by the embodiment of this application can be performed by 303 or 307.

统一存储器306用于存放输入数据以及输出数据。The unified memory 306 is used to store input data and output data.

权重数据直接通过存储单元访问控制器305(Direct Memory AccessController，DMAC)将外部存储器中的输入数据搬运到输入存储器301和/或统一存储器306、将外部存储器中的权重数据存入权重存储器302，以及将统一存储器306中的数据存入外部存储器。The weight data directly transfers the input data in the external memory to the input memory 301 and/or unified memory 306 through the storage unit access controller 305 (Direct Memory Access Controller, DMAC), stores the weight data in the external memory into the weight memory 302, and Store the data in the unified memory 306 into the external memory.

总线接口单元(Bus Interface Unit，BIU)310，用于通过总线实现主CPU、DMAC和取指存储器309之间进行交互。The Bus Interface Unit (BIU) 310 is used to realize the interaction between the main CPU, the DMAC and the instruction memory 309 through the bus.

与控制器304连接的取指存储器(instruction fetch buffer)309，用于存储控制器304使用的指令；An instruction fetch buffer 309 connected to the controller 304 is used to store instructions used by the controller 304;

控制器304，用于调用指存储器309中缓存的指令，实现控制该运算加速器的工作过程。The controller 304 is used to call instructions cached in the memory 309 to control the working process of the computing accelerator.

可选地，本申请中此处的输入数据为图片，输出数据为图片中感兴趣物体的2D、3D、Mask、关键点等信息。Optionally, the input data here in this application is a picture, and the output data is 2D, 3D, Mask, key points and other information of the object of interest in the picture.

一般地，统一存储器306，输入存储器301，权重存储器302以及取指存储器309均为片上(On-Chip)存储器，外部存储器为该NPU外部的存储器，该外部存储器可以为双倍数据率同步动态随机存储器(Double Data Rate Synchronous Dynamic Random AccessMemory，简称DDR SDRAM)、高带宽存储器(High Bandwidth Memory，HBM)或其他可读可写的存储器。Generally, the unified memory 306, the input memory 301, the weight memory 302 and the instruction memory 309 are all on-chip memories, and the external memory is a memory external to the NPU. The external memory can be double data rate synchronous dynamic random access. Memory (Double Data Rate Synchronous Dynamic Random AccessMemory, referred to as DDR SDRAM), high bandwidth memory (High Bandwidth Memory, HBM) or other readable and writable memory.

图1和图2中的程序算法是由主CPU和NPU共同配合完成的。The program algorithms in Figures 1 and 2 are completed by the main CPU and NPU.

其中，图2所示的卷积神经网络中各层的运算可以由运算电路303或向量计算单元307执行。Among them, the operations of each layer in the convolutional neural network shown in FIG. 2 can be performed by the operation circuit 303 or the vector calculation unit 307.

参见图5，图5为本申请实施例提供的一种多头感知网络的结构示意图。如图5所示，该感知网络包括：Refer to Figure 5, which is a schematic structural diagram of a multi-head sensing network provided by an embodiment of the present application. As shown in Figure 5, the perception network includes:

其主要由主干网络(Backbone)401、多个并行Header0～N两部分组成。It mainly consists of two parts: the backbone network (Backbone) 401 and multiple parallel Headers0~N.

主干网络401，用于接收输入的图片，并对输入的图片进行卷积处理，输出对应所述图片的具有不同分辨率的特征图；也就是说输出对应所述图片的不同大小的特征图；The backbone network 401 is used to receive input pictures, perform convolution processing on the input pictures, and output feature maps with different resolutions corresponding to the pictures; that is to say, output feature maps of different sizes corresponding to the pictures;

也就是说，Backbone完成基础特征的提取，为后续的检测提供相应的特征。In other words, Backbone completes the extraction of basic features and provides corresponding features for subsequent detection.

任一个并行头端，用于用于根据所述主干网络输出的特征图，对一个任务中的任务物体进行检测，输出所述任务物体所在区域的2D框以及每个2D框对应的置信度；其中，所述每个并行Header完成不同的任务物体的检测；其中，所述任务物体为该任务中需要检测的物体；所述置信度越高，表示所述对应该置信度的2D框内存在所述任务所对应的物体的概率越大。Any parallel head end is used to detect the task object in a task based on the feature map output by the backbone network, and output the 2D box of the area where the task object is located and the confidence level corresponding to each 2D box; Wherein, each parallel Header completes the detection of different task objects; wherein, the task object is an object that needs to be detected in the task; the higher the confidence level, it means that there is a 2D box corresponding to the confidence level. The greater the probability of the object corresponding to the task.

也就是说并行Header完成不同的2D检测任务，比如并行Header0完成车的检测，输出Car/Truck/Bus的2D框和置信度；并行Header1完成人的检测，输出Pedestrian/Cyclist/Tricyle的2D框和置信度；并行Header2完成交通灯的检测，输出Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight的2D框和置信度。That is to say, parallel Header completes different 2D detection tasks. For example, parallel Header0 completes car detection and outputs the 2D box and confidence of Car/Truck/Bus; parallel Header1 completes human detection and outputs the 2D box and confidence level of Pedestrian/Cyclist/Tricyle. Confidence; Parallel Header2 completes the detection of traffic lights and outputs the 2D box and confidence of Red_Trafficligh/Green_Trafficlight/Yellow_TrafficLight/Black_TrafficLight.

可选地，如图5中的所示，所示感知网络还可以包括多个串行Header，所述感知网络还包括至少一个或多个串行头端；所述串行头端与一个并行头端连接；这里需要强调的是，虽然图5为了更好得展示画出了多个串行Header，但实际上，串行Header并不是必须的，对于只需要检测2D框的场景，就不需要包括串行Header。Optionally, as shown in Figure 5, the perception network may also include multiple serial Headers, and the perception network may also include at least one or more serial heads; the serial heads are connected to a parallel Head-end connection; what needs to be emphasized here is that although Figure 5 draws multiple serial headers for better display, in fact, serial headers are not necessary. For scenes that only need to detect 2D boxes, it is not necessary Need to include serial header.

串行Header可选地串接在并行Header的后面，在检测出该任务的2D框的基础上，完成2D框内部物体的3D/Mask/Keypoint检测。The serial header is optionally connected in series behind the parallel header. Based on detecting the 2D box of the task, the 3D/Mask/Keypoint detection of objects inside the 2D box is completed.

比如，串行3D_Header0完成车辆的朝向、质心和长宽高的估计，从而输出车辆的3D框；串行Mask_Header0预测车辆的精细掩膜，从而把车辆分割开来；串行Keypont_Header0完成车辆的关键点的估计。For example, serial 3D_Header0 completes the estimation of the vehicle's orientation, centroid, length, width and height, thereby outputting the 3D box of the vehicle; serial Mask_Header0 predicts the fine mask of the vehicle, thereby segmenting the vehicle; serial Keypont_Header0 completes the key points of the vehicle estimate.

串行Header并不是必须的，某些任务不需要进行3D/Mask/Keypoint检测，则不需要串接串行Header，比如交通灯的检测，只需要检测2D框，就不用串接串行Header。另外，某些任务可以根据任务的具体需求，选择串接一个或者多个串行Header，比如停车场(Parkingslot)的检测，除了需要得到2D框外，还需要车位的关键点，因此在这个任务中只需要串接一个串行Keypoint_Header即可，不需要3D和Mask的Header。The serial header is not necessary. Some tasks do not require 3D/Mask/Keypoint detection, so there is no need to connect the serial header in series. For example, the detection of traffic lights only needs to detect the 2D frame, and there is no need to connect the serial header in series. In addition, some tasks can choose to connect one or more serial headers in series according to the specific needs of the task. For example, the detection of parking lots (Parkingslot) requires not only the 2D frame, but also the key points of the parking spaces. Therefore, in this task You only need to connect a serial Keypoint_Header in series, and there is no need for 3D and Mask Headers.

下面对各个模块进行详细描述。Each module is described in detail below.

Backbone:主干网络对输入的图片进行一系列的卷积处理，得到在不同的尺度下的特征图(feature map)。这些特征图将为后续的检测模块提供基础特征。主干网络可以采用多种形式，比如VGG(Visual Geometry Group，视觉几何组)、Resnet(Residual NeuralNetwork，残差神经网络)、Inception-net(GoogLeNet的核心结构)等。Backbone: The backbone network performs a series of convolution processes on the input images to obtain feature maps at different scales. These feature maps will provide basic features for subsequent detection modules. The backbone network can take many forms, such as VGG (Visual Geometry Group, visual geometry group), Resnet (Residual Neural Network, residual neural network), Inception-net (the core structure of GoogLeNet), etc.

并行Header:并行Header主要根据Backbone提供的基础特征，完成一个任务的2D框的检测，输出这个任务的物体的2D框以及对应的置信度。Parallel Header: The parallel Header mainly completes the detection of the 2D box of a task based on the basic features provided by Backbone, and outputs the 2D box of the object of the task and the corresponding confidence.

可选地，每个任务的并行Header包括RPN、ROI-ALIGN和RCNN三个模块。Optionally, the parallel header of each task includes three modules: RPN, ROI-ALIGN and RCNN.

RPN模块：用于在主干网络提供的一个或者多个特征图上预测所述任务物体所在的区域，并输出匹配所述区域的候选2D框；RPN module: used to predict the area where the task object is located on one or more feature maps provided by the backbone network, and output candidate 2D boxes that match the area;

或者可以这样理解，RPN全称为候选区域生成网络(Region Proposal Network)，其在Backbone的一个或者多个feature map上预测出可能存在该任务物体的区域，并且给出这些区域的框，这些区域称为候选区域(Proposal)。Or it can be understood this way. The full name of RPN is Region Proposal Network. It predicts the areas where the task object may exist on one or more feature maps of Backbone, and gives the frames of these areas. These areas are called is the candidate area (Proposal).

比如，当并行Header0负责检测车时，其RPN层就预测出可能存在车的候选框；当并行Header1负责检测人时，其RPN层就预测出可能存在人的候选框。当然，这些Proposal是不准确的，一方面其不一定含有该任务的物体，另一方面这些框也是不紧致的For example, when parallel Header0 is responsible for detecting cars, its RPN layer predicts candidate boxes where cars may exist; when parallel Header1 is responsible for detecting people, its RPN layer predicts candidate boxes where people may exist. Of course, these proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task. On the other hand, these boxes are not compact.

ROI-ALIGN模块：用于根据所述RPN模块预测得到的区域，从所述主干网络提供的一个特征图中扣取出所述候选2D框所在区域的特征；ROI-ALIGN module: used to deduct the features of the area where the candidate 2D box is located from a feature map provided by the backbone network based on the area predicted by the RPN module;

也就是说，ROI-ALIGN模块主要根据RPN模块提供的Proposal，在Backbone的某个feature map上把每个Proposal所在的区域的特征扣取出来，并且resize到固定的大小，得到每个Proposal的特征。可以理解的是，ROI-ALIGN模块可以使用但不局限于ROI-POOLING(感兴趣区域池化)/ROI-ALIGN(感兴趣区域提取)/PS-ROIPOOLING(位置敏感的感兴趣区域池化)/PS-ROIALIGN(位置敏感的感兴趣区域提取)等特征抽取方法。In other words, the ROI-ALIGN module mainly extracts the features of the area where each Proposal is located on a feature map of Backbone based on the Proposal provided by the RPN module, and resizes it to a fixed size to obtain the features of each Proposal. . It can be understood that the ROI-ALIGN module can be used but is not limited to ROI-POOLING (region of interest pooling)/ROI-ALIGN (region of interest extraction)/PS-ROIPOOLING (position-sensitive region of interest pooling)/ Feature extraction methods such as PS-ROIALIGN (position-sensitive region of interest extraction).

RCNN模块：用于通过神经网络对所述候选2D框所在区域的特征进行卷积处理，得到所述候选2D框属于各个物体类别的置信度；所述各个物体类别为所述一个并行头端对应的任务中的物体类别；通过神经网络对所述候选区域2D框的坐标进行调整，使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配，并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。RCNN module: used to perform convolution processing on the features of the area where the candidate 2D box is located through a neural network to obtain the confidence that the candidate 2D box belongs to each object category; each object category corresponds to the one parallel head end The object category in the task; adjust the coordinates of the 2D box in the candidate area through the neural network, so that the adjusted 2D candidate box matches the shape of the actual object better than the candidate 2D box, and select a confidence level greater than the preset The adjusted 2D candidate box of the threshold is used as the 2D box of the region.

也就是说，RCNN模块主要是对ROI-ALIGN模块提出的每个Proposal的特征进行细化处理，得到每个Proposal的属于各个类别置信度(比如对于车这个任务，会给出Backgroud/Car/Truck/Bus 4个分数)，同时对Proposal的2D框的坐标进行调整，输出更加紧致的2D框。这些2D框经过NMS(Non Maximum Suppression，非极大值抑制)合并后，作为最后的2D框输出。In other words, the RCNN module mainly refines the features of each Proposal proposed by the ROI-ALIGN module to obtain the confidence of each Proposal belonging to each category (for example, for the car task, Backgroud/Car/Truck will be given /Bus 4 points), and at the same time adjust the coordinates of the Proposal's 2D box to output a tighter 2D box. These 2D boxes are merged by NMS (Non Maximum Suppression) and output as the final 2D box.

如上所述，在一些实际应用场景中，该感知网络还可以包括串行Header。串行Header主要是串接在并行Header后面，在检测出2D框的基础上，进一步进行3D/Mask/Keypoint检测。因此串行Header有3个类型：As mentioned above, in some practical application scenarios, the perception network may also include a serial header. The serial header is mainly connected in series behind the parallel header. On the basis of detecting the 2D frame, further 3D/Mask/Keypoint detection is performed. Therefore, there are 3 types of serial headers:

串行3D Header:串行3D Header根据前端的并行Header提供的2D框(此时的2D框是准确的紧致2D框)，通过ROI-ALIGN模块在Backbone的某个feature map上把这些2D框所在的区域的特征扣取出来，然后通过一个小网络(图5中的3D_Header)回归出这个2D框内部的物体的质心点坐标、朝向角度和长宽高，从而得到完整的3D信息。Serial 3D Header: The serial 3D Header uses the ROI-ALIGN module to convert these 2D boxes on a feature map of Backbone based on the 2D boxes provided by the front-end parallel Header (the 2D boxes at this time are accurate compact 2D boxes). The features of the area are extracted, and then the centroid point coordinates, orientation angle, length, width and height of the object inside the 2D box are returned through a small network (3D_Header in Figure 5), thereby obtaining complete 3D information.

串行Mask Header:串行Mask Header根据前端的并行Header提供的2D框(此时的2D框是准确的紧致2D框)，通过ROI-ALIGN模块在Backbone的某个feature map上把这些2D框所在的区域的特征扣取出来，然后通过一个小网络(图5中的Mask_Header)回归出这个2D框内部的物体的掩膜，从而把这个物体分割出来。Serial Mask Header: The serial Mask Header uses the ROI-ALIGN module to map these 2D boxes on a feature map of Backbone based on the 2D boxes provided by the front-end parallel header (the 2D boxes at this time are accurate compact 2D boxes). The features of the area are extracted, and then a small network (Mask_Header in Figure 5) is used to regress the mask of the object inside the 2D box, thereby segmenting the object.

串行Keypoint Header:串行Keypoint Header根据前端的并行Header提供的2D框(此时的2D框是准确的紧致2D框)，通过ROI-ALIGN模块在Backbone的某个feature map上把这些2D框所在的区域的特征扣取出来，然后通过一个小网络(图5中的Keypoint_Header)回归出这个2D框内部的物体的关键点坐标。Serial Keypoint Header: The serial Keypoint Header uses the ROI-ALIGN module to convert these 2D boxes on a feature map of Backbone based on the 2D boxes provided by the front-end parallel Header (the 2D boxes at this time are accurate compact 2D boxes). The features of the area are extracted, and then the key point coordinates of the object inside the 2D box are returned through a small network (Keypoint_Header in Figure 5).

可选地，如图29所示，本发明实施例还提供一种基于部分标注数据训练多任务感知网络的装置，所述感知网络包括主干网络和多个并行头端(Header)，感知网络的结构在前述实施例中已经详细描述，在此不再赘述，所述装置包括：Optionally, as shown in Figure 29, this embodiment of the present invention also provides a device for training a multi-task perception network based on partially annotated data. The perception network includes a backbone network and multiple parallel heads (Headers). The structure has been described in detail in the previous embodiments and will not be repeated here. The device includes:

任务确定模块2900，用于根据每张图片的标注数据类型，确定每张图片所属的任务；其中，所述每张图片标注一个或者多个数据类型，所述多个数据类型是所有数据类型的子集，一个数据类型对应一个任务；The task determination module 2900 is used to determine the task to which each picture belongs based on the annotated data type of each picture; wherein each picture is annotated with one or more data types, and the multiple data types are all data types. Subset, one data type corresponds to one task;

Header决定模块2901，用于根据任务确定模块2900确定的每张图片所属的任务，决定所述每张图片所需训练的Header；Header determination module 2901 is used to determine the header that each picture needs to train according to the task to which each picture belongs determined by the task determination module 2900;

损失值计算模块2902，针对每张图片，用于计算Header决定模块2901决定出的Header的损失值；The loss value calculation module 2902 is used to calculate the loss value of the header determined by the header decision module 2901 for each picture;

调整模块2903，针对每张图片，通过计算Header决定模块2901决定出的Header进行梯度回传，并基于所述损失值计算模块2902得到的损失值调整所述所需训练的Header以及主干网络的参数。The adjustment module 2903 performs gradient return for each picture by calculating the Header determined by the Header determination module 2901, and adjusts the parameters of the required training Header and the backbone network based on the loss value obtained by the loss value calculation module 2902. .

可选地，在一个实施例中，如图29的虚线框所示，所述装置还可以包括：Optionally, in one embodiment, as shown in the dotted box in Figure 29, the device may further include:

数据均衡模块2904，用于对所属不同任务的图片进行数据均衡。The data balancing module 2904 is used to perform data balancing on pictures belonging to different tasks.

如图6所示，下面以ADAS/AD的视觉感知系统为例,对本发明实施例进行详细的介绍As shown in Figure 6, the following takes the ADAS/AD visual perception system as an example to introduce the embodiment of the present invention in detail.

在ADAS/AD的视觉感知系统中，需要实时进行多类型的2D目标检测，包括：动态障碍物(Pedestrian、Cyclist、Tricycle、Car、Truck、Bus)，静态障碍物(TrafficCone、TrafficStick、FireHydrant、Motocycle、Bicycle)，交通标志(TrafficSign、GuideSign、Billboard)。另外，为了准确获取车辆在3维空间所占的区域，还需要对动态障碍物进行3D估计，输出3D框。为了与激光雷达的数据进行融合，需要获取动态障碍物的Mask，从而把打到动态障碍物上的激光点云筛选出来；为了进行精确的泊车位，需要同时检测出泊车位的4个关键点。使用本实施例提供的技术方案，可以在一个网络中完成上述所有的功能，下面对本实施例进行详细描述。In the ADAS/AD visual perception system, multiple types of 2D target detection need to be performed in real time, including: dynamic obstacles (Pedestrian, Cyclist, Tricycle, Car, Truck, Bus), static obstacles (TrafficCone, TrafficStick, FireHydrant, Motorcycle , Bicycle), traffic signs (TrafficSign, GuideSign, Billboard). In addition, in order to accurately obtain the area occupied by the vehicle in the 3D space, it is also necessary to perform 3D estimation of dynamic obstacles and output a 3D box. In order to fuse it with the lidar data, it is necessary to obtain the Mask of the dynamic obstacle to filter out the laser point cloud hitting the dynamic obstacle; in order to carry out accurate parking space, it is necessary to detect the four key points of the parking space at the same time. . Using the technical solution provided by this embodiment, all the above functions can be completed in a network. This embodiment will be described in detail below.

1、各个Header任务的划分以及网络的总体框图1. Division of each Header task and overall block diagram of the network

根据需要检测的物体的相似性以及训练样本的丰富和稀缺程度，本实施例中把需要检测的20类物体划分为8个大类别，如表2所示。According to the similarity of the objects that need to be detected and the abundance and scarcity of training samples, in this embodiment, the 20 types of objects that need to be detected are divided into 8 major categories, as shown in Table 2.

表2各个Header需要检测的物体类别以及扩展功能Table 2 The object categories and extended functions that each Header needs to detect

根据业务的需求，Header0除了需要完成车的2D检测外，还需要进一步完成3D和Mask检测；Header1除了需要完成人的2D检测外，还需要进一步完成Mask的检测；Header2除了完成泊车位2D框的检测外，还需要完成车位关键点的检测。According to business needs, Header0 needs to complete 3D and Mask detection in addition to 2D detection of cars; Header1 needs to further complete Mask detection in addition to 2D detection of people; Header2 needs to complete 2D parking space box detection in addition to In addition to detection, it is also necessary to complete the detection of key points of the parking space.

需要说明的是，表2的任务划分只是本实施例中的一个举例，在其它实施例中可以进行不同的任务划分，并不局限于表2中的任务划分。It should be noted that the task division in Table 2 is just an example in this embodiment. Different task divisions can be carried out in other embodiments and are not limited to the task division in Table 2.

根据表2的任务划分，本实施例的感知网络总体结构如图6所示。According to the task division in Table 2, the overall structure of the perception network in this embodiment is shown in Figure 6.

感知网络其主要包括3部分：Backbone、并行Header和串行Header。需要说明的是，如前实施例所述，串行Header并不是必须的，原因在上述实施例已经描述，在此不再赘述。其中，8个并行Header同时完成了表1中的8个大类的2D检测，在Header0～2后面串接了若干个串行Header，进一步完成3D/Mask/Keypoint的检测。从图6可以看出，本发明可以根据业务的需求，对Header进行灵活的增加和删减，从而实现不同的功能配置。The perceptual network mainly consists of three parts: Backbone, parallel Header and serial Header. It should be noted that, as mentioned in the previous embodiment, the serial header is not necessary. The reason has been described in the above embodiment and will not be repeated here. Among them, 8 parallel headers simultaneously complete the 2D detection of the 8 major categories in Table 1, and several serial headers are connected in series behind Header0~2 to further complete the 3D/Mask/Keypoint detection. It can be seen from Figure 6 that the present invention can flexibly add and delete Headers according to business needs, thereby realizing different functional configurations.

2、基础特征生成2. Basic feature generation

基础特征生成流程由图6的Backbone实施，其对输入的图像进行卷积处理，生成若干不同尺度的卷积特征图，每张特征图是一个H*W*C的矩阵，其中H是特征图的高度，W是特征图的宽度、C是特征图的通道数。The basic feature generation process is implemented by Backbone in Figure 6, which performs convolution processing on the input image to generate several convolution feature maps of different scales. Each feature map is a matrix of H*W*C, where H is the feature map. The height of W is the width of the feature map, and C is the number of channels of the feature map.

Backbone可以采用目前多种现有的卷积网络框架，比如VGG16、Resnet50、Inception-Net等，下面以Resnet18为Backbone，说明基础特征的生成流程。该流程如图7所示。Backbone can use a variety of existing convolutional network frameworks, such as VGG16, Resnet50, Inception-Net, etc. The following uses Resnet18 as Backbone to illustrate the basic feature generation process. The process is shown in Figure 7.

假设输入的图片的分辨率为H*W*3(高度H，宽度W，通道数为3，也就是RBG三个通道)。输入图片经过Resnet18的第一个卷积模块(图中的Res18-Conv1，该卷积模块由若干卷积层组成，后面的卷积模块类似)进行卷积运算，生成Featuremap(特征图)C1，这个特征图相对于输入图像进行了2次下采样，并且通道数扩充为64，因此C1的分辨率是H/4*W/4*64；C1经过Resnet18的第2个卷积模块(Res18-Conv2)进行卷积运算，得到Featuremap C2，这个特征图的分辨率与C1一致；C2继续经过Resnet18的第3个卷积模块(Res18-Conv3)处理，生成Featuremap C3，这个特征图相对C2进一步下采样，通道数增倍，其分辨率为H/8*W/8*128；最后C3经过Res18-Conv4处理，生成Featuremap C4，其分辨率为H/16*W/16*256。Assume that the resolution of the input image is H*W*3 (height H, width W, and the number of channels is 3, that is, three RBG channels). The input image undergoes a convolution operation through the first convolution module of Resnet18 (Res18-Conv1 in the figure. The convolution module consists of several convolution layers, and the subsequent convolution modules are similar) to generate Featuremap (feature map) C1. This feature map is downsampled twice relative to the input image, and the number of channels is expanded to 64, so the resolution of C1 is H/4*W/4*64; C1 passes through the second convolution module of Resnet18 (Res18- Conv2) performs a convolution operation to obtain Featuremap C2. The resolution of this feature map is the same as C1; C2 continues to be processed by the third convolution module (Res18-Conv3) of Resnet18 to generate Featuremap C3. This feature map is further lower than C2. Sampling, the number of channels is doubled, and its resolution is H/8*W/8*128; finally, C3 is processed by Res18-Conv4 to generate Featuremap C4, its resolution is H/16*W/16*256.

从图7可以看出，Resnet18对输入图片进行多个层次的卷积处理，得到不同尺度的特征图：C1/C2/C3/C4。底层的特征图的宽度和高度比较大，通道数较少，其主要为图像的低层特征(比如图像边缘、纹理特征)，高层的特征图的宽度和高度比较小，通道数较多，其主要为图像的高层特征(比如形状、物体特征)。后续的2D检测流程将会基于这些特征图进行进一步的预测。As can be seen from Figure 7, Resnet18 performs multiple levels of convolution processing on the input image to obtain feature maps of different scales: C1/C2/C3/C4. The width and height of the low-level feature map are relatively large and the number of channels is small. They are mainly low-level features of the image (such as image edges and texture features). The width and height of the high-level feature map are relatively small and the number of channels is large. They are mainly low-level features of the image (such as image edges and texture features). It is the high-level feature of the image (such as shape and object features). The subsequent 2D detection process will make further predictions based on these feature maps.

3、2D候选区域预测流程3. 2D candidate area prediction process

2D候选区域预测流程由图6中每个并行Header的RPN模块实施，其根据Backbone提供的特征图(C1/C2/C3/C4)，预测出可能存在该任务物体的区域，并且给出这些区域的候选框(也可以叫候选区域，Proposal)。在本实施例中，并行Header0负责检测车，其RPN层就预测出可能存在车的候选框；并行Header1负责检测人，其RPN层就预测出可能存在人的候选框，以此类推，不再赘述。The 2D candidate area prediction process is implemented by the RPN module of each parallel Header in Figure 6. Based on the feature map (C1/C2/C3/C4) provided by Backbone, it predicts the areas where the task object may exist and gives these areas. candidate box (also called candidate area, Proposal). In this embodiment, parallel Header0 is responsible for detecting cars, and its RPN layer predicts candidate boxes where cars may exist; parallel Header1 is responsible for detecting people, and its RPN layer predicts candidate boxes where people may exist, and so on. Repeat.

RPN层的基本结构如图8所示。在C4上通过一个3*3的卷积，生成特征图RPNHidden。后面每个并行Header的RPN层将会从RPN Hidden中预测Proposal。具体来说，并行Header0的RPN层分别通过两个1*1的卷积，预测出RPN Hidden每个位置处的Proposal的坐标以及置信度。这个置信度越高，表示这个Proposal存在该任务的物体的概率越大。比如，在并行Header0中某个Proposal的score越大，就表示其存在车的概率越大。每个RPN层预测出来的Proposal需要经过Proposal合并模块，根据Proposal之间的重合程度去掉多余的Proposal(这个过程可以采用但不限制于NMS算法)，在剩余的K个Proposal中挑选出score最大的N(N<K)个Proposal作为候选的可能存在物体的区域。从图8可以看出，这些Proposal是不准确的，一方面其不一定含有该任务的物体，另一方面这些框也是不紧致的。因此，RPN模块只是一个粗检测的过程，需要后续的RCNN模块进行细分。The basic structure of the RPN layer is shown in Figure 8. Through a 3*3 convolution on C4, the feature map RPNHidden is generated. The RPN layer of each subsequent parallel Header will predict the Proposal from the RPN Hidden. Specifically, the RPN layer of parallel Header0 predicts the coordinates and confidence of the Proposal at each position of the RPN Hidden through two 1*1 convolutions. The higher the confidence level, the greater the probability that this Proposal has an object for this task. For example, the larger the score of a Proposal in parallel Header0, the greater the probability that it has a car. The proposals predicted by each RPN layer need to go through the proposal merging module. The redundant proposals are removed according to the degree of overlap between proposals (this process can be used but is not limited to the NMS algorithm), and the one with the largest score is selected from the remaining K proposals. N (N < K) Proposals are used as candidate areas where objects may exist. As can be seen from Figure 8, these Proposals are inaccurate. On the one hand, they do not necessarily contain the objects of the task, and on the other hand, these boxes are not compact. Therefore, the RPN module is only a rough detection process and requires subsequent RCNN module for subdivision.

在RPN模块回归Proposal的坐标时，并不是直接回归坐标的绝对值，而是回归出相对于Anchor的坐标。当这些Anchor与实际的物体匹配越高，PRN能检测出物体的概率越大。本发明中采用多个Header的框架，可以对每个RPN层的物体的尺度和宽高比设计对应的Anchor，从而提升每个PRN层的查全率。如图9所示。When the RPN module returns the coordinates of the Proposal, it does not directly return the absolute value of the coordinates, but returns the coordinates relative to the Anchor. The higher the match between these Anchors and actual objects, the greater the probability that PRN can detect the object. The present invention uses a framework of multiple Headers to design corresponding Anchors according to the scale and aspect ratio of objects in each RPN layer, thereby improving the recall rate of each PRN layer. As shown in Figure 9.

对于并行Header1，其负责人的检测，而人的主要形态是瘦长型的，因此，可以把Anchor设计为瘦长型；对于并行Header4，其负责交通标志的检测，而交通标志的主要形态是正方形的，因此，可以把Anchor设计为正方形。For parallel Header1, it is responsible for the detection of people, and the main shape of people is elongated. Therefore, the Anchor can be designed to be elongated. For parallel Header4, it is responsible for the detection of traffic signs, and the main shape of traffic signs is square. , therefore, the Anchor can be designed as a square.

4、2D候选区域特征提取流程4. 2D candidate region feature extraction process

2D候选区域特征提取流程主要由图6中的每个并行Header中的ROI-ALIGN模块实施，其根据PRN层提供的Proposal的坐标，在Backbone提供的一个特征图上把每个Proposal所在的特征抽取出来。ROI-ALIGN的过程如图10所示。The 2D candidate region feature extraction process is mainly implemented by the ROI-ALIGN module in each parallel header in Figure 6. It extracts the features of each Proposal on a feature map provided by Backbone based on the coordinates of the Proposal provided by the PRN layer. come out. The process of ROI-ALIGN is shown in Figure 10.

本实施例中在Backbone的C4特征图上进行扣取特征，每个Proposal在C4上区域如图中的箭头所指的深色区域，在这个区域中采用插值和抽样的方法，扣取出固定分辨率的特征。假设Proposal的个数为N，ROI-ALIGN抽取出来的特征的宽高都为14，则ROI-ALIGN输出的特征的大小为N*14*14*256(ROI-ALIGN抽取出来的特征的通道数与C4的通道数相同，均为256个通道)。这些特征将会送到后续的RCNN模块进行细分。In this embodiment, features are extracted from Backbone's C4 feature map. Each proposal is in the dark area pointed by the arrow in the C4 area. Interpolation and sampling methods are used in this area to extract a fixed resolution. rate characteristics. Assume that the number of proposals is N and the width and height of the features extracted by ROI-ALIGN are 14, then the size of the features output by ROI-ALIGN is N*14*14*256 (the number of channels of the features extracted by ROI-ALIGN The number of channels is the same as that of C4, 256 channels). These features will be sent to the subsequent RCNN module for segmentation.

5、2D候选区域细分类5. Subdivision of 2D candidate areas

2D候选区域细分类主要由图6中每个并行Header的RCNN模块实施，其根据ROI-ALIGN模块提取出来的每个Proposal的特征，进一步回归出更加紧致的2D框坐标，同时对这个Proposal进行分类，输出其属于各个类别的置信度。The subdivision of the 2D candidate area is mainly implemented by the RCNN module of each parallel Header in Figure 6. It further regresses the more compact 2D frame coordinates based on the features of each Proposal extracted by the ROI-ALIGN module, and at the same time performs on this Proposal. Classify and output the confidence that it belongs to each category.

RCNN的可实现形式很多，其中一种实现形式如图11所示。下面对其进行分析。RCNN can be implemented in many forms, one of which is shown in Figure 11. It is analyzed below.

ROI-ALIGN模块输出的特征大小为N*14*14*256，其在RCNN模块中首先经过Resnet18的第5个卷积模块(Res18-Conv5)处理，输出的特征大小为N*7*7*512，然后通过一个Global Avg Pool(平均池化层)进行处理，把输入特征中每个通道内的7*7的特征进行平均，得到N*512的特征，其中每个1*512维的特征向量代表每个Proposal的特征。接下来通过2个全连接层FC分别回归框的精确坐标(输出N*4的向量，这4个数值分表表示框的中心点x/y坐标，框的宽高)，框的类别的置信度(在Header0中，需要给出这个框是Backgroud/Car/Truck/Bus的分数)。最后通过框合并操作，选择分数最大的若干个框，并且通过NMS操作去除重复的框，从而得到紧致的框输出。The feature size output by the ROI-ALIGN module is N*14*14*256, which is first processed by the fifth convolution module (Res18-Conv5) of Resnet18 in the RCNN module, and the output feature size is N*7*7* 512, and then processed through a Global Avg Pool (average pooling layer) to average the 7*7 features in each channel of the input features to obtain N*512 features, each of which has a 1*512-dimensional feature The vector represents the characteristics of each Proposal. Next, two fully connected layers FC are used to respectively return the precise coordinates of the frame (output N*4 vector, these 4 numerical tables represent the x/y coordinates of the center point of the frame, the width and height of the frame), and the confidence of the category of the frame. degree (in Header0, the score that this box is Backgroud/Car/Truck/Bus needs to be given). Finally, through the frame merging operation, several frames with the largest scores are selected, and the duplicate frames are removed through the NMS operation, thereby obtaining a compact frame output.

6、3D检测流程6. 3D inspection process

3D检测流程由图6中的串行3D_Header0完成，其根据“2D检测”流程提供的2D框以及Backbone提供的特征图，预测出每个2D框内部物体的质心点坐标、朝向角、长、宽、高等3D信息。串行3D_Header一种可能的实现方式如图12所示。The 3D detection process is completed by the serial 3D_Header0 in Figure 6. Based on the 2D box provided by the "2D detection" process and the feature map provided by Backbone, it predicts the centroid point coordinates, orientation angle, length, and width of the object inside each 2D box. , advanced 3D information. A possible implementation of serial 3D_Header is shown in Figure 12.

ROI-ALIGN模块根据并行Header提供的准确的2D框，在C4上提取出每个2D框所在区域的特征，假设2D框的个数为M，那么ROI-ALIGN模块输出的特征大小为M*14*14*256，其首先经过Resnet18的第5个卷积模块(Res18-Conv5)处理，输出的特征大小为N*7*7*512，然后通过一个Global Avg Pool(平均池化层)进行处理，把输入特征中每个通道的7*7的特征进行平均，得到M*512的特征，其中每个1*512维的特征向量代表每个2D框的特征。接下来通过3个全连接层FC分别回归框中物体的朝向角(图中的orientation，M*1向量)、质心点坐标(图中的centroid，M*2向量，这2个数值表示质心的x/y坐标)和长宽高(图中的dimention)The ROI-ALIGN module extracts the features of the area where each 2D box is located on C4 based on the accurate 2D box provided by the parallel header. Assuming that the number of 2D boxes is M, then the feature size output by the ROI-ALIGN module is M*14 *14*256, which is first processed by the fifth convolution module (Res18-Conv5) of Resnet18, and the output feature size is N*7*7*512, and then processed by a Global Avg Pool (average pooling layer) , average the 7*7 features of each channel in the input features to obtain M*512 features, in which each 1*512-dimensional feature vector represents the characteristics of each 2D box. Next, three fully connected layers FC are used to respectively return the orientation angle of the object in the frame (orientation in the figure, M*1 vector) and center of mass point coordinates (centroid in the figure, M*2 vector. These two values represent the center of mass. x/y coordinates) and length, width and height (dimention in the picture)

7、Mask检测流程7. Mask detection process

Mask检测流程由图6中的串行Mask_Header0完成，其根据“2D检测”流程提供的2D框以及Backbone提供的特征图，预测出每个2D框内部物体的精细掩膜。串行Mask_Header一种可能的实现方式如图13所示。The Mask detection process is completed by the serial Mask_Header0 in Figure 6, which predicts the fine mask of the object inside each 2D box based on the 2D box provided by the "2D detection" process and the feature map provided by Backbone. A possible implementation of serial Mask_Header is shown in Figure 13.

ROI-ALIGN模块根据并行Header提供的准确的2D框，在C4上提取出每个2D框所在区域的特征，假设2D框的个数为M，那么ROI-ALIGN模块输出的特征大小为M*14*14*256，其首先经过Resnet18的第5个卷积模块(Res18-Conv5)处理，输出的特征大小为N*7*7*512，然后通过经过反卷积层Deconv进一步进行卷积，得到M*14*14*512的特征，最后通过一个卷积，得到M*14*14*1的Mask置信度输出。这个输出中每个14*14的矩阵代表每个2D框中物体的掩膜的置信度，每个2D框被均等的划分为14*14个区域，这个14*14的矩阵就标志每个区域存在物体的可能性。对这个置信度矩阵进行阈值化处理(比如大于阈值0.5的输出1，否则输出0)就可以得到物体的掩膜。The ROI-ALIGN module extracts the features of the area where each 2D box is located on C4 based on the accurate 2D box provided by the parallel header. Assuming that the number of 2D boxes is M, then the feature size output by the ROI-ALIGN module is M*14 *14*256, which is first processed by the fifth convolution module (Res18-Conv5) of Resnet18, and the output feature size is N*7*7*512, and then further convolved through the deconvolution layer Deconv to obtain The features of M*14*14*512 are finally passed through a convolution to obtain the Mask confidence output of M*14*14*1. Each 14*14 matrix in this output represents the confidence of the mask of the object in each 2D box. Each 2D box is equally divided into 14*14 areas. This 14*14 matrix marks each area. The possibility of an object existing. Thresholding this confidence matrix (for example, outputting 1 if it is greater than the threshold 0.5, otherwise outputting 0) can obtain the mask of the object.

8、Keypoint检测流程8. Keypoint detection process

Keypoint检测流程由图6中的串行Keypoint_Header2完成，其根据“2D检测”流程提供的2D框以及Backbone提供的特征图，预测出每个2D框内部物体的关键点坐标。串行Keyponit_Header一种可能的实现方式如图14所示。The Keypoint detection process is completed by the serial Keypoint_Header2 in Figure 6, which predicts the key point coordinates of the objects inside each 2D box based on the 2D box provided by the "2D detection" process and the feature map provided by Backbone. A possible implementation of serial Keyponit_Header is shown in Figure 14.

ROI-ALIGN模块根据并行Header提供的准确的2D框，在C4上提取出每个2D框所在区域的特征，假设2D框的个数为M，那么ROI-ALIGN模块输出的特征大小为M*14*14*256，其首先经过Resnet18的第5个卷积模块(Res18-Conv5)处理，输出的特征大小为N*7*7*512，然后通过一个Global Avg Pool进行处理，把输入特征中每个通道的7*7的特征进行平均，得到M*512的特征，其中每个1*512维的特征向量代表每个2D框的特征。接下来通过1个全连接层FC分别回归框中物体的关键点坐标(图中的Keypoint，M*8向量，这8个数值表示泊车位的4个角点的x/y坐标)。The ROI-ALIGN module extracts the features of the area where each 2D box is located on C4 based on the accurate 2D box provided by the parallel header. Assuming that the number of 2D boxes is M, then the feature size output by the ROI-ALIGN module is M*14 *14*256, which is first processed by the fifth convolution module (Res18-Conv5) of Resnet18, and the output feature size is N*7*7*512, and then processed through a Global Avg Pool, and each input feature is The 7*7 features of each channel are averaged to obtain M*512 features, in which each 1*512-dimensional feature vector represents the characteristics of each 2D box. Next, a fully connected layer FC is used to respectively return the key point coordinates of the object in the frame (Keypoint in the picture, M*8 vector, these 8 values represent the x/y coordinates of the four corner points of the parking space).

基于表2的任务划分，本申请实施例还对该感知网络的训练过程做详细描述。Based on the task division in Table 2, the embodiment of this application also describes in detail the training process of the perception network.

A.训练数据的准备A. Preparation of training data

根据表2的任务划分，我们需要为每个任务提供标注数据。比如需要为Header0的训练提供车的标注数据，在数据集上标注出Car/Truck/Bus的2D框以及类标签；需要为Header1的训练提供人的标注数据，在数据集上标注出Pedestrian/Cyclist/Tricycle的2D框以及类标签；需要为Header3提供交通灯的标注数据，在数据集上标注出TrafficLight_Red/Yellow/Green/Black的2D框以及类标签，以此类推。According to the task division in Table 2, we need to provide annotation data for each task. For example, it is necessary to provide vehicle annotation data for the training of Header0, and label the 2D box and class label of Car/Truck/Bus on the data set; it is necessary to provide human annotation data for the training of Header1, and label Pedestrian/Cyclist on the data set. /Tricycle’s 2D box and class label; you need to provide the traffic light annotation data for Header3, and mark the 2D box and class label of TrafficLight_Red/Yellow/Green/Black on the data set, and so on.

每一种数据只需要标注特定类型的物体即可，这样可以进行针对性的采集，而不用在每一张图片中把所有感兴趣的物体标注出来，从而降低数据采集和标注的成本。另外，采用这种方式准备数据具有很灵活的扩展性，在增加检测物体检测类型的情况下，只需要增加一个或者多个Header，并且提供新增物体的标注数据类型即可，不需要在原有数据上把新增的物体标注出来。Each type of data only needs to be labeled with a specific type of object, so that targeted collection can be carried out instead of labeling all objects of interest in every picture, thus reducing the cost of data collection and labeling. In addition, preparing data in this way has very flexible scalability. When increasing the detection type of detected objects, you only need to add one or more Headers and provide the annotation data type of the new objects. There is no need to add the original Mark the newly added objects on the data.

另外，为了训练Header0中的3D检测功能，需要提供独立的3D标注数据，在数据集上标注出每一辆车的3D信息(质心点坐标、朝向角、长、宽、高)；为了训练Header0中的Mask检测功能，需要提供独立的Mask标注数据，在数据集上标注出每一辆车的掩膜；特别的，Header2中的Parkingslot检测需要检测关键点，这个任务要求数据集同时把泊车位的2D框和关键点标注出来(实际上，只需要把关键点标注出来即可，车位的2D框可以通过关键点的坐标自动生成)In addition, in order to train the 3D detection function in Header0, independent 3D annotation data needs to be provided, and the 3D information of each vehicle (center of mass point coordinates, orientation angle, length, width, and height) is annotated on the data set; in order to train Header0 The Mask detection function in Header2 needs to provide independent Mask annotation data and mark the mask of each vehicle on the data set; in particular, the Parkingslot detection in Header2 needs to detect key points. This task requires the data set to simultaneously label the parking spaces. The 2D box and key points are marked (actually, you only need to mark the key points, and the 2D box of the parking space can be automatically generated based on the coordinates of the key points)

一般情况下只需要为每个任务提供独立的训练数据即可，但是也可以提供混合标注的数据，比如可以在数据集上同时标注Car/Truck/Bus/Pedestrian/Cyclist/Tricycle的2D框以及类标签，这样就可以利用这个数据来同时训练Header0和Header1的并行Header；也可以在数据集上同时标注出Car/Truck/Bus的2D/3D/Mask数据，这样这份数据就可以同时训练并行Header0、串行3D_Header0和串行Mask_Header0。Generally, you only need to provide independent training data for each task, but you can also provide mixed annotation data. For example, you can label the 2D boxes and classes of Car/Truck/Bus/Pedestrian/Cyclist/Tricycle on the data set at the same time. label, so that you can use this data to train the parallel Header of Header0 and Header1 at the same time; you can also mark the 2D/3D/Mask data of Car/Truck/Bus on the data set at the same time, so that this data can train the parallel Header0 at the same time. , Serial3D_Header0 and SerialMask_Header0.

我们可以给每张图片指定一个标签，这个标签决定这张图片可以用来训练网络中的哪些Header，这在后续的训练流程中会有详细的描述。We can assign a label to each image. This label determines which headers in the network this image can be used to train. This will be described in detail in the subsequent training process.

为了保证各个Header得到均等的训练机会，需要对数据进行均衡，具体就是把数量少的数据进行扩展，扩展的方式包括但不限于复制扩展。对均衡后的数据进行随机打乱，然后送入到网络中进行训练，如图15所示。In order to ensure that each Header gets equal training opportunities, the data needs to be balanced. Specifically, the small amount of data is expanded. The expansion method includes but is not limited to copy expansion. The balanced data is randomly scrambled and then sent to the network for training, as shown in Figure 15.

B.基于部分标注数据训练全功能网络B. Train a full-featured network based on part of the annotated data

基于部分标注数据训练全功能网络的过程中根据每张输入图片所属的任务的类型，计算相应的Header的Loss，并且通过这个Loss进行梯度回传，并计算相应Header以及Backbone上的参数的梯度，然后根据梯度对相应的Header以及Backbone进行调整。不在当前输入图片的标注任务中的Header则不进行调整。In the process of training a full-featured network based on partially annotated data, the Loss of the corresponding Header is calculated according to the type of task to which each input picture belongs, and the gradient is returned through this Loss, and the gradient of the parameters on the corresponding Header and Backbone is calculated. Then adjust the corresponding Header and Backbone according to the gradient. Headers that are not in the annotation task of the current input image will not be adjusted.

如果一张图片只是标注一个任务的2D数据，那么这张图片送入网络训练的时候，只训练对应的一个并行Header。如图16所示。If a picture is only labeled with 2D data for a task, then when the picture is sent to the network for training, only the corresponding parallel Header will be trained. As shown in Figure 16.

当前图片只标注了交通灯的2D框，那么在训练的时候，只通过并行Header3得到此输入图片的交通灯的预测结果，并且和真值进行比较，得到这个Header的损失代价2D_Loss3。由于只产生一个代价损失，总体的代价损失值Final Loss＝2D_Loss3。也就是说，交通灯的输入图片只是流过Backbone以及并行Header3，其余的Header并不参与到训练中，如图16中的无“X”号的粗箭头所示。在得到Final Loss后，沿着图16中的无“X”号粗箭头的反方向计算并行Header3以及Backbone中的梯度，然后利用这个梯度更新Header3以及Backbone的参数，实现对网络的调整，使得网络更好得预测交通灯。The current picture only has the 2D frame of the traffic light marked, so during training, the prediction result of the traffic light of this input picture is only obtained through parallel Header3, and compared with the true value, the loss cost of this Header 2D_Loss3 is obtained. Since only one cost loss is generated, the overall cost loss value Final Loss=2D_Loss3. In other words, the input image of the traffic light only flows through Backbone and parallel Header3, and the other Headers do not participate in the training, as shown by the thick arrow without an "X" in Figure 16. After getting the Final Loss, calculate the gradients in parallel Header3 and Backbone in the opposite direction of the thick arrow without "X" in Figure 16, and then use this gradient to update the parameters of Header3 and Backbone to adjust the network so that the network Better predict traffic lights.

如果一张图片标注了多个任务的2D数据，那么这张图片送入网络训练的时候，则会训练对应的多个并行Header。如图17所示。If a picture is annotated with 2D data for multiple tasks, then when the picture is sent to the network for training, multiple corresponding parallel headers will be trained. As shown in Figure 17.

当前图片同时标注了人和车的2D框，那么在训练的时候，就会通过并行Header0和并行Header1得到此输入图片的人车的预测结果，并且和真值进行比较，得到这2个Header的损失代价2D_Loss0/2D_Loss1。由于产生多个代价损失，总体的代价损失值是各个损失的均值，即Final Loss＝(2D_Loss0+2D_Loss1)/2。也就是说，标注人车的输入图片只是流过Backbone以及并行Header0/1，其余的Header并不参与到训练中，如图17中的无“X”号粗箭头所示。在得到Final Loss后，沿着图17中的无“X”号粗箭头的反方向计算并行Header0/1以及Backbone中的梯度，然后利用这个梯度更新Header0/1以及Backbone的参数，实现对网络的调整，使得网络更好得预测人车。The current picture is marked with 2D boxes of people and cars at the same time. Then during training, the prediction results of people and cars in this input picture will be obtained through parallel Header0 and parallel Header1, and compared with the true values, the prediction results of these two Headers will be obtained. The loss cost is 2D_Loss0/2D_Loss1. Since multiple cost losses occur, the overall cost loss value is the average of each loss, that is, Final Loss=(2D_Loss0+2D_Loss1)/2. In other words, the input images labeled with people and cars only flow through Backbone and parallel Header0/1, and the rest of the Headers do not participate in the training, as shown by the thick arrow without an "X" in Figure 17. After obtaining the Final Loss, calculate the gradients in parallel Header0/1 and Backbone in the opposite direction of the thick arrow without "X" in Figure 17, and then use this gradient to update the parameters of Header0/1 and Backbone to achieve network optimization. Adjustments are made to make the network better at predicting people and vehicles.

串行Header的训练需要独立的数据集来进行训练，需要下面以车的3D训练为例进行说明。如图18所示。The training of serial header requires an independent data set for training. It needs to be explained below using the 3D training of the car as an example. As shown in Figure 18.

此时的输入图片标注了车的2D和3D的真值。在训练的时候，数据的流向如图中的无“X”号粗箭头所示，有“X”号粗箭头表示数据流不到的Header。当这张图片送入网络后，会同时计算出来2D和3D的损失函数，得到最终的Final Loss＝(2D_Loss0+3D_Loss0)/2。然后沿着无“X”号粗箭头的反方向计算串行3D Header0、并行Header0以及Backbone中的梯度，然后利用这个梯度更新其参数，实现对网络的调整，使得网络更好得预测车的2D和3D。The input image at this time is labeled with the 2D and 3D true values of the car. During training, the flow of data is shown in the figure by the thick arrow without "X", and the thick arrow with "X" indicates the header where the data cannot flow. When this picture is sent to the network, the 2D and 3D loss functions will be calculated simultaneously, and the final Final Loss=(2D_Loss0+3D_Loss0)/2 is obtained. Then calculate the gradient in the serial 3D Header0, parallel Header0 and Backbone in the opposite direction of the thick arrow without "X", and then use this gradient to update its parameters to adjust the network so that the network can better predict the 2D shape of the car. and 3D.

每张图片送入网络中训练时只会调整对应的Header以及Backbone，使得对应的任务表现更好，在这个过程中，其他任务性能会变差，但是后续轮到该任务的图片时，变差的Header又可以调整过来。由于所有任务的训练数据预先进行均衡，各个任务都得到均等的训练机会，因此不会出现某个任务过度训练的情况。通过这种训练方法，使得Backbone学习到了各个任务的共有的特征，而每个Header学习到其任务特定的特征。When each picture is sent to the network for training, only the corresponding Header and Backbone will be adjusted to make the corresponding task perform better. In this process, the performance of other tasks will become worse, but when it is the subsequent turn of the picture for the task, it will become worse. The Header can be adjusted again. Since the training data of all tasks are balanced in advance, each task gets an equal training opportunity, so there will be no overtraining of a certain task. Through this training method, Backbone learns the common features of each task, and each Header learns its task-specific features.

目前感知需要需要实现的功能越来越多，使用多个网络来实现单点功能会导致总体计算量过大。本发明实施例提出一种基于Multi-Header的高性能可扩展的感知网络，各个感知任务共用相同的主干网络，成倍节省计算量和网络的参数量。表3给出了Single-Header网络实现单个功能的计算量和参数量统计。At present, perception needs to implement more and more functions, and using multiple networks to implement single-point functions will lead to excessive overall computing load. The embodiment of the present invention proposes a high-performance and scalable sensing network based on Multi-Header. Each sensing task shares the same backbone network, doubling the amount of calculation and network parameters. Table 3 shows the calculation amount and parameter amount statistics of Single-Header network to implement a single function.

Single-Header-Model@720pSingle-Header-Model@720p GFlopsGFlops Parameters(M)Parameters(M) Vehicle(Car/Truck/Tram)Vehicle(Car/Truck/Tram) 235.5235.5 17.7617.76 Vehicle+Mask+3DVehicle+Mask+3D 235.6235.6 32.4932.49 Person(Pedestrian/Cyclist/Tricycle)Person(Pedestrian/Cyclist/Tricycle) 235.5235.5 17.7617.76 Person+MaskPerson+Mask 235.6235.6 23.023.0 Motocycle/BicycleMotorcycle/Bicycle 235.5235.5 17.7617.76 TrafficLight(Red/Green/Yellow/Black)TrafficLight(Red/Green/Yellow/Black) 235.6235.6 17.7617.76 TrafficSign(Trafficsign/Guideside/Billboard)TrafficSign(Trafficsign/Guideside/Billboard) 235.5235.5 17.7517.75 TrafficCone/TrafficStick/FireHydrantTrafficCone/TrafficStick/FireHydrant 235.5235.5 17.7517.75 Parkingslot(with keypoint)Parkingslot(with keypoint) 235.6235.6 18.9818.98 全功能网络(多个single-Header网络)Full-featured network (multiple single-Header networks) 1648.91648.9 145.49145.49

表3Single-Header网络的计算量和参数量统计Table 3 Statistics of calculation amount and parameter amount of Single-Header network

从表中可以看出，如果采用8个网络实现本实施例中的所有功能，需要的总计算量为1648.9GFlops，网络参数总量为145.49M。这个计算量和网络参数量非常巨大，将会给硬件带来很大的压力。It can be seen from the table that if 8 networks are used to implement all the functions in this embodiment, the total amount of calculation required is 1648.9GFlops, and the total amount of network parameters is 145.49M. This amount of calculation and network parameters is very huge, which will put a lot of pressure on the hardware.

表4给出了采用一个Multi-Header的网络实现本实施例所有功能的计算量和参数量。Table 4 shows the calculation amount and parameter amount required to implement all functions of this embodiment using a Multi-Header network.

Multi-Header-Model@720pMulti-Header-Model@720p GFlopsGFlops Parameters(M)Parameters(M) 全功能网络(单个Multi-Header网络)Fully functional network (single Multi-Header network) 236.6236.6 42.1642.16

表4Multi-Header网络的计算量和参数量统计Table 4 Statistics of calculation amount and parameter amount of Multi-Header network

从表中可以看出，Mulit-Header网络的计算量和参数量仅为Single-Header的1/7和1/3，极大降低了计算消耗。As can be seen from the table, the calculation amount and parameter amount of the Mulit-Header network are only 1/7 and 1/3 of the Single-Header network, which greatly reduces the calculation consumption.

另外，Multi-Header网络可以实现与Single-Header一样的检测性能，表5给出了Multi-Header与Single-Header在部分类别上的性能比较。In addition, the Multi-Header network can achieve the same detection performance as the Single-Header. Table 5 shows the performance comparison between Multi-Header and Single-Header in some categories.

类别category Single-HeaderSingle-Header Multi-HeaderMulti-Header CarCar 91.791.7 91.691.6 TramTram 81.881.8 80.180.1 PedestrianPedestrian 73.673.6 75.275.2 CyclistCyclist 81.881.8 83.383.3 TrafficLightTrafficLight 98.398.3 97.597.5 TrafficSignTrafficSign 95.195.1 94.594.5 Parkingslot(point precision/recall)Parkingslot(point precision/recall) 94.01/80.6194.01/80.61 95.17/78.8995.17/78.89 3D(mean_orien_err/mecentroid_dist_err)3D(mean_orien_err/mecentroid_dist_err) 2.95/6.782.95/6.78 2.88/6.342.88/6.34

表5Single-Header与Multi-Header网络的检测性能对比Table 5 Comparison of detection performance between Single-Header and Multi-Header networks

从表中可以看出，两者的性能相当，因此，Multi-Header网络在节省计算量和显存的情况下不会导致性能下降。As can be seen from the table, the performance of the two is equivalent. Therefore, the Multi-Header network will not cause performance degradation while saving computing power and video memory.

本发明实施例提出一种基于Multi-Header的高性能可扩展感知网络，在同一个网络上同时实现不同的感知任务(2D/3D/关键点/语义分割等)，该网络中各个感知任务共用相同的主干网络，节省计算量，网络结构易于扩展，只需要通过增加一个Header来增加一个功能。另外，本发明实施例还提出一种基于部分标注数据训练多任务感知网络的方法，各个任务使用独立数据集，不需要在同一张图片上进行全任务标注，不同任务的训练数据方便进行均衡，不同任务的数据不会产生相互的抑制。The embodiment of the present invention proposes a high-performance scalable perception network based on Multi-Header, which simultaneously implements different perception tasks (2D/3D/key points/semantic segmentation, etc.) on the same network. Each perception task in the network is shared The same backbone network saves computation and the network structure is easy to expand. You only need to add a Header to add a function. In addition, embodiments of the present invention also propose a method for training a multi-task perception network based on partial annotation data. Each task uses an independent data set, and there is no need to perform full-task annotation on the same picture. The training data of different tasks can be easily balanced. Data from different tasks do not inhibit each other.

如图30所示，本发明实施例还提供一种物体检测方法，该方法包括：As shown in Figure 30, an embodiment of the present invention also provides an object detection method, which method includes:

S3001，接收输入的图片；S3001, receive the input image;

S3002，对输入的图片进行卷积处理，输出对应所述图片的具有不同分辨率的特征图；S3002, perform convolution processing on the input image, and output feature maps with different resolutions corresponding to the image;

S3003，根据所述特征图，针对不同的任务独立检测每个任务中的任务物体，输出所述每个任务物体所在区域的2D框以及每个2D框对应的置信度；其中，所述任务物体为该任务中需要检测的物体；所述置信度越高，表示所述对应该置信度的2D框内存在所述任务所对应的物体的概率越大。S3003, according to the feature map, independently detect the task objects in each task for different tasks, and output the 2D box of the area where each task object is located and the confidence level corresponding to each 2D box; wherein, the task object is the object that needs to be detected in the task; the higher the confidence level, the greater the probability that the object corresponding to the task exists in the 2D box corresponding to the confidence level.

可选地，在一个实施例中，S3002可以包括如下4个步骤：Optionally, in one embodiment, S3002 may include the following four steps:

1、在一个或者多个特征图上预测所述任务物体所在的区域，并输出匹配所述区域的候选2D框；1. Predict the area where the task object is located on one or more feature maps, and output candidate 2D boxes that match the area;

可选地，可以基于所属任务对应的物体的模板框(Anchor)，在主干网络提供的一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框；其中，所述模板框是基于其所属的任务物体的统计特征得到的，所述统计特征包括所述物体的形状和大小。Alternatively, based on the template box (Anchor) of the object corresponding to the task, the area where the task object exists can be predicted on one or more feature maps provided by the backbone network to obtain a candidate area, and output matching the candidate Candidate 2D boxes of the region; wherein, the template box is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the shape and size of the object.

2、根据所述任务物体所在的区域，从一个特征图中扣取出所述候选2D框所在区域的特征；2. According to the area where the task object is located, extract the features of the area where the candidate 2D box is located from a feature map;

3、对所述候选2D框所在区域的特征进行卷积处理，得到所述候选2D框属于各个物体类别的置信度；所述各个物体类别为所述一个任务中的物体类别；3. Perform convolution processing on the features of the area where the candidate 2D box is located to obtain the confidence that the candidate 2D box belongs to each object category; each object category is an object category in the one task;

4、通过神经网络对所述候选区域2D框的坐标进行调整，使得调整后的2D候选框比所述候选2D框与实际物体的形状更加匹配，并选择置信度大于预设阈值的调整后的2D候选框作为所述区域的2D框。4. Adjust the coordinates of the 2D box in the candidate area through the neural network so that the adjusted 2D candidate box matches the shape of the actual object better than the candidate 2D box, and select the adjusted 2D box with a confidence greater than the preset threshold. The 2D candidate box serves as the 2D box of the region.

可选地，上述2D框可以为矩形框。Optionally, the above-mentioned 2D frame may be a rectangular frame.

可选地，所述方法还包括：Optionally, the method also includes:

S3004，基于所属任务的任务物体的2D框，在主干网络上的一个或多个特征图上提取所述2D框所在区域的特征，根据所述2D框所在区域的特征对所述所属任务的任务物体的3D信息、Mask信息或Keypiont信息进行预测。S3004. Based on the 2D box of the task object to which the task belongs, extract the characteristics of the area where the 2D box is located on one or more feature maps on the backbone network, and classify the task to which the task belongs based on the characteristics of the area where the 2D box is located. The object’s 3D information, Mask information or Keypiont information is used for prediction.

可选地，可以在低分辨率的特征上完成大物体所在区域的检测，所述RPN模块在高分辨率的特征图上完成小物体所在区域的检测。Optionally, the detection of the area where the large object is located can be completed on low-resolution features, and the RPN module is used to detect the area where the small object is located on the high-resolution feature map.

如图31所示，本发明实施例还提供一种基于部分标注数据训练多任务感知网络的方法，所述方法包括：As shown in Figure 31, an embodiment of the present invention also provides a method for training a multi-task perception network based on partial annotation data. The method includes:

S3101，根据每张图片的标注数据类型，确定每张图片所属的任务；其中，所述每张图片标注一个或者多个数据类型，所述多个数据类型是所有数据类型的子集，一个数据类型对应一个任务；S3101. Determine the task to which each picture belongs based on the annotated data type of each picture; wherein each picture is annotated with one or more data types, and the multiple data types are subsets of all data types. One data type The type corresponds to a task;

S3102，根据每张图片所属的任务，决定所述每张图片所需训练的Header；S3102: Determine the Header required to be trained for each picture according to the task to which each picture belongs;

S3103，计算每张图片所述所需训练的Header的损失值；S3103, calculate the loss value of the Header required for training in each picture;

S3104，对于每张图片，通过所述所需训练的Header进行梯度回传，并基于所述损失值调整所述所需训练的Header以及主干网络的参数。S3104: For each picture, perform gradient backtransmission through the Header that needs to be trained, and adjust the parameters of the Header that needs to be trained and the backbone network based on the loss value.

可选地，如图31的虚线框所示，在步骤S3102之前，所述方法还包括：Optionally, as shown in the dotted box in Figure 31, before step S3102, the method further includes:

S31020，对所属不同任务的图片进行数据均衡。S31020, perform data balancing on pictures belonging to different tasks.

本发明实施例还提供一种基于基于multi-Header的物体感知方法，本发明实施例提供的感知方法的流程包括“推理”流程和“训练”流程两部分。下面分别介绍。Embodiments of the present invention also provide a multi-Header-based object sensing method. The process of the sensing method provided by the embodiment of the present invention includes two parts: a "reasoning" process and a "training" process. They are introduced below.

一、感知流程：1. Perception process:

本发明实施例提供的感知方法的流程如图21所示。The flow of the sensing method provided by the embodiment of the present invention is shown in Figure 21.

步骤S210中，图片输入网络；In step S210, the image is input into the network;

步骤S220中，进入“基础特征生成”流程。In step S220, the "basic feature generation" process is entered.

在该流程中，图片通过图5的Backbone进行基础特征提取，得到不同尺度的特征图。在基础特征生成后，将会进入图21中虚线框所在的核心流程。在核心流程中，每个任务具有独立的“2D检测”流程以及可选的“3D检测”流程、“Mask检测”流程和“Keypoint检测”流程。下面对核心流程进行描述。In this process, the image is extracted from basic features through the Backbone in Figure 5 to obtain feature maps of different scales. After the basic features are generated, the core process located in the dotted box in Figure 21 will be entered. In the core process, each task has an independent "2D detection" process and optional "3D detection" process, "Mask detection" process and "Keypoint detection" process. The core process is described below.

1、2D检测流程1. 2D inspection process

“2D检测”流程根据“基础特征生成”流程产生的特征图，预测出每个任务的2D框以及置信度。具体来说，“2D检测”流程又可以细分为“2D候选区域预测”流程、“2D候选区域特征提取”流程和“2D候选区域细分类”流程，如图22所示。The "2D detection" process predicts the 2D box and confidence of each task based on the feature map generated by the "basic feature generation" process. Specifically, the "2D detection" process can be subdivided into the "2D candidate area prediction" process, the "2D candidate area feature extraction" process and the "2D candidate area subdivision" process, as shown in Figure 22.

“2D候选区域预测”流程由图5中的RPN模块实施，其在“基础特征生成”流程提供的一个或者多个feature map上预测出可能存在该任务物体的区域，并且给出这些区域的框(Proposal)。The "2D candidate area prediction" process is implemented by the RPN module in Figure 5, which predicts areas where the task object may exist on one or more feature maps provided by the "basic feature generation" process, and gives the frames of these areas. (Proposal).

“2D候选区域特征提取”流程由图5中的ROI-ALIGN模块实施，其根据“2D候选区域预测”流程提供的Proposal，在在“基础特征生成”流程提供的一个feature map上把每个Proposal所在的区域的特征扣取出来，并且resize到固定的大小，得到每个Proposal的特征。The "2D candidate area feature extraction" process is implemented by the ROI-ALIGN module in Figure 5. According to the Proposal provided by the "2D candidate area prediction" process, each Proposal is generated on a feature map provided by the "Basic feature generation" process. The features of the area are extracted and resized to a fixed size to obtain the features of each Proposal.

“2D候选区域细分类”流程由图5中的RCNN模块实施，其采用神经网络对每个Proposal的特征进行进一步预测，输出每个Proposal的属于各个类别置信度，同时对Proposal的2D框的坐标进行调整，输出更加紧致的2D框。The "2D candidate area subdivision" process is implemented by the RCNN module in Figure 5, which uses a neural network to further predict the characteristics of each Proposal, output the confidence of each category belonging to each Proposal, and at the same time calculate the coordinates of the Proposal's 2D box Make adjustments to output a tighter 2D frame.

2、3D检测流程2. 3D inspection process

“3D检测”流程根据“2D检测”流程提供的2D框以及“基础特征生成”流程产生的特征图，预测出每个2D框内部物体的质心点坐标、朝向角和长宽高等3D信息。具体来说，“3D检测”由两个子流程组成，如图23所示。The "3D detection" process predicts the 3D information such as the centroid point coordinates, orientation angle, length, width and height of the objects inside each 2D box based on the 2D box provided by the "2D detection" process and the feature map generated by the "basic feature generation" process. Specifically, "3D detection" consists of two sub-processes, as shown in Figure 23.

子流程分析如下：The sub-process analysis is as follows:

“2D候选区域特征提取”流程由图5中的ROI-ALIGN模块实施，其根据2D框的坐标，在“基础特征生成”流程提供的一个feature map上把每个2D框所在的区域的特征扣取出来，并且resize到固定的大小，得到每个2D框的特征。The "2D candidate area feature extraction" process is implemented by the ROI-ALIGN module in Figure 5, which deducts the features of the area where each 2D box is located on a feature map provided by the "basic feature generation" process based on the coordinates of the 2D box. Take it out and resize it to a fixed size to get the characteristics of each 2D box.

“3D质心/朝向/长宽高预测”流程由图5中的3D_Header实施，其主要根据每个2D框的特征，回归出2D框内部物体的质心点坐标、朝向角度和长宽高等3D信息。The "3D centroid/orientation/length, width, and height prediction" process is implemented by the 3D_Header in Figure 5. It mainly returns the 3D information such as the centroid point coordinates, orientation angle, length, width, and height of the objects inside the 2D box based on the characteristics of each 2D box.

3、Mask检测流程3. Mask detection process

“Mask检测”流程根据“2D检测”流程提供的2D框以及“基础特征生成”流程产生的特征图，预测出每个2D框内部物体的细致的掩膜。具体来说，“Mask检测”由两个子流程组成，如图24所示。The "Mask Detection" process predicts detailed masks of objects inside each 2D box based on the 2D boxes provided by the "2D Detection" process and the feature maps generated by the "Basic Feature Generation" process. Specifically, "Mask Detection" consists of two sub-processes, as shown in Figure 24.

子流程分析如下：The sub-process analysis is as follows:

“掩膜预测”流程由图5中的Mask_Header实施，其主要根据每个2D框的特征，回归出2D框内部物体所在的掩膜。The "mask prediction" process is implemented by the Mask_Header in Figure 5, which mainly returns the mask where the object inside the 2D box is located based on the characteristics of each 2D box.

4、Keypoint检测流程4. Keypoint detection process

“Keypoint预测”流程根据“2D检测”流程提供的2D框以及“基础特征生成”流程产生的特征图，预测出每个2D框内部物体的掩膜。具体来说，“Keypoint预测”由两个子流程组成，如图25所示。The "Keypoint prediction" process predicts the mask of the object inside each 2D box based on the 2D box provided by the "2D detection" process and the feature map generated by the "basic feature generation" process. Specifically, "Keypoint prediction" consists of two sub-processes, as shown in Figure 25.

子流程分析如下：The sub-process analysis is as follows:

“关键点坐标预测”流程由图5中的Keypoint_Header实施，其主要根据每个2D框的特征，回归出2D框内部物体的关键点的坐标。The "keypoint coordinate prediction" process is implemented by the Keypoint_Header in Figure 5, which mainly returns the coordinates of the key points of the objects inside the 2D box based on the characteristics of each 2D box.

二、训练流程2. Training process

本发明实施例的训练流程如图26所示。The training process of the embodiment of the present invention is shown in Figure 26.

其中红框部分为核心的训练流程。下面对核心训练流程进行介绍。The red box part is the core training process. The core training process is introduced below.

1、任务间数据均衡流程1. Data balancing process between tasks

各个任务的数据量是极度不均衡的，比如含有人的图片数量会比交通标志的大得多，为了让各个任务的Header得到均等的训练机会，必须对任务间的数据进行均衡。具体就是把数量少的数据进行扩展，扩展的方式包括但不限于复制扩展。The amount of data for each task is extremely unbalanced. For example, the number of pictures containing people will be much larger than that of traffic signs. In order to allow the headers of each task to get equal training opportunities, the data between tasks must be balanced. Specifically, a small amount of data is expanded. The expansion methods include but are not limited to copy expansion.

2、根据图片所属任务计算Loss流程2. Loss calculation process based on the task to which the picture belongs

每张图片根据其标注的数据类型可以分属于一个或者多个任务，比如一张图片中仅标注交通标志，那么这张图片就只属于交通标志这个任务；如果一张图片同时标注了人和车，那么这种图片就同时属于人和车2个任务。在计算Loss的时候，只计算当前图片所属任务所对应的Header的Loss，其余任务的Loss并不计算。比如，当前输入的训练图片属于人和车的任务，则此时仅计算人和车的的Header的Loss，其余的(比如交通灯、交通标志)的Loss不计算。Each picture can belong to one or more tasks according to the data type it is annotated. For example, if a picture only annotates traffic signs, then the image only belongs to the task of traffic signs; if a picture also annotates people and vehicles, , then this kind of picture belongs to both the human and the car tasks. When calculating Loss, only the Loss of the Header corresponding to the task to which the current picture belongs is calculated, and the Loss of other tasks is not calculated. For example, if the currently input training image belongs to the task of people and cars, then only the loss of the header of people and cars will be calculated at this time, and the loss of the rest (such as traffic lights and traffic signs) will not be calculated.

3、根据图片所属任务回传梯度流程3. Return the gradient process according to the task to which the picture belongs

在计算Loss后，需要进行梯度回传。此时，仅通过当前任务的Header进行梯度回传，不在当前任务的Header不参与梯度回传。这样，就可以对当前图片调整当前的Header，使得当前的Header更好的学习当前的任务。由于各个任务的数据已经经过均衡化处理，各个Header能够得到均等的训练机会，所以在这样反复调整过程中，各个Header学习到任务相关的特征，而Backbone则学习到任务间共用的特征。After calculating Loss, gradient backhaul needs to be performed. At this time, gradient return is only performed through the Header of the current task, and Headers that are not in the current task do not participate in gradient return. In this way, the current Header can be adjusted for the current picture, so that the current Header can better learn the current task. Since the data of each task has been balanced, each Header can get equal training opportunities. Therefore, in this repeated adjustment process, each Header learns task-related features, while Backbone learns features shared between tasks.

本申请实施例综合考虑了现有方法的不足，提出一种基于Multi-Header的高性能可扩展感知网络，在同一个网络上同时实现不同的感知任务(2D/3D/关键点/语义分割等)，该网络中各个感知任务共用相同的主干网络，显著节省计算量。另外，其网络结构易于扩展，只需要通过增加一个或者若干个Header来进行功能扩展。The embodiments of this application comprehensively consider the shortcomings of existing methods and propose a high-performance scalable perception network based on Multi-Header, which can simultaneously realize different perception tasks (2D/3D/key points/semantic segmentation, etc.) on the same network. ), each sensing task in this network shares the same backbone network, which significantly saves computational effort. In addition, its network structure is easy to expand, and the functions only need to be expanded by adding one or several Headers.

此外，本申请实施例还提出一种基于部分标注数据训练多任务感知网络的方法，各个任务使用独立数据集，不需要在同一张图片上进行全任务标注，不同任务的训练数据方便进行均衡，不同任务的数据不会产生相互的抑制。In addition, embodiments of this application also propose a method for training a multi-task perception network based on partial annotation data. Each task uses an independent data set, and there is no need to annotate the entire task on the same picture. The training data of different tasks can be easily balanced. Data from different tasks do not inhibit each other.

如图5所示的感知网络可以以图27中的结构来实现，图27是感知网络的一个应用系统示意图，图中所示，该感知网络2000包括至少一个处理器2001，至少一个存储器2002、至少一个通信接口2003以及至少一个显示设备2004。处理器2001、存储器2002、显示设备2004和通信接口2003通过通信总线连接并完成相互间的通信。The perception network shown in Figure 5 can be implemented with the structure in Figure 27. Figure 27 is a schematic diagram of an application system of the perception network. As shown in the figure, the perception network 2000 includes at least one processor 2001, at least one memory 2002, At least one communication interface 2003 and at least one display device 2004. The processor 2001, the memory 2002, the display device 2004 and the communication interface 2003 are connected through a communication bus and complete communication with each other.

通信接口2003，用于与其他设备或通信网络通信，如以太网，无线接入网(radioaccess network,RAN)，无线局域网(wireless local area networks，WLAN)等。Communication interface 2003 is used to communicate with other devices or communication networks, such as Ethernet, wireless access network (radioaccess network, RAN), wireless local area networks (WLAN), etc.

存储器2002可以是只读存储器(read-only memory，ROM)或可存储静态信息和指令的其他类型的静态存储设备，随机存取存储器(random access memory，RAM)或者可存储信息和指令的其他类型的动态存储设备，也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory，EEPROM)、只读光盘(compactdisc read-only memory，CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质，但不限于此。存储器可以是独立存在，通过总线与处理器相连接。存储器也可以和处理器集成在一起。The memory 2002 may be a read-only memory (ROM) or other type of static storage device that can store static information and instructions, a random access memory (random access memory (RAM)) or other type that can store information and instructions. The dynamic storage device can also be electrically erasable programmable read-only memory (EEPROM), compactdisc read-only memory (CD-ROM) or other optical disk storage, optical disc storage ( Including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be stored by a computer. any other medium, but not limited to this. The memory can exist independently and be connected to the processor through a bus. Memory can also be integrated with the processor.

其中，所述存储器2002用于存储执行以上方案的应用程序代码，并由处理器2001来控制执行。所述处理器2001用于执行所述存储器2002中存储的应用程序代码。The memory 2002 is used to store the application code for executing the above solution, and the processor 2001 controls the execution. The processor 2001 is configured to execute application program codes stored in the memory 2002 .

存储器2002存储的代码可执行以上提供的一种基于Multi-Header的物体的感知方法。The code stored in the memory 2002 can execute the Multi-Header based object perception method provided above.

显示设备2004用于显示待识别图像、该图像中感兴趣物体的2D、3D、Mask、关键点等信息。The display device 2004 is used to display the image to be recognized, 2D, 3D, Mask, key points and other information of the object of interest in the image.

处理器2001还可以采用或者一个或多个集成电路，用于执行相关程序，以实现本申请实施例的基于Multi-Header的物体的感知方法或模型训练方法。The processor 2001 may also use one or more integrated circuits to execute relevant programs to implement the Multi-Header-based object perception method or model training method in the embodiment of the present application.

处理器2001还可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，本申请的推荐方法的各个步骤可以通过处理器2001中的硬件的集成逻辑电路或者软件形式的指令完成。在实现过程中，本申请实施例的训练方法的各个步骤可以通过处理器2001中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2001还可以是通用处理器、数字信号处理器(digital signal processing，DSP)、ASIC、现成可编程门阵列(fieldprogrammable gate array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及模块框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2002，处理器2001读取存储器2002中的信息，结合其硬件完成本申请实施例的物体感知方法或模型训练方法。The processor 2001 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the recommended method of this application can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 2001 . During the implementation process, each step of the training method in the embodiment of the present application can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 2001. The above-mentioned processor 2001 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an ASIC, an off-the-shelf programmable gate array (fieldprogrammable gate array, FPGA) or other programmable logic devices, discrete gates or transistor logic. devices, discrete hardware components. Each method, step and module block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 2002. The processor 2001 reads the information in the memory 2002 and completes the object perception method or model training method in the embodiment of the present application in combination with its hardware.

通信接口2003使用例如但不限于收发器一类的收发装置，来实现推荐装置或训练装置与其他设备或通信网络之间的通信。例如，可以通过通信接口2003获取待识别图片或者训练数据。The communication interface 2003 uses a transceiver device such as but not limited to a transceiver to implement communication between the recommendation device or training device and other devices or communication networks. For example, the picture to be recognized or the training data can be obtained through the communication interface 2003.

总线可包括在装置各个部件(例如，存储器2002、处理器2001、通信接口2003、显示设备2004)之间传送信息的通路。在一种可能的实施例中，处理器2001具体执行以下步骤：接收输入的图片；对输入的图片进行卷积处理，输出对应所述图片的具有不同分辨率的特征图；根据所述主干网络提供的特征图，针对不同的任务独立检测每个任务所对应的物体，输出所述每个任务所对应物体的候选区域的2D框以及每个2D框对应的置信度。A bus may include a path that carries information between various components of the device (eg, memory 2002, processor 2001, communication interface 2003, display device 2004). In a possible embodiment, the processor 2001 specifically performs the following steps: receiving an input picture; performing convolution processing on the input picture, and outputting feature maps with different resolutions corresponding to the picture; according to the backbone network The provided feature map independently detects the object corresponding to each task for different tasks, and outputs the 2D box of the candidate area of the object corresponding to each task and the confidence level corresponding to each 2D box.

在一种可能的实施例中，在执行根据所述主干网络提供的特征图，针对不同的任务独立检测每个任务所对应的物体，输出所述每个任务所对应物体的候选区域的2D框以及每个2D框对应的置信度的步骤时，处理器2001具体执行如下步骤：在一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框；根据所述RPN模块得到的候选区域，从一个特征图中扣取出所述候选区域所在区域的特征；对所述候选区域的特征进行细化处理，得到所述候选区域对应各个物体类别的置信度；所述各个物体为对应的一个任务中的物体；对所述候选区域的坐标进行调整得到第二候选2D框，所述第二2D候选框比所述候选2D框与实际物体更加匹配，并选择置信度大于预设阈值的2D候选框作为候选区域的2D框。In a possible embodiment, after executing the feature map provided by the backbone network, the object corresponding to each task is independently detected for different tasks, and the 2D box of the candidate area of the object corresponding to each task is output. and the confidence level corresponding to each 2D box, the processor 2001 specifically performs the following steps: predict the area where the task object exists on one or more feature maps to obtain a candidate area, and output a matching candidate area candidate 2D box; according to the candidate area obtained by the RPN module, extract the characteristics of the area where the candidate area is located from a feature map; refine the characteristics of the candidate area to obtain the candidate area corresponding to each The confidence of the object category; each object is an object in a corresponding task; the coordinates of the candidate area are adjusted to obtain a second candidate 2D box, and the second 2D candidate box is closer to the actual candidate 2D box than the candidate 2D box. The objects are more closely matched, and the 2D candidate box with a confidence greater than the preset threshold is selected as the 2D box of the candidate area.

在一种可能的实施例中，在执行在一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框时，处理器2001具体执行如下步骤：In a possible embodiment, when performing prediction on the area where the task object exists on one or more feature maps to obtain a candidate area, and outputting a candidate 2D box matching the candidate area, the processor 2001 specifically Perform the following steps:

基于所对应任务的物体的模板框(Anchor)，在一个或者多个特征图上对存在该任务的物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框；其中：所述模板框是基于其所属任务物体的统计特征得到的，所述统计特征包括所述物体的形状和大小。Based on the template box (Anchor) of the object of the corresponding task, predict the area where the object of the task exists on one or more feature maps to obtain a candidate area, and output a candidate 2D box that matches the candidate area; where: The template frame is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the shape and size of the object.

在一种可能的实施例中，处理器2001还执行如下步骤：In a possible embodiment, the processor 2001 also performs the following steps:

基于对应任务的物体的2D框，在主干网络上的一个或多个特征图上提取所述物体的特征，对所述物体的3D、Mask或Keypiont进行预测。Based on the 2D box of the object corresponding to the task, the features of the object are extracted from one or more feature maps on the backbone network, and the 3D, Mask or Keypiont of the object is predicted.

在一种可能的实施例中，在低分辨率的特征图上完成大物体的候选区域的检测，在高分辨率的特征图上完成小物体的候选区域的检测。In a possible embodiment, the detection of candidate areas for large objects is completed on a low-resolution feature map, and the detection of candidate areas for small objects is completed on a high-resolution feature map.

在一种可能的实施例中，所述2D框为矩形框。In a possible embodiment, the 2D frame is a rectangular frame.

可选地，如图28所示，感知网络的结构可以由服务器实现，服务器可以以图28中的结构来实现，该服务器2110包括至少一个处理器2101，至少一个存储器2102以及至少一个通信接口2103。处理器2101、存储器2102和通信接口2103通过通信总线连接并完成相互间的通信。Optionally, as shown in Figure 28, the structure of the perception network can be implemented by a server. The server can be implemented with the structure in Figure 28. The server 2110 includes at least one processor 2101, at least one memory 2102 and at least one communication interface 2103. . The processor 2101, the memory 2102 and the communication interface 2103 are connected through a communication bus and communicate with each other.

通信接口2103，用于与其他设备或通信网络通信，如以太网，RAN，WLAN等。Communication interface 2103 is used to communicate with other devices or communication networks, such as Ethernet, RAN, WLAN, etc.

存储器2102可以是ROM或可存储静态信息和指令的其他类型的静态存储设备，RAM或者可存储信息和指令的其他类型的动态存储设备，也可以是EEPROM)CD-ROM或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质，但不限于此。存储器可以是独立存在，通过总线与处理器相连接。存储器也可以和处理器集成在一起。The memory 2102 can be ROM or other types of static storage devices that can store static information and instructions, RAM or other types of dynamic storage devices that can store information and instructions, or it can be EEPROM, CD-ROM or other optical disk storage, optical disk storage (including compressed optical discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program code in the form of instructions or data structures and can be used by a computer Any other medium for access, but not limited to this. The memory can exist independently and be connected to the processor through a bus. Memory can also be integrated with the processor.

其中，所述存储器2102用于存储执行以上方案的应用程序代码，并由处理器2101来控制执行。所述处理器2101用于执行所述存储器2102中存储的应用程序代码。The memory 2102 is used to store the application code for executing the above solution, and the processor 2101 controls the execution. The processor 2101 is used to execute application codes stored in the memory 2102.

存储器2102存储的代码可执行以上提供的一种基于Multi-Header的物体的感知方法。The code stored in the memory 2102 can execute the Multi-Header based object perception method provided above.

处理器2101还可以采用或者一个或多个集成电路，用于执行相关程序，以实现本申请实施例的基于Multi-Header的物体的感知方法或模型训练方法。The processor 2101 may also use one or more integrated circuits to execute related programs to implement the Multi-Header-based object perception method or model training method in the embodiment of the present application.

处理器2101还可以是一种集成电路芯片，具有信号的处理能力。在实现过程中，本申请的推荐方法的各个步骤可以通过处理器2101中的硬件的集成逻辑电路或者软件形式的指令完成。在实现过程中，本申请实施例的训练方法的各个步骤可以通过处理器2101中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器2001还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及模块框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器，闪存、只读存储器，可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器2102，处理器2101读取存储器2102中的信息，结合其硬件完成本申请实施例的物体感知方法或模型训练方法。The processor 2101 may also be an integrated circuit chip with signal processing capabilities. During the implementation process, each step of the recommended method of this application can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 2101. During the implementation process, each step of the training method in the embodiment of the present application can be completed by instructions in the form of hardware integrated logic circuits or software in the processor 2101. The above-mentioned processor 2001 can also be a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component. Each method, step and module block diagram disclosed in the embodiment of this application can be implemented or executed. A general-purpose processor may be a microprocessor or the processor may be any conventional processor, etc. The steps of the method disclosed in conjunction with the embodiments of the present application can be directly implemented by a hardware decoding processor, or executed by a combination of hardware and software modules in the decoding processor. The software module can be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other mature storage media in this field. The storage medium is located in the memory 2102. The processor 2101 reads the information in the memory 2102 and completes the object perception method or model training method in the embodiment of the present application in combination with its hardware.

通信接口2103使用例如但不限于收发器一类的收发装置，来实现推荐装置或训练装置与其他设备或通信网络之间的通信。例如，可以通过通信接口2103获取待识别图片或者训练数据。The communication interface 2103 uses a transceiver device such as but not limited to a transceiver to implement communication between the recommendation device or training device and other devices or communication networks. For example, the picture to be recognized or the training data can be obtained through the communication interface 2103.

总线可包括在装置各个部件(例如，存储器2102、处理器2101、通信接口2103)之间传送信息的通路。在一种可能的实施例中，处理器2101具体执行以下步骤：在一个或者多个特征图上对存在该任务物体的区域进行预测以得到候选区域，并输出匹配所述候选区域的候选2D框；根据所述RPN模块得到的候选区域，从一个特征图中扣取出所述候选区域所在区域的特征；对所述候选区域的特征进行细化处理，得到所述候选区域对应各个物体类别的置信度；所述各个物体为对应的一个任务中的物体；对所述候选区域的坐标进行调整得到第二候选2D框，所述第二2D候选框比所述候选2D框与实际物体更加匹配，并选择置信度大于预设阈值的2D候选框作为候选区域的2D框。A bus may include a path that carries information between various components of a device (eg, memory 2102, processor 2101, communication interface 2103). In a possible embodiment, the processor 2101 specifically performs the following steps: predict the area where the task object exists on one or more feature maps to obtain a candidate area, and output a candidate 2D box that matches the candidate area. ; According to the candidate area obtained by the RPN module, extract the characteristics of the area where the candidate area is located from a feature map; refine the characteristics of the candidate area to obtain the confidence of each object category corresponding to the candidate area degree; each object is an object in a corresponding task; the coordinates of the candidate area are adjusted to obtain a second candidate 2D box, and the second 2D candidate box is more consistent with the actual object than the candidate 2D box, And select the 2D candidate box with a confidence greater than the preset threshold as the 2D box of the candidate area.

本申请提供一种计算机可读介质，该计算机可读介质存储用于设备执行的程序代码，该程序代码包括用于执行如图21、22、23、24或25所示实施例的物体感知方法的相关内容。The present application provides a computer-readable medium that stores program code for device execution. The program code includes a method for executing the object sensing method as shown in the embodiment as shown in Figure 21, 22, 23, 24 or 25. related content.

本申请提供一种计算机可读介质，该计算机可读介质存储用于设备执行的程序代码，该程序代码包括用于执行如图26所示实施例的训练方法的相关内容。The present application provides a computer-readable medium that stores program code for device execution. The program code includes relevant content for executing the training method of the embodiment shown in Figure 26.

本申请提供一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得计算机执行上述如图21、22、23、24或25所示实施例的感知方法的相关内容。This application provides a computer program product containing instructions. When the computer program product is run on a computer, it causes the computer to execute the relevant content of the sensing method of the embodiment shown in Figure 21, 22, 23, 24 or 25.

本申请提供一种包含指令的计算机程序产品，当该计算机程序产品在计算机上运行时，使得计算机执行上述如图26所示实施例的训练方法的相关内容。The present application provides a computer program product containing instructions. When the computer program product is run on a computer, it causes the computer to execute the relevant content of the training method of the embodiment shown in FIG. 26 .

本申请提供一种芯片，所述芯片包括处理器与数据接口，所述处理器通过所述数据接口读取存储器上存储的指令，执行如图21、22、23、24、25或26所示实施例的感知方法的相关内容。This application provides a chip. The chip includes a processor and a data interface. The processor reads instructions stored in the memory through the data interface and executes them as shown in Figure 21, 22, 23, 24, 25 or 26. Relevant content of the sensing method of the embodiment.

本申请提供一种芯片，所述芯片包括处理器与数据接口，所述处理器通过所述数据接口读取存储器上存储的指令，执行如图26所示实施例的训练方法的相关内容。The present application provides a chip. The chip includes a processor and a data interface. The processor reads instructions stored in the memory through the data interface and executes relevant contents of the training method of the embodiment shown in Figure 26.

可选地，作为一种实现方式，所述芯片还可以包括存储器，所述存储器中存储有指令，所述处理器用于执行所述存储器上存储的指令，当所述指令被执行时，所述处理器用于执行如图21、22、23、24或25所示实施例的感知方法的相关内容，或者执行如图26所示实施例的训练方法的相关内容。Optionally, as an implementation manner, the chip may further include a memory, in which instructions are stored, and the processor is configured to execute the instructions stored in the memory. When the instructions are executed, the The processor is configured to execute relevant contents of the sensing method of the embodiment shown in Figure 21, 22, 23, 24 or 25, or execute relevant contents of the training method of the embodiment shown in Figure 26.

需要说明的是，对于前述的各方法实施例，为了简单描述，故将其都表述为一系列的动作组合，但是本领域技术人员应该知悉，本发明并不受所描述的动作顺序的限制，因为依据本发明，某些步骤可以采用其他顺序或者同时进行。其次，本领域技术人员也应该知悉，说明书中所描述的实施例均属于优选实施例，所涉及的动作和模块并不一定是本发明所必须的。It should be noted that for the sake of simple description, the foregoing method embodiments are expressed as a series of action combinations. However, those skilled in the art should know that the present invention is not limited by the described action sequence. Because in accordance with the present invention, certain steps may be performed in other orders or simultaneously. Secondly, those skilled in the art should also know that the embodiments described in the specification are all preferred embodiments, and the actions and modules involved are not necessarily necessary for the present invention.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above embodiments, each embodiment is described with its own emphasis. For parts that are not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

综上所述，本申请实施例的有益效果总结如下：To sum up, the beneficial effects of the embodiments of the present application are summarized as follows:

(1)各个感知任务共用相同的主干网络，成倍节省计算量；网络结构易于扩展，只需要增加一个或者若干个Header就可以扩展2D的检测类型。每个并行Header具有独立的RPN和RCNN模块，仅需要检测其所属的任务的物体，这样在训练过程中，可以避免对未标注的其他任务的物体的误伤。另外，采用独立RPN层，可以针对每个任务的物体的尺度和宽高比定制专门的Anchor，从而提高Anchor与物体的重合比例，进而提高RPN层对物体的查全率。(1) Each sensing task shares the same backbone network, doubling the amount of calculation; the network structure is easy to expand, and 2D detection types can be expanded by adding one or several Headers. Each parallel Header has independent RPN and RCNN modules, and only needs to detect the objects of the task to which it belongs, so that during the training process, unlabeled objects of other tasks can be avoided. In addition, by using an independent RPN layer, a special Anchor can be customized according to the scale and aspect ratio of the object of each task, thereby increasing the overlap ratio between the Anchor and the object, thereby improving the object recall rate of the RPN layer.

(2)可以以灵活方便的方式，实现3D、Mask、Keypoint的检测功能。并且这些功能扩展与2D部分共用相同的主干网络，不会显著增加计算量。使用一个网络实现多项功能，易于在芯片上实现(2) The detection functions of 3D, Mask and Keypoint can be realized in a flexible and convenient way. And these functional extensions share the same backbone network as the 2D part and will not significantly increase the computational load. Use one network to implement multiple functions, easy to implement on the chip

(3)各个任务使用独立的数据集，不需要在同一张图片上把所有任务标注出来，节省标注成本。任务扩展灵活简单，在增加新的任务的时候，只需要提供新任务的数据即可，不要在原来的数据上把新的物体标注出来。不同任务的训练数据方便进行均衡，使得各个任务得到均等的训练机会,避免数量大的数据淹没数量少的数据。(3) Each task uses an independent data set, and there is no need to label all tasks on the same picture, saving labeling costs. Task expansion is flexible and simple. When adding a new task, you only need to provide the data of the new task. Do not mark new objects on the original data. The training data of different tasks can be easily balanced so that each task gets equal training opportunities and avoids a large amount of data from overwhelming a small amount of data.

在本申请所提供的几个实施例中，应该理解到，所揭露的装置，可通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed device can be implemented in other ways. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or may be Integrated into another system, or some features can be ignored, or not implemented. On the other hand, the coupling or direct coupling or communication connection between each other shown or discussed may be through some interfaces, and the indirect coupling or communication connection of the devices or units may be in electrical or other forms.

所述作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or they may be distributed to multiple network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.

另外，在本发明各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现，也可以采用软件功能单元的形式实现。In addition, each functional unit in various embodiments of the present invention can be integrated into one processing unit, or each unit can exist physically alone, or two or more units can be integrated into one unit. The above integrated units can be implemented in the form of hardware or software functional units.

所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储器中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储器中，包括若干指令用以使得一台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储器包括：U盘、ROM、RAM、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。If the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable memory. Based on this understanding, the technical solution of the present invention is essentially or contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a memory, It includes several instructions to cause a computer device (which can be a personal computer, a server or a network device, etc.) to execute all or part of the steps of the method described in various embodiments of the present invention. The aforementioned memory includes: U disk, ROM, RAM, mobile hard disk, magnetic disk or optical disk and other media that can store program codes.

本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成，该程序可以存储于一计算机可读存储器中，存储器可以包括：闪存盘、ROM、RAM、磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps in the various methods of the above embodiments can be completed by instructing relevant hardware through a program. The program can be stored in a computer-readable memory, and the memory can include: a flash disk. , ROM, RAM, magnetic disk or CD, etc.

以上对本申请实施例进行了详细介绍，本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本发明的思想，在具体实施方式及应用范围上均会有改变之处，综上上述，本说明书内容不应理解为对本发明的限制。The embodiments of the present application have been introduced in detail above. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method and the core idea of the present application; at the same time, for Those of ordinary skill in the art will make changes in the specific implementation and application scope based on the ideas of the present invention. In summary, the contents of this description should not be understood as limiting the present invention.

Claims

1. A perceptual network based on multiple heads (Headers), characterized in that the perceptual network includes a backbone network and multiple parallel Headers, and the multiple parallel Headers are connected to the backbone network;

The backbone network is used to receive input pictures, perform convolution processing on the input pictures, and output feature maps with multiple different resolutions corresponding to the pictures;

Each of the multiple parallel headers is used to detect a task object in a task based on multiple feature maps of different resolutions output by the backbone network, and output a 2D box of the area where the task object is located. And the confidence that each 2D box belongs to each object category; wherein each parallel header completes the detection of task objects of different categories; wherein the task object is an object that needs to be detected in the task; the confidence The higher it is, the greater the probability that the task object corresponding to the task exists in the 2D box corresponding to the confidence level, and each object category is the object category in the task corresponding to the parallel head end;

The perception network also includes one or more serial headers, which are connected to one of the parallel headers. The serial headers are used to analyze the task objects of the task according to the characteristics of the area where the 2D box is located. 3D information, Mask information or Keypiont information for prediction.

2. The perceptual network according to claim 1, characterized in that each parallel head end includes a region candidate generation network (RPN) module, a region of interest extraction (ROI-ALIGN) module and a region convolutional neural network ( RCNN) module, the RPN module of each parallel head end is independent of the RPN modules of other parallel head ends; the ROI-ALIGN module of each parallel head end is independent of the ROI-ALIGN modules of other parallel head ends; each parallel head end The RCNN module of the end is independent of the RCNN modules of other parallel heads, where, for each parallel head:

The RPN module is used to: predict the area where the task object is located on one or more feature maps provided by the backbone network, and output candidate 2D boxes matching the area;

The ROI-ALIGN module is used to: extract the characteristics of the area where the candidate 2D box is located from a feature map provided by the backbone network based on the area predicted by the RPN module;

The RCNN module is used to: perform convolution processing on the features of the area where the candidate 2D box is located through a neural network to obtain the confidence that the candidate 2D box belongs to each object category; each object category is the parallel head end The corresponding object category in the task; adjust the coordinates of the candidate 2D box through the neural network so that the adjusted 2D candidate box matches the shape of the actual object better than the candidate 2D box, and select a confidence level greater than the preset The adjusted 2D candidate box of the threshold is used as the 2D box of the region.

3. The perceptual network according to claim 1 or 2, characterized in that the 2D box is a rectangular box.

4. The perception network according to claim 2, characterized in that,

The RPN module is used to: based on the template box (Anchor) of the object corresponding to the task, predict the area where the task object exists on one or more feature maps provided by the backbone network to obtain the candidate area, and output the matching results. The candidate 2D box of the candidate area; wherein the template box is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the shape and size of the object.

5. The perception network according to claim 1, characterized in that,

The serial header is specifically used to: utilize the 2D box of the task object belonging to the task provided by the parallel header it is connected to, and extract the region where the 2D box is located on one or more feature maps on the backbone network. Features: predict the 3D information, Mask information or Keypiont information of the task object belonging to the task according to the characteristics of the area where the 2D frame is located.

6. The perceptual network according to claim 4, characterized in that the RPN module is used to predict areas where objects of different sizes are located on feature maps of different resolutions.

7. The perception network according to claim 6, characterized in that the RPN module is used to complete the detection of the area where the large object is located on the low-resolution feature map, and to complete the detection of the area where the small object is located on the high-resolution feature map. Area detection.

8. An object detection method, characterized in that it is applied to a perceptual network. The perceptual network includes a backbone network and multiple parallel Headers. The perceptual network also includes one or more serial Headers, and the serial Header and One of the parallel Header connections; the method includes:

The backbone network receives an input picture; performs convolution processing on the input picture, and outputs feature maps with multiple different resolutions corresponding to the picture;

According to the multiple feature maps of different resolutions output by the backbone network, each parallel headend (Header) detects the task object in a task, outputs the 2D box of the area where the task object is located and the location of each 2D box. Confidence of each object category; wherein each parallel headend (Header) completes the detection of task objects of different categories; the task object is an object of the category that needs to be detected in the task; the higher the confidence, the The greater the probability that the task object corresponding to the task exists in the 2D box corresponding to the confidence level, the each object category is the object category in the task corresponding to the parallel head end;

The serial header predicts the 3D information, Mask information or Keypiont information of the task object belonging to the task based on the characteristics of the area where the 2D box is located.

9. The object detection method according to claim 8, characterized in that, according to the feature map, the task objects in each task are independently detected for different tasks, and the 2D image of the area where each task object is located is output. Frames and the confidence corresponding to each 2D frame, including:

Predict the area where the task object is located on one or more feature maps, and output candidate 2D boxes that match the area;

According to the area where the task object is located, extract the features of the area where the candidate 2D box is located from a feature map;

Perform convolution processing on the features of the area where the candidate 2D box is located to obtain the confidence that the candidate 2D box belongs to each object category; each object category is an object category in the one task;

The coordinates of the candidate 2D box are adjusted through the neural network so that the adjusted 2D candidate box matches the shape of the actual object better than the candidate 2D box, and the adjusted 2D candidate box with a confidence greater than the preset threshold is selected. as a 2D box for the region.

10. The object detection method according to claim 9, wherein the 2D frame is a rectangular frame.

11. The object detection method according to claim 9, characterized in that, predicting the area where the task object is located on one or more feature maps, and outputting the candidate 2D box matching the area is:

Based on the template frame (Anchor) of the object corresponding to the task, predict the area where the task object exists on one or more feature maps provided by the backbone network to obtain candidate areas, and output candidates matching the candidate areas. 2D box; wherein, the template box is obtained based on the statistical characteristics of the task object to which it belongs, and the statistical characteristics include the shape and size of the object.

12. The object detection method according to any one of claims 8-11, characterized in that the method further includes:

Based on the 2D box of the task object belonging to the task, the features of the area where the 2D box is located are extracted from one or more feature maps on the backbone network.

13. The object detection method according to claim 8, characterized in that the detection of the area where the large object is located is completed on the low-resolution feature, and the detection of the area where the small object is located is completed on the high-resolution feature map.

14. A method for training a multi-task perception network based on partially annotated data, characterized in that it is applied to the perception network according to any one of claims 1 to 7, and the method includes:

Determine the task to which each sample picture belongs according to the labeled data type of each sample picture; wherein, each sample picture is labeled with one or more data types, and the plurality of data types are subsets of all data types , each data type among all the data types corresponds to a task;

According to the task to which each sample picture belongs, determine the Header that needs to be trained for each sample picture;

Calculate the loss value of the Header required for training for each sample image;

For each sample picture, gradient backhaul is performed through the Header required to be trained, and the parameters of the Header required to be trained and the backbone network are adjusted based on the loss value, and the remaining Headers are not involved. In training.

15. The method of training a multi-task perception network according to claim 14, characterized in that before calculating the loss value of the Header required for training for each sample picture, the method further includes:

Perform data balancing on sample images belonging to different tasks.

16. A device for training a multi-task perception network based on partially annotated data, characterized in that it is applied to the perception network according to any one of claims 1 to 7, and the device includes:

A task determination module, configured to determine the task to which each sample picture belongs based on the annotated data type of each sample picture; wherein each sample picture is annotated with one or more data types, and the plurality of data types are A subset of all data types, each of which corresponds to a task;

A Header determination module is used to determine the Header required for training of each sample picture according to the task to which each sample picture belongs;

A loss value calculation module, for each sample picture, used to calculate the loss value of the Header determined by the Header decision module;

The adjustment module performs gradient return for each sample picture by calculating the Header determined by the Header determination module, and adjusts the required training Header and backbone network based on the loss value obtained by the loss value calculation module. parameters, the rest of the Headers do not participate in the training.

17. The device for training a multi-task perception network based on partial annotation data as claimed in claim 16, characterized in that the device further includes:

The data balancing module is used to balance the data of sample images belonging to different tasks.