CN108268815B

CN108268815B - Method and device for understanding image scene

Info

Publication number: CN108268815B
Application number: CN201611254544.6A
Authority: CN
Inventors: 彭超; 俞刚; 张祥雨
Original assignee: Beijing Kuangshi Technology Co Ltd; Beijing Megvii Technology Co Ltd
Current assignee: Force Map New Chongqing Technology Co ltd
Priority date: 2016-12-30
Filing date: 2016-12-30
Publication date: 2020-12-25
Anticipated expiration: 2036-12-30
Also published as: CN108268815A

Abstract

An embodiment of the present invention provides a method for understanding an image scene, including: acquiring an original image of a scene; performing a convolution operation on the original image to obtain a convolution output; and processing the convolution output through a global convolution network , obtaining a processing result; and performing boundary refinement on the processing result to obtain an image scene understanding result. The embodiment of the present invention utilizes a global convolutional network to effectively increase the effective receptive field, and further utilizes boundary refinement to increase the discrimination of the boundary, thereby effectively improving the overall performance of the system.

Description

Method and device for image scene understanding

技术领域technical field

本发明涉及视频监控领域，更具体地涉及一种图像场景理解的方法及装置。The present invention relates to the field of video surveillance, and more particularly, to a method and device for image scene understanding.

背景技术Background technique

深度学习(Deep Learning)的概念源于人工神经网络的研究。深度学习通过组合低层特征形成更加抽象的高层表示属性类别或特征，以发现数据的分布式特征表示。在计算机视觉及相关领域，新兴的深度学习方法相比过去传统方法有了长足的进步。卷积神经网络(Convolutional neural network，CNN)是一种深度的监督学习下的机器学习模型，是深度学习的核心操作，它将卷积核(Kernel)与原图像输入进行卷积操作得到输出。The concept of deep learning originated from the research of artificial neural network. Deep learning combines low-level features to form more abstract high-level representation attribute categories or features to discover distributed feature representations of data. In computer vision and related fields, emerging deep learning methods have made great progress compared to traditional methods in the past. Convolutional neural network (CNN) is a machine learning model under deep supervised learning, and it is the core operation of deep learning.

场景理解在视频监控领域有着重要的应用。传统的场景理解系统往往通过全卷积网络(Fully Convolutional Network)来实现，但是这种实现方式没有考虑到有效感受野(receptive field)的局限性。一般而言，全卷积网络的理论感受野是整张图片，但实际上的有效感受野往往是一个有限的区域。在这个有限区域之内的物体能够被很好的理解出来，但位于这个有限区域之外的物体则会出现较大的误差。例如：一辆小轿车可以被很好的分割理解出来，但一辆大货车可能因为体积原因，被理解为不同的物体组合。另一方面，传统方法对于物体边界存在着很大的误判，例如靠着车的人会有一部分被理解为车身，可见，现有的方法会导致系统整体的性能下降。Scene understanding has important applications in the field of video surveillance. Traditional scene understanding systems are often implemented through fully convolutional networks, but this implementation method does not take into account the limitations of the effective receptive field. In general, the theoretical receptive field of a fully convolutional network is the entire image, but the actual effective receptive field is often a limited area. Objects within this limited area can be well understood, but objects located outside this limited area will have larger errors. For example: a small car can be understood by a good segmentation, but a large truck may be understood as a combination of different objects because of its size. On the other hand, the traditional method has a great misjudgment for the boundary of the object. For example, some people leaning on the car will be understood as the body. It can be seen that the existing method will lead to a decrease in the overall performance of the system.

发明内容SUMMARY OF THE INVENTION

考虑到上述问题而提出了本发明。本发明提供了一种图像场景理解的方法，能够提升系统的整体性能。The present invention has been made in view of the above-mentioned problems. The present invention provides a method for image scene understanding, which can improve the overall performance of the system.

根据本发明的第一方面，提供了一种图像场景理解的方法，包括：According to a first aspect of the present invention, a method for image scene understanding is provided, comprising:

获取场景的原始图像；Get the original image of the scene;

对所述原始图像进行卷积操作，得到卷积输出；以及performing a convolution operation on the original image to obtain a convolution output; and

通过全局卷积网络对所述卷积输出进行处理，得到处理结果；The convolution output is processed through a global convolution network to obtain a processing result;

对所述处理结果进行边界精炼，得到图像场景理解的结果。The boundary refinement is performed on the processing result to obtain the result of image scene understanding.

示例性地，所述对所述原始图像进行卷积操作，得到卷积输出，包括：Exemplarily, performing a convolution operation on the original image to obtain a convolution output, including:

通过N个卷积神经网络对所述原始图像进行卷积操作，得到N路卷积输出；The original image is subjected to a convolution operation through N convolutional neural networks to obtain N-way convolution outputs;

其中，所述N路卷积输出的空间维度两两互不相等，且所述N路卷积输出的空间维度均小于所述原始图像的空间维度，N为大于1的正整数。Wherein, the spatial dimensions of the N-way convolution outputs are not equal to each other, and the spatial dimensions of the N-way convolution outputs are all smaller than the spatial dimension of the original image, and N is a positive integer greater than 1.

示例性地，所述通过全局卷积网络对所述卷积输出进行处理，得到处理结果，包括：Exemplarily, the convolution output is processed through a global convolution network to obtain a processing result, including:

通过N个所述全局卷积网络对所述N路卷积输出分别进行处理，得到与所述N路卷积输出一一对应的N个处理结果。The N channels of convolution outputs are respectively processed through the N global convolution networks, and N processing results corresponding to the N channels of convolution outputs are obtained.

示例性地，所述通过N个所述全局卷积网络对所述N路卷积输出分别进行处理，得到与所述N路卷积输出一一对应的N个处理结果，包括：Exemplarily, the N channels of convolution outputs are respectively processed through the N global convolution networks, and N processing results corresponding to the N channels of convolution outputs are obtained, including:

对所述N路卷积输出中的第i路卷积输出进行多路卷积，将所述多路卷积的输出进行相加，并将所述相加的结果确定为与所述第i路卷积输出对应的第i个处理结果；Perform multiple convolution on the i-th convolution output in the N-way convolution outputs, add the outputs of the multiple convolutions, and determine the result of the addition as the same as the i-th convolution output. The i-th processing result corresponding to the road convolution output;

其中，i的取值范围为1至N。Among them, the value of i ranges from 1 to N.

示例性地，所述对所述处理结果进行边界精炼，得到图像场景理解的结果，包括：Exemplarily, performing boundary refinement on the processing result to obtain an image scene understanding result, including:

对所述N个处理结果分别进行边界精炼，得到与所述N个处理结果一一对应的N个边界精炼结果；Perform boundary refining on the N processing results respectively, and obtain N boundary refining results corresponding to the N processing results one-to-one;

根据所述N个边界精炼结果，得到所述图像场景理解的结果。According to the N boundary refinement results, the image scene understanding result is obtained.

示例性地，所述对所述N个处理结果分别进行边界精炼，得到与所述N个处理结果一一对应的N个边界精炼结果，包括：Exemplarily, performing boundary refining on the N processing results, respectively, to obtain N boundary refining results corresponding to the N processing results one-to-one, including:

对所述N个处理结果中的第i个处理结果进行修正，将所述修正后的结果与所述第i个处理结果进行结合，得到第i个边界精炼结果；Correcting the i-th processing result in the N processing results, combining the corrected result with the i-th processing result to obtain the i-th boundary refining result;

示例性地，所述根据所述N个边界精炼结果，得到所述图像场景理解的结果，包括：Exemplarily, obtaining the image scene understanding result according to the N boundary refining results, including:

将所述N个边界精炼结果进行融合操作，得到所述图像场景理解的结果。A fusion operation is performed on the N boundary refinement results to obtain the image scene understanding result.

示例性地，在所述获取所述原始图像之前，所述方法还包括：Exemplarily, before the acquiring the original image, the method further includes:

训练所述全局卷积网络。Train the global convolutional network.

根据本发明的第二方面，提供了一种图像场景理解的装置，包括：According to a second aspect of the present invention, there is provided an image scene understanding device, comprising:

获取模块，用于获取场景的原始图像；The acquisition module is used to acquire the original image of the scene;

卷积模块，用于对所述获取模块获取的所述原始图像进行卷积操作，得到卷积输出；a convolution module, configured to perform a convolution operation on the original image acquired by the acquisition module to obtain a convolution output;

全局卷积网络模块，用于通过全局卷积网络对所述卷积模块得到的所述卷积输出进行处理，得到处理结果；以及a global convolution network module for processing the convolution output obtained by the convolution module through a global convolution network to obtain a processing result; and

边界精炼模块，用于对所述全局卷积网络模块得到的所述处理结果进行边界精炼，得到图像场景理解的结果。The boundary refining module is configured to perform boundary refining on the processing result obtained by the global convolutional network module to obtain the result of image scene understanding.

示例性地，所述卷积模块，用于：Exemplarily, the convolution module is used for:

示例性地，所述全局卷积网络模块，用于：Exemplarily, the global convolutional network module is used for:

示例性地，所述N个全局卷积网络中的第i个全局卷积网络用于：Exemplarily, the i-th global convolutional network in the N global convolutional networks is used for:

示例性地，所述边界精炼模块包括N个边界精炼子模块和融合子模块：Exemplarily, the boundary refining module includes N boundary refining sub-modules and fusion sub-modules:

所述N个边界精炼子模块，用于对所述N个处理结果分别进行边界精炼，得到与所述N个处理结果一一对应的N个边界精炼结果；The N boundary refining sub-modules are used to perform boundary refining on the N processing results respectively, and obtain N boundary refining results corresponding to the N processing results one-to-one;

所述融合子模块，用于根据所述N个边界精炼结果，得到所述图像场景理解的结果。The fusion sub-module is configured to refine the results according to the N boundaries to obtain the image scene understanding result.

示例性地，所述N个边界精炼子模块中的第i个边界精炼子模块用于：Exemplarily, the ith boundary refining submodule in the N boundary refining submodules is used for:

示例性地，所述融合子模块，用于：Exemplarily, the fusion submodule is used for:

示例性地，所述装置还包括训练模块，用于：训练所述全局卷积网络。Exemplarily, the apparatus further includes a training module for: training the global convolutional network.

第二方面所述的装置能够用于实现前述第一方面的图像场景理解的方法。The apparatus described in the second aspect can be used to implement the image scene understanding method of the foregoing first aspect.

根据本发明的第三方面，提供了一种计算机芯片，该计算机芯片包括处理器和存储器。所述存储器存储有指令代码，所述处理器用于执行所述指令代码，且当所述处理器执行指令代码时，能够实现前述第一方面所述的图像场景理解的方法。According to a third aspect of the present invention, there is provided a computer chip including a processor and a memory. The memory stores an instruction code, the processor is configured to execute the instruction code, and when the processor executes the instruction code, the method for understanding an image scene described in the first aspect can be implemented.

本发明实施例利用全局卷积网络，有效地增大了有效感受野，并进一步利用边界精炼增加了边界的判别性，从而使得系统的整体性能得到有效提升。The embodiment of the present invention utilizes a global convolutional network to effectively increase the effective receptive field, and further utilizes boundary refinement to increase the discrimination of the boundary, thereby effectively improving the overall performance of the system.

附图说明Description of drawings

通过结合附图对本发明实施例进行更详细的描述，本发明的上述以及其它目的、特征和优势将变得更加明显。附图用来提供对本发明实施例的进一步理解，并且构成说明书的一部分，与本发明实施例一起用于解释本发明，并不构成对本发明的限制。在附图中，相同的参考标号通常代表相同部件或步骤。The above and other objects, features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention in conjunction with the accompanying drawings. The accompanying drawings are used to provide a further understanding of the embodiments of the present invention, and constitute a part of the specification, and together with the embodiments of the present invention, they are used to explain the present invention, and do not limit the present invention. In the drawings, the same reference numbers generally refer to the same components or steps.

图1是本发明实施例的电子设备的一个示意性框图；1 is a schematic block diagram of an electronic device according to an embodiment of the present invention;

图2是本发明实施例的图像场景理解的方法的一个示意性流程图；FIG. 2 is a schematic flowchart of a method for image scene understanding according to an embodiment of the present invention;

图3是本发明实施例的图像场景理解的方法的一个示意性流程图；3 is a schematic flowchart of a method for image scene understanding according to an embodiment of the present invention;

图4是本发明实施例的图像场景理解的方法的另一个示意性流程图；4 is another schematic flow chart of a method for image scene understanding according to an embodiment of the present invention;

图5是本发明实施例的全局卷积网络处理的方法的一个示意性流程图；5 is a schematic flowchart of a method for global convolutional network processing according to an embodiment of the present invention;

图6是本发明实施例的边界精炼处理的方法的一个示意性流程图；6 is a schematic flowchart of a method for boundary refining processing according to an embodiment of the present invention;

图7是本发明实施例的图像场景理解的装置的一个示意性框图；7 is a schematic block diagram of an image scene understanding apparatus according to an embodiment of the present invention;

图8是本发明实施例的图像场景理解的装置的另一个示意性框图。FIG. 8 is another schematic block diagram of an image scene understanding apparatus according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使得本发明的目的、技术方案和优点更为明显，下面将参照附图详细描述根据本发明的示例实施例。显然，所描述的实施例仅仅是本发明的一部分实施例，而不是本发明的全部实施例，应理解，本发明不受这里描述的示例实施例的限制。基于本发明中描述的本发明实施例，本领域技术人员在没有付出创造性劳动的情况下所得到的所有其它实施例都应落入本发明的保护范围之内。In order to make the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments according to the present invention will be described in detail below with reference to the accompanying drawings. Obviously, the described embodiments are only some of the embodiments of the present invention, not all of the embodiments of the present invention, and it should be understood that the present invention is not limited by the example embodiments described herein. Based on the embodiments of the present invention described in the present invention, all other embodiments obtained by those skilled in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例可以应用于电子设备，图1所示为本发明实施例的电子设备的一个示意性框图。图1所示的电子设备10包括一个或多个处理器102、一个或多个存储装置104、输入装置106、输出装置108、图像传感器110以及一个或多个非图像传感器114，这些组件通过总线系统112和/或其它形式互连。应当注意，图1所示的电子设备10的组件和结构只是示例性的，而非限制性的，根据需要，所述电子设备也可以具有其他组件和结构。The embodiment of the present invention may be applied to an electronic device, and FIG. 1 is a schematic block diagram of the electronic device according to the embodiment of the present invention. The electronic device 10 shown in FIG. 1 includes one or more processors 102, one or more storage devices 104, an input device 106, an output device 108, an image sensor 110, and one or more non-image sensors 114, these components being connected via a bus Systems 112 and/or other forms of interconnection. It should be noted that the components and structures of the electronic device 10 shown in FIG. 1 are only exemplary and not restrictive, and the electronic device may also have other components and structures as required.

所述处理器102可以包括中央处理单元(Central Processing Unit，CPU)1021和/或图像处理单元(Graphics Processing Unit，GPU)1022，或者包括具有数据处理能力和/或指令执行能力的其它形式的处理单元，并且可以控制所述电子设备10中的其它组件以执行期望的功能。The processor 102 may include a central processing unit (Central Processing Unit, CPU) 1021 and/or a graphics processing unit (Graphics Processing Unit, GPU) 1022, or include other forms of processing with data processing capability and/or instruction execution capability unit, and may control other components in the electronic device 10 to perform desired functions.

所述存储装置104可以包括一个或多个计算机程序产品，所述计算机程序产品可以包括各种形式的计算机可读存储介质，例如易失性存储器1041和/或非易失性存储器1042。所述易失性存储器1041例如可以包括随机存取存储器(Random Access Memory，RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器1042例如可以包括只读存储器(Read-Only Memory，ROM)、硬盘、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机程序指令，处理器102可以运行所述程序指令，以实现各种期望的功能。在所述计算机可读存储介质中还可以存储各种应用程序和各种数据，例如所述应用程序使用和/或产生的各种数据等。The storage device 104 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory 1041 and/or non-volatile memory 1042 . The volatile memory 1041 may include, for example, a random access memory (Random Access Memory, RAM) and/or a cache memory (cache). The non-volatile memory 1042 may include, for example, a read-only memory (Read-Only Memory, ROM), a hard disk, a flash memory, and the like. One or more computer program instructions may be stored on the computer-readable storage medium, which may be executed by the processor 102 to implement various desired functions. Various application programs and various data, such as various data used and/or generated by the application program, etc. may also be stored in the computer-readable storage medium.

所述输入装置106可以是用户用来输入指令的装置，并且可以包括键盘、鼠标、麦克风和触摸屏等中的一个或多个。The input device 106 may be a device used by a user to input instructions, and may include one or more of a keyboard, mouse, microphone, touch screen, and the like.

所述输出装置108可以向外部(例如用户)输出各种信息(例如图像或声音)，并且可以包括显示器、扬声器等中的一个或多个。The output device 108 may output various information (eg, images or sounds) to the outside (eg, a user), and may include one or more of a display, a speaker, and the like.

所述图像传感器110可以拍摄用户期望的图像(例如照片、视频等)，并且将所拍摄的图像存储在所述存储装置104中以供其它组件使用。The image sensor 110 may capture user-desired images (eg, photos, videos, etc.) and store the captured images in the storage device 104 for use by other components.

示例性地，该电子设备10可以被实现为诸如智能手机、平板电脑、门禁系统的图像采集端等。Exemplarily, the electronic device 10 may be implemented as an image acquisition terminal such as a smart phone, a tablet computer, an access control system, and the like.

本发明实施例中的电子设备可以是视频监控领域的设备，并且本发明实施例对该设备的具体形态不做限定。The electronic device in the embodiment of the present invention may be a device in the field of video surveillance, and the specific form of the device is not limited in the embodiment of the present invention.

图2是本发明实施例的图像场景理解的方法的一个示意性流程图，图2所示的方法包括：FIG. 2 is a schematic flowchart of a method for understanding an image scene according to an embodiment of the present invention. The method shown in FIG. 2 includes:

S101，获取场景的原始图像。S101, acquiring an original image of a scene.

在一个示例中，包含场景的原始图像可以为实时采集的图像。例如，该原始图像可以是由摄像头所采集到的视频中的一帧图像或者是由照相机采集到的图片。在其他示例中，包含场景的原始图像也可以为来自任何源的图像。此处，包含场景的原始图像可以为视频数据，也可以为图片数据。In one example, the original image containing the scene may be an image acquired in real time. For example, the original image may be a frame of image in a video captured by a camera or a picture captured by a camera. In other examples, the original image containing the scene can also be an image from any source. Here, the original image including the scene may be video data or picture data.

本发明实施例中，原始图像的空间维度可以表示为w×h。作为一例，可以假设该原始图像的空间维度为512×512。该原始图像还可以具有第三维度，例如该第三维度为1(表示对应的图像为灰度图像)或3(表示对应的图像为RGB彩色图像)，应该理解的是，根据图像的性质，第三维度的数目还可以为其它的值，本发明对此不限定。In this embodiment of the present invention, the spatial dimension of the original image may be expressed as w×h. As an example, it can be assumed that the spatial dimension of the original image is 512×512. The original image may also have a third dimension. For example, the third dimension is 1 (indicating that the corresponding image is a grayscale image) or 3 (indicating that the corresponding image is an RGB color image). It should be understood that, according to the nature of the image, The number of the third dimension may also be other values, which are not limited in the present invention.

S102，对所述原始图像进行卷积操作，得到卷积输出。S102, perform a convolution operation on the original image to obtain a convolution output.

示例性地，可以针对原始图像直接进行卷积操作，得到卷积输出。但是，由于原始图像的维度较大，该操作对处理器等硬件的要求也比较高。Exemplarily, a convolution operation can be performed directly on the original image to obtain a convolution output. However, due to the large dimension of the original image, this operation also requires relatively high hardware such as processors.

示例性地，S102可以包括：通过N个卷积神经网络对所述原始图像进行卷积操作，得到N路卷积输出。其中，所述N路卷积输出的空间维度(w×h)两两互不相等，且所述N路卷积输出的空间维度均小于所述原始图像的空间维度，N为大于1的正整数。Exemplarily, S102 may include: performing a convolution operation on the original image through N convolutional neural networks to obtain N convolution outputs. Wherein, the spatial dimensions (w×h) of the N-way convolution outputs are not equal to each other, and the spatial dimensions of the N-way convolution outputs are all smaller than the spatial dimension of the original image, and N is a positive value greater than 1. Integer.

一般地，卷积输出的形式为张量的形式，因此，可理解，卷积输出的维度为张量的维度。示例性地，卷积输出的维度可以表示为w×h×c，其中第三个维度c表示频道(channel)维度，也可称为信道维度。Generally, the form of the convolution output is in the form of a tensor, so it can be understood that the dimension of the convolution output is the dimension of the tensor. Exemplarily, the dimension of the convolution output can be represented as w×h×c, where the third dimension c represents the channel dimension, which can also be referred to as the channel dimension.

举例来说，以N＝4为例，即假设通过4个卷积神经网络对所述原始图像进行4路卷积操作得到4路卷积输出，如图3所示，4路卷积操作依次为S1021、S1022、S1023和S1024，且4路卷积输出的维度分别为：128×128×256、64×64×512、32×32×1024、16×16×2048。可见，该4路卷积输出的维度中的空间维度(w维度×h维度(分别为128×128、64×64、32×32、16×16))均小于原始图像的空间维度(w维度×h维度(512×512))。卷积输出的第三维度(c维度)(本例中，为256，512，1024和2048，均表示卷积输出的信道(channel)数目)。For example, taking N=4 as an example, it is assumed that four convolution operations are performed on the original image through four convolutional neural networks to obtain four convolution outputs. As shown in Figure 3, the four convolution operations are performed in sequence. are S1021, S1022, S1023 and S1024, and the dimensions of the 4-way convolution outputs are: 128×128×256, 64×64×512, 32×32×1024, and 16×16×2048, respectively. It can be seen that the spatial dimension (w dimension × h dimension (respectively 128 × 128, 64 × 64, 32 × 32, 16 × 16)) in the dimensions of the four-way convolution output is smaller than the spatial dimension of the original image (w dimension ×h dimension (512×512)). The third dimension (c dimension) of the convolution output (in this case, 256, 512, 1024, and 2048, all representing the number of channels of the convolution output).

可选地，在S1021之前，还可以对原始图像进行卷积操作，例如，先执行卷积操作0，如图4中的S1020所示，该卷积操作0的输出维度为256×256×64。这样，能够实现卷积输出的维度的逐步变化，从而能够保证该过程的处理精度。Optionally, before S1021, a convolution operation may also be performed on the original image, for example, convolution operation 0 is performed first, as shown in S1020 in FIG. 4 , the output dimension of the convolution operation 0 is 256×256×64 . In this way, the dimension of the convolution output can be gradually changed, so that the processing accuracy of the process can be guaranteed.

S103，通过全局卷积网络(Global Convolutional Network，GCN)对所述卷积输出进行处理，得到处理结果。S103, processing the convolution output through a global convolutional network (Global Convolutional Network, GCN) to obtain a processing result.

其中，全局卷积网络(GCN)也可以称为全局卷积神经网络(Global ConvolutionalNeural Network，GCNN)，本发明对此不限定。Wherein, the global convolutional network (GCN) may also be referred to as a global convolutional neural network (Global Convolutional Neural Network, GCNN), which is not limited in the present invention.

可选地，在S102卷积操作的后端增加全局卷积网络，可以增大有效感受野。在卷积神经网络中，感受野的定义可以为：原始图像中的像素点的变化对于卷积神经网络每一层输出的特征图(feature map)上的影响区域。Optionally, adding a global convolutional network at the back end of the S102 convolution operation can increase the effective receptive field. In a convolutional neural network, the receptive field can be defined as: the area of influence on the feature map output by each layer of the convolutional neural network due to changes in pixels in the original image.

示例性地，若S102得到N路卷积输出，S103可以包括：通过N个所述全局卷积网络对所述N路卷积输出分别进行处理，得到与所述N路卷积输出一一对应的N个处理结果。Exemplarily, if N channels of convolution outputs are obtained in S102, S103 may include: processing the N channels of convolution outputs through the N global convolution networks, respectively, to obtain a one-to-one correspondence with the N channels of convolution outputs. The N processing results of .

可见，对于N路卷积输出，可以并行地得到N个处理结果。It can be seen that for N-way convolution outputs, N processing results can be obtained in parallel.

举例来说，以N＝4为例，即假设为4路卷积操作，如图3所示，那么S103可以包括与4路卷积输出对应的4路全局卷积神经网络处理S1031-S1034。并且，所得到的4个处理结果中，每个处理结果的维度中的w维度和h维度与对应的卷积输出的相应维度一致，每个处理结果的维度中的c维度表示所述全局卷积网络能够识别的对象的种类的数目。也就是说，每个处理结果的维度中的w维度与对应的卷积输出的w维度一致，每个处理结果的维度中的h维度与对应的卷积输出的h维度一致，每个处理结果的维度中的c维度表示所述全局卷积网络能够识别的对象的种类的数目。For example, taking N=4 as an example, that is, assuming a 4-way convolution operation, as shown in FIG. 3 , then S103 may include 4-way global convolutional neural network processing S1031-S1034 corresponding to the 4-way convolution output. And, in the obtained 4 processing results, the w dimension and h dimension in the dimension of each processing result are consistent with the corresponding dimension of the corresponding convolution output, and the c dimension in the dimension of each processing result represents the global volume The number of types of objects that the product network can recognize. That is to say, the w dimension in the dimension of each processing result is consistent with the w dimension of the corresponding convolution output, the h dimension in the dimension of each processing result is consistent with the h dimension of the corresponding convolution output, and each processing result The c dimension of the dimensions represents the number of kinds of objects that the global convolutional network can recognize.

作为一个例子，卷积操作1S1021的卷积输出的维度为128×128×256，那么，对应的GCN 1处理S1031的处理结果的维度中，该处理结果的w维度等于对应的卷积输出的w维度，即128；该处理结果的h维度等于对应的卷积输出的h维度，即128；该处理结果的c维度等于所述全局卷积网络能够识别的对象的种类的数目，例如21。从而，GCN 1处理的处理结果的维度为128×128×21。关于GCN 2处理至GCN 4处理的处理结果的维度确定方式与GCN1处理的方式类似，这里不再一一罗列。其中，c维度的值表示所述全局卷积网络能够识别的对象的种类的数目，例如21。应该理解的是，c维度的值可以根据实际应用进行设置，本发明不对此进行限制。As an example, the dimension of the convolution output of convolution operation 1S1021 is 128×128×256, then, in the dimension of the processing result of the corresponding GCN 1 processing S1031, the w dimension of the processing result is equal to the w dimension of the corresponding convolution output dimension, namely 128; the h dimension of the processing result is equal to the h dimension of the corresponding convolution output, namely 128; the c dimension of the processing result is equal to the number of object types that can be identified by the global convolutional network, for example 21. Thus, the dimension of the processing result processed by GCN 1 is 128×128×21. The dimension determination method of the processing results from GCN 2 processing to GCN 4 processing is similar to that of GCN1 processing, and will not be listed one by one here. Wherein, the value of the c dimension represents the number of types of objects that the global convolutional network can recognize, for example, 21. It should be understood that the value of the c dimension can be set according to practical applications, which is not limited in the present invention.

具体地，所述通过N个所述全局卷积网络对所述N路卷积输出分别进行处理，得到与所述N路卷积输出一一对应的N个处理结果，可以包括：对所述N路卷积输出中的第i路卷积输出进行多路卷积，将所述多路卷积的输出进行相加，并将所述相加的结果确定为与所述第i路卷积输出对应的第i个处理结果。其中，i的取值范围为1至N。Specifically, processing the N-way convolution outputs through the N global convolutional networks respectively, to obtain N processing results corresponding to the N-way convolution outputs one-to-one, may include: processing the N-way convolution outputs. The i-th convolution output in the N-way convolution outputs is subjected to multi-channel convolution, the outputs of the multi-channel convolution are added, and the result of the addition is determined as the i-th channel convolution Output the corresponding i-th processing result. Among them, the value of i ranges from 1 to N.

应理解，N个全局卷积网络中的每一个都可以包括多条卷积支路以进行多路卷积，其中，每条卷积支路执行对应的卷积运算，其中的多路卷积可以是并行地进行的，并且对多路卷积的数量不作限定，例如可以进行2路卷积，或3路卷积等等，该多路卷积的数量可以根据计算的精度和处理器的计算能力等进行预先设定，本发明对此不限定。另外，多路卷积中的任意一路卷积可以包括至少一个卷积运算，多路卷积中的任意两路卷积所执行的卷积运算的数量可以相同或不同。例如，多路卷积为2路卷积，2路卷积中的第1路卷积可以包括2个卷积运算，2路卷积中的第2路卷积可以包括1个或2个卷积运算，等等。It should be understood that each of the N global convolutional networks may include multiple convolution branches to perform multiple convolutions, wherein each convolution branch performs a corresponding convolution operation, wherein the multiple convolution branches It can be performed in parallel, and the number of multiple convolutions is not limited, for example, 2-way convolution, or 3-way convolution, etc. The computing capability and the like are preset, which is not limited in the present invention. In addition, any one of the multiple convolutions may include at least one convolution operation, and the number of convolution operations performed by any two of the multiple convolutions may be the same or different. For example, a multi-way convolution is a 2-way convolution, the first convolution in the 2-way convolution can include 2 convolution operations, and the second convolution in the 2-way convolution can include 1 or 2 convolutions Product operation, etc.

举例来说，若多路卷积为2路卷积，且每路卷积都包括2个卷积运算。那么，关于任意一个GCN处理(GCN 1至GCN 4中的任一个)，其具体过程可以如图5所示。假设GCNi(1≤i≤N)接收到的卷积输出的维度为w1×h1×c1，则并行地执行(a)和(b)：(a)、依次执行两次卷积运算，第一次卷积运算的输入的c维度为c1、输出的c维度为21，第二次卷积运算的输入的c维度为21、输出的c维度也为21，在一个实施例中，可以沿水平方向做第一次卷积运算，沿垂直方向做第二次卷积运算；(b)、依次执行两次卷积运算，第一次卷积运算的输入的c维度为c1、输出的c维度为21，第二次卷积运算的输入的c维度为21、输出的c维度也为21，在一个实施例中，可以沿垂直方向做第一次卷积运算，沿水平方向做第二次卷积运算。For example, if the multi-channel convolution is a 2-channel convolution, and each channel of convolution includes 2 convolution operations. Then, regarding any GCN processing (any one of GCN 1 to GCN 4 ), the specific process thereof may be shown in FIG. 5 . Assuming that the dimension of the convolution output received by GCNi (1≤i≤N) is w1×h1×c1, then perform (a) and (b) in parallel: (a), perform two convolution operations in sequence, the first The input c-dimension of the second convolution operation is c1, the output c-dimension is 21, the input c-dimension of the second convolution operation is 21, and the output c-dimension is also 21. Do the first convolution operation in the direction, and the second convolution operation in the vertical direction; (b), perform two convolution operations in turn, the input c dimension of the first convolution operation is c1, and the output c dimension is 21, the c-dimension of the input of the second convolution operation is 21, and the c-dimension of the output is also 21. In one embodiment, the first convolution operation can be performed in the vertical direction, and the second convolution operation can be performed in the horizontal direction. Convolution operation.

进一步将(a)和(b)的结果求和(Sum)，得到处理结果，且该处理结果的维度为w1×h1×21，例如，对于GCN1处理，处理结果的维度为128×128×21；对于GCN2处理，处理结果的维度为64×64×21；对于GCN3处理，处理结果的维度为32×32×21；对于GCN4处理，处理结果的维度为16×16×21。The results of (a) and (b) are further summed (Sum) to obtain the processing result, and the dimension of the processing result is w1×h1×21, for example, for GCN1 processing, the dimension of the processing result is 128×128×21 ; for GCN2 processing, the dimension of the processing result is 64×64×21; for GCN3 processing, the dimension of the processing result is 32×32×21; for GCN4 processing, the dimension of the processing result is 16×16×21.

由此可见，S102中对常规(general)图片学习到的一些底层网络信息，对于场景理解是非常有效的。S102中的卷积神经网络可以用resnet实现，从而可以有效地利用resnet本身的特性，并且S103中的全局卷积网络也可以通过resnet实现，能够有效地增大有效感受野，从而能够提升系统的性能。It can be seen that some underlying network information learned from general pictures in S102 is very effective for scene understanding. The convolutional neural network in S102 can be implemented by resnet, which can effectively utilize the characteristics of resnet itself, and the global convolutional network in S103 can also be implemented by resnet, which can effectively increase the effective receptive field and improve the system performance. performance.

S104，对所述处理结果进行边界精炼(Boundary Refinement)，得到图像场景理解的结果。S104 , performing boundary refinement (Boundary Refinement) on the processing result to obtain an image scene understanding result.

可选地，可以对处理结果进行线性修正，并将该线性修正后的结果与处理结果进行结合，从而得到图像场景理解的结果。Optionally, linear correction may be performed on the processing result, and the linear correction result may be combined with the processing result, so as to obtain the result of image scene understanding.

示例性地，若S103得到N个处理结果，S104可以包括：Exemplarily, if N processing results are obtained in S103, S104 may include:

对所述N个处理结果分别进行边界精炼，得到与所述N个处理结果一一对应的N个边界精炼结果；以及根据所述N个边界精炼结果，得到所述图像场景理解的结果。The N processing results are respectively subjected to boundary refining to obtain N boundary refining results corresponding to the N processing results one-to-one; and the image scene understanding result is obtained according to the N boundary refining results.

可见，对N个处理结果，可以并行地得到N个边界精炼结果。It can be seen that for N processing results, N boundary refining results can be obtained in parallel.

举例来说，以N＝4为例，即假设为4个处理结果，如图3所示，那么S104可以包括与4个处理结果对应的4个边界精炼处理S1041-S1044。并且，所得到的4个边界精炼结果中，每个边界精炼结果的维度与对应的处理结果的维度一致。For example, taking N=4 as an example, that is, assuming that there are 4 processing results, as shown in FIG. 3 , then S104 may include 4 boundary refining processes S1041-S1044 corresponding to the 4 processing results. In addition, among the four obtained boundary refining results, the dimension of each boundary refining result is consistent with the dimension of the corresponding processing result.

作为一个例子，边界精炼处理S1041(即BR 1)得到的边界精炼结果的维度与GCN 1处理结果的维度相同，为128×128×21。关于边界精炼处理S1042-S1044(即BR 2至BR4)的边界精炼结果的维度类似可以确定，这里不再一一罗列。As an example, the dimension of the boundary refinement result obtained by the boundary refinement process S1041 (ie, BR 1 ) is the same as that of the GCN 1 processing result, which is 128×128×21. The dimensions of the boundary refining results of the boundary refining processes S1042 to S1044 (ie, BR 2 to BR 4 ) can be determined similarly, and are not listed one by one here.

具体地，对所述N个处理结果分别进行边界精炼，得到与所述N个处理结果一一对应的N个边界精炼结果，可以包括：对所述N个处理结果中的第i个处理结果进行修正，将所述修正后的结果与所述第i个处理结果进行结合，得到第i个边界精炼结果。其中，i的取值范围为1至N。Specifically, performing boundary refining on the N processing results respectively to obtain N boundary refining results corresponding to the N processing results one-to-one, which may include: performing the i-th processing result on the N processing results. Correction is performed, and the corrected result is combined with the i-th processing result to obtain the i-th boundary refining result. The value of i ranges from 1 to N.

可选地，修正可以为线性修正和非线性修正。本发明对修正的具体执行方法不做限定。Alternatively, the corrections may be linear corrections and non-linear corrections. The present invention does not limit the specific execution method of the correction.

举例来说，修正可以包括顺次地卷积操作、修正线性单元(Rectified LinearUnit，ReLU)操作和卷积操作。那么该修正可以如图6中的S1040，依次包括第一卷积操作、ReLU操作和第二卷积操作。For example, the correction may include sequential convolution operations, Rectified Linear Unit (ReLU) operations, and convolution operations. Then, the correction may include the first convolution operation, the ReLU operation and the second convolution operation in sequence as shown in S1040 in FIG. 6 .

如图6所示，假设处理结果的维度为w1×h1×21，那么边界精炼结果的维度依然为w1×h1×21。S1040中依次包括的第一卷积操作、ReLU操作和第二卷积操作。其中，第一卷积操作的输入的c维度为21、输出的c维度为21；ReLU操作的输入的c维度为21、输出的c维度为21；第二卷积操作的输入的c维度为21、输出的c维度为21。c维度的值表示识别的对象的种类的数目。示例性地，第一卷积操作、ReLU操作和第二卷积操作的卷积核的大小可以为3×3，应该理解的是，可以根据计算的需要取其他的值作为卷积核的大小，例如5×5、7×7等等，本发明对此不限定。As shown in FIG. 6 , assuming that the dimension of the processing result is w1×h1×21, the dimension of the boundary refinement result is still w1×h1×21. The first convolution operation, the ReLU operation, and the second convolution operation are sequentially included in S1040. Among them, the c-dimension of the input of the first convolution operation is 21, and the c-dimension of the output is 21; the c-dimension of the input of the ReLU operation is 21, and the c-dimension of the output is 21; the c-dimension of the input of the second convolution operation is 21. The c dimension of the output is 21. The value of the c-dimension represents the number of kinds of recognized objects. Exemplarily, the size of the convolution kernel of the first convolution operation, the ReLU operation and the second convolution operation may be 3×3. It should be understood that other values may be taken as the size of the convolution kernel according to the needs of the calculation. , such as 5×5, 7×7, etc., which are not limited in the present invention.

将所述修正后的结果与对应的处理结果进行结合(求和操作)，得到对应的边界精炼结果。获得的边界精炼结果的维度与对应的处理结果的维度相等。The corrected result is combined with the corresponding processing result (summation operation) to obtain the corresponding boundary refinement result. The dimension of the obtained boundary refinement result is equal to the dimension of the corresponding processing result.

具体地，根据所述N个边界精炼结果，得到所述图像场景理解的结果，可以包括：将所述N个边界精炼结果进行融合操作，得到所述图像场景理解的结果。Specifically, obtaining the image scene understanding result according to the N boundary refining results may include: performing a fusion operation on the N boundary refining results to obtain the image scene understanding result.

如图3所示，N＝4，S1045包括：将4个边界精炼结果进行融合操作。As shown in FIG. 3 , N=4, and S1045 includes: performing a fusion operation on the four boundary refining results.

示例性地，融合操作可以包括相加操作和放大操作。本发明实施例中，放大操作也可以称为解卷积(deconvolution)操作，本发明对此不限定。Illustratively, the fusion operation may include an addition operation and an enlargement operation. In this embodiment of the present invention, the zoom-in operation may also be referred to as a deconvolution operation, which is not limited in the present invention.

示例性地，将所述N个边界精炼结果进行融合操作，得到所述图像场景理解的结果，可以包括：将所述N个边界精炼结果进行融合操作，得到得分图(Score Map)，再基于该Score Map得到所述图像场景理解的结果。Exemplarily, performing a fusion operation on the N boundary refining results to obtain the image scene understanding result may include: performing a fusion operation on the N boundary refining results to obtain a Score Map, and then based on The Score Map obtains the result of scene understanding of the image.

由于N个边界精炼结果的维度互不相等，将所述N个边界精炼结果进行融合操作，包括：重复执行以下的(a1)，直至只剩下一个结果：(a1)将最低维度的结果放大至次低维度，再与次低维度的结果进行相加。然后将剩下的一个结果的空间维度(w×h)放大至与原始图像相同的空间维度(w×h)，得到Score Map，再基于该Score Map得到所述图像场景理解的结果Since the dimensions of the N boundary refining results are not equal to each other, a fusion operation is performed on the N boundary refining results, including: repeating the following (a1) until only one result remains: (a1) Enlarging the result of the lowest dimension To the next-lowest dimension, and then add the result of the next-lowest dimension. Then, the spatial dimension (w×h) of the remaining result is enlarged to the same spatial dimension (w×h) as the original image to obtain the Score Map, and then the result of the image scene understanding is obtained based on the Score Map.

举例来说，假设N＝4，N个边界精炼结果的维度分别为：128×128×21、64×64×21、32×32×21、16×16×21。那么该融合操作可以如图4所示，可以包括：For example, assuming N=4, the dimensions of the N boundary refinement results are: 128×128×21, 64×64×21, 32×32×21, and 16×16×21, respectively. Then the fusion operation can be as shown in Figure 4, which can include:

1、首先执行S1051，将最低维度16×16×21的结果(即S1044的边界精炼结果)进行放大操作，放大至次低维度32×32×21；然后执行S1052，将该放大后的结果与次低维度的结果(即S1043的边界精炼结果)进行相加。1. First, perform S1051, and perform the enlargement operation on the result of the lowest dimension of 16×16×21 (that is, the boundary refining result of S1044), and enlarge it to the next lowest dimension of 32×32×21; then perform S1052, and the enlarged result is The results of the next lower dimension (ie, the boundary refinement results of S1043) are added.

2、在执行上述1之后，最低维度为32×32×21。执行S1053，将此时的最低维度32×32×21的结果(即S1052的结果)进行放大操作，放大至此时的次低维度64×64×21；然后执行S1054，将该放大后的结果与此时的次低维度的结果(即S1042的边界精炼结果)进行相加。2. After performing the above 1, the lowest dimension is 32×32×21. Execute S1053, perform an enlargement operation on the result of the lowest dimension of 32×32×21 at this time (that is, the result of S1052), and enlarge it to the next lowest dimension of 64×64×21 at this time; and then execute S1054, and the enlarged result and The results of the next lower dimension at this time (that is, the boundary refinement results of S1042 ) are added.

3、在执行上述2之后，最低维度为64×64×21。执行S1055，将此时的最低维度64×64×21的结果(即S1054的结果)进行放大操作，放大至此时的次低维度128×128×21；然后执行S1056，将该放大后的结果与此时的次低维度的结果(即S1041的边界精炼结果)进行相加。3. After performing the above 2, the lowest dimension is 64×64×21. Execute S1055, perform an enlargement operation on the result of the lowest dimension of 64×64×21 at this time (that is, the result of S1054), and enlarge it to the next lowest dimension of 128×128×21 at this time; and then execute S1056, and the enlarged result and The results of the next lower dimension at this time (that is, the boundary refinement results of S1041 ) are added.

4、在执行上述3之后，已经将4个边界精炼结果进行了合并，只剩下一个结果，即维度为128×128×21的S1056的结果。由于原始图像的维度为512×512×21，因此，这里将维度为128×128×21的S1056的结果再放大至维度512×512×21，从而得到得分图(ScoreMap)。4. After performing the above 3, the four boundary refinement results have been merged, and only one result remains, that is, the result of S1056 with a dimension of 128×128×21. Since the dimension of the original image is 512×512×21, the result of S1056 with a dimension of 128×128×21 is enlarged to a dimension of 512×512×21 to obtain a score map (ScoreMap).

示例性地，在上述4中，将维度128×128×21的结果放大至维度512×512×21，可以包括如图4所示的S1057和S1058，也可以是说可以进行一次放大，或多次的逐级放大，本发明对此不限定。Exemplarily, in the above 4, the result of the dimension 128×128×21 is enlarged to the dimension 512×512×21, which may include S1057 and S1058 as shown in FIG. The step-by-step amplification is not limited in the present invention.

另外，示例性地，在每次放大操作之前或之后，还可以包括边界精炼操作。这样能够增加边界的判别性。Also, for example, before or after each zoom-in operation, a boundary refinement operation may also be included. This can increase the discriminativeness of the boundary.

示例性地，在上述1与2之间、2与3之间、3与4之间(即相加操作之后且放大操作之前)包括边界精炼操作，如图4中的S10530、S10550和S10570。在上述4中的每个放大操作之后包括边界精炼操作，如图4中的S10580和S10590。这样能够增加边界的判别性。Exemplarily, boundary refining operations are included between the above-mentioned 1 and 2, 2 and 3, and 3 and 4 (ie, after the addition operation and before the enlargement operation), such as S10530, S10550 and S10570 in FIG. 4 . A boundary refinement operation is included after each enlargement operation in 4 above, such as S10580 and S10590 in FIG. 4 . This can increase the discriminativeness of the boundary.

示例性地，本发明实施例中，在图2所示的方法之前，还可以包括：训练所述全局卷积网络。Exemplarily, in this embodiment of the present invention, before the method shown in FIG. 2 , the method may further include: training the global convolutional network.

应注意，本发明实施例对训练的过程和方法不作限定。举例来说，可以获取M个样本图像，并基于该M个样本图像进行训练，从而得到全局卷积网络的参数。举例来说，可以对于N个全局卷积网络同时进行训练，例如，可以获取M个样本图像，M个样本图像的空间维度相等，例如均为512×512；基于该M个样本图像对所述N个全局卷积网络进行训练，从而得到N个全局卷积网络的参数。其中，M为正整数。其中，所述样本图像中每个对象的类别是已知的。It should be noted that the embodiments of the present invention do not limit the training process and method. For example, M sample images may be acquired, and training is performed based on the M sample images, so as to obtain the parameters of the global convolutional network. For example, N global convolutional networks can be trained at the same time, for example, M sample images can be obtained, and the spatial dimensions of the M sample images are equal, such as 512×512; N global convolutional networks are trained to obtain the parameters of N global convolutional networks. Among them, M is a positive integer. Wherein, the category of each object in the sample image is known.

可选地，可以基于从ImageNet的数据集中获取样本图像进行预训练(pretrained)，然后在场景理解的相关数据集上对全局卷积网络进行微调(fine-tune)。例如，可以利用大数据集中的信息对小数据集中的信息进行训练，并以训练后的小数据集为基础进行场景理解，这样不仅可以剔除大数据集中的一些无用信息，并且可以加快收敛的速度，从而提高处理的效率。Optionally, a global convolutional network can be fine-tuned on a relevant dataset for scene understanding based on pretrained sample images obtained from the ImageNet dataset. For example, the information in the large data set can be used to train the information in the small data set, and the scene understanding can be performed based on the small data set after training, which can not only eliminate some useless information in the large data set, but also can speed up the speed of convergence , thereby improving the processing efficiency.

可以理解的是，对于步骤S102中的卷积神经网络和步骤S103中的全局卷积网络，所述方法可以包括：训练所述卷积神经网络和所述全局卷积网络。It can be understood that, for the convolutional neural network in step S102 and the global convolutional network in step S103, the method may include: training the convolutional neural network and the global convolutional network.

可见，本发明实施例中，汇集了全局卷积网络和边界精炼，其中，全局卷积网络能够有效地增大有效感受野，边界精炼能够增加物体边界的判别性。相比于传统的场景理解算法，本发明实施例的方法能够满足场景理解对于性能的需求，从而能够使得系统整体性能得到极大的提升。It can be seen that in the embodiment of the present invention, a global convolutional network and boundary refinement are combined, wherein the global convolutional network can effectively increase the effective receptive field, and the boundary refinement can increase the discriminability of object boundaries. Compared with the traditional scene understanding algorithm, the method of the embodiment of the present invention can meet the performance requirements of scene understanding, so that the overall performance of the system can be greatly improved.

图7是本发明实施例的图像场景理解的装置的一个示意性框图。图7所示的装置70包括获取模块701、卷积模块702、全局卷积网络模块703和边界精炼模块704。所述各个模块可分别执行上文中结合图2-6描述的图像场景理解方法的各个步骤/功能。以下仅对该图像场景理解的装置700的各模块的主要功能进行描述，而省略以上已经描述过的细节内容。FIG. 7 is a schematic block diagram of an image scene understanding apparatus according to an embodiment of the present invention. The apparatus 70 shown in FIG. 7 includes an acquisition module 701 , a convolution module 702 , a global convolution network module 703 and a boundary refinement module 704 . Each of the modules can respectively execute various steps/functions of the image scene understanding method described above in conjunction with FIGS. 2-6 . Only the main functions of each module of the apparatus 700 for understanding the image scene are described below, and the details that have been described above are omitted.

获取模块701，用于获取场景的原始图像。The acquiring module 701 is used for acquiring the original image of the scene.

卷积模块702，用于对所述获取模块701获取的所述原始图像进行卷积操作，得到卷积输出。The convolution module 702 is configured to perform a convolution operation on the original image obtained by the obtaining module 701 to obtain a convolution output.

全局卷积网络模块703，用于通过全局卷积网络对所述卷积模块702得到的所述卷积输出进行处理，得到处理结果。以及The global convolution network module 703 is configured to process the convolution output obtained by the convolution module 702 through a global convolution network to obtain a processing result. as well as

边界精炼模块704，用于对所述全局卷积网络模块703得到的所述处理结果进行边界精炼，得到所述图像场景理解的结果。The boundary refining module 704 is configured to perform boundary refining on the processing result obtained by the global convolutional network module 703 to obtain the image scene understanding result.

示例性地，卷积模块702可以具体用于：通过N个卷积神经网络对所述原始图像进行卷积操作，得到N路卷积输出。其中，所述N路卷积输出的空间维度两两互不相等，且所述N路卷积输出的空间维度均小于所述原始图像的空间维度，N为大于1的正整数。Exemplarily, the convolution module 702 may be specifically configured to perform a convolution operation on the original image through N convolutional neural networks to obtain N channels of convolution outputs. Wherein, the spatial dimensions of the N-channel convolution outputs are not equal to each other, and the spatial dimensions of the N-channel convolution outputs are all smaller than the spatial dimension of the original image, and N is a positive integer greater than 1.

示例性地，全局卷积网络模块703可以具体用于：通过N个所述全局卷积网络对所述N路卷积输出分别进行处理，得到与所述N路卷积输出一一对应的N个处理结果。Exemplarily, the global convolutional network module 703 may be specifically configured to: separately process the N-way convolution outputs through the N global convolutional networks to obtain N corresponding to the N-way convolutional outputs one-to-one. processing results.

示例性地，所述N个全局卷积网络中的第i个全局卷积网络可以用于：对所述N路卷积输出中的第i路卷积输出进行多路卷积，将所述多路卷积的输出进行相加，并将所述相加的结果确定为与所述第i路卷积输出对应的第i个处理结果。其中，i的取值范围为1至N。Exemplarily, the i-th global convolutional network in the N global convolutional networks may be used to: perform multiple convolution on the i-th convolution output in the N-channel convolution outputs, and convert the The outputs of the multi-channel convolution are added, and the result of the addition is determined as the ith processing result corresponding to the output of the ith channel of convolution. Among them, the value of i ranges from 1 to N.

示例性地，边界精炼模块704可以包括N个边界精炼子模块和融合子模块。所述N个边界精炼子模块，可以用于对所述N个处理结果分别进行边界精炼，得到与所述N个处理结果一一对应的N个边界精炼结果。所述融合子模块，可以用于根据所述N个边界精炼结果，得到所述图像场景理解的结果。Illustratively, the boundary refinement module 704 may include N boundary refinement sub-modules and fusion sub-modules. The N boundary refining sub-modules may be used to perform boundary refining on the N processing results respectively, and obtain N boundary refining results corresponding to the N processing results one-to-one. The fusion sub-module may be configured to refine the results according to the N boundaries to obtain the image scene understanding result.

示例性地，所述N个边界精炼子模块中的第i个边界精炼子模块可以用于：对所述N个处理结果中的第i个处理结果进行修正，将所述修正后的结果与所述第i个处理结果进行结合，得到第i个边界精炼结果。其中，i的取值范围为1至N。Exemplarily, the i-th boundary refining sub-module in the N boundary refining sub-modules may be used to: correct the i-th processing result in the N processing results, and compare the corrected result with the result. The i-th processing results are combined to obtain the i-th boundary refining result. Among them, the value of i ranges from 1 to N.

示例性地，融合子模块可以用于：将所述N个边界精炼结果进行融合操作，得到所述图像场景理解的结果。融合操作包括：重复执行以下的(a1)，直至只剩下一个结果：(a1)将最低维度的结果放大至次低维度，再与次低维度的结果进行相加。然后将剩下的一个结果的空间维度(w×h)放大至与原始图像相同的空间维度(w×h)，得到Score Map，再基于该Score Map得到所述图像场景理解的结果.Exemplarily, the fusion sub-module may be used to: perform a fusion operation on the N boundary refinement results to obtain the image scene understanding result. The fusion operation includes: repeating the following (a1) until only one result remains: (a1) Enlarging the result of the lowest dimension to the next lower dimension, and then adding the result of the second lower dimension. Then, the spatial dimension (w×h) of the remaining result is enlarged to the same spatial dimension (w×h) as the original image to obtain the Score Map, and then the result of the image scene understanding is obtained based on the Score Map.

示例性地，如图8所示，装置70还包括训练模块700，可以用于：训练所述全局卷积网络。Exemplarily, as shown in FIG. 8 , the apparatus 70 further includes a training module 700, which can be used for: training the global convolutional network.

图7和图8所示的装置70能够用于实现前述图2至图6所示的图像场景理解的方法。The apparatus 70 shown in FIG. 7 and FIG. 8 can be used to implement the image scene understanding method shown in the foregoing FIG. 2 to FIG. 6 .

另外，本发明实施例还提供了另一种图像场景理解的装置，该装置可以包括处理器和存储器，其中，存储器用于存储指令代码，处理器执行该指令代码时，可以实现前述图2至图6所示的图像场景理解的方法。In addition, an embodiment of the present invention also provides another apparatus for understanding image scenes, the apparatus may include a processor and a memory, wherein the memory is used to store an instruction code, and when the processor executes the instruction code, the aforementioned FIG. 2 to FIG. 2 can be implemented. The method for image scene understanding shown in Figure 6.

另外，本发明实施例还提供了一种电子设备，该电子设备可以包括图7或图8所示的装置70。In addition, an embodiment of the present invention further provides an electronic device, and the electronic device may include the apparatus 70 shown in FIG. 7 or FIG. 8 .

本发明实施例中，汇集了全局卷积网络模块和边界精炼模块，其中，全局卷积网络模块能够有效地增大有效感受野，边界精炼模块能够增加物体边界的判别性。相比于传统的场景理解算法，本发明实施例能够满足场景理解对于性能的需求，从而能够使得系统整体性能得到极大的提升。In the embodiment of the present invention, a global convolutional network module and a boundary refining module are combined, wherein the global convolutional network module can effectively increase the effective receptive field, and the boundary refining module can increase the discriminability of object boundaries. Compared with the traditional scene understanding algorithm, the embodiment of the present invention can meet the performance requirements of scene understanding, so that the overall performance of the system can be greatly improved.

尽管这里已经参考附图描述了示例实施例，应理解上述示例实施例仅仅是示例性的，并且不意图将本发明的范围限制于此。本领域普通技术人员可以在其中进行各种改变和修改，而不偏离本发明的范围和精神。所有这些改变和修改意在被包括在所附权利要求所要求的本发明的范围之内。Although example embodiments have been described herein with reference to the accompanying drawings, it should be understood that the above-described example embodiments are exemplary only, and are not intended to limit the scope of the invention thereto. Various changes and modifications can be made therein by those of ordinary skill in the art without departing from the scope and spirit of the invention. All such changes and modifications are intended to be included within the scope of the invention as claimed in the appended claims.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本发明的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

在本申请所提供的几个实施例中，应该理解到，所揭露的设备和方法，可以通过其它的方式实现。例如，以上所描述的设备实施例仅仅是示意性的，例如，所述单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个设备，或一些特征可以忽略，或不执行。In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the device embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or May be integrated into another device, or some features may be omitted, or not implemented.

在此处所提供的说明书中，说明了大量具体细节。然而，能够理解，本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中，并未详细示出公知的方法、结构和技术，以便不模糊对本说明书的理解。In the description provided herein, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.

类似地，应当理解，为了精简本发明并帮助理解各个发明方面中的一个或多个，在对本发明的示例性实施例的描述中，本发明的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而，并不应将该本发明的方法解释成反映如下意图：即所要求保护的本发明要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说，如相应的权利要求书所反映的那样，其发明点在于可以用少于某个公开的单个实施例的所有特征的特征来解决相应的技术问题。因此，遵循具体实施方式的权利要求书由此明确地并入该具体实施方式，其中每个权利要求本身都作为本发明的单独实施例。Similarly, it is to be understood that in the description of the exemplary embodiments of the invention, various features of the invention are sometimes grouped together , or in its description. However, this method of the invention should not be interpreted as reflecting the intention that the invention as claimed requires more features than are expressly recited in each claim. Rather, as the corresponding claims reflect, the invention lies in the fact that the corresponding technical problem may be solved with less than all features of a single disclosed embodiment. Thus, the claims following the Detailed Description are hereby expressly incorporated into this Detailed Description, with each claim standing on its own as a separate embodiment of this invention.

本领域的技术人员可以理解，除了特征之间相互排斥之外，可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述，本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。It will be understood by those skilled in the art that all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any method or apparatus so disclosed may be used in any combination, except that the features are mutually exclusive. Processes or units are combined. Each feature disclosed in this specification (including accompanying claims, abstract and drawings) may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise.

此外，本领域的技术人员能够理解，尽管在此所述的一些实施例包括其它实施例中所包括的某些特征而不是其它特征，但是不同实施例的特征的组合意味着处于本发明的范围之内并且形成不同的实施例。例如，在权利要求书中，所要求保护的实施例的任意之一都可以以任意的组合方式来使用。Furthermore, those skilled in the art will appreciate that although some of the embodiments described herein include certain features, but not others, included in other embodiments, that combinations of features of different embodiments are intended to be within the scope of the invention within and form different embodiments. For example, in the claims, any of the claimed embodiments may be used in any combination.

本发明的各个部件实施例可以以硬件实现，或者以在一个或者多个处理器上运行的软件模块实现，或者以它们的组合实现。本领域的技术人员应当理解，可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本发明实施例的物品分析设备中的一些模块的一些或者全部功能。本发明还可以实现为用于执行这里所描述的方法的一部分或者全部的装置程序(例如，计算机程序和计算机程序产品)。这样的实现本发明的程序可以存储在计算机可读介质上，或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到，或者在载体信号上提供，或者以任何其他形式提供。Various component embodiments of the present invention may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof. Those skilled in the art should understand that a microprocessor or a digital signal processor (DSP) may be used in practice to implement some or all of the functions of some modules in the article analysis device according to the embodiment of the present invention. The present invention may also be implemented as apparatus programs (eg, computer programs and computer program products) for performing part or all of the methods described herein. Such a program implementing the present invention may be stored on a computer-readable medium, or may be in the form of one or more signals. Such signals may be downloaded from Internet sites, or provided on carrier signals, or in any other form.

应该注意的是上述实施例对本发明进行说明而不是对本发明进行限制，并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中，不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本发明可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中，这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。It should be noted that the above-described embodiments illustrate rather than limit the invention, and that alternative embodiments may be devised by those skilled in the art without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention can be implemented by means of hardware comprising several different elements and by means of a suitably programmed computer. In a unit claim enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, and third, etc. do not denote any order. These words can be interpreted as names.

以上所述，仅为本发明的具体实施方式或对具体实施方式的说明，本发明的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本发明的保护范围之内。本发明的保护范围应以权利要求的保护范围为准。The above is only the specific embodiment of the present invention or the description of the specific embodiment, and the protection scope of the present invention is not limited thereto. Any changes or substitutions should be included within the protection scope of the present invention. The protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. a method for image scene understanding, is characterized in that, comprises:

Get the original image of the scene;

Perform a convolution operation on the original image to obtain a convolution output;

The convolution output is processed through a global convolution network to obtain a processing result, wherein the global convolution network includes a plurality of convolution branches to perform multiple convolutions, and each convolution includes at least one convolution operation , wherein the multiple convolution branches have the same input, and the processing result is obtained after summing the multiple branch outputs of the multiple convolution branches; and

Perform boundary refinement on the processing result to obtain an image scene understanding result, wherein the boundary refinement includes modifying the processing result, and combining the processing result with the corrected result, and the correction includes linear correction and at least one of nonlinear corrections.

2. The method of claim 1, wherein the performing a convolution operation on the original image to obtain a convolution output, comprising:

The original image is subjected to a convolution operation through N convolutional neural networks to obtain N-way convolution outputs;

Wherein, the spatial dimensions of the N-way convolution outputs are not equal to each other, and the spatial dimensions of the N-way convolution outputs are all smaller than the spatial dimension of the original image, and N is a positive integer greater than 1.

3. The method of claim 2, wherein the convolution output is processed through a global convolution network to obtain a processing result, comprising:

The N channels of convolution outputs are respectively processed through the N global convolution networks, and N processing results corresponding to the N channels of convolution outputs are obtained.

4 . The method according to claim 3 , wherein the N-way convolution outputs are respectively processed by the N global convolutional networks to obtain a one-to-one correspondence with the N-way convolution outputs. 5 . N processing results of , including:

Perform multiplex convolution on the i-th convolution output in the N-channel convolution outputs, add the outputs of the multiplex convolution, and determine the result of the addition to be the same as the i-th convolution output. The i-th processing result corresponding to the road convolution output;

Among them, the value of i ranges from 1 to N.

5. The method according to claim 3, wherein the process result is subjected to boundary refinement to obtain an image scene understanding result, comprising:

Perform boundary refining on the N processing results respectively, and obtain N boundary refining results corresponding to the N processing results one-to-one;

According to the N boundary refinement results, the image scene understanding result is obtained.

6. The method of claim 5, wherein the N processing results are respectively subjected to boundary refining to obtain N boundary refining results corresponding to the N processing results one-to-one, comprising:

Correcting the i-th processing result in the N processing results, combining the corrected result with the i-th processing result to obtain the i-th boundary refining result;

Among them, the value of i ranges from 1 to N.

7. The method according to claim 5, wherein, obtaining the image scene understanding result according to the N boundary refining results, comprising:

A fusion operation is performed on the N boundary refinement results to obtain the image scene understanding result.

8. The method according to any one of claims 1 to 7, wherein before the acquiring the original image, the method further comprises:

Train the global convolutional network.

9. A device for image scene understanding, comprising:

The acquisition module is used to acquire the original image of the scene;

a convolution module, configured to perform a convolution operation on the original image acquired by the acquisition module to obtain a convolution output;

A global convolutional network module, configured to process the convolution output obtained by the convolution module through a global convolutional network to obtain a processing result, wherein the global convolutional network includes a plurality of convolution branches to perform multiple Convolution, and each convolution includes at least one convolution operation, wherein the multiple convolution branches have the same input, and the multiple branch outputs of the multiple convolution branches are summed to obtain the result. the result of said processing; and

A boundary refinement module, configured to perform boundary refinement on the processing result obtained by the global convolutional network module to obtain an image scene understanding result, wherein the boundary refinement includes revising the processing result and converting the processing The results are combined with corrected results, the corrections including at least one of linear corrections and non-linear corrections.

10. The apparatus of claim 9, wherein the convolution module is used for:

11. The apparatus of claim 10, wherein the global convolutional network module is used for:

12. The apparatus of claim 11, wherein the i-th global convolutional network in the N global convolutional networks is used for:

Perform multiple convolution on the i-th convolution output in the N-way convolution outputs, add the outputs of the multiple convolutions, and determine the result of the addition as the same as the i-th convolution output. The i-th processing result corresponding to the road convolution output;

Among them, the value of i ranges from 1 to N.

13. The apparatus of claim 11, wherein the boundary refining module comprises N boundary refining sub-modules and fusion sub-modules:

The N boundary refining sub-modules are used to perform boundary refining on the N processing results respectively, and obtain N boundary refining results corresponding to the N processing results one-to-one;

The fusion sub-module is configured to refine the results according to the N boundaries to obtain the image scene understanding result.

14. The apparatus of claim 13, wherein the ith boundary refining submodule in the N boundary refining submodules is used for:

Among them, the value of i ranges from 1 to N.

15. The apparatus of claim 13, wherein the fusion submodule is used for:

16. The apparatus according to any one of claims 9 to 15, wherein the apparatus further comprises a training module for:

Train the global convolutional network.