CN119032566A

CN119032566A - Video encoding and decoding method, device, equipment, system and storage medium

Info

Publication number: CN119032566A
Application number: CN202280094956.5A
Authority: CN
Inventors: 马展; 刘浩杰
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2024-11-26
Also published as: WO2023206420A1

Abstract

The embodiment of the application provides a video coding and decoding method, device, equipment, system and storage medium, which aim to improve the accuracy of a reconstructed image, and the quantized first characteristic information is fused with the characteristic information of a previous reconstructed image of a current image, namely the quantized first characteristic information is fused with the characteristic information of a plurality of reconstructed images before the current image, so that when certain information in the previous reconstructed image of the current image is blocked, the blocked information can be obtained from a plurality of reconstructed images before the current image, and further the generated mixed space-time characterization comprises more accurate, rich and detailed characteristic information. When the previous reconstructed image is subjected to motion compensation based on the mixed space-time characterization, the reconstructed image of the current image can be accurately obtained based on the high-precision P predicted images when the high-precision P predicted images can be generated, and then the video compression effect is improved.

Description

Video encoding and decoding method, device, equipment, system and storage medium

Technical Field

本申请涉及视频编解码技术领域，尤其涉及一种视频编解码方法、装置、设备、系统及存储介质。The present application relates to the field of video coding and decoding technology, and in particular to a video coding and decoding method, device, equipment, system and storage medium.

Background Art

数字视频技术可以并入多种视频装置中，例如数字电视、智能手机、计算机、电子阅读器或视频播放器等。随着视频技术的发展，视频数据所包括的数据量较大，为了便于视频数据的传输，视频装置执行视频压缩技术，以使视频数据更加有效的传输或存储。Digital video technology can be incorporated into a variety of video devices, such as digital televisions, smart phones, computers, e-readers or video players, etc. With the development of video technology, the amount of data included in video data is large. In order to facilitate the transmission of video data, video devices implement video compression technology to make video data more efficiently transmitted or stored.

随着神经网络技术的快速发展，神经网络技术在视频压缩技术中得到广泛应用，例如，在环路滤波、编码块划分和编码块预测等中得到应用。但是，目前的基于神经网络的视频压缩技术，压缩效果不佳。With the rapid development of neural network technology, neural network technology has been widely used in video compression technology, for example, in loop filtering, coding block division and coding block prediction, etc. However, the current video compression technology based on neural network has poor compression effect.

发明内容Summary of the invention

本申请实施例提供了一种视频编解码方法、装置、设备、系统及存储介质，以提高视频压缩效果。The embodiments of the present application provide a video encoding and decoding method, apparatus, device, system and storage medium to improve the video compression effect.

第一方面，本申请提供了一种视频解码方法，包括：In a first aspect, the present application provides a video decoding method, comprising:

解码第一码流，确定量化后的第一特征信息，所述第一特征信息是对当前图像和所述当前图像的前一重建图像进行特征融合得到的；Decoding a first bitstream to determine quantized first feature information, where the first feature information is obtained by performing feature fusion on a current image and a reconstructed image before the current image;

对量化后的所述第一特征信息进行多级时域融合，得到混合时空表征；Performing multi-level time-domain fusion on the quantized first feature information to obtain a mixed time-space representation;

根据所述混合时空表征对所述前一重建图像进行运动补偿，得到所述当前图像的P个预测图像，所述P为正整数；Performing motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation to obtain P predicted images of the current image, where P is a positive integer;

根据所述P个预测图像，确定所述当前图像的重建图像。A reconstructed image of the current image is determined according to the P predicted images.

第二方面，本申请实施例提供一种视频编码方法，包括：In a second aspect, an embodiment of the present application provides a video encoding method, including:

对当前图像以及所述当前图像的前一重建图像进行特征融合，得到第一特征信息；Performing feature fusion on a current image and a previous reconstructed image of the current image to obtain first feature information;

对所述第一特征信息进行量化，得到量化后的所述第一特征信息；quantizing the first feature information to obtain quantized first feature information;

对量化后的所述第一特征信息进行编码，得到所述第一码流。The quantized first feature information is encoded to obtain the first code stream.

第三方面，本申请提供了一种视频编码器，用于执行上述第一方面或其各实现方式中的方法。具体地，该编码器包括用于执行上述第一方面或其各实现方式中的方法的功能单元。In a third aspect, the present application provides a video encoder for executing the method in the first aspect or its respective implementations. Specifically, the encoder includes a functional unit for executing the method in the first aspect or its respective implementations.

第四方面，本申请提供了一种视频解码器，用于执行上述第二方面或其各实现方式中的方法。具体地，该解码器包括用于执行上述第二方面或其各实现方式中的方法的功能单元。In a fourth aspect, the present application provides a video decoder for executing the method in the second aspect or its respective implementations. Specifically, the decoder includes a functional unit for executing the method in the second aspect or its respective implementations.

第五方面，提供了一种视频编码器，包括处理器和存储器。该存储器用于存储计算机程序，该处理器用于调用并运行该存储器中存储的计算机程序，以执行上述第一方面或其各实现方式中的方法。In a fifth aspect, a video encoder is provided, comprising a processor and a memory, wherein the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method in the first aspect or its implementations.

第六方面，提供了一种视频解码器，包括处理器和存储器。该存储器用于存储计算机程序，该处理器用于调用并运行该存储器中存储的计算机程序，以执行上述第二方面或其各实现方式中的方法。In a sixth aspect, a video decoder is provided, comprising a processor and a memory, wherein the memory is used to store a computer program, and the processor is used to call and run the computer program stored in the memory to execute the method in the second aspect or its implementations.

第七方面，提供了一种视频编解码系统，包括视频编码器和视频解码器。视频编码器用于执行上述第一方面或其各实现方式中的方法，视频解码器用于执行上述第二方面或其各实现方式中的方法。In a seventh aspect, a video coding and decoding system is provided, including a video encoder and a video decoder. The video encoder is used to execute the method in the first aspect or its respective implementations, and the video decoder is used to execute the method in the second aspect or its respective implementations.

第八方面，提供了一种芯片，用于实现上述第一方面至第二方面中的任一方面或其各实现方式中的方法。具体地，该芯片包括：处理器，用于从存储器中调用并运行计算机程序，使得安装有该芯片的设备执行如上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In an eighth aspect, a chip is provided for implementing the method in any one of the first to second aspects or their respective implementations. Specifically, the chip includes: a processor for calling and running a computer program from a memory, so that a device equipped with the chip executes the method in any one of the first to second aspects or their respective implementations.

第九方面，提供了一种计算机可读存储介质，用于存储计算机程序，该计算机程序使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In a ninth aspect, a computer-readable storage medium is provided for storing a computer program, wherein the computer program enables a computer to execute the method of any one of the first to second aspects or any of their implementations.

第十方面，提供了一种计算机程序产品，包括计算机程序指令，该计算机程序指令使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In a tenth aspect, a computer program product is provided, comprising computer program instructions, which enable a computer to execute the method in any one of the first to second aspects or their respective implementations.

第十一方面，提供了一种计算机程序，当其在计算机上运行时，使得计算机执行上述第一方面至第二方面中的任一方面或其各实现方式中的方法。In an eleventh aspect, a computer program is provided, which, when executed on a computer, enables the computer to execute the method in any one of the first to second aspects or in each of their implementations.

第十二方面，提供了一种码流，包括第二方面任一方面生成的码流。In the twelfth aspect, a code stream is provided, including the code stream generated by any aspect of the second aspect.

基于以上技术方案，本申请为了提高重建图像的准确性，通过对量化后的第一特征信息进行多级时域融合，即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合，并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合，这样可以避免当前图像的前一重建图像中的某信息被遮挡时，被遮挡的信息可以从当前图像之前的几张重建图像中得到，进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时，可以生成高精度的P个预测图像时，基于该高精度的P个预测图像可以准确得到当前图像的重建图像，进而提高视频压缩效果。Based on the above technical solution, in order to improve the accuracy of the reconstructed image, the present application performs multi-level time-domain fusion on the quantized first feature information, that is, the quantized first feature information is not only fused with the feature information of the previous reconstructed image of the current image, but also the quantized first feature information is feature fused with multiple reconstructed images before the current image. This can avoid that when certain information in the previous reconstructed image of the current image is blocked, the blocked information can be obtained from several reconstructed images before the current image, thereby making the generated mixed spatiotemporal representation include more accurate, rich and detailed feature information. In this way, when motion compensation is performed on the previous reconstructed image based on the mixed spatiotemporal representation, P high-precision predicted images can be generated, and the reconstructed image of the current image can be accurately obtained based on the high-precision P predicted images, thereby improving the video compression effect.

BRIEF DESCRIPTION OF THE DRAWINGS

图1为本申请实施例涉及的一种视频编解码系统的示意性框图；FIG1 is a schematic block diagram of a video encoding and decoding system according to an embodiment of the present application;

图2为本申请实施例提供的一种视频解码方法的流程示意图；FIG2 is a schematic diagram of a flow chart of a video decoding method provided in an embodiment of the present application;

图3是本申请实施例涉及的反变换模块的网络结构示意图；FIG3 is a schematic diagram of a network structure of an inverse transformation module involved in an embodiment of the present application;

图4是本申请实施例涉及的递归聚合模块的网络结构示意图；FIG4 is a schematic diagram of a network structure of a recursive aggregation module according to an embodiment of the present application;

图5是本申请实施例涉及的第一解码器的网络结构示意图；FIG5 is a schematic diagram of a network structure of a first decoder involved in an embodiment of the present application;

图6是本申请实施例涉及的第二解码器的网络结构示意图；FIG6 is a schematic diagram of a network structure of a second decoder involved in an embodiment of the present application;

图7是本申请实施例涉及的第三解码器的网络结构示意图；FIG7 is a schematic diagram of a network structure of a third decoder involved in an embodiment of the present application;

图8是本申请实施例涉及的第四解码器的网络结构示意图FIG. 8 is a schematic diagram of the network structure of the fourth decoder involved in the embodiment of the present application

图9为本申请一实施例涉及的一种基于神经网络的解码器的网络结构示意图；FIG9 is a schematic diagram of a network structure of a decoder based on a neural network according to an embodiment of the present application;

图10为本申请一实施例提供的视频解码流程示意图；FIG10 is a schematic diagram of a video decoding process provided by an embodiment of the present application;

图11为本申请实施例提供的视频编码方法的一种流程示意图；FIG11 is a schematic diagram of a flow chart of a video encoding method provided in an embodiment of the present application;

图12为本申请一实施例涉及的一种基于神经网络的编码器的网络结构示意图；FIG12 is a schematic diagram of a network structure of an encoder based on a neural network according to an embodiment of the present application;

图13为本申请一实施例提供的视频编码流程示意图；FIG13 is a schematic diagram of a video encoding process provided by an embodiment of the present application;

图14是本申请实施例提供的视频解码装置的示意性框图；FIG14 is a schematic block diagram of a video decoding device provided in an embodiment of the present application;

图15是本申请实施例提供的视频编码装置的示意性框图；FIG15 is a schematic block diagram of a video encoding device provided in an embodiment of the present application;

图16是本申请实施例提供的电子设备的示意性框图；FIG16 is a schematic block diagram of an electronic device provided in an embodiment of the present application;

图17是本申请实施例提供的视频编码系统的示意性框图。FIG. 17 is a schematic block diagram of a video encoding system provided in an embodiment of the present application.

DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.

本申请可应用于图像编解码领域、视频编解码领域、硬件视频编解码领域、专用电路视频编解码领域、实时视频编解码领域等。或者，本申请的方案可结合至其它专属或行业标准而操作，所述标准包含ITU-TH.261、ISO/IECMPEG-1Visual、ITU-TH.262或ISO/IECMPEG-2Visual、ITU-TH.263、ISO/IECMPEG-4Visual，ITU-TH.264(还称为ISO/IECMPEG-4AVC)，包含可分级视频编解码(SVC)及多视图视频编解码(MVC)扩展。应理解，本申请的技术不限于任何特定编解码标准或技术。The present application can be applied to the field of image coding and decoding, the field of video coding and decoding, the field of hardware video coding and decoding, the field of dedicated circuit video coding and decoding, the field of real-time video coding and decoding, etc. Alternatively, the scheme of the present application can be combined with other proprietary or industry standards and operated, and the standards include ITU-TH.261, ISO/IECMPEG-1Visual, ITU-TH.262 or ISO/IECMPEG-2Visual, ITU-TH.263, ISO/IECMPEG-4Visual, ITU-TH.264 (also known as ISO/IECMPEG-4AVC), including scalable video coding (SVC) and multi-view video coding (MVC) extensions. It should be understood that the technology of the present application is not limited to any specific coding standard or technology.

为了便于理解，首先结合图1对本申请实施例涉及的视频编解码系统进行介绍。For ease of understanding, the video encoding and decoding system involved in the embodiment of the present application is first introduced in conjunction with Figure 1.

图1为本申请实施例涉及的一种视频编解码系统的示意性框图。需要说明的是，图1只是一种示例，本申请实施例的视频编解码系统包括但不限于图1所示。如图1所示，该视频编解码系统100包含编码设备110和解码设备120。其中编码设备用于对视频数据进行编码(可以理解成压缩)产生码流，并将码流传输给解码设备。解码设备对编码设备编码产生的码流进行解码，得到解码后的视频数据。FIG1 is a schematic block diagram of a video encoding and decoding system involved in an embodiment of the present application. It should be noted that FIG1 is only an example, and the video encoding and decoding system of the embodiment of the present application includes but is not limited to that shown in FIG1. As shown in FIG1, the video encoding and decoding system 100 includes an encoding device 110 and a decoding device 120. The encoding device is used to encode (which can be understood as compression) the video data to generate a code stream, and transmit the code stream to the decoding device. The decoding device decodes the code stream generated by the encoding device to obtain decoded video data.

本申请实施例的编码设备110可以理解为具有视频编码功能的设备，解码设备120可以理解为具有视频解码功能的设备，即本申请实施例对编码设备110和解码设备120包括更广泛的装置，例如包含智能手机、台式计算机、移动计算装置、笔记本(例如，膝上型)计算机、平板计算机、机顶盒、电视、相机、显示装置、数字媒体播放器、视频游戏控制台、车载计算机等。The encoding device 110 of the embodiment of the present application can be understood as a device with a video encoding function, and the decoding device 120 can be understood as a device with a video decoding function, that is, the embodiment of the present application includes a wider range of devices for the encoding device 110 and the decoding device 120, such as smartphones, desktop computers, mobile computing devices, notebook (e.g., laptop) computers, tablet computers, set-top boxes, televisions, cameras, display devices, digital media players, video game consoles, vehicle-mounted computers, etc.

在一些实施例中，编码设备110可以经由信道130将编码后的视频数据(如码流)传输给解码设备120。信道130可以包括能够将编码后的视频数据从编码设备110传输到解码设备120的一个或多个媒体和/或装置。In some embodiments, the encoding device 110 may transmit the encoded video data (eg, a code stream) to the decoding device 120 via the channel 130. The channel 130 may include one or more media and/or devices capable of transmitting the encoded video data from the encoding device 110 to the decoding device 120.

在一个实例中，信道130包括使编码设备110能够实时地将编码后的视频数据直接发射到解码设备120的一个或多个通信媒体。在此实例中，编码设备110可根据通信标准来调制编码后的视频数据，且将调制后的视频数据发射到解码设备120。其中通信媒体包含无线通信媒体，例如射频频谱，可选的，通信媒体还可以包含有线通信媒体，例如一根或多根物理传输线。In one example, the channel 130 includes one or more communication media that enable the encoding device 110 to transmit the encoded video data directly to the decoding device 120 in real time. In this example, the encoding device 110 can modulate the encoded video data according to the communication standard and transmit the modulated video data to the decoding device 120. The communication medium includes a wireless communication medium, such as a radio frequency spectrum, and optionally, the communication medium may also include a wired communication medium, such as one or more physical transmission lines.

在另一实例中，信道130包括存储介质，该存储介质可以存储编码设备110编码后的视频数据。存储介质包含多种本地存取式数据存储介质，例如光盘、DVD、快闪存储器等。在该实例中，解码设备120可从该存储介质中获取编码后的视频数据。In another example, the channel 130 includes a storage medium, which can store the video data encoded by the encoding device 110. The storage medium includes a variety of locally accessible data storage media, such as optical disks, DVDs, flash memories, etc. In this example, the decoding device 120 can obtain the encoded video data from the storage medium.

在另一实例中，信道130可包含存储服务器，该存储服务器可以存储编码设备110编码后的视频数据。在此实例中，解码设备120可以从该存储服务器中下载存储的编码后的视频数据。可选的，该存储服务器可以存储编码后的视频数据且可以将该编码后的视频数据发射到解码设备120，例如web服务器(例如，用于网站)、文件传送协议(FTP)服务器等。In another example, the channel 130 may include a storage server that can store the video data encoded by the encoding device 110. In this example, the decoding device 120 can download the stored encoded video data from the storage server. Alternatively, the storage server can store the encoded video data and transmit the encoded video data to the decoding device 120, such as a web server (e.g., for a website), a file transfer protocol (FTP) server, etc.

一些实施例中，编码设备110包含视频编码器112及输出接口113。其中，输出接口113可以包含调制器/解调器(调制解调器)和/或发射器。In some embodiments, the encoding device 110 includes a video encoder 112 and an output interface 113. The output interface 113 may include a modulator/demodulator (modem) and/or a transmitter.

在一些实施例中，编码设备110除了包括视频编码器112和输入接口113外，还可以包括视频源111。In some embodiments, the encoding device 110 may further include a video source 111 in addition to the video encoder 112 and the input interface 113 .

视频源111可包含视频采集装置(例如，视频相机)、视频存档、视频输入接口、计算机图形系统中的至少一个，其中，视频输入接口用于从视频内容提供者处接收视频数据，计算机图形系统用于产生视频数据。The video source 111 may include at least one of a video acquisition device (eg, a video camera), a video archive, a video input interface, and a computer graphics system, wherein the video input interface is used to receive video data from a video content provider, and the computer graphics system is used to generate video data.

视频编码器112对来自视频源111的视频数据进行编码，产生码流。视频数据可包括一个或多个图像(picture)或图像序列(sequence of pictures)。码流以比特流的形式包含了图像或图像序列的编码信息。编码信息可以包含编码图像数据及相关联数据。相关联数据可包含序列参数集(sequence parameter set，简称SPS)、图像参数集(picture parameter set，简称PPS)及其它语法结构。SPS可含有应用于一个或多个序列的参数。PPS可含有应用于一个或多个图像的参数。语法结构是指码流中以指定次序排列的零个或多个语法元素的集合。The video encoder 112 encodes the video data from the video source 111 to generate a code stream. The video data may include one or more pictures or a sequence of pictures. The code stream contains the coding information of the picture or the sequence of pictures in the form of a bit stream. The coding information may include the coded picture data and associated data. The associated data may include a sequence parameter set (SPS for short), a picture parameter set (PPS for short) and other syntax structures. The SPS may contain parameters applied to one or more sequences. The PPS may contain parameters applied to one or more pictures. The syntax structure refers to a set of zero or more syntax elements arranged in a specified order in the code stream.

视频编码器112经由输出接口113将编码后的视频数据直接传输到解码设备120。编码后的视频数据还可存储于存储介质或存储服务器上，以供解码设备120后续读取。The video encoder 112 transmits the encoded video data directly to the decoding device 120 via the output interface 113. The encoded video data may also be stored in a storage medium or a storage server for subsequent reading by the decoding device 120.

在一些实施例中，解码设备120包含输入接口121和视频解码器122。In some embodiments, the decoding device 120 includes an input interface 121 and a video decoder 122 .

在一些实施例中，解码设备120除包括输入接口121和视频解码器122外，还可以包括显示装置123。In some embodiments, the decoding device 120 may include a display device 123 in addition to the input interface 121 and the video decoder 122 .

其中，输入接口121包含接收器及/或调制解调器。输入接口121可通过信道130接收编码后的视频数据。The input interface 121 includes a receiver and/or a modem. The input interface 121 can receive the encoded video data through the channel 130 .

视频解码器122用于对编码后的视频数据进行解码，得到解码后的视频数据，并将解码后的视频数据传输至显示装置123。The video decoder 122 is used to decode the encoded video data to obtain decoded video data, and transmit the decoded video data to the display device 123 .

显示装置123显示解码后的视频数据。显示装置123可与解码设备120整合或在解码设备120外部。显示装置123可包括多种显示装置，例如液晶显示器(LCD)、等离子体显示器、有机发光二极管(OLED)显示器或其它类型的显示装置。The display device 123 displays the decoded video data. The display device 123 may be integrated with the decoding device 120 or external to the decoding device 120. The display device 123 may include a variety of display devices, such as a liquid crystal display (LCD), a plasma display, an organic light emitting diode (OLED) display, or other types of display devices.

此外，图1仅为实例，本申请实施例的技术方案不限于图1，例如本申请的技术还可以应用于单侧的视频编码或单侧的视频解码。In addition, FIG1 is only an example, and the technical solution of the embodiment of the present application is not limited to FIG1 . For example, the technology of the present application can also be applied to unilateral video encoding or unilateral video decoding.

在一些实施例中，上述视频编码器112可应用于亮度色度(YCbCr，YUV)格式的图像数据上。例如，YUV比例可以为4:2:0、4:2:2或者4:4:4，Y表示明亮度(Luma)，Cb(U)表示蓝色色度，Cr(V)表示红色色度，U和V表示为色度(Chroma)用于描述色彩及饱和度。例如，在颜色格式上，4:2:0表示每4个像素有4个亮度分量，2个色度分量(YYYYCbCr)，4:2:2表示每4个像素有4个亮度分量，4个色度分量(YYYYCbCrCbCr)，4:4:4表示全像素显示(YYYYCbCrCbCrCbCrCbCr)。In some embodiments, the video encoder 112 may be applied to image data in a luminance-chrominance (YCbCr, YUV) format. For example, the YUV ratio may be 4:2:0, 4:2:2, or 4:4:4, where Y represents brightness (Luma), Cb (U) represents blue chrominance, Cr (V) represents red chrominance, and U and V represent chrominance (Chroma) for describing color and saturation. For example, in color format, 4:2:0 means that every 4 pixels have 4 luminance components and 2 chrominance components (YYYYCbCr), 4:2:2 means that every 4 pixels have 4 luminance components and 4 chrominance components (YYYYCbCrCbCr), and 4:4:4 represents full pixel display (YYYYCbCrCbCrCbCrCbCr).

由于视频的一个帧中的相邻像素之间存在很强的相关性，在视频编解码技术中使用帧内预测的方法消除相邻像素之间的空间冗余。由于视频中的相邻帧之间存在着很强的相似性，在视频编解码技术中使用帧间预测方法消除相邻帧之间的时间冗余，从而提高编码效率。Since there is a strong correlation between adjacent pixels in a video frame, the intra-frame prediction method is used in video coding and decoding technology to eliminate the spatial redundancy between adjacent pixels. Since there is a strong similarity between adjacent frames in a video, the inter-frame prediction method is used in video coding and decoding technology to eliminate the temporal redundancy between adjacent frames, thereby improving coding efficiency.

本申请实施例可用于帧间编码，用于提高帧间编码的效率。The embodiments of the present application may be used for inter-frame coding to improve the efficiency of inter-frame coding.

视频编码技术主要针对序列化视频数据进行编码，主要用于互联网时代的数据存储、传输和呈现等应用。视频在现阶段占据了85％以上的流量空间与入口，随着未来用户对视频数据分辨率、帧率以及维度等需求的增加，未来视频编码技术所承载的作用与价值也将大幅提升，对于视频编码的技术提升与需求是巨大的机遇和挑战。传统视频编码技术经历了几十年的发展与变革，在每一个时代都极大地满足和服务于世界的视频服务。传统视频编码技术在基于多尺度块级的混合编码框架下迭代更新并沿用至今，伴随着硬件技术的飞速发展，视频编码通过子技术的提升，在牺牲一定复杂度的情况下，带来了极大的编码性能提升。然而，置换复杂度获取性能的方式由于硬件发展的瓶颈逐渐有了较为明显的限制，对硬件设计和更新提出了更高的要求，使得现在商用的传统编解码器通常需要进行一定的简化使用。Video coding technology mainly encodes serialized video data and is mainly used for data storage, transmission and presentation in the Internet era. Video currently accounts for more than 85% of traffic space and entry. With the increasing demand of users for video data resolution, frame rate and dimension in the future, the role and value of video coding technology will also increase significantly. The technical improvement and demand for video coding are huge opportunities and challenges. Traditional video coding technology has undergone decades of development and change, and has greatly satisfied and served the world's video services in every era. Traditional video coding technology has been iteratively updated and used to this day under the hybrid coding framework based on multi-scale block level. With the rapid development of hardware technology, video coding has brought great coding performance improvement through the improvement of sub-technology at the expense of certain complexity. However, the way of obtaining performance by replacing complexity has gradually become more obviously limited due to the bottleneck of hardware development, which puts higher requirements on hardware design and update, so that the traditional codecs currently in commercial use usually need to be simplified to a certain extent.

同时，深度学习技术尤其是深度神经网络技术的日趋成熟，在视频图像的多个任务上都有着广泛的研究和使用，包括视频增强、视频检测以及视频分割等。而应用于视频编码领域的深度学习技术最初主要集中于传统视频编码中子技术的研究与替换，通过研究传统视频编码中的相关模块，以原有的视频编码框架作为数据生成工具得到成对的训练数据对相应的神经网络进行训练，并在最终神经网络收敛后用于替换对应的模块，其中可替换的模块如环路内滤波、环路外滤波、编码块划分、编码块预测等。但是，目前的基于神经网络的视频压缩技术，压缩效果不佳。At the same time, deep learning technology, especially deep neural network technology, is becoming more and more mature, and has been widely studied and used in multiple tasks of video images, including video enhancement, video detection, and video segmentation. The deep learning technology applied to the field of video coding initially focused on the research and replacement of sub-techniques in traditional video coding. By studying the relevant modules in traditional video coding, the original video coding framework is used as a data generation tool to obtain paired training data to train the corresponding neural network, and then the corresponding modules are replaced after the final neural network converges. The replaceable modules include in-loop filtering, out-of-loop filtering, coding block division, coding block prediction, etc. However, the current video compression technology based on neural networks has poor compression effect.

为了进一步提高视频的压缩效果，本申请提出一种纯数据驱动的神经网络编码框架，即整个编解码系统都基于深度神经网络设计、训练并最终用于视频编码，并采用了一种全新的混合有损运动表达方式实现了基于神经网络的帧间编解码技术。In order to further improve the video compression effect, this application proposes a purely data-driven neural network coding framework, that is, the entire coding and decoding system is designed, trained and ultimately used for video coding based on a deep neural network, and a new hybrid lossy motion expression method is adopted to realize the neural network-based inter-frame coding and decoding technology.

下面结合具体的实施例对本申请实施例提供的技术方案进行详细描述。The technical solution provided in the embodiments of the present application is described in detail below in conjunction with specific embodiments.

首先结合图2，以解码端为例进行介绍。First, in conjunction with FIG. 2 , the decoding end is taken as an example for introduction.

图2为本申请实施例提供的一种视频解码方法的流程示意图，本申请实施例应用于图1所示视频解码器。如图2所示，本申请实施例的方法包括：FIG2 is a flow chart of a video decoding method provided by an embodiment of the present application, and the embodiment of the present application is applied to the video decoder shown in FIG1. As shown in FIG2, the method of the embodiment of the present application includes:

S201、解码第一码流，确定量化后的第一特征信息。S201. Decode a first bit stream to determine quantized first feature information.

其中，第一特征信息是对当前图像和当前图像的前一重建图像进行特征融合得到的。The first feature information is obtained by fusing the features of the current image and a previous reconstructed image of the current image.

本申请实施例提出一种基于神经网络的解码器，该基于神经网络的解码器与基于神经网络的编码器进行端到端训练得到。An embodiment of the present application proposes a neural network-based decoder, which is obtained by end-to-end training of a neural network-based encoder.

本申请实施例中，当前图像的前一重建图像可以理解为视频序列中，位于当前图像之前的前一帧图像，该前一帧图像已解码重建。In the embodiment of the present application, the previous reconstructed image of the current image can be understood as the previous frame image before the current image in the video sequence, and the previous frame image has been decoded and reconstructed.

由于当前图像和当前图像的前一重建图像这两个相邻帧之间存在着很强的相似性，因此，编码端在编码时，将当前图像和当前图像的前一重建图像进行特征融合，得到第一特征信息。例如，编码端将当前图像和当前图像的前一重建图像进行级联，将级联后的图像进行特征提取，得到第一特征信息。示例性的，编码端通过特征提取模块对级联后的图像进行特征提取，得到该第一特征信息。本申请对特征提取模块的具体网络结构不做限制。上述得到的第一特征信息为浮点型，例如为32位浮点数表示，进一步的，为了降低编码代价，则编码端对上述得到的第一特征信息进行量化，得到量化后的第一特征信息。接着，对量化后的第一特征信息进行编码，得到第一码流，例如，编码端对第一特征信息进行算数编码，得到第一码流。这样，解码端得到第一码流后，对该第一码流进行解码，得到量化后的第一特征信息，并根据该量化后的第一特征信息，得到当前图像的重建图像。Since there is a strong similarity between the two adjacent frames of the current image and the previous reconstructed image of the current image, the encoding end performs feature fusion on the current image and the previous reconstructed image of the current image during encoding to obtain the first feature information. For example, the encoding end cascades the current image and the previous reconstructed image of the current image, and extracts features from the cascaded image to obtain the first feature information. Exemplarily, the encoding end extracts features from the cascaded image through a feature extraction module to obtain the first feature information. The present application does not limit the specific network structure of the feature extraction module. The first feature information obtained above is a floating point type, for example, a 32-bit floating point number representation. Further, in order to reduce the encoding cost, the encoding end quantizes the first feature information obtained above to obtain the quantized first feature information. Then, the quantized first feature information is encoded to obtain a first code stream. For example, the encoding end performs arithmetic encoding on the first feature information to obtain the first code stream. In this way, after obtaining the first code stream, the decoding end decodes the first code stream to obtain the quantized first feature information, and obtains the reconstructed image of the current image based on the quantized first feature information.

本申请实施例中，对上述S201中解码端解码第一码流，确定量化后的第一特征信息的方式包括但不限于如下几种：In the embodiment of the present application, the decoding end decodes the first bit stream in the above S201, and determines the quantized first feature information in the following ways, but not limited to:

方式一，若编码端直接使用量化后的第一特征信息的概率分布，对量化后的第一特征信息进行编码，得到第一码流。对应的，则解码端直接对第一码流进行解码，得到量化后的第一特征信息。In method 1, if the encoding end directly uses the probability distribution of the quantized first feature information to encode the quantized first feature information to obtain the first bitstream, then the decoding end directly decodes the first bitstream to obtain the quantized first feature information.

上述量化后的第一特征信息所包括冗余信息量较多，直接对量化后的第一特征信息进行编码时，编码所需的码字多，编码代价大。为了降低编码代价，在一些实施例中，编码端根据第一特征信息进行特征变换，得到第二特征信息，并对第二特征信息进行量化后再编码，得到第二码流；对该第二码流进行解码，得到量化后的第二特征信息，并根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布；进而根据量化后的第一特征信息的概率分布，对量化后的第一特征信息进行编码，得到第一码流。也就是说，为了降低编码代价，则编码端确定第一特征信息对应的超先验特征信息，即第二特征信息，并基于该第二特征信息确定量化后的第一特征信息的概率分布，由于第二特征信息为第一特征信息的超先验特征信息，所包括的冗余量较少，这样基于该冗余量较少的第二特征信息确定量化后的第一特征信息的概率分布，并使用该概率分布对第一特征信息进行编码，可以降低第一特征信息的编码代价。The quantized first feature information includes a large amount of redundant information. When the quantized first feature information is directly encoded, many codewords are required for encoding and the encoding cost is large. In order to reduce the encoding cost, in some embodiments, the encoding end performs feature transformation according to the first feature information to obtain the second feature information, and quantizes the second feature information and then encodes it to obtain the second code stream; decodes the second code stream to obtain the quantized second feature information, and determines the probability distribution of the quantized first feature information according to the quantized second feature information; and then encodes the quantized first feature information according to the probability distribution of the quantized first feature information to obtain the first code stream. In other words, in order to reduce the encoding cost, the encoding end determines the super-prior feature information corresponding to the first feature information, that is, the second feature information, and determines the probability distribution of the quantized first feature information based on the second feature information. Since the second feature information is the super-prior feature information of the first feature information, it includes less redundancy. In this way, the probability distribution of the quantized first feature information is determined based on the second feature information with less redundancy, and the first feature information is encoded using the probability distribution, which can reduce the encoding cost of the first feature information.

基于上述描述，解码端可以通过如下方式二的步骤，确定量化后的第一特征信息。Based on the above description, the decoding end can determine the quantized first feature information through the steps of the following method 2.

方式二，上述S201包括如下S201-A至S201-C的步骤：Mode 2, the above S201 includes the following steps S201-A to S201-C:

S201-A、解码第二码流，得到量化后的第二特征信息。S201-A. Decode the second code stream to obtain quantized second feature information.

其中，第二特征信息是对第一特征信息进行特征变换得到的。The second feature information is obtained by performing feature transformation on the first feature information.

由上述可知，编码端为了降低编码代价，对第一特征信息进行特征变换，得到该第一特征信息的超先验特征信息，即第二特征信息，使用该第二特征信息确定量化后的第一特征信息的概率分布，并使用该概率分布对量化后的第一特征信息进行编码，得到第一码流。同时，为了使解码端采用与编码相同的概率分布对第一码流进行解码，则对上述第二特征信息进行编码，得到第二码流。也就是说，在该方式二中，编码端生成两个码流，分别为第一码流和第二码流。As can be seen from the above, in order to reduce the coding cost, the encoder performs feature transformation on the first feature information to obtain the super-prior feature information of the first feature information, that is, the second feature information, and uses the second feature information to determine the probability distribution of the quantized first feature information, and uses the probability distribution to encode the quantized first feature information to obtain the first code stream. At the same time, in order to enable the decoder to decode the first code stream using the same probability distribution as the encoding, the second feature information is encoded to obtain the second code stream. That is to say, in this second method, the encoder generates two code streams, namely the first code stream and the second code stream.

这样解码端得到第一码流和第二码流后，首先解码第二码流，确定量化后的第一特征信息的概率分布，具体是，解码第二码流，得到量化后的第二特性信息，根据该量化后的第二特征信息，确定量化后的第一特征信息的概率分布。接着，解码端使用确定出的概率分布对第一码流进行解码，得到量化后的第一特征信息，进而实现对第一特征信息的准确解码。In this way, after the decoding end obtains the first code stream and the second code stream, it first decodes the second code stream to determine the probability distribution of the quantized first feature information. Specifically, the second code stream is decoded to obtain the quantized second feature information, and the probability distribution of the quantized first feature information is determined based on the quantized second feature information. Then, the decoding end decodes the first code stream using the determined probability distribution to obtain the quantized first feature information, thereby achieving accurate decoding of the first feature information.

本申请中，由于第二特征信息为第一特征信息的超先验特征信息，所包括的冗余信息较少，因此，编码端在编码时，可以直接使用量化后的第二特征信息的概率分布，对量化后的第二特征信息进行编码，得到第二码流。对应的，解码端在解码时，直接对该第二码流进行解码，即可得到量化后的第二特征信息。In the present application, since the second feature information is the super-prior feature information of the first feature information and includes less redundant information, the encoder can directly use the probability distribution of the quantized second feature information to encode the quantized second feature information to obtain the second bitstream when encoding. Correspondingly, the decoder can directly decode the second bitstream to obtain the quantized second feature information when decoding.

S201-B、根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布。S201-B. Determine the probability distribution of the quantized first feature information according to the quantized second feature information.

解码端根据上述步骤，确定出量化后的第二特征信息后，根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布。After the decoding end determines the quantized second feature information according to the above steps, it determines the probability distribution of the quantized first feature information according to the quantized second feature information.

本申请实施例，对上述S201-B中根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布的具体方式不做限制。In the embodiment of the present application, there is no limitation on the specific method of determining the probability distribution of the quantized first feature information according to the quantized second feature information in the above S201-B.

在一些实施例中，由于上述第二特征信息是对第一特征信息进行特征变换得到的，基于此，S201-B包括如下S201-B1至S201-B3的步骤：In some embodiments, since the second feature information is obtained by performing feature transformation on the first feature information, based on this, S201-B includes the following steps S201-B1 to S201-B3:

S201-B1、对量化后的第二特征信息进行反变换，得到重建特征信息。S201-B1. Perform an inverse transformation on the quantized second feature information to obtain reconstructed feature information.

在该实现方式中，解码端对量化后的第二特征信息进行反变换，得到重建特征信息，其中，解码端所采用的反变换方式可以理解为编码端采用的变换方式的逆运算。例如，编码端对第一特征信息进行N次特征提取，得到第二特征信息，对应的，解码端对量化后的第二特征信息进行N次反向的特征提取，得到反变换后的特征信息，记为重建特征信息。In this implementation, the decoding end performs an inverse transformation on the quantized second feature information to obtain the reconstructed feature information, wherein the inverse transformation method adopted by the decoding end can be understood as the inverse operation of the transformation method adopted by the encoding end. For example, the encoding end performs N times of feature extraction on the first feature information to obtain the second feature information, and correspondingly, the decoding end performs N times of reverse feature extraction on the quantized second feature information to obtain the inversely transformed feature information, which is recorded as the reconstructed feature information.

本申请实施例对解码端采用反变换方式不做限制。The embodiment of the present application does not limit the inverse transformation method adopted by the decoding end.

在一些实施例中，解码端采用的反变换方式包括N次特征提取。也就是说，解码端对得到的量化后的第二特征信息进行N次特征提取，得到重建特征信息。In some embodiments, the inverse transformation method adopted by the decoding end includes N times of feature extraction. That is, the decoding end performs N times of feature extraction on the obtained quantized second feature information to obtain reconstructed feature information.

在一些实施例中，解码端采用的反变换方式包括N次特征提取和N次上采样。也就是说，解码端对得到的量化后的第二特征信息进行N次特征提取和N次上采样，得到重建特征信息。In some embodiments, the inverse transformation method adopted by the decoding end includes N times of feature extraction and N times of upsampling. That is, the decoding end performs N times of feature extraction and N times of upsampling on the obtained quantized second feature information to obtain reconstructed feature information.

本申请实施例对上述N次特征提取和N次上采样的具体执行顺序不做限制。The embodiment of the present application does not limit the specific execution order of the above-mentioned N feature extractions and N upsamplings.

在一种示例中，解码端可以先对量化后的第二特征信息进行N次连续的特征提取后，再进行N次连续的上采样。In an example, the decoding end may first perform N consecutive feature extractions on the quantized second feature information, and then perform N consecutive upsamplings.

在另一种示例中，上述N次特征提取和N次上采样穿插进行，即执行一次特征提取后执行一次上采样。举例说明，假设N＝2，则解码端对量化后的第二特征信息进行反变换，得到重建特征信息的具体过程是：将量化后的第二特征信息输入第一个特征提取模块中进行第一次特征提取，得到特征信息1，对特征信息1进行上采样，得到特征信息2，将特征信息2输入第二个特征提取模块中进行第二次特征提取，得到特征信息3，对特征信息3进行上采样，得到特征信息4，将该特征信息4记为重建特征信息。In another example, the above-mentioned N feature extractions and N upsamplings are performed alternately, that is, upsampling is performed once after performing one feature extraction. For example, assuming that N=2, the decoding end performs an inverse transformation on the quantized second feature information, and the specific process of obtaining the reconstructed feature information is: input the quantized second feature information into the first feature extraction module for the first feature extraction to obtain feature information 1, upsample feature information 1 to obtain feature information 2, input feature information 2 into the second feature extraction module for the second feature extraction to obtain feature information 3, upsample feature information 3 to obtain feature information 4, and record feature information 4 as the reconstructed feature information.

需要说明的是，本申请实施例对解码端所采用的N次特征提取方式不做限制，例如包括多层卷积、残差连接、密集连接等特征提取方式中的至少一种。It should be noted that the embodiment of the present application does not limit the N-time feature extraction method adopted by the decoding end, for example, at least one of the feature extraction methods including multi-layer convolution, residual connection, dense connection, etc.

在一些实施例中，解码端通过非局部注意力方式来进行特征提取，此时，上述S201-B1包括如下S201-B11的步骤：In some embodiments, the decoding end performs feature extraction by non-local attention. In this case, the above S201-B1 includes the following steps S201-B11:

S201-B11、对量化后的第二特征信息进行N次非局部注意力变换和N次上采样，得到重建特征信息，N为正整数。S201-B11, perform N non-local attention transformations and N upsamplings on the quantized second feature information to obtain reconstructed feature information, where N is a positive integer.

由于非局部注意力方式可以实现更高效的特征提取，能使得提取的特征保留更多的信息，且计算效率高，因此，本申请实施例中，解码端采用非局部注意力的方式对量化后的第二特征信息进行特征提取，以实现对量化后的第二特征信息的快速和准确特征提取。另外，编码端在根据第一特征信息生成第二特征信息时，进行了N次下采样，因此，解码端对应的执行N次上采样，以使重建得到的重建特征信息与第一特征信息的大小一致。Since the non-local attention method can achieve more efficient feature extraction, the extracted features can retain more information, and the calculation efficiency is high, therefore, in the embodiment of the present application, the decoding end uses the non-local attention method to extract features from the quantized second feature information to achieve fast and accurate feature extraction of the quantized second feature information. In addition, when the encoding end generates the second feature information based on the first feature information, it performs N downsampling, so the decoding end correspondingly performs N upsampling to make the reconstructed feature information consistent with the size of the first feature information.

在一些实施例中，如图3所示，解码端通过反变换模块得到重建特征信息，该反变换模块包括N个非局部注意力模块和N个上采样模块。其中，非局部注意力模块用于实现非局部注意力变换，上采样模块用于实现上采样。示例性的，如图3所示，一个非局部注意力模块后，连接一个上采样模块。在实际应用时，解码端将解码得到的量化后的第二特征信息输入反变换模块中，反变换模块中的第一个非局部注意力模块对量化后的第二特征信息进行非局部注意力特征变换提取，得到特征信息1，再将特征信息1输入第一个上采样模块进行上采样，得到特征信息2。接着，将特征信息2输入第二个非局部注意力模块进行非局部注意力特征变换提取，得到特征信息3，再将特征信息3输入第二个上采样模块进行上采样，得到特征信息4。依次类推，得到第N个上采样模块输出的特征信息，并将该特征信息确定为重建特征信息。In some embodiments, as shown in FIG3 , the decoding end obtains the reconstructed feature information through the inverse transformation module, and the inverse transformation module includes N non-local attention modules and N upsampling modules. Among them, the non-local attention module is used to realize the non-local attention transformation, and the upsampling module is used to realize the upsampling. Exemplarily, as shown in FIG3 , an upsampling module is connected after a non-local attention module. In actual application, the decoding end inputs the decoded quantized second feature information into the inverse transformation module, and the first non-local attention module in the inverse transformation module performs non-local attention feature transformation extraction on the quantized second feature information to obtain feature information 1, and then inputs feature information 1 into the first upsampling module for upsampling to obtain feature information 2. Next, feature information 2 is input into the second non-local attention module for non-local attention feature transformation extraction to obtain feature information 3, and then feature information 3 is input into the second upsampling module for upsampling to obtain feature information 4. By analogy, the feature information output by the Nth upsampling module is obtained, and the feature information is determined as the reconstructed feature information.

S201-B2、确定重建特征信息的概率分布。S201-B2. Determine the probability distribution of the reconstructed feature information.

由上述可知，第二量化特征信息是对第一特征信息进行变换得到的，解码端通过上述步骤，对量化后的第二特征信息进行反量化，得到重建特征信息，因此，该重建特征信息可以理解为第一特征信息的重建信息，也就是说，重建特征信息的概率分布与量化后的第一特征信息的概率分布相似或相关，这样，解码端可以先确定出重建特征信息的概率分布，进而根据该重建特征信息的概率分布，预测量化后的所述第一特征信息的概率分布。From the above, it can be seen that the second quantized feature information is obtained by transforming the first feature information. The decoding end dequantizes the quantized second feature information through the above steps to obtain the reconstructed feature information. Therefore, the reconstructed feature information can be understood as the reconstructed information of the first feature information, that is, the probability distribution of the reconstructed feature information is similar to or related to the probability distribution of the quantized first feature information. In this way, the decoding end can first determine the probability distribution of the reconstructed feature information, and then predict the probability distribution of the first feature information after quantization based on the probability distribution of the reconstructed feature information.

在一些实施例中，重建特征信息的概率分布为正态分布或高斯分布，此时，确定重建特征信息的概率分布的过程为，根据重建特征信息中的各特征值，确定该重建特征信息的均值和方差矩阵，根据均值和方差矩阵，生成该重建特征信息的高斯分布。In some embodiments, the probability distribution of the reconstructed feature information is a normal distribution or a Gaussian distribution. In this case, the process of determining the probability distribution of the reconstructed feature information is to determine the mean and variance matrix of the reconstructed feature information based on each eigenvalue in the reconstructed feature information, and generate a Gaussian distribution of the reconstructed feature information based on the mean and variance matrix.

S201-B3、根据重建特征信息的概率分布，预测得到量化后的第一特征信息的概率分布。S201-B3. According to the probability distribution of the reconstructed feature information, predict the probability distribution of the quantized first feature information.

由于重建特征信息为第一特征信息的重建信息，重建特征信息的概率分布与量化后的第一特征信息的概率分布相似或相关，因此，本申请实施例通过该重建特征信息的概率分布，可以实现对量化后的第一特征信息的概率分布的准确预测。Since the reconstructed feature information is the reconstructed information of the first feature information, the probability distribution of the reconstructed feature information is similar or related to the probability distribution of the quantized first feature information. Therefore, the embodiment of the present application can achieve accurate prediction of the probability distribution of the quantized first feature information through the probability distribution of the reconstructed feature information.

本申请实施例对上述S201-B3的具体实现方式不做限制。The embodiment of the present application does not limit the specific implementation method of the above S201-B3.

在一种可能的实现方式中，将重建特征信息的概率分布，确定为量化后的第一特征信息的概率分布。In a possible implementation manner, the probability distribution of the reconstructed feature information is determined as the probability distribution of the quantized first feature information.

在另一种可能的实现方式中，根据重建特征信息的概率分布，预测量化后的第一特征信息中编码像素的概率；根据量化后的第一特征信息中编码像素的概率，得到量化后的第一特征信息的概率分布。In another possible implementation, the probability of the encoded pixels in the quantized first feature information is predicted based on the probability distribution of the reconstructed feature information; and the probability distribution of the quantized first feature information is obtained based on the probability of the encoded pixels in the quantized first feature information.

S201-C、根据量化后的第一特征信息的概率分布，对第一码流进行解码，得到量化后的第一特征信息。S201-C. Decode the first bitstream according to the probability distribution of the quantized first feature information to obtain the quantized first feature information.

根据上述步骤，确定出量化后的第一特征信息的概率分布后，使用该概率分布对第一码流进行解码，进而实现对量化后的第一特征信息的准确解码。According to the above steps, after the probability distribution of the quantized first feature information is determined, the first code stream is decoded using the probability distribution, thereby achieving accurate decoding of the quantized first feature information.

本申请实施例中，解码端根据上述方式一或方式二，解码第一码流，确定出量化后的第一特征信息后，执行如下S202的步骤。In the embodiment of the present application, the decoding end decodes the first bit stream according to the above-mentioned method 1 or method 2, determines the quantized first feature information, and then performs the following step S202.

S202、对量化后的第一特征信息进行多级时域融合，得到混合时空表征。S202, performing multi-level time-domain fusion on the quantized first feature information to obtain a mixed time-space representation.

本申请实施例中，为了提高重建图像的准确性，对量化后的第一特征信息进行多级的时域融合，即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合，并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合，例如将t-1时刻、t-2时刻…、t-k时刻等多个时刻的重建图像与量化后的第一特征信息进行融合。这样可以避免当前图像的前一重建图像中的某信息被遮挡时，被遮挡的信息可以从当前图像之前的几张重建图像中得到，进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成当前图像的P个预测图像时，可以提高生成的预测图像的准确性，进而基于该准确的预测图像可以准确得到当前图像的重建图像，进而提高视频压缩效果。In an embodiment of the present application, in order to improve the accuracy of the reconstructed image, the quantized first feature information is subjected to multi-level time domain fusion, that is, the quantized first feature information is not only fused with the feature information of the previous reconstructed image of the current image, but also the quantized first feature information is fused with the multiple reconstructed images before the current image, for example, the reconstructed images at multiple times such as time t-1, time t-2..., time t-k are fused with the quantized first feature information. In this way, when certain information in the previous reconstructed image of the current image is blocked, the blocked information can be obtained from several reconstructed images before the current image, thereby making the generated mixed spatiotemporal representation include more accurate, rich and detailed feature information. In this way, when motion compensation is performed on the previous reconstructed image based on the mixed spatiotemporal representation to generate P predicted images of the current image, the accuracy of the generated predicted image can be improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate predicted image, thereby improving the video compression effect.

本申请实施例对解码端对量化后的第一特征信息进行多级时域融合，得到混合时空表征的具体方式不做限制。The embodiment of the present application does not limit the specific method in which the decoding end performs multi-level time domain fusion on the quantized first feature information to obtain a mixed time-space representation.

在一些实施例中，解码端通过递归聚合模块混合时空表征，即上述S202包括如下S202-A的步骤：In some embodiments, the decoding end mixes the spatiotemporal representations through a recursive aggregation module, that is, the above S202 includes the following step S202-A:

S202-A、解码端通过递归聚合模块将量化后的第一特征信息，与前一时刻递归聚合模块的隐式特征信息进行融合，得到混合时空表征。S202-A, the decoding end fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment through the recursive aggregation module to obtain a mixed spatiotemporal representation.

本申请实施例的递归聚合模块在每次生成混合时空表示时，会学习且保留从本次特征信息中所学习到的深层次特征信息，且将学习到的深层次特征作为隐式特征信息作用于下一次的混合时空表征生成，进而提高生成的混合时空表征的准确性。也就是说，本申请实施例中，前一时刻递归聚合模块的隐式特征信息包括了递归聚合模块所学习到的当前图像之前的多张重建图像的特征信息，这样，解码端通过递归聚合模块将量化后的第一特征信息，与前一时刻递归聚合模块的隐式特征信息进行融合，可以生成更加准确、丰富和详细的混合时空表征。The recursive aggregation module of the embodiment of the present application will learn and retain the deep-level feature information learned from the feature information each time it generates a mixed space-time representation, and use the learned deep-level features as implicit feature information for the next generation of the mixed space-time representation, thereby improving the accuracy of the generated mixed space-time representation. That is to say, in the embodiment of the present application, the implicit feature information of the recursive aggregation module at the previous moment includes the feature information of multiple reconstructed images before the current image learned by the recursive aggregation module. In this way, the decoding end fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment through the recursive aggregation module, so as to generate a more accurate, rich and detailed mixed space-time representation.

本申请实施例对递归聚合模块的具体网络结构不做限制，例如为可以实现上述功能的任意网络结构。The embodiment of the present application does not limit the specific network structure of the recursive aggregation module, for example, it can be any network structure that can implement the above functions.

在一些实施例中，递归聚合模块由至少一个时空递归网络ST-LSTM堆叠而成，此时，上述混合时空表征Gt的表达公式如公式(1)所示：In some embodiments, the recursive aggregation module is formed by stacking at least one spatiotemporal recursive network ST-LSTM. In this case, the expression formula of the above hybrid spatiotemporal representation Gt is as shown in formula (1):

其中，为量化后的第一特征信息，h为ST-LSTM所包括的隐式特征信息。 in, is the first feature information after quantization, and h is the implicit feature information included in ST-LSTM.

在一种示例中，假设递归聚合模块包括2个ST-LSTM组成，如图4所示，解码端将重建得到的量化后的第一特征信息输入递归聚合模块中，递归聚合模块中的2个ST-LSTM依次对量化后的第一特征信息进行处理，生成一特征信息，具体的，如图4所示，第一个ST-LSTM生成的隐式特征信息h1作为下一个ST-LSTM的输入，且两个ST-LSTM在本次运算过程中分别生成传输带的更新值c1和c2以对各自的传输带值进行更新，其中m为记忆信息，在两个ST-LSTM之间进行传递，最终得到第二个ST-LSTM输出的特征信息h2。进一步的，为了提高生成的混合时空表征的准确新，则将第二个ST-LSTM生成的特征信息h2与量化后的第一特征信息进行残差连接，即将第二个ST-LSTM生成的特征信息h与量化后的第一特征信息进行相加，生成混合时空表征Gt。 In one example, assuming that the recursive aggregation module includes two ST-LSTM components, as shown in FIG4 , the decoder reconstructs the quantized first feature information The input is sent to the recursive aggregation module, and the two ST-LSTMs in the recursive aggregation module sequentially quantize the first feature information Processing is performed to generate feature information. Specifically, as shown in Figure 4, the implicit feature information h1 generated by the first ST-LSTM is used as the input of the next ST-LSTM, and the two ST-LSTMs generate transmission band update values c1 and c2 respectively during this operation to update their respective transmission band values, where m is the memory information, which is transmitted between the two ST-LSTMs, and finally the feature information h2 output by the second ST-LSTM is obtained. Furthermore, in order to improve the accuracy of the generated hybrid spatiotemporal representation, the feature information h2 generated by the second ST-LSTM is compared with the quantized first feature information. Perform residual connection, that is, the feature information h generated by the second ST-LSTM is connected to the quantized first feature information Add them together to generate a mixed spatiotemporal representation Gt.

解码端根据上述方法，得到混合时空表征后，执行如下S203。After the decoder obtains the mixed spatiotemporal representation according to the above method, it executes the following S203.

S203、根据混合时空表征对前一重建图像进行运动补偿，得到当前图像的P个预测图像。S203 . Perform motion compensation on the previous reconstructed image according to the hybrid spatiotemporal representation to obtain P predicted images of the current image.

其中，P为正整数。Wherein, P is a positive integer.

由上述可知，本申请实施例的混合时空表征融合的当前图像以及当前图像之前的多个重建图像的特征信息，这样根据该混合时空表征对前一重建图像进行运动补偿，可以得到精确的当前图像的P个预测图像。From the above, it can be seen that the hybrid spatiotemporal representation of the embodiment of the present application fuses the feature information of the current image and multiple reconstructed images before the current image, so that motion compensation is performed on the previous reconstructed image according to the hybrid spatiotemporal representation, and P accurate predicted images of the current image can be obtained.

本申请实施例对生成的P个预测图像的具体数量不做限制。即本申请实施例中，解码端可以采用不同的方式，根据混合时空表征对前一重建图像进行运动补偿，得到当前图像的P个预测图像。The embodiment of the present application does not limit the specific number of the generated P predicted images. That is, in the embodiment of the present application, the decoding end can use different methods to perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation to obtain P predicted images of the current image.

本申请实施例对上述解码端根据混合时空表征对前一重建图像进行运动补偿的具体的方式不做限制。The embodiment of the present application does not limit the specific manner in which the decoding end performs motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation.

在一些实施例中，上述P个预测图像中包括第一预测图像，该第一预测图像是解码端采用光流运动补偿方式得到的，此时，上述S203包括如下S203-A1和S203-A2的步骤：In some embodiments, the P predicted images include a first predicted image, and the first predicted image is obtained by the decoding end using an optical flow motion compensation method. In this case, the S203 includes the following steps S203-A1 and S203-A2:

S203-A1、根据混合时空表征，确定光流运动信息；S203-A1, determining optical flow motion information according to the mixed spatiotemporal representation;

S203-A2、根据光流运动信息对前一重建图像进行运动补偿，得到第一预测图像。S203-A2: Perform motion compensation on the previous reconstructed image according to the optical flow motion information to obtain a first predicted image.

本申请实施例对解码端根据混合时空表征，确定光流运动信息的具体方式不做限制。The embodiment of the present application does not limit the specific manner in which the decoding end determines the optical flow motion information based on the mixed spatiotemporal representation.

在一些实施例中，解码端通过预先训练好的神经网络模型得到光流运动信息，即该神经网络模型可以基于混合时空表征，预测出光流运动信息。在一些实施例中，该神经网络模型可以称为第一解码器，或光流信号解码器Df。解码端将混合时空表征Gt输入该光流信号解码器Df中进行光流运动信息的预测，得到该光流信号解码器Df输出的光流运动信息f _x,y。可选的，该f _x,y为通道为2的光流运动信息。 In some embodiments, the decoding end obtains the optical flow motion information through a pre-trained neural network model, that is, the neural network model can predict the optical flow motion information based on the mixed spatiotemporal representation. In some embodiments, the neural network model can be called a first decoder, or an optical flow signal decoder Df. The decoding end inputs the mixed spatiotemporal representation Gt into the optical flow signal decoder Df to predict the optical flow motion information, and obtains the optical flow motion information _fx,y output by the optical flow signal decoder Df. Optionally, fx _,y is the optical flow motion information with a channel of 2.

示例性的，f _x,y的生成公式如公式(2)所示： Exemplarily, the formula for generating f _x,y is shown in formula (2):

f _x,y＝D _f(G _t) (2) f _x,y = D _f (G _t ) (2)

本申请实施例对上述光流信号解码器Df的具体网络结构不做限制。The embodiment of the present application does not limit the specific network structure of the optical flow signal decoder Df.

在一些实施例中，光流信号解码器Df由多个NLAM和多个上采样模块组成，示例性的，如图5所示，光流信号解码器Df包括1个NLAM、3个LAM和4个下采样模块，其中一个NLAM之后连接一个下采样模块，且一个LAM之后连接一个下采样模块。可选的，NLAM包括多个卷积层，例如包括3个卷积层，每个卷积层的卷积核大小为3*3，通道数为192。可选的，3个LAM分别包括多个卷积层，例如分别包括3个卷积层，每个卷积层的卷积核大小为3*3，3个LAM所包括的卷积层的通道数依次为128、96和64。可选的，4个下采样模块分别包括一个卷积层Conv，该卷积层的卷积核大小为5*5，4个下采样模块所包括的卷积层的通道数依次为128、96、64和2。这样，解码端将混合时空表征Gt输入该光流信号解码器Df中，NLAM对该时空表征Gt进行特征提取，得到一个通道数为192的特征信息a，并将该特征信息a输入第一个下采样模块中进行下采样，得到通道数为128的特征信息b。接着，将特征信息b输入第一个LAM中进行特征再提取，得到通道数为128的特征信息c，并将该特征信息c输入第二个下采样模块中进行下采样，得到通道数为96的特征信息d。接着，将特征信息d输入第二个LAM中进行特征再提取，得到通道数为96的特征信息e，并将该特征信息e输入第三个下采样模块中进行下采样，得到通道数为64的特征信息f。接着，将特征信息f输入第三个LAM中进行特征再提取，得到通道数为64的特征信息g，并将该特征信息g输入第四个下采样模块中进行下采样，得到通道数为2的特征信息j，特征信息j即为光流运动信息。In some embodiments, the optical flow signal decoder Df is composed of multiple NLAMs and multiple upsampling modules. Exemplarily, as shown in FIG5, the optical flow signal decoder Df includes 1 NLAM, 3 LAMs and 4 downsampling modules, wherein one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module. Optionally, the NLAM includes multiple convolutional layers, for example, including 3 convolutional layers, each of which has a convolution kernel size of 3*3 and a channel number of 192. Optionally, the three LAMs include multiple convolutional layers, for example, including 3 convolutional layers, each of which has a convolution kernel size of 3*3, and the number of channels of the convolutional layers included in the three LAMs is 128, 96 and 64, respectively. Optionally, the four downsampling modules each include a convolutional layer Conv, the convolution kernel size of the convolutional layer is 5*5, and the number of channels of the convolutional layers included in the four downsampling modules is 128, 96, 64 and 2, respectively. In this way, the decoding end inputs the mixed spatiotemporal representation Gt into the optical flow signal decoder Df, and the NLAM extracts the features of the spatiotemporal representation Gt to obtain feature information a with a channel number of 192, and inputs the feature information a into the first downsampling module for downsampling to obtain feature information b with a channel number of 128. Then, the feature information b is input into the first LAM for feature re-extraction to obtain feature information c with a channel number of 128, and the feature information c is input into the second downsampling module for downsampling to obtain feature information d with a channel number of 96. Then, the feature information d is input into the second LAM for feature re-extraction to obtain feature information e with a channel number of 96, and the feature information e is input into the third downsampling module for downsampling to obtain feature information f with a channel number of 64. Then, the feature information f is input into the third LAM for feature re-extraction to obtain feature information g with a channel number of 64, and the feature information g is input into the fourth downsampling module for downsampling to obtain feature information j with a channel number of 2, and the feature information j is the optical flow motion information.

需要说明的是，上述图5只是一种示例中，且图5中各参数的设定也仅为示例，本申请实施例的光流信号解码器Df的网络结构包括但不限于图5所示。It should be noted that the above FIG. 5 is only an example, and the settings of the parameters in FIG. 5 are also only examples. The network structure of the optical flow signal decoder Df in the embodiment of the present application includes but is not limited to that shown in FIG. 5 .

解码端生成光流运动信息f _x,y后，使用光流运动信息f _x,y对前一重建图像进行运动补偿，得到第一预测图像X ₁。 After the decoder generates the optical flow motion information f _x,y , it uses the optical flow motion information f _x,y to reconstruct the previous image Motion compensation is performed to obtain a first predicted image X ₁ .

本申请实施例对解码端根据光流运动信息对前一重建图像进行运动补偿，得到第一预测图像的具体方式不做限制，例如，解码端使用光流运动信息f _x,y对前一重建图像进行线性插值，将插值生成的图像记为第一预测图像X ₁。 In the embodiment of the present application, the decoding end performs motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image without limiting the specific method. For example, the decoding end uses the optical flow motion information f _x,y to compensate the previous reconstructed image. Linear interpolation is performed, and the image generated by interpolation is recorded as the first predicted image X ₁ .

在一种可能的实现方式中，解码端通过如下公式(3)，得到第一预测图像X ₁： In a possible implementation, the decoding end obtains the first predicted image X ₁ by the following formula (3):

在该实现方式中，如图5所示，解码端通过Warping(扭曲)运算，使用光流运动信息fxy对前一重建图像进行运动补偿，得到第一预测图像X ₁。 In this implementation, as shown in FIG5 , the decoder uses the optical flow motion information fxy to reconstruct the previous image by performing a Warping operation. Motion compensation is performed to obtain a first predicted image X ₁ .

在一些实施例中，上述P个预测图像中包括第二预测图像，该第二预测图像是解码端采用偏移运动补偿方式得到的，此时，上述S203包括如下S203-B1至S203-B3的步骤：In some embodiments, the P predicted images include a second predicted image, and the second predicted image is obtained by the decoding end using an offset motion compensation method. In this case, the above S203 includes the following steps S203-B1 to S203-B3:

S203-B1、根据混合时空表征，得到当前图像对应的偏移量；S203-B1, obtaining an offset corresponding to the current image according to the mixed spatiotemporal representation;

S203-B2、对前一重建图像进行空间特征提取，得到参考特征信息；S203-B2, extracting spatial features from the previous reconstructed image to obtain reference feature information;

S203-B3、使用偏移量对参考特征信息进行运动补偿，得到第二预测图像。S203-B3. Use the offset to perform motion compensation on the reference feature information to obtain a second predicted image.

本申请实施例对解码端根据混合时空表征，得到当前图像对应的偏移量的具体方式不做限制。The embodiment of the present application does not limit the specific manner in which the decoding end obtains the offset corresponding to the current image based on the mixed spatiotemporal representation.

在一些实施例中，解码端通过预先训练好的神经网络模型得到当前图像对应的偏移量，即该神经网络模型可以基于混合时空表征，预测出偏移量，该偏移量为有损的偏移量信息。在一些实施例中，该神经网络模型可以称为第二解码器，或可变卷积解码器Dm。解码端将混合时空表征Gt输入该可变卷积解码器Dm中进行偏移量信息的预测。In some embodiments, the decoding end obtains the offset corresponding to the current image through a pre-trained neural network model, that is, the neural network model can predict the offset based on the mixed spatiotemporal representation, and the offset is lossy offset information. In some embodiments, the neural network model can be called a second decoder, or a variable convolution decoder Dm. The decoding end inputs the mixed spatiotemporal representation Gt into the variable convolution decoder Dm to predict the offset information.

同时，解码端对前一重建图像进行空间特征提取，得到参考特征信息。例如，解码端通过空间特征提取模块SFE对前一重建图像进行空间特征提取，得到参考特征信息。At the same time, the decoding end extracts spatial features from the previous reconstructed image to obtain reference feature information. For example, the decoding end extracts spatial features from the previous reconstructed image through a spatial feature extraction module SFE to obtain reference feature information.

接着，解码端使用偏移量对提取得到的参考特征信息进行运动补偿，得到当前图像的第二预测图像。Next, the decoding end uses the offset to perform motion compensation on the extracted reference feature information to obtain a second predicted image of the current image.

本申请实施例对解码端使用偏移量对提取得到的参考特征信息进行运动补偿，得到当前图像的第二预测图像的具体方式不做限制。In the embodiment of the present application, the decoding end uses the offset to perform motion compensation on the extracted reference feature information, and the specific method of obtaining the second predicted image of the current image is not limited.

在一种可能的实现方式中，解码端使用偏移量，对参考特征信息进行基于可变形卷积的运动补偿，得到第二预测图像。In a possible implementation, the decoding end uses the offset to perform deformable convolution-based motion compensation on the reference feature information to obtain a second predicted image.

在一些实施例中，由于可变换卷积可以基于混合时空表征，生成当前图像对应的偏移量，因此，本申请实施例中，解码端将混合时空表征Gt，以及参考特征信息输入该可变换卷积中，该可变换卷积基于混合时空表征Gt生成当前图像对应的偏移量，且将该偏移量作用在参考特征信息上进行运动补偿，进而得到第二预测图像。In some embodiments, since the transformable convolution can generate an offset corresponding to the current image based on the mixed space-time representation, in an embodiment of the present application, the decoding end inputs the mixed space-time representation Gt and the reference feature information into the transformable convolution. The transformable convolution generates an offset corresponding to the current image based on the mixed space-time representation Gt, and applies the offset to the reference feature information for motion compensation, thereby obtaining a second predicted image.

基于此，示例性的，如图6所示，本申请实施例的可变卷积解码器Dm包括可变换卷积DCN，解码端将前一重建图像输入反变换模块SFE中进行时空特征提取，得到参考特征信息。接着，将混合时空表征Gt，以及参考特征信息输入可变换卷积DCN中进行偏移量的提取以及运动补偿，得到第二预测图像X ₂。 Based on this, exemplary, as shown in FIG6 , the variable convolution decoder Dm of the embodiment of the present application includes a transformable convolution DCN, and the decoding end reconstructs the previous image The mixed spatiotemporal representation Gt and the reference feature information are then input into the transformable convolution DCN to extract the offset and perform motion compensation to obtain the second predicted image X ₂ .

示例性的，解码端通过如公式(4)生成第二预测图像X ₂： Exemplarily, the decoding end generates the second predicted image X ₂ by formula (4):

在一些实施例中，如图6所示，为了进一步提高第二预测图像的准确性，则可变卷积解码器Dm除了包括可变换卷积DCN外，还包括1个NLAM、3个LAM和4个下采样模块，其中一个NLAM之后连接一个下采样模块，且一个LAM之后连接一个下采样模块。可选的，可变卷积解码器Dm所包括的1个NLAM、3个LAM和前3个下采样模块的网络结构与上述光流信号解码器Df所包括的1个NLAM、3个LAM和前3个下采样模块的网络结构相同，在此不再赘述。可选的，可变卷积解码器Dm包括的最后一个下采样模块所包括的通道数为5。In some embodiments, as shown in FIG6 , in order to further improve the accuracy of the second predicted image, the variable convolution decoder Dm includes, in addition to the transformable convolution DCN, 1 NLAM, 3 LAMs and 4 downsampling modules, wherein one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module. Optionally, the network structure of the 1 NLAM, 3 LAMs and the first 3 downsampling modules included in the variable convolution decoder Dm is the same as the network structure of the 1 NLAM, 3 LAMs and the first 3 downsampling modules included in the above-mentioned optical flow signal decoder Df, and will not be repeated here. Optionally, the number of channels included in the last downsampling module included in the variable convolution decoder Dm is 5.

需要说明的是，上述图6只是一种示例中，且图6中各参数的设定也仅为示例，本申请实施例的可变卷积解码器Dm的网络结构包括但不限于图6所示。It should be noted that the above-mentioned Figure 6 is only an example, and the settings of the parameters in Figure 6 are also only examples. The network structure of the variable convolution decoder Dm of the embodiment of the present application includes but is not limited to that shown in Figure 6.

本申请实施例中，如图6所示，解码端首先将前一重建图像输入反变换模块SFE中进行时空特征提取，得到参考特征信息。接着，将混合时空表征Gt，以及参考特征信息输入可变卷积解码器Dm中的可变换卷积DCN中进行偏移量的提取以及运动补偿，得到一个特征信息，将该特征信息输入NLAM中，经过NLAM、3个LAM以及4个下采样模块的特征提取，最终还原为第二预测图像X ₂。 In the embodiment of the present application, as shown in FIG6 , the decoding end first converts the previous reconstructed image The mixed spatiotemporal representation Gt and the reference feature information are then input into the transformable convolution DCN in the variable convolution decoder Dm for offset extraction and motion compensation to obtain feature information, which is then input into the NLAM. After feature extraction by the NLAM, 3 LAMs and 4 downsampling modules, the second predicted image X ₂ is finally restored.

根据上述方法，解码端可以确定出P个预测图像，例如确定出第一预测图像和第二预测图像，接着，执行如下S204的步骤。According to the above method, the decoding end can determine P predicted images, for example, determine the first predicted image and the second predicted image, and then perform the following step S204.

S204、根据P个预测图像，确定当前图像的重建图像。S204: Determine a reconstructed image of the current image according to the P predicted images.

在一些实施例中，若上述P个预测图像包括一个预测图像时，则根据该预测图像，确定当前图像的重建图像。In some embodiments, if the P predicted images include one predicted image, a reconstructed image of the current image is determined according to the predicted image.

例如，将该预测图像与当前图像的前一个或几个重建图像进行比较，计算损失，若该损失小，则说明该预测图像的预测精度较高，可以将该预测图像确定为当前图像的重建图像。For example, the predicted image is compared with one or several previous reconstructed images of the current image to calculate the loss. If the loss is small, it means that the prediction accuracy of the predicted image is high, and the predicted image can be determined as the reconstructed image of the current image.

再例如，若上述损失大，则说明该预测图像的预测精度较低，此时，可以根据当前图像的前一个或几个重建图像和该预测图像，确定当前图像的重建图像，例如，将该预测图像和当前图像的前一个或几个重建图像输入一神经网络中，得到当前图像的重建图像。For another example, if the above-mentioned loss is large, it means that the prediction accuracy of the predicted image is low. At this time, the reconstructed image of the current image can be determined based on the previous one or several reconstructed images of the current image and the predicted image. For example, the predicted image and the previous one or several reconstructed images of the current image are input into a neural network to obtain the reconstructed image of the current image.

在一些实施例中，上述S204包括如下S204-A和S204-B的步骤：In some embodiments, the above S204 includes the following steps S204-A and S204-B:

S204-A、根据P个预测图像，确定当前图像的目标预测图像。S204-A. Determine a target predicted image for the current image based on the P predicted images.

在该实现方式中，解码端首先根据P个预测图像，确定当前图像的目标预测图像，接着，根据该当前图像的目标预测图像实现当前图像的重建图像，进而提高重建图像的确定准确性。In this implementation, the decoding end first determines the target prediction image of the current image based on P prediction images, and then realizes the reconstructed image of the current image based on the target prediction image of the current image, thereby improving the determination accuracy of the reconstructed image.

本申请实施例对根据P个预测图像，确定当前图像的目标预测图像的具体方式不做限制。The embodiment of the present application does not limit the specific method of determining the target predicted image of the current image based on P predicted images.

在一些实施例中，若P＝1，则将该一个预测图像确定为当前图像的目标预测图像。In some embodiments, if P=1, the one predicted image is determined as the target predicted image of the current image.

在一些实施例中，若P大于1，则S204-A包括S204-A11和S204-A12：In some embodiments, if P is greater than 1, S204-A includes S204-A11 and S204-A12:

S204-A11、根据P个预测图像，确定加权图像；S204-A11, determining a weighted image according to the P predicted images;

在该实现方式中，若根据上述方法，生成当前图像的多个预测图像，例如生成第一预测图像和第二预测图像时，则对这P个预测图像进行加权，生成加权图像，则根据该加权图像，得到目标预测图像。In this implementation, if multiple predicted images of the current image are generated according to the above method, for example, when a first predicted image and a second predicted image are generated, these P predicted images are weighted to generate a weighted image, and then the target predicted image is obtained based on the weighted image.

本申请实施例对根据P个预测图像，确定加权图像的具体方式不做限制。The embodiment of the present application does not limit the specific method of determining the weighted image based on P predicted images.

例如，确定P个预测图像对应的权重；并根据P个预测图像对应的权重，对P个预测图像进行加权，得到加权图像。For example, weights corresponding to P predicted images are determined; and according to the weights corresponding to the P predicted images, the P predicted images are weighted to obtain a weighted image.

示例性的，若P个预测图像包括第一预测图像和第二预测图像，则解码端确定第一预测图像对应的第一权重和第二预测图像对应的第二权重，根据第一权重和所述第二权重，对第一预测图像和第二预测图像进行加权，得到加权图像。Exemplarily, if the P predicted images include a first predicted image and a second predicted image, the decoding end determines a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image, and weights the first predicted image and the second predicted image according to the first weight and the second weight to obtain a weighted image.

其中，确定P个预测图像对应的权重的方式包括但不限于如下几种：The methods for determining the weights corresponding to the P prediction images include but are not limited to the following:

方式一，上述P个预测图像对应的权重为预设权重。假设P＝2，即第一预测图像对应的第一权重和第二预测图像对应的第二权重可以是，第一权重等于第二权重，或者第一权重与第二权重的比值为1/2、1/4、1/2、1/3、2/1、3/1、4/1等等。Method 1: The weights corresponding to the above P predicted images are preset weights. Assume that P=2, that is, the first weight corresponding to the first predicted image and the second weight corresponding to the second predicted image can be that the first weight is equal to the second weight, or the ratio of the first weight to the second weight is 1/2, 1/4, 1/2, 1/3, 2/1, 3/1, 4/1, etc.

方式二，解码端根据混合时空表征进行自适应掩膜，得到P个预测图像对应的权重。In the second method, the decoder performs adaptive masking based on the mixed spatiotemporal representation to obtain weights corresponding to P predicted images.

示例性的，解码端通过神经网络模型，生成P个预测图像对应的权重，该神经网络模型为预先训练好的，可以用于生成P个预测图像对应的权重。在一些实施例中，该神经网络模型也称为第三解码器或自适应掩膜补偿解码器D _w。具体的，解码端将混合时空表征输入该自适应掩膜补偿解码器D _w中进行自适应掩膜，得到P个预测图像对应的权重。例如，解码端将混合时空表征Gt输入该自适应掩膜补偿解码器D _w中进行自适应掩膜，自适应掩膜补偿解码器D _w输出第一预测图像的第一权重w1和第二预测图像的第二权重w2，进行根据第一权重w1和第二权重w2对上述得到第一预测图像X ₁和第二预测图像X ₂，能自适应地选择相应代表预测帧中不同区域地信息，进而生成加权图像。 Exemplarily, the decoding end generates weights corresponding to P predicted images through a neural network model, and the neural network model is pre-trained and can be used to generate weights corresponding to P predicted images. In some embodiments, the neural network model is also called a third decoder or an adaptive mask compensation decoder _Dw . Specifically, the decoding end inputs the mixed spatiotemporal representation into the adaptive mask compensation decoder _Dw for adaptive masking to obtain weights corresponding to the P predicted images. For example, the decoding end inputs the mixed spatiotemporal representation Gt into the adaptive mask compensation decoder _Dw for adaptive masking, and the adaptive mask compensation decoder _Dw outputs a first weight w1 of the first predicted image and a second weight w2 of the second predicted image, and performs the first weight w1 and the second weight w2 on the first predicted image _X1 and the second predicted image _X2 obtained above, and can adaptively select the corresponding information representing different regions in the predicted frame, thereby generating a weighted image.

示例性的，根据如下公式(5)生成加权图像X ₃： Exemplarily, the weighted image X ₃ is generated according to the following formula (5):

X ₃＝w ₁*X ₁+w ₂*X ₂ (5) X ₃ =w ₁ *X ₁ +w ₂ *X ₂ (5)

在一些实施例中，上述P个预测图像对应的权重为一个矩阵，包括了预测图像中每个像素点对应的权重，这样在生成加权图像时，针对当前图像中的每个像素点，将P个预测图像中该像素点分别对应的预测值及其权重进行加权运算，得到该像素点的加权预测值，这样当前图像中每个像素点对应的加权预测值组成当前图像的加权图像。In some embodiments, the weights corresponding to the above-mentioned P predicted images are a matrix, including the weights corresponding to each pixel in the predicted image. When generating a weighted image, for each pixel in the current image, the predicted values and their weights corresponding to the pixel in the P predicted images are weighted to obtain the weighted predicted value of the pixel. In this way, the weighted prediction values corresponding to each pixel in the current image constitute the weighted image of the current image.

本申请实施例对上述自适应掩膜补偿解码器D _w的具体网络结构不做限制。 The embodiment of the present application does not limit the specific network structure of the adaptive mask compensation decoder _Dw .

在一些实施例中，如图7所示，自适应掩膜补偿解码器D _w包括1个NLAM、3个LAM、4个下采样模块和一个sigmoid函数，其中一个NLAM之后连接一个下采样模块，一个LAM之后连接一个下采样模块。可选的，自适应掩膜补偿解码器D _w所包括的1个NLAM、3个LAM、4个下采样模块与上述可变卷积解码器Dm所包括的1个NLAM、3个LAM、4个下采样模块的网络结构一致，在此不再赘述。 In some embodiments, as shown in FIG7 , the adaptive mask compensation decoder D _w includes 1 NLAM, 3 LAMs, 4 downsampling modules and a sigmoid function, wherein one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module. Optionally, the network structure of the 1 NLAM, 3 LAMs, and 4 downsampling modules included in the adaptive mask compensation decoder D _w is consistent with the network structure of the 1 NLAM, 3 LAMs, and 4 downsampling modules included in the above-mentioned variable convolution decoder D m, which is not repeated here.

需要说明的是，上述图7只是一种示例中，且图7中各参数的设定也仅为示例，本申请实施例的自适应掩膜补偿解码器D _w的网络结构包括但不限于图7所示。 It should be noted that FIG. 7 is only an example, and the settings of the parameters in FIG. 7 are also only examples. The network structure of the adaptive mask compensation decoder _Dw in the embodiment of the present application includes but is not limited to that shown in FIG. 7.

在该实现方式中，解码端根据上述方法，对P个预测图像进行加权，得到加权图像后，执行如下S204-A12。In this implementation, the decoding end weights the P predicted images according to the above method, and after obtaining the weighted images, executes the following S204-A12.

S204-A12、根据加权图像，得到目标预测图像。S204-A12: Obtain a target prediction image based on the weighted image.

例如，将该加权图像，确定为目标预测图像。For example, the weighted image is determined as the target prediction image.

在一些实施例中，解码端还可以根据混合时空表征，得到当前图像的残差图像。In some embodiments, the decoding end may also obtain a residual image of the current image based on the mixed spatiotemporal representation.

示例性的，解码端通过神经网络模型，得到当前图像的残差图像，该神经网络模型为预先训练好的，可以用于生成当前图像的残差图像。在一些实施例中，该神经网络模型也称为第四解码器或空间纹理增强解码器Dt。具体的，解码端将混合时空表征输入该空间纹理增强解码器Dt中进行空间纹理增强，得到当前图像的残差图像X _r＝D_t(G _t)，该残差图像X _r可以对预测图像进行纹理增强。 Exemplarily, the decoding end obtains the residual image of the current image through a neural network model, and the neural network model is pre-trained and can be used to generate the residual image of the current image. In some embodiments, the neural network model is also called a fourth decoder or a spatial texture enhancement decoder Dt. Specifically, the decoding end inputs the mixed spatiotemporal representation into the spatial texture enhancement decoder Dt for spatial texture enhancement, and obtains the residual image _Xr = D_t ( _Gt ) of the current image, and the residual image _Xr can perform texture enhancement on the predicted image.

本申请实施例中，对上述空间纹理增强解码器Dt的具体网络结构不做限制。In the embodiment of the present application, there is no limitation on the specific network structure of the above-mentioned spatial texture enhancement decoder Dt.

在一些实施例中，如图8所示，空间纹理增强解码器Dt包括1个NLAM、3个LAM、4个下采样模块，其中一个NLAM之后连接一个下采样模块，一个LAM之后连接一个下采样模块。可选的，空间纹理增强解码器Dt所包括的1个NLAM、3个LAM、前3个下采样模块与上述光流信号解码器Df所包括的1个NLAM、3个LAM、前3个下采样模块的网络结构一致，在此不再赘述。空间纹理增强解码器Dt包括的最后一个下采样模块包括的通道数为3。In some embodiments, as shown in FIG8 , the spatial texture enhancement decoder Dt includes 1 NLAM, 3 LAMs, and 4 downsampling modules, wherein one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module. Optionally, the 1 NLAM, 3 LAMs, and the first 3 downsampling modules included in the spatial texture enhancement decoder Dt are consistent with the network structure of the 1 NLAM, 3 LAMs, and the first 3 downsampling modules included in the above-mentioned optical flow signal decoder Df, which are not repeated here. The number of channels included in the last downsampling module of the spatial texture enhancement decoder Dt is 3.

需要说明的是，上述图8只是一种示例中，且图8中各参数的设定也仅为示例，本申请实施例的空间纹理增强解码器Dt的网络结构包括但不限于图8所示。It should be noted that the above FIG. 8 is only an example, and the settings of the parameters in FIG. 8 are also only examples. The network structure of the spatial texture enhancement decoder Dt in the embodiment of the present application includes but is not limited to that shown in FIG. 8 .

由于上述残差图像X _r可以对预测图像进行纹理增强。基于此，在一些实施例中，上述S204-A中根据P个预测图像，确定当前图像的目标预测图像包括如下S204-A21的步骤： Since the residual image _Xr can perform texture enhancement on the predicted image, in some embodiments, determining the target predicted image of the current image according to the P predicted images in S204-A includes the following step S204-A21:

S204-A21、根据P个预测图像和残差图像，得到目标预测图像。S204-A21. Obtain a target prediction image based on P prediction images and residual images.

例如，若P＝1，则根据该预测图像和残差图像，得到目标预测图像，例如，将该预测图像与残差图像进行相加，生成目标预测图像。For example, if P=1, a target predicted image is obtained based on the predicted image and the residual image. For example, the predicted image and the residual image are added to generate the target predicted image.

再例如，若P大于1时，则首先根据P个预测图像，确定加权图像；再根据加权图像和残差图像，确定目标预测图像。For another example, if P is greater than 1, a weighted image is first determined based on the P predicted images; and then a target predicted image is determined based on the weighted image and the residual image.

其中，解码端根据P个预测图像，确定加权图像的具体过程可以参照上述S204-A11的具体描述，在此不再赘述。The specific process of the decoding end determining the weighted image based on the P predicted images can refer to the specific description of S204-A11 above, which will not be repeated here.

举例说明，以P＝2为例，根据上述方法，确定出第一预测图像对应的第一权重w1和第二预测图像对应的第二权重w2，可选的，根据上述公式(5)对第一预测图像和第二预测图像进行加权，得到加权图像X ₃，接着，使用残差图像X _r对加权图像X ₃进行增强，得到目标预测图像。 For example, taking P=2 as an example, according to the above method, a first weight w1 corresponding to the first predicted image and a second weight w2 corresponding to the second predicted image are determined. Optionally, the first predicted image and the second predicted image are weighted according to the above formula (5) to obtain a weighted image _X3 . Then, the weighted image _X3 is enhanced using the residual image _Xr to obtain a target predicted image.

示例性的，根据如下公式(6)生成目标预测图像X ₄： Exemplarily, the target predicted image X ₄ is generated according to the following formula (6):

X ₄＝X ₃+X _r (6) X ₄ =X ₃ +X _r (6)

根据上述方法，解码端确定出当前图像的目标预测图像后，执行如下S204-B的步骤。According to the above method, after the decoding end determines the target predicted image of the current image, the following step S204-B is performed.

S204-B、根据目标预测图像，确定当前图像的重建图像。S204-B, determining a reconstructed image of the current image according to the target predicted image.

在一些实施例中，将该目标预测图像与当前图像的前一个或几个重建图像进行比较，计算损失，若该损失小，则说明该目标预测图像的预测精度较高，可以将该目标预测图像确定为当前图像的重建图像。若上述损失大，则说明该目标预测图像的预测精度较低，此时，可以根据当前图像的前一个或几个重建图像和该目标预测图像，确定当前图像的重建图像，例如，将该目标预测图像和当前图像的前一个或几个重建图像输入一神经网络中，得到当前图像的重建图像。In some embodiments, the target predicted image is compared with one or several previous reconstructed images of the current image to calculate the loss. If the loss is small, it means that the prediction accuracy of the target predicted image is high, and the target predicted image can be determined as the reconstructed image of the current image. If the above loss is large, it means that the prediction accuracy of the target predicted image is low. At this time, the reconstructed image of the current image can be determined based on one or several previous reconstructed images of the current image and the target predicted image. For example, the target predicted image and one or several previous reconstructed images of the current image are input into a neural network to obtain the reconstructed image of the current image.

在一些实施例中，为了进一步提高重建图像的确定准确性，则本申请实施例还包括残差解码，此时，上述S204-B包括如下S204-B1和S204-B2的步骤：In some embodiments, in order to further improve the accuracy of determining the reconstructed image, the embodiment of the present application further includes residual decoding. In this case, the above S204-B includes the following steps S204-B1 and S204-B2:

S204-B1、对残差码流进行解码，得到当前图像的残差值；S204-B1, decoding the residual code stream to obtain a residual value of the current image;

S204-B2、根据目标预测图像和残差值，得到重建图像。S204-B2, obtaining a reconstructed image based on the target predicted image and the residual value.

本申请实施例中，为了提高重建图像的效果，则编码端还通过残差编码的方式，生成残差码流，具体是，编码端确定当前图像的残差值，对该残差值进行编码生成残差码流。对应的，解码端对残差码流进行解码，得到当前图像的残差值，并根据目标预测图像和残差值，得到重建图像。In the embodiment of the present application, in order to improve the effect of reconstructing the image, the encoding end also generates a residual code stream by residual coding. Specifically, the encoding end determines the residual value of the current image, encodes the residual value to generate a residual code stream. Correspondingly, the decoding end decodes the residual code stream to obtain the residual value of the current image, and obtains the reconstructed image according to the target predicted image and the residual value.

本申请实施例对上述当前图像的残差值的具体表示形式不做限制。The embodiment of the present application does not limit the specific representation form of the residual value of the current image.

在一种可能的实现方式中，当前图像的残差值为一个矩阵，该矩阵中的每个元素为当前图像中每个像素点对应的残差值。这样，解码端可以逐像素的，将目标预测图像中每个像素点对应的残差值和预测值进行相加，得到每个像素点的重建值，进而得到当前图像的重建图像。以当前图像中的第i个像素点为例，在目标预测图像中，得到该第i个像素点对应的预测值，以及从当前图像的残差值中得到该第i个像素点对应的残差值，接着，将该第i个像素点对应的预测值和残差值进行相加，得到该第i个像素点对应的重建值。针对当前图像中的每个像素点，参照上述第i个像素点，可以得到当前图像中每个像素点对应的重建值，当前图像中每个像素点对应的重建值，组成当前图像的重建图像。In a possible implementation, the residual value of the current image is a matrix, and each element in the matrix is the residual value corresponding to each pixel in the current image. In this way, the decoding end can add the residual value and the predicted value corresponding to each pixel in the target predicted image pixel by pixel to obtain the reconstruction value of each pixel, and then obtain the reconstructed image of the current image. Taking the i-th pixel in the current image as an example, in the target predicted image, the predicted value corresponding to the i-th pixel is obtained, and the residual value corresponding to the i-th pixel is obtained from the residual value of the current image. Then, the predicted value and the residual value corresponding to the i-th pixel are added to obtain the reconstruction value corresponding to the i-th pixel. For each pixel in the current image, referring to the above i-th pixel, the reconstruction value corresponding to each pixel in the current image can be obtained, and the reconstruction value corresponding to each pixel in the current image constitutes the reconstructed image of the current image.

本申请实施例对解码端得到当前图像的残差值的具体方式不做限制，也就是说，本申请实施例对编解码两端所采用的残差编解码的方式不做限制。The embodiment of the present application does not limit the specific method in which the decoding end obtains the residual value of the current image, that is, the embodiment of the present application does not limit the residual encoding and decoding method adopted by both ends of the encoding and decoding.

在一种示例中，编码端根据上述与解码端相同的方式，确定出当前图像的目标预测图像，接着，根据当前图像和目标预测图像，得到当前图像的残差值，例如，将当前图像和目标预测图像的差值确定为当前图像的残差值。接着，对当前图像的残差值进行编码，生成残差编码。可选的，可以对当前图像的残差值进行变换，得到变换系数，对变换系数进行量化得到量化系数，对量化系数进行编码，得到残差码流。对应的，解码端解码残差码流，得到当前图像的残差值，例如解码残差码流，得到量化系数，对量化系数进行反量化和反变换，得到当前图像的残差值。接着，再根据上述方法，将目标预测图像和当前图像对应的残差值进行相加，得到当前图像的重建图像。In one example, the encoding end determines the target prediction image of the current image in the same manner as the decoding end, and then obtains the residual value of the current image according to the current image and the target prediction image, for example, the difference between the current image and the target prediction image is determined as the residual value of the current image. Then, the residual value of the current image is encoded to generate residual coding. Optionally, the residual value of the current image can be transformed to obtain a transformation coefficient, the transformation coefficient is quantized to obtain a quantization coefficient, and the quantization coefficient is encoded to obtain a residual code stream. Correspondingly, the decoding end decodes the residual code stream to obtain the residual value of the current image, for example, decodes the residual code stream to obtain a quantization coefficient, dequantizes and de-transforms the quantization coefficient to obtain the residual value of the current image. Then, according to the above method, the residual values corresponding to the target prediction image and the current image are added to obtain a reconstructed image of the current image.

在一些实施例中，编码端可以采用神经网络的方法，对当前图像和当前图像的目标预测图像进行处理，生成当前图像的残差值，并对当前图像的残差值进行编码，生成残差码流。对应的，解码端解码该残差码流，得到当前图像的残差值，接着，再根据上述方法，将目标预测图像和当前图像对应的残差值进行相加，得到当前图像的重建图像。In some embodiments, the encoding end may use a neural network method to process the current image and the target predicted image of the current image, generate a residual value of the current image, and encode the residual value of the current image to generate a residual code stream. Correspondingly, the decoding end decodes the residual code stream to obtain the residual value of the current image, and then, according to the above method, adds the residual values corresponding to the target predicted image and the current image to obtain a reconstructed image of the current image.

本申请实施例中，解码端根据上述方法，可以得到当前图像的重建图像。In the embodiment of the present application, the decoding end can obtain a reconstructed image of the current image according to the above method.

可选的，可以将该重建图像进行直接显示。Optionally, the reconstructed image may be directly displayed.

可选的，还可以将该重建图像存入缓存中，用于后续图像的解码。Optionally, the reconstructed image may also be stored in a cache for use in decoding subsequent images.

本申请实施例提供的视频解码方法，解码端通过解码第一码流，确定量化后的第一特征信息，第一特征信息是对当前图像和当前图像的前一重建图像进行特征融合得到的；对量化后的第一特征信息进行多级时域融合，得到混合时空表征；根据混合时空表征对前一重建图像进行运动补偿，得到当前图像的P个预测图像，P为正整数；根据P个预测图像，确定当前图像的重建图像。本申请，为了提高重建图像的准确性，通过对量化后的第一特征信息进行多级时域融合，即将量化后的第一特征信息不仅与当前图像的前一重建图像的特征信息进行融合，并且将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合，这样可以避免当前图像的前一重建图像中的某信息被遮挡时，被遮挡的信息可以从当前图像之前的几张重建图像中得到，进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时，可以生成高精度的P个预测图像时，基于该高精度的P个预测图像可以准确得到当前图像的重建图像，进而提高视频压缩效果。The video decoding method provided by the embodiment of the present application is that the decoding end determines the quantized first feature information by decoding the first bit stream, and the first feature information is obtained by feature fusion of the current image and the previous reconstructed image of the current image; the quantized first feature information is subjected to multi-level time domain fusion to obtain a mixed space-time representation; motion compensation is performed on the previous reconstructed image according to the mixed space-time representation to obtain P predicted images of the current image, where P is a positive integer; and the reconstructed image of the current image is determined according to the P predicted images. In the present application, in order to improve the accuracy of the reconstructed image, the quantized first feature information is subjected to multi-level time domain fusion, that is, the quantized first feature information is not only fused with the feature information of the previous reconstructed image of the current image, but also the quantized first feature information is feature fused with multiple reconstructed images before the current image, so that when certain information in the previous reconstructed image of the current image is blocked, the blocked information can be obtained from several reconstructed images before the current image, thereby making the generated mixed space-time representation include more accurate, rich and detailed feature information. In this way, when motion compensation is performed on the previous reconstructed image based on the mixed spatiotemporal representation, P high-precision predicted images can be generated. Based on the high-precision P predicted images, the reconstructed image of the current image can be accurately obtained, thereby improving the video compression effect.

本申请实施例中，提出一种端到端的基于神经网络的编解码框架，该基于神经网络的编解码框架包括基于神经网络的编码器和基于神经网络的解码器。下面结合的本申请一种可能的基于神经网络的解码器，对本申请实施例的解码过程进行介绍。In an embodiment of the present application, an end-to-end neural network-based encoding and decoding framework is proposed, and the neural network-based encoding and decoding framework includes a neural network-based encoder and a neural network-based decoder. The decoding process of the embodiment of the present application is introduced below in combination with a possible neural network-based decoder of the present application.

图9为本申请一实施例涉及的一种基于神经网络的解码器的网络结构示意图，包括：反变换模块、递归聚合模块和混合运动补偿模块。FIG9 is a schematic diagram of the network structure of a neural network-based decoder according to an embodiment of the present application, including: an inverse transformation module, a recursive aggregation module and a hybrid motion compensation module.

其中，反变换模块用于对量化后的第二特征信息进行反变换，得到第一特征信息的重建特征信息，示例性的，其网络结构如图3所示。The inverse transformation module is used to perform an inverse transformation on the quantized second feature information to obtain reconstructed feature information of the first feature information. Exemplarily, its network structure is shown in FIG3 .

递归聚合模块用于对量化后的第一特征信息进行多级时域融合，得到混合时空表征，示例性的，其网络结构如图4所示。The recursive aggregation module is used to perform multi-level time-domain fusion on the quantized first feature information to obtain a mixed spatiotemporal representation. Exemplarily, its network structure is shown in FIG4 .

混合运动补偿模块用于对混合时空表征进行混合运动补偿，得到当前图像的目标预测图像，示例性的，混合运动补偿模块可以包括图5所示的第一解码器、和/或图6所示的第二解码器，可选的，若混合运动补偿模块包括第一解码器和第二解码器时，则该混合运动补偿模块还可以包括图7所示的第三解码器。在一些实施例中，该混合运动补偿模块还可以包括如图8所示的第四解码器。The hybrid motion compensation module is used to perform hybrid motion compensation on the hybrid spatiotemporal representation to obtain a target predicted image of the current image. Exemplarily, the hybrid motion compensation module may include the first decoder shown in FIG5 and/or the second decoder shown in FIG6. Optionally, if the hybrid motion compensation module includes the first decoder and the second decoder, the hybrid motion compensation module may also include the third decoder shown in FIG7. In some embodiments, the hybrid motion compensation module may also include a fourth decoder as shown in FIG8.

示例性的，本申请实施例以运动补偿模块包括第一解码器、第二解码器、第三解码器和第四解码器为例进行说明。Exemplarily, the embodiment of the present application is described by taking the example that the motion compensation module includes a first decoder, a second decoder, a third decoder and a fourth decoder.

在上述图9所示的基于神经网络的解码器的基础上，结合图10对本申请实施例一种可能的视频解码方法进行介绍。Based on the neural network-based decoder shown in FIG. 9 above, a possible video decoding method of an embodiment of the present application is introduced in combination with FIG. 10 .

图10为本申请一实施例提供的视频解码流程示意图，如图10所示，包括：FIG10 is a schematic diagram of a video decoding process provided by an embodiment of the present application, as shown in FIG10 , including:

S301、解码第二码流，得到量化后的第二特征信息。S301. Decode the second bit stream to obtain quantized second feature information.

上述S301的具体实现过程参照上述S201-A的描述，在此不再赘述。The specific implementation process of the above S301 refers to the description of the above S201-A, which will not be repeated here.

S302、通过反变换模块对量化后的第二特征信息进行反变换，得到重建特征信息。S302: Perform an inverse transformation on the quantized second feature information through an inverse transformation module to obtain reconstructed feature information.

示例性的，该反变换模块的具体网络结构如图3所示，包括2个非局部自注意力模块和2个上采样模块。Exemplarily, the specific network structure of the inverse transformation module is shown in FIG3 , including two non-local self-attention modules and two upsampling modules.

例如，解码端将量化后的第二特征信息输入反变换模块进行反变换，该反变换模块输出重建特征信息。上述S302的具体实现过程参照上述S201-B1的描述，在此不再赘述。For example, the decoding end inputs the quantized second feature information into the inverse transformation module for inverse transformation, and the inverse transformation module outputs the reconstructed feature information. The specific implementation process of the above S302 refers to the description of the above S201-B1, which will not be repeated here.

S303、确定重建特征信息的概率分布。S303: Determine the probability distribution of the reconstructed feature information.

S304、根据重建特征信息的概率分布，预测得到量化后的第一特征信息的概率分布。S304: predicting the probability distribution of the quantized first feature information according to the probability distribution of the reconstructed feature information.

S305、根据量化后的第一特征信息的概率分布，对第一码流进行解码，得到量化后的第一特征信息。S305 . Decode the first bitstream according to the probability distribution of the quantized first feature information to obtain the quantized first feature information.

上述S303至S305的具体实现过程，参照上述S201-B2、S201-B3和S201-C的具体描述，在此不再赘述。For the specific implementation process of S303 to S305, please refer to the specific description of S201-B2, S201-B3 and S201-C, which will not be repeated here.

S306、通过递归聚合模块，对量化后的第一特征信息进行多级时域融合，得到混合时空表征。S306 , performing multi-level time-domain fusion on the quantized first feature information through a recursive aggregation module to obtain a mixed time-space representation.

可选的，递归聚合模块由至少一个时空递归网络堆叠而成。Optionally, the recursive aggregation module is formed by stacking at least one spatiotemporal recursive network.

示例性的，递归聚合模块的网络结构如图4所示。Exemplarily, the network structure of the recursive aggregation module is shown in FIG4 .

例如，解码端将量化后的第一特征信息输入递归聚合模块，以使递归聚合模块将量化后的第一特征信息与前一时刻递归聚合模块的隐式特征信息进行融合，进而输出混合时空表征。上述S306的具体实现过程参照上述S202-A的描述，在此不再赘述。For example, the decoding end inputs the quantized first feature information into the recursive aggregation module, so that the recursive aggregation module fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment, and then outputs a mixed spatiotemporal representation. The specific implementation process of the above S306 refers to the description of the above S202-A, which will not be repeated here.

S307、通过第一解码器对混合时空表征进行处理，得到第一预测图像。S307 . Process the mixed spatiotemporal representation through a first decoder to obtain a first predicted image.

根据上述S306得到混合时空表征后，将该混合时空表征和前一重建图像输入混合运动补偿模块进行运动混合补偿，得到当前图像的目标预测图像。After the mixed spatiotemporal representation is obtained according to the above S306, the mixed spatiotemporal representation and the previous reconstructed image are input into the mixed motion compensation module for motion mixed compensation to obtain the target predicted image of the current image.

具体是，通过第一解码器对混合时空表征进行处理，确定光流运动信息，并根据光流运动信息对前一重建图像进行运动补偿，得到第一预测图像。Specifically, the mixed spatiotemporal representation is processed by a first decoder to determine optical flow motion information, and motion compensation is performed on a previous reconstructed image according to the optical flow motion information to obtain a first predicted image.

可选的，第一解码器的网络结构如图5所示。Optionally, the network structure of the first decoder is shown in FIG5 .

上述S307的具体实现过程，参照上述S203-A1和S203-A2的具体描述，在此不再赘述。For the specific implementation process of S307, please refer to the specific description of S203-A1 and S203-A2, which will not be repeated here.

S308、通过第二解码器对混合时空表征进行处理，得到第二预测图像。S308 . Process the mixed spatiotemporal representation through a second decoder to obtain a second predicted image.

具体是，通过SFE对前一重建图像进行空间特征提取，得到参考特征信息；将参考特征信和混合时空表征输入第二解码器，以使偏移量对参考特征信息进行运动补偿，得到第二预测图像。Specifically, spatial features of a previous reconstructed image are extracted through SFE to obtain reference feature information; the reference feature information and the mixed spatiotemporal representation are input into a second decoder so that the offset performs motion compensation on the reference feature information to obtain a second predicted image.

可选的，第二解码器的网络结构如图6所示。Optionally, the network structure of the second decoder is shown in FIG6 .

上述S308的具体实现过程，参照上述S203-B1至S203-B3的具体描述，在此不再赘述。For the specific implementation process of the above S308, please refer to the specific description of S203-B1 to S203-B3, which will not be repeated here.

S309、通过第三解码器对混合时空表征进行处理，得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。S309 , processing the mixed spatiotemporal representation through a third decoder to obtain a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image.

具体是，将混合时空表征输入第三解码器进行自适应掩膜，得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。Specifically, the mixed spatiotemporal representation is input into a third decoder for adaptive masking to obtain a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image.

可选的，第三解码器的网络结构如图7所示。Optionally, the network structure of the third decoder is shown in FIG7 .

上述S309的具体实现过程，参照上述S204-A11中方式二的具体描述，在此不再赘述。For the specific implementation process of the above S309, refer to the specific description of the second method in the above S204-A11, which will not be repeated here.

S310、根据第一权重和第二权重，对第一预测图像和第二预测图像进行加权，得到加权图像。S310 . Weight the first predicted image and the second predicted image according to the first weight and the second weight to obtain a weighted image.

例如，将第一权重与第一预测图像的乘积，与第二权重与第二预测图像的乘积相加，得到加权图像。For example, the product of the first weight and the first predicted image is added to the product of the second weight and the second predicted image to obtain a weighted image.

S311、通过第四解码器对混合时空表征进行处理，得到当前图像的残差图像。S311 . Process the mixed spatiotemporal representation through a fourth decoder to obtain a residual image of the current image.

具体是，将混合时空表征输入第四解码器进行处理，得到当前图像的残差图像。Specifically, the mixed spatiotemporal representation is input into the fourth decoder for processing to obtain a residual image of the current image.

可选的，第四解码器的网络结构如图8所示。Optionally, the network structure of the fourth decoder is shown in FIG8 .

上述S311的具体实现过程，参照上述S204-A12的具体描述，在此不再赘述。The specific implementation process of the above S311 refers to the specific description of the above S204-A12, which will not be repeated here.

S312、根据加权图像和残差图像，确定目标预测图像。S312: Determine a target prediction image according to the weighted image and the residual image.

例如，将加权图像和残差图像相加，确定为目标预测图像。For example, the weighted image and the residual image are added together to determine the target prediction image.

S313、对残差码流进行解码，得到当前图像的残差值。S313: Decode the residual code stream to obtain a residual value of the current image.

S314、根据目标预测图像和残差值，得到重建图像。S314: Obtain a reconstructed image according to the target predicted image and the residual value.

上述S313和S314的具体实现过程，参照上述S204-B1和S204-B2的具体描述，在此不再赘述。For the specific implementation process of the above S313 and S314, refer to the specific description of the above S204-B1 and S204-B2, which will not be repeated here.

本申请实施例，通过图9所示的基于神经网络的解码器进行解码时，对量化后的第一特征信息进行多级时域融合，即将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合，使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成多个解码信息，例如该多个解码信息包括第一预测图像、第二预测图像、第一预测图像和第二预测图像分别对应的权重、以及残差图像，这样基于这多个解码信息确定当前图像的目标预测图像时，可以有效提高目标预测图像的准确性，进而基于该准确的预测图像可以准确得到当前图像的重建图像，进而提高视频压缩效果。In the embodiment of the present application, when decoding is performed by the neural network-based decoder shown in FIG9 , the quantized first feature information is subjected to multi-level time-domain fusion, that is, the quantized first feature information is subjected to feature fusion with multiple reconstructed images before the current image, so that the generated mixed spatiotemporal representation includes more accurate, rich and detailed feature information. In this way, motion compensation is performed on the previous reconstructed image based on the mixed spatiotemporal representation to generate multiple decoding information. For example, the multiple decoding information includes the first predicted image, the second predicted image, the weights corresponding to the first predicted image and the second predicted image, and the residual image. In this way, when the target predicted image of the current image is determined based on the multiple decoding information, the accuracy of the target predicted image can be effectively improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate predicted image, thereby improving the video compression effect.

上文对本申请实施例涉及的视频解码方法进行了描述，在此基础上，下面针对编码端，对本申请涉及的视频编码方法进行描述。The above describes the video decoding method involved in the embodiment of the present application. On this basis, the following describes the video encoding method involved in the present application for the encoding end.

图11为本申请实施例提供的视频编码方法的一种流程示意图。本申请实施例的执行主体可以为上述图1所示的编码器。Fig. 11 is a schematic diagram of a flow chart of a video encoding method provided by an embodiment of the present application. The execution subject of the embodiment of the present application may be the encoder shown in Fig. 1 above.

如图11所示，本申请实施例的方法包括：As shown in FIG11 , the method of the embodiment of the present application includes:

S401、对当前图像以及当前图像的前一重建图像进行特征融合，得到第一特征信息。S401 , performing feature fusion on a current image and a previous reconstructed image of the current image to obtain first feature information.

本申请实施例提出一种基于神经网络的编码器，该基于神经网络的编码器与基于神经网络的解码器进行端到端训练得到。An embodiment of the present application proposes a neural network-based encoder, which is obtained by end-to-end training of the neural network-based encoder and the neural network-based decoder.

由于当前图像X _t和当前图像的前一重建图像这两个相邻帧之间存在着很强的相似性，因此，编码端在编码时，将当前图像X _t和当前图像的前一重建图像进行特征融合，得到第一特征信息。例如，编码端将当前图像X _t和当前图像的前一重建图像进行通道间的级联通过得到级联的输入数据X _cat，和X _t为SRGB域的3通道视频帧输入，X _cat采用逐个通道堆叠的方式将两帧视频合成得到通道数为6的输入信号。接着，对级联后的图像X _cat进行特征提取，得到第一特征信息。 Since the current image _Xt and the previous reconstructed image of the current image There is a strong similarity between these two adjacent frames. Therefore, when encoding, the encoder replaces the current image _Xt with the previous reconstructed image of the current image. Perform feature fusion to obtain the first feature information. For example, the encoder combines the current image _Xt and the previous reconstructed image of the current image Cascade connection between channels Get the cascaded input data X _cat , X _{t and X t} are 3-channel video frame inputs in the SRGB domain, and X _cat synthesizes the two video frames by stacking channels one by one to obtain an input signal with 6 channels. Next, feature extraction is performed on the cascaded image X _cat to obtain first feature information.

本申请实施例对编码端对X _cat进行特征提取的具体方式不做限制。例如包括多层卷积、残差连接、密集连接等特征提取方式中的至少一种。 The embodiment of the present application does not limit the specific manner in which the encoder extracts features from _Xcat , for example, including at least one of multi-layer convolution, residual connection, dense connection and other feature extraction methods.

在一些实施例中，编码端对级联后的图像进行Q次非局部注意力变换和Q次下采样，得到第一特征信息，Q为正整数。In some embodiments, the encoder performs Q non-local attention transformations and Q downsampling on the cascaded image to obtain first feature information, where Q is a positive integer.

例如，编码端将级联后的6通道高维输入信号X _cat，输入时空特征提取模块(Spatiotemporal Feature Extraction，STFE)进行多层的特征变换和提取。 For example, the encoder inputs the cascaded 6-channel high-dimensional input signal X _cat into a spatiotemporal feature extraction module (STFE) for multi-layer feature transformation and extraction.

可选的，时空特征提取模块包括Q个非局部注意力模块和Q个下采样模块。其中，非局部注意力模块用于实现非局部注意力变换，下采样模块用于实现下采样。示例性的，如图12所示，一个非局部注意力模块后，连接一个下采样模块。在实际应用时，编码端将级联后的6通道高维输入信号X _cat输入STFE中，STFE中的第一个非局部注意力模块对X _cat进行非局部注意力特征变换提取，得到特征信息11，再将特征信息11输入第一个下采样模块进行下采样，得到特征信息12。接着，将特征信息12输入第二个非局部注意力模块进行非局部注意力特征变换提取，得到特征信息13，再将特征信息13输入第二个下采样模块进行下采样，得到特征信息14。依次类推，得到第Q个下采样模块输出的特征信息，并将该特征信息确定为第一特征信息X _F。 Optionally, the spatiotemporal feature extraction module includes Q non-local attention modules and Q downsampling modules. Among them, the non-local attention module is used to implement non-local attention transformation, and the downsampling module is used to implement downsampling. Exemplarily, as shown in FIG12, a downsampling module is connected after a non-local attention module. In actual application, the encoder inputs the cascaded 6-channel high-dimensional input signal X _cat into the STFE, and the first non-local attention module in the STFE performs non-local attention feature transformation extraction on X _cat to obtain feature information 11, and then inputs the feature information 11 into the first downsampling module for downsampling to obtain feature information 12. Then, the feature information 12 is input into the second non-local attention module for non-local attention feature transformation extraction to obtain feature information 13, and then the feature information 13 is input into the second downsampling module for downsampling to obtain feature information 14. By analogy, the feature information output by the Qth downsampling module is obtained, and the feature information is determined as the first feature information X _F.

本申请实施例对Q的具体取值不做限制。The embodiment of the present application does not limit the specific value of Q.

可选的，Q＝4。Optionally, Q=4.

S402、对第一特征信息进行量化，得到量化后的第一特征信息。S402: quantize the first feature information to obtain quantized first feature information.

上述得到的第一特征信息为浮点型，例如为32位浮点数表示，进一步的，为了降低编码代价，则编码端对上述得到的第一特征信息进行量化，得到量化后的第一特征信息。The first feature information obtained above is of floating point type, for example, represented by a 32-bit floating point number. Further, in order to reduce the encoding cost, the encoding end quantizes the first feature information obtained above to obtain the quantized first feature information.

示例性的，编码端采用四舍五入函数Round(.)对第一特征信息量化。Exemplarily, the encoding end uses a rounding function Round(.) to quantize the first feature information.

在一些实施例中，在模型训练过程中，对正向传播时，使用如下公式(7)所示的方法对第一特征信息进行量化：In some embodiments, during the model training process, during forward propagation, the first feature information is quantized using the method shown in the following formula (7):

其中，U(-0.5,0.5)为正负0.5的均匀噪声分布用于近似实际的四舍五入量化函数Round(.)。Among them, U(-0.5,0.5) is a uniform noise distribution of plus or minus 0.5, which is used to approximate the actual rounding quantization function Round(.).

在训练过程对公式(7)进行求导得到对应的反向传播梯度为1，并将其作为反向传播的梯度对模型进行更新。During the training process, the derivative of formula (7) is used to obtain the corresponding back-propagation gradient of 1, which is used as the back-propagation gradient to update the model.

S403、对量化后的第一特征信息进行编码，得到第一码流。S403: Encode the quantized first feature information to obtain a first code stream.

方式一，编码端直接使用量化后的第一特征信息的概率分布，对量化后的第一特征信息进行编码，得到第一码流Method 1: The encoder directly uses the probability distribution of the quantized first feature information to encode the quantized first feature information to obtain the first bitstream

基于上述描述，编码端可以通过如下方式二的步骤，对量化后的第一特征信息进行编码，得到第一码流。Based on the above description, the encoding end may encode the quantized first feature information through the following steps of method 2 to obtain a first code stream.

方式二，上述S403包括如下S403-A1至S403-A4的步骤：Mode 2, the above S403 includes the following steps S403-A1 to S403-A4:

S403-A1、根据第一特征信息进行特征变换，得到第二特征信息。S403-A1. Perform feature transformation according to the first feature information to obtain second feature information.

在该方式二中，编码端为了降低编码代价，对第一特征信息进行特征变换，得到该第一特征信息的超先验特征信息，即第二特征信息，使用该第二特征信息确定量化后的第一特征信息的概率分布，并使用该概率分布对量化后的第一特征信息进行编码，得到第一码流。同时，为了使解码端采用与编码相同的概率分布对第一码流进行解码，则对上述第二特征信息进行编码，得到第二码流。也就是说，在该方式二中，编码端生成两个码流，分别为第一码流和第二码流。In the second method, in order to reduce the coding cost, the encoder performs feature transformation on the first feature information to obtain the super-prior feature information of the first feature information, that is, the second feature information, and uses the second feature information to determine the probability distribution of the quantized first feature information, and encodes the quantized first feature information using the probability distribution to obtain the first code stream. At the same time, in order to enable the decoder to decode the first code stream using the same probability distribution as the encoding, the second feature information is encoded to obtain the second code stream. That is to say, in the second method, the encoder generates two code streams, namely the first code stream and the second code stream.

本申请实施例中，编码端根据第一特征信息进行特征变换，得到第二特征信息的方式包括但不限于如下几种：In the embodiment of the present application, the encoding end performs feature transformation according to the first feature information to obtain the second feature information in the following ways, but not limited to:

方式1，对第一特征信息进行N次非局部注意力变换和N次下采样，得到第二特征信息。Method 1, performing N non-local attention transformations and N downsampling on the first feature information to obtain the second feature information.

方式2，对量化后的第一特征信息进行N次非局部注意力变换和N次下采样，得到第二特征信息。Method 2, performing N non-local attention transformations and N downsampling on the quantized first feature information to obtain the second feature information.

也就是说，编码端可以对第一特征信息或者量化后的第一特征信息进行N次非局部注意力变换和N次下采样，得到第二特征信息。That is to say, the encoding end can perform N non-local attention transformations and N downsamplings on the first feature information or the quantized first feature information to obtain the second feature information.

S403-A2、对第二特征信息进行量化后再编码，得到第二码流。S403-A2, quantize and then encode the second characteristic information to obtain a second code stream.

例如，对第二特征信息进行量化，得到量化后的第二特征信息；确定量化后的第二特征信息的概率分布；根据量化后的第二特征信息的概率分布，对量化后的第二特征信息进行编码，得到第二码流。For example, the second feature information is quantized to obtain quantized second feature information; the probability distribution of the quantized second feature information is determined; and according to the probability distribution of the quantized second feature information, the quantized second feature information is encoded to obtain a second code stream.

本申请中，由于第二特征信息为第一特征信息的超先验特征信息，所包括的冗余信息较少，因此，编码端在编码时，直接使用量化后的第二特征信息的概率分布，对量化后的第二特征信息进行编码，得到第二码流。In the present application, since the second feature information is the super-prior feature information of the first feature information and includes less redundant information, the encoding end directly uses the probability distribution of the quantized second feature information during encoding to encode the quantized second feature information to obtain a second code stream.

S403-A3、对第二码流进行解码，得到量化后的第二特征信息，并根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布。S403-A3, decode the second code stream to obtain quantized second feature information, and determine the probability distribution of the quantized first feature information according to the quantized second feature information.

本申请实施例中，编码端对超先验的第二码流进行算数解码，还原得到量化后的超先验时空特征即量化后的第二特征信息，接着，根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布，进而根据量化后的第一特征信息的概率分布对量化后的第一特征信息进行编码，得到第一码流。 In the embodiment of the present application, the encoding end performs arithmetic decoding on the second super-prior code stream to restore the quantized super-prior spatiotemporal features. That is, the quantized second feature information. Then, according to the quantized second feature information, the probability distribution of the quantized first feature information is determined, and then the quantized first feature information is encoded according to the probability distribution of the quantized first feature information to obtain the first code stream.

下面对上述S403-A3中根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布的过程进行介绍。The following is an introduction to the process of determining the probability distribution of the quantized first feature information according to the quantized second feature information in the above S403-A3.

在一些实施例中，上述S403-A3中根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布包括如下步骤：In some embodiments, determining the probability distribution of the quantized first feature information according to the quantized second feature information in S403-A3 comprises the following steps:

S403-A31、对量化后的第二特征信息进行反变换，得到重建特征信息。S403-A31. Perform an inverse transformation on the quantized second feature information to obtain reconstructed feature information.

在该实现方式中，编码端对量化后的第二特征信息进行反变换，得到重建特征信息，其中，编码端所采用的反变换方式可以理解为编码端采用的变换方式的逆运算。例如，编码端对第一特征信息进行N次特征提取，得到第二特征信息，对应的，此时编码端对量化后的第二特征信息进行N次反向的特征提取，得到反变换后的特征信息，记为重建特征信息。In this implementation, the encoding end performs an inverse transformation on the quantized second feature information to obtain the reconstructed feature information, wherein the inverse transformation method adopted by the encoding end can be understood as the inverse operation of the transformation method adopted by the encoding end. For example, the encoding end performs N times of feature extraction on the first feature information to obtain the second feature information, and correspondingly, at this time, the encoding end performs N times of reverse feature extraction on the quantized second feature information to obtain the inversely transformed feature information, which is recorded as the reconstructed feature information.

本申请实施例对编码端采用反变换方式不做限制。The embodiment of the present application does not limit the inverse transformation method adopted by the encoding end.

在一些实施例中，编码端采用的反变换方式包括N次特征提取。也就是说，编码端对得到的量化后的第二特征信息进行N次特征提取，得到重建特征信息。In some embodiments, the inverse transformation method adopted by the encoding end includes N times of feature extraction. That is, the encoding end performs N times of feature extraction on the obtained quantized second feature information to obtain reconstructed feature information.

在一些实施例中，编码端采用的反变换方式包括N次特征提取和N次上采样。也就是说，编码端对得到的量化后的第二特征信息进行N次特征提取和N次上采样，得到重建特征信息。In some embodiments, the inverse transformation method adopted by the encoding end includes N times of feature extraction and N times of upsampling. That is, the encoding end performs N times of feature extraction and N times of upsampling on the obtained quantized second feature information to obtain reconstructed feature information.

在一种示例中，编码端可以先对量化后的第二特征信息进行N次连续的特征提取后，再进行N次连续的上采样。In an example, the encoding end may first perform N consecutive feature extractions on the quantized second feature information, and then perform N consecutive upsamplings.

在另一种示例中，上述N次特征提取和N次上采样穿插进行，即执行一次特征提取后执行一次上采样。In another example, the N times of feature extraction and N times of upsampling are performed alternately, that is, one upsampling is performed after one feature extraction.

需要说明的是，本申请实施例对编码端所采用的N次特征提取方式不做限制，例如包括多层卷积、残差连接、密集连接等特征提取方式中的至少一种。It should be noted that the embodiment of the present application does not limit the N-time feature extraction method adopted by the encoding end, for example, at least one of the feature extraction methods including multi-layer convolution, residual connection, dense connection, etc.

在一些实施例中，编码端对量化后的第二特征信息进行N次非局部注意力变换和N次上采样，得到重建特征信息，N为正整数。In some embodiments, the encoding end performs N non-local attention transformations and N upsamplings on the quantized second feature information to obtain reconstructed feature information, where N is a positive integer.

由于非局部注意力方式可以实现更高效的特征提取，能使得提取的特征保留更多的信息，且计算效率高，因此，本申请实施例中，编码端采用非局部注意力的方式对量化后的第二特征信息进行特征提取，以实现对量化后的第二特征信息的快速和准确特征提取。另外，编码端在根据第一特征信息生成第二特征信息时，进行了N次下采样，因此，此时，在反变换时编码端对应的执行N次上采样，以使重建得到的重建特征信息与第一特征信息的大小一致。Since the non-local attention method can achieve more efficient feature extraction, the extracted features can retain more information, and the calculation efficiency is high, therefore, in the embodiment of the present application, the encoding end uses the non-local attention method to extract features from the quantized second feature information to achieve fast and accurate feature extraction of the quantized second feature information. In addition, when the encoding end generates the second feature information based on the first feature information, it performs N downsampling. Therefore, at this time, the encoding end performs N upsampling correspondingly during the inverse transformation, so that the reconstructed feature information obtained by reconstruction is consistent with the size of the first feature information.

在一些实施例中，如图3所示，编码端通过反变换模块得到重建特征信息，该反变换模块包括N个非局部注意力模块和N个上采样模块。In some embodiments, as shown in FIG3 , the encoding end obtains the reconstructed feature information through an inverse transformation module, and the inverse transformation module includes N non-local attention modules and N upsampling modules.

S403-A32、确定重建特征信息的概率分布。S403-A32, determine the probability distribution of the reconstructed feature information.

由上述可知，第二量化特征信息是对第一特征信息进行变换得到的，编码端通过上述步骤，对量化后的第二特征信息进行反量化，得到重建特征信息，因此，该重建特征信息可以理解为第一特征信息的重建信息，也就是说，重建特征信息的概率分布与量化后的第一特征信息的概率分布相似或相关，这样，编码端可以先确定出重建特征信息的概率分布，进而根据该重建特征信息的概率分布，预测量化后的所述第一特征信息的概率分布。From the above, it can be seen that the second quantized feature information is obtained by transforming the first feature information. Through the above steps, the encoding end dequantizes the quantized second feature information to obtain the reconstructed feature information. Therefore, the reconstructed feature information can be understood as the reconstructed information of the first feature information, that is, the probability distribution of the reconstructed feature information is similar to or related to the probability distribution of the quantized first feature information. In this way, the encoding end can first determine the probability distribution of the reconstructed feature information, and then predict the probability distribution of the first feature information after quantization based on the probability distribution of the reconstructed feature information.

S403-A33、根据重建特征信息的概率分布，确定量化后的第一特征信息的概率分布。S403-A33. Determine the probability distribution of the quantized first feature information according to the probability distribution of the reconstructed feature information.

例如，根据重建特征信息的概率分布，预测量化后的第一特征信息中编码像素的概率；根据量化后的第一特征信息中编码像素的概率，得到量化后的第一特征信息的概率分布。For example, according to the probability distribution of the reconstructed feature information, the probability of the encoded pixel in the quantized first feature information is predicted; according to the probability of the encoded pixel in the quantized first feature information, the probability distribution of the quantized first feature information is obtained.

S403-A4、根据量化后的第一特征信息的概率分布，对量化后的第一特征信息进行编码，得到第一码流。S403-A4. Encode the quantized first feature information according to the probability distribution of the quantized first feature information to obtain a first code stream.

根据上述步骤，确定出量化后的第一特征信息的概率分布后，使用该概率分布对量化后的第一特征信息进行编码，得到第一码流。According to the above steps, after the probability distribution of the quantized first feature information is determined, the quantized first feature information is encoded using the probability distribution to obtain a first code stream.

在一些实施例中，本申请实施例还包括确定当前图像的重建图像的步骤，即本申请实施例还包括如下S404：In some embodiments, the embodiment of the present application further includes a step of determining a reconstructed image of the current image, that is, the embodiment of the present application further includes the following S404:

S404、确定当前图像的重建图像。S404: Determine a reconstructed image of the current image.

在一些实施例中，上述S404包括如下步骤：In some embodiments, the above S404 includes the following steps:

S404-A、对量化后的第一特征信息进行多级时域融合，得到混合时空表征。S404-A, performing multi-level time domain fusion on the quantized first feature information to obtain a mixed time-space representation.

在一些实施例中，上述量化后的第一特征信息为编码端对第一特征信息进行量化后的特征信息。In some embodiments, the quantized first feature information is feature information obtained by quantizing the first feature information at the encoding end.

在一些实施例中，上述量化后的第一特征信息为编码端重建后的，例如，编码端对第二码流进行解码，得到量化后的第二特征信息，并根据量化后的第二特征信息，确定量化后的第一特征信息的概率分布，示例性的，编码端根据上述S403-A31至S403-A33的方法，得到量化后的第一特征信息的概率分布，进而使用量化后的第一特征信息的概率分布对第一码流进行解码，得到量化后的第一特征信息。In some embodiments, the quantized first feature information is reconstructed by the encoding end. For example, the encoding end decodes the second code stream to obtain the quantized second feature information, and determines the probability distribution of the quantized first feature information based on the quantized second feature information. Exemplarily, the encoding end obtains the probability distribution of the quantized first feature information according to the above methods S403-A31 to S403-A33, and then uses the probability distribution of the quantized first feature information to decode the first code stream to obtain the quantized first feature information.

接着，编码端对上述得到的量化后的第一特征信息进行多级时域融合，得到混合时空表征。Next, the encoding end performs multi-level time domain fusion on the quantized first feature information obtained above to obtain a mixed time-space representation.

本申请实施例对编码端对量化后的第一特征信息进行多级时域融合，得到混合时空表征的具体方式不做限制。The embodiment of the present application does not limit the specific method of performing multi-level time-domain fusion on the quantized first feature information at the encoding end to obtain a mixed time-space representation.

在一些实施例中，编码端通过递归聚合模块混合时空表征，即上述S404-A包括如下S404-A1的步骤：In some embodiments, the encoding end mixes the spatiotemporal representations through a recursive aggregation module, that is, the above S404-A includes the following step S404-A1:

S404-A1、编码端通过递归聚合模块将量化后的第一特征信息，与前一时刻递归聚合模块的隐式特征信息进行融合，得到混合时空表征。S404-A1. The encoding end fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment through the recursive aggregation module to obtain a mixed spatiotemporal representation.

本申请实施例的递归聚合模块在每次生成混合时空表示时，会学习且保留从本次特征信息中所学习到的深层次特征信息，且将学习到的深层次特征作为隐式特征信息作用于下一次的混合时空表征生成，进而提高生成的混合时空表征的准确性。也就是说，本申请实施例中，前一时刻递归聚合模块的隐式特征信息包括了递归聚合模块所学习到的当前图像之前的多张重建图像的特征信息，这样，编码端通过递归聚合模块将量化后的第一特征信息，与前一时刻递归聚合模块的隐式特征信息进行融合，可以生成更加准确、丰富和详细的混合时空表征。The recursive aggregation module of the embodiment of the present application will learn and retain the deep-level feature information learned from the feature information each time it generates a mixed spatiotemporal representation, and use the learned deep-level features as implicit feature information for the next generation of the mixed spatiotemporal representation, thereby improving the accuracy of the generated mixed spatiotemporal representation. That is to say, in the embodiment of the present application, the implicit feature information of the recursive aggregation module at the previous moment includes the feature information of multiple reconstructed images before the current image learned by the recursive aggregation module. In this way, the encoding end fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment through the recursive aggregation module, so as to generate a more accurate, rich and detailed mixed spatiotemporal representation.

在一些实施例中，递归聚合模块由至少一个时空递归网络ST-LSTM堆叠而成，此时，上述混合时空表征Gt的表达公式如上述公式(1)所示。In some embodiments, the recursive aggregation module is formed by stacking at least one spatiotemporal recursive network ST-LSTM. In this case, the expression formula of the above-mentioned mixed spatiotemporal representation Gt is as shown in the above formula (1).

S404-B、根据混合时空表征对前一重建图像进行运动补偿，得到当前图像的P个预测图像，P为正整数。S404-B, performing motion compensation on the previous reconstructed image according to the hybrid spatiotemporal representation to obtain P predicted images of the current image, where P is a positive integer.

本申请实施例对生成的P个预测图像的具体数量不做限制。即本申请实施例中，编码端可以采用不同的方式，根据混合时空表征对前一重建图像进行运动补偿，得到当前图像的P个预测图像。The embodiment of the present application does not limit the specific number of the generated P predicted images. That is, in the embodiment of the present application, the encoding end can use different methods to perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation to obtain P predicted images of the current image.

本申请实施例对上述编码端根据混合时空表征对前一重建图像进行运动补偿的具体的方式不做限制。The embodiment of the present application does not limit the specific manner in which the encoding end performs motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation.

在一些实施例中，上述P个预测图像中包括第一预测图像，该第一预测图像是编码端采用光流运动补偿方式得到的，此时，上述S404-B包括如下S404-B1和S404-B2的步骤：In some embodiments, the P predicted images include a first predicted image, and the first predicted image is obtained by the encoder using optical flow motion compensation. In this case, S404-B includes the following steps S404-B1 and S404-B2:

S404-B1、根据混合时空表征，确定光流运动信息；S404-B1, determining optical flow motion information according to the mixed spatiotemporal representation;

S404-B2、根据光流运动信息对前一重建图像进行运动补偿，得到第一预测图像。S404-B2. Perform motion compensation on the previous reconstructed image according to the optical flow motion information to obtain a first predicted image.

本申请实施例对编码端根据混合时空表征，确定光流运动信息的具体方式不做限制。The embodiment of the present application does not limit the specific manner in which the encoding end determines the optical flow motion information based on the mixed spatiotemporal representation.

在一些实施例中，编码端通过预先训练好的神经网络模型得到光流运动信息，即该神经网络模型可以基于混合时空表征，预测出光流运动信息。在一些实施例中，该神经网络模型可以称为第一解码器，或光流信号解码器Df。编码端将混合时空表征Gt输入该光流信号解码器Df中进行光流运动信息的预测，得到该光流信号解码器Df输出的光流运动信息f _x,y。可选的，该f _x,y为通道为2的光流运动信息。 In some embodiments, the encoding end obtains the optical flow motion information through a pre-trained neural network model, that is, the neural network model can predict the optical flow motion information based on the mixed spatiotemporal representation. In some embodiments, the neural network model can be called a first decoder, or an optical flow signal decoder Df. The encoding end inputs the mixed spatiotemporal representation Gt into the optical flow signal decoder Df to predict the optical flow motion information, and obtains the optical flow motion information _fx,y output by the optical flow signal decoder Df. Optionally, fx _,y is the optical flow motion information with a channel of 2.

示例性的，f _x,y的生成公式如上述公式(2)所示。 Exemplarily, the generation formula of f _x,y is shown in the above formula (2).

在一些实施例中，光流信号解码器Df由多个NLAM和多个上采样模块组成，示例性的，如图5所示，光流信号解码器Df包括1个NLAM、3个LAM和4个下采样模块，其中一个NLAM之后连接一个下采样模块，且一个LAM之后连接一个下采样模块。In some embodiments, the optical flow signal decoder Df is composed of multiple NLAMs and multiple upsampling modules. Exemplarily, as shown in Figure 5, the optical flow signal decoder Df includes 1 NLAM, 3 LAMs and 4 downsampling modules, wherein one NLAM is connected to a downsampling module and one LAM is connected to a downsampling module.

编码端生成光流运动信息f _x,y后，使用光流运动信息f _x,y对前一重建图像进行运动补偿，得到第一预测图像X ₁。 After the encoder generates the optical flow motion information f _x,y , it uses the optical flow motion information f _x,y to reconstruct the previous image Motion compensation is performed to obtain a first predicted image X ₁ .

本申请实施例对编码端根据光流运动信息对前一重建图像进行运动补偿，得到第一预测图像的具体方式不做限制，例如，编码端使用光流运动信息f _x,y对前一重建图像进行线性插值，将插值生成的图像记为第一预测图像X ₁。 In the embodiment of the present application, the encoder performs motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image without limiting the specific method. For example, the encoder uses the optical flow motion information f _x,y to compensate the previous reconstructed image. Linear interpolation is performed, and the image generated by interpolation is recorded as the first predicted image X ₁ .

在一种可能的实现方式中，编码端通过如下公式(3)，得到第一预测图像X ₁。 In a possible implementation, the encoder obtains the first predicted image X ₁ through the following formula (3).

在该实现方式中，如图5所示，编码端通过Warping(扭曲)运算，使用光流运动信息f _x,y对前一重建图像进行运动补偿，得到第一预测图像X ₁。 In this implementation, as shown in FIG5 , the encoder uses the optical flow motion information f _x,y to reconstruct the previous image by performing a warping operation. Motion compensation is performed to obtain a first predicted image X ₁ .

在一些实施例中，上述P个预测图像中包括第二预测图像，该第二预测图像是解码端采用偏移运动补偿方式得到的，此时，上述S404-B包括如下S404-B-1至S404-B-3的步骤：In some embodiments, the P predicted images include a second predicted image, and the second predicted image is obtained by the decoding end using an offset motion compensation method. In this case, the S404-B includes the following steps S404-B-1 to S404-B-3:

S404-B-1、根据混合时空表征，得到当前图像对应的偏移量；S404-B-1. Obtain an offset corresponding to the current image according to the mixed spatiotemporal representation;

S404-B-2、对前一重建图像进行空间特征提取，得到参考特征信息；S404-B-2, extracting spatial features from the previous reconstructed image to obtain reference feature information;

S404-B-3、使用偏移量对参考特征信息进行运动补偿，得到第二预测图像。S404-B-3. Use the offset to perform motion compensation on the reference feature information to obtain a second predicted image.

本申请实施例对编码端根据混合时空表征，得到当前图像对应的偏移量的具体方式不做限制。The embodiment of the present application does not limit the specific manner in which the encoding end obtains the offset corresponding to the current image based on the mixed spatiotemporal representation.

在一些实施例中，编码端通过预先训练好的神经网络模型得到当前图像对应的偏移量，即该神经网络模型可以基于混合时空表征，预测出偏移量，该偏移量为有损的偏移量信息。在一些实施例中，该神经网络模型可以称为第二解码器，或可变卷积解码器Dm。编码端将混合时空表征Gt输入该可变卷积解码器Dm中进行偏移量信息的预测。In some embodiments, the encoder obtains the offset corresponding to the current image through a pre-trained neural network model, that is, the neural network model can predict the offset based on the mixed spatiotemporal representation, and the offset is lossy offset information. In some embodiments, the neural network model can be called a second decoder, or a variable convolution decoder Dm. The encoder inputs the mixed spatiotemporal representation Gt into the variable convolution decoder Dm to predict the offset information.

同时，编码端对前一重建图像进行空间特征提取，得到参考特征信息。例如，编码端通过空间特征提取模块SFE对前一重建图像进行空间特征提取，得到参考特征信息。At the same time, the encoding end extracts spatial features from the previous reconstructed image to obtain reference feature information. For example, the encoding end extracts spatial features from the previous reconstructed image through a spatial feature extraction module SFE to obtain reference feature information.

接着，编码端使用偏移量对提取得到的参考特征信息进行运动补偿，得到当前图像的第二预测图像。Next, the encoder uses the offset to perform motion compensation on the extracted reference feature information to obtain a second predicted image of the current image.

本申请实施例对编码端使用偏移量对提取得到的参考特征信息进行运动补偿，得到当前图像的第二预测图像的具体方式不做限制。In the embodiment of the present application, the encoding end uses the offset to perform motion compensation on the extracted reference feature information, and the specific method of obtaining the second predicted image of the current image is not limited.

在一种可能的实现方式中，编码端使用偏移量，对参考特征信息进行基于可变形卷积的运动补偿，得到第二预测图像。In a possible implementation, the encoder uses the offset to perform deformable convolution-based motion compensation on the reference feature information to obtain a second predicted image.

在一些实施例中，由于可变换卷积可以基于混合时空表征，生成当前图像对应的偏移量，因此，本申请实施例中，编码端将混合时空表征Gt，以及参考特征信息输入该可变换卷积中，该可变换卷积基于混合时空表征Gt生成当前图像对应的偏移量，且将该偏移量作用在参考特征信息上进行运动补偿，进而得到第二预测图像。In some embodiments, since the transformable convolution can generate an offset corresponding to the current image based on the mixed space-time representation, in an embodiment of the present application, the encoding end inputs the mixed space-time representation Gt and the reference feature information into the transformable convolution. The transformable convolution generates an offset corresponding to the current image based on the mixed space-time representation Gt, and applies the offset to the reference feature information for motion compensation, thereby obtaining a second predicted image.

基于此，示例性的，如图6所示，本申请实施例的可变卷积解码器Dm包括可变换卷积DCN，编码端将前一重建图像输入反变换模块SFE中进行时空特征提取，得到参考特征信息。接着，将混合时空表征Gt，以及参考特征信息输入可变换卷积DCN中进行偏移量的提取以及运动补偿，得到第二预测图像X ₂。 Based on this, exemplary, as shown in FIG6 , the variable convolution decoder Dm of the embodiment of the present application includes a transformable convolution DCN, and the encoding end reconstructs the previous image The mixed spatiotemporal representation Gt and the reference feature information are then input into the transformable convolution DCN to extract the offset and perform motion compensation to obtain the second predicted image X ₂ .

示例性的，编码端通过上述公式(4)生成第二预测图像X ₂。 Exemplarily, the encoder generates the second predicted image X ₂ by using the above formula (4).

在一些实施例中，如图6所示，为了进一步提高第二预测图像的准确性，则可变卷积解码器Dm除了包括可变换卷积DCN外，还包括1个NLAM、3个LAM和4个下采样模块，其中一个NLAM之后连接一个下采样模块，且一个LAM之后连接一个下采样模块。In some embodiments, as shown in Figure 6, in order to further improve the accuracy of the second predicted image, the variable convolution decoder Dm includes, in addition to the transformable convolution DCN, 1 NLAM, 3 LAMs and 4 downsampling modules, wherein one NLAM is connected to a downsampling module and one LAM is connected to a downsampling module.

本申请实施例中，如图6所示，编码端首先将前一重建图像输入反变换模块SFE中进行时空特征提取，得到参考特征信息。接着，将混合时空表征Gt，以及参考特征信息输入可变卷积解码器Dm中的可变换卷积DCN中进行偏移量的提取以及运动补偿，得到一个特征信息，将该特征信息输入NLAM中，经过NLAM、3个LAM以及4个下采样模块的特征提取，最终还原为第二预测图像X ₂。 In the embodiment of the present application, as shown in FIG6 , the encoding end first reconstructs the previous image The mixed spatiotemporal representation Gt and the reference feature information are then input into the transformable convolution DCN in the variable convolution decoder Dm for offset extraction and motion compensation to obtain feature information, which is then input into the NLAM. After feature extraction by the NLAM, 3 LAMs and 4 downsampling modules, the second predicted image X ₂ is finally restored.

根据上述方法，编码端可以确定出P个预测图像，例如确定出第一预测图像和第二预测图像，接着，执行如下S204 的步骤。According to the above method, the encoding end may determine P predicted images, for example, determine a first predicted image and a second predicted image, and then perform the following step S204.

S404-C、根据P个预测图像，确定所述当前图像的重建图像。S404-C. Determine a reconstructed image of the current image according to the P predicted images.

在一些实施例中，上述S404-C包括如下S404-C-A和S404-C-B的步骤：In some embodiments, the above S404-C includes the following steps S404-C-A and S404-C-B:

S404-C-A、根据P个预测图像，确定当前图像的目标预测图像。S404-C-A. Determine a target predicted image for the current image based on P predicted images.

在该实现方式中，编码端首先根据P个预测图像，确定当前图像的目标预测图像，接着，根据该当前图像的目标预测图像实现当前图像的重建图像，进而提高重建图像的确定准确性。In this implementation, the encoding end first determines the target prediction image of the current image based on P prediction images, and then realizes the reconstructed image of the current image based on the target prediction image of the current image, thereby improving the determination accuracy of the reconstructed image.

在一些实施例中，若P大于1，则S404-C-A包括S404-C-A11和S404-C-A12：In some embodiments, if P is greater than 1, S404-C-A includes S404-C-A11 and S404-C-A12:

S404-C-A11、根据P个预测图像，确定加权图像；S404-C-A11, determining a weighted image according to the P predicted images;

示例性的，若P个预测图像包括第一预测图像和第二预测图像，则编码端确定第一预测图像对应的第一权重和第二预测图像对应的第二权重，根据第一权重和所述第二权重，对第一预测图像和第二预测图像进行加权，得到加权图像。Exemplarily, if the P predicted images include a first predicted image and a second predicted image, the encoding end determines a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image, and weights the first predicted image and the second predicted image according to the first weight and the second weight to obtain a weighted image.

其中，确定P个预测图像对应的权重的方式包括但不限于如下几种：The methods for determining the weights corresponding to the P predicted images include but are not limited to the following:

方式二，编码端根据混合时空表征进行自适应掩膜，得到P个预测图像对应的权重。In the second method, the encoder performs adaptive masking based on the mixed spatiotemporal representation to obtain weights corresponding to P predicted images.

示例性的，编码端通过神经网络模型，生成P个预测图像对应的权重，该神经网络模型为预先训练好的，可以用于生成P个预测图像对应的权重。在一些实施例中，该神经网络模型也称为第三解码器或自适应掩膜补偿解码器D _w。具体的，编码端将混合时空表征输入该自适应掩膜补偿解码器D _w中进行自适应掩膜，得到P个预测图像对应的权重。例如，编码端将混合时空表征Gt输入该自适应掩膜补偿解码器D _w中进行自适应掩膜，自适应掩膜补偿解码器D _w输出第一预测图像的第一权重w1和第二预测图像的第二权重w2，进行根据第一权重w1和第二权重w2对上述得到第一预测图像X ₁和第二预测图像X ₂，能自适应地选择相应代表预测帧中不同区域地信息，进而生成加权图像。 Exemplarily, the encoder generates weights corresponding to P predicted images through a neural network model, and the neural network model is pre-trained and can be used to generate weights corresponding to P predicted images. In some embodiments, the neural network model is also called a third decoder or an adaptive mask compensation decoder _Dw . Specifically, the encoder inputs the mixed spatiotemporal representation into the adaptive mask compensation decoder _Dw for adaptive masking to obtain weights corresponding to the P predicted images. For example, the encoder inputs the mixed spatiotemporal representation Gt into the adaptive mask compensation decoder _Dw for adaptive masking, and the adaptive mask compensation decoder _Dw outputs a first weight w1 of the first predicted image and a second weight w2 of the second predicted image, and performs the first weight w1 and the second weight w2 on the first predicted image _X1 and the second predicted image _X2 obtained above, and can adaptively select the corresponding information representing different regions in the predicted frame, thereby generating a weighted image.

示例性的，根据上述公式(5)生成加权图像X ₃。 Exemplarily, the weighted image X ₃ is generated according to the above formula (5).

在一些实施例中，如图7所示，自适应掩膜补偿解码器D _w包括1个NLAM、3个LAM、4个下采样模块和一个sigmoid函数，其中一个NLAM之后连接一个下采样模块，一个LAM之后连接一个下采样模块。 In some embodiments, as shown in FIG. 7 , the adaptive mask compensation decoder D _w includes 1 NLAM, 3 LAMs, 4 downsampling modules and a sigmoid function, wherein a downsampling module is connected after one NLAM and a downsampling module is connected after one LAM.

在该实现方式中，编码端根据上述方法，对P个预测图像进行加权，得到加权图像后，执行如下S404-C-A12。In this implementation, the encoding end weights the P predicted images according to the above method, and after obtaining the weighted images, executes the following S404-C-A12.

S404-C-A12、根据加权图像，得到目标预测图像。S404-C-A12. Obtain a target prediction image based on the weighted image.

在一些实施例中，编码端还可以根据混合时空表征，得到当前图像的残差图像。In some embodiments, the encoding end may also obtain a residual image of the current image according to the mixed spatiotemporal representation.

示例性的，编码端通过神经网络模型，得到当前图像的残差图像，该神经网络模型为预先训练好的，可以用于生成当前图像的残差图像。在一些实施例中，该神经网络模型也称为第四解码器或空间纹理增强解码器Dt。具体的，编码端将混合时空表征输入该空间纹理增强解码器Dt中进行空间纹理增强，得到当前图像的残差图像X _r＝D_t(G _t)，该残差图像X _r可以对预测图像进行纹理增强。 Exemplarily, the encoder obtains a residual image of the current image through a neural network model, and the neural network model is pre-trained and can be used to generate a residual image of the current image. In some embodiments, the neural network model is also called a fourth decoder or a spatial texture enhancement decoder Dt. Specifically, the encoder inputs the mixed spatiotemporal representation into the spatial texture enhancement decoder Dt for spatial texture enhancement, and obtains a residual image _Xr = D_t ( _Gt ) of the current image, and the residual image _Xr can perform texture enhancement on the predicted image.

在一些实施例中，如图8所示，空间纹理增强解码器Dt包括1个NLAM、3个LAM、4个下采样模块，其中一个NLAM之后连接一个下采样模块，一个LAM之后连接一个下采样模块。In some embodiments, as shown in FIG8 , the spatial texture enhancement decoder Dt includes 1 NLAM, 3 LAMs, and 4 downsampling modules, wherein one NLAM is connected to a downsampling module, and one LAM is connected to a downsampling module.

由于上述残差图像X _r可以对预测图像进行纹理增强。基于此，在一些实施例中，上述S404-C-A中根据P个预测图像，确定当前图像的目标预测图像包括如下S404-C-A21的步骤： Since the residual image _Xr can perform texture enhancement on the predicted image, in some embodiments, determining the target predicted image of the current image according to the P predicted images in S404-CA includes the following step S404-C-A21:

S404-C-A21、根据P个预测图像和残差图像，得到目标预测图像。S404-C-A21. Obtain a target prediction image based on P prediction images and residual images.

其中，编码端根据P个预测图像，确定加权图像的具体过程可以参照上述S204-A11的具体描述，在此不再赘述。The specific process of the encoder determining the weighted image based on the P predicted images can refer to the specific description of S204-A11 above, which will not be repeated here.

示例性的，根据上述公式(6)生成目标预测图像X ₄。 Exemplarily, the target predicted image X ₄ is generated according to the above formula (6).

根据上述方法，编码端确定出当前图像的目标预测图像后，执行如下S404-C-B的步骤。According to the above method, after the encoding end determines the target predicted image of the current image, the following steps S404-C-B are executed.

S404-C-B、根据目标预测图像，确定当前图像的重建图像。S404-C-B. Determine a reconstructed image of the current image based on the target predicted image.

在一些实施例中，为了进一步提高重建图像的确定准确性，则编码端根据当前图像和目标预测图像，确定当前图像的残差值；对残差值进行编码，得到残差码流。此时，则本申请实施例还包括残差解码，上述S404-C-B包括如下S404-C-B1和S404-C-B2的步骤：In some embodiments, in order to further improve the accuracy of determining the reconstructed image, the encoder determines the residual value of the current image based on the current image and the target predicted image; encodes the residual value to obtain a residual code stream. At this time, the embodiment of the present application also includes residual decoding, and the above S404-C-B includes the following steps S404-C-B1 and S404-C-B2:

S404-C-B1、对残差码流进行解码，得到当前图像的残差值；S404-C-B1, decoding the residual code stream to obtain a residual value of the current image;

S404-C-B2、根据目标预测图像和残差值，得到重建图像。S404-C-B2. Obtain a reconstructed image based on the target predicted image and the residual value.

本申请实施例中，为了提高重建图像的效果，则编码端还通过残差编码的方式，生成残差码流，具体是，编码端确定当前图像的残差值，对该残差值进行编码生成残差码流。对应的，编码端对残差码流进行解码，得到当前图像的残差值，并根据目标预测图像和残差值，得到重建图像。In the embodiment of the present application, in order to improve the effect of reconstructing the image, the encoding end also generates a residual code stream by residual coding. Specifically, the encoding end determines the residual value of the current image, encodes the residual value to generate a residual code stream. Correspondingly, the encoding end decodes the residual code stream to obtain the residual value of the current image, and obtains the reconstructed image according to the target predicted image and the residual value.

在一种可能的实现方式中，当前图像的残差值为一个矩阵，该矩阵中的每个元素为当前图像中每个像素点对应的残差值。这样，编码端可以逐像素的，将目标预测图像中每个像素点对应的残差值和预测值进行相加，得到每个像素点的重建值，进而得到当前图像的重建图像。以当前图像中的第i个像素点为例，在目标预测图像中，得到该第i个像素点对应的预测值，以及从当前图像的残差值中得到该第i个像素点对应的残差值，接着，将该第i个像素点对应的预测值和残差值进行相加，得到该第i个像素点对应的重建值。针对当前图像中的每个像素点，参照上述第i个像素点，可以得到当前图像中每个像素点对应的重建值，当前图像中每个像素点对应的重建值，组成当前图像的重建图像。In a possible implementation, the residual value of the current image is a matrix, and each element in the matrix is the residual value corresponding to each pixel in the current image. In this way, the encoding end can add the residual value and the predicted value corresponding to each pixel in the target predicted image pixel by pixel to obtain the reconstruction value of each pixel, and then obtain the reconstructed image of the current image. Taking the i-th pixel in the current image as an example, in the target predicted image, the predicted value corresponding to the i-th pixel is obtained, and the residual value corresponding to the i-th pixel is obtained from the residual value of the current image. Then, the predicted value and the residual value corresponding to the i-th pixel are added to obtain the reconstruction value corresponding to the i-th pixel. For each pixel in the current image, referring to the above i-th pixel, the reconstruction value corresponding to each pixel in the current image can be obtained, and the reconstruction value corresponding to each pixel in the current image constitutes the reconstructed image of the current image.

本申请实施例对编码端得到当前图像的残差值的具体方式不做限制，也就是说，本申请实施例对编解码两端所采用的残差编解码的方式不做限制。The embodiment of the present application does not limit the specific method in which the encoding end obtains the residual value of the current image, that is, the embodiment of the present application does not limit the residual encoding and decoding method adopted by both ends of the encoding and decoding.

在一种示例中，编码端确定出当前图像的目标预测图像，接着，根据当前图像和目标预测图像，得到当前图像的残差值，例如，将当前图像和目标预测图像的差值确定为当前图像的残差值。接着，对当前图像的残差值进行编码，生成残差编码。可选的，可以对当前图像的残差值进行变换，得到变换系数，对变换系数进行量化得到量化系数，对量化系数进行编码，得到残差码流。对应的，编码端解码残差码流，得到当前图像的残差值，例如解码残差码流，得到量化系数，对量化系数进行反量化和反变换，得到当前图像的残差值。接着，再根据上述方法，将目标预测图像和当前图像对应的残差值进行相加，得到当前图像的重建图像。In one example, the encoding end determines the target prediction image of the current image, and then obtains the residual value of the current image based on the current image and the target prediction image, for example, the difference between the current image and the target prediction image is determined as the residual value of the current image. Then, the residual value of the current image is encoded to generate residual coding. Optionally, the residual value of the current image can be transformed to obtain a transformation coefficient, the transformation coefficient is quantized to obtain a quantization coefficient, and the quantization coefficient is encoded to obtain a residual code stream. Correspondingly, the encoding end decodes the residual code stream to obtain the residual value of the current image, for example, decodes the residual code stream to obtain a quantization coefficient, dequantizes and de-transforms the quantization coefficient to obtain the residual value of the current image. Then, according to the above method, the residual values corresponding to the target prediction image and the current image are added to obtain a reconstructed image of the current image.

在一些实施例中，编码端可以采用神经网络的方法，对当前图像和当前图像的目标预测图像进行处理，生成当前图像的残差值，并对当前图像的残差值进行编码，生成残差码流。In some embodiments, the encoding end may use a neural network method to process the current image and the target predicted image of the current image to generate a residual value of the current image, and encode the residual value of the current image to generate a residual code stream.

本申请实施例中，编码端根据上述方法，可以得到当前图像的重建图像。In the embodiment of the present application, the encoding end can obtain a reconstructed image of the current image according to the above method.

可选的，可以将该重建图像进行直接显示。Optionally, the reconstructed image may be displayed directly.

可选的，还可以将该重建图像存入缓存中，用于后续图像的编码。Optionally, the reconstructed image may also be stored in a cache for use in encoding subsequent images.

本申请实施例提供的视频编码方法，编码端通过对当前图像以及当前图像的前一重建图像进行特征融合，得到第一特征信息；对第一特征信息进行量化，得到量化后第一特征信息；对量化后的第一特征信息进行编码，得到第一码流，以使解码端解码第一码流，确定量化后的第一特征信息，对量化后的第一特征信息进行多级时域融合，得到混合时空表征；根据混合时空表征对所述前一重建图像进行运动补偿，得到当前图像的P个预测图像；进而根据P个预测图像，确定当前图像的重建图像。即本申请，为了提高重建图像的准确性，对量化后的第一特征信息进行多级时域融合，例如将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合，这样可以避免当前图像的前一重建图像中的某信息被遮挡时，被遮挡的信息可以从当前图像之前的几张重建图像中得到，进而使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征对前一重建图像进行运动补偿时，可以生成高精度的P个预测图像时，基于该高精度的P个预测图像可以准确得到当前图像的重建图像，进而提高视频压缩效果。The video encoding method provided in the embodiment of the present application is that the encoding end obtains the first feature information by performing feature fusion on the current image and the previous reconstructed image of the current image; quantizes the first feature information to obtain the quantized first feature information; encodes the quantized first feature information to obtain the first code stream, so that the decoding end decodes the first code stream, determines the quantized first feature information, performs multi-level time domain fusion on the quantized first feature information to obtain a mixed time-space representation; performs motion compensation on the previous reconstructed image according to the mixed time-space representation to obtain P predicted images of the current image; and then determines the reconstructed image of the current image according to the P predicted images. That is, in order to improve the accuracy of the reconstructed image, the present application performs multi-level time domain fusion on the quantized first feature information, for example, the quantized first feature information is feature fused with multiple reconstructed images before the current image, so that when certain information in the previous reconstructed image of the current image is blocked, the blocked information can be obtained from several reconstructed images before the current image, so that the generated mixed time-space representation includes more accurate, rich and detailed feature information. In this way, when motion compensation is performed on the previous reconstructed image based on the mixed spatiotemporal representation, P high-precision predicted images can be generated. Based on the high-precision P predicted images, the reconstructed image of the current image can be accurately obtained, thereby improving the video compression effect.

本申请实施例中，提出一种端到端的基于神经网络的编解码框架，该基于神经网络的编解码框架包括基于神经网络的编码器和基于神经网络的解码器。下面结合的本申请一种可能的基于神经网络的编码器，对本申请实施例的编码过程进行介绍。In an embodiment of the present application, an end-to-end neural network-based encoding and decoding framework is proposed, and the neural network-based encoding and decoding framework includes a neural network-based encoder and a neural network-based decoder. The encoding process of the embodiment of the present application is introduced below in combination with a possible neural network-based encoder of the present application.

图12为本申请一实施例涉及的一种基于神经网络的编码器的网络结构示意图，包括：时空特征提取模块、反变换模块、递归聚合模块和混合运动补偿模块。12 is a schematic diagram of the network structure of a neural network-based encoder according to an embodiment of the present application, including: a spatiotemporal feature extraction module, an inverse transformation module, a recursive aggregation module and a hybrid motion compensation module.

其中，时空特征提取模块用于对级联后的当前图像和前一重建图像进行特征提取和下采样，得到第一特征信息。The spatiotemporal feature extraction module is used to perform feature extraction and down-sampling on the cascaded current image and the previous reconstructed image to obtain first feature information.

反变换模块用于对量化后的第二特征信息进行反变换，得到第一特征信息的重建特征信息，示例性的，其网络结构如图3所示。The inverse transformation module is used to perform an inverse transformation on the quantized second feature information to obtain reconstructed feature information of the first feature information. Exemplarily, its network structure is shown in FIG3 .

在上述图12所示的基于神经网络的编码器的基础上，结合图13对本申请实施例一种可能的视频编码方法进行介绍。Based on the neural network-based encoder shown in FIG. 12 above, a possible video encoding method of an embodiment of the present application is introduced in combination with FIG. 13 .

图13为本申请一实施例提供的视频编码流程示意图，如图13所示，包括：FIG. 13 is a schematic diagram of a video encoding process provided by an embodiment of the present application, as shown in FIG. 13 , including:

S501、对当前图像以及当前图像的前一重建图像进行特征融合，得到第一特征信息。S501: Perform feature fusion on a current image and a previous reconstructed image of the current image to obtain first feature information.

例如，编码端将当前图像X _t和当前图像的前一重建图像进行通道间的级联得到X _cat，接着，对级联后的图像X _cat进行特征提取，得到第一特征信息。 For example, the encoder combines the current image _Xt and the previous reconstructed image of the current image The channels are cascaded to obtain X _cat , and then, feature extraction is performed on the cascaded image X _cat to obtain first feature information.

上述S501的具体实现过程参照上述S401的描述，在此不再赘述。The specific implementation process of the above S501 refers to the description of the above S401 and will not be repeated here.

S502、对第一特征信息进行量化，得到量化后的第一特征信息。S502: quantize the first feature information to obtain quantized first feature information.

上述S502的具体实现过程参照上述S402的描述，在此不再赘述。The specific implementation process of the above S502 refers to the description of the above S402 and will not be repeated here.

S503、根据第一特征信息进行特征变换，得到第二特征信息。S503: Perform feature transformation according to the first feature information to obtain second feature information.

上述S503的具体实现过程参照上述S403-A1的描述，在此不再赘述。The specific implementation process of the above S503 refers to the description of the above S403-A1, which will not be repeated here.

S504、对第二特征信息进行量化后再编码，得到第二码流。S504: quantize and then encode the second feature information to obtain a second bit stream.

上述S504的具体实现过程参照上述S403-A2的描述，在此不再赘述。The specific implementation process of the above S504 refers to the description of the above S403-A2, which will not be repeated here.

S505、对第二码流进行解码，得到量化后的第二特征信息。S505: Decode the second code stream to obtain quantized second feature information.

上述S505的具体实现过程参照上述S403-A3的描述，在此不再赘述。The specific implementation process of the above S505 refers to the description of the above S403-A3, which will not be repeated here.

S506、通过反变换模块对量化后的第二特征信息进行反变换，得到重建特征信息。S506 . Perform an inverse transformation on the quantized second feature information through an inverse transformation module to obtain reconstructed feature information.

例如，解码端将量化后的第二特征信息输入反变换模块进行反变换，该反变换模块输出重建特征信息。For example, the decoding end inputs the quantized second feature information into the inverse transformation module for inverse transformation, and the inverse transformation module outputs the reconstructed feature information.

上述S506的具体实现过程参照上述S403-A31的描述，在此不再赘述。The specific implementation process of the above S506 refers to the description of the above S403-A31, which will not be repeated here.

S507、确定重建特征信息的概率分布。S507: Determine the probability distribution of the reconstructed feature information.

S508、根据重建特征信息的概率分布，预测得到量化后的第一特征信息的概率分布。S508: Predict the probability distribution of the quantized first feature information according to the probability distribution of the reconstructed feature information.

S509、根据量化后的第一特征信息的概率分布，对量化后的第一特征信息进行编码，得到第一码流。S509: Encode the quantized first feature information according to the probability distribution of the quantized first feature information to obtain a first code stream.

上述S507至的S509具体实现过程参照上述S403-A32、S403-A33和S403-A4的描述，在此不再赘述。The specific implementation process of the above S507 to S509 refers to the description of the above S403-A32, S403-A33 and S403-A4, which will not be repeated here.

本申请实施例还包括确定重建图像的过程。The embodiment of the present application also includes a process of determining a reconstructed image.

S510、根据量化后的第一特征信息的概率分布，对第一码流进行解码，得到量化后的第一特征信息。S510: Decode the first bitstream according to the probability distribution of the quantized first feature information to obtain the quantized first feature information.

S511、通过递归聚合模块，对量化后的第一特征信息进行多级时域融合，得到混合时空表征。S511. Perform multi-level time-domain fusion on the quantized first feature information through a recursive aggregation module to obtain a mixed time-space representation.

例如，解码端将量化后的第一特征信息输入递归聚合模块，以使递归聚合模块将量化后的第一特征信息与前一时刻递归聚合模块的隐式特征信息进行融合，进而输出混合时空表征。上述S511的具体实现过程参照上述S404-A的描述，在此不再赘述。For example, the decoding end inputs the quantized first feature information into the recursive aggregation module, so that the recursive aggregation module fuses the quantized first feature information with the implicit feature information of the recursive aggregation module at the previous moment, and then outputs a mixed spatiotemporal representation. The specific implementation process of the above S511 refers to the description of the above S404-A, which will not be repeated here.

S512、通过第一解码器对混合时空表征进行处理，得到第一预测图像。S512: Process the mixed spatiotemporal representation through a first decoder to obtain a first predicted image.

根据上述S512得到混合时空表征后，将该混合时空表征和前一重建图像输入混合运动补偿模块进行运动混合补偿，得到当前图像的目标预测图像。After the mixed spatiotemporal representation is obtained according to the above S512, the mixed spatiotemporal representation and the previous reconstructed image are input into the mixed motion compensation module for motion mixed compensation to obtain the target predicted image of the current image.

上述S512的具体实现过程，参照上述S404-B1和S404-B2的具体描述，在此不再赘述。For the specific implementation process of the above S512, please refer to the specific description of the above S404-B1 and S404-B2, which will not be repeated here.

S513、通过第二解码器对混合时空表征进行处理，得到第二预测图像。S513: Process the mixed spatiotemporal representation through a second decoder to obtain a second predicted image.

上述S513的具体实现过程，参照上述S404-B-1至S404-B-3的具体描述，在此不再赘述。For the specific implementation process of the above S513, please refer to the specific description of S404-B-1 to S404-B-3, which will not be repeated here.

S514、通过第三解码器对混合时空表征进行处理，得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。S514 . Process the mixed spatiotemporal representation through a third decoder to obtain a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image.

具体是，将混合时空表征输入第三解码器进行自适应掩膜，得到第一预测图像对应的第一权重和第二预测图像对应的第二权重。Specifically, the mixed spatiotemporal representation is input into the third decoder for adaptive masking to obtain a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image.

上述S514的具体实现过程，参照上述S404-C-A11中方式二的具体描述，在此不再赘述。For the specific implementation process of the above S514, please refer to the specific description of method 2 in the above S404-C-A11, which will not be repeated here.

S515、根据第一权重和第二权重，对第一预测图像和第二预测图像进行加权，得到加权图像。S515 . Weight the first predicted image and the second predicted image according to the first weight and the second weight to obtain a weighted image.

S516、通过第四解码器对混合时空表征进行处理，得到当前图像的残差图像。S516: Process the mixed spatiotemporal representation through a fourth decoder to obtain a residual image of the current image.

上述S516的具体实现过程，参照上述S404-C-A12的具体描述，在此不再赘述。For the specific implementation process of the above S516, please refer to the specific description of the above S404-C-A12, which will not be repeated here.

S517、根据加权图像和残差图像，确定目标预测图像。S517: Determine a target prediction image according to the weighted image and the residual image.

S518、对残差码流进行解码，得到当前图像的残差值。S518: Decode the residual code stream to obtain a residual value of the current image.

S519、根据目标预测图像和残差值，得到重建图像。S519: Obtain a reconstructed image according to the target predicted image and the residual value.

上述S518和S519的具体实现过程，参照上述S404-C-B1和S404-C-B2的具体描述，在此不再赘述。For the specific implementation process of the above S518 and S519, please refer to the specific description of the above S404-C-B1 and S404-C-B2, which will not be repeated here.

本申请实施例，通过图12所示的基于神经网络的编码器进行编码时，对量化后的第一特征信息进行多级时域融合，即将量化后的第一特征信息与当前图像之前的多个重建图像进行特征融合，使得生成的混合时空表征包括更加准确、丰富和详细的特征信息。这样基于该混合时空表征实现对前一重建图像进行运动补偿生成多个解码信息，例如该多个解码信息包括第一预测图像、第二预测图像、第一预测图像和第二预测图像分别对应的权重、以及残差图像，这样基于这多个解码信息确定当前图像的目标预测图像时，可以有效提高目标预测图像的准确性，进而基于该准确的预测图像可以准确得到当前图像的重建图像，进而提高视频压缩效果。In the embodiment of the present application, when encoding is performed by the neural network-based encoder shown in FIG12, the quantized first feature information is subjected to multi-level time-domain fusion, that is, the quantized first feature information is subjected to feature fusion with multiple reconstructed images before the current image, so that the generated mixed spatiotemporal representation includes more accurate, rich and detailed feature information. In this way, motion compensation is performed on the previous reconstructed image based on the mixed spatiotemporal representation to generate multiple decoding information. For example, the multiple decoding information includes the first predicted image, the second predicted image, the weights corresponding to the first predicted image and the second predicted image, and the residual image. In this way, when the target predicted image of the current image is determined based on the multiple decoding information, the accuracy of the target predicted image can be effectively improved, and then the reconstructed image of the current image can be accurately obtained based on the accurate predicted image, thereby improving the video compression effect.

应理解，图2至图13仅为本申请的示例，不应理解为对本申请的限制。It should be understood that Figures 2 to 13 are merely examples of the present application and should not be construed as limitations to the present application.

以上结合附图详细描述了本申请的优选实施方式，但是，本申请并不限于上述实施方式中的具体细节，在本申请的技术构思范围内，可以对本申请的技术方案进行多种简单变型，这些简单变型均属于本申请的保护范围。例如，在上述具体实施方式中所描述的各个具体技术特征，在不矛盾的情况下，可以通过任何合适的方式进行组合，为了避免不必要的重复，本申请对各种可能的组合方式不再另行说明。又例如，本申请的各种不同的实施方式之间也可以进行任意组合，只要其不违背本申请的思想，其同样应当视为本申请所公开的内容。The preferred embodiments of the present application are described in detail above in conjunction with the accompanying drawings. However, the present application is not limited to the specific details in the above embodiments. Within the technical concept of the present application, the technical solution of the present application can be subjected to a variety of simple modifications, and these simple modifications all belong to the protection scope of the present application. For example, the various specific technical features described in the above specific embodiments can be combined in any suitable manner without contradiction. In order to avoid unnecessary repetition, the present application will not further explain various possible combinations. For another example, the various different embodiments of the present application can also be arbitrarily combined, as long as they do not violate the ideas of the present application, they should also be regarded as the contents disclosed in the present application.

还应理解，在本申请的各种方法实施例中，上述各过程的序号的大小并不意味着执行顺序的先后，各过程的执行顺序应以其功能和内在逻辑确定，而不应对本申请实施例的实施过程构成任何限定。另外，本申请实施例中，术语“和/或”，仅仅是一种描述关联对象的关联关系，表示可以存在三种关系。具体地，A和/或B可以表示：单独存在A，同时存在A和B，单独存在B这三种情况。另外，本文中字符“/”，一般表示前后关联对象是一种“或”的关系。It should also be understood that in the various method embodiments of the present application, the size of the sequence number of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application. In addition, in the embodiments of the present application, the term "and/or" is merely a description of the association relationship of associated objects, indicating that three relationships may exist. Specifically, A and/or B can represent: A exists alone, A and B exist at the same time, and B exists alone. In addition, the character "/" in this article generally indicates that the objects associated before and after are in an "or" relationship.

上文结合图2至图13，详细描述了本申请的方法实施例，下文结合图14至图17，详细描述本申请的装置实施例。The above text describes in detail a method embodiment of the present application in conjunction with Figures 2 to 13 , and the following text describes in detail a device embodiment of the present application in conjunction with Figures 14 to 17 .

图14是本申请实施例提供的视频解码装置的示意性框图。FIG. 14 is a schematic block diagram of a video decoding device provided in an embodiment of the present application.

如图14所示，视频解码装置10包括：As shown in FIG. 14 , the video decoding device 10 includes:

解码单元11，用于解码第一码流，确定量化后的第一特征信息，所述第一特征信息是对当前图像和所述当前图像的前一重建图像进行特征融合得到的；A decoding unit 11 is used to decode the first bit stream and determine quantized first feature information, where the first feature information is obtained by performing feature fusion of a current image and a previous reconstructed image of the current image;

融合单元12，用于对量化后的所述第一特征信息进行多级时域融合，得到混合时空表征；A fusion unit 12, configured to perform multi-level time-domain fusion on the quantized first feature information to obtain a mixed time-space representation;

补偿单元13，用于根据所述混合时空表征对所述前一重建图像进行运动补偿，得到所述当前图像的P个预测图像，所述P为正整数；A compensation unit 13, configured to perform motion compensation on the previous reconstructed image according to the mixed spatiotemporal representation to obtain P predicted images of the current image, where P is a positive integer;

重建单元14，用于根据所述P个预测图像，确定所述当前图像的重建图像。The reconstruction unit 14 is used to determine a reconstructed image of the current image according to the P predicted images.

在一些实施例中，融合单元12，具体用于通过递归聚合模块将量化后的所述第一特征信息，与前一时刻所述递归聚合模块的隐式特征信息进行融合，得到所述混合时空表征。In some embodiments, the fusion unit 12 is specifically configured to fuse the quantized first feature information with the implicit feature information of the recursive aggregation module at a previous moment through a recursive aggregation module to obtain the mixed spatiotemporal representation.

可选的，所述递归聚合模块由至少一个时空递归网络堆叠而成。Optionally, the recursive aggregation module is formed by stacking at least one spatiotemporal recursive network.

在一些实施例中，所述P个预测图像包括第一预测图像，补偿单元13，具体用于根据所述混合时空表征，确定光流运动信息；根据所述光流运动信息对所述前一重建图像进行运动补偿，得到所述第一预测图像。In some embodiments, the P predicted images include a first predicted image, and the compensation unit 13 is specifically used to determine the optical flow motion information according to the mixed spatiotemporal representation; perform motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.

在一些实施例中，所述P个预测图像包括第二预测图像，补偿单元13，具体用于根据所述混合时空表征，得到所述当前图像对应的偏移量；对所述前一重建图像进行空间特征提取，得到参考特征信息；使用所述偏移量对所述参考特征信息进行运动补偿，得到所述第二预测图像。In some embodiments, the P predicted images include a second predicted image, and the compensation unit 13 is specifically used to obtain an offset corresponding to the current image based on the mixed spatiotemporal representation; perform spatial feature extraction on the previous reconstructed image to obtain reference feature information; and use the offset to perform motion compensation on the reference feature information to obtain the second predicted image.

在一些实施例中，补偿单元13，具体用于使用所述偏移量，对所述参考特征信息进行基于可变形卷积的运动补偿，得到所述第二预测图像。In some embodiments, the compensation unit 13 is specifically configured to use the offset to perform deformable convolution-based motion compensation on the reference feature information to obtain the second predicted image.

在一些实施例中，重建单元14，用于根据所述P个预测图像，确定所述当前图像的目标预测图像；根据所述目标预测图像，确定所述当前图像的重建图像。In some embodiments, the reconstruction unit 14 is used to determine a target prediction image of the current image based on the P prediction images; and determine a reconstructed image of the current image based on the target prediction image.

在一些实施例中，重建单元14，用于根据所述P个预测图像，确定加权图像；根据所述加权图像，得到所述目标预测图像。In some embodiments, the reconstruction unit 14 is used to determine a weighted image according to the P predicted images; and obtain the target predicted image according to the weighted image.

在一些实施例中，重建单元14，还用于根据所述混合时空表征，得到所述当前图像的残差图像；根据所述P个预测图像和所述残差图像，得到所述目标预测图像。In some embodiments, the reconstruction unit 14 is further configured to obtain a residual image of the current image according to the mixed spatiotemporal representation; and obtain the target predicted image according to the P predicted images and the residual image.

在一些实施例中，重建单元14，具体用于根据所述P个预测图像，确定加权图像；根据所述加权图像和所述残差图像，确定所述目标预测图像。In some embodiments, the reconstruction unit 14 is specifically configured to determine a weighted image according to the P predicted images; and determine the target predicted image according to the weighted image and the residual image.

在一些实施例中，重建单元14，具体用于确定所述P个预测图像对应的权重；根据所述P个预测图像对应的权重，对所述P个预测图像进行加权，得到所述加权图像。In some embodiments, the reconstruction unit 14 is specifically used to determine the weights corresponding to the P predicted images; and weight the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted image.

在一些实施例中，重建单元14，具体用于根据所述混合时空表征进行自适应掩膜，得到所述P个预测图像对应的权重。In some embodiments, the reconstruction unit 14 is specifically configured to perform adaptive masking according to the mixed spatiotemporal representation to obtain weights corresponding to the P predicted images.

在一些实施例中，若所述P个预测图像包括第一预测图像和第二预测图像，重建单元14，具体用于确定所述第一预测图像对应的第一权重和所述第二预测图像对应的第二权重；根据所述第一权重和所述第二权重，对所述第一预测图像和所述第二预测图像进行加权，得到所述加权图像。In some embodiments, if the P predicted images include a first predicted image and a second predicted image, the reconstruction unit 14 is specifically used to determine a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image; and weight the first predicted image and the second predicted image according to the first weight and the second weight to obtain the weighted image.

在一些实施例中，重建单元14，具体用于对残差码流进行解码，得到所述当前图像的残差值；根据所述目标预测图像和所述残差值，得到所述重建图像。In some embodiments, the reconstruction unit 14 is specifically configured to decode the residual code stream to obtain the residual value of the current image; and obtain the reconstructed image according to the target predicted image and the residual value.

在一些实施例中，解码单元11，具体用于解码第二码流，得到量化后的第二特征信息，所述第二特征信息是对所述第一特征信息进行特征变换得到的；根据量化后的所述第二特征信息，确定量化后的所述第一特征信息的概率分布根据量化后的所述第一特征信息的概率分布，对所述第一码流进行解码，得到量化后的所述第一特征信息。In some embodiments, the decoding unit 11 is specifically used to decode the second code stream to obtain quantized second feature information, where the second feature information is obtained by performing feature transformation on the first feature information; determine the probability distribution of the quantized first feature information based on the quantized second feature information; decode the first code stream based on the probability distribution of the quantized first feature information to obtain the quantized first feature information.

在一些实施例中，解码单元11，具体用于对量化后的所述第二特征信息进行反变换，得到重建特征信息；确定所述重建特征信息的概率分布；根据所述重建特征信息的概率分布，预测得到量化后的所述第一特征信息的概率分布。In some embodiments, the decoding unit 11 is specifically used to perform an inverse transformation on the quantized second feature information to obtain reconstructed feature information; determine the probability distribution of the reconstructed feature information; and predict the probability distribution of the quantized first feature information based on the probability distribution of the reconstructed feature information.

在一些实施例中，解码单元11，具体用于对量化后的所述第二特征信息进行N次非局部注意力变换和N次上采样，得到所述重建特征信息，所述N为正整数。In some embodiments, the decoding unit 11 is specifically used to perform N non-local attention transformations and N upsamplings on the quantized second feature information to obtain the reconstructed feature information, where N is a positive integer.

在一些实施例中，解码单元11，具体用于根据所述重建特征信息的概率分布，预测量化后的所述第一特征信息中编码像素的概率；根据量化后的所述第一特征信息中编码像素的概率，得到量化后的所述第一特征信息的概率分布。In some embodiments, the decoding unit 11 is specifically used to predict the probability of the encoded pixels in the first feature information after quantization according to the probability distribution of the reconstructed feature information; and obtain the probability distribution of the first feature information after quantization according to the probability of the encoded pixels in the first feature information after quantization.

应理解，装置实施例与方法实施例可以相互对应，类似的描述可以参照方法实施例。为避免重复，此处不再赘述。具体地，图14所示的视频解码装置10可以对应于执行本申请实施例的方法中的相应主体，并且视频解码装置10中的各个单元的前述和其它操作和/或功能分别为了实现方法等各个方法中的相应流程，为了简洁，在此不再赘述。It should be understood that the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, no further description is given here. Specifically, the video decoding device 10 shown in FIG. 14 may correspond to the corresponding subject in the method for executing the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the video decoding device 10 are respectively for implementing the corresponding processes in each method such as the method, and for the sake of brevity, no further description is given here.

图15是本申请实施例提供的视频编码装置的示意性框图。FIG15 is a schematic block diagram of a video encoding device provided in an embodiment of the present application.

如图15所示，视频编码装置20包括：As shown in FIG. 15 , the video encoding apparatus 20 includes:

融合单元21，用于对当前图像以及所述当前图像的前一重建图像进行特征融合，得到第一特征信息；A fusion unit 21 is used to perform feature fusion on a current image and a previous reconstructed image of the current image to obtain first feature information;

量化单元22，用于对所述第一特征信息进行量化，得到量化后的所述第一特征信息；A quantization unit 22, configured to quantize the first feature information to obtain quantized first feature information;

编码单元23，用于对量化后的所述第一特征信息进行编码，得到所述第一码流。The encoding unit 23 is used to encode the quantized first feature information to obtain the first code stream.

在一些实施例中，融合单元21，具体用于将所述当前图像和所述重建图像进行通道级联，得到级联后的图像；对所述级联后的图像进行特征提取，得到所述第一特征信息。In some embodiments, the fusion unit 21 is specifically used to perform channel cascading on the current image and the reconstructed image to obtain a cascaded image; and perform feature extraction on the cascaded image to obtain the first feature information.

在一些实施例中，融合单元21，具体用于对所述级联后的图像进行Q次非局部注意力变换和Q次下采样，得到所述第一特征信息，所述Q为正整数。In some embodiments, the fusion unit 21 is specifically used to perform Q non-local attention transformations and Q downsampling on the cascaded image to obtain the first feature information, where Q is a positive integer.

在一些实施例中，编码单元23，还用于根据所述第一特征信息进行特征变换，得到第二特征信息；对所述第二特征信息进行量化后再编码，得到第二码流；对所述第二码流进行解码，得到量化后的所述第二特征信息，并根据量化后的所述第二特征信息，确定量化后的所述第一特征信息的概率分布；根据量化后的所述第一特征信息的概率分布，对量化后的所述第一特征信息进行编码，得到第一码流。In some embodiments, the encoding unit 23 is further used to perform feature transformation according to the first feature information to obtain second feature information; quantize and then encode the second feature information to obtain a second code stream; decode the second code stream to obtain the quantized second feature information, and determine the probability distribution of the quantized first feature information according to the quantized second feature information; encode the quantized first feature information according to the probability distribution of the quantized first feature information to obtain a first code stream.

在一些实施例中，编码单元23，具体用于对所述第一特征信息进行N次非局部注意力变换和N次下采样，得到所述第二特征信息，所述N为正整数。In some embodiments, the encoding unit 23 is specifically used to perform N non-local attention transformations and N downsampling on the first feature information to obtain the second feature information, where N is a positive integer.

在一些实施例中，编码单元23，具体用于对量化后的所述第一特征信息进行N次非局部注意力变换和N次下采样，得到所述第二特征信息。In some embodiments, the encoding unit 23 is specifically used to perform N non-local attention transformations and N downsamplings on the quantized first feature information to obtain the second feature information.

在一些实施例中，编码单元23，还用于对所述第二特征信息进行量化，得到量化后的所述第二特征信息；确定量化后的所述第二特征信息的概率分布；根据量化后的所述第二特征信息的概率分布，对量化后的所述第二特征信息进行编码，得到所述第二码流。In some embodiments, the encoding unit 23 is further used to quantize the second feature information to obtain the quantized second feature information; determine the probability distribution of the quantized second feature information; and encode the quantized second feature information according to the probability distribution of the quantized second feature information to obtain the second code stream.

在一些实施例中，编码单元23，具体用于对量化后的所述第二特征信息进行反变换，得到重建特征信息；确定所述重建特征信息的概率分布；根据所述重建特征信息的概率分布，确定量化后的所述第一特征信息的概率分布。In some embodiments, the encoding unit 23 is specifically used to perform an inverse transformation on the quantized second feature information to obtain reconstructed feature information; determine the probability distribution of the reconstructed feature information; and determine the probability distribution of the quantized first feature information based on the probability distribution of the reconstructed feature information.

在一些实施例中，编码单元23，具体用于对量化后的所述第二特征信息进行N次非局部注意力变换和N次上采样，得到所述重建特征信息，所述N为正整数。In some embodiments, the encoding unit 23 is specifically used to perform N non-local attention transformations and N upsamplings on the quantized second feature information to obtain the reconstructed feature information, where N is a positive integer.

在一些实施例中，编码单元23，具体用于根据所述重建特征信息的概率分布，确定量化后的所述第一特征信息中编码像素的概率；根据量化后的所述第一特征信息中编码像素的概率，得到量化后的所述第一特征信息的概率分布。In some embodiments, the encoding unit 23 is specifically used to determine the probability of the encoded pixels in the quantized first feature information according to the probability distribution of the reconstructed feature information; and obtain the probability distribution of the quantized first feature information according to the probability of the encoded pixels in the quantized first feature information.

在一些实施例中，编码单元23，还用于确定所述当前图像的重建图像。In some embodiments, the encoding unit 23 is further configured to determine a reconstructed image of the current image.

在一些实施例中，编码单元23，具体用于对量化后的所述第一特征信息进行多级时域融合，得到混合时空表征；根据所述混合时空表征对所述前一重建图像进行运动补偿，得到所述当前图像的P个预测图像，所述P为正整数；根据所述P个预测图像，确定所述当前图像的重建图像。In some embodiments, the encoding unit 23 is specifically used to perform multi-level time-domain fusion on the quantized first feature information to obtain a mixed space-time representation; perform motion compensation on the previous reconstructed image according to the mixed space-time representation to obtain P predicted images of the current image, where P is a positive integer; and determine the reconstructed image of the current image based on the P predicted images.

在一些实施例中，编码单元23，具体用于通过递归聚合模块将量化后的所述第一特征信息，与前一时刻所述递归聚合模块的隐式特征信息进行融合，得到所述混合时空表征。In some embodiments, the encoding unit 23 is specifically configured to fuse the quantized first feature information with the implicit feature information of the recursive aggregation module at a previous moment through a recursive aggregation module to obtain the mixed spatiotemporal representation.

在一些实施例中，所述P个预测图像包括第一预测图像，编码单元23，具体用于根据所述混合时空表征，确定光流运动信息；根据所述光流运动信息对所述前一重建图像进行运动补偿，得到所述第一预测图像。In some embodiments, the P predicted images include a first predicted image, and the encoding unit 23 is specifically used to determine optical flow motion information according to the mixed spatiotemporal representation; perform motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.

在一些实施例中，所述P个预测图像包括第二预测图像，编码单元23，具体用于根据所述混合时空表征，得到所述当前图像对应的偏移量；对所述前一重建图像进行空间特征提取，得到参考特征信息；使用所述偏移量对所述参考特征信息进行运动补偿，得到所述第二预测图像。In some embodiments, the P predicted images include a second predicted image, and the encoding unit 23 is specifically used to obtain an offset corresponding to the current image based on the mixed spatiotemporal representation; perform spatial feature extraction on the previous reconstructed image to obtain reference feature information; and use the offset to perform motion compensation on the reference feature information to obtain the second predicted image.

在一些实施例中，编码单元23，具体用于使用所述偏移量，对所述参考特征信息进行基于可变形卷积的运动补偿，得到所述第二预测图像。In some embodiments, the encoding unit 23 is specifically configured to use the offset to perform deformable convolution-based motion compensation on the reference feature information to obtain the second predicted image.

在一些实施例中，编码单元23，具体用于根据所述P个预测图像，确定所述当前图像的目标预测图像；根据所述目标预测图像，确定所述当前图像的重建图像。In some embodiments, the encoding unit 23 is specifically configured to determine a target prediction image of the current image according to the P prediction images; and determine a reconstructed image of the current image according to the target prediction image.

在一些实施例中，编码单元23，具体用于根据所述P个预测图像，确定加权图像；根据所述加权图像，得到所述目标预测图像。In some embodiments, the encoding unit 23 is specifically configured to determine a weighted image according to the P predicted images; and obtain the target predicted image according to the weighted image.

在一些实施例中，编码单元23，还用于根据所述混合时空表征，得到所述当前图像的残差图像；根据所述P个预测图像和所述残差图像，得到所述目标预测图像。In some embodiments, the encoding unit 23 is further configured to obtain a residual image of the current image according to the mixed spatiotemporal representation; and obtain the target predicted image according to the P predicted images and the residual image.

在一些实施例中，若所述P大于1，编码单元23，具体用于根据所述P个预测图像，确定加权图像；根据所述加权图像和所述残差图像，确定所述目标预测图像。In some embodiments, if the P is greater than 1, the encoding unit 23 is specifically configured to determine a weighted image according to the P predicted images; and determine the target predicted image according to the weighted image and the residual image.

在一些实施例中，编码单元23，具体用于确定所述P个预测图像对应的权重；根据所述P个预测图像对应的权重，对所述P个预测图像进行加权，得到所述加权图像。In some embodiments, the encoding unit 23 is specifically configured to determine weights corresponding to the P predicted images; and weight the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted image.

在一些实施例中，编码单元23，具体用于根据所述混合时空表征进行自适应掩膜，得到所述P个预测图像对应的权重。In some embodiments, the encoding unit 23 is specifically configured to perform adaptive masking according to the mixed spatiotemporal representation to obtain weights corresponding to the P predicted images.

在一些实施例中，若所述P个预测图像包括第一预测图像和第二预测图像，编码单元23，具体用于确定所述P个预测图像，确定所述第一预测图像对应的第一权重和所述第二预测图像对应的第二权重；根据所述第一权重和所述第二权重，对所述第一预测图像和所述第二预测图像进行加权，得到所述加权图像。In some embodiments, if the P predicted images include a first predicted image and a second predicted image, the encoding unit 23 is specifically used to determine the P predicted images, determine a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image; and weight the first predicted image and the second predicted image according to the first weight and the second weight to obtain the weighted image.

在一些实施例中，编码单元23，还用于根据所述当前图像和所述目标预测图像，确定所述当前图像的残差值；对所述残差值进行编码，得到残差码流。In some embodiments, the encoding unit 23 is further configured to determine a residual value of the current image according to the current image and the target predicted image; and encode the residual value to obtain a residual code stream.

在一些实施例中，编码单元23，具体用于对所述残差码流进行解码，得到所述当前图像的残差值；根据所述目标预测图像和所述残差值，得到所述重建图像。In some embodiments, the encoding unit 23 is specifically configured to decode the residual code stream to obtain a residual value of the current image; and obtain the reconstructed image according to the target predicted image and the residual value.

应理解，装置实施例与方法实施例可以相互对应，类似的描述可以参照方法实施例。为避免重复，此处不再赘述。具体地，图15所示的视频编码装置20可以对应于执行本申请实施例的方法中的相应主体，并且视频编码装置20中的各个单元的前述和其它操作和/或功能分别为了实现方法等各个方法中的相应流程，为了简洁，在此不再赘述。It should be understood that the device embodiment and the method embodiment may correspond to each other, and similar descriptions may refer to the method embodiment. To avoid repetition, no further description is given here. Specifically, the video encoding device 20 shown in FIG. 15 may correspond to the corresponding subject in the method for executing the embodiment of the present application, and the aforementioned and other operations and/or functions of each unit in the video encoding device 20 are respectively for implementing the corresponding processes in each method such as the method, and for the sake of brevity, no further description is given here.

上文中结合附图从功能单元的角度描述了本申请实施例的装置和系统。应理解，该功能单元可以通过硬件形式实现，也可以通过软件形式的指令实现，还可以通过硬件和软件单元组合实现。具体地，本申请实施例中的方法实施例的各步骤可以通过处理器中的硬件的集成逻辑电路和/或软件形式的指令完成，结合本申请实施例公开的方法的步骤可以直接体现为硬件译码处理器执行完成，或者用译码处理器中的硬件及软件单元组合执行完成。可选地，软件单元可以位于随机存储器，闪存、只读存储器、可编程只读存储器、电可擦写可编程存储器、寄存器等本领域的成熟的存储介质中。该存储介质位于存储器，处理器读取存储器中的信息，结合其硬件完成上述方法实施例中的步骤。The above describes the device and system of the embodiment of the present application from the perspective of the functional unit in conjunction with the accompanying drawings. It should be understood that the functional unit can be implemented in hardware form, can be implemented by instructions in software form, and can also be implemented by a combination of hardware and software units. Specifically, the steps of the method embodiment in the embodiment of the present application can be completed by the hardware integrated logic circuit and/or software form instructions in the processor, and the steps of the method disclosed in the embodiment of the present application can be directly embodied as a hardware decoding processor to perform, or a combination of hardware and software units in the decoding processor to perform. Optionally, the software unit can be located in a mature storage medium in the field such as a random access memory, a flash memory, a read-only memory, a programmable read-only memory, an electrically erasable programmable memory, a register, etc. The storage medium is located in a memory, and the processor reads the information in the memory, and completes the steps in the above method embodiment in conjunction with its hardware.

图16是本申请实施例提供的电子设备的示意性框图。FIG. 16 is a schematic block diagram of an electronic device provided in an embodiment of the present application.

如图16所示，该电子设备30可以为本申请实施例所述的视频编码器，或者视频解码器，该电子设备30可包括：As shown in FIG. 16 , the electronic device 30 may be a video encoder or a video decoder as described in the embodiment of the present application, and the electronic device 30 may include:

存储器33和处理器32，该存储器33用于存储计算机程序34，并将该程序代码34传输给该处理器32。换言之，该处理器32可以从存储器33中调用并运行计算机程序34，以实现本申请实施例中的方法。The memory 33 and the processor 32, the memory 33 is used to store the computer program 34 and transmit the program code 34 to the processor 32. In other words, the processor 32 can call and run the computer program 34 from the memory 33 to implement the method in the embodiment of the present application.

例如，该处理器32可用于根据该计算机程序34中的指令执行上述方法中的步骤。For example, the processor 32 may be configured to execute the steps in the above method according to the instructions in the computer program 34 .

在本申请的一些实施例中，该处理器32可以包括但不限于：In some embodiments of the present application, the processor 32 may include but is not limited to:

通用处理器、数字信号处理器(Digital Signal Processor，DSP)、专用集成电路(Application Specific Integrated Circuit，ASIC)、现场可编程门阵列(Field Programmable Gate Array，FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等等。General-purpose processor, digital signal processor (DSP), application-specific integrated circuit (ASIC), field programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc.

在本申请的一些实施例中，该存储器33包括但不限于：In some embodiments of the present application, the memory 33 includes but is not limited to:

易失性存储器和/或非易失性存储器。其中，非易失性存储器可以是只读存储器(Read-Only Memory，ROM)、可编程只读存储器(Programmable ROM，PROM)、可擦除可编程只读存储器(Erasable PROM，EPROM)、电可擦除可编程只读存储器(Electrically EPROM，EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory，RAM)，其用作外部高速缓存。通过示例性但不是限制性说明，许多形式的RAM可用，例如静态随机存取存储器(Static RAM，SRAM)、动态随机存取存储器(Dynamic RAM，DRAM)、同步动态随机存取存储器(Synchronous DRAM，SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM，DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM，ESDRAM)、同步连接动态随机存取存储器(synch link DRAM，SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM，DR RAM)。Volatile memory and/or non-volatile memory. Among them, the non-volatile memory can be read-only memory (ROM), programmable ROM (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM) or flash memory. The volatile memory can be random access memory (RAM), which is used as an external cache. By way of example and not limitation, many forms of RAM are available, such as static random access memory (SRAM), dynamic random access memory (DRAM), synchronous dynamic random access memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced synchronous dynamic random access memory (ESDRAM), synchronous link DRAM (SLDRAM), and direct RAM bus random access memory (DR RAM).

在本申请的一些实施例中，该计算机程序34可以被分割成一个或多个单元，该一个或者多个单元被存储在该存储器33中，并由该处理器32执行，以完成本申请提供的方法。该一个或多个单元可以是能够完成特定功能的一系列计算机程序指令段，该指令段用于描述该计算机程序34在该电子设备30中的执行过程。In some embodiments of the present application, the computer program 34 may be divided into one or more units, which are stored in the memory 33 and executed by the processor 32 to complete the method provided by the present application. The one or more units may be a series of computer program instruction segments capable of completing specific functions, and the instruction segments are used to describe the execution process of the computer program 34 in the electronic device 30.

如图16所示，该电子设备30还可包括：As shown in FIG. 16 , the electronic device 30 may further include:

收发器33，该收发器33可连接至该处理器32或存储器33。The transceiver 33 may be connected to the processor 32 or the memory 33 .

其中，处理器32可以控制该收发器33与其他设备进行通信，具体地，可以向其他设备发送信息或数据，或接收其他设备发送的信息或数据。收发器33可以包括发射机和接收机。收发器33还可以进一步包括天线，天线的数量可以为一个或多个。The processor 32 may control the transceiver 33 to communicate with other devices, specifically, to send information or data to other devices, or to receive information or data sent by other devices. The transceiver 33 may include a transmitter and a receiver. The transceiver 33 may further include an antenna, and the number of antennas may be one or more.

应当理解，该电子设备30中的各个组件通过总线系统相连，其中，总线系统除包括数据总线之外，还包括电源总线、控制总线和状态信号总线。It should be understood that the various components in the electronic device 30 are connected via a bus system, wherein the bus system includes not only a data bus but also a power bus, a control bus and a status signal bus.

图17是本申请实施例提供的视频编解码系统40的示意性框图。FIG. 17 is a schematic block diagram of a video encoding and decoding system 40 provided in an embodiment of the present application.

如图17所示，该视频编解码系统40可包括：视频编码器41和视频解码器42，其中视频编码器41用于执行本申请实施例涉及的视频编码方法，视频解码器42用于执行本申请实施例涉及的视频解码方法。As shown in FIG. 17 , the video encoding and decoding system 40 may include: a video encoder 41 and a video decoder 42 , wherein the video encoder 41 is used to execute the video encoding method involved in the embodiment of the present application, and the video decoder 42 is used to execute the video decoding method involved in the embodiment of the present application.

在一些实施例中，本申请还提供一种码流，该码流通过上述编码方法得到。In some embodiments, the present application also provides a code stream, which is obtained by the above encoding method.

本申请还提供了一种计算机存储介质，其上存储有计算机程序，该计算机程序被计算机执行时使得该计算机能够执行上述方法实施例的方法。或者说，本申请实施例还提供一种包含指令的计算机程序产品，该指令被计算机执行时使得计算机执行上述方法实施例的方法。The present application also provides a computer storage medium on which a computer program is stored, and when the computer program is executed by a computer, the computer can perform the method of the above method embodiment. In other words, the present application embodiment also provides a computer program product containing instructions, and when the instructions are executed by a computer, the computer can perform the method of the above method embodiment.

当使用软件实现时，可以全部或部分地以计算机程序产品的形式实现。该计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行该计算机程序指令时，全部或部分地产生按照本申请实施例该的流程或功能。该计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。该计算机指令可以存储在计算机可读存储介质中，或者从一个计算机可读存储介质向另一个计算机可读存储介质传输，例如，该计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(digital subscriber line，DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。该计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。该可用介质可以是磁性介质(例如，软盘、硬盘、磁带)、光介质(例如数字视频光盘(digital video disc，DVD))、或者半导体介质(例如固态硬盘(solid state disk，SSD))等。When software is used for implementation, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the process or function according to the embodiment of the present application is generated in whole or in part. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices. The computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium. For example, the computer instructions can be transmitted from a website site, computer, server or data center by wired (e.g., coaxial cable, optical fiber, digital subscriber line (digital subscriber line, DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) mode to another website site, computer, server or data center. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that includes one or more available media integration. The available medium can be a magnetic medium (e.g., a floppy disk, a hard disk, a tape), an optical medium (e.g., a digital video disc (digital video disc, DVD)), or a semiconductor medium (e.g., a solid state drive (solid state disk, SSD)), etc.

本领域普通技术人员可以意识到，结合本文中所公开的实施例描述的各示例的单元及算法步骤，能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行，取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能，但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art will appreciate that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Professional and technical personnel can use different methods to implement the described functions for each specific application, but such implementation should not be considered to be beyond the scope of this application.

在本申请所提供的几个实施例中，应该理解到，所揭露的系统、装置和方法，可以通过其它的方式实现。例如，以上所描述的装置实施例仅仅是示意性的，例如，该单元的划分，仅仅为一种逻辑功能划分，实际实现时可以有另外的划分方式，例如多个单元或组件可以结合或者可以集成到另一个系统，或一些特征可以忽略，或不执行。另一点，所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口，装置或单元的间接耦合或通信连接，可以是电性，机械或其它的形式。In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices and methods can be implemented in other ways. For example, the device embodiments described above are only schematic. For example, the division of the unit is only a logical function division. There may be other division methods in actual implementation, such as multiple units or components can be combined or integrated into another system, or some features can be ignored or not executed. Another point is that the mutual coupling or direct coupling or communication connection shown or discussed can be through some interfaces, indirect coupling or communication connection of devices or units, which can be electrical, mechanical or other forms.

作为分离部件说明的单元可以是或者也可以不是物理上分开的，作为单元显示的部件可以是或者也可以不是物理单元，即可以位于一个地方，或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。例如，在本申请各个实施例中的各功能单元可以集成在一个处理单元中，也可以是各个单元单独物理存在，也可以两个或两个以上单元集成在一个单元中。The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the scheme of this embodiment. For example, each functional unit in each embodiment of the present application may be integrated into a processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.

以上内容，仅为本申请的具体实施方式，但本申请的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本申请揭露的技术范围内，可轻易想到变化或替换，都应涵盖在本申请的保护范围之内。因此，本申请的保护范围应以该权利要求的保护范围为准。The above contents are only specific implementation methods of the present application, but the protection scope of the present application is not limited thereto. Any technician familiar with the technical field can easily think of changes or substitutions within the technical scope disclosed in the present application, which should be included in the protection scope of the present application. Therefore, the protection scope of the present application should be based on the protection scope of the claims.

Claims

A video decoding method, comprising:

Decoding a first code stream, and determining quantized first characteristic information, wherein the first characteristic information is obtained by carrying out characteristic fusion on a current image and a reconstructed image before the current image;

performing multi-stage time domain fusion on the quantized first characteristic information to obtain a hybrid space-time representation;

Performing motion compensation on the previous reconstructed image according to the mixed space-time representation to obtain P predicted images of the current image, wherein P is a positive integer;

and determining a reconstructed image of the current image according to the P predicted images.
The method of claim 1, wherein the performing multi-level time-domain fusion on the quantized first feature information to obtain a hybrid space-time representation comprises:

And fusing the quantized first characteristic information with implicit characteristic information of the recursion aggregation module at the previous moment through the recursion aggregation module to obtain the mixed space-time characterization.
The method of claim 2, wherein the recursive aggregation module is stacked by at least one spatio-temporal recursive network.
The method of claim 1, wherein the P predicted pictures comprise a first predicted picture, wherein the motion compensating the previous reconstructed picture based on the hybrid spatio-temporal characterization results in P predicted pictures for the current picture, comprising:

determining optical flow motion information according to the mixed space-time representation;

And performing motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.
The method of claim 1, wherein the P predicted pictures comprise a second predicted picture, wherein the motion compensating the previous reconstructed picture based on the hybrid spatio-temporal characterization results in P predicted pictures for the current picture, comprising:

Obtaining the offset corresponding to the current image according to the mixed space-time representation;

extracting spatial features of the previous reconstructed image to obtain reference feature information;

And performing motion compensation on the reference characteristic information by using the offset to obtain the second predicted image.
The method of claim 5, wherein the motion compensating the reference feature information using the offset to obtain the second predicted image comprises:

And performing deformable convolution-based motion compensation on the reference characteristic information by using the offset to obtain the second predicted image.
The method according to any one of claims 1-6, wherein said determining a reconstructed image of the current image from the P predicted images comprises:

Determining a target predicted image of the current image according to the P predicted images;

and determining a reconstructed image of the current image according to the target predicted image.
The method of claim 7, wherein if P is greater than 1, the determining the target predicted picture for the current picture from the P predicted pictures comprises:

determining a weighted image according to the P predicted images;

and obtaining the target prediction image according to the weighted image.
The method of claim 7, wherein the method further comprises:

Obtaining a residual image of the current image according to the mixed space-time representation;

The determining the target predicted image of the current image according to the P predicted images comprises:

And obtaining the target predicted image according to the P predicted images and the residual image.
The method according to claim 9, wherein if the P is greater than 1, the obtaining the target prediction image from the P prediction images and the residual image comprises:

determining a weighted image according to the P predicted images;

and determining the target prediction image according to the weighted image and the residual image.
The method according to claim 8 or 10, wherein said determining weighted pictures from said P predicted pictures comprises:

determining weights corresponding to the P predicted images;

and weighting the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted images.
The method of claim 11, wherein determining weights for the P predicted pictures comprises:

And carrying out self-adaptive masking according to the mixed space-time characterization to obtain weights corresponding to the P predicted images.
The method of claim 11, wherein if the P predicted pictures include a first predicted picture and a second predicted picture, the determining weights corresponding to the P predicted pictures comprises:

determining a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image;

the step of weighting the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted images includes:

And weighting the first predicted image and the second predicted image according to the first weight and the second weight to obtain the weighted image.
The method of claim 7, wherein the determining a reconstructed image of the current image from the target predicted image comprises:

decoding the residual code stream to obtain a residual value of the current image;

And obtaining the reconstructed image according to the target predicted image and the residual value.
The method of any of claims 1-6, wherein decoding the first code stream to determine quantized first characteristic information comprises:

Decoding a second code stream to obtain quantized second characteristic information, wherein the second characteristic information is obtained by carrying out characteristic transformation on the first characteristic information;

Determining probability distribution of the quantized first characteristic information according to the quantized second characteristic information;

and decoding the first code stream according to the quantized probability distribution of the first characteristic information to obtain the quantized first characteristic information.
The method of claim 15, wherein determining probability distribution information of the quantized first feature information based on the quantized second feature information comprises:

Inversely transforming the quantized second characteristic information to obtain reconstructed characteristic information;

determining probability distribution of the reconstruction feature information;

and predicting the probability distribution of the quantized first characteristic information according to the probability distribution of the reconstructed characteristic information.
The method of claim 16, wherein the inverse transforming the quantized second feature information to obtain reconstructed feature information comprises:

And carrying out N times of non-local attention transformation and N times of up-sampling on the quantized second characteristic information to obtain the reconstructed characteristic information, wherein N is a positive integer.
The method of claim 16, wherein predicting the quantized probability distribution of the first feature information based on the probability distribution of the reconstructed feature information comprises:

Predicting the probability of the encoded pixels in the quantized first characteristic information according to the probability distribution of the reconstructed characteristic information;

And obtaining the probability distribution of the quantized first characteristic information according to the probability of the encoded pixels in the quantized first characteristic information.
A video encoding method, comprising:

performing feature fusion on a current image and a reconstructed image before the current image to obtain first feature information;

Quantizing the first characteristic information to obtain quantized first characteristic information;

And encoding the quantized first characteristic information to obtain a first code stream.
The method of claim 19, wherein the feature fusion of the current image and the reconstructed image before the current image to obtain the first feature information includes:

carrying out channel cascade on the current image and the reconstructed image to obtain a cascade image;

and extracting the characteristics of the cascaded images to obtain the first characteristic information.
The method of claim 20, wherein the feature extracting the concatenated image to obtain the first feature information includes:

And carrying out Q times of non-local attention conversion and Q times of downsampling on the cascaded images to obtain the first characteristic information, wherein Q is a positive integer.
The method of claim 19, wherein encoding the quantized first characteristic information to obtain the first code stream comprises:

Performing feature transformation according to the first feature information to obtain second feature information;

quantizing and then encoding the second characteristic information to obtain a second code stream;

decoding the second code stream to obtain quantized second characteristic information, and determining probability distribution of the quantized first characteristic information according to the quantized second characteristic information;

And encoding the quantized first characteristic information according to the probability distribution of the quantized first characteristic information to obtain a first code stream.
The method of claim 22, wherein performing the feature transformation according to the first feature information to obtain second feature information includes:

And carrying out N times of non-local attention conversion and N times of downsampling on the first characteristic information to obtain the second characteristic information, wherein N is a positive integer.
The method of claim 22, wherein performing the feature transformation according to the first feature information to obtain second feature information includes:

And carrying out N times of non-local attention transformation and N times of downsampling on the quantized first characteristic information to obtain the second characteristic information.
The method of claim 22, wherein the quantizing the second characteristic information and then encoding to obtain a second code stream comprises:

quantizing the second characteristic information to obtain quantized second characteristic information;

determining a probability distribution of the quantized second feature information;

And encoding the quantized second characteristic information according to the probability distribution of the quantized second characteristic information to obtain the second code stream.
The method of claim 22, wherein determining probability distribution information of the quantized first feature information based on the quantized second feature information comprises:

Inversely transforming the quantized second characteristic information to obtain reconstructed characteristic information;

determining probability distribution of the reconstruction feature information;

and determining the probability distribution of the quantized first characteristic information according to the probability distribution of the reconstructed characteristic information.
The method of claim 26, wherein said inversely transforming the quantized second feature information to obtain reconstructed feature information comprises:

And carrying out N times of non-local attention transformation and N times of up-sampling on the quantized second characteristic information to obtain the reconstructed characteristic information, wherein N is a positive integer.
The method of claim 26, wherein determining the quantized probability distribution of the first feature information based on the probability distribution of the reconstructed feature information comprises:

Determining the probability of the encoded pixels in the quantized first characteristic information according to the probability distribution of the reconstructed characteristic information;

And obtaining the probability distribution of the quantized first characteristic information according to the probability of the encoded pixels in the quantized first characteristic information.
The method according to any one of claims 19-28, further comprising:

A reconstructed image of the current image is determined.
The method of claim 29, wherein said determining a reconstructed image of said current image comprises:

performing multi-stage time domain fusion on the quantized first characteristic information to obtain a hybrid space-time representation;

Performing motion compensation on the previous reconstructed image according to the mixed space-time representation to obtain P predicted images of the current image, wherein P is a positive integer;

and determining a reconstructed image of the current image according to the P predicted images.
The method of claim 30, wherein the performing multi-level time-domain fusion on the quantized first feature information to obtain a hybrid space-time representation comprises:

And fusing the quantized first characteristic information with implicit characteristic information of the recursion aggregation module at the previous moment through the recursion aggregation module to obtain the mixed space-time characterization.
The method of claim 31, wherein the recursive aggregation module is stacked by at least one spatio-temporal recursive network.
The method of claim 30, wherein the P predicted pictures comprise a first predicted picture, wherein the motion compensating the previous reconstructed picture based on the hybrid spatio-temporal characterization results in P predicted pictures for the current picture, comprising:

determining optical flow motion information according to the mixed space-time representation;

And performing motion compensation on the previous reconstructed image according to the optical flow motion information to obtain the first predicted image.
The method of claim 30, wherein the P predicted pictures comprise a second predicted picture, wherein the motion compensating the previous reconstructed picture based on the hybrid spatio-temporal characterization results in P predicted pictures for the current picture, comprising:

Obtaining the offset corresponding to the current image according to the mixed space-time representation;

extracting spatial features of the previous reconstructed image to obtain reference feature information;

And performing motion compensation on the reference characteristic information by using the offset to obtain the second predicted image.
The method of claim 34, wherein the motion compensating the reference feature information using the offset to obtain the second predicted image comprises:

And performing deformable convolution-based motion compensation on the reference characteristic information by using the offset to obtain the second predicted image.
The method according to any one of claims 30-35, wherein said determining a reconstructed image of said current image from said P predicted images comprises:

Determining a target predicted image of the current image according to the P predicted images;

and determining a reconstructed image of the current image according to the target predicted image.
The method of claim 36, wherein if P is greater than 1, the determining the target predicted picture for the current picture from the P predicted pictures comprises:

determining a weighted image according to the P predicted images;

and obtaining the target prediction image according to the weighted image.
The method of claim 36, wherein the method further comprises:

Obtaining a residual image of the current image according to the mixed space-time representation;

The determining the target predicted image of the current image according to the P predicted images comprises:

And obtaining the target predicted image according to the P predicted images and the residual image.
The method of claim 38, wherein if P is greater than 1, the obtaining the target predicted image from the P predicted images and the residual image comprises:

determining a weighted image according to the P predicted images;

and determining the target prediction image according to the weighted image and the residual image.
The method of claim 37 or 39, wherein said determining a weighted image from said P predicted images comprises:

determining weights corresponding to the P predicted images;

and weighting the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted images.
The method of claim 40, wherein determining weights for the P predicted pictures comprises:

And carrying out self-adaptive masking according to the mixed space-time characterization to obtain weights corresponding to the P predicted images.
The method of claim 41, wherein if the P predicted pictures include a first predicted picture and a second predicted picture, the determining weights for the P predicted pictures comprises:

Determining the P predicted images, and determining a first weight corresponding to the first predicted image and a second weight corresponding to the second predicted image;

the step of weighting the P predicted images according to the weights corresponding to the P predicted images to obtain the weighted images includes:

And weighting the first predicted image and the second predicted image according to the first weight and the second weight to obtain the weighted image.
The method of claim 36, wherein the method further comprises:

Determining a residual value of the current image according to the current image and the target predicted image;

and encoding the residual error value to obtain a residual error code stream.
A method according to claim 43, wherein said determining a reconstructed image of the current image from the target predicted image comprises:

decoding the residual code stream to obtain a residual value of the current image;

And obtaining the reconstructed image according to the target predicted image and the residual value.
A video decoding apparatus, comprising:

The decoding unit is used for decoding the first code stream, and determining quantized first characteristic information, wherein the first characteristic information is obtained by carrying out characteristic fusion on a current image and a reconstructed image before the current image;

the fusion unit is used for carrying out multi-stage time domain fusion on the quantized first characteristic information to obtain a hybrid space-time characterization;

The compensation unit is used for performing motion compensation on the previous reconstructed image according to the mixed space-time representation to obtain P predicted images of the current image, wherein P is a positive integer;

And the reconstruction unit is used for determining a reconstructed image of the current image according to the P predicted images.
A video encoding apparatus, comprising:

The fusion unit is used for carrying out feature fusion on the current image and a reconstructed image before the current image to obtain first feature information;

the quantization unit is used for quantizing the first characteristic information to obtain quantized first characteristic information;

and the encoding unit is used for encoding the quantized first characteristic information to obtain a first code stream.
A video codec system comprising a video encoder and a video decoder;

The video decoder for performing the video decoding method of any of claims 1-19;

The video encoder is configured to perform the video encoding method of any of claims 20-44.
An electronic device, comprising: a memory, a processor;

the memory is used for storing a computer program;

The processor being adapted to execute the computer program to implement the method of any of the preceding claims 1 to 19 or 20 to 44.
A computer readable storage medium having stored therein computer executable instructions which when executed by a processor are for implementing the method of any of claims 1 to 19 or 20 to 44.
A code stream comprising the code stream obtained by the method of any one of claims 20 to 44.