CN119402654A - Video encoding method and device, electronic device and non-transitory computer-readable storage medium - Google Patents
Video encoding method and device, electronic device and non-transitory computer-readable storage medium Download PDFInfo
- Publication number
- CN119402654A CN119402654A CN202411521146.0A CN202411521146A CN119402654A CN 119402654 A CN119402654 A CN 119402654A CN 202411521146 A CN202411521146 A CN 202411521146A CN 119402654 A CN119402654 A CN 119402654A
- Authority
- CN
- China
- Prior art keywords
- motion
- motion compensation
- current block
- frame
- video
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 124
- 239000013598 vector Substances 0.000 claims abstract description 209
- 238000001914 filtration Methods 0.000 claims abstract description 131
- 238000013135 deep learning Methods 0.000 claims description 10
- 238000010801 machine learning Methods 0.000 claims description 10
- 238000004590 computer program Methods 0.000 claims description 6
- 230000002123 temporal effect Effects 0.000 abstract description 74
- 230000000694 effects Effects 0.000 abstract description 20
- 238000012545 processing Methods 0.000 description 35
- 230000004927 fusion Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 17
- 238000013139 quantization Methods 0.000 description 15
- 230000008569 process Effects 0.000 description 10
- 238000004891 communication Methods 0.000 description 7
- 230000003044 adaptive effect Effects 0.000 description 6
- 208000037170 Delayed Emergence from Anesthesia Diseases 0.000 description 5
- 230000002146 bilateral effect Effects 0.000 description 4
- 101710121996 Hexon protein p72 Proteins 0.000 description 3
- 101710125418 Major capsid protein Proteins 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000013500 data storage Methods 0.000 description 2
- 238000006073 displacement reaction Methods 0.000 description 2
- 230000006872 improvement Effects 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000005192 partition Methods 0.000 description 2
- 238000000638 solvent extraction Methods 0.000 description 2
- 241000023320 Luma <angiosperm> Species 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000000903 blocking effect Effects 0.000 description 1
- 230000015556 catabolic process Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000006731 degradation reaction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- OSWPMRLSEDHDFF-UHFFFAOYSA-N methyl salicylate Chemical compound COC(=O)C1=CC=CC=C1O OSWPMRLSEDHDFF-UHFFFAOYSA-N 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/136—Incoming video signal characteristics or properties
- H04N19/137—Motion inside a coding unit, e.g. average field, frame or block difference
- H04N19/139—Analysis of motion vectors, e.g. their magnitude, direction, variance or reliability
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/50—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding
- H04N19/503—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using predictive coding involving temporal prediction
- H04N19/51—Motion estimation or motion compensation
- H04N19/513—Processing of motion vectors
- H04N19/517—Processing of motion vectors by encoding
- H04N19/52—Processing of motion vectors by encoding by predictive encoding
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/60—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding
- H04N19/61—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding
- H04N19/615—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using transform coding in combination with predictive coding using motion compensated temporal filtering [MCTF]
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The present disclosure provides a video encoding method, an apparatus, an electronic device, and a non-transitory computer readable storage medium thereof, the video encoding method including performing motion estimation on a current block in a current frame for each neighboring frame of the current frame, searching a plurality of motion vectors from the neighboring frames, and performing motion compensation on the current block based on the plurality of motion vectors searched from the neighboring frames, thereby obtaining a motion compensation result for each neighboring frame, performing temporal filtering on the current block based on the motion compensation result for each neighboring frame in at least one neighboring frame, and encoding the current block based on the temporal filtering result of the current block. In the present disclosure, each neighboring frame is no longer limited to only one motion compensated prediction, but rather multiple motion compensated predictions may be used, i.e. allowing each neighboring frame to correspond to multiple motion compensation results. Thus, the prediction precision of motion compensation can be remarkably improved, and the filtering effect and coding performance of MCTF can be further improved.
Description
Technical Field
The present disclosure relates to video codec and compression. More particularly, the present disclosure relates to video encoding methods and apparatus, electronic devices, and non-transitory computer readable storage media.
Background
Various electronic devices (e.g., digital televisions, laptop or desktop computers, tablet computers, digital cameras, digital recording devices, digital media players, video game consoles, smart phones, video teleconferencing devices, video streaming devices, etc.) support digital video. The electronic device sends and receives or otherwise communicates digital video data over a communication network and/or stores the digital video data on a storage device. Because of the limited bandwidth capacity of the communication network and the limited storage resources of the storage device, video data may be compressed using video codec according to one or more video codec standards before it is transmitted or stored. For example, video coding standards include general video coding (VVC), joint exploration test model (JEM), high efficiency video coding (HEVC/h.265), advanced video coding (AVC/h.264), moving Picture Experts Group (MPEG) coding, and so forth. Video coding typically employs prediction methods (e.g., inter-prediction, intra-prediction, etc.) that exploit redundancy inherent in video data. Video codec aims at compressing video data into a form using a lower bit rate while avoiding or minimizing degradation of video quality.
The motion estimation compensation-based temporal filtering method (Motion Compensated Temporal Filter, MCTF) is a video filtering technique of only the encoder side (Encoder-only), which uses temporal correlation of video and uses inter-block reference relationships to perform temporal filtering on reference frames to reduce temporal redundancy information generated during video block reference, thereby improving overall coding efficiency.
In the related art, when MCTF is used for temporal filtering, a classical block-based motion compensation technique is used for each frame, i.e. only one motion compensation block is searched for each frame, and then bilateral weighted filtering is performed on a plurality of motion compensation blocks corresponding to all neighboring frames of the current frame and the current block. However, using only single hypothesis compensation per frame may result in lower prediction quality, and in particular, may not meet the prediction requirements for some complex scenes, resulting in poor temporal filtering effect and low coding efficiency.
Disclosure of Invention
Examples of the present disclosure provide video encoding methods and apparatuses thereof, electronic devices, and non-transitory computer-readable storage media.
According to a first aspect of the present disclosure, there is provided a video encoding method including performing motion estimation on a current block in a current frame for each of at least one neighboring frame of the current frame, searching a plurality of motion vectors from the neighboring frame, and performing motion compensation on the current block based on the plurality of motion vectors searched from the neighboring frame, thereby obtaining a motion compensation result for each of the at least one neighboring frame, performing temporal filtering on the current block based on the motion compensation result for each of the at least one neighboring frame, thereby obtaining a temporal filtering result of the current block, and encoding the current block based on the temporal filtering result of the current block.
Optionally, the motion compensation of the current block based on the plurality of motion vectors searched from the adjacent frame comprises the steps of carrying out weighted summation on the plurality of motion vectors searched from the adjacent frame to obtain a final motion vector and carrying out motion compensation on the current block based on the final motion vector.
Optionally, the motion compensation of the current block based on the plurality of motion vectors searched from the adjacent frames comprises the steps of performing motion compensation on the current block based on the plurality of motion vectors searched from the adjacent frames respectively to obtain a plurality of motion compensation results, and performing weighted summation on the plurality of motion compensation results to obtain a final motion compensation result, wherein the temporal filtering of the current block based on the motion compensation result of each adjacent frame in the at least one adjacent frame comprises the step of performing temporal filtering on the current block based on the final motion compensation result obtained for each adjacent frame in the at least one adjacent frame.
Optionally, the motion compensation of the current block based on the plurality of motion vectors searched from the adjacent frames comprises motion compensation of the current block based on the plurality of motion vectors searched from the adjacent frames respectively to obtain a plurality of motion compensation results, wherein the temporal filtering of the current block based on the motion compensation results for each of the at least one adjacent frame comprises temporal filtering of the current block based on the plurality of motion compensation results obtained for each of the at least one adjacent frame.
Optionally, the searching of the plurality of motion vectors from the adjacent frames comprises searching a plurality of motion vector candidates from the adjacent frames through the same motion vector searching method, and selecting a preset optimal motion vector candidate from the plurality of motion vector candidates according to a preset rule to serve as the plurality of motion vectors.
Optionally, the searching for a plurality of motion vectors from the neighboring frame includes searching for a plurality of motion vector candidates from the neighboring frame by different motion vector searching methods to determine the plurality of motion vectors.
The method comprises the steps of searching a first motion vector candidate from an adjacent frame through a first motion vector searching method, searching at least one second motion vector candidate from the adjacent frame through at least one second motion vector searching method, selecting a preset optimal second motion vector candidate from the at least one second motion vector candidate according to a preset rule, and determining the first motion vector candidate and the preset optimal second motion vector candidate as the plurality of motion vectors.
Alternatively, the weights for the weighted summation may be determined according to the prediction characteristics corresponding to each motion vector, or according to a machine learning or deep learning manner.
Alternatively, the weights for the weighted summation may be determined according to the prediction characteristics corresponding to each motion compensation result, or according to a machine learning or deep learning manner.
Alternatively, the filtering weights for the temporal filtering may be determined according to the prediction features corresponding to each motion compensation result.
According to a second aspect of the present disclosure, there is provided a video encoding apparatus including a motion estimation module and a motion compensation module configured to perform motion estimation on a current block in a current frame for each of at least one neighboring frame of the current frame, search a plurality of motion vectors from neighboring frames, and motion compensate the current block based on the plurality of motion vectors searched from the neighboring frames, thereby obtaining a motion compensation result for each of the at least one neighboring frame, a temporal filtering module configured to temporally filter the current block based on the motion compensation result for each of the at least one neighboring frame, thereby obtaining a temporal filtering result of the current block, and an encoding module configured to encode the current block based on the temporal filtering result of the current block.
Optionally, the motion estimation module and the motion compensation module are configured to perform weighted summation on the plurality of motion vectors searched from the adjacent frames to obtain a final motion vector, and perform motion compensation on the current block based on the final motion vector.
Optionally, the motion estimation module and the motion compensation module are configured to perform motion compensation on the current block based on the plurality of motion vectors searched from the adjacent frames respectively to obtain a plurality of motion compensation results, perform weighted summation on the plurality of motion compensation results to obtain a final motion compensation result, and the temporal filtering module is configured to perform temporal filtering on the current block based on the final motion compensation result obtained for each of the at least one adjacent frame.
Optionally, the motion estimation module and the motion compensation module are configured to perform motion compensation on the current block based on the plurality of motion vectors searched from the adjacent frames respectively to obtain a plurality of motion compensation results, and the temporal filtering module is configured to perform temporal filtering on the current block based on the plurality of motion compensation results obtained for each of the at least one adjacent frame.
Optionally, the motion estimation module and the motion compensation module are configured to search a plurality of motion vector candidates from the adjacent frames by the same motion vector search method, and select a predetermined optimal motion vector candidate from the plurality of motion vector candidates as the plurality of motion vectors according to a predetermined rule.
Optionally, the motion estimation module and the motion compensation module are configured to search a plurality of motion vector candidates from the neighboring frame by different motion vector search methods to determine the plurality of motion vectors.
Optionally, the motion estimation module and the motion compensation module are configured to search a first motion vector candidate from the adjacent frame by a first motion vector search method, search at least one second motion vector candidate from the adjacent frame by at least one second motion vector search method, select a predetermined optimal second motion vector candidate from the at least one second motion vector candidate according to a predetermined rule, and determine the first motion vector candidate and the predetermined optimal second motion vector candidate as the plurality of motion vectors.
Alternatively, the weights for the weighted summation may be determined according to the prediction characteristics corresponding to each motion vector, or according to a machine learning or deep learning manner.
Alternatively, the weights for the weighted summation may be determined according to the prediction characteristics corresponding to each motion compensation result, or according to a machine learning or deep learning manner.
Alternatively, the filtering weights for the temporal filtering may be determined according to the prediction features corresponding to each motion compensation result.
According to a third aspect of the present disclosure there is provided an apparatus for video encoding comprising one or more processors and a memory coupled to the one or more processors and configured to store instructions executable by the one or more processors, wherein the one or more processors, when executing the instructions, are configured to perform a video encoding method according to the present disclosure.
According to a fourth aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium for storing computer-executable instructions which, when executed by one or more computer processors, cause the one or more computer processors to perform a video encoding method according to the present disclosure.
According to a fifth aspect of the present disclosure there is provided a computer program product having instructions for storing a bitstream, wherein the bitstream comprises video data generated by a video encoding method according to the present disclosure.
According to a sixth aspect of the present disclosure, there is provided a computer program product comprising computer instructions, characterized in that the computer instructions, when executed by at least one processor, implement a video encoding method according to the present disclosure.
According to a seventh aspect of the present disclosure, there is provided a method of generating a bitstream, comprising generating a bitstream according to the video encoding method of the present disclosure.
The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:
in the present disclosure, using the multi-hypothesis motion compensation concept, a coding temporal filtering method based on multi-hypothesis weighted prediction is proposed, and when performing motion compensation based temporal filtering (MCTF), not only one Motion Vector (MV) is searched for each neighboring frame, but a plurality of MVs are allowed to be searched for, and based on the searched plurality of MVs, temporal filtering of multi-hypothesis weighted prediction is performed, for example, multi-hypothesis based motion estimation, multi-hypothesis based motion compensation, multi-hypothesis based temporal filtering. Thus, the prediction precision of motion estimation, motion compensation or temporal filtering can be remarkably improved, and the filtering effect of MCTF can be improved. Particularly, in complex scenes and high dynamic scenes, higher coding efficiency can be obtained, coding performance is remarkably improved, and error propagation distortion generated by channel errors can be reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate examples in accordance with the disclosure and together with the description, serve to explain the principles of the disclosure.
Fig. 1 is a block diagram illustrating an exemplary system for encoding and decoding video blocks in parallel according to some embodiments of the present disclosure.
Fig. 2 is a block diagram illustrating an exemplary video encoder according to some embodiments described in this disclosure.
Fig. 3 is a block diagram illustrating an exemplary video decoder according to some embodiments of the present application.
Fig. 4 is a flowchart illustrating a video encoding method according to an exemplary embodiment of the present disclosure.
Fig. 5 is a schematic diagram illustrating motion estimation and motion compensation according to an exemplary embodiment of the present disclosure.
Fig. 6 is a diagram illustrating multi-hypothesis weighted fusion at the motion estimation stage according to an exemplary embodiment of the present disclosure.
Fig. 7 is a diagram illustrating multi-hypothesis weighted fusion in a motion compensation phase according to an exemplary embodiment of the present disclosure.
Fig. 8 is a diagram illustrating multi-hypothesis weighted fusion in the time-domain filtering stage according to an exemplary embodiment of the present disclosure.
Fig. 9 is a block diagram illustrating a video encoding apparatus according to an exemplary embodiment of the present disclosure.
FIG. 10 is a diagram illustrating a computing environment coupled with a user interface according to some embodiments of the present disclosure.
Detailed Description
Reference will now be made in detail to the present embodiments, examples of which are illustrated in the accompanying drawings. In the following detailed description, numerous non-limiting specific details are set forth in order to provide an understanding of the subject matter presented herein. However, various alternatives may be used and the subject matter may be practiced without these specific details without departing from the scope of the claims. For example, the subject matter presented herein may be implemented on many types of electronic devices having digital video capabilities.
It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the figures are used for distinguishing between objects and not for describing any particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate, such that embodiments of the disclosure described herein may be implemented in other sequences than those illustrated in the figures or otherwise described in the disclosure.
Fig. 1 is a block diagram illustrating an exemplary system 10 for encoding and decoding video blocks in parallel according to some embodiments of the present disclosure. As shown in fig. 1, the system 10 includes a source device 12, the source device 12 generating and encoding video data to be later decoded by a target device 14. Source device 12 and destination device 14 may comprise any of a wide variety of electronic devices including cloud servers, server computers, desktop or laptop computers, tablet computers, smart phones, set-top boxes, digital televisions, cameras, display devices, digital media players, video gaming machines, video streaming devices, and the like. In some implementations, the source device 12 and the target device 14 are equipped with wireless communication capabilities.
In some implementations, the target device 14 may receive encoded video data to be decoded via the link 16. Link 16 may comprise any type of communication medium or device capable of moving encoded video data from source device 12 to destination device 14.
In other embodiments, encoded video data may be sent from output interface 22 to storage device 32. The encoded video data in the storage device 32 may then be accessed by the target device 14 via the input interface 28.
As shown in fig. 1, source device 12 includes a video source 18, a video encoder 20, and an output interface 22. Video source 18 may include sources such as a video capture device (e.g., a video camera), a video archive containing previously captured video, a video feed interface for receiving video from a video content provider, and/or a computer graphics system for generating computer graphics data as source video, or a combination of such sources.
Captured, pre-captured, or computer-generated video may be encoded by video encoder 20. The encoded video data may be sent directly to the target device 14 via the output interface 22 of the source device 12. The encoded video data may also (or alternatively) be stored on the storage device 32 for later access by the target device 14 or other device for decoding and/or playback.
The target device 14 includes an input interface 28, a video decoder 30, and a display device 34. Input interface 28 may include a receiver and/or modem and receives encoded video data over link 16. The encoded video data transmitted over link 16 or provided on storage device 32 may include various syntax elements generated by video encoder 20 for use by video decoder 30 in decoding the video data. Such syntax elements may be included in encoded video data transmitted over a communication medium, stored on a storage medium, or stored on a file server.
Video encoder 20 and video decoder 30 may operate in accordance with a proprietary standard or industry standard (e.g., part 10 of VVC, HEVC, MPEG-4, AVC) or an extension of such standard. It should be appreciated that the present application is not limited to a particular video encoding/decoding standard and may be applicable to other video encoding/decoding standards. It is generally contemplated that video encoder 20 of source device 12 may be configured to encode video data according to any of these current or future standards. Similarly, it is also generally contemplated that the video decoder 30 of the target device 14 may be configured to decode video data according to any of these current or future standards.
Video encoder 20 and video decoder 30 may each be implemented as any of a variety of suitable encoder and/or decoder circuits, such as one or more microprocessors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs), discrete logic devices, software, hardware, firmware or any combinations thereof. When implemented in part in software, the electronic device can store instructions for the software in a suitable non-transitory computer readable medium and execute the instructions in hardware using one or more processors to perform the video encoding/decoding operations disclosed in the present disclosure. Each of video encoder 20 and video decoder 30 may be included in one or more encoders or decoders, any of which may be integrated as part of a combined encoder/decoder (CODEC) in the respective device.
Fig. 2 is a block diagram illustrating an exemplary video encoder 20 according to some embodiments described in this disclosure. Video encoder 20 may perform intra-prediction encoding and inter-prediction encoding on video blocks within video frames. Intra-prediction encoding relies on spatial prediction to reduce or remove spatial redundancy in video data within a given video frame or picture. Inter-prediction encoding relies on temporal prediction to reduce or remove temporal redundancy in video data within adjacent video frames or pictures of a video sequence. It should be noted that in the field of video coding, the term "frame" may be used as a synonym for the term "image" or "picture".
As shown in fig. 2, video encoder 20 includes a video data memory 40, a prediction processing unit 41, a Decoded Picture Buffer (DPB) 64, an adder 50, a transform processing unit 52, a quantization unit 54, and an entropy encoding unit 56. The prediction processing unit 41 further includes a motion estimation unit 42, a motion compensation unit 44, a segmentation unit 45, an intra prediction processing unit 46, and an intra Block Copy (BC) unit 48. In some implementations, video encoder 20 also includes an inverse quantization unit 58, an inverse transform processing unit 60, and an adder 62 for video block reconstruction. A loop filter 63, such as a deblocking filter, may be located between adder 62 and DPB 64 to filter block boundaries to remove blocking artifacts from the reconstructed video. In addition to the deblocking filter, another loop filter, such as a Sample Adaptive Offset (SAO) filter, a cross-component sample adaptive offset (CCSAO) filter, and/or an Adaptive Loop Filter (ALF), may be used to filter the output of adder 62. In some examples, the loop filter may be omitted and the decoded video block may be provided directly to DPB 64 by adder 62. Video encoder 20 may take the form of fixed or programmable hardware units, or may be dispersed in one or more of the fixed or programmable hardware units described.
Video data memory 40 may store video data to be encoded by components of video encoder 20. The video data in video data store 40 may be obtained, for example, from video source 18 shown in fig. 1. DPB 64 is a buffer that stores reference video data (e.g., reference frames or pictures) for use by video encoder 20 in encoding the video data (e.g., in intra or inter prediction encoding modes).
As shown in fig. 2, after receiving video data, a dividing unit 45 within the prediction processing unit 41 divides the video data into video blocks. This partitioning may also include partitioning the video frame into slices, tiles (tiles) (e.g., a set of video blocks), or other larger Coding Units (CUs) according to a predefined split structure (e.g., a Quadtree (QT) structure) associated with the video data. It should be noted that the term "block" or "video block" as used herein may be a portion of a frame or picture, especially a rectangular (square or non-square) portion. Referring to HEVC and VVC, a block or video block may be or correspond to a Coding Tree Unit (CTU), a CU, a Prediction Unit (PU), or a Transform Unit (TU) and/or may be or correspond to a respective block (e.g., a Coding Tree Block (CTB), a Coding Block (CB), a Prediction Block (PB), or a Transform Block (TB)) and/or sub-block.
The prediction processing unit 41 may select one of a plurality of possible prediction coding modes, for example, one of one or more inter prediction coding modes of a plurality of intra prediction coding modes, for the current video block based on the error result (e.g., the coding rate and the distortion level). The prediction processing unit 41 may provide the resulting intra-or inter-prediction encoded block to the adder 50 to generate a residual block and to the adder 62 to reconstruct the encoded block for subsequent use as part of a reference frame. Prediction processing unit 41 also provides syntax elements (e.g., motion vectors, intra mode indicators, partition information, and other such syntax information) to entropy encoding unit 56.
To select the appropriate intra-prediction encoding mode for the current video block, intra-prediction processing unit 46 within prediction processing unit 41 may perform intra-prediction encoding of the current video block in relation to one or more neighboring blocks in the same frame as the current block to be encoded to provide spatial prediction. Motion estimation unit 42 and motion compensation unit 44 within prediction processing unit 41 perform inter-prediction encoding of the current video block in relation to one or more prediction blocks in one or more reference frames to provide temporal prediction. Video encoder 20 may perform multiple encoding passes, for example, to select an appropriate encoding mode for each block of video data.
In some embodiments, motion estimation unit 42 determines the inter-prediction mode for the current video frame by generating a motion vector from a predetermined pattern within the sequence of video frames, the motion vector indicating a displacement of a video block within the current video frame relative to a predicted block within a reference video frame. The motion estimation performed by the motion estimation unit 42 is a process of generating a motion vector that estimates motion for a video block. For example, the motion vector may indicate the displacement of a video block within a current video frame or picture relative to a predicted block within a reference frame associated with the current block being encoded within the current frame. The predetermined pattern may designate video frames in the sequence as P-frames or B-frames. The intra BC unit 48 may determine the vector (e.g., block vector) for intra BC encoding in a similar manner as the motion vector used for inter prediction by the motion estimation unit 42, or may determine the block vector using the motion estimation unit 42.
Regardless of whether the prediction block is from the same frame according to intra-prediction or from a different frame according to inter-prediction, video encoder 20 may form a residual video block by subtracting the pixel values of the prediction block from the pixel values of the current video block being encoded. The pixel differences forming the residual video block may include both a luma component difference and a chroma component difference.
Intra-prediction processing unit 46 may encode the current block using various intra-prediction modes, for example, during separate encoding passes, and intra-prediction processing unit 46 (or a mode selection unit in some examples) may select an appropriate intra-prediction mode from the tested intra-prediction modes for use. Intra-prediction processing unit 46 may provide information to entropy encoding unit 56 indicating the intra-prediction mode selected for the block. Entropy encoding unit 56 may encode information into the bitstream that indicates the selected intra-prediction mode.
After the prediction processing unit 41 determines a prediction block for the current video block via inter prediction or intra prediction, the adder 50 forms a residual video block by subtracting the prediction block from the current video block. The residual video data in the residual block may be included in one or more TUs and provided to transform processing unit 52. Transform processing unit 52 transforms the residual video data into residual transform coefficients using a transform, such as a Discrete Cosine Transform (DCT) or a conceptually similar transform.
The transform processing unit 52 may send the resulting transform coefficients to the quantization unit 54. The quantization unit 54 quantizes the transform coefficient to further reduce the bit rate. The quantization process may also reduce the bit depth associated with some or all of the coefficients. The quantization level may be modified by adjusting the quantization parameter. In some examples, quantization unit 54 may then perform a scan on the matrix including the quantized transform coefficients. Alternatively, entropy encoding unit 56 may perform the scan.
After quantization, entropy encoding unit 56 entropy encodes the quantized transform coefficients into a video bitstream using, for example, context Adaptive Variable Length Coding (CAVLC), context Adaptive Binary Arithmetic Coding (CABAC), syntax-based context adaptive binary arithmetic coding (SBAC), probability Interval Partition Entropy (PIPE) coding, or another entropy encoding method or technique. The encoded bitstream may then be sent to a video decoder 30 as shown in fig. 1, or archived in a storage device 32 as shown in fig. 1 for later transmission to the video decoder 30 or retrieval by the video decoder 30. Entropy encoding unit 56 may also entropy encode motion vectors and other syntax elements for the current video frame being encoded.
Inverse quantization unit 58 and inverse transform processing unit 60 apply inverse quantization and inverse transforms, respectively, to reconstruct the residual video block in the pixel domain for generating reference blocks for predicting other video blocks. As noted above, motion compensation unit 44 may generate a motion compensated prediction block from one or more reference blocks of a frame stored in DPB 64. Motion compensation unit 44 may also apply one or more interpolation filters to the prediction block to calculate sub-integer pixel values for use in motion estimation.
Adder 62 adds the reconstructed residual block to the motion compensated prediction block generated by motion compensation unit 44 to generate a reference block for storage in DPB 64. The reference block may then be used by the intra BC unit 48, the motion estimation unit 42, and the motion compensation unit 44 as a prediction block to inter-predict another video block in a subsequent video frame.
Fig. 3 is a block diagram illustrating an exemplary video decoder 30 according to some embodiments of the present application. Video decoder 30 includes video data memory 79, entropy decoding unit 80, prediction processing unit 81, inverse quantization unit 86, inverse transform processing unit 88, adder 90, and DPB 92. The prediction processing unit 81 further includes a motion compensation unit 82, an intra prediction unit 84, and an intra BC unit 85. Video decoder 30 may perform a decoding process that is substantially reciprocal to the encoding process described above in connection with fig. 2 with respect to video encoder 20. For example, the motion compensation unit 82 may generate prediction data based on the motion vector received from the entropy decoding unit 80, and the intra prediction unit 84 may generate prediction data based on the intra prediction mode indicator received from the entropy decoding unit 80.
In some examples, embodiments of the present disclosure may be dispersed in one or more of the units of video decoder 30. For example, the intra BC unit 85 may perform embodiments of the present application alone or in combination with other units of the video decoder 30 (e.g., the motion compensation unit 82, the intra prediction unit 84, and the entropy decoding unit 80). In some examples, video decoder 30 may not include intra BC unit 85, and the functions of intra BC unit 85 may be performed by other components of prediction processing unit 81 (e.g., motion compensation unit 82).
Video data memory 79 may store video data, such as an encoded video bitstream, to be decoded by other components of video decoder 30. The video data stored in the video data memory 79 may be obtained, for example, from the storage device 32, from a local video source (e.g., a camera), via wired or wireless network communication of video data, or by accessing a physical data storage medium (e.g., a flash drive or hard disk). Vision device
During the decoding process, video decoder 30 receives an encoded video bitstream representing video blocks of encoded video frames and associated syntax elements. Entropy decoding unit 80 of video decoder 30 entropy decodes the bitstream to generate quantization coefficients, motion vectors, or intra-prediction mode indicators, as well as other syntax elements. Entropy decoding unit 80 then forwards the motion vector or intra prediction mode indicator, as well as other syntax elements, to prediction processing unit 81.
When a video frame is encoded as an intra prediction encoded (I) frame or an intra encoding prediction block used in other types of frames, the intra prediction unit 84 of the prediction processing unit 81 may generate prediction data for a video block of the current video frame based on the signaled intra prediction mode and reference data from a previously decoded block of the current frame.
When a video frame is encoded as an inter-prediction encoded (i.e., B or P) frame, the motion compensation unit 82 of the prediction processing unit 81 generates one or more prediction blocks for the video block of the current video frame based on the motion vectors and other syntax elements received from the entropy decoding unit 80. Each of the prediction blocks may be generated from reference frames within one of the reference frame lists. Video decoder 30 may construct a list of reference frames, i.e., list 0 and list 1, using a default construction technique based on the reference frames stored in DPB 92.
In some examples, when video blocks are encoded according to the intra BC mode described herein, intra BC unit 85 of prediction processing unit 81 generates a prediction block for the current video block based on the block vectors and other syntax elements received from entropy decoding unit 80. The prediction block may be within a reconstructed region of the same picture as the current video block defined by video encoder 20.
The motion compensation unit 82 and/or the intra BC unit 85 determine prediction information for the video block of the current video frame by parsing the motion vector and other syntax elements, and then use the prediction information to generate a prediction block for the current video block being decoded.
Motion compensation unit 82 may also perform interpolation using interpolation filters, such as those used by video encoder 20 during encoding of video blocks, to calculate interpolation values for sub-integer pixels of the reference block. In this case, motion compensation unit 82 may determine interpolation filters used by video encoder 20 from the received syntax elements and use these interpolation filters to generate the prediction block.
The dequantization unit 86 dequantizes quantized transform coefficients provided in the bitstream and entropy decoded by the entropy decoding unit 80 using the same quantization parameter calculated by the video encoder 20 for each video block in the video frame that is used to determine the degree of quantization. The inverse transform processing unit 88 applies an inverse transform (e.g., an inverse DCT, an inverse integer transform, or a conceptually similar inverse transform process) to the transform coefficients in order to reconstruct the residual block in the pixel domain.
After the motion compensation unit 82 or the intra BC unit 85 generates a prediction block for the current video block based on the vector and other syntax elements, the adder 90 reconstructs a decoded video block for the current video block by adding the residual block from the inverse transform processing unit 88 to the corresponding prediction block generated by the motion compensation unit 82 and the intra BC unit 85. Loop filter 91 (e.g., deblocking filter, SAO filter, CCSAO filter, and/or ALF) may be located between adder 90 and DPB 92 to further process the decoded video block. In some examples, loop filter 91 may be omitted and the decoded video block may be provided directly to DPB 92 by adder 90. The decoded video blocks in a given frame are then stored in DPB 92, and DPB 92 stores reference frames for subsequent motion compensation of the next video block. DPB 92 or a memory device separate from DPB 92 may also store decoded video for later presentation on a display device (e.g., display device 34 of fig. 1).
As described above, the motion estimation compensation-based temporal filtering method (Motion Compensated Temporal Filter, MCTF) is a video filtering technique of only the encoder side (Encoder-only), which uses temporal correlation of video and uses inter-block reference relationships to perform temporal filtering on reference frames to reduce temporal redundancy information generated during video block reference, thereby improving overall coding efficiency.
MCTF techniques can be decomposed into mainly two parts, namely a "motion estimation compensation part" and a "filtering part". In the "motion estimation compensation portion," MCTF performs block-level motion estimation (Motion Estimation, ME) first. Specifically, the current image thereof may be uniformly divided into a plurality of blocks of m×m size, for example, a plurality of blocks of 8×8, each of which may be searched for a Motion Vector (MV). Then, each 8×8 block can be subjected to classical block-based motion compensation by MV, and then all 8×8 motion compensated image blocks can be directly spliced to obtain a motion frame. Thus, the motion compensation process can be executed for each adjacent frame of the current frame, and a plurality of motion compensation frames corresponding to the adjacent frames of the current frame one by one can be obtained. In the filtering part, bilateral filtering is carried out on each block of the current frame and a plurality of motion compensation blocks corresponding to the adjacent frames according to a certain proportion, and a final time domain filtering result is generated.
However, in the related art, when MCTF is used for temporal filtering, a classical block-based motion compensation technique is used for each frame, that is, only one motion compensation block is acquired for each frame, and then bilateral weighted filtering is performed for a plurality of motion compensation blocks corresponding to all neighboring frames of the current frame one by one. However, using only single hypothesis compensation per frame may result in lower prediction quality, which may result in poorer temporal filtering and lower coding efficiency.
In order to solve the above-mentioned problems occurring in the related art, the present disclosure provides a video encoding method, an apparatus, an electronic device, and a non-transitory computer-readable storage medium thereof, which use a multi-hypothesis motion compensation concept, and proposes a multi-hypothesis weighted prediction based encoding temporal filtering method, which searches not only one Motion Vector (MV) for each neighboring frame but also allows a plurality of MVs to be searched out when performing motion compensation based temporal filtering (MCTF), performs multi-hypothesis weighted prediction based temporal filtering, e.g., multi-hypothesis based motion estimation, multi-hypothesis based motion compensation, multi-hypothesis based temporal filtering, based on the plurality of MVs searched out. Thus, the prediction precision of motion estimation, motion compensation or temporal filtering can be remarkably improved, and the filtering effect of MCTF can be improved. Particularly, in complex scenes and high dynamic scenes, higher coding efficiency can be obtained, coding performance is remarkably improved, and error propagation distortion generated by channel errors can be reduced.
Next, the video encoding method of the present disclosure, and the apparatus, electronic device, and non-transitory computer readable storage medium thereof will be described in detail with reference to fig. 4 to 10.
Fig. 4 is a flowchart illustrating a video encoding method according to an exemplary embodiment of the present disclosure.
Referring to fig. 4, in step 401, for each of at least one neighboring frame of a current frame, motion estimation may be performed on a current block in the current frame to search for a plurality of motion vectors from the neighboring frame, and motion compensation may be performed on the current block based on the plurality of motion vectors searched from the neighboring frame to obtain a motion compensation result, thereby obtaining a motion compensation result for each of the at least one neighboring frame. It should be noted that the number of motion compensation results corresponding to each adjacent frame may be the same or different, which is not particularly limited in the present disclosure.
The existing MCTF technique can be mainly decomposed into two parts, namely a "motion estimation compensation part" and a "filtering part":
For the "motion estimation" part, assuming that the current frame is I t, i.e. filtering is required for the current frame I t, then several frames are required before and after each of its neighbors, for example, four frames before and after each of its neighbors in turn perform block-level motion estimation on the current frame I t. Let n=8, the adjacent frames can be denoted as I t-4、It-3、It-2、It-1、It+1、It+2、It+3、It+4, respectively. As shown in fig. 5, fig. 5 is a schematic diagram illustrating motion estimation and motion compensation according to an exemplary embodiment of the present disclosure, the first four frames of the current frame I y are I t-4、Iy-3、It-2、Iy-1, respectively, and the last four frames of the current frame I t are I t+1、It+2、It+3、It+4, respectively. In this way, a corresponding one of the most similar blocks of each nxn (e.g., 8 x 8) block in the current frame I t within each of the adjacent 8 frames can be searched, and the position offset of the similar block with respect to the current block can be represented by the motion vector MV. For example, in order to accelerate the motion estimation process, a pyramid motion estimation form may be used, i.e. a fast estimation may be performed on the downsampled image first, and the original resolution ME may be performed based on the downsampled motion estimation ME result. Specifically, for the original resolution, the whole pixel ME may be performed first, then the sub-pixel ME may be performed based on the whole pixel result, and finally an MV corresponding to each block may be obtained.
For the "motion compensation" part, for the current 8×8 block, in each adjacent frame, classical block-level motion compensation can be directly performed by using the searched corresponding MV, so as to obtain a motion compensation image block corresponding to the adjacent frame. Thus, motion compensation is performed on 8 adjacent frames, respectively, so that 8 image blocks, i.e., 8 motion compensation blocks, which are the best matches for the 8 adjacent frames, can be obtained.
And for the filtering part, carrying out bilateral filtering on each block of the current frame and a plurality of motion compensation blocks corresponding to the adjacent frames according to a certain proportion, and generating a final time domain filtering result.
However, in the above-mentioned prior MCTF technique, the use of only single hypothesis compensation per frame may result in lower prediction quality, which may result in lower temporal filtering and lower coding efficiency. To solve this problem, the present disclosure proposes a coding temporal filtering method based on multi-hypothesis weighted prediction using a multi-hypothesis motion compensation concept, which searches not only one Motion Vector (MV) for each neighboring frame but also allows searching a plurality of MVs, and performs temporal filtering of multi-hypothesis weighted prediction based on the searched plurality of MVs when performing motion compensation based temporal filtering (MCTF).
Various implementations of searching for a plurality of motion vectors MV of a neighboring frame of the current frame will be described below.
According to an exemplary embodiment of the present disclosure, a plurality of motion vector candidates may be searched out from neighboring frames by the same motion vector search method. Then, a predetermined optimal motion vector candidate may be selected from among a plurality of motion vector candidates according to a predetermined rule as a plurality of motion vectors corresponding to the adjacent frame. The number of "a plurality of motion vector candidates" may be set based on actual conditions, for example, may be set to 2,3,5, etc., and the present disclosure is not particularly limited thereto.
For example, several MV candidates having less distortion or rate distortion may be searched out from the neighboring frame by the same motion vector search method, and thus a predetermined number of optimal motion vector candidates may be selected from the searched several MV candidates as a plurality of motion vectors corresponding to the neighboring frame.
Therefore, by searching a plurality of motion vectors, the motion characteristics of the current block can be comprehensively and accurately described, and the phenomenon that the motion compensation effect is poor due to motion compensation based on a single motion vector is avoided.
According to an exemplary embodiment of the present disclosure, a plurality of motion vector candidates may also be searched from a neighboring frame by different motion vector searching methods to determine a plurality of motion vectors corresponding to the neighboring frame. For example, other MVs may be searched based on the searched optimal MVs to optimize the weighted prediction result of motion compensation, similar to the bi-prediction method in the inter-frame prediction, or the optimal MVs may be searched using a plurality of pyramid levels of MCTF, or a plurality of MV candidates may be searched using a method of estimating ME based on overlapping blocks, so that several MV candidates having superior characteristics may be selected from the searched plurality of MV candidates as a plurality of motion vectors of neighboring frames, and so on. The present disclosure does not specifically limit the motion vector search method, and the foregoing embodiment is merely an exemplary illustration.
Therefore, as each motion vector searching method corresponds to each searching emphasis point, a plurality of motion vectors with different characteristics can be obtained by searching the motion vectors by using a plurality of different motion vector searching methods, the motion characteristics of the image block can be fully and comprehensively represented to obtain a better motion compensation effect, and then the time domain filtering effect and the coding performance can be further improved.
According to an exemplary embodiment of the present disclosure, the first motion vector candidate may be further searched for from the neighboring frame by a first motion vector search method, and the at least one second motion vector candidate may be searched for from the neighboring frame by at least one second motion vector search method. Then, a predetermined optimal second motion vector candidate may be selected from the at least one second motion vector candidate according to a predetermined rule. Next, the first motion vector candidate and a predetermined number of optimal second motion vector candidates may be determined as a plurality of motion vectors.
For example, a first motion vector search method may be used to search for an optimal motion vector candidate from an adjacent frame, and a motion vector search method other than the first motion vector search method may be used to obtain n sub-optimal MV candidates for the adjacent frame. Next, the distortion (e.g., error or SSIM) of each of the n sub-optimal MV candidates may be calculated, and a distortion difference between the distortion of each sub-optimal MV candidate and the distortion of the aforementioned optimal motion vector candidate may be calculated. Then, a preset number of sub-optimal MV candidates with smaller corresponding distortion differences can be selected from the n sub-optimal MV candidates for reservation. When n=1, it is determined whether the distortion difference between the 1 suboptimal MV candidate and the optimal motion vector candidate is smaller than a preset threshold. If the distortion difference corresponding to the suboptimal MV candidate is smaller than a preset threshold, the suboptimal MV candidate can be reserved, otherwise, the suboptimal MV candidate can be discarded, namely, only the optimal motion vector candidate can be used for motion compensation at the moment.
Therefore, through the cross application of a plurality of different motion vector searching methods, the motion characteristics of the image block can be fully and accurately represented so as to obtain a better motion compensation effect, and then the time domain filtering effect and the coding performance can be further improved.
It should be noted that, in the present disclosure, in order to accelerate the motion estimation process, a pyramid motion estimation form may be adopted, that is, a rapid estimation may be performed on a downsampled image first, and then the original resolution ME may be performed based on the downsampled motion estimation ME result. Specifically, for the original resolution, the whole pixel ME may be performed first, then the sub-pixel ME may be performed based on the whole pixel result, and finally, a plurality of MVs corresponding to each adjacent frame may be obtained.
In step 402, the current block may be temporally filtered based on the motion compensation result for each of the at least one neighboring frame, resulting in a temporal filtering result of the current block. For example, as described above, assuming that the number of neighboring frames of the current frame is 8, the current block may be time-domain filtered based on a plurality of motion compensation blocks obtained for each of the 8 neighboring frames, to obtain a time-domain filtered block of the current block.
In step 403, the current block may be encoded based on the time domain filtering result of the current block.
The multi-hypothesis weighting techniques proposed by the present disclosure may be applied to any stage in MCTF. In the following, an implementation of the multi-hypothesis weighting techniques proposed by the present disclosure at various stages of MCTF (e.g., without limitation, motion estimation, motion compensation, temporal filtering) will be described by way of example.
Motion estimation based on multiple hypotheses
According to an exemplary embodiment of the present disclosure, a plurality of motion vectors searched from adjacent frames may be weighted and summed to obtain a final motion vector. The current block may then be motion compensated based on the final motion vector. The current block may then be temporally filtered based on the motion compensation result of the current block.
Fig. 6 is a schematic diagram illustrating the application of multi-hypothesis weighting techniques in the motion estimation stage according to an exemplary embodiment of the present disclosure. Referring to fig. 6, assuming that the current frame has N neighboring frames in total, each neighboring frame may correspondingly search out a plurality of motion vectors MVs. Then, for each adjacent frame, the multiple motion vectors searched from the adjacent frame may be weighted and summed to obtain a final motion vector corresponding to the adjacent frame, so that N adjacent frames may obtain N final motion vectors in a one-to-one correspondence. Next, for each adjacent frame, motion compensation can be performed on the current block of the current frame based on the final motion vector corresponding to the adjacent frame, so as to obtain a motion compensation result corresponding to the adjacent frame, namely motion compensation blocks, so that N adjacent frames can obtain N motion compensation blocks corresponding to each other one by one. Then, the current block can be subjected to time domain filtering based on N motion compensation blocks corresponding to N adjacent frames one by one, so as to obtain a time domain filtering result of the current block, namely a time domain filtering block.
In this way, in the present disclosure, a multi-vector hypothesis motion compensation prediction algorithm may be used, that is, multi-hypothesis weighted fusion may be performed in the motion estimation stage, so that the prediction accuracy of motion compensation may be significantly improved, and further, the filtering effect of MCTF may be improved.
According to an exemplary embodiment of the present disclosure, the weights for the weighted sum of motion vectors may be determined according to the prediction characteristics corresponding to each motion vector, or may be determined according to a machine learning or deep learning manner.
For example, the weight of each motion vector in the plurality of motion vectors corresponding to the same neighboring frame may be determined based on a distortion (e.g., error) or structural similarity (Structural Similarity, SSIM)) of the motion vector. For example, the smaller the distortion of a motion vector, the higher the weight that the motion vector corresponds to may be.
Motion compensation based on multiple hypotheses
According to the exemplary embodiments of the present disclosure, the current block may be respectively motion-compensated based on a plurality of motion vectors searched from neighboring frames, resulting in a plurality of motion compensation results, i.e., a plurality of motion compensation blocks. The multiple motion compensation results may then be weighted and summed to obtain a final motion compensation result, i.e., a final motion compensation block. Next, the current block may be temporally filtered based on a final motion compensation result obtained for each of the at least one neighboring frame. Illustratively, as previously described, assuming that the number of neighboring frames is 8, the current block may be temporally filtered based on a final motion compensation block corresponding to each of the 8 neighboring frames.
Fig. 7 is a diagram illustrating multi-hypothesis weighted fusion in a motion compensation phase according to an exemplary embodiment of the present disclosure. Referring to fig. 7, assuming that there are N neighboring frames in total for the current frame, each neighboring frame may correspondingly search out a plurality of motion vectors MVs, and thus each neighboring frame may obtain a plurality of motion compensation blocks. Then, for each adjacent frame, weighting and fusing the motion compensation blocks corresponding to the adjacent frame to obtain a unique final motion compensation block corresponding to the adjacent frame. Thus, the weighted fusion of the motion compensation blocks is performed for the N adjacent frames, respectively, so that N final motion compensation blocks corresponding to the N adjacent frames one by one can be obtained. Then, the N final motion compensation blocks can be utilized to carry out time domain filtering on the current block, so as to obtain a time domain filtering result of the current block, namely a time domain filtering block.
Specifically, first, motion compensation may be performed for each of a plurality of MVs (denoted as M MVs) of each adjacent frame, and a plurality of motion compensated prediction results (MCP m) corresponding to the adjacent frame, that is, a plurality of motion compensated blocks, may be obtained. Then, weighting and fusing can be performed on a plurality of motion compensation results corresponding to each adjacent frame, namely a plurality of motion compensation blocks, so as to obtain a final motion compensation result MCP corresponding to each adjacent frame, namely a final motion compensation block:
Where w m represents the weighting of the mth motion compensation block.
Then, in the subsequent time domain filtering stage, the final motion compensation result corresponding to each adjacent frame in the plurality of adjacent frames of the current frame can be utilized to perform time domain filtering on the current block of the current frame, so as to obtain the time domain filtering result of the current block.
Thus, in the present disclosure, a multi-hypothesis motion compensation refinement algorithm may be used, i.e., a multi-hypothesis weighted fusion may be performed during the motion compensation phase, and a more accurate prediction block may be generated for each neighboring frame for subsequent temporal filtering. The prediction precision of motion compensation can be remarkably improved, and the filtering effect of MCTF can be further improved.
According to an exemplary embodiment of the present disclosure, the weight for weighted summation may be determined according to a prediction characteristic corresponding to each motion compensation result, or may be determined according to a machine learning or deep learning manner.
For example, the weight corresponding to each motion compensation block of the plurality of motion compensation blocks corresponding to the same neighboring frame may be determined according to a distortion (e.g., error or SSIM) of the motion compensation block. For example, the smaller the distortion of a motion compensation block, the higher the weight of the motion compensation block may be.
Time domain filtering based on multiple hypotheses
According to the exemplary embodiments of the present disclosure, the current block may be further motion-compensated based on a plurality of motion vectors searched from neighboring frames, respectively, to obtain a plurality of motion compensation results, i.e., a plurality of motion compensation blocks may be obtained. The current block may then be temporally filtered based on a plurality of motion compensation results obtained for each of the at least one neighboring frame. Illustratively, as described above, assuming that the number of neighboring frames of the current frame is 8, the current block may be temporally filtered based on a plurality of motion compensation blocks obtained for each of the 8 neighboring frames.
Fig. 8 is a diagram illustrating multi-hypothesis weighted fusion in the time-domain filtering stage according to an exemplary embodiment of the present disclosure. Referring to fig. 8, assuming that there are N neighboring frames in total for the current frame, each neighboring frame may correspondingly search out a plurality of motion vectors MVs, and thus each neighboring frame may obtain a plurality of motion compensation blocks. Then, in the time domain filtering stage, the time domain filtering is performed on the current block of the current frame by using a plurality of motion compensation blocks corresponding to each adjacent frame in a plurality of adjacent frames of the current frame, so as to obtain a time domain filtering result of the current block, namely a time domain filtering block.
Specifically, for each adjacent frame, classical motion compensation may be performed using a plurality of MVs (denoted as M MVs) corresponding to the adjacent frame, and a plurality of motion compensated prediction results (MCPs m) corresponding to the MVs one to one, that is, a plurality of motion compensated blocks, may be obtained. Then, the plurality of motion compensation blocks corresponding to each of the plurality of adjacent frames are used in a subsequent temporal filtering stage. For example, assuming that there are 8 neighboring frames in total for the current frame and 2 motion compensation blocks are corresponding to each neighboring frame, the current block of the current frame may be temporally filtered based on 2×8=16 motion compensation blocks to obtain a temporally filtered block of the current block.
In the "temporal filtering portion", the related art performs temporal filtering using only N (or less than N) motion-compensated prediction blocks corresponding to N adjacent frames of the current frame. In the present disclosure, the number of motion compensation blocks corresponding to each adjacent frame may be extended from one to a plurality of motion compensation blocks, and then temporal filtering may be performed based on more than N motion compensation prediction blocks, to obtain a final filtering result of the current block, that is, a final filtering block.
In this way, in the present disclosure, a multi-hypothesis temporal filtering improvement algorithm may be used, that is, for N adjacent frames, more than N motion compensation prediction blocks may be generated to perform temporal weighted filtering, that is, multi-hypothesis weighted fusion may be performed in the temporal filtering stage, so that the prediction precision of motion compensation may be significantly improved, and further, the filtering effect of MCTF may be improved.
According to an exemplary embodiment of the present disclosure, the filtering weights for temporal filtering may be determined according to the prediction features corresponding to each motion compensation result.
For example, the weight corresponding to each motion compensation block of the plurality of motion compensation blocks corresponding to the same neighboring frame may be determined according to a distortion (e.g., error or SSIM) of the motion compensation block. For example, the smaller the distortion of a motion compensation block, the higher the weight of the motion compensation block may be.
It should be noted that in the present disclosure, three multi-hypothesis weighted time domain filtering methods are provided, which are "multi-hypothesis weighted fusion in motion estimation stage", "multi-hypothesis weighted fusion in motion compensation stage", and "multi-hypothesis weighted fusion in time domain filtering stage". In addition, the multiple motion compensation results of each neighboring frame may be multi-hypothesis weighted fusion in other phases of MCTF in addition to the three phases described above, which is not particularly limited by the present disclosure, and the foregoing embodiment is merely an exemplary illustration.
Fig. 9 is a block diagram illustrating a video encoding apparatus 900 according to an exemplary embodiment of the present disclosure.
Referring to fig. 9, the video encoding apparatus 900 may include a motion estimation module 901, a motion compensation module 902, a temporal filtering module 903, and an encoding module 904.
For each of at least one neighboring frame of the current frame, the motion estimation module 901 may perform motion estimation on a current block in the current frame to search for a plurality of motion vectors from the neighboring frame, and the motion compensation module 902 may perform motion compensation on the current block based on the plurality of motion vectors searched for from the neighboring frame, resulting in a motion compensation result, thereby resulting in a motion compensation result for each of the at least one neighboring frame.
According to an exemplary embodiment of the present disclosure, the motion estimation module 901 may search for a plurality of motion vector candidates from neighboring frames through the same motion vector search method. Then, the motion estimation module 901 may select a predetermined optimal motion vector candidate from among a plurality of motion vector candidates according to a predetermined rule as a plurality of motion vectors corresponding to the neighboring frame. The number of the plurality of motion vector candidates may be set based on actual conditions, for example, may be set to 2,3,5, etc., which is not particularly limited in the present disclosure.
Therefore, by searching a plurality of motion vectors, the motion characteristics of the current block can be comprehensively and accurately described, and the phenomenon that the motion compensation effect is poor due to motion compensation based on a single motion vector is avoided.
According to an exemplary embodiment of the present disclosure, the motion estimation module 901 may also search a plurality of motion vector candidates from a neighboring frame through different motion vector search methods to determine a plurality of motion vectors corresponding to the neighboring frame. Therefore, as each motion vector searching method corresponds to each searching emphasis point, a plurality of motion vectors with different characteristics can be obtained by searching the motion vectors by using a plurality of different motion vector searching methods, the motion characteristics of the image block can be fully and comprehensively represented to obtain a better motion compensation effect, and then the time domain filtering effect and the coding performance can be further improved.
According to an exemplary embodiment of the present disclosure, the motion estimation module 901 may also search for a first motion vector candidate from a neighboring frame through a first motion vector search method, and may search for at least one second motion vector candidate from the neighboring frame through at least one second motion vector search method. Then, the motion estimation module 901 may select a predetermined optimal second motion vector candidate from the at least one second motion vector candidate according to a predetermined rule. Next, the motion estimation module 901 may determine the first motion vector candidate and the predetermined optimal second motion vector candidate as a plurality of motion vectors.
Therefore, through the cross application of a plurality of different motion vector searching methods, the motion characteristics of the image block can be fully and accurately represented so as to obtain a better motion compensation effect, and then the time domain filtering effect and the coding performance can be further improved.
It should be noted that, in the present disclosure, in addition to searching for a motion vector from an adjacent frame in the above manner, searching may be performed in other manners, which is not particularly limited in the present disclosure, and the foregoing motion vector searching manner is merely an exemplary illustration.
According to an exemplary embodiment of the present disclosure, the motion compensation module 902 may perform weighted summation on a plurality of motion vectors searched from adjacent frames to obtain a final motion vector. The motion compensation module 902 may then motion compensate the current block based on the final motion vector.
In this way, in the present disclosure, a multi-vector hypothesis motion compensation prediction algorithm may be used, that is, multi-hypothesis weighted fusion may be performed in the motion estimation stage, so that the prediction accuracy of motion compensation may be significantly improved, and further, the filtering effect of MCTF may be improved.
According to an exemplary embodiment of the present disclosure, the weights for the weighted sum of motion vectors may be determined according to the prediction characteristics corresponding to each motion vector, or may be determined according to a machine learning or deep learning manner.
According to an exemplary embodiment of the present disclosure, the motion compensation module 902 may perform motion compensation on the current block based on a plurality of motion vectors searched from adjacent frames, respectively, to obtain a plurality of motion compensation results, i.e., a plurality of motion compensation blocks. The motion compensation module 902 may then perform weighted summation on the plurality of motion compensation results to obtain a final motion compensation result, i.e., a final motion compensation block. Next, the temporal filtering module 903 may temporally filter the current block based on the final motion compensation result obtained for each of the at least one neighboring frame. Illustratively, as previously described, assuming that the number of neighboring frames is 8, the current block may be temporally filtered based on a final motion compensation block corresponding to each of the 8 neighboring frames.
Thus, in the present disclosure, a multi-hypothesis motion compensation refinement algorithm may be used, i.e., a multi-hypothesis weighted fusion may be performed during the motion compensation phase, and a more accurate prediction block may be generated for each neighboring frame for subsequent temporal filtering. The prediction precision of motion compensation can be remarkably improved, and the filtering effect of MCTF can be further improved.
According to an exemplary embodiment of the present disclosure, the weight for weighted summation may be determined according to a prediction characteristic corresponding to each motion compensation result, or may be determined according to a machine learning or deep learning manner.
According to an exemplary embodiment of the present disclosure, the motion compensation module 902 may further perform motion compensation on the current block based on a plurality of motion vectors searched from adjacent frames, respectively, to obtain a plurality of motion compensation results, i.e., a plurality of motion compensation blocks may be obtained. The temporal filtering module 903 may then temporally filter the current block based on the plurality of motion compensation results obtained for each of the at least one neighboring frame. Illustratively, as described above, assuming that the number of neighboring frames of the current frame is 8, the current block may be temporally filtered based on a plurality of motion compensation blocks obtained for each of the 8 neighboring frames.
In this way, in the present disclosure, a multi-hypothesis temporal filtering improvement algorithm may be used, that is, for N adjacent frames, more than N motion compensation prediction blocks may be generated to perform temporal weighted filtering, that is, multi-hypothesis weighted fusion may be performed in the temporal filtering stage, so that the prediction precision of motion compensation may be significantly improved, and further, the filtering effect of MCTF may be improved.
According to an exemplary embodiment of the present disclosure, the filtering weights for temporal filtering may be determined according to the prediction features corresponding to each motion compensation result.
It should be noted that in the present disclosure, three multi-hypothesis weighted time domain filtering methods are provided, which are "multi-hypothesis weighted fusion in motion estimation stage", "multi-hypothesis weighted fusion in motion compensation stage", and "multi-hypothesis weighted fusion in time domain filtering stage". In addition, the multiple motion compensation results of each neighboring frame may be multi-hypothesis weighted fusion in other phases of MCTF in addition to the three phases described above, which is not particularly limited by the present disclosure, and the foregoing embodiment is merely an exemplary illustration.
The temporal filtering module 903 may perform temporal filtering on the current block based on the motion compensation result for each of the at least one neighboring frame to obtain a temporal filtering result of the current block. For example, as described above, assuming that the number of adjacent frames of the current frame is 8, the temporal filtering module 903 may perform temporal filtering on the current block based on a plurality of motion compensation blocks obtained by each of the 8 adjacent frames to obtain a temporal filtered block of the current block.
The encoding module 904 may encode the current block based on a time-domain filtering result of the current block.
Fig. 10 illustrates a computing environment 1010 coupled to a user interface 1050. The computing environment 1010 may be part of a data processing server. The computing environment 1010 includes a processor 1020, a memory 1030, and an input/output (I/O) interface 1040.
The processor 1020 generally controls overall operation of the computing environment 1010, such as operations associated with display, data acquisition, data communication, and image processing. The processor 1020 may include one or more processors for executing instructions to perform all or some of the steps of the methods described above. Further, the processor 1020 may include one or more modules that facilitate interactions between the processor 1020 and other components. The processor may be a Central Processing Unit (CPU), microprocessor, single-chip microcomputer, graphics Processing Unit (GPU), or the like.
Memory 1030 is configured to store various types of data to support the operation of computing environment 1010. Memory 1030 may include predetermined software 1032. Examples of such data include instructions, video data sets, image data, and the like for any application or method operating on computing environment 1010. The memory 1030 may be implemented using any type or combination of volatile or nonvolatile memory devices such as Static Random Access Memory (SRAM), electrically Erasable Programmable Read Only Memory (EEPROM), erasable Programmable Read Only Memory (EPROM), programmable Read Only Memory (PROM), read Only Memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
I/O interface 1040 provides an interface between processor 1020 and peripheral interface modules (e.g., keyboard, click wheel, buttons, etc.). Buttons may include, but are not limited to, a home button, a start scan button, and a stop scan button. I/O interface 1040 may be coupled with an encoder and a decoder.
In an embodiment, there is also provided a non-transitory computer readable storage medium comprising, for example, a plurality of programs in the memory 1030 executable by the processor 1020 in the computing environment 1010 for performing the above-described method and/or storing a bitstream generated by the above-described encoding method or a bitstream to be decoded by the above-described decoding method. In one example, the plurality of programs may be executed by the processor 1020 in the computing environment 1010 to receive a bitstream or data stream comprising encoded video information (e.g., representing video blocks of encoded video frames, and/or associated one or more syntax elements, etc.), for example, from the video encoder 20 in fig. 2, and may also be executed by the processor 1020 in the computing environment 1010 to perform the above-described decoding method in accordance with the received bitstream or data stream. In another example, the plurality of programs may be executed by the processor 1020 in the computing environment 1010 for performing the encoding methods described above to encode video information (e.g., video blocks representing video frames, and/or associated one or more syntax elements, etc.) into a bitstream or data stream, and may also be executed by the processor 1020 in the computing environment 1010 for transmitting the bitstream or data stream (e.g., to the video decoder 30 in fig. 3). Alternatively, a non-transitory computer readable storage medium may have stored therein a bitstream or data stream comprising encoded video information (e.g., video blocks representing encoded video frames, and/or associated one or more syntax elements, etc.) that is generated by an encoder (e.g., video encoder 20 of fig. 2) using, for example, the encoding methods described above, for use by a decoder (e.g., video decoder 30 of fig. 3) in decoding video data. The non-transitory computer readable storage medium may be, for example, ROM, random-access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
In an embodiment, a bitstream generated by the above-described encoding method or a bitstream to be decoded by the above-described decoding method is provided. In an embodiment, there is provided a bitstream including encoded video information generated by the above-described encoding method or encoded video information to be decoded by the above-described decoding method.
In an embodiment, a computing device is also provided that includes one or more processors (e.g., processor 1020) and a non-transitory computer-readable storage medium or memory 1030 having stored therein a plurality of programs executable by the one or more processors, wherein the one or more processors are configured to perform the above-described methods when executing the plurality of programs.
In an embodiment, there is also provided a computer program product having instructions for storing or transmitting a bitstream comprising encoded video information generated by the above-described encoding method or encoded video information to be decoded by the above-described decoding method. In an embodiment, a computer program product is also provided that includes a plurality of programs, e.g., in memory 1030, executable by processor 1020 in computing environment 1010 for performing the methods described above. For example, the computer program product may include a non-transitory computer readable storage medium.
In an embodiment, the computing environment 1010 may be implemented by one or more ASICs, DSPs, digital Signal Processing Devices (DSPDs), programmable Logic Devices (PLDs), FPGAs, GPUs, controllers, microcontrollers, microprocessors, or other electronic components for performing the methods described above.
In an embodiment there is also provided a method of storing a bitstream comprising storing said bitstream on a digital storage medium, wherein said bitstream comprises encoded video information generated by the above described encoding method or encoded video information to be decoded by the above described decoding method.
In an embodiment, there is also provided a method for transmitting a bitstream generated by the above encoder. In an embodiment, a method for receiving a bitstream to be decoded by the decoder described above is also provided.
The description of the present disclosure has been presented for purposes of illustration and is not intended to be exhaustive or limited to the disclosure. Many modifications, variations and alternative embodiments will come to mind to one skilled in the art having the benefit of the teachings presented in the foregoing descriptions and the associated drawings.
The order of steps of the method according to the present disclosure is intended to be illustrative only, unless specifically stated otherwise, and the steps of the method according to the present disclosure are not limited to the above-described order, but may be changed according to actual circumstances. Furthermore, at least one of the steps of the method according to the present disclosure may be adjusted, combined or pruned as actually needed.
The examples were chosen and described in order to explain the principles of the present disclosure and to enable others skilled in the art to understand the disclosure for various embodiments and with various modifications as are suited to the particular use contemplated. Therefore, it is to be understood that the scope of the disclosure is not to be limited to the specific examples of the disclosed embodiments, and that modifications and other embodiments are intended to be included within the scope of the disclosure.
Claims (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411521146.0A CN119402654A (en) | 2024-10-29 | 2024-10-29 | Video encoding method and device, electronic device and non-transitory computer-readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202411521146.0A CN119402654A (en) | 2024-10-29 | 2024-10-29 | Video encoding method and device, electronic device and non-transitory computer-readable storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN119402654A true CN119402654A (en) | 2025-02-07 |
Family
ID=94420651
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202411521146.0A Pending CN119402654A (en) | 2024-10-29 | 2024-10-29 | Video encoding method and device, electronic device and non-transitory computer-readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN119402654A (en) |
-
2024
- 2024-10-29 CN CN202411521146.0A patent/CN119402654A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
KR20210010633A (en) | Inter prediction method based on history-based motion vector and apparatus therefor | |
US20240129519A1 (en) | Motion refinement with bilateral matching for affine motion compensation in video coding | |
US20240283924A1 (en) | Intra prediction modes signaling | |
US20240214580A1 (en) | Intra prediction modes signaling | |
CN114071158B (en) | Method, device and equipment for constructing motion information list in video encoding and decoding | |
CN119402653A (en) | Video encoding method and device, electronic device and non-transitory computer-readable storage medium | |
US20250047897A1 (en) | Methods and devices for candidate derivation for affine merge mode in video coding | |
CN119420916A (en) | Video frame processing method, device, electronic device and storage medium | |
US20240298009A1 (en) | Sign prediction for block-based video coding | |
CN119452644A (en) | Method and apparatus for chroma motion compensation using adaptive cross-component filtering | |
CN119402654A (en) | Video encoding method and device, electronic device and non-transitory computer-readable storage medium | |
US20250220210A1 (en) | Methods and devices for candidate derivation for affine merge mode in video coding | |
US20240414375A1 (en) | Sign prediction for block-based video coding | |
CN119629334A (en) | Video encoding method and device | |
CN119697383A (en) | Video encoding method and device, electronic device, storage medium and program product | |
CN119767017A (en) | Video encoding method and apparatus, electronic device, storage medium, program product, and method of generating bit stream | |
CN119299674A (en) | Video frame processing method, device, electronic device and storage medium | |
CN119676434A (en) | Video frame processing method, device, electronic device and storage medium | |
CN119420913A (en) | Method and device for determining frame type, video encoding method and device | |
WO2024206533A2 (en) | Methods and devices for candidate derivation for affine merge mode in video coding | |
CN119653104A (en) | Prediction method and device for intra-frame block copy mode | |
EP4466855A1 (en) | Methods and devices for candidate derivation for affine merge mode in video coding | |
WO2023081499A1 (en) | Candidate derivation for affine merge mode in video coding | |
CN119629352A (en) | Video encoding method and device | |
WO2023192335A1 (en) | Methods and devices for candidate derivation for affine merge mode in video coding |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |