CN119364016A

CN119364016A - Method and apparatus for encoding and decoding video by means of inter-frame prediction

Info

Publication number: CN119364016A
Application number: CN202411528448.0A
Authority: CN
Inventors: 姜制远; 朴胜煜; 林和平
Original assignee: Hyundai Motor Co; Industry Collaboration Foundation of Ewha University; Kia Corp
Current assignee: Hyundai Motor Co; Industry Collaboration Foundation of Ewha University; Kia Corp
Priority date: 2019-10-06
Filing date: 2020-09-24
Publication date: 2025-01-24
Also published as: CN119402661A; CN119364017A; KR20210040787A; CN114503560B; CN119364015A; CN114503560A

Abstract

The present invention relates to methods and apparatus for encoding and decoding video by means of inter prediction. An image decoding apparatus for predicting a target block in a current image to be decoded is provided. The image decoding apparatus includes a prediction unit that determines a first reference picture and a second reference picture for bi-prediction and a first motion vector and a second motion vector by decoding a bitstream, generates a first reference block from the first reference picture referred to by the first motion vector, generates a second reference block from the second reference picture referred to by the second motion vector, and predicts a target block by means of the first reference block and the second reference block. The prediction unit includes a first encoding tool for generating a prediction block of a target block by performing bidirectional optical flow processing by means of a first reference block and a second reference block.

Description

Method and apparatus for encoding and decoding video by means of inter prediction

The application is a divisional application of PCT patent application entering china, the application date of which is 9/24/2020, of chinese patent application number 202080070159.4, entitled "method and apparatus for encoding and decoding video by means of inter-frame prediction".

Cross Reference to Related Applications

The present application claims priority from korean patent application No.10-2019-0123491 filed on 10 th month 2019 and korean patent application No. 10-2019-010158564 filed on 12 nd month 2019, the entire contents of which are incorporated herein by reference.

Technical Field

The present invention relates to encoding and decoding of video, and more particularly, to an encoding tool for improving compression performance of inter prediction.

Background

Since the amount of video data is larger than the amount of voice data or still image data, a large amount of hardware resources (including memory) are required to store or transmit video data without performing compression processing.

Accordingly, in storing or transmitting video data, an encoder is generally utilized to compress the video data for storage or transmission. Then, the decoder receives the compressed video data, and decompresses and reproduces the video data. Compression techniques for such Video include h.264/AVC and High Efficiency Video Coding (HEVC), which improves Coding efficiency by about 40% over h.264/AVC.

However, the image size, resolution and frame rate are gradually increasing, and accordingly, the amount of data to be encoded is also increasing. Therefore, a new compression technique having better coding efficiency and higher image quality than the existing compression technique is required.

In video coding, predictive coding is mainly used to improve compression performance. There are intra prediction for predicting a target block to be encoded based on pre-reconstructed samples in a current picture and inter prediction for predicting the current block using a pre-reconstructed reference picture. In particular, inter prediction is widely used for video coding because it exhibits better compression performance than intra prediction.

The present invention proposes an encoding tool for improving the compression performance of existing inter-prediction.

Disclosure of Invention

The present invention provides an encoding tool for improving compression performance of inter prediction, and in one aspect relates to an encoding tool capable of compensating various motions of an object as well as translational motions in units of blocks.

According to an aspect of the present invention, there is provided a video decoding apparatus for predicting a target block in a current image to be decoded. The apparatus includes a predictor configured to determine a first reference picture and a second reference picture for bi-prediction and a first motion vector and a second motion vector by decoding a bitstream, generate a first reference block from the first reference picture referred to by the first motion vector and a second reference block from the second reference picture referred to by the second motion vector, and generate a prediction block of a target block using the first reference block and the second reference block. The predictor includes a first encoding tool configured to generate a predicted block of the target block by performing bidirectional optical flow processing using the first reference block and the second reference block. Herein, when the luminance weights assigned to each of the first and second reference pictures for predicting the luminance component of the target block are different from each other, the first encoding tool is not performed. Further, when the chroma weights assigned to each of the first reference picture and the second reference picture for predicting the chroma component of the target block are different from each other, the first encoding tool is not performed.

According to another aspect of the present invention, there is provided a video encoding apparatus for inter-predicting a target block in a current image to be encoded. The apparatus includes a predictor configured to determine a first motion vector and a second motion vector for bi-direction, generate a first reference block from a first reference picture referenced by the first motion vector and a second reference block from a second reference picture referenced by the second motion vector, and generate a prediction block of a target block using the first reference block and the second reference block. The predictor includes a first encoding tool configured to generate a predicted block of the target block by performing bi-directional optical flow using the first reference block and the second reference block. Herein, when the luminance weights assigned to each of the first and second reference pictures for predicting the luminance component of the target block are different from each other, the first encoding tool is not performed. Also, when the chroma weights assigned to each of the first and second reference pictures for predicting the chroma component of the target block are different from each other, the first encoding tool is not performed.

According to another aspect of the present invention, a method for predicting a target block in a current image is provided. The method includes determining a first motion vector and a second motion vector for bi-direction, generating a first reference block from a first reference picture referenced by the first motion vector and a second reference block from a second reference picture referenced by the second motion vector, and predicting a target block using the first reference block and the second reference block. Predicting the target block includes executing a first encoding tool configured to generate a predicted block of the target block by performing bi-directional optical flow processing with the first reference block and the second reference block. Herein, when the luminance weights assigned to each of the first and second reference pictures for predicting the luminance component of the target block are different from each other, the first encoding tool is not performed. Further, when the chroma weights assigned to each of the first reference picture and the second reference picture for predicting the chroma component of the target block are different from each other, the first encoding tool is not performed.

Drawings

Fig. 1 is an exemplary block diagram of a video encoding apparatus capable of implementing the techniques of the present invention.

Fig. 2 exemplarily shows a block partition structure using QTBTTT structures.

Fig. 3 exemplarily illustrates a plurality of intra prediction modes.

Fig. 4 exemplarily shows neighboring blocks around the current block.

Fig. 5 is an exemplary block diagram of a video decoding apparatus capable of implementing the techniques of this disclosure.

Fig. 6 is an exemplary diagram illustrating the concept of bi-predictive optical flow provided by the present invention.

FIG. 7 is an exemplary diagram illustrating a method of deriving gradients of block boundary samples in bi-directional optical flow.

Fig. 8a, 8b and 9 are exemplary diagrams illustrating affine motion prediction provided by the present invention.

Fig. 10 is an exemplary diagram showing a method of deriving a merge candidate for affine motion prediction from translational motion vectors of neighboring blocks.

Fig. 11a, 11b and 11c are exemplary diagrams illustrating a method of deriving illumination compensation parameters according to an embodiment of illumination compensation provided by the present invention.

Detailed Description

Hereinafter, some embodiments of the present invention will be described in detail with reference to the accompanying drawings. It should be noted that when reference numerals are added to constituent elements in the respective drawings, the same reference numerals denote the same elements even though the elements are shown in different drawings. Furthermore, in the following description of the present invention, a detailed description of known functions and configurations incorporated herein will be omitted to avoid obscuring the subject matter of the present invention.

Fig. 1 is an exemplary block diagram of a video encoding apparatus capable of implementing the techniques of the present invention. Hereinafter, a video encoding apparatus and elements of the apparatus will be described with reference to fig. 1.

The video encoding apparatus includes a block divider 110, a predictor 120, a subtractor 130, a transformer 140, a quantizer 145, a reordering unit 150, an entropy encoder 155, an inverse quantizer 160, an inverse transformer 165, an adder 170, a loop filtering unit 180, and a memory 190.

Each element of the video encoding apparatus may be implemented in hardware or software, or a combination of hardware and software. The functions of the respective elements may be implemented as software, and the microprocessor may be implemented to perform the software functions corresponding to the respective elements.

A video is made up of one or more sequences comprising a plurality of images. Each image is divided into a plurality of regions, and encoding is performed on each region. For example, an image is segmented into one or more tiles (tiles) or/and slices (slices). Here, one or more tiles may be defined as a tile set. Each tile or slice is partitioned into one or more Coding Tree Units (CTUs). Each CTU is partitioned into one or more Coding Units (CUs) by a tree structure. Information applied to each CU is encoded as a syntax of the CU, and information commonly applied to CUs included in one CTU is encoded as a syntax of the CTU. Further, information commonly applied to all blocks in one slice is encoded as syntax of a slice header, and information applied to all blocks constituting one or more images is encoded in an image parameter set (Picture PARAMETER SET, PPS) or an image header. Furthermore, information commonly referred to by a Sequence composed of a plurality of images is encoded in a Sequence parameter set (Sequence PARAMETER SET, SPS). Furthermore, information commonly applied to one tile or group of tiles may be encoded as syntax of a tile header or a tile group header. The syntax included in the SPS, PPS, slice header, or tile header or tile group header may be referred to as a high level syntax.

Each of the plurality of images may be divided into a plurality of sub-images that can be independently encoded/decoded and/or independently displayed. When sub-image segmentation is applied, information about the layout of sub-images in the image is signaled.

The block divider 110 determines the size of a Coding Tree Unit (CTU). Information about the size of the CTU (CTU size) is encoded as a syntax of the SPS or PPS and transmitted to the video decoding apparatus.

The block divider 110 divides each image constituting a video into a plurality of CTUs having a predetermined size, and then recursively divides the CTUs using a tree structure. In the tree structure, leaf nodes are used as Coding Units (CUs), which are the basic units of coding.

The tree structure may be a Quad Tree (QT), a Binary Tree (BT), a trigeminal tree (TERNARYTREE, TT), or a structure formed by a combination of two or more QT structures, BT structures, and TT structures, the Quad Tree (QT), i.e., a node (or parent node), being partitioned into four slave nodes (or child nodes) of the same size, the Binary Tree (BT), i.e., a node, being partitioned into two slave nodes, and the Trigeminal Tree (TT), i.e., a node, being partitioned into three slave nodes at a ratio of 1:2:1. For example, a quadtree plus binary tree (QuadTree plus BinaryTree, QTBT) structure may be used, or a quadtree plus binary tree trigeminal tree (QuadTree plus BinaryTree TERNARYTREE, QTBTTT) structure may be used. BTTT may be referred to herein collectively as a multiple-type tree (MTT).

Fig. 2 exemplarily shows QTBTTT a split tree structure. As shown in fig. 2, the CTU may be first partitioned into QT structures. QT segmentation may be repeated until the size of the segment blocks reaches the minimum block size MinQTSize of the leaf nodes allowed in QT. A first flag (qt_split_flag) indicating whether each node of the QT structure is partitioned into four lower-layer nodes is encoded by the entropy encoder 155 and signaled to the video decoding apparatus. When the leaf node of QT is not greater than the maximum block size (MaxBTSize) of the root node allowed in BT, it may be further partitioned into one or more BT structures or TT structures. The BT structure and/or the TT structure may have a plurality of division directions. For example, there may be two directions, i.e., a direction of dividing the blocks of the nodes horizontally and a direction of dividing the blocks vertically. As shown in fig. 2, when the MTT division starts, a second flag (MTT _split_flag) indicating whether a node is divided, a flag indicating a division direction (vertical or horizontal) in the case of division, and/or a flag indicating a division type (binary or trigeminal) are encoded by the entropy encoder 155 and signaled to the video decoding apparatus. Alternatively, a CU partition flag (split_cu_flag) indicating whether a node is partitioned may be encoded before encoding a first flag (qt_split_flag) indicating whether each node is partitioned into 4 nodes of a lower layer. When the value of the CU partition flag (split_cu_flag) indicates that partition is not performed, a block of a node becomes a leaf node in the partition tree structure and serves as an encoding unit (CU), which is a basic unit of encoding. When the value of the CU partition flag (split_cu_flag) indicates that partition is performed, the video encoding apparatus starts encoding the flag from the first flag in the above-described manner.

When QTBT is utilized as another example of the tree structure, there may be two types of division, i.e., a type of dividing a block horizontally into two blocks of the same size (i.e., symmetrical horizontal division) and a type of dividing a block vertically into two blocks of the same size (i.e., symmetrical vertical division). A partition flag (split_flag) indicating whether each node of the BT structure is partitioned into lower-layer blocks and partition type information indicating a partition type are encoded by the entropy encoder 155 and transmitted to the video decoding apparatus. There may be additional types of partitioning a block of nodes into two asymmetric blocks. The asymmetric division type may include a type of dividing a block into two rectangular blocks at a size ratio of 1:3, or a type of dividing a block of a node diagonally.

The CUs may have various sizes according to QTBT or QTBTTT partitions of the CTU. Hereinafter, a block corresponding to a CU to be encoded or decoded (i.e., a leaf node of QTBTTT) is referred to as a "current block". When QTBTTT partitions are employed, the shape of the current block may be square or rectangular.

The predictor 120 predicts the current block to generate a predicted block. Predictor 120 includes an intra predictor 122 and an inter predictor 124.

The intra predictor 122 predicts samples in the current block using samples (reference samples) located around the current block in the current image including the current block. Depending on the prediction direction, there are multiple intra prediction modes. For example, as shown in fig. 3, the plurality of intra prediction modes may include 2 non-directional modes and 65 directional modes, and the 2 non-directional modes include a planar (planar) mode and a Direct Current (DC) mode. The neighboring samples and equations to be used are defined differently for each prediction mode.

The intra predictor 122 may determine an intra prediction mode to be used when encoding the current block. In some examples, intra predictor 122 may encode the current block with several intra prediction modes and select an appropriate intra prediction mode to use from the tested modes. For example, the intra predictor 122 may calculate a rate distortion value using rate-distortion (rate-distortion) analysis of several tested intra prediction modes, and may select an intra prediction mode having the best rate distortion characteristics among the tested modes.

The intra predictor 122 selects one intra prediction mode from among a plurality of intra prediction modes, and predicts the current block using neighboring pixels (reference pixels) determined according to the selected intra prediction mode and an equation. Information about the selected intra prediction mode is encoded by the entropy encoder 155 and transmitted to a video decoding device.

The inter predictor 124 generates a prediction block of the current block through motion compensation. The inter predictor 124 searches for a block most similar to the current block in a reference picture that has been encoded and decoded earlier than the current picture, and generates a prediction block of the current block using the searched block. Then, the inter predictor generates a motion vector (motion vector) corresponding to a displacement (displacement) between a current block in the current image and a predicted block in the reference image. In general, motion estimation is performed on a luminance (luma) component, and a motion vector calculated based on the luminance component is used for both the luminance component and the chrominance component. Motion information including information on a reference picture and information on a motion vector for predicting a current block is encoded by the entropy encoder 155 and transmitted to a video decoding device.

The inter predictor 124 may perform interpolation on the reference image or the reference block to increase prediction accuracy. In other words, the sub-samples are interpolated between two consecutive integer samples by applying the filter coefficients to a plurality of consecutive integer samples comprising the two integer samples. When an operation of searching for a block most similar to the current block is performed on the interpolated reference image, the motion vector may be expressed in the precision level of a fractional sample unit instead of the precision level of an integer sample unit. The precision or resolution of the motion vector may be set differently for each target region to be encoded, e.g., for each unit such as a slice, tile, CTU or CU. When such adaptive motion vector resolution is applied, information on the motion vector resolution to be applied to each target area should be signaled for each target area. For example, when the target area is a CU, information about the resolution of a motion vector applied to each CU is signaled.

The inter predictor 124 may perform inter prediction using bi-directional prediction. In bi-prediction, the inter predictor 124 uses two reference pictures and two motion vectors representing block positions most similar to the current block in the respective reference pictures. The inter predictor 124 selects a first reference picture and a second reference picture from the reference picture list0 (RefPicList 0) and the reference picture list1 (RefPicList 1), searches for a block similar to the current block in the respective reference pictures, and generates a first reference block and a second reference block, respectively. Then, the inter predictor 124 generates a prediction block for the current block by averaging the first reference block and the second reference block. Then, the inter predictor 124 transmits motion information including information about two reference pictures and two motion vectors for predicting the current block to the entropy encoder 155. Here, refPicList0 may be composed of images preceding the current image in display order in the reconstructed image, and RefPicList1 may be composed of images following the current image in display order in the reconstructed image. However, the embodiment is not limited thereto. The pre-reconstructed image following the current image in display order may be further included in RefPicList0, and conversely, the pre-reconstructed image preceding the current image may be further included in RefPicList 1.

The inter predictor 124 may perform bi-prediction using a weighted average, so-called weighted bi-prediction. The inter predictor 124 determines weights applied to the first reference picture and the second reference picture, respectively. The weights assigned to the first reference picture are applied to blocks in the first reference picture and the weights assigned to the second reference picture are applied to blocks in the second reference picture. The inter predictor 124 applies weights assigned to the first reference picture to the first reference block and weights assigned to the second reference picture to the second reference block, thereby generating a final prediction block of the target block through a weighted sum or weighted average operation of the first reference block and the second reference block. Weight information of a reference image used for inter prediction is signaled to a video decoding device.

On the other hand, the weight for predicting the luminance component and the weight for predicting the chrominance component may be independently determined. In this case, information on luminance weights to be applied to luminance components and information on chromaticity weights to be applied to chromaticity components are signaled separately.

Motion information (motion vector, reference picture) for inter prediction should be signaled to the video decoding apparatus. Various methods may be utilized to minimize the number of bits required to encode motion information.

For example, when the reference image and the motion vector of the current block are identical to those of the neighboring block, the motion information on the current block may be transmitted to the decoding apparatus through the encoding information for identifying the neighboring block. This method is called "merge mode".

In the merge mode, the inter predictor 124 selects a predetermined number of merge candidate blocks (hereinafter referred to as "merge candidates") from neighboring blocks of the current block.

As shown in fig. 4, all or part of the left block L, the upper block a, the upper right block AR, the lower left block BL, and the upper left block AL adjacent to the current block in the current image may be used as an adjacent block for deriving the merge candidate. Furthermore, a block located within a reference picture (which may be the same as or different from a reference picture used to predict the current block) may be used as a merging candidate in addition to the current picture in which the current block is located. For example, a co-located block (co-located block) at the same position as the current block in the reference picture or a block adjacent to the co-located block may additionally be used as a merging candidate.

The inter predictor 124 configures a merge list including a predetermined number of merge candidates using such neighboring blocks. The inter predictor 124 selects a merge candidate to be used as motion information on the current block from among the merge candidates included in the merge list, and generates merge index information for identifying the selected candidate. The generated combined index information is encoded by the entropy encoder 155 and transmitted to a decoding apparatus.

Another method of encoding motion information is AMVP mode.

In the AMVP mode, the inter predictor 124 derives a predicted motion vector candidate for a motion vector of a current block by using neighboring blocks of the current block. In the current image in fig. 2, all or part of the left block L, the upper block a, the upper right block AR, the lower left block BL, and the upper left block AL adjacent to the current block may be used as an adjacent block for deriving a predicted motion vector candidate. Furthermore, in addition to the current picture including the current block, a block located within the reference picture (which may be the same as or different from the reference picture used to predict the current block) may be used as a neighboring block for deriving a predicted motion vector candidate. For example, a co-located block at the same position as the current block in the reference picture or a block adjacent to the co-located block may be utilized.

The inter predictor 124 derives predicted motion vector candidates using motion vectors of neighboring blocks, and determines predicted motion vectors for the motion vector of the current block using the predicted motion vector candidates. Then, a motion vector difference is calculated by subtracting the predicted motion vector from the motion vector of the current block.

The predicted motion vector may be obtained by applying a predefined function (e.g., a function for calculating a median, average, etc.) to the predicted motion vector candidates. In this case, the video decoding apparatus also knows the predefined function. Since neighboring blocks used to derive predicted motion vector candidates have already been encoded and decoded, the video decoding device has also known the motion vectors of the neighboring blocks. Accordingly, the video encoding apparatus does not need to encode information for identifying predicted motion vector candidates. In this case, therefore, information on the motion vector difference and information on the reference image used to predict the current block are encoded.

The predicted motion vector may be determined by selecting any one of the predicted motion vector candidates. In this case, information for identifying the selected predicted motion vector candidate is further encoded together with information about a motion vector difference to be used for predicting the current block and information about a reference image.

The subtractor 130 subtracts the prediction block generated by the intra predictor 122 or the inter predictor 124 from the current block to generate a residual block.

The transformer 140 may transform the residual signal in the residual block. The two-dimensional (2D) size of the residual block may be used as a transform unit (hereinafter "TU") of a block size for performing the transform. Alternatively, the residual block may be divided into a plurality of sub-blocks, and residual signals in the respective sub-blocks may be transformed by using each sub-block as a TU.

The transformer 140 may divide the residual block into one or more sub-blocks and apply a transform to the one or more sub-blocks, thereby transforming residual values of the transformed block from the pixel domain to the frequency domain. In the frequency domain, a transform block is referred to as a coefficient block or transform block that contains one or more transform coefficient values. Two-dimensional transform kernels may be used for the transform and one-dimensional transform kernels may be used for the horizontal transform and the vertical transform, respectively. The transform kernel may be based on Discrete Cosine Transform (DCT), discrete Sine Transform (DST), or the like.

The transformer 140 may separately transform the residual block in the horizontal direction and the vertical direction. For the transformation, various types of transformation kernels or transformation matrices may be utilized. For example, the pair transformation kernels for horizontal and vertical transformations may be defined as a transformation set (multiple transform set, MTS). The transformer 140 may select a pair of transform kernels having the best transform efficiency in the MTS and transform the residual block in the horizontal direction and the vertical direction, respectively. Information (mts_idx) on the transform checkup selected in the MTS is encoded by the entropy encoder 155 and signaled to the video decoding apparatus.

The quantizer 145 quantizes the transform coefficient output from the transformer 140 using a quantization parameter, and outputs the quantized transform coefficient to the entropy encoder 155. For some blocks or frames, the quantizer 145 may quantize the relevant residual block directly without transformation. The quantizer 145 may apply different quantization coefficients (scaling values) according to the positions of the transform coefficients in the transform block. A matrix of quantized coefficients, which is applied to quantized transform coefficients arranged in two dimensions, may be encoded and signaled to a video decoding apparatus.

The reordering unit 150 may reclassify coefficient values of the quantized residual values. The rearrangement unit 150 may change the 2-dimensional coefficient array into a 1-dimensional coefficient sequence through coefficient scanning (coefficient scanning). For example, the rearrangement unit 150 may scan the coefficients from the DC coefficients to the coefficients in the high frequency region using a zig-zag scan (zig-zag scan) or a diagonal scan (diagonal scan) to output a 1-dimensional coefficient sequence. Depending on the size of the transform unit and the intra prediction mode, a zigzag scan may be replaced with a vertical scan, i.e. scanning the two-dimensional array of coefficients in the column direction, or a horizontal scan, i.e. scanning the coefficients of the two-dimensional block shape in the row direction. That is, the scan pattern to be used may be determined in zigzag scan, diagonal scan, vertical scan, and horizontal scan according to the size of the transform unit and the intra prediction mode.

The entropy encoder 155 encodes the one-dimensional quantized transform coefficients output from the rearrangement unit 150 using various encoding techniques such as Context-based adaptive binary arithmetic coding (Context-based Adaptive Binary Arithmetic Code, CABAC) and exponential golomb (exponential Golomb) to generate a bitstream.

The entropy encoder 155 encodes information related to block division (e.g., CTU size, CU division flag, QT division flag, MTT division type, and MTT division direction) so that a video decoding apparatus can divide blocks in the same manner as a video encoding apparatus. Further, the entropy encoder 155 encodes information on a prediction type indicating whether the current block is encoded by intra prediction or inter prediction, and encodes intra prediction information (i.e., information on an intra prediction mode) or inter prediction information (a merge index for a merge mode, information on a reference picture index for an AMVP mode and a motion vector difference) according to the prediction type. The entropy encoder 155 also encodes information related to quantization (i.e., information about quantization parameters and information about quantization matrices).

The inverse quantizer 160 inversely quantizes the quantized transform coefficient output from the quantizer 145 to generate a transform coefficient. The inverse transformer 165 transforms the transform coefficients output from the inverse quantizer 160 from the frequency domain to the spatial domain, and reconstructs a residual block.

The adder 170 adds the reconstructed residual block and the prediction block generated by the predictor 120 to reconstruct the current block. The samples in the reconstructed current block are used as reference samples when performing intra prediction of the subsequent block.

The loop filtering unit 180 filters the reconstructed samples to reduce block artifacts (blocking artifacts), ringing artifacts (RINGING ARTIFACTS), and blurring artifacts (blurring artifacts) generated due to block-based prediction and transform/quantization. The loop filtering unit 180 may include at least one of a deblocking filter 182, a Sampling Adaptive Offset (SAO) filter 184, and an adaptive loop filter (adaptive loop filter, ALF) 186.

Deblocking filter 182 filters boundaries between reconstructed blocks to remove block artifacts caused by block-wise encoding/decoding, and SAO filter 184 performs additional filtering on the deblock filtered video. The SAO filter 184 is a filter for compensating for differences between reconstructed samples and original samples caused by lossy encoding (lossy coding), and performs filtering in such a way that the respective offsets and each reconstructed sample are added. The ALF 186 performs filtering on the target sample to be filtered by applying filter coefficients to the target sample and adjacent samples of the target sample. The ALF 186 may divide samples included in the image into predetermined groups and then determine one filter to be applied to the corresponding group to differentially perform filtering on each group. Information about filter coefficients to be used for ALF may be encoded and signaled to the video decoding apparatus.

The reconstructed block filtered by the loop filtering unit 180 is stored in the memory 190. Once all blocks in one image are reconstructed, the reconstructed image can be used as a reference image for inter prediction of blocks in a subsequent image to be encoded.

Fig. 5 is an exemplary functional block diagram of a video decoding apparatus capable of implementing the techniques of this disclosure. Hereinafter, a video decoding apparatus and elements of the apparatus will be described with reference to fig. 5.

The video decoding apparatus may include an entropy decoder 510, a reordering unit 515, an inverse quantizer 520, an inverse transformer 530, a predictor 540, an adder 550, a loop filtering unit 560, and a memory 570.

Similar to the video encoding apparatus of fig. 1, each element of the video decoding apparatus may be implemented in hardware, software, or a combination of hardware and software. Furthermore, the function of each element may be implemented in software, and the microprocessor may be implemented to execute the software function corresponding to each element.

The entropy decoder 510 determines a current block to be decoded by decoding a bitstream generated by a video encoding apparatus and extracting information related to block division, and extracts prediction information required to reconstruct the current block, information on a residual signal, and the like.

The entropy decoder 510 extracts information about the size of CTUs from a Sequence Parameter Set (SPS) or a Picture Parameter Set (PPS), determines the size of CTUs, and partitions a picture into CTUs of the determined size. Then, the decoder determines the CTU as the highest layer (i.e., root node) of the tree structure, and extracts partition information about the CTU to partition the CTU using the tree structure.

For example, when the CTU is segmented using the QTBTTT structure, a first flag (qt_split_flag) related to the segmentation of QT is extracted to segment each node into four nodes of a sub-layer. For a node corresponding to a leaf node of QT, a second flag (mtt_split_flag) related to the division of MTT and information on a division direction (vertical/horizontal) and/or a division type (binary/trigeminal) are extracted to divide the corresponding leaf node in an MTT structure. Thereby, each node below the leaf node of QT is recursively partitioned in the BT or TT structure.

As another example, when the CTU is partitioned using the QTBTTT structure, a CU partition flag (split_cu_flag) indicating whether to partition the CU may be extracted. When the corresponding block is divided, a first flag (qt_split_flag) may be extracted. In a partitioning operation, zero or more recursive MTT partitions may occur per node after zero or more recursive QT partitions. For example, CTUs may undergo MTT segmentation directly without QT segmentation, or only QT segmentation multiple times.

As another example, when CTUs are segmented using the QTBT structure, a first flag (qt_split_flag) related to QT segmentation is extracted, and each node is segmented into four nodes of the lower layer. Then, a split flag (split_flag) indicating whether to further split a node corresponding to a leaf node of QT with BT and split direction information are extracted.

Once the current block to be decoded is determined through tree structure segmentation, the entropy decoder 510 extracts information about a prediction type indicating whether the current block is intra-predicted or inter-predicted. When the prediction type information indicates intra prediction, the entropy decoder 510 extracts syntax elements of intra prediction information (intra prediction mode) of the current block. When the prediction type information indicates inter prediction, the entropy decoder 510 extracts syntax elements for the inter prediction information, that is, information indicating a motion vector and a reference picture referred to by the motion vector.

The entropy decoder 510 also extracts information on the transform coefficient of the quantized current block as information on quantization and information on a residual signal.

The reordering unit 515 may change the sequence of one-dimensional quantized transform coefficients entropy-decoded by the entropy decoder 510 into a 2-dimensional coefficient array (i.e., block) in the reverse order of coefficient scanning performed by the video encoding apparatus.

The inverse quantizer 520 inversely quantizes the quantized transform coefficients using quantization parameters. The inverse quantizer 520 may apply different quantization coefficients (scaling values) to the quantized transform coefficients arranged in two dimensions. The inverse quantizer 520 may perform inverse quantization by applying a quantization coefficient (scaling value) matrix from the video encoding apparatus to a 2-dimensional array of quantized transform coefficients.

The inverse transformer 530 inversely transforms the inversely quantized transform coefficients from the frequency domain to the spatial domain to reconstruct the residual signal, thereby generating a reconstructed residual block of the current block. Further, when applying MTS, the inverse transformer 530 determines transform functions or transform matrices to be applied in the horizontal and vertical directions, respectively, using MTS information (mts_idx) signaled from the video encoding apparatus, and inversely transforms transform coefficients in the transform blocks in the horizontal and vertical directions using the determined transform functions.

The predictor 540 may include an intra predictor 542 and an inter predictor 544. The intra predictor 542 is activated when the prediction type of the current block is intra prediction, and the inter predictor 544 is activated when the prediction type of the current block is inter prediction.

The intra predictor 542 determines an intra prediction mode of the current block among a plurality of intra prediction modes based on syntax elements of the intra prediction mode extracted from the entropy decoder 510, and predicts the current block using reference samples around the current block according to the intra prediction mode.

The inter predictor 544 determines a motion vector of the current block and a reference picture referenced by the motion vector using syntax elements of inter prediction extracted from the entropy decoder 510, and predicts the current block based on the motion vector and the reference picture.

Similar to the inter predictor 124 of the video encoding device, the inter predictor 544 may generate a prediction block of the current block using bi-prediction. When weighted bi-prediction is applied, the entropy decoder 510 extracts weight information applied to two reference pictures to be used for bi-prediction of the current block from the bitstream. The weight information may include weight information to be applied to the luminance component and weight information to be applied to the chrominance component. The inter predictor 544 generates a prediction block for the luma component and a prediction block for the chroma component of the current block using the weight information.

The adder 550 reconstructs the current block by adding the residual block output from the inverse transformer to the prediction block output from the inter predictor or the intra predictor. In intra prediction of a block to be decoded later, samples in the reconstructed current block are used as reference samples.

The loop filtering unit 560 may include at least one of a deblocking filter 562, an SAO filter 564, and an ALF 566. Deblocking filter 562 deblocking filters boundaries between reconstructed blocks to remove block artifacts caused by block-by-block decoding. The SAO filter 564 performs filtering in such a way that reconstructed blocks are added after deblocking filtering the corresponding offsets in order to compensate for differences between reconstructed samples and original samples caused by lossy coding. The ALF 566 performs filtering on the target sample to be filtered by applying filter coefficients to the target sample and adjacent samples of the target sample. The ALF 566 may divide samples in the image into predetermined groups and then determine one filter to be applied to the corresponding group to differentially perform filtering for each group. The filter coefficients of the ALF are determined based on information about the filter coefficients decoded from the bitstream.

The reconstructed block filtered by the loop filtering unit 560 is stored in the memory 570. When all blocks in one image are reconstructed, the reconstructed image is used as a reference image for inter prediction of blocks in a subsequent image to be encoded.

The following disclosure relates to an encoding tool for improving compression performance of inter prediction, which may be operated by an inter predictor 124 of a video encoding device and an inter predictor 544 of a video decoding device. As used herein, the term "target block" may have the same meaning as the term "current block" or "Coding Unit (CU)" used above, or may represent a local region of a CU.

I. Combined inter-intra prediction

As described above, the target block is predicted by one of inter prediction and intra prediction. The combined inter-intra prediction described in this disclosure is a technique that supplements an inter-prediction signal with an intra-prediction signal. When combined inter-intra prediction is applied, the inter predictor 124 of the video encoding device determines a motion vector of a target block and predicts the target block using the determined motion vector to generate an inter-predicted block. The intra predictor 124 of the video encoding device predicts a target block using reference samples around the target block and generates an intra prediction block. As an intra prediction mode for generating an intra prediction block, any one of the plurality of intra prediction modes described above may be fixedly used. For example, a planar mode or a DC mode may be used as a prediction mode for generating an intra prediction block. The final prediction block is generated from an average or weighted average of the inter prediction block and the intra prediction block. The equation for calculating the final prediction block in the combined inter-intra prediction is given as follows.

[ Equation 1]

P _final＝((4-wt)*P_inter+wt*P_intra +2) > >2 here, P _inter denotes an inter prediction block, P _intra, and denotes an intra prediction block. wt represents the weight. +2 is the offset used for the rounding operation.

The weights may be determined based on whether to predict a pre-encoded/decoded neighboring block neighboring the target block using inter prediction or intra prediction. For example, when intra prediction is performed on both the left block and the upper block of the target block, a greater weight is given to the intra prediction block (P _intra). For example, wt is set to 3. When intra prediction is performed on only one of the left block and the upper block, the inter prediction block (P _inter) and the intra prediction block (P _intra) are given the same weight. For example, wt is set to 2. When neither intra prediction is performed on the left block nor intra prediction is performed on the upper block, an inter prediction block (P _inter) is given a greater weight. For example, wt is set to 1.

When the target block is predicted by combined inter-intra prediction, the inter predictor 544 of the video decoding apparatus extracts information on a motion vector of the target block from the bitstream to determine the motion vector of the target block. Then, the target block is predicted in the same manner as the video encoding device.

The combined inter-intra prediction is a technique of supplementing an inter prediction signal with an intra prediction signal, and thus may be effective when inter prediction is more or less incorrect (e.g., when a motion vector of a target block is determined by a merge mode). Thus, only when the motion vector of the target block is determined by the merge mode, the combined inter-intra prediction can be applied.

II. bidirectional optical flow

Bidirectional optical flow is a technique of additionally compensating for the motion of a sample predicted using bidirectional motion prediction, assuming that a sample or object constituting a video moves at a constant speed and the sample value hardly changes.

Fig. 6 is an exemplary diagram illustrating a basic concept of BIO.

It is assumed that bi-directional motion vectors MV ₀ and MV ₁ pointing to the corresponding regions most similar to the target block to be encoded in the current picture (i.e., reference blocks) have been determined in the reference pictures Ref ₀ and Ref ₁ by (normal) bi-directional motion prediction for the target block. the two motion vectors have values representing the motion of the entire target block. In the example of fig. 6, P ₀ is a sample in reference image Ref ₀ indicated by motion vector MV ₀ and corresponding to sample P in the target block, and P ₁ is a sample in reference image Ref ₁ indicated by motion vector MV ₁ and corresponding to sample P in the target block. Further, it is assumed that the motion of the sample P in fig. 6 is slightly different from the overall motion of the target block. For example, when an object located at sample a in Ref ₀ of fig. 6 moves via sample P in the target block of the current image to sample B in Ref ₁, sample a and sample B may have very similar values to each other. Also in this case, the point in Ref ₀ that is most similar to the sample P in the target block is not P ₀ indicated by the bi-directional motion vector MV ₀, but a sample a shifted from P ₀ by a predetermined displacement vector (v _xτ₀,v_yτ₁). The point in Ref ₁ that is most similar to sample P in the target block is not P ₁ indicated by bi-directional motion vector MV ₁, but sample B shifted from P ₁ by a predetermined displacement vector (-v _xτ₀,-v_yτ₁). Here, τ ₀ and τ ₁ represent the time axis distances of Ref ₀ and Ref ₁, respectively, with respect to the current image, and are calculated based on the image order count (picture order count, POC). hereinafter, (v _x,v_y) is referred to as "optical flow" or "motion offset".

In predicting the value of the sample P of the current block in the current picture, the two reference samples a and B can achieve more accurate prediction than the reference samples P ₀ and P ₁ indicated by the bi-directional motion vectors MV ₀ and MV ₁.

I ⁽⁰⁾ (I, j) represents the value of a sample in the reference image Ref ₀ corresponding to the sample (I, j) in the target block, which is indicated by the motion vector MV ₀, and I ⁽¹⁾ (I, j) represents the value of a sample in the reference image Ref ₁ corresponding to the sample (I, j) in the target block, which is indicated by the motion vector MV ₁.

The value of sample a in reference image Ref ₀ corresponding to samples in the target block, indicated by the BIO motion vector (v _x,v_y), may be defined as I ⁽⁰⁾(i+v_xτ₀,j+v_yτ₀), and the value of sample B in reference image Ref ₁ may be defined as I ⁽¹⁾(i-v_xτ₁,j-v_yτ₁. Here, when linear approximation is performed using only the first order terms of the taylor series, a and B may be expressed as equation 2.

[ Equation 2]

Here, I _x ^(k) and I _y ^(k) (k=0, 1) are gradient values in the horizontal direction and the vertical direction at positions (I, j) of Ref ₀ and Ref ₁. Furthermore, τ ₀ and τ ₁ represent the time axis distances of Ref ₀ and Ref ₁, respectively, with respect to the current image, and are calculated based on POC: τ ₀ =poc (current) -POC (Ref ₀), and τ ₁＝POC(Ref₁) -POC (current).

The bidirectional optical flow (v _x,v_y) for each sample in the block is determined as a solution that minimizes delta, which is defined as the difference between sample a and sample B. Delta can be defined by equation 3 using the linear approximation of a and B derived from equation 2.

[ Equation 3]

For simplicity, sample position (i, j) is omitted from each term of equation 3.

To achieve a more robust optical flow estimation, it is assumed that the motion is locally consistent with neighboring samples. For the BIO motion vector of the sample (i, j) to be currently predicted, the difference Δ in equation 3 of all samples (i ', j') existing in a mask Ω of a certain size centered on the sample (i, j) is considered. That is, the optical flow of the current sample (i, j) may be determined as a vector minimizing an objective function Φ (v _x,v_y), which is the sum of squares of the differences Δi ', j' ] obtained for the respective samples in the mask Ω, as shown in equation 4.

[ Equation 4]

The bidirectional optical flow of the present invention can be applied to a case where one of two reference pictures for bidirectional prediction is before the current picture and the other is after the current picture in display order, and distances of the two reference pictures and the current picture are equal to each other, that is, differences in Picture Order Count (POC) between each reference picture and the current picture are equal to each other. Therefore, τ ₀ and τ ₁ are negligible.

Furthermore, the bidirectional optical flow of the present invention may be applied only to the luminance component.

For a target block to which bi-prediction is applied, the bi-directional optical flow of the present invention is performed on a sub-block basis rather than a pixel-based basis. In the following description, for simplicity, it is assumed that sub-blocks having various sizes of 2×2, 4×4, and 8×8 have a size of 4×4.

Before performing optical flow, the inter predictor 124 of the video encoding device generates two reference blocks for the target block using the above-described bi-prediction. The first of the two reference blocks represents a block composed of predicted samples generated from the reference picture Ref ₀ using the first motion vector MV ₀ of the target block, and the second reference block represents a block composed of predicted samples generated from the reference picture Ref ₁ using the motion vector MV ₁.

The inter predictor 124 calculates (v _x,v_y) for each of the 4×4 sub-blocks constituting the target block, which is called optical flow, using horizontal gradient values and vertical gradient values of the predicted samples in the first and second reference blocks. Optical flow (v _x,v_y) is determined so that the difference between the predicted sample from reference image Ref ₀ and the predicted sample from reference image Ref ₁ is minimized. The inter predictor 124 derives a sample offset for modifying bi-directionally predicted samples of the 4 x 4 sub-block by using the (v _x,v_y) calculated for the 4 x 4 sub-block and the gradient of the predicted samples in the 4 x 4 sub-block.

Specifically, the inter predictor 124 calculates the horizontal gradient and the vertical gradient of the sample value at the position (i, j) using equation 5.

[ Equation 5]

Here, k is 0 or 1, and I ⁽⁰⁾ (I, j) and I ⁽¹⁾ (I, j) represent sample values at positions (I, j) in the first and second reference blocks, respectively. Shift1 is a value derived from the bit depth of the luminance component, for example, shift1 = max (6, bit depth-6).

In order to derive the gradient of the samples located at the boundary of each reference block, samples outside the boundary of the first reference block and the second reference block are required. Accordingly, as shown in fig. 7, each reference block is extended one column to the left and right, and extended one line to the upper side and lower side. To reduce the amount of computation, each sample in the extension portion may be filled with samples or integer samples at the nearest neighbor positions in the reference block. Moreover, gradients at sample locations outside the boundary of each reference block may be filled with gradients corresponding to samples at the nearest neighbor locations.

The inter predictor 124 calculates S1, S2, S3, S5, and S6 corresponding to the auto-correlation and cross-correlation of the gradients using the horizontal gradient and the vertical gradient in the 6×6 window covering the 4×4 sub-block, as shown in fig. 7.

[ Equation 6]

S₁＝∑_(i,j)∈Ωψ_x(i,j)·ψ_x(i,j),S₃＝∑_(i,j)∈Ωθ(i,j)·ψ_x(i,j)

S₅＝∑_(i,j)∈Ωψ_y(i,j)·ψ_y(i,j) S₆＝∑_(i,j)∈Ωθ(i,j)·ψ_y(i,j)

Here, Ω denotes a window covering the sub-block. Furthermore, as shown in the following equation 7, ψ _x (i, j) represents the sum of horizontal gradient values at the position (i, j) in the first and second reference blocks, ψ _y (i, j) represents the sum of vertical gradient values at the position (i, j) in the first and second reference blocks, and θ (i, j) represents the difference between the sample value at the position (i, j) in the second reference block and the sample value at the position (i, j) in the first reference block.

[ Equation 7]

θ(i,j)=(I⁽¹⁾(i,j)>>n_b)-(I⁽⁰⁾(i,j)>>n_b)

Here, n _a and n _b are values derived from the bit depth, and have values of min (1, bit depth-11) and min (4, bit depth-8).

The inter predictor 124 calculates optical flow (v _x,v_y) of the 4×4 sub-block using equation 8 based on S1, S2, S3, S5, and S6.

[ Equation 8]

Here the number of the elements is the number,And th' _BIO＝2^max(5,BD-7).Is a function of rounding down, and

The optical flow calculated for the 4 x 4 sub-block and the gradient values at the sample positions (x, y) may be used to calculate a sample offset for modifying the predicted samples at each sample position (x, y) in the 4 x 4 sub-block in the target block, as shown in equation 9. In equation 9, rnd () represents a rounding operation.

[ Equation 9]

The inter predictor 124 generates a final predicted sample pred (x, y) using the sample offset b (x, y) at the position (x, y) and the predicted sample I ⁽⁰⁾ (x, y) in the first reference block and the predicted sample I ⁽¹⁾ (x, y) in the second reference block, as shown in equation 10.

[ Equation 10]

pred(x,y)=(I⁽⁰⁾(x,y)+I⁽¹⁾(x,y)+b(x,y)+o_offset)>>shift

Here, shift is Max (3, 15-bit depth), and O _offset is a value used for rounding operation, and is half of shift.

As described above, the bidirectional optical flow technique uses the values of samples predicted using motion information (two motion vectors and two reference images) for bidirectional prediction. Accordingly, the inter predictor 544 of the video decoding apparatus may also perform bidirectional optical flow in the same manner as the video encoding apparatus using motion information (motion vector, reference image) for bidirectional prediction received from the video encoding apparatus. The video encoding device is not required to signal the additional information for the bi-directional optical flow processing to the video decoding device.

The bi-directional optical flow techniques described above may be applied to the chrominance components. In this case, in order to reduce the computational complexity, the optical flow calculated for the luminance component may be used as the optical flow of the chrominance component without the need to recalculate the optical flow (v _x,v_y) for the chrominance component. Therefore, when applying bi-directional optical flow to chrominance components, only the horizontal and vertical gradients of the chrominance components of each sample need be calculated.

The bidirectional optical flow itself requires a large amount of computation. Furthermore, bi-directional optical flow may further increase computational complexity and delay encoding/decoding processing when used with other encoding tools. Furthermore, the bi-directional optical flow combined with some encoding tools may not contribute to the improvement of the encoding efficiency. In view of this, the execution of bidirectional optical flow may be limited under certain conditions. The video encoding device and the video decoding device skip the bidirectional optical flow by checking whether a predefined condition is satisfied before performing the bidirectional optical flow. Conditions for restricting the execution of the bidirectional optical flow are described below. When one or more conditions are met, bi-directional optical flow is not performed and skipped.

In some implementations, bi-directional optical flow is not used with affine motion prediction to be described later. Since both bi-directional optical flow and affine motion prediction require a large amount of computation, the combination of the two coding tools not only increases the computational complexity, but also delays the encoding/decoding process. Therefore, when affine motion prediction is used for a target block, bidirectional optical flow for the target block is not performed.

In some other implementations, bi-directional optical flow may be constrained from being used with inter-intra combined prediction techniques. When the inter-intra merged prediction technique is applied to a target block, the bidirectional optical flow to the target block is not applied. In intra prediction, pre-reconstructed samples around a target block are used, so the target block can be predicted after decoding (sample reconstruction) of neighboring blocks is completed. Thus, when both combined inter-intra prediction and bi-directional optical flow are applied, bi-directional optical flow processing should be stopped until intra prediction of the target block can be performed after decoding all neighboring blocks of the target block is completed. This may cause a serious delay in the decoding process. Accordingly, bi-directional optical flow may not be applied to blocks where combined inter-intra prediction is applied.

Further, the bidirectional optical flow may be restricted for use with local illumination compensation, which will be described later. For example, when local illumination compensation is applied, bi-directional optical flow is not applied.

In some other implementations, when the current image including the target block or the reference image referenced by the target block is an image segmented into sub-images, bi-directional optical flow is not applied.

In some other implementations, when weighted bi-prediction is performed on the target block and different weights are applied to the two reference blocks (the first reference block and the second reference block), bi-directional optical flow is not performed. That is, when weights to be applied to two reference blocks (luminance components in the two reference blocks) to predict the luminance components are different from each other, bidirectional optical flow of the target block is not performed. Further, when weights to be applied to two reference blocks (chrominance components in the two reference blocks) to predict the chrominance components are different from each other, bidirectional optical flow of the target block is not performed. As described above, the bidirectional optical flow is based on the assumption that the sample value hardly changes between images. In contrast, weighted bi-prediction assumes that the sample values between pictures change. Accordingly, when weighted bi-directional prediction applying different weights is performed, the bi-directional optical flow of the target block is skipped.

Alternatively, in the case of applying bi-directional prediction of different weights, bi-directional optical flow may be performed by applying different weights. When weights applied to the first reference block (I ⁽⁰⁾ (x, y)) and the second reference block (I ⁽¹⁾ (x, y)) are w ₀ and w ₁, respectively, I ⁽⁰⁾ (x, y) and I ⁽¹⁾ (x, y) in equations 5 to 10 may be replaced by w ₀ I⁽⁰⁾ (x, y) and w ₁ I⁽⁰⁾ (x, y). That is, instead of sample values in the two reference blocks, sample values multiplied by weights respectively corresponding to the two reference blocks may be used to calculate the optical flow (v _x,v_y), the sample offset b (x, y), and the final prediction sample pred (x, y).

Affine motion prediction

The inter prediction is a motion prediction reflecting a translational motion model. That is, this is a technique for predicting motion in the horizontal direction (x-axis direction) and the vertical direction (y-axis direction). In practice, however, there may be various types of movement, for example, rotation, magnification or reduction in addition to translational movement. An aspect of the present invention provides affine motion prediction capable of covering such various types of motion.

Fig. 8a and 8b are exemplary diagrams illustrating affine motion prediction.

There may be two types of models for affine motion prediction. One model is a model using motion vectors (that is, four parameters) of two control points of the upper left and upper right corners of the target block currently to be encoded, as shown in fig. 8 a. Another model is a model using motion vectors (that is, six parameters) of three control points of the upper left corner, the upper right corner, and the lower left corner of the target block, as shown in fig. 8 b.

The four-parameter affine model is expressed by equation 11. The motion at the sample position (x, y) in the target block can be calculated by equation 11. Here, it is assumed that the position of the upper left sample of the target block is (0, 0).

[ Equation 11]

The six-parameter affine model is expressed by equation 12. The motion at the sample position (x, y) in the target block can be calculated by equation 12.

[ Equation 12]

Here, (mv _0x,mv_0y) is the motion vector of the upper left corner control point, (mv _1x,mv_1y) is the motion vector of the upper right corner control point, and (mv _2x,mv_2y) is the motion vector of the lower left corner control point. W is a constant determined according to the horizontal length of the target block, and H is a constant determined according to the vertical length of the target block.

Affine motion prediction may be performed on each sample in the target block using the motion vector calculated by equation 11 or equation 12.

Alternatively, in order to reduce the computational complexity, prediction may be performed on each sub-block divided from the target block, as shown in fig. 9. For example, the size of the sub-block may be 4×4, 2×2, or 8×8. In the following exemplary embodiments, affine motion prediction for a target block is performed based on 4×4 sub-blocks. This example is for convenience of description only, and the present invention is not limited thereto.

In the sub-block-based affine motion prediction, a motion vector (affine motion vector) of each sub-block is calculated by substituting the center position of each sub-block into (x, y) of equation 11 or equation 12. Here, the center position may be an actual center point of the sub-block or a lower right sample position of the center point. For example, in the case where the coordinates of the lower left sample in the 4×4 sub-block are (0, 0), the center position of the sub-block may be (1.5 ) or (2, 2). An affine motion vector (mv _x,mv_y) of a sub-block is used to generate a predicted block for each sub-block.

The motion vector (mv _x,mv_y) may be set to have a 1/16 sample precision. In this case, the motion vector (mv _x,mv_y) calculated by equation 11 or equation 12 may be rounded in 1/16 sample unit. The adaptive motion vector resolution can be applied to affine motion prediction as in normal inter prediction. In this case, information on the resolution of the motion vector of the target block (that is, the accuracy of the motion vector) is signaled for each target block.

Affine motion prediction can be performed not only on the luminance component but also on the chrominance component. In the case of the 4:2:0 video format, when affine motion prediction for a luminance component is performed based on 4×4 sub-blocks, affine motion prediction for a chrominance component may be performed based on 2×2 sub-blocks. The motion vector (mv _x,mv_y) of each sub-block of the chrominance component may be derived from the motion vector of the corresponding luminance component. Alternatively, the size of the sub-block for affine motion prediction of the chrominance component may be the same as the size of the sub-block for the luminance component. When affine motion prediction for luminance components is performed based on 4×4 sub-blocks, affine motion prediction for chrominance components may also be performed based on 4×4 sub-blocks. In this case, since the 4×4 sub-block for the chrominance component corresponds to the four 4×4 sub-blocks for the luminance component, the motion vector (mv _x,mv_y) of the sub-block for the chrominance component can be calculated by calculating the average value of the motion vectors of the four corresponding sub-blocks of the luminance component.

The video encoding device performs intra prediction, inter prediction (translational motion prediction), affine motion prediction, and the like, and calculates a rate-distortion (RD) cost to select the best prediction method. To perform affine motion prediction, the inter predictor 124 of the video encoding apparatus determines which type of model of the two types of models is used, and determines two or three control points according to the determined type. The inter predictor 124 calculates a motion vector (mv _x,mv_y) of each sub-block of the 4×4 sub-blocks in the target block using the motion vector of the control point. The inter predictor 124 then performs motion compensation in the reference image on a sub-block-by-sub-block basis using the motion vector (mv _x,mv_y) of each sub-block to generate a predicted block for each sub-block in the target block.

The entropy encoder 155 of the video encoding apparatus encodes affine-related syntax elements including a flag indicating whether affine motion prediction is applied to a target block, type information indicating the type of affine model, and motion information indicating a motion vector of each control point, and transmits it to the video decoding apparatus. When affine motion prediction is performed, type information and motion information about the control points may be signaled, and motion vectors of as many control points as the number determined according to the type information may be signaled. Further, when adaptive motion vector resolution is applied, motion vector resolution information about affine motion vectors of the target block is signaled.

The video decoding apparatus determines the type of affine model and the control point motion vector using the signaled syntax elements, and calculates the motion vector (mv _x,mv_y) of each 4×4 sub-block in the target block using equation 11 or equation 12. In the case of signaling motion vector resolution information of an affine motion vector regarding a target block, the motion vector (mv _x,mv_y) is corrected to the accuracy identified by the motion vector resolution information by an operation such as rounding.

The video decoding device performs motion compensation within the reference picture using a motion vector (mv _x,mv_y) for each sub-block to generate a predicted block for each sub-block.

In order to reduce the amount of bits required to encode the motion vector of the control point, a method as used in the above-described normal inter prediction (translational motion prediction) may be applied.

As an example, in the merge mode, the inter predictor 124 of the video encoding device derives a motion vector of each control point from neighboring blocks of the target block. For example, the inter predictor 124 generates a merge candidate list by deriving a predefined number of merge candidates from neighboring samples L, BL, A, AR and AL of the target block shown in fig. 4. Each merge candidate included in the list corresponds to a pair of motion vectors of two or three control points.

First, the inter predictor 124 derives a merge candidate from control point motion vectors of neighboring blocks predicted in an affine mode among the neighboring blocks. In some embodiments, the number of merging candidates derived from neighboring blocks predicted in affine mode may be limited. For example, the inter predictor 124 may derive two merge candidates, one of L and BL and one of A, AR and AL, from neighboring blocks predicted in affine mode. Priorities may be assigned in the order of L and BL and A, AR and AL.

When the total number of merging candidates is greater than or equal to 3, the inter predictor 124 may derive a necessary number of merging candidates from the translational motion vectors of the neighboring blocks.

The inter predictor 124 derives control point motion vectors CPMV1, CPMV2, CPMV3 from the neighboring block group { B2, B3, A2}, the neighboring block group { B1, B0} and the neighboring block group { A1, A0} respectively. As an example, the priorities in each adjacent block group may be assigned in the order of B2, B3, A2, B1 and B0, and A1 and A0. Furthermore, another control point motion vector CPMV4 is derived from the co-located block T in the reference picture. The inter predictor 124 generates as many merge candidates as required by a combination of two or three of the four control point motion vectors. The combined priority assignment is as follows. The elements in each group are listed in the order of the top left, top right, and bottom left of the control point motion vector.

{CPMV1,CPMV2,CPMV3},{CPMV1,CPMV2,CPMV4},{CPMV1,CPMV3,CPMV4},{CPMV2,CPMV3,CPMV4},{CPMV1,CPMV2},{CPMV1,CPMV3}

The inter predictor 124 selects a merge candidate in the merge candidate list and performs affine motion prediction on the target block. When the selected candidate includes two control point motion vectors, affine motion prediction is performed using a four parameter model. On the other hand, when the selected candidate includes three control point motion vectors, affine motion prediction is performed using a six-parameter model. The entropy encoder 155 of the video encoding apparatus encodes and signals index information indicating a merge candidate selected among the merge candidates in the merge candidate list to the video decoding apparatus.

The entropy decoder 510 of the video decoding apparatus decodes index information signaled from the video encoding apparatus. The inter predictor 544 of the video decoding apparatus constructs a merge candidate list in the same manner as the video encoding apparatus, and performs affine motion prediction using a control point motion vector corresponding to the merge candidate indicated by the index information.

As another example, in AMVP mode, the inter predictor 124 of the video encoding apparatus determines the type of affine model and control point motion vector for the target block. Then, the inter predictor 124 calculates a motion vector difference, which is a difference between an actual control point motion vector of the target block and a predicted motion vector of each control point, and transmits the motion vector differences respectively corresponding to the control points. To this end, the inter predictor 124 of the video encoding device configures a predefined number of lists of affine AMVP. When the target block is of the 4-parameter type, candidates included in the list are each composed of two control point motion vectors in pairs. On the other hand, when the target block is of the 6-parameter type, candidates included in the list are each composed of paired three control point motion vectors. The affine AMVP list may be derived using control point motion vectors or translational motion vectors of neighboring blocks in a similar manner to the above-described method of constructing the merge candidate list.

However, in order to derive candidates to be included in the affine AMVP list, there may be a limit in which only neighboring blocks that refer to the same reference image as the target block in the neighboring blocks of fig. 4 are considered.

Further, in the AMVP mode, affine model types of the target block should be considered. When the affine model type of the target block is a 4-parameter type, the video encoding apparatus derives two control point motion vectors (upper left and upper right corner control point motion vectors of the target block) using affine models of neighboring blocks. When the affine model type of the target block is a 6-parameter type, the apparatus derives three control point motion vectors (upper left, upper right, and lower left control point motion vectors of the target block) using affine models of neighboring blocks.

When the neighboring block is of the 4 parameter type, two or three control point motion vectors are predicted according to the affine model type of the target block using the two control point motion vectors of the neighboring block. For example, affine models of neighboring blocks expressed by equation 11 may be used. In equation 11, (mv ₀x,mv₀ y) and (mv ₁x,mv₁ y) are replaced by the upper left and upper right control point motion vectors of neighboring blocks, respectively. W is replaced by the horizontal length of the neighboring block. The predicted motion vector for each control point of the target block is derived by putting the difference between the position of the corresponding control point of the target block and the position of the upper left corner of the neighboring block into (x, y).

When the neighboring block is of the 6 parameter type, two or three control point motion vectors are predicted according to the affine model type of the target block using the three control point motion vectors of the neighboring block. For example, affine models of neighboring blocks expressed by equation 12 may be used. In equation 12, (mv ₀x,mv₀y)、(mv₁x,mv₁ y) and (mv ₂x,mv₂ y) are replaced by control point motion vectors for the upper left, upper right and lower left corners of neighboring blocks, respectively. W and H are replaced by the horizontal and vertical lengths of the adjacent blocks, respectively. The predicted motion vector for each control point of the target block is derived by putting the difference between the position of the corresponding control point of the target block and the position of the upper left corner of the neighboring block into (x, y).

The inter predictor 124 of the video encoding apparatus selects one candidate in the affine AMVP list and generates a motion vector difference between the motion vector of each actual control point and the predicted motion vector of the corresponding control point of the selected candidate. The entropy encoder 155 of the video encoding apparatus encodes type information indicating the affine model type of the target block, index information indicating a selected candidate among candidates in the affine AMVP list, and a motion vector difference corresponding to each control point, and transmits it to the video decoding apparatus.

The inter predictor 544 of the video decoding apparatus determines affine model type using information signaled from the video encoding apparatus and generates a motion vector difference for each control point. Then, the inter predictor generates an affine AMVP list in the same manner as the video encoding apparatus, and selects candidates indicated by index information signaled in the affine AMVP list. The inter predictor 544 of the video decoding apparatus calculates a motion vector of each control point by adding the predicted motion vector of each control point of the selected candidates to the corresponding motion vector difference.

Sample-by-sample adjustment for affine motion prediction samples

Sub-block-wise affine motion prediction for a target block has been described above. Another aspect of the invention relates to adjusting sample values of predicted samples generated by sub-block-wise affine motion prediction on a sample-by-sample basis. The motion according to the position of each sample is additionally compensated in each sub-block forming the basis of affine motion prediction.

When the sample value in any one of the sub-blocks generated as a result of the sub-block-by-sub-block affine motion prediction for the target block is I (x, y), the video encoding apparatus calculates a horizontal gradient gx (I, j) and a vertical gradient gy (I, j) at each sample position. Equation 13 can be used to calculate the gradient.

[ Equation 13]

g_x(i,j)=I(i+1,j)-I(i-1,j)

g_y(i,j)=I(i,j+1)-I(i,j-1)

The sample offset Δi (I, j) for adjusting the prediction samples is calculated by the following equation.

[ Equation 14]

ΔI(i,j)=g_x(i,j)*Δmv_x(i,j)+g_y(i,j)*Δmv_y(i,j)

Here, Δmv (i, j) represents a motion offset, that is, a difference between an affine motion vector in the sample (i, j) and an affine motion vector at the center position of the sub-block, and can be calculated by applying equation 11 or equation 12 according to the affine model type of the target block. That is, Δmv (i, j) can be calculated by subtracting the motion vector given when (i, j) is placed in (x, y) from the motion vector given when (x, y) is placed in the center of the sub-block according to equation 11 or equation 12. In other words, Δmv (i, j) can be calculated from an equation obtained by substituting the horizontal offset and the vertical offset from the sub-block center position to the sample position (i, j) into (x, y) in equations 11 and 12 and eliminating the last items "+mv _0x" and "+mv _0y". The center position may be the actual center point of the sub-block or may be the lower right sample position of the center point.

The motion vector for each control point of the target block for calculating Δmv (i, j) and the difference between the sample position (i, j) and the center position of the sub-block are the same for all sub-blocks. Accordingly, the value of Δmv (i, j) may be used to calculate only one sub-block (e.g., the first sub-block) and may be repeated for other sub-blocks.

The technique of the present invention is based on the assumption that an object moves at a constant speed and the variation of sample values is uniform. Therefore, the sample variation in the horizontal direction and the sample variation in the vertical direction are obtained by multiplying the x-component (Δmv _x) and the y-component (Δmv _y) of Δmv (i, j) by the horizontal sample gradient value and the vertical sample gradient value, respectively. The sample offset Δi (I, j) is calculated by adding the two sample variation amounts.

The final value of the predicted samples is calculated as follows.

[ Equation 15]

I′(i,j)=I(i,j)+ΔI(i,j)

When applying the sample-by-sample adjustment for affine motion prediction samples, the inter predictor 124 of the video encoding device and the inter predictor 544 of the video decoding device perform the above-described processing to modify the sample values of the predicted samples generated by affine motion prediction. Gradient values are derived from predicted samples generated by affine motion prediction, and Δmv (i, j) is derived from control point motion vectors of the target block. Therefore, the video encoding apparatus is not required to signal the additional information for the processing of the present technology to the video decoding apparatus.

The sample-by-sample adjustment technique for affine motion prediction samples described above is applied to the luminance component. Furthermore, the present technique can be applied even to chrominance components. In this case, Δmv _x and Δmv _y calculated for the luminance component can be used as Δmv _x and Δmv _y for the chrominance component without separate calculation. That is, the video encoding apparatus and the video decoding apparatus calculate gradient values for predicted samples of the chrominance components generated by affine motion prediction. Then, the predicted samples of the chromaticity component generated by affine motion prediction can be adjusted by substituting the gradient values of the chromaticity component and Δmv _x and Δmv _y calculated for the luminance component into equations 14 and 15.

In the case of bi-prediction, equations 14 and 15 are applied to each of the two reference pictures. The video encoding apparatus and the video decoding apparatus generate two reference blocks by performing sample-by-sample adjustment on affine prediction samples of each of the reference image list 0 and the reference image of the reference image list 1. These two reference blocks are generated by equations 14 and 15. The final prediction block of the target block may be generated by averaging between the two reference blocks. When the bit depth is 10, the process of generating the final prediction block is expressed as follows.

[ Equation 16]

In equation 16, "I ₀ (I, j) +clip3 ()" is a prediction sample in a reference block generated by applying the present technique to a reference picture of reference picture list 0, and "I ₁ (I, j) +clip3 ()" is a prediction sample in a reference block generated by applying the present technique to a reference picture of reference picture list 1.

In order to prevent delay due to performing a sample-by-sample adjustment technique on affine motion prediction samples, it may be determined whether the application of the technique is appropriate before the technique is performed so that the technique may be skipped without performing the technique.

As an example, the video encoding device may determine whether to apply the present technology based on a predefined image region, and signal a flag indicating whether to apply the technology to the video decoding device. Here, the predefined image region may be an image sequence, an image or a slice. When determining the application of the present technology on a sequence-by-sequence, picture-by-picture, or slice-by-slice basis, a flag may be included in the header (SPS) of the sequence, the header (PPS) of the picture, or the slice header. The video decoding apparatus may extract a flag contained in the bitstream and determine whether to apply the present technique to a block in an image area corresponding to the extracted flag.

As another example, whether to apply the present technique to the target block may be predetermined based on a control point motion vector of the target block. When the values of the control point motion vectors of the target blocks are all the same, this technique is not applied. In the case where the affine type of the target block is a 4-parameter model, when the control point motion vectors of the upper left corner and the upper right corner are the same, the technique is not performed. In the case of the 6-parameter model, when the control point motion vectors of the upper left corner, the upper right corner, and the lower left corner are the same, the technique is not performed.

As another example, whether to apply the present technique may be determined based on an angle between control point motion vectors. For example, when the angle between the control point motion vectors is obtuse (i.e., the dot product between the vectors is negative), the present technique may not be applied. Alternatively, when the angle between the control point motion vectors is an acute angle (i.e., the dot product between the vectors is positive), the application of the present technique may be limited.

As another example, application of the present technique may be excluded when the control point motion vector of the target block references a reference picture in a different reference picture list.

As another example, to minimize delay, the technique may be restricted for use with combined inter-intra prediction techniques, such as in the case of bi-directional optical flow. Furthermore, the application of the present technique may be excluded in the case of applying the following local illumination compensation or in the case of bi-prediction.

As another example, in unidirectional or bi-directional prediction, if the reference picture referenced by the target block is not a short-term reference picture, the technique is not performed.

As another example, when the current image including the target block or the reference image referred to by the target block is an image divided into sub-images, the technique is not applied.

As another example, the present technology is not performed when bi-prediction is performed and different weights are applied to two prediction blocks (a first reference block and a second reference block) generated by affine motion prediction. That is, when luminance weights to be applied to two reference blocks (luminance components in the two reference blocks) to predict the luminance components are different from each other, the technique is not applied to the target block. Also, when chromaticity weights to be applied to two reference blocks (chromaticity components in the two reference blocks) to predict the chromaticity components are different from each other, the technique is not applied to the target block.

Alternatively, in the case of applying bi-prediction of different weights, the present technique may be performed by applying different weights. For example, the video encoding apparatus or the video decoding apparatus generates two reference blocks by performing sample-by-sample adjustment on affine prediction samples of each of the reference image list 0 and the reference image of the reference image list 1. In the following, the final prediction block is generated by applying the corresponding weights to the two reference blocks.

Even after the sample-by-sample adjustment technique is performed on affine motion prediction samples, if the values of Δmv _x and Δmv _y are less than a predetermined threshold, the execution of the present technique may be stopped.

V. local illumination compensation

The local illumination compensation technique is an encoding technique that compensates for an illumination variation between a target block and a predicted block using a linear model. The inter predictor 124 of the video encoding apparatus determines a reference block in a reference image using a motion vector (translational motion vector) of a target block, and obtains parameters of a linear model for illumination compensation using pre-reconstructed samples around the reference block (on the upper and left sides of the reference block) and pre-reconstructed samples around the target block (on the upper and left sides of the reference block).

When the pre-reconstructed samples around the reference block are referred to as x and the corresponding pre-reconstructed samples around the target block are referred to as y, parameters "a" and "b" are derived as in equation 17 such that the sum of squares of the differences between y and (ax+b) is minimized.

[ Equation 17]

argmin{∑(y-Ax-b)²}

The final prediction samples are generated by applying the weight a and the offset b to samples in a prediction block (reference block) generated from the motion vector of the target block, as shown in equation 18. In equation 18, pred [ x ] [ y ] is a prediction sample of the (x, y) position generated from the motion vector of the target block, and pred _LIC [ x ] [ y ] is a final prediction sample after illumination compensation.

[ Equation 18]

pred_LIC[x][y]=A*pred[x][y]+b

Another aspect of the present invention relates to techniques for combining illumination compensation techniques with affine motion prediction.

As described above, when sub-block-by-sub-block affine motion prediction is applied to a target block, a motion vector for each sub-block is generated. The illumination compensation parameter may be derived using a corresponding motion vector for each sub-block, and then illumination compensation may be performed in sub-block units using the corresponding motion vector. However, this not only increases the computational complexity, but also causes serious delay problems. Since the reconstructed samples in each sub-block require illumination compensation of the subsequent sub-block, the illumination compensation process for the sub-block should be suspended until the neighboring sub-block is reconstructed (that is, until both the prediction block and the residual block for the sub-block are reconstructed). The present invention aims to solve these problems.

Fig. 11a, 11b and 11c illustrate various examples of determining the position of a reference block to derive illumination compensation parameters according to an embodiment of the invention. In this embodiment, one illumination compensation parameter set (a, b) for the target block is derived, and the same parameters are applied to all sub-blocks in the target block. That is, the entire target block is modified with one illumination compensation parameter set.

As shown in fig. 11a, the inter predictor 124 of the video encoding apparatus may determine the position of the reference block in the reference image using an affine motion vector of a sub-block located at the upper left of the target block or an upper left corner control point motion vector of the target block. The pre-reconstructed samples around the determined reference block are used for parameter derivation. Alternatively, as shown in fig. 11b, the affine motion vector of the center sub-block in the target block may be utilized to determine the position of the reference block. Once the position of the reference block is determined, the luminance compensation parameters are derived using pre-reconstructed samples adjacent to the upper and left sides of the reference block and corresponding pre-reconstructed samples adjacent to the upper and left sides of the target block.

As another example, multiple sub-blocks in a target block may be used. As shown in fig. 11c, the inter predictor 124 determines a reference sub-block corresponding to each boundary sub-block using affine motion vectors of sub-blocks (boundary sub-blocks) located at boundaries in the target block. Samples for deriving illumination compensation parameters are extracted from pre-reconstructed samples adjacent to boundary sub-blocks and corresponding reference sub-blocks in the target block, respectively. For a sub-block located at the upper boundary of the target block and a corresponding reference sub-block, samples are extracted from pre-reconstructed samples adjacent to the upper side. For the sub-block located at the left boundary of the target block and the corresponding reference sub-block, samples are extracted from pre-reconstructed samples adjacent to the left.

One or more of the coding tools described above may be used to improve the prediction performance of inter prediction. To address issues such as complexity or delay, applying some encoding tools may require that other encoding tools be excluded from the application.

On the other hand, both sample-by-sample adjustment and bidirectional optical flow of affine predicted samples are techniques to modify the predicted samples after prediction, and use the gradient of the samples for modification. Accordingly, to reduce computational complexity and hardware complexity, the equations for bi-directional optical flow may be modified in the form of equations for sample-by-sample adjustment of affine prediction samples for bi-directional prediction. Alternatively, the equation for sample-by-sample adjustment of affine prediction samples for bi-prediction may be modified in the form of an equation for bi-directional optical flow.

By substituting equation 9 into equation 10, the equation for obtaining the final predicted sample in the bi-directional optical flow can be expressed as follows.

[ Equation 19]

Equation 19 is expressed as equation 16 as follows.

[ Equation 20]

That is, the samples of the final prediction that apply the bidirectional optical flow may be calculated by equation 20 instead of equation 19. Since equation 20 is expressed in the form of a sample-by-sample adjustment technique that is similar to that of affine prediction, there is no need to separately design hardware according to the bi-directional optical flow technique to implement the equation. Further, since equation 20 is expressed in the form of an average of predicted blocks from the reference pictures in reference picture list 0 and predicted blocks from the reference pictures in reference picture list 1, the hardware design is simplified.

Furthermore, the motion vector accuracy of the motion offset (optical flow) of the bidirectional optical flow technique (v _x,v_y) and the motion vector accuracy of the motion offset (Δmv _x,Δmv_y) of the sample-by-sample adjustment technique for affine prediction may match each other. For example, both techniques of motion offset can be expressed with an accuracy of 1/32 subsampled units.

It should be appreciated that the above-described exemplary embodiments may be implemented in numerous different ways. The functions described in one or more examples may be implemented as hardware, software, firmware, or any combination thereof. It should be understood that the functional components described herein have been labeled as "..units" to further emphasize their implementation independence.

The various functions or methods described in the present invention may be implemented as instructions stored in a non-volatile recording medium, which may be read and executed by one or more processors. Non-volatile recording media include, for example, all types of recording devices in which data is stored in a form readable by a computer system. For example, nonvolatile recording media include storage media such as erasable programmable read-only memory (EPROM), flash memory drives, optical disk drives, magnetic hard disk drives, and Solid State Drives (SSD).

Although the exemplary embodiments have been described for illustrative purposes, those skilled in the art will appreciate that various modifications and changes are possible without departing from the spirit and scope of the embodiments. For brevity and clarity, exemplary embodiments have been described. Accordingly, it should be understood by those of ordinary skill that the scope of the embodiments is not limited by the embodiments explicitly described above, but is included within the claims and their equivalents.

Claims

1. A video decoding device for reconstructing a target block in a current image to be decoded, the video decoding device comprising at least one processor configured to:

determining a first reference picture and a second reference picture and a first motion vector and a second motion vector for bi-prediction by decoding a bitstream;

Generating a first reference block from a first reference image referenced by a first motion vector, and generating a second reference block from a second reference image referenced by a second motion vector;

generating a prediction block of the target block using the first reference block and the second reference block;

Generating a residual block of the target block based on the residual signal decoded from the bitstream;

reconstructing a target block based on a prediction block of the target block and a residual block of the target block,

Wherein the at least one processor is configured to generate a predicted block of the target block using the first reference block and the second reference block based on:

Executing a first encoding tool configured to generate a predicted block of a target block by performing bidirectional optical flow processing using a first reference block and a second reference block,

Wherein the first encoding tool is applied only to the luma component of the target block,

Wherein when the luminance weights assigned to each of the first reference image and the second reference image for predicting the luminance component of the target block are different from each other, the first encoding means is not executed, and

When the chroma weights assigned to each of the first and second reference pictures for predicting the chroma component of the target block are different from each other, the first encoding tool is not performed.

2. The video decoding device of claim 1, wherein the at least one processor is further configured to:

predicting luminance components of the target block by applying luminance weights respectively corresponding to the first and second reference blocks when the luminance weights assigned to each of the first and second reference images are different from each other, and

When the chroma weights assigned to each of the first and second reference pictures are different from each other, the chroma components of the target block are predicted by applying the chroma weights corresponding to the first and second reference blocks, respectively, to the first and second reference blocks.

3. The video decoding device of claim 1, wherein when executing the first encoding tool, the at least one processor is configured to generate a prediction block for a sub-block partitioned from a target block based on:

Generating a first horizontal gradient and a first vertical gradient for each luminance sample of a sub-block in a first reference block corresponding to a sub-block of the target block,

Generating a second horizontal gradient and a second vertical gradient for each luminance sample of a sub-block in a second reference block corresponding to the sub-block of the target block,

Calculating a motion offset corresponding to a sub-block of the target block using the first and second horizontal gradients and the first and second vertical gradients for the luminance sample, and

The luminance samples in the sub-blocks of the target block are predicted using the luminance sample values of the sub-blocks of the first reference block, the luminance sample values of the sub-blocks of the second reference block, and the motion offset.

4. The video decoding device of claim 3, wherein predicting luma samples in sub-blocks of the target block comprises:

calculating a sample offset of the luminance sample position in a sub-block of the target block using a difference between the first horizontal gradient and the second horizontal gradient corresponding to the luminance sample position, a difference between the first vertical gradient and the second vertical gradient corresponding to the luminance sample position, and a motion offset corresponding to the sub-block of the target block, and

Luminance samples of a luminance sample position are predicted using luminance sample values in a first reference block and a second reference block corresponding to the luminance sample position and a sample offset of the luminance sample position.

5. The video decoding device of claim 1, wherein the at least one processor is configured to generate the predicted block for the target block with the first reference block and the second reference block based further on:

Performing a second encoding tool configured to generate an inter-prediction block using the first reference block and the second reference block, generate an intra-prediction block by performing intra-prediction on the target block, and generate a prediction block of the target block by weighted-averaging the inter-prediction block and the intra-prediction block,

Wherein execution of the second encoding tool restricts execution of the first encoding tool.

6. The video decoding device of claim 5, wherein:

When the second encoding tool is executed, an intra-prediction block is generated using a planar mode of the plurality of intra-prediction modes.

7. The video decoding device of claim 5, wherein:

The weight value for the weighted average is determined by the number of intra prediction blocks in neighboring blocks including the left block and the upper block of the target block.

8. A video encoding device for encoding a target block in a current image to be encoded, the video encoding device comprising at least one processor configured to:

Determining a first motion vector and a second motion vector for bi-directional;

Generating a residual block based on a prediction block of the target block, and encoding the residual block,

executing a first encoding tool configured to generate a predicted block of a target block by performing bi-directional optical flow using a first reference block and a second reference block,

9. The video encoding device of claim 8, wherein when executing the first encoding tool, the at least one processor is configured to generate a prediction block for a sub-block partitioned from a target block based on:

10. The video encoding device of claim 9, wherein predicting luma samples in sub-blocks of the target block comprises:

11. The video encoding device of claim 8, wherein the at least one processor is configured to generate the prediction block of the target block with the first reference block and the second reference block based further on:

12. The video encoding device of claim 8, wherein:

The weight for the weighted average is determined by the number of intra prediction blocks in neighboring blocks including the left block and the upper block of the target block.

13. A method for transmitting a bitstream including video data to a decoding device, the method comprising:

generating a bitstream by encoding video data based on encoding a target block in a current image to be encoded;

the bit stream is transmitted to the decoding apparatus,

Wherein encoding the target block includes:

Wherein predicting the target block using the first reference block and the second reference block comprises: