CN114424537B

CN114424537B - Systems and methods for performing inter-frame predictive coding in video coding

Info

Publication number: CN114424537B
Application number: CN202080062841.9A
Authority: CN
Inventors: 弗兰克·博森
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2019-09-05
Filing date: 2020-09-03
Publication date: 2026-01-02
Anticipated expiration: 2040-09-03
Also published as: CN114424537A; WO2021045171A1; US20220303561A1; US12382071B2

Abstract

The invention discloses a method for decoding video data. The method includes receiving a first array of predicted sample values, receiving a second array of predicted sample values, determining a scaling value based on the color component index values and a video sampling format of the video data, and generating a third array of predicted sample values by applying a blending value to the first array of predicted sample values and the second array of predicted sample values, wherein the blending value is based on the scaling value.

Description

System and method for performing inter-prediction coding in video coding

Technical Field

The present disclosure relates to video coding, and more particularly to techniques for performing inter prediction.

Background

Digital video functionality may be incorporated into a variety of devices including digital televisions, notebook or desktop computers, tablet computers, digital recording devices, digital media players, video gaming devices, cellular telephones (including so-called smartphones), medical imaging devices, and the like. The digital video may be encoded according to a video encoding standard. The video coding standard defines the format of a compatible bitstream encapsulating the coded video data. A compatible bitstream is a data structure that may be received and decoded by a video decoding device to generate reconstructed video data. The video coding standard may incorporate video compression techniques. Examples of video coding standards include ISO/IEC MPEG-4Visual and ITU-T H.264 (also known as ISO/IEC MPEG-4 AVC) and High Efficiency Video Coding (HEVC). HEVC is described in High Efficiency Video Coding (HEVC) of the ITU-T h.265 recommendation of 12 in 2016, which is incorporated herein by reference and is referred to herein as ITU-T h.265. Extensions and improvements to ITU-T h.265 are currently being considered to develop next generation video coding standards. For example, the ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), collectively referred to as the joint video research group (JVET), are working on video coding techniques that standardize compression capabilities significantly beyond the current HEVC standard. The joint exploration model 7 (JEM 7), algorithmic description of the joint exploration test model 7 (JEM 7), ISO/IEC JTC1/SC29/WG11 document JVET-G1001 (month 7 2017, italy, city) describes the coding features under joint test model study by JVET, a potentially enhanced video coding technique beyond the ITU-T H.265 functionality. It should be noted that the coding feature of JEM 7 is implemented in JEM reference software. As used herein, the term JEM may refer generically to the algorithm included in JEM 7 as well as the specific implementation of JEM reference software. Furthermore, in response to "Joint Call for Proposals on Video Compression with Capabilities beyond HEVC" issued by the VCEG and MPEG jointly, various groups set forth various descriptions of video coding tools at the 10 th conference of ISO/IEC JTC1/SC29/WG11 held in San Diego, calif., from 16 days to 20 days at 4 months of 2018. According to various descriptions of video coding tools, the final initial Draft text of the video coding specification is described in ISO/IEC JTC1/SC29/WG11, 10 th meeting, "VERSATILE VIDEO CODING (Draft 1)" at 4 months 16 to 20 days 2018, document JVET-J1001-v2, which is incorporated herein by reference and referred to as JVET-J1001. the current development of the next generation video coding standards for VCEG and MPEG is known as the Versatile Video Coding (VVC) project. "VERSATILE VIDEO CODING (Draft 6)" (document JVET-O2001-vE, which is incorporated herein by reference, and is referred to as JVET-O2001) in the 15 th meeting of ISO/IEC JTC1/SC29/WG11 held by goldburgh, sweden on days 7, 3 to 12 represents the current version of Draft text corresponding to the video coding specification of the VVC project.

Video compression techniques can reduce the data requirements for storing and transmitting video data. Video compression techniques may reduce data requirements by exploiting redundancy inherent in video sequences. Video compression techniques may subdivide a video sequence into successively smaller portions (i.e., a group of pictures within the video sequence, a picture within a group of pictures, a region within a picture, a sub-region within a region, etc.). The difference between the unit video data to be encoded and the reference unit of video data may be generated using intra prediction encoding techniques (e.g., intra-picture spatial prediction techniques) and inter prediction techniques (i.e., inter-picture techniques (times)). This difference may be referred to as residual data. The residual data may be encoded as quantized transform coefficients. Syntax elements may relate to residual data and reference coding units (e.g., intra prediction mode index and motion information). The residual data and syntax elements may be entropy encoded. The entropy encoded residual data and syntax elements may be included in a data structure forming a compatible bitstream.

Disclosure of Invention

In one example, a method of decoding video data includes receiving a first array of predicted sample values, receiving a second array of predicted sample values, determining a scaling value based on a color component index value and a video sampling format of the video data, and generating a third array of predicted sample values by applying a blending value to the first array of predicted sample values and the second array of predicted sample values, wherein the blending value is based on the scaling value.

In one example, an apparatus includes one or more processors configured to receive a first array of predicted sample values, receive a second array of predicted sample values, determine a scaling value based on a color component index value and a video sampling format of video data, and generate a third array of predicted sample values by applying a blending value to the first array of predicted sample values and the second array of predicted sample values, wherein the blending value is based on the scaling value.

Drawings

Fig. 1 is a conceptual diagram illustrating an example of a set of pictures encoded according to a quadtree multi-way tree partitioning in accordance with one or more techniques of the present disclosure.

Fig. 2A is a conceptual diagram illustrating an example of encoding a block of video data according to one or more techniques of this disclosure.

Fig. 2B is a conceptual diagram illustrating an example of encoding a block of video data according to one or more techniques of this disclosure.

Fig. 3 is a conceptual diagram illustrating an example of a video component sampling format that may be used in accordance with one or more techniques of this disclosure.

Fig. 4 is a conceptual diagram illustrating a data structure encapsulating encoded video data and corresponding metadata according to one or more techniques of the present disclosure.

Fig. 5 is a block diagram illustrating an example of a system that may be configured to encode and decode video data in accordance with one or more techniques of the present disclosure.

Fig. 6 is a block diagram illustrating an example of a video encoder that may be configured to encode video data in accordance with one or more techniques of the present disclosure.

Fig. 7 is a block diagram illustrating an example of a video decoder that may be configured to decode video data in accordance with one or more techniques of this disclosure.

Detailed Description

In general, this disclosure describes various techniques for encoding video data. In particular, this disclosure describes techniques for performing inter prediction. It should be noted that although the techniques of this disclosure are described with respect to ITU-T H.264, ITU-T H.265, JEM and JVET-O2001, the techniques of this disclosure are generally applicable to video coding. For example, in addition to those techniques included in ITU-T H.265, JEM, and JVET-O2001, the encoding techniques described herein may be incorporated into video coding systems (including video coding systems based on future video coding standards), including video block structures, intra-prediction techniques, inter-prediction techniques, transformation techniques, filtering techniques, and/or other entropy coding techniques. Accordingly, references to ITU-T H.264, ITU-T H.265, JEM and/or JVET-O2001 are for descriptive purposes and should not be construed as limiting the scope of the techniques described herein. Furthermore, it should be noted that the incorporation of documents by reference herein is for descriptive purposes and should not be construed as limiting or creating ambiguity with respect to the terms used herein. For example, where a definition of a term provided in an incorporated reference is different from another incorporated reference and/or the term as used herein, that term should be interpreted as broadly as including each respective definition and/or as including each particular definition in the alternative.

In one example, an apparatus for encoding video data includes one or more processors configured to receive a first array of predicted sample values, receive a second array of predicted sample values, determine a scaling value based on a color component index value and a video sampling format of the video data, generate a third array of predicted sample values by applying a mixing matrix to the first array of predicted sample values and the second array of predicted sample values, wherein the mixing matrix is based on the scaling value, and perform video encoding using the third array of predicted sample values.

In one example, a non-transitory computer-readable storage medium includes instructions stored thereon that, when executed, cause one or more processors of a device to receive a first array of predicted sample values, receive a second array of predicted sample values, determine a scaling value based on a color component index value and a video sampling format of video data, generate a third array of predicted sample values by applying a mixing matrix to the first array of predicted sample values and the second array of predicted sample values, wherein the mixing matrix is based on the scaling value, and perform video encoding using the third array of predicted sample values.

In one example, an apparatus includes means for receiving a first array of predicted sample values, means for receiving a second array of predicted sample values, means for determining a scaling value based on color component index values and a video sampling format of video data, means for generating a third array of predicted sample values by applying a mixing matrix to the first array of predicted sample values and the second array of predicted sample values, wherein the mixing matrix is based on the scaling value, and means for performing video encoding using the third array of predicted sample values.

The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Video content comprises a video sequence consisting of a series of frames (or pictures). A series of frames may also be referred to as a group of pictures (GOP). Each video frame or picture may be divided into one or more regions. The region may be defined according to a base unit (e.g., video block) and a set of rules defining the region. For example, a rule defining a region may be that the region must be an integer number of video blocks arranged in a rectangle. Further, the video blocks in the region may be ordered according to a scan pattern (e.g., raster scan). As used herein, the term "video block" may generally refer to a region of a picture, or may more particularly refer to a largest array of sample values, sub-partitions thereof, and/or corresponding structures that may be predictively encoded. Furthermore, the term "current video block" may refer to an area of a picture being encoded or decoded. A video block may be defined as an array of sample values. It should be noted that in some cases, pixel values may be described as sample values that include corresponding components of video data, which may also be referred to as color components (e.g., luminance (Y) and chrominance (Cb and Cr) components or red, green, and blue components). It should be noted that in some cases, the terms "pixel value" and "sample value" may be used interchangeably. Further, in some cases, a pixel or sample may be referred to as pel. The video sampling format (which may also be referred to as chroma format) may define the number of chroma samples included in a video block relative to the number of luma samples included in the video block. For example, for a 4:2:0 sampling format, the sampling rate of the luma component is twice the sampling rate of the chroma components in both the horizontal and vertical directions.

The video encoder may perform predictive encoding on the video block and its sub-partitions. The video block and its sub-partitions may be referred to as nodes. ITU-T h.264 specifies the macroblock comprising 16 x 16 luma samples. That is, in ITU-T H.264, pictures are segmented into macroblocks. ITU-T h.265 specifies a similar Coding Tree Unit (CTU) structure, which may be referred to as a Largest Coding Unit (LCU). In ITU-T H.265, pictures are segmented into CTUs. In ITU-T h.265, CTU sizes may be set to include 16 x 16, 32 x 32, or 64 x 64 luma samples for pictures. In ITU-T h.265, a CTU is composed of a respective Coding Tree Block (CTB) for each component of video data, e.g., luminance (Y) and chrominance (Cb and Cr). It should be noted that a video with one luminance component and two corresponding chrominance components may be described as having two channels, namely a luminance channel and a chrominance channel. Furthermore, in ITU-T h.265, CTUs may be divided according to a Quadtree (QT) division structure, which causes CTBs of the CTUs to be divided into Coded Blocks (CBs). That is, in ITU-T H.265, the CTU may be divided into quadtree leaf nodes. According to ITU-T h.265, one luma CB together with two corresponding chroma CBs and associated syntax elements are referred to as a Coding Unit (CU). In ITU-T h.265, the minimum allowed size of the CB may be signaled. In ITU-T H.265, the minimum allowable minimum size for luminance CB is 8 x 8 luminance samples. In ITU-T h.265, the decision to encode a picture region using intra-prediction or inter-prediction is made at the CU level.

In ITU-T H.265, a CU is associated with a Prediction Unit (PU) structure that has its root at the CU. In ITU-T H.265, the PU structure allows partitioning of luma CB and chroma CB to generate corresponding reference samples. That is, in ITU-T h.265, luminance CB and chrominance CB may be partitioned into respective luminance prediction blocks and chrominance Prediction Blocks (PB), where PB comprises blocks of sample values to which the same prediction is applied. In ITU-T H.265, CBs can be divided into 1,2 or 4 PBs. ITU-T h.265 supports PB sizes from 64 x 64 samples down to 4 x4 samples. In ITU-T h.265, square PB is supported for intra prediction, where CB may form PB or CB may be partitioned into four square PB. In ITU-T h.265, rectangular PB is supported for inter prediction in addition to square PB, where CB may be halved vertically or horizontally to form PB. Furthermore, it should be noted that in ITU-T h.265, for inter prediction, four asymmetric PB partitioning is supported, where CB is partitioned into two PB at one quarter of the height (top or bottom) or width (left or right) of CB. Intra-prediction data (e.g., intra-prediction mode syntax elements) or inter-prediction data (e.g., motion data syntax elements) corresponding to the PB are used to generate reference and/or prediction sample values for the PB.

JEM specifies CTUs with 256 x 256 luma samples of maximum size. JEM specifies a quadtree plus binary tree (QTBT) block structure. In JEM, the QTBT structure allows the quadtree (BT) structure to be further partitioned into quadtree leaf nodes. That is, in JEM, the binary tree structure allows for recursive partitioning of the quadtree nodes vertically or horizontally. In JVET-O2001, the CTUs are partitioned according to a quadtree plus multi-type tree (QTMT or QT+MTT) structure. QTMT in JVET-O2001 is similar to QTBT in JEM. However, in JVET-O2001, in addition to indicating binary segmentation, the multi-type tree may also indicate so-called ternary (or Trigeminal Tree (TT)) segmentation. Ternary partitioning divides one block vertically or horizontally into three blocks. In the case of vertical TT splitting, the block is split from the left edge at one quarter of its width and from the right edge at one quarter of its width, and in the case of horizontal TT splitting, the block is split from the top edge at one quarter of its height and from the bottom edge at one quarter of its height. Referring again to fig. 1, fig. 1 shows an example in which CTUs are divided into quadtree nodes and the quadtree nodes are further divided according to BT split or TT split. That is, in fig. 1, the dashed lines indicate additional binary and ternary partitions in the quadtree.

As described above, each video frame or picture may be divided into one or more regions. For example, according to ITU-T h.265, each video frame or picture may be divided to include one or more slices, and further divided to include one or more tiles, wherein each slice includes a sequence of CTUs (e.g., arranged in raster scan order), and wherein a tile is a sequence of CTUs corresponding to a rectangular region of the picture. It should be noted that in ITU-T h.265, a slice is a sequence of one or more slice segments starting from an independent slice segment and containing all subsequent dependent slice segments (if any) before the next independent slice segment (if any). A slice segment (e.g., slice) is a CTU sequence. Thus, in some cases, the terms "slice" and "slice segment" are used interchangeably to indicate a sequence of CTUs arranged in a raster scan order arrangement. Further, it should be noted that in ITU-T H.265, a tile may be composed of CTUs contained in more than one slice, and a slice may be composed of CTUs contained in more than one tile. However, ITU-T H.265 specifies that one or both of (1) all CTUs in a slice belong to the same tile, and (2) all CTUs in a tile belong to the same slice should be satisfied.

With respect to JVET-O2001, the slice needs to be made up of an integer number of bricks, rather than just an integer number of CTUs. In JVET-O2001, a brick is a rectangular CTU row region within a particular tile in a picture. Further, in JVET-O2001, a tile may be divided into a plurality of bricks, each consisting of one or more rows of CTUs within the tile. A tile that is not divided into a plurality of bricks is also referred to as a brick. However, bricks that are a proper subset of tiles are not referred to as tiles. Thus, in some video encoding techniques, slices of a set of CTUs that include rectangular regions that do not form pictures may or may not be supported. Furthermore, it should be noted that in some cases, a slice may need to be made up of an integer number of complete tiles, and in such cases, the slice is referred to as a tile group. The techniques described herein may be applicable to bricks, slices, tiles, and/or tile groups. Fig. 1 is a conceptual diagram illustrating an example of a group of pictures including slices. In the example shown in fig. 1, pic ₃ is shown as including two slices (i.e., slice ₀ and slice ₁). In the example shown in fig. 1, slice ₀ includes one brick, namely brick ₀, and slice ₁ includes two bricks, namely brick ₁ and brick ₂. It should be noted that in some cases, slices ₀ and slices ₁ may meet the requirements of and be categorized as tiles and/or tile groups.

The video sampling format (which may also be referred to as chroma format) may define the number of chroma samples included in a CU relative to the number of luma samples included in the CU. For example, for a 4:2:0 sampling format, the sampling rate of the luma component is twice the sampling rate of the chroma components in both the horizontal and vertical directions. Thus, for a CU formatted according to the 4:2:0 format, the width and height of the sample array for the luma component is twice the width and height of each sample array for the chroma component. Fig. 3 is a conceptual diagram illustrating an example of an encoding unit formatted according to a 4:2:0 sample format. Fig. 3 shows the relative positions of chroma samples with respect to luma samples within a CU. As described above, CUs are typically defined in terms of the number of horizontal and vertical luminance samples. Thus, as shown in fig. 3, a 16×16CU formatted according to a 4:2:0 sample format includes 16×16 samples of the luma component and 8×8 samples for each chroma component. Further, in the example shown in fig. 3, the relative positions of chroma samples with respect to luma samples of neighboring video blocks of a 16×16CU are shown. For a CU formatted according to the 4:2:2 format, the width of the sample array of luma components is twice the width of the sample array of each chroma component, but the height of the sample array of luma components is equal to the height of the sample array of each chroma component. Furthermore, for a CU formatted according to the 4:4:4 format, the sample array of luma components has the same width and height as the sample array of each chroma component.

Table 1 shows how the chroma format is specified in JVET-O2001 based on the values of the syntax elements chroma_format_idc and separate _color_plane_flag included in JVET-O2001. Further, table 1 shows how variables SubWidthC and SubHeightC are derived from the chroma format. SubWidthC and SubHeightC are used for example for deblocking. With respect to Table 1, JHET-O2001 provides the following provisions:

In monochrome sampling, there is only one sample array, which is nominally considered to be the luma array.

In a 4:2:0 sample, each of the two chroma arrays has half the height and half the width of the luma array.

In 4:2:2 sampling, each of the two chroma arrays has the same height and half the width of the luma array.

In 4:4:4 sampling, according to the separate _color_plane_flag value, the following applies:

If separate _color_plane_flag is equal to 0, then each of the two chroma arrays has the same height and width as the luma array.

Otherwise (separate _color_plane_flag equal to 1), the three color planes are processed individually as a single color sample picture.

chroma_format_idc	separate_colour_plane_flag	Chroma format	SubWidth C	SubHeight C
					0	0	Monochromatic color	1	1
1	0	4:2:0	2	2
					2	0	4:2:2	2	1
3	0	4:4:4	1	1
					3	1	4:4:4	1	1

TABLE 1

It should be noted that for a sampling format, such as a 4:2:0 sample format, the chroma position type may be specified. That is, for example, for a 4:2:0 sample format, the chroma samples may be assigned a horizontal offset value and a vertical offset value relative to the luma samples that indicate relative spatial positioning. Table 2 provides definitions of HorizontalOffsetC and VerticalOffsetC for the 5 chroma position types provided in JVET-O2001.

ChromaLocType	HorizontalOffsetC	VerticalOffsetC
			0	0	0.5
1	0.5	0.5
			2	0	0
3	0.5	0
			4	0	1
5	0.5	1

TABLE 2

For intra prediction encoding, an intra prediction mode may specify the location of a reference sample within a picture. In ITU-T H.265, the possible intra prediction modes that have been defined include a planar (i.e., surface-fitting) prediction mode, a DC (i.e., flat ensemble average) prediction mode, and 33 angular prediction modes (predMode: 2-34). In JEM, the possible intra prediction modes that have been defined include a planar prediction mode, a DC prediction mode, and 65 angular prediction modes. It should be noted that the plane prediction mode and the DC prediction mode may be referred to as a non-directional prediction mode, and the angle prediction mode may be referred to as a directional prediction mode. It should be noted that the techniques described herein may be universally applicable regardless of the number of possible prediction modes that have been defined.

For inter prediction encoding, a reference picture is determined, and a Motion Vector (MV) identifies samples in the reference picture that are used to generate a prediction for a current video block. For example, a current video block may be predicted using reference sample values located in one or more previously encoded pictures, and a motion vector used to indicate the position of the reference block relative to the current video block. The motion vector may describe, for example, a horizontal displacement component of the motion vector (i.e., MV _x), a vertical displacement component of the motion vector (i.e., MV _y), and a resolution of the motion vector (e.g., one-quarter pixel precision, one-half pixel precision, one-pixel precision, two-pixel precision, four-pixel precision). Previously decoded pictures (which may include pictures output before or after the current picture) may be organized into one or more reference picture lists and identified using reference picture index values. Furthermore, in inter prediction coding, single prediction refers to generating a prediction using sample values from a single reference picture, and double prediction refers to generating a prediction using corresponding sample values from two reference pictures. That is, in single prediction, a single reference picture and a corresponding motion vector are used to generate a prediction for a current video block, while in bi-prediction, a first reference picture and a corresponding first motion vector and a second reference picture and a corresponding second motion vector are used to generate a prediction for the current video block. In bi-prediction, the corresponding sample values are combined (e.g., added, rounded and clipped, or averaged according to weights) to generate a prediction. Pictures and their regions may be classified based on which types of prediction modes are available to encode their video blocks. That is, for a region having a B type (e.g., a B slice), bi-prediction, uni-prediction, and intra-prediction modes may be utilized, for a region having a P type (e.g., a P slice), uni-prediction and intra-prediction modes may be utilized, and for a region having an I type (e.g., an I slice), only intra-prediction modes may be utilized. As described above, the reference picture is identified by the reference index. For example, for P slices, there may be a single reference picture list RefPicList0, and for B slices, there may be a second independent reference picture list RefPicList1 in addition to RefPicList 0. It should be noted that for single prediction in a B slice, one of RefPicList0 or RefPicList1 may be used to generate the prediction. Further, it should be noted that during the decoding process, at the start of decoding a picture, a reference picture list is generated from previously decoded pictures stored in a Decoded Picture Buffer (DPB).

Furthermore, the coding standard may support various motion vector prediction modes. Motion vector prediction enables the value of a motion vector for a current video block to be derived based on another motion vector. For example, a set of candidate blocks with associated motion information may be derived from the spatial neighboring blocks and the temporal neighboring blocks of the current video block. In addition, the generated (or default) motion information may be used for motion vector prediction. Examples of motion vector prediction include Advanced Motion Vector Prediction (AMVP), temporal Motion Vector Prediction (TMVP), so-called "merge" mode, and "skip" and "direct" motion reasoning. Further, other examples of motion vector prediction include Advanced Temporal Motion Vector Prediction (ATMVP) and spatio-temporal motion vector prediction (STMVP). For motion vector prediction, both the video encoder and the video decoder perform the same process to derive a set of candidates. Thus, for the current video block, the same set of candidates is generated during encoding and decoding.

As described above, for inter prediction coding, reference samples in a previously coded picture are used to code a video block in the current picture. Previously encoded pictures that may be used as references when encoding a current picture are referred to as reference pictures. It should be noted that the decoding order does not necessarily correspond to the picture output order, i.e. the temporal order of the pictures in the video sequence. In ITU-T h.265, when a picture is decoded, it is stored to a Decoded Picture Buffer (DPB) (which may be referred to as a frame buffer, a reference picture buffer, etc.). In ITU-T h.265, pictures stored to a DPB are removed from the DPB when output and are no longer needed for encoding subsequent pictures. in ITU-T h.265, after decoding the slice header, i.e. at the start of decoding a picture, each picture invokes a determination of whether a picture should be removed from the DPB or not. For example, referring to fig. 1, pic ₃ is shown as reference Pic ₂. Similarly Pic ₄ is shown as reference Pic ₁. with respect to fig. 1, it is assumed that the number of pictures corresponds to the decoding order that the DPB will fill in such a way that after decoding Pic ₁, the DPB will include { Pic ₁ }, when decoding Pic ₂ starts, the DPB will include { Pic ₁ }, after decoding Pic ₂, the DPB will include { Pic ₁,Pic₂ }, when decoding Pic ₃ starts, the DPB will include { Pic ₁,Pic₂ }. Pic ₃ will then be decoded with reference to Pic ₂, and after Pic ₃ is decoded, the DPB will include { Pic ₁,Pic₂,Pic₃ }. At the beginning of decoding Pic ₄, pictures Pic ₂ and Pic ₃ will be marked for removal from the DPB because they are not needed to decode Pic ₄ (or any subsequent pictures, not shown) and assuming Pic ₂ and Pic ₃ have been output, the DPB will be updated to include { Pic ₁ }. Pic ₄ will then be decoded with reference to Pic ₁. The process of marking pictures to remove them from the DPB may be referred to as Reference Picture Set (RPS) management.

As described above, the intra prediction data or inter prediction data is used to generate reference sample values for a block of sample values. The differences between sample values included in the current PB or another type of picture region structure and associated reference samples (e.g., those generated using prediction) may be referred to as residual data. The residual data may include a respective difference array corresponding to each component of the video data. The residual data may be in the pixel domain. A transform such as a Discrete Cosine Transform (DCT), a Discrete Sine Transform (DST), an integer transform, a wavelet transform, or a conceptually similar transform may be applied to the array of differences to generate transform coefficients. It should be noted that in ITU-T H.265 and JVET-O2001, a CU is associated with a Transform Unit (TU) structure that has its root at the CU level. That is, to generate transform coefficients, an array of differences may be partitioned (e.g., four 8 x 8 transforms may be applied to a 16 x 16 array of residual values). For each component of video data, such subdivision of the difference value may be referred to as a Transform Block (TB). It should be noted that in some cases, a core transform and a subsequent secondary transform may be applied (in a video encoder) to generate transform coefficients. For video decoders, the order of the transforms is reversed.

The quantization process may be performed directly on the transform coefficients or residual sample values (e.g., in terms of palette coded quantization). Quantization approximates transform coefficients by limiting the amplitude to a set of specified values. Quantization essentially scales the transform coefficients to change the amount of data needed to represent a set of transform coefficients. Quantization may include dividing the transform coefficients (or values resulting from adding an offset value to the transform coefficients) by a quantization scaling factor and any associated rounding function (e.g., rounding to the nearest integer). The quantized transform coefficients may be referred to as coefficient level values. Inverse quantization (or "dequantization") may include multiplying coefficient level values by quantization scaling factors, and any reciprocal rounding or offset addition operations. It should be noted that, as used herein, the term quantization process may refer in some cases to division by a scaling factor to generate level values, and may refer in some cases to multiplication by a scaling factor to recover transform coefficients. That is, the quantization process may refer to quantization in some cases, and may refer to inverse quantization in some cases. Further, it should be noted that while the quantization process is described in some of the examples below with respect to arithmetic operations associated with decimal notation, such description is for illustrative purposes and should not be construed as limiting. For example, the techniques described herein may be implemented in a device using binary operations or the like. For example, the multiply and divide operations described herein may be implemented using shift operations or the like.

The quantized transform coefficients and syntax elements (e.g., syntax elements indicating the coding structure of the video block) may be entropy encoded according to an entropy encoding technique. The entropy encoding process includes encoding the syntax element values using a lossless data compression algorithm. Examples of entropy coding techniques include Content Adaptive Variable Length Coding (CAVLC), context Adaptive Binary Arithmetic Coding (CABAC), probability interval partitioning entropy coding (PIPE), and the like. The entropy encoded quantized transform coefficients and corresponding entropy encoded syntax elements may form a compatible bitstream that may be used to render video data at a video decoder. An entropy encoding process, such as CABAC, may include binarizing the syntax elements. Binarization refers to the process of converting the value of a syntax element into a sequence of one or more bits. These bits may be referred to as "bins". Binarization may include one or a combination of fixed length coding, unary coding, truncated Rice coding, golomb coding, k-th order exponential Golomb coding, and Golomb-Rice coding. For example, binarization may include representing the integer value 5 of the syntax element as 00000101 using an 8-bit fixed length binarization technique, or representing the integer value 5 as 11110 using a unary coding binarization technique. As used herein, the terms fixed length coding, unary coding, truncated Rice coding, golomb coding, k-th order exponential Golomb coding, and Golomb-Rice coding may each refer to a general implementation of these techniques and/or a more specific implementation of these coding techniques. For example, golomb-Rice coding implementations may be specifically defined in accordance with video coding standards. In the example of CABAC, for a particular bin, the context provides a Maximum Probability State (MPS) value for the bin (i.e., the MPS of the bin is one of 0 or 1), and a probability value for the bin being the MPS or the minimum probability state (LPS). For example, the context may indicate that the MPS of bin is 0 and the probability of bin being 1 is 0.3. It should be noted that the context may be determined based on the value of the previously encoded bin including the current syntax element and the bin in the previously encoded syntax element. For example, the value of the syntax element associated with the neighboring video block may be used to determine the context of the current bin.

Regarding the formulas used herein, the following arithmetic operators may be used:

+addition method

-Subtraction

* Multiplication, including matrix multiplication

X ^y exponentiation. X is specified as a power of y. In other contexts, such symbols are used for superscripts and are not intended to be interpreted as exponentiations.

Integer division of the result towards zero. For example, 7/4 and-7/-4 are truncated to 1, and-7/4 and 7/-4 are truncated to-1.

Used to represent division in mathematical formulas without truncation or rounding.

Used to represent division in a mathematical formula without truncation or rounding.

X% y modulus. The remainder of x divided by y is defined only for integers x and y where x≥0 and y > 0.

Furthermore, the following logical operators may be used:

boolean logical AND of x & & y x and y "

Boolean logical OR of x y x and y "

The Boolean logic "NO"

Xy, z is evaluated as y if x is TRUE or not equal to 0, and is evaluated as z otherwise.

Furthermore, the following relational operators may be used:

Greater than

More than or equal to

< Less than

Is less than or equal to

= Equal to

| Is not equal to

Furthermore, the following bitwise operators may be used:

And bitwise. When an integer variable is operated on, a two's complement representation of the integer value is operated on. When operating on a binary variable that contains fewer bits than another variable, the shorter variable is extended by adding more significant bits equal to 0.

The i bitwise "or". When an integer variable is operated on, a two's complement representation of the integer value is operated on. When operating on a binary variable that contains fewer bits than another variable, the shorter variable is extended by adding more significant bits equal to 0.

And (5) performing bitwise exclusive OR. When an integer variable is operated on, a two's complement representation of the integer value is operated on. When operating on a binary variable that contains fewer bits than another variable, the shorter variable is extended by adding more significant bits equal to 0.

X > > y x times the arithmetic right shift of the two's complement integer representation of the y binary digit. The function is defined only for non-negative integer values of y. The bit shifted into the Most Significant Bit (MSB) due to the right shift has a value equal to the MSB of x before the shift operation.

The arithmetic left shift of the two's complement integer representation of the x < < y x times y binary digits. The function is defined only for non-negative integer values of y. The bit shifted into the Least Significant Bit (LSB) due to the left shift has a value equal to 0.

Furthermore, the following assignment operators may be used:

=assignment operator

++ Increment, i.e., x++ is equivalent to x=x+1, and when used in an array index, the values of the variables are evaluated prior to the increment operation.

Decrementing, i.e., x-is equivalent to x=x-1, and when used in an array index, evaluating the value of the variable prior to the decrementing operation.

+=Increment by the specified amount, i.e. x+=3 is equivalent to x=x+3, and x+= (-3) is equivalent to x=x+(-3).

- = Decreasing by a specified amount, i.e. x- = 3 is equivalent to x=x-3, and x- = (-3) is equivalent to x=x- (-3).

Furthermore, the mathematical functions defined below may be used:

floor (x), is less than or equal to the maximum integer of x.

Log2 (x), the base 2 logarithm of x.

Round(x)=Sign(x)*Floor(Abs(x)+0.5)

Further, it should be noted that in the syntax descriptor used herein, the following descriptor may be applied:

b (8) bytes (8 bits) with any bit string pattern. The parsing process of the descriptor is specified by the return value of the function read_bit (8).

-F (n) a fixed pattern bit string written using n bits (left to right) from the leftmost bit. The parsing process of the descriptor is specified by the return value of the function read_bit (n).

Se (v) signed integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.

Tb (v) truncated binaries of up to maxVal bits are used, wherein maxVal is defined in the semantics of the syntax element.

Tu (v) using truncated unary codes of up to maxVal bits, wherein maxVal is defined in the semantics of the syntax element.

U (n) is an unsigned integer of n bits. When n is "v" in the syntax table, the number of bits varies in a manner depending on the values of other syntax elements. The parsing process of the descriptor is specified by the return value of the function read_bits (n), which is interpreted as a binary representation of the unsigned integer written first to the most significant bit.

Ue (v) unsigned integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.

Fig. 2A to 2B are conceptual diagrams illustrating an example of encoding a block of video data. As shown in fig. 2A, a current block of video data (e.g., CB corresponding to a video component) is encoded by subtracting a set of prediction values from the current block of video data to generate a residual, performing a transform on the residual, and quantizing the transform coefficients to generate level values. The level values are encoded into a bit stream. As shown in fig. 2B, a block of current video data is decoded by performing inverse quantization on the level values, performing an inverse transform, and adding a set of predictors to the resulting residual. It should be noted that in the examples of fig. 2A-2B, the sample values of the reconstructed block are different from the sample values of the current video block being encoded. Thus, the coding may be considered lossy. However, the difference in sample values may be considered acceptable or imperceptible to a viewer of the reconstructed video.

In addition, as shown in fig. 2A to 2B, coefficient level values are generated using the scaling factor array. In ITU-T h.265, an array of scaling factors is generated by selecting a scaling matrix and multiplying each entry in the scaling matrix by a quantized scaling factor. The scaling matrix may be selected based in part on the prediction mode and the color component. It should be noted that in some examples, the scaling matrix may provide the same value for each entry (i.e., scale all coefficients according to a single value). The value of the quantization scaling factor may be determined by the quantization parameter QP. Further, a QP value for a set of transform coefficients may be derived using a predicted quantization parameter value (which may be referred to as a predicted QP value or QP predicted value) and an optionally signaled quantization parameter delta value (which may be referred to as a QP delta value or delta QP value). The quantization parameters may be updated for each CU, and the respective quantization parameters may be derived for each of the luma and chroma channels.

Referring to the example shown in fig. 1, each slice of video data included in Pic ₃ (i.e., slice ₀ slice ₁) is shown encapsulated in a NAL unit. In JVET-O2001, each of the video sequence, GOP, picture, slice, and CTU may be associated with metadata describing video coding properties. JVET-O2001 defines parameter sets that can be used to describe video data and/or video coding properties. Specifically, JVET-O2001 includes five types of parameter sets, namely a Decoding Parameter Set (DPS), a Video Parameter Set (VPS), a Sequence Parameter Set (SPS), a Picture Parameter Set (PPS), and an Adaptive Parameter Set (APS). In JVET-O2001, the parameter set may be encapsulated as a special type of NAL unit or may be signaled as a message. NAL units that include encoded video data (e.g., slices) may be referred to as VCL (video coding layer) NAL units, and NAL units that include metadata (e.g., parameter sets) may be referred to as non-VCL NAL units. Furthermore, JVET-O2001 enables Supplemental Enhancement Information (SEI) messages to be signaled. In JVET-O2001, SEI messages assist in the process related to decoding, display, or other purposes, however, SEI messages may not be needed to construct luminance or chrominance samples through the decoding process. In JVET-O2001, signaling SEI messages can be sent in the bitstream using non-VCL NAL units. Furthermore, the SEI message may be transmitted in some way, not by being present in the bitstream (i.e. signaled out-of-band). Fig. 4 shows an example of a bitstream including a plurality of CVSs, wherein the CVSs are represented by NAL units included in respective access units. In the example shown in fig. 4, the non-VCL NAL units include respective parameter set NAL units (i.e., sequence Parameter Set (SPS) and Picture Parameter Set (PPS) NAL units), SEI message NAL units, and access unit delimiter NAL units. It should be noted that in fig. 4, HEADER is NAL unit HEADER.

As described above, for inter-prediction encoding, a current video block may be predicted using reference sample values located in one or more previously encoded pictures. JVET-O2001 comprises so-called triangle-based motion compensation. In triangle-based motion compensation, two triangle predictions are used to predict a rectangular video block. That is, for prediction, a rectangular video block is divided into two triangles with respect to a diagonal (from the upper left corner to the lower right corner) or with respect to an inverse diagonal (from the upper right corner to the lower left corner). Each triangle prediction may be generated using its own motion vector and reference frame index. Further, a rectangular prediction for predicting a rectangular video block may be generated by performing an adaptive weighting process on diagonal edges of neighboring triangle predictions. That is, triangle predictions may be mixed. "CE10.3.1.B: triangular prediction unit mode" (document JVET-L0124-v 2) at ISO/IEC JTC1/SC29/WG11 meeting at 12 th, australia, china, from 10.3.12, provides a detailed description of an example of triangle-based motion compensation.

JVET-O2001 specifies the following decoding procedure for triangle inter blocks:

The inputs of this process are:

A luma position (xCb, yCb) specifying an upper left sample of the current coding block relative to an upper left luma sample of the current picture,

A variable cbWidth specifying the width of the current coding block in the luma samples,

A variable cbHeight specifying the height of the current coding block in the luma samples,

Luminance motion vectors mvA and mvB with a 1/16 fractional sample accuracy,

The chrominance motion vectors mvCA and mvCB,

Reference is made to the indices refIdxA and refIdxB,

Predictive list markers PREDLISTFLAGA and predListFlagB.

The output of this process is:

(cbWidth) x (cbHeight) array predSamples _L of luma prediction samples,

An array predSamples _Cb of (cbWidth/SubWidthC) x (cbHeight/SubHeightC) for chroma prediction samples of component Cb,

- (CbWidth/SubWidthC) x (cbHeight/SubHeightC) array predSamples _Cr for chroma prediction samples of component Cr.

Let PREDSAMPLESLA _L and predSamplesLB _L be (cbWidth) x (cbHeight) arrays of predicted luminance sample values, PREDSAMPLESLA _Cb、predSamplesLB_Cb、predSamplesLA_Cr and predSamplesLB _Cr be (cbWidth/SubWidthC) x (cbHeight/SubHeightC) arrays of predicted chrominance sample values.

PredSamples _L、predSamples_Cb and predSamples _Cr were derived using the following sequential steps:

1. for N, which is each of a and B, the following applies:

Deriving a reference picture consisting of an ordered two-dimensional array refPicLN _L of luma samples and two ordered two-dimensional arrays refPicLN _Cb and refPicLN _Cr of chroma samples by invoking a specified reference picture selection procedure with X set equal to PREDLISTFLAGN and refIdxX set equal to refIdxN as inputs.

-Deriving the array PREDSAMPLESLNL by invoking a specified fractional sample interpolation procedure with luma position (xCb, yCb), luma code block width sbWidth set equal to cbWidth, luma code block height sbHeight set equal to cbHeight, motion vector offset mvOffset set equal to (0, 0), motion vector mvLX set equal to mvN and reference array refPicLXL set equal to refPicLN _L, variable bdofFlag set equal to FALSE and variable cIdx set equal to 0 as inputs.

-Deriving the array PREDSAMPLESLN _Cb by invoking a specified fractional sample interpolation procedure with luminance position (xCb, yCb), coding block width sbWidth set equal to cbWidth/SubWidthC, coding block height sbHeight set equal to cbHeight/SubHeightC, motion vector offset mvOffset set equal to (0, 0), motion vector mvLX set equal to mvCN and reference array refPicLX _Cb set equal to refPicLN _Cb, variable bdofFlag set equal to FALSE and variable cIdx set equal to 1 as inputs.

-Deriving the array PREDSAMPLESLNCR by invoking a specified fractional sample interpolation procedure with luminance position (xCb, yCb), coding block width sbWidth set equal to cbWidth/SubWidthC, coding block height sbHeight set equal to cbHeight/SubHeightC, motion vector offset mvOffset set equal to (0, 0), motion vector mvLX set equal to mvCN and reference array refPicLX _Cr set equal to refPicLN _Cr, variable bdofFlag set equal to FALSE and variable cIdx set equal to 2 as inputs.

2. The partition direction of the merge triangle pattern variable TRIANGLEDIR is set equal to merge_triangle_split_dir [ xCb ] [ yCb ] [ it indicates a diagonal or an inverse diagonal direction ].

3. The prediction samples predSamples _L[x_L][y_L],x_L =0.. cbWidth-1 and y _L =0.. cbHeight-1 within the current luma coded block are derived by calling the weighted sample prediction procedure specified below for the delta merge mode, with the coded block width nCbW set equal to cbWidth, the coded block height nCbH set equal to cbHeight, the sample arrays PREDSAMPLESLA _L and predSamplesLB _L, the variable TRIANGLEDIR, and cIdx equal to 0 as inputs.

4. The prediction samples predSamples _Cb [ xc ] [ yc ], xc=0.. cbWidth/SubWidthC-1 and yc=0.. cbHeight/SubHeightC-1 within the current chroma component Cb coded block are derived by invoking the weighted sample prediction process specified below for the delta merge mode, with coded block width nCbW set equal to cbWidth/SubWidthC, coded block height nCbH set equal to cbHeight/SubHeightC, sample arrays PREDSAMPLESLA _Cb and predSamplesLB _Cb, variable TRIANGLEDIR and cIdx equal to 1 as inputs.

5. The prediction samples predSamples _Cr [ xc ] [ yc ], xc=0.. cbWidth/SubWidthC-1 and yc=0.. cbHeight/SubHeightC-1 within the current chroma component Cr coding block are derived by invoking the weighted sample prediction process specified below for the delta merge mode, with coding block width nCbW set equal to cbWidth/SubWidthC, coding block height nCbH set equal to cbHeight/SubHeightC, sample arrays PREDSAMPLESLA _Cr and predSamplesLB _Cr, variable TRIANGLEDIR and cIdx equal to 2 as inputs.

6. The motion vector storage procedure for merging triangle mode is invoked using luma code block position (xCb, yCb), luma code block width cbWidth, luma code block height cbHeight, partition direction TRIANGLEDIR, luma motion vectors mvA and mvB, reference indices refIdxA and refIdxB, and prediction list flags PREDLISTFLAGA and predListFlagB as inputs.

Weighted sample prediction process for triangle merge mode

The inputs of this process are:

Two variables nCbW and nCbH specifying the width and height of the current code block,

Two (nCbW) x (nCbH) arrays PREDSAMPLESLA and predSamplesLB, a variable TRIANGLEDIR specifying the direction of division,

-A variable cIdx specifying the color component index.

The output of this process is the (nCbW) x (nCbH) array pbSamples of predicted sample values.

Variables nCbR are derived as follows:

nCbR=(nCbW>nCbH)?(nCbW/nCbH):(nCbH/nCbW)

variables bitDepth are derived as follows:

-if cIdx is equal to 0, then bitDepth is set equal to BitDepthy.

Otherwise, bitDepth is set equal to BitDepthc.

Variables shiftl and offsetl were derived as follows:

-setting variable shiftl equal to Max (5, 17-bitDepth).

-Setting variable offsetl equal to 1< < (shiftl-1).

Based on the values of TRIANGLEDIR, WS and cIdx, the predicted samples pbSamples [ x ] [ y ], x=0.. nCbW-1 and y=0.. nCbH-1 are derived as follows:

variables wIdx are derived as follows:

-if cIdx is equal to 0 and TRIANGLEDIR is equal to 0, the following applies:

wIdx=(nCbW>nCbH)?(Clip3(0,8,(x/nCbR–y)+4)):(Clip3(0,8,(x-y/nCbR)+4))

Otherwise, if cIdx is equal to 0 and TRIANGLEDIR is equal to 1, the following applies:

wIdx=(nCbW>nCbH)?(Clip3(0,8,(nCbH-1-x/nCbR-y)+4))(Clip3(0,8,(nCbW–1–x–y/nCbR)+4))

otherwise, if cIdx is greater than 0 and TRIANGLEDIR is equal to 0, the following applies:

wIdx=(nCbW>nCbH)?(Clip3(0,4,(x/nCbR–y)+2))v:(Clip3(0,4,(x–y/nCbR)+2))

Otherwise (if cIdx is greater than 0 and TRIANGLEDIR is equal to 1), the following applies:

wIdx=(nCbW>nCbH)?(Clip3(0,4,(nCbH-1-x/nCbR-y)+2))(Clip3(0,4,(nCbW–1–x–y/nCbR)+2))

-deriving a variable wValue specifying the height of the prediction samples using wIdx and cIdx as follows:

wValue=(cIdx==0)?Clip3(0,8,wIdx):Clip3(0,8,wIdx*2)

-predicting sample values is derived as follows:

pbSamples[x][y]＝Clip3(0,(1<<bitDepth)–1,(predSamplesLA[x][y]*wValue+predSamplesLB[x][y]*(8-wValue)+offsetl)>>shiftl)

It should be noted that in JVET-O2001, the weighted sample prediction process based on triangle-based motion compensation may be less than ideal. Specifically, the blending function for triangulation does not take into account chroma sampling formats other than 4:2:0. In one example, the mixing function may be simplified and/or extended to accommodate the actual chroma sampling format (which may be 4:2:2 or 4:4:4) in accordance with the techniques herein.

Fig. 5 is a block diagram illustrating an example of a system that may be configured to encode (e.g., encode and/or decode) video data in accordance with one or more techniques of the present disclosure. System 100 represents an example of a system that may perform video encoding using the partitioning techniques according to one or more techniques of the present disclosure. As shown in fig. 5, system 100 includes a source device 102, a communication medium 110, and a target device 120. In the example shown in fig. 5, source device 102 may include any device configured to encode video data and transmit the encoded video data to communication medium 110. Target device 120 may include any device configured to receive encoded video data and decode the encoded video data via communication medium 110. The source device 102 and/or the target device 120 may comprise computing devices equipped for wired and/or wireless communication, and may comprise set-top boxes, digital video recorders, televisions, desktop, laptop or tablet computers, gaming consoles, mobile devices including, for example, "smart" phones, cellular telephones, personal gaming devices, and medical imaging devices.

Communication medium 110 may include any combination of wireless and wired communication media and/or storage devices. Communication medium 110 may include coaxial cable, fiber optic cable, twisted pair cable, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. Communication medium 110 may include one or more networks. For example, the communication medium 110 may include a network configured to allow access to the world wide web, such as the Internet. The network may operate according to a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunications protocols include the Digital Video Broadcasting (DVB) standard, the Advanced Television Systems Committee (ATSC) standard, the Integrated Services Digital Broadcasting (ISDB) standard, the Data Over Cable Service Interface Specification (DOCSIS) standard, the global system for mobile communications (GSM) standard, the Code Division Multiple Access (CDMA) standard, the third generation partnership project (3 GPP) standard, the European Telecommunications Standards Institute (ETSI) standard, the Internet Protocol (IP) standard, the Wireless Application Protocol (WAP) standard, and the Institute of Electrical and Electronics Engineers (IEEE) standard.

The storage device may include any type of device or storage medium capable of storing data. The storage medium may include a tangible or non-transitory computer readable medium. The computer readable medium may include an optical disk, flash memory, magnetic memory, or any other suitable digital storage medium. In some examples, the memory device or portions thereof may be described as non-volatile memory, and in other examples, portions of the memory device may be described as volatile memory. Examples of volatile memory can include Random Access Memory (RAM), dynamic Random Access Memory (DRAM), and Static Random Access Memory (SRAM). Examples of non-volatile memory may include magnetic hard disk, optical disk, floppy disk, flash memory, or forms of electrically programmable memory (EPROM) or Electrically Erasable and Programmable (EEPROM) memory. The storage devices may include memory cards (e.g., secure Digital (SD) memory cards), internal/external hard disk drives, and/or internal/external solid state drives. The data may be stored on the storage device according to a defined file format.

Referring again to fig. 5, source device 102 includes video source 104, video encoder 106, and interface 108. Video source 104 may include any device configured to capture and/or store video data. For example, video source 104 may include a camera and a storage device operatively coupled thereto. The video encoder 106 may include any device configured to receive video data and generate a compliant bitstream representing the video data. A compatible bitstream may refer to a bitstream from which a video decoder may receive and reproduce video data. Aspects of a compatible bitstream may be defined in accordance with a video coding standard. When generating a compatible bitstream, the video encoder 106 may compress the video data. Compression may be lossy (perceptible or imperceptible) or lossless. Interface 108 may include any device configured to receive a compatible video bitstream and to transmit and/or store the compatible video bitstream to a communication medium. Interface 108 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may transmit and/or receive information. In addition, interface 108 may include a computer system interface that may allow compatible video bit streams to be stored on a storage device. For example, interface 108 may include a logical and physical structure supporting Peripheral Component Interconnect (PCI) and peripheral component interconnect express (PCIe) bus protocols, a proprietary bus protocol, a Universal Serial Bus (USB) protocol, I ² C, or any other available interconnect mechanism for peer devices.

Referring again to fig. 5, target device 120 includes an interface 122, a video decoder 124, and a display 126. Interface 122 may include any device configured to receive a compatible video bitstream from a communication medium. Interface 108 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may receive and/or transmit information. In addition, interface 122 may include a computer system interface that allows for retrieving compatible video bitstreams from a storage device. For example, interface 122 may include a chipset supporting PCI and PCIe bus protocols, a proprietary bus protocol, a USB protocol, I ² C, or any other logical and physical structure that may be used to interconnect peer devices. Video decoder 124 may include any device configured to receive a compatible bitstream and/or acceptable variations thereof and to render video data therefrom. Display 126 may include any device configured to display video data. The display 126 may include one of a variety of display devices such as a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or another type of display. The display 126 may include a high definition display or an ultra high definition display. It should be noted that although in the example shown in fig. 3, video decoder 124 is described as outputting data to display 126, video decoder 124 may be configured to output video data to various types of devices and/or subcomponents thereof. For example, video decoder 124 may be configured to output video data to any communication medium, as described herein.

Fig. 6 is a block diagram illustrating an example of a video encoder 200 that may implement the techniques for encoding video data described herein. It should be noted that although the exemplary video encoder 200 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video encoder 200 and/or its subcomponents to a particular hardware or software architecture. The functions of video encoder 200 may be implemented using any combination of hardware, firmware, and/or software implementations. In one example, video encoder 200 may be configured to encode video data in accordance with the techniques described herein. The video encoder 200 may perform intra prediction encoding and inter prediction encoding of picture regions, and thus may be referred to as a hybrid video encoder. In the example shown in fig. 6, video encoder 200 receives a source video block. In some examples, a source video block may include picture regions that have been partitioned according to an encoding structure. For example, the source video data may include macroblocks, CTUs, CBs, sub-partitions thereof, and/or another equivalent coding unit. In some examples, video encoder 200 may be configured to perform additional subdivision of the source video block. It should be noted that some of the techniques described herein may be generally applicable to video encoding, regardless of how the source video data is partitioned prior to and/or during encoding. In the example shown in fig. 6, the video encoder 200 includes a summer 202, a transform coefficient generator 204, a coefficient quantization unit 206, an inverse quantization/transform processing unit 208, a summer 210, an intra prediction processing unit 212, an inter prediction processing unit 214, a filter unit 216, and an entropy encoding unit 218.

As shown in fig. 6, the video encoder 200 receives source video blocks and outputs a bitstream. Video encoder 200 may generate residual data by subtracting the prediction video block from the source video block. Summer 202 represents a component configured to perform the subtraction operation. In one example, the subtraction of video blocks occurs in the pixel domain. The Transform coefficient generator 204 applies a Transform, such as a discrete cosine Transform (Discrete Cosine Transform, DCT), a discrete sine Transform (DISCRETE SINE Transform, DST), or a conceptually similar Transform, to its residual block or sub-partition (e.g., four 8 x 8 transforms may be applied to a 16 x 16 array of residual values) to generate a set of residual Transform coefficients. The transform coefficient generator 204 may be configured to perform any and all combinations of transforms included in the series of discrete trigonometric transforms. As described above, in ITU-T H.265, TB is limited to the following sizes 4X 4, 8X 8, 16X 16 and 32X 32. In one example, the transform coefficient generator 204 may be configured to perform the transforms according to arrays of sizes 4×4,8×8, 16×16, and 32×32. In one example, the transform coefficient generator 204 may be further configured to perform transforms according to other sized arrays. In particular, in some cases it may be useful to perform a transformation on a rectangular array of different values. In one example, the transform coefficient generator 204 may be configured to perform the transforms according to an array size of 2×2,2×4n, 4 mx2, and/or 4 mx4n. In one example, a two-dimensional (2D) mxn inverse transform may be implemented as a one-dimensional (1D) M-point inverse transform followed by a 1D N-point inverse transform. In one example, the 2D inverse transform may be implemented as a 1D N point vertical transform followed by a 1D N point horizontal transform. In one example, the 2D inverse transform may be implemented as a 1D N point horizontal transform followed by a 1D N point vertical transform. The transform coefficient generator 204 may output the transform coefficients to the coefficient quantization unit 206.

The coefficient quantization unit 206 may be configured to perform quantization of the transform coefficients. As described above, the quantization degree can be modified by adjusting the quantization parameter. The coefficient quantization unit 206 may be further configured to determine quantization parameters and output QP data (e.g., data for determining quantization group sizes and/or delta QP values) that may be used by a video decoder to reconstruct quantization parameters to perform inverse quantization during video decoding. It should be noted that in other examples, one or more additional or alternative parameters may be used to determine the quantization level (e.g., a scaling factor). The techniques described herein are generally applicable to determining quantization levels of transform coefficients corresponding to one component of video data based on quantization levels of transform coefficients corresponding to another component of video data.

Referring again to fig. 6, the quantized transform coefficients are output to an inverse quantization/transform processing unit 208. The inverse quantization/transform processing unit 208 may be configured to apply inverse quantization and inverse transform to generate reconstructed residual data. As shown in fig. 6, at summer 210, reconstructed residual data may be added to the predicted video block. In this way, the encoded video block may be reconstructed and the resulting reconstructed video block may be used to evaluate the coding quality of a given prediction, transform and/or quantization. The video encoder 200 may be configured to perform multiple encoding passes (e.g., perform encoding while changing one or more of the prediction, transform parameters, and quantization parameters). The rate-distortion or other system parameters of the bitstream may be optimized based on the evaluation of the reconstructed video block. Furthermore, the reconstructed video block may be stored and used as a reference for predicting a subsequent block.

As described above, video blocks may be encoded using intra prediction. Intra-prediction processing unit 212 may be configured to select an intra-prediction mode for a video block to be encoded. The intra-prediction processing unit 212 may be configured to evaluate the frame and/or regions thereof and determine an intra-prediction mode to encode the current block. As shown in fig. 6, the intra-prediction processing unit 212 outputs intra-prediction data (e.g., syntax elements) to the entropy encoding unit 218 and the transform coefficient generator 204. As described above, the transformation performed on the residual data may depend on the mode. As described above, possible intra prediction modes may include a planar prediction mode, a DC prediction mode, and an angular prediction mode. Further, in some examples, the prediction of the chroma component may be inferred from intra prediction for the luma prediction mode.

The inter prediction processing unit 214 may be configured to perform inter prediction encoding for the current video block. The inter prediction processing unit 214 may be configured to receive a source video block and calculate a motion vector for a PU of the video block. The motion vector may indicate the displacement of a PU (or similar coding structure) of a video block within the current video frame relative to a prediction block within the reference frame. Inter prediction coding may use one or more reference pictures. Further, the motion prediction may be unidirectional prediction (using one motion vector) or bidirectional prediction (using two motion vectors). The inter prediction processing unit 214 may be configured to select a prediction block by calculating pixel differences determined by, for example, sum of Absolute Differences (SAD), sum of Squared Differences (SSD), or other difference metrics. As described above, a motion vector can be determined and specified from motion vector prediction. As described above, the inter prediction processing unit 214 may be configured to perform motion vector prediction. The inter prediction processing unit 214 may be configured to generate a prediction block using the motion prediction data. For example, the inter-prediction processing unit 214 may locate a predicted video block (not shown in fig. 6) within a frame buffer. It should be noted that the inter prediction processing unit 214 may be further configured to apply one or more interpolation filters to the reconstructed residual block to calculate sub-integer pixel values for motion estimation. The inter prediction processing unit 214 may output motion prediction data of the calculated motion vector to the entropy encoding unit 218. As shown in fig. 6, the inter prediction processing unit 214 may receive the reconstructed video block via the filter unit 216.

As described above, in JVET-O2001, the weighted sample prediction process based on triangle-based motion compensation may be less than ideal. In one example, in accordance with the techniques herein, a weighted sample prediction process for triangle-based motion compensation may determine a size of a prediction sample array based on a color component index and a sampling format. In one example, in accordance with the techniques herein, a weighted sample prediction process for triangle-based motion compensation may be based on:

Weighted sample prediction process for triangle merge mode

The inputs of this process are:

Two (nCbW) x (nCbH) arrays PREDSAMPLESLA and predSamplesLB,

A variable TRIANGLEDIR specifying the direction of division,

-A variable cIdx specifying the color component index.

The output of this process is the (nCbW) x (nCbH) array pbSamples of predicted sample values. Variables nCbWH and nCbHW were derived as follows:

nCbWH=Max(1,nCbW/nCbH)

nCbHW=Max(1,nCbH/nCbW)

From the value of cIdx, variable bitDepth, scale is derived as follows:

-if cIdx is equal to 0, bitDepth is set equal to BitDepth _Y and scale is set equal to 1.

Otherwise, bitDepth is set equal to BitDepthc, scale is set equal to 1 if Min (nCbW, nCbH) is equal to Min (nCbW ×subwidthc, ncbh× SubHeightC), otherwise, 2.

Variables shiftl and offsetl were derived as follows:

-setting variable shiftl equal to Max (5, 17-bitDepth).

-Setting variable offsetl equal to 1< < (shiftl-1).

Based on the value of TRIANGLEDIR, the predicted samples pbSamples [ x ] [ y ], x=0.. nCbW-1 and y=0.. nCbH-1 are derived as follows:

variables wIdx are derived as follows:

-if TRIANGLEDIR is equal to 0, the following applies:

wIdx=x/nCbWH*scale–y/nCbHW*scale

otherwise (if TRIANGLEDIR is equal to 1), the following applies:

wIdx=(nCbW–1–x)/nCbWH*scale–y/nCbHW*scale

wValue=Clip3(0,8,wIdx+4)

-predicting sample values is derived as follows:

pbSamples[x][y]＝Clip3(0,(1<<bitDepth)–1,(predSamplesLA[x][y]*wValue+predSamplesLB[x][y]*(8–wValue)+offsetl)>>shiftl)

It should be noted that in the case of a sampling format of 4:2:2, for a video block with 16×8 luma samples, the corresponding chroma array is 8×8 (width×height). According to the weighted sample prediction procedure specified in JVET-O2001, for TRIANGLEDIR equal to 0, the following mixing matrix (wValue values) is generated for luminance and chrominance:

4 4 5 5 6 6 7 7 8 8 8 8 8 8 8 8

3 3 4 4 5 5 6 6 7 7 8 8 8 8 8 8

2 2 3 3 4 4 5 5 6 6 7 7 8 8 8 8

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8

0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7

0 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6

0 0 0 0 0 0 1 1 2 2 3 3 4 4 5 5

0 0 0 0 0 0 0 0 1 1 2 2 3 3 4 4

4 6 8 8 8 8 8 8

2 4 6 8 8 8 8 8

0 2 4 6 8 8 8 8

0 0 2 4 6 8 8 8

0 0 0 2 4 6 8 8

0 0 0 0 2 4 6 8

0 0 0 0 0 2 4 6

0 0 0 0 0 0 2 4

It should be noted that in this case, horizontally amplifying the chromaticity matrix will not produce a matrix similar to the luminance matrix.

According to the weighted sample prediction process specified in accordance with the techniques herein, for TRIANGLEDIR equal to 0, the following mixing matrix (wValue values) is generated for luminance and chrominance:

4 4 5 5 6 6 7 7 8 8 8 8 8 8 8 8

3 3 4 4 5 5 6 6 7 7 8 8 8 8 8 8

2 2 3 3 4 4 5 5 6 6 7 7 8 8 8 8

1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8

0 0 1 1 2 2 3 3 4 4 5 5 6 6 7 7

0 0 0 0 1 1 2 2 3 3 4 4 5 5 6 6

0 0 0 0 0 0 1 1 2 2 3 3 4 4 5 5

0 0 0 0 0 0 0 0 1 1 2 2 3 3 4 4

4 5 6 7 8 8 8 8

3 4 5 6 7 8 8 8

2 3 4 5 6 7 8 8

1 2 3 4 5 6 7 8

0 1 2 3 4 5 6 7

0 0 1 2 3 4 5 6

0 0 0 1 2 3 4 5

0 0 0 0 1 2 3 4

Thus, the weighted sample prediction process specified according to JVET-O2001 and the weighted sample prediction process specified according to the techniques herein may in some cases produce different wValue values. For example, for the above case, for chroma, at (x, y) = (1, 2), wValue is equal to 2 according to the weighted sample prediction process specified by JVET-O2001, and wValue is equal to 3 according to the techniques herein. Different wValue values will result in different arrays pbSamples [ x ] [ y ]. Arrays pbSamples [ x ] [ y ] produced according to the techniques herein may in some cases increase coding efficiency.

As such, video encoder 200 represents an example of a device configured to receive a first array of predicted sample values, receive a second array of predicted sample values, determine a scaling value based on color component index values and a video sampling format of video data, generate a third array of predicted sample values by applying a mixing matrix to the first array of predicted sample values and the second array of predicted sample values, wherein the mixing matrix is based on the scaling value, and perform video encoding using the third array of predicted sample values.

As shown in fig. 6, the filter unit 216 receives the reconstructed video block and the encoding parameters and outputs modified reconstructed video data. The filter unit 216 may be configured to perform deblocking, sample Adaptive Offset (SAO) filtering, adaptive Loop Filtering (ALF), and the like. SAO filtering is a type of nonlinear amplitude mapping that can be used to improve reconstruction by adding an offset to the reconstructed video data. It should be noted that as shown in fig. 5, the intra prediction processing unit 212 and the inter prediction processing unit 214 may receive the modified reconstructed video block via the filter unit 216. The entropy encoding unit 218 receives quantized transform coefficients and prediction syntax data (i.e., intra prediction data and motion prediction data). The entropy encoding unit 218 may be configured to perform entropy encoding according to one or more of the techniques described herein.

Fig. 7 is a block diagram illustrating an example of a video decoder that may be configured to decode video data in accordance with one or more techniques of this disclosure. In one example, the video decoder 300 may be configured to reconstruct video data based on one or more of the techniques described above. That is, the video decoder 300 may operate in a manner that is reciprocal to the video encoder 200 described above. The video decoder 300 may be configured to perform intra prediction decoding and inter prediction decoding, and thus may be referred to as a hybrid decoder. In the example shown in fig. 7, the video decoder 300 includes an entropy decoding unit 302, an inverse quantization unit 304, an inverse transform processing unit 306, an intra prediction processing unit 308, an inter prediction processing unit 310, a summer 312, a filter unit 314, and a reference buffer 316. The video decoder 300 may be configured to decode video data in a manner consistent with a video encoding system that may implement one or more aspects of a video encoding standard. It should be noted that although the exemplary video decoder 300 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video decoder 300 and/or its subcomponents to a particular hardware or software architecture. The functions of video decoder 300 may be implemented using any combination of hardware, firmware, and/or software implementations.

As shown in fig. 7, the entropy decoding unit 302 receives an entropy-encoded bitstream. The entropy decoding unit 302 may be configured to decode the quantized syntax elements and the quantized coefficients from the bitstream according to a process that is reciprocal to the entropy encoding process. Entropy decoding unit 302 may be configured to perform entropy decoding according to any of the entropy encoding techniques described above. The entropy decoding unit 302 may parse the encoded bitstream in a manner consistent with the video encoding standard. The video decoder 300 may be configured to parse an encoded bitstream, wherein the encoded bitstream is generated based on the techniques described above.

Referring again to fig. 7, the inverse quantization unit 304 receives quantized transform coefficients (i.e., level values) and quantization parameter data from the entropy decoding unit 302. The quantization parameter data may include any and all combinations of the delta QP values and/or quantization group size values described above, and the like. The video decoder 300 and/or the inverse quantization unit 304 may be configured to determine QP values for inverse quantization based on values signaled by the video encoder and/or by video attributes and/or encoding parameters. That is, the inverse quantization unit 304 may operate in a manner that is reciprocal to the coefficient quantization unit 206 described above. For example, the inverse quantization unit 304 may be configured to infer predetermined values, allowable quantization group sizes, derive quantization parameters, and the like, in accordance with the techniques described above. The inverse quantization unit 304 may be configured to apply inverse quantization. The inverse transform processing unit 306 may be configured to perform an inverse transform to generate reconstructed residual data. The techniques respectively performed by the inverse quantization unit 304 and the inverse transform processing unit 306 may be similar to those performed by the inverse quantization/transform processing unit 208 described above. The inverse transform processing unit 306 may be configured to apply an inverse DCT, an inverse DST, an inverse integer transform, an inseparable quadratic transform (NSST), or a conceptually similar inverse transform process to transform the coefficients in order to generate a residual block in the pixel domain. Furthermore, as described above, whether to perform a particular transform (or the type of particular transform) may depend on the intra prediction mode. As shown in fig. 7, the reconstructed residual data may be provided to a summer 312. Summer 312 may add the reconstructed residual data to the prediction video block and generate reconstructed video data. The prediction video block may be determined according to a prediction video technique (i.e., intra-prediction and inter-prediction).

The intra prediction processing unit 308 may be configured to receive the intra prediction syntax element and retrieve the predicted video block from the reference buffer 316. The reference buffer 316 may include a memory device configured to store one or more frames of video data. The intra prediction syntax element may identify an intra prediction mode, such as the intra prediction mode described above. In one example, the intra-prediction processing unit 308 may reconstruct the video block according to one or more of the intra-prediction encoding techniques described herein. The inter-prediction processing unit 310 may receive the inter-prediction syntax elements and generate motion vectors to identify prediction blocks in one or more reference frames stored in the reference buffer 316. The inter prediction processing unit 310 may generate a motion compensation block, possibly performing interpolation based on interpolation filters. An identifier of an interpolation filter for motion estimation with sub-pixel precision may be included in the syntax element. The inter prediction processing unit 310 may calculate interpolation values of sub-integer pixels of the reference block using interpolation filters. In one example, the inter-prediction processing unit 310 may reconstruct a video block using one or more of the inter-prediction encoding techniques described herein. The filter unit 314 may be configured to perform filtering on the reconstructed video data. For example, filter unit 314 may be configured to perform deblocking and/or SAO filtering as described above with respect to filter unit 216. Further, it should be noted that in some examples, the filter unit 314 may be configured to perform dedicated arbitrary filtering (e.g., visual enhancement). As shown in fig. 7, the video decoder 300 may output the reconstructed video block. As such, video decoder 300 represents an example of a device configured to receive a first array of predicted sample values, receive a second array of predicted sample values, determine a scaling value based on color component index values and a video sampling format of video data, generate a third array of predicted sample values by applying a mixing matrix to the first array of predicted sample values and the second array of predicted sample values, wherein the mixing matrix is based on the scaling value, and perform video encoding using the third array of predicted sample values.

In one or more examples, the functions described may be implemented by hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium, and executed by a hardware-based processing unit. The computer-readable medium may comprise a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a propagation medium comprising any medium that facilitates the transfer of a computer program from one place to another, for example, according to a communication protocol. As such, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Furthermore, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated into a combined codec. Moreover, these techniques may be implemented entirely in one or more circuits or logic elements.

The techniques of this disclosure may be implemented in various devices or apparatuses including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by an interoperating hardware unit comprising a set of one or more processors as described above, in combination with suitable software and/or firmware.

Furthermore, each functional block or various features of the base station apparatus and the terminal apparatus used in each of the above-described embodiments may be implemented or performed by a circuit (typically, an integrated circuit or a plurality of integrated circuits). Circuits designed to perform the functions described herein may include a general purpose processor, a Digital Signal Processor (DSP), an application specific or general purpose integrated circuit (ASIC), a Field Programmable Gate Array (FPGA), or other programmable logic device, discrete gate or transistor logic, or discrete hardware components, or a combination thereof. A general purpose processor may be a microprocessor, or in the alternative, the processor may be a conventional processor, controller, microcontroller, or state machine. The general purpose processor or each of the above circuits may be configured by digital circuitry or may be configured by analog circuitry. In addition, when a technology of manufacturing an integrated circuit that replaces the current integrated circuit occurs due to progress in semiconductor technology, the integrated circuit produced by the technology can also be used.

Various examples have been described. These and other examples are within the scope of the following claims.

< Summary of the invention >

In one example, a method of encoding video data includes receiving a first array of prediction sample values, receiving a second array of prediction sample values, determining a scaling value based on color component index values and a video sampling format of the video data, generating a third array of prediction sample values by applying a mixing matrix to the first array of prediction sample values and the second array of prediction sample values, wherein the mixing matrix is based on the scaling value, and performing video encoding using the third array of prediction sample values.

In one example, the method is provided wherein performing video encoding using the third array of prediction sample values comprises decoding video data by adding a residual to the third array of prediction sample values.

In one example, the method is provided, further comprising wherein performing video encoding using the third array of prediction sample values comprises encoding video data by subtracting the third array of prediction sample values from the current video block.

In one example, the method is provided wherein the color component index value indicates a chroma component and the video sample format of the video data is 4:2:2.

In one example, the method is provided wherein the color component index value indicates a chroma component and the video sample format of the video data is 4:4:4.

In one example, an apparatus for encoding video data includes one or more processors configured to perform any and all combinations of these steps.

In one example, the apparatus is provided, wherein the apparatus comprises a video encoder.

In one example, the apparatus is provided, wherein the apparatus comprises a video decoder.

In one example, a system includes a device including a video encoder and the device including a video decoder.

In one example, an apparatus for encoding video data includes means for performing any and all combinations of steps.

In one example, a non-transitory computer-readable storage medium comprising instructions stored thereon that, when executed, cause one or more processors of a device for encoding video data to perform any and all combinations of steps.

In one example, the method is provided wherein determining the scaling value based on the color component index value and the video sample format of the video data comprises determining a scaling value 2 for the horizontal component and a scaling value 1 for the vertical component if the color component index value indicates the chroma component and the video sample format of the video data is 4:2:2.

In one example, the method is provided wherein determining the scaling value based on the color component index value and the video sample format of the video data comprises determining a scaling value 1 for the horizontal component and a scaling value 1 for the vertical component if the color component index value indicates the chroma component and the video sample format of the video data is 4:4:4.

In one example, the apparatus is provided wherein determining the scaling value based on the color component index value and the video sample format of the video data comprises determining that the scaling value for the horizontal component is 2 and the scaling value for the vertical component is 1 if the color component index value indicates the chroma component and the video sample format of the video data is 4:2:2.

In one example, the apparatus is provided wherein determining the scaling value based on the color component index value and the video sample format of the video data comprises determining a scaling value 1 for the horizontal component and a scaling value 1 for the vertical component if the color component index value indicates the chroma component and the video sample format of the video data is 4:4:4.

< Cross-reference >

This non-provisional application claims priority from provisional application 62/896,500 filed on 5 th month 9 of 2019, volume 35, 119 of the united states code, the entire contents of which are hereby incorporated by reference.

Claims

1. A method of decoding video data, the method comprising:

Receiving a first array of predicted sample values, wherein the first array of predicted sample values is derived from a fractional sample interpolation process using a first motion vector;

receiving a second array of predicted sample values, wherein the second array of predicted sample values is derived from the fractional sample interpolation process using a second motion vector;

scaling a horizontal position and a vertical position according to a video sampling format of the video data based on a color component index value not equal to 0, and generating a blend value using the horizontal position and the vertical position, wherein the blend value specifies a weight of a prediction sample, and

A third array of predicted sample values is generated by using values obtained by adding (1) the product of the blended value and the first array of predicted sample values, and (2) eight minus the product of the blended value and the second array of predicted sample values.

2. The method of claim 1, wherein scaling the horizontal and vertical positions comprises:

in the case where the color component index value indicates a chroma component and the video sample format of the video data is 4:2:2, the horizontal position and the vertical position are scaled based on different size values.

3. The method of claim 1, wherein scaling the horizontal and vertical positions comprises:

In the case where the color component index value indicates a chroma component and the video sample format of the video data is 4:4:4, the horizontal position and the vertical position are scaled based on the same size value.

4. An apparatus comprising one or more processors, the one or more processors are configured to:

Scaling a horizontal position and a vertical position according to a video sampling format of video data based on a color component index value not equal to 0, and generating a mixed value using the horizontal position and the vertical position, wherein the mixed value specifies a weight of a prediction sample, and

5. The apparatus of claim 4, wherein the one or more processors scale the horizontal position and the vertical position based on different size values if the color component index value indicates a chroma component and the video sample format of the video data is 4:2:2.

6. The apparatus of claim 4, wherein the one or more processors scale the horizontal position and the vertical position based on the same size value if the color component index value indicates a chroma component and the video sample format of the video data is 4:4:4.

7. The apparatus of claim 4, wherein the apparatus is a video decoder.