CN118338022A

CN118338022A - Method and apparatus for performing neural network filtering on video data

Info

Publication number: CN118338022A
Application number: CN202410022693.8A
Authority: CN
Inventors: 萨钦·G·德施潘德; 艾哈迈德·谢赫·西迪亚
Original assignee: Sharp Corp
Current assignee: Sharp Corp
Priority date: 2023-01-12
Filing date: 2024-01-06
Publication date: 2024-07-12

Abstract

The invention discloses a device configurable to perform filtering based on information included in a neural network post-filter characteristic message. In one example, the neural network post-filter characteristic message includes a syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the syntax element and a height equal to a vertical sample count indicated by the syntax element, or a block size having a width equal to a multiple of the horizontal sample count indicated by the syntax element and a height equal to a multiple of the vertical sample count indicated by the syntax element.

Description

Method and apparatus for performing neural network filtering on video data

RELATED APPLICATIONS

The present application claims the benefit of U.S. provisional patent application No. 63/438,779 filed on 1 month 12 of 2023, which application is incorporated herein by reference in its entirety.

Technical Field

The present disclosure relates to video coding, and more particularly to a system and method for signaling neural network post-filter parameter information of coded video.

Background

Digital video functionality may be incorporated into a variety of devices including digital televisions, notebook or desktop computers, tablet computers, digital recording devices, digital media players, video gaming devices, cellular telephones (including so-called smartphones), medical imaging devices, and the like. The digital video may be encoded according to a video encoding standard. The video coding standard defines the format of a compatible bitstream encapsulating the coded video data. A compatible bitstream is a data structure that may be received and decoded by a video decoding device to generate reconstructed video data. The video coding standard may incorporate video compression techniques. Examples of video coding standards include ISO/IEC MPEG-4Visual and ITU-TH.264 (also known as ISO/IEC MPEG-4 AVC) and High Efficiency Video Coding (HEVC). HEVC is described in High Efficiency Video Coding (HEVC) of the ITU-t h.265 recommendation of 12 in 2016, which is incorporated herein by reference and is referred to herein as ITU-t h.265. Extensions and improvements to ITU-t h.265 are being considered to develop next generation video coding standards. For example, ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG), collectively referred to as joint video research group (JVET), have video coding techniques that standardize compression capabilities significantly beyond current HEVC standards. joint exploration model 7 (JEM 7), algorithmic description of joint exploration test model 7 (JEM 7), ISO/IEC JTC1/SC29/WG11 documents: JVET-G1001 (2017, 7, italian, duling) describes the coding features under joint test model study by JVET, a potentially enhanced video coding technique that surpasses ITU-t h.265 functionality. It should be noted that the coding feature of JEM7 is implemented in JEM reference software. As used herein, the term JEM may collectively refer to the algorithm included in JEM7 as well as the specific implementation of JEM reference software. Furthermore, in response to "Joint Call for Proposals on Video Compression with Capabilities beyond HEVC" issued by the VCEG and MPEG jointly, various groups set forth various descriptions of video coding tools at the 10 th conference of ISO/IEC JTC1/SC29/WG11 held in San Diego, calif., from 16 days to 20 days at 4 months of 2018. According to various descriptions of video coding tools, the final initial Draft text of the video coding specification is described in ISO/IEC JTC1/SC29/WG11, 10 th meeting, "VERSATILE VIDEO CODING (Draft 1)" at 4 months 16 to 20 days 2018, document JVET-J1001-v2, which is incorporated herein by reference and referred to as JVET-J1001. The development of video coding standards for VCEG and MPEG is known as the Versatile Video Coding (VVC) project. "VERSATILE VIDEO CODING (Draft 10)" (teleconference, document JVET-T2001-v2, which is incorporated herein by reference, and is referred to as JVET-T2001) in the 20 th conference of ISO/IEC JTC1/SC29/WG11 held from 7 th to 16 th month 10 2020 represents the current iteration of the Draft text corresponding to the video coding specification of the VVC project.

Video compression techniques can reduce the data requirements for storing and transmitting video data. Video compression techniques may reduce data requirements by exploiting redundancy inherent in video sequences. Video compression techniques may subdivide a video sequence into successively smaller portions (i.e., a group of pictures within the video sequence, a picture within a group of pictures, a region within a picture, a sub-region within a region, etc.). The difference between the unit video data to be encoded and the reference unit of video data may be generated using intra prediction encoding techniques (e.g., intra-picture spatial prediction techniques) and inter prediction techniques (i.e., inter-picture techniques (times)). This difference may be referred to as residual data. The residual data may be encoded as quantized transform coefficients. Syntax elements may relate to residual data and reference coding units (e.g., intra prediction mode index and motion information). The residual data and syntax elements may be entropy encoded. The entropy encoded residual data and syntax elements may be included in a data structure forming a compatible bitstream.

Disclosure of Invention

In general, this disclosure describes various techniques for encoding video data. In particular, the present disclosure describes techniques for signaling neural network post-filter parameter information for encoded video data. It should be noted that although the techniques of this disclosure are described with respect to ITU-T H.264, ITU-T H.265, JEM and JVET-T2001, the techniques of this disclosure are generally applicable to video coding. For example, in addition to those techniques included in ITU-T.265, JEM, and JVET-T2001, the encoding techniques described herein may be incorporated into video coding systems (including video coding systems based on future video coding standards), including video block structures, intra-prediction techniques, inter-prediction techniques, transform techniques, filtering techniques, and/or other entropy coding techniques. Accordingly, references to ITU-T H.264, ITU-T H.265, JEM and/or JVET-T2001 are for descriptive purposes and should not be construed as limiting the scope of the techniques described herein. Furthermore, it should be noted that the incorporation of documents by reference herein is for descriptive purposes and should not be construed as limiting or creating ambiguity with respect to the terms used herein. For example, where a definition of a term provided in a particular incorporated reference differs from another incorporated reference and/or the term as used herein, that term should be interpreted as broadly as includes each and every corresponding definition and/or includes every specific definition in the alternative.

In one example, a method of encoding video data includes transmitting a signal-to-neural network post-filter characteristic message, and transmitting a signal-to-first syntax element in the neural network post-filter characteristic message, the first syntax element indicating whether a post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by a second syntax element and a height equal to a vertical sample count indicated by a third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by a fourth syntax element and a height equal to a multiple of a vertical sample count indicated by a fifth syntax element.

In one example, an apparatus includes one or more processors configured to: transmitting a post-filter characteristic message signaling the neural network; a first syntax element is signaled in the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

In one example, a non-transitory computer-readable storage medium includes instructions stored thereon that, when executed, cause one or more processors of a device to send a signal neural network post-filter characteristic message and to send a signal in the neural network post-filter characteristic message a first syntax element indicating whether a post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by a second syntax element and a height equal to a vertical sample count indicated by a third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by a fourth syntax element and a height equal to a multiple of a vertical sample count indicated by a fifth syntax element.

In one example, an apparatus includes: means for sending a message signaling the neural network post-filter characteristics; and means for signaling a first syntax element in the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

In one example, a method of decoding video data includes: the method includes receiving a neural network post-filter characteristic message and parsing a first syntax element from the neural network post-filter characteristic message, the first syntax element indicating whether a post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

In one example, an apparatus includes one or more processors configured to: receiving a neural network post-filter characteristic message; a first syntax element is parsed from the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

In one example, a non-transitory computer-readable storage medium includes instructions stored thereon that, when executed, cause one or more processors of a device to receive a neural network post-filter characteristic message and parse a first syntax element from the neural network post-filter characteristic message, the first syntax element indicating whether a post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by a second syntax element and a height equal to a vertical sample count indicated by a third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by a fourth syntax element and a height equal to a multiple of a vertical sample count indicated by a fifth syntax element.

In one example, an apparatus includes: means for receiving a neural network post-filter characteristic message; and means for parsing a first syntax element from the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

The details of one or more examples are set forth in the accompanying drawings and the description below. Other features, objects, and advantages will be apparent from the description and drawings, and from the claims.

Drawings

Fig. 1 is a block diagram illustrating an example of a system that may be configured to encode and decode video data in accordance with one or more techniques of the present disclosure.

Fig. 2 is a conceptual diagram illustrating encoded video data and corresponding data structures according to one or more techniques of the present disclosure.

Fig. 3 is a conceptual diagram illustrating a data structure encapsulating encoded video data and corresponding metadata according to one or more techniques of the present disclosure.

Fig. 4 is a conceptual diagram illustrating an example of components that may be included in a particular implementation of a system that may be configured to encode and decode video data in accordance with one or more techniques of the present disclosure.

Fig. 5 is a block diagram illustrating an example of a video encoder that may be configured to encode video data in accordance with one or more techniques of this disclosure.

Fig. 6 is a block diagram illustrating an example of a video decoder that may be configured to decode video data in accordance with one or more techniques of this disclosure.

Fig. 7 is a conceptual diagram illustrating an example of a packed data channel of a luminance component according to one or more techniques of this disclosure.

Detailed Description

Video content comprises a video sequence consisting of a series of frames (or pictures). A series of frames may also be referred to as a group of pictures (GOP). Each video frame or picture may be divided into one or more regions. The region may be defined according to a base unit (e.g., video block) and a set of rules defining the region. For example, a rule defining a region may be that the region must be an integer number of video blocks arranged in a rectangle. Further, the video blocks in the region may be ordered according to a scan pattern (e.g., raster scan). As used herein, the term "video block" may generally refer to a region of a picture, or may more particularly refer to a largest array of sample values, sub-partitions thereof, and/or corresponding structures that may be predictively encoded. Furthermore, the term "current video block" may refer to an area of a picture being encoded or decoded. A video block may be defined as an array of sample values. It should be noted that in some cases, pixel values may be described as sample values that include corresponding components of video data, which may also be referred to as color components (e.g., luminance (Y) and chrominance (Cb and Cr) components or red, green, and blue components). It should be noted that in some cases, the terms "pixel value" and "sample value" may be used interchangeably. Further, in some cases, a pixel or sample may be referred to as pel. The video sampling format (which may also be referred to as chroma format) may define the number of chroma samples included in a video block relative to the number of luma samples included in the video block. For example, for a 4:2:0 sampling format, the sampling rate of the luma component is twice the sampling rate of the chroma components in both the horizontal and vertical directions.

The video encoder may perform predictive encoding on the video block and its sub-partitions. The video block and its sub-partitions may be referred to as nodes. ITU-t h.264 specifies a macroblock comprising 16 x 16 luma samples. That is, in ITU-T H.264, pictures are segmented into macroblocks. ITU-t h.265 specifies a similar Coding Tree Unit (CTU) structure, which may be referred to as a Largest Coding Unit (LCU). In ITU-T H.265, pictures are segmented into CTUs. In ITU-t h.265, CTU sizes may be set to include 16 x 16, 32 x 32, or 64 x 64 luma samples for pictures. In ITU-t h.265, a CTU is composed of respective Coding Tree Blocks (CTBs) for each component of video data, e.g., luminance (Y) and chrominance (Cb and Cr). It should be noted that a video with one luminance component and two corresponding chrominance components may be described as having two channels, namely a luminance channel and a chrominance channel. Furthermore, in ITU-t h.265, CTUs may be divided according to a Quadtree (QT) division structure, which causes CTBs of the CTUs to be divided into Coded Blocks (CBs). That is, in ITU-T H.265, the CTU may be divided into quadtree leaf nodes. According to ITU-t h.265, one luma CB together with two corresponding chroma CBs and associated syntax elements are referred to as a Coding Unit (CU). In ITU-t h.265, the minimum allowed size of the CB may be signaled. In ITU-t h.265, the minimum allowable minimum size of luminance CB is 8 x 8 luminance samples. In ITU-t h.265, the decision to encode a picture region using intra prediction or inter prediction is made at the CU level.

In ITU-t h.265, a CU is associated with a prediction unit structure with its root at the CU. In ITU-t h.265, the prediction unit structure allows partitioning of luma CB and chroma CB to generate corresponding reference samples. That is, in ITU-t h.265, luminance CB and chrominance CB may be partitioned into respective luminance prediction blocks and chrominance Prediction Blocks (PB), where PB comprises blocks of sample values to which the same prediction is applied. In ITU-T H.265, CBs can be divided into 1, 2 or 4 PBs. ITU-t h.265 supports PB sizes from 64 x 64 samples down to 4 x 4 samples. In ITU-t h.265, square PB is supported for intra prediction, where CB may form PB or CB may be partitioned into four square PB. In ITU-t h.265, rectangular PB is supported for inter prediction in addition to square PB, where CB may be halved vertically or horizontally to form PB. Furthermore, it should be noted that in ITU-t h.265, for inter prediction, four asymmetric PB partitioning is supported, where CB is partitioned into two PB at one quarter of the height (top or bottom) or width (left or right) of CB. Intra-prediction data (e.g., intra-prediction mode syntax elements) or inter-prediction data (e.g., motion data syntax elements) corresponding to the PB are used to generate reference and/or prediction sample values for the PB.

JEM specifies CTUs with 256 x 256 luma samples of maximum size. JEM specifies a quadtree plus binary tree (QTBT) block structure. In JEM, the QTBT structure allows the quadtree (BT) structure to be further partitioned into quadtree leaf nodes. That is, in JEM, the binary tree structure allows for recursive partitioning of the quadtree nodes vertically or horizontally. In JVET-T2001, the CTUs are partitioned according to a quadtree plus multi-type tree (QTMT or QT+MTT) structure. QTMT in JVET-T2001 is similar to QTBT in JEM. However, in JVET-T2001, in addition to indicating binary segmentation, the multi-type tree may also indicate so-called ternary (or Trigeminal Tree (TT)) segmentation. Ternary partitioning divides one block vertically or horizontally into three blocks. In the case of vertical TT splitting, the block is split from the left edge at one quarter of its width and from the right edge at one quarter of its width, and in the case of horizontal TT splitting, the block is split from the top edge at one quarter of its height and from the bottom edge at one quarter of its height.

As described above, each video frame or picture may be divided into one or more regions. For example, according to ITU-t h.265, each video frame or picture may be divided to include one or more slices, and further divided to include one or more tiles, wherein each slice includes a sequence of CTUs (e.g., arranged in raster scan order), and wherein a tile is a sequence of CTUs corresponding to a rectangular region of a picture. It should be noted that in ITU-t h.265, a slice is a sequence of one or more slice segments starting from an independent slice segment and containing all subsequent dependent slice segments (if any) before the next independent slice segment (if any). A slice segment (e.g., slice) is a CTU sequence. Thus, in some cases, the terms "slice" and "slice segment" are used interchangeably to indicate a sequence of CTUs arranged in a raster scan order arrangement. Further, it should be noted that in ITU-T H.265, a tile may be composed of CTUs contained in more than one slice, and a slice may be composed of CTUs contained in more than one tile. However, ITU-t h.265 specifies that one or both of the following conditions should be met: (1) all CTUs in a slice belong to the same tile; and (2) all CTUs in a tile belong to the same slice.

With respect to JVET-T2001, a slice needs to be made up of an integer number of complete tiles or an integer number of consecutive complete CTU rows within a tile, rather than just an integer number of CTUs. It should be noted that in JVET-T2001, the slice design does not include slice segments (i.e., there are no independent/dependent slice segments). Thus, in JVET-T2001, a picture may include a single tile, where the single tile is contained within a single slice, or a picture may include multiple tiles, where the multiple tiles (or rows of CTUs thereof) may be contained within one or more slices. In JVET-T2001, a picture is specified to divide the picture into tiles by specifying a respective height of a tile row and a respective width of a tile column. Thus, in JVET-T2001, a tile is a rectangular CTU region within a particular tile row and a particular tile column location. Furthermore, it should be noted that JVET-T2001 specifies that a picture may be divided into sub-pictures, where a sub-picture is a rectangular CTU region within the picture. The upper left CTU of a sub-picture may be located at any CTU position within the picture, where the sub-picture is constrained to include one or more slices. Thus, unlike tiles, sub-pictures need not be limited to specific row and column positions. It should be noted that the sub-pictures may be used to encapsulate regions of interest within the picture, and that the sub-bitstream extraction process may be used to decode and display only specific regions of interest. That is, as described in further detail below, the bitstream of encoded video data includes a Network Abstraction Layer (NAL) unit sequence, where NAL units encapsulate encoded video data (i.e., video data corresponding to a picture slice), or NAL units encapsulate metadata (e.g., parameter sets) for decoding video data, and the sub-bitstream extraction process forms a new bitstream by removing one or more NAL units from the bitstream.

Fig. 2 is a conceptual diagram illustrating an example of pictures within a group of pictures divided according to tiles, slices, and sub-pictures. It should be noted that the techniques described herein may be applied to tiles, slices, sub-pictures, sub-partitions thereof, and/or equivalent structures thereof. That is, the techniques described herein may be universally applicable regardless of how a picture is divided into regions. For example, in some cases, the techniques described herein may be applicable where tiles may be divided into so-called bricks, where a brick is a rectangular CTU row region within a particular tile. Further, for example, in some cases, the techniques described herein may be applicable where one or more tiles may be included in a so-called tile group, where the tile group includes an integer number of adjacent tiles. In the example shown in fig. 2, pic ₃ is shown as including 16 tiles (i.e., tile ₀ through tile ₁₅) and three slices (i.e., slices ₀ through ₂). In the example shown in fig. 2, slice ₀ includes four tiles (i.e., tile ₀ through tile ₃), slice ₁ includes eight tiles (i.e., Tile ₄ to tile ₁₁), and slice ₂ includes four tiles (i.e., tile ₁₂ to tile ₁₅). as shown in the example of fig. 2, moreover, pic ₃ is shown as including two sub-pictures (i.e., sub-picture ₀ and sub-picture ₁), Wherein sub-picture ₀ includes slice ₀ and slice ₁ and wherein sub-picture ₁ includes slice ₂. As described above, the sub-picture may be used to encapsulate a region of interest within the picture, and a sub-bitstream extraction process may be used to selectively decode (and display) the region of interest. For example, referring to fig. 2, sub-picture ₀ may correspond to an action portion of a sporting event presentation (e.g., a view of a venue), and sub-picture ₁ may correspond to a scroll banner displayed during the sporting event presentation. by organizing the pictures into sub-pictures in this way, the viewer may be able to disable the display of the scroll banner. That is, through the sub-bitstream extraction process, slice ₂ NAL units may be removed from the bitstream (and thus not decoded and/or displayed), and slice ₀ NAL units and slice ₁ NAL units may be decoded and displayed. how slices of a picture are encapsulated into corresponding NAL unit data structures and sub-bitstream extraction is described in further detail below.

For intra prediction encoding, an intra prediction mode may specify the location of a reference sample within a picture. In ITU-T H.265, the possible intra prediction modes that have been defined include a planar (i.e., surface-fitting) prediction mode, a DC (i.e., flat ensemble average) prediction mode, and 33 angular prediction modes (predMode: 2-34). In JEM, the possible intra prediction modes that have been defined include a planar prediction mode, a DC prediction mode, and 65 angular prediction modes. It should be noted that the plane prediction mode and the DC prediction mode may be referred to as a non-directional prediction mode, and the angle prediction mode may be referred to as a directional prediction mode. It should be noted that the techniques described herein may be universally applicable regardless of the number of possible prediction modes that have been defined.

For inter prediction encoding, a reference picture is determined, and a Motion Vector (MV) identifies samples in the reference picture that are used to generate a prediction for a current video block. For example, a current video block may be predicted using reference sample values located in one or more previously encoded pictures, and a motion vector used to indicate the position of the reference block relative to the current video block. The motion vector may describe, for example, a horizontal displacement component of the motion vector (i.e., MV _x), a vertical displacement component of the motion vector (i.e., MV _y), and a resolution of the motion vector (e.g., one-quarter pixel precision, one-half pixel precision, one-pixel precision, two-pixel precision, four-pixel precision). Previously decoded pictures (which may include pictures output before or after the current picture) may be organized into one or more reference picture lists and identified using reference picture index values. Furthermore, in inter prediction coding, single prediction refers to generating a prediction using sample values from a single reference picture, and double prediction refers to generating a prediction using corresponding sample values from two reference pictures. That is, in single prediction, a single reference picture and a corresponding motion vector are used to generate a prediction for a current video block, while in bi-prediction, a first reference picture and a corresponding first motion vector and a second reference picture and a corresponding second motion vector are used to generate a prediction for the current video block. In bi-prediction, the corresponding sample values are combined (e.g., added, rounded and clipped, or averaged according to weights) to generate a prediction. Pictures and their regions may be classified based on which types of prediction modes are available to encode their video blocks. That is, for a region having a B type (e.g., a B slice), bi-prediction, uni-prediction, and intra-prediction modes may be utilized, for a region having a P type (e.g., a P slice), uni-prediction and intra-prediction modes may be utilized, and for a region having an I type (e.g., an I slice), only intra-prediction modes may be utilized. As described above, the reference picture is identified by the reference index. For example, for P slices, there may be a single reference picture list RefPicList0, and for B slices, there may be a second independent reference picture list RefPicList1 in addition to RefPicList 0. It should be noted that for single prediction in a B slice, one of RefPicList0 or RefPicList1 may be used to generate the prediction. Further, it should be noted that during the decoding process, at the start of decoding a picture, a reference picture list is generated from previously decoded pictures stored in a Decoded Picture Buffer (DPB).

Furthermore, the coding standard may support various motion vector prediction modes. Motion vector prediction enables the value of a motion vector for a current video block to be derived based on another motion vector. For example, a set of candidate blocks with associated motion information may be derived from the spatial neighboring blocks and the temporal neighboring blocks of the current video block. In addition, the generated (or default) motion information may be used for motion vector prediction. Examples of motion vector prediction include Advanced Motion Vector Prediction (AMVP), temporal Motion Vector Prediction (TMVP), so-called "merge" mode, and "skip" and "direct" motion reasoning. Further, other examples of motion vector prediction include Advanced Temporal Motion Vector Prediction (ATMVP) and spatio-temporal motion vector prediction (STMVP). For motion vector prediction, both the video encoder and the video decoder perform the same process to derive a set of candidates. Thus, for the current video block, the same set of candidates is generated during encoding and decoding.

As described above, for inter prediction coding, reference samples in a previously coded picture are used to code a video block in the current picture. Previously encoded pictures that may be used as references when encoding a current picture are referred to as reference pictures. It should be noted that the decoding order does not necessarily correspond to the picture output order, i.e. the temporal order of the pictures in the video sequence. In ITU-t h.265, when a picture is decoded, it is stored to a Decoded Picture Buffer (DPB) (which may be referred to as a frame buffer, a reference picture buffer, etc.). In ITU-t h.265, pictures stored to a DPB are removed from the DPB when output and are no longer needed for encoding subsequent pictures. In ITU-t h.265, after decoding the slice header, i.e. at the start of decoding a picture, each picture invokes a determination of whether a picture should be removed from the DPB or not. For example, referring to fig. 2, pic ₂ is shown as reference Pic ₁. Similarly Pic ₃ is shown as reference Pic ₀. With respect to fig. 2, assuming that the number of pictures corresponds to the decoding order, the DPB will fill as follows: after decoding Pic ₀, the DPB would include { Pic ₀ }; at the beginning of decoding Pic ₁, the DPB will include { Pic ₀ }; After decoding Pic ₁, the DPB would include { Pic ₀,Pic₁ }; at the beginning of decoding Pic ₂, the DPB will include { Pic ₀,Pic₁ }. Pic ₂ will then be decoded with reference to Pic ₁, and after Pic ₂ is decoded, the DPB will include { Pic ₀,Pic₁,Pic₂ }. At the beginning of decoding Pic ₃, pictures Pic ₀ and Pic ₁ will be marked for removal from the DPB, because they are not decoding Pic ₃ (or any subsequent pictures, Not shown), and assuming Pic ₁ and Pic ₂ have been output, the DPB will be updated to include Pic ₀. Pic ₃ will then be decoded with reference to Pic ₀. The process of marking pictures to remove them from the DPB may be referred to as Reference Picture Set (RPS) management.

As described above, the intra prediction data or inter prediction data is used to generate reference sample values for a block of sample values. The difference between the sample values included in the current PB or another type of picture region structure and the associated reference samples (e.g., those generated using prediction) may be referred to as residual data. The residual data may include a respective difference array corresponding to each component of the video data. The residual data may be in the pixel domain. A transform such as a Discrete Cosine Transform (DCT), a Discrete Sine Transform (DST), an integer transform, a wavelet transform, or a conceptually similar transform may be applied to the array of differences to generate transform coefficients. It should be noted that in ITU-T H.265 and JVET-T2001, CUs are associated with a transform tree structure that has their root at the CU level. The transform tree is divided into one or more Transform Units (TUs). That is, to generate transform coefficients, an array of differences may be partitioned (e.g., four 8 x 8 transforms may be applied to a 16 x 16 array of residual values). For each component of video data, such subdivision of the difference value may be referred to as a Transform Block (TB). It should be noted that in some cases, a core transform and a subsequent secondary transform may be applied (in a video encoder) to generate transform coefficients. For video decoders, the order of the transforms is reversed.

The quantization process may be performed directly on the transform coefficients or residual sample values (e.g., in terms of palette coded quantization). Quantization approximates transform coefficients by limiting the amplitude to a set of specified values. Quantization essentially scales the transform coefficients to change the amount of data needed to represent a set of transform coefficients. Quantization may include dividing the transform coefficients (or values resulting from adding an offset value to the transform coefficients) by a quantization scaling factor and any associated rounding function (e.g., rounding to the nearest integer). The quantized transform coefficients may be referred to as coefficient level values. Inverse quantization (or "dequantization") may include multiplying coefficient level values by quantization scaling factors, and any reciprocal rounding or offset addition operations. It should be noted that, as used herein, the term quantization process may refer in some cases to division by a scaling factor to generate level values, and may refer in some cases to multiplication by a scaling factor to recover transform coefficients. That is, the quantization process may refer to quantization in some cases, and may refer to inverse quantization in some cases. Further, it should be noted that while the quantization process is described in some of the examples below with respect to arithmetic operations associated with decimal notation, such description is for illustrative purposes and should not be construed as limiting. For example, the techniques described herein may be implemented in a device using binary operations or the like. For example, the multiply and divide operations described herein may be implemented using shift operations or the like.

The quantized transform coefficients and syntax elements (e.g., syntax elements indicating the coding structure of the video block) may be entropy encoded according to an entropy encoding technique. The entropy encoding process includes encoding the syntax element values using a lossless data compression algorithm. Examples of entropy coding techniques include Content Adaptive Variable Length Coding (CAVLC), context Adaptive Binary Arithmetic Coding (CABAC), probability interval partitioning entropy coding (PIPE), and the like. The entropy encoded quantized transform coefficients and corresponding entropy encoded syntax elements may form a compatible bitstream that may be used to render video data at a video decoder. An entropy encoding process, such as CABAC, may include binarizing the syntax elements. Binarization refers to the process of converting the value of a syntax element into a sequence of one or more bits. These bits may be referred to as "bins". Binarization may include one or a combination of the following encoding techniques: fixed length coding, unary coding, truncated Rice coding, golomb coding, k-order exponential Golomb coding, and Golomb-Rice coding. For example, binarization may include representing the integer value 5 of the syntax element as 00000101 using an 8-bit fixed length binarization technique, or representing the integer value 5 as 11110 using a unary coding binarization technique. As used herein, the terms fixed length coding, unary coding, truncated Rice coding, golomb coding, k-th order exponential Golomb coding, and Golomb-Rice coding may each refer to a general implementation of these techniques and/or a more specific implementation of these coding techniques. For example, golomb-Rice coding implementations may be specifically defined in accordance with video coding standards. In the example of CABAC, for a particular bin, the context provides a Maximum Probability State (MPS) value for the bin (i.e., the MPS of the bin is one of 0 or 1), and a probability value for the bin being the MPS or the minimum probability state (LPS). For example, the context may indicate that the MPS of bin is 0 and the probability of bin being 1 is 0.3. It should be noted that the context may be determined based on the value of the previously encoded bin including the current syntax element and the bin in the previously encoded syntax element. For example, the value of the syntax element associated with the neighboring video block may be used to determine the context of the current bin.

As described above, the sample values of the reconstructed block may be different from the sample values of the current video block being encoded. In addition, it should be noted that in some cases, encoding video data block by block may lead to artifacts (e.g., so-called block artifacts, banding artifacts, etc.). For example, block artifacts may result in encoded block boundaries of reconstructed video data being visually perceived by a user. In this way, the reconstructed sample values may be modified to minimize differences between the encoded current video block and the reconstructed block sample values and/or to minimize artifacts introduced by the video encoding process. Such modifications may be generally referred to as filtering. It should be noted that the filtering may occur as part of an in-loop filtering process or a post-loop (or post-filtering) filtering process. For the in-loop filtering process, the resulting sample values of the filtering process may be used to predict the video block (e.g., stored to a reference frame buffer for subsequent encoding at the video encoder and subsequent decoding at the video decoder). For the post-loop filtering process, the resulting sample values of the filtering process are only output as part of the decoding process (e.g., not used for subsequent encoding). For example, with respect to a video decoder, for an in-loop filtering process, the sample values produced by the filtering reconstruction block will be used for subsequent decoding (e.g., stored to a reference buffer) and will be output (e.g., to a display). For the post-loop filtering process, the reconstruction block will be used for subsequent decoding, and the sample values generated by filtering the reconstruction block will be output and will be used for subsequent decoding.

Deblocking (or deblocking), deblocking filtering, or applying a deblocking filter refers to the process of smoothing the boundary of neighboring reconstructed video blocks (i.e., making the boundary imperceptible to an observer). Smoothing the boundaries of neighboring reconstructed video blocks may include modifying sample values included in rows or columns adjacent to the boundary. JVET-T2001 provides a scenario in which a deblocking filter is applied to reconstructed sample values as part of the in-loop filtering process. In addition to applying deblocking filters as part of the in-loop filtering process, JVET-T2001 also provides a scenario in which Sample Adaptive Offset (SAO) filtering may be applied during in-loop filtering. Generally, SAO is a process of modifying deblocking sample values in a region by conditionally adding an offset value. Another type of filtering process includes so-called Adaptive Loop Filters (ALF). ALF using block-based adaptation is specified in JEM. In JEM, ALF is applied after SAO filters. It should be noted that ALF may be applied to the reconstructed samples independently of other filtering techniques. The process of applying the ALF specified in JEM at the video encoder can be summarized as follows: (1) Each 2 x 2 block for reconstructing the luminance component of the image is classified according to a classification index; (2) deriving a set of filter coefficients for each class index; (3) determining a filtering decision for the luminance component; (4) determining a filtering decision for the chrominance component; and (5) signaling filter parameters (e.g., coefficients and decisions). JVET-T2001 specifies deblocking filters, SAO filters, and ALF filters, which may be described as being generally based on deblocking filters, SAO filters, and ALF filters provided in ITU-T h.265 and JEM.

It should be noted that JVET-T2001 is referred to as a pre-release version of ITU-T h.266, and thus is an almost final determined draft of the video coding standard produced by the VVC project, and thus may be referred to as a first version of the VVC standard (or VVC version 1 or ITU-h.266). It should be noted that during the VVC project, convolutional Neural Network (CNN) based techniques have been studied that show the potential to remove artifacts and improve objective quality, but it is decided not to include such techniques in the VVC standard. However, CNN-based techniques are currently being considered for expansion and/or improvement of VVCs. Some CNN-based techniques involve post-filtering. For example, "AHG11: content-adaptive neural network post-filter" (conference call, document JVET-Z0082-v2, herein referred to as JVET-Z0082) in the ISO/IEC JTC1/SC29/WG11 26 th conference held at 20-29 of month 4 of 2022 describes a Content-adaptive neural network-based post-filter. It should be noted that in JVET-Z0082, content adaptation is achieved by overfitting NN post-filters to the test video. Furthermore, it should be noted that the result of the over-fitting process in JVET-Z0082 is a weight update. JVET-Z0082 describes the location where weight updates are encoded using ISO/IEC FDIS 15938-17. Information technology-multimedia content description interface-section 17: compression of neural networks for multimedia content description and analysis, a test model (INCTM) of delta compression of neural networks for multimedia content description and analysis, N0179, month 2022, which may be collectively referred to as the MPEG NNR (representation of neural networks) or Neural Network Coding (NNC) standards. JVET-Z0082 also describes the location within the video bitstream where the coding weight update is signaled as an NNR post-filter SEI message. The ISO/IEC JTC1/SC29/WG11 26 th meeting, "AHG9:NNR post-FILTER SEI MESSAGE" (teleconference, document JVET-Z0052-v1, referred to herein as JVET-Z0052), held 20 to 29 days 2022 describes NNR post-filter SEI messages utilized by JVET-Z0082. Elements of the NN post-filter described in JVET-Z0082 and NNR post-filter SEI messages described in JVET-Z0052 are employed in ISO/IEC JTC1/SC29/WG11, "Additional SEI MESSAGES for VSEI (Draft 2)" (teleconference, document JVET-AA2006-v2, herein referred to as JVET-AA 2006) at the 27 th conference, held at 13-22 days 7 in 2022. JVET-AA2006 provides a generic supplemental enhancement information message for an encoded video bitstream (VSEI). JVET-AA2006 specifies syntax and semantics for a neural network post-filter characteristics SEI message and for a neural network post-filter activation SEI message. The neural network post-filter characteristics SEI message specifies the neural network that can be used as a post-processing filter. The use of a specified post-processing filter for a particular picture is indicated by a neural network post-filter activation SEI message. Furthermore, "information technology-MPEG video technology-part 7: general supplemental enhancement information message for encoding video bitstream, revision 1: additional SEI MESSAGES "ISO/IEC JTC1/SC29/WG59, meeting 28, month 11, 2022, german Meijin, document JVET-AB2006, m61498 (referred to herein as JVET-AB 2006). JVET-AB2006 is described in more detail below. The techniques described herein provide techniques for signaling a neural network postfilter message.

Regarding the formulas used herein, the following arithmetic operators may be used:

+addition method

-Subtraction

* Multiplication, including matrix multiplication

X ^y exponentiation. X is specified as a power of y. In other contexts, such symbols are used for superscripts and are not intended to be interpreted as exponentiations. Integer division of the result towards zero. For example, 7/4 and-7/-4 are truncated to 1, and-7/4 and 7/-4 are truncated to-1. Used to represent division in mathematical formulas without truncation or rounding.

Used to represent division in a mathematical formula without truncation or rounding.

Furthermore, the following mathematical functions may be used:

Log2 (x) x is a base 2 logarithm;

Ceil (x) is greater than or equal to the smallest integer of x.

Regarding the exemplary syntax used herein, the following definition of logical operators may be applied:

boolean logical AND of x & & y x and y "

Boolean logical OR of x y x and y "

! Boolean logic "NO"

Xy z if x is TRUE or not equal to 0, then evaluate to y; otherwise, z is evaluated.

Furthermore, the following relational operators may be applied:

Further, it should be noted that in the syntax descriptor used herein, the following descriptor may be applied:

-b (8): bytes (8 bits) with any bit string pattern. The parsing process of the descriptor is specified by the return value of the function read_bit (8).

-F (n): a fixed pattern bit string written using n bits (left to right) from the leftmost bit. The parsing process of the descriptor is specified by the return value of the function read_bit (n).

-Se (v): signed integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.

-Tb (v): truncated binaries of up to maxVal bits are used, with maxVal defined in the semantics of the syntax element.

Tu (v): truncated unary code of up to maxVal bits is used, where maxVal is defined in the semantics of the syntax element.

-U (n): an unsigned integer of n bits is used. When n is "v" in the syntax table, the number of bits varies in a manner depending on the values of other syntax elements. The parsing process of the descriptor is specified by the return value of the function read_bits (n), which is interpreted as a binary representation of the unsigned integer written first to the most significant bit.

-Ue (v): unsigned integer 0 th order Exp-Golomb encoded syntax element, starting from the leftmost bit.

As described above, video content includes a video sequence composed of a series of pictures, and each picture may be divided into one or more regions. In JVET-T2001, the encoding of a picture represents VCL NAL units that include a particular layer within an AU and contains all CTUs of the picture. for example, referring again to fig. 2, the encoded representation of pic ₃ is encapsulated in three encoded slice NAL units (i.e., slice ₀ NAL unit, slice ₁ NAL unit, and slice ₂ NAL unit). It should be noted that the term Video Coding Layer (VCL) NAL unit is used as a generic term for coded slice NAL units, i.e. VCL NAL is a generic term comprising all types of slice NAL units. As described above, and described in further detail below, NAL units may encapsulate metadata for decoding video data. NAL units that encapsulate metadata for decoding a video sequence are commonly referred to as non-VCL NAL units. Thus, in JVET-T2001, the NAL units may be VCL NAL units or non-VCL NAL units. It should be noted that the VCL NAL unit includes slice header data that provides information for decoding a particular slice. thus, in JVET-T2001, the information used to decode the video data (which may be referred to as metadata in some cases) is not limited to being included in non-VCL NAL units. JVET-T2001 specifies that Picture Units (PUs) are a set of NAL units associated with each other according to a specified classification rule, consecutive in decoding order and containing exactly one coded picture, and that Access Units (AUs) are a set of PUs belonging to different layers and containing coded pictures associated with the same time output from a DPB. JVET-T2001 further specifies that a layer is a set of VCL NAL units and their associated non-VCL NAL units that all have a particular value of a layer identifier. Furthermore, in JVET-T2001, a PU is composed of zero or one PH NAL units, one coded picture (which is composed of one or more VCL NAL units), and zero or more other non-VCL NAL units. Further, in JVET-T2001, the Coded Video Sequence (CVS) is an AU sequence consisting of CVSs AUs and subsequent AUs of zero or more non-CVSs AUs arranged in decoding order, including all subsequent AUs before any subsequent AUs that are CVSs AUs next (not included), where the coded video sequence start (CVSs) AUs is an AU where each layer in the CVS has a PU, and the coded picture in each existing picture unit is a Coded Layer Video Sequence Start (CLVSS) picture. In JVET-T2001, the encoded layer video sequence (CLVS) is a PU sequence within the same layer that consists of CLVSS PU and then zero or more PUs that are not CLVSS PU in decoding order, including all subsequent PUs until the next (not including) any subsequent PUs that are CLVSS PU. That is, in JVET-T2001, the bitstream may be described as comprising an AU sequence forming one or more CVSs.

The multi-layer video encoding enables the video presentation to be decoded/displayed as a presentation corresponding to the base layer of video data and decoded/displayed as one or more additional presentations corresponding to the enhancement layer of video data. For example, the base layer may enable presentation of video presentations having a base quality level (e.g., high definition presentation and/or 30Hz frame rate), and the enhancement layer may enable presentation of video presentations having an enhanced quality level (e.g., ultra-high definition rendering and/or 60Hz frame rate). The enhancement layer may be encoded by referring to the base layer. That is, a picture in the enhancement layer may be encoded (e.g., using inter-layer prediction techniques), for example, by referencing one or more pictures in the base layer (including scaled versions thereof). It should be noted that the layers may also be encoded independently of each other. In this case, there may be no inter-layer prediction between the two layers. Each NAL unit may include an identifier indicating the video data layer with which the NAL unit is associated. As described above, the sub-bitstream extraction process may be used to decode and display only a specific region of interest of a picture. In addition, the sub-bitstream extraction process may be used to decode and display only a specific video layer. Sub-bitstream extraction may refer to the process by which a device receiving a compatible or compliant bitstream forms a new compatible or compliant bitstream by discarding and/or modifying data in the received bitstream. For example, sub-bitstream extraction may be used to form a new compatible or compliant bitstream corresponding to a particular video representation (e.g., a high quality representation).

In JVET-T2001, each of the video sequence, GOP, picture, slice, and CTU may be associated with metadata describing video coding properties, and some types of metadata are encapsulated in non-VCL NAL units. JVET-T2001 defines parameter sets that can be used to describe video data and/or video coding properties. Specifically, JVET-T2001 includes the following four parameter sets: video Parameter Sets (VPSs), sequence Parameter Sets (SPS), picture Parameter Sets (PPS), and Adaptive Parameter Sets (APS), wherein SPS is applied to zero or more integer number of CVSs, PPS is applied to zero or more integer number of encoded pictures, APS is applied to zero or more slices, and VPSs may optionally be referenced by SPS. PPS applies to the single coded picture that references it. In JVET-T2001, the parameter set may be encapsulated as a non-VCL NAL unit and/or may be signaled as a message. JVET-T2001 also includes a Picture Header (PH), which is encapsulated as a non-VCL NAL unit. In JVET-T2001, a picture header is applied to all slices of an encoded picture. JVET-T2001 further enables signaling Decoding Capability Information (DCI) and Supplemental Enhancement Information (SEI) messages. In JVET-T2001, DCI and SEI messages assist in the process related to decoding, display, or other purposes, however, DCI and SEI messages may not be needed to construct luma or chroma samples from the decoding process. In JVET-T2001, DCI and SEI messages may be signaled in a bitstream using non-VCL NAL units. Further, the DCI and SEI message may be transmitted by some mechanism, not by being present in the bitstream (i.e., signaling out-of-band).

Fig. 3 shows an example of a bitstream comprising a plurality of CVSs, wherein the CVSs comprise AUs and the AUs comprise picture units. The example shown in fig. 3 corresponds to an example of encapsulating the slice NAL units shown in the example of fig. 2 in a bitstream. In the example shown in fig. 3, the corresponding picture units of Pic ₃ include three VCL NAL coded slice NAL units, namely a slice ₀ NAL unit, a slice ₁ NAL unit, and a slice ₂ NAL unit, and two non-VCL NAL units, namely a PPS NAL unit and a PH NAL unit. It should be noted that in fig. 3, the header is a NAL unit header (i.e., not confused with a slice header). Further, it should be noted that in fig. 3, other non-VCL NAL units, not shown, may be included in the CVS, such as SPS NAL units, VPS NAL units, SEI message NAL units, etc. Further, it should be noted that in other examples, PPS NAL units used to decode Pic ₃ may be included elsewhere in the bitstream, e.g., in picture units corresponding to Pic ₀, or may be provided by an external entity. As described in further detail below, in JVET-T2001, the PH syntax structure may be present in the slice header of the VCL NAL unit or in the PH NAL unit of the current PU.

JVET-T2001 defines NAL unit header semantics that specify the type of original byte sequence payload (RBSP) data structure included in the NAL unit. Table 1 shows the syntax of the NAL unit header provided in JVET-T2001.

TABLE 1

JVET-T2001 provides the following definitions for the corresponding syntax elements shown in table 1.

The forbidden _ zero _ bit should be equal to 0.

Nuh_reserved_zero_bit should be equal to 0. The value 1 of nuh_reserved_zero_bit may be specified in the future by ITU-T|ISO/IEC. Although in this version of the specification the value of nuh_reserved_zero_bit is required to be equal to 0, a decoder conforming to this version of the specification should allow the value of nuh_reserved_zero_bit to be equal to 1 to appear in the syntax and should ignore (i.e., delete and discard from the bitstream) NAL units for which nuh_reserved_zero_bit is equal to 1.

Nuh_layer_id specifies an identifier of a layer to which a VCL NAL unit belongs or an identifier of a layer to which a non-VCL NAL unit applies. The value of nuh_layer_id should be in the range of 0 to 55 (inclusive). Other values of nuh_layer_id are reserved for future use by ITU-t|iso/IEC. Although in this version of the specification the value of nuh layer id is required to be in the range of 0 to 55 (inclusive), a decoder conforming to this version of the specification should allow values of nuh layer id greater than 55 to appear in the syntax and should ignore (i.e., delete and discard from the bitstream) NAL units with nuh layer id greater than 55.

The values of nuh layer id of all VCL NAL units of a coded picture should be the same. The value of nuh_layer_id of a coded picture or PU is the value of nuh_layer_id of the VCL NAL unit of the coded picture or PU.

When nal_unit_type is equal to ph_nut or fd_nut, nuh_layer_id should be equal to the nuh_layer_id of the associated VCL NAL unit.

When nal_unit_type is equal to eos_nut, nuh_layer_id should be equal to one of the nuh_layer_id values of the layers present in the CVS.

Note that the values of nuh layer id for-DCI, OPI, VPS, AUD and EOB NAL units are not constrained.

Nuh temporal id plus1 minus 1 specifies the temporal identifier of the NAL unit.

The value of nuh_temporal_id_plus1 should not be equal to 0.

The variable TemporalId is derived as follows:

TemporalId＝nuh_temporal_id_id_plus1-1

When nal_unit_type is in the range of idr_w_radl to rsv_irap_11 (inclusive), the temporalld should be equal to 0.

When nal_unit_type is equal to STSA_NUT and vps_independent_layer_flag [ GeneralLayerIdx [ nuh_layer_id ] ] is equal to 1, the TemporalId should be greater than 0.

The value of the temporalld should be the same for all VCL NAL units of the AU. The value of the temporalld of the encoded picture, PU or AU is the value of the temporalld of the VCL NAL unit of the encoded picture, PU or AU. The value of the temporalld for the sub-layer representation is the maximum value of the temporalld for all VCL NAL units in the sub-layer representation.

The value of the temporalld of a non-VCL NAL unit is constrained as follows:

If nal_unit_type is equal to dci_nut, opi_nut, vps_nut or sps_nut, then temporalld shall be equal to 0 and the temporalld of the AU containing the NAL unit shall be equal to 0.

Otherwise, if nal_unit_type is equal to ph_nut, then the temporalld should be equal to the temporalld of the PU containing the NAL unit. Otherwise, if nal_unit_type is equal to eos_nut or eob_nut, then the temporalld should be equal to 0.

Otherwise, if nal_unit_type is equal to aud_nut, fd_nut, prepix_sei_nut, or SUFFIX _sei_nut, then the TemporalId should be equal to the TemporalId of the AU containing the NAL unit.

Otherwise, when nal_unit_type is equal to pps_nut, prepix_aps_nut or SUFFIX _aps_nut, the temporalld shall be

Greater than or equal to the temporallid of the PU containing the NAL unit.

Note that when the NAL unit is a non-VCL NAL unit, the value of the temporalld is equal to the minimum of the temporalld values of all AUs to which the non-VCL NAL unit applies. When nal_unit_type is equal to pps_nut, prefix_aps_nut, or SUFFIX _aps_nut, the temporalld may be greater than or equal to the temporalld containing AU, because all PPS and APS may be included in the beginning of the bitstream (e.g., when they are transported out-of-band and the receiver places them at the beginning of the bitstream), with the first encoded picture having a temporalld equal to 0.

Nal_unit_type specifies the NAL unit type, i.e., the type of RBSP data structure contained in the NAL unit as specified in table 2.

NAL units (whose semantics are not specified) with nal_unit_type within the range of unspec28.. UNSPEC31 (inclusive) should not affect the decoding process specified in this specification.

Note that NAL unit types within UNSPEC _28.. UNSPEC _31 may be used as determined by the application. The decoding process for these values of nal_unit_type is not specified in this specification. Since different applications may use these NAL unit types for different purposes, special attention is expected when designing encoders that generate NAL units with these nal_unit_type values and when designing decoders that interpret the content of NAL units with these nal_unit_type values. The present specification does not define any management of these values. These nal_unit_type values may only be applicable to use of "conflict"

(I.e., the meaning of NAL unit content of the same NAL unit type value has different definitions) is not important, or is not possible or is managed in a context, e.g., defined or managed in a control application or transport specification, or managed by controlling the environment in which the bitstream is distributed.

For the purpose of determining the amount of data out of DUs of the bitstream, the decoder should ignore (remove and discard from the bitstream) the contents of all NAL units that use the reserved value of nal_unit_type. Note that this requirement allows for future definition of compatible extensions of the specification.

TABLE 2

Note that a Clean Random Access (CRA) picture may have an associated RASL or RADL picture present in the bitstream.

Note that an Instantaneous Decoding Refresh (IDR) picture with nal_unit_type equal to idr_n_lp does not have an associated leading picture present in the bitstream. An IDR picture with nal_unit_type equal to idr_w_radl does not have an associated RASL picture present in the bitstream, but may have an associated RADL picture in the bitstream.

The value of nal_unit_type should be the same for all VCL NAL units of a sub-picture. A sub-picture is referred to as having the same NAL unit type as the VCL NAL unit of the sub-picture.

For VCL NAL units of any particular picture, the following applies:

If pps_mixed_ nalu _types_in_pic_flag is equal to 0, then the value of nal_unit_type should be the same for all VCL NAL units of the picture, and the picture or PU is said to have the same NAL unit type as the coded slice NAL unit of the picture or PU.

Otherwise (pps_mixed_ nalu _types_in_pic_flag equal to 1), all the following constraints apply:

the picture should have at least two sub-pictures.

The VCL NAL units of a picture should have two or more different nal_unit_type values.

VCL NAL units for pictures where nal_unit_type equals gdr_nut will not exist.

When nal_unit_type of a VCL NAL unit of a picture is equal to nalUnitTypeA of the value idr_w_radl, idr_n_lp or cra_nut, the nal_unit_type of the other VCL NAL units of the picture should be equal to nalUnitTypeA or

TRAIL_NUT。

The value of nal_unit_type should be the same for all pictures of IRAP or GDR AU.

When sps_video_parameter_set_id is greater than 0, vps_max_tid_ref_pics_plus 1[ i ] [ j ]

Equal to 0 (for any value of i in the range of j equal to GeneralLayerIdx [ nuh_layer_id ] and j+1 to vps_max_layers_minus1 (inclusive) and pps_mixed_ nalu _types_in_pic_flag equal to 1), the value of nal_unit_type will not be equal to idr_w_radl, idr_n_lp or cra_nut.

The following constraints apply for the bitstream compliance requirements:

When a picture is the leading picture of an IRAP picture, the picture should be a RADL or RASL picture.

When the sub-picture is the leading sub-picture of the IRAP sub-picture, the sub-picture should be a RADL or RASL sub-picture.

When the picture is not the leading picture of the IRAP picture, the picture should not be a RADL or RASL picture.

When the sub-picture is not the leading sub-picture of the IRAP sub-picture, the sub-picture should not be an RADL or RASL sub-picture.

RASL pictures should not be present in the bitstream, which RASL pictures are associated with IDR pictures.

RASL sub-pictures should not be present in the bitstream, which RASL sub-pictures are associated with IDR sub-pictures.

RADL pictures should not be present in the bitstream, which RADL pictures are associated with IDR pictures with nal_unit_type equal to idr_n_lp.

Note that random access at the location of the IRAP PU can be performed by discarding all PUs preceding the IRAP AU (and correctly decoding the non-RASL pictures in the IRAP AU and all subsequent AUs in decoding order), provided that each parameter set (in the bitstream or by an external means not specified in this specification) is available when referenced.

RADL sub-pictures should not be present in the bitstream, these RADL sub-pictures and IDR sub-pictures with nal_unit_type equal to idr_n_lp

The tiles are associated.

-Nuh_layer_id equals any of the IRAP pictures of nuh_ llayer _id equal to layerId in decoding order of the particular value layerId

Which pictures should be located before the IRAP picture in output order and any RADL pictures associated with the IRAP picture in output order.

Nuh layer id equal to a specific value layerId and the sub-picture index equal to a specific value subpicIdx are located in decoding order

Any sub-picture preceding the IRAP sub-picture with nuh layer id equal to layerId and sub-picture index equal to subpicIdx should precede the IRAP sub-picture and all its associated RADL sub-pictures in output order.

-Any of nuh layer id equal to a specific value layerId that is located before a recovery point picture with nuh layer id equal to layerId in decoding order

Which picture should precede the recovery point picture in output order.

-Nuh_layer_id is equal to a specific value layerId and the sub-picture index is equal to a specific value subpicIdx in the decoding order at the recovery point map

Any sub-picture in the slice nuh layer id equal to layerId and preceding the sub-picture with sub-picture index equal to subpicIdx should be located before this sub-picture in the recovery point picture in output order.

Any RASL picture associated with a CRA picture should be arranged in output order at any RADL picture associated with a CRA picture

Before.

Any RASL sub-picture associated with a CRA sub-picture should be located in output order at any of the RASL sub-pictures associated with the CRA sub-picture

RADL sub-picture is preceded.

Any RASL picture associated with a CRA picture for which nuh layer id is equal to a particular value layerId should be located in output order

Nuh_layer_id is equal to layerId after any IRAP or GDR picture preceding the CRA picture in decoding order.

Any RASL sub-picture associated with a CRA sub-picture for which nuh layer id is equal to a particular value layerId and the sub-picture index is equal to a particular value subpicIdx should be located in output order after any IRAP or GDR sub-picture located before the CRA sub-picture in decoding order for which nuh layer id is equal to layerId and the sub-picture index is equal to subpicIdx.

-If sps_field_seq_flag is equal to 0, the following applies: the current picture when nuh_layer_id equals a particular value layerId is with IRAP

When a picture is associated with a leading picture, then the current picture should precede all non-leading pictures associated with the same IRAP picture in decoding order. Otherwise (sps field seq flag equal to 1), let picA and picB be the first leading picture and last leading picture associated with the IRAP picture, respectively, in decoding order, nuh layer id should be present before picA in decoding order equal to

LayerId at most one non-leading picture, and no nuh_layer_id equal to layerId should exist between picA and picB in decoding order

Is not a leading picture of (a).

-If sps_field_seq_flag is equal to 0, the following applies: when nuh_layer_id is equal to the specific value layerId and the sub-picture index is equal to

When the current sub-picture of the particular value subpicIdx is the leading sub-picture associated with an IRAP sub-picture, then the current sub-picture should precede all non-leading sub-pictures associated with the same IRAP sub-picture in decoding order. Otherwise (sps field seq flag equal to 1), let subpicA and subpicB be the first leading sub-picture and the last leading sub-picture, respectively, associated with IRAP sub-picture in decoding order, nuh layer id equal to layerId should be present before subpicA in decoding order and sub-picture index equal to subpicIdx

At most one non-leading sub-picture of (a), and no nuh_layer_id equal to layerId should exist between picA and picB in decoding order

And the sub-picture index is equal to subpicIdx of the non-leading pictures.

As provided in table 2, the NAL unit may include a Supplemental Enhancement Information (SEI) syntax structure. Tables 3 and 4 show the Supplemental Enhancement Information (SEI) syntax structure provided in JVET-T2001.

TABLE 3 Table 3

TABLE 4 Table 4

For tables 3 and 4, JHET-T2001 provides the following semantics:

Each SEI message consists of variables specifying the type payloadType and size payloadSize of the SEI message payload. The SEI message payload is specified. The derived SEI message payload size payloadSize is specified in bytes and should be equal to the number of RBSP bytes in the SEI message payload.

Note that-the NAL unit byte sequence containing the SEI message may include one or more emulation prevention bytes (represented by the encryption_prediction_thread_byte syntax element). Since the payload size of the SEI message is specified in RBSP bytes, the number of emulation prevention bytes is not included in the size payloadSize of the SEI payload.

Payload_type_byte is a byte of the payload type of the SEI message.

Payload_size_byte is a byte of the payload size of the SEI message.

It should be noted that JVET-T2001 defines the payload type and that ISO/IEC JTC1/SC29/WG11, held from day 12 to day 21, at 1 month 2022, defines the Additional payload type, "Additional SEI MESSAGES for VSEI (Draft 6)" (teleconference, document JVET-Y2006-v1, which is incorporated herein by reference and referred to as JVET-Y2006). Table 5 shows in general the sei_payload () syntax structure. That is, table 5 shows the sei_payload () syntax structure, but for the sake of brevity, not all possible types of payloads are included in table 5.

TABLE 5

For Table 5, JHET-T2001 provides the following semantics.

The sei_reserved_payload_extension_data should not be present in the bitstream conforming to this version of the present specification. However, a decoder conforming to this version of the present specification should ignore the presence and value of sei_reserved_payload_extension_data. When present, the length in bits of the sei_reserved_payload_extension_data is equal to 8×payload size-nEarlierBits-nPayloadZeroBits-1, where nEarlierBits is the number of bits in the sei_payload () syntax structure preceding the sei_reserved_payload_extension_data syntax element, and nPayloadZeroBits is the number of sei_payload_bit_equivalent_to_zero syntax elements at the end of the sei_payload () syntax structure.

If more_data_in_payload () is TRUE and nPayloadZeroBits is not equal to 7 after parsing the SEI message syntax structure (e.g., buffering_period () syntax structure), payloadBits is set equal to 8 x payload size-nPayloadZeroBits-1; otherwise PayloadBits is set equal to 8 x payloadsize.

The payload_bit_equal_to_one should be equal to 1.

The payload_bit_equivalent_to_zero should be equal to 0.

Note that-SEI messages with the same payloadType value are conceptually identical SEI messages, regardless of whether they are contained in prefix or suffix SEI NAL units.

Note that for the SEI messages specified in this specification and VSEI specifications (ITU-T h.274|iso/IEC 23002-7), payloadType values are aligned with similar SEI messages specified in AVC (rec.itu-T h.264|iso/IEC 14496-10) and HEVC (rec.itu-T h.265|iso/IEC 23008-2).

The semantics and persistence scope of each SEI message is specified in the semantic specification of each particular SEI message.

Note-summarize the persistence information of the SEI message.

JVET-T2001 also provides the following:

SEI messages with the syntax structure identified in table 5 are specified in Rec. ITU-t h.274iso/IEC 23002-7 may be used with the bitstreams specified in the present specification.

When any particular RecITU-th.274|iso/IEC 23002-7SEI message is included in the bitstream specified in the present specification, the SEI payload syntax should be included in the sei_payload () syntax structure specified in table 5, the payloadType value specified in table 5 should be used, and furthermore, any SEI message specific constraint specified for a particular SEI message in the present attachment should apply.

As described above, the PayloadBits values are passed to the parser of the SEI message syntax structure specified in Rec. ITU-T H.274|ISO/IEC 23002-7.

As described above, JVET-AB2006 provides an NN post-filter supplemental enhancement information message. Specifically, JVET-AB2006 provides a neural network post-filter characteristics SEI message (payloadType +=210) and a neural network post-filter activation SEI message (payloadType +=211). Table 6 shows the syntax of the neural network post-filter characteristics SEI message provided in JVET-AB 2006. It should be noted that the neural network post-filter characteristic SEI message may be referred to as NNPFC SEI.

TABLE 6

With respect to Table 6, JHET-AB 2006 provides the following semantics:

The neural network post-filter characteristics (NNPFC) SEI message specifies the neural network that can be used as a post-processing filter. The use of a specified post-processing filter for a particular picture is indicated by a neural network post-filter activation SEI message.

The use of this SEI message requires the definition of the following variables:

the cropped decoded output picture width and height in luminance samples are denoted here by CroppedWidth and CroppedHeight, respectively.

Luminance sample array CroppedYPic [ idx ] and chrominance sample arrays CroppedCbPic [ idx ] and CroppedCrPic [ idx ] are used as inputs to a post-processing filter when they exist in a cropped decoded output picture with idx in the range of 0 to numInputPics-1, inclusive.

-Bit depth BitDepth _Y of the luma sample array of the cropped decoded output picture.

Bit depth BitDepth _C of the chroma sample array (if any) of the cropped decoded output picture.

A chroma format indicator, denoted ChromaFormatIdc herein.

When nnpfc _auxliary_ inP _idc is equal to 1, the filter strength control value StrengthContro1Val should be a real number (inclusive) in the range of 0 to 1.

Variables SubWidthC and SubHeightC were derived from ChromaFormatIdc specified in table 7.

sps_chroma_foRmat_idc	Chroma format	SubWidthC	SubHeightC
				0	Monochrome	1	1
1	4∶2∶0	2	2
				2	4∶2∶2	2	1
3	4∶4∶4	1	1

TABLE 7

Note-there may be more than one NNPFC SEI messages for the same picture. When more than one NNPFC SEI message with different nnpfc _id values are present or activated for the same picture, they may have the same or different nnpfc _purpose value and nnpfc _mode_idc value.

Nnpfc _id contains an identification number that can be used to identify the post-processing filter. The value of nnpfc _id should be in the range of 0 to 2 ³² -2 (inclusive). The nnpfc _id values from 256 to 511 (inclusive) and from 2 ³¹ to 2 ³² -2 (inclusive) are reserved for future use by ITU-t|iso/IEC. Decoders meeting this version of the present document that encounters NNPFC SEI messages with nnpfc _id in the range of 256 to 511 (inclusive) or in the range of 2 ³¹ to 2 ³² -2 (inclusive) will ignore the SEI message.

When NNPFC SEI message is the first NNPFC SEI message with a particular nnpfc _id value in decoding order within the current CLVS, the following applies:

The SEI message specifies the basic post-processing filters.

The SEI message relates to the current decoded picture of the current layer and all subsequent decoded pictures in output order until the current CLVS ends.

When NNPFC SEI message is a repetition of the previous NNPFC SEI message in decoding order within current CLVS, the subsequent semantics apply, considering that the SEI message is the only NNPFC SEI message with the same content within current CLVS.

When NNPFC SEI message is not the first NNPFC SEI message with a particular nnpfc _id value in decoding order within the current CLVS, the following applies:

the SEI message defines the update with respect to the previous basic post-processing filter in decoding order with the same nnpfc _id value.

The SEI message relates in output order to the current decoded picture of the current layer and all subsequent decoded pictures until current CLVS or at the present time

The next NNPFC SEI message with a particular nnpfc _id value in output order within the previous CLVS ends.

Nnpfc mode idc equal to 0 indicates that the SEI message contains an ISO/IEC 15938-17 bitstream that specifies a base post-processing filter or an update to a base post-processing filter with the same nnpfc id value.

When NNPFC SEI message is the first NNPFC SEI message having a particular nnpfc _id value in decoding order within the current CLVS, nnpfc _mode_idc equal to 1 specifies that the basic post-processing filter associated with nnpfc _id value is a neural network identified by a URI indicated by nnpfc _uri having a format identified by tag URI nnpfc _tag_uri.

When NNPFC SEI message is not the first NNPFC SEI message having a particular nnpfc _id value in decoding order within the current CLVS, nnpfc _mode_idc equals 1 specifying that the update to the basic post-processing filter having the same nnpfc _id value is defined by the URI indicated by nnpfc _uri having the format identified by tag URI nnpfc _tag_uri.

In a bitstream consistent with this version of this document, the value of nnpfc _mode_idc should be in the range of 0 to 1 (inclusive). The values of Nnpfc mode idc of 2 to 255 are reserved for future use by ITU-t|iso/IEC and should not be present in the bitstream conforming to this version of the present document. Decoders consistent with this version of this document should ignore NNPFC SEI messages with nnpfc _mode_idc in the range of 2 to 255 (inclusive). A value nnpfc _mode_idc greater than 255 should not be present in the bitstream conforming to this version of the present document and not reserved for future use.

When the SEI message is the first NNPFC SEI message having a particular nnpfc _id value in decoding order within the current CLVS, the post-processing filter PostProcessingFilter () is assigned the same as the basic post-processing filter.

When the SEI message is not the first NNPFC SEI message having a specific nnpfc _id value in decoding order within the current CLVS, the post-processing filter PostProcessingFilter () is obtained by applying the update defined by the SEI message on the basic post-processing filter.

The updates are not accumulated, but rather each update is applied to a basic post-processing filter specified by the first NNPFC SEI message with a particular nnpfc _id value in decoding order within the current CLVS.

In a bitstream consistent with this version of this document, nnpfc _reserved_zero_bit_a should be equal to 0. When nnpfc _reserved_zero_bit_a is not equal to 0, the decoder should ignore NNPFC SEI.

Nnpfc _tag_uri contains a tag URI with syntax and semantics as specified in IETF RFC 4151, identifying the format and associated information about a neural network that is used as a basic post-processing filter or updated to a basic post-processing filter with the same nnpfc _id value specified by nnpfc _uri.

Note that-nnpfc _tag_uri enables unique identification of the format of the neural network specified by nnrpf _uri without the need for a central registration authority.

Nnpfc _tag_uri is equal to "tag: iso.org,2023:15938-17" indicating that the neural network identified by nnpfc _uri is ISO/IEC 15938-17 compliant.

Nnpfc _ URI contains a URI with syntax and semantics as specified in IETF internet standard 66, identifying a neural network that is used as a basic post-processing filter or an update to a basic post-processing filter with the same nnpfc _ id value.

Nnpfc _ formatting _and_purpose_flag equal to 1 specifies that there are syntax elements related to filter purpose, input format, output format and complexity. nnpfc _ formatting _and_purpose_flag equal to 0 specifies that there are no syntax elements related to filter purpose, input format, output format and complexity.

When the SEI message is the first NNPFC SEI message with a specific nnpfc _id value in decoding order within the current CLVS, nnpfc _ formatting _and_purpose_flag should be equal to 1. When the SEI message is not the first NNPFC SEI message having a specific nnpfc _id value in decoding order within the current CLVS, nnpfc _ formatting _and_purpose_flag should be equal to 0.

Nnpfc _purpose indicates the purpose of the post-processing filter as specified in table 8.

In a bitstream conforming to this version of this document, the value nnpfc _purpose should be in the range of 0 to 5 (inclusive). The values of nnpfc _purose 6 to 1023 are reserved for future use by ITU-t|iso/IEC and should not be present in the bitstream conforming to this version of the present document. Decoders consistent with this version of this document should ignore NNPFC SEI messages for which nnpfc _purpose is in the range of 6 to 1203, inclusive. A value nnpfc _purpsose greater than 1023 should not be present in the bitstream conforming to this version of the present document and not reserved for future use.

Value of	Interpretation of the drawings
		0	May be used as determined by the application
1	Visual quality improvement
		2	Chroma upsampling from a 4:2:0 chroma format to a 4:2:2 or 4:4:4 chroma format, or from a 4:2:2 chroma format to a 4:4:4 chroma format
3	Increasing width or height of cropped decoded output picture without changing chroma format
		4	Increasing width or height of cropped decoded output picture and upsampling chroma format
5	Picture rate upsampling

TABLE 8

Note that when ITU-t|iso/IEC uses a reserved value of nnpfc _purpose in the future, the syntax of the SEI message can be extended with syntax elements, the presence of which is conditioned on nnpfc _purpose being equal to this value.

When SubWidthC is equal to 1 and SubHeightC is equal to 1, nnpfc _purpose should not be equal to 2 or 4.

Nnpfc out sub c flag equal to 1 specifies outSubWidthC equal to 1 and outSubHeightC equal to 1.nnpfc out sub c flag equal to 0 specifies outSubWidthC equal to 2 and outSubHeightC equal to 1. When nnpfc _out_sub_c_flag is not present, outSubWidthC is inferred to be equal to SubWidthC and outSubHeightC is inferred to be equal to SubHeightC. When ChromaFormatIdc is equal to 2 and nnpfc _out_sub_c_flag is present, the value of nnpfc _out_sub_c_flag should be equal to 1.

Nnpfc _pic_width_in_luma_samples and nnpfc _pic_height_in_luma_samples specify the width and height, respectively, of the luma sample array of the picture that is generated by applying the post-processing filter identified by nnpfc _id to the cropped decoded output picture. When nnpfc _pic_width_in_luma_samples and nnpfc _pic_height_in_luma_samples are not present, it is inferred that they are equal to CroppedWidth and CroppedHeight, respectively. The value of nnpfc _pic_width_in_luma_samples should be in the range CroppedWidth to CroppedWidth x 16-1 (inclusive). The value of nnpfc _pic_height_in_luma_samples should be in the range CroppedHeight to CroppedHeight x 16-1 (inclusive).

Nnpfc _num_input_pics_minus2 plus 2 specifies the number of decoded output pictures that are used as input to the post-processing filter.

Nnpfc _ interpolated _pics [ i ] specifies the number of interpolated pictures generated by the post-processing filter between the i-th and (i+1) -th pictures that are used as inputs to the post-processing filter.

A variable numInputPics specifying the number of pictures used as input to the post-processing filter and a variable numOutputPics specifying the total number of pictures produced from the post-processing filter are derived as follows:

nnpfc _component_last_flag equal to 1 indicates that the last dimension in the input tensor inputTensor is input to the post-processing filter and the output tensor outputTensor generated from the post-processing filter is used for the current channel. nnpfc _component_last_flag equal to 0 indicates that the third dimension in the input tensor inputTensor is input to the post-processing filter and the output tensor outputTensor generated from the post-processing filter is used for the current channel.

Note that-the first dimension in the input tensor and the output tensor is used for batch indexing, which is a practice in some neural network frameworks. Although the formulas in the semantics of the SEI message use a batch size corresponding to a batch index equal to 0, the batch size used as input for neural network inference is determined by post-processing implementations.

Note that, for example, when nnpfc _inp_order_idc is equal to 3 and nnpfc _auxliary_inp_idc is equal to 1, there are 7 channels in the input tensor, including four luminance matrices, two chrominance matrices, and one auxiliary input matrix. In this case, the DeriveInputTensors () procedure will derive each of the 7 channels of the input tensor one by one, and when a particular one of these channels is processed, that channel is referred to as the current channel during processing.

Nnpfc _inp_format_idc indicates a method of converting sample values of a clip-decoded output picture into input values input to a post-processing filter. When nnpfc _inp_format_idc is equal to 0, the input value to the post-processing filter is a real number, and functions InpY () and InpC () are specified as follows:

InpY(x)＝x÷((1＜＜BitDepthY)-1)

InpC(x)＝x÷((1＜＜BitDepthC)-1)

when nnpfc _inp_format_idc is equal to 1, the input value to the post-processing filter is an unsigned integer, and functions InpY () and InpC () are specified as follows:

Variable inpTensorBitDepth is derived from syntax element nnpfc _inp_ tensor _ bitdepth _minus8, as specified below.

A value of nnpfc _inp_format_idc greater than 1 is reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the document. Decoders consistent with this version of this document should ignore NNPFCSEI messages that contain reserved values of nnpfc _inp_format_idc.

Nnpfc _inp_ tensor _ bitdepth _minus8plus 8 specifies the bit depth of the luma sample value in the input integer tensor. The value of inpTensorBitDepth is derived as follows:

inpTensorBitDepth＝nnpfc_inp_tensor_bitdepth_minus8+8

The value of the bitstream compliance requirement nnpfc _inp_ tensor _ bitdepth _minus8 should be in the range of 0 to 24 (inclusive).

Nnpfc _inp_order_idc indicates a method of ordering a sample array of a clip-decoded output picture that is one of the input pictures of the post-processing filter.

In a bitstream conforming to this version of this document, the value of nnpfc _inp_order_idc should be in the range of 0 to 3 (inclusive). Values in the range of 4 to 255 of nnpfc _inp_order_idc are reserved for future use by ITU-t|iso/IEC and should not be present in the bitstream conforming to this version of the present document. Decoders consistent with this version of this document should ignore NNPFC SEI messages with nnpfc _inp_order_idc in the range of 4 to 255 (inclusive). A value nnpfc _inp_order_idc greater than 255 should not be present in the bitstream conforming to this version of the present document and not reserved for future use.

When ChromaFormatIdc is not equal to 1, nnpfc _inp_order_idc should not be equal to 3.

Table 9 contains an informative description of nnpfc _inp_order_idc values.

TABLE 9

A block is a rectangular array of samples from a component of a picture (e.g., a luma or chroma component).

Nnpfc _auxliary_inp_idc greater than 0 indicates that auxiliary input data is present in the input tensor of the neural network post-filter. nnpfc _auxliary_inp_idc equal to 0 indicates that auxiliary input data is not present in the input tensor. nnpfc _auxliary_inp_idc equal to 1 specifies that auxiliary input data is derived as specified in equation 82.

In a bitstream consistent with this version of this document, the value of nnpfc _auxliary_inp_idc should be in the range of 0 to 1 (inclusive). The values of 2 to 255 of nnpfc _inp_order_idc are reserved for future use by ITU-t|iso/IEC and should not be present in a bitstream conforming to this version of the present specification. Decoders consistent with this version of this document should ignore NNPFC SEI messages with nnpfc _inp_order_idc in the range of 2 to 255 (inclusive). A value nnpfc _inp_order_idc greater than 255 should not be present in the bitstream conforming to this version of the present document and not reserved for future use.

The DeriveInputTensors () procedure of the input tensor inputTensor for deriving the given vertical sample coordinate cTop and horizontal sample coordinate cLeft specifying the top left sample position of the sample block included in the input tensor is specified as follows:

nnpfc _ separate _color_description_present_flag equal to 1 indicates that different combinations of color primaries, transmission characteristics, and matrix coefficients of a picture generated by a post-processing filter are specified in an SEI message syntax structure. nnfpc _ separate _color_description_present_flag equal to 0 indicates that the combination of color primaries, transmission characteristics, and matrix coefficients of the picture produced by the post-processing filter is the same as that indicated in the VUI parameters of CLVS.

Nnpfc _color_primaries have the same semantics as specified for the vui_color_primaries syntax element, as follows: vui_color_primary indicates chromaticity coordinates of the source color primary. Its semantics are as specified for ColourPrimaries parameters in Rec.ITU-T H.273|ISO/IEC 23091-2. When the vui_color_primaries syntax element is not present, the value of vui_color_primaries is inferred to be equal to 2 (chroma is unknown or unspecified or determined by other means not specified in this specification). The value of vui_color_primaries is identified as reserved for future use by rec.itu-t h.273|iso/IEC 23091-2 and should not be present in a bitstream conforming to this version of the specification. The decoder should interpret the reserved value of vui_color_primary as equal to the value 2.

Except for the following cases:

nnpfc _color_primaries specify the color primaries of the picture produced by applying the neural network post-filter specified in the SEI message, instead of the color primaries for CLVS.

When nnpfc _color_primaries are not present in the NNPFC SEI message, the value of nnpfc _color_primaries is inferred to be equal to vui_color_primaries.

Nnpfc _transfer_characteristics have the same semantics as for the vui_transfer_characteristics syntax element as follows: vui_transfer_characteristics indicate a transfer characteristic function of the color representation. Its semantics are as specified for TRANSFERCHARACTERISTICS parameters in Rec.ITU-T H.273|ISO/IEC 23091-2. When the vui_transfer_characteristics syntax element is not present, the value of vui_transfer_characteristics is inferred to be equal to 2 (the transmission characteristics are unknown or unspecified or determined by other means not specified in the present specification). The value of vui_transfer_characteristics is identified as reserved for future use by rec.itu-t h.273|iso/IEC 23091-2 and should not be present in a bitstream conforming to this version of the specification. The decoder should interpret the reserved value of vui_transfer_characteristics as equal to the value 2.

Except for the following cases:

Nnpfc _transfer_characteristics specify the transmission characteristics of the pictures produced by applying the neural network post-filter specified in the SEI message, instead of the transmission characteristics for CLVS.

When nnpfc _transfer_characteristics is not present in the NNPFC SEI message, the value of nnpfc _transfer_characteristics is inferred to be equal to vui_transfer_characteristics.

Nnpfc matrix coeffs has the same semantics as for the vui matrix coeffs syntax element as follows: vui matrix coeffs describes equations for deriving luminance and chrominance signals from the green, blue and red or Y, Z and X primary colors. Its semantics are as specified for MatrixCoefficients parameters in Rec.ITU-T H.273|ISO/IEC 23091-2.

Except for the following cases:

Nnpfc matrix coeffs specifies the matrix coefficients of the picture produced by applying the neural network post-filter specified in the SEI message, instead of the matrix coefficients for CLVS.

When nnpfc _matrix_coeffs is not present in the NNPFC SEI message, the value of nnpfc _matrix_coeffs is inferred to be equal to vui_matrix_coeffs.

The value allowed by nnpfc matrix coeffs is not constrained by the chroma format of the decoded video picture, which is indicated by the ChromaFormatIdc value for VUI parameter semantics.

When nnpfc _matrix_coeffs is equal to 0, nnpfc _out_order_idc should not be equal to 1 or 3.

Nnpfc out format idc equal to 0 indicates that the sample values output by the post-processing filter are real numbers (inclusive) in the range of 0 to 1, linearly mapped to the unsigned integer value range 0 to (1 < < bitDepth) -1 (inclusive) for any desired bit depth bitDepth for subsequent post-processing or display.

Nnpfc _out_format_flag equal to 1 indicates that the sample value output by the post-processing filter is an unsigned integer in the range of 0 to (1 < (nnpfc _out_ tensor _ bitdepth _minus8+8)) -1.

A value of nnpfc out format idc greater than 1 is reserved for future use by ITU-t|iso/IEC specifications and should not be present in bitstreams conforming to this version of the present document. Decoders consistent with this version of this document should ignore NNPFC SEI messages that contain reserved values of nnpfc _out_format_idc.

Nnpfc out tensor bitdepth minus8 plus8 specifies the bit depth of the sample values in the output integer tensor. The value of Nnpfc out tensor bitdepth minus8 should be in the range of 0 to 24 (inclusive).

Nnpfc out order idc indicates the output order of the samples generated from the post-processing filter.

In a bitstream consistent with this version of this document, the value of nnpfc _out_order_idc should be in the range of 0 to 3 (inclusive). Values in the range of 4 to 255 of nnpfc out order idc are reserved for future use by ITU-t|iso/IEC and should not be present in the bitstream conforming to this version of the present document. Decoders consistent with this version of this document should ignore NNPFC SEI messages with nnpfc out order idc in the range of 4 to 255, inclusive. A value nnpfc out order idc greater than 255 should not be present in the bitstream conforming to this version of this document and not reserved for future use.

When nnpfc _purcose is equal to 2 or 4, nnpfc _out_order_idc should not be equal to 3.

Table 10 contains an informative description of nnpfc out order idc values.

Table 10

The StoreOutputTensors () procedure for deriving sample values in the filtered output sample arrays FILTEREDYPIC, FILTEREDCBPIC and FILTEREDCRPIC from the output tensor outputTensor specifying the top left sample position of the sample block included in the input tensor, given the vertical sample coordinates cTop and the horizontal sample coordinates cLeft, is specified as follows:

nnpfc _constant_patch_size_flag equal to 1 indicates that the post-processing filter accepts exactly the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 as inputs. nnpfc _constant_patch_size_flag equal to 0 indicates that the post-processing filter accepts as input any block size that is a positive integer multiple of the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1.

When nnpfc _constant_patch_size_flag is equal to 1, nnpfc _patch_width_minus1+1 specifies the horizontal sample count of the block size required to be input to the post-processing filter. The nnpfc _latch_width_minus1 value should be in the range of 0 to Min (32766, croppetwidth-1) (inclusive).

When nnpfc _constant_patch_size_flag is equal to 1, nnpfc _patch_height_minus1+1 specifies the vertical sample count of the block size required to be input to the post-processing filter. The value of nnpfc _latch_height_minus1 should be in the range of 0 to Min (32766, croppedHeight-1) (inclusive).

Let variables INPPATCHWIDTH and INPPATCHHEIGHT be the block size width and block size height, respectively.

If nnpfc _constant_patch_size_flag is equal to 0, the following applies:

the values of-INPPATCHWIDTH and INPPATCHHEIGHT are provided by external means not specified in this document, or set by the post-processor itself.

The value of-INPPATCHWIDTH should be a positive integer multiple of nnpfc _patch_width_minus1+1 and should be less than or equal to CroppedWidth. The value of INPPATCHHEIGHT should be a positive integer multiple of nnpfc _patch_height_minus1+1 and should be less than or equal to CroppedHeight.

Otherwise (nnpfc _constant_latch_size_flag equal to 1), the value of INPPATCHWIDTH is set equal to nnpfc _latch_width_minus1+1, and the value of INPPATCHHEIGHT is set equal to nnpfc _latch_height_minus1+1.

Nnpfc _overlap specifies the overlapping horizontal and vertical sample counts of adjacent input tensors of the post-processing filter. The value nnpfc _overlap should be in the range of 0 to 16383 (inclusive).

Variables outPatchWidth, outPatchHeight, horCScaling, verCScaling, outPatchCWidth, outPatchCHeight and overlapSize were derived as follows:

outPatchWidth＝(nnpfc_pic_width_in_huma_samples*inpPatchWidth)/CrcppedWidth

outPatchHeight＝(nnpfc_pic_height_in_luma_samples*inpPatcbHeight)/CroppedHeight

horCScaling＝SubWidthC/outSubWidthC

verCScaling＝SubHeightC/outSubHeightC

outPatchCWidth＝outPatchWidth*horCScaling

outPatchCHeight＝outPatchHeight*verCScaling

overlapSize＝nnpfc_overlap

The bit stream conformance requirements outPatchWidth x CroppedWidth should be equal to nnpfc _pic_width_in_luma_samples x INPPATCHWIDTH and outPatchHeight x CroppedHeight should be equal to nnpfc _pic_height_in_luma_samples x INPPATCHHEIGHT.

Nnpfc _padding_type indicates a padding process when referring to sample positions outside the boundary of the clip-decoded output picture as described in table 11. The value nnpfc _padding_type should be in the range of 0 to 15 (inclusive).

nnpfc_padding_type	Description of the invention
		0	Zero padding
1	Replication filling
		2	Reflective filling
3	Surrounding filling
		4	Fixed filling
5..15	Reservation of

TABLE 11

Nnpfc _luma_padding_val indicates the luminance value for padding when nnpfc _padding_type is equal to 4.

Nnpfc _cb_padding_val indicates the Cb value for padding when nnpfc _padding_type is equal to 4.

Nnpfc _cr_padding_val indicates the Cr value for padding when nnpfc _padding_type is equal to 4.

The function INPSAMPLEVAL (y, x, PICHEIGHT, picWidth, croppedPic) input as vertical sample position y, horizontal sample position x, picture height PICHEIGHT, picture width picWidth, and sample array croppedPic returns the value of SAMPLEVAL derived as follows:

Note-for the input of function INPSAMPLEVAL (), the vertical position is listed before the horizontal position for compatibility with the input tensor convention of some inference engines.

The following example process may be used to filter the cropped decoded output picture piece-by-piece with post-processing filter PostProcessingFilter () to produce a filtered picture, including Y, cb, cr sample arrays FILTEREDYPIC, FILTEREDCBPIC and FILTEREDCRPIC, respectively, as indicated by nnpfc _out_order_idc.

Nnpfc _ complexity _info_present_flag equal to 1 specifies that there are one or more syntax elements indicating the complexity of the post-processing filter associated with nnpfc _id. nnpfc _ complexity _info_present_flag equal to 0 specifies that there is no syntax element indicating the complexity of the post-processing filter associated with nnpfc _id.

Nnpfc _parameter_type_idc equal to 0 indicates that the neural network uses only integer parameters. nnpfc _parameter_type_flag equal to 1 indicates that the neural network can use floating point or integer parameters. nnpfc _parameter_type_idc equal to 2 indicates that the neural network uses only binary parameters. nnpfc _parameter_type_idc equal to 3 is reserved for future use by ITU-t|iso/IEC and should not be present in the bitstream conforming to this version of the present document. A decoder conforming to this version of this document will ignore NNPFC SEI messages with nnpfc _parameter_type_idc equal to 3.

Nnpfc _log2_parameter_bit_length_minus3 equal to 0, 1, 2, and 3 indicate that the neural network does not use parameters with bit lengths greater than 8, 16, 32, and 64, respectively. When nnpfc _parameter_type_idc is present and nnpfc _log2_parameter_bit_length_minus3 is not present, the neural network does not use parameters with bit length greater than 1.

Nnpfc _num_parameters_idc indicates the maximum number of neural network parameters for the post-processing filter in units of a power of 2048. nnpfc _num_parameters_idc equal to 0 indicates the maximum number of unknown neural network parameters. The value nnpfc _num_parameters_idc should be in the range of 0 to 52 (inclusive). A value of nnpfc _num_parameters_idc greater than 52 is reserved for future use by ITU-t|iso/IEC and should not be present in the bitstream conforming to this version of the present document. Decoders consistent with this version of this document will ignore NNPFC SEI messages with nnpfc _num_parameters_idc greater than 52.

If nnpfc _num_parameters_idc has a value greater than 0, variable maxNumParameters is derived as follows:

maxNumParameters＝(2048＜＜nnpfc_num_parameters_idc)-1

The number of neural network parameters of the post-processing filter should be less than or equal to maxNumParameters for bit stream compliance requirements.

Nnpfc _num_ kmac _operations_idc greater than 0 indicates that the maximum number of multiply-accumulate operations per sample of the post-processing filter is less than or equal to nnpfc _num_ kmac _operations_idc x 1000.nnpfc _num_ kmac _operations_idc equal to 0 indicates the maximum number of unknown multiply-accumulate operations. The value of nnpfc _num_ kmac _operations_idc should be in the range of 0 to 2 ³² -1 (inclusive).

Nnpfc _total_kilobyte_size greater than 0 indicates the total size in kilobytes required to store the uncompressed parameters for the neural network. The total size of bits is a number equal to or greater than the sum of bits used to store each parameter. nnpfc _total_kilobyte_size is the total size of bits divided by 8000, rounded up. nnpfc _total_kilobyte_size equal to 0 indicates the total size that is not known to be needed to store the parameters for the neural network. The value of nnpfc _total_kilobyte_size should be in the range of 0 to 2 ³² -1 (inclusive).

In a bitstream consistent with this version of this document, nnpfc _reserved_zero_bit_b should be equal to 0. When nnpfc _reserved_zero_bit_b is not equal to 0, the decoder should ignore NNPFC SEI message.

Nnpfc _payload_byte [ i ] contains the ith byte of the ISO/IEC 15938-17 compliant bitstream. The byte sequence nnpfc _payload_byte [ i ] for all current values of i should be a complete bitstream compliant with ISO/IEC 15938-17.

Table 12 shows the syntax of the neural network postfilter activation SEI message provided in JVET-AB 2006.

Table 12

For Table 12, JHET-AB 2006 provides the following semantics.

The neural network post-filter activation (NNPFA) SEI message activates or deactivates the possible use of the target neural network post-processing filter identified by nnpfa _target_id for post-processing filtering of a group of pictures.

Note that several NNPFA SEI messages may exist for the same picture, for example, when the post-processing filter is intended for different purposes or to filter different color components.

Nnpfa _target_id indicates a target neural network post-processing filter specified by one or more neural network post-processing filter characteristics SEI messages related to the current picture and having nnpfc _id equal to nnfpa _target_id.

The value of nnpfa _target_id should be in the range of 0 to 2 ³² -2 (inclusive). The nnpfa _target_id values from 256 to 511 (inclusive) and from 2 ³¹ to 2 ³² -2 (inclusive) are reserved for future use by ITU-t|iso/IEC. Decoders meeting this version of this document that encounter NNPFC SEI messages with nnpfa _target_id in the range of 256 to 511 (inclusive) or in the range of 2 ³¹ to 2 ³² -2 (inclusive) will ignore the SEI message.

A NNPFA SEI message with a particular value of nnpfa _target_id will not exist in the current PU unless one or both of the following conditions are true:

within the current CLVS, there is a NNPFC SEI message of nnpfc _id with a specific value equal to nnpfa _target_id

In a PU that precedes the current PU in decoding order.

There is a NNPFC SEI message with nnpfc _id equal to the specific value of nnpfa _target_id in the current PU.

When the PU contains both a NNPFC SEI message with a specific value of nnpfc _id and a NNPFA SEI message with nnpfa _target_id equal to a specific value of nnpfc _id, the NNPFC SEI message will precede the NNPFA SEI message in decoding order.

Nnpfa _cancel_flag equal to 1 indicates cancellation of persistence of the target neural network post-processing filter established by any previous NNPFA SEI message having the same nnpfa _target_id as the current SEI message, i.e., the target neural network post-processing filter is no longer used unless it is activated by another NNPFA SEI message having the same nnpfa _target_id as the current SEI message and nnpfa _cancel_flag equal to 0. nnpfa _cancel_flag equal to 0 indicates that nnpfa _persistence_flag follows.

Nnpfa _persistence_flag specifies the persistence of the target neural network post-processing filter for the current layer.

Nnpfa _persistence_flag equal to 0 specifies the neural network post-processing filter that is available for the post-processing filter of the current picture.

Nnpfa _persistence_flag equal to 1 specifies that the target neural network post-processing filter is available to post-process filter the current picture and all subsequent pictures of the current layer in output order until one or more of the following conditions are true:

-a new CLVS of the current layer starts.

-End of bit stream.

NNPFA SEI with nnpfa _target_id and nnpfa _cancel_flag equal to 1, identical to the current SEI message

The pictures in the current layer associated with the message are output in output order after the current picture.

Note that the target neural network post-processing filter in the current layer associated with the NNPFA SEI message having the same nnpfa _target_id and nnpfa _cancel_flag equal to 1 as the current SEI message is not applied to the subsequent picture.

The neural network post-filter characteristics SEI message provided in JVET-AB2006 may be less than ideal. In particular, for example, signaling in JVET-AB2006 may be insufficient to support frame rate upsampling on a sub-layer by sub-layer basis. According to the techniques described herein, syntax and semantics are provided to provide better support for frame rate upsampling with respect to temporal scalability.

Fig. 1 is a block diagram illustrating an example of a system that may be configured to encode (e.g., encode and/or decode) video data in accordance with one or more techniques of the present disclosure. System 100 represents an example of a video data system that may be packaged in accordance with one or more techniques of the present disclosure. As shown in fig. 1, system 100 includes a source device 102, a communication medium 110, and a target device 120. In the example shown in fig. 1, source device 102 may include any device configured to encode video data and transmit the encoded video data to communication medium 110. Target device 120 may include any device configured to receive encoded video data and decode the encoded video data via communication medium 110. The source device 102 and/or the target device 120 may include computing devices equipped for wired and/or wireless communication, and may include, for example, set-top boxes, digital video recorders, televisions, desktop, laptop or tablet computers, gaming consoles, medical imaging devices, and mobile devices (including, for example, smart phones, cellular phones, personal gaming devices).

Communication medium 110 may include any combination of wireless and wired communication media and/or storage devices. Communication medium 110 may include coaxial cable, fiber optic cable, twisted pair cable, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. Communication medium 110 may include one or more networks. For example, the communication medium 110 may include a network configured to allow access to the world wide web, such as the Internet. The network may operate in accordance with a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunications protocols include the Digital Video Broadcasting (DVB) standard, the Advanced Television Systems Committee (ATSC) standard, the Integrated Services Digital Broadcasting (ISDB) standard, the Data Over Cable Service Interface Specification (DOCSIS) standard, the global system for mobile communications (GSM) standard, the Code Division Multiple Access (CDMA) standard, the 3 rd generation partnership project (3 GPP) standard, the European Telecommunications Standards Institute (ETSI) standard, the Internet Protocol (IP) standard, the Wireless Application Protocol (WAP) standard, and the Institute of Electrical and Electronics Engineers (IEEE) standard.

The storage device may include any type of device or storage medium capable of storing data. The storage medium may include a tangible or non-transitory computer readable medium. The computer readable medium may include an optical disk, flash memory, magnetic memory, or any other suitable digital storage medium. In some examples, the memory device or portions thereof may be described as non-volatile memory, and in other examples, portions of the memory device may be described as volatile memory. Examples of volatile memory can include Random Access Memory (RAM), dynamic Random Access Memory (DRAM), and Static Random Access Memory (SRAM). Examples of non-volatile memory may include magnetic hard disk, optical disk, floppy disk, flash memory, or forms of electrically programmable memory (EPROM) or Electrically Erasable and Programmable (EEPROM) memory. The storage devices may include memory cards (e.g., secure Digital (SD) memory cards), internal/external hard disk drives, and/or internal/external solid state drives. The data may be stored on the storage device according to a defined file format.

Fig. 4 is a conceptual diagram illustrating an example of components that may be included in a particular implementation of system 100. In the exemplary implementation shown in fig. 4, system 100 includes one or more computing devices 402A-402N, a television service network 404, a television service provider site 406, a wide area network 408, a local area network 410, and one or more content provider sites 412A-412N. The implementations shown in fig. 4 represent examples of systems that may be configured to allow digital media content (such as movies, live sporting events, etc.) and data and applications associated therewith, as well as media presentations, to be distributed to and accessed by multiple computing devices (such as computing devices 402A-402N). In the example shown in fig. 4, computing devices 402A-402N may include any device configured to receive data from one or more of television services network 404, wide area network 408, and/or local area network 410. For example, computing devices 402A-402N may be equipped for wired and/or wireless communication and may be configured to receive services over one or more data channels and may include televisions, including so-called smart televisions, set-top boxes, and digital video recorders. Further, computing devices 402A-402N may include desktop, laptop or tablet computers, gaming consoles, mobile devices (including, for example, "smart" phones, cellular phones, and personal gaming devices).

Television services network 404 is an example of a network configured to enable distribution of digital media content that may include television services. For example, television service network 404 may include a public over-the-air television network, a public or subscription-based satellite television service provider network, and a public or subscription-based cable television provider network and/or a cloud or internet service provider. It should be noted that although in some examples, television services network 404 may be primarily used to allow provision of television services, television services network 404 may also allow provision of other types of data and services according to any combination of the telecommunication protocols described herein. Further, it should be noted that in some examples, television service network 404 may allow bi-directional communication between television service provider site 406 and one or more of computing devices 402A-402N. Television services network 404 may include any combination of wireless and/or wired communication media. Television services network 404 may include coaxial cables, fiber optic cables, twisted pair cables, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. The television services network 404 may operate according to a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunication protocols include DVB standards, ATSC standards, ISDB standards, DTMB standards, DMB standards, data Over Cable Service Interface Specification (DOCSIS) standards, hbbTV standards, W3C standards, and UPnP standards.

Referring again to fig. 4, television service provider site 406 may be configured to distribute television services via television service network 404. For example, television service provider site 406 may include one or more broadcast stations, cable television providers, or satellite television providers, or internet-based television providers. For example, television service provider site 406 may be configured to receive transmissions (including television programs) via satellite uplink/downlink. Further, as shown in fig. 4, television service provider site 406 may be in communication with wide area network 408 and may be configured to receive data from content provider sites 412A through 412N. It should be noted that in some examples, television service provider site 406 may include a television studio and content may originate from the television studio.

Wide area network 408 may comprise a packet-based network and operate in accordance with a combination of one or more telecommunications protocols. The telecommunications protocol may include proprietary aspects and/or may include standardized telecommunications protocols. Examples of standardized telecommunications protocols include the global system for mobile communications (GSM) standard, the Code Division Multiple Access (CDMA) standard, the 3 rd generation partnership project (3 GPP) standard, the European Telecommunications Standards Institute (ETSI) standard, the european standard (EN), the IP standard, the Wireless Application Protocol (WAP) standard, and the Institute of Electrical and Electronics Engineers (IEEE) standard, such as one or more IEEE 802 standards (e.g., wi-Fi). Wide area network 408 may include any combination of wireless and/or wired communication media. Wide area network 408 may include coaxial cables, fiber optic cables, twisted pair cables, ethernet cables, wireless transmitters and receivers, routers, switches, repeaters, base stations, or any other device that may be used to facilitate communications between various devices and sites. In one example, wide area network 408 may include the internet. The local area network 410 may comprise a packet-based network and operate in accordance with a combination of one or more telecommunications protocols. Local area network 410 may be distinguished from wide area network 408 based on access level and/or physical infrastructure. For example, local area network 410 may include a secure home network.

Referring again to fig. 4, content provider sites 412A-412N represent examples of sites that may provide multimedia content to television service provider site 406 and/or computing devices 402A-402N. For example, the content provider site may include a studio having one or more studio content servers configured to provide multimedia files and/or streams to the television service provider site 406. In one example, content provider sites 412A-412N may be configured to provide multimedia content using an IP suite. For example, the content provider site may be configured to provide multimedia content to the receiver device according to Real Time Streaming Protocol (RTSP), HTTP, or the like. Further, the content provider sites 412A-412N may be configured to provide data including hypertext-based content and the like to one or more of the receiver devices 402A-402N and/or the television service provider sites 406 via the wide area network 408. Content provider sites 412A-412N may include one or more web servers. The data provided by the data provider sites 412A through 412N may be defined according to a data format.

Referring again to fig. 1, source device 102 includes a video source 104, a video encoder 106, a data packager 107, and an interface 108. Video source 104 may include any device configured to capture and/or store video data. For example, video source 104 may include a camera and a storage device operatively coupled thereto. Video encoder 106 may include any device configured to receive video data and generate a compatible bitstream representing the video data. A compatible bitstream may refer to a bitstream from which a video decoder may receive and reproduce video data. Aspects of a compatible bitstream may be defined in accordance with a video coding standard. When generating a compatible bitstream, the video encoder 106 may compress the video data. Compression may be lossy (perceptible or imperceptible to an observer) or lossless. Fig. 5 is a block diagram illustrating an example of a video encoder 500 that may implement the techniques for encoding video data described herein. It should be noted that although the exemplary video encoder 500 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video encoder 500 and/or its subcomponents to a particular hardware or software architecture. The functions of video encoder 500 may be implemented using any combination of hardware, firmware, and/or software implementations.

The video encoder 500 may perform intra prediction encoding and inter prediction encoding of picture regions, and thus may be referred to as a hybrid video encoder. In the example shown in fig. 5, a video encoder 500 receives a source video block. In some examples, a source video block may include picture regions that have been partitioned according to an encoding structure. For example, the source video data may include macroblocks, CTUs, CBs, sub-partitions thereof, and/or additional equivalent coding units. In some examples, video encoder 500 may be configured to perform additional subdivision of the source video block. It should be noted that the techniques described herein are generally applicable to video encoding, regardless of how the source video data is partitioned prior to and/or during encoding. In the example shown in fig. 5, the video encoder 500 includes an adder 502, a transform coefficient generator 504, a coefficient quantization unit 506, an inverse quantization and transform coefficient processing unit 508, an adder 510, an intra prediction processing unit 512, an inter prediction processing unit 514, a filter unit 516, and an entropy encoding unit 518. As shown in fig. 5, a video encoder 500 receives source video blocks and outputs a bitstream.

In the example shown in fig. 5, video encoder 500 may generate residual data by subtracting a predicted video block from a source video block. The selection of the predicted video block is described in detail below. Summer 502 represents a component configured to perform the subtraction operation. In one example, the subtraction of video blocks occurs in the pixel domain. The transform coefficient generator 504 applies a transform such as a Discrete Cosine Transform (DCT), a Discrete Sine Transform (DST), or a conceptually similar transform (e.g., four 8 x 8 transforms may be applied to a 16 x 16 array of residual values) to the residual block or sub-partitions thereof to produce a set of residual transform coefficients. The transform coefficient generator 504 may be configured to perform any and all combinations of the transforms included in the series of discrete trigonometric transforms, including approximations thereof. The transform coefficient generator 504 may output the transform coefficients to the coefficient quantization unit 506. The coefficient quantization unit 506 may be configured to perform quantization of the transform coefficients. The quantization process may reduce the bit depth associated with some or all of the coefficients. The degree of quantization may change the rate distortion (i.e., the relationship of bit rate to video quality) of the encoded video data. The degree of quantization may be modified by adjusting a Quantization Parameter (QP). The quantization parameter may be determined based on a slice level value and/or a CU level value (e.g., a CU delta QP value). QP data may include any data used to determine a QP for quantizing a particular set of transform coefficients. As shown in fig. 5, the quantized transform coefficients (which may be referred to as level values) are output to an inverse quantization and transform coefficient processing unit 508. The inverse quantization and transform coefficient processing unit 508 may be configured to apply inverse quantization and inverse transform to generate reconstructed residual data. As shown in fig. 5, reconstructed residual data may be added to the predicted video block at summer 510. In this way, the encoded video block may be reconstructed and the resulting reconstructed video block may be used to evaluate the coding quality of a given prediction, transform and/or quantization. The video encoder 500 may be configured to perform multiple encoding passes (e.g., perform encoding while changing one or more of the prediction, transform parameters, and quantization parameters). The rate-distortion or other system parameters of the bitstream may be optimized based on the evaluation of the reconstructed video block. Furthermore, the reconstructed video block may be stored and used as a reference for predicting a subsequent block.

Referring again to fig. 5, the intra-prediction processing unit 512 may be configured to select an intra-prediction mode for the video block to be encoded. The intra prediction processing unit 512 may be configured to evaluate the frame and determine an intra prediction mode used to encode the current block. As described above, possible intra prediction modes may include a planar prediction mode, a DC prediction mode, and an angular prediction mode. Further, it should be noted that in some examples, the prediction mode of the chroma component may be inferred from the prediction mode of the luma prediction mode. The intra-prediction processing unit 512 may select the intra-prediction mode after performing one or more encoding passes. Further, in one example, intra-prediction processing unit 512 may select a prediction mode based on rate-distortion analysis. As shown in fig. 5, the intra-prediction processing unit 512 outputs intra-prediction data (e.g., syntax elements) to the entropy encoding unit 518 and the transform coefficient generator 504. As described above, the transforms performed on the residual data may be mode dependent (e.g., a quadratic transform matrix may be determined based on the prediction mode).

Referring again to fig. 5, the inter prediction processing unit 514 may be configured to perform inter prediction encoding for the current video block. The inter prediction processing unit 514 may be configured to receive the source video block and calculate a motion vector for a PU of the video block. The motion vector may indicate a displacement of a prediction unit of a video block within the current video frame relative to a prediction block within the reference frame. Inter prediction coding may use one or more reference pictures. Further, the motion prediction may be unidirectional prediction (using one motion vector) or bidirectional prediction (using two motion vectors). The inter prediction processing unit 514 may be configured to select a prediction block by calculating pixel differences determined by, for example, sum of Absolute Differences (SAD), sum of Squared Differences (SSD), or other difference metrics. As described above, a motion vector can be determined and specified from motion vector prediction. As described above, the inter prediction processing unit 514 may be configured to perform motion vector prediction. The inter prediction processing unit 514 may be configured to generate a prediction block using the motion prediction data. For example, the inter-prediction processing unit 514 may locate a predicted video block (not shown in fig. 5) within a frame buffer. It should be noted that the inter prediction processing unit 514 may be further configured to apply one or more interpolation filters to the reconstructed residual block to calculate sub-integer pixel values for motion estimation. The inter prediction processing unit 514 may output motion prediction data of the calculated motion vector to the entropy encoding unit 518.

As shown in fig. 5, the filter unit 516 receives the reconstructed video block and the encoding parameters and outputs modified reconstructed video data. The filter unit 516 may be configured to perform deblocking and/or Sample Adaptive Offset (SAO) filtering. SAO filtering is a type of nonlinear amplitude mapping that can be used to improve reconstruction by adding an offset to the reconstructed video data. It should be noted that as shown in fig. 5, the intra prediction processing unit 512 and the inter prediction processing unit 514 may receive the modified reconstructed video block via the filter unit 216. The entropy encoding unit 518 receives quantized transform coefficients and prediction syntax data (i.e., intra prediction data and motion prediction data). It should be noted that in some examples, coefficient quantization unit 506 may perform a scan of a matrix comprising quantized transform coefficients before outputting the coefficients to entropy encoding unit 518. In other examples, entropy encoding unit 518 may perform the scanning. The entropy encoding unit 518 may be configured to perform entropy encoding according to one or more of the techniques described herein. As such, video encoder 500 represents an example of a device configured to generate encoded video data in accordance with one or more techniques of this disclosure.

Referring again to fig. 1, data encapsulator 107 can receive encoded video data and generate a compatible bitstream, e.g., a sequence of NAL units, according to a defined data structure. A device receiving a compatible bitstream may reproduce video data therefrom. Further, as described above, sub-bitstream extraction may refer to the process by which a device receiving a compatible bitstream forms a new compatible bitstream by discarding and/or modifying data in the received bitstream. It should be noted that the term compliant bitstream may be used instead of the term compatible bitstream. In one example, the data packager 107 may be configured to generate a grammar in accordance with one or more techniques described herein. It should be noted that the data packager 107 need not necessarily be located in the same physical device as the video encoder 106. For example, the functions described as being performed by the video encoder 106 and the data encapsulator 107 may be distributed among the devices shown in fig. 4.

As described above, the signaling in JVET-AB2006 may not be sufficient to support frame rate upsampling on a sub-layer by sub-layer basis. According to the techniques herein, a syntax is provided for specifying a number of interpolated pictures generated by a post-processing filter for the case where temporal sub-layers are discarded. That is, in one example, syntax element nnpfc _ interpolated _pics [ i ] [ j ] is used instead of syntax element nnpfc _ interpolated _pics [ i ] in order to provide support for better temporal scalability, according to the techniques herein.

As an illustrative example, the original bitstream may be a 60 frames per second (fps) bitstream and include two temporal sublayers: sub-layers 0 and 1, wherein sub-layer 0 is 30fps and sub-layers 0 and 1 are laminated together at 60fps. Furthermore, the NNPFC message according to JVET-AB2006 may signal nnpfc _num_input_pics_minus2 equal to 0 and nnpfc _ interpolated _pics [0] equal to 1 for the original 60fps bitstream, such that interpolation and the original bitstream together produce a set of pictures with 120 fps. In this case, if the highest temporal sub-layer is discarded from the original bitstream such that the original bitstream becomes a 30fps bitstream, the value of nnpfc_interleaved_pics [0] should be changed to 3 in order to achieve 120 fps. For this case JVET-AB2006 does not provide a mechanism for changing the value of nnpfc _ interpolated _pics [0 ]. In accordance with the techniques herein, in this case, the additional loop allows signaling nnpfc _ interpolated _pics [0] [1] =1 and nnpfc _ interpolated _pics [0] [0] =3.

Tables 13A and 13B illustrate relevant syntax of an example neural network post-filter characteristics SEI message, according to the techniques herein, for specifying the number of interpolated pictures produced by a post-processing filter for the case where temporal sub-layers are discarded. With respect to Table 13A, it should be noted that JVET-T2001 provides that the syntax element sps_max_ sublayers _minus1 is included in the sequence parameter set syntax seq_parameter_set_rbsp (), and has the following semantics:

sps_max_ sublayers _minus1 plus 1 specifies the maximum number of temporal sub-layers that may exist in each CLVS that references SPS.

If sps_video_parameter_set_id is greater than 0, then the value of sps_max_ sublayers _minus1 should be in the range of 0 to vps_max_ sublayers _minus1 (inclusive).

Otherwise (sps_video_parameter_set_id equals 0), the following applies:

The value of-sps_max_ sublayers _minus1 should be in the range of 0 to 6 (inclusive).

The value of-vps_max_ sublayers _minus1 is inferred to be equal to sps_max_ sublayers _minus1.

The value of-NumSubLayersInLayerInOLS [0] [0] is inferred to be equal to sps_max_ sublayers _minus1+1.

The value of vps_ols_ptl_idx [0] is inferred to be equal to 0, and the value of vps_ptl_max_tid [ vps_ols_ptl_idx [0] ] (i.e., vps_ptl_max_tid [0 ]) is inferred to be equal to sps_max_ sublayers _minus1.

Wherein,

Vps_max_ sublayers _minus1 plus 1 specifies the maximum number of temporal sub-layers that may exist in the layer specified by the VPS. The value of vps_max_ sublayers _minus1 should be in the range of 0 to 6 (inclusive).

TABLE 13A

TABLE 13B

For tables 13A and 13B, the semantics may be based on the semantics provided above and the following semantics:

The use of this SEI message requires the definition of the following variables:

HIGHESTTID (or in one example, the highest temporal sub-layer presents HIGHESTTID) of the highest temporal sub-layer decoding

Bit depth BitDepthc of the chroma sample array (if any) of the cropped decoded output picture.

A chroma format indicator, denoted ChromaFormatIdc herein.

When nnpfc _auxliary_inp_idc is equal to 1, the filtered intensity control value StrengthControlVal should be a real number (inclusive) in the range of 0 to 1.

Nnpfc _ interpolated _pics [ i ] [ j ] specifies the number of interpolated pictures generated by the post-processing filter between the i-th picture and the (i+1) -th picture that are used as inputs to the post-processing filter when j is the highest temporal sub-layer present in the bitstream or decoded.

let Htid be the highest Tid

Wherein for JVET-T2001:

variables Htid identifying the highest temporal sub-layer to decode are derived as follows:

-the following applies to the first AU of the bitstream:

if some external devices not specified in the present description are available for setting Htid, setting Htid by these external devices.

Otherwise, if OPI _ htid _plus1 is present in an OPI NAL unit in the first AU of the bitstream, htid is set equal to ((OPI _ htid _plus1> 0).

Otherwise Htid is set equal to vps_ptl_max_tid [ vps_ols_ptl_idx [ TargetOlsIdx ] ].

Note that-Htid will be set equal to sps_max_ sublayers _minus1 when sps_video_parameter_set_id is equal to 0.

In another example, the above code may use variable HIGHESTTID to indicate the highest Tid, and a portion of the above code may be derived from

numOutputPics+＝nnpfc_interpolated_pics[i][Htid]

Becomes as follows

numOutputPics+＝nnpfc_interpolated_pics[i][HighestTid]

Wherein the syntax element opi _ htid _plus1 is included in the operation point information syntax operation_point_information_rbsp (), and has the following semantics:

OPI _ htid _plus1 equals 0 specifies that all pictures in the current CVS and all next CVSs in decoding order (until the next CVS that is not included in the OPI NAL unit to provide OPI _ htid _plus 1) are IRAP pictures or GDR pictures with ph_recovery_poc_cnt equal to 0. OPI _ htid _plus1 > 0 specifies that all pictures in the current CVS and all next CVSs in decoding order (until the next CVS that does not include OPI _ htid _plus1 provided in the OPI NAL unit) have a temporalld that is less than OPI _ htid _plus 1.

Wherein the syntax element vps_ptl_max is included in the video parameter set syntax video_parameter_set_rbsp (), and has the following semantics:

vps_ptl_max_tid [ i ] specifies the temporalld of the highest sub-layer representation of the level information present in the ith profile_tier_level () syntax structure in the VPS and the temporalld of the highest sub-layer representation present in the OLS with the OLS index olsIdx such that vps_ols_ptl_idx [ olsIdx ] is equal to i. The value of vps_ptl_max_tid [ i ] should be in the range of 0 to vps_max_ sublayers _minus1 (inclusive). When vps_default_ptl_dpb_hrd_max_tid_flag is equal to 1, the value of vps_ptl_max_tid [ i ] is inferred to be equal to vps_max_ sublayers _minus1.

It should be noted with respect to Table 13B, the input variable NumSubLayerMinus1 providing the number of temporal sub-layers minus1 may instead be referred to as MaxNumSubLayersMinus1 or NumSubLayersInLayerInOLS, where MaxNumSubLayersMinus or NumSubLayersInLayerInOLS may be based on the definitions provided in JVET-T2001.

According to JVET-AB2006, when nnpfc _constant_patch_size_flag is equal to 0 (i.e., when the post-processing filter accepts any block size that is a positive integer multiple of the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1), the overlapping horizontal and vertical sample counts of adjacent input tensors of the post-processing filter need to be considered relative to variables INPPATCHWIDTH and INPPATCHHEIGHT, which represent block size width and block size height, respectively. According to JVET-AB2006, for loops defined for yP values ranging from yp= -overlapSize to yP < INPPATCHHEIGHT + overlapSize and xP values ranging from xp= -overlapSize to xP < INPPATCHWIDTH + overlapSize, the actual input size to the post-processing filter as defined by the DeriveInputTensors () procedure is calculated. This may result in the actual input size to the post-processing filter not being a multiple of a particular value (e.g., 8) that may be required by a typical neural network post-filter.

For example, according to JVET-AB2006, there may be the following: nnpfc _patch_width_minus1+1=8, nnpfc_patch_height_minus1+1=8, overlay_size=3, and nnpfc _constant_patch_size_flag=0. In this case, the filter accepts an input of a multiple of (8, 8). Further, if INPPATCHWIDTH = (nnpfc _patch_width_minus1+1) ×k (i.e., 8*k) and INPPATCHHEIGHT = (nnpfc _patch_height_minus1+1) ×k (i.e., 8*k), the input of the post-processing filter is (8×k+2×overlap ) (i.e., (8×k+6,8×k+6)). The value 8 x k +6 is not necessarily divisible by 8, which may lead to unexpected and/or erroneous filtering results.

According to the techniques herein, an example neural network post-filter characteristics SEI message may include syntax elements having semantics that enable ensuring that the actual input size to the post-processing filter is a multiple of a particular value that may be required by a typical neural network post-filter. In one example, according to the techniques herein, the example neural network post-filter characteristics SEI message may be based on the syntax and semantics provided in the examples of table 6 and/or tables 13A-13B and:

nnpfc _constant_patch_size_flag equal to 1 indicates that the post-processing filter accepts exactly the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 as inputs. nnpfc _constant_patch_size_flag equal to 0 indicates that the post-processing filter accepts as input any block size that is a positive integer multiple of the block sizes indicated by nnpfc _patch_width_minus1, nnpfc _patch_height_minus1, and nnpfc _overlap.

If nnpfc _constant_patch_size_flag is equal to 0, the following applies:

The value of-INPPATCHWIDTH +2 x overlapping should be a positive integer multiple of nnpfc _patch_width_minus1+1 and INPATCHWIDTH should be less than or equal to CroppedWidth. The value of INPPATCHHEIGHT +2 x overlapping should be a positive integer multiple of nnpfc _patch_height_minus1+1 and INPATCHHEIGHT should be less than or equal to CroppedHeight.

In one example, nnpfc _patch_height_minus1 may be based on the following:

If nnpfc _constant_patch_size_flag is equal to 0, the following applies:

The value of-INPPATCHWIDTH +2 x overlapping should be a positive integer multiple of nnpfc _patch_width_minus1+1. And INPPATCHWIDTH should be less than or equal to CroppedWidth. The value of INPPATCHHEIGHT +2 x overlapping should be a positive integer multiple of nnpfc _patch_height_minus1+1. INPPATCHHEIGHT should be less than or equal to CroppedHeight.

In one example, nnpfc _patch_height_minus1 may be based on the following:

If nnpfc _constant_patch_size_flag is equal to 0, the following applies:

The value of-INPPATCHWIDTH should be a positive integer multiple of nnpfc _patch_width_minus1+1-2 x overlapping and should be less than or equal to CroppedWidth. The value of INPPATCHHEIGHT should be a positive integer multiple of nnpfc _patch_height_minus1+1-2 x overlapping size and should be less than or equal to CroppedHeight.

In one example, according to the techniques herein, the example neural network post-filter characteristics SEI message may be based on the syntax and semantics provided in the examples of table 6 and/or tables 13A-13B and:

nnpfc _constant_patch_size_flag equal to 1 indicates that the post-processing filter accepts exactly the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 as inputs. nnpfc _constant_patch_size_flag equal to 0 indicates that the post-processing filter accepts as input any block size that is a positive integer power of the block sizes indicated by nnpfc _patch_width_minus1, nnpfc _patch_height_minus1, and nnpfc _overlap.

If nnpfc _constant_patch_size_flag is equal to 0, the following applies:

The value of-INPPATCHWIDTH +2 x overlapping should be a positive integer power of nnpfc _patch_width_minus1+1 and INPPATCHWIDTH should be less than or equal to CroppedWidth. The value of INPPATCHHEIGHT +2 x overlapping should be a positive integer power of nnpfc _patch_height_minus1+1 and INPPATCHHEIGHT should be less than or equal to CroppedHeight.

It should be noted that in JVET-AB2006, when nnpfc _constant_patch_size_flag is equal to 0, nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 do not have any semantics. It should also be noted that in the above example, the syntax elements nnpfc _latch_width_minus1 and nnpfc _latch_height_minus1 function slightly differently according to the value of nnpfc _constant_latch_size_flag. For example, only when nnpfc _constant_latch_size_flag is equal to 1, the value of INPPATCHWIDTH is set equal to nnpfc _latch_width_minus1+1, and the value of INPPATCHHEIGHT is set equal to nnpfc _latch_height_minus1+1. Conversely, when nnpfc _constant_patch_size_flag is equal to 0, the syntax elements nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 indicate horizontal and vertical values along with the overlap size (overlapSize), and the post-filter takes as input a positive integer multiple thereof. However, when nnpfc _constant_patch_size_flag flag is equal to 0, nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 do not directly define the values of INPPATCHWIDTH and INPPATCHHEIGHT. Depending on the value of nnpfc _constant_patch_size_flag, it may not be ideal to make the syntax elements nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 slightly different.

In one example, according to the techniques herein, syntax elements nnpfc _latch_width_minus1 and nnpfc _latch_height_minus1 may be signaled only when nnpfc _constant_latch_size_flag is equal to 1, and two different syntax elements (e.g., nnpfc _ postfilter _factor_width_minus1 and nnpfc _ postfilter _factor_height_minus1) may be signaled when nnpfc _constant_latch_size_flag is equal to 0. Table 14 shows the relevant syntax of an exemplary neural network post-filter characteristics SEI message according to the techniques herein, wherein different syntax elements are signaled.

TABLE 14

For table 14, the semantics may be based on the semantics provided above and the following semantics:

nnpfc _constant_patch_size_flag equal to 1 indicates that the post-processing filter accepts exactly the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 as inputs. nnpfc _constant_patch_size_flag equal to 0 indicates that the post-processing filter accepts as input any magnitude that is a positive integer multiple of the magnitudes indicated by nnpfc _ postfilter _factor_width_minus1, nnpfc _ postfilter _factor_height_minus1, and nnpfc _overlap.

When nnpfc _constant_patch_size_flag is equal to 1, nnpfc _patch_width_minus1+1 specifies the horizontal sample count of the block size required to be input to the post-processing filter. The nnpfc _patch_width_minus1 value should be in the range of 0 to Min (32766, min (0, cyclopedwidth-1-2 x overlay), inclusive.

When nnpfc _constant_patch_size_flag is equal to 1, nnpfc _patch_height_minus1+1 specifies the vertical sample count of the block size required to be input to the post-processing filter. The nnpfc _latch_height_minus1 value should be in the range of 0 to Min (32766, min (0, cyclopedheight-1-2 x overlay), inclusive.

When nnpfc _constant_patch_size_flag is equal to 0, nnpfc _ postfilter _factor_width_minus1+1 indicates a horizontal sample count requirement for the input of the post-processing filter, as described by the following constraint. The value nnpfc _ postfilter _factor_width_minus1 should be in the range of 0 to Min (32766, pipelinewidth-1) (inclusive).

When nnpfc _constant_patch_size_flag is equal to 0, nnpfc _ postfilter _factor_height_minus1+1 indicates a vertical sample count requirement for the input of the post-processing filter, as described by the following constraint. The value of nnpfc _ postfilter _factor_height_minus1 should be in the range of 0to Min (32766, croppedHeight-1), inclusive.

Nnpfc _overlap specifies the overlapping horizontal and vertical sample counts of adjacent input tensors of the post-processing filter. The value nnpfc _overlap should be in the range of 0 to Min (16383, (CroppedWidth-1)/(2), (CroppedHeight-1)/(2), inclusive.

If nnpfc _constant_patch_size_flag is equal to 0, the following applies:

The value of-INPPATCHWIDTH +2 x overlapping should be a positive integer multiple of nnpfc _ postfilter _factor_width_minus1+1 and INPPATCHWIDTH should be less than or equal to CroppedWidth. The value of INPPATCHHEIGHT +2 x overlapping should be a positive integer multiple of nnpfc _ postfilter _factor_height_minus1+1 and INPPATCHHEIGHT should be less than or equal to CroppedHeight.

In one example, nnpfc _ postfilter _factor_width_minus1 may alternatively be signaled as nnpfc _ postfilter _factor_width, and nnpfc _ postfilter _factor_height_minus1 may alternatively be signaled as nnpfc _ postfilter _factor_height, as follows:

When nnpfc _constant_patch_size_flag is equal to 0, nnpfc _ postfilter _factor_width indicates a horizontal sample count requirement for the input of the post-processing filter, as described by the following constraint. The value nnpfc _ postfilter _factor_width_minus1 should be in the range of 1 to Min (32766, cropepdWidth), inclusive. A value of 0 is reserved. In one example, a value of 0 means that there is no limit to the width of the input of the post-processing filter.

When nnpfc _constant_patch_size_flag is equal to 0, nnpfc _ postfilter _factor_height indicates a vertical sample count requirement for the input of the post-processing filter, as described by the following constraint. The value of nnpfc _ postfilter _factor_height_minus1 should be in the range of 1 to Min (32766, collocation height), inclusive. A value of 0 is reserved. In one example, a value of 0 means that there is no limit on the height of the input of the post-processing filter.

In one example, one or more of the following semantics may be used:

nnpfc _constant_patch_size_flag equal to 1 indicates that the post-processing filter accepts exactly the block sizes indicated by nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 as inputs. nnpfc _constant_patch_size_flag equal to 0 indicates that the post-processing filter accepts as input any input size that is a positive integer multiple of the input sizes indicated by nnpfc _ postfilter _factor_width_minus1, nnpfc _ postfilter _factor_height_minus1, and nnpfc _overlap.

When nnpfc _constant_patch_size_flag is equal to 1, nnpfc _patch_width_minus1+1 specifies the horizontal sample count of the block size required to be input to the post-processing filter. The nnpfc _latch_width_minus1 value should be in the range of 0 to Min (32766, croppedwidth-1-2 x overlapping) inclusive.

When nnpfc _constant_patch_size_flag is equal to 1, nnpfc _patch_height_minus1+1 specifies the vertical sample count of the block size required to be input to the post-processing filter. The nnpfc _latch_height_minus1 value should be in the range of 0 to Min (32766, cyclopedHeight-1-2 x overlay), inclusive.

When nnpfc _constant_patch_size_flag is equal to 0, nnpfc _ postfilter _factor_width_minus1+1 indicates a horizontal sample count requirement for the input of the post-processing filter, as described by the following constraint. The nnpfc _ postfilter _factor_width_minus1+1 should have a multiple value in the range of 0 to Min (32766, cropepdWidth), inclusive.

When nnpfc _constant_patch_size_flag is equal to 0, nnpfc _ postfilter _factor_height_minus1+1 indicates a vertical sample count requirement for the input of the post-processing filter, as described by the following constraint. The nnpfc _ postfilter _factor_height_minus1+1 should have a multiple value in the range of 0 to Min (32766, croppedHeight), inclusive.

Nnpfc _overlap specifies the overlapping horizontal and vertical sample counts of adjacent input tensors of the post-processing filter. The nnpfc _overlap value should be in the range of 0 to Min (16383, min ((CroppedWidth-1)/(2) (CroppedHeight-1)/(2)), inclusive).

In another example, instead of flag nnpfc _constant_patch_size_flag, some other flag may be used and/or signaled. This flag may control the presence of nnpfc _patch_width_minus1 and nnpfc _patch_height_minus1 or the presence of nnpfc _ postfilter _factor_width_minus1 and nnpfc _ postfilter _factor_height_minus1. In another example, syntax element nnpfc _ postfilter _factor_width_minus1 may alternatively be referred to as nnpfc _ postfilter _input_width_info_minus1 or nnpfc _ postfilter _input_width_minus1 or nnpfc _ postfilter _input_factor_width_minus1 or some other name. In another example, the syntax element nnpfc _ postfilter _factor_height_minus1 may alternatively be referred to as nnpfc _ postfilter _input_height_info_minus1 or nnpfc _ postfilter _input_height_minus1 or nnpfc _ postfilter _input_factor_height_minus1 or some other name.

In this manner, video encoder 500 represents an example of a device configured to: transmitting a post-filter characteristic message signaling the neural network; a first syntax element is signaled in the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

Referring again to fig. 1, interface 108 may comprise any device configured to receive data generated by data packager 107 and to transmit and/or store the data to a communication medium. Interface 108 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may transmit and/or receive information. Further, interface 108 may include a computer system interface that may enable files to be stored on a storage device. For example, interface 108 may include a chipset supporting Peripheral Component Interconnect (PCI) and peripheral component interconnect express (PCIe) bus protocols, a proprietary bus protocol, a Universal Serial Bus (USB) protocol, I ² C, or any other logical and physical structure that may be used to interconnect peer devices.

Referring again to fig. 1, target device 120 includes an interface 122, a data decapsulator 123, a video decoder 124, and a display 126. Interface 122 may include any device configured to receive data from a communication medium. Interface 122 may include a network interface card such as an ethernet card, and may include an optical transceiver, a radio frequency transceiver, or any other type of device that may receive and/or transmit information. In addition, interface 122 may include a computer system interface that allows for retrieving compatible video bitstreams from a storage device. For example, interface 122 may include a chipset supporting PCI and PCIe bus protocols, a proprietary bus protocol, a USB protocol, I ² C, or any other logical and physical structure that may be used to interconnect peer devices. The data decapsulator 123 may be configured to receive and parse any of the example syntax structures described herein.

Video decoder 124 may include any device configured to receive a bitstream (e.g., sub-bitstream extraction) and/or acceptable variations thereof and to reproduce video data therefrom. Display 126 may include any device configured to display video data. The display 126 may include one of a variety of display devices such as a Liquid Crystal Display (LCD), a plasma display, an Organic Light Emitting Diode (OLED) display, or another type of display. The display 126 may include a high definition display or an ultra high definition display. It should be noted that although in the example shown in fig. 1, video decoder 124 is described as outputting data to display 126, video decoder 124 may be configured to output video data to various types of devices and/or subcomponents thereof. For example, video decoder 124 may be configured to output video data to any communication medium, as described herein.

Fig. 6 is a block diagram illustrating an example of a video decoder (e.g., a decoding process for reference picture list construction described above) that may be configured to decode video data in accordance with one or more techniques of the present disclosure. In one example, the video decoder 600 may be configured to decode the transform data and reconstruct residual data from the transform coefficients based on the decoded transform data. The video decoder 600 may be configured to perform intra prediction decoding and inter prediction decoding, and thus may be referred to as a hybrid decoder. The video decoder 600 may be configured to parse any combination of the syntax elements described above in tables 1-14. The video decoder 600 may decode video based on or in accordance with the above-described procedure and also based on the parsed values in tables 1 to 14.

In the example shown in fig. 6, the video decoder 600 includes an entropy decoding unit 602, an inverse quantization unit 604, an inverse transform coefficient processing unit 606, an intra prediction processing unit 608, an inter prediction processing unit 610, a summer 612, a post-filter unit 614, and a reference buffer 616. The video decoder 600 may be configured to decode video data in a manner consistent with a video encoding system. It should be noted that although the exemplary video decoder 600 is shown with different functional blocks, such illustration is intended for descriptive purposes and not to limit the video decoder 600 and/or its subcomponents to a particular hardware or software architecture. The functions of video decoder 600 may be implemented using any combination of hardware, firmware, and/or software implementations.

As shown in fig. 6, the entropy decoding unit 602 receives an entropy-encoded bitstream. The entropy decoding unit 602 may be configured to decode syntax elements and quantized coefficients from the bitstream according to a process that is reciprocal to the entropy encoding process. The entropy decoding unit 602 may be configured to perform entropy decoding according to any of the entropy encoding techniques described above. The entropy decoding unit 602 may determine values of syntax elements in the encoded bitstream in a manner consistent with the video encoding standard. As shown in fig. 6, the entropy decoding unit 602 may determine quantization parameters, quantization coefficient values, transform data, and prediction data from a bitstream. In the example shown in fig. 6, the inverse quantization unit 604 and the inverse transform coefficient processing unit 606 receive quantized coefficient values from the entropy decoding unit 602, and output reconstructed residual data.

Referring again to fig. 6, the reconstructed residual data may be provided to a summer 612. Summer 612 may add the reconstructed residual data to the prediction video block and generate reconstructed video data. The prediction video block may be determined according to a prediction video technique (i.e., intra-prediction and inter-prediction). The intra prediction processing unit 608 may be configured to receive the intra prediction syntax element and retrieve the predicted video block from the reference buffer 616. The reference buffer 616 may include a memory device configured to store one or more frames of video data. The intra prediction syntax element may identify an intra prediction mode, such as the intra prediction mode described above. The inter prediction processing unit 610 may receive the inter prediction syntax element and generate a motion vector to identify a prediction block in one or more reference frames stored in the reference buffer 616. The inter prediction processing unit 610 may generate a motion compensation block, possibly performing interpolation based on an interpolation filter. An identifier of an interpolation filter for motion estimation with sub-pixel precision may be included in the syntax element. The inter prediction processing unit 610 may calculate interpolation values of sub-integer pixels of the reference block using interpolation filters. Post-filter unit 614 may be configured to perform filtering on the reconstructed video data. For example, post-filter unit 614 may be configured to perform deblocking and/or Sample Adaptive Offset (SAO) filtering, e.g., based on parameters specified in the bitstream. Further, it should be noted that in some examples, post-filter unit 614 may be configured to perform dedicated arbitrary filtering (e.g., visual enhancement such as mosquito noise cancellation). As shown in fig. 6, the video decoder 600 may output the reconstructed video block. In this manner, video decoder 600 represents an example of a device configured to: receiving a neural network post-filter characteristic message; a first syntax element is parsed from the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

In one or more examples, the functions described may be implemented by hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium, and executed by a hardware-based processing unit. The computer-readable medium may comprise a computer-readable storage medium corresponding to a tangible medium, such as a data storage medium, or a propagation medium comprising any medium that facilitates the transfer of a computer program from one place to another, for example, according to a communication protocol. As such, the computer-readable medium may generally correspond to: (1) A non-transitory tangible computer readable storage medium, or (2) a communication medium such as a signal or carrier wave. Data storage media may be any available media that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described in this disclosure. The computer program product may include a computer-readable medium.

By way of example, and not limitation, such computer-readable storage media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, flash memory, or any other medium that can be used to store desired program code in the form of instructions or data structures and that can be accessed by a computer. Also, any connection is properly termed a computer-readable medium. For example, if the instructions are transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. However, it should be understood that computer-readable storage media and data storage media do not include connections, carrier waves, signals, or other transitory media, but are instead directed to non-transitory tangible storage media. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk and Blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The instructions may be executed by one or more processors, such as one or more Digital Signal Processors (DSPs), general purpose microprocessors, application Specific Integrated Circuits (ASICs), field programmable logic arrays (FPGAs), or other equivalent integrated or discrete logic circuitry. Thus, the term "processor" as used herein may refer to any of the foregoing structure or any other structure suitable for implementation of the techniques described herein. Furthermore, in some aspects, the functionality described herein may be provided within dedicated hardware and/or software modules configured for encoding and decoding, or incorporated into a combined codec. Moreover, these techniques may be implemented entirely in one or more circuits or logic elements.

The techniques of this disclosure may be implemented in various devices or apparatuses, including a wireless handset, an Integrated Circuit (IC), or a set of ICs (e.g., a chipset). Various components, modules, or units are described in this disclosure to emphasize functional aspects of devices configured to perform the disclosed techniques but do not necessarily require realization by different hardware units. Rather, as described above, the various units may be combined in a codec hardware unit or provided by an interoperating hardware unit comprising a set of one or more processors as described above, in combination with suitable software and/or firmware.

Further, each functional block or various features of the base station apparatus and the terminal apparatus used in each of the above-described embodiments may be realized or executed by a circuit (typically, an integrated circuit or a plurality of integrated circuits). Circuits designed to perform the functions described in this specification may include general purpose processors, digital Signal Processors (DSPs), application specific or general purpose integrated circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic, or discrete hardware components, or combinations thereof. A general purpose processor may be a microprocessor, or in the alternative, the processor may be a conventional processor, controller, microcontroller, or state machine. The general purpose processor or each of the above circuits may be configured by digital circuitry or may be configured by analog circuitry. In addition, when a technology of manufacturing an integrated circuit that replaces the current integrated circuit occurs due to progress in semiconductor technology, the integrated circuit produced by the technology can also be used.

Various examples have been described. These and other examples are within the scope of the following claims.

Claims

1. A method of performing neural network filtering on video data, the method comprising:

Receiving a neural network post-filter characteristic message; and

Parsing a first syntax element from the neural network post-filter characteristic message, the first syntax element indicating whether a post-processing filter accepts as input (i) a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or (ii) a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

2. An apparatus comprising one or more processors configured to:

Receiving a neural network post-filter characteristic message; and

3. The apparatus of claim 2, wherein the apparatus comprises a video decoder.

4. An apparatus comprising one or more processors configured to:

Transmitting a post-filter characteristic message signaling the neural network; and

Signaling a first syntax element in the neural network post-filter characteristic message, the first syntax element indicating whether the post-processing filter accepts as input (i) a block size having a width equal to a horizontal sample count indicated by the second syntax element and a height equal to a vertical sample count indicated by the third syntax element, or (ii) a block size having a width equal to a multiple of a horizontal sample count indicated by the fourth syntax element and a height equal to a multiple of a vertical sample count indicated by the fifth syntax element.

5. The apparatus of claim 4, wherein the apparatus comprises a video encoder.