GB2630154A

GB2630154A - Intermediate video decoding

Info

Publication number: GB2630154A
Application number: GB2315482.6A
Authority: GB
Inventors: Lee Timothy
Original assignee: V Nova International Ltd
Current assignee: V Nova International Ltd
Priority date: 2023-10-09
Filing date: 2023-10-09
Publication date: 2024-11-20

Abstract

There is provided a method of decoding an encoded bitstream into one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video. The method comprises retrieving enhancement data, 303, from the encoded bitstream, 430, and downscaling the enhancement data by a first scaling factor (using down sampler 410) to produce downscaled one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video. The enhancement data may consist of one or more enhancement residual planes with the downscaling process producing one or more downscaled enhancement residual planes. The bitstream may conform to the Low Complexity Enhancement Video Coding (LCEVC) standard.

Description

INTERMEDIATE VIDEO DECODING

Technical Field

The invention relates to video decoding. In particular, the invention relates to creating an intermediate picture at a decoder. In particular, but not exclusively, the invention relates to obtaining an intermediate picture at a decoder from a bitstream compliant with the MPEG5 Part 2 LCEVC standard. The invention is implementable in hardware or software.

Background

For video broadcasters and streamers, the delivery of high-quality video experiences to end users is very important. However, broadcasters and streamers face significant challenges in managing limitations such as bandwidth constraints and varied decoding capabilities across different distribution channels and end user devices. This problem becoming pronounced with technical innovations in the field of video coding. Hierarchical coding schemes, including that introduced by the LCEVC standard, have mitigated the need to simulcast multiple streams with diverse characteristics to suit different bandwidths and decoding capabilities. Yet, there is a necessity to further improve these hierarchical coding schemes.

A very useful aspect that needs addressing is how to provide more flexibility to video distribution schemes using a single encoded bitstream, especially where a plurality of decoder deployments are being used or where a certain level of quality is mandated for a base layer. Therefore, this disclosure aims at providing improved flexibility of picture output from a single encoded bitstream.

Summary

According to a first aspect of the invention, there is provided a method of decoding an encoded bitstream into one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video. The method comprises retrieving enhancement data from the encoded bitstream and downscaling the enhancement data by a first scaling factor to produce downscaled one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video.

In this way, there is no need to upscale the decoder reconstructed video to an extent so that the enhancement data is usable therewith. For example, there is typically a resolution change of the first scaling factor (e.g. typically x2 in one or more of the vertical and/or horizontal directions) in certain scalable video coding technologies and the enhancement data is provided with that scaling assumption at the decoder side. By downscaling the enhancement data in certain decoder deployments, a reduced memory resource requirement (size and/or memory bandwidth) is achievable when producing enhanced pictures, although the pictures are not enhanced to the full enhancement data potential.

This technique allows improved picture output for those deployments over and above the decoder reconstructed video output in deployments which may have a screen output that is less than the capability that the enhancement data provides (for example the deployment has a lower output resolution or useable bitdepth).

By combining a set of preliminary pictures and enhancement data at a lower scale than expected using the first scaling factor, e.g. a 1.5 scale there may follow a reduced memory resource requirement (size and memory bandwidth) when handling the set of preliminary pictures and the enhancement data.

There is no need to downscale from a higher quality set of output pictures to match a deployment hardware (e.g. produce enhanced pictures at a top level of resolution or bitdepth or both only to downscale the full output pictures to match the deployment capability). In turn this allows best quality pictures, or at least improved quality pictures, to be realised with a reduced memory resource requirement (size and memory bandwidth). The claimed technique downscales the residual data and not a fully decoded picture. An advantage is that the residual information is sparser and easier to handle in memory, allowing certain deployments to increase the picture quality appropriately.

The enhancement data may be one of: one or more enhancement residuals planes, or a set of transformed elements representing the one or more enhancement residuals planes.

In a typical deployment, the enhancement data is one or more enhancement residuals planes and the downscaling the enhancement data produces corresponding one or more downscaled enhancement residuals planes.

In this way, it is easier to implement the enhancement downscaling as a post-processing operation.

In another typical deployment, the enhancement data is a set of transformed elements representing the one or more enhancement residuals planes and the downscaling the enhancement data produces a corresponding downscaled set of transformed elements transformable into the one or more enhancement residuals planes.

In this way, there is a reduced memory and/or memory bandwidth required to operate on transformed elements.

An inverse transform operation is typically applied to the downscaled set of transformed elements to produce one or more downscaled enhancement residuals planes. The method optionally comprises dequantizing the enhancement data prior to downscaling the enhancement data.

In one example, the set of transformed elements indicate the extent of spatial correlation between corresponding residual elements in the one or more enhancement residuals planes such that the set of transformed elements indicate at least one of average, horizontal, vertical and diagonal relationship between neighbouring residual elements in the set of enhancement residuals planes.

The method optionally includes obtaining an upscaled decoder reconstructed video using a second scaling factor to create the set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.

The method optionally includes upscaling the decoder reconstructed video using a second scaling factor to create the set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.

Typically the downscaling downscales one or more of: picture resolution, bit depth, frame rate, and quantisation parameter. In some deployments both picture resolution and bit depth may be downscaled. In one preferred embodiment, the downscaling downscales picture resolution.

In one implementation, the one or more enhancement residuals planes is obtained from the encoded bitstream using a first decoding process that is different from a second decoding process used to obtain the decoder reconstructed video.

The encoded bitstream typically conforms with MPEG-5 part 2 LCEVC standard and/or SO/IEC 23094-2.

In one implementation, the encoded bitstream indicates at least one of the first scaling factor and the second scaling factor, or both.

There is also provided a decoder configured to perform the method of any preceding claim.

There is also provided a computer readable storage medium comprising instructions which when executed by a processor causes the processor to perform the method of any preceding method claim.

There is also provided a video distribution system comprising a video server configured to serve an encoded bitstream one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video. The system further comprises a first decoding deployment, for example a video decoding hardware with associated connected display, configured to receive the encoded bitstream, to obtain a decoder reconstructed video from the encoded bitstream, to create a first set of preliminary pictures from the decoder reconstructed video, to obtain the one or more enhancement residuals planes, and to add the one or more enhancement residuals planes to the first set of preliminary pictures to output video pictures at a first level of quality, for example at a first picture resolution compatible with the connected display. The system further comprises a second decoding deployment configured to receive the encoded bitstream, to obtain a decoder reconstructed video from the encoded bitstream, to create a second set of preliminary pictures from the decoder reconstructed video at a level of quality lower than the first set of preliminary pictures, to retrieve enhancement data from the encoded bitstream, to downscale the enhancement data by a first scaling factor to produce downscaled one or more enhancement residuals planes suitable to be added to the second set of preliminary pictures, to add the downscaled one or more enhancement residuals planes to the second set of preliminary pictures, and to output video pictures at a second level of quality lower than the first level of quality, for example at a second picture resolution compatible with the connected display.

In one typical example, at the second decoding deployment, the enhancement data is one or more enhancement residuals planes and the downscaling the enhancement data produces corresponding one or more downscaled enhancement residuals planes.

In another typical example, at the second decoding deployment, the enhancement data is a set of transformed elements representing the one or more enhancement residuals planes and the downscaling the enhancement data produces a corresponding downscaled set of transformed elements transformable into the one or more enhancement residuals planes.

The system optionally comprises, at the second decoding deployment, obtaining an upscaled decoder reconstructed video using a second scaling factor to create the second set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.

The system optionally comprises at the second decoding deployment upsca ling the decoder reconstructed video using a second scaling factor to create the set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.

The system preferably uses an encoded bitstream which conforms with MPEG-5 part 2 LCEVC standard.

There is also provided a method for producing a video signal from a bitstream at a decoder, wherein the bitstream comprises enhancement data suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video, the method comprising: upscaling the preliminary pictures to produce upscaled preliminary pictures; downscaling the enhancement data to produce downscaled enhancement layer data combinable with the upscaled preliminary pictures; and combining the upscaled preliminary pictures with the downscaled enhancement layer data to produce the video signal.

There is also provided a computer program comprising instructions which when executed by a processor causes the processor to perform any of the methods described above.

There is also provided a data carrier signal carrying the computer program above. Brief Description of the Drawings The invention shall now be described, by way of example only, with reference to the accompanying drawings in which: FIG. 1 shows a high-level schematic of a hierarchical encoding and decoding process; FIG. 2 shows a high-level schematic of an encoding process of a hierarchical coding technology; FIG. 3 shows a high-level schematic of a decoding process suitable for decoding the output of FIG. 2 FIG. 4 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a first variation; FIG. 5 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a second variation; FIG. 6 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a third variation; FIG. 7 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a fourth variation; and Fig. 8 shows a high-level flow chart of a modified decoding process.

Detailed Description

FIG. 1 shows a high-level schematic of a hierarchical encoding and decoding process. Data 101 to be encoded is retrieved by a hierarchical encoder 102 which outputs encoded data 103. Subsequently, the encoded data 103 is received by a hierarchical decoder 104 which decodes the data and outputs decoded data 105.

Typically, the hierarchical coding schemes used in examples herein create a base or core level, which is a representation of the original data at a lower level of quality and one or more levels of residuals which can be used to recreate the original data at a higher level of quality using a decoded version of the base level data. In general, the term "residuals" as used herein refers to a difference between a value of a reference array or reference frame and an actual array or frame of data. The array may be a one or two-dimensional array that represents a coding unit. For example, a coding unit may be a 2x2 or 4x4 set of residual values that correspond to similar sized areas of an input video frame.

It should be noted that the generalised examples are agnostic as to the nature of the input signal. Reference to "residual data" as used herein refers to data derived from a set of residuals, e.g. a set of residuals themselves or an output of a set of data processing operations that are performed on the set of residuals. Throughout the present description, generally a set of residuals includes a plurality of residuals or residual elements, each residual or residual element corresponding to a signal element, that is, an element of the signal or original data.

In specific examples, the data may be an image or video. In these examples, the set of residuals corresponds to an image or frame of the video, with each residual being associated with a pixel of the signal, the pixel being the signal element.

The methods described herein may be applied to so-called planes of data that reflect different colour components of a video signal. For example, the methods may be applied to different planes of YUV or RGB data reflecting different colour channels. Different colour channels may be processed in parallel. The components of each stream may be collated in any logical order.

A further hierarchical coding technology with which the principles of the present invention may be utilised is illustrated in FIGS. 2 and 3. This technology is a flexible, adaptable, highly efficient and computationally inexpensive coding format which combines a different video coding format, a base codec, (e.g., AVC, HEVC, or any other present or future codec) with at least two enhancement levels of coded data.

The general structure of the encoding scheme uses a down-sampled source signal encoded with a base codec, adds a first level of correction data to the decoded output of the base codec to generate a corrected picture, and then adds a further level of enhancement data to an up-sampled version of the corrected picture. Thus, the streams are considered to be a base stream and an enhancement stream, which may be further multiplexed or otherwise combined to generate an encoded data stream. References to an encoded data as described herein may refer to the enhancement stream or a combination of the base stream and the enhancement stream. The base stream may be decoded by a hardware decoder while the enhancement stream may be suitable for software processing implementation with suitable power consumption. This general encoding structure creates a plurality of degrees of freedom that allow great flexibility and adaptability to many situations, thus making the coding format suitable for many use cases including OTT transmission, live streaming, live ultra-high-definition UHD broadcast, and so on. Although the decoded output of the base codec is not intended for viewing, it is a fully decoded video at a lower resolution, making the output compatible with existing decoders and, where considered suitable, also usable as a lower resolution output.

Returning to the initial process described above, where a base stream is provided along with two levels (or sub-levels) of enhancement within an enhancement stream, an example of a generalised encoding process is depicted in the block diagram of FIG. 2. An input video 200 at an initial resolution is processed to generate various encoded streams 201, 202, 203 to form bitstream 230. A first encoded stream 201 (encoded base stream) is produced by feeding a base codec (e.g., AVC, HEVC, or any other codec) with a down-sampled version of the input video 200. The encoded base stream may be referred to as the base layer or base level. A second encoded stream 303(encoded level 1 stream) is produced by processing the residuals obtained by taking the difference between a reconstructed base codec video and the down-sampled version of the input video. A third encoded stream 203 (encoded level 2 stream) is produced by processing the residuals obtained by taking the difference between an up-sampled version of a corrected version of the reconstructed base coded video and the input video. In certain cases, the components of FIG. 2 may provide a general low complexity encoder. In certain cases, the enhancement streams may be generated by encoding processes that form part of the low complexity encoder and the low complexity encoder may be configured to control an independent base encoder and decoder (e.g., as packaged as a base codec). In other cases, the base encoder and decoder may be supplied as part of the low complexity encoder. In one case, the low complexity encoder of FIG. 2 may be seen as a form of wrapper for the base codec, where the functionality of the base codec may be hidden from an entity implementing the low complexity encoder.

A down-sampling operation illustrated by down-sampling component 205 may be applied to the input video to produce a down-sampled video to be encoded by a base encoder 213 of a base codec. The down-sampling can be done either in both vertical and horizontal directions, or alternatively only in the horizontal direction. The base encoder 213 and a base decoder 214 may be implemented by a base codec (e.g., as different functions of a common codec). The base codec, and/or one or more of the base encoder 213 and the base decoder 214 may comprise suitably configured electronic circuitry (e.g., a hardware encoder/decoder) and/or computer program code that is executed by a processor.

Each enhancement stream encoding process may not necessarily include an upsampling step. In FIG. 2 for example, the first enhancement stream is conceptually a correction stream while the second enhancement stream is upsampled to provide a level of enhancement.

Looking at the process of generating the enhancement streams in more detail, to generate the encoded level 1 stream, the encoded base stream is decoded by the base decoder 214 (i.e. a decoding operation is applied to the encoded base stream to generate a decoded base stream). Decoding may be performed by a decoding function or mode of a base codec. The difference between the decoded base stream and the down-sampled input video is then created at a level 1 comparator 210 (i.e. a subtraction operation is applied to the down-sampled input video and the decoded base stream to generate a first set of residuals). The output of the comparator 210 may be referred to as a first set of residuals, e.g. a surface or frame of residual data, where a residual value is determined for each picture element at the resolution of the base encoder 213, the base decoder 214 and the output of the down-sampling block 205.

The difference is then encoded by a first encoder 215 (i.e. a level 1 encoder) to generate the encoded level 1 stream 202 (i.e. an encoding operation is applied to the first set of residuals to generate a first enhancement stream).

As noted above, the enhancement stream may comprise a first level of enhancement 202 and a second level of enhancement 203. The first level of enhancement 202 may be considered to be a corrected stream, e.g. a stream that provides a level of correction to the base encoded/decoded video signal at a lower resolution than the input video 200. The second level of enhancement 203 may be considered to be a further level of enhancement that converts the corrected stream to the original input video 200, e.g. that applies a level of enhancement or correction to a signal that is reconstructed from the corrected stream.

In the example of FIG. 2, the second level of enhancement 203 is created by encoding a further set of residuals. The further set of residuals are generated by a level 2 comparator 219. The level 2 comparator 219 determines a difference between an upsampled version of a decoded level 1 stream, e.g. the output of an upsampling component 217, and the input video 200. The input to the up-sampling component 217 is generated by applying a first decoder (i.e. a level 1 decoder 218) to the output of the first encoder 215. This generates a decoded set of level 1 residuals. These are then combined with the output of the base decoder 214 at summation component 220. This effectively applies the level 1 residuals to the output of the base decoder 214. It allows for losses in the level 1 encoding and decoding process to be corrected by the level 2 residuals. The output of summation component 220 may be seen as a simulated signal that represents an output of applying level 1 processing to the encoded base stream 201 and the encoded level 1 stream 202 at a decoder.

As noted, an upsampled stream is compared to the input video which creates a further set of residuals (i.e. a difference operation is applied to the upsampled re-created stream to generate a further set of residuals). The further set of residuals are then encoded by a second encoder 221 (i.e. a level 2 encoder) as the encoded level 2 enhancement stream (i.e. an encoding operation is then applied to the further set of residuals to generate an encoded further enhancement stream).

Thus, as illustrated in FIG. 2 and described above, the output of the encoding process is a base stream 201 and one or more enhancement streams 202, 203 which preferably comprise a first level of enhancement and a further level of enhancement. It should be noted that the components shown in FIG. 2 may operate on blocks or coding units of data, e.g. corresponding to 2x2 or 4x4 portions of a frame at a particular level of resolution. The components operate without any inter-block dependencies, hence they may be applied in parallel to multiple blocks or coding units within a frame. This differs from comparative video encoding schemes wherein there are dependencies between blocks (e.g., either spatial dependencies or temporal dependencies). The dependencies of comparative video encoding schemes limit the level of parallelism and require a much higher complexity.

A corresponding generalised decoding process is depicted in the block diagram of FIG. 3. FIG. 3 may be said to show a low complexity decoder that corresponds to the low complexity encoder of FIG. 6. The low complexity decoder receives the three streams 201, 202, 203 generated by the low complexity encoder together with headers 304 containing further decoding information as part of a bitstream 330. The encoded base stream 301 is decoded by a base decoder 310 corresponding to the base codec used in the low complexity encoder. The encoded level 1 stream 302 is received by a first decoder 311 (i.e. a level 1 decoder), which decodes a first set of residuals as encoded by the first encoder 215 of Figure 1. At a first summation component 312, the output of the base decoder 310 is combined with the decoded residuals obtained from the first decoder 311. The combined video, which may be said to be a level 1 reconstructed video signal, is upsampled by upsampling component 313. The encoded level 2 stream 303 is received by a second decoder 314 (i.e. a level 2 decoder). The second decoder 314 decodes a second set of residuals as encoded by the second encoder 221 of FIG. 2. Although the headers 304 are shown in FIG. 3 as being used by the second decoder 314, they may also be used by the first decoder 311 as well as the base decoder 310. The output of the second decoder 314 is a second set of decoded residuals. These may be at a higher resolution to the first set of residuals and the input to the upsampling component 313. At a second summation component 315, the second set of residuals from the second decoder 314 are combined with the output of the up-sampling component 313, i.e. an up-sampled reconstructed level 1 signal, to reconstruct decoded video 350.

As per the low complexity encoder, the low complexity decoder of FIG. 3 may operate in parallel on different blocks or coding units of a given frame of the video signal. Additionally, decoding by two or more of the base decoder 310, the first decoder 311 and the second decoder 314 may be performed in parallel. This is possible as there are no inter-block dependencies.

In the decoding process, the decoder may parse the headers 304 (which may contain global configuration information, picture or frame configuration information, and data block configuration information) and configure the low complexity decoder based on those headers. In order to re-create the input video, the low complexity decoder may decode each of the base stream, the first enhancement stream and the further or second enhancement stream. The frames of the stream may be synchronised and then combined to derive the decoded video 350. The decoded video 350 may be a lossy or lossless reconstruction of the original input video 100 depending on the configuration of the low complexity encoder and decoder. In many cases, the decoded video 350 may be a lossy reconstruction of the original input video 200 where the losses have a reduced or minimal effect on the perception of the decoded video 350.

In each of FIGS. 2 and 3, the level 2 and level 1 encoding operations may include the steps of transformation, quantization and entropy encoding (e.g., in that order). The encoding operations may also include residual ranking, weighting and filtering. Similarly, at the decoding stage, the residuals may be passed through an entropy decoder, a dequantizer and an inverse transform module (e.g., in that order). Any suitable encoding and corresponding decoding operation may be used. Preferably however, the level 2 and level 1 encoding steps may be performed in software (e.g., as executed by one or more central or graphical processing units in an encoding device).

The transform as mentioned herein may use a directional decomposition transform such as a Hadamard-based transform. Both may comprise a small kernel or matrix that is applied to flattened coding units of residuals (i.e. 2x2 or 4x4 blocks of residuals). More details on the transform can be found for example in patent WO 2013/171173 Al or WO 2018/046941 Al, which are incorporated herein by reference. The encoder may select between different transforms to be used, for example between a size of kernel to be applied.

The transform may transform the residual information to four surfaces. For example, the transform may produce the following components or transformed coefficients: average, vertical, horizontal and diagonal. A particular surface may comprise all the values for a particular component, e.g. a first surface may comprise all the average values, a second all the vertical values and so on. As alluded to earlier in this disclosure, these components that are output by the transform may be taken in such embodiments as the coefficients to be quantized in accordance with the described methods. A quantization scheme may be useful to create the residual signals into quanta, so that certain variables can assume only certain discrete magnitudes. Entropy encoding in this example may comprise run length encoding (RLE), then processing the encoded output is processed using a Huffman encoder. In certain cases, only one of these schemes may be used when entropy encoding is desirable.

In summary, the methods and apparatuses herein are based on an overall approach which is built over an existing encoding and/or decoding algorithm (such as MPEG standards such as AVC/H.264, HEVC/H.265, etc. as well as non-standard algorithm such as VP9, AV1, and others) which works as a baseline for an enhancement layer which works accordingly to a different encoding and/or decoding approach. The idea behind the overall approach of the examples is to hierarchically encode/decode the video frame as opposed to the use block-based approaches as used in the MPEG family of algorithms. Hierarchically encoding a frame includes generating residuals for the full frame, and then a decimated frame and so on.

As indicated above, the processes may be applied in parallel to coding units or blocks of a colour component of a frame as there are no inter-block dependencies. The encoding of each colour component within a set of colour components may also be performed in parallel (e.g., such that the operations are duplicated according to (number of frames) * (number of colour components) * (number of coding units per frame)). It should also be noted that different colour components may have a different number of coding units per frame, e.g. a luma (e.g., Y) component may be processed at a higher resolution than a set of chroma (e.g., U or V) components as human vision may detect lightness changes more than colour changes.

Thus, as illustrated and described above, the output of the decoding process is an (optional) base reconstruction, and an original signal reconstruction at a higher level. This example is particularly well-suited to creating encoded and decoded video at different frame resolutions. For example, the input signal 30 may be an HD video signal comprising frames at 1920 x 1080 resolution. In certain cases, the base reconstruction and the level 2 reconstruction may both be used by a display device. For example, in cases of network traffic, the level 2 stream may be disrupted more than the level 1 and base streams (as it may contain up to 4x the amount of data where down-sampling reduces the dimensionality in each direction by 2). In this case, when traffic occurs the display device may revert to displaying the base reconstruction while the level 2 stream is disrupted (e.g., while a level 2 reconstruction is unavailable), and then return to displaying the level 2 reconstruction when network conditions improve. A similar approach may be applied when a decoding device suffers from resource constraints, e.g. a set-top box performing a systems update may have an operation base decoder 220 to output the base reconstruction but may not have processing capacity to compute the level 2 reconstruction.

The encoding arrangement also enables video distributors to distribute video to a set of heterogeneous devices; those with just a base decoder 320 view the base reconstruction, whereas those with the enhancement level may view a higher-quality level 2 reconstruction. In comparative cases, two full video streams at separate resolutions were required to service both sets of devices. As the level 2 and level 1 enhancement streams encode residual data, the level 2 and level 1 enhancement streams may be more efficiently encoded, e.g. distributions of residual data typically have much of their mass around 0 (i.e. where there is no difference) and typically take on a small range of values about 0.

This may be particularly the case following quantization. In contrast, full video streams at different resolutions will have different distributions with a non-zero mean or median that require a higher bit rate for transmission to the decoder.

In the examples described herein residuals are encoded by an encoding pipeline. This may include transformation, quantization and entropy encoding operations. It may also include residual ranking, weighting and filtering. Residuals are then transmitted to a decoder, e.g. as L-1 and L-2 enhancement streams, which may be combined with a base stream as a hybrid stream (or transmitted separately). In one case, a bit rate is set for a hybrid data stream that comprises the base stream and both enhancements streams, and then different adaptive bit rates are applied to the individual streams based on the data being processed to meet the set bit rate (e.g., high-quality video that is perceived with low levels of artefacts may be constructed by adaptively assigning a bit rate to different individual streams, even at a frame by frame level, such that constrained data may be used by the most perceptually influential individual streams, which may change as the image data changes).

The sets of residuals as described herein may be seen as sparse data, e.g. in many cases there is no difference for a given pixel or area and the resultant residual value is zero. When looking at the distribution of residuals much of the probability mass is allocated to small residual values located near zero -e.g. for certain videos values of -2, -1, 0, 1, 2 etc. occur the most frequently. In certain cases, the distribution of residual values is symmetric or near symmetric about 0. In certain test video cases, the distribution of residual values was found to take a shape similar to logarithmic or exponential distributions (e.g., symmetrically or near symmetrically) about 0. The exact distribution of residual values may depend on the content of the input video stream.

Residuals may be treated as a two-dimensional image in themselves, e.g. a delta image of differences. Seen in this manner the sparsity of the data may be seen to relate features like "dots", small "lines", "edges", "corners", etc. that are visible in the residual images. It has been found that these features are typically not fully correlated (e.g., in space and/or in time). They have characteristics that differ from the characteristics of the image data they are derived from (e.g., pixel characteristics of the original video signal).

As the characteristics of residuals differ from the characteristics of the image data they are derived from it is generally not possible to apply standard encoding approaches, e.g. such as those found in traditional Moving Picture Experts Group (MPEG) encoding and decoding standards. For example, many comparative schemes use large transforms (e.g., transforms of large areas of pixels in a normal video frame). Due to the characteristics of residuals, e.g. as described above, it would be very inefficient to use these comparative large transforms on residual images. For example, it would be very hard to encode a small dot in a residual image using a large block designed for an area of a normal image.

Certain examples described herein address these issues by instead using small and simple transform kernels (e.g., 2x2 or 4x4 kernels -the Directional Decomposition and the Directional Decomposition Squared -as presented herein). The transform described herein may be applied using a Hadamard matrix (e.g., a 4x4 matrix for a flattened 2x2 coding block or a 16x16 matrix for a flattened 4x4 coding block). This moves in a different direction from comparative video encoding approaches. Applying these new approaches to blocks of residuals generates compression efficiency. For example, certain transforms generate uncorrelated transformed coefficients (e.g., in space) that may be efficiently compressed. While correlations between transformed coefficients may be exploited, e.g. for lines in residual images, these can lead to encoding complexity, which is difficult to implement on legacy and low-resource devices, and often generates other complex artefacts that need to be corrected. Pre-processing residuals by setting certain residual values to 0 (i.e. not forwarding these for processing) may provide a controllable and flexible way to manage bitrates and stream bandwidths, as well as resource use.

In FIGS. 4 and 5, we present high-level schematics illustrating modifications to the decoding processes implemented by the exemplary hierarchical coding technology initially introduced in FIG. 3. The received bitstream and decoding processes are in accordance with MPEG-5 Part 2 Low Complexity Enhancement Video Coding (LCEVC) standard delineated in ISO/IEC 23094-2:2021(en). Notwithstanding, the disclosed techniques retain applicability across various other hierarchical coding technologies. To facilitate comprehension, like features between FIGS. 3, 4, 5, 6 and 7 are denoted using consistent reference signs and descriptions are not repeated.

FIG. 4 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a first variation.

In this first variation, the example modified decoding process introduces the following components and operations: at a downsampler component 410, downsampling the decoded enhancement data; at a second upsampling component 414, upsampling the output of the base decoder separate to the upsampling component 313 (alternatively, the upsampling component 313 may be reconfigured to provide the operation of upsampling the output of the base decoder; and at a third summation component 412, combining the downsampled decoded residuals obtained from the downsampler component 410 with the upsampled output of the upsampling component 414.

In this way, an intermediate decoded video 450, or set of output pictures, is generated, with an intermediate quality metric that is between the base decoder output quality and the quality of the decoded video 350.

FIG. 5 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a second variation.

In this second variation, the example modified decoding process introduces the following components and operations: at a downsampler component 410, downsampling the decoded enhancement data; at an inverse transform component 511, transforming the downsampled enhancement data into one or more enhancement residuals planes; at an upsampling component 414, upsampling the output of the base decoder; and at a third summation component 412, combining the downsampled decoded residuals obtained from the downsampler component 410 with the upsampled output of the upsampling component 414.

In this way, an intermediate decoded video 450, or set of output pictures, is generated, with a quality metric between the base decoder output quality and the quality of the decoded video 350. By downsampling the enhancement data in the "transform space" rather than in the "residual space", memory resources may be better utilised. The inverse transform component 511 corresponds to the inverse transform module described with reference to Figs. 2 and 3.

In Figs. 4 and 5, the decoded video 350 at the level 2 reconstruction may or may not be generated. Additionally, the level 1 processing may or may not be used, either because no level 1 enhancements are sent in the encoded bitstream or because the decoder is configured to ignore the level 1 enhancements. In some decoder deployments, there is not the memory or other hardware resource to use the decoded video 350 at the level 2 reconstruction, however there is the memory or other hardware resource to output a decoded picture 450 at an intermediate level between the base reconstruction and the level 2 reconstruction. Using the level 2 enhancement data, and downsampling to an intermediate level, allows for a decoder extension to maximise the decoder deployment without needing to modify a suitably encoded bitstream.

FIG. 6 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a third variation.

The third variation is a modification of the first variation shown in Fig. 4, and in the third variation the level 1 enhancement data is decoded and combined with the output of the base decoder 310 prior to the additional upsampling operation at upsampling component 25 414.

FIG. 7 shows a high-level schematic of a modified decoding process of the hierarchical coding technology shown in FIG. 3 according to a fourth variation.

The fourth variation is a modification of the second variation shown in Fig. 5, and in the fourth variation the level 1 enhancement data is decoded and combined with the output of the base decoder 310 prior to the additional upsampling operation at upsampling component 414.

In this way, the third and fourth variations use the level 1 enhancement data. The first and second variations described with reference to Figs. 4 and 5 have a comparative advantage in that a decoder may be more easily modified to post-process the video using the outputs of the base decoder 310 and the LCEVC decoder, whereas the third and fourth variations described with reference to Figs. 6 and 7 may provide a more accurate output or more sparse level 2 enhancement data making the downsampling operation more efficient.

Fig. 8 shows a high-level flow chart of a modified decoding process.

At step 810, enhancement data is retrieved from an encoded bitstream.

At step 820, the enhancement data is downscaled by a first scaling factor to produce downscaled one or more enhancement residuals planes.

The skilled person would realise from the disclosure of Figs. 4-7 that the enhancement data is one or more enhancement residuals planes and the downscaling the enhancement data produces corresponding one or more downscaled enhancement residuals planes.

Alternatively, the skilled person would realise from the disclosure of Figs. 4-7 that the enhancement data is a set of transformed elements representing the one or more enhancement residuals planes and the downscaling the enhancement data produces a corresponding downscaled set of transformed elements transformable into the one or more enhancement residuals planes. In this example, the transformed elements are inverse transformed into the one or more enhancement residuals planes prior to being added to a suitably upscaled version of an output of a base decoder.

The skilled person would realise that the first scaling factor may be any useful scaling factor. In one particularly useful embodiment, the first scaling factor is 1.5.

The skilled person would realise from the disclosure of Figs. 4-7 that an output of a base decoder, either modified by a first level enhancement data operation, or not, is upscaled with a second scaling factor. Typically, the second scaling factor corresponds to the first scaling factor.

To illustrate further, in one example a video signal at 1440p is provided to the encoder of Fig. 2 and is suitably downscaled so that the encoded base stream is at 720p. The level 2 enhancement data provides for a level 2 reconstruction at 1440p. In the decoder variations of Figs. 3 to 7, the level 2 enhancement data is downscaled by the first scaling factor of 1.5 from 1440p to 1080p, and the decoded base stream is upscaled by the second scaling factor of 1.5 from 710p to 1080p. The combined output is a decoded video 450 or set of pictures at a resolution of 1080p. These resolutions and scaling factors are example resolutions and scaling factors, and other resolutions and scaling factors are possible.

A particular use case for the above techniques is where there are at least two decoder deployments using a single encoded bitstream for video. Such a use case may exist for example on an aircraft or other vessel where encoded bistreams are stored on a central server, and where different parts of the aircraft have differing decoding hardware and associated capabilities. For example, one section of the aircraft may have decoding capability of 1440p, whereas another section of the aircraft may have decoding capability of 1080p. Rather than use a single encoded bitstream offering decoded video at 1440p with level 2 enhancements (i.e. decoded video 350) and at 720p without enhancements (i.e. directly from the base decoder 310), this disclosure allows for certain decoder deployments with sufficient memory resources and a sufficient screen resolution to obtain decoded video at an intermediate level of quality, such as an intermediate resolution. By using downscaled enhancement data, the picture quality will be improved when compared with simply upscaling the output from the base decoder. By using downscaled enhancement data, the memory resources with be more effectively used when compared to generating the decoded video at the level 2 reconstruction and then downsampling to fit to a display device with a lower resolution.

This disclosure allows for a 720p base output resolution and a 1080p intermediate enhanced output (e.g. if there is a restraint on the base decoder that the base resolution cannot be 540p). With native LCEVC, there would have to be a 1080p source video, 540p base video and 1080p level 2 enhancement reconstruction.

Although the above examples are focussed on LCEVC, the above disclose can also be applied to other hierarchal coding schemes.

While the above examples describe changing resolution, the skilled person would realise from this disclosure that the same principles may be applied to changing bit depth, for example to change the colour gamut of a video signal. Moreover, quality is not restricted to resolution, thus quality may correspond to one or more of resolution, bit depth, frame rate, quantisation parameter and so forth and these quality metrics may be changed through upscaling and downscaling in the way described.

The skilled person will understand from this disclosure that the encoding of video data in the way disclosed is not graphics rendering, nor is the disclosure related to transcoding.

Instead, the video encoding disclosed relates to the creation of an encoded video stream or bitstream from an input video source, and the decoding disclosed relates to the creation of video data from a corresponding encoded video stream or bitstream.

We describe a low level of quality (e.g. 540p), a mid level of quality (e.g. 720p) and a high level of quality (e.g. 1080p). The low, mid and high levels of qualities are not restricted to 540p, 720p and 1080p respectively but can take any value (as long as the quality of the low level is lower than the mid level, and the quality of the mid level is lower than the high level). Moreover, quality is not restricted to resolution, thus quality may correspond to one or more of resolution, bit depth, frame rate, quantisation parameter and so forth.

As discussed, there is a need for providing a layer (e.g. a base layer) at the mid-level of quality, and an enhanced layer (e.g. obtained by combining the base layer with one or more enhancement layers) at the high level of quality. The known hierachical coding scheme LCEVC may not be able to fulfil this need in some circumstances, for example if the mid level of quality is mandated to be 720p and the high level of quality is mandated to be 1080p because the LCEVC standard teaches upsampling of: 0 dimension (no upscaling), 1 dimension (upscaling, with a scaling factor of 2, along a single axis of a base image), and 2 dimensions (upscaling, with a scaling factor of 2, along both axis of a base image). Clearly, applying the 0 dimension or 1 dimension upscaling to a 720p base layer would not result in the desired 1080p enhanced layer. Moreover, applying the 2 dimension upscaling to the 720p base layer would result in a 1440p enhanced layer. Thus, the known methods of LCEVC are not suitable for this particular use case.

One 'comparative' method of achieving this desired outcomes it to modify the known LCEVC coding scheme to utilise an upsampler (and, when encoding, a downsampler) having a scaling factor other than 2. In embodiments, the upsampler (and, when encoding, the downsampler) comprises a scaling factor greater than 1 and less than 2. We thus call such a upsampler (or downsampler) a fractional upsampler (or downsampler. In embodiments, the fractional upsampler (or downsampler) comprises a scaling factor of 1.5. In such an example, the fractional upsampler (or downsampler) having a scaling factor of 1.5 is used in place of a known LCEVC upsampler (e.g. comprises a scaling factor of 2). To encode according to such a scheme, we describe a comparative encoding method.

The comparative encoding method may comprise downsampling a high quality source image (e.g. at 1080p), using a fractional downsampler, to obtain a mid quality image (e.g. at 720p, if the fractional downsampler comprises a scaling factor of 1.5). The comparative encoding method may comprise encoding (e.g. using a base encoder) the obtained mid quality image. The comparative encoding method may comprise decoding (e.g. using a base decoder) the encoded mid quality image. The comparative encoding method may comprise upsampling, using a fractional upsampler, the decoded encoded mid quality image to obtain a rendition of the image at a high quality. The scaling factor of the fractional upsampler be such that it reverses the downsampling of the fractional downsampler, i.e. the fraction downsampler and upsampler may comprise the same scaling factor. In this example the scaling factor is 1.5. The comparative encoding method may comprise comparing the rendition of the image at a high quality with the source image to obtain residuals. The comparative encoding method may comprise encoding the obtained residuals. The encoding may comprise one or more of transforming, quantising, entropy encoding. The encoding may be in accordance with the LCEVC standard. A corresponding comparative decoding scheme may comprise obtaining a decoded base layer. The comparative decoding scheme may comprise upsampling, using the fractional upsampler (e.g. an upsampler having a scaling factor of 1.5), the decoded base to obtain an upsampled image. The comparative decoding scheme may comprise decoding an enhancement layer. The comparative decoding scheme may comprise combining the decoded enhancement layer with the upsampled image to produce an enhanced image.

As such, we describe a comparative coding scheme, wherein the standard LCEVC coding scheme is modified to utilise an upsampler (and downsampler) having a fractional scaling factor (e.g. scaling factor of 1.5).

Following on, alternative to the solutions disclosed with reference to Figs. 4-8 is to use a '1.5 x' downsampler and upsampler when encoding with the encoder of Fig. 2 and a corresponding 1.5 upsampler when decoding in the decoder of Fig. 3. In other words, use the usual LCEVC method but use a '1.5x' scaling factor in place of the usual '2x' scaler.

Thus, as an example for 1080p and 720p resolutions, the encoding process would work as follows: i) downsample a source image at 1080p to get a 720p image; ii) base encode and base decode the 720p image; iii) upsample the decoded encoded 720p image to generate a 1080p resolution version; iv) compare the generated 1080p resolution version with the 1080p source to obtain the level 2 residuals; iv) encode the residuals as per usual.

Then the decoding process would work as follows: i) obtain a decoded 720p base, ii) upsample the 720p decoded base to get a 1080p picture, fi) decode the 1080p residuals, i.e. the level 2 residuals; iii) combine the 1080p residuals with the 1080p picture; iv) output the decoded video.

However, this is not possible with a currently compliant LCEVC bistream, as the upsampler required (720 -> 1080) is not in the standard.

Definitions and Terms In certain examples described herein the following terms are used: "base layer" -this is a layer pertaining to a coded base picture, where the "base" refers to a codec that receives processed input video data. It may pertain to a portion of a bitstream that relates to the base.

"bitstream" -this is sequence of bits, which may be supplied in the form of a NAL unit stream or a byte stream. It may form a representation of coded pictures and associated data forming one or more coded video sequences (CVSs).

"block" -an MxN (M-column by N-row) array of samples, or an MxN array of transform coefficients. The term "coding unit" or "coding block" is also used to refer to an MxN array of samples. These terms may be used to refer to sets of picture elements (e.g. values for pixels of a particular colour channel), sets of residual elements, sets of values that represent processed residual elements and/or sets of encoded values. The term "coding unit" is sometimes used to refer to a coding block of luma samples or a coding block of chroma samples of a picture that has three sample arrays, or a coding block of samples of a monochrome picture or a picture that is coded using three separate colour planes and syntax structures used to code the samples. A coding unit may comprise an M by N array R of elements with elements R[x][y]. For a 2x2 coding unit, there may be 4 elements. For a 4x4 coding unit, there may be 16 elements.

"chroma" -this is used as an adjective to specify that a sample array or single sample is representing a colour signal. This may be one of the two colour difference signals related to the primary colours, e.g. as represented by the symbols Cb and Cr. It may also be used to refer to channels within a set of colour channels that provide information on the colouring of a picture. The term chroma is used rather than the term chrominance in order to avoid the implication of the use of linear light transfer characteristics that is often associated with the term chrominance.

"coded picture" -this is used to refer to a set of coding units that represent a coded representation of a picture.

"coded base picture" -this may refer to a coded representation of a picture encoded using a base encoding process that is separate (and often differs from) an enhancement encoding process.

"coded representation" -a data element as represented in its coded form "decoded base picture" -this is used to refer to a decoded picture derived by decoding a coded base picture.

"decoded picture" -a decoded picture may be derived by decoding a coded picture. A decoded picture may be either a decoded frame, or a decoded field. A decoded field may be either a decoded top field or a decoded bottom field.

"decoder" -equipment or a device that embodies a decoding process.

"decoding process" -this is used to refer to a process that reads a bitstream and derives decoded pictures from it.

"encoder" -equipment or a device that embodies a encoding process.

"encoding process" -this is used to refer to a process that produces a bitstream (i.e. an encoded bitstream).

"enhancement layer" -this is a layer pertaining to a coded enhancement data, where the enhancement data is used to enhance the "base". It may pertain to a portion of a bitstream that comprises planes of residual data. The singular term is used to refer to encoding and/or decoding processes that are distinguished from the "base" encoding and/or decoding processes.

"video frame or frame" -in certain examples a video frame may comprise a frame composed of an array of luma samples in monochrome format or an array of luma samples and two corresponding arrays of chroma samples. The luma and chroma samples may be supplied in 4:2:0, 4:2:2, and 4:4:4 colour formats (amongst others). A frame may consist of two fields, a top field and a bottom field (e.g. these terms may be used in the context of interlaced video). References to a "frame" in these examples may also refer to a frame for a particular plane, e.g. where separate frames of residuals are generated for each of YUV planes. As such the terms "plane" and "frame" may be used interchangeably.

"layer" -this term is used in certain examples to refer to one of a set of syntactical structures in a non-branching hierarchical relationship, e.g. as used when referring to the "base" and "enhancement" layers, or the two (sub-) "layers" of the enhancement layer.

"luma" -this term is used as an adjective to specify a sample array or single sample that represents a lightness or monochrome signal, e.g. as related to the primary colours. Luma samples may be represented by the symbol or subscript Y or L. The term "luma" is used rather than the term luminance in order to avoid the implication of the use of linear light transfer characteristics that is often associated with the term luminance. The symbol L is sometimes used instead of the symbol Y to avoid confusion with the symbol y as used for vertical location.

"network abstraction layer (NAL) unit (NALU)" -this is a syntax structure containing an indication of the type of data to follow and bytes containing that data in the form of a raw byte sequence payload (RBSP). The RBSP is a syntax structure containing an integer number of bytes that is encapsulated in a NAL unit. An RBSP is either empty or has the form of a string of data bits containing syntax elements followed by an RBSP stop bit and followed by zero or more subsequent bits equal to 0. The RBSP may be interspersed as necessary with emulation prevention bytes.

"network abstraction layer (NAL) unit stream" -a sequence of NAL units.

"picture" -this is used as a collective term for a field or a frame. In certain cases, the terms frame and picture are used interchangeably.

"residual" -this term is defined in further examples below. It generally refers to a difference between a reconstructed version of a sample or data element and a reference of that same sample or data element.

"source" -this term is used in certain examples to describe the video material or some of its attributes before encoding.

"transform coefficient" (or just "coefficient") -this term is used to refer to a value that is produced when a transformation is applied to a residual or data derived from a residual (e.g. a processed residual). It may be a scalar quantity, that is considered to be in a transformed domain. In one case, an M by N coding unit may be flattened into an M*N one-dimensional array. In this case, a transformation may comprise a multiplication of the one-dimensional array with an M by N transformation matrix. In this case, an output may comprise another (flattened) M*N one-dimensional array. In this output, each element may relate to a different "coefficient", e.g. for a 2x2 coding unit there may be 4 different types of coefficient. As such, the term "coefficient" may also be associated with a particular index in an inverse transform part of the decoding process, e.g. a particular index in the aforementioned one-dimensional array that represented transformed residuals.

Claims

Claims 1. A method of decoding an encoded bitstream into one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video, the method comprising: retrieving enhancement data from the encoded bitstream; and downscaling the enhancement data by a first scaling factor to produce downscaled one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video.
2. The method of claim 1, wherein the enhancement data is one or more enhancement residuals planes and the downscaling the enhancement data produces corresponding one or more downscaled enhancement residuals planes.
3. The method of claim 1, wherein the enhancement data is a set of transformed elements representing the one or more enhancement residuals planes and the downscaling the enhancement data produces a corresponding downscaled set of transformed elements transformable into the one or more enhancement residuals planes.
4. The method of claim 2, wherein the set of transformed elements indicate the extent of spatial correlation between corresponding residual elements in the one or more enhancement residuals planes such that the set of transformed elements indicate at least one of average, horizontal, vertical and diagonal relationship between neighbouring residual elements in the set of enhancement residuals planes.
5. The method of any preceding claim, wherein the method comprises obtaining an upscaled decoder reconstructed video using a second scaling factor to create the set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.
6. The method of any preceding claim, wherein the method comprises upscaling the decoder reconstructed video using a second scaling factor to create the set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.
7. The method of any preceding claim, wherein the downscaling downscales one or more of: picture resolution and bit depth.
8. The method of any preceding claim, wherein the bitstream conforms with MPEG-5 part 2 LCEVC standard and/or SO/IEC 23094-2.
9. A decoder configured to perform the method of any preceding claim.
10. A computer readable storage medium comprising instructions which when executed by a processor causes the processor to perform the method of any preceding method claim.
11. A video distribution system comprising: a video server configured to serve an encoded bitstream one or more enhancement residuals planes suitable to be added to a set of preliminary pictures obtained from a decoder reconstructed video; a first decoding deployment configured to receive the encoded bitstream, to obtain a decoder reconstructed video from the encoded bitstream, to create a first set of preliminary pictures from the decoder reconstructed video, to obtain the one or more enhancement residuals planes, and to add the one or more enhancement residuals planes to the first set of preliminary pictures to output video pictures at a first level of quality; a second decoding deployment configured to receive the encoded bitstream, to obtain a decoder reconstructed video from the encoded bitstream, to create a second set of preliminary pictures from the decoder reconstructed video at a level of quality lower than the first set of preliminary pictures, to retrieve enhancement data from the encoded bitstream, to downscale the enhancement data by a first scaling factor to produce downscaled one or more enhancement residuals planes suitable to be added to the second set of preliminary pictures, to add the downscaled one or more enhancement residuals planes to the second set of preliminary pictures, and to output video pictures at a second level of quality lower than the first level of quality.
12. The system of claim 11, wherein at the second decoding deployment, the enhancement data is a set of transformed elements representing the one or more enhancement residuals planes and the downscaling the enhancement data produces a corresponding downscaled set of transformed elements transformable into the one or more enhancement residuals planes.
13. The system of any of any of claims 11 to 12, wherein the second decoding deployment is configured to obtain an upscaled decoder reconstructed video using a second scaling factor to create the second set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.
14. The system of any of any of claims 9 to 12, wherein the second decoding deployment is configured to upscale the decoder reconstructed video using a second scaling factor to create the set of preliminary pictures, wherein the second scaling factor is such so that the downscaled enhancement data is suitable to be added to the set of preliminary pictures.
15. The system of any of any of claims 9 to 14, wherein the encoded bitstream conforms with MPEG-5 part 2 LCEVC standard and/or SO/IEC 23094-2.