HK1161016B

HK1161016B - Method for resampling and picture resizing operations for multi-resolution video coding and decoding

Info

Publication number: HK1161016B
Application number: HK12101256.3A
Authority: HK
Inventors: G.J.苏利万
Original assignee: Microsoft Technology Licensing, Llc
Priority date: 2006-01-06
Filing date: 2012-02-08
Publication date: 2013-09-06

Description

Method of resampling and resizing operations for multi-resolution video encoding and decoding

The present application is a divisional application of an inventive patent application entitled "method of resampling and resizing operations for multi-resolution video encoding and decoding" filed on 8/1/2007 in 2007 by the applicant under the application number "200780001989.6" (international application number "PCT/US 2007/000195").

Technical Field

Techniques and tools for encoding/decoding digital video are described.

Background

With the increasing popularity of DVDs, it is becoming more common to deliver music via the internet, digital cameras, digital media. Engineers use various techniques to efficiently process digital audio, video, and images while maintaining quality. To understand these techniques, it is helpful to understand how audio, video, and image information is represented and processed within a computer.

I. Representation of media information in a computer

The computer processes the media information as a series of numbers representing the information. For example, a single number may represent the intensity of the luminance and the intensity of a color component, such as red, green or blue, of each elemental small region of the image, such that the digital representation of the image is made up of one or more arrays of these numbers. Each such number may be referred to as a sample. For color images, it is customary to represent the color of each elemental region with more than one sample, typically three samples are used. The set of these samples for the element region may be referred to as pixels, where the word "pixel" refers to an abbreviation of the concept "image element". For example, a pixel may consist of three samples representing the intensity of red, green and blue light necessary to display an element region. This pixel type is called an RGB pixel. Several factors affect the quality of the media information, including sample depth, resolution, and frame rate (for video).

Sample depth is a property, typically measured in bits, that indicates a range of numbers that can be used to represent a sample. The more possible values for the sample, the higher the quality, as the number can capture finer intensity variations and/or a larger range of values. Resolution generally refers to the number of samples within a certain duration (for audio) or space (for images or individual video images). Images with higher spatial resolution tend to look brighter than other images and contain more useful detail that is discernable. Frame rate is a common term for video temporal resolution. Higher frame rate video tends to mimic the smooth motion of natural objects more than other video and can similarly be considered to contain more detail in the temporal dimension. In view of all these factors, high quality is traded off against the cost of storing and transmitting information in terms of bit rate required to represent sample depth, resolution and frame rate as shown in table 1 below.

Table 1: bit rate of different quality levels of original video

Despite the high bit rates required to store and transmit high quality video (such as HDTV), companies and consumers increasingly rely on computers to create, distribute, and playback high quality content. To this end, engineers use compression (also known as source coding or source encoding) to reduce the bit rate of digital media. Compression reduces the cost of information storage and transmission by converting the information into a lower bit rate form. Compression can be lossless, where video quality is not compromised, but the reduction in bit rate is limited by the complexity of the video. Alternatively, compression may be lossy, where video quality suffers, but the reduction in bit rate is more pronounced. Decompression (also known as decoding) reconstructs the original information version from the compressed form. A "codec" is an encoder/decoder system.

In general, video compression techniques include "intra" compression and "inter" or predictive compression. For video images, intra-frame compression techniques compress individual images. Inter-frame compression techniques compress images with reference to previous and/or subsequent images.

Multiresolution video and spatial scalability

Standard video encoders experience significant performance degradation when the target bit rate is below a certain threshold. Quantization and other lossy processing stages introduce distortion. At low bit rates, high frequency information may be severely distorted or lost altogether. The result is that noticeable artifacts occur and the quality of the reconstructed video is significantly reduced. Although the available bit rates are increasing with improvements in transmission and processing technology, maintaining high visual quality at limited bit rates remains a major goal of video codec design. Existing codecs use several methods to improve visual quality at limited bit rates.

Multi-resolution coding allows video to be coded at different spatial resolutions. Reduced resolution video can be encoded at a sufficiently low bit rate at the expense of information loss. For example, a prior video encoder may downsample (using a downsampling filter) full resolution video and encode it at a reduced resolution in the vertical and/or horizontal directions. The half reduction in resolution in each direction reduces the size dimension of the encoded image by half. The encoder signals the decoder of this reduced resolution encoding. The decoder receives information indicating a resolution reduction encoding and determines from the received information how the resolution reduced video should be upsampled (using an upsampling filter) to increase its image size prior to display. However, information that is lost when the encoder downsamples and encodes is still lost in the upsampled image.

Spatially scalable video uses a multi-layer approach, allowing an encoder to reduce spatial resolution (and thus bit rate) in the base layer, while maintaining higher resolution information from the source video in one or more enhancement layers. For example, a base layer intra picture may be encoded with a reduced resolution, while an accompanying enhancement layer intra picture may be encoded with a higher resolution. Similarly, a base layer predicted picture may be accompanied by an enhancement layer predicted picture. The decoder may choose (in view of bit rate constraints and/or other criteria) to decode only the lower resolution base layer to obtain the lower resolution reconstructed image, or to decode the images of the base and enhancement layers to obtain the higher resolution reconstructed image. When the base layer is encoded at a lower resolution than the displayed image (also referred to as downsampling), the encoded image size is actually smaller than the displayed image. The decoder performs calculations to resize the reconstructed image and generates interpolated sample values at appropriate locations within the reconstructed image using an upsampling filter. However, previous codecs that use spatially scalable video have suffered from inflexible upsampling filters and inaccurate or expensive (in terms of computation time or bit rate) picture resizing techniques.

Given the critical importance of video compression and decompression for digital video, it is not surprising that video compression and decompression are well developed areas. However, despite the benefits of earlier video compression and decompression techniques, they do not have the advantages of the following techniques and tools.

Disclosure of Invention

This summary is provided to introduce a selection of concepts in a simplified form that are further described below in the detailed description. This summary is not intended to identify key features or essential features of the claimed subject matter, nor is it intended to be used as an aid in limiting the scope of the claimed subject matter.

In this summary, detailed description is directed to various techniques and tools for multi-resolution and layered spatially scalable video encoding and decoding.

For example, the detailed description relates to various techniques and tools for high precision position calculation for resizing images in applications such as spatial scalable video coding and decoding. Described in such asTechniques and tools for high precision position calculation for resizing images in applications of spatial scalable video coding and decoding. In one aspect, resampling of the video image is performed according to a resampling scale factor. Resampling comprises calculating the sample value at position i, j in the resample array. The calculation includes a calculation involving a multiplication of 2 by the inverse (approximate or exact) of the upsampling scale factorⁿApproximating a portion of a value by a value (or by 2)ⁿValue divided by the upsampling scale factor or an approximation to the upsampling scale factor) to calculate a derived horizontal or vertical sub-sample position x or y. The exponent n may be the sum of two integers including an integer F representing the number of fractional component bits. The approximation may be rounding or some other type of approximation, such as an upper bound integer or a lower bound integer function that approximates to a neighboring integer. The sample values are interpolated using a filter.

Some alternatives to the described techniques provide varying sample position calculations, providing in one implementation the precision of approximating one additional bit in the calculation without significantly altering the sample position calculation process or its complexity. Some further alternatives to the described techniques relate to how the sample position calculation is performed for 4:2:2 and 4:4:4 sampling structures. These alternative techniques for these sampling structures lock the luma and chroma sample position calculations together as long as the resolution of the chroma and luma sampling grids is the same in a particular dimension.

Other features and advantages will become apparent from the following detailed description of various embodiments, which proceeds with reference to the accompanying drawings.

Drawings

FIG. 1 is a block diagram of a suitable computing environment in connection with which several of the described embodiments may be implemented.

FIG. 2 is a block diagram of a generalized video encoder system in conjunction with which several embodiments described may be implemented.

FIG. 3 is a block diagram of a generic video decoder system in conjunction with which several of the described embodiments may be implemented.

Fig. 4 is an illustration of a macroblock format used in the several embodiments described.

Fig. 5A is a diagram of a portion of an interlaced video frame showing interlaced lines of an upper field and a lower field. Fig. 5B is a diagram of an interlaced video frame organized into frames for encoding/decoding, and fig. 5C is a diagram of an interlaced video frame organized into fields for encoding/decoding.

Fig. 5D shows six exemplary spatial arrangements of 4:2:0 chroma sample positions relative to luma sample positions of each field of a video frame.

Fig. 6 is a flow diagram illustrating a generalized technique for multi-resolution video coding.

Fig. 7 is a flow diagram illustrating a generalized technique for multi-resolution video decoding.

Fig. 8 is a flow diagram illustrating a technique for multi-resolution encoding of multi-resolution intra pictures and inter predicted pictures.

Fig. 9 is a flowchart illustrating a multi-resolution decoding technique of multi-resolution intra and inter predicted images.

Fig. 10 is a flow diagram illustrating a technique for encoding spatially scalable bitstream layers to allow decoding of video at different resolutions.

Fig. 11 is a flow diagram illustrating a technique for decoding spatial scalable bitstream layers to allow decoding of video at different resolutions.

Fig. 12 and 13 are code diagrams illustrating pseudo code for an exemplary multi-stage position calculation technique.

FIG. 14 is a code diagram illustrating pseudo code for an exemplary incremental position calculation technique.

Detailed Description

The described embodiments relate to techniques and tools for multi-resolution and layered spatially scalable video encoding and decoding.

The various techniques and tools described herein may be used independently. Certain techniques and tools may also be used in combination (e.g., at different respective phases of a combined encoding and/or decoding process).

Various techniques will be described below with reference to flowcharts of processing acts. The various processing acts illustrated in the flowcharts may be combined into fewer acts or divided into more acts. For the sake of brevity, the relationships between various acts illustrated in a particular flowchart and those described elsewhere are typically not illustrated. In many cases, the actions in the flow diagrams may be rearranged.

Most of the detailed description is directed to representing, encoding, and decoding video information. The techniques and tools described herein for representing, encoding, and decoding video information may be applied to audio information, still image information, or other media information.

I. Computing environment

FIG. 1 illustrates a general example of a suitable computing environment 100 in which the described embodiments may be implemented. The computing environment 100 is not intended to suggest any limitation as to scope of use or functionality, as the techniques and tools may be implemented in diverse general-purpose or special-purpose computing environments.

Referring to FIG. 1, computing environment 100 includes at least one processing unit 110 and memory 120. In fig. 1, this most basic configuration 130 is included within a dashed line. The processing unit 110 executes computer-executable instructions and may be a real or virtual processor. In a multi-processing system, multiple processing units execute computer-executable instructions to increase processing power. Memory 120 may be volatile memory (e.g., registers, cache, RAM), non-volatile memory (e.g., ROM, EEPROM, flash memory, etc.), or some combination of the two. The memory 120 stores software 180 that implements a video encoder or decoder using one or more of the techniques or tools described herein.

The computing environment may have additional features. For example, computing environment 100 includes storage 140, one or more input devices 150, one or more output devices 160, and one or more communication connections 170. An interconnection mechanism (not shown) such as a bus, controller, or network interconnects the components of the computing environment 100. Typically, operating system software (not shown) provides an operating environment for other software executing in the computing environment 100 and coordinates activities of the components of the computing environment 100.

Storage 140 may be removable or non-removable, and includes magnetic disks, magnetic tapes or cassettes, CD-ROMs, DVDs, flash memory, or any other medium which can be used to store information and which can be accessed within computing environment 100. The storage 140 stores instructions for the software 180 to implement a video encoder or decoder.

The input device 150 may be a touch input device such as a keyboard, mouse, pen, touch screen, or trackball, a voice input device, a scanning device, or another device that may provide input to the computing environment 100. For audio or video encoding, the input device 150 may be a sound card, video card, TV tuner card, or similar device that accepts audio or video input in analog or digital format, or a CD-ROM, CD-RW or DVD that reads audio or video samples into the computing environment 100. Output device 160 may be a display, a printer, a speaker, a CD or DVD recorder, or another device that provides output from computing environment 100.

Communication connection(s) 170 allow communication over a communication medium to another computing entity. The communication medium conveys information such as computer-executable instructions, audio or video input or output, or other data in a modulated data signal. A modulated data signal is a signal that has one or more of its characteristics set or changed in such a manner as to encode information in the signal. By way of example, and not limitation, communication media includes wired or wireless techniques implemented with an electrical, optical, RF, infrared, acoustic, or other carrier.

Various techniques and tools may be described in the general context of computer-readable media. Computer readable media can be any available media that can be accessed within a computing environment. By way of example, and not limitation, for computing environment 100, computer-readable media include memory 120, storage 140, communication media, and combinations of any of the above.

The various techniques and tools may be described in the general context of computer-executable instructions, such as those included in program modules, being executed on one or more target real or virtual processors in a computing environment. Generally, program modules include routines, programs, libraries, objects, classes, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The functionality of the program modules may be combined or separated between program modules as desired in various embodiments. Computer-executable instructions for program modules may be executed in local or distributed computing environments.

For the sake of presentation, the detailed description uses terms like "encode," "decode," and "select" to describe computer operations in a computing environment. These terms are high-level abstractions for operations performed by a computer, and should not be confused with acts performed by a human being. The actual computer operations corresponding to these terms vary depending on the implementation.

Exemplary video encoder and decoder

Fig. 2 is a block diagram of an exemplary video encoder 200 in conjunction with which certain embodiments described may be implemented. Fig. 3 is a block diagram of a generalized video decoder 300 in conjunction with which certain embodiments described may be implemented.

The relationship shown between the modules within the encoder 200 and decoder 300 indicates the general information flow in the encoder and decoder; other relationships are not shown for simplicity. In particular, fig. 2 and 3 generally do not show auxiliary information indicating encoder settings, modes, tables, etc. for video sequences, pictures, slices, macroblocks, blocks, etc. This side information is typically sent in the output bitstream after entropy coding of the side information. The format of the output bitstream may vary depending on the implementation.

The encoder 200 and decoder 300 process video images, which may be video frames, video fields, or a combination of frames and fields. The bitstream syntax and syntax at the picture and macroblock level may depend on whether a frame or a field is used. There may also be changes to the macroblock organization and overall timing. The encoder 200 and decoder 300 are block-based and use a 4:2:0 macroblock format for frames, where each macroblock includes four 8x 8 luma blocks (sometimes treated as one 16 x 16 macroblock) and two 8x 8 chroma blocks. For fields, the same or different macroblock organization and format may be used. The 8x 8 block may also be subdivided in different stages, for example in the frequency transform and entropy coding stages. Exemplary video frame organization is described in more detail below. Alternatively, the encoder 200 and decoder 300 are object-based, use different macroblock or block formats, or perform operations on a different size or configuration of sample sets than 8x 8 blocks and 16 x 16 macroblocks.

Depending on the desired implementation and type of compression, modules of the encoder or decoder may be added, omitted, divided into multiple modules, combined with other modules, and/or replaced with similar modules. In alternative embodiments, encoders or decoders having different module and/or other module configurations perform one or more of the described techniques.

A. Video frame organization

In some implementations, the encoder 200 and decoder 300 process video frames organized as follows. A frame contains a line of spatial information of the video signal. For progressive video, the lines contain samples representing a snapshot of the scene content immediately sampled from a common instant and covering the entire scene from the top to the bottom of the frame. A progressive video frame is divided into macroblocks such as macroblock 400 shown in fig. 4. The macroblock 400 includes four 8x 8 luminance blocks (Y1 through Y4) and two 8x 8 chrominance blocks that are co-located with the four luminance blocks, but are half horizontal and half vertical resolution, following the conventional 4:2:0 macroblock format. The 8x 8 blocks may also be subdivided at different stages, such as at the frequency transform (e.g., 8x4, 4x8, or 4x4DCT) and entropy encoding stages. A progressive I-frame is an intra-coded progressive video frame, where the term "intra" refers to an encoding method that does not involve prediction from other previously decoded image content. A progressive P frame is a progressive video frame encoded using a prediction from one or more pictures at a temporally different time than the current picture (sometimes also referred to as forward prediction in some contexts), while a progressive B frame is a progressive video frame encoded using inter prediction (sometimes referred to as bi-prediction or bi-prediction) that involves an average of multiple predictions (possibly weighted) in some regions. Progressive P-frames and B-frames may include intra-coded macroblocks as well as various types of inter-predicted macroblocks.

Interlaced video frames consist of an alternating sequence of two types of scanning of a scene-one comprising the even lines of the frame (line numbers 0, 2, 4, etc.), referred to as the top field, and the other comprising the odd lines of the frame (line numbers 1, 3, 5, etc.), referred to as the bottom field. The two fields may represent two different snapshot time instants. Fig. 5A shows a portion of an interlaced video frame 500 including alternating lines of a top field and a bottom field located in the top left portion of the interlaced video frame 500.

Fig. 5B shows the interlaced video frame 500 of fig. 5A organized as a frame 530 for encoding/decoding. Interlaced video frame 500 has been divided into macroblocks such as macroblocks 531 and 532 or other such regions using a 4:2:0 format as shown in fig. 4. In the luminance plane, each macroblock 531, 532 comprises 8 lines from the top field alternating with 8 lines from the bottom field for a total of 16 lines, and each line is 16 samples long. (the actual organization of the image into macroblocks or other such regions, as well as the actual organization and arrangement of luma and chroma blocks within macroblocks 531, 532 is not shown, and may actually vary for different coding decisions and for different video coding designs.) within a given macroblock, the top field information and the bottom field information may be jointly coded or separately coded at either of the phases.

An interlaced I-frame is an intra-coded interlaced video frame comprising two fields, wherein each macroblock comprises information about one or two fields. An interlaced P frame is an interlaced video frame encoded using inter prediction that includes two fields, where each macroblock includes information about one or both fields, as with an interlaced B frame. Interlaced P and B frames may include intra-coded macroblocks as well as various types of inter-predicted macroblocks.

Fig. 5C shows the interlaced video frame 500 of fig. 5A organized into fields 560 for encoding/decoding. Each of the two fields of the interlaced video frame 500 is divided into macroblocks. The top field is divided into macroblocks such as macroblock 561, and the bottom field is divided into macroblocks such as macroblock 562. (again, these macroblocks also use the 4:2:0 format as shown in fig. 4, and the organization of the image into macroblocks or other such regions, and the arrangement of luma and chroma blocks within each macroblock is not shown and may vary.) in the luma plane, macroblock 561 includes 16 lines from the top field, while macroblock 562 includes 16 lines from the bottom field, and each line is 16 samples long.

An interlaced I field is a field of a single, separate representation of an interlaced video frame. An interlaced P field is a single, separately represented field of an interlaced video frame encoded using inter-frame image prediction, as is an interlaced B field. Interlaced P and B fields may include intra-coded macroblocks as well as different types of inter-predicted macroblocks.

Interlaced video frames organized as fields for encoding/decoding may include various combinations of different field types. For example, such a frame may have the same field type (I-field, P-field, or B-field) in both the upper and lower fields, or a different field type in each field.

The term image generally refers to a source, a frame or field of encoded or reconstructed image data. For progressive video, the image is typically a progressive video frame. For interlaced video, an image may refer to an interlaced video frame, the top field of a frame, or the bottom field of a frame, depending on the context.

Alternatively, the encoder 200 and decoder 300 are object-based, using different macroblocks or formats (e.g., 4:2:2 or 4:4:4) or block formats, or performing operations on a sample set of different sizes or configurations than 8x 8 blocks and 16 x 16 macroblocks.

B. Video encoder

Fig. 2 is a block diagram of an exemplary video encoder system 200. The encoder system 200 receives a sequence of video images including a current image 205 (e.g., a progressive video frame, an interlaced video frame, or a field of an interlaced video frame) and generates compressed video information 295 as output. Particular embodiments of the video encoder typically use a variant or complementary version of the example encoder 200.

The encoder system 200 uses an encoding process for intra-coded (intra) pictures (I-pictures) and inter-picture predicted (inter) pictures (P or B pictures). For purposes of illustration, fig. 2 shows the path of an I picture through the encoder system 200 and the path of a predictive picture for an inter picture. Many of the components of the encoder system 200 are used to compress both I-pictures and inter-predicted pictures simultaneously. The exact operations performed by these components may vary depending on the type of information being compressed.

An inter-picture predicted picture is represented as a prediction (or difference) from one or more other pictures, commonly referred to as reference pictures. The prediction residual is the difference between the predicted and original pictures. In contrast, I pictures are compressed without reference to other pictures. I pictures may use spatial prediction or frequency domain prediction (i.e., inter-picture prediction) to predict certain portions of an I picture from data from other portions of the I picture itself. But for simplicity, these I pictures will not be referred to herein as "predictive" pictures, and thus the phrase "predictive" pictures may be understood as inter-picture predictive pictures (e.g., P pictures or B pictures).

If the current picture 205 is a predictive picture, the motion estimator 210 estimates motion of a macroblock or other sample set of the current picture 205 relative to one or more reference pictures (e.g., a reconstructed previous picture 225 buffered in the picture store 220). The motion estimator 210 can estimate motion relative to one or more temporally previous reference pictures and one or more temporally future reference pictures (e.g., in the case of a bi-predictive picture). Thus, the encoder system 200 may use separate stores 220 and 222 for multiple reference pictures.

The motion estimator 210 may estimate motion in full samples, 1/2 samples, 1/4 samples, or other increments, and may switch the resolution of the motion estimation on an image-by-image basis or other basis. The motion estimator 210 (and compensator 230) may also switch between types of reference image sample interpolation (e.g., between cubic convolutional interpolation and bilinear interpolation) on a per-frame or other basis. The resolution of the motion estimation may be the same or different horizontally and vertically. The motion estimator 210 outputs motion information 215, such as differential motion vector information, as side information. The encoder 200 encodes the motion information 215 by, for example, calculating one or more predictors for the motion vector, calculating a difference between the motion vector and the predictor, and entropy encoding the difference. To reconstruct the motion vector, the motion compensator 230 combines the prediction value with the motion vector difference information.

Motion compensator 230 applies the reconstructed motion vectors to reconstructed picture 225 to form motion compensated prediction 235. However, the prediction is rarely perfect, and the difference between the motion compensated prediction 235 and the original current picture 205 is the prediction residual 245. During later image reconstruction, an approximation of the prediction residual 245 is added to the motion compensated prediction 235 to obtain a reconstructed image that is closer to the original current image 205 than the motion compensated prediction 235. However, in lossy compression, some information is still lost from the original current picture 205. Alternatively, the motion estimator and motion compensator apply another type of motion estimation/compensation.

Frequency transformer 260 converts the spatial domain video information into frequency domain (i.e., spectral) data. For block-based video coding, the frequency transformer 260 typically applies a Discrete Cosine Transform (DCT), a variant of a DCT, or some other block transform to a block of sample data or prediction residual data, producing a block of frequency-domain transform coefficients. Alternatively, the frequency transformer 260 applies another conventional frequency transform type such as a Fourier transform or uses wavelet or subband analysis. The frequency transformer 260 may apply a frequency transform of 8 × 8, 8 × 4, 4 × 8, 4 × 4, or other size.

Quantizer 270 then quantizes the block of frequency-domain transform coefficients. The quantizer applies scalar quantization to the transform coefficients according to a quantization step size that varies on an image-by-image basis, a macroblock basis, or some other basis, where the quantization step size is a control parameter that governs the evenly spaced intervals between the discretely representable reconstruction points in the decoder dequantizer process, which may also be repeated in the encoder dequantizer process 276. Alternatively, the quantizer applies another type of quantization to the frequency domain transform coefficients, such as a scalar quantizer with non-uniform reconstruction points, a vector quantizer, or a non-adaptive quantization, or quantizes the spatial domain data directly in an encoder system that does not use a frequency transform. In addition to adaptive quantization, encoder 200 may use frame dropping, adaptive filtering, or other techniques for rate control.

When the reconstructed current image is required for subsequent motion estimation/compensation, the inverse quantizer 276 performs inverse quantization on the quantized frequency-domain transform coefficients. The inverse frequency transformer 266 then performs the inverse operation of the frequency transformer 260, resulting in an approximation of the reconstructed prediction residual (for a predicted image) or an approximation of the reconstructed I-picture. If the current picture 205 is an I-picture, an approximation of the reconstructed I-picture is used as an approximation of the reconstructed current picture (not shown). If the current picture 205 is a predicted picture, an approximation of the reconstructed prediction residual is added to the motion compensated prediction 235 to form an approximation of the reconstructed current picture. One or more image stores 220, 222 buffer an approximation of the reconstructed current image for use as a reference image in motion compensated prediction of subsequent images. The encoder may apply a deblocking filter or other image refinement process to the reconstructed frame to adaptively smooth discontinuities from the image and remove other artifacts prior to storing the image approximation in one or more image stores 220, 222.

The entropy coder 280 compresses the output of the quantizer 270 as well as some side information (e.g., motion information 215, quantization step size). Typical entropy coding techniques include arithmetic coding, differential coding, huffman coding, run-length coding, Lempel-Ziv coding, lexicographic coding, and combinations thereof. The entropy encoder 280 typically uses different encoding techniques for different kinds of information (e.g., low frequency coefficients, high frequency coefficients, zero frequency coefficients, different kinds of side information) and may select from multiple code tables within a particular encoding technique.

The entropy encoder 280 provides the compressed video information 295 to a multiplexer [ "MUX" ] 290. The MUX290 may include a buffer and a buffer integrity level indicator may be fed back to the bit rate adaptation module for rate control. Before or after the MUX290, the compressed video information 295 may be channel encoded for transmission over a network. Channel coding may apply error detection and correction data to compressed video information 295.

C. Video decoder

Fig. 3 is a block diagram of an exemplary video decoder system 300. The decoder system 300 receives information 395 for a compressed sequence of video images and produces an output comprising a reconstructed image 305 (e.g., a progressive video frame, an interlaced video frame, or a field of an interlaced video frame). Particular embodiments of the video decoder typically use a variant or complementary version of the generalized decoder 300.

The decoder system 300 decompresses the predicted image and the I-picture. For the sake of illustration, fig. 3 shows the path through decoder system 300 for an I-picture and the path for a predicted picture. Many of the components of the decoder system 300 are used to decompress both I-pictures and predicted pictures. The exact operations performed by these components may vary depending on the type of information being decompressed.

The DEMUX 390 receives information 395 for the compressed video sequence and makes the received information available to the entropy decoder 380. The DEMUX 390 may include a jitter buffer as well as other buffers. The compressed video information may be channel decoded and processed for error detection and correction before or during DEMUX 390.

The entropy decoder 380 entropy decodes the entropy-encoded quantized data as well as the entropy-encoded side information (e.g., motion information 315, quantization step size), typically applying the inverse of the entropy encoding performed in the encoder. Entropy decoding techniques include arithmetic decoding, differential decoding, Huffman decoding, run-length decoding, Lempel-Ziv decoding, lexicographic decoding, and combinations thereof. The entropy decoder 380 typically uses different decoding techniques for different kinds of information (e.g., low frequency coefficients, high frequency coefficients, zero frequency coefficients, different kinds of side information) and may select from multiple code tables within a particular decoding technique.

The decoder 300 decodes the motion information 315 by, for example, calculating one or more predictors for the motion vector, entropy decoding the motion vector differences (at the entropy decoder 380), and combining the decoded motion vector differences with the predictors to reconstruct the motion vector.

Motion compensator 330 applies motion information 325 to one or more reference pictures 315 to form a prediction 335 of reconstructed picture 305. For example, motion compensator 330 uses one or more macroblock motion vectors to find blocks of samples or to interpolate fractional positions between samples in reference picture 325. One or more image stores (e.g., image stores 320, 322) store previously reconstructed images for use as reference images. Typically, a B picture has more than one reference picture (e.g., at least one temporally previous reference picture and at least one temporally future reference picture). Thus, the decoder system 300 may use separate picture stores 320 and 322 for multiple reference pictures. The motion compensator 330 may compensate motion in full samples, 1/2 samples, 1/4 samples, or other increments, and may switch the resolution of motion compensation on a picture-by-picture basis or other basis. Motion compensator 330 may also switch between types of reference picture sample interpolation (e.g., between cubic convolutional interpolation and bilinear interpolation) on a per-frame or other basis. The resolution of the motion compensation may be the same or different horizontally and vertically. Alternatively, the motion compensator applies another type of motion compensation. The prediction of the motion compensator is rarely perfect and therefore the decoder 300 also reconstructs the prediction residual.

The inverse quantizer 370 inversely quantizes the entropy-decoded data. In general, the inverse quantizer applies a uniform scalar inverse quantization to the entropy decoded data, where the reconstruction step size varies on a picture-by-picture basis, a macroblock basis, or some other basis. Alternatively, the inverse quantizer applies another type of inverse quantization to the data, for example for inverse quantization of spatial domain data in non-uniform vector or non-adaptive inverse quantization, or directly in decoder systems that do not use inverse frequency transforms.

The inverse frequency transformer 360 transforms the inverse quantized frequency domain transform coefficients into spatial domain video information. For block-based video images, the inverse frequency transformer 360 applies inverse DCT [ "IDCT" ], a variant of IDCT, or other inverse block transform to the blocks of frequency transform coefficients, thereby generating sample data or inter-image prediction residual data for I-images or predicted images, respectively. Alternatively, the inverse frequency transformer 360 applies another type of inverse frequency transform, such as an inverse fourier transform or using wavelet or subband synthesis. The inverse frequency transformer 360 may apply an 8 × 8, 8 × 4, 4 × 8, 4 × 4, or other size inverse frequency transform.

For predicted pictures, the decoder 300 combines the reconstructed prediction residual 345 with the motion compensated prediction 335 to form the reconstructed picture 305. When the decoder needs the reconstructed picture 305 for subsequent motion compensation, one or more picture stores (e.g., picture store 320) buffer the reconstructed picture 305 for use in predicting the next picture. In certain embodiments, the decoder 300 may apply a deblocking filter or other image refinement process to the reconstructed image to adaptively smooth discontinuities from the image and remove other artifacts prior to storing the reconstructed image 305 to one or more image stores (e.g., image store 320) or prior to displaying the decoded image during decoded video play-out.

Overview of Multi-resolution encoding and decoding

Video may be encoded (and decoded) at different resolutions. For the purposes of this description, multi-resolution encoding and decoding may be described as frame-based encoding and decoding (e.g., reference picture resampling) or layered (also sometimes referred to as spatially scalable) encoding and decoding. Multi-resolution encoding and decoding may also involve interlaced video, field-based encoding and decoding, and switching between frame-based and field-based encoding and decoding on a specified resolution basis or on some other basis. However, frame coding of progressive video is discussed in this overview for the purpose of simplifying the conceptual description.

A. Frame-based multi-resolution encoding and decoding

In frame-based multi-resolution coding, an encoder encodes an input image at different resolutions. The encoder selects a spatial resolution for each picture on a picture-by-picture basis or on some other basis. For example, in reference picture resampling, a reference picture may be resampled if the picture is encoded at a different resolution than the current encoding. The term resampling is used to describe an increase (upsampling) or decrease (downsampling) in the number of samples used to represent an image region or some other portion of the sampled signal. The number of samples per unit area or per signal portion is referred to as the resolution of the sampling.

The spatial resolution may be selected based on, for example, a decrease/increase in available bit rate, a decrease/increase in quantization step size, a decrease/increase in the amount of motion of the input video content, other properties of the video content (e.g., presentation of strong edges, text, or other content that may be significantly distorted at lower resolutions), or on some other basis. The spatial resolution may vary in the vertical, horizontal, or both vertical and horizontal dimensions. The horizontal resolution may be the same as or different from the vertical resolution. The decoder decodes the encoded frames using complementary techniques.

Once the encoder has selected a spatial resolution for the current image or a region within the current image, the encoder will resample the original image to the desired resolution before encoding it. The encoder may then signal the decoder of this spatial resolution selection.

Fig. 6 illustrates a technique (600) for frame-based multi-resolution image encoding. An encoder, such as the encoder shown in fig. 2, sets a resolution for an image (610). For example, the encoder considers the criteria listed above or other criteria. The encoder then encodes the image at the division rate (620). If encoding of all pictures to be encoded is complete (630), the encoder exits. If not, the encoder sets the resolution for the next picture (610) and continues encoding. Alternatively, the encoder may set the resolution at some level other than the image level, such as setting the resolution differently for different portions of an image or making a resolution selection for a group or series of images.

The encoder can encode predictive pictures as well as intra pictures. Fig. 8 illustrates a technique (800) for frame-based multi-resolution intra picture and inter picture prediction picture encoding. First, the encoder checks whether the current picture to be encoded is an intra picture or a predictive picture at 810. If the current picture is an intra picture, the encoder sets the resolution for the current picture at 820. If the picture is a predictive picture, the encoder sets the resolution for the reference picture before setting the resolution for the current picture at 830. After setting the resolution for the current picture, the encoder encodes the current picture at the resolution (840). Setting a resolution for a picture (whether the current source picture or a stored reference picture) may involve resampling the picture to match a selected resolution and may involve encoding a signal to indicate the selected resolution to a decoder. If encoding of all pictures to be encoded is complete (850), the encoder exits. If not, the encoder continues to encode additional pictures. Alternatively, the encoder processes the predicted image in a different manner.

The decoder decodes the encoded image and, if necessary, resamples the image before display. Similar to the resolution of the encoded image, the resolution of the decoded image may be adjusted in a number of different ways. For example, the resolution of the decoded image may be adjusted to accommodate the resolution of the output display device or a region of the output display device (e.g., for a "picture-in-picture" or PC desktop window display).

Fig. 7 illustrates a technique (700) for frame-based multi-resolution image decoding. A decoder, such as the decoder shown in fig. 3, sets the resolution for the image (at 710). For example, the decoder acquires resolution information from the encoder. The decoder then decodes the image at the divide rate (720). If the encoding of all pictures to be decoded is complete (730), the decoder exits. If not, the decoder sets the resolution for the next picture (710) and continues decoding. Alternatively, the decoder sets the resolution at some level other than the picture level.

The decoder can decode the predicted image as well as the intra image. Fig. 9 illustrates a technique (900) for frame-based multi-resolution intra picture and predictive picture decoding.

First, the decoder checks whether a current picture to be decoded is an intra picture or a predicted picture (910). If the current picture is an intra picture, the decoder sets the resolution for the current picture (920). If the picture is a predictive picture, the encoder sets the resolution for the reference picture (930) before setting the resolution for the current picture (920). Setting the resolution of the reference image may involve resampling the stored reference image to match the selected resolution. After setting the resolution for the current picture (920), the decoder decodes the current picture at the resolution (940). If decoding of all pictures to be decoded is complete (950), the decoder exits. If not, the decoder continues decoding.

The decoder typically decodes the image at the same resolution as used by the encoder. Alternatively, the decoder may decode the image at a different resolution, such as in the case where the resolution available to the decoder cannot be exactly the same as the resolution used in the encoder.

B. Layered multi-resolution encoding and decoding

In layered multi-resolution coding, an encoder encodes video in layers, where each layer has information to decode the video at a different resolution. In this way, the encoder encodes at least some of the independent images within the video at more than one resolution. The decoder may then decode the video at one or more resolutions by processing different combinations of the layers. For example, a first layer (sometimes referred to as a base layer) contains information for decoding video at a lower resolution, while one or more other layers (sometimes referred to as enhancement layers) contain information for decoding video at a higher resolution.

The base layer itself may be designed as an independently decodable bitstream. Thus, in this design, a decoder that decodes only the base layer would generate an effectively decoded bitstream at the lower resolution of the base layer. Proper decoding of higher resolution images using enhancement layers may also require decoding of some or all of the encoded base layer data and possibly one or more enhancement layers. A decoder that decodes the base layer and one or more other higher resolution layers will be able to generate higher resolution content than a decoder that decodes only the base layer. Two, three or more layers may be used to allow two, three or more different resolutions. Alternatively, the higher resolution layer itself may be a bit stream that can be decoded independently. (this design is often referred to as simulcast multiresolution coding method.)

Fig. 10 illustrates a technique (1000) for encoding bitstream layers to allow decoding at different resolutions. An encoder, such as encoder 200 shown in fig. 2, obtains full resolution video information as input (1010). An encoder downsamples full resolution video information (1020) and encodes a base layer (1030) using the downsampled information. The encoder encodes one or more higher resolution layers using the base layer and the higher resolution video information (1040). The higher resolution layer may be a layer that allows decoding at full resolution or a layer that allows decoding at some intermediate resolution. The encoder then outputs a layered bitstream comprising more than two encoded layers. Alternatively, the encoding (1040) of the higher resolution layer may not use the base layer information, thus enabling independent decoding of the data of the higher resolution layer for the simulcast multi-resolution encoding method.

The encoder can implement the encoding of the multi-resolution layer in various ways according to the basic outline shown in fig. 10. For more information, see, for example, U.S. Pat. No.6,510,177, or the MPEG-2 standard or other video standards.

Fig. 11 shows a technique (1100) for decoding bitstream layers to allow decoding of video at different resolutions. A decoder, such as decoder 300 shown in fig. 3, takes a layered bitstream as input (1110). Each layer includes a lower resolution layer (base layer) and one or more layers containing higher resolution information. The higher resolution layer need not contain independently encodable pictures; in general, the higher resolution layer includes residual information describing the difference between the higher and lower resolution versions of the respective images. The decoder decodes the base layer (1120), and if higher resolution decoding is desired, the decoder downsamples the decoded base layer image to the desired resolution (1130). The decoder decodes one or more higher resolution layers (1140) and combines the decoded higher resolution information with the downsampled decoded base layer picture to form a higher resolution image (1150). Depending on the desired resolution level, the higher resolution image may be a full resolution image or an intermediate resolution image. For more information, see, for example, U.S. Pat. No.6,510,177, or the MPEG-2 standard or other video standards.

The decoder typically decodes the image at one of the resolutions used by the encoder. Alternatively, the resolution available to the decoder may not be exactly the same as the resolution used in the encoder.

Resampling filters for scalable video coding and decoding

This section describes techniques and tools for scalable video encoding and decoding. Although some of the techniques and tools described are described in the context of layering (or spatial scalability), some of the techniques and tools described may also be used in the context of frame-based (or reference image sampling) or in some other context involving resampling filters. Further, although some of the techniques and tools described may be described in the context of resampling an image, some of the techniques and tools described may also be used to resample a residual or differential signal from a higher resolution signal prediction.

Scalable Video Coding (SVC) is a type of digital video coding that allows a subset of a larger bitstream to be decoded to produce decoded pictures whose frame quality is acceptable in some applications, although these picture qualities are lower than those produced by decoding the entire higher bit rate bitstream. One well-known type of SVC is called spatial scalability, or resolution scalability. In spatial SVC design, the encoding process (or a pre-processing function performed prior to the encoding process, depending on the precise definition of the extent of the encoding process) typically involves downsampling the video to a lower resolution and encoding the lower resolution video to enable the lower resolution decoding process, while upsampling the lower resolution decoded picture for use as a prediction of sample values in the higher resolution video picture. The decoding process for the higher resolution video then includes decoding the lower resolution video (or portions thereof) and using the upsampled video as a prediction of sample values in the higher resolution video image. These designs require the use of resampling filters. More specifically, the design of codecs involves the use of upsampling filters in both the decoder and the encoder and downsampling filters in either the encoder or the pre-coding processor. Particular attention is paid to the upsampling filters used in such designs. In general, the upsampling process is designed to be the same in both the encoder and the decoder to prevent the drift phenomenon, so-called drift, which is the accumulation of errors due to the use of different predictions of the same signal during encoding and decoding.

One big drawback of some spatial SVC designs is the use of low-quality filters (e.g., double-tap bilinear filters) in the decoding process. Using higher quality filters may be beneficial for video quality.

Spatial SVC may include resampling filters that enable a higher degree of flexibility in the resampling ratio of the filter. However, this may require a large number of specific filter designs in the encoder and decoder implementations for each different "phase" of this filter to be developed and the "tap" values of these filters to be stored.

Furthermore, this benefits video quality to allow the encoder to control the amount of blurring for the resampling filter of spatial SVC. Thus, for each "phase" of the resampling designed for upsampling or downsampling, a selection from several different filters depending on the desired degree of blurring to be introduced in the process may be advantageous. The selection of the degree of blurring to be performed during upsampling may be sent from the encoder to the decoder as information conveyed for use by the decoding process. This additional flexibility further complicates the design because it greatly increases the number of necessary tap values that need to be stored in the encoder or decoder.

A unified design may be used to specify various resampled filters with various phases and various degrees of ambiguity. One possible solution is to use the Mitchell-Netravali filter design method. The straightforward application of the Mitchell-Netravali filter design approach to these problems may require excessive computational resources in the form of an excessively large possible value dynamic range for the quantities to be computed in the encoder or decoder. For example, one such design may require the use of 45-bit arithmetic processing rather than the 16-bit or 32-bit processing elements commonly used in general purpose CPUs and DSPs. To address this problem, some design improvements are provided.

Typical SVC design requires a standardized upsampling filter for spatial scalability. To support arbitrary resampling ratios (known as the extended spatial scalability feature), an upsampling filter design is described that combines a lot of flexibility with respect to the resampling ratio. Another key aspect is the relative alignment of luminance and chrominance. When various alignment structures are found in a single layer approach (see, e.g., H.261/MPEG-1 versus MPEG-2 alignment for 4:2:0 chroma and H.264/MPEG-4AVC), the described techniques and tools support various types of flexible alignment in a way that an encoder can easily indicate to a decoder how to properly apply filtering.

The techniques and tools include an upsampling filter that enables high quality upsampling and good anti-aliasing. More specifically, the described techniques and tools have qualities that are superior to those provided by previous bilinear filter designs for spatial scalability. The described techniques and tools have high quality upsampling filters that are visually pleasing and can provide good signal processing frequency behavior. The described techniques and tools include filter designs that are simple to specify and do not require large memory storage tables to hold tap values, and the filtering operation itself is computationally simple. For example, the described techniques and tools have filters that are not excessively lengthy and do not require excessive mathematical precision or extremely complex mathematical functions.

This section describes designs having one or more of the following features:

flexibility of luminance/chrominance phase alignment;

-flexibility of resampling ratios;

-flexibility of frequency characteristics;

-high visual quality;

not too few nor too many filter taps (e.g. between 4 and 6);

-specified simple;

simple operation (e.g. arithmetic using a utility word size).

Mitchell-Nerravali upsampling filter

The described techniques and tools employ a separable filtering approach-so the discussion that follows will focus primarily on the processing of one-dimensional signals, since the two-dimensional case is a simple separable application of the one-dimensional case. First a set of two-parameter filter sets is proposed based on the conceptual continuous impulse response h (x) given by:

where b and c are those two parameters (others). For relative phase offset positions 0 ≦ x < 1, the kernel generates a 4-tap Finite Impulse Response (FIR) filter with tap values given by the following matrix equation:

in practice, it is sufficient to consider only x in the range from 0 to 1/2, since the FIR filter kernels for x are exactly the FIR filter kernels for 1-x in the reverse order.

This design has a number of interesting and useful attributes. Some of which are listed below:

no trigonometric, transcendental or irrational processing is required to calculate the filter tap values. In practice, the tap values of this filter can be calculated directly with few simple operations. It is not necessary to store these tap values for the various possible parameter values and phases to be used, since these values can be simply calculated if desired. (thus, to standardize the use of these filters, only a small number of formulas are required-rather than requiring large tables of multiple or standardized attempts to approximate functions like cosines or Bessel functions.

The resulting filter has 4 taps. This is a very practical number.

The filter has only a single side lobe on each side of its main lobe. So that no excessive edge oscillation effects are generated.

The filter has a smoothed impulse response. Its value and its first derivative are continuous.

It has a unity gain DC response, meaning that there is no overall brightness amplification or attenuation of the information species being up-sampled.

The members of these filter families comprise relatively good approximations to well-known good filters, such as the "Lanczos-2" design and the "Catmull-Rom" design.

Further, the described techniques and tools include a specific relationship between two parameters for selecting a visually pleasing filter. This relationship can be expressed as follows:

this reduces the degree of freedom to a single bandwidth control parameter b. This parameter controls the degree of extra blurring introduced by the filter. Note that the members of the series associated with the value b-0 are excellent and well known Catmull-Rom upsampling filters (also known as key "cubic convolution" interpolation filters).

In addition to the basic advantages that all members of the Mitchell-Nerravali filter family can find, the Catmull-Rom upsampling filter itself has a number of good properties:

it is an "interpolation" filter-i.e. for phase values x 0 and x 1 the filter has a single non-zero tap equal to 1. In other words, the up-sampled signal will pass the value of the input sample exactly at the edge of each up-sampled curve segment.

If the settings of the input samples form a parabola (or a straight line or a static value), the output point will fall exactly on the parabola curve (or a straight line or a static value).

Indeed, in some ways, the Catmull-Rom upsampler may be considered to be the best upsampling filter of this length for these reasons-although introducing some additional blurring (adding b) may sometimes be more visually pleasing. Also, introducing some extra blurring may help to smear out low bit rate compression artifacts, and can then act more similarly as a Wiener filter (a well known filter for noise filtering) estimator of the true up-sampled image.

Simply substituting equation (3) into equation (2) may result in the following tap values:

it is reported that based on subjective tests with 9 expert observers and over 500 samples, one can obtain:

the usable range is reported as 0 ≦ b ≦ 5/3;

0 ≦ b ≦ 1/2 is classified as visually "satisfactory," reported as visually pleasing when b 1/3;

-b > 1/2 is classified as "fuzzy", reported as very fuzzy when b 3/2

B. Integer quantization of bandwidth control parameters

It may not be appropriate to divide by 6 in equation (4). In contrast, it is desirable to integer the bandwidth control parameters and filter tap values because infinite accuracy as part of the decoder design is not feasible. Consider a new integer-valued variable a substitution defined as:

a＝(b/6)*2^S (5)，

where S is an integer shift factor and a is an unsigned integer used as an integer bandwidth control parameter. The parameter a may be encoded by the encoder as a syntax element at the video sequence level in the bitstream. For example, parameter a may be explicitly coded with a variable or fixed length code, jointly coded with other information, or explicitly signaled. Alternatively, the parameter a may be signaled at some other stage in the bitstream.

The integer-ization results in an integer-ized tap value:

the result then needs to be scaled down by S positions in a binary arithmetic process.

If a ranges from 0 to M, b ranges from 0 to 6M/2^S. Some potentially useful choices for M include the following:

-M＝2^(S-2)-1, b ranging from 0 to 3/2-6/2^S。

-M＝Ceil(2^S6) back is greater than or equal to 2^SThe smallest integer,/6, resulting in b ranging from 0 to slightly greater than 1.

-M＝2^(S-3)-1, resulting in an approximate range of b from 0 to 3/4-6/2^S。

These choices for M are large enough to cover most useful cases, with the first choice (M ═ 2)^(S-2)-1) is the larger of the three options. A useful range of S is between 6 and 8. For example, consider S-7 and M-2 (S-2) -1, i.e., M-31. Alternatively, other values of M and S may be used.

C. Integer quantization of fractional sample positioning

Next, consider the granularity of the x value. For practicality, x should also be approximated. For example, integer i may be defined as follows:

x＝i÷2^F (7)

where F represents the fractional sample position accuracy supported. For an example of a sufficiently accurate resampling operation, consider F ≧ 4(1/16 or greater sample positioning accuracy). This yields the following integer filter tap values:

for example, consider F ═ 4. The result then needs to be scaled down by 3F + S positions.

Note that each element in the above matrix contains a factor of 2 (assuming S is greater than 1). The tap values can then instead be formulated as follows:

where each tap value has been divided by 2. The result then only requires scaling down 3F + S-1 positions.

For scale-down, a function roundinglightshift (p, R) is defined as the output of the right-shifted R bit calculated for the input value p (with rounding), which is calculated as follows:

where the symbol ">" refers to a binary arithmetic right shifter using 2's complement binary arithmetic. Alternatively, the rounding right shift is performed differently.

Some exemplary applications of rounding right shifts are provided below.

D. Dynamic range consideration

If the image is filtered with an N-bit sample bit length and this is done two-dimensionally before any rounding is performed, a dynamic range of 2 x (3F + S-1) + N +1 bits will be required in the accumulator before scaling the result down by 2 x (3F + S-1) positions and limiting the output to an N-bit range. For example, if F is 4, S is 7 and N is 8, a 45-bit accumulator needs to be used to calculate the result of the filtering.

Some methods of alleviating this problem will be discussed in the following subsections. These methods may be used separately from each other or in combination with each other. It should be appreciated that variations to the dynamic range mitigation methods are possible based on the disclosure herein.

1. First exemplary dynamic Range mitigation method

Consider an example where horizontal filtering is performed first followed by vertical filtering. Consider the maximum word size of W bits for any point in the two-dimensional processing pipeline. In a first dynamic range mitigation method, to achieve filtering, R is used at the output of the first (horizontal) stage of the process_HBit rounded right shift, while R is used at the output of the second (vertical) stage of the process_VBit rounding right shifts.

The following calculation can then be made:

2*(3F+S-1)+N+1-R_H＝W (11)，

thus, it is possible to provide

R_H＝2*(3F+S-1)+N+1-W (12).

The right shift of the second (vertical) stage is then calculated from:

R_H+R_V＝2*(3F+S-1) (13)，

thus, it is possible to provide

R_V＝2*(3F+S-1)-R_H. (14).

For example, for F ═ 4 and S ═ 7 and N ═ 8 and W ═ 32, R results_H13 and R_V23. Thus, instead of a 45-bit dynamic range, the dynamic range is reduced to 32 bits with a rounded right shift. Different values of W may be usedRight shift number of bits.

2. Second exemplary dynamic Range mitigation method

The second dynamic range mitigation method involves reducing the accuracy of the tap values rather than the accuracy of the phase positioning (i.e., reducing F), reducing the granularity of the filter bandwidth adjustment parameters (i.e., reducing S), or reducing the accuracy of the first stage output (i.e., increasing R)_H)。

The four integer tap values generated by equation (9) are denoted as [ t-1, t ]₀，t₁，t₂]. Note that the sum of the four filter tap values will be equal to 2^3F+S-1Namely: t is t_-1+t₀+t₁+t₂＝2^3F+S-1

(15).

This is an important property of the exemplary dynamic range mitigation method, because the output will have the same value as long as the four input samples have the same value.

Using the exemplary definition of rounded right-shift found in equation (10) and giving the amount of right-shift R for the tap value_tThen, it is defined as follows:

u_-1＝RoundingRightShift(t_-1，R_t)；

u₁＝RoundingRightShift(t₁，R_t)；

u₂＝RoundingRightShift(t₂，R_t)；

u₀＝2^3F+S-1-u_-1-u₁-u₂.

then using the tap value u_-1，u₀，u₁，u₂]Rather than [ t ]_-1，t₀，t₁，t₂]Filtering is performed. R_tEach increase of 1 in the value of (a) indicates a reduction of 1 bit in the required dynamic range in the arithmetic accumulator, while in the subsequent processing stages it is performedThe right shift of the row is also reduced by 1 bit.

3. Third exemplary dynamic Range mitigation method

The former design uses a similar approach to the first exemplary dynamic range mitigation method concept, except that it has the amount of right shift as a function of the value of the phase location variable i after the first stage of the process.

It will be appreciated that when the value of i is 2^KAt integer multiples of (b), the filter tap values shown in equation (9) will contain K zero value LSBs. Thus, if the second stage of the filtering process uses 2^KAn integer multiple of the phase position variable i, then the tap values of the second stage can be shifted to the right by K bits and the amount of shift to the right of the first stage can be reduced by K bits.

This becomes quite difficult to track when operating with a typical resampling factor. However, when performing a simple resampling factor of 2: 1 or other simple factor, it is easy to confirm that all phases used in the second phase of the filtering process contain the same multiplier 2^KAllowing the method to be applied to these specific situations.

Techniques and tools for position calculation

Techniques and tools for computing positioning information for spatial SVC are described.

Some techniques and tools relate to how to focus on word length B and optimize computational accuracy within the word length constraint. Rather than just selecting the precision and requiring some necessary word length, applying this new method will yield higher precision in real implementations and will broaden the effective application range of the technique, as it uses all available word lengths to maximize accuracy within the constraints.

Some techniques and tools involve a) offsetting the origin of the coordinate system and b) using unsigned integers instead of signed integers to achieve a better tradeoff between accuracy and word length/dynamic range. A small number of calculations need to be added to add an origin offset term to each calculated position.

Some techniques and tools involve different stages of the computational typing process where different portions of the sample string are to be generated, where the origin of the coordinate system changes at the beginning of each stage. Likewise, it provides a better trade-off between accuracy and word length/dynamic range with another small increase in computational requirements (since some extra computation is performed at the beginning of each phase). The need for multiplication operations can be eliminated if the logical extremes of the technique are reached, thereby further improving the tradeoff between precision and word length/dynamic range. However, some additional operations would need to be performed for each sample (since the additional computations required for "per stage" become required for each sample if only one sample is contained per stage).

As a general matter, the described design is used in the processing position calculation section to achieve a desired compromise between accuracy of the calculation result, word length/dynamic range of the processing element, and the number and type of mathematical operations involved in the processing (e.g., shift, add, and multiply operations).

For example, the described techniques and tools allow flexible precision computations using B-bit (e.g., 32-bit) arithmetic. This allows the spatial SVC encoder/decoder to flexibly accommodate different image sizes without the need to convert to different arithmetic (e.g., 16-bit or 64-bit arithmetic) for computation. Using flexible precision B-bit (e.g., 32-bit) arithmetic, the encoder/decoder is able to use a flexible number of bits for the fractional component. This allows for increased computational accuracy as the number of bits required to represent the integer component decreases (e.g., for smaller frame sizes). As the number of bits required to represent the integer component increases (e.g., for larger frame sizes), the encoder/decoder can use more bits for the integer component and less bits for the fractional component, thereby reducing precision but maintaining B-bit arithmetic. The variation between different precisions and different frame sizes can thus be greatly simplified.

This section includes specific details for an exemplary implementation. However, it should be noted that the specific details described herein may be varied in other implementations in accordance with the principles described herein.

A. Introduction and location calculation principles

Techniques are described for computing position and phase information to achieve much lower computational requirements without any significant loss of accuracy. For example, the described techniques can significantly reduce computational requirements, such as by dynamically reducing nominal dynamic range requirements (e.g., by tens of bits). In view of the variety of possible chroma positions that may be used in the base and enhancement layers, it is desirable to find a solution that provides for proper positioning of resampled chroma samples relative to luma samples. Thus, the described techniques allow adjustments to be made with different relationships between luma and chroma positions to calculate the position of the video format.

Previous upsampling methods designed for extended spatial scalability use a rather cumbersome method to compute the position and phase information when upsampling the low resolution layer; it scales the upward-shifting approximation inverse of the denominator, resulting in a magnification of the rounding error in the inversion approximation by comparison as the numerator increases (i.e., as the upsampling process moves from left to right or top to bottom), the techniques described herein have excellent accuracy and simplified computation. More specifically, the techniques reduce the amount of right shift in the dynamic range and position calculations by tens of bits.

For example, one technique is described for calculating position information to obtain integer position and phase location variables i, where i is 0^F-1 for use in SVC spatial upsampling.

The described techniques apply the resampling process to spatial scalable video coding applications, rather than forward reference picture resampling. In this application of spatial scalable coding, some simplification may be applied. Instead of a normal morphing process, only an image resizing operation is required. This may be a separate design for each dimension.

B. Position calculation design

Consider the problem statement that in each dimension (x or y), the real values range from L to R > L because the generation of the sample string is conceptually in the new (upsampled) array. This real-valued range corresponds to the range from L ' to R ' > L ' in the reference low resolution array.

For a location T in the new array, where L ≦ T ≦ R, then the location in the reference array corresponding to the location in the new array needs to be calculated. This would be the position T '═ L' + (T-L) × (R '-L')/(R-L).

Now instead of considering adjusting the size of the range from L to R, the integer M > 0 is defined and considering adjusting from L to L +2 by the same size adjustment ratio (R '-L')/(R-L)^MThe size of the range of (c). The corresponding range in the reference sample coordinates is then from L 'to R ", where R ″ ═ L' +2^M(R '-L')/(R-L). If M is sufficiently large, i.e., if M ≧ Ceil (Log)₂(R-L)), then R '≧ R'. (it is presently assumed that this constraint is maintained to explain the following concept, although this constraint is not really necessary for proper function of the equation.)

Linear interpolation between positions L' and R "can now be used for the positioning calculation. Location L is mapped to location L', and location T ≧ L is mapped to location ((2)^M-(T-L))*L′+(T-L)*R″)÷2^M. This converts the denominator of the operation to a power of 2, thereby reducing the computational complexity of the division operation by allowing the division operation to be replaced with a binary right shift.

Appropriate modifications may be made to integer this calculation. Rounding the values of L 'and R' to 1/2^GWherein G is an integer, such that L' is defined by k/2^GApproximately, and R' is defined by R/2^GApproximately, where k and r are integers. Using this adjustment, position T can be mapped to position ((2)^M-(T-L))*k+(T-L)*r)÷2^(M+G)。

Now assume that the correlation value of T and L is 1/2^JWherein J is an integer such that T-L ═ J ÷ 2^J. Using this adjustment, position T can be mapped to position ((2)^(M+J)-j)*k+j*r)÷2^(M+G+J)。

Recall from section IV above that the fractional phase of the resampling filter may be in units of 1/2^FIs an integer of (1). Therefore, in these units, the calculated position is Round (((2)^(M+J)-j)*k+j*r)÷2^(M+G+J-F)) Or is or

t′＝((2^(M+J)-j)*k+j*r+2^(M+G+J-F-1))＞＞(M+G+J-F) (16)，

Or, more simply,

t′＝(j*C+D)＞＞S (17)，

wherein

S＝M+G+J-F (18)，

C＝r-k (19)，

D＝(k＜＜(M+J))+(1＜＜(S-1)) (20)。

The method described here rounds the calculated position to 1 ÷ 2^FThe only error that occurs before the nearest multiple (error present in both designs) (assuming no error in the representation of L and R and L ' and R ') is rounding from the position R ' to the nearest multiple 1/2^GRound off error of (2). This amount will be small in case G + M is relatively large. In practice, this source of error is tightly bound to about (T-L) ÷ 2^(G+M+1)The word size requirement for the result calculation is moderate and modulo arithmetic allows the integer part of the result to be dropped to minimize the word size, or allows the calculation to be decomposed in other similar ways.

F may be 4 or greater, for example. (for some applications, F-3 or F-2 may be sufficient.) examples of J values include J-1 for luma position calculation and J-2 for chroma sample positions. The rationale for these J value examples can be found as follows.

1. First exemplary simplified position calculation technique using signed B-bit arithmetic

If R '> 0 and L' > -R ', then all locations t' computed in the image to be upsampled are treated as being at 1/2^FAn integer of unit is located at-2^ZAnd 2^Z-1, wherein Z ═ Ceil (Log2 (R')) + F. If the word size calculated for (j C + D) is B bits and assuming signed 2' S complement arithmetic is used, then B-1 ≧ Z + S may be required. If this constraint is strict, i.e. if B-1 ═ Z + M + G + J-F, high precision is achieved.

For a fairly small picture size (e.g., up to 4.2 levels in the current h.264/MPEG-4AVC standard), B-32 may be used as the word size. Other B values may also be used. For very large images, a larger B may be used. The calculation can also be easily broken down into smaller word-size sub-calculations for use on 16-bit or other processors.

The remaining two degrees of freedom are M and G. Their relationship is flexible as long as G is large enough to avoid representing L' as k/2^GAny need for round-time errors. Then, based on the problem discussed in the next section for SVC, G ═ 2 can be chosen, resulting in:

M＝B+F-(G+J+Z+1)

that is to say that the first and second electrodes,

M＝32+4-(2+1+Z+1)

that is to say that the first and second electrodes,

M＝32-Z。

for example, if it is desired to upsample an image luminance array having a 1000 luminance sample width with B-32 and L' -0, then F-4, G-2, J-1, M-18, S-17, and Z-14 may be used using this first exemplary position calculation technique.

When T is very close to (or equal to) R and R' is very close to (or equal to)In) an integer power of 2, in particular when (T-L) × (R '-L')/[ 2 ]^FLarger (e.g., greater than 1/2), then there may be a chance on the assumption that the upper bound is violated by 1. These cases are not further considered here, although the adjustments to deal with them are straightforward.

2. Second exemplary position calculation technique Using unsigned B-bit arithmetic

If all the positions calculated in the low resolution image are greater than or equal to 0, which sometimes exists by adding a suitable offset to the origin of the coordinate system, then it would be a better option to use unsigned integer arithmetic instead of signed 2's complement arithmetic to calculate t' ═ j C + D. This allows for one more bit of dynamic range without overflowing in the calculation (i.e., the dynamic range size of B bits can be used instead of B-1 bits), thereby adding 1 to M (or G) and S, respectively, and further increasing the accuracy of the calculation result. Thus, after the offset E is included to adjust the origin of the coordinate system, the calculation may be of the form t '((j × C + D') > S ') + E instead of just t' ((j × C + D) > S.

By identifying when the origin offset E will not be needed, further details regarding the more accurate method involving unsigned arithmetic are provided below.

-selecting values for B, F, G, J and Z as described above.

-setting M ═ B + F- (G + J + Z).

-calculating S, C and D as specified in equations (18), (19) and (20) above, respectively, where D is calculated as a signed number.

If D is greater than or equal to 0, then no origin offset is required (i.e., E is not used) and the calculation can be simply performed using unsigned arithmetic as t' ═ j C + D > S, with the resulting result being more accurate than the first exemplary position calculation technique described above in section v.b.1.

In addition to increasing accuracy by enabling computations using unsigned integers, an offset origin may sometimes be used to provide improved accuracy by enabling a drop in the Z value. Without an origin offset, Z is a function of R'. But with the origin offset, Z can be made a function of R '-L', which would make the calculation more accurate if it resulted in a smaller value of Z.

Deriving D' and E by showing a way of offset to the origin provides further details regarding this more accurate method involving unsigned arithmetic as follows.

The values for B, F, G and J were selected as described above.

Set Z ═ Ceil (Log2(R '-L')) + F.

-setting M ═ B + F- (G + J + Z).

-setting E ═ D > S.

-Set D′＝D-(E＜＜S)。

The position calculation can then be performed as t '═ ((j × C + D') > S) + E.

If D 'and E (and M, S and Z) are calculated in this way, the arithmetic result of equation t' ((j × C + D ') > S) + E will in practice always be theoretically the same as the result of equation t' ((j × C + D) > S, except that the value of (j × C + D) will sometimes fall from 0 to 2^B-1, while the value of (j C + D') does not.

For example, if it is desired to upsample an image luminance array having a 1000 luminance sample width with B-32 and L' -0, then F-4, G-2, J-1, M-19, S-18, and Z-14 may be used using this second exemplary position calculation technique. Also functions equivalently, not offsetting the origin so that all values of j C + D are not negative, and thereby allowing the use of unsigned arithmetic using B-bit calculations to use ranges from 0 to 2^BAnother possibility for B-bit computation of-1 is by another 2^(B-1)To allow the use of signed arithmetic to range from-2^(B-1)To 2^(B-1)B bit calculation of-1 to further shift the origin to the right.

Whereas in the first exemplary position calculation technique of the previous section, there is an "angle case" adjustment that is required when T is very close to (or equal to) R and R '-L' is very close to (or equal to) an integer power of 2.

3. Exemplary Multi-order techniques for position computation

Methods have been discussed in which a design is enabled to perform calculations with the same variable values C, D ', S, and E with the same equation (e.g., T ' ((j × C + D ') > S) + E) for all values of j covering the range of samples to be generated (i.e., for all values of T between L and R). We now discuss how this assumption can be relaxed, allowing for higher accuracy and/or reduced computational dynamic range requirements.

In general, the resampling process proceeds from left to right (or top to bottom) to generate successive sample strings at equally spaced locations. The second exemplary location technique described in section v.b.2 above, shows the (j × C + D') portion of how varying the origin using the offset parameter E can be used to compute a location calculation using the B-bit dynamic range of the register.

Recall that in the previous section, only the S least significant bits in D are retained in D', and others are shifted to E. Then, the main remaining problem of calculating (j × C + D') is the size of j × C.

Recall that T and L are 1/2^JInteger multiples of. The upsampling process is typically performed in higher resolution images to generate a string of samples at integer value increments, e.g., 2 between successive samples^JAnd (4) spacing. It is then desirable to calculate for some values of p and N for i-0 to N-1 corresponding to position T_i＝(p+i*2^J)÷2^JPosition t'_i

This process may be summarized in pseudo code as shown in pseudo code 1200 of FIG. 12 for some values of p and N. As i increments to N, the q value increases, and the maximum value of q should be maintained within the B-bit available dynamic range. Maximum value calculated for qIs (p + (N-1) × 2^J)*C+D′。

Now, instead of generating all samples in one cycle in this way, consider dividing the process into multiple stages, e.g., two stages. For example, in a two-stage process, the first stage generates a first N₀< N samples, and the second stage generates the remaining N-N₀And (4) sampling. Likewise, because p is a constraint on the loop, its effect can be shifted into D' and E before the first phase. This results in the two-stage process shown in pseudo code 1300 of fig. 13.

At the beginning of each phase in pseudo code 1300, the origin has been reset so that all but the S least significant bits of the first value of q for this phase have been shifted into E (i.e., E for the first phase)₀And E for the second stage₁). Thus, during operation of each of the two stages, q requires a smaller dynamic range. After the process is divided into stages in this manner, the maximum value of q will be N₀*C′+D₀Or ((N-N)₀-1)*C′+D₁The larger of these. But because of D₀And D₁Each having an unsigned dynamic range of no more than S bits, so this will typically be a maximum less than the aforementioned single-stage design. The number of samples generated in this stage (i.e., first stage N)₀And a second stage N-N₀One) may affect the dynamic range of the correlation calculation. For example, using a smaller number of samples at each stage will result in a smaller dynamic range for the correlation calculation.

Each stage may be further divided into more stages, and the generation of a total of N samples may then be further broken down into any number of these smaller stages. For example, the process may be divided into stages of equal size, generating blocks of, for example, 8 or 16 consecutive samples at each stage. This technique can be used either to reduce the number of bits of dynamic range B required in computing q, or to increase the accuracy of the computation (increase S and G + M) while keeping the dynamic range the same, or a mixture of both.

This technique of breaking down the position calculation process into stages can also be used to perform a continuous resampling process along a very long string of input samples (conceptually, the string can be infinitely long), such as performing sample rate conversion as samples arrive from an analog-to-digital converter of the audio signal. Clearly, without dividing the process into stages of finite size and incrementally resetting the origin from each stage to the next, the technique described in the previous sections would not be able to handle an infinite length string of samples, as this would require handling an infinite dynamic range in word length. However, the difficulty of applying each technique to effectively infinite string length is not a substantial limitation of these techniques, as the application to effectively infinite length is only in terms of 1 ÷ 2^GThe representation of the hypothetical reference positions L' and R "in multiples of integer units is useful when no rounding errors are introduced.

In a scenario in which a multi-stage position calculation technique can be applied, a way is provided to perform calculations along an infinite length sample string without "drift" accumulation of rounding errors, whatever happens in the position calculation operation throughout the rate conversion process.

4. Exemplary incremental operations for position calculation

An interesting specified case for the multi-stage decomposition concept described above is when the number of samples to be generated per stage has been reduced to one sample per stage. Pseudo code 1400 in fig. 14 represents generation of N positions t 'for i-0 to N-1'_iThe process of (1).

Because the process is described as an upsampling process (although the same principles may also be applied to a downsampling process), it is known that for each increment of i there is an interval of 1 in the higher resolution image and therefore an increment less than or equal to 1 in the lower resolution image. An increment of 1 in spatial position in the lower resolution image corresponds to 2 of C^(S+F)The value of (c). Also known is D' < 2^S. Thus, q ═ C '+ D' has a value from 0 to less than 2^(S+F)+2^SSo that unsigned integer arithmetic with no more than B ═ S + F +1 bits may be usedTo calculate q. In one implementation, this dynamic range requirement is invariant to image size (i.e., does not depend on the value of R ' or R ' -L ').

For scalable video coding and many other such applications, it is not really necessary to support upsampling ratios very close to 1. In such applications, it may be assumed that C' actually requires no more than S + F bits.

For example, if it is desired to upsample an image luminance array having a 1000 luminance sample width with B-32 and L' -0, then F-4, G-2, J-1, M-29, S-28, and Z-14 may be used using this method. The result will be such that this is exceptionally accurate so that smaller B values appear to be a more reasonable choice.

Alternatively, if it is desired to upsample an image luminance array having a 1000 luminance sample width with B-16 and L' -0, then F-4, G-2, J-1, M-13, S-12 and Z-14 may be used using this method.

Further optimization opportunities may be provided with respect to further understanding of the scenario in which the upsampling operation is performed. For example, if the upsampling ratio is significantly greater than 2, the dynamic range requirement will be reduced by one more bit and continue to be reduced for upsampling ratios greater than 4, 16, etc.

Changes described with reference to the exemplary incremental position calculation technique in this section (relative to the exemplary multi-step position calculation technique described above) do not affect the actual calculated position t 'given the values of C, D and S'_iThe value of (c). Only the dynamic range needed to support the calculation is changed.

The inner loop in pseudo code 1400 for this decomposed form does not require any multiplication operations. This fact is advantageous in providing reduced computation time on some computing processors.

5. Additional notes

For common resampling ratios such as 2: 1, 3: 2-where it is not necessary toThe positions L 'and R' are approximated as 1/2^GAny case where rounding is done in integer units-there is no rounding error at all when using these methods (except when the final result is rounded by 1 ÷ 2)^FAny rounding error introduced when an integer in unity, which is the error that would exist regardless of the position calculation method).

C. Luminance and chrominance position and relationship

Assuming that the whole new (up-sampled) image and the reference image array are precisely aligned with respect to the luma sample grid index coordinates, then the locations L and R within the current image coordinates areAndwhere W is the number of samples in the vertical or horizontal direction of the image depending on the relevant resampling dimension. Equivalently, the origin of the image space coordinate system can be set to shift left (or up) half a sample to the grid index 0 position and added 1/2 when converting from image space coordinates to grid index values, thereby obviating the need to process negatives when performing calculations in the space coordinate system.

The locations L 'and R' in the reference (low resolution) image are referenced in the same way to the sampling grid coordinates, where W in this case is the number of samples in the reference image and not the new image.

For chroma sampling grids (whether in the new image or the reference image), the situation is somewhat less simple. To construct a specified alignment of chroma samples with respect to luma, consider that the image rectangle represented by chroma samples is the same as the rectangle represented by luma samples. This may result in the following: level of

For 4:2:0 chroma sampling types 0, 2, and 4 (see FIG. 5D), the current image coordinates are determined byAndand (4) defining.

Horizontally, for 4:2:0 chroma sampling types 3, 1 and 5 (see fig. 5D), the current image coordinates are represented byAndand (4) defining.

Vertically, for 4:2:0 chroma sampling types 2 and 3 (see fig. 5D), the current image coordinates are determined byAndand (4) defining.

Vertically, for 4:2:0 chroma sampling types 0 and 1 (see fig. 5D), the current image coordinates are determined byAndand (4) defining.

Vertically, for 4:2:0 chroma sampling types 4 and 5 (see fig. 5D), the current image coordinates are determined byAndand (4) defining.

Horizontally, for 4:2:2 chroma sampling, the current image coordinates of 4:2:2 sampling, which are commonly used in industrial practice, are determined byAndand (4) defining.

Vertically, for 4:2:2 chroma sampling, the current image coordinates of 4:2:2 sampling, which are commonly used in industrial practice, are determined byAndand (4) defining.

For both horizontal and vertical, for 4:4:4 chroma sampling, the current image coordinates are determined byAndand (4) defining.

Likewise, the origin of the coordinate system is moved to the left of position L using an offset sufficient to avoid handling negatives.

The integer coordinate and fractional phase offset remainder are calculated by adjusting the integer coordinate position of each sample to be generated in the upsampled array to compensate for the fractional offset L, and then applying the transformation shown at the end of section v.b. Conceptually, right shifting the result by F bits results in an integer coefficient pointer to the reference image, and subtracting the left shifted integer coordinate (shifted by F bits) provides the phase offset remainder.

D. Additional precision of upsampling location calculation

This section describes how the position calculation method of the above section v.c.4 is mapped to a specific upsampling process, such as the upsampling process used for the h.264 SVC extension. The position calculation is applied in a very flexible way to maximize the accuracy of both the luminance and chrominance channels in various chrominance formats, as well as progressive and interlaced frame formats. The techniques described in this section can vary depending on implementation and different upsampling processes.

In the above position calculation (in the above section v.a-C), the scale change parameter (being the variable C, hereafter labeled deltaX (or deltaY)) is equal to 2^JIs scaled up (where J is 1 for luminance and 2 for chrominance) to form an increment that is added to generate each sample position from left to right or top to bottom. The scaling is chosen so that the scaled up increments will fit 16 bits.

1. Maximum accuracy of scaled position calculation

A straightforward way to apply the position calculation method is to scale up the scale-up parameter by 2^JWhere J is 1 for luminance and 2 for chrominance, forming an increment that is added to generate each sample position from left to right or top to bottom. The scaling parameters are then selected to ensure that the scaled increments will fit within the specified word size, such as 16 bits. A more flexible design to maximize positional accuracy is described in the following section.

a. Luminance channel

The "direct" luminance position calculation method can be summarized by the following exemplary equation (in the horizontal direction) when F-4 and S-12:

deltaX＝Floor(((BasePicWidth＜＜15)+(ScaledBaseWidth＞＞1))÷ScaledBaseWidth)

xf＝((2*(xP-ScaledBaseLeftOffset)+1)*deltaX-30720)＞＞12

here, basepcwidth is the horizontal resolution of the base layer or low resolution image; ScaledBaseWidth is the horizontal resolution of the high resolution image area or window; deltaX is an intermediate scaling parameter, in this case 32768 times upsamplingA rounded approximation of the inverse of the sample ratio; xP denotes the sample position in the high resolution image; ScaledBaseLeftOffset represents the relative position of the picture window in the high-resolution picture, and Floor () indicates the largest integer less than or equal to its argument. Constant value 30720 is obtained by adding 2 before the right shift^S-1Subtracting 2 as a rounding offset and for a half sample offset of the luminance sampling grid reference position^S*2^FAnd/2, as discussed at the beginning of section V.C above.

It is worth noting that each increment of xP results in an increment of 2 deltaX inside the equation. Likewise, the LSB of the quantity 2 deltaX is always zero, so that one bit of computational accuracy is essentially wasted. The accuracy of an additional bit can be obtained approximately without any significant increase in complexity by changing these equations as follows:

deltaX＝Floor(((BasePicWidth＜＜16)+(ScaledBaseWidth＞＞1))÷ScaledBaseWidth)

xf＝((xP-ScaledBaseLeftOffset)*deltaX+(deltaX＞＞1)-30720)＞＞12

or (slightly) more precise as follows:

deltaXa＝Floor(((BasePicWidth＜＜16)+(ScaledBaseWidth＞＞1))÷ScaledBaseWidth)

deltaXb＝Floor(((BasePicWidth＜＜15)+(ScaledBaseWidth＞＞1))÷ScaledBaseWidth)

xf＝((xP-ScaledBaseLeftOffset)*deltaXa+deltaXb-30720)＞＞12

the latter two forms are recommended because of their higher accuracy and negligible complexity impact (although the difference in accuracy appears to be small).

Note that for architectures that handle architectures on which it is difficult to perform division calculations, having the result of one of these equations can simplify the other calculations. The value of deltaXa will always be in the range of 2 deltaXa plus 1 or minus 1. The following simplified rule can therefore be derived to avoid the need to perform a division operation on the deltaXa calculation:

deltaXa＝(deltaXb＜＜1)

remainderDiff＝(BasePicWidth＜＜16)+(ScaledBaseWidth＞＞1)-deltaXa

if(remainderDiff＜0)

deltaXa--

else if(remainderDiff≥ScaledBaseWidth)

deltaXa++

b. chrominance channel

Instead of a two-factor multiplier, a four-factor multiplier may be used for the chroma channels in this part of the design, so that the chroma position of the 4:2:0 sample can be represented (J-2 for chroma instead of J-1 for luma as described). Thus, the "direct" equation is:

deltaXC＝Floor(((BasePicWidthC＜＜14)+(ScaledBaseWidthC＞＞1))÷

ScaledBaseWidthC)

xfC＝((((4*(xC-ScaledBaseLeftOffsetC)+

(2+scaledBaseChromaPhaseX))*deltaXC)

+2048)＞＞12)-4*(2+baseChromaPhaseX)

here, baseChromaPhaseX and scaledBaseChromaPhaseX represent the chroma sampling grid position offsets for low resolution and high resolution, respectively. The values of these parameters may be explicitly conveyed as information is sent from the encoder to the decoder, or may have specific values determined by the application. All other variables are similar to those defined for the luminance channel, with the "C" suffix indicating the application to the chrominance channel.

Each increment of xC results in an increment of 4 deltaXC inside the equation. Thus, the precision of the additional two bits can be approximately obtained without any substantial increase in complexity by changing these equations as follows:

deltaXC＝Floor(((BasePicWidthC＜＜16)+(ScaledBaseWidthC＞＞1))÷

ScaledBaseWidthC

xfC＝(((xC-ScaledBaseLeftOffsetC)*deltaXC

+(2+scaledBaseChromaPhaseX)*((deltaXC+K)＞＞2)

+2048)＞＞12)-4*(2+baseChromaPhaseX)

wherein K is 0,1 or 2. Using K-0 avoids an extra operation. The use of K-1 or K-2 can be of somewhat higher accuracy.

A correspondingly somewhat more accurate form may be as follows:

deltaXCa＝Floor(((BasePicWidthC＜＜16)+(ScaledBaseWidthC＞＞1))÷

ScaledBaseWidthC)

deltaXCb＝Floor(((BasePicWidthC＜＜14)+(ScaledBaseWidthC＞＞1))÷

ScaledBaseWidthC)

xfC＝(((xC-ScaledBaseLeftOffsetC)*deltaXCa+

(2+scaledBaseChromaPhaseX)*deltaXCb

+2048)＞＞12)-4*(2+baseChromaPhaseX)

the latter variant is recommended, as is the case with luminance, because the difference in complexity appears negligible (although the difference in precision also appears small).

c. Interlaced field coordinates

The reference to the image coordinate system is typically based on half a sample position in the luminance frame coordinates, thus resulting in a scale factor of 2 for the luminance coordinate reference position as described above. The displacement of half a sample in the luminance frame coordinates corresponds to the displacement of a quarter of a sample in the 4:2:0 chrominance frame coordinates, which is also why a factor of 4 is currently used in the scaling for the above-mentioned chrominance coordinates instead of a factor of 2.

Horizontally, there is no essential difference in the operation on the coded images representing the frame and those representing the single fields of the interlaced video. However, when the encoded image represents a single field, the displacement of a half sample in the luminance frame vertical coordinate corresponds to the displacement of a quarter sample in the luminance field vertical coordinate. Thus, a scale factor of 4 should be applied in the calculation of the vertical luminance coordinate position instead of 2.

Similarly, when the encoded image represents a single field, the displacement of one-half sample in the luminance frame vertical coordinate corresponds to the displacement of one-eighth sample in the chrominance field vertical coordinate. Thus, a scale factor of 8 should be applied in the calculation of the vertical chromaticity coordinate position instead of 4.

These scaling factors for calculating the vertical coordinate position in the encoded field image may be incorporated into the deltaY vertical delta calculation in the same manner as described above with respect to the delta calculation in the encoded frame image. In this case, the improvement in precision is increased for a luminance position of approximately 2 bits and for chrominance (vertically) of approximately 3 bits, due to the increased scaling factor applied.

2.4:2:2 and 4:4:4 chroma limiting and refining

The position calculation method of the section v.d.1.b requires the use of multiplicative factors for the chroma that differ in luminance. This makes sense for 4:2:0 video and is also reasonable for 4:2:2 video horizontally, but it is not necessary for 4:2:2 video vertically or for 4:4:4 video either horizontally or vertically, since in these cases the luma and chroma resolution are the same and the luma and chroma samples are therefore presumably co-located.

As a result, the approach of section v.d.1.b may require separate calculations to determine the luma and chroma positions, even in some dimensions where the luma and chroma resolutions are the same and there is no intentional phase shift, simply because the rounding to be performed in these two cases is slightly different. This is undesirable, so it is suggested in this section to use different chroma processing for the 4:2:2 and 4:4:4 sampling structures.

a 4:2:2 vertical and 4:4:4 horizontal and vertical positions

There is no apparent need for custom control of chroma phase for both the vertical dimension of 4:2:2 video and the vertical and horizontal dimensions of 4:4:4 video. Thus, as long as the chroma resolution is the same as the luma resolution in a certain dimension, the equation for calculating the chroma position should be modified so that exactly the same position is calculated for both luma and chroma samples whenever the chroma sampling format has the same resolution for luma and chroma in a certain particular dimension. One option is to set only the chroma position variable equal to the luma position variable, and the other option is to set the chroma position equations so that they have the same result.

b.4:2:2 horizontal position

While there is no functional problem to horizontally allow chroma phase adjustment for 4:2:2 video, if 4:2:2 uses only one type of horizontal sub-sampling structure, such as the one corresponding to the value-1 for scaledBaseChromaPhaseX or BaseChromaPhaseX in the partial v.d.1.b equations, it may be desirable to consider forcing the use of these values as long as the color sampling format is 4:2: 2.

Extension and variation

The techniques and tools herein may also be applied to multi-resolution video coding using reference picture resampling, such as found in annex P of ITU-T international standard recommendation h.263.

The techniques and tools herein may also be applied to upsampling not only of image sample arrays, but also of residual data signals or other signals. For example, the techniques and tools herein may also be applied to residual data signal upsampling for reduced resolution update coding, such as may be found in annex Q of the ITU-T international standard recommendation h.263. As another example, the techniques and tools described herein may also be applied to upsampling of a residual data signal that predicts a high resolution residual signal from a lower resolution residual signal in a spatial scalable video coding design. As yet another example, the techniques and tools described herein may also be applied to upsampling of motion vector fields in a spatially scalable video coding design as yet another example, the techniques and tools described herein may also be applied to upsampling of graphics images, still photo images, audio sample signals, and the like.

Having described and illustrated the principles of the present invention with reference to various described embodiments, it will be recognized that the various embodiments can be modified in arrangement and detail without departing from such principles. It should be understood that the programs, processes, or methods described herein are not related or limited to any particular type of computing environment unless otherwise specified. Various types of general purpose or special purpose computing environments may be used or operations may be performed in accordance with the teachings described herein. Elements shown in software in the described embodiments may be implemented in hardware and vice versa.

In view of the many possible embodiments to which the principles of our invention may be applied, all such embodiments are claimed as the invention, which fall within the scope and spirit of the appended claims and their equivalents.

Claims

1. A method for performing upsampling of base layer image data during video encoding or decoding, the method comprising: for one location in the upsampled array:

calculating a location in the base layer image data, wherein y indicates a vertical value for the location in the base layer image data, and the derivation of y comprises a calculation of a result mathematically equivalent to (j × C + D) > S, and wherein:

j indicates the vertical value of the location in the upsampled array;

c is obtained by multiplying the reciprocal of the vertical scale factor by 2^S+FIs approximated;

d is an offset;

s is a shift value; and

f is based on the number of bits in the fractional component of y.

2. The method of claim 1, wherein the j, C, and D portions are based on whether the base layer image data is for a frame or a field, and wherein the j and D portions are based on whether the base layer image data is for luma or chroma.

3. The method of claim 1, wherein

S, setting a dynamic range and precision; and

d is based on the vertical resolution of the base layer image data, the vertical scale factor, and S.

4. The method of claim 1, wherein F is 4 and S is 12.

5. The method of claim 1, wherein x indicates a level value for the location in the base layer image data, and the derivation of x comprises a calculation of a result mathematically equivalent to (i + C ' + D ') > S ', and wherein:

i indicates the horizontal value of the location in the upsampled array;

c' is obtained by multiplying the reciprocal of the horizontal scale factor by 2^S’+F’Is approximated;

d' is an offset which may be the same or different from D;

s' is a shift value that may be the same or different from S; and

f' is based on the number of bits in the fractional component of x.

6. The method of claim 5, wherein F 'is 4 and S' is 12.

7. The method of claim 6, wherein C' is derived according to the following equation:

((BasePic Width＜＜16)+(ScaledBase Width＞＞1))÷ScaledBase Width，

wherein BasePicWidth indicates a horizontal resolution of the base layer image data, and ScaledBaseWidth indicates a horizontal resolution after upsampling.

8. The method of claim 5, wherein x is further based on an offset E, the derivation of x comprising calculation of a result mathematically equivalent to ((i ^ C ' + D ') > S ') + E.

9. The method of claim 5, further comprising:

selecting a vertical filter based on F least significant bits of y and a vertical integer position to filter based on remaining bits of y, wherein vertical interpolation of the base layer image data uses the vertical filter at the vertical integer position; and

selecting a horizontal filter based on the F' least significant bits of x and selecting a horizontal number position to filter based on the remaining bits of x, wherein horizontal interpolation of the result of vertical interpolation uses the horizontal filter at the horizontal integer position.

10. The method of claim 1, further comprising:

interpolating a value at the location in the base layer image data; and

assigning the interpolated value to the position in the upsampled array.

11. A system for upsampling base layer image data during video encoding or decoding, the system comprising: for one location in the upsampled array:

means for calculating a location in the base layer image data, wherein y indicates a vertical value for the location in the base layer image data, and the derivation of y comprises calculation of a result mathematically equivalent to (j × C + D) > S, and wherein:

j indicates the vertical value of the location in the upsampled array;

d is an offset;

s is a shift value; and

f is based on the number of bits in the fractional component of y.

12. The system of claim 11, wherein

j. C and D are based in part on whether the base layer image data is for a frame or a field,

j and D are based in part on whether the base layer image data is for luminance or chrominance;

d is based on the vertical resolution of the base layer image data, the vertical scale factor, and S;

f is 4, and

s is 12.

13. The system of claim 11, wherein x indicates a level value for the location in the base layer image data, and the derivation of x comprises a calculation of a result mathematically equivalent to (i + C ' + D ') > S ', and wherein:

i indicates the horizontal value of the location in the upsampled array;

d' is an offset which may be the same or different from D;

s' is a shift value that may be the same or different from S; and

f' is based on the number of bits in the fractional component of x.

14. The system of claim 13,

f' is the number 4 of the ring-shaped grooves,

s' is 12.

15. The system of claim 14,

c' is derived according to:

((BasePic Width＜＜16)+(ScaledBase Width＞＞1))÷ScaledBase Width，

16. The system of claim 13, further comprising:

means for selecting a vertical filter based on the F least significant bits of y and a vertical integer position to filter based on the remaining bits of y, wherein vertical interpolation of the base layer image data uses the vertical filter at the vertical integer position; and

means for selecting a horizontal filter based on the F' least significant bits of x and a horizontal number position to filter based on the remaining bits of x, wherein horizontal interpolation of the result of vertical interpolation uses the horizontal filter at the horizontal integer position.

17. The system of claim 11, further comprising:

means for interpolating a value at the location in the base layer image data; and

means for assigning the interpolated value to the position in the upsampled array.