HK1133761B - System and method for implementing efficient decoded buffer management in multi-view video coding - Google Patents
System and method for implementing efficient decoded buffer management in multi-view video coding Download PDFInfo
- Publication number
- HK1133761B HK1133761B HK10101353.7A HK10101353A HK1133761B HK 1133761 B HK1133761 B HK 1133761B HK 10101353 A HK10101353 A HK 10101353A HK 1133761 B HK1133761 B HK 1133761B
- Authority
- HK
- Hong Kong
- Prior art keywords
- view
- picture
- pictures
- inter
- signaling element
- Prior art date
Links
Description
Technical Field
The present invention relates generally to video coding. In particular, the present invention relates to coded picture buffer management in multi-view video coding.
Background
This section is intended to provide a background or context to the invention that is recited in the claims. The description herein may include concepts that could be pursued, but are not necessarily ones that have been previously conceived or pursued. Thus, unless otherwise indicated herein, what is described in this section is not prior art to the description and claims in this application and is not admitted to be prior art by inclusion in this section.
In multi-view video coding, video sequence outputs from different cameras are encoded into one bitstream, where each video sequence output benefits from a different view of a scene. After decoding, decoded pictures belonging to a certain view are reconstructed and displayed in order to display the view. It is also possible to reconstruct and display more than one view.
Multi-view video coding processes include a wide variety of applications including free-view video/television, three-dimensional (3D) TV, and surveillance applications. Currently, the Joint Video Team (JVT)/International engineering Union (IEC) Moving Picture Experts Group (MPEG) and the International Telecommunication Union (ITU) -T video coding experts group of the International organization for standardization (ISO) are working on developing a Multiview Video Coding (MVC) standard (also referred to as ISO/IEC MPEG-4 part 10) that is becoming an extension of the ITU-T H.264 standard. These draft standards are referred to herein as MVC and AVC, respectively. The latest draft of the MVC standard is in JVT-T208 "Joint Multiview Video Model (JMVM) 1.0" (20 th JVT conference, Austria, Claritufu, 20 ″)7 months of 06 years), can be found inftp3.itu.ch/av-arch/jvt-site/2006_07_Klagenfurt/JVT-T208.zipAre found and are incorporated herein by reference in their entirety.
In JMVM1.0, for each group of pictures (GOP), pictures of any view are adjacent in decoding order. This is depicted in fig. 1, where the horizontal direction represents time (each time instant is represented by Tm) and the vertical direction represents viewpoints (each viewpoint is represented by Sn). The pictures of each view are grouped into GOPs, for example, pictures T1 through T8 in fig. 1 for each view form a GOP. This decoding order arrangement is called view-first coding. It should be noted that for pictures in one view and in one GOP, their decoding order may be changed internally, although they are consecutive without any other picture inserted between any two pictures.
It is also possible to have a decoding order different from the decoding order discussed for the priority view coding. For example, pictures may be arranged such that pictures of any temporal position are contiguous in decoding order. This arrangement is shown in fig. 2. This decoding order arrangement is called time-first coding. It should also be noted that the decoding order of the access units may be different from the temporal order.
A typical prediction structure for multi-view video coding (including inter-picture prediction and inter-view prediction within each view) is shown in fig. 2, where the prediction is represented by arrows and pointing to a (pointed-to) object will point to an object from (pointed-from) for prediction reference. Inter-picture prediction within one view is also referred to as temporal prediction, intra-view prediction, or simply inter prediction.
An Instantaneous Decoding Refresh (IDR) picture is an intra-coded picture that causes the decoding process to mark all reference pictures as "unused for reference" immediately after the IDR picture is decoded. After decoding of an IDR picture, all subsequent coded pictures in decoding order can also be decoded without mutual prediction from pictures decoded before the IDR picture.
In AVC and MVC, coding parameters that remain unchanged throughout the coded video sequence are included in the sequence parameter set. The sequence parameter set may optionally contain Video Usability Information (VUI) including parameters important for buffering, picture output timing, presentation and resource reservation, in addition to the parameters necessary for the decoding process. Two structures to carry sequence parameter sets are specified-sequence parameter set NAL units containing all data for AVC pictures in the sequence and sequence parameter set extensions for MVC. A picture parameter set includes such parameters that may not change across several encoded pictures. The frequently changing picture level data is repeated in each slice header and the picture parameter set carries the remaining picture level parameters. The h.264/AVC syntax allows multiple sequence instances and picture parameter sets, and each instance is identified with a unique identifier. Each slice header includes an identifier of a picture parameter set that is active for decoding of pictures containing the slice, and each picture parameter set contains an identifier of an active sequence parameter set. Thus, the transmission of picture and sequence parameter sets need not be accurately synchronized with the transmission of slices. Instead it is sufficient to receive the active sequence and the picture parameter set at any time before they are referenced, which allows the parameter set to be transmitted using a more reliable transmission mechanism compared to the protocol used for slicing data. For example, parameter sets may be included as MIME parameters in the session description for an H.264/AVC real-time protocol (RTP) session. It is recommended to use an out-of-band reliable transport mechanism whenever it is possible to use in an application. If the parameter sets are transmitted in-band, they can be repeated to improve error robustness.
As discussed herein, an anchor (anchor) picture is a coded picture in which all slices refer only to slices having the same time index, i.e., refer only to slices in other views, and not to slices in earlier pictures of the current view. The anchor picture is signaled by setting anchor _ pic _ flag to 1. After decoding the anchor picture, all subsequent encoded pictures in display order may be decoded without requiring interactive prediction from any picture decoded before the anchor picture. If a picture in one view is an anchor picture, all pictures in other views having the same time index are also anchor pictures. Thus, decoding of any view can be initiated from the time index corresponding to the anchor picture.
Picture output timing, such as output timestamp settings, is not included in the main part of the AVC or MVC bitstream. However, the value of Picture Order Count (POC) is derived for each picture and does not decrease with increasing picture position in output order relative to the previous IDR picture or pictures that include a memory management control operation that marks all pictures as "unused for reference". POC therefore indicates the output order of the pictures. It is also used in the decoding process for implicit scaling of motion vectors in direct mode for bi-predictive slices, for implicitly derived weights in weighted prediction, and for reference picture list initialization for B slices. In addition, POC is also used in verification of output order consistency.
The POC value may be encoded in one of three modes signaled in the active sequence parameter set. In the first mode, a selected number of least significant bits of the POC value is included in each slice header. In the second mode, POC relative increments (as a function of picture position in decoding order in the encoded video sequence) are encoded in the sequence parameter set. Furthermore, the offset with respect to the POC value derived from the sequence parameter set may be indicated in the slice header. In the third mode, the POC values are derived from the decoding order by assuming that the decoding and output orders are the same. Furthermore, only one non-reference picture may appear continuously when the third mode is used.
NAL _ ref _ idc is a 2-bit syntax element in the NAL unit header. The value of NAL _ ref _ idc indicates the degree of correlation of the NAL unit used to reconstruct the sample values. non-zero values of NAL _ ref _ idc must be used for coded slice and slice data partition NAL units of reference pictures and for parameter set NAL units. For slices and slice data partitions of non-reference pictures and for NAL units that do not affect the reconstruction of sample values, such as supplemental enhancement information NAL units, the value of NAL _ ref _ idc must be equal to 0. In the h.264/AVC advanced design, an external specification (i.e., any system or specification that uses or references h.264/AVC) is allowed to specify an interpretation of the non-zero value of nal _ ref _ idc. For example, the RTP payload format for H.264/AVC Request for comments (RFC)3984 (which can be found at www/ietf.org/RFC 3984.txt and incorporated herein by reference) specifies a strong recommendation to use nal _ ref _ idc. In other words, some systems have established conventions for setting and interpreting non-zero nal _ ref _ idc values. For example, the RTP mixer may set NAL _ ref _ idc according to the NLA unit type, e.g., set NAL _ ref _ idc to 3 for an IDR NAL unit. Since MVC is a backward compatible extension of the H.264/AVC standard, it is desirable that existing system elements that understand H.264/AVC be able to handle MVC streams as well. It is therefore undesirable to specify the semantics of a particular non-zero value of nal _ ref _ idc differently in the MVC specification than any other non-zero value of nal _ ref _ idc.
Decoded pictures used for prediction of subsequently encoded pictures and for future output are buffered in a Decoded Picture Buffer (DPB). In order to effectively utilize the cache memory, DPB management processes including a process of storing decoded pictures into the DPB, a marking process of reference pictures, a process of outputting and removing decoded pictures from the DPB should be specified.
The process for reference picture marking in AVC is generally as follows. The maximum number of reference pictures used for cross prediction, called M, is indicated in the active sequence parameter set. A reference picture is marked as "used for reference" when it is decoded. If the decoding of a reference picture results in more than M pictures being marked as "used for reference", at least one picture must be marked as "unused for reference". The DPB removal process then removes pictures marked as "unused for reference" from the DPB if they are also not needed for output.
There are two types of operations for reference picture marking: adaptive memory control and sliding window. The operation mode for reference picture marking is selected on a picture basis. Adaptive memory control requires the presence of Memory Management Control Operation (MMCO) commands in the bitstream. The memory management control operation enables signaling explicitly which pictures are marked as "unused for reference", assigning a long-term index to short-term reference pictures, storing the current picture as a long-term picture, changing the short-term picture to a long-term picture, and assigning a maximum allowed long-term index (MaxLongTermFrameIdx) to the long-term picture. If the sliding window operation mode is used and there are M pictures marked as "used for reference", the short-term reference picture, which is the first decoded picture among those short-term reference pictures marked as "used for reference", is marked as "unused for reference". In other words, the sliding window operation mode causes a first-in/first-out buffering operation among the short-term reference pictures.
Each short-term picture is associated with a variable PicNum derived from a frame _ num syntax element. Each long-term picture is associated with a variable LongTermPicNum derived from a long _ term _ frame _ idx syntax element signaled by the MMCO command. Depending on whether a frame or field is encoded or decoded, PicNum is derived from the FrameNumWrap syntax element. For frames with PicNum equal to FrameNumWrap, FrameNumWrap is derived from FrameNum and directly from frame _ num. For example, in AVC frame coding, FrameNum is assigned the same value as frame _ num, and FrameNumWrap is defined as follows:
if (FrameNum > frame _ num)
FrameNumWrap=FrameNum-Max FrameNum
Otherwise
FrameNumWrap=FrameNum
LongTermPicNum is derived from the long-term frame index (LongTermFrameIdx) assigned to the picture. For a frame, LongTermPicNum equals LongTermFrameIdx. frame _ num is a syntax element in each slice header. The value of frame _ num for a frame or complementary field pair is substantially incremented by one in modulo arithmetic relative to the frame _ num of a previous reference frame or reference complementary field pair. In an IDR picture, the value of frame _ num is zero. For pictures that include a memory management control operation that marks all pictures as "unused for reference," the value of frame _ num is considered zero after decoding of the picture.
The MMCO command uses PicNum and LongTermPicNum as follows for indicating the target picture for the command. To mark a short-term picture as "unused for reference," the PicNum difference between the current picture p and the destination picture r is signaled in the MMCO command. To mark a long-term picture as "unused for reference," the LongTermPicNum of the picture r to be removed is signaled in the MMCO command. To store the current picture p as a long-term picture, long _ term _ frame _ idx is signaled with the MMCO command. This index is assigned to the newly stored long-term picture as the value of LongTermPicNum. To change picture r from a short-term picture to a long-term picture, the PicNum difference between the current picture p and picture r is signaled in the MMCO command, long _ term _ frame _ idx is signaled in the MMCO command, and the index is assigned to this long-term picture.
When multiple reference pictures can be used, each reference picture must be identified. In AVC, the identification of reference pictures used to encode a block is as follows. First, all reference pictures stored in the DPB for future picture prediction reference are marked as "used for short-term reference" (short-term pictures) or "used for long-term reference" (long-term pictures). When the coded slice is decoded, a reference picture list is constructed. If the coded slice is a bi-directional predicted slice, a second reference picture list is also constructed. The reference picture used for the encoded block is then identified by the index of the reference picture used in the reference picture list. When more than one reference picture can be used, the index can be encoded in the bitstream.
The reference picture list construction process is as follows. For simplicity, it is assumed that only one reference picture list is needed. An initial reference picture list is first constructed that includes all short-term and long-term pictures. Reference Picture List Reordering (RPLR) is then performed when the slice header contains a Reference Picture List Reordering (RPLR) command. The PRLR process may reorder the reference pictures into an order different from the order in the initial list. The final list is finally constructed by keeping only a certain number of pictures in the beginning of the list of possible reordering (this number is indicated by the slice header or another syntax element in the picture parameter set referred to by the slice).
During initialization, all short-term and long-term pictures are considered candidates for a reference picture list for the current picture. Whether the current picture is a B-picture or a P-picture, long-term pictures are placed after short-term pictures in RefPicList0 (while RefPicList1 is available for B-slices). For P pictures, the initial reference picture list for RefPicList0 contains all short-term reference pictures ordered in descending order of PicNum. For B pictures, those reference pictures obtained from all short-term pictures are ordered by rules related to the current POC number and the POC number of the reference picture-for RefPicList0, the reference picture with the smaller POC (compared to the current POC) is considered to be prioritized and inserted into RefPicList0 in descending order of POC. Pictures with larger POC are then appended in ascending order of POC. For RefPicList1 (if available), reference pictures with larger POC (compared to the current POC) are considered to be prioritized and inserted into RefPicList1 in ascending order of POC. Pictures with smaller POC are then appended in descending order of POC. For both P and B pictures, after considering all short-term reference pictures, long-term pictures are added in ascending order of LongTermPicNum.
The reordering process is invoked by including four types of consecutive RPLR commands. The first type is a command to specify a short-term picture to be moved with a smaller PicNum (compared to the temporally predicted PicNum). The second type is a command to specify a short-term picture with a large PicNum to be moved. The third type is a command to specify a long-term picture with a certain LongTermPicNum to be moved and the end of an RPLR loop. If the current picture is bi-directionally predicted, there are two cycles — one for the forward reference list and the other for the backward reference list.
A predicted PicNum, called picNumLXPred, is initialized to the PicNum of the current coded picture. This is set to the PicNum of the just-moved picture after each reordering process for short-term pictures. The difference between PicNum and picNumLXPred of the reordered current picture will be signaled in an RPLR command. The picture indicated as reordered is moved to the beginning of the reference picture list. After the reordering process is completed, the entire reference picture list will be truncated based on the active reference picture list size, which is num _ ref _ idx _1X _ active _ minus1+1 (X pairs equal to 0 or 1 correspond to RefPicList0 and RefPicList1, respectively).
A Hypothetical Reference Decoder (HRD) specified in annex C of the H.264/AVC standard is used to verify bitstream and decoder conformance. The HRD includes a Coded Picture Buffer (CPB), a temporal decoding process, a Decoded Picture Buffer (DPB), and an output picture cropping block. The CPB and instantaneous decoding processes are specified similarly to any other video coding standard, and the output picture cropping block simply crops those samples from decoded pictures that are outside the signaled output picture range. DPB was introduced in h.264/AVC to control the required memory resources for decoding a coherent bitstream.
Caching decoded pictures has two reasons: for reference in cross prediction and for reordering the decoded pictures into output order. Since the h.264/AVC standard provides a lot of flexibility for both reference picture marking and output reordering, separate buffers for reference picture buffering and output picture buffering can be a waste of memory resources. Thus, the DPB includes a unified decoded picture buffering process for reference picture and output reordering. Decoded pictures are removed from the DPB when they are no longer used as reference and need to be output. The maximum size of the DPB that a bitstream is allowed to use is specified in the level definition (appendix A) of the H.264/AVC standard.
There are two types of conformance for decoders: output timing consistency and output order consistency. For output timing consistency, the decoder must output pictures at the same time as compared to the HRD. For output order consistency, only the correct output picture order is taken into account. It is assumed that the output sequential DPB contains the maximum allowed number of frame buffers. A frame is removed from the DPB when it is no longer used as a reference and needs to be output. When the DPB becomes full, the earliest frame in output order is output until at least one frame buffer becomes unoccupied.
Temporal scalability is achieved by using only the hierarchical B-picture GOP structure of the AVC tool. A typical temporally scalable GOP typically includes key pictures encoded as I or P frames and other pictures encoded as B pictures. Those B pictures are hierarchically encoded based on POC. The encoding of a GOP requires only key pictures of the previous GOP, in addition to those pictures in the GOP. The relative POC number (POC minus the previous anchor picture POC) is referred to in implementations as pocidinrgop. Each POCIdInGOP may have a POCIdInGOP of 2xy (where y is an odd number). Pictures with the same x value belong to the same temporal level called L-x (where L is log2(GOP _ length)). Only pictures with the highest temporal level L are not stored as reference pictures. In general, pictures in a temporal level can only use pictures in a lower temporal level as a reference to support temporal scalability, i.e., pictures of a higher temporal level can be discarded without affecting the decoding of pictures of a lower temporal level. Similarly, the same hierarchy may be applied in the view dimension for view scalability.
In the current JMVM, frame _ num is coded separately and signaled for each view, i.e. the value of frame _ num is incremented relative to the previous reference frame or reference complementary field pair within the same view as the current picture. In addition, pictures in all views share the same DPB buffer. To handle reference picture list construction and reference picture management globally, FrameNum and POC generation are redefined as follows:
FrameNum=frame_num*(1+num_views_minus_1)+view_id
PicOrderCnt()=PicOrderCnt()*(1+num_views_minus_1)+view_id;
JMVM basically follows the same reference picture marking as used for AVC. The only difference is that: FrameNum is redefined in JMVM so that FrameNumWrap is redefined as follows:
if(FrameNum>frame_num*(1+num_views_minus_1)+view_id)
FrameNumWrap=FrameNum-MaxFrameNum*(1+num_views_minus_1)+
view_id
else
FrameNumWrap=FrameNum
in the current JMVM standard, inter-view reference pictures are implicitly specified in an SPS (sequence parameter set) extension, where the active number of inter-view reference lists and the view id of those pictures are specified. This information is shared by all pictures that refer to the same SPS. The reference picture list construction process performs reference picture list initialization, reordering, and puncturing first in the same manner as in AVC, but taking into account all reference pictures stored in the DPB. Pictures that have the view id specified in the SPS and are within the same time axis (i.e., have the same capture/output time) are then appended to the reference list in the order in which they were listed in the SPS.
Unfortunately, the above JSVM design poses a number of problems. First, it is sometimes desirable that switching of views decoded (by the decoder), transmitted (by the sender), or forwarded (by the media gateway or MANE) can occur at a different time index than the time index corresponding to the anchor picture. For example, the base views may be compressed for maximum coding efficiency (temporal prediction is heavily used) and anchor pictures are coded infrequently. Thus, anchor pictures for other views are also not always present because they are synchronized for all views. The current JMVM syntax does not include a picture that signals that decoding of a certain view can start from (unless all views of the temporal index contain anchor pictures).
Second, allowed reference views for inter-view prediction are specified for each view (and separately for anchor and non-anchor pictures). However, depending on the similarity between the coded picture and the potential pictures in the same temporal axis and potential reference views, inter-view prediction may or may not be done in the encoder. The current JMVM standard uses nal _ ref _ idc to indicate whether a picture is used for intra-view prediction or inter-view prediction, but it cannot separately indicate whether a picture is used for intra-view prediction and/or inter-view prediction. Furthermore, according to JMVM1.0, for an AVC compatible view, even if a picture is not used for temporal prediction when it is used only for inter-view prediction reference, nal _ ref _ idc must be set not equal to 0. Thus, if only the views are encoded and output, an additional DPB size is required to store such pictures when they can be output once they are decoded.
Third, note that the reference picture marking process specified in JMVM1.0 is essentially the same as the AVC process, except that FrameNum, FrameNumWrap, and thus PicNum, are redefined. Special problems arise. For example, this process cannot efficiently handle the management of decoded pictures that are needed in order to be cached for inter-view prediction, especially when those pictures are not used for temporal prediction reference. The reason is that the DPB management process specified in the AVC standard is intended for single view coding. In single-view coding, such as in the AVC standard, buffered decoded pictures that are needed for temporal prediction reference or future output may be removed from the buffer when they are no longer needed for temporal prediction reference and future output. To enable the removal of a reference picture as soon as it becomes no longer needed for temporal prediction reference and future output, the reference picture marking process is specified so that it can be known immediately after the reference picture becomes no longer needed for temporal prediction reference. However, when it is the turn to a picture for inter-view prediction reference, there is no way to know immediately after the picture becomes no longer needed for inter-view prediction reference. Thus, pictures used for inter-view prediction reference may be unnecessarily buffered in the DPB, which reduces the efficiency of buffer memory usage.
In another example, given the way to recalculate PicNum, if the sliding window mode of operation is used and the number of short-term and long-term pictures is equal to the maximum, the short-term reference picture with the smallest FrameNumWrap is marked as "unused for reference". However, sliding window reference picture marking does not operate optimally in the current JMVM due to the fact that: this picture is not necessarily the oldest coded picture, since the FrameNum order in the current JMVM does not follow the decoding order. In addition, due to the fact that PicNum is derived from the redefined and scaled FrameNumWrap, the difference between the PicNum values of the two coded pictures will be scaled averagely. For example, it is useful to assume that there are two pictures in the same view and frame _ num equals 3 and 5, respectively. When only one view, i.e. the bitstream, is an AVC stream, the difference between the two PicNum values will be 2. When encoding a picture with frame _ num equal to 5, if an MMCO command is required to mark a picture with PicNum equal to 3 as "unused for reference", the difference between the two values minus1 equals 1, which will be signaled in the MMCO. This value requires 3 bits. However, if there are 256 views, then subtracting 1 from the difference between the two PicNum values will become 511. In this case, 19 bits are required to signal the value. Thus, the encoding of MMCO commands is much less efficient. Generally, the increased number of bits is equal to 2 × log2 (number of views) for the current MMCO command of JMVM compared to single view coding of h.264/AVC.
The fourth set of problems surrounds the reference picture list construction process specified in JMVM 1.0. The reference picture list initialization process considers reference pictures from all views before the reordering process. However, due to the fact that pictures from other views used for inter-view prediction are appended to the list after the list is truncated, reference pictures from other views do not appear in the reference picture list after reordering and truncation in any way. Therefore, those pictures do not need to be considered during initialization. In addition, illegal reference pictures (such pictures that have a different view _ id than the current picture and are not temporally aligned with the current picture) and duplicate inter-view reference pictures may appear in the finally constructed reference picture list.
The reference picture list initialization procedure operates as listed in the following steps: (1) all reference pictures are included in the initial list regardless of their view _ id and whether they are temporally aligned with the current picture. In other words, the initial reference picture list may contain some illegal reference pictures (such pictures that have a different view _ id than the current picture and are not temporally aligned with the current picture). However, in view-first coding, the beginning of the initial list contains reference pictures from the same view as the current picture. (2) Intra-view reference pictures and inter-view pictures may be reordered. After reordering, the beginning of the list may still contain illegal reference pictures. (3) The list is truncated, but the truncated list may still contain illegal reference pictures. (4) Inter-view reference pictures are appended to the list in the order they appear in the MVC extension of the SPS.
Furthermore, the reference picture list reordering process specified in JMVM1.0 does not allow for reordering of inter-view frames that are always placed at the end of the list in the order they appear in the MVC extension of the SPS. This results in lower flexibility in reference picture list construction leading to lower compression efficiency when the default order of inter-view reference frames is not optimal or some inter-view reference frames are more likely to be used for prediction than some intra-view reference frames. In addition, similar to the MMCO command, due to the fact that PicNum is derived from the redefined and scaled FrameNumWrap, a longer VLC codeword needs to be used for RPLR command encoding involving signaling of the difference between 1-subtracted PicNum values, as compared to the single-view encoding of the h.264/AVC standard.
Disclosure of Invention
The present invention provides an improved system and method for implementing efficient decoded picture buffer management in multi-view video coding. In one embodiment, a new flag is used to indicate whether decoding of a view can be started from a certain picture. In a more specific embodiment, this flag is signaled in the NAL unit header. In another embodiment, the new flag is used to indicate whether a picture is used for inter-view prediction reference, while the syntax element nal _ ref _ idc only indicates whether a picture is used for temporal prediction reference. This flag may also be signaled in the NAL unit header. In the third embodiment, a set of new reference picture marking methods is used to efficiently manage decoded pictures. These methods may include both sliding windows and adaptive memory control mechanisms. In a fourth embodiment, a set of new reference picture list construction methods are used, and these methods include reference picture list initialization and reordering.
These and other advantages and features of the invention, together with the organization and manner of operation thereof, will become apparent from the following detailed description when taken in conjunction with the accompanying drawings, wherein like elements have like numerals throughout the several drawings described below.
Drawings
Fig. 1 is a picture arrangement in a priority view coding arrangement;
fig. 2 is a picture arrangement in a time-first coding arrangement;
FIG. 3 is a prediction illustrating the MVC temporal and inter-view prediction structure;
FIG. 4 is an overview of a system in which the present invention may be implemented;
FIG. 5 is a perspective view of a mobile device that may be used in the practice of the present invention; and is
Fig. 6 is a schematic representation of circuitry of the mobile device of fig. 5.
Detailed Description
Figure 4 shows a generic multimedia communication system for use with the present invention. As shown in fig. 4, the data source 100 provides a source signal in an analog, uncompressed digital, or compressed digital format, or any combination of these formats. The encoder 110 encodes the source signal into a coded media bitstream. The encoder 110 may be capable of encoding multiple media types, such as speech, audio and video, or multiple encoders 110 may be required to encode source signals of different media types. The encoder 110 may also obtain synthetically produced input, such as graphics and text, or it may be capable of producing coded bitstreams of synthetic media. In the following, only the processing of one coded media bitstream of one media type is considered to simplify the description. It should be noted, however: a real-time broadcast service typically comprises several streams (typically at least one audio, video and text subtitle stream). It should also be noted that: the system may include multiple encoders, but only one encoder 110 is considered below to simplify the description without loss of generality.
The coded media bitstream is transferred to storage 120. The storage 120 may comprise any type of mass storage to store the coded media bitstream. The format of the coded media bitstream in the storage 120 may be an elementary self-contained bitstream format, or one or more coded media bitstreams may be encapsulated into a container file. Some systems operate "live", i.e. omit storage and transfer the coded media bitstream from the encoder 110 directly to the sender 130. The coded media bitstream is then transmitted to a sender 130, also referred to as a server, as needed. The format used in the transmission may be an elementary self-contained (elementary-contained) bitstream format, a packetized stream format, or one or more coded media bitstreams may be encapsulated into a container file. The encoder 110, storage 120, and sender 130 may reside in the same physical device, or they may be contained in separate devices. The encoder 110 and sender 130 may operate with live real-time content, in which case the coded media bitstream is typically not stored persistently but rather is buffered in the content encoder 110 and/or sender 130 for a short period of time to mitigate variations in processing delays, delivery delays, and/or coded media bit rates.
The sender 130 sends the coded media bitstream using a communication protocol stack. The stack may include, but is not limited to, real-time transport protocol (RTP), User Datagram Protocol (UDP), and Internet Protocol (IP). When the communication protocol stack is packet-oriented, the sender 130 encapsulates the coded media bitstream into packets. For example, when RTP is used, the sender 130 encapsulates the coded media bitstream into RTP packets according to an RTP payload format. Typically, each media type has a dedicated RTP payload format. It should be noted again that: the system may contain multiple senders 130, but the following description considers only one sender 130 for simplicity.
Sender 130 may or may not be connected to gateway 140 through a communication network. The gateway 140 may perform different types of functions, such as translating a packet stream according to one communication protocol stack to another communication protocol stack, merging and forking data streams, and manipulating data streams according to downlink and/or receiver capabilities, such as controlling the bit rate of the forwarded stream according to prevailing downlink network conditions. Examples of gateways 140 include multipoint conference control units (MCUs), gateways between circuit-switched and packet-switched video telephony, push-to-talk over cellular (PoC) servers, IP encapsulators in digital video broadcast handheld (DVB-H) systems, or set top boxes that forward broadcast transmissions locally to a home wireless network. When RTP is used, gateway 140 is referred to as an RTP mixer and acts as an endpoint for the RTP connection.
The system includes one or more receivers 150 that are generally capable of receiving the transmitted signal, demodulating and decapsulating the signal into a coded media bitstream. The coded media bitstream is typically further processed by a decoder 160, the output of which is one or more uncompressed media streams. Finally, a renderer (renderer)170 may reproduce the uncompressed media bit stream with a speaker or a display, for example. The receiver 150, decoder 160 and renderer 170 may reside in the same physical device, or they may be contained in separate devices.
Scalability in terms of bit rate, decoding complexity and picture size is a desirable property for heterogeneous and error-prone environments. This property is needed in order to cope with limitations such as constraints on bit rate, display resolution, network throughput and computational power in the receiving device.
It should be understood that although the words and examples contained herein may specifically describe an encoding process, those skilled in the art will readily understand that the same concepts and principles also apply to a corresponding decoding process and vice versa. It should be noted that the bitstream to be decoded can be received by a remote device located within virtually any type of network. Further, the bitstream may be received from local hardware or software.
The communications devices of the present invention may communicate using various transmission techniques including, but not limited to, Code Division Multiple Access (CDMA), global system for mobile communications (GSM), Universal Mobile Telecommunications System (UMTS), Time Division Multiple Access (TDMA), Frequency Division Multiple Access (FDMA), transmission control protocol/internet protocol (TCP/IP), Short Messaging Service (SMS), Multimedia Messaging Service (MMS), e-mail, Instant Messaging Service (IMS), bluetooth, IEEE 802.11, and the like. A communication device may communicate using various media including, but not limited to, radio, infrared, laser, cable connection, and the like.
Figures 5 and 6 show one representative mobile device 12 within which the present invention may be implemented. It should be understood, however, that the present invention is not intended to be limited to one particular type of mobile device 12 or other electronic device. Some or all of the features shown in fig. 5 and 6 may be incorporated into any or all of the devices that may be shown in fig. 4.
The mobile device 12 of figures 5 and 6 includes a housing 30, a display 32 in the form of a liquid crystal display, a keypad 34, a microphone 36, an ear-piece 38, a battery 40, an infrared port 42, an antenna 44, a smart card 46 in the form of a UICC according to one embodiment of the invention, a card reader 48, radio interface circuitry 52, codec circuitry 54, a controller 56 and a memory 58. Individual circuits and elements are all of a type well known in the art, for example in the Nokia range of mobile devices.
The present invention provides an improved system and method for implementing efficient decoded picture buffer management in multi-view video coding. To solve the problem that the current JMVM syntax does not include the fact that the picture from which decoding of a certain view can start is signaled (unless all views of the temporal index contain anchor pictures), a new flag is signaled indicating whether the view can be accessed from a certain picture, i.e. whether decoding of the view can start from a certain picture. In one embodiment of the present invention, this flag is signaled in the NAL unit header. The following is an example of the syntax and semantics of the flags according to one specific embodiment. However, it is also possible to similarly change the semantics of the syntax element anchor _ pic _ flag instead of adding a new syntax element.
| nal_unit_header_svc_mvc_extension(){ | C | Descriptor(s) |
| svc_mvc_flag | All | u(1) |
| if(!svc_mvc_flag){ | ||
| priority_id | All | u(6) |
| discardable_flag | All | u(1) |
| temporal_level | All | u(3) |
| dependency_id | All | u(3) |
| quality_level | All | u(2) |
| layer_base_flag | All | u(1) |
| use_base_prediction_flag | All | u(1) |
| fragmented_flag | All | u(1) |
| last_fragment_flag | All | u(1) |
| fragment_order | All | u(2) |
| reserved_zero_two_bits | All | u(2) |
| }else{ | ||
| view_refresh_flag | All | u(1) |
| view_subset_id | All | u(2) |
| view_level | All | u(3) |
| anchor_pic_flag | All | u(1) |
| view_id | All | u(l0) |
| reserved_zero_five_bits | All | u(6) |
| } | ||
| nalUnitHeaderBytes+-3 | ||
| } |
For a certain picture in a view, all pictures at the same temporal position from other views used in inter-view prediction are referred to as "direct dependent view pictures", and all pictures at the same temporal position from other views required for current picture decoding are referred to as "dependent view pictures".
The semantics of view refresh flag may be specified in four ways in one embodiment. A first way to specify the semantics of view _ refresh _ flag involves: let view _ refresh _ flag indicate that the current picture in the same view and all subsequent pictures in output order are correctly decoded when all direct dependent view pictures of the current and subsequent pictures in the same view are (possibly partially) decoded also without decoding any preceding pictures in the same view or other views. This means that (1) none of the dependent view pictures depends on any preceding picture in decoding order in any view, or (2) if any dependent view picture depends on any preceding picture in decoding order in any view, then the constrained intra coded regions of the directly dependent view pictures of the current and subsequent pictures in the same view are only used for inter-view prediction. The constrained intra coding region does not use data from the inter-coding neighboring regions for intra prediction.
A second way to specify the semantics of view _ refresh _ flag involves: let view _ refresh _ flag indicate that the current picture and all subsequent pictures in decoding order in the same view can also be correctly decoded when all direct dependent view pictures of the current picture and the subsequent pictures in the same view are decoded completely or partially in one embodiment without decoding any preceding pictures.
A third approach for specifying the semantics of view _ refresh _ flag involves: let view _ refresh _ flag indicate that the current picture and all subsequent pictures in output order in the same view can be correctly decoded also when all dependent view pictures of the current and subsequent pictures in the same view are decoded either fully or, in one embodiment, partially. This definition is similar to starting the intra pictures of an open GOP in single view coding. This option can be written as follows in the specification text: view _ refresh _ flag equal to 1 indicates that the current picture and any subsequent pictures in decoding order that follow the current picture in the same view as the current picture and in output order do not refer to pictures preceding the current picture in decoding order in the inter prediction process. View _ refresh _ flag equal to 0 indicates that the current picture or a subsequent picture in decoding order that follows the current picture in output order in the same view as the current picture may refer to a picture preceding the current picture in decoding order in an inter prediction process.
A fourth way to specify the semantics of view _ refresh _ flag involves: let view _ refresh _ flag indicate that the current picture and all subsequent pictures in decoding order in the same view can be decoded correctly also when all dependent view pictures of the current picture and the subsequent pictures in the same view are decoded fully or partially in one embodiment. This definition is similar to starting an intra picture of a closed GOP in single view coding.
view _ refresh _ flag may be used in a system such as the system shown in fig. 4. In this case, the receiver 150 has received or the decoder 160 has decoded only a certain subset M of all available N views, which excludes view a. For example, due to user action, the receiver 150 or decoder 160 will want to receive or decode, respectively, view a from now on. The decoder may start decoding of view a from the first picture, with view _ refresh _ flag equal to 1 within view a. If no view a is received, the receiver 150 may indicate to the gateway 140 or the transmitter 130 to include the coded picture of view a into the transmission bitstream. The gateway 140 or the sender 130 may wait until the next picture within view _ refresh _ flag is equal to 1 before sending any picture of view a in order to avoid sending unnecessary pictures from view a that the decoder 160 cannot successfully decode.
To solve the second problem discussed previously, a new flag is signaled to indicate whether a view is used for inter-view prediction reference, and the syntax element nal _ ref _ idc only indicates whether a picture is used for temporal prediction reference. In one embodiment, this flag is signaled in the NAL unit header. The following are examples of the syntax and semantics of the flag.
| nal_unit_header_svc_mvc_extension(){ | C | Descriptor(s) |
| svc_mvc_flag | All | u(1) |
| if(!svc_mvc_flag){ | ||
| priority_id | All | u(6) |
| discardable_flag | All | u(1) |
| temporal_level | All | u(3) |
| dependency_id | All | u(3) |
| quality_level | All | u(2) |
| layer_base_flag | All | u(1) |
| use_base_prediction_flag | All | u(1) |
| fragmented_flag | All | u(1) |
| last_fragment_flag | All | u(1) |
| fragment_order | All | u(2) |
| reserved_zero_two_bits | All | u(2) |
| }else{ | ||
| inter_view_reference_flag | All | u(1) |
| view_subset_id | All | u(2) |
| view_level | All | u(3) |
| anchor_pic_flag | All | u(1) |
| view_id | All | u(l0) |
| reserved_zero_five_bits | All | u(5) |
| , | ||
| nalUnitHeaderBytes+-3 | ||
| , |
Inter _ view _ reference _ flag equal to 0 indicates that the current picture is not used as an inter-view reference picture. Inter _ view _ reference _ flag equal to 1 indicates that the current picture is used as an inter-view reference picture. The value of inter _ view _ reference _ flag is inferred to be equal to 1 when profile _ idc indicates an MVC profile and view _ id is 0. When a picture is decoded, all pictures whose inter _ view _ reference _ flag is equal to 1 and whose time axis is the same as that of the current picture are referred to as inter-view pictures of the current picture.
The inter _ view _ reference _ flag may be used in the gateway 140, also referred to as a Media Aware Network Element (MANE). When a picture is not used as inter-view reference and intra-view reference (inter _ view _ reference _ flag is equal to 0 and nal _ ref _ idc is equal to 0), the MANE may choose not to forward it without decoding of the remaining bitstream. When a picture is not used as an inter-view reference, but as an intra-view reference, the MANE should only discard the picture if it also gives up transmitting dependent views. When a picture is not used as an inter-view reference, but is used as an intra-view reference, the MANE should discard the picture only when decoding of the view in which the picture is located is not required or desired.
The flag inter _ view _ reference _ flag is reused for the problem of the reference picture marking process specified in JMVM1.0 that cannot efficiently handle the management of decoded pictures that must be buffered for inter-view prediction. Pictures with inter _ view _ reference _ flag equal to 1 may be marked using any of a number of three methods.
A first method for marking pictures with inter _ view _ reference _ flag equal to 1 involves: the inter-view reference picture is temporarily stored as a long-term picture. During the encoding process, each picture indicated in the bitstream for inter-view prediction is marked as "for long-term reference". One way to indicate that it is marked as "for long term reference" is inter _ view _ reference _ flag. The decoder responds to this indication by marking the picture as "for long-term reference" and "temporary multi-view long-term reference". Any memory management control operations targeting pictures marked as "for long-term reference" and "temporary multi-view long-term reference" are temporarily cached. When all pictures in the temporal axis are encoded or decoded, all pictures marked as "for long-term reference" and "temporary multi-view long-term reference" are no longer marked as "for long-term reference" and "temporary multi-view long-term reference" and reference picture marking is done anew for them in their decoding order using sliding window operations or cache memory management control operations (either applicable to a particular picture). For example, if a picture is used for inter prediction (i.e. the value of nal _ ref _ idc is greater than 0), it is marked back as "used for short-term reference". If a picture is not used for inter prediction (i.e. nal _ ref _ idc equals 0), it is marked as "unused for reference". Generally, pictures have only two cases in a certain time axis: all pictures are reference pictures for interactive prediction, or no picture is a reference picture for interactive prediction. This latter operation may be performed after decoding the last VCL NAL unit in the time axis or before the next access unit or next picture in the subsequent time axis is to be decoded. During decoding, the operation in this phase can be triggered implicitly by a change in the time axis or it can be signaled explicitly, e.g. as an MMCO command. With this approach, inter-view reference pictures have the same impact as long-term reference pictures for weighted prediction and in temporal direct mode.
A second method for marking pictures with inter _ view _ reference _ flag equal to 1 involves: the inter-view reference picture is marked as "used for inter-view reference". With this approach, the reference pictures used for inter-prediction (labeled "for short-term reference" and "for long-term reference") are unchanged from the AVC standard. For the processes related to temporal direct mode and weighted prediction, pictures marked as "used for inter-view reference", i.e. those inter-view reference pictures sharing the same time axis as the current picture, are treated identically to long-term reference pictures. When all pictures in the time axis are encoded or decoded, all pictures labeled "for inter-view reference" are no longer labeled "for inter-view reference".
Note that removing the flag "for inter-view reference" after processing all pictures in the time axis is only one embodiment of the present invention. The flag as "for inter-view reference" may also be removed at other times in the decoding process. For example, the flag of a particular picture as "used for inter-view reference" may be removed once the current picture or any subsequent pictures are no longer dependent directly or indirectly on the picture according to the view dependency signaling included in the MVC extension for SPS.
The operation of leaving the appropriate picture no longer marked as "for inter-view reference" can be done after decoding the last VCL NAL unit in the time axis or before the next access unit or next picture in the subsequent time axis is to be decoded. This may be triggered implicitly by a change in the time axis during the decoding process, or it may be explicitly signaled, e.g., as an MMOC command.
With this particular approach, inter-view reference pictures have the same impact as long-term reference pictures for weighted prediction and in temporal direct mode. In other words, this method has the same effect as the first method discussed above for weighted prediction and in temporal direct mode.
In this approach, an improved sliding window mechanism may be applied to remove the "for inter-view reference" flag for pictures that are only used for inter-view prediction (i.e., pictures with nal _ ref _ idc equal to 0 and marked as "for inter-view reference"). This improved sliding window mechanism uses a variable named, for example, num _ inter _ view _ ref _ frames (which is preferably signaled in the SPS extension for MVC), so that when the number of pictures marked as "used for inter-view reference" and nal _ ref _ idc equal to 0 is equal to num _ inter _ view _ ref _ frames, then the oldest decoded picture becomes unmarked as "used for inter-view reference". Thus, if a picture is not needed for output (has already been output or is not intended to be output), the decoder may invoke a process to remove the picture from the DPB so that the newly decoded picture can be stored to the DPB.
A third method for marking pictures with inter _ view _ reference _ flag equal to 1 involves: pictures are tagged after decoding of all pictures of the same temporal axis/time index. Instead of marking pictures immediately after their decoding, this method is based on the idea of marking pictures after the decoding of all pictures of the same time axis, i.e. the same time index. Sliding window or adaptive reference picture marking as indicated in each coded picture is performed in the decoding order of the pictures. For the processes associated with temporal direct mode and weighted prediction, pictures marked as having the same temporal axis as the current picture are treated the same as long-term reference pictures. Inter-view reference pictures having the same time axis as the current picture are included in the initial reference picture list construction, and the pictures may be reordered based on their view _ id or may be first assigned a long-term reference index and then may be remapped based on the long-term reference index.
As discussed previously, given the way to recalculate PicNum, if the sliding window mode of operation is used and the number of short-term and long-term pictures is equal to the maximum, the short-term reference picture with the smallest FrameNumWrap is marked as "unused for reference". However, the sliding window reference picture marking does not operate optimally in the current JMVM due to the fact that this picture is not necessarily the oldest coded picture due to the fact that the FrameNum order in the current JMVM does not follow the decoding order. To solve this problem and compare with the JMVM standard, the variables FrameNum and FrameNumWrap are not redefined/scaled, i.e. their definitions remain unchanged compared with the AVC standard. It is designed to automatically manage short-term pictures through a sliding window fifo mechanism. Only a slight modification of the sliding window mechanism is required compared to JMVM 1.0. The modifications are as follows (italics indicates new text):
g.8.2.5.3 sliding window decoded reference picture marking process
This process is called when adaptive _ ref _ pic _ marking _ mode _ flag is equal to 0. Only reference pictures with view _ id same as the current slice are considered in the process including calculating the numhostterm and numLongTerm and num _ ref _ frames applicable values.
In the above method, the total number of reference frames for the entire MVC bitstream (which indicates the buffer size for storing pictures for intra-view or inter-view reference for the entire MVC bitstream) should be equal to the sum of num _ ref _ frames values applicable to all views contained in the MVC bitstream plus the maximum number of inter-view reference frames used to decode the MVC bitstream. Alternatively, the sliding window may be performed globally for all pictures in all views.
For time-first coding, the sliding window process is defined as follows (italics represent new text for JMVM 1.0):
g.8.2.5.3 sliding window decoded reference picture marking process
…
…
When numstartterm + numLongTerm is equal to Max (num _ ref _ frames, 1), the condition that numstartterm is greater than 0 should be satisfied, and the short-term reference frame, complementary reference field pair, or non-paired reference field selected according to the following rule is marked as "unused for reference". When it is a frame or complementary field pair, its two fields are also marked as "unused for reference".
*The selection rule is: the first picture in decoding order is selected from all those pictures having the smallest FrameNumWrap value. The decoding order of those pictures is indicated by the view _ id value or by the view dependency information signaled in the SPS for MVC extension.
For time-first coding, the sliding window process is defined as follows (italics represent new text for JMVM 1.0):
g.8.2.5.3 sliding window decoded reference picture marking process
…
…
When numstartterm + numLongTerm is equal to Max (num _ ref _ frames, 1), the condition that numstartterm is greater than 0 should be satisfied, and the short-term reference frame, complementary reference field pair, or non-paired reference field selected according to the following rule is marked as "unused for reference". When it is a frame or complementary field pair, its two fields are also marked as "unused for reference".
*The selection rule is: the picture with the smallest FrameNumWrap is selected from all those pictures of the earliest decoded view. The view decoding order is indicated by a view _ id value or by view dependency information signaled in SPS for MVC extension.
As discussed previously, the difference between the PicNum values of the two coded pictures should be scaled on average due to the fact that the PicNum is derived from the redefined and scaled FrameNumWrap. For example, it is advantageous to assume two pictures in the same view and with frame _ num equal to 3 and 5, respectively. When only one view, i.e. the bitstream, is an AVC stream, the difference between the two PicNum values will be 2. When encoding a picture with frame _ num equal to 5, if an MMCO command is required to mark a picture with PicNum equal to 3 as "unused for reference", the difference between the two values minus1 equals 1, which will be signaled in the MMCO. This value requires 3 bits. However, if there are 256 views, then subtracting 1 from the difference between the two PicNum values will become 511. In this case, 19 bits are required to signal the value. Thus, the encoding of MMCO commands is much less efficient. Generally, the increased number of bits is equal to 2 × log2 (number of views) for the current MMCO command of JMVM compared to single view coding of h.264/AVC.
To address this problem and in contrast to the JMVM standard, the variables FrameNum and FrameNumWrap are not redefined/scaled, as in the AVC standard. In most cases, a picture is not required to contain MMCO commands to remove pictures that do not belong to the same view nor the same time axis as the current picture from the DPB size point of view. Even some pictures become unnecessary for reference and may therefore be marked as "unused for reference". In this case, the marking may be performed by using a sliding window procedure or deferred until the next coded picture with the same view _ id. Thus, for pictures belonging to the same view or the same time axis, the MMCO command is constrained to mark pictures as "unused for reference" only, although the DPB may contain pictures of different views or different time axes.
The modification of JMVM1.0 for intra-view reference picture marking is as follows (changes are shown in italics):
g.8.2.5.4.1 Process to mark short-term reference pictures as "unused for reference
This procedure is called when adaptive _ ref _ pic _ marking _ mode _ flag is equal to 1. Only reference pictures with the same view _ id as the current slice are considered in this process.
The syntax and semantics for inter-view reference picture marking may be as follows:
| slice_header(){ | C | descriptor(s) |
| … | ||
| if(nal_ref_idc!=0) | ||
| dec_ref_pic_marking() | 2 | |
| if(inter_view_reference_flag) | ||
| dec_view_ref_pic_marking_mvc() | 2 | |
| } |
| dec_view_ref_pic_marking_mvc(){ | C | Descriptor(s) |
| adaptive_view_ref_pic_marking_mode_flag | 2 | u(1) |
| if(adaptive_view_ref_pic_marking_mode_flag) | ||
| do{ | ||
| view_memory_management_control_operation | 2 | ue(v) |
| if(view_memory_management_control_operation==1||view_memory_management_control_operation==2) | ||
| abs_difference_of_view_id_minus1 | 2 | ue(v) |
| }while(view_mernory_management_control_operation !=0) | ||
| } | ||
| } |
Memory management control operations
The (view _ memory _ management _ control _ operation) value is as follows:
| view_memory_management_control_operation | memory management control operations |
| 0 | Ending the view memory _ management _ control _ operation loop |
| 1 | Remove flag "used for inter-view reference" or flag picture as "unused for reference", abs _ difference _ of _ view _ id _ minus1 exists and corresponds to a difference value to be subtracted from the current view id |
| 2 | Remove flag "used for inter-view reference", or flag picture as "unused for reference", abs _ difference _ of _ view _ id _ minus1 exists and corresponds to a difference value to be added to the current view id |
adaptive _ view _ ref _ pic _ marking _ mode _ flag specifies whether the sliding window mechanism (when equal to 0) is in use or the adaptive reference picture marking process (when equal to 1) is in use.
The modified decoding process for inter-view reference picture marking is as follows:
8.2.5.5.2 inter-view picture marking
This procedure is called when view _ memory _ management _ control _ operation equals 1.
Let viewIDX be specified as follows:
if(view_memory_managcment_control_operation==1)
viewIDX=CurrViewId-(difference_of_view_id_minus1+1)
else if(view_memory_management_control_operation==2)
viewIDX=CurrViewId+(difference_of_view_id_minus1+1)
to allow view scalability, i.e. the possibility to select which views to transmit, forward or decode, the memory management control operation may be constrained as follows. If currtemporallevelle is equal to the temporallevel of the current picture and dependentViews is a set of views that depend on the current view, the MMCO command may only target pictures that have a temporallevel equal to or greater than currtemporallevelle and are within the dependent views. To allow this, an indication of the view _ id is appended to the MMCO command or a new MMCO command is specified with an indication of the view _ id.
To solve the problems related to the reference picture list construction process described previously, the variables FrameNum and FrameNumWrap are not redefined/scaled. This is the same action as occurs in the AVC standard and is in contrast to the JMVM standard, where variables are redefined/rescaled. The modification of JMVM1.0 is as follows (changes are shown in italics): in 8.2.4.31 the reordering process of the reference picture list for short-term reference pictures, 8-38 should change to:
for(cIdx=num_ref_idx_1X_active_minus1+1;cIdx>refIdxLX;cIdx--)
RefPicListX[cIdx]=RefPicListX[cIdx-1]
RefPicListX[refIdxLX++]=short-term reference picture with PicNum equal to
picNumLX and view_id equal to CurrViewID
nIdx=refIdxLX
for(cIdx=refIdxLX;cIdx<=num_tef_idx_1X_active_minus1+1;cIdx++)(8-38)
//if(PicNumF(RefPicListX[cIdx])!=picNumLX)
if(PicNumF(RefPicListX[cIdx])!=picNumLX||ViewID(RefPicListX[cIdx])
!=CurrViewID)
RefPicListX[nIdx++]=RefPicListX[cIdx]
where CurrViewID is the view _ id of the currently decoded picture.
Regarding the problems associated with the reference picture list initialization process discussed previously, these problems can be addressed by noting that only frames, fields, or field pairs belonging to the same view as the current slice can be considered in the initialization process. According to JMVM1.0, this expression may be added to the beginning of each subsection 8.2.4.2.1 "initialization process for reference picture lists for P and SP slices in frame" through 8.2.4.2.5 "initialization process for reference picture lists in field".
Regarding other issues related to the reference picture list construction process, various methods can be used to effectively reorder both inter-view pictures and pictures for intra prediction. A first such method involves placing inter-view reference pictures in the list before the intra-view reference pictures and specifying separate rorr procedures for the inter-view pictures and the pictures used for intra-view prediction. The picture used for intra-view prediction is also referred to as an intra-view picture. In this method, a reference picture list initialization process for intra-view pictures as specified above is performed, followed by an RPLR reordering process and a list truncation process for intra-view pictures. Then, the inter-view picture is added to the list after the intra-view picture. Finally, the following syntax, semantics and decoding process modified from JMVM1.0 may also be used to select each inter-view picture and place it in a specified entry of the reference picture list. This method is applicable to refPicList0 and refPiclist1 (if present).
| ref_pic_list_reordering(){ | C | Descriptor(s) |
| if(slice_type!=1&&slice_type!=SI)( | ||
| … | ||
| : | ||
| if(svc_mvc_flag) | ||
| { | ||
| view_ref_pic_list_reordering_flag_l0 | 2 | u(1) |
| if(view_ref_pic_list_reordering_flag_l0) | ||
| do{ | ||
| view_reordering_idc | 2 | ue(v) |
| if(view_reordering_idc==0||view_reordering_idc==1) | ||
| abs_diff_view_idx_minus1 | 2 | ue(v) |
| ref_idx | 2 | ue(v) |
| }while(view_reordering_idc!=2) | ||
| view_ref_pic_list_reordering_flag_l1 | 2 | u(1) |
| if(view_ref_pic_list_reordering_flag_l1) | ||
| do{ | ||
| view_reordering_idc | 2 | ue(v) |
| if(view_reordering_idc==0||view_reordering_idc==1) | ||
| abs_diff_view_idx_minus1 | 2 | ue(v) |
| ref_idx | 2 | ue(v) |
| }while(view_reordcring_idc!=2) | ||
| } |
Regarding syntax, view _ ref _ pic _ list _ reordering _ flag _1X (X is 0 or 1) equal to 1 specifies that the syntax element view _ reordering _ idc exists for refPicListX. View _ ref _ pic _ list _ reordering _ flag _1X equal to 0 specifies that no syntax element view _ reordering _ idc exists for refPicListX. ref _ idx indicates an entry to put the inter-view picture to the reference picture list.
abs _ diff _ view _ idx _ minus1 plus 1 specifies the absolute difference between the view index and the view index prediction value of the picture to be placed to the reference picture list entry indicated by ref _ idx. abs _ diff _ view _ idx _ minus1 ranges from 0 to num _ multiview _ refs _ for _ listX [ view _ id ] -1. num _ multiview _ refs _ for _ listX [ ] refers to anchor _ reference _ view _ for _ list _ X [ curr _ view _ id ] [ ] for anchor pictures and non _ access _ reference _ view _ for _ list _ X [ curr _ view _ id ] [ ] for non-anchor pictures, where curr _ view _ id is equal to the view _ id of the view containing the current slice. The view index of the inter-view picture indicates the order of view _ id of the inter-view picture appearing in the MVC SPS extension. For pictures with view index equal to view _ index, view _ id is equal to num _ multiview _ refs _ for _ listX [ view _ index ].
abs _ diff _ view _ idx _ minus1 plus 1 specifies the absolute difference between the view index of the picture moved to the current index in the list and the view index prediction value. abs _ diff _ view _ idx _ minus1 ranges from 0 to num _ multiview _ refs _ for _ listX [ view _ id ] -1. num _ multiview _ refs _ for _ listX [ ] refers to anchor _ reference _ view _ for _ list _ X [ curr _ view _ id ] [ ] for anchor pictures and non _ access _ reference _ view _ for _ list _ X [ curr _ view _ id ] [ ] for non-anchor pictures, where curr _ view _ id is equal to the view _ id of the view containing the current slice. The view index of the inter-view picture indicates the order of view _ id of the inter-view picture appearing in the MVC SPS extension. For pictures with view index equal to view _ index, view _ id is equal to num _ multiview _ refs _ for _ listX [ view _ index ].
The decoding process is as follows:
the definition of numrefidxlxtactive is done after truncation for the intra-view picture:
NumRefIdxLXActive=num_ref_idx_1X_active_minus1+1+
num_multiview_refs_for_listX[view_id]
g.8.2.4.3.3 reordering Process for reference Picture lists of inter-view pictures
The input to this process is a reference picture list RefPicListX (X equals 0 or 1). The output of this process is a possibly modified reference picture list RefPicListX (X equals 0 or 1).
The variable picViewIdxLX is derived as follows.
if view_reordering_idc is equal to 0
picViewIdxLX=picViewIdxLXPred-(abs_diff_view_idx_minus1+1)
Otherwise(view_reordering_idc is equal to 1),
picViewIdxLX=picViewIdxLXPred+(abs_diff_view_idx_minus1+1)
picviewidxlxpord is a predicted value for the variable picViewIdxLX. When the procedure specified in this subsection is called for the first time for a slice (i.e., for the first occurrence of a view _ reordering _ idc equal to 0 or 1 in the ref _ pic _ list _ reordering () syntax), the picViewIdxL0Pred and the picViewIdxL1Pred are initially set equal to 0. After each assignment of picViewIdxLX, the value of picViewIdxLX is assigned to picviewidxlxpld.
The following procedure is implemented to place the inter-view picture with a view index equal to picViewIdxLX at the index position ref _ Idx, shifting the position of any other remaining pictures to a later position in the list as follows.
for(cIdx=NumRefIdxLXActive;cIdx>ref_Idx;cIdx--)
RefPicListX[cIdx]=RefPicListX[cIdx-1]
RefPicListX[ref_Idx]=inter-view rcference picture with view id equal to
reference_view_for_list_X[picViewIdxLX]
nIdx=ref_Idx+1;
for(cIdx=refIdxLX;cIdx<=NumRefIdxLXActive;cIdx++)
if(ViewID(RefPicListX[cIdx])!=TargetView ID||Time(RefPicListX[cIdx
])!=TargetTime)
RefPicListX[nIdx++]=RefPicListX[cIdx]
picView_id=PicViewIDLX
The TargetViewID and TargetTime indicate the view _ id or time axis value of the target reference picture to be reordered, and time (pic) returns the time axis value of the picture pic.
According to a second method for efficiently reordering both inter-view pictures and pictures for intra prediction, a reference picture list initialization procedure for intra-view pictures as specified above is performed, and then inter-view pictures are appended to the end of the list in the order in which they appear in the MVC SPS extension. Then the RPLR reordering process for intra-view and inter-view pictures is applied followed by the list truncation process. The sample syntax, semantics and decoding process modified based on JMVM1.0 are as follows.
Reference picture list reordering syntax
| ref_pic_list_reordering(){ | C | Descriptor(s) |
| if(slice_type!=1&&slice_type!=){ | ||
| ref_pic_list_reordering_flag_l0 | 2 | u(1) |
| if(ref_pic_list_reordering_flag_l0) | ||
| do{ | ||
| reordering_of_pic_nums_idc | 2 | ue(v) |
| if(reordering_of_pic_nums_idc==0||reordering_of_pic_nums_idc==1) | ||
| abs_diff_pic_num_minus1 | 2 | ue(v) |
| else if(reordering_of_pic_nums_idc==2) | ||
| long_term_pic_num | 2 | ue(v) |
| if(reordering_of_pic_nums_idc==4||reordering_of_pic_nums_idc==5) | ||
| abs_diff_view_idx_minus1 | 2 | ue(v) |
| }while(reordering_of_pic_nums_idc!=3) | ||
| } | ||
| if(slice_type==B||slice_type==EB){ | ||
| ref_pic_list_reordering_flag_l1 | 2 | u(1) |
| if(ref_pic_list_reordering_flag_l1) | ||
| do{ | ||
| reordering_of_pic_nums_idc | 2 | ue(v) |
| if(reordering_of_pic_nums_idc==0||reordering_of_pic_nums_idc==1) | ||
| abs_diff_pic_num_minus1 | 2 | ue(v) |
| else if(reordering_of_pic_numsidc==2) | ||
| long_term_pic_num | 2 | ue(v) |
| if(reordering_of_pic_nums_idc==4||reordering_of_pic_nums_idc==5) | ||
| abs_diff_view_idx_minus1 | 2 | ue(v) |
| }while(reordering_of_pic_nums_idc!=3) | ||
| } | ||
| } |
G.7.4.3.1 reference picture list reordering semantics
Reordering of pic nums idc operation for reordering reference picture lists
| reordering_of_pic_nums_idc | Specified reordering |
| 0 | abs _ diff _ pic _ num _ minus1 exists and corresponds to the difference value to subtract from the picture number prediction value |
| 1 | abs _ diff _ pic _ num _ minus1 exists and corresponds to the difference value used to add to the picture number prediction value |
| 2 | long _ term _ pic _ num exists and specifies the long-term picture number for the reference picture |
| 3 | Ending a loop for reordering of an initial reference picture list |
| 4 | abs _ diff _ view _ idx _ minus1 exist and are paired |
| Applied to a difference value to be subtracted from a view index prediction value | |
| 5 | abs _ diff _ view _ idx _ minus1 exists and corresponds to a difference value to be added to a view index prediction value |
reordering _ of _ pic _ num _ idc together with abs _ diff _ pic _ num _ minus1 or long _ term _ pic _ num specifies which reference pictures to remap. The value of reordering _ of _ pic _ nums _ idc is specified in the table above. The value of the first reordering _ of _ pic _ nums _ idc immediately after ref _ pic _ list _ reordering _ flag _ l0 or ref _ pic _ list _ reordering _ flag _ l1 is not equal to 3.
abs _ diff _ view _ idx _ minus1 plus 1 specifies the absolute difference between the view index of the picture to be placed to the current index in the reference picture list and the view index prediction value. abs _ diff _ view _ idx _ minus1 ranges from 0 to num _ multiview _ refs _ for _ listX [ view _ id ] -1. num _ multiview _ refs _ for _ listX [ ] refers to anchor _ reference _ view _ for _ list _ X [ curr _ view _ id ] [ ] for anchor pictures and non _ anchor _ reference _ view _ for _ list _ X [ curr _ view _ id ] [ ] for non-anchor pictures, where curr _ view _ id is equal to the view _ id of the view containing the current slice. The view index of the inter-view picture indicates the order of view _ id of the inter-view picture appearing in the MVC SPS extension. For pictures with view index equal to view _ index, view _ id is equal to num _ multiview _ refs _ for _ listX [ view _ index ].
The reordering process may be described as follows.
G.8.2.4.3.3 reordering process for reference picture list of inter-view reference picture the input to this process is the index refIdxLX (X equals 0 or 1).
The output of this process is the increment index refIdxLX.
The variable picViewIdxLX is derived as follows.
if reordering_of_pic_nums_idc is equal to 4
picViewIdxLX picViewIdxLX Pred-(abs_diff_view_idx_minus1+1)
Otherwise(reordering_of_pic_nums_idc is equal to 5),
picViewIdxLX picViewIdxLX Pred-(abs_diff_view_idx_minus1+1)
picviewidxlxpord is a predicted value for the variable picViewIdxLX. When the procedure specified in this subsection is called for the first time for a slice (i.e., for the first occurrence of a reordering _ of _ pic _ nums _ idc equal to 4 or 5 in the ref _ pic _ list _ reordering () syntax), the picViewIdxL0Pred and picViewIdxL1Pred are initially set equal to 0. After each assignment of picViewIdxLX, the value of picViewIdxLX is assigned to picviewidxlxpld.
The following process is implemented to place the inter-view picture with a view index equal to picViewIdxLX at the index position refIdxLX, shift the position of any other remaining pictures to a later position in the list, and increment the value of refIdxLX.
for(cIdx=num_ref_idx_1X_active_minus1+1;cIdx>refIdxLX;cIdx-)
RefPicListX[cIdx]=RefPicListX[cIdx-1]
RefPicListX[refIdxLX++]=inter-view reference picture with view id equal to
reference_view_for_list_X[picViewIdxLX]
nIdx=refIdxLX
for(cIdx=refIdxLX;cIdx<=mum_ref_idx_1X_active_minus1+1;cIdx++)
if(ViewID(RefPicListX[cIdx])!=TargetViewID||Time(RefPicListX[cIdx])!=
TargetTime)
RefPicListX[nIdx++]=RefPicListX[cIdx]
Where TargetViewID and TargetTime indicate the view _ id or time axis value of the target reference picture to be reordered, and time (pic) returns the time axis value of picture pic.
According to a third method for efficiently reordering both inter-view pictures and pictures for intra prediction, the initial reference picture list contains pictures marked as "used as short-term reference" or "used as long-term reference" and having the same view _ id as the current picture. In addition, the initial reference picture list contains pictures that can be used for inter-view prediction. Pictures for inter-view prediction are inferred from the sequence parameter set extension for MVC and may also be inferred from inter _ view _ reference _ flag. Pictures used for inter-view prediction are assigned certain long-term reference indices for the decoding process of such a picture. The assigned long-term reference index for the inter-view reference picture may be, for example, the first N reference indices, and the index for the intra-view long-term pictures may be modified to be equal to their previous value for the decoding process of this picture + N, where N represents the number of inter-view reference pictures. Alternatively, the allocated long-term reference indices may range from MaxLongTermFrameIdx +1 to MaxLongTermFrameIdx + n, inclusive. Alternatively, the sequence parameter set extension for MVC may contain a syntax element referred to herein as start _ lt _ index _ for _ rplr, and the assigned long-term index allocations start _ lt _ index _ for _ rplr (inclusive) to the range start _ lt _ index _ for _ rplr + N (exclusive). The available long-term indices for inter-view reference pictures may be assigned in the order of view _ id, i.e., camera order, or in the order of view correlation listed in the sequence parameter set extension for MVC. The RPLR commands (syntax and semantics) remain unchanged from the H.264/AVC standard.
For time-directly related processing, e.g. for motion vector scaling, the AVC decoding process is followed if both reference pictures are inter-predicted (intra-view predicted) pictures (i.e. the reference pictures are not marked as "used for inter-view reference"). If one of the two reference pictures is an inter-prediction picture and the other reference picture is an inter-view prediction picture, the inter-view prediction picture is considered as a long-term reference picture. Otherwise (if both reference pictures are inter-view pictures), the view _ id or camera order indicator value is used to derive the weighted prediction parameters instead of the POC value for motion vector scaling.
For deriving prediction weights for implicit weighted prediction, the following procedure is performed. If both reference pictures are inter-predicted (intra-view predicted) pictures (i.e., not marked as "used for inter-view reference"), the AVC decoding process is followed. If one of the two reference pictures is an inter-prediction picture and the other reference picture is an inter-view prediction picture, the inter-view prediction picture is considered as a long-term reference picture. Otherwise (i.e. if both pictures are inter-view predicted pictures), the view _ id or camera order indicator value is used to derive the weighted prediction parameters instead of the POC value.
The present invention is described in the general context of method steps, which may be implemented in one embodiment by a program product including computer-executable instructions, such as program code, embodied in a computer-readable medium and executed by computers in networked environments. Examples of computer readable media may include various types of storage media including, but not limited to, electronic device memory units, Random Access Memory (RAM), read-only memory (ROM), Compact Discs (CDs), Digital Versatile Discs (DVDs), and other internal or external storage devices. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. Computer-executable instructions, associated data structures, and program modules represent examples of program code for executing steps of the methods disclosed herein. The particular sequence of such executable instructions or associated data structures represents examples of corresponding acts for implementing the functions described in such steps.
Software and web implementations of the present invention could be accomplished with standard programming techniques with rule based logic and other logic to accomplish the various database searching steps, correlation steps, comparison steps and decision steps. It should also be noted that the words "component" and "module," as used herein and in the claims, is intended to encompass implementations using one or more lines of software code, and/or hardware implementations, and/or equipment for receiving manual inputs.
The foregoing description of the embodiments of the invention has been presented for the purposes of illustration and description. It is not intended to be exhaustive or to limit the invention to the precise form disclosed, and modifications and variations are possible in light of the above teachings or may be acquired from practice of the invention. The embodiments were chosen and described in order to explain the principles of the invention and its practical application to enable one skilled in the art to utilize the invention in various embodiments and with various modifications as are suited to the particular use contemplated.
Claims (12)
1. A method of encoding multiple views of a scene, the method comprising:
providing a signaling element corresponding to a picture of a view, the signaling element indicating whether to use the picture of the view as a reference for any other picture belonging to a different view,
wherein the signaling element is a flag having a binary value indicating whether to use the picture of the view as a reference for any other picture belonging to a different view; and
signaling the signaling element in a Network Abstraction Layer (NAL) unit header corresponding to the picture.
2. The method of claim 1, further comprising:
constructing an initial reference picture list based on the intra-view reference picture and the inter-view reference picture, an
Providing a second signaling element for reordering inter-view reference pictures with respect to the initial reference picture list, the second signaling element being derived based on a view identifier value.
3. A method of decoding an encoded video bitstream, the encoded video bitstream being an encoded representation of a plurality of views of a scene, the method comprising:
receiving the encoded video bitstream; and
retrieving a signaling element corresponding to a picture of a view from the encoded video bitstream, the signaling element indicating whether to use the picture corresponding to the view as a reference for any other picture belonging to a different view, wherein the signaling element is a flag having a binary value indicating whether to use the picture of the view as a reference for any other picture belonging to a different view, and the signaling element is retrieved from a Network Abstraction Layer (NAL) unit header corresponding to the picture.
4. The method of claim 3, further comprising:
omitting the transmission of a portion of the coded bitstream corresponding to the picture if the signaling element indicates that the picture of the view is not used as a reference for any other picture belonging to a different view.
5. The method of claim 3, further comprising:
omitting decoding of a portion of the encoded bitstream corresponding to the picture if the signaling element indicates that the picture of the view is not used as a reference for any other picture belonging to a different view.
6. The method of claim 3, further comprising:
constructing an initial reference picture list based on the intra-view reference picture and the inter-view reference picture; and
reordering inter-view reference pictures with respect to the initial reference picture list based on a second signaling element and a view identifier value retrieved from the coded bitstream.
7. An apparatus for encoding multiple views of a scene, comprising:
means for providing a signaling element corresponding to a picture of a view, the signaling element indicating whether to use the picture of the view as a reference for any other picture belonging to a different view, wherein the signaling element is a flag having a binary value indicating whether to use the picture of the view as a reference for any other picture belonging to a different view; and
means for signaling the signaling element in a Network Abstraction Layer (NAL) unit header corresponding to the picture.
8. The apparatus of claim 7, further comprising:
means for constructing an initial reference picture list based on intra-view reference pictures and inter-view reference pictures, and
means for providing a second signaling element for reordering inter-view reference pictures relative to the initial reference picture list, the second signaling element derived based on a view identifier value.
9. An apparatus for decoding an encoded video bitstream, comprising:
means for receiving the encoded video bitstream; and
means for retrieving, from the encoded video bitstream, a signaling element corresponding to a picture of a view, the signaling element indicating whether to use the picture corresponding to the view as a reference for any other picture belonging to a different view, wherein the signaling element is a flag having a binary value indicating whether to use the picture of the view as a reference for any other picture belonging to a different view, and the signaling element is retrieved from a Network Abstraction Layer (NAL) unit header corresponding to the picture.
10. The device of claim 9, configured to:
omitting transmission of a portion of the encoded bitstream corresponding to the picture if the signaling element indicates that the picture of the view is not used as a reference for any other picture belonging to a different view.
11. The device of claim 9, configured to:
omitting decoding of a portion of the encoded bitstream corresponding to the picture if the signaling element indicates that the picture of the view is not used as a reference for any other picture belonging to a different view.
12. The apparatus of claim 9, further comprising:
means for constructing an initial reference picture list based on the intra-view reference picture and the inter-view reference picture; and
means for reordering inter-view reference pictures with respect to the initial reference picture list based on a second signaling element and a view identifier value retrieved from the coded bitstream.
Applications Claiming Priority (3)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US85222306P | 2006-10-16 | 2006-10-16 | |
| US60/852,223 | 2006-10-16 | ||
| PCT/IB2007/054200 WO2008047303A2 (en) | 2006-10-16 | 2007-10-15 | System and method for implementing efficient decoded buffer management in multi-view video coding |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| HK1133761A1 HK1133761A1 (en) | 2010-04-01 |
| HK1133761B true HK1133761B (en) | 2014-12-05 |
Family
ID=
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN101548550B (en) | System and method for implementing efficient decoded buffer management in multi-view video coding | |
| US8855199B2 (en) | Method and device for video coding and decoding | |
| US10158881B2 (en) | Method and apparatus for multiview video coding and decoding | |
| EP2080382A2 (en) | System and method for implementing low-complexity multi-view video coding | |
| WO2008084443A1 (en) | System and method for implementing improved decoded picture buffer management for scalable video coding and multiview video coding | |
| AU2016201810B2 (en) | System and method for implementing efficient decoded buffer management in multi-view video coding | |
| HK1133761B (en) | System and method for implementing efficient decoded buffer management in multi-view video coding | |
| AU2012216719B2 (en) | System and method for implementing efficient decoded buffer management in multi-view video coding | |
| HK1189108A (en) | System and method for implementing efficient decoded buffer management in multi-view video coding | |
| HK1189108B (en) | System and method for implementing efficient decoded buffer management in multi-view video coding | |
| HK1134385B (en) | System and method for implementing low-complexity multi-view video coding |