CN112422986B

CN112422986B - Hardware decoder pipeline optimization method and application

Info

Publication number: CN112422986B
Application number: CN202011154937.6A
Authority: CN
Inventors: 雷理; 张云; 韦虎; 占坤; 谢峥; 江焕承
Original assignee: Mouxin Technology Shanghai Co ltd
Current assignee: Mouxin Technology Shanghai Co ltd
Priority date: 2020-10-26
Filing date: 2020-10-26
Publication date: 2023-12-22
Anticipated expiration: 2040-10-26
Also published as: CN112422986A

Abstract

The invention discloses a hardware decoder pipeline optimization method and application, and relates to the technical field of video decoding. A hardware decoder pipeline optimization method divides decoding of an AVC video sequence into a multi-stage pipeline structure, and the method prescribes a cooperative working mode of entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering processes; the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT conversion process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned; the entropy decoding and inverse quantization unit is used for performing entropy decoding and inverse quantization processing on the macro block, and the data of a plurality of pixel points after the inverse quantization processing are synchronously input to the inverse transformation unit for performing multipoint parallel inverse DCT transformation processing. The invention can save the extra multiplier resource expenditure caused by the fact that the IQT operation and the IDCT operation are in the same pipeline stage.

Description

Hardware decoder pipeline optimization method and application

Technical Field

The invention relates to the technical field of video decoding, in particular to a hardware decoder pipeline optimization method and application.

Background

For high definition Video, if software full Video decoding is adopted, great CPU, power consumption and other expenses are brought, so that the industry usually adopts a special hardware accelerator as a Video Decoder (VDEC (Video Decoder), taking a common single-core hardware Decoder as an example, the single-core hardware Decoder adopts a pipeline design, and Macro Blocks (MB) as a pipeline unit, taking AVC (Advanced Video Coding, one of the current mainstream Video compression standards) Video as an example, the main pipeline level division can be shown in FIG. 1, 4 levels are included in FIG. 1, and each level of functions is described as follows, namely, first level Entropy Dec: entropy decoding (CABAC/CAVLC); the second stage IQT (inverse quantization), IDCT (also called inverse DCT transform or inverse DCT transform), inverse DCT transform, third stage IPred, rec, intra, inter prediction, image reconstruction, fourth stage Dblock, deblocking filtering, wherein, inverse Discrete Cosine Transform (IDCT) is the most basic and common transform in Video decoding operation, is one of the core operation processes of Video decoding, the main process is to perform matrix multiplication on residual data twice, a great deal of multiplier/adder resources are needed, inverse Quantization (IQT) is needed before residual is sent to an IDCT unit, the process is to perform corresponding multiplication operation according to inverse quantization value (QP), in the prior art, because of complex control of Entropy decoding different syntax elements (synthax), DSP and special Entropy operation accelerator are usually adopted, IQT and IDCT are closely related before and after operation, typically at the same pipeline stage after entropy decoding.

With the advent of the 4K/8K high definition era, the IDCT unit may need to perform parallel computation on multiple pixels in each clock cycle, resulting in that the IQT unit at the same pipeline stage also needs to perform inverse quantization computation on multiple residuals at the same time. For example, referring to fig. 2, the IQT unit and the IDCT unit are placed in the same pipeline stage for operation, and the pixel residuals p0, p1, p2, p3, p4, p5, … …, pn are synchronously sent to the IDCT unit after being operated by respective IQT units, in fig. 2, the residual p0 is subjected to inverse quantization operation by the IQT0 unit, the residual p1 is subjected to inverse quantization operation by the IQT1 unit, and so on, and the residual pn is subjected to inverse quantization operation by the IQTn unit. Because the internal operation of each IQT unit needs at least one 16-bit (bit) multiplier, and additionally needs resources such as adders, shifters and the like, the multi-point parallel operation can bring about linear increase of the resource overhead of the multiplier.

In summary, how to provide a hardware decoder pipeline optimization method capable of saving the additional multiplier resource overhead caused by the IDCT parallel operation is a technical problem to be solved currently.

Disclosure of Invention

The invention aims at: overcomes the defects of the prior art and provides a hardware decoder pipeline optimization method and application. The invention advances the IQT (inverse quantization) operation to ensure that the IQT operation and the entropy decoding are in the same pipeline stage, thereby saving the extra multiplier resource expenditure caused by the fact that the IQT operation and the IDCT operation are in the same pipeline stage.

In order to achieve the above object, the present invention provides the following technical solutions:

a hardware decoder pipeline optimization method divides decoding of an AVC video sequence into a multi-stage pipeline structure, and the method prescribes a cooperative working mode of entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering processes;

the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT conversion process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned;

the entropy decoding and inverse quantization unit is used for performing entropy decoding and inverse quantization processing on the macro block, and the data of a plurality of pixel points after the inverse quantization processing are synchronously input to the inverse transformation unit for performing multipoint parallel inverse DCT transformation processing.

Further, decoding of the AVC video sequence is divided into 4 stages of pipeline structures, with stage 1 corresponding to entropy decoding inverse quantization processing, stage 2 corresponding to inverse DCT transform processing, stage 3 corresponding to intra prediction, inter prediction, and image reconstruction processing, and stage 4 corresponding to deblocking filtering processing.

Further, the entropy decoding and dequantizing unit comprises a residual analysis module and a dequantizing calculation module,

and the residual analysis module analyzes pixel residual errors one by one in sequence according to the zigzag sequence, and sends the pixel residual errors into an inverse quantization calculation module for inverse quantization processing after each pixel residual error is analyzed by the residual analysis module, and stores the processed data into an entropy decoding output macro block cache.

Further, when the inverse quantization calculation module performs inverse quantization processing on the residual error, the residual error coefficient transformation process comprises AC coefficient inverse quantization processing and DC coefficient inverse quantization processing, and before the DC coefficient inverse quantization processing, the residual error analysis module performs DC inverse Hadamard transformation.

Further, the inverse quantization process of the residual error includes the steps of,

acquiring a residual analysis state;

when it is determined that the residual analysis is in the idle state, the luminance DC analysis sub-module Y_DC_DEC analyzes the DC coefficient R of the 4x4 block of the luminance component Y of the macroblock _{ij_DC_Y} The DC coefficient R of the 4x4 block is further quantized by the brightness DC inverse quantization sub-module Y_DC_IQT _{ij_DC_Y} Performing inverse Hadamard transform and inverse quantization to obtain an inverse quantized residual value Q _{ij_DC_Y} The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the AC coefficient R of the luminance component Y is resolved by the luminance AC resolving submodule Y_AC_DEC _{ij_AC_Y} And performing inverse quantization to obtain an inverse quantized residual value Q _{ij_AC_Y} ；

Then, the DC coefficients R of the 2x2 blocks of the chrominance component UV of the macroblock are parsed by the chrominance DC parsing sub-module uv_dc_dec _{ij_DC_UV} Then the DC coefficient R of the 2x2 block is subjected to the chrominance DC inverse quantization sub-module UV_DC_IQT _{ij_DC_UV} Performing inverse Hadamard transform and inverse quantization to obtain an inverse quantized residual value Q _{ij_DC_UV} The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the AC coefficient R of the chrominance component UV is resolved by the chrominance AC resolving sub-module UV_AC_DEC _{ij_AC_UV} And performing inverse quantization to obtain an inverse quantized residual value Q _{ij_AC_UV} 。

The invention also provides a data processing device for video decoding, which divides decoding of the AVC video sequence into a multi-stage pipeline structure;

when the data processing device is used for setting a cooperative working mode of entropy decoding, inverse quantization, inverse DCT conversion, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering processes, the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT conversion process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned;

the data processing device comprises an entropy decoding inverse quantization unit and an inverse transformation unit;

the entropy decoding and inverse quantization unit can perform entropy decoding and inverse quantization processing on the macro block, and synchronously input data of a plurality of pixel points after the inverse quantization processing into the inverse transformation unit;

the inverse transform unit is capable of performing inverse DCT transform processing on the data of a plurality of pixel points input synchronously in a multipoint parallel manner.

The invention also provides a video decoder system, which adopts macro-block level pipelining; the video decoder system comprises decoding firmware and a multi-core hardware decoding accelerator which are in communication connection, wherein the decoding firmware is used for analyzing non-entropy coding data of an upper layer of a video code stream, and the multi-core hardware decoding accelerator is used for processing decoding tasks of a macro block layer in the video code stream;

the multi-core hardware decoding accelerator comprises a data processing apparatus as claimed in any preceding claim.

The invention also provides a video decoding method, which comprises the following steps:

receiving video code stream data;

analyzing non-entropy coding data of an upper layer of the video code stream by a decoding firmware, and processing decoding tasks of a macro block layer in the video code stream by a multi-core hardware decoding accelerator;

wherein,

the decoding task of the macro block layer comprises entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering processes, when the cooperative working mode of each process is set, the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT transformation process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned; the entropy decoding and inverse quantization unit is used for performing entropy decoding and inverse quantization processing on the macro block, and the data of a plurality of pixel points after the inverse quantization processing are synchronously input to the inverse transformation unit for performing multipoint parallel inverse DCT transformation processing.

Compared with the prior art, the invention has the following advantages and positive effects by taking the technical scheme as an example: the invention advances the IQT (inverse quantization) operation to ensure that the IQT operation and the entropy decoding are in the same pipeline stage, thereby saving the extra multiplier resource expenditure caused by the fact that the IQT operation and the IDCT operation are in the same pipeline stage.

Drawings

FIG. 1 is a schematic diagram of a pipeline design of a prior art single core hardware decoder.

FIG. 2 is a diagram of an example of parallel operations where IQT and IDCT are placed in the same pipeline stage.

FIG. 3 is a schematic diagram of a portion of an optimized pipeline stage division according to an embodiment of the present invention.

Fig. 4 is a schematic diagram of information transmission between the entropy decoding inverse quantization unit and the inverse transformation unit according to an embodiment of the present invention.

Fig. 5 is a schematic diagram of a residual coefficient transformation process during inverse quantization according to an embodiment of the present invention.

Fig. 6 is a control schematic diagram of a master state machine according to an embodiment of the present invention.

Fig. 7 is a schematic block diagram of a video decoder system according to an embodiment of the present invention.

Detailed Description

The hardware decoder pipeline optimization method and application disclosed in the invention are further described in detail below with reference to the accompanying drawings and specific embodiments. It should be noted that the technical features or combinations of technical features described in the following embodiments should not be regarded as being isolated, and they may be combined with each other to achieve a better technical effect. In the drawings of the embodiments described below, like reference numerals appearing in the various drawings represent like features or components and are applicable to the various embodiments. Thus, once an item is defined in one drawing, no further discussion thereof is required in subsequent drawings.

It should be noted that the structures, proportions, sizes, etc. shown in the drawings are merely used in conjunction with the disclosure of the present specification, and are not intended to limit the applicable scope of the present invention, but rather to limit the scope of the present invention. The scope of the preferred embodiments of the present invention includes additional implementations in which functions may be performed out of the order described or discussed, including in a substantially simultaneous manner or in an order that is reverse, depending on the function involved, as would be understood by those of skill in the art to which embodiments of the present invention pertain.

Techniques, methods, and apparatus known to one of ordinary skill in the relevant art may not be discussed in detail, but should be considered part of the specification where appropriate. In all examples shown and discussed herein, any specific values should be construed as merely illustrative, and not a limitation. Thus, other examples of the exemplary embodiments may have different values.

Examples

A hardware decoder pipeline optimization method for dividing decoding of an AVC video sequence into multi-stage pipeline structures specifies the cooperative manner of entropy decoding, inverse quantization, inverse DCT transformation, intra prediction, inter prediction, image reconstruction and deblocking filtering.

When the pipeline level division is set, the inverse quantization process and the entropy decoding process are located at the same pipeline level, and the inverse DCT conversion process is located at the next pipeline level of the pipeline level where the inverse quantization process is located. Referring to fig. 3, the macro block is entropy decoded and inverse quantized by the entropy decoding inverse quantization unit, and the inverse quantized data of a plurality of pixels is synchronously input to the inverse transformation unit to perform multi-point parallel inverse DCT transformation.

In this embodiment, decoding of AVC video sequence is preferably divided into 4-stage pipeline structures, as follows: the 1 st stage corresponds to entropy decoding inverse quantization processing (or entropy decoding inverse quantization pipeline stage), the 2 nd stage corresponds to inverse DCT transformation processing (or inverse DCT transformation pipeline stage), the 3 rd stage corresponds to intra-frame prediction, inter-frame prediction and image reconstruction processing (or intra-frame prediction, inter-frame prediction and image reconstruction pipeline stage), and the 4 th stage corresponds to deblocking filtering processing (or deblocking filtering pipeline stage). The difference between the above-described pipeline stage division and the prior art of fig. 1 is that the Inverse Quantization (IQT) operation is advanced into the pipeline stage where entropy decoding is performed.

With continued reference to fig. 3, the entropy decoding dequantization unit (Entropy Dec and IQT) may include a residual parsing module and a dequantization calculation module, where the residual parsing module and the dequantization calculation module adopt a pipeline structure. Considering that the residuals (residual) are sequentially analyzed one by one according to the zigzag order during Entropy decoding (Entropy Dec), when the inverse quantization operation is moved forward to the pipeline level where the Entropy decoding is located, the residual analysis (Res Dec) module decodes one residual pn each time, that is, sends the residual pn to the Inverse Quantization (IQT) module for inverse quantization, and then stores the residual pn in the Entropy decoding output macroblock buffer (MB buffer).

Specifically, for a plurality of pixel point residuals p0, p1, p2, … …, pn located in the same Macroblock (MB), the residual resolution module is configured to: the pixel residuals pi are sequentially parsed one by one in the zigzag order (i to 0 (1 (2 (… …, n)) and each parsed pixel residual pi is sent to the dequantization calculation module.

The dequantization calculation module is configured to: and receiving pixel point residuals sent by a residual analysis module, performing inverse quantization processing, and storing the data subjected to the inverse quantization processing into an entropy decoding output macro block cache.

For example, in fig. 3, after the residual analysis (Res Dec) module analyzes the pixel residual p0, p0 is sent to the Inverse Quantization (IQT) calculation module to perform inverse quantization (the data after inverse quantization is stored in the entropy decoding output macroblock buffer), then the residual analysis module continues to analyze the next pixel residual p1, and after analysis, p1 is sent to the inverse quantization calculation module to perform inverse quantization (the data after inverse quantization is stored in the entropy decoding output macroblock buffer); and the same is said, namely, after the residual analysis module analyzes the pixel point residual pn, pn is sent into the inverse quantization calculation module for inverse quantization processing.

By adopting the technical scheme, the time-sharing multiplexing of the inverse quantization calculation module is realized, and as shown in fig. 4, the whole 1 st pipeline stage (entropy decoding inverse quantization pipeline stage) only needs to use one IQT calculation module, and compared with IQT/IDCT peer operation in the prior art, multiplier resources required by multi-pixel parallel calculation are obviously reduced.

Since the residual coefficient transformation process involves the processing of AC coefficients and DC coefficients, the complete dequantization process of the residual at the entropy decoding dequantization pipeline stage is described in detail below in connection with fig. 5 and 6.

According to the AVC protocol, as shown in fig. 5, when the residual pn is subjected to inverse quantization, the residual coefficient transformation process includes an AC coefficient inverse quantization process and a DC coefficient inverse quantization process, and compared with the AC coefficient inverse quantization, the DC inverse quantization process is further preceded by an inverse hadamard transformation operation. Therefore, before the DC coefficient dequantization processing is performed, a DC inverse hadamard transform is also required to be performed by a residual analysis module.

By R _ij Representing the original residual value obtained by entropy decoding the solution syntax element, using Q _ij Representing the final dequantized residual value, the dequantization process of the residual, in a preferred embodiment, comprises the steps of:

and step 1, acquiring a residual analysis state.

Step 2, when it is determined that the residual analysis is in an IDLE state (IDLE), the 4x4 block of the luminance component Y of the macroblock is analyzed by the luminance DC analysis sub-module Y_DC_DECIs (1) DC coefficient R _{ij_DC_Y} The DC coefficient R of the 4x4 block is further quantized by the brightness DC inverse quantization sub-module Y_DC_IQT _{ij_DC_Y} Performing inverse Hadamard transform and inverse quantization to obtain an inverse quantized residual value Q _{ij_DC_Y} The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the AC coefficient R of the luminance component Y is resolved by the luminance AC resolving submodule Y_AC_DEC _{ij_AC_Y} And performing inverse quantization to obtain an inverse quantized residual value Q _{ij_AC_Y} 。

Step 3, then, the DC coefficients R of the 2x2 blocks of the chrominance component UV of the macroblock are parsed by the chrominance DC parsing sub-module uv_dc_dec _{ij_DC_UV} Then the DC coefficient R of the 2x2 block is subjected to the chrominance DC inverse quantization sub-module UV_DC_IQT _{ij_DC_UV} Performing inverse Hadamard transform and inverse quantization to obtain an inverse quantized residual value Q _{ij_DC_UV} The method comprises the steps of carrying out a first treatment on the surface of the At the same time, the AC coefficient R of the chrominance component UV is resolved by the chrominance AC resolving sub-module UV_AC_DEC _{ij_AC_UV} And performing inverse quantization to obtain an inverse quantized residual value Q _{ij_AC_UV} 。

In this way, the residual value Q of the inverse quantized luminance component Y of the macroblock can be obtained _{ij_DC_Y} And Q _{ij_AC_Y} And the inverse quantized residual value Q of the chrominance component UV _{ij_DC_UV} And Q _{ij_AC_UV} . The data is input to an inverse transform unit to perform an IDCT transform process (or an inverse DCT transform process, an inverse DCT transform process).

As a typical mode, when DC inverse hadamard operation is inserted into the residual analysis module, the control mode of the main state machine is shown in fig. 6.

Wherein Y_DC_DEC represents a luminance DC parsing sub-module for parsing DC coefficients R of 4x4 blocks of a luminance component Y of a macroblock _{ij_DC_Y} . Y_DC_IQT represents the luma DC dequantization submodule for DC coefficients R for a 4x4 block _{ij_DC_Y} Performing inverse Hadamard transform and inverse quantization to obtain an inverse quantized residual value Q _{ij_DC_Y} . Y_AC_DEC represents a luminance AC parsing sub-module for parsing AC coefficients R of a luminance component Y _{ij_AC_Y} And performing inverse quantization to obtain an inverse quantized residual value Q _{ij_AC_Y} 。

Uv_dc_dec represents a chrominance DC parsing sub-module for parsing DC coefficients R of 2x2 blocks of chrominance components UV of a macroblock _{ij_DC_UV} . Uv_dc_iqt represents a chrominance DC dequantization submodule for DC coefficients R for a 2x2 block _{ij_DC_UV} Performing inverse Hadamard transform and inverse quantization to obtain an inverse quantized residual value Q _{ij_DC_UV} . Uv_ac_dec represents a chroma AC parsing sub-module for parsing the AC coefficients R of the chroma components UV _{ij_AC_UV} And performing inverse quantization to obtain an inverse quantized residual value Q _{ij_AC_UV} 。

Another embodiment of the present invention also provides a data processing apparatus for video decoding. The data processing apparatus is for partitioning decoding of an AVC video sequence into a multi-stage pipeline structure.

In this embodiment, when the data processing apparatus sets a cooperative working mode of entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction, and deblocking filtering, the inverse quantization process and the entropy decoding process are located at the same pipeline stage, and the inverse DCT transformation process is located at a next pipeline stage of the pipeline stage where the inverse quantization process is located.

Specifically, the data processing apparatus includes an entropy decoding inverse quantization unit and an inverse transformation unit.

The entropy decoding and inverse quantization unit can perform entropy decoding and inverse quantization processing on the macro block, and synchronously input data of a plurality of pixel points after the inverse quantization processing into the inverse transformation unit.

In this embodiment, the data processing apparatus divides decoding of an AVC video sequence into a 4-stage pipeline structure, the 1 st stage corresponds to entropy decoding inverse quantization processing, the 2 nd stage corresponds to inverse DCT transform processing, the 3 rd stage corresponds to intra prediction, inter prediction, and image reconstruction processing, and the 4 th stage corresponds to deblocking filtering processing.

Preferably, the entropy decoding and dequantizing unit may include a residual analysis module and a dequantizing calculation module, where the residual analysis module and the dequantizing calculation module adopt a pipeline structure. Considering that the residuals (residual) are sequentially analyzed one by one according to the zigzag order during Entropy decoding (Entropy Dec), when the inverse quantization operation is moved forward to the pipeline level where the Entropy decoding is located, the residual analysis (Res Dec) module decodes one residual pn each time, that is, sends the residual pn to the Inverse Quantization (IQT) module for inverse quantization, and then stores the residual pn in the Entropy decoding output macroblock buffer (MB buffer).

Specifically, the residual analysis module analyzes pixel point residuals one by one in turn according to the zigzag sequence, and the residual analysis module sends the pixel point residuals to the inverse quantization calculation module for inverse quantization processing after each pixel point residual is analyzed, and stores the processed data into the entropy decoding output macro block cache.

By adopting the technical scheme, the time-sharing multiplexing of the inverse quantization calculation module is realized, and the whole 1 st pipeline stage (entropy decoding inverse quantization pipeline stage) only needs to use one IQT calculation module, so that compared with IQT/IDCT peer-to-peer operation in the prior art, multiplier resources required by multi-pixel parallel calculation are obviously reduced.

Other technical features are referred to the previous embodiments and will not be described here again.

Another embodiment of the present invention also provides a video decoder system that employs macroblock-level pipelining.

The decoding FirmWare is used for analyzing non-entropy coding data of an upper layer of a Video code stream, and the Multi-Core hardware decoding accelerator is used for processing decoding tasks of a macro block layer in the Video code stream.

The code stream of the AVC video adopts a layered structure, most grammars shared in a GOP layer and a Slice layer are dissociated, and a video parameter set VPS (namely Video Parameter Set), a sequence parameter set SPS (namely Sequence Parameter Set), a picture parameter set PPS (namely Picture Parameter Set) and the like are formed, so that the code stream is very suitable for software analysis due to small data proportion and simple analysis. According to the characteristics of the above-mentioned bitstream data, the decoder system provided in this embodiment divides the video decoder VDEC into two parts, namely, the decoding firmware vdec_fw and the multi-core hardware decoding accelerator vdec_mcore, where the decoding firmware is used as a software part to parse non-entropy encoded data (such as the video parameter set VPS, the sequence parameter set SPS, the image parameter set PPS, slice header information, etc.) of the upper layer of the video bitstream, and the multi-core hardware decoding accelerator is used as a hardware part to collectively process all decoding operations of the macroblock layer in the video bitstream.

In this embodiment, the multi-core hardware decoding accelerator includes the data processing apparatus in the foregoing embodiment.

When the data processing device is used for setting the cooperative working modes of entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering, the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT transformation process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned.

The data processing apparatus may include an entropy decoding inverse quantization unit and an inverse transformation unit.

The entropy decoding and inverse quantization unit can perform entropy decoding and inverse quantization processing on the macro block, and synchronously input data of a plurality of pixel points after the inverse quantization processing into the inverse transformation unit. The inverse transform unit is capable of performing inverse DCT transform processing on the data of a plurality of pixel points input synchronously in a multipoint parallel manner. Further, the data processing apparatus may further include a prediction reconstruction unit for performing intra prediction processing, inter prediction processing, and image reconstruction processing on the macroblock data, and a deblocking filter unit for performing deblocking filter processing on the macroblock data.

In one embodiment, the entropy decoding and dequantizing unit may include a residual analysis module and a dequantizing calculation module, where the residual analysis module and the dequantizing calculation module adopt a pipeline structure. Considering that the residuals (residual) are sequentially analyzed one by one according to the zigzag order during Entropy decoding (Entropy Dec), when the inverse quantization operation is moved forward to the pipeline level where the Entropy decoding is located, the residual analysis (Res Dec) module decodes one residual pn each time, that is, sends the residual pn to the Inverse Quantization (IQT) module for inverse quantization, and then stores the residual pn in the Entropy decoding output macroblock buffer (MB buffer). Specifically, the residual analysis module analyzes pixel point residuals one by one in turn according to the zigzag sequence, and the residual analysis module sends the pixel point residuals to the inverse quantization calculation module for inverse quantization processing after each pixel point residual is analyzed, and stores the processed data into the entropy decoding output macro block cache.

With continued reference to fig. 7, the software/hardware uses the Slice level in the video stream as an interaction unit, and performs data interaction through Slice Queue inside the video decoder. The interaction flow of the decoding firmware vdec_fw and the multi-core hardware decoding accelerator vdec_mcore may be as follows:

1) After the decoding firmware VDEC_FW completes the upper layer analysis task of the code stream, the Slice upper layer parameter information is packed and pressed into Slice Queue, i.e. the information is put (pushed) into the Queue of Slice Queue for queuing. The downward arrow in fig. 7 indicates the operation of pressing in the Slice Queue.

At this time, the decoding firmware is configured to: after the upper layer analysis of the video code stream is completed, the Slice upper layer parameter information is packed and pressed into Slice Queue.

2) The multi-core hardware decoding accelerator VDEC_MCORE inquires ready information (ready state information) of Slice Queue data, reads Queue information and completes configuration, then the whole hardware analyzes macro blocks in the current Slice until finishing, and sends an interrupt signal when finishing to release Slice Queue, namely corresponding information in a Queue of the Slice Queue is released (pop). The upward arrow in fig. 7 indicates the operation of releasing the Slice Queue.

At this time, the multi-core hardware decoding accelerator is configured to: inquiring ready information of Slice Queue data, after reading a Queue and completing configuration, analyzing a current Slice inner macro block until analysis of the Slice inner macro block is completed, and sending an interrupt signal after analysis is completed to release the Slice Queue.

Therefore, the software and hardware division is combined with the software queue to realize the parallel processing of the software, and the parallel processing of the software and the hardware can obviously save the software processing time, thereby improving the parallel processing efficiency.

Specifically, as an example of a typical manner, the multi-core hardware decoding accelerator may be configured to include a pre-processor module and a plurality of homogeneous, fully functional hardware decoders. The fully functional hardware decoder is capable of handling at least the steps of inverse DCT transformation, intra-frame inter-prediction and pixel reconstruction necessary for macroblock line decoding.

The preprocessor module comprises an entropy decoding inverse quantization unit of the data processing device.

The full-function hardware decoder comprises an inverse transformation unit, a prediction reconstruction unit and a deblocking filter unit of the data processing device.

Each full-function hardware decoder is single-core and is responsible for decoding one row of macro blocks including inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction, deblocking filtering and other process steps, and enabling the macro blocks being decoded in two adjacent uplink and downlink to be at least two macro blocks at intervals so as to realize multi-core synchronous decoding.

Isomorphic full-function hardware decoders may be provided in more than two (including two), two in time called dual-core hardware decoding accelerators, three in time called tri-core hardware decoding accelerators, four in time called quad-core hardware decoding accelerators, and so on. Each full-function hardware decoder is responsible for decoding one row of macro block rows, the dual-core hardware decoding accelerator can simultaneously perform parallel decoding work of two rows of macro block rows, the three-core hardware decoding accelerator can simultaneously perform parallel decoding work of three rows of macro block rows, and so on.

Other technical features of the data processing apparatus are referred to the previous embodiments and will not be described in detail here.

Another embodiment of the present invention also provides a video decoding method using the aforementioned video decoder system. The video decoding method comprises the following steps:

step 100, video code stream data is received.

And 200, analyzing non-entropy coding data of an upper layer of the video code stream by a decoding firmware, and processing decoding tasks of a macroblock layer in the video code stream by a multi-core hardware decoding accelerator.

In the step 200, the decoding tasks of the macroblock layer include entropy decoding, inverse quantization, inverse DCT transformation, intra prediction, inter prediction, image reconstruction, and deblocking filtering processes. When the cooperative working mode of each process is set, the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT conversion process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned. The entropy decoding and inverse quantization unit is used for performing entropy decoding and inverse quantization processing on the macro block, and the data of a plurality of pixel points after the inverse quantization processing are synchronously input to the inverse transformation unit for performing multipoint parallel inverse DCT transformation processing.

In the above description, the disclosure of the present invention is not intended to limit itself to these aspects. Rather, the components may be selectively and operatively combined in any number within the scope of the present disclosure. In addition, terms like "comprising," "including," and "having" should be construed by default as inclusive or open-ended, rather than exclusive or closed-ended, unless expressly defined to the contrary. All technical, scientific, or other terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Common terms found in dictionaries should not be too idealized or too unrealistically interpreted in the context of the relevant technical document unless the present disclosure explicitly defines them as such. Any alterations and modifications of the present invention, which are made by those of ordinary skill in the art based on the above disclosure, are intended to be within the scope of the appended claims.

Claims

1. A hardware decoder pipeline optimization method adopts macro block level pipeline operation to divide decoding of AVC video sequence into multi-stage pipeline structure, the method prescribes cooperative working modes of entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering process, and is characterized in that:

the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT conversion process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned; performing inverse quantization processing on the residual error on an entropy decoding inverse quantization pipeline stage;

performing entropy decoding and inverse quantization processing on the macro block through an entropy decoding inverse quantization unit, and synchronously inputting the data of a plurality of pixel points after the inverse quantization processing to an inverse transformation unit for performing multipoint parallel inverse DCT transformation processing;

the entropy decoding and dequantizing unit comprises a residual analysis module and an dequantizing calculation module, wherein the residual analysis module and the dequantizing calculation module adopt a pipeline structure, the dequantizing calculation module is used for time-sharing multiplexing when performing entropy decoding and dequantizing processing, at the moment, the residual analysis module sequentially analyzes pixel residual errors one by one according to a zig sequence for a plurality of pixel residual errors p0, p1, p2, … … and pn in the same macroblock, and each pixel residual error analysis is that the pixel residual error is sent to the same dequantizing calculation module for dequantizing processing, and the dequantizing calculation module stores the dequantized data into an entropy decoding output macroblock cache.

2. The method according to claim 1, characterized in that: decoding of an AVC video sequence is divided into 4 stages of pipeline structures, with stage 1 corresponding to entropy decoding inverse quantization processing, stage 2 corresponding to inverse DCT transform processing, stage 3 corresponding to intra prediction, inter prediction, and image reconstruction processing, and stage 4 corresponding to deblocking filtering processing.

3. The method according to claim 1, characterized in that: when the residual is subjected to inverse quantization processing through an inverse quantization calculation module, a residual coefficient transformation process comprises AC coefficient inverse quantization processing and DC coefficient inverse quantization processing, and before the DC coefficient inverse quantization processing, DC inverse Hadamard transformation is performed through a residual analysis module.

4. A method according to claim 3, characterized in that: by R _ij Representing the original residual value obtained by entropy decoding the solution syntax element, using Q _ij Representing the final dequantized residual value, the dequantization process of the residual includes the steps of,

acquiring a residual analysis state;

5. A data processing apparatus for video decoding employing macroblock-level pipelining to divide decoding of AVC video sequences into a multi-level pipeline architecture, characterized by:

when the data processing device is used for setting a cooperative working mode of entropy decoding, inverse quantization, inverse DCT conversion, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering processes, the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, and the inverse DCT conversion process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned; performing inverse quantization processing on the residual error on an entropy decoding inverse quantization pipeline stage;

the inverse transformation unit can perform multipoint parallel inverse DCT transformation processing on the data of a plurality of pixel points which are synchronously input;

the entropy decoding and dequantizing unit comprises a residual analysis module and an dequantizing calculation module, wherein the residual analysis module and the dequantizing calculation module adopt a pipeline structure, and the dequantizing calculation module is used for carrying out time-sharing multiplexing when carrying out entropy decoding and dequantizing treatment; at this time, for a plurality of pixel residuals p0, p1, p2, … … and pn located in the same macroblock, the residual analysis module analyzes the pixel residuals one by one in turn according to the zigzag sequence, and each time a pixel residual is analyzed, the pixel residual is sent to the same inverse quantization calculation module to perform inverse quantization processing, and the inverse quantization calculation module stores the data after the inverse quantization processing into the entropy decoding output macroblock buffer.

6. The data processing apparatus of claim 5, wherein: decoding of an AVC video sequence is divided into 4 stages of pipeline structures, with stage 1 corresponding to entropy decoding inverse quantization processing, stage 2 corresponding to inverse DCT transform processing, stage 3 corresponding to intra prediction, inter prediction, and image reconstruction processing, and stage 4 corresponding to deblocking filtering processing.

7. A video decoder system employing macroblock-level pipelining, characterized by: the video decoder system comprises decoding firmware and a multi-core hardware decoding accelerator which are in communication connection, wherein the decoding firmware is used for analyzing non-entropy coding data of an upper layer of a video code stream, and the multi-core hardware decoding accelerator is used for processing decoding tasks of a macro block layer in the video code stream; the Slice level in the video code stream is used as an interaction unit between the decoding firmware and the multi-core hardware decoding accelerator; after the decoding firmware is configured to complete the upper layer analysis of the video code stream, packing the Slice upper layer parameter information into Slice Queue; the multi-core hardware decoding accelerator is configured to query ready information of Slice Queue data, analyze a current Slice inner macro block until analysis of the Slice inner macro block is completed after a Queue is read and configuration is completed, and send an interrupt signal after analysis is completed to release Slice Queue;

the multi-core hardware decoding accelerator comprising the data processing apparatus of any of claims 5-6.

8. A video decoding method employing macroblock-level pipelining, comprising the steps of: receiving video code stream data;

analyzing non-entropy coding data of an upper layer of the video code stream by a decoding firmware, and processing decoding tasks of a macro block layer in the video code stream by a multi-core hardware decoding accelerator; the Slice level in the video code stream is used as an interaction unit between the decoding firmware and the multi-core hardware decoding accelerator; after the decoding firmware is configured to complete the upper layer analysis of the video code stream, packing the Slice upper layer parameter information into Slice Queue; the multi-core hardware decoding accelerator is configured to query ready information of Slice Queue data, analyze a current Slice inner macro block until analysis of the Slice inner macro block is completed after a Queue is read and configuration is completed, and send an interrupt signal after analysis is completed to release Slice Queue;

wherein,

the decoding task of the macro block layer comprises entropy decoding, inverse quantization, inverse DCT transformation, intra-frame prediction, inter-frame prediction, image reconstruction and deblocking filtering processes, when the cooperative working mode of each process is set, the inverse quantization process and the entropy decoding process are positioned at the same pipeline stage, the inverse DCT transformation process is positioned at the next pipeline stage of the pipeline stage where the inverse quantization process is positioned, and residual errors are subjected to inverse quantization processing on the entropy decoding inverse quantization pipeline stage; performing entropy decoding and inverse quantization processing on the macro block through an entropy decoding inverse quantization unit, and synchronously inputting the data of a plurality of pixel points after the inverse quantization processing to an inverse transformation unit for performing multipoint parallel inverse DCT transformation processing;