CN103634604B

CN103634604B - Multi-core DSP (digital signal processor) motion estimation-oriented data prefetching method

Info

Publication number: CN103634604B
Application number: CN201310632104.XA
Authority: CN
Inventors: 姜宏旭; 孙士明; 翟东林; 李波
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2013-12-01
Filing date: 2013-12-01
Publication date: 2017-01-11
Anticipated expiration: 2033-12-01
Also published as: CN103634604A

Abstract

The invention discloses a multi-core DSP (digital signal processor) motion estimation-oriented data prefetching method. According to the method, data of a coding block and a reference block are prefetched to a local storage according to the space relevancy of data in motion estimation and the data of a prediction motion vector prefetching coding bock and the reference block, so the parallel performing between the motion estimation operation of the current coding block and the data loading operation of the next code block and the reference block are realized, the influence of the storage to the speed of a multi-core processor in the motion estimation is reduced, and tests prove that the data prefetching method enables the motion estimation processing speed in the multi-core DSP parallel video coding to be obviously improved.

Description

A data prefetching method for multi-core DSP motion estimation

技术领域technical field

本发明属于多媒体编解码领域，具体涉及一种针对嵌入式多核DSP处理器并行视频编码中运动估计的数据预取方法,是一种通过数据预取加速视频运动估计处理过程的方法。The invention belongs to the field of multimedia encoding and decoding, and in particular relates to a data prefetching method for motion estimation in parallel video coding of an embedded multi-core DSP processor, which is a method for accelerating video motion estimation processing through data prefetching.

背景技术Background technique

运动估计是基于混合编码框架的视频编码的主要组成部分之一，运动估计以数据块为单位完成预测、运动搜索、运动补偿、DCT变换和量化等操作，在视频编码中运动估计以数据块为处理单元，H.264/AVC编码中运动估计的数据块包含宏块（MB）、亚宏块、块等，HEVC编码中运动估计的数据块包含编码单元（CU）、预测单元（PU）和变换单元（TU）等，P帧的运动估计需要当前编码帧和一个参考帧的数据、B帧的运动估计需要当前编码帧和一个前向参考帧、一个后向参考帧的数据，处理的数据的吞吐量很大。在1080i格式的视频序列中每帧图像的分辨率为1920×1080，每秒钟输出60帧图像，用YUV（4:2:0）格式表示彩色信息每秒产生的编码数据达到0.746Gbps,运动估计中的数据达到1.5Gbps以上，随着视频质量的提高，产生的视频数据量还在急剧增加。嵌入式系统中视频编码越来越多的采用多核DSP处理器实现，嵌入式多核DSP处理器为多级存储结构，每个核各自独立拥有局部存储器，所有核共享MSM存储器和大容量的片外存储器。局部存储器容量小，速度最快；MSM存储器较大，速度较慢，外部存储器容量大，速度慢。由于多核DSP处理器的局部存储空间小，不能存储整个编码帧和参考帧的数据，需要把编码帧和参考帧划分为小的数据块，将编码帧的当前编码块和参考块的数据存储到内部存储器而编码帧和参考帧存储在外部存储器。如图1所示，在嵌入式多核视频编码器中，首先由视频采集得到视频数据并暂时存储在大容量的外部存储器中，在编码时由外部存储器读取到内部存储器做运算，然而处理器性能以每年60%的速度增长,而存储器访问性能每年提高不到10%,处理器和存储器之间的性能差距还在不断增大,存储器成为系统性能瓶颈，多核处理器中存储瓶颈问题更为严重，运动估计由于需要的数据量大，存储器瓶颈成为影响处理速度的重要因素。Motion estimation is one of the main components of video coding based on the hybrid coding framework. Motion estimation completes operations such as prediction, motion search, motion compensation, DCT transformation, and quantization in units of data blocks. In video coding, motion estimation takes data blocks as units. Processing unit, the data block of motion estimation in H.264/AVC encoding includes macroblock (MB), sub-macroblock, block, etc., the data block of motion estimation in HEVC encoding includes coding unit (CU), prediction unit (PU) and Transform unit (TU), etc., the motion estimation of P frame needs the data of the current coded frame and a reference frame, the motion estimation of B frame needs the data of the current coded frame and a forward reference frame, a backward reference frame, and the processed data The throughput is very high. In the video sequence of 1080i format, the resolution of each frame image is 1920×1080, 60 frames of images are output per second, and the encoded data generated per second by YUV (4:2:0) format reaches 0.746Gbps. The estimated data reaches above 1.5Gbps, and with the improvement of video quality, the amount of generated video data is still increasing sharply. Video coding in embedded systems is increasingly implemented by multi-core DSP processors. The embedded multi-core DSP processor has a multi-level storage structure. Each core has its own local memory, and all cores share MSM memory and large-capacity off-chip memory. The local memory has a small capacity and the fastest speed; the MSM memory is large and slow, and the external memory has a large capacity and slow speed. Due to the small local storage space of the multi-core DSP processor, the data of the entire coded frame and reference frame cannot be stored. It is necessary to divide the coded frame and reference frame into small data blocks, and store the data of the current coded block and reference block of the coded frame in Internal memory while encoded frames and reference frames are stored in external memory. As shown in Figure 1, in the embedded multi-core video encoder, the video data is first obtained from video acquisition and temporarily stored in a large-capacity external memory, and is read from the external memory to the internal memory for calculation during encoding. However, the processor The performance increases at a rate of 60% per year, while the memory access performance increases by less than 10% per year. The performance gap between the processor and the memory is still increasing, and the memory becomes the bottleneck of the system performance. Seriously, due to the large amount of data required for motion estimation, the memory bottleneck becomes an important factor affecting the processing speed.

为了降低存储瓶颈的影响，采用多级存储结构的多核处理器存储性能靠Cache的命中率来保证，然而多级存储结构中Cache不命中会引起外部访存操作长延迟，长延迟时间可达上百个处理器时钟周期，降低了处理器的执行速度，如TMS320C6678存储器Cache读失效在最坏情况下的延迟为287个时钟周期，合计287ns（核工作在1GHz）。在运动估计中处理数据量大，Cache不命中造成的影响更为明显。数据预取技术在数据使用之前对数据提前读取，通过计算和访存操作重叠降低处理器的等待时间。现有申请号为200410101465.2的专利“视频编解码过程中宏块数据读取的方法”通过建立“宏块地址映射表”的方式解决缓冲区命中率失效问题，但是这种方法仅仅提供了一种视频帧中宏块数据的索引方式，只能降低Cache失效带来的影响，没有实现编码中参考帧的数据预取问题，而运动估计中往往需要搜索多个参考块的数据，数据量更大，对运动估计的影响也更大。申请号为200710046929.8的专利“视频处理中数据预取系统”在处理器和存储器之间增加“数据预取模块”的方式实现数据块预取，但是这种通过增加硬件单元实现预取的方式不适用于商业化的嵌入式DSP处理器，同时由于缺乏同步机制，这种方法不适用于多核DSP处理器并行实现。In order to reduce the impact of storage bottlenecks, the storage performance of multi-core processors using a multi-level storage structure is guaranteed by the hit rate of the Cache. However, a Cache miss in a multi-level storage structure will cause long delays in external memory access operations, and the long delay time can reach up to Hundreds of processor clock cycles reduce the execution speed of the processor. For example, the worst-case delay of TMS320C6678 memory Cache read failure is 287 clock cycles, totaling 287ns (the core works at 1GHz). In motion estimation, the amount of data to be processed is large, and the impact caused by Cache misses is more obvious. The data prefetch technology reads the data in advance before the data is used, and reduces the waiting time of the processor by overlapping calculation and memory access operations. The existing patent application No. 200410101465.2 "Method for reading macroblock data in the process of video encoding and decoding" solves the problem of buffer hit rate failure by establishing a "macroblock address mapping table", but this method only provides a The index method of macroblock data in video frames can only reduce the impact of Cache failure, and does not realize the problem of data prefetching of reference frames in encoding. In motion estimation, data of multiple reference blocks often needs to be searched, and the amount of data is larger. , which also has a greater impact on motion estimation. The patent "data prefetching system in video processing" with the application number of 200710046929.8 implements data block prefetching by adding a "data prefetching module" between the processor and the memory, but this way of realizing prefetching by adding hardware units does not It is suitable for commercial embedded DSP processors, and because of the lack of synchronization mechanism, this method is not suitable for parallel implementation of multi-core DSP processors.

本发明根据并行编码中运动估计处理的数据空间相关性和预测运动矢量预取编码帧和参考帧的数据，实现了数据读取和运动估计处理的并行，有效的降低了存储瓶颈对多核DSP处理器处理速度的影响，实验表明该方法有效的提高了嵌入式多核DSP处理器运动估计的执行速度。The present invention prefetches the data of coded frames and reference frames according to the data spatial correlation and predicted motion vectors of motion estimation processing in parallel coding, realizes the parallelism of data reading and motion estimation processing, and effectively reduces the storage bottleneck for multi-core DSP processing The influence of processor processing speed, experiments show that this method effectively improves the execution speed of embedded multi-core DSP processor motion estimation.

发明内容Contents of the invention

为了克服多核DSP处理器在应用于运动估计时访问存储器造成的延时，本发明公布了一种根据运动估计中数据空间相关性和预测运动矢量预取编码帧和参考帧的数据的技术，在上一个数据块编码的同时使用DMA预取下一个编码块和参考的数据，实现了运动估计数据读取和处理的并行。实验证明，该方法有效的提高了多核DSP处理器运动估计操作的处理速度。In order to overcome the delay caused by multi-core DSP processor accessing the memory when it is applied to motion estimation, the present invention discloses a technology for prefetching the data of encoded frames and reference frames according to the spatial correlation of data in motion estimation and the predicted motion vector. When the previous data block is encoded, DMA is used to prefetch the next encoded block and reference data, realizing the parallelism of motion estimation data reading and processing. Experiments prove that this method effectively improves the processing speed of motion estimation operation of multi-core DSP processor.

为实现上述目的，本发明采用了下述技术方案：To achieve the above object, the present invention adopts the following technical solutions:

一种面向多核DSP运动估计的数据预取方法（如图4所示），步骤如下：A data prefetching method for multi-core DSP motion estimation (as shown in Figure 4), the steps are as follows:

步骤1、设置预取数据块大小，划分编码块、参考块大小，编码块和参考块的数据存储在核的局部存储器中，并将存储区设置为Ping-Pang结构；Step 1. Set the size of the prefetched data block, divide the size of the coding block and the reference block, and store the data of the coding block and the reference block in the local memory of the core, and set the storage area as a Ping-Pang structure;

步骤2、如果当前执行运动估计的编码块属于P帧，则执行预测、运动搜索操作，预取下一个编码块和参考块的数据；Step 2. If the coding block currently performing motion estimation belongs to a P frame, perform prediction and motion search operations, and prefetch the data of the next coding block and reference block;

步骤3、如果当前执行运动估计的编码块属于B帧，则执行预测、运动搜索操作，预取下一个编码块和前向参考块、后向参考块的数据。Step 3. If the coding block currently performing motion estimation belongs to a B frame, perform prediction and motion search operations, and prefetch the data of the next coding block, forward reference block, and backward reference block.

所述步骤1具体包括如下操作：The step 1 specifically includes the following operations:

（1）编码帧、参考帧按照多核DSP局部存储器容量和运动估计中处理的数据块大小划分为编码块和参考块，编码图像帧和参考帧的数据存储在外部大容量存储器内，当前编码块和参考块的数据存储在局部存储器内；(1) Coded frames and reference frames are divided into coded blocks and reference blocks according to the multi-core DSP local memory capacity and the size of data blocks processed in motion estimation. The data of coded image frames and reference frames are stored in the external large-capacity memory. The current coded block and the data of the reference block are stored in the local memory;

（2）编码帧划分为编码块和编码行，如图2所示，多核DSP的各个核以编码行为单位执行运动估计操作，即当前核结束一个编码行运动估计操作后立即获取后面第一个未做运动估计编码行继续进行；(2) The coded frame is divided into coded blocks and coded lines. As shown in Figure 2, each core of the multi-core DSP performs the motion estimation operation in units of coded lines, that is, the current core immediately obtains the next first motion estimation operation after finishing a coded line motion estimation operation Continue without motion estimation encoding;

（3）设置系统控制表，系统控制表包括当前编码帧计数器、当前编码帧中第一个未做运动估计操作的编码行指示器，当前编码帧编码状态表，系统控制表为多核共用，以实现各核同步；(3) Set the system control table. The system control table includes the current encoding frame counter, the first encoding row indicator in the current encoding frame that has not performed motion estimation operation, and the encoding status table of the current encoding frame. The system control table is shared by multiple cores, and the Realize the synchronization of each core;

（4）在局部存储器中为当前编码块和参考数据块设置Ping-Pang结构存储区，当前编码的为Pang，数据预取到Ping。(4) Set a Ping-Pang structure storage area for the current encoding block and the reference data block in the local memory, the current encoding is Pang, and the data is prefetched to Ping.

所述步骤2具体包括如下步骤：The step 2 specifically includes the following steps:

（1）若当前编码块属于P帧，执行编码块运动估计中的预测和运动搜索操作；(1) If the current coding block belongs to a P frame, perform the prediction and motion search operations in the motion estimation of the coding block;

（2）启动DMA将当前编码块的下一个编码块的数据由外部存储器预取到核的内部存储器，源地址、目的地址、数据块大小按公式（1）计算；(2) Start DMA to prefetch the data of the next coding block of the current coding block from the external memory to the internal memory of the core, and the source address, destination address, and data block size are calculated according to formula (1);

$\{\begin{matrix} DMA DMA__src src = = fenc fenc__base base + + corei corei__line line \times \times line line__block block \times \times block block__size size \\ + + ((corei corei__block block + + 11)) \times \times block block__width width \\ DMA DMA__dst dst = = corei corei__fenc fenc__base base \\ data data__size size = = block block__size size \end{matrix} - - - - - - ((11))$

其中block_width为编码块宽度，block_hight为编码块高度，block_size为编码块大小，block_size＝block_width×block_hight，DMA_src为第i个核DMA搬运数据的源地址，DMA_dst为第i个核DMA搬运数据的目的地址，data_size为搬运数据大小，corei_fenc_base为第i个核的编码块数据在内部存储的起始地址，fenc_base为当前编码帧的起始地址，corei_line为第i个核当前的编码行，line_block为编码行大小，corei_block为第i个核的当前编码块。Where block_width is the width of the coding block, block_hight is the height of the coding block, block_size is the size of the coding block, block_size=block_width×block_hight, DMA_src is the source address of the i-th core DMA transfer data, DMA_dst is the destination address of the i-th core DMA transfer data , data_size is the size of the transported data, corei_fenc_base is the starting address of the encoding block data of the i-th core stored internally, fenc_base is the starting address of the current encoding frame, corei_line is the current encoding line of the i-th core, and line_block is the encoding line Size, corei_block is the current coding block of the i-th core.

（3）启动DMA将当前编码块的下一个编码块的参考块的数据由外部存储器预取到核的内部存储器，即将参考帧中以预测运动矢量为中心的多个参考块的数据预取到核的局部存储器，预取参考数据块的数量由运动搜索方法决定，源地址、目的地址、数据块大小按公式（2）计算；(3) Start DMA to prefetch the data of the reference block of the next coding block of the current coding block from the external memory to the internal memory of the core, that is, prefetch the data of multiple reference blocks centered on the predicted motion vector in the reference frame to For the local memory of the core, the number of prefetched reference data blocks is determined by the motion search method, and the source address, destination address, and data block size are calculated according to formula (2);

$\{\begin{matrix} DMA DMA__src src = = fref fref__base base \\ + + ((mvp mvp__x x \times \times line line__pix pix - - block block__width width)) + + ((mvp mvp__y the y - - block block__hight hight)) \\ DMA DMA__dst dst = = corei corei__ref ref__base base \\ data data__size size = = block block__size size \times \times bref bref__size size \end{matrix} - - - - - - ((22))$

其中DMA_src为第i个核DMA搬运数据的源地址，DMA_dst为第i个核DMA搬运数据的目的地址，fref_base为参考帧首地址，line_pix为参考帧中一行的像素数量，mvp_x,mvp_y为运动矢量预测值，当前编码块运动搜索结束后即可按照运动矢量的预测方法得到运动矢量预测值，corei_ref_base为参考块的数据存储首地址，bref_size为参考的数据块大小，其他参数同式（1）；Among them, DMA_src is the source address of the i-th core DMA handling data, DMA_dst is the destination address of the i-th core DMA handling data, fref_base is the first address of the reference frame, line_pix is the number of pixels in a line in the reference frame, mvp_x, mvp_y are motion vectors Predicted value. After the motion search of the current coding block is completed, the motion vector prediction value can be obtained according to the motion vector prediction method. corei_ref_base is the first address of the data storage of the reference block, bref_size is the size of the reference data block, and other parameters are the same as formula (1);

（4）执行运动补偿（MC）、DCT变换、量化操作；(4) Perform motion compensation (MC), DCT transformation, and quantization operations;

（5）更新控制表，当前编码块编码标志位置位；(5) Update the control table, and set the encoding flag of the current encoding block;

（6）重复步骤（1）、（2）、（3）、（4）、（5），直到当前编码行各编码块运动估计操作结束。(6) Steps (1), (2), (3), (4), and (5) are repeated until the motion estimation operation of each coding block in the current coding line ends.

所述步骤3具体包括如下步骤：Described step 3 specifically comprises the following steps:

（1）若当前编码块属于B帧，执行编码块运动估计中的预测和运动搜索操作；(1) If the current coding block belongs to a B frame, perform the prediction and motion search operations in the motion estimation of the coding block;

（2）启动DMA将当前编码块的下一个编码块的数据由外部存储器预取到核的局部存储器，源地址、目的地址、数据块大小按公式（1）计算；(2) Start DMA to prefetch the data of the next coding block of the current coding block from the external memory to the local memory of the core, and the source address, destination address, and data block size are calculated according to formula (1);

（3）启动DMA将当前编码块的下一个数据块的前向参考块的数据预取到核的内部存储器，即将参考帧中以预测运动矢量为中心的多个参考块的数据预取到核的局部存储器，预取参考块的数量由运动搜索方法决定，源地址、目的地址、数据块大小按公式（2）计算；(3) Start DMA to prefetch the data of the forward reference block of the next data block of the current coding block to the internal memory of the core, that is, prefetch the data of multiple reference blocks centered on the predicted motion vector in the reference frame to the core The local memory of , the number of prefetched reference blocks is determined by the motion search method, and the source address, destination address, and data block size are calculated according to formula (2);

（4）启动DMA将当前编码块的下一个数据块的后向参考块的数据预取到核的内部存储器，即将参考帧中以预测运动矢量为中心的多个参考块的数据预取到核的局部存储区，预取参考块的数量由运动搜索方法决定，源地址、目的地址、数据块大小按公式（2）计算；(4) Start DMA to prefetch the data of the backward reference block of the next data block of the current coding block to the internal memory of the core, that is, prefetch the data of multiple reference blocks centered on the predicted motion vector in the reference frame to the core The local storage area of , the number of prefetched reference blocks is determined by the motion search method, and the source address, destination address, and data block size are calculated according to formula (2);

（5）执行运动补偿（MC）、DCT变换、量化操作；(5) Perform motion compensation (MC), DCT transformation, and quantization operations;

（6）更新控制表，当前编码块编码标志位置位；(6) Update the control table, and set the encoding flag of the current encoding block;

（7）重复步骤（1）、（2）、（3）、（4）、（5）、（6），直到当前编码行各编码块运动估计操作结束。(7) Steps (1), (2), (3), (4), (5), and (6) are repeated until the motion estimation operation of each coding block in the current coding line ends.

本发明与现有技术相比的优点在于：The advantage of the present invention compared with prior art is:

1、本发明实现了运动估计中数据读取与数据处理并行，加快了多核DSP处理器的编码速度；1. The present invention realizes the parallelism of data reading and data processing in motion estimation, and accelerates the encoding speed of multi-core DSP processors;

2、本发明的数据预取方法实现的运动估计数据的预先读取，充分利用了存储器带宽降低了存储器瓶颈问题；2. The pre-reading of the motion estimation data realized by the data pre-fetching method of the present invention makes full use of the memory bandwidth and reduces the memory bottleneck problem;

3、本发明充分利用了嵌入式多核DSP处理器的DMA高速数据传输部件，加快了数据传输过程。3. The present invention makes full use of the DMA high-speed data transmission part of the embedded multi-core DSP processor to speed up the data transmission process.

附图说明Description of drawings

图1视频监控中的多核编码器存储结构示意图；A schematic diagram of the multi-core encoder storage structure in Fig. 1 video surveillance;

图2本发明中编码行划分示意图；Fig. 2 schematic diagram of coding row division in the present invention;

图3本发明具体实施方式中的数据存储结构示意图；Fig. 3 is a schematic diagram of a data storage structure in a specific embodiment of the present invention;

图4本发明中一种面向多核DSP并行视频编码的数据预取方法的流程图。FIG. 4 is a flowchart of a data prefetching method for multi-core DSP parallel video coding in the present invention.

具体实施方式detailed description

下面给出本发明在TI公司的TMS320C6678LE评估板的具体实施方式。The specific implementation of the present invention in TMS320C6678LE evaluation board of TI Company is given below.

TI公司的TMS320C6678LE评估板包含1个TMS320C6678芯片，外部存储空间为DDR3存储器，512MB。TMS320C6678芯片包含8个核，core0到core7，每个核的工作频率为1.0GHz，每个核包含有32KB一级数据缓冲存储器L1D和32KB一级程序缓冲存储器L1P，每个核有512KB二级存储器和4MB共享存储器MSM，将一级数据缓冲存储器设为cache,运动估计中的编码块、参考块的数据存储在二级存储器，编码视频、参考帧的数据存储在外部存储器DDR3中。TI's TMS320C6678LE evaluation board contains a TMS320C6678 chip, and the external storage space is DDR3 memory, 512MB. TMS320C6678 chip contains 8 cores, core0 to core7, each core has a working frequency of 1.0GHz, each core contains 32KB primary data buffer memory L1D and 32KB primary program buffer memory L1P, each core has 512KB secondary memory Shared memory MSM with 4MB, set the first-level data buffer memory as cache, the coded block and reference block data in motion estimation are stored in the secondary memory, and the coded video and reference frame data are stored in the external memory DDR3.

运动估计采用菱形搜索方法，搜索窗口48×48，因此在参考帧中预取以运动矢量预测值为中心的9个参考块的数据。视频格式为YUV4:2:0，分辨率为704×576,编码块大小为16×16，因此一个编码帧有36个编码行，每编码行有44编码块，视频起始地址为0x82000000，前向参考帧起始地址为0x80000000,后向参考帧起始地址为0x80800000，每个核的当前编码块数据、参考块的数据首地址为corei_fenc_base,corei_ref_base0，corei_ref_base1，各存储区为Ping-Pang结构，即当前运动估计使用Pang数据，预取数据存储到Ping。The motion estimation adopts the diamond search method, and the search window is 48×48, so the data of 9 reference blocks centered on the motion vector prediction value are prefetched in the reference frame. The video format is YUV4:2:0, the resolution is 704×576, and the encoding block size is 16×16, so one encoding frame has 36 encoding lines, each encoding line has 44 encoding blocks, and the video start address is 0x82000000, before The starting address of the forward reference frame is 0x80000000, and the starting address of the backward reference frame is 0x80800000. The current encoding block data of each core and the first address of the reference block data are corei_fenc_base, corei_ref_base0, corei_ref_base1, and each storage area is a Ping-Pang structure. That is, the current motion estimation uses Pang data, and the prefetched data is stored in Ping.

设置并行编码控制表，控制表包括enc_tab、cur_enc_frame、first_unenc_line，编码控制表、运动矢量为多核共享数据，存储在片内的MSM存储器中，如图3所示，多核以互斥方式访问,信号量为semphone0,其他编码参数存储到各核的L2中。core0～core7各核设置当前编码行指示器core0_line～core7_line和当前编码块指示器core0_block～core0_block，corei_line用来指示当前核处理的编码行的行号，corei_block用来指示当前核编码块的编号。Set the parallel encoding control table, the control table includes enc_tab, cur_enc_frame, first_unenc_line, the encoding control table and motion vector are multi-core shared data, stored in the MSM memory on-chip, as shown in Figure 3, multi-core is accessed in a mutually exclusive manner, and the semaphore For semphone0, other encoding parameters are stored in L2 of each core. Each core of core0~core7 sets the current coding line indicator core0_line~core7_line and the current coding block indicator core0_block~core0_block, corei_line is used to indicate the line number of the coding line processed by the current core, and corei_block is used to indicate the number of the current core coding block.

DDR3与L2之间的数据预取用QDMA实现，core0～core7分别使用EDMA3中TCC0的QDMA0～QDMA7作为各核的数据预取通道，设置为AB-sync、LINK方式，QDMA0～QDMA7的PaRAMSet为00～02，10～12，20～22，30～32，40～42，50～52，60～62，70～72。The data prefetch between DDR3 and L2 is realized by QDMA. core0~core7 respectively use QDMA0~QDMA7 of TCC0 in EDMA3 as the data prefetch channel of each core, set to AB-sync and LINK mode, and the PaRAMSet of QDMA0~QDMA7 is 00 ～02, 10～12, 20～22, 30～32, 40～42, 50～52, 60～62, 70～72.

具体实施过程为：The specific implementation process is:

（1）core0读入视频序列，并记录视频序列总帧数（total_frames）；(1) core0 reads in the video sequence and records the total number of frames of the video sequence (total_frames);

（2）core0初始化全局参数，控制表的cur_enc_frame初始化为-1，first_unenc_line为-1；(2) core0 initializes the global parameters, the cur_enc_frame of the control table is initialized to -1, and the first_unenc_line is -1;

（3）判断是否有未编码帧，即cur_enc_frame是否大于total_frames,若大于，则编码结束，否则转入（4）；(3) Judging whether there are unencoded frames, that is, whether cur_enc_frame is greater than total_frames, if greater, the encoding ends, otherwise transfer to (4);

（4）初始化编码状态表为“0”，first_unenc_line为0；(4) Initialize the encoding state table as "0", and first_unenc_line as 0;

（5）判断是否有未编码行，即first_unenc_line是否大于36，若大于则当前编码帧结束，转（3），否则，转（6）；(5) Determine whether there is an unencoded line, that is, whether first_unenc_line is greater than 36, if it is greater than the current encoding frame ends, go to (3), otherwise, go to (6);

（6）获取编码行，即各核按互斥方式读取first_unenc_line到corei_line,corei_block=0，first_unenc_line加1；(6) Obtain the coding line, that is, each core reads first_unenc_line to corei_line in a mutually exclusive manner, corei_block=0, first_unenc_line plus 1;

（7）由cur_enc_frame，corei_line,corei_block计算编码块首地址，Y、U、V三个分量地址分别为：(7) Calculate the first address of the encoding block by cur_enc_frame, corei_line, corei_block, and the three component addresses of Y, U, and V are respectively:

$\{\begin{matrix} Y Y = = 00 x x 8200000082000000 + + cur cur__enc enc__frame frame \times \times 608256608256 \\ + + corei corei__line line \times \times 1126411264 + + corei corei__block block \times \times 1616 \\ U u = = 00 x x 8200000082000000 + + cur cur__enc enc__frame frame \times \times 608256608256 + + 00 x x 6300063000 \\ + + corei corei__line line \times \times 56325632 + + corei corei__block block \times \times 88 \\ V V = = 00 x x 8200000082000000 + + cur cur__enc enc__frame frame \times \times 608256608256 + + 00 xBC wxya 0000 \\ + + corei corei__line line \times \times 56325632 + + corei corei__blick blick \times \times 88 \end{matrix}$

data_size分别为：256,64,64；data_size are: 256, 64, 64;

（8）判断当前编码块相邻的左、左上、上、右上运动搜索是否完成，完成则继续，未完成则等待相邻块运动搜索结束；(8) Judging whether the motion search of the left, upper left, upper, and upper right adjacent to the current coding block is completed, if completed, continue, if not completed, wait for the end of the adjacent block motion search;

（9）判断当前编码块的数据是否搬运到L2，是则继续，否则等待搬运结束；(9) Determine whether the data of the current encoding block is transported to L2, if so, continue, otherwise wait for the end of the transport;

（10）执行编码块的预测、运动搜索操作；(10) Perform prediction and motion search operations for coding blocks;

（11）按照图4所示预取下一个编码块及其参考宏块数据，P帧编码块、B帧编码块的预取数据的源地址、目的地址、数据块大小由公式（1）和公式（2）计算，block_width为16，block_hight为16，line_pix为704，fenc_base=0x82000000+cur_enc_frame×608256，bref_size为9；(11) Prefetch the next coded block and its reference macroblock data as shown in Figure 4. The source address, destination address, and data block size of the prefetched data of the P frame coded block and the B frame coded block are determined by formula (1) and Formula (2) calculation, block_width is 16, block_hight is 16, line_pix is 704, fenc_base=0x82000000+cur_enc_frame×608256, bref_size is 9;

（12）执行运动补偿（MC）、DCT变换、量化操作；(12) Perform motion compensation (MC), DCT transformation, and quantization operations;

（13）修改控制表中的enc_tab，即enc_tab[corei_line,corei_block]=1；(13) Modify enc_tab in the control table, that is, enc_tab[corei_line,corei_block]=1;

（14）判断当前编码行是否结束，即corei_block是否小于44，是转（8）继续当前编码行处理,否则转（5）进行下一个编码行处理；(14) Judging whether the current coding line is over, that is, whether corei_block is less than 44, if it is, turn to (8) to continue the current coding line processing, otherwise turn to (5) for the next coding line processing;

（15）重复（5）至（14），直至视频所有视频帧处理结束。(15) Repeat (5) to (14) until the processing of all video frames of the video ends.

测试结果如下表所示：（QP=27）The test results are shown in the table below: (QP=27)

当视频图像组为IBBP BBP BBP BBP BB结构时平均性能提高约15%。When the video image group is IBBP BBP BBP BBP BB structure, the average performance is improved by about 15%.

Claims

1. A data prefetching method for multi-core DSP motion estimation, characterized in that: the steps are as follows:

Step 1. Set the size of the prefetched data block, divide the size of the coding block and the reference block, and store the data of the coding block and the reference block in the local memory of the core, and set the storage area as a Ping-Pang structure;

Step 2. If the coding block currently performing motion estimation belongs to a P frame, perform prediction and motion search operations, and prefetch the data of the next coding block and reference block;

Step 3. If the coding block currently performing motion estimation belongs to a B frame, perform prediction and motion search operations, and prefetch the data of the next coding block, forward reference block, and backward reference block;

The step 1 specifically includes the following operations:

(1) Coding frames and reference frames are divided into coding blocks and reference blocks according to the local storage capacity of the multi-core DSP and the data block size processed in motion estimation. The data of the coding frames and reference frames are stored in the external large-capacity memory, and the current coding block and The data of the reference block is stored in the local memory;

(2) The coded frame is divided into coded blocks and coded lines, and each core of the multi-core DSP performs the motion estimation operation in units of coded lines, that is, the current core obtains the first coded line without motion estimation immediately after finishing the motion estimation operation of a coded line keep going;

(3) Set the system control table. The system control table includes the current coded frame counter, the first coded row indicator in the current coded frame without motion estimation operation, and the coded state table of the current coded frame. The system control table is shared by multiple cores to realize Synchronization of each core;

(4) Ping-Pang structure storage areas are set for the current coding block and the reference data block in the local memory, the current coding is Pang, and the data is prefetched to Ping.

2. a kind of data prefetching method facing multi-core DSP motion estimation according to claim 1, is characterized in that: described step 2 specifically comprises the steps:

(2.1) If the current coding block belongs to a P frame, perform prediction and motion search operations on the coding block;

(2.2) DMA is started to prefetch the data of the next coded block of the current coded block to the internal memory of the core from the external memory, and the source address, destination address, and data block size are calculated by formula (1);

\{\begin{matrix} D D. M m A A__s the s r r c c = = f f e e n no c c__b b a a s the s e e + + c c o o r r e e i i__l l i i n no e e \times \times l l i i n no e e__b b l l o o c c k k \times \times b b l l o o c c k k__s the s i i z z e e \\ + + ((c c o o r r e e i i__b b l l o o c c k k + + 11)) \times \times b b l l o o c c k k__w w i i d d t t h h \\ D D. M m A A__d d s the s t t = = c c o o r r e e i i__f f e e n no c c__b b a a s the s e e \\ d d a a t t a a__s the s i i z z e e = = b b l l o o c c k k__s the s i i z z e e \end{matrix} - - - - - - ((11))

Where block_width is the width of the coding block, block_hight is the height of the coding block, block_size is the size of the coding block, block_size=block_width×block_hight, DMA_src is the source address of the i-th core DMA transfer data, DMA_dst is the destination address of the i-th core DMA transfer data , data_size is the size of the transported data, corei_fenc_base is the starting address of the encoding block data of the i-th core stored internally, fenc_base is the starting address of the current encoding frame, corei_line is the current encoding line of the i-th core, and line_block is the encoding line Size, corei_block is the current coding block of the i-th core;

(2.3) Start the DMA to prefetch the data of the reference block of the next coding block of the current coding block from the external memory to the internal memory of the core, that is, prefetch the data of multiple reference blocks centered on the predicted motion vector in the reference frame to For the local memory of the core, the number of prefetched reference data blocks is determined by the motion search method, and the source address, destination address, and data block size are calculated according to formula (2);

\{\begin{matrix} D D. M m A A__s the s r r c c = = f f r r e e f f__b b a a s the s e e \\ + + ((m m v v p p__x x \times \times l l i i n no e e__p p i i x x - - b b l l o o c c k k__w w i i d d t t h h)) + + ((m m v v p p__y the y - - b b l l o o c c k k__h h i i g g h h t t)) \\ D D. M m A A__d d s the s t t = = c c o o r r e e i i__r r e e f f__b b a a s the s e e \\ d d a a t t a a__s the s i i z z e e = = b b l l o o c c k k__s the s i i z z e e \times \times b b r r e e f f__s the s i i z z e e \end{matrix} - - - - - - ((22))

Among them, DMA_src is the source address of the i-th core DMA handling data, DMA_dst is the destination address of the i-th core DMA handling data, fref_base is the first address of the reference frame, line_pix is the number of pixels in a line in the reference frame, mvp_x, mvp_y are motion vectors Prediction value, after the motion search of the current coded block is completed, the motion vector prediction value can be obtained according to the motion vector prediction method, corei_ref_base is the first address of the data storage of the reference block, bref_size is the number of reference data blocks, and other parameters are the same as formula (1);

(2.4) Perform motion compensation (MC), DCT transformation, and quantization operations;

(2.5) Update the system control table, and the current encoding block encoding flag position is set;

(2.6) Steps (2.1), (2.2), (2.3), (2.4), (2.5) are repeated until the motion estimation operation of each coding block in the current coding line ends.

3. a kind of data prefetching method facing multi-core DSP motion estimation according to claim 1, is characterized in that: described step 3 specifically comprises the steps:

(3.1) If the current coding block belongs to a B frame, perform prediction and motion search operations on the coding block;

(3.2) DMA is started to prefetch the data of the next coded block of the current coded block to the local memory of the core from the external memory, and the source address, destination address, and data block size are calculated by formula (1);

(3.3) Start DMA to prefetch the data of the forward reference block of the next data block of the current coding block to the internal memory of the core, that is, prefetch the data of multiple reference blocks centered on the predicted motion vector in the reference frame to the core The local memory of the prefetching reference block is determined by the motion search method, and the source address, destination address, and data block size are calculated according to formula (2);

(3.4) Start DMA to prefetch the data of the backward reference block of the next data block of the current coding block to the internal memory of the core, that is, prefetch the data of multiple reference blocks centered on the predicted motion vector in the reference frame to the core The local storage area, the number of prefetched reference blocks is determined by the motion search method, and the source address, destination address, and data block size are calculated according to formula (2);

(3.5) Perform motion compensation (MC), DCT transformation, and quantization operations;

(3.6) Update the system control table, and the current encoding block encoding flag position is set;

(3.7) Steps (3.1), (3.2), (3.3), (3.4), (3.5), (3.6) are repeated until the motion estimation operation of each coding block in the current coding line ends.