CN102790884B

CN102790884B - A kind of searching method based on hierarchical motion estimation and realize system

Info

Publication number: CN102790884B
Application number: CN201210264933.2A
Authority: CN
Inventors: 高志勇; 邓刚; 张小云; 陈立
Original assignee: Shanghai Jiao Tong University
Current assignee: Shanghai Jiao Tong University
Priority date: 2012-07-27
Filing date: 2012-07-27
Publication date: 2016-05-04
Anticipated expiration: 2032-07-27
Also published as: CN102790884A

Abstract

The present invention discloses a search method based on layered motion estimation and its implementation system. The method performs a downsampling of the current frame and the reference frame according to the ratio. When the downsampling layer performs motion estimation, the step size of the candidate motion vector is set as 2 integer pixel unit vectors after downsampling. The system comprises a luminance reference pixel cache module, a layered and multiplexed PE array module, and a PE array register data update operation control module. The scheme proposed by the present invention changes the 16x16 blocks of the original image layer into 8x8 blocks only by one 4:1 downsampling, and the feature of block matching is more obvious; meanwhile, the search step size of the downsampling layer is set to 2 to reduce the calculation amount, satisfying Real-time performance of video encoders.

Description

A Search Method Based on Hierarchical Motion Estimation and Its Implementation System

技术领域 technical field

本发明属于多媒体的视频编码器硬件实现领域，具体地讲，是一种适用于FPGA与ASIC实现、基于分层运动估计的搜索方法及其实现系统。The invention belongs to the field of hardware implementation of multimedia video encoders, in particular, it is a search method based on layered motion estimation and its implementation system, which is suitable for FPGA and ASIC implementation.

背景技术 Background technique

在视频编码器中，采用运动补偿的帧间预测编码是视频压缩的主要技术，而搜索窗大小、搜索策略、匹配准则直接影响编码器的性能。其中整像元运动估计是视频编码器中最复杂的模块之一，为了满足高清视频编码的实时性，可通过硬件加速来实现编码。视频编码器的硬件实现需要能在复杂度与性能间取得较好的平衡。In video encoders, inter-frame predictive coding with motion compensation is the main technology of video compression, and the size of search window, search strategy, and matching criteria directly affect the performance of the encoder. Among them, the whole pixel motion estimation is one of the most complicated modules in the video encoder. In order to meet the real-time performance of high-definition video encoding, the encoding can be realized through hardware acceleration. The hardware implementation of the video encoder needs to be able to achieve a good balance between complexity and performance.

基于块匹配的搜索方法主要有全搜索，快速搜索，分级搜索等方法。其中全搜索是在搜索区内逐点搜索，通过准则函数例如SAD（SumofAbsoluteDifference）值，即初始值和预测值的绝对误差和，通过对每一个候选运动向量进行计算而选取出SAD值最小的最佳匹配块。当图像分辨率较高，运动比较复杂情况下，需要在较大搜索窗内进行搜索,运算量是相当大的，为了实现实时运算，必须采取并行处理。为了减少搜索次数，又提出了多种快速搜索方法，如二维对数搜索方法、三步搜索法、共轭方向搜索法、正交搜索法、菱形搜索法等。快速搜索法的共同之处在于把使准则函数（SAD）趋于极小的方向视为最小失真方向，并假定准则函数在偏离最小失真方向时是单调递增的，即认为它在整个搜索区内是运动向量的单极点函数，有唯一极小值。快速搜索方法能减少大部分运算量，能显著提升搜索速度，但是率失真性能会有一定的下降，不利于硬件实现，主要是因为快速搜索方法的数据流控制不规整，不能有效利用片上的存储数据。The search methods based on block matching mainly include full search, fast search, and hierarchical search. The full search is a point-by-point search in the search area. Through the criterion function such as the SAD (Sum of Absolute Difference) value, that is, the absolute error sum of the initial value and the predicted value, the minimum SAD value is selected by calculating each candidate motion vector. best match block. When the image resolution is high and the motion is relatively complex, it is necessary to search within a large search window, and the amount of calculation is quite large. In order to realize real-time calculation, parallel processing must be adopted. In order to reduce the number of searches, a variety of fast search methods have been proposed, such as two-dimensional logarithmic search method, three-step search method, conjugate direction search method, orthogonal search method, diamond search method and so on. The common feature of the fast search method is that the direction that makes the criterion function (SAD) tends to the minimum is regarded as the direction of the minimum distortion, and it is assumed that the criterion function is monotonically increasing when it deviates from the direction of the minimum distortion, that is, it is considered to be in the entire search area is a single-pole function of the motion vector with a unique minimum. The fast search method can reduce most of the calculations and significantly increase the search speed, but the rate-distortion performance will decrease to a certain extent, which is not conducive to hardware implementation, mainly because the data flow control of the fast search method is irregular and cannot effectively utilize the on-chip storage. data.

为此提出的分级搜索方法在减少运算量的同时，可取得接近全搜索的精度和性能，得到真实的运动位移矢量。在分级搜索方法中，先通过对原始图像低通滤波和亚取样得到该图像序列的低分辨率表示，再对所得的低分辨率图像进行全搜索。由于分辨率降低，搜索次数成倍减少，同时单次SAD计算需要计算资源也成倍减小，由此全搜索得到一个最优运动矢量。以该运动矢量作为在原始图像中搜索的起始点，开始进行精细搜索，此步搜索窗大小可相对减小，搜索次数相应减少，最终通过细搜索得到运动位移矢量的估计值。采用分级搜索的优点在于，搜索次数介于全搜索与快速搜索之间，且在各层搜索时可看成是全搜索，这样具有较规整的数据流和控制流，利于硬件实现。已有的基于分层运动估计的硬件通过两次降采样进行搜索，提高了搜索速度，但连续两次4:1降采样后得到的降采样层，由原始图像层的16x16块变为4x4块，这样导致块匹配时由于特征不够明显而容易找到伪最佳匹配块，影响了搜索性能。The hierarchical search method proposed for this purpose can achieve the accuracy and performance close to the full search while reducing the amount of computation, and obtain the real motion displacement vector. In the hierarchical search method, the low-resolution representation of the image sequence is obtained through low-pass filtering and sub-sampling of the original image, and then a full search is performed on the obtained low-resolution image. Due to the reduction of the resolution, the number of searches is doubled, and the computing resources required for a single SAD calculation are also doubled, so that an optimal motion vector can be obtained by the full search. The motion vector is used as the starting point of searching in the original image, and the fine search is started. In this step, the size of the search window can be relatively reduced, and the number of searches can be correspondingly reduced. Finally, the estimated value of the motion displacement vector can be obtained through fine search. The advantage of using hierarchical search is that the number of searches is between full search and fast search, and it can be regarded as a full search when searching at each layer, which has a relatively regular data flow and control flow, which is conducive to hardware implementation. Existing hardware based on layered motion estimation searches through two downsamplings, which improves the search speed, but the downsampling layer obtained after two consecutive 4:1 downsamplings changes from 16x16 blocks of the original image layer to 4x4 blocks , so it is easy to find the pseudo-best matching block because the features are not obvious enough when matching blocks, which affects the search performance.

发明内容 Contents of the invention

现有基于硬件实现的视频编码器无论采用全搜索方法还是分层搜索方法，均通过一定程度上牺牲硬件面积来换取速度，若搜索范围增加，则硬件消耗巨大，不利于FPGA或ASIC电路。本发的目的在于提出一种具有接近全搜索性能的分层搜索方法及相应的硬件实现系统，使其具有较高的搜索速度和较好的编码性能，满足高清编码器实时编码的要求。No matter whether the existing hardware-based video encoder adopts the full search method or the layered search method, the hardware area is sacrificed to a certain extent in exchange for speed. If the search range is increased, the hardware consumption will be huge, which is not conducive to FPGA or ASIC circuits. The purpose of this invention is to propose a layered search method with nearly full search performance and a corresponding hardware implementation system, so that it has higher search speed and better encoding performance, and meets the real-time encoding requirements of high-definition encoders.

为实现上述目的，本发明采取如技术下方案：In order to achieve the above object, the present invention adopts the following technical solutions:

本发明所述的基于分层运动估计的搜索方法，具体为：将当前帧和参考帧按照比例进行一次降采样，降采样层进行运动估计时，设置候选运动向量的步长为降采样后的2个整像素单位向量。The search method based on layered motion estimation in the present invention is specifically as follows: the current frame and the reference frame are down-sampled according to the ratio, and when the down-sampling layer performs motion estimation, the step size of the candidate motion vector is set as the down-sampled 2 integer pixel unit vectors.

进一步的，所述降采样得到4路降采样图像，多PE阵列在每一路降采样图像中进行4路并行搜索，同时另3路降采样图像并行搜索。Further, the down-sampling obtains 4 channels of down-sampled images, and the multi-PE array performs 4 channels of parallel search in each channel of down-sampled images, while the other 3 channels of down-sampled images are searched in parallel.

进一步的，所述方法中片上亮度像素缓存采用分类像素存储来实现降采样的搜索，即按降采样得到4种像素类型交错存储，同时分奇偶行与奇偶宏块列分开存储，共16种像素类型交错存储。Further, in the method, the on-chip luminance pixel cache adopts classified pixel storage to realize down-sampling search, that is, 4 types of pixel types are interleavedly stored according to down-sampling, and are divided into odd-even rows and odd-even macroblock columns for separate storage, a total of 16 types of pixels Types are interleaved.

进一步的，所述PE阵列从亮度参考缓存读取数据复用，一次读取两种像素类型的16字节宽数据用于4路PE阵列的数据更新。Further, the PE array reads data from the luminance reference buffer and multiplexes, and reads 16-byte wide data of two pixel types at a time for data update of the 4-way PE array.

本发明中分层搜索的思想是：先将当前帧和参考帧按照比例降采样，L1层是L0层4:1降采样得到的图像，L0层是原图像层。搜索时先在第L1层上进行，候选MV搜索步长为2，计算SAD时候，用了4:1降采样精度，即64个像素点参与了SAD值计算。本发明提出的方案仅通过一次4:1降采样，由原始图像层的16x16块变为8x8块，块匹配的特征较明显；同时设置降采样层的搜索步长为2来降低计算量，满足视频编码器的实时性。The idea of layered search in the present invention is: first, the current frame and the reference frame are down-sampled according to the ratio, the L1 layer is the image obtained by the 4:1 down-sampling of the L0 layer, and the L0 layer is the original image layer. The search is first performed on the L1 layer, and the candidate MV search step is 2. When calculating the SAD, a 4:1 downsampling precision is used, that is, 64 pixels participate in the calculation of the SAD value. The scheme proposed by the present invention changes the 16x16 blocks of the original image layer into 8x8 blocks only by one 4:1 downsampling, and the feature of block matching is more obvious; meanwhile, the search step size of the downsampling layer is set to 2 to reduce the calculation amount, satisfying Real-time performance of video encoders.

本发明中，降采样后得到4路降采样图像，通过对四个区域的降采样图像采取不同类型的像素点来实现4个区域的并行搜索。同时在L1层对每一个区域搜索时，进一步采用了4路并行搜索，提高搜索速度。采用基于率失真优化的支持可变大小块运动估计，块匹配代价函数取当前块与参考块的SAD值与运动矢量的编码比特消耗的加权和。在L1层块大小为8x8，通过求取最小的匹配代价函数便可以得到最优的运动矢量。在L0层搜索时，将以L1层最优运动矢量和视频编码器的运动预测模块得到的运动矢量为运动搜索起始点，在一个较小的搜索窗内进行全搜索，便可得到可变大小块的块匹配代价函数值，各种模式的匹配代价将用于编码器的亚像元运动估计与预测模式的选择。In the present invention, four channels of downsampled images are obtained after downsampling, and parallel searches of four regions are realized by adopting different types of pixel points for the downsampled images of four regions. At the same time, when searching for each area at the L1 layer, a 4-way parallel search is further used to improve the search speed. Using rate-distortion optimization to support variable-size block motion estimation, the block matching cost function takes the weighted sum of the SAD value of the current block and the reference block and the coding bit consumption of the motion vector. The block size of the L1 layer is 8x8, and the optimal motion vector can be obtained by finding the minimum matching cost function. When searching at the L0 layer, the optimal motion vector of the L1 layer and the motion vector obtained by the motion prediction module of the video encoder are used as the starting point of the motion search, and a full search is performed in a small search window to obtain a variable size The block matching cost function value of the block, and the matching costs of various modes will be used for sub-pixel motion estimation and prediction mode selection of the encoder.

本发明所述的一种基于分层运动估计的搜索方法的实现系统，包括：A system for realizing a search method based on layered motion estimation according to the present invention includes:

亮度参考像素缓存模块：采用了Zigzag扫描顺序编码以及片上数据交错存储的缓存方式即按降采样得到4种像素类型交错存储，同时分奇偶行与奇偶宏块列分开存储，共16种像素类型交错存储；Luminance reference pixel cache module: Zigzag scanning sequence encoding and on-chip data interleaved storage are adopted to obtain interleaved storage of 4 pixel types according to down-sampling, and at the same time, the odd and even rows and odd and even macro block columns are stored separately, and a total of 16 pixel types are interleaved storage;

分层复用的PE阵列模块：降采样得到4路降采样图像，多PE阵列在每一路降采样图像中进行4路并行搜索，同时另3路降采样图像并行搜索；所述PE阵列能在每一个时钟从亮度参考像素缓存模块得到有效的数据并进行SAD值计算；Hierarchically multiplexed PE array module: 4 downsampled images are obtained by downsampling, and the multi-PE array performs 4 parallel searches in each downsampled image, while the other 3 downsampled images are searched in parallel; the PE array can be Each clock gets valid data from the brightness reference pixel buffer module and calculates the SAD value;

PE阵列寄存器数据更新操作控制模块：用于控制PE阵列从亮度参考像素缓存模块读取参考像素到寄存器，一次读取两种像素类型的16字节宽数据用于4路PE阵列的数据更新，PE阵列的当前亮度像素寄存器的数据更新操作通过对不同的当前像素寄存器采用移位实现快速更新。PE array register data update operation control module: used to control the PE array to read reference pixels from the luminance reference pixel buffer module to the register, and read 16-byte wide data of two pixel types at a time for data update of 4-way PE array, The data update operation of the current brightness pixel register of the PE array realizes fast update by shifting different current pixel registers.

亮度参考像素缓存模块是硬件实现的核心，亮度缓存通过像素交错存储的方式来满足PE阵列读取数据的需要，使之适合硬件实现。亮度缓存首先要易于PE阵列计算时进行数据的读取与更新操作；同时易于将片外的数据按亮度缓存的存储格式进行存储。本发明采用了Zigzag扫描顺序编码以及片上数据交错存储的缓存方式。当视频编码器从片外存储器DDRSDRAM读取原始视频序列时，不采用传统的按光栅顺序进行宏块编码，而是采用Zigzag宏块顺序编码，可以减少约2/3的外部存储器访问带宽。通过将参考像素按4类像素（正方形、三角形、圆形、五角形），分奇偶行，奇偶宏块列分开存储，共16路片上RAM存储亮度参考像素，每路RAM的位宽为64比特，RAM的每一地址线对应8个像素点值。The luma reference pixel cache module is the core of hardware implementation. The luma cache meets the needs of PE array to read data through pixel interleaved storage, making it suitable for hardware implementation. The luminance cache must be easy to read and update data during PE array calculations; at the same time, it is easy to store off-chip data in the storage format of the luminance cache. The present invention adopts Zigzag scanning sequence coding and a caching mode of on-chip data interleaved storage. When the video encoder reads the original video sequence from the off-chip memory DDRSDRAM, it does not use the traditional raster order to encode macroblocks, but uses Zigzag macroblock sequential encoding, which can reduce the external memory access bandwidth by about 2/3. By dividing the reference pixels into four types of pixels (square, triangle, circle, and pentagon), divide them into odd and even rows, and store them separately in odd and even macroblock columns. A total of 16 on-chip RAMs store brightness reference pixels, and the bit width of each RAM is 64 bits. Each address line of RAM corresponds to 8 pixel point values.

分层复用的PE阵列模块是运动估计的计算模块，本发明根据运动估计搜索方法设计对应的阵列分布，充分利用每一个PE单元。在这里假定一个PE单元将仅用于两个8比特像素值的绝对差值运算，本发明共采用了2*1024个PE单元，其中前后向预测各采用1024个PE单元。计算16x16的匹配块需要256个PE，则同时有4路16x16块SAD值并行计算，16路8x8块SAD值并行计算。PE阵列从亮度参考像素模块读取参考像素到寄存器，最后由PE阵列来完成计算工作，需要尽可能保证PE阵列在每一个时钟能从亮度缓存得到有效的数据并进行SAD值计算，保证流水线的时钟周期约束。The hierarchically multiplexed PE array module is a calculation module for motion estimation. The present invention designs the corresponding array distribution according to the motion estimation search method, and makes full use of each PE unit. It is assumed here that one PE unit will only be used for the absolute difference calculation of two 8-bit pixel values. The present invention uses 2*1024 PE units in total, of which 1024 PE units are used for forward and backward prediction. Calculating a 16x16 matching block requires 256 PEs, and at the same time there are 4 channels of 16x16 block SAD values to be calculated in parallel, and 16 channels of 8x8 block SAD values to be calculated in parallel. The PE array reads reference pixels from the luminance reference pixel module to the register, and finally the PE array completes the calculation work. It is necessary to ensure that the PE array can obtain valid data from the luminance cache at each clock and perform SAD value calculations to ensure the pipeline. Clock cycle constraints.

分层复用的PE阵列模块中，PE阵列分布呈三级：PEA_32pels阵列单元，PEA_64pels阵列单元，PEA_256pels阵列单元。其中PEA_32pels单元是最基本的阵列单元，由32个PE单元构成，用于计算4x8块的SAD值；PEA_64pels单元包括两路PEA_32pels单元，由64个PE单元构成，用于计算8x8块的SAD值；PEA_256pels单元包括四路PEA_64pels单元，由256个PE单元构成，用于计算16x16块及其可变大小块的SAD值。In the hierarchically multiplexed PE array module, the PE array is distributed in three levels: PEA_32pels array unit, PEA_64pels array unit, and PEA_256pels array unit. Among them, the PEA_32pels unit is the most basic array unit, consisting of 32 PE units, used to calculate the SAD value of 4x8 blocks; the PEA_64pels unit includes two PEA_32pels units, composed of 64 PE units, used to calculate the SAD value of 8x8 blocks; The PEA_256pels unit includes a four-way PEA_64pels unit, which consists of 256 PE units and is used to calculate the SAD value of 16x16 blocks and their variable size blocks.

本发明的硬件系统主要解决了片上亮度像素缓存的数据组织格式，采用分类像素存储来实现降采样的搜索；设计出复杂度适中的数据流和控制流，使得硬件实现简单；在不增加额外存储空间的前提下，增加PE阵列实现高吞吐量的计算。The hardware system of the present invention mainly solves the data organization format of on-chip luminance pixel cache, adopts classified pixel storage to realize down-sampling search; designs data flow and control flow with moderate complexity, which makes hardware implementation simple; without adding additional storage Under the premise of space, increase the PE array to achieve high-throughput computing.

本发明由于采取以上技术方案，其具有以下优点：The present invention has the following advantages due to the adoption of the above technical scheme:

1、降采样层数少，搜索精度高；1. The number of downsampling layers is small, and the search accuracy is high;

2、设置降采样层候选运动向量的搜索步长为2，提高搜索速度；2. Set the search step size of the candidate motion vector of the downsampling layer to 2 to improve the search speed;

3、充分利用片上亮度缓存同一时刻所有可读取的像素数据，多PE阵列并行处理；3. Make full use of the on-chip luminance buffer for all readable pixel data at the same time, and process in parallel with multiple PE arrays;

4、片上RAM与Register充分考虑数据复用，降低带宽需求；4. On-chip RAM and Register fully consider data multiplexing to reduce bandwidth requirements;

5、减少同一PE阵列与不同RAM的数据交互，降低硬件实现复杂度。5. Reduce the data interaction between the same PE array and different RAMs, and reduce the complexity of hardware implementation.

附图说明 Description of drawings

图1是分层搜索的整体硬件构架示意图；Fig. 1 is a schematic diagram of the overall hardware architecture of hierarchical search;

图2是分层搜索的两层像素关系示意图；Fig. 2 is a schematic diagram of two-layer pixel relationship of hierarchical search;

图3是两层运动估计的原理图；Figure 3 is a schematic diagram of two-layer motion estimation;

图4是Zigzag宏块编码顺序示意图；Fig. 4 is a schematic diagram of Zigzag macroblock encoding sequence;

图5是片上亮度缓存的存储结构；Fig. 5 is the storage structure of on-chip brightness cache;

图6是PEA单元的结构；Fig. 6 is the structure of PEA unit;

图7是PEA_64pels与PEA_256pels单元结构的组成；Figure 7 is the composition of the unit structure of PEA_64pels and PEA_256pels;

图8是PE阵列当前像素寄存器的存储方式；Fig. 8 is the storage mode of the current pixel register of the PE array;

图9是降采样层的其中一路降采样图像的搜索示意图。FIG. 9 is a schematic diagram of searching for one of the downsampled images in the downsampling layer.

具体实施方式 detailed description

下面对本发明的实施例作详细说明，本实施例以本发明技术方案为前提，给出了详细的实施方式和具体的操作过程，但本发明的保护范围不限于下述的实施例。The embodiments of the present invention are described in detail below. Based on the premise of the technical solution of the present invention, the present embodiment provides detailed implementation and specific operation process, but the protection scope of the present invention is not limited to the following embodiments.

本实施例提供一种适用于FPGA与ASIC实现、基于分层运动估计的搜索方法，具体为：将当前帧和参考帧按照比例降采样，L1层是L0层4:1降采样得到的图像，L0层是原图像层。本例以L0层搜索范围为[-160160)x[-104104)为例进行说明，则L1层的搜索范围变为[-8080)x[-5252)。搜索时先在第L1层上进行，候选MV搜索步长为2，全搜索[-8080)x[-5252)区间内共160x104/4=4160个候选MV，计算SAD时候，用了4:1降采样精度，即64个像素点参与了SAD值计算。两层像素关系示意图如图2所示。This embodiment provides a search method based on layered motion estimation that is suitable for FPGA and ASIC implementation, specifically: the current frame and the reference frame are down-sampled according to the ratio, and the L1 layer is an image obtained by 4:1 down-sampling of the L0 layer. The L0 layer is the original image layer. In this example, the search range of the L0 layer is [-160160)x[-104104) as an example, and the search range of the L1 layer becomes [-8080)x[-5252). The search is first performed on the L1 layer. The candidate MV search step is 2. There are a total of 160x104/4=4160 candidate MVs in the full search [-8080)x[-5252) interval. When calculating SAD, 4:1 is used Downsampling accuracy, that is, 64 pixels are involved in the SAD value calculation. The schematic diagram of the pixel relationship between the two layers is shown in Figure 2.

本实施例上述方法中，降采样得到4路降采样图像，多PE阵列在每一路降采样图像中进行4路并行搜索，同时另3路降采样图像并行搜索。In the above method of this embodiment, 4 channels of down-sampled images are obtained by down-sampling, and the multi-PE array performs 4-channel parallel search in each channel of down-sampled images, and simultaneously searches the other 3 channels of down-sampled images in parallel.

如图2所示，（b）是（a）水平垂直各2:1降采样后得到的四路降采样图像（正方形、三角形、圆形、五角形），通过对四个区域的降采样图像采取不同类型的像素点来实现4个区域的并行搜索。同时在L1层对每一个区域搜索时，进一步采用了4路并行搜索，提高搜索速度。如图3所示，采用基于率失真优化的支持可变大小块运动估计，块匹配代价函数取当前块与参考块的SAD值与运动矢量的编码比特消耗的加权和。在L1层块大小为8x8，通过求取最小的匹配代价函数便可以得到最优的运动矢量。在L0层搜索时，将以L1层最优运动矢量和视频编码器的运动预测模块得到的运动矢量为运动搜索起始点，在一个较小的搜索窗内进行全搜索，便可得到可变大小块的块匹配代价函数值，各种模式的匹配代价将用于编码器的亚像元运动估计与预测模式的选择。As shown in Figure 2, (b) is a four-way downsampling image (square, triangle, circle, pentagon) obtained after (a) horizontal and vertical 2:1 downsampling. Different types of pixels to realize the parallel search of 4 regions. At the same time, when searching for each area at the L1 layer, a 4-way parallel search is further used to improve the search speed. As shown in Figure 3, the motion estimation based on rate-distortion optimization that supports variable-size blocks is used, and the block matching cost function takes the weighted sum of the SAD values of the current block and the reference block and the coding bit consumption of the motion vector. The block size of the L1 layer is 8x8, and the optimal motion vector can be obtained by finding the minimum matching cost function. When searching at the L0 layer, the optimal motion vector of the L1 layer and the motion vector obtained by the motion prediction module of the video encoder are used as the starting point of the motion search, and a full search is performed in a small search window to obtain a variable size The block matching cost function value of the block, and the matching costs of various modes will be used for sub-pixel motion estimation and prediction mode selection of the encoder.

如图1所示，用于实现上述搜索方法的硬件系统由多个不同的功能模块组成，主要有亮度参考像素缓存模块，分层复用的PE阵列模块，PE阵列寄存器数据更新操作控制模块。As shown in Figure 1, the hardware system used to implement the above search method is composed of a number of different functional modules, mainly including a brightness reference pixel buffer module, a hierarchically multiplexed PE array module, and a PE array register data update operation control module.

亮度参考像素片上缓存如图5所示，图中一个圆圈代表一个参考像素点，其中的字符代表像素类型。将参考图像的划分为0～f共16类参考像素，其中0～7类参考像素为偶宏块列参考像素，8～f为奇宏块列参考像素，每一类参考像素对应一个相应类型RAM，亮度参考像素缓存由16个位宽为64比特的RAM构成。本实施例采用了如图4所示的Zigzag宏块扫描顺序编码，每次更新数据时只读取一列宏块列的数据，只需一次更新8个偶（奇）宏块列RAM即可。The brightness reference pixel on-chip cache is shown in Figure 5, a circle in the figure represents a reference pixel point, and the characters in it represent the pixel type. The reference image is divided into 16 types of reference pixels from 0 to f, among which the reference pixels from 0 to 7 are the reference pixels of the even macroblock column, and the reference pixels from 8 to f are the reference pixels of the odd macroblock column. Each type of reference pixel corresponds to a corresponding type RAM, luma reference pixel buffer consists of 16 RAMs with a bit width of 64 bits. This embodiment adopts Zigzag macroblock scanning sequence encoding as shown in FIG. 4 , and only reads the data of one macroblock column each time the data is updated, and only needs to update 8 even (odd) macroblock column RAMs at a time.

PEA_32pels结构如图6所示，该单元是最基本的阵列单元，计算4x8块的SAD值；同时产生4个2x4块的SAD值（4x8块的左上、右上、左下、右下单元），用于可变大小块的SAD值计算。PEA_32pels单元构成如下：5个8字节宽的参考像素移位寄存器，4个8字节宽的待编码像素的移位寄存器，32个PE基本单元，5个累加器。如图7所示，2个PEA_32pels单元构成PEA_64pels单元，4个PEA_64pels单元构成PEA_256pels单元。当前宏块像素在寄存器中的存储方式如图8所示，在L1层或L0层进行块匹配运动估计时，PEA_32pels单元的当前宏块像素寄存器始终存储0～7类当前像素类型中的一种像素。The PEA_32pels structure is shown in Figure 6. This unit is the most basic array unit and calculates the SAD value of a 4x8 block; simultaneously generates four SAD values of a 2x4 block (upper left, upper right, lower left, and lower right units of a 4x8 block), which are used for SAD value calculation for variable size blocks. The PEA_32pels unit is composed as follows: 5 8-byte wide reference pixel shift registers, 4 8-byte wide shift registers for pixels to be encoded, 32 PE basic units, and 5 accumulators. As shown in Figure 7, 2 PEA_32pels units form a PEA_64pels unit, and 4 PEA_64pels units form a PEA_256pels unit. The storage method of the current macroblock pixel in the register is shown in Figure 8. When block matching motion estimation is performed at the L1 layer or L0 layer, the current macroblock pixel register of the PEA_32pels unit always stores one of the current pixel types from 0 to 7 pixels.

下面分别对在L1层和L0层运动估计步骤进行详细的说明。以下描述仅是针对1024个PE基本单元，涉及一帧的处理。对于P帧，另1024个PE单元并行处理另一参考帧；对于B帧，另1024个PE基本单元并行处理相反方向参考帧。The steps of motion estimation at the L1 layer and the L0 layer are described in detail below. The following description is only for 1024 PE basic units, involving the processing of one frame. For a P frame, another 1024 PE units process another reference frame in parallel; for a B frame, another 1024 PE basic units process a reference frame in the opposite direction in parallel.

L1层分别对由原图像4:1降采样得到的4路降采样图像并行运动估计，其中每路降采样图像进行8x8块的4路并行运动估计。降采样后，i、（i+4）、（i+8）、（i+12）（i=0，1，2，3）共4类像素各构成1路降采样图像。在本例中约定采取二维坐标系，原点为当前待编码宏块的左上角像素点，水平向右为x轴正方向，垂直向下为y轴正方向。设候选运动向量（x，y），其中x、y均为负时，由i=0构成的降采样图像进行运动估计；x为非负，y为负时，由i=1构成的降采样图像进行运动估计；x为负，y为非负时，由i=2构成的降采样图像进行运动估计；x、y均为非负时，由i=3构成的降采样图像进行运动估计。下面仅对其中一路降采样图像进行分析，取i=0即可，设PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels用于4路并行运动估计。如图9所示，PEA1_64pels，PEA2_64pels，PEA3_64pels，PEA4_64pels是水平并行处理，对候选运动向量的SAD值计算采用蛇形扫描的方式进行搜索。The L1 layer performs parallel motion estimation on 4 channels of downsampled images obtained by 4:1 downsampling of the original image, in which each channel of downsampled images performs 4 channels of parallel motion estimation on 8x8 blocks. After downsampling, four types of pixels i, (i+4), (i+8), (i+12) (i=0, 1, 2, 3) each constitute a downsampled image. In this example, it is agreed to adopt a two-dimensional coordinate system, the origin is the upper-left pixel of the current macroblock to be encoded, the positive direction of the x-axis is horizontally to the right, and the positive direction of the y-axis is vertically downward. Assume the candidate motion vector (x, y), where x and y are both negative, the downsampled image composed of i=0 is used for motion estimation; when x is non-negative and y is negative, the downsampled image composed of i=1 Image motion estimation; when x is negative and y is non-negative, the down-sampled image composed of i=2 is used for motion estimation; when x and y are both non-negative, the down-sampled image composed of i=3 is used for motion estimation. In the following, only one of the downsampled images will be analyzed, and i=0 is enough. PEA1_64pels, PEA2_64pels, PEA3_64pels, and PEA4_64pels are used for 4-way parallel motion estimation. As shown in FIG. 9 , PEA1_64pels, PEA2_64pels, PEA3_64pels, and PEA4_64pels are horizontally parallel processed, and the SAD value calculation of candidate motion vectors is searched in a serpentine scanning manner.

T=cycle1时刻，PEA1_64pels计算候选运动向量（-160，-80）的SAD值；PEA2_64pels则相对于PEA1_64pels的运动向量x轴正方向移动4个整像素单位向量，计算（-156，-80）的SAD值；同理，PEA3_64pels，PEA4_64pels分别计算（-152，-80）、（-148，-80）候选运动向量的SAD值。4个PEA_64pels单元在下一周期内将计算的候选运动向量均向y轴正方向移动4个整像素单位向量。PEA1_64pels由两个PEA单元构成，PEA1_64pels_PEA0与PEA1_64pels_PEA1；PEA1_64pels在下一周期计算（-160，-76），所需数据与计算（-160，-80）的参考像素数据相比只需要更新8byte的数据即可，PEA的参考像素寄存器为5x8bytes，则其中4个8bytes寄存器用于参与运算，第5个8bytes寄存器用于读入需要更新的8byte数据即可，保证PE阵列在T=cycle2时刻可以立即参与运算。对于当前像素寄存器不需要更新。对PEA2_64pels，PEA3_64pels，PEA4_64pels同此操作即可。At T=cycle1, PEA1_64pels calculates the SAD value of the candidate motion vector (-160, -80); PEA2_64pels moves 4 integer pixel unit vectors in the positive direction relative to the motion vector x-axis of PEA1_64pels, and calculates (-156, -80) SAD value; similarly, PEA3_64pels and PEA4_64pels respectively calculate the SAD values of (-152, -80), (-148, -80) candidate motion vectors. The four PEA_64pels units move the calculated candidate motion vectors by 4 integer pixel unit vectors in the positive direction of the y-axis in the next cycle. PEA1_64pels consists of two PEA units, PEA1_64pels_PEA0 and PEA1_64pels_PEA1; PEA1_64pels is calculated in the next cycle (-160, -76), and the required data only needs to update 8bytes of data compared with the reference pixel data for calculation (-160, -80). Yes, the reference pixel register of PEA is 5x8bytes, then four 8bytes registers are used to participate in the operation, and the fifth 8bytes register is used to read in the 8byte data that needs to be updated, ensuring that the PE array can immediately participate in the operation at T=cycle2 . No update is required for the current pixel register. Do the same for PEA2_64pels, PEA3_64pels, and PEA4_64pels.

T=cycle2时刻，PEA1_64pels计算候选运动向量（-160，-76）；PEA2_64pels计算候选运动向量（-156，-76）；PEA3_64pels计算候选运动向量（-152，-76）；PEA4_64pels计算候选运动向量（-148，-76）。更新参考像素寄存器8bytes数据，当前像素寄存器不需要更新。At T=cycle2, PEA1_64pels calculates the candidate motion vector (-160, -76); PEA2_64pels calculates the candidate motion vector (-156, -76); PEA3_64pels calculates the candidate motion vector (-152, -76); PEA4_64pels calculates the candidate motion vector ( -148, -76). Update the 8bytes data of the reference pixel register, and the current pixel register does not need to be updated.

同上操作，继续向y轴正方向扫描，则T=cycle19时刻，PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels分别计算候选运动向量（-160，4）、（-156，-4）、（-152，-4）、（-148，-4）的SAD值。Operate as above, continue to scan in the positive direction of the y-axis, then at T=cycle19, PEA1_64pels, PEA2_64pels, PEA3_64pels, PEA4_64pels respectively calculate candidate motion vectors (-160, 4), (-156, -4), (-152, -4 ), (-148, -4) SAD values.

在T=cycle20时刻，若继续向y轴正方向移动的运动向量（-160,0）、（-156，0）、（-152，0）、（-148，0）属于另一路降采样图像的候选运动向量的范围，这里需要向x轴正方向移动16个整像素单位向量，不能数据复用，PE阵列需要空等，直至需要的数据读入参考像素寄存器。同时更新4个8字节宽的数据需要4个周期，减去上一个周期内第5个8字节宽的寄存器已更新的操作时间，共需等待3个周期。不需更新当前像素寄存器。At T=cycle20, if the motion vectors (-160,0), (-156,0), (-152,0), and (-148,0) that continue to move in the positive direction of the y-axis belong to another downsampling image The range of candidate motion vectors. Here, it is necessary to move 16 integer pixel unit vectors in the positive direction of the x-axis. Data cannot be multiplexed. The PE array needs to be empty until the required data is read into the reference pixel register. It takes 4 cycles to update 4 8-byte wide data at the same time, minus the operation time of the fifth 8-byte wide register updated in the previous cycle, a total of 3 cycles is required. There is no need to update the current pixel register.

T=cycle23时刻，PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels分别计算候选运动向量（-144，-4）、（-140，-4）、（-136，-4）、（-132，-4）的SAD值。在下一个周期内各PEA_64pels阵列计算的候选运动向量相对于T=23cycle时刻的运动向量沿y轴负方向移动4个整像素单位向量，故在此周期内更新8字节狂的参考像素数据，不需更新当前像素寄存器。At T=cycle23, PEA1_64pels, PEA2_64pels, PEA3_64pels, PEA4_64pels respectively calculate the SAD of candidate motion vectors (-144, -4), (-140, -4), (-136, -4), (-132, -4) value. In the next cycle, the candidate motion vectors calculated by each PEA_64pels array move 4 integer pixel unit vectors in the negative direction of the y-axis relative to the motion vector at T=23cycle time, so the reference pixel data of 8 bytes is updated in this cycle. The current pixel register needs to be updated.

T=cycle24时刻，PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels分别计算候选运动向量（-144，-8）、（-140，-8）、（-136，-8）、（-132，-8）的SAD值。同时更新参考像素寄存器8bytes数据，不需更新当前像素寄存器。At T=cycle24, PEA1_64pels, PEA2_64pels, PEA3_64pels, PEA4_64pels respectively calculate the SAD of candidate motion vectors (-144, -8), (-140, -8), (-136, -8), (-132, -8) value. At the same time, update the 8bytes data of the reference pixel register without updating the current pixel register.

同上操作，继续向y轴负方向扫描，则T=cycle41时刻，PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels分别计算候选运动向量（-144，-80）、（-140，-80）、（-136，-80）、（-132，-80）的SAD值。在下一个周期内需要计算向x轴正方向移动16个整像素单位向量的候选运动向量的SAD值，数据不能有效复用，PE阵列需要空等，直至需要的数据读入参考像素寄存器。不需更新当前像素寄存器。同上，需要等待3个周期。Do the same as above, continue to scan in the negative direction of the y-axis, then at T=cycle41, PEA1_64pels, PEA2_64pels, PEA3_64pels, PEA4_64pels respectively calculate candidate motion vectors (-144, -80), (-140, -80), (-136, - 80), (-132, -80) SAD values. In the next cycle, it is necessary to calculate the SAD value of the candidate motion vector moving 16 integer pixel unit vectors in the positive direction of the x-axis. The data cannot be effectively multiplexed, and the PE array needs to be empty until the required data is read into the reference pixel register. There is no need to update the current pixel register. Same as above, need to wait for 3 cycles.

T=cycle45时刻，PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels分别计算候选运动向量（-128，-8）、（-124，-8）、（-120，-8）、（-116，-8）的SAD值。同时更新8byte的数据，不需更新当前像素寄存器。At T=cycle45, PEA1_64pels, PEA2_64pels, PEA3_64pels, PEA4_64pels respectively calculate the SAD of candidate motion vectors (-128, -8), (-124, -8), (-120, -8), (-116, -8) value. Update 8byte data at the same time, no need to update the current pixel register.

同上操作，至T=cycle217时刻，PEA1_64pels、PEA2_64pels、PEA3_64pels、PEA4_64pels分别计算候选运动向量（-16，4）、（-12，4）、（-8，4）、（4，-4）的SAD值。至此，该路降采样图像的块匹配运动估计结束。Same operation as above, until T=cycle217, PEA1_64pels, PEA2_64pels, PEA3_64pels, PEA4_64pels respectively calculate the SAD of candidate motion vectors (-16, 4), (-12, 4), (-8, 4), (4, -4) value. So far, the block matching motion estimation of the down-sampled image ends.

对于另3路降采样图像执行同样的操作即可。根据选择最小的块匹配代价，可得到16个最优候选运动向量。对这16个最优候选运动向量的块匹配代价进行比较，在满足限制条件：该向量在L0层作为搜索起始点的搜索区域和以预测运动向量作为搜索起始点的搜索区域不发生相互重叠，选取匹配代价最小的运动向量作为L1层的最后候选运动向量。Perform the same operation for the other 3 downsampled images. According to the selection of the smallest block matching cost, 16 optimal candidate motion vectors can be obtained. Comparing the block matching costs of these 16 optimal candidate motion vectors, if the restriction condition is met: the search area of the vector as the search starting point at the L0 layer and the search area with the predicted motion vector as the search starting point do not overlap each other, Select the motion vector with the smallest matching cost as the last candidate motion vector of L1 layer.

至此，L1层的运动估计结束，下面对L0层运动估计进行相关说明。So far, the motion estimation of the L1 layer is completed, and the related description of the motion estimation of the L0 layer will be described below.

L0层以L1层搜索得到的最优运动向量与预测运动向量分别为搜索起始点，设为（cenx，ceny），在[-88)x[-66)的搜索窗内进行运动估计，找到整像元运动估计的最优运动向量。L0层计算每一个候选运动向量需要256个PE单元，即一个PEA_256pels单元，由4个PEA_64pels单元构成，前后向各4个PEA_256pels单元，记为PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels。L0层运动估计将四路并行搜索，同L1层蛇形扫描原理相同。In the L0 layer, the optimal motion vector and the predicted motion vector obtained from the L1 layer search are used as the starting point of the search, which is set to (cenx, ceny), and the motion estimation is performed in the search window of [-88)x[-66) to find the whole Optimal motion vector for pixel motion estimation. The calculation of each candidate motion vector at the L0 layer requires 256 PE units, that is, a PEA_256pels unit, which is composed of 4 PEA_64pels units, and 4 PEA_256pels units in the front and back directions, which are recorded as PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels. The L0 layer motion estimation will search four ways in parallel, which is the same as the L1 layer serpentine scanning principle.

首先确定由于数据读取的需要，需要将搜索扩展为：[-8-(cenxmod4),12+(cenxmod4))或者[-12-(cenxmod4),8+(cenxmod4))，二者无实质差别，可根据实际情况选择两者之一即可。为简化后续说明，这里假定cenx=16，ceny=16，扩展搜索窗为[-8,12）。First determine that due to the need for data reading, you need to expand the search to: [-8-(cenxmod4),12+(cenxmod4)) or [-12-(cenxmod4),8+(cenxmod4)), there is no substantial difference between the two , you can choose one of the two according to the actual situation. To simplify the subsequent description, here it is assumed that cenx=16, ceny=16, and the extended search window is [-8,12).

T=cycle1时刻，PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels分别计算候选运动向量（-8，-6）、（-7，-6）、（-6，-6）、（-5，-6）的SAD值。其中PEA1_256pels由四个PEA_64pels结构单元组成，记为aPEA1_64pels，aPEA2_64pels，aPEA3_64pels，aPEA4_64pels，并假定它们各自存储的当前宏块像素类型为（0、4），（1、5），（2、6），（3、7）。候选运动向量（-8，-6）对应的参考宏块的左上角的像素坐标为（8，10），此点像素类型为3。aPEA1_64pels，aPEA2_64pels，aPEA3_64pels，APEA4_64pels各自计算以（8，10）、（9，10）、（8，11）、（9，11）为左上角的8x8降采样块的SAD值，累加可得L0层可变大小块的SAD。同时，在此周期内，各PE阵列需要对寄存器的像素进行更新。假设aPEA1_64pels由aPEA1_32pels，aPEA2_32pels组成，分别存储当前待编码宏块的像素类型为（0、4），参考像素寄存器的对应类型为（3与b、7与f）。这2个PEA_32pels阵列更新当前宏块的编码像素类型为（2、6），参考像素对应的类型为（3与b、7与f），需要更新8字节宽的数据。aPEA1_256pels的另3个PEA_64pels阵列以及另3个PEA_256pels的PEA_64pels阵列同此操作。At T=cycle1, PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels respectively calculate the SAD of candidate motion vectors (-8, -6), (-7, -6), (-6, -6), (-5, -6) value. Among them, PEA1_256pels is composed of four PEA_64pels structural units, which are recorded as aPEA1_64pels, aPEA2_64pels, aPEA3_64pels, aPEA4_64pels, and assume that the current macroblock pixel types stored by them are (0, 4), (1, 5), (2, 6), (3, 7). The pixel coordinates of the upper left corner of the reference macroblock corresponding to the candidate motion vector (-8, -6) are (8, 10), and the pixel type of this point is 3. aPEA1_64pels, aPEA2_64pels, aPEA3_64pels, APEA4_64pels respectively calculate the SAD value of the 8x8 downsampling block with (8, 10), (9, 10), (8, 11), (9, 11) as the upper left corner, and accumulate to get the L0 layer SAD for variable size blocks. At the same time, during this period, each PE array needs to update the pixels of the register. Assume that aPEA1_64pels is composed of aPEA1_32pels and aPEA2_32pels, respectively storing the pixel types of the current macroblock to be encoded as (0, 4), and the corresponding types of the reference pixel registers as (3 and b, 7 and f). These two PEA_32pels arrays update the coded pixel type of the current macroblock as (2, 6), and the corresponding type of the reference pixel as (3 and b, 7 and f), and need to update 8-byte wide data. The other 3 PEA_64pels arrays of aPEA1_256pels and the other 3 PEA_64pels arrays of PEA_256pels do the same.

T=cycle2时刻，PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels分别计算候选运动向量（-8，-5）、（-7，-5）、（-6，-5）、（-5，-5）的SAD值。同时，在此周期内，各PE阵列需要对寄存器中像素值进行更新。At T=cycle2, PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels respectively calculate the SAD of candidate motion vectors (-8, -5), (-7, -5), (-6, -5), (-5, -5) value. At the same time, during this period, each PE array needs to update the pixel value in the register.

同上操作，继续向y轴正方向扫描，则T=cycle12时刻，PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels分别计算候选运动向量（-8，5）、（-7，5）、（-6，5）、（-5，5）位置SAD值。这里需要向x轴正方向移动4个整像素单位向量，数据不能有效复用，PE阵列需要空等，直至需要的数据读入参考像素寄存器。共需等待3个周期。Operate as above, continue to scan in the positive direction of the y-axis, then at T=cycle12, PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels respectively calculate candidate motion vectors (-8, 5), (-7, 5), (-6, 5), (-5, 5) Position SAD value. Here, it is necessary to move 4 integer pixel unit vectors in the positive direction of the x-axis, the data cannot be effectively multiplexed, and the PE array needs to be empty until the required data is read into the reference pixel register. A total of 3 cycles are required.

T=cycle16时刻，PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels分别计算候选运动向量（-4，5）、（-3，5）、（-2，5）、（-1，5）的SAD值。同时，在此周期内，各PE阵列需要对寄存器的像素值进行更新。下一个时钟周期需要向y轴负方向移动一个整像素单位向量进行运动估计。At T=cycle16, PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels respectively calculate the SAD values of candidate motion vectors (-4, 5), (-3, 5), (-2, 5), (-1, 5). At the same time, within this period, each PE array needs to update the pixel value of the register. The next clock cycle needs to move an integer pixel unit vector to the negative direction of the y-axis for motion estimation.

T=cycle17时刻，PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels分别计算候选运动向量（-4，4）、（-3，4）、（-2，4）、（-1，4）的SAD值。同时，在此周期内，各PE阵列需要对寄存器的像素值进行更新。下一个周期需要向y轴负方向移动一个整像素单位向量进行运动估计。At T=cycle17, PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels respectively calculate the SAD values of candidate motion vectors (-4, 4), (-3, 4), (-2, 4), (-1, 4). At the same time, within this period, each PE array needs to update the pixel value of the register. The next cycle needs to move an integer pixel unit vector to the negative direction of the y-axis for motion estimation.

同上操作，T=72cycle时刻，PEA1_256pels，PEA2_256pels，PEA3_256pels，PEA4_256pels分别计算候选运动向量（8，5）、（9，5）、（10，5）、（11，5）的SAD值。Same operation as above, at T=72cycle time, PEA1_256pels, PEA2_256pels, PEA3_256pels, PEA4_256pels respectively calculate the SAD values of candidate motion vectors (8, 5), (9, 5), (10, 5), (11, 5).

至此，基于该搜索起始点的L0层局部精细搜索运动估计结束，根据选择最小的块匹配代价，可得到4个最优候选运动向量，再对这4个最优候选运动向量的块匹配代价进行比较，可以得到1个对应该搜索起始点的最优运动向量。So far, the local fine search motion estimation of the L0 layer based on the search starting point is over. According to the selection of the smallest block matching cost, 4 optimal candidate motion vectors can be obtained, and then the block matching costs of these 4 optimal candidate motion vectors are calculated. By comparison, an optimal motion vector corresponding to the search starting point can be obtained.

同理，对另一搜索起始点执行同样操作，可以得到1个对应该搜索起始点的最优运动向量。最后将对应两搜索起始点的各自最优运动向量的代价进行比较，可以得到整像元运动估计的最优运动向量。至此，整像元估计结束。Similarly, by performing the same operation on another search starting point, an optimal motion vector corresponding to the search starting point can be obtained. Finally, the cost of the respective optimal motion vectors corresponding to the two search starting points is compared, and the optimal motion vector for the whole pixel motion estimation can be obtained. At this point, the whole pixel estimation ends.

尽管本发明的内容已经通过上述优选实施例作了详细介绍，但应当认识到上述的描述不应被认为是对本发明的限制。在本领域技术人员阅读了上述内容后，对于本发明的多种修改和替代都将是显而易见的。因此，本发明的保护范围应由所附的权利要求来限定。Although the content of the present invention has been described in detail through the above preferred embodiments, it should be understood that the above description should not be considered as limiting the present invention. Various modifications and alterations to the present invention will become apparent to those skilled in the art upon reading the foregoing disclosure. Therefore, the protection scope of the present invention should be defined by the appended claims.

Claims

1. A search method based on layered motion estimation, characterized in that: the current frame and the reference frame are down-sampled in proportion, and by a 4:1 down-sampling, the 16x16 blocks of the original image layer become 8x8 blocks, When the downsampling layer performs motion estimation, the step size of the candidate motion vector is set to be 2 integer pixel unit vectors after downsampling;

The down-sampling obtains 4-way down-sampled images, and the multi-PE array searches in one-way down-sampled images, while the other 3-way down-sampled images are searched in parallel;

In the method, the on-chip luminance pixel cache adopts classified pixel storage to realize down-sampling search, that is, 4 types of pixel types are interleavedly stored according to down-sampling, and are divided into odd-even rows and odd-even macroblock columns for separate storage, and a total of 16 pixel types are interleavedly stored ; The PE array reads data multiplexing from the brightness pixel cache, and reads two types of pixel types of 16-byte wide data for data update of the 4-way PE array; the data update of the current brightness pixel register of the PE array The operation implements fast updates by applying shifts to different current brightness pixel registers;

The on-chip luminance pixel cache divides the reference image into 16 types of reference pixels from 0 to f, wherein the reference pixels of types 0 to 7 are reference pixels of even macroblock columns, and the reference pixels of type 8 to f are reference pixels of odd macroblock columns. Each type of reference pixel corresponds to a corresponding type of RAM. The brightness reference pixel cache is composed of 16 RAMs with a bit width of 64 bits. It adopts Zigzag macroblock scanning sequence encoding, and only reads the data of one column of macroblock columns each time the data is updated. It only needs to update 8 even or odd macroblock column RAMs at a time;

The PE array is distributed in three levels: PEA_32pels array unit, PEA_64pels array unit, PEA_256pels array unit, wherein: PEA_32pels array unit is the most basic array unit, which is composed of 32 PE units and is used to calculate the SAD value of 4x8 blocks; PEA_64pels The array unit includes two PEA_32pels array units, consisting of 64 PE units, used to calculate the SAD value of 8x8 blocks; the PEA_256pels array unit includes four PEA_64pels array units, composed of 256 PE units, used to calculate 16x16 blocks and Change the SAD value of the large block;

The PEA_32pels array unit calculates the SAD value of the 4x8 block; at the same time, the SAD value of four 2x4 blocks is generated, that is, the upper left, upper right, lower left, and lower right units of the 4x8 block, which are used for the SAD value calculation of the variable size block; the PEA_32pels array unit constitutes As follows: 5 8-byte wide reference pixel shift registers, 4 8-byte wide shift registers for pixels to be encoded, 32 PE basic units, 5 accumulators; block matching is performed at the L1 or L0 layer During motion estimation, the current macroblock pixel register of the PEA_32pels array unit always stores one of the current pixel types from 0 to 7.

2. The search method based on layered motion estimation according to claim 1, characterized in that: said method adopts motion estimation based on rate-distortion optimization to support variable size blocks, and the block matching cost function takes the ratio of the current block and the reference block Weighted sum of SAD value, coded bit consumption of motion vector.

3. A realization system based on the search method of layered motion estimation as claimed in claim 1, characterized in that comprising:

Luminance reference pixel cache module: Zigzag scanning sequence encoding and on-chip data interleaved storage are adopted to obtain interleaved storage of 4 pixel types according to down-sampling, and at the same time, the odd and even rows and odd and even macro block columns are stored separately, and a total of 16 pixel types are interleaved storage;

Hierarchically multiplexed PE array module: downsampling to obtain 4 channels of downsampled images, and multi-PE arrays perform 4 channels of parallel search in each channel of downsampled images; the PE array can be obtained from the brightness reference pixel buffer module at each clock Valid data and calculate SAD value;

PE array register data update operation control module: used to control the PE array to read reference pixels from the luminance reference pixel buffer module to the register, and read 16-byte wide data of two pixel types at a time for data update of 4-way PE array, The data update operation of the current brightness pixel register of the PE array realizes fast updating by shifting different current brightness pixel registers;

The layered multiplexing PE array module has three levels of PE array distribution: PEA_32pels array unit, PEA_64pels array unit, PEA_256pels array unit, wherein: PEA_32pels array unit is the most basic array unit, consisting of 32 PE units, It is used to calculate the SAD value of a 4x8 block; the PEA_64pels array unit includes two PEA_32pels array units, which are composed of 64 PE units, and is used to calculate the SAD value of an 8x8 block; the PEA_256pels array unit includes four PEA_64pels array units, which consist of 256 PE units Composition, used to calculate the SAD value of 16x16 blocks and their variable size blocks;

The brightness reference pixel cache module divides the reference pixels into 4 types of pixels, divides the odd and even rows, and stores the odd and even macro block columns separately, and a total of 16 on-chip RAMs store the brightness reference pixels, and the bit width of each RAM is 64 bits. One address line corresponds to 8 pixel values.