CN100349122C

CN100349122C - Method for realizing data packet sequencing for multi engine paralled processor

Info

Publication number: CN100349122C
Application number: CNB2005100932204A
Authority: CN
Inventors: 王海军; 黄勇
Original assignee: Huawei Technologies Co Ltd
Current assignee: HiSilicon Technologies Co Ltd
Priority date: 2005-08-19
Filing date: 2005-08-19
Publication date: 2007-11-14
Anticipated expiration: 2025-08-19
Also published as: CN1851649A

Abstract

The invention discloses a method for realizing data packet sorting in a multi-engine parallel processor. The key is to generate a tag code during the distribution process of the data stream and save it as a sequence, and to store it in the data stream during the collection process. The tags in the sequence are decoded in sequence, the engine channel or load channel corresponding to the tag is selected, and a corresponding complete data packet is output in sequence, and only one data packet is allowed to be output. Thereby solving the ordering of output packets. Applying the present invention has the following advantages: it is not necessary to use a separate queuing machine to realize data packet sorting, which reduces system resource overhead; at the same time, because the sorting function is combined with a multi-engine parallel processor, the impact of sorting on the efficiency of parallel processing is reduced to It is the smallest and reduces the possibility of system congestion; multi-engine parallel processors can guarantee the logical order of data packets without occupying a large amount of shared memory of the system.

Description

A Method for Realizing Data Packet Sequencing in Multi-Engine Parallel Processor

技术领域technical field

本发明涉及多引擎并行处理器技术领域，特别是指一种实现多引擎并行处理器中数据包排序的方法。The invention relates to the technical field of multi-engine parallel processors, in particular to a method for realizing data packet sorting in multi-engine parallel processors.

背景技术Background technique

多引擎并行处理器为突破单引擎处理能力的局限性提供了解决方案。在理论上，如果不考虑接口的速度、硬件实现的资源量等因素，多引擎并行处理器的处理能力可以是无限的。在实际应用中，多引擎并行处理器通常在结构上划分为多个负载层次，每个负载层次又分为多个负载通路，最底层的每个负载通路上分别有一个包文处理引擎(PE)，这些引擎可以并行工作。假设负载总层次为N，任一个层次用n表示，该n满足1≤n≤N。每一层共有m个负载通路，则从第一层到第n层的负载通路数依次为m₁、m₂……m_n，则引擎的总数P为P＝m₁×m₂×……×m_n。Multi-engine parallel processors provide a solution to break through the limitations of single-engine processing capabilities. In theory, if factors such as the speed of the interface and the amount of resources implemented by the hardware are not considered, the processing capacity of the multi-engine parallel processor can be unlimited. In practical applications, multi-engine parallel processors are usually divided into multiple load levels in structure, and each load level is divided into multiple load paths, and each load path at the bottom layer has a packet processing engine (PE ), these engines can work in parallel. Assuming that the total level of load is N, any level is represented by n, and the n satisfies 1≤n≤N. There are m load paths in each layer, then the number of load paths from the first layer to the nth layer is m ₁ , m ₂ ... m _n , and the total number of engines P is P=m ₁ ×m ₂ ×... × m _n .

为方便说明，以下先定义几个概念的缩写：For the convenience of explanation, the abbreviations of several concepts are first defined as follows:

层n的负载均衡模块(Layer n Load Balance Unit)，缩写为LnBU；Layer n Load Balance Unit (Layer n Load Balance Unit), abbreviated as LnBU;

层n的输入缓存模块(Layer n Input Cache Unit)，缩写为InCU；Layer n Input Cache Unit (Layer n Input Cache Unit), abbreviated as InCU;

层n的负载收集模块(Layer n Load Pooling Unit)，缩写为LnPU；Layer n Load Pooling Unit (Layer n Load Pooling Unit), abbreviated as LnPU;

层n的输出缓存模块(Layer n Output Cache Unit)，缩写为OnCU。Layer n Output Cache Unit (Layer n Output Cache Unit), abbreviated as OnCU.

下面以n＝2，m₁＝2，m₂＝4，P＝8为例进行说明。The following takes n=2, m ₁ =2, m ₂ =4, and P=8 as an example for illustration.

参见图1，图1所示为8引擎并行处理器结构示意图。包文在该多引擎并行处理器中的数据流为：Referring to FIG. 1, FIG. 1 is a schematic structural diagram of an 8-engine parallel processor. The data flow of Baowen in this multi-engine parallel processor is:

1.层1负载均衡模块(L1BU，Layer 1 Load Balance Unit)根据负载均衡或循环复用(Round Robin)的仲裁策略将输入的数据包分配到两个层1输入缓存模块(I1CU，Layer 1 Input Cache Unit)。1. The layer 1 load balancing module (L1BU, Layer 1 Load Balance Unit) distributes the input data packets to two layer 1 input buffer modules (I1CU, Layer 1 Input Cache) according to the arbitration strategy of load balancing or Round Robin Cache Unit).

2.两个I1CU分别负责各自负载通路的数据输入缓存。2. The two I1CUs are responsible for the data input buffer of their respective load paths.

3.两个层2负载均衡模块(L2BU，Layer 2 Load Balance Unit)分别根据负载均衡或循环复用(Round Robin)的仲裁策略将各自负载通路的数据包分配到四个层2输入缓存模块(I2CU，Layer 2 Input Cache Unit)中，每个I2CU对应一个包文处理引擎(PE，Packet Engine)。也就是说，每个L2BU将各自负载通路的数据包分配到四个PE中。3. Two layer 2 load balancing modules (L2BU, Layer 2 Load Balance Unit) distribute the data packets of their respective load paths to four layer 2 input buffer modules ( In I2CU, Layer 2 Input Cache Unit), each I2CU corresponds to a packet processing engine (PE, Packet Engine). That is to say, each L2BU distributes the data packets of its respective load path to four PEs.

4.PE从I2CU中获取待处理的数据包文，处理完成后将数据包文存放在层2输出缓存模块(O2CU，Layer 2 Output Cache Unit)中。4. PE obtains the data packet to be processed from the I2CU, and stores the data packet in the Layer 2 output cache module (O2CU, Layer 2 Output Cache Unit) after the processing is completed.

5.每个层2负载收集模块(L2PU，Layer 2 Load Pooling Unit)分别从各自负载通路上的四个O2CU中收集四个包文处理引擎的输出数据包，并按照一定的顺序依次输出到层1输出缓存模块(O1CU，Layer 1 Output CacheUnit)内。5. Each layer 2 load collection module (L2PU, Layer 2 Load Pooling Unit) collects the output data packets of the four packet processing engines from the four O2CUs on their respective load paths, and outputs them to the layer in a certain order 1 output cache module (O1CU, Layer 1 Output CacheUnit).

6.O1CU分别负责各自负载通路的数据输出缓存。6. O1CU is responsible for the data output buffer of their respective load paths.

7.L1PU收集两个O1CU的输出数据包，并按一定的顺序依次输出。7. The L1PU collects the output data packets of the two O1CUs and outputs them in a certain order.

从上述处理过程中可以看出，由于L1BU和L2BU_i(其中i为0和1)在将数据包文分发到各个引擎时不考虑包文的顺序；且由于每个包文处理引擎的负载不可能完全一样，按顺序分配到各个引擎的数据包，其处理完成的顺序也无法保证。因此经过多引擎并行处理器处理完成后输出的数据包，将无法保证逻辑序。It can be seen from the above processing that since L1BU and L2BU _i (where i is 0 and 1) do not consider the order of packets when distributing data packets to each engine; and because the load of each packet processing engine is not It may be exactly the same, and the order in which the processing of the packets allocated to each engine in sequence cannot be guaranteed. Therefore, the output data packets after processing by the multi-engine parallel processor will not be able to guarantee the logical sequence.

现有的保证多引擎并行处理器中数据包逻辑序的实现方法如下：The existing implementation method for guaranteeing the logical sequence of data packets in a multi-engine parallel processor is as follows:

在数据包进入多引擎并行处理器前，使用排队机(一般用软件实现)在数据包中插入顺序标记，多引擎并行处理器内各个处理模块必须对该标记进行透传。当数据包经过多引擎并行处理器处理完毕输出后，再使用排队机查询数据包中的标记，根据标记对数据包进行排序。Before the data packet enters the multi-engine parallel processor, a queuing machine (generally realized by software) is used to insert a sequence mark in the data packet, and each processing module in the multi-engine parallel processor must transparently transmit the mark. After the data packets are processed and output by the multi-engine parallel processor, the queuing machine is used to query the marks in the data packets, and the data packets are sorted according to the marks.

由于排队机的存在，现有的实现方法必然存在以下缺陷：Due to the existence of the queuing machine, the existing implementation methods must have the following defects:

1.由于使用单独的排队机实现数据包排序，会额外占用系统资源；1. Due to the use of a separate queuing machine to implement data packet sorting, additional system resources will be occupied;

2.如果排队机无法及时完成多引擎并行处理器输出的数据包的排序，将造成系统的拥塞，影响多引擎并行处理器的计算效率；2. If the queuing machine cannot complete the sorting of the data packets output by the multi-engine parallel processor in time, it will cause system congestion and affect the computing efficiency of the multi-engine parallel processor;

3.应用排队机实现数据包排序的过程中，需要将顺序靠后，但先处理完成的数据包全部进行缓存，需要占用系统大量的共享内存。3. In the process of using the queuing machine to implement data packet sorting, it is necessary to cache all the data packets that have been processed first, but occupy a large amount of shared memory in the system.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供一种实现多引擎并行处理器中数据包排序的方法，以解决多引擎处理中的数据包排序问题。In view of this, the object of the present invention is to provide a method for realizing data packet sequencing in multi-engine parallel processors, so as to solve the problem of data packet sequencing in multi-engine processing.

为达到上述目的，本发明的技术方案是这样实现的：In order to achieve the above object, technical solution of the present invention is achieved in that way:

一种实现多引擎并行处理器中数据包排序的方法，所述多引擎并行处理器包含一层以上的负载均衡模块LBU，且每个LBU对应一个负载收集模块LPU，该LPU与所对应的LBU处于同一层次，A method for realizing data packet sorting in a multi-engine parallel processor, the multi-engine parallel processor includes more than one layer of load balancing modules LBU, and each LBU corresponds to a load collection module LPU, the LPU and the corresponding LBU at the same level,

在多引擎并行处理器中，预设一个以上用于记录标记信息的序列，且每个序列与一个LBU一一对应；In a multi-engine parallel processor, more than one sequence for recording tag information is preset, and each sequence corresponds to one LBU;

所述LBU执行以下处理步骤：The LBU performs the following processing steps:

A、LBU接收到待处理数据包，根据预设的分发原则将接收到的待处理数据包分发到下一层的处理模块中，并将所分发的数据包进行标记，以指示该数据包所在负载通路，且将该标记记录在与该LBU所对应的序列中；A. The LBU receives the data packet to be processed, and distributes the received data packet to the processing module of the next layer according to the preset distribution principle, and marks the distributed data packet to indicate where the data packet is load path, and record the mark in the sequence corresponding to the LBU;

B、该LBU判断其所对应的序列中记录信息的个数是否达到预设的用于流控数据包个数的阈值，如果是，则停止分发操作，然后重复执行步骤B，否则，重复执行步骤A；B. The LBU judges whether the number of recorded information in its corresponding sequence reaches the preset threshold for the number of flow control data packets. If so, stop the distribution operation, and then repeat step B, otherwise, repeat the execution Step A;

所述与每个LBU对应的且与所对应的LBU处于同层的负载收集模块LPU，执行以下处理步骤：The load collection module LPU corresponding to each LBU and at the same layer as the corresponding LBU performs the following processing steps:

a、LPU判断与其对应的LBU所对应的序列中记录信息的个数是否为非空，如果是，执行步骤b；否则重复执行步骤a；a. The LPU judges whether the number of recorded information in the sequence corresponding to the corresponding LBU is non-empty, if yes, execute step b; otherwise, repeat step a;

b、LPU从该序列中顺序读取一个标记信息，根据该标记信息获取对应数据包所在位置，打开该位置所对应的输出通道，等待一个完整的数据包通过，且只允许一个数据包通过，然后重复执行步骤a，直至所有包文输出完毕。b. The LPU sequentially reads a tag information from the sequence, obtains the location of the corresponding data packet according to the tag information, opens the output channel corresponding to the location, waits for a complete data packet to pass through, and only allows one data packet to pass through, Then repeat step a until all packets are output.

较佳地，如果所述LBU为非最底层的LBU，则步骤A所述下一层的处理模块为下一层的LBU；如果所述LBU为最底层的LBU，则步骤A所述下一层的处理模块为包文处理引擎PE。Preferably, if the LBU is a non-bottom LBU, then the processing module of the next layer described in step A is an LBU of the next layer; if the LBU is the bottom LBU, then the next layer described in step A The processing module of the layer is the packet processing engine PE.

较佳地，如果所述负载均衡模块LBU为最高层的LBU，步骤B所述用于流控数据包个数的阈值为数据输入端口的数据包个数流控阈值；如果所述负载均衡模块LBU为非最高层的LBU，则步骤B所述用于流控数据包个数的阈值为负载通路的数据包个数流控阈值。Preferably, if the LBU of the load balancing module is the highest-level LBU, the threshold for the number of flow control data packets described in step B is the flow control threshold of the number of data packets of the data input port; if the load balancing module The LBU is a non-top layer LBU, and the threshold for the number of flow control data packets in step B is the flow control threshold for the number of data packets of the load path.

较佳地，所述序列的宽度B_n为Preferably, the width B _n of the sequence is

B_n＝[log₂(m_n)]，其中，[ ]表示进1取整运算，m_n为LBU对应的下一层的负载通路数；B _n = [log ₂ (m _n )], where [ ] represents the rounding operation, and m _n is the number of load channels of the next layer corresponding to the LBU;

最高层的LBU所对应的序列的深度为数据输入端口的数据包个数流控阈值；The depth of the sequence corresponding to the highest level LBU is the flow control threshold of the number of data packets at the data input port;

非最高层LBU所对应的序列的深度为该层LBU所对应的上一层负载通路的数据包个数流控阈值。The depth of the sequence corresponding to the non-highest layer LBU is the flow control threshold of the number of data packets of the upper layer load path corresponding to the layer LBU.

较佳地，所述数据输入端口的数据包个数流控阈值的确定方法为：Preferably, the method for determining the flow control threshold of the number of data packets at the data input port is:

首先计算多引擎并行处理器缓存空间的大小，然后应用该缓存空间大小的值除以最小包文的长度值，获得可容纳的最大包文个数，再根据该最大包文个数以及预设的策略确定数据输入端口的数据包个数流控阈值。First calculate the size of the cache space of the multi-engine parallel processor, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum number of packets that can be accommodated, and then according to the maximum number of packets and the preset The strategy determines the flow control threshold of the number of data packets at the data input port.

较佳地，所述计算多引擎并行处理器缓存空间的大小的方法为：计算该多引擎并行处理器中所有输入缓存模块ICU和输出缓存模块OCU的缓存空间的总和。Preferably, the method for calculating the size of the cache space of the multi-engine parallel processor is: calculating the sum of the cache spaces of all input cache modules ICU and output cache modules OCU in the multi-engine parallel processor.

较佳地，所述负载通路的数据包个数流控阈值的确定方法为：Preferably, the method for determining the flow control threshold of the number of data packets of the load path is:

首先计算该负载通路中所有缓存空间的大小，然后应用该缓存空间大小的值除以最小包文的长度值，获得可容纳的最大包文个数，再根据该最大包文个数以及预设的策略确定负载通路的数据包个数流控阈值。First calculate the size of all cache spaces in the load path, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum number of packets that can be accommodated, and then according to the maximum number of packets and the preset The strategy determines the flow control threshold of the number of data packets in the load path.

较佳地，所述计算负载通路中所有缓存空间大小的方法为：计算该负载通路中所有输入缓存模块ICU和输出缓存模块OCU的缓存空间的总和。Preferably, the method for calculating the size of all cache spaces in the load path is: calculating the sum of the cache spaces of all input cache modules ICU and output cache modules OCU in the load path.

较佳地，所述预设的策略为根据资源开销和/或对缓存效率的影响，确定流控阈值。Preferably, the preset strategy is to determine the flow control threshold according to resource overhead and/or impact on cache efficiency.

较佳地，所述序列由先进先出FIFO缓冲器承载。Preferably, said sequence is carried by a first-in-first-out FIFO buffer.

较佳地，所述预设的分发原则包括负载均衡策略或循环复用(RoundRobin)仲裁策略。Preferably, the preset distribution principle includes a load balancing strategy or a round-robin (RoundRobin) arbitration strategy.

本发明的关键是，在数据流的分发过程中产生标记的编码，并保存为一个序列，在数据流的收集过程中对保存在序列中的标记依次解码，选中与该标记相对应的引擎通道或负载通道，依次输出对应的一个完整的数据包，且只允许一个数据包输出。从而解决了对输出数据包的排序。The key of the present invention is that the encoding of the mark is generated during the distribution process of the data stream and stored as a sequence, and the marks stored in the sequence are sequentially decoded during the collection process of the data stream, and the engine channel corresponding to the mark is selected Or the load channel, which outputs a corresponding complete data packet in sequence, and only allows one data packet to be output. Thereby solving the ordering of output packets.

应用本发明，具有以下优点：不需要使用单独的排队机实现数据包排序，减小了系统资源开销；同时，由于排序功能与多引擎并行处理器结合，将排序对并行处理的效率影响降到最小，并降低了系统拥塞的可能性；多引擎并行处理器能够保证数据包的逻辑序，不需要大量占用系统的共享内存。Applying the present invention has the following advantages: it is not necessary to use a separate queuing machine to realize data packet sorting, which reduces system resource overhead; at the same time, because the sorting function is combined with a multi-engine parallel processor, the impact of sorting on the efficiency of parallel processing is reduced to It is the smallest and reduces the possibility of system congestion; multi-engine parallel processors can guarantee the logical order of data packets without occupying a large amount of shared memory of the system.

附图说明Description of drawings

图1所示为8引擎并行处理器结构示意图；Figure 1 shows a schematic diagram of the structure of an 8-engine parallel processor;

图2所示为应用本发明的8引擎并行处理器结构的编码示意图。Fig. 2 is a schematic diagram of coding of the 8-engine parallel processor structure applying the present invention.

具体实施方式Detailed ways

下面结合附图及具体实施例，再对本发明做进一步地详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明的思路是：在数据流的分发过程中产生标记的编码，并保存为一个序列，在数据流的收集过程中对保存在序列中的标记依次解码，选中与该标记相对应的引擎通道或负载通道，依次输出对应的一个完整的数据包，且只允许一个数据包输出。从而完成对输入数据包的标记和输出数据包的排序。The thinking of the present invention is: generate the code of mark during the distribution process of data flow, and save it as a sequence, decode the marks stored in the sequence in sequence during the collection process of data flow, select the engine channel corresponding to the mark Or the load channel, which outputs a corresponding complete data packet in sequence, and only allows one data packet to be output. In this way, the marking of the input data packets and the sorting of the output data packets are completed.

为了保证按照预定的顺序依次输出各引擎处理的数据包，必须对输入的数据包在各LnBU分发到各个负载通路或PE前进行唯一标记。在LnPU再按照预先做好的标记控制输出包文的顺序。In order to ensure that the data packets processed by each engine are output sequentially in a predetermined order, the input data packets must be uniquely marked before each LnBU distributes them to each load path or PE. In the LnPU, the sequence of output packets is controlled according to the pre-made tags.

这里“唯一标记”的含义是指，在极限情况下多引擎并行处理器即芯片中所能同时容纳的包文必需具有各自区别的标记，“唯一标记”的个数即为多引擎并行处理器中同时容纳的包文的最大个数。该“最大个数”已经是多引擎并行处理器所能容纳的极限了。通常，为了减少缓存的长度，需要对数据包个数进行流控，而进行流控就需要设置一个合理的流控阈值，以减少流控对缓存效率的影响。The meaning of "unique mark" here means that in the limit case, the multi-engine parallel processor, that is, the packets that can be accommodated in the chip at the same time must have their own distinctive marks, and the number of "unique marks" is the number of multi-engine parallel processors. The maximum number of packets that can be accommodated at the same time. The "maximum number" is already the limit that the multi-engine parallel processor can accommodate. Usually, in order to reduce the length of the cache, it is necessary to perform flow control on the number of data packets, and to perform flow control, a reasonable flow control threshold needs to be set to reduce the impact of flow control on the cache efficiency.

在本发明中定义了两类流控阈值：一类是数据输入端口的数据包个数流控阈值，其是用于流控输入多引擎并行处理器的数据流，因此该类阈值就一个；另一类是负载通路的数据包个数流控阈值，其是用于流控输入各个负载通路中的数据流，因此该类阈值有多个，其个数与多引擎并行处理器中实际的负载通路的个数相同。下面分别说明两类阈值的设置过程。In the present invention, two types of flow control thresholds are defined: one is the flow control threshold of the number of data packets at the data input port, which is used for flow control input of the data flow of the multi-engine parallel processor, so there is only one such threshold; The other is the flow control threshold of the number of data packets in the load path, which is used to control the flow of data input into each load path. Therefore, there are multiple thresholds of this type, and the number is the same as the actual number of multi-engine parallel processors. The number of load paths is the same. The following describes the setting process of the two types of thresholds respectively.

设置数据输入端口的数据包个数流控阈值的过程为：首先计算多引擎并行处理器的缓存空间的大小，然后应用该缓存空间大小的值除以最小包文的长度值，获得最多可容纳的最大包文个数，再根据该最大包文个数以及预设的策略确定数据输入端口的数据包个数流控阈值。上述计算多引擎并行处理器缓存空间的大小的方法为：计算该多引擎并行处理器中所有输入缓存模块(ICU)和输出缓存模块(OCU)的缓存空间的总和。上述预设的策略为根据资源开销和/或对缓存效率的影响来确定流控阈值。The process of setting the flow control threshold of the number of data packets at the data input port is: first calculate the size of the cache space of the multi-engine parallel processor, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum The maximum number of packets, and then determine the flow control threshold of the number of data packets at the data input port according to the maximum number of packets and the preset strategy. The method for calculating the size of the cache space of the multi-engine parallel processor is as follows: calculating the sum of the cache spaces of all input cache modules (ICU) and output cache modules (OCU) in the multi-engine parallel processor. The aforementioned preset strategy is to determine the flow control threshold according to resource overhead and/or impact on cache efficiency.

仍以n＝2，m₁＝2，m₂＝4，P＝8为例，参见图1，如果每个I1CU和O1CU的缓存空间大小为8Kbyte和9Kbyte，每个I2CU和O2CU的缓存空间大小为2Kbyte，那么，多引擎并行处理器缓存空间的大小，即整个芯片的缓存空间大小＝每个I1CU的缓存空间大小×个数+每个O1CU的缓存空间大小×个数+每个I2CU的缓存空间大小×个数+O2CU的缓存空间大小×个数＝8×2+9×2+2×8+2×8＝66Kbyte，假设在极限情况下所有的包文长度为64字节的小包，则多引擎并行处理器中同时容纳的包文的个数为66Kbyte/64byte，约为1000。在此，根据预定策略设置数据输入端口的数据包个数流控阈值为256。也就是说，当多引擎并行处理器中同时容纳的个数达到256时，必须对输入的数据包进行流控。Still taking n=2, m ₁ =2, m ₂ =4, P=8 as an example, referring to Fig. 1, if the cache space size of each I1CU and O1CU is 8Kbyte and 9Kbyte, the cache space size of each I2CU and O2CU is 2Kbyte, then, the size of the cache space of the multi-engine parallel processor, that is, the cache space size of the whole chip=the cache space size of each I1CU×the number+the cache space size of each O1CU×the number+the cache memory of each I2CU Space size × number + O2CU cache space size × number = 8×2+9×2+2×8+2×8=66Kbyte, assuming that in the limit case all packets are small packets with a length of 64 bytes, Then the number of packets simultaneously accommodated in the multi-engine parallel processor is 66Kbyte/64byte, about 1000. Here, the flow control threshold of the number of data packets at the data input port is set to 256 according to a predetermined policy. That is to say, when the number of concurrently accommodated multi-engine parallel processors reaches 256, flow control must be performed on input data packets.

设置负载通路的数据包个数流控阈值的过程为：首先计算该负载通路中所有缓存空间的大小，然后应用该缓存空间大小的值除以最小包文的长度值，获得最多可容纳的最大包文个数，再根据该最大包文个数以及预设的策略确定负载通路的数据包个数流控阈值。上述计算负载通路中所有缓存空间大小的方法为：计算该负载通路中所有ICU和OCU的缓存空间的总和。上述预设的策略为根据资源开销和/或对缓存效率的影响来确定流控阈值。The process of setting the flow control threshold for the number of data packets in the load path is: first calculate the size of all cache spaces in the load path, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum that can be accommodated. The number of packets, and then determine the flow control threshold of the number of packets of the load path according to the maximum number of packets and the preset policy. The method for calculating the size of all cache spaces in the load path is as follows: calculating the sum of the cache spaces of all ICUs and OCUs in the load path. The aforementioned preset strategy is to determine the flow control threshold according to resource overhead and/or impact on cache efficiency.

仍以n＝2，m₁＝2，m₂＝4，P＝8为例，参见图1，如果每个I2CU和O2CU的缓存空间大小为2Kbyte，那么，每个负载通路中所有缓存空间的大小＝每个I2CU的缓存空间大小×个数+O2CU的缓存空间大小×个数＝2×4+2×4＝16Kbyte，假设在极限情况下所有的包文长度为64字节的小包，则该负载通路中同时容纳的包文的个数为16Kbyte/64byte＝256个包。在此，根据预定策略设置负载通路的数据包个数流控阈值为100。也就是说，当负载通路中同时容纳的个数达到100时，必须对输入该负载通路的数据包进行流控。Still taking n=2, m ₁ =2, m ₂ =4, P=8 as an example, referring to Fig. 1, if the size of the buffer space of each I2CU and O2CU is 2Kbyte, then, all the buffer spaces in each load path Size = buffer space size of each I2CU × number + buffer size of O2CU × number = 2×4+2×4=16Kbyte, assuming that all packets are small packets with a length of 64 bytes in the limit case, then The number of packets simultaneously accommodated in the load path is 16Kbyte/64byte=256 packets. Here, the flow control threshold of the number of data packets of the load path is set to 100 according to a predetermined policy. That is to say, when the number of packets simultaneously accommodated in the load path reaches 100, flow control must be performed on the data packets input into the load path.

众所周知，多引擎并行处理器通常包含一层以上的负载均衡模块(LBU)，且每个LBU对应一个负载收集模块(LPU)，该LPU与所对应的LBU处于同一层次。为了叙述方便，在本申请中，将用于从引擎并行处理器外部接收数据流的LBU称为最高层的LBU，将用于给PE分配数据流的LBU称为最底层的LBU。As we all know, a multi-engine parallel processor usually includes more than one layer of load balancing modules (LBU), and each LBU corresponds to a load collection module (LPU), and the LPU is at the same layer as the corresponding LBU. For the convenience of description, in this application, the LBU used to receive data streams from outside the engine parallel processor is called the highest-level LBU, and the LBU used to allocate data streams to PEs is called the lowest-level LBU.

最高层的LBU用于流控输入多引擎并行处理器的数据流，也即用于流控数据包个数的阈值为数据输入端口的数据包个数流控阈值；非最高层的LBU用于流控输入各个负载通路中的数据流，也即用于流控数据包个数的阈值为负载通路的数据包个数流控阈值。The highest-level LBU is used to control the data flow input to the multi-engine parallel processor, that is, the threshold for the number of data packets used for flow control is the flow control threshold of the number of data packets at the data input port; the non-highest-level LBU is used for The flow control inputs the data flow in each load path, that is, the threshold for the number of data packets used for flow control is the flow control threshold of the number of data packets of the load path.

每个数据包在多引擎并行处理器中有P种可能的去向，即P个引擎中的一个。负载层次n中m个负载通路按0～m_n编号，则负载层次n的编码位数B_n为Each data packet has P possible destinations in the multi-engine parallel processor, that is, one of the P engines. The m load channels in load level n are numbered from 0 to m _n , then the number of coding bits B _n in load level n is

B_n＝[log₂(m_n)]B _n =[log ₂ (m _n )]

其中[ ]表示进1取整运算，即当小数位不为0时固定进1。将所有负载层次的编码Bi按次序组合即得到标记的编码。Among them, [ ] represents the rounding operation by 1, that is, when the decimal place is not 0, it is fixed by 1. Combine the codes Bi of all load levels in order to get the code of the mark.

下面仍以n＝2，m₁＝2，m₂＝4，P＝8为例，对编码进行说明。参见图2，图2所示为应用本发明的8引擎并行处理器结构的编码示意图。本例中，每个数据包在该多引擎并行处理器中只有8种可能的去向，即8个PE中的一个。The following still takes n=2, m ₁ =2, m ₂ =4, and P=8 as an example to describe the encoding. Referring to FIG. 2, FIG. 2 is a schematic diagram of coding of the 8-engine parallel processor structure applied in the present invention. In this example, each data packet has only 8 possible destinations in the multi-engine parallel processor, that is, one of the 8 PEs.

L1BU将输入数据包分发到两个负载通道时，有两个选择，此时，L1BU记录标记的编码位数B₁＝[log₂(m₁)]＝[log₂2]＝1，编码为1bit数据，由此产生标记的第一比特位。也就是说，用0和1来区分两个负载通道。参见图2，如果分发到负载通道0，则标记为1’b0，如果分发到负载通道1，标记为1’b1。并且，依次将对应每个数据包的该标记位保存在一个序列中，并将该序列定义为S_load₁₁。该序列S_load₁₁的宽度为B₁，即在本例中该序列的宽度为一，该序列S_load₁₁的最大深度为数据输入端口的数据包个数流控阈值。When the L1BU distributes the input data packets to two load channels, there are two options. At this time, the number of coded bits B ₁ =[log ₂ (m ₁ )]=[log ₂ 2]=1 of the L1BU record mark is coded as 1bit data, resulting in the first bit of the mark. That is, 0 and 1 are used to distinguish the two load channels. Referring to Figure 2, if it is distributed to load channel 0, it is marked as 1'b0, and if it is distributed to load channel 1, it is marked as 1'b1. And, the flag bit corresponding to each data packet is stored in a sequence in turn, and the sequence is defined as S_load ₁₁ . The width of the sequence S_load ₁₁ is B ₁ , that is, the width of the sequence in this example is one, and the maximum depth of the sequence S_load ₁₁ is the flow control threshold of the number of data packets of the data input port.

L2BU0将负载通路0的数据包分配到四个包文处理引擎中时，有四个选择，此时，L2BU0记录标记的编码位数B₂＝[log₂(m₂)]＝[log₂4]＝2，编码为2bit数据，由此产生标记的第二、三比特位。参见图2，分发到4个PE的标记分别标记为2’b11、2’b10、2’b01、2’b00。并且，依次将对应每个数据包的该标记位保存在一个序列中，并将该序列定义为S_load₂₁。序列S_PE0的宽度为B₂，即在本例中该序列的宽度为二，最大深度为负载通路0中同时容纳的包文的最大个数，即负载通路0所对应的负载通路的数据包个数流控阈值。When L2BU0 distributes the data packets of load channel 0 to the four packet processing engines, there are four choices. At this time, the number of coding bits B ₂ of L2BU0 record marks =[log ₂ (m ₂ )]=[log ₂ 4 ]=2, encoded as 2bit data, thus generating the second and third bits of the mark. Referring to FIG. 2 , the marks distributed to the four PEs are respectively marked as 2'b11, 2'b10, 2'b01, and 2'b00. And, the flag bit corresponding to each data packet is stored in a sequence in turn, and the sequence is defined as S_load ₂₁ . The width of the sequence S_PE0 is B ₂ , that is, the width of the sequence in this example is two, and the maximum depth is the maximum number of packets contained in the payload path 0 at the same time, that is, the data packets of the payload path corresponding to the payload path 0 Number of flow control thresholds.

L2BU1与L2BU0处于同一层次且其所处环境也与L2BU0相同，因此其处理过程与L2BU0也是相同的。即L2BU1将负载通路1的数据包分配到四个包文处理引擎中时，有四个选择，编码为2bit数据，由此产生标记的第二、三比特位。依次将对应每个数据包的该标记位保存在一个序列中，并将该序列定义为S_load₂₂。序列S_PE1的宽度为二，最大深度为负载通路1中同时容纳的包文的最大个数，即负载通路1所对应的负载通路的数据包个数流控阈值。L2BU1 is at the same level as L2BU0 and its environment is also the same as that of L2BU0, so its processing is the same as that of L2BU0. That is, when the L2BU1 distributes the data packets of the load path 1 to the four packet text processing engines, there are four choices, which are encoded as 2-bit data, thereby generating the second and third bits of the mark. The flag bit corresponding to each data packet is stored in a sequence in turn, and the sequence is defined as S_load ₂₂ . The width of the sequence S_PE1 is two, and the maximum depth is the maximum number of packets simultaneously contained in the payload path 1, that is, the flow control threshold of the number of data packets of the payload path corresponding to the payload path 1.

由此可以看出，最高层的LBU所对应的序列的深度为数据输入端口的数据包个数流控阈值；非最高层LBU所对应的序列的深度为该层LBU所对应的上一层负载通路的数据包个数流控阈值。It can be seen from this that the depth of the sequence corresponding to the highest level LBU is the data packet number flow control threshold of the data input port; the depth of the sequence corresponding to the non-highest level LBU is the load of the upper layer corresponding to the LBU of this level The flow control threshold of the number of data packets in the path.

从上述处理过程中可看出，为了实现排序，在多引擎并行处理器中要预设一个以上用于记录标记信息的序列，且每个序列与一个LBU一一对应；这样也就形成了LBU、LPU以及序列三者的一一对应的关系。其中，序列S_load脚标用nk表示：n为负载层次，K为本层中LBU的总个数，该层次中的某一个用k表示，该k满足1≤k≤K。It can be seen from the above processing that in order to achieve sorting, more than one sequence for recording tag information must be preset in the multi-engine parallel processor, and each sequence corresponds to one LBU; thus forming an LBU , LPU and the one-to-one relationship between the sequence. Wherein, the subscript of sequence S_load is represented by nk: n is the load level, K is the total number of LBUs in this layer, and one of the levels is represented by k, and the k satisfies 1≤k≤K.

有了上述准备工作后，下面具体说明对数据包进行排序的过程：With the above preparations, the process of sorting the data packets is described in detail below:

多引擎并行处理器中的LBU执行以下处理步骤：The LBU in the multi-engine parallel processor performs the following processing steps:

A、LBU接收到待处理数据包，根据预设的分发原则，如负载均衡策略或循环Round Robin仲裁策略等，将接收到的待处理数据包分发到下一层的处理模块中，并将所分发的数据包进行标记，以指示该数据包所在负载通路，且将该标记记录在与该LBU所对应的序列中；A. The LBU receives the data packets to be processed, and distributes the received data packets to the processing module of the next layer according to the preset distribution principle, such as load balancing strategy or circular Round Robin arbitration strategy, etc. The distributed data packet is marked to indicate the load path where the data packet is located, and the mark is recorded in the sequence corresponding to the LBU;

B、该LBU判断其所对应的序列中所记录的个数是否达到预设的用于流控数据包个数的阈值，如果是，则停止分发操作，然后重复执行步骤B，否则，重复执行步骤A；B. The LBU judges whether the number recorded in its corresponding sequence reaches the preset threshold for the number of flow control data packets. If so, stop the distribution operation, and then repeat step B, otherwise, repeat the execution Step A;

上述LBU如果为非最底层的LBU，则上述步骤A中所述的下一层的处理模块为下一层的LBU；上述LBU如果为最底层的LBU，则步骤A中所述下一层的处理模块为PE。If the above-mentioned LBU is a non-lowest LBU, then the processing module of the next layer described in the above-mentioned step A is an LBU of the next layer; The processing module is PE.

多引擎并行处理器内与每个LBU对应的且与所对应的LBU处于同层的LPU，执行以下处理步骤：The LPU corresponding to each LBU and at the same layer as the corresponding LBU in the multi-engine parallel processor performs the following processing steps:

a、LPU判断出与其对应的LBU所对应的序列中的记录信息为非空后，执行步骤b；a. After the LPU determines that the record information in the sequence corresponding to the corresponding LBU is non-empty, execute step b;

下面仍以n＝2，m₁＝2，m₂＝4，P＝8为例进行说明，参见图1和图2，。The following still takes n=2, m ₁ =2, m ₂ =4, P=8 as an example for illustration, see FIG. 1 and FIG. 2 .

L1BU将输入数据包分发到两个负载通道时对数据包进行第一次标记，同时顺序将数据包的标记写入序列S_load₁₁中。当序列S_load₁₁中所记录的个数达到预设的数据输入端口的数据包个数流控阈值，即序列S_load₁₁为满时，L1BU停止分发数据包，从而实现流控芯片的数据输入端口。When the L1BU distributes the input data packets to the two load channels, it marks the data packets for the first time, and at the same time sequentially writes the marks of the data packets into the sequence S_load ₁₁ . When the number recorded in the sequence S_load ₁₁ reaches the preset data packet flow control threshold of the data input port, that is, when the sequence S_load ₁₁ is full, the L1BU stops distributing data packets, thereby realizing the data input port of the flow control chip.

L2BU0将负载通路0的数据包分配到四个包文处理引擎中时对数据包进行第二次标记，同时顺序将数据包的标记写入序列S_load₂₁中。当序列S_load₂₁为满即达到负载通路0所对应的负载通路的数据包个数流控阈值时，停止分发数据包，流控负载通路0的数据输入。When the L2BU0 distributes the data packets of the load channel 0 to the four packet processing engines, it marks the data packets for the second time, and at the same time writes the marks of the data packets into the sequence S_load ₂₁ sequentially. When the sequence S_load ₂₁ is full and reaches the flow control threshold of the number of data packets of the load path corresponding to the load path 0, the distribution of data packets is stopped, and the data input of the load path 0 is flow controlled.

L2BU1将负载通路1的数据包分配到四个包文处理引擎中时对数据包进行第二次标记，同时顺序将数据包的标记写入序列S_load₂₂中。当序列S_load1₂₂为满即达到负载通路1所对应的负载通路的数据包个数流控阈值时，停止分发数据包，流控负载通路1的数据输入。When the L2BU1 distributes the data packets of the load channel 1 to the four packet processing engines, it marks the data packets for the second time, and simultaneously writes the marks of the data packets into the sequence S_load ₂₂ sequentially. When the sequence S_load1 ₂₂ is full and reaches the flow control threshold of the number of data packets of the load path corresponding to the load path 1, the distribution of data packets is stopped, and the data input of the load path 1 is flow controlled.

当L2PU0判断出序列S_load₂₁中记录的信息为非空时，L2PU0按顺序从序列S_load₂₁中读取标记信息，通过对标记的解码获取对应该标记的包文的位置信息，比如标记为2’b01，表示最近需要通过的包位于通道1，L2PU0预先打开从引擎1到负载通道0的通道，等待一个完整的数据包通过，且只允许一个数据包通过后，依次打开下一个包文通过的通道，直至负载通道0中的所有包文通过L2PU0输出。When L2PU0 determines that the information recorded in sequence S_load ₂₁ is not empty, L2PU0 reads the tag information from sequence S_load ₂₁ in order, and obtains the position information of the packet corresponding to the tag by decoding the tag, for example, the tag is 2' b01, indicating that the most recent packet that needs to pass is located in channel 1. L2PU0 opens the channel from engine 1 to load channel 0 in advance, waits for a complete data packet to pass through, and only allows one data packet to pass through, and then opens the next packet to pass through. channel until all packets in the load channel 0 are output through L2PU0.

当L2PU1判断出序列S_load₂₂中记录的信息为非空时，L2PU1按顺序从序列S_load₂₂中读取标记信息，通过对标记的解码获取对应该标记的包文的位置信息，比如标记为2’b01，表示最近需要通过的包位于通道1，L2PU1预先打开从引擎1到负载通道1的通道，等待一个完整的数据包通过，且只允许一个数据包通过后，依次打开下一个包文通过的通道，直至负载通道1中的所有包文通过L2PU1输出。When L2PU1 determines that the information recorded in sequence S_load ₂₂ is not empty, L2PU1 reads the tag information from sequence S_load ₂₂ in order, and obtains the position information of the packet corresponding to the tag by decoding the tag, for example, the tag is 2' b01, indicating that the most recent packet to be passed is located in channel 1. L2PU1 pre-opens the channel from engine 1 to load channel 1, waits for a complete data packet to pass, and only allows one data packet to pass, and then opens the next packet in turn. channel until all packets in the load channel 1 are output through L2PU1.

当L1PU判断出序列S_load₁₁中记录的信息为非空时，L1PU按顺序从序列S_load₁₁中读取标记信息，通过对标记的解码获取对应该标记的包文的位置信息，比如标记为2’b1，表示最近需要通过的包位于负载通道1，L1PU预先打开从负载通道1到芯片输出端口的通道，等待一个完整的数据包通过，且只允许一个数据包通过后，依次打开下一个包文通过的通道，直至该多引擎并行处理器中的所有包文通过L1PU输出。When the L1PU judges that the information recorded in the sequence S_load ₁₁ is non-empty, the L1PU reads the tag information from the sequence S_load ₁₁ in order, and obtains the position information of the packet corresponding to the tag by decoding the tag, for example, the tag is 2' b1, indicating that the most recent packet to be passed is located in load channel 1. L1PU pre-opens the channel from load channel 1 to the output port of the chip, waits for a complete data packet to pass, and only allows one data packet to pass, and then opens the next packet in turn. through the channel until all packets in the multi-engine parallel processor are output through the L1PU.

上述序列S_load₁₁、序列S_load₂₁和序列S_load₂₂分别由一个先进先出FIFO缓冲器承载。The above sequence S_load ₁₁ , sequence S_load ₂₁ and sequence S_load ₂₂ are respectively carried by a first-in-first-out FIFO buffer.

以上所述实施例均是以n＝2，m₁＝2，m₂＝4，P＝8为例进行说明，在实际应用中，负载的层次包括n＝2、3、4、5......，但不限于上述值；每一层的负载通路数包括m＝2、3、4、5......，但也不限于上述值。The above-mentioned embodiments are all described by taking n=2, m ₁ =2, m ₂ =4, and P=8 as examples. In practical applications, the load levels include n=2, 3, 4, 5.. ..., but not limited to the above values; the number of load paths for each layer includes m=2, 3, 4, 5..., but not limited to the above values.

再有，本发明所述的实现多引擎并行处理器中数据包排序的方法，并不仅仅适用于图1所示结构的多引擎并行处理器，对于其他结构的多引擎处理器，只要其存在多层次多通路的结构，就同样适用。Furthermore, the method for realizing data packet sorting in the multi-engine parallel processor described in the present invention is not only applicable to the multi-engine parallel processor with the structure shown in Fig. 1, for multi-engine processors with other structures, as long as it exists The structure of multi-level and multi-path is also applicable.

本发明所述的较佳实施例并不用以限制本发明，凡在本发明的精神和原则之内，所作的任何修改、等同替换、改进等，均应包含在本发明的保护范围之内。The preferred embodiments described in the present invention are not intended to limit the present invention, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.

Claims

1, a kind of method that realizes data packet sequencing in many engines parallel processor, described many engines parallel processor comprises the above load balancing module LBU of one deck, and the corresponding load collection module LPU of each LBU, and this LPU and pairing LBU are in same level, it is characterized in that

In many engines parallel processor, the default sequence that is used for record tag information more than, and each sequence is corresponding one by one with a LBU;

Described LBU carries out following treatment step:

A, LBU receive pending packet, the pending packet delivery that will receive according to default distribution principle is in the processing module of one deck down, and the packet of being distributed carried out mark, indicating this packet place load path, and with this label record with the pairing sequence of this LBU in;

B, this LBU judge whether the number of recorded information in its pairing sequence reaches the default threshold value that is used for Flow Control packet number, if, then stop distribution operation, repeated execution of steps B then, otherwise, repeated execution of steps A;

Described corresponding with each LBU and be in load collection module LPU with layer with pairing LBU, carry out following treatment step:

A, LPU judge whether the number of recorded information in the pairing sequence of LBU corresponding with it is non-NULL, if, execution in step b; Otherwise repeated execution of steps a;

B, LPU order from this sequence reads a label information, obtain corresponding data bag position according to this label information, open the pairing output channel in this position, wait for that a complete packet passes through, and only allow a packet to pass through, repeated execution of steps a finishes until all bag literary composition outputs then.

2, method according to claim 1 is characterized in that,

If described LBU is the LBU of the non-bottom, then the processing module of the described following one deck of steps A is the LBU of following one deck; If described LBU is the LBU of the bottom, then the processing module of the described one deck down of steps A is the civilian processing engine PE of bag.

3, method according to claim 2 is characterized in that, if described load balancing module LBU is top LBU, and the packet number Flow Control threshold value that the described threshold value that is used for Flow Control packet number of step B is a data-in port; If described load balancing module LBU is non-top LBU, the described threshold value that is used for Flow Control packet number of the step B packet number Flow Control threshold value that is load path then.

4, method according to claim 3 is characterized in that, the width B of described sequence _nFor

B _n=[log ₂(m _n)], wherein, 1 rounding operation, m are advanced in [] expression _nLoad path number for following one deck of LBU correspondence;

The degree of depth of the pairing sequence of top LBU is the packet number Flow Control threshold value of data-in port;

The degree of depth of the pairing sequence of non-top LBU is the packet number Flow Control threshold value of the pairing last layer load path of this layer LBU.

5, method according to claim 3 is characterized in that, the packet number Flow Control threshold value determination method of described data-in port is:

At first calculate the size of many engines parallel processor spatial cache, use the length value of the value of this spatial cache size then divided by parcel literary composition, obtain the civilian number of open ended maximum bag, the packet number Flow Control threshold value of wrapping civilian number and predetermined strategy specified data input port again according to this maximum.

6, method according to claim 5, it is characterized in that the method for the size of described many engines of calculating parallel processor spatial cache is: the summation of calculating the spatial cache of all input buffer module ICU and output buffer module OCU in this many engines parallel processor.

7, method according to claim 3 is characterized in that, the packet number Flow Control threshold value determination method of described load path is:

At first calculate the size of all spatial caches in this load path, use the length value of the value of this spatial cache size then divided by parcel literary composition, obtain the civilian number of open ended maximum bag, wrap the packet number Flow Control threshold value that civilian number and predetermined strategy are determined load path according to this maximum again.

8, method according to claim 7 is characterized in that, the method for all spatial cache sizes is in the described computational load path: the summation of calculating the spatial cache of all input buffer module ICU and output buffer module OCU in this load path.

According to claim 5 or 7 described methods, it is characterized in that 9, described predetermined strategy is according to resource overhead and/or to the influence of buffer efficiency, determines the Flow Control threshold value.

10, method according to claim 1 is characterized in that, described sequence is carried by the fifo fifo impact damper.

11, method according to claim 1 is characterized in that, described default distribution principle comprises load balancing strategy or round-robin resolving strategy.