CN100349122C - Method for realizing data packet sequencing for multi engine paralled processor - Google Patents
Method for realizing data packet sequencing for multi engine paralled processor Download PDFInfo
- Publication number
- CN100349122C CN100349122C CNB2005100932204A CN200510093220A CN100349122C CN 100349122 C CN100349122 C CN 100349122C CN B2005100932204 A CNB2005100932204 A CN B2005100932204A CN 200510093220 A CN200510093220 A CN 200510093220A CN 100349122 C CN100349122 C CN 100349122C
- Authority
- CN
- China
- Prior art keywords
- lbu
- sequence
- flow control
- load
- packet
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related
Links
Images
Landscapes
- Data Exchanges In Wide-Area Networks (AREA)
Abstract
本发明公开了一种实现多引擎并行处理器中数据包排序的方法,关键是,在数据流的分发过程中产生标记的编码,并保存为一个序列,在数据流的收集过程中对保存在序列中的标记依次解码,选中与该标记相对应的引擎通道或负载通道,依次输出对应的一个完整的数据包,且只允许一个数据包输出。从而解决了对输出数据包的排序。应用本发明,具有以下优点:不需要使用单独的排队机实现数据包排序,减小了系统资源开销;同时,由于排序功能与多引擎并行处理器结合,将排序对并行处理的效率影响降到最小,并降低了系统拥塞的可能性;多引擎并行处理器能够保证数据包的逻辑序,不需要大量占用系统的共享内存。
The invention discloses a method for realizing data packet sorting in a multi-engine parallel processor. The key is to generate a tag code during the distribution process of the data stream and save it as a sequence, and to store it in the data stream during the collection process. The tags in the sequence are decoded in sequence, the engine channel or load channel corresponding to the tag is selected, and a corresponding complete data packet is output in sequence, and only one data packet is allowed to be output. Thereby solving the ordering of output packets. Applying the present invention has the following advantages: it is not necessary to use a separate queuing machine to realize data packet sorting, which reduces system resource overhead; at the same time, because the sorting function is combined with a multi-engine parallel processor, the impact of sorting on the efficiency of parallel processing is reduced to It is the smallest and reduces the possibility of system congestion; multi-engine parallel processors can guarantee the logical order of data packets without occupying a large amount of shared memory of the system.
Description
技术领域technical field
本发明涉及多引擎并行处理器技术领域,特别是指一种实现多引擎并行处理器中数据包排序的方法。The invention relates to the technical field of multi-engine parallel processors, in particular to a method for realizing data packet sorting in multi-engine parallel processors.
背景技术Background technique
多引擎并行处理器为突破单引擎处理能力的局限性提供了解决方案。在理论上,如果不考虑接口的速度、硬件实现的资源量等因素,多引擎并行处理器的处理能力可以是无限的。在实际应用中,多引擎并行处理器通常在结构上划分为多个负载层次,每个负载层次又分为多个负载通路,最底层的每个负载通路上分别有一个包文处理引擎(PE),这些引擎可以并行工作。假设负载总层次为N,任一个层次用n表示,该n满足1≤n≤N。每一层共有m个负载通路,则从第一层到第n层的负载通路数依次为m1、m2……mn,则引擎的总数P为P=m1×m2×……×mn。Multi-engine parallel processors provide a solution to break through the limitations of single-engine processing capabilities. In theory, if factors such as the speed of the interface and the amount of resources implemented by the hardware are not considered, the processing capacity of the multi-engine parallel processor can be unlimited. In practical applications, multi-engine parallel processors are usually divided into multiple load levels in structure, and each load level is divided into multiple load paths, and each load path at the bottom layer has a packet processing engine (PE ), these engines can work in parallel. Assuming that the total level of load is N, any level is represented by n, and the n satisfies 1≤n≤N. There are m load paths in each layer, then the number of load paths from the first layer to the nth layer is m 1 , m 2 ... m n , and the total number of engines P is P=m 1 ×m 2 ×... × m n .
为方便说明,以下先定义几个概念的缩写:For the convenience of explanation, the abbreviations of several concepts are first defined as follows:
层n的负载均衡模块(Layer n Load Balance Unit),缩写为LnBU;Layer n Load Balance Unit (Layer n Load Balance Unit), abbreviated as LnBU;
层n的输入缓存模块(Layer n Input Cache Unit),缩写为InCU;Layer n Input Cache Unit (Layer n Input Cache Unit), abbreviated as InCU;
层n的负载收集模块(Layer n Load Pooling Unit),缩写为LnPU;Layer n Load Pooling Unit (Layer n Load Pooling Unit), abbreviated as LnPU;
层n的输出缓存模块(Layer n Output Cache Unit),缩写为OnCU。Layer n Output Cache Unit (Layer n Output Cache Unit), abbreviated as OnCU.
下面以n=2,m1=2,m2=4,P=8为例进行说明。The following takes n=2, m 1 =2, m 2 =4, and P=8 as an example for illustration.
参见图1,图1所示为8引擎并行处理器结构示意图。包文在该多引擎并行处理器中的数据流为:Referring to FIG. 1, FIG. 1 is a schematic structural diagram of an 8-engine parallel processor. The data flow of Baowen in this multi-engine parallel processor is:
1.层1负载均衡模块(L1BU,Layer 1 Load Balance Unit)根据负载均衡或循环复用(Round Robin)的仲裁策略将输入的数据包分配到两个层1输入缓存模块(I1CU,Layer 1 Input Cache Unit)。1. The
2.两个I1CU分别负责各自负载通路的数据输入缓存。2. The two I1CUs are responsible for the data input buffer of their respective load paths.
3.两个层2负载均衡模块(L2BU,Layer 2 Load Balance Unit)分别根据负载均衡或循环复用(Round Robin)的仲裁策略将各自负载通路的数据包分配到四个层2输入缓存模块(I2CU,Layer 2 Input Cache Unit)中,每个I2CU对应一个包文处理引擎(PE,Packet Engine)。也就是说,每个L2BU将各自负载通路的数据包分配到四个PE中。3. Two
4.PE从I2CU中获取待处理的数据包文,处理完成后将数据包文存放在层2输出缓存模块(O2CU,Layer 2 Output Cache Unit)中。4. PE obtains the data packet to be processed from the I2CU, and stores the data packet in the
5.每个层2负载收集模块(L2PU,Layer 2 Load Pooling Unit)分别从各自负载通路上的四个O2CU中收集四个包文处理引擎的输出数据包,并按照一定的顺序依次输出到层1输出缓存模块(O1CU,Layer 1 Output CacheUnit)内。5. Each
6.O1CU分别负责各自负载通路的数据输出缓存。6. O1CU is responsible for the data output buffer of their respective load paths.
7.L1PU收集两个O1CU的输出数据包,并按一定的顺序依次输出。7. The L1PU collects the output data packets of the two O1CUs and outputs them in a certain order.
从上述处理过程中可以看出,由于L1BU和L2BUi(其中i为0和1)在将数据包文分发到各个引擎时不考虑包文的顺序;且由于每个包文处理引擎的负载不可能完全一样,按顺序分配到各个引擎的数据包,其处理完成的顺序也无法保证。因此经过多引擎并行处理器处理完成后输出的数据包,将无法保证逻辑序。It can be seen from the above processing that since L1BU and L2BU i (where i is 0 and 1) do not consider the order of packets when distributing data packets to each engine; and because the load of each packet processing engine is not It may be exactly the same, and the order in which the processing of the packets allocated to each engine in sequence cannot be guaranteed. Therefore, the output data packets after processing by the multi-engine parallel processor will not be able to guarantee the logical sequence.
现有的保证多引擎并行处理器中数据包逻辑序的实现方法如下:The existing implementation method for guaranteeing the logical sequence of data packets in a multi-engine parallel processor is as follows:
在数据包进入多引擎并行处理器前,使用排队机(一般用软件实现)在数据包中插入顺序标记,多引擎并行处理器内各个处理模块必须对该标记进行透传。当数据包经过多引擎并行处理器处理完毕输出后,再使用排队机查询数据包中的标记,根据标记对数据包进行排序。Before the data packet enters the multi-engine parallel processor, a queuing machine (generally realized by software) is used to insert a sequence mark in the data packet, and each processing module in the multi-engine parallel processor must transparently transmit the mark. After the data packets are processed and output by the multi-engine parallel processor, the queuing machine is used to query the marks in the data packets, and the data packets are sorted according to the marks.
由于排队机的存在,现有的实现方法必然存在以下缺陷:Due to the existence of the queuing machine, the existing implementation methods must have the following defects:
1.由于使用单独的排队机实现数据包排序,会额外占用系统资源;1. Due to the use of a separate queuing machine to implement data packet sorting, additional system resources will be occupied;
2.如果排队机无法及时完成多引擎并行处理器输出的数据包的排序,将造成系统的拥塞,影响多引擎并行处理器的计算效率;2. If the queuing machine cannot complete the sorting of the data packets output by the multi-engine parallel processor in time, it will cause system congestion and affect the computing efficiency of the multi-engine parallel processor;
3.应用排队机实现数据包排序的过程中,需要将顺序靠后,但先处理完成的数据包全部进行缓存,需要占用系统大量的共享内存。3. In the process of using the queuing machine to implement data packet sorting, it is necessary to cache all the data packets that have been processed first, but occupy a large amount of shared memory in the system.
发明内容Contents of the invention
有鉴于此,本发明的目的在于提供一种实现多引擎并行处理器中数据包排序的方法,以解决多引擎处理中的数据包排序问题。In view of this, the object of the present invention is to provide a method for realizing data packet sequencing in multi-engine parallel processors, so as to solve the problem of data packet sequencing in multi-engine processing.
为达到上述目的,本发明的技术方案是这样实现的:In order to achieve the above object, technical solution of the present invention is achieved in that way:
一种实现多引擎并行处理器中数据包排序的方法,所述多引擎并行处理器包含一层以上的负载均衡模块LBU,且每个LBU对应一个负载收集模块LPU,该LPU与所对应的LBU处于同一层次,A method for realizing data packet sorting in a multi-engine parallel processor, the multi-engine parallel processor includes more than one layer of load balancing modules LBU, and each LBU corresponds to a load collection module LPU, the LPU and the corresponding LBU at the same level,
在多引擎并行处理器中,预设一个以上用于记录标记信息的序列,且每个序列与一个LBU一一对应;In a multi-engine parallel processor, more than one sequence for recording tag information is preset, and each sequence corresponds to one LBU;
所述LBU执行以下处理步骤:The LBU performs the following processing steps:
A、LBU接收到待处理数据包,根据预设的分发原则将接收到的待处理数据包分发到下一层的处理模块中,并将所分发的数据包进行标记,以指示该数据包所在负载通路,且将该标记记录在与该LBU所对应的序列中;A. The LBU receives the data packet to be processed, and distributes the received data packet to the processing module of the next layer according to the preset distribution principle, and marks the distributed data packet to indicate where the data packet is load path, and record the mark in the sequence corresponding to the LBU;
B、该LBU判断其所对应的序列中记录信息的个数是否达到预设的用于流控数据包个数的阈值,如果是,则停止分发操作,然后重复执行步骤B,否则,重复执行步骤A;B. The LBU judges whether the number of recorded information in its corresponding sequence reaches the preset threshold for the number of flow control data packets. If so, stop the distribution operation, and then repeat step B, otherwise, repeat the execution Step A;
所述与每个LBU对应的且与所对应的LBU处于同层的负载收集模块LPU,执行以下处理步骤:The load collection module LPU corresponding to each LBU and at the same layer as the corresponding LBU performs the following processing steps:
a、LPU判断与其对应的LBU所对应的序列中记录信息的个数是否为非空,如果是,执行步骤b;否则重复执行步骤a;a. The LPU judges whether the number of recorded information in the sequence corresponding to the corresponding LBU is non-empty, if yes, execute step b; otherwise, repeat step a;
b、LPU从该序列中顺序读取一个标记信息,根据该标记信息获取对应数据包所在位置,打开该位置所对应的输出通道,等待一个完整的数据包通过,且只允许一个数据包通过,然后重复执行步骤a,直至所有包文输出完毕。b. The LPU sequentially reads a tag information from the sequence, obtains the location of the corresponding data packet according to the tag information, opens the output channel corresponding to the location, waits for a complete data packet to pass through, and only allows one data packet to pass through, Then repeat step a until all packets are output.
较佳地,如果所述LBU为非最底层的LBU,则步骤A所述下一层的处理模块为下一层的LBU;如果所述LBU为最底层的LBU,则步骤A所述下一层的处理模块为包文处理引擎PE。Preferably, if the LBU is a non-bottom LBU, then the processing module of the next layer described in step A is an LBU of the next layer; if the LBU is the bottom LBU, then the next layer described in step A The processing module of the layer is the packet processing engine PE.
较佳地,如果所述负载均衡模块LBU为最高层的LBU,步骤B所述用于流控数据包个数的阈值为数据输入端口的数据包个数流控阈值;如果所述负载均衡模块LBU为非最高层的LBU,则步骤B所述用于流控数据包个数的阈值为负载通路的数据包个数流控阈值。Preferably, if the LBU of the load balancing module is the highest-level LBU, the threshold for the number of flow control data packets described in step B is the flow control threshold of the number of data packets of the data input port; if the load balancing module The LBU is a non-top layer LBU, and the threshold for the number of flow control data packets in step B is the flow control threshold for the number of data packets of the load path.
较佳地,所述序列的宽度Bn为Preferably, the width B n of the sequence is
Bn=[log2(mn)],其中,[ ]表示进1取整运算,mn为LBU对应的下一层的负载通路数;B n = [log 2 (m n )], where [ ] represents the rounding operation, and m n is the number of load channels of the next layer corresponding to the LBU;
最高层的LBU所对应的序列的深度为数据输入端口的数据包个数流控阈值;The depth of the sequence corresponding to the highest level LBU is the flow control threshold of the number of data packets at the data input port;
非最高层LBU所对应的序列的深度为该层LBU所对应的上一层负载通路的数据包个数流控阈值。The depth of the sequence corresponding to the non-highest layer LBU is the flow control threshold of the number of data packets of the upper layer load path corresponding to the layer LBU.
较佳地,所述数据输入端口的数据包个数流控阈值的确定方法为:Preferably, the method for determining the flow control threshold of the number of data packets at the data input port is:
首先计算多引擎并行处理器缓存空间的大小,然后应用该缓存空间大小的值除以最小包文的长度值,获得可容纳的最大包文个数,再根据该最大包文个数以及预设的策略确定数据输入端口的数据包个数流控阈值。First calculate the size of the cache space of the multi-engine parallel processor, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum number of packets that can be accommodated, and then according to the maximum number of packets and the preset The strategy determines the flow control threshold of the number of data packets at the data input port.
较佳地,所述计算多引擎并行处理器缓存空间的大小的方法为:计算该多引擎并行处理器中所有输入缓存模块ICU和输出缓存模块OCU的缓存空间的总和。Preferably, the method for calculating the size of the cache space of the multi-engine parallel processor is: calculating the sum of the cache spaces of all input cache modules ICU and output cache modules OCU in the multi-engine parallel processor.
较佳地,所述负载通路的数据包个数流控阈值的确定方法为:Preferably, the method for determining the flow control threshold of the number of data packets of the load path is:
首先计算该负载通路中所有缓存空间的大小,然后应用该缓存空间大小的值除以最小包文的长度值,获得可容纳的最大包文个数,再根据该最大包文个数以及预设的策略确定负载通路的数据包个数流控阈值。First calculate the size of all cache spaces in the load path, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum number of packets that can be accommodated, and then according to the maximum number of packets and the preset The strategy determines the flow control threshold of the number of data packets in the load path.
较佳地,所述计算负载通路中所有缓存空间大小的方法为:计算该负载通路中所有输入缓存模块ICU和输出缓存模块OCU的缓存空间的总和。Preferably, the method for calculating the size of all cache spaces in the load path is: calculating the sum of the cache spaces of all input cache modules ICU and output cache modules OCU in the load path.
较佳地,所述预设的策略为根据资源开销和/或对缓存效率的影响,确定流控阈值。Preferably, the preset strategy is to determine the flow control threshold according to resource overhead and/or impact on cache efficiency.
较佳地,所述序列由先进先出FIFO缓冲器承载。Preferably, said sequence is carried by a first-in-first-out FIFO buffer.
较佳地,所述预设的分发原则包括负载均衡策略或循环复用(RoundRobin)仲裁策略。Preferably, the preset distribution principle includes a load balancing strategy or a round-robin (RoundRobin) arbitration strategy.
本发明的关键是,在数据流的分发过程中产生标记的编码,并保存为一个序列,在数据流的收集过程中对保存在序列中的标记依次解码,选中与该标记相对应的引擎通道或负载通道,依次输出对应的一个完整的数据包,且只允许一个数据包输出。从而解决了对输出数据包的排序。The key of the present invention is that the encoding of the mark is generated during the distribution process of the data stream and stored as a sequence, and the marks stored in the sequence are sequentially decoded during the collection process of the data stream, and the engine channel corresponding to the mark is selected Or the load channel, which outputs a corresponding complete data packet in sequence, and only allows one data packet to be output. Thereby solving the ordering of output packets.
应用本发明,具有以下优点:不需要使用单独的排队机实现数据包排序,减小了系统资源开销;同时,由于排序功能与多引擎并行处理器结合,将排序对并行处理的效率影响降到最小,并降低了系统拥塞的可能性;多引擎并行处理器能够保证数据包的逻辑序,不需要大量占用系统的共享内存。Applying the present invention has the following advantages: it is not necessary to use a separate queuing machine to realize data packet sorting, which reduces system resource overhead; at the same time, because the sorting function is combined with a multi-engine parallel processor, the impact of sorting on the efficiency of parallel processing is reduced to It is the smallest and reduces the possibility of system congestion; multi-engine parallel processors can guarantee the logical order of data packets without occupying a large amount of shared memory of the system.
附图说明Description of drawings
图1所示为8引擎并行处理器结构示意图;Figure 1 shows a schematic diagram of the structure of an 8-engine parallel processor;
图2所示为应用本发明的8引擎并行处理器结构的编码示意图。Fig. 2 is a schematic diagram of coding of the 8-engine parallel processor structure applying the present invention.
具体实施方式Detailed ways
下面结合附图及具体实施例,再对本发明做进一步地详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
本发明的思路是:在数据流的分发过程中产生标记的编码,并保存为一个序列,在数据流的收集过程中对保存在序列中的标记依次解码,选中与该标记相对应的引擎通道或负载通道,依次输出对应的一个完整的数据包,且只允许一个数据包输出。从而完成对输入数据包的标记和输出数据包的排序。The thinking of the present invention is: generate the code of mark during the distribution process of data flow, and save it as a sequence, decode the marks stored in the sequence in sequence during the collection process of data flow, select the engine channel corresponding to the mark Or the load channel, which outputs a corresponding complete data packet in sequence, and only allows one data packet to be output. In this way, the marking of the input data packets and the sorting of the output data packets are completed.
为了保证按照预定的顺序依次输出各引擎处理的数据包,必须对输入的数据包在各LnBU分发到各个负载通路或PE前进行唯一标记。在LnPU再按照预先做好的标记控制输出包文的顺序。In order to ensure that the data packets processed by each engine are output sequentially in a predetermined order, the input data packets must be uniquely marked before each LnBU distributes them to each load path or PE. In the LnPU, the sequence of output packets is controlled according to the pre-made tags.
这里“唯一标记”的含义是指,在极限情况下多引擎并行处理器即芯片中所能同时容纳的包文必需具有各自区别的标记,“唯一标记”的个数即为多引擎并行处理器中同时容纳的包文的最大个数。该“最大个数”已经是多引擎并行处理器所能容纳的极限了。通常,为了减少缓存的长度,需要对数据包个数进行流控,而进行流控就需要设置一个合理的流控阈值,以减少流控对缓存效率的影响。The meaning of "unique mark" here means that in the limit case, the multi-engine parallel processor, that is, the packets that can be accommodated in the chip at the same time must have their own distinctive marks, and the number of "unique marks" is the number of multi-engine parallel processors. The maximum number of packets that can be accommodated at the same time. The "maximum number" is already the limit that the multi-engine parallel processor can accommodate. Usually, in order to reduce the length of the cache, it is necessary to perform flow control on the number of data packets, and to perform flow control, a reasonable flow control threshold needs to be set to reduce the impact of flow control on the cache efficiency.
在本发明中定义了两类流控阈值:一类是数据输入端口的数据包个数流控阈值,其是用于流控输入多引擎并行处理器的数据流,因此该类阈值就一个;另一类是负载通路的数据包个数流控阈值,其是用于流控输入各个负载通路中的数据流,因此该类阈值有多个,其个数与多引擎并行处理器中实际的负载通路的个数相同。下面分别说明两类阈值的设置过程。In the present invention, two types of flow control thresholds are defined: one is the flow control threshold of the number of data packets at the data input port, which is used for flow control input of the data flow of the multi-engine parallel processor, so there is only one such threshold; The other is the flow control threshold of the number of data packets in the load path, which is used to control the flow of data input into each load path. Therefore, there are multiple thresholds of this type, and the number is the same as the actual number of multi-engine parallel processors. The number of load paths is the same. The following describes the setting process of the two types of thresholds respectively.
设置数据输入端口的数据包个数流控阈值的过程为:首先计算多引擎并行处理器的缓存空间的大小,然后应用该缓存空间大小的值除以最小包文的长度值,获得最多可容纳的最大包文个数,再根据该最大包文个数以及预设的策略确定数据输入端口的数据包个数流控阈值。上述计算多引擎并行处理器缓存空间的大小的方法为:计算该多引擎并行处理器中所有输入缓存模块(ICU)和输出缓存模块(OCU)的缓存空间的总和。上述预设的策略为根据资源开销和/或对缓存效率的影响来确定流控阈值。The process of setting the flow control threshold of the number of data packets at the data input port is: first calculate the size of the cache space of the multi-engine parallel processor, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum The maximum number of packets, and then determine the flow control threshold of the number of data packets at the data input port according to the maximum number of packets and the preset strategy. The method for calculating the size of the cache space of the multi-engine parallel processor is as follows: calculating the sum of the cache spaces of all input cache modules (ICU) and output cache modules (OCU) in the multi-engine parallel processor. The aforementioned preset strategy is to determine the flow control threshold according to resource overhead and/or impact on cache efficiency.
仍以n=2,m1=2,m2=4,P=8为例,参见图1,如果每个I1CU和O1CU的缓存空间大小为8Kbyte和9Kbyte,每个I2CU和O2CU的缓存空间大小为2Kbyte,那么,多引擎并行处理器缓存空间的大小,即整个芯片的缓存空间大小=每个I1CU的缓存空间大小×个数+每个O1CU的缓存空间大小×个数+每个I2CU的缓存空间大小×个数+O2CU的缓存空间大小×个数=8×2+9×2+2×8+2×8=66Kbyte,假设在极限情况下所有的包文长度为64字节的小包,则多引擎并行处理器中同时容纳的包文的个数为66Kbyte/64byte,约为1000。在此,根据预定策略设置数据输入端口的数据包个数流控阈值为256。也就是说,当多引擎并行处理器中同时容纳的个数达到256时,必须对输入的数据包进行流控。Still taking n=2, m 1 =2, m 2 =4, P=8 as an example, referring to Fig. 1, if the cache space size of each I1CU and O1CU is 8Kbyte and 9Kbyte, the cache space size of each I2CU and O2CU is 2Kbyte, then, the size of the cache space of the multi-engine parallel processor, that is, the cache space size of the whole chip=the cache space size of each I1CU×the number+the cache space size of each O1CU×the number+the cache memory of each I2CU Space size × number + O2CU cache space size × number = 8×2+9×2+2×8+2×8=66Kbyte, assuming that in the limit case all packets are small packets with a length of 64 bytes, Then the number of packets simultaneously accommodated in the multi-engine parallel processor is 66Kbyte/64byte, about 1000. Here, the flow control threshold of the number of data packets at the data input port is set to 256 according to a predetermined policy. That is to say, when the number of concurrently accommodated multi-engine parallel processors reaches 256, flow control must be performed on input data packets.
设置负载通路的数据包个数流控阈值的过程为:首先计算该负载通路中所有缓存空间的大小,然后应用该缓存空间大小的值除以最小包文的长度值,获得最多可容纳的最大包文个数,再根据该最大包文个数以及预设的策略确定负载通路的数据包个数流控阈值。上述计算负载通路中所有缓存空间大小的方法为:计算该负载通路中所有ICU和OCU的缓存空间的总和。上述预设的策略为根据资源开销和/或对缓存效率的影响来确定流控阈值。The process of setting the flow control threshold for the number of data packets in the load path is: first calculate the size of all cache spaces in the load path, and then divide the value of the cache space by the length of the minimum packet to obtain the maximum that can be accommodated. The number of packets, and then determine the flow control threshold of the number of packets of the load path according to the maximum number of packets and the preset policy. The method for calculating the size of all cache spaces in the load path is as follows: calculating the sum of the cache spaces of all ICUs and OCUs in the load path. The aforementioned preset strategy is to determine the flow control threshold according to resource overhead and/or impact on cache efficiency.
仍以n=2,m1=2,m2=4,P=8为例,参见图1,如果每个I2CU和O2CU的缓存空间大小为2Kbyte,那么,每个负载通路中所有缓存空间的大小=每个I2CU的缓存空间大小×个数+O2CU的缓存空间大小×个数=2×4+2×4=16Kbyte,假设在极限情况下所有的包文长度为64字节的小包,则该负载通路中同时容纳的包文的个数为16Kbyte/64byte=256个包。在此,根据预定策略设置负载通路的数据包个数流控阈值为100。也就是说,当负载通路中同时容纳的个数达到100时,必须对输入该负载通路的数据包进行流控。Still taking n=2, m 1 =2, m 2 =4, P=8 as an example, referring to Fig. 1, if the size of the buffer space of each I2CU and O2CU is 2Kbyte, then, all the buffer spaces in each load path Size = buffer space size of each I2CU × number + buffer size of O2CU × number = 2×4+2×4=16Kbyte, assuming that all packets are small packets with a length of 64 bytes in the limit case, then The number of packets simultaneously accommodated in the load path is 16Kbyte/64byte=256 packets. Here, the flow control threshold of the number of data packets of the load path is set to 100 according to a predetermined policy. That is to say, when the number of packets simultaneously accommodated in the load path reaches 100, flow control must be performed on the data packets input into the load path.
众所周知,多引擎并行处理器通常包含一层以上的负载均衡模块(LBU),且每个LBU对应一个负载收集模块(LPU),该LPU与所对应的LBU处于同一层次。为了叙述方便,在本申请中,将用于从引擎并行处理器外部接收数据流的LBU称为最高层的LBU,将用于给PE分配数据流的LBU称为最底层的LBU。As we all know, a multi-engine parallel processor usually includes more than one layer of load balancing modules (LBU), and each LBU corresponds to a load collection module (LPU), and the LPU is at the same layer as the corresponding LBU. For the convenience of description, in this application, the LBU used to receive data streams from outside the engine parallel processor is called the highest-level LBU, and the LBU used to allocate data streams to PEs is called the lowest-level LBU.
最高层的LBU用于流控输入多引擎并行处理器的数据流,也即用于流控数据包个数的阈值为数据输入端口的数据包个数流控阈值;非最高层的LBU用于流控输入各个负载通路中的数据流,也即用于流控数据包个数的阈值为负载通路的数据包个数流控阈值。The highest-level LBU is used to control the data flow input to the multi-engine parallel processor, that is, the threshold for the number of data packets used for flow control is the flow control threshold of the number of data packets at the data input port; the non-highest-level LBU is used for The flow control inputs the data flow in each load path, that is, the threshold for the number of data packets used for flow control is the flow control threshold of the number of data packets of the load path.
每个数据包在多引擎并行处理器中有P种可能的去向,即P个引擎中的一个。负载层次n中m个负载通路按0~mn编号,则负载层次n的编码位数Bn为Each data packet has P possible destinations in the multi-engine parallel processor, that is, one of the P engines. The m load channels in load level n are numbered from 0 to m n , then the number of coding bits B n in load level n is
Bn=[log2(mn)]B n =[log 2 (m n )]
其中[ ]表示进1取整运算,即当小数位不为0时固定进1。将所有负载层次的编码Bi按次序组合即得到标记的编码。Among them, [ ] represents the rounding operation by 1, that is, when the decimal place is not 0, it is fixed by 1. Combine the codes Bi of all load levels in order to get the code of the mark.
下面仍以n=2,m1=2,m2=4,P=8为例,对编码进行说明。参见图2,图2所示为应用本发明的8引擎并行处理器结构的编码示意图。本例中,每个数据包在该多引擎并行处理器中只有8种可能的去向,即8个PE中的一个。The following still takes n=2, m 1 =2, m 2 =4, and P=8 as an example to describe the encoding. Referring to FIG. 2, FIG. 2 is a schematic diagram of coding of the 8-engine parallel processor structure applied in the present invention. In this example, each data packet has only 8 possible destinations in the multi-engine parallel processor, that is, one of the 8 PEs.
L1BU将输入数据包分发到两个负载通道时,有两个选择,此时,L1BU记录标记的编码位数B1=[log2(m1)]=[log22]=1,编码为1bit数据,由此产生标记的第一比特位。也就是说,用0和1来区分两个负载通道。参见图2,如果分发到负载通道0,则标记为1’b0,如果分发到负载通道1,标记为1’b1。并且,依次将对应每个数据包的该标记位保存在一个序列中,并将该序列定义为S_load11。该序列S_load11的宽度为B1,即在本例中该序列的宽度为一,该序列S_load11的最大深度为数据输入端口的数据包个数流控阈值。When the L1BU distributes the input data packets to two load channels, there are two options. At this time, the number of coded bits B 1 =[log 2 (m 1 )]=[log 2 2]=1 of the L1BU record mark is coded as 1bit data, resulting in the first bit of the mark. That is, 0 and 1 are used to distinguish the two load channels. Referring to Figure 2, if it is distributed to load
L2BU0将负载通路0的数据包分配到四个包文处理引擎中时,有四个选择,此时,L2BU0记录标记的编码位数B2=[log2(m2)]=[log24]=2,编码为2bit数据,由此产生标记的第二、三比特位。参见图2,分发到4个PE的标记分别标记为2’b11、2’b10、2’b01、2’b00。并且,依次将对应每个数据包的该标记位保存在一个序列中,并将该序列定义为S_load21。序列S_PE0的宽度为B2,即在本例中该序列的宽度为二,最大深度为负载通路0中同时容纳的包文的最大个数,即负载通路0所对应的负载通路的数据包个数流控阈值。When L2BU0 distributes the data packets of
L2BU1与L2BU0处于同一层次且其所处环境也与L2BU0相同,因此其处理过程与L2BU0也是相同的。即L2BU1将负载通路1的数据包分配到四个包文处理引擎中时,有四个选择,编码为2bit数据,由此产生标记的第二、三比特位。依次将对应每个数据包的该标记位保存在一个序列中,并将该序列定义为S_load22。序列S_PE1的宽度为二,最大深度为负载通路1中同时容纳的包文的最大个数,即负载通路1所对应的负载通路的数据包个数流控阈值。L2BU1 is at the same level as L2BU0 and its environment is also the same as that of L2BU0, so its processing is the same as that of L2BU0. That is, when the L2BU1 distributes the data packets of the
由此可以看出,最高层的LBU所对应的序列的深度为数据输入端口的数据包个数流控阈值;非最高层LBU所对应的序列的深度为该层LBU所对应的上一层负载通路的数据包个数流控阈值。It can be seen from this that the depth of the sequence corresponding to the highest level LBU is the data packet number flow control threshold of the data input port; the depth of the sequence corresponding to the non-highest level LBU is the load of the upper layer corresponding to the LBU of this level The flow control threshold of the number of data packets in the path.
从上述处理过程中可看出,为了实现排序,在多引擎并行处理器中要预设一个以上用于记录标记信息的序列,且每个序列与一个LBU一一对应;这样也就形成了LBU、LPU以及序列三者的一一对应的关系。其中,序列S_load脚标用nk表示:n为负载层次,K为本层中LBU的总个数,该层次中的某一个用k表示,该k满足1≤k≤K。It can be seen from the above processing that in order to achieve sorting, more than one sequence for recording tag information must be preset in the multi-engine parallel processor, and each sequence corresponds to one LBU; thus forming an LBU , LPU and the one-to-one relationship between the sequence. Wherein, the subscript of sequence S_load is represented by nk: n is the load level, K is the total number of LBUs in this layer, and one of the levels is represented by k, and the k satisfies 1≤k≤K.
有了上述准备工作后,下面具体说明对数据包进行排序的过程:With the above preparations, the process of sorting the data packets is described in detail below:
多引擎并行处理器中的LBU执行以下处理步骤:The LBU in the multi-engine parallel processor performs the following processing steps:
A、LBU接收到待处理数据包,根据预设的分发原则,如负载均衡策略或循环Round Robin仲裁策略等,将接收到的待处理数据包分发到下一层的处理模块中,并将所分发的数据包进行标记,以指示该数据包所在负载通路,且将该标记记录在与该LBU所对应的序列中;A. The LBU receives the data packets to be processed, and distributes the received data packets to the processing module of the next layer according to the preset distribution principle, such as load balancing strategy or circular Round Robin arbitration strategy, etc. The distributed data packet is marked to indicate the load path where the data packet is located, and the mark is recorded in the sequence corresponding to the LBU;
B、该LBU判断其所对应的序列中所记录的个数是否达到预设的用于流控数据包个数的阈值,如果是,则停止分发操作,然后重复执行步骤B,否则,重复执行步骤A;B. The LBU judges whether the number recorded in its corresponding sequence reaches the preset threshold for the number of flow control data packets. If so, stop the distribution operation, and then repeat step B, otherwise, repeat the execution Step A;
上述LBU如果为非最底层的LBU,则上述步骤A中所述的下一层的处理模块为下一层的LBU;上述LBU如果为最底层的LBU,则步骤A中所述下一层的处理模块为PE。If the above-mentioned LBU is a non-lowest LBU, then the processing module of the next layer described in the above-mentioned step A is an LBU of the next layer; The processing module is PE.
多引擎并行处理器内与每个LBU对应的且与所对应的LBU处于同层的LPU,执行以下处理步骤:The LPU corresponding to each LBU and at the same layer as the corresponding LBU in the multi-engine parallel processor performs the following processing steps:
a、LPU判断出与其对应的LBU所对应的序列中的记录信息为非空后,执行步骤b;a. After the LPU determines that the record information in the sequence corresponding to the corresponding LBU is non-empty, execute step b;
b、LPU从该序列中顺序读取一个标记信息,根据该标记信息获取对应数据包所在位置,打开该位置所对应的输出通道,等待一个完整的数据包通过,且只允许一个数据包通过,然后重复执行步骤a,直至所有包文输出完毕。b. The LPU sequentially reads a tag information from the sequence, obtains the location of the corresponding data packet according to the tag information, opens the output channel corresponding to the location, waits for a complete data packet to pass through, and only allows one data packet to pass through, Then repeat step a until all packets are output.
下面仍以n=2,m1=2,m2=4,P=8为例进行说明,参见图1和图2,。The following still takes n=2, m 1 =2, m 2 =4, P=8 as an example for illustration, see FIG. 1 and FIG. 2 .
L1BU将输入数据包分发到两个负载通道时对数据包进行第一次标记,同时顺序将数据包的标记写入序列S_load11中。当序列S_load11中所记录的个数达到预设的数据输入端口的数据包个数流控阈值,即序列S_load11为满时,L1BU停止分发数据包,从而实现流控芯片的数据输入端口。When the L1BU distributes the input data packets to the two load channels, it marks the data packets for the first time, and at the same time sequentially writes the marks of the data packets into the sequence S_load 11 . When the number recorded in the sequence S_load 11 reaches the preset data packet flow control threshold of the data input port, that is, when the sequence S_load 11 is full, the L1BU stops distributing data packets, thereby realizing the data input port of the flow control chip.
L2BU0将负载通路0的数据包分配到四个包文处理引擎中时对数据包进行第二次标记,同时顺序将数据包的标记写入序列S_load21中。当序列S_load21为满即达到负载通路0所对应的负载通路的数据包个数流控阈值时,停止分发数据包,流控负载通路0的数据输入。When the L2BU0 distributes the data packets of the
L2BU1将负载通路1的数据包分配到四个包文处理引擎中时对数据包进行第二次标记,同时顺序将数据包的标记写入序列S_load22中。当序列S_load122为满即达到负载通路1所对应的负载通路的数据包个数流控阈值时,停止分发数据包,流控负载通路1的数据输入。When the L2BU1 distributes the data packets of the
当L2PU0判断出序列S_load21中记录的信息为非空时,L2PU0按顺序从序列S_load21中读取标记信息,通过对标记的解码获取对应该标记的包文的位置信息,比如标记为2’b01,表示最近需要通过的包位于通道1,L2PU0预先打开从引擎1到负载通道0的通道,等待一个完整的数据包通过,且只允许一个数据包通过后,依次打开下一个包文通过的通道,直至负载通道0中的所有包文通过L2PU0输出。When L2PU0 determines that the information recorded in sequence S_load 21 is not empty, L2PU0 reads the tag information from sequence S_load 21 in order, and obtains the position information of the packet corresponding to the tag by decoding the tag, for example, the tag is 2' b01, indicating that the most recent packet that needs to pass is located in
当L2PU1判断出序列S_load22中记录的信息为非空时,L2PU1按顺序从序列S_load22中读取标记信息,通过对标记的解码获取对应该标记的包文的位置信息,比如标记为2’b01,表示最近需要通过的包位于通道1,L2PU1预先打开从引擎1到负载通道1的通道,等待一个完整的数据包通过,且只允许一个数据包通过后,依次打开下一个包文通过的通道,直至负载通道1中的所有包文通过L2PU1输出。When L2PU1 determines that the information recorded in sequence S_load 22 is not empty, L2PU1 reads the tag information from sequence S_load 22 in order, and obtains the position information of the packet corresponding to the tag by decoding the tag, for example, the tag is 2' b01, indicating that the most recent packet to be passed is located in
当L1PU判断出序列S_load11中记录的信息为非空时,L1PU按顺序从序列S_load11中读取标记信息,通过对标记的解码获取对应该标记的包文的位置信息,比如标记为2’b1,表示最近需要通过的包位于负载通道1,L1PU预先打开从负载通道1到芯片输出端口的通道,等待一个完整的数据包通过,且只允许一个数据包通过后,依次打开下一个包文通过的通道,直至该多引擎并行处理器中的所有包文通过L1PU输出。When the L1PU judges that the information recorded in the sequence S_load 11 is non-empty, the L1PU reads the tag information from the sequence S_load 11 in order, and obtains the position information of the packet corresponding to the tag by decoding the tag, for example, the tag is 2' b1, indicating that the most recent packet to be passed is located in
上述序列S_load11、序列S_load21和序列S_load22分别由一个先进先出FIFO缓冲器承载。The above sequence S_load 11 , sequence S_load 21 and sequence S_load 22 are respectively carried by a first-in-first-out FIFO buffer.
以上所述实施例均是以n=2,m1=2,m2=4,P=8为例进行说明,在实际应用中,负载的层次包括n=2、3、4、5......,但不限于上述值;每一层的负载通路数包括m=2、3、4、5......,但也不限于上述值。The above-mentioned embodiments are all described by taking n=2, m 1 =2, m 2 =4, and P=8 as examples. In practical applications, the load levels include n=2, 3, 4, 5.. ..., but not limited to the above values; the number of load paths for each layer includes m=2, 3, 4, 5..., but not limited to the above values.
再有,本发明所述的实现多引擎并行处理器中数据包排序的方法,并不仅仅适用于图1所示结构的多引擎并行处理器,对于其他结构的多引擎处理器,只要其存在多层次多通路的结构,就同样适用。Furthermore, the method for realizing data packet sorting in the multi-engine parallel processor described in the present invention is not only applicable to the multi-engine parallel processor with the structure shown in Fig. 1, for multi-engine processors with other structures, as long as it exists The structure of multi-level and multi-path is also applicable.
本发明所述的较佳实施例并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The preferred embodiments described in the present invention are not intended to limit the present invention, and any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included within the protection scope of the present invention.
Claims (11)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2005100932204A CN100349122C (en) | 2005-08-19 | 2005-08-19 | Method for realizing data packet sequencing for multi engine paralled processor |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNB2005100932204A CN100349122C (en) | 2005-08-19 | 2005-08-19 | Method for realizing data packet sequencing for multi engine paralled processor |
Publications (2)
Publication Number | Publication Date |
---|---|
CN1851649A CN1851649A (en) | 2006-10-25 |
CN100349122C true CN100349122C (en) | 2007-11-14 |
Family
ID=37133128
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNB2005100932204A Expired - Fee Related CN100349122C (en) | 2005-08-19 | 2005-08-19 | Method for realizing data packet sequencing for multi engine paralled processor |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN100349122C (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101770362B (en) * | 2009-01-06 | 2013-04-03 | 中国科学院计算技术研究所 | Distributed dynamic process generating unit meeting System C processor |
CN112732241B (en) * | 2021-01-08 | 2022-04-01 | 烽火通信科技股份有限公司 | Programmable analyzer under multistage parallel high-speed processing and analysis method thereof |
CN117579565B (en) * | 2023-11-03 | 2025-06-27 | 中科驭数(北京)科技有限公司 | Data packet associated data processing method and system |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1173255A (en) * | 1995-01-13 | 1998-02-11 | 摩托罗拉公司 | Method and apparatus for encoding and decoding information in a digital communication system |
CN1284673A (en) * | 1999-05-31 | 2001-02-21 | 德国汤姆森-布兰特有限公司 | Data pack preprocessing method and bus interface and data processing unit thereof |
US6457121B1 (en) * | 1999-03-17 | 2002-09-24 | Intel Corporation | Method and apparatus for reordering data in X86 ordering |
US6594722B1 (en) * | 2000-06-29 | 2003-07-15 | Intel Corporation | Mechanism for managing multiple out-of-order packet streams in a PCI host bridge |
WO2004013752A1 (en) * | 2002-07-26 | 2004-02-12 | Koninklijke Philips Electronics N.V. | Method and apparatus for accessing multiple vector elements in parallel |
-
2005
- 2005-08-19 CN CNB2005100932204A patent/CN100349122C/en not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1173255A (en) * | 1995-01-13 | 1998-02-11 | 摩托罗拉公司 | Method and apparatus for encoding and decoding information in a digital communication system |
US6457121B1 (en) * | 1999-03-17 | 2002-09-24 | Intel Corporation | Method and apparatus for reordering data in X86 ordering |
CN1284673A (en) * | 1999-05-31 | 2001-02-21 | 德国汤姆森-布兰特有限公司 | Data pack preprocessing method and bus interface and data processing unit thereof |
US6594722B1 (en) * | 2000-06-29 | 2003-07-15 | Intel Corporation | Mechanism for managing multiple out-of-order packet streams in a PCI host bridge |
WO2004013752A1 (en) * | 2002-07-26 | 2004-02-12 | Koninklijke Philips Electronics N.V. | Method and apparatus for accessing multiple vector elements in parallel |
Also Published As
Publication number | Publication date |
---|---|
CN1851649A (en) | 2006-10-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN1201532C (en) | Quick-circulating port dispatcher for high-volume asynchronous transmission mode exchange | |
US8099521B2 (en) | Network interface card for use in parallel computing systems | |
US8458267B2 (en) | Distributed parallel messaging for multiprocessor systems | |
US9537772B2 (en) | Flexible routing tables for a high-radix router | |
KR20200135780A (en) | Mediating parts of a transaction through a virtual channel associated with the interconnect | |
US20070268903A1 (en) | System and Method for Assigning Packets to Output Queues | |
CN1643499A (en) | Thread signaling in multi-threaded network processor | |
US10146468B2 (en) | Addressless merge command with data item identifier | |
CN103946803A (en) | Processor with efficient work queuing | |
US20200136986A1 (en) | Multi-path packet descriptor delivery scheme | |
CN1498374A (en) | Apparatus and method for efficiently sharing memory bandwidth in network processor | |
CN101072176A (en) | Report processing method and system | |
CN102158408B (en) | Method for processing data stream and device thereof | |
TWI536772B (en) | Direct provision of information to the technology of the agreement layer | |
CN100349122C (en) | Method for realizing data packet sequencing for multi engine paralled processor | |
US9846662B2 (en) | Chained CPP command | |
US20040078459A1 (en) | Switch operation scheduling mechanism with concurrent connection and queue scheduling | |
US9148270B2 (en) | Method and apparatus for handling data flow in a multi-chip environment using an interchip interface | |
US7460544B2 (en) | Flexible mesh structure for hierarchical scheduling | |
US9804959B2 (en) | In-flight packet processing | |
US9665519B2 (en) | Using a credits available value in determining whether to issue a PPI allocation request to a packet engine | |
CN117692408A (en) | CAN frame sending method, device and system, computing equipment and storage medium | |
CN117440053A (en) | Multistage cross die access method and system | |
US7272151B2 (en) | Centralized switching fabric scheduler supporting simultaneous updates | |
US20250165425A1 (en) | Cxl fabric extensions |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Owner name: SHENZHEN HAISI SEMICONDUCTOR CO., LTD. Free format text: FORMER OWNER: HUAWEI TECHNOLOGY CO., LTD. Effective date: 20081010 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20081010 Address after: HUAWEI electric production center, Bantian HUAWEI base, Longgang District, Shenzhen, Guangdong Patentee after: HISILICON TECHNOLOGIES Co.,Ltd. Address before: Bantian HUAWEI headquarters office building, Longgang District, Shenzhen, Guangdong Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd. |
|
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20071114 |
|
CF01 | Termination of patent right due to non-payment of annual fee |