CN103152281B

CN103152281B - Two-level switch-based load balanced scheduling method

Info

Publication number: CN103152281B
Application number: CN201310069391.8A
Authority: CN
Inventors: 戴艺; 肖立权; 伍楠; 曹继军; 高蕾; 张鹤颖; 童元满; 董德尊; 王绍刚; 沈胜宇; 刘路; 肖灿文; 张磊; 王永庆; 齐星云; 陆平静
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-03-05
Filing date: 2013-03-05
Publication date: 2014-09-17
Anticipated expiration: 2033-03-05
Also published as: CN103152281A

Abstract

A load balancing scheduling method based on two-level switching. The first-level input port buffers arriving cells in the VOQ queue according to the destination port. The scheduler switches the message to the second-level input port through the first-level switching network, and the VOQ queue The k cells from the same flow in the network are called a unit frame; each input port of the first stage executes the minimum length allocation according to the traffic distribution matrix, and in k consecutive external time slots, the same The unit frame of a flow is sent to the fixed mapping area of the flow; N second-level input ports are divided into N/k groups in turn, and each group contains k consecutive second-level input ports to form an area, each area according to the purpose The port buffers the cell in the OQ queue, and switches it to the destination output port through the second-level switching network. The invention has the advantages of simple dispatching process, no calculation or communication, easy hardware implementation, 100% throughput and guaranteed message order.

Description

Load Balance Scheduling Method Based on Two-Level Switching

技术领域technical field

本发明主要涉及到并行交换结构中的报文调度领域，特指一种在并行交换结构中实现负载均衡且报文保序的报文调度方法。The invention mainly relates to the field of message scheduling in a parallel switching structure, in particular to a message scheduling method for realizing load balancing and message order preservation in a parallel switching structure.

背景技术Background technique

研究者采用主动测量和被动测量方式对报文乱序行为做了大量研究。J.Bennet在MAE-East ISP的交换中心测量了报文乱序情况，发现在重负载、网络设备并行度高的测量环境下，报文乱序情况非常严重，90％以上的连接发生乱序。J.Bennett分析这种高乱序率主要来自网络内部的局部并行处理机制，包括并行交换设备和并行传输链路，并指出报文乱序不是网络的病态行为。多数高性能并行交换结构存在报文乱序现象，例如负载均衡交换结构(load-balanced switch)、并行报文交换结构(parallel packet switch)、多级交换结构(multi-levelswitch)等均受到报文乱序问题的困扰。深度并行设计给交换结构带来了负载均衡和报文乱序两大问题。负载均衡是实现延迟和吞吐率保证的关键，但负载均衡可能导致报文乱序，乱序报文会损害Internet网况，因为Internet广泛使用的TCP传输协议会错误的将乱序报文看作是报文丢失拥塞发生的标志，从而引发不必要的重传及TCP超时。这些重传和超时将降低TCP吞吐率提高报文延迟，因此实现输入流量负载均衡的同时保证报文的顺序是极其必要的。Researchers have done a lot of research on packet out-of-sequence behavior by using active measurement and passive measurement methods. J.Bennet measured the out-of-sequence of packets at the switching center of MAE-East ISP, and found that in the measurement environment of heavy load and high parallelism of network equipment, the out-of-sequence of packets is very serious, and more than 90% of the connections have out-of-sequence . J.Bennett analyzed that the high out-of-sequence rate mainly comes from the local parallel processing mechanism inside the network, including parallel switching devices and parallel transmission links, and pointed out that packet out-of-order is not a pathological behavior of the network. Most high-performance parallel switching structures have out-of-order packets, such as load-balanced switch, parallel packet switch, multi-level switch, etc. Trouble with out-of-sequence problems. The deep parallel design brings load balancing and packet out-of-sequence problems to the switching fabric. Load balancing is the key to guaranteeing latency and throughput, but load balancing may cause out-of-order packets, which will damage the Internet network, because the TCP transmission protocol widely used on the Internet will mistakenly treat out-of-order packets as It is a sign of packet loss and congestion, which causes unnecessary retransmission and TCP timeout. These retransmissions and timeouts will reduce the TCP throughput and increase packet delay. Therefore, it is extremely necessary to ensure the sequence of packets while implementing load balancing of input traffic.

防止报文乱序的方法可以分为两类：1)限定乱序报文的数量，在输出端设置容量有限的重定序缓冲区，用于重定序乱序报文；2)保证报文按照到达顺序离开输出端口，从而避免了报文乱序。由于缓冲区的容量有限，第一种方法只能处理一定范围内的乱序报文，如果将重定序缓冲区尺寸增加至O(N²)，虽然能够完全解决报文乱序的问题但会相应地以二次方的时间尺度增加报文延迟，其中N为端口数目。因此，限定乱序报文的数量并不能有效解决报文乱序问题，并且难以适应路由器高端口密度的需求。斯坦福大学提出的满帧优先(Full OrderedFrame First，简称FOFF)算法是第一种方法的代表算法，它允许路由器中存在一定数量的乱序报文，在输出端设置了N×N的缓冲队列用于重定序乱序报文。可以证明，为了保证信元按序离开，FOFF算法重定序缓冲区容量至多为N²个信元且能够提供负载均衡，获得100％吞吐率。近年来，更多的研究者趋向于采用第二种方法来保证报文的顺序，消除了输出端的重定序操作及缓冲区开销，有利于提高延迟性能。这类调度方法的共同特征是通过某种消息传递机制获取所有到达报文的状态信息，基于全局报文状态信息执行集中式调度。例如，邮箱交换(mailbox switch)方法采用对称型连接模式为通告报文离开时间创造反馈通路，调度器根据报文的离开时间调度报文。该策略能够保证每条流的报文按照其到达顺序离开交换系统但不能提供负载均衡实现100％吞吐率。交替匹配调度方法采用集中式的调度模式，假设流量特征是预知的且固定不变，采用矩阵分解的方法离线解决集中式调度问题，分布式实现在线调度并提供服务保证。然而，当流量变得不可预知并动态改变时，难以在大的交换尺寸下满足集中式的调度需求。并行匹配调度方法通过在第一级输入线卡和第二级输入线卡之间传递请求-许可令牌来实现报文保序，但在具体硬件实现中令牌在线卡间的频繁传递将成倍增加调度周期。The methods to prevent out-of-sequence messages can be divided into two categories: 1) limit the number of out-of-order messages, and set a reordering buffer with limited capacity at the output end for reordering out-of-order messages; 2) ensure that messages follow the Arrival order leaves the output port, thereby avoiding out-of-order packets. Due to the limited capacity of the buffer, the first method can only handle out-of-order packets within a certain range. If the size of the reordering buffer is increased to O(N ² ), although the problem of out-of-order packets can be completely solved, it will Correspondingly, the packet delay is increased on a quadratic time scale, where N is the number of ports. Therefore, limiting the number of out-of-sequence packets cannot effectively solve the problem of out-of-sequence packets, and it is difficult to meet the requirements of high port density of routers. The Full Ordered Frame First (FOFF) algorithm proposed by Stanford University is a representative algorithm of the first method, which allows a certain number of out-of-order packets in the router, and sets an N×N buffer queue at the output end for for reordering out-of-sequence packets. It can be proved that in order to ensure that cells leave in sequence, the reordering buffer capacity of FOFF algorithm is at most N ² cells and can provide load balancing to obtain 100% throughput. In recent years, more researchers tend to adopt the second method to ensure the order of packets, which eliminates the reordering operation and buffer overhead at the output end, which is beneficial to improve the delay performance. The common feature of these scheduling methods is to obtain the state information of all arriving packets through a message passing mechanism, and perform centralized scheduling based on the global packet state information. For example, the mailbox switch (mailbox switch) method uses a symmetrical connection mode to create a feedback path for notifying the departure time of a message, and the scheduler schedules the message according to the message's departure time. This strategy can ensure that the packets of each flow leave the switching system in the order they arrive, but cannot provide load balancing to achieve 100% throughput. The alternate matching scheduling method adopts a centralized scheduling mode, assuming that the traffic characteristics are predictable and fixed, and adopts the method of matrix decomposition to solve the centralized scheduling problem offline, and realizes online scheduling and provides service guarantee in a distributed manner. However, when the traffic becomes unpredictable and changes dynamically, it is difficult to meet the centralized scheduling requirements under large switch sizes. The parallel matching scheduling method realizes message order preservation by transferring request-permission tokens between the first-level input line card and the second-level input line card, but in the specific hardware implementation, the frequent transfer of tokens between line cards will become a problem. Double the scheduling cycle.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种调度过程简单、无需任何计算或者通信、易于硬件实现、实现了100％的吞吐率并能保证报文的顺序的基于两级交换的负载均衡调度方法。The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a scheduling process that is simple, does not require any calculation or communication, is easy to implement in hardware, achieves 100% throughput and can guarantee message A sequential load balancing scheduling method based on two-level switching.

为解决上述技术问题，本发明采用以下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种基于两级交换的负载均衡调度方法，第一级输入端口将到达信元按照目的端口缓冲在VOQ队列，调度器通过第一级交换网络将报文交换至第二级输入端口，VOQ队列中来自同一条流的k个信元称之为一个单位帧，单位帧是最小调度单元；第一级每个输入端口根据流量分布矩阵执行最小长度分派，在k个连续的外部时间槽，通过第一级交换网络将同一条流的单位帧发送至该流固定的映射区域；N个第二级输入端口被依次划分为N/k组，每组含k个连续的第二级输入端口构成一个区域，各区域按照目的端口将信元缓冲在OQ队列，由于流到区域的映射关系固定，来自同一单位帧的k个信元将依次到达OQ队列头位置，通过第二级交换网络交换至目的输出端口。A load balancing scheduling method based on two-level switching. The first-level input port buffers arriving cells in the VOQ queue according to the destination port. The scheduler switches the message to the second-level input port through the first-level switching network, and the VOQ queue The k cells from the same flow are called a unit frame, and the unit frame is the smallest scheduling unit; each input port of the first level performs the minimum length assignment according to the traffic distribution matrix, and in k consecutive external time slots, through The first-level switching network sends unit frames of the same flow to the fixed mapping area of the flow; N second-level input ports are sequentially divided into N/k groups, and each group consists of k consecutive second-level input ports In one area, each area buffers cells in the OQ queue according to the destination port. Since the mapping relationship between flow and area is fixed, k cells from the same unit frame will arrive at the head of the OQ queue in sequence, and are switched to the OQ queue through the second-level switching network. Destination output port.

作为本发明的进一步改进，采用双循环方式构建流到区域的映射：As a further improvement of the present invention, a double-loop method is used to construct the mapping from flow to region:

(1.1)以循环方式构建第一级输入端口N/k个流分支到第二级输入端口N/k个区域的映射，以保证每个区域恰能涵盖所有的流；(1.1) Construct the mapping from the N/k flow branches of the first-level input port to the N/k regions of the second-level input port in a cyclic manner, so as to ensure that each region can just cover all the flows;

(1.2)进一步以循环方式调整不同输入端口与N/k种映射方式的关联，以保证每个区域按照输入端口的均衡分布涵盖所有的流。(1.2) Further adjust the association between different input ports and N/k mapping methods in a cyclic manner to ensure that each area covers all flows according to the balanced distribution of input ports.

作为本发明的进一步改进，在输入端口调度时根据流到区域的映射关系分派信元以轮询方式服务于N/k个流分支，所述第一级输入端口在每个外部时间槽的操作流程如下：As a further improvement of the present invention, during input port scheduling, cells are assigned according to the mapping relationship between streams and regions to serve N/k stream branches in a round-robin manner, and the operation of the first-level input port at each external time slot The process is as follows:

(2.1)若流分支VOQ队列存在满帧，则优先调度满帧，赋予该满帧N/k个单位帧最高调度优先级，查找流量分布矩阵L，截取满帧第一个单位帧，发送到目前L_g，j最小的区域R_g，即执行步骤(2.1.1)～(2.1.k)，若不存在满帧则转步骤(2.2)；(2.1) If the flow branch If there is a full frame in the VOQ queue, the full frame is scheduled first, and the N/k unit frames of the full frame are given the highest scheduling priority, the traffic distribution matrix L is searched, the first unit frame of the full frame is intercepted, and sent to the current L _{g, j is} the smallest region R _g , that is, execute steps (2.1.1)~(2.1.k), if there is no full frame, go to step (2.2);

(2.1.1)若输入端口i到第二级输入端口S_g，1的内部链路空闲，则将VOQ_i，j队列头信元发送到区域R_g第二级输入端口S_g，1，否则f＝(f+1)modN/k，转步骤(2.1)；(2.1.1) If the internal link from the input port i to the second-stage input port S _{g, 1} is idle, then the VOQ _{i, j} queue head cell is sent to the second-stage input port S _{g, 1} of the region R _g , Otherwise f=(f+1)modN/k, go to step (2.1);

(2.1.2)发送VOQ_i，j队列头信元到区域R_g第二级输入端口S_g，2；(2.1.2) send VOQ _{i, j} queue head cell to area R _g second stage input port S _{g, 2} ;

(2.1.3)发送VOQ_i，j队列头信元到区域R_g第二级输入端口S_g，3；(2.1.3) send VOQ _{i, j} queue head cell to area R _g second stage input port S _{g, 3} ;

依此类推至(2.1.k)发送VOQ_i，j队列头信元到区域R_g第二级输入端口S_g，k，g＝(g+1)modN/k，转步骤(2.1)；And so on to (2.1.k) to send VOQ _{i, j} queue head cell to the second stage input port S _g of area R _{g, k} , g=(g+1)modN/k, go to step (2.1);

(2.2)若流分支VOQ队列、VOQ_i，j(kf≤j＜kf+k)存在最高调度优先级单位帧，则查找流量分布矩阵L，将该单位帧发送到目前L_g，j最小的区域R_g，即执行步骤(2.1.1)～(2.1.k)，否则转步骤(2.3)；(2.2) If the flow branch VOQ queue, VOQ _{i, j} (kf≤j<kf+k) has the highest scheduling priority unit frame, then search the traffic distribution matrix L, and send the unit frame to the current region R _g with the smallest L _{g, j} , that is, execute Steps (2.1.1) ~ (2.1.k), otherwise go to step (2.3);

(2.3)若流分支VOQ队列存在一个或多个单位帧，则查找流量分布矩阵L，根据查找结果选择最小均衡系数VOQ队列的单位帧，将其发送到流分支的固定映射区域R_g，即执行步骤(2.1.1)～(2.1.k)，若仅含一个单位帧则可跳过流量分布矩阵查找操作，直接将流分支唯一的单位帧发送到其固定映射区域R_g，否则流分支不含任何单位帧，g＝(g+1)modN/k，转步骤(2.1)。(2.3) If the flow branch If there are one or more unit frames in the VOQ queue, search the traffic distribution matrix L, select the unit frame of the VOQ queue with the smallest equalization coefficient according to the search result, and send it to the flow branch The fixed mapping region R _g , that is, execute steps (2.1.1)~(2.1.k), if there is only one unit frame, the search operation of the flow distribution matrix can be skipped, and the unique unit frame of the flow branch can be directly sent to its fixed map region R _g , otherwise flow branch Does not contain any unit frame, g=(g+1)modN/k, go to step (2.1).

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

1、本发明调度方法可分布于各输入端口独立执行，根据本地VOQ队列信息分派信元，不需要任何通信开销，以O(1)时间复杂度实现了100％的吞吐率并能保证报文的顺序。1. The dispatching method of the present invention can be distributed to each input port for independent execution, distributes cells according to the local VOQ queue information, does not require any communication overhead, and realizes 100% throughput with O(1) time complexity and can guarantee message Order.

2、本发明在各输入端口调度器间没有任何通信开销的情况下，实现了报文保序和负载均衡。通过构建流到区域的固定映射，避免了报文乱序，消除了报文重定序开销；为避免流量区域集中现象，采用双循环(dual-rotation)方式构建不同输入端口的流到区域的映射关系，每个输入端口维护全局统一视图的流量分布矩阵，根据流量分布矩阵调度单位帧。可以证明，对任意输出端口j，同一区域OQ_j队列长度相同且不同区域OQ_j队列长度至多差1，从而实现了100％负载均衡度。2. The present invention realizes packet order preservation and load balancing without any communication overhead among the input port schedulers. By constructing a fixed mapping from flow to region, packet disorder is avoided and packet reordering overhead is eliminated; in order to avoid traffic concentration in regions, a dual-rotation method is used to construct flow-to-region mapping of different input ports Relationship, each input port maintains a traffic distribution matrix with a global unified view, and schedules unit frames according to the traffic distribution matrix. It can be proved that for any output port j, the queue lengths of OQ _j in the same area are the same and the queue lengths of OQ _j in different areas differ by at most 1, thus achieving 100% load balancing.

3、本发明只需适当选取聚合粒度k，理论上可获得最低延迟。通过模拟验证本发明调度方法在不同聚合粒度k下的延迟性能，并与目前主流的负载均衡调度算法进行比较。模拟结果显示，当聚合粒度k＝2时，本发明在目前所有能够保证报文顺序的调度算法中具有最优延迟性能，并在突发流量模型下，表现出与不具备报文保序特性的算法相当的性能。3. The present invention only needs to properly select the aggregation particle size k, and theoretically the lowest delay can be obtained. The delay performance of the scheduling method of the present invention under different aggregation granularity k is verified by simulation, and compared with the current mainstream load balancing scheduling algorithm. The simulation results show that when the aggregation granularity k=2, the present invention has the optimal delay performance among all current scheduling algorithms that can guarantee the order of messages, and under the burst traffic model, it exhibits and does not have the property of message order preservation algorithm with comparable performance.

4、本发明根据流到区域的固定映射关系调度报文，调度过程简单，无需任何计算或者通信，易于硬件实现。4. The present invention schedules messages according to the fixed mapping relationship between streams and regions, the scheduling process is simple, no calculation or communication is required, and hardware implementation is easy.

附图说明Description of drawings

图1是本发明调度方法所适用的二级交换体系结构的一个示例。FIG. 1 is an example of a secondary switching architecture to which the scheduling method of the present invention is applied.

图2是本发明在具体应用实例中构建流到区域的映射方法在端口数目N＝32，聚合粒度k＝8时，采用双循环映射方式得到的流到区域的映射结果示意图。Fig. 2 is a schematic diagram of the flow-to-area mapping result obtained by using a double-loop mapping method when the number of ports N=32 and the aggregation granularity k=8 are constructed by the present invention in a specific application example.

图3是本发明在具体应用实例中执行负载均衡调度方法流程示意图。Fig. 3 is a schematic flow chart of the method for executing load balancing scheduling in a specific application example of the present invention.

图4是本发明在具体应用实例中于流量突发情况下采用本发明最小长度分派后，信元在第二级输入端口OQ缓冲区的分布情况示意图。Fig. 4 is a schematic diagram of the distribution of cells in the OQ buffer of the second-level input port after the minimum length allocation of the present invention is adopted in the case of a traffic burst in a specific application example of the present invention.

具体实施方式Detailed ways

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

本发明基于两级交换的负载均衡调度方法，首先第一级输入端口将到达信元按照目的端口缓冲在VOQ队列，调度器通过第一级交换网络(如图所示的Mesh网络)将报文交换至第二级输入端口，VOQ队列中来自同一条流的k个信元，称之为一个单位帧，单位帧是本发明最小调度单元。第一级每个输入端口独立执行本发明的调度方法，根据流量分布矩阵执行最小长度分派，在k个连续的外部时间槽，通过第一级交换网络(Mesh网络)将同一条流的单位帧发送至该流固定的映射区域。N个第二级输入端口被依次划分为N/k组，每组含k个连续的第二级输入端口构成一个区域；各区域按照目的端口将信元缓冲在OQ队列，由于流到区域的映射关系固定，来自同一单位帧的k个信元将依次到达OQ队列头位置，通过第二级交换网络(如图所示的Mesh网络)交换至目的输出端口。在上述过程中，每个输入端口独立执行基于流映射的信元分派算法：将同一条流的单位帧通过第一级Mesh网络发送该流固定的映射区域；各区域按照目的端口将信元缓冲在OQ队列，由于流到区域的映射关系固定，来自同一单位帧的k个信元将依次到达OQ队列头位置，等待第二级Mesh网络空闲时依次到达输出端口。若将第一级缓冲区实现于第一级线卡，第二级缓冲区实现于第二级线卡，那么上述二级交换结构则成为典型的负载均衡交换结构，本发明尤其适用于负载均衡路由器报文调度方法。为便于表述，本发明使用Mesh网络来阐述发明内容，Mesh网络可理解为实现报文交换至第二级输入端口及目的输出端口的技术途径，可以是Mesh网络也可以是其他交换技术。The present invention is based on the load balancing scheduling method of two-level switching. First, the first-level input port buffers the arriving cells in the VOQ queue according to the destination port, and the scheduler passes the first-level switching network (Mesh network as shown) to the message Switched to the second-level input port, k cells from the same flow in the VOQ queue are called a unit frame, and the unit frame is the minimum scheduling unit of the present invention. Each input port of the first stage independently executes the scheduling method of the present invention, executes the minimum length assignment according to the flow distribution matrix, and transfers the unit frames of the same flow through the first-stage switching network (Mesh network) in k consecutive external time slots Send to the fixed mapping area for this stream. N second-level input ports are divided into N/k groups in turn, and each group contains k consecutive second-level input ports to form an area; each area buffers cells in the OQ queue according to the destination port, because the flow to the area The mapping relationship is fixed, and k cells from the same unit frame will arrive at the head of the OQ queue in sequence, and will be switched to the destination output port through the second-level switching network (Mesh network as shown in the figure). In the above process, each input port independently executes the cell allocation algorithm based on flow mapping: the unit frame of the same flow is sent to the fixed mapping area of the flow through the first-level Mesh network; each area buffers the cells according to the destination port In the OQ queue, due to the fixed mapping relationship between flows and areas, k cells from the same unit frame will arrive at the head of the OQ queue in sequence, and will arrive at the output port in turn when the second-level Mesh network is idle. If the first-level buffer is implemented on the first-level line card, and the second-level buffer is implemented on the second-level line card, then the above-mentioned two-level switching structure becomes a typical load balancing switching structure, and the present invention is especially suitable for load balancing Router packet scheduling method. For the convenience of expression, the present invention uses a Mesh network to illustrate the content of the invention. The Mesh network can be understood as a technical approach to realize the exchange of messages to the second-level input port and the destination output port. It can be a Mesh network or other switching technologies.

由上可知，本发明的核心就在于将k个连续的输入端口划分为一个区域，输入端采用基于流映射的负载分配算法，以细粒度的方式将同一条流的k个信元分派到固定的映射区域。通过理论证明，该调度策略可获得100％吞吐率并能够保证报文的顺序。其中k为聚合粒度，其决定了每次调度同一条流的信元个数。为避免流量区域集中现象，本发明进一步采用双循环(dual-rotation)方式构建不同输入端口的流到区域的映射关系。为实现负载在第二级输入端口的均衡分布，本发明进一步在每个输入端口维护全局统一视图的流量分布矩阵，根据流量分布矩阵调度单位帧，可实现100％负载均衡度。It can be seen from the above that the core of the present invention is to divide k continuous input ports into one area, and the input end adopts a load distribution algorithm based on flow mapping to distribute k cells of the same flow to fixed the mapped area. It is proved by theory that this scheduling strategy can obtain 100% throughput rate and can guarantee the sequence of packets. Where k is the aggregation granularity, which determines the number of cells for scheduling the same flow each time. In order to avoid the concentration of flow areas, the present invention further adopts a dual-rotation method to construct the mapping relationship between flows and areas of different input ports. In order to realize the balanced distribution of loads at the second-level input ports, the present invention further maintains a traffic distribution matrix of a global unified view at each input port, and dispatches unit frames according to the traffic distribution matrix to achieve 100% load balance.

图1是本发明调度方法所适用的二级交换体系结构的一个示例。图中，VOQ_i，j表示第一级输入端口i的虚拟输出队列j；表示第二级输入端口l的输出队列j；f(i，j)表示从第一级输入端口i到输出端口j的流；k为聚合粒度，表示连续调度同一条流信元的数目(k为端口数N的因数)；VOQ_i，j队列的每k个信元构成一个单位帧(unit frame)，第一级输入端口i处N/k个不同VOQ队列的单位帧(共计N个信元)构成一个聚合帧(aggregate frame)，VOQ_i，j队列的N个信元，构成一个满帧(full frame)；第二级输入端口1，2，....，N被依次分成N/k组，每一组含k个连续的第二级输入端口，第g组的k个第二级输入端口构成一个区域，记作R_g，S_r，z表示区域r的第z个第二级输入端口；每个输入端口的N条流被划分为N/k组与N/k个区域相对应，每组含k条流，输入端口i第f组的k条流构成一个流分支，记作 FIG. 1 is an example of a secondary switching architecture to which the scheduling method of the present invention is applied. In the figure, VOQ _{i, j} represent the virtual output queue j of the first-stage input port i; Indicates the output queue j of the second-level input port l; f(i, j) indicates the flow from the first-level input port i to the output port j; k is the aggregation granularity, indicating the number of consecutively scheduled cells of the same flow (k is a factor of the number of ports N); VOQ _{i, every k cells of the j} queue form a unit frame (unit frame), and the unit frames of N/k different VOQ queues at the first-level input port i (a total of N cells unit) to form an aggregate frame (aggregate frame), VOQ _{i, j} queue N cells form a full frame (full frame); the second-level input ports 1, 2, ..., N are divided into N in turn /k groups, each group contains k consecutive second-level input ports, and the k second-level input ports of the g-th group form a region, denoted as R _g , S _{r, z} represents the z-th of the region r Two-level input port; N streams of each input port are divided into N/k groups corresponding to N/k areas, each group contains k streams, and the k streams of the fth group of input port i form a stream branch ,Referred to as

为降低报文缓冲存储器带宽需求，Mesh网络通常运行在速率R/N(内部链路加速比为1)，由此得到以下定义：In order to reduce the bandwidth requirements of the message buffer memory, the Mesh network usually operates at the rate R/N (internal link acceleration ratio is 1), thus the following definition is obtained:

定义1.在速率为R的链路发送或接收一个信元所耗费的时间为外部时间槽(external timeslot)。Definition 1. The time it takes to send or receive a cell on a link with rate R is an external timeslot.

定义2.在速率为R/N的链路发送或接收一个单位帧所耗费的时间为时间槽(time slot)，时间槽为外部时间槽的N倍。Definition 2. The time it takes to send or receive a unit frame on a link with a rate of R/N is a time slot (time slot), and the time slot is N times the external time slot.

通常情况下，在每个时间槽即每N个外部时间槽，UFFS-k算法可从输入端N/k个流分支中聚合N/k个单位帧构成一个聚合帧分派到第二级输入端口。VOQ队列均衡系数反映了流量在第二级输入端口OQ队列分布的均衡性，本发明根据各区域OQ队列长度调度单位帧，实现了区域间的负载均衡。接下来，将详细阐述本发明均衡系数工作原理。通过采用对时间槽的归纳法证明，可以从理论上证明上述基于流映射的调度方法可以保证对任意区域R_g，0≤j＜N，k个第二级输入端口对应的k个队列长度相同(忽略单位帧的传输延迟)，其中l∈R_g。由此，可得到以下定义：Normally, in each time slot, that is, every N external time slots, the UFFS-k algorithm can aggregate N/k unit frames from N/k flow branches at the input end to form an aggregated frame and dispatch it to the second-level input port . The VOQ queue balance coefficient reflects the balance of flow distribution in the OQ queue of the second-level input port. The present invention schedules unit frames according to the length of the OQ queue in each region, and realizes the load balance between regions. Next, the working principle of the equalization coefficient of the present invention will be described in detail. By using the inductive proof of time slots, it can be proved theoretically that the above scheduling method based on flow mapping can guarantee that for any region R _g , 0≤j<N, k second-level input ports correspond to k The queue lengths are the same (neglecting the transmission delay of a unit frame), where l ∈ R _g . From this, the following definitions can be obtained:

定义3.既然对任意区域R_g，队列长度相同(l∈R_g)，那么队列VOQ_i，j均衡系数等于其映射区域R_g输出队列j的长度，记作L_g，j。Definition 3. Since for any region R _g , The queue lengths are the same (l∈R _g ), then the equalization coefficient of queue VOQ _{i, j} is equal to the length of the output queue j of its mapping area R _g , denoted as L _{g, j} .

定义4.若队列VOQ_i，j存在单位帧且均衡系数满足则在连续的k个外部时间槽将VOQ_i，j队列单位帧发送到区域R_g，这就是最小长度分派。Definition 4. If the queue VOQ _{i, j} has a unit frame and the equalization coefficient satisfies Then the VOQ _{i, j} queue unit frames are sent to the region R _g in consecutive k external time slots, which is the minimum length assignment.

可以从理论上证明采用最小长度分派策略可以保证在时间槽T结束后，对任意两个区域R_g1，R_g2，其OQ队列长度Lg_1，j与Lg_2，j最多差1，从而能够实现100％吞吐率及100％负载均衡度。为实现最小长度分派，第一级输入端口调度器需要维护全局统一视图的流量分布矩阵L＝[L_g，j]，为了保证流量分布矩阵在每个输入端口视图的一致性，必须实现各端口对流量分布矩阵写操作的互斥性。本发明采用锁机制实现对互斥量L_g，j的互斥写：若g，j满足且L_g，j处于解锁(unlock)状态，那么第一级输入端口i对L_g，j加锁后将VOQ_i，j队列单位帧发送至其映射区域R_g，L_g，j加1后被解锁。流分支中均衡系数L_g，j处于加锁状态的VOQ_i，j队列直接被跳过。不难推断，只有那些流到区域的映射关系相同的输入端口同时调度目的端口相同VOQ_i，j队列时才可能引发对同一L_g，j的写冲突，对均衡系数L_g，j的互斥写可以避免这些输入端口同时将多个单位帧发送到同一区域的输出队列j，造成流量分布不均衡。It can be proved theoretically that adopting the minimum length allocation strategy can ensure that after the end of the time slot T, for any two regions R _g1 , R _g2 , the difference between the OQ queue length Lg _{1, j} and Lg _{2, j} is at most 1, so that it can realize 100% throughput and 100% load balancing. In order to achieve the minimum length assignment, the first-level input port scheduler needs to maintain the traffic distribution matrix L=[L _{g, j} ] of the global unified view. In order to ensure the consistency of the traffic distribution matrix in each input port view, each port must realize Mutual exclusion of write operations to the flow distribution matrix. The present invention adopts the lock mechanism to realize the mutually exclusive writing to the mutex L _{g, j} : if g, j satisfy And L _{g, j} are in the unlocked state, then the first-level input port i locks L _{g, j} and sends the VOQ _{i, j} queue unit frame to its mapping area R _g , L _{g, j} plus 1 is unlocked. The VOQ _i, _{j queues whose equalization coefficients L g,} j are locked in the stream branch are directly skipped. It is not difficult to infer that only those input ports with the same mapping relationship between flows and areas schedule the same VOQ _{i, j} queue at the same destination port, which may cause write conflicts to the same L _{g, j} , and mutual exclusion of the equalization coefficient L _{g, j} Writing can prevent these input ports from sending multiple unit frames to the output queue j of the same area at the same time, resulting in unbalanced traffic distribution.

基于流映射的报文调度算法根据本地VOQ队列信息调度信元，可分布于第一级每个输入端口独立执行，步骤二描述的是第一级每个输入端口执行本发明负载均衡调度方法的过程。The message scheduling algorithm based on flow mapping dispatches cells according to the local VOQ queue information, and can be distributed to each input port of the first level and executed independently. Step 2 describes that each input port of the first level executes the load balancing scheduling method of the present invention. process.

在本发明中，第一步采用双循环(dual-rotation)方式构建流到区域的映射。流到区域的映射方法关系到第二级存储资源及两级Mesh网络的利用率，为避免吞吐量损失流到区域的映射算法应实现输入负载在各区域的均衡分布。本发明提出一种兼顾负载均衡和报文保序的双循环(dual-rotation)映射算法，其设计思想源于对报文乱序原因的本质分析：当同一条流的信元所在第二级缓冲区OQ队列长度不同时就会引发信元乱序，并且乱序信元数目随着OQ队列长度差异的增大而增大。若以细粒度的方式将同一条流的k个信元分派到预先设定的映射区域(k个连续的第二级输入端口)，由于流到区域的映射关系固定，对任意流，其信元所在映射区域的OQ队列长度相同，从而实现了信元的按序发送。In the present invention, the first step is to construct a flow-to-area mapping in a dual-rotation manner. The flow-to-area mapping method is related to the utilization of the second-level storage resources and the two-level Mesh network. In order to avoid throughput loss, the flow-to-area mapping algorithm should achieve a balanced distribution of input load in each area. The present invention proposes a dual-rotation mapping algorithm that takes into account both load balancing and message order preservation. When the buffer OQ queue lengths are different, cell disorder will occur, and the number of disordered cells increases with the increase of the difference in OQ queue length. If the k cells of the same flow are assigned to a preset mapping area (k consecutive second-level input ports) in a fine-grained manner, since the mapping relationship between the flow and the area is fixed, for any flow, its information The lengths of the OQ queues in the mapping areas where the cells are located are the same, so that the sequential sending of cells is realized.

1.1既然对于任意给定的区域，只能接收同一输入端口固定的k条流，那么为实现负载在各区域的均衡分布，以循环方式构建第一级输入端口N/k个流分支到第二级输入端口N/k个区域的映射，步骤1.1对应双循环映射算法伪码描述的第一层for循环，步骤1.1保证了每个区域恰能涵盖所有的流；1.1 Since for any given area, only k fixed streams of the same input port can be received, in order to achieve a balanced load distribution in each area, N/k stream branches of the first-level input port are constructed in a circular manner to the second Mapping of N/k areas of the input port of the stage, step 1.1 corresponds to the first layer of for loop described by the pseudo-code of the double-loop mapping algorithm, and step 1.1 ensures that each area can just cover all streams;

1.2进一步以循环方式调整不同输入端口与N/k种映射方式的关联，步骤1.2对应双循环映射算法伪码描述的第二层for循环，步骤1.2保证了每个区域按照输入端口的均衡分布涵盖所有的流。1.2 Further adjust the relationship between different input ports and N/k mapping methods in a cyclic manner. Step 1.2 corresponds to the second-layer for loop described in the pseudocode of the double-loop mapping algorithm. Step 1.2 ensures that each area is covered by a balanced distribution of input ports all streams.

双循环映射算法通过简单的取模运算建立流到区域的映射关系易于硬件实现，其伪码描述如下：The double-loop mapping algorithm establishes the mapping relationship between streams and regions through simple modulo operations, which is easy to implement in hardware. Its pseudocode is described as follows:

在本发明中，第二步是输入端口调度部件根据流到区域的映射关系分派信元，以轮询(round-robin)方式服务于N/k个流分支，以单位帧为最小调度单元，在连续的k个外部时间槽，发送流分支中固定VOQ队列的单位帧。第一级输入端口i调度部件执行本发明调度方法在每个外部时间槽的操作流程如下：In the present invention, the second step is that the input port scheduling component distributes cells according to the mapping relationship between flows and regions, and serves N/k flow branches in a round-robin manner, with the unit frame as the minimum scheduling unit, In consecutive k external time slots, the unit frame of the fixed VOQ queue in the stream branch is sent. The operation flow of the first-level input port i scheduling component executing the scheduling method of the present invention in each external time slot is as follows:

2.1若流分支(0≤f＜N/k，f初始化为0)、VOQ队列、VOQ_i，j(kf≤j＜kf+k)存在满帧，则优先调度满帧，赋予该满帧N/k个单位帧最高调度优先级，查找流量分布矩阵L，截取满帧第一个单位帧，发送到目前L_g，j最小的区域R_g，即执行步骤2.1.1～2.1.k，若不存在满帧则转步骤2.2，2.1 If the flow branch (0≤f<N/k, f is initialized to 0), VOQ queue, VOQ _{i, j} (kf≤j<kf+k) has a full frame, then the full frame is scheduled first, and the full frame is given N/k units The highest scheduling priority of the frame, search the traffic distribution matrix L, intercept the first unit frame of the full frame, and send it to the area R _g with the smallest L _{g, j} at present, that is, perform steps 2.1.1~2.1.k, if there is no full frame Then turn to step 2.2,

2.1.1若输入端口i到第二级输入端口S_g，1的内部链路空闲，则将VOQ_i，j队列头信元发送到区域R_g第二级输入端口S_g，1，否则f＝(f+1)modN/k，转步骤2.1；2.1.1 If the internal link from the input port i to the second-level input port S _{g, 1} is idle, then send the VOQ _{i, j} queue head cell to the second-level input port S _{g, 1} of the area R _g , otherwise f =(f+1)modN/k, go to step 2.1;

2.1.2发送VOQ_i，j队列头信元到区域R_g第二级输入端口S_g，2；2.1.2 Send the VOQ _{i, j} queue head cell to the second stage input port S _{g, 2} of the region R _g ;

2.1.3发送VOQ_i，j队列头信元到区域R_g第二级输入端口S_g，3；2.1.3 Send the VOQ _{i, j} queue head cell to the second stage input port S _{g, 3} of the region R _g ;

...... …

依次类推至2.1.k发送VOQ_i，j队列头信元到区域R_g第二级输入端口S_g，k，g＝(g+1)modN/k，转步骤2.1。By analogy to 2.1.k, send VOQ _i,j queue head cells to the second-level input port S g _,k of region R _g , g=(g+1)modN/k, go to step 2.1.

2.2若流分支VOQ队列、VOQ_i，j(kf≤j＜kf+k)存在最高调度优先级单位帧，则查找流量分布矩阵L，将该单位帧发送到目前L_g，j最小的区域R_g，即执行步骤2.1.1～2.1.k，否则转步骤2.3。2.2 If the flow branch VOQ queue, VOQ _{i, j} (kf≤j<kf+k) has the highest scheduling priority unit frame, then search the traffic distribution matrix L, and send the unit frame to the current region R _g with the smallest L _{g, j} , that is, execute Step 2.1.1～2.1.k, otherwise go to step 2.3.

2.3若流分支VOQ队列、VOQ_i，j(kf≤j＜kf+k)存在一个或多个单位帧，则查找流量分布矩阵L，根据查找结果选择最小均衡系数VOQ队列、VOQ_i，j(kf≤j＜kf+k)的单位帧，将其发送到流分支的固定映射区域R_g，即执行步骤2.1.1～2.1.k，若仅含一个单位帧则可跳过流量分布矩阵查找操作，直接将流分支唯一的单位帧发送到其固定映射区域R_g，否则流分支不含任何单位帧，g＝(g+1)modN/k，转步骤2.1。2.3 If the flow branch If there are one or more unit frames in VOQ queue, VOQ _{i, j} (kf≤j<kf+k), search the traffic distribution matrix L, and select the minimum equalization coefficient VOQ queue, VOQ _{i, j} (kf≤j< kf+k), sending it to the stream branch The fixed mapping area R _g of the flow branch, that is, perform steps 2.1.1~2.1.k, if there is only one unit frame, the search operation of the flow distribution matrix can be skipped, and the unique unit frame of the flow branch is directly sent to its fixed mapping area R _g , otherwise the stream branches Without any unit frame, g=(g+1)modN/k, go to step 2.1.

本发明采用负载均衡度来衡量负载在第二级OQ缓冲区分布的均衡程度，负载均衡度可定义如下：The present invention adopts the degree of load balance to measure the degree of balance that the load is distributed in the second-level OQ buffer zone, and the degree of load balance can be defined as follows:

定义5.负载均衡度：假设在时间段[t_r，t_v]内的第l个交换模块转发了S_l[t_r，t_v]个信元。负载均衡度为在该时间段内，不同交换模块转发的最小信元个数与最大信元个数的比值，即：Definition 5. Load balancing degree: Assume that the l-th switching module forwards S _l [t _r , t _v ] cells within the time period [t _r , t _v ]. The load balance degree is the ratio of the minimum number of cells forwarded by different switching modules to the maximum number of cells within the time period, that is:

$E [t_{r}, t_{v}] = \frac{\min_{l = 0, . . . K - 1} S_{l} [t_{r}, t_{v}]}{\max_{l = 0, . . . K - 1} S_{l} [t_{r}, t_{v}]}$ (K为第二级交换模块的数目) $E. [t_{r}, t_{v}] = \frac{\min_{l = 0, . . . K - 1} S_{l} [t_{r}, t_{v}]}{\max_{l = 0, . . . K - 1} S_{l} [t_{r}, t_{v}]}$ (K is the number of second-level switching modules)

显然负载均衡度E[t_r，t_v]≤1，E[t_r，t_v]趋近于1，表示各交换模块处理的信元数基本相同，负载在交换模块的分布比较均衡。E[t_r，t_v]越小，交换模块负载均衡性越差，由此可以证明本发明调度方法可获得100％负载均衡度。Obviously, the load balance degree E[t _r , t _v ]≤1, and E[t _r , t _v ] tends to 1, which means that the number of cells processed by each switching module is basically the same, and the load distribution among the switching modules is relatively balanced. The smaller the E[t _r , t _v ], the worse the load balance of the switch module, which proves that the scheduling method of the present invention can obtain 100% load balance.

如图2所示，在具体应用实例中，本发明第一步设计的流到区域的映射方法，当端口数目N＝32，聚合粒度k＝8时，采用双循环映射方法得到的流到区域的映射结果(VOQ_ij代表从输入端口i到输出端口j的流，→表示映射关系)，为方便查看流到区域的映射结果，选取了数值较大的聚合粒度流k＝8。流到区域的映射方法用于构建流到区域固定的映射关系，它关系到第二级输入端存储资源及两级交换网络(如Mesh网络)的利用率，为避免吞吐量损失(loss of throughput)流到区域的映射算法应实现输入负载在各区域的均衡分布。As shown in Figure 2, in a specific application example, the flow-to-area mapping method designed in the first step of the present invention, when the number of ports N=32, and the aggregation granularity k=8, the flow-to-area obtained by using the double-loop mapping method (VOQ _ij represents the flow from input port i to output port j, → represents the mapping relationship). In order to facilitate the viewing of the mapping result from flow to area, the aggregation granularity flow k=8 with a larger value is selected. The flow-to-region mapping method is used to construct a fixed flow-to-region mapping relationship, which is related to the utilization of the second-level input storage resources and the two-level switching network (such as a Mesh network). In order to avoid the loss of throughput ) flow-to-area mapping algorithm should realize the balanced distribution of input load in each area.

本发明采用双循环(dual-rotation)映射方式调整流分支到区域的映射。如图2所示，输入端口i，(0≤i≤31)含N/k＝4个流分支第二级输入端口被划分为N/k＝4个区域{R₀，R₁，R₂，R₃}。为保证第一级交换资源的利用率，同一个输入端口的流分支应映射到不同的区域，从而得到4种映射方式。为了保证每个区域恰能涵盖所有的流，第二层循环以轮询方式构建不同输入端口的流分支到区域的映射，从理论上保证了负载均衡的可行性。按照第二层循环映射操作，输入端口i＝{0，4，...，28}采用第一种映射方式，输入端口i＝{1，5，...，29}采用第二种映射方式，输入端口i＝{2，6，...，30}采用第三种映射方式，输入端口i＝{3，7，...，31}采用第四种映射方式。上述双循环映射方法可兼顾负载均衡和报文保序，本发明根据第二级输入端口流量分布矩阵调度单位帧，进一步保证了每条流在区域之间的均衡性。The present invention adopts a dual-rotation mapping method to adjust the mapping of stream branches to regions. As shown in Figure 2, the input port i, (0≤i≤31) contains N/k=4 flow branches The input ports of the second stage are divided into N/k=4 regions {R ₀ , R ₁ , R ₂ , R ₃ }. In order to ensure the utilization of the first-level switching resources, the flow branches of the same input port should be mapped to different areas, thus four mapping methods are obtained. In order to ensure that each area can cover all the flows, the second-layer loop constructs the mapping from the flow branches of different input ports to the area in a round-robin manner, which theoretically ensures the feasibility of load balancing. According to the second-layer cyclic mapping operation, the input port i={0, 4, ..., 28} adopts the first mapping method, and the input port i={1, 5, ..., 29} adopts the second mapping method The input port i={2, 6, ..., 30} adopts the third mapping mode, and the input port i={3, 7, ..., 31} adopts the fourth mapping mode. The above-mentioned double-loop mapping method can take into account both load balancing and message order preservation. The present invention schedules unit frames according to the second-level input port flow distribution matrix, which further ensures the balance of each flow among regions.

如图3所示，为本发明在具体应用实例中执行负载均衡调度方法流程示意图，对应本发明的上述第二步骤。As shown in FIG. 3 , it is a schematic flow chart of the method for implementing load balancing scheduling in a specific application example of the present invention, corresponding to the above-mentioned second step of the present invention.

如图4所示，为本发明在流量突发情况下，采用本发明最小长度分派后，信元在第二级输入端口OQ缓冲区的分布情况，本发明在保证报文顺序的同时实现了负载的均衡分布。由于流分支到区域的映射关系固定，存在某些区域因负载过重而发生缓冲区溢出而其他区域相对空闲的情况。例如，输入端某个流分支的VOQ队列含N个信元，并且队列长度仍在不断增长，而其他流分支没有单位帧可以调度。这样重负载流分支到其映射区域的内部链路将成为性能瓶颈，另一方面重负载流分支因无法使用输入线卡其他空闲链路而造成内部带宽浪费。为了解决以上问题，在流量突发的情况下本发明允许重负载流分支抢占链路资源，通过赋予满帧最高优先级将突发报文流均匀分布于每个区域。为了解决调度满帧引发的信元乱序问题且遵循负载均衡原则，本发明将从VOQ_i，j队列读出的N/k个单位帧即一个满帧依次分派到目前L_g，j最小的区域R_g。采用该策略得到的分派顺序是：从VOQ_i，j队列读出的第一个及第二个单位帧依次分派在区域2和区域3，第三个单位帧及第四个单位帧依次分派在区域1或区域4，这四个单位帧将按读出顺序依次到达输出端口j。由于不同区域输出队列j的长度至多差1，因此将满帧中N/k个单位帧依次发送到目前L_g，j最小的区域不会引发信元乱序。本质上，调度单位帧与调度满帧都采用了最小长度分派策略，两者的区别在于单位帧只能固定分派到其映射区域，而满帧将被拆分为N/k个单位帧均衡分布于N/k个区域。As shown in Figure 4, for the present invention in the case of traffic burst, after adopting the minimum length distribution of the present invention, the distribution of cells in the OQ buffer zone of the second-level input port, the present invention realizes the message sequence while ensuring Balanced distribution of load. Since the mapping relationship between stream branches and regions is fixed, there are situations where buffer overflow occurs in some regions due to heavy load, while other regions are relatively idle. For example, the VOQ queue of a flow branch at the input end contains N cells, and the queue length is still increasing, while other flow branches have no unit frame to schedule. In this way, the internal link from the heavy-load flow branch to its mapping area will become a performance bottleneck. On the other hand, the heavy-load flow branch cannot use other idle links of the input line card, resulting in waste of internal bandwidth. In order to solve the above problems, the present invention allows heavy-load flow branches to seize link resources in the case of traffic bursts, and evenly distributes burst message flows in each area by giving the highest priority to full frames. In order to solve the cell out-of-sequence problem caused by scheduling full frames and follow the load balancing principle, the present invention assigns the N/k unit frames read from the VOQ _{i, j} queue, that is, a full frame _{, to the current L g with the smallest j.} Region R _g . The dispatch order obtained by adopting this strategy is: the first and second unit frames read from the VOQ _{i, j} queue are assigned in the area 2 and area 3 in turn, and the third unit frame and the fourth unit frame are assigned in the order of Area 1 or Area 4, these four unit frames will arrive at the output port j sequentially in the readout order. Since the length of the output queue j in different areas differs by at most 1, sending N/k unit frames in the full frame to the current L _{g in turn, the area with the smallest j} will not cause cell disorder. Essentially, both the scheduling unit frame and the scheduling full frame adopt the minimum length allocation strategy. The difference between the two is that the unit frame can only be fixedly allocated to its mapping area, while the full frame will be split into N/k unit frames for balanced distribution. in N/k regions.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. the load equilibration scheduling method exchanging based on two-stage, is characterized in that:

First order input port is buffered in VOQ queue by arrival cell according to destination interface, scheduler by first order switching network by message switching to second level input port, k the cell from same stream in VOQ queue is referred to as a unit frame, and unit frame is minimum scheduling unit; Each input port of the first order is carried out minimum length assignment according to flow distribution matrix, at k continuous external time groove, by first order switching network, the unit frame of same stream is sent to this and flows fixing mapping area;

N second level input port is divided into N/k group successively, every group forms a region containing k continuous second level input port, each region according to destination interface by cell-buffering in OQ queue, owing to flowing to the mapping relations in region, fix, k the cell from same unit frame will arrive OQ queue heads position successively, by second level switching network, exchange to object output port;

In first order input port when scheduling, assigns cell according to the mapping relations that flow to region and serves N/k flow branching with polling mode, and described first order input port as follows in the operating process of each groove external time:

(2.1) if flow branching vOQ queue, VOQ _i,jthere is full frame, priority scheduling full frame, in 0≤f<N/k, f is initialized as 0, VOQ _i,jmiddle kf≤j<kf+k; Give this full frame N/k the highest dispatching priority of unit frame, search flow distribution matrix L, first unit frame of intercepting full frame, sends to current L _g,jminimum region R _g, perform step (2.1.1)～(2.1.k), if do not exist full frame to go to step (2.2); Queue VOQ _i,jequalizing coefficient equals its mapping area R _gthe length of output queue j, is denoted as L _g,j;

If (2.1.1) first order input port i is to second level input port S _{g, 1}inner link idle, by VOQ _i,jqueue heads cell sends to region R _gsecond level input port S _{g, 1}, otherwise f=(f+1) modN/k goes to step (2.1);

(2.1.2) send VOQ _i,jqueue heads cell is to region R _gsecond level input port S _{g, 2};

(2.1.3) send VOQ _i,jqueue heads cell is to region R _gsecond level input port S _{g, 3};

The rest may be inferred sends VOQ to (2.1.k) _i,jqueue heads cell is to region R _gsecond level input port S _g,k, g=(g+1) modN/k, goes to step (2.1);

(2.2) if flow branching vOQ queue, VOQ _i,j, there is the highest dispatching priority unit frame in kf≤j<kf+k wherein, searches flow distribution matrix L, and this unit frame is sent to current L _g,jminimum region R _g, perform step (2.1.1)～(2.1.k), otherwise go to step (2.3);

(2.3) if flow branching there are one or more unit frame in VOQ queue, searches flow distribution matrix L, selects the unit frame of minimum equalizing coefficient VOQ queue according to lookup result, sends it to flow branching fixedly mapping area R _g, perform step (2.1.1)～(2.1.k), if only can skip the search operation of flow distribution matrix containing a unit frame, directly the unique unit frame of flow branching is sent to its fixedly mapping area R _g, otherwise flow branching , containing any unit frame, g=(g+1) modN/k, does not go to step (2.1);

Wherein, VOQ _i,jthe VOQ j that represents first order input port i; the output queue j that represents second level input port l; F (i, j) represents the stream from first order input port i to output port j; K is polymerization granularity, represents the number of scheduling same stream cell continuously, and k is the factor of port number N; VOQ _i,jevery k cell of queue forms a unit frame, and the unit frame of the first order input port i N/k of place different VOQ queues forms an aggregate frame, amounts to N cell; VOQ _i,jthe N of a queue cell, forms a full frame; Second level input port 1,2 ...., N is divided into N/k group successively, and each group is containing k continuous second level input port, and k second level input port of g group forms a region, is denoted as R _g, S _r,zz the second level input port that represents region r; It is corresponding with N/k region that the N bar stream of each second level input port is divided into N/k group, and every group is flowed containing k bar, and the k bar stream of second level input port i f group forms a flow branching, is denoted as

Described external time groove in order to be R in speed, link sends or receives the time that cell is spent.

2. the load equilibration scheduling method based on two-stage exchange according to claim 1, is characterized in that, adopts two endless form to build the mapping that flows to region:

(1.1) build in a looping fashion N/k flow branching of first order input port to the mapping in an input port N/k region, the second level, to guarantee that each region be able to contain all stream;

(1.2) further adjust in a looping fashion the associated of different first order input ports and N/k kind mapping mode, to guarantee that each region contains all stream according to the equiblibrium mass distribution of second level input port.