CN115336236B

CN115336236B - Method implemented by first computing node, first computing node and readable medium

Info

Publication number: CN115336236B
Application number: CN202080098258.3A
Authority: CN
Inventors: 叶剑西; 王绍创; 冉仟元; 冯飞; 董建波
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2020-03-31
Filing date: 2020-03-31
Publication date: 2024-04-02
Anticipated expiration: 2040-03-31
Also published as: CN115336236A; WO2021195990A1

Abstract

In distributed training, in order to avoid network congestion, the first computing node may be based at least in part on whether the first network interface controller associated with the first process and the second network interface controller associated with the second process are located in the same A computing node is in or linked to the same switch to determine a routing identifier for routing data from a first process to a second process, the first process and the second process belonging to a specific inter-node ring connecting a plurality of different nodes under the network topology. The first computing node may then route the data from the first process to the second process based on the routing identifier.

Description

Method implemented by a first computing node, first computing node, and readable medium

技术领域Technical field

本申请涉及计算机技术领域，尤其涉及由第一计算节点实施的方法、第一计算节点及可读介质。The present application relates to the field of computer technology, and in particular to a method implemented by a first computing node, the first computing node, and a readable medium.

背景技术Background technique

随着诸如深度神经网络(Deep Neural Network，DNN)的神经网络迅速发展，各种应用领域(例如，计算机视觉、自然语言处理、语音识别等)都得到了发展，并且会从神经网络固有的多功能性和灵活性中受益。然而，由于神经网络应用日益增加的复杂性和越来越严格的准确性要求，神经网络模型的大小和训练模型所需的训练数据的大小也显著增加，这将不可避免地导致训练时间越来越长，从而对训练模型能够满足不断变化的应用环境的有效性和及时性产生不利影响。With the rapid development of neural networks such as Deep Neural Network (DNN), various application fields (such as computer vision, natural language processing, speech recognition, etc.) have been developed and will benefit from the inherent diversity of neural networks. Benefit from functionality and flexibility. However, due to the increasing complexity of neural network applications and increasingly stringent accuracy requirements, the size of neural network models and the size of training data required to train the models have also increased significantly, which will inevitably lead to increasingly longer training times. The longer it is, thus adversely affecting the effectiveness and timeliness of the training model to meet changing application environments.

为了减少训练神经网络模型的时间，可以使用一种采用并行训练的分布式训练系统。一般而言，分布式训练系统可以包括分布在网络上的大量计算节点或服务器，并且将计算任务的子集分配给计算节点或服务器，用于采用并行训练来执行计算。然而，分布式训练系统中的计算节点或服务器之间的数据通信造成了分布式训练系统中可能发生的训练时间的减少量的下限或瓶颈。当分布式训练系统包括计算节点或服务器内部的和之间的各种类型的异构连接或互连时，尤其如此，这些异构连接或互连在延迟、带宽、拓扑等方面表现出不同的特性。这种连接或互连的异构性增加了为分布式训练系统中的计算节点或服务器设计数据通信网络的难度和复杂性。To reduce the time it takes to train a neural network model, a distributed training system using parallel training can be used. Generally speaking, a distributed training system may include a large number of computing nodes or servers distributed over a network, and a subset of computing tasks may be assigned to the computing nodes or servers for performing computations using parallel training. However, data communication between computing nodes or servers in a distributed training system creates a lower limit or bottleneck on the amount of reduction in training time that can occur in a distributed training system. This is especially true when a distributed training system includes various types of heterogeneous connections or interconnections within and between computing nodes or servers that exhibit different performance in terms of latency, bandwidth, topology, etc. characteristic. This heterogeneity of connections or interconnections increases the difficulty and complexity of designing data communication networks for computing nodes or servers in a distributed training system.

此外，由于过量的数据流通过分布式训练系统中的计算节点或服务器之间的特定网络交换机或连接，可能引发网络拥塞，这样可能会由于处理训练结果的延迟而导致训练时间延长。之所以有过量的数据流通过特定网络交换机或连接，可能是因为计算节点或服务器之间发送的路由数据的路径选择失去控制。Additionally, network congestion may occur due to excessive data flow through specific network switches or connections between compute nodes or servers in a distributed training system, which may result in extended training times due to delays in processing training results. Excessive data flow through a particular network switch or connection may be due to an out-of-control routing of data being sent between compute nodes or servers.

发明内容Contents of the invention

本申请提供了由第一计算节点实施的方法、第一计算节点及可读介质，以解决网络拥塞的问题。The present application provides a method implemented by a first computing node, the first computing node and a readable medium to solve the problem of network congestion.

第一方面，提供了一种由第一计算节点实施的方法，包括：至少部分地基于与第一进程相关联的第一网络接口控制器和与第二进程相关联的第二网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从所述第一进程路由到所述第二进程的路由标识符，所述第一进程和所述第二进程属于网络拓扑下连接多个不同节点的特定节点间环；以及根据所述路由标识符将所述数据从所述第一进程路由到所述第二进程。In a first aspect, a method implemented by a first computing node is provided, comprising: based at least in part on a first network interface controller associated with a first process and a second network interface controller associated with a second process. Determine the routing identifier for routing data from the first process to the second process whether it is located in the same computing node or linked to the same switch, and the first process and the second process belong to the connection under the network topology A specific inter-node ring of a plurality of different nodes; and routing the data from the first process to the second process based on the routing identifier.

第二方面，提供了一个或多个机器可读介质，其存储有机器可读指令，当所述机器可读指令由第一计算节点执行时，使所述第一计算节点执行动作，所述动作包括：至少部分地基于与第一进程相关联的第一网络接口控制器和与第二进程相关联的第二网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从所述第一进程路由到所述第二进程的路由标识符，所述第一进程和所述第二进程属于网络拓扑下连接多个不同节点的特定节点间环；以及根据所述路由标识符将所述数据从所述第一进程路由到所述第二进程。In a second aspect, one or more machine-readable media are provided having machine-readable instructions stored thereon that, when executed by a first computing node, cause the first computing node to perform an action, the Actions include determining to transfer the data based at least in part on whether a first network interface controller associated with the first process and a second network interface controller associated with the second process are located in the same compute node or linked to the same switch. A routing identifier for routing from the first process to the second process, the first process and the second process belonging to a specific inter-node ring connecting multiple different nodes under a network topology; and according to the routing identifier The operator routes the data from the first process to the second process.

第三方面，提供了一种第一计算节点，包括：一个或多个处理单元；和存储器，其存储机器可执行指令，当所述机器可执行指令由一个或多个处理单元执行时，使所述一个或多个处理单元执行动作，所述动作包括：至少部分地基于与第一进程相关联的第一网络接口控制器和与第二进程相关联的第二网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从所述第一进程路由到所述第二进程的路由标识符，所述第一进程和所述第二进程属于网络拓扑下连接多个不同节点的特定节点间环；以及根据所述路由标识符将所述数据从所述第一进程路由到所述第二进程。In a third aspect, a first computing node is provided, including: one or more processing units; and a memory that stores machine-executable instructions, which when executed by the one or more processing units, causes The one or more processing units perform actions that include based, at least in part, on whether a first network interface controller associated with the first process and a second network interface controller associated with the second process are co-located. The computing node or connected to the same switch determines the routing identifier for routing data from the first process to the second process. The first process and the second process belong to a network topology that connects multiple different processes. a specific inter-node ring of nodes; and routing the data from the first process to the second process based on the routing identifier.

第四方面，提供了一种第一计算节点，包括：确定模块，用于至少部分地基于与第一进程相关联的第一网络接口控制器和与第二进程相关联的第二网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从所述第一进程路由到所述第二进程的路由标识符，所述第一进程和所述第二进程属于网络拓扑下连接多个不同节点的特定节点间环；以及路由模块，用于根据所述路由标识符将所述数据从所述第一进程路由到所述第二进程。In a fourth aspect, a first computing node is provided, comprising: a determining module for determining based, at least in part, on a first network interface controller associated with the first process and a second network interface controller associated with the second process. Whether the processor is located in the same computing node or linked to the same switch to determine the routing identifier for routing data from the first process to the second process, the first process and the second process belong to the network topology a specific inter-node ring connecting a plurality of different nodes; and a routing module for routing the data from the first process to the second process based on the routing identifier.

第五方面，提供了一种计算机程序产品，包括计算机程序，所述计算机程序在计算机中运行时，用于实现第一方面或第一方面中任一可能的实现方式中所述的方法。In a fifth aspect, a computer program product is provided, including a computer program, wherein when the computer program is run in a computer, the computer program is used to implement the method described in the first aspect or any possible implementation manner of the first aspect.

附图说明Description of drawings

参考附图进行详细描述。附图中，附图标记的最左侧数字表示该附图标记第一次出现的附图。不同附图中使用相同的附图标记表示相似或相同的指代。Detailed description is given with reference to the accompanying drawings. In the drawings, the leftmost digit of a reference number indicates the figure in which the reference number first appears. The use of the same reference numbers in different drawings indicates similar or identical references.

图1示出了分布式训练系统可应用的示例环境。Figure 1 shows an example environment in which a distributed training system may be applied.

图2示出了更详细的示例性计算节点。Figure 2 shows an example compute node in more detail.

图3A示出了将预设数量的节点互相连接的环形配置。Figure 3A shows a ring configuration interconnecting a preset number of nodes.

图3B示出了将预设数量的节点互相连接的减半加倍配置。FIG. 3B shows a halving and doubling configuration in which a preset number of nodes are connected to each other.

图4示出了示例性集群通信库的示意图。Figure 4 shows a schematic diagram of an exemplary cluster communications library.

图5示出了示例性拓扑感知多阶段算法。Figure 5 illustrates an exemplary topology-aware multi-stage algorithm.

图6示出了用于计算节点的节点内归约散布阶段的示例性基于环的算法。Figure 6 illustrates an exemplary ring-based algorithm for computing the intra-node reduction spread phase of a node.

图7示出了用于计算节点的节点内归约散布阶段的示例性减半加倍算法。Figure 7 illustrates an exemplary halving-doubling algorithm for computing the intra-node reduction spread phase of a node.

图8示出了节点间全局归约阶段的示例性减半加倍算法。Figure 8 shows an exemplary halve-double algorithm for the inter-node global reduction phase.

图9示出了更详细的节点间全局归约阶段的示例性减半加倍算法。Figure 9 shows an exemplary halving-doubling algorithm for the inter-node global reduction phase in more detail.

图10示出了示例性基于环的集群通信算法。Figure 10 illustrates an exemplary ring-based cluster communication algorithm.

图11示出了用并行或重叠的方式执行节点内归约散布阶段、节点间全局归约阶段、节点内全局聚集阶段的示例场景。Figure 11 shows an example scenario of executing the intra-node reduction spreading phase, the inter-node global reduction phase, and the intra-node global aggregation phase in a parallel or overlapping manner.

图12示出了示例性胖树(fat-tree)网络拓扑。FIG. 12 illustrates an exemplary fat-tree network topology.

图13示出了使用第一拥塞避免方法的示例场景。Figure 13 shows an example scenario using the first congestion avoidance method.

图14示出了使用第二拥塞避免方法的示例场景。FIG. 14 shows an example scenario using the second congestion avoidance method.

图15示出了示例性拓扑感知多阶段方法。Figure 15 illustrates an exemplary topology-aware multi-stage approach.

图16示出了第一示例性网络拥塞避免方法。Figure 16 illustrates a first exemplary network congestion avoidance method.

图17示出了第二示例性网络拥塞避免方法。Figure 17 illustrates a second exemplary network congestion avoidance method.

图18示出了分布式训练中基于混合架构的示例性并行方法。Figure 18 shows an exemplary parallel approach based on hybrid architecture in distributed training.

具体实施方式Detailed ways

概述Overview

如上所述，现有的分布式训练系统由于分布式训练系统中计算节点间的数据通信而对良好可扩展性造成了性能瓶颈。此外，由于网络结构的多样性(包括例如以太网、无限带宽(InfiniBand)、PCIe、NVLink、NVSwitch、QPI/UPI等)和网络特征(例如延迟、带宽和拓扑等)的高度差异，分布式训练系统通常不能很好地利用这种异构类型的连接或互连来执行计算节点中和计算节点之间的集群数据运算以及计算节点之间的数据传输。此外，由于对计算节点间发送的数据进行路由的路径选择可能失去控制，而发生网络拥塞，从而导致过量的数据流通过分布式训练系统中计算节点之间的特定网络交换机或连接，且导致由处理训练结果的延迟引发的训练时间延长。此外，现有的分布式训练系统未能区分用于集群运算的不同类型的底层结构的算法，因此导致较差的性能。As mentioned above, existing distributed training systems create performance bottlenecks for good scalability due to data communication between computing nodes in the distributed training system. In addition, due to the diversity of network structures (including, for example, Ethernet, InfiniBand, PCIe, NVLink, NVSwitch, QPI/UPI, etc.) and the high differences in network characteristics (such as latency, bandwidth, topology, etc.), distributed training Systems often do not make good use of this heterogeneous type of connection or interconnection to perform cluster data operations in and between compute nodes and to transfer data between compute nodes. Additionally, network congestion occurs because the path selection for routing data sent between compute nodes can get out of control, causing excessive data flow through specific network switches or connections between compute nodes in a distributed training system and resulting in Delay in processing training results results in extended training time. Furthermore, existing distributed training systems fail to differentiate between algorithms for different types of underlying structures for cluster operations, thus resulting in poor performance.

本公开描述了示例性分布式训练系统。在实施中，示例性分布式训练系统可以采用结构感知的集群通信库，该库使得分布式训练系统能够线性扩展。在实施中，集群通信库可以至少部分地基于对底层结构和支持网络架构的分析来定制通信算法，以获得期望的或最大的效率。在实施中，分布式训练系统可以将基本运算分成多个子运算，每个子运算使用一种类型的结构。This disclosure describes an exemplary distributed training system. In implementation, an exemplary distributed training system may employ a structure-aware cluster communication library that enables the distributed training system to scale linearly. In an implementation, the cluster communication library may tailor communication algorithms to achieve desired or maximum efficiency based at least in part on analysis of the underlying structure and supporting network architecture. In implementation, a distributed training system can split a basic operation into multiple sub-operations, each using one type of structure.

在实施中，示例性分布式训练系统可以实现混合算法，该混合算法允许多个算法在单个集群运算中共存，并且选择性地采用用于特定结构的算法来提高或最大化整个通信路径的效率。在实施中，分布式训练系统可以采用双进程并行算法，该算法启动两个并发进程，并流水线化(pipeline)节点内和节点间连接的使用，从而通过重叠节点内通信和节点间通信来提高通信效率。In implementation, an exemplary distributed training system may implement a hybrid algorithm that allows multiple algorithms to coexist in a single cluster operation and selectively employs algorithms for specific structures to improve or maximize the efficiency of the entire communication path. . In implementation, a distributed training system can employ a dual-process parallel algorithm that launches two concurrent processes and pipelines the use of intra-node and inter-node connections, thereby improving performance by overlapping intra-node and inter-node communication. Communication efficiency.

在实施中，示例性分布式训练系统可以采用基于探测的路由控制机制，该机制生成从连接到路径的映射，从而通过对集群运算中的参与者或过程重新排序并将分布式训练系统上的数据流映射到特定物理链路，来将连接分布或散布到通信网络中的不同汇聚或中间交换机，从而避免网络拥塞。In an implementation, an exemplary distributed training system may employ a probe-based routing control mechanism that generates a mapping from connections to paths, thereby reordering participants or processes in a cluster operation and routing them across the distributed training system. Data flows are mapped to specific physical links to distribute or spread connections to different aggregation or intermediate switches in the communications network to avoid network congestion.

本申请描述了多个不同的实施例和实施方式。以下部分描述了适用于实践各种实施方式的示例性框架。接下来，本申请描述了用于实现分布式训练系统的示例性系统、设备和过程。This application describes a number of different embodiments and implementations. The following section describes an exemplary framework suitable for practicing various implementations. Next, this application describes an exemplary system, device, and process for implementing a distributed training system.

示例性环境Example Environment

图1示出了可用于实施分布式训练系统的示例性环境100。该环境100可以包括分布式训练系统102。在该示例中，分布式训练系统102可以包括多个计算节点或服务器104-1、104-2、...、104-K(下文统称为计算节点104)，其中K是大于1的正整数。在实施中，多个计算节点104可以通过通信网络106彼此通信数据。1 shows an exemplary environment 100 that can be used to implement a distributed training system. The environment 100 may include a distributed training system 102. In this example, the distributed training system 102 may include a plurality of computing nodes or servers 104-1, 104-2, ..., 104-K (hereinafter collectively referred to as computing nodes 104), where K is a positive integer greater than 1. In implementation, the plurality of computing nodes 104 may communicate data with each other via a communication network 106.

计算节点104可以实施为具有计算/处理和通信能力的任意多种计算设备，包括但不限于服务器、台式电脑、笔记本电脑或便携式电脑、手持设备、上网本、互联网设备、平板电脑、移动设备(如移动电话、个人数字助理、智能电话等)等，或其组合。Computing node 104 may be implemented as any variety of computing devices with computing/processing and communication capabilities, including but not limited to servers, desktop computers, notebook or portable computers, handheld devices, netbooks, Internet devices, tablets, mobile devices (e.g. Mobile phones, personal digital assistants, smart phones, etc.), etc., or combinations thereof.

通信网络106可以是无线或有线网络，或其组合。网络106可以是彼此互连并用作单个大型网络(例如，互联网或内联网)的独立网络的集合。这种独立网络的示例包括但不限于电话网络、电缆网络、局域网(Local Area Network，LAN)、广域网(Wide AreaNetwork，WAN)和城域网(Metropolitan Area Network，MAN)。此外，独立网络可以是无线或有线网络，或其组合。有线网络可以包括电载波连接(例如通信电缆等)和/或光学载体或连接(例如光纤连接等)。无线网络可以包括例如WiFi网络、其他射频网络(例如蓝牙紫峰(Zigbee)等)等。在实施中，通信网络106可以包括用于提供计算节点104之间的连接的多个节点间互连或交换机108-1、108-2、…、108-L(下文统称为节点间交换机108)，其中L是大于1的正整数。Communication network 106 may be a wireless or wired network, or a combination thereof. Network 106 may be a collection of independent networks that are interconnected with each other and serve as a single large network (eg, the Internet or an intranet). Examples of such independent networks include, but are not limited to, telephone networks, cable networks, Local Area Networks (LANs), Wide Area Networks (WANs), and Metropolitan Area Networks (MANs). Additionally, a stand-alone network may be a wireless or wired network, or a combination thereof. A wired network may include electrical carrier connections (eg, communications cables, etc.) and/or optical carriers or connections (eg, fiber optic connections, etc.). Wireless networks may include, for example, WiFi networks, other radio frequency networks (e.g., Bluetooth Zifeng (Zigbee, etc.) etc. In an implementation, communication network 106 may include a plurality of inter-node interconnects or switches 108-1, 108-2, ..., 108-L (hereinafter collectively referred to as inter-node switches 108) for providing connectivity between computing nodes 104. , where L is a positive integer greater than 1.

在实施中，环境100可以进一步包括客户端设备110。用户可以指示分布式训练系统102基于从客户端设备110发送给分布式训练系统102的数据，对特定的学习模型(例如深度神经网络模型)执行训练。In an implementation, environment 100 may further include client device 110 . A user may instruct distributed training system 102 to perform training on a specific learning model (eg, a deep neural network model) based on data sent to distributed training system 102 from client device 110 .

示例性计算节点Example compute node

图2示出了更详细的计算节点104。在实施中，计算节点104可以包括但不限于一个或多个处理单元202、输入/输出(input/output，I/O)接口204和/或一个或多个网络接口206以及存储器208。在实施中，计算节点104可以进一步包括一个或多个节点内互连或交换机210。Figure 2 shows the compute node 104 in greater detail. In an implementation, the computing node 104 may include, but is not limited to, one or more processing units 202 , input/output (I/O) interfaces 204 and/or one or more network interfaces 206 , and memory 208 . In an implementation, the compute node 104 may further include one or more intra-node interconnects or switches 210 .

在实施中，处理单元202可以被配置成执行存储在存储器208中的和/或从输入/输出接口204和/或网络接口206接收的指令。在实施中，处理单元202可被实施为一个或多个硬件处理器，包括例如微处理器、专用指令集处理器、物理处理单元(Physics ProcessingUnit，PPU)、中央处理单元(Central Processing Unit，CPU)、图形处理单元、数字信号处理器、张量处理单元等。附加地或替代地，这里描述的功能可以至少部分地由一个或多个硬件逻辑组件来执行。例如，但不限于，可以使用的硬件逻辑组件的示例类型包括现场可编程门阵列(Field Programmable Gate Array，FPGA)、专用集成电路(Application-SpecificIntegrated Circuit，ASIC)、专用标准产品(Application-Specific Standard Product，ASSP)、片上系统(System-on-a-Chip System，SOC)、复杂可编程逻辑器件(ComplexProgrammable Logic Device，CPLD)等。In implementations, processing unit 202 may be configured to execute instructions stored in memory 208 and/or received from input/output interface 204 and/or network interface 206 . In an implementation, the processing unit 202 may be implemented as one or more hardware processors, including, for example, a microprocessor, a dedicated instruction set processor, a physical processing unit (Physics Processing Unit, PPU), a central processing unit (Central Processing Unit, CPU). ), graphics processing unit, digital signal processor, tensor processing unit, etc. Additionally or alternatively, the functions described herein may be performed, at least in part, by one or more hardware logic components. For example, but not limited to, example types of hardware logic components that can be used include field programmable gate arrays (FPGAs), application-specific integrated circuits (Application-Specific Integrated Circuits, ASICs), application-specific standard products (Application-Specific Standard Product, ASSP), System-on-a-Chip System (SOC), Complex Programmable Logic Device (CPLD), etc.

存储器208可以包括易失性存储器形式的机器可读介质，例如随机存取存储器(Random Access Memory，RAM)和/或非易失性存储器，例如只读存储器(Read OnlyMemory，ROM)或闪存RAM。存储器208是机器可读介质的一种示例。Memory 208 may include machine-readable media in the form of volatile memory, such as random access memory (Random Access Memory, RAM) and/or non-volatile memory, such as read only memory (Read Only Memory, ROM) or flash RAM. Memory 208 is an example of a machine-readable medium.

机器可读介质可以包括易失性或非易失性类型、可移动或不可移动介质，其可以使用任何方法或技术来实现信息的存储。该信息可以包括机器可读指令、数据结构、程序模块或其他数据。机器可读介质的示例包括但不限于相变存储器(Phase-Change Memory，PRAM)、静态随机存取存储器(Static Random Access Memory，SRAM)、动态随机存取存储器(Dynamic Random Access Memory，DRAM)、其他类型的随机存取存储器(Random-AccessMemory，RAM)、只读存储器(Read-Only Memory，ROM)、电可擦除可编程只读存储器(Electronically Erasable Programmable Read-Only Memory，EEPROM)、快速闪存或其他内部存储技术、光盘只读存储器(Compact Disk Read-Only Memory，CD-ROM)、数字通用光盘(Digital Versatile Disc，DVD)或其他光存储器、盒式磁带、磁盘存储或其他磁存储设备、或任何其他可用于存储可被计算节点访问的信息的非传输介质。如这里所定义的，机器可读介质不包括任何瞬态介质，例如调制数据信号和载波。Machine-readable media may include volatile or nonvolatile types, removable or non-removable media that may utilize any method or technology for storage of information. This information may include machine-readable instructions, data structures, program modules or other data. Examples of machine-readable media include, but are not limited to, phase-change memory (Phase-Change Memory, PRAM), static random access memory (Static Random Access Memory, SRAM), dynamic random access memory (Dynamic Random Access Memory, DRAM), Other types of random access memory (Random-AccessMemory, RAM), read-only memory (Read-Only Memory, ROM), electrically erasable programmable read-only memory (Electronically Erasable Programmable Read-Only Memory, EEPROM), fast flash memory or other internal storage technology, Compact Disk Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, tape cassette, magnetic disk storage or other magnetic storage device, or any other non-transmission medium that can be used to store information that can be accessed by a computing node. As defined herein, machine-readable media does not include any transient media, such as modulated data signals and carrier waves.

在实施中，网络接口206可以被配置成通过通信网络106将计算节点104连接到其他计算节点。在实施中，网络接口206可以通过网络接口控制器(Network InterfaceController，NIC)来建立，该网络接口控制器可以使用硬件和软件来将计算节点104连接到通信网络106。在实施中，每种类型的NIC可以使用不同类型的结构或连接器来连接到与通信网络106相关联的物理介质。在IEEE 802规范中可以找到各种类型的结构或连接器的示例，并且可以包括例如以太网(在802.3中定义)、令牌环网(在802.5中定义)、无线网络(在802.11中定义)、无限带宽(InfiniBand)等。In an implementation, network interface 206 may be configured to connect computing node 104 to other computing nodes through communication network 106 . In an implementation, the network interface 206 may be established through a Network Interface Controller (NIC), which may use hardware and software to connect the computing node 104 to the communication network 106 . In implementation, each type of NIC may use a different type of fabric or connector to connect to the physical medium associated with the communications network 106 . Examples of various types of structures or connectors can be found in the IEEE 802 specification and can include, for example, Ethernet (defined in 802.3), Token Ring (defined in 802.5), Wireless Network (defined in 802.11) , unlimited bandwidth (InfiniBand), etc.

在实施中，节点内交换机210可以包括各种类型的互连或交换机，其可以包括但不限于高速串行计算机扩展总线(例如PCIe等)、串行多通道近距离通信链路(例如Nolan，其是基于有线的通信协议串行多通道近距离通信链路)、具有多个端口的交换机芯片(例如NVSwitch等)、点对点处理器互连(例如英特尔QPI/UPI等)等。In implementations, intra-node switch 210 may include various types of interconnects or switches, which may include, but are not limited to, high-speed serial computer expansion buses (e.g., PCIe, etc.), serial multi-channel short-range communications links (e.g., Nolan, It is a wired-based communication protocol (serial multi-channel short-range communication link), a switch chip with multiple ports (such as NVSwitch, etc.), point-to-point processor interconnect (such as Intel QPI/UPI, etc.), etc.

尽管在该示例中，在计算节点104中仅描述了硬件组件，但是在其他实例中，计算节点104可以进一步包括其他硬件组件和/或其他软件组件，例如执行存储在存储器208中的指令以执行各种操作的程序模块212，以及用于存储接收的用于训练的数据、训练期间计算的中间和最终结果等的程序数据214。Although in this example, only hardware components are described in compute node 104 , in other examples, compute node 104 may further include other hardware components and/or other software components, such as to execute instructions stored in memory 208 to perform Program modules 212 for various operations, and program data 214 for storing data received for training, intermediate and final results calculated during training, etc.

示例性集群通信算法Example cluster communication algorithm

图3A和3B示出了可以在分布式训练系统102中使用的示例性集群通信算法。在实施中，集群通信算法可以包括，但不限于，基于环的通信算法、减半加倍通信算法等。3A and 3B illustrate exemplary cluster communication algorithms that may be used in distributed training system 102. In implementation, cluster communication algorithms may include, but are not limited to, ring-based communication algorithms, halve-double communication algorithms, etc.

图3A示出了环形配置，该配置将预定数量的节点(例如，N个节点，其中N是大于1的正整数)与多个连接(即，N个连接)互连，并将数据(例如，数据包或消息)划分成多个数据块(即，N个数据块)用于传输，且需要多个步骤(在该示例中为N-1个步骤)的通信来完成集群运算。在每个步骤中，节点可以从其相邻节点之一接收数据，对所接收的数据进行特定操作以获得本地结果，并且将所接收的数据转发给相邻节点中的另一个。在N-1个步骤之后，环中的每个节点具有来自环中其他节点的数据，并且最终结果被散布到所有节点，这需要另外的N-1个步骤来广播各自的本地结果。对于每个节点，转发的总数据大小是2S，其中S表示数据大小或消息大小。3A illustrates a ring configuration that interconnects a predetermined number of nodes (eg, N nodes, where N is a positive integer greater than 1) with a plurality of connections (ie, N connections) and connects data (eg, , data packet or message) is divided into multiple data blocks (i.e., N data blocks) for transmission, and communication of multiple steps (N-1 steps in this example) is required to complete the cluster operation. At each step, a node may receive data from one of its neighboring nodes, perform specific operations on the received data to obtain local results, and forward the received data to another of its neighboring nodes. After N-1 steps, each node in the ring has data from other nodes in the ring, and the final result is spread to all nodes, which requires an additional N-1 steps to broadcast the respective local results. For each node, the total data size forwarded is 2S, where S represents the data size or message size.

图3B示出了互连预定数量的节点(例如，N个节点，其中N是大于1的正整数)的减半加倍配置。在这种减半的配置中，节点以成对的方式相互通信，通信的每一步只需要N/2个连接。在第一步中，相邻节点被配对在一起，向各自的配对节点发送一半的消息或数据，并接收另一半的消息或数据进行处理。因此，中间结果可以散布至配对节点。在随后的步骤中，以增加或加倍的距离形成新的配对，并且用于处理的数据大小减半。在log₂N步的通信之后，结果被散布在减半加倍配置中的所有节点中。然后将节点中的本地结果通过附加的log₂N步通信广播到其他节点。Figure 3B illustrates a halved-double configuration interconnecting a predetermined number of nodes (eg, N nodes, where N is a positive integer greater than 1). In this halved configuration, nodes communicate with each other in pairs, and each step of communication requires only N/2 connections. In the first step, adjacent nodes are paired together, sending half of the messages or data to their respective paired nodes and receiving the other half for processing. Therefore, intermediate results can be spread to paired nodes. In subsequent steps, new pairs are formed with increased or doubled distances, and the data size for processing is halved. After log ₂ N steps of communication, the results are spread across all nodes in the halved-double configuration. The local results in the node are then broadcast to other nodes via additional log ₂ N steps of communication.

示例性集群通信库Example cluster communication library

图4示出了描绘可以由分布式训练系统102使用的示例集群通信库400的示意图。在实施中，集群通信库是一种设计成提供高性能、高可扩展性和强可用性的通信库，并且可以被配置成不仅对诸如全局归约(Allreduce)和全局聚集(Allgather)运算之类的标准集群运算提供支持，还对定制应用的其他自定义运算提供支持。在实施中，集群通信库400可以采用具有不同特性(例如，延迟、带宽、拓扑)的不同类型的互连或交换机，并提供一种机制来收集网络和计算节点中底层硬件的信息，从而可以基于一条或多条所收集的信息来开发拓扑感知算法的设计。FIG. 4 shows a schematic diagram depicting an example cluster communication library 400 that may be used by the distributed training system 102. In implementation, the cluster communication library is a communication library designed to provide high performance, high scalability and strong availability, and can be configured not only for operations such as global reduction (Allreduce) and global aggregation (Allgather). It provides support for standard cluster operations as well as other custom operations for custom applications. In an implementation, the cluster communication library 400 may employ different types of interconnects or switches with different characteristics (e.g., latency, bandwidth, topology) and provide a mechanism to collect information about the underlying hardware in the network and compute nodes so that The design of a topology-aware algorithm is developed based on one or more pieces of collected information.

在实施中，集群通信库400可以提供灵活性，以允许在单个运算中执行多个算法，并通过利用节点内通信和节点间通信之间的并行性来提高性能(例如，通信和训练的性能等)。此外，集群通信库400可以在具有传统或新的映射算法的计算节点中利用多个NIC，并且通过连接的拓扑感知布置来消除网络拥塞。In implementation, the cluster communication library 400 may provide flexibility to allow multiple algorithms to be executed in a single operation and improve performance (e.g., communication and training performance) by exploiting parallelism between intra-node and inter-node communication. wait). Additionally, the cluster communication library 400 can utilize multiple NICs in compute nodes with legacy or new mapping algorithms and eliminate network congestion through topology-aware placement of connections.

在实施中，集群通信库400可以包括软件栈402。在实施中，软件栈402可以包括多个组件，其中可以包括但不限于传输组件404、运算组件406、通信器组件408和库上下文组件410。在实施中，软件栈402可以用模块化方式来设计，以允许通用性和可扩展性。In implementation, cluster communication library 400 may include a software stack 402. In implementation, software stack 402 may include multiple components, which may include but are not limited to transport component 404, operation component 406, communicator component 408, and library context component 410. In implementation, software stack 402 may be designed in a modular manner to allow generality and extensibility.

在实施中，传输组件404可以负责节点内和节点间通信中的对等(Peer-to-Peer，P2P)数据的转移或传输。作为示例而非限制，集群通信库400可以支持用于节点间通信的TCP(Transmission Control Protocol，传输控制协议)和RDMA(Remote Direct MemoryAccess，远程直接存储器访问)，以及用于节点内通信的P2P结构，例如PCIe(PeripheralComponent Interconnect Express，快速外围组件互连)、NVLink/NVSwitch和QPI/UPI(Quick Path Interconnect/Ultra Path Interconnect，快速路径互连/超路径互连)等。对于RDMA通信，传输组件404还可被配置成管理处理单元(如图形处理单元(GraphicsProcessing unit，GPU)设备)和主机存储器中的存储器区域(Memory Region，MR)和对应的存储器缓冲区。In implementation, the transmission component 404 may be responsible for the transfer or transmission of peer-to-peer (P2P) data in intra-node and inter-node communications. As an example and not a limitation, the cluster communication library 400 may support TCP (Transmission Control Protocol) and RDMA (Remote Direct Memory Access) for inter-node communication, as well as P2P structures for intra-node communication, such as PCIe (Peripheral Component Interconnect Express), NVLink/NVSwitch, and QPI/UPI (Quick Path Interconnect/Ultra Path Interconnect), etc. For RDMA communication, the transmission component 404 may also be configured to manage memory regions (MR) and corresponding memory buffers in a processing unit (such as a graphics processing unit (GPU) device) and a host memory.

在实施中，运算组件406可以提供一组基本运算和各种网络算法。例如，基本运算可以被配置有集群通信库400所支持的算法。此外，运算组件406可允许用户基于这些基本运算定义新运算，以实现异构性感知运算，该运算可对每种类型的结构采用最优或更好的算法。In an implementation, the operations component 406 may provide a set of basic operations and various network algorithms. For example, the basic operations may be configured with algorithms supported by the cluster communication library 400. In addition, the operation component 406 can allow users to define new operations based on these basic operations to implement heterogeneity-aware operations that can employ optimal or better algorithms for each type of structure.

在实施中，通信器组件408可以与软件进程相关联，并且可以被配置成在处理单元(诸如GPU设备)上执行操纵和处理。通信器组件408可以保存或记录关于其他对等组件的信息(例如，排序ID、IP地址等)，并保持与对等组件的连接。在实施中，通信器组件408可以进一步收集节点内和节点间拓扑信息，并使用该信息来指导算法设计。在实施中，节点内信息可以包括但不限于互连的类型、处理单元的位置之间的距离、处理单元和网络接口控制器之间的距离等。在实施中，节点间信息可以包括，但不限于，例如，可用网络接口控制器的数量、集群或计算节点的拓扑、集群中计算节点的位置。In an implementation, communicator component 408 may be associated with a software process and may be configured to perform manipulation and processing on a processing unit, such as a GPU device. The communicator component 408 may save or log information about other peer components (eg, sort IDs, IP addresses, etc.) and maintain connections with peer components. In an implementation, the communicator component 408 may further collect intra-node and inter-node topology information and use this information to guide algorithm design. In an implementation, intra-node information may include, but is not limited to, types of interconnects, distances between locations of processing units, distances between processing units and network interface controllers, and the like. In an implementation, the inter-node information may include, but is not limited to, for example, the number of available network interface controllers, the topology of the cluster or compute nodes, and the location of the compute nodes in the cluster.

在实施中，库上下文组件410可被配置为开放(expose)用于设置系统配置(例如环境变量)、管理通信器组件408的一个或多个应用接口，并提供诸如日志记录等其他功能。In an implementation, library context component 410 may be configured to expose one or more application interfaces for setting system configuration (eg, environment variables), managing communicator component 408, and providing other functionality such as logging.

此外，在一些情况下，集群通信库400可以进一步包括或提供用于拓扑感知设计、测试和评估以及可用性改进的多个工具和实用程序412。作为示例而非限制，工具和实用程序412可包括用于传输组件404的性能测试工具以辅助算法设计和评估、用于确保系统可用性的基于探测的路由机制以及其他功能，例如可扩展到支持除GPU之外的设备的设备管理功能。Additionally, in some cases, the cluster communications library 400 may further include or provide a number of tools and utilities 412 for topology-aware design, testing and evaluation, and usability improvements. By way of example, and not limitation, tools and utilities 412 may include performance testing tools for transport components 404 to assist in algorithm design and evaluation, probe-based routing mechanisms to ensure system availability, and other functionality, such as scalable to support in addition to Device management capabilities for devices other than GPUs.

用于集群通信的示例性拓扑感知多阶段算法Exemplary topology-aware multi-stage algorithm for cluster communication

在实施中，集群通信可以被定义为涉及一组处理单元或进程的通信，并且集群通信的运算可以由该组中包括的所有处理单元或进程一起执行。集群通信运算的实例可以包括但不限于全局归约运算、全局聚集运算、归约散布(Reduce-Scatter)运算等。在实施中，全局归约运算是分布式训练中集群通信的许多重要基础之一，并且涉及组中跨进程对数据执行归约。归约的示例可以包括但不限于求和运算、求平均值运算、求最大值运算、求最小值运算等。In an implementation, cluster communication may be defined as communication involving a group of processing units or processes, and operations of the cluster communication may be performed together by all processing units or processes included in the group. Examples of cluster communication operations may include, but are not limited to, global reduction operations, global aggregation operations, reduce-scatter operations, etc. In implementation, global reduction operations are one of many important foundations for cluster communication in distributed training and involve performing reductions on data across processes in the group. Examples of reductions may include, but are not limited to, summation operations, averaging operations, maximization operations, minimization operations, and the like.

作为示例而非限制，这里以全局归约运算为示例来说明如何将集群运算划分成多个微运算或子运算。在实施中，分布式训练系统102可以采用拓扑感知的多阶段算法，该算法将全局归约运算分成多个微运算或子运算，并根据需要选择性地挑选一个或多个微运算或子运算，从而通过消除可能不需要的微运算或子运算来减少传输的数据量。在实施中，分布式训练系统102可以将集群通信算法从微运算或子运算中分离，并允许基于底层结构信息在算法和微运算或子运算之间进行独立或单独的匹配，从而通过减少传输数据量，最大化或优化带宽利用。As an example and not a limitation, the global reduction operation is used here to illustrate how to divide the cluster operation into multiple micro-operations or sub-operations. In implementation, the distributed training system 102 may adopt a topology-aware multi-stage algorithm that divides the global reduction operation into multiple micro-operations or sub-operations, and selectively selects one or more micro-operations or sub-operations as needed. , thereby reducing the amount of data transferred by eliminating micro-operations or sub-operations that may not be needed. In implementations, the distributed training system 102 may decouple cluster communication algorithms from micro-operations or sub-operations and allow independent or individual matching between algorithms and micro-operations or sub-operations based on underlying structural information, thereby reducing transmission Data volume, maximize or optimize bandwidth utilization.

图5示出了可以用于分布式训练系统102的示例性拓扑感知多阶段算法500。在实施中，拓扑感知多阶段算法500可以包括多个阶段，例如，节点内归约散布阶段502、节点间全局归约阶段504和节点内全局聚集阶段506。FIG. 5 illustrates an exemplary topology-aware multi-stage algorithm 500 that may be used with distributed training system 102. In an implementation, the topology-aware multi-stage algorithm 500 may include multiple stages, such as an intra-node reduction spreading stage 502, an inter-node global reduction stage 504, and an intra-node global aggregation stage 506.

在实施中，分布式训练系统102可以首先将用于训练的待处理数据的各部分分配给多个计算节点104，使得多个计算节点104的每个计算节点104可以接收数据的相应部分。在实施中，每个计算节点104可以将数据的相应部分划分成多个数据片(例如，N个数据片，其中N是正整数)，并将这多个数据片分配给包括在相应计算节点104中的多个本地处理单元或进程(例如，N个本地处理单元或进程)。In an implementation, the distributed training system 102 may first allocate portions of the data to be processed for training to multiple computing nodes 104 such that each computing node 104 of the multiple computing nodes 104 can receive the corresponding portion of the data. In an implementation, each computing node 104 may divide the corresponding portion of the data into multiple data slices (eg, N data slices, where N is a positive integer), and allocate the multiple data slices to the data included in the corresponding computing node 104 Multiple local processing units or processes in (for example, N local processing units or processes).

在实施中，在节点内归约散布阶段502中，包括在每个计算节点104中的每个本地处理单元或进程可以将分配给它的数据片划分成多个数据块(例如，M个块)。然后包括在每个计算节点104中的本地处理单元或进程可以协作地执行节点内归约散布子运算，以根据特定的集群通信算法在多个步骤或迭代中获得相应计算节点104中的多个数据块的所有归约结果。在节点内归约散布阶段502结束时，计算节点104中包括的本地处理单元或进程可能具有不同数据块中的该计算节点104中包括的所有处理单元或进程的归约结果(或称为归约散布结果)。In an implementation, during the intra-node reduction spreading phase 502, each local processing unit or process included in each compute node 104 may divide its assigned data slice into a plurality of data chunks (e.g., M chunks ). Local processing units or processes included in each compute node 104 may then cooperatively perform intra-node reduction spread sub-operations to obtain multiple values in the corresponding compute node 104 in multiple steps or iterations according to a particular cluster communication algorithm. All reduction results of the data block. At the end of the intra-node reduction spread phase 502 , a local processing unit or process included in a compute node 104 may have the reduction results (or referred to as a reduction) of all processing units or processes included in the compute node 104 in different data blocks. about dispersion results).

作为示例而非限制，以基于环的算法和减半加倍算法这两个示例性集群通信算法作为示例来描述特定的集群通信算法，以示出节点内归约散布阶段502中的特定机制或运算。然而，在该节点内归约散布阶段502中可以使用其他集群通信算法。例如，分布式训练系统102可以基于由集群通信库400收集的多个因素的信息，选择在节点内归约散布阶段502中使用的特定集群通信算法。在实施中，该多个因素可以包括但不限于计算节点中的处理单元(或进程)之间的互连类型、计算节点中的互连数量等。By way of example and not limitation, specific cluster communication algorithms are described using two exemplary cluster communication algorithms, a ring-based algorithm and a halving-doubling algorithm, to illustrate specific mechanisms or operations in the intra-node reduction spread phase 502 . However, other cluster communication algorithms may be used in the intra-node reduction spreading phase 502. For example, the distributed training system 102 may select a particular cluster communication algorithm for use in the intra-node reduction dispersion phase 502 based on information collected by the cluster communication library 400 on multiple factors. In an implementation, the plurality of factors may include, but are not limited to, the type of interconnections between processing units (or processes) in the computing node, the number of interconnections in the computing node, and the like.

例如，在节点内归约散布阶段502中，分布式训练系统102可以为第一计算节点采用第一集群通信算法，为第二计算节点采用第二集群通信算法，其中第二计算节点具有与第一计算节点相同或不同的处理和连接能力，且第一集群通信算法可以与第二集群通信算法相同或不同。作为示例而非限制，对于使用NVSwitch或PCIe进行互连、并包括数量为2的幂的多个用于训练的处理单元或进程的计算节点，分布式训练系统102可采用减半加倍算法，而对使用NVLink或其他进行互连、并使用数量为非2的幂的多个用于训练的处理单元或进程的另一计算节点，分布式训练系统102采用基于环的算法，等等。For example, in the intra-node reduction dispersion phase 502, the distributed training system 102 may employ a first cluster communication algorithm for a first computing node and a second cluster communication algorithm for a second computing node, where the second computing node has the same A computing node may have the same or different processing and connection capabilities, and the first cluster communication algorithm may be the same as or different from the second cluster communication algorithm. By way of example, and not limitation, for compute nodes interconnected using NVSwitch or PCIe and including a power of two number of processing units or processes for training, the distributed training system 102 may employ a halve-double algorithm, and Distributed training system 102 employs a ring-based algorithm for another compute node interconnected using NVLink or other and using a non-power-of-two number of processing units or processes for training, and so on.

图6示出了在节点内归约散布阶段502中用于计算节点的示例性基于环的算法600。出于简洁和描述的目的，示例性基于环的算法仅包括一个环的配置。然而，任何包括多于一个环的配置的基于环的算法都可被采用，例如，每个环处理数据块的一部分。Figure 6 illustrates an exemplary ring-based algorithm 600 for computing nodes in the intra-node reduction spread phase 502. For brevity and description purposes, the exemplary ring-based algorithm only includes a one-ring configuration. However, any ring-based algorithm may be employed that includes a configuration of more than one ring, eg, each ring processing a portion of a data block.

在该示例中，描述的计算节点包括M个处理单元或进程(具有排序标识符或编号1、2、…、M)，且分配给每个处理单元或进程的数据被分成M个数据块。在第一步，处理单元或进程(如P1)可以将其M个数据块中的一个发送给环中的下一个处理单元或进程(如P2)，从环中的前一个处理单元或进程(如PM)接收另一个数据块，并用对应的本地数据块来归约所接收的数据块以获得部分归约的结果。在每个后续步骤中(如第k个步骤)，处理单元或进程(如P1)可以向环中的下一个处理单元或进程(如P2)发送部分归约结果(在本示例中是由P1在第k-1个步骤获得的部分归约结果)，从前一个处理单元或进程(如PM)接收部分归约结果(在本例中是由PM在第k-1个步骤获得的部分归约结果)，并且用先前没有发送或用其他数据归约的另一个本地数据块来归约所接收的部分归约结果。In this example, the computing node described includes M processing units or processes (with ordering identifiers or numbers 1, 2, ..., M), and the data assigned to each processing unit or process is divided into M data blocks. In the first step, the processing unit or process (such as P1) can send one of its M data blocks to the next processing unit or process (such as P2) in the ring, receive another data block from the previous processing unit or process (such as PM) in the ring, and reduce the received data block with the corresponding local data block to obtain a partial reduction result. In each subsequent step (such as the kth step), the processing unit or process (such as P1) can send a partial reduction result (in this example, the partial reduction result obtained by P1 in the k-1th step) to the next processing unit or process (such as P2) in the ring, receive a partial reduction result from the previous processing unit or process (such as PM) (in this example, the partial reduction result obtained by PM in the k-1th step), and reduce the received partial reduction result with another local data block that was not previously sent or reduced with other data.

如图6所示，每一步中，不同的数据块可以被计算节点中不同的处理单元或进程接收和归约或发送。此外，每个处理单元或进程可以在不同的步骤发送或接收并归约不同的数据块(或部分结果)。在节点内归约散布阶段502结束时(即，在M-1个步骤之后)，每个处理单元或进程可以包括结果数据块，该结果数据块存储该计算节点中M个处理单元或进程的M个相应数据块的归约结果。例如，在M-1个步骤之后，“顶部位置”的P1的数据块存储对应于该“顶部位置”的M个处理单元或进程的所有数据块的归约结果，如图6所示。As shown in Figure 6, in each step, different data blocks can be received and reduced or sent by different processing units or processes in the computing node. Additionally, each processing unit or process can send or receive and reduce different chunks of data (or partial results) at different steps. At the end of the intra-node reduction spread phase 502 (i.e., after M-1 steps), each processing unit or process may include a result data block that stores the results of the M processing units or processes in the compute node. The reduction results of M corresponding data blocks. For example, after M-1 steps, the data block of P1 at the "top position" stores the reduction results of all data blocks corresponding to the M processing units or processes of the "top position", as shown in Figure 6.

图7示出了在节点内归约散布阶段502中用于计算节点的示例性减半加倍算法700。在本示例中，描述的计算节点包括M个处理单元或进程(本示例中以M设定为8进行描述)。在第一步中，处理单元或进程(如P1)可以将分配给它的一半数据发送给附近的另一个处理单元或进程(如P2)，并接收分配给该另一个处理单元或进程(如P2)的一半数据，并用分配给本处理单元或进程(如P1)的另一半数据来归约所接收的数据，以获得部分归约结果。在每个后续步骤中，处理单元或进程(如P1)可以将在先前步骤中本地获得的部分归约结果的一半发送给位于离处理单元或进程(即P1)越来越远的位置的不同处理单元或进程，并用在先前步骤中本地获得的部分归约结果的另一半来归约所接收的部分归约结果，以获得处理单元或进程(即P1)的新的部分归约结果。在节点内归约散布阶段502结束时(即，在log₂M个步骤之后，即，在如图7所示的该示例中的3个步骤之后)，每个处理单元或进程可以包括结果数据块，该结果数据块存储该计算节点中M个(在该示例中，如图7所示的8个)处理单元或进程的M个相应数据块的归约结果。例如，在log₂M个步骤之后，“底部位置”的P1的数据块存储对应于如图7所示的该“底部位置”的M个(在该示例中，如图7所示为8个)处理单元或进程的所有数据块的归约结果。Figure 7 illustrates an exemplary halving-doubling algorithm 700 for computing nodes in the intra-node reduction spread phase 502. In this example, the described computing node includes M processing units or processes (in this example, M is set to 8 for description). In the first step, a processing unit or process (e.g. P1) can send half of the data allocated to it to another nearby processing unit or process (e.g. P2) and receive the half of the data allocated to that other processing unit or process (e.g. P2). P2) half of the data, and use the other half of the data allocated to this processing unit or process (such as P1) to reduce the received data to obtain a partial reduction result. At each subsequent step, the processing unit or process (i.e., P1) can send half of the partial reduction result obtained locally in the previous step to a different location located further and further away from the processing unit or process (i.e., P1). processing unit or process, and reduces the received partial reduction result with the other half of the partial reduction result obtained locally in the previous step to obtain a new partial reduction result for the processing unit or process (i.e., P1). At the end of the intra-node reduction spread phase 502 (i.e., after log ₂ M steps, ie, after 3 steps in this example as shown in Figure 7), each processing unit or process may include result data block, the result data block stores the reduction results of M corresponding data blocks of M (in this example, 8 as shown in Figure 7) processing units or processes in the computing node. For example, after log ₂ M steps, the data block storage of P1 at the "bottom position" corresponds to M (in this example, 8 as shown in Figure 7 ) Processes the reduction results of all data blocks of a unit or process.

在实施中，在节点间全局归约阶段504中，节点间全局归约子运算是基于节点的(即，在不同计算节点之间)，并且可以在不同计算节点中包括的处理单元(或进程)之间执行。在实施中，持有相同的归约结果(或归约散布结果)的数据块的不同计算节点的处理单元(或进程)形成同一组，并且在该同一组中相互传送各自的结果以执行节点间全局归约子运算。在节点间全局归约阶段504结束时，特定组中的每个计算节点的每个处理单元或进程可以拥有该同一组中所有处理单元或进程的归约结果的特定数据块，不同组的处理单元或进程拥有不同组中相应处理单元或进程的归约结果的不同数据块。In an implementation, in the inter-node global reduction stage 504, the inter-node global reduction sub-operations are node-based (i.e., between different computing nodes), and may be performed on processing units (or processes) included in different computing nodes. ). In an implementation, the processing units (or processes) of different computing nodes holding data blocks with the same reduction result (or reduction spread result) form the same group, and transfer their respective results to each other in the same group to execute the nodes between global reduction suboperations. At the end of the inter-node global reduction phase 504, each processing unit or process of each compute node in a specific group may have a specific data block of the reduction results of all processing units or processes in the same group, processing of different groups The units or processes have different data blocks of the reduction results of the corresponding processing units or processes in different groups.

在实施中，分布式训练系统102可以基于一个或多个选择标准来选择特定的集群通信算法，并且可以基于所选择的集群通信算法来实现节点间全局归约子运算。特定集群通信算法的示例可以包括但不限于基于环的算法(例如分级环算法、多环算法等)、减半加倍算法等。在实施中，一个或多个选择标准可以包括但不限于连接计算节点的通信网络(例如，通信网络206)的拓扑、通信网络中使用的交换机的数量、通信网络中使用的交换机的类型、通信网络的网络类型等。In implementation, the distributed training system 102 may select a specific cluster communication algorithm based on one or more selection criteria, and may implement the inter-node global reduction sub-operation based on the selected cluster communication algorithm. Examples of specific cluster communication algorithms may include, but are not limited to, ring-based algorithms (e.g., hierarchical ring algorithms, multi-ring algorithms, etc.), halving and doubling algorithms, etc. In implementation, one or more selection criteria may include, but are not limited to, the topology of the communication network (e.g., communication network 206) connecting the computing nodes, the number of switches used in the communication network, the type of switches used in the communication network, the network type of the communication network, etc.

作为示例而非限制，以基于环的算法和减半加倍算法这两个示例性集群通信算法作为示例来描述特定的集群通信算法，以示出节点间全局归约阶段504中的特定机制或运算。然而，基于上述一个或多个选择标准，可以将其他集群通信算法用在该节点间全局归约阶段504中。By way of example and not limitation, specific cluster communication algorithms are described using two exemplary cluster communication algorithms, a ring-based algorithm and a halving-doubling algorithm, to illustrate specific mechanisms or operations in the inter-node global reduction stage 504 . However, other cluster communication algorithms may be used in the inter-node global reduction stage 504 based on one or more of the selection criteria described above.

图8和9出了节点间全局归约阶段504的示例性减半加倍算法。在该示例中，如图8所示，出于简洁和描述的目的，描述的分布式训练系统102包括多个计算节点(即，节点0、节点1、节点2、…节点N-1，为了进行说明，N在图8中被示为4)，其中每个计算节点包括八个处理单元或进程，具有对应的排序编号(即，序位0、序位1、序位2、…序位M-1，为了进行说明，M在图8中被示为8)。如图8所示，在对应的计算节点中具有相同排序编号的处理单元或进程包括相同的归约结果(或归约散布结果)的数据块，并且其形成同一组。例如，在对应的计算节点中具有排序编号0的处理单元或进程包括各本地数据块中第一位置处的归约结果的数据块，并且其形成同一组(例如，组0)。在实施中，不同组中的处理单元或进程可以不相互通信。Figures 8 and 9 illustrate an exemplary halve-double algorithm for the inter-node global reduction stage 504. In this example, as shown in FIG. 8 , for simplicity and description purposes, the distributed training system 102 is described as including a plurality of computing nodes (i.e., Node 0, Node 1, Node 2, ... Node N-1, in order to To illustrate, N is shown as 4 in Figure 8), where each compute node includes eight processing units or processes, with corresponding sequence numbers (i.e., sequence 0, sequence 1, sequence 2, ... sequence M-1, for the purpose of illustration, M is shown as 8) in Figure 8. As shown in FIG. 8 , processing units or processes with the same sorting number in corresponding computing nodes include data blocks of the same reduction result (or reduction spread result), and form the same group. For example, a processing unit or process with order number 0 in the corresponding computing node includes the data block of the reduction result at the first position among the local data blocks, and they form the same group (eg, group 0). In an implementation, processing units or processes in different groups may not communicate with each other.

在实施中，节点间全局归约子运算可以在每个组中的处理单元(或进程)之间单独执行，使得组中的每个处理单元(或进程)可以获得同一组中所有处理单元(或进程)的同一数据块的所有归约结果。类似于上面针对节点内归约散布阶段描述的减半加倍算法的机制，每个组中的处理单元或进程可以与相应组中的其他处理单元或进程迭代地发送对应数据块的本地归约结果，从加倍或增加的距离中的其他处理单元或进程接收对应数据块的各本地归约结果，并且用本地归约结果对所接收的归约结果执行归约运算。In an implementation, the inter-node global reduction sub-operation can be performed individually among the processing units (or processes) in each group, so that each processing unit (or process) in the group can obtain all processing units in the same group ( or process) all reduction results of the same data block. Similar to the mechanism of the halve-double algorithm described above for the intra-node reduction spread phase, processing units or processes in each group can iteratively send the local reduction results of the corresponding data block with other processing units or processes in the corresponding group. , receiving each local reduction result of the corresponding data block from other processing units or processes in doubled or increased distances, and performing a reduction operation on the received reduction result using the local reduction result.

图9示出了对八个计算节点应用减半加倍算法的示例场景。在本示例中，如图9所示，利用减半加倍算法在节点间全局归约阶段504中执行步骤的数量为log₂N＝log₂8＝3，其中N是计算节点的数量。在第一步中，第一计算节点(例如，节点0)中的某个组的第一处理单元或进程(例如，排序编号为0的处理单元或进程)可以向第二计算节点(例如，节点1)中的同一组的第二处理单元或进程发送其本地归约结果，从第二计算节点中的同一组的第二处理单元或进程接收本地归约结果，并对其本地归约结果和接收到的本地归约结果执行归约运算，以获得新的本地归约结果。FIG9 shows an example scenario of applying the halving and doubling algorithm to eight computing nodes. In this example, as shown in FIG9 , the number of steps executed in the inter-node global reduction phase 504 using the halving and doubling algorithm is log ₂ N=log ₂ 8=3, where N is the number of computing nodes. In the first step, a first processing unit or process (e.g., a processing unit or process with a sorting number of 0) of a group in a first computing node (e.g., node 0) can send its local reduction result to a second processing unit or process of the same group in a second computing node (e.g., node 1), receive the local reduction result from the second processing unit or process of the same group in the second computing node, and perform a reduction operation on its local reduction result and the received local reduction result to obtain a new local reduction result.

在第二步中，第一计算节点(例如，节点0)中的第一处理单元或进程(例如，排序编号为0的处理单元或进程)可以将其新的本地归约结果发送给第三计算节点(即，本例中的节点2)中的同一组的第三处理单元或进程(例如，排序编号为0)，从第一计算节点中的同一组的第三处理单元或进程接收本地归约结果，并对其新的本地归约结果和所接收的本地归约结果执行归约运算，以获得另一个新的本地归约结果。In the second step, the first processing unit or process (e.g., the processing unit or process with order number 0) in the first computing node (e.g., node 0) may send its new local reduction result to the third A third processing unit or process of the same set (e.g., sequence number 0) in the compute node (i.e., Node 2 in this example) receives the local reduction result, and performs a reduction operation on its new local reduction result and the received local reduction result to obtain another new local reduction result.

在第三步(或最后一步)，对第一处理单元或进程执行相同的运算，但是此时是对第四计算节点(即，本例中的节点4)中的同一组的第四处理单元或进程执行相同的运算。In the third (or final) step, the same operation is performed on the first processing unit or process, but this time on the same set of fourth processing units in the fourth computing node (i.e., node 4 in this example) or process performs the same operation.

在节点间全局归约阶段504结束时，特定组中的每个计算节点的每个处理单元或进程可以拥有该同一组中所有处理单元或进程的归约结果的特定数据块，不同组的处理单元或进程拥有不同组中相应处理单元或进程的归约结果的不同数据块。At the end of the inter-node global reduction phase 504, each processing unit or process of each compute node in a specific group may have a specific data block of the reduction results of all processing units or processes in the same group, processing of different groups The units or processes have different data blocks of the reduction results of the corresponding processing units or processes in different groups.

与减半加倍算法类似，节点间全局归约子运算可以利用基于环的算法在多个计算节点(如N个计算节点)的每个组中的处理单元(或进程)之间单独执行，使得组中的每个处理单元(或进程)可以获得同一组中所有处理单元(或进程)的同一数据块的所有归约结果。类似于以上针对节点内归约散布阶段描述的基于环的算法的机制，计算节点中的每个组的处理单元或进程可以迭代地将对应数据块的本地归约结果发送给下一个计算节点中的相应组的处理单元或进程，从前一个计算节点中的相应组的处理单元或进程接收对应数据块的本地归约结果，并利用其本地归约结果对所接收的归约结果执行归约运算。在节点间全局归约阶段504结束时(即在N-1步之后)，特定组中的每个计算节点的每个处理单元或进程可以拥有该同一组中所有处理单元或进程的归约结果的特定数据块，不同组的处理单元或进程拥有不同组中相应处理单元或进程的归约结果的不同数据块。Similar to the halve-double algorithm, the inter-node global reduction sub-operation can be performed individually among the processing units (or processes) in each group of multiple computing nodes (such as N computing nodes) using a ring-based algorithm, such that Each processing unit (or process) in a group can obtain all reduction results of the same data block for all processing units (or processes) in the same group. Similar to the mechanism of the ring-based algorithm described above for the intra-node reduction spread phase, each group of processing units or processes in a compute node can iteratively send the local reduction results of the corresponding data block to the next compute node. The processing unit or process of the corresponding group receives the local reduction result of the corresponding data block from the processing unit or process of the corresponding group in the previous computing node, and uses its local reduction result to perform the reduction operation on the received reduction result. . At the end of the inter-node global reduction phase 504 (i.e. after N-1 steps), each processing unit or process of each compute node in a particular group may have the reduction results for all processing units or processes in that same group A specific data block, different groups of processing units or processes have different data blocks of the reduction results of the corresponding processing units or processes in different groups.

在实施中，类似于节点内归约散布阶段502，在节点内全局聚集阶段506中，全局聚集子运算可以跨分布式训练系统102的多个计算节点的每个计算节点中的本地处理单元或进程来执行，以在同一计算节点中向彼此本地广播在节点间全局归约阶段504中获得的相应归约结果。在节点内全局聚集阶段506结束时，分布式训练系统102的每个计算节点中的每个处理单元或过程可以具有分布在多个计算节点中的整个数据的归约结果。In an implementation, similar to the intra-node reduction spreading stage 502 , in the intra-node global aggregation stage 506 , the global aggregation sub-operation may span local processing units in each of the plurality of computing nodes of the distributed training system 102 or Processes are executed to locally broadcast the corresponding reduction results obtained in the inter-node global reduction stage 504 to each other in the same computing node. At the end of the intra-node global aggregation phase 506, each processing unit or process in each computing node of the distributed training system 102 may have the reduction results of the entire data distributed among the multiple computing nodes.

作为示例而非限制，本文使用基于环的算法来说明如何广播由分布式训练系统102的计算节点中的本地处理单元或过程(在节点间全局归约阶段504中)获得的归约结果。然而，分布式训练系统102可以对不同的计算节点采用不同的或相同的集群通信算法(例如减半加倍算法等)。例如，分布式训练系统102可以基于与每个单独的计算节点相关联的多个因素，为不同的计算节点采用不同或相同的集群通信算法。在实施中，该多个因素可以包括但不限于计算节点中的处理单元(或进程)之间的互连类型、计算节点中的互连数量等。By way of example and not limitation, a ring-based algorithm is used herein to illustrate how reduction results obtained by local processing units or processes in the computing nodes of the distributed training system 102 (in the inter-node global reduction stage 504) are broadcast. However, the distributed training system 102 may adopt different or the same cluster communication algorithm (such as the halving and doubling algorithm, etc.) for different computing nodes. For example, distributed training system 102 may employ different or the same cluster communication algorithms for different computing nodes based on multiple factors associated with each individual computing node. In an implementation, the plurality of factors may include, but are not limited to, the type of interconnections between processing units (or processes) in the computing node, the number of interconnections in the computing node, and the like.

图10示出了示例性的基于环的集群通信算法1000，用于在分布式训练系统102的计算节点内相互广播处理单元或过程的个体归约结果。如图10所示，在第一步中，计算节点中的M个处理单元或进程中的每个处理单元或进程(例如，P1)可以根据环形配置将其在节点间全局归约阶段504中获得的归约结果发送给两个相邻处理单元或进程中的一个(例如，本例中的P2)，并从两个相邻处理单元或进程中的另一个(例如，本例中的PM)接收归约结果。在每个后续步骤中，每个处理单元或进程(例如，P1)可以根据环形配置向两个相邻处理单元或进程中的一个(例如，本例中的P2)发送新接收的归约结果，并从两个相邻处理单元或进程中的另一个(例如，本例中的PM)接收另一个归约结果。在节点内全局聚集阶段506结束时(即，在M-1个步骤之后)，计算节点中的每个处理单元或进程可具有计算节点中所有处理单元或进程的归约结果的归约结果。Figure 10 illustrates an exemplary ring-based cluster communication algorithm 1000 for broadcasting individual reduction results of processing units or processes to each other within computing nodes of a distributed training system 102. As shown in Figure 10, in the first step, each of the M processing units or processes (eg, P1) in the computing node can be used in the inter-node global reduction stage 504 according to the ring configuration. The obtained reduction results are sent to one of the two adjacent processing units or processes (e.g., P2 in this case) and are retrieved from the other of the two adjacent processing units or processes (e.g., PM in this case). ) receives the reduction result. At each subsequent step, each processing unit or process (e.g., P1) can send the newly received reduction result to one of the two adjacent processing units or processes (e.g., P2 in this example) according to the ring configuration , and receives another reduction result from another one of the two adjacent processing units or processes (e.g., PM in this case). At the end of the intra-node global aggregation phase 506 (ie, after M-1 steps), each processing unit or process in the compute node may have the reduction result of the reduction results for all processing units or processes in the compute node.

示例性并行算法Example parallel algorithm

在实施中，分布式训练系统102可以执行包括在拓扑感知多阶段算法中的多个阶段，即，依次为节点内归约散布阶段502、节点间全局归约阶段504和节点内全局聚集阶段506等。在实施中，可替代地，分布式训练系统102可以对节点内归约散布阶段502、节点间全局归约阶段504和节点内全局聚集阶段506中的一些进行部分或实质重叠，并且并行地执行这些阶段的一些部分。In an implementation, the distributed training system 102 may execute multiple stages included in the topology-aware multi-stage algorithm, namely, an intra-node reduction dispersion stage 502 , an inter-node global reduction stage 504 and an intra-node global aggregation stage 506 in sequence. wait. In an implementation, the distributed training system 102 may alternatively partially or substantially overlap some of the intra-node reduction spreading phase 502 , the inter-node global reduction phase 504 , and the intra-node global aggregation phase 506 and execute them in parallel. Some of these stages.

例如，由于节点内归约散布阶段502和节点内全局聚集阶段506涉及节点内数据通信或传输(即，计算节点内的数据通信或传输)，并且节点间全局归约阶段504涉及节点间数据通信或传输(即，计算节点之间的数据通信或传输)，在实施中，分布式训练系统102可以允许节点内归约散布阶段502和节点间全局归约阶段504的至少一部分并行执行，以及节点间全局归约阶段504和节点内全局聚集阶段506的一部分，从而提高节点内和节点间链路(或连接)的利用率，并避免节点内链路在节点间链路被使用时空闲，反之亦然。For example, since the intra-node reduction dispersion phase 502 and the intra-node global aggregation phase 506 involve intra-node data communication or transmission (i.e., data communication or transmission within the computing node), and the inter-node global reduction phase 504 involves inter-node data communication or transmission (i.e., data communication or transmission between computing nodes). In an implementation, the distributed training system 102 may allow at least a portion of the intra-node reduction spreading phase 502 and the inter-node global reduction phase 504 to be executed in parallel, and the nodes part of the inter-node global reduction phase 504 and the intra-node global aggregation phase 506, thereby improving the utilization of intra-node and inter-node links (or connections) and preventing intra-node links from being idle when inter-node links are in use, and vice versa. Likewise.

图11示出了用并行或重叠的方式执行节点内归约散布阶段、节点间全局归约阶段、节点内全局聚集阶段的示例场景。如图11所示，计算节点的处理单元或进程可以将数据块划分为多个块(在本示例中是如图11所示的4个块)，并将这些块分发至至少两个并发线程(如第一线程1102和第二线程1104)。以这种方式，处理单元或进程可以流水线化节点内和节点间子运算，以便由至少两个并发线程(在该示例中，第一线程1102和第二线程1104)执行。Figure 11 shows an example scenario of executing the intra-node reduction spreading phase, the inter-node global reduction phase, and the intra-node global aggregation phase in a parallel or overlapping manner. As shown in Figure 11, the processing unit or process of the compute node can divide the data block into multiple chunks (in this example, 4 chunks as shown in Figure 11) and distribute these chunks to at least two concurrent threads (eg first thread 1102 and second thread 1104). In this manner, a processing unit or process can pipeline intra-node and inter-node sub-operations for execution by at least two concurrent threads (in this example, first thread 1102 and second thread 1104).

作为示例而非限制，第一线程1102可以对第一数据块(例如，数据块1106)执行节点间全局归约子运算(即，节点间全局归约阶段504中的运算)，而第二线程1104对第二数据块(例如，数据块1108)执行节点内归约散布子运算(即，节点内归约散布阶段502中的运算)。例如，第一线程1102可以对第三数据块(例如，数据块1110)执行节点内全局聚集子运算(即，节点内全局聚集阶段506中的运算)，而第二线程1104对第四数据块(例如，数据块1112)执行节点间全局归约子运算。By way of example and not limitation, the first thread 1102 may perform an inter-node global reduce sub-operation (i.e., the operation in the inter-node global reduce stage 504) on a first data block (e.g., data block 1106), while the second thread 1104 performs an intra-node reduce-scatter sub-operation (i.e., the operation in the intra-node reduce-scatter stage 502) on a second data block (e.g., data block 1108). For example, the first thread 1102 may perform an intra-node global gather sub-operation (i.e., the operation in the intra-node global gather stage 506) on a third data block (e.g., data block 1110), while the second thread 1104 performs an inter-node global reduce sub-operation on a fourth data block (e.g., data block 1112).

作为示例而非限制，分布式神经网络训练中涉及的另一个运算可以进一步用作示例。在实施中，分布式训练系统102可以将分布式神经网络训练中涉及的全局聚集运算分成多个子运算，即节点间全局聚集子运算、节点内全局聚集子运算和数据复制子运算。在实施中，节点间全局聚集子运算可以类似于如上所述的节点间全局归约子运算，但进行的是广播数据(例如，归约的结果)而不是归约运算(例如，用本地结果归约接收的结果)，而节点间全局聚集子运算可以类似于或等同于如上所述的节点间全局聚集子运算。在实施中，数据复制子运算可以包括复制结果数据(例如，最终的归约结果)作为输出参数的运算。By way of example and not limitation, another operation involved in distributed neural network training can be further used as an example. In implementation, the distributed training system 102 may divide the global aggregation operation involved in distributed neural network training into multiple sub-operations, namely, the inter-node global aggregation sub-operation, the intra-node global aggregation sub-operation and the data copy sub-operation. In implementation, the inter-node global aggregation sub-operation may be similar to the inter-node global reduction sub-operation as described above, but broadcast data (e.g., the result of the reduction) rather than the reduction operation (e.g., with the local result The result received by the reduction), and the inter-node global aggregation sub-operation may be similar to or equivalent to the inter-node global aggregation sub-operation as described above. In an implementation, the data replicator operation may include an operation that replicates the result data (eg, the final reduction result) as an output parameter.

在实施中，计算节点的处理单元或进程可以将数据块划分成多个块(例如，四个块)，并将这些块分发到至少两个并发线程(例如，第一线程和第二线程)，并且流水线化节点内和节点间子运算以供至少两个并发线程执行。In an implementation, the processing unit or process of the computing node may divide the block of data into multiple chunks (e.g., four chunks) and distribute the chunks to at least two concurrent threads (e.g., a first thread and a second thread) , and pipeline intra-node and inter-node sub-operations for execution by at least two concurrent threads.

例如，第一线程可以对第一数据块执行节点间全局聚集子运算，而第二线程对第二数据块执行节点内全局聚集子运算。此外，第一线程可以对第三数据块执行数据复制子运算，而第二线程对第四数据块执行节点间全局聚集子运算。For example, the first thread may perform an inter-node global aggregation sub-operation on the first data block, while the second thread may perform an intra-node global aggregation sub-operation on the second data block. In addition, the first thread may perform a data replication sub-operation on the third data block, while the second thread may perform an inter-node global aggregation sub-operation on the fourth data block.

示例性拥塞避免方法Example congestion avoidance methods

在实施中，由于分布式训练系统102中的多个计算节点之间的数据传输，在通信网络206中的一些交换机或链路处可能发生数据或流量拥塞。为了避免拥塞，分布式训练系统102可以采用预定的拥塞避免策略来在通信网络206中的各种交换机或链路之间分发或转移数据流量，从而避免过量的数据在训练期间(例如，节点间全局归约子运算或阶段，或者节点间全局聚集子运算或阶段)通过通信网络206中的某个交换机或链路。In an implementation, data or traffic congestion may occur at some switches or links in the communication network 206 due to data transmission between multiple computing nodes in the distributed training system 102 . To avoid congestion, the distributed training system 102 may employ predetermined congestion avoidance strategies to distribute or transfer data traffic among various switches or links in the communication network 206 to avoid excessive data flow during training (e.g., between nodes). The global reduction sub-operation or stage, or the global aggregation sub-operation or stage between nodes) passes through a certain switch or link in the communication network 206.

在实施中，分布式训练系统102可以采用第一拥塞避免方法，该方法包括环生成策略，以及随后的网络流的路由管理。附加地或替代地，分布式训练系统102可以采用第二拥塞避免方法，该方法包括对节点标识重新排序的策略，以及随后的网络流的路由管理。取决于通信网络206的网络拓扑的类型以及多个计算节点104的处理和通信能力等，分布式训练系统102可以选择第一拥塞避免方法或第二拥塞避免方法中的一个或多个，用于在分布式训练系统102中的多个计算节点的全部或部分之间路由数据流。此外，分布式训练系统102可以选择性地组合第一拥塞避免方法和第二拥塞避免方法的部分，以实现新的拥塞避免方法。在实施中，第一拥塞避免方法和第二拥塞避免方法都可以旨在以节点间数据流彼此没有或有很少冲突的方式为节点间数据流的每个方向指定专用网络路径。In implementation, the distributed training system 102 may employ a first congestion avoidance method that includes a ring generation strategy, and subsequent routing management of network flows. Additionally or alternatively, the distributed training system 102 may employ a second congestion avoidance method that includes a strategy for reordering node identities and subsequent routing management of network flows. Depending on the type of network topology of the communication network 206 and the processing and communication capabilities of the plurality of computing nodes 104, etc., the distributed training system 102 may select one or more of the first congestion avoidance method or the second congestion avoidance method for Data flows are routed between all or portions of the plurality of computing nodes in the distributed training system 102. In addition, the distributed training system 102 can selectively combine parts of the first congestion avoidance method and the second congestion avoidance method to implement a new congestion avoidance method. In an implementation, both the first congestion avoidance method and the second congestion avoidance method may be designed to designate dedicated network paths for each direction of inter-node data flow in such a manner that the inter-node data flows have little or no conflict with each other.

在实施中，分布式训练系统102可以预先获得或建立通信连接和路由路径(例如，物理链路)之间的映射关系。在实施中，可以创建表格、链表等形式的连接路径数据结构并用于存储映射关系的信息。在实施中，分布式训练系统102可以基于连接路径数据结构选择性地或策略性地使用特定路径来建立任何两个计算节点之间的连接。In implementation, the distributed training system 102 may obtain or establish a mapping relationship between a communication connection and a routing path (e.g., a physical link) in advance. In implementation, a connection path data structure in the form of a table, a linked list, etc. may be created and used to store information about the mapping relationship. In implementation, the distributed training system 102 may selectively or strategically use a specific path to establish a connection between any two computing nodes based on the connection path data structure.

在实施中，分布式训练系统102可以通过使分布式训练系统102的每个计算节点能够通过变化探测数据包的源/目的地端口向其他计算节点发送探测数据包，来确定通信连接和路由路径之间的映射关系，以穷尽分布式训练系统102的计算节点之间可能的通信连接。显然，分布式训练系统102可以采用其他方法来探索通信连接和路由路径之间的映射关系，在此不做限定。In an implementation, the distributed training system 102 may determine communication connections and routing paths by enabling each computing node of the distributed training system 102 to send probe packets to other compute nodes through the source/destination ports of the change probe packets. The mapping relationship between them is to exhaust the possible communication connections between the computing nodes of the distributed training system 102. Obviously, the distributed training system 102 can use other methods to explore the mapping relationship between communication connections and routing paths, which are not limited here.

作为示例而非限制，第一计算节点可以向第二计算节点发送多个探测数据包，每个探测数据包具有源端口和目的端口的不同组合，而源地址和目的地址分别是第一计算节点的地址和第二计算节点的地址。每个探测数据包可以记录相应探测数据包所经过的交换机，因此当相应探测数据包返回到第一计算节点时，第一计算节点可以知道用于映射的相应探测数据包的整个路由路径。据此，可以在第一计算节点和第二计算节点之间建立连接路径数据结构(例如，连接路径数据结构)。类似地，可以相应地建立分布式训练系统102中的其他计算节点对的通信连接和路由路径之间的映射关系(以及连接路径数据结构)。By way of example, and not limitation, the first computing node may send a plurality of probe packets to the second computing node, each probe packet having a different combination of source port and destination port, and the source address and destination address are respectively the first computing node and the address of the second computing node. Each probe packet may record the switches through which the corresponding probe packet passes, so when the corresponding probe packet returns to the first computing node, the first computing node may know the entire routing path of the corresponding probe packet for mapping. Accordingly, a connection path data structure (eg, a connection path data structure) may be established between the first computing node and the second computing node. Similarly, mapping relationships (and connection path data structures) between communication connections and routing paths of other pairs of computing nodes in the distributed training system 102 can be established accordingly.

出于简洁和说明的目的，示例性网络拓扑，即胖树网络(或者特别是全网状拓扑中的双层Clos网络架构)在此被用作与分布式训练系统102相关联的通信网络206的示例网络拓扑。然而，这里描述的示例性拥塞避免策略也可以适用于其他网络拓扑。For purposes of simplicity and illustration, an exemplary network topology, namely a fat tree network (or specifically a two-layer Clos network architecture in a full mesh topology) is used herein as the communication network 206 associated with the distributed training system 102 Example network topology. However, the exemplary congestion avoidance strategies described here may be applicable to other network topologies as well.

图12示出了示例性胖树网络拓扑1200。在该示例中，示例性胖树网络拓扑是全网状拓扑中的双层Clos网络架构。一层对应于直接连接到计算节点1204的一层片交换机1202，每个片交换机1202连接到一个或多个计算节点1204。在实施中，计算节点1204可以包括连接到片交换机1202的一个或多个端口(例如，四个端口)的一个或多个网络接口控制器(例如，四个网络接口控制器)。在实施中，每个计算节点1204的网络接口控制器的数量可以相同也可以不同。另一层对应于连接到一个或多个片交换机1202的一层汇聚交换机1206(或称为主干交换机1206)。Figure 12 illustrates an exemplary fat tree network topology 1200. In this example, the exemplary fat-tree network topology is a two-layer Clos network architecture in a full mesh topology. One layer corresponds to a layer of slice switches 1202 directly connected to compute nodes 1204, with each slice switch 1202 connected to one or more compute nodes 1204. In an implementation, compute node 1204 may include one or more network interface controllers (eg, four network interface controllers) connected to one or more ports (eg, four ports) of slice switch 1202 . In an implementation, the number of network interface controllers for each compute node 1204 may be the same or different. Another layer corresponds to a layer of aggregation switches 1206 (or spine switches 1206) connected to one or more shard switches 1202.

在实施中，如果包括在不同计算节点中的两个处理单元或进程连接在同一片交换机下，则在这两个处理单元或进程之间传输的数据包将通过该同一片交换机，而不通过任何汇聚交换机。可替换地，如果包括在不同计算节点中的两个处理单元或进程在不同的片交换机下连接，则在这两个处理单元或进程之间传输的数据包将通过汇聚交换机之一。使用如上所述的连接路径数据结构，通过在数据包中设置源端口和目的端口的适当组合，可以使在两个处理单元或进程之间传输的数据包流经指定的汇聚交换机。在实施中，第一拥塞避免方法和/或第二拥塞避免方法的路由管理可以旨在使得从同一片交换机到不同目的地片交换机的数据流能够通过不同的汇聚交换机，和/或使得从不同源片交换机到同一目的地片交换机的数据流能够通过不同的汇聚交换机，从而避免数据流之间的冲突，并且使得汇聚交换机处没有网络拥塞。In implementation, if two processing units or processes included in different computing nodes are connected under the same switch, the data packets transmitted between the two processing units or processes will pass through the same switch instead of Any aggregation switch. Alternatively, if two processing units or processes included in different computing nodes are connected under different slice switches, data packets transmitted between the two processing units or processes will pass through one of the aggregation switches. Using the connection path data structure as described above, packets traveling between two processing units or processes can be caused to flow through a designated aggregation switch by setting the appropriate combination of source and destination ports in the packet. In an implementation, the route management of the first congestion avoidance method and/or the second congestion avoidance method may be designed to enable data flows from the same slice switch to different destination slice switches to pass through different aggregation switches, and/or enable data flows from different destination slice switches to pass through different aggregation switches. Data flows from the source switch to the same destination switch can pass through different aggregation switches, thereby avoiding conflicts between data flows and eliminating network congestion at the aggregation switches.

在实施中，如前面的描述中所述，第一拥塞避免方法可以包括环生成策略，以及随后的网络流的路由管理。第一拥塞避免方法可以支持各种基于环的算法，包括但不限于环算法、环分块算法、多环算法、分层环算法、涉及多个分层环的算法和节点感知环算法等。In an implementation, as described in the previous description, the first congestion avoidance method may include a ring generation strategy, and subsequent routing management of network flows. The first congestion avoidance method can support various ring-based algorithms, including but not limited to ring algorithms, ring blocking algorithms, multi-ring algorithms, hierarchical ring algorithms, algorithms involving multiple hierarchical rings, node-aware ring algorithms, etc.

在实施中，环生成的策略可以包括环生成的拓扑感知策略。作为示例和限制，环生成的拓扑感知策略可以包括多个规则来建立处理单元或进程的环或环配置。在实施中，计算节点中的处理单元或进程可以通过网络接口控制器向/从另一计算节点中的处理单元或进程发送/接收数据。在实施中，计算节点中的处理单元或进程可以与单个网络接口控制器或多个网络接口控制器相关联，以向其他计算节点中的处理单元或进程传输数据。附加地或可替换地，多个处理单元或进程可以与单个网络接口控制器相关联，并且使用该网络接口控制器来向其他计算节点中的处理单元或进程传输数据。In an implementation, the ring-generated policy may include a ring-generated topology-aware policy. As an example and limitation, the ring-generated topology-aware policy may include multiple rules to establish a ring or ring configuration of a processing unit or process. In an implementation, a processing unit or process in a computing node may send/receive data to/from a processing unit or process in another computing node via a network interface controller. In an implementation, a processing unit or process in a computing node may be associated with a single network interface controller or multiple network interface controllers to transmit data to processing units or processes in other computing nodes. Additionally or alternatively, multiple processing units or processes may be associated with a single network interface controller and use the network interface controller to transmit data to processing units or processes in other computing nodes.

在实施中，多个规则可以包括但不限于第一计算节点中的处理单元或进程选择相邻处理单元或进程的优先级、第一计算节点中的网络接口控制器发送或接收数据的条件、第一计算节点中的网络接口控制器向/从第二计算节点中的网络接口控制器路由数据的条件等。In an implementation, the plurality of rules may include, but are not limited to, a priority for a processing unit or process in the first computing node to select an adjacent processing unit or process, a condition for the network interface controller in the first computing node to send or receive data, Conditions for the network interface controller in the first computing node to route data to/from the network interface controller in the second computing node, and the like.

在实施中，第一计算节点中的处理单元或进程选择相邻处理单元或进程的优先级可以包括，按照优先级的降序，选择第一计算节点中的处理单元或进程并使用进程间通信(如果适用的话)，选择连接到与第一计算节点所连的片交换机相同的片交换机的第二计算节点中的处理单元或进程，在连接到与第一计算节点所连的片交换机不同的片交换机的第三计算节点中选择处理单元或进程，其中第一计算节点不同于第二计算节点和第三计算节点。In an implementation, selecting the priority of an adjacent processing unit or process by a processing unit or process in a first computing node may include, in descending order of priority, selecting a processing unit or process in the first computing node and using inter-process communication ( If applicable), select a processing unit or process in the second compute node that is connected to the same slice switch as the first compute node to which the first compute node is connected. A processing unit or process is selected in a third computing node of the switch, where the first computing node is different from the second computing node and the third computing node.

在实施中，第一计算节点中的网络接口控制器发送或接收数据的条件可以包括，例如，网络接口控制器能够仅向第二计算节点中的网络接口控制器发送数据，和/或网络接口控制器能够仅从第三计算节点中的网络接口控制器接收数据，其中第一计算节点不同于第二计算节点和第三计算节点，并且第二计算节点可以与第三计算节点相同或不同。In an implementation, the conditions under which a network interface controller in a first computing node sends or receives data may include, for example, the network interface controller is able to send data only to a network interface controller in a second computing node, and/or the network interface controller is able to receive data only from a network interface controller in a third computing node, wherein the first computing node is different from the second computing node and the third computing node, and the second computing node may be the same as or different from the third computing node.

在实施中，第一计算节点中的网络接口控制器向/从第二计算节点中的网络接口控制器路由数据的条件可以包括，例如，如果数据是通过第一计算节点中的网络接口控制器发送的，则将由属于多个环的处理单元或进程发送的数据路由到第二计算节点中的网络接口控制器。在实施中，第一计算节点中的网络接口控制器向/从第二计算节点中的网络接口控制器路由数据的条件可以进一步包括，如果数据是由属于多个环的处理单元或进程通过第二计算节点中的网络接口控制器发送的，则通过第一计算节点中的网络接口控制器接收数据。In an implementation, the conditions under which the network interface controller in the first computing node routes data to/from the network interface controller in the second computing node may include, for example, if the data is routed through the network interface controller in the first computing node data sent by processing units or processes belonging to multiple rings is routed to the network interface controller in the second computing node. In an implementation, the conditions for the network interface controller in the first computing node to route data to/from the network interface controller in the second computing node may further include if the data is passed by a processing unit or process belonging to multiple rings through the first computing node. The data sent by the network interface controller in the second computing node is received by the network interface controller in the first computing node.

在实施中，第一拥塞避免方法的路由管理可以将网络接口控制器(NetworkInterface Controller，NIC)标识符分配给连接或链接到同一片交换机的每个网络接口控制器。第一拥塞避免方法的路由管理还可以向通信网络206中的每个汇聚交换机分配汇聚标识符。对于某个环中的处理单元或进程，路由管理可以确定用于路由来自该处理单元或进程的数据包的路由标识符。In implementation, the routing management of the first congestion avoidance method may assign a network interface controller (NIC) identifier to each network interface controller connected or linked to the same switch. The routing management of the first congestion avoidance method may also assign an aggregation identifier to each aggregation switch in the communication network 206. For a processing unit or process in a certain ring, the routing management may determine a routing identifier for routing a data packet from the processing unit or process.

例如，如果处理单元或进程的网络接口控制器和环中的下一个处理单元或进程的网络接口控制器位于相同的计算节点中，或者直接连接或链接到相同的片交换机，则路由标识符可以被确定为默认值或默认标识符。该默认路由标识符指示数据或在计算节点内路由，或通过片交换机路由，而不通过通信网络中的任何汇聚交换机。否则，路由标识符可被确定为等于该处理单元或进程的NIC标识符或其他预定义值。基于路由标识符和汇聚标识符之间的映射关系，可以基于所确定的路由标识符来确定汇聚标识符。在实施中，例如，路由标识符和汇聚标识符之间的映射关系可以使用基于探测的路由机制(例如，如前面的描述中所描述的在计算节点之间发送探测数据包)来预先确定。For example, if the network interface controller of a processing unit or process and the network interface controller of the next processing unit or process in the ring are in the same compute node, or are directly connected or linked to the same slice switch, the route identifier can Is determined to be the default value or default identifier. This default route identifier indicates that the data is routed either within the compute node or through the slice switch, and not through any aggregation switch in the communications network. Otherwise, the route identifier may be determined to be equal to the NIC identifier of the processing unit or process or other predefined value. Based on the mapping relationship between the routing identifier and the aggregation identifier, the aggregation identifier may be determined based on the determined routing identifier. In an implementation, for example, the mapping relationship between routing identifiers and rendezvous identifiers may be predetermined using a probe-based routing mechanism (eg, sending probe packets between computing nodes as described in the preceding description).

换句话说，包括在同一计算节点中的处理单元(或进程)或同一片交换机的网络接口控制器之间的数据流将不经过通信网络中的任何汇聚交换机。另一方面，不同计算节点中包括的处理单元(或进程)和不同片交换机的网络接口控制器之间的数据流将基于预定的映射关系通过指定的汇聚交换机，从而实现数据流的路由控制和管理，并将数据流分发到不同的汇聚交换机，以避免网络拥塞。In other words, data flows between processing units (or processes) included in the same computing node or network interface controllers of the same switch will not pass through any aggregation switch in the communication network. On the other hand, the data flow between the processing units (or processes) included in different computing nodes and the network interface controllers of different chip switches will pass through the designated aggregation switch based on the predetermined mapping relationship, thereby achieving routing control and control of the data flow. Manage and distribute data flows to different aggregation switches to avoid network congestion.

图13示出了使用第一拥塞避免方法的示例场景。在该示例中，生成了包含八个计算节点(节点0、节点1、…、节点7)的四个节点间环(或环配置，R0、R1、R2和R3)，并且每个环使用不同的汇聚交换机来发送和接收数据(例如，在节点间全局归约阶段504期间)。因此，这四个环不存在冲突。此外，任何环的每个片交换机只有一个数据流进入，一个数据流离开，从而避免了网络拥塞的发生。Figure 13 shows an example scenario using the first congestion avoidance method. In this example, four inter-node rings (or ring configurations, R0, R1, R2, and R3) containing eight compute nodes (Node 0, Node 1, ..., Node 7) are generated, and each ring uses a different aggregation switch to send and receive data (e.g., during the inter-node global reduction phase 504). Therefore, there is no conflict between these four rings. In addition, only one data flow enters and one data flow leaves each chip switch of any ring, thus avoiding the occurrence of network congestion.

在实施中，如上所述，第二拥塞避免方法可以包括关于节点标识的重新排序的策略，以及随后的网络流的路由管理。在实施中，为了最小化通信成本，第二拥塞避免方法可以根据基于多个规则连接计算节点和处理单元(或进程)的网络拓扑来重新排序计算节点和处理单元(或进程)的标识符。In implementation, as mentioned above, the second congestion avoidance method may include a strategy regarding the reordering of node identities, and subsequent routing management of network flows. In an implementation, in order to minimize communication costs, the second congestion avoidance method may reorder identifiers of computing nodes and processing units (or processes) according to a network topology connecting computing nodes and processing units (or processes) based on multiple rules.

在实施中，多个规则可以包括例如通过各片交换机对计算节点进行分组。例如，连接到同一片交换机的计算节点(例如，具有链接到同一片交换机的网络接口控制器的计算节点)被组成一个组，并且每个计算节点被分配有节点标识符。由于计算节点连接到同一片交换机，这些计算节点彼此(物理上)相邻。In an implementation, the plurality of rules may include, for example, grouping compute nodes by each slice switch. For example, compute nodes connected to the same switch (eg, compute nodes having network interface controllers linked to the same switch) are grouped into a group, and each compute node is assigned a node identifier. Since the compute nodes are connected to the same switch, these compute nodes are (physically) adjacent to each other.

在实施中，多个规则还可以包括使用相同的顺序序列将排序标识符(或排序编号)分配给计算节点中的每个处理单元或进程。例如，第一计算节点中的k个处理单元(或进程)可以被分配有排序标识符0、1、…、k-1，而第二计算节点中的k个处理单元(或进程)可以被分配有排序标识符k、k+1、…、2k-1等，对于其他计算节点也是如此。可以根据处理单元(或进程)使用的相应网络接口控制器来对计算节点中的处理单元(或进程)进行排序，并且使用相同网络接口控制器的处理单元(或进程)彼此(物理上)相邻。In an implementation, the plurality of rules may also include assigning a sort identifier (or sort number) to each processing unit or process in the compute node using the same sequential sequence. For example, k processing units (or processes) in a first computing node may be assigned ranking identifiers 0, 1, ..., k-1, while k processing units (or processes) in a second computing node may be assigned Sorting identifiers k, k+1, ..., 2k-1, etc. are assigned, and the same is true for other computing nodes. Processing units (or processes) in a compute node can be ordered according to the corresponding network interface controller used by the processing unit (or process), and processing units (or processes) using the same network interface controller are (physically) connected to each other. adjacent.

在这种情况下，在前log₂L个步骤中，计算节点中的处理单元(或进程)之间的数据流可以被限制为通过相应的片交换机，其具有比汇聚交换机更优的延迟，因此不会产生网络拥塞。在实施中，L是如在上文描述中的节点感知的减半加倍算法的每片交换机的计算节点数量。在实施中，对于传统的减半加倍算法，L是每片交换机的计算节点数和每计算节点的处理单元(或进程)数的乘积。In this case, in the first log ₂ L steps, the data flow between the processing units (or processes) in the computing nodes can be restricted to pass through the corresponding slice switches, which have better latency than the aggregation switches, and thus will not cause network congestion. In implementation, L is the number of computing nodes per slice switch for the node-aware halving and doubling algorithm as described above. In implementation, for the traditional halving and doubling algorithm, L is the product of the number of computing nodes per slice switch and the number of processing units (or processes) per computing node.

在实施中，第二拥塞避免方法的路由管理可以包括确定从具有第一节点标识符的第一计算节点中具有第一排序标识符的第一处理单元(或进程)发送给具有第二节点标识符的第二计算节点中具有第二排序标识符的第二处理单元(或进程)的数据流或数据包的汇聚标识符，其中第一计算节点可以与第二计算节点相同或不同。In an implementation, routing management of the second congestion avoidance method may include determining whether to send data from a first processing unit (or process) with a first ordering identifier in a first computing node with a first node identifier to a node with a second node identifier. An aggregate identifier of a data stream or packet of a second processing unit (or process) having a second ordering identifier in a second computing node of the symbol, where the first computing node may be the same as or different from the second computing node.

在实施中，汇聚标识符可以至少部分地基于排序标识符、节点标识符、每计算节点的网络接口控制器的数量以及每片交换机处的计算节点的最大数量中的至少一些来确定。作为示例而非限制，汇聚标识符可被确定为发送数据流或数据包的第一处理单元(或进程)的第一排序标识符+(具有第一处理单元(或进程)的第一计算节点的第一节点标识符)％每个片交换机处的计算节点的最大数量)×每计算节点的网络接口控制器的数量，其中％表示模数运算符。显然，只要能获得一致的结果，汇聚标识符的其他计算方法也是适用的。例如，汇聚标符可以基于汇聚标识符与排序标识符和节点标识符的组合之间的预设映射关系来确定等。In an implementation, the aggregation identifier may be determined based at least in part on at least some of the ordering identifier, the node identifier, the number of network interface controllers per compute node, and the maximum number of compute nodes at each switch slice. By way of example, and not limitation, the aggregation identifier may be determined as a first ordering identifier of a first processing unit (or process) sending a data stream or packet + a first computing node having the first processing unit (or process) The first node identifier) % the maximum number of compute nodes at each slice switch) × the number of network interface controllers per compute node, where % represents the modulo operator. Obviously, other calculation methods for aggregate identifiers are also suitable as long as consistent results are obtained. For example, the aggregation identifier may be determined based on a preset mapping relationship between the aggregation identifier and a combination of the sorting identifier and the node identifier, or the like.

在实施中，第二拥塞避免方法的路由管理可以包括预先将汇聚标识符分配给与分布式训练系统102相关联的通信网络206中的每个汇聚交换机。如果第一处理单元(或进程)和第二处理单元(或进程)链接到同一片交换机或在同一片交换机下(例如，通过各自的网络控制器)，则数据流或数据包将通过该片交换机，而不需要通过通信网络206中的任何汇聚交换机。如果第一处理单元(或进程)和第二处理单元(或进程)没有链接到同一片交换机或不在同一片交换机下，则由第一处理单元(或进程)发送给第二处理单元(或进程)的数据流或数据包将通过具有所确定的汇聚标识符的汇聚交换机。在本示例中，所描述的包括在每个计算节点中的网络接口控制器的数量是4个。In an implementation, route management of the second congestion avoidance method may include pre-assigning aggregation identifiers to each aggregation switch in the communication network 206 associated with the distributed training system 102 . If the first processing unit (or process) and the second processing unit (or process) are linked to or under the same switch (e.g., through respective network controllers), then the data flow or packet will pass through the switch switch without going through any aggregation switch in the communications network 206. If the first processing unit (or process) and the second processing unit (or process) are not linked to the same switch or are not under the same switch, the first processing unit (or process) sends the message to the second processing unit (or process). ) will pass through the aggregation switch with the determined aggregation identifier. In this example, the described number of network interface controllers included in each computing node is four.

图14示出了使用第二拥塞避免方法的示例场景。在本示例中，所有计算节点包括相同数量的处理单元(或进程)和相同数量的网络接口控制器，每个网络接口控制器具有相同数量的待关联的处理单元(或进程)。此外，链接到片交换机的网络接口控制器的数量少于网络中汇聚交换机的数量。在本示例中，每计算节点的网络接口控制器的数量为4个，每个片交换机处的计算节点的数量最多为2个。在实施中，对于节点感知的减半加倍算法，同一片交换机下的计算节点的数量可以是2的幂，并且同一片交换机下的计算节点中包括的网络接口控制器的数量可以是2的幂，对于传统的减半加倍算法，使用同一网络接口控制器的处理单元(或进程)的数量可以是2的幂。在本示例中，所描述的包括在每个计算节点中的网络接口控制器的数量为4个。Figure 14 shows an example scenario using the second congestion avoidance method. In this example, all computing nodes include the same number of processing units (or processes) and the same number of network interface controllers, each network interface controller having the same number of processing units (or processes) to be associated. Additionally, the number of network interface controllers linked to slice switches is less than the number of aggregation switches in the network. In this example, the number of network interface controllers per compute node is 4, and the maximum number of compute nodes at each slice switch is 2. In implementation, for the node-aware halves and doubling algorithm, the number of computing nodes under the same switch can be a power of 2, and the number of network interface controllers included in the computing nodes under the same switch can be a power of 2. , for the traditional halve-double algorithm, the number of processing units (or processes) using the same network interface controller can be a power of 2. In this example, the described number of network interface controllers included in each computing node is four.

在本示例中，在节点感知的减半加倍算法中的节点间全局归约阶段，计算节点(节点0、节点2、节点4和节点6)的处理单元(或进程)将使用具有汇聚标识符的汇聚交换机(例如，A1、A2、A3和A4)，并且计算节点(节点1、节点3、节点5和节点7)的处理单元(或进程)将使用具有汇聚标识符的汇聚交换机(例如，A5、A6、A7和A8)。据此，计算节点之间的数据流中不存在冲突，从而避免了网络中任何汇聚交换机处的网络拥塞。In this example, in the node-to-node global reduction phase in the node-aware halving and doubling algorithm, the processing units (or processes) of the computing nodes (node 0, node 2, node 4, and node 6) will use aggregation switches with aggregation identifiers (e.g., A1, A2, A3, and A4), and the processing units (or processes) of the computing nodes (node 1, node 3, node 5, and node 7) will use aggregation switches with aggregation identifiers (e.g., A5, A6, A7, and A8). Accordingly, there is no conflict in the data flow between the computing nodes, thereby avoiding network congestion at any aggregation switch in the network.

在实施中，在节点间全局归约阶段的每个步骤，处理单元(或进程)可以与新的处理单元(或进程)进行数据通信。在实施中，可以进行同步以确保由使用网络接口控制器的处理单元(或进程)在当前步骤进行的数据流不与由使用相同网络接口控制器的相邻处理单元(或进程)在先前步骤进行的数据流重叠，以避免发生微突发流(incast)，从而避免发生微突发流(incast)拥塞。In implementation, at each step of the inter-node global reduction phase, a processing unit (or process) may communicate data with a new processing unit (or process). In implementation, synchronization may be performed to ensure that the data flow performed by a processing unit (or process) using a network interface controller at the current step does not overlap with the data flow performed by an adjacent processing unit (or process) using the same network interface controller at the previous step, so as to avoid microbursts (incast) and thus avoid microburst congestion.

示例性方法Example methods

图15示出了示例性拓扑感知多阶段方法的示意图。图16示出了第一示例性网络拥塞避免方法的示意图。图17示出了第二示例性网络拥塞避免方法的示意图。图18示出了分布式训练中基于混合架构的示例性并行方法示意图。图15-18的方法可以(但不一定)在图1所示的环境中利用图2所示的计算节点在图3-14所示的方法和场景的辅助下实施。为了便于描述，参考图1-14来描述方法1500-1800。然而，方法1500-1800可以替代性地在其他环境和/或利用其他系统来实施。Figure 15 shows a schematic diagram of an exemplary topology-aware multi-stage method. Figure 16 shows a schematic diagram of a first exemplary network congestion avoidance method. Figure 17 shows a schematic diagram of a second exemplary network congestion avoidance method. Figure 18 shows a schematic diagram of an exemplary parallel method based on hybrid architecture in distributed training. The methods of Figures 15-18 may (but are not necessarily) implemented in the environment shown in Figure 1 using the computing nodes shown in Figure 2 with the aid of the methods and scenarios shown in Figures 3-14. For ease of description, methods 1500-1800 are described with reference to Figures 1-14. However, methods 1500-1800 may alternatively be implemented in other environments and/or utilizing other systems.

方法1500-1800在机器可执行指令的一般背景下描述。通常，机器可执行指令可以包括执行特定功能或实现特定抽象数据类型的例程、程序、对象、组件、数据结构、过程、模块、函数等。此外，将每个示例方法表示为逻辑流程图中的框的集合，该逻辑流程图表示可以在硬件、软件、固件或其组合中实现的一系列操作。描述该方法的顺序不意图被理解为进行限制，并且任何数量的所描述的方法框可以以任何顺序组合以实现该方法或替代方法。此外，在不脱离本文所述主题的精神和范围的情况下，可以从该方法中省略单独的框。在软件的背景下，框代表当由一个或多个处理器执行时执行所述操作的计算机指令。在硬件的背景下，一些或所有的框可以代表执行所述操作的专用集成电路(ASIC)或其他物理组件。Methods 1500-1800 are described in the general context of machine-executable instructions. Generally, machine-executable instructions may include routines, programs, objects, components, data structures, procedures, modules, functions, etc., that perform specific functions or implement specific abstract data types. Additionally, each example method is represented as a collection of boxes in a logic flow diagram that represents a sequence of operations that may be implemented in hardware, software, firmware, or a combination thereof. The order in which the method is described is not intended to be construed as limiting, and any number of the described method blocks may be combined in any order to implement the method or alternative methods. Additionally, individual blocks may be omitted from the method without departing from the spirit and scope of the subject matter described herein. In the context of software, blocks represent computer instructions that perform the operations when executed by one or more processors. In the context of hardware, some or all of the blocks may represent application specific integrated circuits (ASICs) or other physical components that perform the operations described.

参考图15，在框1502中，第一计算节点(例如，计算节点104)可以根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算。Referring to Figure 15, in block 1502, a first computing node (eg, computing node 104) may perform a reduction spread sub-operation among a first set of processing units in the first computing node according to a first cluster communication algorithm.

在实施中，在执行归约散布子运算之前，第一计算节点可以至少部分地基于第一计算节点中的第一处理单元集合之间的节点内连接的类型或带宽来选择第一集群通信算法。在实施中，第一集群通信算法可以包括但不限于基于环的算法或减半加倍算法。In an implementation, prior to performing the reduce spread sub-operation, the first computing node may select the first cluster communication algorithm based at least in part on the type or bandwidth of intra-node connections between the first set of processing units in the first computing node . In an implementation, the first cluster communication algorithm may include, but is not limited to, a ring-based algorithm or a halve-double algorithm.

在实施中，根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算可以包括将数据划分成多个数据块；将多个数据块分配给第一处理单元集合；根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。In an implementation, performing the reduction spread sub-operation among the first set of processing units in the first computing node according to the first cluster communication algorithm may include dividing the data into a plurality of data blocks; assigning the plurality of data blocks to the first a set of processing units; receiving a data block at a first processing unit of the first set of processing units from a second processing unit of the first set of processing units according to a first cluster communication algorithm; and homing at the first processing unit with the local data block About the data block received.

在框1504中，第一计算节点可以根据第二集群通信算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算。In block 1504, the first computing node may perform a global reduction sub-operation between the first set of processing units in the first computing node and the second set of processing units in the second computing node according to the second cluster communication algorithm.

在实施中，在执行全局归约子运算之前，第一计算节点可以至少部分地基于第一计算节点和其他计算节点之间的节点间连接的类型或带宽，和/或第一计算节点和其他计算节点的连接拓扑来选择第二集群通信算法。在实施中，第一集群通信算法可以包括但不限于基于环的算法，或者减半加倍算法(比如节点感知减半加倍算法)等。In an implementation, before performing the global reduce sub-operation, the first compute node may be based at least in part on the type or bandwidth of the inter-node connection between the first compute node and other compute nodes, and/or the first compute node and other compute nodes. The connection topology of the nodes is calculated to select the second cluster communication algorithm. In implementation, the first cluster communication algorithm may include but is not limited to a ring-based algorithm, or a halving and doubling algorithm (such as a node-aware halving and doubling algorithm), etc.

在实施中，根据第二集群算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算可以包括：第一处理单元集合接收第二计算节点中的第二处理单元集合根据第二集群算法所获得的归约散布结果的各部分，第一处理单元集合的每个处理单元与第二处理单元集合的相应处理单元形成组，并从相应处理单元接收归约散布结果的相应部分；第一处理单元集合通过在第一处理单元集合之间执行归约散布子运算后获得的归约散布结果的对应本地部分对归约散布结果的各部分执行归约。In an implementation, performing the global reduction sub-operation between the first set of processing units in the first computing node and the second set of processing units in the second computing node according to the second cluster algorithm may include: the first set of processing units receiving a second set of processing units in the second computing node dispersing portions of the reduction results obtained by the second clustering algorithm, each processing unit of the first set of processing units forming a group with a corresponding processing unit of the second set of processing units, and receiving corresponding portions of the reduction spread results from corresponding processing units; the first set of processing units pairs the reduction spread results by corresponding local portions of the reduction spread results obtained after performing the reduction spread sub-operations among the first set of processing units Each part performs the reduction.

在框1506中，第一计算节点可以根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算。In block 1506 , the first computing node may perform a global aggregation sub-operation among a first set of processing units in the first computing node according to a first cluster communication algorithm.

在实施中，根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算可以包括：根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。In an implementation, performing the global aggregation sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm may include: according to the first cluster communication algorithm, between the first processing unit of the first set of processing units receiving a data block from a second processing unit of the first set of processing units; and reducing the received data block with a local data block at the first processing unit.

参考图16，在框1602中，第一计算节点(例如，计算节点104)或第一进程可以至少部分地基于与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从第一进程路由到第二进程的路由标识符。Referring to FIG. 16 , in block 1602 , a first computing node (eg, computing node 104 ) or a first process may be based, at least in part, on a network interface controller associated with the first process and a network interface associated with the second process. Whether the controller is located in the same compute node or linked to the same switch determines the routing identifier for routing data from the first process to the second process.

在实施中，第一进程和第二进程可以属于特定网络拓扑下连接多个不同节点的特定节点间环。作为示例而非限制，特定网络拓扑可以包括胖树拓扑。In an implementation, the first process and the second process may belong to a specific inter-node ring connecting multiple different nodes under a specific network topology. By way of example, and not limitation, certain network topologies may include fat tree topologies.

在实施中，与第一进程相关联的网络接口控制器被配置为仅向环形拓扑中的第二计算节点发送数据或从第二计算节点接收数据，第二计算节点不同于第一计算节点。In an implementation, the network interface controller associated with the first process is configured to only send data to or receive data from a second computing node in the ring topology, the second computing node being different than the first computing node.

在实施中，与第一进程相关联的网络接口控制器还与一个或多个进程相关联，其中从第一进程和一个或多个进程发送的所有数据都通过网络接口控制器发送。In an implementation, the network interface controller associated with the first process is also associated with one or more processes, wherein all data sent from the first process and the one or more processes is sent through the network interface controller.

在实施中，响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于同一计算节点中或链接到同一片交换机，可以将路由标识符设置或确定为默认标识符。In implementations, in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or linked to the same blade switch, the routing identifier may be set or determined as a default identifier.

在实施中，响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于不同的计算节点中或链接到不同的片交换机，可以将路由标识符设置或确定为等于与第一进程相关联的网络接口控制器的标识符。In an implementation, in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in different compute nodes or linked to different slice switches, the routing identifier may be set or determined to be equal to the identifier of the network interface controller associated with the first process.

在框1604中，第一计算节点或第一进程可以根据路由标识符将数据从第一进程路由到第二进程。In block 1604, the first computing node or the first process may route data from the first process to the second process according to the routing identifier.

在实施中，根据路由标识符将数据从第一进程路由到第二进程可以包括至少通过与网络接口控制器连接的片交换机和具有与网络接口控制器的标识符具有对应关系的标识符的汇聚交换机将数据从第一进程路由到第二进程，该网络接口控制器与第一进程相关联。In an implementation, routing data from the first process to the second process based on the routing identifier may include at least passing through a slice switch connected to the network interface controller and an aggregation having an identifier that has a corresponding relationship with the identifier of the network interface controller. The switch routes data from the first process to the second process, and the network interface controller is associated with the first process.

参考图17，在框1702中，第一计算节点(例如，计算节点104)或第一进程可以根据节点感知减半加倍算法确定用于从第一进程向第二进程发送数据包的汇聚标识符，第一进程和第二进程属于在特定网络拓扑下连接至不同片交换机的不同节点。17 , in box 1702, a first computing node (e.g., computing node 104) or a first process may determine an aggregation identifier for sending a data packet from the first process to a second process based on a node-aware halving and doubling algorithm, wherein the first process and the second process belong to different nodes connected to different slice switches under a specific network topology.

在实施中，第一计算节点可以为定向到连接至不同片交换机的计算节点的数据包分配不同的汇聚标识符，以使得能够通过不同的汇聚交换机将数据包路由到连接至不同片交换机的节点。In an implementation, the first computing node may assign different aggregation identifiers to data packets directed to computing nodes connected to different slice switches, so that the data packets can be routed to nodes connected to different slice switches through different aggregation switches.

在实施中，第一计算节点可以至少部分地基于预定对应关系，分配与汇聚标识符相关联的汇聚交换机对应的源端口和目的端口。在实施中，对应关系可以记录多个汇聚交换机的汇聚标识符与对应的源端口和目的端口对之间的关系。在实施中，特定的网络拓扑可以包括胖树拓扑。In an implementation, the first computing node may allocate the source port and the destination port corresponding to the aggregation switch associated with the aggregation identifier based at least in part on the predetermined correspondence. In an implementation, the correspondence relationship may record the relationship between aggregation identifiers of multiple aggregation switches and corresponding source port and destination port pairs. In implementations, the specific network topology may include a fat tree topology.

在框1704中，第一计算节点可以通过与汇聚标识符对应的汇聚交换机从第一进程向第二进程发送数据包。In block 1704, the first computing node may send the data packet from the first process to the second process through the aggregation switch corresponding to the aggregation identifier.

在实施中，第一计算节点还可以通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，从第一计算节点包括的第一进程集合向第二计算节点包括的第二进程集合发送各数据包。In an implementation, the first computing node may also transfer data from the first process set included in the first computing node to the process set included in the second computing node through multiple different aggregation switches corresponding to multiple different aggregation identifiers assigned to each data packet. The second set of processes sends each data packet.

在实施中，第一计算节点还可以通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，由第一计算节点包括的第一进程集合从第二计算节点包括的第二进程集合接收数据包。In an implementation, the first computing node may also use multiple different aggregation switches corresponding to multiple different aggregation identifiers assigned to each data packet. A second set of processes receives the packet.

参考图18，在框1802中，第一计算节点(例如，计算节点104)或处理单元可以将分配给该处理单元的数据块划分成多个数据段，该多个数据段至少包括第一数据段和第二数据段。Referring to Figure 18, in block 1802, a first computing node (eg, computing node 104) or processing unit may divide a data block allocated to the processing unit into a plurality of data segments, the plurality of data segments including at least first data segment and the second data segment.

在框1804中，第一计算节点或者处理单元可以将多个数据段分配给多个线程，该多个线程至少包括第一线程和第二线程。In block 1804, the first computing node or processing unit may allocate the plurality of data segments to the plurality of threads, including at least a first thread and a second thread.

在框1806中，第一计算节点或处理单元可以使用第一线程对第一数据段的一部分执行节点内子运算，且并行地使用第二线程对第二数据段的一部分执行节点间子运算。In block 1806, the first compute node or processing unit may perform an intra-node sub-operation on a portion of the first data segment using a first thread, and in parallel use a second thread to perform an inter-node sub-operation on a portion of the second data segment.

在实施中，使用第一线程对第一数据段的一部分执行节点内子运算可以包括通过节点内连接在第一计算节点中包括的处理单元和另一处理单元之间传输上述第一数据段的一部分。In an implementation, performing an intra-node sub-operation on a portion of the first data segment using the first thread may include transmitting the portion of the first data segment between a processing unit included in the first computing node and another processing unit via an intra-node connection. .

在实施中，使用第二线程对第二数据段的一部分执行节点间子运算可以包括通过节点间连接在处理单元和不同于第一计算节点的第二计算节点中包括的另一处理单元之间传输上述第二数据段的一部分。In an implementation, performing an inter-node sub-operation on a portion of the second data segment using the second thread may include an inter-node connection between the processing unit and another processing unit included in a second computing node that is different from the first computing node. Transmit a portion of the second data segment described above.

在实施中，节点内子运算可以包括在第一计算节点内执行的归约散布子运算或全局聚集子运算，并且节点间子运算可以包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局归约子运算。In an implementation, the intra-node sub-operation may include a reduce spread sub-operation or a global aggregation sub-operation performed within a first compute node, and the inter-node sub-operation may include a first compute node and a second compute node different from the first compute node. Compute global reduction sub-operations performed across nodes.

在实施中，节点内子运算可以包括在第一计算节点内执行的全局聚集子运算或复制子运算，节点间子运算可以包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局聚集子运算。In an implementation, the intra-node sub-operation may include a global aggregation sub-operation or a replication sub-operation executed in a first computing node, and the inter-node sub-operation may include a global aggregation sub-operation executed between the first computing node and a second computing node different from the first computing node.

在实施中，第一计算节点或处理单元可以使用第一线程对上述第一数据段的一部分执行另一个节点间子运算，并且并行地使用第二线程对上述第二数据段的一部分执行另一个节点内子运算。In an implementation, the first computing node or processing unit may use a first thread to perform another inter-node sub-operation on a portion of the first data segment, and in parallel use a second thread to perform another inter-node sub-operation on a portion of the second data segment. Suboperations within a node.

在实施中，使用第一线程对第一数据段的该部分执行节点内子运算，以及并行地使用第二线程对第二数据段的该部分执行节点间子运算，使得使用节点内连接将第一数据段的一部分传输至第一计算节点包括的另一处理单元，以及并发地使用节点间连接将第二数据段的一部分传输至与第一计算节点不同的第二计算节点包括的另一处理单元。In an implementation, a first thread is used to perform an intra-node sub-operation on the portion of the first data segment, and a second thread is used in parallel to perform an inter-node sub-operation on the portion of the second data segment, such that the first is combined using an intra-node join. transmitting a portion of the data segment to another processing unit included in the first computing node, and concurrently using an inter-node connection to transmit a portion of the second data segment to another processing unit included in a second computing node different from the first computing node .

虽然上述描述的方法框是按一定顺序执行的，但在一些实施中，一些或全部方法框可以以其他顺序或并行地执行。Although the method blocks described above are executed in a certain order, in some implementations, some or all of the method blocks may be executed in another order or in parallel.

总结Summarize

虽然是用专用于结构特征和/或方法动作的语言描述了实施方式，但是应该理解，权利要求不一定限于所描述的特定特征或动作。相反地，这些具体特征和动作是作为实施所要求保护的主题的示例形式公开的。附加地或替代地，一些或所有操作可以由一个或多个ASICS、FPGA或其他硬件来实现。Although embodiments have been described in language specific to structural features and/or methodological acts, it is to be understood that the claims are not necessarily limited to the specific features or acts described. Rather, the specific features and acts are disclosed as example forms of implementing the claimed subject matter. Additionally or alternatively, some or all operations may be implemented by one or more ASICS, FPGAs, or other hardware.

本公开可以进一步利用下述条款来理解：This disclosure may further be understood using the following terms:

条款1：一种由第一计算节点实施的方法，所述方法包括：根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算；根据第二集群通信算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算；并根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算。Clause 1: A method implemented by a first computing node, the method comprising: performing a reduction spread sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm; according to a second cluster The communication algorithm performs a global reduction sub-operation between the first set of processing units in the first computing node and the second set of processing units in the second computing node; and performs a global reduction sub-operation on the first set of processing units in the first computing node according to the first cluster communication algorithm. Global aggregation sub-operations are performed between a collection of processing units.

条款2：根据条款1所述的方法，还包括：至少部分基于第一计算节点中的第一处理单元集合间的节点内连接的类型或带宽选择第一集群通信算法。Clause 2: The method of Clause 1, further comprising selecting the first cluster communication algorithm based at least in part on a type or bandwidth of an intra-node connection between the first set of processing units in the first computing node.

条款3：根据条款1所述的方法，还包括：至少部分基于第一计算节点和其他计算节点之间的节点间连接的类型或带宽，和/或，第一计算节点和其他计算节点的连接拓扑来选择第二集群通信算法。Clause 3: The method of Clause 1, further comprising: based at least in part on the type or bandwidth of the inter-node connection between the first computing node and the other computing nodes, and/or the connection of the first computing node and the other computing nodes. topology to select the second cluster communication algorithm.

条款4：根据条款1所述的方法，其中第一集群通信算法包括基于环的算法，或减半加倍算法。Clause 4: The method of Clause 1, wherein the first cluster communication algorithm includes a ring-based algorithm, or a halve-double algorithm.

条款5：根据条款1所述的方法，其中根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算包括：将数据划分成多个数据块；将多个数据块分配给第一处理单元集合；根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。Clause 5: The method of clause 1, wherein performing the reduction spread sub-operation among the first set of processing units in the first computing node according to the first cluster communication algorithm includes: dividing the data into a plurality of data blocks; allocating a plurality of data blocks to the first set of processing units; receiving the data blocks at a first processing unit of the first set of processing units from a second processing unit of the first set of processing units according to a first cluster communication algorithm; and at the first The received data blocks are reduced using local data blocks at the processing unit.

条款6：根据条款1所述的方法，其中根据第二集群算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算包括：第一处理单元集合接收第二计算节点中的第二处理单元集合根据第二集群算法所获得的归约散布结果的各部分，其中第一处理单元集合的每个处理单元与第二处理单元集合的相应处理单元形成组，并从相应处理单元接收归约散布结果的相应部分；第一处理单元集合通过在第一处理单元集合之间执行归约散布子运算后获得的归约散布结果的对应本地部分对归约散布结果的各部分执行归约。Clause 6: The method of Clause 1, wherein performing the global reduction sub-operation between the first set of processing units in the first computing node and the second set of processing units in the second computing node according to the second clustering algorithm includes : The first set of processing units receives portions of the reduction dispersion results obtained by the second set of processing units in the second computing node according to the second cluster algorithm, wherein each processing unit of the first set of processing units is identical to the second set of processing units. The corresponding processing units of the set form a group and receive corresponding portions of the reduction spread results from the corresponding processing units; the first set of processing units obtains the reduction spread results obtained by performing the reduction spread sub-operations among the first set of processing units. The reduction is performed on each part of the reduction spread result corresponding to the local part.

条款7：根据条款1所述的方法，其中根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算包括：根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。Clause 7: The method of Clause 1, wherein performing the global aggregation sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm comprises: according to the first cluster communication algorithm, on the first processing unit A data block is received at a first processing unit of the set of units from a second processing unit of the first set of processing units; and the received data block is reduced with a local data block at the first processing unit.

条款8：一个或多个机器可读介质，其存储有机器可读指令，当机器可读指令由第一计算节点执行时，使第一计算节点执行动作，所述动作包括：根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算；根据第二集群通信算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算；并根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算。Item 8: One or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform actions, the actions comprising: performing a reduce-scatter sub-operation between a first set of processing units in the first computing node according to a first cluster communication algorithm; performing a global reduce sub-operation between a first set of processing units in the first computing node and a second set of processing units in a second computing node according to a second cluster communication algorithm; and performing a global gather sub-operation between a first set of processing units in the first computing node according to the first cluster communication algorithm.

条款9：根据条款8所述的一个或多个机器可读介质，所述动作还包括：至少部分基于第一计算节点中的第一处理单元集合间的节点内连接的类型或带宽选择第一集群通信算法。Clause 9: The one or more machine-readable media of Clause 8, the actions further comprising: selecting the first computing node based at least in part on a type or bandwidth of an intra-node connection between the first set of processing units in the first computing node. Cluster communication algorithm.

条款10：根据条款8所述的一个或多个机器可读介质，所述动作还包括：至少部分基于第一计算节点和其他计算节点之间的节点间连接的类型或带宽，和/或，第一计算节点和其他计算节点的连接拓扑来选择第二集群通信算法。Clause 10: According to one or more machine-readable media of clause 8, the action further comprises: selecting a second cluster communication algorithm based at least in part on a type or bandwidth of an inter-node connection between the first computing node and the other computing nodes, and/or a connection topology between the first computing node and the other computing nodes.

条款11：根据条款8所述的一个或多个机器可读介质，其中第一集群通信算法包括基于环的算法，或减半加倍算法。Clause 11: The one or more machine-readable media of Clause 8, wherein the first cluster communication algorithm includes a ring-based algorithm, or a halve-double algorithm.

条款12：根据条款8所述的一个或多个机器可读介质，其中根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算包括：将数据划分成多个数据块；将多个数据块分配给第一处理单元集合；根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。Clause 12: The one or more machine-readable media of Clause 8, wherein performing the reduction spread sub-operation among the first set of processing units in the first computing node according to the first cluster communication algorithm includes: partitioning the data into a plurality of data blocks; allocating the plurality of data blocks to a first set of processing units; receiving at a first processing unit of the first set of processing units from a second processing unit of the first set of processing units according to a first cluster communication algorithm data blocks; and reducing the received data blocks with local data blocks at the first processing unit.

条款13：根据条款8所述的一个或多个机器可读介质，其中根据第二集群算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算包括：第一处理单元集合接收第二计算节点中的第二处理单元集合根据第二集群算法所获得的归约散布结果的各部分，第一处理单元集合的每个处理单元与第二处理单元集合的相应处理单元形成组，并从相应处理单元接收归约散布结果的相应部分；第一处理单元集合通过在第一处理单元集合之间执行归约散布子运算后获得的归约散布结果的对应本地部分对归约散布结果的各部分执行归约。Clause 13: One or more machine-readable media as described in clause 8, wherein between a first set of processing units in a first computing node and a second set of processing units in a second computing node according to a second clustering algorithm Executing the global reduction sub-operation includes: the first set of processing units receiving each part of the reduction spread result obtained by the second set of processing units in the second computing node according to the second cluster algorithm, and each processing of the first set of processing units The units form groups with corresponding processing units of the second set of processing units and receive corresponding portions of the reduction spread results from the corresponding processing units; the first set of processing units is obtained by performing the reduction spread sub-operations among the first set of processing units The corresponding local part of the reduce spread result performs the reduction on each part of the reduce spread result.

条款14：根据条款8所述的一个或多个及其可读介质，其中根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算包括：根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。Clause 14: The one or more of clause 8 and the readable medium thereof, wherein performing the global aggregation sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm includes: according to the first A cluster communication algorithm for receiving a data block at a first processing unit of the first set of processing units from a second processing unit of the first set of processing units; and reducing the received data block with a local data block at the first processing unit.

条款15：一种第一计算节点，包括：第一处理单元集合；以及存储器，其存储有机器可执行指令，当机器可执行指令由第一处理单元集合执行时，使第一处理单元集合执行动作，所述动作包括：根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算；根据第二集群通信算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算；并根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算。Clause 15: A first computing node, comprising: a first set of processing units; and a memory storing machine-executable instructions that, when executed by the first set of processing units, cause the first set of processing units to execute Actions, the actions comprising: performing a reduction spread sub-operation between a first set of processing units in a first computing node according to a first cluster communication algorithm; performing first processing in the first computing node according to a second cluster communication algorithm A global reduction sub-operation is performed between the unit set and the second processing unit set in the second computing node; and a global aggregation sub-operation is performed between the first processing unit set in the first computing node according to the first cluster communication algorithm.

条款16：根据条款15所述的第一计算节点，所述动作还包括：至少部分基于第一计算节点的第一处理单元集合间的节点内连接的类型或带宽选择第一集群通信；至少部分基于第一计算节点和其他计算节点之间的节点间连接的类型或带宽，和/或，第一计算节点和其他计算节点的连接拓扑来选择第二集群通信算法。Clause 16: The first computing node of clause 15, the actions further comprising: selecting the first cluster communication based at least in part on a type or bandwidth of an intra-node connection between the first set of processing units of the first computing node; The second cluster communication algorithm is selected based on the type or bandwidth of the inter-node connection between the first computing node and the other computing nodes, and/or, the topology of the connection between the first computing node and the other computing nodes.

条款17：根据条款15所述的第一计算节点，其中第一集群通信算法包括基于环的算法，或减半加倍算法。Clause 17: The first computing node of clause 15, wherein the first cluster communication algorithm comprises a ring-based algorithm, or a halving-doubling algorithm.

条款18：根据条款15所述的第一计算节点，其中根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行归约散布子运算包括：将数据划分成多个数据块；将多个数据块分配给第一处理单元集合；根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。Clause 18: The first computing node of Clause 15, wherein performing the reduction spread sub-operation among the first set of processing units in the first computing node according to the first cluster communication algorithm includes: partitioning the data into a plurality of data blocks; allocating the plurality of data blocks to the first set of processing units; receiving the data blocks at a first processing unit of the first set of processing units from a second processing unit of the first set of processing units according to a first cluster communication algorithm; and The received data blocks are reduced with local data blocks at the first processing unit.

条款19：根据条款15所述的第一计算节点，其中根据第二集群算法在第一计算节点中的第一处理单元集合和第二计算节点中的第二处理单元集合之间执行全局归约子运算包括：第一处理单元集合接收第二计算节点中的第二处理单元集合根据第二集群算法所获得的归约散布结果的各部分，第一处理单元集合的每个处理单元与第二处理单元集合的相应处理单元形成组，并从相应处理单元接收归约散布结果的相应部分；第二处理单元集合通过在第一处理单元集合之间执行归约散布子运算后获得的归约散布结果的相应本地部分对归约散布结果的各部分执行归约。Clause 19: A first computing node according to clause 15, wherein a global reduction is performed between a first set of processing units in the first computing node and a second set of processing units in the second computing node according to a second clustering algorithm The sub-operations include: the first set of processing units receiving each part of the reduction spread result obtained by the second set of processing units in the second computing node according to the second cluster algorithm, each processing unit of the first set of processing units communicating with the second set of processing units. Corresponding processing units of the set of processing units form a group and receive corresponding parts of the reduction spread results from the corresponding processing units; the second set of processing units passes the reduction spread obtained after performing the reduction spread sub-operations among the first set of processing units The corresponding local part of the result performs the reduction on each part of the reduction spread result.

条款20：根据条款15所述的第一计算节点，其中根据第一集群通信算法在第一计算节点中的第一处理单元集合之间执行全局聚集子运算包括：根据第一集群通信算法，在第一处理单元集合的第一处理单元处从第一处理单元集合的第二处理单元接收数据块；以及在第一处理单元处用本地数据块归约所接收的数据块。Clause 20: The first computing node of clause 15, wherein performing the global aggregation sub-operation between the first set of processing units in the first computing node according to the first cluster communication algorithm includes: according to the first cluster communication algorithm, in A data block is received at a first processing unit of the first set of processing units from a second processing unit of the first set of processing units; and the received data block is reduced with a local data block at the first processing unit.

条款21：一种由第一计算节点实施的方法，所述方法包括：至少部分地基于与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从第一进程路由到第二进程的路由标识符，第一进程和第二进程属于特定网络拓扑下连接多个不同节点的特定节点间环；以及根据路由标识符将数据从第一进程路由到第二进程。Clause 21: A method implemented by a first computing node, the method comprising: based, at least in part, on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located on the same compute node. in a node or linked to the same switch to determine a routing identifier for routing data from a first process to a second process, the first process and the second process belonging to a specific inter-node ring connecting multiple different nodes under a specific network topology; and Route data from the first process to the second process based on the routing identifier.

条款22：根据条款21所述的方法，其中与第一进程相关联的网络接口控制器被配置为仅向环形拓扑中的第二计算节点发送数据或从第二计算节点接收数据，第二计算节点不同于第一计算节点。Clause 22: The method of clause 21, wherein the network interface controller associated with the first process is configured to send data to or receive data from only a second computing node in the ring topology, the second computing node being different from the first computing node.

条款23：根据条款21所述的方法，其中与第一进程相关联的网络接口控制器还与一个或多个进程相关联，其中从第一进程和一个或多个进程发送的所有数据都通过网络接口控制器发送。Clause 23: The method of clause 21, wherein the network interface controller associated with the first process is also associated with one or more processes, and wherein all data sent from the first process and the one or more processes passes through Sent by the network interface controller.

条款24：根据条款21所述的方法，其中特定网络拓扑包括胖树拓扑。Clause 24: The method of clause 21, wherein the specific network topology includes a fat tree topology.

条款25：根据条款21所述的方法，还包括：响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于同一计算节点中或链接到同一片交换机，将路由标识符设置为默认标识符。Clause 25: The method of Clause 21, further comprising: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same computing node or linked to the same slice. switch, sets the routing identifier to the default identifier.

条款26：根据条款21所述的方法，还包括：响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于不同的计算节点或链接到不同的片交换机，将路由标识符设置为等于与第一进程相关联的网络接口控制器的标识符。Clause 26: The method according to Clause 21 further includes: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located on different computing nodes or linked to different slice switches, setting the routing identifier to be equal to the identifier of the network interface controller associated with the first process.

条款27：根据条款26所述的方法，其中根据路由标识符将数据从第一进程路由到第二进程包括：至少通过与网络接口控制器连接的片交换机和具有与网络接口控制器的标识符具有对应关系的标识符的汇聚交换机将数据从第一进程路由到第二进程，该网络接口控制器与第一进程相关联。Clause 27: The method of Clause 26, wherein routing data from the first process to the second process based on the routing identifier includes: at least through a slice switch connected to the network interface controller and having an identifier with the network interface controller An aggregation switch having a corresponding identifier routes data from a first process to a second process, the network interface controller being associated with the first process.

条款28：一个或多个机器可读介质，其存储有机器可读指令，当机器可读指令由第一计算节点执行时，使第一计算节点执行动作，所述动作包括：至少部分地基于与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从第一进程路由到第二进程的路由标识符，第一进程和第二进程属于特定网络拓扑下连接多个不同节点的特定节点间环；以及根据路由标识符将数据从第一进程路由到第二进程。Clause 28: One or more machine-readable media having machine-readable instructions stored thereon that, when executed by a first computing node, cause the first computing node to perform an action, the action comprising: based at least in part on: Whether the network interface controller associated with the first process and the network interface controller associated with the second process are located in the same compute node or linked to the same switch to determine the route for routing data from the first process to the second process The identifier, the first process and the second process belong to a specific inter-node ring connecting multiple different nodes under a specific network topology; and routing data from the first process to the second process according to the routing identifier.

条款29：根据条款28所述的一个或多个机器可读介质，其中与第一进程相关联的网络接口控制器被配置为仅向环形拓扑中的第二计算节点发送数据或从第二计算节点接收数据，第二计算节点不同于第一计算节点。Clause 29: One or more machine-readable media as described in clause 28, wherein the network interface controller associated with the first process is configured to send data to or receive data from only a second computing node in the ring topology, the second computing node being different from the first computing node.

条款30：根据条款28所述的一个或多个机器可读介质，其中与第一进程相关联的网络接口控制器还与一个或多个进程相关联，其中从第一进程和一个或多个进程发送的所有数据都通过网络接口控制器发送。Clause 30: The one or more machine-readable media of Clause 28, wherein the network interface controller associated with the first process is also associated with one or more processes, wherein from the first process and the one or more All data sent by the process is sent through the network interface controller.

条款31：根据条款28所述的一个或多个机器可读介质，其中特定网络拓扑包括胖树拓扑。Clause 31: The one or more machine-readable media of Clause 28, wherein the particular network topology includes a fat tree topology.

条款32：根据条款28所述的一个或多个机器可读介质，其中所述动作还包括：响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于同一计算节点中或链接到同一片交换机，将路由标识符设置为默认标识符。Clause 32: The one or more machine-readable media of Clause 28, wherein the actions further comprise: in response to determining a network interface controller associated with the first process and a network interface control associated with the second process If the router is located in the same compute node or linked to the same switch, set the routing identifier to the default identifier.

条款33：根据条款28所述的一个或多个机器可读介质，其中所述动作还包括：响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于不同的计算节点或链接到不同的片交换机，将路由标识符设置为等于与第一进程相关联的网络接口控制器的标识符。Clause 33: One or more machine-readable media according to clause 28, wherein the action further comprises: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located on different computing nodes or linked to different slice switches, setting the routing identifier to be equal to the identifier of the network interface controller associated with the first process.

条款34：根据条款33所述的一个或多个机器可读介质，其中根据路由标识符将数据从第一进程路由到第二进程包括：至少通过与网络接口控制器连接的片交换机和具有与网络接口控制器的标识符具有对应关系的标识符的汇聚交换机将数据从第一进程路由到第二进程，该网络接口控制器与第一进程相关联。Clause 34: The one or more machine-readable media of clause 33, wherein routing data from the first process to the second process based on the routing identifier includes: at least through a slice switch connected to the network interface controller and having An aggregation switch having an identifier of a corresponding relationship routes data from a first process to a second process, the network interface controller being associated with the first process.

条款35：一种第一计算节点，包括：一个或多个处理单元；以及存储器，其存储机器可执行指令，当机器可执行指令由一个或多个处理单元执行时，使一个或多个处理单元执行动作，所述动作包括：至少部分地基于与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器是否位于同一计算节点中或链接到同一片交换机来确定将数据从第一进程路由到第二进程的路由标识符，第一进程和第二进程属于特定网络拓扑下连接多个不同节点的特定节点间环；以及根据路由标识符将数据从第一进程路由到第二进程。Clause 35: A first computing node comprising: one or more processing units; and a memory storing machine-executable instructions that, when executed by the one or more processing units, cause one or more processes The unit performs actions that include determining based at least in part on whether a network interface controller associated with the first process and a network interface controller associated with the second process are located in the same compute node or linked to the same switch. Routing data from a first process to a routing identifier of a second process, the first process and the second process belonging to a specific inter-node ring connecting multiple different nodes under a specific network topology; and routing data from the first process according to the routing identifier Route to the second process.

条款36：根据条款35所述的第一计算节点，其中与第一进程相关联的网络接口控制器被配置为仅向环形拓扑中的第二计算节点发送数据或从第二计算节点接收数据，第二计算节点不同于第一计算节点。Clause 36: The first computing node of Clause 35, wherein the network interface controller associated with the first process is configured to send data only to or receive data from the second computing node in the ring topology, The second computing node is different from the first computing node.

条款37：根据条款35所述的第一计算节点，其中与第一进程相关联的网络接口控制器还与一个或多个进程相关联，其中从第一进程和一个或多个进程发送的所有数据都通过网络接口控制器发送。Clause 37: The first computing node of Clause 35, wherein the network interface controller associated with the first process is also associated with one or more processes, and wherein all Data is sent through the network interface controller.

条款38：根据条款35所述的第一计算节点，其中所述动作还包括：响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于同一计算节点中或链接到同一片交换机，将路由标识符设置为默认标识符。Clause 38: The first computing node of Clause 35, wherein the actions further comprise: in response to determining that the network interface controller associated with the first process and the network interface controller associated with the second process are located on the same compute node. In the node or connected to the same switch, set the routing identifier as the default identifier.

条款39：根据条款35所述的第一计算节点，其中所述动作还包括：响应于确定与第一进程相关联的网络接口控制器和与第二进程相关联的网络接口控制器位于不同的计算节点或链接到不同的片交换机，将路由标识符设置为等于与第一进程相关联的网络接口控制器的标识符。Clause 39: The first computing node of Clause 35, wherein the actions further comprise: in response to determining that a network interface controller associated with the first process and a network interface controller associated with the second process are located on different The compute node or link to a different slice switch sets the route identifier equal to the identifier of the network interface controller associated with the first process.

条款40：根据条款39所述的第一计算节点，其中根据路由标识符将数据从第一进程路由到第二进程包括：至少通过与网络接口控制器连接的片交换机和具有与网络接口控制器的标识符具有对应关系的标识符的汇聚交换机将数据从第一进程路由到第二进程，该网络接口控制器与第一进程相关联。Clause 40: The first compute node of Clause 39, wherein routing data from the first process to the second process based on the routing identifier includes: at least through a slice switch connected to the network interface controller and having a connection to the network interface controller An aggregation switch having an identifier corresponding to an identifier routes data from a first process to a second process, the network interface controller being associated with the first process.

条款41：一种由第一计算节点实施的方法，所述方法包括：Clause 41: A method implemented by a first computing node, the method comprising:

根据节点感知减半加倍算法确定用于从第一进程向第二进程发送数据包的汇聚标识符，第一进程和第二进程属于在特定网络拓扑下连接至不同片交换机的不同节点；以及通过与汇聚标识符对应的汇聚交换机从第一进程向第二进程发送数据包。Determining a rendezvous identifier for sending packets from a first process to a second process, the first process and the second process belonging to different nodes connected to different slice switches under a specific network topology according to a node-aware halving and doubling algorithm; and by The aggregation switch corresponding to the aggregation identifier sends the data packet from the first process to the second process.

条款42：根据条款41所述的方法，还包括：为定向到连接至不同片交换机的节点的数据包分配不同的汇聚标识符，以使得能够通过不同的汇聚交换机将数据包路由到连接至不同片交换机的节点。Clause 42: The method of Clause 41, further comprising: assigning different aggregation identifiers to data packets directed to nodes connected to different slice switches to enable routing of data packets to nodes connected to different shard switches through the different aggregation switches. The node of the chip switch.

条款43：根据条款41所述的方法，还包括：至少部分地基于预定对应关系，分配与汇聚标识符相关联的汇聚交换机对应的源端口和目的端口。Clause 43: The method of Clause 41, further comprising allocating a source port and a destination port corresponding to the aggregation switch associated with the aggregation identifier based at least in part on the predetermined correspondence.

条款44：根据条款43所述的方法，其中对应关系记录多个汇聚交换机的汇聚标识符与对应的源端口和目的端口对之间的关系。Clause 44: The method according to Clause 43, wherein the correspondence relationship records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source port and destination port pairs.

条款45：根据条款41所述的方法，其中特定网络拓扑包括胖树拓扑。Clause 45: The method of clause 41, wherein the particular network topology comprises a fat tree topology.

条款46：根据条款41所述的方法，还包括：通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，从第一计算节点包括的第一进程集合向第二计算节点包括的第二进程集合发送各数据包。Clause 46: The method according to Clause 41 further includes: sending each data packet from a first set of processes included in a first computing node to a second set of processes included in a second computing node through multiple different aggregation switches corresponding to multiple different aggregation identifiers assigned to each data packet.

条款47：根据条款41所述的方法，还包括：通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，由第一计算节点包括的第一进程集合从第二计算节点包括的第二进程集合接收数据包。Clause 47: The method according to Clause 41 further includes: receiving data packets from a second set of processes included in a second computing node by a first set of processes included in a first computing node through multiple different aggregation switches corresponding to multiple different aggregation identifiers assigned to each data packet.

条款48：一个或多个机器可读介质，其存储有机器可读指令，当机器可读指令由第一计算节点执行时，使第一计算节点执行动作，所述动作包括：根据节点感知减半加倍算法确定用于从第一进程向第二进程发送数据包的汇聚标识符，第一进程和第二进程属于在特定网络拓扑下连接至不同片交换机的不同节点；以及通过与汇聚标识符对应的汇聚交换机从第一进程向第二进程发送数据包。Item 48: One or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform actions, the actions comprising: determining an aggregation identifier for sending a data packet from a first process to a second process based on a node-aware halving and doubling algorithm, the first process and the second process belonging to different nodes connected to different slice switches under a specific network topology; and sending a data packet from the first process to the second process through an aggregation switch corresponding to the aggregation identifier.

条款49：根据条款48所述的一个或多个机器可读介质，其中所述动作还包括：为定向到连接至不同片交换机的节点的数据包分配不同的汇聚标识符，以使得能够通过不同的汇聚交换机将数据包路由到连接至不同片交换机的节点。Clause 49: The one or more machine-readable media of Clause 48, wherein the actions further comprise assigning different rendezvous identifiers to packets directed to nodes connected to different slice switches to enable different rendezvous identifiers to be The aggregation switches route packets to nodes connected to different slice switches.

条款50：根据条款48所述的一个或多个机器可读介质，其中所述动作还包括：至少部分地基于预定对应关系，分配与汇聚标识符相关联的汇聚交换机对应的源端口和目的端口。Clause 50: The one or more machine-readable media of clause 48, wherein the actions further comprise: allocating a source port and a destination port corresponding to an aggregation switch associated with the aggregation identifier based at least in part on a predetermined correspondence.

条款51：根据条款50所述的一个或多个机器可读介质，其中对应关系记录多个汇聚交换机的汇聚标识符与对应的源端口和目的端口对之间的关系。Clause 51: One or more machine-readable media according to clause 50, wherein the correspondence records the relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source port and destination port pairs.

条款52：根据条款48所述的一个或多个机器可读介质，其中特定网络拓扑包括胖树拓扑。Clause 52: The one or more machine-readable media of Clause 48, wherein the particular network topology includes a fat-tree topology.

条款53：根据条款48所述的一个或多个机器可读介质，其中所述动作还包括：通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，从第一计算节点包括的第一进程集合向第二计算节点包括的第二进程集合发送各数据包。Clause 53: The one or more machine-readable media of Clause 48, wherein the actions further comprise: transmitting data from a first packet to a packet via a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet. A first set of processes included in the computing node sends each data packet to a second set of processes included in the second computing node.

条款54：根据条款48所述的一个或多个机器可读介质，其中所述动作还包括：通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，由第一计算节点包括的第一进程集合从第二计算节点包括的第二进程集合接收数据包。Clause 54: The one or more machine-readable media of clause 48, wherein the actions further comprise: through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet, by the first A first set of processes included in a computing node receives data packets from a second set of processes included in a second computing node.

条款55：一种第一计算节点，包括：一个或多个处理单元；以及存储器，其存储有机器可执行指令，当机器可执行指令由一个或多个处理单元执行时，使一个或多个处理单元执行动作，所述动作包括：根据节点感知减半加倍算法确定用于从第一进程向第二进程发送数据包的汇聚标识符，第一进程和第二进程属于在特定网络拓扑下连接至不同片交换机的不同节点；以及通过与汇聚标识符对应的汇聚交换机从第一进程向第二进程发送数据包。Clause 55: A first computing node comprising: one or more processing units; and a memory storing machine-executable instructions that, when executed by the one or more processing units, cause one or more The processing unit performs actions, the actions including: determining a convergence identifier for sending a data packet from a first process to a second process according to a node-aware halving and doubling algorithm, the first process and the second process belonging to a connection under a specific network topology to different nodes on different slice switches; and sending the data packet from the first process to the second process through the aggregation switch corresponding to the aggregation identifier.

条款56：根据条款55所述的第一计算节点，其中所述动作还包括：为定向到连接至不同片交换机的节点的数据包分配不同的汇聚标识符，以使得能够通过不同的汇聚交换机将数据包路由到连接至不同片交换机的节点。Clause 56: The first compute node of Clause 55, wherein the actions further comprise assigning different aggregation identifiers to packets directed to nodes connected to different shard switches to enable routing through the different aggregation switches. Packets are routed to nodes connected to different slice switches.

条款57：根据条款55所述的第一计算节点，其中所述动作还包括：至少部分地基于预定对应关系，分配与汇聚标识符相关联的汇聚交换机对应的源端口和目的端口。Clause 57: The first computing node of clause 55, wherein the actions further comprise allocating source and destination ports corresponding to the aggregation switch associated with the aggregation identifier based at least in part on the predetermined correspondence.

条款58：根据条款57所述的第一计算节点，其中对应关系记录多个汇聚交换机的汇聚标识符与对应的源端口和目的端口对之间的关系。Clause 58: The first computing node of Clause 57, wherein the correspondence records a relationship between aggregation identifiers of a plurality of aggregation switches and corresponding source port and destination port pairs.

条款59：根据条款55所述的第一计算节点，其中所述动作还包括：通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，从第一计算节点包括的第一进程集合向第二计算节点包括的第二进程集合发送各数据包。Clause 59: The first computing node of Clause 55, wherein the actions further comprise: transferring data from the first computing node to a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet. The first set of processes sends each data packet to a second set of processes included in the second computing node.

条款60：根据条款55所述的第一计算节点，其中所述动作还包括：通过分配给各数据包的多个不同汇聚标识符对应的多个不同的汇聚交换机，由第一计算节点包括的第一进程集合从第二计算节点包括的第二进程集合接收数据包。Clause 60: The first computing node of Clause 55, wherein the actions further comprise: through a plurality of different aggregation switches corresponding to a plurality of different aggregation identifiers assigned to each data packet. The first set of processes receives data packets from a second set of processes included in the second computing node.

条款61：一种由第一计算节点实施的方法，该方法包括：将分配给处理单元的数据包划分成多个数据段，该多个数据段至少包括第一数据段和第二数据段；将多个数据段分配给多个线程，该多个线程至少包括第一线程和第二线程；使用第一线程对第一数据段的一部分执行节点内子运算，且并行地使用第二线程对第二数据段的一部分执行节点间子运算。Clause 61: A method implemented by a first computing node, the method comprising: dividing a data packet allocated to a processing unit into a plurality of data segments, the plurality of data segments including at least a first data segment and a second data segment; Allocating multiple data segments to multiple threads, the plurality of threads including at least a first thread and a second thread; using the first thread to perform an intra-node sub-operation on a part of the first data segment, and using the second thread in parallel to perform an intra-node sub-operation on a part of the first data segment. Part of the second data segment performs inter-node sub-operations.

条款62：根据条款61所述的方法，其中使用第一线程对第一数据段的一部分执行节点内子运算包括：通过节点内连接在第一计算节点中包括的处理单元和另一处理单元之间传输第一数据段的该部分。Clause 62: The method of Clause 61, wherein performing an intra-node sub-operation on a portion of the first data segment using the first thread comprises: via an intra-node connection between a processing unit included in the first compute node and another processing unit Transmit this portion of the first data segment.

条款63：根据条款61所述的方法，其中使用第二线程对第二数据段的一部分执行节点间子运算包括：通过节点间连接在处理单元和不同于第一计算节点的第二计算节点中包括的另一处理单元之间传输第二数据段的该部分。Clause 63: The method of clause 61, wherein using the second thread to perform an inter-node sub-operation on a portion of the second data segment comprises: in a processing unit and a second computing node different from the first computing node via an inter-node connection The portion of the second data segment is transmitted between another processing unit included.

条款64：根据条款61所述的方法，其中节点内子运算包括：在第一计算节点内执行的归约散布子运算或全局聚集子运算，并且节点间子运算包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局归约子运算。Clause 64: The method of Clause 61, wherein the intra-node sub-operation includes a reduction spread sub-operation or a global aggregation sub-operation performed within a first compute node, and the inter-node sub-operation includes a first compute node and a different A global reduction sub-operation performed between the first computing node and the second computing node.

条款65：根据条款61所述的方法，其中节点内子运算包括在第一计算节点内执行的全局聚集子运算或复制子运算，节点间子运算包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局聚集子运算。Clause 65: A method according to clause 61, wherein the intra-node sub-operation includes a global aggregation sub-operation or a replication sub-operation performed within a first computing node, and the inter-node sub-operation includes a global aggregation sub-operation performed between the first computing node and a second computing node different from the first computing node.

条款66：根据条款66所述的方法，还包括：使用第一线程对第一数据段的一部分执行另一个节点间子运算，以及并行地使用第二线程对第二数据段的一部分执行另一个节点内子运算。Clause 66: The method of Clause 66, further comprising: using the first thread to perform another inter-node sub-operation on a portion of the first data segment, and in parallel using the second thread to perform another on a portion of the second data segment. Suboperations within a node.

条款67：根据条款61所述的方法，其中使用第一线程对第一数据段的该部分执行节点内子运算，以及并行地使用第二线程对第二数据段的该部分执行节点间子运算，使得使用节点内连接来将第一数据段的一部分传输至第一计算节点包括的另一处理单元，以及并发地使用节点间连接将第二数据段的一部分传输至与第一计算节点不同的第二计算节点包括的另一处理单元。Clause 67: A method according to Clause 61, wherein an intra-node sub-operation is performed on the portion of the first data segment using a first thread, and an inter-node sub-operation is performed on the portion of the second data segment using a second thread in parallel, so that an intra-node connection is used to transfer a portion of the first data segment to another processing unit included in the first computing node, and an inter-node connection is used to concurrently transfer a portion of the second data segment to another processing unit included in a second computing node different from the first computing node.

条款68：一个或多个机器可读介质，其存储有机器可读指令，当机器可读指令由第一计算节点执行时，使第一计算节点执行动作，所述动作包括：将分配给处理单元的数据包划分成多个数据段，该多个数据段至少包括第一数据段和第二数据段；将多个数据段分配给多个线程，该多个线程至少包括第一线程和第二线程；使用第一线程对第一数据段的一部分执行节点内子运算，且并行地使用第二线程对第二数据段的一部分执行节点间子运算。Clause 68: One or more machine-readable media having machine-readable instructions stored thereon that, when executed by the first computing node, cause the first computing node to perform actions, the actions including: assigning to processing The data packet of the unit is divided into multiple data segments, the multiple data segments at least include a first data segment and a second data segment; the multiple data segments are allocated to multiple threads, the multiple threads include at least the first thread and the second data segment. Two threads: using the first thread to perform intra-node sub-operations on a portion of the first data segment, and using the second thread in parallel to perform inter-node sub-operations on a portion of the second data segment.

条款69：根据条款68所述的一个或多个机器可读介质，其中使用第一线程对第一数据段的部分执行节点内子运算包括通过节点内连接在第一计算节点中包括的另一处理单元和处理单元之间传输第一数据段的该部分。Clause 69: The one or more machine-readable media of Clause 68, wherein performing an intra-node sub-operation on a portion of the first data segment using the first thread includes another process included in the first compute node via an intra-node connection The portion of the first data segment is transferred between the unit and the processing unit.

条款70：根据条款68所述的一个或多个机器可读介质，其中使用第二线程对第二数据段的部分执行节点间子运算包括：通过节点间连接在处理单元和不同于第一计算节点的第二计算节点中包括的另一处理单元之间传输第二数据段的该部分。Clause 70: One or more machine-readable media as described in clause 68, wherein performing an inter-node sub-operation on a portion of a second data segment using a second thread includes transmitting the portion of the second data segment between a processing unit and another processing unit included in a second computing node different from the first computing node via an inter-node connection.

条款71：根据条款68所述的一个或多个机器可读介质，其中节点内子运算包括在第一计算节点内执行的归约散布子运算或全局聚集子运算，并且节点间子运算包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局归约子运算。Clause 71: One or more machine-readable media of Clause 68, wherein the intra-node sub-operation includes a reduce spread sub-operation or a global aggregation sub-operation performed within the first computing node, and the inter-node sub-operation includes a A global reduction sub-operation performed between a computing node and a second computing node different from the first computing node.

条款72：根据条款68所述的一个或多个机器可读介质，其中节点内子运算包括在第一计算节点内执行的全局聚集子运算或复制子运算，并且节点间子运算包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局聚集子运算。Clause 72: One or more machine-readable media according to clause 68, wherein the intra-node sub-operation includes a global aggregation sub-operation or a replication sub-operation performed within a first computing node, and the inter-node sub-operation includes a global aggregation sub-operation performed between the first computing node and a second computing node different from the first computing node.

条款73：根据条款68所述的一个或多个机器可读介质，所述动作还包括：使用第一线程对第一数据段的部分执行另一个节点间子运算，以及并行地使用第二线程对第二数据段的部分执行另一个节点内子运算。Clause 73: The one or more machine-readable media of Clause 68, the actions further comprising: using the first thread to perform another inter-node sub-operation on the portion of the first data segment, and using the second thread in parallel Perform another in-node sub-operation on part of the second data segment.

条款74：根据条款68的一个或多个机器可读介质，其中使用第一线程对第一数据段的一部分执行节点内子运算，以及并行地使用第二线程对第二数据段的一部分执行节点间子运算，使得使用节点内连接来将第一数据段的一部分传输至第一计算节点包括的另一处理单元，以及并发地使用节点间连接将第二数据段的一部分传输至与第一计算节点不同的第二计算节点包括的另一处理单元。Clause 74: One or more machine-readable media according to Clause 68, wherein a first thread is used to perform an intra-node sub-operation on a portion of the first data segment, and a second thread is used in parallel to perform an inter-node sub-operation on a portion of the second data segment. Sub-operations such that a portion of the first data segment is transmitted to another processing unit included in the first computing node using an intra-node connection, and concurrently transmitting a portion of the second data segment to another processing unit included in the first computing node using an inter-node connection. A different second computing node includes another processing unit.

条款75：一种第一计算节点，包括：一个或多个处理单元；以及存储器，其存储有机器可执行指令，当机器可执行指令由一个或多个处理单元执行时，使一个或多个处理单元执行动作，所述动作包括：将分配给处理单元的数据包划分成多个数据段，该多个数据段至少包括第一数据段和第二数据段；将多个数据段分配给多个线程，该多个线程至少包括第一线程和第二线程；使用第一线程对第一数据段的一部分执行节点内子运算，且并行地使用第二线程对第二数据段的一部分执行节点间子运算。Clause 75: A first computing node comprising: one or more processing units; and a memory storing machine-executable instructions that, when executed by the one or more processing units, cause one or more The processing unit performs actions, the actions include: dividing the data packet allocated to the processing unit into multiple data segments, the multiple data segments at least include a first data segment and a second data segment; allocating the multiple data segments to multiple data segments. Threads, the plurality of threads at least include a first thread and a second thread; using the first thread to perform intra-node sub-operations on a part of the first data segment, and using the second thread in parallel to perform inter-node sub-operations on a part of the second data segment suboperation.

条款76：根据条款75所述的第一计算节点，其中使用第一线程对第一数据段的一部分执行节点内子运算包括：通过节点内连接在第一计算节点中包括的另一处理单元和处理单元之间传输第一数据段的该部分。Clause 76: The first compute node of clause 75, wherein performing the intra-node sub-operation on the portion of the first data segment using the first thread includes: another processing unit included in the first compute node and processing via an intra-node connection The portion of the first data segment is transmitted between units.

条款77：根据条款75所述的第一计算节点，其中使用第二线程对第二数据段的部分执行节点间子运算包括：通过节点间连接在处理单元和不同于第一计算节点的第二计算节点中包括的另一处理单元之间传输第二数据段的该部分。Clause 77: The first compute node of Clause 75, wherein performing the inter-node sub-operation on the portion of the second data segment using the second thread includes: The portion of the second data segment is transmitted between another processing unit included in the computing node.

条款78：根据条款75所述的第一计算节点，其中节点内子运算包括在第一计算节点内执行的归约散布子运算或全局聚集子运算，并且节点间子运算包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局归约子运算。Clause 78: The first compute node of Clause 75, wherein the intra-node sub-operations comprise a reduction spread sub-operation or a global aggregation sub-operation performed within the first compute node, and the inter-node sub-operations comprise the first compute node and A global reduction sub-operation performed between second computing nodes that are different from the first computing node.

条款79：根据条款75所述的第一计算节点，其中节点内子运算包括在第一计算节点内执行的全局聚集子运算或复制子运算，并且节点间子运算包括在第一计算节点和不同于第一计算节点的第二计算节点间执行的全局聚集子运算。Clause 79: The first compute node of clause 75, wherein the intra-node sub-operations comprise global aggregate sub-operations or replica sub-operations executed within the first compute node, and the inter-node sub-operations comprise the first compute node and are different from A global aggregation sub-operation performed between the first computing node and the second computing node.

条款80：根据条款75所述的第一计算节点，其中使用第一线程对第一数据段的该部分执行节点内子运算，以及并行地使用第二线程对第二数据段的一部分执行节点间子运算，使得使用节点内连接来将第一数据段的一部分传输至第一计算节点包括的另一处理单元，以及并发地使用节点间连接将第二数据段的一部分传输至与第一计算节点不同的第二计算节点包括的另一处理单元。Clause 80: The first compute node of Clause 75, wherein a first thread is used to perform intra-node sub-operations on the portion of the first data segment, and a second thread is used in parallel to perform inter-node sub-operations on the portion of the second data segment. Operate such that a portion of the first data segment is transmitted using an intra-node connection to another processing unit included in the first computing node, and concurrently using an inter-node connection to transmit a portion of the second data segment to a computing node different from the first computing node. The second computing node includes another processing unit.

Claims

1. A method implemented by a first computing node, the method comprising:

Determining to transfer the data from the first network interface controller associated with the first process to the second network interface controller associated with the second process is based at least in part on whether the first network interface controller associated with the second process is located in the same computing node or linked to the same switch. A route identifier for a first process to route to the second process, wherein in response to determining the first network interface controller associated with the first process and the third network interface controller associated with the second process Two network interface controllers are located in the same computing node or linked to the same switch, and set the routing identifier as a default identifier; in response to determining that the first network associated with the first process The interface controller and the second network interface controller associated with the second process are located on different computing nodes or linked to different slice switches, and the routing identifier is set equal to that associated with the first process The identifier of the first network interface controller connected; the first process and the second process belong to a specific inter-node ring connecting multiple different nodes under the network topology; and

The data is routed from the first process to the second process based on the routing identifier.

2. The method of claim 1, wherein the first network interface controller associated with the first process is configured to send data to or from a second computing node in a ring topology. Two computing nodes receive data, the second computing node being different from the first computing node.

3. The method of claim 1, wherein the first network interface controller associated with the first process is also associated with one or more processes, wherein from the first process and the one Or the data sent by multiple processes is sent through the first network interface controller.

4. The method of claim 1, wherein the network topology includes a fat tree topology.

5. The method of claim 1, wherein routing the data from the first process to the second process based on the routing identifier includes at least one interface connected to the first network interface controller. A slice switch and an aggregation switch having an identifier corresponding to the identifier of the first network interface controller route the data from the first process to the second process, the first network An interface controller is associated with the first process.

6. One or more machine-readable media storing machine-readable instructions that, when executed by a first computing node, cause the first computing node to perform an action, the action including:

Determining to transfer the data from the first network interface controller associated with the first process to the second network interface controller associated with the second process is based at least in part on whether the first network interface controller associated with the second process is located in the same computing node or linked to the same switch. A route identifier for a first process to route to the second process, wherein in response to determining the first network interface controller associated with the first process and the third network interface controller associated with the second process two network interface controllers located in the same computing node or linked to the same switch, setting the routing identifier as a default identifier; in response to determining that the first network associated with the first process The interface controller and the second network interface controller associated with the second process are located on different computing nodes or linked to different slice switches, and the routing identifier is set equal to that associated with the first process The identifier of the first network interface controller connected; the first process and the second process belong to a specific inter-node ring connecting multiple different nodes under the network topology; and

The data is routed from the first process to the second process according to the routing identifier.

7. The one or more machine-readable media of claim 6, wherein the first network interface controller associated with the first process is configured to send data to a second computing node in a ring topology Or receiving data from a second computing node in a ring topology, the second computing node being different than the first computing node.

8. The one or more machine-readable media of claim 6, wherein the first network interface controller associated with the first process is further associated with one or more processes, wherein from the Data sent by the first process and the one or more processes are sent through the first network interface controller.

9. One or more machine-readable media according to claim 6, wherein the network topology includes a fat tree topology.

10. The one or more machine-readable media of claim 8, wherein routing the data from the first process to the second process based on the routing identifier includes at least communicating with the first process. A network interface controller connected slice switch and an aggregation switch having an identifier corresponding to the identifier of the first network interface controller route the data from the first process to the second process. A process, the first network interface controller is associated with the first process.

11. A first computing node, comprising:

one or more processing units; and

A memory that stores machine-executable instructions that, when executed by the one or more processing units, cause the one or more processing units to perform actions, the actions including:

12. The first computing node of claim 11, wherein the first network interface controller associated with the first process is configured to send data to or from a second computing node in a ring topology. A second computing node in , which is different from the first computing node, receives the data.

13. The first computing node of claim 11, wherein the first network interface controller associated with the first process is further associated with one or more processes, wherein from the first process and Data sent by the one or more processes is sent through the first network interface controller.

14. The first computing node of claim 11, wherein routing the data from the first process to the second process according to the routing identifier comprises: routing the data from the first process to the second process via at least a slice switch connected to the first network interface controller and an aggregation switch having an identifier that corresponds to the identifier of the first network interface controller, the first network interface controller being associated with the first process.

15. A first computing node, comprising:

Determining means for determining based at least in part on whether a first network interface controller associated with the first process and a second network interface controller associated with the second process are located in the same computing node or linked to the same switch. A routing identifier for routing data from the first process to the second process, the first process and the second process belonging to a specific inter-node ring connecting a plurality of different nodes under a network topology; and

a routing module for routing the data from the first process to the second process based on the routing identifier;

The determination template is specifically used in response to determining that the first network interface controller associated with the first process and the second network interface controller associated with the second process are located in the same In a computing node or linked to the same switch, the routing identifier is set as a default identifier; in response to determining that the first network interface controller associated with the first process and the second network interface controller are The second network interface controller associated with the process is located on a different computing node or linked to a different chip switch, and the routing identifier is set equal to the first network interface controller associated with the first process. The identifier of the device.