CN114968895A

CN114968895A - Heterogeneous interconnection system and cluster

Info

Publication number: CN114968895A
Application number: CN202210599488.9A
Authority: CN
Inventors: 郭振华; 赵雅倩; 李仁刚; 王丽; 范宝余
Original assignee: Inspur Electronic Information Industry Co Ltd
Current assignee: IEIT Systems Co Ltd
Priority date: 2022-05-30
Filing date: 2022-05-30
Publication date: 2022-08-30

Abstract

The present application discloses a heterogeneous interconnection system and a cluster in the field of computer technology. In this application, each cabinet includes two switches and at least one server, each CPU in each server is connected to a CXL interface of at least one computing device through the CXL protocol; and each CPU in each server is interconnected through QPI/UPI. In a cabinet, one switch is connected to the network cards of each server in the current cabinet and connected to the public network, so that the current cabinet can be connected to the public network; the other switch is connected to the RDMA interface of each computing device in the current cabinet, so that each computing device can RDMA data is directly fetched. Each computing device connected to the same server is also connected to the data exchange device corresponding to the current server through its own cable interface, so that each computing device in a server can interact through the data exchange device connected to the cable interface, ensuring data transmission efficiency and improving system fault tolerance. A heterogeneous interconnection cluster provided by the present application also has the above technical effects.

Description

A heterogeneous interconnection system and cluster

技术领域technical field

本申请涉及计算机技术领域，特别涉及一种异构互联系统及集群。The present application relates to the field of computer technology, and in particular, to a heterogeneous interconnection system and a cluster.

背景技术Background technique

目前，常借助硬件设备来提升处理速率，如：使用GPU(Graphics ProcessingUnit，图形处理器)、FPGA(Field-Programmable Gate Array，现场可编程门阵列)进行机器学习等。当前可用的硬件设备多种多样，也没有统一的连接标准，当一个系统中纳管多个硬件设备时，这些硬件设备之间，任一硬件设备与主机之间的连接会比较复杂，而复杂的连接很可能影响系统的整体计算效率。At present, hardware devices are often used to improve the processing rate, such as: using GPU (Graphics Processing Unit, graphics processor), FPGA (Field-Programmable Gate Array, field programmable gate array) for machine learning and the like. There are various hardware devices currently available, and there is no unified connection standard. When multiple hardware devices are managed in a system, the connection between these hardware devices, any hardware device and the host will be more complicated. The connections are likely to affect the overall computational efficiency of the system.

因此，如何在一个系统中互联多个硬件设备，是本领域技术人员需要解决的问题。Therefore, how to interconnect multiple hardware devices in a system is a problem to be solved by those skilled in the art.

发明内容SUMMARY OF THE INVENTION

有鉴于此，本申请的目的在于提供一种异构互联系统及集群，以在一个系统中互联多个硬件设备。其具体方案如下：In view of this, the purpose of the present application is to provide a heterogeneous interconnection system and cluster to interconnect multiple hardware devices in one system. Its specific plan is as follows:

第一方面，本申请提供了一种异构互联系统，包括：至少一个机柜和连接公网的网络接入层；In a first aspect, the present application provides a heterogeneous interconnection system, including: at least one cabinet and a network access layer connected to a public network;

其中，每个机柜包括：第一交换机、第二交换机以及至少一个服务器；每个服务器包括至少一个CPU，每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；同一服务器内的不同CPU通过QPI/UPI互连；Wherein, each cabinet includes: a first switch, a second switch and at least one server; each server includes at least one CPU, each CPU is connected to a CXL interface of at least one computing device through the CXL protocol, and each computing device is also provided with RDMA Interface and Cable interface; different CPUs in the same server are interconnected through QPI/UPI;

所述第一交换机连接所述网络接入层和当前机柜中各服务器的网卡；the first switch is connected to the network access layer and the network card of each server in the current cabinet;

所述第二交换机连接当前机柜中的各计算设备的RDMA接口；The second switch is connected to the RDMA interface of each computing device in the current cabinet;

连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备。Each computing device connected to the same server is connected to the data exchange device corresponding to the server through its own cable interface.

可选地，连接同一服务器的各计算设备用于处理不同的计算任务。Optionally, each computing device connected to the same server is used to process different computing tasks.

可选地，所述计算设备为GPU、FPGA、DPU或AI芯片。Optionally, the computing device is a GPU, FPGA, DPU or AI chip.

可选地，所述CXL接口基于PCIe标准实现。Optionally, the CXL interface is implemented based on the PCIe standard.

可选地，所述网络接入层包括至少一个连接层，每个连接层包括至少一个连接设备，属于不同连接层的连接设备互连。Optionally, the network access layer includes at least one connection layer, each connection layer includes at least one connection device, and connection devices belonging to different connection layers are interconnected.

可选地，所述连接设备为交换机或路由器。Optionally, the connection device is a switch or a router.

可选地，所述网络接入层为Spine-Leaf两层互连架构。Optionally, the network access layer is a spine-leaf two-layer interconnection architecture.

可选地，Spine-Leaf两层互连架构的所述网络接入层包括：连接公网的第一连接层、连接所述第一连接层和各机柜内第一交换机的第二连接层；Optionally, the network access layer of the spine-leaf two-layer interconnection architecture includes: a first connection layer connecting the public network, and a second connection layer connecting the first connection layer and the first switch in each cabinet;

其中，所述第一连接层中的连接设备少于所述第二连接层中的连接设备，属于不同连接层的连接设备互连。Wherein, the connection devices in the first connection layer are less than the connection devices in the second connection layer, and the connection devices belonging to different connection layers are interconnected.

可选地，每个服务器的网卡有至少一个。Optionally, each server has at least one network card.

第二方面，本申请提供了一种异构互联集群，包括：多个如上述任一项所述的异构互联系统。In a second aspect, the present application provides a heterogeneous interconnection cluster, including: a plurality of heterogeneous interconnection systems described in any one of the above.

通过以上方案可知，本申请提供了一种异构互联系统，包括：至少一个机柜和连接公网的网络接入层；其中，每个机柜包括：第一交换机、第二交换机以及至少一个服务器；每个服务器包括至少一个CPU，每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；同一服务器内的不同CPU通过QPI/UPI互连；所述第一交换机连接所述网络接入层和当前机柜中各服务器的网卡；所述第二交换机连接当前机柜中的各计算设备的RDMA接口；连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备。It can be seen from the above solutions that the present application provides a heterogeneous interconnection system, including: at least one cabinet and a network access layer connected to the public network; wherein each cabinet includes: a first switch, a second switch, and at least one server; Each server includes at least one CPU, and each CPU is connected to the CXL interface of at least one computing device through the CXL protocol, and each computing device is also provided with an RDMA interface and a Cable interface; different CPUs in the same server are interconnected through QPI/UPI; all The first switch is connected to the network access layer and the network card of each server in the current cabinet; the second switch is connected to the RDMA interface of each computing device in the current cabinet; each computing device connected to the same server is connected through its own Cable interface The data exchange device corresponding to the server.

可见，在本申请中，每个机柜包括：第一交换机、第二交换机以及至少一个服务器，而每个服务器包括至少一个CPU，每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；同一服务器内的不同CPU通过QPI/UPI互连；第一交换机连接网络接入层和当前机柜中各服务器的网卡；第二交换机连接当前机柜中的各计算设备的RDMA接口；连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备。可见，每个服务器中的不同CPU通过QPI/UPI互连，具有较高的传输效率；而每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；第二交换机连接当前机柜中的各计算设备的RDMA接口，因此一个机柜中的各计算设备可实现RDMA数据直取，数据传输效率更高；连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备，因此一个服务器中的各计算设备还能够通过Cable接口所连接的数据交换设备进行交互，能够保障数据传输效率，提升系统容错率。在每个机柜中，第一交换机连接当前机柜中各服务器的网卡，同时，第一交换机还连接对接公网的网络接入层，因此第一交换机可使当前机柜中的各服务器、各计算设备通过网络接入层接入公网。按照本申请可在一个系统中容纳更多类型的计算设备，且较容易扩展机柜个数、服务器个数，方案扩展性较高。It can be seen that in this application, each cabinet includes: a first switch, a second switch and at least one server, and each server includes at least one CPU, and each CPU is connected to the CXL interface of at least one computing device through the CXL protocol, and each The computing device is also provided with an RDMA interface and a Cable interface; different CPUs in the same server are interconnected through QPI/UPI; the first switch connects the network access layer and the network cards of each server in the current cabinet; the second switch connects each of the current cabinets. The RDMA interface of the computing device; each computing device connected to the same server is connected to the data exchange device corresponding to the server through its own Cable interface. It can be seen that different CPUs in each server are interconnected through QPI/UPI, which has high transmission efficiency; and each CPU is connected to the CXL interface of at least one computing device through the CXL protocol, and each computing device also has an RDMA interface and a Cable interface; the second switch is connected to the RDMA interface of each computing device in the current cabinet, so each computing device in a cabinet can directly fetch RDMA data, and the data transmission efficiency is higher; each computing device connected to the same server through its own Cable interface Connect the data exchange device corresponding to the server, so each computing device in a server can also interact through the data exchange device connected to the cable interface, which can ensure the data transmission efficiency and improve the system fault tolerance rate. In each cabinet, the first switch is connected to the network card of each server in the current cabinet, and at the same time, the first switch is also connected to the network access layer connected to the public network, so the first switch can enable each server and each computing device in the current cabinet Access the public network through the network access layer. According to the present application, more types of computing devices can be accommodated in one system, and the number of cabinets and servers can be easily expanded, and the solution has high scalability.

相应地，本申请提供的一种异构互联集群，也同样具有上述技术效果。Correspondingly, a heterogeneous interconnection cluster provided by the present application also has the above technical effects.

附图说明Description of drawings

为了更清楚地说明本申请实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图仅仅是本申请的实施例，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据提供的附图获得其他的附图。In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the following briefly introduces the accompanying drawings required for the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is an embodiment of the present application. For those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without any creative effort.

图1为本申请公开的一种异构互联系统示意图；1 is a schematic diagram of a heterogeneous interconnection system disclosed in the application;

图2为本申请公开的一种服务器节点内的拓扑结构示意图；2 is a schematic diagram of a topology structure in a server node disclosed in the present application;

图3为本申请公开的一种机柜内部的互连拓扑结构示意图；3 is a schematic diagram of an interconnection topology inside a cabinet disclosed in the application;

图4为本申请公开的一种机柜之间的互连拓扑结构示意图。FIG. 4 is a schematic diagram of an interconnection topology between cabinets disclosed in the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. Obviously, the described embodiments are only a part of the embodiments of the present application, but not all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present application.

目前，可用的硬件设备多种多样，也没有统一的连接标准，当一个系统中纳管多个硬件设备时，这些硬件设备之间，任一硬件设备与主机之间的连接会比较复杂，而复杂的连接很可能影响系统的整体计算效率。为此，本申请提供了一种异构互联系统及集群，能够在一个系统中互联多个硬件设备，且较容易扩展机柜个数、服务器个数。At present, there are various hardware devices available, and there is no unified connection standard. When multiple hardware devices are managed in a system, the connection between these hardware devices, any hardware device and the host will be more complicated, and Complex connections are likely to affect the overall computational efficiency of the system. To this end, the present application provides a heterogeneous interconnection system and cluster, which can interconnect multiple hardware devices in one system, and easily expand the number of cabinets and servers.

参见图1所示，本申请实施例公开了一种异构互联系统，包括：至少一个机柜和连接公网的网络接入层；其中，每个机柜包括：第一交换机、第二交换机以及至少一个服务器；每个服务器包括至少一个CPU，每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；同一服务器内的不同CPU通过QPI/UPI互连；所述第一交换机连接所述网络接入层和当前机柜中各服务器的网卡；所述第二交换机连接当前机柜中的各计算设备的RDMA接口；连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备。其中，各个机柜构成图1所示的数据中心。Referring to FIG. 1 , an embodiment of the present application discloses a heterogeneous interconnection system, including: at least one cabinet and a network access layer connected to a public network; wherein each cabinet includes: a first switch, a second switch, and at least one A server; each server includes at least one CPU, each CPU is connected to the CXL interface of at least one computing device through the CXL protocol, and each computing device is also provided with an RDMA interface and a Cable interface; different CPUs in the same server communicate with each other through QPI/UPI. The first switch connects the network access layer and the network cards of each server in the current cabinet; the second switch connects the RDMA interface of each computing device in the current cabinet; the computing devices connected to the same server pass their own The cable interface is connected to the data exchange device corresponding to the server. Among them, each cabinet constitutes the data center shown in FIG. 1 .

可见，一个机柜中的各计算设备可实现RDMA数据直取，一个服务器内的各计算设备可通过Cable接口高速交互。连接同一服务器的各计算设备用于处理不同的计算任务。例如：有的计算设备执行图像识别任务、有的计算设备执行文本分类任务。It can be seen that each computing device in a cabinet can directly fetch RDMA data, and each computing device in a server can interact at high speed through the Cable interface. Each computing device connected to the same server is used to process different computing tasks. For example, some computing devices perform image recognition tasks, and some computing devices perform text classification tasks.

如图1所示，数据中心包括N个机柜，每个机柜中包括：N个服务器节点和两个交换机。每个服务器节点包括：一个服务器、N个计算设备、一个数据交换设备。N取自然数。As shown in FIG. 1 , the data center includes N cabinets, and each cabinet includes N server nodes and two switches. Each server node includes: a server, N computing devices, and a data exchange device. N is a natural number.

在一种具体实施方式中，同一服务器内的不同CPU通过QPI/UPI互连。QPI(QuickPath Interconnect)为快速通道互联，能够实现芯片之间的直接互联。UPI(Ultra PathInterconnect)也是快速通道互联，拥有更高的通信速率，更低的功耗。In a specific implementation, different CPUs within the same server are interconnected via QPI/UPI. QPI (QuickPath Interconnect) is a fast path interconnection, which can realize direct interconnection between chips. UPI (Ultra PathInterconnect) is also a fast channel interconnection with higher communication rate and lower power consumption.

在一种具体实施方式中，计算设备可以为GPU、FPGA、DPU(Data Processing Unit，处理器分散处理单元)或AI(Artificial Intelligence，人工智能)芯片。In a specific implementation manner, the computing device may be a GPU, an FPGA, a DPU (Data Processing Unit, processor distributed processing unit) or an AI (Artificial Intelligence, artificial intelligence) chip.

在一种具体实施方式中，每个服务器的各CPU通过CXL(Compute Express Link)协议连接至少一个计算设备的CXL接口，每个服务器的网卡有至少一个。也即：任一服务器的各CPU与任一计算设备的CXL接口通过CXL协议连接。由于使用CXL协议连接的两端能够进行IO、缓存和内存的共享，因此使用CXL协议连接的CPU与计算设备能够进行IO、缓存和内存的共享。可见，计算设备设有CXL接口，CXL接口可基于PCIe标准实现。由此可确定，一个计算设备同时有CXL接口、RDMA接口和Cable接口。In a specific implementation manner, each CPU of each server is connected to a CXL interface of at least one computing device through a CXL (Compute Express Link) protocol, and each server has at least one network card. That is, each CPU of any server is connected with the CXL interface of any computing device through the CXL protocol. Since two ends connected using the CXL protocol can share IO, cache and memory, the CPU and computing devices connected using the CXL protocol can share IO, cache and memory. It can be seen that the computing device is provided with a CXL interface, and the CXL interface can be implemented based on the PCIe standard. From this, it can be determined that a computing device has a CXL interface, an RDMA interface and a Cable interface at the same time.

在一种具体实施方式中，网络接入层包括至少一个连接层，每个连接层包括至少一个连接设备，属于不同连接层的连接设备互连。其中，连接设备为交换机或路由器。In a specific embodiment, the network access layer includes at least one connection layer, each connection layer includes at least one connection device, and the connection devices belonging to different connection layers are interconnected. The connecting device is a switch or a router.

在一种具体实施方式中，网络接入层可以为Spine-Leaf两层互连架构。其中，Spine-Leaf两层互连架构的所述网络接入层包括：连接公网的第一连接层(即图4中的Spine层)、连接所述第一连接层和各机柜内第一交换机的第二连接层(即图4中的Leaf层)；其中，第一连接层中的连接设备少于第二连接层中的连接设备，属于不同连接层的连接设备互连。In a specific implementation manner, the network access layer may be a spine-leaf two-layer interconnection architecture. The network access layer of the spine-leaf two-layer interconnection architecture includes: a first connection layer (ie, the spine layer in FIG. 4 ) connecting the public network, connecting the first connection layer and the first connection layer in each cabinet The second connection layer of the switch (ie, the Leaf layer in FIG. 4 ); wherein, the connection devices in the first connection layer are less than the connection devices in the second connection layer, and the connection devices belonging to different connection layers are interconnected.

在本实施例中，数据中心包括至少一个机柜，每个机柜包括：第一交换机、第二交换机以及至少一个服务器，而每个服务器中的不同CPU通过QPI/UPI互连，具有较高的传输效率；而每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；第二交换机连接当前机柜中的各计算设备的RDMA接口，因此一个机柜中的各计算设备可实现RDMA数据直取，数据传输效率更高；连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备，因此一个服务器中的各计算设备还能够通过Cable接口所连接的数据交换设备进行交互，能够保障数据传输效率，提升系统容错率。在每个机柜中，第一交换机连接当前机柜中各服务器的网卡，同时，第一交换机还连接对接公网的网络接入层，因此第一交换机可使当前机柜中的各服务器、各计算设备通过网络接入层接入公网。In this embodiment, the data center includes at least one cabinet, and each cabinet includes: a first switch, a second switch, and at least one server, and different CPUs in each server are interconnected through QPI/UPI, which has a higher transmission rate. Efficiency; and each CPU is connected to the CXL interface of at least one computing device through the CXL protocol, and each computing device is also provided with an RDMA interface and a Cable interface; the second switch is connected to the RDMA interface of each computing device in the current cabinet, so in one cabinet Each computing device in the same server can directly fetch RDMA data, and the data transmission efficiency is higher; each computing device connected to the same server is connected to the data exchange device corresponding to the server through its own Cable interface, so each computing device in a server can also pass the Cable interface. The data exchange devices connected to the interface interact with each other, which can ensure the efficiency of data transmission and improve the fault tolerance rate of the system. In each cabinet, the first switch is connected to the network card of each server in the current cabinet, and at the same time, the first switch is also connected to the network access layer connected to the public network, so the first switch can enable each server and each computing device in the current cabinet Access the public network through the network access layer.

可见，本实施例可在一个系统中容纳更多类型的计算设备，且较容易扩展机柜个数、服务器个数，方案扩展性较高。It can be seen that this embodiment can accommodate more types of computing devices in one system, and it is easier to expand the number of cabinets and servers, and the solution has high scalability.

基于本申请的核心思想，下述实施例提供了一种面向人工智能计算的互联系统，具备多层次耦合混合拓扑结构，能够针对服务器CPU互连、异构计算设备互连、服务器节点间互连、机柜间互连。Based on the core idea of the present application, the following embodiments provide an artificial intelligence computing-oriented interconnection system, which has a multi-level coupling hybrid topology structure, and is capable of interconnecting server CPUs, heterogeneous computing devices, and server nodes. , Inter-cabinet interconnection.

服务器节点内的拓扑结构如图2所示。在图2中，将当前多种多样的异构计算加速设备GPU、FPGA、DPU、AI芯片等，统一称为XPU。如图2中的XPU0～XPU7。一个XPU包含三种接口：CXL标准接口、高带宽Cable接口、自定义RDMA接口。其中，XPU采用支持CXL的PCIe(Peripheral Component Interconnect express，一种高速串行计算机扩展总线标准)实现CXL标准接口，并使用CXL标准接口与服务器CPU进行互连，能够实现XPU内存与服务器主机内存的内存一致性。一般地，在单个服务器节点内，根据CXL标准的定义，每个服务器CPU可挂载4个XPU。而在服务器节点内，一个服务器内的不同CPU之间采用QPI/UPI接口进行高速互连。The topology within the server node is shown in Figure 2. In Figure 2, the various heterogeneous computing acceleration devices GPU, FPGA, DPU, AI chip, etc. are collectively referred to as XPU. As shown in Figure 2, XPU0~XPU7. An XPU contains three interfaces: CXL standard interface, high-bandwidth Cable interface, and custom RDMA interface. Among them, the XPU adopts PCIe (Peripheral Component Interconnect express, a high-speed serial computer expansion bus standard) that supports CXL to realize the CXL standard interface, and uses the CXL standard interface to interconnect with the server CPU, which can realize the connection between the XPU memory and the server host memory. Memory consistency. Generally, within a single server node, according to the definition of the CXL standard, each server CPU can mount 4 XPUs. In the server node, the QPI/UPI interface is used for high-speed interconnection between different CPUs in a server.

在图2中，CPU0连XPU0～XPU3，CPU1连XPU4～XPU7。同时，XPU0～XPU7统一连接至XPU交换设备。并且，XPU0～XPU7统一连接至iRDMA Switch。In Figure 2, CPU0 is connected to XPU0 to XPU3, and CPU1 is connected to XPU4 to XPU7. At the same time, XPU0 to XPU7 are uniformly connected to the XPU switching device. In addition, XPU0 to XPU7 are connected to the iRDMA Switch in a unified manner.

需要说明的是，CXL虽然可以为XPU提供标准化的高效互连，但是连接性能受限于承载CXL标准的PCIe物理带宽，无法满足人工智能计算场景对XPU设备间的高带宽数据传输要求。因此在服务器节点内部，还设计了一种高带宽传输的Cable互连接口标准，采用XPU交换设备使服务器节点内的不同XPU设备实现高带宽数据交换。It should be noted that although CXL can provide standardized and efficient interconnection for XPU, the connection performance is limited by the PCIe physical bandwidth carrying the CXL standard, and cannot meet the high-bandwidth data transmission requirements between XPU devices in artificial intelligence computing scenarios. Therefore, within the server node, a cable interconnection interface standard for high-bandwidth transmission is also designed, and XPU switching equipment is used to enable different XPU devices in the server node to realize high-bandwidth data exchange.

同时，机柜内部的iRDMA Switch(即第二交换机)汇总节点内各XPU设备的RDMA接口，实现机柜内部不同节点间XPU设备的高效数据共享，同时也使得一个机柜内部的各服务器可进行RDMA数据交互。为了提高服务器节点与网络中其他设备的扩展效率，每台服务器还具备通用网卡NIC设备，用于服务器与网络中其他设备间的数据交互，这些NIC设备连接至机柜内部的Top of Rack Switch(即第一交换机)。下文将Top of Rack Switch简称为TOR Switch。At the same time, the iRDMA Switch (that is, the second switch) inside the cabinet summarizes the RDMA interfaces of each XPU device in the node to realize efficient data sharing of XPU devices between different nodes in the cabinet, and also enables each server in a cabinet to exchange RDMA data. . In order to improve the expansion efficiency of server nodes and other devices in the network, each server also has a general-purpose network card NIC device for data interaction between the server and other devices in the network. These NIC devices are connected to the Top of Rack Switch (ie first switch). Hereinafter, Top of Rack Switch will be referred to as TOR Switch for short.

具体的，机柜内部的互连拓扑结构如图3所示。在机柜内部配置了两层交换机设备：iRDMA Switch(即图3中的第二交换机)和TOR Switch(即图3中的第一交换机)。iRDMASwitch主要负责机柜内不同XPU设备之间的RDMA数据交互，可以为机柜内节点间的XPU提供低延时高带宽的数据交换。TOR Switch主要负责机柜内不同节点间的网卡设备之间的数据交换，同时，负责将机柜内部的数据与网络进行交换。Specifically, the interconnection topology inside the cabinet is shown in FIG. 3 . Two layers of switch devices are configured inside the cabinet: iRDMA Switch (ie, the second switch in FIG. 3 ) and TOR Switch (ie, the first switch in FIG. 3 ). iRDMASwitch is mainly responsible for RDMA data exchange between different XPU devices in the cabinet, and can provide low-latency, high-bandwidth data exchange for XPUs between nodes in the cabinet. TOR Switch is mainly responsible for data exchange between network card devices between different nodes in the cabinet, and at the same time, it is responsible for exchanging data inside the cabinet with the network.

进一步地，机柜间的互连拓扑结构如图4所示。各机柜接入叶脊网络Spine-Leaf两层互连架构的网络接入层，来保障超大规模数据中心的可扩展性。其中，各机柜内部的TORSwitch与Spine-Leaf架构中的Leaf交换机(即图4中的叶子交换机)进行互连，实现机柜之间的大规模扩展性要求。每个Leaf交换机都会连接到每个Spine交换机(即图4中的脊柱交换机)，形成一个full-mesh拓扑。具体的，Leaf交换机为接入交换机，组成了Leaf层，用于连接机柜。Spine层是网络的骨干，负责将所有的Leaf交换机连接起来。由于每个Leaf交换机都会连接到每个Spine交换机，因此如果一个Spine交换机挂了，数据中心的吞吐性能只会有轻微的下降。如果某个链路被打满了，扩容过程也很直接：添加一个Spine交换机就可以扩展每个Leaf的上行链路，增大了Leaf和Spine之间的带宽。如果接入层的端口数量成为了瓶颈，那就直接添加一个新的Leaf交换机，然后将其连接到每个Spine交换机并做相应的配置即可。这种易于扩展的特性优化了网络扩展过程。Leaf层的接入端口和上行链路都没有瓶颈时，这个架构就实现了无阻塞。在Spine-Leaf架构中，任意一个服务器到另一个服务器的连接，都会经过相同数量的设备(除非这两个服务器在同一Leaf交换机下面)，这保证了延迟是可预测的，因为一个包只需要经过一个Spine交换机和另一个Leaf交换机就可以到达目的端。数据中心的网络与核心网之间的数据交换通过Spine-Leaf完成。Further, the interconnection topology between the cabinets is shown in FIG. 4 . Each cabinet is connected to the network access layer of the leaf-spine network Spine-Leaf two-layer interconnection architecture to ensure the scalability of the ultra-large-scale data center. Among them, the TORSwitch in each cabinet is interconnected with the Leaf switch in the spine-leaf architecture (ie, the leaf switch in Figure 4), so as to meet the large-scale scalability requirements between cabinets. Each Leaf switch is connected to each Spine switch (ie, the spine switch in Figure 4), forming a full-mesh topology. Specifically, the Leaf switch is an access switch, which forms a Leaf layer and is used to connect cabinets. The spine layer is the backbone of the network and is responsible for connecting all Leaf switches. Since every Leaf switch is connected to every Spine switch, if a Spine switch goes down, the throughput performance of the data center will only be slightly degraded. If a link is full, the expansion process is straightforward: adding a spine switch can expand the uplink of each leaf, increasing the bandwidth between the leaf and the spine. If the number of ports at the access layer becomes a bottleneck, simply add a new Leaf switch, connect it to each spine switch, and configure it accordingly. This easy-to-scale feature optimizes the network scaling process. When there are no bottlenecks in the access ports and uplinks of the Leaf layer, this architecture achieves non-blocking. In a Spine-Leaf architecture, any connection from one server to another goes through the same number of devices (unless the two servers are under the same Leaf switch), which ensures predictable latency because a packet only needs to be The destination can be reached through a spine switch and another leaf switch. The data exchange between the network of the data center and the core network is completed through Spine-Leaf.

可见，本实施例在服务器内部，使CPU与CPU之间采用QPI/UPI互连，CPU与XPU设备之间采用CXL互连标准，XPU与XPU之间采用高带宽的Cable互连并使用XPU交换设备进行数据交互。在机柜Rack内部，使XPU之间采用定制化RDMA协议互连通信，使用iRDMA Switch进行数据交互，不同服务器节点采用网卡NIC与TOR Switch进行互连，负责与数据中心的网络进行数据交互。进一步地，不同机柜之间采用Spine-Leaf架构进行互连，每个机柜中的TORSwitch与所有的Leaf交换机进行互连，形成Full-Mesh互连网络，使得Leaf交换机通过Spine交换机一层与核心网互连，实现不同数据中心之间的数据交互。It can be seen that this embodiment uses QPI/UPI interconnection between the CPU and the CPU, the CXL interconnection standard between the CPU and the XPU device, and the high-bandwidth Cable interconnection between the XPU and the XPU, and the XPU is used for switching. device for data interaction. In the rack rack, the customized RDMA protocol is used for interconnection and communication between XPUs, and the iRDMA Switch is used for data interaction. Further, the spine-leaf architecture is used to interconnect different cabinets. The TORSwitch in each cabinet is interconnected with all leaf switches to form a Full-Mesh interconnection network, so that the leaf switches are connected to the core network through the first layer of the spine switches. Interconnection to realize data interaction between different data centers.

本实施例从超异构系统的混合异构互连拓扑结构出发，提出了一种多层次混合异构互连的拓扑结构，可以实现节点内、节点间、机柜间不同异构加速设备XPU之间的高效协同，提高了系统整体能效，可支持连接不同异构计算加速设备，满足不同人工智能计算场景下对互连带宽、数据交换延时的性能需求。Starting from the hybrid heterogeneous interconnection topology of the hyper-heterogeneous system, this embodiment proposes a multi-level hybrid heterogeneous interconnection topology, which can realize the interconnection between different heterogeneous acceleration devices XPUs within nodes, between nodes, and between cabinets. The efficient synergy between the devices improves the overall energy efficiency of the system, supports the connection of different heterogeneous computing acceleration devices, and meets the performance requirements for interconnection bandwidth and data exchange delay in different artificial intelligence computing scenarios.

下面对本申请实施例提供的一种异构互联集群进行介绍，下文描述的一种异构互联集群与上文描述的一种异构互联系统可以相互参照。The following describes a heterogeneous interconnection cluster provided by an embodiment of the present application. A heterogeneous interconnection cluster described below and a heterogeneous interconnection system described above may refer to each other.

本申请实施例公开了一种异构互联集群，包括：多个上述任一实施例所述的异构互联系统。The embodiment of the present application discloses a heterogeneous interconnection cluster, which includes: a plurality of heterogeneous interconnection systems described in any of the foregoing embodiments.

具体的，一种异构互联系统，包括：至少一个机柜和连接公网的网络接入层；其中，每个机柜包括：第一交换机、第二交换机以及至少一个服务器；每个服务器包括至少一个CPU，每个CPU通过CXL协议连接至少一个计算设备的CXL接口，每个计算设备还设有RDMA接口以及Cable接口；同一服务器内的不同CPU通过QPI/UPI互连；所述第一交换机连接所述网络接入层和当前机柜中各服务器的网卡；所述第二交换机连接当前机柜中的各计算设备的RDMA接口；连接同一服务器的各计算设备通过自身的Cable接口连接该服务器对应的数据交换设备。其中，各个机柜构成图1所示的数据中心。Specifically, a heterogeneous interconnection system includes: at least one cabinet and a network access layer connected to a public network; wherein each cabinet includes: a first switch, a second switch and at least one server; each server includes at least one CPU, each CPU is connected to the CXL interface of at least one computing device through the CXL protocol, and each computing device is also provided with an RDMA interface and a Cable interface; different CPUs in the same server are interconnected through QPI/UPI; the first switch connects all The network card of each server in the network access layer and the current cabinet; the second switch is connected to the RDMA interface of each computing device in the current cabinet; each computing device connected to the same server is connected to the data exchange corresponding to the server through its own Cable interface equipment. Among them, each cabinet constitutes the data center shown in FIG. 1 .

在一种具体实施方式中，连接同一服务器的各计算设备用于处理不同的计算任务。In a specific implementation, each computing device connected to the same server is used to process different computing tasks.

在一种具体实施方式中，计算设备为GPU、FPGA、DPU或AI芯片。In a specific implementation, the computing device is a GPU, an FPGA, a DPU or an AI chip.

在一种具体实施方式中，CXL接口基于PCIe标准实现。In a specific implementation, the CXL interface is implemented based on the PCIe standard.

在一种具体实施方式中，网络接入层包括至少一个连接层，每个连接层包括至少一个连接设备，属于不同连接层的连接设备互连。In a specific embodiment, the network access layer includes at least one connection layer, each connection layer includes at least one connection device, and the connection devices belonging to different connection layers are interconnected.

在一种具体实施方式中，连接设备为交换机或路由器。In a specific implementation, the connecting device is a switch or a router.

在一种具体实施方式中，网络接入层为Spine-Leaf两层互连架构。In a specific implementation manner, the network access layer is a spine-leaf two-layer interconnection architecture.

在一种具体实施方式中，Spine-Leaf两层互连架构的网络接入层包括：连接公网的第一连接层、连接第一连接层和各机柜内第一交换机的第二连接层；其中，第一连接层中的连接设备少于第二连接层中的连接设备，属于不同连接层的连接设备互连。In a specific embodiment, the network access layer of the spine-leaf two-layer interconnection architecture includes: a first connection layer connecting the public network, and a second connection layer connecting the first connection layer and the first switch in each cabinet; Wherein, the connection devices in the first connection layer are less than the connection devices in the second connection layer, and the connection devices belonging to different connection layers are interconnected.

可见，按照本实施例，可使不同数据中心构成异构互联集群，不同数据中心可进行数据交互。It can be seen that, according to this embodiment, different data centers can form a heterogeneous interconnection cluster, and different data centers can perform data interaction.

本申请涉及的“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象，而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换，以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外，术语“包括”和“具有”以及他们的任何变形，意图在于覆盖不排他的包含，例如，包含了一系列步骤或单元的过程、方法或设备不必限于清楚地列出的那些步骤或单元，而是可包括没有清楚地列出的或对于这些过程、方法或设备固有的其它步骤或单元。References in this application to "first", "second", "third", "fourth", etc. (if any) are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" and "having", and any variations thereof, are intended to cover non-exclusive inclusion, for example, a process, method or apparatus comprising a series of steps or elements is not necessarily limited to those steps or elements expressly listed , but may include other steps or elements not expressly listed or inherent to these processes, methods or apparatus.

需要说明的是，在本申请中涉及“第一”、“第二”等的描述仅用于描述目的，而不能理解为指示或暗示其相对重要性或者隐含指明所指示的技术特征的数量。由此，限定有“第一”、“第二”的特征可以明示或者隐含地包括至少一个该特征。另外，各个实施例之间的技术方案可以相互结合，但是必须是以本领域普通技术人员能够实现为基础，当技术方案的结合出现相互矛盾或无法实现时应当认为这种技术方案的结合不存在，也不在本申请要求的保护范围之内。It should be noted that the descriptions involving "first", "second", etc. in this application are only for the purpose of description, and should not be construed as indicating or implying their relative importance or implying the number of indicated technical features . Thus, a feature delimited with "first", "second" may expressly or implicitly include at least one of that feature. In addition, the technical solutions between the various embodiments can be combined with each other, but must be based on the realization by those of ordinary skill in the art. When the combination of technical solutions is contradictory or cannot be realized, it should be considered that the combination of such technical solutions does not exist. , is not within the scope of protection claimed in this application.

本说明书中各个实施例采用递进的方式描述，每个实施例重点说明的都是与其它实施例的不同之处，各个实施例之间相同或相似部分互相参见即可。The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same or similar parts between the various embodiments may be referred to each other.

结合本文中所公开的实施例描述的方法或算法的步骤可以直接用硬件、处理器执行的软件模块，或者二者的结合来实施。软件模块可以置于随机存储器(RAM)、内存、只读存储器(ROM)、电可编程ROM、电可擦除可编程ROM、寄存器、硬盘、可移动磁盘、CD-ROM、或技术领域内所公知的任意其它形式的可读存储介质中。The steps of a method or algorithm described in conjunction with the embodiments disclosed herein may be directly implemented in hardware, a software module executed by a processor, or a combination of the two. A software module can be placed in random access memory (RAM), internal memory, read only memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, removable disk, CD-ROM, or any other in the technical field. in any other form of readable storage medium that is well known.

本文中应用了具体个例对本申请的原理及实施方式进行了阐述，以上实施例的说明只是用于帮助理解本申请的方法及其核心思想；同时，对于本领域的一般技术人员，依据本申请的思想，在具体实施方式及应用范围上均会有改变之处，综上所述，本说明书内容不应理解为对本申请的限制。The principles and implementations of the present application are described herein by using specific examples. The descriptions of the above embodiments are only used to help understand the methods and core ideas of the present application. There will be changes in the specific implementation and application scope. To sum up, the content of this specification should not be construed as a limitation on the application.

Claims

1. A heterogeneous interconnection system, comprising: at least one cabinet and a network access layer connected to a public network;

Wherein, each cabinet includes: a first switch, a second switch and at least one server; each server includes at least one CPU, each CPU is connected to a CXL interface of at least one computing device through the CXL protocol, and each computing device is also provided with RDMA Interface and Cable interface; different CPUs in the same server are interconnected through QPI/UPI;

the first switch is connected to the network access layer and the network card of each server in the current cabinet;

The second switch is connected to the RDMA interface of each computing device in the current cabinet;

Each computing device connected to the same server is connected to the data exchange device corresponding to the server through its own cable interface.

2 . The heterogeneous interconnection system according to claim 1 , wherein each computing device connected to the same server is used to process different computing tasks. 3 .

3. The heterogeneous interconnection system according to claim 2, wherein the computing device is a GPU, an FPGA, a DPU or an AI chip.

4. The heterogeneous interconnection system according to claim 1, wherein the CXL interface is implemented based on the PCIe standard.

5. The heterogeneous interconnection system according to any one of claims 1 to 4, wherein the network access layer includes at least one connection layer, each connection layer includes at least one connection device, and the network access layer belongs to different connection layers. Connect the device interconnect.

6. The heterogeneous interconnection system according to claim 5, wherein the connection device is a switch or a router.

7. The heterogeneous interconnection system according to any one of claims 1 to 4, wherein the network access layer is a spine-leaf two-layer interconnection architecture.

8 . The heterogeneous interconnection system according to claim 7 , wherein the network access layer of the spine-leaf two-layer interconnection architecture comprises: a first connection layer connected to a public network, a first connection layer connected to the first connection layer and the second connection layer of the first switch in each cabinet;

Wherein, the connection devices in the first connection layer are less than the connection devices in the second connection layer, and the connection devices belonging to different connection layers are interconnected.

9. The heterogeneous interconnection system according to claim 5, wherein each server has at least one network card.

10. A heterogeneous interconnection cluster, comprising: a plurality of heterogeneous interconnection systems according to any one of claims 1 to 9.