CN117707693A

CN117707693A - Heterogeneous intelligent computing platform virtualization management system and method

Info

Publication number: CN117707693A
Application number: CN202311690463.0A
Authority: CN
Inventors: 王志; 李超; 薛宏亮; 柴威荣
Original assignee: Zhejiang Lab
Current assignee: Zhejiang Lab
Priority date: 2023-12-11
Filing date: 2023-12-11
Publication date: 2024-03-15

Abstract

The invention provides a heterogeneous intelligent computing platform virtualization management system and method, comprising the following steps: the chip virtualization module is used for virtualizing the CPU, the GPU and the FPGA; the computing virtualization module is used for providing a dynamic computing resource pool based on the super-heterogeneous platform; and the network virtualization module is used for creating a virtual network above the physical network and providing communication between the virtual machines and an external network. And the dynamic scheduling management module is used for determining required calculation, memory, storage, network resources and the like according to the application scene and the requirements. The invention is oriented to the image intelligent processing and computing requirement, constructs a heterogeneous intelligent computing resource pool combining virtual and real, and constructs a virtualization system for generating, debugging and optimizing an intelligent computing system. Meanwhile, the resource allocation of the virtual machine is automatically expanded or contracted by periodically monitoring the change of the resource utilization rate in the virtualized environment. The method realizes the maximum utilization of computing resources, reduces the deployment cost and improves the service performance.

Description

A heterogeneous intelligent computing platform virtualization management system and method

技术领域Technical field

本发明涉及计算机技术领域，尤其涉及一种异构智能计算平台虚拟化管理系统和方法。The present invention relates to the field of computer technology, and in particular to a heterogeneous intelligent computing platform virtualization management system and method.

背景技术Background technique

随着计算平台的复杂度和动态能力的要求不断提高，需要对端侧平台的计算资源及网络存储等进行高效的管理。传统的虚拟化技术在实践中可能会遇到以下问题：一是异构平台虚拟化需要支持多种不同的操作系统和硬件架构，每种架构都有其独特的技术要求和挑战。这需要虚拟化技术具备高度的适应性和灵活性，能够应对各种不同的工作负载和环境；二是虚拟化技术会引入额外的开销，包括虚拟机管理器的运行、资源分配和任务调度等。这些开销可能会影响系统的性能和效率，特别是在资源有限的情况下；三是不同的虚拟机可能需要不同的配置和管理策略，这需要管理员具备高度的技术知识和经验。此外，虚拟机的部署、管理和维护也需要大量的时间和资源。目前传统虚拟化技术在国产化异构平台上运行效率低下，无法满足实时应用需求。As the complexity and dynamic capabilities of computing platforms continue to increase, it is necessary to efficiently manage the computing resources and network storage of end-side platforms. Traditional virtualization technology may encounter the following problems in practice: First, heterogeneous platform virtualization needs to support a variety of different operating systems and hardware architectures, and each architecture has its unique technical requirements and challenges. This requires virtualization technology to be highly adaptable and flexible and able to cope with various workloads and environments; second, virtualization technology will introduce additional overhead, including the operation of the virtual machine manager, resource allocation, and task scheduling. . These overheads may affect the performance and efficiency of the system, especially when resources are limited; third, different virtual machines may require different configuration and management strategies, which requires administrators to have a high degree of technical knowledge and experience. In addition, the deployment, management and maintenance of virtual machines also require a lot of time and resources. At present, traditional virtualization technology runs inefficiently on domestic heterogeneous platforms and cannot meet the needs of real-time applications.

发明内容Contents of the invention

本发明目的在于针对现有技术的不足，提供一种异构智能计算平台虚拟化管理系统和方法，解决异构平台虚拟化系统资源分配的问题。The purpose of the present invention is to provide a heterogeneous intelligent computing platform virtualization management system and method to solve the problem of heterogeneous platform virtualization system resource allocation in view of the shortcomings of the existing technology.

本发明的目的是通过以下技术方案来实现的：一种异构智能计算平台虚拟化管理系统，包括：The object of the present invention is achieved through the following technical solutions: a heterogeneous intelligent computing platform virtualization management system, including:

芯片虚拟化模块，用于CPU、GPU、FPGA的虚拟化；Chip virtualization module, used for virtualization of CPU, GPU, and FPGA;

计算虚拟化模块，用于监控和管理计算资源的状态信息和可用性信息，并将信息提供给动态调度管理模块；The computing virtualization module is used to monitor and manage the status information and availability information of computing resources, and provide the information to the dynamic scheduling management module;

网络虚拟化模块，用于采用虚拟软件定义网络技术使得不同用户共享同一个物理网路的网络资源，根据需求将物理网络资源切分形成逻辑独立的虚拟流分发网络vSDN给用户使用；The network virtualization module is used to use virtual software-defined network technology to enable different users to share the network resources of the same physical network, and segment the physical network resources according to needs to form a logically independent virtual flow distribution network vSDN for users to use;

动态调度管理模块，用于判断主机负载状态，然后基于最小迁移数量策略选择满足迁移条件的虚拟机，实现虚拟机与主机的映射关系不变的静态放置或与主机的映射关系可变的动态放置；在完成虚拟机选择后，将能耗增加最少的主机作为虚拟机迁移的目标主机。The dynamic scheduling management module is used to determine the host load status, and then select virtual machines that meet the migration conditions based on the minimum migration number policy to achieve static placement where the mapping relationship between the virtual machine and the host remains unchanged or dynamic placement where the mapping relationship between the virtual machine and the host is variable. ; After completing the virtual machine selection, the host with the least increase in energy consumption will be used as the target host for virtual machine migration.

进一步地，CPU虚拟化允许将物理单个CPU虚拟成多个vCPU，即虚拟CPU，每个虚拟机的用户操作系统使用一个或者多个并行vCPU，每个vCPU之间相互独立运行，同时支持CPU的分时复用，即通过实时调度策略，实现任务的CPU共享。Furthermore, CPU virtualization allows a single physical CPU to be virtualized into multiple vCPUs, that is, virtual CPUs. The user operating system of each virtual machine uses one or more parallel vCPUs. Each vCPU runs independently of each other and supports the CPU. Time-sharing multiplexing uses real-time scheduling strategies to achieve CPU sharing of tasks.

进一步地，GPU虚拟化采用基于SR-IOV的硬件辅助虚拟化技术，实现对PCIe设备的虚拟化，通过在GPU上启用SR-IOV功能，划分出多个虚拟GPU，每个虚拟GPU均有自己的标识和资源，虚拟机或容器直接访问虚拟GPU，实现对GPU资源的隔离和共享。Furthermore, GPU virtualization uses hardware-assisted virtualization technology based on SR-IOV to virtualize PCIe devices. By enabling the SR-IOV function on the GPU, multiple virtual GPUs are divided, and each virtual GPU has its own The identification and resources of the virtual machine or container directly access the virtual GPU to realize the isolation and sharing of GPU resources.

进一步地，FPGA虚拟化将单片FPGA划分为具有多块细粒度的部分可重构的vFPGA，即虚拟FPGA，每个vFPGA均有单独的控制器进行管理，其通过AXI总线与外界相连，一方面用于访问内存，另一方面用于与动态调度管理模块进行数据交互。Furthermore, FPGA virtualization divides a single FPGA into multiple fine-grained partially reconfigurable vFPGAs, that is, virtual FPGAs. Each vFPGA is managed by a separate controller and is connected to the outside world through the AXI bus. On the one hand, it is used to access memory, and on the other hand, it is used for data interaction with the dynamic scheduling management module.

进一步地，所述计算虚拟化模块定期更新资源的负载情况、性能指标和可用性信息，并将这些信息提供给动态调度管理模块，根据任务或应用的需求，通过任务调度器和决策引擎对任务进行调度和分配资源。Further, the computing virtualization module regularly updates the load status, performance indicators and availability information of the resources, and provides this information to the dynamic scheduling management module. According to the needs of the task or application, the task scheduler and decision engine perform task scheduling. Scheduling and allocating resources.

进一步地，网络虚拟化模块中的虚拟网络控制器同时接受用户请求，根据需求将物理网络资源切分形成逻辑独立的vSDN给用户使用，建立vSDN主要包含两方面：网络管理程序NVH和虚拟网络映射VNE；NVH位于用户SDN控制器与物理网络中间，通过网络通信协议Openflow的网络虚构化平台Flowvisor实现；VNE将网络资源分配给各个虚拟网络，分为两大部分：节点映射和链路映射；对于节点映射保证物理资源节点不超过容量限制，而链路映射使一条虚拟链路映射对应一条物理路径；在每次用户状态改变时，基于vSDN重配置，将虚拟节点和链路映射到新的物理节点和物理链路。Furthermore, the virtual network controller in the network virtualization module simultaneously accepts user requests and divides physical network resources according to needs to form logically independent vSDN for users to use. The establishment of vSDN mainly includes two aspects: network management program NVH and virtual network mapping VNE; NVH is located between the user SDN controller and the physical network, and is implemented through Flowvisor, the network virtualization platform of the network communication protocol Openflow; VNE allocates network resources to each virtual network, which is divided into two parts: node mapping and link mapping; for Node mapping ensures that physical resource nodes do not exceed capacity limits, while link mapping maps a virtual link to a physical path; every time the user status changes, virtual nodes and links are mapped to new physical paths based on vSDN reconfiguration. nodes and physical links.

进一步地，所述主机状态监测用于判断主机状态处于过载或者欠载状态，采用自适应动态阈值法进行判断；使用机器学习的方法学习动态的、自适应的资源利用率阈值，同时在学习过程中通过与动态环境的交互与试错来强化学习结果，以适应变化的环境，采用过载和欠载双阈值的方法，在过载和欠载时触发虚拟机调度。Further, the host status monitoring is used to determine whether the host status is in an overload or underload state, using the adaptive dynamic threshold method to make the determination; using machine learning methods to learn dynamic and adaptive resource utilization thresholds, and at the same time during the learning process The learning results are strengthened through interaction and trial and error with the dynamic environment to adapt to the changing environment, and a dual threshold method of overload and underload is used to trigger virtual machine scheduling when overloaded and underloaded.

进一步地，基于最小迁移数量策略选择满足迁移条件的虚拟机具体过程为：首先根据虚拟机的资源需求进行降序排列，然后选择满足条件的虚拟机，在最后完成虚拟机迁移后，主机上剩余虚拟机的资源需求小于主机的最大容量，选择其中对于资源需求最小的虚拟机。Furthermore, the specific process of selecting virtual machines that meet the migration conditions based on the minimum migration quantity policy is: first, sort the virtual machines in descending order according to the resource requirements of the virtual machines, and then select the virtual machines that meet the conditions. After the virtual machine migration is finally completed, the remaining virtual machines on the host If the resource requirements of the machine are less than the maximum capacity of the host, select the virtual machine with the smallest resource requirements.

另一方面，本发明提供了一种异构智能计算平台虚拟化管理方法，包括：On the other hand, the present invention provides a heterogeneous intelligent computing platform virtualization management method, including:

基于人机交互端用户需求，获取对应的CPU、GPU和FPGA计算资源；Based on the needs of human-computer interaction end users, obtain corresponding CPU, GPU and FPGA computing resources;

监控和管理计算资源的状态信息和可用性信息，并根据所需计算资源生成虚拟机资源分配指令；Monitor and manage the status information and availability information of computing resources, and generate virtual machine resource allocation instructions based on the required computing resources;

将资源分配指令发送给各个芯片模组，用于虚拟化计算资源，然后将虚拟后的资源挂载至目标虚拟机中；Send resource allocation instructions to each chip module to virtualize computing resources, and then mount the virtualized resources to the target virtual machine;

采用虚拟软件定义网络技术使得不同用户共享同一个物理网路的网络资源，根据需求将物理网络资源切分形成逻辑独立的虚拟流分发网络vSDN给用户使用；The use of virtual software-defined network technology allows different users to share the network resources of the same physical network, and splits the physical network resources according to needs to form a logically independent virtual flow distribution network vSDN for users to use;

基于最小迁移数量策略选择满足迁移条件的虚拟机，实现虚拟机与主机的映射关系不变的静态放置或与主机的映射关系可变的动态放置；在完成虚拟机选择后，将能耗增加最少的主机作为虚拟机迁移的目标主机。Based on the minimum migration quantity strategy, select virtual machines that meet the migration conditions, and implement static placement where the mapping relationship between the virtual machine and the host remains unchanged or dynamic placement where the mapping relationship between the virtual machine and the host is variable; after completing the virtual machine selection, minimize the increase in energy consumption The host is used as the target host for virtual machine migration.

进一步地，收集和分析计算资源的负载情况、性能指标和可用性信息；这些信息用于评估资源的利用率、瓶颈和性能瓶颈，并用于决策引擎进行调度决策和优化；通过不断监控和内存、网络及存储优化，提高资源利用率，降低任务执行时间和成本。Further, the load status, performance indicators and availability information of computing resources are collected and analyzed; this information is used to evaluate resource utilization, bottlenecks and performance bottlenecks, and is used in the decision-making engine to make scheduling decisions and optimization; through continuous monitoring and memory, network And storage optimization, improve resource utilization, reduce task execution time and cost.

本发明的有益效果：Beneficial effects of the present invention:

(1)突破异构硬件集群算力管理与调度瓶颈，实现异构算力一体化调度，提高硬件资源利用效率(1) Break through the bottleneck of heterogeneous hardware cluster computing power management and scheduling, achieve integrated scheduling of heterogeneous computing power, and improve hardware resource utilization efficiency

(2)针对深度学习模型训练，异构算力虚拟化可以有助于实现分布式训练，大幅提升训练效率(2) For deep learning model training, heterogeneous computing power virtualization can help achieve distributed training and greatly improve training efficiency.

(3)针对多任务并发场景，异构算力虚拟化可以实现多任务并行，且任务间互不干扰，保证任务安全可靠运行。(3) For multi-task concurrency scenarios, heterogeneous computing power virtualization can realize multi-task parallelism without interfering with each other, ensuring safe and reliable operation of tasks.

附图说明Description of the drawings

为了更清楚地说明本发明实施例或现有技术中的技术方案，下面将对实施例或现有技术描述中所需要使用的附图做简单地介绍，显而易见地，下面描述中的附图仅仅是本发明的一些实施例，对于本领域普通技术人员来讲，在不付出创造性劳动前提下，还可以根据这些附图获得其他附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without exerting creative efforts.

图1为本发明提供的一种异构智能计算平台虚拟化管理系统结构示意图；Figure 1 is a schematic structural diagram of a heterogeneous intelligent computing platform virtualization management system provided by the present invention;

图2为本发明提供的CPU虚拟化示意图；Figure 2 is a schematic diagram of CPU virtualization provided by the present invention;

图3为本发明提供的GPU虚拟化示意图；Figure 3 is a schematic diagram of GPU virtualization provided by the present invention;

图4为本发明提供的状态监控系统流程图；Figure 4 is a flow chart of the status monitoring system provided by the present invention;

图5为本发明提供的异构智能计算平台虚拟化管理方法流程图；Figure 5 is a flow chart of the heterogeneous intelligent computing platform virtualization management method provided by the present invention;

图6为本发明提供的一种可能的硬件示意图；Figure 6 is a possible hardware schematic diagram provided by the present invention;

具体实施方式Detailed ways

下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例仅仅是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only some of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts fall within the scope of protection of the present invention.

需要说明的是，在不冲突的情况下，下述的实施例及实施方式中的特征可以相互组合。It should be noted that, as long as there is no conflict, the features in the following embodiments and implementation modes can be combined with each other.

图1为本发明提供的一种异构智能计算平台虚拟化管理系统结构示意图，如图1所述，系统包括芯片虚拟化模块、计算虚拟化模块、存储虚拟化模块、网络虚拟化模块、动态调度管理模块以及底层KVM模块。Figure 1 is a schematic structural diagram of a heterogeneous intelligent computing platform virtualization management system provided by the present invention. As shown in Figure 1, the system includes a chip virtualization module, a computing virtualization module, a storage virtualization module, a network virtualization module, a dynamic Scheduling management module and underlying KVM module.

其中所述芯片虚拟化模块，主要包括CPU虚拟化子模块、GPU虚拟化子模块和FPGA虚拟化子模块，用于CPU、NPU、GPU和FPGA的虚拟化；计算虚拟化模块，用于根据用户需求，基于国产异构平台，提供动态的计算资源池；存储虚拟化模块，用于构建虚拟存储管理层，实现对虚拟存储的资源的统一管理和调度。并且构建可扩展的存储框架，可以随着业务发展进行无缝扩展；网络虚拟化模块，用于在物理网络之上创建虚拟网络，提供虚拟机之间和虚拟机与外部网络的通信。动态调度管理模块，用于根据应用场景和需求，确定所需的计算、内存、存储和网络资源等。The chip virtualization module mainly includes a CPU virtualization sub-module, a GPU virtualization sub-module and an FPGA virtualization sub-module, which are used to virtualize CPU, NPU, GPU and FPGA; the computing virtualization module is used to virtualize the CPU, NPU, GPU and FPGA according to the user's needs. Based on the domestic heterogeneous platform, it provides a dynamic computing resource pool; the storage virtualization module is used to build a virtual storage management layer and realize unified management and scheduling of virtual storage resources. And build a scalable storage framework that can be seamlessly expanded with business development; the network virtualization module is used to create a virtual network on top of the physical network to provide communication between virtual machines and between virtual machines and external networks. The dynamic scheduling management module is used to determine the required computing, memory, storage and network resources based on application scenarios and needs.

需要说明的是，在使用本发明系统之前可以在一台实体服务器上部署搭建KVM底层系统，从而通过上述KVM底层系统提供虚拟化服务。It should be noted that before using the system of the present invention, a KVM underlying system can be deployed and built on a physical server, thereby providing virtualization services through the above KVM underlying system.

可以理解的是，上述CPU虚拟化中，CPU可以是ARM架构也可以是X86架构，本实施例对此不做限制。It can be understood that in the above-mentioned CPU virtualization, the CPU may be an ARM architecture or an X86 architecture, and this embodiment does not limit this.

应理解的是，上述CPU虚拟化子模块中，允许将物理单个CPU虚拟成多个vCPU(虚拟CPU)，参见图2。每个虚拟机的用户操作系统可以使用一个或者多个vCPU。每个vCPU之间相互独立运行，互不干扰，实现多核并行。同时支持CPU的分时复用，即通过实时调度策略，实现任务的CPU共享。It should be understood that the above-mentioned CPU virtualization sub-module allows a single physical CPU to be virtualized into multiple vCPUs (virtual CPUs), see Figure 2. Each virtual machine's user operating system can use one or more vCPUs. Each vCPU runs independently of each other without interfering with each other, achieving multi-core parallelism. At the same time, it supports time-sharing multiplexing of CPU, that is, through real-time scheduling strategy, CPU sharing of tasks is realized.

所述CPU虚拟化包含实时调度单元。主要是采用优先级排序的方法，首先根据任务类型及迫切程度划分优先级。任务执行过程中，优先执行优先级高的任务。若运行过程中有更高优先级的任务插入，那可以通过中断的方式保证高优先级的任务运行。The CPU virtualization includes a real-time scheduling unit. Mainly using the prioritization method, first classify the priorities according to the task type and urgency. During task execution, tasks with high priority are executed first. If a higher-priority task is inserted during the running process, an interrupt can be used to ensure that the high-priority task runs.

应理解的是，上述GPU虚拟化中，根据资源分配指令，获取其对应的虚拟GPU，并将虚拟GPU挂载至预设虚拟机中，参见图3。使用完成后，卸载预设虚拟机中挂载的虚拟GPU，释放虚拟GPU资源。NPU的虚拟化可以参考GPU虚拟化，技术方案基本一致。It should be understood that in the above-mentioned GPU virtualization, the corresponding virtual GPU is obtained according to the resource allocation instruction, and the virtual GPU is mounted to the default virtual machine, see Figure 3. After use is completed, uninstall the virtual GPU mounted in the preset virtual machine and release the virtual GPU resources. For NPU virtualization, you can refer to GPU virtualization, and the technical solutions are basically the same.

在具体实现中，GPU虚拟化子模块采用基于SR-IOV(Single Root I/OVirtualization)的硬件辅助虚拟化技术，可以实现对PCIe设备的虚拟化。通过在GPU上启用SR-IOV功能，可以划分出多个虚拟GPU，每个虚拟GPU都有自己的标识和资源。首先基于用户发送的登录指令，获取用户对应的GPU资源信息，根据GPU资源信息生成资源分配指令，基于资源分配指令获取其对应的虚拟GPU资源，将虚拟GPU资源挂载至预设虚拟机中，基于预设虚拟机为用户提供GPU算力。虚拟机或容器可以直接访问虚拟GPU，实现对GPU资源的隔离和共享。In specific implementation, the GPU virtualization sub-module uses hardware-assisted virtualization technology based on SR-IOV (Single Root I/OVirtualization) to virtualize PCIe devices. By enabling the SR-IOV function on the GPU, multiple virtual GPUs can be divided, each with its own identity and resources. First, based on the login command sent by the user, obtain the user's corresponding GPU resource information, generate resource allocation instructions based on the GPU resource information, obtain the corresponding virtual GPU resources based on the resource allocation instructions, and mount the virtual GPU resources to the preset virtual machine. Provide users with GPU computing power based on preset virtual machines. Virtual machines or containers can directly access virtual GPUs to isolate and share GPU resources.

FPGA虚拟化将单片FPGA划分为具有多块细粒度的部分可重构虚拟FPGA(vFPGA)，可以给用户单独配置和使用。每个vFPGA都有单独的控制器进行管理，其通过AXI总线与外界相连，一方面用于访问内存，另一方面用于与其他模块进行数据交互。在具体实现中，FPGA虚拟化子模块主要将FPGA逻辑资源分为静态区域和动态区域，所述静态区域为不可配置区域，所述动态区域为多个可动态重构的vFPGA区域；主机CPU端软件设计：通过用户应用程序API进行应用部署、FPGA资源管理，以及与vFPGA区域通信；通过运行时管理器在空间和时间上实现应用的调度；通过驱动程序实例化该多个可动态重构的vFPGA区域，分别为vFPGA区域设置所需数据结构，并为应用程序创建虚拟内存映射，以通过PCIe总线与FPGA进行通信。单个vFPGA区域包含用户逻辑和动态封装器两部分。用户逻辑是经综合和系统验证的比特流，用户可使用包括HLS、Verilog、VHDL、OpenCL的语言开发应用；动态封装器为用户逻辑提供标准接口，使应用可跨vFPGA运行。FPGA virtualization divides a single FPGA into multiple fine-grained partially reconfigurable virtual FPGAs (vFPGA), which can be individually configured and used by users. Each vFPGA has a separate controller for management, which is connected to the outside world through the AXI bus. On the one hand, it is used to access memory, and on the other hand, it is used for data interaction with other modules. In specific implementation, the FPGA virtualization sub-module mainly divides FPGA logical resources into static areas and dynamic areas. The static area is a non-configurable area, and the dynamic area is a plurality of dynamically reconfigurable vFPGA areas; the host CPU side Software design: Application deployment, FPGA resource management, and communication with the vFPGA area through user application APIs; application scheduling in space and time through the runtime manager; multiple dynamically reconfigurable devices are instantiated through the driver vFPGA area, set up the required data structures for the vFPGA area respectively, and create a virtual memory map for the application to communicate with the FPGA through the PCIe bus. A single vFPGA area contains two parts: user logic and dynamic wrapper. User logic is a bit stream that has been synthesized and system verified. Users can develop applications using languages including HLS, Verilog, VHDL, and OpenCL; the dynamic wrapper provides a standard interface for user logic so that applications can run across vFPGAs.

作为实施例，所述计算虚拟化模块负责监控和管理计算资源的状态和可用性。它会定期更新资源的负载情况、性能指标和可用性信息，并将这些信息提供给动态调度管理模块的任务调度器和决策引擎。其需要对系统中存在的异构资源进行发现和描述。这包括识别系统中的各种资源类型(如CPU、GPU、FPGA等)，获取其特性和性能信息，以供后续管理和调度使用。并且根据任务或应用的需求，将任务合理地调度和分配给适当的资源。所述计算虚拟化模块包括资源发现与描述子模块、资源监控与调整子模块，以及故障处理和容错子模块。As an embodiment, the computing virtualization module is responsible for monitoring and managing the status and availability of computing resources. It regularly updates the load status, performance indicators and availability information of resources, and provides this information to the task scheduler and decision engine of the dynamic scheduling management module. It requires the discovery and description of heterogeneous resources existing in the system. This includes identifying various resource types in the system (such as CPU, GPU, FPGA, etc.) and obtaining their characteristics and performance information for subsequent management and scheduling. And according to the needs of the task or application, tasks are reasonably scheduled and allocated to appropriate resources. The computing virtualization module includes a resource discovery and description submodule, a resource monitoring and adjustment submodule, and a fault handling and fault tolerance submodule.

其中资源发现与描述子模块主要对系统中存在的异构资源进行发现和描述。这包括识别系统中的各种资源类型(如CPU、GPU、FPGA等)，获取其特性和性能信息，以供后续动态调度管理模块的管理和调度使用。Among them, the resource discovery and description submodule mainly discovers and describes the heterogeneous resources existing in the system. This includes identifying various resource types in the system (such as CPU, GPU, FPGA, etc.) and obtaining their characteristics and performance information for subsequent management and scheduling of the dynamic scheduling management module.

可以理解的是，资源监控与调整子模块主要用于监控异构资源的使用情况、负载状况和性能指标，及时调整资源的分配和配置。通过实时监控和反馈机制，可以动态地调整资源分配策略，以适应系统的变化和优化资源利用效率。It can be understood that the resource monitoring and adjustment sub-module is mainly used to monitor the usage, load status and performance indicators of heterogeneous resources, and adjust the allocation and configuration of resources in a timely manner. Through real-time monitoring and feedback mechanisms, resource allocation strategies can be dynamically adjusted to adapt to system changes and optimize resource utilization efficiency.

应理解的是，故障处理和容错子模块是为了考虑资源故障和异常情况的处理策略。当某个资源出现故障时，需要采取相应的容错机制和替代方案，以保证系统的可靠性和连续性。It should be understood that the fault handling and fault tolerance sub-module is to consider the processing strategy of resource faults and abnormal situations. When a resource fails, corresponding fault tolerance mechanisms and alternatives need to be adopted to ensure the reliability and continuity of the system.

作为实施例，所述网络虚拟化模块允许不同用户共享同一个物理网路的网络资源，采用虚拟软件定义网络技术(vSDN)。虚拟网络是由一组虚拟节点和虚拟链路组成的虚拟拓扑，是物理拓扑的子集，是基于网络虚拟化环境下的实体。其映射关系是虚拟节点映射到物理节点上，并且多个虚拟节点可共存。虚拟链路映射到物理链路上。虚拟网络映射是根据用户的请求建立虚拟网络(包括拓扑、资源需求和位置限定等要素)。将虚节点逻辑地部署在位置和资源都满足要求的基底网络的相应物理节点上，链路则由部署了相应虚节点的物理节点间的满足资源约束等条件的路径相连接，同时为节点和链路分配相应虚拟网络的请求，资源虚拟网络映射即实现一个虚拟网络请求的实例化或初始化。As an embodiment, the network virtualization module allows different users to share network resources of the same physical network, using virtual software-defined network technology (vSDN). A virtual network is a virtual topology composed of a set of virtual nodes and virtual links. It is a subset of the physical topology and an entity based on the network virtualization environment. The mapping relationship is that virtual nodes are mapped to physical nodes, and multiple virtual nodes can coexist. Virtual links are mapped onto physical links. Virtual network mapping is to establish a virtual network (including topology, resource requirements, location restrictions and other elements) based on user requests. The virtual nodes are logically deployed on the corresponding physical nodes of the base network whose location and resources meet the requirements. The links are connected by paths that meet resource constraints and other conditions between the physical nodes where the corresponding virtual nodes are deployed. At the same time, the nodes and Link allocation corresponds to virtual network requests, and resource virtual network mapping implements the instantiation or initialization of a virtual network request.

由于用户使用网络环境处于动态变化中，随着用户的请求不断到达和离开，网路切片的资源需求也会随之变化，传统的最优解的虚拟网络映射(VNE)无法解决资源分配不合理的问头。而采用虚拟软件定义网络则可以对网络进行实时管控，加强网络的可编程性，提高用户使用的灵活性。除了资源分配外，虚拟软件定义网络在故障恢复和移动管理方面也具有较好的适用性。Since the network environment used by users is in dynamic change, as user requests continue to arrive and leave, the resource requirements of network slicing will also change accordingly. The traditional optimal solution of virtual network mapping (VNE) cannot solve the problem of unreasonable resource allocation. asked. The use of virtual software-defined networks can control the network in real time, enhance the programmability of the network, and improve user flexibility. In addition to resource allocation, virtual software-defined networks also have good applicability in fault recovery and mobility management.

本发明中虚拟网络控制器同时接受用户请求，可以根据需求将物理网络资源切分形成逻辑独立的vSDN给用户使用。建立vSDN主要包含两方面：网络Hypervisor(NVH)和虚拟网络映射(VNE)。NVH位于用户SDN控制器与物理网络中间，可以通过Openflow的Flowvisor实现；VNE将网络资源分配给各个虚拟网络，分为两大部分：节点映射和链路映射。对于节点映射保证物理资源节点不超过容量限制，而链路映射使一条虚拟链路映射对应一条物理路径。针对用户不断到达和退出的动态问题，拟采用vSDN重配置来解决，在每次用户状态改变时，将虚拟节点和链路映射到新的物理节点和物理链路。The virtual network controller in the present invention accepts user requests at the same time, and can segment physical network resources according to needs to form logically independent vSDN for users to use. Establishing vSDN mainly includes two aspects: network hypervisor (NVH) and virtual network mapping (VNE). NVH is located between the user SDN controller and the physical network and can be implemented through Openflow's Flowvisor; VNE allocates network resources to each virtual network and is divided into two parts: node mapping and link mapping. Node mapping ensures that physical resource nodes do not exceed capacity limits, while link mapping maps a virtual link to a physical path. In order to solve the dynamic problem of users constantly arriving and exiting, vSDN reconfiguration is proposed to map virtual nodes and links to new physical nodes and physical links every time the user status changes.

作为实施例，所述动态调度管理模块主要包括主机状态监测子模块、虚拟机选择子模块和虚拟机放置子模块。As an embodiment, the dynamic scheduling management module mainly includes a host status monitoring sub-module, a virtual machine selection sub-module and a virtual machine placement sub-module.

可以理解的是，为了实现对异构资源的高效利用和任务的优化分配，系统需要实时监控资源的状态和性能指标。主机状态监控子模块负责收集和分析资源的负载情况、性能指标和可用性信息。这些信息可以用于评估资源的利用率、瓶颈和性能瓶颈，并用于动态调度管理模块进行调度决策和优化。通过不断监控和优化，系统可以提高资源利用率，降低任务执行时间和成本。所述优化主要包含内存、网络及存储优化。网络优化指将物理网卡池化，减少数据拷贝的次数；存储优化指避免虚拟机过度使用共享资源，并防止影响其他虚拟机的性能。具体策略包括：建立性能日志，尽量减少性能监视本身对服务器所造成的影响，分析监视结果，建立性能基线创建警报。It is understandable that in order to achieve efficient utilization of heterogeneous resources and optimal allocation of tasks, the system needs to monitor the status and performance indicators of resources in real time. The host status monitoring sub-module is responsible for collecting and analyzing resource load, performance indicators and availability information. This information can be used to evaluate resource utilization, bottlenecks, and performance bottlenecks, and is used by the dynamic scheduling management module to make scheduling decisions and optimizations. Through continuous monitoring and optimization, the system can improve resource utilization and reduce task execution time and cost. The optimization mainly includes memory, network and storage optimization. Network optimization refers to pooling physical network cards to reduce the number of data copies; storage optimization refers to preventing virtual machines from overusing shared resources and preventing the performance of other virtual machines from being affected. Specific strategies include: establishing performance logs, minimizing the impact of performance monitoring itself on the server, analyzing monitoring results, establishing performance baselines and creating alarms.

所述主机状态监测子模块主要用于判断主机状态处于过载或者欠载状态，拟采用自适应动态阈值法进行判断。使用机器学习的方法学习动态的、自适应的资源利用率阈值。同时在学习过程中通过与动态环境的交互与试错来强化学习结果，以适应变化的环境。采用双阈值的方法，在过载和欠载时触发虚拟机调度。The host status monitoring sub-module is mainly used to determine whether the host status is overloaded or underloaded, and the adaptive dynamic threshold method is proposed to be used for determination. Use machine learning methods to learn dynamic and adaptive resource utilization thresholds. At the same time, during the learning process, the learning results are strengthened through interaction with the dynamic environment and trial and error to adapt to the changing environment. A dual-threshold method is used to trigger virtual machine scheduling when overloaded and underloaded.

在具体实现中，如图4所示，主机状态监控子模块用于实时监测虚拟化环境中的性能指标，如CPU、内存和网络延迟等，并设置告警机制，及时处理异常情况。本发明采用Prometheus来进行系统监控与告警，Prometheus是一款基于时序数据库的开源监控告警系统，非常适合Kubernetes集群的监控。Prometheus的基本原理是通过HTTP协议周期性抓取被监控组件的状态，任意组件只要提供对应的HTTP接口就可以接入监控。不需要任何SDK或者其他的集成过程。这样做非常适合做虚拟化环境监控系统，比如VM、Docker、Kubernetes等。通过在虚拟机上安装相关的Agent或Exporter，将虚拟机的性能数据推送到Prometheus中，并通过Grafana进行可视化展示和分析。In the specific implementation, as shown in Figure 4, the host status monitoring sub-module is used to monitor performance indicators in the virtualized environment in real time, such as CPU, memory and network latency, and set up an alarm mechanism to handle abnormal situations in a timely manner. The present invention uses Prometheus for system monitoring and alarming. Prometheus is an open source monitoring and alarming system based on a time series database and is very suitable for monitoring Kubernetes clusters. The basic principle of Prometheus is to periodically capture the status of monitored components through the HTTP protocol. Any component can be accessed for monitoring as long as it provides the corresponding HTTP interface. No SDK or other integration process is required. This is very suitable for virtualization environment monitoring systems, such as VM, Docker, Kubernetes, etc. By installing the relevant Agent or Exporter on the virtual machine, the performance data of the virtual machine is pushed to Prometheus and visually displayed and analyzed through Grafana.

应理解的是，主机状态监控子模块如果能够监控预测主机将要过载，可以停止向该主机分配虚拟机，或者将该主机上的部分虚拟机提前迁移出去以避免可能发生的SLA违例；如果能够准确的预测主机将在下一时刻轻载，则可以进行一系列的操作以节省能源，可以停止向该主机分配虚拟机，然后将该主机上的所有虚拟机迁移出去，最后将该主机关闭或者转换到休眠模式以节省能源。It should be understood that if the host status monitoring sub-module can monitor and predict that the host will be overloaded, it can stop allocating virtual machines to the host, or migrate some virtual machines on the host in advance to avoid possible SLA violations; if it can accurately predict If it is predicted that the host will be lightly loaded at the next moment, a series of operations can be performed to save energy. You can stop allocating virtual machines to the host, then migrate all virtual machines on the host, and finally shut down the host or convert it to Sleep mode to save energy.

作为实施例，虚拟机选择子模块采用最小迁移数量策略来减少在线迁移虚拟机的数量。首先根据虚拟机的资源需求进行降序排列，然后选择满足条件的虚拟机，主要方式为：只有当主机上运行的所有虚拟机的计算需求超过主机的最大容量时，才会进行虚拟机迁移；其次，在虚拟机迁移之后，主机上所有剩余虚拟机的计算需求(CPU、GPU、NPU、FPGA等)应该小于主机的最大容量；最后，在满足上述两个条件的所有虚拟机中，选择计算资源最小的虚拟机迁移到另一台主机，如果没有满足条件的虚拟机，则迁移计算容量最大的虚拟机。As an embodiment, the virtual machine selection sub-module adopts the minimum migration number strategy to reduce the number of online migration virtual machines. First, sort the virtual machines in descending order according to their resource requirements, and then select the virtual machines that meet the conditions. The main method is: only when the computing requirements of all virtual machines running on the host exceed the maximum capacity of the host, the virtual machine will be migrated; secondly , after the virtual machine migration, the computing requirements (CPU, GPU, NPU, FPGA, etc.) of all remaining virtual machines on the host should be less than the maximum capacity of the host; finally, among all virtual machines that meet the above two conditions, select computing resources The smallest virtual machine is migrated to another host, and if there is no virtual machine that meets the conditions, the virtual machine with the largest computing capacity is migrated.

在具体实现中，虚拟机选择子模块首先获得该主机上的所有的虚拟机列表，如果虚拟机列表为空，则退出，因为此时该主机没有虚拟机可以迁移。如果虚拟机列表不为空，继续开展虚拟机选择工作。计算主机需要释放的计算资源量，即虚拟机请求的计算资源总量与主机的计算资源总量乘以预留系数的差值。如果当前计算资源量小于0，则退出，因为主机有足够的资源，不需要迁移虚拟机。如果当前计算资源量大于0，即主机资源不能满足虚拟机的需求，此时需要进行虚拟机迁移，所以继续进行虚拟机选择工作。接着开始遍历主机上的虚拟机列表选择合适的虚拟机。In the specific implementation, the virtual machine selection sub-module first obtains a list of all virtual machines on the host. If the virtual machine list is empty, it exits because the host has no virtual machines to migrate at this time. If the virtual machine list is not empty, continue with virtual machine selection. The amount of computing resources that the computing host needs to release is the difference between the total computing resources requested by the virtual machine and the total computing resources of the host multiplied by the reservation coefficient. If the current amount of computing resources is less than 0, exit because the host has sufficient resources and there is no need to migrate the virtual machine. If the current amount of computing resources is greater than 0, that is, the host resources cannot meet the needs of the virtual machines, and virtual machine migration needs to be performed at this time, so the virtual machine selection process continues. Then start traversing the virtual machine list on the host to select the appropriate virtual machine.

作为实施例，虚拟机放置子模块可分为静态放置和动态放置，其中静态放置指在整个生命周期中，虚拟机与主机的映射关系不变；而动态放置指两者映射关系可变。拟采用能耗感知最好适应策略，在完成虚拟机选择后，将能耗增加最少的主机作为目标主机。As an embodiment, the virtual machine placement sub-module can be divided into static placement and dynamic placement. Static placement means that the mapping relationship between the virtual machine and the host remains unchanged during the entire life cycle; and dynamic placement means that the mapping relationship between the two is variable. It is planned to adopt the energy consumption-aware best adaptation strategy. After completing the virtual machine selection, the host with the least increase in energy consumption will be used as the target host.

虚拟机放置子模块具体实现为：首先将待放置的虚拟机按照计算资源利用率降序排列，然后按照虚拟机计算资源利用率从高到低为虚拟机选择目标放置主机。对于一个虚拟机，将遍历主机列表寻求合适的目标主机。The specific implementation of the virtual machine placement sub-module is as follows: first, arrange the virtual machines to be placed in descending order according to the computing resource utilization, and then select the target placement host for the virtual machine according to the virtual machine computing resource utilization from high to low. For a virtual machine, the host list is traversed looking for a suitable target host.

在具体实现中，对于目标主机，首先利用主机状态探测模块进行探测，如果探测结果为主机将在下一时刻过载或者轻载，那么该主机将不作为虚拟机放置的目标主机，因为过载或轻载主机不是虚拟机理想的目标主机。如果主机状态正常，那么将进一步判断该主机是否有足够的资源容纳虚拟机，如果没有，该主机也不会作为虚拟机的目标放置主机；如果主机有足够的资源容纳虚拟机，那么将计算虚拟机放置到该主机后的能耗增加值，并从所有满足条件的目标主机中选择能耗增加值最小的主机作为该虚拟机的目标放置主机。In the specific implementation, for the target host, first use the host status detection module to detect. If the detection result is that the host will be overloaded or lightly loaded at the next moment, then the host will not be used as the target host for virtual machine placement because it is overloaded or lightly loaded. The host is not an ideal target host for the virtual machine. If the host status is normal, it will be further determined whether the host has enough resources to accommodate the virtual machine. If not, the host will not be placed as the target of the virtual machine; if the host has sufficient resources to accommodate the virtual machine, then the virtual machine will be calculated. The energy consumption increase value after the virtual machine is placed on the host is calculated, and the host with the smallest energy consumption increase value is selected from all target hosts that meet the conditions as the target host for placing the virtual machine.

请参阅图5，图5为本发明实施例提供的一种异构智能计算平台虚拟化管理方法流程图如图5所示，本发明提供了一种异构智能计算平台虚拟化管理方法，方法包括：Please refer to Figure 5. Figure 5 is a flow chart of a heterogeneous intelligent computing platform virtualization management method provided by an embodiment of the present invention. As shown in Figure 5, the present invention provides a heterogeneous intelligent computing platform virtualization management method. include:

基于人机交互端用户需求，获取对应的计算(CPU、GPU、NPU、FPGA等)、内存、存储和网络资源等；Based on the user needs of the human-computer interaction end, obtain the corresponding computing (CPU, GPU, NPU, FPGA, etc.), memory, storage and network resources, etc.;

需要说明的是，本实施例方法的执行主体可以是具有数据处理、网络通信及程序运行功能的计算机终端设备，也可以是具有相同相似功能的服务器设备，还可以是具有相似功能的服务器，本实施例对此不做限制。为了便于理解，本实施例及下述各实施例将以服务器设备为例进行说明。It should be noted that the execution subject of the method of this embodiment can be a computer terminal device with data processing, network communication and program running functions, or a server device with the same similar functions, or a server with similar functions. The embodiment does not limit this. To facilitate understanding, this embodiment and the following embodiments will be described using a server device as an example.

根据所需计算资源，根据需求定义资源分配策略，包括资源配额和优先级，生成虚拟机资源分配指令；Based on the required computing resources, define resource allocation strategies based on requirements, including resource quotas and priorities, and generate virtual machine resource allocation instructions;

将资源分配指令发送给各个芯片模组，用于虚拟化计算资源。据需求，使用Hypervisor创建KVM虚拟机实例，并为其分配合适的计算、内存和存储资源。创建之后需要对虚拟机是进行初始化，初始化的目的为虚拟机的创建和运行提供必要的软硬件环境。通过虚拟机管理器对虚拟机进行配置管理，包括网络设置、操作系统安装等。然后将虚拟计算资源挂载到计算机。Send resource allocation instructions to each chip module for virtualizing computing resources. Based on requirements, use Hypervisor to create a KVM virtual machine instance and allocate appropriate computing, memory, and storage resources to it. After creation, the virtual machine needs to be initialized. The purpose of initialization is to provide the necessary software and hardware environment for the creation and operation of the virtual machine. Configure and manage virtual machines through the virtual machine manager, including network settings, operating system installation, etc. Then mount the virtual computing resources to the computer.

定期监测虚拟化环境中的资源利用率，如计算资源、内存和存储的使用情况。Regularly monitor resource utilization in virtualized environments, such as computing resources, memory, and storage usage.

根据资源利用率的变化，自动扩容或收缩虚拟机的资源分配，以满足实际需求。其中，系统可以根据运行情况自动调整内存大小以优化资源的利用。According to changes in resource utilization, automatically expand or shrink the resource allocation of virtual machines to meet actual needs. Among them, the system can automatically adjust the memory size according to operating conditions to optimize resource utilization.

需要说明的是，根据实际需求和性能优化的目标，设置内存调整的策略。主要包含以下几个方面：内存阈值，设置一个内存利用率的阈值，当虚拟机的内存利用率超过或低于该阈值时触发内存调整；调整幅度，确定内存调整的增加或减少的幅度，例如每次增加/减少的内存块大小；调整频率，设定内存调整的检测频率，即多久监测一次内存使用情况并作出调整。It should be noted that the memory adjustment strategy is set based on actual needs and performance optimization goals. It mainly includes the following aspects: memory threshold, which sets a threshold for memory utilization and triggers memory adjustment when the memory utilization of the virtual machine exceeds or falls below the threshold; adjustment amplitude, which determines the increase or decrease of memory adjustment, for example The memory block size increased/decreased each time; adjustment frequency, set the detection frequency of memory adjustment, that is, how often to monitor memory usage and make adjustments.

根据设定的内存调整策略，采取相应的操作来调整虚拟机的内存大小。当虚拟机的内存利用率超过阈值时，动态增加虚拟机的内存大小。这可以通过虚拟机管理器(libvirt)提供的API进行操作，将额外的内存分配给虚拟机。当虚拟机的内存利用率较低时，可以减少虚拟机的内存大小。这可以通过虚拟机管理器提供的API或者在虚拟机中使用工具来释放部分内存资源。According to the set memory adjustment policy, take corresponding operations to adjust the memory size of the virtual machine. When the memory utilization of the virtual machine exceeds the threshold, the memory size of the virtual machine is dynamically increased. This can be done through the API provided by the virtual machine manager (libvirt) to allocate additional memory to the virtual machine. When a virtual machine's memory utilization is low, you can reduce the virtual machine's memory size. This can be done through the API provided by the virtual machine manager or by using tools in the virtual machine to release some memory resources.

采用虚拟软件定义网络技术使得不同用户共享同一个物理网路的网络资源，根据需求将物理网络资源切分形成逻辑独立的虚拟流分发网络vSDN给用户使用；The use of virtual software-defined network technology enables different users to share the network resources of the same physical network, and splits the physical network resources according to needs to form a logically independent virtual flow distribution network vSDN for users to use;

基于所述预设虚拟机为所述用户提供所需算力的步骤之后，还包括：After the step of providing required computing power to the user based on the preset virtual machine, the method further includes:

基于用户的关机指令，获取所述关机指令对应的虚拟机；Based on the user's shutdown instruction, obtain the virtual machine corresponding to the shutdown instruction;

将所述虚拟机中的虚拟计算资源进行卸载(CPU、GPU、NPU、FPGA等)，在卸载完成后关闭所述虚拟机；Uninstall the virtual computing resources in the virtual machine (CPU, GPU, NPU, FPGA, etc.), and close the virtual machine after the uninstallation is completed;

所述在卸载完成后关闭所述虚拟机的步骤之后，还包括：After the step of shutting down the virtual machine after the uninstallation is completed, the method also includes:

检测当前虚拟环境中处于关机状态的虚拟机的虚拟计算资源挂载状态，若处于挂载中的，则对所述虚拟机的虚拟GPU进行卸载。Detect the mounting status of the virtual computing resource of the virtual machine that is in the shutdown state in the current virtual environment. If it is being mounted, uninstall the virtual GPU of the virtual machine.

请参阅图6，图6为本发明提供的一种可能的硬件示意图。如图6所示，本发明实施例提供了一种电子设备，包括存储器、CPU处理器、GPU处理器、FPGA、NPU及存储在存储器上并可在CPU处理器上运行的计算机程序，处理器执行计算机程序时实现本发明提供的一种异构智能计算平台虚拟化管理方法。Please refer to Figure 6, which is a possible hardware schematic diagram provided by the present invention. As shown in Figure 6, an embodiment of the present invention provides an electronic device, including a memory, a CPU processor, a GPU processor, an FPGA, an NPU, and a computer program stored in the memory and capable of running on the CPU processor. The processor When the computer program is executed, a heterogeneous intelligent computing platform virtualization management method provided by the present invention is implemented.

上述实施例用来解释说明本发明，而不是对本发明进行限制，在本发明的精神和权利要求的保护范围内，对本发明作出的任何修改和改变，都落入本发明的保护范围。The above embodiments are used to illustrate the present invention, rather than to limit the present invention. Within the spirit of the present invention and the protection scope of the claims, any modifications and changes made to the present invention fall within the protection scope of the present invention.

Claims

1. A heterogeneous intelligent computing platform virtualization management system, which is characterized by including:

Chip virtualization module, used for virtualization of CPU, GPU, and FPGA;

The computing virtualization module is used to monitor and manage the status information and availability information of computing resources, and provide the information to the dynamic scheduling management module;

The network virtualization module is used to use virtual software-defined network technology to enable different users to share the network resources of the same physical network, and segment the physical network resources according to needs to form a logically independent virtual flow distribution network vSDN for users to use;

The dynamic scheduling management module is used to determine the host load status, and then select virtual machines that meet the migration conditions based on the minimum migration number policy to achieve static placement where the mapping relationship between the virtual machine and the host remains unchanged or dynamic placement where the mapping relationship between the virtual machine and the host is variable. ; After completing the virtual machine selection, the host with the least increase in energy consumption will be used as the target host for virtual machine migration.

2. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that CPU virtualization allows a single physical CPU to be virtualized into multiple vCPUs, that is, virtual CPUs, and user operations of each virtual machine The system uses one or more parallel vCPUs, each vCPU runs independently of each other, and supports time-sharing multiplexing of the CPU, that is, through real-time scheduling strategy, the CPU sharing of tasks is realized.

3. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that GPU virtualization adopts hardware-assisted virtualization technology based on SR-IOV to realize virtualization of PCIe devices. Enable the SR-IOV function on the GPU and divide it into multiple virtual GPUs. Each virtual GPU has its own identity and resources. The virtual machine or container directly accesses the virtual GPU to isolate and share GPU resources.

4. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that FPGA virtualization divides a single FPGA into multiple fine-grained partially reconfigurable vFPGAs, that is, virtual FPGAs. , Each vFPGA has a separate controller for management, which is connected to the outside world through the AXI bus. On the one hand, it is used to access memory, and on the other hand, it is used for data interaction with the dynamic scheduling management module.

5. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that the computing virtualization module regularly updates the load status, performance indicators and availability information of resources, and provides these information to The dynamic scheduling management module schedules tasks and allocates resources through the task scheduler and decision engine according to the needs of tasks or applications.

6. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that the virtual network controller in the network virtualization module simultaneously accepts user requests and divides physical network resources into logical An independent vSDN is provided to users. The establishment of vSDN mainly includes two aspects: network management program NVH and virtual network mapping VNE; NVH is located between the user SDN controller and the physical network, and is implemented through the network communication protocol Openflow's network virtualization platform Flowvisor; VNE will Network resources are allocated to each virtual network and are divided into two parts: node mapping and link mapping; for node mapping, it is ensured that the physical resource nodes do not exceed the capacity limit, while link mapping ensures that one virtual link mapping corresponds to one physical path; in each When the secondary user status changes, virtual nodes and links are mapped to new physical nodes and physical links based on vSDN reconfiguration.

7. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that the host status monitoring is used to determine whether the host status is in an overload or underload state, and an adaptive dynamic threshold method is used to determine. ; Use machine learning methods to learn dynamic and adaptive resource utilization thresholds, and at the same time strengthen the learning results through interaction and trial and error with the dynamic environment during the learning process to adapt to the changing environment, using dual thresholds of overload and underload method to trigger virtual machine scheduling when overloaded and underloaded.

8. A heterogeneous intelligent computing platform virtualization management system according to claim 1, characterized in that the specific process of selecting virtual machines that meet migration conditions based on the minimum migration number strategy is: first, perform descending order according to the resource requirements of the virtual machines. Arrange, and then select the virtual machines that meet the conditions. After the virtual machine migration is completed, the resource requirements of the remaining virtual machines on the host are less than the maximum capacity of the host. Select the virtual machine with the smallest resource requirements.

9. A virtualization management method for heterogeneous intelligent computing platforms, which is characterized by including:

Based on the needs of human-computer interaction end users, obtain corresponding CPU, GPU and FPGA computing resources;

Monitor and manage the status information and availability information of computing resources, and generate virtual machine resource allocation instructions based on the required computing resources;

Send resource allocation instructions to each chip module to virtualize computing resources, and then mount the virtualized resources to the target virtual machine;

The use of virtual software-defined network technology allows different users to share the network resources of the same physical network, and splits the physical network resources according to needs to form a logically independent virtual flow distribution network vSDN for users to use;

Based on the minimum migration quantity strategy, select virtual machines that meet the migration conditions, and implement static placement where the mapping relationship between the virtual machine and the host remains unchanged or dynamic placement where the mapping relationship between the virtual machine and the host is variable; after completing the virtual machine selection, minimize the increase in energy consumption The host is used as the target host for virtual machine migration.

10. A heterogeneous intelligent computing platform virtualization management method according to claim 9, characterized by collecting and analyzing load conditions, performance indicators and availability information of computing resources; these information are used to evaluate resource utilization and Performance bottlenecks are used in the decision-making engine for scheduling decisions and optimization; through continuous monitoring and memory, network and storage optimization, resource utilization is improved and task execution time and costs are reduced.