CN114925591A - An automatic parallel strategy search method and related equipment based on polyhedral model modeling - Google Patents
An automatic parallel strategy search method and related equipment based on polyhedral model modeling Download PDFInfo
- Publication number
- CN114925591A CN114925591A CN202111646797.9A CN202111646797A CN114925591A CN 114925591 A CN114925591 A CN 114925591A CN 202111646797 A CN202111646797 A CN 202111646797A CN 114925591 A CN114925591 A CN 114925591A
- Authority
- CN
- China
- Prior art keywords
- model
- parallel strategy
- graph
- calculation
- polyhedral
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2111/00—Details relating to CAD techniques
- G06F2111/08—Probabilistic or stochastic CAD
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Medical Informatics (AREA)
- Software Systems (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Devices For Executing Special Programs (AREA)
Abstract
Description
技术领域technical field
本发明涉及人工智能技术领域,特别涉及一种基于多面体模型建模的自动并行策略搜索方法及相关设备。The invention relates to the technical field of artificial intelligence, in particular to an automatic parallel strategy search method based on polyhedral model modeling and related equipment.
背景技术Background technique
近十年来,深度学习技术不断刷新视觉、自然语言、语音、搜索、推荐等领域各种任务的纪录。这其中的原因,用一个关键词描述就是“大规模”。大规模的数据使得模型有足够的知识可以记忆,大规模参数量的模型使得模型本身有能力记忆更多的数据,大规模高性能的算力(以GPU为典型代表)使得模型的训练速度有百倍甚至千倍的提升。数据、模型、算力的发展催生了大规模深度学习这个领域,如何进行多机任务的拆分、如何配置集群训练资源、如何平衡训练速度和收敛速度、如何训练单机无法训练的模型、弹性训练与容错等都是这个方向重点研究的问题。分布式训练正是解决上述问题,提升训练效率的最有效手段,分布式训练的核心目的是加快模型的训练速度。In the past ten years, deep learning technology has continuously refreshed the records of various tasks in the fields of vision, natural language, speech, search, and recommendation. The reason for this, described with a keyword is "large-scale". Large-scale data enables the model to have enough knowledge to memorize, the model with large-scale parameters enables the model itself to be able to memorize more data, and the large-scale high-performance computing power (typically represented by GPU) enables the model to train faster. A hundred times or even a thousand times the improvement. The development of data, models, and computing power has given birth to the field of large-scale deep learning. How to split multi-machine tasks, how to configure cluster training resources, how to balance training speed and convergence speed, how to train models that cannot be trained on a single machine, and elastic training and fault tolerance are the key research issues in this direction. Distributed training is the most effective way to solve the above problems and improve training efficiency. The core purpose of distributed training is to speed up the training speed of the model.
目前主流深度学习框架如TensorFlow(TensorFlow是一个基于数据流编程的符号数学系统,被广泛应用于各类机器学习算法的编程实现)、Pytorch(PyTorch是一个开源的Python机器学习库,基于Torch,用于自然语言处理等应用程序)、Mindspore(MindSpore是一种适用于端边云场景的新型开源深度学习训练/推理框架)、PaddlePaddle(飞桨(PaddlePaddle)是集深度学习核心框架、工具组件和服务平台为一体的技术先进、功能完备的开源深度学习平台)都具备了多机分布式训练功能,主要并行方式有数据并行(数据并行,是指AI模型在分布式训练时,把训练数据样本且分到多个计算设备进行分布式计算的过程)、算子并行、流水线并行(流水线技术是指在程序执行时多条指令重叠进行操作的一种准并行处理实现技术)等模式,然而这些并行模式需要算法开发者根据算法模型特点调用AI框架提供的并行切分API实现,这种方式提高了AI算法分布式训练的技术难度,同时由于算法开发者对于AI框架和计算设备特点掌握不足,导致模型并行训练效率低下,具体的分布式调优工作提高了算法开发的难度同时降低了算法研究的效率。At present, mainstream deep learning frameworks such as TensorFlow (TensorFlow is a symbolic mathematics system based on data flow programming, which is widely used in the programming implementation of various machine learning algorithms), Pytorch (PyTorch is an open source Python machine learning library, based on Torch, use applications such as natural language processing), Mindspore (MindSpore is a new open source deep learning training/inference framework suitable for end-to-end cloud scenarios), PaddlePaddle (PaddlePaddle) is a collection of deep learning core frameworks, tool components and services The platform is an open source deep learning platform with advanced technology and complete functions) with multi-machine distributed training function. It is divided into multiple computing devices for distributed computing), operator parallelism, pipeline parallelism (pipeline technology refers to a quasi-parallel processing implementation technology in which multiple instructions overlap and operate during program execution) and other modes. However, these parallel The mode requires algorithm developers to call the parallel segmentation API provided by the AI framework according to the characteristics of the algorithm model. This method increases the technical difficulty of distributed training of AI algorithms. At the same time, because the algorithm developers do not have sufficient grasp of the characteristics of the AI framework and computing equipment, resulting in The model parallel training efficiency is low, and the specific distributed tuning work increases the difficulty of algorithm development and reduces the efficiency of algorithm research.
针对这一难题,Mindspore框架提出具备了模型的自动并行训练功能,FlexFlow框架也提出了一种基于4维度并行策略空间建模的搜索策略,RaNNC框架提出了一种支持Pytorch前端的流水线并行策略自动搜索中间件,然而由于并行策略搜索空间规模大(与计算图规模与资源空间规模相关),上述这些工作在自动并行搜索效率方面很难做到实用,比如RaNNC框架在实现4.9B参数量BERT-enlarge模型在4节点32卡上的流水线并行策略搜索时,所需的策略搜索时间达到了4小时以上,这在一定程度上提高了模型训练开发时的调试和训练时间,降低了效率。因而现有技术还有待改进和提高。In response to this problem, the Mindspore framework proposes an automatic parallel training function for models, the FlexFlow framework also proposes a search strategy based on 4-dimensional parallel strategy space modeling, and the RaNNC framework proposes a pipeline parallel strategy that supports Pytorch front-end automatic However, due to the large scale of the parallel strategy search space (related to the scale of the computational graph and the scale of the resource space), the above work is difficult to be practical in terms of automatic parallel search efficiency. When the enlarged model performs a pipeline parallel strategy search on 4 nodes and 32 cards, the required strategy search time reaches more than 4 hours, which improves the debugging and training time during model training and development to a certain extent, and reduces the efficiency. Therefore, the existing technology still needs to be improved and improved.
发明内容SUMMARY OF THE INVENTION
本发明的主要目的在于提供一种基于多面体模型建模的自动并行策略搜索方法及相关设备,旨在解决现有技术中在对大规模的深度学习的模型进行训练时,需要算法开发者自行配置并行策略,这样就会造成训练效率低,开发难度大的问题。The main purpose of the present invention is to provide an automatic parallel strategy search method and related equipment based on polyhedral model modeling, aiming to solve the problem that in the prior art, when training a large-scale deep learning model, algorithm developers need to configure themselves Parallel strategy, this will cause problems of low training efficiency and difficult development.
为了达到上述目的,本发明采取了以下技术方案:In order to achieve the above object, the present invention has adopted the following technical solutions:
一种基于多面体模型建模的自动并行策略搜索方法,包括:An automatic parallel strategy search method based on polyhedral model modeling, including:
根据用户输入的模型对象得到深度学习算法的模型计算图;Obtain the model calculation graph of the deep learning algorithm according to the model object input by the user;
对所述模型计算图进行转换,得到转换后的模型计算图;Converting the model calculation graph to obtain the converted model calculation graph;
将转换后的模型计算图进行均衡处理,得到均衡计算图;Perform equalization processing on the converted model calculation graph to obtain a balanced calculation graph;
根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略;Create a polyhedron model instance according to the equilibrium calculation graph, and output a parallel strategy according to the polyhedron model instance;
调用底层框架执行所述并行策略。The underlying framework is invoked to execute the parallel strategy.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述根据用户输入的模型对象得到深度学习算法的模型计算图的步骤具体包括:In the automatic parallel strategy search method based on polyhedral model modeling, the step of obtaining the model calculation graph of the deep learning algorithm according to the model object input by the user specifically includes:
根据用户输入的模型对象得到一个算法模型;Obtain an algorithm model according to the model object input by the user;
随机输入一个数值至所述算法模型后,记录所述算法模型的计算过程,得到所述模型计算图;After randomly inputting a numerical value to the algorithm model, record the calculation process of the algorithm model to obtain the model calculation diagram;
或通过python解释器对所述模型对象解析后生成语法树,再对所述语法树进行分析,得到所述模型计算图。Or generate a syntax tree after parsing the model object by a python interpreter, and then analyze the syntax tree to obtain the model computation graph.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述对所述模型计算图进行转换,得到转换后的模型计算图的步骤具体包括:In the automatic parallel strategy search method based on polyhedral model modeling, the step of converting the model computation graph to obtain the converted model computation graph specifically includes:
用预先定义的中间表示法对所述模型计算图重新进行表示,得到转换后的模型计算图。The model computation graph is re-expressed with a pre-defined intermediate representation to obtain a converted model computation graph.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述将转换后的模型计算图进行均衡处理,得到均衡计算图的步骤具体包括:In the automatic parallel strategy search method based on polyhedral model modeling, the step of performing equalization processing on the converted model calculation graph to obtain the balanced calculation graph specifically includes:
设定节点的平均计算量阈值,将转换后的模型计算图中的计算量节点与所述平均计算量阈值进行比较;Set the average computation amount threshold of the node, and compare the computation amount node in the converted model computation graph with the average computation amount threshold;
将小于所述平均计算阈值的相邻计算量节点进行融合,并将大于所述平均计算阈值的计算量节点进行拆分,得到均衡计算图。The adjacent calculation amount nodes smaller than the average calculation threshold are fused, and the calculation amount nodes larger than the average calculation threshold are split to obtain a balanced calculation graph.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略的步骤具体包括:In the automatic parallel strategy search method based on polyhedral model modeling, the steps of creating a polyhedral model instance according to the equilibrium calculation graph, and outputting a parallel strategy according to the polyhedral model instance specifically include:
将均衡计算图映射在多面体模型上,得到一个多面体优化模型;Map the equilibrium calculation graph on the polyhedron model to obtain a polyhedron optimization model;
将均衡计算图输入到所述多面体优化模型,得到一个多面体模型实例;Input the equilibrium calculation graph into the polyhedron optimization model to obtain an instance of the polyhedron model;
根据所述多面体模型实例和用户输入的计算资源数量,输出并行策略。A parallel strategy is output according to the polyhedral model instance and the number of computing resources input by the user.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述调用底层框架执行所述并行策略的步骤具体包括:In the automatic parallel strategy search method based on polyhedral model modeling, the step of invoking the underlying framework to execute the parallel strategy specifically includes:
调用底层框架的执行API来执行所述并行策略。The execution API of the underlying framework is called to execute the parallel strategy.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述模型对象是指用户预先定义的深度学习算法的单机训练代码。In the automatic parallel strategy search method based on polyhedral model modeling, the model object refers to the single-machine training code of the deep learning algorithm predefined by the user.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述预先定义的中间表示法包括:IRType、IRValue、IRNode和IRGraph。In the automatic parallel strategy search method based on polyhedral model modeling, the pre-defined intermediate representations include: IRType, IRValue, IRNode and IRGraph.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述并行策略包括数据并行切分维度和流水线并行切分维度。In the automatic parallel strategy search method based on polyhedral model modeling, the parallel strategy includes a data parallel segmentation dimension and a pipeline parallel segmentation dimension.
在所述基于多面体模型建模的自动并行策略搜索方法中,所述执行API为底层AI框架中的运行管理器。In the automatic parallel strategy search method based on polyhedral model modeling, the execution API is an operation manager in the underlying AI framework.
一种自动并行策略搜索系统,所述自动并行策略搜索系统还包括:An automatic parallel strategy search system, the automatic parallel strategy search system further comprises:
计算图生成模块,用于根据用户输入的模型对象得到深度学习算法的模型计算图;The calculation graph generation module is used to obtain the model calculation graph of the deep learning algorithm according to the model object input by the user;
计算图转换模块,用于对所述模型计算图进行转换,得到转换后的模型计算图;a computational graph conversion module, configured to convert the model computational graph to obtain the converted model computational graph;
计算图均衡模块,用于将转换后的模型计算图进行均衡处理,得到均衡计算图;The calculation graph equalization module is used to equalize the converted model calculation graph to obtain a balanced calculation graph;
并行策略搜索模块,用于根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略;a parallel strategy search module, configured to create a polyhedron model instance according to the equilibrium calculation graph, and output a parallel strategy according to the polyhedron model instance;
并行策略执行模块,用于调用底层框架执行所述并行策略。The parallel strategy execution module is used for invoking the underlying framework to execute the parallel strategy.
一种控制器,所述控制器包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的基于多面体模型建模的自动并行策略搜索程序,所述基于多面体模型建模的自动并行策略搜索程序被所述处理器执行时实现如上所述的基于多面体模型建模的自动并行策略搜索方法的步骤。A controller comprising: a memory, a processor, and an automatic parallel strategy search program based on polyhedral model modeling stored on the memory and executable on the processor, the polyhedral model-based modeling The modular automatic parallel strategy search program is executed by the processor to implement the steps of the automatic parallel strategy search method based on polyhedral model modeling as described above.
一种计算机可读存储介质,所述计算机可读存储介质存储有基于多面体模型建模的自动并行策略搜索程序,所述基于多面体模型建模的自动并行策略搜索程序被处理器执行时实现如上所述的基于多面体模型建模的自动并行策略搜索方法的步骤。A computer-readable storage medium that stores an automatic parallel strategy search program based on polyhedral model modeling, and the automatic parallel strategy search program based on polyhedral model modeling is executed by a processor. The steps of the described automatic parallel policy search method based on polyhedral model modeling.
相较于现有技术,本发明提供的一种基于多面体模型建模的自动并行策略搜索方法及相关设备,所述基于多面体模型建模的自动并行策略搜索方法包括:根据用户输入的模型对象得到深度学习算法的模型计算图;对所述模型计算图进行转换,得到转换后的模型计算图;将转换后的模型计算图进行均衡处理,得到均衡计算图;根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略;调用底层框架执行所述并行策略。本发明中通过将生成的模型计算图进行转换及均衡处理,并基于多面体模型的框架创建多面体模型实例,以便根据多面体模型实例自动输出并行策略,实现了在多面体模型下将不同框架下的算法逻辑进行建模,并自动输出可高效执行的并行策略的过程,从而有效地提升了搜索并行策略的效率,同时降低了深度学习算法的分布式训练开发和效率调优难度。Compared with the prior art, the present invention provides an automatic parallel strategy search method and related equipment based on polyhedral model modeling. Model calculation graph of the deep learning algorithm; convert the model calculation graph to obtain a converted model calculation graph; perform equalization processing on the converted model calculation graph to obtain a balanced calculation graph; create a polyhedron according to the balanced calculation graph model instance, and output a parallel strategy according to the polyhedron model instance; invoke the underlying framework to execute the parallel strategy. In the present invention, the generated model calculation graph is converted and equalized, and a polyhedron model instance is created based on the frame of the polyhedron model, so as to automatically output a parallel strategy according to the polyhedron model instance, and the algorithm logic under different frameworks is realized under the polyhedron model. The process of modeling and automatically outputting parallel strategies that can be executed efficiently, thus effectively improving the efficiency of searching for parallel strategies, and reducing the difficulty of distributed training development and efficiency tuning of deep learning algorithms.
附图说明Description of drawings
图1为本发明提供的基于多面体模型建模的自动并行策略搜索方法的较佳实施例的流程图;1 is a flowchart of a preferred embodiment of an automatic parallel strategy search method based on polyhedral model modeling provided by the present invention;
图2为本发明提供的基于多面体模型建模的自动并行策略搜索方法的较佳实施例中步骤S100的流程图;Fig. 2 is the flow chart of step S100 in the preferred embodiment of the automatic parallel strategy search method based on polyhedral model modeling provided by the present invention;
图3为本发明提供的基于多面体模型建模的自动并行策略搜索方法的较佳实施例中步骤S300的流程图;3 is a flowchart of step S300 in a preferred embodiment of the automatic parallel strategy search method based on polyhedral model modeling provided by the present invention;
图4为本发明提供的节点拆分和节点聚合示意图;4 is a schematic diagram of node splitting and node aggregation provided by the present invention;
图5为本发明提供的基于多面体模型建模的自动并行策略搜索方法的较佳实施例中步骤S400的流程图;5 is a flowchart of step S400 in a preferred embodiment of the automatic parallel strategy search method based on polyhedral model modeling provided by the present invention;
图6为本发明提供的数据并行模式下的计算图切分示意图;FIG. 6 is a schematic diagram of computing graph segmentation under the data parallel mode provided by the present invention;
图7为本发明提供的流水线并行模式下的计算图切分示意图;7 is a schematic diagram of a computation graph segmentation in a pipeline parallel mode provided by the present invention;
图8为本发明提供的自动并行策略搜索系统的原理框图;8 is a schematic block diagram of an automatic parallel strategy search system provided by the present invention;
图9为本发明提供的Pytorch框架与自动并行策略搜索系统的架构关系图;Fig. 9 is the architecture relationship diagram of the Pytorch framework provided by the present invention and the automatic parallel strategy search system;
图10为本发明提供的控制器的较佳实施例的运行环境示意图。FIG. 10 is a schematic diagram of the operating environment of the preferred embodiment of the controller provided by the present invention.
具体实施方式Detailed ways
为使本发明的目的、技术方案及效果更加清楚、明确,以下参照附图并举实施例对本发明进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本发明。In order to make the objectives, technical solutions and effects of the present invention clearer and clearer, the present invention will be further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are only used to explain the present invention, but not to limit the present invention.
本技术领域技术人员可以理解,除非特意声明,这里使用的单数形式“一”、“一个”、“所述”和“该”也可包括复数形式。应该进一步理解的是,本发明的说明书中使用的措辞“包括”是指存在所述特征、整数、步骤、操作、元件和/或组件,但是并不排除存在或添加一个或多个其他特征、整数、步骤、操作、元件、组件和/或它们的组。应该理解,当我们称元件被“连接”或“耦接”到另一元件时,它可以直接连接或耦接到其他元件,或者也可以存在中间元件。此外,这里使用的“连接”或“耦接”可以包括无线连接或无线耦接。这里使用的措辞“和/或”包括一个或更多个相关联的列出项的全部或任一单元和全部组合。It will be understood by those skilled in the art that the singular forms "a", "an", "the" and "the" as used herein can include the plural forms as well, unless expressly stated otherwise. It should be further understood that the word "comprising" used in the description of the present invention refers to the presence of stated features, integers, steps, operations, elements and/or components, but does not exclude the presence or addition of one or more other features, Integers, steps, operations, elements, components and/or groups thereof. It will be understood that when we refer to an element as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Furthermore, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combination of one or more of the associated listed items.
本技术领域技术人员可以理解,除非另外定义,这里使用的所有术语(包括技术术语和科学术语),具有与本发明所属领域中的普通技术人员的一般理解相同的意义。还应该理解的是,诸如通用字典中定义的那些术语,应该被理解为具有与现有技术的上下文中的意义一致的意义,并且除非像这里一样被特定定义,否则不会用理想化或过于正式的含义来解释。It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It should also be understood that terms, such as those defined in a general dictionary, should be understood to have meanings consistent with their meanings in the context of the prior art and, unless specifically defined as herein, should not be interpreted in idealistic or overly formal meaning to explain.
首先,基于多机多卡分布式训练深度学习模型成为目前加速模型训练效率的最重要的一个技术方案,主要的分布式训练模式包括数据并行、模型并行、优化器并行、流水线并行、混合并行等,这些并行方式也是目前主流深度学习框架(如TensorFlow、Pytorch、Mindspore、PaddlePaddle)主要支持的并行功能,不同的AI框架只有易用性、效率方面的差异,然而上述并行方式几乎都需要算法开发者通过调用框架API进行实现,同时进行手工进行训练效率调优,这对不太了解底层AI框架实现机制、集群通信特征的算法开发者来说十分困难,复杂的实现和调试带来效率的大幅度降低。First of all, distributed training of deep learning models based on multi-machine and multi-card has become the most important technical solution to accelerate model training efficiency. The main distributed training modes include data parallelism, model parallelism, optimizer parallelism, pipeline parallelism, hybrid parallelism, etc. , These parallel methods are also the parallel functions mainly supported by the current mainstream deep learning frameworks (such as TensorFlow, Pytorch, Mindspore, PaddlePaddle). Different AI frameworks only have differences in ease of use and efficiency. However, almost all of the above parallel methods require algorithm developers. It is implemented by calling the framework API, and at the same time, manual training efficiency tuning is performed, which is very difficult for algorithm developers who do not know much about the implementation mechanism of the underlying AI framework and the characteristics of cluster communication. Complex implementation and debugging bring great efficiency. reduce.
另外一方面由于自动并行策略搜索效率与AI模型计算图、集群资源规模呈指数增长关系,现有一些自动并行搜索工作在搜索效率难以满足大参数量模型在大规模集群上的效率要求,因此需要设计实现一种高效率的多机多卡分布式训练并行策略搜索方法,以解决大模型的自动并行训练策略搜索问题,让算法开发者能够只专注于算法逻辑的开发,而快速的实现在AI集群上的分布式训练。On the other hand, because the search efficiency of automatic parallel strategy is related to the exponential growth of AI model calculation graph and cluster resource scale, the search efficiency of some existing automatic parallel search work is difficult to meet the efficiency requirements of large-parameter models on large-scale clusters. Therefore, it is necessary to Design and implement an efficient multi-machine and multi-card distributed training parallel strategy search method to solve the problem of automatic parallel training strategy search for large models, so that algorithm developers can only focus on the development of algorithm logic, and quickly implement them in AI. Distributed training on a cluster.
为了解决上述现有技术问题,本发明提供了一种基于多面体模型建模的自动并行策略搜索方法及相关设备。本发明中通过将生成的模型计算图进行转换及均衡处理,均衡计算图,并在多面体模型的框架下,根据均衡计算图创建多面体模型实例,以便根据多面体模型实例自动输出并行策略,实现了在多面体模型下将不同框架下的算法逻辑进行建模,并自动输出可高效执行的并行策略的过程,从而有效地提升了搜索并行策略的效率,同时降低了深度学习算法的分布式训练开发和效率调优难度。In order to solve the above problems in the prior art, the present invention provides an automatic parallel strategy search method and related equipment based on polyhedral model modeling. In the present invention, the generated model calculation graph is converted and equalized, the calculation graph is balanced, and under the framework of the polyhedron model, a polyhedral model instance is created according to the balanced calculation graph, so as to automatically output a parallel strategy according to the polyhedral model instance, and the Under the polyhedron model, the algorithm logic under different frameworks is modeled, and the process of automatically outputting parallel strategies that can be executed efficiently, thus effectively improving the efficiency of searching for parallel strategies, and reducing the distributed training development and efficiency of deep learning algorithms. Tuning difficulty.
下面通过具体示例性的实施例对基于多面体模型建模的自动并行策略搜索方法设计方案进行描述,需要说明的是,下列实施例只用于对发明的技术方案进行解释说明,并不做具体限定:The design scheme of the automatic parallel strategy search method based on polyhedral model modeling will be described below through specific exemplary embodiments. It should be noted that the following embodiments are only used to explain the technical scheme of the invention, and do not specifically limit it. :
请参阅图1,本发明提供的一种基于多面体模型建模的自动并行策略搜索方法,包括:Referring to Fig. 1, an automatic parallel strategy search method based on polyhedral model modeling provided by the present invention includes:
S100、根据用户输入的模型对象得到深度学习算法的模型计算图。S100. Obtain a model calculation graph of the deep learning algorithm according to the model object input by the user.
其中,所述模型对象是指用户预先定义的深度学习算法的单机训练代码,比如BERT模型、GPT3模型(GPT-3由独立的AI研究和部署公司OpenAI构建,是一种大规模的自然语言模型,目前在微软Azure上运行)等。Among them, the model object refers to the stand-alone training code of the deep learning algorithm pre-defined by the user, such as the BERT model and the GPT3 model (GPT-3 is constructed by OpenAI, an independent AI research and deployment company, and is a large-scale natural language model. , currently running on Microsoft Azure), etc.
具体地,在不同的深度学习模型的框架(如TensorFlow、Pytorch、Mindspore、PaddlePaddle)下,对应的模型对象(用户预先定义的算法逻辑)不同,所以需要在采用相应的方法的基础上,根据用户输入的基于AI框架定义的深度学习算法的单机训练代码代码得到深度学习算法的模型计算图。其中,模型计算图是指深度学习算法在AI框架中的一种表示形式,目前一般采用一个有向无环图(DAG)来表示一个深度学习算法的计算过程,计算图的节点表示一个计算操作,边表示算法计算操作之间的张量数据依赖。上述生成所述模型计算图的过程的作用就是用于表示一个算法的计算过程。Specifically, under the frameworks of different deep learning models (such as TensorFlow, Pytorch, Mindspore, PaddlePaddle), the corresponding model objects (the algorithm logic predefined by the user) are different, so it is necessary to adopt the corresponding method based on the user. The input single-machine training code code based on the deep learning algorithm defined by the AI framework obtains the model calculation graph of the deep learning algorithm. Among them, the model calculation graph refers to a representation of deep learning algorithms in the AI framework. Currently, a directed acyclic graph (DAG) is generally used to represent the calculation process of a deep learning algorithm, and the nodes of the calculation graph represent a calculation operation. , the edges represent tensor data dependencies between algorithmic computation operations. The function of the above process of generating the model calculation graph is to represent the calculation process of an algorithm.
进一步地,请参阅图2,步骤100具体包括:Further, referring to FIG. 2, step 100 specifically includes:
S110、根据用户输入的模型对象得到一个算法模型;S110, obtaining an algorithm model according to the model object input by the user;
S120、随机输入一个数值至所述算法模型后,记录所述算法模型的计算过程,得到所述模型计算图;S120, after randomly inputting a numerical value to the algorithm model, record the calculation process of the algorithm model, and obtain the model calculation diagram;
S130、或通过python解释器对所述模型对象解析后生成语法树,再对所述语法树进行分析,得到所述模型计算图。S130, or generate a syntax tree after parsing the model object by a python interpreter, and then analyze the syntax tree to obtain the model computation graph.
具体地,本发明中以Pytorch框架为例,通过jit.trace(跟踪技术)或者jit.script(源码转换技术)两种方法,可以得到算法开发者定义的模型计算图;其中,jit.trace方法是基于向量计算跟踪的方式,根据用户输入的模型对象得到一个算法模型后,按照算法的输入类型要求随机输入一个数值,然后记录算法的每个计算过程,这些记录下的计算过程就构建成了一个计算图;比如算法模型是输入一张32X32的图片,输出是这张图片的分类标签,那么所述随机输入就是指随机输入一个32X32的数据。jit.script方法是基于源码转换的方式,其基本原理是首先通过python解释器将所述模型对象(用户定义的算法逻辑)解析成语法树(计算机描述世界真理的树状结构),然后通过分析语法树得到模型计算图。Specifically, taking the Pytorch framework as an example in the present invention, the model calculation graph defined by the algorithm developer can be obtained by two methods: jit.trace (tracing technology) or jit.script (source code conversion technology); wherein, the jit.trace method It is based on the method of vector calculation and tracking. After obtaining an algorithm model according to the model object input by the user, randomly input a value according to the input type of the algorithm, and then record each calculation process of the algorithm. These recorded calculation processes are constructed as A computational graph; for example, the algorithm model is to input a 32X32 picture, and the output is the classification label of this picture, then the random input refers to the random input of a 32X32 data. The jit.script method is based on source code conversion. Its basic principle is to first parse the model object (user-defined algorithm logic) into a syntax tree (a tree-like structure that describes the truth of the world) through the python interpreter, and then analyze the The syntax tree gets the model computation graph.
请继续参阅图1,S200、对所述模型计算图进行转换,得到转换后的模型计算图。Please continue to refer to FIG. 1 . In S200 , the model calculation graph is converted to obtain a converted model calculation graph.
具体地,由于不同框架下的计算图有多种定义方式,即不同框架下的计算图表示差异,比如Pytorch框架使用的就是Torch计算图(Torch Graph),所以需要将不同框架计算图转换为一种通用的中间表示形式的计算图,而通常情况下,计算图中间表示方法是由专家人工定义的,是一个人工建立规则映射的过程,因此不同的计算图中间表示转换基于各自规则的方式进行,本发明中针对并行策略搜索定义一种新的计算图表示方法,因此需要将上层框架计算图转换为并行策略搜索时使用的计算图。Specifically, since there are many ways to define the computation graphs under different frameworks, that is, the computation graphs under different frameworks represent differences. For example, the Pytorch framework uses the Torch computation graph (Torch Graph), so it is necessary to convert the computation graphs of different frameworks into a single graph. It is a general-purpose intermediate representation of the calculation graph, and usually, the intermediate representation method of the calculation graph is manually defined by experts, which is a process of manually establishing rule mapping, so the intermediate representation conversion of different calculation graphs is based on their own rules. , in the present invention, a new calculation graph representation method is defined for parallel strategy search, so it is necessary to convert the upper-layer framework calculation graph into a calculation graph used in parallel strategy search.
进一步地,步骤S200具体包括:Further, step S200 specifically includes:
S210、用预先定义的中间表示法对所述模型计算图重新进行表示,得到转换后的模型计算图。S210. Re-represent the model calculation graph with a predefined intermediate representation to obtain a converted model calculation graph.
具体地,将不同框架下计算图用用户预先定义的中间表示法重新进行表示,从而将不同框架下计算图转换为本发明中的并行策略搜索时使用的计算图,通过中间表示法是一种通用的方法,具体定义在C++实现中分别就是4个类定义;其中,所述预先定义的中间表示法为:IRType(类型)、IRValue(值)、IRNode(节点)和IRGraph(图)。Specifically, the computation graphs under different frameworks are re-expressed with the intermediate representations predefined by the user, so that the computation graphs under different frameworks are converted into the computation graphs used in the parallel strategy search in the present invention. The intermediate representations are a kind of computation graph. The general method is specifically defined as four class definitions in the C++ implementation; wherein, the pre-defined intermediate representations are: IRType (type), IRValue (value), IRNode (node) and IRGraph (graph).
请继续参阅图1,S300、将转换后的模型计算图进行均衡处理,得到均衡计算图。Please continue to refer to FIG. 1, S300, performing equalization processing on the converted model calculation graph to obtain an equalized calculation graph.
具体地,由于计算图中许多不同节点操作的计算量不一样,导致了计算图不均衡,为了在后面步骤中对节点进行切分时能够得到均衡大小的小节点,以及为了提高自动搜索并行策略的效率,需要对计算图中的节点进行均衡处理,即进行节点拆分和节点聚合操作;其中,计算图均衡过程的目的是将计算图在横向和纵向维度进行计算均衡,计算均衡的目的是指对计算图中的节点进行重新聚合或拆分,能够使得转换后的计算图在两个维度上达到计算均衡,避免并行策略搜索时的计算分配出现不均。Specifically, because the calculation amount of many different node operations in the calculation graph is different, the calculation graph is unbalanced. In order to obtain small nodes of balanced size when the nodes are segmented in the following steps, and in order to improve the automatic search parallel strategy The efficiency of the calculation graph needs to be balanced, that is, node splitting and node aggregation. Refers to the re-aggregation or splitting of nodes in the calculation graph, which can make the transformed calculation graph achieve a calculation balance in two dimensions and avoid uneven calculation distribution during parallel strategy search.
进一步地,请参阅图3,步骤S300具体包括:Further, referring to FIG. 3 , step S300 specifically includes:
S310、设定节点的平均计算量阈值,将转换后的模型计算图中的计算量节点与所述平均计算量阈值进行比较;S310, setting the average computation amount threshold of the node, and comparing the computation amount nodes in the converted model computation graph with the average computation amount threshold;
S320、将小于所述平均计算阈值的相邻计算量节点进行融合,并将大于所述平均计算阈值的计算量节点进行拆分,得到均衡计算图。S320 , fuse adjacent nodes with calculation amount smaller than the average calculation threshold, and split the nodes with calculation amount larger than the average calculation threshold to obtain a balanced calculation graph.
具体地,均衡处理主要包括对模型计算图中的节点进行聚合或拆分操作;在节点聚合或节点差分操作之前,首先,需要设定一个节点的平均计算量阈值,按照节点先后顺序遍历转换后的模型计算图中的计算量节点,将转换后的模型计算图中的计算量节点与所述平均计算量阈值进行比较;然后,节点聚合操作就是:将小于所述平均计算阈值的相邻计算量节点进行融合;而节点拆分操作则是:根据所述平均计算量阈值,将大于所述平均计算阈值的计算量节点拆分为多个节点,通常是对大于所述平均计算阈值的矩阵乘法算子拆分,以使融合和拆分后的节点计算量与所述平均计算量阈值相当;完成了节点聚合或节点拆分操作后,就能得到均衡计算图。Specifically, the equalization process mainly includes the aggregation or splitting operation of the nodes in the model calculation graph; before the node aggregation or node differential operation, first, it is necessary to set the average calculation amount threshold of a node, and traverse the transformed nodes according to the order of the nodes. The calculation amount node in the model calculation graph of The node splitting operation is: according to the average calculation threshold, split the calculation node larger than the average calculation threshold into multiple nodes, usually a matrix larger than the average calculation threshold The multiplication operator is split, so that the node calculation amount after fusion and splitting is equivalent to the average calculation amount threshold; after the node aggregation or node split operation is completed, a balanced calculation graph can be obtained.
本发明中通过将计算图中的大节点进行拆分,或者将相邻的多个小节点进行聚合,得到均衡后的节点,即得到所述均衡计算图,使得各节点间的计算量得到均衡,能够有效地避免并行策略搜索时的计算分配出现不均的问题,同时提高了节点切分的效率。In the present invention, the large nodes in the calculation graph are split, or the adjacent small nodes are aggregated to obtain the balanced node, that is, the balanced calculation graph is obtained, so that the calculation amount among the nodes is balanced. , which can effectively avoid the problem of uneven calculation distribution during parallel strategy search, and at the same time improve the efficiency of node segmentation.
其中,节点聚合就是把多个节点的计算过程在表示上转换成一个节点,比如3个计算节点进行聚合,聚合前3个节点分别是三个IRNode对象,聚合后形成了一个新的IRNode对象;并且大计算量节点切分是需要根据具体节点的算子类型来决定,不是所有节点都能拆分,比如矩阵乘法就是一个可以拆分的节点,也是AI模型中常用的一个基础算子。具体进行节点拆分和节点聚合的操作过程如图4所示,表示的是大计算量节点切分和相邻小计算量节点聚合到的过程。Among them, node aggregation is to convert the computing process of multiple nodes into one node in representation, for example, three computing nodes are aggregated, and the first three nodes of the aggregation are three IRNode objects respectively, and a new IRNode object is formed after aggregation; In addition, the segmentation of large-scale nodes needs to be determined according to the operator type of the specific node. Not all nodes can be split. For example, matrix multiplication is a node that can be split, and it is also a basic operator commonly used in AI models. The specific operation process of node splitting and node aggregation is shown in Figure 4, which represents the process of splitting large-computation nodes and aggregating adjacent small-computation nodes.
请继续参阅图1,S400、根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略。Please continue to refer to FIG. 1, S400, according to the equilibrium calculation graph, create a polyhedron model instance, and output a parallel strategy according to the polyhedron model instance.
具体地,该过程属于基于多面体优化的模型下的训练建模,以及并行策略搜索过程,该训练建模过程是指根据均衡计算图,在已经建立好的多面体优化模型框架下进行初始化,创建多面体优化模型实例,即建模成一个具体的对象;然后,并行策略搜索是指基于多面体优化模型实例,根据指定的并行模式(指定的并行模式是指开发者需要进行的并行模式,例如数据并行、模型并行、优化器并行、流水线并行、混合并行等),以流水线并行策略搜索为例:则根据开发者指定的流水线模式,搜索流水线的数量和mini-batch的数量的策略值,并生成最终的并行策略。其中,并行模式是一个大的框架,比如数据并行,是指AI模型在分布式训练时,把训练数据样本且分到多个计算设备进行分布式计算的过程,而并行策略是指某一个具体算法模型在执行时的分布式训练切分策略。Specifically, this process belongs to the training modeling under the model based on polyhedron optimization, and the parallel strategy search process. The training and modeling process refers to initializing under the framework of the established polyhedron optimization model according to the equilibrium calculation graph to create a polyhedron. The optimization model instance is modeled as a specific object; then, the parallel strategy search refers to the optimization model instance based on the polyhedron, according to the specified parallel mode (the specified parallel mode refers to the parallel mode that the developer needs to perform, such as data parallel, Model parallelism, optimizer parallelism, pipeline parallelism, hybrid parallelism, etc.), taking the pipeline parallel strategy search as an example: then according to the pipeline mode specified by the developer, the strategy value of the number of pipelines and the number of mini-batches is searched, and the final strategy value is generated. Parallel strategy. Among them, the parallel mode is a large framework, such as data parallelism, which refers to the process of dividing the training data samples into multiple computing devices for distributed computing during the distributed training of the AI model, and the parallel strategy refers to a specific The distributed training splitting strategy of the algorithm model during execution.
多面体模型建模是编译器中针对for循环进行编译优化的常见方法,通过将循环表示到多面体模型空间,能够直接通过多面体模型的映射表实现循环的并行计算优化,以提高训练的效率;在本发明中,将一个深度学习模型的训练计算过程表示成多层循环操作,然后定义了如何通过多面体模型来对这个训练过程进行表示,最后将常见的数据并行、模型并行、流水线并行在多面体建模框架下进行了统一。Polyhedral model modeling is a common method for compiler optimization for for loops in compilers. By representing the loop in the polyhedral model space, the parallel computing optimization of the loop can be realized directly through the mapping table of the polyhedral model, so as to improve the training efficiency; in this paper In the invention, the training and calculation process of a deep learning model is represented as a multi-layer loop operation, and then how to represent the training process through a polyhedron model is defined, and finally the common data parallelism, model parallelism, and pipeline parallelism are modeled on the polyhedron. unified under the framework.
假设一个深度学习模型通过有向无环图(DAG)表示为计算图D(N,E),其中N是计算图节点集合,E是计算图的边集合。因此深度学习模型训练计算过程(实则也是一种建模过程)表示如下(以python编程语言表示),从而实现了将for循环映射到多面体优化模型上:Suppose a deep learning model is represented by a directed acyclic graph (DAG) as a computational graph D(N, E), where N is the set of computational graph nodes and E is the set of edges of the computational graph. Therefore, the training calculation process of the deep learning model (in fact, it is also a modeling process) is expressed as follows (expressed in the python programming language), thus realizing the mapping of the for loop to the polyhedron optimization model:
for e in range(Epoch_num)://Epoch_num:模型训练轮数,即for循环的轮数范围(域);for e in range(Epoch_num): //Epoch_num: The number of training rounds of the model, that is, the range of rounds (domain) of the for loop;
for b in range(Batch_num)://Batch_num:模型训练一轮的样本批数,即for循环的批数范围;for b in range(Batch_num): //Batch_num: The number of sample batches for one round of model training, that is, the batch number range of the for loop;
for node in nodes://node:模型的一个节点;nodes:模型计算图的节点集合;for node in nodes: //node: a node of the model; nodes: node collection of the model calculation graph;
out=Forward(node,b)//Forward():表示模型的前向计算;out=Forward(node, b)//Forward(): Indicates the forward calculation of the model;
for node in reverse(nodes)://reverse():表示将模型计算节点进行反向排序;for node in reverse(nodes): //reverse(): Reverse sorting of model computing nodes;
grad=Backward(node,b)//grad:表示模型参数的梯度;Backward():表示模型的反向梯度计算;grad=Backward(node, b)//grad: represents the gradient of the model parameters; Backward(): represents the reverse gradient calculation of the model;
Update(node)//update():表示模型的参数更新过程。Update(node)//update(): Indicates the parameter update process of the model.
进一步地,请参阅图5,步骤S400具体包括:Further, referring to FIG. 5 , step S400 specifically includes:
S410、创建一个多面体优化模型;S410. Create a polyhedron optimization model;
S420、根据所述均衡计算图初始化所述多面体优化模型,得到一个多面体模型实例;S420. Initialize the polyhedron optimization model according to the equilibrium calculation graph to obtain a polyhedron model instance;
S430、根据所述多面体模型实例和用户输入的计算资源数量,输出并行策略。其中,并行策略包括数据并行切分维度、流水线并行切分维度以及流水线的排布等。S430. Output a parallel strategy according to the polyhedral model instance and the number of computing resources input by the user. Among them, the parallel strategy includes the data parallel segmentation dimension, the pipeline parallel segmentation dimension, and the arrangement of the pipeline.
具体地,首先将均衡计算图映射在多面体模型上,得到一个多面体优化模型,其中,这里的“映射”可以理解为坐标空间的仿射变换,如平移、旋转等操作,然后将均衡计算图输入到所述多面体优化模型,得到一个多面体模型实例(具体的对象),也即将for循环映射到多面体优化模型上,得到一个多面体模型实例(寻找分布式优化的计算模型),再经过几何上的线性变换找到分布式训练的数据并行切分维度或者流水线并行切分维度,并结合用户输入的计算资源数量,比如计算资源是8张GPU卡,所以数据并行会从数据并行的切分维度切分为8个mini-batch分别在8张卡上计算,最后综合输出并行策略(数据并行切分维度、流水线并行切分维度以及流水线的排布等)。具体的数据并行切分维度和流水线并行切分维度的切分过程分别如图6和图7所示,其中,横坐标表示模型训练的样本批数,纵坐标表示模型参数的层数,如BERT基础模型计算过程有24层;第一象限中的点表示某一样本batch在模型某一层上的计算过程,箭头表示模型训练计算过程的数据依赖或时间依赖;虚线框内表示模型在分布式训练时的一个切分示例;图6表示的是数据并行模式下的计算图切分示例,直接从batch方向上进行切分;图7表示流水线并行模式下的计算图切分示例,在错位地横向方向上进行切分。Specifically, the equilibrium calculation graph is first mapped on the polyhedron model to obtain a polyhedron optimization model, where the "mapping" here can be understood as an affine transformation of the coordinate space, such as translation, rotation and other operations, and then the equilibrium calculation graph is input To the polyhedron optimization model, a polyhedron model instance (specific object) is obtained, that is, the for loop is mapped to the polyhedron optimization model to obtain a polyhedron model instance (finding a distributed optimization calculation model), and then through the geometric linearity Transform to find the data parallel segmentation dimension or pipeline parallel segmentation dimension of distributed training, and combine the number of computing resources input by the user. For example, the computing resources are 8 GPU cards, so the data parallel will be divided from the data parallel segmentation dimension. 8 mini-batches are calculated on 8 cards respectively, and finally the parallel strategy (data parallel segmentation dimension, pipeline parallel segmentation dimension and pipeline arrangement, etc.) is synthesized and output. The specific data parallel segmentation dimension and the pipeline parallel segmentation dimension segmentation process are shown in Figure 6 and Figure 7 respectively, where the abscissa represents the number of sample batches for model training, and the ordinate represents the number of layers of model parameters, such as BERT The basic model calculation process has 24 layers; the dots in the first quadrant represent the calculation process of a sample batch on a certain layer of the model, and the arrows represent the data dependence or time dependence of the model training calculation process; the dotted box indicates that the model is distributed in a distributed manner. An example of segmentation during training; Figure 6 shows an example of calculation graph segmentation in data parallel mode, which is directly segmented from the batch direction; Figure 7 shows an example of calculation graph segmentation in pipeline parallel mode, in a dislocation Divide in the horizontal direction.
本发明中将深度学习模型的训练计算过程进行了统一描述,并将该计算过程描述到多面体模型中,通过多面体模型,常用的数据并行、流水线并行等分布式并行训练模式能够通过所得到的多面体优化模型进行统一,因此能够通过简单的多面体模型的映射变换来搜索出可行的并行策略。In the present invention, the training calculation process of the deep learning model is uniformly described, and the calculation process is described in the polyhedron model. Through the polyhedron model, the commonly used distributed parallel training modes such as data parallelism and pipeline parallelism can pass the obtained polyhedron model. The optimization model is unified, so a feasible parallel strategy can be searched through a simple mapping transformation of the polyhedral model.
请继续参阅图1,S500、调用底层框架执行所述并行策略。其中,所述执行API为底层AI框架中的运行管理器(runtime接口)。Please continue to refer to FIG. 1, S500, call the underlying framework to execute the parallel strategy. The execution API is an operation manager (runtime interface) in the underlying AI framework.
具体地,在多面体模型下找到分布式训练的数据并行切分维度或者流水线并行切分维度,并结合用户输入的计算资源数量,会自动输出并行策略,最后需要通过调用底层框架执行API来执行所述并行策略。其中所述API为交互接口。Specifically, in the polyhedron model, the data parallel segmentation dimension or pipeline parallel segmentation dimension of distributed training is found, and the parallel strategy is automatically output in combination with the number of computing resources input by the user. Finally, it is necessary to call the underlying framework to execute the API to execute all the parallel strategy. The API is an interactive interface.
进一步地,步骤S500具体包括:Further, step S500 specifically includes:
S510、调用底层框架的执行API来执行所述并行策略。其中,所述执行API为底层AI框架中的运行管理器。S510. Call the execution API of the underlying framework to execute the parallel strategy. The execution API is an operation manager in the underlying AI framework.
具体地,在多面体模型下创建多面体模型实例后,通过多面体模型的实例找到分布式训练的数据并行切分维度或者流水线并行切分维度,并结合用户输入的计算资源数量,自动输出算法模型的分布式训练执行策略,然后需要调用底层框架执行API实现算法模型的分布式训练策略的执行。Specifically, after creating a polyhedral model instance under the polyhedral model, find the data parallel segmentation dimension of the distributed training or the pipeline parallel segmentation dimension through the instance of the polyhedral model, and automatically output the distribution of the algorithm model in combination with the number of computing resources input by the user Then you need to call the underlying framework execution API to implement the distributed training strategy of the algorithm model.
更进一步地,请参阅图8,基于上述基于多面体模型建模的自动并行策略搜索方法,本发明还相应地提供了一种自动并行策略搜索系统,所述自动并行策略搜索系统还包括:计算图生成模块100、计算图转换模块200、计算图均衡模块300、并行策略搜索模块400和并行策略执行模块500。Further, referring to FIG. 8 , based on the above-mentioned automatic parallel strategy search method based on polyhedral model modeling, the present invention also provides an automatic parallel strategy search system accordingly. The automatic parallel strategy search system also includes: a calculation graph A
具体地,所述计算图生成模块100,用于根据用户输入的模型对象得到深度学习算法的模型计算图;所述计算图转换模块200,用于对所述模型计算图进行转换,得到转换后的模型计算图;所述计算图均衡模块300,用于将转换后的模型计算图进行均衡处理,得到均衡计算图;所述并行策略搜索模块400,用于根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略;所述并行策略执行模块500,用于调用底层框架执行所述并行策略。Specifically, the calculation
本发明提出的基于多面体模型建模的自动并行策略搜索方法的实现可以看成是支撑AI框架实现自动并行功能的一个中间件,如图9所示为其功能与Pytorch框架的关系图。首先,用户采用Pytorch前端API进行算法定义,其实用户无需考虑算法的并行切分,而是只需要考虑算法单机实现逻辑即可;因此用户只需要关注Pytorch框架针对算法提供的API使用方法,通过API来实现单机算法,如图9所示,左边竖排为Pytorch框架基本架构,右边部分为自动并行策略搜索系统,自动并行策略搜索系统与Pytorch框架的主要接口是算法的计算图表示及底层框架运行时接口,用户对于自动并行策略搜索系统几乎是无感知的。The realization of the automatic parallel strategy search method based on polyhedral model modeling proposed by the present invention can be regarded as a middleware supporting the AI framework to realize the automatic parallel function. Figure 9 shows the relationship between its function and the Pytorch framework. First of all, the user uses the Pytorch front-end API to define the algorithm. In fact, the user does not need to consider the parallel segmentation of the algorithm, but only needs to consider the single-machine implementation logic of the algorithm; therefore, the user only needs to pay attention to the API usage method provided by the Pytorch framework for the algorithm, through the API To implement a stand-alone algorithm, as shown in Figure 9, the vertical column on the left is the basic structure of the Pytorch framework, and the right part is the automatic parallel strategy search system. The main interface between the automatic parallel strategy search system and the Pytorch framework is the calculation graph representation of the algorithm and the operation of the underlying framework. The user is almost unaware of the automatic parallel strategy search system.
然后用户定义的算法会通过Pytorch的JIT(即时编译)模块输出计算图,计算图输入到自动并行策略搜索系统后相继输入计算图生成模块、计算图转换模块、计算图均衡模块和并行策略搜索模块,最后输出根据并行策略完成切分的计算子图,并将计算子图通过调用Pytorch的底层Runtime API(Pytorch框架的内部的功能API)来执行。Then the user-defined algorithm will output the computation graph through Pytorch's JIT (just-in-time compilation) module, and the computation graph will be input to the automatic parallel strategy search system and then successively input to the computation graph generation module, computational graph conversion module, computational graph equalization module and parallel strategy search module , and finally output the calculation subgraph that is divided according to the parallel strategy, and execute the calculation subgraph by calling the underlying Runtime API of Pytorch (the internal functional API of the Pytorch framework).
更进一步地,本发明还提供一种控制器,所述控制器包括处理器10、存储器20及显示器30。图10仅示出了控制器的部分组件,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。Furthermore, the present invention also provides a controller, the controller includes a
所述存储器20在一些实施例中可以是所述控制器的内部存储单元,例如控制器的硬盘或内存。所述存储器20在另一些实施例中也可以是所述控制器的外部存储设备,例如所述控制器上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(SecureDigital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器20还可以既包括所述控制器的内部存储单元也包括外部存储设备。所述存储器20用于存储安装于所述控制器的应用软件及各类数据。所述存储器20还可以用于暂时地存储已经输出或者将要输出的数据。在一实施例中,存储器20上存储有自动并行策略搜索程序40,该自动并行策略搜索程序40可被处理器10所执行,从而实现本发明中基于多面体模型建模的自动并行策略搜索方法。The
所述处理器10在一些实施例中可以是一中央处理器(Central Processing Unit,CPU),微处理器或其他数据处理芯片,用于运行所述存储器20中存储的程序代码或处理数据,例如执行所述基于多面体模型建模的自动并行策略搜索方法等。In some embodiments, the
所述显示器30在一些实施例中可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。所述显示器30用于显示在所述装置的信息以及用于显示可视化的用户界面。所述装置的部件10-30通过系统总线相互通信。In some embodiments, the
在一实施例中,当处理器10执行所述存储器20中自动并行策略搜索程序40时实现以下步骤:In one embodiment, when the
根据用户输入的模型对象得到深度学习算法的模型计算图;Obtain the model calculation graph of the deep learning algorithm according to the model object input by the user;
对所述模型计算图进行转换,得到转换后的模型计算图;Converting the model calculation graph to obtain the converted model calculation graph;
将转换后的模型计算图进行均衡处理,得到均衡计算图;Perform equalization processing on the converted model calculation graph to obtain a balanced calculation graph;
根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略;Create a polyhedron model instance according to the equilibrium calculation graph, and output a parallel strategy according to the polyhedron model instance;
调用底层框架执行所述并行策略。The underlying framework is invoked to execute the parallel strategy.
其中,所述根据用户输入的模型对象得到深度学习算法的模型计算图的步骤具体包括:Wherein, the step of obtaining the model calculation graph of the deep learning algorithm according to the model object input by the user specifically includes:
随机输入一个数值至所述算法模型后,记录所述算法模型的计算过程,得到所述模型计算图;其中,所述模型对象是指用户预先定义的深度学习算法的单机训练代码。After randomly inputting a numerical value into the algorithm model, record the calculation process of the algorithm model, and obtain the model calculation graph; wherein, the model object refers to the stand-alone training code of the deep learning algorithm predefined by the user.
或通过python解释器对所述模型对象解析后生成语法树,再对所述语法树进行分析,得到所述模型计算图。Or generate a syntax tree after parsing the model object by a python interpreter, and then analyze the syntax tree to obtain the model computation graph.
其中,所述对所述模型计算图进行转换,得到转换后的模型计算图的步骤具体包括:Wherein, the step of converting the model computation graph to obtain the converted model computation graph specifically includes:
用预先定义的中间表示法对所述模型计算图重新进行表示,得到转换后的模型计算图。The model computation graph is re-expressed with a pre-defined intermediate representation to obtain a converted model computation graph.
其中,所述将转换后的模型计算图进行均衡处理,得到均衡计算图的步骤具体包括:Wherein, the step of performing equalization processing on the converted model calculation graph to obtain a balanced calculation graph specifically includes:
通过将转换后的模型计算图中的大节点进行拆分操作,以及将转换后的模型计算图中的相邻小节点进行聚合操作,得到均衡计算图。A balanced computation graph is obtained by performing splitting operations on large nodes in the converted model computation graph and performing aggregation operations on adjacent small nodes in the converted model computation graph.
其中,所述根据所述均衡计算图,创建多面体模型实例,并根据所述多面体模型实例输出并行策略的步骤具体包括:Wherein, the step of creating a polyhedron model instance according to the equilibrium calculation graph, and outputting a parallel strategy according to the polyhedron model instance specifically includes:
创建一个多面体优化模型;根据所述均衡计算图初始化所述多面体优化模型,得到一个多面体模型实例;Create a polyhedron optimization model; initialize the polyhedron optimization model according to the equilibrium calculation graph to obtain a polyhedron model instance;
根据所述多面体模型实例和用户输入的计算资源数量,输出并行策略。A parallel strategy is output according to the polyhedral model instance and the number of computing resources input by the user.
其中,所述调用底层框架执行所述并行策略的步骤具体包括:Wherein, the step of invoking the underlying framework to execute the parallel strategy specifically includes:
调用底层框架的执行API来执行所述并行策略。The execution API of the underlying framework is called to execute the parallel strategy.
进一步地,本发明还提供一种计算机可读存储介质,其中,所述计算机可读存储介质存储有自动并行策略搜索程序40,所述自动并行策略搜索程序40被处理器执行时实现如上所述的基于多面体模型建模的自动并行策略搜索方法的步骤;由于上述对该所述基于多面体模型建模的自动并行策略搜索方法的步骤进行了详细的描述,在此不再赘述。Further, the present invention also provides a computer-readable storage medium, wherein the computer-readable storage medium stores an automatic parallel
综上所述,本发明提供的一种基于多面体模型建模的自动并行策略搜索方法及相关设备,所述基于多面体模型建模的自动并行策略搜索方法包括:根据用户输入的模型对象得到深度学习算法的模型计算图;对模型计算图进行转换,得到转换后的模型计算图;将转换后的模型计算图进行均衡处理,得到均衡计算图;根据所述均衡计算图,创建多面体模型实例,并根据多面体模型实例输出并行策略;调用底层框架执行并行策略。本发明中通过将模型计算图进行转换及均衡处理,并在多面体模型的框架下创建多面体模型实例后,自动输出并行策略,实现了在多面体模型下将不同的算法逻辑进行建模,并自动输出并行策略过程,提升了并行策略搜索的效率,降低了深度学习算法的分布式训练开发和效率调优难度。To sum up, the present invention provides an automatic parallel strategy search method based on polyhedral model modeling and related equipment. The automatic parallel strategy search method based on polyhedral model modeling includes: obtaining a deep learning method according to a model object input by a user The model calculation graph of the algorithm; convert the model calculation graph to obtain a converted model calculation graph; perform equalization processing on the converted model calculation graph to obtain a balanced calculation graph; according to the balanced calculation graph, create a polyhedron model instance, and Output the parallel strategy according to the polyhedron model instance; call the underlying framework to execute the parallel strategy. In the present invention, the model calculation graph is converted and equalized, and after the polyhedron model instance is created under the frame of the polyhedron model, the parallel strategy is automatically output, so that different algorithm logics are modeled under the polyhedron model and automatically output. The parallel strategy process improves the efficiency of parallel strategy search and reduces the difficulty of distributed training development and efficiency tuning of deep learning algorithms.
可以理解的是,对本领域普通技术人员来说,可以根据本发明的技术方案及其发明构思加以等同替换或改变,而所有这些改变或替换都应属于本发明所附的权利要求的保护范围。It can be understood that for those of ordinary skill in the art, equivalent replacements or changes can be made according to the technical solutions of the present invention and the inventive concept thereof, and all these changes or replacements should belong to the protection scope of the appended claims of the present invention.
Claims (13)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111646797.9A CN114925591B (en) | 2021-12-29 | 2021-12-29 | Automatic parallel strategy search method based on polyhedron modeling and related equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111646797.9A CN114925591B (en) | 2021-12-29 | 2021-12-29 | Automatic parallel strategy search method based on polyhedron modeling and related equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN114925591A true CN114925591A (en) | 2022-08-19 |
| CN114925591B CN114925591B (en) | 2025-08-29 |
Family
ID=82804123
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111646797.9A Active CN114925591B (en) | 2021-12-29 | 2021-12-29 | Automatic parallel strategy search method based on polyhedron modeling and related equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN114925591B (en) |
Cited By (5)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115730681A (en) * | 2022-11-11 | 2023-03-03 | 北京百度网讯科技有限公司 | Model training method, device, equipment and storage medium |
| CN115964947A (en) * | 2022-12-29 | 2023-04-14 | 鹏城实验室 | Automatic parallel strategy searching method based on polyhedron model modeling and related equipment |
| CN118445074A (en) * | 2024-05-08 | 2024-08-06 | 杭州电子科技大学 | Distributed automatic parallel computing method and system supporting heterogeneous AI framework |
| WO2024178522A1 (en) * | 2023-02-27 | 2024-09-06 | 华为技术有限公司 | Computational graph splitting method and related device |
| CN118821898A (en) * | 2024-09-18 | 2024-10-22 | 山东浪潮科学研究院有限公司 | Deep learning model optimization method, device and medium based on polyhedron compilation |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112561051A (en) * | 2019-09-26 | 2021-03-26 | 中兴通讯股份有限公司 | Method and device for performing parallel processing on deep learning model |
| CN113748399A (en) * | 2019-04-25 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Computation graph mapping in heterogeneous computers |
-
2021
- 2021-12-29 CN CN202111646797.9A patent/CN114925591B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113748399A (en) * | 2019-04-25 | 2021-12-03 | 阿里巴巴集团控股有限公司 | Computation graph mapping in heterogeneous computers |
| CN112561051A (en) * | 2019-09-26 | 2021-03-26 | 中兴通讯股份有限公司 | Method and device for performing parallel processing on deep learning model |
Non-Patent Citations (1)
| Title |
|---|
| SANKET TAVARAGERI, ET AL: "PolyDL: Polyhedral Optimizations for Creation of High Performance DL primitives", ARXIV:2006.02230V2, 18 November 2020 (2020-11-18), pages 1 - 10 * |
Cited By (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115730681A (en) * | 2022-11-11 | 2023-03-03 | 北京百度网讯科技有限公司 | Model training method, device, equipment and storage medium |
| CN115730681B (en) * | 2022-11-11 | 2023-08-15 | 北京百度网讯科技有限公司 | Model training method, device, equipment and storage medium |
| CN115964947A (en) * | 2022-12-29 | 2023-04-14 | 鹏城实验室 | Automatic parallel strategy searching method based on polyhedron model modeling and related equipment |
| WO2024178522A1 (en) * | 2023-02-27 | 2024-09-06 | 华为技术有限公司 | Computational graph splitting method and related device |
| CN118445074A (en) * | 2024-05-08 | 2024-08-06 | 杭州电子科技大学 | Distributed automatic parallel computing method and system supporting heterogeneous AI framework |
| CN118445074B (en) * | 2024-05-08 | 2025-09-16 | 杭州电子科技大学 | Distributed automatic parallel computing method and system supporting heterogeneous AI framework |
| CN118821898A (en) * | 2024-09-18 | 2024-10-22 | 山东浪潮科学研究院有限公司 | Deep learning model optimization method, device and medium based on polyhedron compilation |
| CN118821898B (en) * | 2024-09-18 | 2025-04-18 | 山东浪潮科学研究院有限公司 | Deep learning model optimization method, device and medium based on polyhedron compilation |
Also Published As
| Publication number | Publication date |
|---|---|
| CN114925591B (en) | 2025-08-29 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Jin et al. | Compiling onnx neural network models using mlir | |
| CN114925591A (en) | An automatic parallel strategy search method and related equipment based on polyhedral model modeling | |
| WO2021000970A1 (en) | Deep learning algorithm compiling method, device, and related product. | |
| US20150135166A1 (en) | Source code generation, completion, checking, correction | |
| WO2021000971A1 (en) | Method and device for generating operation data and related product | |
| US12493785B2 (en) | Method, electronic device, and computer program product for deploying machine learning model | |
| CN112199086A (en) | Automatic programming control system, method, device, electronic device and storage medium | |
| WO2020062086A1 (en) | Method and device for selecting processor | |
| CN112070202B (en) | Fusion graph generation method and device and computer readable storage medium | |
| WO2022143419A1 (en) | Node fusion method for computational graph, and device | |
| WO2023030507A1 (en) | Compilation optimization method and apparatus, computer device and storage medium | |
| CN111831285B (en) | Code conversion method, system and application for memory computing platform | |
| Valencia-Cabrera et al. | Simulation challenges in membrane computing: L. Valencia-Cabrera et al. | |
| CN114385181A (en) | Data processing method, device and equipment and computer storage medium | |
| CN114398080A (en) | Data processing method, device and equipment and computer storage medium | |
| Wang et al. | Auto-MAP: A DQN framework for exploring distributed execution plans for DNN workloads | |
| WO2023071509A1 (en) | Model compilation method and apparatus, and model running system | |
| CN116974646A (en) | Plug-in generation method, device, electronic device and storage medium | |
| Ali et al. | Parallelizing user-defined functions in the ETL workflow using orchestration style sheets | |
| WO2024178522A1 (en) | Computational graph splitting method and related device | |
| JP2025513784A (en) | Methods, systems and kits for quantum-optimized cross-backend software development | |
| CN110442753A (en) | A kind of chart database auto-creating method and device based on OPC UA | |
| CN117501281A (en) | Adaptive buffer management to support dynamic tensor shapes in deep neural network applications | |
| CN110221838B (en) | A method for automatic program design optimization based on genetic algorithm and directed acyclic graph | |
| US20240119050A1 (en) | Method and system for extending query processing with differentiable operators |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant |