CN114816669A

CN114816669A - Distributed training method and data processing method of model

Info

Publication number: CN114816669A
Application number: CN202210476218.9A
Authority: CN
Inventors: 王晖; 刘洋; 王亚男
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-04-29
Filing date: 2022-04-29
Publication date: 2022-07-29

Abstract

The disclosure provides a distributed training method and a data processing method for a model, and relates to the technical field of artificial intelligence, in particular to the technical fields of image processing, computer vision and the like. The implementation scheme is as follows: the method comprises the steps of obtaining a plurality of containers on a plurality of nodes, wherein each container in the plurality of containers is provided with a training module, and the training module can respond to a training request corresponding to each training task in a plurality of training tasks to execute the training task; in response to receiving a first request corresponding to a first training task, determining a plurality of training containers corresponding to the first training task from a plurality of containers; and respectively sending first training requests corresponding to the first training task to the plurality of training containers so that the training module in each of the plurality of training containers executes the first training task, thereby training the model corresponding to the first training task.

Description

Distributed training method and data processing method of model

技术领域technical field

本公开涉及人工智能技术领域，具体涉及图像处理、计算机视觉等技术领域，具体涉及一种模型的分布式训练方法、数据处理方法、装置、电子设备、计算机可读存储介质和计算机程序产品。The present disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of image processing and computer vision, and in particular to a distributed training method for a model, a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

背景技术Background technique

人工智能是研究使计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科，既有硬件层面的技术也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术。人工智能软件技术主要包括计算机视觉技术、语音识别技术、自然语言处理技术以及机器学习/深度学习、大数据处理技术、知识图谱技术等几大方向。Artificial intelligence is the study of making computers to simulate certain thinking processes and intelligent behaviors of people (such as learning, reasoning, thinking, planning, etc.), both hardware-level technology and software-level technology. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing. Artificial intelligence software technology mainly includes computer vision technology, speech recognition technology, natural language processing technology, machine learning/deep learning, big data processing technology, knowledge graph technology and other major directions.

人工智能技术广泛采用经训练的算法模型，作为实现相关软件技术的手段，其中，在算法模型的训练过程中，由于训练数据较多，计算规模较大，往往采用分布式训练方法对算法模型进行训练。Artificial intelligence technology widely uses trained algorithm models as a means to implement related software technologies. Among them, in the training process of algorithm models, due to the large amount of training data and large calculation scale, distributed training methods are often used to perform algorithm model training. train.

在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明，否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地，除非另有指明，否则此部分中提及的问题不应认为在任何现有技术中已被公认。The approaches described in this section are not necessarily approaches that have been previously conceived or employed. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the issues raised in this section should not be considered to be recognized in any prior art.

发明内容SUMMARY OF THE INVENTION

本公开提供了一种模型的分布式训练方法、数据处理方法、装置、电子设备、计算机可读存储介质和计算机程序产品。The present disclosure provides a distributed training method for a model, a data processing method, an apparatus, an electronic device, a computer-readable storage medium, and a computer program product.

根据本公开的一方面，提供了一种模型的分布式训练方法，包括：获取在多个节点上的多个容器，所述多个容器中的每一个容器部署有训练模块，所述训练模块能够响应于与多个训练任务中的每一个训练任务相应的训练请求执行该训练任务；响应于接收到与第一训练任务相应的第一请求，从多个容器中确定与所述第一训练任务相应的多个训练容器；以及分别向所述多个训练容器发送与所述第一训练任务相应的第一训练请求，以使所述多个训练容器中的每一个训练容器中的训练模块执行所述第一训练任务，从而训练与所述第一训练任务相应的模型。According to an aspect of the present disclosure, a distributed training method for a model is provided, including: acquiring multiple containers on multiple nodes, each container in the multiple containers is deployed with a training module, and the training module The training task can be executed in response to a training request corresponding to each training task in the plurality of training tasks; in response to receiving the first request corresponding to the first training task, determining from the plurality of containers that the training task corresponds to the first training task multiple training containers corresponding to the task; and respectively sending a first training request corresponding to the first training task to the multiple training containers, so that the training modules in each of the multiple training containers are trained The first training task is executed, thereby training a model corresponding to the first training task.

根据本公开的另一方面，提供了一种数据处理方法，包括：获取待处理数据；以及将所述待处理数据输入至处理模型，所述处理模型是采用根据本公开的模型的分布式训练方法训练而获得的。According to another aspect of the present disclosure, there is provided a data processing method, comprising: acquiring data to be processed; and inputting the data to be processed into a processing model, the processing model being distributed training using the model according to the present disclosure obtained by training the method.

根据本公开的另一方面，提供了一种模型的分布式训练装置，包括：容器获取单元，被配置用于获取在多个节点上的多个容器，所述多个容器中的每一个容器部署有训练模块，所述训练模块能够响应于与多个训练任务中的每一个训练任务相应的训练请求执行该训练任务；确定单元，被配置用于响应于接收到与第一训练任务相应的第一请求，从多个容器中确定与所述第一训练任务相应的多个训练容器；以及训练请求单元，被配置用于分别向所述多个训练容器发送与所述第一训练任务相应的第一训练请求，以使所述多个训练容器中的每一个训练容器中的训练模块执行所述第一训练任务，从而训练与所述第一训练任务相应的模型。According to another aspect of the present disclosure, there is provided a distributed training apparatus for a model, comprising: a container acquisition unit configured to acquire a plurality of containers on a plurality of nodes, each container in the plurality of containers A training module is deployed, the training module is capable of executing the training task in response to a training request corresponding to each training task in the plurality of training tasks; the determining unit is configured to respond to receiving a training task corresponding to the first training task a first request, for determining a plurality of training containers corresponding to the first training task from the plurality of containers; and a training request unit, configured to respectively send the plurality of training containers corresponding to the first training task the first training request, so that the training module in each of the plurality of training containers executes the first training task, thereby training a model corresponding to the first training task.

根据本公开的另一方面，提供了一种数据处理装置，包括：数据获取单元，被配置用于获取待处理数据；以及数据输入单元，被配置用于将所述待处理数据输入至处理模型，所述处理模型是采用根据本公开的模型的分布式训练方法训练而获得的。According to another aspect of the present disclosure, there is provided a data processing apparatus, comprising: a data acquisition unit configured to acquire data to be processed; and a data input unit configured to input the data to be processed into a processing model , the processing model is obtained by using the distributed training method of the model according to the present disclosure.

根据本公开的另一方面，提供了一种电子设备，包括：至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器实现根据上述的方法。According to another aspect of the present disclosure, there is provided an electronic device comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores memory executable by the at least one processor The instructions are executed by the at least one processor to cause the at least one processor to implement a method according to the above.

根据本公开的另一方面，提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其中，所述计算机指令用于使所述计算机实现根据上述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to implement the method according to the above.

根据本公开的另一方面，提供了一种计算机程序产品包括计算机程序，其中，所述计算机程序在被处理器执行时实现根据上述的方法。According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method according to the above.

根据本公开的一个或多个实施例，通过在多个节点上的容器中部署能够响应于训练请求执行多个训练任务的训练模块，向各个训练容器发送训练请求，使各个训练容器的训练模块响应训练请求执行训练任务，实现模型的分布式训练；由于训练模块是部署在节点的容器中的，使得模型的分布式训练平台的部署效率高，并且所部署的平台易于拓展，降低使用、维护成本。According to one or more embodiments of the present disclosure, by deploying a training module capable of executing multiple training tasks in response to a training request in containers on multiple nodes, and sending a training request to each training container, the training module of each training container is enabled Execute training tasks in response to training requests to realize distributed training of the model; since the training module is deployed in the container of the node, the deployment efficiency of the distributed training platform of the model is high, and the deployed platform is easy to expand, reducing usage and maintenance. cost.

应当理解，本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征，也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.

附图说明Description of drawings

附图示例性地示出了实施例并且构成说明书的一部分，与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的，并不限制权利要求的范围。在所有附图中，相同的附图标记指代类似但不一定相同的要素。The accompanying drawings illustrate the embodiments by way of example and constitute a part of the specification, and together with the written description of the specification serve to explain exemplary implementations of the embodiments. The shown embodiments are for illustrative purposes only and do not limit the scope of the claims. Throughout the drawings, the same reference numbers refer to similar but not necessarily identical elements.

图1示出了根据本公开的实施例的模型的分布式训练方法的流程图；1 shows a flowchart of a distributed training method of a model according to an embodiment of the present disclosure;

图2示出了根据本公开的实施例的模型的分布式训练方法中获取在多个节点上的多个容器的过程的流程图；2 shows a flowchart of a process of acquiring multiple containers on multiple nodes in a distributed training method for a model according to an embodiment of the present disclosure;

图3示出了根据本公开的实施例的模型的分布式训练方法中的训练平台的示意图；3 shows a schematic diagram of a training platform in a distributed training method for a model according to an embodiment of the present disclosure;

图4示出了根据本公开的实施例的模型的分布式训练方法的流程图；4 shows a flowchart of a distributed training method for a model according to an embodiment of the present disclosure;

图5示出了根据本公开的实施例的模型的分布式训练装置的结构框图；FIG. 5 shows a structural block diagram of a distributed training apparatus for a model according to an embodiment of the present disclosure;

图6示出了根据本公开的实施例的数据处理装置的结构框图；以及FIG. 6 shows a structural block diagram of a data processing apparatus according to an embodiment of the present disclosure; and

图7示出了能够用于实现本公开的实施例的示例性电子设备的结构框图。7 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

具体实施方式Detailed ways

以下结合附图对本公开的示范性实施例做出说明，其中包括本公开实施例的各种细节以助于理解，应当将它们认为仅仅是示范性的。因此，本领域普通技术人员应当认识到，可以对这里描述的实施例做出各种改变和修改，而不会背离本公开的范围。同样，为了清楚和简明，以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.

在本公开中，除非另有说明，否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系，这种术语只是用于将一个元件与另一元件区分开。在一些示例中，第一要素和第二要素可以指向该要素的同一实例，而在某些情况下，基于上下文的描述，它们也可以指代不同实例。In the present disclosure, unless otherwise specified, the use of the terms "first", "second", etc. to describe various elements is not intended to limit the positional relationship, timing relationship or importance relationship of these elements, and such terms are only used for Distinguish one element from another. In some examples, the first element and the second element may refer to the same instance of the element, while in some cases they may refer to different instances based on the context of the description.

在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的，而并非旨在进行限制。除非上下文另外明确地表明，如果不特意限定要素的数量，则该要素可以是一个也可以是多个。此外，本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly dictates otherwise, if the number of an element is not expressly limited, the element may be one or more. Furthermore, as used in this disclosure, the term "and/or" covers any and all possible combinations of the listed items.

在相关技术中，采用基于集群的分布式架构实现分布式模型训练。例如，采用spark、hadoop、mpi的分布式架构实现分布式模型训练。这些分布式架构往往涉及基于集群的应用部署和管理，因此，需要解决集群控制的问题，避免损失效率。然而对集群的部署和管理，涉及复杂的技术和应用(例如，K8S)，使得分布式模型训练架构的使用和维护成本较高，部署不方便，无法适用于小型用户(例如，应用于训练平台的计算设备的数量较少)的使用需求。In the related art, a cluster-based distributed architecture is used to implement distributed model training. For example, distributed model training is implemented using the distributed architecture of spark, hadoop, and mpi. These distributed architectures often involve cluster-based application deployment and management. Therefore, it is necessary to solve the problem of cluster control to avoid loss of efficiency. However, the deployment and management of clusters involves complex technologies and applications (for example, K8S), which makes the use and maintenance costs of the distributed model training architecture relatively high, and the deployment is inconvenient, making it unsuitable for small users (for example, for training platforms). number of computing devices) is required.

为此，提供了一种分布式模型训练方法和装置，以解决上述问题。To this end, a distributed model training method and apparatus are provided to solve the above problems.

下面将结合附图详细描述本公开的实施例。Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

参看图1，根据本公开的一些实施例的一种模型的分布式训练方法200包括：Referring to FIG. 1, a distributed training method 200 of a model according to some embodiments of the present disclosure includes:

步骤S110：获取在多个节点上的多个容器，所述多个容器中的每一个容器部署有训练模块，所述训练模块能够响应于与多个训练任务中的每一个训练任务相应的训练请求执行该训练任务；Step S110: Acquire a plurality of containers on a plurality of nodes, each container in the plurality of containers is deployed with a training module, and the training module can respond to training corresponding to each training task in the plurality of training tasks request the execution of the training task;

步骤S120：响应于接收到与第一训练任务相应的第一请求，从多个容器中确定与所述第一训练任务相应的多个训练容器；以及Step S120: in response to receiving the first request corresponding to the first training task, determining a plurality of training containers corresponding to the first training task from the plurality of containers; and

步骤S130：分别向所述多个训练容器发送与所述第一训练任务相应的第一训练请求，以使所述多个训练容器中的每一个训练容器中的训练模块执行所述第一训练任务，从而训练与所述第一训练任务相应的模型。Step S130: Send a first training request corresponding to the first training task to the multiple training containers, so that the training module in each of the multiple training containers executes the first training task to train a model corresponding to the first training task.

通过在多个节点上的容器中部署能够响应于训练请求执行多个训练任务的训练模块，向各个训练容器发送训练请求，使各个训练容器的训练模块响应训练请求执行训练任务，实现模型的分布式训练；由于训练模块是部署在节点的容器中的，使得分布式训练平台的部署效率高，并且所部署的平台易于拓展，降低使用、维护成本。By deploying training modules capable of executing multiple training tasks in response to training requests in containers on multiple nodes, sending training requests to each training container, and enabling the training modules of each training container to execute training tasks in response to the training requests, the distribution of the model is realized Since the training module is deployed in the container of the node, the deployment efficiency of the distributed training platform is high, and the deployed platform is easy to expand, reducing the use and maintenance costs.

在根据本公开的实施例中，通过在计算设备构成的节点上，部署能够响应于训练请求执行多个训练任务的训练模块，就能实现模型的分布式训练，不需要使用基于集群的分布式架构，因而不涉及集群的应用部署和管理，从而可以降低使用和维护成本，并且基于容器的部署使得部署方便，可以适用于小型用户(例如，应用于训练平台的计算设备的数量较少)的使用需求。In the embodiment according to the present disclosure, by deploying a training module capable of executing multiple training tasks in response to a training request on a node formed by a computing device, the distributed training of the model can be realized without using cluster-based distributed training. Therefore, it does not involve application deployment and management of clusters, which can reduce the cost of use and maintenance, and the container-based deployment makes deployment convenient and can be applied to small users (for example, the number of computing devices applied to the training platform is small) Usage requirements.

在一些实施例中，节点是多个计算设备，例如，计算机。In some embodiments, nodes are multiple computing devices, eg, computers.

在一些实施例中，多个节点上的容器可以是基于训练模块的部署而创建的。其中，容器在创建之后启动。In some embodiments, containers on multiple nodes may be created based on the deployment of training modules. Among them, the container is started after creation.

在一些实施例中，训练模块可以是部署在容器上的服务，其中，包括与多个训练任务相应的算法服务和响应服务，响应服务可以对训练请求进行响应，并启动与训练请求相应的算法服务，从而进行相应的训练任务。In some embodiments, the training module may be a service deployed on a container, which includes an algorithm service corresponding to multiple training tasks and a response service, the response service may respond to the training request and start the algorithm corresponding to the training request service, so as to carry out the corresponding training tasks.

在一些实施例中，多个训练任务可以是人脸检测、图像分类等，在此并不限定。In some embodiments, the multiple training tasks may be face detection, image classification, etc., which are not limited herein.

在一些实施例中，与训练任务相应的模型可以是人脸检测模型、图像分类模型，等。In some embodiments, the model corresponding to the training task may be a face detection model, an image classification model, or the like.

在一些实施例中，算法服务可以是封装了与训练任务相应的算法的服务。例如，算法可以是人脸检测算法、图像分类算法等，在此并不限定。In some embodiments, an algorithm service may be a service that encapsulates an algorithm corresponding to a training task. For example, the algorithm may be a face detection algorithm, an image classification algorithm, etc., which is not limited herein.

在一些实施例中，训练模块还可以包括其他服务，使得响应服务基于对训练请求的响应启动该服务以执行相应的任务。其他服务，例如，可以是数据获取服务，用于获取相应的训练数据。In some embodiments, the training module may also include other services such that the response service initiates the service to perform the corresponding task based on the response to the training request. Other services, for example, may be data acquisition services for acquiring corresponding training data.

在一些实施例中，如图2所示，获取在多个节点上的多个容器包括：In some embodiments, as shown in FIG. 2 , acquiring multiple containers on multiple nodes includes:

步骤S210：获取所述训练模块的镜像；Step S210: obtaining a mirror image of the training module;

步骤S220：将所述镜像挂载到所述多个节点中的第一节点上；以及Step S220: Mount the image on the first node of the plurality of nodes; and

步骤S220：响应于所述第一节点运行所述镜像，获取所述第一节点上的容器。Step S220: Acquire the container on the first node in response to the first node running the image.

通过获得训练模块的镜像，并通过挂载镜像的方式，使得节点获得该训练模块的镜像，当运行镜像时，即启动容器，使训练模块的部署效率高，易于拓展。By obtaining the image of the training module and by mounting the image, the node obtains the image of the training module. When the image is run, the container is started, which makes the deployment of the training module efficient and easy to expand.

例如，在用户增加节点时，仅仅只需要将训练模块的镜像挂载到所增加的节点上，并在该节点上运行该镜像，就能部署该训练模块，以进行模型的分布式训练，使得训练模块的部署效率大大提升，并且易于拓展。For example, when a user adds a node, he only needs to mount the image of the training module on the added node and run the image on the node, then the training module can be deployed to perform distributed training of the model, so that the The deployment efficiency of the training module is greatly improved, and it is easy to expand.

参看图3，示出了根据本公开的一些实施例的模型的分布式训练平台的示意图。Referring to Figure 3, a schematic diagram of a distributed training platform for a model according to some embodiments of the present disclosure is shown.

模型的分布式训练平台300包括多个节点(节点310-3n0)，每个节点上通过容器部署训练模块(训练模块311-3n1)，n为正整数。The distributed training platform 300 of the model includes a plurality of nodes (nodes 310-3n0), and a training module (training module 311-3n1) is deployed on each node through a container, and n is a positive integer.

通过节点310实现根据本公开的模型的分布式训练方法的过程。其中，节点310响应于获得与第一训练任务相应的第一请求，从多个节点(节点310-3n0)中获得多个训练节点，以获得多个训练容器。并且基于该第一请求生成相应的训练请求以发送给该多个训练容器。The process of the distributed training method of the model according to the present disclosure is implemented by the node 310 . The node 310 obtains multiple training nodes from multiple nodes (nodes 310-3n0) in response to obtaining the first request corresponding to the first training task, so as to obtain multiple training containers. And a corresponding training request is generated based on the first request to be sent to the plurality of training containers.

在一些实施例中，第一请求，可以是通过输入输出设备接收来自用户的请求。其中，该第一请求中包括指示与第一训练任务相应的训练节点的信息、数据信息等，在此并不限定。In some embodiments, the first request may be a request received from a user through an input and output device. The first request includes information indicating training nodes corresponding to the first training task, data information, and the like, which is not limited herein.

在一个示例中，训练节点的信息可以是训练节点的IP地址；数据信息可以是指示第一训练任务所需的训练数据集和训练数据集中的各个训练数据的标注标签的信息。In one example, the information of the training node may be the IP address of the training node; the data information may be information indicating the training data set required for the first training task and the labeling label of each training data in the training data set.

在一些实施例中，第一请求包括与所述训练任务相关的训练参数，所述方法还包括：In some embodiments, the first request includes training parameters related to the training task, and the method further includes:

生成所述训练请求，所述训练请求指示所述训练参数。The training request is generated, the training request indicating the training parameters.

通通过将由用户确定的训练参数以训练请求的方式发送给训练容器，使训练容器中的训练模块基于相应的训练请求执行训练任务，实现模型的分布式训练，并且模型的分布式训练的训练任务易于拓展。By sending the training parameters determined by the user to the training container in the form of a training request, the training module in the training container executes the training task based on the corresponding training request, realizing the distributed training of the model, and the training task of the distributed training of the model. Easy to expand.

在一些实施例中，训练参数是模型的超参数，该超参数在过程中非基于损失而调整。超参数例如可以是，正则化系数λ，决策树模型中树的深度。通过设置超参数可以设置不同的算法组合，实现不同训练任务。In some embodiments, the training parameters are hyperparameters of the model that are adjusted in the process other than based on loss. Hyperparameters can be, for example, the regularization coefficient λ, the depth of the tree in the decision tree model. By setting hyperparameters, different algorithm combinations can be set to achieve different training tasks.

在一些实施例中，如图4所示，根据本公开的方法还包括：In some embodiments, as shown in FIG. 4 , the method according to the present disclosure further comprises:

步骤S410：获得与所述第一训练任务相应的训练数据信息，所述训练数据信息指示所述第一训练任务的训练数据集和所述训练数据集中的每一个训练数据的标注标签；以及Step S410: Obtain training data information corresponding to the first training task, the training data information indicating the training data set of the first training task and the labeling label of each training data in the training data set; and

步骤S420：分别向所述多个训练容器发送所述训练数据信息，以使所述多个训练容器中的多个训练模块分别获取所述训练数据。Step S420: Send the training data information to the multiple training containers respectively, so that multiple training modules in the multiple training containers obtain the training data respectively.

通过获得训练数据信息，并且将训练数据信息发送给训练容器，使训练数据与训练模块分开部署，进一步简化分布式训练平台的部署。By obtaining the training data information and sending the training data information to the training container, the training data and the training module are deployed separately, which further simplifies the deployment of the distributed training platform.

在一些实施例中，基于第一请求中所指示的数据信息获得训练数据信息。In some embodiments, the training data information is obtained based on the data information indicated in the first request.

例如，第一请求中包含训练数据的地址，基于该训练数据的地址获得训练数据信息。For example, the first request includes the address of the training data, and the training data information is obtained based on the address of the training data.

在一些实施例中，通过获得用户上传的训练数据，获得训练数据信息。In some embodiments, the training data information is obtained by obtaining training data uploaded by the user.

在一些实施例中，所述训练数据信息包括所述训练数据集中的每一个训练数据和该训练数据的标注标签。In some embodiments, the training data information includes each training data in the training data set and a label of the training data.

通过将训练数据发送给每个训练容器，使训练容器基于所接收到的训练数据和该训练数据的标注标签，进行训练，进一步简化分布式训练平台的部署。By sending the training data to each training container, the training container is trained based on the received training data and the labels of the training data, which further simplifies the deployment of the distributed training platform.

在一些实施例中，所述训练数据信息包括所述训练数据集中的每一个训练数据获取地址，基于该获取地址进行访问，能够获得该训练数据。In some embodiments, the training data information includes each training data obtaining address in the training data set, and the training data can be obtained by accessing based on the obtaining address.

在训练数据集的数据量较多的情况下，将训练数据集挂载到存储模块中，通过发送训练数据集中的每一个训练数据的地址，使训练容器中的训练模块基于该地址进行访问，以获得训练数据，可以减少用以存储训练数据集的存储空间，也可以避免向多个训练模块同时传输训练数据集时造成的通信链路拥挤。In the case of a large amount of data in the training data set, mount the training data set to the storage module, and send the address of each training data in the training data set, so that the training module in the training container can be accessed based on the address. In order to obtain the training data, the storage space for storing the training data set can be reduced, and the communication link congestion caused by the simultaneous transmission of the training data set to multiple training modules can also be avoided.

例如，将训练数据集存储在网络附属存储(NAS)上，通过向训练容器发送网络附属存储上的训练数据集中的每一个训练数据的获取地址，使训练容器基于各个训练数据的获取地址访问网络附属存储，从而获得训练数据集。For example, the training data set is stored on the network-attached storage (NAS), and the training container is made to access the network based on the acquisition address of each training data by sending the acquisition address of each training data in the training data set on the network-attached storage to the training container. Affiliated storage to obtain the training dataset.

在一些实施例中，根据本公开的方法还包括：In some embodiments, the method according to the present disclosure further comprises:

获得与所述多个训练容器中的每一个训练容器相应的数据顺序，所述数据顺序指示该训练容器的训练模块在执行所述第一训练任务时输入所述训练数据集中的多个训练数据的顺序；以及其中，所述分别向所述多个训练容器发送所述训练数据信息包括：Obtain a data sequence corresponding to each training container in the plurality of training containers, the data sequence instructing the training module of the training container to input a plurality of training data in the training data set when executing the first training task and wherein the sending the training data information to the plurality of training containers respectively includes:

基于所述多个训练容器中的每一个训练容器相应的数据顺序，向所述该训练容器发送所述训练数据信息。The training data information is sent to the training container based on the corresponding data sequence of each training container in the plurality of training containers.

通过获得每个训练容器相应的数据顺序，基于该相应的数据顺序向该训练容器发送训练数据信息，使得训练模块在执行第一训练任务时基于其相应的数据顺序输入训练数据集中的多个训练数据，即各个训练容器的训练模块在执行第一训练任务时，所采用的训练数据集相同，而输入训练数据集中的各个训练数据时的顺序不同，使经训练的模型是考虑了训练数据的输入顺序的训练结果，更加鲁棒。By obtaining the corresponding data sequence of each training container, and sending training data information to the training container based on the corresponding data sequence, the training module can input multiple training data in the training data set based on the corresponding data sequence when the training module executes the first training task. data, that is, the training modules of each training container use the same training data set when performing the first training task, but the order of inputting the training data in the training data set is different, so that the trained model takes the training data into account. The training results of the input order are more robust.

在一些实施例中，训练数据信息包括训练数据集中的每一个训练数据和该训练数据的标注标签，在基于各个训练容器相应的数据顺序发送训练数据信息时，基于数据顺序排列训练数据集中的各个训练数据后，依次发送经排列的训练数据集中的各个训练数据。In some embodiments, the training data information includes each training data in the training data set and a label of the training data. When the training data information is sent based on the data sequence corresponding to each training container, the training data set is arranged based on the data sequence. After the training data, each training data in the arranged training data set is transmitted in sequence.

在一些实施例中，训练数据信息包括训练数据集中的每一个训练数据的标注标签和获取地址，在基于各个训练容器相应的数据顺序发送训练数据信息时，基于数据顺序排列训练数据集中的各个训练数据的获取地址后，依次发送经排列的训练数据集中的各个训练数据的获取地址。In some embodiments, the training data information includes a label and an acquisition address of each training data in the training data set, and when the training data information is sent based on the corresponding data sequence of each training container, the training data in the training data set is arranged based on the data sequence. After obtaining the address of the data, the obtaining address of each training data in the arranged training data set is sent in sequence.

在一些实施例中，还可以通过在训练模块中部署用以获得数据顺序的服务，以实现每个训练容器的训练模块在执行第一训练任务时基于其相应的数据顺序输入训练数据集中的多个训练数据。In some embodiments, a service for obtaining data sequence may also be deployed in the training module, so that the training module of each training container inputs multiple data in the training data set based on its corresponding data sequence when executing the first training task. training data.

例如，在训练模块中封装用以获得训练数据的输入顺序的函数，例如，随机函数，以对获得的训练数据集中的多个训练数据进行排序，从而获得其相应的数据顺序。For example, a function for obtaining the input order of the training data, eg, a random function, is encapsulated in the training module to sort a plurality of training data in the obtained training data set to obtain their corresponding data order.

在一些实施例中，在训练过程中，容器之间彼此通信，以同步训练过程中的梯度和参数信息。例如，通过NCCL(Nvidia Collective multi-GPU Communication Library)实现彼此通信。In some embodiments, during the training process, the containers communicate with each other to synchronize gradient and parameter information during the training process. For example, they communicate with each other through NCCL (Nvidia Collective multi-GPU Communication Library).

向所述多个训练容器发送获取请求，以获取所述多个训练容器中的每一个训练容器的所述训练模块在执行所述第一训练任务后所获得的损失。An acquisition request is sent to the plurality of training containers to acquire the loss obtained by the training module of each of the plurality of training containers after executing the first training task.

通过向训练容器发送获取请求，获取训练过程中的损失，以展示给用户，实现对模型的分布式训练过程的实时监控。By sending an acquisition request to the training container, the loss in the training process is acquired and displayed to the user to realize real-time monitoring of the distributed training process of the model.

根据本公开的另一方面，还提供了一种数据处理方法，包括：According to another aspect of the present disclosure, a data processing method is also provided, comprising:

获取待处理数据；以及obtain data to be processed; and

将所述待处理数据输入至处理模型，所述处理模型是采用根据本公开的模型的分布式训练方法训练而获得的。The data to be processed is input into a processing model obtained by training using the distributed training method of the model according to the present disclosure.

在一些实施例中，处理模型可以是目标检测模型、文字识别模型等，在此并不限定。In some embodiments, the processing model may be an object detection model, a character recognition model, etc., which is not limited herein.

在一些实施例中，待处理数据可以是图像数据、音频数据等，在此并不限定。In some embodiments, the data to be processed may be image data, audio data, etc., which is not limited herein.

根据本公开的另一方面，还提供了一种模型的分布式训练装置，如图5所示，装置500包括：容器获取单元510，被配置用于获取在多个节点上的多个容器，所述多个容器中的每一个容器部署有训练模块，所述训练模块能够响应于与多个训练任务中的每一个训练任务相应的训练请求执行该训练任务；确定单元520，被配置用于响应于接收到与第一训练任务相应的第一请求，从多个容器中确定与所述第一训练任务相应的多个训练容器；以及训练请求单元530，被配置用于分别向所述多个训练容器发送与所述第一训练任务相应的第一训练请求，以使所述多个训练容器中的每一个训练容器中的训练模块执行所述第一训练任务，从而训练与所述第一训练任务相应的模型。According to another aspect of the present disclosure, a distributed training apparatus for a model is also provided. As shown in FIG. 5 , the apparatus 500 includes: a container acquisition unit 510 configured to acquire a plurality of containers on a plurality of nodes, Each of the plurality of containers is deployed with a training module, and the training module is capable of executing the training task in response to a training request corresponding to each training task in the plurality of training tasks; the determining unit 520 is configured to In response to receiving the first request corresponding to the first training task, determining a plurality of training containers corresponding to the first training task from the plurality of containers; and a training request unit 530 configured to respectively send the plurality of training containers to the plurality of containers; A training container sends a first training request corresponding to the first training task, so that the training module in each training container of the plurality of training containers executes the first training task, thereby training and training the first training task. A model corresponding to the training task.

在一些实施例中，所述容器获取单元510包括：镜像获取单元，被配置用于获取所述训练模块的镜像；挂载单元，被配置用于将所述镜像挂载到所述多个节点中的第一节点上；以及获取子单元，被配置用于响应于所述第一节点运行所述镜像，获取所述第一节点上的容器。In some embodiments, the container obtaining unit 510 includes: an image obtaining unit configured to obtain an image of the training module; a mounting unit configured to mount the image to the plurality of nodes on a first node in the ; and an obtaining subunit configured to obtain a container on the first node in response to the first node running the image.

在一些实施例中，所述第一请求包括与所述训练任务相关的训练参数，所述装置还包括：训练请求生成单元，被配置用于生成所述训练请求，所述训练请求指示所述训练参数。In some embodiments, the first request includes training parameters related to the training task, and the apparatus further includes: a training request generating unit configured to generate the training request, the training request indicating the training parameters.

在一些实施例中，还包括：训练数据获取单元，被配置用于获得与所述第一训练任务相应的训练数据信息，所述训练数据信息指示所述第一训练任务的训练数据集和所述训练数据集中的每一个训练数据的标注标签；以及发送单元，被配置用于分别向所述多个训练容器发送所述训练数据信息，以使所述多个训练容器中的多个训练模块分别获取所述训练数据。In some embodiments, it further includes: a training data obtaining unit configured to obtain training data information corresponding to the first training task, the training data information indicating the training data set and all the training data of the first training task a label for each training data in the training data set; and a sending unit configured to send the training data information to the plurality of training containers respectively, so that the plurality of training modules in the plurality of training containers Obtain the training data separately.

在一些实施例中，所述训练数据信息包括所述训练数据集中的每一个训练数据的标注标签和获取地址，基于该获取地址进行访问，能够获得该训练数据。In some embodiments, the training data information includes a label and an obtaining address of each training data in the training data set, and the training data can be obtained by accessing based on the obtaining address.

在一些实施例中，还包括：顺序获取单元，被配置用于获取与所述多个训练容器中的每一个训练容器相应的数据顺序；以及所述发送单元还包括：发送子单元，被配置用于基于所述多个训练容器中的每一个训练容器相应的数据顺序，向所述该训练容器发送所述训练数据信息，以使该训练容器的训练模块在执行所述第一训练任务时根据该训练容器相应的数据顺序，输入所述训练数据集。In some embodiments, further comprising: a sequence obtaining unit configured to obtain a data sequence corresponding to each training container in the plurality of training containers; and the sending unit further comprising: a sending subunit configured for sending the training data information to the training container based on the corresponding data sequence of each training container in the plurality of training containers, so that the training module of the training container performs the first training task when the training module The training data set is input according to the corresponding data sequence of the training container.

在一些实施例中，还包括：损失请求单元，被配置用于向所述多个训练容器发送获取请求，以获取所述多个训练容器中的每一个训练容器的所述训练模块在执行所述第一训练任务后所获得的损失。In some embodiments, the method further includes: a loss request unit configured to send an acquisition request to the plurality of training containers to obtain the execution of the training module of each training container in the plurality of training containers. The loss obtained after the first training task is described.

根据本公开的另一方面，还提供了一种数据处理装置，如图6所示，装置600包括：数据获取单元610，被配置用于获取待处理数据；以及数据输入单元620，被配置用于将所述待处理数据输入至处理模型，所述处理模型是采用根据本公开所述的模型的分布式训练方法训练而获得的。According to another aspect of the present disclosure, a data processing apparatus is also provided. As shown in FIG. 6 , the apparatus 600 includes: a data acquisition unit 610 configured to acquire data to be processed; and a data input unit 620 configured to use The data to be processed is input into a processing model, and the processing model is obtained by training the distributed training method of the model according to the present disclosure.

根据本公开的另一方面，还提供了一种电子设备，包括：至少一个处理器；以及与所述至少一个处理器通信连接的存储器；其中所述存储器存储有可被所述至少一个处理器执行的指令，所述指令被所述至少一个处理器执行，以使所述至少一个处理器能够执行根据本公开所述的方法。According to another aspect of the present disclosure, there is also provided an electronic device, comprising: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores information that can be used by the at least one processor Instructions to be executed, the instructions being executed by the at least one processor to enable the at least one processor to perform a method according to the present disclosure.

根据本公开的另一方面，还提供了一种存储有计算机指令的非瞬时计算机可读存储介质，其中，所述计算机指令用于使所述计算机执行根据本公开所述的方法。According to another aspect of the present disclosure, there is also provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method according to the present disclosure.

根据本公开的另一方面，还提供了一种计算机程序产品，包括计算机程序，其中，所述计算机程序在被处理器执行时实现根据本公开所述的方法。According to another aspect of the present disclosure, there is also provided a computer program product, comprising a computer program, wherein the computer program, when executed by a processor, implements the method according to the present disclosure.

参考图7，现将描述可以作为本公开的服务器或客户端的电子设备700的结构框图，其是可以应用于本公开的各方面的硬件设备的示例。电子设备旨在表示各种形式的数字电子的计算机设备，诸如，膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置，诸如，个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例，并且不意在限制本文中描述的和/或者要求的本公开的实现。Referring to FIG. 7 , a structural block diagram of an electronic device 700 that can function as a server or client of the present disclosure will now be described, which is an example of a hardware device that can be applied to various aspects of the present disclosure. Electronic devices are intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframe computers, and other suitable computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions are by way of example only, and are not intended to limit implementations of the disclosure described and/or claimed herein.

如图7所示，电子设备700包括计算单元701，其可以根据存储在只读存储器(ROM)702中的计算机程序或者从存储单元708加载到随机访问存储器(RAM)703中的计算机程序，来执行各种适当的动作和处理。在RAM 703中，还可存储电子设备700操作所需的各种程序和数据。计算单元701、ROM 702以及RAM 703通过总线704彼此相连。输入/输出(I/O)接口705也连接至总线704。As shown in FIG. 7 , the electronic device 700 includes a computing unit 701 that can be programmed according to a computer program stored in a read only memory (ROM) 702 or loaded into a random access memory (RAM) 703 from a storage unit 708 . Various appropriate actions and processes are performed. In the RAM 703, various programs and data necessary for the operation of the electronic device 700 can also be stored. The computing unit 701 , the ROM 702 , and the RAM 703 are connected to each other through a bus 704 . An input/output (I/O) interface 705 is also connected to bus 704 .

电子设备700中的多个部件连接至I/O接口705，包括：输入单元706、输出单元707、存储单元708以及通信单元709。输入单元706可以是能向电子设备700输入信息的任何类型的设备，输入单元706可以接收输入的数字或字符信息，以及产生与电子设备的用户设置和/或功能控制有关的键信号输入，并且可以包括但不限于鼠标、键盘、触摸屏、轨迹板、轨迹球、操作杆、麦克风和/或遥控器。输出单元707可以是能呈现信息的任何类型的设备，并且可以包括但不限于显示器、扬声器、对象/音频输出终端、振动器和/或打印机。存储单元708可以包括但不限于磁盘、光盘。通信单元709允许电子设备700通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据，并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信收发机和/或芯片组，例如蓝牙TM设备、802.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。Various components in the electronic device 700 are connected to the I/O interface 705 , including: an input unit 706 , an output unit 707 , a storage unit 708 , and a communication unit 709 . The input unit 706 may be any type of device capable of inputting information to the electronic device 700, the input unit 706 may receive input numerical or character information, and generate key signal input related to user settings and/or function control of the electronic device, and This may include, but is not limited to, a mouse, keyboard, touch screen, trackpad, trackball, joystick, microphone and/or remote control. The output unit 707 may be any type of device capable of presenting information, and may include, but is not limited to, a display, speaker, object/audio output terminal, vibrator, and/or printer. The storage unit 708 may include, but is not limited to, magnetic disks and optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices through a computer network such as the Internet and/or various telecommunication networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chips Groups such as Bluetooth™ devices, 802.11 devices, WiFi devices, WiMax devices, cellular communication devices and/or the like.

计算单元701可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元701的一些示例包括但不限于中央处理单元(CPU)、图形处理单元(GPU)、各种专用的人工智能(AI)计算芯片、各种运行机器学习模型算法的计算单元、数字信号处理器(DSP)、以及任何适当的处理器、控制器、微控制器等。计算单元701执行上文所描述的各个方法和处理，例如方法200。例如，在一些实施例中，方法200可被实现为计算机软件程序，其被有形地包含于机器可读介质，例如存储单元708。在一些实施例中，计算机程序的部分或者全部可以经由ROM 702和/或通信单元709而被载入和/或安装到电子设备700上。当计算机程序加载到RAM 703并由计算单元701执行时，可以执行上文描述的方法200的一个或多个步骤。备选地，在其他实施例中，计算单元701可以通过其他任何适当的方式(例如，借助于固件)而被配置为执行方法200。Computing unit 701 may be various general-purpose and/or special-purpose processing components with processing and computing capabilities. Some examples of computing units 701 include, but are not limited to, central processing units (CPUs), graphics processing units (GPUs), various specialized artificial intelligence (AI) computing chips, various computing units that run machine learning model algorithms, digital signal processing processor (DSP), and any suitable processor, controller, microcontroller, etc. Computing unit 701 performs the various methods and processes described above, eg, method 200 . For example, in some embodiments, method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 708 . In some embodiments, part or all of the computer program may be loaded and/or installed on the electronic device 700 via the ROM 702 and/or the communication unit 709 . When a computer program is loaded into RAM 703 and executed by computing unit 701, one or more steps of method 200 described above may be performed. Alternatively, in other embodiments, computing unit 701 may be configured to perform method 200 in any other suitable manner (eg, by means of firmware).

本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、芯片上系统的系统(SOC)、负载可编程逻辑设备(CPLD)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括：实施在一个或者多个计算机程序中，该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释，该可编程处理器可以是专用或者通用可编程处理器，可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令，并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, field programmable gate arrays (FPGAs), application specific integrated circuits (ASICs), application specific standard products (ASSPs), systems on chips system (SOC), load programmable logic device (CPLD), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.

用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器，使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行，作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.

在本公开的上下文中，机器可读介质可以是有形的介质，其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备，或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器(CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), fiber optics, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

为了提供与用户的交互，可以在计算机上实施此处描述的系统和技术，该计算机具有：用于向用户显示信息的显示装置(例如，CRT(阴极射线管)或者LCD(液晶显示器)监视器)；以及键盘和指向装置(例如，鼠标或者轨迹球)，用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互；例如，提供给用户的反馈可以是任何形式的传感反馈(例如，视觉反馈、听觉反馈、或者触觉反馈)；并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having a display device (eg, a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user ); and a keyboard and pointing device (eg, a mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.

可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如，作为数据服务器)、或者包括中间件部件的计算系统(例如，应用服务器)、或者包括前端部件的计算系统(例如，具有图形用户界面或者网络浏览器的用户计算机，用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如，通信网络)来将系统的部件相互连接。通信网络的示例包括：局域网(LAN)、广域网(WAN)和互联网。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: Local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器，也可以为分布式系统的服务器，或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, a distributed system server, or a server combined with blockchain.

应该理解，可以使用上面所示的各种形式的流程，重新排序、增加或删除步骤。例如，本公开中记载的各步骤可以并行地执行、也可以顺序地或以不同的次序执行，只要能够实现本公开公开的技术方案所期望的结果，本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be performed in parallel, sequentially or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, which are not limited herein.

虽然已经参照附图描述了本公开的实施例或示例，但应理解，上述的方法、系统和设备仅仅是示例性的实施例或示例，本发明的范围并不由这些实施例或示例限制，而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外，可以通过不同于本公开中描述的次序来执行各步骤。进一步地，可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进，在此描述的很多要素可以由本公开之后出现的等同要素进行替换。Although the embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it should be understood that the above-described methods, systems and devices are merely exemplary embodiments or examples, and the scope of the present invention is not limited by these embodiments or examples, but is limited only by the appended claims and their equivalents. Various elements of the embodiments or examples may be omitted or replaced by equivalents thereof. Furthermore, the steps may be performed in an order different from that described in this disclosure. Further, various elements of the embodiments or examples may be combined in various ways. Importantly, as technology evolves, many of the elements described herein may be replaced by equivalent elements that appear later in this disclosure.

Claims

1. A distributed training method for a model, comprising:

Obtain a plurality of containers on a plurality of nodes, each of the plurality of containers is deployed with a training module, and the training module is capable of executing the training request in response to a training request corresponding to each of the plurality of training tasks. training tasks;

in response to receiving a first request corresponding to a first training task, determining a plurality of training containers corresponding to the first training task from a plurality of containers; and

Send a first training request corresponding to the first training task to the plurality of training containers, so that the training module in each training container of the plurality of training containers executes the first training task, thereby A model corresponding to the first training task is trained.

2. The method of claim 1 , wherein the acquiring a plurality of containers on a plurality of nodes comprises:

obtain a mirror image of the training module;

mounting the image on a first node of the plurality of nodes; and

In response to the first node running the image, a container on the first node is obtained.

3. The method of claim 1 or 2, wherein the first request includes training parameters related to the first training task, the method further comprising:

The training request is generated, the training request indicating the training parameters.

4. The method of claim 1 or 2, further comprising:

obtaining training data information corresponding to the first training task, the training data information indicating a training data set for the first training task and a label for each training data in the training data set; and

The training data information is respectively sent to the multiple training containers, so that multiple training modules in the multiple training containers obtain the training data respectively.

5. The method of claim 4, wherein the training data information includes each training data in the training data set and a label of the training data.

6 . The method according to claim 4 , wherein the training data information includes a label and an acquisition address of each training data in the training data set, and the training data can be acquired by accessing based on the acquisition address. 7 .

7. The method of claim 4, further comprising:

Obtain a data sequence corresponding to each training container in the plurality of training containers, the data sequence instructing the training module of the training container to input a plurality of training data in the training data set when executing the first training task and wherein the sending the training data information to the plurality of training containers respectively includes:

The training data information is sent to the training container based on the corresponding data sequence of each training container in the plurality of training containers.

8. The method of any one of claims 1-7, further comprising:

An acquisition request is sent to the plurality of training containers to acquire the loss obtained by the training module of each of the plurality of training containers after executing the first training task.

9. A data processing method, comprising:

obtain data to be processed; and

The data to be processed is input into a processing model, and the processing model is obtained by training the method according to any one of claims 1-8.

10. A distributed training device for a model, comprising:

a container acquisition unit configured to acquire a plurality of containers on a plurality of nodes, each of the plurality of containers is deployed with a training module, the training module can be responsive to a plurality of training tasks with each The training request corresponding to the training task executes the training task;

a determining unit configured to, in response to receiving a first request corresponding to a first training task, determine a plurality of training containers corresponding to the first training task from the plurality of containers; and

A training request unit, configured to send a first training request corresponding to the first training task to the plurality of training containers, so that the training module in each of the plurality of training containers executes the first training task, thereby training a model corresponding to the first training task.

11. The apparatus of claim 10, wherein the container acquisition unit comprises:

an image acquisition unit configured to acquire an image of the training module;

a mounting unit configured to mount the image on a first node of the plurality of nodes; and

An obtaining subunit is configured to obtain a container on the first node in response to the first node running the image.

12. The apparatus of claim 10 or 11, wherein the first request includes training parameters related to the first training task, the apparatus further comprising:

A training request generating unit configured to generate the training request, the training request indicating the training parameter.

13. The apparatus of claim 10 or 11, further comprising:

A training data obtaining unit configured to obtain training data information corresponding to the first training task, the training data information indicating the training data set of the first training task and each training data in the training data set label; and

The sending unit is configured to send the training data information to the multiple training containers respectively, so that multiple training modules in the multiple training containers obtain the training data respectively.

14. The apparatus of claim 13, wherein the training data information includes each training data in the training data set and a label of the training data.

15. The apparatus according to claim 13, wherein the training data information includes a label and an acquisition address of each training data in the training data set, and the training data can be acquired by accessing based on the acquisition address.

16. The apparatus of claim 13, further comprising:

a sequence obtaining unit configured to obtain a data sequence corresponding to each training container in the plurality of training containers, the data sequence instructing a training module of the training container to input the data when performing the first training task The sequence of multiple training data in the training data set; and wherein the sending unit further includes:

17. The apparatus of any of claims 10-16, further comprising:

A loss requesting unit, configured to send an acquisition request to the plurality of training containers to acquire the data obtained by the training module of each training container in the plurality of training containers after executing the first training task loss.

18. A data processing apparatus comprising:

a data acquisition unit configured to acquire the data to be processed; and

A data input unit, configured to input the data to be processed into a processing model, the processing model being obtained by training the method according to any one of claims 1-8.

19. An electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the execution of any of claims 1-9 Methods.

20. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any of claims 1-9.

21. A computer program product comprising a computer program, wherein the computer program, when executed by a processor, implements the method of any of claims 1-9.