CN118014098B

CN118014098B - Machine learning training data scheduling method and equipment

Info

Publication number: CN118014098B
Application number: CN202410155756.7A
Authority: CN
Inventors: 杜剑峰; 张世明
Original assignee: Beigemeis Shenzhen Technology Co ltd
Current assignee: Beigemeis Shenzhen Technology Co ltd
Priority date: 2024-02-04
Filing date: 2024-02-04
Publication date: 2024-09-13
Anticipated expiration: 2044-02-04
Also published as: CN118014098A

Abstract

The application provides a machine learning training data scheduling method and equipment, wherein the method comprises the following steps: when a target machine learning algorithm starts training, acquiring a grafting module corresponding to the target machine learning algorithm, and triggering a universal module layer to start through the grafting module; determining a target disk file according to the directed acyclic graph in the universal module layer; forming a machine learning training module subgraph according to the adaptation modules from the target disk file to all the directed paths of the grafting module; and converting the original training data in the target disk file into data required by a target machine learning algorithm based on the machine learning training module subgraph. According to the technical scheme, the training data conversion efficiency of the machine learning algorithm can be improved.

Description

Machine learning training data scheduling method and device

技术领域Technical Field

本申请涉及机器学习领域，尤其涉及一种机器学习训练数据调度方法及设备。The present application relates to the field of machine learning, and in particular to a method and device for scheduling machine learning training data.

背景技术Background Art

随着大数据时代的发展，从海量数据中挖掘潜在有用的知识已经成为各大行业领域的普遍需求。机器学习是在大数据时代背景下蓬勃发展的人工智能子领域，解决的核心问题就是从数据中发现知识，并将学到的知识保存到预测性模型或解释性模型中，以便后续应用，其中预测性模型和解释性模型统称模型。机器学习的流程一般分成两个阶段：第一阶段为学习阶段，又称训练阶段，该阶段输入数据、输出模型；第二阶段为应用阶段(对于预测性模型又称预测阶段)，该阶段将模型应用到新数据上，得到模型的输出结果。机器学习的应用阶段较为高效，一般可以逐条数据处理，得到每条数据对应的预测或解释结果。而机器学习的训练阶段则是机器学习总流程的性能瓶颈，一般需要考虑全部训练数据、通过全部数据的多次非顺序访问才能得到可靠有效的模型。With the development of the big data era, mining potentially useful knowledge from massive data has become a common demand in major industries. Machine learning is a subfield of artificial intelligence that has flourished in the context of the big data era. The core problem it solves is to discover knowledge from data and save the learned knowledge into predictive models or explanatory models for subsequent applications. Predictive models and explanatory models are collectively referred to as models. The process of machine learning is generally divided into two stages: the first stage is the learning stage, also known as the training stage, in which data is input and the model is output; the second stage is the application stage (also known as the prediction stage for predictive models), in which the model is applied to new data to obtain the output of the model. The application stage of machine learning is relatively efficient, and generally data can be processed one by one to obtain the corresponding prediction or explanation results for each data. The training stage of machine learning is the performance bottleneck of the overall machine learning process. Generally, all training data must be considered and multiple non-sequential accesses to all data must be performed to obtain a reliable and effective model.

为了保证训练阶段的效果，大部分机器学习算法都要求全部训练数据放入内存中进行处理。当训练数据整体无法放入内存中时，机器学习算法可以通过操作系统的磁盘充当内存的虚拟内存来管理超出内存容量的数据。此种方法涉及磁盘访问，需要将原始的存储全部训练数据的磁盘文件转换为机器学习算法需要的数据，转换过程需要计算机设备有较高的磁盘数据处理技术。然而，当前对于磁盘数据的转换过程均较为缓慢，甚至在处理过程中会偶尔出现操作系统崩溃的现象，机器学习的训练数据转换效率较低。In order to ensure the effectiveness of the training phase, most machine learning algorithms require that all training data be placed in memory for processing. When the entire training data cannot be placed in memory, the machine learning algorithm can manage data that exceeds the memory capacity by using the operating system's disk as virtual memory. This method involves disk access, and the original disk file storing all training data needs to be converted into the data required by the machine learning algorithm. The conversion process requires the computer equipment to have high disk data processing technology. However, the current conversion process for disk data is relatively slow, and the operating system may occasionally crash during the processing process. The training data conversion efficiency of machine learning is low.

发明内容Summary of the invention

本申请提供一种机器学习训练数据调度方法及设备，以解决计算机设备在进行机器学习算法的在训练数据转换时，其数据转换效率较低的技术问题。The present application provides a method and device for scheduling machine learning training data to solve the technical problem of low data conversion efficiency when a computer device performs training data conversion for a machine learning algorithm.

第一方面，提供一种机器学习训练数据调度方法，应用于计算机设备，所述方法包括：当目标机器学习算法启动训练时，获取所述目标机器学习算法对应的嫁接模块，并通过所述嫁接模块触发通用模块层启动；根据所述通用模块层中的有向无环图确定目标磁盘文件；根据所述目标磁盘文件到所述嫁接模块的所有有向路径中的适配模块构成机器学习训练模块子图；基于所述机器学习训练模块子图将所述目标磁盘文件里的原始训练数据转换为所述目标机器学习算法所需的数据。In a first aspect, a method for scheduling machine learning training data is provided, which is applied to a computer device, and the method includes: when a target machine learning algorithm starts training, obtaining a grafting module corresponding to the target machine learning algorithm, and triggering the startup of a general module layer through the grafting module; determining a target disk file according to a directed acyclic graph in the general module layer; constructing a machine learning training module subgraph according to adaptation modules in all directed paths from the target disk file to the grafting module; and converting the original training data in the target disk file into data required by the target machine learning algorithm based on the machine learning training module subgraph.

第二方面，提供一种计算机设备，包括存储器、处理器，所述存储器连接至所述处理器，所述处理器用于执行存储在所述存储器中的一个或多个计算机程序，实现如第一方面所述的方法。In a second aspect, a computer device is provided, comprising a memory and a processor, wherein the memory is connected to the processor, and the processor is used to execute one or more computer programs stored in the memory to implement the method described in the first aspect.

本申请可以实现如下技术效果：计算机设备预设一通用模块层，当在检测到目标机器学习算法启动训练时，获取该目标机器学习算法对应的嫁接模块，并通过该嫁接模块触发通用模块层的启动，然后根据该通用模块层中的有向无向图确定需要该目标机器学习算法需要使用到的目标磁盘文件，根据该目标磁盘文件到该嫁接模块的所有有向路径的适配模块构成机器学习训练模块子图，基于该机器学习训练模块子图将目标磁盘文件里的原始训练数据转换为所述目标机器学习算法所需的数据。上述方法通过预设的通用模块层实现训练数据转换处理，保证普通的计算机也能高效处理存储于磁盘文件的海量训练数据，使机器学习算法的训练阶段能够针对海量的训练数据进行实施，提高机器学习的训练数据转换效率。The present application can achieve the following technical effects: the computer device presets a general module layer, and when it detects that the target machine learning algorithm starts training, it obtains the grafting module corresponding to the target machine learning algorithm, and triggers the startup of the general module layer through the grafting module, and then determines the target disk file that the target machine learning algorithm needs to use according to the directed and undirected graph in the general module layer, and forms a machine learning training module subgraph according to the adaptation module of all directed paths from the target disk file to the grafting module, and converts the original training data in the target disk file into the data required by the target machine learning algorithm based on the machine learning training module subgraph. The above method realizes the training data conversion processing through the preset general module layer, ensuring that ordinary computers can also efficiently process the massive training data stored in the disk files, so that the training stage of the machine learning algorithm can be implemented for the massive training data, and improves the training data conversion efficiency of machine learning.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为现有的一种关于机器学习算法训练数据转换的流程示意图；FIG1 is a schematic diagram of an existing process for converting training data for a machine learning algorithm;

图2为本申请实施例提供的一种机器学习训练数据调度方法的流程示意图；FIG2 is a flow chart of a method for scheduling machine learning training data according to an embodiment of the present application;

图3为本申请实施例提供的一种通用模块层的示例图；FIG3 is an example diagram of a general module layer provided in an embodiment of the present application;

图4为本申请实施例提供的一种计算机设备的结构示意图。FIG4 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application.

具体实施方式DETAILED DESCRIPTION

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行描述。The technical solutions in the embodiments of the present application will be described below in conjunction with the drawings in the embodiments of the present application.

为了使本申请的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本申请进行进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本申请，并不用于限定本申请。基于本申请中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本申请保护的范围。In order to make the purpose, technical solutions and advantages of the present application clearer, the present application is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present application and are not intended to limit the present application. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without making creative work are within the scope of protection of the present application.

需要说明的是，如果不冲突，本申请实施例中的各个特征可以相互结合，均在本申请的保护范围之内。另外，虽然在装置示意图中进行了功能模块划分，在流程图中示出了逻辑顺序，但是在某些情况下，可以以不同于装置中的模块划分，或流程图中的顺序执行所示出或描述的步骤。再者，本申请所采用的“第一”“第二”“第三”等字样并不对数据和执行次序进行限定，仅是对功能和作用基本相同的相同项或相似项进行区分。It should be noted that, if there is no conflict, the various features in the embodiments of the present application can be combined with each other, all within the scope of protection of the present application. In addition, although the functional modules are divided in the device schematic diagram and the logical order is shown in the flow chart, in some cases, the steps shown or described can be performed in a different order from the module division in the device or the flow chart. Furthermore, the words "first", "second", "third", etc. used in this application do not limit the data and execution order, but only distinguish the same items or similar items with basically the same functions and effects.

为更便于理解本申请，首先对机器学习算法进行介绍。机器学习是在大数据时代背景下蓬勃发展的人工智能子领域，其核心是“使用算法解析数据，从中学习，然后对新数据做出解释或预测”。也就是说，机器学习就是从数据中发现知识，并将学到的知识保存到模型中，以便后续应用。机器学习的流程一般分成两个阶段：第一阶段为学习阶段，又称训练阶段，该阶段输入数据、输出模型；第二阶段为应用阶段(对于预测性模型又称预测阶段)，该阶段将模型应用到新数据上，得到模型的输出结果。机器学习的应用阶段较为高效，一般可以逐条数据处理，得到每条数据对应的预测或解释结果。而机器学习的训练阶段则是机器学习总流程的性能瓶颈，一般需要考虑全部训练数据、通过全部数据的多次非顺序访问才能得到可靠有效的模型。为了保证训练阶段的效果，大部分机器学习算法都要求全部训练数据放入内存中进行处理，因此，机器学习算法对内存具有较大的需求。To make it easier to understand this application, the machine learning algorithm is first introduced. Machine learning is a sub-field of artificial intelligence that has flourished in the context of the big data era. Its core is "using algorithms to parse data, learn from it, and then explain or predict new data." In other words, machine learning is to discover knowledge from data and save the learned knowledge in the model for subsequent application. The process of machine learning is generally divided into two stages: the first stage is the learning stage, also known as the training stage, in which data is input and the model is output; the second stage is the application stage (also known as the prediction stage for predictive models), in which the model is applied to new data to obtain the output of the model. The application stage of machine learning is more efficient, and generally data can be processed one by one to obtain the prediction or explanation results corresponding to each data. The training stage of machine learning is the performance bottleneck of the overall process of machine learning. Generally, all training data need to be considered and multiple non-sequential accesses to all data can be used to obtain a reliable and effective model. In order to ensure the effect of the training stage, most machine learning algorithms require that all training data be placed in memory for processing. Therefore, machine learning algorithms have a large demand for memory.

针对上述机器学习算法内存需求过大的问题，较为常用的方案有两种，一种是将硬件升级，通过增加内存条和设置更大可用内存的方式来保证机器学习算法训练阶段的高效实施；另一种是通过分布式系统的部署，将训练数据分摊到多个内存容量有限、不能容纳全部训练数据的机器节点上进行处理，并通过高效的数据通信机制，保证机器学习算法在多个节点中照常训练。然而，上述两种方案都涉及硬件的增强，或者增加内存，或者添加计算设备和网络通信设备，不能在维持原有机器硬件设施的前提下解决问题，使用成本较高。There are two common solutions to the problem of excessive memory requirements of the above-mentioned machine learning algorithms. One is to upgrade the hardware by adding memory bars and setting larger available memory to ensure efficient implementation of the machine learning algorithm training phase; the other is to deploy a distributed system to distribute the training data to multiple machine nodes with limited memory capacity that cannot accommodate all the training data for processing, and to ensure that the machine learning algorithm is trained normally in multiple nodes through an efficient data communication mechanism. However, both of the above solutions involve hardware enhancement, or increasing memory, or adding computing devices and network communication equipment. They cannot solve the problem while maintaining the original machine hardware facilities, and the cost of use is high.

本申请的发明人在对现有技术的深入研究中发现，相关技术中还有两种方案，可以不需要改变计算机设备的硬件设施来实现机器学习训练数据的转换。具体地，一种方案是通过机器学习算法升级，将基于内存的算法改造成支持磁盘输入的数据流式算法、递增学习算法或者迭代学习算法，这些算法流程都需要顺序访问训练数据；另一种方案是基于随机采样的近似学习，在训练模型之前从全部训练数据中随机采样得到适合当前内存容量的训练子集，并基于该训练子集进行学习，最终得到接近于通过全部训练数据学习得到的标准模型的近似模型。这两种方案都是非硬件依赖的解决方案，不需要改变机器硬件设施，适用于更多应用场景。但是，这两种解决方案都涉及磁盘访问，性能瓶颈环节在于从原始的存储全部训练数据的磁盘文件到机器学习算法需要的顺序访问数据流或者不超当前内存容量样本集的转换流程，如图1所示。该数据转换流程涉及磁盘文件数据的视图转换，比如增删改某些列或者过滤某些行，并且要求基于磁盘来高效进行视图转换，不能将全部数据载入内存中进行处理。因此，上述两种方案都需要引入高效的磁盘数据处理技术来实施图1中所示的数据转换流程，这样才能保证解决方案整体上的有效性和实用性。The inventors of this application have found in their in-depth research on the prior art that there are two other solutions in the related art that can achieve the conversion of machine learning training data without changing the hardware facilities of the computer equipment. Specifically, one solution is to upgrade the machine learning algorithm and transform the memory-based algorithm into a data streaming algorithm, an incremental learning algorithm or an iterative learning algorithm that supports disk input. These algorithm processes all require sequential access to training data; the other solution is approximate learning based on random sampling. Before training the model, a training subset suitable for the current memory capacity is randomly sampled from all the training data, and learning is performed based on the training subset, and finally an approximate model close to the standard model obtained by learning all the training data is obtained. Both solutions are non-hardware-dependent solutions, do not require changes to the machine hardware facilities, and are suitable for more application scenarios. However, both solutions involve disk access, and the performance bottleneck lies in the conversion process from the original disk file that stores all the training data to the sequential access data stream required by the machine learning algorithm or the sample set that does not exceed the current memory capacity, as shown in Figure 1. The data conversion process involves view conversion of disk file data, such as adding, deleting, or modifying certain columns or filtering certain rows, and requires efficient view conversion based on disk, and all data cannot be loaded into memory for processing. Therefore, both of the above solutions need to introduce efficient disk data processing technology to implement the data conversion process shown in Figure 1, so as to ensure the overall effectiveness and practicality of the solution.

鉴于此，本申请提出一种机器学习训练数据调度方法及设备。可以通过预设的通用模块层构建机器学习算法的数据转换框架，将原始训练数据文件高效地转换为机器学习算法训练阶段所需要的数据，提高机器学习的训练数据转换效率。In view of this, the present application proposes a method and device for scheduling machine learning training data. The data conversion framework of the machine learning algorithm can be constructed through a preset general module layer, and the original training data file can be efficiently converted into the data required in the training phase of the machine learning algorithm, thereby improving the training data conversion efficiency of machine learning.

本申请的基本思路是将图1中的数据转换流程设计成一个通用模块层，该通用模块层由适配模块作为节点构成，该节点通过配置的输入输出端口传递数据对象，由这些数据对象的流动路线构成有向无环图。当需要调用的机器学习算法启动训练时，连接该机器学习算法的嫁接模块发起该通用模块层的执行，形成机器学习训练模块子图。对于机器学习训练模块子图中的适配模块节点，按照原有的分层自顶向下逐层处理，每层的适配模块在处理时并发执行所有在该层中带磁盘文件或内存数据输出端口的适配模块。当前层适配模块在执行时带动文件指针连接的上层适配模块一起执行，进一步带动这些上层的适配模块顺序访问输入磁盘文件。通过上述方案可以有效解决机器学习算法内存需求过大的问题，该方案不需要添加或改造计算机硬件设施，就能保证普通的计算机有效处理存储于磁盘文件的海量训练数据，使机器学习算法的训练阶段能够针对海量的训练数据进行实施，提高计算机设备的训练数据转换效率。The basic idea of this application is to design the data conversion process in Figure 1 into a general module layer, which is composed of adapter modules as nodes. The nodes transfer data objects through the configured input and output ports, and the flow routes of these data objects constitute a directed acyclic graph. When the machine learning algorithm that needs to be called starts training, the grafting module connected to the machine learning algorithm initiates the execution of the general module layer to form a machine learning training module subgraph. For the adapter module nodes in the machine learning training module subgraph, they are processed layer by layer from top to bottom according to the original layering. The adapter modules of each layer concurrently execute all adapter modules with disk files or memory data output ports in the layer during processing. When the current layer adapter module is executed, it drives the upper layer adapter modules connected to the file pointer to execute together, and further drives these upper layer adapter modules to sequentially access the input disk file. The above scheme can effectively solve the problem of excessive memory requirements of machine learning algorithms. The scheme does not require the addition or modification of computer hardware facilities, and can ensure that ordinary computers can effectively process the massive training data stored in disk files, so that the training phase of the machine learning algorithm can be implemented for massive training data, and improve the training data conversion efficiency of computer equipment.

下面将具体阐述本申请所示的机器学习训练数据调度方法。请参阅图2，为本申请所示的一种机器学习训练数据调度方法的流程示意图，该机器学习训练数据调度方法可应用于计算机设备，如图2所示的方法包括：The following will specifically describe the method for scheduling machine learning training data shown in the present application. Please refer to FIG2, which is a flow chart of a method for scheduling machine learning training data shown in the present application. The method for scheduling machine learning training data can be applied to a computer device. The method shown in FIG2 includes:

S201、当目标机器学习算法启动训练时，获取所述目标机器学习算法对应的嫁接模块，并通过所述嫁接模块触发通用模块层启动。S201. When the target machine learning algorithm starts training, obtain the grafting module corresponding to the target machine learning algorithm, and trigger the startup of the general module layer through the grafting module.

需要说明的是，该计算机设备可以存储多个不同的机器学习算法所需要的训练数据，该目标机器学习算法是指该计算机设备检测到的需要启动训练数据转换的机器学习算法。It should be noted that the computer device can store the training data required by multiple different machine learning algorithms, and the target machine learning algorithm refers to the machine learning algorithm detected by the computer device that needs to start training data conversion.

还需要说明的是，该通用模块层是一种提前预设的用于机器学习算法训练数据转换的虚拟模块层。在一个实施例中，所述通用模块层包括至少两个适配模块，其中一个适配模块是所述目标机器学习算法对应的嫁接模块；所述适配模块按照层级配置为所述通用模块层的节点，并用于传递数据对象，所述数据对象的流动路径构成所述通用模块层中的有向无环图，其中所述数据对象包括磁盘文件、文件指针、内存数据中的任意一种或多种。It should also be noted that the general module layer is a virtual module layer preset in advance for the conversion of machine learning algorithm training data. In one embodiment, the general module layer includes at least two adapter modules, one of which is a grafting module corresponding to the target machine learning algorithm; the adapter modules are configured as nodes of the general module layer according to the hierarchy, and are used to transfer data objects, the flow path of the data objects constitutes a directed acyclic graph in the general module layer, wherein the data objects include any one or more of disk files, file pointers, and memory data.

在一个实施例中，所述适配模块设置至少一个输入端口以及至少一个输出端口，所述适配模块通过所述输入端口接收上一层节点输出的数据对象，并通过所述输出端口将转换后的数据对象传递给下一层节点。In one embodiment, the adaptation module is provided with at least one input port and at least one output port. The adaptation module receives the data object output by the upper layer node through the input port, and transmits the converted data object to the next layer node through the output port.

在一个实施例中，所述适配模块包括数据视图生成模块、数据视图列变换模块、数据视图行变换模块、数据视图分批处理模块中的任意一种或多种类型。In one embodiment, the adaptation module includes any one or more types of a data view generation module, a data view column transformation module, a data view row transformation module, and a data view batch processing module.

其中，所述数据视图生成模块输入的数据对象为磁盘文件，输出的数据对象为顺序访问所述磁盘文件的文件指针。所述数据视图列变换模块包括在数据视图中添加列、删除列和改变列值的功能模块；所述数据视图列变换模块输入的数据对象为文件指针，输出的数据对象为文件指针。所述数据视图行变换模块包括在数据视图中随机采样行、以指定条件过滤行、过滤重复行和有序合并多个数据视图中数据行的功能模块；所述数据视图行变换模块输入的数据对象为磁盘文件、文件指针、内存数据中的任意一种或多种，输出的数据对象为磁盘文件、文件指针、内存数据中的任意一种或多种。The data object input to the data view generation module is a disk file, and the data object output is a file pointer for sequentially accessing the disk file. The data view column transformation module includes functional modules for adding columns, deleting columns, and changing column values in the data view; the data object input to the data view column transformation module is a file pointer, and the data object output is a file pointer. The data view row transformation module includes functional modules for randomly sampling rows in the data view, filtering rows according to specified conditions, filtering duplicate rows, and orderly merging data rows in multiple data views; the data object input to the data view row transformation module is any one or more of a disk file, a file pointer, and memory data, and the data object output is any one or more of a disk file, a file pointer, and memory data.

所述数据视图分批处理模块包括固定行个数的分批处理和固定序列个数的分批处理模块，其中所述固定序列是指具有相同ID的数据行集合；所述数据视图分批处理模块输入的数据对象为文件指针，输出的数据对象为分批次的内存数据，所述分批次的内存数据供内部过程使用，所述内部过程由所述目标机器学习算法发起。The data view batch processing module includes a batch processing module for a fixed number of rows and a batch processing module for a fixed number of sequences, wherein the fixed sequence refers to a set of data rows with the same ID; the data object input to the data view batch processing module is a file pointer, and the data object output is batched memory data, the batched memory data is used by an internal process, and the internal process is initiated by the target machine learning algorithm.

在一个实施例中，所述有向无环图的顶层节点为所述数据视图生成模块，所述有向无环图的底层节点是除所述数据视图生成模块外的其他类型模块，所述有向无环图的中间节点是除所述数据视图分批处理模块外的其他类型模块。In one embodiment, the top-level node of the directed acyclic graph is the data view generation module, the bottom-level nodes of the directed acyclic graph are other types of modules except the data view generation module, and the middle nodes of the directed acyclic graph are other types of modules except the data view batch processing module.

作为一种可行的实施方式，该计算机设备可以提前设计成一个通用模块层用于机器学习算法的训练数据转换，该通用模块层由适配模块作为节点构成。每个适配模块节点可以拥有多个输入端口，每个输入端口接收磁盘文件、文件指针或内存数据等数据对象，节点也可以拥有多个输出端口，每个端口输出磁盘文件、文件指针或内存数据等数据对象。其中，磁盘文件的输出端口至多一个，该内存数据的输出端口也至多一个，该磁盘文件可以用文件路径表示。该适配模块的节点可以通过输入和输出端口传递数据对象，由这些数据对象的流动路线构成有向无环图(DAG图)。As a feasible implementation method, the computer device can be designed in advance as a general module layer for training data conversion of machine learning algorithms, and the general module layer is composed of adapter modules as nodes. Each adapter module node can have multiple input ports, each input port receives data objects such as disk files, file pointers or memory data, and the node can also have multiple output ports, each port outputs data objects such as disk files, file pointers or memory data. Among them, there is at most one output port for the disk file, and at most one output port for the memory data, and the disk file can be represented by a file path. The nodes of the adapter module can transfer data objects through the input and output ports, and the flow routes of these data objects constitute a directed acyclic graph (DAG graph).

举例而言，请参阅图3，为本申请提供的一种通用模块层的示例图。图3所示的通用模块层中包括了3个待调机器学习算法(应知，在其他实施例中，也可以包括其他任意数量的待调机器学习算法)和7个适配模块。这7个适配模块分层级进行配置，2号适配模块位于顶层，7至10号适配模块位于中间层，15号和17号适配模块位于底层；2号适配模块可以是数据视图生成模块，8号和10号适配模块可以是数据视图列变换模块，7、9和15号适配模块可以是数据视图行变换模块，17号适配模块可以是数据视图分批处理模块，16号机器学习算法的嫁接模块是9号适配模块，20号机器学习算法的嫁接模块是15号适配模块，21号机器学习算法的嫁接模块是17号适配模块。For example, please refer to FIG3, which is an example diagram of a general module layer provided in the present application. The general module layer shown in FIG3 includes three machine learning algorithms to be adjusted (it should be known that in other embodiments, any other number of machine learning algorithms to be adjusted may also be included) and seven adaptation modules. These seven adaptation modules are configured in a hierarchical manner, with adaptation module No. 2 located at the top layer, adaptation modules No. 7 to No. 10 located in the middle layer, and adaptation modules No. 15 and No. 17 located at the bottom layer; adaptation module No. 2 can be a data view generation module, adaptation modules No. 8 and No. 10 can be data view column transformation modules, adaptation modules No. 7, 9 and No. 15 can be data view row transformation modules, adaptation module No. 17 can be a data view batch processing module, the grafting module of machine learning algorithm No. 16 is adaptation module No. 9, the grafting module of machine learning algorithm No. 20 is adaptation module No. 15, and the grafting module of machine learning algorithm No. 21 is adaptation module No. 17.

在一个实施例中，所述目标机器学习算法的嫁接模块为所述数据视图分批处理模块或输出的数据对象为内存数据的适配模块。In one embodiment, the grafting module of the target machine learning algorithm is the data view batch processing module or the output data object is an adaptation module for memory data.

举例而言，分批处理模块输入文件指针，输出分批次的内存数据供内部过程使用，其中内部过程由需要调用的机器学习算法发起，这种情况下机器学习算法的嫁接模块可以定义为对应的数据视图分批处理模块。内存数据的输出端口可以连接需要调用的机器学习算法，这种情况下机器学习算法的嫁接模块可以定义为输出内存数据的适配模块。For example, the batch processing module inputs a file pointer and outputs batches of memory data for use by an internal process, where the internal process is initiated by a machine learning algorithm that needs to be called. In this case, the grafting module of the machine learning algorithm can be defined as the corresponding data view batch processing module. The output port of the memory data can be connected to the machine learning algorithm that needs to be called. In this case, the grafting module of the machine learning algorithm can be defined as an adapter module that outputs memory data.

又举例而言，同样请参阅图3，假设16号待调机器学习算法需要启动训练，该计算机设备可以将该16号待调机器学习算法作为目标待调机器学习算法，并获取该16号待调机器学习算法连接的嫁接模块，也就是9号适配模块，通过该嫁接模块触发整个通用模块层的启动执行。For another example, please refer to Figure 3. Assuming that the machine learning algorithm No. 16 needs to start training, the computer device can use the machine learning algorithm No. 16 as the target machine learning algorithm to be adjusted, and obtain the grafting module connected to the machine learning algorithm No. 16, that is, the adaptation module No. 9, and trigger the startup execution of the entire general module layer through the grafting module.

S202、根据所述通用模块层中的有向无环图确定目标磁盘文件。S202: Determine a target disk file according to the directed acyclic graph in the general module layer.

需要说明的是，该目标磁盘文件是指该有向无环图中通过有向路径能到达该目标机器学习算法对应的嫁接模块的磁盘文件。It should be noted that the target disk file refers to the disk file that can reach the grafting module corresponding to the target machine learning algorithm through a directed path in the directed acyclic graph.

作为一种可行的实施方式，当该目标机器学习算法启动训练时，该目标机器学习算法对应的嫁接模块可以发起通用模块层的执行，该通用模块层中各种数据对象的流动路线构成了该有向无环图，该计算机设备可以确定该有向无环图中通过有向路径能够到达该目标机器学习算法对应的嫁接模块的磁盘文件为目标磁盘文件，该目标磁盘文件可以为一个，也可以为多个，本申请对此不作任何限制。As a feasible implementation method, when the target machine learning algorithm starts training, the grafting module corresponding to the target machine learning algorithm can initiate the execution of the general module layer. The flow routes of various data objects in the general module layer constitute the directed acyclic graph. The computer device can determine that the disk file that can reach the grafting module corresponding to the target machine learning algorithm through a directed path in the directed acyclic graph is the target disk file. The target disk file can be one or more, and the present application does not impose any restrictions on this.

在一个实施例中，所述目标磁盘文件以二维表形式的数据视图表示。In one embodiment, the target disk file is represented by a data view in the form of a two-dimensional table.

S203、根据所述目标磁盘文件到所述嫁接模块的所有有向路径中的适配模块构成机器学习训练模块子图。S203. Construct a machine learning training module subgraph according to the adaptation modules in all directed paths from the target disk file to the grafting module.

举例而言，16号机器学习算法为目标机器学习算法时，涉及的磁盘文件只有1号，因此其机器学习训练模块子图由{1,2,6,9,13,16}号适配模块构成；同理，20号机器学习算法为目标机器学习算法时，涉及的磁盘文件有1、3、4、11号，因此其机器学习训练模块子图由{1,2,3,4,5,7,8,11,12,15,18}号节点集合构成。For example, when machine learning algorithm No. 16 is the target machine learning algorithm, the disk file involved is only No. 1, so its machine learning training module subgraph is composed of adaptation modules No. {1, 2, 6, 9, 13, 16}; similarly, when machine learning algorithm No. 20 is the target machine learning algorithm, the disk files involved are No. 1, 3, 4, and 11, so its machine learning training module subgraph is composed of node sets No. {1, 2, 3, 4, 5, 7, 8, 11, 12, 15, 18}.

S204、基于所述机器学习训练模块子图将所述目标磁盘文件里的原始训练数据转换为所述目标机器学习算法所需的数据。S204. Convert the original training data in the target disk file into data required by the target machine learning algorithm based on the machine learning training module subgraph.

作为一种可行的实施方式，从存储全部训练数据的磁盘文件到机器学习算法所需输入数据的转换过程可以均在该目标机器学习算法的机器学习训练模块子图中进行。As a feasible implementation method, the conversion process from the disk file storing all the training data to the input data required by the machine learning algorithm can be performed in the machine learning training module subgraph of the target machine learning algorithm.

在一个实施例中，所述S204步骤，包括：按照所述机器学习训练模块子图中的适配模块的层级，将所述目标磁盘文件对应的适配模块作为顶层节点，自顶向下逐层触发启动所述适配模块，以将所述目标磁盘文件的原始训练数据转换为所述目标机器学习算法所需的数据。In one embodiment, the S204 step includes: according to the hierarchy of the adaptation module in the machine learning training module subgraph, the adaptation module corresponding to the target disk file is used as the top-level node, and the adaptation module is triggered and started layer by layer from top to bottom to convert the original training data of the target disk file into the data required by the target machine learning algorithm.

在一个实施例中，所述方法还包括，在启动当前层的适配模块时，若所述启动的适配模块的输入数据对象为所述文件指针，则带动启动所述文件指针连接的上层适配模块；并发执行所述当前层中的联动适配模块，所述联动适配模块为输出端口的数据对象为磁盘文件或内存数据的适配模块。In one embodiment, the method also includes, when starting the adaptation module of the current layer, if the input data object of the started adaptation module is the file pointer, then driving the starting of the upper layer adaptation module connected to the file pointer; concurrently executing the linkage adaptation module in the current layer, the linkage adaptation module is an adaptation module whose output port data object is a disk file or memory data.

举例而言，图3所示的通用模块层示例中，16号机器学习算法的机器学习训练模块子图中适配模块的自顶向下的分层结构为{2}和{9}；同理，20号机器学习算法的机器学习训练模块子图中适配模块的自顶向下分层结构为{2}、{7,8}和{15}。16号机器学习算法在执行机器学习训练模块子图时，首先处理{2}，由于2号适配模块在子图中的输出不是磁盘文件或内存数据，因此跳过{2}直接处理{9}，9号适配模块的输出是内存数据，因此被触发执行，并带动文件指针连接的上层2号适配模块一起执行，顺序访问1号磁盘文件，产生13号内存数据。同理，20号机器学习算法在执行机器学习训练模块子图时，首先处理{2}，由于2号适配模块在子图中的输出是磁盘文件，因此被触发执行，顺序访问1号磁盘文件，产生4号磁盘文件；然后并行处理{7,8}，由于7号适配模块在子图中的输出是磁盘文件，而8号适配模块在子图中的输出不是磁盘文件或内存数据，因此跳出8号适配模块只触发执行7号适配模块；由于7号适配模块没有文件指针连接的上层适配模块，因此不带动上层适配模块运行，仅顺序访问3号磁盘文件和4号磁盘文件，产生11号磁盘文件；最后处理{15}，由于15号适配模块在子图中的输出是内存数据，因此被触发执行，相继带动文件指针连接的上层8号适配模块和2号适配模块一起执行，顺序访问1号磁盘文件和11号磁盘文件，产生18号内存数据。For example, in the example of the general module layer shown in FIG3, the top-down hierarchical structure of the adapter modules in the machine learning training module subgraph of the machine learning algorithm No. 16 is {2} and {9}; similarly, the top-down hierarchical structure of the adapter modules in the machine learning training module subgraph of the machine learning algorithm No. 20 is {2}, {7,8} and {15}. When executing the machine learning training module subgraph, the machine learning algorithm No. 16 first processes {2}. Since the output of the adapter module No. 2 in the subgraph is not a disk file or memory data, {2} is skipped and {9} is directly processed. The output of the adapter module No. 9 is memory data, so it is triggered to execute, and drives the upper-layer adapter module No. 2 connected to the file pointer to execute together, sequentially access the disk file No. 1, and generate the memory data No. 13. Similarly, when executing the machine learning training module subgraph, the machine learning algorithm No. 20 first processes {2}. Since the output of the adapter module No. 2 in the subgraph is a disk file, it is triggered to execute, and the disk file No. 1 is accessed sequentially to generate the disk file No. 4; then {7, 8} are processed in parallel. Since the output of the adapter module No. 7 in the subgraph is a disk file, and the output of the adapter module No. 8 in the subgraph is not a disk file or memory data, the adapter module No. 8 is jumped out and only the adapter module No. 7 is triggered to execute; since the adapter module No. 7 has no upper-level adapter module connected by the file pointer, it does not drive the upper-level adapter module to run, and only the disk file No. 3 and the disk file No. 4 are accessed sequentially to generate the disk file No. 11; finally, {15} is processed. Since the output of the adapter module No. 15 in the subgraph is memory data, it is triggered to execute, and the upper-level adapter module No. 8 and the adapter module No. 2 connected by the file pointer are successively driven to execute together, and the disk file No. 1 and the disk file No. 11 are accessed sequentially to generate the memory data No. 18.

需要说明的是，上述通用模块层在最坏情况下，每层的适配模块都要触发执行，在执行时至多顺序访问一遍所有上层的磁盘文件，因此该通用模块层可以尽量减少顺序访问原始训练数据文件的次数，并通过机器学习训练模块子图抽取和各层适配模块的并发执行，将原始训练数据文件高效地转换成机器学习算法训练阶段所需要输入的顺序访问数据流或者不超出当前内存容量上限的训练样本集。It should be noted that, in the worst case, the adaptation module of each layer of the above-mentioned general module layer must be triggered to execute, and all the disk files of the upper layer must be accessed sequentially at most once during execution. Therefore, the general module layer can minimize the number of sequential accesses to the original training data files, and through the extraction of machine learning training module subgraphs and the concurrent execution of adaptation modules at each layer, the original training data files can be efficiently converted into the sequential access data stream required for input in the training phase of the machine learning algorithm or a training sample set that does not exceed the current memory capacity limit.

可见，本申请所示的机器学习训练数据调度方法，将原始的存储全部训练数据的磁盘文件到机器学习算法所需输入数据的转换过程设计成一个通用模块层，以便在不改变该计算机设备的硬件设施以及最小化内存占用的前提下最大化转换效率。当需要调用的目标机器学习算法启动训练时，它嫁接的适配模块发起通用模块层的执行，涉及的磁盘文件是有向无环图中通过有向路径能到达该嫁接模块的磁盘文件，并由这些磁盘文件到嫁接模块的全部有向路径中的节点构成机器学习训练模块子图；在数据转换的过程中，机器学习训练模块子图中的适配模块节点实施同层并行计算流程，即适配模块按照原有的分层自顶向下逐层处理，每层的适配模块在处理时并发执行所有在该层中带磁盘文件或内存数据输出端口的适配模块，没有输出磁盘文件或内存数据的适配模块不处理，当前层适配模块在执行时带动文件指针连接的上层适配模块一起执行，进一步带动这些上层的适配模块顺序访问输入磁盘文件，至多顺序访问一遍所有上层的磁盘文件就能产生输出的磁盘文件或内存数据，有效提升了训练数据转换的效率。It can be seen that the machine learning training data scheduling method shown in the present application designs the conversion process of the original disk file storing all the training data to the input data required by the machine learning algorithm into a common module layer, so as to maximize the conversion efficiency without changing the hardware facilities of the computer device and minimizing memory usage. When the target machine learning algorithm to be called starts training, its grafted adapter module initiates the execution of the general module layer. The disk files involved are the disk files that can reach the grafted module through directed paths in the directed acyclic graph, and the nodes in all directed paths from these disk files to the grafted modules constitute the machine learning training module subgraph; in the process of data conversion, the adapter module nodes in the machine learning training module subgraph implement the same-layer parallel computing process, that is, the adapter module is processed layer by layer from top to bottom according to the original hierarchy, and the adapter module of each layer concurrently executes all the adapter modules with disk file or memory data output ports in the layer during processing, and the adapter modules without output disk files or memory data are not processed. When the current layer adapter module is executed, it drives the upper layer adapter modules connected to the file pointer to execute together, and further drives these upper layer adapter modules to sequentially access the input disk files. At most, all upper layer disk files can be accessed sequentially once to generate output disk files or memory data, which effectively improves the efficiency of training data conversion.

上述介绍了本申请的方法，下面介绍本申请的执行设备。The above describes the method of the present application, and the following describes the execution device of the present application.

参见图4，图4是本申请实施例提供的一种计算机设备的结构示意图，该计算机设备40包括处理器401、存储器402。存储器402连接至处理器401，例如通过总线连接至处理器401。Referring to Fig. 4, Fig. 4 is a schematic diagram of the structure of a computer device provided in an embodiment of the present application, wherein the computer device 40 includes a processor 401 and a memory 402. The memory 402 is connected to the processor 401, for example, via a bus.

处理器401被配置为支持该计算机设备40执行上述方法实施例中的方法中相应的功能。该处理器401可以是中央处理器(central processing unit，CPU)、图形处理器(Graphics Processing Unit，GPU)或硬件芯片。上述硬件芯片可以是专用集成电路(application specific integrated circuit，ASIC)、可编程逻辑器件(programmablelogic device，PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complexprogrammable logic device，CPLD)、现场可编程逻辑门阵列(field-programmable gatearray，FPGA)、通用阵列逻辑(generic array logic，GAL)或其任意组合。The processor 401 is configured to support the computer device 40 to perform the corresponding functions in the method in the above method embodiment. The processor 401 can be a central processing unit (CPU), a graphics processing unit (GPU) or a hardware chip. The above hardware chip can be an application specific integrated circuit (ASIC), a programmable logic device (PLD) or a combination thereof. The above PLD can be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.

存储器402用于存储程序代码等。存储器402可以包括易失性存储器(volatilememory，VM)，例如随机存取存储器(random access memory，RAM)；存储器402也可以包括非易失性存储器(non-volatile memory，NVM)，例如只读存储器(read-only memory，ROM)、快闪存储器(flash memory)、硬盘(hard disk drive，HDD)或固态硬盘(solid-state drive，SSD)；存储器402还可以包括上述种类的存储器的组合。The memory 402 is used to store program codes, etc. The memory 402 may include a volatile memory (VM), such as a random access memory (RAM); the memory 402 may also include a non-volatile memory (NVM), such as a read-only memory (ROM), a flash memory, a hard disk drive (HDD), or a solid-state drive (SSD); the memory 402 may also include a combination of the above-mentioned types of memories.

该处理器401可以调用所述程序代码以执行以下操作：The processor 401 may call the program code to perform the following operations:

当目标机器学习算法启动训练时，获取所述目标机器学习算法对应的嫁接模块，并通过所述嫁接模块触发通用模块层启动；When the target machine learning algorithm starts training, a grafting module corresponding to the target machine learning algorithm is obtained, and the startup of the general module layer is triggered by the grafting module;

根据所述通用模块层中的有向无环图确定目标磁盘文件；Determine the target disk file according to the directed acyclic graph in the general module layer;

根据所述目标磁盘文件到所述嫁接模块的所有有向路径中的适配模块构成机器学习训练模块子图；Construct a machine learning training module subgraph according to the adaptation modules in all directed paths from the target disk file to the grafting module;

基于所述机器学习训练模块子图将所述目标磁盘文件里的原始训练数据转换为所述目标机器学习算法所需的数据。Based on the machine learning training module subgraph, the original training data in the target disk file is converted into data required by the target machine learning algorithm.

本申请实施例还提供一种计算机设备，包括存储器、处理器，所述存储器连接至所述处理器，所述处理器用于执行存储在所述存储器中的一个或多个计算机程序，实现如前述实施例所述的方法。An embodiment of the present application further provides a computer device, including a memory and a processor, wherein the memory is connected to the processor, and the processor is used to execute one or more computer programs stored in the memory to implement the method described in the above embodiment.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程，是可以通过计算机程序来指令相关的硬件来完成，所述的程序可存储于计算机可读取存储介质中，该程序在执行时，可包括如上述各方法的实施例的流程。其中，所述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only memory，ROM)或随机存储记忆体(Random Accessmemory，RAM)等。A person skilled in the art can understand that all or part of the processes in the above-mentioned embodiments can be implemented by instructing the relevant hardware through a computer program, and the program can be stored in a computer-readable storage medium, and when the program is executed, it can include the processes of the embodiments of the above-mentioned methods. The storage medium can be a disk, an optical disk, a read-only memory (ROM) or a random access memory (RAM), etc.

以上所揭露的仅为本申请较佳实施例而已，当然不能以此来限定本申请之权利范围，因此依本申请权利要求所作的等同变化，仍属本申请所涵盖的范围。The above disclosure is only the preferred embodiment of the present application, which certainly cannot be used to limit the scope of rights of the present application. Therefore, equivalent changes made according to the claims of the present application are still within the scope covered by the present application.

Claims

1. A machine learning training data scheduling method, for application to a computer device, the method comprising:

when a target machine learning algorithm starts training, acquiring a grafting module corresponding to the target machine learning algorithm, and triggering a universal module layer to start through the grafting module, wherein the universal module layer is a virtual module layer which is preset in advance and used for machine learning algorithm training data conversion, and is formed by taking an adaptation module as a node;

The adaptation module is used for transmitting data objects, wherein the data objects comprise any one or more of disk files, file pointers and memory data;

the adaptation module comprises any one or more types of a data view generation module, a data view column conversion module, a data view row conversion module and a data view batch processing module;

the data objects input by the data view generation module are disk files, and the data objects output by the data view generation module are file pointers for sequentially accessing the disk files;

The data view column transformation module comprises a functional module for adding columns, deleting columns and changing column values in the data view; the data object input by the data view column conversion module is a file pointer, and the data object output by the data view column conversion module is a file pointer;

The data view line transformation module comprises a functional module for randomly sampling lines in a data view, filtering the lines under a specified condition, filtering repeated lines and orderly combining the data lines in a plurality of data views; the data objects input by the data view line transformation module are any one or more of disk files, file pointers and memory data, and the data objects output by the data view line transformation module are any one or more of disk files, file pointers and memory data;

The data view batch processing module comprises batch processing with fixed line number and batch processing module with fixed sequence number, wherein the fixed sequence refers to a data line set with the same ID; the data objects input by the data view batch processing module are file pointers, the data objects output by the data view batch processing module are memory data of batches, the memory data of batches are used by an internal process, and the internal process is initiated by the target machine learning algorithm;

determining a target disk file according to the directed acyclic graph in the universal module layer;

Forming a machine learning training module subgraph according to the adaptation modules in all directed paths from the target disk file to the grafting module;

And converting the original training data in the target disk file into data required by the target machine learning algorithm based on the machine learning training module subgraph.

2. The method of claim 1, wherein the generic module layer comprises at least two adaptation modules, wherein one adaptation module is a grafting module corresponding to the target machine learning algorithm;

the adaptation modules are configured as nodes of the generic module layer in a hierarchy, and the flow paths of the data objects form a directed acyclic graph in the generic module layer.

3. The method of claim 2, wherein the adaptation module is provided with at least one input port and at least one output port, and wherein the adaptation module receives the data object output by the node of the previous layer through the input port and passes the converted data object to the node of the next layer through the output port.

4. The method of claim 1, further characterized in that a top level node of the directed acyclic graph is the data view generation module, a bottom level node of the directed acyclic graph is another type of module other than the data view generation module, and an intermediate node of the directed acyclic graph is another type of module other than the data view batch processing module.

5. The method of claim 1, further characterized in that the grafting module of the target machine learning algorithm is the data view batch processing module or an adaptation module of the output data object as memory data.

6. The method of claim 1, further characterized in that the target disk file is represented in a data view in the form of a two-dimensional table; the adaptation module in all directed paths from the target disk file to the grafting module forms a machine learning training module subgraph, which comprises:

The adaptation module of the target disk file in all the directed paths of the grafting module is obtained;

and constructing the machine learning training module subgraph according to the original hierarchical structure of the adaptation module in the universal module layer.

7. The method of claim 1, further characterized in that the converting the raw training data in the target disk file to data required by the target machine learning algorithm based on the machine learning training module subgraph comprises:

and triggering and starting the adaptation module layer by layer from top to bottom by taking the adaptation module corresponding to the target disk file as a top node according to the level of the adaptation module in the machine learning training module subgraph so as to convert the original training data of the target disk file into the data required by the target machine learning algorithm.

8. The method of claim 7, wherein the method further comprises:

When an adaptation module of a current layer is started, if an input data object of the started adaptation module is the file pointer, driving an upper layer adaptation module connected with the file pointer to be started;

And concurrently executing the linkage adaptation module in the current layer, wherein the linkage adaptation module is an adaptation module of which the data object of the output port is a disk file or memory data.

9. A computer device comprising a memory, a processor connected to the processor for executing one or more computer programs stored in the memory, which processor, when executing the one or more computer programs, causes the computer device to implement the method of any of claims 1-8.