CN108985367A

CN108985367A - Computing engines selection method and more computing engines platforms based on this method

Info

Publication number: CN108985367A
Application number: CN201810734031.8A
Authority: CN
Inventors: 杜凡; 杜一凡; 陈昭; 刁博宇; 徐勇军
Original assignee: Institute of Computing Technology of CAS
Current assignee: Institute of Computing Technology of CAS
Priority date: 2018-07-06
Filing date: 2018-07-06
Publication date: 2018-12-11

Abstract

The invention provides a calculation engine selection method and a multi-calculation engine platform based on the method. The method includes: inputting task feature data corresponding to the task to be calculated into a task execution time prediction model of each of the multiple calculation engines, and obtaining a task execution time prediction result of the task to be calculated on each calculation engine, wherein , the task execution time prediction model is obtained through training based on a training sample set, the training sample set includes a plurality of task feature data and corresponding task execution time; according to the task execution time prediction results from the plurality of computing engines Select the computing engine that executes the task to be calculated. The method of the invention can automatically select a computing engine with high efficiency, and reduces the task execution time.

Description

Computing engine selection method and multi-computing engine platform based on the method

技术领域technical field

本发明涉及信息技术领域，尤其涉及一种计算引擎选择方法和基于该方法的多计算引擎平台。The invention relates to the field of information technology, in particular to a calculation engine selection method and a multi-calculation engine platform based on the method.

背景技术Background technique

随着国家在海、空、天、深海等方向大量新型装备的发展，装备试验变得愈发重要。例如，在歼10战斗机研制过程中，共进行了上万次的风洞试验，获得了百万条气动力数据，对这些数据的处理分析成为歼10成功研制的重要基础。装备试验包括“试验”与“评价”两个过程，是获取数据的一种途径，然后对各种数据进行分析、处理、比较，以帮助做出决策。目前仍然主要依靠专家经验与计算机辅助处理的试验数据处理方式，已经不能满足当前试验数据处理的需要，并且，由于在试验数据处理需要对规模不同的数据量进行处理，结构化与非结构化处理混杂，实时与离线处理相结合等情况，使用单一引擎已经无法应对各类试验处理需求。针对此问题，目前有三种解决思路：首先是人工管理多种引擎，将计算引擎分开部署，并采用人工方式管理计算引擎、执行计算任务，此方式需要大量人力，效率低下，同时若系统不保持全负载，就会造成巨大的资源浪费；第二种方式是使用支持各种计算需求的“超级”引擎，具体为部署一个支持所有处理方式的引擎，使用此引擎就可以满足所有的试验数据处理需求，但目前这种方式成熟度不高，距离大规模使用还需时日；第三种方式是前两者的折中，使用一个支持多计算引擎的计算平台，这种方式一方面能够应用目前成熟的各种计算引擎技术，另一方面使用自动化的方法对计算引擎和计算任务进行管理，能够提高资源利用率和任务执行效率。总之，对于上述三种方式，人工管理多种引擎效率低下，“超级”引擎一时难以满足急切需求，一个多计算引擎的计算平台是目前平衡效率和可行性的解决方案。With the development of a large number of new equipment in the sea, air, sky, deep sea and other directions of the country, equipment testing has become more and more important. For example, during the development of the J-10 fighter jet, tens of thousands of wind tunnel tests were carried out, and millions of pieces of aerodynamic data were obtained. The processing and analysis of these data became an important basis for the successful development of the J-10. Equipment test includes two processes of "test" and "evaluation", which is a way to obtain data, and then analyze, process and compare various data to help make decisions. At present, the test data processing method that still mainly relies on expert experience and computer-aided processing can no longer meet the needs of current test data processing. Moreover, due to the need to process different data volumes in test data processing, structured and unstructured processing Mixed, real-time and offline processing, etc., using a single engine can no longer meet the needs of various test processing. To solve this problem, there are currently three solutions: first, manually manage multiple engines, deploy the computing engines separately, and manually manage the computing engines and perform computing tasks. This method requires a lot of manpower and is inefficient. At the same time, if the system does not maintain Full load will cause a huge waste of resources; the second way is to use a "super" engine that supports various computing needs, specifically to deploy an engine that supports all processing methods, and use this engine to meet all experimental data processing demand, but the current maturity of this method is not high, and it will take time before large-scale use; the third method is a compromise between the first two, using a computing platform that supports multiple computing engines, this method can be applied on the one hand At present, various mature computing engine technologies, on the other hand, use automated methods to manage computing engines and computing tasks, which can improve resource utilization and task execution efficiency. In short, for the above three methods, manual management of multiple engines is inefficient, and "super" engines are difficult to meet urgent needs for a while. A computing platform with multiple computing engines is currently the solution to balance efficiency and feasibility.

然而，应用多计算引擎平台，需要解决多计算引擎兼容问题、计算任务统一管理问题和未来引擎的扩展的问题，因此，需要能够自动选择任务执行引擎，提高平台效率。现有支持多计算引擎的平台，都不能解决上述问题。例如，Twitter SummingBird采用Lambda架构整合了分布式批处理引擎(Hadoop)和分布式流计算引擎(Storm)，在执行请求时可以整合批处理和流计算的结果，但其没有方便的引擎管理机制，同时没有提供引擎运行环境隔离；Apache Ambari基于Web实现，支持Apache Hadoop生态的供应、管理和监控，同时提供自定义接口，支持添加各类单机或分布式引擎，但其没有提供统一的计算任务管理，只能保证特定引擎兼容性，同时需要人工选择计算引擎执行计算任务；Google Kubernete基于Docker实现，能够以容器的方式运行计算引擎，依据需求可以运行单机引擎和分布式引擎，提供容器的部署，调度和节点集群间扩展等功能，但其没有任务管理机制，同时也需要人工选择计算引擎。However, to apply a multi-computing engine platform, it is necessary to solve the problems of multi-computing engine compatibility, unified management of computing tasks, and future engine expansion. Therefore, it is necessary to be able to automatically select a task execution engine to improve platform efficiency. None of the existing platforms that support multiple computing engines can solve the above problems. For example, Twitter SummingBird uses the Lambda architecture to integrate the distributed batch processing engine (Hadoop) and the distributed stream computing engine (Storm). It can integrate the results of batch processing and stream computing when executing requests, but it does not have a convenient engine management mechanism. At the same time, it does not provide engine operating environment isolation; Apache Ambari is based on Web implementation, supports the supply, management and monitoring of the Apache Hadoop ecosystem, and provides custom interfaces to support adding various stand-alone or distributed engines, but it does not provide unified computing task management , can only guarantee the compatibility of a specific engine, and at the same time need to manually select a computing engine to perform computing tasks; Google Kubernetes is implemented based on Docker, which can run computing engines in the form of containers, and can run stand-alone engines and distributed engines according to requirements, providing container deployment. Functions such as scheduling and node cluster expansion, but it does not have a task management mechanism, and it also requires manual selection of computing engines.

因此，需要对现有技术进行改进，以提供多计算引擎平台以及面向多计算引擎平台的自动化选择计算引擎的方法。Therefore, it is necessary to improve the prior art to provide a multi-computing engine platform and a method for automatically selecting a computing engine for the multi-computing engine platform.

发明内容Contents of the invention

本发明的目的在于克服上述现有技术的缺陷，提供一种计算引擎选择方法和基于该方法的多计算引擎平台。The purpose of the present invention is to overcome the above-mentioned defects in the prior art, and provide a computing engine selection method and a multi-computing engine platform based on the method.

根据本发明的第一方面，提供了一种计算引擎选择方法。该方法包括以下步骤：According to a first aspect of the present invention, a calculation engine selection method is provided. The method includes the following steps:

步骤1：将待计算任务对应的任务特征数据输入到多个计算引擎中的每一个计算引擎的任务执行时间预测模型，获得待计算任务在每一个计算引擎上的任务执行时间预测结果，其中，所述任务执行时间预测模型是基于训练样本集通过训练获得，所述训练样本集包括多条任务特征数据和对应的任务执行时间；Step 1: Input the task characteristic data corresponding to the task to be calculated into the task execution time prediction model of each of the multiple calculation engines, and obtain the task execution time prediction result of the task to be calculated on each calculation engine, wherein, The task execution time prediction model is obtained through training based on a training sample set, and the training sample set includes a plurality of task feature data and corresponding task execution time;

步骤2：根据所述任务执行时间预测结果从所述多个计算引擎中选择执行待计算任务的计算引擎。Step 2: Select a computing engine that executes the task to be calculated from the plurality of computing engines according to the task execution time prediction result.

在一个实施例中，所述任务特征数据包括算法类型、算法参数、数据类型、数据量和数据存放位置中的至少一项。In one embodiment, the task feature data includes at least one of algorithm type, algorithm parameter, data type, data amount, and data storage location.

在一个实施例中，通过以下步骤构建一个计算引擎的训练样本集：In one embodiment, a training sample set of a computing engine is constructed through the following steps:

步骤31：收集多条用于描述任务信息的任务描述数据，；Step 31: collecting multiple pieces of task description data for describing task information;

步骤32：利用该计算引擎执行每一条任务描述数据对应的任务，获得每一条任务描述数据对应的任务执行时间；Step 32: use the calculation engine to execute the task corresponding to each task description data, and obtain the task execution time corresponding to each task description data;

步骤33：从每一条任务描述数据中提取影响任务执行时间的特征组成任务特征数据，结合所获得的任务执行时间构建该计算引擎的训练样本集。Step 33: Extract features that affect task execution time from each piece of task description data to form task feature data, and combine the obtained task execution time to construct a training sample set for the calculation engine.

在一个实施例中，通过执行以下步骤获得一个计算引擎的任务执行时间预测模型：In one embodiment, a task execution time prediction model of a computing engine is obtained by performing the following steps:

步骤41：基于该计算引擎的训练样本集，以任务特征数据为自变量，以任务执行时间为因变量，建立线性回归模型，表示为：Step 41: Based on the training sample set of the calculation engine, with the task feature data as the independent variable and the task execution time as the dependent variable, establish a linear regression model, expressed as:

y_i＝β₀+β₁x_i1+…+β_px_ip，i＝1，2，…，ny _i =β ₀ +β ₁ x _i1 +...+β _p x _ip , i=1,2,...,n

其中，x_i1至x_ip表示该计算引擎的训练样本集包含的任务特征，i表示该计算引擎的训练样本集中包含的样本数据条数的编号，n为该计算引擎的训练样本集的样本数据条数，β₀为待优化偏置值，β₁至β_p为待优化权重值；Among them, x _i1 to x _ip represent the task features contained in the training sample set of the computing engine, i represents the serial number of the number of sample data contained in the training sample set of the computing engine, and n is the sample data of the training sample set of the computing engine The number of bars, β ₀ is the bias value to be optimized, β ₁ to β _p is the weight value to be optimized;

步骤42：使用最小二乘法求解所述线性回归模型的优化权重值和偏置值；Step 42: using the least square method to solve the optimal weight value and bias value of the linear regression model;

步骤43：根据获得的优化权重值和偏置值表示所述线性回归模型，获得该计算引擎的任务执行时间预测模型。Step 43: express the linear regression model according to the obtained optimized weight value and bias value, and obtain a task execution time prediction model of the calculation engine.

在一个实施例中，步骤2包括以下子步骤：In one embodiment, step 2 includes the following sub-steps:

步骤51：选择预测执行时间最短的计算引擎；或者Step 51: Select the computing engine with the shortest predicted execution time; or

步骤52：当该预测执行时间最短的计算引擎的剩余资源不能支持待计算任务的情况下，根据所述任务执行时间预测结果依次优先选择预测时间短的计算引擎。Step 52: When the remaining resources of the computing engine with the shortest predicted execution time cannot support the task to be calculated, preferentially select the computing engine with the shortest predicted time according to the prediction result of the task execution time.

根据本发明的第二方面，提供了一种多计算引擎平台。该平台包括：According to a second aspect of the present invention, a multi-computing engine platform is provided. The platform includes:

计算任务管理模块：用于管理计算任务的处理流程并生成计算任务信息；Computing task management module: used to manage the processing flow of computing tasks and generate computing task information;

引擎管理模块：用于根据来自于所述计算任务管理模块的计算任务信息根据本发明的计算引擎选择方法来选择计算引擎；Engine management module: used to select a computing engine according to the computing task information from the computing task management module according to the computing engine selection method of the present invention;

任务执行模块：用于执行计算任务并输出任务执行时间。Task execution module: used to execute computing tasks and output task execution time.

在一个实施例中，本发明的多计算引擎选择平台还包括：In one embodiment, the multi-computing engine selection platform of the present invention also includes:

容器管理模块：用于调用所述任务执行模块执行计算任务；Container management module: used to call the task execution module to execute computing tasks;

用户交互模块：用于接收用户操作指令和信息；User interaction module: used to receive user operation instructions and information;

调试任务管理模块：用于执行用户调试任务并输出调试信息。Debugging task management module: used to execute user debugging tasks and output debugging information.

在一个实施例中，当所述平台包含的计算引擎变更时，所述引擎管理模块将新计算引擎的任务执行时间预测模型激活，同时将替换掉的计算引擎的任务执行时间预测模型设置为非激活状态。In one embodiment, when the computing engine included in the platform is changed, the engine management module activates the task execution time prediction model of the new computing engine, and simultaneously sets the task execution time prediction model of the replaced computing engine to non- active state.

与现有技术相比，本发明的优点在于：提供的计算引擎选择方法能够利用机器学习方法构建多个计算引擎的任务执行时间预测模型，并基于所构建的模型的预测结果结合资源情况自动选择效率最高的计算引擎，能够显著减少任务执行时间，提高试验数据处理效率；提供的基于计算引擎选择方法的多计算引擎平台提供了任务管理机制并能够支持计算引擎变更，提高了灵活性，同时对未来引擎扩展提供了良好支持。Compared with the prior art, the present invention has the advantages that: the provided calculation engine selection method can utilize machine learning methods to build task execution time prediction models for multiple calculation engines, and automatically select based on the prediction results of the constructed models combined with resource conditions The most efficient computing engine can significantly reduce task execution time and improve the efficiency of test data processing; the provided multi-calculation engine platform based on the calculation engine selection method provides a task management mechanism and can support calculation engine changes, which improves flexibility and at the same time Future engine extensions are well supported.

附图说明Description of drawings

以下附图仅对本发明作示意性的说明和解释，并不用于限定本发明的范围，其中：The following drawings only illustrate and explain the present invention schematically, and are not intended to limit the scope of the present invention, wherein:

图1示出了根据本发明一个实施例的计算引擎选择方法的流程图；FIG. 1 shows a flowchart of a calculation engine selection method according to an embodiment of the present invention;

图2示出了根据本发明一个实施例的多计算引擎平台的框架示意图。Fig. 2 shows a schematic framework diagram of a multi-computing engine platform according to an embodiment of the present invention.

具体实施方式Detailed ways

为了使本发明的目的、技术方案、设计方法及优点更加清楚明了，以下结合附图通过具体实施例对本发明进一步详细说明。应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。In order to make the purpose, technical solution, design method and advantages of the present invention clearer, the present invention will be further described in detail through specific embodiments in conjunction with the accompanying drawings. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

根据本发明的一个实施例，提供了一种面向多计算引擎平台的计算引擎选择方法，简言之，该方法包括：收集多个计算引擎的任务执行数据，构建训练样本集；利用所构建的训练样本集通过机器学习方式训练任务执行时间预测模型；利用训练好的任务执行时间预测模型预测每个计算引擎的任务执行时间，进而选择合适的计算引擎。具体地，参见图1所述，本发明的计算引擎选择方法包括以下步骤：According to an embodiment of the present invention, a computing engine selection method for a multi-computing engine platform is provided. In short, the method includes: collecting task execution data of multiple computing engines, constructing a training sample set; using the constructed The training sample set trains the task execution time prediction model through machine learning; use the trained task execution time prediction model to predict the task execution time of each computing engine, and then select the appropriate computing engine. Specifically, referring to FIG. 1, the computing engine selection method of the present invention includes the following steps:

步骤S110，收集多个计算引擎的任务执行数据以构建训练样本集。Step S110, collecting task execution data of multiple computing engines to construct a training sample set.

在此步骤中，收集多个计算引擎面对不同条件的计算任务时的运行时间数据，尽可能的将各类任务均包含在内，以便构建全面的训练样本集。In this step, the running time data of multiple computing engines facing computing tasks under different conditions is collected, and various tasks are included as much as possible, so as to construct a comprehensive training sample set.

根据本发明的一个实施例，构建训练样本集的过程包括以下子步骤：According to one embodiment of the present invention, the process of constructing the training sample set includes the following sub-steps:

步骤S111，准备待测数据Step S111, preparing data to be tested

收集待测试的算法，为每个算法准备适当的待测数据。例如，可根据训练平台性能和每个算法通常需要处理的数据量来确定待测数据量，又如将待测总数据量根据实际情况取得上限。Collect the algorithms to be tested and prepare appropriate data to be tested for each algorithm. For example, the amount of data to be tested can be determined according to the performance of the training platform and the amount of data that each algorithm usually needs to process. Another example is to obtain an upper limit for the total amount of data to be tested according to the actual situation.

步骤S112，准备任务描述数据Step S112, preparing task description data

任务描述数据用于描述所执行任务的信息，例如，任务执行的算法，算法的相关参数等。一个计算引擎可根据任务描述数据执行具体任务。The task description data is used to describe the information of the executed task, for example, the algorithm for task execution, the relevant parameters of the algorithm, and so on. A calculation engine can perform specific tasks according to the task description data.

在一个实施例中，将任务描述数据定义为一个六元组Task_Info，包括<Task_ID，Algorithm_ID，Algorithm_Args，Data_Type，Data_Size，Data_Path>，其中Task_ID表示任务序号，内容为整数；Algorithm_ID表示算法序号，内容为整数，根据该算法序号可对应到具体的算法，例如算法包括FP-Growth算法、K-Means算法、PageRank算法和皮尔森相关系数算法等；Algorithm_Args表示算法参数，内容为Json编码字符串，以K-Means算法为例，算法参数可包括分数簇数量k、初始化方式initMode、最大迭代次数maxItr等；Data_Type表示数据类型，内容为数字标号离散值；Data_Size表示数据量，内容为整数，单位为字节；Data_Path表示数据存放位置，内容有数字标号离散值，取值有本地文件系统或分布式文件系统。In one embodiment, the task description data is defined as a six-tuple Task_Info, including <Task_ID, Algorithm_ID, Algorithm_Args, Data_Type, Data_Size, Data_Path>, wherein Task_ID represents the task sequence number, and the content is an integer; Algorithm_ID represents the algorithm sequence number, and the content is Integer, according to the algorithm number, it can correspond to a specific algorithm. For example, the algorithm includes FP-Growth algorithm, K-Means algorithm, PageRank algorithm and Pearson correlation coefficient algorithm, etc.; Algorithm_Args indicates the algorithm parameters, and the content is a Json encoded string, starting with K -Means algorithm as an example, the algorithm parameters can include the number of fractional clusters k, the initialization method initMode, the maximum number of iterations maxItr, etc.; Data_Type indicates the data type, and the content is a discrete value of a digital label; Data_Size indicates the amount of data, the content is an integer, and the unit is byte ;Data_Path indicates the data storage location, the content has digital label discrete values, and the value can be local file system or distributed file system.

需要说明的是，所定义的任务描述数据可以包括对任务执行时间有影响的任何其他的内容，例如，除了上述的算法序号和算法相关系数之外，还可包括任务执行优先级等。此外，对于不同的算法，算法参数所包含的内容也不同。It should be noted that the defined task description data may include any other content that affects task execution time, for example, in addition to the above-mentioned algorithm sequence number and algorithm correlation coefficient, it may also include task execution priority and the like. In addition, for different algorithms, the content contained in the algorithm parameters is also different.

步骤S113，根据任务描述数据执行计算任务，获得多个计算引擎的任务执行时间数据Step S113, execute the computing task according to the task description data, and obtain the task execution time data of multiple computing engines

对于每一个计算引擎，使用准备好的多个计算任务的描述数据Task_Info执行计算任务，收集每个计算任务的执行时间Run_Time，将任务描述数据和执行时间组成两元组<Task_Info，Run_Time>，获得任务执行数据。For each computing engine, use the prepared description data Task_Info of multiple computing tasks to execute computing tasks, collect the execution time Run_Time of each computing task, and combine the task description data and execution time into a tuple <Task_Info, Run_Time> to obtain Task execution data.

步骤114，对任务执行数据进行数据清洗Step 114, perform data cleaning on task execution data

数据清洗的目的是剔除可能有错误的或不完整的数据。例如，可对所有的任务执行数据进行统计分析，得到任务执行时间标准差，如果某一任务执行时间与平均任务执行时间超过3倍标准差，则将其标注为异常数据并剔除。此外，对于属性列缺失的数据也进行剔除，属性列缺失包括任务执行时间缺失或者任务描述信息缺失，也进行剔除操作。The purpose of data cleaning is to remove potentially erroneous or incomplete data. For example, all task execution data can be statistically analyzed to obtain the standard deviation of task execution time. If a task execution time exceeds 3 times the standard deviation of the average task execution time, it will be marked as abnormal data and eliminated. In addition, the data with missing attribute columns is also eliminated. The missing attribute columns include missing task execution time or missing task description information, and the removal operation is also performed.

步骤115，特征工程Step 115, feature engineering

特征工程的目的是从任务描述数据中选择对任务的执行时间有明显影响作用的特征，以构建训练样本集。本发明的训练样本集中包括任务特征数据和对应的任务执行时间，任务特征数据中包含对任务执行时间有影响的多个任务特征。The purpose of feature engineering is to select features that have a significant impact on the execution time of the task from the task description data to construct a training sample set. The training sample set of the present invention includes task feature data and corresponding task execution time, and the task feature data includes multiple task features that affect task execution time.

根据本发明的一个实施例，任务描述数据Task_Info中的Task_ID对于计算任务执行时间没有影响，故可剔除该特征，而任务描述描述数据中的算法序号(即算法类型)、算法参数、数据存放位置、数据类型、数据量等任务的执行时间均有影响，因此将其保留作为训练样本集中的任务特征数据。According to an embodiment of the present invention, the Task_ID in the task description data Task_Info has no influence on the execution time of the computing task, so this feature can be eliminated, and the algorithm sequence number (ie algorithm type), algorithm parameters, and data storage location in the task description description data The execution time of tasks such as , data type, and data volume are all affected, so they are reserved as task feature data in the training sample set.

在构建最终的训练样本集时，对于离散化的任务特征，如果采用序号编码的方式，则序号的顺序会影响到无序的离散量，造成额外的信息输入。因此，根据本发明的一个实施例，可使用one-hot编码(独热编码)的方式对离散数据进行编码。例如，对于一个任务描述数据，离散特征包括数据存放位置，该离散特征的取值有“本地文件系统”和“分布式文件系统”。使用one-hot编码后，该离散特征变为两个独立的特征：“数据存放位置-本地”和“数据存放位置-分布式”。当原数据存放位置特征取值为“本地文件系统”时，“数据存放位置-本地”特征取1，“数据存放位置-分布式”特征取值为0；当原数据存放位置特征取值为“分布式文件系统”时，“数据存放位置-本地”特征取值为0，“数据存放位置-分布式”特取值为1。类似地，可对算法序号Algorithm_ID也使用one-hot编码，即当包括四种算法时，该离散特征变为四个独立的特征。When constructing the final training sample set, for the discretized task features, if serial number encoding is used, the order of the serial number will affect the unordered discrete quantities, resulting in additional information input. Therefore, according to an embodiment of the present invention, one-hot encoding (one-hot encoding) may be used to encode discrete data. For example, for a task description data, the discrete feature includes the data storage location, and the values of the discrete feature include "local file system" and "distributed file system". After using one-hot encoding, the discrete feature becomes two independent features: "data storage location - local" and "data storage location - distributed". When the feature value of the original data storage location is "local file system", the feature value of "data storage location-local" is 1, and the value of "data storage location-distributed" is 0; when the feature value of the original data storage location is For the "distributed file system", the value of the "data storage location - local" feature is 0, and the value of the "data storage location - distributed" feature is 1. Similarly, one-hot encoding can also be used for the algorithm serial number Algorithm_ID, that is, when four algorithms are included, the discrete feature becomes four independent features.

为了清楚起见，下表1示意了构建的训练样本集的示例。For clarity, Table 1 below illustrates an example of the constructed training sample set.

表1：训练样本集示例Table 1: Examples of training sample sets

表1示意了计算引擎1的训练样本集，需要说明的是，其中的任务特征数据根据实际实验处理时对任务执行时间的影响情况，可以包括算法类型、算法参数、数据存放位置、数据类型、数据量中的至少一项或者也可以增加其他的任务特征。此外，当使用one-hot编码对任务特征数据的离散特征进行编码时，该离散特征将变为多个独立的特征，但是本发明并不限于采用one-hot编码方式对离散特征进行编码。Table 1 shows the training sample set of computing engine 1. It should be noted that the task characteristic data in it can include algorithm type, algorithm parameter, data storage location, data type, At least one of the data volumes or also other task characteristics can be added. In addition, when one-hot encoding is used to encode the discrete features of the task feature data, the discrete features will become multiple independent features, but the present invention is not limited to one-hot encoding to encode the discrete features.

步骤S120，利用所构建的训练样本集训练任务执行时间预测模型。Step S120, using the constructed training sample set to train a task execution time prediction model.

在此步骤中，采用机器学习方法利用训练样本集进行训练，以获得每个计算引擎对应的任务执行时间预测模型。In this step, a machine learning method is used for training with a training sample set to obtain a task execution time prediction model corresponding to each computing engine.

例如，可采用线性回归、梯度提升回归树(GBRT)、XGBoost等机器学习模型进行训练。For example, machine learning models such as linear regression, gradient boosted regression tree (GBRT), and XGBoost can be used for training.

在一个实施例中，使用线性回归模型进行训练，以训练样本集中的任务特征数据为自变量，以任务执行时间为因变量，建立线性回归模型。例如，线性回归模型可表示为：In one embodiment, a linear regression model is used for training, the task feature data in the training sample set is used as an independent variable, and the task execution time is used as a dependent variable to establish a linear regression model. For example, a linear regression model can be expressed as:

Y＝e+a₁X₁+a₂X₂+a₃X₃+a₄X₄+a₅X₅+a₆X₆+a₇X₇ (1)Y=e+a ₁ X ₁ +a ₂ X ₂ +a ₃ X ₃ +a ₄ X ₄ +a ₅ X ₅ +a ₆ X ₆ +a ₇ X ₇ (1)

其中，X₁代表数据量，X₂、X₃代表数据存储位置使用one-hot编码后的特征，X₄至X₇代表算法序号Algorithm_ID使用one-hot编码后的特征，Y代表任务执行时间，a₁至a₇表示待优化的权重值，e表示待优化的偏置值。Among them, X ₁ represents the amount of data, X ₂ and X ₃ represent the characteristics of the data storage location using one-hot encoding, X ₄ to X ₇ represent the characteristics of the algorithm serial number Algorithm_ID using one-hot encoding, Y represents the task execution time, a ₁ to a ₇ represent the weight value to be optimized, and e represents the bias value to be optimized.

一般性地，可建立的p元线性回归模型可表示为：Generally, the p-element linear regression model that can be established can be expressed as:

y_i＝β₀+β₁x_i1+…+β_px_ip，i＝1，2，…，n (2)y _i =β ₀ +β ₁ x _i1 +...+β _p x _ip , i=1, 2,..., n (2)

其中，p为训练样本集的任务特征数据中包含的特征数量，i对应训练样本集中包含的数据条数的编号，n为训练样本集中的数据条数，β₀为待优化偏置，β₁至β_p为待优化权重值，x_i1至x_ip对应训练样本集中的任务特征。Among them, p is the number of features contained in the task feature data of the training sample set, i corresponds to the number of the number of data pieces contained in the training sample set, n is the number of data pieces in the training sample set, β ₀ is the bias to be optimized, β ₁ to β _p are the weight values to be optimized, and x _i1 to x _ip correspond to the task features in the training sample set.

在训练时，可使用最小二乘法获得优化权重值和偏置值，最小二乘法的目标是让误差的平方和最小，即：During training, the optimal weight value and bias value can be obtained using the least squares method. The goal of the least squares method is to minimize the sum of squares of errors, namely:

然后，对各个参数求偏导：Then, take partial derivatives for each parameter:

获得正规方程组：Get the normal system of equations:

写成矩阵形式为：Written in matrix form as:

X′Xβ＝X′Y (6)X'Xβ=X'Y (6)

从而得到参数的解(包括权重值和偏置值)：So as to get the solution of the parameters (including weight value and bias value):

在此步骤中，通过训练可获得优化权重值和偏置值，利用这些优化权重和偏置表示的模型即为任务执行时间预测模型，通过这种方式，能够获得每一个计算引擎的任务执行时间预测模型。In this step, the optimized weight value and bias value can be obtained through training. The model represented by these optimized weights and bias is the task execution time prediction model. In this way, the task execution time of each computing engine can be obtained predictive model.

步骤S130，利用训练好的任务执行时间预测模型预测每个计算引擎的任务执行时间，进而选择合适的计算引擎。Step S130, using the trained task execution time prediction model to predict the task execution time of each computing engine, and then select a suitable computing engine.

当需要执行一个新的计算任务时，首先根据计算任务属性生成计算任务特征数据，之后依次将其输入到每个计算引擎的任务执行预测模型中，获得各个计算引擎的任务执行时间预测结果，最后结合系统资源情况和计算任务需求，选择最合适的计算引擎执行计算任务。根据本发明的一个实施例，包括以下子步骤：When a new computing task needs to be executed, the computing task characteristic data is first generated according to the computing task attributes, and then input into the task execution prediction model of each computing engine in turn to obtain the task execution time prediction results of each computing engine, and finally Combined with system resource conditions and computing task requirements, select the most appropriate computing engine to perform computing tasks. According to an embodiment of the present invention, the following sub-steps are included:

步骤131，生成计算任务特征数据Step 131, generating computing task feature data

生成待计算任务特征数据的过程与构建训练样本集时生成任务特征数据的过程类似，例如，也可使用one-hot将其中的离散特征编码，最后验证转化后的计算任务特征数据是否符合任务执行时间预测模型的输入。The process of generating task characteristic data to be calculated is similar to the process of generating task characteristic data when constructing a training sample set. For example, one-hot can also be used to encode the discrete features, and finally verify whether the converted computing task characteristic data conforms to the task execution Input to the temporal forecasting model.

步骤132，获取任务执行时间预测结果Step 132, obtain task execution time prediction results

将待预测的计算任务的特征数据输入到一个计算引擎的任务执行时间预测模型Model_i，得到该计算任务在第i个引擎上的执行时间预测结果P_Time_i，进而能够获得所有引擎的执行时间预测结果。Input the characteristic data of the computing task to be predicted into the task execution time prediction model Model _i of a computing engine, and obtain the execution time prediction result P_Time _i of the computing task on the i-th engine, and then obtain the execution time prediction of all engines result.

步骤132，根据预测结果选择将执行计算任务的计算引擎。Step 132, selecting a computing engine that will execute the computing task according to the prediction result.

根据各个计算引擎的时间预测结果结合资源使用情况选择合适的引擎执行计算任务。According to the time prediction results of each computing engine combined with resource usage, an appropriate engine is selected to perform computing tasks.

例如，选择所有执行时间预测结果中时间最短的引擎，判断其剩余资源是否可以支持计算任务运行，如不能支持则选取所有执行时间预测结果中时间次短的引擎，判断其资源。直至满足计算任务需求，则选择此计算引擎运行计算任务。For example, select the engine with the shortest time among all the execution time prediction results, and judge whether its remaining resources can support the operation of computing tasks, and if not, select the engine with the second shortest time among all the execution time prediction results, and judge its resources. Until the calculation task requirements are met, select this calculation engine to run the calculation task.

根据本发明一个实施例，提供了一种多计算引擎平台，该平台包含本发明提供的计算引擎选择方法，能够应用于实验数据处理。参见图2所示，该实施例的多计算引擎平台包含用户交互模块210、调试任务执行模块220、计算任务管理模块230、引擎管理模块240、容器管理模块250、任务执行模块260、算法管理模块270和算法库271、任务信息管理模块280和任务信息库281。According to an embodiment of the present invention, a multi-computing engine platform is provided, the platform includes the calculation engine selection method provided by the present invention, and can be applied to experimental data processing. 2, the multi-computing engine platform of this embodiment includes a user interaction module 210, a debugging task execution module 220, a computing task management module 230, an engine management module 240, a container management module 250, a task execution module 260, and an algorithm management module 270 and algorithm library 271, task information management module 280 and task information library 281.

用户交互模块210用于与用户之间的信息交互。具体地，可使用Flask(其是Python编写的Web微框架)作为用户交互模块210的后端部分，使用bootstrap和jquery构建Web页面，同时后台可提供RESTful软件架构风格的接口供二次开发使用。例如，实际应用时的具体过程包括：采用Web界面收集用户操作指令和输入信息等；将用户操作转换为Json(JavaScript Object Notation，JS对象简谱)格式的网络交换信息；使用Ajax(异步JavaScript和XML)调用用户交互模块210后端的路由接口；后端响应路由接口请求，完成特定功能并返回处理结果；Web界面接受处理结果并响应用户操作。The user interaction module 210 is used for information interaction with users. Specifically, Flask (which is a Web microframework written in Python) can be used as the back-end part of the user interaction module 210, and bootstrap and jquery are used to build Web pages, and the background can provide a RESTful software architecture style interface for secondary development. For example, the specific process during actual application includes: using the Web interface to collect user operation instructions and input information, etc.; ) calls the routing interface at the back end of the user interaction module 210; the back end responds to the routing interface request, completes a specific function and returns a processing result; the Web interface accepts the processing result and responds to user operations.

调试任务管理模块220用于实现后端算法输出和调试信息实时向Web界面推送的功能。例如，可使用WebSocket作为前后端通信协议，实现的具体过程包括：用户在Web界面中编写算法和算法调试用数据；用户提交调试任务；后端执行用户调试任务，并返回执行结果和调试信息；用户确认算法编写情况，仍需调试时返回第一步；用户确认算法正常后将算法加入算法库271中。The debugging task management module 220 is used to implement the function of pushing back-end algorithm output and debugging information to the Web interface in real time. For example, WebSocket can be used as the front-end and back-end communication protocol, and the implementation process includes: the user writes the algorithm and algorithm debugging data in the web interface; the user submits the debugging task; the back-end executes the user debugging task, and returns the execution result and debugging information; The user confirms the programming of the algorithm, and returns to the first step when debugging is still required; the user confirms that the algorithm is normal and adds the algorithm to the algorithm library 271 .

计算任务管理模块230用于管理计算任务的处理流程。例如，可使用有向无环图编排试验数据的处理流程，通过将有向无环图可视化在Web界面，以便于用户查看和确认流程信息。具体地，计算任务管理模块230的步骤包括：用户根据试验数据流程添加计算任务；根据编排可视化结果确认是否修改添加的计算任务；确认无误后可请求引擎管理模块240，获取任务执行引擎信息；调用容器管理模块250执行计算任务。The computing task management module 230 is used to manage the processing flow of computing tasks. For example, the directed acyclic graph can be used to arrange the processing flow of test data, and the directed acyclic graph can be visualized on the web interface so that users can view and confirm the process information. Specifically, the steps of the calculation task management module 230 include: the user adds a calculation task according to the test data flow; confirms whether to modify the added calculation task according to the arrangement visualization result; after confirming that it is correct, the engine management module 240 can be requested to obtain the task execution engine information; The container management module 250 performs computing tasks.

引擎管理模块240用于选择计算引擎和管理计算引擎状态等。计算引擎可通过开源的应用容器引擎Docker进行封装，变为一个个容器，以便进行资源管理。引擎管理模块240的功能包括：根据每一个任务信息调用本发明的计算引擎选择算法，选择最优计算引擎；判断所选择的引擎是否处于激活状态，若处于非激活状态则激活该引擎；将最终选择的任务执行计算引擎信息返回给计算任务管理模块230。The engine management module 240 is used for selecting a computing engine and managing the state of the computing engine, etc. The computing engine can be packaged through the open source application container engine Docker and turned into individual containers for resource management. The functions of the engine management module 240 include: calling the calculation engine selection algorithm of the present invention according to each task information, and selecting the optimal calculation engine; judging whether the selected engine is in an active state, and activating the engine if it is in an inactive state; The information of the selected task execution computing engine is returned to the computing task management module 230 .

容器管理模块250用于管理容器功能，例如，可使用docker-python对容器进行管理，每一个容器由一组特定的应用和必要的依赖库组成。容器管理模块250的功能包括提供容器添加、容器启动、容器停止和容器查询等功能，以及调用任务执行模块260完成计算任务执行。The container management module 250 is used to manage container functions. For example, docker-python can be used to manage containers. Each container consists of a set of specific applications and necessary dependent libraries. The functions of the container management module 250 include providing functions such as adding containers, starting containers, stopping containers, and querying containers, and invoking the task execution module 260 to complete computing task execution.

任务执行模块260使用容器技术将计算任务统一封装，提供算法调用、运行监控和数据收集功能。任务执行模块260的功能包括：检查任务数据完整性；判断算法是否需要三方库，如需要则安装三方库；判断算法输入数据形式与算法需求是否一致，不一致则按照算法需求转换输入数据形式；执行算法并统计算法执行时间；收集算法执行结果并更新计算任务信息等。The task execution module 260 uses container technology to package computing tasks uniformly, and provides functions of algorithm calling, operation monitoring and data collection. The functions of the task execution module 260 include: checking the integrity of the task data; judging whether the algorithm needs a third-party library, and installing the third-party library if necessary; judging whether the input data form of the algorithm is consistent with the algorithm requirements, if inconsistent, convert the input data form according to the algorithm requirements; Algorithm and statistics of algorithm execution time; collect algorithm execution results and update computing task information, etc.

算法管理模块270与算法库271用于管理多计算引擎平台所实现的算法，例如使用基于分布式文件存储的数据库MongoDB和HDFS提供分布式的算法库支持，提供算法的添加、删除和查询功能。The algorithm management module 270 and the algorithm library 271 are used to manage the algorithms implemented by the multi-computing engine platform, for example, use the databases MongoDB and HDFS based on distributed file storage to provide distributed algorithm library support, and provide algorithm addition, deletion and query functions.

任务信息管理模块280与任务信息库281用于管理任务信息，例如，使用关系型数据库管理系统MySQL提供任务信息的存储与管理，提供任务信息添加、删除、查询等功能。Task information management module 280 and task information library 281 are used to manage task information, for example, use relational database management system MySQL to provide storage and management of task information, and provide functions such as adding, deleting, and querying task information.

需要说明的是，由于进行实际的试验数据时，数据处理流程的方法变化较快，为节省多计算引擎平台的资源，提高计算效率，平台所使用的计算引擎应随试验数据处理任务而变化。在计算引擎发生更改时，需要根据引擎变化，动态调整根据本发明获得的任务执行时间预测模型的状态，以实现对引擎变更的自适应。例如，当变更的计算引擎为新加入的计算引擎时，针对该引擎利用训练样本集进行训练，得到该计算引擎的任务执行时间预测模型。又如，若更换的为已经训练好的引擎时，将新计算引擎的任务执行时间预测模型激活，同时将替换掉的引擎的任务执行时间预测模型设置为非激活状态。It should be noted that, since the method of data processing flow changes rapidly when carrying out actual test data, in order to save the resources of the multi-computing engine platform and improve the calculation efficiency, the computing engine used by the platform should be changed according to the task of test data processing. When the computing engine is changed, it is necessary to dynamically adjust the state of the task execution time prediction model obtained according to the present invention according to the engine change, so as to realize the self-adaptation to the engine change. For example, when the changed computing engine is a newly added computing engine, the engine is trained using the training sample set to obtain a task execution time prediction model of the computing engine. For another example, if the engine that has been trained is replaced, the task execution time prediction model of the new computing engine is activated, and the task execution time prediction model of the replaced engine is set to an inactive state.

综上所述，本发明提供的多计算引擎平台方便易用，能够面向试验数据处理需求，支持多计算任务在线编排，同时能够基于本发明的计算引擎选择方法自动选择效率最优的计算引擎，支持在线算法调试，支持计算引擎自动封装和切换等。具体地，本发明的多计算引擎平台的有益效果主要体现在：利用容器技术封装计算引擎和计算任务，启停速度快，资源消耗低，可以达到引擎和任务间环境隔离和资源限制的目的；利用通用的计算任务调用流程，统一了不同任务间的差异，方便对计算任务进行统一管理；支持算法的在线编辑测试，可以尝试查看算法调试结果，能够极大提高算法编辑效率；面向试验数据处理的特点，提供计算任务可视化编排，可以减少用户编排错误，提高试验数据处理灵活性；基于任务执行时间预测的多计算引擎选择算法可以使用机器学习方法，自动发掘计算引擎的运行规律，针对特定计算任务结合资源情况选择效率最高的计算引擎，从而显著减少计算任务执行时间，提高了试验数据处理效率；支持计算引擎变更，能够提高系统灵活性，同时对未来引擎扩展提供了良好支持。In summary, the multi-computing engine platform provided by the present invention is convenient and easy to use, can meet the requirements of test data processing, support online arrangement of multiple computing tasks, and can automatically select the most efficient computing engine based on the computing engine selection method of the present invention, Supports online algorithm debugging, automatic encapsulation and switching of computing engines, etc. Specifically, the beneficial effects of the multi-computing engine platform of the present invention are mainly reflected in: using container technology to encapsulate computing engines and computing tasks, fast start and stop speed, low resource consumption, and achieving the purpose of environmental isolation and resource limitation between engines and tasks; Utilizes the general computing task call process to unify the differences between different tasks and facilitate unified management of computing tasks; supports online editing and testing of algorithms, and can try to view algorithm debugging results, which can greatly improve algorithm editing efficiency; oriented to test data processing It provides visual arrangement of computing tasks, which can reduce user programming errors and improve the flexibility of test data processing; the multi-computing engine selection algorithm based on task execution time prediction can use machine learning methods to automatically discover the operating rules of computing engines and target specific calculations. The calculation engine with the highest efficiency is selected according to the task combined with the resource situation, thereby significantly reducing the execution time of the calculation task and improving the efficiency of test data processing; supporting the change of the calculation engine can improve the flexibility of the system and provide good support for future engine expansion.

需要说明的是，虽然上文按照特定顺序描述了各个步骤，但是并不意味着必须按照上述特定顺序来执行各个步骤，实际上，这些步骤中的一些可以并发执行，甚至改变顺序，只要能够实现所需要的功能即可。It should be noted that although the steps are described above in a specific order, it does not mean that the steps must be performed in the above specific order. In fact, some of these steps can be performed concurrently, or even change the order, as long as it can be realized The required functions are sufficient.

本发明可以是系统、方法和/或计算机程序产品。计算机程序产品可以包括计算机可读存储介质，其上载有用于使处理器实现本发明的各个方面的计算机可读程序指令。The present invention can be a system, method and/or computer program product. A computer program product may include a computer readable storage medium having computer readable program instructions thereon for causing a processor to implement various aspects of the present invention.

计算机可读存储介质可以是保持和存储由指令执行设备使用的指令的有形设备。计算机可读存储介质例如可以包括但不限于电存储设备、磁存储设备、光存储设备、电磁存储设备、半导体存储设备或者上述的任意合适的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括：便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、静态随机存取存储器(SRAM)、便携式压缩盘只读存储器(CD-ROM)、数字多功能盘(DVD)、记忆棒、软盘、机械编码设备、例如其上存储有指令的打孔卡或凹槽内凸起结构、以及上述的任意合适的组合。A computer readable storage medium may be a tangible device that holds and stores instructions for use by an instruction execution device. A computer readable storage medium may include, for example, but is not limited to, electrical storage devices, magnetic storage devices, optical storage devices, electromagnetic storage devices, semiconductor storage devices, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of computer-readable storage media include: portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM), or flash memory), static random access memory (SRAM), compact disc read only memory (CD-ROM), digital versatile disc (DVD), memory stick, floppy disk, mechanically encoded device, such as a printer with instructions stored thereon A hole card or a raised structure in a groove, and any suitable combination of the above.

以上已经描述了本发明的各实施例，上述说明是示例性的，并非穷尽性的，并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下，对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择，旨在最好地解释各实施例的原理、实际应用或对市场中的技术改进，或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Having described various embodiments of the present invention, the foregoing description is exemplary, not exhaustive, and is not limited to the disclosed embodiments. Many modifications and alterations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen to best explain the principle of each embodiment, practical application or technical improvement in the market, or to enable other ordinary skilled in the art to understand each embodiment disclosed herein.

Claims

1. a kind of computing engines selection method, comprising the following steps:

Step 1: the corresponding task characteristic of task to be calculated is input to each of multiple computing engines computing engines Task execution time prediction model, obtain task to be calculated on each computing engines task execution time prediction knot Fruit, wherein the task execution time prediction model is to be obtained based on training sample set by training, the training sample set packet Include a plurality of task characteristic and corresponding task execution time；

Step 2: being selected to execute task to be calculated from the multiple computing engines according to the task execution time prediction result Computing engines.

2. according to the method described in claim 1, wherein, the task characteristic includes algorithm types, algorithm parameter, data At least one of in type, data volume and data storage position.

3. according to the method described in claim 1, wherein, the training sample set of a computing engines is constructed by following steps:

Step 31: a plurality of task description data for being used to describe mission bit stream are collected,；

Step 32: executing the corresponding task of each task description data using the computing engines, obtain each task description The corresponding task execution time of data；

Step 33: the feature composition task characteristic for influencing task execution time is extracted from each task description data, The training sample set of the computing engines is constructed in conjunction with task execution time obtained.

4. according to the method described in claim 1, wherein, the task execution of a computing engines is obtained by executing following steps Time prediction model:

Step 41: the training sample set based on the computing engines is with task execution time using task characteristic as independent variable Dependent variable establishes linear regression model (LRM), indicates are as follows:

y_i=β₀+β₁x_i1+…+β_px_ip, i=1,2 ..., n

Wherein, x_i1To x_ipIndicate the task feature that the training sample set of the computing engines includes, i indicates the training of the computing engines The number for the sample data item number for including in sample set, n are the sample data item number of the training sample set of the computing engines, β₀For Bias to be optimized, β₁To β_pFor weighted value to be optimized；

Step 42: the optimization weighted value and bias of the linear regression model (LRM) are solved using least square method；

Step 43: the linear regression model (LRM) being indicated according to the optimization weighted value and bias of acquisition, obtains the computing engines Task execution time prediction model.

5. method according to any one of claims 1 to 4, wherein step 2 includes following sub-step:

Step 51: selection prediction executes time shortest computing engines；Or

Step 52: when the prediction execute time shortest computing engines surplus resources cannot support task to be calculated the case where Under, according to the task execution time prediction result successively preferential short computing engines of selection predicted time.

6. a kind of more computing engines platforms characterized by comprising

Calculating task management module: for managing the process flow of calculating task and generating calculating task information；

Engine management module: for according to from the calculating task management module calculating task information according to claim 1 to 5 described in any item method choice computing engines；

Task execution module: for executing calculating task and exporting task execution time.

7. more computing engines platforms according to claim 6, which is characterized in that further include:

Container Management module: for calling the task execution module to execute calculating task；

User interactive module: for receiving user operation instruction and information；

Debugging task management module: for executing user's debugging task and exporting Debugging message.

8. platform according to claim 7, which is characterized in that described when the computing engines change that the platform includes Engine management module activates the task execution time prediction model of new computing engines, while appointing the computing engines replaced Business running time prediction model is set as unactivated state.

9. a kind of computer readable storage medium, is stored thereon with computer program, wherein real when the program is executed by processor Now according to claim 1 to any one of 5 the method the step of.

10. a kind of computer equipment, including memory and processor, be stored on the memory to transport on a processor Capable computer program, which is characterized in that the processor realizes any one of claims 1 to 5 institute when executing described program The step of method stated.