CN104516784A

CN104516784A - Method and system for forecasting task resource waiting time

Info

Publication number: CN104516784A
Application number: CN201410796248.3A
Authority: CN
Inventors: 尤海航; 邢飞
Original assignee: Institute of Computing Technology of CAS
Current assignee: Zhongkehai Micro Beijing Technology Co ltd
Priority date: 2014-07-11
Filing date: 2014-12-18
Publication date: 2015-04-15
Anticipated expiration: 2034-12-18
Also published as: CN104516784B

Abstract

The invention discloses a method and system for predicting the waiting time of task resources. The invention relates to large-scale computing system resource management, optimization and allocation, and in particular to a method and system for predicting the waiting time of task resources. The method includes obtaining historical task records , delete the task records that have dependencies in the historical task record, and generate a new historical task record; use the autocorrelation function to obtain the task records in the new historical task record that have correlation with the predicted time period, and generate a task Record set; set the task resource waiting time threshold, obtain the number of task records whose task resource waiting time exceeds the task resource waiting time threshold in the task record set, and according to the total amount of task records in the task record set, through Bayeux The Adams method predicts the waiting time of task resources within the time period to be predicted. The invention can predict the availability of computing system resources and optimize task scheduling.

Description

Method and system for predicting task resource waiting time

技术领域technical field

本发明涉及大规模计算系统(包括超级计算及云计算)资源管理，优化及分配，特别涉及一种预测任务资源等待时间方法及系统。The invention relates to resource management, optimization and allocation of large-scale computing systems (including supercomputing and cloud computing), and in particular to a method and system for predicting the waiting time of task resources.

背景技术Background technique

在针对大规模计算系统资源使用的刻画，预测，优化，及分配的过程中，现有的许多方案采用基于模型的方法。具体来说，研究人员首先选取一种或几种和系统资源使用相关的维度进行数据跟踪观察(例如作业的剩余运行时间，作业在系统队列中的排队时间，等等)，然后应用某种相关概率模型来刻画此种维度数据的概率分布，接下来，研究人员应用此种模型所呈现的概率分布性质进行针对该系统未来表现的预测，从而实现资源优化及合理分配，例如，Downey应用对数均匀分布(log uniform distribution)来测量作业剩余运行时间，Brevik及其合作者提出了二项方法批量预测的概念(BinomialMethod Batch Predictor(BMBP))来刻画系统队列等待时间，Li及其合作者应用混合高斯模型(Gaussian-mixture model)来描述某段特定时间内系统全部作业的运行时间分布。Many existing approaches use model-based approaches to characterize, predict, optimize, and allocate resource usage in large-scale computing systems. Specifically, researchers first select one or several dimensions related to system resource usage for data tracking observation (such as the remaining running time of jobs, the queuing time of jobs in system queues, etc.), and then apply some correlation A probability model is used to describe the probability distribution of such dimensional data. Next, researchers use the probability distribution properties presented by this model to predict the future performance of the system, so as to achieve resource optimization and reasonable allocation. For example, Downey applied the logarithm Uniform distribution (log uniform distribution) to measure the remaining running time of the job, Brevik and his collaborators proposed the concept of binomial method batch prediction (BinomialMethod Batch Predictor (BMBP)) to describe the waiting time of the system queue, Li and his collaborators applied mixed The Gaussian-mixture model is used to describe the running time distribution of all jobs in the system within a certain period of time.

基于特定概率模型来刻画资源分布的方法在处理大型计算系统管理以及大数据处理的实际应用当中会存在如下的一些实际问题：海量数据的历史负载记录在实际应用中很难服从某种特定的概率分布，实际上，通过检验不同的实际数据，大数据的历史负载记录不但不服从单一的某种概率分布，甚至不易用混合概率分布模型来刻画；一些常用的概率模型(例如二项分布模型(binomialmodel)及其衍生模型)中关于用户在短时间间隔内提交的作业性质彼此独立的假设现实中往往并不成立，事实上，通过对实际超级计算机历史负载数据的具体研究，我们发现大部分用户会在短时间内多次提交内容类似，参数相当的作业，因此用此类概率模型进行资源消耗的评估和预测往往是不准确的。The method of describing resource distribution based on a specific probability model will have the following practical problems in the practical application of large-scale computing system management and big data processing: the historical load records of massive data are difficult to obey a specific probability in practical applications In fact, by testing different actual data, the historical load records of big data not only do not obey a single probability distribution, but are even difficult to describe with a mixed probability distribution model; some commonly used probability models (such as the binomial distribution model ( binomialmodel) and its derivative models), the assumption that the properties of jobs submitted by users in a short time interval are independent of each other is often not true in reality. In fact, through specific research on the historical load data of actual supercomputers, we found that most users will Jobs with similar content and equivalent parameters are submitted multiple times in a short period of time, so it is often inaccurate to use this type of probability model to evaluate and predict resource consumption.

发明内容Contents of the invention

针对现有技术的不足，本发明提出一种预测任务资源等待时间方法及系统。Aiming at the deficiencies of the prior art, the present invention proposes a method and system for predicting the waiting time of task resources.

本发明提出一种预测任务资源等待时间的方法，包括：The present invention proposes a method for predicting the waiting time of task resources, including:

步骤1，获取历史任务记录，删除该历史任务记录中存在依赖关系的任务记录，生成新历史任务记录；Step 1, obtain historical task records, delete task records that have dependencies in the historical task records, and generate new historical task records;

步骤2，通过自相关函数，获取该新历史任务记录中与待预测时间段具有相关性的时间段内的任务记录，生成任务记录集合；Step 2, through the autocorrelation function, obtain the task records in the time period related to the time period to be predicted in the new historical task record, and generate a task record set;

步骤3，设置任务资源等待时间阈值，获取该任务记录集合中任务资源等待时间超过该任务资源等待时间阈值的任务记录的个数，并根据该任务记录集合中任务记录的总量，通过贝叶斯方法预测该待预测时间段内的任务资源等待时间。Step 3: Set the task resource waiting time threshold, obtain the number of task records whose task resource waiting time in the task record set exceeds the task resource waiting time threshold, and according to the total amount of task records in the task record set, use Bayeux The Adams method predicts the waiting time of task resources within the time period to be predicted.

所述的预测任务资源等待时间的方法，该步骤1的具体步骤包括：The method for predicting the waiting time of task resources, the specific steps of the step 1 include:

步骤11，判断该历史任务记录中的任务记录是否存在依赖关系；Step 11, judging whether there is a dependency relationship among the task records in the historical task record;

步骤12，若存在则删除存在依赖关系的任务记录。Step 12, delete task records with dependencies if they exist.

所述的预测任务资源等待时间的方法，该步骤11包括：The method for predicting the waiting time of task resources, the step 11 includes:

步骤21，选择时间临界点t*和空间临界点x*；Step 21, select time critical point t* and spatial critical point x*;

步骤22，如果该历史任务记录中的两个任务记录的提交时间间隔在时间临界点t*之内并且参数选取的临近程度在空间间隔x*之内，且配对的密度高于该历史任务记录中除该两个任务记录的配对的密度，则该两个任务记录在(t*，x*)尺度之内具有依赖关系。Step 22, if the submission time interval of two task records in the historical task record is within the time critical point t* and the proximity of parameter selection is within the spatial interval x*, and the pairing density is higher than the historical task record In addition to the density of pairings of the two task records, the two task records have a dependency within the scale (t*, x*).

所述的预测任务资源等待时间的方法，该步骤12包括：The method for predicting the waiting time of task resources, the step 12 includes:

步骤31，将该历史任务记录中的任务记录按提交时间由小到大进行排列；Step 31, arrange the task records in the historical task records according to the submission time from small to large;

步骤32，从提交时间最小的任务记录开始，删除该提交时间最小的该任务记录至接下来t*时间内所有和该提交时间最小的该任务记录的参数选择临近程度小于x*的任务记录；Step 32, starting from the task record with the smallest submission time, deleting the task record with the smallest submission time to all task records whose parameter selection proximity to the task record with the smallest submission time within the next t* time is less than x*;

步骤33，更新该历史任务记录并重复该步骤22，直至遍历该历史任务记录，更新该历史任务记录。Step 33, update the historical task record and repeat the step 22 until traversing the historical task record and updating the historical task record.

所述的预测任务资源等待时间的方法，该步骤3中该贝叶斯方法的公式为：The method for predicting the waiting time of task resources, the formula of the Bayesian method in the step 3 is:

${P P}_{LTW LTW,, k k} = = \frac{{N N}_{LTW LTW,, k k - - 11,, k k - - 22}}{{N N}_{k k - - 11,, k k - - 22}}$

其中N_LTW,k-1,k-2为该任务资源等待时间超过该任务资源等待时间阈值的该任务记录的该个数，N_k-1,k-2为该任务记录集合中该任务记录的该总量，P_LTW,k是为该待预测时间段内任务资源等待时间超过该任务资源等待时间阈值的任务概率。Among them, N _{LTW, k-1, k-2} are the number of the task records whose waiting time of the task resource exceeds the threshold of the task resource waiting time, and N _{k-1, k-2} are the task records in the task record set The total amount of P _LTW,k is the task probability that the waiting time of the task resource exceeds the threshold of the waiting time of the task resource in the time period to be predicted.

本发明还提出一种预测任务资源等待时间的系统，包括：The present invention also proposes a system for predicting the waiting time of task resources, including:

生成新历史任务记录模块，用于获取历史任务记录，删除该历史任务记录中存在依赖关系的任务记录，生成新历史任务记录；Generate a new historical task record module, which is used to obtain historical task records, delete dependent task records in the historical task records, and generate new historical task records;

生成任务记录集合，用于通过自相关函数，获取该新历史任务记录中与待预测时间段具有相关性的时间段内的任务记录，生成任务记录集合；Generating a task record set, which is used to obtain the task records in the time period that has correlation with the time period to be predicted in the new historical task record through an autocorrelation function, and generate a task record set;

预测模块，用于设置任务资源等待时间阈值，获取该任务记录集合中任务资源等待时间超过该任务资源等待时间阈值的任务记录的个数，并根据该任务记录集合中任务记录的总量，通过贝叶斯方法预测该待预测时间段内的任务资源等待时间。The prediction module is used to set the task resource waiting time threshold, obtain the number of task records whose task resource waiting time exceeds the task resource waiting time threshold in the task record set, and according to the total amount of task records in the task record set, pass The Bayesian method predicts the waiting time of task resources within the time period to be predicted.

所述的预测任务资源等待时间的系统，该生成新历史任务记录模块还包括：In the system for predicting the waiting time of task resources, the module for generating new historical task records also includes:

判断模块，判断该历史任务记录中的任务记录是否存在依赖关系；A judging module, judging whether there is a dependency relationship among the task records in the historical task records;

删除依赖关系模块，若存在则删除存在依赖关系的任务记录。Delete the dependency module, if it exists, delete the task records with dependencies.

所述的预测任务资源等待时间的系统，该判断模块的具体作用包括：选择时间临界点t*和空间临界点x*；如果该历史任务记录中的两个任务记录的提交时间间隔在时间临界点t*之内并且参数选取的临近程度在空间间隔x*之内，且配对的密度高于该历史任务记录中除该两个任务记录的配对的密度，则该两个任务记录在(t*，x*)尺度之内具有依赖关系。In the system for predicting the waiting time of task resources, the specific functions of the judging module include: selecting time critical point t* and space critical point x*; if the submission time interval of two task records in the historical task record is within the time critical point point t* and the proximity of parameter selection is within the space interval x*, and the pairing density is higher than the pairing density of the historical task records except the two task records, then the two task records are in (t *, x*) have dependencies within the scale.

所述的预测任务资源等待时间的系统，该删除依赖关系模块的具体作用包括：将该历史任务记录中的任务记录按提交时间由小到大进行排列；从提交时间最小的任务记录开始，删除该提交时间最小的该任务记录至接下来t*时间内所有和该提交时间最小的该任务记录的参数选择临近程度小于x*的任务记录；更新该历史任务记录并重复该步骤22，直至遍历该历史任务记录，更新该历史任务记录。In the system for predicting the waiting time of task resources, the specific function of the deletion dependency module includes: arranging the task records in the historical task records according to the submission time from small to large; starting from the task record with the smallest submission time, delete From the task record with the smallest submission time to the next t* time, the parameter selection of the task record with the smallest submission time is less than x* task records; update the historical task record and repeat step 22 until traversal For the historical task record, update the historical task record.

所述的预测任务资源等待时间的系统，该预测模块中该贝叶斯方法的公式为：In the system for predicting the waiting time of task resources, the formula of the Bayesian method in the prediction module is:

由以上方案可知，本发明的优点在于：As can be seen from the above scheme, the present invention has the advantages of:

本发明可以用来预测计算系统资源的可使用性，可用来优化作业调度，优化任务资源需求配置，提高资源使用效率，及作业在作业执行队列中的等待时间的预测，本发明的实验表明本发明可以达到89％以上的可靠预测率。The present invention can be used to predict the availability of computing system resources, and can be used to optimize job scheduling, optimize task resource demand configuration, improve resource utilization efficiency, and predict the waiting time of jobs in the job execution queue. The experiments of the present invention show that The invention can achieve a reliable prediction rate of more than 89%.

附图说明Description of drawings

图1为本发明的总体流程图；Fig. 1 is the general flowchart of the present invention;

图2为作业分区计数图表；Fig. 2 is the operation partition count chart;

图3为去噪去相关性后的长时间等待概率图表；Fig. 3 is a long-time waiting probability chart after denoising and de-correlation;

图4为超级计算机海妖(Kraken)上的LTW月趋势图；Figure 4 is the LTW monthly trend graph on the supercomputer Kraken;

图5为月长时间等待概率自动相关值图表；Fig. 5 is the monthly long-time waiting probability automatic correlation value chart;

图6为历史任务记录进行降噪处理的流程图。FIG. 6 is a flow chart of noise reduction processing for historical task records.

其中附图标记为：Wherein reference sign is:

步骤100为本发明的整体步骤，包括：Step 100 is an overall step of the present invention, including:

步骤101/102/103；Step 101/102/103;

步骤200为本发明降噪具体步骤，包括：Step 200 is a specific step of noise reduction in the present invention, including:

步骤201/202/203。Step 201/202/203.

具体实施方式detailed description

发明人在研究超级计算机海妖(Kraken)的历史负载的过程中，发现大约52.4％的用户提交的作业(任务)在作业执行队列中等待时间要超过其实际运行时间。对于每一个作业，特别是并行作业，有两个基本参数表达了其对于运算资源的需求：运行时间、计算节点的个数(CPU，内存)。本发明从用户的角度出发，希望能够在保证获得正确结果的先决条件下，调整这两个参数，使得作业能较快的得到运行并减少作业执行队列中的等待时间，在对作业历史记录进行统计分析的过程中，某个用户在短时间内的多次提交类似的作业(比如同样的执行程序，不同的输入数据等)极大的影响了有效统计结果，因为系统对于同一用户可同时执行的作业数目一般都会有所限制。本发明通过统计方法：空间及时间聚类探测方法(space-time clustering detection method)(又称Knox方法(Knox method))，分析系统负载历史记录里的任务依赖关系，不同于通常的聚类(clustering)方法，本发明使用的方法更强调参数空间及时间的关系，从而使得数据缩减(reduction)更有效，去除有依赖关系的任务记录，达到去除噪声，去噪之后的负载历史记录可以用来生成准确的长时间等待概率图表，此长时间等待概率表(Long-Time Waiting Chart)能够呈献出不同参数值选取和长时间等待概率的宏观趋势，进而给用户一个关于参数选取的总体指导。。进一步的，本发明可以同时生成以月为单位的长时间等待概率趋势图，由于计算资源的使用率是动态的，并不是越多的历史记录对于等待概率的计算越有帮助，往往多余的数据有掩盖真实趋势的负面作用，所以本发明使用自相关函数AutoCorrelation Function(ACF，自相关函数在不同领域有不同的刻画，本发明采用自协方差函数来描述自相关系数，自协方差函数是描述时间序列X(t)在任意两个不同时刻t1，t2的取值之间的二阶混合中心矩，用来描述X(t)在两个时刻取值的起伏变化(相对与均值)的相关程度，也称为中心化的自相关函数)确定具有相关联系的时间间隔区域，比如在测试系统上发现3个月的数据具有彼此关联关系，最后，本发明运用贝叶斯方法(BayesianFramework)预测未来任务(根据其对于系统资源的要求)在此计算系统上获得资源并运行的等待时间。In the process of studying the historical load of the supercomputer Kraken, the inventor found that the waiting time of about 52.4% of the jobs (tasks) submitted by users in the job execution queue exceeded the actual running time. For each job, especially parallel jobs, there are two basic parameters that express its demand for computing resources: running time and the number of computing nodes (CPU, memory). From the user's point of view, the present invention hopes to adjust these two parameters under the prerequisite of ensuring the correct result, so that the job can be run quickly and reduce the waiting time in the job execution queue. In the process of statistical analysis, a user submits multiple similar jobs (such as the same execution program, different input data, etc.) within a short period of time, which greatly affects the effective statistical results, because the system can simultaneously execute The number of jobs is generally limited. The present invention uses a statistical method: space-time clustering detection method (space-time clustering detection method) (also known as Knox method (Knox method)), to analyze the task dependencies in the system load history record, which is different from the usual clustering ( clustering) method, the method used in the present invention puts more emphasis on the relationship between parameter space and time, thereby making data reduction (reduction) more effective, removing task records with dependencies, and achieving noise removal, and the load history records after denoising can be used Generate accurate long-time waiting probability charts. This long-time waiting probability table (Long-Time Waiting Chart) can present the macro trends of different parameter value selections and long-time waiting probabilities, and then give users an overall guidance on parameter selection. . Furthermore, the present invention can simultaneously generate a long-time waiting probability trend graph in units of months. Since the usage rate of computing resources is dynamic, it is not that more historical records are more helpful for the calculation of waiting probability, and often redundant data There is the negative effect of concealing true trend, so the present invention uses autocorrelation function AutoCorrelation Function (ACF, autocorrelation function has different characterizations in different fields, and the present invention adopts autocovariance function to describe autocorrelation coefficient, and autocovariance function is description The second-order mixed central moment between the values of time series X(t) at any two different moments t1 and t2 is used to describe the correlation between the fluctuations (relative and mean) of the values of X(t) at two moments Degree, also known as centralized autocorrelation function) to determine the time interval area with correlation, such as finding that the data of 3 months on the test system has a correlation with each other, finally, the present invention uses Bayesian method (BayesianFramework) to predict The wait time for a future task (according to its requirements for system resources) to acquire resources and run on this computing system.

下面结合附图对本发明的具体实施方式进行说明。Specific embodiments of the present invention will be described below in conjunction with the accompanying drawings.

以下为本发明的总体步骤，如图1所示，具体步骤如下：The following are general steps of the present invention, as shown in Figure 1, the concrete steps are as follows:

步骤101，使用空间及时间聚类探测方法(space-time clusteringdetection method)(又称Knox方法)分析系统负载历史记录里(历史任务记录)的任务依赖关系，去除有依赖关系的任务记录，达到去除噪声，提高预测精确度的目的。Step 101, use the space-time clustering detection method (space-time clustering detection method) (also known as Knox method) to analyze the task dependencies in the system load history records (historical task records), and remove task records with dependencies, so as to remove Noise, for the purpose of improving prediction accuracy.

步骤102，根据去除噪声后的负载历史记录(新历史任务记录)生成长时间等待概率图表(Long-Time Waiting Chart，任务资源长时间等待图表)，同时生成以月为单位的长时间等待概率趋势图。另外，使用自相关函数AutoCorrelation Function(ACF)确定具有相关联系的时间间隔区域。Step 102, generate a long-time waiting probability chart (Long-Time Waiting Chart, long-time waiting chart for task resources) according to the load history record (new historical task record) after the noise is removed, and generate a long-time waiting probability trend in units of months at the same time picture. In addition, the AutoCorrelation Function (ACF) was used to identify time interval regions with correlations.

步骤103，运用贝叶斯方法(Bayesian Framework)预测未来任务(根据其对于系统资源的要求)在计算系统上获得资源并运行的等待时间，其中设置任务资源等待时间阈值，获取该任务记录集合中该任务资源等待时间超过该任务资源等待时间阈值的任务记录的数量，并根据该任务记录集合中任务记录的总数量，通过贝叶斯方法预测该某时间段内的任务资源等待时间。Step 103, use the Bayesian method (Bayesian Framework) to predict the waiting time for future tasks (according to their requirements for system resources) to obtain resources and run on the computing system, wherein the task resource waiting time threshold is set, and the task record set is obtained The number of task records whose waiting time of the task resource exceeds the threshold of the task resource waiting time, and predict the waiting time of the task resource within a certain period of time according to the total number of task records in the task record set by Bayesian method.

以下结合实施例对本发明进一步说明：The present invention is further described below in conjunction with embodiment:

发掘真实的长时间等待模式：作业分区计数表：首先，对作业的资源使用情况以二维图表的形式获取统计结果(对于每一个提交的作业，特别是并行作业，有两个基本参数表达了作业对于运算资源的需求：预留运行时间(WallClock Requested(WCR))，预留计算节点的个数(CPU，内存)(Number of computeNodes Requested(NNR))，本实施例只采用了以上两个参数，即预留运行时间及预留计算节点个数，故采用二维图标展示统计结果，多维的情况以此类推)，每一个参数分段统计，在实验中，把获取的超级计算机海妖(Kraken)上的作业数据(NNR在1至4128节点的区间,WCR在0至24小时的区间)分成若干个参数分区，分区临界值的选择主要基于以下两点：每个区域内包含一些实际操作用户习惯选择的参数组合方式(譬如1小时运行时间及10个计算节点，12小时运行时间及1个计算节点等)；保证每一个分区中覆盖了相当大的作业记录，从而保证统计结果的可靠性及有效性。图2为依据超级计算机海妖(Kraken)上的30个月的历史负载数据生成的作业分区计数表。历史数据的降噪去相关性处理：通过空间及时间聚类探测方法(space-time clustering detectionmethod)，又称Knox方法(Knox method)，检验系统负载历史记录里的任务依赖关系，具体方法如下：首先根据先验经验选择一个时间临界点t*及一个空间临界点x*，如果两个作业的提交时间间隔在时间临界点t*之内并且参数选取的临近程度在空间间隔x*以内，则认为两个作业被认为存在相关可能性，如果此类作业配对的密度远高于其他作业配对的密度，则认为历史数据在(t*，x*)尺度之内存在很强的相关性，从而需要采取降噪去相关性处理，为了检验上述现象，本发明应用在流行病传播学中应用的Knox统计检验方法，如果Knox方法显示历史数据在(t*，x*)尺度之内存在很强的相关性，则应用以下的算法来进行历史负载数据(历史任务记录)降噪处理，如图6所示：Discover the real long-time waiting pattern: job partition count table: first, obtain the statistical results of the resource usage of the job in the form of a two-dimensional chart (for each submitted job, especially parallel jobs, there are two basic parameters expressed Job requirements for computing resources: reserved running time (WallClock Requested (WCR)), the number of reserved computing nodes (CPU, memory) (Number of computeNodes Requested (NNR)), this embodiment only uses the above two Parameters, that is, the reserved running time and the number of reserved computing nodes, so two-dimensional icons are used to display the statistical results, and multi-dimensional cases can be deduced by analogy), and each parameter is counted in sections. The job data on (Kraken) (NNR is in the range of 1 to 4128 nodes, and WCR is in the range of 0 to 24 hours) is divided into several parameter partitions. The selection of the partition threshold is mainly based on the following two points: each region contains some actual Operate the parameter combination mode selected by the user (such as 1 hour running time and 10 computing nodes, 12 hours running time and 1 computing node, etc.); ensure that each partition covers a considerable job record, so as to ensure the accuracy of statistical results reliability and effectiveness. Figure 2 is the job partition count table generated based on 30 months of historical load data on the supercomputer Kraken. Noise reduction and de-correlation processing of historical data: use the space-time clustering detection method (space-time clustering detection method), also known as the Knox method (Knox method), to check the task dependencies in the system load history records, the specific method is as follows: First, select a time critical point t* and a space critical point x* based on prior experience, if the submission time interval of two jobs is within the time critical point t* and the proximity of parameter selection is within the space interval x*, then It is considered that two jobs are considered to be correlated, and if the density of such job pairings is much higher than that of other job pairs, the historical data is considered to have a strong correlation within the (t*, x*) scale, thus It is necessary to take noise reduction and de-correlation processing. In order to check the above phenomenon, the present invention applies the Knox statistical test method applied in epidemiology. If the Knox method shows that the historical data has a strong presence within the (t*, x*) scale , then apply the following algorithm to perform noise reduction processing on historical load data (historical task records), as shown in Figure 6:

步骤201，将历史负载数据按提交时间升序排列，执行步骤202，从第一个负载数据开始，删除从该负载数据提交时间至接下来t*时间内所有和该负载数据参数选择临近程度小于x*的所有相关负载数据，步骤203，更新负载数据并重复步骤202，直至遍历所有历史负载数据。Step 201, arrange the historical load data in ascending order of submission time, and execute step 202, starting from the first load data, delete all parameters from the submission time of the load data to the next t* time that are less than x All relevant load data of *, step 203, update the load data and repeat step 202 until all historical load data are traversed.

本发明使用以上的算法对系统的负载历史进行降噪去相关性处理，从而达到减少数据量，产生相对独立的作业记录。The present invention uses the above algorithm to perform noise reduction and de-correlation processing on the load history of the system, thereby reducing the amount of data and generating relatively independent job records.

长时间等待概率图表：本发明定义了长时间等待(LTW:Long TimeWaiting)阈值(任务资源等待时间阈值)，比如1个小时，基于降噪和去除相关性之后的新数据，建立如图2所示的表格，对于每一个分区计算出等待时间大于LTW阈值的作业占所有该分区内所有作业数的比例，从而产生如图3的热力图表(heatmap)。Long time waiting probability chart: The present invention defines a long time waiting (LTW: Long Time Waiting) threshold (task resource waiting time threshold), such as 1 hour, based on the new data after noise reduction and correlation removal, the establishment is as shown in Figure 2 For each partition, calculate the proportion of jobs whose waiting time is greater than the LTW threshold to all jobs in the partition, thereby generating a heatmap as shown in FIG. 3 .

预测任务资源等待时间：如图3生成的降噪及去相关性后的长时间等待概率图表能够提供宏观的参数选取准则。以海妖超级计算机为例，在同等的服务计算量下(service unit)，预留少量计算时间和大量计算节点的参数选取方案会比预留大量计算时间和少量计算节点的方案产生更少的长时间等待，从而产生更加高效的超级计算机用户资源利用效率，但是，为了提供更加精细和具有时效性的用户指南，本发明还需要考虑时间因素，具体步骤如下：Predicting task resource waiting time: The long-time waiting probability chart after noise reduction and decorrelation generated in Figure 3 can provide a macro parameter selection criterion. Taking the Kraken supercomputer as an example, under the same service calculation amount (service unit), the parameter selection scheme that reserves a small amount of computing time and a large number of computing nodes will generate less cost than the scheme that reserves a large amount of computing time and a small number of computing nodes. Waiting for a long time, thereby producing more efficient supercomputer user resource utilization efficiency, but, in order to provide a more refined and time-sensitive user guide, the present invention also needs to consider the time factor, and the specific steps are as follows:

利用自相关函数确定具有相关联系的时间间隔区域：在如图4所示的以月为单位的LTW概率的时间序列记录中，某一时刻的值往往与其相邻的历史数据有强相关的关系，本发明使用自相关函数AutoCorrelation Function(ACF)确定具有相关联系的时间间隔区域。具体来说，本发明采用自协方差函数来描述自相关系数。自协方差函数是描述时间序列X(t)在任意两个不同时刻t1，t2的取值之间的二阶混合中心矩，用来描述X(t)在两个时刻取值的起伏变化(相对与均值)的相关程度，也称为中心化的自相关函数。其定义式是Use the autocorrelation function to determine the time interval area with correlation: in the time series records of LTW probability in months as shown in Figure 4, the value at a certain moment often has a strong correlation with its adjacent historical data , the present invention uses autocorrelation function AutoCorrelation Function (ACF) to determine the time interval area with correlation. Specifically, the present invention uses an autocovariance function to describe the autocorrelation coefficient. The autocovariance function is the second-order mixed central moment between the values of the time series X(t) at any two different moments t1 and t2, and is used to describe the fluctuation of the values of X(t) at two moments ( relative to the mean), also known as the centralized autocorrelation function. Its definition is

$R R ((k k)) = = \frac{E E. [[(({X x}_{i i} - - {μ μ}_{i i})) (({X x}_{i i + + k k} - - {μ μ}_{i i + + k k}))]]}{{σ σ}^{22}},,$

其中E代表期望值，X_i代表在t(i)时的随机变量值。μ_i代表在t(i)时的预期值，X_i+k代表在t(i+k)时的随机变量值，μ_i+k代表在t(i+k)时的预期值，σ²代表方差。Among them, E represents the expected value, and Xi represents the value of the random variable at t( _i ). μ _i represents the expected value at t(i), X _i+k represents the random variable value at t(i+k), μ _i+k represents the expected value at t(i+k), σ ² stands for variance.

图5显示了基于超级计算机海妖(Kraken)作业历史记录的月长时间等待概率自相关函数值：从图5中可以看出当月的长时间等待概率(LTW概率)与在此之前两个月的LTW概率有显著地统计相关性，然后利用历史相关时间段数据及贝叶斯方法预测下一时间段的长时间等待概率，具体来说，选取降噪去相关性后的数据在上一步得到的具有相关联系的时间间隔部分，应用Beta-二项分布分层模型(Beta-binomial hierarchical model)的平均值来预测下一时间段的长时间等待概率。以超级计算机海妖为例，因为历史数据的自相关函数显示下一个月的长时间等待时间和之前两个月的数据有关联，从而得出Beta-二项分布分层模型的平均值等于：Figure 5 shows the autocorrelation function value of the monthly long-time wait probability based on the supercomputer Kraken's job history: From Figure 5, it can be seen that the long-time wait probability (LTW probability) of the current month is different from that of the previous two months The LTW probability of the LTW has a significant statistical correlation, and then use the historical correlation time period data and the Bayesian method to predict the long-term waiting probability of the next time period. For the part of the time interval with correlation, the average value of the Beta-binomial hierarchical model (Beta-binomial hierarchical model) is applied to predict the probability of long waiting in the next time period. Taking the supercomputer Kraken as an example, because the autocorrelation function of the historical data shows that the long waiting time of the next month is related to the data of the previous two months, the average value of the Beta-binomial distribution model is equal to:

其中N_LTW,k-1,k-2是之前两个月的长时间等待作业数，N_k-1,k-2是之前两个月的总作业数，P_LTW,k是当前月份的长时间等待概率。从而通过本发明可以预测下一个月每一个分区的长时间等待概率。Among them, N _LTW,k-1,k-2 is the number of long-waiting jobs in the previous two months, N _k-1,k-2 is the total number of jobs in the previous two months, P _LTW,k is the number of long-waiting jobs in the current month Time waiting probability. Therefore, the present invention can predict the long-time waiting probability of each partition in the next month.

本发明还提出了一种预测任务资源等待时间的系统，包括如下模块：The present invention also proposes a system for predicting the waiting time of task resources, including the following modules:

生成任务记录集合，用于通过自相关函数，获取该新历史任务记录中与该某一时间段具有相关性的时间段内的任务记录，生成任务记录集合；generating a task record set, which is used to obtain the task records in the time period related to the certain time period in the new historical task record through an autocorrelation function, and generate a task record set;

预测模块，用于设置任务资源等待时间阈值，获取该任务记录集合中该任务资源等待时间超过该任务资源等待时间阈值的任务记录的数量，并根据该任务记录集合中任务记录的总数量，通过贝叶斯方法预测该某时间段内的任务资源等待时间。The prediction module is used to set the task resource waiting time threshold, obtain the number of task records whose waiting time of the task resource in the task record set exceeds the task resource waiting time threshold, and according to the total number of task records in the task record set, pass The Bayesian method predicts the waiting time of task resources within a certain period of time.

判断模块，判断该历史任务记录中的任务记录是否存在依赖关系，其中选择时间临界点t*和空间临界点x*；如果该历史任务记录中的两个任务记录的提交时间间隔在时间临界点t*之内并且参数选取的临近程度在空间间隔x*之内，且配对的密度高于该历史任务记录中除该两个任务记录的配对的密度，则该两个任务记录在(t*，x*)尺度之内具有依赖关系。Judging module, judging whether there is a dependency relationship among the task records in the historical task record, wherein the time critical point t* and the spatial critical point x* are selected; if the submission time interval of the two task records in the historical task record is within the time critical point Within t* and the proximity of parameter selection is within the spatial interval x*, and the pairing density is higher than the pairing density of the historical task records except the two task records, then the two task records are in (t* , there is a dependency relationship within the scale of x*).

删除依赖关系模块，若存在则删除存在依赖关系的任务记录，其中将该历史任务记录中的任务记录按提交时间由小到大进行排列；从提交时间最小的任务记录开始，删除该提交时间最小的该任务记录至接下来t*时间内所有和该提交时间最小的该任务记录的参数选择临近程度小于x*的任务记录；更新该历史任务记录并重复该步骤22，直至遍历该历史任务记录，更新该历史任务记录。Delete the dependency module, if it exists, delete the task records with dependencies, where the task records in the historical task records are arranged in descending order of submission time; starting from the task record with the smallest submission time, delete the task record with the smallest submission time From the task record to the next t* time, the parameters of the task record with the minimum submission time are selected as task records whose proximity is less than x*; update the historical task record and repeat step 22 until the historical task record is traversed , to update the historical task record.

该预测模块中该贝叶斯方法的公式为：The formula for the Bayesian method in the forecasting module is:

其中N_LTW,k-1,k-2为该任务资源等待时间超过该任务资源等待时间阈值的该任务记录的该数量，N_k-1,k-2为该任务记录集合中该任务记录的该总数量，P_LTW,k是为该某一时间段内任务资源等待时间超过该任务资源等待时间阈值的任务概率。Among them, N _LTW,k-1,k-2 is the number of the task records whose waiting time of the task resource exceeds the threshold of the task resource waiting time, and N _k-1,k-2 is the number of the task records in the task record set The total quantity, P _LTW,k is the task probability that the waiting time of the task resource exceeds the waiting time threshold of the task resource within the certain period of time.

Claims

1. A method for predicting task resource waiting time, characterized in that, comprising:

Step 1, obtain historical task records, delete task records that have dependencies in the historical task records, and generate new historical task records;

Step 2, through the autocorrelation function, obtain the task records in the time period related to the time period to be predicted in the new historical task record, and generate a task record set;

Step 3: Set the task resource waiting time threshold, obtain the number of task records whose task resource waiting time in the task record set exceeds the task resource waiting time threshold, and according to the total amount of task records in the task record set, use Bayeux The Adams method predicts the waiting time of task resources within the time period to be predicted.

2. the method for predicting task resource waiting time as claimed in claim 1, is characterized in that, the concrete steps of this step 1 comprise:

Step 11, judging whether there is a dependency relationship among the task records in the historical task record;

Step 12, delete task records with dependencies if they exist.

3. The method for predicting task resource waiting time as claimed in claim 2, is characterized in that, this step 11 comprises:

Step 21, select time critical point t* and spatial critical point x*;

Step 22, if the submission time interval of two task records in the historical task record is within the time critical point t* and the proximity of parameter selection is within the spatial interval x*, and the pairing density is higher than the historical task record In addition to the density of pairings of the two task records, the two task records have a dependency within the scale (t*, x*).

4. The method for predicting task resource waiting time as claimed in claim 2, is characterized in that, this step 12 comprises:

Step 31, arrange the task records in the historical task records according to the submission time from small to large;

Step 32, starting from the task record with the smallest submission time, deleting the task record with the smallest submission time to all task records whose parameter selection proximity to the task record with the smallest submission time within the next t* time is less than x*;

Step 33, update the historical task record and repeat the step 22 until traversing the historical task record and updating the historical task record.

5. the method for predicting task resource waiting time as claimed in claim 1, is characterized in that, the formula of this Bayesian method in this step 3 is:

{P P}_{LTW LTW,, k k} = = \frac{{N N}_{LTW LTW,, k k - - 11,, k k - - 22}}{{N N}_{k k - - 11,, k k - - 22}}

Among them, N _{LTW, k-1, k-2} are the number of the task records whose waiting time of the task resource exceeds the threshold of the task resource waiting time, and N _{k-1, k-2} are the task records in the task record set The total amount of P _LTW,k is the task probability that the waiting time of the task resource exceeds the threshold of the waiting time of the task resource in the time period to be predicted.

6. A system for predicting the waiting time of task resources, characterized in that it comprises:

Generate a new historical task record module, which is used to obtain historical task records, delete dependent task records in the historical task records, and generate new historical task records;

Generate a task record set, which is used to obtain task records in the new historical task record in a time period that is correlated with the time period to be predicted by an autocorrelation function, and generate a task record set;

The prediction module is used to set the task resource waiting time threshold, obtain the number of task records whose task resource waiting time exceeds the task resource waiting time threshold in the task record set, and according to the total amount of task records in the task record set, pass The Bayesian method predicts the waiting time of task resources within the time period to be predicted.

7. The system of predicting task resource waiting time as claimed in claim 6, is characterized in that, this generation new historical task record module also comprises:

A judging module, judging whether there is a dependency relationship among the task records in the historical task records;

Delete the dependency module, if it exists, delete the task records with dependencies.

8. The system for predicting task resource waiting time as claimed in claim 7, wherein the specific functions of the judging module include: selecting a time critical point t* and a spatial critical point x*; The submission time interval of a task record is within the time critical point t* and the proximity of parameter selection is within the space interval x*, and the pairing density is higher than the pairing density of the historical task record except the two task records , then the two task records have dependencies within the (t*, x*) scale.

9. The system for predicting the waiting time of task resources as claimed in claim 7, wherein the specific function of the deletion dependency module includes: arranging the task records in the historical task records according to the submission time from small to large; Starting from the task record with the smallest submission time, delete the task record with the smallest submission time to all the task records whose parameter selection proximity to the task record with the smallest submission time within the next t* time is less than x*; update the history Task records and repeat step 22 until the historical task records are traversed and the historical task records are updated.

10. The system of predicting task resource waiting time as claimed in claim 6, is characterized in that, the formula of this Bayesian method in this prediction module is:

{P P}_{LTW LTW,, k k} = = \frac{{N N}_{LTW LTW,, k k - - 11,, k k - - 22}}{{N N}_{k k - - 11,, k k - - 22}}