CN108416054A

CN108416054A - Dynamic HDFS copy number calculating methods based on file access temperature

Info

Publication number: CN108416054A
Application number: CN201810228575.7A
Authority: CN
Inventors: 代钰; 杨雷; 化红翠; 王际烽; 张斌
Original assignee: Northeastern University China
Current assignee: Northeastern University China
Priority date: 2018-03-20
Filing date: 2018-03-20
Publication date: 2018-08-17
Anticipated expiration: 2038-03-20
Also published as: CN108416054B

Abstract

The present invention provides a kind of dynamic HDFS copy number calculating methods based on file access temperature, is related to data analysis technique field.Dynamic HDFS copy number calculating methods based on file access temperature, the rule that the access temperature of hot spot file changes over time is obtained by improved Markovian model type analysis first, and according to the calculation formula of file access temperature, the access temperature of file is predicted.Then queueing theory algorithm is used, the calculation formula of copy number is provided, dynamic adjusts the copy number of hot spot file.Dynamic HDFS copy number calculating methods provided by the invention based on file access temperature, solve the problems, such as the access bottleneck to hot spot file, improve the efficiency of service of cluster.

Description

Calculation method of dynamic HDFS replica number based on file access heat

技术领域technical field

本发明涉及数据分析技术领域，尤其涉及一种基于文件访问热度的动态HDFS副本个数计算方法。The invention relates to the technical field of data analysis, in particular to a method for calculating the number of dynamic HDFS copies based on file access heat.

背景技术Background technique

随着现代互联网技术的发展与科学技术的进步，数据以大容量、多样性、高速度及真实性的特点渗透到社会发展的各个行业和领域中。海量数据的增长趋势，对数据和资源进行合理的管理，并保证数据的可靠性，已经成为云计算领域面临的一个关键性问题。With the development of modern Internet technology and the advancement of science and technology, data has penetrated into various industries and fields of social development with the characteristics of large capacity, diversity, high speed and authenticity. With the increasing trend of massive data, reasonable management of data and resources and ensuring the reliability of data have become a key issue in the field of cloud computing.

由Apache基金会所开发的分布式系统基础架构Hadoop实现了一个分布式文件系统(Hadoop Distributed File System)，简称HDFS。HDFS有高容错性的特点，并且设计用来部署在低廉的(low-cost)硬件上；而且它提供高吞吐量(high throughput)来访问应用程序的数据，适合那些有着超大数据集(large data set)的应用程序。HDFS放宽了(relax)POSIX的要求，可以以流的形式访问(streaming access)文件系统中的数据。在HDFS的副本管理机制中，集群默认采取为文件的每个数据块保存3个副本数的副本管理机制，但不能满足不同用户对不同文件的访问需求，当用户对于某个文件的访问量增大时，默认的数据块副本个数无法响应大量的访问请求，造成热点文件访问瓶颈问题。目前相关的副本管理方法已经逐步由静态副本创建策略转向动态副本创建策略，当外界环境变化时，使集群的整个性能能够不变或能够保持高效率地为客户端提供服务。但是在动态副本创建策略中依旧存在一些没有考虑到但对集群的工作效率具有重要影响的因素。The distributed system infrastructure Hadoop developed by the Apache Foundation implements a distributed file system (Hadoop Distributed File System), referred to as HDFS. HDFS has the characteristics of high fault tolerance and is designed to be deployed on low-cost (low-cost) hardware; and it provides high throughput (high throughput) to access application data, suitable for those with large data sets (large data set) applications. HDFS relaxes the requirements of (relax) POSIX, and can access (streaming access) data in the file system in the form of streams. In the copy management mechanism of HDFS, the cluster adopts the copy management mechanism that saves 3 copies for each data block of the file by default, but it cannot meet the access requirements of different users for different files. When the value is large, the default number of copies of data blocks cannot respond to a large number of access requests, resulting in bottlenecks in hot file access. At present, the relevant copy management methods have gradually shifted from static copy creation strategy to dynamic copy creation strategy, so that when the external environment changes, the overall performance of the cluster can be kept unchanged or can provide services to clients efficiently. However, there are still some factors that are not considered in the dynamic replica creation strategy but have an important impact on the working efficiency of the cluster.

现有技术中，文献《云环境下的高效多副本管理研究》针对大规模云存储系统的成本效益保证问题提出了一种动态副本创建方法，它综合考虑副本个数与可用性之间的关系，即在考虑满足云存储系统可用性的前提下调整副本个数，但是没有考虑文件访问热度与副本个数之间的关系。文献《An Elastic Replication Management System for HDFS》提出active/standby存储模型实现对HDFS副本的灵活管理，它利用复杂的事务理引擎对实时访问的数据量进行识别，动态地调整副本个数，并引入纠删码对副本个数进行管理。此系统虽然有效地提高了HDFS的性能，但是其实现过程复杂，在识别实时访问数据时复杂度高。In the prior art, the document "Research on Efficient Multi-Replica Management in Cloud Environment" proposes a dynamic replica creation method for the cost-effectiveness guarantee of large-scale cloud storage systems, which comprehensively considers the relationship between the number of replicas and availability, That is, the number of copies is adjusted under the premise of meeting the availability of the cloud storage system, but the relationship between the file access heat and the number of copies is not considered. The document "An Elastic Replication Management System for HDFS" proposes an active/standby storage model to achieve flexible management of HDFS copies. It uses a complex transaction engine to identify the amount of data accessed in real time, dynamically adjusts the number of copies, and introduces correction Erasure codes manage the number of copies. Although this system effectively improves the performance of HDFS, its implementation process is complicated, and the complexity is high when identifying real-time access data.

发明内容Contents of the invention

针对现有技术的缺陷，本发明提供一种基于文件访问热度的动态HDFS副本个数计算方法，实现对动态副本个数进行计算。Aiming at the defects of the prior art, the present invention provides a method for calculating the number of dynamic HDFS copies based on file access heat, so as to realize the calculation of the number of dynamic copies.

一种基于文件访问热度的动态HDFS副本个数计算方法，包括以下步骤：A method for calculating the number of dynamic HDFS copies based on file access heat, comprising the following steps:

步骤1、根据分布式文件系统HDFS上文件访问日志表并按照文件访问热度的计算公式计算得到统计周期内各文件的访问热度，并按统计时间内文件的访问热度之和对文件进行降序排序，选择排序列表中的前20％的文件作为热点文件，构建热点文件-访问热度序列作为待预测序列，进行访问热度的预测；Step 1. According to the file access log table on the distributed file system HDFS and according to the calculation formula of file access heat, the access heat of each file in the statistical period is calculated, and the files are sorted in descending order according to the sum of the file access heat within the statistical period. Select the top 20% files in the sorted list as hot files, construct the hot file-access popularity sequence as the sequence to be predicted, and predict the access popularity;

所述文件访问热度的计算公式如下式所示：The formula for calculating the file access popularity is as follows:

其中，Hot(f)表示文件f的访问热度，AF(f)＝N/T表示文件f的访问频率，N表示文件f在统计周期T内的访问次数，表示文件f的数据块大小，f_size表示文件f的大小，表示不大于的最大整数，即得到文件f的数据块个数；Among them, Hot(f) represents the access heat of file f, AF(f)=N/T represents the access frequency of file f, and N represents the number of visits of file f in the statistical period T, Indicates the data block size of file f, f _size indicates the size of file f, means not greater than the largest integer of That is, the number of data blocks of the file f is obtained;

步骤2、采用层次聚类算法对热点文件-访问热度序列进行状态空间划分，具体方法为：Step 2. Use the hierarchical clustering algorithm to divide the state space of the hot file-access heat sequence. The specific method is:

将热点文件-访问热度序列构成一个长度为N的数据集，数据集中的对象代表热点文件在不同时刻的访问热度，对热点文件-访问热度数据集进行层次聚类的过程为：The hotspot files-access heat sequence constitutes a data set with a length of N. The objects in the data set represent the access heat of hot files at different times. The process of hierarchical clustering of the hot file-access heat data set is as follows:

(1)将数据集中的每个对象看作一类，共得到N类，类与类之间的距离为两类中各个数据点两两距离的平方的中间值；(1) Treat each object in the data set as a class, and obtain N classes in total, and the distance between classes is the median value of the square of the distance between each data point in the two classes;

(2)将距离最近的两个类合并成一个类，使得类的总数减少一个；(2) Merge the two closest classes into one class, reducing the total number of classes by one;

(3)重新计算新的类与其他类之间的距离；(3) Recalculate the distance between the new class and other classes;

(4)重复(2)-(3)步，直到最后将数据集中所有数据对象合并为一个类为止；(4) Repeat steps (2)-(3) until all data objects in the data set are finally merged into one class;

基于以上步骤，得到热点文件-访问热度序列的聚类树，根据聚类树结构，定义马尔可夫划分的状态空间；Based on the above steps, the clustering tree of the hotspot file-access heat sequence is obtained, and the state space divided by Markov is defined according to the clustering tree structure;

步骤3、对划分了状态空间的热点文件-访问热度序列进行马氏性检验，如果满足马氏性，将该序列作为改进马尔可夫模型的输入序列，否则该序列不能用改进的马尔可夫模型来处理；Step 3. Carry out the Markov property test on the hotspot file-access heat sequence that has divided the state space. If the Markov property is satisfied, use the sequence as the input sequence of the improved Markov model, otherwise the sequence cannot be used by the improved Markov model. model to handle;

步骤4、将满足马氏性的热点文件-访问热度序列作为改进马尔可夫模型的输入序列，预测下一时刻热点文件的访问热度，并将预测得到的访问热度写入到热点文件-访问热度数据库表中，具体方法为：Step 4. Use the hotspot file-access heat sequence that satisfies the Markov property as the input sequence of the improved Markov model, predict the access heat of the hot file at the next moment, and write the predicted access heat into the hot file-access heat In the database table, the specific method is:

步骤4.1：基于所划分的状态空间，根据文件-访问热度序列，计算得到一步状态转移概率矩阵P；Step 4.1: Based on the divided state space, calculate the one-step state transition probability matrix P according to the file-access heat sequence;

步骤4.2：将当前时刻文件访问热度值对应的状态设置为初始状态分布，记为p(0)，根据一步状态转移概率矩阵P计算得到下一时刻的状态概率分布p(1)＝p(0)P；Step 4.2: Set the state corresponding to the current file access popularity value as the initial state distribution, denoted as p(0), and calculate the state probability distribution at the next moment p(1)=p(0) according to the one-step state transition probability matrix P )P;

步骤4.3：取下一时刻的状态概率分布p(1)中分布概率最大值的状态作为下一时刻的状态，取热点文件-访问热度序列的标准差与目标状态空间的平均值之和作为下一时刻的预测访问热度值；Step 4.3: Take the state with the maximum distribution probability in the state probability distribution p(1) at the next moment as the state at the next moment, and take the sum of the standard deviation of the hotspot file-access heat sequence and the average value of the target state space as the next Predicted access popularity value at a moment;

步骤4.4：去除输入序列的第一个值，并将新预测的访问热度值作为下一次预测序列的最后一个值加入到待预测序列中，重复以上步骤，预测热点文件下一时刻的访问热度；Step 4.4: Remove the first value of the input sequence, and add the newly predicted access popularity value as the last value of the next prediction sequence to the sequence to be predicted, repeat the above steps, and predict the access popularity of hot files at the next moment;

步骤5、基于M/M/r单队列多服务台的队列模型对副本访问请求进行建模，并在此基础上设置节点上副本的吞吐量来确定副本的个数，具体方法为：Step 5. Model the copy access request based on the M/M/r single-queue multi-server queue model, and on this basis, set the throughput of the copy on the node to determine the number of copies. The specific method is:

步骤5.1：通过查询热点文件-访问热度数据库表统计得到指定热点文件在下一统计周期内的副本的访问平均请求率λ；Step 5.1: Obtain the average access request rate λ of the copy of the specified hot file in the next statistical period by querying the hot file-access heat database table statistics;

步骤5.2：设置副本所在服务器的CPU利用率阈值U，根据CPU效用法则，CPU利用率等于请求到达率除以服务率，利用下面的公式，计算得到单服务器的请求服务率μ为：Step 5.2: Set the CPU utilization threshold U of the server where the replica is located. According to the CPU utility law, the CPU utilization is equal to the request arrival rate divided by the service rate. Using the following formula, the request service rate μ of a single server is calculated as:

步骤5.3：设置集群总吞吐量约束为Q，基于排队论理论中的Little公式，服务停留时间等于服务率乘以的倒数，吞吐量等于服务停留时间的倒数；在同构集群环境中，多副本所在服务器服务率是相同的，从而，通过以下两个公式计算得到副本个数r：Step 5.3: Set the total throughput constraint of the cluster to Q, based on the Little formula in queuing theory, the service residence time is equal to the service rate multiplied The reciprocal of the throughput is equal to the reciprocal of the service residence time; in a homogeneous cluster environment, the service rate of the server where multiple copies are located is the same, so the number of copies r is calculated by the following two formulas:

由上述技术方案可知，本发明的有益效果在于：本发明提供的基于文件访问热度的动态HDFS副本个数计算方法，通过基于改进马尔可夫模型对文件的访问热度进行预测，提高了预测的准确性。同时，基于排队论的副本个数计算方法，考虑热点文件访问热度随时间变化的规律，动态调整其副本个数，以应对热点文件的高并发访问情况的发生。采用排队论的方法，将节点上存储副本当作服务资源，通过分析热点文件副本的请求率及响应率，以保证集群吞吐量与可靠性为目标，通过副本计算公式可以得到副本个数，为后续动态调整副本个数作铺垫。It can be seen from the above technical solution that the beneficial effect of the present invention is that the method for calculating the number of dynamic HDFS copies based on file access heat provided by the present invention can improve the accuracy of prediction by predicting the file access heat based on the improved Markov model. sex. At the same time, based on the calculation method of the number of copies based on queuing theory, considering the law of hot file access over time, the number of copies is dynamically adjusted to cope with the occurrence of high concurrent access to hot files. Using the method of queuing theory, the storage copy on the node is regarded as a service resource. By analyzing the request rate and response rate of hot file copies, the goal is to ensure the throughput and reliability of the cluster. The number of copies can be obtained through the copy calculation formula, as Subsequent dynamic adjustment of the number of copies as a foreshadowing.

附图说明Description of drawings

图1为本发明实施例提供的基于文件访问热度的动态HDFS副本个数计算方法的流程图；Fig. 1 is the flow chart of the method for calculating the number of dynamic HDFS copies based on file access heat provided by the embodiment of the present invention;

图2为本发明实施例提供的使用改进马尔可夫模型预测下一时刻热点文件的访问热度的预测过程示意图；Fig. 2 is a schematic diagram of the prediction process of using the improved Markov model to predict the access popularity of hot files at the next moment provided by the embodiment of the present invention;

图3为本发明实施例提供的马尔可夫模型的预测值、改进的马尔可夫模型的预测值与真实值之间的对比图；Fig. 3 is a comparison diagram between the predicted value of the Markov model provided by the embodiment of the present invention, the predicted value of the improved Markov model, and the real value;

图4为本发明实施例提供的基于排队论计算得到副本的个数与实际的副本吞吐量计算得到的副本个数的对比图。FIG. 4 is a comparison diagram between the number of replicas calculated based on queuing theory and the number of replicas calculated based on actual replica throughput provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面结合附图和实施例，对本发明的具体实施方式作进一步详细描述。以下实施例用于说明本发明，但不用来限制本发明的范围。The specific implementation manners of the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. The following examples are used to illustrate the present invention, but are not intended to limit the scope of the present invention.

本实施例中，搭建3个机架，每个机架上配置4台虚拟机，并将搭建另外三台实体机，分别为处于Active状态的namenode节点，和处于standby状态的namenode节点，用来防止namenode单点故障。将第三台实体机作为计算节点，用来获取文件访问日志、预测文件的访问热度及计算副本的个数。集群的配置为Hadoop版本Hadoop-2.2.0，内存32G，CPU Intel(R)Core(TM)i3-2120 CPU@3.30GHz，操作系统CentOS-6.7，硬盘2T，开发语言JAVA，R，Matlab。In this embodiment, 3 racks are built, and 4 virtual machines are configured on each rack, and another three physical machines will be built, which are the namenode node in the active state and the namenode node in the standby state for Prevent namenode single point of failure. The third physical machine is used as a computing node to obtain file access logs, predict file access popularity, and calculate the number of copies. The configuration of the cluster is Hadoop version Hadoop-2.2.0, memory 32G, CPU Intel(R) Core(TM) i3-2120 CPU@3.30GHz, operating system CentOS-6.7, hard disk 2T, development language JAVA, R, Matlab.

基于文件访问热度的动态HDFS副本个数计算方法，如图1所示，包括以下步骤：The method for calculating the number of dynamic HDFS copies based on file access heat, as shown in Figure 1, includes the following steps:

步骤1、根据分布式文件系统HDFS上文件访问日志表并按照访问热度的计算公式计算得到统计周期内各文件的访问热度，并按统计时间内文件的访问热度之和对文件进行降序排序，选择排序列表中的前20％的文件作为热点文件，构建热点文件-访问热度序列作为待预测序列，进行访问热度的预测；Step 1. According to the file access log table on the distributed file system HDFS and according to the calculation formula of access heat, the access heat of each file in the statistical period is calculated, and the files are sorted in descending order according to the sum of the file access heat within the statistical period, and select The top 20% files in the sorting list are used as hot files, and the hot file-access popularity sequence is used as the sequence to be predicted to predict the access popularity;

其中，Hot(f)表示文件f的访问热度，AF(f)＝N/T表示文件f的访问频率，N表示文件f在统计周期T内的访问次数，表示文件f的数据块大小，f_size表示文件f的大小，表示不大于的最大整数，即得到文件f的数据块个数。Among them, Hot(f) represents the access heat of file f, AF(f)=N/T represents the access frequency of file f, and N represents the number of visits of file f in the statistical period T, Indicates the data block size of file f, f _size indicates the size of file f, means not greater than the largest integer of That is, the number of data blocks of the file f is obtained.

本实施例中，以5天为一个统计周期，共统计5个周期内的flu.txt文件的访问频率。根据文件访问日志表按照访问热度的计算公式计算得到统计周期内flu.txt的访问热度信息如表1所示。In this embodiment, with 5 days as a statistical period, the access frequency of the flu.txt file within 5 periods is counted. According to the file access log table, the access heat information of flu.txt in the statistical period is calculated according to the calculation formula of access heat, as shown in Table 1.

表1 flu.txt文件的访问热度信息表Table 1 Access popularity information table of flu.txt file

访问时间interview time 2017-08-012017-08-01 2017-08-022017-08-02 2017-08-032017-08-03 2017-08-042017-08-04 2017-08-052017-08-05 访问热度Access popularity 262262 486486 632632 300300 570570 访问时间interview time 2017-08-062017-08-06 …… …… 2017-10-022017-10-02 …… 访问热度Access popularity 401401 …… …… 382382 ……

基于以上步骤，得到热点文件-访问热度序列的聚类树，根据聚类树结构，定义马尔可夫划分的状态空间。Based on the above steps, the clustering tree of the hotspot file-access heat sequence is obtained, and the state space divided by Markov is defined according to the clustering tree structure.

本实施例中，采用层次聚类方法对历史访问热度划分空间状态，将历史数据划分为5个空间状态，并用A、B、C、D和E对数据集进行标记。In this embodiment, the hierarchical clustering method is used to divide the spatial state of the historical access heat, the historical data is divided into five spatial states, and the data sets are marked with A, B, C, D and E.

马氏性检验的具体方法为：The specific method of Markov test is as follows:

对包含n个可能的状态指标序列值X_n＝{x₁，x₂，...，x_n}，将转移频数矩阵的第j列之和除以各行各列的总和所得到的值，称为边际概率，如下式所示：For n possible state index sequence values X _n ={x ₁ , x ₂ ,...,x _n }, the value obtained by dividing the sum of the jth column of the transition frequency matrix by the sum of each row and each column, is called the marginal probability, as shown in the following formula:

其中，f_ij表示指标序列X_n＝{x₁，x₂，...，x_n}中从状态i经一步转移到达状态j的频率，i，j∈E；Among them, f _ij represents the frequency in the index sequence X _n = {x ₁ , x ₂ , ..., x _n } from state i to state j after one step transfer, i, j∈E;

则统计量以自由度为(n-1)²的χ²分布为极限分布，其中，then statistic Take the ^χ2 distribution with (n-1) ² degrees of freedom as the limiting distribution, where,

给定显著性水平α，若存在则此序列X_n是符合马氏性的，否则该序列不能用马尔可夫模型来处理。Given a significance level α, if there is Then this sequence X _n conforms to the Markov property, otherwise the sequence cannot be processed by the Markov model.

本实施例中，用R语言处理可以得到如下式所示的一步频数转移矩阵f_ij和概率转移矩阵p_ij，以及如表2所示边际概率矩阵p_.j。In this embodiment, the one-step frequency transfer matrix f _ij and the probability transfer matrix p _ij shown in the following formula can be obtained by R language processing, as well as the marginal probability matrix p _.j shown in Table 2.

表2边际概率表Table 2 Marginal probability table

状态state 11 22 33 44 55 p_.j _p.j 0.170212770.17021277 0.425531910.42553191 0.170212770.17021277 0.085106380.08510638 0.148936170.14893617

根据以上值计算得到统计量得到如表3所示的χ²的统计量计算表。Statistics are calculated from the above values Get the χ ² statistics calculation table shown in Table 3.

表3 χ²的统计量计算表Table 3 Calculation table of χ ² statistics

本实施例中，显著性水平α＝0.1，根据χ²的统计量计算表得到分位点其中，n＝5。因此该文件的历史访问热度是满足马氏性的，可以用马尔可夫模型对文件的访问热度进行预测。In the present embodiment, the significance level α=0.1, obtains the quantile according to the statistics calculation table of χ ² Among them, n=5. Therefore, the historical access heat of the file satisfies the Markov property, and the Markov model can be used to predict the file access heat.

步骤4、将满足马氏性的热点文件-访问热度序列作为改进马尔可夫模型的输入序列，预测下一时刻热点文件的访问热度，并将预测得到的访问热度写入到热点文件-访问热度数据库表中，如图2所示，具体方法为：Step 4. Use the hotspot file-access heat sequence that satisfies the Markov property as the input sequence of the improved Markov model, predict the access heat of the hot file at the next moment, and write the predicted access heat into the hot file-access heat In the database table, as shown in Figure 2, the specific method is:

步骤4.4：去除输入序列的第一个值，并将新预测的访问热度值作为下一次预测序列的最后一个值加入到待预测序列中，重复以上步骤，预测热点文件下一时刻的访问热度。Step 4.4: Remove the first value of the input sequence, and add the newly predicted access popularity value as the last value of the next prediction sequence to the sequence to be predicted, repeat the above steps, and predict the access popularity of hotspot files at the next moment.

本实施例中，为了对此方法预测的准确性进行验证，分别采用改进与未改进的马尔可夫模型对5个周期的flu.txt访问热度进行对比。马尔可夫模型预测值、改进的马尔可夫模型的预测值与真实值之间的对比图如图3所示。由图可知，预测第一个周期下一时刻的访问热度值时，由于访问热度的序列是一样的，所以用改进和未改进的马尔可夫模型得到的访问热度与实际访问热度值偏差是相同的，并且两种方法预测的访问热度值与实际值没有太大的差异。但在预测以后时刻的访问热度时，采用改进的马尔可夫模型由于采用了不断更新待预测的序列，使预测的访问热度与实际偏差不大，而未改进的马尔可夫模型，由于马尔可夫模型本身的遍历性和平衡分布特征，随着预测频数的加增，预测的结果与实际偏差较大。结果显示，改进的马尔可夫模型预测的访问热度值与实际值是比较接近的，并且对热点文件的访问趋势也有比较好的预测效果。In this embodiment, in order to verify the prediction accuracy of this method, the improved and unimproved Markov models were used to compare the flu.txt access popularity in five cycles. The comparison chart between the predicted value of the Markov model, the predicted value of the improved Markov model and the real value is shown in Figure 3. It can be seen from the figure that when predicting the visit popularity value at the next moment of the first period, since the visit popularity sequence is the same, the deviation between the visit popularity obtained by the improved and unimproved Markov models and the actual visit popularity value is the same , and there is not much difference between the visit popularity value predicted by the two methods and the actual value. However, when predicting the popularity of visits in the future, the improved Markov model adopts the continuously updated sequence to be predicted, so that the predicted visit popularity has little deviation from the actual one, while the unimproved Markov model, due to the Markov Due to the ergodicity and balanced distribution characteristics of the husband model itself, with the increase of the forecast frequency, the predicted result deviates greatly from the actual one. The results show that the access heat value predicted by the improved Markov model is relatively close to the actual value, and it also has a better prediction effect on the access trend of hot files.

步骤5.2：设置副本所在服务器的CPU利用率阈值U，根据CPU效用法则，CPU利用率等于请求到达率除以服务率。从而，利用下面的公式，计算单服务器的请求服务率u：Step 5.2: Set the CPU utilization threshold U of the server where the replica is located. According to the CPU utility law, the CPU utilization is equal to the request arrival rate divided by the service rate. Thus, use the following formula to calculate the request service rate u of a single server:

本实施例中，通过改进后的马尔可夫模型预测得到热点文件的访问热度后，根据热点文件的访问热度，设置节点CPU利用率阈值为0.5，设置节点副本的吞吐量为100/s，则日平均吞吐量为访问11个小时的吞吐量为100*11h*3600＝400万，基于排队论计算得到副本的个数，并与实际的副本吞吐量计算得到的副本个数做对比，对比图如图4所示。由图可知，考虑副本的请求访问率与响应率，计算副本个数的方法能根据访问热度的趋势调整副本个数，在第一个周期内，热点文件的访问热度呈下降的趋势，此时，基于排队论得到的副本个数比实际的副本个数要少，在后续周期时间内，考虑热点文件访问热度的趋势动态调整副本个数，并且与实际吞吐量计算得到的副本数比较接近，验证了此方法的有效性。In this embodiment, after the access heat of the hot file is predicted by the improved Markov model, according to the access heat of the hot file, the CPU utilization threshold of the node is set to 0.5, and the throughput of the node copy is set to 100/s, then The average daily throughput is 100*11h*3600=4 million for 11 hours of access. The number of copies is calculated based on queuing theory, and compared with the number of copies calculated by the actual copy throughput. The comparison diagram As shown in Figure 4. It can be seen from the figure that considering the request access rate and response rate of the copy, the method of calculating the number of copies can adjust the number of copies according to the trend of access popularity. In the first cycle, the access popularity of hot files shows a downward trend. At this time , the number of copies obtained based on queuing theory is less than the actual number of copies. In the subsequent cycle time, the number of copies is dynamically adjusted considering the trend of hot file access, and is closer to the number of copies obtained by the actual throughput calculation. The effectiveness of this method was verified.

最后应说明的是：以上实施例仅用以说明本发明的技术方案，而非对其限制；尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：其依然可以对前述实施例所记载的技术方案进行修改，或者对其中部分或者全部技术特征进行等同替换；而这些修改或者替换，并不使相应技术方案的本质脱离本发明权利要求所限定的范围。Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present invention, rather than to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still be Modifications are made to the technical solutions described in the foregoing embodiments, or equivalent replacements are made to some or all of the technical features; these modifications or replacements do not make the essence of the corresponding technical solutions depart from the scope defined by the claims of the present invention.

Claims

1. A method for calculating the number of dynamic HDFS copies based on file access heat, characterized in that: comprise the following steps:

Step 1. According to the file access log table on the distributed file system HDFS and according to the calculation formula of file access heat, the access heat of each file in the statistical period is calculated, and the files are sorted in descending order according to the sum of the file access heat within the statistical period. Select the top 20% files in the sorted list as hot files, construct the hot file-access popularity sequence as the sequence to be predicted, and predict the access popularity;

Step 2, using a hierarchical clustering algorithm to divide the state space of the hot file-access heat sequence;

Step 3. Carry out the Markov property test on the hotspot file-access heat sequence that has divided the state space. If the Markov property is satisfied, use the sequence as the input sequence of the improved Markov model, otherwise the sequence cannot be used by the improved Markov model. model to handle;

Step 4. Use the hotspot file-access heat sequence that satisfies the Markov property as the input sequence of the improved Markov model, predict the access heat of the hot file at the next moment, and write the predicted access heat into the hot file-access heat in the database table;

Step 5: Model the copy access request based on the M/M/r single-queue multi-server queue model, and on this basis set the throughput of the copy on the node to determine the number of copies.

2. the method for calculating the number of dynamic HDFS copies based on file access heat according to claim 1, characterized in that: the calculation formula of file access heat described in step 1 is as follows:

Among them, Hot(f) represents the access heat of file f, AF(f)=N/T represents the access frequency of file f, and N represents the number of visits of file f in the statistical period T, Indicates the data block size of file f, f _size indicates the size of file f, means not greater than the largest integer of That is, the number of data blocks of the file f is obtained.

3. the method for calculating the number of dynamic HDFS copies based on file access heat according to claim 1, is characterized in that: the concrete method of described step 2 is:

The hotspot files-access heat sequence constitutes a data set with a length of N. The objects in the data set represent the access heat of hot files at different times. The process of hierarchical clustering of the hot file-access heat data set is as follows:

(1) Treat each object in the data set as a class, and obtain N classes in total, and the distance between classes is the median value of the square of the distance between each data point in the two classes;

(2) Merge the two closest classes into one class, reducing the total number of classes by one;

(3) Recalculate the distance between the new class and other classes;

(4) Repeat steps (2)-(3) until finally all data objects in the data set are merged into one class;

Based on the above steps, the clustering tree of the hotspot file-access heat sequence is obtained, and the state space divided by Markov is defined according to the clustering tree structure.

4. the method for calculating the number of dynamic HDFS copies based on file access heat according to claim 1, is characterized in that: the concrete method of described step 4 is:

Step 4.1: Based on the divided state space, calculate the one-step state transition probability matrix P according to the file-access heat sequence;

Step 4.2: Set the state corresponding to the current file access popularity value as the initial state distribution, denoted as p(0), and calculate the state probability distribution at the next moment p(1)=p(0) according to the one-step state transition probability matrix P )P;

Step 4.3: Take the state with the maximum distribution probability in the state probability distribution p(1) at the next moment as the state at the next moment, and take the sum of the standard deviation of the hotspot file-access heat sequence and the average value of the target state space as the next Predicted access popularity value at a moment;

Step 4.4: Remove the first value of the input sequence, and add the newly predicted access popularity value as the last value of the next prediction sequence to the sequence to be predicted, repeat the above steps, and predict the access popularity of hotspot files at the next moment.

5. the method for calculating the number of dynamic HDFS copies based on file access heat according to claim 1, is characterized in that: the concrete method of described step 5 is:

Step 5.1: obtain the access average request rate λ of the copy of specified hot file in the next statistical period by querying hot file-access heat database table statistics;

Step 5.2: Set the CPU utilization threshold U of the server where the replica is located. According to the CPU utility law, the CPU utilization is equal to the request arrival rate divided by the service rate. Using the following formula, the request service rate μ of a single server is calculated as:

Step 5.3: Set the total throughput constraint of the cluster to Q, based on the Little formula in queuing theory, the service residence time is equal to the service rate multiplied The reciprocal of the throughput is equal to the reciprocal of the service residence time; in a homogeneous cluster environment, the service rate of the server where multiple copies are located is the same, so the number of copies r is calculated by the following two formulas:

.