CN105574153A

CN105574153A - Transcript placement method based on file heat analysis and K-means

Info

Publication number: CN105574153A
Application number: CN201510943677.3A
Authority: CN
Inventors: 马廷淮; 李坚; 田伟; 金子龙
Original assignee: Nanjing University of Information Science and Technology
Current assignee: Nanjing University of Information Science and Technology
Priority date: 2015-12-16
Filing date: 2015-12-16
Publication date: 2016-05-11

Abstract

The present invention provides a copy placement method based on file heat analysis and K-means. Firstly, the file access heat is calculated by analyzing the file access frequency within a given time. Utilize the access popularity of files, combined with the K-means algorithm, predict possible high-access popularity files in the next period, comprehensively consider various factors such as statistical cycle, file size, and working environment, and dynamically adjust the number and placement of file copies as needed Location. The invention can effectively reduce the average response time of file access and improve data service performance.

Description

A Copy Placement Method Based on File Heat Analysis and K-means

技术领域technical field

本发明属于云计算领域，具体涉及的是一种利用热度统计分析与K-means算法对云环境下高热度文件副本动态调整放置的方法。The invention belongs to the field of cloud computing, and specifically relates to a method for dynamically adjusting and placing copies of high-hot files in a cloud environment by using heat statistical analysis and K-means algorithm.

背景技术Background technique

随着社会的发展以及计算机存储和数据处理能力的提高,数据爆炸式增长已经成为当今时代的一个重要特征。根据国际数据公司(InternationalDataCorportion,IDC)对数据增长的估计，到2020年将产生40ZB(1ZB＝1.1805916207174113×10²¹B)的数据，相当于地球上人均5247GB(http://datacenter.watchstor.com/infra-143421.htm)。面对规模不断增长的海量数据，随之而来的海量数据的存储和管理也得到了越来越多的关注。With the development of society and the improvement of computer storage and data processing capabilities, the explosive growth of data has become an important feature of today's era. According to the estimate of data growth by International Data Corporation (IDC), by 2020 there will be 40ZB (1ZB=1.1805916207174113×10 ²¹ B) of data, which is equivalent to 5247GB per capita on the earth (http://datacenter.watchstor.com/ infra-143421.htm). Facing the ever-increasing mass of data, the subsequent storage and management of massive data has also received more and more attention.

为了提高系统的可靠性和访问效率，常用副本技术将数据项复制多份，并分别存放在分布式文件系统的多个节点上。针对各个历史阶段对数据提出的不同访问要求，人们提出了多种副本管理策略，主要包括主从式、层次式、对等计算(PeertoPeer,P2P)式和基于图的等几种。In order to improve the reliability and access efficiency of the system, the commonly used copy technology copies data items into multiple copies and stores them on multiple nodes of the distributed file system respectively. Aiming at the different access requirements for data in various historical stages, people have proposed a variety of copy management strategies, mainly including master-slave, hierarchical, peer-to-peer computing (PeertoPeer, P2P) and graph-based.

副本管理策略通常要进行副本个数和存放位置两方面的决策，按照做决策的时机可分为静态和动态两类。IanForster和KavithaRanganathan于2001年提出了在层次网络拓扑结构中的六种副本创建策略:无副本策略、最佳客户策略、瀑布式策略、普通缓存策略、缓存瀑布式策略、快速扩展策略(数据网格环境下基于经济模型的副本优化策略的研究与实现李琳.)。这些策略在大部分情况下都能够减少访问延迟，但瀑布式策略、缓存瀑布式策略和快速扩展策略只适用于数据存储于顶层节点的数据网格，最佳客户策略、普通缓存策略没有考虑到拓扑结构、数据分布、网络带宽、节点存储能力等特点(基于存储联盟的双层动态副本创建策略-SADDERS孙海燕，王晓东，周斌等.)，没有考虑到文件大小和网络带宽对访问延迟的影响。Replica management strategies usually need to make decisions on the number of replicas and storage locations, and can be divided into static and dynamic according to the timing of making decisions. In 2001, IanForster and Kavitha Ranganathan proposed six replica creation strategies in the hierarchical network topology: no replica strategy, best customer strategy, waterfall strategy, common cache strategy, cache waterfall strategy, rapid expansion strategy (data grid Research and implementation of copy optimization strategy based on economic model in environment Li Lin.). These strategies can reduce access latency in most cases, but the waterfall strategy, caching waterfall strategy, and rapid expansion strategy are only applicable to data grids where data is stored on the top-level nodes, and the best client strategy and common cache strategy do not take into account Topological structure, data distribution, network bandwidth, node storage capacity and other characteristics (based on the two-layer dynamic copy creation strategy of the storage alliance - SADDERS Sun Haiyan, Wang Xiaodong, Zhou Bin, etc.), did not take into account the impact of file size and network bandwidth on access delay.

本发明通过分析文件在预设时间周期内的访问频率，根据热度计算公式，推算文件的访问热度。利用文件的访问热度，结合K-means算法，预测下一周期内可能的高访问热度文件(基于热度分析的动态副本创建算法饶磊，杨凡德，李新明，刘东.)，同时综合考虑统计周期、文件大小、工作环境等多种因素，动态地调整文件副本的数量及放置位置。The invention calculates the access heat of the file according to the heat calculation formula by analyzing the access frequency of the file within the preset time period. Use the access popularity of files, combined with the K-means algorithm, to predict possible high-access popularity files in the next cycle (dynamic copy creation algorithm based on popularity analysis Rao Lei, Yang Fande, Li Xinming, Liu Dong.), and comprehensively consider the statistical cycle , file size, working environment and other factors, dynamically adjust the number and location of file copies.

发明内容Contents of the invention

本发明的所要解决的技术问题是分布式系统或云计算平台中的副本放置问题，提出一种基于文件热度分析和K-means的副本放置方法，根据任务的执行时间选取最大值作为时间周期，计算时间周期内文件的访问热度。利用文件的访问热度，结合K-means算法，预测下一周期内可能的高访问热度文件，综合考虑统计周期、文件大小、工作环境等多种因素，按需动态地调整文件副本的数量及放置位置。本发明能够有效地减少文件访问的平均响应时间，提高数据服务性能。The technical problem to be solved in the present invention is the copy placement problem in the distributed system or cloud computing platform. A copy placement method based on file heat analysis and K-means is proposed, and the maximum value is selected as the time period according to the execution time of the task. Calculate the access popularity of files within a time period. Utilize the access popularity of files, combined with the K-means algorithm, predict possible high-access popularity files in the next period, comprehensively consider various factors such as statistical cycle, file size, and working environment, and dynamically adjust the number and placement of file copies as needed Location. The invention can effectively reduce the average response time of file access and improve data service performance.

技术方案：Technical solutions:

一种基于文件热度分析和K-means的副本放置方法，包括以下步骤：A copy placement method based on file heat analysis and K-means, comprising the following steps:

步骤1)，根据任务的执行时间，选择最小值作为热度分析的时间周期，在该时间周期内分析文件的访问频率；Step 1), according to the execution time of the task, select the minimum value as the time period of heat analysis, and analyze the access frequency of the file in this time period;

步骤2)，根据步骤1)得到的文件访问频率，计算文件的访问热度值；Step 2), according to the file access frequency obtained in step 1), calculate the access popularity value of the file;

步骤3)，根据步骤2)得到的文件访问热度值，获取高热度值的文件的信息，通过K-means算法，计算并预测下一运行周期的高热度文件；Step 3), according to the file access heat value that step 2) obtains, obtains the information of the file of high heat value, by K-means algorithm, calculates and predicts the high heat file of next running cycle;

步骤4)，根据步骤3)得到的高热度文件信息，综合考虑文件大小、文件数量、文件位置、工作环境等众多因素动态地调整文件副本的数量以及放置位置；Step 4), according to the high-heat file information obtained in step 3), dynamically adjust the number of copies of the files and the placement location by comprehensively considering many factors such as file size, file quantity, file location, and working environment;

进一步的，本发明的一种基于文件热度分析和K-means的副本放置方法，步骤1)根据任务的执行时间，选择最大值作为热度分析的时间周期，在该时间周期内分析文件的访问频率。本发明使用了文件访问次数计数器和统计周期计时器。初始化时，默认文件访问次数为1，每个统计周期内，文件每次被访问计数器加1，未被访问则计数器减1。若访问次数已经为1，则计数器不再执行减1操作。若文件访问超时未完成，访问计数器加1。某文件在第k个统计周期内的访问频率f_k＝n/t，其中n为该文件在统计周期内被访问的次数，t为统计周期内访问的持续时间之和；Further, a copy placement method based on file heat analysis and K-means of the present invention, step 1) according to the execution time of the task, select the maximum value as the time period of heat analysis, and analyze the access frequency of the file within this time period . The present invention uses a file access times counter and a statistics period timer. At initialization, the default number of file accesses is 1. In each statistical cycle, the counter is incremented by 1 each time the file is accessed, and the counter is decremented by 1 when the file is not accessed. If the number of visits is already 1, the counter will no longer perform decrement operation. If the file access timeout is not completed, the access counter is incremented by 1. The access frequency f _k =n/t of a file in the kth statistical period, where n is the number of times the file is accessed in the statistical period, and t is the sum of the duration of visits in the statistical period;

进一步的，本发明的一种基于文件热度分析和K-means的副本放置方法，步骤2)根据步骤1)得到的文件访问频率，利用公式h_ij＝α·F_j/(S_i+1)，计算文件i在j时刻的访问热度值。公式中，α为常量，用于对数据进行归一化处理；F_j表示频率对文件访问热度的影响，S_i表示文件大小对文件访问热度的影响。其中，Further, in the method of copy placement based on file heat analysis and K-means of the present invention, step 2) uses the formula h _ij =α·F _j /(S _i +1) according to the file access frequency obtained in step 1) , calculate the access popularity value of file i at time j. In the formula, α is a constant used to normalize the data; F _j represents the influence of frequency on file access popularity, and S _i represents the influence of file size on file access popularity. in,

进一步的，本发明的一种基于文件热度分析和K-means的副本放置方法，步骤3)根据步骤2)得到的文件访问热度值，获取高热度值的文件的信息，选取k个文件作为初始化中心，计算每个文件到中心文件的距离，将每个文件分配至最近的簇。根据现有的簇关系重复计算前述过程，直至满足终止条件。终止条件包括：Further, in a copy placement method based on file heat analysis and K-means of the present invention, step 3) obtains information on files with high heat values according to the file access heat value obtained in step 2), and selects k files as initialization center, calculates the distance of each file to the center file, and assigns each file to the closest cluster. The foregoing process is repeatedly calculated according to the existing cluster relationship until the termination condition is met. Termination conditions include:

(1)没有(或最小数目)文件被重新分配给不同的聚类；(1) No (or a minimum number) of documents are reassigned to different clusters;

(2)没有(或最小数目)聚类中心发生变化；(2) No (or minimum number) cluster centers change;

(3)误差平方和(SSE)局部最小，其中x表示文件，m_j表示聚类C_j的聚类中心，dist(x，m_j)表示文件x与聚类中心m_j之间的距离；(3) The sum of squared errors (SSE) is locally minimized, Where x represents the file, m _j represents the cluster center of cluster C _j , and dist(x, m _j ) represents the distance between file x and cluster center m _j ;

进一步的，本发明的一种基于文件热度分析和K-means的副本放置方法，步骤4)根据步骤3)得到的聚类信息，根据各个聚类中心的访问热度，综合考虑文件大小、文件数量、文件位置、工作环境等众多因素动态地调整文件副本的数量以及放置位置，高热度的簇适当增加副本数量，低热度的簇课适当减少副本数量。Further, a copy placement method based on file heat analysis and K-means of the present invention, step 4) according to the clustering information obtained in step 3), according to the access heat of each cluster center, the file size and the number of files are comprehensively considered , File location, working environment and many other factors dynamically adjust the number and location of file copies, increase the number of copies appropriately for clusters with high popularity, and appropriately reduce the number of copies for clusters with low popularity.

有益效果Beneficial effect

本发明针对分布式系统或云计算平台中副本放置，结合文件访问热度与K-means算法来综合分析，有助于高访问量的系统中实现副本的合理放置。该方法弥补了以往简单通过文件热度分析的副本放置方法，单纯通过本次统计周期内的文件热度进行副本放置；同时，为提高后续统计周期内访问的响应时间，采用了K-means聚类算法，预测下一周期内可能的高热度文件，提前调整文件副本。两方面的结合，既能提高副本的合理性，降低响应时间，又能减少IO拥塞。The present invention aims at copy placement in a distributed system or a cloud computing platform, and conducts a comprehensive analysis in combination with file access heat and K-means algorithm, which helps to realize reasonable placement of copies in a system with high access volume. This method makes up for the previous method of placing copies simply through file heat analysis, and simply places copies based on the heat of files in this statistical cycle; at the same time, in order to improve the response time of access in subsequent statistical cycles, the K-means clustering algorithm is adopted , predict possible high-heat files in the next cycle, and adjust file copies in advance. The combination of the two aspects can not only improve the rationality of the copy, reduce the response time, but also reduce IO congestion.

附图说明Description of drawings

图1是一种基于文件热度分析和K-means的副本放置方法的流程图。Figure 1 is a flowchart of a copy placement method based on file heat analysis and K-means.

具体实施方式detailed description

下面结合附图对技术方案的实施作进一步的详细描述：Below in conjunction with accompanying drawing, the implementation of technical scheme is described in further detail:

结合流程图及实施案例对本发明所述的一种基于文件热度分析和K-means的副本放置方法作进一步的详细描述。A copy placement method based on file heat analysis and K-means according to the present invention will be further described in detail in conjunction with the flow chart and the implementation case.

本实施案例采用文件热度分析和K-means算法对分布式系统或云环境中的副本进行调整放置。如图1所示，本方法包含如下步骤：This implementation case uses file heat analysis and K-means algorithm to adjust and place copies in distributed systems or cloud environments. As shown in Figure 1, this method includes the following steps:

步骤101)，分布式系统或者云环境中，不同任务的执行时间是不一样的，进行文件热度分析的时，在有任务完成是，便可进行一次副本调整，及时地将上一次任务执行产生的信息应用到后续的应用中。任务的执行时间可由仿真模拟或者经验值获取。；Step 101), in a distributed system or cloud environment, the execution time of different tasks is different. When performing file heat analysis, when a task is completed, a copy adjustment can be performed, and the last task execution will be generated in a timely manner. information to be used in subsequent applications. The execution time of the task can be obtained by simulation or experience value. ;

步骤102)，根据公式f_k＝n/t，在预设时间周期内，计算获取文件的访问频率。Step 102), according to the formula f _k =n/t, within a preset time period, calculate the access frequency of the acquired file.

步骤2)，根据上一步得到的文件访问频率，计算文件的访问热度值；Step 2), calculate the access popularity value of the file according to the file access frequency obtained in the previous step;

步骤201)，得到文件访问频率可以计算文件访问频率对其热度的影响，由该文件在最近l个统计周期内的被访问的频率和权值来确定。Step 201), after obtaining the file access frequency, the impact of the file access frequency on its popularity can be calculated, which is determined by the file's access frequency and weight in the last 1 statistical period.

步骤202)，计算文件大小对文件访问热度的影响，由文件大小s_i和分布式系统中的数据块大小决定；Step 202), calculate the impact of file size on file access heat, determined by the file size s _i and the data block size in the distributed system;

步骤203)，根据公式h_ij＝α·F_j/(S_i+1)，结合前两步获得的响应的值，进行归一化处理，可计算得出文件i在j时刻的访问热度值，。Step 203), according to the formula h _ij =α·F _j /(S _i +1), combined with the response values obtained in the previous two steps, and performing normalization processing, the access popularity value of file i at time j can be calculated ,.

步骤3)，根据上一步得到的文件访问热度值，获取高热度值的文件的信息，通过K-means算法，计算并预测下一运行周期的高热度文件；Step 3), according to the file access heat value obtained in the previous step, obtain the information of the file with high heat value, and calculate and predict the high heat file in the next operation cycle through the K-means algorithm;

步骤301)，根据上一步计算的结果，可以获取高热度值的文件，从而从系统中获取这些文件的信息。Step 301), according to the calculation result of the previous step, the files with high heat value can be obtained, so as to obtain the information of these files from the system.

步骤302)，从高热度文件中选取K个文件作为中心文件，计算所有文件到各中心文件的距离，根据计算结果，将每个文件分配给最近的聚类中心；Step 302), select K files from the high-heat files as the center files, calculate the distances from all files to each center file, and assign each file to the nearest cluster center according to the calculation results;

步骤303)，重复执行上一步，直至满足终止条件；Step 303), repeating the previous step until the termination condition is met;

步骤4),根据上一步所获得的聚类信息，依据各聚类中心的访问热度，综合考虑文件大小、文件数量、工作环境等因素，对各个文件的副本数量以及放置位置进行调整。访问热度高的聚类中心相对应的聚类适当地增加其副本数量；访问热度低的聚类则相应地减少其副本数量。Step 4), according to the clustering information obtained in the previous step, according to the visit popularity of each clustering center, and comprehensively considering factors such as file size, file quantity, and working environment, adjust the number of copies of each file and the placement location. The clusters corresponding to the cluster centers with high access heat will increase the number of copies appropriately; the clusters with low access heat will reduce the number of copies accordingly.

Claims

1. A copy placement method based on file heat analysis and K-means, characterized in that, comprising the following steps:

Step 1), according to the execution time of the task, select the minimum value as the time period of heat analysis, and analyze the access frequency of the file in this time period;

Step 2), according to the file access frequency obtained in step 1), calculate the access popularity value of the file;

Step 3), according to the file access heat value that step 2) obtains, obtains the information of the file of high heat value, by K-means algorithm, calculates and predicts the high heat file of next running cycle;

In step 4), according to the information of high-profile files obtained in step 3), the number of copies of the files and the placement locations are dynamically adjusted in consideration of many factors such as file size, number of files, file locations, and working environment.

2. The method according to claim 1, wherein, in step 1), a file access times counter and a statistical cycle timer are used; during initialization, the default file access times is 1, and in each statistical cycle, the file is accessed every time The visited counter is incremented by 1, and the counter is decremented by 1 if it is not visited; if the number of visits is already 1, the counter will no longer perform the decrement operation. If the file access timeout is not completed, the access counter is increased by 1; if the file access frequency f _k =n/t in the k statistical cycle, where n is the number of times the file is accessed in the statistical cycle, t is the statistical cycle The sum of the duration of the visit.

3. The method according to claim 1, characterized in that, in step 2) according to the file access frequency obtained in step 1), use the formula h _ij =α·F _j /(S _i +1) to calculate the file i in The access heat value at time j; in the formula, α is a constant used to normalize the data; F _j represents the impact of frequency on file access heat, and S _i represents the impact of file size on file access heat; among them,

4. The method according to claim 1, characterized in that, step 3) obtains the information of files with high heat value according to the file access heat value obtained in step 2), selects k files as the initialization center, and calculates each file The distance to the central file, assigning each file to the closest cluster. Repeat the calculation of the preceding process according to the existing cluster relationship until the termination condition is met; the termination condition includes:

(1) No (or a minimum number) of documents are reassigned to different clusters;

(2) No (or minimum number) cluster centers change;

(3) The sum of squared errors (SSE) is locally minimized, Where x represents the file, m _j represents the cluster center of cluster C _j , and dist(x, m _j ) represents the distance between file x and cluster center m _j .

5. The method according to claim 1, characterized in that, according to the clustering information obtained in step 3) in step 4), according to the visit popularity of each clustering center, comprehensively consider file size, file quantity, file location, work Many factors such as the environment dynamically adjust the number and location of file copies. High-heat clusters increase the number of copies appropriately, and low-heat clusters appropriately reduce the number of copies.