CN115130032B

CN115130032B - Deletion rate curve construction method and system in heterogeneous granularity storage system

Info

Publication number: CN115130032B
Application number: CN202210789994.4A
Authority: CN
Inventors: 郭潇俊; 王桦; 周可; 韩耀东; 谭頔凡
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2022-07-05
Filing date: 2022-07-05
Publication date: 2024-07-09
Anticipated expiration: 2042-07-05
Also published as: CN115130032A

Abstract

The invention discloses a method for constructing a deletion rate curve in a heterogeneous granularity storage system, which comprises the following steps: receiving an access request sequence, setting a counter i=1, judging whether i is equal to the total number of access requests in the access request sequence, loading an ith access request in the access request sequence into a cache filter if the i is not equal to the total number of the access requests in the access request sequence, judging whether an access object corresponding to the ith access request hits in the cache filter, acquiring the sampling rate of the access object according to the size of the access object corresponding to the ith access request if the i is not equal to the total number of the access requests in the access request sequence, sampling the access object according to the acquired sampling rate of the access object corresponding to the ith access request, calculating the reuse distance of the access object, and setting the counter i=i+1. The invention can solve the technical problem of inaccurate construction of the byte deletion rate curve and the object deletion rate curve caused by the content popularity difference and the object size difference in the existing heterogeneous granularity storage system.

Description

A method and system for constructing a missing rate curve in a heterogeneous granularity storage system

技术领域Technical Field

本发明属于计算机数据存储领域，更具体地，涉及一种异构粒度存储系统中的缺失率曲线构建方法和系统。The present invention belongs to the field of computer data storage, and more specifically, relates to a method and system for constructing a missing rate curve in a heterogeneous granularity storage system.

背景技术Background technique

对象级缓存是网络中非常常见的缓存技术，其主要评价指标有字节缺失率(Bytemissing rate，简称BMR)和对象缺失率(Object missing rate，简称OMR)两种。在网络缓存中，字节丢失率可以直接反映带宽开销和缓存提供商的流量开销，因此其应用非常重要。更好的缓存策略可以在给定的缓存大小上获得更低的丢失率，并且，更好的缓存资源管理可以提高共享缓存的所有应用程序的总体性能。缺失率曲线(Miss rate curve，简称MRC)是一种定量的缓存性能建模技术，其构建的方法在缓存资源管理中起着至关重要的作用。Object-level caching is a very common caching technology in the network. Its main evaluation indicators are byte missing rate (BMR) and object missing rate (OMR). In network caching, the byte missing rate can directly reflect the bandwidth overhead and traffic overhead of the cache provider, so its application is very important. Better caching strategies can achieve lower miss rates for a given cache size, and better cache resource management can improve the overall performance of all applications sharing the cache. Miss rate curve (MRC) is a quantitative cache performance modeling technology, and its construction method plays a vital role in cache resource management.

针对MRC的构建方法在过去得到了广泛的研究，其构建方法主要有：准确的MRC构建方法、近似的MRC构建方法、基于空间哈希采样的MRC构建方法等。准确的MRC构建方法通过逐对象计算重用距离，可以得到较为精确的构建结果；近似的MRC构建通过近似和等效的处理方法，使得计算开销较小，且误差在可容忍范围之内；基于空间哈希采样的MRC构建方法通过哈希采样技术，能以较低的计算开销在异构粒度存储系统中实现MRC的构建。The construction method of MRC has been widely studied in the past, and its construction methods mainly include: accurate MRC construction method, approximate MRC construction method, MRC construction method based on spatial hash sampling, etc. The accurate MRC construction method can obtain more accurate construction results by calculating the reuse distance of each object; the approximate MRC construction uses approximate and equivalent processing methods to reduce the computational overhead and the error is within the tolerable range; the MRC construction method based on spatial hash sampling can realize MRC construction in heterogeneous granularity storage systems with lower computational overhead through hash sampling technology.

然而，现有的MRC构建方法在异构粒度存储系统的运用中存在者一些问题：第一、若要实现MRC的准确构建，必然伴随着极高的时间和空间复杂度，导致计算开销很大；第二、在异构粒度存储系统缓存(如对象存储)中，对象尺寸的差异会带来MRC的构建误差，以及字节缺失率曲线(Byte missing rate curve，简称BMRC)与对象缺失率曲线(Object missingrate curve，简称OMRC)之间存在较大的构建误差；第三、不同对象的不同内容流行程度(即请求概率)会导致前端MRC构建不准确，从而影响MRC的整体准确性。However, there are some problems with the existing MRC construction methods in the application of heterogeneous granularity storage systems: First, if the accurate construction of MRC is to be achieved, it is inevitably accompanied by extremely high time and space complexity, resulting in a large computational overhead; Second, in the cache of heterogeneous granularity storage systems (such as object storage), the difference in object size will lead to MRC construction errors, and there will be a large construction error between the Byte missing rate curve (BMRC) and the Object missing rate curve (OMRC); Third, the different content popularity of different objects (i.e., request probability) will lead to inaccurate front-end MRC construction, thereby affecting the overall accuracy of MRC.

发明内容Summary of the invention

针对现有技术的以上缺陷或改进需求，本发明提供了一种异构粒度存储系统中的缺失率曲线构建方法，其目的在于，解决现有MRC构建方法在实现MRC的准确构建期间，必然伴随着极高的时间和空间复杂度，从而导致计算开销很大的技术问题，以及在异构粒度存储系统缓存(如对象存储)中，对象尺寸的差异会带来MRC的构建误差，以及BMRC与OMRC之间存在较大的构建误差的技术问题，以及不同对象的不同内容流行程度会导致前端MRC构建不准确，从而影响MRC整体准确性的技术问题。In response to the above defects or improvement needs of the prior art, the present invention provides a method for constructing a miss rate curve in a heterogeneous granularity storage system, with the aim of solving the technical problems that the existing MRC construction method is inevitably accompanied by extremely high time and space complexity during the accurate construction of MRC, resulting in a large computational overhead, as well as the technical problems that in the cache of a heterogeneous granularity storage system (such as object storage), the difference in object size will cause MRC construction errors, and there is a large construction error between BMRC and OMRC, and the different content popularity of different objects will lead to inaccurate front-end MRC construction, thereby affecting the overall accuracy of MRC.

为实现上述目的，按照本发明的一个方面，提供了一种用于缺失率曲线构建的采样方法，包括以下步骤：To achieve the above object, according to one aspect of the present invention, a sampling method for constructing a missing rate curve is provided, comprising the following steps:

(1)接收访问请求序列，并设置计数器i＝1；(1) Receive access request sequence and set counter i=1;

(2)判断i是否等于访问请求序列中的访问请求总数，如果是则进入步骤(8)，否则转入步骤(3)；(2) Determine whether i is equal to the total number of access requests in the access request sequence. If so, proceed to step (8); otherwise, proceed to step (3);

(3)将访问请求序列中的第i条访问请求载入缓存过滤器中，并判断该第i条访问请求对应的访问对象是否在缓存过滤器中命中，如果是则转入步骤(4)，否则转入步骤(5)；(3) Load the i-th access request in the access request sequence into the cache filter, and determine whether the access object corresponding to the i-th access request hits in the cache filter. If so, proceed to step (4); otherwise, proceed to step (5);

(4)获取该第i条访问请求对应的访问对象的重用距离，然后转入步骤(7)；(4) Obtain the reuse distance of the access object corresponding to the i-th access request, and then proceed to step (7);

(5)根据第i条访问请求对应的访问对象的大小获取该访问对象的采样率；(5) obtaining a sampling rate of the access object according to the size of the access object corresponding to the i-th access request;

(6)根据步骤(5)得到的第i条访问请求对应的访问对象的采样率对该访问对象进行采样，并计算该访问对象的重用距离，然后转入步骤(7)；(6) sampling the access object according to the sampling rate of the access object corresponding to the i-th access request obtained in step (5), calculating the reuse distance of the access object, and then proceeding to step (7);

(7)设置计数器i＝i+1，并返回步骤(2)；(7) Set counter i=i+1 and return to step (2);

(8)根据访问请求队列中所有在缓存过滤器中命中的访问对象的重用距离构建第一直方图，然后转入步骤(9)。(8) Construct the first histogram based on the reuse distances of all access objects in the access request queue that hit the cache filter, and then proceed to step (9).

(9)根据访问请求队列中所有未在缓存过滤器中命中的访问对象的重用距离构建第二直方图，然后转入步骤(10)；(9) constructing a second histogram according to the reuse distances of all access objects in the access request queue that are not hit in the cache filter, and then proceeding to step (10);

(10)合并第一直方图与第二直方图，构建完整的重用距离直方图，并对重用距离直方图进行积分处理，以获得最终的缺失率曲线。(10) The first histogram and the second histogram are combined to construct a complete reuse distance histogram, and the reuse distance histogram is integrated to obtain the final missing rate curve.

优选地，步骤(1)是从缓存系统中获取的实际运行的访问请求序列。Preferably, step (1) is to obtain the actual running access request sequence from the cache system.

优选地，每条访问请求仅对应一个访问对象，一个访问对象对应至少一条访问请求。Preferably, each access request corresponds to only one access object, and one access object corresponds to at least one access request.

优选地，缓存过滤器中仅保留最近的大小不大于L的访问对象，其中L的取值范围是0到5×1/r×log₂(1/r)，优选为1/r×log₂(1/r)，其中r为预设的基础采样率大小。Preferably, the cache filter only retains the most recently accessed objects whose size is not greater than L, where L ranges from 0 to 5×1/r×log ₂ (1/r), preferably 1/r×log ₂ (1/r), where r is a preset basic sampling rate.

优选地，访问对象i的重用距离为该访问对象i在距上一次被访问的间隔之内除了该访问对象之外的其余所有访问对象的大小之和；Preferably, the reuse distance of the access object i is the sum of the sizes of all other access objects except the access object i within the interval from the last access of the access object i;

访问对象D的采样率r_D是按照以下公式计算得到:The sampling rate r _D of the access object D is calculated according to the following formula:

r_D＝r×s_D/s_avg r _D ＝r×s _D /s _avg

其中s_D为访问对象的大小，s_avg为访问请求序列中所有访问请求对应的访问对象的平均值；Where s _D is the size of the access object, s _avg is the average value of the access objects corresponding to all access requests in the access request sequence;

优选地，步骤(6)包括以下子步骤：Preferably, step (6) comprises the following sub-steps:

(6-1)根据步骤(5)中计算得到的第i条访问请求对应的访问对象的采样率r_D对第i条访问请求对应的访问对象进行采样，以生成一个0到1的随机数，并判断该随机数是否小于等于r_D，如果是则转入步骤(6-2)，否则过程结束；(6-1) Sampling the access object corresponding to the i-th access request according to the sampling rate r _D of the access object corresponding to the i-th access request calculated in step (5) to generate a random number between 0 and 1, and determining whether the random number is less than or equal to r _D . If so, proceed to step (6-2); otherwise, the process ends.

(6-2)构建二叉树T_ws，使得二叉树T_ws的中序遍历等价于步骤(5)得到的第i条访问请求对应的访问对象的先后顺序；(6-2) Construct a binary tree T _ws , so that the in-order traversal of the binary tree T _ws is equivalent to the order of access objects corresponding to the i-th access request obtained in step (5);

(6-3)根据步骤(6-2)得到的二叉树T_ws计算第i条访问请求对应的访问对象的重用距离sBRD；(6-3) Calculate the reuse distance sBRD of the access object corresponding to the i-th access request based on the binary tree T _ws obtained in step (6-2);

(6-4)使用步骤(6-3)得到的第i条访问请求对应的访问对象的重用距离并根据公式eBRD＝sBRD/r+size(T_CF)获取第i条访问请求对应的访问对象的最终重用距离eBRD，其中size(T_CF)是缓存过滤器中存储的所有访问对象的总大小。(6-4) Use the reuse distance of the access object corresponding to the i-th access request obtained in step (6-3) and obtain the final reuse distance eBRD of the access object corresponding to the i-th access request according to the formula eBRD=sBRD/r+size(T _CF ), where size(T _CF ) is the total size of all access objects stored in the cache filter.

优选地，在第一直方图中，横坐标1表示的是重用距离，其对应的纵坐标，是重用距离为1的访问对象的总数。Preferably, in the first histogram, the abscissa 1 represents the reuse distance, and the corresponding ordinate is the total number of access objects with a reuse distance of 1.

在第二直方图中，横坐标1表示的是重用距离，其对应的纵坐标，是重用距离为1的访问对象的总数。In the second histogram, the horizontal axis 1 represents the reuse distance, and the corresponding vertical axis is the total number of access objects with a reuse distance of 1.

按照本发明的另一方面，提供了一种异构粒度存储系统中的缺失率曲线构建系统，包括：According to another aspect of the present invention, a system for constructing a miss rate curve in a heterogeneous granularity storage system is provided, comprising:

第一模块，用于接收访问请求序列，并设置计数器i＝1；The first module is used to receive an access request sequence and set a counter i=1;

第二模块，用于判断i是否等于访问请求序列中的访问请求总数，如果是则进入第八模块，否则转入第三模块；The second module is used to determine whether i is equal to the total number of access requests in the access request sequence, if yes, then enter the eighth module, otherwise, then transfer to the third module;

第三模块，用于将访问请求序列中的第i条访问请求载入缓存过滤器中，并判断该第i条访问请求对应的访问对象是否在缓存过滤器中命中，如果是则转入第四模块，否则转入第五模块；The third module is used to load the i-th access request in the access request sequence into the cache filter, and determine whether the access object corresponding to the i-th access request hits in the cache filter, if yes, transfer to the fourth module, otherwise transfer to the fifth module;

第四模块，用于获取该第i条访问请求对应的访问对象的重用距离，然后转入第七模块；The fourth module is used to obtain the reuse distance of the access object corresponding to the i-th access request, and then transfer to the seventh module;

第五模块，用于根据第i条访问请求对应的访问对象的大小获取该访问对象的采样率；A fifth module is used to obtain a sampling rate of the access object according to the size of the access object corresponding to the i-th access request;

第六模块，用于根据第五模块得到的第i条访问请求对应的访问对象的采样率对该访问对象进行采样，并计算该访问对象的重用距离，然后转入第七模块；The sixth module is used to sample the access object corresponding to the i-th access request according to the sampling rate of the access object obtained in the fifth module, calculate the reuse distance of the access object, and then transfer to the seventh module;

第七模块，用于设置计数器i＝i+1，并返回第二模块；The seventh module is used to set the counter i=i+1 and return to the second module;

第八模块，用于根据访问请求队列中所有在缓存过滤器中命中的访问对象的重用距离构建第一直方图，然后转入第九模块。The eighth module is used to construct a first histogram according to the reuse distances of all access objects in the access request queue that hit the cache filter, and then transfer to the ninth module.

第九模块，用于根据访问请求队列中所有未在缓存过滤器中命中的访问对象的重用距离构建第二直方图，然后转入第十模块；The ninth module is used to construct a second histogram according to the reuse distances of all access objects in the access request queue that are not hit in the cache filter, and then transfer to the tenth module;

第十模块，用于合并第一直方图与第二直方图，以构建完整的重用距离直方图，并对重用距离直方图进行积分处理，以获得最终的缺失率曲线。The tenth module is used to merge the first histogram and the second histogram to construct a complete reuse distance histogram, and integrate the reuse distance histogram to obtain a final missing rate curve.

总体而言，通过本发明所构思的以上技术方案与现有技术相比，能够取得下列有益效果：In general, the above technical solutions conceived by the present invention can achieve the following beneficial effects compared with the prior art:

1、由于本发明采用了步骤(5)到步骤(6)，应用了采样的计算方式，能够降低时间与空间复杂度，从而降低MRC曲线构建的计算开销；此外，由于采用了步骤(3)到步骤(6)，分开考虑缺失率曲线的头部和尾部的构建问题，在降低开销的同时保证了准确性满足要求。1. Since the present invention adopts step (5) to step (6) and applies a sampling calculation method, it can reduce the time and space complexity, thereby reducing the calculation overhead of MRC curve construction; in addition, since step (3) to step (6) are adopted, the construction problems of the head and tail of the missing rate curve are considered separately, which reduces the overhead while ensuring that the accuracy meets the requirements.

2、由于本发明采用了步骤(5)到步骤(6)，对不同尺寸的对象采取不同的采样率进行变权重采样的处理，能够避免对象大小差异造成的缺失率曲线构建不准确问题，降低BMRC与OMRC之间存在的构建误差，大大提高了缓存缺失率曲线构建的准确率。2. Since the present invention adopts steps (5) to (6) and adopts different sampling rates for objects of different sizes to perform variable weight sampling processing, it can avoid the problem of inaccurate construction of the miss rate curve caused by the difference in object size, reduce the construction error between BMRC and OMRC, and greatly improve the accuracy of cache miss rate curve construction.

3、由于本发明采用了步骤(3)到步骤(4)，将内容流行度高的对象单独计算其重用距离，分开考虑缺失率曲线的头部和尾部的构建问题，提高了高内容流行度对象的重用距离统计精度，大大提高了MRC曲线的头部构建精度。3. Since the present invention adopts step (3) to step (4), the reuse distance of objects with high content popularity is calculated separately, and the construction problems of the head and tail of the missing rate curve are considered separately, which improves the statistical accuracy of the reuse distance of objects with high content popularity and greatly improves the head construction accuracy of the MRC curve.

4、本发明作为通用性的工具，很容易在内容发布网络、网页缓存等其他应用场景灵活迁移适用，可以帮助缓存系统实现更灵活、更高效的多目标缓存优化。4. As a universal tool, the present invention can be easily and flexibly migrated and applied in other application scenarios such as content distribution networks and web page caches, and can help cache systems achieve more flexible and efficient multi-objective cache optimization.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明异构粒度存储系统中的缺失率曲线构建方法的流程图在所有附图中，相同的附图标记用来表示相同的元件或结构。FIG. 1 is a flow chart of a method for constructing a miss rate curve in a heterogeneous granularity storage system according to the present invention. In all drawings, the same reference numerals are used to represent the same elements or structures.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用以解释本发明，并不用于限定本发明。此外，下面所描述的本发明各个实施方式中所涉及到的技术特征只要彼此之间未构成冲突就可以相互组合。In order to make the purpose, technical solutions and advantages of the present invention more clearly understood, the present invention is further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only used to explain the present invention and are not intended to limit the present invention. In addition, the technical features involved in the various embodiments of the present invention described below can be combined with each other as long as they do not conflict with each other.

如图1所示，本发明提供了一种异构粒度存储系统中的缺失率曲线构建方法，其主要包含两个部分：缓存过滤器以及变权重采样阶段，两部分分别用于构建缺失率曲线的头部和尾部，以提高缺失率曲线的构建精度。As shown in Figure 1, the present invention provides a method for constructing a miss rate curve in a heterogeneous granularity storage system, which mainly includes two parts: a cache filter and a variable weight sampling stage. The two parts are respectively used to construct the head and tail of the miss rate curve to improve the construction accuracy of the miss rate curve.

具体而言，本发明异构粒度存储系统中的缺失率曲线构建方法包括以下步骤：Specifically, the method for constructing a missing rate curve in a heterogeneous granularity storage system of the present invention comprises the following steps:

具体而言，本步骤是从缓存系统中获取的实际运行的访问请求序列记录。Specifically, this step is to obtain the actual running access request sequence record from the cache system.

具体而言，每条访问请求仅仅对应一个访问对象，一个访问对象可以对应多条访问请求。Specifically, each access request corresponds to only one access object, and one access object can correspond to multiple access requests.

缓存过滤器使用二叉树T_CF来构建访问请求序列，使得二叉树的中序遍历等价于访问请求序列访问对象的先后顺序，以快速计算访问请求序列对应的访问对象的重用距离。并且，缓存过滤器中仅保留最近的大小不大于L的访问对象，其中L的取值范围是0到5×1/r×log₂(1/r)，优选为1/r×log₂(1/r)，其中r为预设的基础采样率大小，可以在0到1之间任意选择，与步骤(5)中的基础采样率大小相同。The cache filter uses a binary tree T _CF to construct an access request sequence, so that the in-order traversal of the binary tree is equivalent to the order of accessing objects in the access request sequence, so as to quickly calculate the reuse distance of the access objects corresponding to the access request sequence. In addition, only the most recent access objects whose size is not greater than L are retained in the cache filter, where the value range of L is 0 to 5×1/r×log ₂ (1/r), preferably 1/r×log ₂ (1/r), where r is a preset basic sampling rate, which can be arbitrarily selected between 0 and 1, and is the same as the basic sampling rate in step (5).

具体而言，访问对象i的重用距离为，访问对象i在距上一次被访问的间隔之内除了该访问对象之外的其余所有访问对象的大小之和。Specifically, the reuse distance of access object i is the sum of the sizes of all other access objects except the access object i within the interval from the last access to the access object i.

上述步骤(3)到(4)的优点在于，对内容流行度高的对象单独计算重用距离，后续对其构建准确的重用距离直方图，可以提高缺失率曲线在头部的准确率，从而整体提高缺失率曲线的准确程度。The advantage of the above steps (3) to (4) is that the reuse distance is calculated separately for objects with high content popularity, and an accurate reuse distance histogram is subsequently constructed for them, which can improve the accuracy of the missing rate curve at the head, thereby improving the accuracy of the missing rate curve as a whole.

具体而言，访问对象D的采样率r_D是按照以下公式计算得到:Specifically, the sampling rate r _D of the access object D is calculated according to the following formula:

r_D＝r×s_D/s_avg r _D ＝r×s _D /s _avg

具体而言，重用距离的计算策略是：Specifically, the calculation strategy of reuse distance is:

具体而言，本步骤中的计算方法与步骤(4)中完全相同，在此不再赘述。Specifically, the calculation method in this step is exactly the same as that in step (4), and will not be repeated here.

上述步骤(6-1)到(6-4)的优点在于，对不同大小的对象采用不同的抽样率，减少了随机采样的方差，进一步降低了缺失率曲线在尾部的误差，降低了对象尺寸差异对MRC构建的影响，从而提高整体缺失率曲线的准确程度。The advantage of the above steps (6-1) to (6-4) is that different sampling rates are used for objects of different sizes, which reduces the variance of random sampling, further reduces the error at the tail of the missing rate curve, and reduces the impact of object size differences on MRC construction, thereby improving the accuracy of the overall missing rate curve.

(8)根据访问请求队列中所有在缓存过滤器中命中的访问对象的重用距离构建第一直方图，然后转入步骤(9)；(8) constructing a first histogram according to the reuse distances of all access objects in the access request queue that hit the cache filter, and then proceeding to step (9);

具体而言，在第一直方图中，横坐标1表示的是重用距离，其对应的纵坐标，是重用距离为1的访问对象的总数。Specifically, in the first histogram, the horizontal axis 1 represents the reuse distance, and the corresponding vertical axis is the total number of access objects with a reuse distance of 1.

具体而言，在第二直方图中，横坐标1表示的是重用距离，其对应的纵坐标，是重用距离为1的访问对象的总数。Specifically, in the second histogram, the horizontal axis 1 represents the reuse distance, and the corresponding vertical axis is the total number of access objects with a reuse distance of 1.

(10)合并第一直方图与第二直方图，以构建完整的重用距离直方图，并对重用距离直方图进行积分处理，以获得最终的缺失率曲线。(10) The first histogram and the second histogram are combined to construct a complete reuse distance histogram, and the reuse distance histogram is integrated to obtain the final missing rate curve.

从缺失率曲线的工作流程可以看出，相较于传统的构建流程，本发明増加了缓存过滤器，使得内容流行度高的对象单独计算，从而减少了传统采样计算造成的缺失率曲线头部误差大的问题。此外，本发明在传统采样的基础上，根据对象的大小对不同对象采用不同的采样率，减少了对象大小差异对缺失率曲线构建准确度的影响。It can be seen from the workflow of the missing rate curve that, compared with the traditional construction process, the present invention adds a cache filter so that objects with high content popularity are calculated separately, thereby reducing the problem of large errors in the head of the missing rate curve caused by traditional sampling calculations. In addition, based on traditional sampling, the present invention uses different sampling rates for different objects according to the size of the object, thereby reducing the impact of object size differences on the accuracy of missing rate curve construction.

本领域的技术人员容易理解，以上所述仅为本发明的较佳实施例而已，并不用以限制本发明，凡在本发明的精神和原则之内所作的任何修改、等同替换和改进等，均应包含在本发明的保护范围之内。It will be easily understood by those skilled in the art that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for constructing a missing rate curve in a heterogeneous granularity storage system, characterized by comprising the following steps:

(1) Receive access request sequence and set counter i=1;

(2) Determine whether i is equal to the total number of access requests in the access request sequence. If so, proceed to step (8); otherwise, proceed to step (3);

(3) Load the i-th access request in the access request sequence into the cache filter, and determine whether the access object corresponding to the i-th access request hits in the cache filter. If so, proceed to step (4); otherwise, proceed to step (5);

(4) Obtain the reuse distance of the access object corresponding to the i-th access request, and then proceed to step (7);

(5) obtaining a sampling rate of the access object according to the size of the access object corresponding to the i-th access request;

(6) sampling the access object according to the sampling rate of the access object corresponding to the i-th access request obtained in step (5), calculating the reuse distance of the access object, and then proceeding to step (7);

(7) Set counter i=i+1 and return to step (2);

(8) constructing a first histogram according to the reuse distances of all access objects in the access request queue that hit the cache filter, and then proceeding to step (9);

(9) constructing a second histogram according to the reuse distances of all access objects in the access request queue that are not hit in the cache filter, and then proceeding to step (10);

(10) The first histogram and the second histogram are combined to construct a complete reuse distance histogram, and the reuse distance histogram is integrated to obtain the final missing rate curve.

2. The miss rate curve construction method in a heterogeneous granularity storage system according to claim 1 is characterized in that step (1) is an actual running access request sequence obtained from the cache system.

3. The method for constructing a miss rate curve in a heterogeneous granularity storage system according to claim 1 or 2, characterized in that each access request corresponds to only one access object, and one access object corresponds to at least one access request.

4. The miss rate curve construction method in a heterogeneous granularity storage system according to claim 3 is characterized in that only the most recent access objects whose size is not greater than L are retained in the cache filter, where the value range of L is 0 to 5×1/r×log ₂ (1/r), where r is a preset basic sampling rate size.

5. The method for constructing a missing rate curve in a heterogeneous granularity storage system according to claim 4, characterized in that:

The reuse distance of access object i is the sum of the sizes of all other access objects except the access object i within the interval from the last access to the access object i;

The sampling rate r _D of the access object D is calculated according to the following formula:

r _D ＝r×s _D /s _avg

Where s _D is the size of the access object, and s _avg is the average value of the access objects corresponding to all access requests in the access request sequence.

6. The method for constructing a miss rate curve in a heterogeneous granularity storage system according to claim 5, wherein step (6) comprises the following sub-steps:

(6-1) Sampling the access object corresponding to the i-th access request according to the sampling rate r _D of the access object corresponding to the i-th access request calculated in step (5) to generate a random number between 0 and 1, and determining whether the random number is less than or equal to r _D . If so, proceed to step (6-2); otherwise, the process ends.

(6-2) Construct a binary tree T _ws , so that the in-order traversal of the binary tree T _ws is equivalent to the order of access objects corresponding to the i-th access request obtained in step (5);

(6-3) Calculate the reuse distance sBRD of the access object corresponding to the i-th access request based on the binary tree T _ws obtained in step (6-2);

(6-4) Use the reuse distance of the access object corresponding to the i-th access request obtained in step (6-3) and obtain the final reuse distance eBRD of the access object corresponding to the i-th access request according to the formula eBRD=sBRD/r+size(T _CF ), where size(T _CF ) is the total size of all access objects stored in the cache filter.

7. The method for constructing a missing rate curve in a heterogeneous granularity storage system according to claim 6, characterized in that:

In the first histogram, the horizontal axis 1 represents the reuse distance, and its corresponding vertical axis is the total number of access objects with a reuse distance of 1;

In the second histogram, the horizontal axis 1 represents the reuse distance, and the corresponding vertical axis is the total number of access objects with a reuse distance of 1.

8. A system for constructing a missing rate curve in a heterogeneous granularity storage system, comprising:

The first module is used to receive an access request sequence and set a counter i=1;

The second module is used to determine whether i is equal to the total number of access requests in the access request sequence, if yes, then enter the eighth module, otherwise, then transfer to the third module;

The third module is used to load the i-th access request in the access request sequence into the cache filter, and determine whether the access object corresponding to the i-th access request hits in the cache filter, if yes, transfer to the fourth module, otherwise transfer to the fifth module;

The fourth module is used to obtain the reuse distance of the access object corresponding to the i-th access request, and then transfer to the seventh module;

A fifth module is used to obtain a sampling rate of the access object according to the size of the access object corresponding to the i-th access request;

The sixth module is used to sample the access object corresponding to the i-th access request according to the sampling rate of the access object obtained in the fifth module, calculate the reuse distance of the access object, and then transfer to the seventh module;

The seventh module is used to set the counter i=i+1 and return to the second module;

The eighth module is used to construct a first histogram according to the reuse distances of all access objects in the access request queue that hit the cache filter, and then transfer to the ninth module;

The ninth module is used to construct a second histogram according to the reuse distances of all access objects in the access request queue that are not hit in the cache filter, and then transfer to the tenth module;

The tenth module is used to merge the first histogram and the second histogram to construct a complete reuse distance histogram, and integrate the reuse distance histogram to obtain a final missing rate curve.