2014 Second International Symposium on Computing and Networking
A cluster-based data deduplication technology
Chuan-Mu Tseng Jheng-Rong Ciou, Tzong-Jye Liu
Department of Applied Digital Media Department of Information Engineering and Computer
Jeh-teh Junior College of Medicine Nursing and Science
Management Feng Chia University
Miaoli, Taiwan, R. O. C. Taichung, Taiwan, R. O. C.
cmtjason@jente.edu.tw cioujimmy@gmail.com, tjliu@fcu.edu.tw
Abstract—Data deduplication technology usually identifies of the index table will be up to 100TB. Memory cannot
redundant data quickly and correctly by using bloom filter cache the index table directly if the size of the index table is
technology. A bloom filter can determine whether there is larger than the memory. This was called a “disk bottleneck
redundant data. However, there are the presences of false problem [3][7].”
positives. In order to avoid false positives, we need to compare
a new chunk with chunks that have been stored. In order to Previous research [3] has resolved the disk bottleneck
reduce the time to exclude the bloom filter false positives, problem through many index tables of small size. However,
current research uses many small size index tables to store there is a drawback. When the system excludes the bloom
chunk ID. However, the target chunk ID only stores in one filter false positives, it must check all of the index tables
index table. Searching for the target chunk ID at another index
table uselessly took a great deal of time. In this paper, we
until the target chunk ID is found. The target chunk ID is
cluster the stored chunks to reduce the time of excluding the only stored in one index table. Therefore, searching for the
false positive problem induced by bloom filter. target chunk ID in another index table uselessly took a great
deal of time.
Keywords: Bloom filter; cluster; data deduplication In order to reduce the time to exclude the bloom filter’s
I. INTRODUCTION false positives, we propose a new excluding bloom filter’s
false positives module. Our method can choose one index
With the development of cloud storage, increasingly table to exclude the bloom filter’s false positives. After
more users store data in the cloud. More than 57% of virtual checking this index table, we can understand whether
data is produced annually[1]. From the cloud provider’s excluding the bloom filter will have false positives.
perspective, efficient use of limited storage space can bring
about a great deal of interest. Therefore, data deduplication The rest of this paper is organized as follows. Section II
technology is proposed, and it compares the uploaded introduces the background about data deduplication and the
chunks as well as those saved ones. Avoiding repeatedly to bloom filter. Section III focuses on our proposed method.
store the same chunk will reduce redundant data, and Section IV presents the results of our experiment. Lastly,
therefore, the limited space can be used to provide more conclusions are presented in Section V.
services for users.
II. RELATED WORKS
In the current deduplication technology, the main cost of The main purpose of deduplication technology is to
the system is to query redundant chunks, and the overhead effectively reduce duplicate data and to increase the
will increase with the amount of data. Bloom filter available storage space. Typical data deduplication system
technology[2] was used to reduce the cost of query[3][4]. can be divided into two parts: data segmentation and
Although the bloom filter can quickly find the redundant duplication data comparison. The system will segment files
data, it has the presence of false positives. The system may into chunks and compare those chunks to find duplicate
judge unique data as redundant data. When a user data. In order to enhance the comparison efficiency, the
downloads the file, they may miss a part of the chunks. This system will use the hash function (ex: SHA-1, MD5) to
will cause the file to be irrecoverable. Such a result will calculate the chunk ID and will determine whether those
make users not want to use the cloud storage service. chunks can be uploaded according to the comparison.
In order to avoid bloom filter’s false positives, [4] In order to speed up the comparison time,
proposed that after the judgment of the bloom filter, we [8][9][10][11][12][13] used the index table to store chunk
should compare the new chunk and all saved chunks. In IDs, through the comparison to find out redundant data from
order to accelerate the comparison speed, the authors in [4] the index table. As the amount of data grows, the index table
used an index table for storage chunk identifications by becomes larger. If the size of the index table is larger than
comparing index tables to find out redundant data. the memory, this will cause a ”disk bottleneck problem
However, the index table size is proportional to the amount [3][7].” The concept of the hierarchical index table was used
of data already stored[5][6]. The ratio is about 1:100. In to prevent this result[7][8][10][11][12][13]. The hierarchical
other words, when the system stored 10PB of data, the size index table used few chunk IDs to represent files and to
978-1-4799-4152-0/14 $31.00 © 2014 IEEE 226
DOI 10.1109/CANDAR.2014.22
store them in memory. The other chunk IDs were stored in cloud services have to completely rule out false positives
the disk. The system compared the representative chunk IDs that can make the user feel at ease about uploading files.
in memory, and then according to the result, linked the next
(1) Limit False Positives:
layer to compare the entire chunk IDs. This method could
There are two methods to limit false positives. The first
reduce the disk bottleneck problems, but the index table size
one is to enlarge the capacity of bloom filters. More
would still increase with the growing amount of data.
elements can be accommodated without increasing the false
Therefore, this method could reduce the probability of the
positives. The second one is to increase the number of
disk bottleneck problems, but could not eradicate disk
bloom filters. As a result, all of the bloom filters will be
bottlenecks.
reconstructed with lower false positives.
The bloom filter [2] and the index table [3][4][5][9][14]
(2) Exclude False Positives
were used in the deduplication system. Already having
In order to solve false positives, previous research [4]
stored data as a collection, users uploaded chunks as
proposed that if the bloom filter determined this chunk was
elements. Through the bloom filter, the system could
stored, this was compared with all of the already stored
quickly check the element that exists in the collection.
chunks. However, using this way to exclude false positives
A typical bloom filter consists of the bit array and the could still result in a disk bottleneck problem. A method
hash function composition. In the initialization step, all the that solves false positives and disk bottleneck problems was
values of the bit arrays are 0. When a user uploads files, the proposed[3]. The system used amounts of a fixed-sized
system will calculate the identification, File ID, and will index table to store chunk IDs. The index table, called a
send this to the bloom filter. The file ID will be mapped by “tablet”, is fixed to 1.5 MB. It could effectively prevent disk
the number of different hash functions to the bit array. And bottleneck problems. A tablet was a B+ tree based structure.
then the location will be revised to 1. When the system It maintained the list of tablets in least recently used (LRU)
would like to check whether a file has been saved, the file manner. When the chunk ID was found in a certain tablet,
IDs are not only calculated but also mapped to the bit array system moved the respective tablet to the head of a linked
by using hash functions, and then the system observes the list. If we did not find the target chunk ID at a current tablet,
location of a bit array. If one of the values is equal to 0, the we would check the next tablet until the target chunk ID was
file is new. If all of the values are equal to 1, the file is found or the whole tablet was searched.
saved. In this way, the system can quickly find the saved
Although the solution in [3] could solve disk bottleneck
files.
problems, while excluding the false positives, the system
However, the bloom filter has the presence of false had to check all of the already stored chunks until the target
positives due to the collision of hash function. When the was found. When false positives occurred, all of the tablets
bloom filter stored n files, there are totally k hash functions did not store this chunk ID. The system would check all of
and the size of the bit array is m. Bloom filter false positives the tablets. As the amount of data grows, the number of
will be equal to (1-(1-1/m)kn)k(1-e-kn/m)k[3]. tablets becomes larger. It would waste a lot of time in
searching for the chunk ID.
The more files are saved, the more positions with values
equal to 1 there are. When the new file is mapped to the bit III. PROPOSED METHOD
array by using hash functions, the probability of mapping to
the location whose value is equal to 1 becomes larger. A. Typical data deduplication
Therefore, that system may identify unique data as Fig. 1 is the basic framework of the data deduplication
redundant ones. When a user downloads the file, one part of system that uses the bloom filter. When a user uploads a
the file may be missed. This will cause the file to be file, it first segments the file into chunks, and then calculates
irrecoverable. the chunk ID by using the hash function. In order to
For the study of false positives of bloom filter, there are determine whether to save the chunk, the system will detect
two directions on the topic. (1) Limit False Positives: limit redundancy through bloom filter modules. If the chunk is
the false positives within an acceptable range. The not saved, the system will ask the user to upload the chunk.
advantage of this approach is that it does not take the time to On the contrary, if the chunk is saved, the system will
exclude bloom filter false positives. However, it is not compare all of the stored chunks. The system excludes the
suitable to act on the cloud storage system. If users upload false positives of the bloom filter by using the excluding
important files, the bloom filter false positives will cause false positives module, and then decides whether the chunk
them to be irrecoverable and to result in a lot of problems should be saved.
for the user. (2) Exclude False Positives: compare all of the The solutions of previous research [3][4] would
already stored chunks. This method will take a great deal of compare the uploaded chunk with all of the stored chunks to
time to search for the already stored chunks. However, exclude the false positives of the bloom filter. The means of
managing the index table affects the running time to exclude
227
false positives. As the number of data grows, the running
time to exclude false positives becomes longer. It therefore
wastes a lot of time in searching for chunk IDs.
Fig.2. Mapping table
When the excluding false positives module excludes
the bloom filter false positives, the system uses formula (1)
to find the cluster ID. This is then compared with the
uploaded chunk ID, and the chunk ID is stored in a target
cluster to exclude false positives, as shown in Fig. 3.
The x value of formula (1) will affect the number of
chunk IDs stored in a cluster. The maximum number of
chunk IDs stored in a cluster will be 2160-x. The experimental
results will be provided in Section IV.B.
Fig.1. The data deduplication system framework
B. The proposed method
In order to reduce the running time, the proposed
solution replaces the excluding false positives module in
Fig. 1 by using the clustering module. The files will be Fig.3. Clustering module
segmented, and the system calculates the chunk ID through
the chunking module. Hence, the bloom filter module IV. EXPERIMENT
detects whether the chunk is saved. If the result shows that We implement a data deduplication system and examine
the chunk is already stored, the system will use the the performances of the techniques developed in Section III.
clustering module to exclude false positives. Our clustering Before backing up the file, the chunking module is used to
module only needs to check one index table and determines segment the file into chunks and to calculate the chunk IDs.
whether this chunk is stored. Therefore, we can reduce the After this, the system sends the index file into bloom filter
running time to exclude false positives. module. This module will determine whether the chunk is
saved. According to the result of bloom filter module, the
C. The clustering module
clustering module is implemented to exclude the false
The clustering module uses chunk ID to perform positives. Notice that we only save the chunk IDs, not
clustering according to formula (1). complete chunks.
Cluster ID = Chunk ID mod 2x (1) The following two designed experiments analyze the
performance of our system.
Every chunk finds its cluster ID by using formula (1).
As shown in Fig. 2, we use a mapping table to record cluster 1. Effect of the x value
IDs and an index table to record chunk IDs. Those chunks 2. The time to exclude bloom filter’s false positives
with the same cluster ID will be classified into the same
cluster. Because all of the chunk IDs in the same cluster We use 8KB fixed-size chunk as well as seven different
have identical cluster IDs, the system only needs to save a kinds of hash functions, and the length of bit array is
160-x bits chunk ID. Two chunks are stored in different 131072 [15]. The test data set is actual multimedia files and
clusters, meaning that it is impossible for these chunks to be Linux kernel files as shown in Table 1. Our experiment is
redundant. Hence, the system only needs to check one index performed on an I7-3770 (Intel Core 3.40 GHz), 2TB hard
table to determine whether this chunk is saved. disk drive, and an 8GB RAM, which runs on a 64bits
WINDOWS 7 operating system.
228
Table 1 Testing data set
File type File numbers File size Chunk numbers
Multimedia 7 5.25GB 5638963
Linux kernel 8 434MB 574013
A. Effect of the x value
The purpose of this experiment is to find the suitable x
value of formula (1) to improve the system performance.
We have implemented the clustering module to use five
different x values and to back up the test datasets of Table 1.
Fig.4. Performance of clustering module (Linux kernel)
As shown in Fig. 4 and 5, the x value affects the system
performance. The cluster number increases as the x value
increases and the cluster size will decrease as the x value
increases. For the clustering module, both the cluster size
and the number of clusters will affect the performance. At
low x values, more chunk IDs are stored in clusters. That
needs more time to exclude false positives. With the
increase of x value, it will generate a lot of clusters. If there
are a lot of clusters saved in the mapping table, it will cause
the performance decreased due to more time to locate the
target cluster.
As shown in Tables 2 and 3, whether there is the large
size of clusters or the large number of clusters, these two
Fig.5. Performance of clustering module (Multimedia)
will decrease the performance of backup. When the balance
between the cluster size and cluster number, the system B. The time to exclude bloom filter’s false positives
performance is optimized. When x is equal to 16, the system The purpose of this experiment is comparing the
can be at the best backup speed. It greatly reduces the time execution times of excluding false positives between our
to exclude false positives. clustering module and LRU-based index partition [3]. In the
experiment, we backup the test dataset, and observe the
Table 2 Multimedia file running time. According to the result of the previous
x Number of Average number of Backup speed experiment, the value of x will be set to 16. The tablet size
Clusters chunk IDs in each (MB/sec) of LRU-based index partition will be set to 1.5MB.
cluster
8 256 22027.2 269.2 As shown in Fig. 6, the clustering module significantly
reduces the execution time. The larger the test files are, the
12 4096 1376.7 271.3
more obvious the affect is. As shown in Tables 4 and 5,
16 65536 86 279.3 when the backup files become larger, a lot of index tables
20 1043731 5.4 269.6 are produced by the LRU-based index partition module. The
number of the index tables, needing to check to exclude
24 4788784 1.1 243.5
false positives, is increased. During the backup of Linux
kernel files, the bloom filter module considers 207717
Table 3 Linux kernel file chunks have been saved. When the system excludes the
x Number of Average number of Backup speed false positives, it will generate 2107052 queries in the non-
Clusters chunk IDs in each (MB/sec) target index table. An average of 11.1 queries excludes false
cluster
8 256 2242.2 217.3
positives.
12 4096 140.1 217.3 During the backup of multimedia files, the bloom filter
module considers 5250296 chunks have been saved. When
16 65526 8.8 219.1
the system excludes the false positives, it will generate
20 442423 1.3 204.5 392528094 queries at the non-target index table. That will
24 564461 1 198.7 cause the performance decrease. An average of 75.8 queries
excludes false positives.
229
Our approach greatly reduces the time to exclude false IEEE Transactions on Computers, vol. 60, no. 8, Jun.2011,
positives because the clustering module omits the query in pp. 824-840, doi: 10.1109/TC.2010.263.
the non-target index table and only needs to search in one [4] D. Meister and A. Brinkman, "Dedupv1: improving
index table. deduplication throughput using solid state drives (SSD),"
IEEE 26th Symposium on Mass Storage Systems and
Technologies, May.2010, pp.1-6,
doi:10.1109/MSST.2010.5496992.
[5] W. Jiansheng, J. Hong, Z. Ke, and F. Dan, "Efficiently
representing membershipfor variable large data sets," IEEE
Transactions on Parallel and Distributed Systems, vol. 25,
no. 4, Apr. 2014, pp. 960-970, doi:10.1109/TPDS.2013.66.
[6] Z. Zhike, J. Zejun, L. Zhiqiang, and P. Chengzhang, "LHS:
a novel method of information retrieval avoiding an index
using linear hashing with key groups in deduplication,"
International Conference on Machine Learning and
Cybernetics , Jul.2012, pp. 1312-1318, doi:
10.1109/ICMLC.2012.6359555.
Fig.6. Performance of excluding false positives [7] M. Lillibridge et al., "Sparse indexing: large scale, inline
deduplication using sampling and locality," in Proceedings
of the Eighth USENIX Conference on File and Storage
Table 4 Backup speed of excluding false positives Technologies, Dec. 2009, pp. 111-123.
Backup speed(MB/sec) Clustering LRU-based
index partition
[8] Y. Tianming et al., "3DNBS: a data de-duplication disk-
Linux kernel 219.1 97.2
based network backup system," IEEE International
Conference on Networking, Architecture, and Storage,
Multimedia 279.3 8.2
Jul.2009, pp. 9-11, doi: 10.1109/NAS.2009.56.
[9] H. Qinlu, L. Zhanhuai, and Z. Xiao, "Data deduplication
Table 5 Comparison of average query time
techniques," International Conference on Future Information
Average query time Clustering LRU-based
Technology and Management Engineering, vol. 1, Oct.2010,
index partition
pp. 430-433, doi: 10.1109/FITME.2010.5656539.
Linux kernel 1 11.1
Multimedia 1 75.8 [10] W. Can, Q. Guang Zhi, Y. Lei, and W. Juan, "A Fast
Duplicate Chunk Identifying Method Based on Hierarchical
Indexing Structure," International Conference on Industrial
V. CONCLUSION
Control and Electronics Engineering, Aug.2012, pp. 624-
We apply a clustering module to managing chunk IDs. 627,doi: 10.1109/ICICEE.2012.169.
The chunk IDs will be stored separately in different clusters [11] Z. Zhike, J. Zejun, L. Zhiqiang, and P. Chengzhang,
by using formula (1). While excluding the false positives, "Extreme Binning: Scalable, parallel deduplication for
the system knows whether the chunk is saved by a query of chunk-based file backup," IEEE International Symposium
one index table. on Modeling, Analysis & Simulation of Computer and
Telecommunication Systems, Sep.2009, pp. 1-9, doi:
The proposed method excludes bloom filter’s false 10.1109/MASCOT.2009.5366623.
positives that do not need to check all the index tables. It [12] Z. Zhike, J. Zejun, L. Zhiqiang, and P. Chengzhang, "LHS:
only needs to query one index table, effectively reducing the A novel method of information retrieval avoiding an index
time to exclude the false positives. using linear hashing with key groups in deduplication,"
International Conference on Machine Learning and
ACKNOWLEDGMENT Cybernetics, vol. 4, Jul.2012, pp. 1312-1318, doi:
10.1109/MASCOT.2009.5366623.
Part of this study was supported by Ministry of Science [13] F. Yinjin, J. Hong, X. Nong, T. Lei, and L. Fang, "AA-
and Technology of the Republic of China under the Contract Dedupe: An Application-Aware Source Deduplication
MOST 103-2221-E-035-056. Approach for Cloud Backup Services In The Personal
Computing Environment," IEEE International Conference
REFERENCE on Cluster Computing, Sep.2011 ,pp. 112-120, doi:
[1] W. Can, Q. Guang Zhi, C. Feng Sheng, and P. Jing, 10.1109/CLUSTER.2011.20.
"Deduplication oriented backup data encryption method," [14] L. Guanlin, N. Jin Young, and D. David Hung-Chang,
Journal of Computer Applications, vol. 30, no. 7, Jul.2010, "BloomStore: Bloom-Filter based memory-efficient key-
pp. 1764-1766, doi: 10.3724/SP.J.1087.2010.01763. value store for indexing of data deduplication on flash,"
[2] B. H. Bloom, "Space/time trade-offs in hash coding with IEEE 28th Symposium on Mass Storage Systems and
allowable errors," Communications of the ACM, vol. 13, no. Technologies, Apr.2012, pp. 1-11, doi:
7, Jul.1970, pp. 422-426, doi: 10.1145/362686.362692. 10.1109/MSST.2012.6232390.
[3] M. Jaehong, Y. Daeyoung, and W. Youjip, "Efficient [15] Timdoug, “timdoug.com” [Online]. Available:Aug.2014.
deduplication techniques for modern backup operation," http://www.timdoug.com/etc/bloom_filter.c
230