Abstract
In this paper, we study the problem of storing an archive of versioned data in a reliable and efficient manner. The proposed technique is relevant in cloud settings, where, because of the huge volume of data to be stored, distributed (scale-out) storage systems deploying erasure codes for fault tolerance is typical. However existing erasure coding techniques do not leverage redundancy of information across multiple versions of a file. We propose a new technique called differential erasure coding (DEC) where the differences (deltas) between subsequent versions are stored rather than the whole objects, akin to a typical delta encoding technique. However, unlike delta encoding techniques, DEC opportunistically exploits the sparsity (i.e., when the differences between two successive versions have few non-zero entries) in the updates to store the deltas using sparse sampling techniques applied with erasure coding. We first show that DEC provides significant savings in the storage size for versioned data whenever the update patterns are characterized by in-place alterations. Subsequently, we propose a practical DEC framework so as to reap storage size benefits against not just in-place alterations but also real-world update patterns such as insertions and deletions that alter the overall data sizes. We conduct experiments with several synthetic and practical workloads to demonstrate that the practical variant of DEC provides significant reductions in storage-overhead.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
The proof for the proposition follows from the property that any \(2\gamma \) columns of a parity check matrix of a linear code with minimum distance \(2\gamma +1\) are linearly independent.
References
Huang, C., Simitci, H., Xu, Y., Ogus, A., Calder, B., Gopalan, P., Li, J., Yekhanin, S.: Erasure coding in windows azure storage. In: The Proceedings of the USENIX Annual Technical Conference (ATC) (2012)
Thusoo, A., Shao, Z., Anthony, S., Borthakur, D., Jain, N., Sen Sarma, J., Murthy, R., Liu, H.: Data warehousing and analytics infrastructure at Facebook. In: The Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (2010)
Ford, D., Labelle, F., Popovici, F.I., Stokely, M., Truong, V.A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: The 9th USENIX Conference on Operating Systems Designand Implementation (OSDI) (2010)
Dimakis, A.G., Ramchandran, K., Wu, Y., Suh, C.: A survey on network codes for distributed storage. Proc. IEEE 99, 476–489 (2011)
Oggier, F., Datta, A.: Coding techniques for repairability in networked distributed storage systems. In: Foundations and Trends in Communications and Information Theory, vol. 9, no. 4, pp. 383–466. Now Publishers, June 2013
http://www.ibm.com/developerworks/tivoli/library/t-snaptsm1/index.html
Borthakur, D.: HDFS and Erasure Codes (HDFS-RAID), August 2009. http://hadoopblog.blogspot.com/2009/08/hdfs-and-erasure-codes-hdfs-raid.html
The Coding for Distributed Storage wiki. http://storagewiki.ece.utexas.edu/
Wang, Z., Cadambe, V.: Multi-version Coding for Distributed Storage. In: Proceedings of IEEE ISIT 2014, Honalulu, USA (2014)
Rouayheb, S., Goparaju, S., Kiah, H., Milenkovic, O.: Synchronising edits in distributed storage networks. In: The Proceedings of the IEEE International Symposium on Information Theory, Hong Kong (2015)
Rawat, A., Vishwanath, S., Bhowmick, A., Soljanin, E.: Update efficient codes for distributed storage. In: IEEE International Symposium on Information Theory (2011)
Han, Y., Pai, H.-T., Zheng, R., Varshney, P.K.: Update-efficient regenerating codes with minimum per-node storage. In: Proceedings of IEEE International Symposium on Information Theory (ISIT 2013), Istanbul (2013)
Mazumdar, A., Wornell, G.W., Chandar, V.: Update efficient codes for error correction. In: The Proceedings of IEEE IEEE International Symposium on Information Theory, Cambridge, MA, pp. 1558–1562, July 2012
Esmaili, K.S., Chiniah, A., Datta, A.: Efficient updates in cross-object erasure-coded storage systems. In: IEEE International Conference on Big Data, Silicon Valley, CA, October 2013
Tarasov, V., Mudrankit, A., Buik, W., Shilane, P., Kuenning, G., Zadok, E.: Generating realistic datasets for deduplication analysis. In: The Proceedings of the 2012 USENIX Conference on Annual Technical Conference (2012)
Harshan, J., Oggier, F., Datta, A.: Sparsity exploiting erasure coding for resilient storage and efficient I, O access in delta based versioning systems. In: The Proceedings of IEEE ICDCS, Columbus, Ohio, USA (2015)
Harshan, J., Oggier, F., Datta, A.: Sparsity exploiting erasure coding for distributed storage of versioned data. Computing 98, 1305–1329 (2016). Springer
Donoho, D.L.: Compressed sensing. IEEE Trans. Inf. Theor. 52(4), 1289–1306 (2006)
Zhang, F., Pfister, H.D.: Compressed sensing and linear codes over real numbers. In: Information Theory and Applications Workshop (ITA) (2008)
McWilliams, F.J., Sloane, N.J.A.: The Theory of Error Correcting Codes. North Holland, Amsterdam (1977)
Lacan, J., Fimes, J.: A construction of matrices with no singular square submatrices. In: The Proceedings of International Conference on Finite Fields and Applications, pp. 145–147 (2003)
Harshan, J., Datta, A., Oggier, F.: DiVers: an erasure code based storage architecture for versioning exploiting sparsity. Future Gener. Comput. Syst. 59, 47–62 (2016). Elsevier
Acknowledgements
This work was supported by the MoE Tier-2 grant “eCode: erasure codes for data center environments" (MOE2013-T2-1-068).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer-Verlag GmbH Germany
About this chapter
Cite this chapter
Harshan, J., Datta, A., Oggier, F. (2016). Differential Erasure Codes for Efficient Archival of Versioned Data in Cloud Storage Systems. In: Hameurlain, A., Küng, J., Wagner, R., Schewe, KD., Bosa, K. (eds) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXX. Lecture Notes in Computer Science(), vol 10130. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-662-54054-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-662-54054-1_2
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-662-54053-4
Online ISBN: 978-3-662-54054-1
eBook Packages: Computer ScienceComputer Science (R0)