[go: up one dir, main page]

Academia.eduAcademia.edu

In search of I/O-optimal recovery from disk failures

2011, Workshop on Hot Topics in Storage Systems

Abstract

We address the problem of minimizing the I/O needed to recover from disk failures in erasure-coded storage systems. The principal result is an algorithm that finds the optimal I/O recovery from an arbitrary number of disk failures for any XOR-based erasure code. We also describe a family of codes with high-fault tolerance and low recovery I/O, eg one instance tolerates up to 11 failures and recovers a lost block in 4 I/Os. While we have determined I/O optimal recovery for any given code, it remains an open problem to identify ...

In Search of I/O-Optimal Recovery from Disk Failures Osama Khan, Randal Burns James Plank Department of Computer Science Department of Electrical Eng. and Comp. Science Johns Hopkins University University of Tennessee {okhan, randal}@cs.jhu.edu plank@cs.utk.edu Cheng Huang Microsoft Research Cheng.Huang@microsoft.com Abstract Two application contexts, cloud storage systems and deep archival storage, motivate the need for I/O effi- We address the problem of minimizing the I/O needed cient coding (Section 2). Cloud storage systems per- to recover from disk failures in erasure-coded storage form erasure decoding when recovering from disk fail- systems. The principal result is an algorithm that finds ures and when performing system upgrades. Upgrades the optimal I/O recovery from an arbitrary number of occur frequently—they are often continuous [7]—and disk failures for any XOR-based erasure code. We also minimizing I/O limits performance degradation. Deep describe a family of codes with high-fault tolerance and archival stores include data that are almost never read, low recovery I/O, e.g. one instance tolerates up to 11 fail- but need to be stored for regulatory or archival purposes. ures and recovers a lost block in 4 I/Os. While we have For these data, the only workload is introspection and determined I/O optimal recovery for any given code, it recovery. Therefore, highly fault-tolerant recovery I/O remains an open problem to identify codes with the best efficient codes allow us to increase scale and save power. recovery properties. We describe our ongoing efforts to- Many advances have been made in improving re- ward characterizing space overhead versus recovery I/O covery performance in disk redundancy coding. These tradeoffs and generating codes that realize these bounds. include hardware to minimize data copying [6], load- balancing recovery among disks [12], recovering popular 1 Introduction data first to decrease read degradation [21], and only re- covering blocks that contain live data [19]. Recently, the Recovery from failures has become a critical component issue of minimizing I/O recovery schedules has emerged of disk storage systems because they have reached such as a research topic. The results for Even-Odd codes [22] a massive scale that failures must be expected and dealt and row-diagonal parity [23] represent solutions for two with as a matter of regular operation [8]. Large scale de- specific codes. We present an algorithm that defines the ployments now typically tolerate multiple failures both to I/O lower bound for any matrix code and allows multiple keep service available and to avoid data loss, e.g. three codes to be compared for I/O recovery cost. replicas has become the de facto standard in Hadoop! Optimizing the recovery I/O of existing erasure codes and systems utilizing RAID-6 are widely deployed. shows benefits, but does not transform recovery. In We answer a fundamental question in recovery perfor- contrast, codes designed specifically for recovery rad- mance: what are the fewest number of I/Os needed to ically reduce I/O. Our algorithm applied to Liberation recover from an arbitrary number of disk failures? As codes and Cauchy Reed-Solomon codes reduce I/O by the amount of redundancy grows, storage system codes 20-30%: on the same order as previous results [22, 23]. offer many different schedules to recover a lost disk that We present a family of codes (Section 4) that leverage vary widely in their I/O requirements. For example, in the constrained data dependencies of Weaver codes [10] row-diagonal parity [23] and Even-Odd codes [22] that and 2-dimensional properties of Grid codes [15]. One in- tolerate two disk failures, I/O can be reduced by 25% stance of these codes tolerates up to 11 failures and can by recovering a combination of rows and diagonals that recover a lost encoded block in 4 I/Os! A Reed-Solomon share blocks, rather than the standard practice of recov- code with similar properties uses 12 I/Os. ering each row independently. We provide an algorithm The design of recovery I/O optimal codes remains an that minimizes the I/O recovery cost for any XOR-based open problem. We conjecture that there are tradeoffs be- code. tween recovery I/O and storage efficiency, i.e. that an increase in storage can reduce I/O at a given fault toler- diting for outsourced data [4] enables cost reduction in ance. We are pursuing the fundamental bounds for this cloud archives. problem. At the same time, we are exploring the struc- ture of recovery I/O by searching for the best feasible 3 Finding I/O Optimal Recovery Schedules codes using our optimization algorithm. Any erasure code based on exclusive-or operations may Note: Regenerating codes provide minimal recovery be represented by a bit-matrix-vector product as in Fig- bandwidth and storage overhead [5]. They were designed ure 1. A vector of k data bits is multiplied by a (n × k) for distributed systems in which wide-area bandwidth Generator matrix to yield an n-element vector called the limits recovery performance. They achieve minimum codeword. In our simplified example, each bit (or row) bandwidth by transferring a smaller amount of data from of the input data vector, and consequently the codeword, as many shares of the data as are available. For storage can represent one or more disk sectors. The code rep- systems, minimizing I/O is more valuable than minimiz- resented in Figure 1 is a RAID-6 code for a four-disk ing bandwidth and regenerating codes that access all ex- system, where each disk stores two bits (or rows), of the isting shares of data increase I/O. codeword. All XOR-based codes can be represented by a Generator matrix. The difference between the various 2 Applications of I/O-Optimal Recovery codes lies in different Generator matrices, and different ways to store the bits on different disks. For example, Cloud File Systems: Cloud storage systems, such as Greenan et al define a “flat” code as one where each bit Amazon S3 and Windows Azure Storage, assemble mas- is stored on a different disk [9]. Thus, Figure 1 could sive amounts of unreliable hardware and rely on software represent a flat code for an 8-disk system. to deliver highly reliable and available storage services. Typically, they store three replicas [8] to guard against failures. Erasure coding provides an alternative that im- proves fault tolerance at reduced storage costs [1]. Cloud storage uses erasure decoding when recovering from failures and, more frequently, when storage nodes are unavailable. Scheduled events, such as patches and software updates, and unscheduled events, such are re- boots, make nodes unvailable. For example, an update of the storage software stack rolls out in phases. Small batches of the storage nodes are suspended and the up- date applied. Then, the entire system is left running until performance metrics stabilize. The entire update process Figure 1: Erasure-coding as a matrix-vector product. can last hours [7]. During updates, read requests to the unavailable nodes invoke erasure decoding and recovery Each bit in the codeword is represented by a row of dictates overall I/O performance. the Generator matrix. When data is lost, the standard methodology for reconstruction is to create an invertible Deep Archival Stores: Regulatory requirements and (k × k) matrix from k rows of the Generator matrix that preservation dictate that data needs to be archived for fu- correspond to surviving bits in the codeword. This ma- ture availability. However, a large fraction of this data trix is inverted, and multiplying the inverse by the sur- will never be read. The workload for these systems con- viving bits yields the original data [16, 11]. sists of introspection, checking that data are intact, and While this technique is general-purpose, it produces error recovery. Pergamum [20] defined archival sys- one of the many possible ways to reconstruct the lost tems of this type based on massive arrays of idle disks data. We solve the problem of determining how to re- (MAID). They demonstrate that 95% of disks may be calculate the lost data while minimizing the total number powered off at all times. We extend Pergamum’s vision of surviving bits that are read. With each bit representing of infrequent error detection and look to employing un- one or more sectors on disk, minimizing the bits read will trusted cloud storage. To increase power savings, we minimize the number of disk I/O’s required for recovery. take a much more passive approach to introspection and We present an algorithm for this task that is computa- recovery. Encoding data with large amounts of redun- tionally expensive, but feasible for systems of sizes typi- dancy allows for the lazy detection of failed devices/sites cally used today. In practice, one calculates the recovery and recovery from multiple drive and latent sector errors. strategies for all potential failure scenarios a priori and Combining I/O-efficient erasure coding with secure au- stores them for later use. 2 We use the code in Figure 1 as an illustrative exam- strings. The OR calculates the union of the equations ple. Consider a collection of bits in the codeword whose leading up to si and ei+1 , with si+1 denoting the cumu- corresponding rows in the Generator matrix sum to zero. lative number of elements required for recovery up to that One example is D0 , D2 and C0 . We call such a collec- point. The weight of the edge is equal to the difference tion of bits a decoding equation, because the fact that between the number of ones in si and si+1 ’s bit strings. their sum is zero allows us to decode any one of its bits The shortest path from Z to any node in S|F |−1 denotes as long as the remaining bits are not lost. For example, the minimum number of elements required for recovery. if D2 is lost, and both D0 and C0 are not, then this equa- If we annotate each edge with the decoding equation that tion may be used to decode D2 . creates it, then the shortest path contains the equations Suppose that we enumerate all decoding equations for that are used for recovery. a given Generator matrix, and suppose some subset F of To illustrate, suppose again that F = {D0 , D1 }, the codeword bits are lost. Then for each bit fi ∈ F , meaning f0 = D0 and f1 = D1 . The decoding equa- we determine the set Ei of decoding equations for fi . tions for E0 and E1 are enumerated below: Formally, an equation ei ∈ Ei if ei ∩F = {fi }. Our goal is to select one equation ei from each Ei such that the E0 E1 number of elements in the union of all ei is minimized. e0,0 = 10101000 e1,0 = 01010100 For example, suppose bits D0 and D1 in Figure 1 e0,1 = 10010010 e1,1 = 01101110 are lost. A standard way to decode the failed bits e0,2 = 10011101 e1,2 = 01100001 is to use coding bits C0 and C1 . In equation form, e0,3 = 10100111 e1,3 = 01011011 F = {D0 , D1 }, eD0 = {D0 , D2 , C0 }, and eD1 = {D1 , D3 , C1 }. Since eD0 and eD1 have distinct ele- These equations may be converted to the graph de- ments, their union is composed of six elements, which picted in Figure 2, which has two shortest paths of length means that four are required for recovery. However, if five: {e0,0 , e1,2 } and {e0,1 , e1,0 }. Both require three bits we use {D1 , D2 , C3 } for eD1 , then |eD0 ∪ eD1 | is five for recovery: {D2 , C0 , C3 } and {D3 , C1 , C2 }. elements, meaning that three are required for recovery. While the graph clearly contains an exponential num- This saves one I/O operation. ber of nodes, one may program Dijkstra’s algorithm to Thus, our problem is as follows: Given |F | sets of determine the shortest path and only create the graph decoding equations E0 , E1 , . . . E|F |−1 , we wish to se- on demand. For example, in Figure 2, the dotted edges lect one equation from each set such that the size of the and grayed nodes will not be constructed, because the union of these equations is minimized. Unfortunately, shortest path is discovered before nodes 10011101 and this problem is NP-Hard in |F | and |Ei |.1 However, we 10100111 are evaluated by the algorithm. can solve the problem for practical values of |F | and |Ei | by converting the equations into a directed, weighted graph and finding the shortest path through the graph. Given an instance of the problem, we convert it to a graph as follows. First, we represent each decoding equation in set form as an n-element bit string. For ex- ample, {D0 , D2 , C0 } is represented by 10101000. Each graph node is also represented by an n-element bit string. There is a starting node Z whose string is all zeroes. The remaining nodes are partitioned into |F | sets labeled S0 , S1 , . . . S|F |−1 . Each node in Si is at the same depth (number of edges) relative to Z as any other node in Si . For each equation e0 ∈ E0 , there is a node s0 ∈ S0 whose bit string equals e0 ’s bit string. There is an edge from Z to each s0 whose weight is equal to the number of ones in s0 ’s bit string. Traversing a single level (or edge) in the graph signi- fies the recovery of a single bit in F . For each node si ∈ Si , there is an edge that corresponds to each ei+1 ∈ Ei+1 . This edge is to a node si+1 ∈ Si+1 whose bit string is equal to the bitwise OR of si and ei+1 ’s bit 1 Adam Buchsbaum, personal communication, reduction from Ver- tex Cover. Figure 2: Recovery graph when D0 and D1 are lost. 3 STAR Greenan et al. [9] use a similar approach to enumer- ate the recovery equations for flat-XOR codes. Their al- gorithm employs pruning heuristics on the search space, rather than converting the problem into a graph. Figure 3 presents the results of running the algorithm on eleven different RAID-6 erasure codes for 8-disk sys- Weaver nv tems (six data, two parity). The first two codes are RDP and Even-Odd, for which I/O minimization results exist already [23, 13]. The next three are “Minimal Density” that best fit 8-disk systems (Blaum-Roth [2], Libera- tion [17] and Liber8tion [18]), and the last six are Cauchy Reed-Solomon codes where the variable w, which spec- ifies the number of bits stored per disk, varies from three nh to eight [3]. For each code, we calculated the average disk with parity disk with data and parity number of bits required for recovery when one data disk fails, plotted as a percentage of the number of bits that Figure 4: The GRID/Weaver code. The dashed boxes are required when matrix inversion is used to decode. indicate the two separate code dimensions. The results show that the Miminal Density codes re- I/Os for # disks Storage Fault quire fewer bits than RDP and Even-Odd, with the quirky recov- ac- effi- toler- Liber8tion code requiring the fewest bits of all codes. ery cessed ciency ance GRID(STAR, W(2,2)) 4 3 31.25% 11 85 GRID(STAR, W(3,3)) 6 3 31.25% 15 % Bits Needed 80 GRID(STAR, W(2,4)) 7 4 20.8% 19 75 Table 1: Performance of GRID/Weaver codes. 70 65 ing from a failure entails accessing the t connected parity disks. But when k < t, one can also recover a failed disk RDP Even-Odd Blaum-Roth Liberation Liber8tion CRS w=3 CRS w=4 CRS w=5 CRS w=6 CRS w=7 CRS w=8 by accessing any one of the t parity symbols and then using its connected k data symbols to recover the failed data symbol. Therefore, the cost to recover the failed Figure 3: Minimum bits needed for recovering from sin- data symbol is k + 1 I/Os. Parity symbols are recovered gle failures in RAID-6 codes. using their own k connected disks. Thus, recovery of an entire encoded block takes (k + 1)r + qk = r(t + k + 1) 4 An I/O Efficient Recovery Code I/Os (since rt = qk), where r and q are the number of data and parity symbols per disk respectively. In some Our search for codes which exhibit both low recovery cases, Weaver codes can also recover data and parity I/O and high fault tolerance led us to consider GRID from the same disk. This does not reduce the number of codes [15] as a suitable candidate. In a GRID code (Fig- I/Os, i.e. block transfers, but does benefit MAID systems ure 4), the disks form a logical grid with each dimen- in that fewer disks must be spun up. sion being encoded using (potentially) different schemes. Examining the GRID/Weaver construction reveals that The GRID code allows us to use recovery I/O efficient the codes use very few I/Os in recovery relative to their Weaver codes in conjunction with fault tolerant STAR fault tolerance and access even fewer disks (Table 1). We codes [14], thereby enabling us to capture the desirable experiment with the most efficient (minimum distance properties of both. Weaver codes are parameterized by separable–MDS) instances of Weaver codes: W(2,2), W(k, t) in which k is the in-degree to a parity symbol W(3,3), and W(2,4). We combine this with a STAR code and t (fault tolerance) is the in-degree to a data symbol. with 5 data disks and 3 parity disks. The GRID(STAR, The fixed in-degree limits the recovery I/O regardless of W(2,2)) tolerates 11 failures and recovers a lost encoded the stripe size (number of disks). This differs from all block (data and parity) using 4 I/Os from three disks, in systematic erasure codes. Failures within Weaver’s fault one case accessing data and parity from the same disk. tolerance in any column are recovered entirely by the The number of disks that needs to be accessed remains Weaver code and benefit from Weaver’s efficient recov- small as fault tolerance increases. Storage overheads for ery I/O. these codes are substantial, but are reasonable given the With a Weaver code, recovery from a single disk fail- high fault tolerance and their intended use in archival ap- ure can be done in two ways. The naive way of recover- plications. 4 5 Discussion and Open Problems [7] D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in We have corroborated our conjecture that for all XOR- globally distributed storage systems. In USENIX OSDI, based erasure codes, there is a fundamental tradeoff be- pages 1–7, 2010. tween recovery I/O and storage overhead at a given fault [8] S. Ghemawat, H. Gobioff, and S. Leung. The Google file tolerance. We know the extrema in this tradeoff. Replica- system. In ACM SOSP, 2003. tion has maximum storage overhead and recovers a data [9] K. M. Greenan, X. Li, and J. J. Wylie. Flat xor-based block in a single I/O. Minimum distance separable codes erasure codes in storage systems: Constructions, efficient provide maximum storage efficiency and the algorithm recovery, and tradeoffs. Mass Storage Systems and Tech- we present for minimizing recovery I/O gives optimal re- nologies, 2010. covery schedules. We evaluated optimal recovery for the [10] J. L. Hafner. Weaver codes:highly fault tolerant erasure most prevalent XOR-based erasure codes. In between codes for storage systems. In USENIX FAST, 2005. these extrema, lie codes that increase storage overhead [11] J. L. Hafner, V. Deenadhayalan, K. K. Rao, and J. A. Tom- and reduce recovery I/O. We demonstrate meaningful in- lin. Matrix methods for lost data reconstruction in erasure termediate points in the GRID/Weaver code. codes. In USENIX FAST, 2005. It remains an open problem to formalize the tradeoff [12] R. Y. Hou, J. Menon, and Y. N. Patt. Balancing I/O re- between storage efficiency and recovery I/O and con- sponse time and disk rebuild time in a RAID5 disk array. struct codes that are recovery optimal. We are pursu- In Hawaii International Conference on System Sciences, ing this problem both analytically and through automatic 1993. erasure code generation. At present, we are conduct- [13] C. Huang, M. Chen, and J. Li. Pyramid codes: Flexible ing a programmatic search of feasible generator matri- schemes to trade space for access efficiency in reliable ces and their optimal recovery I/O schedules to find the data storage systems. Network Computing and Applica- codes with minimum I/O requirements. The exponen- tions, 2007. tial growth of possible codes as a function of matrix size [14] C. Huang and L. Xu. Star : An efficient coding scheme for means that we need to develop methods to prune the correcting triple storage node failures. IEEE Transactions search both in matrix generation and in the finding of op- on Computers, 57:889–901, 2008. timal recovery schedules. However, exploring the space [15] M. Li, J. Shu, and W. Zheng. Grid codes: Strip-based era- for reasonable sized systems, up to one hundred disks, sure codes with high fault tolerance for storage systems. seems within reach. ACM Transactions on Storage, 4(4):15:1–15:22, 2009. [16] J. S. Plank. A tutorial on Reed-Solomon coding for fault- References tolerance in RAID-like systems. Software—Practice & Experience, 27(9):995–1012, 1997. [1] E. Anderson, X. Li, A. Merchant, M. A. Shah, K. Smath- [17] J. S. Plank. The RAID-6 Liberation codes. In USENIX ers, J. Tucek, M. Uysal, and J. J. Wylie. Efficient even- FAST, 2008. tual consistency in Pahoehoe, an erasure-coded key-blob archive. Dependable Systems and Networks, 2010. [18] J. S. Plank. The RAID-6 Liber8Tion code. Int. J. High Perform. Comput. Appl., 23:242–251, August 2009. [2] M. Blaum and R. M. Roth. On lowest-density MDS codes. IEEE Trans. on Information Theory, 45:46–59, [19] M. Sivathanu, V. Prabhakaran, A. C. Arpaci-Dusseau, and 1999. R. H. Arpaci-Dusseau. Improving storage system avail- ability with D-GRAID. In USENIX FAST, 2004. [3] J. Blomer, M. Kalfane, M. Karpinski, R. Karp, M. Luby, and D. Zuckerman. An XOR-based erasure-resilient cod- [20] M. W. Storer, K. Greenan, E. L. Miller, and K. Voruganti. ing scheme. Technical Report TR-95-048, International Pergamum: Replacing tape with energy efficient, reliable, Computer Science Institute, August 1995. disk-based archival storage. In USENIX FAST, 2008. [4] B. Chen, R. Curtmola, G. Ateniese, and R. Burns. Re- [21] L. Tian, D. Feng, H. Jiang, K. Zhou, L. Zeng, J. Chen, mote data checking for network coding-based distributed Z. Wang, and Z. Song. PRO: a popularity-based storage systems. In Cloud Computing Security Workshop, multi-threaded reconstruction optimization for RAID- 2010. structured storage systems. In USENIX FAST, 2007. [5] A. G. Dimakis, P. B. Godfrey, Y. Wu, M. J. Wainwright, [22] Z. Wang, A. G. Dimakis, and J. Bruck. Rebuilding and K. Ramchandran. Network coding for distributed for array codes in distributed storage systems. CoRR, storage systems. IEEE Trans. Inf. Theor., 56(9):4539– abs/1009.3291, 2010. 4551, September 2010. [23] L. Xiang, Y. Xu, J. C. S. Lui, and Q. Chang. Optimal re- [6] A. L. Drapeau et al. RAID-II: a high-bandwidth network covery of single disk failure in RDP code storage systems. file server. In International Symposium on Computer ar- In ACM SIGMETRICS, 2010. chitecture, 1994. 5