(PDF) Minimizing Metadata Access Latency in Wide Area Networked File Systems

Recent work has demonstrated a peer-to-peer storage system that locates data objects using O(logN) messages by placing objects on nodes according to pseudo-randomly chosen IDs. While elegant, this approach constrains system functionality and flexibility: files are immutable, directories and symbolic names are not supported, data location is fixed, and access locality is not exploited. This paper presents Mammoth, a peer-to-peer hierarchical file system that, unlike alternative approaches, supports a traditional file-system API, allows files and directories to be stored on any node, and adapts storage location to exploit locality, balance load, and ensure availability. Our approach handles all coordination at the granularity of files instead of nodes. In effect, the nodes that store a particular file act as its server independently of other nodes in the system. The resulting system is highly available and robust to failure. Our experiments with our prototype have yielded good results, but an important question remains: how the system will perform on a massive scale. We discuss the key issues, some of which we have addressed and others that remain open.

Minimizing Metadata Access Latency in Wide Area Networked File Systems Jian Liang† Aniruddha Bohra Hui Zhang Samrat Ganguly Rauf Izmailov NEC Laboratories, Princeton, NJ, 08540 †jliang@cis.poly.edu, {bohra,huizhang,samrat,rauf}@nec-labs.com Abstract Traditional network file systems, like NFS, do not extend to wide-area due to low bandwidth, high network latency, and dynamism introduced in the WAN environment. Metadata access latency is a significant performance problem for Wide Area File Systems, since metadata requests constitute a large portion of all file system requests, are synchronous, and cannot be cached at clients. We present WireFS, a Wide Area File System, which enables delegation of metadata management to nodes at client sites (homes). The home of a file stores the most recent copy of the file, serializes all updates, and streams updates to the central file server. WireFS uses access history to migrate the home of a file to the client site which accesses the file most frequently. We formulate the home migration problem as an integer programming problem, and present two algorithms: a dynamic programming approach to find the optimal solution, and a greedy algorithm which is non-optimal but is faster than the optimal algorithm. We show through extensive simulations that even in the WAN setting, access latency over WireFS is comparable to NFS’s performance in the LAN setting; the migration overhead is also marginal after the initial delegation. Keywords: Network file systems, WAN, data management, algorithms, dynamic programming I. I NTRODUCTION With economic globalization, more and more enterprises have multiple satellite offices around the world. These locations span multiple timezones, range from small offices of less than twenty users, to large facilities of several thousand users. For these enterprises, ease of user and data management including backups and fault tolerance, legal requirements to record and report stored data over a number of years, and economic benefits of reducing infrastructure and support staff costs at satellite locations have led to a move towards resource consolidation and centralized data management. In such scenarios, network file systems provide a familiar interface for data access and are used extensively. LAN REDIRECTOR CLIENTS Wide Area Network LAN SITE 1 MANAGER FILE SERVER LAN REDIRECTOR CLIENTS SITE 2 Fig. 1. WireFS Architecture Traditionally, network file systems have been designed for the local area networks, where bandwidth is ample and latencies are low. Common networked file systems like NFS [3] and CIFS [6] transfer large amounts of data frequently. All writes are transmitted to the server and require synchronous updates to the files there. Apart from wasting bandwidth, typical networked file systems require multiple round trips to complete a single file operation. The metadata requests are synchronous and the client cannot proceed without receiving server response. The high latency of the round-trips over the WAN and the chatty nature of the protocols make file access slow and unreliable. Finally, relying on a central server over the wide area network makes the file system susceptible to significant slowdowns due to unpredictable network delays and outages. To improve network bandwidth utilization and to hide wide area latencies, Wide Area File Systems (WAFS) have been developed [17], [1], [25], [20]. These file systems reduce bandwidth utilization by (i) aggregating file system operations to eliminate redundant operations and to reduce bandwidth requirements, and (ii) using content based persistent caching to eliminate duplicate block transfers and to enable caching across files. Unfortunately, current Wide Area File Systems ignore the file system access patterns, and are oblivious to the characteristics of the underlying network. These systems either take the client-centric view, where each file system client maintains the content cache and there is no sharing across clients even at the same location, or the sitecentric view where a group of clients are organized as islands and sharing is enabled across clients at the same site. Site-centric WAFS deploy an appliance at each client site (redirector), which acts as a file server for the clients at the site and as the WAFS client. A server side appliance (manager) acts as the WAFS server and as a client of the file server. An enterprise which deploys existing WAFS has no way to take advantage of temporal locality across sites, e.g., different timezones and access patterns, or network diversity which arises due to distinct network paths between sites and data centers. Recently developed file systems [1] allow sharing data across client sites. However, the sharing 2 is limited to data and the goal is to eliminate the bottleneck at the central server. These systems are designed for a read dominated workload, and clients must exhibit significant sharing for these systems to take advantage of network diversity. In this paper, we present WireFS, a wide area file system which takes an organization-centric view, enables data and meta-data sharing across multiple client sites, and minimizes metadata access latency in this system. WireFS takes advantage of temporal locality in file access, and allows data and metadata sharing across client sites. Figure 1 shows the WireFS architecture. Similar to site-centric WAFS, WireFS uses Redirectors (WFSRs), which act as file servers for all clients that belong a site (island). These redirectors act as WAFS clients and communicate with a server side Manager (WFSM), which acts as a WAFS server. WFSM appears as the only client to the central file server which is the final authority on file system contents. In WireFS, Redirectors communicate not only with the Manager, but also with other Redirectors to allow data sharing, and cooperative metadata management. WireFS presents a number of design challenges. First, the system must maintain the file system interface while distributing the metadata management across WFSRs. Second, delegating metadata management to individual WFSRs must not lead to inconsistent file system state. Third, the system must be fault tolerant and a failure must not lead to loss of service or inconsistency. Finally, WireFS must minimize the file system operation latency while distributing the metadata management. WireFS overcomes these challenges by using a home based approach. Each file is assigned a home server, WFSR or WFSM, which controls access and serializes updates to the file. The most recent copy of a file is cached at its home server. The home maintains a single serialization point for all updates, therefore provides semantics and consistency guarantees similar to a centralized file server. Fault tolerance is achieved by maintaining a primary and a secondary home which maintain identical state. The home is not statically assigned and can be migrated closer to the clients accessing the file most frequently. In this paper, we address the problem of home migration based on file system access history. Intuitively, a file that is accessed frequently by a client is moved closer to it. This is achieved by assigning the home of the file to the WFSR at the client site. Since the number of files in a modern file system is large, assigning homes to individual files is both inefficient and infeasible due to the overwhelming maintenance and lookup overhead. Instead, we decompose the file system namespace into a number of sub-trees and assign homes to these sub-trees. We formulate the problem of tree decomposition and home assignment to redirectors as an integer programming problem. We propose a dynamic programming algorithm to find the optimal solution in polynomial time. We also present a greedy algorithm as a heuristic which works much faster than the dynamic programming algorithm. We evaluate WireFS using a trace driven simulator. We use publicly available NFSv3 traces [8] and network 3 measurement data [24] as inputs to the simulator. We partition the traces into clusters of hosts identified from the traces. We place these clusters across timezones and transform the requests in the trace to take the corresponding timezone into account. By introducing additional network latencies for messages exchanged between hosts, we simulate the effect of network characteristics on WireFS. In the simulated 75-host 10-site wide-area system with two-week data traces, WireFS showed superior access latency performance: up to 92% of the meta-data lookups took less than 10 ms and the average access latency per lookup was around 23 ms, compared to the average node-pair round-trip time of 157 ms introduced by the WAN setting. Moreover, the maximum 2.5% and average 0.1% home reassignment ratio (the percentage of the meta-data files whose home nodes change per round of assignment computation every 30 minutes) demonstrated the marginal overhead home migration incurred. The rest of the paper is organized as follows. Section II presents the WireFS architecture design briefly and then gives the details on its meta-data layer. The problem definition and algorithms for home migration are described in Sections III. Section IV discusses the implementation and Section V presents the experimental setup and evaluation results. Section VI describes the related work and Section VII concludes the paper. II. W IRE FS WireFS is a wide area file system which enables delegation of metadata management, and uses content caching and duplicate elimination to reduce redundant data block transfers. WireFS has two primary components : WireFS redirectors (WFSRs), that are deployed at each client site, act as WFS clients, and export the file system interface to the collocated clients, and WireFS manager (WFSM), that maintains the global namespace view, coordinates WFSRs, and communicates with the file server as a file system client. The WireFS architecture has two logical components which capture the typical behavior of network file systems: (i) the Data Access Layer (DAL) and (ii) the Metadata Layer (MDL). The MDL is composed of a set of WireFS redirectors, that serve all meta-data requests including file and directory lookup, creation and deletion of files, and updates to the file or directory metadata e.g. access time updates. In addition to the traditional file system functionality, the MDL also maintains and communicates the location of the data blocks in the system. Note that the data chunks can be located in the content cache of one or more WFSRs. The primary goal of MDL is to reduce the latency of the above operations in WireFS. The DAL enables fast transfer of data across the wide area. The transfer may include original file/directory contents, update logs, and the updated data blocks being transferred to the file server. The primary goal of the DAL is to reduce the volume of data block transfers across the wide area network. Several techniques for example, aggressive prefetching, large persistent caches, duplicate elimination by using a summary of the file data, sharing 4 FS FILE SYSTEM INTERFACE WIREFS CLIENT CONTENT CACHE UPDATE LOGS WIREFS / WFSR0 /a LOG(A) /a/b CLIST MIGRATION TABLE Fig. 2. WireFS Redirector (WFSR) architecture data chunks across files, etc., previously proposed for WAFS, are used by WireFS implement DAL. In addition to the above, DAL performs coordinated data dissemination to create warm caches and uses cooperative caching among WFSRs which reduces server load and takes advantage of network diversity. In this paper, we focus on the design of algorithms for MDL to minimize the latency of metadata operations. In the following, we describe the architectural components of WireFS and the design of the meta data layer in detail. A. WireFS Redirector A WireFS redirector is deployed at each client site and has three main functions, (i) to export a file system interface to the clients at the site, (ii) to maintain a content addressable cache and communicate with other WFSRs or the WFSM to perform data transfers, and (iii) to maintain operation logs, perform serialization of updates, and handle metadata queries for files it is the designated home. Figure 2 shows the architecture of a WireFS redirector. The file system interface (FSI) exported by the WFSR enables clients to communicate using an unmodified file system protocol. This interface translates client requests to WFS requests. On receiving the corresponding WFS reply, the FSI constructs the response to the original client request and sends it to the client. A pending request map is maintained by the FSI to match the responses to the corresponding requests. WireFS can support multiple file system protocols by defining the appropriate FSI. Each WFSR maintains a large persistent content-cache that stores files as chunks indexed by content hashes, which can be used across files. Chunks are non-overlapping segments of file data whose extents are determined by content boundaries (breakpoints) using a fingerprinting technique, and are indexed by the SHA-1 collision resistant hash of the contents. WireFS associates a sequence of chunk indices in the file metadata which augments the default file information, e.g. access times, permissions, access control lists, etc. B. WireFS Manager The WireFS manager is deployed at the server site has a specialized role in the WireFS protocol. It communicates directly with the server and maintains a global view of the file system namespace. It also assigns and maintains the 5 WireFS specific attributes of files like the home node, ownership information, generation numbers etc. The WFSM is the home node for all files until it delegates this responsibility to a WFSR. The WFSM is also responsible for the coordinated dissemination of commonly accessed files to multiple WFSRs to warm up the WFSR caches. Finally, the WireFS manager periodically reorganizes the file system namespace by reassigning homes of files according to the access history statistics. C. WireFS Home Each file in the file system namespace is assigned a home. The home is responsible for maintaining file consistency, provides a serialization point for all updates, and performs aggregation on file system operations to reduce network round trips. Homes maintain not only the update logs and serialization, but also maintain the latest version of the file metadata including access times, permissions, size, chunk indices, etc. Each WFSR and the WFSM maintain a migration table, which contains a view of the file system namespace, statistics and access history, and per-file WireFS metadata. An entry in the migration table is indexed by the file identifier, and contains either the home node identifier, or the WireFS metadata for the file. WireFS metadata contains attributes defined in the file system, and a list of chunk indices, update logs, etc. The migration table is updated locally, on each operation to maintain statistics and access history, and remotely, by the WFSM. On receiving a client request for metadata, for example file lookup, or data, for example read or write, the WFSR identifies the home of the file using the migration table and forwards the request to the home. The home provides the information and maintains a timestamped record of all updates as update logs. The home node aggregates updates, eliminates duplicate or redundant updates, and streams the update logs to the file server. D. Update Logs The home node serializes updates to a file. Updates performed by the client are forwarded to the home by the WFSRs. A WireFS home maintains a timestamped record of all updates it receives. The logs are used to reconcile conflicting updates to reconstruct the most recent version of the file. WireFS maintains an ordered sequence of logs using version vectors [10], [18]. A version vector includes the timestamp at the home node and a global monotonically increasing sequence number. The sequence number of is incremented when an update entry is appended to the log. Logs are ordered using these version vectors. Update logs allow each WFSR to acquire leases on files, apply all updates locally to avoid unnecessary network communication, and forward the local update log to the home node on callbacks. A version number is provided to each WFSR when the lease is acquired. 6 The use of update logs allows WFSRs to function and provide the file system functionality to clients in event of disconnection or failure. Unfortunately, this comes at the cost of weakened consistency semantics. WireFS maintains a close-to-open consistency semantics similar to Andrew [16] when the WFSRs are well connected. On disconnection, multiple versions of the file due to independent update logs are merged using an algorithm similar to rsync. However, in some cases, WireFS requires the administrator or users to manually decide on the correct version of the file which is then updated atomically at the server. In all cases, the file server maintains the correct version of the file and updates applied to the server are considered a barrier and all logs are updated to reflect this update. E. Leases and Callbacks A WFSR acquires a lease on the file when it performs updates. The home node performs two actions on receiving a lease request. First, it constructs the most recent version of the file using update logs. Second, it registers a callback which recalls the file including updates performed by the WFSR while it had the file leased to it. The callbacks are required when the home is reassigned, another WFSR requests a lease, or time greater than a threshold has elapsed. A lease request may not always succeed. In this case, the home node requires all WFSRs to send read as well as update requests to it and maintains a single update log. This avoids multiple conflicting updates and provides a stronger consistency guarantees at the cost of network communication on each read or write request. F. Example Lookup: Clients lookup the path of the file (or directory) starting from the root of the mounted file system and descending through the path hierarchy. The client starts the lookup from the first component of the path for which the file attributes are invalid. In the worst case, this lookup starts at the root of the mounted volume. The WFSR performs two operations on receiving the lookup request. First, it translates the file handle to the server path name (on which the WireFS lookup hierarchy is constructed). Figure 3 shows the lookup operation. If the file handle is cached and is valid (there is no pending callback), the WFSR returns it. If the cached handle is invalid and the home of the parent directory is known, the cached entry is purged and an OPEN request is forwarded to the home of the parent. If the parent’s home is unknown, the WFSR sends a HOME_LOOKUP request to the WFSM and sends the OPEN request to the returned home node. The parent is guaranteed to have either the file handle information or the location of the delegated node that has the file handle information. The OPEN request registers a callback with the home node to invalidate the WFSR cache on an update. It also retrieves the attributes of all children of the lookup target. Note that by using an invalidation based scheme over 7 LOOKUP Cache Hit LOOKUP OPEN Register callback Create dentry LOOKUP HOME_LOOKUP OPEN Register callback Create dentry Client WFSR WFSR HOME WFS Manager Fig. 3. Timeline for the lookup operation. There are three cases. First, when the file handle is cached at the WFSR shown at the top. Second, when the home of the file is known, and third when the home information is retrieved from the WFSM. The solid lines show local area communication while the dotted lines show the messages over the wide area. the WAN, we significantly reduce the number of round-trips as well as guarantee consistency of the file across the wide area. Moreover, since the number of WFSRs is limited (100s), the state maintenance overhead at the home node is not very high. At the same time, characteristics of the file system over the LAN are preserved without modifying the existing implementation of the protocol. III. W IRE FS M ETA -DATA L AYER The design of traditional network file systems like NFS assumes the clients and the server are connected over a low latency network. This allows each file system operation to perform multiple remote procedure calls (RPCs). While this is acceptable over a LAN, each additional round trip over the wide area network results in deteriorated performance. For data transfer, the additional latency can be masked by aggressive prefetching and writing back bulk data. However, for typical meta-data operations like file lookup, open, and delete, the short RPC messages lead to significant increase in the response time. Such large overheads subsequently affect the performance observed by the clients as any data transfer is preceded by one or more meta-data operations. For example, before reading a file, the client must perform a recursive directory lookup, followed by authentication and attribute checks. Therefore, for any wide area file system, improving the performance of the meta-data operations is of utmost importance. 8 Recently proposed wide-area file systems rely on a central server for all meta-data operations. For a large client population, such operations contribute towards heavy load on the server. To reduce the load on the central server, file systems over Distributed Hash Tables (DHTs) have been proposed which do not have a central server, and the participating nodes cooperatively provide its functionality. Unfortunately, in this model, the hierarchical structure of the file system namespace is lost and the cost of lookups of files and directories can take up to O(log(n)) round trips (where n is the number of participating nodes), which are unacceptable over the wide area. A. Home Migration We use a virtual namespace tree rooted at the directory “/” to model the file organization in NFS. A NFS [3] file lookup consists of a series of sub-lookups that traverse the directory path from the root node to the file node on the directory tree. For example, in Figure 4, to look up the directory entry for the file “/a/x/1.txt”, the lookups for “/”, “/a”, “/a/x”, and “/a/x/1.txt” are executed in order. Lookup 1 root / Lookup 2 /a /b /c /d Lookup 3 /a/x /a/y /a/z Lookup 4 /a/x/1.txt Fig. 4. /a/x/2.txt A lookup for the file “/a/x/1.txt” in the directory tree. In a LAN setting the multiple lookup round-trips are invisible to end-users due to fast local transmission speed. However, the network latency in WAN is large enough that a file lookup can take up to seconds to finish. This makes the response time intolerable during normal file operations. To alleviate this performance problem, our solution is based on the following observation: if most of the accesses into a subtree in the directory tree come from one site (through a WFSR), we will assign the administration privilege of this subtree onto that site (WFSR). We call this task delegation as a home migration, and that WFSR the home node of this subtree. Notice home migrations can occur recursively in that a subtree migrated to one WFSR may have its own subtree migrated to another WFSR node. Therefore, the directory tree is decomposed into multiple sub-trees based on access statistics, and we want to design the assignment scheme for home migrations so that the total access latency is minimized. In addition, to allow fast (one-hop) resolution of home nodes, we will maintain a migration table at WFSM, the central server side, which keeps one pointer (the address of the home node) for each distinct migrated sub-tree. Figure 5 shows 9 one example for home migration. Directory Tree root / home migration /b /a migration table at GFSM subtree /a/x /a /b /c /c home node R_2 R_1 R_2 R_3 /a/x GFSR R_1 Fig. 5. GFSR R_2 GFSR R_3 The home migration of a directory tree and the corresponding migration table. Formally, we label the WFSM as R0 , the n WFSRs as R1 , R2 , . . . Rn , and the network latency (RTT) between Ri and Rj as LRi Rj . When a file lookup from Ri traverses a directory node Dx (1 ≤ x ≤ m, where m is the number of directory nodes), we call it one access of Ri on Dx . For each node Dx in the directory tree, a stack of n registers {CDx Ri , i ∈ [0, n]} record the expected access times of each WFSR on Dx during the next time period 1 T . Now we formulate access latency optimization as an integer programming problem: min n m X X n X IDx Ri ( CDx Rj LRj Ri + MDx Ri ) x=1 i=0 subject to (1) j=0 IDx Ri ∈ 0, 1 n X IDx Ri = 1 i=0 P Where IDx Ri = 1 if the subtree rooted at Dx will be migrated to Ri , 0 otherwise. IDx Ri ( nj=0 CDx Rj LRj Ri ) were the total access cost to the directory node Ri if we migrated the subtree rooted at it to the home node Ri . MDx Ri is the transfer cost of migrating Dx from its current home node to Ri . When there is no migration table size constraint, the optimal solution can be found by deciding the best home node for each directory node individually. Next, we present the algorithm to compute the optimal solution of the optimization problem when we have migration table size constraint. B. Optimal Solution under Constrained Migration Table Let Pmax (< the directory size) be the maximal number of pointers that the migration table can contain. Deciding the Pmax distinct subtrees is similar to many cache or filter placement problems in the literature [13], [23]. To 1 In the experiments we use an exponential weighted moving average (EWMA) counter to approximate the access register based on past historical information. 10 find the optimal solution in a bounded-degree directory tree, we can solve the following problem using dynamic programming. (i.) Let access(Dx , k, Hp (Dx )) be the optimal access cost for the directory (sub)tree rooted at Dx given that there are k pointers left for this subtree and the home node for the parent node of Dx is Hp (Dx ). We start with access(“\′′ , Pmax , R0 ) on the root node and enumerate the rest of the nodes following breadth first search. (ii.) At each directory node Dx , the optimal assignment is decided as • If k = 0, all nodes in the subtree will be assigned to Hp (Dx ) and P P access(Dx , k, Hp (Dx )) = z: nodes in the subtree nj=0 (CDz Rj LRj RHp (Dx ) + WDz RHp (Dx ) ). • Otherwise, access(Dx , k, Hp (Dx )) = min { min [ for all possible allocation schemes (z,Az ) of k-1 pointers on the children of Dx Pn P j=0 (CDx Rj LRj Ry + WDx Ry ) + z: child of Dx access(z, Az , y) for every y 6= Hp (Dx )], min [ for all possible allocation schemes (z,Az ) of k pointers on the children of Dx P Pn j=0 (CDx Rj LRj RHp (Dx ) + WDx RHp (Dx ) ) + z: child of x access(z, Az , Hp (Dx )) ] } Next we present the analysis result on the dynamic programming algorithm. D m2 n) time, where D is Theorem 1: The dynamic programming algorithm finds the optimal solution in O(Pmax the maximal degree in the directory tree. Proof: The analysis is similar to the one for the k-median problems on trees [26] and is skipped in the paper. C. A greedy algorithm under Constrained Migration Table While we can find the optimal solution in polynomial time, the likely enormous directory tree size m and large degree bound D makes it desirable to find a solution good enough and as quickly as possible. We observe that on the file directory tree, the nodes close to the root receive more lookup requests (then likely incur higher access cost) than the nodes close to the leaf nodes do. Therefore, when deciding home migration we can take the top-down order and start from the nodes at the top of the directory tree. For a set of candidate nodes, we will firstly pick the node whose subtree has the most access requests (from all users) for the home migration process. The following describes the greedy algorithm based on the above ideas: subtree , the total number of lookup requests fallen into the (i.) Initially, for each directory node Dx , we count CD x Ri subtree rooted at Dx from each WFSR Ri ; we label the home node for Dx as H(Dx ), where H(Dx ) = P Pn subtree L subtree mini:0,...,n nj=0 CD ; lastly, we assign D the weight W (D ) as x x R R j i i=0 CDx Ri . R x j 11 (ii.) The migration table is initialized with one entry which records the home node for the directory root node is WFSM (R0 ). (iii.) The children nodes of R0 are put into an ordered linked list where the descending order is based on the weight W (Dx ). For two nodes with the same weight, the tie is broken by giving the node with a smaller subtree a higher position. (iv.) We repeat the following operation until either all k migration table entries are filled up or the ordered list is empty: • Remove the head node Dx in the ordered list and insert its children nodes into the ordered list. Dx is put into the migration table and assigned the home H(Dx ) if its closest ancestor node in the migration table is not assigned to the same home as H(Dx ); Otherwise it is not put into the migration table. (v.) Lastly, for any node Dy not in the migration table, its assigned home node assigned(Dy ) is the same as the home node assigned to its closest ancestor node in the migration table. The greedy algorithm gives priority to the nodes whose subtrees incur more access requests (then likely more access cost) than the other nodes’. The operations in step (i.) omit file transfer cost for simplicity; The operations in step (iv.) remove unnecessary (redundant) migration table entry for a child node in the tree if its parent node will be migrated to the same home as it is. Next we present the analysis result on the greedy algorithm. Theorem 2: The greedy algorithm finds an assignment scheme in O(m log(m) + Pmax m) time. Proof: Step (i.) in the algorithm can be finished in one tree traversal using depth-first search algorithm, which takes O(m) time. Operations related to the ordered list takes O(mlog(m)) time. For each new node to be put in the migration table, checking its ancestor nodes takes O(Pmax ) time and at most m nodes will be tried as a new node for the migration table. Later in Section V we show this greedy algorithm works well in practice. IV. I MPLEMENTATION WireFS is implemented by extending the Low Bandwidth File System (LBFS) [17]. LBFS provides content hashing, file system indexing, and chunk storage and retrieval. WireFS extends the LBFS implementation by including the WFSR update logs. Unlike the default LBFS, WireFS uses a modified NFS implementation which sends the file system requests to the LBFS client at each WFSR. At the WFSM, the LBFS server sits in front of the NFS server and is unmodified. 12 WFS WFS WFSR Home Reassignment LBFS client RPC/TCP UPDATE LOG CONTENT CACHE Fig. 6. WFSM LBFS server Name FS Attributes Parent Home Object Home Object Owner Callback List Update Log Chunk List Update Time Generation Num Dentry Pointer Fentry Pointer Chunk List Dentry Update Entry CONTENT CACHE WireFS implementation Fig. 7. Directory and update log entries in WFSR Figure 6 shows the WireFS implementation. In addition to the default LBFS, WireFS includes additional functionality for home migration and maintaining update logs. These are implemented as extensions to LBFS and use the SFS toolkit [15] to provide the asynchronous programming interface. Finally, the interaction between the WFSRs is independent of the LBFS protocol. WireFS receives all NFS requests from the clients, and uses the WFS protocol to identify the home node. The requests are passed on to LBFS client at the home node which in-turn uses the content cache and the LBFS server to service the requests. WireFS associates additional metadata with each file system object. It is important to note that this information is not visible to either the server or the clients, but is generated and maintained by the WireFS redirectors transparently. The additional attributes enable WireFS specific optimizations over the wide-area-network. As shown in Figure 7, for each file, WireFS maintains a directory entry (dentry) which contains four additional attributes, a chunk list, callback list, Home information for the parent and the file itself, and Owner information. In addition to the extended attributes, update logs are maintained for any updates in queue for the server. Finally, each WFSR maintains a translation table which maps the file handles provided by the server at mount time to the path name of the file on the server. V. E VALUATION In this section, we present an evaluation of WireFS home migration using trace driven simulation. We first describe our simulation methodology and demonstrate that metadata operations constitute a significant portion of all file system accesses. We also show the temporal locality exhibited by accesses, especially across client sites. We use publicly available long-term NFSv3 traces to identify file system accesses, and network latency traces to emulate geographical distribution of client sites. We then show the behavior of the home based WireFS metadata access protocol and compare it against existing network and wide area file systems. Finally, we show the benefits of home migration in WireFSwhile comparing our two algorithms for reassignment. 13 RPC methods read write getattr lookup access remove setattr readdirplus statfs readdir link rmdir fsinfo readlink rename mkdir pathconf symlink nothing mknod commit total # of access (14 days) 194,404,998 64,635,756 89,051,316 18,442,927 6,464,648 3,234,284 3,090,839 1,448,400 910,257 692,763 474,079 265,733 131,104 125,584 86,367 55,659 30,478 11,872 4,361 7 0 383,561,432 % of RPC methods 50.68 16.85 23.22 4.81 1.69 0.84 0.81 0.38 0.24 0.18 0.12 0.07 0.03 0.03 0.02 0.01 0.01 0.003 0.001 1.825E-06 0 100 NFS client/server Site0 Site1 Site2 Site3 Site4 Site5 Site6 Site7 Site8 Site9 NFS server TABLE I T HE BREAK DOWN OF NFS RPC REQUESTS AND RESPONSES IN THE 14 DAYS H ARVARD TRACE Planetlab node planetlab1.cs.uit.no. planetlab-1.stanford.edu. planetlab1.informatik.uni-kl.de. planetlab2.pop-rs.rnp.br. planetlab1.eecs.umich.edu. planetlab1.cs.wayne.edu. planetlab2.cs.unibo.it. planetlab1.cs.ucla.edu planetlab2.cs.duke.edu. planetlab2.kaist.ac.kr planetlab-1.cs.colostate.edu Time Zone GMT -1h GMT +8h GMT -1h GMT +3h GMT +5h GMT +5h GMT -1h GMT +8h GMT +5h GMT -9h GMT +7h TABLE II T HE PLANETLAB SITES EMULATION OF W IRE FS CLIENT / SERVER CONFIGURE A. Simulation Methodology We use the publicly available NFSv3 traces from Harvard SOS project [8]. The Harvard traces include up to three month real campus NFSv3 traffic in different deployment scenarios. We choose the most diverse workload which is a mix of research, email and web workload. In our simulation, traffic traces of two weeks are extracted to evaluate WireFS performance under different configurations. The traces feature workload and operation diversity where 993 thousand distinct files with 64 thousand directory files are monitored. During the studied two week period, 384 million NFS RPC call/response pairs are recorded. The RPC call breakdown is presented in table I. From Table I, we observe that 32% of these operations are composed of LOOKUP, GETATTR, SETATTR and other metadata operations. Therefore, WireFS focuses on minimizing access latency of a significant portion of file system accesses. The evolution of these access patterns with time is indicated in Figure 8. We observe approximately one million file operations per hour along with number of distinctly accessed files varying between one thousand to one hundred thousand. 75 distinct host IP addresses are identified from the traces and are used for creating user groups. To emulate an enterprise environment with branch offices, we partition the total 75 hosts into 10 groups (sites) with the access pattern following uniform or Zipf distribution. The site geographic distribution is emulated based on the Ping project traces [24]: we randomly picked 10 PlanetLab nodes scattered around the world, and emulated the wide-area network latency between them by extracting the round-trip time (RTT) information between them 14 File operation of Zipf grouping in different time zone Evolution of file operation over 2 weeks 1e+07 1e+07 site0 +GMT -1h site1 GMT +8h site3 GMT +3h operatons per hour number of Files accessed 1e+06 file operation 1e+06 100000 100000 10000 10000 1000 1000 100 0 0 Fig. 8. 50 100 150 Time (hour) 200 250 20 40 300 60 80 100 120 140 160 Time (hour) Fig. 9. The operation of three sites in Zipf grouping in different time zone. Evolution of file operation and accessed file per hour. from the Ping project traces. Table II describes our site location configuration with domains and time zones. Three sites are located on the east coast, two from west coast of United States, two from Europe, one from South America, and one from Asia. The central file server (data center) is placed in the middle of North America. The RTT between two sites varies from 2.4ms to 358ms with the average value of 157ms. The time zone for each site is included in our experiments by adding time offset to the trace data originating from that site. For example, Figure 9 shows the time evolution of the file operations from three sites with Zipf-based grouping and time zone offset. We compare four network file systems in our simulated WAN setting. The first file system is a wide-area deployment of the NFSv3 system, called WAN-NFS in the rest of the paper. In WAN-NFS, all client groups access files from the remote central NFS server via NFS RPC procedures. The second file system, called DHT file system, utilizes the DHT based data management scheme (like SHARK [1]) that randomly distributes file objects among the participating sites. For simplicity, in the simulations we assume a file lookup takes only one-hop searching for remote file object access. The third file system is called WireFS-node, where home assignment is done on individual files based on their access statistics. The fourth system is called WireFS-tree, where home assignment is done based on the greedy algorithm described in Section III-C. In both WireFS-node and WireFS-tree, home migration decision is recomputed every T minutes, and the number of accesses to a file f from a site x at the end of the i-th period is calculated with an EWMA counter: Cfx (i) = α × Cfx (i − 1) + (1 − α) × nxf (i), where nxf (i) is the total access number of x on f during the i-th period and Cfx (0) = 0. Unless explicitly stated, T = 60, α = 0.5, and a migration table with the size k = 50000 are used in the following. 15 latency of ’Lookup’ operation of four schemes under uniform access pattern 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 CDF CDF latency of ’Lookup’ operation of four schemes under Zipf access pattern 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 WAN NFS DHT (one hop search) WireFS-node WireFS-tree LAN NFS 0.1 0 0.1 1 10 100 WAN NFS DHT (one hop search) WireFS-node WireFS-tree LAN NFS 0.1 0 1000 0.1 1 10 Lookup Latency (ms) 100 1000 Lookup Latency (ms) Fig. 10. The CDFs of average lookup latency for different systems with zipf-based grouping and time zone offset. Fig. 11. The CDFs of average lookup latency for different systems with uniform-based grouping and time zone offset. B. Results Figure 10 shows the average meta-data lookup latency distribution in the four file systems where host grouping was based on Zipf distribution and time zone effect is considered. The NFS lookup latency performance in local area network (LAN NFS) is also included as the baseline for all these schemes. We observe that WireFS-tree performs close to NFS-LAN and outperforms the other three schemes. The latency of more than 96% of the lookups in WireFS-tree is comparable to that in NFS-LAN; 92% of the lookups in WireFS-tree take less than 10ms, compared with 75% of WireFS-node, less than 15% of DHT system, and 0 of WAN NFS as all other sites are more than 10ms away from the central file server; only 2% of the operations in WireFS-tree underperformed the other schemes due to its worst case scenario with two-hop lookups. We repeat the above simulations with host grouping based on uniform distribution, and the result (as shown in Figure 11) was similar to that of Zipf distribution. Evolution of local hit ratio over 2 weeks local lookup hit ratio of two WFS schemes 100 1 0.9 80 0.8 Percentage 0.7 CDF 0.6 0.5 60 40 0.4 0.3 20 0.2 0.1 0 0.65 Fig. 12. WFS-node WFS-tree Remote first-time-access file ratio local hit ratio 0 0 0.7 0.75 0.8 0.85 Local hit ratio 0.9 0.95 1 50 100 150 Time (hour) 200 250 300 Fig. 13. WireFS-tree: local access hit ratio vs. remote first-timeaccess file ratio. CDF of local hit ratio for the two WireFS systems. 16 Evolution of lookup latency over 2 weeks 120 Evolution of home reassignment 100 Average lookup latency (ms) first-time-access file ratio moving avg. remote first-time-access file ratio Home reassignment ratio 10 100 1 Percentage 80 60 0.1 0.01 40 0.001 20 1e-04 0 1e-05 0 50 100 150 200 250 300 0 Time (hour) 50 100 150 200 250 300 Time (hour) Fig. 14. WireFS-tree: average access latency vs. first-time-access file ratio. Fig. 15. WireFS-tree: home reassignment ratio vs. moving average remote first-time-access file ratio. Figure 12 compares the performance of WFS-tree and WFS-node in terms of the distribution of local hit ratios (computed every T minutes) throughout the 2 weeks. We observe that WFS-tree has a hit ratio over 95% most of the time, while WFS-node experiences hit ratio oscillation during the experiment with average value less than 90%. The performance difference between WireFS-tree and WireFS-node is caused by the prefetching nature of the subtree-based migration and the caching nature of the node-based migration. If file accesses from a site have a locality pattern within the directory tree hierarchy, prefetching avoids “cold” misses, due to first-time accesses; our experiment results clearly validated that assumption. Figure 13 shows the time evolution of local hit ratios in WFS-tree. The aperiodic deterioration of hit ratios is explained by the spikes of remote first-time-access file ratios 2 , which are also shown in Figure 13. Figure 14 presents the time evolution of average lookup latency in WireFS-tree over the two-week time period. The first-time-file access ratio is shown in Figure 14. We observe that the latency spikes are consistent with the spikes of the first-time-access file ratios. The effect of home migration is demonstrated by the immediate decline after each latency spike in Figure 14. The drop in the access latency shows that home migration reduces the wide-area accesses adaptively and quickly. Over the first 50 hours, most of the files are accessed for the first time by remote sites, which makes the average lookup latency oscillate dramatically. After this time, the latency stabilized until another first-time-access spike changed the pattern. Figure 15 presents the time evolution of home reassignment ratio in WFS-tree system. Home reassignment ratio is defined as the percentage of the meta-data files whose home nodes change. This ratio is used as a metric for 2 Remote first-time-access file ratio is defined as the percentage of the files accessed by a remote group for the first time out of all files accessed during a time period. 17 Lookup Latency for 5 uniform groups from different time zones 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 CDF CDF Lookup Latency for 5 Zipf groups from different time zones 1 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 site0 site1 site3 site5 site9 All_sites 0.1 0 0.1 1 10 100 site0 site1 site3 site5 site9 All_sites 0.1 0 1000 0.1 Lookup Latency (ms) 1 10 100 1000 Lookup Latency (ms) Fig. 16. Average lookup latency CDF of 5 sites with Zipf-based grouping and time zone offset. Fig. 17. Average lookup latency CDF of 5 sites with uniformbased grouping and time zone offset. home migration overhead as each reassignment requires computation. The maximum 2.5% and average 0.1% home reassignment ratio demonstrates the marginal overhead home migration incurred after the initial few delegations. Moving average (3 runs) remote first-time-access file ratios 3 are also shown in Figure 15 to illustrate the main cause of home reassignments. Figure 16 and Figure 17 present the average lookup latency distribution of five individual sites in WireFS-tree with Zipf- and uniform- based grouping. For comparison, the average lookup latency distribution of the system is also shown. We observe that in both scenarios the large sites (sites 0,1,5 in Figure 16 and sites 0,3,5 in Figure 17) account for majority of the lookup operations and achieve good performance, which also resulted in good average performance for the whole system. However, there was at least one site (site 9 in both figures) whose performance was much worse than the rest of the system. It turns out that this site has a small number of hosts and therefore perform a small number of file operations. That is, this small office branch suffers in the competition with large branches for resources. While it is intuitive, addressing this problem is our immediate future work. We ran the simulations for a set of grouping results and location mappings, different values of α values (0.25, 0.5, 0.75), varying migration frequency T (15, 30, 60), different migration table size (5K, 15K, 50K). All simulation results showed qualitatively similar performance for the WireFS system. VI. R ELATED W ORK Network file systems have been studied in the local area with stateless [3] and stateful servers [11], [2], [19], [9]. Satyanarayanan presents an overview of several traditional distributed file systems [22]. Recently, there has 3 Moving average (m rounds) remote first-time-access file ratio is the average value of the remote first-time-access file ratio in the consecutive m rounds. As we use an EWMA counter to record the historical access information, remote accesses in the current (one) round might not immediately affect the home assignment decision. Therefore, we pick m = 3 in Figure 15 to better reflect the reason behind the home reassignment evolution. 18 been significant research activity in providing data access (object or file system based) over the WAN. Multiple peer-to-peer architectures for decentralized data management have been proposed [7], [21], [14], [18]. However, the goal of such systems is to store large quantities of data, dispersed and replicated across multiple clients to improve fault resilience and reduce management overheads. In contrast, WireFS tries to improve performance of existing network file systems for interactive workloads. While WireFS is capable of storing large data, replication, and disconnected operation, such characteristics are not the primary concern. Independently, improving the performance of large file downloads in overlay networks has been studied [12], [4], [5]. These systems target client downloads of whole data objects like movies, software distributions etc. from one or more publishers. They do not maintain object hierarchies like directories, and do not consider modifications to objects. In WireFS, we target an entirely different workload. Table I shows the distribution of the different NFS RPC calls in a trace collected at Harvard University [8]. From the distribution of the RPC calls, it is clear that a significant portion of the network communication is due to the lookups and other metadata traffic. In a WAN environment, such communication imposes a significant overhead on the performance of the file system. Previous efforts to provide wide area file system access optimize mainly for the bandwidth. Reducing the latency of these metadata transfers is a primary design goal of WireFS in addition to providing high-bandwidth parallel downloads. VII. C ONCLUSIONS In this paper, we presented home migration, a technique to minimize meta-data access latency in wide-area file systems. We first described the design of WireFS a wide-area networked file system. Next, a set of algorithms for home assignment and migration were proposed in the context of WireFS to improve performance of metadata accesses. Through trace driven simulations, we demonstrated that our technique improves the latency of metadata operations with low management and network overheads. VIII. ACKNOWLEDGEMENTS We thank the Harvard University SOS project team, especially Daniel Ellard, Jonathan Ledlie, and Christopher Stein for providing the NFS traces. R EFERENCES [1] S. Annapureddy, M. J. Freedman, and D. Mazières. Shark: Scaling File Servers via Cooperative Caching. In Proc. of 2nd Usenix Symposium on Network Systems Design and Implementation NSDI’05, Boston, MA, May 2005. [2] A. D. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swart. The Echo Distributed File System. Technical Report 111, Digital Equipment Corporation, Systems Research Center, Palo Alto, CA, USA, 10 1993. [3] B. Callaghan, B. Pawlowski, and P. Staubach. NFS Version 3 Protocol Specification, RFC 1813. IETF, Network Working Group, June 1995. 19 [4] M. Castro et al. SplitStream: High-Bandwidth Multicast in Cooperative Environments. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 298–313, October 2003. [5] B. Cohen. Incentives Build Robustness in BitTorrent. http://bittorrent.com/bittorrentecon.pdf, May 2003. [6] M. Corporation. Cifs: Common internet file system. http://msdn.microsoft.com/library/default.asp?url=/library/en- us/cifs/protocol/portalcifs.asp. [7] F. Dabek et al. Wide-area cooperative storage with CFS. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), Chateau Lake Louise, Banff, Canada, October 2001. [8] D. Ellard and M. Seltzer. New NFS Tracing Tools and Techniques for System Analysis. In LISA ’03: Proceedings of the 17th USENIX conference on System administration, pages 73–86, Berkeley, CA, USA, 2003. USENIX Association. [9] J. H. Hartman and J. K. Ousterhout. The Zebra Striped Network File System. ACM Transactions on Computer Systems., 13(3):274–310, 1995. [10] D. S. P. Jr., G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, D. A. Edwards, S. Kiser, and C. S. Kline. Detection of mutual inconsistency in distributed systems. IEEE Trans. Software Eng., 9(3):240–247, 1983. [11] J. Kistler and M. Satyanarayanan. Disconnected Operation in the Coda File System. ACM Transactions on Computer Systems, 10(1):3–25, Feb 1992. [12] D. Kostic et al. Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 282–297, October 2003. [13] P. Krishnan, D. Raz, and Y. Shavitt. The cache location problem. IEEE/ACM Transactions on Networking, 8(5):568–582, 2000. [14] J. Kubiatowicz et al. OceanStore: an Architecture for Global-Scale Persistent Storage. In Proceedings of the 9th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 190–201, 2000. [15] D. Mazieres. A toolkit for user-level file systems. In USENIX Technical Conference, Boston, MA, June 2001. [16] J. Morris, M. Satyanarayanan, M. Conner, J. Howard, D. Rosenthal, and F. Smith. Andrew: A distributed perosnal computing environment. Commun. ACM, 29(3):184–201, Mar. 1986. [17] A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In SOSP ’01: Proceedings of the eighteenth ACM symposium on Operating systems principles, pages 174–187, 2001. [18] A. Muthitacharoen et al. Ivy: A read/write peer-to-peer file system. In Proceedings of the 5th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’02), Boston, Massachusetts, December 2002. [19] M. N. Nelson, B. B. Welch, and J. K. Ousterhout. Caching in the sprite network file system. ACM Trans. Comput. Syst., 6(1):134–154, 1988. [20] I. Riverbed Technology. Rios: Riverbed optimization system. http://www.riverbed.com/docs/TechOverview-Riverbed-RiOS.pdf, 2006. [21] A. Rowstron and P. Druschel. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In Proceedings of the 18th ACM Symposium on Operating Systems Principles, pages 188–201, 2001. [22] M. Satyanarayanan. A survey of distributed file systems. Technical Report CMU-CS-89-116, Carnegie Mellon University, Pittsburgh, Pennsylvania, 1989. [23] R. Shah, Z. Ramzan, R. Jain, R. Dendukuri, and F. Anjum. Efficient dissemination of personalized information using content-based multicast. IEEE Transactions on Mobile Computing, 3(4):394–408, 2004. [24] J. Stribling. All-Pairs-Pings for PlanetLab. http://pdos.csail.mit.edu/ strib/pl app/. [25] I. TacitNetworks. Tacitnetworks ishared datasheet. http://www.tacitnetworks.com/docs/Datasheet-I-shared-Products.pdf, 2006. [26] A. Tamir. An o(pn2 ) algorithm for the p-median and related problems on tree graphs. Operations Research Letters, 19:59–64, 1996. 20

RELATED PAPERS

RELATED TOPICS

Log In

Minimizing Metadata Access Latency in Wide Area Networked File Systems

Minimizing Metadata Access Latency in Wide Area Networked File Systems

Related Papers

RELATED PAPERS

RELATED TOPICS