Minimizing Metadata Access Latency in Wide Area
Networked File Systems
Jian Liang†
Aniruddha Bohra
Hui Zhang
Samrat Ganguly
Rauf Izmailov
NEC Laboratories, Princeton, NJ, 08540
†jliang@cis.poly.edu, {bohra,huizhang,samrat,rauf}@nec-labs.com
Abstract
Traditional network file systems, like NFS, do not extend to wide-area due to low bandwidth, high network
latency, and dynamism introduced in the WAN environment. Metadata access latency is a significant performance
problem for Wide Area File Systems, since metadata requests constitute a large portion of all file system requests,
are synchronous, and cannot be cached at clients.
We present WireFS, a Wide Area File System, which enables delegation of metadata management to nodes at
client sites (homes). The home of a file stores the most recent copy of the file, serializes all updates, and streams
updates to the central file server. WireFS uses access history to migrate the home of a file to the client site which
accesses the file most frequently.
We formulate the home migration problem as an integer programming problem, and present two algorithms:
a dynamic programming approach to find the optimal solution, and a greedy algorithm which is non-optimal but
is faster than the optimal algorithm. We show through extensive simulations that even in the WAN setting, access
latency over WireFS is comparable to NFS’s performance in the LAN setting; the migration overhead is also marginal
after the initial delegation.
Keywords: Network file systems, WAN, data management, algorithms, dynamic programming
I. I NTRODUCTION
With economic globalization, more and more enterprises have multiple satellite offices around the world. These
locations span multiple timezones, range from small offices of less than twenty users, to large facilities of several
thousand users. For these enterprises, ease of user and data management including backups and fault tolerance,
legal requirements to record and report stored data over a number of years, and economic benefits of reducing
infrastructure and support staff costs at satellite locations have led to a move towards resource consolidation and
centralized data management. In such scenarios, network file systems provide a familiar interface for data access
and are used extensively.
LAN
REDIRECTOR
CLIENTS
Wide Area Network
LAN
SITE 1
MANAGER
FILE SERVER
LAN
REDIRECTOR
CLIENTS
SITE 2
Fig. 1.
WireFS Architecture
Traditionally, network file systems have been designed for the local area networks, where bandwidth is ample
and latencies are low. Common networked file systems like NFS [3] and CIFS [6] transfer large amounts of data
frequently. All writes are transmitted to the server and require synchronous updates to the files there. Apart from
wasting bandwidth, typical networked file systems require multiple round trips to complete a single file operation.
The metadata requests are synchronous and the client cannot proceed without receiving server response. The high
latency of the round-trips over the WAN and the chatty nature of the protocols make file access slow and unreliable.
Finally, relying on a central server over the wide area network makes the file system susceptible to significant
slowdowns due to unpredictable network delays and outages.
To improve network bandwidth utilization and to hide wide area latencies, Wide Area File Systems (WAFS) have
been developed [17], [1], [25], [20]. These file systems reduce bandwidth utilization by (i) aggregating file system
operations to eliminate redundant operations and to reduce bandwidth requirements, and (ii) using content based
persistent caching to eliminate duplicate block transfers and to enable caching across files.
Unfortunately, current Wide Area File Systems ignore the file system access patterns, and are oblivious to the
characteristics of the underlying network. These systems either take the client-centric view, where each file system
client maintains the content cache and there is no sharing across clients even at the same location, or the sitecentric view where a group of clients are organized as islands and sharing is enabled across clients at the same site.
Site-centric WAFS deploy an appliance at each client site (redirector), which acts as a file server for the clients at
the site and as the WAFS client. A server side appliance (manager) acts as the WAFS server and as a client of the
file server.
An enterprise which deploys existing WAFS has no way to take advantage of temporal locality across sites, e.g.,
different timezones and access patterns, or network diversity which arises due to distinct network paths between sites
and data centers. Recently developed file systems [1] allow sharing data across client sites. However, the sharing
2
is limited to data and the goal is to eliminate the bottleneck at the central server. These systems are designed for
a read dominated workload, and clients must exhibit significant sharing for these systems to take advantage of
network diversity.
In this paper, we present WireFS, a wide area file system which takes an organization-centric view, enables data
and meta-data sharing across multiple client sites, and minimizes metadata access latency in this system. WireFS
takes advantage of temporal locality in file access, and allows data and metadata sharing across client sites. Figure 1
shows the WireFS architecture. Similar to site-centric WAFS, WireFS uses Redirectors (WFSRs), which act as file
servers for all clients that belong a site (island). These redirectors act as WAFS clients and communicate with a
server side Manager (WFSM), which acts as a WAFS server. WFSM appears as the only client to the central file
server which is the final authority on file system contents. In WireFS, Redirectors communicate not only with the
Manager, but also with other Redirectors to allow data sharing, and cooperative metadata management.
WireFS presents a number of design challenges. First, the system must maintain the file system interface while
distributing the metadata management across WFSRs. Second, delegating metadata management to individual
WFSRs must not lead to inconsistent file system state. Third, the system must be fault tolerant and a failure
must not lead to loss of service or inconsistency. Finally, WireFS must minimize the file system operation latency
while distributing the metadata management.
WireFS overcomes these challenges by using a home based approach. Each file is assigned a home server, WFSR
or WFSM, which controls access and serializes updates to the file. The most recent copy of a file is cached at
its home server. The home maintains a single serialization point for all updates, therefore provides semantics and
consistency guarantees similar to a centralized file server. Fault tolerance is achieved by maintaining a primary and
a secondary home which maintain identical state. The home is not statically assigned and can be migrated closer
to the clients accessing the file most frequently.
In this paper, we address the problem of home migration based on file system access history. Intuitively, a file
that is accessed frequently by a client is moved closer to it. This is achieved by assigning the home of the file to the
WFSR at the client site. Since the number of files in a modern file system is large, assigning homes to individual
files is both inefficient and infeasible due to the overwhelming maintenance and lookup overhead. Instead, we
decompose the file system namespace into a number of sub-trees and assign homes to these sub-trees.
We formulate the problem of tree decomposition and home assignment to redirectors as an integer programming
problem. We propose a dynamic programming algorithm to find the optimal solution in polynomial time. We also
present a greedy algorithm as a heuristic which works much faster than the dynamic programming algorithm.
We evaluate WireFS using a trace driven simulator. We use publicly available NFSv3 traces [8] and network
3
measurement data [24] as inputs to the simulator. We partition the traces into clusters of hosts identified from the
traces. We place these clusters across timezones and transform the requests in the trace to take the corresponding
timezone into account. By introducing additional network latencies for messages exchanged between hosts, we
simulate the effect of network characteristics on WireFS.
In the simulated 75-host 10-site wide-area system with two-week data traces, WireFS showed superior access
latency performance: up to 92% of the meta-data lookups took less than 10 ms and the average access latency per
lookup was around 23 ms, compared to the average node-pair round-trip time of 157 ms introduced by the WAN
setting. Moreover, the maximum 2.5% and average 0.1% home reassignment ratio (the percentage of the meta-data
files whose home nodes change per round of assignment computation every 30 minutes) demonstrated the marginal
overhead home migration incurred.
The rest of the paper is organized as follows. Section II presents the WireFS architecture design briefly and then
gives the details on its meta-data layer. The problem definition and algorithms for home migration are described in
Sections III. Section IV discusses the implementation and Section V presents the experimental setup and evaluation
results. Section VI describes the related work and Section VII concludes the paper.
II. W IRE FS
WireFS is a wide area file system which enables delegation of metadata management, and uses content caching
and duplicate elimination to reduce redundant data block transfers. WireFS has two primary components : WireFS
redirectors (WFSRs), that are deployed at each client site, act as WFS clients, and export the file system interface
to the collocated clients, and WireFS manager (WFSM), that maintains the global namespace view, coordinates
WFSRs, and communicates with the file server as a file system client.
The WireFS architecture has two logical components which capture the typical behavior of network file systems:
(i) the Data Access Layer (DAL) and (ii) the Metadata Layer (MDL).
The MDL is composed of a set of WireFS redirectors, that serve all meta-data requests including file and directory
lookup, creation and deletion of files, and updates to the file or directory metadata e.g. access time updates. In
addition to the traditional file system functionality, the MDL also maintains and communicates the location of the
data blocks in the system. Note that the data chunks can be located in the content cache of one or more WFSRs.
The primary goal of MDL is to reduce the latency of the above operations in WireFS.
The DAL enables fast transfer of data across the wide area. The transfer may include original file/directory
contents, update logs, and the updated data blocks being transferred to the file server. The primary goal of the
DAL is to reduce the volume of data block transfers across the wide area network. Several techniques for example,
aggressive prefetching, large persistent caches, duplicate elimination by using a summary of the file data, sharing
4
FS
FILE
SYSTEM
INTERFACE
WIREFS CLIENT
CONTENT
CACHE
UPDATE
LOGS
WIREFS
/
WFSR0
/a LOG(A)
/a/b CLIST
MIGRATION
TABLE
Fig. 2.
WireFS Redirector (WFSR) architecture
data chunks across files, etc., previously proposed for WAFS, are used by WireFS implement DAL. In addition
to the above, DAL performs coordinated data dissemination to create warm caches and uses cooperative caching
among WFSRs which reduces server load and takes advantage of network diversity.
In this paper, we focus on the design of algorithms for MDL to minimize the latency of metadata operations. In
the following, we describe the architectural components of WireFS and the design of the meta data layer in detail.
A. WireFS Redirector
A WireFS redirector is deployed at each client site and has three main functions, (i) to export a file system
interface to the clients at the site, (ii) to maintain a content addressable cache and communicate with other WFSRs
or the WFSM to perform data transfers, and (iii) to maintain operation logs, perform serialization of updates, and
handle metadata queries for files it is the designated home. Figure 2 shows the architecture of a WireFS redirector.
The file system interface (FSI) exported by the WFSR enables clients to communicate using an unmodified file
system protocol. This interface translates client requests to WFS requests. On receiving the corresponding WFS
reply, the FSI constructs the response to the original client request and sends it to the client. A pending request
map is maintained by the FSI to match the responses to the corresponding requests. WireFS can support multiple
file system protocols by defining the appropriate FSI.
Each WFSR maintains a large persistent content-cache that stores files as chunks indexed by content hashes,
which can be used across files. Chunks are non-overlapping segments of file data whose extents are determined by
content boundaries (breakpoints) using a fingerprinting technique, and are indexed by the SHA-1 collision resistant
hash of the contents. WireFS associates a sequence of chunk indices in the file metadata which augments the default
file information, e.g. access times, permissions, access control lists, etc.
B. WireFS Manager
The WireFS manager is deployed at the server site has a specialized role in the WireFS protocol. It communicates
directly with the server and maintains a global view of the file system namespace. It also assigns and maintains the
5
WireFS specific attributes of files like the home node, ownership information, generation numbers etc. The WFSM
is the home node for all files until it delegates this responsibility to a WFSR. The WFSM is also responsible for the
coordinated dissemination of commonly accessed files to multiple WFSRs to warm up the WFSR caches. Finally,
the WireFS manager periodically reorganizes the file system namespace by reassigning homes of files according to
the access history statistics.
C. WireFS Home
Each file in the file system namespace is assigned a home. The home is responsible for maintaining file consistency,
provides a serialization point for all updates, and performs aggregation on file system operations to reduce network
round trips. Homes maintain not only the update logs and serialization, but also maintain the latest version of the
file metadata including access times, permissions, size, chunk indices, etc.
Each WFSR and the WFSM maintain a migration table, which contains a view of the file system namespace,
statistics and access history, and per-file WireFS metadata. An entry in the migration table is indexed by the file
identifier, and contains either the home node identifier, or the WireFS metadata for the file. WireFS metadata
contains attributes defined in the file system, and a list of chunk indices, update logs, etc. The migration table is
updated locally, on each operation to maintain statistics and access history, and remotely, by the WFSM.
On receiving a client request for metadata, for example file lookup, or data, for example read or write, the WFSR
identifies the home of the file using the migration table and forwards the request to the home. The home provides
the information and maintains a timestamped record of all updates as update logs. The home node aggregates
updates, eliminates duplicate or redundant updates, and streams the update logs to the file server.
D. Update Logs
The home node serializes updates to a file. Updates performed by the client are forwarded to the home by the
WFSRs. A WireFS home maintains a timestamped record of all updates it receives. The logs are used to reconcile
conflicting updates to reconstruct the most recent version of the file.
WireFS maintains an ordered sequence of logs using version vectors [10], [18]. A version vector includes the
timestamp at the home node and a global monotonically increasing sequence number. The sequence number of is
incremented when an update entry is appended to the log. Logs are ordered using these version vectors.
Update logs allow each WFSR to acquire leases on files, apply all updates locally to avoid unnecessary network
communication, and forward the local update log to the home node on callbacks. A version number is provided to
each WFSR when the lease is acquired.
6
The use of update logs allows WFSRs to function and provide the file system functionality to clients in event
of disconnection or failure. Unfortunately, this comes at the cost of weakened consistency semantics. WireFS
maintains a close-to-open consistency semantics similar to Andrew [16] when the WFSRs are well connected. On
disconnection, multiple versions of the file due to independent update logs are merged using an algorithm similar
to rsync. However, in some cases, WireFS requires the administrator or users to manually decide on the correct
version of the file which is then updated atomically at the server.
In all cases, the file server maintains the correct version of the file and updates applied to the server are considered
a barrier and all logs are updated to reflect this update.
E. Leases and Callbacks
A WFSR acquires a lease on the file when it performs updates. The home node performs two actions on receiving
a lease request. First, it constructs the most recent version of the file using update logs. Second, it registers a callback
which recalls the file including updates performed by the WFSR while it had the file leased to it. The callbacks
are required when the home is reassigned, another WFSR requests a lease, or time greater than a threshold has
elapsed.
A lease request may not always succeed. In this case, the home node requires all WFSRs to send read as well
as update requests to it and maintains a single update log. This avoids multiple conflicting updates and provides a
stronger consistency guarantees at the cost of network communication on each read or write request.
F. Example
Lookup: Clients lookup the path of the file (or directory) starting from the root of the mounted file system and
descending through the path hierarchy. The client starts the lookup from the first component of the path for which
the file attributes are invalid. In the worst case, this lookup starts at the root of the mounted volume.
The WFSR performs two operations on receiving the lookup request. First, it translates the file handle to the
server path name (on which the WireFS lookup hierarchy is constructed). Figure 3 shows the lookup operation. If
the file handle is cached and is valid (there is no pending callback), the WFSR returns it. If the cached handle is
invalid and the home of the parent directory is known, the cached entry is purged and an OPEN request is forwarded
to the home of the parent. If the parent’s home is unknown, the WFSR sends a HOME_LOOKUP request to the
WFSM and sends the OPEN request to the returned home node. The parent is guaranteed to have either the file
handle information or the location of the delegated node that has the file handle information.
The OPEN request registers a callback with the home node to invalidate the WFSR cache on an update. It also
retrieves the attributes of all children of the lookup target. Note that by using an invalidation based scheme over
7
LOOKUP
Cache Hit
LOOKUP
OPEN
Register callback
Create dentry
LOOKUP
HOME_LOOKUP
OPEN
Register callback
Create dentry
Client
WFSR
WFSR HOME
WFS Manager
Fig. 3. Timeline for the lookup operation. There are three cases. First, when the file handle is cached at the WFSR shown at the top.
Second, when the home of the file is known, and third when the home information is retrieved from the WFSM. The solid lines show local
area communication while the dotted lines show the messages over the wide area.
the WAN, we significantly reduce the number of round-trips as well as guarantee consistency of the file across the
wide area. Moreover, since the number of WFSRs is limited (100s), the state maintenance overhead at the home
node is not very high. At the same time, characteristics of the file system over the LAN are preserved without
modifying the existing implementation of the protocol.
III. W IRE FS M ETA -DATA L AYER
The design of traditional network file systems like NFS assumes the clients and the server are connected over
a low latency network. This allows each file system operation to perform multiple remote procedure calls (RPCs).
While this is acceptable over a LAN, each additional round trip over the wide area network results in deteriorated
performance.
For data transfer, the additional latency can be masked by aggressive prefetching and writing back bulk data.
However, for typical meta-data operations like file lookup, open, and delete, the short RPC messages lead to
significant increase in the response time. Such large overheads subsequently affect the performance observed by
the clients as any data transfer is preceded by one or more meta-data operations. For example, before reading a file,
the client must perform a recursive directory lookup, followed by authentication and attribute checks. Therefore,
for any wide area file system, improving the performance of the meta-data operations is of utmost importance.
8
Recently proposed wide-area file systems rely on a central server for all meta-data operations. For a large client
population, such operations contribute towards heavy load on the server. To reduce the load on the central server,
file systems over Distributed Hash Tables (DHTs) have been proposed which do not have a central server, and the
participating nodes cooperatively provide its functionality. Unfortunately, in this model, the hierarchical structure
of the file system namespace is lost and the cost of lookups of files and directories can take up to O(log(n)) round
trips (where n is the number of participating nodes), which are unacceptable over the wide area.
A. Home Migration
We use a virtual namespace tree rooted at the directory “/” to model the file organization in NFS. A NFS [3]
file lookup consists of a series of sub-lookups that traverse the directory path from the root node to the file node
on the directory tree. For example, in Figure 4, to look up the directory entry for the file “/a/x/1.txt”, the lookups
for “/”, “/a”, “/a/x”, and “/a/x/1.txt” are executed in order.
Lookup 1
root /
Lookup 2
/a
/b
/c
/d
Lookup 3
/a/x
/a/y
/a/z
Lookup 4
/a/x/1.txt
Fig. 4.
/a/x/2.txt
A lookup for the file “/a/x/1.txt” in the directory tree.
In a LAN setting the multiple lookup round-trips are invisible to end-users due to fast local transmission speed.
However, the network latency in WAN is large enough that a file lookup can take up to seconds to finish. This makes
the response time intolerable during normal file operations. To alleviate this performance problem, our solution is
based on the following observation: if most of the accesses into a subtree in the directory tree come from one site
(through a WFSR), we will assign the administration privilege of this subtree onto that site (WFSR). We call this
task delegation as a home migration, and that WFSR the home node of this subtree. Notice home migrations can
occur recursively in that a subtree migrated to one WFSR may have its own subtree migrated to another WFSR
node. Therefore, the directory tree is decomposed into multiple sub-trees based on access statistics, and we want
to design the assignment scheme for home migrations so that the total access latency is minimized. In addition,
to allow fast (one-hop) resolution of home nodes, we will maintain a migration table at WFSM, the central server
side, which keeps one pointer (the address of the home node) for each distinct migrated sub-tree. Figure 5 shows
9
one example for home migration.
Directory Tree
root /
home migration
/b
/a
migration table at GFSM
subtree
/a/x
/a
/b
/c
/c
home node
R_2
R_1
R_2
R_3
/a/x
GFSR R_1
Fig. 5.
GFSR R_2
GFSR R_3
The home migration of a directory tree and the corresponding migration table.
Formally, we label the WFSM as R0 , the n WFSRs as R1 , R2 , . . . Rn , and the network latency (RTT) between
Ri and Rj as LRi Rj . When a file lookup from Ri traverses a directory node Dx (1 ≤ x ≤ m, where m is the
number of directory nodes), we call it one access of Ri on Dx . For each node Dx in the directory tree, a stack of
n registers {CDx Ri , i ∈ [0, n]} record the expected access times of each WFSR on Dx during the next time period
1
T
.
Now we formulate access latency optimization as an integer programming problem:
min
n
m X
X
n
X
IDx Ri (
CDx Rj LRj Ri + MDx Ri )
x=1 i=0
subject to
(1)
j=0
IDx Ri ∈ 0, 1
n
X
IDx Ri = 1
i=0
P
Where IDx Ri = 1 if the subtree rooted at Dx will be migrated to Ri , 0 otherwise. IDx Ri ( nj=0 CDx Rj LRj Ri ) were
the total access cost to the directory node Ri if we migrated the subtree rooted at it to the home node Ri . MDx Ri
is the transfer cost of migrating Dx from its current home node to Ri .
When there is no migration table size constraint, the optimal solution can be found by deciding the best home
node for each directory node individually. Next, we present the algorithm to compute the optimal solution of the
optimization problem when we have migration table size constraint.
B. Optimal Solution under Constrained Migration Table
Let Pmax (< the directory size) be the maximal number of pointers that the migration table can contain. Deciding
the Pmax distinct subtrees is similar to many cache or filter placement problems in the literature [13], [23]. To
1
In the experiments we use an exponential weighted moving average (EWMA) counter to approximate the access register based on past
historical information.
10
find the optimal solution in a bounded-degree directory tree, we can solve the following problem using dynamic
programming.
(i.) Let access(Dx , k, Hp (Dx )) be the optimal access cost for the directory (sub)tree rooted at Dx given that
there are k pointers left for this subtree and the home node for the parent node of Dx is Hp (Dx ). We start
with access(“\′′ , Pmax , R0 ) on the root node and enumerate the rest of the nodes following breadth first
search.
(ii.) At each directory node Dx , the optimal assignment is decided as
•
If k = 0, all nodes in the subtree will be assigned to Hp (Dx ) and
P
P
access(Dx , k, Hp (Dx )) = z: nodes in the subtree nj=0 (CDz Rj LRj RHp (Dx ) + WDz RHp (Dx ) ).
•
Otherwise, access(Dx , k, Hp (Dx )) =
min { min [ for all possible allocation schemes (z,Az ) of k-1 pointers on the children of Dx
Pn
P
j=0 (CDx Rj LRj Ry + WDx Ry ) +
z: child of Dx access(z, Az , y) for every y 6= Hp (Dx )],
min [ for all possible allocation schemes (z,Az ) of k pointers on the children of Dx
P
Pn
j=0 (CDx Rj LRj RHp (Dx ) + WDx RHp (Dx ) ) +
z: child of x access(z, Az , Hp (Dx )) ] }
Next we present the analysis result on the dynamic programming algorithm.
D m2 n) time, where D is
Theorem 1: The dynamic programming algorithm finds the optimal solution in O(Pmax
the maximal degree in the directory tree.
Proof: The analysis is similar to the one for the k-median problems on trees [26] and is skipped in the paper.
C. A greedy algorithm under Constrained Migration Table
While we can find the optimal solution in polynomial time, the likely enormous directory tree size m and large
degree bound D makes it desirable to find a solution good enough and as quickly as possible. We observe that on
the file directory tree, the nodes close to the root receive more lookup requests (then likely incur higher access cost)
than the nodes close to the leaf nodes do. Therefore, when deciding home migration we can take the top-down
order and start from the nodes at the top of the directory tree. For a set of candidate nodes, we will firstly pick the
node whose subtree has the most access requests (from all users) for the home migration process. The following
describes the greedy algorithm based on the above ideas:
subtree , the total number of lookup requests fallen into the
(i.) Initially, for each directory node Dx , we count CD
x Ri
subtree rooted at Dx from each WFSR Ri ; we label the home node for Dx as H(Dx ), where H(Dx ) =
P
Pn
subtree L
subtree
mini:0,...,n nj=0 CD
;
lastly,
we
assign
D
the
weight
W
(D
)
as
x
x
R
R
j
i
i=0 CDx Ri .
R
x j
11
(ii.) The migration table is initialized with one entry which records the home node for the directory root node is
WFSM (R0 ).
(iii.) The children nodes of R0 are put into an ordered linked list where the descending order is based on the
weight W (Dx ). For two nodes with the same weight, the tie is broken by giving the node with a smaller
subtree a higher position.
(iv.) We repeat the following operation until either all k migration table entries are filled up or the ordered list is
empty:
•
Remove the head node Dx in the ordered list and insert its children nodes into the ordered list. Dx is
put into the migration table and assigned the home H(Dx ) if its closest ancestor node in the migration
table is not assigned to the same home as H(Dx ); Otherwise it is not put into the migration table.
(v.) Lastly, for any node Dy not in the migration table, its assigned home node assigned(Dy ) is the same as the
home node assigned to its closest ancestor node in the migration table.
The greedy algorithm gives priority to the nodes whose subtrees incur more access requests (then likely more
access cost) than the other nodes’. The operations in step (i.) omit file transfer cost for simplicity; The operations
in step (iv.) remove unnecessary (redundant) migration table entry for a child node in the tree if its parent node
will be migrated to the same home as it is.
Next we present the analysis result on the greedy algorithm.
Theorem 2: The greedy algorithm finds an assignment scheme in O(m log(m) + Pmax m) time.
Proof: Step (i.) in the algorithm can be finished in one tree traversal using depth-first search algorithm, which
takes O(m) time. Operations related to the ordered list takes O(mlog(m)) time. For each new node to be put in
the migration table, checking its ancestor nodes takes O(Pmax ) time and at most m nodes will be tried as a new
node for the migration table.
Later in Section V we show this greedy algorithm works well in practice.
IV. I MPLEMENTATION
WireFS is implemented by extending the Low Bandwidth File System (LBFS) [17]. LBFS provides content
hashing, file system indexing, and chunk storage and retrieval. WireFS extends the LBFS implementation by
including the WFSR update logs. Unlike the default LBFS, WireFS uses a modified NFS implementation which
sends the file system requests to the LBFS client at each WFSR. At the WFSM, the LBFS server sits in front of
the NFS server and is unmodified.
12
WFS
WFS
WFSR
Home Reassignment
LBFS client
RPC/TCP
UPDATE LOG
CONTENT CACHE
Fig. 6.
WFSM
LBFS server
Name
FS Attributes
Parent Home
Object Home
Object Owner
Callback List
Update Log
Chunk List
Update Time
Generation Num
Dentry Pointer
Fentry Pointer
Chunk List
Dentry
Update Entry
CONTENT CACHE
WireFS implementation
Fig. 7.
Directory and update log entries in WFSR
Figure 6 shows the WireFS implementation. In addition to the default LBFS, WireFS includes additional functionality for home migration and maintaining update logs. These are implemented as extensions to LBFS and use the
SFS toolkit [15] to provide the asynchronous programming interface. Finally, the interaction between the WFSRs is
independent of the LBFS protocol. WireFS receives all NFS requests from the clients, and uses the WFS protocol
to identify the home node. The requests are passed on to LBFS client at the home node which in-turn uses the
content cache and the LBFS server to service the requests.
WireFS associates additional metadata with each file system object. It is important to note that this information is
not visible to either the server or the clients, but is generated and maintained by the WireFS redirectors transparently.
The additional attributes enable WireFS specific optimizations over the wide-area-network. As shown in Figure 7,
for each file, WireFS maintains a directory entry (dentry) which contains four additional attributes, a chunk list,
callback list, Home information for the parent and the file itself, and Owner information. In addition to the extended
attributes, update logs are maintained for any updates in queue for the server. Finally, each WFSR maintains a
translation table which maps the file handles provided by the server at mount time to the path name of the file on
the server.
V. E VALUATION
In this section, we present an evaluation of WireFS home migration using trace driven simulation. We first
describe our simulation methodology and demonstrate that metadata operations constitute a significant portion of
all file system accesses. We also show the temporal locality exhibited by accesses, especially across client sites.
We use publicly available long-term NFSv3 traces to identify file system accesses, and network latency traces to
emulate geographical distribution of client sites. We then show the behavior of the home based WireFS metadata
access protocol and compare it against existing network and wide area file systems. Finally, we show the benefits
of home migration in WireFSwhile comparing our two algorithms for reassignment.
13
RPC methods
read
write
getattr
lookup
access
remove
setattr
readdirplus
statfs
readdir
link
rmdir
fsinfo
readlink
rename
mkdir
pathconf
symlink
nothing
mknod
commit
total
# of access (14 days)
194,404,998
64,635,756
89,051,316
18,442,927
6,464,648
3,234,284
3,090,839
1,448,400
910,257
692,763
474,079
265,733
131,104
125,584
86,367
55,659
30,478
11,872
4,361
7
0
383,561,432
% of RPC methods
50.68
16.85
23.22
4.81
1.69
0.84
0.81
0.38
0.24
0.18
0.12
0.07
0.03
0.03
0.02
0.01
0.01
0.003
0.001
1.825E-06
0
100
NFS client/server
Site0
Site1
Site2
Site3
Site4
Site5
Site6
Site7
Site8
Site9
NFS server
TABLE I
T HE BREAK DOWN OF NFS RPC REQUESTS AND RESPONSES IN
THE 14 DAYS H ARVARD TRACE
Planetlab node
planetlab1.cs.uit.no.
planetlab-1.stanford.edu.
planetlab1.informatik.uni-kl.de.
planetlab2.pop-rs.rnp.br.
planetlab1.eecs.umich.edu.
planetlab1.cs.wayne.edu.
planetlab2.cs.unibo.it.
planetlab1.cs.ucla.edu
planetlab2.cs.duke.edu.
planetlab2.kaist.ac.kr
planetlab-1.cs.colostate.edu
Time Zone
GMT -1h
GMT +8h
GMT -1h
GMT +3h
GMT +5h
GMT +5h
GMT -1h
GMT +8h
GMT +5h
GMT -9h
GMT +7h
TABLE II
T HE PLANETLAB SITES EMULATION OF W IRE FS CLIENT / SERVER
CONFIGURE
A. Simulation Methodology
We use the publicly available NFSv3 traces from Harvard SOS project [8]. The Harvard traces include up to
three month real campus NFSv3 traffic in different deployment scenarios. We choose the most diverse workload
which is a mix of research, email and web workload. In our simulation, traffic traces of two weeks are extracted
to evaluate WireFS performance under different configurations.
The traces feature workload and operation diversity where 993 thousand distinct files with 64 thousand directory
files are monitored. During the studied two week period, 384 million NFS RPC call/response pairs are recorded. The
RPC call breakdown is presented in table I. From Table I, we observe that 32% of these operations are composed
of LOOKUP, GETATTR, SETATTR and other metadata operations. Therefore, WireFS focuses on minimizing access
latency of a significant portion of file system accesses.
The evolution of these access patterns with time is indicated in Figure 8. We observe approximately one million
file operations per hour along with number of distinctly accessed files varying between one thousand to one hundred
thousand. 75 distinct host IP addresses are identified from the traces and are used for creating user groups.
To emulate an enterprise environment with branch offices, we partition the total 75 hosts into 10 groups (sites)
with the access pattern following uniform or Zipf distribution. The site geographic distribution is emulated based
on the Ping project traces [24]: we randomly picked 10 PlanetLab nodes scattered around the world, and emulated
the wide-area network latency between them by extracting the round-trip time (RTT) information between them
14
File operation of Zipf grouping in different time zone
Evolution of file operation over 2 weeks
1e+07
1e+07
site0 +GMT -1h
site1 GMT +8h
site3 GMT +3h
operatons per hour
number of Files accessed
1e+06
file operation
1e+06
100000
100000
10000
10000
1000
1000
100
0
0
Fig. 8.
50
100
150
Time (hour)
200
250
20
40
300
60
80
100
120
140
160
Time (hour)
Fig. 9. The operation of three sites in Zipf grouping in different
time zone.
Evolution of file operation and accessed file per hour.
from the Ping project traces.
Table II describes our site location configuration with domains and time zones. Three sites are located on the
east coast, two from west coast of United States, two from Europe, one from South America, and one from Asia.
The central file server (data center) is placed in the middle of North America.
The RTT between two sites varies from 2.4ms to 358ms with the average value of 157ms. The time zone for each
site is included in our experiments by adding time offset to the trace data originating from that site. For example,
Figure 9 shows the time evolution of the file operations from three sites with Zipf-based grouping and time zone
offset.
We compare four network file systems in our simulated WAN setting. The first file system is a wide-area
deployment of the NFSv3 system, called WAN-NFS in the rest of the paper. In WAN-NFS, all client groups access
files from the remote central NFS server via NFS RPC procedures. The second file system, called DHT file system,
utilizes the DHT based data management scheme (like SHARK [1]) that randomly distributes file objects among
the participating sites. For simplicity, in the simulations we assume a file lookup takes only one-hop searching for
remote file object access. The third file system is called WireFS-node, where home assignment is done on individual
files based on their access statistics. The fourth system is called WireFS-tree, where home assignment is done based
on the greedy algorithm described in Section III-C.
In both WireFS-node and WireFS-tree, home migration decision is recomputed every T minutes, and the number
of accesses to a file f from a site x at the end of the i-th period is calculated with an EWMA counter: Cfx (i) =
α × Cfx (i − 1) + (1 − α) × nxf (i), where nxf (i) is the total access number of x on f during the i-th period and
Cfx (0) = 0. Unless explicitly stated, T = 60, α = 0.5, and a migration table with the size k = 50000 are used in
the following.
15
latency of ’Lookup’ operation of four schemes under uniform access pattern
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
CDF
CDF
latency of ’Lookup’ operation of four schemes under Zipf access pattern
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
WAN NFS
DHT (one hop search)
WireFS-node
WireFS-tree
LAN NFS
0.1
0
0.1
1
10
100
WAN NFS
DHT (one hop search)
WireFS-node
WireFS-tree
LAN NFS
0.1
0
1000
0.1
1
10
Lookup Latency (ms)
100
1000
Lookup Latency (ms)
Fig. 10. The CDFs of average lookup latency for different systems
with zipf-based grouping and time zone offset.
Fig. 11. The CDFs of average lookup latency for different systems
with uniform-based grouping and time zone offset.
B. Results
Figure 10 shows the average meta-data lookup latency distribution in the four file systems where host grouping
was based on Zipf distribution and time zone effect is considered. The NFS lookup latency performance in local
area network (LAN NFS) is also included as the baseline for all these schemes.
We observe that WireFS-tree performs close to NFS-LAN and outperforms the other three schemes. The latency
of more than 96% of the lookups in WireFS-tree is comparable to that in NFS-LAN; 92% of the lookups in
WireFS-tree take less than 10ms, compared with 75% of WireFS-node, less than 15% of DHT system, and 0 of
WAN NFS as all other sites are more than 10ms away from the central file server; only 2% of the operations in
WireFS-tree underperformed the other schemes due to its worst case scenario with two-hop lookups. We repeat the
above simulations with host grouping based on uniform distribution, and the result (as shown in Figure 11) was
similar to that of Zipf distribution.
Evolution of local hit ratio over 2 weeks
local lookup hit ratio of two WFS schemes
100
1
0.9
80
0.8
Percentage
0.7
CDF
0.6
0.5
60
40
0.4
0.3
20
0.2
0.1
0
0.65
Fig. 12.
WFS-node
WFS-tree
Remote first-time-access file ratio
local hit ratio
0
0
0.7
0.75
0.8
0.85
Local hit ratio
0.9
0.95
1
50
100
150
Time (hour)
200
250
300
Fig. 13. WireFS-tree: local access hit ratio vs. remote first-timeaccess file ratio.
CDF of local hit ratio for the two WireFS systems.
16
Evolution of lookup latency over 2 weeks
120
Evolution of home reassignment
100
Average lookup latency (ms)
first-time-access file ratio
moving avg. remote first-time-access file ratio
Home reassignment ratio
10
100
1
Percentage
80
60
0.1
0.01
40
0.001
20
1e-04
0
1e-05
0
50
100
150
200
250
300
0
Time (hour)
50
100
150
200
250
300
Time (hour)
Fig. 14. WireFS-tree: average access latency vs. first-time-access
file ratio.
Fig. 15. WireFS-tree: home reassignment ratio vs. moving average
remote first-time-access file ratio.
Figure 12 compares the performance of WFS-tree and WFS-node in terms of the distribution of local hit ratios
(computed every T minutes) throughout the 2 weeks. We observe that WFS-tree has a hit ratio over 95% most
of the time, while WFS-node experiences hit ratio oscillation during the experiment with average value less than
90%.
The performance difference between WireFS-tree and WireFS-node is caused by the prefetching nature of the
subtree-based migration and the caching nature of the node-based migration. If file accesses from a site have a
locality pattern within the directory tree hierarchy, prefetching avoids “cold” misses, due to first-time accesses; our
experiment results clearly validated that assumption.
Figure 13 shows the time evolution of local hit ratios in WFS-tree. The aperiodic deterioration of hit ratios is
explained by the spikes of remote first-time-access file ratios
2
, which are also shown in Figure 13.
Figure 14 presents the time evolution of average lookup latency in WireFS-tree over the two-week time period.
The first-time-file access ratio is shown in Figure 14. We observe that the latency spikes are consistent with the
spikes of the first-time-access file ratios.
The effect of home migration is demonstrated by the immediate decline after each latency spike in Figure 14.
The drop in the access latency shows that home migration reduces the wide-area accesses adaptively and quickly.
Over the first 50 hours, most of the files are accessed for the first time by remote sites, which makes the average
lookup latency oscillate dramatically. After this time, the latency stabilized until another first-time-access spike
changed the pattern.
Figure 15 presents the time evolution of home reassignment ratio in WFS-tree system. Home reassignment ratio
is defined as the percentage of the meta-data files whose home nodes change. This ratio is used as a metric for
2
Remote first-time-access file ratio is defined as the percentage of the files accessed by a remote group for the first time out of all files
accessed during a time period.
17
Lookup Latency for 5 uniform groups from different time zones
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
CDF
CDF
Lookup Latency for 5 Zipf groups from different time zones
1
0.5
0.5
0.4
0.4
0.3
0.3
0.2
0.2
site0
site1
site3
site5
site9
All_sites
0.1
0
0.1
1
10
100
site0
site1
site3
site5
site9
All_sites
0.1
0
1000
0.1
Lookup Latency (ms)
1
10
100
1000
Lookup Latency (ms)
Fig. 16. Average lookup latency CDF of 5 sites with Zipf-based
grouping and time zone offset.
Fig. 17. Average lookup latency CDF of 5 sites with uniformbased grouping and time zone offset.
home migration overhead as each reassignment requires computation. The maximum 2.5% and average 0.1% home
reassignment ratio demonstrates the marginal overhead home migration incurred after the initial few delegations.
Moving average (3 runs) remote first-time-access file ratios
3
are also shown in Figure 15 to illustrate the main
cause of home reassignments.
Figure 16 and Figure 17 present the average lookup latency distribution of five individual sites in WireFS-tree
with Zipf- and uniform- based grouping. For comparison, the average lookup latency distribution of the system is
also shown. We observe that in both scenarios the large sites (sites 0,1,5 in Figure 16 and sites 0,3,5 in Figure 17)
account for majority of the lookup operations and achieve good performance, which also resulted in good average
performance for the whole system. However, there was at least one site (site 9 in both figures) whose performance
was much worse than the rest of the system. It turns out that this site has a small number of hosts and therefore
perform a small number of file operations. That is, this small office branch suffers in the competition with large
branches for resources. While it is intuitive, addressing this problem is our immediate future work.
We ran the simulations for a set of grouping results and location mappings, different values of α values (0.25,
0.5, 0.75), varying migration frequency T (15, 30, 60), different migration table size (5K, 15K, 50K). All simulation
results showed qualitatively similar performance for the WireFS system.
VI. R ELATED W ORK
Network file systems have been studied in the local area with stateless [3] and stateful servers [11], [2], [19],
[9]. Satyanarayanan presents an overview of several traditional distributed file systems [22]. Recently, there has
3
Moving average (m rounds) remote first-time-access file ratio is the average value of the remote first-time-access file ratio in the
consecutive m rounds. As we use an EWMA counter to record the historical access information, remote accesses in the current (one) round
might not immediately affect the home assignment decision. Therefore, we pick m = 3 in Figure 15 to better reflect the reason behind the
home reassignment evolution.
18
been significant research activity in providing data access (object or file system based) over the WAN. Multiple
peer-to-peer architectures for decentralized data management have been proposed [7], [21], [14], [18]. However,
the goal of such systems is to store large quantities of data, dispersed and replicated across multiple clients to
improve fault resilience and reduce management overheads. In contrast, WireFS tries to improve performance of
existing network file systems for interactive workloads. While WireFS is capable of storing large data, replication,
and disconnected operation, such characteristics are not the primary concern.
Independently, improving the performance of large file downloads in overlay networks has been studied [12],
[4], [5]. These systems target client downloads of whole data objects like movies, software distributions etc. from
one or more publishers. They do not maintain object hierarchies like directories, and do not consider modifications
to objects. In WireFS, we target an entirely different workload. Table I shows the distribution of the different NFS
RPC calls in a trace collected at Harvard University [8]. From the distribution of the RPC calls, it is clear that
a significant portion of the network communication is due to the lookups and other metadata traffic. In a WAN
environment, such communication imposes a significant overhead on the performance of the file system. Previous
efforts to provide wide area file system access optimize mainly for the bandwidth. Reducing the latency of these
metadata transfers is a primary design goal of WireFS in addition to providing high-bandwidth parallel downloads.
VII. C ONCLUSIONS
In this paper, we presented home migration, a technique to minimize meta-data access latency in wide-area file
systems. We first described the design of WireFS a wide-area networked file system. Next, a set of algorithms
for home assignment and migration were proposed in the context of WireFS to improve performance of metadata
accesses. Through trace driven simulations, we demonstrated that our technique improves the latency of metadata
operations with low management and network overheads.
VIII. ACKNOWLEDGEMENTS
We thank the Harvard University SOS project team, especially Daniel Ellard, Jonathan Ledlie, and Christopher
Stein for providing the NFS traces.
R EFERENCES
[1] S. Annapureddy, M. J. Freedman, and D. Mazières. Shark: Scaling File Servers via Cooperative Caching. In Proc. of 2nd Usenix
Symposium on Network Systems Design and Implementation NSDI’05, Boston, MA, May 2005.
[2] A. D. Birrell, A. Hisgen, C. Jerian, T. Mann, and G. Swart. The Echo Distributed File System. Technical Report 111, Digital Equipment
Corporation, Systems Research Center, Palo Alto, CA, USA, 10 1993.
[3] B. Callaghan, B. Pawlowski, and P. Staubach. NFS Version 3 Protocol Specification, RFC 1813. IETF, Network Working Group, June
1995.
19
[4] M. Castro et al. SplitStream: High-Bandwidth Multicast in Cooperative Environments. In Proceedings of the 19th ACM Symposium
on Operating Systems Principles, pages 298–313, October 2003.
[5] B. Cohen. Incentives Build Robustness in BitTorrent. http://bittorrent.com/bittorrentecon.pdf, May 2003.
[6] M.
Corporation.
Cifs:
Common
internet
file
system.
http://msdn.microsoft.com/library/default.asp?url=/library/en-
us/cifs/protocol/portalcifs.asp.
[7] F. Dabek et al. Wide-area cooperative storage with CFS. In Proceedings of the 18th ACM Symposium on Operating Systems Principles
(SOSP ’01), Chateau Lake Louise, Banff, Canada, October 2001.
[8] D. Ellard and M. Seltzer. New NFS Tracing Tools and Techniques for System Analysis. In LISA ’03: Proceedings of the 17th USENIX
conference on System administration, pages 73–86, Berkeley, CA, USA, 2003. USENIX Association.
[9] J. H. Hartman and J. K. Ousterhout. The Zebra Striped Network File System. ACM Transactions on Computer Systems., 13(3):274–310,
1995.
[10] D. S. P. Jr., G. J. Popek, G. Rudisin, A. Stoughton, B. J. Walker, E. Walton, J. M. Chow, D. A. Edwards, S. Kiser, and C. S. Kline.
Detection of mutual inconsistency in distributed systems. IEEE Trans. Software Eng., 9(3):240–247, 1983.
[11] J. Kistler and M. Satyanarayanan. Disconnected Operation in the Coda File System. ACM Transactions on Computer Systems,
10(1):3–25, Feb 1992.
[12] D. Kostic et al. Bullet: High Bandwidth Data Dissemination Using an Overlay Mesh. In Proceedings of the 19th ACM Symposium on
Operating Systems Principles, pages 282–297, October 2003.
[13] P. Krishnan, D. Raz, and Y. Shavitt. The cache location problem. IEEE/ACM Transactions on Networking, 8(5):568–582, 2000.
[14] J. Kubiatowicz et al. OceanStore: an Architecture for Global-Scale Persistent Storage. In Proceedings of the 9th International Conference
on Architectural Support for Programming Languages and Operating Systems, pages 190–201, 2000.
[15] D. Mazieres. A toolkit for user-level file systems. In USENIX Technical Conference, Boston, MA, June 2001.
[16] J. Morris, M. Satyanarayanan, M. Conner, J. Howard, D. Rosenthal, and F. Smith. Andrew: A distributed perosnal computing
environment. Commun. ACM, 29(3):184–201, Mar. 1986.
[17] A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In SOSP ’01: Proceedings of the eighteenth
ACM symposium on Operating systems principles, pages 174–187, 2001.
[18] A. Muthitacharoen et al. Ivy: A read/write peer-to-peer file system. In Proceedings of the 5th USENIX Symposium on Operating
Systems Design and Implementation (OSDI ’02), Boston, Massachusetts, December 2002.
[19] M. N. Nelson, B. B. Welch, and J. K. Ousterhout. Caching in the sprite network file system. ACM Trans. Comput. Syst., 6(1):134–154,
1988.
[20] I. Riverbed Technology. Rios: Riverbed optimization system. http://www.riverbed.com/docs/TechOverview-Riverbed-RiOS.pdf, 2006.
[21] A. Rowstron and P. Druschel. Storage management and caching in PAST, a large-scale, persistent peer-to-peer storage utility. In
Proceedings of the 18th ACM Symposium on Operating Systems Principles, pages 188–201, 2001.
[22] M. Satyanarayanan. A survey of distributed file systems. Technical Report CMU-CS-89-116, Carnegie Mellon University, Pittsburgh,
Pennsylvania, 1989.
[23] R. Shah, Z. Ramzan, R. Jain, R. Dendukuri, and F. Anjum. Efficient dissemination of personalized information using content-based
multicast. IEEE Transactions on Mobile Computing, 3(4):394–408, 2004.
[24] J. Stribling. All-Pairs-Pings for PlanetLab. http://pdos.csail.mit.edu/ strib/pl app/.
[25] I. TacitNetworks. Tacitnetworks ishared datasheet. http://www.tacitnetworks.com/docs/Datasheet-I-shared-Products.pdf, 2006.
[26] A. Tamir. An o(pn2 ) algorithm for the p-median and related problems on tree graphs. Operations Research Letters, 19:59–64, 1996.
20