H-0141 October 10, 2002
Computer Science
IBM Research Report
DSF - Data Sharing Facility
Zvi Dubitzky, Israel Gold, Ealan Henis, Julian Satran, Dafna Sheinwald
IBM Research Division
Haifa Research Laboratory
Haifa 31905, Israel
Research Division
Almaden - Austin - Beijing - Haifa - India - T. J. Watson - Tokyo - Zurich
LIMITED DISTRIBUTION NOTICE: This report has been submitted for publication outside of IBM and will probably be copyrighted if accepted for publication. Ithas been issued as a Research Report
for early dissemination of its contents. In view of the transfer of copyright to the outside publisher, its distribution outside of IBM prior to publication should be limited to peer communications and specific requests.
After outside publication, requests should be filled only by reprints or legally obtained copies of the article (e.g., payment of royalties). Copies may be requested from IBM T. J. Watson Research Center , P. O. Box 218,
Yorktown Heights, NY 10598 USA (email: reports@us.ibm.com). Some reports are available on the internet at http://domino.watson.ibm.com/library/CyberDig.nsf/home.
DSF - DATA SHARING
FACILITY
Zvi Dubitzky
Israel Gold*
Ealan Henis
Julian Satran
Dafna Sheinwald
IBM Research Laboratory in Haifa
(* now with SANGate Israel)
Abstract
This paper presents DSF - a new serverless distributed file system, aimed to
improve scalability. Scalability is obtained by moving traditional file system
functionality to lower (disk) levels and by using a dynamic file management
assignment policy to improve load balancing.
1
1. Introduction
Disks are slower than processors, and modern networks performance is
improving at a higher rate than that of disks. Moreover, the fast processors and
networks impose an increasing average and peak demands on storage and file
systems.
A typical centralized network file system is based on a dedicated file server that
satisfies storage access requests, manages file system metadata and maintains a
cache. These tasks make the file server a performance and reliability bottleneck,
and the centralized solution does not scale well.
The Data Sharing Facility (DSF) presented here is a scaleable non-centralized
("serverless") distributed storage access system, where storage, cache and control
is distributed over cooperating workstations.
Existing systems provide limited answers to the growing storage access demands.
NFS [1,2] is a remote file access protocol that provides a weak notion of cache
consistency. Its stateless design requires clients to access servers frequently to
maintain consistency. NFS4 [13] introduced client caching and state-based
protocol. AFS [3] provides local disk caching and consistency guarantees, but it
does not implement a native file system. It has a global namespace, but a single
centralized server manages each mountable volume. The VMS Cluster file system
[4,5] offloads file system processing to a group of individual machines that are
members of a cluster. Every cluster member runs its own instance of the file
system code on top of a shared physical disk, with synchronization provided by a
distributed lock service. The shared physical disk is accessed either through a
special purpose cluster interconnect to which a disk controller can be directly
connected, or through an ordinary local area network such as Ethernet and a
machine acting as a disk server. The Frangipani clustered file system [6] improves
upon this design by replacing the shared physical disk with a shared scaleable
virtual disk provided by Petal [7]. Petal consists of a collection of networkconnected servers that cooperatively manages a pool of physical disks. To a
Frangipani cluster member, this collection appears as a highly available block level
storage system that provides large abstract containers, which are globally
accessible by all Frangipani cluster members. IBM's GPFS has similarities with
Frangipani in its log-based recovery. However, GPFS does not scale well due to
the use of a centralized lock server. xFS [8] attempts to distribute all aspects of
2
file service over multiple machines across the network to provide high availability,
performance and scalability. However, xFS management distribution policy is
static, and its recovery mechanism is complicated (log based).
The DSF presented here is closest to xFS among existing systems. Functionally
DSF differs from xFS by providing dynamic distribution of file management.
It has also different functionality in the components. Through this new
functionality it achieves a simplified structure, better scaling and simplified
metadata recovery, all of which potentially improves performance and reliability.
2. DSF design principles
The DSF design is depicted in Figure 1.
D SF S ystem A rchitecture
File System
Clients
M etadata
M anagem en t
Client
Client
Client
File
M anager
Local Area
N etw ork
File
M anager
Logical Volum e
Storage
M an ager
Storage
M anager
Storage
M anager
Storage
M anagem ent
Figure 1 DSF - System Architecture
3
The System component and functions are:
•
DSF Client - runs on some workstation and provides access to DSF
files and directories. The DSF Client maintains a memory cache of
data blocks accessed by applications on the Client workstation. The
Client accepts file system requests from user programs, sends data to
Storage Managers on writes, forwards reads to File Managers on
cache misses, and receives replies from Storage Managers or other
Clients. It also answers forwarding requests from File Managers by
sending data to other Clients.
•
DSF File Manager - manages the metadata and cache consistency for
a subset of DSF files. To provide scalable service DSF splits the
management of its files among several File Managers. The Manager
of a file controls two sets of information about it, cache consistency
state and file structure metadata blocks. Together these structures
allow the File Manager to track all copies of the file's data blocks. The
File Manager can thus forward Client read requests to other Clients
thereby implementing cooperative client caching.
•
DSF Storage Manager - stores and maintains data and metadata
blocks on its local disks. The Storage Manager reacts to requests from
File Managers by supplying data to Clients which have initiated I/O
operations. DSF Storage Managers contain the intelligence to support
DSF Logical Volumes.
•
DSF Logical Volume - consists of a collection of logical disks that
span multiple Storage Manager machines and provides abstract
interface to disk storage with a rich set of recovery properties. The
logical volume hides the physical distribution of its logical disks so
that new disks and Storage Managers can be incorporated into the
system dynamically and without interrupting system operation,
thereby increasing storage capacity and throughput. The storage
system may be further reconfigured by moving disks between Storage
Managers, to match different working environments and workloads.
Expanding the storage space (volume scalability) can also be done
without interrupting system operation
4
DSF performance and scalability is achieved by the following design elements:
•
•
Separation of storage from file management.
Distribution of storage management over multiple machines
•
Dynamic distribution of file and metadata management across
multiple machines
Caching and metadata management can be done on a machine
that is different from the one storing the data.
Cooperative caching
Client machine memories are treated as one global cooperative
cache. Clients are able to access blocks cached by other clients,
thereby reducing Storage Managers load and reducing the cost
of local cache miss.
Lack of dedicated machines
This eliminates source bottlenecks. Any machine in the system,
including one that runs user applications, can be made
responsible for storing, caching and managing any piece of
data or metadata. Furthermore, any machine in the system can
assume the responsibilities of a failed component.
Extensibility - machines can be added to the system
Freedom to configure the file system.
DSF can be configured to match different system environment
depending on machine memories and CPU speeds. DSF can
have multiple configurations ranging from a "small office
system", where the file system is shared between two machines
and only one is responsible for storing the data, to a large
"clustered system" of hundreds machines, where each machine
is made responsible for storing, caching and managing parts of
the file system.
Use of logical volumes
DSF logical volume can be used to dynamically reconfigure the
storage subsystem without interrupting file system operation.
The support for transactional operations over multiple disks
improves the performance of file operations that require
atomic multi-block writes, like file sync and directory
•
•
•
•
•
5
operations. The new allocateAndWrite technique removes the
need to allocate a new block prior to its writing to disk
3. DSF mechanisms
3.1.
DSF Logical Disk - Introduction
The Storage Manager maintains each of the plurality of disks physically attached
to it as a logical disk.
The idea of a Logical Disk that binds logical block addresses to physical block
address via a translation table is not new. At MIT [13], HP [14, 15], DEC [6,7],
Princeton U, and more, prototypes of the logical disk architecture are built since
the early 1990’s.
With the logical disk approach, the disk storage is partitioned into fixed size block
spaces, each made of several consecutive sectors. The user application (file system,
or Data Base, etc.) refers to its data as partitioned to (logical) blocks, each is
worth of the size of slightly (some 30 bytes) smaller than a block space, and it
associates logical addresses to its blocks. The Storage Manager maintains a translation
table that converts each logical block address to physical disk address, which is the
sector address of the first sector in the block space that accommodates the most
recently stored contents of the logical block. The Storage Manager also maintains
an allocation bitmap where it records the availability of the block spaces on the disk.
Dynamic change of these tables provides for the stable storage feature: only
after an available block-space is allocated and the new contents of a logical
block are safely stored into it, is the block-space that accommodated the old
contents released for use by further block stores, and the translation table is
updated.
With such a scheme, a write that fails due to a faulty disk-sector is retried at
another block space (transparently to the caller of the write operation, which
only refers to blocks by their logical addresses), and if a failure occurs in the
midst of block writing, the old contents of the block can be fully recovered.
Thus, successful write of a block ends with the logical address of that block
bound to the fresh contents of the block, stored sound on the disk, and a nonsuccessful write -- to the old contents. A write operation never ends with
indefinite contents of a block - therefore the attribute stable.
6
Without the translation table, no block can be found on the disk, and without the
allocation bitmap available block-spaces are hard to find. To withstand a power
failure, it does not suffice to keep these data structures in volatile memory;
periodically, these data structures must be flushed to disk. Furthermore, every
block write must include sufficient information that can be used, during
recovery, to redo the updates made to the translation table and the allocation
bitmap, which we show in the sequel.
The logical disk scheme we used for DSF simplifies the data structures
manipulation, improves performance of ordinary and recovery operations, and
further extends the services provided by the logical disk, to include operations
traditionally done by the file systems. It also allows more than a single logical
disk user to use it without having to coordinate operations. DSF logical disk
provides also transactional store of multiple blocks, over multiple disks.
Our mechanisms provide the following benefits, most of which we have not
seen in none of the existing implementations of logical disks:
1. Allow I/O in blocks made from several consecutive sectors (thereby
allowing atomic multi-sector write). That is, the disk can be managed in
blocks, larger than its sectors, whose size is determined when the disk is
formatted as DSF logical disk.
2. The physical disk, on which the DSF logical disk is implemented, contains
all the information needed for its management, and thus the disk is easily
movable from one host to another, without calling for a total reconfiguration.
3. On each block write, at no extra I/O cost, the update of the translation table
is stored to disk as well.
4. On each block write, at no extra I/O cost, the update of the allocation
bitmap, which records the availability of block spaces, is stored to disk as
well. Preallocation of several blocks, and the storage of the updated
allocation bitmap, is also not needed, as opposed to the usual practice, and
hence no leak of space occurs when the storage server fails and the
preallocated blocks are lost.
5. It provides allocation and deletion of blocks. This allows multiple users of
our DSF logical disk to allocate blocks without the need to synchronize
their requests and protect against collisions. Moreover, our allocation and
deletion schemes withstand cache failures.
6. It provides soft-write, commit, and abort operations which enable the
two-phase commit needed for atomic multi-block stores (on single or
multiple disks).
7. Consecutive stores of blocks (not necessarily in one chunk) make the disk
arm move mostly forward; once in a while, the arm is reset all the way
7
8.
9.
10.
11.
backward, and then again it moves forward for many stores. Although the
stores are not necessarily adjacent on disk, the one directional, rather than
random, move of the arm gives better performance (as in log structured file
systems).
Checkpoint of our scheme’s data structures (store to disk of the translation
table and allocation bitmap and a few integer variables) takes place when the
arm moves backward, or earlier, at the DSF logical disk’s convenience; i.e.,
timing for checkpoint is rather flexible. Checkpoints can be done
succinctly by identifying the components that changed since the last
checkpoint, and can even be made piecemeal, in small parts, one at a time.
Recovering from cache (power) failure, DSF logical disk reconstructs its inmemory data structures, bringing them to the very same state they were
in immediately prior to its failure, faster than any previous work, known
to us, in this area. The time consumed is linear in the number of write
operations that took place since the last checkpoint. Besides avoiding
scanning the whole disk, the read operations needed for the recovery are
ordered such that the arm only moves forward.
The stable store mechanisms can co-exist with the conventional store in
place mechanism on one disk: part of the disk is managed through
translation table and allocation bitmap, and the other part is managed as a
simple disk.
The implementation of our scheme is rather simple and suits modern
disk controllers.
3.2.
Formatting the disk
Some of the space of the physical disk is reserved for describing general
physical and logical parameters, like size of disk sector, number of sectors,
number and size of block spaces, range of logical addresses supported by the
disk, etc.. Space is also reserved for checkpoints of the data structures used to
manage the DSF logical disk. The rest of the space is partitioned to block-spaces.
An allocation bitmap is constructed for the disk, associating one bit with each
block-space thus defined. Initially, all the block-spaces are free and accordingly
all the bits of the allocation bitmap are set to 0.
3.3.
Allocation of Block-spaces
When a block-space is needed for storage, one is allocated from the free blockspaces on the disk. Our scheme records the physical sector address of the last
block-space allocated, and looks for a free block-space from that address
forward. As explained in the sequel, our scheme employs block chaining: the
blocks are stored with a forward pointer, yielding a forward linked list made
from all the blocks stored. This allocation, store and link forward process
8
continues until no free block-space is found whose address is higher than the
last block-space allocated. When this happens, a checkpoint is called, whereby the
DSF logical disk stores its data structures to disk, and the allocation process
resumes from the free block-space of lowest address, creating a new forward
linked list of stored blocks. All along the process of store and link forward,
between successive block stores, the disk arm moves in one direction: forward.
The one directional move of the arm gives good performance, as in logstructured filesystems.
3.4.
Storage Management Data Structures
In addition to the Translation Table and the Allocation Bitmap, our DSF logical disk
also maintains the Pass Number: a counter of the number of times that the disk
arm completed move-forward-and-store passes. This equals the number of
checkpoints done thus far; the First Available Block-space: a pointer to the first, in
address order, available block-space when a checkpoint takes place; and the
Next Available Block-space: a pointer to the available block-space which will be
stored to the next storage operation.
Figure 2 Translation Table and block
chain operations
9
3.5.
DSF Logical Disk Operations
The Read (address) operation is straightforward. If the address is in the range
of the metadata addresses, then address is logical, and the translation table is
consulted, and the contents of the block whose logical address is address is
returned. If address is in the range of the conventional store, then the contents
of data block whose physical address is address is returned.
The Write (address, contents) operation is also straightforward. On the
conventional part of the disk (where a write overrides old data) this is the
ordinary store of contents into the space whose physical address is address. In the
meta-data part, this is a stable store, as described above: first a new block-space
is allocated, into which contents are stored, and associated with logical address
address. Then, the translation table is updated, and finally, the block-space that
used to hold the previous contents of the block whose logical address is address,
is released.
The non-conventional disk operations are:
Allocate and Write (contents): For regular blocks it means getting a free
block, allocating it and returning the address to the caller. For stablestorage (metadata) blocks it means also finding a logical address, by
looking in the translation table for an entry that is mapped to NULL, and a
free block-space, and then continue as with Write, and return also the
logical address allocated.
Write (logical address, contents) For stable-storage blocks it means getting
the next free block in the chain, writing the content and updating the inmemory translation table.
Delete Blocks (i1,i2,...): For stable-storage blocks it involves deleting the
binding of blocks of logical address i1,i2,... with any stored contents, and
for all making the spaces used to hold their contents available for further
block stores. For stable store blocks, the DSF logical disk stores a special
block with deletion information. This block occupies a block-space only
until the next checkpoint operation, at which time the deletion
information is stored to disk in the form of the stored updated tables, and
the block-space that accommodates the deletion information becomes
available. For regular blocks it involves only making the blocks available
for reallocation.
Softwrite(Transaction-id, logical-address, contents): Allocate a blockspace into which contents of logical block, associated with logical address
l, are stored, but the old contents still remain on the disk. An extension of
the translation table makes a note of this ambiguity. Once Abort (Tid) is
issued, the new contents that pertain to transaction Tid are removed (i.e.,
10
the block-space that accommodate them becomes available), along with
the removal of the ambiguity notification. Once Commit(Tid) is issued,
the analogous removal of the old contents takes place.
3.6.
CheckPoint
CheckPoint can take place at any time. It is mandatory, though, when no
available block-space is found beyond Next Available Block-space. In CheckPoint,
the following items are flushed from volatile memory to a preallocated space on
disk that is dedicated for the checkpoint:
1. Translation Table and Allocation Bitmap.
2. First Available Block-space, which is the first (associated
with lowest sector address) block-space marked free by
the Allocation Bitmap at the time of CheckPoint.
3. Pass Number after increment.
The store of Translation Table and Allocation Bitmap dominates the amount of
time consumed by CheckPoint. This store operation can be done efficiently by
partitioning the data structures to segments of sector size, and each time one of
these data structures is updated, the relevant segment is marked. Then, on
CheckPoint, only the updated segments are stored to disk - each to one disk
sector. This way, if checkpoints are frequent, due to very small number of free
block-spaces, the updates between successive checkpoints are very few, and the
checkpoint process is very short. When checkpoints occur infrequent
(abundance of free space) the overhead is negligible compared to the ordinary
activity.
On CheckPoint, the value of Next Available Block-space is set to be First Available
Block-space; thereby a new pass of move-forward-and-store takes off.
Immediately following disk formatting, a first CheckPoint, of all the initial
values of the data structures, takes place. (This generalizes the recovery
process).
If no free block-space is found when CheckPoint takes place, an error message
is issued.
CheckPoint in time-bounded segments. When CheckPoint takes place,
DSF logical disk ceases to provide service, because all its data structures are
locked until the store to disk is complete. As this may have a negative effect
on response time we suggest here a simple scheme for making CheckPoint in
small, time-bounded segments. Once the store-and-link-forward reaches a
point when CheckPoint should start, copies of the Translation Table, Allocation
Bitmap, and Next Available Block-space are made in main memory, Pass Number is
incremented, and then the ordinary operation of DSF logical disk continues.
Then, in between operations, when it is not busy, “on its leisure” (in other
11
words, by a thread with a low priority, for example) the DSF logical disk stores,
segment by segment of the copies, to special dedicated place on the disk, (that it
always alternates between 2 checkpoints). Because the store is from the copies,
it does not block the ordinary work with the original data structures. When the
copies of the tables are all stored to disk, the DSF logical disk also stores Pass
Number and the kept value of Next Available Block-space, as the checkpointed
value First Available Block-space. From that moment on, the newly stored data
structures, plus all the block-spaces stored to since that CheckPoint started
(from the block-space pointed at by the stored value of First Available Blockspace), suffice to recover all data structures, in case they are lost on a powerfailure. The scheme of store-and-link-forward may continue, and it even may
wrap around the lowest disk addresses and then continue forward, but it should
not pass over the block-space pointed at by the stored value of First Available
Block-space. If a failure occurs while this incremental CheckPoint takes place, the
information stored to disk on last CheckPoint, plus all the block-spaces stored
to since that CheckPoint (which are uniquely identified by their being linked
forward, starting with the block-space pointed at by First Available Block-space
stored at last CheckPoint, and by their Pass_Number field containing the Pass
Number stored in last CheckPoint, or a value greater than it by 1) suffice for a
full recovery of DSF logical disk’s in memory data structures.
3.7.
Migration of Disks:
On a storage system where each disk is only attached to a single host, failure of
that host makes the disks attached to it inaccessible. When host failures last too
long, system availability increases if disks can be detached from a failed host
and attached to a functioning one. In our scheme, the physical disk always
contains all the information needed to manage it as a DSF logical disk, and thus it
can easily be removed from one host and attached to another one.
3.8.
Reads and Caching
Figure 3 illustrates how DSF reads a data block given a file name and an offset
within that file.
To open a file, the Client first reads the file's parent directory block (labeled 1 in
the diagram) to determine its inode address. Note that the parent directory is,
itself, a data file that must be read using the procedure described above here. DSF
breaks this recursion at the root; the Client learns the inode address of the root
when it mounts the file system.
12
Figure 3 Read a Data Block
Once the Client determines the file inode address, it follows a Manager selection
procedure, to locate/assign appropriate Manager for the file.
As the top left of the path in the figure indicates, the Client first checks in its local
cache for the block (2a); if the block is present, the request is satisfied from the
local cache. Otherwise, it follows the lower path to fetch the data block over the
network. The Client first uses its manager map to locate the correct manager
from the inode address (2b) then sends a Read request to the Manager. If the
Manager is not co-located with the Client, this message travels over the network.
The Manager then tries to satisfy the request by fetching the data from the
cooperative cache, i.e. from some other Client's cache. The Manager checks the
cache consistency state (3a), and, if possible, forwards the request to another
Client caching the requested data block. The "source" Client reads the block from
13
its local cache (3b) and forwards the data directly to the "destination" Client (the
one that originated the request). The Manager is notified on the block arrival to
the "destination" Client and adds it to the list of Clients caching the block that the
manager maintains.
If no Client can supply the data from its cache, the Manager routes the Read
request to disk storage by examining the inode block. The Manager may find the
inode block in its local cache (4a) or it may have to read the inode block from
disk. If the Manager has to read the inode from disk, its uses the inode address
and the SSR (Storage Server Map) map (4b) to implicitly determine which Storage
Server to contact. The manager then requests the inode block from the Storage
Server, who then reads the metadata block from its disk and sends it back to the
Manager (5). Once the Manager receives the inode block it uses the inode (6) to
identify the address of the requested data block (if the file is large, the Manager
may have to read several levels of indirect blocks to fine the data block's address;
to do this the Manager follows the same procedure in reading indirect blocks as
in reading the inode block; this is not shown here).
The Manager uses the data block's address and the SSR map (7) to send the Read
request to the appropriate Storage Server keeping the block. The Storage server
reads the data block from its disk (8) and forwards the block directly to the Client
that originated the Read request.
4. Experimental Results
All DSF components where initially implemented on NT/4 and then ported to
Linux. The NT version was up and running in 1999 and a heterogeneous system
was running in the lab during the summer of 2000.
A test bed for experimental measurements was implemented on a cluster of three
NT/x86 machines... as depicted in Figure 4
The clients ran on two machines, the File Managers ran on two machines and the
Storage Manager ran on two machines.
14
Figure 4 Measurement/Demo Systems
Test environment:
3 NT machines (one PC Pentium III, 660Mhz (client + manager), 2 PC
Pentium II, 266Mhz - one running mangers and the second running
client), 128 MB RAM on each PC. Each PC running NT 4 SP 5
Fast Ethernet (100 MB/sec) which is part of the site infrastructure (not a
dedicated switch)
NT NTFS native cache: 21MB
DSF client - DSF managed cache: 6 MB/client (4k data blocks)
DSF File Manager - DSF managed cache: 2 MB/file manager (1k metadata
blocks)
We did run two sets of tests:
•
A BMP file read with on screen presentation using Internet Explorer
(graphics presentation)
15
•
A benchmark named Postmark from Network Appliance Inc.
(Postmark)
4.1.
Graphics presentation
Test file: a BMP file of size 4.8 MB (1172 blocks * 4kbytes /block)
Acronyms and special terms:
DSS - DSF storage Manager
FMGR - DSF file manager
DSF Client - A 2 layer NT driver (File system driver and Logical Volume &
com driver)
CC - DSF cooperative cache mechanism
Test results per the test file that is read from storage and presented on the screen:
Table 1 File Access Experiment
DSF Client DSF Client DSF Client NT
Native NT
Native
Remote DSS Local Cache CC
Access Local
File cache access
Access
File Access
through
Access
FMGR
1.692 sec
8.79 msec
1.172 sec
1 sec
0.45 sec
Note:
DSF measurements were made with DSF internal diagnostic measurement
tools
NT file access time measurements were done using a stop watch (the test
program was a browser and NTFS is not instrumented).
As can be seen, on this experiment, CC file access time is 68 % of the DSS access
time. Having a faster machine for the other DSF client machine will make the
CC faster because CC operation is CPU bound
16
4.2.
Postmark
Postmark v1.13 Experiment:
553 file creations and deletions and 100 file transactions (47 reads + 53 appends);
read/write combinations determined by a coin toss. Total read 261.84 kB and
total write 2.87 MB
Table 2 Postmark
Total test File
time
Create/sec
File
Delete/sec
Read
speed
Write
speed
KB/sec
KB/Sec
NTFS
4
138
138
65.44
717.97
DSF(DSS
local)
31
17
17
8.45
92.64
DSF(DSS
remote)
33
16
16
7.93
87.03
The experiment results are displayed in Table 2. Please note that those results
were obtained with "first cut code", not optimized and including modules where
basic function was the only goal pursued.
5. Discussion
The DSF attempts to build a serverless storage access system that distributes all
aspects of storage management over cooperating machines interconnected by a
fast, switched network. The system should scale from two to several hundred
machines, using commodity components (similar to the xFS goals).
DSF attempts to outperform xFS by using a dynamic (rather than xFS' static)
management distribution policy.
DSF attempts to provide better reliability than xFS' by employing a simplified
recovery mechanism, based on metadata shadowing (rather than xFS' log based
mechanism), and by carrying out directory operations as atomic transactions at
the storage level.
17
DSF is similar to Frangipani in using logical volumes to hide the distributed
nature of the storage system from its clients.
It outperforms Frangipani by
employing dynamic file management, block level cache synchronization, and
cooperative caching. DSF is designed to a higher level of scalability than the
cluster based file systems: up to several hundreds of commodity workstations.
Cooperative caching: Like XFS, DSF clients are assumed to contribute main
memory and CPU cycles to support the cooperative caching
operations
triggered by neighboring DSF clients.
Taken together, DSF has advantages over existing systems. It provides extensible
distributed
disk management, it moves functionality (e.g. allocation) down to
lower levels (disk), and it provides stable storage and dynamic assignment of file
managers.
6. Conclusions
We have presented the Data Sharing Facility system that is based on a novel
design:
Use of logical volumes whereby logical addresses span several disks, self
management of space, dynamic distribution of file management across multiple
nodes (machines) for dynamic load balancing, cooperative caching, and stable and
transactional storage in low layers (close to storage).
Since there are no dedicated machines, any machine may assume the
responsibilities of a failed component, resulting in improved fault-tolerance.
Based on these design principles the DSF has the potential of improved features
and properties in terms of scalability and recoverability over existing filesystems.
18
7. References
[ 1] "NFS Version 3 design and Implementation" Brian
Pawlowski, Chet Juszczak, Peter Staubach, Carl Smith, Diane
Lebel and David Hintz, Proceedings of summer USENIX
Conference, pp. 137-152, June 1994.
[ 2] "Design and Implementation of the Sun Network File
System" Russel Sandberg, David Goldberg, Steve Klieman, Dan
Walsh, and Bob Lyon, Proceedings of summer USENIX Conference,
pp. 119-130, June 1985.
[ 3] "Scale and Performance in Distributed File System" H.J.
Kazar, M. Menees, S. Nichols, D. Satyanarayanan, M.
Midebotham, ACM Trans. on Computer Systems, Vol 6, 1 Feb.
1988. pp. 51-81.
[ 4] "VAXclusters: A Closely-Coupled Distributed System" N.
Kronenberg, H. Levy, and W. Strecker. ACM Transaction on
Computer Systems, May 1986.
[ 5] "The Design and Implementation of a Distributed File
System" Andrew C. Goldstein Digital Technical Journal, 1(5)
pp. 45-55, September 1987.
[ 6] "Frangipani: A Scaleable Distributed File System"
Chandramohan A. Tekkath, Timothy Mann, and Edward K. Lee,
Digital Systems Research Center, 16th SOSP Conference.
[ 7] "Petal: Distributed Virtual Disks" E.K. Lee and C.A.
Thekkath, ASPLOS, October 1996.
[ 8] "Serverless Network File Systems". T. Anderson, M.
Dahlin, J. Neffe, D. Patterson, D. Roselli, and R. Wang. ACM
Transaction on Computer Systems, February 1996.
[ 9] "Self-Stabilization" S. Dolev, the MIT Press, 208 pages,
March 2000.
[10] "Communication Adaptive Self-Stabilizing Group
Communication" Dolev S., and Schiller E., Technical Report
#2000-02, Department of Mathematics and Computer Science BenGurion University, Beer-Sheva, Israel, July 2000.
[11] "Efficient Cooperative Caching Using Hints" P. Sarkar and
J. Hartman, USENIX Conference on Operating Systems Design and
Implementation, October 1996.
[12] "Cooperative Caching: Using Remote Client Memory to
Improve File System Performance" M.D. Dahlin, R.Y. Wang, T.E.
Anderson, and D.A. Patterson. OSDI, November 1994.
[13] "NFS Version 4 Design Considerations" Sun Microsystems,
Inc. June 1999 URL: http://www.landfield.com/rfcs/rfc2624.html
after the caption. To update the tables of figures and tables
by right clicking inside them and selecting Update All.
[14] Wiebren de Jonge, M. Frans Kaashoek and Wilson C. Hsieh.
“The logical disk: a new approach to improve file systems”.
19
Proc. 14th Symp. on Operating Systems Principles, pages 15-28,
Dec. 1989.
20