[go: up one dir, main page]

Academia.eduAcademia.edu
NASA/CP—2004–212750 NASA/IEEE MSST 2004 Twelfth NASA Goddard Conference on Mass Storage Systems and Technologies in cooperation with the Twenty-First IEEE Conference on Mass Storage Systems and Technologies Edited by Ben Kobler, Goddard Space Flight Center, Greenbelt, Maryland P C Hariharan, Systems Engineering and Security, Inc., Greenbelt, Maryland Proceedings of a conference held at The Inn and Conference Center University of Maryland, University College, Adelphi, Maryland, USA April 13-16, 2004 National Aeronautics and Space Administration Goddard Space Flight Center Greenbelt, Maryland 20771 April 2004 The NASA STI Program Office … in Profile Since its founding, NASA has been dedicated to the advancement of aeronautics and space science. The NASA Scientific and Technical Information (STI) Program Office plays a key part in helping NASA maintain this important role. The NASA STI Program Office is operated by Langley Research Center, the lead center for NASA’s scientific and technical information. The NASA STI Program Office provides access to the NASA STI Database, the largest collection of aeronautical and space science STI in the world. The Program Office is also NASA’s institutional mechanism for disseminating the results of its research and development activities. These results are published by NASA in the NASA STI Report Series, which includes the following report types: • TECHNICAL PUBLICATION. Reports of completed research or a major significant phase of research that present the results of NASA programs and include extensive data or theoretical analysis. Includes compilations of significant scientific and technical data and information deemed to be of continuing reference value. NASA’s counterpart of peer-reviewed formal professional papers but has less stringent limitations on manuscript length and extent of graphic presentations. • TECHNICAL MEMORANDUM. Scientific and technical findings that are preliminary or of specialized interest, e.g., quick release reports, working papers, and bibliographies that contain minimal annotation. Does not contain extensive analysis. • CONTRACTOR REPORT. Scientific and technical findings by NASA-sponsored contractors and grantees. • CONFERENCE PUBLICATION. Collected papers from scientific and technical conferences, symposia, seminars, or other meetings sponsored or cosponsored by NASA. • SPECIAL PUBLICATION. Scientific, technical, or historical information from NASA programs, projects, and mission, often concerned with subjects having substantial public interest. • TECHNICAL TRANSLATION. English-language translations of foreign scientific and technical material pertinent to NASA’s mission. Specialized services that complement the STI Program Office’s diverse offerings include creating custom thesauri, building customized databases, organizing and publishing research results . . . even providing videos. For more information about the NASA STI Program Office, see the following: • Access the NASA STI Program Home Page at http://www.sti.nasa.gov/STI-homepage.html • E-mail your question via the Internet to help@sti.nasa.gov • Fax your question to the NASA Access Help Desk at (301) 621-0134 • Telephone the NASA Access Help Desk at (301) 621-0390 • Write to: NASA Access Help Desk NASA Center for AeroSpace Information 7121 Standard Drive Hanover, MD 21076-1320 Available from: NASA Center for AeroSpace Information 7121 Standard Drive Hanover, MD 21076-1320 Price Code: A17 National Technical Information Service 5285 Port Royal Road Springfield, VA 22161 Price Code: A10 Preface MSST2004, the Twelfth NASA Goddard / Twenty-first IEEE Conference on Mass Storage Systems and Technologies has as its focus long-term stewardship of globally-distributed storage. The increasing prevalence of e-anything brought about by widespread use of applications based, among others, on the World Wide Web, has contributed to rapid growth of online data holdings. A study1 released by the School of Information Management and Systems at the University of California, Berkeley, estimates that over 5 exabytes of data was created in 2002. Almost 99 percent of this information originally appeared on magnetic media. The theme for MSST2004 is therefore both timely and appropriate. There have been many discussions about rapid technological obsolescence, incompatible formats and inadequate attention to the permanent preservation of knowledge committed to digital storage. Tutorial sessions at MSST2004 detail some of these concerns, and steps being taken to alleviate them. Over 30 papers deal with topics as diverse as performance, file systems, and stewardship and preservation. A number of short papers, extemporaneous presentations, and works in progress will detail current and relevant research on the MSST2004 theme. Our thanks go to the researchers, authors, and the Program Committee for their zeal and energy in putting together an interesting agenda. P C Hariharan Nabil Adam Publications Chairs Ben Kobler Conference Chair 1 http://www.sims.berkeley.edu/research/projects/how-much-info-2003/execsum.htm iii iv Organizing Committee Conference and Program Committee Chair Ben Kobler, NASA Goddard Space Flight Center Program Committee Ahmed Amer, University of Pittsburgh Curtis Anderson, Mendocino Software Jean-Jacques Bedet, SSAI John Berbert, NASA Goddard Space Flight Center Randal Burns, Johns Hopkins University Robert Chadduck, NARA Jack Cole, US Army Research Laboratory Bob Coyne, IBM Jim Finlayson, Department of Defense Dirk Grunwald, University of Colorado Bruce K. Haddon, Sun Microsystems Gene Harano, National Center for Atmospheric Research P C Hariharan, SES Jim Hughes, StorageTek Merritt Jones, MITRE Ben Kobler, NASA Goddard Space Flight Center Steve Louis, Lawrence Livermore National Laboratory Ethan Miller, University of California, Santa Cruz Alan Montgomery, Department of Defense Reagan Moore, San Diego Supercomputer Center Bruce Rosen, NIST Paul Rutherford, ADIC Tom Ruwart, I/O Performance Julian Satran, IBM Haifa Research Laboratory, Israel Donald Sawyer, NASA Goddard Space Flight Center Rodney Van Meter, Keio University, Japan Keynote and Invited Papers P C Hariharan, SES Tutorial Chair Jim Hughes, StorageTek Vendor Expo Chair Gary Sobol, StorageTek Publications Chairs P C Hariharan, SES Nabil Adam, Rutgers University Work In Progess Chairs Ethan Miller, University of California, Santa Cruz Randal Burns, Johns Hopkins University v Publicity Chairs Jack Cole, U.S. Army Research Laboratory Sam Coleman, Lawrence Livermore National Laboratory, Retired IEEE Computer Society Liaison Merritt Jones, MITRE vi Table of Contents Parallel - Track Tutorials Data Grid Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Reagan Moore, Arun Jagatheesan, Arcot Rajasekar, Michael Wan, and Wayne Schroeder San Diego Supercomputer Center Long-Term Stewardship of Globally-Distributed Representation Information . . . . . . . . . . . . .17 David Holdsworth and Paul Wheatley, Leeds University Fibre Channel and IP SAN Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Henry Yang, McData Corporation Challenges in Long-Term Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Ruth Duerr, Mark A. Parsons, Melinda Marquis, Rudy Dichtl, and Teresa Mullins National Snow and Ice Data Center (NSIDC) Long Term Preservation Stewardship - Chair, Jack Cole NARA’s Electronic Records Archive (ERA) - The Electronic Records Challenge . . . . . . . . . . .69 Mark Huber, American Systems Corporation, Alla Lake, Lake Information Systems, LLC, and Robert Chadduck, National Archives and Records Administration Preservation Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 Reagan Moore, San Diego Supercomputer Center Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Jose Zero, David McNab, William Sawyer, and Samson Cheung, Halcyon Systems, Daniel Duffy, Computer Sciences Corporation, and Richard Rood, Phil Webster, Nancy Palm, Ellen Salmon, and Tom Schardt NASA NCCS, Goddard Space Flight Center Performance - Chair, Randal Burns Regulating I/O Performance of Shared Storage with a Control Theoretical Approach . . . . .105 Han Deok Lee, Young Jin Nam, Kyong Jo Jung, Seok Gan Jung, and Chanik Park Pohang University of Science and Technology, Republic of Korea SAN and Data Transport Technology Evaluation at the NASA Goddard Space Flight Center . .119 Hoot Thompson, Patuxent Technology Partners, LLC File System Workload Analysis for Large Scientific Computing Applications . . . . . . . . . . . .139 Feng Wang, Qin Xin, Bo Hong, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long University of California, Santa Cruz and Tyce T. McLarty, Lawrence Livermore National Laboratory Short Papers - Chair, Robert Chadduck/Fynnette Eaton V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . . . . . . .153 André Brinkmann, Michael Heidebuer, Friedhelm Meyer auf der Heide, Ulrich Rückert, Kay Salzwedel, and Mario Vodisek, Paderborn University Identifying Stable File Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 Purvi Shah and Jehan-François Pâris, University of Houston, Ahmed Amer, University of Pittsburgh, and Darrell D. E. Long, University of California, Santa Cruz vii An On-Line Back-Up Function for a Clustered NAS System (X-NAS) . . . . . . . . . . . . . . . . . .165 Yoshiko Yasuda, Shinichi Kawamoto, Atsushi Ebata, Jun Okitsu, and Tatsuo Higuchi Hitachi, Ltd., Central Research Laboratory, Japan dCache, the Commodity Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171 Patrick Fuhrmann, DESY, Germany Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Ellen Salmon, Adina Tarshish, Nancy Palm, NASA Goddard Space Flight Center, Sanjay Patel, Marty Saletta, Ed Vanderlan, Mike Rouch, Lisa Burns, and Dr. Daniel Duffy, Computer Sciences Corporation, Robert Caine and Randall Golay, Sun Microsystems, Inc., and Jeff Paffel and Nathan Schumann, Instumental, Inc. Parity Redundancy Strategies in a Large Scale Distributed Storage System . . . . . . . . . . . . .185 John A. Chandy, University of Connecticut Reducing Storage Management Costs via Informed User-Based Policies . . . . . . . . . . . . . . . .193 Erez Zadok, Jeffrey Osborn, Ariye Shater, Charles Wright, and Kiran-Kumar Muniswamy-Reddy, Stony Brook University, and Jason Nieh, Columbia University A Design of Megadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . .199 Jie Yan, Yao-Long Zhu, Hui Xiong, Renuga Kanagavelu, Feng Zhou, and So Lih Weon, Data Storage Institute, Singapore An iSCSI Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 Hui Xiong, Renuga Kanagavelu, Yaolong Zhu, and Khai Leong Yong, Data Storage Institute, Singapore Quanta Data Storage: A New Storage Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 Prabhanjan C. Gurumohan, Sai S. B. Narasimhamurthy, and Joseph Y. Hui Arizona State University Rebuild Strategies for Redundant Disk Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223 Gang Fu, Alexander Thomasian, Chunqi Han, and Spencer Ng New Jersey Institute of Technology Evaluation of Efficient Archival Storage Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .227 Lawrence L. You, University of California, Santa Cruz Christos Karamanolis, Hewlett-Packard Labs An Efficient Data Sharing Scheme for iSCSI-Based File Systems . . . . . . . . . . . . . . . . . . . . . . .233 Dingshan He and David H. C. Du, University of Minnesota Using DataSpace to Support Long-Term Stewardship of Remote and Distributed Data . . . .239 Robert Grossman, Dave Hanley, Xinwei Hong, and Parthasarathy Krishnaswamy University of Illinois at Chicago Infrastructure Resources - Chair, Steve Louis Promote-IT: An Efficient Real-Time Tertiary-Storage Scheduler . . . . . . . . . . . . . . . . . . . . . . .245 Maria Eva Lijding, Sape Mullender, and Pierre Jansen University of Twente, The Netherlands The Data Services Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .261 Rena A. Haynes and Wilbur R. Johnson, Sandia National Laboratories Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems . . . . .273 Andy D. Hospodor and Ethan L. Miller, University of California, Santa Cruz viii File Systems - Chair, Curtis Anderson OBFS: A File System for Object-Based Storage System Devices . . . . . . . . . . . . . . . . . . . . . . . .283 Feng Wang, Scott A. Brandt, Ethan L. Miller, and Darrell D. E. Long University of California, Santa Cruz Duplicate Data Elimination in a SAN File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301 Bo Hong and Darrell D. E. Long, University of Califoria, Santa Cruz Demyn Plantenberg and Miriam Sivan-Zimit, IBM Almaden Research Center Clotho: Transparent Data Versioning at the Block I/O Level . . . . . . . . . . . . . . . . . . . . . . . . . .315 Michail D. Flouris, University of Toronto Angelos Bilas, Institute of Computer Science, Foundation for Research and Technology, Greece Site Reports - Chair, Gene Harano U.S. National Oceanographic Data Center Archival Management Practices and the Open Archival Information System Reference Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .329 Donald W. Collins, NOAA, National Oceanographic Data Center Storage Resource Sharing with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Olof Bärring, Ben Couturier, Jean-Damien Durand, Emil Knezo, and Sebastien Ponce CERN, Switzerland, and Vitaly Motyakov, Institute for High EnergyPhysics, Russia GUPFS: The Global Unified Parallel File System Project at NERSC . . . . . . . . . . . . . . . . . . .361 Greg Butler, Rei Lee, and Mike Welcome, Lawrence Berkeley National Laboratory iSCSIand SAN - Chair, Jean-Jacques Bedet SANSIM: A Platform for Simulation and Design of a Storage Area Network . . . . . . . . . . . . .373 Yao-Long Zhu, Chao-Yang Wang, Wei-Ya Xi, and Feng Zhou, Data Storage Institute, Singapore Cost-Effective Remote Mirroring Using the iSCSI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . .385 Ming Zhang, Yinan Liu, and Qing (Ken) Yang, University of Rhode Island Simulation Study of iSCSI-Based Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399 Yingping Lu, Farrukh Noman, and David H. C. Du, University of Minnesota Vendor Solutions - Chair, Bruce Rosen Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Ismail Dalgic, Kadir Ozdemir, Rajkumar Velpuri, and Jason Weber, Intransa, Inc. Helen Chen, Sandia National Laboratories, and Umesh Kukreja, Atrica, Inc. H-RAIN: An Architecture for Future-Proofing Digital Archives . . . . . . . . . . . . . . . . . . . . . . . .415 Andres Rodriguez and Dr. Jack Orenstein, Archivas, Inc. A New Approach to Disk-Based Mass Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .421 Aloke Guha, COPAN Systems, Inc. Multi-Tiered Storage – Consolidating the Differing Storage Requirements of the Enterprise Into a Single Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .427 Louis Gray, BlueArc Corporation Managing Scalability in Object Storage Systems for HPC Linux Clusters . . . . . . . . . . . . . . .433 Brent Welch and Garth Gibson, Panasas, Inc. The Evolution of a Distributed Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 Norman Margolus, Permabit, Inc. ix x DATA GRID MANAGEMENT SYSTEMS Reagan W. Moore, Arun Jagatheesan, Arcot Rajasekar, Michael Wan, Wayne Schroeder San Diego Supercomputer Center 9500 Gilman Drive, MC 0505 La Jolla, CA 92093 Tel: +1-858-534-5000, Fax: +1- 858-534-5152 e-mail: {moore,arun,sekar,mwan,schroede}@sdsc.edu Abstract: The “Grid” is an emerging infrastructure for coordinating access across autonomous organizations to distributed, heterogeneous computation and data resources. Data grids are being built around the world as the next generation data handling systems for sharing, publishing, and preserving data residing on storage systems located in multiple administrative domains. A data grid provides logical namespaces for users, digital entities and storage resources to create persistent identifiers for controlling access, enabling discovery, and managing wide area latencies. This paper introduces data grids and describes data grid use cases. The relevance of data grids to digital libraries and persistent archives is demonstrated, and research issues in data grids and grid dataflow management systems are discussed. 1 Introduction A major challenge in the design of a generic data management system is the set of multiple requirements imposed by user communities. The amount of data is growing exponentially, both in the number of digital entities and in the size of files. The sources for data are distributed across multiple sites, with data generated in multiple administration domains, and on sites only accessible over wide-area networks. The need for discovery is becoming more important, with data assembled into collections that can be browsed. The need for preservation is becoming more important, both to meet legal data retention requirements, and to preserve the intellectual capital of organizations. In practice, six types of data handling systems are found: 1. 2. 3. 4. 5. 6. Data ingestion systems Data collection creation environments Data sharing environments based on data grids Digital libraries for publication of data Persistent archives for data preservation Data processing pipelines The goal of a generic data management system is to build software infrastructure that can meet the requirements of each of these communities. We will demonstrate that data grids provide the generic data management abstractions needed to manage distributed data, and that all of the systems can be built upon common software. 1 Data management requires four basic naming conventions (or information categories) for managing data on distributed resources: 1. Resource naming for access to storage systems across administrative domains. This is used to implement data storage virtualization. 2. Distinguished user names for identifying persons across administrative domains. This is used to implement single-sign on security environments. 3. Distinguished file names for identifying files across administrative domains. This is used to implement data virtualization. 4. Context attributes for managing state information generated by remote processes. This is used to implement digital libraries and federate data grids. A data grid [1,2] provides virtualization mechanisms for resources, users, files, and metadata. Each virtualization mechanism implements a location and infrastructure independent name space that provides persistent identifiers. The persistent identifiers for data are organized as a collection hierarchy and called a “logical name space”. In practice, the logical name spaces are implemented in a separate metadata catalog for each data grid. Within a data grid, access to data, management of data, and manipulation of data is done via commands applied to the logical name space. To access data located within another data grid (another logical name space), federation mechanisms are required. To manage operations on the massive collections that are assembled, data flow environments are needed. We illustrate the issues related to data grid creation, data management, data processing, and data grid federation by examining how these capabilities are used by each of the six types of data management systems. We then present the underlying abstraction mechanisms provided by data grids, and close with a discussion of current research and development activities in data flow and federation infrastructure. 2 Data Management Systems Data management systems provide unifying mechanisms for naming, organizing, accessing, and manipulating context (administrative, descriptive, and preservation metadata) about content (digital entities such as files, URLs, SQL command strings, directories). Each of the types of data management system approaches focuses on a different aspect, and provides specific mechanisms for data and metadata manipulation. 2.1 Data ingestion systems The Real-time Observatories, Applications, and Data management Network (ROADNet) [3] project manages ingestion of data in real-time from sensors. The data is assembled both synchronously and asynchronously from multiple networks into an object ring buffer (ORB), where it is then registered into a data grid. Multiple object ring buffers are federated into a Virtual Object Ring Buffer (VORB) to support discovery and attributebased queries. Typical operations are the retrieval of the last ten observations, the tracking of observations about a particular seismic event, and the migration of data from the ORB into a remote storage system. 2 A second type of data ingestion system is a grid portal, which manages interactions with jobs executing in a distributed environment. The portal provides access to collections for input data, and stores output results back into a collection. A grid portal uses the Grid Security Infrastructure to manage inter-realm authentication between compute and storage sites. 2.2 Data collection creation environments Scientific disciplines are assembling data collections that represent the significant digital holdings within their domain. Each community is organizing the material into a coherent collection that supports uniform discovery semantics, uniform data models and data formats, and an assured quality. The National Virtual Observatory (NVO) [4] is hosting multiple sky survey image collections. Each collection is registered into a logical name space to provide a uniform naming convention, and standard metadata attributes are used to describe the sky coverage of each image, the filter that was used during observation, the date the image was taken, etc. Collection formation is facilitated by the ability to register the descriptive metadata attributes onto a logical name space. The logical name space serves as the key for correlating the context with each image. Other examples include: GAMESS [5], computational chemistry data collections of simulation output; and the CEED: Caveat Emptor Ecological Data Repository. Both projects are assembling collections of data for their respective communities. 2.3 Data sharing environments based on data grids Scientific disciplines promote the sharing of data. While collections are used to organize the content, data grids are used to manage content that is distributed across multiple sites. The data grid technology provides the logical name space for registering files, the interrealm authentication mechanisms, the latency management mechanisms, and support for high-speed parallel data transfers. An example is the Joint Center for Structural Genomics data grid which generates crystallographic data at the Stanford Linear Accelerator and pushes the data to SDSC for storage in an archive and analysis of protein structures. The Particle Physics Data Grid [6] federates collections housed at Stanford and Lyon, France for the BaBar high energy physics experiment [7]. The Biomedical Informatics Research Network (BIRN) [8] is using a data grid to share data from multiple Magnetic Resonance Imaging laboratories. Each project implements access controls that are applied on the distributed data by the data grid, independently of the underlying storage resource. 2.4 Digital libraries for publication of data An emerging initiative within digital libraries is support for standard digital reference sets that can be used by an entire community. The standard digital reference sets are created from observational data collections, or from simulations, and are housed within the digital library. Curation methods are applied to assure data quality. Discovery mechanisms are supported for attribute-based query. An example is the HyperAtlas catalog that is being created for the 2MASS and DPOSS astronomy sky surveys [4,19,20]. The catalog projects each image to a standard reference frame, organizes the 3 projections into an atlas of the sky, and supports discovery through existing sky catalogs of stars and galaxies. The organizations participating in a collaboration may share their digital entities using a collective logical view. Multiple logical views could be created for the same set of distributed digital entities. These logical views may be based on different taxonomies or business rules that help in the categorization of the data. The organizations, apart from sharing the physical data, could also share the logical views as publication mechanisms. The library community applies six data management processes to digital entities: 1. Collection building to organize digital entities for access 2. Content management to store each digital image 3. Context management to define descriptive metadata 4. Curation processes to validate the quality of the collection 5. Closure analyses to assert completeness of the collection and the ability to manipulate digital entities within the collection 6. Consistency processes to assure that the context is updated correctly when operations are performed on the content. 2.5 Persistent archives for data preservation The preservation community manages archival collections for time periods that are much longer that the lifetimes of the underlying infrastructure [9,10,11,25,26,27]. The principal concern is the preservation of the authenticity of the data, expressed as an archival context associated with each digital entity, and the management of technology evolution. As new more cost effective storage repositories become available, and as new encoding formats appear for data and metadata, the archival collection is migrated to the new standard. This requires the ability to make replicas of data on new platforms, provide an access abstraction to support new access mechanisms that appear over time, and migrate digital entities to new encoding formats or emulate old presentation applications. Example projects include a persistent archive for the National Science Digital Library, and a persistent archive prototype for the National Archives and Records Administration [12.13]. Both projects manage data that is stored at multiple sites, replicate data onto different types of storage repositories, and support both archival and digital library interfaces. The preservation community applies their own standard processes to data: 1. Appraisal – the decision for whether a digital entity is worth preserving 2. Accession – the controlled process under which digital entities are brought into the preservation environment 3. Arrangement – the association of digital entities with a record group or record series 4. Description – the creation of an archival context specifying provenance information 5. Preservation – the creation of an archival form for each digital entity and storage 6. Access – the support for discovery services 4 2.6 Data Processing Pipelines The organization of collections, registration of data into a data grid, curation of data for ingestion into a digital library, and preservation of data through application of archival processes, all need the ability to apply data processing pipelines. The application of processes is a fundamental operation needed to automate data management tasks. Scientific disciplines also apply data processing pipelines to convert sensor data to a standard representation, apply calibrations, and create derived data products. Examples are the Alliance for Cell Signaling digital library, which applies standard data analysis techniques to interpret each cell array, and the NASA Earth Observing Satellite system that generates derived data products from satellite observations. Data processing pipelines are also used to support knowledge generation. The steps are: 1. Apply semantic labels to features detected within a digital entity 2. Organize detected features for related digital entities within a collection 3. Identify common relationships (structural, temporal, logical, functional) between detected features 4. Correlate identified relationships to physical laws 3 Generic requirements Multiple papers that summarize the requirements of end-to-end applications have been generated in the Global Grid Forum [14]. They range from descriptions of the remote operations that are needed to manage large collections, to the abstraction mechanisms that are needed for preservation. The capabilities can be separated into four main categories: Context management, data management, access mechanisms, and federation mechanisms. The capabilities are provided by data grids: Context management mechanisms • Global persistent identifiers for naming files. • Organization of context as collection hierarchy • Support for administrative metadata to describe the location and ownership of files • Support for descriptive metadata to support discovery through query mechanisms • Support for browsing and queries on metadata • Information repository abstraction for managing collections in databases Data management mechanisms • Storage repository abstraction for interacting with multiple types of storage systems • Support for the registration of files into the logical name space • Inter-realm authentication system for secure access to remote storage systems • Support for replication of files between sites • Support for caching onto a local storage system and for accessing files in an archive • Support for aggregating files into containers Access mechanisms • Standard access mechanisms: Web browsers, Unix shell commands, Windows browsers, Python scripts, Java, C library calls, Linux I/O redirection, WSDL, etc. • Access controls and audit trails to control and track data usage 5 • Support for the execution of remote operations for data sub-setting, metadata extraction, indexing, third-party data movement, etc. • Support for bulk data transfer of files, bulk metadata transfer, and parallel I/O Federation mechanisms • Cross-registration of users between data grids • Cross-registration of storage resources data grids • Cross-registration of files between data grids • Cross-registration of context between data grids Data Grids provide transparency and abstraction mechanisms that enable applications to access and manage data as if they were local to their home system. Data grids are implemented as federated client-server middleware that use collections to organize distributed data. 4 Data Grid Implementation An example of a data grid is the Storage Resource Broker (SRB) from the San Diego Supercomputer Center [15,16,17]. The SRB manages context (administrative, descriptive, and preservation metadata) about content (digital entities such as files, URLs, SQL command strings, directories). The content may be distributed across multiple types of storage systems across independent administration domains. By separating the context management from the content management, the SRB easily provides a means for managing, querying, accessing, and preserving data in a distributed data grid framework. Logical name spaces describe storage systems, digital file objects, users, and collections. Context is mapped to the logical name spaces to manage replicas of data, authenticate users, control access to documents and collections, and audit accesses. The SRB manages the context in a Meta data Catalog (MCAT) [18], organized as a collection hierarchy. The SRB provides facilities to associate user-defined metadata, both free-form attributebased metadata and schema-based metadata at the collection and object level and query them for access and semantic discovery. The SRB supports queries on descriptive attributes [21]. The SRB provides specific features needed to implement digital libraries, persistent archive systems [10,11,12,13] and data management systems [22,23,24]. The Storage Resource Broker (SRB) in earlier versions used a centralized MCAT [18] for storing system-level and application-level metadata. Though essential for consistent operations, the centralized MCAT poses a problem. It can be considered a single-point of failure as well as a potential bottleneck for performance. Moreover, when users are widely distributed, users remote from the MCAT may see latency unacceptable for interactive data access. In order to mitigate these problems, the SRB architecture has been extended to a federated environment, called zoneSRB. The ZoneSRB architecture provides a means for multiple context catalogs to interact with each other on a peer-to-peer basis and synchronize their data and metadata. Each zoneSRB system can be autonomous, geographically distant, and administer a set of users, resources and data that may or may 6 not be shared by another zoneSRB. Each zoneSRB has its own MCAT for providing the same level of features and facilities as done by the older SRB system. The main advantage of the zoneSRB system is that now, there is no single point of failure, as multiple MCATs can be federated into a multi-zoneSRB system. Users can be distributed across the zones to improve quality of performance and minimize access latencies to geographically distant metadata catalogs. The multiple zoneSRBs can share metadata and data based on policies established by the collaborating administrators. The level of collaboration can be varied to specify how much of the information is shared, partitioned or overlapped, and whether the interactions are controlled by the users or the zone administrators. More information about the SRB can be found in [15,16,17,18]. In a nutshell, the SRB provides all of the capabilities listed as generic requirements. The SRB provides interoperability mechanisms that map users, datasets, collections, resources and methods to global namespaces. It also provides abstractions for data management functionality such as file creation, access, authorization, user authentication, replication and versioning, and provides a means to associate metadata and annotation with data objects and collections of data objects. Descriptive metadata is used for searching at the semantic level and discovery of relevant data objects using the attribute-based discovery paradigm. Figure 1 provides details about the modules that form the core of the SRB services. Application C, C++, Java Libraries Linux I/O Unix Shell DLL / Python, Perl Java, NT Browsers OAI, WSDL, OGSA HTTP Federation Management Consistency & Metadata Management / Authorization-Authentication-Audit Logical Name Space Latency Management Data Transport Catalog Abstraction Databases DB2, Oracle, Sybase, Postgres, mySQL, Informix Metadata Transport Storage Abstraction Archives - Tape, HPSS, ADSM, UniTree, DMF, CASTOR,ADS ORB & Datascop File Systems Unix, NT, Mac OSX Figure 1. Storage Resource Broker 7 Databases DB2, Oracle, Sybase, SQLserver,Postgres, mySQL, Informix The SRB server is built using three layers. The top-layer is a set of server access APIs written in C and Java that are used to provide multiple client interfaces. The middle layer provides the intelligence for collection-based management, federation of data grids, data movement, authentication, and authorization. The bottom-layer is a set of storage drivers and MCAT database drivers that are used to connect to diverse resources. These drivers have a well-defined and published API such that a new storage resource or data server system can be integrated easily into the SRB system. For example, custom-based interfaces to special drivers such as the Atlas Data Store System and CERN’s CASTOR storage systems were written in just a few days. The SRB drivers for storage include, HPSS, ADSM, Unix/Linux /Mac OSX and NT file systems, and for databases includes DB2, Oracle, Postgres, Informix, and Sybase. The middle layer of the SRB system supports the federation consistency and control mechanisms needed to integrate multiple data grids and encapsulates much of the intelligence needed to manage a data grid. This layer interacts with the MCAT to access the context needed to control each type of data service. The data grid technology based on the SRB that is in production use at SDSC at this date manages over 90 Terabytes of data comprising over 16 million files. 5 Data Grid Federation The specification of appropriate federation mechanisms is still a research project. The federation of multiple data grids basically imposes constraints on the cross-registration of users, resources, files, and context. The constraints may be invoked by either a user or automatically implemented by the data management system. The constraints may be set for either no cross-registration, partial cross-registration, or complete cross-registration. It is possible to identify over 1500 possible federation approaches by varying the type of cross-registration constraints that are imposed. In practice, ten of the approaches are either in active use or proposed for use within scientific projects. Each data grid is called a zone, with its own metadata catalog managing the logical name spaces for its users, resources, files, and context. The approaches include: 5.1 Occasional Interchange This is the simplest model in which two or more zones operate autonomously with very little exchange of data or metadata. The two zones exchange only user-ids for those users who need access across zones. Most of the users stay in their own zone accessing resources and data that are managed by their zone MCAT. Inter-zone users will occasionally cross zones, browsing collections, querying metadata and accessing files for which they have read permission. These users can store data in remote zones if needed but these objects are only accessible to users in the other zones. This model provides the greatest degree of autonomy and control. The cross-zone user registration is done not for every user from a zone but only for selected users. The local SRB administrator controls who is given access to the local SRB system and can restrict these users from creating files in the local SRB resources. (NPACI driven federation model [28]) 8 5.2 Replicated Catalog In this model, even though there are multiple MCATs managing independent zones, the overall system behaves as though it were a single zone with replicated MCATs. Metadata about the tokens being used, users, resources, collections, containers and data objects are all synchronized between all MCATs. Hence, the view from every zone is the same. An object created in a zone is registered as an object in all other sister zones and any associated metadata is also replicated. This model provides a completely replicated system that has a high degree of fault-tolerance for MCAT failures. The user can still access data even if their local MCAT becomes non-functional. The degree of synchronization, though very high in principle, in practice is limited. The MCATs may be out of synchronization on newly created data and metadata. The periodicity of synchronization is decided by the cooperating administrators and can be as long as days if the systems change slowly. An important point to note is that because of these delayed synchronizations, one might have occasional logical name clashes. For example, a data object with the same name and in the same collection might be created in two zones almost at the same time. Because of delayed synchronization both will be allowed in their respective zones. But when the synchronization is attempted, the system will see a clash when registering across zones. The resolution of this has to be done by mutual policies set by the cooperating administrators. In order to avoid such clashes, policies can be instituted with clear lines of partitioning about where one can create a new file in a collection. (NARA federation model [12]) 5.3 Resource Interaction In this model resources are shared by more than one zone and hence they can be used for replicating data. This model is useful if the zones are electronically distant, but want to make it easier for users in the sister zone to access data that might be of mutual interest. A user in a zone replicates data into the shared resources (either using synchronous replication or asynchronous replication as done in a single zone). Then the metadata of these replicated objects is synchronized across the zones. User names need not be completely synchronized. (BIRN federation model [8]) 5.4 Replicated Data Zones In this model two or more zones work independently but maintain the same data across zones, i.e., they replicate data and related metadata across zones. In this case, the zones are truly autonomous and do not allow users to cross zones. In fact, user lists and resources are not shared across zones. But data stored in one zone is copied into another zone along with related metadata, by a user who has accounts in the sister zones. This method is very useful when two zones are operating across a wide-area network has to share data and the network delay in accessing data across the zones has to be reduced. (BaBar federation model [7]) 5.5 Master-Slave Zones This is a variation of the 'Replicated Data Zones' model in which new data is created at a Master site and the slave sites synchronize with the master site. The user list and resource list are distinct across zones. The data created at the master are copied over to the slave zone. The slave zone can create additional derived objects and metadata but these may 9 not be shared back to the Master Zone. (PDB federation model) 5.6 Snow-Flake Zones This is a variation of the 'Master-Slave Zones' model, One can view this as a hierarchical model, where a Master Zone creates the data that is copied to the slave zones, whose data in turn gets copied into other slave zones lower in the hierarchy. Each level of the hierarchy can create new derived products of data and metadata, can have their own client base, and can choose to propagate only a subset of their holdings to their slave zones. (CMS federation model [23]). 5.7 User and Data Replica Zones This is another variation of the 'Replicated Data Zones' where not just the data get replicated but also user names are exchanged. This model allows users to access data in any zone. This model can be used for wide-area enterprises where users travel across zones and would like to access data from their current locations for improved performance. (Web cache federation model) 5.8 Nomadic Zones - SRB in a Box In this model, a user might have a small zone on a laptop or other desktop systems that are not always connected to other zones. The user during his times of non-connectedness can create new data and metadata. The user on connecting to the parent zone will then synchronize and exchange new data and metadata across the user-zone and the parent zone. This model is useful for users who have their own zones on laptops. It is also useful for zones that are created for ships and nomadic scientists in the field who periodically synchronize with a parent zone. (SIOExplorer federation model) 5.9 Free-floating Zones – myZone This is a variation of the 'Nomadic Zone' model having multiple stand-alone zones but no parent zone. These zones can be considered peers and possibly have very few users and resources. These zones can be seen as isolated systems running by themselves (like a PC) without any interaction with other zones, but with a slight difference. These zones occasionally "talk" to each other and exchange data and collections. This is similar to what happens when we exchange files using zip drives or CDs or as occasional network neighbors. This system has a good level of autonomy and isolation with controlled data sharing. (Peer-to-peer or Napster federation model) 5.10 Archival Zone, BackUp Zone In this model, there can be multiple zones with an additional zone called the archive. The main purpose of this is to create an archive of the holdings of the first set of zones. Each zone in the first set can designate the collections that need to be archived. This provides for backup copies for zones which by themselves might be fully running on spinning disk. (SDSC backup federation model, NASA backup federation model [29]) 10 6 Grid Dataflow A second research area is support for data flow environments, in which state information is kept about the processing steps that have been applied to each digital entity in a work set. 6.1 Need for peer-to-peer Data Grid Dataflows A dataflow executes multiple tasks. Each task might require: different resources; access to different data collections for input; storage of output products onto physically distributed resources within a data grid; and disparate services that might be in the form of web/grid services or simply executables of an application. The dataflow is described in a data grid language. The dataflow is executed through a dataflow engine. Each dataflow engine needs to be able to communicate with other dataflow engines in a peerto-peer federation for coordination. This allows dynamic distributed execution, without having to specify a pre-planned schedule. Placement scheduling is still required to find the right location for execution of each task. In the data grid, the tasks in the dataflow could be executed in any of distributed resources within the participating administrative domains. In general the following factors must be considered for dataflow scheduling: • Appropriateness of a given resource for a particular task: Is there enough disk space to hold result sets, and is the compute resource powerful enough to execute the task within a desired time? Are the tasks sufficiently small that they could be processed by less powerful systems? • Management of data movement: How can the amount of data moved for both input and output files and for the executable be minimized? • Co-location of dependent tasks: How can tasks be co-located on the same administrative domain or resource to minimize coordination messages that have to be sent across the network? 6.2 Grid Dataflow System Requirements Collections of data sets can be manipulated in a Data Grid dataflow. Instead of creating a separate dataflow for each file, state information can be maintained about the aggregated set of files for which processes have been applied. Related issues are: • Management of processing state: (e.g.) What information needs to be maintained about each process? • Control procedures: (e.g.) What types of control mechanisms are needed to support loops over collections? • Dynamic status queries: (e.g.) Can a process detect the state of completion of other processes through knowledge of the placement schedule? 6.3 Data Grid Language The SDSC Matrix project [33], funded by the NSF Grid Physics Network (GriPhyN) [30], NIH Biomedical Informatics Research Network (BIRN) [8] and NSF Southern California Earthquake Center (SCEC) [31], has developed a data grid language to describe grid dataflow. Just like SQL (Structured Query Language) is used to interact with the databases, the Data Grid Language (DGL) is used to interact with the data grids 11 and dataflow environments. DGL is XML-based and uses a standard schema that describes: • Control-based dataflow structures. These include sequential, parallel, and aggregated process execution. • Context-based dataflow structures. These include barriers (synchronization points or milestones), “For loops” (iteration over task sets) and “For Each loops” (iteration over collection of files). • Event Condition Alternate Action (ECA) rules. Any event in the workflow engine like completion or start of a task could be used to trigger a condition to be evaluated dynamically and execute any of the alternate dataflow actions. The conditions could be described using XQuery or any other language that would be understood by the dataflow engine. This allows other useful or simple workflow query languages to be used along with DGL. • Variables. Both global variables and local variables can be managed for the dataflow. The variables are related to the dataflow, rather than an individual file that is manipulated by the dataflow. Hierarchical scoping is used to restrict the use of the dataflow variables to aggregates of processes. • Discovery. Queries from external grid processes are supported for determining the completion status of a process and the state of variables. A simple example of the use of dataflow systems is the management of the ingestion of a collection into a data grid. The SCEC project implemented the collection ingestion as a dataflow using the data grid language and executed the dataflow using a SDSC Matrix Grid workflow engine. 6.4 SDSC Matrix Architecture The architecture of the SDSC Matrix dataflow engine is shown in Figure 2. The components are layered on top of agents that can execute either SRB or other processes (SDSC Data Management Cyberinfrastructure, java classes, WSDL services and other executables). The matrix dataflow engine tracks the dataflow execution. The dataflow execution state can be queried by other applications or other dataflows that are executed by the matrix engine. Persistence of the dataflow execution state is held in memory and exported to a relational database. Clients send DGL dataflow requests as SOAP [32] messages to the Java XML (JAXM) messaging interface (Fig 2). The Matrix web service receives these SOAP messages and forwards the DGL requests to the Data Grid Request Processor. The Request Processor parses the DGL requests, which could be either a data grid transaction (new dataflow) or a status query on another dataflow. A data grid transaction request is a long running dataflow involving the execution of multiple data management and compute intensive processes. The Transaction Handler registers a new transaction and starts the book keeping and execution of the processes. The Status Query Handler is used to query the state of execution of the transaction and the variables associated with a dataflow. In Figure 2, the Matrix Engine components shown in white boxes have been implemented for a stand-alone matrix workflow engine (version 3.1). Those in solid (black) boxes are a work in progress to provide peer-to-peer grid workflow and involve protocols to distribute the workflow. The P2P broker will be based on Sangam protocols [34] to 12 JAXM Event Publish Subscribe, Notification WSDL SOAP Service JMS Messaging Interface Matrix Data Grid Request Processor Sangam P2P Grid Workflow Broker and Protocols Transaction Handler Status Query Handler Flow Handler and Execution Manager XQuery Processor ECA rules Handler Matrix Agent Abstraction SRB Agents SDSC Data Services Workflow Query Processor Data flow Meta data Manager Persistence (Store) Abstraction Agents for any java, WSDL and other executables JDBC In Memory Store Figure 2. SDSC Matrix Grid Workflow Engine Architecture for data flows facilitate Peer-to-peer brokering between workflow engines. The protocols will be loosely coupled with resource scheduling algorithms. 7 Conclusion Data grids provide a common infrastructure base upon which multiple types of data management environments may be implemented. Data grids provide the mechanisms needed to manage distributed data, the tools that simplify automation of data management processes, and the logical name spaces needed to assemble collections. The Storage Resource Broker data grid is an example of a system that has been successfully applied to a wide variety of scientific disciplines for management of massive collections. Current research issues include identification of the appropriate approaches for federating data grids, and the development of capable data flow processing systems for the management of data manipulation. 8 Acknowledgement The results presented here were supported by the NSF NPACI ACI-9619020 (NARA supplement), the NSF NSDL/UCAR Subaward S02-36645, the NSF Digital Library Initiative Phase II Interlib project, the DOE SciDAC/SDM DE-FC02-01ER25486 and DOE Particle Physics Data Grid, the NSF National Virtual Observatory, the NSF Grid Physics Network, and the NASA Information Power Grid. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archives and Records Administration, or the U.S. government. The authors would also like acknowledge other members of the SDSC SRB team who have contributed to this work including: George Kremenek, Bing Zhu, Sheau-Yen Chen, Charles Cowart, Roman Olschanowsky, Vicky Rowley and Lucas Gilbert. The members 13 of the SDSC Matrix Project include Reena Mathew, Jon Weinberg, Allen Ding and Erik Vandekieft. 9 References [1] Foster, I., and Kesselman, C., (1999) “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann. [2] Rajasekar, A., M. Wan, R. Moore, T. Guptill, “Data Grids, Collections and Grid Bricks,” Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies, April 7-10, 2003, San Diego, USA. [3] Real-time Observatories, Applications, and Data management Network (RoadNet), http://roadnet.ucsd.edu/ [4] NVO, (2001) “National Virtual Observatory”, (http://www.srl.caltech.edu/nvo/). [5] GAMESS “General Atomic Molecular Electronic Structure Systems - Web Portal. (https://gridport.npaci.edu/GAMESS/). [6] PPDG, (1999) “The Particle Physics Data Grid”, (http://www.ppdg.net/, http://www.cacr.caltech.edu/ppdg/). [7] BABAR Collaboration (B. Aubert et al.)., “The First Year of the Babar Experiment At PEP-II. 30th International Conference on High-Energy Physics (ICHEP 2000), Japan. [8] BIRN, “The Biomedical Informatics Research Network”, http://www.nbirn.net [9] Rajasekar, A., R. Marciano, R. Moore, (1999), “Collection Based Persistent Archives,” Proceedings of the 16th IEEE Symposium on Mass Storage Systems, 1999. [10] Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta, (2000), “Collection-Based Persistent Digital Archives – Parts 1& 2”, D-Lib Magazine, April/March 2000, http://www.dlib.org/ [11] Moore, R., A. Rajasekar, “Common Consistency Requirements for Data Grids, Digital Libraries, and Persistent Archives”, Grid Protocol Architecture Research Group draft, Global Grid Forum, April 2003. [12] US National Archives and Records Administration, http://www.archives.gov/, also see http://www.sdsc.edu/NARA/ [13] Moore, R., C. Baru, A. Gupta, B. Ludaescher, R. Marciano, A. Rajasekar, (1999), “Collection-Based long-Term Preservation,” GA-A23183, report to National Archives and Records Administration, June, 1999. [14] GGF, “The Global Grid Forum” (http://www.ggf.org/) [15] SRB, “Storage Resource Broker Website”, SDSC (http://www.npaci.edu/dice/srb). [16] Rajasekar, A., Wan, M., Moore, R.W., Schroeder, W., Kremenek, G., Jagatheesan, A., Cowart, C., Zhu, B., Chen, S.Y. and Olschanowsky, R., “Storage Resource Broker – Managing Distributed Data in a Grid,” Computer Society of India Journal, special issue on SAN, 2003 [17] Rajasekar, A., Wan, M., Moore, R.W., Jagatheesan, A. and Kremenek, G., “Real Experiences with Data Grids – Case-studies in using the SRB,” Proceedings of 6th International Conference/Exhibition on High Performance Computing Conference in Asia Pacific Region (HPC-Asia), December 2002, Bangalore, India [18] MCAT - “The Metadata Catalog”, http://www.npaci.edu/DICE/SRB/mcat.html [19] 2-Micron All Sky Survey (2MASS), http://www.ipac.caltech.edu/2mass/ [20] Digital Palomar Observatory Sky Survey, http://www.astro.caltech.edu/~george/dposs/ 14 [21] Moore R. and A. Rajasekar, “Data and Metadata Collections for Scientific Applications,” High Performance Computing and Networking, Amsterdam, NL, 2001. [22] Wan, M., A. Rajasekar, R. Moore, “A Simple Mass Storage System for the SRB Data Grid,” Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies, April 7-10, 2003, San Diego, USA. [23] Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger, H., and Stockinger, K. (2000) “Data Management in an International Data Grid Project,” IEEE/ACM International Workshop on Grid Computing Grid'2000, Bangalore, India 17-20 December 2000. (http://www.eu-datagrid.org/grid/papers/data_mgt_grid2000.pdf). [24] Moore, R., C. Baru, A. Rajasekar, R. Marciano, M. Wan: Data Intensive Computing, In ``The Grid: Blueprint for a New Computing Infrastructure'', eds. I. Foster and C. Kesselman. Morgan Kaufmann, San Francisco, 1999. [25] Thibodeau, K., “Building the Archives of the Future: Advances in Preserving Electronic Records at the National Archives and Records Administration”, U.S. National Archives and Records Administration, http://www.dlib.org/dlib/february01/thibodeau/02thibodeau.html [26] Underwood, W. E., “As-Is IDEF0 Activity Model of the Archival Processing of Presidential Textual Records,” TR CSITD 98-1, Information Technology and Telecommunications Laboratory, Georgia Tech Research Institute, December 1, 1988. [27] Underwood, W. E., “The InterPARES Preservation Model: A Framework for the Long-Term Preservation of Authentic Electronic Records”. Choices and Strategies for Preservation of the Collective Memory, Toblach/Dobbiaco Italy 25-29 June 2002, Archivi per la Storia. [28] NPACI Data Intensive Computing Environment thrust, http://www.npaci.edu/DICE/ [29] NASA Information Power Grid (IPG), (http://www.ipg.nasa.gov/) [30] GriPhyN, “The Grid Physics Network”, (http://www.griphyn.org/). [31] SCEC Web Site, Southern California Earthquake Center, (http://www.scec.org/) [32] Box, D., Ehnebuske, D., Kakivaya, G., Layman, A., Mendelsohn, N., Nielsen, H.F., Thatte, S., and Winer, D., “Simple Object Access Protocol (SOAP)” W3C Note. http://www.w3.org/TR/SOAP/ [33] SDSC Matrix Project Web site http://www.npaci.edu/DICE/SRB/matrix/ [34] Jagatheesan, A., “Architecture and Protocols for Sangam Communities and Sangam E-Services Broker,” Technical Report, (Master's Thesis) CISE Department, University of Florida, 2001 http://etd.fcla.edu/UF/anp1601/ArunFinal.pdf [35] Helal, A., Su, S.Y.W., Meng, Jei, Krithivasan, R. and Jagatheesan, R., “The Internet Enterprise,” Proceedings of Second IEEE/IPSJ Symposium on Applications and the Internet (SAINT 02), February 2002, Japan. http://www.harris.cise.ufl.edu/projects/publications/Internet_Enterprise.pdf 15 16 LONG-TERM STEWARDSHIP OF GLOBALLY-DISTRIBUTED REPRESENTATION INFORMATION David Holdsworth Information Systems Services Leeds University LS2 9JT UK +44 113 343 5401 e-mail: ecldh@leeds.ac.uk Paul Wheatley Edward Boyle Library Leeds University LS2 9JT UK +44 113 343 5830 e-mail: P.R.Wheatley@leeds.ac.uk Background Leeds was a major participant in three projects looking at digital preservation, viz Cedars [1] (jointly with the Universities of Oxford and Cambridge), CAMiLEON [2] (jointly with the University of Michigan), and the Representation and Rendering Project [3]. With this background, work is beginning on setting up a digital curation centre [4]. for UK academia. As a result of this work, we strongly favour a policy of retaining the original byte-stream (or possibly bit-stream, see below) as the master copy, and evolving representation information (including software tools) over time to guarantee continued access to the intellectual content of the preserved material. This paper attempts to justify that approach, and to argue for its technical feasibility and economic good sense. Thus we need long-term stewardship of the byte-streams, and long-term stewardship of the representation information. We use the term representation information in the sense of the OAIS model [5]. The purpose of the representation information is to give future access to the intellectual content of preserved byte-streams. Without stewardship of the representation information we would not be exercising stewardship of the preserved data. Inevitability of Change in the context of long-term Since computers were invented in the 1940s and 50s, there have many changes in the representation of data. The binary digit has survived as an abstraction, and in today's world the byte is a world-wide standard, although we sometimes have to call it an octet. All we can be certain of for the long-term future is that there will be further change. However, even though the technology used for representing such bits and bytes has 17 changed over time, the abstract concept lives on. Nonetheless, the uses to which those bits and bytes can be put have grown massively over the years. Our work has always taken the view that "long-term" means many decades. As digital information technology is barely 60 years old, and we have already lost all of the software from the earliest machines, we need to mend our ways. We should plan that our digital information will still be safe and accessible in 100 years. It is then likely that developments over that time will render the material safe for millennia. In short, we are talking of a time span over which all of our existing hardware technology is likely to be obsolete, and also much of the software. It is the representation information that makes the bridge between IT practices at the time of preservation, and IT practices at the time of access to the information. Abstraction is Vital We can be confident that the concept of information will survive the passage of time, and even the concept of digital information. We need to bridge the longevity of the information concept to the certain mortality of the media on which the data lives. Our approach is to ensure that everything is represented as a sequence of bytes. We have confidence that the ability to store a sequence of bytes will survive for many decades, and probably several centuries. Current technology usually does this by calling this sequence a file, and storing it in a file system. There are many files in today's computer systems that had their origins in previous systems. The challenge that remains is to maintain the ability to extract the information content of such byte-streams. The knowledge of the formats of such preserved data is itself information, and is amenable to being represented digitally, and is thus amenable to preservation by the same means as we use for the data itself. By taking this focus on the storage of a stream of bytes, we divide the problem into two. 1. Providing media for storage, and copying byte-streams from older technology to newer technology. 2. Maintaining knowledge of the data formats, and retaining the ability to process these data formats in a cost-effective manner. The OAIS representation net is the means by which the knowledge is retained. By treating all data as an abstract byte-stream at the lowest level, we have a common frame of reference in which we can record representation information, independent of any particular data storage technology, and any particular data storing institution. We have a framework in which representation information will be globally relevant. 18 Keep the Original Data We have no faith in long-lived media [6]. Our approach is always to keep the original data as an abstract byte-stream and to regard it as the master. Why? Because it is the only way to be sure that nothing is lost. Format conversion can lose data through moving to a representation incapable of handling all the properties of the original. It can also lose data through simple software error in the conversion process that goes undetected until it is too late to read the previous data. One of us (DH) has personal experience of both situations. One in which the data was damaged, and one in which potential damage was avoided by keeping the original and producing a format conversion tool. How? We certainly cannot preserve the medium upon which the data is stored. In Cedars we developed the concept of an underlying abstract form which enabled us to convert any digital object into a byte-stream from which we could regenerate the significant properties of the original. Our approach is to preserve this byte-stream indefinitely, copying it unchanged as storage technology evolves. The question then remains as to how we continue to have access to the intellectual content (another Cedars phrase) of the data, and not merely a stream of bytes. Our answer to this is that we evolve the representation information over time so that it provides us with the means to transform our original into a form that can be processed with the tools current at the time of access. We believe that our work in the CAMiLEON project has shown this to be feasible in the case of a very difficult original digital object of great historical importance. Using emulation we successfully preserved the accessibility of the BBC's "Domesday" project, see below and [16]. The very essence involves identifying appropriate abstractions, and then using them as the focus of the rendering software. We achieve longevity by arranging that the rendering software is implemented so as remain operational over the decades. The application of our approach to emulation is covered in Emulation, Preservation and Abstraction [7]. We have also investigated the same technique of retention of the original binary data coupled with evolving software tools in the context of format migration [8]. Format Conversion — when? It is obvious that when data is to be accessed some time after its initial collection, the technology involved in this access will differ markedly from that in use when data collection took place. There is also the real possibility that other technologies have been and gone in the interim. Thus, format conversion is inevitable. For data held in currently common formats, the amount of representation information needed is trivial. Meaningful access to the data normally happens at the click of a mouse. 19 A current computer platform will render a PDF file merely by being told that the format is PDF. Conversely, faced with an EBCDIC file of IBM SCRIPT mark-up, the same current platform might well render something with little resemblance to the original, whereas back in 1975, the file could be rendered as formatted text with minimal formality. However, if we have representation information for IBM SCRIPT files that points us at appropriate software for rendering the file contents on current platforms, the historic data becomes accessible to today's users. Alternatively, we could have converted all the world's IBM SCRIPT files into Word-for-Windows, or LAT EX, or .... We could argue about the choice until all the current formats become obsolete, and we could well have chosen a format that itself quickly became obsolete. We could have been tempted to convert from EBCDIC to ASCII, but that could have lost information because EBCDIC has a few more characters than ASCII. We recommend that the format of preserved data be converted only when access is required to the data, i.e. on creation of the Dissemination Information Package (DIP). For a popular item, it would obviously make sense to cache the DIP, but not to allow the reformatted DIP to replace the original as master. This means that the tracking of developments in storage technology involves only the copying of byte-streams. Moreover, when the format conversion has to be done, there will be improved computational technology with which to do it [9]. Indirection is Vital There isn't a problem in computer science that cannot be solved by an extra level of indirection. Anon The essence of our approach involves keeping the preserved data unchanged, and ensuring that we always have representation information that tells us how to access it, rather than repeatedly converting to a format in current use. We take the view that it is very difficult (impossible?) to provide representation information that will be adequate for ever. We propose that representation information evolves over time to reflect changes in IT practice. This clearly implies a structure in which each stored object contains a pointer to its representation information. This is easily said, but begs the question as to the nature of the pointer. We need a pointer that will remain valid over the long-term (i.e. 100 years). We need to be wary of depending on institutions whose continued existence cannot be guaranteed. Alongside this need for a pointer, we also have a need for a reference ID for each preserved object. This needs to be distinct from the location of the object, but there needs to be a service that translates a reference ID into a current location. This is the essence of the Cedars architecture [10]. 20 Reference IDs could be managed locally within an archive store. Such IDs could then be made global, by naming each archive store, and prefixing each local name with that of the archive store. There are various global naming schemes, ISBN, DNS, Java packages, URL, URI, URN, DOI, etc. It may even be necessary to introduce another one, just because there is no clear long-term survivor. What is certain is that there have to be authorities that give out reference IDs and take responsibility for translating these IDs into facilities for access to the referenced stored objects. If we grasp the nettle of a global name space for reference IDs of stored objects and keep the representation information in the same name space, we have the prospect of sharing the evolving representation information on a world-wide basis. This will imply some discipline if dangling pointers are to be avoided. Enhance Representation Nets over time In the Cedars Project we produced a prototype schema for a representation net following the OAIS model, and populated it with some examples. After this experience, we had some new ideas on the schema of the representation net. We believe that it is inevitable that this area is allowed to develop further, and that operational archives are built so that evolution in this area is encouraged to take place. We must accept that there is likely to be revision in the OAIS model itself over the 100 year time-frame. Also, we could see that to require a fully specified representation net before allowing ingest could act as a disincentive to preservation of digital objects whose value is not in doubt. In many cases, representation information existed as textual documentation. An operational archive needs to be capable of holding representation information in this purely textual form, although with an ambition to refine it later. Such information would not actually violate the OAIS model, but there is a danger of being over-prescriptive in implementing the model. For instance the NISO technical metadata standard for still images [11] has over 100 elements, at least half of which are compulsory. For some formats the most useful representation information is in the form of viewing software. We need our representation nets to enable the discovery of such software (see below). Many current objects need only to be introduced to a typical desktop computer in order for them to be rendered. On the other hand, we experimented with obsolete digital objects (from 1970s and 1980s) in order to see some of the issues likely to arise when our grandchildren wish to gain access to today's material. We even tried to imagine how we would have gone about preserving for the long-term future using the technology of the 1970s. It was abundantly clear that ideas are very different now than they were 30 or 40 years ago. We must expect that today's ideas could well be superseded over the longterm. In order to accommodate this, we must allow the content of objects in the representation net to be changed over time, in sharp contrast to the original preserved objects where we 21 are recommending retention of original byte-streams. It is vital that the reference ID that is originally used for representation information is re-used for newer representation information which gets produced as a result of development of new tools and ideas. That way, old data gets to benefit from new techniques available for processing it. The representation information that is being replaced should of course be retained, but with a new ID, which should then be referenced by the replacement. Representation Nets should link to software Our representation nets in Cedars very deliberately contained software, or in some cases references to it. We have no regrets on this issue. Ideally we want software in source form in a programming language for which implementations are widely available, but it seems churlish to refuse to reference the Acrobat viewer as a way of rendering PDF files, just because we do not have the source, but see example 1 below. A format conversion program that is known to work correctly on many different data objects is clearly a valuable resource for access to the stored data, and should be available via the representation network. As regards the issue of longevity of such software, we argued earlier for the longevity of abstract concepts such as bits, bytes and byte-streams. Programming languages are also abstract concepts, and they too can live for a very long time. Current implementations of C or FORTRAN will run programs from long ago. Other languages which have been less widely used also have current implementations that function correctly. The source text of a format conversion program which is written in a language for which no implementation is available is still a valuable specification of the format, and has the benefit of previously proven accuracy. We address the issue of evolving emulator programs in C-ing Ahead for Digital Longevity [12], which proposes using a subset of C as the programming language for writing portable emulators. Examples We illustrate the way in which we see representation information evolving over time, by reference to three examples drawn from rather different computational environments. Example 1: Acrobat files In today's IT world it is very common to use Adobe Acrobat® portable document format (PDF) for holding and transmitting electronic forms of what are thought of as printed documents. The only representation information needed by today's computer user is the URL for downloading the Acrobat® Reader™. The representation net for PDF files is basically this single node, detailing how to gain access to the software for rendering the data. In reality, it should be an array of nodes with elements for different platforms. All preserved PDF files would reference this one piece of representation information. The 22 recent appearance of the GNU open-source Xpdf [13] would be reflected by adding it to this array. Example 2: IBM SCRIPT files One upon a time, the representation information for a preserved IBM SCRIPT file would point to the IBM SCRIPT program for the IBM/360 platform. Unfortunately we did not have the OAIS model in the 1970s, but if we had had an OAIS archive for storage of our VM/CMS data, this is the only representation information that would have been needed. (Actually the CMS file-type of SCRIPT performed the rôle of representation information, much as file extensions do today on a PC.) As the 30+ years elapsed, our putative OAIS archive would have expanded the representation information for SCRIPT by information suitable for more current platforms — including the human readable documentation for a live-ware platform. There would probably also be reference to the Hercules project [14] which allows emulation of IBM/360/370 systems of yesteryear. This need to keep up-to-date was highlighted in the InterPARES project [15]. Example 3: The BBC Domesday Project In 1986, to commemorate the 900th anniversary of the Domesday Book, the BBC ran a project to collect a picture of Britain in 1986, to do so using modern technology, and to preserve the information so as to withstand the ravages of time. This was done using a micro computer coupled to a Philips LaserVision player, with the data stored on two 12" video disks. Software was included with the package, some on ROM an some held on the disks, which then gave an interactive interface to this data. The disks themselves are robust enough to last a long time, but the device to read them is much more fragile, and has long since been superseded as a commercial product. Here we have a clear example where the preservation decisions placed (mis-placed) faith in the media technology of the day, and more crucially in the survival of the information technology practices of the time. The CAMiLEON project used this example as a test case to show the effectiveness of emulation as a preservation technique. A detailed treatment is to be found on the CAMiLEON web site [16]. We can look at this example with particular reference to its long-term viability, both with regard to the original efforts in 1986, and to the emulation work of 2002. We shall use it to illustrate our ideas about the appropriateness of emulation software as part of the representation information. Firstly, a bit of background to the work. 23 We have taken our own advice and preserved the data from the original disks as abstract byte-streams. We can represent this step as the process marked A in the diagram (taken from reference [7]): The technique was to show that we could use emulation to bridge from the Original platform to a different host platform, labelled Host platform 1 in the diagram. The ingest step (marked A in the diagram) involves identifying the significant properties of the original. The data consisted of the four disk surfaces, each with 3330 tracks, and some software in ROM held inside the BBC micro computer. Some tracks were video images and some held digital data which was often textual. We preserved the ROM contents straightforwardly as binary files, and made each track of the disk into a binary file of pixels for the video images, and a straightforward binary file for each of the digital data tracks. This we claim preserves the significant properties of the software and data necessary for it to run on the BBC computer with its attached video disk player. An example representation network describing the capture process was constructed as part of the Representation and Rendering Project [17] To demonstrate the validity of this claim, we produced the emulator shown as Emulator 1 on the diagram. The original software relied on an order code and an API (applications program interface) labelled 1 in the diagram. In order to achieve successful preservation of this digital object, we need to reproduce this API with software that operates with a more modern API, labelled 2 in the diagram. The emulation of the BBC micro-computer was obtained from an open-source emulation written by an enthusiast (Richard Gellman) and available on the net [18]. Although the achievements of enthusiasts are not always ideally structured for use in digital 24 preservation work, they can often provide a useful starting point for further development. At the very least the source code can act as a handy reference point for new work. The emulation of the video disk player was done by our own project staff. This emulation software then becomes the major component of the representation information for this data. Its longevity depends crucially on the longevity of the interface labelled 2. Here we have used code that is written in C, and makes use of only a few Win32-specific API calls. In other words our interface labelled 2, is not the whole API of Host platform 1, but only the facilities that we have chosen to use. The move to another platform is made easier by choosing to use as few as possible of the proprietary features of Host platform 1. We may need to recode a few bits of the screen driving routines, but by and large we can expect to find on Host platform 2 an API (shown as 3) that has most of the features needed on the new platform. We expect that a slightly revised emulator called Emulator 1.01 will readily be generated (step B) to run on Host platform 2. Meanwhile, the preserved digital object will be completely unchanged, as indicated by the large equals sign. Example 3: The BBC Domesday Project — Evolution of Representation Information At the outset, the storage media consisted of two 12" video disks. The representation information (a booklet supplied with the disks) basically said buy the appropriate hardware including the two E-PROM chips holding software that is used in accessing the video disk player. In addition, the BBC microcomputer had a well documented API for applications programs. This API (or preferably the subset of this that happened to be used) provides the interface labelled 1 in the diagram. Our preservation of the data from its original preservation medium created byte-streams that closely mirrored the actual physical data addressing. This maximised the validity of the existing representation information, viz. the documentation of the API mentioned above. The emulator then implements this API, opening up the question of the API upon which it itself runs. Thus we add to the representation information the emulator, and the information concerning the API needed to run it. This is not yet stored in a real OAIS archive, but we do have the materials necessary to achieve this, and data from the disks is stored in our LEEDS archive[19]. Our care in producing an emulation system that is not tied too closely to the platform upon which it runs illustrates our desire to produce representation information that will indeed stand the test of time by being easily revised to accommodate newly emerging technologies. This revised emulator becomes an addition to the representation information, extending the easy availability of the original data to a new platform. InterPARES [15] identified clearly the desire of users to access the material on the technology of their own time. 25 So why emulate in this case? The interactive nature of the digital object is really a part of it. There is no readily available current product that reproduces that interaction, so we treat the interaction software as part of the data to be preserved. On the better examples of current desk-top hardware, it runs faster than the original. Share and Cross-Reference Representation Nets We have argued earlier for the impossibility of producing an adequate standard for representation information which will retain its relevance over the decades. To attempt to do so would stifle research and development. We must therefore expect that different data storage organisations may develop different forms of Representation Information. Initiatives such as the PRONOM [20] file format database and the proposed Global File Format Registry will also produce valuable resources that should be linked from representation information. It would seem that collaboration should be the watchword here. The emerging solutions for IBM SCRIPT files in example 2 are likely to be applicable to any institution holding such data. With our proposed global namespace, they can all reference the same representation net, and benefit from advancing knowledge on the rendering of such files. Global Considerations The implementation of preservation on a global basis means that there will be no overall command. Co-operation will have to be by agreement rather than by diktat. This situation has some aspects that resemble the problems of achieving true long-term preservation. We cannot predict the future accurately, nor can we control it to any great extent, so the ambition to operate on a global scale despite being unable to control activities everywhere in the world sits well with the need for future-proofing. The future is another country whose customs and practices we cannot know. Referential Integrity We are proposing that no object that has a name in the digital store is ever deleted. It may be modified, but never deleted. Thus, anyone may use a reference to an object in the OAIS digital storage world confident that it will never become a dangling pointer. However, the representation information in any OAIS archive will need to refer to information outside its control. (This is actually an inevitable consequence of Gödel's incompleteness theorem — reflected in Cedars by describing nodes holding such references as Gödel ends.) Many of these external references will relate to the current practice of the time. A vital part of the management of such an archive will involve keeping an inventory of all such external references, and maintaining a process of review of the inventory in the 26 search for things that are no longer generally understood or refer to information that is no longer available. The remedy in such cases is to update the referring nodes to reflect the new realities. Clearly it is in the interests of good management to try to keep such nodes to a minimum. For example, a store would have a single node that describes the current version of Microsoft Word to which the representation information for any ingested Word file would refer. When this version becomes obsolete, this one node is updated with information on how to access data in the old format, or to convert to a newer format. The two level naming proposed earlier helps greatly in implementation of such a policy. Digital Curation in the UK The education funding authorities in Britain are currently in the process of setting up a digital curation centre [4]. This is seen as a centre for oversight and co-ordination of digital storage, and for R&D. The decision was announced shortly before Christmas. The centre will be based in Edinburgh, the home of the existing e-Science Centre [21], and EDINA [22]. The centre will not be a repository for the data itself. It will provide consultancy and advice services, and a directory of standard file formats. There will be a significant research activity, and a particular focus on digital integration, the enabling of research combining data from different sources. Academia is addressing its own problems, but what about the rest of the world of digital information, e.g. engineering data? How confident are we that the CAD data for nuclear power stations has an appropriate lifetime, or even half-life? Summary We argue strongly for retention of the original in the form of a byte-stream derived as simply as possible from the original data, and for the use of representation information to enable continued access to the intellectual content. We take the view that for much material it is impossible to have perfect representation information at the time of ingest, but that we must preserve the data and develop its representation information over time. Ideas on the nature of representation information will evolve over time. We must have systems capable of taking on board changing schemas of representation information. A two-level naming system, separating reference ID from location (and translating between them) should be the practice for implementing pointers in an OAIS archive, as a 27 prerequisite for our proposed policy of evolving representation information over time, and sharing it on a global scale. A Footnote on Bits versus Bytes The OAIS model uses the bit as the lowest level. However, the byte is the ubiquitous unit of data storage. In today's systems one cannot see how the bits are packed into bytes. When a file is copied from one medium to another we know that whether we read the original or the copy, we shall see the same sequence of bytes, but we know nothing of the ordering of bits within the byte, and these may be different on the two media types. On some media (e.g. 9-track tape) the bits are stored side-by-side. Pragmatically, we regard the byte as the indivisible unit of storage. If the OAIS model requires us to use bits, then we shall have a single definition of the assembly of bits into a byte. This would enable us unambiguously to refer to the millionth bit in a file, but not constrain us to hold it immediately before the million-and-oneth bit. References: [1] Cedars project http://www.leeds.ac.uk/cedars/ [2] CAMiLEON project http://www.si.umich.edu/CAMILEON/ [3] Representation and Rendering Project http://www.leeds.ac.uk/reprend/ [4] UK National Digital Curation Centre http://www.jisc.ac.uk/index.cfm?name=funding_digcentre [5] Reference Model for an Open Archival Information System (OAIS) ISO 14721:2002: http://www.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf [6] The Medium is NOT the message. Fifth NASA Goddard Conference on Mass Storage Systems and Technologies NASA publication 3340, September 1996. http://esdisit.gsfc.nasa.gov/MSST/conf1996/A6_07Holdsworth.html [7] Emulation, Preservation and Abstraction, RLG DigiNews vol5 no4, 2001. http://www.rlg.org/preserv/diginews/diginews5-4.html#feature2 [8] Research and Advances Technology for Digital Technology : 6th European Conference, ECDL 2002 http://www.springerlink.com/openurl.asp?genre=article&issn=03029743&volume=2458&spage=516 [9] Migration - A CAMiLEON discussion paper —Paul Wheatley, Ariadne, Issue 29(September 2001) http://www.ariadne.ac.uk/issue29/camileon/ [10] Cedars architecture. http://www.leeds.ac.uk/cedars/archive/architecture.html [11] NISO technical metadata standard for still images. http://www.niso.org/committees/committee_au.html [12] C-ing Ahead for Digital Longevity http://www.leeds.ac.uk/CAMiLEON/dh/cingahd.html [13] Xpdf Acrobat® renderer http://www.foolabs.com/xpdf/about.html [14] Hercules IBM Emulator http://www.schaefernet.de/hercules/index.html [15] InterPARES project http://www.interpares.org/book/index.cfm, see also [23 [16] Domesday. http://www.si.umich.edu/CAMILEON/domesday/domesday.html 28 [17] Representation and Rendering Project case study http://www.leeds.ac.uk/reprend/repnet/casestudy.html [18] Richard Gellman and David Gilbert, BBC Emulator http://www.mikebuk.dsl.pipex.com/beebem/ [19] LEEDS archive http://www.leeds.ac.uk/iss/systems/archive/ [20] PRONOM http://www.records.pro.gov.uk/pronom/ [21] EDINA http://www.edina.ac.uk/ [22] UK National e-Science Centre http://www.nesc.ac.uk/ [23] InterPARES2 http://www.interpares.org/ip2.htm 29 30 FIBRE CHANNEL and IP SAN INTEGRATION Henry Yang McDATA Corporation 4 McDATA Parkway, Broomfield, CO 80021, Tel: +1-720-558-4418 e-mail: Henry.yang@McDATA.com Abstract The maturity and mission-critical deployment of Fibre Channel (FC) in storage area networks (SANs) creates a unique class of multi-terabit networks with demanding throughput, latency, scalability, robustness, and availability requirements. This paper reviews the state of and critical system-level requirements for SANs. It describes how Internet SCSI (iSCSI), FC over IP (FCIP), and Internet FC Protocol (iFCP) integrate with FC SANs and discusses associated benefits and challenges. Finally, the paper examines case studies in performance and protocol tuning in high-speed, long-delay networks, which are increasingly critical for FC-to-IP integration opportunities and challenges. 1.0 Introduction Information technology (IT) is a key driver and challenge for businesses, government, and research/development centers. Data centers are a critical asset and provide the infrastructure that houses information processing, storage, and communication resources. Corporations are under tremendous pressure to manage return on investment, massive growth in information processing and storage needs at a global scale, management, performance, availability, and scalability requirements, and the IT infrastructure. To add to the challenges, there are many new technology and deployment decisions that have significant implications in terms of value and impact to the data center. SANs are a critical part of the data center, and are based on high speed, high bandwidth, low latency, and low error rate interconnects for scaling application, database, file, and storage services. FC is the key technology and standard that drive rapid growth of SAN deployment. The development of global and distributed file systems, content-addressable storage, object-oriented storage, cluster and blade servers, and utility computing is driving more integrated IP and FC network usage. The evolution of the data center and new information and computing trends drives the data center toward a more dynamic resource and performance provisioning and management model, which demands more efficient and scalable computing, information storage, and networking. In addition, business and operational requirements in the data center drive the scaling and evolution of larger SANs encompassing metropolitan and wide-area distances, high security and availability, and multi-protocol networks. In the face of these trends, ease of use, configuration, and management of the SAN is even more important. This paper reviews important requirements and deployment examples. It describes emerging IP SAN technologies and how these technologies interface and integrate with 31 FC. It also examines several protocol and design considerations, system-level behaviors, and areas that need further research and enhancement. This paper leverages the efforts of many engineers, architects, and researchers from the industry. The paper uses their findings and recommendations, and tries to relate them to SAN applications. 2.0 The FC SAN Today 2.1 FC SAN Overview FC technology [1] and product deployment has evolved from 1 gigabit per second (Gbps) to 2 Gbps links, and there is development to introduce 4 Gbps and 10 Gbps links. An FC network or fabric is a multi-terabit, low-latency switching network, mainly used to interconnect servers to storage. Although a FC fabric is designed to support any-to-any connectivity, the actual use tends to be some-to-some. Each server talks to a few storage devices or each storage device talks to a few servers, with occasional traffic for backup or other purposes involving devices shared by many sets of storage and servers. Deployment of mid-range to high-end FC fabrics is based on FC directors [2], which are highavailability switches with high-aggregate switching bandwidth and high port density. For the edge part of a large or small fabric, smaller and lower-cost FC switches are typically used. Directors and switches use one or more interswitch links (ISLs) to connect and form a larger fabric. It is common to deploy one or more isolated FC fabrics, called SAN islands. SANs are also extended to campus, metropolitan, and wide-area distances using T1/T3, ATM, IP, SONET, dark fiber, and DWDM technologies. Servers/Blade Servers Host Switch Switch Switch Switch Switch Switch Switch Switch Switch Storage Director Director Storage Storage Director Director Storage Switch Switch Switch Switch Switch Switch Switch Switch Switch Inter-Switch Link (ISL) Switch Switch Servers/Blade Servers Figure 1 Example of a Large FC Fabric Figure 1 shows an example of a large (approximately 1000 node) fabric, with directors and switches configured to provide high availability and high-aggregate bandwidth. Servers are typically aggregated at the edge of the fabric, and storage arrays are typically configured near the core of the fabric. It is typical to over-subscribe server link bandwidth in comparison to storage array link bandwidth (more servers with respect to a given storage array). A network of directors forms the core (or backbone) of the fabric. 32 For a fabric to be operational, there is a fabric initialization, involving all switches, directors, and devices. Initialization steps include parameter exchanges, Principal Switch selection, address assignment, path computation, and zone merge operations. As part of the path computation, directors and switches in a fabric run Fabric Shortest Path First (FSPF) routing protocol to build the forwarding data base. FSPF is a link state protocol that computes shortest path routes for frame forwarding. Within a FC fabric, there are name services and state change notification protocol services for resource discovery, configuration, and change management. FC zoning is an overlay network mechanism to limit the visibility and connectivity of servers to storage devices. A device can be in one or more zones, thereby enabling the sharing of servers (or clusters of servers) and storage resources. When an ISL changes state, all these protocols normally run, and when an end device comes up or goes down, name services and state change notification services run. These services consume more and more resources as the fabric size grows. 2.2 Traffic Patterns in Fibre Channel Most FC traffic uses the SCSI-FCP protocol [4] on top of FC Class 3 (unacknowledged datagram) service. SCSI-FCP is a request-response protocol that provides frame sequencing within transactions provided by lower-layer FC protocols. On frame loss or error, the protocol performs a transaction-level time out and retransmission. No retransmission of individual frames is supported. Time-out values are typically preconfigured and not based on actual round trip delay. The performance of SCSI-FCP is therefore sensitive to frame loss or frame level errors. Table 1 shows example read and write transactions and protocols frames. Transaction Protocol Direction Read Server to Storage Storage to Server Storage to Server Write Storage to Server Server to Storage Storage to Server Server to Storage Storage to Server Frame Type FCP_CMD (Read) FCP_XFER_RDY FCP_DATA (one or more) FCP_RSP FCP_CMD (Write) FCP_XFER_RDY FCP_DATA (one or more) FCP_RSP Typical Frame Length 68 Bytes 48 Bytes Up to 2084 Bytes 64 Bytes 68 Bytes 48 Bytes Up to 2084 Bytes 64 Bytes Table 1 Example SCSI-FCP Read and Write Protocol Frames As bandwidth and delay product increases, it is critical to understand performance tuning and error recovery mechanisms. For configurations with long delay, it is important to consider the way data is moved (write or read). As shown in Table 1, the write transaction has one additional round trip delay more than the read transaction. Therefore, the read operation is faster when network delay is long. 33 2.3 Critical Factors in SAN Deployment SAN deployments today range from small fabrics with less than 100 devices to large fabrics with several thousand devices. The following are factors critical to SAN design and deployment: x High availability: The impact of down-time and lost of information to business is severe. High availability requirements are quantified to vary from several 9’s, to 99.999%, to no down time. Most highly available fabrics are based on dual-rail redundancy and highly available directors, switches, and gateways. Servers and storage devices may have redundant paths through one fabric or through separate redundant fabrics with no shared single point of failure. Directors and some switches are designed with high-availability features, including fully redundant and hot swappable field-replaceable units (FRUs) and hot software download and activation, meaning that operation may continue through a software upgrade. x Robustness and stability: Some FC servers, associated host bus adapters (HBAs) and storage devices are extremely sensitive to frame loss and frame out of order delivery. Error recovery in the SCSI-FCP protocol is based on command and transaction level time-out and retry. Therefore, SCSI-FCP expects very low frame loss rate, since frame loss has significant performance impact. The design of SANs has to account for the following factors: o It is important to limit and reduce FC fabric size in terms of number of switching nodes. The goal is to limit the frequency of fabric initialization, FSPF route computation, and traffic for state notification and name services. o It is critical to ensure there is adequate aggregate bandwidth (fabric-wide and for individual links), to avoid severe and prolonged congestion. FC fabrics use a link-level, credit-based flow control, which is useful for handling short-term, bursty congestion. In FC, it is not common to use active queue management techniques (e.g., based on random early detection) to minimize queue build up. It is typical for a FC switch to discard frames that have been queued for a pre-determined time (e.g., 0.5 to 1.0 second), as part of the stale frame discard policy. As the deployment of multi-speed (1 Gbps, 2 Gbps, 4 Gbps, and 10 Gbps) ramps up, the design of the network and switching architecture becomes more challenging. As the size of network grows, comprehensive congestion management mechanisms become more critical and current link-level flow control may no longer be adequate. x Performance: Most FC switches and directors specify best-case frame latency to be less than a few microseconds. But latency grows with loading and can result in effective bandwidth to be significantly less than nominal bandwidth. Measured frame latency at 70% link utilization [3] showed it was 5.2 to 6.5 microseconds for one vendor’s product and 2.6 to 2222.6 microseconds for another vendor’s 34 product. The lesson is that not all switches are designed equal. Switching architecture issues like head-of-line blocking and internal resource bandwidth (throughput or frame rate) limitations impact throughput, latency, and congestion loss, especially at higher offered load. x Distance extension: Requirements for disaster recovery and business continuance (file/data mirroring, replication, and backup) are driving the deployment of SAN extension to deliver better performance and availability. In addition to robustness, stability, and performance considerations, it is important to understand the configurations, products, and protocols and system tuning parameters with respect to distance extension technology. We examine this topic later. x Scaling the SAN: A large number of FC fabrics deployed today are small islands of fabrics that are not inter-networked into a large and connected SAN. Reasons for deploying isolated islands include early adopters learning new technology, difficulty and lack of confidence in management and operational stability of a large fabric, and insufficient business and operational drivers (for connecting islands of FC fabrics). However, there are many benefits of internetworking FC islands. Resource sharing (such as tape library for backup) and the ability to dynamically provision and allocate resource are some of the benefits. When scaling an FC SAN, it is important to maintain performance and availability properties. Since a FC fabric is similar to an IP layer 2 switching network, it is important to constrain the number of switches in a fabric so the resulting fabric is stable and robust. When interconnecting FC fabrics, it is critical to consider isolating FC fabric local initialization and services, while allowing servers and storage devices to be interconnected regardless of locality. This is an area of further research and standardization work, and currently ANSI T11 has a fabric extension study group addressing these topics. 3.0 FC & IP Integration & Challenges 3.1 IP SAN Developments The emergence of iSCSI, FCIP, and iFCP standards [5, 6, 7, 8, 9] enables IP technology to enhance the deployment and benefits of SANs. FCIP and iFCP protocols use a common framing and encapsulation design. We examine the applicability, design, and limitations of these technologies in the following sections. These protocols leverage the matured IPSec standard and technology to enable security (including authentication, integrity, and privacy). As part of the protocol suite, Internet Storage Name Service (iSNS) [10] provides a method to manage and configure names, registry, discovery, and zones for multi-protocol SANs. The use of Service Location Protocols (SLP) [11] to discover services and resources is another critical part of the standard. 35 3.2 iSCSI iSCSI is a SCSI over TCP transport protocol used between a SCSI initiator and a SCSI target for storage-block level transport of SCSI commands and payloads. iSCSI protocol uses TCP/IP and IPSec as its network transport and security protocols. It has many features designed to leverage standard TCP/IP protocols to block storage needs. These features include the use of multiple TCP connections (for a given session), cyclic redundancy check (CRC) digests, out of order data placement, and TCP connection failure recovery options. iSCSI design and analysis have been presented in several papers [12, 13, 14, 15, 16]. STORAGE SERVER FC-4 ISCSI FC FCiSCSI GW IP iSCSI STORAGE STORAGE Figure 2 FC-iSCSI Gateway In Figure 2, the FC-iSCSI gateway provides the internetworking of iSCSI devices with FC devices, while communicating with each of the networks appropriately. While the FC SAN and the IP SAN are operating independently, the gateway maps selected iSCSI devices into the FC SAN and selected FC devices into the IP SAN. When a FC server creates a SCSI-FCP session to a storage device, the gateway intercepts the request and acts as a proxy for the storage device. On the IP side, the gateway acts as a proxy initiator (for the server), and creates an iSCSI session for the storage device. The gateway maintains and manages the state of the gateway portion of supported sessions. For an IPbased server creating an iSCSI session to a FC storage device, the gateway performs similar roles as proxy target on iSCSI session and proxy initiator for the SCSI-FCP session. An iSCSI gateway performs several important functions, including FCP and iSCSI session-level protocol translations, command and payload forwarding, error checking, and command/session-level error propagation. A gateway has to manage device discovery and registry (on the IP side with an iSNS server, and on the FC side with FC name services), authentication of FC and IP devices, and mapping of device names to local addresses, etc. It is important that a gateway is as transparent as possible to the servers and storage devices using the gateway, while maintaining high data integrity. It 36 should have very low latency and sufficient bandwidth to forward commands and payloads, and support a sufficiently large number of sessions to enable storage consolidation (a high end storage array on the FC side shared by a large number of IP based servers). Management of the multi-protocol SAN is a critical part of the deployment success. 3.3 FCIP FCIP is a tunneling protocol that transports all FC ISL traffic. Similarly, FCIP uses TCP/IP as the transport protocol and IPSec for security. A FCIP link tunnels all ISL traffic between a pair of FC switches, and may have one or more TCP connections between a pair of IP nodes for the tunnel end points. From the FC fabric view, an FCIP link is an ISL transporting all FC control and data frames between switches, with the IP network and protocols invisible. One can configure one or more ISLs (using FCIP links) between FC switches using FCIP links. Figure 3 shows an example of FCIP links being used as ISLs between FC switches A and B. SERVER ISLs over FCIP FC-4 STORAGE IP FC STORAGE FC FC SWITCH B FC SWITCH A FC-4 SERVER Figure 3 FC-FCIP Tunnel A key advantage of the FCIP tunnel approach is transparency to a fabric, as existing fabric tools and services are used. Once a FCIP link is configured, existing fabric operations and management continue. Similarly, fabric initialization, FSPF routing protocol, and name/state change services run transparently over FCIP links. However, since FC fabric-level control protocols run over the FCIP tunnel, IP and TCP connection failures can disrupt the FC fabrics on both sides. Given the speed and bandwidth differences between FC and a typical IP network used to interconnect remote SANs, the design and management of congestion and over-load conditions is important to understand. For the FCIP tunnel, a simple FIFO (first in first out) frame forwarding queue design can result in head-of-line blocking of fabric initialization protocol frames when the tunnel is congested, or the TCP connection is in slow-start recovery mode. Another case to consider is when a SCSI-FCP transaction time out occurs, the entire transaction (such as 1 MB block) might be retransmitted over an FCIP link that is experiencing congestion. In addition, there might be multiple application streams using the same FCIP link, and there is no mechanism to help reduce or avoid network congestion. These are possible 37 scenarios of overload and congestion that can result in performance and stability issues that impact the entire fabric. For a medium to large fabric, these are critical issues for concern. Most FCIP deployments are based on small fabrics, where there are a small number of devices and switches at each end of the FCIP link, and these issues are less critical. 3.4 iFCP iFCP technology is a gateway-to-gateway protocol for providing FC device-to-FC device communication over TCP/IP. For each pair of FC devices, there is an iFCP session created between a pair of gateways supporting the devices. An iFCP session uses a TCP connection for transport and IPSec for security, and manages FC frame transport, data integrity, address translation, and session management for a pair of FC devices. Since an iFCP gateway handles the communications between a pair of FC devices, it only transports device-to-device frames over the session, and, hence, the FC fabrics across the session are fully isolated and independent. This is a major difference between iFCP and FCIP, in that FCIP builds an extended fabric, tunneled over IP. In contrast to an FC-iSCSI gateway, an iFCP gateway transports FC device-to-device frames over TCP, and in most cases original FC frames, including the original CRC and frame delimiters, are transported. An FC-iSCSI gateway terminates and translates SCSIFCP protocol from the FC side and similarly terminates and translates iSCSI protocol from the IP side. SERVER FC-4 iFCP sessions FCiFCP GW FC STORAGE FCiFCP GW IP FC SERVER FC-4 STORAGE Figure 4 FC-iFCP Gateway The iFCP draft standard specifies an address-translation mode as well as an addresstransparent mode, depending on whether the FC addresses of devices are translated or not. An FC device exchanges login and protocol parameters with another FC device using FC link service protocol frames as part of the session creation and parameter exchange protocol. In address-translation mode, a gateway intercepts these device-to-device link service protocol frames and translates device addresses embedded in the frames. It regenerates frame CRCs when the original frame content is changed, which imposes extra overhead on the iFCP gateway. Address translation is a particularly useful feature when interconnecting FC fabrics. It enables installation of a gateway between existing 38 fabrics without requiring fabric address changes. A gateway manages the session state, addresses translation and mapping, and provides proxy functions for remote devices. In addition, a gateway performs security functions (like authentication of devices), and works with an iSNS server for registry and discovery functions. The configuration and management of an iFCP gateway is more involved than for an FCIP gateway, as each device-device session has to be set up. Also, an iFCP gateway has more device proxy-related states to manage. As the number of device-to-device sessions increases, an iFCP gateway design becomes more complex and may result in performance and stability issues. However, one can use admission control techniques to limit the number iFCP sessions allowed for a gateway. Since an iFCP gateway is managing device-to-device communications, it can enforce some degree of flow control by pacing command forwarding at the time of congestion. The iFCP specification allows an optional unbounded connection feature, which sets up and uses a pool of backup TCP connections for fast-session fail-over support. This assists a gateway in providing faster connection fail-over. 3.5 TCP/IP & Transport Protocol Discussions Some classes of applications have different requirements for transport services and protocols. For example, applications that prefer timeliness in delivery over reliable data delivery (such as RealAudio, Voice over IP) prefer a different transport service and protocol design [17] than that of TCP. Also, for applications that prefer a different type of fault tolerance, reliability, and a non-byte stream-oriented transport service, a different type of transport protocol might be needed (such as Stream Control Transmission Protocol (SCTP) [18]). These are examples of new transport protocol research and standard development activities. TCP protocol is undergoing many enhancements to improve performance under different operating conditions, and these enhancements include High Performance Extensions (TCP Window Scaling Option, Round-Trip Time Measurements, Protect Against Wrapped Sequence Numbers) [19], Selective Ack Option [20, 21], Explicit Congestion Notification [22, 23], Eifel Detection Algorithm [24], and High Speed TCP (HSTCP) [25]. As part of the design considerations for an IP SAN, the design and tuning of TCP for the SAN is critical. For iSCSI servers and storage devices, the design and tuning of protocol off-load, zero-copy, interrupt coalescing, and buffer-MTU-MSS tuning are critical (MTU is the maximum transfer unit, MSS is the maximum segment size). For iSCSI, FCIP, and iFCP gateway design, buffer-MTU-MSS tuning is very critical and several of the aforementioned TCP enhancements are important considerations for scaling the IP SAN for 1 to 10 Gbps speeds. For long and fast network (LFN), HSTCP enhancement is an important design. Multiple TCP connections for iSCSI, FCIP link, and unbounded iFCP connections are critical considerations for load balancing and high availability. In addition to the IP based transport, there are developments for operating Gigabit Ethernet and FC protocol directly over SONET-based transports for Generic Frame Protocol ITU-T G.7041 standards [26]. 39 4.0 Some Case Studies 4.1 Experiment of 10 Gbps Transcontinental Data Access As part of the Supercomputing Conference 2002 demonstration of SAN extension over a multi-gigabit transcontinental network, [27] test results of an FC SAN interconnected with iFCP gateways over a 10 Gbps link from San Diego to Baltimore were presented. Figure 5 shows the configurations used for the experiment. IP Switch/Router FC Switch IP Switch/Router IP SAN GW IP SAN GW FC Switch Figure 5 Schematic of Data Access for the SC'02 demonstration The Supercomputing ’02 experiment proves the operation of a network running FC over IP network, using iFCP gateways, between the San Diego Supercomputer Center (SDSC) and the SDSC booth in Baltimore. The experiment demonstrates that FC traffic, using iFCP gateways, runs over a 10 Gbps link in excess of 2,600 miles, with a round-trip latency of 70 to 90 milliseconds. Aggregate throughput was relatively constant at 717 MB/s, and read performance was slightly better than write performance. In addition to the IP/iFCP based demo [27], there was another experiment of FC traffic over FCIP using a 10 Gbps SONET link, configured between the Pittsburgh Supercomputing Center (PSC) booth and the SDSC booth at the SC’02 show. 40 4.2 Remote Mirroring Target Source MAN/WAN Example of Remote Figure 6 Remote Mirroring – Throughput vs Delay When performing remote mirroring of logical units (LUNs), remote copy operations must synchronize data copied to each of the LUNs to ensure data coherency within the mirror group. The effective throughput of the remote mirroring of 12 LUNs was shown to drop from 25 MBps to about 5 MBps as the round trip delay increases from 0 to 10 ms, as shown in Figure 6 [28]. It is important to configure and tune file and block size, MTU, MSS, and synchronization rate. In addition, the use of compression to reduce the amount of data transfer is important. 4.3 Delay and Cache Effect on I/O Workload Figure 7 Delay and Cache Effect of PostMark Experiment PostMark was used to test the delay and cache sensitivity of the I/O workload of a large email server [29]. Figure 7 shows the PostMark transactions rate of I/O from a FreeBSD host to a storage element (SE) with varying delays (to the SE) and cache sizes in FreeBSD VM cache. The transaction rate declines as the delay is increased, and with larger cache sizes the transaction rate increases. Application performance sensitivity with 41 respect to delay and error recovery is an area that needs further research and understanding. 4.4 Long Fast Network Experiment In another case [30], the University of Tokyo conducted an experiment using ‘iperf’ running TCP between Maryland and Tokyo, traversing the Abilene and APAN networks. The result was surprising in that Fast Ethernet is sometimes faster than Gigabit Ethernet on LFN. The main cause of the throughput degradation with Gigabit Ethernet LFN tests was congestion overflow of an intermediate router, resulting in cranking of TCP time out, slow start and congestion control mechanisms. Transmission rate control is important to mitigate the overflow in the bottleneck’s buffer in addition to the window size control. Therefore, transmit rate or bandwidth limiting is another important mechanism that avoids or mitigates the impact of congestion overflow in an intermediate network. 4.5 Fast Write We examine a method to improve the write performance over a long delay network. As shown in Table 1, a SCSI write transaction incurs two round trip delays for a data block. The maximum block size is determined by the target (storage) device’s buffer capacity and is specified by the target in the XFR_RDY message. For example, writing one MB of data using 64 KB blocks takes 16 transactions, which is 32 round trips plus data transfer time. Fast Write [31] is a way to minimize round-trip delay overhead and accelerate SCSI write performance leveraging a gateway’s buffer capacity. The XFR_RDY is spoofed by the gateway on the initiator (server) side of the network, and the data is buffered by the gateway on the target side of the network until the target sends its own XFR_RDY. In addition, the use of TCP protocol with selective retransmission (on error) provides better frame loss recovery than retransmitting the entire block on timeout (as in the SCSI-FCP case). With Fast Write, the number of round trip involved for a 1 MB transfer is reduced to two round trips. Figure 8 Fast Write Example 42 Figure 8 shows an example of Fast Write, where a 1 MB transfer is negotiated between the source and the left-hand gateway as well as between the two gateways. For the write operations between the final gateway and destination, the maximum block size is specified by the destination. Most of the round-trips required are over the SAN between the right-hand gateway and destination. Therefore, the SAN should have higher bandwidth, lower latency, and lower error rate than the WAN connection between gateways. Fast Write is an innovative method of using standard protocols to leverage the capability of a gateway and leverage TCP protocol benefits over WAN. 5.0 Summary We present several of the critical requirements and best practices for FC SAN deployments. We examine IP SAN technologies and protocols, and show that FC and IP integration works well - integrated SANs are a critical part of today’s data center. We explore how several new high-speed protocol extensions work, and areas that need further research and development. The deployment of high-speed and long-distance networks for data centers (while providing good performance and reliability) is becoming very important and has potential value as well as challenges. Acknowledgements I would like to thank many of the engineers, architects, and researchers who are mentioned in the references and many others who are not explicitly mentioned here. I would like to thank Stuart Soloway for his detailed review and help with this paper. I would also like to thank Tom Clark and Neal Fausset for their review and inputs. References 1. Latest FC standards and drafts can be found at www.t11.org. 2. H. Yang, “Performance Considerations for Large-Scale SANs”, McDATA Corporation, 4 McDATA Parkway, Broomfield, CO 80021, December 2000. www.mcdata.com. 3. Randy Birdsall, “Competitive Performance Validation Test”, Miercom Labs, Princeton Junction, NJ, June 2002. 4. Stephen Trevitt, “Traffic Patterns in Fibre Channel”, McDATA Corporation, May 2002. Available at www.mcdata.com. 5. RFC 3643 - Fibre Channel (FC) Frame Encapsulation. www.ietf.org. 6. Draft-ietf-ips-iscsi-20.txt, January 19, 2003. www.ietf.org. 7. Draft-ieft-ips-fcovertcpip-12.txt, August 2002. www.ietf.org. 8. Draft-ietf-ips-ifcp-14.txt, December 2002. www.ietf.org. 9. T. Clark, “IP SANs A Guide to iSCSI, iFCP, and FCIP Protocols for Storage Area Networks”, Addison-Wesley, ISBN 0-201-75277-8, October 2002. 10. Draft-ietf-ips-isns-21.txt, October 2003. www.ietf.org. 11. Draft-ietf-ips-iscsi-slp-06.txt, December 2003. www.ietf.org. 12. K. Meth, J. Satran, “Design of the iSCSI Protocol”, IEEE/NASA MSST2003 Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies, April 7-10 2003, storageconference.org/2003. 43 13. H. Thompson, C. Tilmes, R. Cavey, B. Fink, P. Lang, B. Kobler, “Considerations and Performance Evaluations of Shared Storage Area Networks At NASA Goddard Space Flight Center”, IEEE/NASA MSST2003 Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies, April 7-10 2003, storageconference.org/2003. 14. K. Voruganti, P. Sarkar, “An Analysis of Three Gigabit Networking Protocols for Storage Area Networks”, 20th IEEE International Performance, Computing, and Communications Conference, April 2001. 15. S. Aiken, D. Grunwald, A. Pleszkun, J. Willeke, “A Performance Analysis of the iSCSI Protocol”, IEEE/NASA MSST2003 Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies, April 7-10 2003, storageconference.org/2003. 16. P. Sarkar, K. Voruganti, “IP Storage: The Challenge Ahead”, 10th Goddard Conference on Mass Storage Systems and Technologies/ 19th IEE Symposium on Mass Storage Systems, April 15-18, storageconference.org/2002. 17. Draft-ietf-dccp-problem-00.txt, October 23, 2003. www.ietf.org. 18. RFC 2960 – Stream Control Transmission Protocol, October 2000. www.ietf.org. 19. RFC 1323 – TCP Extension for High Performance. www.ietf.org. 20. RFC 2883 – An Extension to the Selective Acknowledgement (SACK) Option for TCP, July 2000. www.ietf.org. 21. RFC 3517 – A Conservative Selective Acknowledgement (SACK) – based Loss Recovery Algorithm for TCP, April 2003. www.ietf.org. 22. RFC 3168 - The Addition of Explicit Congestion Notification (ECN) to IP, September 2001. www.ietf.org. 23. S. Floyd, “TCP and Explicit Congestion Notification”, Lawrence Berkeley Laboratory, One Cyclotron Road, Berkeley, CA 94704, ACM Computer Communication Review, V. 24 N.5, October 1994, p. 10-23. www.icir.org/floyd/papers.html. 24. RFC 3522 – The Eifel Detection Algorithm for TCP, April 2003. www.ietf.org. 25. RFC 3649 – HighSpeed TCP for Large Congestion Windows, December 2003. www.ietf.org. 26. ITU-T G.7041 Standards, reference www.itu.int/ITU-T/. 27. P. Andrews, T. Sherwin, B. Banister, “A Centralized Access Model for Grid Computing”, IEEE/NASA MSST2003 Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies, April 7-10 2003, storageconference.org/2003. 28. EMC Corporation, “MirrorView Using Fibre Channel over IP”, December 30, 2002. 29. E. Gabber, J. Fellin, M. Flaster, F. Gu, B. Hillyer, W.T. Ng, B. Ozden, E. Shriver, “StarFish: highly-available block storage”, Proceedings of the FREENIX track of the 2003 USENIX Annual Technical Conference, San Antionio, Tx, June 9-14. 30. M. Nakamura, M. Inaba, K. Hiraki, “Fast Ethernet Is Sometimes Faster Than Gigabit Ethernet on LFN – Observations Of Congestion Control Of TCP Streams”, University of Tokyo, 7-3-1, Hongo, Bunkyo-ku, Tokyo, Japan, www.data-reservoir.adm.s.utokyo.ac.jp/paper/pdcs2003.pdf. 44 31. McDATA Corporation, “Maximizing Utilization of WAN Links with Nishan Fast Write”, McDATA San Jose Development Center, 3850 North First Street, San Jose, Ca 95134. 2002. 45 46 CHALLENGES IN LONG-TERM DATA STEWARDSHIP Ruth Duerr, Mark A. Parsons, Melinda Marquis, Rudy Dichtl, Teresa Mullins National Snow and Ice Data Center (NSIDC) 449 UCB, University of Colorado Boulder, CO 80309-0449 (rduerr, parsonsm, marquism, dichtl, tmullins)@nsidc.org Telephone: +1 303-(735-0136, 492-2359, 492-2850, 492-5532, 492-4004) Fax: +1 303-492-2468 1 Introduction The longevity of many data formats is uncertain at best, and more often is disturbingly brief. Maintenance of backwards compatibility of proprietary formats is frustratingly limited. The physical media that store digital data are ephemeral. Even if the data are properly preserved, the information that allows the data to be searched and which maintains the context of the data is often lost, threatening data utility. These are only a few of the formidable problems that threaten the long-term preservation and long-term use of digital data. Over the past decade, much has been written about the problems of long-term digital preservation (see for example [14], [15], [32], [38], and [39]). Many approaches or strategies to address these problems have been proposed (see for example [7], [10], and [32]), and a number of prototypes and test beds have been implemented (see for example [44]). No one has developed a comprehensive solution to these problems. In fact, there may not be a single solution. Most of the literature applies directly to the needs of libraries, museums, and records management organizations. Only rarely are issues related to preservation of science data discussed directly. Stewards of scientific data often face much different issues than the typical library, museum, or records archive. Some issues are simpler others more complex. In this paper, we provide a brief history of data stewardship, particularly science data stewardship, define long-term stewardship; and discuss some of the problems faced by data managers. We describe a broad array of data stewardship issues, but we will focus on those that are particularly amenable to technological solutions or that are exacerbated when archives are geographically distributed. 2 A Brief History of Scientific Data Stewardship A cursory review of scientific data stewardship as a discipline distinct from document preservation or records management suggests that it is a fairly recent concept. For most of human history, what little scientific data existed was recorded in notebooks, logs or 47 maps. With luck, a library or archive would collect and preserve these logs and maps. The archives may have been maintained by the church, a professional society, or perhaps were established through government regulation, but it was generally an ad hoc affair. Unless a potential data user was already aware of the existence and location of certain “data set,” it was extremely difficult to find and access the data. The establishment and growth of academic and public libraries in more recent centuries greatly improved data preservation and access. Libraries were at the forefront of new data cataloging, indexing, and access schemes; librarians were critical data stewards. Yet the “data” were still primarily in the form of monographs and logbooks, and, logically, libraries focused more on books, journals, and other publications more concerned with data analysis. (Maps may have been as readily archived as books and journals). It wasn’t until the establishment of the World Data Centers (WDCs) in 1957-1958 that the concept of a publicly funded facility specifically charged with providing data access and preservation became prominent [1]. The World Data Center system originally archived and distributed the data collected during the International Geophysical Year [1]. The data in question were generally small in volume and certainly not digital, but the concept that an institution would focus on the preservation and distribution of raw data as opposed to the interpretation of those data was revolutionary. Furthermore, the WDCs were organized by disciplines such as glaciology and meteorology. This helped reinforce an association between disciplinespecific science and data stewardship. Since then, the number of discipline-specific data centers has grown. In the US a total of nine national data centers were established, primarily sponsored by NOAA, DOE, USGS and NASA [2] to archive and distribute data in disciplines such as space science, seismology, and socioeconomics. The development of these world and national centers made finding relevant data a little simpler. Now there was likely to be an organization that could be queried if only by mail or telephone. If they couldn’t provide the data directly, they were usually able to provide references to other places to look. Local and state governments, universities, and even commercial entities have continued the trend and established a variety of data centers, typically organized around disciplines or subject areas as diverse as “advertising” [3] or “cancer in Texas” [4]. The Federal government again made a significant contribution in the early 1990s when NASA established eight discipline-specific Distributed Active Archive Centers (DAACs) to collaboratively archive and distribute data from NASA’s Earth Science Enterprise (ESE). In some ways the DAAC system followed the model of the distributed and disciplinespecific World and National Data Centers, and NASA typically collocated the DAACs with already established data centers [2]. However there are some key differences in the approach. On one hand, DAACs are intended to only archive and distribute data during the most active part of the data life cycle. The DAACs are to transfer their data to a permanent archive several years after each spacecraft mission in the ESE program ends, but the details of this transfer are yet to be finalized. On the other hand, an early and 48 important goal of the ESE was to make finding and obtaining Earth science data simpler than it had been. The DAACs are part of a larger system of remote sensing instruments and data systems called the Earth Observing System (EOS). They are linked together through the EOS Data and Information System (EOSDIS) Core System (ECS), which provides tools and hardware to handle ingest, archival, and distribution of the large volumes of data generated by EOS sensors and heritage data sources. An important component of ECS is an electronic interface that allows users to search and access the holdings of all of the DAACs simultaneously. This interface was initially developed as an independent client that users would install on their own machine, but shortly after ECS development started, the first web browsers became available. This led to the development of the EOS Data Gateway (EDG), a web-based search and order tool. Currently the EDG allows search and access to DAAC data as well as data located at several data centers scattered around the world. What is important to note about ECS and the DAACs is that it was arguably the functional beginning of new model of data management where data archival was geographically distributed, but search and order were centralized. It is also notable that this was a newly comprehensive effort to acquire, archive, and provide access to a very large volume of data but there is still no concrete plan for the long-term disposition of the data. Both these trends—centralized access to decentralized data and inadequate planning for long term archival—continue today. Indeed NASA is moving further away from a data center approach with its new Strategic Evolution of Earth Science Enterprise Data Systems (SEEDS) [12]. Of course, the World Wide Web has been a major driver in the increased decentralization of data storage. Furthermore, improved search engines theoretically make it easier than ever to find data. We have even heard it suggested that Google may be the only search engine needed. General search engines, however, provide little information to help a user determine the actual applicability or utility of the data found. Little of the information currently available on the web has been subject to the levels of peer-review, copyediting, or quality control traditionally done by data managers or library collection specialists [18]. Finally, no mechanism ensures the preservation of much of the information available via the Web. Often web sites cited in a paper are no longer active mere months after the publication of the paper [43]. There are many efforts underway to address some of the issues inherent in distributed Earth-science data systems including the overall Web. Some examples of centralized search tools for distributed scientific data include: • NASA’s Global Change Master Directory (GCMD) (http://gcmd.nasa.gov) • The Distributed Oceanographic (http://www.unidata.ucar.edu/packages/dods/index.html) 49 Data System • The Alexandria Digital Library Project (http://alexandria.ucsb.edu/) • The National Spatial (http://www.fgdc.gov/nsdi/nsdi.html) Data Infrastructure (NSDI) Some of these efforts predate the World Wide Web, and some like the GCMD are strictly search tools, while others such as the NSDI attempt (with mixed success) to provide actual data access. As data managers at the National Snow and Ice Data Center (NSIDC), we are primarily concerned with Earth science data, but we should note that many of the issues we will discuss apply to a variety of disciplines. Based on some of our experience at a session on “Virtual Observatories” at the Fall 2003 meeting of the American Geophysical Union, it seems that non-Earth Science related disciplines sometimes lag behind the Earth sciences in the management of their data. Mechanisms for simultaneously searching and accessing data stored at multiple distributed data centers may not exist. For example, no equivalent to the GCMD or EDG currently exists for the solar or space physics community. This situation is rapidly changing. Numerous groups are working on virtual observatory concepts, which in some ways are reminiscent of the EOS DAAC system described earlier. We should also be aware of the growth of private records management companies. It is certainly possible for commercial entities to address some of the issues of modern data stewardship, but very little research has been done to accurately quantify the necessary costs of a distributed data management infrastructure. Nor have there been any significant efforts to do a cost-benefit analysis of the various components of such a structure [17]. This is especially true in the international context, where not only is distributed data management more challenging, but cost models become more difficult. For example, different countries have data access and pricing policies that are rooted less in economics than in political or philosophical issues such as the right for citizens to access government documents (See [17] and [16] for examples.). In the following sections, the challenges of providing distributed data discovery and access, while adequately addressing long-term stewardship will be discussed. NSIDC's more than 25-year history as: • A World Data Center • Part of a NOAA cooperative institute • A NASA Distributed Active Archive Center (DAAC), • NSF's Arctic System Science (ARCSS) Data Coordination Center ADCC) and Antarctic Glaciological Data Center (AGDC) 50 • A central node for the International Permafrost Association's (IPA) Global Geocryological Data System (GGD) will serve as one source of examples. 3 Long-Term Stewardship Defined Within the data management field, “long-term” is typically defined as: A period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository[5]. Given the current rate of technological change, any data-generating project or program with a duration of five or more years should be considered as long-term and will need to take changes in technology into account. Stewardship, especially data or scientific stewardship is more difficult to define. Of the 107 results recently found with a Google search of the phrase “scientific stewardship,” very few (primarily NOAA sites, a few religious sites, and one lumber company) actually defined what the phrase meant in their context. These concepts are relatively new and do not show up in standard information science dictionaries or encyclopedias. The term data stewardship was used in the early 1990s by the Department of Defense in DOD Directive 8320.1-M.1, which defined data administration as “the person or group that manages the development, approval, and use of data within a specified functional area, ensuring that it can be used to satisfy data requirements throughout the organization” [40]. Two other relevant definitions can be found in the literature. The first comes from the vision statement from a workshop sponsored by NASA and NOAA which states that long-term archiving needs to be a “continuing program for preservation and responsive supply of reliable and comprehensive data, products, and information … for use in building new knowledge to guide public policy and business decisions” [11]. The second definition was presented by John J. Jensen of NOAA/NESDIS at the 2003 IEEE/NASA Mass Storage Conference, as “maintaining the science integrity and long term utility of scientific records” [45]. Both definitions associate scientific stewardship with data preservation as well as access or use in the future. These dual needs are also recognized in the library and records communities (see for example [44] and [46]). Beyond simple access to the original science data, good science stewardship has been shown to allow future development of new or improved products and for use of data in ways that were not originally anticipated [11]. To support these uses however, extensive documentation is needed including complete documentation about the characteristics of the instrument/sensor, its calibration 51 and how that was validated, the algorithms and any ancillary data used to produce the product, etc. [11] and [12]. This level of associated documentation goes well beyond the typical metadata needs of library or records materials. 4 Data and Metadata Related Challenges 4.1 Open and Proprietary Data and Metadata Formats The challenges of preserving information for the long term when it is stored in a proprietary format (e.g., MS Word) have been described elsewhere [6]. Commercial pressures do not allow companies to maintain backwards compatibility with each new release for very long. This leaves a very narrow window of opportunity for the information to be migrated to a newer version of the format or a different format, with the attendant risk of loss of functionality or information with each migration. In the science stewardship realm this may not seem like a large concern since data are still often stored as ASCII tables, flat binary files or one of an increasing number of community standard formats (e.g., shapefiles, HDF-EOS 4). However, much of the associated information about the data - the information that will be needed decades later to allow reanalysis or reprocessing or to allow the development of new products - may very well be stored in a wide variety of proprietary formats (e.g., CAD files, MS Word document). Even when the data are stored in a non-proprietary format (e.g., CDF, net-CDF or HDFEOS), the data cannot be maintained forever in their original format. Even so-called standard formats evolve with changes in technology. For example, much of the data stored in the typically petabyte-scale archives of the NASA DAACs, are stored in either HDF-EOS 2.x or HDF-EOS 5.x formats (there are no 3.x or 4.x versions). HDF-EOS 5.x was developed as technological changes mandated entirely new data system architectures incompatible with HDF-EOS 2.x. While tools are available to help users migrate data from the 2.x version to the 5.x version, the new version is not entirely backwards compatible. NASA is currently committed to funding maintenance of both versions [8], but it is not clear whether maintenance will continue once the data are transferred to another agency for long-term archival. Format evolution can cause particular problems in the Earth sciences where it is necessary to study long data time series in order to detect subtle changes. For example, NSIDC holds brightness temperature and derived sea ice data from a series of passive microwave remote sensing sensors. This is one of the longest continuous satellite remote sensing time series available, dating back prior to 1978. NASA is continuing this time series with a new higher resolution sensor, the Advanced Scanning Microwave Radiometer (AMSR), aboard the Aqua spacecraft. This is an exciting new addition, but scientists and data managers must work to tie the AMSR data into the existing time series. Not only will there be the normal, expected issues of intercallibrating different but related sensors, but someone will likely need to do some data format conversion. The 52 currently intercallibrated historical data is available in flat binary arrays with ASCII headers while the AMSR data is available in HDF-EOS. Issues such as these have resulted in a call by some for the establishment of a digital format archive [9], while others have called for conversion to a “Universal Data Format” or other technology-independent-representation upon archival (see for example [10], [13], and [32]). Both of these options require additional research according to a recent NSF-DELOS report [14]. They also increase the need for good metadata describing data format transformations and how these transformations may affect the utility of the data. 4.2 Which Standards and What Metadata? One of the lessons learned from the ESE experience is that “community-based standards, or profiles of standards, are more closely followed than standards imposed by outside forces” [12]. Developers of the ECS system recognized that having all of the data from the entire suite of satellites and sensors in the same format would simplify user access. After consideration of several potential formats, NASA settled on the HDF-EOS, a derivative of the HDF format standard [8]. A variety of user and producer communities rebelled. As a result, while much of the data stored in the ECS system is stored in HDFEOS format, there are a number of products, notably the GLAS data stored at NSIDC, that are not in HDF-EOS format. In addition to the recognition that user community involvement is necessary for successful standards development and adoption, the other important concept from the quote above is the notion of a standards profile, “a specific convention of use of a standard for a specific user community” [12]. It is typically not enough to say that a particular format standard is being used (e.g., HDF or netCDF); it may be necessary to define specific usage conventions possibly even content standards acceptable to a given user community in order to ensure interoperability. These specific conventions or profiles may vary from community to community. Probably one of the most overworked expressions in the IT industry is “Which standards? There are so many to choose from.” It is ironic that not only are there so many standards of a type to choose from; but also that there are so many types of standards about which one must make choices. In the data stewardship realm it is not enough to think about data preservation and data access format standards; one must also think about standards for metadata format and content, documentation format and content, interoperability, etc. For metadata, the question is compounded further by the need to distinguish the type of metadata under discussion, e.g., metadata for data discovery, data preservation, data access, etc. The Open Archival Information System (OAIS) Reference Model [5] provides an information model (see Figure 1) that describes the different kinds of information needed in order to ingest, preserve and provide access to digital or analog objects. The model appears to be gaining some acceptance in the library and archive communities. It is based 53 on the concept of an Information Package that can be found by examining its associated Descriptive Information. The components of the Information package itself are: Figure 1: Information Package Components and Relationships [19] • Content Information-containing both the data object to be preserved as well as enough Representation Information for a targeted user community to understand the data object’s structure and content. For science data this would include identification of the data format and any associated profile. Of the two components, structure and content, content is more difficult to obtain. The science community has become so specialized that community-specific jargon and underlying assumptions are pervasive. Capturing and documenting these, so that others outside that very small peer group can understand and use the data, is challenging. In a very distributed environment, such as the virtual observatories of the future, there will be many thousands or millions of Data Objects preserved in many different places, all of which have the same data format or even the same standard profile. It would be impractical to store the same format information with each object. This may bolster the argument for establishing centralized data format repositories [9], but would require considerable coordination to be successful. • Preservation Description Information (PDI)-containing the information needed to preserve the object for the long-term. The PDI is comprised of 4 components: o Provenance, or the history of the Data Object. In the science arena, this involves tracking the processing history, what input products were used, what version of which algorithms were used, what ancillary information was used for calibration and validation, as well as a host of other instrument-related information. It also includes information about when and where the data were created, processed, and acquired; as well as who was responsible for their creation and what changes have taken place since. 54 o Reference Information-including information needed to identify this object from the universe of other objects. Much has been written about the need for persistent and unambiguous identifiers (see for example [32], [34], and [15]) and various communities have proposed standards for these (see for example [33] and [13]). A key finding is that in a distributed environment a name issuing authority is needed to prevent naming collisions [32]. In the science community, a hierarchy of identifiers is typically needed. For example, in the ECS system, Earth Science Data Types (ESDT’s) are used to identify a data set, or class of objects, while granule ids are used to identify specific objects within the set. o Fixity Information-documenting the methods used to ensure that the object hasn’t been changed in an undocumented manner. This typically includes Cyclic Redundancy Check (CRC) values, digital signature keys, etc. This topic is addressed separately later in this paper. o Context Information-documenting why this object was created and how it relates to other objects. While the OAIS reference model discusses the types of metadata or information that must be gathered in order to preserve an object, it leaves the actual definition of that metadata to the individual archive or archive group. Several groups within the library community have independently developed their own preservation metadata specifications (see [21], [22], [23], and [24]) and have recently come together under the joint auspices of the Research Libraries Group (RLG) and the Online Computer Library Center (OCLC) to develop an OAIS-based metadata framework that “could be readily applied to a broad range of digital preservation activities” [25]. While providing a useful starting point for the science community, the OCLC/RLG framework is not adequate for preserving science data. The science community is typically more interested in preserving information about how the data were created than in preserving any particular presentation mechanism. This is to be expected given the different kinds of uses to which patrons of libraries and science users put the materials accessed. Typically a library patron expects to experience the material using his or her senses; to read, listen to, touch, or watch; but not to transform the materials accessed. Scientists typically access data so that it can be transformed, analyzed, used as an input to a model or new product, compared with other data, etc. Changes in presentation format over time as technology, programming languages, and visualization tools change, are not that important – the important things are the original bits and their meaning. In the earth science realm probably the most relevant metadata standard is the “Content Standard for Digital Geospatial Metadata” established by the Federal Geographic Data Committee (FGDC) [19]. President Clinton mandated that federally funded geospatial data (i.e., most Earth science data including ESE data) adhere to the FGDC standard in an 11 April 1994 executive order [20]. The FGDC standard “was developed from the perspective of defining the information required by a prospective user to determine the 55 availability of a set of geospatial data; to determine the fitness and the set of geospatial data for an intended use; to determine the means of accessing the set of geospatial data; and to successfully transfer the set of geospatial data” [19]. As such, there is some but not complete overlap with the kinds of metadata called for by the OAIS reference model. Much of the preservation metadata called for by the OAIS model is not part of the FGDC standard. In the international standards realm, the equivalent to the FGDC standard is the ISO 19115 standard [26]. Like the FGDC standard, the ISO standard is meant to facilitate, discovery, assessment for use, access and use and, like the FGDC standard, does not address much of the preservation metadata of the OAIS reference model. The FGDC has developed a draft “cross-walk” between the FGDC and ISO 19115 standards which will help FGDC-compliant users also become ISO 19115 compliant users. Both the FGDC and ISO 19115 standards are content standards, not implementation standards, yet organizations must choose implementation options. Consensus seems to be building that Extensible Markup Language (XML) should be the implementation standard for metadata. ISO Technical Committee 211 is in the process of developing an UML implementation standard for the ISO 19115 metadata standard that will include an associated XML schema. XML is also the recommendation of the National Research Council [2]. 4.3 Preservation vs. Access Users want data in formats that are easy to use. The desired format may be a communitybased standard, or it may be a format that is compatible with other related data sets. Furthermore, our experience at NSIDC shows that users usually need spatial or temporal subsets of large collections and may need to combine several products. They may also need the data in a different grid or projection than the original data. In other words, the utility or essence of science data is not strongly associated with any particular access format. Indeed, many different formats, grids, and projections may need to be supported at any given time. This is significantly different from other disciplines concerned with digital preservation where it is often essential to preserve the essence of the original experience for multimedia digital objects such as movies, applications, or interactive web sites. In the Earth science community it makes more sense to consider preservation and access formats independently. Access formats are likely to change quickly over time, while preservation formats should be more stable. There are similar issues with preservation and access metadata. There are advantages to preserving the metadata with the actual data (see section 4.4), but much of the metadata is only relevant to the archivist. The preservation-specific metadata should probably be separated from the data upon delivery to the user to minimize user confusion. Some have called for completely separate storage of preservation and access data [13]. However, storage of multiple copies of the data is unaffordable with large data sets or when there are multiple access formats. In many cases, the issue may be how to afford preservation of even a single copy of the data! There is some agreement that the best 56 CD contained over 50 data sets and nearly 100 references to other data sets held by different “nodes” of the GGD system. Unfortunately, funding for the GGD did not continue past 1998. It wasn’t until 2002, when a new initiative started to create an updated version of the CD (now a three CD set [42]), that any maintenance of the data from the 1998 version resumed. Regrettably, dozens of the distributed or “brokered” products were no longer readily available. NSIDC has plans to try and track down or “rescue” some of these data sets, but the four- to five-year time lag and the globally distributed nature of the data sets will make it very challenging. This illustrates the need for nearly constant tracking of distributed data to ensure its continued availability, or a clear and usable means (with incentives) for providers to provide updates to any central access point. 4.5 Data Security and Integrity Ultimately, keeping track of data and metadata becomes an issue of data integrity. Scientists need to trust the validity of the data they use. They need to know that the data came from a scientifically reputable source and that the data have not been corrupted in any way. Scientific integrity is an ill-defined concept, but it is rooted in the scientific method. Experiments must be repeatable. Results from experiments should be published in peerreviewed literature. The data and information used in the experiment must be specifically acknowledged and generally accessible when possible. Traditionally this is handled in the literature through a system of formal citations. But while methods for citing information sources are well established and traceable, methods for citing data sources are more variable. Historically, with small non-digital data sets, the data may have been published directly in a journal or monograph that could specifically be cited. This was not an entirely consistent process, though, and as data sets have grown, authors have adopted different methods for acknowledging their data sources. Some authors may provide a simple acknowledgement of the data source in the body of an article or the acknowledgements section. Other authors may cite an article published by the data provider that describes the data set and its collection. As publishers of data, we at NSIDC have found these historical approaches lacking. General data acknowledgements are difficult to trace, are often imprecise, and sometimes do not acknowledge the true data source. For example, an acknowledgement of “NSIDC’s SSM/I sea ice data” could actually refer to one of several different data sets and it makes no reference to the actual scientists who developed the sea ice algorithm. Citing a paper about the data is better, but in many cases such papers may not exist, they may only describe a portion of the data set, or their description may not be relevant to the new application of the data. In any case, it is not clear how to actually acquire the data—a necessary step if an experiment is to be repeated. We recommend that users cite the actual data set itself, much as they would a book or journal article. The “author” is typically the data provider or the person who invested intellectual effort into the creation of the data set (e.g., by creating an algorithm), while NSIDC or other archive that 57 CD contained over 50 data sets and nearly 100 references to other data sets held by different “nodes” of the GGD system. Unfortunately, funding for the GGD did not continue past 1998. It wasn’t until 2002, when a new initiative started to create an updated version of the CD (now a three CD set [42]), that any maintenance of the data from the 1998 version resumed. Regrettably, dozens of the distributed or “brokered” products were no longer readily available. NSIDC has plans to try and track down or “rescue” some of these data sets, but the four- to five-year time lag and the globally distributed nature of the data sets will make it very challenging. This illustrates the need for nearly constant tracking of distributed data to ensure its continued availability, or a clear and usable means (with incentives) for providers to provide updates to any central access point. 4.5 Data Security and Integrity Ultimately, keeping track of data and metadata becomes an issue of data integrity. Scientists need to trust the validity of the data they use. They need to know that the data came from a scientifically reputable source and that the data have not been corrupted in any way. Scientific integrity is an ill-defined concept, but it is rooted in the scientific method. Experiments must be repeatable. Results from experiments should be published in peerreviewed literature. The data and information used in the experiment must be specifically acknowledged and generally accessible when possible. Traditionally this is handled in the literature through a system of formal citations. But while methods for citing information sources are well established and traceable, methods for citing data sources are more variable. Historically, with small non-digital data sets, the data may have been published directly in a journal or monograph that could specifically be cited. This was not an entirely consistent process, though, and as data sets have grown, authors have adopted different methods for acknowledging their data sources. Some authors may provide a simple acknowledgement of the data source in the body of an article or the acknowledgements section. Other authors may cite an article published by the data provider that describes the data set and its collection. As publishers of data, we at NSIDC have found these historical approaches lacking. General data acknowledgements are difficult to trace, are often imprecise, and sometimes do not acknowledge the true data source. For example, an acknowledgement of “NSIDC’s SSM/I sea ice data” could actually refer to one of several different data sets and it makes no reference to the actual scientists who developed the sea ice algorithm. Citing a paper about the data is better, but in many cases such papers may not exist, they may only describe a portion of the data set, or their description may not be relevant to the new application of the data. In any case, it is not clear how to actually acquire the data—a necessary step if an experiment is to be repeated. We recommend that users cite the actual data set itself, much as they would a book or journal article. The “author” is typically the data provider or the person who invested intellectual effort into the creation of the data set (e.g., by creating an algorithm), while NSIDC or other archive that 58 distributed the data might be considered the publisher. It is also crucial to include publication dates to distinguish between different versions of related data sets. In any case, we try and provide a specific recommended citation for every data set we distribute. Although we have met some sporadic resistance from occasional providers who wish only for their papers to be cited, this approach has become broadly accepted. It is the approach specifically recommended by the International Permafrost Association ([41], [42]), and has generally been accepted by the other NASA DAACs. This formal citation approach works well when there is a clear and reputable data publisher even in a distributed environment. But the distributed environment may provide additional challenges, especially if data sources are somewhat ephemeral or hard to identify. For example, in a peer-to-peer system, the access mechanism needs to specifically identify the different peers and possibly some assessment of their stability. This is somewhat different than peer-to-peer systems in other areas such as music where users generally don’t care where the music came from as long as it is the piece they wanted. With the rise of electronic journals we have also heard informal discussion of including the actual data used in the publication itself. Although this approach obviously includes many of the same data preservation challenges already discussed, it is an intriguing concept worthy of further exploration. Once the scientific integrity of a data set has been assured, assurance is needed that the data received is what was expected. Several authors discuss the use of public/private key cryptography and digital signatures as methods for ensuring the authenticity of the data (see for example [35] and [36]). Lynch points out that we know very little about how these technologies behave over very long times and that, as a consequence, information about evolution of these technologies will likely be important to preserve [37]. For a scientist to be able to trust that the data have not been changed the scientist must be able to trust that the preservation practices of the source of the data are adequate: that archive media are routinely verified and refreshed, that the facilities are secure, that processes to verify and ensure the fixity of the data are operational, that geographically distributed copies of the data are maintained as a protection against catastrophe, and that disaster recovery plans and procedures are in place. To verify these practices, the RLG/OCLC Working Group on Digital Archive Attributes suggests that a process for certification of digital repositories be put in place [34]; while Ashley suggests that administrative access to data and metadata be “subject to strong proofs of identity” [36]. Once again, a distributed data environment may make implementing these suggestions more difficult. 4.6 Long-Term Preservation and Technology Refresh A continual theme in this paper is how the speed of technological change presents a major challenge for preserving data over the long term. As a recent report by the National Science Foundation and the Library of Congress puts it “digital objects require constant and perpetual maintenance, and they depend on elaborate systems of hardware, 59 data user community. A good data manager, equipped with the right tools, should be working closely with the data provider to uncover any known limitations in the data. For example, it may be self evident to developers of sea ice detection algorithms for passive microwave remote sensing that their methodology is well-suited to detection of trends in ice concentration over time but ill-suited for representing precise local ice conditions on a given day. This may not be apparent to a biologist who uses a near-real time passive microwave derived product to associate sea ice conditions in the Beaufort Sea with polar bear migration. While this is an extreme example, it highlights the need for scientists and data managers to work closely together to carefully document and track new and potentially unexpected uses of the data. It is also important to realize that the risks of inappropriate data applications could increase over time. Of course data can also be improved. New algorithms, new calibration methods, and new instruments may be developed. In Earth science in particular, it is important to detect variability over long time periods. This means that different instruments must be intercallibrated to ensure a consistent time series, i.e. we need to be able to ensure that any changes we detect in a data stream result from actual geophysical processes not just changes in instruments or algorithms. This again requires collaboration between the data manager and the scientist. This is certainly possible in distributed environments, but mechanisms should be established to ensure that information about data harmonization and improvements are readily available to users. Traditionally, this was the role of the data steward or data manager (see, for example, [29]). It is less clear how this would work in a distributed environment, but knowledge bases and data mining systems are likely to contribute. On a related note, to ensure maximum scientific understanding of an issue, data and support services need to be readily available to as many users as possible [11]. This is necessary to ensure all possible scientific ideas are explored and that scientific experiments can be duplicated. The necessary broad access may be better realized in a distributed data model, but only if the challenges in section four are addressed. Again this will require close interaction with the users. Historically, NSIDC has addressed these scientific issues by working closely with its data providers and by having scientific data users and developers on staff. This becomes a less practical approach in a distributed data environment where data may be held and distributed by individuals and institutions with varying levels of scientific and data management expertise. It will become increasingly important to formally address the relationship of data managers and scientists as new distributed data management models are developed. 5.2 Decisions, Decisions, Decisions - Deciding What Data to Acquire and Retain One of the most difficult decisions in data archival is which data to acquire and keep and which data to throw away. Although, there is still no effective business model that 60 data user community. A good data manager, equipped with the right tools, should be working closely with the data provider to uncover any known limitations in the data. For example, it may be self evident to developers of sea ice detection algorithms for passive microwave remote sensing that their methodology is well-suited to detection of trends in ice concentration over time but ill-suited for representing precise local ice conditions on a given day. This may not be apparent to a biologist who uses a near-real time passive microwave derived product to associate sea ice conditions in the Beaufort Sea with polar bear migration. While this is an extreme example, it highlights the need for scientists and data managers to work closely together to carefully document and track new and potentially unexpected uses of the data. It is also important to realize that the risks of inappropriate data applications could increase over time. Of course data can also be improved. New algorithms, new calibration methods, and new instruments may be developed. In Earth science in particular, it is important to detect variability over long time periods. This means that different instruments must be intercallibrated to ensure a consistent time series, i.e. we need to be able to ensure that any changes we detect in a data stream result from actual geophysical processes not just changes in instruments or algorithms. This again requires collaboration between the data manager and the scientist. This is certainly possible in distributed environments, but mechanisms should be established to ensure that information about data harmonization and improvements are readily available to users. Traditionally, this was the role of the data steward or data manager (see, for example, [29]). It is less clear how this would work in a distributed environment, but knowledge bases and data mining systems are likely to contribute. On a related note, to ensure maximum scientific understanding of an issue, data and support services need to be readily available to as many users as possible [11]. This is necessary to ensure all possible scientific ideas are explored and that scientific experiments can be duplicated. The necessary broad access may be better realized in a distributed data model, but only if the challenges in section four are addressed. Again this will require close interaction with the users. Historically, NSIDC has addressed these scientific issues by working closely with its data providers and by having scientific data users and developers on staff. This becomes a less practical approach in a distributed data environment where data may be held and distributed by individuals and institutions with varying levels of scientific and data management expertise. It will become increasingly important to formally address the relationship of data managers and scientists as new distributed data management models are developed. 5.2 Decisions, Decisions, Decisions - Deciding What Data to Acquire and Retain One of the most difficult decisions in data archival is which data to acquire and keep and which data to throw away. Although, there is still no effective business model that 61 demonstrates the costs and benefits of long-term data archival [15], it is clearly impractical to keep all data for all time. That said, we need to recognize that many data sets often have unexpected future applications (see [11] for examples). A simple approach would be to archive a very low level of the data along with the necessary algorithms to process the higher level products. However, this must be viewed only as a minimum since it does not allow for the necessary simple and broad access described above. It is probably not possible to describe any one infallible data acquisition and deposition scheme. However, any data stewardship model must explicitly include a method for development of such a scheme for different types of data and user communities. These schemes must explicitly include knowledgeable and experienced users of the data who are directly involved in generating new products and data quality control [11]. 5.3 Upfront Planning Our experience at NSIDC has shown that by working with the scientists and data providers early in an experiment or mission, ideally before any data are actually collected, we can significantly improve the quality and availability of the data. Most scientists can probably think of a field campaign where the data are no longer available. NSIDC worked to avoid this problem by working closely with the investigators conducting the Cold Land Processes field experiment in the Colorado Rocky Mountains during the winter and spring of 2002 and 2003 (see [30]). Not only was NSIDC involved in the planning of the data collection, but also provided data technicians who worked closely with field surveyors during the experiment. These data technicians learned the data collection protocol with the surveyors, helped collect some of the data, and entered the data into computers the night after they were collected. By learning the protocol and immediately entering the data, technicians were able to identify missing values and anomalies in the data and run some automated quality control checks. They were then able to follow up with the surveyors soon after they collected the data to correct specific problems and to improve later data collection. Technicians were also able to provide the data to the lead scientists for immediate assessment. Overall, this led to a 10 to 20 percent improvement in data quality [31]. NSIDC has had similar experience with recent satellite remote-sensing missions. NSIDC is the archive for all the data from NASA’s Advanced Microwave Scanning Radiometer (AMSR) and Global Laser Altimetry System (GLAS). Although NSIDC was not directly involved in the acquisition of the data, it did work closely with the mission science and instrument teams well before the instruments were even launched. This allowed the data managers to have a much greater understanding of the engineering aspects of the data and the algorithms used to produce the higher-level products. The result is much better documentation and much earlier data availability. Data from both of these missions were available to the public only months after launch, in contrast to years with some historical systems where data managers were not involved until well after their launch (e.g, sea ice data from SSM/I). 62 There is nothing inherent about distributed data systems that should preclude early involvement of data managers, but again this is something to consider in the design of those systems. Furthermore, data manager involvement could be more difficult if traditional data management organizations are not directly involved in the distributed data system. 6 Conclusions The scientific method requires that experimental results be reproducible. That means the data used in the original experiment must be available and understandable. Furthermore, reexamination of an early data set often can yield important new results. Maintaining access to and understanding of scientific data sets has been a challenge throughout history. The trend to a more geographically distributed data management model may improve data access in the short run but raises additional challenges. We should be able to address many of these challenges by developing new tools and data management systems, but we must not forget the human component. Experience and a review of the known data management issues show that we achieve the greatest success in long term data stewardship only when there is a close collaboration between data providers, data users, and professional data stewards. As we move forward, we need to ensure that new technologies and new data archive models enhance this collaboration. [1] NOAA's National Geophysical Data Center. “About the World Data Center System.” December 29, 2003. http://www.ngdc.noaa.gov/wdc/about.shtml. January 2004. [2] National Research Council. 2003. “Government Data Centers: Meeting Increasing Demands.” The National Academies Press. [3] A d Age Group. “Data Center.” January http://www.adage.com/datacenter.cms. Data Center. January 6, 2004. 6, 2004. [4] Texas Cancer Council. “Texas Cancer Data Center.” December 23, 2003. http://www.txcancer.org/. Texas Cancer Data Center. January 6, 2004. [5] CCSDS. 2002. “Reference Model for an Open Archival Information System (OAIS).” CCDSD 650.0-B-1. Blue Book. Issue 1. January 2002. [Equivalent to ISO 14721:2002]. [6] Barnum, George D. and Steven Kerchoff. “The Federal Depository Library Program Electronic Collection: Preserving a Tradition of Access to United States Government Information.” December 2000. http://www.rlg.org/events/pres-2000/barnum.html. January 7, 2003. [7] Moore, Reagan W. October 7, 1999. “Persistent Archives for Data Collections.” SDSC Technical Report sdsc-tr-1999-2. 63 [8] Ullman, Richard. “HDF-EOS Tools and Information Center.” 2003. http://hdfeos.gsfc.nasa.gov/hdfeos/index.cfm. January 7, 2004. [9] Abrams, Stephen L. and David Seaman. "Towards a global digital format registry." World Library and Information Congress: 69th IFLA General Conference and Council, Berlin, August 1-9, 2003 http://www.ifla.org/IV/ifla69/papers/128eAbrams_Seaman.pdf [10] Shepard, T. and D. MacCarn. 1999. The Universal Preservation Format: A Recommended Practice for Archiving Media and Electronic Records. WGBH Educational Foundation. Boston. 1999. [11] Hunolt, Greg. Global Change Science Requirements for Long-Term Archiving. Report of the Workshop, Oct 28-30, 1998. USGCRP Program Office. March 1999. [12] SEEDS Formulation Team. Strategic Evolution of Earth Science Enterprise Data Systems (SEEDS) Formulation Team final recommendations report. 2003. http://lennier.gsfc.nasa.gov/seeds/FinRec.htm. Jan. 2004. [13] The Cedars Project. “Cedars Guide to The Distributed Digital Archiving Prototype.” March 2002. http://www.leeds.ac.uk/cedars/guideto/cdap/. December 2003. [14] “Invest to Save: Report and Recommendations of the NSF-DELOS Working Group on Digital Archiving and Preservation.” 2003. Prepared for the National Science Foundation’s (NSF) Digital Library Initiative and the European Union under the Fifth Framework Programme by the Network of Excellence for Digital Libraries (DELOS). [15] “It’s About Time: Research Challenges in Digital Archiving and Long-Term Preservation, Final Report, Workshop on Research Challenges in Digital Archiving and Long-Term Preservation.” August 2003. Sponsored by the National Science Foundation, Digital Government Program and Digital Libraries Program, Directorate for Computing and Information Sciences and Engineering, and the Library of Congress, National Digital Information Infrastructure and Preservation Program, August 2003. [16] Lachman, B.E., A Wong, D. Knopman, and K. Gavin. 2002. “Lessons for the Global Spatial Data Infrastructure: International case study analysis.” Santa Monica, CA: RAND. [17] Rhind, D. 2000. Funding an NGDI. In Groot, R. and J. McLaughlin, eds. 2000. Geospatial data and infrastructure: Concepts, cases, and good practice. Oxford University Press. 64 [18] PDG (Panel on Distributed Geolibraries, Mapping Science Committee, National Research Council). 1999. Distributed geolibraries: Spatial information resources. Washington, DC: National Academy Press. [19] Federal Geographic Data Committee. Revised June 1998. "Content Standard for Digital Geospatial Metadata." Washington, D.C. [20] Clinton, W. 1994. Coordinating geographic data acquisition and access to the National Spatial Data Infrastructure, Executive Order 12906. Washington, DC. Federal Register 59 17671-4. 2pp. [21] The CEDARS Project. 2001. “Reference Model for an Open Archival Information System (OAIS).” http://www.ccds.org/documents/pdf/CCSDS-650.0-R-2.pdf. December 2003. [22] National Libarary of Australia. 1999. “Preservation Metadata for Digital Collections.” http://www.nla.gov.au/preserve/pmeta.html”. December 2003. [23] NEDLIB. 2000. “Metadata for Long Term Preservation.” http://www.kb.nl/coop/nedlib/results/preservationmetadata.pdf. December 2003. [24] OCLC. 2001. “Preservation Metadata Element Set – Definitions and Examples.” http://www.oclc.org/digitalpreservation/archiving/metadataset.pdf. December 2003. [25] The OCLC/RLG Working Group on Preservation Metadata. 2002. “Preservation and the OAIS Information Model – A Metadata Framework to Support the Preservation of Digital Objects.” OCLC Online Computer Library Center, Inc. [26] ISO Technical Committee ISO/TC 211, Geographic Information/Geomatics. “May, 2003. “Geographic information – Metadata. ISO 19115:2003(E).” International Standards Organization. [27] ISO Technical Committee ISO/TC 200. November 2002. “Scope.” http://www.isotc211.org/scope.htm#19139, January 2003. [28] Holdsworth, D. “The Medium is NOT the Message or Indefinitely Long-Term http://esdisStorage at Leeds University.” 1996. it.gsfc.nasa.gov/MSST/conf1996/A6_07Holdsworth.html. January 2004. [29] Stroeve, J, X. Li, and J. Maslanik. 1997. “An Intercomparison of DMSP F11- and F13-derived Sea Ice P r o d u c t s . ” NSIDC special report 5. http://nsidc.org/pubs/special/5/index.html. January 2004. [30] Cline, D. et al. 2003. Overview of the NASA cold land processes field experiment (CLPX-2002). Microwave Remote Sensing of the Atmosphere and Environment III. Proceedings of SPIE. Vol. 4894. 65 [31] Parsons, M. A., M. J. Brodzik, T. Haran, N. Rutter. 2003. Data management for the Cold Land Processes Experiment. Oral presentation, 11 December 2003 at the meeting of the American Geophysical Union. [32] The CEDARS Project. “CEDARS Guide to: Digital Preservation Strategies.” April 2, 2002. http://www.leeds.ac.uk/cedars/guideto/dpstrategies/dpstrategies.html. January 2004. [33] National Library of Australia. “Persistent identifiers – Persistent identifier Scheme Adopted by the National Library of Australia.” September 2001. http://www.nla.gov.au/initiatives/nlapi.html. January 2004. [34] RLG/OCLC Working Group on Digital Archive Attributes. May 2002. “Trusted Digital Repositories: Attributes and Responsibilities.” RLG Inc. [35] Brodie, N. December 2000. “Authenticity, Preservation and Access in Digital Collections.” Preservation 2000: An International Conference on the Preservation and Long Term Accessibility of Digital Materials. RLG Inc. [36] Ashley, K. December 2000. “I’m me and you’re you but is that that?” Preservation 2000: An International Conference on the Preservation and Long Term Accessibility of Digital Materials. RLG Inc. [37] Lynch, C. 2000. “Authenticity and Integrity in the Digital Environment: An Exploratory Analysis of the Central Role of Trust.” Council on Library and Informational Resources, Washington D.C. [38] Thibodeau, K. 2002. “Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years.” The State of Digital Preservation: An International Perspective. Conference Papers and Documentary Abstracts. http://www.clir.org/pubs/reports/pub107/thibodeau.html. December 2003. [39] Lee, K., Slattery, O., Lu, R., Tang X., McCrary V. 2002. The State of the Art and Practice in Digital Preservation. J. Res. Natl. Inst. Stand. Technol. 107, 93-106. [40] DoD 8320.1-M. "Data Administration Procedures." March 29, 1994. Authorized by DoD Directive 8320.1. September 26, 1991. Reference: DoD 8320.1-M-1, "Data Element Standardization Procedures," January 15, 1993. [41] International Permafrost Association, Data and Infromation Working Group, compilors. 1998. Circumpolar active-layer permafrost system, version 1.0. Boulder, CO: National Snow and Ice Data Center/World Data Center for Glaciology. CDROM. [42] International Permafrost Association Standing Committee on Data Information and Communication, compilors. 2003. Circumpolar active-layer permafrost system, 66 version 2.0. Edited by M. Parsons and T. Zhang. Boulder, CO: National Snow and Ice Data Center/World Data Center for Glaciology. CD-ROM. [43] Diomidis Spinellis. The decay and failures of web references. 2003. Communications of the ACM, 46(1):71-77. [44] ICTU. “Digital Preservation Testbed.” Digitale Duurzaamheid. 2004. http://www.digitaleduurzaamheid.nl/home.cfm Jan. 2004. [45] Diamond, H., Bates, J., Clark D., Mairs R. “Archive Management – The Missing C o m p o n e n t . ” A p r i l 2 0 0 3 , http://storageconference.org/2003/presentations/B06_Jensen..pdf, NOAA/NESDIS. January 2004. [46] Hedstrom, M. 2001. “Digital Preservation: A Time Bomb for Libraries.” http://www.uky.edu/~kiernan/DL/hedstrom.html. Jan. 2004. 67 68 NARA’s ELECTRONIC RECORDS ARCHIVES (ERA) – THE ELECTRONIC RECORDS CHALLENGE Mark Huber American Systems Corp. National Archives and Records Administration 8601 Adelphi Rd., Rm. 1540, College Park, MD 20740 Tel: +1-301-837-0420 mark.huber@nara.gov Alla Lake Lake Information Systems, LLC National Archives and Records Administration 8601 Adelphi Rd., Rm. B550, College Park, MD 20740 Tel: +1-301-837-0399 alla.lake@nara.gov Robert Chadduck National Archives and Records Administration 8601 Adelphi Rd., Rm. 1540, College Park, MD 20740 Tel: +1-301-837-0394 Robert.chadduck@nara.gov Abstract The National Archives and Records Administration (NARA) is the nation’s recordkeeper. NARA is a public trust that safeguards the records of the American people, ensuring the accountability and credibility of their national institutions, while documenting their national experience. Today NARA holds an estimated 4 billion records nationwide. The Archives consists of the permanently valuable records generated in all three branches of the Federal Government. These record collections span this country’s entire experience, across our history, the breadth of our nation, and our people. While paper documents presently predominate, NARA holds enormous numbers of other media, such as reels of motion picture film, maps, charts, and architectural drawings, sound and video recordings, aerial photographs, still pictures and posters, and computer data sets. It is that last medium, the electronic records, that is the fastest growing record keeping medium in the United States and elsewhere in the world. Since 1998, NARA has established key partnerships with Federal Agencies, state and local governments, universities, other national archives, the scientific community, and private industry to perform research enabling better understanding of the problems and the possibilities associated with the electronic records challenge. The challenge of electronic records encompasses the proof and assurance of records authenticity and assurance of record persistence and ready access to records over time. 1. Background/General Project Description 69 “Electronic records pose the biggest challenge ever to record keeping in the Federal Government and elsewhere. There is no option to finding answers…the alternative is irretrievable information, unverifiable documentation, diminished government accountability, and lost history.” John Carlin, The Archivist of the United States The National Archives and Records Administration (NARA) is the nation’s recordkeeper. NARA is a public trust that safeguards the records of the American people, ensuring the accountability and credibility of their national institutions, while documenting their national experience. Pursuant to legislation codified under Title 44 of the United States Code the Archivist of the United States has authority to provide guidance direction and assistance to Federal officials on the management of records, to determine the retention and disposition of records, to store records in centers from which agencies can retrieve them, and to take into the archival facilities of the National Archives and Presidential libraries, for public use, records that he determines “to have sufficient historical or other value to warrant their continued preservation by the United States Government." (44 U.S.C. 2107) Similarly, under the Presidential Records Act, when a President leaves office, the Archivist of the United States assumes responsibility “for the custody, control, and preservation of, and access to, the Presidential records of that President”. Both the Government and the public rely on NARA to provide this and subsequent generations of the American public with access to extraordinarily high accretion rate, increasingly diverse, and arbitrarily complex collections of historically valuable federal, presidential and congressional electronic records collections. The technology challenge confronting NARA is repeatedly confirmed as among the President’s research priorities. In the supplement to the President’s budget for fiscal year 2004, The National Science and Technology Council expressly acknowledges that “R&D in advanced technologies that enable preservation and utility of electronic information archives…,” and “…digital archives of core knowledge for research and learning” is “far from finished.” Especially prominent is the Council’s explicit identification of “….substantial technical issues – such as interoperability among file formats, indexing protocols, and interfaces; data management, storage and validation; … and long term preservation – that impede development of digital libraries…” Similarly noted is research enabling agencies to move “…toward two ambitious goals: quick, easy, and secure on-line access for citizens to government services and information, and radical reduction in internally duplicative record-keeping, ... through coordinated development of IT standards and procedures...” [1] Experts predicted in FY2003 that electronic records volumes will swell by orders of magnitude over this decade, presenting enormous challenges for society along with unprecedented opportunities for U.S. advanced research and technological innovation.”, …fused with requirements for… “technologies for rapid mining, filtering, correlating and assessing of vast quantities of heterogeneous and unstructured data”, and… “tools for collecting, archiving and synthesis.” [2] Similarly, among the president’s FY2002 research priorities: 70 “Strategies to assure long-term preservation of digital records constitute another particularly pressing issue for research. As storage technologies evolve with increasing speed to cope with the growing demand for storage space, the obsolescence of older storage hardware and software threatens to cut us off from the electronically stored past.” [3] The Archivist is authorized by law to “conduct research with respect to the improvement of records management practices and programs.” [44 U.S.C Section 2904(c)(2)]. Since 1998, NARA has established key partnerships with Federal Agencies, state and local governments, universities, other national archives, the scientific community, and private industry to perform research enabling better understanding of the problems and the possibilities associated with the electronic records challenge. NARA’s Key Research Partners x x x x x x x x x x x x x x National Science Foundation (NSF) San Diego Supercomputer Center (SDSC) University of Maryland (UMd) Georgia Tech Research Institute (GTRI) U.S. Army Research Laboratory (ARL) National Computational Science Alliance (NCSA) National Institute of Standards of Technology (NIST) National Nuclear Security Administration (NNSA) National Aeronautics and Space Administration (NASA) U.S. Department of Defense (DOD) Library of Congress (LC) International Research on Permanent Authentic Records in Electronic Systems (InterPARES) Digital Library Federation (DLF) Global Grid Forum (GGF) NARA’s ERA Program includes ongoing sponsorship, support, and collaboration in technology research activities relevant to developing and sustaining the systematic capability for transfer, preservation, and sustained access to electronic records. ERA must be dynamic in response to continuing technology evolution, ensuring that electronic records delivered to future generations of Americans are as authentic decades in the future as they were when first created. Among the findings presented in the report of the Committee on Digital Archiving and the National Archives and Records Administration of the Computer Science and Telecommunications Board (CSTB) for the National Research Council of the National Academies are findings that while no turnkey system, application, or product exists in the marketplace which meets NARA’s requirements, the system can and should be built. [4] 71 2. Program Status In response to the digital records challenge, Congress, in November 2001, acting through the Treasury and General Government Appropriations Act {P.L.107-67}, approved the fiscal 2002 budget that included $22.3 million for Electronic Records Archives (ERA) Program. Similarly, in January 2003, Congress, acting through the Consolidated Appropriation Resolution, 2003 {P.L.108-7}, approved the fiscal 2003 budget that included $11.8 million for the Electronic Records Archives (ERA) Program. At the time of this writing, and while the final appropriations have not passed, both the House of Representatives and the Senate have agreed to fund the ERA Program at the $35.7M level in the President’s FY2004 request. The official Request for Proposal (RFP) for the ERA system was released to the public on December 5, 2003. At the time of the RFP release, proposals from industry were required to be submitted to NARA by January 28, 2004. The ERA program schedule calls for up to two contract awards to be made by mid-2004. NARA has structured the ERA procurement to fundamentally be a challenge to industry to propose innovative ways to address the challenges represented by the large number and variety of electronic records generated and used by the Federal government. The ERA procurement strives to define the electronic records challenge without prescribing implementations or techniques with which to address the issues. Again, NARA wants to engage industry in crafting long term responses to the various technical and operational issues that ERA represents. This paper goes on to explore some of the archival, technical, and operational issues that the ERA program sees as important to the success of ERA. 3. Goals, Issues, and Challenges for Electronic Records - Persistence, obsolescence, access over time Today NARA holds in the National Archives of the United States and the Presidential Libraries an estimated 4 billion records nationwide. The archives consist of the permanently valuable records generated in all three branches of the Federal Government, supplemented with donated documentary materials. [5] These records span this country’s entire experience, across our history, the breadth of our nation, and our people. Not surprisingly, with the passage of time, the medium of the records of the United States has become diverse in format. While paper documents presently predominate, NARA holds enormous numbers of x reels of motion picture film, x maps, charts, and architectural drawings, x sound and video recordings, x aerial photographs, x still pictures and posters, and x computer data sets. [6] It is that last medium – computer data sets - the electronic records, that is the fastest growing record keeping medium in the United States and elsewhere in the world. According to the How Much Information? 2003 study from the University of California 72 Berkley School of Information Management and Systems, released in October 2003, the worldwide production of original information stored digitally on magnetic media has grown by 80% in the time elapsed between the 1999 and 2002 samples. The upper boundary study volume estimate in that category of information for 1999 was 2.8 Peta Bytes and for 2002 – 4.99 Peta Bytes. [7] The digital (electronic) storage of information –has been growing in proportion to the rise in creation and use of information in general. There is no consensus optimal method for the long term preservation of electronic records. A number of approaches are being used in the industry singly and in combination. Each of the approaches brings with it its own cost, as well as operational and reliability concerns. The larger the size of the electronic records holdings, the more important it is to carefully select and design the preservation approach. Preserving electronic records serves the same fundamental purpose as preserving any other type of record: to enable the records to continue to provide evidence and information about the decisions, acts, and facts described in the records with the same degree of reliability as when the record was created. However, the process of preserving electronic records is substantially different than the preservation of traditional, nonelectronic records. Traditional records are aptly termed “hard copy” in that the information that the record contains is inscribed in a hard, indissoluble manner on a physical medium, and the physical inscription conveys the information the record is intended to provide. Therefore, preservation traditionally focused on the physical object. However, an electronic record is inscribed on a physical medium as a sequence of binary values which must always be translated into a different form – the form of a record – in order to communicate the information the record was meant to convey. Therefore, preserving an electronic record requires maintaining the ability to reproduce that record from stored data. While the preservation of a paper record can be deemed successful if that record remains physically intact in storage, the success of a process of preserving an electronic record can only be verified by translating the stored bits into the form of the record. It is the result of this reproduction, not the stored bits, that literally is the electronic record. If the wrong process is applied, or if the process is not executed correctly, the result will not be an authentic copy of the record. Over time, reproducing an electronic record is challenging because the conventions for representing information in digital form change along with hardware and software. Newer systems may not be able to process older formats, or may do so incorrectly. [8] Archiving of electronic records brings with it an increased challenge of authenticity of the record and a more difficult burden of proof of that authenticity. For the ERA program, electronic record authenticity is defined as the property of a record that it is what it purports to be and has not been corrupted. Given the legal, historical, and cultural significance of national or institutional record holdings, authenticity of the records is essential. Establishing authenticity of a paper, photographic negative, or other physical medium-based record has historically been accomplished by establishing that the record itself is, or is based on, an original via the proof that the medium of the record (or the medium of the basis record) is the original and there is a clear chain of custody 73 associated with the record. Stringent requirements assigned to electronic record collections to support a continuing burden of proof relates to the attainment of criteria for authenticity over time. Electronic records present special challenges with respect to the proof of record authenticity as the record is preserved over time due to the both increased risk of corruption of the record when it exists in digital form. Records are being created in progressively larger volumes through the use of electronic hardware and the associated software applications. Some of the records are traditional textual or graphic documents that could have been originated with the use of pen and paper. At least in theory, their content can be preserved in hard copy. The bulk of them, however, are in most respects indelibly tied to the technology that produced them, such as the contents of data base systems, interactive Web pages, geographic information systems, and virtual reality models. [9] These later types of electronic records need to be preserved in electronic form in order to preserve the essential properties of the record other than pure content - the context, structure, and behavior. Whenever a mix of technologies are involved in the creation, maintenance, and presentation of the record, preservation is far more involved than the preservation of the precise sequence of bits constituting instrument reading, an ASCII text, or a bitmap graphic document. All of the electronic records in lesser or greater degrees rely for access on the use of technologies that arise and evolve rapidly and just as rapidly become obsolete. Computing platforms on which the records are created, preserved, or examined, communication infrastructures interconnecting these platforms, data recording media, and, perhaps most importantly, data recording formats are all subject to rapid obsolescence while the records themselves must persist. Preservation approaches for electronic records are multifold and can be broadly categorized in following areas of concern: x Media x Hardware Technology x Software Technology, including record formats x Archival: provenance, authenticity, context, structure and appearance. A significant complicating factor in preservation of an electronic record is the necessity to preserve some of the associated linkages to other records. The loss of such linkages may, at best, lead to the loss of context or, at worst, render the record itself unreadable. Preservation of electronic records is the end-to-end process which enables re-production of an authentic copy of the record. To assure that reproduction preservation of electronic records extends beyond protection of the record physical medium to protection of record accessibility and assuring record authenticity over time. Assuring persistence of records means ensuring that the records are not only readable but also intelligible after the passage of time. Assuring record authenticity means ensuring that the records are not inadvertently or deliberately altered or corrupted over time and that the authenticity can be adequately proven. [10] 74 Finally, a fundamentally important aspect of ensuring that the records are accessible over time is appropriate processing of the records as they enter the electronic archive with respect to establishing appropriate searchable archival structures and relationships and extraction and storage of associated metadata. Preservation methods which maintain dependencies of records to obsolete technologies tend to increasingly constrain access over time. Continuing general public access to old and obsolete technologies may not be possible except in highly limited environments or circumstances. In example, general public access to an emulator appropriate to enable reliable future rendition of electronic records created in the technology of an early 1990’s proprietary geographic information system in the hypothetical context of 2020 vintage computing is not presently assured. 4. System Characteristics and Drivers The ERA system, because of its size, scope, ingest and access loads, and commitment to long term preservation and servicing of government records, will require deployment and design approaches that support its unique nature and mission goals. When designing and deploying ERA, NARA must take a long term view for the system’s operation and its required scalability, reliability, and cost effective operations. This long term vision will accommodate the use of outsourcing of potential processing and hosting services while at the same time ensures NARA’s stewardship of the records trusted to it. 4.1 Design and Deployment Goals There are certain assumptions and drivers that sculpt the deployment approach for ERA. These assumptions and design drivers are collectively considered the design and deployment goals for the ERA program. These goals include: x NARA must own and control at least one set of all holdings of electronic records entrusted to it. This is required for protection of the records and fulfillment of NARA’s mission to ensure long term preservation and access to the government’s records. x The ERA system is one of NARA’s contributions to the Federal Enterprise Architecture (FEA) and fulfills a critical role in the development and deployment of NARA’s own Enterprise Architecture (EA). x The design and deployment vision for ERA must allow for the contracting out of record processing and access support, if NARA chooses to exercise that option in the future. The contracting out of record services must be done within the context of NARA’s mission and ultimate responsibility for the integrity of the records. x Minimize government ownership of equipment and facilities. This desire must be balanced against NARA’s stewardship of the records and commitment to FEA support. x Allow industry and academia to provide value added services on record holdings. x Produce a highly reliable system design. Characteristics of such a design include: o Avoidance of single point/site of failure. o Graceful performance degradation of the system when failures occur o Maintain system operations in face of remedial maintenance (RM), preventative maintenance (PM), and planned upgrades/changes 75 4.2 System Design Drivers In addition to the deployment goals, the design and deployment of ERA must take into account certain architectural demands and aspects of the ERA record preservation domain itself. These drivers must be accommodated in any deployment and design strategy for the ERA system. These drivers include: x The size of the ERA record holdings. ERA permanent records holdings are projected to be in excess of 100 PBs of data 12-15 years after deployment, with continued growth in holdings in subsequent years. The sheer volume of data that must be accommodated, as well as its associated access loads, is a huge driver that must be accounted for in the ERA architecture. Architectural concepts including distributed deployment(s), load balancing techniques, and multiple sources for access to high demand records are applicable to the holdings size aspect of ERA. x Insuring the integrity of the record holdings. The records must be protected from loss, alteration, or the lack of access capability over time. Appropriate security and accommodation of timely backup of holdings with subsequent restoration of access are techniques that are required in this area. x The evolutionary nature of the ERA system. This aspect is most pronounced in two areas: - Changes to the Persistent Preservation approaches used for records. Over time electronic records will need to be stored, represented, and accessed in different ways given the forward march of computer technology and the rapid obsolescence of formats and techniques. - Independent of the record preservation techniques, the general infrastructure and support technologies used in ERA will need to be updated and upgraded over time. Technology insertion into the ERA design will be imperative. x The heterogeneity of assets in ERA will complicate storing and providing access to the assets, as well as preserving them. The scope of this issue can be appreciated by considering that ERA records can be classified via three different attributes. o Record Types (RTs) – Any record will be classified according to its intellectual format. Examples of record types include letters, ledgers, maps, reports, etc. o Data types (DTs) – A data type is a set of lexical representations for a corresponding set of values. The values might be alphabetic characters, numbers, colors, shades of grey, sounds, et al. The lexical representation of such values in digital form assigns each value to a corresponding binary number, or string of bits. A data type may be simple, such as the ASCII representation of alphabetic characters, or composite; that is, consisting of a combination of other data types. An electronic record consists of one or more digital components; that is, strings of bits each of which has a specific data type. o Varied classes/collections of holdings – Records of the same RT and DT may still belong to different record series or collections, which further define the nature of the record. Examples of high level collections or 76 series could include particular Presidential collections, Federal record series, and potentially record series in Federal Record Centers (FRCs). 5. Conclusion This paper has provided an overview of the NARA ERA program and the challenges that face NARA is the areas of electronic records preservation, system deployment, and archival management of the Nation’s permanent records. The NARA ERA program represents a bold initiative in electronic records management and preservation and is a call for industry to propose new and innovative approaches to the unique issues NARA faces as the steward of the government’s electronic records. Through the fusion of different technologies such as distributed computing, large scale object storage and access methods, secure infrastructure, and forward thinking record preservation strategies, ERA will open and exciting new era for electronics records management and access. References: [1] The Networking and Information Technology, R&D (NITRD), SUPPLEMENT TO THE PRESIDENT'S BUDGET FOR FY 2004, A Report by the Interagency Working Group on Information Technology Research and Development National Science and Technology Council, September 2003. [2] The Networking and Information Technology, R&D (NITRD), SUPPLEMENT TO THE PRESIDENT'S BUDGET FOR FY 2003, A Report by the Interagency Working Group on Information Technology Research and Development National Science and Technology Council, July 2002 [3] The Networking and Information Technology, R&D (NITRD), SUPPLEMENT TO THE PRESIDENT'S BUDGET FOR FY 2002, A Report by the Interagency Working Group on Information Technology Research and Development National Science and Technology Council, July 2001 [4] Building an Electronic Records Archives and Records Administration: Recommendation for Initial Development, http://www.nap.edu/openbook/0309089476/html/R1.html [5] http://www.archives.gov/about_us/reports/2002_annual_report_measuring_success.p df [6] NARA’s Strategic Directions for Federal Records Management, July 31, 2003, http://www.archives.gov/records_management/pdf/strategic_directions.pdf, referenced 10/2003. [7] How Much Information? 2003, http://www.sims.berkeley.edu/research/projects/how-much-info2003/execsum.htm#summary [8] Overview of Technological Approaches to Digital Preservation and Challenges in Coming Years, Kenneth Thibodeau, July 10, 2002 [9] Preserving the Long-Term Authenticity of Electronic Records: The InterPARES Project, Heather MacNeil, University of British Columbia AABC Newsletter, Volume 10 No. 2 Spring 2000 77 http://aabc.bc.ca/aabc/newsletter/10_2/preserving_the_long.htm [10] The Long-Term Preservation of Authentic Electronic Records: Findings of the InterPares Project. 2003. http://www.interpares.org/book/index.cfm & http://www.archives.gov/electronic_records_archives/pdf/preservation_and_access_le vels.pdf 78 Preservation Environments Reagan W. Moore San Diego Supercomputer Center University of California, San Diego 9500 Gilman Drive, MC-0505 La Jolla, CA 92093-0505 moore@sdsc.edu tel: +1-858-534-5073 fax: +1-858-534 5152 Abstract: The long-term preservation of digital entities requires mechanisms to manage the authenticity of massive data collections that are written to archival storage systems. Preservation environments impose authenticity constraints and manage the evolution of the storage system technology by building infrastructure independent solutions. This seeming paradox, the need for large archives, while avoiding dependence upon vendor specific solutions, is resolved through use of data grid technology. Data grids provide the storage repository abstractions that make it possible to migrate collections between vendor specific products, while ensuring the authenticity of the archived data. Data grids provide the software infrastructure that interfaces vendor-specific storage archives to preservation environments. 1. Introduction A preservation environment manages both archival content (the digital entities that are being archived), and archival context (the metadata that are used to assert authenticity) [8]. Preservation environments integrate data storage repositories with information repositories, and provide mechanisms to maintain consistency between the context and content. Preservation systems rely upon software systems to manage and interpret the data bits. Traditionally, a digital entity is retrieved from an archival storage system, structures within the digital entity are interpreted by an application that issues operating system I/O calls to read the bits, and semantic labels that assign meaning to the structures are organized in a database. This process requires multiple levels of software, from the archival storage system software, to the operating system on which the archive software is executed, to the application that interprets and displays the digital entity, to the database that manages the descriptive context. A preservation environment assumes that each level of the software hierarchy used to manage data and metadata will change over time, and provides mechanisms to manage the technology evolution. A digital entity by itself requires interpretation. An archival context is needed to describe the provenance (origin), format, data model, and authenticity [9]. The context is created by archival processes, and managed through the creation of attributes that describe the knowledge needed to understand and display the digital entities. The archival context is organized as a collection that must also be preserved. Since archival storage systems manage files, software infrastructure is needed to map from the archival repository to the 79 preservation collection. Data Grids provide the mechanisms to manage collections that are preserved on vendor-supplied storage repositories [7]. Preservation environments manage collections for time periods that are much longer than the lifetime of any storage repository technology. In effect, the collection is held invariant while the underlying technology evolves. When dealing with Petabyte-sized collections, this is a non-trivial problem. The preservation environment must provide mechanisms to migrate collections onto new technology as it becomes available. The driving need behind the migrations is to take advantage of lower-cost storage repositories that provide higher capacity media, faster data transfer rates, smaller foot-print, and reduced operational maintenance. New technology can be more cost effective. 2. Persistent Archives and Data Grids A persistent archive is an instance of a preservation environment [9]. Persistent archives provide the mechanisms to ensure that the hardware and software components can be upgraded over time, while maintaining the authenticity of the collection. When a digital entity in migrated to a new storage repository, the persistent archive guarantees the referential integrity between the archival context, and the new location of the digital entity. Authenticity also implies the ability to manage audit trails that record all operations performed upon the digital entity, access controls for asserting that only archivists performed the operations, and checksums to assert the digital entity has not been modified between applications of archival processes. Data grids provide these data management functions in addition to abstraction mechanisms for providing infrastructure independence [7]. The abstractions are used to define the fundamental operations that are needed on storage repositories to support access and manipulation of data files. The data grid maps from the storage repository abstraction to the protocols required by a particular vendor product. By adding drivers for each new storage protocol as they are created, it is possible for a data grid to manage digital entities indefinitely into the future. Each time a storage repository becomes obsolete, the digital entities can be migrated onto a new storage repository. The migration is feasible as long as the data grid uses a logical name space to create global, persistent identifiers for the digital entities. The logical name space is managed as a collection, independently of the storage repositories. The data grid maps from the logical name space identifier to the file name within the vendor storage system. Data grids support preservation by applying mappings to the logical name space to define the preservation context. The preservation context includes administrative attributes (location, ownership, size), descriptive attributes (provenance, discovery attributes), structural attributes (components within a compound record), and behavioral attributes (operations that can be performed on the digital entity). The context is managed as metadata in a database. An information repository abstraction is used to define the operations required to manipulate a collection within a database, providing the equivalent infrastructure independence mechanisms for the collection. 80 Archivists apply archival processes to convert digital entities into archival forms. Similar ideas of infrastructure independence can be used to characterize and manage archival processes. The application of each archival process generates part of the archival context. By creating an infrastructure independent characterization of the archival processes, it becomes possible to apply the archival processes in the future. An archival form can then consist of the original digital entity and the characterization of the archival process. Virtual data grids support the characterization of processes and on demand application of the process characterizations. A reference to the product generated by a process can result in direct access to the derived data product, or can result in the application of the process to create the derived data product. Virtual data grids can characterize and apply archival processes. Data grids provide the software mechanisms needed to manage the evolution of software infrastructure [7] and automate the application of archival processes. The standard capabilities provided by data grids were assessed by the Persistent Archive Research Group of the Global Grid Forum [8]. Five major categories were identified that are provided by current data grids: 1. Logical name space; a persistent and infrastructure independent naming convention 2. Storage repository abstraction; the operations that are used to access and manage data 3. Information repository abstraction; the operations that are used to organize and manage a collection within a database 4. Distributed resilient architecture; the federated client-server architecture and latency management functions needed for bulk operations on distributed data 5. Virtual data grid; the ability to characterize the processing of digital entities, and apply the processing on demand. The assessment compared the Storage Resource Broker (SRB) data grid from the San Diego Supercomputer Center [18], the European DataGrid replication environment (based upon GDMP, a project in common between the European DataGrid [2] and the Particle Physics Data Grid [15], and augmented with an additional product of the European DataGrid for storing and retrieving meta-data in relational databases called Spitfire and other components), the Scientific Data Management (SDM) data grid from Pacific Northwest Laboratory [20], the Globus toolkit [3], the Sequential Access using Metadata (SAM) data grid from Fermi National Accelerator Laboratory [19], the Magda data management system from Brookhaven National Laboratory [6], and the JASMine data grid from Jefferson National Laboratory [4]. These systems have evolved as the result of input by user communities for the management of data across heterogeneous, distributed storage resources. EGP, SAM, Magda, and JASMine data grids support high energy physics data. The SDM system provides a digital library interface to archived data for PNL and manages data from multiple scientific disciplines. The Globus toolkit provides services that can be composed to create a data grid. The SRB data handling system is used in projects for 81 multiple US federal agencies, including the NASA Information Power Grid (digital library front end to archival storage) [11], the DOE Particle Physics Data Grid (collection-based data management) [15], the National Library of Medicine Visible Embryo project (distributed data collection) [21], the National Archives Records Administration (persistent archive research prototype) [10], the NSF National Partnership for Advanced Computational Infrastructure (distributed data collections for astronomy, earth systems science, and neuroscience) [13], the Joint Center for Structural Genomics (data grid) [5], and the National Institute of Health Biomedical Informatics Research Network (data grid) [1]. The systems therefore include not only data grids, but also distributed data collections, digital libraries and persistent archives. Since the core component of each system is a data grid, common capabilities do exist across the multiple implementations. The resulting core capabilities and functionality are listed in Table 1. These capabilities should encompass the mechanisms needed to implement a persistent archive. This can be demonstrated by mapping the functionality required by archival processes onto the functionality provided by data grids. 3. Persistent Archive Processes The preservation community has identified standard processes that are applied in support of paper collections, listed in Table 2. These standard processes have a counterpart in the creation of archival forms for digital entities. The archival form consists of the original bits of the digital entity plus the archival context that describes the origin (provenance) of the data, the authenticity attributes, and the administrative attributes. A preservation environment applies the archival processes to each digital entity through use of a dataflow system, records the state information that results from each process, organizes the state information into a preservation collection, transforms the digital entity into a sustainable Core Capabilities and Functionality Storage repository abstraction Storage interface to at least one repository Standard data access mechanism Standard data movement protocol support Containers for data Logical name space Registration of files in logical name space Retrieval by logical name Logical name space structural independence from physical file Persistent handle Information repository abstraction Collection owned data Collection hierarchy for organizing logical name space Standard metadata attributes (controlled vocabulary) Attribute creation and deletion Scalable metadata insertion Access control lists for logical name space Attributes for mapping from logical file name to physical file Encoding format specification attributes Data referenced by catalog query Containers for metadata Distributed resilient scalable architecture Specification of system availability Standard error messages Status checking Authentication mechanism Specification of reliability against permanent data loss Specification of mechanism to validate integrity of data Specification of mechanism to assure integrity of data Virtual Data Grid Knowledge repositories for managing collection properties Application of transformative migration for encoding format Application of archival processes Table 1. Core Capabilities of Data Grids 82 format, archives the original digital entity and its transforms, and provides the ability to discover and retrieve a specified digital entity. Archival Process Appraisal Accession Description Arrangement Preservation Access Functionality Assessment of digital entities Import of digital entities Assignment of provenance metadata Logical organization of digital entities Storage in an archive Discovery and retrieval Table 2. Archival process functionality for paper records To understand whether data grids can meet the archival processing requirements for digital entities, scenarios are given below for the equivalent operations on digital entities. The term record is used to denote a digital entity that is the result of a formal process, and thus a candidate for preservation. The term fonds is used to denote a record series. Appraisal is the process of determining the disposition of records and in particular which records need long-term preservation. Appraisal evaluates the various terms and conditions applying to the preservation of records beyond the time of their active life in relation to the affairs that created them. An archivist bases an appraisal decision on the uniqueness of the record collection being evaluated, its relationship to other institutional records, and its relationship to the activities, organization, functions, policies, and procedures of the institution. Data grids provide the ability to register digital entities into a logical name space organized as a collection hierarchy for comparison with other records of the institution that have already been accessioned into the archives. The logical name space is decoupled from the underlying storage systems, making it possible to reference digital entities without moving them. The metadata associated with those other collections assist the archivist in assessing the relationship of the records being appraised to the prior records. Queries are made on the descriptive and provenance metadata to identify relevant records. The data grid supports controlled vocabularies for describing provenance and formats. This metadata also provides information that helps the archivist understand the relevance/importance/value of the records being appraised for documenting the activities, functions, etc. of the institution that created them. The activities of the institution can be managed as relationships maintained in a concept space, or as process characterizations maintained in a procedural ontology. By authorizing archivist access to the collection, and providing mechanisms to ensure authenticity of the previously archived records, the preservation environment maintains an authentic environment. Accessioning is the formal acceptance into custody and recording of an acquisition. Data Grids control import by registering the digital entities into a logical name space organized 83 as a collection/sub-collection hierarchy. The records that are being accessioned can be managed as a collection independently of the final archival form. By having the data grid own the records (stored under a data grid Unix ID), all accesses to the records can be tracked through audit trails. By associating access controls with the logical name space, all references to the records can be authorized no matter where the records are finally stored. Data grids put digital entities under management control, such that automated processing can be done across an entire collection. Bulk operations are used to move the digital entities using a standard protocol and to store the digital entities in a storage repository. Digital entities may be aggregated into containers (the equivalent of a cardboard box for paper) to control the data distribution within the storage repository. Containers are used to minimize the impact on the storage repository name space. The metadata catalog manages the mapping from the digital entities to the container in which they are written. The storage repository only sees the container names. Standard clients are used for controlling the bulk operations. The information repository supports attribute creation and deletion to preserve record or fonds specific information. In particular, information on the properties of the records and fonds are needed for validation of the encoding formats and to check whether the entire record series has been received. The accession schedule may specify knowledge relationships that can be used to determine whether associated attribute values are consistent with implied knowledge about the collection, or represent anomalies and artifacts. An example of a knowledge relationship is the range of permissible values for a given attribute, or the expected number of records in a fonds. If the range of values do not match the assertions provided by the submitter, the archivist needs to note the discrepancy as a property of the collection. Bulk operations are needed on metadata insertion when dealing with collections that contain millions of digital entities. A resilient architecture is needed to specify the storage system availability, check system status, authenticate access by the submitting institution, and specify reliability against data loss. At the time of accession, mechanisms such as checksums, need to be applied to be able to assert in the future that the data has not been changed. The Open Archival Information System (OAIS) specifies submission information packages that associate provenance information with each digital entity [14]. While OAIS is presented in terms of packaging of information with each digital entity, the architecture allows bulk operations to be implemented. An example is bulk loading of multiple digital entities, in which the provenance information is aggregated into an XML file, while the digital entities are aggregated into a container. The XML file and container are moved over the network from the submitting site to the preservation environment, where they are unpacked into the storage and information repositories. The integrity of the data (the consistency between the archival context and archival content) needs to be assured, typically by imposing constraints on metadata update. 84 When creating replicas and aggregating digital entities into containers, state information is required to describe the status of the changes. When digital entities are appended to a container, write locks are required to avoid over-writes. When a container is replicated, a synchronization flag is required to identify which container holds the new digital entities, and synchronization mechanisms are needed to update the replicas. The accession process may also impose transformative migrations on encoding formats to assure the ability to read and display a digital entity in the future. The transformative migrations can be applied at the time of accession, or the transformation may be characterized such that it can be applied in the future when the digital entity is requested. In order to verify properties of the entire collection, it may be necessary to read each digital entity, verify its content against an accession schedule, and summarize the properties of all of the digital entities within the record series. The summarization is equivalent to a bill of lading for moving the record series into the future. When the record series is examined at a future date, the archivist needs to be able to assert that the collection is complete as received, and that missing elements were never submitted to the archive. Summarization is an example of a collection property that is asserted about the entire record series. Other collection properties include completeness (references to records within the collection point to other records within the collection), and closure (operations on the records result in data products that can be displayed and manipulated with mechanisms provided by the archive). The closure property asserts that the archive can manipulate all encoding formats that are deposited into the archive. Arrangement is the process and result of identification of documents for whether they belong to accumulations within a fonds or record series. Arrangement requires organization of both metadata (context) and digital entities (content). The logical name space is used as the coordination mechanism for associating the archival context with the submitted digital entities. All archival context is mapped as metadata attributes onto the logical name for each digital entity. The logical name space is also used as the underlying naming convention on which a collection hierarchy is imposed. Each level of the collection hierarchy may have a different archival context expressed as a different set of metadata. The metadata specifies relationships of the submitted records to other components of the record series. For a record series that has yearly extensions, a suitable collection hierarchy might be to organize each year’s submission as a separate subcollection, annotated with the accession policy for that year. The digital entities are sorted into containers for physical aggregation of similar entities. The expectation is that access to one digital entity will likely require access to a related digital entity. The sorting requires a specification of the properties of the record series that can be used for a similarity analysis. The container name in which a digital entity is placed is mapped as an administrative attribute onto the logical name. Thus by knowing the logical name of a digital entity within the preservation environment, all pertinent information can be retrieved or queried. The process of arrangement points to the need for a digital archivist workbench. The storage area that is used for applying archival processes does not have to be the final 85 storage location. Data grids provide multiple mechanisms for arranging data, including soft-links between collections to associate a single physical copy with multiple subcollections, copies that are separately listed in different sub-collections, and versions within a single sub-collection. Data grids provide multiple mechanisms for managing data movement, including copying data between storage repositories, moving data between storage repositories, and replicating data between storage repositories. Description is the recording in a standardized form of information about the structure, function and content of records: Description requires a persistent naming convention and a characterization of the encoding format, as well as information used to assert authenticity. The description process generates the archival context that is associated with each digital entity. The archival context is includes not only the administrative metadata generated by the accession and arrangement processes, but also descriptive metadata that are used for subsequent discovery and access. Preservation Function Type of information Administrative Location, physical file name, size, creation time, update time, owner, location in a container, container name, container size, replication locations, replication times Descriptive Provenance, submitting institution, record series attributes, discovery attributes Authenticity Global Unique Identifier, checksum, access controls, audit trail, list of transformative migrations applied Structural Encoding format, components within digital entity Behavioral Viewing mechanisms, manipulation mechanisms Table 3. Archival context managed for each digital entity The description process can require access to the storage repository to apply templates for the extraction of descriptive metadata, as well as access to the information catalog to manage the preservation of the metadata. The description process should generate a persistent handle for the digital entity in addition to the logical name. The persistent handle is used to assert equivalence across preservation environments. An example of a persistent handle is the concatenation of the name of the preservation environment and the logical name of the entity, and is guaranteed unique as long as the preservation environments are uniquely named. The ability to associate a unique handle with a digital entity that is already stored requires the ability to apply a validation mechanism such as a digital signature or checksum to assert equivalence. If a transformative migration has occurred, the validation mechanism may require access to the original form of the digital entity. Preservation is the process of protecting records of continuing usefulness: Preservation requires a mechanism to interact with multiple types of storage repositories, mechanisms for disaster recovery, and mechanisms for asserting authenticity. 86 The only assured mechanism for guaranteeing against content or context loss is the replication of both the digital entities and the archival metadata. The replication can implement bit-level equivalence for asserting that the copy is authentic. The replication must be done onto geographically remote storage and information repositories to protect against local disasters (fire, earthquake, flood). While data grids provide tools to replicate digital entities between sites, some form of federation mechanism is needed to replicate the archival context and logical name space. One would like to assert that a completely independent preservation environment can be accessed that replicates even the logical names of the digital entities. The independent systems are required to support recovery from operation errors, in which recovery is sought from the mis-application of the archival procedures themselves. The coordination of logical name spaces between data grids is accomplished through peer-to-peer federation. Consistency controls on the synchronization of digital entities and metadata between the data grids are required for the user name space (who can access digital entities), the resources (whether the same repository stores data from multiple grids), the logical file names (whether replication is managed by the systems or archival processes), and the archival context (whether insertion of new entities is managed by the system or archival processes). Multiple versions of control policies can be implemented, ranging from automated replication into a union archive from multiple data grids, to simple cross-registration of selected sub-collections. Data grids use a storage repository abstraction to manage interactions with heterogeneous storage systems. To avoid problems specific to vendor products, the archival replica should be made onto a different vendor’s product from the primary storage system. The heterogeneous storage repositories can also represent different versions of storage systems and databases as they evolve over time. When a new infrastructure component is added to a persistent archive, both the old version and new version will be accessed simultaneously while the data and information content are migrated onto the new technology. Through use of replication, the migration can be done transparently to the users. For persistent archives, this includes the ability to migrate a collection from old database technology onto new database technology. Persistence is provided by data grids through support for a consistent environment, which guarantees that the administrative attributes used to identify derived data products always remain consistent with migrations performed on the data entities. The consistent state is extended into a persistent state through management of the information encoding standards used to create platform independent representations of the context. The ability to migrate from an old representation of an information encoding standard to a new representation leads to persistent management of derived data products. It is worth noting that a transformative migration can be characterized as the set of operations performed on the encoding syntax. The operations can be applied on the original digital entity at the time of accession or at any point in the future. If a new encoding syntax standard emerges, the set of operations needed to map from the original encoding syntax to the new encoding syntax can be defined, without requiring any of the intermediate 87 encoding representations. The operations needed to perform a transformative migration are characterized as a digital ontology [8]. Authenticity is supported by data grids through the ability to track operations done on each digital entity. This capability can be used to track the provenance of digital entities, including the operations performed by archivists. Audit trails record the dates of all transactions and the names of the persons who performed the operations. Digital signatures and checksums are used to verify that between transformation events the digital entity has remained unchanged. The mechanisms used to accession records can be re-applied to validate the integrity of the digital entities between transformative migrations. Data grids also support versioning of digital entities, making it possible to store explicitly the multiple versions of a record that may be received. The version attribute can be mapped onto the logical name space as both a time-based snapshot of a changing record, and as an explicitly named version. Access is the process of using descriptive metadata to search for archival objects of interest and retrieve them from their storage location. Access requires the ability to discover relevant documents, transport them from storage to the user, and interact with storage systems for document retrieval. The essential component of access is the ability to discover relevant files. In practice, data grids use four naming conventions to identify preserved content. A global unique identifier (GUID) identifies digital entities across preservation environments, the logical name space provides a persistent naming convention within the preservation environment, descriptive attributes support discovery based on attribute values, and the physical file name identifies the digital entity within a storage repository. In most cases, the user of the system will not know either the GUID, logical name or physical file name, and discovery is done on the descriptive attributes. Access then depends upon the ability to instantiate a collection that can be queried to discover a relevant digital entity. A knowledge space is needed to define the semantic meaning of the descriptive attributes, and a mechanism is needed to create the database instance that holds the descriptive metadata. For a persistent archive, this is the ability to instantiate an archival collection from its infrastructure independent representation onto a current information repository. The information repository abstraction supports the operations needed to instantiate a metadata catalog. The other half of access is transport of the discovered records. This includes support for moving data and metadata in bulk, while authenticating the user across administration domains. Since access mechanisms also evolve in time, mechanisms are needed to map from the storage and information repository abstractions to the access mechanism preferred by the user. 4. Preservation Infrastructure The operations required to support archival processes can be organized by identifying which capability is used by each process. The resulting preservation infrastructure is shown in Table 4. The list includes the essential capabilities that simplify the management of collections of digital entities while the underlying technology evolves. 88 The use of each capability by one of the six archival processes is indicated by an x in the appropriate row. The columns are labeled by App (Appraisal), Acc (Accessioning), Arr (Arrangement), Des (Description), Pres (Preservation), and Ac (Access). Many of the data grid capabilities are required by all of the archival processes. This points out the difficulty in choosing an appropriate characterization for applying archival processes to digital entities. Even though we have shown that the original paper-oriented archival processes have a counterpart in preservation of digital entities, there may be a better choice for characterizing electronic archival processes. Core Capabilities and Functionality Storage repository abstraction Storage interface to at least one repository Standard data access mechanism Standard data movement protocol support Containers for data App Logical name space Registration of files in logical name space Retrieval by logical name Logical name space structural independence from physical file Persistent handle x x x Information repository abstraction Collection owned data Collection hierarchy for organizing logical name space Standard metadata attributes (controlled vocabulary) Attribute creation and deletion Scalable metadata insertion Access control lists for logical name space Attributes for mapping from logical file name to physical file Encoding format specification attributes Data referenced by catalog query Containers for metadata x x x x x Distributed resilient scalable architecture Specification of system availability Standard error messages Status checking Authentication mechanism Specification of reliability against permanent data loss Specification of mechanism to validate integrity of data Specification of mechanism to assure integrity of data x Virtual Data Grid Knowledge repositories for managing collection properties Application of transformative migration for encoding format Application of archival processes x x x x x x Acc x x x x x Arr x x x x x Des Pres x x x x x Ac x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x x Table 4. Data Grid capabilities used in preservation environments 5. Persistent Archive Prototype The preservation of digital entities is being implemented at the San Diego Supercomputer Center (SDSC) through multiple projects that apply data grid technology. In collaboration with the United States National Archives and Records Administration (NARA), SDSC is developing a research prototype persistent archive. The preservation 89 environment is based on the Storage Resource Broker (SRB) data grid [17], and links three archives at NARA, the University of Maryland, and SDSC. For the National Science Foundation, SDSC has implemented a persistent archive for the National Science Digital Library [12]. Snapshots of digital entities that are registered into the NSDL repository as URLs are harvested from the web and stored into an archive using the SRB data grid. As the digital entities change over time, versions are tracked to ensure that an educator can find the desired version of a curricula module. Both of these projects rely upon the ability to create archival objects from digital entities through the application of archival processes. We differentiate between the generation of archival objects through the application of archival processes, the management of archival objects using data grid technology, and the characterization of the archival processes themselves, so that archived material can be re-processed (or re-purposed) in the future using virtual data grids. The San Diego Supercomputer Center Storage Resource Broker (SRB) is used to implement the persistent archives. The SRB provides mechanisms for all of the capabilities and functions listed in Table 2 except for knowledge repositories. The SRB also provides mechanisms for the extended features listed in section 3, such as soft-links, peer-to-peer federation of data grids, and mapping to user-preferred APIs. The SRB storage repository abstraction is based upon standard Unix file system operations, and supports drivers for accessing digital entities stored in Unix file systems (Solaris, SunOS, AIX, Irix, Unicos, Mac OS X, Linux), in Windows file systems (98, 2000, NT, XP, ME), in archival storage systems (HPSS, UniTree, DMF, ADSM, Castor, Dcache, Atlas Data Store), as binary large objects in databases (Oracle, DB2, Sybase, SQLServer, PostgresSGL), in object ring buffers, in storage resource managers, in FTP sites, in GridFTP sites, on tape drives managed by tape robots, etc. The SRB has been designed to facilitate the addition of new drivers for new types of storage systems. Traditional tape-based archives still remain the most cost-effective mechanism for storing massive amounts of data, although the cost of commodity-based disk is approaching that of tape [17]. The SRB supports direct access to tapes in tape robots. The SRB information repository abstraction supports the manipulation of collections stored in databases. The manipulations include the ability to add user-defined metadata, import and export metadata as XML files, support bulk registration of digital entities, apply template-based parsing to extract metadata attribute values, and support queries across arbitrary metadata attributes. The SRB automatically generates the SQL that is required to respond to a query, allowing the user to specify queries by operations on attribute values. Version 3.0.1 of the Storage Resource Broker data grid provides the basic mechanisms for federation of data grids [16]. The underlying data grid technology is in production use at SDSC and manages over 90 Terabytes of data comprising over 16 million files. The ultimate goal of the NARA research prototype persistent archive is to identify the key technologies that facilitate the creation of a preservation environment. 90 5. Summary Persistent archives manage archival objects by providing infrastructure independent abstractions for interacting with both archival objects and software infrastructure. Data grids provide the abstraction mechanisms for managing evolution of storage and information repositories. Persistent archives use the abstractions to preserve the ability to manage, access and display archival objects while the underlying technologies evolve. The challenge for the persistent archive community is the demonstration that data grid technology provides the correct set of abstractions for the management of software infrastructure. The Persistent Archive Research Group of the Global Grid Forum is exploring this issue, and is attempting to define the minimal set of capabilities that need to be provided by data grids to implement persistent archives [8]. A second challenge is the development of digital ontologies that characterize the structures present within digital entities. The Data Format Description Language research group of the Global Grid Forum is developing an XML-based description of the structures present within digital entities, as well as a description of the semantic labels that are applied to the structures. A third challenge is the specification of a standard set of operations that can be applied to the relationships within an archival object. A preservation environment will need to support operations at the remote storage repository, through the application of a digital ontology. 6. Acknowledgements The concepts presented here were developed by members of the Data and Knowledge Systems group at the San Diego Supercomputer Center. The Storage Resource Broker was developed principally by Michael Wan and Arcot Rajasekar. This research was supported by the NSF NPACI ACI-9619020 (NARA supplement), the NSF NSDL/UCAR Subaward S02-36645, the DOE SciDAC/SDM DE-FC02-01ER25486 and DOE Particle Physics Data Grid, the NSF National Virtual Observatory, the NSF Grid Physics Network, and the NASA Information Power Grid. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the National Science Foundation, the National Archives and Records Administration, or the U.S. government. This document is based upon an informational document submitted to the Global Grid Forum. The data grid and Globus toolkit characterizations were only possible through the support of the following persons: Igor Terekhov (Fermi National Accelerator Laboratory), Torre Wenaus (Brookhaven National Laboratory), Scott Studham (Pacific Northwest Laboratory), Chip Watson (Jefferson Laboratory), Heinz Stockinger and Peter Kunszt (CERN), Ann Chervenak (Information Sciences Institute, University of Southern California), Arcot Rajasekar (San Diego Supercomputer Center). Mark Conrad (NARA) provided the archival process characterization. Copyright (C) Global Grid Forum (date). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are 91 included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the GGF or other organizations, except as needed for the purpose of developing Grid Recommendations in which case the procedures for copyrights defined in the GGF Document process must be followed, or as required to translate it into languages other than English. 7. References 1. Biomedical Informatics Research Network, http://nbirn.net/ 2. EDG – European Data Grid, http://eu-datagrid.web.cern.ch/eu-datagrid/ 3. Globus – The Globus Toolkit, http://www.globus.org/toolkit/ 4. Jasmine – Jefferson Laboratory Asynchronous Storage Manager, http://cc.jlab.org/scicomp/JASMine/ 5. Joint Center for Structural Genomics, http://www.jcsg.org/ 6. Magda – Manager for distributed Grid-based Data, http://atlassw1.phy.bnl.gov/magda/info 7. Moore, R., C. Baru, “Virtualization Services for Data Grids”, Book chapter in "Grid Computing: Making the Global Infrastructure a Reality", John Wiley & Sons Ltd, 2003. 8. Moore, R., A. Merzky, “Persistent Archive Concepts”, Global Grid Forum Persistent Archive Research Group, Global Grid Forum 8, June 26, 2003. 9. Moore, R., “The San Diego Project: Persistent Objects,” Archivi & Computer, Automazione E Beni Culturali, l’Archivio Storico Comunale di San Miniato, Pisa, Italy, February, 2003. 10. NARA Persistent Archive Prototype, http://www.sdsc.edu/NARA/Publications.html 11. NASA Information Power Grid (IPG) is a high-performance computing and data grid, http://www.ipg.nasa.gov/ 12. National Science Digital Library, http://nsdl.org 13. NPACI National Partnership for Advanced Computational Infrastructure, http://www.npaci.edu/ 14. OAIS - Reference Model for an Open Archival Information System (OAIS). submitted as ISO draft, http://www.ccsds.org/documents/pdf/CCSDS-650.0-R-1.pdf, 1999. 15. Particle Physics Data Grid, http://www.ppdg.net/ 16. Peer-to-peer federation of data grids, http://www.npaci.edu/dice/srb/FedMcat.html 17. Rajasekar, A., M. Wan, R. Moore, G. Kremenek, T. Guptil, “Data Grids, Collections, and Grid Bricks”, Proceedings of the 20th IEEE Symposium on Mass Storage Systems and Eleventh Goddard Conference on Mass Storage Systems and Technologies, San Diego, April 2003. 18. Rajasekar, A., M. Wan, R. Moore, “mySRB and SRB, Components of a Data Grid”, 11th High Performance Distributed Computing conference, Edinburgh, Scotland, July 2002. 19. SAM – Sequential data Access using Metadata, http://d0db.fnal.gov/sam/. 20. SDM – Scientific Data Management in the Environmental Molecular Sciences Laboratory, http://www.computer.org/conferences/mss95/berard/berard.htm. 21. Visible Embryo Project, http://netlab.gmu.edu/visembryo.htm 92 Data Management as a Cluster Middleware Centerpiece Jose Zero, David McNab, William Sawyer, Samson Cheung Halcyon Systems, Inc. 1219 Folsom St San Francisco CA 94103 Tel +1-415-255-8673, Fax +1-415-255-8673 e-mail: zero@halcyonsystems.com Daniel Duffy Computer Sciences Corporation NASA NCCS, Goddard Space Flight Center Greenbelt MD 20771 Tel +1-301-286-8830 e-mail: Daniel.Q.Duffy@gsfc.nasa.gov Richard Rood, Phil Webster, Nancy Palm, Ellen Salmon, Tom Schardt NASA NCCS, Goddard Space Flight Center Greenbelt MD 20771 Tel: +1-301-614-6155, Fax: +1-301-286-1777 e-mail: Richard.B.Rood.1@gsfc.nasa.gov Abstract Through earth and space modeling and the ongoing launches of satellites to gather data, NASA has become one of the largest producers of data in the world. These large data sets necessitated the creation of a Data Management System (DMS) to assist both the users and the administrators of the data. Halcyon Systems Inc. was contracted by the NASA Center for Computational Sciences (NCCS) to produce a Data Management System. The prototype of the DMS was produced by Halcyon Systems Inc. (Halcyon) for the Global Modeling and Assimilation Office (GMAO). The system, which was implemented and deployed within a relatively short period of time, has proven to be highly reliable and deployable. Following the prototype deployment, Halcyon was contacted by the NCCS to produce a production DMS version for their user community. The system is composed of several existing open source or government-sponsored components such as the San Diego Supercomputer Center’s (SDSC) Storage Resource Broker (SRB), the Distributed Oceanographic Data System (DODS), and other components. Since Data Management is one of the foremost problems in cluster computing, the final package not only extends its capabilities as a Data Management System, but also to a cluster management system. This Cluster/Data Management System (CDMS) can be envisioned as the integration of existing packages. 1. Introduction In the last twelve years, Commercial Off-the-Shelf (COTS)-based cluster computing has become the main source of supercomputing providers. From the revolution of the first viable microprocessors that lead the way to replacing vector supercomputers, to passing through new network technologies and arriving at the efficient porting of scientific code, the road to cluster 93 computing was paved with problems seemingly impossible to resolve. In a sense, the battle was won; but the war is still being fought. Many aspects of computing have changed so radically that situations from the past seem unbelievably irrelevant today. Up until 1999, computing centers spent an immense amount of time in lengthy negotiations with vendors in an effort to obtain “build-able operating system codes”. Today, they can directly download them from the web. Still, in the midst of a new era with the power of COTS microprocessors, there are many challenges. Despite networks with low latency and high bandwidth and build-able operating systems and the availability of a myriad of open source packages, cluster computing is, at best, a difficult task that fails to replace the panacea days of Cray Research Inc.’s delivery of a C90 supercomputer. The Data Management System (DMS) attempts to fill the void of middleware that both supercomputing centers and their users need in order to easily manage and use the diverse technology of cluster computers. The DMS is composed of several existing open source or government-sponsored components, such as the San Diego Supercomputing Center’s Storage Resource Broker (SRB), the Distributed Oceanographic Data System (DODS), and others. Since data management is one of the major concerns in High Performance Computing (HPC), the final DMS package not only serves as a data management system for very high end computing, but it can easily be extended to a complete cluster management system. Many areas of science that base their results on computing resources have different ratios of Mega-Flops per byte of data ingested and/or produced. Meteorology is a science that ingests and produces voluminous amounts of data. It is not a coincidence that the same branch of science that produced the word “computer” is now leading the core issues of cluster computing. One of the legacy items from the previous computing models of the 60’s, 70’s, 80’s, and 90’s is the separation of mass storage engines and computing clusters. At this point, it is more efficient to follow the management structure of the computing centers rather than the computing architecture of the systems. COTS mass storage units, with multiple terabytes of attached disks, are just as reliable and economical as the COTS computing nodes. COTS CPU power has grown side-by-side with high bandwidth internal interconnects and new devices like Serial ATA and others that can provide support for multi-terabyte storage on each single unit. At the same time, OS improvements (Linux, etc.) make it possible to support those large file systems. In a generic scientific computing center, the problem that must be solved is how to manage the vast amount of data that is being produced by multiple users in a variety of formats. And, the added challenge is to do so in a manner that is consistent and that does not consume all of the users’ time manipulating such data or all of the computer center’s personnel in endless migrations from one system to another and from one accounting report to the next. This holds true across a broad range of actions from software engineering practices, to the production of code, to upgrading OS versions and patches, and includes changes in the systems, in accounting, in system engineering practices, and in the management of the actual scientific data. 94 Despite the best efforts of computing centers, “dead data” continues to mount up in mass storage vaults. The increasing cost of maintaining the storage, migrating, and in general curating can reach up to 40% of the total budget of a typical computing center. These curation activities (such as changing ownership, deleting, browsing, etc.) add to the burden of data management. Likewise, the proliferation of mass storage vaults is increasingly higher: two copies in situ, a third copy for catastrophic recovery, a copy in the computing engine (scratch) and additional copies wherever users need them (desktops, websites, etc.). This not only drives up costs, but it also undermines the collaboration among different scientists wherein data sharing becomes a limiting factor. The cost and expertise necessary to deploy a Grid-useable computing node is too high for small computing groups. Groups of ten to twenty computer users typically have one or two system administrators and no system software developers, which makes the start-up cost beyond their reach (both in terms of dollars and expertise). As computing power increases, fewer groups need a true supercomputer platform. A successful Grid should easily deploy smaller nodes and maintain production level. Finally, the lack of connection between the datasets and the software engineering practices (code version, patches, etc.) and the computing environment (CPU type, number of CPUs, etc.) limits the life of a dataset, its utility, and the scientific verification value. In this paper we describe an integration effort composed of several existing packages that solves, to a large extent (but not totally), the data management problem for data coming out of a cluster computing environment. As a posteriori result we describe how the data management, essential to the utility of a cluster, becomes a centerpiece for its management. We also propose an ensemble set that can be used as a turn-key engine for a further integration of Cluster/Data Management into a full Grid/Data Management System (“Incoherent”). In this area, Halcyon proposes that Incoherent be an Open Source Project. 2. Basic Requirements for a Data Management System The following list contains the basic requirements for the DMS. x Ensure a single point of information wherein data is retrieved/searched. Though there might be many different interfaces, the initial point of contact for each interface should be the same. x Provide system tools to cap storage costs and select datasets to be expunged. x Provide methods for minimizing the number of data copies (and conceivably provide a live backup of the data). The copy that is more efficient to fetch should be the one that is accessed. x Establish a linkage between data, scientific metadata, computing metadata, and configuration management data. x Provide support for data migration whether it is from computing nodes to local storage (where users are) or from one storage system to another. x Support plug and play of different visualization tools. 95 x Avoid multiple, full, or subset copies of datasets in the system by providing a Virtual Local Data capacity (data always feels local), along with the automatic use of local caches and sub-setting on-the-fly. x Provide robust, easily deployed, grid-compatible security tools. x Deploy with ease. Most department-type scientific groups do not have the resources to integrate a fully deployed Cluster Software Management and Mass Storage System. 3. Data Management System, Present Components Halcyon Systems has integrated several packages to work together as a DMS: x Storage Resource Broker (front-end mass storage, metadata catalog) x Distributed Oceanographic Data System (transport layer, connection to manipulation and visualization tools) x Configuration Management Software for all systems involved x Distributed Oceanographic Data System and GrADS visualization tool A minimal number of changes were implemented in the SRB software. A build tool and benchmarks were produced for ease of administration. Exit codes were changed to comply with standard UNIX command return codes. The underlying database is Oracle 9i running on Linux. The DODS dispatch script is CGI-Perl. It was modified to make calls to SRB S-utilities to retrieve and cache files from SRB. Once a file has been transferred to local disk, it remains there until either the SRB version is modified or the cache fills and it is the oldest file. The DODS server authenticates as SRB identity "dods", and users who wish to export their files via DODS add read access for that user to the files' access control lists. The DODS environment does not maintain a separate metadata catalog for managing semanticbased access to the data. There is presently no connection between DODS metadata, which is synthesized from the DODS-served file depending on the data format, and SRB metadata, which is stored in the MCAT associated with the file. MCAT data cannot yet be retrieved through DODS, nor is DODS-style synthesized metadata stored in MCAT. Configuration Management Software is a set of commands enabling the user/administrator to enter changes in the specifically devoted tables created separately from the SRB tables in the Oracle database. GrADS is already integrated with DODS; however, future work will have a separate server (GrADS-DODS server or GDS) fully integrated with SRB. In this way, a wider set of data manipulation and computation will be directly accessible to DMS users. Note on GCMD integration: DMS uses SRB's "user defined metadata" facility to store GCMDcompliant metadata. We have defined site-standard user metadata attributes corresponding to the attributes defined in GCMD; then restricted their values based on GCMD convention. An application level tool replaces the general purpose SRB metadata manipulation client and enforces the conventions. 96 4. Existing Architecture DMS Storage Architecture DODS Server GrADS Display DMS/DODS manages a local disk cache of readonly SRB objects User or batch job uses local disk as cache (manually managed) Logical device transparently distributes traffic SRB uses HSM disk cache manager transparently Cross-mounting makes both vaults available to either server OpenDAP Server OSF/1 Compute Engine Client IRIX Compute Engine Client SRB Cache Manager PBS Job User Job Local Disk Local Disk MCAT RDBMS Oracle/9i Local Disk (Failover) Metadata Server SRB Middleware Logical SAM QFS Server Local Disk Local Disk Metadata Server (Live) MCAT RDBMS Local Disk Oracle/9i HSM IRIX DMF Storage Server HSM SAM-QFS Storage Server 1 SAM-QFS Storage Server 2 Database replication maintains synchrony Figure 1: Depicts the existing components of the DMS deployed at the NCCS and their functionality. 5. Requirement Fulfillment Based on the requirements and the architecture described above, the DMS currently meets the following requirements. x There should be a single point of information for retrieving/searching the data. Even though there might be many different interfaces, the initial point of contact for each interface should be the same. SRB provides a single point of access for the data. x The system should provide tools to cap storage costs and select datasets to be expunged. The Data Management System Toolkit provides tools to manage expiration dates for datasets and mechanisms allowing users to preserve selected datasets beyond a given lapse of time (separate description). x The system should provide ways to minimize the number of data copies and could provide a live backup of the data. The copy that is more efficient to fetch should be the one fetched. DODS/OpenDAP can manage a local cache, network-wise close to the users. Computations, on-the-fly sub-setting can be provided by tools like the GrADSDODs server (active implementation already on-going). 97 x Scratch space on the computing platforms can be managed by a short expiration date of a SRB replica of a given dataset. x Linkage between data, scientific metadata, computing metadata. and configuration management data. SRB flexible metadata schemas provide a linkage between datasets and their scientific content. Metadata schema has been modified to accommodate the format provided by the Global Change Master Directory Software (GCMD), although a fully compatible version of GCMD has not been implemented as yet. Halcyon also has integrated a Configuration Management Software into the DMS that links the system “state” (patches, compilers, etc.) of computing engines with the dataset metadata. x Provide support for data migration from computing nodes to local storage (where users are) or from one storage system to another: SRB provides bulk transfer from legacy mass storage systems to newer ones; and DODs/OpenDAP can manage local caches as datasets are requested by users. The Halcyon DMS Toolkit provides the following features: x x x x x x x file ownership management (user, group, project) file expiration dates management tools dms acct uses MCAT interface for accounting reports dms admin provides administrative commands dms meta provides metadata management, search dms ingest stores files with metadata automatically adds concept of file certification. A process through which the users can extend the life of a file beyond expiration dates. x Provide robust, easily deployed, grid-compatible security tools. SRB’s underlying security infrastructure is compatible with the Grid Security Infrastructure (GSI). At the moment, the current DMS deployment is using password encryption, which is more robust than FTP and does not pass clear text passwords. GSI can support tickets (PKI) and Kerberos infrastructure. x Ease of deployment. Most department-type scientific groups do not have the resources to integrate a fully deployed Cluster Software Management and Mass Storage System. Halcyon is planning to deploy a turn-key server, named Infohedron, to deploy the DMS software in a single box (see next section). 6. Performance As with all high performance production systems, the risk of not utilizing all available network bandwidth can be a significant issue. In tests performed between two single points at NCCS, the following results have proven that the DMS and, particularly, SRB are able to sustain performance levels equivalent to scp transfers without the overhead of CPU consumption due to encrypting and decrypting the data. 98 The NCCS implementation is built around a pair of redundant Linux-based SRB MCAT servers running Oracle/9i to provide database services. These DMS servers are identically configured two-CPU Xeon systems with 4 GBytes of RAM and SCSI RAID disk arrays. One machine, the primary, is the active server. The second is a hot backup that can be brought into production within two hours should a catastrophic failure disable the first, losing at most thirty minutes worth of MCAT transactions—although in the vast majority of situations the RAID arrays prevent this type of serious failure and no transactions will be lost. DMS/SRB I/O bandwidth was measured between two hosts, “halem”, a Compaq Tru64 compute cluster acting as SRB client, and “dirac”, a Solaris9-based SAM-QFS storage server. The tests reported here used a single node of halem and a single node of dirac interconnected by Gigabit Ethernet. Thirty-two transfer threads ran simultaneously—although test results indicated that the performance changed little from eight to sixty-four nodes. These bandwidth tests were designed to demonstrate that DMS/SRB is capable of supporting the near-term projected storage load for NCCS, which was estimated at 2 TBytes per day with a ratio of three writes to one read—i.e., 1.5 TB write traffic and 0.5 TB read traffic per day. The average file at NCCS is 40 MBytes in size, and it was calculated that in order to meet the daily write requirement it would be necessary to complete the transfer of 1600 files in an hour. Although only one third this number of files had to be transferred within an hour to meet the read test requirements, for convenience the tests ran with the same group of 1600. A significant part of the file transfer time is due to MCAT overhead independent of the file size, so the aggregate throughput increases significantly as the file size increases. For these tests, no NCCS-specific network optimization—for instance adjustment of network buffer sizes—took place. TEST write ELAPSED m. MB/s TB/day 30.5 - 33.3 32 - 35 2.6 – 2.9 read 17.6 – 32.2 33 – 60 2.7 – 5.0 1600 40MB files, 32 threads, halem t dirac requirement: 1 hr. or less, 2TB day (3:1 W:R) NOTE: single client system to single server system; no optimization to NCCS network As the table demonstrates, DMS/SRB was easily able to meet the requirements even without optimization. The daily performance numbers were extrapolated from the 1600-file test performance. The second group of tests measured MCAT transaction performance and were intended to demonstrate that DMS can support the expected number of file metadata operations per day. For the tests, it was estimated that each file would have 15 associated metadata attribute-value pairs, and similarly to the bandwidth tests a group of 1600 canonical 40 MByte files was used. 99 Metadata insertions and deletions were tested, as well as simple queries—display of the metadata attributes associated with a particular file. 50,000 insertions and deletions were required each day, as well as 10,000 searches. DMS Performance: Metadata TEST insert query delete ELAPSED m. 43.5 – 48.6 2.9 – 3.1 42.4 – 45.3 TRANS/s 8.2 – 9.2 129 – 140 8.8 – 9.4 TRANS/day 711K – 795K 11.2M – 12.1M 770K – 815K 1600 40MB files, 32 threads, halem t dirac requirement: 50K inserts/day, 10K search/day Even more so than with the bandwidth tests, the DMS/SRB easily exceeded the requirements. 7. Infohedron System Architecture Presently, the DMS system is built on a Linux and Oracle 9i platform with limited redundancy (manual switchover), which covers the minimal needs of a production system. The cost of upgrading to a replicated database is largely due to the cost of an Oracle replicating database. Halcyon is testing the deployment of a Postgres-based, underlying database. In this area, Halcyon has been using an SRB 2.1 server while advancing to Postgres version 7.4. This decision has been based on the large customer base of Postgres – which allows it to mature faster – and the smaller customer base of SRB, which implies a slower maturation process of the software to arrive at the production level required by the NCCS environment. With an upgrade to SRB 3.0, the process would close the compatibility of Infohedron platforms by distributing the metadata catalog and, thereby, form a federated DMS. In planning for the full deployment of Infohedron, Halcyon has included the GrADS-DODS server to fulfill the needs of NCCS major customers, such as the Global Modeling and Assimilation Office (GMAO), as well as the following packages (to make it useful to a wider audience of customers). 100 Local Services: Infohedron LOCAL CACHES GLOBUS ? LOCAL MCAT NFS Services (SAMBA) Document Library CM Software Grid Security Infrastructure VIS Data Management Toolkit GDS Plot3D (plus project management) Revision Control Web Services (DoDS) (CVS) _____ x_____ x _____ x _____ x_____ x _____ x _____ x _____ x _____ x _____ SRB SRB ??? Federated MCAT Solaris Big Brother Specialized SRB Clients Linux SAM-QFS Lustre (?) Oracle Postgres Figure 2: Depicts the turn-key option with typical services needed by a scientific group to adhere to a Grid-like infrastructure. The seemingly chaotic disposition of the packages is intended to depict large variations in needs from group-to-group. The question marks indicate uncertainties in the configuration of groups or the possibility of replacing them with other packages. 8. DMS as Cluster Management By managing the accounting in the cluster and providing Virtual Locality for Data, DMS can provide full utilization of the cluster and the local caches co-located with the users and the scratch space of the computing cluster itself. By containing the software engineering information and the computing configuration management, DMS is able to provide data integrity and reproducibility. Homogeneous, easily deployable security infrastructure couples with federated metadata catalogs enabling the Grid. Migration of data and underlying data movements can be controlled in a small environment automatically and in a larger environment with the aid of user indirect manipulation (SRB replication process). Finally user control of data sharing and user quotas (SRB 3.0) can enable cluster sharing, producing a CDMS. 101 9. Future Directions Though many of the components described in this paper already exist, and their integration is relatively simple, the production level will be arduous to achieve. Halcyon provides a rigorous system engineering background to test, document and deploy all components. While the effort is sizeable, it has the potential to move progressively toward deployment of a large grid by doing the hardest work first – incorporating legacy data into a Data Management System and then enlarging the DMS into a wider set of services service like CDMS. Parallel transfers of datasets over separate rails support is provided by SRB. However, it has not been tested under production on DMS. GSI infrastructure has not been deployed at NCCS. The level of Software Systems support has not yet been determined. Grid wise accounting has not yet been defined under CDMS. The Earth System Modeling Framework (http://www.esmf.ucar.edu/) is in the process of formulating an I/O interface. The DMS project will provide a library to interact directly with DMS. If proper network support is provided, an application running in a computer cluster could directly deposit files into mass storage systems. In this way, a consolidation of high performance file-systems would provide savings, as well as avoid the usual double I/O process of depositing files in a local parallel file-system and then transporting them to mass storage. Integration of the DMS with Lustre: Luster is a distributed file-system designed to provide high performance and excellent scalability for cluster computers. The resulting system would combine the simplicity, portability, and rich interfaces of DMS with the high performance and scalability of Lustre, effectively extending DMS to efficiently support data-intensive clusterbased supercomputing. Lustre is designed to serve clusters with tens of thousands of nodes, manage petabytes of storage, and achieve bandwidths of hundreds of GBs/sec with state of the art security and management infrastructure. It is currently being developed with strong funding from the Department of Energy and corporate sponsors. Experimentation with more integration between SRB and the underlying Hierarchical Storage Systems could lead to a more efficient sub-setting by extracting only necessary parts of the files to be sub-set directly from tape (no full file recalling). This is similar to the ECMWF MARS Archive. In conclusion we propose a two-tier approach: Firstly, convert the typical mass storage/computing cluster architecture most computing centers have to a service rich Cluster/Data Management System Architecture as, for example, the one described in this paper. Secondly, produce a brick-like engine that can take care of most requirements of the diverse, medium- to small-size groups. These bricks would provide local data caches and direct connection to software trees, as well as many other services targeted to the individual groups. 102 In this manner local idiosyncrasies can be accommodated while maintaining a homogeneous systems engineering throughout a Computing Grid. The further development of this project would be a breakthrough in data-intensive supercomputing, alleviating a persistent performance bottleneck by enabling efficient analysis and visualization of massive, distributed datasets. By exploiting dataset layout metadata to provide direct access to the relevant portions of the data, it is possible to avoid the performance limiting serialization traditionally imposed by requiring transfer of the entire dataset through a non-parallel mass storage system. References [1] Rajasekar, A., M. Wan, R. Moore, "mySRB and SRB, Components of a Data Grid", 11th High Performance Distributed Computing conference, Edinburgh, Scotland, July 2002. [2] Arcot Rajasekar, Michael Wan, Reagan Moore, George Kremenek, Tom Guptil, "Data Grids, Collections, and Grid Bricks", Proceedings of the 20th IEEE Symposium on Mass Storage Systems and Eleventh Goddard Conference on Mass Storage Systems and Technologies, San Diego, April 2003. [3] http://www.unidata.ucar.edu/packages/dods [4] http://www.esmf.ucar.edu [5] http://gcmd.gsfc.nasa.gov [6] http://www.globus.org [7] http://www.escience-grid.org.uk [8] http://www.nas.nasa.gov/About/IPG/ipg.html [9] http://www.globalgridforum.org 103 104 Regulating I/O Performance of Shared Storage with a Control Theoretical Approach Han Deok Lee, Young Jin Nam, Kyong Jo Jung, Seok Gan Jung, Chanik Park Department of Computer Science and Engineering / PIRL Pohang University of Science and Technology Kyungbuk, Republic of Korea cyber93,yjnam,braiden,javamaze,cipark@postech.ac.kr tel +82-54-279-5668 fax +82-54-279-5699 Abstract Shared storage has become commonplace with recent trends in storage technologies, such as storage consolidation and virtualization, etc. Meanwhile, storage QoS, which guarantees different storage service requirements from various applications toward shared storage, is gaining in importance. This paper proposes a new scheme which combines a feedback-controlled leaky bucket with a fair queuing algorithm in order to deliver guaranteed storage service for applications competing for shared storage. It not only assures an agreed-upon response time for each application, but also maximizes the aggregate I/O throughput by proportionating unused bandwidth to other active applications. Simulation results under various types of competing I/O workloads validate the features of the proposed scheme. 1 Introduction The explosive growth of on-line data in many applications, such as multimedia, e-business, ERP, etc., poses scalability and manageability problems with storage. The advent of storage consolidation through SAN and storage cluster has overcome the limitation of scalability in traditional directed-attached storage environments. Moreover, the introduction of a new abstraction layer between physical disks and storage management applications called storage virtualization reduces complexity in storage manageability dramatically. With these trends in storage technologies, a shared storage model is now accepted in many areas, such as storage service providers, departmental storage environments in an enterprise, etc. In a shared storage environment, it is commonplace for different users or applications to share a physical disk resource. Moreover, each application assumes that the storage is owned by itself, implying that it demands to have a guaranteed storage service called storage QoS at all times no matter how many applications share the storage. The storage QoS can be specified in many aspects which include I/O performance, reliability/availability, capacity, cost, etc. The issue of delivering guaranteed I/O performance has been given a higher priority than the others [6, 7, 12]. In addition, Shenoy and Vin in [7] described how 105 partitioning storage bandwidth can satisfy the different I/O performance requirements from mixed types of applications. Few disk scheduling algorithms exist with QoS in mind [6, 7]. YFQ [6] is an approximated version of Generalized Processor Sharing (GPS) [2] that allows each application to reserve a fixed proportion of disk bandwidth. However, when the I/O workload from an application becomes heavier, it cannot bound the maximum response time. Cello framework [7] schedules I/O requests from heterogeneous types of clients including real-time and best-effort applications. The drawback of the Cello framework is that it assumes the existence of an accurate device model. This paper proposes a new scheme which combines a feedback-controlled leaky bucket with a fair queuing algorithm in order to deliver guaranteed storage service for applications competing for shared storage. It not only assures an agreed-upon response time for each application, but also maximizes the aggregate I/O throughput by proportionating the unused bandwidth to other active applications. The feedback-controlled leaky bucket at the frontend dynamically regulates I/O rates from each application. The fair queuing algorithm at the back-end partitions a disk bandwidth among multiple I/O workloads from different applications in a proportional manner. As a result, the proposed algorithm is expected to assure a demanded response time as well as to maximize storage utilization. The remainder of this paper is organized as follows. Section 2 describes basic assumptions and definitions for the proposed scheme. Section 3 gives a detailed description on the proposed scheme. Performance evaluations via simulation are given in Section 4. Finally, this paper concludes with Section 5. 2 Preliminaries Assumptions and Definitions: We begin by providing a set of assumptions and definitions to be used throughout this paper for clear descriptions. First, we assume that the characteristics of an I/O workload featured by an average IOPS and an average request size are known. Second, I/O requests access the underlying storage randomly. We denote the underlying shared storage with  . Next, it is shared by a set of I/O workloads denoted with    ½  ¾    . An I/O workload  demands I/O performance level of        for the shared storage  , where   is an average request size,  is an I/O arrival rate per second (briefly IOPS), and  is a demanded response time with  . Given I/O requests of size   , the response time of any I/O request is required not to exceed  , unless the current arrival I/O rate from  is faster than  . Given the maximum IOPS of the storage denoted with    , we assume that it can provide    in a sustained manner. Denote with   the sustained maximum IOPS. Simple Admission Control: Next, we describe how to systematically map demanded I/o performance from an I/O workload1 and the underlying storage, called admission control. Given    ½  ¾     where  requires performance of       , the 1 Hereafter, we interchangeably use an application and an I/O workload. 106 following procedure decides if or not underlying storage can guarantee the required different types of performance from . Figure 1 depicts this procedure graphically. In  and  values for two sets of I/O Section 4, we will show how to seek those workloads based on this procedure.  Shared Storage  . .. . ... ..... ... ... .. ... . .. .. +20% . . .. ..` . .. .. ... ... .. 95% . . . . .. . .. .. ... .. .. . . .. RT Mixed I/O workloads W1 W2 … Wn  RTT RTE IOPS IOPST   ) for a given target IOPS Figure 1: Measuring deliverable target response time (  ) with a set of I/O workloads (     generate mixed I/O requests whose size is    where ,         , with a probability of  which is the 95th percentile of all response times whose  find a response time , corresponding IOPS falls into   compute a target response time with a 20% margin as follows:  , and   if     for all     of , then it can be said that the underlying storage can guarantee the performance requirements demanded from .  3 The Proposed Algorithm 3.1 Feedback-Controlled Leaky Bucket (FCLB) The proposed algorithm consists of a feedback-controlled leaky bucket and YFQ disk scheduling algorithm, as shown in Figure 2. The YFQ disk scheduling algorithm proportionately partitions a disk bandwidth according to the assigned weight ( ) among multiple I/O workloads, and then the feedback-controlled leaky bucket dynamically regulates requests within each partition by controlling the token replenish rate   . The feedbackcontrol module is composed of a monitor and controller. It controls the token replenish rate   parameter of the leaky bucket adaptively according to the current response time  . The controller increases  when  goes below its demanded response time   . Conversely, the controller decreases  to its demanded IOPS    maximumly. In addition, when one I/O workload is inactive, the other can utilize the surplus left by the currently inactive I/O workload.          Monitor: The monitor component is responsible for collecting the current response time   of each workload   at the end of each monitoring period, and feeding these results to the controller.   107 Monitor Monitor11 Controller Controller11 ɋ1(k-1) ɍ1 I/O workload1 {iops1, size1, rt1} Ȱ1 ɋ2(k-1) S Shared Storage Adjusting ɋ2 (k-1) to ɋ2 (k) ɍ2 I/O workload2 {iops2, size2, rt2} Ȱ2 Compute ɋ2 (k) Controller Controller22 Reference rt2 Measured RT2 (k) Monitor Monitor22 Figure 2: Architecture of the proposed algorithm Controller: The controller compares the current response time    for the time window      with the demanded response time   , and computes the replenish rate  to be used during the next monitoring period     . 1. For each workload (0 n) in the system, compute its error           (1)  where  is called the reference in control theory. More negative values of represent larger response time violations. 2. Compute the replenish rate according to the integral control function ( urable parameter of the controller):               is a config-  (2) 3. Adjust the replenish rate     in the previous control period to    . 3.2 Feedback control loop in FCLB must to be tuned to prevent the replenish rate and measured response time Parameter from oscillating excessively and for fast convergence of the output to the reference. This can be done systematically with standard control theory techniques. System Modeling: In general cases, all systems are non-linear. However, there are equilibrium points where systems behave in a linear fashion. Accordingly, non-linear systems can be linearized at the points previously described in Section 2. We approximate the controlled system with the linear model, as shown in Equation 3. The controlled system includes the shared storage, leaky bucket, monitor and controller. The output is      108 and the input to the controlled system is the replenish rate     in the monitoring period     .                   (3) The process gain, , is the derivative of the output      with respect to the input   .  represents the sensitivity of the response time with regard to the change in the replenish rate.  -Transform: Next, we transform the controlled system model to the  -domain, which is amenable to control analysis. The controlled system model in Equations 2 and 3 is equivalent to Equations 4 and 5. Figure 3 describes the flow of signals in the control loop.        z ⋅rt (z+1) i   Σ   (4)    ρ (z) Ei(z) +    (5) i C(z) C(z) H(z) H(z) RTi(z) - 1 G z z K (z-1) Figure 3:  -Transform of the control loop Transfer Function: transfer function: The whole feedback control system is modeled in the following                    (6) Given the dynamic model of the closed loop system, we tune the control parameter analytically using linear control theory, which states that the performance of a system depends on the poles of its closed loop transfer function. The closed loop transfer function has a single pole:     (7) and the sufficient and necessary condition for system stability is:       109     (8) 4 Performance Evaluations This section provides the behavior and performance of the proposed algorithm which is obtained from simulations. First, we describe simulation environments, I/O characteristics of two competing I/O workloads and performance requirements each of I/O workloads requires. Second, we will measure the I/O characteristics of the shared storage used in the experiments and investigate the range where shared storage can provide service in a stable manner. Third, given a set of competing I/O workloads and their performance requirements, we will perform an admission control to decide whether or not the underlying shared storage can assure the requirement. Fourth, we will determine experimentally two parameters  and  for the feedback control in order to stabilize the system. Finally, under a variety of conditions of the two I/O workloads, we will analyze the behavior and performance of the proposed algorithm. 4.1 Simulation Environments We implemented the proposed algorithm within the DiskSim simulator[14]. Table1 shows the generic throttling parameters which are used for the experiments. In this table,   represents a rate of replenishing tokens. It is the same as the demanded maximum IOPS for  .  means an amount of tokens that can be accumulated during an idle period. It corresponds to the size of a bucket in a leaky bucket model. Actually, tokens (IOPS) of    are replenished every  second. In our experiments, we will employ a time interval of 1 msec to replenish tokens to eventually allow I/O requests to pass through a throttling module and 1000 msec to control the replenish rate. We also set YFQ weight to   : =2:1. Two competing I/O workloads based on a closed model are synthetically generated, as shown in Table2. The sizes of the I/O requests are distributed normally with a mean of 8 blocks. The performance of read and write is the same in a random I/O pattern, so that we will perform experiments with only read requests. We use a single IBM DNES309170W SCSI disk which serves arriving I/O requests in a FIFO manner. Table 1: Throttling parameters for each experiment Parameter       (  )   Bucket Size ( ) 8 4 4.2 I/O characteristics of the shared storage In this subsection we investigate the I/O characteristics of shared storage to decide a set of the replenish rate   where shared storage can provide service stably. The stable area is spread over 75 IOPS, as shown in Figure4. The growth of response times is gradual according to the increase of IOPS in this area. We respectively assign 40 and 20 IOPS to each I/O workload. 110 Table 2: Two competing I/O workloads Parameter ½ ¾ 4KB 4KB  40 20  35 msec 38 msec  access pattern Random Random      ͥ͢͡ ͣ͢͡ ͚Δ ΤΖ ͢͡͡ Ξ ͙Ζ ͩ͡ ΚΞ ͑΅Ζ ΤΟ ͧ͡ ΠΡ ΤΖ ΃ ͥ͡ ͣ͡ ͡ ͡ ͣ͡ ͥ͡ ͧ͡ ͺ΀΁΄ ͩ͡ ͢͡͡ ͣ͢͡ Figure 4: I/O characteristics of the shared storage with Random Read 4KB 4.3 Admission Control After acquiring all the information about performance requirements from I/O workloads and underlying storage, we will try to map each I/O workload described in Table2 to the underlying storage. Recall the mapping methodology proposed in Section 2. I/O requests of 4KB are issued increasingly to the corresponding reservation queue for  . We obtain IOPS versus RT chart as shown in Figure5. By analyzing Figure5 with   Ì several steps given in Section 2, we can obtain the following parameter in Table3. Based   on and of each I/O workload in Table3, it can be said that the demanded level of performance by given I/O workloads in Table2 can be deliverable with the underlying storage. is internally used for the feedback control as a reference value.   ͦ͡ +20% ͥ͡ 95th percentile RT1E ͤ͡ ΄ ΁ ΀ ͺ ͣ͡ ͣ͑͡ͺ΀΁΄ ͥ͑͡ͺ΀΁΄ 95th percentile ͢͡ ͡ RT1T +20% RT2E ͡ ͢͡ ͣ͡ ͤ͡ RT2T ͥ͡ ͦ͡ ΃΅͙ΞΤΖΔ͚ ͧ͡ ͨ͡ ͩ͡ ͪ͡ Figure 5: Measuring deliverable target response time    with two I/O workloads 111     for a given target IOPS Table 3: Given I/O workloads and deliverable response time by underlying storage Parameter        ½ 40 29.08 msec 34 msec ¾ 20 31.38 msec 37 msec ͺ΀΁΄ ΃΅͙ΞΤΖΔ͚ ΅ΒΣΘΖΥ͑΃΅ ͺ΀΁΄ ͪ͡ ͪ͡ ͩ͡ ͩ͡ ͨ͡ ͨ͡ ͧ͡ ͧ͡ ͦ͡ ͦ͡ ͥ͡ ͥ͡ ͤ͡ ͤ͡ ͣ͡ ͣ͡ ͢͡ ͡ ΃΅͙ΞΤΖΔ͚ ΅ΒΣΘΖΥ͑΃΅ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͢͡ ͢ ͦ ͡ ͪ ͤ͢ ͨ͢ ͣ͢ ͣͦ ͣͪ ͤͤ ͤͨ ͥ͢ ͥͦ ͥͪ ͦͤ ͦͨ ͧ͢ ͧͦ ͧͪ ͨͤ ΅ΚΞΖ͙ΤΖΔ͚ (a) Throughput and response time for Active 40 IOPS ͢ ͢͢ ͣ͢ ͤ͢ ͧ͢ ͨ͢ (b) Throughput and response time for Active 20 IOPS Figure 6: Performance Experiment 1. - one I/O workload is inactive and the other is active 4.4 Control Parameters  Here, we determine the parameters  and . First, we approximate  by running a set of shared storage profiling experiments, as shown in Section 4.2. We estimate that =0.3 for the synthetic workload. Since the closed loop transfer function as shown in Equation 7 has a single pole   , we can set to the desired value by choosing the right value of . In our experiments, we set    by choosing  ´½µ   . In the case of shared storage having a non-linear property whose I/O request service time is not proportional to its data size, we determine that the location of the pole is a close to  in order to stabilize the system. 4.5 Performance Results Under a variety of conditions of the two I/O workloads, we analyze the behavior of the proposed algorithm and its resulting I/O performance. Case 1 - One Inactive I/O workload : In this experiment, one I/O workload is inactive and the other is active. Figure6(a)-(b) show time-plots of response time and throughput for two I/O workloads when 20 and 40 IOPS I/O workload are inactive respectively. As the graphs show, active I/O workload fully utilizes the shared storage by 72 IOPS on average. The response time time-plot shows that active I/O workload receives its demanded response time with a 5% violation. The degree of a response time violation seen by Figure6(b) is higher than Figure6(a). This is because the reference of 20 IOPS workload which is used to compute the error is larger than 40 IOPS. 112 ͺ΀΁΄͢ ΃΅͙͢ΞΤΖΔ͚ ͺ΀΁΄ͣ ͪ͡ ͧ͡ ͦ͡ ͥ͡ ͤ͡ ͣ͡ ͣ͡ ͢͡ ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͡ ͨ͢ (a) Throughput - 20 IOPS Step ͺ΀΁΄͢ ͢͢ ͣ͢ ΃΅͙͢ΞΤΖΔ͚ ͺ΀΁΄ͣ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͨ͢ ΅ΒΣΘΖ͑΃΅͢ ΃΅͙ͣΞΤΖΔ͚ ΅ΒΣΘΖΥ͑΃΅ͣ ͣ͡͡ ͪ͡ ͩ͢͡ ͩ͡ ͣ͡ ͚Δ ͧ͢͡ ΤΖ ͥ͢͡ Ξ Ζ͙ ͣ͢͡ ΚΞ ΅͑ ͢͡͡ ΤΖΟ ͩ͡ ΡΠΤ ͧ͡ Ζ ΃ ͥ͡ ͢͡ ͣ͡ ͨ͡ ͧ͡ ͦ͡ ͥ͡ ͤ͡ ͡ ͢ (b) Response time - 20 IOPS Step ͢͡͡ ΄ ΁ ΀ ͺ ΅ΒΣΘΖΥ͑΃΅ͣ ͣ͢͡ ͨ͡ ͡ ΃΅͙ͣΞΤΖΔ͚ ͚Δ ΖΤ ͢͡͡ Ξ Ζ͙ ͩ͡ Ξ ΅Κ͑ ΖΤ ͧ͡ ΠΟΡ ΤΖ ͥ͡ ΃ ͩ͡ ΄ ΁ ΀ ͺ ΅ΒΣΘΖΥ͑΃΅͢ ͥ͢͡ ͢͡͡ ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͡ ͨ͢ (c) Throughput - 40 IOPS Step ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͨ͢ (d) Response time - 40 IOPS Step Figure 7: Performance Experiment 2. - one I/O workload begins after 30 seconds and the other issues I/O requests continuously Case 2 - One Step I/O workload : In this experiment, one I/O workload begins after 30 seconds and the other issues I/O requests continuously. Figure7(a)-(b) shows the measured response time and throughput for 20 and 40 IOPS when 20 and 40 IOPS I/O workload begins after 30 seconds, respectively. In Figure7, we observe that two competing I/O workloads have its demanded response time in most cases except for 30 seconds where one I/O workload issues I/O requests and achieves its demanded IOPS in all cases. Before 30 seconds YFQ allocates a full disk bandwidth for continuously issuing the I/O workload. When one I/O workload comes on after 30 seconds, YFQ proportionately partitions a disk bandwidth according to the assigned weight among I/O workloads. As a result, the I/O workload allotted a full disk bandwidth has a high response time because it takes time for its reservation queue to drain sufficiently so that the corresponding response time target can be met. In this case, the response time violation is below 3 percent. Case 3 - One Pulse I/O workload : In this experiment, one I/O workload repeats on for 5 seconds and off for 5 seconds and the other issues I/O requests continuously. As the graphs show, we can observe that two competing I/O workloads have its demanded IOPS in all cases. However, there is a spike in the response time whenever a burst of requests 113 ͺ΀΁΄͢ ΃΅͙͢ΞΤΖΔ͚ ͺ΀΁΄ͣ ͥ͡ ΄ ΁ ΀ ͺ ͤ͡ ͣ͡ ͢͡ ͣ͡ ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͡ ͨ͢ (a) Throughput - 20 IOPS Pulse ͺ΀ ΁΄͢ ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͨ͢ (b) Response time - 20 IOPS Pulse ͺ΀ ΁ ΄ͣ ΃΅͙͢ΞΤΖΔ͚ ͨ͡ ΅ΒΣΘΖΥ͑΃΅͙͢ΞΤΖΔ͚ ΃΅͙ͣΞΤΖΔ͚ ΅ΒΣΘΖΥ͑΃΅͙ͣΞΤΖΔ͚ ͣͦ͡ ͧ͡ ͣ͡͡ ͚Δ ΤΖ Ξ ͙Ζ ͦ͢͡ Ξ Κ΅ Ζ͑Τ ΟΠ ͢͡͡ ΡΤ Ζ ΃ ͦ͡ ͥ͡ ΀ ͺ ͤ͡ ͣ͡ ͦ͡ ͢͡ ͡ ΅ΒΣΘΖΥ͑΃΅͙ͣΞΤΖΔ͚ ͚Δ ΖΤ ͢͡͡ Ξ Ζ͙ ͩ͡ Ξ ΅Κ͑ ΖΤ ͧ͡ ΠΟΡ ΤΖ ͥ͡ ΃ ͦ͡ ΄ ΁ ΃΅͙ͣΞΤΖΔ͚ ͣ͢͡ ͧ͡ ͡ ΅ΒΣΘΖΥ͑΃΅͙͢ΞΤΖΔ͚ ͥ͢͡ ͨ͡ ͢ ͦ ͪ ͤ͢ ͨ͢ ͣ͢ ͣͦ ͣͪ ͤͤ ͤͨ ͥ͢ ͥͦ ͥͪ ͦͤ ͦͨ ͧ͢ ͧͦ ͧͪ ͨͤ ΅ΚΞΖ͙ΤΖΔ͚ (c) Throughput - 40 IOPS Pulse ͡ ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͨ͢ (d) Response time - 40 IOPS Pulse Figure 8: Performance Experiment 3. - one I/O workload repeats on for 5 seconds and off for 5 seconds and the other issues I/O requests continuously begins. The spike subsides quickly as soon as possible - within a two or three time window. This tendency is due to a feature of YFQ explained in our previous experiment. The degree of a response time violation seen by Figure8(b) is higher than Figure8(a). Also, this is due to the same reason described in our first experiment. The response time violation is 12/6%, 3/19%, as shown in Figure8(a)-(b). Case 4 - Two Active I/O workload : In Figure9, two Active I/O workloads have its demanded IOPS and response time in most cases. In this case, the response time violation is below 3% and two I/O workloads occur at approximately the same rate as YFQ weight, that is 2:1. Comparisons with Cello Framework [7]: The Cello framework heavily depends on the accuracy of the underlying storage device model, whereas the proposed scheme operates based on the measured performance of the underlying storage device. Thus, the proposed scheme can be more portable and applicable. In addition, the Cello framework proportionates unused storage performance by selecting pending I/O requests from the active applications in an ad-hoc order. However, the proposed scheme distributes the unused storage 114 ͺ΀΁΄͢ ΃΅͙͢ΞΤΖΔ͚ ͺ΀΁΄ͣ ͧ͡ ͧ͡ ͦ͡ ͚Δ ΖΤ ͥ͡ Ξ Ζ͙ Ξ Κ΅ ͤ͡ Ζ͑Τ ΟΠ ΡΤ ͣ͡ Ζ ΃ ΃΅͙ͣΞΤΖΔ͚ ΅ΒΣΘΖΥ͑΃΅͙ͣΞΤΖΔ͚ ͦ͡ ͥ͡ ΄ ΁ ΀ ͺ ͤ͡ ͣ͡ ͢͡ ͡ ΅ΒΣΘΖΥ͑΃΅͙͢ΞΤΖΔ͚ ͢͡ ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͡ ͨ͢ (a) Throughput ͢ ͢͢ ͣ͢ ͤ͢ ͥ͢ ΅ΚΞΖ͙ΤΖΔ͚ ͦ͢ ͧ͢ ͨ͢ (b) Response time Figure 9: Performance Experiment 4. - Two Active I/O workload performance by adaptively configuring the replenishing rate of tokens at the leaky bucket of each application based on the concrete theory given in Equations 1–8. 5 Conclusion and Future Work We proposed a new scheme that combines a feedback-controlled leaky bucket with a fair queuing algorithm in order to provide guaranteed storage service for different applications competing for shared storage. The proposed scheme not only assures an agreed-upon response time for each application, but also maximizes the aggregate I/O throughput by distributing the unused bandwidth to other active applications proportionally. We evaluated the performance of the proposed scheme under various types of competing I/O workloads. First, when an I/O workload becomes idle, we observed that the other workload could fully utilize the surplus bandwidth unused and only 5% of all completed I/O requests missed the agreed-upon response time. Second, when an I/O workload is backlogged again while the other I/O workload is using the entire bandwidth, we observed that competing I/O workloads had their demanded response time in most cases except for 30 seconds where an I/O workload issues I/O requests and achieved its demanded bandwidth in all cases. In this case, the response time violation is below 3 percent. Third, when an I/O workload is backlogged for a short period like a pulse while the other I/O workload is using the entire bandwidth, a spike occurs in the response time whenever a burst of requests begins. In this case, the proposed scheme revealed a lower performance than others. Finally, when both I/O workloads are active, both I/O workloads can approximately share the bandwidth at a rate of 2:1 and below 3% of all completed I/O requests missed the agreed-upon response time. In summary, the simulation results with various types of competing I/O workloads showed that the proposed algorithm provided a satisfactory level of response times; that is, 6% violation on average for the demanded response times. In future work, we plan to support workloads with multiple performance requirements that change over time. 115 6 Acknowledgement The authors would like to thank the Ministry of Education of Korea for its support toward the Electrical and Computer Engineering Division at POSTECH through its BK21 program. This research was also supported in part by grant No. R01-2003-000-10739-0 from the Basic Research Program of the Korea Science and Engineering Foundation. References [1] J.S. Turner. New directions in communications, or Which way to the information age?. IEEE Communication Magazine, 1986. [2] A. Parekh and R. Gallager. A generalized processor sharing approach to flow control in integrated services networks: The single-node case. IEEE/ACM Trans. on Networking, vol. 1, 1993. [3] Pawan Goyal and Harrick M. Vin and Haichen Cheng. Start-time Fair Queueing: A scheduling algorithms for integrated services packet switching networks. Proceedings of SIGCOMM, 1996. [4] E. Borowsky and R. Golding and A. Merchant and L. Schrier and E. Shriver and M. Spasojevic and J. Wilkes. Using attribute-managed storage to achieve QoS. Proceeding of 5th Intl. Workshop on Quality of Service, 1997. [5] E. Borowsky and R. Golding and P. Jacobson and A. Merchant and L. Schreier and M. Spasojevic and J. Wilkes. Capacity planning with phased workload. Proceeding of First Intl. Workshop on Software and Performance, 1998. [6] John L. Bruno and Jose Carlos Brustoloni and Eran Gabber and Banu Ozden and Abraham Silberschatz. Disk Scheduling with quality of service guarantees. Proceedings of the IEEE International Conference on Multimedia Computing and Systems, 1999. [7] Shenoy, P., Vin, H.: Cello: A disk scheduling framework for next-generation operating systems. In: Proceedings of ACM SIGMETRICS. (1998) [8] Guillermo A. Alvarez and Elizabeth Borowsky and Susie Go and Theodore H. Romer and Ralph Becker-Szendy and Richard Golding and Arif Merchant and Mirjana Spasojevic and Alistair Veitch and John Wilkes. Minerva: An automated resource provisioning tool for large-scale storage system. ACM Trans. on Computer Systems, 2001. [9] J. Wilkes. Traveling to Rome: QoS specifications for automated storage system management. Intl. Workshop on Quality of Service (2001) 75-91 [10] Chenyang Lu and John A. Stankovic and Gang Tao and Sang H. Son. Feedback Control Real-Time Scheduling: Framework, Modeling, and Algorithms. Journal of Real-Time Systems, 2002. 116 [11] C. Lu, G. A. Alvarez, and J. Wilkes. Aqueduct: online data migration with performance guarantees. Conference on File and Storage Technologies, 2002. [12] Christopher Lumb and Arif Merchant and Guillermo Alvarez. Facade: Virtual Storage Devices with Performance Guarantees. Conference on File and Storage Technology, 2003. [13] G.F. Franklin, J. D. Powell, and M. L. Workman. Digital Control of Dynamic Systems (3rd Ed.). Addison-Wesley, 1998. [14] Gregory R. Ganger, Bruce L. Worthington, Yale N. Patt. The DiskSim Simulation Environment Version 2.0 Reference Manual. CMU, 1999. 117 118 SAN and Data Transport Technology Evaluation at the NASA Goddard Space Flight Center (GSFC) Hoot Thompson Patuxent Technology Partners, LLC 11030 Clara Barton Drive Fairfax Station, VA 22039-1410 Tel: +1-703-250-3754, Fax: +1-703-250-3742 e-mail: hoot@ptpnow.com Abstract Growing data stockpiles and storage consolidation continue to be the trend. So does the need to provide secure yet unconstrained, high bandwidth access to such repositories by geographically distributed users. Conventional data management approaches, both at the local and wide area level, are viewed as potentially inadequate to meet these challenges. This paper explores methods deploying a new breed of Fibre Channel (FC) technology that leverages Internet Protocol (IP) infrastructures as the data transport mechanism, a step towards creating a “storage area network (SAN) grid”. These technologies include products using the FC Over IP (FCIP) and the Internet FC Protocol (iFCP) protocols. The effort draws upon earlier work that concentrated on standard FC and internet SCSI (iSCSI) technologies. In summary, the vendor offerings tested performed as expected and provided encouraging performance results. However, their operational readiness still needs to be understood and demonstrated. Installing and configuring the products was reminiscent of the early days of FC with driver and version compatibly issues surfacing once again. Maturity will take some time. 1. Introduction GSFC, as part of a continuing technology evaluation effort, continues its interest in SAN products and related technologies by evaluating and demonstrating the operational viability of new vendor offerings. Under the auspices of the SAN Pilot, earlier testing has shown the advantages of high-speed transport mechanisms such as FC as well as the flexibility that iSCSI provides in deploying a SAN [1]. Subsequent testing is building upon this work, emphasizing higher speed campus backbones with a focus on manageability as well connectivity to geographically distributed sites. Standardized benchmarks provide measurement of inherent link throughput. In addition, the push is on to attract users with real applications that could benefit from these kinds of technologies The vision is direct access to data regardless of geogr aphical location, using IP based wide area networks (WAN) as the transport mechanism. Such distributed storage, whether for disaster preparedness or for logical proximity to a compute server, pushes the operational requirements normally associated with direct-attached storage onto the WAN. The storage will be expected to be both reliable and high performance, and to behave like direct attached and physically local. The vision promotes leaving data static and performing the necessary processing directly on a data store as opposed to moving large quantities of data between user facilities. Connections would be temporal in nature with a corresponding service, such as the Storage Resource Broker (SRB) [2], to assist users in 119 locating relevant data. The end result would be a SAN grid, analogous in many ways to more traditional grids currently gaining wide exposure. This paper explores a variety of topics seen as contributing to the vision. 2. SAN Pilot Infrastructure Description The core of the SAN Pilot (figure 1) is the connectivity between multiple, on-campus buildings at GSFC. Traditional FC dominates the local GSFC infrastructure with a mix of 2 Gigabit/sec and 1 Gigabit/sec switches – Brocade 3800s and 2400s – providing ports for a variety of server and storage technologies. Linux, Solaris and Apple hosts are represented. RAID storage systems include a DataDirect Networks S2A6000, an Apple Xserve, an Adaptec/Eurologic SANbloc and a Nexsan ATABoy2. A pair of Nishan IPS 3000 Series Multiprotocol IP Storage Switches as well as a LightSand I-8100 augment the other switches by bridging the FC fabric to the IP network. A pair of legacy Cisco SN5420s used for iSCSI work completes the topology. The equipment is mostly GSFC owned. However, notable exceptions include the Nishan and LightSand IP switches. Cisco, Brocade and ADIC have also provided loaner equipment during the testing. Bldg A Fibre Channel IP Connec tions Apple Nex san Linux Brocade FC Switch DDN Linux N ishan Switch Euro Apple Brocade FC Switch Brocade FC Switch RAID (Future) MAX Linux Nishan Switch Cisco Router N ishan Switch SEN LightSand Gateway RAID C isco Router Brocade FC Switch Sun Linux SVS Bldg B Brocade FC Switch Sun ADG Scalar 100 SDSC Linux LightSand Gateway Brocade FC Switch Bldg C Linux Abilene Nishan Switch Brocade FC Switch UMIACS NCSA RAID Bldg D Figure 1 - SAN Pilot Infrastructure The Nishan and LightSand equipment provide IP connections to similar boxes at the University of Maryland Institute for Advanced Computer Studies (UMIACS), the San Diego Supercomputer Center (SDSC) and the National Center for Supercomputing Applications (NCSA). The underlying networks have been key to the IP related testing. Local to GSFC, the primary backbone is the Science and Engineering Network (SEN) 120 [3]. Connection to UMIACS is attained by the Mid-Atlantic Crossroads (MAX) [4]. MAX is also the jump off point to the Abilene Network [5] that completes the circuit to both NCSA and SDSC. The result is full Gigabit Ethernet (GE) to all of the remote sites. 2.1. SEN Network The SEN is a local, non- mission dedicated computer network with high-speed links to the Internet2’s Abilene and other Next Generation Internet (NGI) networks. It serves GSFC projects/users who have computer network performance requirements greater than those allocated to the general-use, campus-wide Center Network Environment. The majority of the SEN’s inter-building backbone links are 4 gigabits per second (Gbps), created using IEEE 802.3ad link aggregation standards with four separate GE connections between respective pairs of switches. For desktop workstations and servers, as well as for its other inter-building and intra-building links, the SEN minimally provides GE LAN connections. Only jumbo frame-capable GE switches are used in the SEN’s infrastructure. The 9000-byte sized Ethernet jumbo frames (maximum transmission unit or MTU) generally provide individual users with approximately six times better throughput performance as compared to networks only supporting standard 1500 MTUs. The SEN presently supports a 2 Gbps jumbo frame-capable link with the MAX point-ofpresence at the University of Maryland College Park. 2.2. MAX Network The MAX is a multi-state metaPoP consortium founded by Georgetown University, George Washington University, the University of Maryland, and Virginia Polytechnic Institute and State University. The proximity of the MAX to Washington, D.C. places it in an advantageous location to partner with federal agencies as well as the business community and post-secondary institutions of DC, Maryland and Virginia. MAX represents a pioneering effort in advanced networking, with the potential to rapidly incorporate a broad cross-section of the not-for-profit community. The MAX, the regional gigapop for access to the Abilene network and the NGI-East Exchange, provides the SEN with excellent WAN connectivity. 2.3. Abilene Network The Abilene Network is an Internet2 high-performance backbone network that enables the development of advanced Internet applications and the deployment of leading-edge network services to Internet2 universities and research labs across the country. The network supports the development of applications such as virtual laboratories, digital libraries, distance education and tele- immersion, as well as the advanced networking capabilities that are the focus of Internet2. Abilene complements and peers with other high-performance research networks in the U.S. and internationally. The current network is primarily an OC-192c (10 Gbps) backbone employing optical transport technology and advanced high-performance routers. 3. FCIP and iFCP Technology Prior testing focused on standard FC and iSCSI technologies as it applied to on-campus connectio ns and/or short distances. Interest shifted to assessing the feasibility of constructing a geographically distributed SAN system. This led to experimenting with 121 more suitable technologies, namely FCIP and iFCP. Several products are available that exploit these protocols. The two tested extensively were the IPS 3000 Series IP Storage Switch by Nishan Systems, now a part of the McData Corporation, and the i-8100 unit by LightSand Communications, Inc. The following paragraphs give a brief overview of each of the products and summarize the current evaluation status. 3.1. Nishan IPS 3000 Series IP Storage Switch The IPS 3000 and 4000 Series IP Storage Switches use standards-based IP and GE for storage fabric connectivity. Nishan's Multiprotocol Switch supports iSCSI, iFCP, and E_Port for trunking to both IP backbones and legacy FC fabrics. The IPS 3000 Series connects to a wide variety of end systems, including FC, NAS, and iSCSI initiators and targets. The switch has a non-blocking architecture that supports Ethernet Layer 2 switching, IP Layer 3 switching and FC switching over extended distances at full Gigabit wire speed. The Series also supports standard IP routing protocols such as open shortest path first (OSPF) and distance-vector multicast routing protocol (DVMRP ) and can be fully integrated into existing IP networks. Three parameters assist in tuning the performance of the Nishan to a specific environment – Fast Write™ [6], compression [7] and MTU size. When servers and storage are interconnected via a WAN using a pair of Nishans, the normal SCSI exchange (figure 2) required for a 1MB file write will break the data into multiple transfers thereby compounding the “round trip time (rtt)” effect. In contrast, with Fast Write enabled, when the server sends the SCSI write command (figure 3) to set up the transfer, the local Nishan responds with a transfer ready specifying that the entire 1MB of data can be sent at once. At the same time, the sending Nishan forwards the SCSI write command across the WAN so that the target can be prepared to receive data. Having received the 1MB of data from the server, the sending Nishan streams the 1MB block across the WAN to the receiving Nishan. The receiving Nishan, in turn, mimics the normal command/response sequence for the transfers until all of the data is given to the target. The Nishans do not spoof write completion. Instead, the actual status generated by the storage target is passed back through the network to the server. This guarantees that all data was actually written to disk. 122 Initiator Near-End Nishan SCSI Write 1 MB Far-End Nishan SCSI Write 1 MB XFER_RD Target SCSI Write 1 MB Y 64KB XFER_RDY 64KB XFER_RDY 64KB Data 64KB XFER_RDY 64KB Data 64KB XFER_RDY 64KB Data 64KB XFER_RD Data 64KB Y 64KB Data 64KB Data 64KB XFER_RD Y 64KB . . . . . . CMD Complete XFER_RDY 64KB XFER_RDY 64KB . . . CMD Complete plete CMD Com Figure 2 - Normal SCSI Exchangefor a 1MB Write Initiator Near-End Nishan SCSI Write 1MB XFER_RDY 1MB Data 1MB Far-End Nishan SCSI W rite Target 1MB SCSI Write 1MB Data 1 MB XFER_RDY 64KB Data 64KB XFER_RDY 64KB Data 64KB XFER_RDY 64KB Data 64KB . . . CMD Complete CMD Com plete CMD Complete Figure 3 - Fast Write Modified SCSI Exchange The Nishan switch also features software based lossless compression. The following options are available: 123 • • • Off - Data going out of the port is not compressed. On - Data going out of the port is always compressed using the appropriate algorithm to achieve maximum compression. Auto - Depending on the available bandwidth, the switch dynamically decides whether or not to compress the data, the level of compression to apply and the compression algorithm to use. With the Auto setting, the port keeps the data rate as close as possible to the Port Speed of the port. The last key parameter is MTU. The Nishan switches can support packet sizes up to 4096 bytes, an increase of almost 3X over the nominal 1500. The larger data payload results in less header processing overhead and better link utilization. Packet sizes greater than 1500 bytes maximizes direct matching with FC originated frames. The full FC data payload of 2112 bytes can be delivered in a single jumbo, 4096 byte Ethernet frame. The “auto” option for MTU setting allows Nishan switches to negotiate the best possible rate. Configuring the Nishan switch involves the interaction of two applications, the switch resident http GUI Element Manager and the host based (Linux or Solaris) SANvergence Manager application. Between the two, devices to be shared are placed in commonly seen, exported zones. The level of SAN merging is a cooperative effort between two or more switches. As a default, a CLI is also available. 3.2. LightSand -8100 i The LightSand i-8100A is an intelligent gateway that provides connectivity between FC fabrics across an IP WAN infrastructure. The i- 8100A is an eight port, multi-protocol switch that provides isolation between FC SANs using Autonomous Region (AR) technology. Conventional FCIP bridging devices link two sites by merging the FC fabrics together. By maintaining Autonomous Regions, the i-8100A is able to share storage devices without merging fabrics. In the diagram (figure 4), two autonomous regions are joined. Each AR consists of four FC switches, the three original switches plus the gateway. If these two SANs had been bridged by a simple FCIP gateway (nonswitching), the fabric would appear as six FC switches—all part of the same fabric. The storage arrays labeled Disk 1 and Disk 2 are shared. Once they have been imported into SAN 2, every initiator in SAN 2 can see the shared disks as if they were present in SAN 2. In reality, the I-8100A is performing Domain Address Translation (DAT) and the actual disks remain inside SAN 1. Because of this technology, each fabric is isolated from any disturbances that might occur in the other fabric. The LightSand i-8100A employs the user datagram protocol (UDP) with an additional sequencing number to enable protection against packet-loss and mis-ordering. This protocol is referred to as UDP/SR (UDP with Selective Retransmission). Using UDP/SR, the i-8100A can be set for a desired WAN bandwidth. It will instantly jump to that bandwidth and execute appropriate backpressure against the FC fabric, if the WAN bandwidth is less than the native FC bandwidth. In the event that there is packet- loss on the WAN, the i-8100A will retransmit the lost data without throttling the bandwidth. 124 Hosts in SAN 2 believe that Disk 1 and Disk 2 are locally attached to gateway. 1 1 Host 1 IP SAN 1 i-8100A i-8100A SAN 2 2 2 Figure 4 - LightSand Interconnect Configuring the LightSand switch requires running the SANman GUI on each of switches or using the available CLI. 3.3. EvaluationProcess andResults As evidenced by the work done at SDSC for last year’s Mass Storage conference [8], outstanding performance moving data over IP is achievable using a well-behaved, highly tuned network. The tact taken at GSFC has been more the “every day”, out-of-the-box approach where nothing aggressive is done to enhance the performance of site-to-site WANS. In more real world networks, the effects of rtt, congestion and packet loss can render an application useless that requires high bandwidth. In the spirit of the SAN grid vision, laying a distributed file system, such as ADIC’s StorNext File System (SNFS) or SGI’s CXFS™, on the topology would further attenuate any irregularities. FCIP and iFCP testing has been a multi- step process: • Evaluate the technology on a local, campus basis under ideal network conditions. • Artificially introduce non-zero rtts, packet loss and congestion into the circuit, and observe the impact on performance. • Connect to a geographically distant center(s) and compare performance to predictions based on simulated distance testing. Testing was performance centered using standard benchmarks such lmdd [9] and IOzone [10] as the primary tools. lmdd is good for quick, single threaded operations. IOzone permits a variety of IO operations including writes, reads, mixed writes and reads, multithreaded operations, etc. all with options for setting attributes such as record and file size. The majority of the tests consisted of multiple IOzone operations described by the following script: ./iozone_mod -i 0 - i 1 [-+d] -r 1m -s 16g -b one_thread 125 ./iozone_mod -t 2 -i 0 - i 1 [-+d] -r 1m -s 8g -b two_threads ./iozone_mod -t 4 -i 0 - i 1 [-+d] -r 1m -s 4g -b four_threads ./iozone_mod -t 8 -i 0 - i 1 [-+d] -r 1m -s 2g -b eight_threads The scripts steps through 1, 2, 4 and 8 threaded write/read operations and in aggregate moves 16 Gbytes. IOzone was modified such that the [-+d] option would generate random data without doing the diagnostic byte-for-byte check of the data. This was done to evaluate the efficiency of the Nishan compression algorithm while not impacting performance with verification process. Tests were performed using mostly native file systems (ext2) with some minimal SNFS evaluation. Network utilization was also monitored. Data traffic cannot be at the expense and disruption of existing communication traffic. At a minimum, the impact must be understood and anticipated. Nishan and LightSand use two different approaches to how the data is transported so the resulting network perturbation varies. 3.3.1. On-Campus Testing Testing began at GSFC with a pair of Nishan switches. A Linux machine was FC connected to one of the Nishans co- located in the same building (figure 5). The other Nishan, in a different building provided tie- in to the SAN Pilot and its associated RAID. Initial results, with zero rtt, compared favorably with the same tests using directly connected RAID. Linux Linux FC Nishan IP FC SEN FC FC FC FC Nishan NIST Delay IP FC RAID Bldg A Figure 5 - Local GSFC Testing The next step was to introduce set delays into the circuit using a NIST Net [11] network emulator to simulate the potential effects of geographically separating the two Nishan switches. The NIST Net network emulator is a general-purpose tool for emulating performance dynamics in IP networks. The tool is designed to allow controlled, reproducible experiments. By operating at the IP level, NIST Net can emulate the critical end-to-end performance characteristics imposed by various WAN situations (e.g., 126 congestion loss) or by various underlying subnetwork technologies (e.g., asymmetric bandwidth situations of xDSL and cable modems). Impressions Installation and configuration of the Nishan units was relatively straightforward with the assistance of the product support engineers. Besides providing FC-IP translation, the Nishans are also full FC switches, an attribute that has different ramifications depending upon how the device is introduced into an existing SAN. As a standalone switch with directly connected devices, as was the case for one end of the GSFC circuit, operation was clear with only the usual zoning decisions to be made. The second switch was Eport connected, a more complicated configuration which requires choosing how the Nishan was to interoperate with the existing SAN Pilot Brocade infrastructure. Multiple options are available, so the ripple effect of zone changes, for example, need to be understood to avoid any unforeseen interruption of an operational SAN. Setting up the zones and mapping devices was easily accomplished using SANvergence and the Element Manager. Large transfers (files) were required to overcome the buffering effects of the servers, the switches and the link. With IOzone modified accordingly, a variety of tests were executed varying rtt and MTU size while going through the permutations of the Fast Write and compression settings. Three observations were made: • Fast Write seems to have an overall positive effect on write performance with this likely being the default setting. Nishan recommends setting to “on” for distances over 200km noting potential degradation if “on” for shorter distances. • Compression can have a positive or negative effect depending upon rtt. Compression processing significantly reduces throughput when rtt is small. Conversely, for large rtt compression enhances performance. Nishan recommends the “auto” mode letting the switch dynamically determine the appropriate level of compression. • The effect of increasing MTU size from 1500 to 4096 was somewhat inconclusive but an odd jump was noted when both FastWrite and compression were turned “off”. Intuitively the larger frames should improve performance but the suspicion is that the effects of a large rtt on the SCSI exchange may mitigate the gain. This warrants further testing. In summary, settings are situation dependent. This warrants exercising all the combinations before finalizing an installation. To illustrate the point, the following graphs (figure 6 and 7) depict bandwidth as a function of threads for rtt=35msec for different MTUs, Fast Write and compression settings. For MTU = 1500, the best write performance was for Fast Write, no compression while read was best for Fast Write with compression enabled. Bumping the MTU to 4096 resulted in both the write and read numbers being best with Fast Write and compression disabled. Incidentally, these parameters are changed using the Element Manager with each switch configured independently. The implication is that unpredictable results may occur if the switches are not configured the same. Overall, the write performance topped out at just slightly over 127 25 MB/sec while read approached 20MB/sec. For the most part, running multiple threads 18.0 16.0 14.0 FW Write Comp Read 12.0 MB/sec FW Write No Comp Read No FW Write Comp Read 10.0 No FW Write No Comp Read 8.0 6.0 4.0 one two four eight Threads Figure 6 - Delay=35msec, MTU= 1 500 30.0 25.0 FW Write 20.0 MB/sec Comp Read FW Write No Comp Read No FW Write Comp Read No FW Write 15.0 No Comp Read 10.0 5.0 one two four eight Threads Figure 7 - Delay=35msec, MTU=4096 128 boosted aggregate throughput. These numbers are in contrast to 86 MB/sec writes and 78 MB/sec reads obtained running eight threads with rtt=0, MTU=1500 and both Fast Write and compression turned off. Future Testing Additional tests to be conducted include: • Run tests with a broader range of rtt values while changing configuration of the Nishan units. This would give the full curve for bandwidth as a function of rtt. • Test the compression “auto” setting in contrast to the “on/off” results. • Induce deterministic packet loss and congestion, and measure the impact on write and read performance. 3.3.2. Multi-site Testing The next series of tests involved different combinations of IP hardware and network connections to UMIACS, SDSC and NCSA. Experiments focused mainly on building and exercising native file systems (ext2) with server/host and storage at opposite ends of the WAN link. Some preliminary SNFS testing was also accomplished. In all cases, the assessment centered on: • Gauging the impact of rtt or latency on performance in a real world setting where the network is potentially hostile. • Comparing measured maximum network bandwidth, as determined using nuttcp, with file system oriented traffic. 3.3.2.1. UMIACS Last year, UMIACS participated with GSFC in distance testing using iSCSI technology. That effort involved a Linux box at UMIACS routed through a Cisco SN5420 at GSFC to the associated storage assets. This time for comparison, one of the two loaner Nishan units was moved to UMIACS (figure 8). Nishan-to-Nishan communication was established using the MAX network. IOzone benchmarks were performed building a native ext2 file system on GSFC storage from an UMIACS resident Linux host. Linux Linux FC Nishan FC IP SEN FC FC IP Bldg B Linux IP FC MAX Nishan RAID Bldg A UMIACS Figure 8 - GSFC - UMIACS Configuration 129 Impressions Moving and establishing the Nishan to UMIACS connection was relatively simple. Network logistics provided the only significant obstacles. Getting the Nisha n configuration tools functioning in a new environment posed a minor nuisance. Only certain browser/host combinations will run the Element Manager GUI. Secondly, UMD, except in specific instances, blocks SNMP which led to establishing a virtual private network (VPN) for remote access to both Element Manager and SANvergence. Performing IOzone testing with random data yielded the following results (Table 1) for one, two, four and eight threaded operations. These results are for an MTU size of 1500 and a negligible rtt as registered by the Nishans. Table 1 - Results Threads one two four eight FW, Comp Write Read 12.8 9.5 12.9 11.7 12.8 11.6 12.8 11.6 No FW, No Comp Write Read 38.6 14.1 47.3 19.8 28.9 20.6 59.8 25.8 Given the near zero rtt, the boxes ran best with both Fast Write and compression disabled. As noticed in other testing involving the Nishan, compression processing effectively halves the bandwidth in applications involving small rtts. The eight threaded write, 59.8 MB/sec, saturated the network given the available bandwidth, as measured by nuttcp [12], was 56.2MB/sec. Reads topped out at 25.8MB/sec. Single threaded IOzone tests saw 38.6MB/sec writes and 14.1MB/sec reads. As it turns out, the WAN connection at UMIACS end is not full GE but rather a fractional allocation of a full GE. By comparison to historical data, single threaded iSCSI operations using lmdd yielded 18MB writes and 12MB reads. Future Testing Additional tests to be conducted include: • Increase network bandwidth between GSFC and UMIACS to a full GE and reevaluate Nishan performance. Given the almost negligible rtt, a significant performance jump is anticipated. • Connect storage to the UMIACS Nishan then test reads and writes originating at GSFC. • Exercise the UMIACS-to-SDSC connection and compare to the GSFC-to-SDSC results. 130 3.3.2.2. SDSC Testing with SDSC (figure 9) leveraged the in-place, SDSC Series 4000 switch. WAN Linux Linux FC Nishan FC IP FC IP FC FC FC SEN Nishan IP MAX Bldg B RAI D Bldg A IP Nishan RAI D IP Abilene SDSC Figure 9 – SDSC Configuration connection used the Abilene backbone with MAX as the local hopping off point for GSFC. IOzone benchmarks were performed building a native ext2 file system on SDSC Sun storage from a GSFC resident Linux host. Impressions Set-up was straightforward with only the expected configuration items to be dealt with, namely network routing and allocating the appropriate zones, resolving SAN IDs, etc. However, the switches could not be made to operate in the jumbo frame (MTU=4096) mode, although the network was theoretically configured for such operation. It was learned though trial and error that manually forcing the MTU setting to 4096 can result in very erratic behavior of the link including complete lock up. The next two graphs (figures 10 and 11) illustrate performance as a function of the various Nishan settings for random versus static data. 131 14.0 12.0 10.0 MB/sec FW/Comp Write FW/Comp Read FW/No Comp Write FW/No Comp Read 8.0 No FW/Comp Write No FW/Comp Read No FW/No Comp Write No FW/No Comp Read 6.0 4.0 2.0 one two four eight Threads Figure 10 - Random Data, MTU=1500 20.0 18.0 16.0 MB/sec 14.0 12.0 FW/Comp Write FW/Comp Read FW/No Comp Write 10.0 FW/No Comp Read No FW/Comp Write No FW/Comp Read No FW/No Comp Write No FW/No Comp Read 8.0 6.0 4.0 2.0 one two four eight Threads Figure 11 - Static Data, MTU=1500 132 The following data (Table 2) compares actual results of the GSFC-to-SDSC connection with test data using the NIST simulator with an equivalent rtt of 70msec. In both cases, Fast Write and compression are turned on. Note fair agreement in the data despite the difference in MTU sizes. The suspicion is that the rtt impact on the SCSI command interchange dilutes the performance gains of jumbo frames. Table 2 - Results Threads one two four GSFC => GSFC GSFC => SDSC rtt delay => 70msec rtt actual => 70msec MTU => 4096 MTU => 1500 Write Read Write Read 13.1 5.6 11.6 6.0 13.1 11.5 13.1 8.2 13.1 12.5 12.7 8.0 Future Testing Additional tests to be conducted include: • Get jumbo frames (MTU=4096) working between GSFC and SDSC then reevaluate performance and compare to delay numbers. Determine if the jump in performance was an anomaly related to the NIST emulator. • Exercise link in opposite direction – server/host at SDSC and storage at GSFC. • Exercise the SDSC-to-UMIACS connection and compare to SDSC-to-GSFC results. 3.3.2.3. NCSA The IP connection with NCSA (figure 12) was accomplished using a pair of LightSand i8100s. As with SDSC, WAN connection used the Abilene backbone with MAX as the local hopping off point for GSFC. IOzone benchmarks were performed building a native ext2 file system on NCSA DataDirect storage from a GSFC resident Linux host. Linux Linux FC FC LightSand IP SEN FC IP FC Bldg B MAX IP FC LightSand RAID Bldg A IP RAID NCSA Figure 12 - NCSA Configuration 133 Abilene Impressions Initial set- up was time consuming because of the learning curve of dealing with the LightSand equipme nt and establishing the network connection between GSFC and NCSA. The LightSands required that the Brocade 3800 switches be at the 3.1 firmware level. In addition, the command "portcfgislmode <port>,1" also had to be issued to the Brocades so that the switch ports connected to the 8100s would get the R_RDY set. An inordinate amount time was spent trying to determine why the SANman GUI would not execute properly from a remote workstation (off campus with respect to GSFC). As it turns out, NASA blocks external pings from open networks and the first thing the LightSand GUI requires is a successful ping to make sure the connection is in place. Once properly configured, the DataDirect Networks storage at NCSA was easily configured and accessed. Using the same IOzone script as before, the following results (Table 3) where obtained for native, ext2 file transfers. Table 3 - Results rtt => 30msec 1MB block Linux Host 1 Threads Write Read one 37.0 12.1 two 37.5 28.9 four 37.3 35.6 eight 37.3 36.2 These numbers are consistent with the theoretical maximums as predicted by the TimeCalc utility provided with the SANman. An interesting although not perfect comparison is the 35msec rrt numbers obtained using the NIST Net network emulator and Nishan switches. The best results with Fast Write and Compression turned off, was 26MB/sec writes and 20 MB/sec reads. It seems fair to presume, that running the Nishans in the “auto” compression mode may have improved those results. Future Testing Additional tests to be conducted include: • Exercise link in opposite direction – server/host at NCSA and storage at GSFC. • Get raw bandwidth numbers for the GSFC to NCSA link using nuttcp. 4. Operational Users As to what might seem like a sidebar to the major thrust of the evaluation, the search for a relevant application of this technology, a geographically distributed file system, continues. Two GSFC groups, the Scientific Visualization Studio (SVS) and the Advanced Data Grid (ADG) Project, are currently being pursued to provide on-campus operational proof of the various connectivity schemes. The plan is to also involve UMIACS, SDSC and NCSA in relevant application demonstrations. 134 4.1. Scientific Visualization Studio The GSFC SVS has a need for approximately 1 TB of storage to use as an animation "scratch" area. The content/data to be stored will be scientific visua lization animation frames in both HDTV and NTSC resolutions, and MPEG-1 and MPEG-2 movies in various resolutions from web to HDTV. Relatively fast (high bandwidth) access to such volumes is required, including constantly writing frames, various types of processing (read/write) of frames, and streaming frames from this volume to the local SVS workstations for animation preview. A Linux server in the SVS has an FC connection to the SAN Pilot. 4.2. Advanced Data GRID Prototype In conjunction with NASA Ames, the ADG prototype is a new initiative that intends to leverage the availability of Landsat data The mechanism for making the data available is the SAN Pilot connected to a Sun 3800 located on the GSFC campus. 5. Supporting Technologies Other technologies are being evaluated to ease the administrative burden of SANs as well improve the performance of the chosen data transport mechanism. The list includes SAN management software and a new generation of network interface (NIC) cards. Also, the evolution of network attached storage (NAS) is also being monitored. 5.1. SAN Management Software With the emphasis on connecting operational users, part of the testing has focused on SAN management software and tools. The goal is to acquire a tool or suite of tools that enables efficient monitoring of the SAN health and utilization as well as providing for asset allocation and administration. A mechanism is needed that readily discovers SAN components and provides a topology view of the infrastructure. Four such tools have been installed and evaluated: • BrightStor™ SAN Manager by Computer Associates International, Inc. • SANavigator® by SANavigator, Inc. a subsidiary of McData Corporation • SANScreen by Onaro, Inc. • Fabric Manager and WEB TOOLS by Brocade Communications Systems, Inc. The shortcoming of all such products seems to be coverage of all the needed versions of operating systems, and storage and interface devices, something not usually supported. Recognizing the new breed of FC and FC related products, such as Nishan and LightSand boxes, is sporadic as well. No one product seems to do it all. Not tested but briefed was a StorageAuthority™ Suite from AppIQ, Inc. It possesses some very rich capabilities worthy of consideration. In the meantime, SANScreen was purchased and installed. It will be important to observe how the product deals with a heterogeneous, near operational environment with ever evolving security constraints. 135 5.2. NIC Evaluation This testing is most relevant to iSCSI connected hosts. The plan is for parametric evaluation of generic NICs versus TCP Off-Load Engine (TOE) NICs and TOE iSCSI NICs. It will be key to measure end-to-end throughput performance and CPU utilization on hosts with different processor speeds. The intent is to include cards from multiple manufacturers such as Intel, Adaptec, and Alacritech. Testing is underway but not yet completed. So far, getting the basic set- up configured and operational is proving to be a challenge. 6. Summary In retrospect, the testing permutations became formidable when the multiple locations, potential rtt, equipment configurations and settings are factored in. As a result, only a subset of possible hardware and software combinations were actually exercised. However, the size of the data sampling does not adversely impact the overall evaluation of the products. Evaluating IP devices has been an educational process punctuated by learning new jargon and redefining the concept of a SAN while dealing with the unavoidable reality of the hardware and software incompatibilities, typical of emerging technology. This class of product is mainly deployed in disaster recovery applications as opposed to file system applications. As a result, empirical data for comparison was not readily available, leaving conversations and paper exercises as the basis for determining the validity of the collected data. A better understanding of theoretical maximums as they relate to SCSI transfers as a function of rtt versus the selected FC-IP protocol (FCIP or iFCP) is needed. The vendor products behaved admirably with one significant, non-performance concern. Security features were found to be lacking from a device management perspective – no secure login, clear text passwords, etc. To circumvent such shortfalls during the testing, network routing was altered and access lists were incorporated to minimize the perceived vulnerabilities. Also, a desirable feature available at the data level for iSCSI is host authentication by the IP interface. The following table (Table 4) presents a qualitative review of the Nishan and LightSand equipment: 136 Table 4 – Findings Summary IP Device General Pros • Perform as advertised. • Operationally fairly intuitive. • Both GUI and CLI management options. • Administrator defined level of SAN merging/isolation. Nishan 3000 • Built in performance graphs. • Good statistical info. • Companion applications that provide data analysis. LightSand i-8100 Cons • Minimal security. • No ssh. • No CLI standard • Redundant, conflicting naming conventions. • Proprietary, same vendor product required at both ends of the WAN connection. • High skill level to configure, etc., multiple talents involved. • Incompatibilities, version issues, etc. reminiscent of the early days of FC. • Passwords in clear text. • IP routes cleared by reboots. • Difficult to save and compare configurations. A sidebar to the qualitative aspects of the testing is that the majority of configuration, benchmarking, etc. was done remotely from third party locations, not at any of the centers. Besides the obvious advantage of permitting geographic flexibility for the testers and vendors, it had the interesting side effect of revealing obstacles to deploying such a methodology for an operational IP based SAN. In place site security procedures and firewalls had to be acknowledged and understood. Blocked ports and disabled functionality had to be navigated. Such activity led to a greater understanding of the equipment and what changes would be welcomed in the products. Certainly at one level the objective of the testing was met – to gain experience with data over IP devices. Understanding the requirements being levied against a proposed SAN has always been critical, but the extra layer of configuration encountered installing FC-IP devices makes such planning even more necessary. There is the usual FC zoning at the local SAN level but in addition, bridging disparate SANs requires designating which components – servers, storage, etc. – will be mutually shared by the co-joined SANs. This two-step mechanism, while adding to the rigor, ensures isolation and privacy of the local SAN while allowing the sharing of mutually agreed to assets. Plans fell short in terms of evaluating a geographically distributed file system (SNFS and/or CXFS) encompassing GSFC, UMIACS, SDSC and NCSA, an outcome planned to be rectified in the near future. These file systems have centralized agents that control their overall operation. It will be interesting to track data movement performance (throughput ) as a function of where in the topology the agent is located and the latencies incurred in accessing it. 137 Acknowledgements The author wishes to acknowledge the following individuals for their contributions: Bill Fink, Paul Lang, Wei-Li Liu and Aruna Muppalla at NASA GSFC; Bryan Bannister and Nathaniel Mendoza at SDSC; Chad Kerner at NCSA; and Fritz McCall at UMIACS. Gratitude is also extended to the vendor community for their rich support. References [1] Hoot Thompson, Curt Tilmes, Robert Cavey, Bill Fink, Paul Lang, Ben Kobler; Architectural Considerations and Performance Evaluations Of Shared Storage Area Networks at NASA Goddard Space Flight Center; Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies; April 710, 2003. [2] http://www.npaci.edu/DICE/SRB/ [3] J. P. Gary; Research and Development of High End Computer Networks at GSFC, Earth Science Technology Conference, College Park, MD; June 24-26, 2003. [4] http://www.maxgigapop.net/ [5] http://abilene.internet2.edu/ [6] Maximizing Utilization of WAN Links with Nishan Fast Write; Nishan Systems. [7] FAQ on Nishan Systems’ Compression Technology; Nishan Systems. [8] Phil Andrews, Tom Sherwin, Bryan Bannister; A Centralized Data Access Model for Grid Computing; Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems & Technologies; April 7-10, 2003. [9] http://www.bitmover.com/lmbench/ [10] http://www.iozone.org [11] http://snad.ncsl.nist.gov/itg/nistnet/ [12] ftp://ftp.lcp.nrl.navy.mil/pub/nuttcp/beta/nuttcp- v5.1.1.c 138 File System Workload Analysis For Large Scale Scientific Computing Applications Feng Wang, Qin Xin, Bo Hong, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long Storage Systems Research Center University of California, Santa Cruz Santa Cruz, CA 95064 cyclonew, qxin, hongbo, sbrandt, elm, darrell @cs.ucsc.edu Tel +1 831-459-4458 Fax +1 831-459-4829 ✁ Tyce T. McLarty Development Environment Group/Integrated Computing and Communications Lawrence Livermore National Laboratory Livermore, CA 94551 tmclarty@llnl.gov Tel +1 925-424-6975 Fax +1 925-423-8719 ✁ Abstract Parallel scientific applications require high-performance I/O support from underlying file systems. A comprehensive understanding of the expected workload is therefore essential for the design of high-performance parallel file systems. We re-examine the workload characteristics in parallel computing environments in the light of recent technology advances and new applications. We analyze application traces from a cluster with hundreds of nodes. On average, each application has only one or two typical request sizes. Large requests from several hundred kilobytes to several megabytes are very common. Although in some applications small requests account for more than 90% of all requests, almost all of the I/O data are transferred by large requests. All of these applications show bursty access patterns. More than 65% of write requests have inter-arrival times within one millisecond in most applications. By running the same benchmark on different file models, we also find that the write throughput of using an individual output file for each node exceeds that of using a shared file for all nodes by a factor of 5. This indicates that current file systems are not well optimized for file sharing. 1. Introduction Parallel scientific applications impose great challenges on not only the computational speeds but also the data-transfer bandwidths and capacities of I/O subsystems. The U.S. Department of Energy Ac- 139 celerated Strategic Computing Initiative (ASCI) projects computers with 100 TeraFLOPS, I/O rates of 50–200 gigabytes/second, and storage system capacities of 0.5–20 PB in 2005. The projected computing and storage requirements are estimated to 400 TeraFLOPS, 80–500 gigabytes/second, and 3–20 PB in 2008 [2]. The observed widening disparity in the performance of I/O devices, processors, and communication links results in a growing imbalance between computational performance and the I/O subsystem performance. To reduce or even eliminate this growing I/O performance bottleneck, the design of high-performance parallel file systems needs to be improved to meet the I/O requirements of parallel scientific applications. The success of file system designs comes from a comprehensive understanding of I/O workloads generated by targeted applications. In the early and middle 1990s, significant research efforts were focused on characterizing parallel I/O workload patterns and providing insights on parallel system designs [1, 4, 7, 14]. The following decade has witnessed significant improvements in computer hardware, including processors, memory, communication links, and I/O devices. At the same time, systems are scaling up to match the increasing demands of computing capability and storage capacity. This advance in technologies also enables new scientific applications. Together these changes motivate us to re-examine the characteristics of parallel I/O workloads a decade later. In our research, we traced the system I/O activities under three typical parallel scientific applications: the benchmark ior2 [6], a physics simulation, f1, running on 343 nodes, and another physics simulation, m1, running on 1620 nodes. We study both static file system and dynamic I/O workload characteristics. We use the results to address the following questions: ✂ ✂ ✂ ✂ ✂ ✂ ✂ What were the file sizes? How old were they? How many files were opened, read, and written? What were their sizes? How frequent were typical file system operations? How often did nodes send I/O requests? What were the request sizes? What forms of locality were there? How might caching be useful? Did nodes share data often? What were the file sharing patterns? How well did nodes utilize the I/O bandwidth? The remainder of this paper is organized as follows: a brief overview of the related work is given in Section 2. We then describe the tracing methodology in Section 3 and present our results in Section 4. Finally, we conclude our paper in Section 5. 2. Related Work The I/O subsystem has been a system performance bottleneck for a long time. In parallel scientific computing environments, the high I/O demands make the I/O bottleneck problem even more severe. Kotz and Jain [3] surveyed impacts of I/O bottlenecks in major areas of parallel and distributed systems and pointed out that I/O subsystem performance should be considered at all levels of system design. Previous research showed that the I/O behavior of scientific applications is regular and predictable [7, 9]. Users have also made attempts to adjust access patterns to improve performance of parallel file systems [13]. 140 There are several studies on file system workload characterizations in scientific environments [1, 4, 7, 8, 11]. They have shown that file access patterns share common properties such as large file sizes, sequential accesses, bursty program accesses, and strong file sharing among processes within a job. A more recent study [14] showed that applications use a combination of both sequential and interleaved access patterns and all I/O requests are channeled through a single node when applications require concurrent accesses; we observe similar phenomena in one of the applications under our examinations. Pasquale and Polyzos [9] found that the data transfer rates ranges from 4.66 to 131 megabytes/sec in fifty long-running large-scale scientific applications. They also demonstrated that the the I/O request burstiness is periodic and regular [10]. Baylor and Wu [1] showed that the I/O request rate is on the order of hundreds of requests per second; this is similar to our results. They also found that a large majority of requests are on the order of kilobytes and a few requests are on the order of megabytes; our results differ in this regard. Previous research has mainly investigated scientific workloads in the 1990’s, although technology has evolved very quickly since then. We observed changes in large-scale scientific workloads, in our study, and provided guidelines for future file system designs based on a thorough understanding of current requirements of large-scale scientific computing. 3. Tracing Methodology All the trace data in this study was collected from a large Linux cluster with more than 800 dual processor nodes at the Lawrence Livermore National Laboratory (LLNL). A development version of Lustre Lite [12] is employed as the parallel file system and the Linux kernel in use is a variant of 2.4.18. 3.1. Data Collection Tracing I/O activities in large scale distributed file systems is challenging. One of the most critical issues is minimizing the disturbance of tracing on the system behaviors. A commonly-used method is to develop a trace module that intercepts specific I/O system calls—a dedicated node in the cluster collects all trace data and stores them to local disks. However, due to time limits, we chose a simpler approach: we employed the strace utility with parameters tuned for tracing file-related system calls. The trace data are written to local files. We rely on the local host file systems to buffer trace data. This approach has two shortcomings: first, strace intercepts all I/O-related activities, including parallel file system, local file system, and standard input/output activities. This results in relatively large data footprint. Second, the strace utility relies on the local file system to buffer traced data. This buffer scheme works poorly when the host file system is under heavy I/O workloads. In such a scenario, the host system performance might be affected by the frequent I/Os of the traced data. However, the strace utility greatly simplifies the tedious data collection process to a simple shell script. More importantly, the shortcomings mentioned above were not significant in our trace col- 141 Table 1. The ASCI Linux Cluster Parameters Total Nodes (IBM x355) Compute Nodes Login Nodes Gateway Nodes Metadata Server Nodes Processor per Nodes (Pentium 4 Prestonia) Total Number of Processors Processor Speed (GHz) Theoretical Peak System Performance (TFlops) Memory per Node (GB) Total Memory (TB) Total Local Disk Space (TB) Nodes Interconnection 960 924 2 32 2 2 1920 2.4 9.2 4 3.8 115 Quadrics Switch lection because of the large I/O requests and the relatively short tracing periods. As we discuss in Section 4, I/O requests in such a large system are usually around several hundred kilobytes to several megabytes. Even in the most bursty I/O period, the total number of I/Os per node is still around tens of requests per second. Up to one hundred trace records will be generated on each node per second on average. Buffering and storing these data only has a slight impact on the system performance. Moreover, instead of tracing the whole cluster, we only study several typical scientific applications. Those applications are usually composed of two stages: the computation phase and the I/O phase. The typical I/O stage ranges from several minutes to several hours. During this period, each node usually generates several hundred kilobytes of trace data, which can be easily buffered in memory. 3.2. Applications and Traces All of the trace data were collected from the ASCI Linux Cluster in Lawrence Livermore National Laboratory. This machine is currently in limited-access mode for science runs and file system testing. It has 960 dual-processor nodes connected through a Quadrics Switch. Two of the nodes are dedicated metadata servers and another 32 nodes are used as the gateways for accessing a global parallel file system. The detailed configuration of this machine is provided in table 1 [5]. We traced three typical parallel scientific applications during July, 2003. The total size of the traces is more than 800 megabytes. The first application is a parallel file system benchmark, ior2 [6], developed by LLNL. It is used for benchmarking parallel file systems using POSIX, MPIIO, or HDF5 interfaces. Basically it writes a large amount of data to one or more files and then reads them back to verify the correctness of the data. The data set is large enough to minimize the operating system caching effect. Based on different file usages, we collected three different benchmark traces, named ior2-fileproc, ior2shared, and ior2-stride, respectively. All of them ran on a 512-node cluster. ior2-fileproc assigns an individual output file for each node, while ior2-shared and ior2-stride use a shared file for all the nodes. The difference between the last two traces is that ior2-shared allocates a contiguous region in the shared file for each node, while ior2-stride strides the blocks from different nodes into the shared file. 142 The second application is a physics simulation run on 343 processes. In this application, a single node gathers a large amount of data in small pieces from the others nodes. A small set of nodes then write these data to a shared file. Reads are executed from a single file independently by each node. This application has two I/O-intensive phases: the restart phase, in which read is dominant; and the result-dump phase, in which write is dominant. The corresponding traces are named f1-restart and f1-write, respectively. The last application is another physics simulation which runs on 1620 nodes. This application use an individual output file for each node. Like the previous application, it also has a restart phase and a result-dump phase. The corresponding traces are referred as m1-restart and m1-write, respectively. 3.3. Analysis The raw trace files required some processing before they could be easily analyzed. Some unrelated system calls and signals were filtered out. Since each node maintained its own trace records, the raw trace for each application is composed of hundreds of individual files. We merged those individual files in chronological order. Thanks to the Quadrics switch, which has a common clock, the traced time in those individual trace files was globally synchronized. Our analysis work, such as request inter-arrival time, have been greatly simplified by sorting all requests into a chronologically sorted trace file. A good understanding of file metadata operation characteristics is important, however, our traces are not large enough to capture general metadata access patterns. Therefore, we focus more on file data I/O characterization in the following section. 4. Workload Characteristics We present the characteristics of the workloads, including file distributions and I/O request properties. We study the distributions of file size and lifetimes and show the uniqueness of large-scale scientific workloads. We focus on three typical applications as described in Section 3.2 and examine the characteristics of I/O requests, such as the size and number of read and write requests and the burst and the distribution of I/O requests on various nodes. 4.1. File Distributions We collected file distributions from thirty-two file servers that were in use for the ASCI Linux cluster during the science runs phase. Each file server has storage capacity of 1.4 terabytes. The file servers were dedicated to a small number of large-scale scientific applications, which provides a good model of data storage patterns. On average, the number of files on each file server was 350,250, and each server stored 1.04 terabytes of data, more than 70% of their capacity. On most of the file servers, the number and capacity of files are similar except for five file servers. Table 2 displays statistic values of the number and capacity of files on these servers, including mean, standard deviation (std. dev.), median, minimum (min) and maximum (max). Figure 1(a) presents file size distributions by number and file capacity. The ranges of file sizes are sampled from 0–1 Byte to 1–2 Gigabytes. Some of the partitions were merged due to space limitations. We observed that over 80% of the files are between 512 kilobytes and 16 megabytes in 143 Table 2. File Numbers and Capacity of the 32 File Servers percentage in all the files (%) mean standard deviation median minimum maximum 50 Number 305,200 75,760 305,680 67,276 605,230 Capacity 1044.33 GB 139.66 GB 1072.88 GB 557.39 GB 1207.37 GB number of files capacity of files 40 30 20 10 0 0 1B 1 KB 512 KB 1 MB 2 MB 4M KB 8 MB 16 MB 32 MB 256 MB 512 MB 2 GB range of file sizes percentage in all the files (%) (a) By File Sizes 50 number of files capacity of files 40 30 20 10 0 0 1 day 1 wk 2 wk 4 wk 8 wk 13 wk 26 wk 52 wk range of file ages (b) By File Ages Figure 1. Distribution of Files size and these files accounted for over 80% of the total capacity. Among various file size ranges, the most noticeable one is from 2 megabytes to 8 megabytes: about 61.7% of all files and 60.5% of all bytes are in this range. We divided file lifetimes into 9 categories: from 0–1 day to 52 weeks and older. As illustrated in figure 1(b), 60% of the files and 50% of the bytes lived from 2 weeks to 8 weeks, while 6.6% of the files and 7.3% of the bytes lived less than one day. The lifetime of the traced system is about 1 year so no files lived longer than 52 weeks. 4.2. I/O Request Sizes Figure 2 shows the cumulative distribution function of request sizes and request numbers. Since all three ior2 benchmarks have identical request size distributions, we only show one of them. As shown in Figure 2(a), ior2 has only an unique request size of around 64 kilobytes. Figure 2(b) shows the write request size distribution of the result-dump stage in the physics simulation, f1. Almost all the write requests are smaller than 16 bytes, while almost all the I/O data are 144 Fraction of Requests 1 read_num read_size write_num write_size 0.8 0.6 0.4 0.2 0 0 10 100 1000 1e4 1e5 1e6 Request Size (bytes) 1 Fraction of Requests Fraction of Requests (a) ior2-fileperproc 0.8 write_num 0.6 write_size 0.4 0.2 0 0 10 100 1000 1e4 1e5 1 0.8 read_num 0.6 read_size 0.4 0.2 0 0 1e6 10 Request Size (bytes) 1 read_num read_size 0.6 write_num write_size 0.4 0.2 0 0 10 100 1000 1e4 1e5 1e6 (c) f1-restart Fraction of Requests Fraction of Requests (b) f1-write 0.8 100 1000 1e4 Request Size (bytes) 1e5 1 read_num 0.8 read_size 0.6 write_num write_size 0.4 0.2 0 0 1e6 10 100 1000 1e4 1e5 1e6 Request Size (bytes) Request Size (bytes) (d) m1-write (e) m1-restart Figure 2. Cumulative Distribution Functions (CDF) of the Size and the Number of I/O Requests (X axis-logscale). The read num and write num curves indicate the fraction of all requests that is smaller than the size given in X axis. The read size and write size curves indicate the fraction of all transferred data that live in requests with size smaller than the value given in the X axis. transferred in the requests with sizes larger than one megabyte. This turns out to be a common I/O pattern of scientific applications: a master node collects small pieces of data from all computing nodes and writes them to data files, which results in a huge number of small writes. Other nodes read and write these data files in very large chunks then. There are so few read requests in the resultdump stage and write requests in the restart stage that we actually ignore the read request curves in figure 2(b) and the write request curves in figure 2(c). Figure 2(d) and figure 2(e) show the same write request distribution in the restart and result-dump stages of the physics simulation, m1. The two spikes in the write num curves indicate two major write sizes: 64 kilobytes and 1.75 megabytes, respectively. Each of them accounts for 50% of all write requests. More than 95% of the data are transfered by large requests, which is also shown in Figures 2(d) and 2(e). Reads in m1 are dominated by small requests less than 1 kilobytes. However, a small faction (less than 3%) of 8 kilobyte requests still accounts for 30% of all read data transfer. This is similar to the read distribution in Figure 2(e): only 5% of the read requests contribute to 90% of all data read. 145 4.3. I/O Accesses Characteristics Figure 3–5 show I/O accesses characteristics over time. The resolution for these figures is 1 second except figure 4(a), which uses a resolution of 50 seconds. Figure 3 shows that the request number distribution and the request size distribution are almost identical in ior2 due to the fixed size requests used in those benchmarks. The ior2-fileproc benchmark, using the one-file-per-node model, presents the best write performance. Up to 150,000 write requests per second, totaling 9 gigabytes per second, are generated by the 512 nodes. However, the ior2-shared and ior2-stride benchmarks can only achieve 25,000 write requests per second, totaling 2 gigabytes per second. These two benchmarks use the shared-region and the shared-stride file model, respectively. We believe that the performance degradation is caused by the underlying file consistency protocol. This result is somewhat counterintuitive. The shared-region file model appears to be similar to the one-file-pernode model because the contiguous regions in the former can be analogous to the separate files in the latter. Therefore, their performance should be comparable as well. The severe performance degradation implies that the shared-file model is not optimized for this scenario. After a write, each node reads back another node’s data as soon as it is available. The gaps between the write and read curves in each sub-figure reflect the actual I/O times. Obviously, the ior2-fileproc benchmark demonstrates much better performance: only 10 seconds are used in this model, while more than 20 seconds are needed to dump the same amount of data when using the shared file model. Since reads must be synchronous, we can easily figure out the file system read bandwidth from the read size curve. The ior2-fileproc and ior2-shared benchmarks have comparable read performance. However, the ior2-stride has the worst read performance, which is only 100 megabytes per second for 512 nodes. This result is not surprising: the stride data layout in shared files limits the chances of large sequential reads. Figure 4 shows the I/O access pattern of the application f1. As we mentioned before, f1-write has very few reads and f1-restart has very few writes. Therefore, we can ignore those requests in the corresponding figures. In Figure 4(a), we chose a resolution of 50 seconds because it becomes unreadable if we use finer time resolutions. The spike of the write-num curve is caused by the activities of the master node to collect small pieces of data from other computing nodes. At its peak time, nearly 1 million file system requests are issued per second. However, due to the very small request size (8 to 16 bytes), this intensive write phase contributes negligable amounts of data to the overall data size. In the rest of the application, large write requests from 48 nodes dominate the I/O activities. Requests are issued in a very bursty manner. Figure 4(b) zooms in a small region of Figure 4(a) by 1 second resolution. It shows that sharp activity spikes are separated by long idle periods. At the peak time, up to 120 megabytes per second of data are generated by 48 nodes. In the restart phase of f1, read requests become dominant. However, both the number and the data size of read requests are small compared to those in the write phase. Figure 5 presents the I/O access pattern of the physics application m1. It demonstrates very good read performance: nearly 28 gigabytes per second bandwidth can be achieved by 1620 nodes, thanks to the large read size (1.6 megabytes – 16 megabytes). Like f1, its write activities are also bursty. We observed that the write curves have similar shapes in figure 5. They all begin with a sharp spike followed by several less intensive spikes. One possible explanation is that the file system buffer cache absorbs the coming write requests at the begin of the writes. However, as soon as the buffer is filled up, the I/O rate drops to what can be served by the persistent storage. 146 read_num write_num 0 6 12 18 Snapshot Time (sec.) 24 Data Size of I/O Op. (GB) Number of I/O Op.(X 1e4) 14 12 10 8 6 4 2 0 6.0 4.0 2.0 0 0 read_num write_num 4 3 2 1 0 12 24 36 Snapshot Time (sec.) 0 6 12 18 Snapshot Time (sec.) 48 3.0 2.5 2.0 1.5 1.0 0.5 0 read_size write_size Data Size of I/O Op. (GB) Number of I/O Op. (X 1e3) 10 5 0 30 0 48 (d) ior2-shared size read_num write_num 15 12 24 36 Snapshot Time (sec.) 0 (c) ior2-shared number 20 24 (b) ior2-fileperproc size Data Size of I/O Op. (GB) Number of I/O Op. (X 1e4) (a) ior2-fileperproc number 5 read_size write_size 8.0 60 90 120 150 180 Snapshot Time (sec.) read_size write_size 1.2 0.8 0.4 0 0 (e) ior2-stride number 30 60 90 120 150 180 Snapshot Time (sec.) (f) ior2-stride size 1.0 write_num write_size 0.8 0.6 0.4 0.2 0 0 1800 3600 Snapshot Time (sec.) 5400 Data Size of I/O Op. (GB) Number of I/O Op. (X 1e6) Data Size of I/O Op. (GB) Figure 3. I/O Requests over Time for ior2 Benchmarks write_size 0 20 40 60 80 Snapshot Time (sec.) 100 (b) time-f1-write-short (a) time-f1-write Number of I/O Op. Data Size of I/O Op. (KB) 0.1 0.08 0.06 0.04 0.02 0 read_num read_size 1500 1000 500 0 0 6 12 18 Snapshot Time (sec.) 24 (c) time-f1-restart Figure 4. I/O Requests over Time for f1 Application 4.4. I/O Burstiness To study I/O burstiness, we measure I/O request inter-arrival times. Figure 6 shows the cumulative distribution functions (CDF) of I/O request inter-arrival times. Note that the x-axis is in the logarithmic scale. Write activities are very bursty in the ior2 benchmarks and the f1 application: over 147 12 8 4 0 60 120 180 240 Snapshot Time (sec.) Data Size of I/O Op. (GB) Number of I/O Op. (X 1e3) read_num write_num read_size write_size 25 20 15 10 5 0 120 180 240 60 Snapshot Time (sec.) read_num write_num 10 8 6 4 2 0 0 60 120 180 240 Snapshot Time (sec.) (b) m1-restart-size Data Size of I/O Op. (GB) Number of I/O Op. (X 1e3) (a) m1-restart-num (c) m1-write-num read_size write_size 8 6 4 2 0 0 60 120 180 240 Snapshot Time (sec.) (d) m1-write-size Figure 5. I/O Requests over Time for m1 Application 65–100% of write requests have inter-arrival times within 1 millisecond. In ior2 and f1, most of the write activities are due to memory dump and I/O nodes can issue write requests quickly. However, write activities on m1 are less intensive than those on ior2 and f1 On the other hand, read requests are generally less intensive than write requests because reads are synchronous. In particular, Figure 6(c) indicates that ior2 under shared-strided files suffers low read performance, as described in Section 4.3. In this scenario, data are interleaved in the shared file and read accesses are not sequential. 4.5. I/O Nodes In this section, we study the distributions of I/O request sizes and numbers over nodes, as shown in Figure 7. For the ior2 benchmarks, read and writes are distributed evenly among nodes, as shown in Figures 7(a) and 7(b), because each node executes the same sequence of operations in these benchmarks. In the physics application f1, a small set of nodes write gathered simulated data to a shared file. Therefore, only a few nodes have significant I/O activity in their write phase and most of the transfered data are from large write requests (14% of the write requests), as shown in Figures 7(c) and 7(d). There is little read activity in the write phase. However, read requests are evenly distributed among nodes in the restart phase and their sizes are around 1 megabyte, as shown in Figures 7(e) and 7(f). There is little write activity in the restart phase. In the restart and write phases of the physics application m1, I/O activity is well balanced among nodes, as shown in Figures 7(g)–7(j). We also observe significant write activity in the restart phase. 148 0.6 0.4 read write 0.2 0 0 1 4 16 64 Time (ms.) 512 Fraction of Requests Fraction of Requests 1 0.8 1 0.8 0.6 0.4 read write 0.2 0 0 Fraction of Requests (a) inter-ior2-fileperproc 1 4 16 64 Time (ms.) 512 (b) inter-ior2-shared 1 0.8 0.6 0.4 read write 0.2 0 0 1 4 16 64 Time (ms.) 512 1 0.8 0.6 0.4 read write 0.2 0 0 1 4 16 64 Time (ms.) 512 Fraction of Requests Fraction of Requests (c) inter-ior2-strided 1 0.8 0.6 0.4 read write 0.2 0 0 1 0.8 0.6 0.4 read write 0.2 0 0 1 4 16 64 Time (ms.) 4 16 64 Time (ms.) 512 (e) inter-f1-restart 512 Fraction of Requests Fraction of Requests (d) inter-f1-write 1 1 0.8 0.6 0.4 read write 0.2 0 0 (f) inter-m1-write 1 4 16 64 Time (ms.) 512 (g) inter-m1-restart Figure 6. Cumulative Distribution Functions (CDF) of Inter-arrival Time of I/O Requests (X axis-logscale) Table 3. File Open Statistics Applicatons ior2 f1-write f1-restart m1-restart m1-write Overall Number of File Opens Number of Data File Opens Read/Write 6,656 3,871 3,773 17,824 17,824 Read/Write 1,024 98 0 0 0 Read 5,121 6,870 6,179 22,681 21,061 Write 0 718 0 12,940 12,960 149 Read 0 10 343 1,620 0 Write 0 34 0 12,960 12,960 Fraction of Nodes Fraction of Nodes 1 0.8 read write 0.6 0.4 0.2 0 0 100 200 300 400 Number of Requests 1 0.8 read write 0.6 0.4 0.2 0 500 0 0.02 0.04 0.06 0.08 Size of Requests (GB) (b) node-ior2 1 Fraction of Nodes Fraction of Nodes (a) node-ior2 0.8 read write 0.6 0.4 0.2 0 0 1 0.8 read write 0.6 0.4 0.2 0 2 4 6 8 10 Number of Requests (X 1e6) 0 Fraction of Nodes Fraction of Nodes 1 read write 0.6 0.4 0.2 0 0 5 10 15 20 25 Number of Requests 30 1 0.8 0.4 0.2 0 0 Fraction of Nodes Fraction of Nodes read write 0.4 0.2 0 0 150 300 450 600 Number of Requests 1 0.8 0.4 0.2 0 0 750 Fraction of Nodes Fraction of Nodes read write 0.4 0.2 0 0 150 300 450 600 Number of Requests 0.02 0.04 0.06 0.08 Size of Requests (GB) 0.1 (h) node-m1-write-size 1 0.6 read write 0.6 (g) node-m1-write-num 0.8 200 400 600 800 1000 1200 Size of Requests (KB) (f) node-f1-restart-size 1 0.6 read write 0.6 (e) node-f1-restart-num 0.8 0.2 0.4 0.6 Size of Requests (GB) (d) node-f1-write-size (c) node-f1-write-num 0.8 0.1 750 (i) node-m1-restart-num 1 0.8 read write 0.6 0.4 0.2 0 0 0.02 0.04 0.06 0.08 Size of Requests (GB) 0.1 (j) node-m1-restart-size Figure 7. Cumulative Distribution Functions (CDF) of the Size of I/O Requests over Nodes 150 Table 4. Operations During File Open Avg. open time Applications ior2-fileproc ior2-shared ior2-stride f1-write f1-restart m1-restart m1-write Overall 0.4 sec 0.7 sec 7.6 sec 20.2 sec 0.02 sec 1.2 sec 1.2 sec Data File 4.5 sec 5.2 sec 26.57 sec 504.9 sec 0.1 sec 3.9 sec 2.4 sec Avg. IOs per Open Overall 44.4 44.4 44.4 14.8 0.5 4.2 4.3 Data File 512.0 512.0 512.0 142161 1 15.3 17 ✄ Avg. IO Size per Open ✄ Overall 2.8 MB 2.8 MB 2.8 MB 2.4 MB 1 MB 3.7 MB 3.1 MB Data File 32.8 MB 32.8 MB 32.8 MB 3993.5 MB 1 MB 8.5 MB 6.5 MB ✄ ✄ 4.6. File Opens In this section, we study the file open patterns of the applications. We use the term data files to refer to those files that actually store results dumped from applications. In all applications, files tend to be opened as read/write or read-only. We only observe significant write-only files in the physics application m1, as shown in table 3. However, the data files are opened either read-only or write-only except for the benchmark ior2. The open operations on the data files only account for a small portion of the overall files opened. Given the fact that the data file operations dominate the overall I/O, the small number of data file opens implies longer open time and more I/O operations during each open. As listed in table 4, the open duration of data files ranges from several seconds to several hundred seconds, which is typically 2 to 20 times longer than overall file open durations. The average number of operations and the size of data files on each open operation are also much larger than those on the overall files. For example, up to 400 MB data are transferred during each data file open in physical application f1-write. 5. Conclusion In this study, we analyze application traces from a cluster with hundreds of processing nodes. On average, each application has only one or two typical request sizes. Large requests from several hundred kilobytes to several megabytes are very common. Although in some applications, small requests account for more than 90% of all requests, almost all of the I/O data are transferred by large requests. All of these applications show bursty access patterns. More than 65% of write requests have inter-arrival times within one millisecond in most applications. By running the same benchmark on different file models, we also find that the write throughput of using an individual output file for each node exceeds that of using a shared file for all nodes by a factor of 5. This indicates that current file systems are not well optimized for file sharing. In all those applications, almost all I/Os are performed on a small set of files containing the intermediate or final computation results. Such files tend to be opened for a relatively long time, from several seconds to several hundred seconds. And a large amount of data are transferred during each open. 151 Acknowledgments Feng Wang, Qin Xin, Scott Brandt, Ethan Miller, Darrell Long were supported in part by Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratory under contract B520714. Bo Hong was supported in part by the National Science Foundation under grant number CCR-073509. Tyce McLarty’s effort was under the auspices of the U.S. Department of Energy by the University of California, Lawrence Livermore National Laboratory under Contract No. W-7405-Eng-48. This document was reviewed and released as unclassified with unlimited distribution as LLNL-UCRL-CONF-201895. We are also grateful to our sponsors: National Science Foundation, USENIX Association, Hewlett Packard Laboratories, IBM Research, Intel Corporation, Microsoft Research, ONStor, Overland Storage, and Veritas. References [1] S. J. Baylor and C. E. Wu. Parallel I/O workload characteristics using Vesta. In Proceedings of the IPPS ’95 Workshop on Input/Output in Parallel and Distributed Systems (IOPADS ’95), pages 16–29, Apr. 1995. [2] DOE National Nuclear Security Administration and the DOE National Security Agency. Proposed statement of work: SGS file system, Apr. 2001. [3] D. Kotz and R. Jain. I/O in parallel and distributed systems. In A. Kent and J. G. Williams, editors, Encyclopedia of Computer Science and Technology, volume 40, pages 141–154. Marcel Dekker, Inc., 1999. Supplement 25. [4] D. F. Kotz and N. Nieuwejaar. File-system workload on a scientific multiprocessor. IEEE Parallel and Distributed Technology, 3(1):51–60, 1995. [5] Lawrence Livermore National Laboratory. ASCI linux cluster. http://www.llnl.gov/linux/alc/, 2003. [6] Lawrence Livermore National Laboratory. IOR software. http://www.llnl.gov/icc/lc/siop/downloads/download.html, 2003. [7] E. L. Miller and R. H. Katz. Input/output behavior of supercomputing applications. In Proceedings of Supercomputing ’91, pages 567–576, Nov. 1991. [8] A. L. Narasimha Reddy and P. Banerjee. A study of I/O behavior of perfect benchmarks on a multiprocessor. In Proceedings of the 17th International Symposium on Computer Architecture, pages 312–321. IEEE, 1990. [9] B. K. Pasquale and G. C. Polyzos. A static analysis of I/O characteristics of scientific applications in a production workload. In Proceedings of Supercomputing ’93, pages 388–397, Portland, OR, 1993. IEEE. [10] B. K. Pasquale and G. C. Polyzos. Dynamic I/O characterization of I/O-intensive scientific applications. In Proceedings of Supercomputing ’94, pages 660–669. IEEE, 1994. [11] A. Purakayastha, C. S. Ellis, D. Kotz, N. Nieuwejaar, and M. Best. Characterizing parallel file-access patterns on a large-scale multiprocessor. In Proceedings of the 9th International Parallel Processing Symposium (IPPS ’95), pages 165–172. IEEE Computer Society Press, 1995. [12] P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux Symposium, July 2003. [13] E. Smirni, R. A. Aydt, A. A. Chien, and D. A. Reed. I/O requirements of scientific applications: An evolutionary view. In Proceedings of the 5th IEEE International Symposium on High Performance Distributed Computing (HPDC), pages 49–59. IEEE, 1996. [14] E. Smirni and D. Reed. Lessons from characterizing the input/output behavior of parallel scientific applications. Performance Evaluation: An International Journal, 33(1):27–44, June 1998. 152 V:Drive - Costs and Benefits of an Out-of-Band Storage Virtualization System  André Brinkmann, Michael Heidebuer, Friedhelm Meyer auf der Heide, Ulrich Rückert, Kay Salzwedel, and Mario Vodisek Paderborn University Abstract SAN, the formerly fixed connections between storage and servers are broken-up and both are attached to the highspeed dedicated storage network. The introduction of a storage area network can significantly improve the reliability, availability, manageability, and performance of servers and storage systems. Nevertheless, it has been shown that the potential of a SAN can only be fully exploited with the assistance of storage management software, and here particulary with the help of a virtualization system. Storage virtualization is often seen as the key technology in the area of storage management. But what actually is storage virtualization? A good definition has been given by the Storage Networking Industry Association SNIA [8]: ”[Storage virtualization is] an abstraction of storage that separates the host view [from the] storage system implementation.” This abstraction includes the physical location of a data block as well as the path from the host to the storage subsystem through the SAN. Therefore, it is not necessary that the administrator of a SAN is aware of the distribution of data elements among the connected storage systems. Generally, the administrator only creates a virtual volume and assigns it to a pool of physical volumes, where each physical volume can be of different size. Then, a file system or a database can work upon this virtual volume and the virtualization software provides a consistent allocation of data elements on the storage systems. It is even possible that a large number of virtual volumes share a common storage pool. The use of a virtualization environment has many advantages compared to the traditional approach of assigning an address space to a fixed partition. The most obvious one is that a virtual disk can become much larger than the size of a single disk or even than a single RAID-system [7]. When using virtualization software, the size of a virtual disk is only limited by the restrictions inherent to the operating system and the total amount of available disk capacity. Another important feature of virtualization software is a much better utilization of disk capacity. It has been shown The advances in network technology and the growth of the Internet together with upcoming new applications like peer-to-peer (P2P) networks have led to an exponential growth of the stored data volume. The key to manage this data explosion seems to be the consolidation of storage systems inside storage area networks (SANs) and the use of a storage virtualization solution that is able to abstract from the underlying physical storage system. In this paper we present the first measurements on an out-of-band storage virtualization system and investigate its performance and scalability compared to a plain SAN. We show in general that a carefully designed out-of-band solution has only a very minor impact on the CPU usage in the connected servers and that the metadata management can be efficiently handled. Furthermore we show that the use of an adaptive data placement scheme in our virtualization solution V:Drive can significantly enhance the throughput of the storage systems, especially in environments with random access schemes. 1. Introduction The advances in networking technology and the growth of the Internet have enabled and accelerated the emergence of new storage consuming applications like peer-to-peer (P2P) networking, video-on-demand, and data warehousing. The resulting exponential growth of the stored data volume requires a new storage architecture, while the management of the traditional, distributed direct attached storage (DAS) architecture has shown to be intractable from a business perspective. The first step towards this new storage architecture is the consolidation of the servers and storage devices inside a storage area network (SAN). In a  Partially supported by the DFG Transferbereich 40 and the Future and Emerging Technologies programme of the EU under contract number IST-1999-14186 (ALCOM-FT). 153 that in the traditional storage model only 50% of the available disk space is used. The disk utilization can be increased up to 80% through the central and more flexible administration of virtualization software. Thus, the required storage capacity and, with it, the hardware costs of a storage area network can be reduced significantly. Furthermore, virtualization software offers new degrees of flexibility. Storage systems can be added to or removed from storage pools without downtime, thus enabling a fast adaptation to new requirements. These storage systems do not have to be from a single vendor, so that the traditional vendor-locking of customers can be avoided. Virtualization software can be implemented as out-ofband virtualization or in-band virtualization, inside the storage subsystems, or as logical volume manager (LVM) inside the hosts. In an out-of-band virtualization system, the virtualization is done inside the kernel of the hosts and all participating hosts are coordinated by one or more additional SAN appliance. In this paper we will focus on the analysis of our out-of-band solution V:Drive. Chapter 2 of this paper introduces the design of V:Drive. In chapter 3 we present the first measurements on an out-ofband storage virtualization system and investigate its performance and scalability compared to a plain SAN. We show that a carefully designed out-of-band solution has only a very minor impact on the CPU usage in the connected hosts and that the metadata management can be efficiently implemented. Furthermore we give evidence that the use of an adaptive data placement scheme can significantly enhance the throughput of storage systems, especially in environments with random access schemes. storage pool and can change over time. In general, smaller extents can guarantee better load balancing, while bigger extents result in a smaller management overhead and less disk head movements in case of sequential accesses. The extents are distributed among the storage devices according to the Share strategy which is able to guarantee an almost optimal distribution of the data blocks across all participating disks in a storage pool (see Section 2.1). 2.1. The Share-Strategy Any virtualization strategy depends on the underlying data distribution strategy. Such a distribution is challenging if the system is allowed to contain heterogeneous storage components. The main task of a distribution strategy is an even distribution of data blocks and an even distribution of requests among the storage devices. Therefore, it has a strong impact on the scalability and the performance of the SAN. It can be shown, that a static data placement scheme is generally not able to fulfill the given requirements. We have developed a new adaptive distribution scheme that has been implemented in V:Drive, called Sharestrategy [2]. In this paper we will present Share without data replication. Of course it is possible to support replication inside Share, e.g. by a scheme proposed in [4]. For other static and dynamic placement schemes, see [6, 3, 4, 5]. Share works in two phases. In the first phase, the algorithm reduces the problem of mapping extents to heterogeneous disks to a number of homogeneous ones. The result is a number of volumes which are equally likely to store the requested extent. In the second phase, we use any distribution strategy that is able to map extents to equal sized disks (see e.g. [6]). The reduction phase is based on two hash functions          and          where M is the maximal number of extents in the system and N is the maximal number of disks that are allowed to participate, respectively. 2. V:Drive Design In this chapter we will describe the design of our outof-band virtualization solution V:Drive. From the architectural perspective, V:Drive consists of a number of cooperating components: one or more SAN appliances which are responsible for the metadata management and the coordination of the hosts (see section 2.2), the virtualization engine inside the kernel of the hosts (see section 2.3), and a graphical user interface (GUI). From a logical point of view, V:Drive offers the ability to cluster the connected storage devices into storage pools that can be combined according to their age, speed, or protection against failures. Each storage pool has its own storage management policy describing individual aspects like logical and physical block size or redundancy constraints. A large number of virtual volumes can share the capacity of a single storage pool. The capacity of each disk in a storage pool is partitioned into minimum sized units of contiguous data blocks, so called extents. The extent size need not be constant inside a I4 I1 g(1) g(4) 0 I3 I2 g(2) g(3) h(b) 1 Figure 1. Hashing scheme in Share The reduction phase works as follows: Initially or after every change in the system configuration, we map the 154 starting points of sub-intervals of certain length into a [0,1) interval using the hash function g. The length of these subintervals  corresponds to the capacity   of disk i. To ensure that the whole interval is covered by at least one sub-interval we need to stretch each of the sub-intervals by a factor s. In other words, the sub-interval   starts at   and ends at        mod 1. The extents are hashed into the same interval using h where the quality of h ensures an even distribution of all extents over the whole interval. Now, an extent can be accessed by calculating its hash value and then deriving all sub-intervals that value falls into. Any efficient uniform strategy can be applied to get the correct disk out of the number of possible candidates. It can be shown that the fraction of extents stored on a disk and the number of requests to a disk are proportional to its capacity and that the number of extent replacements in case of any change in the number or kind of disks is nearly minimal (see [2] for more detail). Manager is responsible for this redistribution tasks. After each change of a storage pool it checks each allocated extent if it has to be relocated. In such a case, the extent is moved online to its new location in a way that ensures the consistency of the data before, during, and after the replacement process. The administrator can access the metadata via the graphical user interface. The administration interface contains all the necessary functionality to manage enterprise wide storage networks: administration of storage systems, storage pools, and virtual devices, authentication and authorization, security, and statistics. 2.3. Kernel Integration The host software basically consists of a kernel module which is linked to the operating system of the participating servers and some additional applications running in the user space. Currently, modules for the Linux kernel  and  are available. If a data block needs to be read from a virtual disk, the file system generates a block I/O request and passes it to the kernel module where it is processed and transmitted to the appropriate physical disk. To perform the transformation from a virtual address to a physical address, the kernel keeps all necessary information, like existing storage pools, assignments of virtual and physical disks to the pools, storage policies etc. These information are given to the kernel initially or on-demand by the metadata server. 2.2. SAN Appliance and Metadata Management To ensure a consistent view of the SAN and a proper configuration of the hosts and storage devices, one or more SAN appliances are connected to the SAN. The appliances keep track of all necessary metadata structures. These metadata include among others the partitioning of the storage devices into storage pools and virtual volumes, access rights of the hosts, and the allocation of extents on the storage devices. The metadata appliance consists of a number of separate modules which are arranged around the V:Drive database, including the Disk-Agent, the Host-Interface, the DiskManager, and the Administration Interface. The interface to the database is standard SQL, implemented in many commercial and free databases. All components of the appliance can be executed on a single machine or can run in a distributed fashion. Information about the state of the SAN are collected by the Disk-Agent that is responsible for detecting disk partitions and for finding changes in the system configuration. Each newly detected suitable partition is labelled with a unique ID and is made available by updating the database. The Host-Interface is connected to the servers via Ethernet/IP. A data transfer between a server and the host interface is issued if the configuration of the SAN has been changed, if the server has started and has to load its configuration, or if the server accesses a virtual address for the first time and has to allocate an extent on the corresponding disk. If the configuration of the storage system changes, a small number of extents has to be redistributed in order to guarantee close to optimal performance. The Disk- 3. Results In this section we will present the experimental results of our virtualization approach. The test system consists of two Pentium servers connected to an FC-AL array with 8 fibre channel disks. Both servers have 2 Pentium II processors with 450 MHz and 512 kilobyte cache. Furthermore, they have local access to a mirrored disk drive containing the operating system and all relevant management information. Both servers run a Linux 2.4.18 kernel (Red Hat) and use gcc version 3.2.2 as the C compiler. The access to the disks is enabled by a QLogic qla2300 host bus adapter. The FC-AL array consists of four 17 Gigabyte and four 35 Gigabyte fibre channel disks. They are connected with the server via an 1 Gigabit switch. Each disk is partitioned into one partition covering the whole disk. For stressing the underlying I/O subsystem we used the Bonnie file system benchmark [1]. We changed the original source code such that we could derive more information concerning the overhead of our solution. The simple design and easy handling of Bonnie makes it a suitable tool for testing I/O performance. It performs, among others, the following operations on a number of file of desired size: it 155 reads and writes the random content of each character of the file separately, it reads and writes each block of the file, and it concurrently accesses arbitrary blocks in the file. The first two tests access a number of data files sequentially. Such a scenario is rather unlikely in practice but it is able to give an idea of the maximal performance of the I/O subsystem. More suited to model real world scenarios is the last test, because we have to access arbitrary blocks in some files. We set the overall file size to 4 GB (4 times the size of the main memory) to reduce caching effects. The size of the extents was fixed to 1 MB. To derive the overhead of our approach we compare our approach to the performance of a plain disk (labeled with the device name, e.g. SDA). More specific, we investigate the influence of each component of our solution to the overall throughput. For that we distinguish the following cases: I/O Throughput putc with cache 13000 12900 kByte per second 12800 12700 12600 12500 12400 12300 12200 12100 12000 SDA Clean Transfer Access Method Driver Figure 2. Comparison of the sequential output per character. 1. Clean System (C): Nothing is known in advance. 2. Transfer (T): All extents exist in the database and have only to be transferred to the driver. access the data in extents instead of data blocks. When allocating an extent the metadata server returns the first free position on the disk that is big enough to host the extent. Therefore, a sequential access of the file system results in higher movement of the disk head and only the sustained throughput of a disk could be achieved. 3. Driver (D): The driver has all information locally and does only perform the mapping of addresses. The number in parentheses behind the letters C,T,D in the charts axes is equivalent to the number of physical volumes belonging to the corresponding storage pool. If not mentioned otherwise, the storage pool consists of a single physical volume. Throughout the experiments the CPU usage for our approach was indistinguishable from the CPU usage when accessing the plain disk. Due to space limitations the corresponding figures are omitted. I/O Throughput write with cache 120 kByte per second 100 3.1. Impact of Extent Requests Figure 2 shows the throughput for the different settings when each character is written separately. Note, that we only get an overhead when the extent is accessed for the first time. Therefore, the induced costs are credited to many data accesses and their effect becomes marginal. The differences are mostly due to cache effects. The situation is very different when it comes to blockwise accesses. Here, the fraction of block accesses which induce overhead is much higher. Figure 3 shows the performance not only for the different settings but also for varying sizes of corresponding storage pools. Surprisingly, we lose roughly 40% when using only one disk. The reason for that lies mostly in the special sequential access pattern. The achieved high throughput could only be gained because the layout of the data blocks on disk enables a sweep of the disk head, minimizing the head movements. Modern file systems take that into account and adapt their data layout accordingly. But we destroy the careful layout because we 80 60 40 20 0 SDA T (1)T (2)T (3) C (1)C (2)C (3) Access Method D (1)D (2)D (3) Figure 3. Comparison of the sequential output per block. Surprisingly, this effect is compensated by parallel accesses when feeding the virtual device from more than one disk. Due to the fact that the operating system issues the write requests to the main memory and returns immediately, we achieve parallelism and get roughly the sustained performance of two disks. We could top the performance significantly, even if the access pattern does not allow for much parallelism. This indicates that the overhead induced by the driver alone is not a limiting factor. Only a clean sys- 156 Random Access tem with many extent allocations is not able to use many disks to increase the performance compared to a single disk. But in a real-world application a data block is accessed many times and the overhead occurs only once. 140 seeks per second 120 3.2. Block Read Performance 100 80 60 40 I/O Throughput Read 20 70000 0 kByte per second 60000 SDA 1 Disk 2 Disks 3 Disks 4 Disks 5 Disks Access Method 50000 Figure 5. Comparison of the performed number of seeks per second. 40000 30000 4. Conclusion 20000 10000 0 SDA In this paper we presented a virtualization environment that is based on the randomized Share-strategy. The shown results give evidence that such an approach is not only feasible but also efficient. Especially the performance of random seeks to files via Bonnie hints that V:Drive scales nicely with a growing storage network. 1 Disk 2 Disks 3 Disks 4 Disks 5 Disks Access Method Figure 4. Comparison of performance of block accesses. The read access is different from a write because it has to wait until the data is delivered from the disk. This gives the operating system enough time to rearrange a larger number of data accesses and, hence, accesses the disk in a sweeping manner. Figure 4 gives evidence for that. We lose only about 9% compared to the performance of a plain disk. As noted above the access pattern allows little parallelism. Hence, the increasing of the number of disks has only a small impact on the overall performance. References [1] T. Bray. Bonnie source code. http://www.textuality.com. [2] A. Brinkmann, K. Salzwedel, and C. Scheideler. Compact, Adaptive Placement Schemes for Non-Uniform Distribution Requirements. In Proceedings of the 14th ACM SPAA Conference, 2002. [3] T. Cortes and J. Labarta. Extending Heterogeneity to RAID level 5. In Proceedings of the USENIX Annual Technical Conference, 2001. [4] R. J. Honicky and E. L. Miller. A Fast Algorithm for Online Placement and Reorganization of Replicated Data. In Proceedings of the 17th IPDPS Conference, 2003. [5] R. J. Honicky and E. L. Miller. Replication Under Scalable Hashing: A Family of Algorithms for Scalable Decentralized Data Distribution. In Proceedings of the 18th IPDPS Conference, 2004. [6] D. Karger, E. Lehman, F. Leighton, M. Levine, D. Lewin, and R. Panigrahy. Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web. In In Proceeding of the 29th ACM STOC Conference, pages 654–663, 1997. [7] D. A. Patterson, G. Gibson, and R. H. Katz. A Case for Redundant Arrays of Inexpensive Disks (RAID). In Proceedings of the 1988 ACM Conference on Management of Data (SIGMOD), 1988. [8] The Storage Networking Industry Association (SNIA). Storage Virtualization I: What, Why, Where and How. 3.3. Random Seeks To get the number of random seeks per second Bonnie creates 3 threads performing the data requests. It is our opinion that this test is closest to practice because on a storage server there are different application generating rather unpredictable block accesses. Figure 5 compares the number of seeks per second for all approaches. Again, the overhead induced by the V:Drive solution is too small to measure once the extents are allocated. Note, that the impact of more disks decreases the more disk participate in the storage pool. This is due to the fact that the number of scheduled requests stays constant. That means, that the likelihood of parallel accesses to all disks decreases with the number of disks. If we would access the storage pool with more virtual devices the scaling would be much better. 157 158 Identifying Stable File Access Patterns Purvi Shah University of Houston Jehan-François Pâris1 University of Houston Ahmed Amer2 University of Pittsburgh Darrell D. E. Long 3 U. C. Santa Cruz purvi@cs.uh.edu paris@cs.uh.edu amer@cs.pitt.edu darrell@cs.ucsc.edu access to file X, it predicts that file Y will always be the successor of file X and never alters this prediction. The remainder of this paper is organized as follows. Section 2 reviews previous work on file access prediction. Section 3 introduces our First Stable Successor predictor and Section 4 discusses its performance. Finally, Section 5 states our conclusions 1. Introduction Disk access times have not kept pace with the evolution of disk capacities, CPU speeds and main memory sizes. They have only improved by a factor of 3 to 4 in the last 25 years whereas other system components have almost doubled their performance every other year. As a result, disk latency has an increasingly negative impact on the overall performance of many computer applications.123 Two main techniques can be used to mitigate this problem, namely caching and prefetching. Caching keeps in memory the data that are the most likely to be used again while prefetching attempts to bring data in memory before they are needed. Both techniques are widely implemented at the data block level. More recent work has focused on caching and prefetching entire files. There are two ways to implement file prefetching. Predictive prefetching attempts to predict which files are likely to be accessed next in order to read them before they are needed. While being conceptually simple, the approach has two important shortcomings. First, the prefetching workload will get in the way of the regular disk workload. Second, it is difficult to predict file accesses sufficiently ahead of time to ensure that the predicted files can be brought into main memory before they are needed. A more promising alternative is to group together on the disk drive files that are often accessed at the same time [3]. This technique is known as implicit prefetching and suffers none of the shortcomings of predictive prefetching because each cluster of files can now be brought into main memory in a single I/O operation. The sole drawback of this new approach is the need to identify stable file access patterns in order to build long-lived clusters of related files. We present here a new file predictor that identifies stable access patterns and can predict between 50 and 70 percent of next file accesses over a period of one year. Our First Stable Successor keeps track of the successor of each individual file. Once it has detected m successive accesses to file Y, each immediately following an 2. Previous Work Palmer et al. [8] used an associative memory to recognize access patterns within a context over time. Their predictive cache, named Fido, learns file access patterns within isolated access contexts. Griffioen and Appleton presented in 1994 a file prefetching scheme relying on graph-based relationships [4]. Shriver et al. [10] proposed an analytical performance model to study the effects of prefetching for file system reads. Tait and Duchamp [11] investigated a client-side cache management technique used for detecting file access patterns and for exploiting them to prefetch files from servers. Lei and Duchamp [6] later extended this approach and introduced the Last Successor predictor. More recent work by Kroeger and Long introduced more effective schemes based on context modeling and data compression [5]. Two much simpler predictors, Stable Successor (or Noah) [1] and Recent Popularity [2], have been recently proposed. The Stable Successor predictor is a refinement of the Last Successor predictor that attempts to filter out noise in the observed file reference stream. Stable Successor keeps track of the last observed successor of every file, but it does not update its past prediction of the successor of file X before having observed m successive instances of file Y immediately following instances of file X. Hence, given the sequence: S: ABABABACABACABADADADA Stable Successor with m = 3 will first predict that B is the successor of A and will not update its prediction until it encounters three consecutive instances of file D immediately following instances of file A. The Recent Popularity or k-out-of-n predictor maintains the n most recently observed successors of each file. When attempting to make a prediction for a given file, Recent Popularity searches for the most 1 Supported in part by the National Science Foundation under grant CCR-9988390. 2 Supported in part by the National Science Foundation under grant ANI-0325353. 3 Supported in part by the National Science Foundation under grant CCR-0204358. 159 popular successor from the list. If the most popular successor occurs at least k times then it is submitted as a prediction. When more than one file satisfies the criterion, recency is used as the tiebreaker. Assumptions: G is file being currently accessed F its direct predecessor FirstStableSuccessor(F) is last prediction made for the successor of F LastSuccessor(F) is last observed successor of F Count(F) is a counter m is minimum number of consecutive identical successors to declare a First Stable Sucessor 3. The First Stable Successor Predictor All the predictors are dynamic in the sense that they reflect changes in file access patterns and modify accordingly their predictions. The sole existing static predictor is First Successor [1], which always predicts the first encountered successor of file X as its successor. It is a rather crude predictor and was found to perform much worse than all Last Successor, Stable Successor or Recent Popularity. There are two explanations for this poor performance. First, First Successor cannot reflect changes in file access patterns. Second, it bases all its predictions on a single observation. As shown on Figure 1, the First Stable Successor (FSS) predictor remedies this second limitation by requiring m successive instances of file Y immediately following instances of file X before predicting that file Y is the successor of file X. Otherwise it makes no prediction. When m = 1, the FSS predictor becomes identical to the First Successor protocol and predicts that that file Y is the successor of file X once it has encountered a single access to file Y immediately following an access to file X. A large value of m will result into fewer predictions than a smaller value of m but will also increase the likelihood that these predictions will be correct. This provides us with a relatively easy way to tune the protocol by either increasing m whenever we want to reduce the number of false predictions or decreasing it whenever we want to increase the total number of predictions. Algorithm: if FirstStableSuccessor(F) is undefined then if LastSuccessor(F) = G then Counter(F) m Counter(F) + 1 else Counter(F) m 1 end if if Counter(F) = m then FirstStableSuccessor(F) mG end if end if Figure 1 The First Stable Successor Predictor cost of the incorrect prediction is thus one additional cache miss. Allowing preemption would reduce this delay and decrease the penalty. Note that the incorrect prediction will have no other adverse effect on the cache performance as long as the cache replacement policy expels first the files that were never accessed. We define the effective success rate per reference of a predictor as the ratio: Ncorr  ĮNincorr Nref where Ncorr is the number of correct predictions, Nincorr the number of incorrect predictions and Nref the number of references and the D factor represents the impact of file fetch preemption on the performance of the predictor. A zero value for D corresponds to the situation where incorrect predictions incur no cost because all predicted file fetches can be preempted when found to be incorrect without any further delay. A unit value assumes that there is no fetch preemption, and all ongoing fetches must be completed, whether correctly predicted or not. An intermediate D value corresponds to situations where preemption is possible, but at some cost less than the cost of a file fetch. Computing the effective success rate per reference for D values of, say, 0.0, 0.5 and 1.0 will permit us to compare predictors for a realistic range of file-system implementations. We evaluated the performance of our FSS predictor by simulating its operation on two sets of file traces. The first set consisted of four file traces collected using Carnegie Mellon University’s DFSTrace system [7]. The traces include mozart, a personal workstation, ives, a system with the largest number of users, dvorak, a system with the largest proportion of write activity, 4. Performance Evaluation When comparing the effectiveness of file predictors, one is often confronted with two primary metrics, successper-reference and success-per-prediction. Given the dependent nature of these metrics, it is impossible to use either of them alone when assessing the performance of any given predictor. For example, a predictor that has a 99% success-per-prediction rate would be considered impractical if it could only be used on 5% of the references. Conversely, predictors that have a high successper-reference rate may also give rise to a high number of incorrect predictions that may tax the file system to the extent that it outweighs any improvements due to predictive prefetching. We will use a third metric integrating both aspects of the predictor performance. Consider first the two possible outcomes of an incorrect prediction. If we assume no preemption, the next file access will have to wait while the predicted file is loaded into the cache. The 160 Effective Success Rate)(% 80 D  60 40 20 0 0 2 4 6 8 10 12 14 16 18 20 Barber Dvorak Ives Mozart Instruct Research Web -20 Number of Consecutive Successors Effective Success Rate)(% Figure 2. Effective success rate per reference of the FSS predictor for D = 0 and m varying between 1 and 20. 80 D  60 Barber Dvorak Ives Mozart Instruct Research Web 40 20 0 0 2 4 6 8 10 12 14 16 18 20 -20 -40 Number of Consecutive Successors Effective Success Rate)(% Figure 3. Effective success rate per reference of the FSS predictor for D = 0.5 and m varying between 1 and 20. 80 D  60 40 20 0 -20 0 2 4 6 8 10 12 14 16 -40 -60 18 20 Barber Dvorak Ives Mozart Instruct Research Web -80 Number of Consecutive Successors Figure 4. Effective success rate per reference of the FSS predictor for D = 1 and m varying between 1 and 20. 161 and barber, a server with the highest number of system calls per second. They include between four and five million file accesses collected over a time span of approximately one year. Our second set of traces was collected in 1997 by Roselli [9] at the University of California, Berkeley over a period of approximately three months. To eliminate any interleaving issues, these traces were processed to extract the workloads of an instructional machine (instruct), a research machine (research) and a web server (web). Figures 2 to 4 represent the effective success rates per reference achieved by our First Stable Successor when the number m of consecutive successors triggering the predictor varies between 1 and 20. Negative success rates correspond to situations where D > 0 and the sum of the penalties assessed for incorrect predictions exceeds the number of correct predictions. As we can see, our First Stable Successor performs much better with the four CMU traces than with the three Berkeley traces even though the Berkeley traces were collected over a much shorter period. In particular, our predictor performs very poorly with the instruct trace, which appears to have the least stable reference patterns of all seven traces. The four CMU traces can be further subdivided into two groups. The first group comprises barber and mozart, which exhibit rather stable behaviors. As a result, our predictor can successfully predict between 66 and 69 percent of future references. Conversely, dvorak and ives exhibit less stable behaviors and our predictor can successfully predict between 53 and 57 percent of future references. This should not surprise us because ives had the largest number of users and dvorak the largest proportion of write activity. Even when we do not penalize incorrect predictions, First Stable Successor requires less consecutive successors to reach their optimum performance on barber and mozart than on dvorak and ives. We can also observe that the number of consecutive successors required to achieve optimum performance increases on all seven traces when D increases from zero to one. It might be therefore indicated to increase the value of the m parameter for workloads that exhibit less stable file access patterns in order to reduce the number of misses. Figures 5 to 7 compare the effective success rates per reference achieved by our First Stable Successor with m = 8 with those achieved by First Successor, Last Successor, Stable Successor with m = 2, and k-out-of-m. As we can see, our First Stable Successor predictor performs much better than First Successor but not as well as Last Successor, Stable Successor and k-out-of-m. This gap is especially evident for the instruct trace as these last three predictors perform almost as well as with the mozart trace while First Successor and First Stable Successor perform very poorly. We can draw two major conclusions from our measurements. First, there are enough stable access patterns in the six of the seven traces we analyzed to make implicit file prefetching a worthwhile proposition. This is especially true because of the low overhead of the approach, which means that wrong predictions would only incur a minimal penalty (D << 1). Second, many, if not most, of these stable access patterns are long lived and appear to persist over at least a full year. A file system implementing implicit file prefetching would probably reevaluate its file groups once a week. We can already predict that these weekly group reevaluations will not result in a complete reconfiguration of the whole file system. 5. Conclusions Identifying and exploiting stable file access patterns is essential to the success of implicit file prefetching as this technique builds long-lived clusters of related files that can be brought into memory in a single I/O operation. We have presented a new file access predictor that was specifically tailored to identify such stable file access patterns. Trace-driven simulation results indicate that our First Stable Successor can predict up to 70 percent of next file accesses over a period of one year. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] 162 A. Amer and D. D. E. Long, Noah: Low-cost file access prediction through pairs, in Proc. 20th Int’ l Performance, Computing, and Communications Conf., pp. 27–33, Apr. 2001. A. Amer, D. D. E. Long, J.-F. Pâris, and R. C. Burns, File access prediction with adjustable accuracy, in Proc. 21st Int’ l Performance of Computers and Communication Conf., pp. 131– 140, Apr. 2002. A. Amer, D. Long, and R. Burns. Group-based management of distributed file caches, in Proc. 17th Int’ l Conf. on Distributed Computing Systems, pp. 525–534, July 2002. J. Griffioen and R. Appleton, Reducing file system latency using a predictive approach, in Proc. 1994 Summer USENIX Conf., pp. 197–207, June 1994. T. M. Kroeger and D. D. E. Long, Design and implementation of a predictive file prefetching algorithm, in Proc. 2001 USENIX Annual Technical Conf., pp. 105–118, June 2001. H. Lei and D. Duchamp, An analytical approach to file prefetching, in Proc. 1997 USENIX Annual Technical Conf., pp. 305-318, Jan. 1997. L. Mummert and M. Satyanarayanan, Long term distributed file reference tracing: implementation and experience, Technical Report, School of Computer Science, Carnegie Mellon University, 1994. M. L. Palmer and S. B. Zdonik, FIDO: a cache that learns to fetch, in Proc. 17th Int’ l Conf. on Very Large Data Bases, pp. 255–264, Sept. 1991. D. Roselli, Characteristics of file system workloads, Technical. Report CSD-98-1029, University of California, Berkeley, 1998. E. Shriver, C. Small, and K. A. Smith, Why does file system prefetching work? in Proc. 1999 USENIX Technical Conf., pp. 71–83, June 1999. C. Tait and D. Duchamp, Detection and exploitation of file working sets, in Proc. 11th Int’ l Conf. on Distributed Computing Systems, pp. 2–9, May 1991. Effective Success Rate per Ref. (%) 100 D=0 80 First Successor Last Successor Stable Successor 2-Out-Of-4 First Stable Succesor 60 40 20 0 Barber Dvorak Ives Mozart Instruct Research Web -20 File System Trace Effective Success Rate per Ref. (%) Figure 5. Compared success rates per reference of the five policies for D = 0. 80 D = 0.5 60 First Successor Last Successor Stable Successor 3-Out-Of-4 First Stable Succesor 40 20 0 Barber Dvorak Ives Mozart Instruct Research Web -20 -40 File System Trace Effective Success Rate per Ref. (%) Figure 6. Compared success rates per reference of the five policies for D = 0.5. 80 D=1 60 40 20 0 -20 Barber Dvorak Ives Mozart Instruct Research Web First Successor Last Successor Stable Successor 4-Out-Of-4 First Stable Succesor -40 -60 File System Trace Figure 7. Compared success rates per reference of the five policies for D = 1. 163 164 An On-line Backup Function for a Clustered NAS System (X-NAS) _ Yoshiko Yasuda, Shinichi Kawamoto, Atsushi Ebata, Jun Okitsu, and Tatsuo Higuchi Hitachi, Ltd., Central Research Laboratory 1-280 Higashi-koigakubo, Kokubunji-shi, Tokyo 185-8601, Japan Tel: +81-423-23-1111, Fax: +81-423-27-7743 e-mail: {yoshikoy, skawamo, ebata, j-okitsu, higuchi}@crl.hitachi.co.ip Abstract An on-line backup function for X-NAS, a clustered NAS system designed for entry-level NAS, has been developed. The on-line backup function can replicate file objects on XNAS to a remote NAS in real-time. It makes use of the virtualized global file system of X-NAS, and sends NFS write operations to both X-NAS and the remote backup NAS at the same time. The performance of the on-line backup function was evaluated and the evaluation results show that the on-line backup function of X-NAS improves the system reliability while maintaining 80% of the throughput of the X-NAS without this function. 1. Introduction An entry-level NAS system is convenient in terms of the cost and the ease of management for offices with no IT experts. However, it is not scalable. To solve this problem, X-NAS, which is a simple, scalable clustered NAS architecture designed for entry-level NAS, has been proposed [6]. Like conventional NAS systems, it can be used for various clients, such as those using UNIX and Windows1. X-NAS aims at the following four goals. Ɣ Cost reduction by using entry-level NAS as an element Ɣ Ease of use by providing a single-file-system view for various kinds of clients Ɣ Ease of management by providing a centralized management function Ɣ Ease of scaling-up by providing several system-reconfiguration functions To achieve these goals, X-NAS virtualizes multiple entry-level NAS systems as a unified system without changing clients' environments. In addition, X-NAS maintains the manageability and the performance of the entry-level NAS. It also can easily be reconfigured without stopping file services or changing setting information. However, when one of the X-NAS elements suffers a fault, file objects on the faulty NAS system may be lost if there are no backups. To improve the X-NAS reliability, a file-replication function must therefore be developed. The goal of the present work is to introduce an on-line backup function of X-NAS that replicates original file objects on X-NAS to a remote NAS for each file access request in real-time without changing the clients' environments. The performance of the on-line backup function was evaluated and the evaluation results indicate that X-NAS with the on-line backup function improves the system reliability while maintaining 80% of the throughput of standard X-NAS. 1 Windows and DFS are trademarks of Microsoft Corporation. Double Take is a trademark of Network Specialists, Inc. All other products are trademarks of their respective corporations. 165 2. On-line backup function for X-NAS To improve the reliability of X-NAS, an on-line backup function for X-NAS has been developed. (Since the details of the X-NAS structure are discussed in another paper [6], they are not described here.) The on-line backup function consists of many sub-functions. Among these sub-functions, we focus on on-line replication, the heart of the on-line backup function, in this paper. The on-line replication replicates files of X-NAS to a remote NAS, which is called a backup NAS, in real-time for each file access request. 2.1. Requirements The on-line backup function of X-NAS must meet the following requirements: Ɣ Generate replicas of file objects in real-time in order to eliminate the time lag between the original data and the replicas. Ɣ Use a standard file-access protocol such as NFS to communicate between X-NAS and the backup NAS in order to apply as many kinds of NAS as clients need. Ɣ Do not change clients' environments in order to curb their management cost. 2.2. On-line replication There are several methods for replicating file objects to remote systems via an IP network. One method is to use a block I/O [5]. Since using a block I/O is a fine-grain process, all file objects are completely consistent with copied objects. However, the system structure is limited because the logical disk blocks of the objects must be allocated to the same address between the original data and its replica. Another method is to change the client's system. DFS [1] is a simple method for replicating file objects to many NASs. It replicates file objects in constant intervals but not in real-time. Xnfsd and the management partition in X-NAS enable the centralized management of many NAS elements and provide a unified file system view for clients (Fig. 1). Xnfsd is a wrapper daemon and receives an NFS operation in place of the NFS server and sends the operation to others. On-line replication of X-NAS makes use of Xnfsd in order to copy file objects to the backup NAS. By extending this function, Xnfsd sends the NFS operation not only to the NFS servers on the X-NAS but also to the NFS servers on the backup NAS. All file objects can thus be replicated in real-time for each NFS operation. 2.2.1. Operations NFS operations handled in X-NAS can be divided into four categories. Category 1 is reading files; category 2 is writing files; category 3 is reading directories; and category 4 is writing directories. Xnfsd sends NFS operations belonging to categories 2 and 4 to both X-NAS and the backup NAS at the same time. On the other hand, NFS operations belonging to categories 1 and 3 are not sent to the backup NAS. When a UNIX client sends a WRITE operation for file f to X-NAS, Xnfsd on P-NAS (parent NAS) receives the operation in place of the NFS daemon. Figure 1 shows the flow of this operation, and Figure 2 shows the timing chart with or without the on-line backup function. Firstly, Xnfsd specifies a data partition that stores the file entity by using the inode number of the dummy file f on the management partition (#1). Secondly, Xnfsd invokes a sub thread and then sends the WRITE operation to the backup NAS by 166 using the thread (#2). Thirdly, Xnfsd sends the WRITE operation to the NFS daemon on the specified C-NAS (child NAS), and then the C-NAS processes the operation (#3). Finally, Xnfsd waits for the responses of the operations from the NFS server on the CNAS and from the backup NAS (#4), and then it makes one response from all the responses and sends it back to the client. We call this procedure a synchronized backup. UNIX (NFS) client Windows (CIFS) client WRITE file f LAN Internal LAN #2 #3 main nfsd nfsd dir1 dir2 dir1 dir3 dir2 dir2 dir3 fileA data partition data partition C-NAS C-NAS #4 Xnfsd dir1 dir3 dir2 fileA filef fileC dir1 dir3 file f fileC #1 nfsd Samba sub nfsd dir1 dir2 dir3 fileA filef fileC dummy files data partition data partition management partition P-NAS Backup NAS X-NAS Figure 1: Flow of WRITE operation with online backup function. client Xnfsd response time (#1) (#2) (#3) response time (#1) (#2) (#3) client Xnfsd-main response time (#1) (#2) client Xnfsdmain response time (#3) (#1) (#2) (#4) Xnfsd-sub (#1) (#4) (#2) (#3) (#4) (#3) elapsed time Xnfsd-sub send to backup NAS (#1) (#4) (#2) send to backup NAS (#3) (#1) access to management partition (#3) access to data partition (#2) invoke sub-thread (#4) wait sub-thread completion (a) X-NAS without on-line backup (b) X-NAS with synchronized backup (c) X-NAS with partial asynchronized backup Figure 2. Timing charts of WRITE operation with or without on-line backup function. 2.2.2. Key features An on-line backup function must guarantee the consistency of data between X-NAS and the backup NAS. To achieve this, Xnfsd waits for all responses from both one of the NFS servers on the X-NAS and the backup NAS for each NFS operation. However, waiting for the responses degrades total performance. To solve this problem, the performance of the on-line replication function must be improved through three key features as follows. (1) Multi-threaded wrapper daemon Xnfsd waits for all responses from both NAS systems. This incurs an overhead because of frequent accesses to the network and the disk drives. To reduce this cost, the main 167 thread of Xnfsd invokes a sub thread to send the file I/Os to the backup NAS. This feature enables X-NAS to process the disk accesses of both X-NAS and the backup NAS in parallel. (2) File-handle cache The cost of specifying the full path name and the file handle on the backup NAS is high because of frequent accesses to the network and the disk drives. To reduce this cost, XNAS makes use of a file-handle cache, which records the correspondence between the file handle of the dummy file, i.e., the global file handle, and the file handle of the backup NAS. (3) Partial asynchronized backup Although the synchronized backup is a simple method, the execution cost is high because this method waits for all the responses from the NFS servers on the X-NAS and the backup NAS. A method that does not wait for the response from the backup NAS achieves the same performance as X-NAS without the on-line backup function. However, when X-NAS or the backup NAS becomes faulty, it is difficult to guarantee the consistency of data between X-NAS and the backup NAS. Using a log is one solution to guarantee the consistency. However, since the log size is limited, it is not a perfect solution for entry-level NAS, which usually has a small-sized memory. Furthermore, according to the X-NAS concept, the architecture must be simplified as much as possible. Xnfsd thus supports a partial asynchronized backup method in addition to the synchronized backup. Figure 2(c) shows the timing chart of the WRITE operation with partial asynchronized backup. In the method, after processing disk accesses to the data partition on the X-NAS element, Xnfsd sends back a response to a client without waiting for the response from the backup NAS. As a result, the client can send the next operation. The main thread of Xnfsd can perform the disk accesses to the management partition for the next operation during the waiting time for the response from the backup NAS. 3. Performance evaluation To evaluate the on-line backup function of X-NAS, an X-NAS prototype based on the NFSv3 implementation was developed. We ran NetBench [3] and SPECsfs97 [4] on the X-NAS prototype with or without on-line backup function. In this evaluation, by taking account of permissible range for the entry-level NAS's users, we set the performance objective for X-NAS with the on-line backup function at 80% of the performance of XNAS without the function. Throughput and average response time are used as the performance metrics. In this evaluation, we implemented the partial asynchronized backup function in the WRITE operation. This is because the ratio of the WRITE operations to all operations is higher than other operations in the workload mix of the benchmarks. Furthermore, since the file sizes used by the benchmark programs are from 100 to 300 KB, many WRITE operations are issued continuously and then each process in a WRITE operation could be overlapped. 3.1. Experimental environment In the experimental environment, the maximum number of X-NAS elements is fixed to four. Each X-NAS element and the backup NAS configured with one 1-GHz Pentium III 168 processor, 1 GB of RAM and a 35-GB Ultra 160 SCSI disk drive running Red Hat Linux 7.2. For the NetBench test, one to eight clients running Windows 2000 Professional were used. The clients, P-NAS, C-NASs, and the backup NAS were connected by 100-Megabit Ethernet because most offices still use this type of LAN. 3.2. Results Figures 3 and 4 show the results of our performance evaluation in terms of throughput and average response time. The throughputs of X-NAS with the synchronized backup function are about 80% of those without the function. Under an experimental environment with NetBench, the average response time for X-NAS with the function is about 1.2 times higher than that for X-NAS without the function. On the other hand, under an experimental environment with SPECsfs, the average response time for X-NAS with the function is about 1.4 times higher than the time for X-NAS without it. Although the partial asynchronized backup can improve both throughput and average response time by several percentage, the performance objective for the response time in the case of SPECsfs cannot be achieved yet. 50 Throughput (Mbit/sec) Response time (msec) 4 40 3 30 2 20 without on-line backup with partial-asynchronized backup with synchronized backup 10 0 without on-line backup with partial-asynchronized backup with synchronized backup 0 1 2 3 4 5 6 Number of clients (a) Throughput 7 1 8 0 0 1 2 3 4 5 6 7 Number of clients (b) Average response time 8 Figure 3. Throughput and average response time of X-NAS with or without on-line backup function in the case of NetBench. 800 Delivered load (NFSOPS) 3 Response time (msec) Response time (msec) 8 600 2 6 400 4 1 200 0 10 2 0 200 400 600 800 Offered load (NFSOPS) without on-line backup (a) Total throughput 0 0 200 400 600 800 Offered load (NFSOPS) with partial-asynchronized backup (b) Total average response time 0 0 200 400 600 800 Offered load (NFSOPS) with synchronized backup (c) Average response times for WRITE operations Figure 4. Throughput and average response time of X-NAS with or without on-line backup function in the case of SPECsfs. 3.3. Discussion To specify the reason for the longer response time in the case of SPECsfs, the average 169 response time of each NFS operation in the case of the synchronized backup was analyzed. The average response times for some write requests such as WRITE, SETATTR and CREATE are longer than those for X-NAS without that function. In particular, the average response time for WRITE operations is 2.5 times higher than that for the other operations. Profiling results of the WRITE operations shows that waiting time for the sub-thread completion is about 24% of the total processing time and access to the data partition via an IP network is about 48% of that time. By applying the partial asynchronized backup to X-NAS, this waiting time can be reduced to almost zero. Figure 4(c) shows the effects of the partial asynchronized backup in the case of WRITE operations. The average response time for WRITE operations with the partial asynchronized backup can be reduced to from 2.5 times to 1.8 times the time for X-NAS without the function. As a result, the total average response time for SPECsfs with the function can be reduced to 1.3 times that without it. However since the ratio of data transmission time to the total processing time is still higher in the case of the 100Megabit Ethernet, using a Gigabit network is effective because it can reduce the data transmission time for 100-Megabit Ethernet to at least one-fifth. Furthermore, by optimizing other operations such as CREATE and COMMIT, the performance objective of 1.2 times can be achieved. 4. Related work There are several methods for replicating file objects between several NAS systems via the network. DFS [1] is a simple and easy file-replication function on Windows systems. DRBD [5] is a kernel module for building a two-node HA cluster under Linux. Double Take [2] is a third-vendor software to replicate file objects on the master NAS to the slave NAS. 5. Conclusions An on-line backup function for X-NAS, a clustered NAS system, has been developed. On-line replication, the core of the on-line backup function, replicates file objects on XNAS to a remote backup NAS in real-time for each NFS operation. A multi-threaded wrapper daemon with a low overhead, the developed file-handle cache and the partial asynchronized backup method can reduce the overhead for accessing the backup NAS. An X-NAS prototype with the on-line backup function, based on NFSv3 running the NetBench and SPECsfs97 programs attains 80% of the performance of X-NAS without the function. This function improves the dependability of entry-level NAS while maintaining its manageability. References [1] Deploying Windows Powered NAS Using Dfs with or without Active Directory. http://www.microsoft.com, 2001. [2] Double-Take Theory of Operations. http://www.nsisoftware.com, 2001. [3] NetBench 7.0.3. http://www.etestomglabs.com/benchmarks/netbench, 2002. [4] SFS3.0 Documentation Version 1.0. http://www.spec.org, 2002. [5] P. Reisner. DRBD. In Proceedings of the 7th International Linux Kongress, 2000. [6] Y. Yasuda et al. Concept and Evaluation of X-NAS: a highly scalable NAS system. In Proceedings of the 20th IEEE/11th NASA MSST2003. 170 dCache, the commodity cache Patrick Fuhrmann Deutsches Elektronen Synchrotron 22607 Hamburg, Germany Notkestrasse 85 Tel: +49-40-8998-4474, Fax: +49-40-8994-4474 e-mail: patrick.fuhrmann@desy.de 1. Abstract The software package presented within this paper has proven to be capable of managing the storage and exchange of several hundreds of terabytes of data, transparently distributed among dozens of disk storage nodes. One of the key design features of the dCache is that although the location and multiplicity of the data is autonomously determined by the system, based on configuration, cpu load and disk space, the name space is uniquely represented within a single file system tree. The system has shown to significantly improve the efficiency of connected tape storage systems, through caching, ’gather & flush’ and scheduled staging techniques. Furthermore, it optimizes the throughput to and from data clients as well as smoothing the load of the connected disk storage nodes by dynamically replicating datasets on the detection of load hot spots. The system is tolerant against failures of its data servers which enables administrators to go for commodity disk storage components. Access to the data is provided by various ftp dialects, including gridftp, as well as by a native protocol, offering regular file system operations like open/read/write/seek/stat/close. Furthermore the software is coming with an implementation of the Storage Resource Manager protocol, SRM, which is evolving to an open standard for grid middleware to communicate with site specific storage fabrics. 2. Contributors The software is being developed by the Deutsches Elektronen Synchrotron (DESY) in Hamburg, Germany[1] and the Fermi National Laboratory, Batavia Chicago,IL, USA [2]. 3. Technical Specification 3.1 File name space and dataset location dCache strictly separates the filename space of its data repository from the actual physical location of the datasets. The filename space is internally managed by a database and interfaced to the user resp. to the application process by the nfs2 [9] protocol and through the various ftp filename operations. The location of a particular file may be on one or more dCache data servers as well as within the repository of an external Tertiary Storage Manager. dCache transparently handles all necessary data transfers between nodes and optionally between the external Storage Manager and the cache itself. Inter dCache transfers may be caused by configuration or load balancing constrains. As long as 171 a file is transient, all dCache client operations to the dataset are suspended and resumed as soon as the file is fully available. 3.2Maintenance and fault tolerance As a result of the name space —data separation, dCache data server nodes subsequently denoted as pools, can be added at any time without interfering with system operation. Having a Tertiary Storage System attached, or having the system configured to hold multiple copies of each dataset, data nodes can even be shut down at any time. Under those conditions, the cache system is extremely tolerant against failures of its data server nodes. 3.3Data access methods In order to access dataset contents, dCache provides a native protocol (dCap), supporting regular file access functionality. The software package includes a c-language client implementation of this protocol offering the posix open/read/write/seek/stat/close calls. This library may be linked against the client application or may be preloaded to overwrite the file system I/O operations. The library supports pluggable security mechanisms where the GssApi (Kerberos) and ssl security protocols are already implemented. Additionally, it performs all necessary actions to survive a network or pool node failure. It is available for Solaris, Linux, Irix64 and windows. Furthermore, it allows to open files using an http like syntax without having the dCache nfs file system mounted. In addition to this native access, various FTP dialects are supported, e.g. GssFtp (kerberos) [8] and GsiFtp (GridFtp) [7]. An interface definition is provided, allowing other protocols to be implemented as well. 3.4Tertiary Storage Manager connection Although dCache may be operated stand alone, it can also be connected to one or more Tertiary Storage Systems. In order to interact with such a system, a dCache external procedure must be provided to store data into and retrieve data from the corresponding store. A single dCache instance may talk to as many storage systems as required. The cache provides standard methods to optimize access to those systems. Whenever a dataset is requested and cannot be found on one of the dCache pools, the cache sends a request to the connected Tape Storage Systems and retrieves the file from there. If done so, the file is made available to the requesting client. To select a pool for staging a file, the cache considers configuration information as well as pool load, available space and a Least Recently Used algorithms to free space for the incoming data. Data, written into the cache by clients, is collected and, depending on configuration, flushed into the connected tape system based on a timer or on the maximum number of bytes stored, or both. The incoming data is sorted, so that only data is flushed which will go to the same tape or tape set. Mechanisms are provided that allow giving hints to the cache system about which file will be needed in the near future. The cache will do its best to stage the particular file before it’s requested for transfer. Space management is internally handled by the dCache itself. Files which have their origin on a connected tape storage system will be removed from cache, based on a Least 172 Recently Used algorithm, if space is running short. Space is created only when needed. No high/low watermarks are used. 3.5 Pool Attraction Model Though dCache distributes datasets autonomously among its data nodes, preferences may be configured. As input, those rules can take the data flow direction, the subdirectory location within the dCache file system, storage information of the connected Storage Systems as well as the IP number of the requesting client. The cache defines data flow direction as getting the file from a client, delivering a file to a client and fetching a file from the Tertiary Storage System. The simplest setup would direct incoming data to data pools with highly reliable disk systems, collect it and flush it to the Tape Storage System when needed. Those pools could e.g. not be allowed to retrieve data from the Tertiary Storage System as well as deliver data to the clients. The commodity pools on the other hand would only handle data fetched from the Storage System and delivered to the clients because they would never hold the original copy and therefore a disk /node failure wouldn’t do any harm to the cache. Extended setups may include the network topology to select an appropriate pool node. Those rules result in a matrix of pools from which the load balancing module, described below, may choose the most appropriate candidate. Each row of the matrix contains pools with similar attraction. Attraction decreases from top to bottom. Should none of the pools in the top row be available, the next row is chosen, a.s.o.. Optionally, stepping from top to bottom can be done as long as the candidate of row ’n’is still above a certain load. The final decision, which pool to select of this set, is based on free space, age of file and node load considerations. 3.6 Load Balancing and pool to pool transfers The load balancing module is, as described above, the second step in the pool selection process. This module keeps itself updated on the number of active data transfers and the age of the least recently used file for each pool. Based on this set of information, the most appropriate pool is chosen. This mechanism is efficient even if requests are arriving in bunches. In other words, as a new request comes in, the scheduler already knows about the overall state change of the whole system triggered by the previous request though this state change might not even have fully evolved. System administrators may decide to make pools with unused files more attractive than pools with only a small number of movers, or some combination. Starting at a certain load, pools can be configured to transfer datasets to other, less loaded pools, to smooth the overall load pattern. At a certain point, pools may even fetch a file from the Tertiary Storage System again, if all pools, holding the requested dataset are too busy. Regulations are in place to suppress chaotic pool to pool transfer orgies in case the global load is steadily increasing. Furthermore, the maximum numbers of replica of the same file can be defined to avoid having the same set of files on each node. 3.7 File Replica Manager A first version of the so called Replica Manager is currently under evaluation. This module enforces that at least N copies of each file, distributed over different pool nodes, must exist within the system, but never more than M copies. This approach allows to shut 173 down servers without affecting system availability or to overcome node or disk failures. The administration interface allows to announce a scheduled node shut down to the Replica Manager so that it can adjust the N — M interval. 4. Data Grid functionality In the context of the LHC Computing Grid Project [4], a Storage Element describes a module providing mass data to local Computing Elements. To let a local Storage System look like a Storage Element, two conditions must be met. Storage Elements must be able to communicate to each other in order to exchange mass data between sites running different Storage System and Storage Elements have to provide local data through standard methods to allow GRID jobs to access data files in a site independent manner. The first requirement is covered by a protocol called the Storage Resource Manager, SRM [3], defining a set of commands to be implemented by the local Storage System to enable remote access. It mainly covers queries about the availability of datasets as well as commands to prepare data for remote transfer and to negotiate appropriate transfer mechanisms. dCache is providing an SRM interface and has proven to be able to talk to other implementations of the SRM. A dCache system at FERMI is successfully exchanging data with the CASTOR Storage Manager at CERN using the SRM protocol for high level communication and GridFtp for the actual data transfer. The second requirement, to make local files available to Grid Applications, is approached by the GFile initiative, a quasi standard as well. It offers well defined, posix like function calls that allow site independent access to files held by the local Storage Element. Optionally G-File can talk to other grid modules to register imported files or files being exportable. G-File developers at CERN have successfully linked the g-file library against the dCache dCap library. 5. Dissemination In the meantime, dCache is in production at various locations in Europe and the US. The largest installation is, to our knowledge, the CDF system at FERMI [2]. 150 Tbytes are stored on commodity disk systems and in the order of 25 Tbytes have been delivered to about 1000 clients daily for more than a year. FERMI dCache installations are typically connected to ENSTORE [11], the FERMI tape storage system. CDF is operating more than 10 tape-less dCache installations outside of FERMI, evaluating the dCache Replica Manager. The US devision of the LHC CMS [13] experiment is using the dCache as Grid Storage Element and large file store in the US and Europe. At DESY, dCache is connected to the Open Storage Manager (OSM) and serving data out of 70 Tbytes of disk space. The German LHC Grid Tier 1 center in Karlruhe (GridKa,[12]) is in the process of building a dCache installation as Grid Storage Element, connected to their Tivoli Storage Manager [16] installation. 6. References [1] DESY : http://www.desy.de [2] FERMI : http://www.fnal.gov [3] SRM : http://sdm.lbl.gov/srm-wg [4] LCG : http://lcg.web.cern.ch/LCG/ 174 [5] CASTOR Storage Manager : http://castor.web.cern.ch/castor/ [6] dCache Documentation : http://www.dcache.org [7] GsiFtp : http://www.globus.org/datagrid/deliverables/gsiftp-tools.html [8] Secure Ftp : http://www.ietf.org/rfc/rfc2228.txt [9] NFS2 : http://www.ietf.org/rfc/rfc1094.txt [10] Fermi CDF Experiment : http://www-cdf.fnal.gov [11] Fermi Enstore : http://www.fnal.gov/docs/products/enstore/ [12] GridKA : http://www.gridka.de/ [13] Cern CMS Experiment : http://cmsinfo.cern.ch [14] Cern LHC Project : http://lhc.web.cern.ch/LHC [15] Grid g-file : http://lcg.web.cern.ch/LCG/peb/GTA/GTA-ES/Grid-File-AccessDesign-v1.0.doc [16] Tivoli Storage Manager : http://www-306.ibm.com/software/tivoli/products/storage-mgr/ 175 176 HIERARCHICAL STORAGE MANAGEMENT AT THE NASA CENTER FOR COMPUTATIONAL SCIENCES: FROM UNITREE TO SAM-QFS Ellen Salmon, Adina Tarshish, Nancy Palm NASA Center for Computational Sciences (NCCS) NASA Goddard Space Flight Center (GSFC), Code 931 Greenbelt, Maryland 20071 Tel: +1-301-286-7705 e-mail: Ellen.M.Salmon@nasa.gov Sanjay Patel, Marty Saletta, Ed Vanderlan, Mike Rouch, Lisa Burns, Dr. Daniel Duffy Computer Sciences Corporation, NCCS GSFC Greenbelt, Maryland 20071 Tel: +1-301-286-3131 e-mail: sjpatel@calvin.gsfc.nasa.gov Robert Caine, Randall Golay Sun Microsystems, Inc. 7900 Westpark Drive McLean, VA, 22102 Tel: +1-703-280-3952 e-mail: Robert.Caine@sun.com Jeff Paffel, Nathan Schumann Instrumental, Inc. 2748 East 82nd Street Bloomington, MN 55425 Tel: +1-715-832-1499 e-mail: jpaffel@instrumental.com Abstract This paper presents the data management issues associated with a large center like the NCCS and how these issues are addressed. More specifically, the focus of this paper is on the recent transition from a legacy UniTree (Legato) system to a SAM-QFS (Sun) system. Therefore, this paper will describe the motivations, from both a hardware and software perspective, for migrating from one system to another. Coupled with the migration from UniTree into SAM-QFS, the complete mass storage environment was upgraded to provide high availability, redundancy, and enhanced performance. This paper will describe the resulting solution and lessons learned throughout the migration process. 1. Introduction The Science Computing Branch of the Earth and Space Data Computing Division at the Goddard Space Flight Center (GSFC) manages and operates the NASA Center for 177 Computational Sciences (NCCS).[1] The NCCS is a shared center providing supercomputing services and petabyte-capacity data storage to a variety of user groups. Its mission is to enable Earth and space sciences research through computational modeling by providing its user community access to state of the art facilities in High Performance Computing (HPC), mass storage technologies, high-speed networking, and HPC computational science expertise. The largest workloads currently being performed at the NCCS consist of Earth system and climate modeling, prediction, and data assimilation. Input data for these applications come from many sources, including ground and satellite stations. Both computer and sensor technology have grown dramatically within the last decade causing a boom in the amount of data generated by these types of sources.[2] The major groups that comprise the NCCS user community include the following: x x x x x 2. Global Modeling and Assimilation Office (GMAO): consists of both the Seasonalto-Interannual Prediction Project (NSIPP) and the Data Assimilation Office (DAO), produces ensembles of simulations of near-term climate and creates research-quality assimilated global data sets from multiple satellites for climate analysis and observation planning. Goddard Institute for Space Studies (GISS): produces climate studies focusing on timescales ranging from a decade to a century. ESTO/Computational Technologies Project: develops the Earth System Modeling Framework (ESMF). Atmospheric Chemistry: research teams investigating the evolution of the composition of the Earth's atmosphere and its impact on weather and climate. Research and Analysis Group: a large collection of smaller research efforts. Data Management at the NCCS With over 3 Teraflops of computational capacity, the research performed throughout the heterogeneous environment of the NCCS uses large amounts of existing data for new computational studies while generating large amounts of new data from the output of these studies. In general, the total data stored at the NCCS is growing at approximately 125 TB of data per year, which includes both primary and secondary copies of user data. As an example of this net growth of data, during FY03, a total of 207 TB of new data was stored while approximately 143 TB of data was deleted. This resulted in a net growth of 64 TB of single copy data or 128 TB when duplicated. Complementary, the number of files managed by the Mass Data Storage and Delivery System (MDSDS) has grown from 3.5 million in 1999 to more than 10 million in 2003. Figure 1 shows the linear data growth as measured at the end of the fiscal year (month of September) for the past five years of only the legacy UniTree data. This trend is expected to increase dramatically in the next few years as the diverse mass storage facilities at the NCCS are consolidated and with increased utilization of the computational resources. 178 NCCS MDSDS Growth 700 14 Total Data with Duplicates 600 12 500 10 400 8 300 6 200 4 100 2 0 0 Sep-1999 Sep-2000 Sep-2001 Sep-2002 Millions of Files Total Data (TB) Number of Files Sep-2003 Figure 1: Growth of Mass Data Storage and Delivery System (MDSDS) data and files at the NCCS. Throughout any given day, data is pulled from and stored to the MDSDS as jobs are run on the computational platforms. The NCCS measures the inbound and outbound data of the mass storage system and has seen traffic, files being transferred into and out of the mass storage system, of up to a total of 2.9 TB for a single day. Therefore, the resulting storage system must not only keep pace with the increase in overall storage, but also maintain the capability to serve larger amounts of data on demand. 3. Hierarchical Storage Management (HSM) An HSM consists of different layers of storage capability for users to store and retrieve their data. Typically, a high speed disk cache is used as the first layer of storage and software is run to migrate files from high speed disk cache (1st layer storage) to slower tape media (2nd layer storage). There are two existing HSM systems at the NCCS. The NCCS provides the MDSDS, previously running UniTree, for high-performance long-term storage for most NCCS user data.[3,4] A second system, which uses the SGI Data Migration Facility (DMF), supports the GMAO DAO users. This paper only discusses the replacement of the MDSDS’s OTG’s DiskXtender Storage Manager (DXSM) software, formerly known as UniTree Central File Manager (UCFM). The UniTree management software runs on an aging Sun E10K with eight TB of Data Direct Network high performance disk storage. The MDSDS manages eight StorageTek (STK) Powderhorn 9310 robotic silos (five primary silos in the NCCS’s primary building 179 and three secondary risk-mitigation silos in a building a mile away). For all user data, the MDSDS is configured to make a primary copy of files on tapes in the NCCS’s primary building and a secondary copy on the tapes in the risk mitigation location. Users access the MDSDS through the File Transfer Protocol (FTP) from any of the computational platforms or even their desktops. A home directory for each user is defined within a single MDSDS file system. Files that are put into UniTree are first copied into the MDSDS file system and then archived to tape and later released from disk cache according to NCCS policies. File retrieves are transparent whether the file resides on disk or must first be staged from tape; however, any retrieves of files from tapes incur a latency to load the tape into a tape drive, position the tape to the beginning of the file, and then copy the file to the disk. While UniTree was a very reliable mass storage software system, by the middle of 2001, it became apparent that the recently modified capacity license cost model for UniTree was not compatible with the NCCS budget in light of the NCCS users’ projected growth over the ensuing years. The NCCS began exploring alternatives and undertook a detailed feature comparison of four major storage management systems used for several years in high performance computing environments. The candidates were SGI’s Data Migration Facility (DMF), IBM’s High Performance Storage System (HPSS), Sun’s SAM-QFS (also known as Sun StorEdge Performance and Utilization Suite), and UniTree. The various solutions were evaluated based on the following attributes: x x x x x 4. Performance: meet the needs for user requests for storage and retrieval of data. Integrity/High Availability: stable and safe environment more readily available than the existing HSM. Flexible/Modular/Scalable: allows for the maximum possible options for hardware and software and can scale with the users’ requirements. Balance: avoid bottlenecks throughout the flow of data to the storage media. Manageable: tools provide a rich environment for administration and reporting. Sun/SAM-QFS Solution An internal panel evaluated the vendor responses and awarded the highest rating to the Sun SAM-QFS proposal. Notably, the Sun proposal scored high marks for its ability to be configured for high availability by sharing file systems in a clustered environment, its ability to “stream” the writing of tiny files to tape by combining them into “containers,” and by having the largest customer base. Complementary to the Sun proposal, the NCCS also purchased more disk space and tape drive upgrades. The resulting system continued to leverage the existing investment in the STK hardware while providing a viable system to meet the future needs of the NCCS. A Sun Fire 15K system was purchased and configured into two distinct domains. These two domains, along with multiple interfaces to each, provide a highly available system for the user community. Fully redundant, SAM-QFS provides the necessary storage management software to provide multiple file systems with storage, archive management, 180 and retrieval capabilities for a variety of storage media. The major components that make up the Sun SAM-QFS software are as follows:[6] x x x x Archiver: automatically copies online disk cache files to archive media. The archive media can consist of either online disks or removable media cartridges. Releaser: automatically maintains the file system’s online disk cache at sitespecified percentage usage thresholds by freeing disk blocks occupied by eligible archived files. Stager: restores file data to the disk cache. When a user or a process requests file data that has been released from disk cache, the stager automatically copies the file data back to the online disk cache. Recycler: clears archive volumes of expired archive copies and makes volumes available for reuse. One of the key requirements from the outset of the transition to a new HSM was for the legacy UniTree data to remain transparently accessible to the user community through the new system. To facilitate the access of UniTree data on SAM-QFS, the entire file name space and directory structure of the UniTree system was recreated as directories and inodes in SAM-QFS. These inodes were basically placeholders, or links, to the original files in UniTree and contained an NCCS-defined volume serial number (VSN) and a “stranger” tape media type. Using the SAM migration toolkit, a set of libraries was created by Instrumental, Inc. to satisfy a stage request in SAM-QFS for a legacy UniTree file, which was identified to SAM by the “stranger” media type and the NCCS-defined VSN. Therefore, if a user requests a file that resides in UniTree, these libraries transparently retrieve the specified file from the UniTree system over a private network. Once the file has been retrieved from UniTree, it now exists within the SAM-QFS file system with two archive copies written to SAM tape and no longer needs to be retrieved from the legacy HSM. Complementary to user driven access to the legacy data in UniTree, the NCCS has written Perl scripts to actively migrate the data from UniTree into SAM-QFS. These Perl scripts migrate files on a tape-by-tape basis and run “behind the scenes” to minimize the impact to the production environment. A single migration stream will secure files on a UniTree VSN from a well-defined list of UniTree tapes. This stream will get the current status of each file on that tape, i.e., whether or not the user has already migrated the file by retrieving it from tape or has even deleted the file. Next, the migration stream will begin to transfer the files over the private network using FTP. When the legacy files are retrieved to SAM-QFS disk cache, SAM writes two tape archive copies. After the migration stream has been completed, a separate analysis Perl script is run on each UniTree tape to verify that the files are in SAM-QFS. For quality control purposes, a checksum is run on every 100th file. The current rate of migration is approximately 2 TB of data per day. 5. Conclusions The integration effort of installing a new system to an existing High Performance Computing environment is difficult and requires much planning and effort. The 181 installation of the new Sun SAM-QFS system was no exception and many valuable lessons were learned. x x x x x x x Migration of Legacy Data: The goal of migrating 100’s of terabytes of data while still providing users with the ability to store and retrieve new files and transparently access legacy data is nontrivial and takes a significant number of resources. The amount of time and resources, i.e., tapes, tape drives, and network bandwidth, needs to be accurately estimated from the beginning and built into the integration plan such that users are not overly disrupted during the transition period as data is being migrated. Test System: It is important to have a test environment in which configuration modifications, such as operating system or storage software upgrades, may be tested without affecting the production environment. User Account Management: With two highly available domains on the new system, NCCS specific scripts were developed to synchronize user accounts between the two domains. Pilot User Phase: Before turning the system over to production computing, the internal NCCS staff and a set of pilot users were permitted access to the system. This phase allowed for a thorough testing of the environment before the full user community was allowed access. Staff and User Training: While the SAM-QFS system was designed to be as consistent as possible with UniTree, several training sessions were held with the staff and with the users to attempt to answer common questions. This allowed the user community to immediately begin using the new system and the staff to better support users from the beginning. Software Upgrades: Maintaining concurrency with the vendor’s most recent release levels of operating systems and software is extremely important. Most vendors do not have the means to retroactively fix bugs for earlier release levels. Security: Define the necessary security requirements at the beginning of the process, and let those requirements drive the solution. It is more costly and disruptive to secure a system after it has been installed and patterns of use have developed by the user community. The NCCS successfully transitioned the Sun SAM-QFS system into the production environment in September of 2003. The active migration of the more than 300 TB of data is slated to be completed in May of 2004. The new system has proven to be very reliable and capable of handling heavier loads than its predecessor. To date, the NCCS has seen tape activity, both user demand and migrations, exceed 9.8 TB for a single day. As the NCCS continues to add computational capacity and as the user community continues to push the limits of modeling and assimilation to new heights, the HSM must evolve and adapt to the continued increase of requirements. The NCCS will incorporate the disk cache from UniTree into the production SAM-QFS system once the migration of the legacy data is complete. Also, the NCCS is analyzing the use of serial ATA, commodity based disk storage, as a second tier storage to sit between the high speed disk and slower tape. Finally, the NCCS is currently developing a data management system, 182 based on the Storage Resource Broker (SRB),[7] to provide user’s with a single interface to storage and more control over their own data administration. References [1] http://nccs.nasa.gov. [2] Performance Management at an Earth Science Supercomputer Center, Jim McGalliard and Dick Glassbrook. [3] Storage and Network Bandwidth Requirements Through the Year 2000 for the NASA Center for Computational Sciences, Ellen Salmon, Proceedings of the fifth Goddard Conference on Mass Storage Systems and Technologies, (1996) pp. 273-286. [4] Mass Storage System Upgrades at the NASA Center for Computational Sciences, A. Tarshish, E. Salmon, M. Macie, and M. Saletta, Proceedings of the Eight NASA Goddard Conference on Mass Storage Systems and Technologies, Seventh IEEE Symposium on Mass Storage Systems, (2000) pp. 325-334. [4] UniTree to SAM-QFS Project Plan, Jeff Paffel, Instrumental, Inc., NCCS internal report. [5] UniTree to SAM-QFS Migration Procedure, Daniel Duffy, Computer Sciences Corporation, NCCS internal report. [6] Sun SAM-FS and Sun SAM-QFS Storage and Archive Management Guide, August 2002; Sun QFS, Sun SAM-FS, and Sun SAM-QFS File System Administrator’ s Guide. [7] http://www.npaci.edu/DICE/SRB/. 183 184 Parity Redundancy Strategies in a Large Scale Distributed Storage System John A. Chandy Dept. of Electrical and Computer Engineering University of Connecticut Storrs, CT 06269-1157 jchandy@uconn.edu tel +1-860-486-5047 Abstract With the deployment of larger and larger distributed storage systems, data reliability becomes more and more of a concern. In particular, redundancy techniques that may have been appropriate in small-scale storage systems and disk arrays may not be sufficient when applied to larger scale systems. We propose a new mechanism called delayed parity generation with active data replication (DPGADR) to maintain high reliability in a large scale distributed storage system without sacrificing fault-free performance. 1 Introduction Data creation and consumption has increased significantly in recent years and studies have suggested that the amount of information stored digitally will continue to double every year for the foreseeable future. This increasing need for information storage leads to a corresponding need for high-performance and reliable storage systems. Single storage nodes can not provide the required storage capacity or scalability. Thus, in an effort to satisfy this need, there has been significant work in the area of distributed storage systems where storage nodes are aggregated together into a larger cohesive storage system. These include distributing data amongst shared disks [1, 6, 12], dedicated storage nodes [8], clustered servers [4], or the clients themselves [2, 7]. It is not unreasonable to expect systems with petabytes of data distributed across thousands of nodes in these distributed storage systems. However, as we increase the number of nodes, the reliability of the entire system decreases correspondingly unless steps are taken to introduce some form of redundancy into the system. In this paper, we discuss redundancy mechanisms to provide high reliability without sacrificing fault free performance. For the purposes of this paper, we refer to individual storage devices in the distributed storage system as nodes - whether they be disks in a shared disk SAN, servers in a clustered server, or OBSDs in a network attached disk system. The techniques and strategies apply with slight variations to all implementations. Data redundancy in most disk array subsystems is typically provided by using RAID. These same techniques used at the disk level can also be used at the node level. Mirroring, 185 or RAID1, entails replication of the data on multiple nodes. Parity striping, or RAID5, involves spreading data along with parity across multiple nodes. Choosing which RAID level to use is typically determined by cost and application requirements. At the disk array level, the redundancy choice is usually RAID5 as it provides excellent availability, little storage overhead, and adequate performance. However, with a large scale distributed system, the choice is not so clear. RAID5 no longer provides sufficient reliability since a thousand node system could exhibit MTTFs of a few years. A mirrored distributed storage system, however, can have an MTTF of several decades. In addition, RAID5 suffers from the well-known write penalty whereby parity updates require two extra reads to generate the parity. Techniques to overcome this problem in a disk subsystem such as the use of non-volatile caches can not be used in a distributed system. Moreover, the cost of the write penalty is more significant because of the high latency costs inherent in network communications. Because of the limitations of parity striping, in many distributed storage systems, replication or mirroring is the preferred strategy for redundancy [3, 15]. In addition, replication allows widely distributed clients and nodes to take advantage of locality and retrieve data from the closest storage node. However, the cost of mirroring is the 100% storage overhead. In this paper, we present methods to achieve the low storage overhead of parity striping but retaining the performance and reliability characteristics of mirroring. In particular, we discuss the use of delayed parity generation to improve parity striping performance. 2 Delayed Parity Generation with Active Data Replication The concept of delayed parity generation with active data replication (DPGADR) is based on reducing the number of accesses required to generate parity in a RAID5 system. In a standard RAID5 disk array, the array controller must read old values from both the data and parity disks and then write the new data back to the data disk and the XOR’ed parity result back to the parity disk. This results in a total of 4 disk accesses (potentially 2 if the parity and data reads had been cached). In a DPGADR system, we delay the generation of the parity, and thus do not require the reading of old data and parity or the writing the parity result. However, without parity generation, the system is potentially compromised in the event of failure. To address this, we replicate the new data to a replication node that is not part of the RAIDed redundancy group. We have reduced the number of accesses to just 2 writes - both of which can be proceed in parallel. Figure 1 shows the data distribution in a DPGADR system. In order to distribute the load, the replication node can be rotated across the redundancy group. Each replication node keeps a map relating the actual block location to the active data locations kept on the node. Thus, the client is not responsible for identifying the block location for the replicated data on the replication node. The client can use a simple hash algorithm to map from block ID to replication node, and it is then the replication node’s responsibility to allocate storage space locally. On the surface, this DPGADR scheme appears to be simply mirroring of data. However, we do not maintain the mirrored data on the replication node in perpetuity. In order to avoid replicating all data writes, the replication node will periodically generate the parity for any 186 Replication Node D00 D01 D02 P0 D10 D11 P1 D12 D20 P2 D21 D22 P3 D30 D31 D32 a) Initial data distribution Replication Node D00 D01 D02 P0 D11' D10 D11' P1 D12 D32' D20 P2 D21 D22 P3 D30 D31 D32' b) Data distribution after writes to D11 and D32 Figure 1: DPGADR data distribution. blocks that it contains and then flush these blocks from its data store. Because of this periodic data flushing, the replication node is in effect a cache of actively used data blocks. Parity generation from the replication node is not a trivial problem as the replication node may not have the required data to generate the parity. In such a case, the node must retrieve the stripe data from the relevant nodes before generating the parity. However, in general, it is likely that the active data set on the replication node will contain all or most of the data blocks in a particular stripe because of locality and small working set sizes [11]. We would like to keep the stripe length small so as to make sure that the entire stripe is in the active data set on the replication node. This is also desirable for reliability reasons since it reduces the size of the redundancy group. A reasonable stripe length is 5 nodes, thus requiring 200 stripes to span a 1000 node system. 3 Reliability As mentioned above, as we increase the number of nodes in a large scale storage system, RAID5 parity striping no longer provides sufficient redundancy to give adequate system reliability. The use of DPGADR can improve system reliability significantly. Since recently used data is copied to a replication node, the system exhibits mean time to data loss (MTTDL) characteristics near to that of a mirrored system. Figure 2 illustrates how the system can tolerate more than one failure in a redundancy set and still recover the most recent data. Even though two nodes have failed, the active data blocks, D11’ and D32’, are still available from the replication node. In fact if all the nodes except for the replication node fail, the DPGADR method allows for the recovery of all active data. While the replication node can prevent loss of active data in the presence of multiple failures, inactive data can still be lost if there is more than one fault. To prevent data loss in such a scenario, we require that the system have judicious backup procedures so that all inactive data is present on backup media. The replication node must be large enough to accommodate all active 187 Replication Node D00 D01 D02 P0 D11' D10 D11' P1 D12 D32' D20 P2 D21 D22 P3 D30 D31 D32' Figure 2: DPGADR failure scenario. data between backups. This need not be that large as the working set size of a storage system over a 24 hour period is typically around 5% of the entire storage space [11]. Thus, it is sufficient to have one replication node for every 20 data nodes and using daily backups to prevent data loss of inactive data in the presence of dual faults. We also suspend access to the DPGADR group once a second fault has been recorded. This prevents invalid data being read from inactive data blocks. Thus, the DPGADR method prevents data loss in most dual fault cases but system availability is the same as a RAID system because of the access blocking due to a second fault. We can develop a model for the MTTDL and availability of a DPGADR system using a similar analysis methodology to that outlined in [10]. The nodes are assumed to have independent and exponential failure rates. We assume d nodes per parity group, one redundancy node per nG parity groups, and the mean time to failure of each node is MT T FN ode The MTTDL for a DPGADR group is: MT T DLDP GADR = M T T FNode nG (d+1)+1 P r[data loss failure during repair time] (1) Data loss during the repair time of the failed node can happen in three cases: 1) the first failed node was the replication node and any other node fails, 2) the first failed node was not the replication node and the replication node fails, and 3) the failed node was not the replication node and two non-replication nodes in the same parity group fail. Thus, the probability of data loss causing failure during the repair time is as follows: P r[data loss failure during repair time] = P r[first failed node was a replication node]pf + (1 − P r[first failed node was a replication node])(prf + p2f ) (2) where pf is the probability that any one of the remaining nG (d + 1) nodes fails during the repair time, prf is the probability that the replication node fails during the repair time, and p2f is the probability that 2 non-replication nodes from the same parity group fail during the repair time. The derivation of pf is straightforward. If we define the mean time to repair the node as MT T RN ode , then assuming exponential failure rates, pf ≈ M T T RNode nG (d+1) MT T FN ode 188 (3) Configuration RAID5 (d=5) Mirror DPGADR (d=5,nG =4) MTTDL (years) 7.9 23.8 39.6 Redundancy overhead 200 nodes 1000 nodes 250 nodes Table 1: MTTDL and Overhead for a 1000 data node system. (MT T FN ode = 100000 hours and MT T RN ode = 24 hours when MT T FN ode ≫ MT T RN ode . Similarly, prf is equal to expressed as follows: p2f = d 2 T T RNode 2 (M ) (1 − M T T FNode M T T RNode d−1 ) M T T FNode ≈ M T T RNode . M T T FNode p2f can be d(d − 1)  MT T FN ode 2 2 MT T RN ode (4) Substituting into Eqs. 1 and 2, we arrive at: MT T DLDP GADR = 1 [ M T T RNode ] nG (d+1)+1 nG (d+1)M T T FNode + M T T FNode nG (d+1)+1 nG (d+1) M T T RNode [ nG (d+1)+1 M T T FNode + ≈ d(d−1) M T T RNode 2 ( M T T FNode ) ] 2 MT T FN2 ode nG (d + 1)MT T RN ode (5) MT T F 2 In a large system with nS DPGADR groups, the MT T DL is nS nG (d+1)MNode . If T T RNode we define D as the total number of data nodes in the system, i.e. nS nG d, we can rewrite MT T F 2 . The redundancy overhead to support parity and replicaMT T DL as D(1+ 1 )M TNode TR d Node tion is Dd + nGD d . By comparison, a mirrored system with D data nodes has a MTTDL MT T F 2 of 2D M T TNode with an overhead of D nodes, and a RAID5 system has an MTTDL of RNode 2 M T T FNode and an overhead of Dd nodes. The DPGADR system actually has better D(d+1) M T T RNode MTTDL times than a mirrored system with significantly less redundancy overhead. Table 1 shows MTTDL and overhead numbers for a 1000 data node system. Note that while the MTTDL of a DPGADR system is better than a mirrored system, the availability, i.e. the probability that the system is available for use, is more like a RAID5 system. This is because after a second failure, the system is suspended until the repair is complete. 4 Related Work Xin et al [15], have proposed three different large storage system redundancy architectures with two fast recovery mechanisms: fast mirroring copy (FMC) and lazy parity backup (LPB). The LPB method is similar to the DPGADR scheme in that parity calculation is delayed. However, it relies on a RAID5 redundancy set to be completely mirrored at a 189 greater than 100% storage overhead. The DPGADR scheme requires significantly less overhead since only actively used data is replicated. Other related work is in the area of RAID5 disk arrays and of particular interest are parity logging [13], data logging [5], and hot mirroring [9, 14]. The parity logging technique eliminates the need for parity disk accesses by caching the partial parity formed from the old and new data in non-volatile memory at the controller. The partial parity can then be periodically flushed to a log disk, which can then be cleaned out at a later time to generate the actual parity disk data. This process reduces the number of disk accesses from 4 to 2 and clearly, this reduction in accesses will greatly speed up the performance of writes in a RAID system. Parity logging, however, is not practical in a distributed system because of the need to cache data in non-volatile memory. It is not reasonable to expect distributed system clients to have non-volatile memory available. Moreover, the management of the cache across multiple clients can be problematic. The DPGADR system does not require non-volatility at the client since all data is pushed out to the replication node immediately. Data logging is similar to DPGADR except that it performs an old data read and stores that to the data log as well. This requires an extra disk access and the maintenance of log maps requires non-volatile memory at the clients as well. Hot mirroring [9] and AutoRAID [14] are similar techniques that attempt to move actively used data to mirrored regions of the array and less frequently used data to parity logged regions (hot mirroring) or parity striped regions (AutoRAID). These system require a background process that evaluates the “hotness” of data and then moves them to or from mirrored regions as required. If a data block is in the parity striped region, it will remain there until a background process has tagged it as hot even if it is experiencing high activity. DPGADR systems, however, dynamically adjust to the activity of the data since the latest data is always pushed to the mirrored region, i.e. the replication node. 5 Conclusions In this paper, we have described a delayed parity construction mechanism called DPGADR that allows parity striping to be used on large scale distributed storage systems without suffering from the small write performance penalty. Compared to mirroring it can reduce the storage overhead from 100% to less than 20% and compared to parity striping it can reduce small write accesses to just two parallel accesses. Because of redundancy in the active data replication node, the overall system reliability is better than mirroring for significantly less overhead. References [1] D. Anderson and J. Chase. Failure-atomic file access in the Slice interposed network storage system. Cluster Computing, 5(4):411–419, Oct. 2002. [2] T. Anderson, M. Dahlin, J. Neefe, D. Patterson, D. Roselli, and R. Wang. Serverless network file systems. In Proceedings of the Symposium on Operating System Principles, pages 109–126, Dec. 1995. 190 [3] W. Bolosky, J. Douceur, D. Ely, and M. Theimer. Feasibility of a serverless distributed file system deployed on an existing set of desktop PCs. In Proceedings of the ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 34–43, June 2000. [4] J. D. Bright and J. A. Chandy. A scalable architecture for clustered network attached storage. In Proceedings of the IEEE/NASA Goddard Symposium on Mass Storage Systems and Technologies, pages 196–206, Apr. 2003. [5] E. Gabber and H. F. Korth. Data logging: A method for efficient data updates in constantly active RAIDs. In Proceedings of the International Conference on Data Engineering, pages 144–153, 1998. [6] G. A. Gibson and R. Van Meter. Network attached storage architecture. Commun. ACM, 43(11):37–45, Nov. 2000. [7] J. H. Hartman and J. K. Ousterhout. Zebra: A striped network file system. In Proceedings of the USENIX 1992 Workshop on File Systems, May 1992. [8] E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 84–92, Oct. 1996. [9] K. Mogi and M. Kitsuregawa. Hot mirroring: A method of hiding parity update penalty and degradation during rebuilds for RAID5. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 183–194, June 1996. [10] D. A. Patterson, G. A. Gibson, and R. H. Katz. A case for redundant arrays of inexpensive disks (RAID). In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 109–116, June 1988. [11] C. Ruemmler and J. Wilkes. A trace-driven analysis of working set sizes. Technical Report HPL-OSR-93-23, Hewlett-Packard, Palo Alto, CA, Apr. 1993. [12] F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of USENIX Conference on File and Storage Technologies, pages 231–244, Jan. 2002. [13] D. Stodolsky, G. Gibson, and M. Holland. Parity logging: Overcoming the small write problem in redundant disk arrays. In Proceedings of the International Symposium on Computer Architecture, 1993. [14] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID hierarchical storage system. ACM Transactions on Computer Systems, 14(1):108–136, Feb. 1996. [15] Q. Xin, E. L. Miller, T. Schwarz, D. D. E. Long, S. A. Brandt, and W. Litwin. Reliability mechanisms for very large storage systems. In Proceedings of the IEEE/NASA Goddard Symposium on Mass Storage Systems and Technologies, pages 146–156, Apr. 2003. 191 192 Reducing Storage Management Costs via Informed User-Based Policies Erez Zadok, Jeffrey Osborn, Ariye Shater, Charles Wright, and Kiran-Kumar Muniswamy-Reddy Stony Brook University {ezk, jrosborn, ashater, cwright, kiran}@fsl.cs.sunysb.edu Jason Nieh Columbia University nieh@cs.columbia.edu Abstract Elastic quotas enter users into a contract with the system: users can exceed their quota while space is available, under the condition that the system does not provide as rigid assurances about the file’s safety. Users or applications may designate some files as elastic. Non-elastic (or persistent) files maintain existing semantics. Elastic quotas create a hierarchy of data’s importance: the most important data will be backed up frequently; some data may be compressed and other data can be compressed in a lossy manner; and some files may not be backed up at all. Finally, if the system is running short on space, the elastic files may even be removed. Users and administrators can configure flexible policies to designate which files belong to which part of the hierarchy. Elastic quotas introduce little overhead for normal operations and demonstrate that through this new disk usage model, significant space savings are possible. Storage management costs continue to increase despite the decrease in hardware costs. We propose a system to reduce storage maintenance costs by reducing the amount of data backed up and reclaiming disk space using various methods (e.g., transparently compress old files). Our system also provides a rich set of policies. This allows administrators and users to select the appropriate methods for reclaiming space. Our performance evaluation shows that the overheads under normal use are negligible. We report space savings on modern systems ranging from 25% to 76%, which result in extending storage lifetimes by 72%. 1. Introduction 2. Motivational study Despite seemingly endless increases in the amount of storage and decreasing hardware costs, managing storage is still expensive. Furthermore, backing up more data takes more time and uses more storage bandwidth—thus adversely affecting performance. Users continue to fill increasingly larger disks. In 1991, Baker reported that the size of large files had increased by ten times since the 1985 BSD study [1, 8]. In 2000, Roselli reported that large files were getting ten times larger than Baker reported [9]. Our recent studies show that just merely by 2003, large files are ten times larger than Roselli reported. Today, management costs are five to ten times the cost of underlying hardware and are actually increasing as a proportion of cost because each administrator can only manage a limited amount of storage [4, 7]. We believe that reducing the rate of consumption of storage is the best solution to this problem. Independent studies [10] as well as ours indicate that significant savings are possible. To improve storage management via efficient use of storage, we designed the Elastic Quota System (Equota). Storage needs are increasing—often as quickly as larger storage technologies are produced. Moreover, each upgrade is costly and carries with it high fixed costs [4]. We conducted a study to quantify this growth, with an eye toward reducing this rate of growth. We identified four classes of files, three of which can reduce the growth rate and also the amount of data to be backed up. Similar classifications have been used previously [6] to reduce the amount of data to be backed up. First, there are files that cannot be considered for reducing growth. These files are important to users and should be backed up frequently, say daily. Second, studies indicate that 82–85% of storage is consumed by files that have not been accessed in more than a month [2]. Our studies confirm this trend: 89.1% of files or 90.4% of storage has not been accessed in the past month. These files can be compressed to recover space. They need not be backed up with the same frequency as the first class of files as least- 193 182.7GB 457.7GB 100% 80.3GB 14.9GB 20.8GB 87.5GB 52.4GB 80% 41.0GB 7.7GB 70% savings on group C: it has many .c files that compress well. Group B contains a large number of active users, so the percentage of files that were used in the past 90 days is less than that in the other sites. The next bar down (top hatched) is the savings from lossy compression of still images, videos, and sound files. The results varied from a savings of 2.5% for group A to a savings of 35% for group D. Groups B and D contain a large number of personal .mp3 and .avi files. As media files grow in popularity and size, so will the savings from a lossy compression policy. The next bar down represents space consumed by regenerable files, such as .o files (with corresponding .c’s) and ˜ files, respectively. This varied between 1.7% for group B to 40.5% for group A. This represents the amount of data that need not be backed up, or can be removed. Group A had large temporary backup tar files that were no longer needed. The amount of storage that cannot be reduced through these policies is the dark bar at the bottom. Overall, using the three space reclamation methods, we can save between 25% to 76.5% of the total disk space. To verify if applying the aforementioned space reclamation methods would reduce the rate of disk space consumption, we correlated the average savings we obtained in the above environments with the SEER [5] and Roselli [9] traces. We require filename and path information, since our space reclamation methods depend on file types, which are highly correlated with names [3]. We evaluated several other traces, but only the combination of SEER and Roselli’s traces provides us with the information we required. The SEER traces have pathname information but do not have file size information. Roselli’s traces do not contain any file name information, but have the file size information. We used the size information obtained by Roselli to extrapolate the SEER growth rates. The Roselli traces were taken around the same time of the SEER traces, and therefore give us a good estimate of the average file size on a system at the time. At the rate of growth exhibited in the traces, the hard drives in the machines would need to be upgraded after 11.14 months. We observed that our policies extended the disks’ lifetime to 19.2 months. The disk space growth rates were reduced by 52%. Based on these results, we have concluded that our policies offer promising storage management cost-reduction techniques. 1.7GB 90% 4.2GB 4.5GB 60% 74.1GB 50% 0.6GB 6.9GB 341.7GB 40% 7.4GB 13.5GB 30% compression lossy-compr non-backup Remainder 20% 51.7GB 10% 18.9GB 0% A B C D Survey Group Figure 1. Space consumed by different classes. Actual amounts appear to the right of the bars, with the total size on top. recently-used files are unlikely to change in the near future. Third, multimedia files such as JPEG or MP3 can be re-encoded with lower quality. This method carries some risk because not all of the original data is preserved, but the data is still available and useful. These files can be backed up less frequently than other files. Fourth, previous studies show that over 20% of all files—representing over half of the storage—are regenerable [10]. These files need not be backed up. Moreover, these files can be removed when space runs short. To determine what savings are possible given the current usage of disk space, we conducted a study of four sites, to which we had complete access. These sites include a total of 3,898 users, over 9 million files, and 735.8GB of data dating back 15 years: (A) a small software development company with 100 programmers, management, sales, marketing, and administrative users with data from 1992–2003; (B) an academic department with 3,581 users, mostly students, using data from shared file servers, collected over 15 years; (C) a research group with 177 users and data from 2000–2003; and (D) a group of 40 cooperative users with personal Web sites and data from 2000–2003. Each of these sites has experienced real costs associated with storage: A underwent several major storage upgrades in that period; B continuously upgrades several file servers every six months; the statistics for C were obtained from a file server that was recently upgraded; and D has recently installed quotas to rein in disk usage. Figure 1 summarizes our study, starting with the top bar. We considered a transparent compression policy on all uncompressed files that have not been accessed in 90 days. We do not include already compressed data (e.g., .gz), compressed media (e.g., MP3 or JPEG), or files that are only one block long. In this situation, we save between 4.6% from group B to 51 % from group C. We yield large 3. Design Our two primary design goals were to allow for versatile and efficient elastic quota policy management. To achieve versatility we designed a flexible policy configuration language for use by administrators and users. To achieve efficiency we designed the system to run as a kernel file system with a database, which associates user IDs, file names, and inode numbers. Our present implementation marks a file as 194 elastic usage. The latter, called the shadow UID, is simply the ones-complement of the former. When an Edquot operation is called, Edquot determines if it was for an elastic or a persistent file, and informs dquot to account for the changed resource (inode or disk block) for either the UID or shadow UID. This allows us to use the existing quota infrastructure and utilities to account for elastic usage. EQFS communicates information about creation, deletion, renames, hard links, and ownership changes of elastic files to Rubberd’s database management thread over a netlink socket. Rubberd records this information in the BDB databases. Rubberd also records historical abuse factors for each user periodically, denoting the user’s elastic space utilization over a period of time. BDB Databases Elastic Quota Utilities ioctl’s quota system call RUBBERD Database Management Thread Disk Quota Mgmt. Policy Thread User Process system calls USER netlink socket ioctl’s KERNEL quota ops Elastic Disk Quota Mgmt. (Edquot) Stackable File System Elastic Quota File System (EQFS) quota ops VFS calls EXT2, EXT3, etc. Figure 2. Elastic Quota Architecture Elasticity Modes EQFS can determine a file’s elasticity in five ways. (1) Users can explicitly toggle the file’s elasticity, allowing them to control elasticity on a per file basis. (2) Users can toggle the elastic bit on a directory inode. Newly created files or sub-directories inherit the elastic bit. (3) Users can tell EQFS to create all new files elastically (or not). (4) Users can tell EQFS which newly-created files should be elastic by their extension. (5) Developers can mark files as elastic using two new flags we added to the open and creat system calls. These flags tell EQFS to create the new file as elastic or persistent. elastic using a single inode bit. A more complex hierarchy could be created using extended attributes. Architecture Figure 2 shows the overall architecture of our system. There are four components in our system: (1) EQFS is a stackable file system that is mounted on top of another file system such as Ext3 [13]. EQFS includes a component (Edquot) that indirectly manages the kernel’s native quota accounting. EQFS also sends messages to a user space component, Rubberd. (2) Berkeley DB (BDB) databases record information about elastic files [11]. We have two types of databases. First, for each user we maintain a database that maps inode numbers of elastic files to their names, allowing us to easily locate and enumerate each user’s elastic files. The second type of database records an abuse factor for each user denoting how “good” or “bad” a given user has been with respect to historical utilization of disk space. (3) Rubberd is a user-level daemon that contains two threads. The database management thread is responsible for updating the BDB databases. The policy thread periodically executes cleaning policies. (4) Elastic Quota Utilities are enhanced quota utilities that maintain the BDB databases and control both persistent and elastic quotas. 4. Elastic quota policies The core of the elastic quota system is its handling of space reclamation policies. File system management involves two parties: the running system and the people (administrators and users). To the system, file system reclamation must be efficient so as not to disturb normal operations. To the people involved, file system reclamation policies must consider three factors: fairness, convenience, and gaming. These three factors are important especially in light of efficiency as some policies can be executed more efficiently than others. We describe these three factors next. Fairness Fairness is hard to quantify precisely. It is often perceived by the individual users as how they personally feel that the system and the administrators treat them. Nevertheless, it is important to provide a number of policies that could be tailored to the site’s own needs. For example, some users might consider a largest-file-first compression or removal policy unfair because recently-created files may not remain on the system long enough to be used. For these reasons, we also provide policies that are based on individual users’ disk space usage: users that consume more disk space over longer periods of time are considered the worst offenders. Once the worst offenders are determined and the System Operation EQFS intercepts file system operations, performs related elastic quota operations, and then passes the operation to the lower file system (e.g., Ext2). EQFS also intercepts the quota management system call and inserts its own set of quota management operations, edquot. Quota operations are intercepted in reverse (e.g., from Ext2 to the VFS), because only the native disk-based file system knows when an operation has resulted in a change in the consumption of inodes or disk blocks. Each user on our system has two UIDs: one that accounts for persistent usage and another that accounts for 195 amount of disk space to clean from the users is calculated, the system must decide which specific files should be reclaimed from that user. Basic policies allow for time-based or size-based policies for each user. For the utmost in flexibility, users are allowed to define their own ordered list of files to be processed first. ter is an optional list of file name filters to apply the policy to. If not specified, the policy applies to all files. If users define their own policy files and Rubberd cannot reclaim enough space, then Rubberd continues to reclaim space as defined in the system-wide policy file. HSM systems operate similarly, however, at a system-wide level [6]. Convenience For a system to be successful, it should be easy to use and simple to understand. Users should be able to find out how much disk space they are consuming in persistent and elastic files and which of their elastic files will be removed first. Administrators should be able to configure new policies easily. The algorithms used to define a worst offender should be simple and easy to understand. For example considering the current total elastic usage is simple and easy to understand. A more complex and fair algorithm could count the elastic space usage over time as a weighted average, although it might be more difficult for users to understand. 4.2. Abuse factors When Rubberd reclaims disk space, it must provide a fair mechanism to distribute the amount of reclaimed space among users. To decide how much disk space to reclaim from each user, Rubberd computes an abuse factor (AF) for all users. Rubberd then distributes the amount of space to reclaim from each user proportionally to their AF. We define two types of AF calculations: current usage and historical usage. Current usage can be calculated in three ways. First, Equota can consider the total elastic usage (in disk blocks) the user consumes. Second, it can consider the total elastic usage minus the user’s available persistent space. Third, Equota can consider the total amount of space consumed by the user (elastic and persistent). These three modes give a system administrator enough flexibility to calculate the abuse fairly given any group of users (we also have modes based on a percentage of quota). Historical usage can be calculated either as a linear or as an exponential average of a user’s disk consumption over a period of time (using the same metrics as current usage). The linear method calculates a user’s abuse factor as the linear average over time, whereas the exponential method calculates the user’s abuse with an exponentially decaying average. Gaming Gaming is defined as the ability of individual users to circumvent the system and prevent their files from being processed first. Good policies should be resistant to gaming. For example, a global LRU policy that compresses older files could be circumvented simply by reading those files. Policies that are difficult to circumvent include a per-user worst-offender policy. Regardless of the file’s attributes, a user still owns the same total amount of data. Such policies work well on systems where it is expected that users will try to exploit the system. 4.1. Rubberd configuration files 4.3. Cleaning operation When Rubberd has to reclaim space, it first determines how much space it should reclaim—the goal. The configuration file defines multiple policies, one per line. Rubberd then applies each policy in order until the goal is reached or no more policies can be applied. Each policy in this file has four parameters. (1) type defines what kind of policy to use and can have one of three values: global for a global policy, user for a per-user policy, and user profile for a per-user policy that first considers the user’s own personal policy file. (2) method defines how space should be reclaimed. Our prototype currently defines two policies: gzip compresses files and rm removes them. This allows administrators to define a system policy that first compresses files and then removes them if necessary. A policy using mv and tar could be used together as an HSM system, archiving and migrating files to slower media at cleaning time. (3) sort defines the order of files being reclaimed. We define several keys: size (in disk blocks) for sorting by largest file first, mtime for sorting by oldest modification time first, and similarly for ctime and atime. (4) fil- To reclaim elastic space, Rubberd periodically wakes up and performs a statfs to determine if the high watermark has been reached. If so, Rubberd spawns a new thread to perform the reclamation. The thread reads the global policy file and applies each policy sequentially, until the low watermark is met or all policy entries are applied. The application of each policy proceeds in three phases: abuse calculation, candidate selection, and application. For user policies, Rubberd retrieves the abuse factor of each user and then determines the number of blocks to clean from each user proportionally to the abuse factor. For global policies this step is skipped since all files are considered without regard to the owner’s abuse factor. Rubberd performs the candidate selection and application phases only once for global policies. For user policies these two phases are performed once for each user. Rubberd then gets the attributes (size and times) for each file (EQFS allows Rubberd to get these attributes more efficiently by inode number rather than by name as required by stat). Rub- 196 berd then sorts the candidates based on the policy (e.g., largest or oldest files first). In the application phase, we reclaim disk space (e.g., compress the file) from the sorted candidates. Cleaning terminates once enough space has been reclaimed. backup software) and as much as 89.9% with optional database operations. A full version of this paper, including a more detailed design and a performance evaluation, is available at www.fsl.cs.sunysb.edu/ docs/equota-policy/policy.pdf. 5. Related work References Elastic quotas are complementary to HSM systems. HSM systems provide disk backup as well as ways to reclaim disk space by moving less-frequently accessed files to a slower disk or tape. These systems then provide a way to access files stored on the slower media, ranging from file search software to replacing the migrated file with a link to its new location. Several HSM systems are in use today including UniTree, SGI DMF (Data Migration Facility), the SmartStor Infinet system, IBM Storage Management, Veritas NetBackup Storage Migrator, and parts of IBM OS/400. HP AutoRaid migrates data blocks using policies based on access frequency [12]. Wilkes et. al. implemented this at the block level, and suggested that per-file policies in the file system might allow for more powerful policies; however, they claim that it is difficult to provide an HSM at the file system level because there are too many different file system implementations deployed. We believe that using stackable file systems can mitigate this concern, as they are relatively portable [13]. In addition, HSMs typically do not take disk space usage per user over time into consideration, and users are not given enough flexibility in choosing storage control policies. We believe that integrating user- and application-specific knowledge into an HSM system would reduce overall storage management costs significantly. [1] M. G. Baker, J. H. Hartman, M. D. Kupfer, K. W. Shirriff, and J. K. Ousterhout. Measurements of a Distributed File System. In Proceedings of 13th ACM Symposium on Operating Systems Principles, pages 198–212. Association for Computing Machinery SIGOPS, 1991. [2] J. M. Bennett, M. A. Bauer, and D. Kinchlea. Characteristics of files in NFS environments. ACM SIGSMALL/PC Notes, 18(3-4):18–25, 1992. [3] D. Ellard, J. Ledlie, and M. Seltzer. The Utility of File Names. Technical Report TR-05-03, Computer Science Group, Harvard University, March 2003. [4] Gartner, Inc. Server Storage and RAID Worldwide. Technical report, Gartner Group/Dataquest, 1999. www. gartner.com. [5] G. H. Kuenning. Seer: Predictive File Hoarding for Disconnected Mobile Operation. PhD thesis, University of California, Los Angeles, May 1997. [6] Julie Lugar. Hierarchical storage management (HSM) solutions today. http://www.serverworldmagazine. com/webpapers/2000/10_camino.shtml, October 2000. [7] J. Moad. The Real Cost of Storage. eWeek, October 2001. www.eweek.com/article2/0, 4149, 1249622,00.asp. [8] J. Ousterhout, H. Costa, D. Harrison, J. Kunze, M. Kupfer, and J. Thompson. A Trace-Driven Analysis of the UNIX 4.2 BSD File System. In Proceedings of the 10th ACM Symposium on Operating System Principles, pages 15–24, Orcas Island, WA, December 1985. ACM. [9] D. Roselli, J. R. Lorch, and T. E. Anderson. A Comparison of File System Workloads. In Proc. of the Annual USENIX Technical Conference, pages 41–54, June 2000. [10] D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R.W. Carton, and J. Ofir. Deciding When to Forget in the Elephant File System. In Proceedings of the 17th ACM Symposium on Operating Systems Principles, pages 110– 123, December 1999. [11] M. Seltzer and O. Yigit. A new hashing package for UNIX. In Proceedings of the Winter USENIX Technical Conference, pages 173–84, January 1991. www.sleepycat. com. [12] J. Wilkes, R. Golding, C. Staelin, and T. Sullivan. The HP AutoRAID Hierarchical Storage System. In ACM Transactions on Computer Systems, volume 14, pages 108–136, February 1996. [13] E. Zadok and J. Nieh. FiST: A Language for Stackable File Systems. In Proceedings of the Annual USENIX Technical Conference, pages 55–70, June 2000. 6. Conclusions The main contribution of this paper is in the exploration and evaluation of various elastic quota policies. These policies allow administrators to reduce the overall amount of storage consumed and to control what files are backed up when, thereby reducing overall backup and storage costs. Our system includes many features that allow both site administrators and users to tailor their elastic quota policies to their needs. Through the concept of an abuse factor we have introduced historical use into quota systems. Finally, our work provides an extensible framework for new or custom policies to be added. We evaluated our Linux prototype extensively. Performance overheads are small and acceptable for dayto-day use. We observed an overhead of 1.5% when compiling gcc. For a worst-case benchmark, creation and deletion of empty files, our overhead is 5.3% without database operations (a mode that is useful when recursive scans may already be performed by 197 198 A DESIGN OF METADATA SERVER CLUSTER IN LARGE DISTRIBUTED OBJECT-BASED STORAGE Jie Yan, Yao-Long Zhu, Hui Xiong, Renuga Kanagavelu, Feng Zhou, So LihWeon Data Storage Institute, DSI building, 5 Engineering Drive 1, Singapore 117608 {Yan_jie, Zhu_Yaolong}@dsi.a-star.edu.sg tel +65-68748085 Abstract In large distributed Object-based Storage Systems, the performance, availability and scalability of the Metadata Server (MDS) cluster are critical. Traditional MDS cluster suffers from frequent metadata access and metadata movement within the cluster. In this paper, we present a new method called Hashing Partition (HAP) for MDS cluster design to avoid these overheads. We also demonstrate a design using HAP to achieve good performance of MDS cluster load balancing, failover and scalability. 1. Introduction Unlike traditional file storage systems with metadata and data managed by the same machine and stored on the same device [1], the object-based storage system separates the data and metadata management. An Object-based Storage Device (OSD) [2] cluster manages low-level storage tasks such as object-to-block mapping and request scheduling, and presents an object access interface instead of block-level interface [3]. A separate cluster of MDS manages metadata and file-to-object mapping, as shown in Figure 1. The goal of such storage system with specialized metadata management is to efficiently manage metadata and improve the overall system performance. In this paper, we mainly address performance, availability and scalability issues for the design of MDS cluster in Object-based Storage Systems. Two key concerns about MDS cluster are the request load of metadata and load balancing within the cluster. In our preliminary OSD prototype, which adopts the traditional directory sub-tree to manage metadata, we find that more than 70 percent of all file system access requests are for metadata when using Postmark [4] to access 0.5k files, as shown in Figure 2. Although 199 Application Server Cluster VoD Server Web Server Database Server E-mail Server File Server Metadata Data Storage Network (Fibre Channel) MDS Cluster Security Object-based Storage Device Cluster Figure 1. Object-based Storage System Figure 2 shows the data request percent (Dreq%) and the metadata request percent (Mreq%) of the total requests. This test is based on our OSD prototype (one client, one MDS and one OSD) connected by Fibre Channel, using Postmark (1000 files, 10 subdirectories, random access, 500 transactions). the size of the metadata is generally small compared to the overall storage capacity, the traffic volume of such metadata access degrades the system performance. The large number of metadata requests can be attributed to the use of directory sub-tree metadata management. Apart from metadata requests, an uneven load distribution within a MDS cluster would also raise severe bottleneck. Based on traditional cluster architecture, the performance of the load balancing, failover and scalability in the MDS cluster is limited, because most of these operations lead to the inevitable massive metadata movement within cluster. The Lazy Hybrid metadata management method [5] presented a hashing metadata management with the hierarchical directory support, which dramatically reduced the total number of metadata requests, but Lazy Hybrid did not deal with reducing metadata movement between MDSs for load balancing, failover and scalability. This paper presents the new method called Hashing Partition (HAP) for MDS cluster design. HAP herein also adopts the hashing method, but focuses on reducing the cross MDS metadata movement in a clustered design, in order to achieve high performance of load balancing, failover and scalability. The rest of the paper is organized as follows. The next section details the design of HAP and section 3 demonstrates our solutions of MDS Cluster load balancing, failover and scalability. Section 4 discusses MDS Cluster Rebuild. Finally, the conclusion of the paper is drawn in section 5. 2. Hashing Partition Hashing Partition (HAP) provides a total solution for the file hashing, metadata partitioning, and metadata storage. There are three logical modules in the HAP: file Application Servers Pathname: /Dir1/Dir2/filename Hashing Partition Application File Hashing Manager Mapping Manager 1 4 Pathname Hashing Result (i) Mapping Manager Metadata Server Cluster Pathname Metadata & etc 2 Pathname Hashing Logical Partition Manager Result (i+1) Pathname Metadata Server Cluster 3 Metadata Server Backend 18 Logical Partitions Common Storage Space Figure 3. Hashing Partition Metadata & etc Figure 4. Metadata Access Pattern ķ.Filename hashing, ĸ.Selecting MDS through Mapping Manager, Ĺ .Accessing metadata by pathname hashing result, ĺ.Returning metadata to application server. 200 hashing manager, mapping manager, and logical partition manager, as shown in Figure 3. In addition, HAP employs an independent common storage space for all MDSs to store metadata, and this space is divided into multiple logical partitions. Each logical partition contains part of global metadata table. Each MDS can mount and then exclusively access logical partitions allocated to it. Thus as a whole, MDS cluster can access a unique global metadata table. The procedure of metadata access is described as follows. Firstly, file hashing manager hashes a filename to an integer, which can be mapped to the partition that stores the metadata of the file. Secondly, mapping manager can figure out the identity number of MDS that currently mounts that partition. Then client can send metadata request with the hash value of pathname to the MDS. Finally, logical partition manager located in MDS side accesses metadata on the logical partition in the common storage space. Figure 4 describes this efficient metadata access procedure. Normally, only a single message to a single metadata server is required to access a file’s metadata. 2.1. File Hashing Manager File hashing manager performs two kinds of hashing: filename hashing for partitioning metadata in MDS cluster, and pathname hashing for the metadata allocation in MDS. To access metadata of a file in MDS cluster, client needs to know two facts: which MDS manages the metadata and where the metadata is located in the logical partition. Filename hashing answers the first question and pathname hashing solves the second one. For example, if the client needs to access the file, “/a/b/filec”, client uses the hashing result of “filec” to select MDS that manages the metadata. Then instead of accessing directory “a” and “b” to know where is the metadata of “filec”, a hash result of “/a/b/filec”, directly indicates where to retrieve the metadata. But the filename hashing may introduce a potential bottleneck when a large parallel access to different files with the same name in different directories. Fortunately, the different hash values of various popular filenames, such as readme and makefile, make all these “hot points” distributed among MDS cluster and reduce the possibility of the potential bottleneck. In addition, even if certain MDS is over-loaded, our dynamic load balancing policy (section 3.1) can effectively handle this scenario and shift the “hot points” from overloaded MDS to the less-loaded MDSs. 2.2. Logical Partition Manager Logical partition manager manages all logical partitions in the common storage space. It performs many logical partition management tasks, e.g. mount/un-mount, backup and Journal recovery. For instance, logical partition manager can periodically backup logical partitions to a remote backup server. 2.3. Mapping Manager Mapping manager performs two kinds of mapping tasks: hashing result to logical partition mapping and logical partition to MDS mapping. Equation 1 describes these two mapping functions. 201 Pi f ( H ( filename )) MDSi ML ( Pi , PWi , MWi ) Pi  ^0 , Pn `; H ( filename )  ^0 , Hn `; MDSi  { 0 , Mn } (1) ( Hn t Pn t Mn ! 0 ) Where, H represents a filename hashing function; f stands for the mapping function that transfers hashing result to partition number (Pi); ML represents the function that figures out MDS number (MDSi) from partition number and related parameters (PW and MW will be explained in section 3.1); Pn is the total number of partitions; Hn is the maximum hashing value and Mn is the total number of MDSs. When PW and MW are set, mapping Table 1. Example of MLT manager simplifies the mapping function Logical partition MDS MDS Weight Number ID ML to a mapping table MLT, which 0~15 0 300 describes the current mapping between 16~31 1 300 MDS and logical partition. It is noted that 32~47 2 300 one MDS can mount multiple partitions, but 48~63 3 300 one partition can only be mounted to one MDS. To access metadata, mapping manager can indicate the logical partition that stores the metadata of a file based on the hash result of the filename. Then through MLT, mapping manager knows which MDS mounts that partition and manages the metadata of the file. Finally the client contacts the selected metadata server to obtain the file’s metadata, file-to-object mapping and security information. Table 1 gives an example of MLT. Based on this table, in order to access metadata on logical partition 18, client needs to send request to MDS1. 3. Load Balancing, Failover and Scalability 3.1. MDS Cluster Load Balancing Design We propose a simple Dynamic Weight algorithm to dynamically balance the load of MDSs. HAP assigns a MDS Weight (MW) to each MDS according to its CPU power, memory size and bandwidth, and uses a Partition Weight (PW) to reflect the access frequency of each partition. MW is a stable value if the hardware configuration of the MDS cluster does not change, and PW can be dynamically adjusted according to the access rate and pattern of partitions. In order to balance the load between MDSs, mapping manager allocates partitions to MDS based on Equation 2. ¦ PWi ¦ MWi ¦ Pn a 0 Mn a PWa (2) MWa 0 Where, ěPWi presents the sum of PW of all partitions mounted by MDSi; Pn stands for the total number of partitions; Mn presents the total number of MDSs. In addition, each MDS needs to maintain load information about itself and all partitions mounted on it, and periodically uses Equation 3 to calculate new values. 202 MDSLOAD(i  1) MDSLOAD(i) u D %  MDSCURLOAD u (1  D %) (3) PLOAD (i  1) PLOAD (i ) u E %  PCURLOAD u (1  E %) Where, MDSCURLOAD is the current load of the MDS; PCURLOAD is the current load of the logical partition; MDSLOAD(i) represents the load status of a MDS at time i; PLOAD(i) stands for the load status of a logical partition at time i; ¢and£are constant used to balance the effects of old value and new value. However, MDSs don’t report their load information to the master node, e.g. one particular MDS, until a MDS alarms in its overloaded situation, such as the MDSLOAD exceeding the preset maximum load of the MDS. After receiving load information from all MDSs, the master node sets the PW using new PLOAD values. Then according to new PW and Equation 2, HAP shifts the control of certain partitions from the over-loaded MDS to some less-loaded MDSs and modifies MLT accordingly. This adjustment does not involve any physical metadata movement between MDSs. 3.2. MDS Cluster Failover Design Typically, a conventional failover design adopts a standby server to take over all services of the failed server. In our design, the failover strategy relies on the clustered approach. In the case of a MDS failure, mapping manager assigns other MDSs to take over the work of the failed MDS based on Equation 2. Then the logical partition manager allocates the logical partitions managed by the failed MDS to its successors, as shown in Figure 5. So application servers can still access metadata on the same logical partition in the common storage space through the successors. Hashing Partition Mapping Manager 2 1 Metadata Server Cluster 3 Logical Partitions 4 Common Storage Space Figure 5. MDS cluster failover procedure ķ .Detecting the MDS failure, ĸ .Recalculating MW and adjusting MLT, Ĺ.Other MDSs take over logical partitions of the failure one, ĺ .Journal recovery 3.3. MDS Cluster Scalability Design HAP significantly simplifies the procedure to scale the metadata servers. If the current MDS cluster cannot handle metadata request effectively due to the heavy load, new MDSs can be dynamically set up to release the overhead of others. HAP method allows the addition of MDS by adjusting MWs and thus generating a new MLT based on ML. This process doesn’t touch the mapping relationship between filename and logical partition, because the number of logical partitions is unchanged. Following the new MLT, logical partition manager un-mounts certain partitions from existing MDSs and mounts them to the new MDS. This procedure also doesn’t introduce any physical metadata movement within MDS cluster. 4. MDS Cluster Rebuild Although HAP method can dramatically simplify the operation of MDS addition and removal, HAP actually has a scalability limitation, called Scalability Capability. The 203 preset number of logical partitions limits Scalability Capability, since one partition can only be mounted and accessed by one MDS at a time. For instance 64 logical partitions can only support up to 64 MDSs without rebuild. In order to improve Scalability Capability, we can add storage hardware to create new logical partitions and redistribute metadata among the entire cluster. This metadata redistribution introduces multi-MDS communication because the change in the number of logical partitions requires a new mapping function f in Equation 1, and affects the metadata location of the existing files in logical partitions. For example, after Scalability Capability is improved from 64 to 256, the metadata of a file may need to move from logical partition 18 to logical partition 74. The procedure that redistributes all metadata based on new mapping policy and improves Scalability Capability, is called MDS Cluster Rebuild. In order to reduce the response time of MDS cluster rebuild, HAP adopts Deferred Update algorithm, which defers metadata movement and distributes its overhead. After receiving the cluster rebuild request, HAP saves a copy of the mapping function f, creates a new f based on the new number of logical partitions, and generates a new MLT. Then logical partition manager mounts all logical partitions including both the old and new according to the new MLT. After that, HAP responses immediately to the rebuild request and changes MDS cluster to a rebuild mode. Thus the initial operation for this entire process is very fast. Metadata Request Where? Metadata Asker From Client (MDS or Client) Metadata in local? From other MDS Metadata in local? Y N Op. A Pathname: /a/b/filec Metadata in remote? N 3 Y 4 6 5 Op. A 2 MDS Cluster Old 18 6 5 N Y 5 1 74 New 4 3 Op. A 1.Computing old partition number based on the old f 2.Finding the MDS that mounting the old partition based on the new MLT 3.Issuing a request to get metadata from the MDS. Logical Partitions Figure 6. MDS Cluster Rebuild ķ.Sending request to MDS based on new mapping result, ĸ.Searching for metadata and making judgment (the rectangle on the left shows the internal logic and Op. A is explained in the bottom rectangle), Ĺ .Returning metadata and deleting it in local, ĺ .Reporting Error, Ļ .Returning metadata, ļ.Wrong filename 204 During the rebuild, the behavior of the system is as if all the metadata had been moved to the right logical partitions. Actually, HAP updates or moves the metadata upon the first access. If a MDS receives a metadata request, and the metadata hasn’t been moved to the logical partition that is mounted by it, the MDS needs to use the old mapping function f to calculate the original logical partition number based on the filename. Then through the new MLT, the MDS can find the MDS that currently mounts the original logical partition and send a metadata request to it. So the MDS can retrieve the metadata and complete the metadata movement. Figure 6 describes this procedure in detail. In addition, in order to accelerate the metadata movement progress, HAP can also adopt an independent thread to travel the metadata database and move the affected metadata only during the spare time of system. 5. Conclusion We present a new method of Hashing Partition to manage metadata server cluster in large distributed object-based storage system. We use hashing method to avoid the numerous metadata accesses, and use filename hashing policy to remove the overhead of multiple MDS communication. Furthermore, based on the concept of logical partitions in the common storage space, HAP method significantly simplifies the implementation of the MDS cluster and provides efficient solutions for load balancing, failover and scalability. The design described in this paper is part of our BrainStor project that targets to provide the full object-based storage solution. Currently we are implementing the Hashing Partition management for MDS Cluster in the BrainStor prototype. We also plan to explore the application of BrainStor technologies in Grid storage. References [1] J. H. Morris, M. Satyanarayanan, M. H. Conner, J. H. Howard, D. S. H. Rosenthal, and F. D. Smith. “Andrew: A distributed personal computing environment”, Communications of the ACM, 29(3):184–201, Mar. 1986. [2] R. O. Weber. “Information technology—SCSI object-based storage device commands (OSD)”, Technical Council Proposal Document T10/1355-D, Technical Committee T10, Aug. 2003. [3] Thomas M. Ruwart, “OSD: A Tutorial on Object Storage Devices”, 19th IEEE Symposium on Mass Storage Systems and Technologies, University of Maryland, Maryland, USA, April 2002. [4] KATCHER, J. “Postmark: A new file system benchmark”, Tech. Rep. TR3022 (Oct. 1997). Network Appliance. [5] Scott A. Brandt, Lan Xue, Ethan L. Miller, and Darrell D. E. Long. “Efficient metadata management in large distributed file systems”, Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 290–298, April 2003. 205 206 AN ISCSI DESIGN AND IMPLEMENTATION Hui Xiong, Renuga Kanagavelu, Yaolong Zhu, and Khai Leong Yong Data Storage Institute, DSI Building, 5 Engineering Drive 1, Singapore 117608 Tel:+65-68748100, Fax: +65-68732745 {Xiong_Hui, Renuga_KANAGAVELU, ZHU_Yaolong, YONG_Khai_Leong} @dsi.a_star.edu.sg Abstract iSCSI is a network storage technology designed to provide an economical solution over a TCP/IP Network. This paper presents a new iSCSI design with multiple TCP/IP connections. The prototype is developed and experiments are conducted for performance evaluation on Gigabit Ethernet (GE). Test results show that the new iSCSI design improves performance 20%~60% compared with normal iSCSI architecture. Throughput can reach 107MB/s for big I/O and I/O rate can reach 15000 IOPS for small I/O. 1. Introduction iSCSI is a network storage technology which transports SCSI commands over TCP/IP network. It has attracted a lot of attention due to the following advantages: a. Good scalability. iSCSI is based on SCSI protocol and TCP/IP network, which can provide good scalability. b. Low-cost. iSCSI can share and be compatible with existing TCP/IP networks. The user does not need to add any new hardware. c. Remote data transferring capability. TCP/IP network can extend to metro area, which makes iSCSI suitable for remote backup and disaster recovery applications. Most current iSCSI implementations use software solutions in which iSCSI device drivers are added on top of the TCP/IP layer for off-the-shelf network interface cards (NICs). It may cause performance issue. Many iSCSI researches and projects have been carried out to analyze the problem. A prior research project of iSCSI – Netstation project of USC showed that it was possible for iSCSI to achieve the 80% performance of direct-attached SCSI device [1]. IBM Haifa Research Lab carried out research on the design and the performance analysis of iSCSI [2, 3]. Bell Laboratories also did some test and performance study of iSCSI over metro network [4]. Some of solutions have also been brought forward to handle the performance issue. A research group proposed a solution to use memory of iSCSI initiator to cache iSCSI data [5]. Other solutions included using a TCP/IP offload Engine (TOE) [6] and even iSCSI adapter [7] to reduce 207 the burden of host CPU by offloading the processing of TCP/IP and iSCSI protocol into the hardware on the network adapter. But these hardware solutions will add the extra cost compared to a software solution. The improvement of semiconductor technology has led to the rapid increase of CPU speed and memory access speed. During the past two years, commercial CPU speed has almost tripled from 1GHz to 3GHz, and memory bandwidth has also doubled from around 200~300MB/s to around 400~500MB/s. A powerful hardware platform makes it possible to achieve good performance for iSCSI software solutions. The paper presents a new software iSCSI design and implementation, which employs multiple TCP/IP connections. Experiments are conducted on the GE network both in the lab and metro network environment. Testing results are analyzed and discussed. 2. Software iSCSI Design 2.1 iSCSI Storage Architecture Figure 1 shows an iSCSI storage architecture including an initiator and a target, which communicate to each other via the iSCSI protocol. In the initiator, the application, which needs to store and access data to/from the storage device, issues file requests. The file system converts file requests to block requests from application to block device layer and SCSI layer. The initiator iSCSI driver encapsulates SCSI commands in iSCSI Protocol Data Units (PDUs) and sends them to the Ethernet network via the TCP/IP layer. The target iSCSI driver receives iSCSI PDUs from TCP/IP layer and de-capsulates it. Then SCSI commands are mapped to RAM, called RAM I/O, or mapped to an actual magnetic storage disk, called DISK I/O. The target driver then sends response data and status back to the TCP/IP layer. The low-level flow control of iSCSI fully follows the a p p lic a tio n D a ta copy D is k F ile s y s te m R am IO B lo c k d e v /S C S I la y e r D a ta copy D is k IO iS C S I d riv e r iS C S I d riv e r T C P /IP la y e r T C P /IP la y e r N IC d riv e r N IC d riv e r D a ta copy D a ta copy DM A DM A N IC N e tw o rk N e tw o r k N IC Initiator Target Fig.1 General iSCSI architecture 208 TCP/IP and Ethernet communication mechanism, which connects initiator and target by IP network. It is noted that two data copies and one DMA are processed in the initiator side during one IO access. For RAM I/O, one data copy and one DMA are processed in target. 2.2 A new iSCSI Design and Implementation Normal iSCSI implementation is based on the single TCP/IP connection, which may not sufficiently utilize the network bandwidth. Furthermore, it may face serious performance issue caused by packets loss and long latency in metro network environment. We propose a new iSCSI architecture with multiple TCP/IP connections. This new architecture actually employs multiple virtual connections over one physical Ethernet connection (by one NIC), which is different from general idea of multiple Ethernet physical connections (by multiple NICs) according to iSCSI Request for Commands (iSCSI RFC) documents. The new design supposes not only to improve the iSCSI performance by increasing the utilization of network bandwidth, but also to provide a better mechanism to handle the long latency issue in metro network environment. The working principle of our new design is shown in Figure 2. Multiple virtual TCP/IP connections are built on one physical Ethernet connection. One connection is used for sending SCSI request from initiator to target; another one is used for sending response from target to initiator. One pair of transmitting thread (Tx_thread) and receiving thread (Rx_thread), which locate in initiator or target respectively, are responsible for data communication within one connection. We use the read operation as an example to explain the detailed communication procedure. The SCSI middle layer is a standard interface for SCSI layer to communicate with iSCSI device driver. The two main functions of SCSI middle layer are queuecommand( ) which issues SCSI command to the iSCSI driver and done( ) which informs SCSI middle layer that the command is finished by iSCSI driver. The iSCSI Initiator SCSI layer Initiator iSCSI driver queuecommand( ) Tx_thread Command queue 3K\VLFDO FRQQHFWLRQRQ 1HWZRUN 1HWZRUN Rx_thread done( ) Target iSCSI driver Virtual Connection 1 Rx_thread Command queue Tx_thread DONE queue Response queue Virtual Connection 2 Fig 2. Multiple parallel connections iSCSI architecture 209 Handling command initiator driver gets read commands from SCSI middle layer. These commands are encapsulated in iSCSI request PDUs and then queued in the initiator command-queue. As shown in Figure 2, The Tx_thread of connection 1 sends these PDUs from command-queue to the target. The target driver receives these PDUs by the Rx_thread of the same connection. After de-encapsulating iSCSI request PDUs and mapping SCSI commands to the storage device, the target driver gets data from storage device and forms iSCSI response PDUs (including data and status). Then these iSCSI response PDUs are queued in the response-queue. Tx_thread in the target sends these PDUs by connection 2. The initiator driver receives response through the Rx_thread of the connection 2. Then the initiator driver put it to done-queue to call the done( ) function to finish the SCSI exchange. As multiple connections are used, synchronization becomes important. According to iSCSI draft, the “initiator task tag” is used to record the sequence number of the iSCSI command in every iSCSI PDU. Related data and Status PDUs of the iSCSI command attach the same “initiator task tag”. Even if multiple connections are used and command queue is enabled in both initiator and target, the device driver can easily find respective data/status PDUs from the pending queue. 3. Experiments Figure 3 shows the system configuration used in our iSCSI experiments in the lab environment. One Nera-summit-5i Gigabit Ethernet (GE) switch (supporting 9K jumbo frames) is used to connect iSCSI initiator and target. One Finisar GTX THG iSCSI analyzer is used to monitor and capture all Ethernet packets for detailed analysis. Further experiments are conducted in the metro network environment. iSCSI initiator and target are connected through fiber with the aid of DWDM switch over 25KM physical distance, as shown in Figure 4. L6&6,DQDO\]HU L6&6,LQLWLDWRU *(VZLWFK L6&6,WDUJHW Fig. 3 Experiment platform in Lab Fig.4 Metro network experiment 210 The hardware configuration of the iSCSI target includes a 64-bit/133MHz PCI-X motherboard with 2.4GHz P4 processor and 512M RAM. The iSCSI initiator is implemented on a 64-bit/66MHz PCI motherboard with 2.4GHz P4 processor and 256MB RAM. Intel PRO1000F GE network interface cards (NICs) are used at both iSCSI initiator and target. All experiments are based on Redhat Linux 8.0 (kernel version 2.4.20). TCP network performance of the system is tested by netperf. Result shows that throughput can reach 939.8Mbps with maximum 70% CPU utilization. RAM I/O mode or DISK I/O mode is used in iSCSI target. RAM I/O directly maps requests to memory. In DISK I/O test, to diminish the impact from low speed storage device, we use a 2G fibre channel HBA to connect a self-developed high speed fibre channel disk array, which can reach around 110 MB/s for sequential write and 80 MB/s for sequential read. We use dd command as the application benchmark tool to copy data to/from iSCSI disk. For example, “dd write” is “dd if=/dev/zero of=/dev/sda bs=4k count=500000”. It will generate write request to raw device. The size of request is set much bigger than initiator’s memory to avoid impact of local cache. In big I/O test, the general dd commands will generate SCSI commands, which request 128KB data to SCSI layer. In small I/O test, the kernel source code is modified to allow dd command to generate 1KB to 32KB requests to SCSI layer. Different queue lengths are tested for the small I/O test. 4. Experimental Results and Analysis The iSCSI prototype with multiple connections has been tested and the results have been compared with the normal iSCSI performance. Figure 5 shows the throughput of iSCSI with RAM I/O mode .The I/O request size is 128KB. Ethernet frame size is set as 1.5k (small frame) or 9k (jumbo frame) respectively. The multiple connections iSCSI prototype can achieve 70~80MB/s for small frame test and 97~107MB/s for jumbo frame test, which is around 20%~60% improvement compared with normal iSCSI prototype. Further analysis of CPU utilization shows that utilization of initiator’s CPU can reach almost 100% with 107MB/s write throughput. That means CPU’s power become system’s bottleneck in this test condition. The impact of the frame size on the iSCSI performance is shown in Figure 6. The bigger frame will improve performance mainly because it decrease interrupt times. The impact is very small when the frame size is bigger than 3k. CPU utilization in the initiator is always higher than target. This is because the initiator needs to handle one more data copy in memory than target, when RAM I/O is employed in target. 211   6& 0& UHDG    0%V  0%V ZULWH         UHDG N ZULWH N UHDG N N ZULWH N Fig. 5 single /multiple connection of RAM I/O N N N N N Fig. 6. Effects of Ethernet frame on the throughput The performance of multiple connection iSCSI for small I/O request is tested. Since the queue depth is a critical parameter that affects the iSCSI performance in small I/O, the effects of the queue length on iSCSI performance is summarized in Figure 7. The testing condition is set as request size with 1KB and frame size with 1.5kB. When queue depth equals to 2, I/O rate is only about 4000 IOPS. When the queue length is bigger than 8, I/O rate can reach around 15000 IOPS. The results show that the I/O rate increases with the queue length until the queue length equals to 8. Further analyzing the captured data by iSCSI analyzer, we find that the maximum effective command queue length is 8 in our prototype and experiments. Figure 8 shows test result of DISK I/O in the lab and metro network environment. In the lab environment, read performance of iSCSI working in DISK I/O mode is much lower than that in RAM I/O mode. This is because of one more data copy to hard disk in the iSCSI target. For write operation, write cache of raid array card in the iSCSI target makes the iSCSI write performance in DISK I/O mode almost same as that in RAM I/O mode. 18000 read  write ODE PHWUR  14000  12000  10000 0%V ,2VVHF 16000 8000   6000  4000  2000   0 2 4 8 16 queue length UHDG N Fig.7 I/O rate for small I/O on different queue length Fig.8 212 ZULWH N  UHDG N ZULWH N result of lab and metro network of DISK I/O It is commonly supposed that network latency has a big impact on the iSCSI performance in a metro network. Network latency is estimated by the ping command. The roundtrip time is around 0.6ms in metro network which is around 20 times of that in lab network. But the throughput of our new iSCSI prototype can still reach 67MB/s with small frame in the experiment, which is only 2%~5% less than that of lab. This is due to the iSCSI queue architecture and multiple connections design. It seems that the metro network can’t support jumbo frame well because it makes performance much worse than small frames. 5. Conclusion The paper presents a new iSCSI design with multiple TCP/IP connections. The prototype has been developed and tested in both lab and metro network environment. Results show that the new iSCSI prototype can achieve 20%~60% performance improvement compared with normal iSCSI architecture with single TCP/IP connection. The new iSCSI architecture has been proved in the metro network environment. Future work will focus on the iSCSI testing and application in real network environment. Reference [1] Rodney Van Meter, Gregory G, Finn, Steve Hotz, “VISA: Netstation’s Virtual Internet SCSI Adapter”, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, 1998 [2] Prasenjit Sarkar , Kaladhar Voruganti , “IP Storage: The Challenge Ahead” Proc of 10th conference on Mass Storage Systems and Technologies,April-2002. [3] Kalman Z.Meth, “iSCSI Initiator Design and Implementation Experience”, Proc of 10th conference on Mass Storage Systems and Technologies, April-2002. [4] Wee Tech Ng, Bruce Hillyer, Elizabeth Shriver, Eran Gabber, Banu Özden, “Obtaining High Performance for Storage Outsourcing” Proceedings of the Conference on File and Storage Technologies, 2002 [5] Xubin He, Qin Yang, Ming Zhang, Ā A Caching Strategy to improve iSCSI Performance”, Proc of the 27th Annual IEEE conference on Local Computer Networks (LCN’02), 2002. [6] Alacritech, White paper, “Delivering High Performance Storage Networking”, http://www.alacritech.com/html/storagewhitepaper.html. [7] Adaptec, White paper, “Building SANs with iSCSI, Ethernet and Adaptec”, http://www.graphics.adaptec.com/pdfs/buildingsanwithiscsi-21.pdf 213 214 Quanta Data Storage: A New Storage Paradigm Prabhanjan C. Gurumohan Arizona State University gpj@asu.edu Sai S. B. Narasimhamurthy Arizona State University saib@asu.edu Abstract TCP layer and poor iSCSI implementations have been identified as the main bottlenecks in realizing high iSCSI performance. With the addition of security mechanisms the throughput achieved by the storage system using iSCSI is further reduced. Along with the above mentioned problems, we argue that the excessive processing redundancy introduced by several protocol layers and use of protocols designed for non-storage specific requirements result in poor storage network architectures. In order to overcome these issues, we introduce a new storage paradigm in which data is manipulated, encrypted and stored in fixed block sizes called quanta. Each quanta is manipulated by a single effective cross layer (ECL) that includes security features, iSCSI functionalities, direct data placement techniques and data transport mechanisms. Further, the new architecture emphasizes majority of burden of computation for achieving security on the clients. Qualitative description of the idea is presented. Performance improvements observed during tests of the idea are presented. Through emulation and analysis we also show that the size of the quanta must be equal to the minimum path MTU for maximum throughput. 1. Introduction IP has been accepted as a de-facto standard for Internet applications. IP based networks also provide relatively inexpensive and convenient solution [5][13] for transportation of bulk data. Considering these facts, transportation of storage data on the IP network provides a cheap and easy alternative to the SCSI and Fiber Channel based networks. iSCSI [5] is the proposed protocol for storage (block) data transport over IP networks. Most effort has been concentrated on designing the protocol over the existing TCP/IP protocol. Initial versions of the implementation of this protocol are available. Although the idea to develop these new protocols over existing protocols was done to ease 215 Joseph Y. Hui Arizona State University jhui@asu.edu the integration of iSCSI, it has resulted in over layering and crowding of protocols stacks. Stephen et. al [11] have presented the findings of implementing an iSCSI based target on a specialized hardware. Their work concentrated on comparing the overall performance of various network configurations using iSCSI protocol. The performance results from their work indicate that the iSCSI protocol can be severely limited by implementation. They suspect this problem due to inefficient handling of underlying network properties and poor iSCSI implementation. Y.Lu and D.Du [8] have examined different storage protocols (viz, iSCSI, NFS,SMB) and have analyzed their performance. The work shows that iSCSI storage with Gigabit connection could have performance very close to directly attached fiber channel-arbitrated loop storage. From their study they also show that iSCSI based file access performance outperforms the NAS schemes on both Windows and Linux platforms. However, they point out that the advantage of iSCSI reduces as the file size increases. The reasoning for this reduction in performance has not been presented in this work. Shuang-Yi Tang et. al [10] have also shown that with addition of security features such as IPsec below iSCSI and TCP/IP protocol stack leads to high degradation in throughput. Further the idea of storage security is not well severed by usage of IPsec because it does not encrypt the data that will be stored on the storage device. It only protects from attacks between two communicating points. Hence, using iSCSI over TCP/IP/IPsec leads to high overhead and complex layering structure. Further, use of IPsec leads to encryption and decryption at both sever and client ends. End system copies and TCP packet reassembly form throughout bottlenecks for many applications including iSCSI [1][2][3]. End system copies can be eliminated by Direct Data Placement (DDP) where the application entities are directly placed into application buffers without system copies [1]. DDP packets have to be modified to suite the existing TCP functions. This is enabled by a Framing protocol for TCP [3]. Additional extensions to the DDP protocol are provided by the RDMA protocol, which operates over the DDP protocol. The RDMA, DDP and the Framing protocol form the iWARP suite. This is the proposed solution for end system copy issues [12]. The iWARP suite is also applicable for iSCSI [2]. The iWARP suite is proposed to operate between iSCSI and TCP [12]. The introduction of iWARP leads to increased overheads and redundancy. Two simple examples of redundant functions at different layers are as follows. First, the CRCs/checksums are computed at the iSCSI layer, MPA or framing layer, and the TCP layer. Second, sequencing information is repeated in both iSCSI and transport layer. Motivated to overcome these problems, we propose the use of fixed block size data units called quanta. These data units are chosen to be of size 512, 1024, 2048 or 4096 bytes. A quanta is encrypted and formatted according to an Effective Cross Layer (ECL) by the client wishing to write data to a server and is stored as is on the storage device. The effective cross layer combines the functionalities of encryption, buffer management for direct data placement, iSCSI formatting and transport functions. The key used for encryption of such a quanta is stored on a separate key server. Any valid client can access the encrypted and preformatted data from the server and decrypt it using keys obtained from the key server. The client does the encryption and decryption, checksum calculation, and any other formatting. This lets the target to be implemented with fewer functionalities of ECL and used for only providing an access point to the storage device. A facility called fast buffers (fbufs) was introduced by Peter Druschel and Larry L. Peterson [14]. It was an operating system facility implemented for I/O buffer management and data transfer across protection domain boundaries on shared memory machines. Although, the main goal of this idea was to provide high bandwidth to user level processes, this was achieved by implementing a new facility in the operating system. This is unlike the ECL and quanta approach that achieves the similar goal by reducing the overheads in the protocol stacks. Further, ECL and quanta approach are specifically designed to improve the performance of the storage networks instead for the general networks. David D. Clark and David L. Tennenhouse [15] provide guidelines for designing of new generation of protocols. Our protocol design efforts are in line with two important guidelines presented in [15]. They are, the data manipulation costs are more compared to transfer control operations and the application data units are the natural pipelining units. 216 In section 2, we provide the architectural goals of the cross layer design. Section 3 discusses the encryption mechanism details. We present the ECL details in section 4. Section 5 provides the test results for the emulation of ECL using Hyperscsi. Finally we present the conclusion in section 6. 2. Cross layer architectural goals CLA (Cross Layer Architecture) enables the combining of the features of the SCSI/iSCSI/RDMA /DDP/Framing/TCP/IPsec into a single layer called the ECL. This obviates layering overheads and enables the ease of operation of Storage Area Networks. CLA has been designed with the following architectural goals: 1. Minimize data handling and processing by dividing data units into fixed size blocks or quanta. 2. Provide security on the wire and on the storage unit. 3. Provide the features of the iSCSI protocol. 4. Provide application level buffering by incorporating the features of the DDP protocol. (Direct Data Placement) 5. Provide the transport layer mechanisms. 3. Encryption Mechanism In order to include a security mechanism as a part of a single layer, ECL, a new security mechanism is proposed. The new scheme for storage security mainly deals with reducing the high overheads of authentication and encryption. This is done by avoiding encryption and decryption of data on both client and server sides. The new scheme does both encryption and decryption on the client side. The new scheme includes several techniques, which improves the efficiency of encryption and decryption: 1. The data is stored in the encrypted form on the server: 2. Encryption and Decryption is never performed on the server and hence there is no key management in the server. 3. In addition to encryption, the computation of parities, packetization, and errorcorrective information is done at the time the file is stored and never recomputed again. features in different layers. The necessary and common functionalities are then retained and incorporated into the ECL and the rest are excluded. These functionalities are designed such that they are scalable over a WAN. 4. The majority of the computation load for providing security and error correction is performed at the client end. 4.1 . ISCSI functionalities of the ECL 5. Authentication, key management decoupled from data security. is These requirements are met by implementing a new security mechanism. The data to be stored is encrypted by the client and preformatted. This data unit is called the Encrypted Data Unit (EDU). The EDU is stored on the servers as is. The EDU is transported to be stored on servers using ECL headers. The combination of EDUs and the ECL headers are called quanta. The keys used to encrypt the data are generated using AES encryption algorithm. The keys belonging to a particular file is stored in a file and encrypted. This file is stored in a centralized key server. A valid client can access the file on servers stored in the quanta form. The clients fetch the keys from the key server and perform decryption. Figure 1 shows the network architecture that would be used to implement the security mechanism. Several parameters that are used for read and write operation, encryption, and decryption are included in the ECL header. This is discussed in the next section. iSCSI is an adaptation of SCSI for accessing data over the Internet. Hence most of the SCSI functionalities that are part of iSCSI are retained in ECL [5]. Headers and data digests in iSCSI were not inherited from SCSI. These were added for error correction in the iSCSI protocol. Hence header and the data digests are excluded from the ECL. Instead, a single checksum for the entire quanta traversing the channel are utilized. 4.2. Copy avoidance functionalities of the ECL Most of the iWARP functionalities are retained in the ECL. Messages framed using MPA protocol is obviated by defining a constant MTU in the SAN. Further, the quanta size is always chosen less than or equal to the minimum MTU. Fig 2. iWARP suite for iSCSI Figure 1. System Setup for data storage centric network 4. Effective Cross Layer A single layer satisfying the specified objectives is defined as the ECL. Figure 2 shows the iWARP stack for iSCSI along with the security layer, IPsec. Figure 3 shows the corresponding ECL. In the following paragraphs we identify several common 217 Fig 3. Effective Cross Layer This obviates the necessity of fragmentation and thus there is no need for markers. The minimum MTU over a path between a source and the destination can be discovered, before any transmission occurs. The DDP protocol requires the tags or the addresses of buffers at the destination. These tags when available at the source enable direct data placement. As a consequence the quanta do not subject themselves to reassembly buffering and kernel copies. Hence they are directly placed at the appropriate SCSI buffers from the NIC. the current form, Hyperscsi is equivalent to SCSI packets placed directly over Ethernet that does not involve copies, reassembly buffering, and functionality redundancies of layers. 4.3. Transport functionalities of the ECL Sequencing [6] information is retained from the iSCSI headers. Thus the sequencing information at the TCP layer can be excluded. Congestion and flow control [6] algorithms are implemented in TCP and they are based on sliding windows [6]. The windows are based on sequence numbers. ECL has sequencing information similar to TCP. Hence congestion and flow control mechanisms similar to that of TCP can be used. Source and destination ports [6] identification is due to a requirement of socket level demultiplexing. Instead direct data placement provides buffer addressing information that can be used to place the data directly into application buffers. Thus direct data placement alleviates the need for source and destination port information. The data checksum is computed during the encryption process. Therefore there is a need only for a single quanta header checksum. Message length information that is a part of UDP can be excluded because it is also a part of the iSCSI functionality. 4.4. Security considerations in ECL The encryption should be done only when a write or an update operation is done. During a read operation, only decryption is required. These operations are indicated in the iSCSI functionality. Hence it is not necessary to add it into the ECL separately. Authentication must be performed before every transaction. Authentication is done by the login mechanism. iSCSI login mechanism that are retained can be employed for this purpose. The final ECL header structure that combines the features mentioned above is shown in figure 4. 5. Emulation of ECL by HYPERSCSI In order to evaluate the performance improvements with the use of ECL, tests were conducted using Hyperscsi [7]. Hyperscsi was used because it emulates a simple version of the ECL. In 218 Fig 4. ECL header for WRITE operations The test setup for ECL throughput characterization consists of two Dell Power EdgeLinux boxes (initiator and the target) with Pentium 866 MHz processors connected end to end through a Gigabit Ethernet link (82546 Dual port GBE). Linux Kernel 2.4.20-18.7 was used in the end systems. Throughput was measured using the bonnie++1.03 tool [16]. Tcpdump and Ethereal [17] was used to see the packet dumps on the end systems. iSCSI Reference 0.18 v10 [18] was used for comparison with iSCSI on TCP. 40 GB QUANTUM hard disks with ULTRA-160 SCSI bus were used at the target. Figure 5 shows the throughput snapshot tests indicating throughput improvements with the ECL. Several read operations were performed in three different scenarios. First, read operations were performed using only the SCSI protocol. Second was performed using Hyperscsi and the third repeated with UNH iSCSI v10. The results represent an arithmetic mean of 10 trials for a fixed MTU size of 16000. The read throughput achieved using Hyperscsi was close to direct access, whereas, the latest version of the UNH iSCSI code achieves only 219.6 Mb/s as opposed to 358 Mb/s and 363 Mb/s achieved by Hyperscsi and local disk access respectively. The improvement of 63% denotes the extent of betterment in throughput that can be obtained by getting rid of system copies, reassembly buffering and multiplicity of functionalities in the iWARP stack through the ECL emulated by Hyperscsi. Mb/s 363 358 219.6 Fig 5. Read performance through the ECL MTU sizes in the Gigabit Ethernet platform. The write throughput is less than the read throughput for both the iSCSI and the ECL cases, since writes are more expensive than reads. For large MTU sizes, the ECL throughput approaches the iSCSI throughput since we speculate that non-TCP/IP overheads for large MTU values in an ECL environment approach the TCP/IP overheads for large MTU values in the TCP/IP environment. Figure 8 shows the read throughput variations for 4K quantum ECL. The throughput peaks at MTU values equal to the quantum-sized blocks. We allow for fragmentation in our tests for MTU sizes less than the quantum sizes. For MTU sizes greater than the quantum sizes, the throughput gradually decays since the per packet non-TCP/IP overheads begin to dominate. This is due to the Ethernet per packet processing overheads. For MTU sizes less than the quantum sizes, the fragmentation overheads exceed the Ethernet per packet overheads. The optimal values are thus reached for MTU sizes almost equal to the quantum sizes. The trend is also seen for writes as shown in figure 9. An increase in quanta size increases the throughput. The write throughput for 5K path MTU and various quanta sizes are depicted in figure 10. The tests conducted conclude the throughput characteristics of the ECL and justifies the Quantum size selections equal to MTU sizes. 6. Conclusions Fig 6. Bulk read throughput results We next investigate the throughput characteristics of the ECL for reads and writes and justify the quantum data lengths equal to the path MTU. The ECL needs to accommodate very large windows when it is operating over a network with large round trip times. The required changes are made to the Hyperscsi code to accommodate infinitely large windows. (Hyperscsi at present can accommodate only 32 segments in a window interval). 4K quanta are used for ECL. Fragmentation of quantum is allowed for MTU values less than 4K. For MTU values greater than 4K, each quantum is contained within a single Ethernet packet along with the headers. Figure 6 compares the iSCSI throughput vs. the ECL throughput for reads for various possible MTU sizes in the Gigabit Ethernet platform. The read throughput shows a constant improvement of about 63% in the favor of ECL. Figure 7 compares the iSCSI throughput vs. the ECL throughput for writes for various possible 219 Existing iSCSI storage systems exhibit low performance. Additional protocols have been proposed for improving the performance. We presented the argument that this would lead to further decrease in performance. This is due to excessive processing redundancy and several protocol layers. Further, use of protocols designed for non-storage specific requirements result in poor storage network architectures. In order to solve these problems we proposed data handling in the form of fixed data units called quanta. A new Effective Cross Layer was proposed that combines the necessary features of security, iSCSI, direct data placement, and TCP. A new security mechanism is proposed that emphasizes burden of computation on clients. Throughput improvements over the existing iSCSI are indicated by using the Hyperscsi protocol for emulation. The characteristics of the ECL throughput are noted. Fig 7. Bulk write throughput results Fig 9. ECL Bulk write throughput results Mb/s 162 80 39 Fig 8. ECL bulk read throughput results Fig 10. Write Throughput for 1K, 2K and 4Kquantum sizes- in Mb/s for 5K path MTU References [1] Hemal Shah, James Pinkerton, Renato Recio, and Paul Culley, Direct Data Placement over Reliable Transports, IETF internet draft, draft-shah-iwarpddp-00.txt (Work in progress), October 2002 [2] R.Recio, P.Culley, D.Garcia, and J. Hilland, RDMA Protocol Specification, IETF internet draft, draft-recio-iwarprdmap-00.txt, October 2002 [3] P.Culley, U.Elzur, R.Recio, and S.Bailey et. al, Marker PDU Aligned Framing for TCP specification, IETF internet draft, draft-culley-iwarp-mpa-03.txt, June 2003 [4] S Kent, IP Encapsulating Security Payload (ESP), IETF internet draft, draftietf-ipsec-esp-v3-06.txt, July 2003 [5] Julian Satran, Costa Sapuntzakis, Mallikarjun Chandalapaka and Efri Zeidner, iSCSI, IETF internet draft, draftietf-ips-iSCSI-20.txt [6] R Stevens, TCP/IP illustrated, Volume 2, Addison-Wesley, November 2001 [7] http://www.nst.dsi.a-star.edu.sg/mcsa/ 220 [8] Yingping Lu and David H.C.Du, Performance Study of iSCSI-Based Storage Subsystems, IEEE communications magazine, August 2003 [9] Yongdae Kim, Fabio Maino, Maithili Narasimhma, and Gene Tsudik, Secure Group Key Management for Storage Area Networks, IEEE communications magazine, August 2003 [10] Shuang-Yi Tang, Ying-Ping Lu, and David H.C. Du, Performance Study of Software-Based iSCSI Security, Proceedings of the First International IEEE Security in Storage Workshop, SISW’02, 2003 [11] StephenAiken, Dirk Grunwald, Andrew R.Pleszkun, and Jesse Willeke, Performance Analysis of the iSCSI protocol, IEEE MSST, 2003 [12] http://www.rdmaconsortium.org [13] Rodney Van Meter, Gregory G. Finn, and Steve Hotz, VISA: Netstation’ s Virtual Internet SCSI Adapter, Proceedings of the eighth international conference on Architectural Support for Programming Languages and Operating Systems, Pages 71 - 80 , October 1998 [14] Peter Druschel and Larry L. Peterson, A High-Bandwidth Cross-Domain Transfer Facility, Proceedings of the fourteenth ACM symposium on Operating Systems Principles, volume 27 issue 5, December 1993 [15] David D. Clark and David L. Tennenhouse, Architectural Considerations for a New Generation of Protocols, ACM SIGCOMM Computer Communication Review, Proceedings of the ACM symposium on Communications architectures & protocols, Volume 20 Issue 4 , August 1990 [16]http://aixpdslib.seas.ucla.edu/packages/b onnie++.html [17] http://www.ethereal.com/ [18]http://sourceforge.net/projects/unh-iscsi/ 221 222 Rebuild Strategies for Redundant Disk Arrays Gang Fu, Alexander Thomasian∗, Chunqi Han, and Spencer Ng† Computer Science Department New Jersey Institute of Technology -NJIT Newark, NJ 07102, USA Abstract RAID5 performance is critical while rebuild is in progress, since in addition to the increased load to recreate lost data on demand, there is interference caused by rebuild requests. We report on simulation results, which show that processing user requests at a higher, rather than the same priority as rebuild requests, results in a lower response time for user requests, as well as reduced rebuild time. Several other parameters related to rebuild processing are also explored. 1 Introduction RAID5 with rotated parity is a popular design, which tolerates single disk failure and balances disk loads via striping. Striping allocates successive segments of files called stripe units on the N − 1 disks in an array of N disks, with one stripe unit dedicated to parity, so that a capacity equal to the capacity of one disk is dedicated to parity. In this case the parity group size G is equal to N . The parity blocks are kept up-to-date as data is updated and this is especially costly when small randomly placed data blocks are updated, hence the small write penalty. If a single disk fails, a block of data on that disk is recreated by exclusive-ORing (XORing) the corresponding blocks on the N − 1 surviving disks. Each surviving disks needs to process the load due to forkjoin requests to recreate lost data blocks besides its own load, so that the load on surviving disks is increased, e.g., at worst doubled when all requests are reads. Clustered RAID solves this problem by selecting a parity group size G < N , so that the load increase is proportional to the declustering ratio: (G − 1)/(N − 1) [3]. Balanced Incomplete Block Designs – BIBD [1, 4] and random permutation lay∗ The first three authors are partially supported by NSF through Grant 0105485 in Computer Systems Architecture. † Hitachi Global Storage Technologies, San Jose Research Center, San Jose, CA. out [2] are two approaches to balance disk loads from the viewpoint of parity updates. The rebuild process is a systematic reconstruction of the contents of the failed disk, which is started immediately after a disk fails, provided a hot spare is available. Of interest is the time to complete the rebuild Trebuild (u) and the response time of user requests versus time: R(t), 0 < t < Trebuild (u). The utilization at all disks, which is equal to u before disk failure occurs, is specified explicitly, since it has a first order effect on rebuild time. A distinction is made between stripe-oriented or rebuild-unit(RU)-oriented and disk-oriented rebuild in [1]. In the former case the reconstruction proceeds one RU at a time, so that the reconstruction of the next RU is started after the previous one is reconstructed and even written to disk. Diskoriented rebuild reads RUs from all surviving disks asynchronously, so that the number of RUs read from surviving disks and held in a buffer in the disk array controller can vary. It is shown in [1] that diskoriented rebuild outperforms stripe-oriented rebuild, therefore the stripe-oriented rebuild policy will not be considered further in this study. Rebuild requests can be processed at the same priority as user requests, which is the case with the permanent customer model – PCM [2], while [1, 6] process rebuild requests when the disk is idle according to the well-known vacationing server model – VSM in queueing theory: an idle server (resp. disk) takes successive vacations (resp. reads successive RUs), but returns from vacation (resp. stops reading RUs) when a user request arrives at the disk. In effect rebuild requests are processed at a lower priority than user requests. In PCM a new RU is introduced at the tail of the request queue, as soon as the processing of the previous RU is completed, and hence the name of the model. The few studies dealing with rebuild processing [1, 2, 6] leave several questions unanswered, such 223 as the relative performance of VSM versus PCM, the effect of disk zoning, etc., but not all issues are addressed here due to space limitations. We extended our RAID5 simulator to simulate rebuild processing. This simulator utilizes a detailed simulator of single disks, which can handle different disk drives whose characteristics are available at: http://www.pdl.cmu.edu/Dixtrac/index.html. The reason for adopting simulation rather than an analytic solution method is because of the approximations required for analysis, which would have required validation by simulation anyway. Simulation results are given in the next section, which is followed by Figure 1: Performance Comparison of VSM and PCM conclusions. already rebuilt, versus time (t) for both VSM and PCM are shown in Figure 1. A disk failed and rebuild 2 Experimental Results was started at time t = 135 sec. The graphs shown The parameters of the simulation are as follows. We are averages over ten runs, but even more runs are utilize IBM 18ES 9 GByte, 7200 RPM disk drives. required to obtain a tighter confidence interval, say We assume an OLTP workload generating requests at 95% confidence level. The following observations to small (4KB) randomly placed blocks over the data can be made: blocks of the N = 19 disks in the array. Track alignment ensures that all 4 KB accesses are carried out efficiently, since they will not span track boundaries. Accesses to randomly placed disk blocks introduce a high overhead, i.e., 11.54 ms per request with FCFS scheduling, less than 1% of which is data transfer time. We assume that the ratio of reads to writes is R:W=1:0, since reads introduce a heavier performance degradation upon disk failure. While we experimented with different disk utilizations, only results for u = 0.45, which results in a 90% disk utilization are reported here. We assume zero-latency read and write capability, which has a significant impact on rebuild time. The parameter space to be investigated includes: (i) VSM versus PCM. (ii) The impact of buffer size per disk B specified as number of tracks. (iii) the size of the RU (rebuild unit) is a multiple of tracks and RU = 1 is the default value. (iv) the effect of preempting rebuild requests. (v) The effect of piggybacking, also considered in [1]. (vi) The effect of read-redirection and controlling the fraction of reads redirected. In fact due to its beneficial effect readredirection is postulated in all other cases reported here [1, 5]. (vii) The effect of the number of disks on rebuild time. (viii) A first order approximation for rebuild time. Due to space limitations item (iv) is not investigated. 2.1 (i) Disk utilizations are effectivy 100% utilized in both cases, but since RU reads are processed at a lower (nonpreemptive) priority in VSM user requests are only affected by the mean residual service time for RU reads, which is close to half a disk rotation at lower disk utilzations. PCM yields a higher R(t) than VSM since it reads RUs at the same priority as user requests. For each n̄ user requests processed (on the average) by a disk, the disk processes m̄ consecutive rebuild requests, so that the arrival rate is increased by a factor of 1 + m̄/n̄. Furthermore, rebuild requests, in spite of zero latency reads, have a longer service time than user requests. (ii) The rebuild time in PCM is higher because during the rebuild period, the disk utilizations due to user requests is approximately the same, but the disk “idleness” is utilized more efficiently by VSM than PCM. The reading of consecutive RUs is started in VSM only when the disk is idle, so that if the reading of an RU is completed before a user request arrives, the reading of the next RU can be carried out without incurring seek time. Uninterrupted processing of rebuild requests is less likely with PCM, since by the time the request to read an RU is being served, it is not likely that the disk queue is empty. This intuition can be ascertained by comparing the mean number of consecutive RU reads in the two cases. 2.2 VSM versus PCM The impact of buffer size Figures 2 and 3 show R(t) and Trebuild versus buffer The response time of user requests R(t) and the com- size for VSM and PCM, respectively. We can repletion percentage c(t), which is the fraction of tracks duce buffer space requirements by XORing the available blocks right away, but this would introduce con- 224 Figure 2: The impact of buffer size in VSM Figure 4: The impact of rebuild unit size in VSM Figure 3: The impact of buffer size in PCM Figure 5: The impact of rebuild unit size in PCM straints on hardware requirements. Two observations 2.4 can be made: (i) With disk-oriented rebuild a larger buffer size leads to shorter rebuild times in both VSM and PCM. This is because due to temporary load imbalance, rebuild processing with disk-oriented rebuild is suspended when the buffer is filled. A shared or semi-shared buffer requires further investigation. (ii) The impact of buffer size on R(t) for VSM is very small, but it is significant in PCM. In VSM, the rebuild requests are processed at a lower priority, so the R(t) is only affected by the time to read an RU. In PCM a small buffer size limits the rate of RU reads, i.e., no new RU reads are inserted into the disk queue when the buffer is full, but as B is increased, RU reads are introduced at a rate determined only by u. 2.3 The impact of rebuild unit size Figures 4 and 5 shows R(t) and Trebuild versus the RU size = 1, . . . 16 tracks. The following observations can be made: (i) The larger RU size leads to higher response times in VSM and PCM. In VSM the mean residual time to read an RU is added to the mean waiting time caused by the contention among user requests [6].(ii) Larger RU sizes lead to shorter rebuild times in VSM and PCM, because the rebuild time per RU is reduced, since the cost of one seek is prorated over the reading of multiple RUs. The effect of piggybacking Figure 6: The effect of piggybacking Figure 6 shows R(t) and c(t) versus t in VSM with and without piggybacking. Piggybacking is done at the RU level, rather than at the level of user accesses, e.g., 4 KB blocks, which has been shown to be ineffectual [1]. In other words, we extend the reconstruction of a single block into a full track. Piggybacking shortens rebuild time, but initially the disk utilization will exceed 2u in our case, since track reads have a higher mean service time than the reading of 4 KB blocks, which are carried out on behalf of fork-join requests (this is the reason why u = 0.4 in this case). We can control the initial increase in R(t) by controlling the fraction of piggybacked user requests. 225 low priority applications. We obtain a first order approximation for rebuild time with given disk characteristics by postulating that the rebuild time Trebuild (u) is simply a function of the disk utilization u, so that Trebuild (u) = Trebuild (0)/(1 − αu), αu < 1, where Trebuild (0) is the time to read an idle disk, and α is a measure of the average increase in disk utilization during rebuild. Figure 7: The effect of disk utilization on rebuild time 2.5 The impact of read redirection Read-redirection shortens the rebuild time (by a factor of three with 19 disks, u=0.45, and B=64) and also reduces R(t) as gradually more tracks from the failed disk are rebuilt. Since we assumed there are no write requests, we get the maximum improvement in performance with read redirection, but there will be less improvement in response time when all requests are updates. In “unclustered” RAID5 the utilization of the spare disk remains below the other disks and there is no need to control the fraction of read requests being redirected, as in [3]. With read redirection and R:W=1:0 the utilization of surviving disks initially doubles, but then it drops gradually as disk utilization reaches u, so that α ≈ 1.5 is an initial guess. On the other hand the rebuild time is slow when rebuild starts, so that more time is spent in the initial stages of rebuild. Curve fitting shows that α ≈ 1.75 yields an acceptable estimate of rebuild time, but there is a sensitivity to R:W ratio, and other factors, which are being explored. 3 Conclusions It is interesting to note that VSM outperforms PCM on both counts, user response times and rebuild time. The latter is counterintuitive since PCM rebuilds more aggressively. We are also investigating the effect of preempting rebuild requests, utilizing multiple rebuild regions, and allowing multiple user requests before rebuild processing in VSM is stopped. Finally, 2.6 The effect of the number of the disks we are interested in rebuild processing in RAID6 and Table 1 gives the rebuild time using VSM versus the EVENODD, especially when operating with two disk number of disks (N ) with various number of buffers. failures. The disk utilization is u = 0.45. It is observed that even at high disk utilization, rebuild time with VSM References is not affected very much by N . An additional observation is that due to the prevailing symmetry, that [1] M. C. Holland, G. A. Gibson, and D. P. Siewiorek. “Architectures and algorithms for on-line failure rethe load at all disks is balanced, rebuild time is detercovery in redundant disk arrays”, Distributed and mined by the reading time of any of the disks and that Parallel Databases 11(3): 295-335 (July 1994). the writing of the spare disk follows shortly there[2] A. Merchant and P. S. Yu. “Analytic modeling of clusafter. When the buffer size dedicated to the rebuild tered RAID with mapping based on nearly random process is small, rebuild time is more sensitive to N . Number of disks 9 19 29 permutation”, IEEE Trans. Computers 45(3): 367373 (1996). 39 [3] R. R. Muntz and J. C. S. Lui. “Performance analysis of disk arrays under failure”, Proc. 16th Int’l Conf. VLDB, 1990, pp. 162–173. Rebuild B=16384 1604.2 1622.8 1631.9 1632.0 B=64 2586.3 2618.9 2701.4 2704.5 Time (sec) B=16 3033.8 3246.0 3434.7 3526.5 Table 1: The effect of number of disks on rebuild time 2.7 The effect of disk utilization on rebuild time [4] S. W. Ng and R. L. Mattson. “Uniform parity distribution in disk arrays with multiple failures”, IEEE Trans. Computers 43(4): 501-506 (1994). [5] A. Thomasian and J. Menon. “Performance analysis of RAID5 disk arrays with a vacationing server model for rebuild mode operation”, Proc. 10th Int’l Conf. Data Eng. - ICDE, 1994, pp. 111-119. It follows from Figure 7 that rebuild time increases sharply with disk utilization. This is especially so [6] A. Thomasian and J. Menon. “RAID5 performance with an infinite source model, where the request arwith distributed sparing”, IEEE Trans. Parallel and rival rate is not affected by increased disk response Distributed Systems 8(6): 640-657 (1997). time. Rebuild time can be reduced by not processing 226 Evaluation of Efficient Archival Storage Techniques Lawrence L. You University of California, Santa Cruz Jack Baskin School of Engineering 1156 High Street Santa Cruz, California 95064 Tel: +1-831-459-4458 you@cs.ucsc.edu Christos Karamanolis Storage Systems Group Hewlett-Packard Labs 1501 Page Mill Road, MS 1134 Palo Alto, CA 94304 Tel: +1-650-857-6956 Fax: +1-650-857-5548 christos@hpl.hp.com Abstract ways of improving the space efficiency of disk-based The ever-increasing volume of archival data that need archival storage systems. Researchers have observed to be retained for long periods of time has motivated that they can take advantage of content overlapping, the design of low-cost, high-efficiency storage sys- which is common in archival data, to improve stortems. Inter-file compression has been proposed as a age efficiency [9, 3]. There are two main techniques technique to improve storage efficiency by exploit- proposed for this purpose. The first technique divides ing the high degree of similarity among archival data. each data object into a number of non-overlapping We evaluate the two main inter-file compression tech- chunks and stores only unique chunks in the archival niques, data chunking and delta encoding, and com- storage system. Chunks may be of fixed or variable pare them with traditional intra-file compression. We size. The second technique is based on resemblance report on experimental results from a range of repre- detection between data objects and uses delta encoding to store only deltas instead of entire data objects. sentative archival data sets. These two approaches have been developed and used in very different contexts, with different goals and Over the last several years, we have witnessed an un- data sets. For example, variable-size chunking was precedented growth of the volume of stored digital proposed for improving the bandwidth consumption data. A recent study estimated the amount of origi- of network file systems [7]. Delta encoding [1] has nal digital data generated in 2002 alone to be close to been used for data compression in HTTP [6] as well 5 exabytes, approximately double the volume of data as in version-control systems [11]. However, there created in 1999 [4]. An increasing fraction of this has been no attempt to compare the two approaches corpus is archival data: immutable data retained for side-by-side, evaluating the storage efficiency they long periods of time for legal or archival purposes. achieve and their applicability to different archival Examples of archival data include rich media such as data sets. audio, images and video, documents, email, and inThe goals of this work can be summarized as folstant messages. lows: (i) evaluate the applicability of the approaches The high rate of archival data generation has mo- on different data types that exhibit different degrees tivated a number of research projects to look into 1 Introduction 227 of inter-file similarities; (ii) identify the key parameters for each technique and provide rules of thumb for their settings for different data types; (iii) compare inter-file compression techniques with traditional lossless intra-file techniques and explore the potential benefits of hybrid approaches. Further, we provide a performance analysis of the different approaches, and discuss system design and engineering considerations. 2 Overview Archival data, by its nature, often exhibits strong inter-file resemblance. This paper examines techniques that take advantage of such inter-file resemblance to avoid storing redundant data and thus improve storage efficiency. Such techniques may be combined, if necessary, with lossless intra-file compression, such as sliding-window compression techniques (e.g. zip variants). Several systems that exploit data redundancy at different levels of granularity have been developed in order to improve storage efficiency. One class of systems detects redundant chunks of data at granularities that range from entire file (EMC’s Centera) down to individual fixed-size disk blocks (Venti [9]) and variable-size data chunks (LBFS [7]). We focus on the use of variable-sized chunks, which have been reported to exhibit better efficiency over the special case of fixed-size blocks [8]. Typically, such techniques are used in content-addressable storage (CAS) infrastructures. The second class of systems detects and stores only differences (deltas) between similar files, at the granularity of bytes [3]. 2.1 Chunking Data chunking involves two problems. First, a data stream, such as a file, needs to be divided into chunks in a deterministic way. We consider the general case of variable-sized chunks, which works for any type of data, including binary formats. Chunk boundaries are defined by calculating some feature (a digital signature) over a sliding window of fixed size. In our prototype, we use Rabin fingerprints [10], for their computational efficiency in the above scenario. Boundaries are set where the value of the feature meets certain criteria, such as when the value, modulo some specified integer divisor, is zero; the divisor affects the average chunk size. Such deterministic algorithms do not require any knowledge of other files in the system. Moreover, chunking can be performed in a decentralized fashion, even on the clients of the system. The second problem is to uniquely identify chunks. An algorithm is required that computes a digest over a variable-length block of data. Currently in the prototype, we reuse the Rabin fingerprinting code for deriving chunk identifiers. In practice and for very large data sets, one would need an algorithm that guarantees low probability for collisions, such as MD5 and SHA variants. Exactly because chunks are contentaddressable, chunking is suitable for CAS systems. Only unique chunks are stored for any file. The original files can be reconstructed from their constituent chunks. To do that, the system needs to maintain metadata that maps file identifiers to a list of chunk identifiers. Any evaluation of storage efficiency must take into account the overhead due to the metadata. Figure 1 illustrates the main parameters of a chunking technique. window size sliding window chunk size chunk end/start Rabin fingerprint size chunk ID size Figure 1: Chunking parameters We developed a prototype program, named chc, to evaluate the efficiency and performance characteristics of chunking without having to build a complete storage system. The input to chc is an archive (tar file) of a number of files that form the target data set; chc produces an output archive that includes unique chunks derived from the original files. Optionally, we can compress the individual chunks in the archive using the zlib compression library. chc captures a list of chunk identifiers for each file, as well as the identifier and size for each chunk. This metadata is stored also in the output archive and provides an estimate of storage overhead due to chunking. In addition, chc can reconstruct the original data set from the final chunk archive. 2.2 Delta encoding Delta compression is used to compute a delta encoding between a new file and a reference file already stored in the system. When resemblance is above some threshold, a delta is calculated and only that is stored in the system. There are three key problems that need to be addressed in delta encoding. First, resemblance has to be detected in a contentindependent and efficient way. We use the shingling technique proposed by the DERD project [3]. It calculates Rabin fingerprints using a sliding window along an entire file (the window size is a configurable 228 parameter). The number of fingerprints produced is proportional to the file size. A deterministic feature selection algorithm selects a subset of those fingerprints, called a sketch, which is retained and later used to compute an estimate of the resemblance between two files by comparing two sketches using the approximate min-wise independent permutations [2]. This estimate computes similarity between two files by counting the number of matching pairs of features between two sketches. It has been shown that even small sketches, e.g. sets of 20 features, capture sufficient degrees of resemblance. Thus, when new data needs to be stored, one has to find an appropriate reference file in the system: a file exhibiting a high degree of resemblance with the new data. In general, this is a computationally intensive task (especially given the expected size of archival data repositories). In the prototype, we use an exhaustive search over all stored files. We are currently investigating the use of hierarchical clustering of sketches to reduce the search. compare the overall storage efficiency achieved by each approach. The required storage includes the overhead due to the metadata that needs to be stored. Last, we discuss the performance cost and the design issues of applying the two techniques to an archival storage system. 3.1 Data Sets We chose a range of data sets that we believe to be representative of archival data. Email messages often contain headers (and sometimes attachments) that show great resemblance. Source code and web content are typically versioned. Non-textual content such as presentations and imagery are often similar and require lots of storage space. Finally, computergenerated data such as logs are generated in high volumes and can contain repeated content such as field descriptors. The following is the list of data sets we use. • HP Support Unix logs (two sets of different total volume) • Linux kernel 2.2 source code (four versions) • Email (single user) • Mailing list archive (BLU) • HP ITRC Support web site • Microsoft PowerPoint presentations • Digital raster graphics (California DRG 37122 7.5 minute untrimmed TIFF) The third problem is to calculate the delta encoding once a reference file has been found. Delta compression is a well-explored area, and in the prototype we used the xdelta tool [5], which computes the output using the zlib (gzip) library. Pointers to reference files are stored with every delta. These identifiers (e.g. SHA digests), along with sketch data, contribute to accounted storage overhead. Our prototype consists 3.2 Parameter Tuning of three programs, one for each of the three above In the case of chunking, the expected chunk size is problems: feature extraction, resemblance detection, a key configuration parameter. It is implicitly set by and delta generation. setting the fingerprint divisor as well as the minimum and maximum allowed chunk size. In general, the 3 Evaluation smaller it is, the higher the probability of detecting We explain the experimental methodology we used common chunks among files. For data with very high to measure the efficiency, describe the data sets, and inter-file similarity (such as log files), small chunk analyze the storage efficiency and differences in per- sizes result in greater storage efficiency. However, for formance of the two archival storage techniques. most data this is not the case, because smaller chunks also mean higher metadata overhead. Often, because As noted above, storage efficiency is the determining of this overhead the storage space required may be factor for the applicability of an inter-file compres- greater than the size of the original corpus. As Figsion approach. We report on the storage space re- ure 2 shows, the optimal expected chunk size depends quired as a percentage of the original, uncompressed on the type of data; using 128-bit identifiers, the best data set size. For example, stored data that is 20% efficiencies range from 256 to 512 bytes. the size of the original represents an efficiency ratio of 5:1. The functionality and performance of each The main configurable parameter in the case of delta approach depends on the settings of a number of pa- encoding is the size of the sketches—i.e. the numrameters. As expected, experimental results indicate ber of features used for resemblance detection. Our that no single parameter setting provides optimal re- experimental results are consistent with what was results for all data sets. Thus, we first report on param- ported by Douglis et al.: a sketch size of 20 to 30 feaeter tuning for each approach and different data sets. tures is sufficient to capture resemblance among files. Then, using optimal parameters for each data set, we Another parameter is the resemblance threshold, the 229 deltas. The reason is that gzip’s dictionary is more efficient across entire files than within the smaller individual chunks, and chunk IDs appear as random (essentially non-compressible) data. But in the context of an archival storage system, gzip’s advantage is not likely to be as effective in practice; this is discussed below. Chunking efficiency by divisor 128-bit chunk IDs, min=32 bytes, max=65536 bytes 180% 160% efficiency (%) 140% 120% 100% 80% 60% 40% 20% Non-textual data, such as the PowerPoint files with chunking and delta (especially with gzip) achieve better efficiency than gzip alone. However, the achieved compression rates are less impressive than those for the log data. For raster graphics, delta encoding with gzip achieves modest improvement over gzip alone. 0% 32 64 128 256 512 1024 2048 4096 divisor size (bytes) List-KW Paper Archive (PDF) Linux 2.2.x List-Magick DRG 32114 logs-500 List-BLU PowerPoint Figure 2: Chunking efficiency by divisor size number of features that must correspond between two files to consider that sufficient resemblance exists to justify calculating the delta instead of storing the entire new file. For the evaluation of delta encoding, we traverse the target data set one file at a time in a random order. For a file, a delta is created against the file with the highest resemblance that is already in the output archive, as long the resemblance is above a threshold of one corresponding feature. Otherwise, the entire file is stored as a new reference file. 3.3 Storage Efficiency Table 1 shows the achieved storage efficiency by the two approaches. We do that with and without additional zlib compression of chunks and deltas respectively. To establish a baseline for each data set, we create a single tar file from the data set, and then compress it with an intra-file compression program, gzip. As expected (and as shown by the two first rows of the table), inter-file compression improves with larger corpus sizes. This is not the case with gzip. The HP Unix Logs (8,000 files) show very high similarity. Chunk-based compression on this similar data was reduced to 11% of the original data size, and when each chunk is compressed using the zlib (similar to gzip) compression, it is just 7.7% of the original size. Even more impressive are the reductions in size when using delta compression. When delta compression is used alone, the data set is reduced to 4% of the original size, but when combined with zlib compression, the compressed data is less than 1% of the original size. Textual content, such as web pages, can be highly similar. However, in the case of the HP ITRC content, gzip compression is more efficient than chunking or delta. More surprisingly, gzip is better even when we do additional compression of chunks and A single user’s email directory and a mailing list archive show little improvement when using delta. Chunking is less effective than gzip, although we would expect it to reduce redundancy found across multiple users’ data. In most cases, inter-file compression outperforms intra-file compression, especially when individual chunks and deltas are internally compressed. Chunking achieves impressive efficiency for large volumes of very similar data. On the other hand, delta encoding seems better for less similar data. We believe that this is due to the lower storage overhead required for delta metadata. Typical sketch sizes of 80 to 120 bytes (20 to 30 features × 4 bytes) for a file of any size are significantly smaller than the overhead of chunk-based storage, which is linear with the size of the file. Although compressing a set of files into a single gzip file to establish a baseline measurement helps illustrate how much redundancy might exist within a data set, it is not likely that an archival storage system would reach those levels of efficiency for several reasons. Most important is that files would be added to an archival system over time and files would be retrieved individually. If a new file were added to the archival store, it would not be stored as efficiently unless the file could be incorporated into an existing compressed file collection, i.e. the new file would need to be added to an existing tar/gzip file. Likewise, retrieving a file would require first extracting it from a compressed collection and this would require additional time and resources over a chunk or deltabased file retrieval method. Our experiments measured the size of an entire corpus, in the form of a tar file after it has been compressed with gzip. Had we compressed each file with 230 Data Set HP Unix Logs HP Unix Logs Linux 2.2 source (4 vers.) Email (single user) Mailing List (BLU) HP ITRC Web Pages PowerPoint Digital raster graphics Size # Files 824 MB 13,664 MB 255 MB 549 MB 45 MB 71 MB 14 MB 430 MB 500 8,000 20,400 544 46 4,751 19 83 tar + gzip 15% 14% 23% 52% 22% 16% 67% 42% Chunk 13% 11% 57% 98% 98% 86% 55% 102% Chunk + zlib 5.0% 7.7% 22% 62% 53% 33% 46% 55% Delta 3.0% 4.0% 44% 84% 67% 50% 38% 99% Delta + zlib 1.0% 0.94% 24% 50% 21% 26% 31% 42% Table 1: Storage efficiency comparison (64-bit chunk IDs) gzip first and then computed the aggregate size of all compressed files, the sizes for gzip-compressed files would have been much larger. For example, in the case of the HP ITRC web pages, gzip efficiency would have been 30% of the original size, much larger than the 16% shown in table 1, and larger than the 26% that can be achieved by using delta compression with zlib. When delta compression (or to a lesser extent, chunking) is applied across files first and then an intra-file compression method second, it is more effective than compressing large collections of data because additional redundancy can be eliminated. 3.4 Performance In practice, space efficiency is not the only factor used to choose a compression technique; we briefly discuss some important systems issues such as computation and I/O performance. The chunking approach requires less computation than delta encoding. It requires two hashing operations per byte in the input file: one fingerprint calculation and one digest calculation. In contrast, delta encoding requires s + 1 fingerprint calculations per byte, where s is the sketch size. It also requires calculating the deltas, even though this can be performed efficiently, in linear time with respect to the size of the inputs. Additional issues with delta encoding include efficient file reconstruction and resemblance detection in large repositories. The two techniques exhibit different I/O patterns. Chunks can be stored on the basis of their identifiers using a (potentially distributed) hash table. There is no need for maintaining placement metadata and hashing may work well in distributed environments. However, reconstructing files may involve random I/O. In contrast, delta-encoded objects are whole reference files or smaller delta files, which can be stored and accessed efficiently in a sequential manner. But, placement in a distributed infrastructure is more involved. 4 Conclusions Inter-file compression is emerging as a technique to improve space efficiency in archival storage systems. This paper provides the first direct comparison of the two main techniques proposed in the literature, namely chunking and delta encoding, and compares them against traditional intra-file compression. In general, both chunking and delta encoding outperform gzip, especially when they are combined with compression of individual chunks and deltas. Chunking is computationally cheap and can be easily used in distributed systems. It works well for data with very high similarity. Thus, it is applicable to applications where there are multiple versions of the same data, such as version control systems, and log files. On the other hand, delta encoding is more computationally expensive, but more efficient with less similar data and thus, it is potentially applicable to a wider range of data sets. Acknowledgments Lawrence You was supported by a grant from Hewlett-Packard Laboratories (via CITRIS), Microsoft Research, and supported in part by National Science Foundation Grant CCR-0310888. We thank Kave Eshghi and George Forman of Hewlett-Packard Laboratories for their help and insight into the behavior of file chunking. We are also grateful to members of the Storage Systems Research Center at the University of California, Santa Cruz for their help preparing this paper. References 231 [1] M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. Journal of the ACM, 49(3):318–367, May 2002. [2] A. Z. Broder, M. Charikar, A. M. Frieze, and M. Mitzenmacher. Min-wise independent permutations. Journal of Computer and Systems Sciences, 60(3):630–659, 2000. [3] F. Douglis and A. Iyengar. Application-specific deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, Texas, June 2003. [4] P. Lyman, H. R. Varian, K. Searingen, P. Charles, N. Good, L. L. Jordan, and J. Pal. How much information? 2003. http: //www.sims.berkeley.edu/research/ projects/how-much-info-2003/, Oct. 2003. [5] J. P. MacDonald. File system support for delta compression. Master’s thesis, University of California at Berkeley, 2000. [6] J. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy. Potential benefits of delta-encoding and data compression for HTTP. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’97), Sept. 1997. [7] A. Muthitacharoen, B. Chen, and D. Mazières. A low-bandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), pages 174–187, Lake Louise, Alberta, Canada, Oct. 2001. [8] C. Policroniades and I. Pratt. Feasibility of data compression by eliminating repeated data in practical file systems. First Year Report. [9] S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In D. D. E. Long, editor, Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 89–101, Monterey, California, USA, 2002. USENIX. [10] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981. [11] W. F. Tichy. RCS—a system for version control. Software—Practice and Experience, 15(7):637–654, July 1985. 232 An Efficient Data Sharing Scheme for iSCSI-Based File Systems Dingshan He and David H.C. Du Department of Computer Science and Engineering DTC Intelligent Storage Consortium University of Minnesota he,du @cs.umn.edu Abstract iSCSI is an emerging transport protocol for transmitting SCSI storage I/O commands and data blocks over TCP/IP networks. It allows storage devices to be shared by multiple network hosts. A fundamental problem is how to enable consistent and efficient data sharing in iSCSI-based environments. In this paper, we propose a suite of data sharing schemes for iSCSI-based file systems and use ext2 as an example for implementation. Finally, we use simulations to verify the correctness of our designs and to study the performances. Database Server iSCSI Tape Subsystem File Server IP Network Database Server iSCSI Disk Array File Server 1. Introduction Figure 1. iSCSI network architecture scenario iSCSI [2] combines two popular and mature protocols in the data storage area and the network communication area - SCSI and TCP/IP. Small Computer System Interface (SCSI) enables host computer systems to perform I/O operations of data blocks with peripheral devices. iSCSI extends the connection of SCSI from traditional parallel cables, which could only stretch to several meters, to ubiquitous TCP/IP networks. iSCSI encapsulates and reliably delivers SCSI Protocol Data Units (PDUs) over TCP/IP networks. iSCSI is expected to expand the coverage of System Area Networks (SAN). One major feature of SAN is to provide a shared pool of storage resources for multiple accessing hosts. As storage devices are not directly attached to any hosts, they can be easily shared. However, data sharing is going beyond sharing storage devices. With data sharing, a single piece of data could actually be shared by multiple clients. The integrity of data content has to be enforced. Therefore, concurrency control mechanism is necessary when multiple hosts could read/write a single piece of data concurrently. When using iSCSI-based storage subsystems, application servers (as initiators) and their shared storage subsystems (as targets) could be connected by TCP/IP networks over long distances as depicted in Figure 1. Applications like file systems on the application servers are not aware of the fact that iSCSI-based storage subsystems are accessed and these storage subsystems are potentially shared with other application servers. Remote iSCSI devices are mounted at mounting points of application servers’ local file systems and corresponding file system modules are loaded to manage data blocks. The file systems for mounted iSCSI devices are not adapted for iSCSI at all. Although these mounted file system modules in IP hosts are able to use their existing file system locks locally, multiple IP hosts still can not guarantee consistent data sharing. Therefore, our design is to coordinate the accesses of multiple iSCSI initiators. Since iSCSI-based architectures could be deployed over WANs, latency introduced by large physical distance will be a severe restriction on performance. To improve performance, caching at application servers is necessary. In addition to hiding physical latency, caching at application This project is partially supported by members of DTC Intelligent Storage Consortium (DISC) at UMN, and gifts from Intel and Cisco. 233 servers can also reduce load at iSCSI targets. However, it incurs the problem of cache consistency, which is the consistency between cached data on initiators (application servers) and data on targets (iSCSI-based storage subsystems). In our design, we will enforce strong consistency on cached data by using a callback cache similar to the Coda file system [6]. An iSCSI target keeps track of cached physical blocks on all connected iSCSI initiators. The iSCSI target forces the iSCSI initiators to discard their stale copy, when it is going to modify a physical block. Metadata are also concurrently accessed, and cached by multiple initiators. However, a general solution for both metadata and normal file data would be inefficient since metadata and normal file data are different in terms of access patterns. We design different mechanisms for them separately as discussed in Section 3.1 and Section 3.2. Finally, most existing SAN file systems, such as Lustre, only provide UNIX file sharing semantics. However, the UNIX file sharing semantics is not suitable for transactional execution of a sequence of operations. In the transaction file sharing semantics, there should be a consistent view of any involved data throughout the execution of a transaction. Deadlocks are possible, so detection and resolution of deadlocks are considered. In addition, rollback capability is required in case a transaction is unable to complete due to deadlocks. The rest of this paper is organized as following. In Section 2, we will summarize related work. In Section 3, we are going to give an overview of our design and implementation. In Section 4, we present the simulation results over ns-2 simulator to study the performance of our design. Table 1. Compatibility of metadata locks MS MX MS + MX + + itly take possible long network latency into design consideration. Our locking granularity is at physical block level. 3. Design Overview Our proposed scheme for data sharing in iSCSI-based file systems consists of two major parts. The first part is a concurrency control mechanism to coordinate multiple concurrent accesses for shared data. The second part is a cache consistency control mechanism. We assume that no single operation will involve data from more than one iSCSI target/LUN. Roselli et al [1] found that the percentage of metadata reads is much larger than metadata writes. In order to take advantage of this fact, our design allows iSCSI initiators to cache shared locks (referred to as semi-preemptible shared locks [4]) on metadata objects. On the other hand, normal data are organized into files. The access patterns for files are application dependent. Therefore, in addition to integrity of shared data, we are trying to design a general locking scheme to achieve high concurrency and to reduce maintenance costs of locks. Unfortunately, these two goals are conflicting with each other. Fine granularity of locks are preferred to maximize concurrency, meanwhile coarse granularity reduces number of locking requests and memory space used to maintain locks. Our design is trying to balance between these two conflicting goals. The detail of our hierarchical locking scheme will be discussed later in Section 3.2. 2. Related Work The performance of iSCSI-based storage subsystems is studied by Lu and Du in [3]. It is shown that iSCSI storage with Gigabit connection could have performance very close to directly attached FC-AL storage, and iSCSI storage in campus network can achieve reasonable performance restricted by available bandwidth. Tang et al [5] have study performance of software-based iSCSI security using IPSec and SSL. In application side, researchers start to consider using iSCSI to implement hierarchical web proxy server and remote mirroring and backup. Several distributed or clustered file systems have designed different data sharing schemes. GFS uses Device Lock mechanisms, which has been included in SCSI 3 specification as Dlock command. The IBM Distributed Lock Manager (DLM) is an implementation of the classic VAX Cluster locking semantics. Another simpler implementation is the DLM in Lustre with introduction of object concept. Our work is mainly different from these designs in network environment and locking granularity. We explic- 3.1. Locking scheme for metadata The locks for metadata are applied on metadata objects. We define following 5 kinds of metadata objects: 1) directory files, 2) normal file inodes, 3) super block, 4) inode bitmap blocks, and 5) data-block bitmap blocks. For each metadata object, there are two possible kinds of locks: M S and M X. M S lock gives shared access to the requested metadata object. M X lock gives exclusive access to the requested object. The compatibility of the locks is shown in Table 1, where ’+’ indicates incompatibility. The M S lock is semi-preemptible. An initiator is allowed to hold an M S lock until the lock manager asks it to release that M S lock. A callback mechanism is used to force the holder to release the lock, when some initiator requests M X on the same metadata object. The holding 234 3.3. Table 2. Compatibility of hierarchical locks D S D X D IS D IX DS + + DX + + + + D IS + D IX + + Physical blocks fetched over networks are cached in iSCSI initiators’ buffer caches. Buffer caches will be checked first when a physical block is requested. In order to avoid revalidating consistency of cached data blocks every time, we employ a mechanism based on callback. A callback record will be set up on iSCSI target side when a physical block is read out. When an iSCSI initiator is going to write a physical block, it first sends a SCSI CDB with write request. The iSCSI initiator will wait for a R2T response before starting transmitting data. When an iSCSI target receives a SCSI CDB with write request, it will check callback records for the requested physical blocks. If there are outstanding callback records, callback requests will be sent to those iSCSI initiators to ask them to purge the requested physical blocks out of their buffer caches. A iSCSI target will not send R2T response until it receives confirmations for all callback requests that it sent out. initiator only comply the request when it has no conflicting usage of the object. An M X lock for a metadata object is requested by an initiator when it is going to modify the metadata object. The M X locks will not be cached at the initiators, so initiators have to contact targets every time. An M X lock is always released immediately after the involved operations have finished. Another possible operation on an M S lock is to upgrade it to a M X lock. This happens when an metadata object is first read and cached locally, and later a write request for the same metadata object arrives at the same initiator. 3.2. Cache consistency control 3.4. Locking scheme for normal data Transaction file sharing semantics In our design, file-accessing operations are grouped into transactions. Every transaction will be assigned a unique transaction id within the session between an iSCSI initiator and an iSCSI target. Deadlocks are going to happen since we are supporting transactions. Due to the nature of random access of file data, it is difficult to prevent deadlocks from happening. Therefore, we use deadlock detection mechanism to detect deadlock when they have happened. When detecting a deadlock, a victim transaction will be selected and rolled back. The mechanism to detect deadlocks is to find a loop among transactions and locks. Normal data are organized into files. In order to balance between high concurrency and high resource consumption, we design a two level hierarchical locking scheme. The upper level is an entire file and the lower level contains fixsized block groups. There are 4 possible locks applicable to nodes of such hierarchy. D IS: intention shared access; allowing explicitly locking on descendant nodes in D S or D IS mode; no implicit locking to the sub-tree. 3.5. D IX: intention exclusive access; allowing explicitly locking on descendent nodes in D X, D S, D IX, or D IS mode; no implicit locking to the sub-tree. Implementation components Figure 2 shows the architecture overview of our implementations. We insert new modules in to both iSCSI initiators and iSCSI targets. Our implementation is based on the ext2 file system. The metadata and file data stored on storage devices are intact. In iSCSI initiators, vfs is used between the upper level system call layer and the lower level iSCSI layer. we have inserted following two modules into the kernel of iSCSI initiators. D S: shared access; implicit D S locking to all descendants. D X: exclusive access; implicit D X locking to all descendants. Intention mode is used to indicate that compatible locks are going to be requested at finer level and thereby prevents incompatible non-intention locks (D S and D X) on upper level. Table 2 gives the compatibility of lock modes, where ’+’ means conflict. Locks are always requested from root to leaves. On the other hand, locks should be reversely released from leaves to root. Intention modes are not applicable to leaf nodes. iSCSI client module is actually a modified ext2 file system module. It manages transactions and various metadata and normal data locks. Initiator cache manager module manages a dedicated buffer cache for the iSCSI client module. It supports callback mechanism to assure cache consistency. 235 iSCSI initiator kernel iSCSI target kernel Table 3. Parameters of SCSI disk modules Average Latency 2.0 msec Average Read Seek Time 3.6 msec Average Write Seek Time 3.9 msec Internal Transfer Rate 62 Mbytes/sec1 System call layer Virtual File System layer Local FS 1 Local FS 2 Buffer cache Driver Driver metadata lock manager open file manager iSCSI client iSCSI stack Initiator cache manager Target cache manager Buffer cache Driver Driver Network portal Network portal Local disks Driver is assumed in our simulations. Once a Read/Write command leaves the waiting queue on an iSCSI target for execution, the delay for Read/Write access one block of data is computed as DelayRead W rite AverageLatency AverageSeekTimeRead W rite BlockSize InternalTrans f erRate. In our simulations, we let iSCSI drivers on iSCSI initiators send a SCSI CDB for every physical block. iSCSI targets process requests, including locking requests and read/write requests, sequentially. Driver Logical units IP Network Figure 2. Overview of the architecture An iSCSI target is responsible for maintaining active transactions, maintaining opened files, supporting callbacks for cached physical blocks, and so on. In iSCSI targets, we have inserted following three modules into iSCSI targets’ kernel. 4.1. In order to investigate the overhead of our concurrency control and cache consistency scheme, we run simulations for single sequential writing of a single file. The operations are run as a single transaction as defined in Section 3.4. The file size we used in these simulations is 100MB. For writing of each physical block, we assume that each physical block should be read from its iSCSI target first and then written back. We have run this simulation for several different network configurations. Due to space limit, we only show the result that we get under one configuration. Figure 3 shows the composition of transaction time under this configuration. We use different size for physical blocks and different size for block groups defined in Section 3.2. For certain block group size, the total transaction time will decrease as the physical block size increasing. This is because larger physical block size requires less number of transmission and hence saves propagation delay. On the other hand, when using the same physical block size and varying block group size, the time spent on normal data locks decrease as we increase block group size. Larger group size would require less number of locking requests. Metadata lock manager module manages lock requests for metadata objects mentioned in Section 3.1. It handles deadlock detection and resolution caused by metadata locking requests. File lock manager module manages transaction requests, file open/close requests, and block group lock requests. It is also responsible for deadlock detection and resolution. Target cache manager module maintains callback records for physical blocks cached at iSCSI initiators. When there is a SCSI CDB with write command triggering callbacks, this module is also responsible for suspending the write command until all confirmations for callback requests are received. 4. Simulation Results We use network simulator ns-2 to simulate network environments. Our schemes are implemented as modules running on host nodes in ns-2. The implementation is based on the ext2 file system as described in Section 3.5. We implement application-level iSCSI initiators and iSCSI targets, which contain components as presented in Section 3.5. Table 3 shows the parameters we used for SCSI disk modules in all of our simulations. These parameters are following the specification of Seagate Cheetah 15K.3 family disk drives. However, no cache of disk modules 1 average Scheme Overhead 4.2. Effectiveness of Caching In our design, physical blocks are cached in iSCSI initiators’ buffer cache to improve performance. Our next set of simulations is trying to understand the effectiveness of such caching as the environment of iSCSI extending from LAN to WAN. In addition, we also try to investigate what are the major factors affecting effectiveness of caching and how. In order to reflect file access patterns of real world, we use trace data generated from modified TPC-C benchmark of 49 to 75 Mbytes/sec 236 Group Size 1 Group Size 2 2000 2000 time (sec) 1500 1000 500 0 1000 500 1KB 2KB 4KB block size 0 8KB 1KB Group Size 4 Time (sec) reading blocks total 8KB w/ cache 29.5 50.2 w/o cache 5785.5 5806.3 2000 1000 500 block read/write metadat locks data locks transaction overhead ohters 1500 time (sec) block read/write metadat locks data locks transaction overhead ohters 1500 time (sec) 2KB 4KB block size Group Size 8 2000 0 Table 4. Comparison of w/ and w/o iSCSI initiator caching bandwidth=100Mbps latency=1msec block read/write metadat locks data locks transaction overhead ohters 1500 time (sec) block read/write metadat locks data locks transaction overhead ohters show the performance will be intolerable without caching as iSCSI-based systems extend to WAN. Our cache consistency scheme employs callback mechanism. A SCSI write command will be blocked at an iSCSI target until all response for callback requests are received. Therefore, access patterns for blocks will affect effectiveness of caching and performance of caching consistency control scheme. Still using the aforementioned TPC-C configuration, we run simulations three times with 2, 3, and 4 client terminals, respectively. Each simulation is run for 3600 seconds. In each simulation, only one client is repeatedly writing 10 physical blocks. For the other clients, they are repeatedly reading the same 10 physical blocks. Each reading and each writing is a single transaction, so no deadlock could happen. The result show that as the number of concurrent readers increases, the single writer spends more time on getting block from iSCSI target. This is caused by two reasons. First, with higher concurrency, writing command conflicts more with reading command. Secondly, with more reader, there is a higher chance that when a write command is sent, a copy of the requested block is cached on some other clients. 1000 500 1KB 2KB 4KB block size 8KB 0 1KB 2KB 4KB block size 8KB Figure 3. Composition of total transaction time for sequential access with bandwidth=100Mbps and latency=1msec of the Transaction Processing Performance Council (TPC). The TPC-C benchmark is an OnLine Transaction Processing (OLTP) benchmark for database systems. We adapt this benchmark to generate file access traces. TPC-C benchmark involves a warehouse management database with 9 relation tables. We view each relation table as a file storing fix-sized records consecutively. TPC-C benchmark has 5 different transactions in SQL. For each transaction, we only trace the location of real reading or writing of records in the table files. We totally ignore additional database metadata such as index for keys in real database systems. For a transaction involving one or more than one table files, all table files are opened with proper mode before any reading or writing of data blocks. Network bandwidth and latency are two potential factors that could affect effectiveness of iSCSI initiator caching. We set up a configuration of TPC-C with 4 warehouses. Each warehouse has 10 distinct districts. There are a number of customers registered to a district. Initially, we generate information to load the 9 table files. After loading initial data, the size of these 9 files range from over 380B to 120MB. We collect trace data from one client terminal, which is bond to one of the 4 warehouses. There are 200 transactions generated from this client terminal. The distribution of these transaction is 45% new order, 43% payment, 4% order status, 4% delivery, and 4% store level. In this set of simulation, we use 4K as physical block size and 1 as block group size. In Table 4, we show the comparison of with and without iSCSI initiator caching. The network bandwidth is 100Mbps and the network latency is 1msec. There are 724888 blocks accessed from cache. We also run simulations for other network configuration, which References [1] D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In In Proceedings of USENIX Technical Conference, pages 41–54, San Diego, California, June 2000. [2] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner. iSCSI. IP Storage Working Group, January 2003. [3] Y. Lu and D. Du. Performance study of iscsi-based storage subsystems. IEEE Communication Magazine, 41(8):76–82, 2003. [4] R. Burns, R. Rees, and D. Long. Semi-preemptible locks for a distributed file system. In In proceedings of the 2000 International Performance Computing and Communication Conference(IPCCC), Phoenix, AZ, February 2000. [5] S. Tang, Y. Lu, and D. Du. Performance study of softwarebased iscsi security. In Proceedings of First International IEEE Security in Storage Workshop, pages 70–79, 2002. [6] M. Satyanarayanan, J. Kistler, P. Kumar, M. Okasaki, E. Siegel, and D. Steere. Coda: A highly available file system for a distributed workstation environment. IEEE Transactions on Computers, 39(4):447–459, 1990. 237 238 USING DATASPACE TO SUPPORT LONG-TERM STEWARDSHIP OF REMOTE AND DISTRIBUTED DATA Robert L. Grossman, Dave Hanley, Xinwei Hong and Parthasarathy Krishnaswamy Laboratory for Advanced Computing University of Illinois at Chicago 851 South Morgan Street Chicago, Illinois 60607 tel: +1-312-413-2176 email: grossman@uic.edu, dave@lac.uic.edu, xwhong@lac.uic.edu, babu@lac.uic.edu 1 Introduction In this note, we introduce DataSpace Archives. DataSpace Archives are built on top of DataSpace’s DSTP servers [2] and are designed not only to provide a long term archiving of data, but also to enable the archived data to be discovered, explored, integrated and mined. DataSpace Archives are based upon web services. Web services’ UDDI and WSDL mechanisms provide a simple means for any web service client to discover relevant archived data [7]. In addition, data in DataSpace Archives can carry a variety of XML metadata, and the DSTP servers which underly the DataSpace Archives provide direct access to this metadata. Unfortunately, web services today do not provide the scalabilty required to work with large remote data sets. For this reason, DataSpace Archives employ a scalable web service we have developed called SOAP+. As the amount of data grows, the ability to explore and browse remote and distributed archived data will become more and more important. For this reason, a requirement of DataSpace Archives is that they support direct browsing of the data they contain, without the necessity of first retrieving the data and then opening a local application. DataSpace Archives also support a type of distributed database keys, which are described below and which enable data sets in different DataSpace Archives to be easily integrated. Finally, DataSpace Archives use emerging internet storage platforms, such as IBP [1] and OceanStore [6], as a basis for providing long term storage, long past the demise of any individual disk or server. 239 2 The DataSpace Transfer Protocol (DSTP) In previous work we have developed a protocol called the the DataSpace Transport Protocol or DSTP [2]. Data in DataSpace Archives can be accessed directly using DSTP or indirectly using web service based DSTP operations. Here is a quick summary of DSTP. Data model. Data accessible via DSTP servers form a distributed collection of attribute-based data (in contrast to the file-based data that is usually available in data archives) that we refer to as DataSpace. At the simplest, data in DataSpace consists of records which may be distributed across nodes either vertically (by attributes) or horizontally (by records). Data records and data attributes are joined using universal keys, which are described next. Universal Keys. One of the novel aspects of DataSpace is that certain attributes can be identified as universal (correlation) keys or UCKs by associating globally unique IDs or GUIDs to them. The assumption in DataSpace is that two distributed attributes having the same universal keys, as identified by their GUIDs, can be joined. This simple mechanism enables DataSpace to support vertically partitioned data, that is data records whose attributes are geographically distributed across DSTP servers. In addition, universal keys may be attached to data sets and, in this way, data sets may be horizontally partitioned. In other words, data records may be geographically distributed across DSTP servers. Metadata. The DataSpace infrastructure provides direct support for metadata. Each data set and each attribute can have metadata associated with it. In general, we assume that the metadata is in XML. Some DataSpace applications, with large amounts of metadata, use alternate formats for metadata for greater efficiency. Data Access. DSTP-based access to data is via SOAP/XML or what we call SOAP+. SOAP+ is a variant of SOAP we have developed which employs a separate SOAP/XMLbased control channel together with a data channel which employs a streaming protocol for moving large amounts of data or metadata efficiently. Depending upon the request, DSTP servers may return one or more attributes, one or more records, or entire data sets. DSTP servers can also return metadata about data sets or a list of universal keys associated with a data set. Each DSTP server has a special file called the catalog file containing XML metadata about the data sets on the server. AAA Model. The default access and security model in DataSpace s a web-based instead of a grid-based access mechanism. The difference is how authentication, authorization, access (AAA) are handled. For long term stewardship, mechanisms to manage AAA are quite challenging. The default assumption in DataSpace is that data is open to anyone with a browser. Clearly, many data sets require some type of AAA infrastructure. In these cases, a AAA infrastructure can be implemented using of the standard approaches, such as Globus GSI, IETF SSL/TLS, etc. DSTP clients and servers support the following services: 240 • Discovery queries. Discovery in DataSpace is via web service’s WSDL and UDDI mechanisms. This provides a standards based discovery mechanism for the discovery of data sets, data attributes, metadata, etc. • Metadata queries. DSTP servers automatically create XML metadata about the data they serve and provide a simple mechanism for user supplied metadata. DSTP Clients can request attribute based metadata, data set based metadata, or metadata summarizing all the datasets managed by the DSTP server. For example, metadata associated with an attribute typically contains the number of data records on the DSTP server associated with that attribute, a description of the attribute, the min and max values, and perhaps the providence of the attribute. • UCK queries. Several DSTP operations are based upon universal keys or UCks. For example, a DSTP client can request all UCKs from two distributed DSTP servers, set a UCK, and then request all attributes associated with that UCK to join vertically partitioned data. Similarly, a DSTP client can request the UCKs from two distributed DSTP servers associated with data sets on the server, set a data set UCK, and retrieve all data records associated with that UCK to merge two horizontally partitioned data sets. • Range based queries. DSTP client and servers support range based queries. Ranges may be determined using a single UCK or using several UCKs. • Server side sampling. It is easy for DSTP Servers to overwhelm DSTP Client applications with data. The DSTP servers support server side sampling so that the appropriate amounts of data can be returned. • Support for missing values. DSTP servers and clients support missing values as a primitive data type. This is important for exploratory data analysis and many data mining applications. 3 DataSpace Archives We define a DataSpace Archive to consist of one or more DataSpace DSTP servers with the following the additional structure: • Internet-based, replicated storage. Since we cannot guarantee over long periods of time that DSTP servers will be backed up and that the hardware will be maintained, we have decided instead to rely, in part, on distributed, replicated internet based storage, such as that provided by the OceanStore Project [6] or an Internet Backplane Protocol enabled storage system [1]. We are not suggesting that either of these two projects is ready yet to provide the type of long term storage required for archiving data, but rather that, in principle, this type of approach may play a role in helping to ensure the long term survival of data. 241 • XML based metadata. DSTP servers have a mechanism for attaching auxiliary information, such as XML metadata, to data sets. For example, PMML data may be attached to data sets in DSTP servers. The Predictive Model Markup (PMML) Language is an XML based markup language for working with statistical and data mining data sets and models [3]. It can be used to describe data attributes, data mining attributes, and some common transformations used for preparing data for analysis and mining. Maintaining data over long periods of time is made more complex by the many transformations, normalizations, and aggregations that are generally part of data prepration. The assumption in DataSpace Archives is that these are done using the transformations specified by an open XML-based standard, such as PMML. • Physical replicas. In practice, one of the most durable mechanisms for preserving data is paper, as well as other durable materials such as metal or stone. Although neither practical nor desirable in many cases, it is still useful to be able to associate physical replicas with DataSpace data sets. If an ID (bar code, radio frequency ID, etc.) is attached to the physical replica, then this ID may be associated with the data set in a DataSpace Archive. More generally, GUIDS of various types may be associated with attributes and data sets in DataSpace. These can be used so that reports, laboratory books, disks, tapes, etc. can be associated with attributes, collections of attributes, and data sets in DataSpace. 4 Implementation and Experimental Studies To understand some of the stewardship issues associated with DataSpace Archives, we have taken the data sets in the University of California at Irvine Knowledge Discovery in Databases Archive (UCI KDD Archive) [5] and created a DataSpace Archive using them. This is available on line at data.dataspaceweb.org. Prior to the DataSpace Archive which we created, the only way to retrieve data from the UCI KDD Archive was via ftp. Moreover, as far as we are aware, the UCK KDD Archive was available on only a single server. Using a DataSpace Archive, UCI KDD data can now not only be directly viewed, but in addition, it can be explored with simple exploratory data analysis operations. In addition, XML based metadata is readily available. We have also integrated the DSTP server with the Internet Backplane Protocol or IBP [1], which provides a baseline replication of the physical storage underlying the DSTP server. We are currently exploring additional internet based storage approaches in order to improve the long term survival of the data, such as OceanStore [6]. We have not implemented any physical replicas yet. Finally, we note that DataSpace’s support for UCKs allows UCK KDD Archive data to be integrated and correlated with other DataSpace data for the first time. 242 Number Records 10,000 50,000 150,000 375,000 1,000,000 DSTP: SOAP/XML (sec) 0.65 177 673 3078 21121 DSTP: SOAP/XML+ (sec) 0.21 0.72 125 301 823 Speed-up 3.10 245.83 5.38 10.23 25.66 Table 1: DSTP servers have two modes for data access. One mode uses SOAP/XML, which works for small data sets, and (small) metadata. The other mode uses a new protocol called SOAP+ with seperate control and data channels which scales much better for large and complex data sets. Figure 1: We have put most of the data from the UCI KDD data sets into DataSpace Archives. This figure illustrates the results of browsing one of the UCI KDD data sets and graphing the results. 243 5 Summary and Conclusion In this note, we have introduced DataSpace Archives, which are archives for data sets built on top of DataSpace’s DSTP servers. DataSpace Archives are designed to support the long term archiving of data in a format in which it is easy to browse and explore the archived data. In addition, DataSpace Archives enable geographically distributed data to be integrated, even if the data comes from separate data sets. In general, most current archives provide ftp access to data, but do not support the browsing, exploration, and integration of large quantities of remote and distributed data. We have implemented DataSpace Archives on top of Version 3.0 of DataSpace and created an archive containing the University of California at Irvine Knowledge Discovery in Databases Archive. We are currently creating a DataSpace Archive containing approximately a hundred data sets and about a Terabyte of data in order to test these ideas further. We are also currently improving the internet based replicated storage for DataSpace Archives, as well exploring additional mechanisms to ensure that long term survival of the data is possible, such as the use of physical replicas for important data sets. We are also working with the Data Mining Group [4] to standardize web services for data discovery, data integration, and data mining so that DataSpace Archives remain open and non-proprietary. References [1] M. Beck, T. Moore, and J. Plank. An end-to-end approach to globally scalable network storage. In ACM SIGCOMM 2002 Conference, 2002. [2] Robert Grossman and Marco Mazzucco. Dataspace - a web infrastructure for the exploratory analysis and mining of data. IEEE Computing in Science and Engineering, pages 44–51, July/August, 2002. [3] Data Mining Group. Predictive model markup language (PMML). Retrieved from http://www.dmg.org, September 30, 2003. [4] Data Mining Group. Web services for data mining. http://www.dmg.org, September 30, 2003. Retrieved from [5] University of California at Irvine (UCI). UCI knowledge discovery in databases archive. http://kdd.ics.uci.edu/, retrieved on September 4, 2003. [6] Sean Rhea, Chris Wells, Patrick Eaton, Dennis Geels, Ben Zhao, Hakim Weatherspoon, and John Kubiatowicz. Maintenance-free global data storage, 2001. [7] W3C semantic web. Retrieved from http://www.w3.org/2002/ws/, September 4, 2003. 244 Promote-IT: An efficient Real-Time Tertiary-Storage Scheduler Maria Eva Lijding, Sape Mullender, Pierre Jansen Distributed and Embedded Systems Group (DIES), University of Twente P.O.Box 217, 7500AE Enschede, The Netherlands {lijding, sape, jansen}@cs.utwente.nl tel +31-53-4893770, fax +31-53-4984590 Abstract Promote-IT is an efficient heuristic scheduler that provides QoS guarantees for accessing data from tertiary storage. It can deal with a wide variety of requests and jukebox hardware. It provides short response and confirmation times, and makes good use of the jukebox resources. It separates the scheduling and dispatching functionality and effectively uses this separation to dispatch tasks earlier than scheduled, provided that the resource constraints are respected and no task misses its deadline. To prove the efficiency of Promote-IT we implemented alternative schedulers based on different scheduling models and scheduling paradigms. The evaluation shows that Promote-IT performs better than the other heuristic schedulers. Additionally, Promote-IT provides response-times near the optimum in cases where the optimal scheduler can be computed. 1 Introduction Today multimedia data is generally stored in secondary storage (hard disks) and is from there delivered to the users. However, the amount of storage capacity needed for a multimedia archive is large and constantly growing with the expectations of the users. Tertiarystorage jukeboxes can provide the required storage capacity in an attractive way if the data can be accessed with real-time guarantees. A jukebox1 is a large tertiary storage device that can access data from a large number of removable storage media (RSM, for example DVDs or tapes) using a small number of drives and one or more robots to move RSM between their shelves and the drives. A central problem with this setup is that the RSM switching times are high, in the order of tens of seconds. Thus, multiplexing between files may be many orders of magnitudes slower than on a hard drive, where it takes only a few milliseconds. The second important problem is the potential for resource contention that results from the shared resources in the jukebox. 1 We use the term jukebox to refer to any type of Robotic Storage Library (RSL) 245 Our hierarchical multimedia archive (HMA) is a service that provides flexible real-time access to data stored in tertiary storage. The HMA can serve complex requests for the real-time delivery of any combination of media files it stores. A request consists of a deadline and a set of request units for individual files (or part of files). Such requests can for instance result from a database query to compile a historical background for news onthe-fly, or from a personalized entertainment program consisting of music video clips. The HMA can also be used to provide real-time guarantees in the access of scientific data, e.g., earth measurements, weather forecast. In the latter cases it is especially important to be able to tell the users in advance when the data will be available in secondary storage. Tertiary storage plays an important role in supercomputing environments and scientific computing. Essential to these environments is the capacity to deal with petabytes of data that must be easily accessible to geographically distributed scientists. The storage hierarchy that stores the data must be transparent to the users, except for the delays of accessing data in tertiary storage. The IEEE Mass Storage System Reference Model [7] describes the characteristics such systems should posses. Multiple hierarchical storage management (HSM) systems have been developed, both conforming to the reference model and prior to it. Some examples are the High Performance Storage System (HPSS) of the National Storage Laboratory [23], and the Storage and Archive Manager File System (SAM-FS) of Fujitsu [9]. The openness of the reference model permits to include specific real-time services as future interfaces [23]. However, no HSM so far supports real-time services. Our HMA can be incorporated in the reference model as a Storage Server component. The HMA uses secondary storage as a buffer and cache for the data in its tertiary-storage jukeboxes. The jukebox scheduler is the key component of the HMA that guarantees the in-time promotion of data from tertiary storage to secondary storage. Apart from providing real-time guarantees, the scheduler also tries to minimize the number of rejected requests, minimize the response time for ASAP requests, minimize the confirmation time, and optimize hardware utilization. We use a new design of jukebox schedulers, where the scheduling and dispatching functionality are clearly separated. This separation allows us to improve the performance of the system, because the optimality criteria of both functions are different. The goal of the schedule builder is to find feasible schedules for the requested data. Thus, the scheduler tries to build schedules as flexible as possible and is not concerned about the optimal use of the resources. The dispatcher, instead, is concerned about utilizing the jukebox resources in an efficient manner. We introduce the concept of early dispatching, by which a dispatcher can dispatch the tasks earlier than scheduled as long as the resource constraints are respected and no task misses its deadline. The first step to build an efficient scheduler is to understand the scheduling problem thoroughly. On the one hand, we model the hardware and identify the parameters that define the hardware behavior. Our model is flexible and can represent any present and expected future jukebox hardware. On the other hand, we formalize the scheduling problem using scheduling theory so that its characteristics and complexity can be analyzed, and the problem can be classified and compared with other scheduling problems. Given the complexity 246 of the scheduling problem we are dealing with, there are many different ways in which it can be modeled. The most important of these models is the minimum switching model, which models the problem as a flexible flow shop with three stages—load, read, unload. The model uses shared resources to guarantee mutual exclusion in the use of the jukebox resources. This model puts only a small restriction on the utilization of the resources, which additionally results in better use of the resources and system performance. The model requires that once an RSM is loaded in a drive, all the requested data of the RSM is read before the RSM is unloaded. Thus, the schedules that can be built with this model have a minimum number of switches. Promote-IT is based on the minimum switching model. For every incoming request it builds a new schedule that includes all the previously scheduled request units plus the request units of the new request. It uses an efficient heuristic algorithm to find a solution to an instance of the minimum switching model on-line. Promote-IT can deal with any type of request and jukebox hardware. Additionally, it provides short response times and confirmation times, and makes good use of the jukebox resources. We defined different scheduling strategies for Promote-IT, which vary in the way in which the jobs are added to the schedule. These strategies can be classified as Front-to-Back (earliest deadline first (EDF) and earliest starting time first (ESTF)) and Back-to-Front (latest deadline last (LDL) and latest starting time last (LSTL)). When using Front-toBack, each job is scheduled as early as possible, while with Back-to-Front, each job is scheduled as late as possible. When using Back-to-Front, Promote-IT profits strongly from the separation of scheduling and dispatching. The scheduler creates schedules with idle times that are used by the dispatcher to dispatch tasks early. This combination proves useful in many cases, especially when the use of a shared robot is the bottleneck in the system. The rest of the paper is organized as follows. Section 2 discusses related work. Section 3 presents some more details about Promote-IT. Section 4 evaluates Promote-IT, comparing it’s capabilities and performance with that of other schedulers. Finally, Section 5 concludes the paper. 2 Related Work We first discuss two schedulers that can be used in a HMA. In Section 4 we compare the performance of these schedulers with that of Promote-IT. Later in this section we briefly discuss schedulers for more simple requests, schedulers with unsolved contention problems and schedulers for discrete-media. Lau et al. [15] present an aperiodic scheduler for Video-on-Demand systems that can use two scheduling strategies: aggressive and conservative. When using the aggressive strategy each job is scheduled and dispatched as early as possible, while when using the conservative strategy each job is scheduled and dispatched as late as possible. These two strategies 247 are similar to the EDF and LDL strategies that we use in Promote-IT. An important difference between the strategies of Lau et al. and Promote-IT is that their strategies dispatch the tasks in the same sequence and time as assigned in the schedule. Thus, the conservative strategy performs poorly, because it leaves the resources idle, even when there are tasks that need executing. Another important difference is that their algorithm handles the jobs to include in the schedule as formed by a read task and a switch task. The switch task is scheduled as a unity, although it involves unloading the RSM loaded in the drive and loading the new RSM. Lau et al. assume that all the drives are identical and that the switching time is constant, independently of the drive and shelf involved. The former assumption is reasonable in many jukeboxes, but makes the algorithm difficult to generalize to the case with non-identical drives. The latter assumption is not reasonable in most of the large jukeboxes and forces the use of worst-case switching times when building the schedules. Using more accurate switching times provides better schedules. Federighi et al. [8] use requests similar to those of the HMA. In their system the videos may be stored in multiple objects, with different sound tracks and subtitles corresponding to each video. The requests in their system have soft deadlines, e.g., the data should be available at around eight o’clock. Federighi et al. are mainly concerned about balancing the load on distributed video file servers, which are placed near the users [2]. An important difference with our approach is that, even if the requests consist of multiple objects, the playback only begins once all the objects are available at the video file servers. We refer to this type of approach as Fully-Staged-Before-Starting (FSBS). There are multiple proposals for scheduling continuous data stored in one RSM [4, 6, 5, 21, 25, 11]. The main difference among these proposals is whether the data should be fully staged before starting, streamed directly to the user, or pipelined (i.e., the data of a request can be consumed while other data of the request is being staged). We show in our work that pipelining the data is the best approach. Various authors try to solve the problem of providing access to data contained in multiple RSM, however their schedulers suffer from unsolved contention problems [1, 16, 12, 3]. Therefore, these schedulers cannot guarantee that the real-time deadlines are always met. We analyze the faults in these schedulers in [17]. There are numerous proposals for scheduling requests for discrete data [24, 10, 22, 19, 20]. The goal of these schedulers is to minimize the average response time. In all cases, the conclusion is that as much data as possible should be read from an RSM when the RSM is loaded in the drive. These results support the minimum switching model that we use in Promote-IT. More et al. [20] are concerned with performing queries on data that is stored in multiple tapes. Thus, a query may have multiple request units without real-time constraints. Their goal is to minimize the response time of each query. They model the scheduling problem as a two-machine flow-shop with additional constraints. In their model, the unload and load of a tape are coupled. They propose the longest transfer-time first (LtF) algorithm that for each query starts reading first the data of the sub-queries that require the longest transfer time. If there are multiple sub-queries for the same tape they use the SORT algorithm proposed by 248 Hillyer et al. [13] to decide the order in which the sub-queries should be read. The rationale behind the LtF algorithm is that while the data of the longest sub-query is being read, there is time to switch the tapes on the other drives and read the data corresponding to the shorter sub-queries. Through analytical analysis and simulations they show that LtF provides short response times. In our HMA we can represent the type of requests More et al. are concerned with as a request with multiple request units with the same delta deadline (see Section 3 for details about the request of the HMA). The strategies of Promote-IT that use the latest starting time as parameter to sort the jobs build similar schedules to those of LtF for this type of requests, even if the length of the transfer is not the scheduling parameter used by Promote-IT. Given a set of RSM with the same deadline and different transfer times, the ones with longer transfer times will have earlier latest-starting-times. Thus, these strategies of Promote-IT will also schedule the RSM to begin earlier. 3 Promote-IT A request ri , which a user issues to the Hierarchical Multimedia Archive, consist of a deadline and a set of li request units uij for individual files (or part of files). The request can represent any kind of static temporal relation between the request units. Formally we express the user request structure in the following way: ri = (d˜i, asapi , maxConfi , {ui1, ui2, . . . , uili }) uij = (∆d˜ij , mij , oij , sij , bij ) The deadline d˜i of the request is the time by which the user must have guaranteed access to the data. The flag asapi indicates if the request should be scheduled as soon as possible. The user may specify no deadline (d˜i = ∞) if the only restriction is that the request should be scheduled ASAP. The maximum confirmation time maxConfi is the time the user is willing to wait in order to get a confirmation from the system, which indicates if the request was accepted or rejected. The system must provide a confirmation before making the data available, so maxConfi ≤ d˜i . The relative deadline of the request unit ∆d˜ij is the time at which the data of the request unit should be available, relative to the starting time of the request. The other parameters of the request unit mij , oij , sij and bij represent the RSM where the data is stored, the offset in the RSM, the size of the data, and the bandwidth with which the user wants to access the data, respectively. The starting time of the request must not be later than its deadline, so stk ≤ d˜k . If the request is ASAP, the scheduler assigns the request the earliest possible starting time stk that will allow it to be incorporated into the system. Thus, the scheduler must find the minimum starting time stk that makes U  schedulable, where U  is the set of request units that need scheduling. The scheduler tries different candidate starting times stxk and selects the earliest feasible stxk . assigns it the starting time corresponding to its deadline. If the 249 deadline of the request cannot be met, then the scheduler puts the request in the list of unscheduled requests until it can schedule it or maxConfi is reached and the request is rejected. The structure of the scheduling algorithm of Promote-IT is the following: 1. Generate a candidate starting time stxk and update the deadline of each request unit so that d˜kj = stxk + ∆d˜kj . The algorithm uses a variation of the bisection method for finding roots of mathematical functions. 2. Model U  as an instance of the minimum switching model. We represent the instance of the problem by the set J of jobs to schedule. 3. Compute the medium schedules. For each RSM, compute m medium schedules— one MS for each drive. Set the parameters of the duration and deadline of the read tasks T2j to the corresponding values of the computed MS 4. Compute the resource assignment. The algorithm must incorporate each job Jj ∈ J into the schedule. If the algorithm succeeds in finding a valid resource assignment, the output of this step is a feasible schedule S x ; otherwise S x = ∅. The pair (S x , stxk ) is incorporated into the list of analyzed solutions. 5. Repeat from step 1 until the bisection stop-criteria is fulfilled for the list of candidates, i.e. the time difference between the last unsuccessful and first successful candidate is smaller than a threshold. 6. Select the best solution. The best solution is the earliest candidate starting time for which step 4 could compute a feasible schedule (min{stxk | S x = ∅}). If there is no such stxk , the request rk is placed in the list of unscheduled requests to be scheduled at a later time. Otherwise, the scheduler confirms the starting time stk to the user and replaces the active schedule with the new feasible schedule. In the minimum switching model we limit the number of jobs to one per RSM. This model requires that all the requested data from an RSM must be read before the RSM is unloaded from a drive. The processing environment of our model is a flexible flow shop with three stages (F F3). The first stage is to load an RSM to a drive, the second stage is to read the data from an RSM and the third is to unload the RSM. The jobs to be processed are of the form Jj = {T1j , T2j , T3j }, with one task for each stage. Both the drives and robots may all have different characteristics. Therefore, the processors at each stage are modeled as unrelated. In the first stage there are l processors representing the l loader robots. In the second stage there are m processors representing the drives. In the third stage there are u processors representing the unloader robots. The robots in the first and third stage may have some elements in common and in the worst case all the elements will be the same: when all robots are able to load and unload. Because the robots may be limited to serve only a subset of drives and shelves, there are jobs that can be executed only in a subset of resources. In the model we indicate this by using machine eligibility restrictions. 250 The processing time of a reading task T2j is determined by computing a separate schedule for all request units that are grouped into Jj . We call this schedule for an RSM a Medium Schedule (MS). An MS determines in which order the data must be read once the RSM is in the drive. As the drives may be non-identical, we compute a separate MS for each drive. The optimization criterion for an MS is to maximize the time at which the RSM has to be loaded in a drive to start reading data from it, in such a way that the deadlines of the request units are met. In other words we want to determine the latest possible starting time of the read. If the RSM is already loaded in a drive, the goal is to read the requested data before the RSM must be unloaded. In step 4 we use a branch-and-bound algorithm to prune the tree of possible assignments of jukebox resources to the jobs in J . The branch-and-bound algorithm uses the best-drive heuristic to choose which drive will be tried first to schedule a job and prune from the tree the branches corresponding to drives which offer a worse solution. When pruning the tree, the algorithm may be throwing away a feasible solution that an optimal scheduler would find. But searching the whole tree of solutions is computationally unacceptable. For comparison, we have also implemented an optimal scheduler, but it can take up to several days to compute a feasible schedule for one new request, in contrast to the few milliseconds needed by Promote-IT. The jobs are incorporated to the schedule using any of the four strategies presented in Section 1. None of the strategies is absolutely better than the others, because each strategy can find schedules that cannot be found by the others. However, ESTF performs best in most cases, and when the system load is very high it is convenient to use LDL. The difference in performance between the different strategies is small when compared with other schedulers. Therefore, in the next section we use only ESTF and LDL as representatives of Promote-IT. 4 Evaluation To prove the efficiency of Promote-IT, we implemented alternative schedulers based on different scheduling models and scheduling paradigms. On the one hand, we designed two new schedulers: the jukebox early quantum scheduler (JEQS) and the optimal scheduler. On the other hand, we extended some heuristic schedulers proposed in the literature (see Section 2): the extended aggressive strategy, the extended conservative strategy and FullyStaged-Before-Starting (FSBS). Our extensions are able to deal with the requests used in the HMA, and with jukeboxes with different drive models and multiple robots. Furthermore, they do not assume constant switching and reading times. The extended schedulers have better properties than the original ones, while still keeping the features of the original schedulers that we consider most important to evaluate. The jukebox early quantum scheduler (JEQS) is a periodic scheduler. The basic heuristic used by a periodic scheduler is to represent the requests as periodic tasks. A restriction of periodic schedulers is that they can be used only for some special use cases of HMA, as Video-on-Demand, because they are unable to deal with complex requests. Addition- 251 ally, periodic schedulers have serious problems in avoiding resource-contention problems. JEQS solves these problems by using the robots and drives in a cyclic way. The robot exchanges the contents of each drive at regular, fixed intervals. This results in a cyclic use of the drives, which are dedicated to reading data of an RSM while the other drives are being served by the robot. To our best knowledge, JEQS is the only correct periodic jukebox scheduler. The other periodic jukebox schedulers presented in the literature do not deal correctly with the resource-contention problem. JEQS uses the scheduling theory on early quantum tasks (EQT) presented in [14]. An early quantum task is a task whose first instance is executed in the next quantum after its arrival and the rest of the instances are scheduled in a normal periodic way with the release time immediately after the first execution. Although, JEQS is generally able to start incoming requests in the next cycle of a drive, its performance is much worse than that of any of the aperiodic schedulers. The optimal scheduler is a scheduler that computes the minimum response time for each incoming request. The objective of this scheduler is to be used as a baseline for evaluating the quality of the heuristic schedulers. The optimal scheduler cannot be used in a real environment due to its computing-time requirements. The computing time increases exponentially with the complexity of the requests and the system load. Therefore, we can only use it for evaluation of small test sets and relatively low system load. The comparisons that we performed show that the performance of Promote-IT is near the optimum, at least under these special testing conditions. The simulations shown in this section were performed with JukeTools [18], using in each case representative jukebox architectures and request sets. Except when evaluating the optimal scheduler, the size of the cache is 10% of the jukebox capacity. The number of requests that can be handled by the schedulers in each of the examples shown depend on the request set and the hardware used. In each of the individual comparisons all the schedulers use the same request set and hardware simulation. As we show, some schedulers can handle load better than others, and in general the load shown in the graphics is determined by the load that can be handled by the most restrictive scheduler being compared. Figure 1(a) compares the response time of aperiodic and periodic schedulers. Aperiodic scheduling is represented by Promote-IT and FSBS, periodic scheduling by JEQS. In this comparison the performance of Promote-IT is representative for the performance of the extended aggressive strategy and extended conservative strategy, because the difference in performance among these schedulers is negligible when compared with the difference among Promote-IT, FSBS and JEQS. For JEQS we consider two variations: scheduling normal quantum tasks (shown in the plots as ‘JEQS’) and scheduling only EQTs (shown in the plots as ‘JEQS only EQTs’). We use FSBS in this comparison, because even though FSBS is very simple in many cases it performs better than JEQS. FSBS has a similar behavior to a First-Come-First-Serve scheduler, which virtually means that no serious scheduling is done. It first serves a request completely and only then it provides access to the data of the request. 252 700 600 JEQS JEQS only EQTs Promote-IT FSBS 500 Time (sec) 500 400 300 200 400 300 200 100 100 0 0 60 70 80 90 100 110 System load (requests/hour) 120 60 (a) Mean Response Time 70 80 90 100 110 System load (requests/hour) (b) Mean confirmation time. Figure 1: Aperiodic vs. periodic scheduling. 350 FSBS Promote-IT (LDL) Promote-IT (ESTF) 300 Time (seconds) Time (sec) 600 JEQS JEQS only EQTs Promote-IT FSBS 250 200 150 100 50 0 30 40 50 60 70 80 90 100 110 System load (requests/hour) Figure 2: Pipelining vs. full staging. Mean Response Time. 253 120 The request set consists of 1000 ASAP requests that follow a Zipf distribution. Each request corresponds to one long-video file, because of the restrictions imposed by JEQS. To be able to use JEQS the request must be only for data stored in one RSM in a contiguous fashion. Additionally, JEQS needs the data to be continuous-media. When using Promote-IT the request is split in request units of 100 MB in size. The requests cannot be rejected, i.e., deadline and maximum confirmation time are infinite. The data in the jukebox is stored in double-layered DVDs and each video is stored completely in one disk. However, one disk may store multiple videos. The jukebox has four identical DVD drives and one robot. The load time is between 21.8 and 24.8 seconds, while the unload time is between 14.3 and 17.4 seconds. The drives use CAV technology and have a transfer speed that ranges between 7.96 and 20.53 MBps and a maximum access time of 0.17 seconds. The response time of Promote-IT is much shorter than the response time of JEQS. As the system load increases, the performance of FSBS is also better than that of JEQS. JEQS uses the resources poorly, because it performs multiple switches for reading data from an RSM. In contrast, Promote-IT and FSBS use the resources efficiently by performing the minimum amount of switches required to read the data. The confirmation time of the aperiodic schedulers is shorter that that of JEQS (see Figure 1(b)). The main difference can be seen with ‘JEQS only EQTs’, because this scheduler waits to accept a request until it can schedule it as an EQT. As the system load increases, the possibilities of accepting a request as an EQT diminish drastically. Periodic schedulers have a clear advantage over aperiodic schedulers in the computing time, because they just need to evaluate a couple of formulae to decide if a request is schedulable. However, this advantage is not visible to the end user, who notices only the response time and the confirmation time. When evaluating the performance of the optimal scheduler, we will show that the computing time becomes an important parameter when it influences the confirmation time. We conclude that periodic scheduling is a bad technique for scheduling a jukebox, because even the FSBS scheduler—which is extremely simple—in many cases performs better than JEQS. The bad performance of JEQS is not a characteristic of this particular scheduler, but is intrinsic to any periodic jukebox scheduler. A periodic scheduler either needs to use the robot in a cyclic way, or take into account the worst-case time for robot contention in the execution time of the tasks. Therefore, when using a periodic scheduler, the best-case starting time for a request that does not produce a cache-hit is the maximum time needed to switch all the drives, even if the system load is very low and all drives are idle. In the same scenario, the starting time for Promote-IT is in most cases just the time to load the RSM in the drive and read the data of the first request unit. For FSBS it is the time needed to stage all the data of the request. The response time of Promote-IT is also shorter than that of FSBS, as can be seen in Figure 1(a). FSBS stages the whole file before giving access to the user. Therefore, the response of FSBS has as lower limit the time to buffer the whole file, while the lower limit for Promote-IT is the time to buffer the first request unit. 254 350 Conservative Promote-IT (LDL) Promote-IT (ESTF) 250 Time (seconds) Time (seconds) 300 200 150 100 50 0 30 40 50 60 70 80 90 100 110 System load (requests/hour) 45 40 35 30 25 20 15 10 5 0 Conservative Promote-IT (LDL) Promote-IT (ESTF) 30 (a) Mean Response Time. 40 50 60 70 80 90 100 110 System load (requests/hour) (b) Mean confirmation time. Figure 3: Early vs. conservative dispatching Moreover, Figure 2 shows that the difference in performance between Promote-IT and FSBS is even bigger when the data of a request is stored in multiple RSM. In this case, FSBS needs to perform multiple switches before giving access to the data, while in most cases Promote-IT only needs to perform one switch to read the data corresponding to the first request unit and the rest of the switches are performed at a later time, when the scheduler finds time for them. In this case the data in the jukebox consists of 30% long videos, 30% short videos, 30% music and 10% discrete data. The requests follow that pattern as well. The data of a request may be stored in multiple RSM. Figure 3(a) compares the response time of Promote-IT and the extended conservative strategy (denoted as ’Conservative’). The main difference between LDL and the extended conservative strategy is the early dispatching of the tasks. As the system load increases, the difference in performance between Promote-IT and Conservative grows very fast. At the highest load level plotted, Conservative is unable to handle the load, because the waiting queue is too long. The test set is the same one described for comparing Promote-IT and FSBS with the requested data stored in multiple RSM. The response time and confirmation time of LDL and ESTF are very similar when compared against the corresponding times of Conservative. Furthermore, when the system load is high, LDL performs slightly better than ESTF. This reinforces the idea that Back-toFront is an interesting scheduling mechanism, when it is combined with early dispatching. The confirmation time of Promote-IT is also lower than of Conservative (see Figure 3(b)). Conservative often fails to schedule incoming requests, because the starting time they should be assigned is too far into the future. Thus, the requests stay in the queue of unscheduled requests until the scheduler can incorporate them to the schedule. The robot and drive utilization of Conservative is much less than that of LDL. When not using early dispatching, the resources are left idle, even if there are tasks in the schedule. Thus, when new requests arrive, their chances to be scheduled immediately are lower, even when the system load is low, because the scheduler has tasks scheduled for the future. 255 250 Aggressive Promote-IT (LDL) Promote-IT (ESTF) Aggressive Promote-IT (LDL) Promote-IT (ESTF) 200 Time (seconds) Time (seconds) 100 90 80 70 60 50 40 30 20 10 150 100 50 0 30 40 50 60 70 80 90 100 110 System load (requests/hour) 30 (a) Fast Jukebox 40 50 60 70 80 System load (requests/hour) 90 (b) Slow Jukebox Figure 4: Uncoupled vs. Coupled load and unload. Mean Response Time Figure 4 compares the response time of Promote-IT and the extended aggressive strategy (denoted as ’Aggressive’). The main difference between Aggressive and the ESTF strategy of Promote-IT is that Aggressive couples the load and unload into a single switch operation. This means that the RSM stay loaded in the drives until the drives are needed again. Therefore, Aggressive needs to perform first an unload before using a drive, even if the drive and the robot are idle before the request arrival. When the system load is low and medium, Promote-IT provides shorter response times than Aggressive. However, when the system load is high and the robot is a clear bottleneck in the system, as in the case plotted in Figure 4(a), Aggressive has a better mean response time than Promote-IT. In this situation, the response time of Aggressive is similar to that of LDL, although Aggressive builds schedules Front-to-Back and LDL builds them Backto-Front. However, Aggressive delays the last unload of a drive as much as possible, until the drive is needed again, which is the original goal of a Back-to-Front strategy. When the system load is low or medium, it is highly probable that at the time when a new request arrives there are idle resources. Therefore, delaying the unloads as much as Aggressive does affects the performance negatively. When the load is high it does not really matter, because there is no opportunity to unload the drives early anyhow. When the robot is not a strong bottleneck, as in the case plotted in Figure 4(b), Promote-IT provides shorter response times than Aggressive, even under high system loads. In this case unloading late is not beneficial: also ESTF performs better than LDL. The request set used to compare Promote-IT and Aggressive is the same as the one used for FSBS and the extended conservative strategy. The ‘Fast Jukebox’ has the same configuration as previously described, in this configuration the robot is a clear bottleneck. The ‘Slow Jukebox’ has four DVD drives based on CLV technology with a transfer speed of 7.96 MBps and an maximum access time of 1.5 seconds. Figure 5(a) shows that the response time provided by Promote-IT is near the optimal response time. Moreover, the difference in response time between Promote-IT and the optimal scheduler is smaller than the difference between Aggressive and Promote-IT. The plots 256 350 Optimal Aggressive Promote-IT (LDL) Promote-IT (ESTF) Optimal Aggressive Promote-IT (LDL) Promote-IT (ESTF) 300 250 Time (sec) Time (sec) 400 390 380 370 360 350 340 330 320 310 300 290 200 150 100 50 0 2 4 6 8 10 12 14 16 System load (requests/hour) 18 2 (a) Mean Response Time. 4 6 8 10 12 14 16 System load (requests/hour) 18 (b) Mean Computing Time. Figure 5: Heuristic vs. Optimal indicate that the difference in response time between Promote-IT and the optimal is larger as the system load increases. Therefore, we regret that we cannot run the optimal scheduler with higher loads. The computing time of the optimal scheduler increases exponentially when the load of the system increases, while the computation time of the heuristic schedulers is nearly constant (see Figure 5(b)). The computing times of the optimal scheduler are so high that the scheduler cannot be used in an on-line system. Additionally, we have observed that the optimal scheduler does not unload an RSM before all the data has been read in any of the runs we have performed. This is an important result in favor of the minimum switching model, on which Promote-IT is based, because even if the optimal scheduler has the possibility of performing intermediate switches, it does not do so. The request set consists of 200 ASAP requests for long videos. The optimal scheduler does not deal with the cache administration. Therefore, each request corresponds to a different video and the cache is empty at the beginning of the runs. Thus, there are no cache-hits. The jukebox only contains long videos, which were generated in the same way as those described in the comparison against JEQS. However, the data in the jukebox is stored in single-layered DVDs and the drives use CLV technology with a transfer speed of 6.45 MBps and a constant access time of 0.1 seconds. When building the request sets we have to make a trade-off between keeping the number of request units per request low and having more than one request unit per RSM. The computational complexity of the optimal scheduler increases exponentially with the number of request units to schedule, so we should only split each file in a small number of request units. On the other hand, we want to give the optimal scheduler the possibility to switch an RSM without reading all the requested data from the RSM. Therefore, it is desirable to have more than one request unit per RSM. In the tests that we show here, we chose to 257 FSBS Extended Aggressive Strategy Extended Conservative Strategy Promote-IT JEQS Optimal Flexibility: requests Flexibility: hardware ++ ++ ++ + ++ ++ ++ ++ −− − + − Response time Confirmation time Computing time Deal with high load −− + ++ + + ++ ++ ++ − + ++ −− ++ ++ ++ ++ −− − +++ −− +++ −−− −−− −−− Table 1: Summary of the performance comparison. The notation used is: excellent (+++), very good (++), good (+), bad (−), very bad (−−), and unusable (−−−). chop the files in request units of 2.5 GB. Thus, the number of request units per request is between 1 and 4 and the number of RSM involved is 1 or 2. Throughout this section we have shown that Promote-IT performs better than the other schedulers. However, the magnitude of the performance difference varies in each case. We put the differences in context and compare all the schedulers among each other. We evaluate the capacity of the schedulers to deal with flexible requests and hardware. We also evaluate the schedulers regarding the response time, confirmation time, computing time and the capacity to deal with high load. Table 1 summarizes the evaluation. The classification we assigned to the schedulers in the last four categories is the result of observing their performance in multiple test setups. Although, the classification is quite subjective and difficult to quantify, we believe that it reflects correctly the average performance of the schedulers. Note that the original conservative and aggressive strategy and FSBS can only handle very limited types of jukeboxes and requests. The extensions we performed are discussed in [17]. 5 Conclusions Through Promote-IT we show that tertiary storage can be used effectively in systems with real-time requirements, for instance in a hierarchical multimedia archive. However, careful scheduling is needed in order to provide those guarantees, to use the resources efficiently, and to provide short response times to the users. A performance comparison of different schedulers, shows that Promote-IT performs better than the other heuristic schedulers, and additionally provides response-times near the optimum in cases where the optimal scheduler can be evaluated. 258 References [1] J. Boulos and K. Ono. Continuous data management on tape-based tertiary storage systems. In Proc. of the 5th International Workshop on Interactive Distributed Multimedia Systems and Telecommunication Services, pages 290–301, Sept. 1998. [2] D. W. Brubeck and L. A. Rowe. Hierarchical storage management in a distributed VoD system. IEEE Multimedia, 3(3):37–47, 1996. [3] H. Cha, J. Lee, J. Oh, and R. Ha. Video server with tertiary storage. In Proc. of the Eighteenth IEEE Symposium on Mass Storage Systems, April 2001. [4] S.-H. G. Chan and F. A. Tobagi. Designing hierarchical storage systems for interactive ondemand video services. In Proc. of IEEE Multimedia Applications, Services and Technologies, June 1999. [5] A. L. Chervenak. Tertiary Storage: An Evaluation of New Applications. PhD thesis, Dept. of Comp. Science, University of California, Berkeley, December 1994. [6] A. L. Chervenak. Challenges for tertiary storage in multimedia servers. Parallel Computing, 24(1):157–176, Jan. 1998. [7] J. L. Cole and M. E. Jones. The IEEE storage system standards working group overview and status. In Proc. of the 14th IEEE Symposium on Mass Storage Systems. IEEE, September 1995. [8] C. Federighi and L. A. Rowe. Distributed hierarchical storage manager for a video-on-demand system. In Storage and Retrieval for Image and Video Databases (SPIE), pages 185–197, February 1994. [9] Fujitsu. Storage management overview. White Paper, January 1998. [10] C. Georgiadis, P. Triantafillou, and C. Faloutsos. Fundamentals of scheduling and performance of video tape libraries. Multimedia Tools and Applications, 18(2):137–158, 2001. [11] S. Ghandeharizadeh and C. Shahabi. On multimedia repositories, personal computers, and hierarchical storage systems. In Proc. of th ACM Multimedia Conference, 1994. [12] L. Golubchik and R. K. Rajendran. A study on the use of tertiary storage in multimedia systems. In Proc. of Joint NASA/IEEE Mass Storage Systems Symposium, March 1998. [13] B. K. Hillyer and A. Silberschatz. Random I/O scheduling in online tertiary storage systems. In Proc. of the 1996 ACM SIGMOD International Conference on Management of Data, pages 195–204, June 1996. [14] P. G. Jansen, F. T. Y. Hanssen, and M. E. Lijding. Scheduling of early quantum tasks. In 15th Euromicro Conf. on Real-Time Systems, pages 203–210. IEEE Computer Society Press, Jul 2003. [15] S.-W. Lau and J. C. S. Lui. Scheduling and replacement policies for a hierarchical multimedia storage server. In Proc. of Multimedia Japan 96, International Symposium on Multimedia Systems, March 1996. [16] S.-W. Lau, J. C. S. Lui, and P. Wong. A cost-effective near-line storage server for multimedia system. In Proc. of the 11th International Conference on Data Engineering, pages 449–456, March 1995. [17] M. E. Lijding. Real-Time Scheduling of Tertiary Storage. PhD thesis, CTIT Ph.D.-thesis Series 03-48, Univ. of Twente, May 2003. 259 [18] M. E. Lijding, P. G. Jansen, and S. J. Mullender. Implementing and evaluating jukebox schedulers using JukeTools. In 20th IEEE Symp. on Mass Storage Systems, pages 92–96, San Diego, California, Apr 2003. IEEE Computer Society Press, Los Alamitos, California. [19] C. Moon and H. Kang. Heuristic algorithms for I/O scheduling for efficient retrieval of large objects from tertiary storage. In Proc. of the Australiasian Database Conference, pages 145– 152. IEEE, February 2001. [20] S. More and A. Choudhary. Scheduling queries on tape-resident data. In Proceeding of the European Conference on Parallel Computing, 2000. [21] H. Pang. Tertiary storage in multimedia systems: Staging or direct access? ACM Multimedia Systems Journal, 5(6):386–399, November 1997. [22] S. Prabhakar, D. Agrawal, A. E. Abbadi, and A. Singh. Scheduling tertiary I/O in database applications. In Proc. of the 8th International Workshop on Database and Expert Systems Applications, pages 722–727, September 1997. [23] D. Teaff, D. Watson, and B. Coyne. The architecture of the High Performance Storage System (HPSS). In Proc. of the Fourth NASA GSFS Conference on Mass Storage Systems and Technologies, 1995. [24] P. Triantafillou and I. Georgiadis. Hierarchical scheduling algorithms for near-line tape libraries. In Proc. of the 10th International Conference and Workshop on Database and Expert Systems Applications, pages 50–54, 1999. [25] P. Triantafillou and T. Papadakis. On-demand data elevation in hierarchical multimedia storage servers. In Proc. of 23rd International Conference on Very Large Data Bases (VLDB’97), pages 226–235, 1997. 260 THE DATA SERVICES ARCHIVE Rena A. Haynes Sandia National Laboratories, MS 0822 Albuquerque, NM 87185-5800 Tel: +1-505-844-9149 e-mail: rahayne@sandia.gov Wilbur R. Johnson Sandia National Laboratories, MS 1137 Albuquerque, NM 87185-5800 Tel: +1-505-845-0279 e-mail: wrjohns@sandia.gov Abstract As access to multi-teraflop platforms has become more available in the Department of Energy Advanced Simulation Computing (ASCI) environment, large-scale simulations are generating terabytes of data that may be located remotely to the site where the data will be archived. This paper describes the Data Service Archive (DSA), a service oriented capability for simplifying and optimizing the distributed archive activity. The DSA is a distributed application that uses Grid components to allocate, coordinate, and monitor operations required for archiving large datasets. Additional DSA components provide optimization and resource management of striped tape storage. 1. Introduction In the ASCI environment, deployment of massively parallel computational platforms and distributed resource management infrastructure for remote access allows computations to execute on the platform that can best perform the calculation. Data generated by very large calculations can be hundreds of gigabytes to terabytes in size. Datasets are typically distributed across hundreds to thousands of files that range in size from megabytes to gigabytes. Extracting information from the data and transporting the reduced data to a local platform that can provide the interactivity required for the analysis is used to accomplish analysis and visualization of large datasets. Large datasets typically remain in the site that generated the data until they are archived to a locally managed parallel tape storage system (HPSS [1]). Although high-speed wide area networks and parallel file transfer components allow high performance movement of data, the amount of time required to transmit large datasets allows opportunities for errors that abort the transfer. Recovery from aborted transfers requires retransmission of files. A disk file system is used to buffer the dataset files before they are transferred across the local area network to HPSS parallel tape storage. Both the wide-area and the local-area transfers cause sustained high-speed bursts of data. Until the file data has been placed onto tape storage, files remain on file system storage. Without resource control and coordination, multiple occurrences of this activity can overwhelm both the intermediate 261 file system and the archival system, especially as more capability platforms, like ASCI Q, Purple, and Red Storm, are deployed. Grid middleware [2] can address some of the issues involved in archiving large distributed datasets. Core services in Grid middleware provide uniform methods for scheduling, allocating resources, and monitoring data transfer processes on distributed systems. More recently, middleware developed for Data Grids address issues involving data location, storage resources, and cache management. The SDSC Storage Resource Broker (SRB [3]) supports location transparency and data replication. Storage Resource Managers [4], [5] implement storage access and cache management policies. Current Data Grids [6], [7], [8], [9] are built to support efficient and cost effective access to large data collections for a geographically distributed community of scientists. Because of this, the focus has been on enabling distributed access to a store of metadata and data, which may also be replicated in the data grid. Our focus is on usability, robustness, and resource issues involved in transferring large datasets to a parallel tape archival storage. We describe the Data Service Archive (DSA), a service oriented capability for simplifying and optimizing the distributed archive activity. The DSA is a distributed application that uses Grid components to allocate, coordinate, and monitor operations required for archiving large datasets. Additional DSA components provide optimization and resource management capabilities for striped tape storage. The following sections present requirements, a functional overview, and performance of the DSA application. Planned extensions to the DSA are also discussed. 2. High Level Requirements The requirements for the DSA include the usual requirements of a simple, easily managed user interface to make archival and retrieval requests. Ease of use includes desktop access to clear interfaces for control and status as well as to a well-defined mechanism for making archive requests. The DSA should mitigate complexities of the distributed environment by providing a common interface regardless of data location and recovery from system failures where possible; however, file and resource brokering are not required. Any desktop software should have minimal prerequisites with automated installation and update procedures. Command line invocation is also required to support requests from running processes. Resource management requirements for the DSA include managing concurrent archival requests, scheduling requests and data transfers, optimizing data transfers, and balancing the load on the file system cache and the tape storage system. Optimizing data transfers and balancing the load on the parallel tape storage system requires obtaining knowledge of network topology, tape drive resources, and striping policies of the tape system. Software management requirements call for a clean integration with HPSS, preserving the storage system namespace and storage system policies. Accessing the storage system must be through supported mechanisms. 262 Additional functional requirements for the DSA include support for data integrity features, multiple storage patterns, and archive persistence management. Data integrity features should be available to verify correct wide area network transfers as well as the final image on tape storage. Storage patterns initially to be supported include requests to create an archive of tarred files as well as a directory hierarchy of files. Archive persistence introduces the notion of useful data lifetime for an archive. 3. DSA Architecture and Functional Overview The DSA is a distributed application with components (shown in Figure 1) located on the user’s desktop, a web server, the remote platform where the dataset resides, and the local data analysis cluster that is integrated with HPSS. The DSA GUI is started from the user’s desktop by accessing the Data Services DSA URL, which automatically checks for updates and optionally downloads the latest software release. The GUI allows users to define an archive request, which requires filling in a minimal amount of information required for data transfer. Features for archive formats and data integrity options may also be selected. An optional comment field is available that can be viewed in context of the archive action request. When the request is submitted for execution, DSA prompts for an identification that will be used to generate an archive identification that uniquely identifies this archive action from all others. The user may check status from the GUI or close it and check status at a later time through the information management service interface in the GUI. Desktop Web server Grid Services HPSS HPSS client Scheduler/ Shpss node Tape Tape Tape Silo Silos Silos HPSS mover nodes Remote Platform I/O nodes Comm node Data Analysis Cluster Figure 1. Desktop software interacts with a servlet on the DSA web server to transmit the archive action request. The servlet maintains state throughout the archive process to enable recovery, restart, or cancellation. The servlet creates metadata that identifies the archive 263 request, then starts a two-step process to archive the data to HPSS. The servlet constructs and submits a transfer request to the scheduler on the data analysis cluster. Once scheduled, the request is sent to the grid services application server, which starts an instance of the dataset transfer program (Zephyr) on the remote platform in behalf of the user. Zephyr, which runs with the user’s credentials, determines what files are to be transferred and how the transfer is to take place. If verification is requested, a checksum is generated for each file and the checksum information is placed in the archive. If a format option is selected, the data is processed on the remote platform. After completing the information gathering and formatting steps, Zephyr transfers the data to a Parallel Virtual File System (PVFS [10]) disk storage on the data analysis cluster using a parallel file transfer protocol. This completes the first step of the archive process. The current DSA implementation does not preallocate space on the PVFS disk cache. If a storage ceiling threshold is reached on the file system, the DSA service becomes unavailable until space is reclaimed. Dataset transfer operations detect the error and may be recovered through normal DSA recovery mechanisms. The transfer to HPSS is scheduled immediately after completion of the transfer from the remote platform by a component called shpss. When data files are transferred as a directory hierarchy, shpss places files in groups called partitions. Files in a partition have the same approximate file characteristics (currently file size) so they will be placed in the same storage class on HPSS and, consequently, use the same tape resources. The number of files in a partition is limited to facilitate recovery if errors occur and to improve overall transfer throughput by preventing large transfers from starving small transfers in the scheduling process. Shpss acts as tape resource manager by placing partition transfers in queues based on the partition file characteristics. The shpss queues use a node allocation mechanism to model the scheduling activity so that it tracks HPSS allocation of tape units. Use of this node allocation mechanism allows management of queues and queuing policies through standard system queue manager commands. If enough tape units are available, the transfer starts immediately. The partition transfer uses the HPSS parallel file transfer protocol interface with local file system commands to move data from PVFS on the data analysis cluster to tape over the cluster interconnect. Once files in a partition have been transferred to HPSS, they are removed from the PVFS. DSA retrieval requests are initiated through the DSA user interface, as are archival requests. Users can request retrieval of individual files or directories of files based on the HPSS namespace to a target system and directory. Retrieval requests, like the archival requests, are processed by web service middleware that maintains state and starts a twostep process to retrieve the data from the tape storage system and place the data onto the target directory. Shpss is launched on the data analysis system to stage the data from tape storage to the PVFS disk cache. As in archival requests, shpss groups file retrievals into partitions and places each partition into a scheduling queue based on the tape resource requirements. If tape resources are available, transfers start immediately. When all data 264 for a retrieval request has been staged to the disk cache, the web middleware starts a Zephyr process to move the data to the target directory. For shpss to set up partition groups for a retrieval operation, it must query HPSS to obtain information about each file in the retrieval request. Currently, only tape striping information is obtained. File media identification information is not queried. Delays can occur if files are not co-located on media or they are retrieved in a different order than stored. The DSA gives users complete control over the portion of the HPSS namespace where their data is written. Therefore it is possible for more than one archive request to reside in a common part of the namespace (e.g. subdirectory). As HPSS performance can be affected by read requests not matching the tape file patterns created during the write process, it is possible for users to present read requests through the DSA that are less than efficient. In the ASCI environment, retrieval requests from the tape archive are made much less frequently than storage requests. Considerable thought was given to tracking tape storage patterns within the HPSS namespace. It was determined that at the present time the cost/benefit of ordering read requests to tape file images was not economical in light of the amount of work required to add the capability and the complexity introduced into the overall system. If a read request points to the ‘top’ of a single archive in the HPSS namespace, the file read ordering will be the same as the original write ordering 4. DSA and other Technologies Sandia National Laboratories has had a grid in place since 2001 [11] based upon a workflow processor layered on top of the globus toolkit. Being a distributed application, the DSA exploits grid technologies by using the grid workflow processor to sequence the Zephyr and shpss components. The Globus toolkit is used for submitting partitions when scheduling transfers within shpss. Some modifications were made to the globus toolkit to enhance the operation of the DSA. Run time stdout/stderr capability was added to get immediate feedback from components running in PBS, and the ability to request DSA specific resources within the cluster where added. As mentioned above the DSA is manipulated using servlets on a web server. This presents a well-defined API for building additional features into the system. The DSA, through the web server, can be managed either through traditional POST and GET operations through HTTPS, or by passing java objects to requisite servlets themselves. The SimTracker [12] application, being developed at the three primary weapons laboratories, is in the process of integrating DSA into its functionality through this interface. The DSA does not use data grid technologies in the classic sense, although it is possible that gridftp [13] could be used for remote transfers. This is primarily due to a differing set of requirements between the programs. The file-tracking requirement employed in the data grid is not necessary since there is only one HPSS system where data is stored to tape and the user controls where files go in the HPSS namespace. An evaluation of 265 current requirements is underway that may lead to rudimentary replica management to enhance the robustness of the archive. The ability to integrate DSA with other archiving technologies or intermediate data stores is also being examined. The infrastructure where DSA resides includes a file migration capability to assist analysts with moving data between computational resources. Each resource has its own file system(s), which are dedicated to work performed on the resource. The DSA has access to these resources and can read or write data to and from any system supporting parallel ftp. 5. Performance Moving large data sets within the ASCI environment creates contention for infrastructure resources. The impact and severity of this contention is dependent upon the location of the data and the performance capability of the resources affected. Significant effort has been put forth to implement performance enhanced resources. High-speed networks, parallel file systems and HPSS are examples of resources where significant investment has produced high performance point solutions within the distributed architecture. However, relying on point solutions as general mechanisms for moving large amounts of data is often inefficient, impractical or requires a large investment in user effort and education. The DSA examines three key areas that affect performance when users manually store data in the HPSS system—resource contention, storage patterns, and scheduling algorithms. HPSS integration in the computational environment looks at issues involved in resource contention in normal and failure conditions. Storage patterns affect how efficiently an archive can be placed into HPSS as well as how efficiently the tape storage is used. Scheduling algorithms impact system overhead and tape resource. HPSS schedules tape mounts within the system where it has knowledge only of the current file request and not the entire archive request. Integrating advanced scheduling with HPSS can eliminate system overhead by holding transfer sessions that cannot immediately be serviced and increase tape utilization by the algorithms used to select transfers to run. 5.1 HPSS Integration in the Computational Environment Upon completion of a computation, data is not 'local' to the HPSS system. By local it is meant that there are systems and networks extraneous to HPSS that are involved in the movement of the computational data. These systems and networks are not dedicated to the archival process, thus are shared with other processes causing potential bottlenecks and points of failure. Contention for resource can affect the performance of data movement causing poor performance for the activity at hand as well as for other work that is contending for that resource. While poor performance can be unacceptable in the computational environment, in the worse case, resource failure can occur. Intermittent resource failure is considered performance degradation, and must be managed by the DSA. In a manual process this is often handled by restarting the entire archival operation, as the user does not have the capability or time to manage intermediate restarts. Whether the simulation data resides on the same network as HPSS 266 or across the wide area, it may not be practical or possible for the user to manage the involved systems and networks. Thus the most practical method for eliminating the probability of resource failure is to reduce the number of failure points in any given process. In order to reduce resource contention, the DSA moves data to a file system that is directly accessible by HPSS tape movers. At first glance this seems counterproductive since the data is actually moved twice. However experience has shown that increased performance in network throughput and the localizing process of data to HPSS outweighs the duplicated movement. This is especially true when the data resides at a remote location over the wide area where the network is unable to reach even modest levels of performance. 5.2 Storage Patterns The HPSS system achieves much of its performance by striping files across multiple tape drives. This allows data to be written in parallel increasing writes speeds by a factor equivalent to the number of tape drives involved in the write process. The number of tape drives used during a write is determined by two factors: 1. The size of the file will determine a stripe width through an internal mapping defined in HPSS. 2. The ftp client may force a stripe width through a class of service request. Both of the above features have tradeoffs in either HPSS overhead or in the ability to efficiently read data back from HPSS. The user does not want to be concerned with the details of how to effectively write data to HPSS. Furthermore, the ftp client does not lend itself to storing multiple directories of varying file sizes, some very large and some relatively small. This often leads to bulk transfers on a directory basis regardless of file size and the number of striping changes that can occur. Using the internal mapping mechanism can cause tapes to be mounted and dismounted many times during a transfer. Long delays may be experienced due to waiting for one or more tape units to become free. In an oversubscribed HPSS system, the results can be drastic as data transfers contend for tape units. Client connections time out, small data transfers are starved by large transfers and the user must be vigilant in monitoring the entire activity to assure all their data is successfully stored. By using the client specified mechanism, all data is stored in the same striping scheme. If the client chooses a stripe width that is too narrow, large files are written inefficiently which also affects HPSS overall throughput. If the client chooses a striping width that is too wide, small files are split across many tapes and the transfer must wait for drives to become available. Again, in an oversubscribed HPSS system, these effects are magnified. 267 By using either of these methods, large data movements starve smaller ones. This has two affects on practical data storage. First, by design, the ftp client-server model can time out if asked to wait too long for an activity. Second, in the ASCI environment, security credentials can time out causing any automation of transfers to fail. The DSA manages these issues by partitioning an entire transfer, including files in multiple directories, into data movements that share a common stripe width. Each partition is transferred one after the other at a time determined by the DSA scheduler (see below). Partitions also have finite length thus allowing the scheduling of smaller transfers between larger ones. 5.3 Scheduling Techniques Integrating with a sophisticated system such as HPSS can be performed in two ways. The DSA can both tightly couple with internal information and state through some welldefined interface, or it can couple in a loose fashion by modeling the HPSS information and state through an external mechanism. The DSA chooses the latter in order to simplify its architecture and reduce development time. DSA transfers data using the common HPSS ftp client and models tape drive allocation using a batch queuing system scheduler. There are two primary reasons to schedule data transfers into HPSS. First is to put on hold any ftp sessions that cannot be immediately serviced by HPSS, thus eliminating unnecessary system overhead. This situation is made worse when an active ftp connection that is transferring data alters its required HPSS resource (number of tape drives) and cannot immediately be serviced. Such situations can result in protocol timeouts and a general level of confusion on the part of the user. The second reason for scheduling data transfers is to increase HPSS utilization. HPSS schedules tape resources in a FIFO manner when a file is opened for reading or writing. Such FIFO queuing is known to be less than optimal when there are free resources (tape drives) and a pending request that could use those resources but is not at the head of the queue. The DSA models HPSS storage resources in a similar fashion as nodes in a cluster. The number of tape drives is managed by the queuing system and the DSA requests a specific number of tape drives when submitting a transfer. This requires the DSA to examine the size of the data to be transferred and to map that size to a particular stripe width. This also requires that data be grouped into partitions that do not violate HPSS stripe width policy for any given transfer request. In order to optimize transfer requests, the batch queuing system scheduler is configured to use a technique called backfill. When scheduling jobs (transfers) using backfill, a standard FIFO queue is employed to determine when a job should start. However, when the job at the head of the queue cannot be started because the amount of free resources is not sufficient, the scheduler looks ‘back’ into the queue to see if any other pending job can be satisfied by the amount of free resources. If such a job exists and the amount of time required by that job is not longer than the wait time for the request at the head of the 268 queue, the backfill job is started. Such a technique requires that DSA calculate the amount of time required for a transfer to take place. Determining the exact amount of time required to transfer a group of files into HPSS is not necessarily possible. While transfer rates onto tape are fairly well behaved, it is not possible to determine with high certainty, the number of tapes that will be required to service a particular transfer which effects the calculation of total transfer time. Hardware compression, the amount of existing data on a tape and the number of tape loads that must be serviced prior to any given load operation all play a roll in the level of uncertainty. The DSA uses the following equation to estimate the amount of time required to perform a given transfer: i N Tjob Tlogin  ¦ >Tstartup  ª Xi / Rrate º  Swidth u Tload @ i 1 The total time required to perform a transfer (Tjob) equals the amount of time required to perform an ftp client login (Tlogin) plus the sum of the amount of time for each individual file to transfer. The time to transfer an individual file is a factor of some startup time for the transfer (Tstartup), the file size (Xi) divided by transfer rate (Rrate), and the time required to perform a new load of all required tapes (SwidthxTload). It is known that this estimation is less than optimal since files can span tape volumes, which would require additional tape loads. It could be possible to obtain a fair estimate of how many tape loads a particular transfer will require, but at this point we leave such optimization to future work. Lastly, knowing that external influences can affect a transfer and cause the time needed to move the data onto tape to exceed the requested job time, the DSA has the ability to track what files have been successfully moved. This permits ‘retrying’ files that did not move by creating a new partition containing the files that did not transfer and repeating this process. 5.4 Transfer Comparisons A complete treatment of DSA performance is a paper unto itself. We try here to give a heads up of how well the unique features of the DSA compare to conventional manual processes for moving data. There are three questions we want to answer: 1. Does the partitioning of data files into common storage class transfers increase performance over the practice of 'mput *'? 2. Can we schedule data for transfer that will make more efficient use of HPSS resources (tape drives) and increase throughput? 3. What performance is gained when minimizing network resource utilization by staging data to a file system that is directly accessed by the HPSS movers? To help answer these questions we ran simple experiments that transferred data into HPSS in a controlled environment. No other activity was permitted when the experiments where run. Care was taken to keep comparisons as equal as possible taking into account 269 that seek times would increase and deferred mounts could skew results when compared to actual mounts. For the first question we ran ten transfers—five for an shpss run and five that performed the local file mput for pftp. All transfers used the same data set that was located on a file system mounted on the machines where the HPSS mover processes ran. Sufficient time was given between transfers to allow HPSS to settle into a common state. The data set was approximately 34 gigabytes of simulation data made up of 23 files purposefully named such that HPSS would have to request a varying number of tape drives during the pftp process. The results were that the standard local file pftp processes averaged 15 minutes to perform the transfer while the shpss processes averaged 13 minutes. We expect this difference to grow, in favor of shpss, when the number of files and changes in class of service increases. To examine scheduled transfers, the same considerations above were used. The HPSS system was configured with eight tape drives. Two sessions were run with each transferring nine data sets concurrently. One session started nine shpss processes in a scheduler configured to perform backfill, and one that simply ran nine different pftp sessions using mlfput. The average time taken to complete the nine pftp/mlfput transfers was 33 minutes. The average time for the nine shpss sessions to complete was 27 minutes. To evaluate the affect of reducing network resource utilization, the first suite of tests was rerun with the standard pftp process using the normal mput command that transfers data to HPSS movers over the local area network. Results from this test showed an order of magnitude improvement in performance when the locally accessible file system cache was used. In all, this is a brief glimpse at performance. We feel that other variables in HPSS not present during these tests could alter results. Further investigation is planned to help understand the DSA environment better and improve its performance. 6. Conclusions and Future Work . The focus of the DSA work to date has been on simplifying and optimizing the process of archiving large datasets. Initial testing indicates that reducing resource utilization, managing storage patterns, and scheduling transfers have accomplished performance improvements. In the near future we will be integrating the DSA service with a simulation tracking service, which maintains metadata and snapshots of results from simulation runs. We also plan to support video archiving for a video editing system. We expect these applications will require additional cache management support for retrieval operations. 270 Acknowledgments The work was performed at Sandia National Laboratories. Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energy under Contract DE-AC04-94AL85000. References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] HPSS, High Performance Storage System, http://www4.clearlake.ibm.com/hpss/index.jsp . Foster, I., KesselMan, C., Tuecke, S. “The Anatomy of the Grid: Enabling Scalable Virtual Organization”, The International Journal of High Performance Computing Applications, Vol. 15, 2001, pp. 200-222. Rajasekar, A., Wan, M., Moore, R. “MySRB & SRBB – Components of a Data Grid”, 11th International Symposium on High Performance Distributed Computing (HPDC-11), Edinburgh, Scotland, July 24-26, 2002. Shoshani, A., Sim, A., Gu, J. “Storage Resource Managers: Middleware Components for Grid Storage”, 19th IEEE Symposium on Mass Storage Systems, 2002. Shoshani, A., Bernardo, L., Nordberg, H., Rotem, D., Sim, A. “Storage Management for High Energy Physics Applications”, Computing in High Energy Physics, 1998. Chervenak, A., Foster, I., Kesselman, C., Salisbury, C., Tuecke, S. “The Data Grid: Towards an Architecture for the Distributed Management and Analysis of Large Scientific Datasets”, Journal of Network and Computer Applications, 23, pp. 187-200, 2001. Hoschek, W., Jaen-Martinez, J., Samar, A., Stockinger, H., Stockinger, K. “Data Management in an International Data Grid Project”, IEEE ACM International Workshop on Grid Computing (Grid 2000), December, 2000, Bangalore, India. PPDG: Particle Physics Data Grid, http://www.cacr.calteck.edu/ppdg. Tierney, B., Johnston, W., Lee, J. “A Cache-Based Data Intensive Distributed Computing Architecture for “Grid” Applications”, CERN School of Computing, September 2000. Ligon, III, W.B., and Ross, R. B., "PVFS: Parallel Virtual File System," Beowulf Cluster Computing with Linux, Thomas Sterling, editor, pages 391-430, MIT Press, November, 2001. Beiriger, J., Bivens, H., Humphreys, S., Johnson, W., and Rhea, R., "Constructing the ASCI Computational Grid," Ninth IEEE International Symposium on High Performance Distributed Computing, August, 2000. SimTracker. http://www.llnl.gov/icc/sdd/img/images/SimTrack.pdf. Laszewski, G., Alunkal, B., Gawor, J., Madhuri, R., Plaszczak, P, Sun, X. “A File Transfer Component for Grids”, 2003 International Conference on Parallel and Distributed Processing Techniques and Applications, June, 2003, Las Vegas, Nevada. 271 272 Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems Andy D. Hospodor Senior Member, IEEE andy.hospodor@ieee.org Ethan L. Miller Storage Systems Research Center University of California, Santa Cruz elm@cs.ucsc.edu Abstract The first part of this shift is already occurring at the storage device level. Today, many storage subsystems utilize low-cost commodity storage in the form of 3.5” hard disk drives. The heads, media and electronics of these devices are often identical to storage used on desktop computers. The only remaining differentiator between desktop and server storage is the interface. At present, Fibre-Channel and SCSI remain the choice of large, high-end storage systems while the AT attachment (ATA) remains the choice of desktop storage. However, the introduction of the Serial ATA interface provides nearly equivalent performance and a greatly reduced cost to attach storage. As demand for storage bandwidth and capacity grows, designers have proposed the construction of petabytescale storage systems. Rather than relying upon a few very large storage arrays, these petabyte-scale systems have thousands of individual disks working together to provide aggregate storage system bandwidth exceeding 100 GB/s. However, providing this bandwidth to storage system clients becomes difficult due to limits in network technology. This paper discusses different interconnection topologies for large disk-based systems, drawing on previous experience from the parallel computing community. By choosing the right network, storage system designers can eliminate the need for expensive high-bandwidth communication links and provide a highly-redundant network resilient against single node failures. We analyze several different topology choices and explore the tradeoffs between cost and performance. Using simulations, we uncover potential pitfalls, such as the placement of connections between the storage system network and its clients, that may arise when designing such a large system. Most current designs for such petabyte-scale systems rely upon relatively large individual storage systems that must be connected by very high-speed networks in order to provide the required transfer bandwidths to each storage element. We have developed alternatives to this design technique using 1 Gb/s network speeds and small (4–12 port) switching elements to connect individual object-based storage devices, usually single disks. By including a smallscale switch on each drive, we develop a design that is more scalable and less expensive than using larger storage elements because we can use cheaper networks and switches. Moreover, our design is more resistant to failures—if a single switch or node fails, data can simply flow around it. Since failure of a single switch typically makes data from at least one storage element unavailable, maintaining less data per storage element makes the overall storage system more resilient. 1. Introduction Modern high-performance computing systems require storage systems capable of storing petabytes1 of data, and delivering that data to thousands of computing elements at aggregate speeds exceeding 100 GB/s. Just as high-performance computers have shifted from a few very powerful computation elements to networks of thousands of commodity-type computing elements, storage systems must make the transition from relatively few highperformance storage engines to thousands of networked commodity-type storage devices. 1A In this paper, we present alternative interconnection network designs for a petabyte-scale storage system built from individual nodes consisting of a disk drive and 4–12 port gigabit network switch. We explore alternative network topologies, focusing on overall performance and resistance to individual switch failures. We are less concerned with disk failures—disks will fail regardless of interconnection network topology, and there is other research on redundancy schemes for massive-scale storage systems [16]. petabyte (PB) is 250 bytes. 273 2. Background can provide the same high bandwidth with much less expensive, lower-speed networks and switches. There are many existing techniques that provide highbandwidth file service, including RAID, storage area networks, and network-attached storage. However, these techniques cannot provide 100 GB/s on their own, and each has limitations that manifest in a petabyte-scale storage system. Rather, network topologies originally developed for massively parallel computers are better suited to construct massive storage systems. 2.2. Parallel Processing Interconnection networks for computing elements have long been the subject of parallel computing research. No clear winner has emerged; rather, there are many different interconnection topologies, each with its own advantages and disadvantages, as discussed in Section 3. Traditional multiprocessors such as the Intel Touchstone Delta [14] and the CM-5 [10] segregated I/O nodes from computing nodes, typically placing I/O nodes at the edge of the parallel computer. File systems developed for these configurations were capable of high performance [3, 12] using a relatively small number of RAID-based arrays, eliminating the need for more complex interconnection networks in the storage system. There have been a few systems that suggested embedding storage in a multiprocessor network. In RAMA [11], every multiprocessor compute node had its own disk and switch. RAMA, however, did not consider storage systems on the scale that are necessary for today’s systems, and did not consider the wide range of topologies discussed in this paper. Fortunately, storage systems place different, and somewhat less stringent, demands on the interconnection network than parallel processors. Computing nodes typically communicate using small, low latency messages, but storage access involves large transfers and relatively high latency. As a result, parallel computers require custom network hardware, while storage interconnection networks, because of their tolerance for higher latency, can exploit commodity technologies such as gigabit Ethernet. 2.1. Existing Storage Architectures RAID (Redundant Array of Independent Disks) [2] protects a disk array against failure of an individual drive. However, the RAID system is limited to the aggregate performance of the underlying array. RAID arrays are typically limited to about 16 disks; larger arrays begin to suffer from reliability problems and issues of internal bandwidth. Systems such as Swift [9] have proposed the use of RAID on very large disk arrays by computing parity across subsections of disks, thus allowing the construction of larger arrays. However, such systems still suffer from a basic problem: connections between the disks in the array and to the outside world are limited by the speed of the interconnection network. Storage area networks (SANs) aggregate many devices together at the block level. Storage systems are connected together via network, typically FibreChannel, through high-performance switches. This arrangement allows servers to share a pool of storage, and can enable the use of hundreds of disks in a single system. However, the primary use of SANs is to decouple servers from storage devices, not to provide high bandwidth. While SANs are appropriate for high I/O rate systems, they cannot provide high bandwidth without appropriate interconnection network topology. Network-attached storage [6] (NAS) is similar to SANbased storage in that both designs have pools of storage connected to servers via networks. In NAS, however, individual devices present storage at the file level rather than the block level. This means that individual devices are responsible for managing their own data layout; in SANs, data layout is managed by the servers. While most existing network-attached storage is implemented in the form of CIFS- or NFS-style file systems, object-based storage [15] is fast becoming a good choice for large-scale storage systems [16]. Current object-based file systems such as Lustre [13] use relatively large storage nodes, each implemented as a standard network file server with dozens of disks. As a result, they must use relatively high-speed interconnections to provide the necessary aggregate bandwidth. In contrast, tightly coupling switches and individual disks 2.3. Issues with Large Storage Scaling A petabyte-scale storage system must meet many demands: it must provide high bandwidth at reasonable latency, it must be both continuously available and reliable, it must not lose data, and its performance must scale as its capacity increases. Existing large-scale storage systems have some of these characteristics, but not all of them. For example, most existing storage systems are scaled by replacing the entire storage system in a “forklift upgrade.” This approach is unacceptable in a system containing petabytes of data because the system is simply too large. While there are techniques for dynamically adding storage capacity to a existing system [7], the inability of such systems to scale in performance remains an issue. One petabyte of storage capacity requires about 4096 (212) storage devices of 250 GB each; disks with this capacity are appearing in the consumer desktop market in 274 Throughput (MB/s) 45 40 35 30 25 20 15 10 5 0 10000 RPM FibreChannel 7200 RPM ATA ✁ ✞ ✟ ✠ ✞ ✁ ✂ ✄ ☎ ✆ ✝ ✁ ✂ ✄ ☎ ✆ ✝ ✟ ☛ 4 10 100 1000 10000 100000 Request size (KB) ✁ ✞ ✟ ✠ ✞ ✟ ✡ Figure 1. Disk throughput as a function of transfer size. early 2004. A typical high-performance disk, the Seagate Cheetah 10K, has a 200 MB/s FibreChannel interface and spins at 10000 RPM with a 5.3 ms average seek time and sustained transfer rate of 44 MB/s. In a typical transaction processing environment, the Cheetah would service a 4 KB request in about 8.3 ms for a maximum of 120 I/Os per second, or 0.5 MB/s from each drive. The aggregate from all 4096 drives would be 4096 × 0.5 MB/s, or only 2 GB/s—far below the required bandwidth of 100 GB/s. By increasing the request size to 512 KB, disk throughput is increased to 25 MB/s per drive, for an aggregate bandwidth of 100 GB/s. Alternatively, Figure 1 shows that low-cost serial ATA drives, such as the 7200 RPM Seagate Barracuda 7200, could also meet the 100 GB/s requirement with a slightly larger request size of 1 MB. ☛ ☞ ✌ ✍ ✎ ✏ ☛ ✌ ✍ ✎ ✏ (a) Disks connected to a single server. This configuration is susceptible to loss of availability if a single switch or server fails. ✘ ✣ ✤ ✥ ✣ ✤ ✛ ✒ ✧ ★ ✘ ✣ ✙ ✤ ✚ ✛ ✜ ✢ ✩ ✑ ✑ ✑ ✘ ✣ ✤ ✥ ✣ ✤ ✦ ✛ ✧ ★ ✣ ✤ ✘ ✙ ✚ ✛ ✜ ✢ ✩ ✑ ✑ ✑ 3. Interconnection Strategies ✒ The interconnection network in a petabyte-scale storage system must be capable of handling an aggregate bandwidth of 100 GB/s as well as providing a connection to each n-storage node capable of transferring 25n MB/s of data. Thus, a 1 Gb/s network link can support nodes with at most 2–3 disks, and a 10 Gb/s network link can support up to 25 disks per node. As with any technology, however, faster is more expensive—1 Gb/s networks cost less than $20 per port, while 10 Gb/s can cost $5000 per port. In time, 10 Gb/s networks will drop in price, but it is likely that storage bandwidth demands will increase, making a tradeoff still necessary. This non-linear tradeoff in costperformance compels us to consider complex architectures that can leverage 1 Gb/s interconnects. The challenge facing storage system designers is an architecture that connects the storage to servers. Figure 2(a) shows a simple strategy that connects a server to 32 storage devices through a switch. Simple replication of this strategy 128 times yields a system capable of meeting the requirement. This strategy is remarkably similar to RAID ✓ ✔ ✕ ✖ ✗ ✒ ✔ ✕ ✖ ✗ (b) Disks connected to redundant servers. Switch failure is still an issue, but routers allow for more redundancy. Figure 2. Simple interconnection strategies. level 0, known as Just a Bunch of Disks (JBOD), and suffers from similar issues with reliability described later in the paper. Here, the port cost would be 4096 switch ports of 1 Gb/s and 128 ports of 10 Gb/s. However, the placement of data becomes crucial in order to keep all storage devices active. Since individual servers can only communicate with a small fraction of the total disk, clients must send their requests to the correct server. Unless there is a switching network comparable to that in the designs we discuss below interposed between the clients and the servers, this design is not viable. If there is a switching fabric between clients and servers, the designs below provide guidelines for how the network should be designed. 275 3.1. Fat Trees ✪ Figure 2(b) shows a hierarchical strategy similar to a fattree that provides redundant connections between components. This strategy expands to have each server connect to two of eight routers that interconnect with the 128 switches that finally attach the 4096 storage devices. Each router has 32 ports attached to the servers and 128 ports attached to each of the switches and seven additional ports to the other routers. The port cost would be 4096 ports of 1 Gb/s, 2048 ports of 10 Gb/s that connect the 128 switches to the 8 routers (one port at either end), 112 ports of 10 Gb/s that interconnect the 8 routers, and 256 ports of 10 Gb/s that connect each servers to two routers. This configuration has the added drawback that the routers must be very large; it is typically not possible to build monolithic network devices with over 100 ports, so the routers would have to be constructed as multi-stage devices. While this device would allow any client to access any disk, the routers in this configuration would be very expensive. Furthermore, the 2418 ports of 10 Gb/s add nearly $10M to the overall cost, making this configuration a poor choice. ✳ ✪ ✳ ✪ ✪ ✴ ✴ ✳ ✳ ✵ ✵ ✴ ✴ ✳ ✳ ✵ ✵ ✴ ✴ ✳ ✳ ✱ ✲ ✰ ✱ ✲ ✰ ✱ ✲ ✰ ✱ ✲ ✷ ✴ ✴ ✰ ✭ ✳ ✴ ✪ ✫ ✬ ✭ ✮ ✯ ✪ ✫ ✬ ✭ ✮ ✯ ✭ ✳ ✴ ✪ ✫ ✬ ✭ ✮ ✯ ✪ ✫ ✬ ✭ ✮ ✯ ✭ ✳ ✴ ✪ ✫ ✬ ✭ ✮ ✯ ✪ ✫ ✬ ✭ ✮ ✯ ✭ ✳ ✴ ✪ ✫ ✬ ✭ ✮ ✯ ✪ ✫ ✬ ✭ ✮ ✯ ✶ ✸ ✹ Figure 3. Disks connected in a butterfly network topology. must continue to run even in the face of network failures, making butterfly networks less attractive unless there is a method to route traffic around failed links. Cube-style networks have many routes between any two nodes in the fabric, making them more tolerant of link failures. 3.3. Meshes and Torii Figure 4(a) shows a mesh strategy that combines the storage device with a small switch that contains four 1 GB/s ports. The 4096 storage devices would be arranged as a 64 × 64 mesh with routers connecting the edge of the mesh to the servers. This configuration would require eight routers to connect to 128 servers to provide the necessary 100 GB/s bandwidth. This configuration would require 16384 1 Gb/s ports for the storage, 256 10 Gb/s ports that connect the 8 routers to the servers on two edges of the mesh, and the 256 10 Gb/s ports that connect the servers to the routers. Optionally, another 256 ports would connect all four sides of the mesh to the routers, although this much redundancy is not likely to be necessary. Router interconnects are no longer necessary because the mesh provides alternate paths in case of failure. Torus topologies, shown in Figure 4(b), are similar to meshes, but with the addition of “wrap-around” connections between opposing edges. The inclusion of these additional connections does not greatly increase cost, but it cuts the average path length—the distance between servers and storage for a given request—by a factor of two, reducing required bandwidth and contention for network links. However, this design choice requires external connectivity into the storage fabric through routers placed at dedicated locations within the torus . Mesh and torus topologies are likely a good fit for large scale storage systems built from “bricks,” as proposed by IBM (IceCube [8]) and Hewlett Packard (Federated Array of Bricks [5]). Such topologies are naturally limited to three dimensions (six connections) per element, though they may resemble hypercubes if multiple highly connected disks are packed into a single “brick.” 3.2. Butterfly Networks Butterfly networks provide a structure similar to the hierarchical strategy at a more reasonable cost. Figure 3 shows a butterfly network interconnection strategy that connects disks to servers. The butterfly network can have relatively few links, but the links may need to be faster because each layer of the network must carry the entire traffic load on its links. In order to keep the individual links below 1 Gb/s, the butterfly network would need 1024 links per level for an aggregate throughput of 100 GB/s. Building a full butterfly network for 4096 disks using 1024 links and 128 switches per level would require three levels of 16-port switches and an additional level of 36-port “concentrators” to route data to and from 32 disks. Alternatively, the switching network could be built entirely from five levels of 8-port switches, using an additional level of 10-port switches to aggregate individual disks together. We use this second configuration in the remainder of the paper because, while 16 port switches are possible, we believe that the 8 port switches necessary for the second design are more reasonable. While butterfly networks appear attractive in many ways, they do not have the fault-tolerance provided by cube-style networks such as meshes and torii. In fact, only a single path exists between any pair of server and storage devices connected by the butterfly network. In traditional parallel computers, network failures could be handled either by shutting down the affected nodes or by shutting down the entire system. Storage fabrics, on the other hand, 276 P ◆ ✺ ✻ ✼ ✽ ✾ ❖ ◗ ◆ ❖ ❘ ✿ ❏ ❑ ▲ ▼ ◆ ❖ ❀ ✾ ✿ ❁ ✾ ✿ ❂ ❀ ✾ ✿ ❁ ✾ ✿ ✺ ❂ ✻ ✼ ✽ ✾ ✿ ❀ ✾ ✿ ❁ ✾ ✿ ❂ (a) Disks connected in a mesh topology. Figure 5. Disks connected in a hypercube topology. ❉ ❊ ❋ ● ❊ ❋ ■ ❃ ❄ ❅ ❆ ❇ ❈ ❃ ❄ ❅ ❆ ❇ 12 hypercube can also be considered a 6-D symmetrical torus. Furthermore, the hypercube can be extended as a torus by adding nodes in one dimension; there is no need to add nodes in powers of two. Hypercubes and high-dimensional torii need not be built from individual disks and routers. To make packaging less expensive, a small group of disks may be assembled into a unit, and the units connected together using a lowerdimensional torus. For units with eight disks in a degree 12 hypercube, this approach requires each unit to have 48 external connections—eight connections per cube face. This is not an unreasonable requirement if the system uses gigabit Ethernet or optical fiber network connections. ❈ ❉ ❊ ❋ ● ❊ ❋ ❍ (b) Disks connected in a torus topology. 4. Analytic Results Figure 4. Mesh and torus interconnection topologies. All of the topologies listed in Section 3 appear capable of providing 100 GB/s bandwidth from a cluster of 4096 disks. Further inspection shows that, because of limitations in link capacities, this is not the case. Moreover, the topologies differ in several critical ways, including overall system cost, aggregate bandwidth, latency and resistance to component failures. We analyzed the basic characteristics of seven specific topologies, listed in Table 1. 3.4. Hypercubes Figure 5 shows a hypercube strategy [1] of a twodimensional hypercube of degree four that intersperses the routers and storage devices throughout the hypercube. In a 4096 node storage system, 3968 storage devices and 128 routers could be arranged in a hypercube of degree 12. Each node in this configuration has twelve 1 Gb/s ports and each router has two additional 10 Gb/s ports that connect to two servers. Bandwidth in the hypercube topology may scale better than in the mesh and torus topologies, but the cost is higher because the number of connections at each node increases as the system gets larger. Note also that hypercubes are a special case of torii; for example, a degree 4.1. System Cost One benefit for the designs in which switches are embedded in the “storage fabric” is that they require far fewer high speed ports at the cost of additional low speed ports. The 2004 cost of a gigabit Ethernet port is less than $20, whereas the cost of a 10 gigabit Ethernet port is on the or- 277 Network cost Disk cost 15 4 3 2 1 2.5 Link load 50 2 40 1.5 30 1 20 0.5 10 Butterfly 6−D Hypercube 5−D Torus 4−D Torus 3−D Torus 2−D Torus Fat−Tree 2−D Mesh Network type Butterfly 4.2. System Bandwidth 0 Independent 0 Figure 7. Total system cost for different interconnection network topologies. The “independent” topology relies upon the host computer for communication between storage nodes. Average link load (Gb/s) Average path length 60 6−D Hypercube Network type 5−D Torus 4−D Torus 3−D Torus 2−D Mesh Table 1. Switching fabric topologies to accommodate about 4000 disks. 2−D Torus 0 Fat−Tree Ports 6,512 16,384 16,384 24,576 32,768 40,960 49,152 14,336 Independent Dimensions 32-8-32-1024 64 × 64 64 × 64 16 × 16 × 16 8×8×8×8 4×8×4×8×4 4×4×4×4×4×4 256 4 × 4 switches/layer System cost (millions of $) Network Fat Tree 2D mesh 2D torus 3D torus 4D torus 5D torus 6D hypercube Butterfly Another consideration for massive storage systems is the average number of network hops between a disk and a server, as shown in Figure 6. The number of hops is crucial because the maximum simultaneous bandwidth to of links all disks is link speed×number . Systems with a small average hops number of links but few average hops may perform better than systems with more links but longer average path distances between the disk and the server. For example, a 16 × 16 × 16 torus might seem like a good topology because it is relatively low cost and is easy to assemble. However, this topology would have 4096 × 6/2 = 12288 links, and the average distance from an edge to a desired node would be 16/4 + 16/4 + 16/4 = 12 hops. This would limit the theoretical bandwidth to 1 × 12288/12 = 1024 Gb/s, or 128 GB/s at most, which might be insufficient to meet the 100 GB/s demand. Figure 6 shows the expected number of hops for each network topology as well as the expected load on each network link. Our models have assumed that links are half-duplex. If full-duplex links are used, individual links can double their theoretical maximum bandwidth. However, this doubling is only realized if the load on the link is the same in both directions. For mesh and torus topologies the load in both directions will depend on the location of the router nodes. However, for butterfly and fat-tree topologies, the ability to do full-duplex transmission is largely wasted because, for large reads, data flows in only direction through the network: from disks to routers. Large writes work similarly— data flows in one direction only, from routers to disks. Figure 6. Average number of network hops and expected per-link bandwidth for each interconnection network topology. The “independent” topology is omitted because it relies upon the host computer for communication between storage nodes. der of $5000. This non-linear tradeoff makes the fabrictype structures more appealing than the other structures because they simply cost less. The overall cost of a 4096 node system with different configurations is shown in Figure 7. The “independent” system is shown as a baseline; in such a system, each disk is connected to exactly one server, and servers are not connected to one another. While this is by far the least expensive option, it requires that the file system use the network of host servers to manage client access of data from the entire storage system, and limits the storage system’s ability to perform internal communication for reliability and other functions. Among the other options, lower dimensionality “cubes” are the least expensive, with higher-dimension and cubes and torii being the most expensive and butterfly networks in between. 278 150 40 20 Butterfly 6−D Hypercube Network type 5−D Torus 4−D Torus 3−D Torus 2−D Torus 2−D Mesh Fat−Tree 0 Independent System cost per GB/s (thousands of dollars) A prime consideration is the ability of individual connections into the storage fabric to supply the necessary bandwidth. For a 4096 node system supplying 100 GB/s, we have assumed that 128 external connections would be sufficient; certainly, 128 links at 10 Gb/s could indeed provide 100 GB/s of aggregate bandwidth. However, designs in which individual nodes have relatively few links cannot support such routers because the intra-fabric bandwidth available to a router is too low to support an external link of 10 Gb/s. For example, a two-dimensional fabric has four ports per node; at 1 Gb/s/port, the total bandwidth available to an external link is about 4 Gb/s, far less than the 10 Gb/s capacity of the external link and insufficient to allow 128 such nodes to provide 100 GB/s to the outside world. Figure 6 shows that the link bandwidth required by the 4-D, 5-D and 6-D configurations could be served by 1 Gb/s interconnects, such as Gigabit Ethernet. Unfortunately, the bandwidth needs of the 2-D and 3-D configurations require the use of a faster interconnect, such as 10 Gigabit Ethernet, to meet the 100 GB/s requirement of the overall system. The cost of the necessary 10 Gb/s ports add $80M to the 2-D configuration, and a whopping $100M to the 3-D configuration. Figures 6 and 7 make it clear that, although lowdimensionality torii are attractive because of the low number of links they require, they cannot meet the 100 GB/s requirement without resorting to more costly interconnects. On the other hand, high-dimensionality hypercubes require less bandwidth per link, yet have many links and require switches with many ports. The 4-D and 5-D torii appear to have the best combination of relatively low cost, acceptable bandwidth on individual links and reasonable path lengths. Compared to butterfly networks, the resiliency of the 4-D and 5-D torii offset the 30% to 40% added cost. The 6-D hypercube has the highest cost and highest aggregate throughput performance. Dividing the cost by aggregate throughput, as shown in Figure 8, shows that the cost per GB/s of bandwidth is nearly identical for the butterfly and hypercube topologies. The 6-D hypercube becomes a cost effective choice for a large reliable system with quality of service constraints. Figure 8. Cost per gigabyte per second for different interconnection network topologies. interconnection networks using dimensional routing [4]; routing in butterfly networks is fixed because there exists only one route between requester and disk. We generated a range of interconnection networks with 4096 nodes, including either 128 routers through which clients would connect to the storage fabric; the remainder of the nodes were storage nodes. The butterfly networks, on the other hand, had 4096 nodes connected through a switching fabric to 128 routers. The specific fabrics we tested are listed in Table 1. In our first cube-style networks (meshes, torii, and hypercubes), we connected external clients through routers placed at regular locations within the network. This resulted in very poor performance due to congestion near the router nodes, as shown in Figure 9. A histogram of load distribution on individual links in a 4 × 4 × 4 × 4 × 4 × 4 hypercube is shown in Figure 9(a). When routers were placed in a regular arrangement, some links had bandwidths of 1.25–2.25 Gb/s. Individual disks require about 25% of a single 1 Gb/s link’s bandwidth and are not affected greatly by requests that “pass through” the switch next to the disk. Routers, on the other hand, require nearly the full bandwidth of all the incoming connections. When routers are adjacent, the bandwidth is greatest nearest the routers and falls off further from the routers, resulting in overloaded links in part of the fabric and underloaded links elsewhere. Figure 9(a) shows that, in addition to overloading some links, regular placement underloads most of the remaining links—the histogram is shifted to the left relative to that for random node placement. The cumulative distribution of link load is shown in Figure 9(b); under regular placement, about 5% of all links experience overload. We addressed the problem of crowding by placing routers randomly throughout the storage fabric. While this 5. Simulation Results While analytic results show theoretical performance, there are many real-world considerations that affect performance. We simulated usage of the networks whose performance was analyzed in Section 4, using a simple simulator that modeled network traffic loads along the network links. Each router between outside clients and disks made 0.5 MB requests of randomly-selected data that induced load sufficient to drive the overall system bandwidth to 100 GB/s. Requests were routed through mesh, torus, and hypercube 279 ratios between disk bandwidth and network bandwidth will also affect storage system design. For example, when 10 Gb/s network connections become inexpensive, it is likely that designs with multiple disks per switch will become feasible. Given the aggregate bandwidth limitations using 1 Gb/s links, however, placing two or three disks per switch will overload the network for most topologies. Of course, standard issues such as protocol choice, storage system network congestion, and reliability concerns are relevant to systems in which storage is embedded in a network fabric. However, other questions such as the placement of connections into the network (edge or core), the use of a few “shortcut” links to reduce dimensionality, and other problems specific to dense interconnection networks will be relevant to designs such as those presented in this paper. Perhaps the most important question, though, is whether this design is applicable to commercial environments, in which bandwidth is less crucial, as well as scientific computing environments. If this design is a good fit to commercial systems, it is likely that the “bricks” used to build a storage fabric will become cheaper, allowing the construction of higher performance scientific computing storage systems as well as faster, more reliable commercial storage systems. 40 Percent of links 30 Regular router placement 20 Excess load on links near routers 10 0 30 Random router placement 20 10 0 0 0.5 1 1.5 Average link load (Gb/s) 2 2.5 Cumulative % of links (a) Histogram of link load. 100 90 80 70 60 50 40 30 20 10 0 Maximum link bandwidth Regular placement Random placement 0 0.5 1 1.5 2 Average link bandwidth (Gb/s) 2.5 (b) Cumulative distribution of link load. 7. Conclusions Figure 9. Distribution of load on links in a 4 × 4 × 4 × 4 × 4 × 4 hypercube. Randomly-placed router nodes improve the evenness of load in the fabric. In this paper, we have introduced the concept of building a multiprocessor-style interconnection network solely for storage systems. While this idea has been alluded to in the past, our research shows the tradeoffs between different configurations and demonstrates that “storage fabrics” based on commodity components configured as torii and hypercubes improve reliability as well as performance. More specifically, the 4-D and 5-D torii appear to be reasonable design choices for a 4096 node storage system capable of delivering 100 GB/s from 1 PB. Furthermore, these designs become faster as the system grows, removing the need to replace the entire storage system as capacity and bandwidth demands increase. It is for these reasons that we believe that storage network topologies as described in this paper will become crucial to the construction of petabytescale storage systems. did not decrease average path length, it dramatically reduced the congestion we noticed in our original network designs, as Figure 9 shows. As a result, bandwidth was spread more evenly throughout the storage fabric, reducing the maximum load on any given link. We believe that it might be possible to further balance load by devising an optimal placement; however, this placement is beyond the scope of this paper. 6. Future Work This paper merely scratches the surface of issues in network design for petabyte-scale storage systems. Some of the unanswered questions about this design can be answered best by building an inexpensive proof of concept system using commodity drives, gigabit networking, and small-scale switches. This setup would allow us to verify our models against a small system, providing some confidence that the large systems we are modeling will perform as expected. As with many other computer systems, changes in the Acknowledgments Ethan Miller was supported in part by Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratory under contract B520714. The Storage Systems Research Center is supported in part by gifts from Hewlett Packard, IBM, Intel, LSI Logic, Microsoft, Overland Storage, and Veritas. 280 References [1] W. C. Athas and C. L. Seitz. Multicomputers: messagepassing concurrent computers. IEEE Computer, 21:9–24, Aug. 1988. [2] P. M. Chen, E. K. Lee, G. A. Gibson, R. H. Katz, and D. A. Patterson. RAID: High-performance, reliable secondary storage. ACM Computing Surveys, 26(2), June 1994. [3] P. F. Corbett and D. G. Feitelson. The Vesta parallel file system. ACM Transactions on Computer Systems, 14(3):225– 264, 1996. [4] D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1999. [5] S. Frølund, A. Merchant, Y. Saito, S. Spence, and A. Veitch. FAB: Enterprise storage systems on a shoestring. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), Kauai, HI, May 2003. [6] G. A. Gibson and R. Van Meter. Network attached storage architecture. Communications of the ACM, 43(11):37–45, 2000. [7] R. J. Honicky and E. L. Miller. Replication under scalable hashing: A family of algorithms for scalable decentralized data distribution. In Proceedings of the 18th International Parallel & Distributed Processing Symposium (IPDPS 2004), Santa Fe, NM, Apr. 2004. IEEE. [8] IBM Company. IceCube – a system architecture for storage and Internet servers. http://www.almaden.ibm.com/ StorageSystems/autonomic storage/CIB Hardware/. [9] D. D. E. Long, B. R. Montague, and L.-F. Cabrera. Swift/RAID: A distributed RAID system. Computing Systems, 7(3):333–359, 1994. [10] S. J. LoVerso, M. Isman, A. Nanopoulos, W. Nesheim, E. D. Milne, and R. Wheeler. sfs: A parallel file system for the CM-5. In Proceedings of the Summer 1993 USENIX Technical Conference, pages 291–305, 1993. [11] E. L. Miller and R. H. Katz. RAMA: An easy-to-use, high-performance parallel file system. Parallel Computing, 23(4):419–446, 1997. [12] N. Nieuwejaar and D. Kotz. The Galley parallel file system. In Proceedings of 10th ACM International Conference on Supercomputing, pages 374–381, Philadelphia, PA, 1996. ACM Press. [13] P. Schwan. Lustre: Building a file system for 1000-node clusters. In Proceedings of the 2003 Linux Symposium, July 2003. [14] R. Stevens. Computational science experiences on the Intel Touchstone DELTA supercomputer. In Proceedings of Compcon ’92, pages 295–299. IEEE, Feb. 1992. [15] R. O. Weber. Information technology—SCSI object-based storage device commands (OSD). Technical Council Proposal Document T10/1355-D, Technical Committee T10, Aug. 2002. [16] Q. Xin, E. L. Miller, D. D. E. Long, S. A. Brandt, T. Schwarz, and W. Litwin. Reliability mechanisms for very large storage systems. In Proceedings of the 20th IEEE / 11th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 146–156, Apr. 2003. 281 282 OBFS: A File System for Object-based Storage Devices Feng Wang, Scott A. Brandt, Ethan L. Miller, Darrell D. E. Long Storage System Research Center University of California, Santa Cruz Santa Cruz, CA 95064 cyclonew, sbrandt, elm, darrell @cs.ucsc.edu tel +1-831-459-4458 fax +1-831-459-4829 ✁ Abstract The object-based storage model, in which files are made up of one or more data objects stored on self-contained Object-Based Storage Devices (OSDs), is emerging as an architecture for distributed storage systems. The workload presented to the OSDs will be quite different from that of generalpurpose file systems, yet many distributed file systems employ general-purpose file systems as their underlying file system. We present OBFS, a small and highly efficient file system designed for use in OSDs. Our experiments show that our user-level implementation of OBFS outperforms Linux Ext2 and Ext3 by a factor of two or three, and while OBFS is 1/25 the size of XFS, it provides only slightly lower read performance and 10%–40% higher write performance. 1. Introduction Object-based storage systems represent files as sets of objects stored on self-contained Object-Based Storage Devices (OSDs). By distributing the objects across many devices, these systems have the potential to provide high capacity, throughput, reliability, availability and scalability. We are developing an object-based storage system with a target capacity of 2 petabytes and throughput of 100 gigabytes per second. In this system as, we expect, in many others, files will be striped across OSDs. The stripe unit size of the system will determine the maximum object size and will be the most common object size in the system. Because files will generally consist of many objects and objects will be distributed across many OSDs, there will be little locality of reference within each OSD. The workload presented to the OSDs in this system will be quite different from that of general-purpose file systems. In object-based systems that do not employ this architecture we can still expect that files will be distributed across multiple objects, objects will be distributed across multiple OSDs, and there will be little locality of reference. Even so, many distributed file systems employ general-purpose file systems as their underlying file system. We present OBFS, a very small, highly efficient object-based file system developed for use in OSDs in large-scale distributed storage systems. The basic idea of OBFS is to optimize the disk layout 283 based on our knowledge of the workload. OBFS uses two block sizes: small blocks, equivalent to the blocks in general-purpose file systems, and large blocks, equal to the maximum object size, to greatly improve the object throughput while still maintaining good disk utilization. OBFS uses regions to collocate blocks of the same size, resulting in relatively little fragmentation as the file system ages. Compared with Linux Ext2 and Ext3 [3, 28], OBFS has better data layout and more efficiently manages the flat name space exported by OSDs. Although developed for a workload consisting mostly of large objects, OBFS does well on a mixed workload and on a workload consisting of all small objects. Thus, in addition to being highly suitable for use in high-performance computing environments where large files (and hence objects) dominate, we believe that it may also prove effective in general-purpose computing environments where small files dominate. Our results show that our user-level implementation of OBFS outperforms Linux kernel implementations of Ext2 and Ext3 by a factor of 2 to 3, regardless of the object size. Our user-level implementation of OBFS is a little slower than a kernel implementation of XFS [19, 27] when doing object reads, but has 10% to 40% better performance on object writes. We expect the performance to improve further once we have fully implemented OBFS in the kernel to avoid extra buffer copies. OBFS is significantly smaller than Linux XFS, using only about 2,000 lines of code compared with over 50,000 lines of code in XFS. This factor of 25 size difference and the corresponding simplicity of OBFS make OBFS easy to verify, maintain, modify, and port to other platforms. OBFS also provides strong reliability guarantees in addition to high throughput and small code size; the disk layout of OBFS allows it to update metadata with very low overhead, so OBFS updates metadata synchronously. 2. Background A new generation of high-performance distributed file systems are being developed, motivated by the need for ever greater capacity and bandwidth. These file systems are built to support highperformance computing environments which have strong scalability and reliability requirements. To satisfy these requirements, the functionality of traditional file systems has been divided into two separate logical components: a file manager and a storage manager. The file manager is in charge of hierarchy management, naming and access control, while the storage manager handles the actual storage and retrieval of data. In large-scale distributed storage systems, the storage manager runs on many independent storage servers. Distributed object-based storage systems, first used in Swift [6] and currently used in systems such as Lustre [4] and Slice [1], are built on this model. However, in object-based systems the storage manager is an object-based storage device (OSD or OSD) [30], which provides an object-level interface to the file data. OSDs abstract away file storage details such as allocation and scheduling, semi-independently managing all of the data storage issues and leaving all of the file metadata management to the file manager. In a typical instance of this architecture, a metadata server cluster services all metadata requests, managing namespace, authentication, and protection, and providing clients with the file to object mapping. Clients contact the OSDs directly to retrieve the objects corresponding to the files they wish to access. One motivation behind this new architecture is to provide highly-scalable aggregate bandwidth by directly transferring data between storage devices and clients. It eliminates the file server as a bottleneck by offloading storage management to the OSDs [8] and enables load balancing and high performance by striping data from a single file across multiple OSDs. It also enables 284 high levels of security by using cryptographically secured capabilities and local data security mechanisms. Much research has gone into hierarchy management, scalability, and availability of distributed file systems in projects such as AFS [18], Coda [11], GPFS [22], GFS[26] and Lustre [4], but relatively little research has been aimed toward improving the performance of the storage manager. Because modern distributed file systems may employ thousands of storage devices, even a small inefficiency in the storage manager can result in a significant loss of performance in the overall storage system. In practice, general-purpose file systems are often used as the storage manager. For example, Lustre uses the Linux Ext3 file system as its storage manager [4]. Since the workload offered to OSDs may be quite different from that of general-purpose file systems, we can build a better storage manager by matching its characteristics to the workload. File systems such as Ext2 and Ext3 are optimized for general-purpose Unix environments in which small files dominate and the file sizes vary significantly. They have several disadvantages that limit their effectiveness in large object-based storage systems. Ext2 caches metadata updates in memory for better performance. Although it flushes the metadata back to disk periodically, it cannot provide the high reliability we require. Both Ext3 and XFS employ write-ahead logs to update the metadata changes, but the lazy log write policy used by both of them can still lose important metadata (and therefore data) in some situations. These general-purpose file systems trade off the reliability for better performance. If we force them to synchronously update object data and metadata for better reliability, their performance degrades significantly. Our experimental results shows that in synchronous mode, their write throughput is only several MB/second. Many general-purpose file systems such as Ext2 and Ext3 use flat directories in a tree-like hierarchy, which results in relatively poor searching performance for directories of more than a thousand objects. XFS uses B+-Trees to address this problem. OBFS uses hash tables to obtain very high performance directory operations on the flat object namespace. In our object-based storage system as, we expect, in many others, RAID-style striping with parity and/or replication is used to achieve high performance, reliability, availability, and scalability. Unlike RAID, the devices are semi-autonomous, internally managing all allocation and scheduling details for the storage they contain. The devices themselves may use RAID internally to achieve high performance. In this architecture, each stripe unit is stored in a single object. Thus, the maximum size of the objects is the stripe unit size of the distributed file system, and most of the objects will be this size. At the OSD level, objects typically have no logical relationship, presenting a flat name space. As a result, general-purpose file systems, which are usually optimized for workloads exhibiting relatively small variable-sized files, relatively small hierarchical directories, and some degree of locality, do not perform particularly well under this workload. 3. Assumptions and Design Principles Our OBFS is designed to be the storage manager on each OSD as part of a large-scale distributed object-based storage system [13], which is currently being developed at the University of California, Santa Cruz, Storage System Research Center (SSRC). Our object-based storage system has three major components, the Metadata Server Cluster (MDSC), the Client Interface (CI), and the Storage Managers (SMs). File system functionality is partitioned among these components. The MDSC is in charge of file and directory management, authentication and protection, distributing workload among OSDs, and providing redundancy and failure recovery. The CI, running on the client 285 machines, provides the file system API to the application software running on the client nodes, communicates with the MDSC and SMs, and manages a local file system cache. The SMs, running on the OSDs, provide object storage and manage local request scheduling and allocation. The operation of the storage system is as follows: Application software running on client machines make file system requests to the CIs on those machines. The CIs preprocess the requests and query the MDSC to open the files and get information used to determine which objects comprise the files. The CIs then contact the appropriate SMs to access the objects that contain the requested data, and provide that data to the applications. In our system, objects are limited by the stripe unit size of the system. Thus, in contrast to a file, whose size may vary from bytes to terabytes, the size variance of an object is much smaller. Moreover, the delayed writes in the file cache at the client side will absorb most small writes and result in relatively large object reads and writes. We provide a more detailed analysis of the object workload characteristics in Section 4. To enable parallel I/O, files are striped into fixed size objects and spread across different OSDs. The specific OSDs are selected based on the overall workload distribution intended to avoid ”hot spots” and increase potential parallelism [13]. From the viewpoint of a single OSD, incoming object accesses will be relatively random. Thus inter-object locality will be insignificant. Most file systems cache writes for fast response, to coalesce many small writes into fewer larger ones, and to allow the file system to exploit locality of reference within the request stream. In object-based storage systems, most asynchronous writes will therefore be cached by the client. As a result, almost all of the writes to the OSDs will be synchronous. Thus, the SMs should probably not cache incoming writes in the OSDs. Furthermore, because logically contiguous data is distributed across many objects in many different OSDs, there is no locality of reference to be leveraged by caching writes of different objects. Another caching-related concern arises due to the black-box nature of the OSDs. Because the OSDs provide a very high-level interface to the data, caching can cause the storage system as a whole to believe that the data has been saved, while data has actually been lost due to power failure or other hardware failures. While this may be addressable, we have not addressed it in this version of OBFS. On each OSD there is a complete lack of information about relationships between objects. Thus a flat name space is used to manage the objects on each OSD. Because hundreds of thousands of objects might coexist on a single OSD, efficient searching in this flat name space is a primary requirement for the SMs. As mentioned above, most of the incoming write requests will be synchronous. A client expects the data to be on the permanent storage when it commits its writes. This requires the OSDs to flush the objects to permanent storage before committing them. This also means the metadata of those objects should also be kept safely. In effect, OSDs in object-based storage systems are like disks in traditional storage systems, and file systems expect disks to actually store committed write requests rather than caching them. 286 Bytes in files (GB) Number of files (thousands) 250 200 150 100 50 0 1B 32B 1KB 32KB 1MB 32MB 1GB 32GB File size 800 700 600 500 400 300 200 100 0 1B 32B 1KB 32KB 1MB 32MB 1GB 32GB File size (a) Number of files of different sizes (b) Total bytes in files of different sizes Figure 1. Data distribution in a large high-performance distributed storage system (data courtesy of LLNL) 4. Workload Characteristics Very large-scale distributed file systems may exhibit very different performance characteristics than general-purpose file systems. The total volume of a large-scale distributed file system may range from several terabytes to several petabytes, orders of magnitude larger than typical general-purpose file systems. The average file size in such a distributed file system may also be much larger than that of current general-purpose file systems. Although our intent is to develop a flexible and general file system applicable in many different situations, one of our performance goals is to handle workloads encountered in high-performance computing environments with tens or hundreds of thousands of processors simultaneously accessing many files in many directories, many files in a single directory, or even a single file. These environments place extremely high demands on the storage system. Figure 1 shows the data distribution across files in a high-performance distributed file system from Lawrence Livermore National Laboratory (LLNL). Figure 1(a) shows the file size distribution for the more than 1.5 million files in this system. Most of the files are larger than 4 KB and the majority of all files are distributed between 32 KB and 8 MB. Those files that are smaller than 4 KB (a typical block size for a general-purpose file system) only account for a very small portion of the total files. However, almost all of the disk space is occupied by files larger than 4 MB and the majority of all bytes are in files between 4 MB and 1 GB, as shown in Figure 1(b). The total number of bytes in files smaller than 256 KB is insignificant. Though the files larger than 1 GB are only a small percentage of the files, the total number of bytes in such files account for more than 15% of the bytes in the system. The file access pattern of such systems is also different from that of a typical general-purpose file system. In the LLNL workload, most data transferred between the processors and the file system are in several megabyte chunks. Most files are accessed simultaneously by hundreds of processors. and instead of flushing dirty data directly back to the storage device, each processor caches the data in its local memory and only writes the data once the buffer is full. Object-based storage may be used for smaller file systems as well. Systems like those traced by Roselli, et al. [20] have many small files; in the systems they studied, 60–70% of the bytes transferred were from files smaller than 512 KB. Clearly, an OSD file system must also be able to efficiently handle workloads composed primarily of small objects. 287 For the OSDs to achieve the high throughput required of the system and to fully take advantage of the object-based storage model, our system stripes file data across the OSDs. This is a very compelling choice, analogous to that of earlier systems such as Swift [6] and Zebra [9], and we believe that this will be an architecture of choice in large-scale object-based storage systems. In such systems, each object stored on an OSD will be a stripe unit (or partial stripe unit) of data from a file. The system stripe unit size depends on the design requirements of the individual system. Stripe units that are too small will decrease the throughput of each OSD while stripe units that are too large will decrease the potential parallelism of each file. Assuming a stripe unit size of 512 KB, large files will be divided into several 512 KB objects and files smaller than 512 KB will be stored in a single object. Consequently, no object in the system will ever exceed the system stripe unit size. In the LLNL workload we estimate that about 85% of all objects will be 512 KB and 15% of all objects will be smaller than 512 KB. We will refer to objects that are the same size as the system stripe unit size as large objects and the rest as small objects. Workstation workloads [20] will likely have more small objects and fewer large objects. Because the object-based storage system is expected to spread the objects evenly across all of the OSD devices, the object size distribution in the workload of a single OSD device will be the same as that of the larger storage system. Thus, a single OSD device under the LLNL workload should expect that 85% of incoming objects are large objects and the rest are small objects. Since files are distributed across many OSDs and directory hierarchies are managed above the OSD level, there is no inter-object locality that can be exploited in the OSDs. The workload of OSDs in this type of system will be dominated by large fixed-size objects exhibiting no inter-object locality. Under workstation workloads, in contrast, the object size distribution will be closer to 25% large objects and 75% small objects. An OSD file system should be able to handle either type of workload. 5. Design and Implementation As described in Section 4, the expected workload of our OSDs is composed of many objects whose sizes range from a few bytes up to the file system stripe unit size. Therefore, OBFS needs to optimize large object performance to provide substantially higher overall throughput, but without overcommitting resources to small objects. Simply increasing the file system block size can provide the throughput needed for large objects, but at the cost of wasted storage space due to internal fragmentation for small objects. For the LLNL workload, more than 10% of the available storage space would be wasted if 512 KB blocks are used, while less than 1% of the space would be lost if 4 KB blocks are used. In a 2 PB storage system, this 9% difference represents about 18 TB. The situation is even worse for a workstation file system, where 512 KB blocks would waste more than 50% of the space in such a system. To use large blocks without wasting space, small objects must be stored in a more efficient way. OBFS therefore employs multiple block sizes and uses regions, analogous to cylinder groups in FFS [15], to keep blocks of the same size together. Thus, the read/write performance of large objects can be greatly improved by using very large blocks, while small objects can be efficiently stored using small blocks. Another important feature of OBFS is the use of a flat name space. As the low-level storage manager in an object-based distributed file system, OBFS has no information about the logical relationship among objects. No directory information is available and no useful locality information is likely to 288 Disk Boot sector Region Free onode Region 1 head Region 2 Region Onode head table Free Free block onode bitmap bitmap map Large Data blocks and onodes block region Small Data blocks block region Region n Figure 2. OBFS structure be available. Note that in a small system where an OSD may hold several objects of a file, some locality information may available, but this does not extend to multiple files in the same directory or other tidbits that are useful to general-purpose file systems. Many general-purpose file systems such as Linux Ext2 are extremely inefficient in managing very large directories due to the fact that they do linear search, resulting in O n performance on simple directory operations. To avoid this, OBFS uses hash tables (like Ext3 [28]) to organize the objects and achieve much higher performance on directory operations. ✂ ✄ 5.1. Regions and Variable-Size Blocks The user-level implementation of OBFS separates the raw disk into regions. As shown in Figure 2, regions are located in fixed positions on the disk and have uniform sizes. All of the blocks in a region have the same size, but the block sizes in different regions may be different. The block size in a free region is undefined until that region is initialized. Regions are initialized when there are insufficient free blocks in any initialized region to satisfy a write request. In this case, OBFS allocates a free region and initializes all of its blocks to the desired block size. When all of the blocks in a used region are freed, OBFS returns the region to the free region list. Although our region policy supports as many different block sizes as there are regions, too many different block sizes will make space allocation and data management excessively complicated. In our current implementation, OBFS uses two block sizes: small and large. Small blocks are 4 KB, the logical block size in Linux, and large blocks are 512 KB, the system stripe unit size and twice the block size of GPFS (256 KB). Those regions that contain large blocks are called large block regions and those regions that contain small blocks are called small block regions. With this strategy, large objects can be laid out contiguously on disk in a single large block. The throughput of large objects is greatly improved by the reduction in seek time and reduced metadata operations that are inherent in such a design. Only one disk seek is incurred during the transfer of a large object. OBFS eliminates additional operations on metadata by removing the need for indirect blocks for large objects. Dividing the file system into regions also reduces the size of other FS data structures such as free block lists or maps and thus make the operations on those data structures more efficient. This scheme reduces file system fragmentation, avoids unnecessary wasted space and more effectively uses the available disk bandwidth. By separating the large blocks in different regions from the 289 Onode table Region head Free Region onode head bitmap Data + Region ID 0 block Data blocks Free block onode bitmap bitmap Onode + Region ID Onode index 15 Free 0 31 Onode index 15 31 Onode ID Onode ID (a) Large Block Region (b) Small Block Region Figure 3. Region structure and data layout. small blocks, OBFS can reserve contiguous space for large objects and prevent small objects from using too much space. Region fragmentation will only become a problem in the rare case that the ratio of large to small objects changes significantly during the lifetime of the system, as described in Section 5.6. Higher throughput in OBFS does not come at the cost of wasted disk space. Internal fragmentation in OBFS is no worse than in a general-purpose Unix file system because the small block size in OBFS is the same as the block size in most Unix file systems. Large blocks do not waste much space because they are only used for objects that will fill or nearly fill the blocks. The only wasted space will be due to objects stored in large blocks that are nearly, but not quite, as large as a stripe unit. This can be limited with a suitable size threshold for selecting the block size to use for an object. One minor complication can occur if an object starts small and then grows past this threshold. Our current implementation recopies the object into a large block when this occurs. Although this sounds expensive, it will happen rarely enough (due to aggressive write coalescing in the client caches) that it does not have a significant impact on system performance, and the inter-region locality of the small blocks makes this a very efficient operation. 5.2. Object Metadata Object metadata, referred as an onode, is used to track the status of each object. Onodes are preallocated in fixed positions at the head of small block regions, similar to the way inodes are placed in cylinder groups in FFS [15]. In large block regions, shown in Figure 3, onodes are packed together with the data block on the disk, similar to embedded inodes [7]. This allows for very low overhead metadata updates as the metadata can be written with the corresponding data block. Figure 3 shows that each onode has a unique 32-bit identifier consisting of two parts: a 16 bit region identifier and a 16 bit in-region object identifier. If a region occupies 256 MB on disk, this scheme will support OSDs of up to 16 TB, and larger OSDs are possible with larger regions. To locate a desired object, OBFS first finds the region using the region identifier and then uses the in-region object identifier to index the onode. This is particularly effective for large objects because the object index points directly to the onode and the object data, which are stored contiguously. In the current implementation, onodes for both large and small objects are 512 bytes, allowing OBFS to avoid using indirect blocks entirely. The maximum size of a small object will always be less than the stripe unit size, which is 512 KB in our design. Because the OBFS layout policy assigns objects to a single region, we can use the relative address to track the blocks. Assuming the 290 region size is 256 MB and the small block size is 4 KB, there will be fewer than 2 16 small blocks in a region, allowing a two-byte addresses to index all of the blocks in the region. In the worse case, a small object will be a little smaller than 512 KB, requiring 128 data blocks. Thus, the maximum amount of space that may be needed to index the small blocks in an object is 256 bytes, which can easily fit into a 512 byte onode. 5.3. Object Lookup Given an object identifier, we need to retrieve the object from the disk. In a hierarchical name space, data lookup is implemented by following the path associated with the object to the destination directory and searching (often linearly) for the object in that directory. In a flat name space, linear search is prohibitively expensive, so OBFS uses a hash table, the Object Lookup Table (OLT), to manage the mapping between the object identifier and the onode identifier. Each valid object has an entry in the OLT that records the object identifier and the corresponding onode identifier. The size of the OLT is proportional to the number of objects in the OSD: with 20,000 objects residing in an OSD, the OLT requires 233 KB. For efficiency, the OLT is loaded into main memory and updated asynchronously. Each region has a region head which stores information about the region, including pointers to the free block bitmap and the free onode bitmap. All of the region heads are linked into the Region Head List (RHL). On an 80 GB disk, the RHL occupies 8 MB of disk space. Like the OLT, the RHL is loaded into memory and updated asynchronously. After obtaining an onode identifier, OBFS searches the RHL using the upper 16 bits of the onode identifier to obtain the corresponding region type. If the onode belongs to a large block region, the object data address can be directly calculated. Otherwise, OBFS searches the in-memory onode cache to find that onode. A disk copy of the onode will be loaded into the onode cache if the search fails. 5.4. Disk Layout Policy The disk layout policy of OBFS is quite simple. For each incoming request, OBFS first decides what type of block(s) the object should use. If the object size is above the utilization threshold of the large blocks, a large block is assigned to the object; otherwise, it uses small blocks. For those objects that use large blocks, OBFS only needs to find the nearest large-block region that contains a free block, mark it as used, and write the object to that block. For objects that use small blocks, an FFS-like allocation policy is employed. OBFS searches the active region list to find the nearest region that has enough free small blocks for the incoming object. After identifying a region with sufficient space, OBFS tries to find a contiguous chunk of free blocks that is large enough for the incoming object. If such a chunk of blocks is not available, the largest contiguous chunk of blocks in that region is assigned to the object. The amount of space allocated in this step is subtracted from the object size, and the process is repeated until the entire object is stored within the region. If no region has the desired number and type of free blocks, the nearest free region will be initialized and put into the active region list. The incoming object will then be allocated to this new region. The OBFS data allocation policy guarantees that each of the large objects is allocated contiguously and each of the small objects is allocated in a single region. No extra seeks are needed during a large object transfer and only short seeks are needed to read or write the small objects, no matter how long the file system has been running. Compared with a general-purpose file system, OBFS 291 is much less fragmented after running for a long time, minimizing performance degradation as the system ages. 5.5. Reliability and Integrity As mentioned in section 5.3, OBFS asynchronously updates important data structures such as the OLT and the RHL to achieve better performance. In order to guarantee system reliability, OBFS updates some important information in the onodes synchronously. If the system crashes, OBFS will scan all of the onodes on the disk to rebuild the OLT and the RHL. For each object, the object identifier and the region identifier are used to assemble a new entry in the OLT. The block addresses for each object are then used to rebuild each region free block bitmap. Because the onodes are synchronously updated, we can eventually rebuild the OLT and RHL and restore the system. As mentioned above, OBFS updates metadata either without an extra disk seek or with one short disk seek. In so doing, it keeps the file system reliable and maintain system integrity with very little overhead. 5.6. Region Cleaning Since OBFS uses regions to organize different types of blocks, one potential problem is that there will be no free regions and no free space in regions of the desired type. Unlike LFS [21], which must clean segments on a regular basis, OBFS will never need cleaning unless the ratio between large and small objects changes significantly over time on an OSD which has been nearly full. This can only happen when the object size characteristic of the workload changes significantly when the file system is near its capacity. We do not expect this to happen very often in practice. However, if it happens, it can result in many full regions of one type, many underutilized regions of the other type, and no free regions. In this situation, the cleaner can coalesce the data in the underutilized regions and create free regions which can be used for regions of desired type. If all of the regions are highly utilized, cleaning will not help much: the disk is simply full. Low utilization regions can only be produced when many objects are written to disk and then deleted, leaving “holes” in regions. However, unlike in LFS, these holes are reused for new objects without the need for cleaning. The only time cleaning is needed is when all of the holes are in the wrong kind of region e.g., the holes are in small block regions, and OBFS is trying to write a large block. This situation only occurs when the ratio between large objects and small objects changes. In our experiments, we only observed the need for the cleaner when we artificially changed the workload ratios on a nearly full disk. Because cleaning is rarely, if ever, necessary, it will have a negligible impact on OBFS performance. However, cleaning can be used to improve file system performance by defragmenting small-block regions to keep blocks of individual objects together. This process would copy all used blocks of a region to a free region on the disk, sorting the blocks as it goes. Because this would occur on a region-by-region basis and because a new region will always have enough free space for all of the blocks in an old region, it would be trivial to implement. The system need never do this, however. 6. OBFS Performance We compared the performance of OBFS to that of Linux Ext2, Ext3 and XFS. Ext2 is a widely-used general-purpose file system. Ext3 is used by Lustre for object storage and has the same disk layout as Ext2 but adds a journal for reliability. XFS is a modern high-performance general-purpose file 292 Capacity 80 GB Controller Ultra ATA/133 Track-to-track seek 0.8 ms Average seek 8.5 ms Rotation speed 7200 RPM Sustained transfer rate 24.2–44.4 MB/s Table 1. Specifications of the Maxtor D740X-6L disk used in the experiments system that uses B-trees and extent-based allocation. While Ext2, Ext3, and XFS run as in-kernel file systems, the version of OBFS used in these experiments is a user-level file system. An in-kernel implementation of OBFS would take advantage of the very effective caching provided by the Linux kernel, but our user-level implementation cannot. Thus, in order to allow for a fair comparison, we executed the following experiments with the system buffer cache bypassed: all of the file systems were mounted using the “-o sync” parameter, which forced the system buffer cache to use a writethrough policy. The results generated evaluates disk layout policies of different file systems. With caching enabled, all three file systems will achieve higher performance. We expect the performance change of OBFS to be comparable to those of XFS, Ext2, and Ext3. 6.1. Experimental Setup All of the experiments were executed on a PC with a 1 GHZ Pentium III CPU and 512 MB of RAM, running Red Hat Linux, kernel version 2.4.0. To examine the performance of the file systems with minimal impact from other operating system activities, we dedicated an 80 GB Maxtor D740X-6L disk (see Table 1) to the experiments. This disk was divided into multiple 8 GB partitions. The first partition was used to install file systems and run experiments. The rest were used to backup aged file system images. We used aged file systems to more accurately measure the long-term performance of the file systems. For each experiment, we copied an aged file system to the first partition of the disk, unmounted the disk and rebooted Linux to clean the buffer cache, then mounted the aged partition to run the benchmarks. We repeated these steps three times and took the average of the performance numbers obtained. Smith, et al. [25] used file system snapshots and traces to approximate the possible activities in file systems. No object-based storage system snapshots are currently available so we used the simplest approach: generate 200,000 to 300,000 randomly distributed create and delete requests and feed these requests to a new file system. The create/delete ratio was dynamically adjusted based on the disk usage, which guaranteed that it neither filled nor emptied the available disk space. Because our user-level implementation of OBFS bypasses the buffer cache, all three file systems were forced to use synchronous file I/O to allow for a fair comparison of the performance. Ext2 uses asynchronous metadata I/O to achieve high throughput even if synchronous file I/O is used, so we mounted the partitions in synchronous mode to force them to always flush the data in the buffer cache back to disk. The benchmarks we used consisted of semi-random sequences of object requests whose characteristics were derived from the LLNL workload described in Section 4. On average, 80% of all objects were large objects (512 KB). The rest were small objects whose size was uniformly distributed between 1 KB and 512 KB. To examine the performance of the various file system, we generated two kinds of benchmarks: microbenchmarks and macrobenchmarks. Our microbenchmarks each 293 Benchmark I Benchmark II # of ops(total size) # of ops(total size) Reads 16854 (7.4GB) 4049 (1.8GB) Writes 4577 (2.0GB) 8969 (4.0GB) Rewrites 4214 (1.8GB) 8531 (3.8GB) Deletes 4356 (1.9GB) 8147 (3.9GB) Sum 30001 (13.1GB) 29696 (12.5GB) 30 30 25 25 25 20 15 10 5 Throughput (MB/s) 30 Throughput (MB/s) Throughput (MB/s) Table 2. Benchmark parameters 20 15 10 5 0 30 OBFS 40 50 60 Disk Utilization (%) Ext2 Ext3 (a) Reads 70 XFS 15 10 5 0 20 20 0 0 10 OBFS 20 30 40 50 Disk Utilization (%) Ext2 Ext3 (b) Writes 60 70 XFS 20 OBFS 30 40 50 60 Disk Utilization (%) Ext2 Ext3 70 XFS (c) Rewrites Figure 4. Performance on a workload of mixed-size objects. consisted of 10,000 requests of a single request type—read, write, or rewrite—and allowed us to examine the performance of the file systems on that request type. Our macrobenchmarks consisted of synthetic workloads composed of create, read, rewrite, and delete operations in ratios determined by the workload mentioned above. These allowed us to examine the performance of the file systems on the expected workload. We used two different macrobenchmarks, Benchmark I and Benchmark II, whose parameters are listed in table 2. Benchmark I is a read-intensive workload in which reads account for 56% of all requests and the total size of the read requests is around 7.4 GB. The writes, rewrites, and deletes account for 15.3%, 14.0%, and 14.5% of the requests. In Benchmark II, reads account for 13.6% of the requests and writes, rewrites, and deletes account for 29.8%, 28.4%, and 27.1%. 6.2. Results Figure 4 shows the performance of Ext2, Ext3, XFS, and OBFS on a mixed workload consisting of 80% large objects and 20% small objects1 . As seen in Figure 4(b), OBFS exhibits very good write performance, almost twice that of Ext2 and Ext3 and 10% to 20% better than XFS. The large block scheme of OBFS contributes a lot to the strong write performance. With large blocks, contiguous space has been reserved for the large objects, allowing large objects to be written with only one disk seek. Because OBFS uses regions to organize large and small blocks, limiting the amount of external fragmentation, the performance of OBFS remains good as disk usage increases. At the same time, the performance of Ext2 and Ext3 drops significantly due to the insufficient availability of large contiguous regions, as seen in Figures 4(b), 5(b), and 6(b). 1 Note that in all of the microbenchmark graphs write performance is displayed starting at 0% disk utilization but because reads and rewrites cannot be done on an empty disk we chose to start those experiments at 25% utilization. 294 30 25 25 20 15 10 5 Throughput (MB/s) 30 25 Throughput (MB/s) Throughput (MB/s) 30 20 15 10 5 0 30 OBFS 40 50 60 Disk Utilization (%) Ext2 Ext3 70 10 0 0 XFS 15 5 0 20 20 10 OBFS (a) Reads 20 30 40 50 Disk Utilization (%) Ext2 Ext3 60 70 20 XFS 30 OBFS (b) Writes 40 50 60 Disk Utilization (%) Ext2 Ext3 70 XFS (c) Rewrites 30 30 25 25 25 20 15 10 20 15 10 5 5 0 0 20 OBFS 30 40 50 60 Disk Utilization (%) Ext2 Ext3 (a) Reads 70 XFS Throughput (MB/s) 30 Throughput (MB/s) Throughput (MB/s) Figure 5. Performance on a workload of large objects. 20 15 10 5 0 0 10 OBFS 20 30 40 50 Disk Utilization (%) Ext2 Ext3 (b) Writes 60 70 XFS 20 OBFS 30 40 50 60 Disk Utilization (%) Ext2 Ext3 70 XFS (c) Rewrites Figure 6. Performance on a workload of small objects. OBFS outperforms Ext2 and Ext3 by nearly 3 times, but is about 10% slower than XFS, as Figure 4(a) shows. We suspect that a more optimized implementation of XFS contributes to its slightly better read performance. As seen in Figure 4(c), the rewrite performance of OBFS beats that of Ext2 and Ext3 by about 3 times, and beats XFS by about 20–30%. The poor performance of Ext2 and Ext3 in both read and rewrite can be explained by their allocation policies and small blocks. XFS uses extents rather than blocks to organize files, so most files can get contiguous space. This results in excellent performance in both read and write. However, OBFS still shows slightly better performance on object rewrite. Figure 5 shows the performance of the four file systems on large objects and Figure 6 shows the performance on small objects. Figure 5 is almost the same as Figure 4 because large objects dominate the mixed workload of Figure 4. In Figure 6, we see that OBFS meets the performance of XFS, almost triples the performance of Ext2 and Ext3 when doing reads and rewrites, and exceeds the performance of all three when doing creates. The benchmark results are shown in Figures 7 and 8. As described above, Benchmark I is a readintensive workload and Benchmark II is a write-intensive workload. Notice that in our benchmarks, XFS beats both Ext2 and Ext3 by a large margin in all cases; this differs from other benchmark studies [5] that found that Ext2 and XFS have comparable performance. There are three factors in our experiments that favor XFS over Ext2 and Ext3. First, our benchmarks include many large objects, which benefit from XFS extent-based allocation, especially when disk utilization is high. Second, while other benchmarks used fresh disks, our benchmarks use disks subjected to long-term aging [25] to reflect more realistic scenarios. After aging, the performance of Ext2 and Ext3 drops 295 30 25 25 20 15 10 20 15 10 5 5 0 0 0 10 OBFS 20 30 40 50 Disk Utilization (%) Ext2 Ext3 (a) Reads 60 70 XFS Throughput (MB/s) 30 25 Throughput (MB/s) Throughput (MB/s) 30 20 15 10 5 0 0 10 OBFS 20 30 40 50 Disk Utilization (%) Ext2 Ext3 60 70 XFS 0 10 OBFS (b) Writes 20 30 40 50 Disk Utilization (%) Ext2 Ext3 60 70 XFS (c) Overall Figure 7. Performance under Benchmark I. dramatically after aging due to disk fragmentation, while XFS maintains good performance because of its extent-based allocation policy. Third, in our object-based benchmarks, we assume a flat name space in which all objects are allocated in the root directory in all file systems. The linear search of directories used by Ext2 and Ext3 performs poorly when the number of objects scales to tens of thousands. XFS uses B+trees to store its directories, ensuring fast name lookup even in very large directories. Ext2 and Ext3 outperform OBFS when the disk is nearly empty, as shown in Figures 7(a) and 8(a). This is due in part to the significant likelihood that an object will be in the buffer cache because of the low number of objects that exist in a nearly empty disk. For example, an 8 GB partition at 10% utilization has only 800 MB of data. With 512 MB of main memory, most objects will be in memory and Linux Ext2, Ext3 and XFS reads will proceed at memory speeds while our userlevel implementation of OBFS gains no advantage from the buffer cache and must therefore always access the disk. However, as disk usage increases, this effect is minimized and Ext2 and Ext3 read performance decreases rapidly while OBFS performance remains essentially constant. The net result is that OBFS read performance is two or three times that of Ext2 and Ext3. OBFS is still about 10% slower than XFS on reads, similar to the results from earlier read microbenchmarks. OBFS outperforms all three other file systems on writes, however, as Figures 7(b) and 8(b) show. For writes, OBFS is 30–40% faster than XFS and 2–3 times faster than Ext3. Overall, OBFS and XFS are within 10% of each other on the two macrobenchmarks, with one file system winning each benchmark. OBFS clearly beats both Ext2 and Ext3, however, running three times faster on both benchmarks. Although our macrobenchmarks focused on large-object performance, Figure 6 shows that OBFS meets or exceeds the performance of the other file systems on a workload consisting entirely of small objects, those less than 512 KB. OBFS doubles or triples the performance of Ext2 and Ext3 and matches that of XFS on reads and rewrites and exceeds it by about 25% on writes. As OBFS also does well on large objects, we conclude that it is as well suited to general-purpose objectbased storage system workloads as it is to terascale high-performance object-based storage system workloads. 7. Related Work Many other file systems have been proposed for storing data on disk; however, nearly all of them have been optimized for storing files rather than objects. The Berkeley Fast File System (FFS) [15] and related file systems such as Ext2 and Ext3 [28] are widely used today. They all try to store 296 30 25 25 20 15 10 5 Throughput (MB/s) 30 25 Throughput (MB/s) Throughput (MB/s) 30 20 15 10 5 0 10 OBFS 20 30 40 50 Disk Utilization (%) Ext2 Ext3 (a) Reads 60 70 XFS 15 10 5 0 0 20 0 0 10 OBFS 20 30 40 50 Disk Utilization (%) Ext2 Ext3 60 70 XFS 0 10 OBFS (b) Writes 20 30 40 50 Disk Utilization (%) Ext2 Ext3 60 70 XFS (c) Overall Figure 8. Performance under Benchmark II. file data contiguously in cylinder groups—regions with thousands of contiguous disk blocks. This strategy can lead to fragmentation so techniques such as extents and clustering [16, 24] are used to group blocks together to decrease seek time. Analysis [23, 24] has shown that clustering can improve performance by a factor of two or three, but it is difficult to find contiguous blocks for clustered allocation in aged file systems. Log-structured file systems [21] group data by optimizing the file system for writes rather than reads, writing data and metadata to segments of the disk as it arrives. This works well if files are written in their entirety, but can suffer on an active file system because files can be interleaved, scattering a file’s data among many segments. In addition, log-structured file systems require cleaning, which can reduce overall performance [2]. XFS [19, 27] is a highly optimized file system that uses extents and B-trees to provide high performance. This performance comes at a cost: the file system has grown from 50,000 to nearly 200,000 lines of code, making it potentially less reliable and less attractive for commodity storage devices because such devices cannot afford data corruption due to file system errors. In addition, porting such a file system is a major effort [19]. Gibson, et al. have proposed network-attached storage devices [8], but spent little time describing the internal data layout of such devices. WAFL [10], a file system for network-attached storage servers that can write data and metadata to any free location, is optimized for huge numbers of small files distributed over many centrally-controlled disks. Many scalable storage systems such as GPFS [22], GFS [26], Petal [12], Swift [6], RAMA [17], Slice [1] and Zebra [9] stripe files across individual storage servers. These designs are most similar to the file systems that will use OSDs for data storage; Slice explicitly discusses the use of OSDs to store data [1]. In systems such as GFS, clients manage low-level allocation, making the system less scalable. Systems such as Zebra, Slice, Petal, and RAMA leave allocation to the individual storage servers, reducing the bottlenecks; such file systems can take advantage of our file system running on an OSD. In GPFS, allocation is done in large blocks, allowing the file system to guarantee few disk seeks, but resulting in very low storage utilization for small files. Existing file systems must do more than allocate data. They must also manage large amounts of metadata and directory information. Most systems do not store data contiguously with metadata, decreasing performance because of the need for multiple writes. Log-structured file systems and embedded inodes [7] store metadata and data contiguously, avoiding this problem, though they still 297 suffer from the need to update a directory tree correctly. Techniques such as logging [29] and soft updates [14] can reduce the penalty associated with metadata writes, but cannot eliminate it. 8. Conclusions Object-based storage systems are a promising architecture for large-scale high-performance distributed storage systems. By simplifying and distributing the storage management problem, they provide both performance and scalability. Through standard striping, replication, and parity techniques they can also provide high availability and reliability. However, the workload characteristics observed by OSDs will be quite different from those of general purpose file systems in terms of size distributions, locality of reference, and other characteristics. To address the needs of such systems, we have developed OBFS, a very small and highly efficient file system targeted specifically for the workloads that will be seen by these object-based storage devices. OBFS currently uses two block sizes: small blocks, roughly equivalent to the blocks in general purpose file systems, and large blocks, equal to the maximum object size. Blocks are laid out in regions that contain both the object data and the onodes for the objects. Free blocks of the appropriate size are allocated sequentially, with no effort made to enforce locality beyond singleregion object allocation and the collocation of objects and their onodes. At present, we have tested OBFS as a user-level file system. Our experiments show that the throughput of OBFS is two to three times that of Linux Ext2 and Ext3, regardless of the object size. OBFS provides slightly lower read performance than Linux XFS, but 10%–40% higher write performance. At a fraction of the size of XFS—2,000 lines of code versus over 50,000 for XFS—OBFS is both smaller and more efficient, making it more suitable for compact embedded implementations. Ultimately, because of its small size and simplicity, we expect that it will also prove to be both more robust and more maintainable than XFS, Ext2, or Ext3. Finally, we successfully implemented a kernel-level version of the OBFS file system in about three person-weeks. The short implementation time was possible because of OBFS’s simplicity and very compact code. At present the performance of our in-kernel implementation does not match that of our user-level implementation because our carefully managed large blocks get broken into small blocks by the Linux buffer management layer, as encountered by the XFS developers. We intend to rewrite the buffer management code, as they did, to avoid this problem. With this change, we expect the in-kernel OBFS performance to exceed that of the user-level implementation, further solidifying OBFS’s advantage over general-purpose file systems for use in object-based storage devices. Acknowledgments This research was supported by Lawrence Livermore National Laboratory, Los Alamos National Laboratory, and Sandia National Laboratory under contract B520714. The Storage Systems Research Center is supported in part by gifts from Hewlett Packard, IBM, Intel, LSI Logic, Microsoft, Overland Storage, and Veritas. 298 References [1] D. C. Anderson, J. S. Chase, and A. M. Vahdat. Interposed request routing for scalable network storage. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI), Oct. 2000. [2] T. Blackwell, J. Harris, , and M. Seltzer. Heuristic cleaning algorithms in log-structured file systems. In Proceedings of the Winter 1995 USENIX Technical Conference, pages 277–288. USENIX, Jan. 1995. [3] D. P. Bovet and M. Cesati. Understanding the Linux Kernel. O’Reilly and Associates, Oct. 2000. [4] P. J. Braam. The Lustre storage architecture, 2002. [5] R. Bryant, R. Forester, and J. Hawkes. Filesystem performance and scalability in Linux 2.4.17. In Proceedings of the Freenix Track: 2002 USENIX Annual Technical Conference, Monterey, CA, June 2002. USENIX. [6] L.-F. Cabrera and D. D. E. Long. Swift: Using distributed disk striping to provide high I/O data rates. Computing Systems, 4(4):405–436, 1991. [7] G. R. Ganger and M. F. Kaashoek. Embedded inodes and explicit groupings: Exploiting disk bandwidth for small files. In Proceedings of the 1997 USENIX Annual Technical Conference, pages 1–17. USENIX Association, Jan. 1997. [8] G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A cost-effective, high-bandwidth storage architecture. In Proceedings of the 8th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 92–103, San Jose, CA, Oct. 1998. [9] J. H. Hartman and J. K. Ousterhout. The Zebra striped network file system. ACM Transactions on Computer Systems, 13(3):274–310, 1995. [10] D. Hitz, J. Lau, and M. Malcom. File system design for an NFS file server appliance. In Proceedings of the Winter 1994 USENIX Technical Conference, pages 235–246, San Francisco, CA, Jan. 1994. [11] J. J. Kistler and M. Satyanarayanan. Disconnected operation in the Coda file system. ACM Transactions on Computer Systems, 10(1):3–25, 1992. [12] E. K. Lee and C. A. Thekkath. Petal: Distributed virtual disks. In Proceedings of the 7th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), pages 84–92, Cambridge, MA, 1996. [13] D. Long, S. Brandt, E. Miller, F. Wang, Y. Lin, L. Xue, and Q. Xin. Design and implementation of large scale object-based storage system. Technical Report ucsc-crl-02-35, University of California, Santa Cruz, Nov. 2002. [14] M. K. McKusick and G. R. Ganger. Soft updates: A technique for eliminating most synchronous writes in the Fast File System. In Proceedings of the Freenix Track: 1999 USENIX Annual Technical Conference, pages 1–18, June 1999. [15] M. K. McKusick, W. N. Joy, S. J. Leffler, and R. S. Fabry. A fast file system for UNIX. ACM Transactions on Computer Systems, 2(3):181–197, Aug. 1984. [16] L. W. McVoy and S. R. Kleiman. Extent-like performance from a UNIX file system. In Proceedings of the Winter 1991 USENIX Technical Conference, pages 33–44. USENIX, Jan. 1991. [17] E. L. Miller and R. H. Katz. RAMA: An easy-to-use, high-performance parallel file system. Parallel Computing, 23(4):419–446, 1997. [18] J. H. Morris, M. Satyanarayanan, M. H. Conner, J. H. Howard, D. S. H. Rosenthal, and F. D. Smith. Andrew: A distributed personal computing environment. Communications of the ACM, 29(3):184–201, Mar. 1986. [19] J. Mostek, B. Earl, S. Levine, S. Lord, R. Cattelan, K. McDonell, T. Kline, B. Gaffey, and R. Ananthanarayanan. Porting the SGI XFS file system to Linux. In Proceedings of the Freenix Track: 2000 USENIX Annual Technical Conference, pages 65–76, San Diego, CA, June 2000. USENIX. [20] D. Roselli, J. Lorch, and T. Anderson. A comparison of file system workloads. In Proceedings of the 2000 USENIX Annual Technical Conference, pages 41–54, June 2000. [21] M. Rosenblum and J. K. Ousterhout. The design and implementation of a log-structured file system. ACM Transactions on Computer Systems, 10(1):26–52, Feb. 1992. [22] F. Schmuck and R. Haskin. GPFS: A shared-disk file system for large computing clusters. In Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 231–244. USENIX, Jan. 2002. 299 [23] M. Seltzer, K. A. Smith, H. Balakrishnan, J. Chang, S. McMains, and V. Padmanabhan. File system logging versus clustering: A performance comparison. In Proceedings of the Winter 1995 USENIX Technical Conference, pages 249–264, 1995. [24] K. A. Smith and M. I. Seltzer. A comparison of FFS disk allocation policies. In Proceedings of the 1996 USENIX Annual Technical Conference, pages 15–26, 1996. [25] K. A. Smith and M. I. Seltzer. File system aging—increasing the relevance of file system benchmarks. In Proceedings of the 1997 SIGMETRICS Conference on Measurement and Modeling of Computer Systems, pages 203–213, 1997. [26] S. R. Soltis, T. M. Ruwart, and M. T. O’Keefe. The Global File System. In Proceedings of the 5th NASA Goddard Conference on Mass Storage Systems and Technologies, pages 319–342, College Park, MD, 1996. [27] A. Sweeney, D. Doucette, W. Hu, C. Anderson, M. Nishimoto, and G. Peck. Scalability in the XFS file system. In Proceedings of the 1996 USENIX Annual Technical Conference, pages 1–14, Jan. 1996. [28] T. Y. Ts’o and S. Tweedie. Planned extensions to the Linux EXT2/EXT3 filesystem. In Proceedings of the Freenix Track: 2002 USENIX Annual Technical Conference, pages 235–244, Monterey, CA, June 2002. USENIX. [29] U. Vahalia, C. G. Gray, and D. Ting. Metadata logging in an NFS server. In Proceedings of the Winter 1995 USENIX Technical Conference, New Orleans, LA, Jan. 1995. USENIX. [30] R. O. Weber. Information technology—SCSI object-based storage device commands (OSD). Technical Council Proposal Document T10/1355-D, Technical Committee T10, Aug. 2002. 300 Duplicate Data Elimination in a SAN File System Bo Hong Univ. of California, Santa Cruz hongbo@cs.ucsc.edu Demyn Plantenberg IBM Almaden Research Center demyn@almaden.ibm.com Darrell D.E. Long Univ. of California, Santa Cruz darrell@cs.ucsc.edu Miriam Sivan-Zimet IBM Almaden Research Center mzimet@us.ibm.com Abstract data in on-line file systems without significantly impacting system performance. This performance requirement is what differentiates our approach, which we call Duplicate Data Elimination (DDE), from those used in backup and archival storage systems [1, 6, 23]. To minimize its performance impact, DDE executes primarily as a background process that operates in a lazy, best-effort fashion whenever possible. Data is written to the file system as usual, and then some time later, background threads find duplicates and coalesce them to save storage. DDE is transparent to users. It is also flexible enough to be enabled and disabled on an existing file system without disrupting its operation, and flexible enough to be used on parts of the file system, such as select directories or particular file types. Duplicate Data Elimination (DDE) is designed for IBM Storage Tank [17], a heterogeneous, scalable SAN file system. In Storage Tank, file system clients coordinate their actions through meta-data servers, but access the storage devices directly without involving servers in the data path. DDE uses three key techniques to address its design goals: content-based hashing, copy-on-write (COW), and lazy update. DDE detects duplicate data on the logical block level by comparing hashes of the block contents; it guarantees consistency between block contents and their hashes by using copy-on-write. Data is coalesced by changing corresponding file block allocation maps. COW and lazy update allow us to update the file block allocation maps without revoking the file’s data locks. Together these techniques minimize DDE’s performance impact. Figure 1 shows an example of coalescing duplicate data blocks in an on-line file system. Before coalescing, files F1 , F2 and F3 consume 11 blocks. However, they each contain a common piece of data that is three blocks in size. Clients are unaware of this duplication when they write these files. The server detects the common data later and coalesces the identical blocks. After coalescing, F1 –F3 consume only five blocks in total by sharing the same three blocks. Six blocks are saved, resulting in a 55% storage reduction. Duplicate Data Elimination (DDE) is our method for identifying and coalescing identical data blocks in Storage Tank, a SAN file system. On-line file systems pose a unique set of performance and implementation challenges for this feature. Existing techniques, which are used to improve both storage and network utilization, do not satisfy these constraints. Our design employs a combination of content hashing, copy-on-write, and lazy updates to achieve its functional and performance goals. DDE executes primarily as a background process. The design also builds on Storage Tank’s FlashCopy function to ease implementation.1 We include an analysis of selected real-world data sets that is aimed at demonstrating the space-saving potential of coalescing duplicate data. Our results show that DDE can reduce storage consumption by up to 80% in some application environments. The analysis explores several additional features, such as the impact of varying file block size and the contribution of whole file duplication to the net savings. 1. Introduction Duplicate data can occupy a substantial portion of a storage system. Often the duplication is intentional: files are copied for safe keeping or for historical records. Just as often, the duplicate data appear through independent channels: individuals who save the same email attachments or who download the same files from the web. It seems intuitive that addressing all of this unrecognized redundancy could result in storage resources being used more efficiently. Our research goal is to reduce the amount of duplicated 1 Storage Tank technology is available today in the IBM Total Storage SAN File System (SANFS). However, this paper and research is based on underlying Storage Tank technology and may not become part of the IBM TotalStorage SAN File System product. 301 Before coalesce After coalesce Tank Tank Dir1 F1 Dir2 F2 Dir1 F3 F1 Dir2 F2 Disk 9 6 8 9 6 F3 Disk 1 9 cesses to files [5, 7, 8]. A Storage Tank server grants locks to clients, and the lock granularity is per file. There are three file lock modes in Storage Tank: 1) exclusive (X), which allows a single client to cache both data and metadata, which it can read and modify; 2) Shared Read (SR), which allows clients to cache data and metadata for read operations, and 3) Shared Write (SW), in which mode clients cannot cache data but can cache metadata in read-only mode. The Storage Tank Protocol also provides copy-on-write capability to support file system snapshots. The server can mark blocks as read-only to enforce copy-on-write. Storage Tank technology is available today in the IBM Total Storage SAN File System (SANFS). However, this paper and research is based on underlying Storage Tank technology and may not become part of the IBM TotalStorage SAN File System product. 9 6 8 3 6 8 1 3 8 Figure 1. An example of coalescing duplicate data blocks. 2. Background 3. IBM Storage Tank is a multi-platform, scalable file system that works with storage area networks (SANs) [17]. In Storage Tank, data is stored on devices that can be directly accessed through a SAN, while meta-data is managed separately by one or more specialized Storage Tank meta-data servers. Storage Tank clients are designed to direct all metadata operations to Storage Tank servers and to direct all data operations to storage devices. Storage Tank servers are not involved in the data path. The current version of Storage Tank works with ordinary block-addressable storage devices such as disk drives and RAID systems. The basic I/O operation unit in Storage Tank is a block. The storage devices are required to have no more intelligence than the ability to read and write blocks from the volumes (LUNs) they present. Storage Tank file data is also managed in block units. The size of a file block is typically a multiple of the device block size. Storage Tank exposes three new abstractions called file sets, storage pools, and arenas. These are in addition to the traditional abstractions found in file systems such as files, directories, and volumes. A file set is a subtree of the global namespace. It groups a set of Storage Tank files and directories for the purpose of load balancing and management. A storage pool is a collection of one or more volumes. It provides a logical grouping of the volumes for the allocation of space to file sets. A file set can cross multiple storage pools. An arena provides the mapping between a file set and a storage pool. As such, there is one arena for each file set that has files in a particular storage pool. The arena abstraction is strictly internal to the Storage Tank server, but is an important element in duplicate data elimination. Using an arena, Storage Tank can track the used and free space owned by a file set in a storage pool, and specifies the logical to physical mapping of space in the file set to the volumes in the storage pool. The Storage Tank Protocol provides a rich locking scheme that enables file sharing among Storage Tank clients or, when necessary, allows clients to have exclusive ac- Related Work Data duplication is ubiquitous. Different techniques have been proposed to identify commonality in data, and to exploit this knowledge for reducing storage and network resource consumption due to storing and transferring duplicate data. Our work was directly inspired by Venti [23]. Venti is a network storage system intended for archival data. In Venti, the unique SHA-1 [12] hash of a block acts as the block identifier, which is used in place of the block address for read and write operations. Venti also implements a writeonce policy that prohibits data from being deleted once it is stored. This write-once policy becomes practical, in part, because Venti stores only one copy of each unique data block. In on-line file systems, performance is essential and data is dynamic and ready to change. This is radically different from the requirements in archival and backup systems, where data is immutable and performance is less of a concern. In our design, duplicate data is also detected on the block level, but in the background. Data is still addressed as usual by the block where it is stored, so data accesses have no extra hash-to-block index searching overheads as in Venti. In turn, it determines that data duplication detection and coalescing is an after-effect effort, i.e. it is done after clients have written data blocks to storage devices. The server also maintains a mapping function between block hashes and blocks. A weaker variant of the write-once policy, copy-on-write (COW), is used to guarantee its consistency. Unreferenced blocks due to deletion and COW can be reclaimed. Single instance store (SIS) [3] also detects duplicate data in an after-effect fashion but on the file level. The technique is optimized for Microsoft Windows remote install servers [18] that store different installation images. In this scenario, the knowledge of file duplication is a priori and files are less likely to be modified. In general on-line file 302 systems, the granularity of file-level data duplication detection may be too coarse because any modification to a file can cause the loss of the benefit of storage reduction. LBFS [20] and Pastiche [9] detect data duplication at the granularity of variable-sized chunks, whose boundary regions, called anchors, are identified by using the techniques of shingling [16] and Rabin fingerprints [24]. This technique is suitable for backup systems and low-bandwidth network environments, where the reduction of storage and network transmission is more important than performance. Delta compression is another technique that can effectively reduce duplicate data, thus the requirements of storage and network bandwidth [1, 6, 19, 25]. When a base version of a data object exists subsequent versions can be represented by changes (deltas) to save both storage and network transmission. Delta compression, in general, requires some prior knowledge of data object versioning. It cannot explore common data across multiple files and a change of a (base) file may cause recalculating deltas for other files. In DERD (Delta-Encoding via Resemblance Detection) [11], similar files can be identified as pairs by using data similarity detection techniques [4, 16] without having any specific prior knowledge. Some file systems provide on-line compression capability [15, 21]. Although it can effectively improve storage efficiency, this technique has significant run-time compression and decompression overheads. On-line compression explores intra-file compressibility and cannot take advantage of common data across files. The techniques of naming and indexing data objects based on their content hashes are also found in several other systems. In the SFS read-only file system [13], blocks are identified by their SHA-1 hashes and the block hashes are hashed recursively to build up more complex structures. The Stanford digital library repository [10] uses the cyclic redundancy check (CRC) values of data objects as their unique handles. Content-derived names [2, 14] take a similar approach to address the issue of naming and managing reusable software components. called the fingerprint of a block. The server compares block fingerprints and coalesces blocks with the same fingerprint (and the same content) by changing corresponding file block allocation maps. The server guarantees consistency between block contents and their fingerprints by directing clients to perform copy-on-write. The server also maintains a reference count for each block and postpones the reclamation of unreferenced blocks. These techniques allow the server to update file block allocation maps without revoking any outstanding data locks on them. Figure 2 shows the basic idea of duplicate data block elimination in a live file system. The client holds an exclusive (X) lock on file A and a shared-read (SR) lock on file B. These lock modes are described in Section 2. File A and B have the same data stored in blocks 100 and 150 respectively. Before duplicate block coalescing, the server and the client share the same view of files and file block allocation maps. The server finds duplicate data and changes the block allocation map of file B to reference block 100 without updating the client. Even though the client has a stale view on file B, it can still read out the correct data because block 150 is not reclaimed immediately. When the client modifies block 100 of file A, it writes the new content to another block and keeps the content of block 100 intact. Therefore, the content and the fingerprint of block 100 are still consistent and file B still references the right block. 4.1. Duplicate Data Detection The most straightforward and trusted way of duplicate data detection is bytewise comparison. Unfortunately, it is expensive. An alternative method is to digest data contents by hashing them to much shorter fingerprints and detect duplicate data by comparing their fingerprints. As long as the probability of hash collisions is vanishingly small, we can be confident that two sets of data content are identical if their fingerprints are the same. In our design, duplicate data is detected on the block level, although maintaining a fingerprint for each block imposes a large amount of bookkeeping information on the system. Storage Tank is based on block-level storage devices, and blocks are the basic operation units. Block-level detection avoids unnecessary I/Os required by other approaches based on files [3] or variable-sized chunks [9, 20], in which the client may have to read other disk blocks before it can recalculate fingerprints due to data boundary mis-alignments. Data duplication detection that is based on blocks has finer granularity and, therefore, higher possibility of storage reduction than techniques based on whole files [3]. Approaches based on variable-sized chunks [9, 20] require at least as much bookkeeping information as the block-based approaches because they tend to limit the average chunk size to obtain a reasonable chance of detecting duplicate data. Chunk-based approaches also need a totally different file block allocation map format from the existing 4. Design of Duplicate Data Elimination Our design goal is to transparently reduce duplicate data in Storage Tank as much as possible without penalizing system performance significantly. Instead of finding data duplication at the first spot, we delay this duplication detection and identify and eliminate duplicate data when server loads are low. In this way, we minimize the performance impact of duplicate data elimination (DDE). In our design, we use three techniques: content-based hashing, copy-onwrite (COW), and lazy update. Duplicate data blocks are detected by the Storage Tank server. A client uses a collision-resistant hash function to digest the block contents it writes to storage devices and returns their hashes to the server. Such a unique hash is 303 Before duplicate data coalescing Server view file A After client modifies file A and returns the dirty metadata After duplicate data coalescing file B file A file B file A No lock revoke The content is consistent to its fingerprint Postpone free space reclamation A rose is a rose … block 100 A rose is a rose … A rose is a rose … block 150 block 100 file B A rose is a rose … A rose is a rose … A rose is a rose … block 100 block 150 block 150 Rose is not green … block 170 Client view file A file B Client has X-lock on file A and SR-lock on file B file A file A file B Client holds stale metadata for file B but it will not read wrong data file B Client writes the new content of file A to a new location Figure 2. An example of coalescing duplicate data blocks in a live file system. Before modification Storage Tank implementation, which makes them difficult to employ. DDE uses the SHA-1 hash function [12] to fingerprint block data contents. SHA-1 is a one-way secure hash function with a 160-bit output. Even in a large system that contains an exabyte of data (1018 bytes) as 4 kilobyte blocks (roughly 3 1014 blocks), the probability of hash collisions using the SHA-1 hash function is less than 10 19, which is at least 5–6 orders of magnitude lower than the probability of an undetectable disk error. To date, there are no known collisions by this hash function. Therefore we can be confident that two blocks are identical if their SHA-1 hashes are the same. In addition, the system could perform bytewise comparisons before coalescing blocks as a cross check. In Storage Tank, data and meta-data management are separated, and Storage Tank servers are not involved in the data path during normal operations. Disks have little intelligence and cannot detect duplicate data by themselves. Even with smarter disks, without extensive inter-disk communications, each disk would know only its local data fingerprints, which would reduce the chances of detecting duplication. In our design, a client calculates fingerprints of the blocks it writes to storage devices and returns them to the server. Software implementations of SHA-1 are quite efficient and hashing is not a performance bottleneck. Storage Tank servers have a global view of the whole system and are appropriate for data duplication detection. Server Hash Database Storage Device Block 100 : H1 … After modification and before new hash value returned Update-in-place Copy-on-write Block 100 : H1 Block 100 : H1 … … 100 100 100 H1 H2 H1 ✁ (a) 103 H2 (b) Figure 3. Maintaining consistency between fingerprint and block content under (a) update-in-place and (b) copy-on-write. rectly accessed and modified by clients, the consistency of the fingerprint and the data content of a block becomes a problem. In Storage Tank, a client can modify a block by two approaches: update-in-place and copy-on-write. 4.2.1. Update-in-Place In Storage Tank, a client can directly modify a block if the block is writable, i.e. it writes new data to the same block. This results in inconsistency between the serverside block fingerprint and the block content until the client returns the new fingerprint to the server, as shown in Figure 3(a). The fingerprint of block 100 that the server keeps is inconsistent with the block content until the client returns the latest fingerprint. During the period of inconsistency, any data duplication detection related to this block gives false results. We can detect this potential inconsistency by checking data locks on the file to which the block belongs. Because 4.2. Consistency of Data Content and Fingerprints After we hash the data content of a block, the fingerprint becomes an attribute of the block. Because the fingerprint is eventually stored on the server and the block can be di- 304 the granularity of file data locks in Storage Tank is per file, the server cannot trust all of the block fingerprints of a file with an exclusive data lock on it. To avoid erroneous duplication detection and coalescing, we could simply delay DDE on those files with exclusive or shared-write locks. This is feasible in some workloads where only a small fraction of files are active concurrently. However, this approach cannot save any storage in some important environments, such as databases, where applications always hold locks on their files. Another approach is to revoke data locks on files to force clients to return the latest block fingerprints. This causes two technical problems: lock checking and lock revocation. To check file locks, every block has to maintain a reverse pointer to the file to which it belongs, which makes the bookkeeping information of a block even larger. To guarantee consistency between fingerprints and block contents, every block-coalescing operation has to revoke file locks, if necessary, which can severely penalize the system performance. Therefore, eliminating duplicate data blocks under the update-in-place scenario is inefficient, at best, or impossible. and Microsoft Word, write the whole modified file into a new place, in which case there is no extra cost for COW. The server could also preallocate new blocks to the client that acquires an exclusive or shared-write lock. The most promising approach to alleviate the extra allocation overhead of COW is for the clients to maintain a private storage pool on behalf of the server from which they could allocate locally. Therefore, there is almost no extra cost for COW. 4.3. Lazy Lock Revocation and Free Space Reclamation The server coalesces duplicate data blocks and reduces storage consumption by updating file block allocation maps to point to one copy of the data and reclaim the rest. During file block allocation map updates, the server does not check whether any client holds data locks on the files. Therefore, the block allocation maps held by the server and clients can be inconsistent, as illustrated in Figure 2. However, we postpone the reclamation of the dereferenced blocks. Therefore, clients holding stale file block allocation maps can still read the correct data from these blocks. At some particular time, e.g. midnight, or when the file system is running low on free space, the server revokes all data locks held by clients and frees those dereferenced blocks. 4.2.2. Copy-on-Write The basic idea of our work is to eliminate duplicate data blocks by comparing their fingerprints. By using a collision-resistant hash function with a sufficiently large output, such as SHA-1, the fingerprints are considered to be distinct for different data. Therefore, the fingerprint can serve as a unique virtual address for the data content of a block. The mapping function from the virtual address to the physical block address is implicitly provided by the block address itself. Our aim is to make the mapping function nearly one-to-one, i.e. each virtual address is mapped to only one physical address. However, update-in-place violates the basic concept of content-addressed storage by making the mapping function inconsistent. Conceptually, if the content of a block is changed, the new content should be mapped to a new block instead of the original one. Consequently, a client should write modified data to new blocks, which implies a writeonce policy, as in Venti [23]. However, write-once keeps all histories of data, which is unnecessary and expensive in on-line file systems. Therefore, we use a weaker variant of write-once, copy-on-write. This technique can guarantee the consistency of the mapping function as long as the original blocks are not reclaimed, as shown in Figure 3(b). The fingerprint of block 100 that the server keeps is still consistent with the block content until block 100 is reclaimed. Apparently, copy-on-write has a noticeable overhead on normal write operations. Every block modification requires free block allocation because the modified content needs to be written to a new block. However, this cost is less significant than we thought. Some applications, such as Emacs 5. Process of Duplicate Data Elimination Duplicate data elimination is done by the coordination between clients and servers. Simply speaking, clients perform copy-on-write operations and calculate and return block SHA-1 hashes to servers. Servers log clients’ activities and identify and coalesce duplicate data blocks in the background. Users are unaware of such operations. 5.1. Impact on Client’s Behaviors In addition to its normal behaviors, a client calculates SHA-1 fingerprints for the data blocks it writes and returns the fingerprints to the server. Because copy-on-write is used (Section 4.2.2), the client does not write modified data blocks back to their original disk blocks; instead, modified data is written to newly-allocated blocks. As long as the client holds a file data lock, further modifications to the same logical block are written to the same newly-allocated disk block. On an update to the server, the client sends the latest block fingerprints along with the block logical offsets within the file and the original physical locations of modified data blocks. 5.2. Data Structures on the Server We discuss necessary data structure supports on the server that facilitate duplicate data block detection and elimination. Essentially, a reference count table, a fingerprint 305 Block fingerprint Bucket 0 00000000 … 00 97 … 91 1 3 31F93A7C … 21 2 15 3F297E32 … D1 ... ... ... -1 3F291F22 … 5A 0 arena block offset 98 Allocate Secondary index Block attributes reference count Bucket index ... Unallocated 3F2 Free ... Reclaim Add reference Unreferenced 5 13 117 H117 34 24 3 119 H119 117 5 13 125 H125 ... ... ... ... Referenced Remove reference if reference = 1 Remove reference if reference > 1 Figure 6. A block is in one of the following states: unallocated, allocated, referenced, and unreferenced. Block offset Block offset within a file within the arena Block fingerprint File ID 18 De-reference log Add reference Reuse Figure 4. Data structures for storing and retrieving block attributes. Block offset within the arena Mark used / unused Allocated Because an arena allows the logical to physical mapping of space in the file set to the LUNs in the storage pool, it is equivalent to referencing a block by its physical location and by its logical offset within an arena. For convenience, we will reference an allocated physical block by its logical offset within the arena. To keep the per-arena data structures to an optimal size, we use 32-bit integers to represent logical block offsets within an arena. Therefore, an arena can contain no more than 232 physical blocks. For 4 KB blocks, an arena can manage 16 TB storage, which is large enough for most applications and environments. However, there is no such a limitation on the capacity of a file set because it can cross multiple storage pools and can consist of multiple arenas. ... New fingerprint log Figure 5. Data structures for logging recent clients’ activities. table and its secondary index maintain attributes associated with blocks, i.e. reference counts and fingerprints, as shown in Figure 4; and a dereference log and a new fingerprint log record recent clients’ activities, as shown in Figure 5. The scope of duplicate data elimination is an important design decision. The larger the scope, the higher the degree of data duplication can be, thus providing more benefit. However, for various reasons, people may want to share data only within their working group or their department. Therefore, we limit data duplication detection and elimination within a file set, which essentially is a logical subset of the global namespace. Even within a file set, files can be stored in different storage pools that may belong to different storage classes, which have different characteristics in access latency, reliability, and availability. Sharing data across storage classes can result in noticeable impacts on the quality of storage service. Therefore, we further narrow the scope of DDE within an arena, which provides the mapping between a file set and a storage pool. The data structures we will discuss soon are per arena. Data are not equally important. Detecting and coalescing temporary, derived, or cached data is less beneficial. Because Storage Tank provides policy-based storage management, a system can be easily configured to store these data in less reliable and so cheaper storage pools, while storing important data in more reliable storage pools. DDE within an arena can take advantage of this flexibility. 5.2.1. Reference Count Table With block coalescing and sharing, a physical block can be referenced multiple times by different files or even one file. Therefore, a reference count is necessary for each block in the arena. From the viewpoint of DDE, a block can be in one of the following four states: 1) free – the block is unallocated; 2) allocated – the block is allocated but unused or it contains valid data that is unhashed; 3) referenced – the block contains valid data, which has been hashed, and is referenced at least once; and 4) unreferenced – the block is allocated and hashed, but has no file referencing it and can be freed or reused. Figure 6 illustrates the four states and the transitions among them. Without accessing the arena block allocation map, we cannot know whether a block is allocated or not. Fortunately, we are interested in the validity of block fingerprints in our work. Therefore, blocks that are in the first two states (free and allocated) can be merged into one state, invalid, because they contain no valid fingerprints. The reference count of a block indicates its state: 1) invalid, where the reference count is 0, 2) referenced, where the reference count is no less than 1, or 3) unreferenced, where the reference 306 count is 1. The initial state of a block is invalid because it contains no valid fingerprint until a client calculates it. A reference count table keeps the reference count for each block in the arena. The table is organized as a linear array, which is indexed by the 32-bit logical block offset within the arena. Each entry in this table is also a 32-bit integer, indicating the state of the corresponding block. The size of the table is up to 232 4 bytes 16 GB. Because block reference counts are crucial to data integrity, any update on them should be transactional. A block may contain valid data but has no fingerprint associated with it. Note that the fingerprint of a block is calculated when it is written. If a block is written before the server turns on this feature, it has no fingerprint on the server. A utility running on the server could ask clients to read those data blocks and calculate their fingerprints on behalf of the server. Each bucket contains a 32-bit block pointer. Therefore, the size of the first level index is 224 4 bytes 64 MB, which can well fit in memory. Each entry in the bucket block contains a 32-bit in-arena logical block offset, indicating the block that the fingerprint is associated with, and the next 32 bits of the SHA-1 fingerprint. Because an arena contains no more than 232 blocks, the average number of hash entries in a bucket is 232 224 28 256. When the bucket block size is 4 KB, the average block utilization is 256 4 4 4096 50%. For an arena with capacity much less than 16 TB, multiple buckets can share one bucket block for better storage and memory utilization. ✂ ✄ ☎ ✄ ✄ ✆ ✄ ✆ ☎ ✝ ✞ ✞ ✄ 5.2.4. Dereference Log The dereference log records the in-arena logical offsets of blocks that are recently deleted, dereferenced due to COW by clients, or dereferenced due to block coalescing by the server. We will discuss the third case in greater details in Section 5.3.3. This log is also called semi-free list because the blocks in this list could be freed if there is no longer any reference to them. Each entry in this log is a 32bit integer. To avoid storage leakage, any update on this log should be transactional. 5.2.2. Fingerprint Table The fingerprint table keeps unique fingerprints of the blocks in an arena. Each fingerprint in this table is associated with a unique physical block. In other words, the table maintains a one-to-one mapping function between fingerprints and physical blocks. We detect and coalesce duplicate data blocks when we merge the new fingerprint log to this table. The fingerprint table is also organized as a linear array and indexed by the 32-bit logical block offset. Each entry contains a 160-bit SHA-1 fingerprint. A fingerprint is valid only if its block reference count is no less than 1. The size of the table is up to 232 20 bytes 80 GB, and it cannot fit in memory. Fortunately, disk block accesses have sequential patterns due to sequential block allocations and file accesses. Therefore, we organize the secure fingerprint table linearly to facilitate comparisons under sequential block accesses. If two disk blocks contain the same content, it is likely that their consecutive blocks also have identical contents. The linear structure makes consecutive fingerprint comparisons efficient because all related entries are in memory. Conceptually, both the reference count table and the fingerprint table are for describing block attributes and can be merged into one table. Because their sizes are potentially large, and the block reference count is accessed more frequently than the fingerprint, we store them separately to optimize system memory usage. The server enforces a client’s copy-on-write behaviors by marking the copy of the file block allocation map in the message buffer as read-only when it responds to a file data lock request from the client. The server immediately logs clients’ recent activities, such as delete and write operations. In our design, DDE runs in a best-effort fashion. Therefore, the server lazily detects and coalesces duplicate data blocks and reclaims unused blocks. It also maintains block reference counts and fingerprints correspondingly. 5.2.3. Secondary Index to Fingerprint Table 5.3.1. Logging Recent Activities Although the linear fingerprint table favors sequential searching, it is difficult to look for a particular fingerprint in this table. Therefore, we also index the table by partial bits of the SHA-1 fingerprint to facilitate random searching. A static hash index is used for this purpose. The hash buckets are indexed by the first 24 bits of the SHA-1 fingerprint. When the server receives a dirty file block allocation map from a client, it compares it with the one on the server. If a block is marked as unused due to copy-on-write, the server first checks whether it is still referenced by the server-side file block allocation maps. If referenced, the in-arena logical offset of the unused block is appended to the dereference 5.2.5. New Fingerprint Log The new fingerprint log records clients’ recent write activities. Each entry in this log includes a 64-bit file ID, a 64bit logical block offset within the file, a 32-bit logical offset within the arena, and a 160-bit SHA-1 fingerprint. Appending new entries to this log can be non-transactional for the performance purpose because losing the most recent entries only causes losing some opportunities for storage reduction. ✄ 5.3. The Responsibilities of the Server 307 t0 t1 t2 current time ondary index. However, we do not update the secondary index during the log preprocessing because of performance reasons. We postpone the removal of false indexes until the duplicate block coalescing process notices them. False index removal also happens when a secondary index bucket block becomes full. Time stable log epoch Figure 7. Log epochs. log; if not (it is possible due to duplicate block coalescing without lock revocation), the block currently referenced by the server-side block allocation map is logged because it is dereferenced by recent modifications on the corresponding file logical block. The unused block returned from the client is not logged because it has been dereferenced due to block coalescing. Other unused blocks are logged. For each recently-written block, the server also appends an entry to the new fingerprint log, including the identifier of the file to which it belongs, its logical block offset within the file, its logical block offset within the arena, and its fingerprint. 5.3.3. Merging to Fingerprint Table We detect and coalesce duplicate data blocks when we merge the compacted new fingerprint log to the fingerprint table. Figure 8 shows the processes of duplication detection and coalescing. For each entry in the log, we first check whether it has a matching fingerprint in the fingerprint table. If not, we insert the new fingerprint into the table and update the secondary index. If there is an identical fingerprint in the fingerprint table, we check the validity of the fingerprint. If the block reference count of the fingerprint is less than 1, the primary block in the fingerprint table contains no valid data. Therefore, we insert the new fingerprint into the table and update the secondary index. We also need to delete the false index in the secondary index because the previous matching index leads to a block containing invalid data. If the block reference count is no less than 1, we fetch the block allocation map of the file to which the recently-written block belongs, and check whether this block is still referenced by this file. If not, this means that the corresponding logical block in this file was modified after the current fingerprint was returned, and we simply discard this coalescing operation. If the block is still referenced, we update the file block allocation map by referencing the primary block in the fingerprint table without checking or revoking data locks on this file. We also increase the reference count of the primary block and set the reference count of the coalesced block to be 1. When a block is inserted into the fingerprint table, either by adding a new entry or by coalescing to another block, it is marked read-only in the block allocation map of the file to which it belongs. 5.3.2. Log Epoch and Preprocessing We periodically checkpoint the dereference log and the new fingerprint log for two reasons. First of all, the data duplication detection and elimination processes run in a besteffort and background fashion and are unlikely to keep up with the most recent clients’ activities. Logging and checkpointing these activities allow the server to detect and coalesce duplicate data blocks during its idle time. By checkpointing the log to epochs, we also limit the number of activities the server processes at a time. Second, recentlywritten blocks are likely to be modified again. Trying to coalesce these blocks is less beneficial. Therefore, we try to coalesce only blocks in a stable epoch, as shown in Figure 7, whose lifetimes have been long enough. Assume that we want to merge the new fingerprint log in epoch t0 t1 to the fingerprint table. Because random accesses on the fingerprint table are expensive, we try to reduce the number of fingerprint comparisons by deleting unuseful entries in the new fingerprint log in epoch t0 t1 . First of all, we find overwritten file logical blocks by sorting the log by file ID and logical block offset within a file. We delete the older entries from the log and set their block reference counts to be 1 (unreferenced). We set the block reference counts of other entries in the log to be 1. Second, we scan the dereference log in epoch t0 t1 . For each entry in the log, we decrease its block reference count by 1; if the count becomes less than 1, we set it to be 1. Third, we compact the new fingerprint log t0 t1 by deleting those entries also in the dereference logs t0 t1 and t1 t2 . The matched entries in the dereference log t1 t2 are removed and their block reference counts are set to be 1. Note that a fingerprint in the fingerprint table becomes invalid when its block reference count reaches 1. Conceptually, we should also remove its index entry in the sec✆ ✂ ✞ ✠ ✆ ✞ ✠ 5.3.4. Free Space Reclamation A free space reclamation process scans the reference count table in the background. It logs the addresses of the blocks with reference counts 1 and sets their reference counts to 0. At some particular time, e.g. midnight, or when the file system is running low on free space, it revokes all data locks and free these blocks. ✂ ✆ ✂ ✞ ✠ ✂ ✆ ✞ ✠ ✆ ✞ ✠ ✆ 6. ✆ ✞ ✞ ✠ Case Studies ✠ We examined six data sets and studied their degrees of data duplication and their compressibility under common compression techniques. The data sets are summarized in Table 1. ✂ ✂ 308 For each entry in the compacted new fingerprint log in epoch (t0 , t1) N Found a partial hash match in the secondary index? Insert it into the fingerprint table and update the secondary index Y Found a full match in the fingerprint table? N Y Y Ref. count of the primary block >= 1? Insert it into the fingerprint table and update the secondary index. Delete the false index in the secondary index N Y Match the physical location of the corresponding logical block offset in the file? N The hash value matches the partial bits? N Discard the merging operation Y If match, without revoking file locks, update the file data descriptor by referencing the primary block in the fingerprint table; increase the corresponding block reference count; and set the reference count of the dereferenced block to be -1 Figure 8. Merge the new fingerprint log to the fingerprint table. Table 1. Data sets Name SNJGSA BVRGSA BUILD BVRGSA TEST GENOME LTC MIRROR PERSONAL WORKSTATIONS Description file server used by a development team file server used by the development team for code build file server used by the development team for testing human being genome data local mirror of installation CDs for different Linux versions aggregation of ten personal workstations The first three data sets—SNJGSA, BVRGSA BUILD, and BVRGSA TEST—are from file servers used by the Storage Tank development team. The servers are used to distribute and exchange data files, but are not used to hold or archive the primary copy of important files. The first server, SNJGSA, is a remote replica that holds a subset of the daily builds that are stored on BVRGSA BUILD, and a subset of the test data that is on BVRGSA TEST. In general, the oldest files are deleted when the servers run low on space. The files are almost never overwritten; they tend to be created and then not modified until they are purged few months later. GENOME contains the human being genome sequence, and is used at UCSC by various bioinformatics applications. The genomic data is encoded in letters, when some single letters and some letter combinations can repeat thousands to millions of times but in fine granularities. LTC MIRROR is a local ftp mirror of the IBM Linux Technology Center (LTC) that includes Open Source and IBM software for internal download and use. Among other Size (GB) 57 Number of files 661,729 344 2,393,795 215 348 115,141 889,884 261 241,724 123 879,657 things, the ftp site holds the CD images (iso) of different Red Hat Linux installations starting at RH7.1 up to RH9. The last data set is an aggregation of ten personal workstations at the IBM Almaden Research Center that are running Windows. All systems are used for development as well as for general purposes such as email, working documents, etc. We also present the results of these systems when they are analyzed separately. For each data set, we collected the size and the number of files of the system. We calculated the amount of storage that is required after eliminating data duplication at the granularity of 1 KB blocks. To compare DDE with common compression techniques, we collected the compressed file sizes, using LZO on 64 KB data blocks. LZO (LempelZiv-Oberhumer) is a data compression library that favors speed over compression ratio [22]. We empirically found that the compression capability of LZO is similar to other techniques’ when the block size is large. In addition, we calculated the storage reduction achieved by combining the techniques of DDE and LZO. We also calculated what per- 309 Table 2. DDE and compression results. Name SNJGSA BVRGSA BUILD BVRGSA TEST GENOME LTC MIRROR PERSONAL WORKSTATIONS % of storage required after DDE (1 KB blocks) 32% 21% 69% 96% 80% 54% % of storage required after eliminating whole file duplications 55% 62% 85% 98% 94% 69% % of storage required after LZO on 64 KB blocks 56% 67% 53% 46% 98% 63% % of storage required after combining DDE and LZO on 64 KB blocks 30% 35% 47% 44% 89% 43% Table 3. DDE and compression results for personal workstations. System 1 2 3 4 5 6 7 8 9 10 % of storage required after DDE (1 KB blocks) 66% 61% 63% 77% 67% 69% 70% 71% 87% 73% % of storage required after eliminating whole file duplications 71% 78% 77% 91% 92% 77% 80% 78% 91% 84% centage of the storage is still required after eliminating only whole file duplications. Table 2 shows the percentage of storage required for different data sets after using DDE at the granularities of 1 KB blocks and whole files, LZO on 64 KB blocks, and the combination of DDE and LZO. DDE at 1 KB blocks only requires one-fifth to one-third of the original storage to hold the BVRGSA BUILD and SNJGSA data sets and it achieves one to two times higher storage efficiency (the ratio of the amount of logical data to the amount of required physical storage) than LZO at 64 KB blocks. This is because both data sets contain daily builds of the Storage Tank codes that share lots of codes among versions; also the data sets include very small files on average, making LZO inefficient. BVRGSA TEST on the other hand, has on average larger files (1.9 MB) and the best reduction is achieved when using the combination of LZO compression and DDE on 64 KB blocks. We will study the data set of SNJGSA in greater detail in Section 6.1. The genomic data set is encoded in letters. Some single letters and letter combinations can repeat thousands to millions of times but at quite fine granularities. With this type of data set it is unlikely to find duplications at the granularities of kilobytes and DDE cannot improve storage efficiency significantly. On the other hand, common compression techniques, such as LZO, are suitable for this data set. The 4% of reduction by DDE is partially due to common file headers and a common “special letter pattern” that fills in gaps in the genomic sequence and partially due to duplicated files that were created by analyzing applications. Using LZO on the data set of LTC MIRROR barely re- % of storage required after LZO on 64 KB blocks 61% 57% 55% 63% 62% 58% 61% 71% 80% 57% % of storage required after combining DDE and LZO on 64 KB blocks 43% 43% 41% 55% 53% 48% 49% 55% 74% 48% duces storage consumption because generally the files in the Linux installation CD images have already been packaged and compressed. DDE can take advantage of cross-file duplications, mainly from different installation versions, and reduce storage consumption by 20%. A more interesting data set is the actual installation of different Linux versions, instead of the installation images. We plan to look into that in the nearest future. We also studied the storage requirements of ten personal workstations after using different techniques on individual systems, as shown in Table 3. On average 70% of storage is required for individual systems after applying DDE on them, from which more than half of the saving is due to eliminating whole file duplications. LZO compression can provide better storage efficiency on these systems. By combining both LZO and DDE on 64 KB blocks, the percentage of storage required for individual systems is further reduced to 51% on average. When aggregating files from all machines together, which is potentially what happens with enterprise-level backup applications, the aggregated storage requirement of ten personal workstations by using DDE drops significantly, from 70% to 54%, as shown as PERSONAL WORKSTATIONS in Table 2. This is because the more individual systems are aggregated, the higher the degree of data duplication is likely to be. Consequently, DDE can result in tremendous storage savings for back up systems that potentially backup hundreds to thousands of personal workstations on a daily basis, which shows a very good example of an application that could benefit from data duplication elimination at the file system. Combining DDE 310 and LZO on 64 KB blocks can further reduce the storage requirement to 43%. The granularity of duplicate data detection affects the effectiveness of DDE. Table 2 shows that file-level detection can lose up to 50–70% of the chance of finding duplicate data at 1 KB blocks. We also noticed that using both DDE and LZO on 64 KB blocks does not always generate better results than using DDE solely. In general, smaller block sizes favor DDE because of finer granularities on duplication elimination; larger block sizes favor LZO because of more efficient encoding. In fact, DDE and LZO-like compression techniques are orthogonal when they are operated at the same data unit granularity because compression explores intra-unit compressibility and DDE explores interunit duplications. 100 Storage Required (%) 90 80 70 60 50 40 30 20 10 0 0.5 1 2 4 8 16 32 64 File System Block Size (KB) Figure 9. SNJGSA, percentage of storage required after removing duplicate data. tion because duplicate blocks tend to be coalesced within a log epoch and require fewer updates to the fingerprint table. At the same time, uncorrelated duplication allows DDE to produce a continually improving compression ratio as the data set grows. The SNJGSA data set consists of 15 million 4 KB blocks, among which 3.3 million are referenced only once. Among the remaining 11.7 million blocks, only 2.5 million are unique. Figure 11 shows the cumulative distributions of storage savings made by frequently occurring data blocks. It reveals that only 1% of unique blocks contributes to 29% of the total storage savings. This suggests us that even a small amount of hash cache on the client side could explore the frequency of data duplication and save actual I/O on the first spot. Although we have not had the opportunity to study the recency of data duplication, we believe that with careful design, the client-side hash cache can also take advantage of this recency to improve system performance. Furthermore, the spatial and temporal localities of duplicate data generation are dominant factor of DDE’s performance on the server side. SNJGSA and BVRGSA BUILD demonstrate applications that can substantially benefit from DDE. The engineers using these file servers exploit storage space to make their jobs easier. One way they do this is by creating hundreds of “views” of their data in isolated directory trees, each with a dedicated purpose. However, these pragmatic, technology savvy users do not use a snap-shot facility to create these views, or a configuration management package, or even symbolic links. The engineers employ the most robust and familiar mechanisms available to them: copying the data, applying small alterations, and leave the results to linger indefinitely. In this environment, the file system is the data management application, and DDE extends its reach. 6.1. Detailed Study on SNJGSA Figure 9 shows the results of applying DDE to the SNJGSA data set using a range of file system block sizes. The results are given as a percentage of the data blocks that are unique. The block size varies from 512 bytes to 64 KB. As expected, the smallest block size works best because it enables the detection of shorter segments of duplicate data. Interestingly, DDE’s effectiveness starts to improve again when the block size reaches 32 KB. The reason is that the file system is “wasting” space using these larger blocks, and DDE is proportionally coalescing more of this wasted space. This effect begins to outpace the reduced number of duplicate blocks due to coarser block granularities. Note that Storage Tank does not support blocks sizes smaller than 4 KB; these results are included to show the savings we are missing out on. The additional space saving potential of smaller blocks is modest: 5%, for instance, between 1 KB and 4 KB blocks in the SNJGSA1 data set. SNJGSA’s usage pattern and its “FIFO” style space management allow us to use this data to simulate a growing file system. Figure 10 shows the amount of space saved by DDE as the file system, which starts empty, grows to eventually contain all the files in the SNJGSA data set. The files are added in order of their modified times (mtime). 4 KB blocks are used. The rightmost value on the chart is 38%, which matches the space savings shown in Figure 9. If Figure 10 had shown a gradual improvement starting at 100% (no compression) and monotonically decreasing to 38%, we could assert that duplication in this data set is independent of time. This is the type of curve that would be generated if the files were inserted in random order rather than being sorted by mtime. If the chart had shown a flat curve, this would indicate that data duplication is highly correlated in time, i.e. has strong temporal locality. The duplicate blocks found in a new file would likely match blocks that were added, for instance, within the last 24 hours, but very unlikely to match blocks that were added a month ago. DDE performs most efficiently on highly time correlated duplica- 7. Discussions Often duplicate data copies are made for the purpose of reliability, preservation, and performance. Typically, such duplicate copies are and should be stored in different stor- 311 pected high degrees of data duplication such as backup of multiple personal workstations, and it is not necessarily intended for general uses. The key techniques used in DDE are content-based hashing, copy-on-write, and lazy updates. With necessary and appropriate supports, these techniques thus DDE could be also applicable to other file systems besides Storage Tank. Particularly, lazy updates minimize the performance impact of identifying duplicate data and maintaining block metadata, which will also be beneficial to other file systems. 100 Storage Required (%) 90 80 70 60 50 40 30 20 10 0 0 10 20 30 40 50 60 File System Growth (GB) Figure 10. SNJGSA, percentage of unique blocks in a simulated growing file system. 8. We are working on implementing duplicate data elimination in Storage Tank. Besides implementation, there are several research directions we can explore in the future. The technique of copy-on-write plays a key role in our design to guarantee consistency between fingerprints and block contents. However, it forces a client to request new block allocations for each file modification and has noticeable overheads on normal write operations. To alleviate the extra allocation cost due to COW, the server could preallocate new blocks to a client that acquires an exclusive or shared-write lock. Therefore, the preallocation policy for COW is an interesting research topic. Another promising approach to alleviate this overhead is to allow a client to maintain a small private storage pool on behalf of the server. Therefore, there is almost no extra cost for COW. In our current design, we employ quite a naive coalescing policy: when we find a recently-written block containing the same data as a block in the fingerprint table, we simply dereference the new block and change the corresponding file block pointer to the primary block in the fingerprint table. This policy is suboptimal in terms of efficiency. Although we have considered file and block access patterns in our design, we do not, for simplicity, explicitly elaborate policies that favor sequential fingerprint probing and matching under certain block access patterns. Therefore, further research on such policies is needed. The naive coalescing policy we describe in this paper may also result in file fragmentation. Coalescing few blocks within a large file is less desirable. Therefore, a study on policies for minimizing file fragmentation is interesting. Furthermore, a good coalescing policy could reduce storage fragmentation by reusing the lingering unused blocks due to lazy free space reclamation. The fingerprint of a block is a short version of its data. A client can easily keep a history of its recent write activities by maintaining a fingerprint cache. The client can do part of the duplicate data elimination work in conjunction with the server. More beneficially, actual write operations to storage can be saved if there are cache hits. As far as we know, there are no extensive and intensive studies on the duplicate data distributions of the block level or other levels. A better understanding of data duplication in Cumulative Storage Saving 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 Most Frequent Future Work Least Frequent Duplicate Blocks Figure 11. SNJGSA, cumulative contribution to the total storage savings made by coalescing frequently occurring data blocks. Blocks are sorted by their reference frequencies. age pools. Since the technique of duplicate data elimination is scoped to an arena, data on different storage pools will not be coalesced and its reliability, preservation, and the performance of accessing data will not be affected. If the duplicate data copies are made against inadvertent deletion or corruption of file data, DDE allows copies to be made without consuming additional storage space and still protects against deletion or corruption. However, DDE does create an exposure to potential multiple file corruptions due to a single block error. DDE coalesces duplicate data blocks in files. It can result in file fragmentation thus degrade the system performance due to non-contiguous reads. This can be alleviated by only coalescing duplicate data with sizes of at least N contiguous blocks. In Section 6, we also found that the majority of duplicate blocks come from whole files, in which cases reading those files has no additional seek overheads. It is probable that DDE can reduce system write activities if clients can detect duplicate data before writing to storage devices, which will be discussed further in Section 8. DDE can also potentially improve the storage subsystem cache utilization because only unique data blocks will be cached. The degree of data duplication can vary dramatically in different systems and application environments. DDE is a technique that is suitable for those environments with ex- 312 file systems can be enormously beneficial for making good duplicate data coalescing policies. In our design, we add two attributes on physical disk blocks: the reference count and the SHA-1 fingerprint of the block content. We also provide appropriate data structures to store and retrieve these attributes. Our work makes it feasible to check data integrity in Storage Tank. A client can ensure that the data it reads is the data it writes by comparing the fingerprint on the server with the one calculated from the data it recently reads. [4] [5] [6] 9. Conclusions [7] Although disk prices drop dramatically, storage is still a precious resource in computer systems. For some data sets, reducing storage consumption caused by duplicate data can significantly improve storage usage efficiency. By using techniques of content-based hashing, copy-on-write, lazy lock revocation, and lazy free space reclamation, we can detect and coalesce duplicate data blocks in on-line file systems without a significant impact on system performance. Our case studies show that 20–79% of storage can be saved by the technique of duplicate data elimination at 1 KB blocks in some application environments. File-level duplication detection is sensitive to changes of file contents and can lose up to 50–70% of the chance of finding duplicate data at finer granularities. [8] [9] [10] [11] Acknowledgments [12] We are grateful to Robert Rees, Wayne Hineman, and David Pease for their discussions and insights in Storage Tank. We thank Scott Brandt, Ethan Miller, Feng Wang, and Lan Xue of the University of California at Santa Cruz for their helpful comments on our work. We thank Vijay Sundaram of the University of Massachusetts at Amherst, in particular, for his invaluable discussions. We also thank Terrence Furey and Patrick Gavin of the bioinformatics department of UCSC for providing access to the GENOME data set and helping us to understand the results. Our shepherd Curtis Anderson provided useful feedback and comments on our first draft. [13] [14] [15] [16] References [17] [1] M. Ajtai, R. Burns, R. Fagin, D. D. E. Long, and L. Stockmeyer. Compactly encoding unstructured inputs with differential compression. Journal of the Association for Computing Machinery, 49(3):318–367, May 2002. [2] K. Akala, E. Miller, and J. Hollingsworth. Using contentderived names for package management in Tcl. In Proceedings of the 6th Annual Tcl/Tk Conference, pages 171–179, San Diego, CA, Sept. 1998. USENIX. [3] W. J. Bolosky, S. Corbin, D. Goebel, and J. R. Douceur. Single instance storage in Windows 2000. In Proceedings [18] [19] 313 of the 4th USENIX Windows Systems Symposium, pages 13– 24. USENIX, Aug. 2000. A. Z. Broder. On the resemblance and containment of documents. In Proceedings of Compression and Complexity of Sequences (SEQUENCES ’97), pages 21–29. IEEE Computer Society, 1998. R. Burns. Data Management in a Distributed File System for Storage Area Networks. Ph.d. dissertation, Department of Computer Science, University of California, Santa Cruz, Mar. 2000. R. Burns and D. D. E. Long. Efficient distributed back-up with delta compression. In Proceedings of I/O in Parallel and Distributed Systems (IOPADS ’97), pages 27–36, San Jose, Nov. 1997. ACM. R. Burns, R. Rees, and D. D. E. Long. Safe caching in a distributed file system for network attached storage. In Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS 2000). IEEE, May 2000. R. Burns, R. Rees, L. J. Stockmeyer, and D. D. E. Long. Scalable session locking for a distributed file system. Cluster Computing Journal, 4(4), 2001. L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making backup cheap and easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI), pages 285–298, Boston, MA, Dec. 2002. A. Crespo and H. Garcia-Molina. Archival storage for digital libraries. In Proceedings of the Third ACM International Conference on Digital Libraries (DL ’98), pages 69– 78, Pittsburgh, Pennsylvania, June 1998. ACM. F. Douglis and A. Iyengar. Application-specific deltaencoding via resemblance detection. In Proceedings of the 2003 USENIX Annual Technical Conference, pages 113– 126. USENIX, June 2003. Secure hash standard. FIPS 180-1, National Institute of Standards and Technology, Apr. 1995. K. Fu, M. F. Kaashoek, and D. Mazières. Fast and secure distributed read-only file system. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI), pages 181–196, San Diego, CA, Oct. 2000. J. Hollingsworth and E. Miller. Using content-derived names for configuration management. In Proceedings of the 1997 Symposium on Software Reusability (SSR ’97), pages 104–109, Boston, MA, May 1997. IEEE. S. Kan. ShaoLin CogoFS – high-performance and reliable stackable compression file system. http://www.shaolinmicro.com/product/cogofs/, Nov. 2002. U. Manber. Finding similar files in a large file system. Technical Report TR93-33, Department of Computer Science, The University of Arizona, Tucson, Arizona, Oct. 1993. J. Menon, D. A. Pease, R. Rees, L. Duyanovich, and B. Hillsberg. IBM Storage Tank—a heterogeneous scalable SAN file system. IBM Systems Journal, 42(2):250–267, 2003. Microsoft Windows 2000 Server online help file. Microsoft Corporation, Feb. 2000. J. C. Mogul, F. Douglis, A. Feldmann, and B. Krishnamurthy. Potential benefits of delta encoding and data compression for HTTP. In Proceedings of SIGCOMM97, pages 181–194, 1997. [20] A. Muthitacharoen, B. Chen, and D. Mazières. A lowbandwidth network file system. In Proceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP ’01), pages 174–187, Oct. 2001. [21] R. Nagar. Windows NT File System Internals: A Developer’s Guide. O’Reilly and Associates, 1997. [22] M. F. Oberhumer. oberhumer.com: LZO data compression library. http://www.oberhumer.com/opensource/lzo/, July 2002. [23] S. Quinlan and S. Dorward. Venti: A new approach to archival storage. In D. D. E. Long, editor, Proceedings of the 2002 Conference on File and Storage Technologies (FAST), pages 89–101, Monterey, California, USA, 2002. USENIX. [24] M. O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981. [25] N. T. Spring and D. Wetherall. A protocol-independent technique for eliminating redundant network traffic. In Proceedings of the Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication (SIGCOMM ’00), pages 87–95, Stockholm, Sweden, Aug. 2000. ACM Press. 314 Clotho: Transparent Data Versioning at the Block I/O Level Michail D. Flouris Angelos Bilas1 Department of Computer Science, University of Toronto, 10 King’s College Road, Toronto, Ontario M5S 3G4, Canada Tel: +1-416-978-6610, Fax: +1-416-978-1931 e-mail: flouris@cs.toronto.edu Institute of Computer Science Foundation for Research and Technology - Hellas Vassilika Vouton, P.O.Box 1385, GR 711 10 Heraklion, Greece Tel: +30-281-039-1600, Fax: +30-281-039-1601 e-mail: bilas@ics.forth.gr Abstract Recently storage management has emerged as one of the main problems in building cost effective storage infrastructures. One of the issues that contribute to management complexity of storage systems is maintaining previous versions of data. Up till now such functionality has been implemented by high-level applications or at the filesystem level. However, many modern systems aim at higher scalability and do not employ such management entities as filesystems. In this paper we propose pushing the versioning functionality closer to the disk by taking advantage of modern, block-level storage devices. We present Clotho, a storage block abstraction layer that allows transparent and automatic data versioning at the block level. Clotho provides a set of mechanisms that can be used to build flexible higher-level version management policies that range from keeping all data modifications to version capturing triggered by timers or other system events. Overall, we find that our approach is promising in offloading significant management overhead and complexity from higher system layers to the disk itself and is a concrete step towards building self-managed storage devices. Our specific contributions are: (i) We implement Clotho as a new layer in the block I/O hierarchy in Linux and demonstrate that versioning can be performed at the block level in a transparent manner. (ii) We investigate the impact on I/O path performance overhead using both microbenchmarks as well as SPEC SFS V3.0 over two real filesystems, Ext2FS and ReiserFS. (iii) We examine techniques that reduce the memory and disk space required for metadata information. 1. Introduction Storage is currently emerging as one of the major problems in building tomorrow’s computing infrastructure. Future systems will provide tremendous storage, CPU pro1 Also, with the Department of Computer Science, University of Crete, P.O. Box 2208, Heraklion, GR 714 09, Greece 315 cessing, and network transfer capacity in a cost-efficient manner and they will be able to process and store ever increasing amounts of data. The cost of managing these large amounts of stored data becomes the dominant complexity and cost factor for building, using, and operating modern storage systems. Recent studies [10] show that storage expenditures represent more than 50% of the typical server purchase price for applications such as OLTP (OnLine Transaction Processing) or ERP (Enterprise Resource Planning) and these percentages will keep growing. Furthermore, the cost of storage administration is estimated at several times the purchase price of the storage hardware [2, 5, 7, 12, 33, 34, 36]. Thus, building self-managed storage devices that reduce management-related overheads and complexity is of paramount importance. One of the most cumbersome management tasks that requires human intervention is creating, maintaining, and recovering previous versions of data for archival, durability, and other reasons. The problem is exacerbated as the capacity and scale of storage systems increases. Today, backup is the main mechanism used to serve these needs. However, traditional backup systems are limited in the functionality they provide. Moreover they usually incur high access and restore overheads on magnetic tapes, they impose a very coarse granularity in the allowable archival periods, usually at least one day, and they result in significant management overheads [5, 27]. Automatic versioning, in conjunction with increasing disk capacities, has been proposed [5, 27] as a method to address these issues. In particular, magnetic disks are becoming cheaper and larger and it is projected that disk storage will soon be as competitive as tape storage [5, 9]. With the advent of inexpensive high-capacity disks, we can perform continuous, real-time versioning and we can maintain online repositories of archived data. Moreover, online storage versioning offers a new range of possibilities compared to simply recovering users’ files that are available today only in expen- sive, high-end storage systems: • Recovery from user mistakes. The users themselves can recover accidentally deleted or modified data by rolling-back to a saved version. • Recovery from system corruption. In the event of a malicious incident on a system, administrators can quickly identify corrupted data as well as recover to a previous, consistent system state [28, 30]. • Historical analysis of data modifications. When it is necessary to understand how a piece of data has reached a certain state, versioning proves a valuable tool. Our goal in this paper is to provide online storage versioning capabilities in commodity storage systems, in a transparent and cost-effective manner. Storage versioning has been previously proposed and examined purely at the filesystem level [24, 26] or at the block level [14, 31] but being filesystem aware. These approaches to versioning were intended for large, centralized storage servers or appliances. We argue that to build self-managed storage systems, versioning functionality should be pushed to lower system layers, closer to the disk to offload higher system layers [30]. This is made possible by underlying technologies that drive storage systems. Disk storage, network bandwidth, processor speed, and main memory are reaching speeds and capacities that make it possible to build cost-effective storage systems with significant processing capabilities [9, 11, 13, 22] that will be able to both store vast amounts of information [13, 17] and to provide advanced functionality. Our approach of providing online storage versioning is to provide all related functionality at the block level. This approach has a number of advantages compared to other approaches that try to provide the same features either at the application or the filesystem level. First, it provides a higher level of transparency and in particular is completely filesystem agnostic. For instance, we have used our versioned volumes with multiple, third party, filesystems without the need for any modifications. Data snapshots can be taken on demand and previous versions can be accessed online simultaneously with the current version. Second, it reduces complexity in higher layers of storage systems, namely the filesystem and storage management applications [15]. Third, it takes advantage of the increased processing capabilities and memory sizes of active storage nodes and offloads expensive host-processing overheads to the disk subsystem, thus, increasing the scalability of a storage archival system [15]. However, block-level versioning poses certain challenges as well: (i) Memory and disk space overhead: Because we only have access to blocks of information, depending on application data access patterns, there is increased danger 316 for higher space overhead in storing previous versions of data and the related metadata. (ii) I/O path performance overhead: It is not clear at what cost versioning functionality can be provided at the block-level. (iii) Consistency of the versioned data when the versioned volume is used in conjunction with a filesystem. (iv) Versioning granularity: Since versioning occurs at a lower system layer, information about the content of data is not available, as is, for instance, the case when versioning is implemented in the filesystem or the application level. Thus, we only have access to full volumes as opposed to individual files. We design Clotho2 , a system that provides versioning at the block-level and addresses all above issues, demonstrating that this can be done at minimal space and performance overheads. First, Clotho has low memory space overhead and uses a novel method to avoid copy-on-write costs when the versioning extent size is larger than the block size. Furthermore, Clotho employs off-line differential compression (or diffing) to reduce disk space overhead for archived versions. Second, using advanced disk management algorithms, Clotho’s operation is reduced in all cases to simply manipulating pointers in in-memory data structures. Thus, Clotho’s common-path overhead follows the rapidly increasing processor-memory curve and does not depend on the much lower disk speeds. Third, Clotho deals with version consistency by providing mechanisms that can be used by higher system layers to guarantee that either all data is consistent or to mark which data (files) are not. Finally, we believe that volumes are an appropriate granularity for versioning policies. Given the amounts of information that will need to be managed in the future, specifying volume-wide policies and placing files on volumes with the appropriate properties, will result in more efficient data management. We implement Clotho as an additional layer (driver) in the I/O hierarchy of Linux. Our implementation approach allows Clotho the flexibility to be inserted in many different points in the block layer hierarchy in a single machine, a clustered I/O system, or a SAN. Clotho works over simple block devices such as a standard disk driver or more advanced device drivers such as volume managers or hardware/software RAIDs. Furthermore, our implementation provides to higher layers the abstraction of a standard block device and thus, can be used by other disk drivers, volume/storage managers, object stores or filesystems. We evaluate our implementation with both microbenchmarks as well as real filesystems and the SPEC SFS 3.0 suite over NFS. The main memory overhead of Clotho for metadata is about 500 Kbytes per GByte of disk space and can be further reduced by using larger extents. Moreover, the performance overhead of Clotho for I/O operations is 2 Clotho, one of the Fates in ancient Greek mythology, spins the thread of life for every mortal. minimal, however, it may change the behavior of higher layers (including the filesystem), especially if they make implicit assumptions about the underlying block device, e.g. the location of disk blocks. In such cases, co-design of the two layers, or system tuning maybe necessary to not degrade system performance. Overall, we find that our approach is promising in offloading significant management overhead and complexity from higher system layers to the disk itself and is a concrete step towards building self-managed storage systems. The rest of this paper is organized as follows. Section 2. presents our design and discusses the related challenges in building block-level versioning systems. Section 3. presents our implementation. Section 4. presents our experimental evaluation and results. Section 5. discusses related work, while section 6. presents limitations and future work. Finally, Section 7. draws our conclusions. anisms Clotho provides and we only present simple policies we have implemented and tested ourselves. We expect that systems administrators will further define their own policies in the context of higher-level storage management tools. Clotho provides a set of primitives (mechanisms) that higher-level policies can use for automatic version management: • CreateVersion() provides a mechanism for capturing the lower-level block device’s state into a version. • DeleteVersion() explicitly removes a previously archived version and reclaims the corresponding volume space. • ListVersions() shows all saved version of a specific block device. • ViewVersion() enables creating a virtual device that corresponds to a specific version of the volume and is accessible in read-only mode. 2. System Design The design of Clotho is driven by the following high-level goals and challenges: • CompactVersion() and UncompactVersion() provide the ability to compact and uncompact existing versions for reducing disk space overhead. • Flexibility and transparency. • Low metadata footprint and low disk space overhead. • Low-overhead common I/O path operation. • Consistent online snapshots. Next we discuss how we address each of these challenges separately. 2.1. Flexibility and Transparency Clotho provides versioned volumes to higher system layers. These volumes look similar to ordinary physical disks that can, however, be customized, based on user-defined policies to keep previous versions of the data they store. Essentially, Clotho provides a set of mechanisms that allow the user to add time as a dimension in managing data by creating and manipulating volume versions. Every piece of data passing through Clotho is indexed based not only on its location on the block device, but also on the time the block was written. When a new version is created, a subsequent write to a block will create a new block preserving the previous version. Multiple writes to the same data block between versions result in overwriting the same block. Using Clotho, device versions can be captured either on demand or automatically at prespecified periods. The user can view and access all previous versions of the data online, as independent block devices along with the current version. The user can compact and/or delete previous volume versions. In this work we focus on the mech- 317 Versions of a volume have the following properties: Each version is identified by a unique version number, which is an integer counter starting from value 0 and increasing with each new version. Version numbers are associated with timestamps for presentation purposes. All blocks of the device that are accessible to higher layers during a period of time will be part of the version of the volume taken at that moment (if any) and will be identified by the same version number. Each of the archived versions exists solely in read-only state and will be presented to the higher levels of the block I/O hierarchy as a distinct, virtual, read-only block device. The latest version of a device is both readable and writable, exists through the entire lifetime of the Clotho’s operation, and cannot be deleted. Clotho can be inserted arbitrarily in a system’s layered block I/O hierarchy. This stackable driver concept has been employed to design other block-level I/O abstractions, such as software RAID systems or volume managers, in a clean and flexible manner [31]. The input (higher) layer can be any filesystem or other block-level abstraction or application, such as a RAID, volume manager, or another storage system. Clotho accepts block I/O requests (read, write, ioctl) from this layer. Similarly, the output (lower) layer can be any other block device or block-level abstraction. This design provides great flexibility in configuring a system’s block device hierarchy. Figure 1 shows some possible configurations for Clotho. On the left part of Figure 1, Clotho operates on top of a physical disk device. In the middle, Clotho acts as a wrapper of FS, Database, NFS, etc. FS, Database, NFS, etc. FS, Database, NFS, etc. Clotho (Versioning Layer) Clotho (Versioning Layer) Clotho (Versioning Layer) Virtual Block Volume RAID Controller Disk Disk Disk ... Disk Disk Disk ... Disk Figure 1: Clotho in the block device hierarchy. a single virtual volume constructed by a volume manager, which abstracts multiple physical disks. In this configuration Clotho captures versions of the whole virtual volume. On the right side of Figure 1, Clotho is layered on top of a RAID controller which adds reliability to the system. The result is a storage volume that is both versioned and can tolerate disk failures. Most higher level abstractions that are built on top of existing block devices assume a device of fixed size, with few rare exceptions such as resizable filesystems. However, the space taken by previous versions of data in Clotho, depends on the number and the amount of modified data between them. Clotho can provide both a fixed size block device abstraction to higher layers, as well as dynamically resizable devices, if the higher layers support it. At device initialization time Clotho reserves a configurable percentage of the available device space for keeping previous versions of the data. This essentially partitions (logically not physically) the capacity of the wrapped device into two logical segments as illustrated in Figure 2. The Primary Data Segment (PDS), which contains the data of the current (latest) version and the Backup Data Segment (BDS), which contains all the data of the archived versions. When BDS becomes full, Clotho simply returns an appropriate error code and the user has to reclaim parts of the BDS by deleting or compacting previous versions, or by moving them to some other device. These operations can also be performed automatically by a module that implements high-level data management policies. The latest version of the block device continues to be available and usable at all times. Clotho enforces this capacity segmentation by reporting as its total size to the input layer, only the size of the PDS. The space reserved for storing versions is hidden from the input layer and is accessed and managed only through the API provided by Clotho. Finally, Clotho’s metadata needs to be saved on the output device along with the actual data. Losing metadata used for indexing extents would render the data stored throughout the block I/O hierarchy unusable. This is similar to 318 most block-level abstractions, such as volume managers, and software RAID devices. Clotho stores metadata to the output device periodically. The size of the metadata depends on the size of the encapsulated device and the extent size. In general, Clotho’s metadata are much less than the metadata of a typical filesystem and thus, saving them to stable storage is not an issue. 2.2. Reducing Metadata Footprint The three main types of metadata in Clotho are the Logical Extent Table (LXT), the Device Version List (DVL), and the Device Superblock (DSB). The Logical Extent Table (LXT) is a structure used for logical to physical block translation. Clotho presents to the input layer logical block numbers as opposed to the physical block numbers provided by the wrapped device. Note that these block numbers need not directly correspond to actual physical locations, if another block I/O abstraction, such as a volume manager (e.g. LVM [32]) is used as the output layer. Clotho uses the LXT to translate logical block numbers to physical block numbers. The Device Version List (DVL) is a list of all versions of the output device that are available to higher layers as separate block devices. For every existing version, it stores its version number, the virtual device it may be linked to, the version creation timestamp, and a number of flags. The Device Superblock (DSB) is a small table containing important attributes of the output versioned device. It stores information about the capacity of the input and output device, the space partitioning, the size of the extents, the sector and block size, the current version counter, the number of existing versions and other usage counters. The LXT is the most demanding type of metadata and is conceptually an array indexed by block numbers. The basic block size for most block devices varies between 512 Bytes (the size of a disk sector) and 8 KBytes. This results Metadata Input Layer Capacity Primary Data Segment Device Block Size (input and output): 4 Kbytes 0 Backup Data Segment 1 0 0 0 0 1 0 Valid subextent bitmap 1111 0000 0000 1111 Output Layer Capacity 0000000000000000 1111111111111111 0000000000000000 1111111111111111 4−KByte extent 32−KByte extent with 4−KByte subextents Figure 3: Subextent addressing in large extents. Figure 2: Logical space segments in Clotho. in large memory requirements. For instance, for 1 TByte of disk storage with 4-KByte blocks, the LXT has 256M entries. In the current version of Clotho, every LXT entry is 128-bits (16 bytes). These include 32 bits for block addressing and 32 bits for versions that allow for a practically unlimited number of versions. Thus, the LXT requires about 4 GBytes per TByte of disk storage. Note that a 32-bit address space, with 4 KByte blocks, can address 16 TBytes of storage. To reduce the footprint of the LXT and at the same time increase the addressing range of LXT, we use extents as opposed to device blocks as our basic data unit. An extent is a set of consecutive (logical and physical) blocks. Extents can be thought as Clotho’s internal block size, which one can configure to arbitrary sizes, up to several hundred KBytes or a few MBytes. Similarly to physical and logical blocks, we denote extents as logical (input) extents or physical (output) extents. We have implemented and tested extent sizes ranging from 1 KByte to 64 KBytes. With 32KByte extents and subextent addressing, we need only 500 MBytes of memory per TByte of storage. Moreover with a 32-KByte extent size we can address 128 TBytes of storage. However, large extent sizes may result in significant performance overhead. When the extent size and the operating system block size for Clotho block devices are the same (e.g. 4KBytes), Clotho receives from the operating system the full extent for which it has to create a full version. When using extents larger than this maximum size, Clotho sees only a subset of the extent for which it needs to create a new version. Thus, it needs to copy the rest of the extent in the new version, even though only a small portion of it written by the higher system layers. This copy can significantly decrease performance in the common I/O path, especially for large extent sizes. However, large extents are desirable for reducing metadata footprint. Given that operating systems support I/O blocks of up to a maximum size (e.g. 4K in Linux), this may result in severe performance overheads. 319 To address this problem we use subextent addressing. Using a small (24-bit) bitmap in each LXT entry we need not copy the whole extent in a partial update. Instead we just translate the block write to a subextent of the same size and mark it in the subextent bitmap as valid, using just 1 bit. In a subsequent read operation we search for the valid subextents in the LXT before translating the read operation. For a 32-Kbyte extent size, we need only 8 bits in the bitmap for 4-KByte subextents. Another possible approach to reduce memory footprint is to store only part of the metadata in main memory and perform swapping of active metadata from stable storage. However, this solution is not adequate for storage systems where large amounts of data need to be addressed. Moreover it is orthogonal to subextent addressing and can be combined with it. 2.3. Version Management Overhead All version management operations can be performed at a negligible cost by manipulating in-memory data structures. Creating a new version in Clotho involves simply incrementing the current version counter and does not involve copying any data. When CreateVersion() is called, Clotho stalls all incoming I/O requests for the time required to flush all its outstanding writes to the output layer. When everything is synchronized on stable storage, Clotho increases the current version counter, appends a new entry to the device version list, and creates a new virtual block device that can be used to access the captured version of the output device, as explained later. Since each version is linked to exactly one virtual device, the (OS-specific) device number that sends the I/O request can be used to retrieve the I/O request’s version. The fact that device versioning is a low-overhead operation makes it possible to create flexible versioning policies. Versions can be created by external processes periodically or based on system events. For instance, the user processes can specify that it requires a new version every 1 hour, or whenever all files to the device are closed or on every single write to the device. Some of the mechanisms to detect such events, e.g. if there are any open files on a device, may be (and currently are) implemented in Clotho but could also be provided by other system components. In order to free backup disk space, Clotho provides a mechanism to delete volume versions. On a DeleteVersion() operation, Clotho traverses the primary LXT segment and for every entry that has a version number equal to the delete candidate, changes the version number to the next existing version number. It then traverses the backup LXT segment and frees the related physical extents. As with version creation, all operations for version deletion are performed in-memory and can overlap with regular I/O. DeleteVersion() is provided to higher layer in order to implement version cleaning policies. Since storage space is finite, such policies are necessary in order to continue versioning without running out of backup storage. Finally, even if the backup data segment (BDS) is full, I/O to the primary data segment and the latest version of data can continue without interruption. 2.4. Common I/O Path Overhead We consider the common path for Clotho, as the I/O path to read and write to the latest (current) version of the output block device, while versioning occurs frequently. Accesses to older versions are of less importance since they are not expected to occur as frequently as current version usage. Accordingly, we divide read and write access to volume versions in two categories, accesses to the current version and accesses to any previous version. The main technique to reduce common path overhead is to divide the LXT in two logical segments, corresponding to the primary and backup data segments of the output device as illustrated in Figure 2. The primary segment of the LXT (mentioned as PLX in figures) has an equal number of logical extents as the input layer to allow a direct, 1-1 mapping between the logical extents and the physical extents of the current version on the output device. By using a direct, 1-1 mapping, Clotho can locate a physical extent of the current version of a data block with a single lookup in the primary LXT segment, when translating I/O requests to the current version of the versioned device. If the input device needs to access previous versions of a versioned output device, then multiple accesses to the LXT maybe required to locate the appropriate version of the requested extent. To find the physical extent that holds the specific version of the requested block, Clotho first references the primary LXT segment entry to locate the current version of the requested extent (a single table access). Then it uses the linked list that represents the version history of the extent to locate the appropriate version of the requested block. Depending on the type of each I/O request and the state of the requested block, I/O requests can be categorized as follows: 320 Write requests can only be performed on the current version of a device, since older versions are read-only. Thus, Clotho can locate the LXT entry of a current version extent with a single LXT access. Write requests can be one of three kinds as shown in Figure 4: a. Writes to new, unallocated blocks. In this case, Clotho calls its extent allocator module, which returns an available physical extent of the output device, it updates the corresponding entry in the LXT, and forwards the write operation to the output device. The extent allocation policy in our current implementation is a scan-type policy, starting from the beginning of the PDS to its end. Free extents are ignored until we reach the end of the device, when we rewind the allocation pointer and start allocating the free extents. b. Writes to existing blocks that have been modified after the last snapshot was captured (i.e. their version number is equal to the current version number). In this case Clotho locates the corresponding entry in the primary LXT segment with a single lookup and translates the request’s block address to the existing physical block number of the output device. Note that in this case the blocks are updated in place. c. Writes to existing blocks that have not been modified since the last snapshot was captured (i.e. their version number is lower than the current version number). The data in the existing physical extent must not be overwritten, but instead the new data should be written in a different location and a new version of the extent must be created. Clotho allocates a new LXT entry in the backup segment and swaps the old and new LXT entries so that the old one is moved to the backup LXT segment. The block address is then translated to the new physical extent address, and the request is forwarded to the output layer. This “swapping” of LXT entries maintains the 1-1 mapping of current version logical extents in the LXT which optimizes common-path references to a single LXT lookup. This write translation algorithm allows for independent, fine grain versioning at the extent level. Every extent in the LXT is versioned according to its updates from the input level. Extents that are updated more often have more versions than extents written less frequently. Read request translation is illustrated in Figure 5. First Clotho determines the desired version of the device by the virtual device name and number in the request (e.g. /dev/clt1-01 corresponds to version 1 and /dev/clt1-02 to version 2). Then, Clotho traverses the version list on the LXT for the specific extent or subextent and locates the appropriate physical block. Write type (a) Input Logical Extents 1 0 0 1 0 1 0 1 Write type (b) Direct 0 1 0 1 0 1 0 1 LXT Arbitrary Output Physical Extents 0110 Write type (c) Move Old LX & Link Mapping 0110 0 1 0 1 0 1 0 1 Input Logical Extents 1111 0000 0000 1111 1 0 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 LXT Mapping 0 1 00 11 0 1 1111111111111111 0000000000000000 0 1 00 11 1 11 0 00 0000000000000000 1111111111111111 0 1 0 1 00 11 Allocate Overwrite Allocate Keep Old PX Read request (Latest Version) Read request (Older Version) Direct Mapping 0110 10 Lookup Backup LXs 0 0000 1111 1 0 1 0 1 1111 0000 0 1 0 1 0000 1111 Arbitrary Mapping 0 1 1111111111111111 Output 0000000000000000 0 1 0000000000000000 1111111111111111 0000000000000000 1111111111111111 0 1 Physical 1111111111111111 0000000000000000 0 1 0000000000000000 0 1 Extents 1111111111111111 New PX Old PX New PX & Write & Write Read Physical Extents (PXs) Figure 4: Translation path for write requests. Figure 5: Translation path for read requests. Finally, previous versions of a Clotho device appear as different virtual block devices. Higher layers, e.g. filesystems, can use these devices to access old versions of the data. If the device id of a read request is different from the normal input layer’s device id, the read request refers to an extent belonging to a previous version. Clotho determines from the device id the version of the extent requested. Then, it traverses the version list associated with this extent to locate the backup LXT entry that holds the appropriate version of the logical extent. This translation process is illustrated in Figure 5. 2.5. Reducing Disk Space Requirements Input Logical Extents LXT Output Physical Extents Reading Compact LX Direct Mapping 0110 Lookup Backup LXs 0 0 1 0 1 1 0 0 0 1 1 1111111111111 0000000000000 0 0 0 1 0 1 1 1 0 1 0 1 1 0 1 1 0000000000000 1111111111111 0 Diff LX Arbitrary Mapping 0 1 111111111111111111 000000000000000000 000000000000000000 111111111111111111 0 1 000000000000000000 111111111111111111 0 1 000000000000000000 111111111111111111 0 1 Ancestor PX Compact PX (stores multiple LXs) Figure 6: Read translation for compact versions. Since Clotho operates at the block level, there is an induced overhead in the amount of space it needs to store data updates. If an application for instance, using a file modifies a few consecutive bytes in the file, Clotho will create a new version for the full block that contains the modified data. To reduce the space overhead in Clotho we provide a differential, content-based compaction mechanism, which we describe next. ing with the previous version. Furthermore, although versions can also be compressed independently of differential compression using algorithms such as Lempel-Ziv encoding [37] or Wheeler-Burrows encoding [3], this is beyond the scope of our work. We envision that such functionality can be provided by other layers in the I/O device stack. Clotho provides the user with the ability to compact device versions and still be able to transparently access them online. The policy decision on when to compact a version is left to higher-layers in the system, similarly to all policy decisions in Clotho. We use a form of binary differential compression [1] to only store the data that has been modified since the last version capture. When CompactVersion() is called, Clotho constructs a differential encoding (or delta) between the blocks that belong to a given version with corresponding blocks in its previous version [1]. Although a lot of differential policies can be applied in this case, such as to compare the content of a specific version with its next version, or both the previous and the next version, at this stage we only explore diff- The differential encoding algorithm works as follows. When a compaction operation is triggered, the algorithm runs through the backup data segment of the LXT and locates the extents that belong to the version under consideration. If an extent does not have a previous version, it is not compacted. For each of the extents to be compacted the algorithm locates its previous version, diffs the two extents, and writes the diffs to a physical extent on the output device. If the diff size is greater than a threshold and diffing is not very effective, then Clotho discards this pair and proceeds with the next extent of the version to be compacted. In other words, Clotho’s differential compression algorithm works selectively on the physical extents, compacting only the extents that can be reduced in size. The 321 rest are left in their normal format to avoid performance penalties necessary for their reconstruction. Since the compacted form of an extent requires less size than a whole physical extent, the algorithm stores multiple deltas in the same physical extent, effectively imposing a different structure on the output block device. Furthermore, for compacted versions, multiple entries in the LXT may point to the same physical extent. The related entries in the LXT and the ancestor extent are kept in Clotho’s metadata. Physical extents that are freed after compaction are reused for storage. Figure 6 shows sample LXT mappings for a compacted version of the output layer. Data on a compacted version can be accessed transparently online as data on uncompacted volumes (Figure 6). Clotho follows the same path to locate the appropriate version of the logical extent in the LXT. To recreate the original, full extent data we need the differential data of the previous version of the logical extent. With this information Clotho can reconstruct the requested block and return it to the input driver. We evaluate the related overheads in Section 4.. Clotho supports recursive compaction of devices. The next version of a compacted version can still be compacted. Also, compacted versions can be uncompacted to their original state with the reverse process. A side-effect of the differential encoding concept is that it creates dependences between two consecutive versions of a logical extent, which affects the way versions are accessed, as explained next. When deleting versions, Clotho checks for dependencies of compacted versions on a previous version and does not delete extents that are required for un-diffing, even if their versions are deleted. These logical extents are marked as "shadow" and are attached to the compacted version. It is left to higher-level policies to decide if keeping such blocks increases the space overhead and it would be better to uncompact the related version and delete any shadow logical extents. 2.6. Consistency One of the main issues in block device versioning at arbitrary times is consistency of the stored data. There are three levels of consistency for online versioning: System state consistency: This refers to consistency of system buffers and data structures that are used in the I/O path. To deal with this, Clotho flushes all device buffers in the kernel as well as filesystem metadata before version creation. This guarantees that the data and metadata on the block device correspond to a valid snapshot of the filesystem at a point-in-time. That is, there are no consistency issues in internal system data structures. 322 Open file consistency: When a filesystem is used on top of a versioned device, certain files may be open at the time of a snapshot. Although Clotho does not deal with this issue, it provides a mechanism to assist users. When a new version is created, Clotho’s user-level module queries the system for open files on the particular device. If such files are open, Clotho creates a special directory with links to all open files and includes the directory in the archived version. Thus, when accessing older versions the user can find out which files were open at versioning time. Application consistency: Applications using the versioned volume may have a specialized notion of consistency. For instance, an application may be using two files that are both updated atomically. If a version is created after the first file is updated and closed but before the second one is open and updated, then, although no files were open during version creation, the application data may still be inconsistent. This type of consistency is not possible to deal with transparently without application knowledge or support, and thus, is not addressed by Clotho. 3. System Implementation We have implemented Clotho as a block device driver module in the Linux 2.4 kernel and a user-level control utility, in about 6,000 lines of C code. The kernel module can be loaded at runtime and configured for any output layer device by means of an ioctl() command triggered by the user-level agent. After configuring the output layer device, the user can manipulate the Clotho block device depending on the higher layer that they want to use. For instance, the user can build a filesystem on top of a Clotho device with mkfs or newfs and then mount it as a regular filesystem. Our module adheres to the framework of block I/O devices in the Linux kernel and provides two interfaces to user programs: an ioctl command interface and a /proc interface for device information and statistics. All operations described in the design section to create, delete, and manage version have been implemented through the ioctl interface and are initiated by the user-level agent. The /proc interface provides information about each device version through readable ASCII files. Clotho also uses this interface to report a number of statistics, including the times of creation, a version’s time span, the size of modified data from the previous version and some specific information to compacted versions, such as the compaction level and the number of shadow extents. The Clotho module uses the zero-copy mechanism of the make request fn() fashion that is used by LVM [32]. With this mechanism Clotho is able to translate the device driver id (kdev t) and the sector address of a block re- Bonnie++ I/O Performance - Write, Rewrite & Read Bonnie++ I/O Performance - Seek 5500 5000 4500 Disk Write Clotho Write Clotho Read Disk Read Disk Rewrite Clotho Rewrite 4000 3500 3000 Seeks/sec Throughput (KBytes/sec) 6000 2500 2000 1 2 4 8 16 Block Size (KBytes) 32 64 170 160 150 140 130 120 110 100 90 80 70 60 Disk Clotho 1 Figure 7: Bonnie++ throughput for write, rewrite, and read operations. quest (struct buffer head) and redirect it to other devices with minimal overhead. To achieve persistence of metadata, Clotho uses a kernel thread created at module load time, which flushes the metadata to the output layer at configurable (currently 30s) intervals. The virtual device creation uses the partitionable block device concepts in the Linux kernel. A limit in the Linux kernel minor numbering is that there can be at most 255 minor numbers for a specific device and thus, only 255 versions can be seen simultaneously as partitions of Clotho. However, the number of partitions supported by Clotho can be much larger. To overcome this limitation we have created a mechanism through an ioctl call that allows the user to link and unlink on demand any of the available versions to any of the 255 minor number partitions of a Clotho device. As mentioned, each of these partitions is read-only and can be used as a normal block device, e.g. can be mounted to a mount-point. 4. Experimental Results Our experimental environment consists of two identical PCs running Linux. Each system has two Pentium III 866 MHz CPUs, 768 MBytes of RAM, an IBM-DTLA-307045 ATA Hard Disk Drive with a capacity of 46116 MBytes (2-MByte cache), and a 100MBps Ethernet NIC. The operating system is Red Hat Linux 7.1, with the 2.4.18 SMP kernel. All experiments are performed on a 21-GByte partition of the IBM disk. With a 32-KByte extent, we need only 10.5 MBytes of memory for our 21-GByte partition. Although a number of system parameters are worth investigation, we evaluate Clotho with respect to two parameters: memory and performance overhead. We use two extent sizes, 4 and 32 KBytes. Smaller extent sizes have 323 2 4 8 16 Block Size (KBytes) 32 64 Figure 8: Bonnie “seek and read” performance. higher memory requirements. For our 21-GByte partition, Clotho with 4-KByte extent size uses 82 MBytes of inmemory metadata, the dirty parts of which are flushed to disk every 30 seconds. We evaluate Clotho using both microbenchmarks (Bonnie++ version 1.02 [4] and an in-house developed microbenchmark) and real-life setups with production-level filesystems. The Bonnie++ benchmark is a publicly available filesystem I/O benchmark [4]. For the real-life setup we run the SPEC SFS V3.0 suite on top of two well-known Linux filesystems, Ext2FS, and the high-performance journaled ReiserFS [20]. In our results we use the label Disk to denote experiments with the regular disk, without the Clotho driver on top. 4.1. Bonnie++ We use the Bonnie++ microbenchmark to quantify the basic overhead of Clotho. The filesystem we use in all Bonnie++ experiments is the Ext2FS with a 4-KByte extent size. The size of the file to be tested is 2 GBytes with block sizes ranging from 1 KByte to 64 KBytes. We measure accesses to the latest version of a volume with the following operations: • Block Write: A large file is created using the write() system call. • Block Rewrite: Each block of the file is read with read(), dirtied, and rewritten with write(), requiring an lseek(). • Block Read: The file is read using a read() for every block. • Random Seek: Processes running in parallel are performing lseek() to random locations in the file and read()ing the corresponding file blocks. SPEC SFS 3.0 - Req. Load vs. Measured Load (4KB Extents) 900 Clotho Reiser No Ver. Disk Ext2 FS Disk Reiser FS 3.5 Clotho Reiser 10min Ver. Clotho Reiser 5min Ver. Clotho Ext2 No Ver. 3 Clotho Ext2 10min Ver. Clotho Ext2 5min Ver. Measured Throughput (Operations/Sec) Average Response Time (Msec/operation) SPEC SFS 3.0 - Response Time vs. Load (4KB Extents) 4 2.5 2 1.5 1 300 400 500 600 700 800 900 Requested Load (NFS V3 operations/second) Clotho Ext2 5min Ver. Clotho Ext2 10min Ver. Clotho Ext2 No Ver. 800 Clotho Reiser 5min Ver. Clotho RFS 10min Ver. Disk Reiser FS 700 Disk Ext2 FS Clotho Reiser No Ver. 600 500 400 300 200 1000 Figure 9: SPEC SFS response time using 4-KByte extents. 3 2.5 2 1.5 1 300 400 500 600 700 800 900 Requested Load (NFS V3 operations/second) 1000 SPEC SFS 3.0 - Req. Load vs. Meas. Load (32KB /w sub-extents) 900 Clotho RFS No Ver. Disk Ext2 FS Disk Reiser FS Clotho RFS 10min Ver. Clotho RFS 5min Ver. Clotho Ext2 No Ver. Clotho Ext2 10min Ver. Clotho Ext2 5min Ver. Measured Throughput (Operations/Sec) Average Response Time (Msec/operation) 3.5 400 500 600 700 800 900 Requested Load (NFS V3 operations/second) Figure 10: SPEC SFS throughput using 4-KByte extents. SPEC SFS 3.0 - Response Time vs. Load (32KB /w sub-extents) 4 300 800 700 Clotho Ext2 5min Ver. Clotho Ext2 10min Ver. Clotho Ext2 No Ver. Clotho RFS 5min Ver. Clotho RFS 10min Ver. Disk Reiser FS Disk Ext2 FS Clotho RFS No Ver. 600 500 400 300 200 1000 300 400 500 600 700 800 900 Requested Load (NFS V3 operations/second) 1000 Figure 11: SPEC SFS response time using 32-Kbyte extents Figure 12: SPEC SFS throughput using 32-Kbyte extents with subextents (RFS denotes ReiserFS). with subextents (RFS denotes ReiserFS). Figure 7 shows that the overhead in write throughput is minimal and the two curves are practically the same. In the read throughput case, Clotho performs slightly better than the regular disk. We believe this is due to the logging (sequential) disk allocation policy that Clotho uses. In the rewrite case, the overhead of Clotho becomes more significant. This is due to the random “seek and read” operation overhead, as shown in Figure 8. Since the seeks in this experiment are random, Clotho’s logging allocation has no effect and the overhead of translating I/O requests and flushing filesystem metadata to disk dominates. Even in this case, however, the overhead observed is at most 7.5% of the regular disk. 4.2. SPEC SFS We use the SPEC SFS 3.0 benchmark suite to measure NFS file server throughput and response time over Clotho. We use one NFS client and one NFS server. The two systems that serve as client and server are 324 connected with a switched 100 MBit/s Ethernet network. We use the following settings: NFS version 3 protocol over UDP/IP, one NFS exported directory, biod max read=2, biod max write=2, and requested loads ranging from 300 to 1000 NFS V3 operations/s with a 100 increment step. Both warm-up and run time are 300 seconds for each run and the time for all the SPEC SFS runs in sequence is approximately 3 hours. As mentioned above, we report results for the Ext2FS and ReiserFS (with -notail option) filesystems [20]. A new filesystem is created before every experiment. We conduct four experiments with SPEC SFS for each of the two filesystems: Using the plain disk, using Clotho over the disk without versioning, using Clotho and versioning every 5 minutes, and using Clotho with 10 minute versioning. Versioning is performed throughout the entire 3 hour run of SPEC SFS. Figures 9 and 10 show our throughput and latency results for 4-Kbyte extents, while Figures 11 and 12 show the results using 32-KByte extents with subextent addressing. Read Throughput (KBytes/sec) 3072 100% Packed Snapshot 75% Packed Snapshot 50% Packed Snapshot 25% Packed Snapshot 0% Packed Snapshot 2048 1024 512 0 1 2 4 8 16 Read Buffer Size (KBytes) 32 64 Figure 13: Random “compact-read” throughput. Average Response Time (Msec/operation) Packed vs. Unpacked Snapshots -- Random Read Throughput Packed vs. Unpacked Snapshots -- Random Read Latency 30 100% Packed Snapshot 75% Packed Snapshot 50% Packed Snapshot 25% Packed Snapshot 0% Packed Snapshot 25 20 15 10 5 1 2 4 8 16 Read Buffer Size (KBytes) 32 64 Figure 14: Random “compact-read” latency. Our results show that Clotho outperforms the regular disk in all cases except ReiserFS without versioning. The higher performance is due to the logging (sequential) block allocation policy that Clotho uses. This explanation is reinforced by the performance in the cases where versions are created periodically. In this case, frequent versioning prevents disk seeks caused by overwriting of old data, which are now written to new locations on the disk in a sequential fashion. Furthermore, we observe that the more frequent the versioning, the higher the performance. The 32-KByte extent size experiments (Figures 11 and 12) show that even with much lower memory requirements, subextent mapping offers almost the same performance as the 4-KByte case. We attribute this small difference to the disk rotational latency, when skipping unused space to write subextents, while in the 4-KByte extent size, the extents are written “back-to-back” in a sequential manner. Finally, comparing the two filesystems, Ext2 and ReiserFS, we find that the latter performs worse on top of Clotho. We attribute this behavior to the journaled metadata management of ReiserFS. While Ext2 updates metadata in place, ReiserFS appends metadata updates to a journal. This technique in combination with Clotho’s logging disk allocation appears to have a negative effect on performance in the SPEC SFS workload, compared to Ext2. 4.3. Compact version performance Finally, we measure the read throughput of compacted versions to evaluate the space-time tradeoff of diff-based compaction. Since old versions are only accessible in read-only mode, we developed a two-phase microbenchmark that performs only read operations. In the first stage, our microbenchmark writes a number of large files and captures multiple versions of the data through Clotho. In writing the data the benchmark is also able to control the amount 325 of similarity between two versions, and thus, the percentage of space required by compacted versions. In the second stage, our benchmark mounts a compacted version and performs 2000 random read operations on the files of the compacted version. Before every run, the benchmark flushes the system’s buffer cache. Figures 13 and 14 present latency and throughput results for different percentages of compaction. For 100% compaction, the compacted version takes up minimal space on the disk, whereas in the 0% case compaction is not spaceeffective at all. The difference in performance is mainly due to the higher number of disk accesses per read operation required for compacted versions. Each such read operation requires two disk reads to reconstruct the requested block. One read to fetch the block of the previous version and one to fetch the diffs. In particular, with 100% compaction, each and every read results in two disk accesses and thus, performance is about half of the 0% compaction case. 5. Related Work A number of projects have highlighted the importance and issues in storage management [13, 16, 22, 36]. Our goal in this work is to define innovative functionality that can be used in future storage protocols and APIs to reduce management overheads. Block-level versioning was recently discussed and used in WAFL [14], a file system designed for Network Appliance’s NFS appliance. WAFL works in the block-level of the filesystem and can create up to 20 snapshots of a volume and keep them available online through NFS. However, since WAFL is a filesystem and works in an NFS appliance, this approach depends on the filesystem. In our work we demonstrate that Clotho is filesystem agnostic by presenting experimental data with two production-level filesystems. Moreover, WAFL can manage a limited number of versions (up to 20), whereas Clotho can manage a practically unlimited number. The authors in [14] mention that WAFL’s performance cannot be compared to other general purpose file systems, since it runs on a specialized NFS appliance and much of its performance comes from its NFS-specific tuning. The authors in [15] use WAFL to compare the performance of filesystem- and block-levelbased snapshots (within WAFL). They advocate the use of block-level backup, due to cost and performance reasons. However, they do not provide any evidence on the performance overhead of block-level versioned disks compared to regular, non-versioned block devices. In our work we thoroughly evaluate this with both microbenchmarks as well as standard workloads. SnapMirror [23] is an extension of WAFL, which introduces management of remote replicas in WAFL’s snapshots to optimize data transfer and ensure consistency. Venti [27] is a block-level network storage service, intended as a repository for backup data. Venti follows a write-once storage model and uses content based addressing by means of hash functions to identify blocks with identical content. Instead, Clotho uses differential compression concepts. Furthermore, Venti does not supper versioning features. Clotho and Venti are designed to perform complementary tasks, the former to version data and the latter as a repository to store safely the archived data blocks over the network. Distributed block-level versioning support was included in Petal [7]. Although the concepts are similar to Clotho, Petal also targets networks of workstations as opposed to active storage devices. Since backup and archival of data is an important problem, there are many products available that try to address the related issues. However, specific information about these systems and their performance with commodity hardware, filesystems, or well-known benchmarks are scarce. LiveBackup [29] captures changes at the file level on client machines and sends modifications to a back-end server that archives previous file versions. EMC’s SnapView [8] runs on the CLARiiON storage servers at the block level and uses a ”copy-on-first-write” algorithm. However, it can capture only up to 8 snapshots and its copy algorithm does not use logging block allocation to speed up writes. Instead, it copies the old block data to hidden storage space on every first write, overwriting another block. Veritas’s FlashSnap [35] software works inside the Veritas File System, and thus, unlike Clotho, is not filesystem agnostic. Furthermore it supports only up to 32 snapshots of volumes. Sun’s Instant Image [31] works also at the blocklevel in the Sun StorEdge storage servers. Its operation appears similar to Clotho. However, it is used through drivers and programs in the Sun’s StorEdge architecture, 326 which runs only through the Solaris architecture and is also filesystem aware. Each of the above systems, especially the commercial ones, uses proprietary customized hardware and system software, which makes comparisons with commodity hardware and general purpose operating systems difficult. Moreover, these systems are intended as standalone services within centralized storage appliances, whereas Clotho is designed as a transparent autonomous blocklever layer for active storage devices and appropriate for pushing functionality closer to the physical disk. In this direction, Clotho categorizes the challenges of implementing block-level versioning and addresses the related problems. The authors in [6] examine the possibility of introducing an additional layer in the I/O device stack to provide certain functionality at lower system layers, which also affect the functionality that is provided by the filesystem. Other efforts in this direction, mostly include work in logical volume management and storage virtualization that try to create a higher level abstraction on-top of simple block devices. The authors in [32] present a survey of such systems for Linux. Such systems usually provide the abstraction of a block-level volume that can be partitioned, aggregated, expanded, or shrunk on demand. Other such efforts [18] add RAID capabilities to arbitrary block devices. Our work is complementary to these efforts and proposes adding versioning capabilities to the block-device level. Other previous work in versioning data has mostly been performed either at the filesystem layer or at higher layers. The authors in [26] propose versioning of data at the file level, discussing how the filesystem can transparently maintain file versions as well as how these can be cleaned up. The authors in [19] try to achieve similar functionality by providing mount points to previous versions of directories and files. They propose a solution that does not require kernel-level modification but relies on a set of user processes to capture user requests to files and to communicate with a back-end storage server that archives previous file versions. Other, similar efforts [21, 24, 25, 28, 30] approach the problem at the filesystem level as well and either provide the ability for checkpointing of data or explicitly manage time as an additional file attribute. Self-securing storage [30] and its filesystem, CVFS [28] target secure storage systems and operate at the filesystem level. Some of the versioning concepts in self-securing storage and CVFS are similar to Clotho, but there are numerous differences as well. The most significant one is that self-securing storage policies are not intended for data archiving and thus, retain versions of data for a short period of time called detection window. No versions are guaranteed to exist outside this window of time and no version management control is provided for specifying higher-level policies. CVFS introduces certain interesting concepts for reducing metadata space, which however, are also geared towards security and are not intended for archival purposes. Since certain concepts in [28, 30] are similar to Clotho, we believe that a block-level self-secure storage system could be based on Clotho, separating the orthogonal versioning and security functionalities in different subsystems. 6. Limitations and Future work The main limitation of Clotho is that it cannot be layered below abstractions that aggregate multiple block devices in a single volume and cannot be used with shared block devices transparently. If Clotho is layered below a volume abstraction that performs aggregation and on top of the block devices that are being aggregated in a single volume, policies for creating versions need to perform synchronized versioning accross devices to ensure data consistency. However, this may not be possible in a transparent manner to higher system layers. The main issue here is that it is not clear what are the semantics of versioning parts of a “coherent”, larger volume. Furthermore, when multiple clients have access to a shared block device, as is usually the case with distributed block devices [7, 33], Clotho cannot be layered on top of the shared volume in each client, since internal metadata will become inconsistent accross Clotho instances. Solutions to these problems are interesting topics for future work. Another limitation of Clotho is that it imposes a change in the block layout from the input to the output layer. Clotho acts as a filter between two block devices, transferring blocks of data from one layer to the next. Although this does not introduce any new issues with wasting space due to fragmentation (e.g. for files if a filesystem is used with Clotho), it alters significantly the data layout. Thus, it may affect I/O performance, if free blocks are scattered over the disk or if higher layers rely on a specific block mapping, e.g. block 0 being the first block on the disk, block 1 the second, etc. However, this is an issue not only with Clotho, but with any layer in the I/O hierarchy that performs block remapping, such as RAIDs and some volume managers. Moreover, as I/O subsystems become more complex and provide more functionality, general solutions to this problem may become necessary. Since this is beyond the scope of this work, we do not discuss this any further here. 7. Conclusions Storage management is an important problem in building future storage systems. Online storage versioning can assist reduce these costs directly, by addressing data archival and retrieval costs and indirectly, by providing novel stor- 327 age functionality. In this work we propose pushing versioning functionality closer to the disk and implementing it at the block-device level. This approach takes advantage of technology trends in building active self-managed storage systems to address issues related to backup and version management. We present a detailed design of our system, Clotho, that provides versioning at the block-level. Clotho imposes small memory and disk space overhead for version data and metadata management by using large extents, subextent addressing and diff-based compaction. It imposes minimal performance overhead in the I/O path by eliminating the need for copy-on-write even when the extent size is larger than the disk-block size and by employing logging (sequential) disk allocation. It provides mechanisms for dealing with data consistency and allows for flexible policies for both manual and automatic version management. We implement our system in the Linux operating system and evaluate its impact on I/O path performance with both microbenchmarks as well as the SPEC SFS standard benchmark on top of two production-level file systems, ExtFS and ReiserFS. We find that the common path overhead is minimal for read and write I/O operations when versions are not compacted. For compact versions, the user has to pay the penalty of double disk accesses for each I/O operation that accesses a compact block. Overall, we believe that our approach is promising in offloading significant management overhead and complexity from higher system layers to the disk itself and is a concrete step towards building self-managed storage devices. 8. Acknowledgments We thankfully acknowledge the support of Natural Sciences and Engineering Research Council of Canada, Canada Foundation for Innovation, Ontario Innovation Trust, the Nortel Institute of Technology, Communications and Information Technology Ontario, Nortel Networks and the General Secretariat for Research and Technology, Greece. References [1] M. Ajtai, R. Burns, R. Fagin, D. Long, and L. Stockmeyer. Compactly Encoding Unstructured Inputs with Differential Compression. Journal of the ACM, 39(3), 2002. [2] E. Anderson, M. Hobbs, K. Keeton, S. Spence, M. Uysal, and A. Veitch. Hippodrome: Running Circles Around Storage Administration. In Proceedings of the FAST ’02 Conference on File and Storage Technologies (FAST-02), pages 175–188, Berkeley, CA, Jan. 28–30 2002. USENIX Association. [3] M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, 1994. [4] R. Coker. Bonnie++. http://www.coker.com.au/bonnie++. [5] L. P. Cox, C. D. Murray, and B. D. Noble. Pastiche: Making Backup Cheap and Easy. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation (OSDI-02), Berkeley, CA, Dec. 9–11 2002. The USENIX Association. [6] W. de Jonge, M. F. Kaashoek, and W. C. Hsieh. The Logical Disk: A New Approach to Improving File Systems. In Proc. of 14th SOSP, pages 15–28, 1993. [7] Edward K. Lee and Chandramohan A. Thekkath. Petal: Distributed Virtual Disks. In Proceedings of ASPLOS VII, Oct. 1996. [8] EMC. Snapview data sheet. http://www.emc.com/pdf/ products/ clariion/SnapView2 DS.pdf. [9] S. C. Esener, M. H. Kryder, W. D. Doyle, M. Keshner, M. Mansuripur, and D. A. Thompson. WTEC Panel Report on The Future of Data Storage Technologies. International Technology Research Institute. World Technology (WTEC) Division, June 1999. [10] GartnerGroup. Total Cost of Storage Ownership – A Useroriented Approach, Sept. 2000. [11] G. A. Gibson, D. F. Nagle, K. Amiri, J. Butler, F. W. Chang, H. Gobioff, C. Hardin, E. Riedel, D. Rochberg, and J. Zelenka. A Cost-Effective, High-Bandwidth Storage Architecture. In Proc. of the 8th ASPLOS, Oct. 1998. [12] G. A. Gibson and J. Wilkes. Self-managing networkattached storage. ACM Computing Surveys, 28(4es):209– 209, Dec. 1996. [13] J. Gray. What Next? A Few Remaining Problems in Information Technology (Turing Lecture). In ACM Federated Computer Research Conferences (FCRC), May 1999. [14] D. Hitz, J. Lau, and M. Malcolm. File System Design for an NFS File Server Appliance. In Proceedings of the Winter 1994 USENIX Conference, pages 235–246, 1994. [15] N. C. Hutchinson, S. Manley, M. Federwisch, G. Harris, D. Hitz, S. Kleiman, and S. O’Malley. Logical vs. Physical File System Backup. In Proc. of the 3rd USENIX Symposium on Operating Systems Design and Impl. (OSDI99), Feb. 1999. [16] J. Kubiatowicz, D. Bindel, Y. Chen, P. Eaton, D. Geels, R. Gummadi, S. Rhea, H. Weatherspoon, W. Weimer, C. Wells, and B. Zhao. OceanStore: An Architecture for Global-scale Persistent Storage. In Proceedings of ACM ASPLOS, November 2000. [17] M. Lesk. How Much Information Is There In the World? http://www.lesk.com/ mlesk/ ksg97/ ksg.html, 1997. [18] M. Icaza and I. Molnar and G. Oxman. The Linux RAID1,-4,-5 Code. In LinuxExpo, Apr. 1997. [19] J. Moran, B. Lyon, and L. S. Incorporated. The Restoreo-Mounter: The File Motel Revisited. In Proc. of USENIX ’93 Summer Technical Conference, June 1993. [20] Namesys. Reiserfs. http://www.namesys.com. [21] M. A. Olson. The Design and Implementation of the Inversion File System. In Proc. of USENIX ’93 Winter Technical Conference, Jan. 1993. [22] D. Patterson. The UC Berkeley ISTORE Project: bringing availability, maintainability, and evolutionary growth to 328 [23] [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] storage-based clusters. http://roc.cs.berkeley.edu, January 2000. R. H. Patterson, S. Manley, M. Federwisch, D. Hitz, S. Kleiman, and S. Owara. SnapMirror: File-System-Based Asynchronous Mirroring for Disaster Recovery. In Proceedings of FAST ’02. USENIX, Jan. 28–30 2002. R. Pike, D. Presotto, K. Thompson, and H. Trickey. Plan 9 From Bell Labs. In Proc. of the Summer UKUUG Conference, 1990. W. D. Roome. 3DFS: A Time-Oriented File Server. In Proceedings of USENIX ’92 Winter Technical Conference, Jan. 1992. D. S. Santry, M. J. Feeley, N. C. Hutchinson, A. C. Veitch, R. W. Carton, and J. Ofir. Deciding When to Forget in the Elephant File System. In Proceedings of 17th SOSP, Dec. 1999. Sean Quinlan and Sean Dorward. Venti: A New Approach to Archival Data Storage. In Proceedings of FAST ’02, pages 89–102. USENIX, Jan. 28–30 2002. C. A. Soules, G. R. Goodson, J. D. Strunk, and G. R. Ganger. Metadata Efficiency in Versioning File Systems. In Proceedings of the FAST ’03 Conference on File and Storage Technologies (FAST-03), Berkeley, CA, Apr. 2003. The USENIX Association. Storactive. Delivering real-time data protection & easy disaster recovery for windows workstations. http://www.storactive.com/files/Storactive Whitepaper.doc, Jan. 2002. J. D. Strunk, G. R. Goodson, M. L. Scheinholtz, C. A. N. Soules, and G. R. Ganger. Self-Securing Storage: Protecting Data in Compromised Systems. In Proceedings of the 4th Symposium on Operating Systems Design and Implementation (OSDI-00), pages 165–180, Berkeley, CA, Oct. 23–25 2000. Sun Microsystems. Instant image white paper. http://www.sun.com/storage/white-papers/ii soft arch.pdf. D. Teigland and H. Mauelshagen. Volume managers in linux. In Proceedings of USENIX 2001 Technical Conference, June 2001. C. A. Thekkath, T. Mann, and E. K. Lee. Frangipani: A Scalable Distributed File System. In Proceedings of the 16th SOSP, volume 31 of Operating Systems Review, pages 224–237, New York, Oct. 5–8 1997. ACM Press. A. C. Veitch, E. Riedel, S. J. Towers, and J. Wilkes. Towards Global Storage Management and Data Placement. In Eighth IEEE Workshop on Hot Topics in Operating Systems (HotOS-VIII), pages 184–184. IEEE Computer Society Press, May 20–23 2001. Veritas. Flashsnap. http://eval.veritas.com/ downloads/ pro/ fsnap guide wp.pdf. J. Wilkes. Traveling to Rome: QoS specifications for automated storage system management. In Proc. of the Int. Workshop on QoS (IWQoS’2001). Karlsruhe, Germany, June 2001. J. Ziv and A. Lempel. A Universal Algorithm for Sequential Data Compression. IEEE Transactions on Information Theory, 23:337–343, May 1977. US NATIONAL NATIONAL OCEANOGRAPHIC CENTER US OCEANOGRAPHICDATA DATA CENTER ARCHIVAL MANAGEMENT PRACTICES AND THE ARCHIVAL MANAGEMENTPRACTICES AND THEOPEN OPEN ARCHIVAL INFORMATION SYSTEM REFERENCE MODEL ARCHIVAL INFORMATIONSYSTEM REFERENCE MODEL Donald W. Collins US Department of Commerce/NOAA/NESDIS National Oceanographic Data Center 1315 East West Highway, SSMC3 Fourth Floor Silver Spring, MD 20910-3282 Donald.Collins@noaa.gov +1-301-713-3272 +1-301-713-3302 Abstract This paper describes relationships between the Open Archival Information System Reference Model (OAIS RM) and the archival practices of the NOAA National Oceanographic Data Center (NODC). The OAIS RM defines a thorough approach to defining the processes, entities, and framework for maintaining digital information in an electronic archival environment without defining how to implement the framework. The NODC Archival Management System (AMS) is an example of an implementation of a persistent digital archive. Major OAIS RM components, such as the Submission Information Package, Archival Information Package, Dissemination Information Package, and Archival Storage are clearly comparable between the OAIS RM and the NODC AMS. The main participants (Producer, Consumer, Management, and OAIS) are represented in the NODC AMS, as are many primary functions (Ingest Process, Archive Process, Dissemination Process). Some important OAIS RM components, such as a consistent Submission Agreement and a deeper level of Preservation Description Information may be missing for some of the information archived in the NODC AMS. It is instructive to document the commonalities between the NODC system and the OAIS RM as the NOAA National Data Centers expand archival services for a broad and growing range of digital environmental data. 1 Introduction Imagine that you are on a large ship on a calm sea with no land in sight and 3 miles of water beneath your feet. You are a marine biologist studying the habitat of the giant squid, so you want to know the range of water temperatures and salinity values, the available nutrients eaten by local microscopic organisms, what those organisms are and what visible-size sea life might be eating them. These pieces of information can be collected by a host of instruments you deploy over the side of the ship and lower through the water. The ship stops periodically to collect these measurements, which are electronically recorded into a series of data files that you will use later in the lab to characterize the conditions in which the giant squid lives. If you received funding for your research from the US Federal government, you are likely to be contractually obligated to send a copy of the data you collected to the National Oceanographic Data Center (NODC), along with enough descriptive metadata to make the data meaningful to others. The NODC is one of several data centers operated by the 329 U.S. Department of Commerce National Oceanic and Atmospheric Administration (NOAA). The NODC is believed to archive the largest collection of in situ measurements of oceanographic parameters in the world, with approximately 300 gigabytes of data, metadata, and model output stored in its digital archives (D. Knoll, NODC, pers. comm., 2003). The NODC has recently undertaken a substantial effort to update and improve the archival management of these electronic environmental records and to improve access to the original data records in its collection. This paper examines how these new record management strategies and procedures relate to the Open Archival Information System Reference Model (OAIS RM) [1], ISO 14721:2002, which defines the major elements and functions of an electronic records archive. To explore the relationship between the OAIS RM and the new NODC Archives Management System (AMS), the first part of this paper provides an overview of the main features of the OAIS RM. The second part of the paper describes in some detail the NODC AMS in terms of the conceptual elements of the OAIS Reference Model. The last part of the paper discusses OAIS RM elements that are not yet implemented by the NODC AMS. 2 OAIS Reference Model The OAIS Reference Model was initially developed by the Consultative Committee for Space Data Systems to identify the essential elements needed by an electronic records archive to manage records over long time scales [2]. The OAIS RM acknowledges technological changes that inevitably create substantial challenges for electronic records archives [3]. As described by Lavoie [4], an OAIS means "any organization or system charged with the task of preserving information over the Long Term and making it accessible to a specified class of users (known as the Designated Community)." In keeping with the practice established in the OAIS RM documentation, entities and procedures named in the model and discussed below will have initial capital letters (e.g., Producer). By definition in the OAIS RM (p. 1-11), data are archived for the Long Term, which is defined as "[a] period of time long enough for there to be concern about the impacts of changing technologies, including support for new media and data formats, and of a changing user community, on the information being held in a repository. This period extends into the indefinite future." The main participants in any OAIS are defined as Producer, Consumer, Management, and the OAIS archive functions. In the OAIS RM, Producer represents the people or systems that create and/or provide information to be preserved in the archive. Consumer represents the people or systems that use the OAIS to find and retrieve preserved information. A Designated Community may be a special subset of the domain of all Consumers. Management is the person or group that sets the policy for the management of the OAIS, among other activities: it is not responsible for the day-to-day operations of the OAIS. The OAIS (archive) is the system that provides for the long term storage, migration, and dissemination of the information that is archived. Using this model and the example in the Introduction to this paper, the following roles could be identified: • Producer: Ocean scientists (e.g., squid specialist) making measurements of physical, chemical, and biological conditions; 330 • • • Consumer (Designated community): other ocean scientists, resource managers, general public; Management: US National Oceanographic Data Center Director and Deputy Director; OAIS (archive): US National Oceanographic Data Center Archive Management System. One of the basic concepts of the Reference Model is that information is a combination of data and its Representation Information. Regardless of whether the data part of the information is physical (e.g., a giant squid) or digital (e.g., digital photographs of the giant squid), Representation Information allows the Consumer to fully interpret the meaning of the information. For digital objects, Representation Information typically includes some mapping of bits into recognizable data types (e.g., characters, integers). It also associates these mappings into higher-level groupings of data types, which are called Structure Information. To fully understand how to interpret the Structure Information, it is important to include Semantic Information, which defines the language of the Structure Information [5]. Another main component of the OAIS RM is the Information Object, which is comprised of the following components: • Content Information object, which is equivalent to the contents of a letter, email, observed data values, etc.; • Packaging Information object, which is information that wraps the different information objects into a cohesive bundle; • Descriptive Information object, which provides descriptive content and contextual information about the Content Information object; • Preservation Description Information object, which includes significant information about the electronic form and structure of the bits that can be translated into the Content Information object. It also includes information about provenance and authenticity validation characteristics of the Content Information object. The OAIS RM distinguishes between Information Objects based on the role that the Information Object performs in the information management process. Three specific types of Information Object are designated: the Submission Information Package (SIP), the Archival Information Package (AIP), and the Dissemination Information Package (DIP). Each Information Package contains the four information objects defined above. Figure 1 depicts the relationship between the components of an Information Package and between SIP, AIP, DIP, and Management (after Sawyer [6]). A Producer creates electronic records: the contents are the property of the Producer and may be in any format that is deemed useful by the Producer. Once the Producer decides to transfer the data to the OAIS, a Submission Agreement is negotiated between Producer and OAIS to define the terms of the information transfer. The OAIS may help the Producer in developing the SIP by providing information, tools, or other assistance in preparing the contents of the SIP, especially the Preservation Description Information 331 component. The SIP contains Content Information, but must also contain sufficient metadata to ensure that the Content Information can by maintained properly by the OAIS and be used by future Consumers [7]. The SIP is then transferred to the OAIS in one or more Data Submission Sessions. Fig. 1. OAIS Reference Model Information Package components (After Sawyer, 2002.) Upon receipt of the SIP, the OAIS creates an Archival Information Package (AIP) using the archives' ingest procedures. The OAIS defines the appropriate ingest procedures for creating the AIP according to the archives policies and guidelines. The OAIS may modify the form and content of the SIP: "An OAIS is not always required to retain the information submitted to it in precisely the same format as in the SIP. Indeed, preserving the original information exactly as submitted may not be desirable [8]". The intention here is to ensure the preservation of digital information, not to modify or tamper with the digital content. Once the SIP is transformed into the corresponding AIP and Package Descriptions (i.e., the information needed to make an AIP accessible from appropriate Access Aids) during the ingest process, the AIP is stored in an Archival Storage entity. Ingest processes, Package Descriptions, and Archival Storage hardware and software may vary significantly from one OAIS to another. The OAIS determines how to make the AIPs in its collection available to its Designated Community. To perform this function, OAIS defines a Dissemination Information 332 Package (DIP). The DIP is described in the Package Description, found using Access Aids provided by the OAIS. Access Aids are the tools provided to discover and obtain AIPs from the OAIS. Finding Aids, Collection Descriptions, Ordering Aids and other data discovery tools are types of Access Aids. When a user discovers the existence of archived materials through available Access Aids, the selected AIPs are assembled into a Dissemination Information Package (DIP) and transferred to the Consumer via a Data Dissemination Session. The structure and mechanism for delivering a DIP to Consumers depends on the way the archival organization creates its DIPs [9]. This introduction to the main components of an OAIS only scratches the surface of an extremely detailed and well-defined set of terms, objects, concepts, and procedures that an electronic archives needs to address to ensure the preservation of data and information for an indeterminate "Long Term". The next section of this paper describes the archival practices established by the US NODC and relates many of the functions, processes and information objects of the new NODC Archives Management System to the components of the OAIS Reference Model. NODC Archiv Archives Management System SystemA 3 NODC es Management : A Case Study The US National Oceanographic Data Center (NODC) was created as an interDepartmental support organization administered by the US Navy in 1960 and transferred to the Environmental Science Services Administration (ESSA) in the mid-1960s. When the National Oceanic and Atmospheric Administration (NOAA) was created in 1970, the NODC became one of the three environmental data centers administered by the NOAA. The NODC receives oceanographic data from a diverse community of international oceanographic organizations, government organizations (federal, state, and local), and public and private universities and research institutions from around the world. The primary commonality among these organizations and the data they provide to the NODC archives is that the information is somehow related to the world's oceans, seas, and coastal areas. Data formats and structures, languages used, and the types of data collected can be extremely variable. Most oceanographic data collected in the past decade or two are provided to NODC in digital data files. As technology changes, new equipment for obtaining, storing and organizing measurements are invented. Data formats and structures change to accommodate new measurable values and techniques. The only constant throughout these technological changes is the need for accurate metadata about the instruments used (e.g., calibration and methodologies), data format structure and other documentation. By 1967, the NODC recognized the need to improve tracking of data that were sent to the NODC. A data set identification system was developed in which groups of data were assigned an NODC Accession Number to identify the information as a unit. An accession is loosely defined as 'a logical grouping of related data,’ which is usually interpreted as a group of data that are received together. Typical examples of the types of data in an accession include: in situ water column measurements (e.g., water temperature and salinity, nutrient concentrations such as dissolved nitrate and silicate, current speed and 333 direction), biological observations (e.g., abundance and taxonomic identification of plankton and fish species), or satellite observations of sea surface characteristics. As of December 31, 2003, there were 20,419 individual accessions in the NODC archives. New accessions are presently acquired at a rate of about 30 per month. During a recent month, more than 62 Gigabytes of data were downloaded by 1088 individual host computers connected to online services that access NODC information, holdings and products created from those holdings (NODC Information Systems and Management Division, unpublished report). The NODC recently developed the NODC Archives Management System (AMS) to bring the electronic and analog data, metadata, and administrative information files for oceanographic data collections into a more robust and flexible environment. The Accession Tracking Data Base (ATDB), the Archive File Management System (AFMS), the Ocean Archive System and the NODC Metadata Repository (NMR) are the primary components of the Archives Management System. The NODC Metadata Repository is a commercially-available database (using a proprietary structure and database developed by Blue Angel Technologies, Inc. and Oracle Corp.) that is optimized for managing data set descriptions in the Federal Geographic Data Committee (FGDC) Content Standard for Digital Geospatial Metadata (CSDGM) structure. While this component will soon provide descriptive metadata to assist with search and retrieval processes, the underlying software and database of this component of the AMS are beyond the scope of this paper. The emphasis in the next few sections is to describe the AMS components designed, developed and deployed within NODC on generic computer hardware using open source software for the operating system, database design and management, and preliminary data entry, search, and retrieval requirements. The NODC AMS identifies a series of information elements, procedures, and standardized practices that facilitate ingesting, describing, accessioning, storing, migrating, and accessing archived digital records and information. These functions are discussed below and are related to relevant OAIS RM concepts or constructs. Figure 2 depicts the generic flow of an Information Package through the Archives Management System. 3.1 Ingest Procedures The NODC frequently works with data Producers to prepare their information for archiving, i.e., to create a more meaningful Submission Information Package. At the NODC, the SIP is equivalent to a single accession as it is received from the Producer (internally, NODC currently refers to "originators data" rather than "a SIP"). During this interaction, the NODC encourages the data Producer to provide the most generic representation of their data possible and to provide as much descriptive documentation as possible. The reason for this is to obviate the need for NODC to undertake after-the-fact translation tasks to represent data that are in software and/or platform dependent data structures into less dependent structures. In many instances, these translations can be undertaken with little or no loss of meaning for the data. The NODC prefers to have the 334 data Producer make these translations whenever possible, rather than making such translations during the ingest process. Fig. 2. NODC AMS flow diagram for an Information Package. As shown in Figure 2, a Producer collects data and prepares it in some fashion for shipment to the NODC. Of course, the method of "shipment" and the medium used in this Data Submission Session has changed substantially over time from mailing analog records, computer punch cards, or magnetic tapes to FTP transfers, web site downloads, or sending CD-ROMs or DVDs. Ingest procedures now in place for each Data Submission Session are: • obtain and review the data/information files and/or other information objects in the SIP from the Producer; • create a new record in the NODC ATDB (described below), which establishes the 'canonical structure' for the accessioned SIP in a preparation area; • copy/move digital data files to the appropriate directory of the Archives File Management System (described below); • request that the ATDB record be closed; • review of 'closed' accession record and SIP by 'Data Officer'; • transfer approved SIP to the archival storage area (creation of the AIP). 335 Once a SIP is ingested and accessioned and an AIP is created, the data in it may be processed into an NODC product (e.g., the NODC World Ocean Database, which contains more than 7 million temperature, salinity, and other parameter profiles) or delivered as a direct copy (DIP) using the NODC Ocean Archive System, an online discovery and retrieval service (OAIS Access Aid) described in greater detail below. 3.2 Accession Tracking Data Base The Accession Tracking Data Base (ATDB) is a relational database that supports initial SIP ingest procedures and administrative metadata management. The ATDB uses postgreSQL database management software on a generic server platform running the Linux operating system. Generically-designed browser-based user interfaces were developed using perl/CGI scripts to facilitate data entry and limited search/retrieval capabilities. Linux, postgreSQL, and perl are all freely available open source software designed to operate on a number of hardware platforms. One of the guiding principles for developing the ATDB and AMS was to determine if open source software tools were robust enough to support a mission critical information management system, rather than relying on commercial packages that require continuous license upkeep, maintenance and other expenses. The initial results suggest that the open source tools are more than adequate for the purposes of creating a workable file management system and a database to manage thousands of digital files. The ATDB represents several components of the Administration function in the OAIS RM [10]. A 'Brief Access Record' (BAR) is created for each new SIP using the main table in the ATDB. The BAR allows a new entry to be made in the ATDB based on a relatively few administrative and descriptive metadata elements. Administrative metadata elements keep track of an accession in the NODC File Management System and other pieces of internal information of importance to NODC but that may not necessarily be useful for Consumers. Administrative metadata, combined with descriptive metadata elements (discussed later) are roughly equivalent to the Package Description in the OAIS RM. Administrative metadata elements in the ATDB are: • accession number (unique identifier for an Archival Information Package, automatically generated by the database), • date received (date the SIP was received at the NODC), • keyer, editor (name of the NODC employee creating the ATDB entry and making the most recent change to the ATDB entry, selected from a controlled vocabulary) • keydate, editdate (system-assigned date-time stamp of ATDB entry creation and latest ATDB entry change), • status (information indicator that denotes the 'state' of the Archival Information Package, i.e., 'new', 'archived', 'revision'), • version (information indicator that denotes the most recent version of the AIP), • availability date (date after which this AIP can be made accessible to the public; default value is the same as the 'date received'), • requested action (information indicator to request action from the Data Officer function, i.e., 'close' ('assess this SIP and metadata for inclusion in the archives') or 'open' ('check the AIP out of the archive for revision, creating a new version of the AIP'), 336 • NODC contact (name of the NODC employee who is most familiar with the contents of this SIP or AIP. May be different than the 'keyer' or 'editor'). The ATDB also requires the inclusion of a minimal amount of descriptive metadata about the SIP. Available descriptive information is passed from the ATDB via a program to the NODC Metadata Repository, creating a partially-complete FGDC CSDGM-structured description of this AIP. Descriptive metadata elements in the ATDB include a title, start and end dates, latitude and longitude bounding coordinates, and controlled-vocabulary descriptors for institution names, sea areas, parameter and observation types, instrument types, project names, and platform (ship) names. About 13 staff members are designated as keyers which means that they can create the initial BAR in the ATDB, although the majority of new BARs are created (at present) by 3 or 4 people. Four senior, non-management staff members are designated Data Officers. The Data Officers are responsible for reviewing the work of the keyers, approving new accessions for inclusion in the archives, and developing related archival management policy recommendations for approval by NODC Management. 3.3 NODC Archives File Management System Creating a new record in the ATDB causes a perl script to create a directory structure on a storage disk (represented in Figure 2 as "Temporary Archive" and referred to internally as the "ingest area" or "/ingest") in the canonical directory structure (Table 1). While still in the ingest area, it is possible to modify or add to the contents of a SIP. Once the contents of the SIP are finalized, a program is run that copies files from the ingest area to the "archive area" or "/archive". Th is Archival Storage is presently provided by a large capacity RAID device with limited write privileges, i.e., only the archivist function (in most cases, via a program) can write files to this system. The canonical structure of an NODC AIP is represented in Table 1. Table 1. NODC AMS File Management System canonical structure. Elements listed in italics are only part of the final archival version of the file structure. A “/” indicates the listed element is a directory. 7-digit unique number/ 0000001.01-version.md5 1-version/ NODC_ReadMe.txt about/ journal.txt other files… data/ NODC Accession number (e.g., 0000001) File containing checksum values for all files in this AIP. Directory identifying the most recent version of this AIP, beginning with 1-version. Text file that describes this directory structure. Directory for storing information files created by NODC about this AIP. NODC-created file describing actions taken by NODC staff regarding this AIP. Optional other files created and/or maintained that are not part of the data in this AIP. Directory for storing the original data and translations of the original data in the AIP. 337 0-data/ directories or files 1-data/ directories or files Directory for storing an exact copy of the original files in this AIP. Copy of the original files obtained from the Producer. Directory for storing translated versions of the original files in this AIP. Translated versions of the original files in this AIP in a directory structure that may resemble the original directory structure. The main sections of the canonical AIP structure are /about and /data. Any additional information about the SIP, such as emails between NODC and Producer or other NODCcreated information about the SIP, are placed in the /about subdirectory. The journal.txt file, which is initially created along with the directory structure, is used to document any steps taken by NODC personnel while performing ingest processes on the SIP. At present, NODC guidelines require that an exact copy of the original files in a SIP are copied to the /0-data directory, regardless of their original file structure. The /1-data directory is where NODC-created translations of the original SIP contents may be placed. The intention of this directory is to provide a place for non-proprietary representations of proprietary original data from the SIP. A typical example of the contents of /1-data might be the comma-separated ASCII representation of the text and values from a Microsoft Excel spreadsheet file. In an effort to minimize the probable lack of access to this file due to program changes or loss of vendor support for the program in the future, NODC attempts to translate files from proprietary structures to generic structures. 3.4 Archiving Procedures The NODC continues to refine its archiving procedures for digital accessions. Archiving processes are undertaken by the employees designated as Data Officer. As noted above, Data Officer tasks (see Figure 2) include reviewing the work of the keyers and determining if an accession is ready to be moved from the /ingest area to the /archive area. There are currently only four criteria that the Data Officer uses to make this determination: • there is a fully-populated ATDB entry, • the journal.txt file has been updated, • the files in the /ingest directory are in the canonical form (Table 1), • a "reasonable attempt" has been made to translate data from proprietary formats to generic formats and translations are placed in /1-data. When a keyer determines that the ingest-AIP needs no further action, the ATDB element ‘requested action’ is set to ‘Close’. This indicates to the Data Officer that the SIP is ready for review. If the four criteria listed above are met, the Data Officer approves the transfer of the SIP from the ingest area to the Working Archive Directories (Archival Storage), creating the 'working archive' copy of the AIP (Figure 2). The program that transfers files from /ingest to /archive calculates a checksum value for each file, which can be used for future validation of the contents of the AIP, and also runs additional virus-detection software to minimize the chance of archiving a virus with the AIP. 338 As noted in Figure 2, there are also "Deep Archive processes" that include the creation and validation of off-site copies of the AIP (one is maintained in Asheville, NC at the National Climatic Data Center and one will soon be maintained at the NODC National Coastal Data Development Center in Stennis Space Center, MS). These 'deep archive' copies are intended for use in disaster recovery situations or when the local 'working archive' copy is rendered temporarily unavailable due to equipment malfunction or other reasons. This backup process represents the Replication function described in the OAIS RM [11]. Figure 2 also shows a 'New Version processes' step. These processes, which require approval from the Data Officer, are used when a new version of an existing AIP is required. New versions are occasionally requested by a Producer, usually when an error is found in a previously submitted data set. In cases where a new version is requested, an exact copy of the existing AIP is 'checked out' of the archive area and placed back in the ingest area. Modifications are then made to the data files and/or metadata files by the editor who requested the new version. When all modifications have been made, the same approval process is followed by the Data Officer, with special attention paid to the documentation of what modifications were made, why they were made, and who made them. The entire AIP is then 'checked-in' to the Working Archive Directories as "/2version" (or the next available version number). In this fashion, NODC maintains a copy of each iteration of a specific data set. Circumstances for determining if an update SIP should be a new version of an existing AIP or should be a new AIP are decided on a caseby-case basis by the Data Officer and Management. The current philosophy adopted by NODC management is that it is better to keep everything, including obsolete versions, than to be unable to provide back to the originator an exact copy of everything that was given to the Center. 3.5 Metadata MetadataManagement Management 3.5 Descriptive metadata for new AIPs, in the form of FGDC CSDGM-compliant database records, are presently created and/or maintained in two places: the ATDB Brief Access Record and the NODC Metadata Repository (NMR). The BAR is discussed in detail above. The NMR is beyond the scope of this paper, but will be used to manage additional descriptive metadata for each AIP in the NODC collection. There is some overlap between the information in each of these databases. NODC maintains several authority tables (e.g., people, ship names, institutions, place names) in the ATDB to facilitate the creation of consistent descriptions of an AIP. Controlled vocabulary entries will only be updated through the ATDB. Details of how to propagate any updates efficiently into the NMR are still in development. Ultimately, NODC plans to maintain a comprehensive accession tracking and descriptive metadata database from which any number of formatted descriptive information (e.g., FGDC, ISO 19115, Dublin Core) could be exported. As mentioned above, descriptive metadata is part of the OAIS concept of Package Description. The OAIS RM requires a substantial level of semantic and syntactic metadata to be maintained in the Preservation Description Information (PDI) object of the AIP. At present, this level of metadata and the existence of a PDI object in an AIP depend on the 339 inclusion of such information in the SIP from the Producer. NODC is investigating how to create or refine the contents of this critical information element. Data Dissemination Dissemination 3.6 Data The NODC distributes two general types of Dissemination Information Package (DIP). One type of DIP is an exact copy of an AIP, with no additional processing beyond the possible translations described above. The other type of DIP distributed by the NODC is a Derived AIP, typically data from multiple AIPs processed by a product developer (usually within NODC, but occasionally by an external organization) to create a valueadded data set, referred to as an NODC Standard Product. In most cases, all of the data in an NODC Standard Product have been reformatted to a single data format and possibly had some type of quality control checks performed, such as marking values that erroneously appear in a land area or a measured value that is outside the range of possible values for a parameter. The obvious advantage of these products is that similar types of data that were originally in a variety of data structures may now be easily inter-compared or otherwise statistically manipulated. The OAIS RM categorizes Access Aids as a Finding Aid or an Ordering Aid. A Finding Aid allows a consumer to search for and discover DIPs that are of interest to the consumer. Ordering Aids are applications that help the Consumer to obtain DIPs of interest and include information costs and other handling circumstances that may be needed to transfer a copy of the DIP to the consumer. Standard products from the NODC (primarily Derived AIPs that are copied to CD-ROM or DVD and mass-replicated) can be discovered through a number of Finding Aid-like applications (e.g., NOAAServer, an FGDC Clearinghouse application), ordered from the NOAA National Data Centers Online Store [12], or discovered and downloaded using the NODC Ocean Archive System (OAS) [13]. The OAS is the current interface available directly from the NODC for searching, discovering, and retrieving copies of original data from the AMS. The OAS interface presents several of the ATDB descriptive metadata elements in a tabular form and allows a Consumer to select one or more descriptive elements to create a database query. Once the query has been processed, a list of accession numbers and descriptors for AIPs that match the query are presented, in addition to links to the full ATDB record and to the File Management System working archive directory. The Consumer can assess the relevance of the contents of the AIP and save all or part of the AIP to a local directory. A small reference team is available to assist a Consumer in deciding whether there is a Standard Product that is more suitable for their needs or to assist with finding appropriate AIPs. In general, the majority of requests for data and information are satisfied by a Standard Product. However, as AIPs become readily available online DIPs through the OAS or other Access Aids, the demand for additional assistance for Consumers in deciding about appropriate data is expected to increase. In general, the Consumers for data from the NODC are: ocean scientists of all types from all over the world and all levels of government (the Designated Community), non-scientist business persons (often 340 lawyers and insurance representatives) and the general public (usually for K-12 educational purposes or recreational purposes). 4 What is missing? The NODC Archives Management System described above is still in development, although the ATDB and File Management System parts of the AMS have been used operationally since about April 2002. The Ocean Archives System Finding Aid was released for public use in December 2003. Additional descriptive metadata elements were added to the ATDB in mid-2003 to accommodate metadata from a legacy database. But as noted above, the links between the ATDB and the NODC Metadata Repository (NMR) are not yet fully developed, although the same information used to load the ATDB with legacy information was used to populate the NMR for each AIP. Many issues related to maintaining the referential integrity of the information that resides in both the ATDB and the NMR are still being examined. The discussion above outlines how the main elements of the NODC Archives Management System map to many of the major components of the OAIS Reference Model. However, some important OAIS RM components are missing from the NODC AMS. In particular, two areas that need additional work are the development of a Submission Agreement and identifying Detailed Preservation Information for each AIP in the AMS. In most cases, no officially-communicated Submission Agreement spells out the terms, conditions, and other responsibilities of the NODC to act as the Long Term custodian of an Information Package. The NODC negotiates and maintains data and information exchange agreements with a number of US and foreign organizations as a routine part of its archives efforts, but many data sets are submitted to the NODC with little more than an email from the Producer or designated intermediary and varying levels of descriptive metadata. It is unclear if these sometimes informal cover letters are sufficient to serve as an OAIS Submission Agreement. The NODC participates in the development of the NOAA Comprehensive Large Array-data Stewardship System (CLASS), which is developing a Submission Agreement on behalf of the NOAA National Data Centers [14]. The draft version of the CLASS Submission Agreement is not yet available for public review, but currently is modeled using the FGDC Content Standard for Digital Geospatial Metadata (CSDGM) as a framework for defining a variety of custodial descriptive metadata. It is not yet clear if a separate formal Submission Agreement will be developed to authorize transferring data and information from the Producer to the NODC in addition to the current draft Submission Agreement. Perhaps the most difficult missing element in the NODC AMS is the availability of detailed Preservation Description Information. The OAIS RM defines this as "[t]he information which is necessary for adequate preservation of the Content Information and which can be categorized as Provenance, Reference, Fixity, and Context Information [15]". Some of this missing information will be provided for future accessions if something like the draft CLASS Data Submission form accompanies each Submission Information Package, but what about the historic data that already reside in the NODC 341 collections? Approximately 2700 of the 20,000+ AIPs archived at NODC have some type of information about their provenance and context in a Data Documentation Form (DDF), which was a standard form used to document data submitted to the NODC from the early 1970s until the mid-1990s. The quantity and quality of the information in the DDFs varies significantly and is dependent on the effort made by the Producer to document the data; nearly all existing DDFs are analog documents stored either on-site at NODC or in an off-site storage facility. In OAIS terms, most historic AIPs at NODC have little or no level of Fixity Information (which "…authenticates that the Content Information has not been altered in an undocumented manner [16]"). Likewise, there are very few instances where Representation Information (i.e., "information that maps a Data Object into more meaningful concepts... [such as] the ASCII definition that describes how a sequence of bits is mapped into a symbol [17]") is present at all. The NODC is beginning to consider how to address these difficult information deficiencies as part of the planning to maintain its collections of data for the Long Term. 5 Conclusions The Open Archival Information Systems Reference Model describes a very thorough approach to defining the processes, entities, and framework for maintaining digital information in a electronic archival environment without defining how to implement the framework. The NODC Archival Management System provides an example of how to implement a persistent digital archive for oceanographic data. Many of the processes and components correlate well with elements of the OAIS Reference Model. Major components, such as the Submission Information Package, Archival Information Package, Dissemination Information Package, and Archival Storage are clearly comparable between the OAIS RM and the NODC AMS. The main participants (Producer, Consumer, Management, and OAIS) are also all present in the NODC AMS, as are many of the primary functions (Ingest Process, Archive Process, Dissemination Process). On the other hand, the NODC AMS is frequently lacking some important OAIS RM components, such as a consistent Submission Agreement and a deeper level of Preservation Description Information. This paper establishes a relationship between the OAIS Reference Model and the archival management practices of the NOAA National Data Centers. It is important to document the commonalities in the NODC system and the OAIS RM as NOAA and the NOAA National Data Centers continue to develop and upgrade archival services for a broad and growing range of digital environmental data. While there are many commonalities between the NODC AMS and the OAIS RM, the NODC (and by extension, the NOAA National Data Centers) needs to be aware of the types of information that are not presently available or actively acquired for data that are sent to be archived. It is imperative that these environmental data records not become the "write once, read never" records bemoaned by Barkstrom [18] because they will provide the baseline scientific data for future environmental investigations and future generations. 6 References Cited [1] Consultative Committee for Space Data Systems, 2002, Reference Model for an Open Archival Information System OAIS. Available online at 342 http://wwwclassic.ccsds.org/documents/pdf/CCSDS-650.0-B-1.pdf (last accessed April 2003). [2] Lavoie, Brian, 2000, Meeting the challenges of digital preservation: The OAIS reference model. OCLC Newsletter, No. 243 (January/February 2000), p. 26-30. Available online at http://www.oclc.org/research/publications/newsletter/repubs/lavoie243/ (last accessed April 2003). [3] Consultative Committee for Space Data Systems, 2002, p. 2-1. [4] Lavoie, Brian, 2000, p. 27. [5] Consultative Committee for Space Data Systems, 2002, p. 4-21. [6] Sawyer, Donald, 2002, ISO "Reference Model for an Open Archival Information System (OAIS)": Tutorial Presentation. Presentation to University of Maryland College of Information Studies, October 2002, 31p. [7] Consultative Committee for Space Data Systems, 2002, p. 4-49. [8] Consultative Committee for Space Data Systems, 2002, p. 4-49. [9] Consultative Committee for Space Data Systems, 2002, p. 4-52. [10] CCSDS, 2002, p. 1-7 and p. 4-10. [11] CCSDS, 2002, p. 5-5. [12] NOAAServer is available online at http://www.esdim.noaa.gov/noaaserver-bin/NOAAServer (last accessed January 2004). The NOAA National Data Centers Online Store is available online at http://www.nndc.noaa.gov/dev/prototype/nndcserver/nndchome.html (last accessed January 2004). [13] NODC Ocean Archive System can be accessed at http://www.nodc.noaa.gov/search/prod/ (last accessed January 2004). [14] Habermann, Ted, (in prep.), Comprehensive Large Array-data Stewardship System (CLASS) Data Product Submission Agreements, 20p. [15] CCSDS, 2002, p. 1-12. [16] CCSDS, 2002, p. 1-10. [17] CCSDS, 2002, p. 1-13. [18] Barkstrom, Bruce R., 1998, Digital archive issues from the perspective of an Earth science data producer. NASA Technical Report, NASA Langley Research Center Atmospheric Sciences Division, Available online from http://techreports.larc.nasa.gov/ltrs/papers/NASA-98-dadw-brb/ (last accessed April 2003). Acknowledgement The author would like to thank Lauren Brown and the students in the “Seminar in Archives, Records, and Information Management” course at the University of Maryland College of Information Studies for their constructive comments. Also, many thanks to Kurt Schnebele, Tony Picciolo, Steve Rutz, Mary Lou Cumberpatch, Anna Fiolek, Bob Gelfeld and Donna Collins for their support, constructive observations and suggestions. 343 344 Storage Resource Sharing with CASTOR Olof Bärring, Ben Couturier, Jean-Damien Durand, Emil Knezo, Sebastien Ponce CERN CH-1211 Geneva 23, Switzerland Olof.Barring@cern.ch, Ben.Couturier@cern.ch, Jean-Damien.Durand@cern.ch, Emil.Knezo@cern.ch, Sebastien.Ponce@cern.ch tel: +41-22-767-3967 fax: +41-22-767-7155 Vitaly Motyakov Institute for High Energy Physics RU-142281, Protvino, Moscow region, Russia motyakov@mx.ihep.su tel: +7-0967-747413 fax: +7-0967-744937 Abstract: The Cern Advanced STORage (CASTOR) system is a hierarchical storage management system developed at CERN to meet the requirements for high energy physics applications. The existing disk cache management subsystem in CASTOR, the stager, was developed more than a decade ago and was intended for relatively moderate (large at the time) sized disk caches and request load. Due to internal limitations a single CASTOR stager instance will not be able to efficiently manage distributed disk cashes of several PetaBytes foreseen for the experiments at the Large Hadron Collider (LHC) which will be commissioned in 2007. The Mass Storage challenge comes not only from the sheer data volume and rates but also from the expected request load in terms of number of file opens per second. This paper presents the architecture design for a new CASTOR stager now being developed to address the LHC requirements and overcome the limitations with the current CASTOR stager. Efficient management of PetaByte disk caches made up of clusters of 100s of commodity file servers (e.g. linux PCs) resembles in many aspects the CPU cluster management, for which sophisticated batch scheduling systems have been available since more than a decade. Rather than reinventing scheduling and resource sharing algorithms and apply them to disk storage resources, the new CASTOR stager design aims to leverage some of the resource management concepts from existing CPU batch scheduling systems. This has led to a pluggable framework design, where the scheduling task itself has been externalized allowing the reuse of commercial or open source schedulers. The development of the new CASTOR stager also incorporates new strategies for data migration and recall between disk and tape media where the resource allocation takes place just-in-time for the data transfer. This allows for choosing the best disk and network resources based on current load. 345 1. Introduction The Cern Advanced STORage (CASTOR) system[1] is a scalable high throughput hierarchical storage system (HSM) developed at CERN. The system was first deployed for full production use in 2001 and is now managing about 13 million files for a total data volume of more than 1.5 PetaByte. The aggregate traffic between disk cache and tape archive usually exceeds 100 MB/s (~75% tape read) and there are of the order of 50,000 – 100,000 tape mounts per week. CASTOR is a modular and fault-tolerant system designed for scalable distributed deployments on potentially unreliable commodity hardware. Like other HSM systems (see for instance [2], [3], [4], [5]) the client front-end for file access consists of a distributed disk cache with file servers managed by the CASTOR stager component, and an associated global logical name-space contained in an Oracle database (MySQL is also supported). The backend tape archive consists of a volume library (Oracle or MySQL database), tape drive queuing, tape mover and physical volume repository. The next section lists some of the salient features of today’s CASTOR system and the installation at CERN. Thereafter are listed some requirements that the CASTOR system has to meet well before the experiments at the Large Hadron Collider (LHC) start their data taking in 2007. The remaining sections (4 - 6) describe the new CASTOR stager (disk cache management component), which is being developed [6] to cope with the data handling requirements for LHC. It is in particular the expected request load (file opens) that requires some special handling. The design target is to be able to sustain peak rates of the order of 500 - 1000 file open requests per second for a PetaByte disk pool. The new developments have been inspired by the problems arising with management of massive installations of commodity storage hardware. It is today possible to build very large distributed disk cache systems using low cost Linux fileservers with EIDE disks. CERN has built up farms with several 100s of disk servers with of the order of 1TB disk space each. The farming of disk servers raises new problems for the disk cache management: request scheduling, resource sharing and partitioning, quality of service management, automated configuration and monitoring, and fault tolerance to unreliable hardware. Some of those problems have been addressed and solved by traditional batch systems for scheduling of CPU resources. With the new CASTOR stager developments described in this paper, the CASTOR team leverages the work already done for CPU scheduling and applies it for the disk cache resource management. This is achieved through pluggable framework design where the request scheduling is delegated to an external component. Initially LSF from Platform Inc [7] and Maui from Supercluster [8] will be supported. The new system is still under development and only exists as an advanced prototype. The final system is planned for 2Q04. 346 Oracle or MySQL Volume Manager (VMGR) CASTOR name server (logical name space) tpdaemon (PVR) RFIO API Application Oracle or MySQL Stager (disk cache mgmt) Remote Tape Copy (tape mover) rfiod (disk mover) Volume and Drive Queue Manager (VDQM) Disk cache Figure 1: CASTOR components and their interactions 2. The CASTOR architecture and capabilities The CASTOR software architecture is schematically depicted in Figure 1. The client application should normally use the Remote File IO (RFIO) library to interface to CASTOR. RFIO provides a POSIX compliant file IO and metadata interface, e.g. rfio_open(), rfio_read(), rfio_write(), rfio_lseek(), rfio_stat(), etc. For bulk operations on groups of files, a low-level stager API is provided, which allows for passing arrays of files to be processed. The Name server provides the CASTOR namespace, which appears as a normal UNIX filesystem directory hierarchy. The name space is rooted with “/castor” followed by a directory that specifies the domain, e.g. “/castor/cern.ch” or “/castor/cnaf.infn.it”. The next level in the name space hierarchy identifies the hostname (or alias) for a node with a running instance of the CASTOR name server. The convention is nodename = “cns” + “directory name”, e.g. “/castor/cern.ch/user” points to the CASTOR name server instance running on a node cnsuser.cern.ch. The naming of all directories below the third level hierarchy is entirely up to the user/administrator managing the sub-trees. The file attributes and metadata stored in the CASTOR name space include all normal POSIX “stat” attributes. The CASTOR name server assigns a unique 64 bit file identifier to all name space elements (files and directories). File class metadata associates tape migration and recall policies to all files. These policies include: • Number of tape drives to be used for the migration 347 • Number of copies required on different media. CASTOR supports multiple copies for precious data In order to avoid waste of tape media, CASTOR supports segmentation of large files over several tapes. The tape metadata stored in the CASTOR name space are currently: • The tape VID • Tape position: file sequence number and blockid (if position by blockid is supported by the tape device) • The size of the file segment • The data compression on tape There is also a ‘side’ attribute for (future) DVD or CD-RW support. The tape metadata will soon be extended to include the segment checksum. CASTOR does not manage the content of the files but a special “user metadata” field can be attached to the files in the CASTOR name space. The user metadata is a character string that the client can use for associating application specific metadata to the files. The RFIO client library interfaces to three CASTOR components: the name server providing the CASTOR namespace described in previous paragraph; the CASTOR stager, which manages the disk cache for space allocations, garbage collection and recall/migration of tape files; the RFIO server (rfiod), which is the CASTOR disk mover and implements a POSIX IO compliant interface for the user application file access. All tape access is managed by the CASTOR stager. The client does not normally know about the tape location of the files being accessed. The stager therefore interfaces with: • The CASTOR Volume Manager (VMGR) to know the status of tapes and select a tape for migration if the client created a new file or updated an existing one • The CASTOR Volume and Drive Queue Manager (VDQM), which provides a FIFO queue for accessing the tape drives. Requests for already mounted volumes are given priority in order to reduce the number of physical tape mounts • The CASTOR tape mover, Remote Tape COPY (RTCOPY), which is a multithreaded application with large memory buffers, for performing the copy between tape and disk • The RFIO server (rfiod) for managing the disk pools The CASTOR software is an evolution of an older system, SHIFT, which was CERN’s disk and tape management system for almost a decade until it was replaced by CASTOR in 2001. SHIFT was awarded ‘21st Century Achievement Award by Computerworld in 2001’ [9]. The CASTOR software has been compiled and tested on a large variety of hardware: Linux, Solaris, AIX, HP-UX, Digital UNIX, IRIX and Windows (NT and W2K). The CASTOR tape software supports DLT/SDLT, LTO, IBM 3590, STK 9840, STK9940A/B tape drives and ADIC Scalar, IBM 3494, IBM 3584, Odetics, Sony DMS24, STK Powderhorn tape libraries as well as all generic SCSI driver compatible robotics. 348 #requests/day/TB 30000 25000 #requests/TB 20000 ALICE ATLAS CMS COMPASS LHCb 15000 10000 5000 1/6/2003 31/05/2003 30/05/2003 29/05/2003 28/05/2003 27/05/2003 26/05/2003 25/05/2003 24/05/2003 23/05/2003 22/05/2003 21/05/2003 20/05/2003 19/05/2003 18/05/2003 17/05/2003 16/05/2003 15/05/2003 14/05/2003 12/5/2003 13/05/2003 11/5/2003 9/5/2003 10/5/2003 8/5/2003 7/5/2003 6/5/2003 5/5/2003 4/5/2003 3/5/2003 2/5/2003 1/5/2003 0 Day Figure 2: The number of requests per day normalized to TB of disk pool space. The plot shows one month of activity for 5 different stagers. ALICE, ATLAS, CMS and LHCb are the four LHC experiments whereas COMPASS is a running heavy ion experiment. The CASTOR RFIO client API provides remote file access through both a POSIX I/O compliant interface and a data streaming interface. The former is usually used by random access applications whereas the latter is used by the tape mover and disk-to-disk copy. At the server side the data streaming mode is implemented with multiple threads overlaying disk and network I/O. When using the RFIO streaming mode the data transfer performance is normally only limited by hardware (see [10] for performance measurements). Parallel stream transfers are not supported since this is not required by High Energy Physics applications. An exception is the CASTOR GridFTP interface, which does support parallel stream transfers in accordance with the GridFTP v1.0 protocol specification. 3. Requirements for the LHC CASTOR is designed to cope with the data volume and rates expected from the LHC experiments. This is frequently tested in so called ‘data challenges’ that the CERN IT department runs together with the experiments. In late 2002, the ALICE experiment ran a data challenge with a sustained rate of 280 MB/s to tape during a week [11]. The CERN IT department has also shown that the CASTOR tape archive can sustain 1 GB/s during a day [12]. In addition to the high data volume and rates generated by the LHC experiments, it is expected that the physics analysis of the data collected by LHC experiments will generate a very high file access request load. The four LHC experiments are large scientific collaborations of thousands of physicists. The detailed requirements for the LHC physics analysis phase are unknown but from experience with previous experiments it can be expected that 349 • The data access is random • Only a portion of the data within a file is accessed • The set of files required by an analysis application is not necessarily known • A subset of the files are hit by many clients concurrently (hotspots) The files must therefore be disk resident and since a substantial part of the physics analysis will take place at CERN it will result in a high request load on the CASTOR stager. Since the detailed requirements are unknown it is difficult to give an exact estimate for the expected request load on CASTOR. In order to obtain a design target the activity of five running instances of the current CASTOR stager was observed during a month, see Figure 2. The rates are normalized to the size of the disk pools in TB. The figure shows a peak rate of about 25,000 requests/day (0.3 requests/second) per TB disk. The average was about 10 times lower: 3,400 requests/day per TB disk. The associated data rate to/from the disk cache depends on the file sizes and the amount of data accessed for each file. The data volume statistics from the period shown in Figure 2 is no longer available but looking at a more recent period it was found that for the ATLAS stager the average data transfer per request is of the order 20MB (400GB for a day with 20,000 requests) whereas for LHCb it was only 3MB (50GB for a day with 17,000 requests). Assuming that the number of file access requests scales linearly with the size of the disk cache, the peak request rate for a PetaByte disk cache would be of the order 300 requests/second. The objective for the new CASTOR stager is therefore to be able to handle peak rates of 500 – 1000 requests per second. Under such load it is essential that the new system is capable of request throttling, which is not the case for the current CASTOR stager. It is also known that the current CASTOR stager catalogue will not scale to manage more than ~100,000 files in the disk cache. Those shortcomings have already today led to a deployment of many CASTOR stager instances, each with its dedicated moderate size (5-10TB) disk cache. The disadvantages are manifold but most important are: • Each stager instance has its dedicated disk cache and this easily leads to suboptimal use of the resources where one instance runs at 100% and lacks resources while another instance may be idle. • The configuration management becomes complex since each stager instance has its own configuration and the clients must know which stager to use. 4. New CASTOR stager architecture The new CASTOR stager, which is currently being developed at CERN, is specifically designed to cope with the requirements listed in the previous section. This section briefly describes some of the salient features of the new architecture. 350 Control (TCP) Data (TCP) Application Notification (UDP) RFIO/stage API Request Handler Request repository and file catalogue (Oracle or MySQL) Migration and recall Master Stager LSF rfiod (disk mover) CASTOR tape archive components (VDQM, VMGR, RTCOPY) Maui Disk cache Figure 3: Overview of the new CASTOR stager architecture 4.1. Overview Figure 3 shows a schematic view of the new architecture. The client application uses the RFIO API or the lower level stager API (for multi-file requests) to submit the file access request. Although it is not shown in the Figure 3, the CASTOR name server is called for all operations on the CASTOR name space as was shown in Figure 1. The request handler, master stager and migration/recall components communicate through a database containing all file access requests and the file residency on disk (catalogue). To avoid too frequent polling, unreliable notification is used between the modules when the database is being updated. The role of the request handler is to insulate the rest of the system from high requests bursts. The master stager is the central component that processes the file access requests and delegates the request scheduling decision to an external module. The first version of the new CASTOR stager will have resource management interfaces to LSF [7] and Maui [8]. The migration/recall components retrieve the tape related information and contacts the CASTOR tape archive (see Figure 1) to submit the tape mount requests. Along with the new stager developments, it was decided to improve the security in CASTOR. However, for CERN physics applications, confidentiality is not a main requirement, and encrypting all the data during transfer would add significant load on the 351 CASTOR servers. Therefore, strong authentication of users in RFIO, stager and the CASTOR name server will be implemented using standard security protocols: Kerberos5, Grid Security Infrastructure, and Kerberos-4 for compatibility with the current CERN infrastructure. Furthermore the architecture of the new stager and its interface with RFIO will allow only authorized clients (file access request has been scheduled to run) to access the files on the disk servers managed by the stager. Below follows a more detailed description of the components and concepts used in the design of the new CASTOR stager architecture. 4.2. Database centric architecture In the new CASTOR stager, a relational database is used to store the requests and their history, to ensure their persistency. The stager code is interfaced with Oracle and MySQL, but the structure allows plugging-in other Relational Database Servers if necessary. Furthermore, the database is also used for the communication between the different components in the stager (request handler, stager, migration/recall…) This database centric architecture has many advantages over the architecture of the current stager. From a developer’s point of view: • The data locking and concurrency when accessing the data are handled by the RDBMS: This makes the development/debugging of the code much easier. • The database API is used to enter or read requests. The CASTOR application does not have to know whether the database is local or remote, the database API takes care of that (e.g. the Oracle API uses shared memory if the database is local or SQLNet otherwise). From the CASTOR system’s administrator point of view: • The database being the central component in the architecture, the coupling between the different components is very low. • This, as well as the persistence of the requests, and the proper use of status codes to identify their state, allows stopping and restarting the system, or specific components at any time. The administration is made easier. For the CASTOR user, the overall scalability of the system is greatly improved as: • RDBMS can cope with large amount of data. If the database tables are properly designed, the system should not have huge performance penalty when dealing with a large number of requests. • The stager software components are stateless, which allows running several instances on different machines without problems. The real limit to the scalability is the database server itself. • The database software is prepared for high loads, and there are some cluster solutions that can help coping with such problems. 352 Network interface Request Decoding Request Management Archiving Figure 4: Request handler architecture There are however some drawbacks to this architecture: • The latency for each call is higher than with the current architecture, as the stager components have to access the database. • The database server is a central point of failure and has to cope with a very high load. However, database technology is widely deployed and it is possible to get database clusters to mitigate that risk. Furthermore, just using the database for communication between the components would force them to frequently poll the database, which creates unnecessary load, and possibly increase the latency of the calls to the stager. Therefore, in the current stager prototype, a notification mechanism using UDP messages has been introduced. This mechanism is very lightweight and does not create a big coupling between the different components as UDP is connectionless. 4.3. Request handler The request handler handles all requests in CASTOR, from reception to archiving. The main purpose of the Request Handler is to shield the request scheduling from the possible request bursts. The request handler is also responsible for the decoding of the requests and storing them in the database. At all different stages in the request processing the requests are stored and updated in the database. The Request Handler architecture is sketched in Figure 4. 353 Flag 1..n Request IClient 1..n File IAuthentication 1..n Copy IAuthorization 1..n Segment 1..n 1..m TapeInfo Figure 5: Request content The Network Interface is responsible for handling new requests and queries coming from the clients. This is the component realizing the isolation against request bursts. It is designed to sustain high rates (up to 1000 requests per second). Therefore the only thing it does is to store the raw request as a binary object into a database for further processing by the request decoder. There are only two transactions on this table per request: one insert and one retrieve + delete. To guarantee the best insertion rate this database instance may be deployed separately from the rest of the request repository. The Request Decoding takes place asynchronously. The decoded requests are stored into a database where they can be easily used and queried by the rest of the system, especially by the master stager scheduling component. When a request is selected by the scheduling component for processing, it is moved to the Request Management components, which associate data to each request describing its status (flags), its clients (authenticated and authorized) and the tapes possibly required (see Figure 5). At last, when the processing of a request is over, it is archived by the Archiving component. The request handler manages two interfaces: the network interface that is used by external clients, and the interface to the Request Manager for internal communication with the other components of CASTOR: 354 • • The network interface component allows clients to send requests to CASTOR. The API used is very much inspired by the SRM standard [13] so it will not be described in any detail here. Command line scripts are also provided. The internal interface has two parts: o The first part allows the stager component to handle requests and to update their metadata o The second part is used by the migration/recall component to get the list of tapes and segments of file to be migrated and to update the tape status. 4.4. Master stager and externalized scheduling plug-in interface The master stager is central in the new architecture. It has two main functions: • Manage all resident files and space allocations in the disk cache • Schedule and control all client access to the disk cache The master stager maintains a catalogue with up-to-date information of all resident files and space allocations in the disk cache. The catalogue maintains a mapping between CASTOR files (paths in the CASTOR name space) and their physical location in the disk cache. The mapping may be one-to-many since, for load-balancing purpose there may be several physical replicas of the same CASTOR file. An important difference to the current CASTOR stager is that the catalogue is implemented with a relational database. Oracle and MySQL are supported but other RDBMS systems can easily be interfaced. This assures that the system will scale to large disk caches with millions of files. When the master stager gets notifications from the request handler that there are new requests to process, it uses the request handler internal interfaces to retrieve the requests from the database. The basic processing flows in the master stager are depicted in Figure 6: • If the request is for reading or updating an existing CASTOR file, the master stager checks if the requested file is already cached somewhere in the disk cache. In case the file is already cached, the request is submitted to the external scheduler and queued for access to the file in the disk cache. • If the request is for creating a new file, the entry is created in the CASTOR name space and the request is submitted to the external scheduler like in the previous case. • If the request is for reading or updating an existing CASTOR file, which is not already cached, the master stager marks the file for recall from tape and notifies the recaller. Once the file has been recalled to disk the request is submitted to the external scheduler like before. The scheduling of the access depends on the load (CPU, memory, I/O) on the file server and disk that holds the file. If the load is too high but another file server is ready to serve the request, an internal disk-to-disk copy may take place replicating the file to the selected disk server. 355 New file access request New file? Yes No No In cache? Create file in CASTOR name space Yes Notify recaller and wait for catalogue update Schedule access to disk cache Figure 6: Master stager basic flows Request throttling and load-balancing are already established concepts in modern disk cache management [2] and distributed file systems [14], [15]. The novelty of the new CASTOR stager architecture is the delegation of the request scheduling to an external module, which allows for leveraging existing and well-tested scheduling systems. In the simplest case, the external scheduling module could be a FIFO queue. However an important advantage of using sophisticated scheduling systems like [7], [8], [16] is that they also provide established procedures for applying policies for quality of service and resource sharing. Interfacing the disk resource management with CPU batch scheduling systems prepares the ground for eventually marrying the CPU and storage resource management. Because of the close resemblance between farms of CPU servers and farms of commodity file servers it is possible to re-use the CPU batch scheduling systems almost out-of-the-box (see Section 5). The only adaptation required was to accommodate for the fact that a disk cache access request not only targets a file server (node) but also the file system (disk) holding the file or space allocation. While the local pool (or worker directory) file-system is part of normal CPU scheduling attributes, it is normally assumed that there is only one file-system per node. A file server may have several file systems mounted. The CASTOR development team has established excellent collaborations with LSF [7] and Maui [8] developers. Both groups were interested in the problem of disk storage access scheduling and extending their systems to support file-system selection in the scheduling and resource allocation phases. 356 It should be noted that the similarity between CPU and file server scheduling is partly due to the particular NAS based disk cache architecture with many (100s) file servers holding only a few TB disk space each. For SAN based architectures some more adaptation of the CPU schedulers is probably needed. 4.5. Recaller and migrator components The recaller and migrator components control the file movements between the CASTOR disk cache and the tape archive. Since most mass storage systems contains similar components, including the current CASTOR stager, only the most important features of the new CASTOR recaller and migrator components will be listed here. Both the recaller and migrator read the request repository to find out what files needs to be recalled or migrated. While the process of copying files to and from tape is in principle trivial, the task is complicated by the significant delays introduced by tape drive contention and the tape mounting. Tape drive contention comes from the fact that there are usually much more tape volumes than drives and the access must be queued. Tape mounting latency comes from fetching the volume and loading it onto the drive. Because of the inherent delays introduced by the tape drive contention and tape mounting, CASTOR recaller and migrator do not select the target disk resources that will receive or deliver the data until the tape is ready and positioned. This “deferred allocation” of the disk resources has already been used with good experience in the recaller component of the current CASTOR stager: the target file server and disk that will receive the file is selected in the moment the tape mover is ready to deliver it. In this way space reservations and traffic shaping, which are cumbersome to implement and often lead to suboptimal resource utilization, can be avoided. The selection is currently based on available disk space and I/O load but other metrics/policies can easily be added. 5. First prototype An evaluation prototype of the new CASTOR stager has been built in order to test out some of the central concepts in the new architecture. The prototype includes: • A request handler storing all requests in a request repository (Oracle and MySQL supported). • A master stager reading requests from the request repository and delegates the request scheduling and submission to a LSF instance. • A LSF scheduler plug-in for selecting target file-system and pass that information to the job when it starts on the file server. • A load monitoring running on the managed file servers collecting disk load data • A LSF job-starter that starts an rfiod (disk mover) instance for each request scheduled to a file server. The RFIO API client is notified about the rfiod address (host and port) and a connection is established for the data transfer. The prototype was based on standard LSF 5.1 installation without any modifications. This was possible using the plug-in interface provided by the LSF scheduler. A CASTOR plug-in is called for the different phases in the LSF scheduling to perform the matching, sorting and allocation of the file systems based on load monitoring information. While this was sufficient for the prototype, it was recognized that some extra features in the LSF 357 interfaces would make the processing more efficient. The LSF development team was interested in providing those features and a fruitful collaboration has been established. The prototype is also prepared for interfacing the Maui scheduler and the Maui developers have been very helpful in providing the necessary interfaces between Maui and the CASTOR file system selection component. While the prototype was aimed to prove the functionality rather than performance, it should be mentioned that already for this simple deployment using old hardware and without any particular database tuning, the request handling was capable of storing and simultaneously retrieve • 40 requests/second using Oracle (a dual CPU 1GHz i386 Linux PC with 1GB memory) • 125 request/second using MySQL (single CPU 2.4GHz i386 Linux PC with 128MB memory) 6. Development status The developments of the new CASTOR stager started in mid-2003 and a production ready version is planned for 2Q04. Before that a second prototype including all new components and the full database schema is planned for March 2004. 7. Conclusions and outlook The CASTOR hierarchical storage management system is in production use at CERN since 2001. While the system is expected to scale well to manage the data volumes and rates required for the experiments at the Large Hadron Collider (LHC) in 2007, it has been become clear that the disk cache management will not cope with the expected request load (file opens per second). The development of a new CASTOR stager (disk cache management system) was therefore launched last year. The important features of the new system design are: • The request processing is shielded from high peak rates by a front-end request handler, which is designed to cope with 500 – 1000 requests per second. • The scheduling of the access to the disk cache resources is delegated to an external scheduling system. This allows leveraging work and experience from CPU batch scheduling. The first version of the new CASTOR stager will support the LSF and Maui schedulers. • The software architecture is database centric. A first prototype of the new system was successfully built in order to prove the new design concepts. The prototype interfaced the LSF 5.1 scheduler. A production ready version of the new CASTOR stager is planned for 2Q04. 8. [1] [2] [3] [4] References CASTOR http://cern.ch/castor dCache http://www.dcache.org/ ENSTORE, http://hppc.fnal.gov/enstore/ HPSS, http://www.sdsc.edu/hpss/ 358 [5] TSM, http://www-306.ibm.com/software/tivoli/products/storage-mgr/ [6] New stager proposal http://cern.ch/castor/DOCUMENTATION/ARCHITECTURE/NEW/CASTORstager-design.htm [7] Platform computing Inc, http://www.platform.com [8] Maui scheduler, http://supercluster.org/maui [9] http://cern.ch/info/Press/PressReleases/Releases2001/PR05.01EUSaward.html [10] Andrei Maslennikov, New results from CASPUR Storage Lab. Presentation at the HEPiX conference, NIKHEF Amsterdam, 19 – 23 May 2003. http://www.nikhef.nl/hepix/pres/maslennikov2.ppt [11] T.Anticic et al, Challenging the challenge: handling data in the Gigabit/s range, Presentation at the CHEP03 conference, La Jolla CA, 24 – 28 March 2003 http://www.slac.stanford.edu/econf/C0303241/proc/papers/MOGT007.PDF [12] http://cern.ch/info/Press/PressReleases/Releases2003/PR06.03EStoragetek.html [13] SRM Interface Specification v.2.1, http://sdm.lbl.gov/srm-wg/documents.html [14] IBM StorageTank, http://www.almaden.ibm.com/storagesystems/file_systems/storage_tank/index.shtml [15] Lustre, http://www.lustre.org/ [16] SUN Grid Engine, http://wwws.sun.com/software/gridware/ 359 360 GUPFS: The Global Unified Parallel File System Project at NERSC* Greg Butler, Rei Lee, and Mike Welcome National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory Berkeley, California 94720 {GFButler, RCLee, MLWelcome}@lbl.gov Tel: +1-510-486-4000 Abstract The Global Unified Parallel File System (GUPFS) project is a five -year project to provide a scalable, high -performance, high -bandwidth, shared file system for the National Energy Research Scientific Computing Center (NERS C). This paper presents the GUPFS testbed configuration, our benchmarking methodology, and some preliminary results. 1 Introduction The Global Unified Parallel File System (GUPFS) project is a multiple -phase, five year project to provide a scalable, high -performance, high -bandwidth, shared file system for the National Energy Research Scientific Computing Center (NERSC) [1]. The primary purpose of the GUPFS project is to make it easier to conduct advanced scientific research using the NERSC systems. This is t o be accomplished through the use of a shared file system providing a unified file namespace, operating on consolidated shared storage that is directly accessed by all the NERSC production computational and support systems. In order to successfully deploy a scalable high -performance shared file system with consolidated disk storage, three major emerging technologies must be brought together: shared/cluster file systems, cost -effective, high performance Storage Area Networks (SAN) fabrics, and high performa nce storage devices. Although they are evolving rapidly, these emerging technologies are not targeted towards the needs of high performance scientific computing. The GUPFS project is intended to evaluate these emerging technologies to determine the best so lutions for a center -wide shared file system, and to encourage the development of these technologies in directions needed for HPC at NERSC. The GUPFS project is expected to span five years. During the first three years of the project, NERSC intends to tes t, evaluate, and steer the development of the technologies necessary for the successful deployment of a center -wide shared file system. Provided that an assessment of the technologies is favorable at the end of the first three years, the * This work was supported by the Director, Office of Science, Office of Advanced Scientific Computer Research, Mathematical, Information, and Computational Sciences Division, of the U.S. Department of Energy under Contract No. DE-AC03-76SF00098. 361 last two years of the GUPFS project will focus on a staged deployment, leading to full production at the beginning of FY2006. To this end, during the past year the GUPFS project focused on identifying, testing, and evaluating existing and emerging shared and cluster file s ystem, SAN fabric, and storage technologies. During this time, the GUPFS project was also active in identifying NERSC user I/O requirements, methods, and mechanisms, and developing appropriate benchmarking methodologies and benchmark codes for a parallel environment. This paper presents the GUPFS testbed configuration, our benchmarking methodology, and some preliminary results. 2 GUPFS Testbed Configuration The GUPFS testbed was constructed from commodity components as a small -scale system mimicking a scien tific parallel computational cluster. Its purpose was to assess the suitability of shared -disk cluster file systems with SAN attached storage to the scientific computational cluster environment. This testbed system presented a microcosm of a parallel sci entific cluster —dedicated computational nodes, special -function service nodes, and a high -speed interconnect for message passing. It consisted of five computational nodes and one management/interactive node, and utilized an internal jumbo frame Gigabit Eth ernet as the high -speed message passing interconnect. An internal 10/100 Mb/s Fast Ethernet LAN was used for system management and NFS distribution of the user home file systems. The configuration of the testbed is illustrated in following diagram. In designing a testbed for the GUPFS project, a number of factors were considered. The testbed was designed to support the evaluation of three technology areas: 362 Shared/cluster file systems SAN fabrics Storage devices These three technology areas are key to the successful deployment of a center shared file system utilizing consolidated storage resources. -wide The testbed is configured as a Linux parallel scientific cluster, with a management node, a core set of 32 dedicated compute nodes and a set of six s pecial-purpose nodes. Each compute node is a dual Pentium IV system with six PCI -X slots. The PCI -X slots allow us to test newer high performance interfaces such as 4x Infiniband HCA. All computer nodes are equipped with a 2Gb/s Fibre Channel HBA and a 1Gb /s Ethernet interface. Various sets of compute nodes are equipped with different groups of interfaces being evaluated such as Infiniband and Myrinet interfaces. 3 GUPFS Benchmarking Approach File systems, and parallel file systems in particular, are extremely complicated and should be evaluated within the context of their intended use. In the commercial world, a file system may be evaluated by how it performs on a single application or a small number of critical applications. For example, a web serving conte nt provider may not be interested in parallel write performance since the primary role of the file system is to provide read only access to data across a large number of servers without having to replicate the data to multiple farms. By contrast, the NERS C HPC environment must support a large number of user -written applications with varying I/O and metadata performance requirements. Further, applications running today may not resemble the applications that will be running two years from now. Given this div ersity, the GUPFS project is taking a more general, multi pathed approach to the evaluation of parallel file systems. Initially, the GUPFS project has performed parallel I/O scalability and metadata performance studies. Later, testing includes reliability and stress -test studies, and finally the project will evaluate the performance with respect to specific I/O applications that emulate real NERSC user codes. 3.1 Parallel I/O Performance Studies Evaluating the performance of a file system, and a parallel file system in particular, can only be done within the context of the underlying system and storage environment. If the underlying storage network or device can only perform at a rate of 30 MB/sec then we cannot expect sustained I/O performance through a file system to exceed this. In addition, for the case of a parallel file system, we need to understand the scalability of the underlying storage network before passing judgment on the file systems’ ability to scale to a large number of clients. To aid in this p ortion of the study, we have developed a parallel I/O benchmark named ‘MPTIO’ which can perform a variety of I/O operations on files or devices in a flexible manner. MPTIO is an MPI program in which each process spawns multiple threads to perform I/O on th e underlying file or device. MPI is 363 used to synchronize the I/O activity and to collect the per -process results for reporting. It does not use the MPI-I/O library. When acting through a file system, MPTIO can have all threads perform I/O to a single global file, or all the threads of a process can perform I/O to a single file per -process, or all the threads can perform I/O to distinct per -thread files. In addition, I/O can be performed directly to a block device or Linux RAW device to help characterize the underlying storage devices and storage area network. Further, each thread can perform I/O on distinct, non -overlapping regions of the file or device, or multiple threads can perform overlapping I/O. The code can run five different I/O tests, including sequential read and write, random I/O and read -modify-write tests. Aggregate and per -process results are reported. For a complete description of this code, see the Benchmark Code Descriptions section in [2]. We can baseline the performance of the storage netw ork and a device using raw device I/O as follows: ✁ ✁ ✁ First we measure the I/O rates from a single node to or from the device, by varying the number of I/O threads to the point that the I/O rate saturates. Second, we scale up the number of processes (each on a separate node) and also vary the number of I/O threads per process. Third, if multiple paths exist through the network to the controller, we can use MPTIO to perform I/O through all paths. If the raw device I/O rates do not improve past a single node, then the performance bottleneck is in some portion of the network or on the storage device. If they do improve, the first test gives us a good estimation of the peak sustainable raw device I/O rate onto a single node. In this case, as the number of nodes increases, eventually the aggregate raw device I/O rate will again saturate. This bottleneck is either in the network or on the storage controller. Once a profile of the storage device and network using raw device I/O is complete, a file system can be built over the device. We can then perform I/O and scalability studies over the file system. Since most file systems cache data in host memory, large I/O requests have to be used to minimize the effects of caching. For example, if we observe better performance through the file system than to the RAW device, we can conclude it is the effect of data caching on the node. In addition, one can compare the performance difference between multiple processes writing to the same file versus each writing to different files. If the performance difference is substantial, it will likely be due to write lock contention between the processes. This might be alleviated if the file system supports byte -range locking so that each process can write its own section without obtaining the single (global) file lock. If the file system supports DIRECT I/O, then one can compare I/O performance through the file system with what was measured to the RAW device. The difference will indicate the amount of software and organizational overhead associated with the file system. For example, file block allocation will probably be distributed across the device resulting in multiple non -contiguous I/O requests to the device. Such I/O performance can degrade as the file system fills and block allocati on is more fragmented. File system build options and mount options may also play a 364 substantial role in the performance. Additional tuning may need to be examined to see how they may affect the I/O performance. 3.2 Metadata Performance Studies In a typical (lo cal) file system, the metadata is often heavily cached in the system memory and updated in an order such that, in the event of a system crash, the file system would be re -constructible from the on -disk image. Modern file systems maintain transaction logs, or journals, to keep track of which updates have been committed to stable storage and which have not. Each of the data structures generally contain one or more locks such that operating system actors wishing to modify the structure do so in an atomic manner, by first acquiring the lock, modifying the structure and then releasing the lock. Some file system operations require modifying many of the structures and thus the actors must acquire multiple locks to complete the operation. In a parallel file system multiple clients, running on different machines under different instances of an operating system will want to modify these data structures (to perform operations) in an unpredictable order. In these cases, where is the metadata is maintained and who control s it can have a dramatic effect on the performance of a file system operation. There are currently two main approaches to managing metadata in parallel file systems: symmetric (distributed) and asymmetric. In a purely symmetric file system, all the file system clients hold and share metadata. Clients wanting to perform an operation must request locks from the clients holding them via some from of distributed lock manager. In an asymmetric file system, a central (dedicated) metadata server maintains all th e information. The clients request updates or read and write access to files. Clearly, this latter case is easier to manage but establishes a single point of failure and may create a performance bottleneck as the number of clients increases. Note that in both cases, once a client is granted read or write access to a file, it accesses the file data directly through the SAN. One aspect of evaluating a file system is how well it performs various metadata operations required for concurrent file operations by l arge number of clients in a parallel environment. Clearly, where and how the metadata is maintained will have a substantial impact on the performance of various file system operations. Clients have to send messages to other nodes in the cluster requesting locks, or asking for operations to be performed. The latency of the interconnection network and the software overhead of processing the communication stack will be a major portion of these costs. Apart from the issues of parallel file systems, a portion of the metadata performance will have to do with how the data structures are organized internally. For example, some file systems will maintain a directory structure as a linear linked list whereas others will use more sophisticated schemes, such as hash tab les and balanced trees. Linear lists are easy to implement but access and update performance degrades rather seriously as the number of directory entries increase. The other schemes provide near -constant or log -time access and update rates and perform well as the directory grows. This, of course, is at the expense of a more complicated implementation. Another example is how the file system maintains information about which underlying blocks are free or in use. Some systems use a bitmap whereby the value of the bit indicates the availability of the block. Other 365 systems use extent -based schemes whereby a small record can represent the availability of a large region of contiguous blocks. In our study, we measured how the various parallel file systems perform with respect to certain metadata intensive operations. To this end, we have developed a file system metadata benchmark code called ‘METABENCH’. This is a parallel application that uses MPI to coordinate processes across multiple file system clients. The pr ocesses perform a series of metadata operations, such as file creation, file stating and file utime and append operations. Details on the current state of METABENCH can be found in the Benchmarking Code Descriptions section in [2]. 3.3 User Applications Emulation Although micro benchmarks such as MPTIO and METABENCH can provide a wealth of information about how a file system behaves under controlled conditions, the true test of a file system is how it performs in a real user environment. The NERSC user commun ity is large, with a diverse collection of codes that evolve over time. In addition, the codes are complex and may not easily be ported to our test system, or even scale down to that size. Further, the I/O and file operation portion of the code may only co nsume a small portion of the run-time so attempting to run the actual application on the testbed would be inefficient. In order to address these issues, we plan to develop a small collection of I/O applications that emulate the I/O and file management beha vior of real NERSC user applications. This will be a time -consuming task and will only be successful with the aid of the user community. We have begun an informal survey of some of the larger NERSC projects to understand their I/O requirements. As a part o f this, we will select a few applications and solicit the users to help us create an I/O benchmark that emulates their code. 4 Preliminary Performance Results During the past year, we have evaluated a number of products and technologies that we believe are key technologies to the GUPFS Project. We will present testing results in this section for some of the following products and technologies evaluated: ✁ ✁ ✁ File Systems: Sistina 5.1 & 5.2 Beta; ADIC StorNext (CVFS) File System 2.0; Lustre 0.6 (1.0 Beta 1) and 1.0; and GPFS 1.3 for Linux Fabric Technologies o Fibre Channel Switches: Brocade SilkWorm and Qlogic SANbox2 -16 and SANbox2-64 o ISCSI[3]: Cisco SN 5428, Intel iSCSI HBA, iSCSI over IB o Infiniband: InfiniCon and Topspin (IB to FC and GigE) o Inter-connect: Myrinnet, GigE Fibre Channel Storage Devices o 1Gb/s FC: Dot Hill, Silicon Gear, Chaparral o 2Gb/s FC: Yotta Yotta NetStorager[8], EMC CX 600, 3PARdata 366 4.1 Storage Performance and Scalability Storage can be a performance bottleneck of any file system. A storage device may be able to sustain a very good single -port performance. However, having good single -port performance is not sufficient for a shared disk file system like GUPFS. For GUPFS, the underlying storage devices must demonstrate a very good scalability wh en the number of clients increases (to thousands or tens of thousands). A shared file system will not scale if the underlying storage does not scale. Storage Scalability (1 Thread, Disk Write) MB/sec 800.00 600.00 400.00 200.00 0.00 1 2 3 Yotta Yotta 4 5 6 # of Clients Silicon Gear 7 8 Dot Hill Figure 1. Storage Scalability Figure 1 shows how st orage devices scale when the number of clients increase. The figure shows the results of three storage devices: Yotta Yotta GSX 2400 (YY), Silicon Gear Mercury II (SG), and DotHill SANnet (DH). Both Silicon Gear and Dot Hill have only 2 1Gb/s front -end ports, while Yotta Yotta has 8 2Gb/s ports and 3PARdata has 16 2Gb/s ports but only 8 were used during the test. On each storage device, we created a single LUN to be shared by multiple clients for shared access. The figure shows that both the Silicon Gear and Dot Hill devices did not scale when the number of clients increased. Silicon Gear performance actually dropped when the number of clients increased. On the other hand, the figure shows that the Yotta Yotta storage did scale very well when the number of clients increased. Throughput (MB/s) Storage Aggregate Performance (with 8 clients) 1600 1200 800 400 0 Cache Write Cache Read Cache Rotate Read Yotta Yotta Disk Write Disk Read Silicon Gear Disk Rotate Read Dot Hill Figure 2. Storage Aggregate Performance Figure 2 shows the aggregate performance of the three storage devices: DotHill, Silicon Gear, and Yotta Yotta, using the MPTIO benchmark w ith different test conditions. The 367 results indicate that the Yotta Yotta storage will be able to sustain higher performance than Silicon Gear or Dot Hill in a shared file system. 4.2 Parallel File I/O Performance Throughput (MB/sec) File System Performance (MPTIO, DH) 10000.00 1000.00 100.00 10.00 1.00 Cache Write Cache Read Cache Rotate Read GFS Disk Write ADIC Disk Read Disk Rotate Read Lustre Figure 3. Shared File System Performance During the last year, we have tested several file systems, including Sistina’s GFS [4], ADIC’s StorNext File System [5], and Lustre [7]. The above diagram shows the 8 -client MPTIO results on these file system, under different test scenarios. These results indicate that there is not much difference in parallel I/O performance between GFS and ADIC’s StorNext File System, except for ‘Cache Read’. ADIC’s StorNext File System was probably doing direct I/O even when operating on files that can fit in the OS cache. With the award of the ASCI PathForward SGSFS [6] file system development contract to HP and Cluster File Systems, Inc., there has been rapid progress on the Lustre file system [7]. The earlier Lustre file system version (0.6) we tested failed to complete all but the ‘Cache Write’ and ‘Cache Read’ tests. Luster1.0.0 was recently released and Figure 4 shows the latest result of Lustre scalability with six clients and two Object Storage Servers (OSS). Lustre Scalability (with 2 OSS's) MB/sec 200 150 Read Write 100 50 0 1 2 3 4 # of Clients 5 Figure 4. Lustre Scalability 368 6 The figure seems to indicate that reads and writes were limited by the GigE interface as the test was running with two OSS’s and each OSS was equipped with one GigE interface. Additional test with more OSS’s and tuning should improve performance. 4.3 Fabric Performance Storage area networks, by providing a high performance network fabric oriented toward storage device transfer protocols, allow direct physical data transfers between hosts and storage devices. Currently, most SANs are i mplemented using Fibre Channel (FC) protocol-based fabric. Emerging alternative SAN protocols, such as iSCSI (Internet Small Computer System Interface), FCIP (Fibre Channel over IP), and SRP (SCSI RDMA [Remote Direct Memory Access] Protocol) [9], are enabl ing the use of alternative fabric technologies, such as Gigabit Ethernet and the emerging InfiniBand, as SAN fabrics. Here we present some performance results of several fabric technologies: Fibre Channel (FC), iSCSI over GigE, iSCSI over IP over Infiniband (IPoIB), and SRP. Storage Fabric Performance (single Thread, Read) 200.00 MB/sec 160.00 120.00 80.00 40.00 0.00 16mb 4mb fc_yy (2gb) 1mb 256kb 64kb Block Size srp_ib (2gb) 16kb iscsi_ge 4kb 1kb iscsi_ib Figure 5. Storage Fabric Performance Figure 5 shows the results of single -thread reads of different I/O size using different fabric technologies. The best performance was achieved by the 2Gb/s FC interfa ce, followed by the SRP protocol over Infiniband. Since the iSCSI traffic was passing through a single GigE interface, the iSCSI performance was less than 100 MB/s. With the additional stack overhead of IPoIB, iSCSI over IPoIB delivered the lowest performa nce for single-thread reads. Figure 6 shows the CPU overhead of different protocols for single-thread reads. FC, while delivered the best performance, used the least CPU overhead. The iSCSI protocol allows the standard SCSI packets to be enveloped in IP packets and transported over standard Ethernet infrastructure, which allows SANs to be deployed on IP networks. This option is very attractive as it allows lower-cost SAN connectivity than can be achieved with Fibre Channel, although with lower performance. It will allow large numbers of inexpensive systems to be connected to the SAN and use the shared file system through 369 commodity-priced components. While attractive from a hardware cost perspective, this option does incur a performance impact on each host due to increased traffic through the host’s IP stack, as shown in Figure 6. CPU Overhead of Storage Fabric 50.00 %sys 40.00 30.00 20.00 10.00 0.00 16mb 4mb 1mb 256kb 64kb 16kb 4kb 1kb Block Size fc_yy (2gb) srp_ib (2gb) iscsi_ge iscsi_ib Figure 6. CPU Overhead of Storage Fabric 5 Conclusions The GUPFS project started in the last half of FY 2001 as a limited investigation of the suitability of shared -disk file systems in a SAN environment for scientific clusters, with an eye towards possible future center -wide deployment. As such, it was targeted towards initial testing of the Sistina Global File System (GFS), and included a small testbed system to be used in the investigation. With the advent of the NERSC Strategic Proposal for FY 2002 –2006, this modest investigation evolved into the GUPFS project, which is one of the major programmatic thrusts at NERSC. During the first three years of th e GUPFS project, NERSC intends to test, evaluate, and influence the development of the technologies necessary for the successful deployment of a center -wide shared file system. Provided that an assessment of the technologies is favorable at the end of the first three years, the last two years of the GUPFS project will focus on a staged deployment of a high performance shared file system center-wide at NERSC, in conjunction with the consolidation of user disk storage, leading to production in FY 2006. Reference [1] NERSC Strategic Proposal FY2002-FY2006, http://www.nersc.gov/aboutnersc/pubs/Strategic_Proposal_final.pdf. [2] The Global Unified Parallel File System (GUPFS) Project: FY 2002 Activities and Results, http://www.nersc.gov/aboutnersc/pubs/GUPFS_02.pdf. [3] Julian Satran, “iSCSI” (Internet Small Computer System Interface) IETF Standard, January 24, 2003, http://www.ietf.org/internet-drafts/draft-ietf-ips-iscsi-20.pdf. 370 [4] Global File System (GFS), Sistina Software, Inc., http://www.sistina.com/downloads/ datasheets/GFS_datasheet.pdf. [5] StorNext File System, Advanced Digital Information Corporation (ADIC), http:// www.adic.com/ibeCCtpSctDspRte.jsp?minisite=10000&respid=22372&section=10121. [6] ASCI Path Forward, SGSFS, 2001, http://www.lustre.org/docs/SGSRFP.pdf. [7] “Lustre: A Scalable, High-Performance File System,” Cluster File Systems, Inc., November 2002, http://www.lustre.org/docs/whitepaper.pdf. [8] Yotta Yotta NetStorager GSX 2400, Yotta Yotta, Inc., http://www.yottayotta.com/pages/products/overview.htm. [9] SRP: SCSI RDMA Protocol: ftp://ftp.t10.org/t10/drafts/srp/srp-r16a.pdf. 371 372 SANSIM: A PLATFORM FOR SIMULATION AND DESIGN OF A STORAGE AREA NETWORK Yao-Long Zhu, Chao-Yang Wang, Wei-Ya Xi, and Feng Zhou Data Storage Institute DSI building, 5 Engineering Drive 1, (off Kent Ridge Crescent, NUS) Singapore 117608 Tel: +65-6874-6436 e-mail: zhu_yaolong@dsi.a-star.edu.sg Abstract Modeling and simulation are flexible and effective tools to design and evaluate the performance of Storage Area Network (SAN). Fibre Channel (FC) is presently the dominant protocol used in SAN. In this paper, we present a new simulation - SANSim, developed for modeling and analyzing FC storage network. SANSim includes four main modules: an I/O workload module, a host module, a storage network module, and a storage system module. SANSim has been validated by comparing the simulation results with the actual I/O performance of a FC RAM disk connected to a FC network. The simulated results match the experimental readings within 3%. As an example of applicability, SANSim has been used to study the impact of link failures on the performance of a FC network with a core/edge topology. 1. Introduction SAN architecture has been proven to provide significant performance advantages, larger scalability, and higher availability over traditional Direct Attached Storage architecture. It is therefore not surprising that the performance modeling and simulation of SANs has become an interesting field of research [1][2]. Xavier [3][4] used the CSIM language to simulate a SAN and model its activities. Petra et al. [5] presented the SIMLab as a simulation environment based on a network of active routers. Wilkes [6] used the Pantheon storage-system simulator to model the performance of parallel disk arrays and parallel computers. DiskSim [7] is another disk storage system simulator to support research in storage subsystem algorithm and architecture. However, the simulation studies and tools mentioned above have been very limited in modeling and simulation at the FC protocol level. Nevertheless, it is necessary to simulate at the frame-level FC in order to monitor and analyze details of FC SAN activities. In this paper, we present a new FC SAN simulation, SANSim, which supports FC framelevel and fully simulates Fibre Channel protocols in accordance to the relevant standards [10-13] to guarantee the compatibility and interoperability of different modules. In the following sections, we first introduce our simulation tool SANSim in section 2. Then, we present in section 3, the experimental results and simulation validation of a FC network. Finally, we simulate and analyze some FC network design issues in section 4. 373 2. SANSim SANSim is an event-driven simulation tool for SAN that includes four main modules: an I/O workload module, a host module, a storage network module, and a storage system module, as shown in Figure 1. The I/O workload module generates I/O request streams according to the workload distribution characteristics and sends them to the host modules. The host module encapsulates the I/O workload to the SCSI commands and sends them to the Host Bus Adaptor (HBA) sub-modules. The storage network module simulates the network connectivity, topology and communication mechanism. The FC network module includes three sub-modules: a FC_controller module, a FC_switch module and a FC communication module. The storage module maps I/O data to the storage devices. SANSim is developed in pure standard C. It has been compiled successfully both in Microsoft Window XP and Linux platforms. The simulator reads in configuration parameters from a user specified input file, plots measurement data, and writes data in an output file after the simulation is completed. The simulation duration and the warm-up period can also be specified in the input file to control and eliminate the transient bias in the simulation results. The configuration parameters for each of the four modules are I/O Workload Storage Clients Clients Clients RAID Ctrl I/O Controller Server Requests HBA Host .. . Cache Completions Network FC Network Device Driver FC Controller System Bus FC Switch FC Port & Com HBA IP Network .. . Figure 1 SANSim internal structure including I/O workload module, host module, storage network module and storage module. 374 arranged orderly in the input file. 2.1 I/O Workload module The key function of I/O Workload module is to generate I/O request streams according to the workload distribution characteristics and send them to the host modules. The current module supports both system-traced I/O workloads and synthetic I/O workloads. We use a synthetic I/O workload in this paper. Generally speaking, a disk request is defined by five dimensions: the requested pattern, the size distribution, the repeatability, the location distribution and the I/O operations. The workload module is able to generate several basic different arrival patterns such as Poisson arrivals, equal time intervals, Normal distribution arrivals and so on. It can also simulate a combination of request patterns that consists of the different distributions and different rates. Another capability of the Workload module is to generate repeatable requests. This scenario is used to define a workload in which some files are more popular than others and consequently accessed more frequently. 2.2 Host module Host module includes a device driver, a SCSI layer, a system bus, as well as DMA modules. The main function of the Host module is to encapsulate the I/O workload into the SCSI commands and sends them to the Host Bus Adaptor (HBA) sub-modules. The host module schedules the I/O requests generated by the workload module, based on various schedule policies, which are configurable. The I/O requests are traced in a waiting queue and an outstanding queue. An I/O request in the outstanding queue refers to a request, which has been submitted to the storage device, but has not been completed yet. The maximum number of outstanding requests depends on the buffer size and system configuration. The I/O requests in the waiting queue may be merged or re-scheduled following certain policies. The host module supports a multiple-host configuration. Each host has a separate I/O generation module. There is a specific mechanism to identify the I/O requests coming from different hosts. 2.3 FC network module The key function of the FC network module is to simulate the FC connectivity, topology and communication protocol. The FC network module includes three sub-modules: the FC Controller module, the FC Switch module and the FC Port & Communication module, as shown in Figure 2. The FC Controller module simulates the communication behavior of FC Commands or Data Frames. The FC Switch module models all the FC Ports, switch architecture, and as well as the routing and flow control. The FC Port & Communication module transfers FC Frames between the FC Ports. 2.3.1 FC Controller module 375 Server system Server Server Server FC Host Bus Adapter FC Connection FC Switch FC Switch FC ports FC Switch FC Switch FC ports FC Connection Storage System Storage System Storage System FCP Target Controller Storage system (a) An example of FC SAN (b) Abstracted view FC Switch module FC Controller module FC Switch core Bus_interface Routing FCP Engine Internal Crossbar FC Port FC Port FC Port & Communication Frame buffer Management FC Connection (c) FC network module Figure 2 Modeling of Fibre Channel network in SANSim The FC Controller module models both initiator and target modes of the FC HBA. As shown in Figure 3, the FC Controller module includes three sub-modules: a Bus_interface, an FCP( SCSI over Firbre Channel Protocol [13]) Engine and a FC Port. The Bus_Interface sub-module handles the communication between the device driver and the controller such as DMA and interruptions. The FCP Engine has the responsibility of 376 The rest of Server System FCP_CMND Bus_interface FCP_DATA . . . FCP Engine FCP_XFER_RDY FCP Engine FCP_DATA FCP_RSP FC port FC Controller Module FC Controller Module Bus_interface The rest of Storage System FC port target mode initiator mode FC Port & Communication Module Figure 3 FCP Operation constructing different FC frames corresponding to each sequence of FCP exchange. The FC_Port is responsible for delivering FC frames to the destination port. When a SCSI request arrives in the initiator, the device driver sends SCSI commands to the FC_Controller through the Bus_Interface. The FC_Controller then executes the commands and fetches SCSI I/O’s information from memory. A FCP_CMND frame is constructed for each SCSI I/O in the FCP Engine, and then sent out through FC_Port module. After the target FC_Controller receives the FCP_CMND, the target firmware decapsulates the FCP_CMND and processes the SCSI command. In the case of a FCP Read operation (SCSI Read 10), the target host decapsulates the SCSI request and prepares the requested data. Once the data is ready, the target FC_Controller takes the responsibility of transferring data back to the requestor (the initiator) and sends a completion message FCP_RSP after all data is sent. With a FCP Write operation (SCSI Write 10), the target driver allocates sufficient memory area for the incoming data, and sends a FCP_XFER_RDY to the initiator requesting the data. After the initiator receives the FCP_XFER_RDY, it starts to send a FCP_DATA. Finally when the target receives all data successfully, it sends a FCP_RSP indicating the completion of the FCP operation. 2.3.2 FC Switch module SANSim FC Switch module has two sub modules: FC port, and FC Switch core. FC port supports F_Ports/FL_Ports and E_Ports. F_Ports/FL_Ports are for host-switch and device-switch connections, and E_Ports are for switch-switch inter-connections. The FC_Port’s address_ID is unique and confined to the FC-SW-2 standard [11]. FC Switch core is the switch’s control center for frames routing and forwarding. It contains routing and internal cross-bar. If the destination port of requested FC frames is busy, the incom- 377 ing frames are held in the incoming buffer until they are successfully routed. SANSim uses Dijkstra’s algorithm [14] to compute the shortest routing path. The routing table remains constant unless the network connectivity is changed during the simulation. When the network configuration is changed, the switch module re-computes the shortest path. 2.3.3 FC Port & Communication module The FC Port & Communication module includes a Frame Buffer management and FC connection sub modules. The Frame Buffer management sub module handles all management of incoming and outgoing frame buffers. The FC connection sub module establishes a FC connection between two FC_Ports used to transfer frames. The Fibre Channel Arbitrated Loop (FC-AL) is a typical sub FC connection sub module used in the simulation. The FC-AL connection module consists of the following sub modules: Loop Port State Machine (LPSM), Alternative BB Credit Flow Control (Alt BB Mgr), and Loop Port Process Control (LPPC). The LPSM models the L_Port's state transition during the communication. The L_Port transmits FC Frame only when it is in certain states. The number of frames permitted to be transmitted depends on the available buffer size in the destination port. The Alternative BB Credit Flow Control sub module models the flow control method defined in the Standard [10] to avoid overflow. The Loop Port Process Control sub module models the remaining behavior of the L_Port, such as responding to a certain loop port request, re-transmitting an ARBx signal and other port activities. 2.4 Storage module The main function of the storage module is to map I/O data to the storage devices. Storage modules, which include interface module (HBA), storage controller module and storage device module, can simulate a RAID system with various cache management algorithms, disk drives, and RAM disks. In the event of disk failure in a RAID system, the degraded mode and rebuild behavior can also be modeled. Various RAID algorithms can be integrated into the storage modules. 3. Simulation Validation 3.1 Experimental environment The experiments were conducted on an in-house developed FC RAM disk, which maps all storage I/Os to the memory rather than using an actual magnetic disk. Since we are focusing on the FC simulation and validation, using RAM disk as a target helps to isolate problems caused by the modeling of a hard disk drive. The initiator and target use a FCAL connection. Table 1 lists the detailed hardware and software configurations used in the experiments. I/OMeter, a widely accepted industry standard benchmark tool, is used here to collect experimental data. The monitored parameters include IOPS (I/O per second), throughput (MB/s), queue depth, I/O request size, and read/write operations. The IOPS is used primarily with small I/O requests, while the throughput is used with large I/O requests. The queue depth refers to the number of outstanding I/O requests, which are injected into the FC network and storage system. The I/OMeter issues a number of re- 378 Table 1 System configuration for SANSim FC-AL module validation Hardware Software Hardware Software Initiator CPU: AMD AthonMP 1600+ FC HBA: Qlogic 2300 RAM: 2x256MB DDR SDRAM Main board: 64 bit PCI Tyan Tiger MP2466N OS: Windows XP Professional SP1 Driver: Qlogic Driver Version 8.1.5.12 Tool: Intel IOMeter Version 2003.02.15 Target CPU: Intel PIII 1GHz FC HBA: Qlogic 2300 RAM: 4 x 1GB Kingston ECC Reg. PC133 Main board: 64bit PCI, Supermicro 370 OS: RedHat 8.0 Kernel: 2.4.18 Driver: In-house 2300 target driver Version 1.0, In-house Linux RAM Disk Version 2.0 quests (equals to the queue depth) initially, and generates new I/O requests only after the completion of previous requests. Fixed I/O sizes were used in all requests. 3.2 Comparisons of the experimental and simulated data Figure 4 and 5 show that I/O transaction performance varies with the queue depth with read and write operations. The I/O sizes are set to 2KB, 8KB, 16KB, and 32KB respectively. The IOPS increases with the queue depth and then reaches a saturation limit. In other words, there is a critical queue depth. When the queue depth is bigger than this critical value, the system transaction performance shows no improvement and the response time for I/O request becomes worse. For example, when an I/O size is 8KB and IOMeter sends 100% of read operations, the maximum transaction performance is 17.5k IOPS. This translates into a network and storage overhead means of around 0.057ms (1second/17.5k). Simulation results show that the average response time is about 0.471ms when queue depth equals to eight. The largest contribution in the response time is from the queue waiting time, not the serving time. Figure 6 and 7 show that the throughput (MB/s) varies with the I/O request size with read/write operations. The queue depths are set to 1, 2, and 8 respectively. Generally, the throughput increases with I/O request sizes. When the request size is large enough (more than 128KB with queue depth of 8), the data transfer time dominates the overall overhead. Then the throughput is limited by the FC network bandwidth. 379 . Tested .Simulated I/Os Per Second ( k) I/Os Per Second (k) Maximum IOPS vs. Queue Depth (Write) Maximum IOPS vs. Queue Depth (Read) 26 24 22 20 18 16 14 12 10 8 6 4 2 0 2KB 8KB 16KB 32KB 1 2 3 4 5 6 7 Queue Depth 8 9 10 11 26 24 22 20 18 16 14 12 10 8 6 4 2 0 12 Tested . .Simulated 2KB 8KB 16KB 32KB 1 2 3 4 5 6 7 Queue Depth 8 9 10 11 12 Figure 4 Read performance vs. queue depth Figure 5 Write performance vs. queue depth Throughput vs. Request Size (Read) 200 180 160 140 120 Depth=8 . Tested 100 .Simulated 80 60 Depth=2 40 20 Depth=1 Throughput (MB/Sec Throughput (MB/Sec 200 (Write) Throughput vs. Request Size (Read) 180 160 140 120 Depth=8 .Tested 100 .Simulated 80 60 Depth=2 40 20 Depth=1 0 0 1k 2k 4k 8k 1k 16k 32k 64k 128k 256k 512k 2k 4k 8k 16k 32k 64k 128k 256k 512k Request Size Request Size Figure 6 Read throughput vs. I/O size Figure 7 Write throughput vs. I/O size The simulation results and experimental data match very well in all cases as illustrated in figures 4-7. The error range is generally less than 10%. With read operations, it is less than 3%. 4. FC Network simulation and analysis 4.1 Simulation Environment To illustrate an application of SANsim, the performance and availability of a FC network are simulated and analyzed in this section. We conducted a simulation based on the core/edge network with five FC switches, as shown in Figure 8. FC switches 1-4, as edge switches, are connected to the core switch 5 through two 2G FC ISLs. Each FC controller has 32 frame buffers. Each FC port on a switch has four frame buffers. The distance between any two points in the network is set to be 50 meters. The storage controller processing capacity is 22.5K transactions per second. 380 1 2 3 4 5 Servers 8 IS L4 L2 IS L1 IS FC Switches L5 IS IS L7 L6 IS IS L8 5 4 1 7 2 IS L3 1 6 3 2 3 4 Storage Devices 5 6 7 8 Figure 8 Simulation configuration A synthetic I/O workload is applied to each server. The I/O inter-arrival time follows an exponential distribution. The maximum number of outstanding I/O requests is 32 for each server. Since the simulation is based on an open system, the I/O queue in the server is allowed to grow without limitation. However, the servers issue a new request to the network and storage devices only after a previous request is completed. The IO requests are randomly and evenly allocated among all devices. 4.2 Simulation Results and Analysis In order to study the impact of link failure on the network throughput, we conducted a series of simulations under four different scenarios: no link failure, ISL1 failure, ISL8 failure and simultaneous failures of ISL1 and ISL8. The throughputs of all servers are collected using 2KB and 32 KB request sizes. Servers 1 through 4 have the same characteristics and achieve similar throughput results. We use an average throughput S1-4, as shown in Figure 9, to represent the performance of Server 1 though 4. Same process is applied to server 5 through 8. Figure 9(a) shows the throughputs as a function of I/O workload for case I (without link failure). The throughputs grow linearly and then reach asymptotic values. All servers achieve the same throughputs due to the symmetrical characteristics of the network. When the I/O size is 2KB, the maximum throughput is 45MB/s. Since the process capacity of the storage device is 22.5k transactions per second, it limits the maximum throughput to 45MB/s for 2KB I/O size. When the I/O size is 32KB, the maximum throughput for each server is 80MB/s. The total throughput for a single network link is 160MB/s and that is 20% less than the nominal value 200MB/s. The simulated results show that the maximum throughout supported by a single storage device is 175MB/s for 32KB I/O size. That means the performance is not limited by the storage devices, but by the network. 381 90 80 S1-4_32KB  S5-8_32KB  70 60 50 40 30 S1-4_2KB  S5-8_2KB 20 10 0 Throughput (MB/Sec)a Throughput (MB/Sec)a 90 10 20 30 40 50 60 70 80 90 70 60 50 20 10 10 20 I/O load (MB/Sec) a 40 50 60 70 80 90 100 b) Case II: ISL1 failure 90 80 S1-4_2KB S1-4_32KB 70 60 Throughput (MB/Sec)a 90 Throughput (MB/Sec)a 30 I/O load (MB/Sec) a a) Case I: No link failure S5-8_2KB S5-8_32KB 50 40 30 20 10 0 S5-8_2KB S5-8_32KB 40 30 0 100 S1-4_2KB S1-4_32KB 80 10 20 30 40 50 60 70 80 90 100 I/O load (MB/Sec) a c) Case III: ISL8 failure 80 S1-4_2KB S1-4_32KB 70 60 50 S5-8_2KB S5-8_32KB 40 30 20 10 0 10 20 30 40 50 60 70 80 90 100 I/O load (MB/Sec) a d) Case IV: Both ISL1 and ISL8 failed Figure 9 Maximum throughputs under symmetrical I/O load for different cases Figure 9(b) shows the throughput for the case II (with ISL1 failure). When the I/O size is 32KB, the asymptotic throughput of servers 1~4 drops to 46MB/s, compared to 80MB/s in case I. Apparently, this is due to the limited bandwidth of a single link ISL2. It is also noted that a large performance drop happens with servers 5~8 (58MB/s vs. 80MB/s) compared to case I, even though the ISL1 has no direct physical connection to those servers. This decrease in performance is probably caused by the head-of-line blocking [15]. The data traffics from storage devices to servers 5~8 are competing at the core switch 5 with the data traffics from storage devices to servers 1~4. When the I/O size is 2KB, the throughput of servers 1~4 approaches to 44MB/s which is slightly less than the 45MB/s in the case I. This is caused by the competition of data traffic on the highly utilized ISL2 link. The bandwidth utilization of ISL2 reaches 176MB/s over 200MB/s. For case III (with ISL8 failure), the maximum throughputs for servers 1~4 and servers 5~8 drop from 80MBs in case I to an average of around 50MB/s, as shown in Figure 9(c), when the I/O size is 32KB. The reason why servers 1~4 and servers 5~8 achieve the same maximum throughput is because the data traffics from storage devices 1~4 to servers 1~8 382 are equally affected by ISL8 failure. When the I/O size is 2KB, the maximum throughput is almost not affected by the ISL8 failure compared to case I. When both ISL1 and ISL8 fail (case IV), the measured throughputs of servers are shown in Figure 9 (d). When I/O request size is 32KB, the asymptotic throughput of server 1-4 reaches 40 MB/s while the throughput of servers 5-8 is about 42MB/s. However, when the I/O size is 2KB, the throughput of servers 1~4 is only 35MB/s, and servers 5~8 is 46MB/s. In order to analyze the detailed frame activity on the network, data traffic across ISL1~4 and ISL5~8 are monitored and the average I/O response time are measured. The results show that the response time of the I/O issued by servers 1~4 to storage devices 1~8 becomes significant long (>2ms) when the I/O workload is larger than 36.5MB/s. This allows the storage devices to serve more I/O requests issued by servers 5~8. So servers 5~8 can achieve better performance than servers 1~4 when the I/O size is 2KB. However, when the I/O size is 32KB, the response time of I/O requests issued by servers 5~8, increases notably due to the effects of header-of-line blocking. It limits the performance of the servers 5~8 to 42MB/s with 32 KB I/O size. 5. Summary and future work In this paper, we have presented SANSim, a platform for simulation and design of a FC SAN. The SANSim, which is based on FC frame level, can simulate all primitive signals (IDLEs, RDYs etc), commands and data frames. The design of SANSim is modular and scalable. Such tool is useful to the rapid development of high-end SAN due to the everincreasing complexity of the SAN architecture. We have conducted several experiments to compare experimental and simulated results. The results show that SANSim model is accurate within less than 3% in read operation, and less than 10% in write operation. As an example, the performance and availability of a core/edge FC network has been analyzed. The simulation results show that the core/edge topology suffers from certain level of bandwidth loss due to the Head-of-Line blocking caused by traffics crossing multi-stage switches. Generally, the maximum throughput achieved at all servers decreases when link failure happens. The servers on different locations have different I/O performance sensitivity to the link failure. Future development work of SANSim includes IP storage module, Object-based storage module, file system simulation. Reference [1] [2] [3] Yao-Long Zhu, Shun-Yu Zhu and Hui Xiong, “ Performance Analysis and Testing of the Storage Area Network”, 19th IEEE Symposium on Mass Storage Systems and Technologies, April 2002. T. Ruwart, “Disk Subsystem Performance Evaluation: From Disk Drivers To Storage Area Networks”, 18th IEEE Symposium on Mass Storage Systems and Technologies, April 2001. Xavier Molero, Federico Silla, Vicente Santonja and José Duato, "Modeling and Simulation of Storage Area Networks", Modeling, Analysis and Simulation of Computer and Telecommunication Systems, IEEE 2000. 383 [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] Xavier Molero, Federico Silla, Vicente Santonja and José Duato, "A Tool For The Design And Evaluation Of Fibre Channel Storage Area Networks", Proceedings of 34th Simulation Symposium, 2001. Petra Berenbrink, André Brinkmann and Christian Scheideler, "SIMLAB - A Simulation Environment for Storage Area Networks", 9th Euromicro Workshop on Parallel and Distributed Processing (PDP), 2000. John Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan, “The HP AutoRAID Hierarchical Storage System”, ACM Transactions on Computer Systems, 1996. Gregory R. Ganger and Yale N. Patt., “Using System-Level Models to Evaluate I/O Subsystem Designs”, IEEE Transactions on Computers 1998. Thomas M. Ruwart, “Performance Characterization of Large and Long Fibre Channel Arbitrated Loops”, IEEE Network 1999. John R. Heath and peter J. Yakutis, "High-Speed Storage Area networks Using Fibre Channel Arbitrated Loop Interconnect", IEEE Network 2000. FC-AL, "FC Arbitrated Loop," ANSI X3.272:1996 FC-PH, "Fibre Channel Physical and Signaling Interface (FC-PH)", ANSI X3.230:1994. FC-SW, “FC Switch Fabric and Switch Control Requirements”, ANSI X3.950:1998. Technical Committee T11, FC Projects http://www.t11.org/Index.html. E. W. Dijkstra, “A Note on Two Problems in Connexion with Graphs”, Numerische Mathmatik 1 (1959) 269-271 M. Jurczyk, “Performance and Implementation Aspects of Higher Order Head-ofLine Blocking Switch Boxes”, Proceedings of the 1997 International Conference on Parallel Processing, IEEE 1997 384 Cost-Effective Remote Mirroring Using the iSCSI Protocol Ming Zhang, Yinan Liu, and Qing (Ken) Yang Department of Electrical and Computer Engineering University of Rhode Island Kingston, RI 02874 mingz, yinan, qyang@ele.uri.edu tel +1-401-874-5880 fax +1-401-782-6422 Abstract in terms of installation and maintenance. A newly emerging protocol for storage networking, iSCSI [2], was recently ratified by the Internet Engineering Task Force [3]. The iSCSI protocol is perceived as a low cost alternative to the FC protocol for remote storage [4][5][6][7]. It allows block level storage data to be transported over the popular TCP/IP network that can cover a wide area across cities and states. Therefore, the iSCSI lends itself naturally to a cost-effective candidate for remote mirroring making use of the available Internet infrastructure. This paper presents a performance study of the iSCSI protocol in the context of remote mirroring. We first integrate our caching technology called DCD (disk caching disk) into a standard iSCSI target device to form a high performance storage system for mirroring purpose. Performance measurements are then carried out using this storage system as well as standard iSCSI targets as mirroring devices. We consider remote mirroring on a LAN (local area network) and on a commercial WAN (wide area network). The workloads used in our measurements include popular benchmarks such as PostMark and IoMeter, and real-world I/O traces. Our measurement results show that iSCSI is a viable approach to cost-effective remote mirroring for organizations that have moderate amount of data changes. In particular, our DCDenhanced iSCSI target can greatly improve performance of remote mirroring. The viability of iSCSI protocol for remote mirroring depends, to a large extent, on whether acceptable performance can be obtained to replicate data to a remote site. While there are studies reported very recently on the iSCSI performance on LAN networks, campus networks, and emulated WAN [4][5][6][7], the open literature lacks technical data on the performance of the iSCSI protocol for remote mirroring over a realistic commercial WAN. The objective of this paper is two fold. First, we incorporate our new storage architecture to an iSCSI target to enhance write performance specifically for remote mirroring purposes. Secondly, we carry out measurement experiments to study the performance of the iSCSI protocol for remote mirroring on both a LAN and a commercial WAN network. 1. Introduction Remote data mirroring has become increasingly important as organizations and businesses depend more and more on digital information [1]. It has been widely deployed in financial industry and other businesses for tolerating failures and disaster recovery. Traditionally, such remote mirroring is done through dedicated SAN (storage area network) with FC (Fiber Channel) connections that are usually very costly Our new storage architecture is referred to as DCD (disk caching disk) [8][9]. The idea is to use a log disk, called cache-disk, as an extension of a small NVRAM to cache file changes and to destage cached data to the data disk afterward when the system is idle. 385 Figure 1. Experimental settings for performance measurements of iSCSI remote mirroring. Small and random writes from the iSCSI network are first buffered in the small NVRAM buffer. Whenever the cache-disk is idle or the RAM is full, all data in the RAM buffer are written sequentially in one data transfer into the cache-disk. The RAM buffer is then made available quickly to absorb additional requests so that the two-level cache appears to the iSCSI network as a huge RAM with size of a disk. When the data disk is idle, a destage operation is performed, in which the data is transferred from the cache-disk to the normal data disk. Since the cache is a disk, it is cost-effective and highly reliable. In addition, the log disk is only a cache that is transparent to the file system and upper layer applications. We incorporated our DCD into the iSCSI target program [10] that is used as a storage device for remote mirroring purpose. We carried out measurement experiments in two different settings: one inside our laboratory and the other over a commercial WAN through Cox-business Internet services. Our experiments with the real world commercial WAN give us great insightful experiences that may not be possible using emulated WAN in a laboratory [4][5][6][7]. We measured a variety of benchmarks and real world I/O traces widely used in the file system and storage system communities. Measured results show that the iSCSI protocol is a viable and cost-effective approach for remote mirroring. It is particularly useful to small to medium size organizations to deploy economical remote mirroring for tolerating site failures and disaster recovery when the cost of losing data matters. 2. Experimental Methodology Figure 1 shows the two experimental setups for our experiments. Our first experiment was carried out inside our laboratory as shown in Figure 1(a). Several server hosts are connected to a mirroring storage system through an Intel NetStructure 470T Gigabit Ethernet switch. The server hosts act as iSCSI initiators while the mirroring storage system acts as an iSCSI target. Our second experimental setting is over a realistic commercial WAN through Cox Business Services. The iSCSI initiators in the LAN inside our laboratory are connected through our campus network and leased lines to the educational Internet. They are then connected to a website of a business office, Elake Data Systems, Inc., on the Cox Communications Inc. 386 quests quickly. In our current implementation, we use part of the host memory to emulate the NVRAM. Our previous experiments have shown that the maximum time before the data in the NVRAM buffer are moved to a disk is usually less than 100 milliseconds, which guarantees the safety of mirrored data even a DRAM buffer is used instead of the NVRAM provided that a UPS is used. Destaging operations between the cache disk and the data disk are done when the target storage system is idle. Our remote mirroring software is based on the RAID1 code in the Linux kernel. There are two devices in a typical mirroring configuration, one is a primary device and the other is a secondary device. In a production implementation, there might be one or more spare devices available. Since we are interested in the performance of remote mirroring, we only consider the two-device configuration here. The mirroring software exports a block device to the operating system and applications that is very similar to the normal RAID1 device such as /dev/md0 or /dev/mdx. All read requests are sent to the primary device only while all write requests are sent to both devices. A write request sent from a upper layer is acknowledged as being finished only after the mirroring software receives acknowledgments from both devices. Therefore, our mirroring software falls into the category of ”lockstep” or ”synchronous” mode as defined by Ji et al [1]. The following is a list of hardware configurations that our experiments are based on: cable network. The business office is used as a remote mirroring site that is located in a different town several miles away from our university campus. The down stream speed to the mirroring site is theoretically 3 Mbps and the upstream speed is theoretically 256 Kbps. Because of sharing of cables, the actual speed varies depending on network traffic and the time of a day. We found that during our experiments that the actual speeds vary from 40.7 KBps to 294 KBps. The cost of such a business connection is less than $100/month in New England area. We believe that such a connection and its cost represent a typical network connection for small to medium size businesses that mirror moderate amount of their business data at a remote site for failure tolerance and disaster recovery. As indicated in [1], the cost of leasing a WAN connection with speed of 155Mbps could cost about $460,000/year. Our objective here is to analyze the backup performance of the iSCSI protocol over an inexpensive WAN network where the iSCSI protocol is likely to be used for cost effectiveness. All machines used in the experiments are Dell servers equipped with a single Pentium III 866MHz CPU, 512MB SDRAM, and the Intel Pro1000T Gigabit NIC (network interface card). We run Redhat 9 as the operating system with recompiled Linux kernel 2.4.20. Our iSCSI implementation is based on the implementation ref20 19b from University of New Hampshire [10]. The SCSI controllers we used are Adaptec AIC7899 Ultra 160 controller. All SCSI disks we used in our experiments are 18GB Seagate ST318452LW Ultra 160 disks. When two disks need to be connected to the same SCSI controller, we connect them to different channels. At the iSCSI target, our DCD system is integrated with the iSCSI target software. Random and small write requests coming from the network are first buffered in the 32MB NVRAM buffer and the target acknowledges immediately to the server host for write completion. These small write requests are collected to form a log to be moved sequentially to the cache disk as soon as the cache disk is idle or the data in the NVRAM exceeds a predetermined watermark. As a result, there is always a room in the NVRAM buffer to accept new requests and the two-level hierarchy consisting of the RAM buffer and the cache disk appears to the network as a large RAM absorbing write re- ¯ A primary SCSI disk with another SCSI disk as the mirroring device (S-S). We use a local SCSI disk to mirror another SCSI disk in the same system. This is a baseline configuration as a reference for comparison purpose. ¯ A primary SCSI disk with an iSCSI target device on a LAN as the mirror device (S-iL). In this configuration, we use an iSCSI target device on a LAN to mirror a local SCSI disk. ¯ A primary SCSI disk with an iSCSI target device on a WAN as the mirror device (S-iW). In this configuration, we use an iSCSI target device on a WAN to mirror a local SCSI disk. ¯ A primary SCSI disk with a DCD-enhanced iSCSI device as the mirroring device (S-iD). This 387 Number of requests Number of write requests Request size range Mean Request Size Write Size Range Mean Write Size Request per second Write per second TPC-C20 10,000,000 2,965,750 4096B-126,976B 47,063B 4096B-126,976B 4,710B 192 57 Financial-1 2,452,167 1,502,641 1024B-17,116,160B 3,855B 1024B-17,116,160B 4,838B 57 35 Financial-2 2,733,121 480,529 1024B-262,656B 2,508B 1024B-262,656B 3,107B 67 12 Table 1. Characteristics of the traces used in our experiments Besides the above benchmarks, real world traces are also used in our experiments. The first trace is TPC-C20 that is a block level trace downloaded from the Performance Evaluation laboratory at Brigham Young University. They had run TPC-C benchmark with 20 data warehouses using Postgres database on Redhat Linux 7.1 and collected the trace using their kernel level disk trace tool, DTB [15]. The other two traces are I/O traces from OLTP applications running at two large financial institutions, Financial-1 and Financial-2. They represent typical workloads of financial industries and are made available by Storage Performance Council in partnership with the University of Massachusetts who together are hosting a repository of I/O traces for use in the public domain [16]. Table 1 shows the characteristics of the traces that we use. configuration incorporates our DCD technology into the iSCSI target device for mirroring purpose. Similarly, such DCD-enhanced iSCSI storage can be either located on the LAN designated as S-iDL, or located on the WAN referred to as S-iDW. The workloads used in our experiments consist of popular storage benchmarks such as PostMark [11], IoMeter [12], and real world traces. PostMark [11] is a widely used [13][14] file system benchmark tool from Network Appliance, Inc.. It measures performance in terms of transaction rates in an ephemeral small-file environment by creating a large pool of continually changing files. Once the pool has been created, a specified number of transactions occur. Each transaction consists of a pair of smaller transactions, i.e. Create file or Delete file and Read file or Append file. Each transaction’s type and files it affected are chosen randomly. The read and write block size can be tuned. On completion of each run, a report is generated showing some metrics such as elapsed time, transaction rate, total number of files created and so on. The IoMeter is another highly flexible and configurable synthetic benchmark tool that is also widely used in various research works. IoMeter can be used to measure the performance of a mounted file system or a block device. For a mounted file system, it generates a large size file as the workplace and performs various configurable operations. For a block device, for example, a SCSI disk /dev/sda1, a RAID or mirroring device /dev/md0, IoMeter treats it as a normal file and directly reads or writes on it after opening it. 3. Results and Discussions 3.1. PostMark Results Our first experiment is to measure the mirroring performances of PostMark benchmark. Figure 2 shows our measured times for finishing 100,000 transactions on 10,000 files of the PostMark benchmark. The total amount of data generated by the PostMark is about 700MB. We varied the proportion of write requests of all transactions in the benchmark between 50% and 30% and plotted them separately as shown in Figure 2. Let us first consider mirroring on the LAN network. As shown in Figure 2(a), it takes longer time to finish all the transactions when mirroring data using iSCSI target than mirroring data using a local 388 Figure 2. PostMark results: total time to finish 100,000 transactions on 10,000 files Figure Legend for Figure-2 through Figure-4: S-S: SCSI disk and SCSI disk mirroring. S-iL: SCSI disk and iSCSI disk mirroring over a LAN. S-iDL: SCSI disk and DCD-enhanced iSCSI disk mirroring over a LAN. S-iL3: SCSI disk and iSCSI disk mirroring over a LAN with 3 parallel iSCSI connections. S-iDL3: SCSI disk and DCD-enhanced iSCSI disk mirroring over a LAN with 3 parallel iSCSI connections. S-iW: SCSI disk and iSCSI disk mirroring over a WAN. S-iDW: SCSI disk and DCD-enhanced iSCSI disk mirroring over a WAN. S-iW3: SCSI disk and iSCSI disk mirroring over a WAN with 3 parallel iSCSI connections. S-iDW3: SCSI disk and DCD-enhanced iSCSI disk mirroring over a WAN with 3 parallel iSCSI connections. 389 SCSI disk. However, the time difference is not as significant as we initially expected, about 16% longer than that of local mirroring. It is interesting to observe that remote mirroring using the DCD-enhanced iSCSI disk takes shorter time than local disk mirroring. This is because DCD hides many seek times and rotation latencies of small writes by combining them into large logs to the cache disk, while with local disk mirroring, every write operation has to wait for two disk writes both requiring seek times and rotation latencies. Similar performance trends were observed for smaller write proportion as shown in Figure 2(b). While the total transaction times are shorter with 30% of writes because only write operations go to remote storage for mirroring, the relative difference between local mirroring and iSCSI mirroring keeps almost the same, about 15%. iSCSI standard suggests that parallel iSCSI connections in a session may help improving its performance. To observe the effect of parallel connections on iSCSI performance, we measured transaction times with three concurrent connections as shown in the bars graphs marked with suffix 3. From our experiments, it seems that parallel connections do not show significant advantages over single connection. We believe that there could be two possible reasons that lead to the similar performances between parallel connections and single connection. One is the specific iSCSI implementation of UNH and the other is the low traffic intensity of PostMark. It remains open whether and how iSCSI can benefit from parallel and concurrent connections. PostMark results for iSCSI mirroring over the WAN are shown in Figure 2(c) and Figure 2(d) for 50% writes and 30% writes, respectively. Clearly, synchronously mirroring data over the WAN increases the total transaction time dramatically. Compared to local mirroring, the total time to finish the same 100,000 transactions is more than tripled, from 189s to 535s. To gain some insight to why it takes such a long time to mirror over the WAN, we measured the RTT (round trip time) between our initiators and the target. While RTT fluctuates from time to time, we found the average RTT value to be around 14 ms. This round trip delay is on the same order as a disk operation including seek time, rotation latency and transfer time. For each write operation issued by the bench- mark, it has to wait for the mirroring write that has to experience the RTT and a disk operation. As a result, the total transaction time is increased dramatically. The good news is that our DCD technology can help greatly. As shown in the figure, using DCDenhanced iSCSI target for mirroring over the WAN, the total transaction time is reduced by half from that of pure iSCSI target. This improvement can be attributed to the effective caching of the DCD technology. Compared to the local disk mirroring, the DCD-enhanced iSCSI mirroring over the WAN shows about 40% increase in terms of total transaction time for pure synchronous/lock-step mode. Compared to iSCSI mirroring over a LAN, the difference is about 24%. We believe this is quite acceptable since this mirroring mode will essentially never lose data. Of course, one can allow some degree of asynchrony to obtain better performance. It would be interesting to compare our results here with the traditional FC-SAN mirroring, which we were not able to do because of lack of such FC-SAN facility. For smaller percentage of write operations as shown in Figure 2(d), similar relative performances were observed though the absolute transaction times are shorter because of smaller number of writes. 3.2. IoMeter Results Our second experiment is to measure the mirroring performances of various configurations using IoMeter benchmark. We configured the IoMeter to generate two types of synthetic workloads, one is 100% writes and the other is 50% writes and 50% reads. Both workloads use random addresses with fixed block size of 4 KB. To minimize the truncation effect that will be explained shortly, we set duration of measurement for each point to 1 hour and reported the average write response time and the maximum response time for each configuration as shown in Figure 3. Comparing the performance of the local mirroring with that of iSCSI mirroring on a LAN shown in Figure 3(a), IoMeter showed a much larger difference than PostMark did. We believe such a large difference in terms of average response time is the result of higher traffic intensity of the IoMeter benchmark. With 100% writes, the IoMeter continuously generates write requests one after another to both pri- 390 Figure 3. Testing results of IoMeter. ms and the response time of using the DCD-enhanced iSCSI target is about 11.9 ms as shown in Figure 3(a). The improvement of the DCD-enhanced iSCSI target over the iSCSI target is about 63%. Putting these results in a different perspective, a computer user would experience 19.4 milliseconds or 11.9 milliseconds delay on average if every write operation were synchronously backed up in a remote site using the iSCSI protocol. Note that these delays correspond to the workload that the user performs write operations continuously one after another. If 50% of continuous disk I/O operations were writes, the average response times would be lower, 8.3 ms and 5.8 ms for the iSCSI target and the DCD-enhanced iSCSI target, respectively as shown in Figure 3(b). Although the average response times are not outrageous, the maximum response times are noticeably large as shown in Figure 3(c) and (d). The maximum response time goes as high as 26 seconds for 100% writes and 6.3 seconds for 50% writes. These high maximum response times suggest that some kind of asynchronous mirroring should be desirable if write traffic is very high. In real world applications, the amount of write opera- mary disk and the mirroring disk. As a result, it creates a lot of traffic over the network for mirroring data. One may argue that IoMeter generates requests back to back implying that it does not generate a new request until the previous request is acknowledged. However, because IoMeter uses asynchronous writes, it receives an acknowledgment as soon as the write is done in the file system cache. A queue of write requests may be formed giving rise to multiple write requests outstanding on the network. Such a queuing effect increases response time very rapidly. It is also this queuing effect that makes the performance improvement of the DCD-enhanced iSCSI mirroring more pronounced, almost 4 times improvement as shown in Figure 3(a). The queuing effect is substantially reduced if we decrease the write ratio from 100% to 50% as shown in Figure 3(b). In this case, we observed a smaller relative difference between local mirroring and iSCSI mirroring, and between iSCSI mirroring and DCD-enhanced iSCSI mirroring due to reduced write traffic. For mirroring over the WAN network, the average write response time of using the iSCSI target is 19.4 391 tions is limited and many organizations write less than 3 GB of data per year [17]. We noticed that while the DCD-enhanced iSCSI mirroring shows better average response time, its maximum response time is higher than other configurations. This is because that the target is very busy at that time point and it can hardly find idle time for destage operations. This result is consistent with our previous studies and observations that the DCD is beneficial only when the system can find idle time to carry out destage operations. In our experiments with the IoMeter benchmark, we found several abnormal phenomena that are hard to explain. One of them is that the DCD-enhanced iSCSI mirroring shows significant lower average response times than the local SCSI disk mirroring. Another phenomenon is the truncation effect as mentioned in the beginning of this subsection. To understand these phenomena, we wrote a micro-benchmark program that continuously performs random writes of 4 KB blocks to a file with the system RAM being set to 256 MB. We measured the total times to finish 50,000 writes and 200,000 writes for three different cases: (1) synchronous writes, (2) asynchronous writes, and (3) asynchronous writes with forced flushing at the end. All the mirroring configurations are in a LAN environment. The asynchronous writes mimic the behavior of IoMeter because IoMeter records time stamp for each request individually and reports performance statistics of all finished transactions at the end of each test run. At the end of each test run, there may be writes done in the cache but no yet written into disks. The asynchronous writes with forced flushing at the end will ensure that all write requests generated in a test run are actually written into disks. As a result, the timing difference between the asynchronous writes and asynchronous writes with forced flushing is the truncation effect. Tables 2 and 3 show the measured results. SYNC ASYNC ASYNC+Flush S-S 273 161 171 S-iDL 273 153 155 SYNC ASYNC ASYNC+Flush S-S 1075 666 677 S-iDL 1075 623 671 S-iL 2702 1666 2000 Table 3. Total time in seconds to finish 200,000 transactions by the microbenchmark. For synchronous writes, we can see that both local mirroring and DCD-enhanced iSCSI mirroring have the same performance which is bounded by the slowest device that we believe is the primary disk. The iSCSI mirroring, however, takes longer time because the iSCSI target may be slower than the primary disk some times during the experiment. For asynchronous writes, the DCD-enhanced iSCSI mirroring shows advantages and even performs better than the local disk mirroring because the performance is no longer bounded by the primary disk because of cache effects. The performance difference increases as the number of transactions increases from 50,000 to 200,000. Similarly, the performance improvement of the DCDenhanced iSCSI mirroring over iSCSI mirroring also increases as the number of transactions increases from 50,000 to 200,000. The truncation effects are also shown in the tables as the performance differences between asynchronous and asynchronous with forced flushing. This difference can be as large as 148% as in the case of S-iL column for 50,000 requests. Such truncation effects did show when measuring the IoMeter performance with short measuring time of a few minutes. Therefore, we purposely enlarged the duration of each test run to 1 hour for each point of performance data. As shown in the tables, when we increase the length of test from 50,000 to 200,000 requests, the truncation effect reduced from 148% to 20% for the case of iSCSI mirroring. Similarly, the difference becomes negligible for the local mirroring case for longer test run. However, for some reason that we are not able to explain, the truncation effect increases a little bit (about 7%) for the DCD-enhanced mirroring after increasing the testing time. We believe that it may be the result of the destage process of the DCD system similar to the S-iL 649 208 515 Table 2. Total time in seconds to finish 50,000 transactions by the microbenchmark. 392 phenomenon shown in Figure 3(c,d). request in the trace is supposed to be issued according to the trace timing. As a result, some write requests in the traces may get skipped until a thread is released. For the experiments reported in Figure 5, we observed that the initiators skipped about 6.55% of writes in the Financial-1 trace and 1.57% in the Financial-2 trace. Note that such skipping would not have happened in real applications but just slowed down the write process. It happened in our experiments because of our applying the traces collected in a fast environment to a slow environment. To avoid skipping write requests in the traces, we carried out write coalescing at the iSCSI initiator side. Figure 6 shows the response time plots with write coalescing. The write coalescing size is 8 consecutive write requests. That is, each thread collects 8 consecutive writes and issues one write as a batch resulting from coalescing the 8 write operations. With this write coalescing, the 16 threads are able to issue 100% of write requests in both Financial-1 and Financial-2 traces. The response times plotted in Figure 6 correspond to the batch write operations. As shown in Figure 6, the response times for remote mirroring fluctuate but the maximum response times are lower than those in Figure 5. Majority of write mirroring can be done within a half of a second. Our measurement show that about 74.4% of mirroring can be done within one half of a second for financial-1 and about 90% of mirroring can be done within one half of a second for Financial-2. In order word, using the iSCSI protocol with our DCD architecture, one can mirror their business data on transaction-basis to a different town through a very inexpensive Internet connection (less than $100 per month). Mirrored data are safe within 6 seconds in the worst case and over 90% of data can be mirrored safely within half of a second if the transaction rate is as intensive as the Financial2 trace. For TPC-C traces, we are still experiencing skipped requests of about 1.5% because of high traffic intensity. The response times for TPC-C shown in Figure 6(c) have about 98.5% of write operations of the entire trace. During our WAN experiments, we also noticed that measured data varies from time to time because of different levels of network contentions. For example, results of daytime measurements may differ from night time and similarly weekdays from weekends. 3.3. Traces Results Figure 4(a) shows the average response times with the three types of the mirroring schemes for the three traces: Financial-1, Financial-2, and TPC-C. Similar performance results to that of the Postmark were observed. If the iSCSI protocol is to be used for remote mirroring, it is important to know the maximum delay caused by iSCSI protocol to determine which mirroring approach to take. We therefore plotted the minimum and maximum response times for three different mirroring schemes as shown in Figure 4(b) and 4(c). It is important to note that the real world workloads are quite different from the benchmarks in that one can always find idle time to take full advantage of the DCD technology. As shown in the figures, the maximum response time of the DCD-enhanced iSCSI is constantly the lowest among the three, implying its smooth and steady performance. The pure iSCSI, on the other hand, takes as much as 2 seconds maximally to mirror a write as shown in the figure. Therefore, it is advisable to use some kind of asynchronous mirroring approaches if an application cannot wait for such a long time. The performance of remote mirroring over a realistic WAN network is shown in Figure 5. We drew here the response time plots of two traces, Financial-1 and Financial-2, mirroring over the WAN using the DCDenhanced iSCSI only. As shown in this figure, the response times are noticeably much larger than those in the LAN. The maximum response time is as high as 13 seconds for Financial-1 and 5 seconds for Financial-2. The average response times are 733ms and 405ms for Financial-1 and Financial-2, respectively. However, majority of remote write operations can finish within one second for both cases as shown in this figure. When converting the traces to SCSI requests, we used 16 parallel threads and each thread generates an SCSI request to the iSCSI initiator according to the time, address and size of one entry of the trace. After a request is issued, the thread waits for the response before generating another SCSI request. Because packet response times fluctuate greatly on a realistic WAN, it may happen that all our 16 threads are busy (blocked) waiting for responses while a new I/O 393 Figure 4. Response time for different mirroring schemes in LAN. 394 Figure 5. Response time plot of Financial-1 and Financial-2 traces over WAN. Figure 6. Response time with write coalescing (batch size=8 consecutive writes) over WAN. 395 4. Related Work protocol, a new storage architecture called DCD is incorporated in the iSCSI target software. Measured results show that the DCD-enhanced iSCSI target storage provides smoother and better performance than local disk mirroring in a LAN environment. Experiments on a real commercial WAN show that response times fluctuate and can be very large. Still, data can be mirrored safely at a remote town within a second on average for typical online transaction processing such as Financial-1 and Financial-2 over an inexpensive Internet connection. Our experiments suggest that write coalescing on the initiator side can help in reducing network traffic. Our experience also indicates that measuring performance over a realistic WAN is quite different from an emulated WAN in a laboratory that is more controllable. Remote mirroring is not new for data protection and disaster recovery [18]. Companies such as IBM, EMC, Veritas, Computer Associates, and Network Appliance Inc, etc., all provide their own proprietary solutions [19][20][21][22][23]. A good summary of various remote mirroring approaches can be found in [1] including a new asynchronous remote mirroring protocol called Seneca. Myriad [24] uses cross-site checksums instead of direct replication to achieve the same level of disaster tolerance as a typical single mirrored solution requiring less resources. Venti [25] uses a unique hash of a block’s contents to act as the block identifier for read and write operations. Thus it enforces a write once policy and can act as a building block for constructing a variety of storage applications with backup and snapshot characteristics. iSCSI [2] is an emerging IETF standard [3] to provide a mapping for the block level SCSI commands and data over existing TCP/IP networks. Such a technology is supposed to provide a cost-effective alternative to build low cost SAN systems. Meth and Satran [26] discussed some strategies they adopted when designing the iSCSI protocol. Many research works [5][6][7] concentrated on iSCSI performance evaluation in various hardware environments. Most of the WAN performance evaluations are carried out in emulated WAN environments instead of a realistic WAN. Nishan systems (now McData) and other vendors carried out a “Promontory Project” [27] to demonstrate the feasibility of iSCSI in long distance transmission with high speed WAN FC links. Tomonori and Masanori proposed optimization techniques in software iSCSI implementations [28]. A novel cache strategy was proposed to improve the iSCSI performance [4]. It was also proposed to use iSCSI for distributed RAID systems [29]. Many iSCSI software and hardware implementations and products are already available [10][30]. Acknowledgment This research is supported in part by National Science Foundation under grants CCR-0073377 and CCR-0312613. Any opinion, findings and conclusions are those of authors and do not necessarily reflect the views of NSF. We would like to thank our shepherd, Ben Kobler, and the anonymous referees for their valuable comments. The authors would like to thank Elake Data Systems, Inc. (http://www.elakedata.com) for providing the remote site for our experiments. References [1] M. Ji, A. Veitch, and J. Wilkes, “Seneca: remote mirroring done write,” in Proceedings of the 2003 USENIX Annual Technical Conference, San Antonio, TX, June 2003, pp. 253–268. [2] J. Satran, K. Meth, C. Sapuntzakis, M. Chadalapaka, and E. Zeidner, “iSCSI draft standard,” http://www.ietf.org/internet-drafts/draftietf-ips-iscsi-20.txt. 5. Summary and Conclusions [3] C. Boulton, “iSCSI becomes official storage standard,” http://www.internetnews.com/. We have carried out measurement experiments to study the viability of using the iSCSI protocol for remote data mirroring for failure tolerance and disaster recovery. To enhance the performance of the iSCSI [4] X. He, Q. Yang, and M. Zhang, “Introducing SCSI-To-IP Cache for Storage Area Networks,” in Proceedings of the 2002 International 396 Conference on Parallel Processing, Vancouver, Canada, Aug. 2002, pp. 203–210. [15] Performance Evaluation Laboratory, Brigham Young University, “DTB: Linux Disk Trace Buffer,” http://traces.byu.edu/new/Tools/. [5] W. T. Ng, B. Hillyer, E. Shriver, E. Gabber, and B. Ozden, “Obtaining high performance for storage outsourcing,” in Proceedings of the Conference on File and Storage Technologies (FAST), Monterey, CA, Jan. 2002, pp. 145–158. [16] SPC, “Storage Performance Council I/O traces,” http://www.storageperformance.org/. [17] P. Desmond, “Going the distance for business continuity,” http://www.nwfusion.com/supp/2003/ business/1020distance.html, Oct. 2003. [6] S. Aiken, D. Grunwald, A. R. Pleszkun, and J. Willeke, “A performance analysis of the iSCSI protocol,” in IEEE Symposium on Mass Storage Systems, San Diego, CA, Apr. 2003, pp. 123– 134. [18] C. Chao, R. English, D. Jacobson, A. Stepanov, and J. Wilkes, “Mime: a high performance parallel storage device with strong recovery guarantees,” Hewlett-Packard Laboratories, Tech. Rep. HPL-CSP-92-9 rev1, Nov. 1992. [7] Y. Lu and D. H. C. Du, “Performance study of iSCSI-based storage subsystems,” IEEE Communication Magazine, vol. 41, no. 8, Aug. 2003. [19] IBM, “DFSMS SDM Copy Services,” http://www.storage.ibm.com/software/sms/sdm/. [8] Y. Hu and Q. Yang, “DCD—disk caching disk: A new approach for boosting I/O performance,” in Proceedings of the 23rd International Symposium on Computer Architecture, Philadelphia, Pennsylvania, May 1996, pp. 169–178. [20] EMC, “Symmetrix remote data (SRDF),” http://www.emc.com/. [21] Veritas, “VERITAS http://www.veritas.com. [9] Q. Yang and Y. Hu, “System for destaging data during idle time,” U.S. Patent 5 754 888, Sept. 24, 1997. storage facility replicator,” [22] Computer Associates, “BrightStor ARCserve backup,” http://www.ca.com. [10] UNH, “iSCSI reference implementation,” http://www.iol.unh.edu/consortiums/iscsi/. [23] Network Appliance Inc., “SnapMirror software: Global data availability and disaster recovery,” http://www.netapp.com/. [11] J. Katcher, “PostMark: A new file system benchmark,” Network Appliance, Tech. Rep. 3022, 1997. [24] F. Chang, M. Ji, S.-T. A. Leung, J. MacCormick, S. E. Perl, and L. Zhang, “Myriad: Costeffective disaster tolerance,” in Proceedings of the Conference on File and Storage Technologies (FAST), Monterey, CA, Jan. 2002. [12] Intel, “IoMeter, performance analysis tool,” http://www.iometer.org/. [13] K. Magoutis, S. Addetia, A. Fedorova, M. Seltzer, J. Chase, A. Gallatin, R. Kisley, R. Wickremesinghe, and E. Gabber, “Structure and performance of the direct access file system(DAFS),” in Proceedings of USENIX 2002 Annual Technical Conference, Monterey, CA, June 2002, pp. 1–14. [25] S. Quinlan and S. Dorward, “Venti: a new approach to archival storage,” in Proceedings of the Conference on File and Storage Technologies (FAST), Monterey, CA, Jan. 2002. [26] K. Z. Meth and J. Satran, “Design of the iSCSI protocol,” in IEEE Symposium on Mass Storage Systems, San Diego, CA, Apr. 2003. [14] J. L. Griffin, J. Schindler, S. W. Schlosser, J. S. Bucy, and G. R. Ganger, “Timing-accurate storage emulation,” in Proceedings of the Conference on File and Storage Technologies (FAST), Monterey, CA, Jan. 2002, pp. 75–88. [27] McData, “The Promontory project: Transcontinental IP storage demonstration,” http://www.mcdata.com/splash/nishan/. 397 [28] F. Tomonori and O. Masanori, “Performance optimized software implementation of iSCSI,” in International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), New Orleans, LA, Sept. 2003. [29] X. He, P. Beedanagari, and D. Zhou, “Performance evaluation of distributed iSCSI RAID,” in International Workshop on Storage Network Architecture and Parallel I/Os (SNAPI), New Orleans, LA, Sept. 2003. [30] Microsoft, “Microsoft delivers iSCSI support for Windows,” http://www.microsoft.com. 398 Simulation Study of iSCSI-based Storage System* Yingping Lu, Farrukh Noman, David H.C. Du Department of Computer Science & Engineering, University of Minnesota Minneapolis, MN 55455 Tel: +1-612-625-4002, Fax: +1-612-625-0572 Email: {lu, noman, du }@cs.umn.edu Abstract iSCSI is becoming an important protocol to enable remote storage access through the ubiquitous TCP/IP network. Due to the significant shift in its transport mechanism, iSCSI-based storage system may possess different performance characteristics from the traditional storage system. Simulation offers a flexible way to study the iSCSI-based storage system. In this paper, we present a simulation work of iSCSI-based storage system based on ns2. The storage system consists of an iSCSI gateway with multiple targets, a number of storage devices that are connected to the storage router through FC-AL and clients, which access the target through iSCSI protocol. We present the system model, the implementation of the components, the validation of the model and performance evaluation based on the model. Coupled with the rich TCP/IP support from ns2, the simulation components can be effortlessly used for the performance and alternatives study of iSCSI-based storage systems, applications in broad configurations. 1. Introduction The iSCSI protocol [1][2] has emerged as a transport for carrying SCSI block-level access protocol over the ubiquitous TCP protocol. It enables a client’s block-level access to remote storage data over an existing IP infrastructure. This can potentially reduce the cost of storage system greatly, and facilitate the remote backup, mirroring applications, etc. Due to the ubiquity and maturity of TCP/IP networks, iSCSI has gained a lot of momentum since its inception. On the other hand, the iSCSI-based storage is quite different from a traditional one. A traditional storage system is often physically restricted to a limited environment, e.g. in a data center. It also adopts a transport protocol specially tailored to this environment, e.g. parallel SCSI bus, Fibre Channel, etc. These characteristics make the storage system tend to be more robust, and achieve more predictable performance. It is much easier to estimate the performance and potential bottleneck by observing the workload. While in an iSCSI storage, the transport is no longer restricted to a small area. The initiator and the target can be far apart. The networking technology in between can be diverse and heterogeneous, e.g. ATM, Optical DWDM, Ethernet, Wireless, satellite, etc. The network condition can be congested and dynamically changing. Packets may experience long delay or even loss and retransmission, etc. Thus, the situation facing the iSCSI storage is quite different from the traditional one. To take advantage of iSCSI protocol to build iSCSI storage systems, we need to better understand the iSCSI characteristics, e.g. the performance characteristics in various * This work was supported by DISC from DTC of UoM, and gifts from Intel and Cisco. 399 networking situations, the impact of network transmission error or network component fault to the storage access performance and robustness, the relationship between iSCSI and underlying TCP/IP protocol, etc. The common way to study the performance characteristics about the iSCSI storage system is through the real performance measurement. Real measurement can be pretty accurate. Papers [4][5][6] represent this endeavor. However, the measurement approach is often restricted to the physical equipments and settings. In a lot of times, the software or hardware of the equipment is not open. Thus a tester cannot adjust parameters, or try an alternative algorithm, etc. in the performance study. In this regard, a simulation approach offers much more flexibility. Once the simulation components have been implemented, it is very easy to construct test configuration, configure parameters of interest, or add an alternative algorithm, etc. to study the iSCSI related issues. To our best knowledge, no simulation model has been built for iSCSI protocol. Our goal of this work is to establish a simulation model for iSCSI-based storage system for the study of the characteristics of iSCSI-based storage system. In addition, we also study the interactions between the iSCSI and TCP layer to better support the iSCSI access. We use network simulator NS2 [9] to implement the iSCSI simulation model. NS2 is an event driven simulator widely used in the research of networking arena. It provides substantial support for the simulation of TCP/UDP, routing, and multicast protocols over wired and wireless networks. To validate the simulated model, we also conduct the real performance measurement and compared the simulation results with the real performance data. In addition, we also examine the different TCP parameters and investigate how they affect the iSCSI performance based on the simulation model. This paper is organized as follows: Section 2 presents the simulation model for the iSCSI-based storage system. Section 3 describes the iSCSI implementation in NS2. Section 4 presents empirical validation of the model. In Section 5, we analyze the performance results of iSCSI model. Section 6 reviews the related work. Finally we conclude this paper in Section 7. 2. 2.1. Simulation Model A Typical iSCSI Storage System Model Figure 1 shows a typical iSCSI-based storage system model used in the simulation. In this model, an initiator generates SCSI requests, which is encapsulated into iSCSI messages (protocol data unit or PDU). These PDUs are then transmitted over TCP/IP network and routed to an iSCSI storage gateway. Storage Gateway Initiator iSCSI TCP/IP Network TCP/IP iSCSI FCP TCP/IP FC FC-AL SAN Figure 1 The storage system model The storage gateway has both TCP/IP networking interface and FC-AL interface. TCP/IP interface provides iSCSI connection for an initiator to access through IP network, while FC-AL interface is used to connect to a SAN storage subsystem. In this SAN 400 environment, the gateway serves as an initiator to the FC-enabled disks. It uses the SCSI over FC encapsulation protocol (FCP) to access the disk devices through Fibre Channel. 2.2. The iSCSI Data Transfer Model Figure 2 shows the iSCSI architecture model. iSCSI builds on top of TCP transport layer. For an iSCSI initiator to communicate with a target, they need to establish a session between them. Within a session, one or multiple TCP connections are established. The data and commands exchange occurs within the context of the session. Figure 3 shows iSCSI command execution by illustrating a typical Write command. The execution consists of three phases: Command, Data and Status response. In the Command phase, The SCSI command (in the form of Command Descriptor Block (CDB)) is incorporated in an iSCSI command PDU. The CDB describes the operation and associated parameters, e.g. the logical block address (LBA) and the length of the requested data. The length of the data is bounded by a negotiable parameter “MaxBurstLength”. During the Data phase, the data PDUs are transmitted from an initiator to a target. Normally, the initiator needs to wait for “Ready to Receive (R2T)” message before it can send out data (solicited data). However, both initiator and target can negotiate a parameter “FirstBurstLength” to speed up the data transmission without waiting. FirstBurstLength is used to govern how much data (unsolicited data) can be sent to the target without receiving “Ready to Receive (R2T)”. A R2T PDU specifies the offset and length of the expected data. To further speed up the data transfer, one data PDU can be embedded in the command PDU if “ImmediateData” parameter is enabled during the parameter negotiation. This should be very beneficial for small write operation. The Status PDU is returned once the command is finished. Application layer Application SCSI layer SCSI Initiator iSCSI protocol layer TCP/IP Ethernet iSCSI protocol entity TCP/IP Data link Disk I/O requests SCSI iSCSI Session iSCSI transport TCP/IP Ethernet Logical Unit Block data SCSI Target SCSI CDB iSCSI protocol entity Target R2T iSCSI Data-out PDU iSCSI PDU R2T R2T TCP/IP TCP segment IP packet Data link Ethernet frame Physical media Figure 2. The iSCSI model Initiator iSCSI Write CMD (CDB) iSCSI Data-out PDU iSCSI Data-out PDU iSCSI Response Figure 3. The command execution sequence Finally, these messages are encapsulated into TCP/IP packets, where the packet size is bounded by MSS (maximum segment size) in TCP. MSS is determined by the smallest frame size along the path to the destination. Within an Ethernet LAN, the maximum frame size is 1500 bytes (Gigabit Ethernet supports Jumbo frame). The MSS is 1460 bytes (40 bytes for IP and TCP header). When the iSCSI PDU size is greater than the segment size, the PDU is further fragmented into smaller packets. The size of iSCSI parameters like: MaxBurstLength, FirstBurstLength and PDU size all have certain impact to the iSCSI performance. However, the iSCSI performance is also significantly affected by the underlying TCP flow control, congestion control mechanism. We will also examine the effect of these parameters. 401 Disk Model We use Seagate ST39102FC Cheetah 9LP disk as the storage device. The disk access time Td = Tds+Tdr+Tdt, where Tds is the disk seek time, which is determined by the difference of current cylinder and the target cylinder; Tdr is the disk rotation latency, which is determined by the difference of the current sector when the disk head moves to the target cylinder and the first sector of the intended access. Disk transfer time Tdt is determined by the number of data blocks transferred. When the data size is large, the requested data may span more than 1 track (cross disk surface) or even 1 cylinder. In that case, we also add the head switch or cylinder switch time into the access time. We consider the disk has enough buffers to hold the requested data. The disk not only handles the block data access, it also handles the data transmission. The disk has built-in FC-AL logic and interface. As a normal FC node, it has a physical address and needs to participate the arbitration phase to win the channel before transferring data between the gateway and the disk. 2.3. FC-AL Interconnect Fibre Channel is a popular networking protocol to construct storage area network. It supports switching fabric, loop and point-point construct. FC-AL aims at loop topology where up to 126 FC-AL nodes (hosts and disk devices) are connected to a shared loop. A node obtains the access to the loop through arbitration. The arbitration is determined by the physical addresses of participating nodes. When a node wins arbitration, it opens a connection to its destination node. A node closes a connection and releases the control over a loop when its transfer has finished. In our environment, the disk devices and storage gateways are FC-AL nodes. On top of the Fibre Channel transport, similar to iSCSI, FCP protocol maps SCSI protocol onto the Fibre Channel protocol. Each SCSI protocol is also performed through three phases, FCP-CMD frame for command transfer, FC-Data frame for data transfer, and FCP-response frame for status transfer. Since the maximum frame size is 2048bytes, a SCSI command may require multiple FCP-Data frames to transfer all requested data. For the SCSI Write operation, FCP-XFER-Ready frame is also used for the flow control between the initiator and disk device. The speed of FC-AL can be 1Gb/s, up to 2 Gb/s. Since it adopts 8B/10B encoding, the actual bandwidth is 100MB/s (Mega bytes per second) and 200MB/s respectively. In our simulation, we assume the speed to be 1Gb/s. We use a central module FC to handle the channel arbitration. All participating nodes (disk devices, the gateway) are required to register to this module. When a node requires channel, it submits a request to FC. FC module determines who wins the arbitration. 2.4. Storage Gateway Model The storage gateway works as a bridge between two protocols: iSCSI and FCP. It hosts the targets of the iSCSI and the initiators of the FC storage. In the meantime, it manages the targets, their access control and their related sessions. It also administers the mapping between a target and its constituent disk devices. In the simulation, the disk devices should be “attached to” (add an pointer in) a target. Each target maintains two interfaces: iSCSI on top of TCP agent and the FCP on top of Fibre Channel interface. An outstanding command queue for the each session glues these 2.5. 402 two interfaces together. Each command item in the queue contains the CDB and other status information. When a new command arrives from iSCSI interface, it is placed into the command queue. The command is further sent to the disk device through the FCP interface when the number of outstanding commands falls below the threshold in the target disk. When a command completes, it receives an FCP-RSP frame from the disk. Upon receiving this frame, the target sends out an iSCSI Response PDU to the actual initiator, in the mean time, the command is removed from the queue. However, the commands within a session are completed in order. 3. Simulating iSCSI in NS2 iSCSI Nodes In our simulation there are three types of nodes: Initiator, Target and gateway node. Figure 4 shows these nodes and their related components. A target node is the peer of an initiator node. The gateway node hosts one or multiple target nodes. iSCSIInitiator and iSCSITarget are the node applications running on the initiator and target respectively. Within these nodes, iSCSIInitiatorSession and iSCSITargetSession, derived from iSCSISession, perform the session tasks. Similarly, iSCSIInitiatorConnection and iSCSITargetConnection, derived from iSCSIConnection, perform connection tasks. A iSCSISession can open multiple iSCSIConnections. 3.1. iS C S I N o d e s Input Queue I n it i a t o r N o d e T a rg e t N o d e G a te w a y N o d e Session Queue Session Queue iS C S I S e s s i o n I n it ia t o r S e s s io n Disk Queue T a rg e t S e s s io n Connection/ TCPAgent Connection/ TCPAgent iS C S I C o n n e c t io n Network Node I n it ia t o r C o n n e c tio n Network Node T a rg e t C o n n e c tio n Link Queue Link Queue iS C S I T c p A p p FC Queue iS C S I T c p A p p Ethernet Ethernet FC Link Figure 4. Hierarchy of iSCSI Node Figure 5 Queuing models (a) Initiator (b) Target Each iSCSIConnection has its own iSCSITcpApp object through which data is sent and received. We use FullTCP agent for the iSCSIConnection application. Queuing Models Figure 5 shows the queuing models in the simulation. Fig. 5(a) is the model for an initiator. It has an input queue to receive workload (SCSI requests). Under the input queue are several Session queues. Each session possesses a queue for the outstanding SCSI commands. The maximum number of outstanding commands is configurable. The commands in each session are processed by their corresponding connections and then passed to their TCP agents. The link queue at the bottom is the network node’s link queue. 3.2. 403 Fig. 5(b) shows the Target’s queuing model. Four types of queues exist: session queue, link queue, disk queue and FC queue. Similar to the Initiator node, link queue is for the Ethernet link. Per-session queue is for outstanding commands. Each disk has a disk queue, which holds outstanding commands for each disk in the target. Disk queue makes the interaction between the target and disk simpler. FC queue holds the requests to access the FC link. Moreover, each disk itself also has a command queue. The number of outstanding commands is configurable. Implementation Figure 6 shows a typical setting of an iSCSI system implemented in NS2. The system comprises three parts: the initiator, which generates workload; the TCP/IP network, which can be easily configured based on existing NS2 components; and the gateway in conjunction with target disk devices. 3.3. Disks Disk-Cntrl 3 Requests from User Disk-Cntrl 2 Disk-Cntrl 1 Initiator Target-3, Session-3 Target-2, Session-2 SCSI Target-1, Session-1 Connection-1 Connection-1 Session-2 Connection-1 Session-1 Connection-1 iSCSI Router iSCSITcpApp iSCSITcpApp FullTcp Agent FullTcp Agent Network Figure 6. iSCSI system based on NS2 All these components are implemented in C++ to achieve high efficiency. The Otcl script in NS2 is used to setup the simulation environment. It creates network nodes and network topology, creates initiator and gateway nodes, creates targets and disks and attaches disks to target. These C++ components also expose a number of configurable parameters such as iSCSI parameters, disk parameters, to the Otcl. This makes the change of test setting very convenient. After the test setting is constructed, the script then invokes login method in the initiator to connect to the gateway. The initiator enters Login phase. Eventually it acquires the targets and LUNs in the target from the gateway. The initiator finally creates a session with each target. The number of sessions and the number of connections in a session are also configurable parameters. The SCSI Read or Write requests (workload) are then passed to the initiator to carry out. The workload for the each test is generated in a separate Otcl script. We have developed a workload generator program that generates requests of even distribution and Poisson distribution. The disk id is also evenly distributed among the specified range of disks. In addition, we also apply the trace file to the initiator to see how the actual application data affect the performance. 404 4. Empirical Validation To verify the simulation model, we compare it with iSCSI performance measurement data obtained from a real iSCSI setting. In this real setup, Initiator communicates with a Cisco SN5420 iSCSI gateway through campus network. The network connection is the 100Mb/s Ethernet. There are 4 Seagate 39102FC disks are accessed. The round trip time in terms of network distance is approximately 2ms. The test involves reading and writing a burst data of sizes from 1K, 4K, 16K to 64K bytes under light load and heavy load with a PDU size of 8KB. 1) A comparison of the real iSCSI access latency and NS modeled iSCSI access latency is shown in Figure 7. Both the light load (Each time only one thread is sending data request) and heavy load (4 threads is sending requests) are tested. The delay patterns for both figures shown above are approximately similar and follow the similar rate of increase. With small data burst size, the difference is within 2%, for the data burst size of 64K, the difference is within 6%. 2) In another test, the burst sizes vary from 4K to 64K under heavy load conditions. Figure 8 shows close approximation in throughput. The NS model provided a little higher throughput because of ideal environment conditions. Read Only, Heavy Load -Data size Effect (Graph-3) Read Latency Comparison 1 thread(Real) 1 thread(NS) 4 threads(Real) 4 threads(NS) 1K 4K 16K 64K Throughput (KB/s Latency (ms) 10000 60 50 40 30 20 10 0 9000 8000 7000 6000 NS iSCSI 5000 4000 Real iSCSI 3000 2000 1000 0 4KB Data burst size 80% 4KB 80% 64KB 64KB Burst size Fig. 7 Response time comparison Fig. 8 Throughput comparison for heavy load The results from the test provide reasonable assurance that the NS model can closely approximate a real iSCSI installation. To support the validity of model more thorough analysis can be done with different test scenarios. 5. Performance Analysis In this section, we present the results of the effect of iSCSI parameters in iSCSI layer and TCP parameters in TCP layer to iSCSI data access performance. 5.1. The Effect of the iSCSI Parameters We first examine the effect of different iSCSI PDU sizes. Figure 9 shows the read throughput with varying PDU sizes. The parameters involved in this simulation include data PDU sizes from 0.5KB to 8KB and max burst sizes from 1KB to 4MB. It is found out that at larger burst data size, the PDU size makes difference. For a large burst size, e.g. 1MB, with the PDU size of 8K, there will be 128 PDUs, while with PDU size of 1K, then there will be 1024 PDUs. More PDUs cause more R2T messages, and potentially more waiting for the R2T signals. From the figure, we observe the better performance for larger PDU size as data size increases. 405 Throughput (MB/s) Throughput of Read vs. PDU Size 8 0.5K 6 1K 4 4K 2 8K 0 1k 4k 16k 64k 256k 1M 4M Data burst size Fig. 9 The effect of iSCSI PDU size The Effect of Network Parameters In this subsection, we investigate how the network parameters like TCP window size, MSS (Maximum Segment Size) and link delays affect the iSCSI performance. We first examine the effect of MSS size. In the test setting, the TCP window assumes the default value of 20. The link is a Gigabit Ethernet with delay from 10us to 50ms. The MSS sizes are 296B, 576B and 1500 bytes respectively. Figure 10 shows the achieved throughput with varying MSS sizes. Normally, for a given delay and MSS size, the maximum throughput that can be achieved is approximately one window per round trip time, i.e. (MSS * window)/2*delay, which implies that throughput is inversely proportional to the link delay for the given MSS. This figure shows that the throughput decreases quickly for smaller MSS sizes, whereas higher MSSs show a gradual decrease even for higher link delays. For link delays less than 1ms, the MSS size does not have much effect on the throughput this is because at short network link delay, bandwidth-delay product is small. The acknowledgement comes back very fast. As the link latency continues to increase, the throughput drops gradually, thus the link utilization is also getting lower. We need more parallelism to take advantage of the link bandwidth. The use of multiple connections may help. Adding more disks will increase the disk I/O bandwidth. The RAID system may also increase the disk access performance. 5.2. Throughput vs. TCP Window size 10 Throughput (MB/s) Throughput (MB/s) Throughput vs. TCP MSS size 8 mss=296 6 mss=576 4 mss=1500 2 0 0.01 0.1 0.4 1 4 10 20 50 10 8 win=20 6 win=40 4 win=80 2 0 0.01 0.1 Link delay (ms) 0.4 1 4 10 20 50 Link delay (ms) Fig. 10 Throughput vs. MSS size Fig. 11 Throughput vs. Window size We then examine how the TCP window variation affects the throughput. Three different TCP window sizes (20, 40 and 80) are used. The MSS size is fixed at 296B as shown in the Figure 11. The result shows that the throughput increases with the increase of TCP window. At the short network link latency (e.g. LAN environment) of 0.4ms, the throughput is about 4MB/s for window size 20, but when the window size increases to 406 80, the throughput reaches to 6.8MB/s. However, with the increase of network latency (long fat network), the window size is too small, and the throughput reduces significantly. 6. Related Work There have been several simulation work related to disk and storage subsystem. Paper [7] gave an introduction about the disk drive modeling and simulation. It described the principle of a disk drive and present a formula to compute disk seek time. Project DiskSim [8] provides an open source code, which can extract the disk parameters of different disk drives. We benefit a lot from their work in the disk simulation. Paper [10] modeled a disk controller and studied some more advanced features like caching. In paper [11], a Storage Area Network (SAN) is simulated. In this SAN, Myrinet is used to connect the storage devices and servers (initiators). On the other hand, iSCSI protocol represents a different SAN paradigm, i.e. it uses the ubiquitous TCP protocol as the SCSI command and data transport. There are several studies in iSCSI performance and characteristics. Papers [4][5][6] presented the iSCSI performance measurement and evaluation under different scenarios. However, due to the diversity of network configuration and the impact of the underlying TCP network, it is crucial to build up the simulation model for the iSCSI environment for the iSCSI-related research. Our work incorporates the rich feature of ns2 in supporting the TCP/IP networking, and implements all related components including disk model, iSCSI protocol, FC-AL protocol and iSCSI target. This can be used to easily construct a flexible iSCSI-based storage system to facilitate the iSCSI related research. 7. Conclusion We have presented a simulation model for the iSCSI-based storage system. The model includes all components for constructing an iSCSI-based storage system. In order to meet the requirements of extensibility and flexibility, we make the components modular and generic. The whole system is composed into several components. These components can be easily replaced or extended. In order to validate the implementation, we also conducted real measurement and compared the simulation results with the real measurement results under the same settings. It turns out that the performance results in our model are close to that of real measurement. With the simulation model, we further conducted the performance characteristics study to examine how the iSCSI parameters and the underlying TCP parameters affect the iSCSI performance. Our results show that with larger burst data size, the larger PDU size will help the throughput. Increasing TCP window size and MSS also affects the end-to-end performance. But this effect is more pronounced for higher link delays. In the future, we’ll further study the iSCSI storage system in a diverse network such as fat network (long latency, high bandwidth), wireless network, we’ll also examine the impact of the underlying TCP protocol on the upper-level SCSI access in terms of performance and resilience based on the simulation model. 407 Acknowledgement The authors thank to Avinashreddy Bathula who helped the initial work of this project. We are also grateful for the help offered by our shepherd Randal Burns. References [1] Julian Satran, et al. iSCSI Specification, Internet Draft, http://www.ietf.org/internetdrafts/draft-ietf-ips-iscsi-20.txt, Jan. 2003 [2] Kalman Z. Meth, Julian Satran, Design of the iSCSI Protocol, IEEE/NASA MSST 2003, Apr. 2003. [3] K. Voruganti; P. Sarkar, An Analysis of Three Gigabit Networking Protocols for Storage Area Networks, 20th IEEE International Performance, Computing, and Communications Conference", April 2001 [4] S. Aiken, D. Grunwald, A. Pleszkun, Performance Analysis of iSCSI protocol, IEEE/NASA MSST 2003, Apr. 2003. [5] Y. Lu, D. Du, Performance Evaluation of iSCSI-based Storage Subsystem, IEEE Communication Magazine, Aug. 2003 [6] S. Tang, Y. Lu, D. H.C. Du: Performance Study of Software-Based iSCSI Security. IEEE Security in Storage Workshop 2002: 70-79 [7] C. Ruemmler and J. Wilkes. An Introduction to Disk Drive Modeling. IEEE Computer, Vol. 27, No. 3, 1993, pp 17-28 [8] G. Ganger, B. Worthington, Y. Patt, The DiskSim Simulation Environment, http://www.pdl.cmu.edu/diskSim/index.html [9] The network simulator ns2, http://www.isi.edu/nsnam/ns/ [10] M. Uysal, G. A. Alvarez and A. Mechant, A Modular, Analytical Throughput Model for Modern Disk Arrays, MASCOTS-2001, Aug. 2001 [11] X. Molero, F. Silla, V. Santonja, Jose Duato, Modeling and Simulation of Storage Area Networks, MASCOTS-2000, Sep. 2000, pp. 307 [12] D. Anderson, J. Dykes and E. Riedel, More Than An Interface – SCSI vs. ATA, Proc. of the 2nd Annual Conference on file and Storage (FAST), Mar. 2003 408 Comparative Performance Evaluation of iSCSI Protocol over Metropolitan, Local, and Wide Area Networks Ismail Dalgic Kadir Ozdemir Rajkumar Velpuri Jason Weber Intransa, Inc 2870 Zanker Rd. San Jose, CA 95134-2114 Tel: +1-408-678-8600 {ismail.dalgic, kadir.ozdemir, rajkumar.velpuri, jason.weber}@intransa.com Helen Chen Sandia National Laboratories, California PO Box 969 Livermore, CA 94551 Tel: +1- 925-294-3000 hycsw@ca.sandia.gov Umesh Kukreja Atrica, Inc. 3255-3 Scott Blvd Santa Clara, CA 95054 Tel: +1-408-562-9400 umesh_kukreja@atrica.com strong contender for wide area connectivity among multiple enterprise locations. While some studies exist on the performance characteristics of the iSCSI protocol [2] [3], the performance characteristics for metropolitan area and wide area networks are yet to be understood. The iSCSI protocol and the underlying TCP/IP and Ethernet protocols have some configurable parameters which impact performance. In this paper, we investigate the effect of some of these parameters on iSCSI throughput. Abstract We identify the tunable parameters of iSCSI and TCP that affect the performance characteristics for local, metropolitan, and wide area networks. Through measurements, we determine the effect of these parameters on the throughput. We conclude that with the appropriate tuning of those parameters, iSCSI and TCP protocols maintain a good level of throughput for all types of networks. 1. Introduction iSCSI [1] is a promising new technology, which overcomes the distance limitations of other storage networking technologies such as Fibre Channel and Infiniband, and thereby enables globally distributed mass storage systems. Wide Area Ethernet services, at the same time are emerging as a 2. Parameters At the iSCSI level, the parameters of interest from a performance point of view are: (i) command request size, (ii) iSCSI command window credit amount, (iii) the number of simultaneous iSCSI connections in a session, and (iv) the 409 traffic. Furthermore, a packet loss in a TCP connection triggers the TCP slowstart and congestion avoidance algorithms, resulting in a drop in the throughput which takes some time to reach back to the maximum possible level [5]. By using multiple connections in a session, the overall impact of this temporary drop in throughput is reduced. On the other hand, the iSCSI protocol has to obey the SCSI command ordering rules that may reduce the parallelism among multiple connections. option of sending solicited vs. unsolicited data. The command request size, the command window credit amount, and the number of simultaneous connections may impact both read and write performance. The choice of solicited vs. unsolicited data may impact write performance, but it has no impact on reads. At the TCP/IP level, the most important parameters are the send and receive window sizes especially in networks with a large bandwidth-delay product such as high speed WANs. As far as solicited vs. unsolicited data transfer is concerned, three independent parameters determine the transfer type: FirstBurstLength, MaxBurstLength, and MaxRecvDataSegmentLength [1]. FirstBurstLength determines the maximum amount of unsolicited data that the initiator can send per command. MaxBurstLength determines the maximum amount of solicited data that the initiator can send per command. MaxRecvDataSegmentLength is the maximum data segment size that can be sent in each protocol data unit (PDU). There are many ways that these 3 parameters can be set. In this study, we consider two cases which produce results in the two extremes: most-unsolicited and most-solicited data allowed by the iSCSI protocol. Most-unsolicited data implies that the FirstBurstLength is greater than or equal to the maximum write command request size that the initiator generates. Most-solicited data implies that the unsolicited data mode is disabled during the login negotiation, effectively equivalent to setting the FirstBurstLength to zero. In this mode, the target will notify the initiator when it is ready to receive data for a given command. This will give the target more control in the receive buffer allocation, but it will introduce extra round trip The iSCSI command request size is the amount of data that is sent or received as part of a SCSI command encapsulated in iSCSI. The iSCSI command window credit amount, dynamically set by the target, determines the maximum number of iSCSI commands that can be outstanding at a given time. The product of these two parameters will determine the maximum amount of data that can be pipelined in the network to deal with the network latency. Increasing this amount will generally improve throughput. The primary reason for iSCSI to support multiple connections per session is to take advantage of trunking in Gigabit LAN switches [4]; each TCP connection may utilize a different link, thus improving the overall throughput of the session. However, even on a WAN or MAN link where only a single path is available between an initiator and a target, the number of simultaneous connections in an iSCSI session may impact the performance due to the behavior of the TCP protocol where each TCP connection adjusts its transfer rate so as to share fairly a congested path. By allowing multiple connections per iSCSI session, the iSCSI traffic is effectively given priority over other TCP 410 delays as compared unsolicited mode. to the and 10 and 50 ms for WAN. In addition, we performed some LAN measurements by using a Gigabit Ethernet connection between the initiator and target, without the LANforge. fully 3. Experimental Setup In this paper, we study the effect of the aforementioned parameters on iSCSI performance for different network types between the initiators and the targets. Note however that we did not study scenarios with multiple TCP connections per iSCSI session because targets and initiators that support this feature are not yet widely available. In addition, the test configurations we study do not have multiple paths. 4. Results 4.1 Effect of the TCP Window Size We first studied the effects of TCP window size setting. By using the default 64KB settings of the Linux kernel 2.4.19, we obtained the results shown in Table 1. The first row corresponds to data being sent from the iSCSI initiator machine to the target, and the second row corresponds to the data sent in the other direction. As can be seen, even at the small values of round trip latency, the default TCP window size settings are inadequate to maintain a good level of throughput. We then increased the maximum send and receive window sizes to 10 MBytes for both the initiator and the target, and achieved the wire speed for all the latency values under consideration as shown in Table 2. In the remainder of this paper, we kept the maximum window sizes at 10 MBytes. In order to isolate the effect of the network latency and not to be affected by the idiosyncrasies of different commercial products, we used a WAN/MAN emulator. This allowed us to vary the network latency while keeping all other parameters unchanged. More specifically, our experimental setup consisted of an open source software initiator by Cisco running on a 933 MHz two processor Intel Pentium III SMP machine, and an Intransa IP5000 iSCSI target, interconnected by a LANforge ICE WAN emulator by Candela Technologies. As traffic generators, we used two open source tools, xdd for block IO and ettcp for tcp traffic. Table 1: TCP throughput results in MBytes/s with default TCP send and receive window sizes (64KB) Transfer Direction 0 ms 19.0 IĺT 19.0 TĺI In our experiments, we set the network bandwidth to the OC3 rate, 155Mb/s. This rate is the maximum supported by the WAN emulator. Since this paper’s focus is network performance, we configured the IP5000 in write-back mode and we set the traffic patterns such that all reads are served from the cache. This allowed us to eliminate any possible disk IO bottleneck. Round Trip Latency 2 ms 10 ms 50 ms 18.7 4.4 0.9 13.5 3.2 0.7 Table 2: TCP Throughput results in MBytes/s with maximum TCP send and receive window sizes set to 10 MBytes Transfer Direction 0 ms 19.3 IĺT 19.3 TĺI We studied four values of round trip latency: 0 ms as baseline, 2 ms for MAN 411 Round Trip Latency 2 ms 10 ms 50 ms 19.3 19.3 19.3 19.3 19.3 19.3 Table 5: iSCSI throughput results in MBytes/s for reads 4.2 Effect of the iSCSI Parameters After eliminating the TCP bottleneck, we studied the effect of the iSCSI parameters. Table 3 presents the iSCSI throughput results for most solicited writes using different iSCSI command window and request sizes. It is clear that the throughput is adversely affected when the product of window and request size is small. Request Window Round Trip Latency Size Size 0ms 2ms 10ms 50ms 1KB 1 2.3 0.4 0.1 0.02 32 17.6 8.0 2.3 0.5 8KB 1 10.2 3.1 0.8 0.2 32 19.2 19.2 19.1 4.7 64KB 1 19.2 17.2 5.4 1.2 32 19.2 19.3 19.3 19.2 256KB 1 19.2 19.2 18.3 4.7 32 19.3 19.3 19.3 19.3 Table 4 shows similar results to Table 3, but for most unsolicited writes. Clearly, the elimination of the extra round trip delays help to improve the throughput. Table 5 shows similar results for read requests. Considering that both the read requests and the unsolicited write requests involve one round trip latency per request, the results in Table 5 match the results in Table 4 in many cases. However, for some other cases, read throughput seems to significantly exceed the most unsolicited write throughput. Table 3: iSCSI throughput results in MBytes/s for most solicited writes Request Window Round Trip Latency Size Size 0ms 2ms 10ms 50ms 1KB 1 1.3 0.2 0.05 0.01 32 11.3 3.3 0.8 0.2 8KB 1 7.7 1.7 0.4 0.08 32 19.0 18.9 5.8 1.2 64KB 1 19.2 10.9 2.9 0.6 32 19.2 19.3 19.3 4.5 256KB 1 19.2 19.2 7.8 1.7 32 19.3 19.3 19.3 4.7 Finally, our LAN measurement results are shown in Table 6 for writes and reads, using various iSCSI command window and request sizes. It is interesting to note that even in a low latency LAN environment, the product of the iSCSI window size and request size impacts the performance significantly. In addition, the unsolicited writes provide a significant increase in performance. Table 4: iSCSI throughput results in MBytes/s for most unsolicited writes Round Trip Latency Request Window Size Size 0ms 2ms 10ms 50ms 1KB 1 2.2 0.4 0.09 0.02 32 17.4 6.6 1.5 0.3 8KB 1 10.3 3.2 0.7 0.2 32 19.2 19.2 11.6 2.5 64KB 1 19.3 17.6 5.4 1.2 32 19.3 19.3 19.3 4.8 256KB 1 19.3 19.3 11.6 2.5 32 19.3 19.3 19.3 4.9 Table 6: iSCSI throughput results in MBytes/s in Gb/s LAN environment Request Window Most Size Size Solicited Writes 1KB 1 3.4 32 17.1 8KB 1 17.4 32 68.6 64KB 1 56.1 32 96.5 256KB 1 71.3 32 97.3 412 Most Unsol. Writes 5.5 23.4 25.0 84.3 65.9 99.8 79.9 100.4 Reads 5.4 19.9 22.5 72.3 61.3 91.9 74.7 98.4 Information Exchange Between Systems - LAN/MAN - Specific Requirements - Part 3: Carrier Sense Multiple Access with Collision Detection (CSMA/CD) Access Method and Physical Layer Specifications,” 2002, ISBN 0-73813089-3 [5] IETF RFC 2001, “TCP Slow Start, Congestion Avoidance, Fast Retransmit, and Fast Recovery Algorithms,” W Stevens, Jan 1997 5. Conclusions We have observed that the default TCP parameter values are inadequate for the high speed MAN and WAN environments, and therefore require tuning. We have seen that the product of the iSCSI command window and request sizes has a very significant effect on the performance as well. Furthermore, using the most solicited writes has a major performance penalty, and should be avoided whenever possible. With appropriate performance tuning, the iSCSI and TCP protocols are capable of achieving good throughput in all types of networks. 6. Acknowledgements We would like to thank Mr. Ben Greear for kindly allowing us to use the LANforge LAN/MAN/WAN emulator for this study. We would also like to thank Mr. Robert Gilligan and Mr. Kenny Speer for their valuable feedback. References [1] IETF Internet Draft, “iSCSI”, draftietf-ips-iscsi-20.txt, J Satran et al, Jan 2003 [2] “A Performance Analysis of the iSCSI Protocol,” S Aiken et al, proceedings of 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, CA, USA, Apr 2003, pp123-134 [3] “IP SAN – From iSCSI to IPAddressable Ethernet Disks,” P Wang et al, proceedings of 20th IEEE/11th NASA Goddard Conference on Mass Storage Systems and Technologies, San Diego, CA, USA, Apr 2003, pp189-193 [4] IEEE 802.3-2002 “Information Technology - Telecommunication & 413 414 H-RAIN: AN ARCHITECTURE FOR FUTURE-PROOFING DIGITAL ARCHIVES Andres Rodriguez Dr. Jack Orenstein Archivas, Inc. Waltham, MA 02451 Tel: +1-781-890-8353 e-mail: arodriguez@archivas.com, jorenstein@archivas.com 1 Introduction Traditionally, systems for large-scale data storage have been based on removable media such as tape and, more recently, optical disk (CD, DVD). While the need for increased storage capacity has never been greater, the inadequacies of traditional approaches have never been more apparent. This is especially true for fixed-content data: new government regulations and increasingly competitive market pressures have converged to underscore the importance of finding long-term storage solutions for fixed-content data that offer ready and secure access, easily scale, and are relatively inexpensive. 1.1 Shortcomings of removable media archives Archives that rely exclusively on removable media share the following shortcomings: x An archive system that commits physical data to removable media is also captive to the specific hardware system that enables read/write access.. As technology changes, these systems inevitably tend towards obsolescence. It is questionable whether the devices that are used today to read tape or disk will still be available and viable years hence— never mind the availability and viability of the vendor itself! x As the archive grows, access becomes increasingly cumbersome and time-consuming. Data is not always readily available when it is wanted. Moreover, the administrative overhead that is required to provide timely access is unacceptably high, and—for many organizations—prohibitively expensive. x Government regulations reflect a rising demand to maintain large amounts of data over long periods of time, and to guarantee their authenticity. Removable data is especially vulnerable to physical mishandling and corruption, both through physical deterioration and outside intervention, whether inadvertent or deliberate. 1.2 An alternative model In general, when digital data is bound to tape or disk, it ceases to be a digital asset; instead, it becomes simply a physical widget that contains bits, with all the drawbacks previously cited. Long-term mass storage of fixed-content data requires a new type of storage model, where the data's physical location is completely separate from its logical representation. In order to achieve this objective, digital data must be stored in a digital archive that is scalable, reliable, and highly available. 415 Today's best hope of realizing this model rests in a network—or cluster—of inexpensive servers such as IA-compatible machines that can run a full Linux distribution. This model offers the following advantages: x Various protection schemes (RAID-5, RAID N+K) safeguard files from multiple, simultaneous points of failure in the network, and guarantee that their data remains continuously available. x Within the network, the archive system autonomously enforces policies that are associated with the stored files. These policies include retention period, file protection, and content authentication. x Gateways for standard protocols (HTTP, NFS, SAMBA, CIFS) provide over-the-wire access to the archive. x The archive is easily extended: as new nodes enter the cluster, the archive automatically invokes its own load-balancing and protection policies, and redistributes existing storage into the new space accordingly. x A network-based archive can facilitate updates to files so they stay current with the latest applications—for example, format changes that are required by new end-user applications. Data migration of this type, on the scale required for large archives, is virtually impossible to achieve in tape-based systems. x The data's actual location on the network is transparent to the user. During its lifetime in an archive, a stored file might be relocated across many network machines—or nodes— as the result of hardware upgrades, replacements, or load balancing. The reference to the file, however, remains constant, enabling users ready access to its contents without requiring knowledge of its physical location within the cluster. 1.3 Two architectures for online archives: RAIN and H-RAIN In the last several years, various vendors have come forward with archive systems that implement the network approach just described. These all embody various implementations of RAIN (redundant array of independent nodes) architecture. RAIN archives are based on one or more clusters of networked server nodes. As nodes enter or leave the cluster, the cluster automatically adjusts by redistributing and, when necessary, replicating the stored files. Currently, RAIN archives are typically delivered as proprietary hardware appliances, closed systems that are built from identical components. Evolution of these systems is carefully controlled by the vendor. The architecture of H-RAIN—heterogeneous redundant arrays of independent nodes— differs from the RAIN architecture from which it evolved by making minimal assumptions about the archive's underlying hardware and software. In practice, this means that H-RAIN architecture can be implemented with commodity hardware. This relatively open architecture has two advantages over its RAIN progenitor: 416 x It adapts more readily to technological advances and site-specific contingencies. Administrators are free to replace components with superior hardware as it becomes available, thus improving storage capacity, performance, and reliability. Furthermore, they can choose among hardware options that best suit their requirements, such as CPU, memory, and disk capacity. For example, a cluster might be extended by adding new nodes with higher-performance CPUs, which can be used for CPU-intensive filtering operations. Incremental hardware additions and improvements might thereby measurably improve overall archive performance. x Archive administrators can start small and scale up capacity incrementally simply by adding nodes as they are needed. Moreover, they are free to seek the best prices for storage cluster components. Given that component costs tend to decrease over time, cost-conscious administrators can reduce their average cost per gigabyte by spreading out purchases. In general, an H-RAIN architecture enables users to upgrade their technical infrastructure while transparently migrating archive content to more up-to-date nodes. Improvements can be made incrementally, leaving the initial installation intact. As hardware prices fall, archive performance can be enhanced with better-performing nodes, and at lower cost. 2 Implementing H-RAIN architecture Archivas' archive management system, Reference Information System (RIS), is based on the H-RAIN model. With RIS, organizations can create large-scale permanent storage for fixed content information such as satellite images, diagnostic images, check images, voice recording, video, and documents, with minimal administrative overhead. In RIS' H-RAIN implementation, two features are salient: x Distributed processing x Autonomous management 2.1 Distributed processing All nodes in a cluster are peers, each capable of running any or all of services that an archive requires. A cluster can be configured so archive services are distributed in a way that best serves the enterprise's storage requirements. For example, a cluster can configured symmetrically—that is to say, each node runs the same processes and daemons, including a portal server, metadata manager, policy manager, request manager, and storage manager. Each node bears equal responsibilities for processing requests, storing data, and sustaining the archive's overall health. No single node becomes a bottleneck: all nodes are equally capable of handling requests such as put and get operations. Furthermore, in the event of node failure, any other node can take over responsibility for the data that was managed by the failed node, so that user access to this data remains unaffected. Alternatively, a cluster might be configured so that various services are distributed asymmetrically across different nodes. For example, if read requests are especially heavy 417 for a given archive, several nodes might be dedicated solely to request management and run multiple request managers and metadata managers, in order to maximize throughput to other nodes that store the physical data. 2.2 Autonomous management Through policies that are associated with archived files individually, and the cluster collectively, the archive can manage itself without human intervention. Policies are set for the archive on initial configuration, and can (optionally) be set for individual files as they are archived. Taken together, these policies determine the archive's day-to-day operation. Through a policy manager that executes on each node, the archive monitors its own compliance with current policies, and when lapses occur, takes the appropriate corrective action. For example, in the event of a failed disk or node, the system determines what data is missing and how best to restore it from data on the remaining healthy nodes, so that the protection policy for these files is fully enforced. Similarly, the system prohibits removal of an archived file before its retention period has elapsed. Human intervention is rarely warranted, and usually only in response to system warnings that require outside action—for example, notification the cluster load has crossed a specified threshold, requiring the addition of new nodes. Four attributes characterize archive self-management: x Self-configuring: Setting up large archive systems is error prone. An archive comprises networks, operating systems, storage management systems, and, in the case of RIS, databases and web servers; getting all these components to run together requires teams of experts with a myriad of skills. An autonomous system simplifies installation and integration by setting system configuration through high-level policies. x Self-protecting: Policies that enforce document retention, content authentication, and file protection combine to protect an archive from loss of valuable digital assets. x Self-healing: Serious problems with large-scale archives can sometimes take weeks to diagnose and fix manually. When the faulty device is finally identified, administrators must be able to remove and replace it without interrupting ongoing service. Autonomous systems can automatically detect software and hardware malfunctions in a node, and safely detach it from the archive. Further, because data is replicated across many nodes in the cluster, the failure of one or more nodes has no impact on data availability. Archivas' distributed metadata manager can find an alternative source for any data that resides on a failed node. x Self-optimizing: Storage systems, databases, web servers and operating systems all have a wide range of tunable parameters that enable administrator to optimize performance. An autonomous system can automatically perform functions such as load balancing as it monitors its own operation. 418 3 Extending the H-RAIN model With its H-RAIN architecture, RIS is capable of integrating with storage systems that use removable media such as tape or optical disk. In this scenario, the tape system is seen by RIS as simply another set of storage nodes; the physical location of data is managed by an RIS storage manager implementation that is specifically targeted to tape-based storage. This capability is critical for a multi-tier storage and migration strategy, where data is stored in whatever medium best serves external access requirements. For example, a file that is frequently accessed should be archived in primary storage on a high-performance disk, while data that is rarely used can be stored on relatively low-performance media such as tape. Further, it is likely that access requirements for a given file will not remain constant, especially if the file is retained for a long period of time. In general, references to most types of data significantly decline as the data itself ages, as shown in the following figure:1 With the advent of government regulations such as the Sarbanes-Oxley Act, enterprises are required to archive increasing amounts of reference data, and retain them for ever longer periods of time. In order to keep storage costs down, it is increasingly important that archive systems respond to changing access requirements by moving data easily from more expensive disk-based media to less-expensive removable media. By encompassing both disk-based and tape-based storage and providing a unified interface to both, RIS can provide a smooth migration path for aging data. Furthermore, RIS policy mangers can automatically manage the migration, as determined by archive-wide or file-specific policies. 4 Conclusion A digital archive system that is based on H-RAIN architecture offers the most economical, scalable, and effective solution for large-scale storage of reference data. Policy-based management minimizes administrative overhead while it provides the most reliable way to 419 achieve an archive's most important requirements: availability, reliability, and content authentication. By extending the H-RAIN model to encompass any network node, including those that interface to tape-based systems, the potential exists to implement a multi-tier system where data is stored on the medium that best suits access requirements, and can easily be migrated to another medium as those requirements change. 1 Fred Moore, "Information Lifecycle Management," Horison Information Strategies (http://www.horison.com) 420 A NEW APPROACH TO DISK-BASED MASS STORAGE SYSTEMS Aloke Guha COPAN Systems, Inc. 2605 Trade Center Drive, Suite D Longmont, CO 80503-4605 Tel: +1-303-827-2520, Fax: +1-303-827-2504 e-mail: aloke.guha@copansys.com Abstract We present a new approach to create large-scale cost-effective mass storage systems using high-capacity disks. The architecture is optimized across multiple dimensions to support streaming and large-block data access. Because I/O requests in these applications are to a small fraction of data, we power-cycle drives as needed to create a very high-density storage system. Data protection is provided through a new variant of RAID, termed power-managed RAID™, which generates parity without requiring all drives in the RAID set to be powered on. To increase bandwidth, the system employs several concurrency mechanisms including load balancing I/O streams across many parallel RAID controllers. A number of optimizations are made to the interconnection architecture and caching schemes that lowers the cost of storage to that of typical automated tape library systems while exploiting the performance and reliability advantages of disk systems. Recent results at COPAN Systems have proven the viability of this new storage category and product. 1. Background Growing reference information, regulatory data and rich media archives [1] are providing new impetus for mass storage systems. Unlike transaction applications, these systems access a small fraction of the data infrequently or on a scheduled basis. One such application is backup and restore that is characterized by large-block streaming writes and infrequent reads. With their need for large-scale storage, these applications are very sensitive to cost. Historically, they have been addressed by automated tape libraries. While concerns with performance, lack of data protection, and the reliability of tapes [2] have always favored disk systems, their limited capacity and high cost have prevented them from replacing tape. The advent of highcapacity SATA drives and their cost approaching that of tape motivates us to revisit this design issue. Existing approaches to high-capacity storage have been incremental, typically substituting Fibre Channel (FC) disks with lower cost ATA disks in standard RAID arrays. This does not result in increased storage density or the cost per unit storage of tape libraries which usually have a 3X advantage over disk systems. Therefore, a fundamentally different approach is required. We believe the best approach is to use an application and workload-driven architecture to provide the performance and reliability of disks at the scale and cost of tape. 2. A New Approach: Application Specific Storage System The specific needs of archival storage applications help define the architectural constraints. These include: x I/O requests to large block with sequential or predictable access 421 x x Performance metrics based on data rates (Mbytes/sec) and not on I/Os per sec (IOPs) Time to first byte should be in ms to seconds, desired for mission-critical systems Given the above, we can eschew many common but complex features of traditional disk storage architectures: 1. No need for large primary RAM cache. Since the storage is not used for hightransaction I/O, there is little need for large shared caches, except for reads. 2. No need for high-speed switched interconnection from host to disks. With more tolerance to latency than transactional systems, access from these applications do not benefit from non-blocking switched connectivity. 3. No need to access all the data all the time. With access to a small fraction of the data (e.g., <5% in tape systems), a majority of disks could be taken off-line if the mean latency is bounded. In addition, most writes can be done sequentially. 4. No need to limit capacity based on interconnection bandwidth. Since the capacity to data rate ratio is high, the ratio of total drive bandwidth to interconnect bandwidth can be higher than traditional disk systems. 5. Ensure data availability. Given the scale of data they maintain, archival storage must have high reliability, data protection and availability. COPAN Systems’ architecture is based on power management of a large number of SATA drives. The basic concept was first used in the MAID (massive array of idle disks) project [3] that examined tradeoffs in disk power consumption and performance. The results showed that a MAID with cache can effectively support most reads from a large database archive. To meet enterprise class storage needs using a power-managed disk design, we had to satisfy multiple criteria. These included data protection, scalable capacity, high bandwidth, storage and data manageability, small footprint, and a cost equal to or better than tape storage costs. This introduces a number of design guidelines: 1. Provide parity based redundancy: Tradeoff redundancy with effective cost of storage. RAID 1 provides 100% redundancy, but it also doubles the cost per unit storage. 2. Ensure data protection and performance with drives power-cycled. Optimize tradeoffs between power cycling, storage density, performance, redundancy and cost. Keeping 30%-50% of drives online implies 10X more data online than tape systems. 3. Ensure that I/O rate and parity protection are maintained, even when disks are in transition during the power-cycling of individual drives. 4. Provide ways to scale total bandwidth of the storage system An implicit design objective was the optimal packing and configuration of the drives. The architecture that works best for a given volume turns out to be a three-level interconnect. If the total number of drives in the system is N, then N is decomposed into a 3-tuple, such that N = s.t.d where s is the number of storage enclosures or shelves, t is the number of “stick”, or column of disks, in the each shelf unit, and d is the number of disk drives in each stick in a shelf (Figure 1). All I/O requests to SCSI disk volumes arrive over FC links at the system controller which maps the logical volume to physical volumes on shelves connected by FC. 422 If physical constraints of the rack can be satisfied, N can be chosen to support very large storage capacities in a single system. The packaging of drives must also provide adequate heat dissipation so that the disks operate at or below the specified ambient temperature. We provide brief descriptions of the data protection, performance and reliability of the system. Figure 1. Three-tiered storage: decomposition into shelf, stick and disk 2.1. Efficient Data Protection: Power-Managed RAID™ Data protection using parity can be provided by power-cycling either full RAID sets or individual drives within the RAID set. COPAN Systems supports both approaches. However, power-cycling individual drives within the RAID set, termed power-managed RAID™ (PMRAID™), has the advantage of keeping more RAID volumes on-line (assuming nominally 1 volume per RAID set) and reducing power swings. In PM-RAID™, all data is written sequentially from drive to drive and the parity drive is fixed. At a minimum, only one data drive and the parity drive are powered on. When writing initially to drive D0, the data on the parity drive P will also be D0 if the parity drive was initialized to zeros. When the writes exceed the capacity of D0, the second drive D1 is powered up and data is written to it, while drive D0 is powered down. At this point, the parity drive P will contain the XOR of the earlier data written to D0 and the new data written to D1. Similarly, after the writes to the third drive, the parity drive will contain the XOR of the data from drives D0, D1 and D2. Thus, after writes to all data drives in the RAID set, the parity drive will indeed be the parity for all drives in the RAID set. On the failure of a data drive, all drives in the RAID set will be powered on to reconstruct the failed data drive as in a normal RAID reconstruct. Figure 2 shows data written sequentially using a stripe size s of 1. If more bandwidth is desired, a larger stripe size s, 1 ” s ” n in an n+1 RAID, should be used. The most powerefficient and least bandwidth case is s = 1 with only 2 drives powered on, while the maximum bandwidth case is s = n when all n+1 drives are powered on. Note that the s = n case corresponds to traditional RAID 4 [4]. PM-RAID™ is therefore the most generalized version of RAID 4 under power constraints. Various configurations of the number of on-line volumes and associated bandwidth are possible with PM-RAID™. For example, if there are 1000 data drives in the system that are set up as 4+1 RAID sets with a budget to power only 25% of all drives, the maximum number of volumes that can be powered is 125, or 50% of all volumes or 2X the drive fraction 423 powered on. Each volume can be accessed sequentially at a data rate possible from one disk drive. PM-RAID™ utilizes a few always-on mirrored drives that maintain information on volume metadata, drive health, and read and write caches. Unlike previous approaches on disk on disk caching [5] used for performance, write caching is used to ensure that parity generation and I/O data rates can be sustained during power transitions of the drives. In general, including the spare and the metadata drives, less than 30% of all drives can be kept online while maintaining high bandwidth given sufficiently large number of disks in the system. To ease storage management, the drive power management is transparent to the end user or application. The stripe width of the PM-RAID™ as well as any striping of the user’s volumes across RAID sets in the shelves is managed by the system and not the user. Figure 2. Writes in a power-managed RAID™ 2.2. Increasing Performance: Concurrent I/O Power management limits the number of drives powered on at any time. This limits the total I/O per storage shelf. To increase aggregate bandwidth, we use a number of concurrency techniques. First, because the system controller has independent FC connections to storage shelves (Figure 1) and each storage shelf contains its own RAID controller, write and read bandwidth is increased s-fold with s shelves. Each shelf RAID controller can stripe data across up to t sticks. Second, since each volume on the shelf is minimally comprised of 1 active data drive, the bandwidth can also be increased by striping (RAID-0) logical volumes from the system controller across volumes in the shelves. Third, more I/O concurrency between the shelf controller and the SATA drives is provided in the stick using a custom SATA data and command router. This router provides concurrent access and command queuing of I/O to all d drives in a stick. This enables high-bandwidth I/O operations across a large number of drives in the shelf. Finally, we use disk caches to cost-effectively ameliorate latency effects of disk spin-up, if there is no a priori information on access patterns. When write requests can be scheduled, as in a backup, a currently powered-down disk can be scheduled to power up with sufficient lead 424 time before the end of file is reached on the current drive. When the write requests cannot be scheduled, the data can be redirected to a spinning drive cache and later staged to the true disk target. Read disk caches also store data from all data drives. A read request to a powered-off drive is directed to the read cache, while the target drive is powered up. In the case of a read cache miss, our current measurements indicate that the read penalty is well below 10 seconds. We also note that for streaming applications that access large blocks of data, the read latency of a few seconds on a cache miss is usually dwarfed by data transfer time. 2.3. Increasing Data Reliability: Effect of Power Management Using high-capacity ATA drives and high-density drive configuration to create PB-sized storage systems raises the importance of data protection and data availability. We address the issue in many ways. First, power-cycling the drives has a direct impact of increasing disk life and therefore system reliability. With 1000 drives and a drive MTBF of 400,000 hrs, the expected first failure of a drive is 18 days. Such low MTTF implies frequent drive swap-outs that is not be acceptable for most data centers. With PM-RAID™, if drives are powered with an average duty cycle of 25%, the net effect is to extend the effective drive MTTF to 4X the nominal value, or 1.6M hours, greater than typical SCSI or Fibre Channel. Second, we closely monitor the total power cycles, also known as contact start-stops (CSSs) since all drives have a specified maximum CSS. We therefore ensure during operations that all drives are all well below the CSS threshold specified by the drive vendor. Third, to accommodate drive failures that can put a large amount of data at risk, we use a unique proactive replacement mechanism. By continuously monitoring drive attributes, embedded controller intelligence determines whether a drive susceptible to failure should be replaced. Proactive drive replacement enables recovering data from the drive before failures, allowing the system to protect data even as failing drives are replaced. 3. Results At the time of writing, we have demonstrated end-to-end I/O from the user to the drives. We have proved two important concepts. First, PM-RAID™ and the interconnect architecture perform as expected, with linear scaling of bandwidth with increasing I/O load. With the current design, 1000 MB/s or more bandwidth is possible in a single system. Second, the implementation shows that close-packing of drives in the system kept environmental attributes within specifications so data reliability is assured to be better than traditional disk systems. More detailed results will be presented at the conference. REFERENCES [1] Fred Moore, Are You Ready for MAID Technology? July 8, 2003, Computer Technology News, http://www.wwpi.com/lead_stories/070803_3.asp [2] Bob Cramer, 'It's the restore, stupid!” April 23, ComputerWorld, http://www.computerworld.com/hardwaretopics/storage/story/0,10801,78483,00.html [3] Dennis Colarelli, Dirk Grunwald et al, The Case for Massive Arrays of Idle Disks (MAID), Usenix Conference on File and Storage Technologies, Jan. 2002, Monterey CA. [4] David A. Patterson, G. Gibson, and Randy H. Katz, “A Case for Redundant Arrays of Inexpensive Disks (RAID),” SIGMOD 88, p. 109-116, June 1988. [5] Y. Hu and Q. Yang, DCD -- Disk Caching Disk: A New Approach for Boosting I/O Performance. Proc. of the 23rd ISCA95, Philadelphia, PA, 1995. 425 426 Multi-Tiered Storage - Consolidating the differing storage requirements of the enterprise into a single storage system Louis Gray BlueArc Corporation Corporate Headquarters 225 Baypointe Parkway San Jose, CA 95134 USA info@bluearc.com Voice 1-408-576-6600 FAX 1-408-576-6601 http://www.bluearc.com Abstract Abstract Performance has always come at a steep cost in the world of enterprise storage, with flexibility often taking a back seat when it comes to IT purchasing decisions. In fact, storage has typically been bought and deployed as a “one size fits all” solution, regardless of the kinds of applications the network is running. Such an approach has required multiple servers to address multiple needs, resulting in significantly higher costs and a much greater degree of management complexity. BlueArc’s Multi-Tiered Storage (MTS) solution changes all this. MTS offers storage performance and consolidation while supporting different types of storage within the same network-attached storage (NAS)-based system, according to the specific requirements of the applications. For the first time, a combination of high-performance online, moderate performance nearline and infrequently accessed archival data can be configured in a single, seamless NAS platform—a BlueArc SiliconServer. This paper looks at the issues surrounding today’s growing storage needs in enterprise and project applications, examines the need for a multi-tier storage solution, explains BlueArc’s Multi-Tiered storage platform and demonstrates how the system is applied to storage applications. Today’s Issues Online data is doubling each year, but disk prices are falling rapidly. Now storage access, infrastructure and software costs are the dominant factors in the rising cost of storage. Today’s business need for scalable storage broadly fits into two categories - enterprise and project storage. In both of these applications, storage needs are growing exponentially. 427 Enterprise Storage Enterprise user home directories are increasing more than two-fold annually, as presentations and documents become more graphics rich, email traffic increases and as employees need to keep more data at their fingertips to help their company prosper in a highly competitive and demanding marketplace. For example, it was not too long ago that a 2 Megabyte PowerPoint presentation was considered large, while now we see large presentations approaching 10 Megabytes apiece, and the average size nearing 2 Megabytes. Factors such as this, combined with employees’ needs to retain additional storage in their home directories and mail archives is driving the average enterprise IT manager to prepare for a four-fold increase in storage space simply to keep pace with the organization’s needs. Project Storage Work on digitized feature films, particle physics projects, bioinformatics/genetic research, seismic research, digital design, broadcast, medical imaging, CAD and other projects already require very significant levels of storage. Business demands and new enabling technology (e.g. low-cost Linux cluster super-computing) are driving for faster project completion and the analysis of even more detail, driving the need for performance network access to multiple terabytes of storage. The recent adoption of low-cost ATA disk arrays in many data centers offers affordable online storage or caching of archive data. In many applications this brings strong productivity benefits. Immediate online availability of archived data enables it to be quickly searched and re-used, rather than abandoned on low-performance archive tapes. In a typical enterprise, this combination promotes growth beyond tens of terabytes of online storage. Infrastructure costs are becoming dominant in scaled storage applications Attempts to deal with this level of growth are stretching current storage architectures to the limit. While Fibre Channel storage arrays are still a significant cost factor in expanding storage needs, the simple cost of storage arrays themselves is now not the prime contributor to growth costs. The infrastructure costs surrounding storage expansion (backup strategies, server proliferation and administration costs) are poised to become highly dominant factors for growth. 24 by 7 access to data 24 by 7 access to storage is being threatened by extending backup windows. Elimination of backup windows is the much-needed solution. A difficult issue that enterprise and project IT managers face is the business mandate for 24 by 7 access to stored information. In the face of expanding storage needs, IT managers would like to be in a position to extend backup windows to accommodate for increased growth rather than contract them, enabling administrators to scale each server further, rather than proliferating servers. But backup activity cannot spill into peak business hours 428 if acceptable service is to be maintained. Traditional server architectures cannot run backup copy activity and continue to serve files without a severe drop in performance, which only becomes more difficult as service levels remain high around the clock. Additionally, fast restore times and disaster recovery are key – the cost of downtime to businesses is high and IT managers need to strike the right balance in this area of the IT strategy, devising what amount of history to keep online and where it should be stored. The introduction of large low-cost ATA storage arrays has changed the landscape for data restore and disaster recovery. With large, low-cost storage as the target for backup data and prime source for immediate backup history, both backup and restore time are reduced when compared to traditional tape backup. While online disk storage does not replace the high integrity holding data copies off site, disk storage offered as “virtual tape” or replicated intermediate storage is highly beneficial for reducing backup times and restore times. Savings and benefits from low-cost storage must not be competing with a higher cost, more complex infrastructure to access it. Changing the rules for Storage Today’s issues demand a different architectural approach to storage. x The infrastructure costs associated with scaling storage must be reduced. Enterprise and project storage needs will continue to expand. Raw storage costs are reducing significantly, but infrastructure costs associated with increasing storage capacity remain high. Proliferation of servers, backup and restore times, backup costs, storage networking complexity, and the administration overhead associated with storage infrastructures need to be tackled and contained. x Match storage to applications without imposing large infrastructure costs. The arrival of resilient arrays of low-cost and high capacity ATA disks heralds the introduction of staged backup, which reduces both backup and restore times and lets servers scale much further. Low cost of storage and very high capacity are key elements for this application, since even the lowest performing disks are significantly faster than tape access. ATA storage fits this need. However, the performance characteristics of this type of storage are not high enough for database and transaction intensive applications. In these cases, Fibre Channel disks deliver appropriate performance at a higher cost. x 24 Hour availability. Set-aside backup windows are no longer acceptable in today’s around the clock businesses. They must be removed or significantly reduced, requiring a storage platform that can be backed up without affecting user access performance, even in 429 peak activity periods. Replication options must also operate without affecting user activity. Restore capabilities need to be very fast and disaster recovery needs to be quick and simple. Access to storage needs to be highly available, but redundant path architectures deployed to achieve this must be uniform across all types of storage without increasing infrastructure costs. x Scaling by server proliferation is too expensive. Performance limitations in traditional servers restrict the amount of storage supported by a single server. Physical scaling limits for direct or SAN-attached storage cannot ultimately be avoided, even by offloading backup from the server. NAS appliances and NAS gateways extend these limits and are becoming more important as installed levels of storage continue to expand. Traditional NAS is more cost effective but lacks full integration The role of NAS is changing. Switched Gigabit Ethernet as a network backbone expands the role of NAS in today’s networks, while resilient network switch architectures support vast bandwidths with high availability. A Gigabit-switched backbone can support user traffic, application server disk access and archive traffic with ease. Furthermore, network traffic can be shaped and prioritized with standard tools in the Gigabit backbone switch, should the need arise. With all this functionality included, the cost of commissioning a second and independent storage network needs careful thought. By removing file and archive storage from the SAN, server and SAN networking costs can be saved. For this reason, NAS has emerged as the optimal architecture for primary file storage and for online archive applications, which the availability of low-cost ATA storage has enabled. NAS also challenges the role of SAN as prime storage for database and other application servers. Traditional NAS solutions have dedicated functionality. Performance NAS filers support primary file storage, and cost-competitive systems support general storage needs with the option of dual server heads offering high availability and fast fail-over in the event of a network or NAS component failure. Slower, cost-effective single head NAS systems are used for archive applications, built to support low-cost ATA storage generally in blocks of 2- 12 terabytes. For a customer to require a combination of archive storage and primary storage is quite common, but to further reduce costs, an integrated approach, with all storage types supported in a single NAS system, has been sought but has not been found, as regrettably such an approach would demand unsustainable scalability and performance from the software-based architecture on which traditional NAS filers are based. The approach would also introduce performance issues related to backup, since is it hard to conceive that tape archive copying would be completed during today’s shrinking backup windows. Decreased user access performance when running archive traffic during business hours adversely affects business productivity. 430 BlueArc - Scalable, high performance Multi-Tiered Storage with low Infrastructure costs BlueArc Network Attached Storage is based on the SiliconServer, a NAS head with hardware data movement, similar to network backbone switches. BlueArc’s architecture has the performance and scalability to support an integrated storage solution, with hardware data movement to maximize throughput. BlueArc also solves the backup traffic problem, because of the high internal bandwidth in the server and because movement of data between network and storage is prioritized over disk to disk and disk to tape activity, should contention for resources arise. BlueArc’s Multi-Tiered Storage, with SiliconServer scalability and performance, consolidates storage to a single system. The system changes network storage rules by: x x x x Reducing the infrastructure costs of scaling Offering multiple tiers of varied cost and performance storage, without imposing large infrastructure costs Reducing the number of servers through consolidation and replacement Enterprise class continuous availability storage The enhanced productivity derived from MTS is seen by customers in fields such as Broadcasting, Post Production, Manufacturing Genetic Research, Government Research, Mapping, University projects and many more. The use of hardware to move data within the SiliconServer is extended to backup and disk-to-disk copies over Fibre Channel. The SiliconServer uses hardware data movement in the implementation of NMDP, the standards based backup protocol for NAS. Coupled with the ability to schedule a snapshot of the file system prior to making a backup, this capability has three distinct advantages: 431 1. The traditional problem associated with slowed server performance when running backup activity is removed and there is no need for a backup window. x x x 2. x 3. Network data movement is prioritized over NDMP traffic NDMP is supported by all backup software applications Backups run faster because of speed of hardware data movement Tape drive efficiency Backups of a frozen on-disk snapshot can be spooled to tape as a background activity, if necessary running 24 hours per day to get maximum scalability out of tape drives. This delivers two to three times the scalability from each tape drive compared with running the drive only during a backup window. This mode of operation is fully supported by standard data backup software applications. Backup copies can be stored online for immediate access x Fast restore and disaster recovery by referencing an online snapshot of the backup, rather than referencing the archive tape. SiliconServer file system snapshots store a complete image of the file system at the time the snapshot was taken. The snapshot retains a block level image of the file system. As files subsequently change, the snapshot stores only the blocks that have been modified since the snapshot was taken. Snapshots are highly efficient in terms of the amount of disk space they occupy, therefore it costs little to retain a week or more of daily backup copies online for fast restore should it be necessary. Backup copies on disk are not only highly desirable for business continuance in the event of a disaster or accidental deletion, but they also reduce the number of required tape library tape slots - if backup history is held online, there is no need to leave recent backup tapes online in the tape library. Shared File Storage BlueArc’s Multi-Tiered storage is mounted as a network share/export. This means that multiple application servers can share the same data using standard Windows and UNIX locks. The SiliconServer also supports secure sharing between UNIX and Windows environments, with permissions mapping and lock integration between the two environments. Data sharing is vital for imaging, Web applications, design, engineering, and research applications where centralized data needs to be shared between a number of application servers and clients. BlueArc’s Multi-Tiered Network Attached Storage is the optimal, strategic choice for these applications. 432 Managing Scalability in Object Storage Systems for HPC Linux Clusters Brent Welch Panasas, Inc 6520 Kaiser Drive Fremont, CA 94555 Tel: 1-510-608-7770 e-mail: welch@panasas.com Garth Gibson Panasas, Inc 1501 Reedsdale Street, Suite 400 Pittsburgh, PA 15233 Tel: 1-412-323-6409 e-mail: garth@panasas.com Abstract This paper describes the performance and manageability of scalable storage systems based on Object Storage Devices (OSD). Object-based storage was invented to provide scalable performance as the storage cluster scales in size. For example, in our large file tests a 10-OSD system provided 325 MB/sec read bandwidth to 5 clients (from disk), and a 299-OSD system provided 10,334 MB/sec read bandwidth to 151 clients. This shows linear scaling of 30x speedup with 30x more client demand and 30x more storage resources. However, the system must not become more difficult to manage as it grows. Otherwise, the performance benefits can be quickly overshadowed by the administrative burden of managing the system. Instead, the storage cluster must feel like a single system image from the management perspective, even though it may be internally composed of 10’s, 100’s or thousands of object storage devices. For the HPC market, which is characterized as having unusually large clusters with usually small IT budgets, it is important that the storage system “just work” with relatively little administrative overhead. 1. Scale Out, not Scale Up The high-performance computing (HPC) sector has often driven the development of new computing architectures, and has given impetus to the development of the Object Storage Architecture. The new architecture driving change today is the Linux cluster system, which is revolutionizing scientific, technical, and business computing. The invention of Beowulf clustering and the development of the Message Passing Interface (MPI) middleware allowed racks of commodity Intel PC-based systems running the Linux operating system to emulate most of the functionality of monolithic Symmetric MultiProcessing (SMP) systems. Since this can be done at less than 10% the cost of the highly-specialized, shared memory systems, the cost of scientific research dropped dramatically. Linux clusters are now the dominant computing architecture for scientific computing, and are quickly gaining traction in technical computing environments as well. 433 Unfortunately, storage architecture scalability in terms of performance, capacity, and manageability have not kept pace, causing systems administrators to perform tedious data movement and staging tasks on multiple standalone storage systems to get data into and out of the Linux clusters where scalable resources are available. There are two main problems that the storage systems for clusters must solve. First, they must provide shared access to ever larger amounts of data so that the applications are easier to write and storage is easier to balance with the scaling compute requirements. Second, the storage system must provide high levels of performance, in both I/O rates and data throughput, to meet the aggregated requirements of 100’s, 1000’s and in some cases up to 10,000’s of nodes in the Linux cluster. Linux cluster administrators have attempted several approaches to meet the need for shared files and high performance, supporting multiple NFS servers or copying data to the local disks in the cluster. But to date there has not been an effective solution to match the power of the Linux compute cluster for the large data sets typically found in scientific and technical computing. 1.1 Methods of Data Sharing in Clusters Sharing files across the Linux cluster substantially decreases the burden on both the scientist writing the programs and the system administrator trying to optimize the performance of the system and control the complexity of cluster management. There are several approaches to providing shared data to a computing cluster: • Network file servers. The NFS protocol and file server hardware impose a bottleneck between the clients that share the data and the disk resources that store it. As storage needs grow, additional file servers with their own disk resources must be added, creating storage “islands” and adding complexity for users and administrators alike. • Block storage via SCSI on Fiber Channel (FC) provides good performance access for a small number of disks, but block storage is private to individual hosts making it generally unscalable. Systems like GFS [1] and GPFS [2], sometimes called SAN filesystems, can provide shared FC storage for compute nodes. However, the costs of FC adapters per node, FC switching with enough ports for all cluster nodes, as well as FC administrators, can significantly increase the cost of each node in the compute cluster. Moreover, SAN filesystems are also fundamentally block-based, and the overhead of managing blocklevel metadata in terms of lock messaging and distribution of block-level metadata limits scalability. • Peer-to-peer systems share hard drives attached to individual compute clusters. However, the complexity and cost of providing storage redundancy prohibits permanent storage of valuable data in per node disks. Moreover, storing data in per node disks introduces complexity in load balancing because some nodes have much worse I/O performance depending on where the file was written. • A new alternative is Object-based Storage Architectures. The Panasas object-based storage system is built from commodity parts, including SATA drives, standard processors and memory, and Gigabit Ethernet to provide a storage system with excellent price/performance characteristics. Its high performance file system can be shared among multiple compute clusters equally and with little overhead. It can also share the same 434 files with legacy engineering workstations using standard NFS and CIFS protocols. The entire storage system is available through one global shared namespace in a manner reminiscent of the Andrew File System (AFS) [3]. 2. Object-based Storage Architecture The Object-based Storage Architecture utilizes data Objects, which encapsulate variablelength user data and attributes of that data [4,5,6]. The device that stores, retrieves and interprets these objects is an Object Storage Device (OSD) [7]. The combination of data and attributes allows an OSD to make decisions regarding data layout or quality of service on a per-object basis, improving flexibility and manageability. The object interface is mid-way between the read-write interface of a block device, and the highlevel interface of an NFS server. The core object operations are create, delete, read, write, get attributes, and set attributes. By moving low-level storage functions into the storage device itself and accessing the device through a higher-level object interface, the Object Storage Device enables: • Dedicated resources to block-level management, • Intelligent space management in the storage layer allowing the OSD to make late binding decisions on the allocation of data to the storage media, • Data-aware pre-fetching, and caching, • A natural paradigm for scaling the capacity and performance characteristics of the storage system. OSD-based storage systems can be created with the following characteristics: • Robust, shared access by many clients, • Scalable performance via an offloaded data path, • Strong fine-grained end-to-end security. These capabilities are highly desirable across a wide range of typical IT storage applications. They are particularly valuable for scientific and technical applications that are increasingly based on Linux cluster computing which generates high levels of concurrent I/O demand for secure, shared files. 2.1 Implementing Files using Objects An Object-based storage system includes metadata managers that coordinate data access from multiple clients and implement POSIX filesystem semantics over the object storage. They also provide location information and implement security policies. For scalable performance, the metadata servers are “out-of-band” of the data path between clients and storage nodes. The metadata servers retain strict control of what data clients can access because the OSDs provide secure access rights enforcement with every operation [8]. Studies indicate that even in demanding small file with random access workloads, these metadata servers handle less than 10% of the work associated with file access, with the rest going to the object storage devices [9], and that load balancing metadata service can be managed by partitioning the objects [10]. 435 In the Panasas implementation, metadata managers are executed on server blades called DirectorBlades, and the OSDs are executed on server blades called StorageBlades. All permanent filesystem data storage is on the StorageBlades. The storage cluster can have a variable number of Director and StorageBlades depending on the workload requirements; workloads with legacy NFS and CIFS access need more DirectorBlades, while highbandwidth, large file workloads need very few DirectorBlades. The object interface has a core security protocol that allows Figure 1. Object Storage Architecture safe concurrent access by multiple clients. Clients read directory objects to process filesystem pathnames. Directories map from names to object identifiers. Clients contact metadata servers to obtain location information (maps) and security capabilities (caps) that enable direct I/O access. These “maps and caps” are cached by clients so repeated access to files do not require interaction with the metadata manager. The OSDs verify security capabilities on every client access, so they enforce the policies implemented on the managers without having to know the details of POSIX or Windows ACL semantics. Our metadata managers also implement a lease-based cache consistency protocol so clients can efficiently cache file attributes and data, which provides further optimizations to file system access. Individual files are striped across multiple StorageBlades (OSD). Striping files across objects allows high bandwidth access as well as protection from failures by using standard RAID techniques. The combination of file distribution across storage devices and multiple client access leads to very high aggregate performance to the shared storage system. 2.3 NAS Compatibility It is also important that a new architecture such as Object-based storage be able to support NFS and CIFS compatibility for data sharing with desktops and other non-cluster computing platforms. The Panasas storage cluster does this in its DirectorBlades. DirectorBlades export standard NFS and CIFS interfaces, hiding the Object Storage Cluster from these legacy systems. Of course, a single NAS interface is a performance 436 bottleneck, but by deploying multiple NAS “filer head” interfaces in the storage cluster, even legacy applications see scalable performance, albeit less efficiently than through the Panasas file system protocol available on the Linux cluster nodes. Every DirectorBlade is capable of serving any file in any StorageBlade to any client. Panasas provides a global filesystem namespace, so clients only need a single mount point, which they can mount from any available DirectorBlade. By providing a shared filesystem between the Linux computing cluster and the rest of the computing platforms, data management is dramatically simplified. For example, noncluster hosts with tape drives can import data via multiple NFS access points in parallel, then the Linux computing cluster can access the data in place, and finally, non-cluster desktop applications can visualize and analyze the results from the computing cluster. In contrast, other approaches require explicit distribution of data to each node in the computing cluster, or management of multiple separate NAS systems. 3. Performance of Object-based Filesystems There are four main components of the Panasas storage system: the client, the DirectorBlade, the StorageBlade, and the Shelf [11]. The client is a loadable kernel module that runs in the Linux compute nodes and implements a POSIX filesystem. It plugs into the VFS interface inside Linux. The StorageBlades have 2 SATA drives, a 1.2 GHz Pentium III processor, 512 MB memory, and 1 GE network port. The DirectorBlades have a 2.4 GHz Pentium 4 CPU, 4 GB memory, and 1 GE network port. Each shelf holds up to 11 Storage or DirectorBlades, and it has an integrated GE switch that provides up to 4 trunked GE ports out of the shelf. (A pass-through card provides 11 independent ports, but the numbers shown here use the integrated switch.) Each shelf also includes dual power supplies and a battery module, which together provide an integrated UPS function for the DirectorBlades and StorageBlades. The bandwidth tests described below were run with 9 or 10 StorageBlades (OSD) per shelf, and in the bandwidth tests the DirectorBlades were mostly idle because most of their resources are reserved for NFS and CIFS. The goal of presenting these numbers is to give a general flavor of the scalability of the system performance. Results vary depending on the speed of the clients, the size of their I/O requests, and network topology. Typical clients in our tests have a single 2.4 GHz Pentium CPU and 1 GE network interface. Typical I/O requests in our bandwidth tests are 64 KB, and the network in our lab connects the systems and cluster under test through a high-end nonblocking Extreme Network BlackDiamond GE switch. In the bandwidth tests, files are always large enough that data must be streamed on and off the disk platters as opposed to being satisfied by cached data. 437 450 400 350 MB/sec 300 250 200 150 100 50 0 0 5 10 15 20 25 Num Clients Read (64K) Write (64K) Figure 2. Bandwidth of 10 OSDs vs. the number of Clients Figure 2 shows that aggregate bandwidth of one file per client in the same directory on a single shelf of 10 OSDs scales quite well as the number of clients increases until the shelf is providing about 380 MB/s. A single client can read from a file that is striped across 10 OSDs at about 90 MB/sec using a single GE port. As the number of clients scales up, but the storage resources remain the same, the aggregate bandwidth increases but the perclient bandwidth drops off. The less than linear scaling after about 8 clients per shelf is due to increasing load and contention at the storage devices that have to manage multiple I/O streams in parallel. There is also contention and a bottleneck in the shelf network as the data passes through the four trunked GE links providing a maximum of 4 Gb/s to each 10-OSD shelf. At saturation, each SATA drive is delivering a sustained bandwidth of just over 19 MB/sec split among the tens of clients pounding on it. Write bandwidth follows a similar curve, with a single client writing at about 77 MB/sec and 10 clients able to write at 335 MB/sec. The write bandwidth peaks at less than the read bandwidth because the RAID engine runs at the client and so parity data flows over the network between the client and the storage nodes. For example, there is about 12% more data being written in an 8+1 RAID configuration than is reported in the bandwidth number. The advantage of moving RAID to the client is that it allows shared access to a scalable number of drives without the bottleneck of a traditional RAID controller [12]. In addition, the XOR computations RAID requires can be done efficiently using specialized 438 MMX instructions on the Pentium CPU, and their speed increases as client CPUs and memory systems get faster. Figure 2 might led one to believe each shelf was a standalone 300-400 MB/s storage system. Not so. When multiple shelves are bound together by Panasas object storage software, total performance scales at 300-400 MB/s per shelf. Scaling when adding more shelves works quite well even if files are still only striped over 10 StorageBlades because the set of OSDs used to store each file are drawn from all OSDs on all shelves. The contention at any single storage device remains about the same. Figure 3 shows several multi-shelf high bandwidth test results. For example, the test with 299 OSDs achieved 10334 MB/sec read bandwidth. There were 32 shelves and 151 clients, or about 5 clients per shelf. The per-shelf bandwidth of 322 MB/sec in this configuration is consistent with tests done with 5 clients against one shelf. (The large test used a larger application read blocksize than in the chart shown in Figure 2, so the numbers are not directly comparable.) As files are added, their data is spread across different subsets of OSDs automatically, on a per-file basis. This allows large numbers of clients to share many OSDs efficiently. Figure 3 shows that bandwidth scales quite well with the number of OSDs. In fact, because our tests are generally done with only about 5 clients per each 10-OSD shelf, Figure 2 indicates that the total bandwidth achievable from these systems can be higher. For example, the difference between the points at 116 and 118 OSDs is that the first test used 61 clients to achieve 3.1 GB/sec, while the second used 79 clients to achieve 3.9 GB/sec. That is about 50 MB/sec per client in both configurations. Finally, it is important to note that these data points were taken over time in our labs with varying configurations of clients, network topology, and software tuning, so each point is not strictly comparable. However, the overall result is that the system scales well, at about 15-20 MB/s per disk, when the number of clients is at least 25% as many as there are disks. 439 12 10 GB/sec 8 6 4 2 0 0 50 100 150 200 250 300 350 Figure 3. Scaling Aggregate Bandwidth with Number of OSDs When files are striped very widely, contention increases because each OSD has connections to more clients. In a concurrent write test with 151 clients writing to a single, shared file striped across 198 OSDs, the write bandwidth was 2775 MB/sec. The OSDs were organized into 22 shelves with 9 OSDs each, for a per-shelf bandwidth of about 126 MB/sec. For legacy, non-cluster computers, access is through NFS or CIFS. Because Panasas DirectorBlades offer multiple NFS servers as a single storage pool, the underlying scalability of Object Storage is made available as a scalable NAS system. Figure 4 shows two results from the industry standard SPEC SFS benchmark [13]1. First we ran the SFS test against a 5-shelf system with 10 metadata managers and 45 OSDs, providing a single rack 90-disk system similar to the high-end of dedicated monolithic NAS filers. This delivered an excellent throughput of 50,907 ops/sec at an Overall Response Time (ORT) of 1.67 msec. The average response time as a function of load is shown in Figure 4 in the lightly shaded curve peaking at 50,000 ops/sec. ORT is a weighted average of the average response time at each of 10 load points. To show how NFS scales, we also show the results from a 30-shelf system with 60 metadata managers and 270 OSDs, which achieved 305,805 ops/sec at an ORT of 1.76 msec. While 300,000 ops/sec is much larger than any other reported SFS benchmark run to date, our real point is that a Panasas storage cluster with 6x the resources delivers 6x the workload at about the same response time profile. 1 NFS ops/sec as measured by the SPEC benchmark. SPEC and the benchmark name SPECsfs97_R1 are registered trademarks of the Standard Performance Evaluation Corporation. 440 Avg Response Time (msec) The SFS test has a uniform access requirement so each client is contacting every file server during the benchmark run. In this case each DirectorBlade is running two orthogonal functions: one is the metadata management for the object storage filesystem, the second is a client of that filesystem that is being exported via an NFS server module. The NFS server on each DirectorBlade accesses the metadata function on other directors as well as storage on all the OSDs. 20 18 16 14 12 10 8 6 4 2 0 0 50000 100000 150000 200000 250000 300000 350000 SFS97_R1 Ops/sec 60-Director ORT 10-Director ORT Figure 4. Results for 10- and 60-DirectorBlade Systems 4. Managing Large Scale Systems The ideal scalable storage system is a large, seamless storage pool that grows incrementally without performance degradation and is shared uniformly by all clients of the system under a common access control scheme. As the system scales in size, however, issues arise in two general areas: traditional storage management and internal resource management. Both of these areas are affected by the distributed system implementation of the storage system itself. To external clients, the storage system should feel like one large, high-performance system with essentially no physical boundaries imposed by the implementation. Internally, the system must manage a large and growing collection of computing and storage resources and shield the administrator from chore of administering individual resources. 4.1 Traditional Storage Management Traditional storage management issues include system configuration, monitoring system performance and capacity utilization, responding to failures, and adding and configuring hardware resources as the system grows. A scalable storage system should minimize the 441 burden these traditional storage management issues place on an administrator so the system can grow to very large capacities with undue operating costs. The Panasas object-storage architecture simplifies capacity and device management by hiding the details normally associated with block devices such as LUN definition and Fiber Channel zone definitions. Instead, the filesystem is built from a collection of object storage devices (OSD) and metadata managers (Directors). The filesystem automatically stripes files across storage devices using RAID techniques to tolerate the failure of storage devices or individual objects (e.g., due to media errors). As more storage devices are added, files are striped more widely. As stripes become wider, additional parity objects are introduced to limit the size of failure domains. A unique aspect of object-based storage is that RAID configurations are configurable on a per-file basis. For example, mirroring is more efficient for small write accesses, but has high capacity overhead compared with RAID 4 or RAID 5. Ordinarily RAID parameters are managed automatically by the system. For example, the Panasas filesystem uses mirroring for directories and small files, while larger files use RAID 5. However, in simplifying management we do not want to go so far that we hurt our users’ ability to optimize performance. For this reason an optional a programming interface exposes stripe width and RAID parameters so MPI IO middleware layers can create files with specific bandwidth and reliability attributes to best match application requirements. New capacity is added simply by adding one or more new StorageBlades to the system. However, this can create a capacity imbalance among the StorageBlades, new blades less full than older blades, so the system actively rebalances capacity across StorageBlades. This is done efficiently by transparently moving component objects. For example, if files are striped across N StorageBlades, then each file is composed of N component objects, one component on each StorageBlade. When a new StorageBlade is added to the system, the system selects component objects from each of the existing StorageBlades, and moves those component objects onto the newly available StorageBlade. This reduces the capacity utilization across existing StorageBlades and fills up the new StorageBlade. The active balancer runs in the background at a low priority. The filesystem blocks the balancer from using files that are being used, and if an application happens to access a file that is currently being rebalanced, it is temporarily blocked from using the file. The application’s access proceeds automatically once the balancer has finished moving the component object. Backup and restore is obviously important for any storage system. One nice benefit of a high performance, shared storage system is that multiple backups can proceed in parallel. This helps the administrator reduce the backup window to manageable levels, even with very large systems. Monitoring and configuration is done via a central management console that has a Web interface as well as a command line interface. Any feature accessible via the GUI is also accessible via the CLI. It usually turns out that the GUI is best for new users, offering a simple display of performance monitoring information, and a quick overview of the system. However, for large systems and more experienced administrators, it can become 442 tedious to configure the system one click at a time. Instead, a CLI that can be scripted (we have a TCL-based shell) can be a real time saver and an enabler for site-specific monitoring and data collection. 4.2 Internal Resource Management Because the storage system is itself a cluster of computers working together to provide service, there are internal management issues such as resource discovery, software upgrade, failure detection, power and thermal management. These internal issues should be handled automatically by the system, yet provide the administrator with monitoring, failure reporting, and robust failure handling. The Panasas storage system is IP-based, and its system configuration includes a block of IP addresses that is managed by an internal DHCP service. This runs on an alternate port so it will not conflict with the customer’s existing DHCP infrastructure. The Panasas DHCP protocol is extended with additional information about device serial numbers, types (Storage or Director), software revision level, and physical location (shelf and slot). As blades boot up the system discovers their type, location, and software version and automatically builds its configuration database. StorageBlades are automatically added to the storage pool as they come on-line, so provisioning a running system just requires physical addition of StorageBlades. The external view of the system is through a single DNS name. Clients mount the file system from this single name, and their filesystem accesses are automatically directed to the appropriate DirectorBlade as they access different directories. Of course, I/O access goes directly between clients and StorageBlades using the maps they get from DirectBlades during access control checks. In high availability configurations, the system DNS name is mapped to a set of IP addresses, and clients can contact any of these addresses to mount the filesystem. For NFS and CIFS load balancing, Panasas provides a delegated DNS name server to distribute legacy clients across DirectorBlades. Software versions are maintained uniformly across the storage cluster to avoid awkward compatibility issues. Each blade boots from its own drive to avoid massive congestion on a netboot server during system startup. Software upgrade is achieved with a twophase installation operation where all blades install a new version and reach a commit point in the first phase. Only if all blades are ready does the system commit to the new version and restart. Filesystem clients pause for the duration of the commit and restart, which is about 5 minutes regardless of the size of the storage cluster. When new blades are added to a system they are checked for hardware compatibility, and they are automatically upgraded to the same version as the rest of the system. Power and thermal management is important for any high-density cluster installation. The Panasas blades are housed in a shelf that has dual power supplies and a battery module that together provide an integrated UPS. The UPS protects the blades against power surges and brown outs. If AC power is lost completely, the blades are signaled and use battery power to write out cached data and safely shutdown the system. By providing 443 an integrated UPS and power management in every shelf, the system can be aggressive about caching data in main memory without burdening the administrator with building a foolproof data center-scale UPS and scaling it up as the system grows. The StorageBlades accumulate data and metadata updates and periodically flush these in a log-like fashion. This lets the system optimize disk arm seeks and provide very high write throughput even during shared workloads. Thermal monitoring is also integrated, and the system will proactively take itself offline if external temperatures rise and cause blades to overheat. The system is able to differentiate between power or thermal failures and disk or blade failures so it can respond appropriately. 5. Conclusion The ability of storage systems built on the Object Storage Architecture to scale capacity and performance addresses a key requirement for HPC Linux clusters. Panasas’ Objectbased storage cluster demonstrates scalability with 32-shelf systems providing 30x the bandwidth of a single shelf, and 30 shelf NAS benchmarks providing 6x the throughput of 5-shelf runs of the same benchmark. While we want performance and capacity to grow linearly as resources are added to a storage cluster, we do not want administrator effort to grow anywhere near linearly. Object Storage Architectures are designed to abstract physical limitations, making virtualization easier to provide, so that larger systems can be managed with little more effort than small systems. Panasas object-based storage clusters use distributed intelligence, a single namespace interface, file-level striping and RAID, and transparent rebalancing to realize the manageability advantages of Object-based Storage. References [1] Soltis, Steven R., Ruwart, Thomas M., et al. The Global File System, proc. of the Fifth NASA Goddard Conference on Mass Storage Systems, IEEE, 1996. [2] Schmuck, Frank, and Haskin, Roger. GPFS: A Shared-Disk File System for Large Computing Clusters. Proc First USENIX conf. on File and Storage Technologies (FAST02), Montery, CA Jan 2002. [3] Howard, J.H., Kazar, M.L., Menees, S.G., Nichols, D.A., Satyanarayanan, M., Sidebotham, R.N., West, M.J, Scale and Performance in a Distributed File System, ACM Transactions on Computer Systems, Feb. 1988, Vol. 6, No. 1, pp. 51-81. [4] Gibson, G. A., et. al., A Cost-Effective, High-Bandwidth Storage Architecture, 8th ASPLOS, 1998. [5] Azagury, A., Dreizin, V., Factor, M., Henis, E., Naor, D., Rinetzky, N., Satran, J., Tavory, A., Yerushalmi, L., Towards an Object Store, IBM Storage Systems Technology Workshop, November 2002. 444 [6] Lustre: A Scalable, High Performance File System, Cluster File System, Inc. 2003. http://www.lustre.org/docs.html [7] Draft OSD Standard, T10 Committee, Storage Networking Industry Association (SNIA), ftp://ftp.t10.org/t10/drafts/osd/osd-r05.pdf [8] Gobioff, Howard. Security for a High Performance Commodity Storage Subsystem. Carnegie Mellon PhD. Dissertation, CMU-CS-99-160, July 1999. [9] Gibson et. al. File Server Scaling with Network-Attached Secure Disks, ACM SIGMETRICS, June 1997, pp 272-284 [10] Brandt, S., Xue, L., Miller, E., Long D., Efficient Metadata Management in Large Distributed File Systems, Twentieth IEEE/Eleventh NASA Goddard Conference on Mass Storage Systems and Technologies, April 2003. [11] http://www.panasas.com [12] Amiri, K., G.A. Gibson, R. Golding, Highly Concurrent Shared Storage, Int. Conf. On Distributed Computing Systems (ICDCS2000), April 2000. [13] http://www.spec.org/sfs97r1/ 445 446 The Evolution of a Distributed Storage System Norman Margolus Permabit, Inc. One Kendall Square, Bldg. 200 Cambridge, MA 02139 nhm@permabit.com tel +1-617-995-9331 fax +1-617-252-9977 Abstract Permabit is a software company that makes a storage clustering product called Permeon. Permeon grew out of the need to reconcile a vision of the future of globally distributed, secure, private and robust storage with near-term marketplace realities. This paper discusses the evolution of the ideas embodied in Permeon. 1 Introduction When Permabit was founded in June of 2000, it was directed towards making the “disk in the sky” Storage Service Provider (SSP) idea practical. Storage clients would send permanent data off into the network, with the guarantee that this data would be kept safe and private. Most of our initial concerns were interface issues between the storage client and the storage system, and were largely independent of the structure and implementation of the actual storage. It was only later that we focused on distributed storage clustering software. 2 Content Addressed Storage (CAS) To make remote network storage immediately practical for ordinary users, we had to first address the issue of the availability and cost of bandwidth. We observed that if a cryptographic hash of a block of data is used as the name for the block, then users of a widely shared Internet storage system should be able to backup much of the data on their PC’s to a “disk in the sky” without sending much data. Not only could they avoid transmitting data that the storage system has seen before, but we could also avoid storing it separately for each user. This could make it practical for a large number of users to store large amounts of data remotely, enabling data sharing and remote access to their data. 447 Unix Filesystem Permabit CAS NFS block addr block name CIFS iSCSI … SSNAP = Secure, Self-Naming Archive Protocol: server-specified inode# blk addr blk addr … blk addr client-specified “handle” blk name blk name … blk name Deposit block Deposit object version Extend version retention List an object’s versions Read from object version Delete object version (explicit or by hash) (client-named) (server enforced) (specified time range) (access is controlled) (if allowed) Figure 1: Content addressed storage (CAS) can save valuable bandwidth on the WAN, reduce storage requirements, and guard data integrity. In Permabit CAS, a cryptographic hash fingerprint of each data block is used as the address of that block, enabling storage sharing. An unshared metadata-level, analogous to the inode level in Unix, controls access and security. A secure network protocol (SSNAP) maintains privacy and thwarts IP-piracy. 2.1 Self-Encryption CAS bandwidth and storage savings depend on sharing storage between unrelated users. This seems to require that the storage providers have complete access to all of the data. This is something that is unlikely to be palatable to the end user for privacy reasons, and is also a significant potential source of liability to the storage provider. This issue can be dealt with using self-encryption: each block of data stored can first be encrypted using a key derived deterministically from the unencrypted block (again using a cryptographic hash). Unrelated users will produce the same encrypted block, but no one without the source block can determine the key. As long as keys are never stored “in the clear” in the storage system, no one who has not had access to an unencrypted copy of the block can determine what it contains. The Farsite project[1] uses a similar idea, but only to save storage, not network bandwidth of the depositor. 2.2 Access Control If knowing the hash of a block is a sufficient credential for being granted access to the block, the storage system has no real access control. For example, in a widely shared “disk in the sky” storage system, the hash corresponding to the contents of a newly released DVD might become widely broadcast. The storage provider needs a way to obey a court-order to remove illegitimate access to this content without removing legitimate access. It is natural, for this reason, to introduce a metadata level which is analogous to the inode level in a Unix file system, as is illustrated in Figure 1. A small amount of unshared per-client and per-object metadata allows conventional access control and provides a place to keep privately-encrypted key information. Only the block-level is globally shared and self-encrypted. Blocks can only be read by reference to unshared per-client metadata. 448 2.3 Network Protocol The SSNAP network protocol for communicating with the “disk in the sky” is summarized in Figure 1. Hash-named blocks of data can be deposited by name to save bandwidth, but they can only be read as part of an object which belongs to a particular client (to permit read-access control). Depositing blocks by hash-name may involve a per-client challenge, to ensure that the client actually has the block and not just a hash that someone has broadcast to them. Objects are analogous to inodes in a Unix file system, except that the object handle (inode number) is supplied by the client. Both block names and handles are 32 bytes long. Objects can have multiple historical versions, allowing retention policies for protecting history to be enforced by the storage system. This is used to protect against accidental or malicious corruption. Server enforced record retention is also useful for automating record retention requirements mandated by government regulations and business best practices. 2.4 Security and Privacy No attempt is made to anonymize user access. Privacy maintenance is based on not storing records that allow a stored object to be linked to a user when the object is not actually being accessed. Depending upon accounting requirements, whatever information is desired can be logged or not logged at access time. By not recording who owns what data, storage service providers can avoid having these (non-existent) records subpoenaed. The globally unique name of an object is obtained by combining a namespace identifier with the handle constructed by the client—which the client ensures is unique within the namespace and unguessable. The handle and namespace are combined by the storage system using a one-way hash function to locate the object. To prevent objects that are not being accessed from being linked to users, handles are not retained by the storage system. 2.5 Embedding File Systems SSNAP provides an object level interface at the equivalent of the Unix inode level. Client libraries build upon this, to support file sharing protocols with directories. A generalized Portable Hierarchical File System (PHFS) data format is defined, with an associated API, which allows file system metadata for different file systems to be embedded. Multiple historical versions of files are supported, to allow convenient file system snapshotting for backup purposes. File system snapshots can be copied into Permabit storage and presented through familiar file sharing protocols. Permabit storage can also be used directly for live file storage, with copy-on-write snapshotting providing backup. By breaking files up into hash-named blocks at natural boundaries (e.g., email attachment boundaries), the likelihood of storage sharing (block coalescence) is enhanced. 2.6 Enterprise Storage Although SSNAP and PHFS were originally conceived with end-users in mind, most of the considerations that went into their design are at least as important for enterprises. Divisions within an enterprise would like to be able to share storage without compromising their 449 disk STORAGE POOL disk bulk disk storage clique storage clique wide area network tape storage clique tape Figure 2: Distributed storage systems can displace tape as safe and economical longterm/large-scale storage. Geographically distributed storage cliques can protect redundant records from loss or corruption, and independently enforce record retention policies. privacy or security. Storage-system enforced data retention is important in this context not only for preserving backup snapshots of a file system, but also for meeting government regulatory mandates for the long-term retention of email and other sensitive records. The elimination of common storage is particularly interesting for enterprises. The Venti project at Lucent Bell Labs[2] showed that, with the combination of avoiding duplication of common blocks of data and compressing the blocks, all of the daily backup data for each of two large shared file systems accumulated over a decade could be stored on disk using only slighly more storage than the current active data on the file systems. 3 Distributed Storage The considerations discussed above apply regardless of the nature of the “disk in the sky.” Since hash-named data is a natural match for scalable clustered storage, and stimulated by the low cost of ATA disks and commodity PC’s, we decided to develop our own storage clustering software. 3.1 Bulk Storage Storage systems are naturally hierarchical. For example, RAM memory systems use slower but more economical memory for bulk storage, while caching the most active data to hide most of the latency. With the raw media cost per gigabyte of ATA disks below that of tape, it seems natural to expect that the bulk of storage will soon be low-cost disk storage, used within a storage hierarchy (Figure 2). But low cost does not mean just cheap disks. Most of the cost of disk storage today is in its management: adding and removing storage, keeping it working, backing it up and preparing for disaster recovery. Bulk disk storage needs to be self-managing, self-healing, self-backing and disaster tolerant in order to qualify as low cost storage. To be disaster tolerant, the bulk disk storage must be distributed: information destroyed at one location must be redundantly represented elsewhere. In considering costs, however, it is important to also keep bandwidth in mind. The cost of bandwidth on the WAN is several orders of magnitude greater than the cost of bandwidth on a local network. This makes a two level system attractive, in which local clustering of economical standard hardware 450 constitutes the first level. In the Permeon system the local cluster provides fast recovery after failure of a server, fast access for nearby clients, high aggregate disk and network bandwidth, and cooperative caching behavior. 3.2 Scalability Overlay network routing schemes developed for academic distributed storage systems depend on statistical allocation of storage capacity. This does not work well for small clusters. Permeon cliques are designed to scale up indefinitely starting from a very small number of servers, allowing a low-cost entry point. For this reason, a table-based routing scheme is used within cliques, in which address ranges for hash-named data (and replicas) are assigned and reassigned as servers are added or removed. Data replicas are created and moved automatically as address assignments change. Because of high WAN bandwidth costs, if a clique is destroyed there is a large benefit in being able to recreate it using data resident at one or a small number of locations. This leads to a multi-clique scheme in which the data from each clique is redundantly represented at some small set of other cliques. 3.3 Portals To provide convenient access to the distributed storage system, Permeon provides portal servers which present the clique storage using familiar file sharing protocols. These portals can live behind firewalls, providing secure access to shared storage cliques. A web-based storage management console runs on the portals. 4 Conclusions The distributed “disk in the sky” storage systems of the future will be the descendents of decentralized bulk-disk storage of the post-magnetic-tape era which make storage inexpensive by solving the costly storage management, repair, data corruption and disaster tolerance problems. In particular, these storage systems will have to enforce data-retention policies, and do so independently in different locations, in order to provide useful backup and dependable archiving. Systems which save bandwidth and storage space by avoiding transmitting or storing duplicate data have a distinct advantage in the near term. Privacy and security provisions are already important today, and will be vital to making the systems suitable for widely shared access. References [1] J. R. Douceur, A. Adya, W. J. Bolosky, D. Simon, and M. Theimer, “Reclaiming space from duplicate files in a serverless distributed file system,” ICDCS, July 2002. [2] Sean Quinlan and Sean Dorward, “Venti: a new approach to archival storage,” First USENIX conference on File and Storage Technologies, 2002. 451 452 List of Authors Amer, Ahmed, Identifying Stable File Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 Bärring, Olof, Storage Resource Sharing with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Bilas, Angelos, Clotho: Transparent Data Versioning at the Block I/O Level . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .315 Brandt, Scott A., File System Workload Analysis for Large Scale Scientific Computing Applications . . . . . . . . . . .139 Brandt, Scott A., OBFS: A File System for Object-Based Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .283 Brinkmann, André, V:Drive—Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . . . . . . . .153 Burns, Lisa, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Butler, Greg, GUPFS: The Global Unified Parallel File System Project at NERSC . . . . . . . . . . . . . . . . . . . . . . . .361 Caine, Robert, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Chadduck, Robert, NARA’s Electronic Records Archives (ERA)—The Electronic Records Challenge . . . . . . . . . . . .69 Chandy, John, Parity Redundancy Strategies in a Large Scale Distributed Storage System . . . . . . . . . . . . . . . . . . .185 Chen, Helen, Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Cheung, Samson, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Collins, Donald, US National Oceanographic Data Center Archival Management Practices and the Open Archival Information System Reference Open Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .329 Couturier, Ben, Storage Resource Sharing with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Dalgic, Ismail, Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Dichtl, Rudy, Challenges in Long-Term Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Du, David, H.C., An Efficient Data Sharing Scheme for iSCSI-Based File Systems . . . . . . . . . . . . . . . . . . . . . . . . .233 Du, David, H.C., Simulation Study of iSCSI-Based Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .299 Duerr, Ruth, Challenges in Long-Term Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Duffy, Dr. Daniel, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Duffy, Dr. Daniel, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Durand, Jean-Damien, Storage Resource Sharing with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Ebata, Atsushi, An On-Line Backup Function for a Clustered NAS System (X-NAS) . . . . . . . . . . . . . . . . . . . . . . . .165 Flouris, Michail, Clotho: Transparent Data Versioning at the Block I/O Level . . . . . . . . . . . . . . . . . . . . . . . . . . . .315 Fu, Gang, Rebuilt Strategies for Redundant Disc Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223 Fuhrmann, Patrick, dCache, The Commodity Cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .171 Gibson, Garth, Managing Scalability in Object Storage Systems for HPC Linux Clusters . . . . . . . . . . . . . . . . . . . .433 Golay, Randall, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Gray, Louis, Multi-Tiered Storage — Consolidating the Differing Storage Requirements of the Enterprise into a Single Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .427 453 Grossman, Robert L., Using Dataspace to Support Long-Term Stewardship of Remote and Distributed Data . . . . .239 Guha, Aloke, A New Approach to Disc-Based Mass Storage Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .421 Gurumohan, Prabhanjan C., Quanta Data Storage: A New Storage Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 Han, Chunqi, Rebuilt Strategies for Redundant Disk Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223 Hanley, Dave, Using Dataspace to Support Long-Term Stewardship of Remote and Distributed Data . . . . . . . . . . .239 Haynes, Rena, The Data Services Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .261 He, Dingshan, An Efficient Data Sharing Scheme for iSCSI-Based File Systems . . . . . . . . . . . . . . . . . . . . . . . . . . .233 Heidebuer, Michael, V:Drive—Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . . . . . . .153 Higuchi, Tatsuo, An On-Line Backup Function for a Clustered NAS System (X-NAS) . . . . . . . . . . . . . . . . . . . . . . .165 Hong, Bo, File System Workload Analysis for Large Scale Scientific Computing Applications . . . . . . . . . . . . . . . .139 Hong, Bo, Duplicate Data Elimination in a SAN File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301 Hong, Xinwei, Using Dataspace to Support Long-Term Stewardship of Remote and Distributed Data . . . . . . . . . . .239 Holdsworth, David, Long-Term Stewardship of Globally-Distributed Representation Information . . . . . . . . . . . . . . . . .17 Hospodor, Andy, Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems . . . . . . . .273 Huber, Mark, NARA’s Electronic Records Archives (ERA)—The Electronic Records Challenge . . . . . . . . . . . . . . . .69 Hui, Joseph Y., Quanta Data Storage: A New Storage Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 Jagatheesan, Arun, Data Grid Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Jansen, Pierre, Promote-IT: An Efficient Real-Time Tertiary-Storage Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . . . .245 Johnson, Wilbur R., The Data Services Archive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .261 Jung, Kyong Jo, Regulating I/O Performance of Shared Storage with a Control Theoretical Approach . . . . . . . . .105 Jung, Seok Gan, Regulating I/O Performance of Shared Storage with a Control Theoretical Approach . . . . . . . . .105 Kanagavelu, Renuga, A Design of Metadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . .199 Kanagavelu, Renuga, An ISCSI Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 Karamanolis, Christos, Evaluation of Efficient Archival Storage Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .227 Kawamoto, Shinichi, An On-Line Backup Function for a Clustered NAS System (X-NAS) . . . . . . . . . . . . . . . . . . .165 Knezo, Emil, Storage Resource Sharing with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Krishnaswamy, Parthasarathy, Using Dataspace to Support Long-Term Stewardship of Remote and Distributed Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .239 Kukreja, Umesh, Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Lake, Alla, NARA’s Electronic Records Archives (ERA)—The Electronic Records Challenge . . . . . . . . . . . . . . . . . .69 Lee, Han Deok, Regulating I/O Performance of Shared Storage with a Control Theoretical Approach . . . . . . . . .105 Lee, Rei, GUPFS: The Global Unified Parallel File System Project at NERSC . . . . . . . . . . . . . . . . . . . . . . . . . . . .361 Lijding, Maria Eva, Promote-IT: An Efficient Real-Time Tertiary-Storage Scheduler . . . . . . . . . . . . . . . . . . . . . . .245 Liu, Yinan, Cost-Effective Remote Mirroring Using the iSCSI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .385 Long, Darrell, D. E., File System Workload Analysis for Large Scale Scientific Computing Applications . . . . . . . .139 Long, Darrell, D. E., Identifying Stable File Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 Long, Darrell, D. E., OBFS: A File System for Object-Based Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . .283 454 Long, Darrell, D. E., Duplicate Data Elimination in a SAN File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301 Lu, Yingping, Simulation Study of iSCSI-Based Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .399 Margolus, Norman, The Evolution of a Distributed Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .447 Marquis, Melinda, Challenges in Long-Term Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Meyer auf der Heide, Friedhelm,V:Drive—Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . .153 McLarty, Tyce T., File System Workload Analysis for Large Scientific Computing Applications . . . . . . . . . . . . . .139 McNab, David, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Miller, Ethan L., File System Workload Analysis for Large Scale Scientific Computing Applications . . . . . . . . . . .139 Miller, Ethan L., Interconnection Architectures for Petabyte-Scale High-Performance Storage Systems . . . . . . . .273 Miller, Ethan L., OBFS: A File System for Object-Based Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .283 Moore, Reagan W., Data Grid Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Moore, Reagan W., Preservation Environments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .79 Motyakov, Vitaly, Storage Resource Sharing with CASTO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 Mullender, Sape, Promote-IT: An Efficient Real-Time Tertiary-Storage Scheduler . . . . . . . . . . . . . . . . . . . . . . . . . .245 Mullins, Teresa, Challenges in Long-Term Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Muniswamy-Reddy, Kiran Kumar, Reducing Storage Management Costs via Informed User-Based Policies . . . .193 Nam, Young Jin, Regulating I/O Performance of Shared Storage with a Control Theoretical Approach . . . . . . . .105 Narasimhamurthy, Sai S. B., Quanta Data Storage: A New Storage Paradigm . . . . . . . . . . . . . . . . . . . . . . . . . . . .215 Ng, Spencer, Rebuilt Strategies for Redundant Disk Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223 Nieh, Jason, Reducing Storage Management Costs via Informed User-Based Policies . . . . . . . . . . . . . . . . . . . . . .193 Noman, Farrukh, Simulation Study of iSCSI-Based Storage System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .399 Okitsu, Jun, An On-Line Backup Function for a Clustered NAS System (X-NAS) . . . . . . . . . . . . . . . . . . . . . . . . . . .165 Orenstein, Jack, H-RAIN: An Architechture for Future-Proofing Digital Archives . . . . . . . . . . . . . . . . . . . . . . . . . .415 Osborn, Jeffrey, Reducing Storage Management Costs via Informed User-Based Policies . . . . . . . . . . . . . . . . . . .193 Ozdemir, Kadir, Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Paffel, Jeff, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Palm, Nancy, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Palm, Nancy, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Pâris, Jehan-François, Identifying Stable File Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 Park, Chanik, Regulating I/O Performance of Shared Storage with a Control Theoretical Approach . . . . . . . . . . .105 Parsons, Mark A., Challenges in Long-Term Data Stewardship . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .47 Patel, Sanjay, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitreeto SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Plantenberg, Demyn, Duplicate Data Elimination in a SAN File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301 Ponce, Sebastien, Storage Resource Sharing with CASTOR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .345 455 Rajasekar, Arcot, Data Grid Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Rodriguez, Andres, H-RAIN: An Architechture for Future-Proofing Digital Archives . . . . . . . . . . . . . . . . . . . . . . .415 Rood, Richard, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Rouch, Mike, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Ruckert, Ulrich, V:Drive—Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . . . . . . . . . .153 Saletta, Marty, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitreeto SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Salmon, Ellen, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Salmon, Ellen, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitreeto SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Salzwedel, Kay, V:Drive—Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . . . . . . . . . .153 Sawyer, William, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Schardt, Tom, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Schroeder, Wayne, Data Grid Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Schumann, Nathan, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Shah, Purvi, Identifying Stable File Access Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 Shater, Ariye, Reducing Storage Management Costs via Informed User-Based Policies . . . . . . . . . . . . . . . . . . . . .193 Sivan-Zimet, Miriam, Duplicate Data Elimination in a SAN File System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .301 Tarshish, Adina, Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitreeto SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Thomasian, Alexander, Rebuilt Strategies for Redundant Disk Arrays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .223 Thompson, Hoot, SAN and Data Transport Technology Evaluation at the NASA Goddard Space Center (GSFC) . . . . .119 Vanderlan, Ed., Hierarchical Storage Management at the NASA Center for Computational Sciences: From Unitree to SAM-QFS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .177 Velpuri, Rajkumar, Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Vodisek, Mario, V:Drive—Costs and Benefits of an Out-of-Band Storage Virtualization System . . . . . . . . . . . . . . .153 Wan, Michael, Data Grid Management Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .1 Wang, Feng, File System Workload Analysis for Large Scale Scientific Computing Applications . . . . . . . . . . . . . .139 Wang, Feng, OBFS: A File System for Object-Based Storage Devices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .283 Wang, Chao-Yang, SANSIM: A Platform for Simulation and Design of a Storage Area Network . . . . . . . . . . . . . .373 Weber, Jason, Comparative Performance Evaluation of iSCSi Protocol over Metropolitan, Local, and Wide Area Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .409 Webster, Phil, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Welcome, Mike, GUPFS: The Global Unified Parallel File System Project at NERSC . . . . . . . . . . . . . . . . . . . . . .361 Welch, Brent, Managing Scalability in Object Storage Systems for HPC Linux Clusters . . . . . . . . . . . . . . . . . . . .433 Weon, So Lih, A Design of Metadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . . . . . . .199 456 Wheatley, Paul, Long-Term Stewardship of Globally-Distributed Representation Information . . . . . . . . . . . . . . . . . . .17 Wright, Charles, Reducing Storage Management Costs via Informed User-Based Policies . . . . . . . . . . . . . . . . . . .193 Xi, Wei-Ya, SANSIM: A Platform for Simulation and Design of a Storage Area Network . . . . . . . . . . . . . . . .373 Xin, Qin, File System Workload Analysis for Large Scale Scientific Computing Applications . . . . . . . . . . . . . . . . .139 Xiong, Hui, A Design of Metadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . . . . . . . . . .199 Xiong, Hui, An ISCSI Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 Yan, Jie, A Design of Metadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . . . . . . . . . . . .199 Yang, Henry, Fibre Channel and IP SAN Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .31 Yang, Qing (Ken), Cost-Effective Remote Mirroring Using the iSCSI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . .385 Yasuda, Yoshiko, An On-Line Backup Function for a Clustered NAS System (X-NAS) . . . . . . . . . . . . . . . . . . . . . .165 Yong, Khai Leong, An ISCSI Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 You, Lawrence, Evaluation of Efficient Archival Storage Techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .227 Zadok, Erez, Reducing Storage Management Costs via Informed User-Based Policie . . . . . . . . . . . . . . . . . . . . . .193 Zero, Jose, Data Management as a Cluster Middleware Centerpiece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .93 Zhang, Ming, Cost-Effective Remote Mirroring Using the iSCSI Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .385 Zhou, Feng, A Design of Metadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . . . . . . . . .199 Zhou, Feng, SANSIM: A Platform for Simulation and Design of a Storage Area Network . . . . . . . . . . . . . . . . . . . .373 Zhu, Yao-Long, A Design of Metadata Server Cluster in Large Distributed Object-Based Storage . . . . . . . . . . . .199 Zhu, Yao-Long, An ISCSI Design and Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .207 Zhu, Yao-Long, SANSIM: A Platform for Simulation and Design of a Storage Area Network . . . . . . . . . . . . . . . . .373 457 458