Abstract
Interconnects can limit the performance achieved by distributed and parallel file systems due to message processing overheads, latencies, low bandwidths and possible congestions. This is specially true for metadata operations, because of the large number of small messages that they usually involve. These problems can be addressed from a hardware approach, with better interconnects, or from a software approach, by means of new designs and implementations. In this paper, we take the software approach and propose to increase the rate of metadata operations by sending several operations to a server in a single request. These metadata requests, that we call batch operations (or batchops for short), are particularly useful for applications that need to create, get the status information of and delete thousands or millions of files. With batchops, performance is increased by saving network delays and round-trips, and by reducing the number of messages, which, in turn, can mitigate possible network congestions. We have implemented batchops in our Fusion Parallel File Systems (FPFS). Results show that batchops can increase the metadata performance of FPFS by between 23 and 100 %, depending on the metadata operation and backed file system used. In absolute terms, batchops allow FPFS to create, stat and delete around 200,000, 300,000 and 200,000 files per second, respectively, with just 8 servers and a regular Gigabyte network.













Similar content being viewed by others
Notes
Recently, Seagate announced Kinetic [43], a drive that is a key/value server with Ethernet connectivity. It has a limited object-oriented interface that supports a few operations on objects identified by keys. Kinetic could be seen as an early implementation of something similar to Gibson’s proposal [20], but, due to its limited design, it still needs a higher level layer like Swift [48] to carry out basic operations, such as mapping large objects, coordinating race conditions on write operations, etc.
The Ethernet protocol limits the maximum payload of a frame to 1500 bytes by default (called Maximum Transfer Unit (MTU)). Consequently, the transport layer limits to 1460 bytes the Maximum Segment Size (MSS), so a message larger than 1460 bytes will be split into several segments to fit this requirement.
References
Ali N, Devulapalli A, Dalessandro D, Wyckoff P, Sadayappan, P (2008) An OSD-based approach to managing directory operations in parallel file systems. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’08), pp 175–184
Artiaga E, Cortes T (2010) Using filesystem virtualization to avoid metadata bottlenecks. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 562–567
Avilés-González A, Piernas J, González-Férez P (2011) A metadata cluster based on OSD+ devices. In: Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 64–71
Avilés-González A, Piernas J, González-Férez P (2013) Scalable huge directories through OSD+ devices. Proceedings of the 21st Euromicro international conference on parallel, distributed, and network-based processing, PDP 2013. Belfast, UK, pp 1–8
Belay A, Prekas G, Klimovic A, Grossman S, Kozyrakis C, Bugnion E (2014) IX: a protected dataplane operating system for high throughput and low latency. In: Proceedings of 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 49–65
Belshe M, Twist, Peon R, Thomson M (2015) Hypertext transfer protocol version 2. http://datatracker.ietf.org/doc/draft-ietf-httpbis-http2
Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M (2009) PLFS: a checkpoint filesystem for parallel applications. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’09), pp 1–12
Braams PJ (2008) High-performance storage architecture and scalable cluster file system. http://wiki.lustre.org/index.php/Lustre_Publications
Brandt SA, Miller EL, Long DDE, Xue L (2003) Efficient metadata management in large distributed storage systems. In: Proceedings of the 20th IEEE conference on mass storage systems and technologies (MSST’03), pp 290–298
Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04), pp 182–191
Cray Inc.: HPCS-IO (2012). http://sourceforge.net/projects/hpcs-io
Dilger A (2012) Lustre future development. In: Symposium at the 28th IEEE conference on massive data storage (MSST’12). http://storageconference.us/2012/Presentations/M04.Dilger
Dilger A (2012) Lustre metadata scaling. http://storageconference.us/2012/Presentations/T01.Dilger. Tutorial at the 28th IEEE Conference on Massive Data Storage (MSST’12)
Dunn MP (2009) A new I/O scheduler for solid state devices. Master’s thesis, Texas A&M University
Facebook Inc.: Batch requests. https://developers.facebook.com/docs/reference/ads-api/batch-requests
Facebook Inc.: Making multiple API requests. https://developers.facebook.com/docs/graph-api/making-multiple-requests/
Fikes A (2010) Storage architecture and challenges. In: Google Faculty Summit 2010. http://research.google.com/university/relations/facultysummit2010/storage_architecture_and_challenges
Freitas R, Slember J, Sawdon W, Chiu L (2011) GPFS scans 10 billion files in 43 minutes. Technical report RJ10484, IBM Almaden Research Center. http://www.almaden.ibm.com/storagesystems/resources/GPFS-Violin-white-paper
Ganger GR, Kaashoek MF (1997) Embedded inodes and explicit groupings: exploiting disk bandwidth for small files. In: Proceedings of USENIX Annual technical conference (ATC), pp 1–17
Gibson GA, Nagle D, Amiri K, Butler J, Chang FW, Gobioff H, Hardin C, Riedel E, Rochberg D, Zelenka J (1998) A cost-effective, high-bandwidth storage architecture. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS’98), pp 92–103
González-Férez P, Bilas A (2015) Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet. In: Proceedings of the IEEE 31st conference on mass storage systems and technologies (MSST)
González-Férez P, Piernas J, Cortés T (2008) Evaluating the effectiveness of REDCAP to recover the locality missed by today’s Linux systems. In: Proceedings of the IEEE/ACM international symposium on modeling, analysis, and simulation computer and telecommunication systems (MASCOTS’08), pp 1–4
Google Inc.: Google spreadsheet (2013). https://developers.google.com/chart/interactive/docs/spreadsheets
Google Inc.: Google base (2014). http://www.google.com/merchants/default
Google Inc.: Google calendar (2014). https://www.google.com/calendar
Google Inc.: Google cloud storage: Sending batch requets (2014). https://developers.google.com/storage/docs/json_api/v1/how-tos/batch
Google Inc.: Using batch operations (2014). http://code.google.com/p/gdata-python-client/wiki/UsingBatchOperations
Kim J, Oh Y, Kim E, Choi J, Lee D, Noh SH (2009) Disk schedulers for solid state drivers. In: Proceedings of the 7th ACM international conference on embedded software, pp 295–304
Lin W, Wei Q, Veeravalli B (2007) WPAR: A weight-based metadata management strategy for petabyte-scale object storage systems. In: Proceedings of the 4th international workshop on storage network architecture and parallel I/Os (SNAPI’07), pp 99–106
MacDonald A (2012) Nfsv4. login: 37(1):28–35
Mesnier M, Ganger GR, Riedel E (2003) Object-based storage. IEEE Commun Mag 41(8):84–90
Microsoft Inc.: Server Message Block (SMB) Version 2.0 Protocol Specification (2007). https://msdn.microsoft.com/en-us/library/cc212614.aspx
Miranda A, Effert S, Kang Y, Miller EL, Brinkmann A, Cortes T (2001) Reliable and randomized data distribution strategies for large scale storage systems. In: Proceedings of 18th IEEE international conference on high performance computing (HiPC’11), pp 1–10
Morrone C, Loewe B, McLarty T (2014) mdtest HPC Benchmark. http://sourceforge.net/projects/mdtest
Newman H (2008) HPCS mission partner file I/O scenarios, revision 3. http://wiki.old.lustre.org/images/5/5a/Newman_May_Lustre_Workshop
OpenSFS, EOFS: The Lustre file system (2015). http://www.lustre.org
OpenStack Foundation: Archive auto extraction (2014). http://docs.openstack.org/developer/swift/middleware.html#module-swift.common.middleware.bulk
OpenStack Foundation: Bulk delete (2014). http://docs.openstack.org/api/openstack-object-storage/1.0/content/bulk-delete.html
Patil S, Gibson G (2011) Scale and concurrency of GIGA+: file system directories with millions of files. In: Proceeding of the 9th USENIX conference on file and storage technologies (FAST’11), pp 15–30
Patil S, Ren K, Gibson G (2012) A case for scaling HPC metadata performance through de-specialization. In: Proceedings of 7th petascale data storage workshop supercomputing (PDSW’12), pp 1–6
Polyakov E (2009) The Elliptics network. http://reverbrain.com/elliptics
Ren K, Patil S, Gibson G (2012) A case for scaling HPC metadata performance through de-specialization. In: Proceedings of the 7th petascale data storage workshop supercomputing (PDSW), pp 30–35
Seagate Inc.: Kinetic open storage (2013). https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 26th IEEE conference on massive storage systems and technologies (MSST’10), pp 1–10
Sinnamohideen S, Sambasivan RR, Hendricks J, Liu L, Ganger GR (2010) A transparently-scalable metadata service for the Ursa Minor storage system. In: Proceedings of USENIX annual technical conference (ATC’10), pp 1–14
Skeen D, Stonebraker M (1983) A formal model of crash recovery in a distributed system. IEEE Trans Software Eng 9(3):219–228
Sun-Oracle: Lustre tunning (2010). http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html
SwiftStack Inc.: Kinetic motion with Seagate and OpenStack Swift (2013). https://swiftstack.com/blog/2013/10/22/kinetic-for-openstack-swift-with-seagate/
The PVFS Community: The Orange file system (2015). http://orangefs.org
Torvalds L et al (2014) Linux 3.14 features. http://kernelnewbies.org/Linux_3.14
Wang F, Xin Q, Hong B, Brandt SA, Miller EL, Long DDE, McLarty TT (2004) File system workload analysis for large scale scientific computing applications. In: Proceedings of the 21st IEEE conference on massive storage systems and technologies (MSST’04), pp 139–152
Weijia L, Wei X, Shu J, Zheng W (2006) Dynamic hashing: Adaptive metadata management for petabyte-scale file systems. In: Proceedings of the 23rd IEEE conference on massive storage systems and technologies (MSST’06), pp 159–164
Weil SA, Brandt SA, Miller EL, Long DDE, Maltzahn C (2006) Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th USENIX symposium on operating systems design and implementation (OSDI’06), pp 307–320
Wheeler R (2010) One billion files: scalability limits in Linux file systems. In: LinuxCon’10. http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler
Zhu Y, Jiang H, Wang J (2004) Hierarchical bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage. In: Proceedings of IEEE international conference on cluster computing (Cluster’04), pp 165–174
Acknowledgments
Work supported by Spanish MICINN, and European Comission FEDER funds, under Grants TIN2009-14475-C04 and TIN2012-38341-C04-03.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Avilés-González, A., Piernas, J. & González-Férez, P. Batching operations to improve the performance of a distributed metadata service. J Supercomput 72, 654–687 (2016). https://doi.org/10.1007/s11227-015-1602-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-015-1602-x