Batching operations to improve the performance of a distributed metadata service

Ana Avilés-González¹,
Juan Piernas¹ &
Pilar González-Férez¹

232 Accesses
3 Citations
Explore all metrics

Abstract

Interconnects can limit the performance achieved by distributed and parallel file systems due to message processing overheads, latencies, low bandwidths and possible congestions. This is specially true for metadata operations, because of the large number of small messages that they usually involve. These problems can be addressed from a hardware approach, with better interconnects, or from a software approach, by means of new designs and implementations. In this paper, we take the software approach and propose to increase the rate of metadata operations by sending several operations to a server in a single request. These metadata requests, that we call batch operations (or batchops for short), are particularly useful for applications that need to create, get the status information of and delete thousands or millions of files. With batchops, performance is increased by saving network delays and round-trips, and by reducing the number of messages, which, in turn, can mitigate possible network congestions. We have implemented batchops in our Fusion Parallel File Systems (FPFS). Results show that batchops can increase the metadata performance of FPFS by between 23 and 100 %, depending on the metadata operation and backed file system used. In absolute terms, batchops allow FPFS to create, stat and delete around 200,000, 300,000 and 200,000 files per second, respectively, with just 8 servers and a regular Gigabyte network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Understanding Metadata Latency with MDWorkbench

I/O Optimizations Based on Workload Characteristics for Parallel File Systems

Characterizing I/O Optimization Effect Through Holistic Log Data Analysis of Parallel File Systems and Interconnects

Notes

Recently, Seagate announced Kinetic [43], a drive that is a key/value server with Ethernet connectivity. It has a limited object-oriented interface that supports a few operations on objects identified by keys. Kinetic could be seen as an early implementation of something similar to Gibson’s proposal [20], but, due to its limited design, it still needs a higher level layer like Swift [48] to carry out basic operations, such as mapping large objects, coordinating race conditions on write operations, etc.
The Ethernet protocol limits the maximum payload of a frame to 1500 bytes by default (called Maximum Transfer Unit (MTU)). Consequently, the transport layer limits to 1460 bytes the Maximum Segment Size (MSS), so a message larger than 1460 bytes will be split into several segments to fit this requirement.

References

Ali N, Devulapalli A, Dalessandro D, Wyckoff P, Sadayappan, P (2008) An OSD-based approach to managing directory operations in parallel file systems. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’08), pp 175–184
Artiaga E, Cortes T (2010) Using filesystem virtualization to avoid metadata bottlenecks. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 562–567
Avilés-González A, Piernas J, González-Férez P (2011) A metadata cluster based on OSD+ devices. In: Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 64–71
Avilés-González A, Piernas J, González-Férez P (2013) Scalable huge directories through OSD+ devices. Proceedings of the 21st Euromicro international conference on parallel, distributed, and network-based processing, PDP 2013. Belfast, UK, pp 1–8
Belay A, Prekas G, Klimovic A, Grossman S, Kozyrakis C, Bugnion E (2014) IX: a protected dataplane operating system for high throughput and low latency. In: Proceedings of 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 49–65
Belshe M, Twist, Peon R, Thomson M (2015) Hypertext transfer protocol version 2. http://datatracker.ietf.org/doc/draft-ietf-httpbis-http2
Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M (2009) PLFS: a checkpoint filesystem for parallel applications. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’09), pp 1–12
Braams PJ (2008) High-performance storage architecture and scalable cluster file system. http://wiki.lustre.org/index.php/Lustre_Publications
Brandt SA, Miller EL, Long DDE, Xue L (2003) Efficient metadata management in large distributed storage systems. In: Proceedings of the 20th IEEE conference on mass storage systems and technologies (MSST’03), pp 290–298
Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04), pp 182–191
Cray Inc.: HPCS-IO (2012). http://sourceforge.net/projects/hpcs-io
Dilger A (2012) Lustre future development. In: Symposium at the 28th IEEE conference on massive data storage (MSST’12). http://storageconference.us/2012/Presentations/M04.Dilger
Dilger A (2012) Lustre metadata scaling. http://storageconference.us/2012/Presentations/T01.Dilger. Tutorial at the 28th IEEE Conference on Massive Data Storage (MSST’12)
Dunn MP (2009) A new I/O scheduler for solid state devices. Master’s thesis, Texas A&M University
Facebook Inc.: Batch requests. https://developers.facebook.com/docs/reference/ads-api/batch-requests
Facebook Inc.: Making multiple API requests. https://developers.facebook.com/docs/graph-api/making-multiple-requests/
Fikes A (2010) Storage architecture and challenges. In: Google Faculty Summit 2010. http://research.google.com/university/relations/facultysummit2010/storage_architecture_and_challenges
Freitas R, Slember J, Sawdon W, Chiu L (2011) GPFS scans 10 billion files in 43 minutes. Technical report RJ10484, IBM Almaden Research Center. http://www.almaden.ibm.com/storagesystems/resources/GPFS-Violin-white-paper
Ganger GR, Kaashoek MF (1997) Embedded inodes and explicit groupings: exploiting disk bandwidth for small files. In: Proceedings of USENIX Annual technical conference (ATC), pp 1–17
Gibson GA, Nagle D, Amiri K, Butler J, Chang FW, Gobioff H, Hardin C, Riedel E, Rochberg D, Zelenka J (1998) A cost-effective, high-bandwidth storage architecture. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS’98), pp 92–103
González-Férez P, Bilas A (2015) Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet. In: Proceedings of the IEEE 31st conference on mass storage systems and technologies (MSST)
González-Férez P, Piernas J, Cortés T (2008) Evaluating the effectiveness of REDCAP to recover the locality missed by today’s Linux systems. In: Proceedings of the IEEE/ACM international symposium on modeling, analysis, and simulation computer and telecommunication systems (MASCOTS’08), pp 1–4
Google Inc.: Google spreadsheet (2013). https://developers.google.com/chart/interactive/docs/spreadsheets
Google Inc.: Google base (2014). http://www.google.com/merchants/default
Google Inc.: Google calendar (2014). https://www.google.com/calendar
Google Inc.: Google cloud storage: Sending batch requets (2014). https://developers.google.com/storage/docs/json_api/v1/how-tos/batch
Google Inc.: Using batch operations (2014). http://code.google.com/p/gdata-python-client/wiki/UsingBatchOperations
Kim J, Oh Y, Kim E, Choi J, Lee D, Noh SH (2009) Disk schedulers for solid state drivers. In: Proceedings of the 7th ACM international conference on embedded software, pp 295–304
Lin W, Wei Q, Veeravalli B (2007) WPAR: A weight-based metadata management strategy for petabyte-scale object storage systems. In: Proceedings of the 4th international workshop on storage network architecture and parallel I/Os (SNAPI’07), pp 99–106
MacDonald A (2012) Nfsv4. login: 37(1):28–35
Mesnier M, Ganger GR, Riedel E (2003) Object-based storage. IEEE Commun Mag 41(8):84–90
Article Google Scholar
Microsoft Inc.: Server Message Block (SMB) Version 2.0 Protocol Specification (2007). https://msdn.microsoft.com/en-us/library/cc212614.aspx
Miranda A, Effert S, Kang Y, Miller EL, Brinkmann A, Cortes T (2001) Reliable and randomized data distribution strategies for large scale storage systems. In: Proceedings of 18th IEEE international conference on high performance computing (HiPC’11), pp 1–10
Morrone C, Loewe B, McLarty T (2014) mdtest HPC Benchmark. http://sourceforge.net/projects/mdtest
Newman H (2008) HPCS mission partner file I/O scenarios, revision 3. http://wiki.old.lustre.org/images/5/5a/Newman_May_Lustre_Workshop
OpenSFS, EOFS: The Lustre file system (2015). http://www.lustre.org
OpenStack Foundation: Archive auto extraction (2014). http://docs.openstack.org/developer/swift/middleware.html#module-swift.common.middleware.bulk
OpenStack Foundation: Bulk delete (2014). http://docs.openstack.org/api/openstack-object-storage/1.0/content/bulk-delete.html
Patil S, Gibson G (2011) Scale and concurrency of GIGA+: file system directories with millions of files. In: Proceeding of the 9th USENIX conference on file and storage technologies (FAST’11), pp 15–30
Patil S, Ren K, Gibson G (2012) A case for scaling HPC metadata performance through de-specialization. In: Proceedings of 7th petascale data storage workshop supercomputing (PDSW’12), pp 1–6
Polyakov E (2009) The Elliptics network. http://reverbrain.com/elliptics
Ren K, Patil S, Gibson G (2012) A case for scaling HPC metadata performance through de-specialization. In: Proceedings of the 7th petascale data storage workshop supercomputing (PDSW), pp 30–35
Seagate Inc.: Kinetic open storage (2013). https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki
Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 26th IEEE conference on massive storage systems and technologies (MSST’10), pp 1–10
Sinnamohideen S, Sambasivan RR, Hendricks J, Liu L, Ganger GR (2010) A transparently-scalable metadata service for the Ursa Minor storage system. In: Proceedings of USENIX annual technical conference (ATC’10), pp 1–14
Skeen D, Stonebraker M (1983) A formal model of crash recovery in a distributed system. IEEE Trans Software Eng 9(3):219–228
Article Google Scholar
Sun-Oracle: Lustre tunning (2010). http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html
SwiftStack Inc.: Kinetic motion with Seagate and OpenStack Swift (2013). https://swiftstack.com/blog/2013/10/22/kinetic-for-openstack-swift-with-seagate/
The PVFS Community: The Orange file system (2015). http://orangefs.org
Torvalds L et al (2014) Linux 3.14 features. http://kernelnewbies.org/Linux_3.14
Wang F, Xin Q, Hong B, Brandt SA, Miller EL, Long DDE, McLarty TT (2004) File system workload analysis for large scale scientific computing applications. In: Proceedings of the 21st IEEE conference on massive storage systems and technologies (MSST’04), pp 139–152
Weijia L, Wei X, Shu J, Zheng W (2006) Dynamic hashing: Adaptive metadata management for petabyte-scale file systems. In: Proceedings of the 23rd IEEE conference on massive storage systems and technologies (MSST’06), pp 159–164
Weil SA, Brandt SA, Miller EL, Long DDE, Maltzahn C (2006) Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th USENIX symposium on operating systems design and implementation (OSDI’06), pp 307–320
Wheeler R (2010) One billion files: scalability limits in Linux file systems. In: LinuxCon’10. http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler
Zhu Y, Jiang H, Wang J (2004) Hierarchical bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage. In: Proceedings of IEEE international conference on cluster computing (Cluster’04), pp 165–174

Download references

Acknowledgments

Work supported by Spanish MICINN, and European Comission FEDER funds, under Grants TIN2009-14475-C04 and TIN2012-38341-C04-03.

Author information

Authors and Affiliations

Departamento de Ingeniería y Tecnología de Computadores, Universidad de Murcia, Murcia, Spain
Ana Avilés-González, Juan Piernas & Pilar González-Férez

Authors

Ana Avilés-González
View author publications
You can also search for this author in PubMed Google Scholar
Juan Piernas
View author publications
You can also search for this author in PubMed Google Scholar
Pilar González-Férez
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Juan Piernas.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Avilés-González, A., Piernas, J. & González-Férez, P. Batching operations to improve the performance of a distributed metadata service. J Supercomput 72, 654–687 (2016). https://doi.org/10.1007/s11227-015-1602-x

Download citation

Published: 29 December 2015
Issue Date: February 2016
DOI: https://doi.org/10.1007/s11227-015-1602-x

Batching operations to improve the performance of a distributed metadata service

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding Metadata Latency with MDWorkbench

I/O Optimizations Based on Workload Characteristics for Parallel File Systems

Characterizing I/O Optimization Effect Through Holistic Log Data Analysis of Parallel File Systems and Interconnects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Batching operations to improve the performance of a distributed metadata service

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Understanding Metadata Latency with MDWorkbench

I/O Optimizations Based on Workload Characteristics for Parallel File Systems

Characterizing I/O Optimization Effect Through Holistic Log Data Analysis of Parallel File Systems and Interconnects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now