[go: up one dir, main page]

Skip to main content
Log in

Batching operations to improve the performance of a distributed metadata service

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Interconnects can limit the performance achieved by distributed and parallel file systems due to message processing overheads, latencies, low bandwidths and possible congestions. This is specially true for metadata operations, because of the large number of small messages that they usually involve. These problems can be addressed from a hardware approach, with better interconnects, or from a software approach, by means of new designs and implementations. In this paper, we take the software approach and propose to increase the rate of metadata operations by sending several operations to a server in a single request. These metadata requests, that we call batch operations (or batchops for short), are particularly useful for applications that need to create, get the status information of and delete thousands or millions of files. With batchops, performance is increased by saving network delays and round-trips, and by reducing the number of messages, which, in turn, can mitigate possible network congestions. We have implemented batchops in our Fusion Parallel File Systems (FPFS). Results show that batchops can increase the metadata performance of FPFS by between 23 and 100 %, depending on the metadata operation and backed file system used. In absolute terms, batchops allow FPFS to create, stat and delete around 200,000, 300,000 and 200,000 files per second, respectively, with just 8 servers and a regular Gigabyte network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. Recently, Seagate announced Kinetic [43], a drive that is a key/value server with Ethernet connectivity. It has a limited object-oriented interface that supports a few operations on objects identified by keys. Kinetic could be seen as an early implementation of something similar to Gibson’s proposal [20], but, due to its limited design, it still needs a higher level layer like Swift [48] to carry out basic operations, such as mapping large objects, coordinating race conditions on write operations, etc.

  2. The Ethernet protocol limits the maximum payload of a frame to 1500 bytes by default (called Maximum Transfer Unit (MTU)). Consequently, the transport layer limits to 1460 bytes the Maximum Segment Size (MSS), so a message larger than 1460 bytes will be split into several segments to fit this requirement.

References

  1. Ali N, Devulapalli A, Dalessandro D, Wyckoff P, Sadayappan, P (2008) An OSD-based approach to managing directory operations in parallel file systems. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’08), pp 175–184

  2. Artiaga E, Cortes T (2010) Using filesystem virtualization to avoid metadata bottlenecks. In: Proceedings of Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 562–567

  3. Avilés-González A, Piernas J, González-Férez P (2011) A metadata cluster based on OSD+ devices. In: Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD), pp. 64–71

  4. Avilés-González A, Piernas J, González-Férez P (2013) Scalable huge directories through OSD+ devices. Proceedings of the 21st Euromicro international conference on parallel, distributed, and network-based processing, PDP 2013. Belfast, UK, pp 1–8

  5. Belay A, Prekas G, Klimovic A, Grossman S, Kozyrakis C, Bugnion E (2014) IX: a protected dataplane operating system for high throughput and low latency. In: Proceedings of 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 49–65

  6. Belshe M, Twist, Peon R, Thomson M (2015) Hypertext transfer protocol version 2. http://datatracker.ietf.org/doc/draft-ietf-httpbis-http2

  7. Bent J, Gibson G, Grider G, McClelland B, Nowoczynski P, Nunez J, Polte M, Wingate M (2009) PLFS: a checkpoint filesystem for parallel applications. In: Proceedings of the conference on high performance computing networking, storage and analysis (SC’09), pp 1–12

  8. Braams PJ (2008) High-performance storage architecture and scalable cluster file system. http://wiki.lustre.org/index.php/Lustre_Publications

  9. Brandt SA, Miller EL, Long DDE, Xue L (2003) Efficient metadata management in large distributed storage systems. In: Proceedings of the 20th IEEE conference on mass storage systems and technologies (MSST’03), pp 290–298

  10. Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04), pp 182–191

  11. Cray Inc.: HPCS-IO (2012). http://sourceforge.net/projects/hpcs-io

  12. Dilger A (2012) Lustre future development. In: Symposium at the 28th IEEE conference on massive data storage (MSST’12). http://storageconference.us/2012/Presentations/M04.Dilger

  13. Dilger A (2012) Lustre metadata scaling. http://storageconference.us/2012/Presentations/T01.Dilger. Tutorial at the 28th IEEE Conference on Massive Data Storage (MSST’12)

  14. Dunn MP (2009) A new I/O scheduler for solid state devices. Master’s thesis, Texas A&M University

  15. Facebook Inc.: Batch requests. https://developers.facebook.com/docs/reference/ads-api/batch-requests

  16. Facebook Inc.: Making multiple API requests. https://developers.facebook.com/docs/graph-api/making-multiple-requests/

  17. Fikes A (2010) Storage architecture and challenges. In: Google Faculty Summit 2010. http://research.google.com/university/relations/facultysummit2010/storage_architecture_and_challenges

  18. Freitas R, Slember J, Sawdon W, Chiu L (2011) GPFS scans 10 billion files in 43 minutes. Technical report RJ10484, IBM Almaden Research Center. http://www.almaden.ibm.com/storagesystems/resources/GPFS-Violin-white-paper

  19. Ganger GR, Kaashoek MF (1997) Embedded inodes and explicit groupings: exploiting disk bandwidth for small files. In: Proceedings of USENIX Annual technical conference (ATC), pp 1–17

  20. Gibson GA, Nagle D, Amiri K, Butler J, Chang FW, Gobioff H, Hardin C, Riedel E, Rochberg D, Zelenka J (1998) A cost-effective, high-bandwidth storage architecture. In: Proceedings of the international conference on architectural support for programming languages and operating systems (ASPLOS’98), pp 92–103

  21. González-Férez P, Bilas A (2015) Reducing CPU and network overhead for small I/O requests in network storage protocols over raw Ethernet. In: Proceedings of the IEEE 31st conference on mass storage systems and technologies (MSST)

  22. González-Férez P, Piernas J, Cortés T (2008) Evaluating the effectiveness of REDCAP to recover the locality missed by today’s Linux systems. In: Proceedings of the IEEE/ACM international symposium on modeling, analysis, and simulation computer and telecommunication systems (MASCOTS’08), pp 1–4

  23. Google Inc.: Google spreadsheet (2013). https://developers.google.com/chart/interactive/docs/spreadsheets

  24. Google Inc.: Google base (2014). http://www.google.com/merchants/default

  25. Google Inc.: Google calendar (2014). https://www.google.com/calendar

  26. Google Inc.: Google cloud storage: Sending batch requets (2014). https://developers.google.com/storage/docs/json_api/v1/how-tos/batch

  27. Google Inc.: Using batch operations (2014). http://code.google.com/p/gdata-python-client/wiki/UsingBatchOperations

  28. Kim J, Oh Y, Kim E, Choi J, Lee D, Noh SH (2009) Disk schedulers for solid state drivers. In: Proceedings of the 7th ACM international conference on embedded software, pp 295–304

  29. Lin W, Wei Q, Veeravalli B (2007) WPAR: A weight-based metadata management strategy for petabyte-scale object storage systems. In: Proceedings of the 4th international workshop on storage network architecture and parallel I/Os (SNAPI’07), pp 99–106

  30. MacDonald A (2012) Nfsv4. login: 37(1):28–35

  31. Mesnier M, Ganger GR, Riedel E (2003) Object-based storage. IEEE Commun Mag 41(8):84–90

    Article  Google Scholar 

  32. Microsoft Inc.: Server Message Block (SMB) Version 2.0 Protocol Specification (2007). https://msdn.microsoft.com/en-us/library/cc212614.aspx

  33. Miranda A, Effert S, Kang Y, Miller EL, Brinkmann A, Cortes T (2001) Reliable and randomized data distribution strategies for large scale storage systems. In: Proceedings of 18th IEEE international conference on high performance computing (HiPC’11), pp 1–10

  34. Morrone C, Loewe B, McLarty T (2014) mdtest HPC Benchmark. http://sourceforge.net/projects/mdtest

  35. Newman H (2008) HPCS mission partner file I/O scenarios, revision 3. http://wiki.old.lustre.org/images/5/5a/Newman_May_Lustre_Workshop

  36. OpenSFS, EOFS: The Lustre file system (2015). http://www.lustre.org

  37. OpenStack Foundation: Archive auto extraction (2014). http://docs.openstack.org/developer/swift/middleware.html#module-swift.common.middleware.bulk

  38. OpenStack Foundation: Bulk delete (2014). http://docs.openstack.org/api/openstack-object-storage/1.0/content/bulk-delete.html

  39. Patil S, Gibson G (2011) Scale and concurrency of GIGA+: file system directories with millions of files. In: Proceeding of the 9th USENIX conference on file and storage technologies (FAST’11), pp 15–30

  40. Patil S, Ren K, Gibson G (2012) A case for scaling HPC metadata performance through de-specialization. In: Proceedings of 7th petascale data storage workshop supercomputing (PDSW’12), pp 1–6

  41. Polyakov E (2009) The Elliptics network. http://reverbrain.com/elliptics

  42. Ren K, Patil S, Gibson G (2012) A case for scaling HPC metadata performance through de-specialization. In: Proceedings of the 7th petascale data storage workshop supercomputing (PDSW), pp 30–35

  43. Seagate Inc.: Kinetic open storage (2013). https://developers.seagate.com/display/KV/Kinetic+Open+Storage+Documentation+Wiki

  44. Shvachko K, Kuang H, Radia S, Chansler R (2010) The Hadoop distributed file system. In: Proceedings of the 26th IEEE conference on massive storage systems and technologies (MSST’10), pp 1–10

  45. Sinnamohideen S, Sambasivan RR, Hendricks J, Liu L, Ganger GR (2010) A transparently-scalable metadata service for the Ursa Minor storage system. In: Proceedings of USENIX annual technical conference (ATC’10), pp 1–14

  46. Skeen D, Stonebraker M (1983) A formal model of crash recovery in a distributed system. IEEE Trans Software Eng 9(3):219–228

    Article  Google Scholar 

  47. Sun-Oracle: Lustre tunning (2010). http://wiki.lustre.org/manual/LustreManual18_HTML/LustreTuning.html

  48. SwiftStack Inc.: Kinetic motion with Seagate and OpenStack Swift (2013). https://swiftstack.com/blog/2013/10/22/kinetic-for-openstack-swift-with-seagate/

  49. The PVFS Community: The Orange file system (2015). http://orangefs.org

  50. Torvalds L et al (2014) Linux 3.14 features. http://kernelnewbies.org/Linux_3.14

  51. Wang F, Xin Q, Hong B, Brandt SA, Miller EL, Long DDE, McLarty TT (2004) File system workload analysis for large scale scientific computing applications. In: Proceedings of the 21st IEEE conference on massive storage systems and technologies (MSST’04), pp 139–152

  52. Weijia L, Wei X, Shu J, Zheng W (2006) Dynamic hashing: Adaptive metadata management for petabyte-scale file systems. In: Proceedings of the 23rd IEEE conference on massive storage systems and technologies (MSST’06), pp 159–164

  53. Weil SA, Brandt SA, Miller EL, Long DDE, Maltzahn C (2006) Ceph: A scalable, high-performance distributed file system. In: Proceedings of the 7th USENIX symposium on operating systems design and implementation (OSDI’06), pp 307–320

  54. Wheeler R (2010) One billion files: scalability limits in Linux file systems. In: LinuxCon’10. http://events.linuxfoundation.org/slides/2010/linuxcon2010_wheeler

  55. Zhu Y, Jiang H, Wang J (2004) Hierarchical bloom filter arrays (HBA): a novel, scalable metadata management system for large cluster-based storage. In: Proceedings of IEEE international conference on cluster computing (Cluster’04), pp 165–174

Download references

Acknowledgments

Work supported by Spanish MICINN, and European Comission FEDER funds, under Grants TIN2009-14475-C04 and TIN2012-38341-C04-03.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Juan Piernas.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Avilés-González, A., Piernas, J. & González-Férez, P. Batching operations to improve the performance of a distributed metadata service. J Supercomput 72, 654–687 (2016). https://doi.org/10.1007/s11227-015-1602-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-015-1602-x

Keywords