Abstract
The increasing use of multiple Workflow Management Systems (WMS) employing various workflow languages and shared workflow repositories enhances the open-source bioinformatics ecosystem. Efficient resource utilization in these systems is crucial for keeping costs low and improving processing times, especially for large-scale bioinformatics workflows running in cloud environments. Recognizing this, our study introduces a novel reference architecture, Cloud Monitoring Kit (CMK), for a multi-platform monitoring system. Our solution is designed to generate uniform, aggregated metrics from containerized workflow tasks scheduled by different WMS. Central to the proposed solution is the use of task labeling methods, which enable convenient grouping and aggregating of metrics independent of the WMS employed. This approach builds upon existing technology, providing additional benefits of modularity and capacity to seamlessly integrate with other data processing or collection systems. We have developed and released an open-source implementation of our system, which we evaluated on Amazon Web Services (AWS) using a transcriptomics data analysis workflow executed on two scientific WMS. The findings of this study indicate that CMK provides valuable insights into resource utilization. In doing so, it paves the way for more efficient management of resources in containerized scientific workflows running in public cloud environments, and it provides a foundation for optimizing task configurations, reducing costs, and enhancing scheduling decisions. Overall, our solution addresses the immediate needs of bioinformatics workflows and offers a scalable and adaptable framework for future advancements in cloud-based scientific computing.
Article PDF
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
Data Availability
The CMK platform and configurations tested were made available to the community as an open-source development in GitHub at https://github.com/biobam/cmk.
References
Amazon Web Services (AWS). https://aws.amazon.com/ (2023)
Google Cloud. https://cloud.google.com/ (2023)
Microsoft Azure. https://azure.microsoft.com/ (2023)
Siddiqui, T., Siddiqui, S.A., Khan, N.A.: Comprehensive Analysis of Container Technology. 2019 4th International Conference on Information Systems and Computer Networks, ISCON 2019, 218–223 (2019). https://doi.org/10.1109/ISCON47742.2019.9036238
Hale, J.S., Li, L., Richardson, C.N., Wells, G.N.: Containers for portable, productive, and performant scientific computing. Comput. Sci. Eng. 19(6), 40–50 (2017). https://doi.org/10.1109/MCSE.2017.2421459
Felter, W., Ferreira, A., Rajamony, R., Rubio, J.: An updated performance comparison of virtual machines and Linux containers. ISPASS 2015 - IEEE International Symposium on Performance Analysis of Systems and Software, 171–172 (2015). https://doi.org/10.1109/ISPASS.2015.7095802
Giorgi, F.M., Ceraolo, C., Mercatelli, D.: The R Language: An Engine for Bioinformatics and Data Science. Life (Basel, Switzerland) 12(5) (2022). https://doi.org/10.3390/LIFE12050648
Fourment, M., Gillings, M.R.: A comparison of common programming languages used in bioinformatics. BMC Bioinform. 9(1), 1–9 (2008). https://doi.org/10.1186/1471-2105-9-82/TABLES/1
Baker, M., Penny, D.: Is there a reproducibility crisis? Nature 533(7604), 452–454 (2016). https://doi.org/10.1038/533452A
Amstutz, P., Crusoe, M.R., Tijanić, N., Chapman, B., Chilton, J., Heuer, M., Kartashov, A., Leehr, D., Ménager, H., Nedeljkovich, M., Scales, M., Soiland-Reyes, S., Stojanovic, L.: Common Workflow Language, v1.0. Figshare (2016). https://doi.org/10.6084/M9.FIGSHARE.3115156
Voss, K., Auwera, G.V.d., Gentry, J., Voss, K., Auwera, G., Gentry, J.: Full-stack genomics pipelining with GATK4 + WDL + Cromwell. ISCB Comm. J. 6 (2017). https://doi.org/10.7490/F1000RESEARCH.1114634.1
Goble, C., Cohen-Boulakia, S., Soiland-Reyes, S., Garijo, D., Gil, Y., Crusoe, M.R., Peters, K., Schober, D.: FAIR Computational workflows. Data Intell. 2(1–2), 108–121 (2020). https://doi.org/10.1162/DINT_A_00033
Herschel, M., Diestelkämper, R., Ben Lahmar, H.: A survey on provenance: What for? What form? What from? VLDB J. 26(6), 881–906 (2017). https://doi.org/10.1007/S00778-017-0486-1
Khan, F.Z., Soiland-Reyes, S., Sinnott, R.O., Lonie, A., Goble, C., Crusoe, M.R.: Sharing interoperable workflow provenance: A review of best practices and their practical application in CWLProv. GigaSci 8(11), 1–27 (2019). https://doi.org/10.1093/GIGASCIENCE/GIZ095
Missier, P., Belhajjame, K., Cheney, J.: The W3C PROV family of specifications for modelling provenance metadata. ACM Int. Conf. Proc. Ser. 773–776 (2013). https://doi.org/10.1145/2452376.2452478
O’Connor, B.D., Yuen, D., Chung, V., Duncan, A.G., Liu, X.K., Patricia, J., Paten, B., Stein, L., Ferretti, V.: The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Research 6:52 6, 52 (2017). https://doi.org/10.12688/f1000research.10137.1
Goble, C., Soiland-Reyes, S., Bacall, F., Owen, S., Williams, A., Eguinoa, I., Droesbeke, B., Leo, S., Pireddu, L., Rodríguez-Navas, L., Fernández, J.M., Capella-Gutierrez, S., Ménager, H., Grüning, B., Serrano-Solano, B., Ewels, P., Coppens, F.: Implementing FAIR digital objects in the EOSC-Life workflow collaboratory (2021). https://doi.org/10.5281/ZENODO.4605654 . https://zenodo.org/record/4605654
Vivian, J., Rao, A.A., Nothaft, F.A., Ketchum, C., Armstrong, J., Novak, A., Pfeil, J., Narkizian, J., Deran, A.D., Musselman-Brown, A., Schmidt, H., Amstutz, P., Craft, B., Goldman, M., Rosenbloom, K., Cline, M., O’Connor, B., Hanna, M., Birger, C., Kent, W.J., Patterson, D.A., Joseph, A.D., Zhu, J., Zaranek, S., Getz, G., Haussler, D., Paten, B.: Toil enables reproducible, open source, big biomedical data analyses. Nature Publishing Group (2017). https://doi.org/10.1038/nbt.3772
chanzuckerberg/miniwdl: Workflow Description Language developer tools & local runner. https://github.com/chanzuckerberg/miniwdl (2023)
Di Tommaso, P., Chatzou, M., Floden, E.W., Barja, P.P., Palumbo, E., Notredame, C.: Nextflow enables reproducible computational workflows. Nat. Biotechnol. 35(4), 316–319 (2017). https://doi.org/10.1038/NBT.3820
AWS Batch. https://aws.amazon.com/batch/ (2023)
Azure Batch. https://azure.microsoft.com/en-us/products/batch/ (2023)
Google Batch. https://cloud.google.com/batch/ (2023)
Giardine, B., Riemer, C., Hardison, R.C., Burhans, R., Elnitski, L., Shah, P., Zhang, Y., Blankenberg, D., Albert, I., Taylor, J., Miller, W., Kent, W.J., Nekrutenko, A.: Galaxy: A platform for interactive large-scale genome analysis. Genome Res. 15(10), 1451–1455 (2005). https://doi.org/10.1101/gr.4086505
TES specification. https://github.com/ga4gh/task-execution-schemas (2023)
Funnel. https://ohsu-comp-bio.github.io/funnel/ (2023)
WES Specification. https://github.com/ga4gh/workflow-execution-service-schemas (2023)
Yoo, A.B., Jette, M.A., Grondona, M.: SLURM: Simple linux utility for resource management. Lect. Notes Comput. Sci. (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2862, 44–60 (2003). https://doi.org/10.1007/10968987_3
HashiCorp State of Cloud Strategy Survey. https://www.hashicorp.com/state-of-the-cloud (2022)
Tyryshkina, A., Coraor, N., Nekrutenko, A.: Predicting runtimes of bioinformatics tools based on historical data: Five years of Galaxy usage. Bioinformatics 35(18), 3453–3460 (2019). https://doi.org/10.1093/BIOINFORMATICS/BTZ054
Fahad, A.M., Ahmed, A.A., Kahar, M.N.M.: The importance of monitoring cloud computing: An intensive review. IEEE Region 10 Annual International Conference, Proceedings/TENCON 2017-December, 2858–2863 (2017). https://doi.org/10.1109/TENCON.2017.8228349
Birje, M.N., Bulla, C.: Commercial and open source cloud monitoring tools: A review. Learn. Anal. Intell. Syst. 3, 480–490 (2020). https://doi.org/10.1007/978-3-030-24322-7_59/FIGURE
da Rosa Righi, R., Lehmann, M., Gomes, M.M., Nobre, J.C., Costa, C.A., Rigo, S.J., Lena, M., Mohr, R.F., Oliveira, L.R.B.: A survey on global management view: toward combining system monitoring, resource management, and load prediction. J. Grid Comput. 17(3), 473–502 (2019). https://doi.org/10.1007/S10723-018-09471-X/METRICS
Ohta, T., Tanjo, T., Ogasawara, O.: Accumulating computational resource usage of genomic data analysis workflow to optimize cloud computing instance selection. GigaScience 8(4), 1–11 (2019). https://doi.org/10.1093/GIGASCIENCE/GIZ052
Bader, J., Witzke, J., Becker, S., Loser, A., Lehmann, F., Doehler, L., Vu, A.D., Kao, O.: Towards advanced monitoring for scientific workflows. Proceedings - 2022 IEEE International Conference on Big Data. Big Data 2709–2715 (2022). https://doi.org/10.1109/BIGDATA55660.2022.10020864
Telegraf | InfluxData. https://influxdata.com/telegraf (2024)
Elasticsearch: The Official Distributed Search & Analytics Engine | Elastic. https://www.elastic.co/elasticsearch (2024)
Cloud monitoring | Dynatrace. https://www.dynatrace.com/platform/cloud-monitoring/ (2023)
Cloud Monitoring as a Service | Datadog. https://www.datadoghq.com/ (2023)
InfluxDB Cloud | InfluxData. https://www.influxdata.com/products/influxdb-cloud/ (2023)
Grafana: The open observability platform | Grafana Labs. https://grafana.com/ (2024)
Nomad by HashiCorp. https://www.nomadproject.io/ (2024)
Fully Managed Container Solution - Amazon Elastic Container Service (Amazon ECS) - Amazon Web Services. https://aws.amazon.com/ecs/ (2024)
Infrastructure As Code Provisioning Tool - AWS CloudFormation - AWS. https://aws.amazon.com/cloudformation/ (2024)
What is Amazon SNS? - Amazon Simple Notification Service. https://docs.aws.amazon.com/sns/latest/dg/welcome.html (2024)
Wratten, L., Wilm, A., Göke, J.: Reproducible, scalable, and shareable analysis pipelines with bioinformatics workflow managers. Nat. Methods 18(10), 1161–1168 (2021). https://doi.org/10.1038/s41592-021-01254-9
Genomics Workflows on AWS. https://docs.opendata.aws/genomics-workflows/quick-start.html (2023)
IEEE SA - IEEE 1003.1-2001 (POSIX). https://standards.ieee.org/ieee/1003.1/1389/ (2021)
Bage, A.P., Saxena, S., Singh, Y.: A brief review on lightweight practice of docker vulnerabilities. Software Engineering Approaches to Enable Digital Transformation Technologies 18–24 (2023). https://doi.org/10.1201/9781003441601-2
OmicsBox - Bioinformatics Made Easy, BioBam Bioinformatics. https://www.biobam.com/omicsbox/ (2023)
Götz, S., García-Gómez, J.M., Terol, J., Williams, T.D., Nagaraj, S.H., Nueda, M.J., Robles, M., Talón, M., Dopazo, J., Conesa, A.: High-throughput functional annotation and data mining with the Blast2GO suite. Nucleic Acids Res. 36(10), 3420–3435 (2008). https://doi.org/10.1093/NAR/GKN176
OpenTofu. https://opentofu.org/ (2024)
Acknowledgements
We thank BioBam for supporting this collaboration, in particular E. Presa Díez and M. Benegas Coll for suggesting and providing links to suitable datasets for the workflow evaluation. Additionally, we would like to thank S. Hewitt for providing writing and editing assistance.
Funding
Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has received funding from the Valencian Innovation Agency (AVI) file INNTA3/2021/5 (INNODOCTO). GM and RN would like to thank Grant PID2020-113126RB-I00 funded by MICIU/AEI/10.13039/501100011033. This work was supported by the project “An interdisciplinary Digital Twin Engine for science” (interTwin) that has received funding from the European Union’s Horizon Europe Programme under Grant 101058386.
Author information
Authors and Affiliations
Contributions
R.N. conceptualized the project and methodology, developed the platform, and wrote the manuscript. S.G. and G.M. provided project oversight and guidance and acquired the funding required to perform the work. All authors reviewed the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Nica, R., Götz, S. & Moltó, G. CMK: Enhancing Resource Usage Monitoring across Diverse Bioinformatics Workflow Management Systems. J Grid Computing 22, 62 (2024). https://doi.org/10.1007/s10723-024-09777-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10723-024-09777-z