Abstract
Large software systems often undergo performance tests to ensure their capability to handle expected loads. These performance tests often consume large amounts of computing resources and time since heavy loads need to be generated. Making it worse, the ever evolving field requires frequent updates to the performance testing environment. In practice, virtual machines (VMs) are widely exploited to provide flexible and less costly environments for performance tests. However, the use of VMs may introduce confounding overhead (e.g., a higher than expected memory utilization with unstable I/O traffic) to the testing environment and lead to unrealistic performance testing results. Yet, little research has studied the impact on test results of using VMs in performance testing activities. To evaluate the discrepancy between the performance testing results from virtual and physical environments, we perform a case study on two open source systems – namely Dell DVD Store (DS2) and CloudStore. We conduct the same performance tests in both virtual and physical environments and compare the performance testing results based on the three aspects that are typically examined for performance testing results: 1) single performance metric (e.g. CPU Time from virtual environment vs. CPU Time from physical environment), 2) the relationship among performance metrics (e.g. correlation between CPU and I/O) and 3) performance models that are built to predict system performance. Our results show that 1) A single metric from virtual and physical environments do not follow the same distribution, hence practitioners cannot simply use a scaling factor to compare the performance between environments, 2) correlations among performance metrics in virtual environments are different from those in physical environments 3) statistical models built based on the performance metrics from virtual environments are different from the models built from physical environments suggesting that practitioners cannot use the performance testing results across virtual and physical environments. In order to assist the practitioners leverage performance testing results in both environments, we investigate ways to reduce the discrepancy. We find that such discrepancy can be reduced by normalizing performance metrics based on deviance. Overall, we suggest that practitioners should not use the performance testing results from virtual environment with the simple assumption of straightforward performance overhead. Instead, practitioners should consider leveraging normalization techniques to reduce the discrepancy before examining performance testing results from virtual and physical environments.





Similar content being viewed by others
Notes
The complete results, data and scripts are shared online at http://das.encs.concordia.ca/members/moiz-arif/
References
Ahmed TM, Bezemer CP, Chen TH, Hassan AE, Shang W (2016) Studying the effectiveness of application performance management (apm) tools for detecting performance regressions for web applications: an experience report. In: MSR 2016: proceedings of the 13th working conference on mining software repositories
Andale (2012) Statistics how to - coefficient of determination (r squared). http://www.statisticshowto.com/what-is-a-coefficient-of-determination/. Accessed: 2017-04-04
Apache (2007) Tomcat. http://tomcat.apache.org/. Accessed: 2015-06-01
Apache (2008) Jmeter. http://jmeter.apache.org/. Accessed: 2015-06-01
Blackberry (2014) Blackberry enterprise server. https://ca.blackberry.com/enterprise. Accessed: 2017-04-04
Bodík P, Goldszmidt M, Fox A (2008) Hilighter: automatically building robust signatures of performance behavior for small- and large-scale systems. In: Proceedings of the third conference on tackling computer systems problems with machine learning techniques, SysML’08, pp 3–3
Brosig F, Gorsler F, Huber N, Kounev S (2013) Evaluating approaches for performance prediction in virtualized environments. In: 2013 IEEE 21st international symposium on modelling, analysis and simulation of computer and telecommunication systems. IEEE, pp 404–408
CA Technologies (2011) The avoidable cost of downtime. http://www3.ca.com/~/media/files/articles/avoidable_cost_of_downtime_part_2_ita.aspx
Chambers J, Hastie T, Pregibon D (1990) Statistical models in S. In: Compstat: proceedings in computational statistics, 9th symposium held at Dubrovnik, Yugoslavia, 1990. Physica-Verlag HD, Heidelberg, pp 317–321
Chen PM, Noble BD (2001) When virtual is better than real [operating system relocation to virtual machines]. In: Proceedings of the eighth workshop on hot topics in operating systems, 2001, pp 133–138
Cito J, Leitner P, Fritz T, Gall HC (2015) The making of cloud applications: an empirical study on software development for the cloud. In: Proceedings of the 2015 10th joint meeting on foundations of software engineering, ESEC/FSE 2015, pp 393–403
CloudScale-Project (2014) Cloudstore. https://github.com/cloudscale-project/cloudstore. Accessed: 2015-06-01
Cohen I, Goldszmidt M, Kelly T, Symons J, Chase JS (2004) Correlating instrumentation data to system states: a building block for automated diagnosis and control. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, OSDI’04, vol 6, pp 16–16
Cohen I, Zhang S, Goldszmidt M, Symons J, Kelly T, Fox A (2005) Capturing, indexing, clustering, and retrieving system history. In: Proceedings of the twentieth ACM symposium on operating systems principles, SOSP ’05, pp 105–118
Costantini D (2015) How to configure a pass-through disk with hyper-v. http://thesolving.com/virtualization/how-to-configure-a-pass-through-disk-with-hyper-v/. Accessed: 2017-04-04
Dean J, Barroso LA (2013) The tail at scale. Commun ACM 56:74–80
Dee (2014) Performance-testing systems on virtual machines that normally run on physical machines. http://sqa.stackexchange.com/questions/7709/performance-testing-systems-on-virtual-machines-that-normally-run-on-physical-ma. Accessed: 2017-04-04
Eeton K (2012) How one second could cost amazon $1.6 billion in sales. http://www.fastcompany.com/1825005/how-one-second-could-cost-amazon-16-billion-sales. Accessed: 2016-03-11
Foo KC, Jiang ZM, Adams B, Hassan AE, Zou Y, Flora P (2010) Mining performance regression testing repositories for automated performance analysis. In: 10th international conference on quality software (QSIC), 2010, pp 32–41
Freedman D (2009) Statistical models: theory and practice. Cambridge University Press
Harrell FE (2001) Regression modeling strategies: with applications to linear models, logistic regression, and survival analysis. Springer
Heger C, Happe J, Farahbod R (2013) Automated root cause isolation of performance regressions during software development. In: ICPE ’13: proceedings of the 4th ACM/SPEC international conference on performance engineering, pp 27–38
Huber N, von Quast M, Hauck M, Kounev S (2011) Evaluating and modeling virtualization performance overhead for cloud environments. In: Proceedings of the 1st international conference on cloud computing and services science, pp 563–573
Jaffe D, Muirhead T (2011) Dell dvd store. http://linux.dell.com/dvdstore/. Accessed: 2015-06-01
Jain R (1990) The art of computer systems performance analysis: techniques for experimental design, measurement, simulation, and modeling. Wiley
Jiang M, Munawar M, Reidemeister T, Ward P (2009a) Automatic fault detection and diagnosis in complex software systems by information-theoretic monitoring. In: Proceedings of 2009 IEEE/IFIP international conference on dependable systems networks, pp 285–294
Jiang M, Munawar MA, Reidemeister T, Ward PA (2009b) System monitoring with metric-correlation models: problems and solutions. In: Proceedings of the 6th international conference on autonomic computing, pp 13–22
Jiang ZM, Hassan AE, Hamann G, Flora P (2009) Automated performance analysis of load tests. In: IEEE International conference on software maintenance, 2009. ICSM 2009, pp 125–134
Jin G, Song L, Shi X, Scherpelz J, Lu S (2012) Understanding and detecting real-world performance bugs. In: Proceedings of the 33rd ACM SIGPLAN conference on programming language design and implementation, PLDI ’12. ACM, pp 77–88
Kabacoff RI (2011) R in action. In: R in action. Manning Publications Co., Staten Island, NY , pp 207–213
Kearon S (2012) Can you use a virtual machine to performance test an application? http://stackoverflow.com/questions/8906954/can-you-use-a-virtual-machine-to-performance-test-an-application. Accessed: 2017-04-04
Kohavi R (1995) A study of cross-validation and bootstrap for accuracy estimation and model selection. In: Proceedings of the 14th international joint conference on artificial intelligence, IJCAI’95, vol 2, pp 1137–1143
Kraft S, Casale G, Krishnamurthy D, Greer D, Kilpatrick P (2011) Io performance prediction in consolidated virtualized environments. SIGSOFT Softw Eng Notes 36(5):295–306
Kuhn M (2008) Building predictive models in r using the caret package. J Stat Softw Articles 28(5):1–26
Leitner P, Cito J (2016) Patterns in the chaos—a study of performance variation and predictability in public iaas clouds. ACM Trans Internet Technol 16(3):15:1–15:23
Luo Q, Poshyvanyk D, Grechanik M (2016) Mining performance regression inducing code changes in evolving software. In: Proceedings of the 13th international conference on mining software repositories, MSR ’16, pp 25–36
Malik H, Adams B, Hassan AE (2010a) Pinpointing the subsystems responsible for the performance deviations in a load test. In: 2010 IEEE 21st international symposium on software reliability engineering, pp 201–210
Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010b) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: CSMR ’10: proceedings of the 2010 14th European conference on software maintenance and reengineering, pp 222–231
Malik H, Jiang ZM, Adams B, Hassan AE, Flora P, Hamann G (2010c) Automatic comparison of load tests to support the performance analysis of large enterprise systems. In: 2010 14th European conference on software maintenance and reengineering, pp 222–231
Malik H, Hemmati H, Hassan AE (2013) Automatic detection of performance deviations in the load testing of large scale systems. In: 2013 35th international conference on software engineering (ICSE), pp 1012–1021
Mcintosh S, Kamei Y, Adams B, Hassan AE (2016) An empirical study of the impact of modern code review practices on software quality. Empirical Softw Engg 21(5):2146–2189
Menon A, Santos JR, Turner Y, Janakiraman GJ, Zwaenepoel W (2005) Diagnosing performance overheads in the xen virtual machine environment. In: Proceedings of the 1st ACM/USENIX international conference on virtual execution environments, pp 13–23
Merrill CL (2009) Load testing sugarcrm in a virtual machine. http://www.webperformance.com/library/reports/virtualization2/. Accessed: 2017-04-04
Microsoft Technet (2007) Windows performance counters. https://technet.microsoft.com/en-us/library/cc780836(v=ws.10).aspx. Accessed: 2015-06-01
Netto MA, Menon S, Vieira HV, Costa LT, De Oliveira FM, Saad R, Zorzo A (2011) Evaluating load generation in virtualized environments for software performance testing. In: IEEE International symposium on parallel and distributed processing workshops and phd forum (IPDPSW), 2011. IEEE, pp 993–1000
Nguyen TH, Adams B, Jiang ZM, Hassan AE, Nasser M, Flora P (2012) Automated detection of performance regressions using statistical process control techniques. In: Proceedings of the 3rd ACM/SPEC international conference on performance engineering, ICPE ’12, pp 299–310
Nistor A, Jiang T, Tan L (2013a) Discovering, reporting, and fixing performance bugs. In: 2013 10th working conference on mining software repositories (MSR), pp 237–246
Nistor A, Song L, Marinov D, Lu S (2013b) Toddler: detecting performance problems via similar memory-access patterns. In: Proceedings of the 2013 international conference on software engineering, ICSE ’13. IEEE Press, Piscataway, NJ, USA, pp 562–571
NIST/SEMATECH (2003) E-handbook of statistical methods. http://www.itl.nist.gov/div898/handbook/eda/section3/qqplot.htm. Accessed: 2015-06-01
Oracle (1998) MYSQL Server 5.6. https://www.mysql.com/. Accessed: 2015-06-01
Refaeilzadeh P, Tang L, Liu H (2009) Cross-validation. In: Encyclopedia of database systems. MA, Boston, pp 532–538
Rodola G (2009) Psutil. https://github.com/giampaolo/psutil. Accessed: 2015-06-01
Shang W, Hassan AE, Nasser M, Flora P (2015) Automated detection of performance regressions using regression models on clustered performance counters. In: Proceedings of the 6th ACM/SPEC international conference on performance engineering, ICPE ’15, pp 15–26
Shewhart WA (1931) Economic control of quality of manufactured product, vol 509. ASQ Quality Press
Srion E (2015) The time for hyper-v pass-through disks has passed. http://www.altaro.com/hyper-v/hyper-v-pass-through-disks/. Accessed: 2017-04-04
Stapleton JH (2008) Models for probability and statistical inference: theory and applications. Wiley
SugarCRM (2017) Sugarcrm. https://www.sugarcrm.com/. Accessed: 2017-04-04
Syer MD, Jiang ZM, Nagappan M, Hassan AE, Nasser M, Flora P (2013) Leveraging performance counters and execution logs to diagnose memory-related performance issues. In: 29th IEEE international conference on software maintenance (ICSM ’13), pp 110–119
Syer MD, Shang W, Jiang ZM, Hassan AE (2017) Continuous validation of performance test workloads. Autom Softw Eng 24(1):189–231
Tintin (2011) Performance test is not reliable on virtual machine? https://social.technet.microsoft.com/forums/windowsserver/en-US/06c0e09b-c5b4-4e2c-90e3-61b06483fe5b/performance-test-is-not-reliable-on-virtual-machine?forum=winserverhyperv . Accessed: 2017-04-04
TPC (2001) TPC-W. http://www.tpc.org/tpcw. Accessed: 2015-06-01
Tsakiltsidis S, Miranskyy A, Mazzawi E (2016) On automatic detection of performance bugs. In: 2016 IEEE international symposium on software reliability engineering workshops (ISSREW) , pp 132–139
Tyson J (2001) How network address translation works. http://computer.howstuffworks.com/nat2.htm. Accessed: 2017-04-04
VMWare (2016) Accelerate software development and testing with the vmware virtualization platform. http://www.vmware.com/pdf/development_testing.pdf. Accessed: 2016-03-16
Walker HM (1929) Studies in the history of statistical method: with special reference to certain educational problems. Williams & Wilkins Co
Woodside M, Franks G, Petriu DC (2007) The future of software performance engineering. In: Future of software engineering, 2007, pp 171–187
Xiong P, Pu C, Zhu X, Griffith R (2013) Vperfguard: an automated model-driven framework for application performance diagnosis in consolidated cloud environments. In: Proceedings of the 4th ACM/SPEC international conference on performance engineering, ICPE ’13, pp 271–282
Zaman S, Adams B, Hassan AE (2012) A qualitative study on performance bugs. In: 2012 9th IEEE working conference on mining software repositories (MSR), pp 199–208
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by: Mark Grechanik
Rights and permissions
About this article
Cite this article
Arif, M.M., Shang, W. & Shihab, E. Empirical study on the discrepancy between performance testing results from virtual and physical environments. Empir Software Eng 23, 1490–1518 (2018). https://doi.org/10.1007/s10664-017-9553-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10664-017-9553-x