LLM-enhanced evolutionary test generation for untyped languages

Ruofan Yang¹,
Xianghua Xu¹ &
Ran Wang¹

176 Accesses
Explore all metrics

Abstract

Dynamic programming languages, such as Python, are widely used for their flexibility and support for rapid development. However, the absence of explicit parameter type declarations poses significant challenges in generating automated test cases. This often leads to random assignment of parameter types, increasing the search space and reducing testing efficiency. Current evolutionary algorithms, which rely heavily on random mutations, struggle to handle specific data types and frequently fall into local optima, making it difficult to generate high-quality test cases. Moreover, the resulting test suites often contain errors, preventing immediate usage in real-world applications. To address these challenges, this paper proposes the use of large language models to enhance test case generation for dynamic programming languages. Our method involves three key steps: analyzing parameter types to narrow the search space, introducing meaningful data during mutations to increase test case relevance, and using large language models to automatically repair errors in the generated test suites. Experimental results demonstrate a 16% improvement in test coverage, faster evolutionary cycles, and an increase in the number of executable test suites. These findings highlight the potential of large language models in improving both the efficiency and reliability of test case generation for dynamic programming languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1

Fig. 13

Automated Unit Test Generation for Python

Evolving code with a large language model

Article Open access 12 September 2024

An empirical study of automated unit test generation for Python

Article Open access 31 January 2023

Data availability

No datasets were generated or analysed during the current study.

References

Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report. (2023) arXiv preprint arXiv:2303.08774
Arcuri, A.: Many independent objective (MIO) algorithm for test suite generation. In: Search Based Software Engineering: 9th International Symposium, SSBSE 2017, Paderborn, Germany, September 9-11, 2017, Proceedings 9, Springer, pp. 3–17, (2017)
Arcuri, A., Briand, L.: A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1–10, (2011)
Arcuri, A., Briand, L.: A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24(3), 219–250 (2014)
Article MATH Google Scholar
Arteca, E., Harner, S., Pradel, M., et al.: Nessie: automatically testing javascript apis with asynchronous callbacks. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1494–1505, (2022)
Banerjee, S., Agarwal, A., Singla, S .: Llms will always hallucinate, and we need to live with this. (2024) arXiv preprint arXiv:2409.05746
Baral, K., Johnson, J., Mahmud, J., et al.: Automating gui-based test oracles for mobile apps. In: Proceedings of the 21st International Conference on Mining Software Repositories, pp. 309–321, (2024)
Bouzenia, I., Devanbu, P., Pradel, M .: Repairagent: an autonomous, llm-based agent for program repair. (2024) arXiv preprint arXiv:2403.17134
Brown, T.B.: Language models are few-shot learners. (2020) arXiv preprint arXiv:2005.14165
Cadar, C., Ganesh, V., Pawlowski, P.M., et al.: Exe: automatically generating inputs of death. ACM Trans. Inf. Syst. Secur. 12(2), 1–38 (2008)
Article MATH Google Scholar
Chen, M., Tworek, J., Jun, H., et al.: Evaluating large language models trained on code. (2021) arXiv preprint arXiv:2107.03374
Chen, Y., Hu, Z., Zhi, C., et al.: Chatunitest: a framework for llm-based test generation. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 572–576, (2024)
Clement, C.B., Drain, D., Timcheck, J., et al.: Pymt5: multi-mode translation of natural language and python code with transformers. (2020) arXiv preprint arXiv:2010.03150
Coello, C.: Evolutionary algorithms for solving multi-objective problems (2007)
Dakhel, A.M., Majdinasab, V., Nikanjam, A., et al.: Github copilot AI pair programmer: Asset or liability? J. Syst. Softw. 203, 111734 (2023)
Article Google Scholar
Dakhel, A.M., Nikanjam, A., Majdinasab, V., et al.: Effective test generation using pre-trained large language models and mutation testing. Inf. Softw. Technol. 171, 107468 (2024)
Article Google Scholar
Deb, K., Pratap, A., Agarwal, S., et al.: A fast and elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evolut. Comput. 6(2), 182–197 (2002)
Article MATH Google Scholar
Dinella, E., Ryan, G., Mytkowicz, T., et al.: Toga: a neural method for test oracle generation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 2130–2141, (2022)
Dustin, E., Rashka, J., Paul, J.: Automated Software Testing: Introduction, Management, and Performance. Addison-Wesley Professional, Boston (1999)
MATH Google Scholar
Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer, Berlin (2015)
Book MATH Google Scholar
Even-Mendoza, K., Sharma, A., Donaldson, A.F., et al .: Grayc: greybox fuzzing of compilers and analysers for c. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 1219–1231,(2023)
Fioraldi, A., Mantovani, A., Maier, D., et al.: Dissecting American fuzzy lop: a fuzzbench evaluation. ACM Trans. Softw. Eng. Methodol. 32(2), 1–26 (2023)
Google Scholar
Fraser, G., Arcuri, A.: Evolutionary generation of whole test suites. In: 2011 11th International Conference on Quality Software, IEEE, pp. 31–40, (2011a)
Fraser, G., Arcuri, A.: Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pp. 416–419, (2011b)
Fraser, G., Arcuri, A.: The seed is strong: seeding strategies in search-based software testing. In: 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, IEEE, pp. 121–130, (2012)
Galeotti, J.P., Fraser, G., Arcuri, A.: Improving search-based test suite generation with dynamic symbolic execution. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), IEEE, pp. 360–369, (2013)
Gargari, S.K., Keyvanpour, M.R.: Sbst challenges from the perspective of the test techniques. In: 2021 12th International Conference on Information and Knowledge Technology (IKT), IEEE, pp. 119–123, (2021)
Godefroid, P., Klarlund, N., Sen, K.: Dart: directed automated random testing. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 213–223, (2005)
Godefroid, P., Kiezun, A., Levin, M.Y.: Grammar-based whitebox fuzzing. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 206–215, (2008)
Gong, L., Pradel, M., Sridharan, M., et al.: Dlint: dynamically checking bad coding practices in javascript. In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, pp. 94–105, (2015)
Huang, L., Yu, W., Ma, W., et al.: A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems (2023)
Jiang, J., Xiong, Y., Zhang, H., et al .: Shaping program repair space with existing patches and similar code. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 298–309, (2018)
Jiang, N., Lutellier, T., Tan, L .: Cure: code-aware neural machine translation for automatic program repair. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 1161–1173, (2021)
Jin, M., Shahriar, S., Tufano, M., et al .: Inferfix: end-to-end program repair with llms. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1646–1656 (2023)
Leitner, A., Ciupa, I., Meyer, B., et al.: Reconciling manual and automated testing: The autotest experience. In: 2007 40th Annual Hawaii International Conference on System Sciences (HICSS’07), IEEE, pp. 261a–261a, (2007)
Lemieux, C., Inala, J.P., Lahiri, S.K., et al.: Codamosa: escaping coverage plateaus in test generation with pre-trained large language models. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 919–931, (2023)
Lin, Y., Ong, Y.S., Sun, J., et al.: Graph-based seed object synthesis for search-based unit testing. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1068–1080, (2021)
Lukasczyk, S., Fraser, G.: Pynguin: automated unit test generation for python. In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pp. 168–172, (2022)
Lukasczyk, S., Kroiß, F., Fraser, G.: An empirical study of automated unit test generation for python. Empir. Softw. Eng. 28(2), 36 (2023)
Article Google Scholar
Miller, B.P., Fredriksen, L., So, B.: An empirical study of the reliability of UNIX utilities. Commun. ACM 33(12), 32–44 (1990)
Article MATH Google Scholar
Nam, D., Macvean, A., Hellendoorn, V., et al.: Using an llm to help with code understanding. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13 ,(2024)
Nashid, N., Sintaha, M., Mesbah, A .: Retrieval-based prompt selection for code-related few-shot learning. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 2450–2462, (2023)
Olsthoorn, M., van, Deursen, A., Panichella, A.: Generating highly-structured input data by combining search-based testing and grammar-based fuzzing. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 1224–1228, (2020)
Panichella, A., Kifetew, F.M., Tonella, P.: Reformulating branch coverage as a many-objective optimization problem. In: 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), IEEE, pp. 1–10 (2015)
Panichella, A., Kifetew, F.M., Tonella, P.: Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Trans. Softw. Eng. 44(2), 122–158 (2017)
Article Google Scholar
Panichella, A., Kifetew, F.M., Tonella, P.: A large scale empirical comparison of state-of-the-art search-based test case generators. Inf. Softw. Technol. 104, 236–256 (2018)
Article Google Scholar
Rak-Amnouykit, I., McCrevan, D., Milanova, A., et al .:Python 3 types in the wild: a tale of two type systems. In: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages, pp. 57–70 (2020)
Sallou, J., Durieux, T., Panichella, A .: Breaking the silence: the threats of using llms in software engineering. In: Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, pp. 102–106 (2024)
Schäfer, M., Nadi, S., Eghbali, A., et al.: An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Softw. Eng. 50, 85–105 (2023)
Article MATH Google Scholar
Sen, K., Marinov, D., Agha, G.: Cute: a concolic unit testing engine for C. ACM SIGSOFT Softw. Eng. Notes 30(5), 263–272 (2005)
Article MATH Google Scholar
Sneha, K., Malle, G.M.: Research on software testing techniques and software automation testing tools. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), IEEE, pp. 77–81, (2017)
Su, T., Fu, Z., Pu, G., et al.: Combining symbolic execution and model checking for data flow testing. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, IEEE, pp. 654–665, (2015)
Tang, Y., Liu, Z., Zhou, Z., et al.: ChatGPT vs SBST: s comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 50, 1340–1359 (2024)
Article MATH Google Scholar
Tillmann, N., De Halleux, J., Xie, T.: Transferring an automated test generation tool to practice: From Pex to fakes and code digger. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, pp. 385–396, (2014)
Tonella, P.: Evolutionary testing of classes. ACM SIGSOFT Softw. Eng. Notes 29(4), 119–128 (2004)
Article MATH Google Scholar
Tufano, M., Drain, D., Svyatkovskiy, A., et al.: Unit test case generation with transformers and focal context. (2020) arXiv preprint arXiv:2009.05617
Wang, J., Huang, Y., Chen, C., et al.: Software testing with large language models: survey, landscape, and vision. IEEE Trans. Softw. Eng. 50, 911–936 (2024)
Article MATH Google Scholar
Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022)
MATH Google Scholar
Wei, Y., Xia, C.S., Zhang, L.: Copiloting the copilots: fusing large language models with completion engines for automated program repair. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 172–184, (2023)
Wen, M., Chen, J., Wu, R., et al.: Context-aware patch generation for better automated program repair. In: Proceedings of the 40th International Conference on Software Engineering, pp. 1–11 (2018)
Xia, C.S., Zhang, L.: Keep the conversation going: fixing 162 out of 337 bugs for \$0.42 each using chatgpt.(2023) arXiv preprint arXiv:2304.00385
Xiong, Y., Wang, J., Yan, R., et al .: Precise condition synthesis for program repair. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), IEEE, pp. 416–426 (2017)
Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. (2024) arXiv preprint arXiv:2401.11817
Yang, C., Chen, J., Lin, B., et al.: Enhancing llm-based test generation for hard-to-cover branches via program analysis. (2024), arXiv preprint arXiv:2404.04966
Ye, H., Martinez, M., Monperrus, M.: Neural program repair with execution-based backpropagation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1506–1518, (2022)
Yuan, Z., Lou, Y., Liu, M., et al.: No more manual tests? Evaluating and improving chatgpt for unit test generation. (2023) arXiv preprint arXiv:2305.04207
Zhang, J., Cambronero, J., Gulwani, S., et al.: Repairing bugs in python assignments using large language models. (2022a) arXiv preprint arXiv:2209.14876
Zhang, Q., Zhao, Y., Sun, W., et al.: Program repair: automated vs. manual. (2022b) arXiv preprint arXiv:2203.05166
Zhu, Q., Sun, Z., Xiao, Ya., et al.: A syntax-guided edit decoder for neural program repair. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 341–353 (2021)
Zhu, Q., Guo, D., Shao, Z., et al.: Deepseek-coder-v2: breaking the barrier of closed-source models in code intelligence. (2024) arXiv preprint arXiv:2406.11931

Download references

Funding

This work was supported by Pioneer and Leading Goose R&D Program of Zhejiang, China under Grant 2022C03132

Author information

Authors and Affiliations

School of Computer Science and Technology, Hangzhou Dianzi University, Hangzhou, 310018, China
Ruofan Yang, Xianghua Xu & Ran Wang

Authors

Ruofan Yang
View author publications
You can also search for this author in PubMed Google Scholar
Xianghua Xu
View author publications
You can also search for this author in PubMed Google Scholar
Ran Wang
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Ruofan Yang: development and design of methodology, wrote the manuscript. Xianghua Xu: oversight and leadership responsibility for the research activity planning and execution. Ran Wang: formulation or evolution of overarching research goals and aims, specifically critical review.

Corresponding author

Correspondence to Ran Wang.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Yang, R., Xu, X. & Wang, R. LLM-enhanced evolutionary test generation for untyped languages. Autom Softw Eng 32, 20 (2025). https://doi.org/10.1007/s10515-025-00496-7

Download citation

Received: 14 September 2024
Accepted: 02 February 2025
Published: 17 February 2025
DOI: https://doi.org/10.1007/s10515-025-00496-7

LLM-enhanced evolutionary test generation for untyped languages

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automated Unit Test Generation for Python

Evolving code with a large language model

An empirical study of automated unit test generation for Python

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

LLM-enhanced evolutionary test generation for untyped languages

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Automated Unit Test Generation for Python

Evolving code with a large language model

An empirical study of automated unit test generation for Python

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now