Abstract
Dynamic programming languages, such as Python, are widely used for their flexibility and support for rapid development. However, the absence of explicit parameter type declarations poses significant challenges in generating automated test cases. This often leads to random assignment of parameter types, increasing the search space and reducing testing efficiency. Current evolutionary algorithms, which rely heavily on random mutations, struggle to handle specific data types and frequently fall into local optima, making it difficult to generate high-quality test cases. Moreover, the resulting test suites often contain errors, preventing immediate usage in real-world applications. To address these challenges, this paper proposes the use of large language models to enhance test case generation for dynamic programming languages. Our method involves three key steps: analyzing parameter types to narrow the search space, introducing meaningful data during mutations to increase test case relevance, and using large language models to automatically repair errors in the generated test suites. Experimental results demonstrate a 16% improvement in test coverage, faster evolutionary cycles, and an increase in the number of executable test suites. These findings highlight the potential of large language models in improving both the efficiency and reliability of test case generation for dynamic programming languages.




















Similar content being viewed by others
Data availability
No datasets were generated or analysed during the current study.
References
Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report. (2023) arXiv preprint arXiv:2303.08774
Arcuri, A.: Many independent objective (MIO) algorithm for test suite generation. In: Search Based Software Engineering: 9th International Symposium, SSBSE 2017, Paderborn, Germany, September 9-11, 2017, Proceedings 9, Springer, pp. 3–17, (2017)
Arcuri, A., Briand, L.: A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1–10, (2011)
Arcuri, A., Briand, L.: A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24(3), 219–250 (2014)
Arteca, E., Harner, S., Pradel, M., et al.: Nessie: automatically testing javascript apis with asynchronous callbacks. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1494–1505, (2022)
Banerjee, S., Agarwal, A., Singla, S .: Llms will always hallucinate, and we need to live with this. (2024) arXiv preprint arXiv:2409.05746
Baral, K., Johnson, J., Mahmud, J., et al.: Automating gui-based test oracles for mobile apps. In: Proceedings of the 21st International Conference on Mining Software Repositories, pp. 309–321, (2024)
Bouzenia, I., Devanbu, P., Pradel, M .: Repairagent: an autonomous, llm-based agent for program repair. (2024) arXiv preprint arXiv:2403.17134
Brown, T.B.: Language models are few-shot learners. (2020) arXiv preprint arXiv:2005.14165
Cadar, C., Ganesh, V., Pawlowski, P.M., et al.: Exe: automatically generating inputs of death. ACM Trans. Inf. Syst. Secur. 12(2), 1–38 (2008)
Chen, M., Tworek, J., Jun, H., et al.: Evaluating large language models trained on code. (2021) arXiv preprint arXiv:2107.03374
Chen, Y., Hu, Z., Zhi, C., et al.: Chatunitest: a framework for llm-based test generation. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 572–576, (2024)
Clement, C.B., Drain, D., Timcheck, J., et al.: Pymt5: multi-mode translation of natural language and python code with transformers. (2020) arXiv preprint arXiv:2010.03150
Coello, C.: Evolutionary algorithms for solving multi-objective problems (2007)
Dakhel, A.M., Majdinasab, V., Nikanjam, A., et al.: Github copilot AI pair programmer: Asset or liability? J. Syst. Softw. 203, 111734 (2023)
Dakhel, A.M., Nikanjam, A., Majdinasab, V., et al.: Effective test generation using pre-trained large language models and mutation testing. Inf. Softw. Technol. 171, 107468 (2024)
Deb, K., Pratap, A., Agarwal, S., et al.: A fast and elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evolut. Comput. 6(2), 182–197 (2002)
Dinella, E., Ryan, G., Mytkowicz, T., et al.: Toga: a neural method for test oracle generation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 2130–2141, (2022)
Dustin, E., Rashka, J., Paul, J.: Automated Software Testing: Introduction, Management, and Performance. Addison-Wesley Professional, Boston (1999)
Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer, Berlin (2015)
Even-Mendoza, K., Sharma, A., Donaldson, A.F., et al .: Grayc: greybox fuzzing of compilers and analysers for c. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 1219–1231,(2023)
Fioraldi, A., Mantovani, A., Maier, D., et al.: Dissecting American fuzzy lop: a fuzzbench evaluation. ACM Trans. Softw. Eng. Methodol. 32(2), 1–26 (2023)
Fraser, G., Arcuri, A.: Evolutionary generation of whole test suites. In: 2011 11th International Conference on Quality Software, IEEE, pp. 31–40, (2011a)
Fraser, G., Arcuri, A.: Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pp. 416–419, (2011b)
Fraser, G., Arcuri, A.: The seed is strong: seeding strategies in search-based software testing. In: 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, IEEE, pp. 121–130, (2012)
Galeotti, J.P., Fraser, G., Arcuri, A.: Improving search-based test suite generation with dynamic symbolic execution. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), IEEE, pp. 360–369, (2013)
Gargari, S.K., Keyvanpour, M.R.: Sbst challenges from the perspective of the test techniques. In: 2021 12th International Conference on Information and Knowledge Technology (IKT), IEEE, pp. 119–123, (2021)
Godefroid, P., Klarlund, N., Sen, K.: Dart: directed automated random testing. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 213–223, (2005)
Godefroid, P., Kiezun, A., Levin, M.Y.: Grammar-based whitebox fuzzing. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 206–215, (2008)
Gong, L., Pradel, M., Sridharan, M., et al.: Dlint: dynamically checking bad coding practices in javascript. In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, pp. 94–105, (2015)
Huang, L., Yu, W., Ma, W., et al.: A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems (2023)
Jiang, J., Xiong, Y., Zhang, H., et al .: Shaping program repair space with existing patches and similar code. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 298–309, (2018)
Jiang, N., Lutellier, T., Tan, L .: Cure: code-aware neural machine translation for automatic program repair. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 1161–1173, (2021)
Jin, M., Shahriar, S., Tufano, M., et al .: Inferfix: end-to-end program repair with llms. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1646–1656 (2023)
Leitner, A., Ciupa, I., Meyer, B., et al.: Reconciling manual and automated testing: The autotest experience. In: 2007 40th Annual Hawaii International Conference on System Sciences (HICSS’07), IEEE, pp. 261a–261a, (2007)
Lemieux, C., Inala, J.P., Lahiri, S.K., et al.: Codamosa: escaping coverage plateaus in test generation with pre-trained large language models. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 919–931, (2023)
Lin, Y., Ong, Y.S., Sun, J., et al.: Graph-based seed object synthesis for search-based unit testing. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1068–1080, (2021)
Lukasczyk, S., Fraser, G.: Pynguin: automated unit test generation for python. In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pp. 168–172, (2022)
Lukasczyk, S., Kroiß, F., Fraser, G.: An empirical study of automated unit test generation for python. Empir. Softw. Eng. 28(2), 36 (2023)
Miller, B.P., Fredriksen, L., So, B.: An empirical study of the reliability of UNIX utilities. Commun. ACM 33(12), 32–44 (1990)
Nam, D., Macvean, A., Hellendoorn, V., et al.: Using an llm to help with code understanding. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13 ,(2024)
Nashid, N., Sintaha, M., Mesbah, A .: Retrieval-based prompt selection for code-related few-shot learning. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 2450–2462, (2023)
Olsthoorn, M., van, Deursen, A., Panichella, A.: Generating highly-structured input data by combining search-based testing and grammar-based fuzzing. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 1224–1228, (2020)
Panichella, A., Kifetew, F.M., Tonella, P.: Reformulating branch coverage as a many-objective optimization problem. In: 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), IEEE, pp. 1–10 (2015)
Panichella, A., Kifetew, F.M., Tonella, P.: Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Trans. Softw. Eng. 44(2), 122–158 (2017)
Panichella, A., Kifetew, F.M., Tonella, P.: A large scale empirical comparison of state-of-the-art search-based test case generators. Inf. Softw. Technol. 104, 236–256 (2018)
Rak-Amnouykit, I., McCrevan, D., Milanova, A., et al .:Python 3 types in the wild: a tale of two type systems. In: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages, pp. 57–70 (2020)
Sallou, J., Durieux, T., Panichella, A .: Breaking the silence: the threats of using llms in software engineering. In: Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, pp. 102–106 (2024)
Schäfer, M., Nadi, S., Eghbali, A., et al.: An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Softw. Eng. 50, 85–105 (2023)
Sen, K., Marinov, D., Agha, G.: Cute: a concolic unit testing engine for C. ACM SIGSOFT Softw. Eng. Notes 30(5), 263–272 (2005)
Sneha, K., Malle, G.M.: Research on software testing techniques and software automation testing tools. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), IEEE, pp. 77–81, (2017)
Su, T., Fu, Z., Pu, G., et al.: Combining symbolic execution and model checking for data flow testing. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, IEEE, pp. 654–665, (2015)
Tang, Y., Liu, Z., Zhou, Z., et al.: ChatGPT vs SBST: s comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 50, 1340–1359 (2024)
Tillmann, N., De Halleux, J., Xie, T.: Transferring an automated test generation tool to practice: From Pex to fakes and code digger. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, pp. 385–396, (2014)
Tonella, P.: Evolutionary testing of classes. ACM SIGSOFT Softw. Eng. Notes 29(4), 119–128 (2004)
Tufano, M., Drain, D., Svyatkovskiy, A., et al.: Unit test case generation with transformers and focal context. (2020) arXiv preprint arXiv:2009.05617
Wang, J., Huang, Y., Chen, C., et al.: Software testing with large language models: survey, landscape, and vision. IEEE Trans. Softw. Eng. 50, 911–936 (2024)
Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022)
Wei, Y., Xia, C.S., Zhang, L.: Copiloting the copilots: fusing large language models with completion engines for automated program repair. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 172–184, (2023)
Wen, M., Chen, J., Wu, R., et al.: Context-aware patch generation for better automated program repair. In: Proceedings of the 40th International Conference on Software Engineering, pp. 1–11 (2018)
Xia, C.S., Zhang, L.: Keep the conversation going: fixing 162 out of 337 bugs for \$0.42 each using chatgpt.(2023) arXiv preprint arXiv:2304.00385
Xiong, Y., Wang, J., Yan, R., et al .: Precise condition synthesis for program repair. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), IEEE, pp. 416–426 (2017)
Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. (2024) arXiv preprint arXiv:2401.11817
Yang, C., Chen, J., Lin, B., et al.: Enhancing llm-based test generation for hard-to-cover branches via program analysis. (2024), arXiv preprint arXiv:2404.04966
Ye, H., Martinez, M., Monperrus, M.: Neural program repair with execution-based backpropagation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1506–1518, (2022)
Yuan, Z., Lou, Y., Liu, M., et al.: No more manual tests? Evaluating and improving chatgpt for unit test generation. (2023) arXiv preprint arXiv:2305.04207
Zhang, J., Cambronero, J., Gulwani, S., et al.: Repairing bugs in python assignments using large language models. (2022a) arXiv preprint arXiv:2209.14876
Zhang, Q., Zhao, Y., Sun, W., et al.: Program repair: automated vs. manual. (2022b) arXiv preprint arXiv:2203.05166
Zhu, Q., Sun, Z., Xiao, Ya., et al.: A syntax-guided edit decoder for neural program repair. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 341–353 (2021)
Zhu, Q., Guo, D., Shao, Z., et al.: Deepseek-coder-v2: breaking the barrier of closed-source models in code intelligence. (2024) arXiv preprint arXiv:2406.11931
Funding
This work was supported by Pioneer and Leading Goose R&D Program of Zhejiang, China under Grant 2022C03132
Author information
Authors and Affiliations
Contributions
Ruofan Yang: development and design of methodology, wrote the manuscript. Xianghua Xu: oversight and leadership responsibility for the research activity planning and execution. Ran Wang: formulation or evolution of overarching research goals and aims, specifically critical review.
Corresponding author
Ethics declarations
Competing interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Yang, R., Xu, X. & Wang, R. LLM-enhanced evolutionary test generation for untyped languages. Autom Softw Eng 32, 20 (2025). https://doi.org/10.1007/s10515-025-00496-7
Received:
Accepted:
Published:
DOI: https://doi.org/10.1007/s10515-025-00496-7