[go: up one dir, main page]

Skip to main content

Advertisement

Log in

LLM-enhanced evolutionary test generation for untyped languages

  • Published:
Automated Software Engineering Aims and scope Submit manuscript

Abstract

Dynamic programming languages, such as Python, are widely used for their flexibility and support for rapid development. However, the absence of explicit parameter type declarations poses significant challenges in generating automated test cases. This often leads to random assignment of parameter types, increasing the search space and reducing testing efficiency. Current evolutionary algorithms, which rely heavily on random mutations, struggle to handle specific data types and frequently fall into local optima, making it difficult to generate high-quality test cases. Moreover, the resulting test suites often contain errors, preventing immediate usage in real-world applications. To address these challenges, this paper proposes the use of large language models to enhance test case generation for dynamic programming languages. Our method involves three key steps: analyzing parameter types to narrow the search space, introducing meaningful data during mutations to increase test case relevance, and using large language models to automatically repair errors in the generated test suites. Experimental results demonstrate a 16% improvement in test coverage, faster evolutionary cycles, and an increase in the number of executable test suites. These findings highlight the potential of large language models in improving both the efficiency and reliability of test case generation for dynamic programming languages.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Algorithm 2
Algorithm 3
Fig. 6
Algorithm 4
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Data availability

No datasets were generated or analysed during the current study.

References

  • Achiam, J., Adler, S., Agarwal, S., et al.: Gpt-4 technical report. (2023) arXiv preprint arXiv:2303.08774

  • Arcuri, A.: Many independent objective (MIO) algorithm for test suite generation. In: Search Based Software Engineering: 9th International Symposium, SSBSE 2017, Paderborn, Germany, September 9-11, 2017, Proceedings 9, Springer, pp. 3–17, (2017)

  • Arcuri, A., Briand, L.: A practical guide for using statistical tests to assess randomized algorithms in software engineering. In: Proceedings of the 33rd International Conference on Software Engineering, pp. 1–10, (2011)

  • Arcuri, A., Briand, L.: A hitchhiker’s guide to statistical tests for assessing randomized algorithms in software engineering. Software Testing, Verification and Reliability 24(3), 219–250 (2014)

    Article  MATH  Google Scholar 

  • Arteca, E., Harner, S., Pradel, M., et al.: Nessie: automatically testing javascript apis with asynchronous callbacks. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1494–1505, (2022)

  • Banerjee, S., Agarwal, A., Singla, S .: Llms will always hallucinate, and we need to live with this. (2024) arXiv preprint arXiv:2409.05746

  • Baral, K., Johnson, J., Mahmud, J., et al.: Automating gui-based test oracles for mobile apps. In: Proceedings of the 21st International Conference on Mining Software Repositories, pp. 309–321, (2024)

  • Bouzenia, I., Devanbu, P., Pradel, M .: Repairagent: an autonomous, llm-based agent for program repair. (2024) arXiv preprint arXiv:2403.17134

  • Brown, T.B.: Language models are few-shot learners. (2020) arXiv preprint arXiv:2005.14165

  • Cadar, C., Ganesh, V., Pawlowski, P.M., et al.: Exe: automatically generating inputs of death. ACM Trans. Inf. Syst. Secur. 12(2), 1–38 (2008)

    Article  MATH  Google Scholar 

  • Chen, M., Tworek, J., Jun, H., et al.: Evaluating large language models trained on code. (2021) arXiv preprint arXiv:2107.03374

  • Chen, Y., Hu, Z., Zhi, C., et al.: Chatunitest: a framework for llm-based test generation. In: Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, pp. 572–576, (2024)

  • Clement, C.B., Drain, D., Timcheck, J., et al.: Pymt5: multi-mode translation of natural language and python code with transformers. (2020) arXiv preprint arXiv:2010.03150

  • Coello, C.: Evolutionary algorithms for solving multi-objective problems (2007)

  • Dakhel, A.M., Majdinasab, V., Nikanjam, A., et al.: Github copilot AI pair programmer: Asset or liability? J. Syst. Softw. 203, 111734 (2023)

    Article  Google Scholar 

  • Dakhel, A.M., Nikanjam, A., Majdinasab, V., et al.: Effective test generation using pre-trained large language models and mutation testing. Inf. Softw. Technol. 171, 107468 (2024)

    Article  Google Scholar 

  • Deb, K., Pratap, A., Agarwal, S., et al.: A fast and elitist multiobjective genetic algorithm: Nsga-II. IEEE Trans. Evolut. Comput. 6(2), 182–197 (2002)

    Article  MATH  Google Scholar 

  • Dinella, E., Ryan, G., Mytkowicz, T., et al.: Toga: a neural method for test oracle generation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 2130–2141, (2022)

  • Dustin, E., Rashka, J., Paul, J.: Automated Software Testing: Introduction, Management, and Performance. Addison-Wesley Professional, Boston (1999)

    MATH  Google Scholar 

  • Eiben, A.E., Smith, J.E.: Introduction to Evolutionary Computing. Springer, Berlin (2015)

    Book  MATH  Google Scholar 

  • Even-Mendoza, K., Sharma, A., Donaldson, A.F., et al .: Grayc: greybox fuzzing of compilers and analysers for c. In: Proceedings of the 32nd ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 1219–1231,(2023)

  • Fioraldi, A., Mantovani, A., Maier, D., et al.: Dissecting American fuzzy lop: a fuzzbench evaluation. ACM Trans. Softw. Eng. Methodol. 32(2), 1–26 (2023)

    Google Scholar 

  • Fraser, G., Arcuri, A.: Evolutionary generation of whole test suites. In: 2011 11th International Conference on Quality Software, IEEE, pp. 31–40, (2011a)

  • Fraser, G., Arcuri, A.: Evosuite: automatic test suite generation for object-oriented software. In: Proceedings of the 19th ACM SIGSOFT Symposium and the 13th European Conference on Foundations of Software Engineering, pp. 416–419, (2011b)

  • Fraser, G., Arcuri, A.: The seed is strong: seeding strategies in search-based software testing. In: 2012 IEEE Fifth International Conference on Software Testing, Verification and Validation, IEEE, pp. 121–130, (2012)

  • Galeotti, J.P., Fraser, G., Arcuri, A.: Improving search-based test suite generation with dynamic symbolic execution. In: 2013 IEEE 24th International Symposium on Software Reliability Engineering (ISSRE), IEEE, pp. 360–369, (2013)

  • Gargari, S.K., Keyvanpour, M.R.: Sbst challenges from the perspective of the test techniques. In: 2021 12th International Conference on Information and Knowledge Technology (IKT), IEEE, pp. 119–123, (2021)

  • Godefroid, P., Klarlund, N., Sen, K.: Dart: directed automated random testing. In: Proceedings of the 2005 ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 213–223, (2005)

  • Godefroid, P., Kiezun, A., Levin, M.Y.: Grammar-based whitebox fuzzing. In: Proceedings of the 29th ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 206–215, (2008)

  • Gong, L., Pradel, M., Sridharan, M., et al.: Dlint: dynamically checking bad coding practices in javascript. In: Proceedings of the 2015 International Symposium on Software Testing and Analysis, pp. 94–105, (2015)

  • Huang, L., Yu, W., Ma, W., et al.: A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions. ACM Transactions on Information Systems (2023)

  • Jiang, J., Xiong, Y., Zhang, H., et al .: Shaping program repair space with existing patches and similar code. In: Proceedings of the 27th ACM SIGSOFT International Symposium on Software Testing and Analysis, pp. 298–309, (2018)

  • Jiang, N., Lutellier, T., Tan, L .: Cure: code-aware neural machine translation for automatic program repair. In: 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), IEEE, pp. 1161–1173, (2021)

  • Jin, M., Shahriar, S., Tufano, M., et al .: Inferfix: end-to-end program repair with llms. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1646–1656 (2023)

  • Leitner, A., Ciupa, I., Meyer, B., et al.: Reconciling manual and automated testing: The autotest experience. In: 2007 40th Annual Hawaii International Conference on System Sciences (HICSS’07), IEEE, pp. 261a–261a, (2007)

  • Lemieux, C., Inala, J.P., Lahiri, S.K., et al.: Codamosa: escaping coverage plateaus in test generation with pre-trained large language models. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 919–931, (2023)

  • Lin, Y., Ong, Y.S., Sun, J., et al.: Graph-based seed object synthesis for search-based unit testing. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 1068–1080, (2021)

  • Lukasczyk, S., Fraser, G.: Pynguin: automated unit test generation for python. In: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: Companion Proceedings, pp. 168–172, (2022)

  • Lukasczyk, S., Kroiß, F., Fraser, G.: An empirical study of automated unit test generation for python. Empir. Softw. Eng. 28(2), 36 (2023)

    Article  Google Scholar 

  • Miller, B.P., Fredriksen, L., So, B.: An empirical study of the reliability of UNIX utilities. Commun. ACM 33(12), 32–44 (1990)

    Article  MATH  Google Scholar 

  • Nam, D., Macvean, A., Hellendoorn, V., et al.: Using an llm to help with code understanding. In: Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, pp. 1–13 ,(2024)

  • Nashid, N., Sintaha, M., Mesbah, A .: Retrieval-based prompt selection for code-related few-shot learning. In: 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), IEEE, pp. 2450–2462, (2023)

  • Olsthoorn, M., van, Deursen, A., Panichella, A.: Generating highly-structured input data by combining search-based testing and grammar-based fuzzing. In: Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering, pp. 1224–1228, (2020)

  • Panichella, A., Kifetew, F.M., Tonella, P.: Reformulating branch coverage as a many-objective optimization problem. In: 2015 IEEE 8th International Conference on Software Testing, Verification and Validation (ICST), IEEE, pp. 1–10 (2015)

  • Panichella, A., Kifetew, F.M., Tonella, P.: Automated test case generation as a many-objective optimisation problem with dynamic selection of the targets. IEEE Trans. Softw. Eng. 44(2), 122–158 (2017)

    Article  Google Scholar 

  • Panichella, A., Kifetew, F.M., Tonella, P.: A large scale empirical comparison of state-of-the-art search-based test case generators. Inf. Softw. Technol. 104, 236–256 (2018)

    Article  Google Scholar 

  • Rak-Amnouykit, I., McCrevan, D., Milanova, A., et al .:Python 3 types in the wild: a tale of two type systems. In: Proceedings of the 16th ACM SIGPLAN International Symposium on Dynamic Languages, pp. 57–70 (2020)

  • Sallou, J., Durieux, T., Panichella, A .: Breaking the silence: the threats of using llms in software engineering. In: Proceedings of the 2024 ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging Results, pp. 102–106 (2024)

  • Schäfer, M., Nadi, S., Eghbali, A., et al.: An empirical evaluation of using large language models for automated unit test generation. IEEE Trans. Softw. Eng. 50, 85–105 (2023)

    Article  MATH  Google Scholar 

  • Sen, K., Marinov, D., Agha, G.: Cute: a concolic unit testing engine for C. ACM SIGSOFT Softw. Eng. Notes 30(5), 263–272 (2005)

    Article  MATH  Google Scholar 

  • Sneha, K., Malle, G.M.: Research on software testing techniques and software automation testing tools. In: 2017 International Conference on Energy, Communication, Data Analytics and Soft Computing (ICECDS), IEEE, pp. 77–81, (2017)

  • Su, T., Fu, Z., Pu, G., et al.: Combining symbolic execution and model checking for data flow testing. In: 2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, IEEE, pp. 654–665, (2015)

  • Tang, Y., Liu, Z., Zhou, Z., et al.: ChatGPT vs SBST: s comparative assessment of unit test suite generation. IEEE Trans. Softw. Eng. 50, 1340–1359 (2024)

    Article  MATH  Google Scholar 

  • Tillmann, N., De Halleux, J., Xie, T.: Transferring an automated test generation tool to practice: From Pex to fakes and code digger. In: Proceedings of the 29th ACM/IEEE International Conference on Automated Software Engineering, pp. 385–396, (2014)

  • Tonella, P.: Evolutionary testing of classes. ACM SIGSOFT Softw. Eng. Notes 29(4), 119–128 (2004)

    Article  MATH  Google Scholar 

  • Tufano, M., Drain, D., Svyatkovskiy, A., et al.: Unit test case generation with transformers and focal context. (2020) arXiv preprint arXiv:2009.05617

  • Wang, J., Huang, Y., Chen, C., et al.: Software testing with large language models: survey, landscape, and vision. IEEE Trans. Softw. Eng. 50, 911–936 (2024)

    Article  MATH  Google Scholar 

  • Wei, J., Wang, X., Schuurmans, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural Inf. Process. Syst. 35, 24824–24837 (2022)

    MATH  Google Scholar 

  • Wei, Y., Xia, C.S., Zhang, L.: Copiloting the copilots: fusing large language models with completion engines for automated program repair. In: Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 172–184, (2023)

  • Wen, M., Chen, J., Wu, R., et al.: Context-aware patch generation for better automated program repair. In: Proceedings of the 40th International Conference on Software Engineering, pp. 1–11 (2018)

  • Xia, C.S., Zhang, L.: Keep the conversation going: fixing 162 out of 337 bugs for \$0.42 each using chatgpt.(2023) arXiv preprint arXiv:2304.00385

  • Xiong, Y., Wang, J., Yan, R., et al .: Precise condition synthesis for program repair. In: 2017 IEEE/ACM 39th International Conference on Software Engineering (ICSE), IEEE, pp. 416–426 (2017)

  • Xu, Z., Jain, S., Kankanhalli, M.: Hallucination is inevitable: an innate limitation of large language models. (2024) arXiv preprint arXiv:2401.11817

  • Yang, C., Chen, J., Lin, B., et al.: Enhancing llm-based test generation for hard-to-cover branches via program analysis. (2024), arXiv preprint arXiv:2404.04966

  • Ye, H., Martinez, M., Monperrus, M.: Neural program repair with execution-based backpropagation. In: Proceedings of the 44th International Conference on Software Engineering, pp. 1506–1518, (2022)

  • Yuan, Z., Lou, Y., Liu, M., et al.: No more manual tests? Evaluating and improving chatgpt for unit test generation. (2023) arXiv preprint arXiv:2305.04207

  • Zhang, J., Cambronero, J., Gulwani, S., et al.: Repairing bugs in python assignments using large language models. (2022a) arXiv preprint arXiv:2209.14876

  • Zhang, Q., Zhao, Y., Sun, W., et al.: Program repair: automated vs. manual. (2022b) arXiv preprint arXiv:2203.05166

  • Zhu, Q., Sun, Z., Xiao, Ya., et al.: A syntax-guided edit decoder for neural program repair. In: Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pp. 341–353 (2021)

  • Zhu, Q., Guo, D., Shao, Z., et al.: Deepseek-coder-v2: breaking the barrier of closed-source models in code intelligence. (2024) arXiv preprint arXiv:2406.11931

Download references

Funding

This work was supported by Pioneer and Leading Goose R&D Program of Zhejiang, China under Grant 2022C03132

Author information

Authors and Affiliations

Authors

Contributions

Ruofan Yang: development and design of methodology, wrote the manuscript. Xianghua Xu: oversight and leadership responsibility for the research activity planning and execution. Ran Wang: formulation or evolution of overarching research goals and aims, specifically critical review.

Corresponding author

Correspondence to Ran Wang.

Ethics declarations

Competing interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, R., Xu, X. & Wang, R. LLM-enhanced evolutionary test generation for untyped languages. Autom Softw Eng 32, 20 (2025). https://doi.org/10.1007/s10515-025-00496-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10515-025-00496-7

Keywords