CODEJUDGE Evaluating Code Generation With Large Language
CODEJUDGE Evaluating Code Generation With Large Language
Models
a1
Analyze then Summarize Taxonomy-Guided Fault Localization
Analyze the semantic
a1 ... b1 correctness of the
...
Output your answer in a JSON format list. code snippet and
Your task is to check if the code snippet provide reasons.
a) If the code snippet is correct, output:
covers the required functionalities.
Evaluation Steps: [{"inconsistency": "None", "severity": a2
... "Negligible"}].
Summarize the output
b) If the code snippet is incorrect, output
of a1.
{Input} the identified inconsistencies and their
severity according to the catalog of code b1
inconsistencies.
The code snippet is wrong because it does Identify inconsisten-
not join the sorted letters back into words. cies in the code
It returns a list of sorted characters for {Input}
snippet and classify
each word instead of a string. their severities.
Taxonomy of Common Inconsistencies:
You will be provided with an analysis a2 1. Missing dependency declarations:
result of a code snippet. Negligible
If the analysis believes that the code ...
snippet is correct, output: "Yes".
Otherwise, output: "No". [{
"The code snippet is wrong because it "inconsistency": "Logic error because
does not join the sorted letters back into the sorted function
words. It returns a list of sorted characters returns a list, not
for each word instead of a string." a string",
"severity": "major"
No }]
Final Decision: 0 (incorrect code) Final Decision: 0.5 / 1.0 (1 Major inconsistency identified)
snippets generated by different LLMs in different mantic correctness of the code, leading to in-
programming languages and also referred to the lit- correct outputs. These errors are considered to
erature on code error analysis (Hristova et al., 2003; have a major impact on semantic correctness.
Weimer and Necula, 2004; Chabbi and Mellor-
Crummey, 2012; Chen et al., 2018). We summa-
rized eight distinct types of frequent code incon- • Fatal. Code generation models sometimes
sistencies and categorized them into four severity hallucinate function calls or variable names
levels based on their impact on the semantic cor- that are never defined in the code. Further-
rectness of the generated code, as shown in Table 2. more, in many cases, they generate code with
• Negligible. Code inconsistencies in this cat- incomplete expressions and statements. These
egory have little impact on semantic correct- issues often lead to runtime exceptions or com-
ness. Specifically, we consider missing import pilation errors that crash the program execu-
statements or exception handling not semanti- tion. Thus, we considered them as fatal errors.
cally wrong, since the code generated in such
cases indeed achieves the functionality in the
task description while not being perfect.
Given the potential inconsistencies identified by
• Small. We classify input handling as small the LLM, we aggregate them via a weighted sum
due to their limited impact on the core func- based on their severity levels to compute the final
tionality of the code snippet and the straight- score. To better compare with other methods, we
forward nature of their correction. normalize the score to the range of [0, 1]. More
• Major. Logical errors directly affect the se- details can be found in Appendix B.
20036
Type Description Fault Localization method for the code deviation
Negligible assessment task dataset (CoNaLa). We briefly de-
Using different methods or algo- scribe each dataset below.
Alternative
rithms to solve the problem.
Dependency Missing import statements. HumanEval-X (Zheng et al., 2023b) is a multi-
No exception handling for unex- language version of HumanEval, a popular code
Error Handling
pected events, e.g., invalid inputs. generation benchmark originally from the Codex
Including inefficient or unneces- paper (Chen et al., 2021). It contains 164 introduc-
Efficiency
sary statements. tory coding tasks, each of which includes a natural
Small
language task description, some test cases, and
Input Handling Failing to handle edge cases.
Major
a human-created reference. We evaluate C ODE -
J UDGE on five programming languages in the ini-
Logic Error Containing logical errors.
Fatal tial release of HumanEval-X, including Python,
Using undefined functions or vari- C++, Java, JavaScript, and Go.2
Declaration
ables. CoNaLa (Yin et al., 2018) is a Python code gen-
Incompletion Incomplete code. eration benchmark with 472 tasks collected from
Table 2: The catalog of code inconsistencies. StackOverflow. We use the human annotation col-
lected by Evtikhiev et al. (2023) as ground truth
for the code deviation assessment. For each task,
4 Experiments Evtikhiev et al. (2023) asked experienced software
developers to grade a score of usefulness between
4.1 Datasets 0 and 4 for the generated code snippets from five
As described in Section 2.1, C ODE J UDGE makes different models.
two kinds of code assessment. Following Zhuo APPS (Hendrycks et al., 2021) is a Python code
(2024), we use HumanEval-X (Zheng et al., 2023b) generation benchmark. It includes introductory-
for the binary assessment task and CoNaLa (Yin level problems, interview-level, and competition-
et al., 2018) for the code deviation assessment task. level coding tasks collected from code competition
The rationale is that HumanEval-X includes test websites. We randomly sampled 100 competition-
cases for each task so we can easily obtain binary level tasks to form a challenging dataset.
correctness labels based on test results. By con- BigCodeBench (Zhuo et al., 2024) is a recently
trast, CoNaLa (Yin et al., 2018) does not have test released code generation dataset in Python with
cases. Instead, it provides human-annotated code 1,140 practical and challenging programming tasks.
usefulness scores in the range of 0 to 4, which were This dataset challenges the ability of LLMs to in-
obtained via crowdsourcing. voke multiple function calls from various libraries.
Since HumanEval-X only includes introductory
coding tasks, we also include two more challeng-
4.2 Evaluation Metrics
ing datasets, APPS (Hendrycks et al., 2021) and
BigCodeBench (Zhuo et al., 2024). Compared with Statistical Correlations. Recent studies have used
HumanEval-X, APPS includes competition-level statistical correlation metrics, such as Kendall’s
coding problems and BigCodeBench includes more τ coefficient (τ ) and Spearman’s rank correlation
complex instructions and more API calls. For in- coefficient (rs ), as a robust way to measure the
stance, Codex achieves a pass@1 rate of 28.81% correlation between code evaluation results and
on HumanEval, but only 0.92% on APPS (Le the ground truth (Zhou et al., 2023; Zhuo, 2024).
et al., 2022; Chen et al., 2021). Similarly, GPT-4o Thus, we adopt these correlation metrics to evaluate
achieves a pass@1 rate of 90.2% on HumanEval C ODE J UDGE on both kinds of assessment tasks.
but only 56.1% on the BigCodeBench (Anthropic,
2024; Zhuo et al., 2024). Since both APPS and Big- Accuracy. For the binary classification task, we
CodeBench provide only test cases, we use them also measure the correctness prediction accuracy
for the binary assessment task. of C ODE J UDGE as a more intuitive metric.
We apply our analyze then summarize method 2
We tried other languages such as Rust in the latest version
for binary assessment task datasets (HumanEval-X, of HumanEval-X but encountered issues when running their
APPS, and BigCodeBench) and Taxonomy-Guided test cases. Thus, we chose not to evaluate those languages.
20037
HumanEval-X CoNaLa APPS BigCodeBench
Method τ rs τ rs τ rs τ rs
E XISTING M ETHODS
BLEU 0.306 0.373 0.437 0.485 0.035 0.042 0.072 0.089
ROUGE-L 0.318 0.388 0.450 0.501 0.035 0.043 0.117 0.143
METEOR 0.357 0.436 0.412 0.463 0.085 0.104 0.247 0.302
chrF 0.328 0.400 0.457 0.514 0.036 0.044 0.167 0.205
CodeBLEU 0.362 0.442 0.292 0.332 0.135 0.164 0.173 0.212
RUBY 0.309 0.376 0.332 0.373 0.092 0.113 0.119 0.146
CodeBERTScoreF1 0.339 0.414 0.499 0.558 -0.003 -0.003 0.048 0.059
CodeBERTScoreF3 0.372 0.454 0.485 0.542 0.008 0.010 0.133 0.163
VANILLA 0.570 0.570 0.357 0.386 0.103 0.103 0.251 0.251
VANILLA w/o REF 0.390 0.390 0.465 0.499 -0.058 -0.058 0.131 0.131
ICE-Score 0.475 0.492 0.253 0.271 0.224 0.224 0.321 0.330
ICE-Score w/o REF 0.349 0.363 0.462 0.491 0.140 0.143 0.117 0.118
C ODE J UDGE 0.612 0.612 0.457 0.478 0.354 0.354 0.392 0.392
C ODE J UDGE w/o REF 0.502 0.502 0.538 0.562 0.153 0.153 0.317 0.317
Table 3: The results on four datasets when using GPT-3.5-Turbo-1106 as the evaluator. The best results are in bold.
Due to space limitations, tables with standard deviation and results of each language are shown in Appendix E.
Table 5: The Kendall-Tau (τ ) and Spearman (rs ) correlations between C ODE J UDGE using GPT-3.5-Turbo and
semantic correctness in HumanEval-X.
CoNaLa HE-X APPS BCB the accuracy of different methods in the binary as-
Method τ rs τ = rs τ = rs τ = rs
CodeLlama-Instruct-34B
sessment task. Since ICE-Score produces a rating
C ODE J UDGE 0.559 0.581 0.492 0.210 0.334 in the range of 0 to 4, we treat the rating of 4 as
w/o REF 0.582 0.607 0.412 0.062 0.097 fully correct, while the other ratings as not correct
Llama-3-8B-Instruct in the binary assessment task. C ODE J UDGE outper-
C ODE J UDGE 0.523 0.547 0.480 0.161 0.383 forms both ICE-Score and VANILLA regardless of
w/o REF 0.576 0.602 0.388 0.072 0.258
Llama-3-70B-Instruct
whether the reference code is provided or not.
C ODE J UDGE 0.572 0.598 0.681 0.391 0.440 Evaluating without References. We want to high-
w/o REF 0.628 0.654 0.619 0.153 0.298 light that even when reference code is not provided
to C ODE J UDGE but is provided to all other meth-
Table 6: The results of C ODE J UDGE using three open-
source models (more results in Appendix E). ods, C ODE J UDGE still outperforms all existing
methods in most settings. This implies the power
of performing “slow thinking” in code evaluation.
ods in the CoNaLa dataset. One plausible expla- Impact of Programming Languages. Table 5
nation is that for the code deviation task, the LLM shows the statistical correlation results of C ODE -
evaluator focuses too much on the differences be- J UDGE on different programming languages. When
tween the generated code and reference code rather reference code is provided, C ODE J UDGE consis-
than high-level semantic similarities. This implies tently achieves a coefficient above 0.5, which in-
future opportunities to calibrate LLMs for code dicates a strong correlation with the ground truth.
assessment. C ODE J UDGE performs much better on Python and
Results on More Challenging Benchmarks. Ta- Java compared with the other three languages.
ble 3 shows the correlations on APPS and Big- Generalizaiblity to Open-Source LLMs. Ta-
CodeBench. While C ODE J UDGE still achieves the ble 6 shows the correlation results of C ODE J UDGE
best performance, we observe that all evaluation when substituting GPT-3.5 with three open-source
methods suffer from a significant drop in perfor- models. Compared with GPT-3.5, C ODE J UDGE
mance on APPS and BigCodeBench. The vanilla achieves better correlations when using Llama-3-
LLM-based method, which performs comparably 70B. Besides, even when using a relatively small
to ICE-SCore on the other benchmarks, experi- model (Llama-3-8B-Instruct), C ODE J UDGE still
enced the biggest degradation. Such a performance achieves better or comparable performance to all
drop is not surprising, since these competition-level existing methods, including ICE-Score, which uses
tasks are challenging to human developers, not GPT-3.5 as the evaluator. This demonstrates that
even to mention LLMs. Without running and de- C ODE J UDGE can be easily applied to other LLMs
bugging the code, many developers may also strug- and obtain evaluations with a reasonable correla-
gle with assessing the code. Table 3 shows that tion to semantic correctness.
LLM-based methods consistently perform better
when reference code is provided to aid code eval- Prompt Design. We further test C ODE J UDGE with
uation. We also observe that for BigCodeBench, few-shot learning, Chain-of-Thought (CoT), and
LLM-based methods with reference show a sig- the combination of them. However, C ODE J UDGE
nificantly smaller performance degradation com- with these prompting methods do not outperform
pared to methods without reference. This implies the original one. Our analysis of the drawbacks
that providing reference code is more helpful for of employing CoT and few-shot learning can be
challenging tasks compared with relatively simple found in Appendix A.
tasks. Failure Case Analysis. To understand the lim-
Accuracy of Binary Evaluation. Table 4 shows itations of C ODE J UDGE, we manually inspected
20039
600 failure cases, especially those from APPS. We Shraddha Barke, Michael B James, and Nadia Po-
identified three failure patterns: likarpova. 2023. Grounded copilot: How program-
mers interact with code-generating models. Pro-
• Wrong Analysis of Code Logic (52.83%). ceedings of the ACM on Programming Languages,
The most common pattern is that the LLM 7(OOPSLA1):85–111.
evaluator fails to infer the code logic correctly. Milind Chabbi and John Mellor-Crummey. 2012. Dead-
For example, the LLM may mention that the spy: a tool to pinpoint program inefficiencies. In
code implements a logic while it does not. Proceedings of the Tenth International Symposium
on Code Generation and Optimization, CGO ’12,
• Wrong Identification of Task Requirements page 124–134, New York, NY, USA. Association for
Computing Machinery.
(26.42%). For some complex tasks, the LLM
evaluator struggles to identify all requirements Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu,
from the task description correctly. Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu.
2024. Chateval: Towards better LLM-based eval-
• Requirements of Error Handling (20.75%). uators through multi-agent debate. In The Twelfth
International Conference on Learning Representa-
We find that the LLM evaluator tends to report
tions.
many error-handling errors (e.g., not handling
invalid inputs) in generated code, even though Mark Chen, Jerry Tworek, Heewoo Jun, Qiming
it is not necessary in many cases. This makes Yuan, Henrique Ponde de Oliveira Pinto, Jared Ka-
plan, Harri Edwards, Yuri Burda, Nicholas Joseph,
C ODE J UDGE over-conservative when evaluat- Greg Brockman, Alex Ray, Raul Puri, Gretchen
ing some partially correct code. Krueger, Michael Petrov, Heidy Khlaaf, Girish Sas-
try, Pamela Mishkin, Brooke Chan, Scott Gray,
5 Conclusion Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz
Kaiser, Mohammad Bavarian, Clemens Winter,
We propose C ODE J UDGE , a framework that Philippe Tillet, Felipe Petroski Such, Dave Cum-
leverages LLMs to evaluate code generation with- mings, Matthias Plappert, Fotios Chantzis, Eliza-
out the need for test cases. We demonstrate that beth Barnes, Ariel Herbert-Voss, William Hebgen
Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie
by guiding LLMs to perform slow thinking, C ODE - Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain,
J UDGE outperforms all existing code evaluation William Saunders, Christopher Hesse, Andrew N.
methods. This demonstrates a promising future Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan
direction to replace human evaluators with LLM Morikawa, Alec Radford, Matthew Knight, Miles
Brundage, Mira Murati, Katie Mayer, Peter Welinder,
evaluators. This is also beneficial for alignment Bob McGrew, Dario Amodei, Sam McCandlish, Ilya
methods that rely on human evaluation as feed- Sutskever, and Wojciech Zaremba. 2021. Evaluat-
back. Finally, we release our code and dataset at ing large language models trained on code. arXiv
https://github.com/VichyTong/CodeJudge. preprint arXiv:2107.03374.
AI Anthropic. 2024. Claude 3.5 sonnet model card Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xi-
addendum. Claude-3.5 Model Card. aocheng Feng, Ming Gong, Linjun Shou, Bing Qin,
20040
Ting Liu, Daxin Jiang, and Ming Zhou. 2020. Code- NLG evaluation using gpt-4 with better human align-
BERT: A pre-trained model for programming and ment. In Proceedings of the 2023 Conference on
natural languages. In Findings of the Association Empirical Methods in Natural Language Processing,
for Computational Linguistics: EMNLP 2020, pages pages 2511–2522, Singapore. Association for Com-
1536–1547, Online. Association for Computational putational Linguistics.
Linguistics.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Jing Zhu. 2002. Bleu: a method for automatic evalu-
Liu. 2023. Gptscore: Evaluate as you desire. arXiv ation of machine translation. In Proceedings of the
preprint arXiv:2302.04166. 40th Annual Meeting of the Association for Compu-
tational Linguistics, pages 311–318, Philadelphia,
Jinlan Fu, See-Kiong Ng, Zhengbao Jiang, and Pengfei Pennsylvania, USA. Association for Computational
Liu. 2024. GPTScore: Evaluate as you desire. In Linguistics.
Proceedings of the 2024 Conference of the North
American Chapter of the Association for Computa- Maja Popović. 2015. chrF: character n-gram F-score
tional Linguistics: Human Language Technologies for automatic MT evaluation. In Proceedings of the
(Volume 1: Long Papers), pages 6556–6576, Mexico Tenth Workshop on Statistical Machine Translation,
City, Mexico. Association for Computational Lin- pages 392–395, Lisbon, Portugal. Association for
guistics. Computational Linguistics.
Dan Hendrycks, Steven Basart, Saurav Kadavath, Man- Shuo Ren, Daya Guo, Shuai Lu, Long Zhou, Shujie Liu,
tas Mazeika, Akul Arora, Ethan Guo, Collin Burns, Duyu Tang, Neel Sundaresan, Ming Zhou, Ambrosio
Samir Puranik, Horace He, Dawn Song, and Jacob Blanco, and Shuai Ma. 2020. Codebleu: a method
Steinhardt. 2021. Measuring coding challenge com- for automatic evaluation of code synthesis. arXiv
petence with APPS. In Thirty-fifth Conference on preprint arXiv:2009.10297.
Neural Information Processing Systems Datasets and Baptiste Rozière, Jonas Gehring, Fabian Gloeckle, Sten
Benchmarks Track (Round 2). Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi,
Jingyu Liu, Tal Remez, Jérémy Rapin, Artyom
Maria Hristova, Ananya Misra, Megan Rutter, and Re-
Kozhevnikov, Ivan Evtimov, Joanna Bitton, Man-
becca Mercuri. 2003. Identifying and correcting java
ish Bhatt, Cristian Canton Ferrer, Aaron Grattafiori,
programming errors for introductory computer sci-
Wenhan Xiong, Alexandre Défossez, Jade Copet,
ence students. In Proceedings of the 34th SIGCSE
Faisal Azhar, Hugo Touvron, Louis Martin, Nico-
Technical Symposium on Computer Science Educa-
las Usunier, Thomas Scialom, and Gabriel Synnaeve.
tion, SIGCSE ’03, page 153–156, New York, NY,
2023. Code llama: Open foundation models for code.
USA. Association for Computing Machinery.
arXiv preprint arXiv:2308.12950.
Daniel Kahneman. 2011. Thinking, fast and slow. Bo Shen, Jiaxin Zhang, Taihong Chen, Daoguang Zan,
macmillan. Bing Geng, An Fu, Muhan Zeng, Ailun Yu, Jichuan
Ji, Jingyang Zhao, Yuenan Guo, and Qianxiang Wang.
Sumith Kulal, Panupong Pasupat, Kartik Chandra, Mina
2023. Pangu-coder2: Boosting large language mod-
Lee, Oded Padon, Alex Aiken, and Percy S Liang.
els for code with ranking feedback. arXiv preprint
2019. Spoc: Search-based pseudocode to code. Ad-
arXiv:2307.14936.
vances in Neural Information Processing Systems,
32. N. Tran, H. Tran, S. Nguyen, H. Nguyen, and T. Nguyen.
2019. Does bleu score work for code migration? In
Hung Le, Yue Wang, Akhilesh Deepak Gotmare, Silvio 2019 IEEE/ACM 27th International Conference on
Savarese, and Steven Chu Hong Hoi. 2022. Coderl: Program Comprehension (ICPC), pages 165–176,
Mastering code generation through pretrained models Los Alamitos, CA, USA. IEEE Computer Society.
and deep reinforcement learning. Advances in Neural
Information Processing Systems, 35:21314–21328. Priyan Vaithilingam, Tianyi Zhang, and Elena L Glass-
man. 2022. Expectation vs. experience: Evaluating
Chin-Yew Lin. 2004. ROUGE: A package for auto- the usability of code generation tools powered by
matic evaluation of summaries. In Text Summariza- large language models. In Chi conference on hu-
tion Branches Out, pages 74–81, Barcelona, Spain. man factors in computing systems extended abstracts,
Association for Computational Linguistics. pages 1–7.
Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Ling- Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten
ming Zhang. 2023a. Is your code generated by chat- Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou,
GPT really correct? rigorous evaluation of large lan- et al. 2022. Chain-of-thought prompting elicits rea-
guage models for code generation. In Thirty-seventh soning in large language models. Advances in neural
Conference on Neural Information Processing Sys- information processing systems, 35:24824–24837.
tems.
Westley Weimer and George C. Necula. 2004. Find-
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, ing and preventing run-time error handling mistakes.
Ruochen Xu, and Chenguang Zhu. 2023b. G-eval: In Proceedings of the 19th Annual ACM SIGPLAN
20041
Conference on Object-Oriented Programming, Sys- Yusuf, Haolan Zhan, Junda He, Indraneil Paul, et al.
tems, Languages, and Applications, OOPSLA ’04, 2024. Bigcodebench: Benchmarking code genera-
page 419–431, New York, NY, USA. Association for tion with diverse function calls and complex instruc-
Computing Machinery. tions. arXiv preprint arXiv:2406.15877.
Pengcheng Yin, Bowen Deng, Edgar Chen, Bogdan A Prompt Design
Vasilescu, and Graham Neubig. 2018. Learning to
mine aligned code and natural language pairs from
stack overflow. In Proceedings of the 15th Interna- Method Acc.
tional Conference on Mining Software Repositories, C ODE J UDGE 81.63
MSR ’18, page 476–486, New York, NY, USA. As- C ODE J UDGE w/o REF 74.43
sociation for Computing Machinery.
CoT 77.65
Weizhe Yuan, Graham Neubig, and Pengfei Liu. 2021. CoT w/o REF 68.56
Bartscore: Evaluating generated text as text genera- CoT + Few-shot 78.22
tion. In Advances in Neural Information Processing CoT + Few-shot w/o REF 67.61
Systems, volume 34, pages 27263–27277. Curran As-
sociates, Inc. C ODE J UDGE + CoT 78.60
C ODE J UDGE + CoT w/o REF 72.16
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q.
C ODE J UDGE + Few-shot 77.84
Weinberger, and Yoav Artzi. 2020. Bertscore: Eval-
uating text generation with bert. In International C ODE J UDGE + Few-shot w/o REF 69.89
Conference on Learning Representations. C ODE J UDGE + CoT + Few-shot 77.27
C ODE J UDGE + CoT + Few-shot w/o REF 69.51
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan
Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Table 7: Average accuracy (%) across five programming
Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, languages in HumanEval-X using different prompts.
Joseph E. Gonzalez, and Ion Stoica. 2023a. Judging
LLM-as-a-judge with MT-bench and chatbot arena.
In Thirty-seventh Conference on Neural Information We use Chain-of-Thought (CoT) (Wei et al.,
Processing Systems Datasets and Benchmarks Track. 2022) and few-shot learning methods (three exam-
Qinkai Zheng, Xiao Xia, Xu Zou, Yuxiao Dong, Shan ples) to construct different prompts and test them
Wang, Yufei Xue, Lei Shen, Zihan Wang, Andi Wang, using GPT-3.5-Turbo in HumanEval-X. Table 7
Yang Li, Teng Su, Zhilin Yang, and Jie Tang. 2023b. shows the results, helping us understand the effects
Codegeex: A pre-trained model for code generation of CoT and few-shot learning. These two methods
with multilingual benchmarking on humaneval-x. In
Proceedings of the 29th ACM SIGKDD Conference
slightly reduce the correlation of the results. We ob-
on Knowledge Discovery and Data Mining, KDD ’23, serve the following drawbacks of the CoT method
page 5673–5684, New York, NY, USA. Association and few-shot prompting:
for Computing Machinery.
• Incorrect judgments: The CoT method can
Ming Zhong, Yang Liu, Da Yin, Yuning Mao, Yizhu
Jiao, Pengfei Liu, Chenguang Zhu, Heng Ji, and
cause the model to make incorrect logical
Jiawei Han. 2022. Towards a unified multi- judgments when thinking step by step. If the
dimensional evaluator for text generation. In Pro- initial determination is incorrect, the model
ceedings of the 2022 Conference on Empirical Meth- will generate incorrect evidence to support
ods in Natural Language Processing, pages 2023– that determination. For instance, Table 8
2038.
shows that when the LLM determines that
Shuyan Zhou, Uri Alon, Sumit Agarwal, and Graham the code snippet does not cover all the re-
Neubig. 2023. CodeBERTScore: Evaluating code quired functionalities in the first sentence, it
generation with pretrained models of code. In Pro-
ceedings of the 2023 Conference on Empirical Meth- then makes the incorrect judgment that the
ods in Natural Language Processing, pages 13921– model does not account for frequency.
13937, Singapore. Association for Computational
Linguistics. • Misleading fixed code: When using the CoT
method, the model may ignore the system
Terry Yue Zhuo. 2024. ICE-score: Instructing large
language models to evaluate code. In Findings of the prompt and create its own process: first, find
Association for Computational Linguistics: EACL the errors, then fix the code. This can be
2024, pages 2232–2242, St. Julian’s, Malta. Associa- misleading during summarization because the
tion for Computational Linguistics. model might evaluate the correctness of the
Terry Yue Zhuo, Minh Chien Vu, Jenny Chim, Han Hu, fixed code instead of the original, incorrect
Wenhao Yu, Ratnadira Widyasari, Imam Nur Bani version. For example, Table 9 shows that
20042
Task Description
You are given a non-empty list of positive integers. Return the greatest integer that is greater
than zero, and has a frequency greater than or equal to the value of the integer itself. The
frequency of an integer is the number of times it appears in the list. If no such a value exist,
return -1.
Code Snippet
def search(lst):
# your code here
d = {}
for i in lst:
if i in d:
d[i] += 1
else:
d[i] = 1
for i in sorted(d, reverse=True):
if d[i] >= i:
return i
return -1
Evaluation
The code snippet provided does not cover all the required functionalities. It does not correctly
implement the condition for finding the greatest integer that is greater than zero and has a
frequency greater than or equal to the value of the integer itself.
The code snippet iterates through the dictionary keys in reverse order and returns the first key
that satisfies the condition, but it does not consider the frequency of the integer.
The reference solution correctly calculates the frequency of each integer and then iterates through
the frequencies to find the greatest integer that satisfies the condition.
Table 8: As the initial determination is that the code “does not cover all the required functionalities”, the model
provides incorrect evidence—the model does not consider the frequency—in support of the determination.
the LLM generates a conclusion based on the and deduct a penalty score for each inconsistency
fixed code, and this conclusion further mis- identified by C ODE J UDGE. We experimented with
leads the summarization step. different penalty score settings on a small valida-
tion set, which includes 32 tasks from HumanEval
• Less example few-shot limitation: Similar (20%). We found that setting the penalty score of
to GPTScore (Fu et al., 2024), we observe Small, Major, and Fatal inconsistencies to 5, 50,
a similar performance degradation when the and 100 points achieve the best correlation. We cal-
number of few-shot examples is less than four. culate the final score with the following equations.
One possible reason is that LLMs become
S = NumSmall × 5
constrained by the given examples, leading
to a preference for particular correctness and M = NumMajor × 50
reduced generalization ability. F = NumFatal × 100 (1)
P enalty = max(−100, −(S + M + F ))
B Postprocessing Steps P enalty
Score = 1 −
For the binary evaluation task, since the LLM gen- 100
erates a free-form response to the summarization
C Experiment Details
task. We use a regex parser that assigns a score of 1
to answers that mention “Yes” and 0 to answers that We use the official version of the HumanEval-X,
mention “No”. While this postprocessing method CoNaLa, APPS, and the BigCodeBench datasets.
may sound simple, it turns out to work very well. To generate code snippets for HumanEval-X and
In our experiments, we did not observe any cases APPS, we adopt the code from MultiPL-E.3 For
where the LLMs generated ambiguous answers that BigCodeBench, we use the official pre-generated
cannot be handled by this method. code samples from LLMs.4 For token-based meth-
For the more complex code deviation estimation 3
https://github.com/nuprl/MultiPL-E
task, we set the initial correctness score to 100 4
https://github.com/bigcode-project/bigcodebench
20043
Task Description
You are given a word. Your task is to find the closest vowel that stands between two consonants from
the right side of the word (case sensitive). Vowels in the beginning and ending doesn’t count.
Return empty string if you didn’t find any vowel met the above condition. You may assume that the
given string contains English letter only.
Code Snippet
def get_closest_vowel(word):
pass
Evaluation
The given problem requires finding the closest vowel that stands between two consonants from the
right side of the word. The reference solution iterates through the word from the right side and
checks if the current character is a vowel and the characters on its right and left are consonants.
If this condition is met, the function returns the current vowel. If no such vowel is found, it
returns an empty string.
The provided code snippet is incomplete and does not contain any implementation. To address the
problem statement, the code should iterate through the word from the right side and implement the
logic to find the closest vowel between two consonants.
def get_closest_vowel(word):
if len(word) < 3:
return ""
vowels = {"a", "e", "i", "o", "u", "A", "E", 'O', 'U', 'I'}
for i in range(len(word)-2, 0, -1):
if word[i] in vowels:
if (word[i+1] not in vowels) and (word[i-1] not in vowels):
return word[i]
return ""
This code snippet covers the required functionalities as per the problem statement and is
consistent with the reference solution.
Table 9: When using the CoT method, the model may design an evaluation plan by itself and generate the fixed
code. This fixed code may mislead the summarization step.
Table 10: Average single execution times (in seconds) over 100 runs.
ods, we adopt implementations from Jetbrains.5 GPU was utilized. The execution times of Llama-3-
For CodeBERTScore and ICE-Score, we use their Instruct-70B were measured using two A100-80GB
implementations available on GitHub.6,7 . To evalu- GPUs to load the model. The generating time of
ate C ODE J UDGE, we use the implementations of C ODE J UDGE is less than 20 seconds, which is rea-
correlation metrics from https://scipy.org/. sonable for code evaluation compared to manual
human annotation.
D Latency Discussion
Table 10 shows the average execution times of E Full Results
C ODE J UDGE using four different models over 100
runs. The results for GPT-3.5-Turbo-1106 were ob- We report the numbers with standard deviations
tained via the official API. For CodeLlama-Instruct- in the HumanEval-X dataset in Table 11. We also
34B and Llama-3-Instruct-8B, a single A100-80GB report the accuracy of the binary classification task
5 of the HumanEval-X dataset in Table 12. The full
https://github.com/JetBrains-Research/codegen-metrics
6
https://github.com/neulab/code-bert-score results of the CoNaLa, APPS, and BigCodeBench
7
https://github.com/terryyz/ice-score are in Table 13, Table 14, and Table 15, respectively.
20044
F Prompts
We present the full prompts of VANILLA in Ta-
bles 16 and 17. Full prompts of C ODE J UDGE are
shown in Table 18 and 19.
20045
Java C++ Python JavaScript Go
Metric τ rs τ rs τ rs τ rs τ rs
E XISTING M ETHODS
BLEU 0.230±0.00 0.280±0.00 0.306±0.00 0.373±0.00 0.446±0.00 0.541±0.00 0.288±0.00 0.352±0.00 0.261±0.00 0.318±0.00
ROUGE-L 0.249±0.00 0.304±0.00 0.305±0.00 0.372±0.00 0.450±0.00 0.546±0.00 0.329±0.00 0.401±0.00 0.260±0.00 0.317±0.00
METEOR 0.299±0.00 0.365±0.00 0.338±0.00 0.412±0.00 0.487±0.00 0.594±0.00 0.379±0.00 0.462±0.00 0.284±0.00 0.346±0.00
chrF 0.267±0.00 0.326±0.00 0.314±0.00 0.383±0.00 0.448±0.00 0.545±0.00 0.368±0.00 0.449±0.00 0.242±0.00 0.295±0.00
CodeBLEU 0.318±0.00 0.388±0.00 0.341±0.00 0.417±0.00 0.501±0.00 0.611±0.00 0.384±0.00 0.468±0.00 0.268±0.00 0.326±0.00
RUBY 0.260±0.00 0.318±0.00 0.284±0.00 0.346±0.00 0.425±0.00 0.516±0.00 0.329±0.00 0.401±0.00 0.245±0.00 0.299±0.00
CodeBERTScoreF1 0.282±0.00 0.344±0.00 0.334±0.00 0.408±0.00 0.453±0.00 0.553±0.00 0.318±0.00 0.388±0.00 0.308±0.00 0.376±0.00
CodeBERTScoreF3 0.303±0.00 0.370±0.00 0.375±0.00 0.458±0.00 0.495±0.00 0.604±0.00 0.363±0.00 0.443±0.00 0.324±0.00 0.396±0.00
CodeLlama-Instruct-34B
VANILLA 0.300±0.01 0.300±0.01 0.345±0.01 0.345±0.01 0.489±0.03 0.489±0.03 0.316±0.03 0.316±0.03 0.314±0.01 0.314±0.01
VANILLA w/o REF 0.297±0.01 0.297±0.01 0.373±0.02 0.373±0.02 0.541±0.03 0.541±0.03 0.277±0.03 0.277±0.03 0.348±0.05 0.348±0.05
ICE-Score 0.418±0.06 0.449±0.06 0.309±0.04 0.331±0.05 0.440±0.04 0.477±0.04 0.308±0.06 0.332±0.07 0.297±0.06 0.320±0.07
ICE-Score w/o REF 0.263±0.04 0.279±0.04 0.282±0.04 0.303±0.04 0.471±0.05 0.503±0.05 0.382±0.04 0.404±0.04 0.338±0.05 0.362±0.05
C ODE J UDGE A.S. 0.515±0.04 0.515±0.04 0.464±0.03 0.464±0.03 0.625±0.00 0.625±0.00 0.503±0.03 0.503±0.03 0.354±0.02 0.354±0.02
C ODE J UDGE A.S. w/o REF 0.355±0.06 0.355±0.06 0.408±0.02 0.408±0.02 0.561±0.02 0.561±0.02 0.338±0.04 0.338±0.04 0.396±0.02 0.396±0.02
Meta-Llama-3-8B-Instruct
VANILLA 0.342±0.01 0.342±0.01 0.216±0.01 0.216±0.01 0.409±0.02 0.409±0.02 0.265±0.03 0.265±0.03 0.192±0.01 0.192±0.01
VANILLA w/o REF 0.282±0.01 0.282±0.01 0.159±0.04 0.159±0.04 0.446±0.02 0.446±0.02 0.356±0.01 0.356±0.01 0.331±0.01 0.331±0.01
ICE-Score 0.389±0.01 0.400±0.01 0.242±0.01 0.248±0.01 0.440±0.00 0.455±0.00 0.296±0.01 0.303±0.01 0.269±0.00 0.281±0.00
ICE-Score w/o REF 0.290±0.02 0.296±0.02 0.306±0.04 0.316±0.04 0.481±0.03 0.499±0.03 0.275±0.00 0.283±0.00 0.287±0.02 0.299±0.02
C ODE J UDGE 0.523±0.01 0.523±0.01 0.387±0.02 0.387±0.02 0.637±0.04 0.637±0.04 0.446±0.03 0.446±0.03 0.407±0.03 0.407±0.03
C ODE J UDGE w/o REF 0.411±0.06 0.411±0.06 0.309±0.04 0.309±0.04 0.586±0.03 0.586±0.03 0.339±0.06 0.339±0.06 0.295±0.01 0.295±0.01
Meta-Llama-3-70B-Instruct
VANILLA 0.607±0.01 0.607±0.01 0.624±0.01 0.624±0.01 0.685±0.00 0.685±0.00 0.554±0.00 0.554±0.00 0.529±0.00 0.529±0.00
VANILLA w/o REF 0.554±0.01 0.554±0.01 0.541±0.01 0.541±0.01 0.651±0.01 0.651±0.01 0.553±0.01 0.553±0.01 0.571±0.01 0.571±0.01
ICE-Score 0.552±0.00 0.576±0.00 0.516±0.01 0.543±0.01 0.626±0.01 0.654±0.01 0.471±0.00 0.490±0.00 0.389±0.01 0.411±0.01
ICE-Score w/o REF 0.509±0.01 0.531±0.00 0.507±0.00 0.533±0.00 0.591±0.00 0.620±0.00 0.425±0.00 0.444±0.00 0.478±0.00 0.508±0.00
C ODE J UDGE 0.640±0.02 0.640±0.02 0.700±0.03 0.700±0.03 0.803±0.02 0.803±0.02 0.675±0.01 0.675±0.01 0.589±0.02 0.589±0.02
C ODE J UDGE w/o REF 0.583±0.02 0.583±0.02 0.611±0.01 0.611±0.01 0.698±0.02 0.698±0.02 0.617±0.04 0.617±0.04 0.587±0.05 0.587±0.05
GPT-3.5-Turbo-1106
VANILLA 0.615 0.615 0.482 0.482 0.675 0.675 0.550 0.550 0.528 0.528
VANILLA w/o REF 0.343 0.343 0.328 0.328 0.537 0.537 0.345 0.345 0.398 0.398
ICE-Score 0.499 0.510 0.436 0.455 0.514 0.537 0.524 0.542 0.402 0.415
ICE-Score w/o REF 0.275 0.278 0.410 0.429 0.485 0.513 0.253 0.258 0.324 0.337
C ODE J UDGE 0.638 0.638 0.580 0.580 0.707 0.707 0.591 0.591 0.543 0.543
C ODE J UDGE w/o REF 0.508 0.508 0.474 0.474 0.629 0.629 0.453 0.453 0.446 0.446
Table 11: The Kendall-Tau (τ ) and Spearman (rs ) correlations of each method with semantic correctness on
HumanEval-X in multiple languages. “w/ REF” indicates that this method contains the reference code in the prompt.
The correlation coefficients are reported across three runs using open-source models, along with the standard
deviation.
20046
Method Java C++ Python JavaScript Go
CodeLlama-Instruct-34B
VANILLA 57.07±0.01 61.11±0.01 72.22±0.01 58.33±0.01 62.37±0.00
VANILLA w/o REF 59.09±0.00 65.91±0.01 73.48±0.02 58.84±0.02 57.32±0.02
C ODE J UDGE 75.00±0.02 75.25±0.01 80.56±0.00 73.74±0.01 75.51±0.01
C ODE J UDGE w/o REF 67.93±0.03 73.48±0.01 78.03±0.01 66.16±0.02 71.97±0.01
Meta-Llama-3-8B-Instruct
VANILLA 57.83±0.00 47.47±0.01 67.42±0.01 55.05±0.01 47.73±0.01
VANILLA w/o REF 58.84±0.01 47.47±0.02 70.20±0.01 62.12±0.01 60.10±0.00
C ODE J UDGE 74.49±0.01 65.91±0.01 81.57±0.02 69.44±0.02 69.70±0.02
C ODE J UDGE w/o REF 70.20±0.03 66.16±0.02 78.79±0.01 65.15±0.02 66.16±0.01
Meta-Llama-3-70B-Instruct
VANILLA 78.28±0.00 79.29±0.00 83.33±0.00 74.24±0.00 73.48±0.00
VANILLA w/o REF 75.51±0.00 75.51±0.00 82.07±0.01 75.76±0.01 78.03±0.01
C ODE J UDGE 81.31±0.01 84.60±0.02 90.15±0.01 81.82±0.01 80.30±0.01
C ODE J UDGE w/o REF 79.55±0.01 81.82±0.01 84.60±0.01 80.56±0.02 81.82±0.02
GPT-3.5-Turbo-1106
VANILLA 77.27 71.21 82.07 72.98 76.26
VANILLA w/o REF 60.86 67.17 74.24 61.36 62.12
C ODE J UDGE 81.57 78.28 85.35 78.28 79.29
C ODE J UDGE w/o REF 73.48 72.22 80.81 68.43 70.71
Table 12: Accuracies (%) across five programming languages in the binary classification task of HumanEval-X
dataset. The accuracies are reported across three runs using open-source models, along with the standard deviation.
20047
Method τ rs
Method τ rs E XISTING M ETHODS
BLEU 0.437 0.485 BLEU 0.035±0.00 0.042±0.00
ROUGE-L 0.450 0.501 ROUGE-L 0.035±0.00 0.043±0.00
METEOR 0.412 0.463 METEOR 0.085±0.00 0.104±0.00
chrF 0.457 0.514 chrF 0.036±0.00 0.044±0.00
CodeBLEU 0.292 0.332 CodeBLEU 0.135±0.00 0.164±0.00
RUBY 0.332 0.373 RUBY 0.092±0.00 0.113±0.00
CodeBERTScoref 1 0.499 0.558 CodeBERTScoreF1 -0.003±0.00 -0.003±0.00
CodeBERTScoref 3 0.485 0.542 CodeBERTScoreF3 0.008±0.00 0.010±0.00
Code Llama - Instruct 34B CodeLlama-Instruct-34B
VANILLA 0.317 0.344 VANILLA 0.005±0.05 0.005±0.05
VANILLA w/o REF 0.448 0.486 VANILLA w/o REF 0.080±0.00 0.080±0.00
ICE-Score 0.397 0.425 ICE-Score 0.174±0.06 0.185±0.06
ICE-Score w/o REF 0.534 0.572 ICE-Score w/o REF -0.032±0.02 -0.034±0.02
20048
Method τ rs
E XISTING M ETHODS
BLEU 0.072±0.00 0.089±0.00
ROUGE-L 0.117±0.00 0.143±0.00
METEOR 0.247±0.00 0.302±0.00
chrF 0.167±0.00 0.205±0.00
CodeBLEU 0.173±0.00 0.212±0.00
RUBY 0.119±0.00 0.146±0.00
CodeBERTScoreF1 0.048±0.00 0.059±0.00
CodeBERTScoreF3 0.133±0.00 0.163±0.00
CodeLlama-Instruct-34B
VANILLA 0.104±0.02 0.104±0.02
VANILLA w/o REF 0.047±0.02 0.047±0.02
ICE-Score -0.023±0.01 -0.023±0.01
ICE-Score w/o REF 0.025±0.02 0.025±0.02
C ODE J UDGE A.S. 0.334±0.03 0.334±0.03
C ODE J UDGE A.S. w/o REF 0.097±0.02 0.097±0.02
Meta-Llama-3-8B-Instruct
VANILLA 0.070±0.01 0.070±0.01
VANILLA w/o REF 0.064±0.00 0.064±0.00
ICE-Score 0.107±0.02 0.108±0.02
ICE-Score w/o REF 0.007±0.02 0.007±0.02
C ODE J UDGE 0.383±0.01 0.383±0.01
C ODE J UDGE w/o REF 0.258±0.02 0.258±0.02
Meta-Llama-3-70B-Instruct
VANILLA 0.316±0.00 0.316±0.00
VANILLA w/o REF 0.225±0.00 0.225±0.00
ICE-Score 0.297±0.00 0.307±0.00
ICE-Score w/o REF 0.164±0.00 0.166±0.00
C ODE J UDGE 0.440±0.01 0.440±0.01
C ODE J UDGE w/o REF 0.298±0.01 0.298±0.01
GPT-3.5-Turbo-1106
VANILLA 0.251 0.251
VANILLA w/o REF 0.131 0.131
ICE-Score 0.321 0.330
ICE-Score w/o REF 0.117 0.118
C ODE J UDGE 0.392 0.392
C ODE J UDGE w/o REF 0.317 0.317
20049
Determine the correctness of the code snippet. Output Yes or No.
Table 16: Full prompt of VANILLA baseline for binary assessment task. Blue text is an example of model output.
Brown text is the problem, reference, and code we provide to LLMs.
Determine the helpfulness of the code snippet. Output a score from 0 to 4 where 0 means the code
snippet is not helpful at all and 4 means the code snippet is very helpful.
Helpfulness (0-4): 4
Table 17: Full prompt of VANILLA baseline for code deviation assessment task. Blue text is an example of model
output. Brown text is the problem, reference, and code we provide to LLMs.
Analysis Subtask
You will be provided with a problem statement and a code snippet that supposedly addresses the
problem in {LANGUAGE}.
Your task is to check if the code snippet covers the required functionalities. Do not provide a
corrected version.
Evaluation Steps:
1. Read the problem statement carefully and identify the required functionalities of the
implementation. You can refer to the example to understand the problem better.
2. Read the code snippet and analyze its logic. Check if the code snippet covers all the required
functionalities of the problem.
3. Finally, conclude your evaluation.
Table 18: Full prompt of A NALYZE THEN S UMMARIZE method. Blue text is an example of model output. Brown
text is the problem and code we provide to LLMs.
20050
You will be provided with a problem statement, a code snippet that supposedly addresses the problem,
and a catalog of code inconsistencies.
Evaluation Steps:
1. Read the problem statement carefully to identify the functionalities required for the
implementation.
2. Read the code snippet and compare it to the problem statement. Check if the code snippet covers
the required functionalities.
3. Output your answer in a JSON format list.
a) If the code snippet is correct, output: ["inconsistency": "None", "severity": "Negligible"].
b) If the code snippet is incorrect, output the identified inconsistencies and their severity
according to the catalog of code inconsistencies. For example: ["inconsistency": "<inconsistency1>",
"severity": "<severity1>", "inconsistency": "<inconsistency2>", "severity": "<severity2>", ...]
Problem: {PROBLEM}
Table 19: Full prompt of FAULT L OCALIZATION method. Blue text is an example of model output. Brown text is
the problem and code we provide to LLMs.
20051