8000 GitHub - CodeEval-Pro/CodeEval-Pro: [ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"
[go: up one dir, main page]

Skip to content

[ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Notifications You must be signed in to change notification settings

CodeEval-Pro/CodeEval-Pro

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CodeEval-Pro
HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation

[🌐 Website][🏆 Leaderboard][📜 Paper][🤗 HF Datasets][🐦 Twitter]

Repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"


Figure 1: Statistics of model performance.

🔥 News

  • [2024/12/31] Paper, Code, Benchmarks all released.

💡 Introduction

We present HumanEval Pro and MBPP Pro, two expanded versions of the traditional HumanEval and MBPP benchmarks to evaluate LLMs on self-invoking code generation task. Self-invoking code generation, a new task designed to evaluate the progressive reasoning and problem-solving capabilities of LLMs. In this task, models are presented with a base problem and a related, more complex problem. They must solve the base problem and then utilize its solution to address the more complex one.


Figure 2: Evaluation pipeline of HumanEval Pro and MBPP Pro.


Figure 3: An example of HumanEval Pro and MBPP Pro.

🚀 Quick Start

⚙️ Setup

We recommend using Conda to manage your environment. Run the following commands to setup your environment:

conda create -n evalpro python==3.10
conda activate evalpro
pip install -e .

⚖️ Evaluation

To evaluate your own models on HumanEval Pro and MBPP Pro, we recommend using vllm to generate solutions with the following command.

set -ex
OUTPUT_DIR=result
MODEL=QwQ-32B-preview
MODEL_PATH=Qwen/QwQ-32B-Preview
TASK_TYPE=humaneval_pro # or mbpp_pro
mkdir -p ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/

python -m eval.inference \
  --model_name_or_path $MODEL_PATH \
  --save_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl \
  --dataset $TASK_TYPE \
  --is_use_vllm true \
  --do_sample false \
  --temperature 0.0 \
  --top_p 1.0 \
  --max_new_tokens 4096 \
  --n_problems_per_batch 28 \
  --n_samples_per_problem 1 \
  --n_batches 1 

The choices of TASK_TYPE include:

["humaneval", "mbpp", "humaneval_pro", "mbpp_pro", "humaneval_pro_cot", "mbpp_pro_cot", "humaneval_pro_1shot", "mbpp_pro_1shot"]

To run API models, use

set -ex
WORK_DIR=evalpro/result
MODEL=GPT-4o 

TASK_TYPE=humaneval_pro      
mkdir -p ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/

python -m run_api \
  --model_name gpt-4o-2024-08-06 \
  --dataset $TASK_TYPE \
  --save_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl \
  --api_key  apikey \
  --base_url url 

Then you will get a results.json file under the --save_path.

To obtain your pass@k score, you can run eval/harness.py with the following command:

set -ex
OUTPUT_DIR=result
MODEL=Qwen2.5Coder-32B-base
DATASET=humaneval_pro
TASK_TYPE=humaneval_pro

python -m santize \
    --model_name $MODEL \
    --source_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    
python -m harness \
    --model_name $MODEL \
    --dataset_path dataset/refined_${DATASET}.json \
    --source_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    --save_path ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE} \
    --run_code

You will get a result_of_pass_k.json file in your --save_path. Please check if the pass@k of ground truth is equal to 1.0 at first. The you will obtain two results: pass_k_of_output and pass_k_of_output_santized. pass_k_of_output_santized is the result that santizes the original model output. We use the higher socre as our final result.

If you use --run_code, you will get the execution error statistics in ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/log.

The choices of DATASET include:

["humaneval_pro", "mbpp_pro"]

To evaluate your model on BigCodeBench-Lite Pro, run the following command:

export CUDA_VISIBLE_DEVICES=0
set -ex
WORK_DIR=result
MODEL=Qwen/QwQ-32B-Preview
MODEL_PATH=Qwen/QwQ-32B-Preview
TASK_TYPE=bigcodebench_lite_pro
mkdir -p ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/

python -m eval.inference \
  --model_name_or_path $MODEL_PATH \
  --save_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl \
  --dataset $TASK_TYPE \
  --is_use_vllm true \
  --do_sample false \
  --temperature 0.0 \
  --top_p 1.0 \
  --max_new_tokens 4096 \
  --n_problems_per_batch 28 \
  --n_samples_per_problem 1 \
  --n_batches 1 

rm -rf ${WORK_DIR}/${MODEL}/${TASK_TYPE}/log
python -m eval.santize \
    --model_name $MODEL \
    --source_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    
python -m eval.harness \
    --model_name $MODEL \
    --task $TASK_TYPE \
    --dataset_path evalpro/dataset/refined_${TASK_TYPE}.json \
    --source_path ${WORK_DIR}/${MODEL}/${TASK_TYPE}/outputs/ \
    --save_path ${WORK_DIR}/${MODEL}/${TASK_TYPE} 
    # --run_code
679C

To obtain the result of original HumanEval and MBPP, we recommend using the evalplus library with the following command.

OUTPUT_DIR=result
MODEL=QwQ-32B-preview
TASK_TYPE=humaneval
evalplus.evaluate --dataset $TASK_TYPE --samples ${OUTPUT_DIR}/${MODEL}/${TASK_TYPE}/outputs/results.jsonl

📖 License

This code repository is licensed under the MIT License.

☕️ Citation

If you find this repository helpful, please consider citing our paper:

@article{yu2024humaneval,
  title={HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation},
  author={Yu, Zhaojian and Zhao, Yilun and Cohan, Arman and Zhang, Xiao-Ping},
  journal={arXiv preprint arXiv:2412.21199},
  year={2024}
}

Acknowledgement

Our evaluation code is inspired by Magicoder and WaveCoder. We thanks Evalplus for providing the evaluation of original HumanEval and MBPP.

About

[ACL'2025 Findings] Official repo for "HumanEval Pro and MBPP Pro: Evaluating Large Language Models on Self-invoking Code Generation Task"

Topics

Resources

Stars

Watchers

Forks

Languages

0