bigcode-project
diff --git a/‎README.md
Lines changed: 5 additions & 7 deletions b/‎README.md
Lines changed: 5 additions & 7 deletions
diff --git a/‎evaluation/README.md
Lines changed: 25 additions & 5 deletions b/‎evaluation/README.md
Lines changed: 25 additions & 5 deletions
diff --git a/‎evaluation/text2code_vllm.py
Lines changed: 10 additions & 21 deletions b/‎evaluation/text2code_vllm.py
Lines changed: 10 additions & 21 deletions
@@ -239,13 +239,11 @@ Also, the container connection may be lost during execution. In this case, you c
 <summary>Data sanitization and selection</summary>
 
 ```shell
-python src/star_align/sanitize_data.py \
-    --data_files /path/to/filtered.jsonl* \
-    --output_file /path/to/final_dataset.jsonl \
-    --parse_raw_response True \
-    --passing_only True \
-    --exact_match_dedup True \
-    --data_augmentation False
+# Uncomment to do decontamination
+# export MBPP_PATH="/path/to/mbpp.jsonl"
+# export DS1000_PATH="/path/to/ds1000_data"
+# export DECONTAMINATION=1
+./sanitize.sh /path/to/exec-filtered.jsonl /path/to/sanitized.jsonl
 ```
 
 </details>
 
@@ -1,9 +1,29 @@
-# Reproduce the experiments
+# Evaluation
 
 > [!IMPORTANT]
 > **General requirements**
 >
 > Before you start, make sure you have cloned the repository and you are in the **root directory of the project**. Make sure you installed the required packages with `pip install -e .`. Different package versions may impact the reproducibility of the results.
+
+## Running EvalPlus with vLLM
+
+We implemented batched inference in [evaluation/text2code_vllm.py] using [vLLM](https://docs.vllm.ai/en/latest/). This speed up the evaluation significantly: **a greedy decoding run can be finished within 20 seconds**. Here is the command:
+
+```bash
+MODEL=/path/to/your/model
+DATASET=humaneval # or mbpp
+SAVE_PATH=evalplus-$(basename $MODEL)-$DATASET.jsonl
+CUDA_VISIBLE_DEVICES=0 python -m evaluation.text2code_vllm \
+    --model_key $MODEL \
+    --dataset $DATASET \
+    --save_path $SAVE_PATH
+
+python -m evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
+```
+
+## Reproduce StarCoder2-Instruct
+
+> [!NOTE]
 >
 > We obtained the results with the subsequent hardware and environment:
 >
@@ -12,13 +32,13 @@
 >
 > In case you face issues, we provide the raw outputs we generated in the [evalplus_results](evalplus_results) directory.
 
-## Reproduce HumanEval(+) and MBPP(+)
+### Reproduce HumanEval(+) and MBPP(+)
 
 We pack multiple problems into one batch to speed up the inference. A different batch size may lead to slightly worse/better results due to the floating point round off resulted from the underlying [cuBLAS](https://docs.nvidia.com/cuda/cublas/index.html) optimization.
 
 Make sure you set `CUDA_VISIBLE_DEVICES` to the GPU you want to use and `cd`ed to the root directory of the repo. We assume you use device 0 in the following commands.
 
-### HumanEval(+)
+#### HumanEval(+)
 
 ```bash
 MODEL_KEY=bigcode/starcoder2-15b-instruct-v0.1
@@ -46,7 +66,7 @@ python -m evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
 # pass@1: 0.634
 ```
 
-### MBPP(+)
+#### MBPP(+)
 
 ```bash
 MODEL_KEY=bigcode/starcoder2-15b-instruct-v0.1
@@ -71,4 +91,4 @@ python -m evalplus.evaluate --dataset $DATASET --samples $SAVE_PATH
 # pass@1: 0.642
 # mbpp+ (base + extra tests)
 # pass@1: 0.526
-```
+```
@@ -1,16 +1,12 @@
-import itertools
 import os
-from dataclasses import dataclass
+from dataclasses import dataclass, field
 from pathlib import Path
 from typing import Literal, TypedDict, cast
-from functools import partial
 from evalplus.data import get_human_eval_plus, get_mbpp_plus, write_jsonl
 
 # from evoeval.data import get_evo_eval
-from tqdm.auto import tqdm
 from transformers import HfArgumentParser
 
-from star_align.llm_wrapper import GenerationConfig, get_model_context
 from star_align.prompt_template import SC2_INSTRUCT_PROMPT
 from star_align.utils import infer_prompt_template
 
@@ -60,7 +56,7 @@ def map_mbpp_problem(p: dict) -> Text2CodeProblem:
 ```python
 {assertion}
 ```"""
-    prefix = "" if PROMPT_TEMPLATE.endswith("\n") else "\n"
+    prefix = ""
     response_prefix = f"""{prefix}```python"""
     return Text2CodeProblem(
         id=str(id), instruction=instruction, response_prefix=response_prefix
@@ -85,7 +81,6 @@ def map_humaneval_problem(p: dict) -> Text2CodeProblem:
 ```python
 {prompt}
 ```"""
-    prefix = "" if PROMPT_TEMPLATE.endswith("\n") else "\n"
     prefix = ""
     prefix_template = os.getenv("PREFIX_TEMPLATE", "```python")
     response_prefix = prefix + (
@@ -115,21 +110,15 @@ class Args:
         "EvoEval_concise",
     ]
     save_path: str
-
-    n_batches: int
-    n_problems_per_batch: int
-    n_samples_per_problem: int
-    # prompted: bool
-
+    n_samples_per_problem: int = field(default=1)
+    max_new_tokens: int = field(default=1024)
+    top_p: float = field(default=1.0)
+    temperature: float = field(default=0.0)
     model_name_or_path: str | None = None
 
 
 def main():
-    parser = HfArgumentParser((Args, GenerationConfig))
-    args, generation_config = cast(
-        tuple[Args, GenerationConfig],
-        parser.parse_args_into_dataclasses(),
-    )
+    args = cast(Args, HfArgumentParser(Args).parse_args_into_dataclasses()[0])
     raw_problem_fn, map_problem_fn = (
         (get_humaneval_raw_problems, map_humaneval_problem)
         if args.dataset == "humaneval"
@@ -141,10 +130,10 @@ def main():
     engine = LLM(args.model_name_or_path or args.model_key)
     sampling_params = SamplingParams(
         n=args.n_samples_per_problem,
-        temperature=generation_config.temperature,
-        max_tokens=generation_config.max_new_tokens,
+        temperature=args.temperature,
+        max_tokens=args.max_new_tokens,
         top_k=-1,
-        top_p=generation_config.top_p,
+        top_p=args.top_p,
         stop="\n```\n",
     )