8000 fix: doc, requirements, and data cleaning script · bigcode-project/selfcodealign@2ca7037 · GitHub
[go: up one dir, main page]

Skip to content

Commit 2ca7037

Browse files
committed
fix: doc, requirements, and data cleaning script
1 parent 9427b74 commit 2ca7037

File tree

3 files changed

+155
-27
lines changed

3 files changed

+155
-27
lines changed

README.md

Lines changed: 61 additions & 12 deletions
< 8000 /div>
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# StarCoder2-Instruct
1+
# StarCoder2-Instruct: Self-Aligned, Transparent, and Fully Permissive
22

33
> [!WARNING]
44
> This documentation is still WIP.
@@ -7,28 +7,77 @@
77

88
We used VLLM's [OpenAI compatible server](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) for data generation. So, before running the following commands, make sure the VLLM server is running, and the associated `openai` environment variables are set.
99

10-
**Snippet to concept generation:**
10+
For example, you can start an VLLM server with `docker`:
1111

1212
```shell
13-
python src/star_align/self_ossinstruct.py --instruct_mode "S->C" --seed_data_files /path/to/seeds.jsonl --max_new_data 50000 --tag $TAG --temperature 0.7 --seed_code_start_index 0 --model bigcode/starcoder2-15b --num_fewshots 8 --num_batched_requests 128 --num_sample_per_request 1
13+
docker run --gpus '"device=0"' \
14+
-v $HF_HOME:/root/.cache/huggingface \
15+
-p 10000:8000 \
16+
--ipc=host \
17+
vllm/vllm-openai:v0.3.3 \
18+
--model bigcode/starcoder2-15b \
19+
--tensor-parallel-size 1 --dtype bfloat16
1420
```
1521

16-
**Concept to instruction generation:**
22+
And then set the environment variables as follows:
1723

1824
```shell
19-
python src/star_align/self_ossinstruct.py --instruct_mode "C->I" --seed_data_files /path/to/seeds.jsonl --max_new_data 50000 --tag $TAG --temperature 0.7 --seed_code_start_index 0 --model bigcode/starcoder2-15b -–num_fewshots 8 --num_sample_per_request 1 --num_batched_request 128
25+
export OPENAI_API_KEY="EMPTY"
26+
export OPENAI_BASE_URL="http://localhost:10000/v1/"
2027
```
2128

22-
**Instruction to response + self-validation code generation:**
29+
### Snippet to concept
2330

2431
```shell
25-
python src/star_align/self_ossinstruct.py --instruct_mode "I->R" --seed_data_files path/to/instructions.jsonl --max_new_data 50000 --tag $TAG --seed_code_start_index 0 --model bigcode/starcoder2-15b --num_fewshots 1 --num_batched_request 16 --num_sample_per_request 10 --temperature 0.7
32+
python src/star_align/self_ossinstruct.py \
33+
--instruct_mode "S->C" \
34+
--seed_data_files /path/to/seeds.jsonl \
35+
--max_new_data 50000 \
36+
--tag concept_gen \
37+
--temperature 0.7 \
38+
--seed_code_start_index 0 \
39+
--model bigcode/starcoder2-15b \
40+
--num_fewshots 8 \
41+
--num_batched_requests 32 \
42+
--num_sample_per_request 1
2643
```
2744

28-
**Execution filtering:**
45+
### Concept to instruction
46+
47+
```shell
48+
python src/star_align/self_ossinstruct.py \
49+
--instruct_mode "C->I" \
50+
--seed_data_files /path/to/concepts.jsonl \
51+
--max_new_data 50000 \
52+
--tag instruction_gen \
53+
--temperature 0.7 \
54+
--seed_code_start_index 0 \
55+
--model bigcode/starcoder2-15b \
56+
--num_fewshots 8 \
57+
--num_sample_per_request 1 \
58+
--num_batched_request 32
59+
```
60+
61+
### Instruction to response w/ self-validation code
62+
63+
```shell
64+
python src/star_align/self_ossinstruct.py \
65+
--instruct_mode "I->R" \
66+
--seed_data_files path/to/instructions.jsonl \
67+
--max_new_data 50000 \
68+
--tag response_gen \
69+
--seed_code_start_index 0 \
70+
--model bigcode/starcoder2-15b \
71+
--num_fewshots 1 \
72+
--num_batched_request 8 \
73+
--num_sample_per_request 10 \
74+
--temperature 0.7
75+
```
76+
77+
### Execution filter
2978

3079
> [!WARNING]
31-
> Though we implemented reliability guards, it is highly recommended to run execution in a sandbox environment.
80+
> Though we implemented reliability guards, it is highly recommended to run execution in a sandbox environment. The command below doesn't provide sandboxing by default.
3281
3382
```shell
3483
python src/star_align/execution_filter.py --response_path /path/to/response.jsonl --result_path /path/to/filtered.jsonl
@@ -37,10 +86,10 @@ python src/star_align/execution_filter.py --response_path /path/to/response.json
3786
# Note that filtered.jsonl may contain multiple passing samples for the same instruction which needs further selection.
3887
```
3988

40-
**Data sanitization and selection:**
89+
### Data sanitization and selection
4190

4291
```shell
43-
RAW=1 python src/star_align/tools/sanitize_data.py /path/to/filtered.jsonl /path/to/sanitized.jsonl
92+
RAW=1 python src/star_align/sanitize_data.py /path/to/filtered.jsonl /path/to/sanitized.jsonl
4493
python src/star_align/clean_data.py --data_files /path/to/sanitized.jsonl --output_file /path/to/sanitized.jsonl --diversify_func_names
45-
SMART=1 python src/star_align/tools/sanitize_data.py /path/to/sanitized.jsonl /path/to/sanitized.jsonl
94+
SMART=1 python src/star_align/sanitize_data.py /path/to/sanitized.jsonl /path/to/sanitized.jsonl
4695
```

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,4 +4,5 @@ openai>=1.3.7
44
tenacity~=8.2.3
55
tiktoken~=0.5.1
66
accelerate==0.27.2
7+
datasets~=2.17.1
78
git+https://github.com/evalplus/evalplus.git@25e195e024b614f2671ad9ac5b8fdcd9b95a2b24#egg=evalplus

src/star_align/clean_data.py

Lines changed: 93 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,97 @@
1-
from star_align.utils import read_jsonl, write_jsonl
2-
import sys
3-
import re
1+
import ast
2+
import random
3+
from dataclasses import dataclass, field
4+
from pathlib import Path
5+
from typing import cast
46

5-
dataset = read_jsonl(sys.argv[1])
7+
from tqdm.auto import tqdm
8+
from transformers import HfArgumentParser
69

7-
def contains_chinese(s):
8-
return bool(re.search(r'[\u4e00-\u9fff]', s))
10+
from star_align.utils import find_code_blocks, read_jsonl, write_jsonl
911

10-
chosen = []
11-
rejected = []
12-
for example in dataset:
13-
if "code snippet" in example["instruction"] or contains_chinese(example["instruction"] + example["response"]):
14-
rejected.append(example)
15-
else:
16-
chosen.append(example)
1712

18-
print(f"Removed {len(dataset) - len(chosen)} examples")
19-
write_jsonl(sys.argv[2], chosen)
13+
@dataclass(frozen=True)
14+
class Args:
15+
data_files: list[str]
16+
output_file: str
17+
diversify_func_names: bool = field(default=False)
18+
19+
20+
def extract_and_concat_function_names(python_content):
21+
"""
22+
Extracts all function names from a given Python content string and concatenates them into a single string.
23+
24+
Parameters:
25+
- python_content: A string containing the Python code to analyze.
26+
27+
Returns:
28+
- A string containing all function names defined in the content, concatenated.
29+
"""
30+
tree = ast.parse(python_content)
31+
function_names = []
32+
33+
# Define a node visitor that adds the name of each function definition it visits
34+
class FunctionDefVisitor(ast.NodeVisitor):
35+
def visit_FunctionDef(self, node):
36+
function_names.append(node.name)
37+
# Process the subtree for this node
38+
self.generic_visit(node)
39+
40+
def visit_AsyncFunctionDef(self, node):
41+
function_names.append(node.name)
42+
self.generic_visit(node)
43+
44+
# Create a node visitor and walk through the AST
45+
visitor = FunctionDefVisitor()
46+
visitor.visit(tree)
47+
48+
# Concatenate all function names into a single string
49+
return " ".join(function_names)
50+
51+
52+
def main():
53+
args = cast(Args, HfArgumentParser(Args).parse_args_into_dataclasses()[0])
54+
raw_data: list[dict] = []
55+
for data_file in args.data_files:
56+
data = read_jsonl(Path(data_file))
57+
# language = data_file.split("-")[1]
58+
# assert language in ALL_LANGS, f"Unknown language {language}"
59+
# raw_data.extend(dict(lang=language, **d) for d in data)
60+
raw_data.extend(data)
61+
# common keys for all d in data
62+
common_keys = set.intersection(*(set(d.keys()) for d in raw_data))
63+
raw_data = [{k: d[k] for k in common_keys} for d in raw_data]
64+
print(f"Common keys: {common_keys}")
65+
# counter = defaultdict[str, int](int)
66+
67+
def mk_key(instruction: str) -> str:
68+
return "".join(instruction.split())
69+
70+
random.seed(0)
71+
random.shuffle(raw_data)
72+
73+
seen_keys = set[str]()
74+
new_data = list[dict]()
75+
for d in tqdm(raw_data):
76+
key_i, key_r = mk_key(d["instruction"]), mk_key(d["response"])
77+
if key_i in seen_keys or key_r in seen_keys:
78+
continue
79+
if args.diversify_func_names:
80+
code_block = find_code_blocks(d["response"])[0]
81+
try:
82+
fn_names = extract_and_concat_function_names(code_block)
83+
except SyntaxError:
84+
continue
85+
if fn_names in seen_keys:
86+
continue
87+
seen_keys.add(fn_names)
88+
new_data.append(d)
89+
616B seen_keys.add(key_i)
90+
seen_keys.add(key_r)
91+
92+
print(f"Chose {len(new_data)} out of {len(raw_data)}")
93+
write_jsonl(Path(args.output_file), new_data)
94+
95+
96+
if __name__ == "__main__":
97+
main()

0 commit comments

Comments
 (0)
0