8000 added missing folders · github/CodeSearchNet@f444bf3 · GitHub
[go: up one dir, main page]

Skip to content
This repository was archived by the owner on Apr 11, 2023. It is now read-only.

Commit f444bf3

Browse files
committed
added missing folders
1 parent e792e1c commit f444bf3

File tree

5 files changed

+2369275
-0
lines changed

5 files changed

+2369275
-0
lines changed

resources/README.md

Lines changed: 77 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,77 @@
1+
# This is an empty directory where you will download the training data, using the [/script/setup](/script/setup) script.
2+
3+
After downloading the data, the directory structure will look like this:
4+
5+
```
6+
├──data
7+
| │
8+
| ├──`{javascript, java, python, ruby, php, go}_licenses.pkl`
9+
| ├──`{javascript, java, python, ruby, php, go}_dedupe_definitions_v2.pkl`
10+
| │
11+
| ├── javascript
12+
| │   └── final
13+
| │   └── jsonl
14+
| │   ├── test
15+
| │   ├── train
16+
| │   └── valid
17+
| ├── java
18+
| │   └── final
19+
| │   └── jsonl
20+
| │   ├── test
21+
| │   ├── train
22+
| │   └── valid
23+
| ├── python
24+
| │   └── final
25+
| │   └── jsonl
26+
| │   ├── test
27+
| │   ├── train
28+
| │   └── valid
29+
| ├── ruby
30+
| │   └── final
31+
| │   └── jsonl
32+
| │   ├── test
33+
| │   ├── train
34+
| │   └── valid
35+
| ├── ruby
36+
| │   └── final
37+
| │   └── jsonl
38+
| │   ├── test
39+
| │   ├── train
40+
| │   └── valid
41+
| ├── php
42+
| │   └── final
43+
| │   └── jsonl
44+
| │   ├── test
45+
| │   ├── train
46+
| │   └── valid
47+
| └── go
48+
|   └── final
49+
|   └── jsonl
50+
|   ├── test
51+
|   ├── train
52+
|    └── valid
53+
|
54+
└── saved_models
55+
```
56+
57+
## Directory structure
58+
59+
- `{javascript, java, python, ruby, php, go}\final\jsonl{test,train,valid}`: these directories will contain multi-part [jsonl](http://jsonlines.org/) files with the data partitioned into train, valid, and test sets. The baseline training code uses TensorFlow, which expects data to be stored in this format, and will concatenate and shuffle these files appropriately.
60+
- `{javascript, java, python, ruby, php, go}_dedupe_definitions_v2.pkl` these files are python dictionaries that contain a superset of all functions even those that do not have comments. This is used for model evaluation.
61+
- `{javascript, java, python, ruby, php, go}_licenses.pkl` these files are python dictionaries that contain the licenses found in the source code used as the dataset for CodeSearchNet. The key is the owner/name and the value is a tuple of ( path, license content). For example:
62+
```
63+
In [6]: data['pandas-dev/pandas']
64+
Out[6]:
65+
('pandas-dev/pandas/LICENSE',
66+
'BSD 3-Clause License\n\nCopyright (c) 2008-2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development
67+
Team\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are
68+
permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above
69+
copyright notice, this\n list of conditions and the following disclaimer.\n\n* Redistributions in binary form must
70+
reproduce the above copyright notice,\n this list of conditions and the following disclaimer in the documentation\n
71+
and/or other materials provided with the distribution....')
72+
````
73+
- `saved_models`: default destination where your models will be saved if you do not supply a destination
74+
75+
## Data Format
76+
77+
See [this](docs/DATA_FORMAT.md) for documentation and an example of how the data is stored.

0 commit comments

Comments
 (0)
0