|
| 1 | +# This is an empty directory where you will download the training data, using the [/script/setup](/script/setup) script. |
| 2 | + |
| 3 | +After downloading the data, the directory structure will look like this: |
| 4 | + |
| 5 | +``` |
| 6 | +├──data |
| 7 | +| │ |
| 8 | +| ├──`{javascript, java, python, ruby, php, go}_licenses.pkl` |
| 9 | +| ├──`{javascript, java, python, ruby, php, go}_dedupe_definitions_v2.pkl` |
| 10 | +| │ |
| 11 | +| ├── javascript |
| 12 | +| │ └── final |
| 13 | +| │ └── jsonl |
| 14 | +| │ ├── test |
| 15 | +| │ ├── train |
| 16 | +| │ └── valid |
| 17 | +| ├── java |
| 18 | +| │ └── final |
| 19 | +| │ └── jsonl |
| 20 | +| │ ├── test |
| 21 | +| │ ├── train |
| 22 | +| │ └── valid |
| 23 | +| ├── python |
| 24 | +| │ └── final |
| 25 | +| │ └── jsonl |
| 26 | +| │ ├── test |
| 27 | +| │ ├── train |
| 28 | +| │ └── valid |
| 29 | +| ├── ruby |
| 30 | +| │ └── final |
| 31 | +| │ └── jsonl |
| 32 | +| │ ├── test |
| 33 | +| │ ├── train |
| 34 | +| │ └── valid |
| 35 | +| ├── ruby |
| 36 | +| │ └── final |
| 37 | +| │ └── jsonl |
| 38 | +| │ ├── test |
| 39 | +| │ ├── train |
| 40 | +| │ └── valid |
| 41 | +| ├── php |
| 42 | +| │ └── final |
| 43 | +| │ └── jsonl |
| 44 | +| │ ├── test |
| 45 | +| │ ├── train |
| 46 | +| │ └── valid |
| 47 | +| └── go |
| 48 | +| └── final |
| 49 | +| └── jsonl |
| 50 | +| ├── test |
| 51 | +| ├── train |
| 52 | +| └── valid |
| 53 | +| |
| 54 | +└── saved_models |
| 55 | +``` |
| 56 | + |
| 57 | +## Directory structure |
| 58 | + |
| 59 | +- `{javascript, java, python, ruby, php, go}\final\jsonl{test,train,valid}`: these directories will contain multi-part [jsonl](http://jsonlines.org/) files with the data partitioned into train, valid, and test sets. The baseline training code uses TensorFlow, which expects data to be stored in this format, and will concatenate and shuffle these files appropriately. |
| 60 | +- `{javascript, java, python, ruby, php, go}_dedupe_definitions_v2.pkl` these files are python dictionaries that contain a superset of all functions even those that do not have comments. This is used for model evaluation. |
| 61 | +- `{javascript, java, python, ruby, php, go}_licenses.pkl` these files are python dictionaries that contain the licenses found in the source code used as the dataset for CodeSearchNet. The key is the owner/name and the value is a tuple of ( path, license content). For example: |
| 62 | +``` |
| 63 | +In [6]: data['pandas-dev/pandas'] |
| 64 | +Out[6]: |
| 65 | +('pandas-dev/pandas/LICENSE', |
| 66 | + 'BSD 3-Clause License\n\nCopyright (c) 2008-2012, AQR Capital Management, LLC, Lambda Foundry, Inc. and PyData Development |
| 67 | + Team\nAll rights reserved.\n\nRedistribution and use in source and binary forms, with or without\nmodification, are |
| 68 | + permitted provided that the following conditions are met:\n\n* Redistributions of source code must retain the above |
| 69 | + copyright notice, this\n list of conditions and the following disclaimer.\n\n* Redistributions in binary form must |
| 70 | + reproduce the above copyright notice,\n this list of conditions and the following disclaimer in the documentation\n |
| 71 | + and/or other materials provided with the distribution....') |
| 72 | +```` |
| 73 | +- `saved_models`: default destination where your models will be saved if you do not supply a destination |
| 74 | +
|
| 75 | +## Data Format |
| 76 | +
|
| 77 | +See [this](docs/DATA_FORMAT.md) for documentation and an example of how the data is stored. |
0 commit comments