You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository was archived by the owner on Apr 11, 2023. It is now read-only.
Want to jump right into training our baseline model? Head [here](#quickstart).
38
+
**If this is your first time reading this, we recommend skipping this section and reading the following sections.** The below commands assume you have [Docker](https://docs.docker.com/get-started/) and [Nvidia-Docker](https://github.com/NVIDIA/nvidia-docker), as well as GPU that supports [CUDA 9.0](https://developer.nvidia.com/cuda-90-download-archive) or greater. Note: you should only have to run `script/setup` once to download the data.
# download data (~3.5GB) from S3; build and run Docker container
44
+
# (this will land you inside the Docker container, starting in the /src directory--you can detach from/attach to this container to pause/continue your work)
45
+
cd CodeSearchNet/
46
+
script/setup # you should only have to run this script once.
47
+
# this will drop you into the shell inside a docker container.
48
+
script/console
49
+
# optional: log in to W&B to see your training metrics, track your experiments, and submit your models to the community benchmark
50
+
wandb login
51
+
# verify your setup by training a tiny model
52
+
python train.py --testrun
53
+
# see other command line options, try a full training run, and explore other model variants by extending this baseline training script example
54
+
python train.py --help
55
+
python train.py
56
+
57
+
# generate predictions for model evaluation
58
+
python predict.py [-r | --wandb_run_id] github/codesearchnet/0123456 # this is the org/project_name/run_id
59
+
```
60
+
61
+
Finally, you can submit your run to the [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by following these [instructions](src/docs/BENCHMARK.md).
39
62
40
63
# Introduction
41
64
@@ -50,7 +73,7 @@ Want to jump right into training our baseline model? Head [here](#quickstart).
50
73
51
74
We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.
52
75
53
-
More context regarding the motivation for this problem is in [this paper][paper].
76
+
More context regarding the motivation for this problem is in this [technical report][paper].
54
77
55
78
## Data
56
79
@@ -218,36 +241,6 @@ The size of the dataset is approximately 20 GB. The various files and the direc
218
241
219
242
Warning: the scripts provided to reproduce our baseline model take more than 24 hours on an [AWS P3-V100](https://aws.amazon.com/ec2/instance-types/p3/) instance.
220
243
221
-
## Quickstart
222
-
223
-
Make sure you have [Docker](https://docs.docker.com/get-started/) and [Nvidia-Docker](https://github.com/NVIDIA/nvidia-docker) (for GPU-compute related dependencies) installed. You should only have to perform the setup steps once to prepare the environment and download the data.
# download data (~3.5GB) from S3; build and run Docker container
229
-
# (this will land you inside the Docker container, starting in the /src directory--you can detach from/attach to this container to pause/continue your work)
230
-
cd CodeSearchNet/
231
-
script/setup
232
-
# this will drop you into the shell inside a docker container.
233
-
script/console
234
-
# optional: log in to W&B to see your training metrics, track your experiments, and submit your models to the community benchmark
235
-
wandb login
236
-
# verify your setup by training a tiny model
237
-
python train.py --testrun
238
-
# see other command line options, try a full training run, and explore other model variants by extending this baseline training script example
239
-
python train.py --help
240
-
python train.py
241
-
```
242
-
243
-
Once you're satisfied with a new model, test it against the CodeSearchNet Challenge. This will generate a CSV file of model prediction scores which you can then submit to the Weights & Biases [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by [following these instructions](src/docs/BENCHMARK.md).
Our baseline models ingest a parallel corpus of (`comments`, `code`) and learn to retrieve a code snippet given a natural language query. Specifically, `comments` are top-level function and method comments (e.g. docstrings in Python), and `code` is an entire function or method. Throughout this repo, we refer to the terms docstring and query interchangeably.
0 commit comments