8000 promote quickstart · github/CodeSearchNet@13cc585 · GitHub
[go: up one dir, main page]

Skip to content
This repository was archived by the owner on Apr 11, 2023. It is now read-only.

Commit 13cc585

Browse files
committed
promote quickstart
1 parent 8d2049d commit 13cc585

File tree

1 file changed

+27
-34
lines changed

1 file changed

+27
-34
lines changed

README.md

Lines changed: 27 additions & 34 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@
1010

1111
<!-- TOC depthFrom:1 depthTo:6 withLinks:1 updateOnSave:1 orderedList:0 -->
1212

13+
- [Quickstart](#quickstart)
1314
- [Introduction](#introduction)
1415
- [Project Overview](#project-overview)
1516
- [Data](#data)
@@ -21,7 +22,6 @@
2122
- [Schema & Format](#schema-format)
2223
- [Downloading Data from S3](#downloading-data-from-s3)
2324
- [Running our Baseline Model](#running-our-baseline-model)
24-
- [Quickstart](#quickstart)
2525
- [Model Architecture](#model-architecture)
2626
- [Training](#training)
2727
- [References](#references)
@@ -33,9 +33,32 @@
3333

3434
<!-- /TOC -->
3535

36-
# QuickStart: Training Baseline Models
36+
# Quickstart
3737

38-
Want to jump right into training our baseline model? Head [here](#quickstart).
38+
**If this is your first time reading this, we recommend skipping this section and reading the following sections.** The below commands assume you have [Docker](https://docs.docker.com/get-started/) and [Nvidia-Docker](https://github.com/NVIDIA/nvidia-docker), as well as GPU that supports [CUDA 9.0](https://developer.nvidia.com/cuda-90-download-archive) or greater. Note: you should only have to run `script/setup` once to download the data.
39+
40+
```bash
41+
# clone this repository
42+
git clone https://github.com/ml-msr-github/CodeSearchNet.git
43+
# download data (~3.5GB) from S3; build and run Docker container
44+
# (this will land you inside the Docker container, starting in the /src directory--you can detach from/attach to this container to pause/continue your work)
45+
cd CodeSearchNet/
46+
script/setup # you should only have to run this script once.
47+
# this will drop you into the shell inside a docker container.
48+
script/console
49+
# optional: log in to W&B to see your training metrics, track your experiments, and submit your models to the community benchmark
50+
wandb login
51+
# verify your setup by training a tiny model
52+
python train.py --testrun
53+
# see other command line options, try a full training run, and explore other model variants by extending this baseline training script example
54+
python train.py --help
55+
python train.py
56+
57+
# generate predictions for model evaluation
58+
python predict.py [-r | --wandb_run_id] github/codesearchnet/0123456 # this is the org/project_name/run_id
59+
```
60+
61+
Finally, you can submit your run to the [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by following these [instructions](src/docs/BENCHMARK.md).
3962

4063
# Introduction
4164

@@ -50,7 +73,7 @@ Want to jump right into training our baseline model? Head [here](#quickstart).
5073

5174
We hope that CodeSearchNet is a step towards engaging with the broader machine learning and NLP community regarding the relationship between source code and natural language. We describe a specific task here, but we expect and welcome other uses of our dataset.
5275

53-
More context regarding the motivation for this problem is in [this paper][paper].
76+
More context regarding the motivation for this problem is in this [technical report][paper].
5477

5578
## Data
5679

@@ -218,36 +241,6 @@ The size of the dataset is approximately 20 GB. The various files and the direc
218241

219242
Warning: the scripts provided to reproduce our baseline model take more than 24 hours on an [AWS P3-V100](https://aws.amazon.com/ec2/instance-types/p3/) instance.
220243

221-
## Quickstart
222-
223-
Make sure you have [Docker](https://docs.docker.com/get-started/) and [Nvidia-Docker](https://github.com/NVIDIA/nvidia-docker) (for GPU-compute related dependencies) installed. You should only have to perform the setup steps once to prepare the environment and download the data.
224-
225-
```bash
226-
# clone this repository
227-
git clone https://github.com/ml-msr-github/CodeSearchNet.git
228-
# download data (~3.5GB) from S3; build and run Docker container
229-
# (this will land you inside the Docker container, starting in the /src directory--you can detach from/attach to this container to pause/continue your work)
230-
cd CodeSearchNet/
231-
script/setup
232-
# this will drop you into the shell inside a docker container.
233-
script/console
234-
# optional: log in to W&B to see your training metrics, track your experiments, and submit your models to the community benchmark
235-
wandb login
236-
# verify your setup by training a tiny model
237-
python train.py --testrun
238-
# see other command line options, try a full training run, and explore other model variants by extending this baseline training script example
239-
python train.py --help
240-
python train.py
241-
```
242-
243-
Once you're satisfied with a new model, test it against the CodeSearchNet Challenge. This will generate a CSV file of model prediction scores which you can then submit to the Weights & Biases [community benchmark](https://app.wandb.ai/github/codesearchnet/benchmark) by [following these instructions](src/docs/BENCHMARK.md).
244-
245-
```bash
246-
python predict.py [-r | --wandb_run_id] github/codesearchnet/0123456
247-
# or
248-
python predict.py [-m | --model_file] ../resources/saved_models/*.pkl.gz
249-
```
250-
251244
## Model Architecture
252245

253246
Our baseline models ingest a parallel corpus of (`comments`, `code`) and learn to retrieve a code snippet given a natural language query. Specifically, `comments` are top-level function and method comments (e.g. docstrings in Python), and `code` is an entire function or method. Throughout this repo, we refer to the terms docstring and query interchangeably.

0 commit comments

Comments
 (0)
0