This project template lets you train a part-of-speech tagger, morphologizer, lemmatizer and dependency parser from a Universal Dependencies corpus. It takes care of downloading the treebank, converting it to spaCy's format and training and evaluating the model. The template uses the UD_English-EWT
treebank by default, but you can swap it out for any other available treebank. Just make sure to adjust the lang
and treebank settings in the variables below. Use xx
for multi-language if no language-specific tokenizer is available in spaCy. Note that multi-word tokens will be merged together when the corpus is converted since spaCy does not support multi-word token expansion.
The project.yml
defines the data assets required by the
project, as well as the available commands and workflows. For details, see the
Weasel documentation.
The following commands are defined by the project. They
can be executed using weasel run [name]
.
Commands are only re-run if their inputs have changed.
Command | Description |
---|---|
preprocess |
Convert the data to spaCy's format |
train |
Train UD_English-EWT |
evaluate |
Evaluate on the test data and save the metrics |
package |
Package the trained model so it can be installed |
clean |
Remove intermediate files |
The following workflows are defined by the project. They
can be executed using weasel run [name]
and will run the specified commands in order. Commands are only re-run if their
inputs have changed.
Workflow | Steps |
---|---|
all |
preprocess → train → evaluate → package |
The following assets are defined by the project. They can
be fetched by running weasel assets
in the project directory.
File | Source | Description |
---|---|---|
assets/UD_English-EWT |
Git |