LASSL is a LAnguage framework for Self-Supervised Learning. LASSL aims to provide an easy-to-use framework for pretraining language model by only using Huggingface's Transformers and Datasets.
First of all, you must install a valid version of pytorch along your computing envrionment. Next, You can install lassl
the required packages to use lassl
following.
pip3 install .
- Language model pretraining can be divided into three steps: 1. Train Tokenizer, 2. Serialize Corpus, 3.Pretrain Language Model.
- After preparing corpus following to supported corpus type, you can pretrain your own language model.
python3 train_tokenizer.py \
--corpora_dirpath $CORPORA_DIR \
--corpus_type $CORPUS_TYPE \
--sampling_ratio $SAMPLING_RATIO \
--model_type $MODEL_TYPE \
--vocab_size $VOCAB_SIZE \
--min_frequency $MIN_FREQUENCY
python3 serialize_corpora.py \
--model_type $MODEL_TYPE \
--tokenizer_dir $TOKENIZER_DIR \
--corpora_dir $CORPORA_DIR \
--corpus_type $CORPUS_TYPE \
--max_length $MAX_LENGTH \
--num_proc $NUM_PROC
python3 pretrain_language_model.py --config_path $CONFIG_PATH
# When using TPU, use the command below. (Poetry environment does not provide PyTorch XLA as default.)
python3 xla_spawn.py --num_cores $NUM_CORES pretrain_language_model.py --config_path $CONFIG_PATH
Boseop Kim | Minho Ryu | Inje Ryu | Jangwon Park | Hyoungseok Kim |
---|---|---|---|---|
Github | Github | Github | Github | Github |
LASSL is built with Cloud TPU support from the Tensorflow Research Cloud (TFRC) program.