This repository has implementations of data augmentation for NLP for Japanese:
README_ja.md is written in Japanese.
This library about usage and performance is also described in the following article.
pip install daaja
Augmenters
provides various types of data augmentation methods.
Sentence Augmenter
is a data augmentation method for sentences.
Augmenter | ref |
---|---|
RandamDeleteAugmentor | [1] |
RandamInsertAugmentor | [1] |
RandamSwapAugmentor | [1] |
SynonymReplacementAugmentor | [1] |
BackTranslationAugmentor | [3] |
ContextualAugmentor | [4] |
- [1] EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks
- [3] Improving Neural Machine Translation Models with Monolingual Data
- [4] Data Augmentation using Pre-trained Transformer Models
from daaja.augmentors.sentence import SynonymReplaceAugmentor
augmentor = SynonymReplaceAugmentor()
augmentor.augment("日本語でデータ拡張を行う") #=> 日本語でデータ伸暢を行う
Sequence Labeling Augmenter
is a data augmentation method for sequence labeling task.
Augmenter | ref |
---|---|
LabelwiseTokenReplacementAugmentor | [2] |
MentionReplacementAugmentor | [2] |
ShuffleWithinSegmentsAugmentor | [2] |
SynonymReplacementAugmentor | [2] |
from daaja.augmentors.sequence_labeling import SynonymReplacementAugmentor
augmentor.augment(["君", "は", "隆弘", "君", "かい"], ["O", "O", "B-PER", "O", "O"])
# => (['は', '君', '隆弘', '君', 'かい'], ['O', 'O', 'B-PER', 'O', 'O'])
The same method as in the following papers can be tried in methods
.
python -m daaja.methods.eda.run --input input.tsv --output data_augmentor.tsv
The format of input.tsv is as follows:
1 この映画はとてもおもしろい
0 つまらない映画だった
from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentor
augmentor = EasyDataAugmentor(alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
text = "日本語でデータ拡張を行う"
aug_texts = augmentor.augments(text)
print(aug_texts)
# ['日本語でを拡張データ行う', '日本語でデータ押広げるを行う', '日本語でデータ拡張を行う', '日本語で智見拡張を行う', '日本語でデータ拡張を行う']
python -m daaja.methods.ner_sda.run --input input.tsv --output data_augmentor.tsv
The format of input.tsv is as follows:
私 O
は O
田中 B-PER
と O
いい O
ます O
from daaja.methods.ner_sda.simple_data_augmentation_for_ner import \
SimpleDataAugmentationforNER
tokens_list = [
["私", "は", "田中", "と", "いい", "ます"],
["筑波", "大学", "に", "所属", "して", "ます"],
["今日", "から", "筑波", "大学", "に", "通う"],
["茨城", "大学"],
]
labels_list = [
["O", "O", "B-PER", "O", "O", "O"],
["B-ORG", "I-ORG", "O", "O", "O", "O"],
["B-DATE", "O", "B-ORG", "I-ORG", "O", "O"],
["B-ORG", "I-ORG"],
]
augmentor = SimpleDataAugmentationforNER(tokens_list=tokens_list, labels_list=labels_list,
p_power=1, p_lwtr=1, p_mr=1, p_sis=1, p_sr=1, num_aug=4)
tokens = ["吉田", "さん", "は", "株式", "会社", "A", "に", "出張", "予定", "だ"]
labels = ["B-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O"]
augmented_tokens_list, augmented_labels_list = augmentor.augments(tokens, labels)
print(augmented_tokens_list)
# [['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '志す', 'だ'],
# ['吉田', 'さん', 'は', '株式', '大学', '大学', 'に', '出張', '予定', 'だ'],
# ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ'],
# ['吉田', 'さん', 'は', '筑波', '大学', 'に', '出張', '予定', 'だ'],
# ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ']]
print(augmented_labels_list)
# [['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
# ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
# ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
# ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
# ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O']]
Reference