daaja

This repository has implementations of data augmentation for NLP for Japanese:

For Japanese

README_ja.md is written in Japanese.

This library about usage and performance is also described in the following article.

Install

pip install daaja

Example

Quick Example

Augmenters

Augmenters provides various types of data augmentation methods.

Sentence Augmenter

Sentence Augmenter is a data augmentation method for sentences.

Augmenter	ref
RandamDeleteAugmentor	[1]
RandamInsertAugmentor	[1]
RandamSwapAugmentor	[1]
SynonymReplacementAugmentor	[1]
BackTranslationAugmentor	[3]
ContextualAugmentor	[4]

How to use

from daaja.augmentors.sentence import SynonymReplaceAugmentor
augmentor = SynonymReplaceAugmentor()
augmentor.augment("日本語でデータ拡張を行う") #=> 日本語でデータ伸暢を行う

Sequence Labeling Augmenter

Sequence Labeling Augmenter is a data augmentation method for sequence labeling task.

Augmenter	ref
LabelwiseTokenReplacementAugmentor	[2]
MentionReplacementAugmentor	[2]
ShuffleWithinSegmentsAugmentor	[2]
SynonymReplacementAugmentor	[2]

[2] An Analysis of Simple Data Augmentation for Named Entity Recognition

How to use

from daaja.augmentors.sequence_labeling import SynonymReplacementAugmentor

augmentor.augment(["君", "は", "隆弘", "君", "かい"], ["O", "O", "B-PER", "O", "O"])
# => (['は', '君', '隆弘', '君', 'かい'], ['O', 'O', 'B-PER', 'O', 'O'])

Methods

The same method as in the following papers can be tried in methods.

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

python -m daaja.methods.eda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

1	この映画はとてもおもしろい
0	つまらない映画だった

In Python

from daaja.methods.eda.easy_data_augmentor import EasyDataAugmentor
augmentor = EasyDataAugmentor(alpha_sr=0.1, alpha_ri=0.1, alpha_rs=0.1, p_rd=0.1, num_aug=4)
text = "日本語でデータ拡張を行う"
aug_texts = augmentor.augments(text)
print(aug_texts)
# ['日本語でを拡張データ行う', '日本語でデータ押広げるを行う', '日本語でデータ拡張を行う', '日本語で智見拡張を行う', '日本語でデータ拡張を行う']

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

python -m daaja.methods.ner_sda.run --input input.tsv --output data_augmentor.tsv

The format of input.tsv is as follows:

私	O
は	O
田中	B-PER
と	O
いい	O
ます	O

In Python

from daaja.methods.ner_sda.simple_data_augmentation_for_ner import \
    SimpleDataAugmentationforNER
tokens_list = [
    ["私", "は", "田中", "と", "いい", "ます"],
    ["筑波", "大学", "に", "所属", "して", "ます"],
    ["今日", "から", "筑波", "大学", "に", "通う"],
    ["茨城", "大学"],
]
labels_list = [
    ["O", "O", "B-PER", "O", "O", "O"],
    ["B-ORG", "I-ORG", "O", "O", "O", "O"],
    ["B-DATE", "O", "B-ORG", "I-ORG", "O", "O"],
    ["B-ORG", "I-ORG"],
]
augmentor = SimpleDataAugmentationforNER(tokens_list=tokens_list, labels_list=labels_list,
                                            p_power=1, p_lwtr=1, p_mr=1, p_sis=1, p_sr=1, num_aug=4)
tokens = ["吉田", "さん", "は", "株式", "会社", "A", "に", "出張", "予定", "だ"]
labels = ["B-PER", "O", "O", "B-ORG", "I-ORG", "I-ORG", "O", "O", "O", "O"]
augmented_tokens_list, augmented_labels_list = augmentor.augments(tokens, labels)
print(augmented_tokens_list)
# [['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '志す', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '大学', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '筑波', '大学', 'に', '出張', '予定', 'だ'],
#  ['吉田', 'さん', 'は', '株式', '会社', 'A', 'に', '出張', '予定', 'だ']]
print(augmented_labels_list)
# [['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'O', 'O', 'O', 'O'],
#  ['B-PER', 'O', 'O', 'B-ORG', 'I-ORG', 'I-ORG', 'O', 'O', 'O', 'O']]

Reference

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
.github		.github
daaja		daaja
examples		examples
github/workflows		github/workflows
tests		tests
.gitignore		.gitignore
README.md		README.md
README_ja.md		README_ja.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

daaja

For Japanese

Install

Example

Augmenters

Sentence Augmenter

How to use

Sequence Labeling Augmenter

How to use

Methods

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

In Python

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

In Python

About

Releases 7

Packages

Contributors 2

Languages

kajyuuen/daaja

Folders and files

Latest commit

History

Repository files navigation

daaja

For Japanese

Install

Example

Augmenters

Sentence Augmenter

How to use

Sequence Labeling Augmenter

How to use

Methods

EDA: Easy Data Augmentation Techniques for Boosting Performance on Text Classification Tasks

Command

In Python

An Analysis of Simple Data Augmentation for Named Entity Recognition

Command

In Python

About

Topics

Resources

Stars

Watchers

Forks

Releases 7

Packages 0

Contributors 2

Languages

Packages