An implementation of Lexical Unit Analysis (LUA) for sequence segmentation tasks (e.g., Chinese POS Tagging). Note that this is not an officially supported Tencent product.
Two steps. Firstly, reformulate the chunking data sets and move them into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following NER case:
[
{
"sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons']",
"labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'O'), (5, 6, 'PER')]",
},
{
"sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
"labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'ORG'), (5, 5, 'O'), (0, 0, 'O')]",
}
]
Secondly, pretrained LM (i.e., BERT) and evaluation script. Create another directory, "resource", with the following arrangement:
- resource
- pretrained_lm
- model.pt
- vocab.txt
- conlleval.pl
- pretrained_lm
For Chinese tasks, the source to construct "pretrained_lm" is bert-base-chinese.
CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -sd dump -rd resource
@inproceedings{li-etal-2021-segmenting-natural,
title = "Segmenting Natural Language Sentences via Lexical Unit Analysis",
author = "Li, Yangming and Liu, Lemao and Shi, Shuming",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
month = nov,
year = "2021",
address = "Punta Cana, Dominican Republic",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2021.findings-emnlp.18",
doi = "10.18653/v1/2021.findings-emnlp.18",
pages = "181--187",
}