Preparation

An implementation of Lexical Unit Analysis (LUA) for sequence segmentation tasks (e.g., Chinese POS Tagging). Note that this is not an officially supported Tencent product.

Preparation

Two steps. Firstly, reformulate the chunking data sets and move them into a new folder named "dataset". The folder contains {train, dev, test}.json. Each JSON file is a list of dicts. See the following NER case:

[ 
 {
  "sentence": "['Somerset', '83', 'and', '174', '(', 'P.', 'Simmons']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'O'), (5, 6, 'PER')]",
 },
 {
  "sentence": "['Leicestershire', '22', 'points', ',', 'Somerset', '4', '.']",
  "labeled entities": "[(0, 0, 'ORG'), (1, 1, 'O'), (2, 2, 'O'), (3, 3, 'O'), (4, 4, 'ORG'), (5, 5, 'O'), (0, 0, 'O')]",
 }
]

Secondly, pretrained LM (i.e., BERT) and evaluation script. Create another directory, "resource", with the following arrangement:

resource
- pretrained_lm
  - model.pt
  - vocab.txt
- conlleval.pl

For Chinese tasks, the source to construct "pretrained_lm" is bert-base-chinese.

Training and Test

CUDA_VISIBLE_DEVICES=0 python main.py -dd dataset -sd dump -rd resource

Citation

@inproceedings{li-etal-2021-segmenting-natural,
    title = "Segmenting Natural Language Sentences via Lexical Unit Analysis",
    author = "Li, Yangming  and  Liu, Lemao  and  Shi, Shuming",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2021",
    month = nov,
    year = "2021",
    address = "Punta Cana, Dominican Republic",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.findings-emnlp.18",
    doi = "10.18653/v1/2021.findings-emnlp.18",
    pages = "181--187",
}

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
nn		nn
utils		utils
.gitignore		.gitignore
README.md		README.md
main.py		main.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Preparation

Training and Test

Citation

About

Releases

Packages

Languages

LeePleased/LUA

Folders and files

Latest commit

History

Repository files navigation

Preparation

Training and Test

Citation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages