UKB: Graph Based Word Sense Disambiguation and Similarity
UKB is a collection of programs for performing graph-based Word
Sense
Disambiguation (WSD) and lexical similarity/relatedness using a pre-existing
knowledge base. UKB applies random walks, e.g. Personalized PageRank, on the
Knowledge Base (KB) graph to rank the vertices according to the given context. It includes tools to produce graphs from KBs like WordNet.
UKB has been evaluated in several tasks:
- WSD using WordNet (English), out-of-the-box state-of-the-art results [17,8,1].
- WSD on specific domains [2]
- WSD on several languages (Basque, Bulgarian, Portuguese, Spanish) [15]
- WSD on the medical domain, using the UMLS meta-thesaurus [5,7,9]
- Word similarity using WordNet or Wikipedia graphs, state-of-the-art-results [3,4,10]
- Named-Entity Disambiguation using the Wikipedia graph [10,13]
- Improvements on Information Retrieval [6,9,12] using WordNet or UMLS
- Word Embeddings produced with random walks on WordNet and
dimensionality reduction techniques, state-of-the-art in monolingual and cross-lingual word similarity
[11,14,16]. (See downloads below)
UKB has been developed by the IXA
group in the University of the Basque Country.
The current version of UKB is 3.2. You can download it here (older versions are here.
News:
Mailing List
Please, pose any questions/problems you may have using
the following mailing
list: UKB
mailing
list
Source code repository
the git source code repository is at github
using git, you can get the whole repository running:
- git clone https://github.com/asoroa/ukb.git
Downloads
Selected graphs:
- Click here to get graph relations of some versions of the English WordNet.
-
Click here to get graph relations of some versions of the Spanish WordNet.
-
Click here to get
graph relations for English Wikipedia (04 April 2013 dump).
More graphs:
- English WordNet 3.0 plus gloss relations: here
- English WordNet 1.7 plus eXtended WordNet relations: here
- WordNet 3.0 ILI version with dictionaries in English, Spanish
and Basque: here
- SyntagNEt 1.0: here
- English Wikipedia: here
- Basque Wikipedia: here
Word vectors:
- Precompiled Personalized PageRank vectors for all WordNet lemmas (around 1.2G), useful to speed up similarity calculations. Click here
to download.
Monolingual word Embeddings:
- Embeddings for English WordNet 3.0 (plus gloss relations): text, binary [11]
- Concatenated embedding for Text Corpora and English WordNet 3.0 (plus gloss relations): text [14]
Bilingual word Embeddings:
- Bilingual Embeddings with Random Walks over Multilingual WordNets. Relevant material to reproduce the experiments in [16]: here
External Resources:
- Visit this page for additional relations extracted by the people at the BulkTreeBank Group within the Qtleap project.
References
[1] Eneko Agirre and Aitor Soroa. 2009.
Personalizing PageRank for Word Sense Disambiguation. Proceedings of
the
12th conference of the European chapter of the Association for
Computational
Linguistics (EACL-2009). Athens,
Greece. (PDF)
[2]
Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2009.
Knowledge-based WSD and specific domains: performing over supervised
WSD. Proceedings of IJCAI. Pasadena, USA. (PDF)
[3] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius
Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness
Using
Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT
09. Boulder, USA. (PDF)
[4] Eneko Agirre, Montse Cuadros, German Rigau and Aitor Soroa.
2010. Exploring Knowledge Bases for Similarity. Proceedings of
LREC 2010. Valletta, Malta. (PDF)
[5] Eneko Agirre, Aitor Soroa, Mark Stevenson. 2010. Graph-based Word
Sense Disambiguation of Biomedical Documents. Bioinformatics, Oxford
University Press. Bioinformatics Vol. 26(22) pp: 2889-2896
(DOI http://dx.doi.org/10.1093/bioinformatics/btq555)
[6] Arantxa Otegi, Xabier Arregi, Eneko Agirre. 2011. Query Expansion
for IR using Knowledge-Based Relatedness. Proceedings of the 5th
International Joint Conference on Natural Language Processing, pp
1467--1471 Thailand. ISBN 978-974-466-564-5. (PDF)
[7] Mark Stevenson, Eneko Agirre and Aitor Soroa 2012. Exploiting
Domain Information for Word Sense Disambiguation of Medical Documents.
Journal of the American Medical Informatics Association. Vol. 19,
Issue 2, pages 235-240. (DOI http://dx.doi.org/10.1136/amiajnl-2011-000415)
[8] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2013. Random
Walks for Knowledge-Based Word Sense Disambiguation. Computational
Linguistics. 40:1. ISSN
0891-2017. doi:10.1162/COLI_a_00164
[9] David Martinez, Arantxa Otegi, Aitor Soroa and Eneko Agirre. 2014.
Improving search over Electronic Health Records using UMLS-based query expansion through random walks.
Journal of Biomedical Informatics, vol. 51, pages 100-106, Elsevier. (PDF)
[10] Eneko Agirre, Ander Barrena and Aitor Soroa. 2015. Studying the
Wikipedia Hyperlink Graph for Relatedness and Disambiguation.
http://arxiv.org/abs/1503.01655 (See README for instructions to replicate results)
[11] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2015. Random Walks and Neural Network Language Models
on Knowledge Bases. Proceedings of NAACL-HLT. Denver, USA. (PDF, WordNet embeddings in text format, WordNet embeddings in binary format)
[12] Otegi A., Arregi X., Ansa O., Agirre E. 2015
Using Knowledge-Based Relatedness for Information Retrieval.
Journal of Knowledge and Information Systems, Springer London, vol. 44, issue 3, pages 689-718. (final, preprint)
[13] Ander Barrena, Aitor Soroa, Eneko Agirre 2015. Combining Mention
Context and Hyperlinks from Wikipedia for Named Entity Disambiguation. Proceedings of STARSEM 2015. (PDF)
[14] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2016. Single or Multiple? Combining Word Representations
Independently Learned from Text and WordNet. Proceedings of AAAI. Phoenix, USA. (PDF, Concatenated embeddings)
[15] Eneko Agirre, Steve Neale, Michal Novak, Arantxa Otegi, Joao Silva, Kiril Simov and Roman Sudarikov, 2015. Report on pilot version of LRTs enhanced to support advanced crosslingual ambiguity resolution. Deliverable D5.6, Version 1.4, QTLeap Project. (PDF).
[16] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2018. Bilingual embeddings with random walks over multilingual wordnets, 2018. Knowledge-Based Systems (KNOSYS). (preprint final reproducibility material)
[17] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2018. The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD. NLP-OSS workshop at ACL (arXiv:1805.04277)
Acknowledgments
This work has been partially funded by European Community in the
framework of ERA-NET CHIST-ERA Commission (project READERS) and
and the European Commission (QTLEAP FP7-ICT-2013.4.1-610516).