UKB: Graph Based Word Sense Disambiguation and Similarity

UKB is a collection of programs for performing graph-based Word Sense Disambiguation (WSD) and lexical similarity/relatedness using a pre-existing knowledge base. UKB applies random walks, e.g. Personalized PageRank, on the Knowledge Base (KB) graph to rank the vertices according to the given context. It includes tools to produce graphs from KBs like WordNet.

UKB has been evaluated in several tasks:

WSD using WordNet (English), out-of-the-box state-of-the-art results [17,8,1].
WSD on specific domains [2]
WSD on several languages (Basque, Bulgarian, Portuguese, Spanish) [15]
WSD on the medical domain, using the UMLS meta-thesaurus [5,7,9]
Word similarity using WordNet or Wikipedia graphs, state-of-the-art-results [3,4,10]
Named-Entity Disambiguation using the Wikipedia graph [10,13]
Improvements on Information Retrieval [6,9,12] using WordNet or UMLS
Word Embeddings produced with random walks on WordNet and dimensionality reduction techniques, state-of-the-art in monolingual and cross-lingual word similarity [11,14,16]. (See downloads below)

UKB has been developed by the IXA group in the University of the Basque Country.

The current version of UKB is 3.2. You can download it here (older versions are here.

News:

05.04.2020 SyntagNet relations for UKB here.

Mailing List

Please, pose any questions/problems you may have using the following mailing list: UKB mailing list

Source code repository

the git source code repository is at github using git, you can get the whole repository running:

git clone https://github.com/asoroa/ukb.git

Downloads

Selected graphs:

Click here to get graph relations of some versions of the English WordNet.
Click here to get graph relations of some versions of the Spanish WordNet.
Click here to get graph relations for English Wikipedia (04 April 2013 dump).

More graphs:

English WordNet 3.0 plus gloss relations: here
English WordNet 1.7 plus eXtended WordNet relations: here
WordNet 3.0 ILI version with dictionaries in English, Spanish and Basque: here
SyntagNEt 1.0: here

English Wikipedia: here
Basque Wikipedia: here

Word vectors:

Precompiled Personalized PageRank vectors for all WordNet lemmas (around 1.2G), useful to speed up similarity calculations. Click here to download.

Monolingual word Embeddings:

Embeddings for English WordNet 3.0 (plus gloss relations): text, binary [11]
Concatenated embedding for Text Corpora and English WordNet 3.0 (plus gloss relations): text [14]

Bilingual word Embeddings:

Bilingual Embeddings with Random Walks over Multilingual WordNets. Relevant material to reproduce the experiments in [16]: here

External Resources:

Visit this page for additional relations extracted by the people at the BulkTreeBank Group within the Qtleap project.

References

[1] Eneko Agirre and Aitor Soroa. 2009. Personalizing PageRank for Word Sense Disambiguation. Proceedings of the 12th conference of the European chapter of the Association for Computational Linguistics (EACL-2009). Athens, Greece. (PDF)

[2] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2009. Knowledge-based WSD and specific domains: performing over supervised WSD. Proceedings of IJCAI. Pasadena, USA. (PDF)

[3] Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT 09. Boulder, USA. (PDF)

[4] Eneko Agirre, Montse Cuadros, German Rigau and Aitor Soroa. 2010. Exploring Knowledge Bases for Similarity. Proceedings of LREC 2010. Valletta, Malta. (PDF)

[5] Eneko Agirre, Aitor Soroa, Mark Stevenson. 2010. Graph-based Word Sense Disambiguation of Biomedical Documents. Bioinformatics, Oxford University Press. Bioinformatics Vol. 26(22) pp: 2889-2896 (DOI http://dx.doi.org/10.1093/bioinformatics/btq555)

[6] Arantxa Otegi, Xabier Arregi, Eneko Agirre. 2011. Query Expansion for IR using Knowledge-Based Relatedness. Proceedings of the 5th International Joint Conference on Natural Language Processing, pp 1467--1471 Thailand. ISBN 978-974-466-564-5. (PDF)

[7] Mark Stevenson, Eneko Agirre and Aitor Soroa 2012. Exploiting Domain Information for Word Sense Disambiguation of Medical Documents. Journal of the American Medical Informatics Association. Vol. 19, Issue 2, pages 235-240. (DOI http://dx.doi.org/10.1136/amiajnl-2011-000415)

[8] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2013. Random Walks for Knowledge-Based Word Sense Disambiguation. Computational Linguistics. 40:1. ISSN 0891-2017. doi:10.1162/COLI_a_00164

[9] David Martinez, Arantxa Otegi, Aitor Soroa and Eneko Agirre. 2014. Improving search over Electronic Health Records using UMLS-based query expansion through random walks. Journal of Biomedical Informatics, vol. 51, pages 100-106, Elsevier. (PDF)

[10] Eneko Agirre, Ander Barrena and Aitor Soroa. 2015. Studying the Wikipedia Hyperlink Graph for Relatedness and Disambiguation. http://arxiv.org/abs/1503.01655 (See README for instructions to replicate results)

[11] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2015. Random Walks and Neural Network Language Models on Knowledge Bases. Proceedings of NAACL-HLT. Denver, USA. (PDF, WordNet embeddings in text format, WordNet embeddings in binary format)

[12] Otegi A., Arregi X., Ansa O., Agirre E. 2015 Using Knowledge-Based Relatedness for Information Retrieval. Journal of Knowledge and Information Systems, Springer London, vol. 44, issue 3, pages 689-718. (final, preprint)

[13] Ander Barrena, Aitor Soroa, Eneko Agirre 2015. Combining Mention Context and Hyperlinks from Wikipedia for Named Entity Disambiguation. Proceedings of STARSEM 2015. (PDF)

[14] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2016. Single or Multiple? Combining Word Representations Independently Learned from Text and WordNet. Proceedings of AAAI. Phoenix, USA. (PDF, Concatenated embeddings)

[15] Eneko Agirre, Steve Neale, Michal Novak, Arantxa Otegi, Joao Silva, Kiril Simov and Roman Sudarikov, 2015. Report on pilot version of LRTs enhanced to support advanced crosslingual ambiguity resolution. Deliverable D5.6, Version 1.4, QTLeap Project. (PDF).

[16] Josu Goikoetxea, Eneko Agirre and Aitor Soroa. 2018. Bilingual embeddings with random walks over multilingual wordnets, 2018. Knowledge-Based Systems (KNOSYS). (preprint final reproducibility material)

[17] Eneko Agirre, Oier Lopez de Lacalle and Aitor Soroa. 2018. The risk of sub-optimal use of Open Source NLP Software: UKB is inadvertently state-of-the-art in knowledge-based WSD. NLP-OSS workshop at ACL (arXiv:1805.04277)

Acknowledgments

This work has been partially funded by European Community in the framework of ERA-NET CHIST-ERA Commission (project READERS) and and the European Commission (QTLEAP FP7-ICT-2013.4.1-610516).