An Approach for Named Entity Recognition in Poorly Structured Data

Nuno Freire^21,22,
José Borbinha²¹ &
Pável Calado²¹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7295))

Included in the following conference series:

Extended Semantic Web Conference

3334 Accesses
10 Citations

Abstract

This paper describes an approach for the task of named entity recognition in structured data containing free text as the values of its elements. We studied the recognition of the entity types of person, location and organization in bibliographic data sets from a concrete wide digital library initiative. Our approach is based on conditional random fields models, using features designed to perform named entity recognition in the absence of strong lexical evidence, and exploiting the semantic context given by the data structure. The evaluation results support that, with the specialized features, named entity recognition can be done in free text within structured data with an acceptable accuracy. Our approach was able to achieve a maximum precision of 0.91 at 0.55 recall and a maximum recall of 0.82 at 0.77 precision. The achieved results were always higher than those obtained with Stanford Named Entity Recognizer, which was developed for grammatically well-formed text. We believe this level of quality in named entity recognition allows the use of this approach to support a wide range of information extraction applications in structured data.

Download to read the full chapter text

Chapter PDF

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Named entity recognition based on conditional random fields

Article 08 September 2017

Entity recognition in the biomedical domain using a hybrid approach

Article Open access 09 November 2017

Keywords

References

Seth, G.: Unstructured Data and the 80 Percent Rule: Investigating the 80%. Technical report, Clarabridge Bridgepoints (2008)
Google Scholar
Shilakes, C., Tylman, J.: Enterprise Information Portals. Merrill Lynch Report (1998)
Google Scholar
Sarawagi, S.: Information Extraction. Found. Trends Databases 1, 261–377 (2008)
Article Google Scholar
Nadeau, D., Sekine, S.: A survey of named entity recognition and classification. Linguisticae Investigationes 30 (2007)
Google Scholar
McCallum, A., Freitag, D., Pereira, F.: Maximum entropy Markov models for information extraction and segmentation. In: International Conference on Machine Learning (2000)
Google Scholar
Martins, B., Borbinha, J., Pedrosa, G., Gil, J., Freire, N.: Geographically-aware information retrieval for collections of digitized historical maps. In: 4th ACM Workshop on Geographical Information Retrieval (2007)
Google Scholar
Freire, N., Borbinha, J., Calado, P., Martins, B.: A Metadata Geoparsing System for Place Name Recognition and Resolution in Metadata Records. In: ACM/IEEE Joint Conference on Digital Libraries (2011)
Google Scholar
Sporleder, C.: Natural Language Processing for Cultural Heritage Domains. Language and Linguistics Compass 4(9), 750–768 (2010)
Article Google Scholar
King, P., Poulovassilis, A.: Enhancing database technology to better manage and exploit Partially Structured Data. Technical report, University of London (2000)
Google Scholar
Williams, D.: Combining Data Integration and Information Extraction. PhD thesis, University of London (2008)
Google Scholar
Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V., Aswani, N., Roberts, I., Gorrell, G., Funk, A., Roberts, A., Damljanovic, D., Heitz, T., Greenwood, M., Saggion, H., Petrak, J., Li, Y., Peters, W.: Text Processing with GATE (Version 6). University of Sheffield Department of Computer Science (2011) ISBN 978-0956599315
Google Scholar
Michelson, M., Knoblock, C.: Creating Relational Data from Unstructured and Ungrammatical Data Sources. Journal of Articial Intelligence Research 31, 543–590 (2008)
MATH Google Scholar
Guo, J., Xu, G., Cheng, X., Li, H.: Named Entity Recognition in Query. In: 32nd Annual ACM SIGIR Conference (2009)
Google Scholar
Du, J., Zhang, Z., Yan, J., Cui, Y., Chen, Z.: Using Search Session Context for Named Entity Recognition in Query. In: 33rd Annual ACM SIGIR Conference (2010)
Google Scholar
Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A Brief History. In: Proc. International Conference on Computational Linguistics (1996)
Google Scholar
Bennett, R., Hengel-Dittrich, C., O’Neill, E., Tillett, B.B.: VIAF (Virtual International Authority File): Linking Die Deutsche Bibliothek and Library of Congress Name Authority Files. In: 72nd IFLA General Conference and Council (2006)
Google Scholar
Vatant, B., Wick, M.: Geonames Ontology (2006), http://www.geonames.org/ontology/
Lafferty, J.D., McCallum, A., Pereira, F.C.N.: Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In: Proceedings of the Eighteenth International Conference on Machine Learning, pp. 282–289. Morgan Kaufmann Publishers Inc. (2001)
Google Scholar
Wallach, H.: Conditional Random Fields: An Introduction. Technical Report MS-CIS-04-21. Department of Computer and Information Science, University of Pennsylvania (2004), http://www.cs.umass.edu/~wallach/technical_reports/wallach04conditional.pdf
Miller, G.A., Beckwith, R., Fellbaum, C.D., Gross, D., Miller, K.: WordNet: An online lexical database. Int. J. Lexicograph. 3(4), 235–244 (1990)
Article Google Scholar
McCallum, A.: MALLET: A Machine Learning for Language Toolkit (2002), http://mallet.cs.umass.edu
The Unicode Consortium: Unicode Text Segmentation (2010), http://www.unicode.org/reports/tr29/
Sekine, S., Isahara, H.: IREX: IR and IE Evaluation project in Japanese. In: Proc. Conference on Language Resources and Evaluation (2000)
Google Scholar
Michie, D., Spieglhalter, D.J., Taylor, C.C.: Machine learning, neural and statistical classification. Prentice Hall, Englewood Cliffs (1994)
MATH Google Scholar
Goodman, J.: Sequential Conditional Generalized Iterative Scaling. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 9–16 (2002)
Google Scholar
Finkel, J.R., Grenager, T., Manning, C.: Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. In: 43rd Annual Meeting of the Association for Computational Linguistics (2005)
Google Scholar
Sang, T.K., Erik, F., De, F.: Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition. In: Conf. on Natural Language Learning (2003)
Google Scholar
Kohavi, R., John, G.: Wrappers for feature selection. Artificial Intelligence 97(1-2), 273–324 (1997)
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

INESC-ID/Instituto Superior Técnico, Technical University of Lisbon, Av. Rovisco Pais, 1049-001, Lisboa, Portugal
Nuno Freire, José Borbinha & Pável Calado
The European Library, National Library of the Netherlands, Willem-Alexanderhof 5, 2509 LK, The Hague, The Netherlands
Nuno Freire

Authors

Nuno Freire
View author publications
You can also search for this author in PubMed Google Scholar
José Borbinha
View author publications
You can also search for this author in PubMed Google Scholar
Pável Calado
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institute AIFB, Karlsruhe Institute of Technology, Englerstrasse 11, 76131, Karlsruhe, Germany
Elena Simperl
CITEC, University of Bielefeld, Morgenbreede 39, 33615, Bielefeld, Germany
Philipp Cimiano
Siemens AG Österreich, Siemensstrasse 90, 1210, Vienna, Austria
Axel Polleres
Technical University of Madrid, C/ Severo Ochoa, 13, 28660, Boadilla del Monte, Madrid, Spain
Oscar Corcho
STLab, ISTC-CNR, Via Nomentana 56, 00161, Rome, Italy
Valentina Presutti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Freire, N., Borbinha, J., Calado, P. (2012). An Approach for Named Entity Recognition in Poorly Structured Data. In: Simperl, E., Cimiano, P., Polleres, A., Corcho, O., Presutti, V. (eds) The Semantic Web: Research and Applications. ESWC 2012. Lecture Notes in Computer Science, vol 7295. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30284-8_55

Download citation

DOI: https://doi.org/10.1007/978-3-642-30284-8_55
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30283-1
Online ISBN: 978-3-642-30284-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Approach for Named Entity Recognition in Poorly Structured Data

Abstract

Chapter PDF

Similar content being viewed by others

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Named entity recognition based on conditional random fields

Entity recognition in the biomedical domain using a hybrid approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

An Approach for Named Entity Recognition in Poorly Structured Data

Abstract

Chapter PDF

Similar content being viewed by others

CRF+LG: A Hybrid Approach for the Portuguese Named Entity Recognition

Named entity recognition based on conditional random fields

Entity recognition in the biomedical domain using a hybrid approach

Keywords

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation