TweetNorm: a benchmark for lexical normalization of Spanish tweets

Iñaki Alegria¹,
Nora Aranberri¹,
Pere R. Comas²,
Víctor Fresno³,
Pablo Gamallo⁴,
Lluis Padró²,
Iñaki San Vicente⁵,
Jordi Turmo² &
…
Arkaitz Zubiaga⁶

626 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets—TweetNorm_es—, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A Modular Approach for Social Media Text Normalization

A customizable pipeline for social media text normalization

Article 09 September 2017

Normalization of Vietnamese Tweets on Twitter

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Details about the workshop can be found at http://komunitatea.elhuyar.org/tweet-norm/.
http://nil.fdi.ucm.es/sepln2013/.
The term “ill-formed” has also been used in the literature to refer to these non-standard word forms. We opted for the term “non-standard word form” because some of the words that fall into this category, such as abbreviations or acronyms, are not necessarily misspellings.
http://creativecommons.org/licenses/by/3.0/legalcode.
http://dev.twitter.com/docs/api.
http://nlp.cs.upc.edu/freeling.
RAE, or Real Academia Española, is the institution responsible for regulating the Spanish language.
http://dev.twitter.com/terms/api-terms.
http://komunitatea.elhuyar.org/tweet-norm/files/2013/06/download_tweets.py.
http://komunitatea.elhuyar.org/tweet-norm/resources/#Downloads.
http://www.efe.com/.
Out of 20 initially registered participants, 13 groups sent results.
http://es.wikipedia.org.
http://aspell.net.
http://hunspell.sourceforge.net.
http://jazzy.sourceforge.net.
http://code.google.com/p/foma/.
http://code.google.com/p/phonetisaurus/.
http://www.opengrm.org.
http://www.speech.sri.com/projects/srilm/.
http://komunitatea.elhuyar.org/tweet-norm/.

References

Ageno, A., Comas, P. R., Padró, L., & Turmo, J. (2013). The talp-upc approach to tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Alegria, I., Etxeberria, I., & Labaka, G. (2013). Una cascada de transductores simples para normalizar tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Beaufort, R., Roekhaut, S., Cougnon, L. A., & Fairon, C. (2010). A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics (ACL) (pp. 770–779), Uppsala, Sweden.
Chakrabarti, D., & Punera, K. (2011). Event summarization using tweets. In Proceedings of the fifth International Conference on Weblogs and Social Media (ICWSM).
Costa-Jussà, M. R., & Banchs, R. E. (2013). Automatic normalization of short texts by combining statistical and rule-based techniques. Language Resources and Evaluation, 47(1), 179–193.
Cotelo-Moya, J. M., Cruz, F. L., & Troyano, J. A. (2013). Resource-based lexical approach to tweet-norm task. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 359–369).
Gamallo, P., Garcia, M., & Pichel, J. R. (2013) A method to lexical normalisation of tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision (pp. 1–12). CS224N Project Report, Stanford.
Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 368–378).
Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalisation for social media text. ACM Transactions on Intelligent Systems and Technology, 43(1), 15–27.
Google Scholar
Han, B., Cook, P., & Baldwin, T. (2013). unimelb: Spanish text normalisation. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80–88), ACM.
Hulden, M., & Francom, J. (2013). Weighted and unweighted transducers for tweet normalization. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Inouye, D., & Kalita, J.K. (2011). Comparing twitter summarization algorithms for multiple post summaries. In Proceedings of the IEEE third international conference on social computing (SocialCom) (pp. 298–306), IEEE.
Jiang, L., Yu, M., Zhou, M., Liu, X., & Zhao, T. (2011). Target-dependent twitter sentiment classification. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 151–160).
Kaufmann, J., & Kalita, J. (2010). Syntactic normalization of twitter messages. In Proceedings of the international conference on natural language processing, Kharagpur, India.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Article Google Scholar
Lin, J., Snow, R., & Morgan, W. (2011). Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 422–429), ACM.
Ling, W., Dyer, C., Black, A. W., & Trancoso, I. (2013). Paraphrasing 4 microblog normalization. In Proceedings of the 2014 conference on empirical methods on natural language processing (EMNLP) (pp. 73–84).
Liu, F., Weng, F., & Jiang, X. (2012). A broad-coverage normalization system for social media language. In Proceedings of the 50th annual meeting of the association for computational linguistics: Long papers (vol. 1, pp. 1035–1044), Association for Computational Linguistics.
Liu, X., Wei, F., Zhang, S., & Zhou, M. (2013). Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology, 4(1), 3.
Google Scholar
Montejo-Ráez, A., Díaz-Galiano, M., Martínez-Cámara, E., Martín-Valdivia, T., García-Cumbreras, M. A., & Ureña-López, A. (2013). Sinai at twitter-normalization 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Mosquera-López, A., & Moreda, P. (2013). Dlsi en tweet-norm 2013: Normalización de tweets en español. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Muñoz-García, O., Suárez, S. V., & Bel, N. (2013). Exploiting web-based collective knowledge for micropost normalisation. In Proceedings of the tweet normalization workshop at the Conference of the Spanish Society for Natural Language Processing (SEPLN).
Oliva, J., Serrano, J. I., del Castillo, M. D., & Iglesias, Á. (2013). A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19(1), 121–141.
Article Google Scholar
Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th international conference on language resources and evaluation (LREC).
Porta, J., & Sancho, J. L. (2013). Word normalization in twitter using finite-state transducers. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Ruiz, P., Cuadros, M., & Etchegoyhen, T. (2013). Lexical normalization of spanish tweets with preprocessing rules, domain-specific edit distances, and language models. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Saralegi, X., & San-Vicente, I. (2013). Elhuyar at tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Vilares, J., Alonso, M. A., & Vilares, D. (2013). Prototipado rápido de un sistema de normalización de tuits: Una aproximación léxica. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Villena Román, J., Lana Serrano, S., Martínez Cámara, E., & González Cristóbal, J. C. (2013). TASS-workshop on sentiment analysis at SEPLN. In Proceedings of the Spanish Society for Natural Language Processing (SEPLN).
Wang, A., Kan, M. Y., Andrade, D., Onishi, T., & Ishikawa, K. (2013). Chinese informal word normalization: An experimental study. Proceedings of the Sixth International Joint Conference on Natural Language Processing, 13, 127–135.
Google Scholar
Wei, Z., Zhou, L., Li, B., Wong, K. F., Gao, W., & Wong, K. F. (2011). Exploring tweets normalization and query time sensitivity for twitter search. In Proceedings of the text REtrieval conference (TREC).

Download references

Acknowledgments

We would like to thank all the members of the organizing committee. This work has been supported by the following projects: Spanish MICINN projects Tacardi (Grant No. TIN2012-38523-C02-01), Skater (Grant No. TIN2012-38584-C06-01), TextMESS2 (TIN2009-13391-C04-01), OntoPedia (Grant No. FFI2010-14986) and Holopedia (TIN2010-21128-C02-01); Xlike FP7 project (Grant No. FP7-ICT-2011.4.2-288342); UNED project (2012V/PUNED/0004); ENEUS-Marie Curie Actions (FP7/2012-2014 under REA Grant Agreement No. 302038); Celtic CDTI FEDER-INNTER-CONECTA project (Grant No. ITC-20113031); Research Network MA2VICMR (S-2009/TIC-1542); and HPCPLN (Grant No. EM13/041, Xunta de Galicia).

Author information

Authors and Affiliations

IXA. UPV/EHU, San Sebastian, Spain
Iñaki Alegria & Nora Aranberri
UPC, Barcelona, Spain
Pere R. Comas, Lluis Padró & Jordi Turmo
UNED, Madrid, Spain
Víctor Fresno
USC, Santiago de Compostela, Spain
Pablo Gamallo
Elhuyar, Usurbil, Spain
Iñaki San Vicente
University of Warwick, Coventry, UK
Arkaitz Zubiaga

Authors

Iñaki Alegria
View author publications
You can also search for this author in PubMed Google Scholar
Nora Aranberri
View author publications
You can also search for this author in PubMed Google Scholar
Pere R. Comas
View author publications
You can also search for this author in PubMed Google Scholar
Víctor Fresno
View author publications
You can also search for this author in PubMed Google Scholar
Pablo Gamallo
View author publications
You can also search for this author in PubMed Google Scholar
Lluis Padró
View author publications
You can also search for this author in PubMed Google Scholar
Iñaki San Vicente
View author publications
You can also search for this author in PubMed Google Scholar
Jordi Turmo
View author publications
You can also search for this author in PubMed Google Scholar
Arkaitz Zubiaga
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Iñaki San Vicente.

Appendix: List of unresolved OOV words

Table 6 contains the list of words from the corpus that none of the systems found the correct variation for. The list comprises the word as spelled originally in the corpus on the left column, and the correct variation annotated manually on the right column.

Table 6 List of OOV words for which none of the participants found the correct variation

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Alegria, I., Aranberri, N., Comas, P.R. et al. TweetNorm: a benchmark for lexical normalization of Spanish tweets. Lang Resources & Evaluation 49, 883–905 (2015). https://doi.org/10.1007/s10579-015-9315-6

Download citation

Published: 15 August 2015
Issue Date: December 2015
DOI: https://doi.org/10.1007/s10579-015-9315-6

TweetNorm: a benchmark for lexical normalization of Spanish tweets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Modular Approach for Social Media Text Normalization

A customizable pipeline for social media text normalization

Normalization of Vietnamese Tweets on Twitter

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: List of unresolved OOV words

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

TweetNorm: a benchmark for lexical normalization of Spanish tweets

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Modular Approach for Social Media Text Normalization

A customizable pipeline for social media text normalization

Normalization of Vietnamese Tweets on Twitter

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix: List of unresolved OOV words

Appendix: List of unresolved OOV words

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation