Abstract
The language used in social media is often characterized by the abundance of informal and non-standard writing. The normalization of this non-standard language can be crucial to facilitate the subsequent textual processing and to consequently help boost the performance of natural language processing tools applied to social media text. In this paper we present a benchmark for lexical normalization of social media posts, specifically for tweets in Spanish language. We describe the tweet normalization challenge we organized recently, analyze the performance achieved by the different systems submitted to the challenge, and delve into the characteristics of systems to identify the features that were useful. The organization of this challenge has led to the production of a benchmark for lexical normalization of social media, including an evaluation framework, as well as an annotated corpus of Spanish tweets—TweetNorm_es—, which we make publicly available. The creation of this benchmark and the evaluation has brought to light the types of words that submitted systems did best with, and posits the main shortcomings to be addressed in future work.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Details about the workshop can be found at http://komunitatea.elhuyar.org/tweet-norm/.
The term “ill-formed” has also been used in the literature to refer to these non-standard word forms. We opted for the term “non-standard word form” because some of the words that fall into this category, such as abbreviations or acronyms, are not necessarily misspellings.
RAE, or Real Academia Española, is the institution responsible for regulating the Spanish language.
Out of 20 initially registered participants, 13 groups sent results.
References
Ageno, A., Comas, P. R., Padró, L., & Turmo, J. (2013). The talp-upc approach to tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Alegria, I., Etxeberria, I., & Labaka, G. (2013). Una cascada de transductores simples para normalizar tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Beaufort, R., Roekhaut, S., Cougnon, L. A., & Fairon, C. (2010). A hybrid rule/model-based finite-state framework for normalizing SMS messages. In Proceedings of the 48th annual meeting of the Association for Computational Linguistics (ACL) (pp. 770–779), Uppsala, Sweden.
Chakrabarti, D., & Punera, K. (2011). Event summarization using tweets. In Proceedings of the fifth International Conference on Weblogs and Social Media (ICWSM).
Costa-Jussà, M. R., & Banchs, R. E. (2013). Automatic normalization of short texts by combining statistical and rule-based techniques. Language Resources and Evaluation, 47(1), 179–193.
Cotelo-Moya, J. M., Cruz, F. L., & Troyano, J. A. (2013). Resource-based lexical approach to tweet-norm task. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Eisenstein, J. (2013). What to do about bad language on the internet. In Proceedings of the 2013 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT) (pp. 359–369).
Gamallo, P., Garcia, M., & Pichel, J. R. (2013) A method to lexical normalisation of tweets. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Go, A., Bhayani, R., & Huang, L. (2009). Twitter sentiment classification using distant supervision (pp. 1–12). CS224N Project Report, Stanford.
Han, B., & Baldwin, T. (2011). Lexical normalisation of short text messages: Makn sens a #twitter. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 368–378).
Han, B., Cook, P., & Baldwin, T. (2013). Lexical normalisation for social media text. ACM Transactions on Intelligent Systems and Technology, 43(1), 15–27.
Han, B., Cook, P., & Baldwin, T. (2013). unimelb: Spanish text normalisation. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Hong, L., & Davison, B. D. (2010). Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics (pp. 80–88), ACM.
Hulden, M., & Francom, J. (2013). Weighted and unweighted transducers for tweet normalization. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Inouye, D., & Kalita, J.K. (2011). Comparing twitter summarization algorithms for multiple post summaries. In Proceedings of the IEEE third international conference on social computing (SocialCom) (pp. 298–306), IEEE.
Jiang, L., Yu, M., Zhou, M., Liu, X., & Zhao, T. (2011). Target-dependent twitter sentiment classification. In Proceedings of the 49th annual meeting of the Association for Computational Linguistics (ACL) (pp. 151–160).
Kaufmann, J., & Kalita, J. (2010). Syntactic normalization of twitter messages. In Proceedings of the international conference on natural language processing, Kharagpur, India.
Kullback, S., & Leibler, R. A. (1951). On information and sufficiency. The Annals of Mathematical Statistics, 22(1), 79–86.
Lin, J., Snow, R., & Morgan, W. (2011). Smoothing techniques for adaptive online language models: topic tracking in tweet streams. In Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining (KDD) (pp. 422–429), ACM.
Ling, W., Dyer, C., Black, A. W., & Trancoso, I. (2013). Paraphrasing 4 microblog normalization. In Proceedings of the 2014 conference on empirical methods on natural language processing (EMNLP) (pp. 73–84).
Liu, F., Weng, F., & Jiang, X. (2012). A broad-coverage normalization system for social media language. In Proceedings of the 50th annual meeting of the association for computational linguistics: Long papers (vol. 1, pp. 1035–1044), Association for Computational Linguistics.
Liu, X., Wei, F., Zhang, S., & Zhou, M. (2013). Named entity recognition for tweets. ACM Transactions on Intelligent Systems and Technology, 4(1), 3.
Montejo-Ráez, A., Díaz-Galiano, M., Martínez-Cámara, E., Martín-Valdivia, T., García-Cumbreras, M. A., & Ureña-López, A. (2013). Sinai at twitter-normalization 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Mosquera-López, A., & Moreda, P. (2013). Dlsi en tweet-norm 2013: Normalización de tweets en español. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Muñoz-García, O., Suárez, S. V., & Bel, N. (2013). Exploiting web-based collective knowledge for micropost normalisation. In Proceedings of the tweet normalization workshop at the Conference of the Spanish Society for Natural Language Processing (SEPLN).
Oliva, J., Serrano, J. I., del Castillo, M. D., & Iglesias, Á. (2013). A SMS normalization system integrating multiple grammatical resources. Natural Language Engineering, 19(1), 121–141.
Padró, L., & Stanilovsky, E. (2012). Freeling 3.0: Towards wider multilinguality. In Proceedings of the 8th international conference on language resources and evaluation (LREC).
Porta, J., & Sancho, J. L. (2013). Word normalization in twitter using finite-state transducers. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Ruiz, P., Cuadros, M., & Etchegoyhen, T. (2013). Lexical normalization of spanish tweets with preprocessing rules, domain-specific edit distances, and language models. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Saralegi, X., & San-Vicente, I. (2013). Elhuyar at tweet-norm 2013. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Vilares, J., Alonso, M. A., & Vilares, D. (2013). Prototipado rápido de un sistema de normalización de tuits: Una aproximación léxica. In Proceedings of the tweet normalization workshop at the conference of the Spanish Society for Natural Language Processing (SEPLN).
Villena Román, J., Lana Serrano, S., Martínez Cámara, E., & González Cristóbal, J. C. (2013). TASS-workshop on sentiment analysis at SEPLN. In Proceedings of the Spanish Society for Natural Language Processing (SEPLN).
Wang, A., Kan, M. Y., Andrade, D., Onishi, T., & Ishikawa, K. (2013). Chinese informal word normalization: An experimental study. Proceedings of the Sixth International Joint Conference on Natural Language Processing, 13, 127–135.
Wei, Z., Zhou, L., Li, B., Wong, K. F., Gao, W., & Wong, K. F. (2011). Exploring tweets normalization and query time sensitivity for twitter search. In Proceedings of the text REtrieval conference (TREC).
Acknowledgments
We would like to thank all the members of the organizing committee. This work has been supported by the following projects: Spanish MICINN projects Tacardi (Grant No. TIN2012-38523-C02-01), Skater (Grant No. TIN2012-38584-C06-01), TextMESS2 (TIN2009-13391-C04-01), OntoPedia (Grant No. FFI2010-14986) and Holopedia (TIN2010-21128-C02-01); Xlike FP7 project (Grant No. FP7-ICT-2011.4.2-288342); UNED project (2012V/PUNED/0004); ENEUS-Marie Curie Actions (FP7/2012-2014 under REA Grant Agreement No. 302038); Celtic CDTI FEDER-INNTER-CONECTA project (Grant No. ITC-20113031); Research Network MA2VICMR (S-2009/TIC-1542); and HPCPLN (Grant No. EM13/041, Xunta de Galicia).
Author information
Authors and Affiliations
Corresponding author
Appendix: List of unresolved OOV words
Appendix: List of unresolved OOV words
Table 6 contains the list of words from the corpus that none of the systems found the correct variation for. The list comprises the word as spelled originally in the corpus on the left column, and the correct variation annotated manually on the right column.
Rights and permissions
About this article
Cite this article
Alegria, I., Aranberri, N., Comas, P.R. et al. TweetNorm: a benchmark for lexical normalization of Spanish tweets. Lang Resources & Evaluation 49, 883–905 (2015). https://doi.org/10.1007/s10579-015-9315-6
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-015-9315-6