Abstract
The experience of a user of major search engines or other web information retrieval services looking for information in the Basque language is far from satisfactory: they only return pages with exact matches but no inflections (necessary for an agglutinative language like Basque), many results in other languages (no search engine gives the option to restrict its results to Basque), etc. This paper proposes using morphological query expansion and language-filtering words in combination with the APIs of search engines as a very cost-effective solution to build appropriate web search services for Basque. The implementation details of the methodology (choosing the most appropriate language-filtering words, the number of them, the most frequent inflections for the morphological query expansion, etc.) have been specified by corpora-based studies. The improvements produced have been measured in terms of precision and recall both over corpora and real web searches. Morphological query expansion can improve recall up to 47 % and language-filtering words can raise precision from 15 % to around 90 %, although with a loss in recall of about 30–35 %. The proposed methodology has already been successfully used in the Basque search service Elebila (http://www.elebila.eu) and the web-as-corpus tool CorpEus (http://www.corpeus.org), and the approach could be applied to other morphologically rich or under-resourced languages as well.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
http://en.wikipedia.org/wiki/Inflection, date of consultation 11/26/2012.
References
Aduriz, I., Aldezabal, I., Alegria, I., Artola, X., Ezeiza, N., & Urizar, R. (1996). EUSLEM: A lemmatiser/tagger for Basque. In Proceedings of Euralex conference, Göteborg, pp. 17–26.
Aduriz, I., Aldezabal, I., Ansa, O., Artola, X., & Diaz de Ilarraza, A. (1998). EDBL: A multi-purpose lexical support for the treatment of basque. In Proceedings of the first international conference on language resources and evaluation, Granada, vol. II, pp. 821–826.
Alegria, I., Artola, X., & Sarasola, K. (1996). Automatic morphological analysis of Basque. Literary & Linguistic Computing, 4(II), 193–203.
Ambroziak, J., & Woods, W. A. (1998). Natural language technology in precision content retrieval. In Proceedings of the international conference on natural language processing and industrial applications, Moncton.
Areta, N., Gurrutxaga, A., Leturia, I., Alegria, I., Artola, X., Diaz de Ilarraza, A., et al. (2007). ZT corpus—annotation and tools for basque corpora. In Proceedings of corpus linguistics conference, Birmingham.
Bar-Ilan, J., & Gutman, T. (2005). How do search engines respond to some non-English queries? Journal of Information Science, 31(1), 13–28.
Belkin, N. J. (2000). Helping people find what they don’t know. Communications of the ACM, 43(8), 58–61.
Benczúr, A. A., Csalogány, K., Fogaras, D., Friedman, E., Sarlós, T., Uher, M. et al. (2003). Searching a small national domain—a preliminary report. In Proceedings of the 12th international World Wide Web conference, Budapest, pp. 184.
Broder, A. (2002). A taxonomy of web search. ACM SIGIR Forum, 36(2).
Cavnar, W. B., & Trenkle, J. M. (1994). N-gram-based text categorization. In Proceedings of third annual symposium on document analysis and information retrieval, Las Vegas, pp. 161–175.
Efthimiadis, E. N., Malevris, N., Kousaridas, A., Lepeniotou, A., & Loutas, N. (2009). Non-english web search: An evaluation of indexing and searching the Greek web. Information Retrieval, 12(3), 352–379.
Fletcher, W. H. (2006). Concordancing the web: Promise and problems, tools and techniques. In M. Hundt, N. Nesselhauf, & C. Biewer (Eds.), Corpus linguistics and the web (pp. 25–46). Amsterdam: Rodopi.
Ghani, R., Jones, R., & Mladenić, D. (2003). Building minority language corpora by learning to generate Web search queries. Knowledge and Information Systems, 7(1), 56–83.
Jones, K. S., & Tait, J. I. (1984). Automatic search term variant generation. Journal of Documentation, 40(1), 50–66.
Kehoe, A., & Renouf, A. (2002). WebCorp: Applying the web to linguistics and linguistics to the web. In Proceedings of the WWW2002 Conference, Honolulu.
Kettunen, K., Airio, E., & Järvelin, K. (2007). Restricted inflectional form generation in management of morphological keyword variation. Information Retrieval, 10(4–5), 415–444.
Kilgarriff, A., & Grefenstette, G. (2003). Introduction to the special issue on the Web as corpus. Computational Linguistics, 29, 333–348.
Krovetz, R. (1993). Viewing morphology as an inference process. In Proceedings of the 16th annual international ACM SIGIR conference on research and development in information retrieval, Pittsburgh, pp. 191–202.
Langer, S. (2001). Natural languages and the World Wide Web. Bulletin de linguistique appliquée et générale, 26, 89–100.
Lazarinis, F. (2007). Web retrieval systems and the Greek language: Do they have an understanding? Journal of Information Science, 33(5), 622–636.
Lazarinis, F., Vilares, J., & Tait, J. (2007). Improving non-English web searching (iNEWS07). ACM SIGIR Forum, 41(2), 72–76.
Leturia, I., Gurrutxaga, A., Alegria, I., & Ezeiza, A. (2007). CorpEus, a ‘web as corpus’ tool designed for the agglutinative nature of Basque. In Proceedings of the 3rd Web as Corpus workshop, Louvain-la-Neuve, pp. 69–81.
Leturia, I., Gurrutxaga, A., Areta, A., Alegria, I., & Ezeiza, A. (2007). EusBila, a search service designed for the agglutinative nature of Basque. In Proceedings of iNEWS’07 workshop in SIGIR, Amsterdam, pp. 47–54.
Moreau, F., Claveau, V., & Sébillot, P. (2007). Automatic morphological query expansion using analogy-based machine learning. In Proceedings of ECIR 2007, Rome, pp. 222–233.
Osinski, S., Stefanowski, J., & Weiss, D. (2004). Lingo: Search results clustering algorithm based on singular value decomposition. In Proceedings of the international conference on intelligent information systems, Zakopane, pp. 359–368.
Padró, M., & Padró, L. (2004). Comparing methods for language identification. Procesamiento del Lenguaje Natural, 33, 155–162.
Rose, D. E., & Levinson, D. (2004). Understanding user goals in web search. In Proceedings of the 13th international conference on World Wide Web WWW’04, New York, pp. 13–19.
Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In M. Baroni & S. Bernardini (Eds.), WaCky! Working papers on the Web as corpus (pp. 63–98). Bologna: Gedit Edizioni.
Stanković, R. M. (2008). Improvement of queries using a rule based procedure for inflection of compounds and phrases. Research Journal on Computer Science and Computer Engineering with Applications, 37, 14–20.
Uyar, A. (2009). Investigation of the accuracy of search engine hit counts. Journal of Information Science, 35(4), 469–480.
Woods, W. A. (2000). Aggressive morphology for robust lexical coverage. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 218–223.
Woods, W. A., Bookman, L. A., Houston, A., Kuhns, R. J., Martin, P., & Green, S. (2000). Linguistic knowledge can improve information retrieval. In Proceedings of the sixth conference on applied natural language processing, Seattle, pp. 262–267.
Xu, J., & Croft, W. B. (1998). Corpus-based stemming using co-occurrence of word variants. ACM Transactions on Information Systems, 16(1), 61–81.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Leturia, I., Gurrutxaga, A., Areta, N. et al. Morphological query expansion and language-filtering words for improving Basque web retrieval. Lang Resources & Evaluation 47, 425–448 (2013). https://doi.org/10.1007/s10579-012-9208-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-012-9208-x