[go: up one dir, main page]

Skip to main content
Log in

On the problem of Wiki texts indexing

  • Artificial Intelligence
  • Published:
Journal of Computer and Systems Sciences International Aims and scope

Abstract

A new type of documents called a “wiki page” is winning the Internet. This is expressed not only in an increase of the number of Internet pages of this type, but also in the popularity of Wiki projects (in particular, Wikipedia); therefore the problem of parsing in Wiki texts is becoming more and more topical. A new method for indexing Wikipedia texts in three languages: Russian, English, and German, is proposed and implemented. The architecture of the indexing system, including the software components GATE and Lemmatizer, is considered. The rules of converting Wiki texts into texts in a natural language are described. Index bases for the Russian Wikipedia and Simple English Wikipedia are constructed. The validity of Zipf’s laws is tested for the Russian Wikipedia and Simple English Wikipedia.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. L. Rainie and B. Tancer, “Wikipedia Users,” in Reports: Online Activities & Pursuits (2007), http://www.pewinternet.org/pdfs/PIP-Wikipedia07.pdf.

  2. J. J. Kleinberg, ACM 46(5) (1999).

  3. S. Brin and L. Page, “The Anatomy of a Large-Scale Hypertextual Web Search Engine (1998),” http://www-db.stanford.edu/backrub/google.html.

  4. S. Fortunato, M. Boguna, A. Flammini, et al., “How to Make the Top Ten: Approximating PageRank from Indegree,” 2005, http://arxiv.org/abs/cs/0511016.

  5. Survey of Text Mining: Clustering, Classification, and Retrieval, Ed. by M. Berry (Springer, New York, 2003).

    Google Scholar 

  6. Y. Ollivier and P. Senellart, “Finding Related Pages Using Green Measures: An Illustration with Wikipedia,” in Association for the Advancement of Artificial Intelligence, Vancouver, Canada (2007).

  7. D. Milne, “Computing Semantic Relatedness Using Wikipedia Link Structure,” in Proceedings of New Zealand Computer Science Research Student Conference (NZCSRSC’2007), Hamilton, New Zealand, 2007, http://www.cs.waikato.ac.nz/dnk2/publications/nzcsrsc07.pdf.

  8. S. Melnik, H. Garcia-Molina, and E. Rahm, “Similarity Flooding: a Versatile Graph Matching Algorithm and Its Application to Schema Matching,” in Proceedings of 18th ICDE Conference, San Jose CA, USA, 2002, http://research.microsoft.com/melnik/publications.html.

  9. V. Blondel and P. Senellart, “Automatic Extraction of Synonyms in a Dictionary,” in Proceedings of SIAM Workshop on Text Mining, Arlington, Texas, USA, 2002. http://www.inma.ucl.ac.be/?blondel/publications/areas.html.

  10. V. Blondel, A. Gajardo, M. Heymans, et al., “A Measure of Similarity Between Graph Vertices: Applications to Synonym Extraction and Web Searching,” SIAM Review 46(1) (2004).

  11. E. Gabrilovich and S. Markovitch, “Computing Semantic Relatedness, Using Wikipedia-Based Explicit Semantic Analysis,” in Proceedings of 20th International Joint Conference on Artificial Intelligence (IJCAI), Hyderabad, India, 2007, http://www.cs.technion.ac.il/gabr/papers/ijcai-2007-sim.pdf.

  12. M. Sahami and T. D. Heilman, “A Web-Based Kernel Function for Measuring the Similarity of Short Text Snippets,” in Proceedings of 15th International World Wide Web Conference (www), 2006, http://robotics.stanford.edu/users/sahami/papers-dir/www2006.pdf.

  13. P. Pantel and D. Lin, “Word-for-Word Glossing with Contextually Similar Words,” in Proceedings of ANLPNAACL 2000, Seattle, USA, 2000.

  14. I. Kuralenok and I. Nekrest’yanov, “Automatic Document Classification Based on Latent-Semantic Analysis,” in Proceedings of the Conference on Electronic Libraries: Promising methods and Technologies, Electronic Collections, St. Petersburg, Russia, 1999, http://www.dl99.nw.ru [in Russian].

  15. K. Bharat and M. Henzinger, “Improved Algorithms for Topic Distillation in a Hyperlinked Environment,” in Proceedings of 21st International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 98), 1998. ftp://ftp.digital.com/pub/DEC/SRC/publications/monika/sigir98.pdf.Proc, 21.

  16. A. G. Maguitman and F. Menczer, H. Roinestad, et al., Algorithmic Detection of Semantic Similarity, 2005, http://www2005.org/cdrom/contents.htm.

  17. A. A. Krizhanovsky, “Automated Search of Semantically Close Words by the Example of Aviation Terminology,” Avtomatizatsiya v Promyshlennosti, 64(4), (2008).

  18. A. A. Krizhanovsky, “Synonym Search in Wikipedia: Synarcher,” in Proceedings of the 11th International Conference on Speech and Computer SPECOM’2006, St. Petersburg, Russia, 2006.

  19. A. A. Krizhanovsky, “Evaluation of Search Results of Semantically Close Words in Wikipedia: Information Content and the Adapted HITS Algorithm,” in Proceedings of Wiki Conference, St. Petersburg, Russia, 2007 [in Russian].

  20. I. V. Segalovich, “How Search Engines Operate,” 2004, http://company.yandex.ru/articles/.

  21. S. Robertson, “Understanding Inverse Document Frequency: on Theoretical Arguments for IDF,” J. Documentation, No. 60 (2004). http://www.soi.city.ac.uk/~ser/idfpapers/Robertson-idf-JDoc.pdf.

  22. H. Cunningham, D. Maynard, K. Bontcheva, et al., Developing Language Processing Components with GATE (User’s Guide), Technical report. University of Sheffield, UK, 2005, http://www.gate.ac.uk.

  23. A. V. Sokirko, “Morphological Modules at Site www.aot.ru,” in Proceedings of International conference Dialog 2004 on Computer Linguistics and Intelligent Technologies, Moscow, Russia, 2004, [in Russian].

  24. D. Vakhitova, “Development of a Corpus of Texts on Corpus Linguistics, 2006, http://matling.spb.ru/files/kurs/Vahitova-Corpus.doc.

  25. J. E. F. Friedl, Regular Expressions (Piter, St. Petersburg, 2001) [in Russian].

    Google Scholar 

  26. S. P. Ponzetto and M. Strube, “An API for Measuring the Relatedness of Words in Wikipedia,” in Companion Volume to the Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Prague, Czech Republic, 2007.

  27. T. Zesch, C. Mueller, and I. Gurevych, “Extracting Lexical Semantic Knowledge from Wikipedia and Wiktionary,” in Proceedings of Conference on Language Resources and Evaluation (LREC), Marrakech, Morocco, 2008.

  28. C. D. Manning and H. Schutze, Foundations of Statistical Natural Language Processing (The MIT Press, 1999).

  29. S. Campbell, J.-P. Chancelier, and R. Nikoukhah, Modeling and Simulation in Scilab/Scicos (Springer, 2006).

  30. O. N. Lyashevskaya and S. A. Sharov, “Frequency Dictionary of the National Corpus of Russian Language: Concept and Technique for Development,” in Proceedings of International Conference Dialog 2008 on Computer Linguistics and Intelligent Technologies, Bekasovo, Russia, 2008, http://www.dialog-21.ru/dialog2008/materials/pdf/53.pdf.

  31. J. Atserias, H. Zaragoza, M. Ciaramita, et al., “Semantically Annotated Snapshot of the English Wikipedia,” in Proceedings of Conference on Language Resources and Evaluation, Marrakech, Morocco, 2008.

  32. N. Aswani, V. Tablan, K. Bontcheva, et al., “Indexing and Querying Linguistic Metadata and Document Content,” in Proceedings of RANLP’2005, Borovets, Bulgaria, 2005.

  33. R. Witte and T. Gitzinger, “Connecting Wikis and Natural Language Processing Systems,” in Proceedings of WikiSym’07, Canada, Quebec, 2007, http://www.wikisym.org/ws2007/-publish/Witte-WikiSym2007-Natur alLanguageProcessing.pdf.

  34. P. Boldi and S. Vigna, Efficient Optimally Lazy Algorithms for Minimal-Interval Semantics (2007), http://vigna.dsi.unimi.it/papers.php.

  35. B. Magnini, C. Strapparava, G. Pezzulo, et al., “The Role of Domain Information in Word Sense Disambiguation,” J. Natural Language Engineering 4(8) (2002).

  36. A. Smirnov and A. Krizhanovsky, “Information Filtering Based on Wiki Index Database,” in Proceeding of FLINS’08, Madrid, Spain, 2008, http://arxiv.org/abs/0804.2354.

  37. M. Shamsfard, A. Nematzadeh, and S. Motiee, “ORank: An Ontology Based System for Ranking Documents,” Int. J. Comput. Sci. 3(1) (2006). http://www.waset.org/ijcs/v1/v1-3-30.pdf.

  38. M. Meyer, C. Rensing, and R. Steinmetz, “Categorizing Learning Objects Based on Wikipedia as Substitute Corpus,” in Proceedings of LODE’07, Crete, Greece, 2007, http://sunsite.informatik.rwth-aachen.de/Publications/CEUR-WS/Vol-311/paper09.pdf.

  39. A. Gulin, M. Maslov, and I. Segalovich, “The Algorithm of Yandex Text Ranking Algoritm at ROMIP-2006,” in Procedings of ROMIP’2006, http://download.yandex.ru/company/03-yandex.pdf.

  40. H. Geser, “From Printed to ‘Wikified’ Encyclopedias. Sociological Aspects of an Incipient Cultural Revolution,” in Sociology in Switzerland: Towards Cybersociety and Virtual Social Relations (Zuerich, 2007), http://socio.ch/intcom/t-hgeser16.pdf.

  41. L.-S. Wu, R. Akavipat, F. Menczer, “6S: P2P Web Index Collecting and Sharing Application,” in Proceeding of RIAO’2007, http://sixearch.org/paper/6S-P2P-Web-1.pdf.

Download references

Author information

Authors and Affiliations

Authors

Additional information

Original Russian Text © A.A. Krizhanovsky, A.V. Smirnov, 2009, published in Izvestiya Akademii Nauk. Teoriya i Sistemy Upravleniya, 2009, No. 4, pp. 121–129.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Krizhanovsky, A.A., Smirnov, A.V. On the problem of Wiki texts indexing. J. Comput. Syst. Sci. Int. 48, 616–624 (2009). https://doi.org/10.1134/S1064230709040157

Download citation

  • Received:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1134/S1064230709040157

Keywords

Navigation