There are many opportunities to improve the interactivity of information retrieval systems beyond... more There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This ...
This paper describes ongoing research into obtaining and using knowledge bases to assist informat... more This paper describes ongoing research into obtaining and using knowledge bases to assist information retrieval. These structures are prohibitively expensive to obtain manually, yet automatic approaches have been researched for decades with limited success. This research investigates a potential shortcut: a way to provide knowledge bases automatically, without expecting computers to replace expert human indexers. Instead we aim to replace the professionals with thousands or even millions of amateurs: with the growing community of contributors who form the core of Web 2.0. Specifically we focus on Wikipedia, which represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide manually-defined yet inexpensive knowledge-bases that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We are also concerned with how best to make these struct...
This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques.
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing c... more Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This
This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. ... more This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. It works on top of the encyclopedia's existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.
What would it take to provide a congenial and comfortable environment for finding and reading boo... more What would it take to provide a congenial and comfortable environment for finding and reading books in a digital library? To locate information we need algorithms that extract semantic metadata in forms such as keyphrases, with accuracy and consistency comparable to human indexers. To support this we need comprehensive, detailed thesauri, automatically created, that embody contemporary language and usage. To emulate and enjoy the serendipitous adventures found in real libraries and bookstores we need browsing environments that provide readers with multiple clues in parallel: keyphrases, text excerpts, and supplementary knowledge structures—as well as the documents themselves. For readers to cherish and enjoy individual works we need to transcend the bland reading environment provided by the web by recreating the subjective impact and pleasurable experience of interacting with real books. This paper describes research that aims to achieve these goals.
We propose a new method for extending a domain-specific thesaurus with valuable information from ... more We propose a new method for extending a domain-specific thesaurus with valuable information from Wikipedia. The main obstacle is to disambiguate thesaurus concepts to correct Wikipedia articles. Given the concept name, we first identify candidate mappings by analyzing article titles, their redirects and disambiguation pages. Then, for each candidate, we compute a link-based similarity score to all mappings of context terms related to this concept. The article with the highest score is then used to augment the thesaurus concept. It is the source for the extended gloss, explaining the concept's meaning, synonymous expressions that can be used as additional non- descriptors in the thesaurus, translations of the concept into other languages, and new domain-relevant concepts.
Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.
Thesauri are useful knowledge structures for assisting information retrieval. Yet their productio... more Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with
Proceedings of the ACM International Conference on Digital Libraries, 2008
Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion
… AAAI Workshop on Wikipedia and Artificial Intelligence …, 2008
This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink ...
There are many opportunities to improve the interactivity of information retrieval systems beyond... more There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This ...
This paper describes ongoing research into obtaining and using knowledge bases to assist informat... more This paper describes ongoing research into obtaining and using knowledge bases to assist information retrieval. These structures are prohibitively expensive to obtain manually, yet automatic approaches have been researched for decades with limited success. This research investigates a potential shortcut: a way to provide knowledge bases automatically, without expecting computers to replace expert human indexers. Instead we aim to replace the professionals with thousands or even millions of amateurs: with the growing community of contributors who form the core of Web 2.0. Specifically we focus on Wikipedia, which represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide manually-defined yet inexpensive knowledge-bases that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We are also concerned with how best to make these struct...
This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques.
Wikipedia is a goldmine of information; not just for its many readers, but also for the growing c... more Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This
This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. ... more This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. It works on top of the encyclopedia's existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.
What would it take to provide a congenial and comfortable environment for finding and reading boo... more What would it take to provide a congenial and comfortable environment for finding and reading books in a digital library? To locate information we need algorithms that extract semantic metadata in forms such as keyphrases, with accuracy and consistency comparable to human indexers. To support this we need comprehensive, detailed thesauri, automatically created, that embody contemporary language and usage. To emulate and enjoy the serendipitous adventures found in real libraries and bookstores we need browsing environments that provide readers with multiple clues in parallel: keyphrases, text excerpts, and supplementary knowledge structures—as well as the documents themselves. For readers to cherish and enjoy individual works we need to transcend the bland reading environment provided by the web by recreating the subjective impact and pleasurable experience of interacting with real books. This paper describes research that aims to achieve these goals.
We propose a new method for extending a domain-specific thesaurus with valuable information from ... more We propose a new method for extending a domain-specific thesaurus with valuable information from Wikipedia. The main obstacle is to disambiguate thesaurus concepts to correct Wikipedia articles. Given the concept name, we first identify candidate mappings by analyzing article titles, their redirects and disambiguation pages. Then, for each candidate, we compute a link-based similarity score to all mappings of context terms related to this concept. The article with the highest score is then used to augment the thesaurus concept. It is the source for the extended gloss, explaining the concept's meaning, synonymous expressions that can be used as additional non- descriptors in the thesaurus, translations of the concept into other languages, and new domain-relevant concepts.
Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.
Thesauri are useful knowledge structures for assisting information retrieval. Yet their productio... more Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with
Proceedings of the ACM International Conference on Digital Libraries, 2008
Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion
… AAAI Workshop on Wikipedia and Artificial Intelligence …, 2008
This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide structured world knowledge about the terms of interest. Our approach is unique in that it does so using the hyperlink ...
Uploads
Papers