Skip to main content

David Milne

The University of Sydney, School of Electrical and Information Engineering, Post-Doc

Followers

13

Following

7

Co-authors

3

Public Views

Interests

Uploads

Papers

Applying Wikipedia to Interactive Information Retrieval

There are many opportunities to improve the interactivity of information retrieval systems beyond... more There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This ...

Exploiting web 2.0 forallknowledge-based information retrieval

This paper describes ongoing research into obtaining and using knowledge bases to assist informat... more This paper describes ongoing research into obtaining and using knowledge bases to assist information retrieval. These structures are prohibitively expensive to obtain manually, yet automatic approaches have been researched for decades with limited success. This research investigates a potential shortcut: a way to provide knowledge bases automatically, without expecting computers to replace expert human indexers. Instead we aim to replace the professionals with thousands or even millions of amateurs: with the growing community of contributors who form the core of Web 2.0. Specifically we focus on Wikipedia, which represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide manually-defined yet inexpensive knowledge-bases that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We are also concerned with how best to make these struct...

Computing Semantic Relatedness using Wikipedia Link Structure

This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques.

Mining Meaning from Wikipedia

by David Milne, Ian Witten, and Catherine Legg

Computing Research Repository, 2008

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing c... more Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This

A link-based visual search engine for Wikipedia

This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. ... more This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. It works on top of the encyclopedia's existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.

Finding documents and reading them: Semantic metadata extraction, topic browsing and realistic books

What would it take to provide a congenial and comfortable environment for finding and reading boo... more What would it take to provide a congenial and comfortable environment for finding and reading books in a digital library? To locate information we need algorithms that extract semantic metadata in forms such as keyphrases, with accuracy and consistency comparable to human indexers. To support this we need comprehensive, detailed thesauri, automatically created, that embody contemporary language and usage. To emulate and enjoy the serendipitous adventures found in real libraries and bookstores we need browsing environments that provide readers with multiple clues in parallel: keyphrases, text excerpts, and supplementary knowledge structures—as well as the documents themselves. For readers to cherish and enjoy individual works we need to transcend the bland reading environment provided by the web by recreating the subjective impact and pleasurable experience of interacting with real books. This paper describes research that aims to achieve these goals.

Augmenting Domain-Specific Thesauri with Knowledge from Wikipedia

We propose a new method for extending a domain-specific thesaurus with valuable information from ... more We propose a new method for extending a domain-specific thesaurus with valuable information from Wikipedia. The main obstacle is to disambiguate thesaurus concepts to correct Wikipedia articles. Given the concept name, we first identify candidate mappings by analyzing article titles, their redirects and disambiguation pages. Then, for each candidate, we compute a link-based similarity score to all mappings of context terms related to this concept. The article with the highest score is then used to augment the thesaurus concept. It is the source for the extended gloss, explaining the concept's meaning, synonymous expressions that can be used as additional non- descriptors in the thesaurus, translations of the concept into other languages, and new domain-relevant concepts.

Learning a concept-based document similarity measure

by Eibe Frank, Ian Witten, and David Milne

Journal of the American Society for Information Science and Technology, 2012

A competitive environment for exploratory query expansion

Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.

Clustering documents using a Wikipedia-based concept representation

by David Milne and Eibe Frank

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for docume... more This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques. This is an author’s accepted version of an article published in Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-29. ©2009 Springer-Verlag Berlin Heidelberg.

Extracting Corpus Specific Knowledge Bases from Wikipedia

Thesauri are useful knowledge structures for assisting information retrieval. Yet their productio... more Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with

Mining domain-specific thesauri from Wikipedia: A case study

by David Milne and Ian Witten

Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06, 2007

A knowledge-based search engine powered by Wikipedia

by David Milne and Ian Witten

International Conference on Information and Knowledge Management, Proceedings, 2007

Learning to link with wikipedia

by David Milne and Ian Witten

International Conference on Information and Knowledge Management, Proceedings, 2008

Clustering documents with active learning using wikipedia

by Eibe Frank, Ian Witten, and David Milne

Proceedings - IEEE International Conference on Data Mining, ICDM, 2008

A competitive environment for exploratory query expansion

Proceedings of the ACM International Conference on Digital Libraries, 2008

Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user&amp;amp;#39;s intent. This paper describes an environment for exploratory query expansion

An open-source toolkit for mining Wikipedia

by David Milne and Ian Witten

Artificial Intelligence, 2013

Topic indexing with Wikipedia

by David Milne and Ian Witten

An Effective, Low-Cost Measure of Semantic Relatedness Obtained From Wikipedia Links

by David Milne and Ian Witten

… AAAI Workshop on Wikipedia and Artificial Intelligence …, 2008

This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more

Exploring Wikipedia with HMpara

by David Milne and Ian Witten

Applying Wikipedia to Interactive Information Retrieval

There are many opportunities to improve the interactivity of information retrieval systems beyond... more There are many opportunities to improve the interactivity of information retrieval systems beyond the ubiquitous search box. One idea is to use knowledge bases—e.g. controlled vocabularies, classification schemes, thesauri and ontologies—to organize, describe and navigate the information space. These resources are popular in libraries and specialist collections, but have proven too expensive and narrow to be applied to everyday webscale search. Wikipedia has the potential to bring structured knowledge into more widespread use. This online, collaboratively generated encyclopaedia is one of the largest and most consulted reference works in existence. It is broader, deeper and more agile than the knowledge bases put forward to assist retrieval in the past. Rendering this resource machine-readable is a challenging task that has captured the interest of many researchers. Many see it as a key step required to break the knowledge acquisition bottleneck that crippled previous efforts. This ...

Exploiting web 2.0 forallknowledge-based information retrieval

This paper describes ongoing research into obtaining and using knowledge bases to assist informat... more This paper describes ongoing research into obtaining and using knowledge bases to assist information retrieval. These structures are prohibitively expensive to obtain manually, yet automatic approaches have been researched for decades with limited success. This research investigates a potential shortcut: a way to provide knowledge bases automatically, without expecting computers to replace expert human indexers. Instead we aim to replace the professionals with thousands or even millions of amateurs: with the growing community of contributors who form the core of Web 2.0. Specifically we focus on Wikipedia, which represents a rich tapestry of topics and semantics and a huge investment of human effort and judgment. We show how this can be directly exploited to provide manually-defined yet inexpensive knowledge-bases that are specifically tailored to expose the topics, terminology and semantics of individual document collections. We are also concerned with how best to make these struct...

Computing Semantic Relatedness using Wikipedia Link Structure

This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more This paper describes a new technique for obtaining measures of semantic relatedness. Like other recent approaches, it uses Wikipedia to provide a vast amount of structured world knowledge about the terms of interest. Our system, the Wikipedia Link Vector Model or WLVM, is unique in that it does so using only the hyperlink structure of Wikipedia rather than its full textual content. To evaluate the algorithm we use a large, widely used test set of manually defined measures of semantic relatedness as our bench-mark. This allows direct comparison of our system with other similar techniques.

Mining Meaning from Wikipedia

by David Milne, Ian Witten, and Catherine Legg

Computing Research Repository, 2008

Wikipedia is a goldmine of information; not just for its many readers, but also for the growing c... more Wikipedia is a goldmine of information; not just for its many readers, but also for the growing community of researchers who recognize it as a resource of exceptional scale and utility. It represents a vast investment of manual effort and judgment: a huge, constantly evolving tapestry of concepts and relations that is being applied to a host of tasks. This

A link-based visual search engine for Wikipedia

This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. ... more This paper introduces HMpara, a new search engine that aims to make Wikipedia easier to explore. It works on top of the encyclopedia's existing link structure, abstracting away from document content and allowing users to navigate the resource at a higher level. It utilizes semantic relatedness measures to emphasize articles and connections that are most likely to be of interest, visualization to expose the structure of how the available information is organized, and lightweight information extraction to explain itself.

Finding documents and reading them: Semantic metadata extraction, topic browsing and realistic books

What would it take to provide a congenial and comfortable environment for finding and reading boo... more What would it take to provide a congenial and comfortable environment for finding and reading books in a digital library? To locate information we need algorithms that extract semantic metadata in forms such as keyphrases, with accuracy and consistency comparable to human indexers. To support this we need comprehensive, detailed thesauri, automatically created, that embody contemporary language and usage. To emulate and enjoy the serendipitous adventures found in real libraries and bookstores we need browsing environments that provide readers with multiple clues in parallel: keyphrases, text excerpts, and supplementary knowledge structures—as well as the documents themselves. For readers to cherish and enjoy individual works we need to transcend the bland reading environment provided by the web by recreating the subjective impact and pleasurable experience of interacting with real books. This paper describes research that aims to achieve these goals.

Augmenting Domain-Specific Thesauri with Knowledge from Wikipedia

We propose a new method for extending a domain-specific thesaurus with valuable information from ... more We propose a new method for extending a domain-specific thesaurus with valuable information from Wikipedia. The main obstacle is to disambiguate thesaurus concepts to correct Wikipedia articles. Given the concept name, we first identify candidate mappings by analyzing article titles, their redirects and disambiguation pages. Then, for each candidate, we compute a link-based similarity score to all mappings of context terms related to this concept. The article with the highest score is then used to augment the thesaurus concept. It is the source for the extended gloss, explaining the concept's meaning, synonymous expressions that can be used as additional non- descriptors in the thesaurus, translations of the concept into other languages, and new domain-relevant concepts.

Learning a concept-based document similarity measure

by Eibe Frank, Ian Witten, and David Milne

Journal of the American Society for Information Science and Technology, 2012

A competitive environment for exploratory query expansion

Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user's intent. This paper describes an environment for exploratory query expansion that pits users against each other and lets them compete, and practice, in their own time and on their own workstation. The system captures query evolution behavior on predetermined information-seeking tasks. It is publicly available, and the code is open source so that others can set up their own competitive environments.

Clustering documents using a Wikipedia-based concept representation

by David Milne and Eibe Frank

This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for docume... more This paper shows how Wikipedia and the semantic knowledge it contains can be exploited for document clustering. We first create a concept-based document representation by mapping the terms and phrases within documents to their corresponding articles (or concepts) in Wikipedia. We also developed a similarity measure that evaluates the semantic relatedness between concept sets for two documents. We test the concept-based representation and the similarity measure on two standard text document datasets. Empirical results show that although further optimizations could be performed, our approach already improves upon related techniques. This is an author’s accepted version of an article published in Proceedings of 13th Pacific-Asia Conference, PAKDD 2009 Bangkok, Thailand, April 27-29. ©2009 Springer-Verlag Berlin Heidelberg.

Extracting Corpus Specific Knowledge Bases from Wikipedia

Thesauri are useful knowledge structures for assisting information retrieval. Yet their productio... more Thesauri are useful knowledge structures for assisting information retrieval. Yet their production is labor-intensive, and few domains have comprehensive thesauri that cover domain-specific concepts and contemporary usage. One approach, which has been attempted without much success for decades, is to seek statistical natural language processing algorithms that work on free text. Instead, we propose to replace costly professional indexers with

Mining domain-specific thesauri from Wikipedia: A case study

by David Milne and Ian Witten

Proceedings - 2006 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2006 Main Conference Proceedings), WI'06, 2007

A knowledge-based search engine powered by Wikipedia

by David Milne and Ian Witten

International Conference on Information and Knowledge Management, Proceedings, 2007

Learning to link with wikipedia

by David Milne and Ian Witten

International Conference on Information and Knowledge Management, Proceedings, 2008

Clustering documents with active learning using wikipedia

by Eibe Frank, Ian Witten, and David Milne

Proceedings - IEEE International Conference on Data Mining, ICDM, 2008

A competitive environment for exploratory query expansion

Proceedings of the ACM International Conference on Digital Libraries, 2008

Most information workers query digital libraries many times a day. Yet people have little opportu... more Most information workers query digital libraries many times a day. Yet people have little opportunity to hone their skills in a controlled environment, or compare their performance with others in an objective way. Conversely, although search engine logs record how users evolve queries, they lack crucial information about the user&amp;amp;#39;s intent. This paper describes an environment for exploratory query expansion

An open-source toolkit for mining Wikipedia

by David Milne and Ian Witten

Artificial Intelligence, 2013

Topic indexing with Wikipedia

by David Milne and Ian Witten

An Effective, Low-Cost Measure of Semantic Relatedness Obtained From Wikipedia Links

by David Milne and Ian Witten

… AAAI Workshop on Wikipedia and Artificial Intelligence …, 2008

This paper describes a new technique for obtaining measures of semantic relatedness. Like other r... more

Exploring Wikipedia with HMpara

by David Milne and Ian Witten