[go: up one dir, main page]

Academia.eduAcademia.edu

Topic structure modeling

2002, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02

In this paper, we present a method based on document probes to quantify and diagnose topic structure, distinguishing topics as monolithic, structured, or diffuse. The method also yields a structure analysis that can be used directly to optimize filter (classifier) creation. Preliminary results illustrate the predictive value of the approach on TREC/Reuters-96 topics.

INTRODUCTION

User information needs, typically expressed in terms of a short description or a set of example documents, tend to vary from very focused, such as "find me information about product XYZ," to very broad, such as "keep me posted on world politics." Traditionally, both types of information needs have been modeled using single or monolithic filters [1] [2]. These approaches provide good levels of performance on focused topics, but generally yield poor performance for broad ones [3]. If one could diagnose topic structure, one could apply different query or filtering strategies to optimize performance, such as the use of committees (e.g., cascades) of filters [3].

One general approach to topic analysis has been suggested by Cronen-Townsend and Croft that exploits the difference in the language model of the query (or topic) and the language model of the corpus to provide a measure of topic ambiguity [4]. The "clarity" score they develop may help predict performance, but it does not directly distinguish query types or lead to specific strategies for improving performance.

In fact, we have found it useful to distinguish three types of topics: monolithic, structured, and diffuse. A monolithic topic (MT) is a topic that is well focused. Any subset of positive examples of such a topic will generally show a high degree of similarity to any other subset. An example might be a topic on "approaches to adaptive filtering." A structured topic (ST), in contrast, may contain several threads or sub-topics of information, each of which is relatively monolithic. In such cases, one subset of positive examples might not show similarity to another subset, but there may be one or more specific subsets in which the documents they contain show a high degree of similarity to one another. An example might be a topic such as "recent developments in biomedicine," where we might find documents on many special (sub-)topics (e.g., cancer, AIDS, cloning, etc.). A diffuse topic (DT), finally, is either inherently vague or overly general and shows little underlying structure. Any one document from a sample of positive examples may bear no similarity to any other. An example of such a topic might be "management practices," under which one might expect to find a great variety of "management" described-ranging from controlling assembly lines to organizing marketing campaignsand many kinds of "practices"-from instituting frequent inventory controls to behavioral modification.

QUANTIFYING TOPIC STRUCTURE

Document clustering can be used to some degree to diagnose topic structure. One can cluster a sample of positive documents into groups, for example, by pruning the cluster branches when they reach join similarities of 0.01 or less. The number of resulting groups and the stability of those groups when the cluster tree is re-factored by increasing or decreasing the join similarity score provide important clues as to how a topic is structured. One large stable group (say, consisting of 60-70% of the documents) suggests an MT. Several stable groups (say, 3, each containing 20-30% of the documents), suggest an ST. And the absence of large or stable groups suggests a DT. Such a clustering approach to topic structure discovery, however, is computationally expensive when dealing with large numbers of documents (say, over 1,000), especially as we have to repeat the process many times under different clustering conditions to assess stability.

An alternative method for calculating the specificity of topic structure is based on using random document probes. Probes or queries are constructed using a small sample of similar documents (typically two or three) taken from the positive examples for a topic. Subsequently, each probe is used as a query over the entire set of positive examples, resulting in a ranked list of documents. The probing process continues until a significant sample of documents has been used as probe seeds. An Average Probe Score (APS) is calculated as follows: where Doc j denotes a positive document for a topic, |Topic| denotes the number of positive examples for a topic, Probe i denotes a query probe constructed from a small number (two or three) of randomly selected positive example documents for a topic, |Probes| denotes the number of query probes, NS(Doc j , Probe i ) denotes the degree of similarity between the document Doc i and the query probe Probe i normalized by the score of the first-ranked document returned by this query probe. Note that documents with a normalized similarity score of less than 0.1 are not considered. The APS measure can be regarded intuitively as being proportional to the average number of high scoring documents that various random probes return. The more monolithic a topic is, the documents of high similarity a probe returns; the more diffuse, the fewer documents-and their scores will be generally lower.

MODELING TOPIC STRUCTURE

Once the probing process is completed, each document can be characterized by a new set of features, viz., the probes under which it was retrieved. The value of each probe feature will be the normalized similarity score between that probe and the document. Using such an abstract document representation, we can cluster the documents based on probe-features. The documents in such clusters can be ranked by their cumulative probe scores, providing an interesting measure of "importance" relative to their cohort in the cluster. More relevantly, the resulting clusters can be used directly to construct component filters that model the topic.

RESULTS AND DISCUSSION

A valuable measure that the process provides is the degree (and rate) of convergence on topic coverage. (Cf. Figure 1.) The observation in MT cases (such as Reuters-96 [3] Topic 10) is that, where there are many similar documents, each probe will touch many other documents above the threshold score (e.g., 0.1 NS); after a few probes, virtually all the documents in the sample set will have been touched. In ST cases (such as Reuters-96 Topics 5, 8), the first probes will touch fewer documents, but after several probes, the total number of documents touched (at least once) will grow and eventually approach the total in the set. Finally, in DT cases (such as Reuters-96 Topics 1, 11, 19), we expect no probe to touch many documents; even after many probes the total touched will not approach the total number of documents in the set. As a preliminary test of APS to predict filter (classifier) performance, we compared the APS for all Reuters-96 topics having more than 20 example documents to the best performing monolithic filters we constructed for the topics, as shown in Figure 2. The regression line shows a strong prediction value.

Figure 1

Figure 1. Convergence on Coverage by Repeated Probes

Figure 2

Figure 2. Prediction of Filter Performance x APS