Academia.edu no longer supports Internet Explorer.
To browse Academia.edu and the wider internet faster and more securely, please take a few seconds to upgrade your browser.
2002, Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval - SIGIR '02
…
2 pages
1 file
In this paper, we present a method based on document probes to quantify and diagnose topic structure, distinguishing topics as monolithic, structured, or diffuse. The method also yields a structure analysis that can be used directly to optimize filter (classifier) creation. Preliminary results illustrate the predictive value of the approach on TREC/Reuters-96 topics.
The paper demonstrates how information on text structure can be used to improve the performance on the identification of topical words in texts, which is based on a probabilistic model of text categorization. We use texts which are not explicitly structured. A text structure is identified by measuring the similarity between segments comprising the text and its title. It is shown that a text structure thus identified gives a good clue to finding out parts of the text most relevant to its content. The significance of exploiting information on the structure for topic identification is demonstrated by a set of experiments conducted on the 19Mb of Japanese newspaper articles. The paper also brings concepts from the rhetorical structure theory (RST) to the statistical analysis of a text structure. Finally, it is shown that information on text structure is more effective for large documents than for small documents.
ICST Transactions on Scalable Information Systems
Topic modelling is the new revolution in text mining. It is a statistical technique for revealing the underlying semantic structure in large collection of documents. After analysing approximately 300 research articles on topic modeling, a comprehensive survey on topic modelling has been presented in this paper. It includes classification hierarchy, Topic modelling methods, Posterior Inference techniques, different evolution models of latent Dirichlet allocation (LDA) and its applications in different areas of technology including Scientific Literature, Bioinformatics, Software Engineering and analysing social network is presented. Quantitative evaluation of topic modeling techniques is also presented in detail for better understanding the concept of topic modeling. At the end paper is concluded with detailed discussion on challenges of topic modelling, which will definitely give researchers an insight for good research.
International Journal of Advanced Computer Science and Applications, 2015
Topic models provide a convenient way to analyze large of unclassified text. A topic contains a cluster of words that frequently occur together. A topic modeling can connect words with similar meanings and distinguish between uses of words with multiple meanings. This paper provides two categories that can be under the field of topic modeling. First one discusses the area of methods of topic modeling, which has four methods that can be considerable under this category. These methods are Latent semantic analysis (LSA), Probabilistic latent semantic analysis (PLSA), Latent Dirichlet allocation (LDA), and Correlated topic model (CTM). The second category is called topic evolution models, which model topics by considering an important factor time. In the second category, different models are discussed, such as topic over time (TOT), dynamic topic models (DTM), multiscale topic tomography, dynamic topic correlation detection, detecting topic evolution in scientific literature, etc.
International Journal on Document Analysis and Recognition, 2007
2005
ABSTRACT. Hierarchical topic detection is a new task in the TDT 2004 evaluation program, which aims to organize a collection of unstructured news data in a directed acyclic graph (DAG) structure, refecting the topics discussed in the collection, ranging from rather coarse category like nodes to fine singular events.
Procedia Computer Science, 2020
In the digitization air, it is very important to detect and analyze the related topics to some discussions, occurred in social media or to label some visited web pages or documents. This information could be very helpful to the process of personalization as well as user satisfaction. There are various and different methods that study and deal with a huge data to provide insights into user behaviors. In this paper, we propose a filtering process that enhances topic detection and labelling. The latter aims to compact the result delivered by inferential algorithms such as Latent Dirichlet Allocation and Dirichlet Mixture Model. Our filtering process relies on words dependency on each contextual use for delivering high correlated label. Indeed, we use Word2vec as well as N-grams to eliminate non-significant words in each topic. We also use Hellinger distance to aggregate redundant words to the appropriate topic. Besides, we eliminate the non-reliable topics according to some metric. We associate this proposal to different topic-modeling algorithms. Experiments demonstrate the effectiveness of the made association between inferential model and our filtering process compared to the state of the art. We also use different textual data to validate our proposal.
IEEE Access
Topic modelling is important for tackling several data mining tasks in information retrieval. While seminal topic modelling techniques such as Latent Dirichlet Allocation (LDA) have been proposed, the ubiquity of social media and the brevity of its texts pose unique challenges for such traditional topic modelling techniques. Several extensions including auxiliary aggregation, self aggregation and direct learning have been proposed to mitigate these challenges, however some still remain. These include a lack of consistency in the topics generated and the decline in model performance in applications involving disparate document lengths. There is a recent paradigm shift towards neural topic models, which are not suited for resource-constrained environments. This paper revisits LDA-style techniques, taking a theoretical approach to analyse the relationship between word co-occurrence and topic models. Our analysis shows that by altering the word co-occurrences within the corpus, topic discovery can be enhanced. Thus we propose a novel data transformation approach dubbed DATM to improve the topic discovery within a corpus. A rigorous empirical evaluation shows that DATM is not only powerful, but it can also be used in conjunction with existing benchmark techniques to significantly improve their effectiveness and their consistency by up to 2 fold. INDEX TERMS Document transformation, greedy algorithm, information retrieval, latent dirichlet allocation, multi-set multi-cover problem, probabilistic generative topic modelling.
This paper presents a novel approach to automate the process of extracting topic and main title from a single-document short text. The proposed approach uses online text mining and Natural Language Processing techniques. The title of any text provides an efficient way to concisely grasp the overview of the contents in the text by giving a glance on its main heading only, which is quicker than reading the summary. In this paper, three different mechanisms have been proposed, implemented and compared to find the best approach for automatic extraction of a topic that is more relevant to the overall event explained in the text. The proposed system is evaluated against fifteen news articles from New York Times. The significance of the paper is twofold: Firstly, these automatic topic extraction techniques can be used further for document classification, document relevancy and similarity, summarization, comprehensive grasp of any event and finding novelty in outsized and scattered text data by scanning titles. Secondly, it can be used as a roadmap for the new researchers by using this detailed analysis of various data mining techniques. The experimental results show that the Nouns are more related, reliable, and suitable words for finding the topic of the text.
Lecture Notes in Computer Science, 2011
Generative models for text data are based on the idea that a document can be modeled as a mixture of topics, each of which is represented as a probability distribution over the terms. Such models have traditionally assumed that a document is an indivisible unit for the generative process, which may not be appropriate to handle documents with an explicit multi-topic structure. This paper presents a generative model that exploits a given decomposition of documents in smaller text blocks which are topically cohesive (segments). A new variable is introduced to model the within-document segments: using this variable at documentlevel, word generation is related not only to the topics but also to the segments, while the topic latent variable is directly associated to the segments, rather than to the document as a whole. Experimental results have shown that, compared to existing generative models, our proposed model provides better perplexity of language modeling and better support for effective clustering of documents.
International journal of engineering research and technology, 2021
Every day large quantities of data are collected. As more information is available, the access to what we are seeking gets challenging. We, therefore, require processes and techniques for organizing, searching, and understanding massive amounts of information. The task of topic modeling is to analyze the whole document to learn the meaningful pattern that exists in the document. It is a supervised strategy used to identify and monitor words in clusters of texts (known as the "topics"). Through the use of topic analysis models, companies can load tasks on machines rather than burden employees with too much data. In this paper, we have used Word embedding for Topic Modelling to learn the meaningful pattern of words, and k-means clustering is used to group the words that belong to one group. In this paper, we have created the nine clusters of words from the headline dataset. One of the applications of topic modeling i.e sentiment analysis using the VADER algorithm is also demonstrated in this paper.
INTRODUCTION
User information needs, typically expressed in terms of a short description or a set of example documents, tend to vary from very focused, such as "find me information about product XYZ," to very broad, such as "keep me posted on world politics." Traditionally, both types of information needs have been modeled using single or monolithic filters [1] [2]. These approaches provide good levels of performance on focused topics, but generally yield poor performance for broad ones [3]. If one could diagnose topic structure, one could apply different query or filtering strategies to optimize performance, such as the use of committees (e.g., cascades) of filters [3].
One general approach to topic analysis has been suggested by Cronen-Townsend and Croft that exploits the difference in the language model of the query (or topic) and the language model of the corpus to provide a measure of topic ambiguity [4]. The "clarity" score they develop may help predict performance, but it does not directly distinguish query types or lead to specific strategies for improving performance.
In fact, we have found it useful to distinguish three types of topics: monolithic, structured, and diffuse. A monolithic topic (MT) is a topic that is well focused. Any subset of positive examples of such a topic will generally show a high degree of similarity to any other subset. An example might be a topic on "approaches to adaptive filtering." A structured topic (ST), in contrast, may contain several threads or sub-topics of information, each of which is relatively monolithic. In such cases, one subset of positive examples might not show similarity to another subset, but there may be one or more specific subsets in which the documents they contain show a high degree of similarity to one another. An example might be a topic such as "recent developments in biomedicine," where we might find documents on many special (sub-)topics (e.g., cancer, AIDS, cloning, etc.). A diffuse topic (DT), finally, is either inherently vague or overly general and shows little underlying structure. Any one document from a sample of positive examples may bear no similarity to any other. An example of such a topic might be "management practices," under which one might expect to find a great variety of "management" described-ranging from controlling assembly lines to organizing marketing campaignsand many kinds of "practices"-from instituting frequent inventory controls to behavioral modification.
QUANTIFYING TOPIC STRUCTURE
Document clustering can be used to some degree to diagnose topic structure. One can cluster a sample of positive documents into groups, for example, by pruning the cluster branches when they reach join similarities of 0.01 or less. The number of resulting groups and the stability of those groups when the cluster tree is re-factored by increasing or decreasing the join similarity score provide important clues as to how a topic is structured. One large stable group (say, consisting of 60-70% of the documents) suggests an MT. Several stable groups (say, 3, each containing 20-30% of the documents), suggest an ST. And the absence of large or stable groups suggests a DT. Such a clustering approach to topic structure discovery, however, is computationally expensive when dealing with large numbers of documents (say, over 1,000), especially as we have to repeat the process many times under different clustering conditions to assess stability.
An alternative method for calculating the specificity of topic structure is based on using random document probes. Probes or queries are constructed using a small sample of similar documents (typically two or three) taken from the positive examples for a topic. Subsequently, each probe is used as a query over the entire set of positive examples, resulting in a ranked list of documents. The probing process continues until a significant sample of documents has been used as probe seeds. An Average Probe Score (APS) is calculated as follows: where Doc j denotes a positive document for a topic, |Topic| denotes the number of positive examples for a topic, Probe i denotes a query probe constructed from a small number (two or three) of randomly selected positive example documents for a topic, |Probes| denotes the number of query probes, NS(Doc j , Probe i ) denotes the degree of similarity between the document Doc i and the query probe Probe i normalized by the score of the first-ranked document returned by this query probe. Note that documents with a normalized similarity score of less than 0.1 are not considered. The APS measure can be regarded intuitively as being proportional to the average number of high scoring documents that various random probes return. The more monolithic a topic is, the documents of high similarity a probe returns; the more diffuse, the fewer documents-and their scores will be generally lower.
MODELING TOPIC STRUCTURE
Once the probing process is completed, each document can be characterized by a new set of features, viz., the probes under which it was retrieved. The value of each probe feature will be the normalized similarity score between that probe and the document. Using such an abstract document representation, we can cluster the documents based on probe-features. The documents in such clusters can be ranked by their cumulative probe scores, providing an interesting measure of "importance" relative to their cohort in the cluster. More relevantly, the resulting clusters can be used directly to construct component filters that model the topic.
RESULTS AND DISCUSSION
A valuable measure that the process provides is the degree (and rate) of convergence on topic coverage. (Cf. Figure 1.) The observation in MT cases (such as Reuters-96 [3] Topic 10) is that, where there are many similar documents, each probe will touch many other documents above the threshold score (e.g., 0.1 NS); after a few probes, virtually all the documents in the sample set will have been touched. In ST cases (such as Reuters-96 Topics 5, 8), the first probes will touch fewer documents, but after several probes, the total number of documents touched (at least once) will grow and eventually approach the total in the set. Finally, in DT cases (such as Reuters-96 Topics 1, 11, 19), we expect no probe to touch many documents; even after many probes the total touched will not approach the total number of documents in the set. As a preliminary test of APS to predict filter (classifier) performance, we compared the APS for all Reuters-96 topics having more than 20 example documents to the best performing monolithic filters we constructed for the topics, as shown in Figure 2. The regression line shows a strong prediction value.
Figure 1
Figure 1. Convergence on Coverage by Repeated Probes
Figure 2
Figure 2. Prediction of Filter Performance x APS
IEEE Transactions on Signal Processing, 2004
Journal of Physics: Conference Series 495 (2014) 012012
Understanding and Designing for Aesthetic Experience …, 2005
Contraception, 1993
Geophysical Research Letters, 2014
Still Life (Chesapeake Valor) by Dani Pettrey
Ninja Sex Party The Graphic Novel Pt 1 by Calcano, David (Hardcover
Nature Immunology, 2010
Int J Offender Ther Comp Criminol, 2013
Journal of Neuroscience, 2007
international journal for research in applied science and engineering technology ijraset, 2020
Great Lakes Entomologist, 2009
Tamkang Journal of Mathematics, 2012
Revista Turismo em Análise, 2014
Academia Materials Science, 2023
Enunciación, 2015
BMC Anesthesiology, 2021
The American Journal of Clinical Nutrition, 2019
Korean Journal of Chemical Engineering, 2020