[go: up one dir, main page]

0% found this document useful (0 votes)
91 views5 pages

Review On Topic Detection Methods For Twitter Streams

This document discusses various methods for detecting topics in Twitter streams. It outlines exemplar-based topic detection, which uses the most descriptive tweets to represent topics. It also discusses clustering-based methods like sequential k-means clustering that group similar tweets. Additionally, it covers frequent pattern mining techniques that search for recurring word patterns and two-level message clustering that categorizes tweets in different phases. The document provides details on each approach and evaluates their usefulness for timely topic detection in Twitter data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
91 views5 pages

Review On Topic Detection Methods For Twitter Streams

This document discusses various methods for detecting topics in Twitter streams. It outlines exemplar-based topic detection, which uses the most descriptive tweets to represent topics. It also discusses clustering-based methods like sequential k-means clustering that group similar tweets. Additionally, it covers frequent pattern mining techniques that search for recurring word patterns and two-level message clustering that categorizes tweets in different phases. The document provides details on each approach and evaluates their usefulness for timely topic detection in Twitter data.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology

ISSN No:-2456-2165

Review On Topic Detection Methods


for Twitter Streams
Vivek Ranjan, Pragati Agrawal Vaibhav Poddar
Computer Science & Engineering Maulana Azad National Economics Nowrosjee Wadia
Institute of Technology Bhopal, India College Pune, India

Abstract:- Among the various social media platforms as possible and aiding political parties and corporations in
that dominate the internet today, Twitter has established assessing public opinion, detecting fake news as soon as
itself as a major player and has become a preferred possible, and many more. In this way, topic detection can
choice for expressing one’s opinion on almost help to reduce the misuse of social media. This paper
everything. Ordinary citizens aside, it is used by the outlines some of these topic detection techniques such as
governments world over to connect directly with their exemplar-based topic detection which uses most descriptive
citizens, by the media houses, companies, research tweets to find the topic, frequent pattern mining in which
organizations, and so on. Given the omnipresent role it patterns are searched using word count, clustering-based
plays, it has thus become imperative to aggressively methods in which clusters are used to detect topics. Other
pursue topic detection from Twitter so that its benefits approaches include two-level message clustering and a
can be implemented in a range of applications such as blend of singular value decomposition and k-means
natural disaster warnings, fake news detection, and user clustering.
opinion assessment among other uses. This paper
outlines different types of topic detection techniques that The remainder of the paper is organized as follows.
are employed frequently such as exemplar based topic Section II mentions topic detection techniques, Section III
detection, clustering based topic detection, frequent discusses the complete approach of exemplar-based topic
pattern mining, two level message clustering, detection. Section IV discusses frequent pattern mining and
combination of k-means clustering methods and singular variations that can be used to detect topics from Twitter.
value decomposition. While exemplar based topic Various clustering methods such as sequential k-means,
detection technique uses the most significant tweets to spherical k-means, DBSCAN, and bngram to detect the
detect topics, the clustering based technique uses topics from Twitter are discussed in Section V. Section VI
different clustering methods such as sequential k-means, discusses all the phases used in two-level message
spherical k-means, DBSCAN, and bngram to detect the clustering. Section VII discusses the combination of
topics. In frequent pattern mining, the FP-growth singular value decomposition and k-means clustering
algorithm along with its variation can be used to detect approaches in a detailed way. Finally, Section VIII
the topics. In two level message clustering, topics are concludes the various approaches used in the paper.
clustered using different phases. A blend of both
algorithms is employed in the combination of k-means II. TOPIC DETECTION TECHNIQUES
clustering methods and singular value decomposition. A There are three different approaches used by topic
detailed discussion of these topic detection techniques detection methods to represent the topics. Feature pivot
has been done in this paper. method [2] is the first one. In this, terms are clustered
Keywords:- Text Mining; Topic Detection; Clustering; together according to their patterns of co-occurrence. There
Twitter. are many approaches used in this method. One interesting
approach is to utilize social features for selecting terms to be
I. INTRODUCTION clustered [5]. The second one is the document pivot method
[2] in which documents are clustered using some measure of
Due to the rapid growth of social media on the internet document similarity and group the similar documents based
in recent years, a multitude of valuable information has on this [21].
become available. Twitter is one such social platform where
a large amount of information is available. The majority of The last one is probabilistic topic models in which
information available on Twitter is in the form of textual topics are discovered by calculating the probability
documents. Twitter allows its users to send the information distribution of words, two most common probabilistic topic
in the form of textual data termed as tweets almost in real- models are Latent Dirichlet Allocation (LDA) [3] and
time. Each user can express his or her feelings, thoughts, Probabilistic Latent Semantic Analysis (PLSA) [13].
knowledge in a maximum of 280 characters which makes
Twitter a very important source for detecting important Every topic detection technique has its own way of
trends throughout the world. Millions of tweets are tweeted representing the topics. The most common approach to
on Twitter every day. Because of the enormous number of describe the topic is by using a list of keywords. The
tweets, identifying relevant topics in real-time is a keyword’s word count reflects the keyword’s relevance in
challenging job. Detecting topics can be beneficial in the given topic. Furthermore, explicit tweets can represent
several areas, namely, detecting natural disasters as quickly topics, in which the tweet keywords act as the topic’s
representation, as in the exemplar-based technique [14]. We

IJISRT22OCT305 www.ijisrt.com 94
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
identify multiple topic detection strategies in this paper such IV. TOPIC DETECTION USING FREQUENT
as clustering-based topic detection, frequent pattern mining, PATTERN MINING
exemplar-based topic detection, two-level message
clustering, and the combination of singular value Frequent pattern mining is used to find frequent
decomposition and k-means clustering approaches. patterns or associations from item sets found in various
kinds of databases [11]. For a given set of transactions, this
III. EXEMPLAR BASED TOPIC DETECTION mining method is used to find association rules to predict
the co-occurrence of items. Frequent pattern mining was
Exemplar based topic detection uses the most first used in mining transactions to find items that co-occur
descriptive tweet to represent a topic [9]. In this, the in the transactions. There are several variations of frequent
arrangement of tweets is maintained due to which users can pattern mining. Some of them are DHP [18], DIC [4] and
understand the topic in an easier way. This approach Apriori [1]. Frequent pattern mining uses FP-growth
discards the keywords which are not correlated and that algorithm [12]. Steps for the FP-growth algorithm are given
makes the topic domain smaller. This strategy is focused on below.
the idea that a tweet that is identical to a group of tweets but  In the first step, the database is scanned to find
differentiated from the majority of the tweets is a good occurrences of the item sets in that database. Frequency of
prototype for the topics it addresses. Based on that theory, words below a particular threshold is discarded.
this approach computes the similarity matrix between each  The FP-tree is constructed in the next step, the root is
pair of tweets and then analyses the action of such formed and is denoted by null.
distributions by categorizing them into three groups. These  In the next step, the database is scanned again and
three groups are: transactions are examined to find out item sets in it. The
 The distribution similarity of tweet j will have a low item set is sorted in descending order due to which item
sample variance if j is similar to many tweets. with maximum frequency is taken at the top and so on. In
 The distribution similarity of tweet j will have a high other words, item sets generate the tree branch in
sample variance if j is similar to a particular group of descending order of frequency.
tweets and less similar to other groups.  In this step, the examination of the next transaction of the
 The distribution similarity of tweet j will have a low database takes place. The item sets are sorted in
sample variance if j is dissimilar to most of the tweets. descending order of frequency. The transaction branch
has the same prefix to the root if the item sets of this
The tweets which belong to group two are chosen to transaction are available in another branch. In other
be the best representative of topics. Since each tweet is words, the repeated item set is connected to the new node
equivalent to a series of tweets, the tweets in group two are of another item set.
selected to represent topics better. The tweets in category  In this step, increase the frequency of the item sets by one
two are also different from the majority of the tweets, so it and the common node and new node.
divides its topic from the rest of the topics.
 The generated FP-tree is mined in this phase. First, the
The sample variance for the similarity distribution of lowest node and the lowest node links are examined. The
each tweet is calculated as: pattern scale of the lowest node is one. The route is
traversed in the FP-tree from here. These are referred to
as conditional pattern bases.
(1)  In this step, a conditional FP-tree is formed due to the
where, xj is the mean of the similarity of tweet j and is given frequency of item sets in the path. In the conditional FP-
by: tree, those item sets are considered which meets the
threshold support.
 In this step, the conditional FP-tree is used to generate
(2) frequent patterns

In this method, tweets are sorted according to their The same method can be used to detect topics on
variances, and the tweet having the highest variance is Twitter. As the seen topic, the top k frequent patterns were
identified as the representation of the first topic. Following discovered using this method. Soft frequent pattern mining
the first topic selection, all tweets that belong to the first (SFM) was also implemented as a variant of frequent pattern
topic are eliminated, and the methodology selects another mining [19]. By measuring the similarities between the set
tweet with high variance as the representative for the second and each term, SFM greedily extends the set, including just
topic, and so on until k topics are discovered. one term. This method is replicated until the next term’s
resemblance and the set is smaller than a certain threshold.

IJISRT22OCT305 www.ijisrt.com 95
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. TOPIC DETECTION WITH CLUSTERING within the eps then points a and point b are called density
TECHNIQUES connected points.
 The process is repeated for the remaining points of the
There are many clustering methods used for topic dataset which are unvisited.
detection. Some of them are sequential k-means, spherical
k-means, DBSCAN, bngram [14]. In clustering methods, D. Bngram
clusters are used to define a particular topic. Centroids in Bngram was proposed to detect topics from twitter
each cluster are used as the representative for that cluster streams [2]. Instead of using single words for topic
and top m-words of the cluster are used as a keyword of that detection, it uses ngrams. This method employs a new time-
topic. The total no of topics defined is the no of clusters based scoring measure for each ngram. It is defined as:
formed. The TF-IDF scheme is used to represent the tweets.
Some of the clustering methods used for topic detection are:
(5)
A. Sequential K-means
Sequential k-means works on two steps [16]. In step 1, where df is the number of tweets that appeared in this
each data point is allocated to that cluster that has the closest ngram in time slot j. dfprev is the number of tweets that were
center with respect to the data point in terms of Euclidean used in this ngram during the previous time slots. The
distance. In step 2, the centroid of each cluster is frequencies which are increased during the present time
recomputed by using all the members of every cluster. The slots in comparison with the previous time slots are chosen
Euclidean distance is calculated as: by this score measure. Following that, ngrams are ordered
according to this score scale, and the top ngrams are chosen.
√(m1 − m2)2 + (n1 − n2)2 (3) A group average hierarchical clustering of ngrams is done in
this process, with the similarities between two ngrams
where the first data point is (m1,n1) and the second defined as tweet fragments in which both ngrams occur.
data point is (m2,n2). This process is terminated when the correlation between the
nearest unmerged clusters falls below a specific threshold
B. Spherical K-means value. At last, the ranking of clusters is done and topics are
In spherical k-means, cosine similarity is used in place returned as the top clusters.
of euclidean distance [16]. The cosine similarity is
calculated as: VI. TWO-LEVEL MESSAGE CLUSTERING FOR
TOPIC DETECTION

This topic detection approach also involves various


(4)
factors of topic enrichment and presentation like ranking of
where t and e are two feature vectors. topics, title extraction, keyword extraction, representative
tweets selection, and relevant image extraction [20]. All the
C. DBSCAN phases used by this approach are given below:
DBSCAN (Density-based spatial clustering of
applications with noise) is a density-based algorithm, which A. Pre-processing Phase
predicts fixed density clusters. DBSCAN works on two Duplicate items are assembled and language-based
parameters [10]: filtering is carried out in this phase. Duplicate items are
 Eps: The neighborhood around a data point is termed as retweets and copies of previous tweets. It is done by hashing
eps. In other words, if the length between two points is the text of each tweet and in one bucket, the only text of a
lesser than or equal to eps, the points are termed as tweet is allowed. Thus, only a single item of duplicate
neighbors. The major portion of data will be considered as copies is processed while also keeping all other duplicate
outliers if the eps value is too small. The clusters combine items. After this, language-based filtering is done in which
and the majority of data points will lie in the same only english tweets are kept and non-english words are
clusters if the eps value is too high. discarded.
 Minpts: It is the least no of data points within the eps B. Topic Detection
radius. The Minpts must be large for larger datasets. After pre-processing, the modified document pivot
Minpts must be greater than or equal to 3. approach is used for detecting topics in which tweets
DBSCAN follows the following step: containing the same URL and a tweet long with its reply are
clustered because they refer to the same topic.
 All the data points within the eps are searched and core
points are identified. In first-level clustering, the grouping of items is done.
 If any core point is not assigned to a cluster, new clusters Union-Find structure [6] is used in grouping of items in the
are created. first level. A Graph containing one node for each tweet is
 All the density connected points of the core point are created. Pair of tweets containing the same URL are
searched recursively and they are assigned to the same connected and the set of the connected component is
cluster as the core point. obtained. Now, those components which have more than
 If there exists a point c that has enough points in the one tweet are those first-level groups that will be also used
neighbor of point a and points b and both a and b are in the second level. Those tweets will be also considered

IJISRT22OCT305 www.ijisrt.com 96
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
which are only members of a component containing a single A. Singular Value Decomposition
element. In second level clustering, the algorithm used is Let M be a r × s matrix. A factorization M=USV T is said
incremental threshold-based clustering and locality-sensitive singular value decomposition for M, where U is a r × r
hashing(LSH), but with some modifications. In this, all the orthogonal matrix, S is a r×s pseudo-diagonal matrix whose
first level clusters become members of second-level elements non-negative, and V T is a s×s orthogonal matrix.
clusters. Moreover, the second-level cluster also has those The diagonal elements of the matrix S are called singular
tweets which were not the members of the first-level. value of M [15].

C. Ranking In this phase, tweets dimensions are reduced using


In this phase, the ranking of topics produced in the SVD. Suppose X be s × r tweet word matrix. This matrix is
second level is done in order to choose the most important. decomposed into X=USV T, where U is a s × x orthogonal
matrix. S is a x × x pseudo-diagonal matrix and V T is an x ×
D. Title Extraction r orthogonal matrix. In this way, the dimension of the word
In this phase, the text of each clustered tweet is split into tweet matrix is reduced and the reduced matrix SV T will go
sentences for obtaining a particular set of titles. For this, to the next phase.
Levehnstein’s distance for each pair of candidate titles is
computed for reducing the number of actual candidates. B. K-means Clustering
after that, candidate titles are ranked using their textual The following steps are used to conduct k-means
features and frequency. The score of the title is its frequency clustering on the reduced matrix:
multiplied by the average likelihood of the appearance of  Tweets in the reduced form are clustered.
those terms contained in an independent corpus.  The identified centroids are translated into the tweet’s
original dimensions.
E. Keyword Extraction  The particular topic is described by the top-weighted
In this phase, keyword extraction is done. It is similar to words lies in each centroid.
the title extraction process but instead of complete
sentences, either verb phrases or noun phrases are VIII. CONCLUSION
examined. After that, keywords are ranked similar to that of
the title extraction but keyword extraction is not limited to In this study, we have examined five strategies for
selecting a particular candidate. To do this, the score detecting topics in Twitter streams. To sum it up, we have
difference between two succeeding candidates is computed discussed exemplar-based topic detection, which uses the
and then after ranking them, the position in the ranked list most descriptive tweet to catch the topic. We have discussed
with the largest difference is discovered and all terms until frequent pattern mining, in which recurring patterns are
that position are selected. produced using the FP-growth algorithm. We have also
discussed soft frequent pattern mining (SFM), a variant of
F. Representative Tweets Selection frequent pattern mining. We have also addressed several
In this phase, representative tweets are selected. clustering techniques such as sequential k-means,
Moreover, all replies from the topic’s cluster are added to DBSCAN, spherical k-means, and bngram, in which
the group of representative tweets. clusters are first discovered using the appropriate algorithms
and then used to define a topic. We have discussed two-
G. Relevant Image Extraction level message clustering which is explained using different
In this phase, relevant images are retrieved using a phases. We have also explored the use of singular value
simple procedure. The most frequent image is found and decomposition (SVD) with k-means. Singular value
returned if the tweets related with a topic have the URL of a decomposition (SVD) reduces the tweet’s dimension, which
few images. reduces processing time, and k-means is then used to detect
topics.
VII. COMBINATION OF SINGULAR VALUE
DECOMPOSITION AND K-MEANS CLUSTERING REFERENCES
METHODS FOR TOPIC DETECTION
[1.] R. Agrawal and R. Srikant. Fast algorithms for mining
In this approach, the blend of singular value association rules in large databases. In VLDB, 1994.
decomposition(SVD) and k-means clustering is used to [2.] L. M. Aiello, G. Petkos, C. J. Mart´ın-Dancausa, D.
detect Twitter topics [17]. Singular value Corney, S. Papadopoulos, R. Skraba, A. Goker, Y.
decomposition(SVD) is used for reducing the dimensions. Kompatsiaris, and A. Jaimes.¨ Sensing trending topics
In the case of textual data, Singular value decomposition in twitter. IEEE Transactions on Multimedia,
(SVD) is also known as latent semantic analysis(LSA) [8] 15:1268–1282, 2013.
[7]. At first, SVD reduced the dimensions of the tweets due [3.] Blei, A. Ng, and M. I. Jordan. Latent dirichlet
to which its computational time becomes faster and after allocation. J. Mach. Learn. Res., 3:993–1022, 2003.
that, the k-means clustering method is used to form clusters [4.] S. Brin, R. Motwani, J. Ullman, and S. Tsur. Dynamic
of tweets from which topics are detected. There are two itemset counting and implication rules for market
phases used in this approach. basket data. In SIGMOD ’97, 1997.

IJISRT22OCT305 www.ijisrt.com 97
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[5.] M. Cataldi, L. D. Caro, and C. Schifanella. Emerging
topic detection on twitter based on temporal and social
terms evaluation. In MDMKDD ’10, 2010.
[6.] T. H. Cormen, C. Leiserson, R. Rivest, and C. Stein.
Introduction to algorithms, second edition. 2001.
[7.] S. Deerwester, S. Dumais, G. Furnas, T. Landauer,
and R. Harshman. Indexing by latent semantic
analysis. Journal of the Association for Information
Science and Technology, 41:391–407, 1990.
[8.] S. Dumais. Latent semantic analysis. Annual Review
of Information Science and Technology, 38:188–230,
2005.
[9.] Elbagoury, R. Ibrahim, A. K. Farahat, M. Kamel, and
F. Karray. Exemplar-based topic detection in twitter
streams. In ICWSM, 2015.
[10.] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-
based algorithm for discovering clusters in large
spatial databases with noise. In KDD, 1996.
[11.] Goethals. Frequent Set Mining, pages 321–338. 07
2010.
[12.] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent
patterns without candidate generation: A frequent-
pattern tree approach. Data Mining and Knowledge
Discovery, 8:53–87, 2006.
[13.] T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR ’99, 1999.
[14.] R. Ibrahim, A. Elbagoury, M. Kamel, and F. Karray.
Tools and approaches for topic detection from twitter
streams: survey. Knowledge and Information Systems,
54:511–539, 2017.
[15.] B. Jacob. Linear algebra. 1990.
[16.] K. Jain, M. Murty, and P. Flynn. Data clustering: a
review. ACM Comput. Surv., 31:264–323, 1999.
[17.] [17] K. Nur’Aini, I. Najahaty, L. Hidayati, H. Murfi,
and S. Nurrohmah. Combination of singular value
decomposition and k-means clustering methods for
topic detection on twitter. 2015 International
Conference on Advanced Computer Science and
Information Systems (ICACSIS), pages 123–128,
2015.
[18.] J. S. Park, M. Chen, and P. S. Yu. An effective hash-
based algorithm for mining association rules. In
SIGMOD ’95, 1995.
[19.] Petkos, S. Papadopoulos, L. M. Aiello, R. Skraba, and
Y. Kompatsiaris. A soft frequent pattern mining
approach for textual topic detection. In WIMS ’14,
2014.
[20.] Petkos, S. Papadopoulos, and Y. Kompatsiaris. Two-
level message clustering for topic detection in twitter.
In SNOW-DC@WWW, 2014.
[21.] S. Phuvipadawat and T. Murata. Breaking news
detection and tracking in twitter. 2010
IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology, 3:120–
123, 2010.

IJISRT22OCT305 www.ijisrt.com 98

You might also like