Review On Topic Detection Methods For Twitter Streams
Review On Topic Detection Methods For Twitter Streams
ISSN No:-2456-2165
Abstract:- Among the various social media platforms as possible and aiding political parties and corporations in
that dominate the internet today, Twitter has established assessing public opinion, detecting fake news as soon as
itself as a major player and has become a preferred possible, and many more. In this way, topic detection can
choice for expressing one’s opinion on almost help to reduce the misuse of social media. This paper
everything. Ordinary citizens aside, it is used by the outlines some of these topic detection techniques such as
governments world over to connect directly with their exemplar-based topic detection which uses most descriptive
citizens, by the media houses, companies, research tweets to find the topic, frequent pattern mining in which
organizations, and so on. Given the omnipresent role it patterns are searched using word count, clustering-based
plays, it has thus become imperative to aggressively methods in which clusters are used to detect topics. Other
pursue topic detection from Twitter so that its benefits approaches include two-level message clustering and a
can be implemented in a range of applications such as blend of singular value decomposition and k-means
natural disaster warnings, fake news detection, and user clustering.
opinion assessment among other uses. This paper
outlines different types of topic detection techniques that The remainder of the paper is organized as follows.
are employed frequently such as exemplar based topic Section II mentions topic detection techniques, Section III
detection, clustering based topic detection, frequent discusses the complete approach of exemplar-based topic
pattern mining, two level message clustering, detection. Section IV discusses frequent pattern mining and
combination of k-means clustering methods and singular variations that can be used to detect topics from Twitter.
value decomposition. While exemplar based topic Various clustering methods such as sequential k-means,
detection technique uses the most significant tweets to spherical k-means, DBSCAN, and bngram to detect the
detect topics, the clustering based technique uses topics from Twitter are discussed in Section V. Section VI
different clustering methods such as sequential k-means, discusses all the phases used in two-level message
spherical k-means, DBSCAN, and bngram to detect the clustering. Section VII discusses the combination of
topics. In frequent pattern mining, the FP-growth singular value decomposition and k-means clustering
algorithm along with its variation can be used to detect approaches in a detailed way. Finally, Section VIII
the topics. In two level message clustering, topics are concludes the various approaches used in the paper.
clustered using different phases. A blend of both
algorithms is employed in the combination of k-means II. TOPIC DETECTION TECHNIQUES
clustering methods and singular value decomposition. A There are three different approaches used by topic
detailed discussion of these topic detection techniques detection methods to represent the topics. Feature pivot
has been done in this paper. method [2] is the first one. In this, terms are clustered
Keywords:- Text Mining; Topic Detection; Clustering; together according to their patterns of co-occurrence. There
Twitter. are many approaches used in this method. One interesting
approach is to utilize social features for selecting terms to be
I. INTRODUCTION clustered [5]. The second one is the document pivot method
[2] in which documents are clustered using some measure of
Due to the rapid growth of social media on the internet document similarity and group the similar documents based
in recent years, a multitude of valuable information has on this [21].
become available. Twitter is one such social platform where
a large amount of information is available. The majority of The last one is probabilistic topic models in which
information available on Twitter is in the form of textual topics are discovered by calculating the probability
documents. Twitter allows its users to send the information distribution of words, two most common probabilistic topic
in the form of textual data termed as tweets almost in real- models are Latent Dirichlet Allocation (LDA) [3] and
time. Each user can express his or her feelings, thoughts, Probabilistic Latent Semantic Analysis (PLSA) [13].
knowledge in a maximum of 280 characters which makes
Twitter a very important source for detecting important Every topic detection technique has its own way of
trends throughout the world. Millions of tweets are tweeted representing the topics. The most common approach to
on Twitter every day. Because of the enormous number of describe the topic is by using a list of keywords. The
tweets, identifying relevant topics in real-time is a keyword’s word count reflects the keyword’s relevance in
challenging job. Detecting topics can be beneficial in the given topic. Furthermore, explicit tweets can represent
several areas, namely, detecting natural disasters as quickly topics, in which the tweet keywords act as the topic’s
representation, as in the exemplar-based technique [14]. We
IJISRT22OCT305 www.ijisrt.com 94
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
identify multiple topic detection strategies in this paper such IV. TOPIC DETECTION USING FREQUENT
as clustering-based topic detection, frequent pattern mining, PATTERN MINING
exemplar-based topic detection, two-level message
clustering, and the combination of singular value Frequent pattern mining is used to find frequent
decomposition and k-means clustering approaches. patterns or associations from item sets found in various
kinds of databases [11]. For a given set of transactions, this
III. EXEMPLAR BASED TOPIC DETECTION mining method is used to find association rules to predict
the co-occurrence of items. Frequent pattern mining was
Exemplar based topic detection uses the most first used in mining transactions to find items that co-occur
descriptive tweet to represent a topic [9]. In this, the in the transactions. There are several variations of frequent
arrangement of tweets is maintained due to which users can pattern mining. Some of them are DHP [18], DIC [4] and
understand the topic in an easier way. This approach Apriori [1]. Frequent pattern mining uses FP-growth
discards the keywords which are not correlated and that algorithm [12]. Steps for the FP-growth algorithm are given
makes the topic domain smaller. This strategy is focused on below.
the idea that a tweet that is identical to a group of tweets but In the first step, the database is scanned to find
differentiated from the majority of the tweets is a good occurrences of the item sets in that database. Frequency of
prototype for the topics it addresses. Based on that theory, words below a particular threshold is discarded.
this approach computes the similarity matrix between each The FP-tree is constructed in the next step, the root is
pair of tweets and then analyses the action of such formed and is denoted by null.
distributions by categorizing them into three groups. These In the next step, the database is scanned again and
three groups are: transactions are examined to find out item sets in it. The
The distribution similarity of tweet j will have a low item set is sorted in descending order due to which item
sample variance if j is similar to many tweets. with maximum frequency is taken at the top and so on. In
The distribution similarity of tweet j will have a high other words, item sets generate the tree branch in
sample variance if j is similar to a particular group of descending order of frequency.
tweets and less similar to other groups. In this step, the examination of the next transaction of the
The distribution similarity of tweet j will have a low database takes place. The item sets are sorted in
sample variance if j is dissimilar to most of the tweets. descending order of frequency. The transaction branch
has the same prefix to the root if the item sets of this
The tweets which belong to group two are chosen to transaction are available in another branch. In other
be the best representative of topics. Since each tweet is words, the repeated item set is connected to the new node
equivalent to a series of tweets, the tweets in group two are of another item set.
selected to represent topics better. The tweets in category In this step, increase the frequency of the item sets by one
two are also different from the majority of the tweets, so it and the common node and new node.
divides its topic from the rest of the topics.
The generated FP-tree is mined in this phase. First, the
The sample variance for the similarity distribution of lowest node and the lowest node links are examined. The
each tweet is calculated as: pattern scale of the lowest node is one. The route is
traversed in the FP-tree from here. These are referred to
as conditional pattern bases.
(1) In this step, a conditional FP-tree is formed due to the
where, xj is the mean of the similarity of tweet j and is given frequency of item sets in the path. In the conditional FP-
by: tree, those item sets are considered which meets the
threshold support.
In this step, the conditional FP-tree is used to generate
(2) frequent patterns
In this method, tweets are sorted according to their The same method can be used to detect topics on
variances, and the tweet having the highest variance is Twitter. As the seen topic, the top k frequent patterns were
identified as the representation of the first topic. Following discovered using this method. Soft frequent pattern mining
the first topic selection, all tweets that belong to the first (SFM) was also implemented as a variant of frequent pattern
topic are eliminated, and the methodology selects another mining [19]. By measuring the similarities between the set
tweet with high variance as the representative for the second and each term, SFM greedily extends the set, including just
topic, and so on until k topics are discovered. one term. This method is replicated until the next term’s
resemblance and the set is smaller than a certain threshold.
IJISRT22OCT305 www.ijisrt.com 95
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
V. TOPIC DETECTION WITH CLUSTERING within the eps then points a and point b are called density
TECHNIQUES connected points.
The process is repeated for the remaining points of the
There are many clustering methods used for topic dataset which are unvisited.
detection. Some of them are sequential k-means, spherical
k-means, DBSCAN, bngram [14]. In clustering methods, D. Bngram
clusters are used to define a particular topic. Centroids in Bngram was proposed to detect topics from twitter
each cluster are used as the representative for that cluster streams [2]. Instead of using single words for topic
and top m-words of the cluster are used as a keyword of that detection, it uses ngrams. This method employs a new time-
topic. The total no of topics defined is the no of clusters based scoring measure for each ngram. It is defined as:
formed. The TF-IDF scheme is used to represent the tweets.
Some of the clustering methods used for topic detection are:
(5)
A. Sequential K-means
Sequential k-means works on two steps [16]. In step 1, where df is the number of tweets that appeared in this
each data point is allocated to that cluster that has the closest ngram in time slot j. dfprev is the number of tweets that were
center with respect to the data point in terms of Euclidean used in this ngram during the previous time slots. The
distance. In step 2, the centroid of each cluster is frequencies which are increased during the present time
recomputed by using all the members of every cluster. The slots in comparison with the previous time slots are chosen
Euclidean distance is calculated as: by this score measure. Following that, ngrams are ordered
according to this score scale, and the top ngrams are chosen.
√(m1 − m2)2 + (n1 − n2)2 (3) A group average hierarchical clustering of ngrams is done in
this process, with the similarities between two ngrams
where the first data point is (m1,n1) and the second defined as tweet fragments in which both ngrams occur.
data point is (m2,n2). This process is terminated when the correlation between the
nearest unmerged clusters falls below a specific threshold
B. Spherical K-means value. At last, the ranking of clusters is done and topics are
In spherical k-means, cosine similarity is used in place returned as the top clusters.
of euclidean distance [16]. The cosine similarity is
calculated as: VI. TWO-LEVEL MESSAGE CLUSTERING FOR
TOPIC DETECTION
IJISRT22OCT305 www.ijisrt.com 96
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
which are only members of a component containing a single A. Singular Value Decomposition
element. In second level clustering, the algorithm used is Let M be a r × s matrix. A factorization M=USV T is said
incremental threshold-based clustering and locality-sensitive singular value decomposition for M, where U is a r × r
hashing(LSH), but with some modifications. In this, all the orthogonal matrix, S is a r×s pseudo-diagonal matrix whose
first level clusters become members of second-level elements non-negative, and V T is a s×s orthogonal matrix.
clusters. Moreover, the second-level cluster also has those The diagonal elements of the matrix S are called singular
tweets which were not the members of the first-level. value of M [15].
IJISRT22OCT305 www.ijisrt.com 97
Volume 7, Issue 10, October – 2022 International Journal of Innovative Science and Research Technology
ISSN No:-2456-2165
[5.] M. Cataldi, L. D. Caro, and C. Schifanella. Emerging
topic detection on twitter based on temporal and social
terms evaluation. In MDMKDD ’10, 2010.
[6.] T. H. Cormen, C. Leiserson, R. Rivest, and C. Stein.
Introduction to algorithms, second edition. 2001.
[7.] S. Deerwester, S. Dumais, G. Furnas, T. Landauer,
and R. Harshman. Indexing by latent semantic
analysis. Journal of the Association for Information
Science and Technology, 41:391–407, 1990.
[8.] S. Dumais. Latent semantic analysis. Annual Review
of Information Science and Technology, 38:188–230,
2005.
[9.] Elbagoury, R. Ibrahim, A. K. Farahat, M. Kamel, and
F. Karray. Exemplar-based topic detection in twitter
streams. In ICWSM, 2015.
[10.] M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-
based algorithm for discovering clusters in large
spatial databases with noise. In KDD, 1996.
[11.] Goethals. Frequent Set Mining, pages 321–338. 07
2010.
[12.] J. Han, J. Pei, Y. Yin, and R. Mao. Mining frequent
patterns without candidate generation: A frequent-
pattern tree approach. Data Mining and Knowledge
Discovery, 8:53–87, 2006.
[13.] T. Hofmann. Probabilistic latent semantic indexing. In
SIGIR ’99, 1999.
[14.] R. Ibrahim, A. Elbagoury, M. Kamel, and F. Karray.
Tools and approaches for topic detection from twitter
streams: survey. Knowledge and Information Systems,
54:511–539, 2017.
[15.] B. Jacob. Linear algebra. 1990.
[16.] K. Jain, M. Murty, and P. Flynn. Data clustering: a
review. ACM Comput. Surv., 31:264–323, 1999.
[17.] [17] K. Nur’Aini, I. Najahaty, L. Hidayati, H. Murfi,
and S. Nurrohmah. Combination of singular value
decomposition and k-means clustering methods for
topic detection on twitter. 2015 International
Conference on Advanced Computer Science and
Information Systems (ICACSIS), pages 123–128,
2015.
[18.] J. S. Park, M. Chen, and P. S. Yu. An effective hash-
based algorithm for mining association rules. In
SIGMOD ’95, 1995.
[19.] Petkos, S. Papadopoulos, L. M. Aiello, R. Skraba, and
Y. Kompatsiaris. A soft frequent pattern mining
approach for textual topic detection. In WIMS ’14,
2014.
[20.] Petkos, S. Papadopoulos, and Y. Kompatsiaris. Two-
level message clustering for topic detection in twitter.
In SNOW-DC@WWW, 2014.
[21.] S. Phuvipadawat and T. Murata. Breaking news
detection and tracking in twitter. 2010
IEEE/WIC/ACM International Conference on Web
Intelligence and Intelligent Agent Technology, 3:120–
123, 2010.
IJISRT22OCT305 www.ijisrt.com 98