[go: up one dir, main page]

Academia.eduAcademia.edu
Journal of Information & Knowledge Management Vol. 20, No. 3 (2021) 2150038 (18 pages) # .c World Scienti¯c Publishing Co. DOI: 10.1142/S0219649221500386 An Evolutionary Clustering Analysis of Social Media Content and Global Infection Rates During the COVID-19 Pandemic Ibrahim Arpaci Department of Computer Education and Instructional Technology Tokat Gaziosmanpasa University, 60250 Tokat, Turkey ibrahim.arpaci@gop.edu.tr Shadi Alshehabi Department of Computer Engineering Turkish Aeronautical Association University Etimesgut, 06790 Ankara, Turkey shadialshehabi@gmail.com Ibrahim Mahariq* and Ahmet E. Topcu† College of Engineering and Technology American University of the Middle East Egaila, Kuwait *ibmahariq@gmail.com † ahmet.topcu@aum.edu.kw Published 19 June 2021 Abstract. This study investigates the impact of global infection rates on social media posts during the COVID-19 pandemic. The study analysed over 179 million tweets posted between March 22 and April 13, 2020 and the global COVID-19 infection rates using evolutionary clustering analysis. Results showed six clusters constructed for each term type, including three-level n-grams (unigrams, bigrams and trigrams). The frequent occurrences of unigrams (\COVID-19", \virus", \government", \people", etc.), bigrams (\COVID 19", \COVID-19 cases", \times share", etc.) and trigrams (\COVID 19 crisis", \things help stop" and \trying times share") were identi¯ed. The results demonstrated that the unigram trends on Twitter were up to about two times and 54 times more common than the bigram terms and trigram terms, respectively. Unigrams like \home" or \need" also became important as these terms re°ected the main concerns of people during this period. Taken together, the present ¯ndings con¯rm that many tweets were used to broadcast people's prevalent topics of interest during the COVID-19 pandemic. Furthermore, the results indicate that the number of COVID-19 infections had a signi¯cant e®ect on all clusters, being strong on 86% of clusters and moderate on 16% of clusters. The downward slope in global infection rates re°ected the start of the trending of \social distancing" and \stay at home". These ¯ndings suggest that infection rates have had a signi¯cant impact on social media posting during the COVID-19 pandemic. Keywords: COVID-19; evolutionary clustering; social media; Twitter. 2150038-1 I. Arpaci et al. 1. Introduction Social media platforms such as Instagram, Twitter and YouTube can increase fear and panic among people (Burnap et al., 2014). Social media allows the dissemination of rumours and disinformation, which can be used to manipulate human behaviours (Li et al., 2020). The term \infodemic" has been used by the World Health Organization (WHO) to emphasise the problem of misinformation during the COVID-19 pandemic (Zarocostas, 2020). The infodemic about the lockdown in the United States, for example, broke the supply chain as people °ooded grocery stores for panic-buying in a period of supply disruption, and this negatively a®ected food security and healthy nutrition (Tasnim et al., 2020). However, social media can also be used to calm people and encourage the wide-scale adoption of behaviours such as staying at home and \social distancing" (Kayes et al., 2020). Social media platforms may be employed for the dissemination of immediate and useful information (Nayar et al., 2020), and they may also be used to predict the number of infected individuals, which could help governments and health institutions detect high-risk or potentially high-risk locations (Qin et al., 2020). Furthermore, a useful analysis of social media posts may help governments better understand public psychology and e®ectively communicate with citizens to manage and overcome panic and fear. Finally, governments may employ social media platforms to promote the wide-scale adoption of \social distancing" and \stay at home" (Shah et al., 2020). COVID-19 has a®ected individuals' behaviours and psychology negatively, and social media increased this negative e®ect by disseminating the COVID-19 infodemic (Arpaci et al., 2021). Humanity is threatened when disinformation and rumours travel faster than the COVID-19 pandemic itself (Wilson and Chen, 2020). Accordingly, this study aimed to investigate the impact of global infection rates on social media posts during the COVID-19 pandemic. 2. Related Works Valdez et al. (2020) analysed 86,581,237 tweets obtained from a public repository from January to April 2020 using computational social media analytics. They found that the majority of the tweets were related to \China" in February 2020. In contrast, the majority of the tweets were related to \social distancing" and \lockdown" from March to April 2020. Their ¯ndings indicated that tweets related to COVID-19 dynamically changed over time. In another study, Tao et al. (2020) analysed 15,900 tweets about oral health data on a Chinese social media platform (Weibo) during the COVID-19 pandemic (from December 31, 2019 to March 16, 2020). Their results indicated that \dental services", \needs of dental treatment" and \home oral care" were the most frequently shared tweet topics. Islam et al. (2020) performed social media analysis to contrast and compare data collected from online platforms, including Twitter, Facebook and online newspapers. 2150038-2 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic They identi¯ed 2,311 stigmas, rumours and conspiracy theories related to COVID-19 in 87 countries and 25 languages. They also identi¯ed claims related to transmission, illness, control measures, treatment, cure, origin, violence, miscellaneous points and mortality. Of the total 2,276 claims, 1,856 claims were false (82%). They concluded that infodemic fuelled by stigmas, rumours and conspiracy theories might have severe implications for public health. Therefore, governments should understand the patterns of stigmas, rumours and conspiracy theories to better address risk communication. In another study, Ahmed et al. (2020) analysed by using social network analysis the Twitter data from seven days (from March 27, 2020 to April 4, 2020) in which the #5GCoronavirus hashtag was trending in the UK. Their results indicated that, of 233 sample tweets, only 34.8% of the tweets supported views that 5G and COVID-19 were linked, while 65.2% of tweets did not support that conspiracy theory, suggesting that only a limited number of users believed it. Sharma et al. (2020) analysed the Twitter data of NASDAQ-100 ¯rms using text analytics tools during the COVID-19 pandemic. They found that ¯rms were experiencing di±culties in constructing sustainable supply chains and they provided recommendations to build resilient and sustainable supply chains. In another study, Guo et al. (2020) analysed Twitter data related to COVID-19 posted in English between March 30, 2020 and April 19, 2020 to extract symptom terms. They extracted 36 physical symptoms, including all symptoms suggested by the Centers for Disease Control and Prevention, from 30,732 unique tweets. 3. Method 3.1. Evolutionary clustering The immense amount of daily posted tweets that are in°uenced by the current crises require performance analysis to extract hidden and important information about the trends of people during such problems or big events. Machine learning provides several techniques for such research. This kind of analysis needs to take into consideration both the daily posted tweets and the evolution of the tweet terms over di®erent periods. Therefore, in this study, we propose an evolutionary clustering approach, which applies an evolutionary clustering algorithm to determine the relationships between the in°uence of events and the evolution of tweets over time. The evolutionary clustering algorithm is based on K-means, which is a hard clustering method where each object belongs to only one cluster. K-means is chosen in this study because the daily tweet terms are described in one-dimensional space, and using more complex clustering methods such as self-organising map (SOM) (Kohonen, 1990) and support vector machine (SVM) for clustering (Winters-Hilt and Merat, 2007; Ben-Hur et al., 2001) to analyse such a dataset provides poor performance compared to K-means due to the parameter sets and complexities (Mingoti and Lima, 2006). Both SOM and SVM clustering are most e®ectively used for dealing with complex and high-dimensional data. Therefore, due to the simplicity of K-means and its lower complexity, it is recommended to be used with the 2150038-3 I. Arpaci et al. evolutionary clustering algorithm. Accordingly, the following algorithm was developed to deal with data streams of terms. In the algorithm, the posted tweet terms of the ¯rst time span are modelled as a list of clusters generated by K-means, bene¯tting from its ability to learn the distribution of tweet terms and determining their best modelling by means of the elbow method. This model is used as an initial model for clustering the posted tweet terms of the next time span. The same approach is applied for all remaining sets of terms in the next time spans. Thus, our evolutionary clustering approach is based on several steps for analysing data streams of terms in a given time window, as shown in Fig. 1. First step: Data terms are collected for a given time window. . Second step: The evolutionary clustering algorithm is applied. . Third step: The terms are distributed into the clusters in di®erent time spans and a term evolution matrix is generated. This matrix describes the evolution of terms in a time window. Each row represents one term, each column represents one time span and each value contains the cluster ID, which varies between 1 and K (the best number of clusters determined by the elbow algorithm). For example, Table 1 shows a term evolution matrix that contains three terms (r1 , r2 and r3 ) in four time spans (t1 ; t2 ; t3 and t4 ). . Term r1 moves from Cluster 1 (at t1 ) to Cluster 2 (at t2 ), stays in Cluster 2 (at t3 ) and then moves to Cluster 1 (at t4 ). . Fourth step: The user can select any term from this matrix and show its evolution, or the similarities between terms can be computed from the above matrix using the following formula (Zaki et al., 2014): similarityðri ; rj Þ ¼ ri  rj : jjri jj  jjrj jj Here, ri is a row array of cluster IDs and jjri jj is the array length. 2150038-4 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic Fig. 1. Evolutionary clustering-based approach. For example, from Table 1, the similarity between r1 similarityðr1 ; r2 Þ ¼ 1. The similarity between r1 and r3 is: similarityðr1 ; r3 Þ ¼ 0:9655. and r2 is: Fifth step: Pairs of terms that have high similarities (i.e. strong relationships) will be extracted. For example, the pair (r1 ; r2 ) is selected due to their high similarity. . Sixth step: The evolution of the selected pairs will be shown. . The tweet terms in a cluster during a given time span will be related to one another according to their frequencies. Consequently, each tweet term can move from one cluster during a particular time span to another cluster at the next time span, showing its evolution in a given time window. The developed unsupervised evolutionary clustering algorithm is suitable to be applied to deal with crises and events without any a priori information where unlabelled data are present. Moreover, our approach shows the relationship between the terms considering their evolution in the clusters over time. While the traditional Table 1. Term evolution matrix. Time window Terms r1 r2 r3 t1 t2 t3 t4 1 1 3 2 2 3 2 2 3 1 1 2 2150038-5 I. Arpaci et al. techniques analyse each term independently by showing its frequency over time, our approach shows, thanks to clusters, that even if the term is more frequent in the next time span, it may be less important than the other terms. On the other hand, when labelled data are present, several supervised machine learning techniques have been employed in the literature for automatically detecting crisis-related tweets, such as Naive Bayes (Li et al., 2018), SVM (Sakaki et al., 2010) and random forests (Kaufhold et al., 2020). Moreover, these supervised techniques are employed for predicting the credibility of news on Twitter (Hassan et al., 2020). 3.2. Dataset characteristics The Twitter data used in this study were posted in early 2020 and published (Banda and Ramya, 2020). The dataset consists of 43Mþ (43,845,712) tweets. The tweets with no retweets constitute 7,479,940 unique tweets. The dataset consists of the top 1,000 bigrams, the top 1,000 frequent terms and the top 1,000 trigrams. Each tweet has an identi¯er with date and time added. This study used a three-level n-gram system that indicates the sequence of many items such as words, letters and numbers: (i) a unigram is one word (\COVID-19", \virus", \pandemic", etc.); (ii) a bigram is a sequence of two words (\COVID-19 cases", \COVID-19 outbreak", etc.); and (iii) a trigram is a sequence of three words (\Italian friends colleagues", \stop spread COVID-19", \¯ve things help", etc.). The n-grams are used to ¯gure out how words co-occur in the tweets. Using bigrams and trigrams is more useful than unigrams for understanding the meaning of the tweeted sentences. 3.3. Procedures and techniques This study was conducted to clarify and analyse the impact of COVID-19 infection rates on the trends of people's posted tweets for a time window of 23 days, from March 22 to April 13, 2020. The tweets contain many terms that can be grouped into three categories: unigrams, bigrams and trigrams. The impact of COVID-19 infections around the world on the evolution of the frequencies of these terms is analysed by two consecutive processes. In the ¯rst process, the terms (unigram, bigram and trigram terms) are clustered using evolutionary clustering that relies on K-means clustering over time. First, for the ¯rst day of the data stream (i.e. March 22), the elbow method is used to determine the optimal number of K clusters. Thus, six clusters are generated for each type of term on that mentioned day. The generated cluster centroids are then used as initial centroids for clustering the second day's data stream (i.e. March 23). The same approach is applied for all remaining days until April 13. Clusters on each day represent di®erent ranges of term frequencies, and these ranges do not overlap. All clusters are arranged in descending order from Cluster 1 to Cluster 6, such that Cluster 1 contains the terms with the highest frequency levels while Cluster 6 comprises the terms with the lowest frequency levels. The cluster centroids are determined by ¯nding the averages of the frequencies of 2150038-6 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic their associated terms. The evolution of terms can be seen by analysing their movements to di®erent clusters over time. In the second process, the relationship is illustrated between the daily numbers of new and total COVID-19 infections and the evolution of the social media term frequencies that are represented by clusters at each time in the time window from March 22 to April 13. The relationship between the ¯rst time series, namely the daily number of COVID-19 infections f ¼ ff1 ; f2 ; . . . ; fT g, and the second time series, namely the evolution of a speci¯c cluster centroid at each time ci ¼ fc1;i , c2;i ; . . . ; cT ;i g, can be quanti¯ed by means of a statistical measure known as Pearson's correlation coe±cient (ri ), which is given as follows: PT  i Þðft F Þ C t¼1 ðct;i ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi : ri ¼ qffiP ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiqffiP T T  iÞ 2  Þ2 C F t¼1 ðct;i t¼1 ðft  i represents the average of the centroids of cluster i in a time window and Here, C  represents the average number of infections in the same time window. F ri 2 ½ 1; þ1Š, where 1 denotes a perfect negative correlation, þ1 denotes a perfect positive correlation and 0 denotes no correlation (Boslaugh, 2012). Such a statistical measure is supported by the p-value, which is used to measure the signi¯cance of the empirical analyses for determining the relationship between the two abovementioned time series, the number of COVID-19 infections and the cluster centroids over time (Arpaci et al., 2020). If the p-value is close to 0, then the observed correlation is unlikely to be due to chance and so there is a correlation between the two time series. 4. Results The evolutions of cluster centroids over time for the three term types are shown in Figs. 2–4, where the cluster centroid shows the frequency averages of the terms. It can be noted from the results illustrated in Figs. 2–4 that trigram terms trended less frequently than bigram terms, which in turn trended less frequently than unigram terms. Fig. 2. Evolution of clusters for unigram terms. 2150038-7 I. Arpaci et al. Fig. 3. Evolution of clusters for bigram terms. Figure 2 shows that Clusters 1 and 2, which have the terms with the highest frequency, generally increase over time. Clusters 5 and 6, which have the terms with the lowest frequency, are stable over time. Cluster 3 has roughly three unchanging phases over time: the ¯rst phase is from March 22 to March 23, the second phase from March 24 to April 1 and the third phase from April 2 to April 13. It is also seen that unigram terms in the second phase trend less frequently than in the ¯rst and third phases. Cluster 4 generally increases over time. Figure 3 shows the evolution of the clusters for bigram terms. The ¯rst cluster, which has the bigram terms of highest frequency, keeps increasing over time, followed by Clusters 2 and 3, which have bigram terms with very low frequencies compared to those in Cluster 1. The other three clusters, which have the bigram terms of lowest frequency, are stable over time. Figure 4 shows the evolution of the clusters for trigram terms. Cluster 1 comprises highly trending terms and it keeps increasing, while Cluster 4 also tends to be increasing over time. The other clusters (i.e. Clusters 2, 3, 5 and 6) increase slowly over time compared to Clusters 1 and 4. The evolution of COVID-19 infections and that of total COVID-19 infections in the world are shown in Fig. 5. It is observed that the total COVID-19 infections kept Fig. 4. Evolution of clusters for trigram terms. 2150038-8 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic Fig. 5. Numbers of new and total COVID-19 infections. increasing over the studied window of time. Di®erent behaviour is observed in the case of new COVID-19 infections, which kept generally increasing for the ¯rst 13 days in the studied time window and peaked on April 3. The impacts of the evolution of total and new COVID-19 infections on the trends of people's tweets are illustrated in Tables 2 and 3. It is observed that the impact of the number of COVID-19 infections is positive on all clusters, and it is strong on 86% of clusters and moderate on 16% of clusters. This impact on the clustered top 1,000 tweet terms of all types is signi¯cant and it is not due to chance based on the very low p-values illustrated in Tables 4 and 5. It is observed that the trends of the most frequent terms for the three types represented in Clusters 1 and 2 are generally more in°uenced by the number of total COVID-19 infections than that of new COVID-19 infections. The impact of total infections is larger than the impact of new infections on the trigram terms. The evolutions of some tweet terms of all types (i.e. unigram, bigram and trigram terms) that are positively in°uenced by both the total and new COVID-19 infections are illustrated in Figs. 6–8. Figure 6 shows the evolution of some unigram terms in the clusters. \COVID-19", as the most frequent term on Twitter, remained in the Table 2. Correlation coe±cients between total infections and clusters. Grams Unigram terms Bigram terms Trigram terms Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 0.6826 0.8984 0.8606 0.9031 0.8616 0.9310 0.7689 0.7712 0.8921 0.9068 0.6270 0.8251 0.9111 0.7124 0.8397 0.9057 0.8098 0.8748 Table 3. Correlation coe±cients between new infections and clusters. Grams Unigram terms Bigram terms Trigram terms Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 0.7399 0.8108 0.6462 0.8033 0.7787 0.7373 0.5874 0.7965 0.7278 0.7418 0.7656 0.6924 0.7788 0.7896 0.7507 0.8573 0.8107 0.7578 2150038-9 I. Arpaci et al. Table 4. The p-values for total COVID-19 infections and clusters. Grams Unigram terms Bigram terms Trigram terms Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 0.0003 0 0 0 0 0 0 0 0 0 0.0014 0 0 0.0001 0 0 0 0 Table 5. The p-values for new COVID-19 infections and clusters. Grams Unigram terms Bigram terms Trigram terms Cluster 1 Cluster 2 Cluster 3 Cluster 4 Cluster 5 Cluster 6 0.0001 0 0.0009 0 0 0.0001 0.0032 0 0.0001 0.0001 0 0.0003 0 0 0 0 0 0 ¯rst cluster during the studied time window. The term \covid19", which was less important than \COVID-19", was in Cluster 3 for the ¯rst two days and then moved up to Cluster 2 on March 24, which means it became more important in tweets. The two unigram terms \virus" and \pandemic" were used less than \covid19" during this time window, and they moved up to Cluster 3 and became more important during the ¯ve days from March 26 to March 30. The three terms \people", \via" Fig. 6. Evolution of the unigram terms in clusters. 2150038-10 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic Fig. 7. Evolution of the bigram terms in clusters. and \trump" moved up and became more important from March 24 to April 1. Some terms like \government", \stay", \many", \die" and \day" were less important than the previous three terms but moved up one cluster in those ¯ve days. There is a harmonious relationship between these ¯ve terms and the two above-mentioned terms of \virus" and \pandemic". The two sets of terms f\please", \stay"g and f\need", \home"g have almost the same importance in tweets and they moved down one cluster (i.e. to Cluster 5), becoming less important, from April 3. Figure 7 shows the evolution of some bigram terms in the clusters. The most important bigram term on Twitter was \COVID 19", followed by the bigram term \COVID-19 pandemic", which was in turn followed by the bigram term \COVID-19 crisis"; these terms remained in Clusters 1–3, respectively. The bigram term \COVID-19 relief" became less important from March 28 and it °uctuated between Clusters 5 and 6 over time. The two bigram terms \COVID-19 cases" and \COVID-19 outbreak" °uctuated between Clusters 2 and 3 from March 22 to April 1 and then they remained in Cluster 3, together with the bigram term \social distancing". The latter bigram term generally remained in Cluster 3. An association between \social distancing" and \stay home" can be observed from April 1. The bigram term \stay home" had the same degree of importance as \COVID-19 pandemic" for the period from March 22 to March 31, and then \stay home" became 2150038-11 I. Arpaci et al. Fig. 8. Evolution of the trigram terms in clusters. less important in tweets. The bigram term \death toll" °uctuated between Clusters 3 and 4. There is a relationship among the bigrams \¯ve things", \things help" and \help stop", and together they form the phrase \¯ve things help stop". This phrase quickly became less important and the bigram terms extracted from it no longer existed after March 24. The bigram terms \times share", \stand Italy", \Italian friends" and \support Italian" moved down quickly from a high degree of importance (i.e. Cluster 2) to a very low degree of importance on March 24 and then no longer 2150038-12 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic existed among the top 1,000 bigram terms in the studied time window. A relationship is noted between these two previous sets of bigram terms. Figure 8 shows the evolution of some trigram terms in clusters for the studied window of time. The \COVID 19 pandemic" trigram term was the most used trigram on Twitter starting from March 23. The \COVID 19 cases" trigram term moved up to Cluster 1 for three consecutive days (March 26–28) and became very important, like \COVID 19 pandemic", and then it moved down one cluster. The trigram term \COVID 19 outbreak" became more important in tweets. Figure 8 shows that this term was in Cluster 2 from March 23 to April 2, and then it moved down one cluster. The trigram term \positive COVID 19" °uctuated between different clusters (Clusters 1–4) during the time window, such that it was among the most important trigrams in tweets for only one day, on March 27. On 70% of the studied days, \COVID 19 crisis" was considered a trending trigram term, while \amid COVID-19 pandemic" moved down from Cluster 3 to Cluster 4 on March 29 and became a less trending trigram term. A relationship was seen among the trigram terms \trying times share", \support Italian friends", \Italian friends colleagues", \stop spread COVID-19", \¯ve things help" and \things help stop" in that they began as the most highly trending terms and then moved down quickly on the following days to be among the trigram terms of lowest frequency. The trigram terms \mass testing ph", \ph demand duterte" and \duterte COVID 19" also moved down quickly from Cluster 3 to Cluster 6 and no longer existed after March 27. 5. Discussion In this study, Twitter feeds were analysed within the recent time frame (from March 22, 2020 to April 13, 2020) using clustering analysis of Twitter posts during the COVID-19 pandemic. From this analysis, there were obvious patterns re°ecting the reality on the ground: the presence of a pandemic. In order to achieve robust results, six clusters were constructed for each term type, including three-level n-grams (unigram, bigram and trigram) and the frequent occurrences of unigrams (\COVID-19", \virus", \government", \people", etc.), bigrams (\COVID 19", \COVID-19 cases", \times share", etc.) and trigrams (\COVID 19 crisis", \things help stop" and \trying times share"). Our results demonstrated that the unigram trends on Twitter were up to about two times and 54 times more common than bigram and trigram terms, respectively. Taken together, the present ¯ndings con¯rm that many tweets have been used to broadcast people's prevalent topics of interest during the COVID-19 pandemic. For the unigram term type, Clusters 1 and 2 have the terms of the highest frequency. Clusters 5 and 6 have low-frequency terms. \COVID-19" remained in the ¯rst cluster as not only the most frequent term in the cluster but also its centroid, being the most important term to capture the underlying societal dynamics of recent events. Also, \covid19" moved up the scale from relatively low frequency in Cluster 3 to the high-frequency terms in Cluster 2. This could correspond to the fact that 2150038-13 I. Arpaci et al. people started adopting the name COVID-19 to refer to the new virus, as announced by the WHO. It is also interesting to observe that policymakers' names, such as \Trump", showed high frequency from March 24 to April 1; this underlines the importance of the in°uence of decision makers during this period. In general, unigrams like \home" or \need" also became important as these terms re°ected the main concerns of people during this period. Similar activity trends could be observed for bigrams and trigrams. Cluster 1 re°ects the most frequent bigrams and trigrams, respectively. The trend of Cluster 1 being the cluster containing the most frequent terms is not surprising, since the algorithm is obviously meant to cluster based on the frequency of terms. A highly trending term would be a nearest neighbour to another highly trending term such that they could eventually fall into the same cluster. As the terms evolve and become more frequent, a change of dynamics and movement into Cluster 1 and Cluster 2 is observed. The most essential bigrams are \COVID 19" and \COVID-19 pandemic". This is also not surprising as these bigrams best explain the dynamics of the underlying events during this period. It may be further explained by the fact that \COVID-19" and \covid19" were also the most frequent unigrams. It is interesting to note that \stay home" trended at the same level as \COVID-19 pandemic" before it eventually plateaued out, signifying that perhaps the word had circulated, and people were then complying or being forced by law to comply. Bigrams such as \social distancing", \COVID-19 cases" and \COVID-19 outbreak" showed °uctuating trends for the period under study. Perhaps this is because the importance or e®ects of activities became more apparent as the news spread. The trends also reveal some change in status for Italy regarding the pandemic during the period under study. This is seen in the fact that terms such as \stand Italy" or \Italian friends" trended for some time and eventually plateaued out. The bigram \COVID-19 relief" also revealed a jump to Cluster 3 on March 24. Explaining that jump, Fig. 8 reveals that this period saw the start of another rise in COVID-19 cases and hence the need for relief packages. After the rise, a decrease in \COVID-19 relief ", re°ected by its jump to a lower cluster, mirrors a decrease in cases, as also depicted in Fig. 8. The corresponding most frequent trigram is \COVID 19 pandemic". As \COVID 19" is the most frequent bigram, it is logical that a trigram containing \COVID 19" would be the most frequent trigram. The terms \COVID 19 outbreak" and \positive COVID 19" also trended for some time. In particular, the trigrams \Italian friends colleagues" and \stop spread COVID-19" also trended to re°ect the rise of daily cases in Italy, as shown in Fig. 8. It is also important to note that this ¯gure re°ects the °uctuation and decrease revealed in the trigram analysis. Figure 5 shows the total infections and new infections. Total infections as declared by the WHO increase during this period, but new infections show a peak and then gradually slope downwards. This downward slope might re°ect the start of the trending of \social distancing" and \stay at home". It would be interesting in future studies to approximate this peak period using multimodal distribution. This might 2150038-14 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic be explained by the fact that the bigrams \social distancing" and \stay home" circulated widely across di®erent regions of the world, and so in spite of the pattern of total cases, new cases began to lag as a result of awareness and preventive measures. It is essential to highlight that the trends of the highest frequency for the three term categories represented in Clusters 1 and 2 are in°uenced more by total COVID-19 infections than new COVID-19 infections. However, speci¯cally for Cluster 1, considering the unigram term, a reverse e®ect is observed: it is in°uenced more by new COVID-19 infections, not the total infection rate. This might be because \COVID-19" is the centroid of unigram Cluster 1 and its only term. 6. Conclusion People around the world spend hours every day on social media to share their thoughts, reactions and opinions with others. During mass emergencies like °oods, earthquakes or the widespread occurrence of any infectious diseases such as COVID-19, social media sites such as Twitter can be signi¯cantly useful tools when quick communications and actions are required by decision makers. This helps policymakers in healthcare systems actively introduce educational services in the regions exposed to higher risk, and it helps predict morbidity rates in each region. Essentially, this can also help decrease the incidence of cases and even mortality rates in communities. Data analytics is one of the most important aspects of understanding pandemics and preventing their dissemination. Thus, data extraction and processing should not only depend on clinical healthcare data; one needs to understand the e®ectiveness of other resources such as surveys and social media related to the disease as well. The models for data processing engines may include di®erent types of resources to analyse the impact of any critical event, including pandemics. These are important ¯ndings in the understanding of clustering analysis as used for the exploration of Twitter feeds in the recent time frame to evaluate Twitter data related to the COVID-19 pandemic. During this analysis, the patterns of the trends obviously re°ected the realities on the ground. The results of this study demonstrate that the frequency of trending unigrams on Twitter is higher than the frequency of trending bigram and trigram terms. Moreover, the present results indicate that during the COVID-19 pandemic, tweets were widely used to spread the topics attracting people's attention. In general, as re°ected in this study, Twitter feeds could possibly help us understand the dynamics of underlying societal incidences. This, in turn, could help us respond adequately to pandemics, social crises or natural disasters. Such a study would also bene¯t from sentiment analysis. This might help in understanding the negative or positive perceptions of the populace with respect to pandemics or with respect to the policies surrounding pandemics. For example, it might be helpful to understand from such analysis the percentage of positive or negative sentiments per region when evaluating channels of allocation of relief. However, this approach would assume that there are no paid or unpaid deterministic agents purposely 2150038-15 I. Arpaci et al. aiming to sway the contextual sentiments of Twitter feeds through various malicious actions. Finally, it is worth recalling that Twitter users are not always good representatives of the population sample, especially in the presence of vested interests of a deterministic adversary. This has a serious implication for us since the attributed social behaviour found in these studies may apply more speci¯cally to Twitter users than to the overall general population. 6.1. Limitations and future work This work has investigated n-gram analysis of Twitter feeds with some useful insights. However, other natural language algorithmic analyses might also be helpful in future work. For example, tweet summarisation and topic analysis may help gain further insights into tweets' main points. In addition to this, since the window of time considered in this study was very short, it would be interesting to perform analysis for other days; in particular, it would be interesting to make a comparison of the term types for other time frames, such as the period of November–December 2019, when COVID-19 cases actually started. This might help us understand the evolution of the terms as a function of time. Moreover, it would also be interesting to analyse the pandemic based on tweets originating in other languages. This would be useful for gaining new intuitions or ¯ndings that might not have been re°ected here, and it could provide a meaningful understanding of how the pandemic is perceived in di®erent regions of the world. An obvious advantage of this is that the reactions of other communities might help our understanding of the e®ectiveness of policies and preventive measures across regions. This is valuable for policymakers, as it might be unrealistic to expect them to read through extensive Twitter feeds, but receiving a summarisation or topic analysis of such feeds might help. Furthermore, such summarisations and topic analyses would be essential for assisting policymakers in assessing the impacts of their policies on a sample population as represented by the Twitter community, assuming there are no deterministic agents trying to manipulate the feeds. This, in turn, could be useful in making suitable adjustments in policies for the bene¯t of the populace in di±cult times such as the present. References Ahmed, W, J Vidal-Alaball, J Downing and FL Seguí (2020). COVID-19 and the 5G conspiracy theory: Social network analysis of Twitter data. Journal of Medical Internet Research, 22(5), e19458. doi: 10.2196/19458. Arpaci, I, S Alshehabi, M Al-Emran, M Khasawneh, I Mahariq, T Abdeljawad and A Ella Hassanien (2020). Analysis of Twitter data using evolutionary clustering during the COVID-19 pandemic. Computers, Materials & Continua, 65(1), 193–204. doi: 10.32604/ cmc.2020.011489. 2150038-16 Social Media Content and Global Infection Rates Analysis During the COVID-19 Pandemic Arpaci, I, K Karataş, M Baloğlu and A Haktanir (2021). COVID-19 phobia in the United States: Validation of the COVID-19 Phobia Scale (C19P-SE). Death Studies. doi: 10.1080/ 07481187.2020.1848945. Banda, JM and T Ramya (2020). A Twitter Dataset of 40+ million tweets related to COVID19. Available at https://github.com/thepanacealab/covid19 twitter. Accessed on April 18, 2020. Ben-Hur, A, D Horn, HT Siegelmann and V Vapnik (2001). A support vector method for clustering. In Advances in Neural Information Processing Systems 13, TK Leen, TG Dietterich and V Tresp (eds.), pp. 367–373. Cambridge, MA: The MIT Press. Boslaugh, S (2012). Statistics in a Nutshell: A Desktop Quick Reference. Sebastopol, CA: O'Reilly Media, Inc. Burnap, P, ML Williams, L Sloan, O Rana, W Housley, A Edwards, V Knight, R Procter and A Voss (2014). Tweeting the terror: modelling the social media reaction to the Woolwich terrorist attack. Social Network Analysis and Mining, 4, 206. doi: 10.1007/s13278-0140206-4. Guo, JW, CL Radlo®, SE Wawrzynski and KG Cloyes (2020). Mining Twitter to explore the emergence of COVID-19 symptoms. Public Health Nursing, 37(6), 934–940. doi: 10.1111/ phn.12809. Hassan, N, W Gomaa, G Khoriba and M Haggag (2020). Credibility detection in Twitter using word n-gram analysis and supervised machine learning techniques. International Journal of Intelligent Engineering and Systems, 13(1), 291–300. doi: 10.22266/ ijies2020.0229.27. Islam, MS, T Sarkar, SH Khan, A-HM Kamal, SMM Hasan, A Kabir, D Yeasmin, MA Islam, KIA Chowdhury, KS Anwar, AA Chughtai and H Seale (2020). COVID-19-related infodemic and its impact on public health: A global social media analysis. The American Journal of Tropical Medicine and Hygiene, 103(4), 1621–1629. doi: 10.4269/ajtmh.20-0812. Kaufhold, MA, M Bayer and C Reuter (2020). Rapid relevance classi¯cation of social media posts in disasters and emergencies: A system and evaluation featuring active, incremental and online learning. Information Processing & Management, 57(1), 102132. doi: 10.1016/j. ipm.2019.102132. Kayes, ASM, MS Islam, PA Watters, A Ng and H Kayesh (2020). Automated measurement of attitudes towards social distancing using social media: A COVID-19 case study. First Monday, 25(11), 10599, doi: 10.20944/preprints202004.0057.v1. Kohonen, T (1990). The self-organizing map. Proceedings of the IEEE, 78(9), 1464–1480. Li, H, D Caragea, C Caragea and N Herndon (2018). Disaster response aided by tweet classi¯cation with a domain adaptation approach. Journal of Contingencies and Crisis Management, 26(1), 16–27. doi: 10.1111/1468-5973.12194. Li, J, Q Xu, R Cuomo, V Purushothaman and T Mackey (2020). Data mining and content analysis of the Chinese social media platform Weibo during the early COVID-19 outbreak: Retrospective observational infoveillance study. JMIR Public Health and Surveillance, 6(2), e18700. doi: 10.2196/18700. Mingoti, SA and JO Lima (2006). Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. European Journal of Operational Research, 174(3), 1742–1759. doi: 10.1016/j.ejor.2005.03.039. Nayar, KR, L Sadasivan, M Sha±, B Vijayan and AP Rao (2020). Social media messages related to COVID-19: A content analysis. Available at https://ssrn.com/abstract=3560666. Accessed on May 3, 2020. Qin, L, Q Sun, Y Wang, K-F Wu, M Chen, B-C Shia and S-Y Wu (2020). Prediction of number of cases of 2019 novel coronavirus (COVID-19) using social media search index. 2150038-17 I. Arpaci et al. International Journal of Environmental Research and Public Health, 17(7), 2365. doi: 10.3390/ijerph17072365. Sakaki, T, M Okazaki and Y Matsuo (2010). Earthquake shakes Twitter users: Real-time event detection by social sensors. In Proceedings of the 19th International Conference on World Wide Web, pp. 851–860. New York, NY: ACM. Shah, K, T Abdeljawad, I Mahariq, F Jarad and S Deniz (2020). Qualitative analysis of a mathematical model in the time of COVID-19. BioMed Research International, 2020, 5098598. doi: 10.1155/2020/5098598. Sharma, A, A Adhikary and SB Borah (2020). Covid-19's impact on supply chain decisions: Strategic insights from NASDAQ 100 ¯rms using Twitter data. Journal of Business Research, 117, 443–449. doi: 10.1016/j.jbusres.2020.05.035. Tao, ZY, G Chu, C McGrath, F Hua, YY Leung, WF Yang and YX Su (2020). Nature and di®usion of COVID-19-related oral health information on Chinese social media: Analysis of tweets on Weibo. Journal of Medical Internet Research, 22(6), e19981. doi: 10.2196/19981. Tasnim, S, MM Hossain and H Mazumder (2020). Impact of rumors or misinformation on coronavirus disease (COVID-19) in social media. Journal of Preventive Medicine and Public Health, 53(3), 171–174. doi: 10.3961/jpmph.20.094. Valdez, D, M Ten Thij, K Bathina, LA Rutter and J Bollen (2020). Social media insights into US mental health during the COVID-19 pandemic: Longitudinal analysis of Twitter data. Journal of Medical Internet Research, 22(12), e21418. doi: 10.2196/21418. Wilson, ME and LH Chen (2020). Travellers give wings to novel coronavirus (2019-nCoV). Journal of Travel Medicine, 27(2), taaa015. doi: 10.1093/jtm/taaa015. Winters-Hilt, S and S Merat (2007). SVM clustering. BMC Bioinformatics, 8, S18. doi: 10.1186/1471-2105-8-S7-S18. Zaki, MJ, W Meira, Jr. and W Meira (2014). Data Mining and Analysis: Fundamental Concepts and Algorithms. New York, NY: Cambridge University Press. Zarocostas, J (2020). How to ¯ght an infodemic. The Lancet (London, England), 395(10225), P676. doi: 10.1016/S0140-6736(20)30461-X. 2150038-18