[go: up one dir, main page]

Academia.eduAcademia.edu
Proceedings of the 53rd Hawaii International Conference on System Sciences | 2020 Follow-back Recommendations for Sports Bettors: A Twitter-based Approach Herman Wandabwa, M. Asif Naeem, Farhaan Mirza, Russel Pears School of Engineering, Computer and Mathematical Sciences Auckland University of Technology [herman.wandabwa, mnaeem, farhaan.mirza, russel.pears]@aut.ac.nz Abstract Social network based recommender systems are powered by a complex web of social discussions and user connections. Short text microblogs e.g. Twitter present powerful frameworks for information consumption, due to their real-time nature in content throughput as well as user connections. Therefore, users on such platforms consume the disseminated content to a greater or lesser extent based on their interests. Quantifying this degree of interest is a difficult task based on the amount of information that such platforms generate at any given time. Thus, the generation of personalized profiles based on the Degree of Interest (DoI) that users have towards certain topics in such short texts presents a research problem. We address this challenge by following a two-step process in generation of personalized sports betting related user profiles in tweets as a case study. We (i) compute the Degree of Interest in Sports Betting (DoiSB) of tweeters and (ii) affirm this DoiSB by correlating it with their friendship network. This is an integral process in the design of a short text based recommender systems for users to follow i.e follow-back recommendations as well as content-based recommendations relying on the interests of users on such platforms. In this paper, we described the DoiSB computation and follow-back recommendation process by building a vector representation model for tweets. We then use this model to profile users interested in sports betting. Experiments using real Twitter dataset geolocated to Kenya shows the effectiveness of our approach in the identification of tweeter’s DoiSBs as well as their correlation with their friendship network. 1. Introduction Citizen journalism aided by the emergence of social networking platforms like Twitter and Facebook has led to the generation of massive and diverse online content e.g. text, videos, images. For example, 6000 tweets are URI: https://hdl.handle.net/10125/64055 978-0-9981331-3-3 (CC BY-NC-ND 4.0) published every second, corresponding to over 350, 000 tweets per minute and 500 million a day1 . Tweeters2 in essence share photos, videos, hyperlinks and locations to members in their networks. Tweeters extrinsically or/and intrinsically formulate online profiles. This mostly depends on the content they consume and disseminate over time in addition to their user follower-followee network. In general, tweeters show diversity in expressing their interests in certain topics. This can be based on hashtags that they follow at the time as well as time based event related information. On the hindsight, the dynamism in their friendship networks as well as the streaming nature of the platform makes it difficult to quantify their DoI in certain online topics. This is further compounded by the fact that such topics are always dynamic. For example, users who love outdoors activities, are likely to tweet about a mountaineering experience, may also often tweet in support of their favourite political candidate or sports team. Therefore, interest identification for the purpose of better user profiling on such platforms is an important research problem. Precise profiling of a tweeter based on his/her interests and to what extent, largely alleviates personalization related problems. Twitter’s recommender system normally discovers relevant followers to be suggested to tweeters or Twitter lists of interest based on the friendship network. However, this does not mean that explicit interests among the users are shared. The questions below elicit the need for this research:• Is it possible to group users based on their topical interests in short text microblogs? • Do friendship connections in short text microblogs influence topical interests for such users? 1 http://www.internetlivestats.com/ twitter-statistics/ 2 a person who posts on the social media application Twitter Page 2569 Sports betting, just like lotteries is a huge industry in the world3 . Tweeters with interest in online sports betting are assumed to propagate sports betting related content on Twitter. Despite their interest in sports betting, we are correct not to assume that all their followers are relevant to be followed back. Sports betting companies as tweeters, may have lots of followers but not all could be relevant. Besides, such users need to be aware of what other tweeters disseminate which partly helps them to rank important tweets or create a network of influencers in their domain. This is instrumental in enabling tweeters make decisions on who to follow as they are presented with the most relevant users to follow on such platforms. In addition, this model is vital in suggesting users to follow back in cold-start scenarios or in the case of lurkers4 as they normally do not have enough initial friendship connections. Therefore, we present a method to compute the Degree of interest in Sports Betting (DoiSB) among tweeters. We consider the Kenyan Twitterspace as a case study where diversity of topics, and interest in sports betting is high in addition to the author’s knowledge of Kenyan’s tweeting patterns. To the best of our knowledge, this work presents the first attempt at quantifying tweeter’s affinity towards sports betting by analyzing their disseminated content over time and corroborating with their friendship network. We make the following scientific contributions in the paper:• We develop a novel framework for computing user profiles based on their content dissemination patterns. • We proof the social theory of homophily by correlating tweeter’s interests to those of their friendship network. Theoretically, this is the tendency for people with shared interests to be connected. • We test our framework in deduction of user’s interests in online sports betting. We also carried out an experimental study in formulation of user representative profiles. The rest of the paper is organized as follows. Related literature of our study is in Section 2. Our methodology is described in Section 3. The experimental framework is presented in Section 4 while results are shown in Section 5. Conclusion and future work is summarized in Section 6. 3 https://www.statista.com/topics/1740/ sports-betting/ 4 a member of an online community who observes, but does not participate 2. Background and Related Work 2.1. User Interests Preferences in Short Text Microblogs The choice of content to be consumed in short text microblogs is largely influenced by the interests of the consumer(tweeter). Therefore, such interests are integral in the design of short text based recommender systems as they are developed to match users to resources of interest. Chen et al.,[1] proposed collaborative ranking in the capture of user interests through integration of useful contextual information such as tweet topic level factors in tweets. A User Interest Profile design methodology was also proposed by Goel et al.,[2]. In the design proposal, user generated tags were enriched with friendship information through vector representations. In the context of profiling users of malicious intent in short text microblogs, Sahoo et al.,[3] proposed a hybrid approach in leveraging classifications and Petri net structure. In addition, authors in [4] proposed a URL recommender system for Twitter users based on social voting and content sources. Results suggested that the generated topics and social interactions were more significant in presentation of recommendations. Recommendation of users to follow-back in short text microblogs was also addressed in [5],[6] and [7]. 2.2. Twitter in Sports and Betting Recommendations Twitter related activities have been instrumental in the domain of sports. Robert et al.,[8] predicted outcomes of NFL games using tweets. Technical stock market techniques were applied to sentiment gathered from social media for the predictions. On the other hand, Brown et al.,[9] evaluated the accuracy of social media forecasting in the English premier league soccer matches where the aim to assess whether tweet semantics could be used in predicting match outcomes. The authors further investigated whether the predictions were only restricted to large events i.e. when goals were scored. Findings indicated that if the combined tone of tweets was positive at any time of the match, then the likelihood of a team winning was higher than betting market prices implied. Vaughan et al., [10] work mirrors what Brown et al.,[9] did. The authors measured Twitter activity around unique, identifiable and newsworthy events and correlated the activities with betting prices fluctuations on Betfair. Their findings corroborated the initial assumptions that response of market prices appeared sluggish with little event related data compared to post-news drift times. Page 2570 The goal of our research is to improve on methodologies that can be used to infer the level of interest denoted as Degree of Interest (DoI) that users may have towards certain topics in streaming microblog texts. DoI measure is instrumental in the design of short text based recommender systems in diverse domains. Lack of studies in short text influence-based recommender systems makes our contribution unique. In this work, we made use of neural-network based vector representations of short text word tokens to comprehend better, the underlying semantic structure of tweets. Vector representations via a neural-network based algorithm, FastText5 worked well with our type of textual data i.e. one with misspelled/shortened words reminiscent of tweets [11]. The algorithm typically makes intelligent guesses on even out of vocabulary words as long as some character level consistency is observed. 3. Our Approach Inferring the extent of interests by a group of short text microblog users in certain topics involves a number of processes. The processes are listed below: • Text Modelling - In this step, we train a FastText based model using a corpus of tweets geolocated to Kenya. The output of the model is a vector space representation of tweet word tokens. In the vector space, a tweet is represented as a vector in an n-dimensional space, where each dimension represents a term. Similarity between terms or documents (tweets in our case) is measured as the cosine angle between the vectors being compared. We evaluated this modeling approach against Word2Vec and Glove baselines in choosing the best model for the task • Clustering and Extraction of Centroids Tweets are grouped based on their semantic similarity via a clustering algorithm. Cluster centroids represented as vectors for each cluster are extracted via the algorithm. • User’s Degree of Interest (DoI) - To compute the DoI,a tweeter’s level of interest in a topic is measured. The tweeter’s tweets are transformed to a vector format via the trained model and distance to the centroid of interest measured. • Correlation with tweeter’s friendship network - To proof a tweeter’s interest in a certain topic, his/her friendship network DoI was computed. This follows the homophily theory in social networks where similar nodes (friendship connections) may be more likely to share interests than than dissimilar ones. 3.1. Text Modeling We based our text modeling methodology on a neural network model FastText 6 . FastText was the algorithm of choice based on its mode of extracting syntactic information in short, sparse and often misspelled words in a corpus. Unlike other word embedding algorithms like [12], FastText makes use of word morphologies where, word vectors are associated with each character n-gram and words are modelled and represented as the sum of character word vectors. Therefore, this algorithm proved to be an ideal model for learning mispelled or words out of the dictionary. To model tweets via this neural network algorithm, the below procedure was followed: • Text pre-processing - Pre-processing text is necessary for a better corpus as model input. This process entails removal of unnecessary words and punctuation as they do not provide any contextual meaning for the model to learn. We followed the below steps in pre-processing the input text : – Lower-cased all words in the corpus. – Removed all accented characters and numbers. Some of the accented characters were encoded to Unicode Transformation Format 8-bit(UTF-8) format. – Removed all hyperlinks. They were not of interest in this instance. – Removed all user mentions. They are words prefixed by the @ symbol in a tweet. Often, they refer to tweeters’ usernames. – Removed all words with less than three characters. They were found not to be semantically relevant in most tweets. – Cleaned out all hashtags. These are words in a tweet that are prefixed by the hash (#) symbol. – Removed stopwords. These are the most common English words. Normally, they are not semantically significant. We used a custom list of stopwords in addition to the NLTK stopword list 7 . – Tokenized the remaining words in each tweet and stored them as a list ready for model training. 6 https://fasttext.cc/ 5 https://fasttext.cc/ 7 http://www.nltk.org/ Page 2571 • Model Training - The tokenized list of words in the corpus forms the input pipeline for model training. In model training, a machine learning algorithm (FastText in our case) is provided with training data to learn from. The model learns semantic knowledge in the dataset by mapping each word to a continuous vector space from its distributional properties observed in the the corpus. Several parameters have to be specified in order to train Word2Vec and FastText models:– size or the number of dimensions in the vector space. – min count or minimum count of a term in the corpus for it to be included in the training. Terms with word counts lower than this value were excluded from training; – sg=1 for training a SkipGram model, otherwise Continuous Bag of Words (CBOW). In the SkipGram modeling, the algorithm loops over the list of words and uses current word to predict its neighbors (its context). However, in CBOW, the context is used to predict the current word. – window parameter is the maximum distance between the current and predicted word in the list of word tokens; – word ngrams are specified in order to enrich word vectors with subword(n-grams) information. This enrichment is possible if the value is specified as 1 ; – iter or iterations is the number of iterations (epochs) over the corpus. In essence, this parameter defines the number times that the learning algorithm goes through the entire training. – Glove model only had the epochs and learning rate(lr) defined. FastText is unique in its vector space representation as it ignores word structures. Each word w is represented in the vector space as a bag-of-character n-grams n where the word itself is included in the n-grams set. We used 3 ≥ n ≤ 6 in our implementation as specified in [11]. This way, most of the n-grams were factored in the modeling. For an n-grams dictionary of size B and word w, Bw ⊂ {1, ..., B}. xb is the vector representation for each n-gram b. The scoring function is formulated as in [11] :- s(w, c) = X x⊤ b vc (1) d⊆Bw where c is the context position of a word, and v the corresponding word vector. In our case, each tweet is made of word tokens. Therefore, its vector representation is the sum of its word vectors after pre-processing. Using the parameters elicited earlier, the model was ready to be used in the generation of vectors for each word in the corpus. The process is the same for Word2Vec and Glove baselines except that their vector space applies only at word level. They were trained for validation purposes. • Clustering and Extraction of Centroids - A clustering algorithm was deployed to group most similar tweets as close as possible (clusters). Semantically dissimilar tweets were pushed as far away as possible from each other. The insight here is that objects in respective clusters are to be as similar as possible. Thereafter, manual inspection of the underlying keywords and analogy tests for terms in each of the clusters were carried out to identify the topic or closely related topic that the each of the clusters inclined towards. To cluster tweets, K-Means++ was applied on the training corpus. This algorithm optimizes the choice of cluster centers for k-means by spreading out the initial set of cluster centroids so that they are not close to each other guaranteeing an O(log k) solution [13]. Therefore, finding the optimal set of centroids was guaranteed. We used a heterogeneity convergence metric to determine the optimal cluster numbers across the models [14]. In determining the numbers, we ran tests considering different k values as cluster numbers on a known test set. The cluster numbers that best represented the test set were chosen to be optimal. Intra-cluster distance between y points in a given cluster Xk and the cluster’s centroid Xx was then computed as cosine distance. Our interest in the case study is with regard to finding a cluster that best represented sports betting content. To do so, we first have to identify the sports betting cluster. This as described earlier is done via analogy tests as well as manual inspection of terms in each cluster. Once the terms are identified, a centroid map that contains terms and their respective cluster numbers is computed as in [15]. Thereafter, centroids for each model are Page 2572 generated via the trained models. For example, a FastText model with 100 dimensions and 3 clusters generated a JSON file with 3, 10 × 10 matrices. • User’s DoI in Sports Betting - To understand the computation of user DoIs to the sports betting cluster, similarity between tweets and the cluster centroids had to be derived first. – Similarity to Cluster Centroids Similarity of a tweet to a cluster of interest, involved calculating the semantic distance of the specific tweet tokens to the centroid of the cluster of interest. To represent this process, let Q be the set of vectors for clusters q ∈ Q in the model. Q is significant in getting the distance between the tweet and clusters. In our case study, the interest was in getting the distance between a given test tweet and the sports betting cluster qSB . To represent this similarity computation process for a tweet s, let Ws be the set of word tokens in the tweet. The average of the vectors Ws was the vector space representation of tweet s as illustrated in Equation 2. ′ ws = average(ws ), ∀ws ∈ Ws (2) In our case, we computed the cosine distance by measuring the similarity ′ between the tweet vector ws and cluster centroids like qSB . Cosine distance was the optimal similarity measurement metric between the tweet vectors and cluster centroids. The advantage with cosine similarity measure is that it works well despite the size of the two vectors being measured. The smaller the angle between the two sets, the higher the cosine similarity. Therefore, two objects are presumed very similar if the cosine distance is close to or equal to 1 and dissimilar if close to or equal to 0. similarity between a tweet and the cluster of interest is computed. – Computation of the Degree of Interest in Sports Betting (DoiSB) - Computation of a tweeter u’s DoiSB entailed following the below steps : 1. We first extracted tweets from the user’s timeline T via Twitter’s Search API8 . A maximum of 3200 tweets can be extracted from the timeline. 2. The extracted tweets xu ∈ Tu are then preprocessed and modeled as described in Section 3.1. 3. Similarity of the processed tweets qu to the sports betting cluster qS B is then computed as in Equation 3. The DoiSB computation for user u is illustrated as in Equation 4. DoiSBu = average(sxu qSB ), ∀xu ∈ Tu , qSB ∈ Q (4) Interpretation of DoiSB values followed the same process as the cosine distance. Tweeter’s with DoiSBs close to 1 meant that they disseminated content that was largely related to sports betting. On the contrary, users with DoiSBs close to 0 meant that their disseminated content had very little sports betting related content. • Homophily Social Theory in DoiSBs Homophily is defined as the tendency for people to have positive ties with people who are similar to themselves in friendship networks. In our case, homophily is measured with regard to the extent to which users share interests [16]. In essence, users with high DoiSB values shows that they share interests thus can be recommended to each other as follow-backs. 4. Experimentation In this section, we elicit all the practical steps that were followed in validation of the proposed approach in Section 3. Word2Vec and Glove baselines were also trained for validation purposes. ′ sxs = CosineDistance(wx , s), ∀s ∈ qSB (3) Computation of the cosine distance between ′ the tweet vector wx , s and cluster centroids qSB is shown in Equation 3. This way, the 4.1. Datasets, Settings and Analogy Tests We collected 298835 tweets geolocalized to Kenya for six months starting 1/9/2018 via Twitter’s streaming 8 https://developer.twitter.com/en/docs/ tweets/search/api-reference/get-search-tweets. html Page 2573 API. The collection was made up of tweets as single row records with their associated metadata. These among others included the mentions, hashtags, list of usernames and geo coordinates among others. Most of the tweets in the dataset were written in English with a few in Swahili and the rest using a mixture of the two. The choice of the Kenyan Twitterspace was firstly informed by the authors knowledge of the dynamic nature of Twitter topical content in the country. On the other hand, online sports betting related activities are also on the rise in the country thus necessitating the need to investigate interests of tweeters in this domain. In order to make sure that the sports betting content related cluster existed for evaluation purposes, betting related tweets had to be added to the generic pool of tweets. We collected 50639 sports betting related tweets to add to the dataset. The tweets were collected from timelines of online sports betting companies with presence in Kenya. They included tweets associated with sportpesa9 , betin10 ,eazibet11 , betika12 and betwayke13 Twitter handles. In addition to the generic set, the total number of tweets in the training corpus totalled 349474. Analogy tests validated the generalization and quality of the trained models. Therefore, we conducted several qualitative tests on the model to make sure that the model was relevant to the test scenario. Table 1. Sample analogy in the sports betting domain. Sportpesa is a sports betting company operating in several countries in the world Most similar to ”sportpesa’) Similarity Score tzsportpesa sportpesajp sportpesasa sportpesacup sportpesarewind hullcitysportpesa sportpesahullcity sportpesanews sportpesashield sportpesapic 0.9649 0.9615 0.9536 0.9284 0.8904 0.8891 0.8607 0.8535 0.8456 0.8442 Figure 1. Sample plot showing the semantic relevance of words in the training set. Semantic distance between words is depicted by the closeness of the words Table 1 summarizes one validation example in the context of sports betting. In the given example, sportpesa14 a Kenyan-based betting company, depicts high similarity with words like tzsportpesa, the Tanzanian wing of the same company. We further used the FastText model to plot the semantic distance between words as an analogy test. Figure 1 shows the distance between words in the corpus. Words like ruto, raila, uhuru being grouped close to each other is semantically relevant since they are all politicians in Kenya. On the other hand, words like county,governor,joho, senator, parliament, sonko are all governance related. In fact Sonko and Joho are current governors in Kenya. Sportpesa,betway,betika are sports betting companies in Kenya, thus grouped together in addition to words like odds, bonus,games,sports and teams being semantically close. 9 https://www.sportpesa.org/ 10 https://www.betin.co.ke/ 11 https://www.eazibet.co.ke 12 https://www.betika.com/ 13 https://www.betway.co.ke 14 https://www.sportpesa.org/ 4.2. Samples of Sports Betting Related Tweets The analogy tests in Section 4.1 provided a general semantic view of the dataset. However, before selection of the best performing model to compute the DoiSBs, the model had to be subjected to a known dataset. Two baselines i.e. Word2Vec and Glove models were introduced for validation purposes. All models were trained on the same dataset with the same parameters. We sampled 100 sports betting tweets to test the three model’s accuracy with different parameters as well as get the optimal number of clusters. The hypothesis was that the selected tweets had to be as close as possible to sports betting related content. This way, it was easier to distinguish classification performances across the models. Manual inspection of the test tweets by Page 2574 three human judges indicated that they were all centered around sports betting. This process was significant in identification of optimal model dimension sizes in model training as well as cluster numbers that best represented the corpus. The hypothesis in this step was such that the higher the number of correctly classified tweets in the sports betting cluster, the better the modeling algorithm and related parameters. Therefore, it was a matter of iterative trialing of varied model parameters in picking the best performing model for use in computing follow-back recommendations. The 100 test tweets were subjected to FastText-CBOW, FastText-SkipGram(SG), Word2Vec-CBOW, Word2VecSkipGram(SG) and Glove models trained with 100, 200 and 300 dimensions consistent with [12]. We tested the model dimensions with the number of clusters set to 3, 4, 5 and 6 based on the elbow method for identification of the optimal number of clusters [14]. In computing the classification accuracies, we followed the below processes: 1. A comparative evaluation was performed for each test tweet to cluster labels using each model as identified by K-means++. For example, FastText’s (100 dimensions, 3 clusters):- cluster 0 represented the sports betting domain, cluster 1, Swahili Related Chatter, cluster 2, General/News based on the analogy tests in Section 4.1. Therefore, cluster 0 denoted by 0 in our experiments was the ground truth (true labels). 2. Each tweet vector was computed and the distance to the three clusters derived to generate predicted labels. The Fowlkes-Mallows Index (FMI-Score) was used to derive correlations between the labels. The FMI-Score is interpreted as the geometric mean of pairwise precision and recall between the true and predicted labels. The score just like cosine distance ranges from 0 to 1. A higher value indicates better similarity between two points [17]. We report the values in Figure 3. FastText-SkipGram with 100 dimensions and 3 clusters, reported the highest FMI-Score in relation to the sports betting cluster. From the table, we can infer that models with more than 3 clusters reported lower FMI-Scores. Therefore, we selected FastText-SkipGram with 100 dimensions and 3 clusters to compute DoiSBs further. 4.3. Samples of Tweeters in the Kenyan Twitter-sphere We simulated a real Twitter environment by collecting sample tweets geolocated to Kenya. The Figure 2. Model’s classification scores with respect to model dimensions (100,200,300) consistent to [12] and cluster numbers (3,4,5 and 6). aim of this process was to help us derive tweeters in the Kenyan Twitter space who would fit this study i.e. have interest in online sports betting. Computation of DoiSBs for sample tweeters as in Section 4.1 involved the collection, pre-processing and modeling of tweets disseminated by the tweeters. We collected a maximum of 3200 tweets from 137 users who tweeted from/near Kenya from 1/1/2019 to 1/04/2019, via Twitter’s search API. Our assumption in the tweets collection process was that a three month period was sufficient to collect enough data with diverse topics as most tweets were disseminated as a reaction to certain events within that timeframe. Another assumption was that there was a likelihood for sports related content to be tweeted in addition to other topics within that timeframe. 4.4. Proof of the Homophily Follower-followee relationships define connections in social networks. Therefore, homophily is evident in social networks based on the fact that tweeters tend to follow other users whom they share interests with [18]. In Twitter, friendship connections are in form of mentions, retweets, replies, hashtags etc. Proof of homophily in connections between tweeters and their friendship networks is in the form of shared interests. Therefore, a positive correlation in DoiSB, was proof that the identified interests in tweeters were realistic in addition to the good model performance in the Degree of Interest identification with respect to sports betting. Page 2575 4.5. Parameter Settings and Experiments The selected FastText (100,3) model had the following parameters setup : size = 100, minimum count = 2, learning rate (lr) = 0.1 and iter =30. In depth descriptions of the above parameters are in Section 3. FastText default parameters were assumed in cases where the above parameters were not explicitly defined. The output of our modeling process was a vector representation of 22816 unique words in the training corpus. The number of clusters as well as the initialization mechanism i.e. k-means++ was specified in the clustering process to generate cluster centroid maps. The process of choosing cluster numbers K was as described in Section 4.2. Centroid maps consisted of words in the corpus and their respective cluster assignments. With the centroid maps in place, words in specific clusters are able to be placed as close as possible each each other. In modeling tweeters, the training corpus in Section 3 was used. The optimal number of clusters in our case was 3 where each of the clusters had a unique identifier. The sports betting related one was cluster 0 and consisted of 3123 unique words. Clusters 1 and 2 had 12518 and 7175 unique words respectively. This made it easier to compute cluster centroids. Resultant tweet vectors were then used to compute the tweet clusters similarity. The similarity as pointed out earlier is the distance between the average tweet vector and the cluster centroid of interest. This process is illustrated below : • Original tweet - Away Win 3 Multibet Football Tips Odds Kenya January 11 2019 http://www.zuribet.com/away-win-3-multibetfootball-tips-odds-kenya-january-11-2019/ • Preprocessed Tweet - away multibet football tips odds kenya january • Cluster Similarity values sxy [0.496, 0.196, 0.434] where the value in the array position 0 is the tweet similarity measure to the sports betting related cluster (sxy DoiSB ). The value shows that the tweet is semantically close to the sports betting cluster as compared to other clusters. 5. Results 5.1. DoiSBs for Follow-Back Recommendations Short text microblog users tend to have positive ties as evidenced by follower-followee relationships. Normally, such users tend to have common interests. Group interests based on user DoiSBs were preferred compared to individual analyses as depicted in Figure 4. Figure 3. Overall Distribution of DoiSBs The graph in Figure 3 shows an almost symmetrical distribution necessitating the grouping of DoiSBs in the below groups: - a) users with DoiSB equal to 0 (Group i), b) users with DoiSB greater than 0 but less than or equal to 0.3 (Group ii), c) users with DoiSB greater than 0.3 but less than 0.5 (Group iii), d) users with DoiSB greater than or equal to 0.5 (Group iv). Results in Figure 4 show the correlation distribution between the DoiSBs of tweeters and their friendship network. From the box plot, tweeters with DoiSBs = 0 correlated with friends whose median DoiSB = 0.37. The same can be said of tweeters with 0 < DoiSB <= 0.3 who shared sports betting interests with friends whose median DoiSB = 0.36. The third and fourth groups showed stronger ties between tweeters and their friendship networks. Tweeters with 0.3 < DoiSB < 0.5 correlated with friendship networks whose approximate median DoiSB = 0.47. Tweeters depicting high interest in sports betting coincidentally had friendship connections who showed the same level of interest. This is shown in the Group iv, where tweeters DoiSB >= 0.5. They shared interests with friends having a median DoiSB of 0.62. The output corroborated with the expectations in the homophily social theory where users with shared interests follow are more likely to connect. 5.2. Practical Application Areas Results in this setup and experiments are applicable in several areas related to short text microblogs based recommender systems. Page 2576 Figure 4. Correlation between users’ DoISBs and their friendship network • Follow back recommendations - From the experimental results, users with DoiSBs >= 0.5 can be recommended to other users with DoiSB >= 0.62 or vice-versa. The two sets of users correlated with each other thus, is plausible to have suggestions for similar interests. • Cold-start Scenarios - New users on short text microblogs are always in need of accurate recommendations regarding users to follow, hashtags and even lists. Correlating DoiSBs with other factors such as geolocation is an ideal process in suggesting pages of interest for such users. 5.3. Qualitative Evaluation of Homophily For affirmation of quantitative results in Section 5, we presented a sample list of 40 randomised and anonymised clean tweets to five judges/evaluators with good English command for evaluation. This was based on topics that we felt were representative of the dataset. Out of the 40 tweets, 20 were from tweeters while the remaining 20 were extracted from the specific tweeter’s friendship network. Overall, each evaluator received a unique set of randomised tweets. The intuition behind this process was for presentation of a dataset that mirrored a real twitter stream in terms of content diversity in both tweeters and their friendship network. Upon manual inspection of the tweets, we identified three classes in the tweets. Swahili related, Sports Betting and General News classes. Evaluators were expected to classify the tweets based on the three clusters, where a tweet could only fall in one class for consistency purposes and in line with the hard clustering approach in the model. In Table 2, X1 to X5 represents the evaluators/judges. x1u to x5u represents individual tweeter classifications in the topics of interest as per the evaluators. On the other hand, M1u ..M5u represents respective FastText model classifications for the same tweets subjected to evaluators as described in Section 4.5. For example, according to evaluator X1, tweeter x1u had 4 tweets classified under the sports betting topic. Their friendship network i.e. x1f had two tweets under the same topic. k1u to k5u are the Cohen Kappa scores which in this instance is the inter-topic agreement between judge’s and model’s classifications [19]. This evaluation score was an indicator of the extent to which tweeter’s and their friendship network tweets contextually correlated . k1f ..k5f represents correct topical classifications of the friendship network tweets by the evaluators. M1f ..M5f just like in the tweeter’s instance represents the model classifications for the same tweets in the friendship network that were subjected to evaluators. Kappa score k was derived as follows; k = po − pe /1 − pe where pe was the hypothetical probability of chance agreement. po was the relative observed agreement between tweeters and their friendship network ratings. From the results in Table 2, 56.67 percent of the inter-topic ratings depicted a weak to perfect agreement as per the Kappa statistic scale [20]. This was quite impressive based on the small sample of tweets in both groups. The results corroborate the homophily theory in social networks. A positive correlation to a certain level between the two sets of data is proof that friends share interests thus follow back recommendations can be made among such users. This correlation was also evident in the model in addition to the output from evaluators. 6. Conclusion and Future Work Twitter as a short text micro-blogging platform is instrumental in disseminating event related information or news. Tweeters in essence show preference towards certain topics to a lesser or greater extent based on their level of interest in them. In addition, tweeters with shared interests are deemed to correlate when they follow-back each other. We developed a model framework that can be used in identification of interests that microblog users have based on their disseminated content. A FastText model was deployed to learn tweet semantics as well as compute the level of interest that a tweeter has in sports betting. Experimental results were inline with the homophily social theory whereby users with shared interests also shared connections, a fundamental principle in user recommendations. Page 2577 Topics(Tweeters) Swahili Related Sports Betting General/News Topics (Friendship Network) Swahili Related Sports Betting General/News X1u 5/20 4/20 11/20 X1 M1u 4/20 6/20 10/20 k1u 0.286 0.474 0.3 X2u 6/20 4/20 10/20 X2 M2u 4/20 8/20 8/20 k2u 0.474 0.545 0.6 X3u 7/20 7/20 6/20 X3 M3u 6/20 5/20 9/20 k3u 0.659 0.765 0.479 X4u 2/20 5/20 13/20 X4 M4u 3/20 4/20 13/20 k4u 1.00 0.857 0.468 X5u 9/20 4/20 7/20 X5 M5u 7/20 6/20 7/20 k5u 0.588 0.474 0.341 X1f M1f k1f X2f M2f k2f X3f M3f k3f X4f M4f k4f X5f X5f k5f 4/20 2/20 14/20 4/20 3/20 13/20 0.688 0.773 0.205 8/20 5/20 7/20 10/20 6/20 4/20 0.6 0.625 0.634 4/20 5/20 11/20 3/20 4/20 13/20 0.828 0.571 0.271 5/20 6/20 9/20 3/20 9/20 8/20 0.692 0.479 0.490 2/20 4/20 14/20 4/20 3/20 13/20 0.615 0.828 0.432 Table 2. Shows the correlation between curated topics and their share of sample tweets among users and their friends. As part of our future work, we plan on automatically modeling multi-topic user profiles based on varied interests that short text microblog users may have over time. Twitter specific features such as bi-directional network metadata could be used in this computation process. References [1] K. Chen, T. Chen, G. Zheng, O. Jin, E. Yao, and Y. Yu, “Collaborative personalized tweet recommendation,” in Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, pp. 661–670, ACM, 2012. [2] S. Goel and R. Kumar, “Folksonomy-based user profile enrichment using clustering and community recommended tags in multiple levels,” Neurocomputing, vol. 315, pp. 425–438, 2018. [3] S. R. Sahoo and B. Gupta, “Hybrid approach for detection of malicious profiles in twitter,” Computers & Electrical Engineering, vol. 76, pp. 65–81, 2019. [4] J. Chen, R. Nairn, L. Nelson, M. Bernstein, and E. Chi, “Short and tweet: experiments on recommending content from information streams,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp. 1185–1194, ACM, 2010. [5] Y. Liu, X. Chen, S. Li, and L. Wang, “A user adaptive model for followee recommendation on twitter,” in Natural Language Understanding and Intelligent Applications, pp. 425–436, Springer, 2016. [6] S. Takimura, R. Harakawa, T. Ogawa, and M. Haseyama, “Twitter followee recommendation based on multimodal ffm considering social relations,” in 2018 IEEE 7th Global Conference on Consumer Electronics (GCCE), pp. 204–205, IEEE, 2018. [7] D. P. Karidi, Y. Stavrakas, and Y. Vassiliou, “Tweet and followee personalized recommendations based on knowledge graphs,” Journal of Ambient Intelligence and Humanized Computing, vol. 9, no. 6, pp. 2035–2049, 2018. [8] R. P. Schumaker, C. S. Labedz Jr, A. T. Jarmoszko, and L. L. Brown, “Prediction from regional angst–a study of nfl sentiment in twitter using technical stock market charting,” Decision Support Systems, vol. 98, pp. 80–88, 2017. [9] A. Brown, D. Rambaccussing, J. J. Reade, and G. Rossi, “Forecasting with social media: evidence from tweets on soccer matches,” Economic Inquiry, vol. 56, no. 3, pp. 1748–1763, 2018. [10] L. V. Williams and J. J. Reade, “Prediction markets, social media and information efficiency,” Kyklos, vol. 69, no. 3, pp. 518–556, 2016. [11] P. Bojanowski, E. Grave, A. Joulin, and T. Mikolov, “Enriching word vectors with subword information,” Transactions of the Association for Computational Linguistics, vol. 5, pp. 135–146, 2017. [12] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” arXiv preprint arXiv:1301.3781, 2013. [13] D. Arthur and S. Vassilvitskii, “k-means++: The advantages of careful seeding,” in Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 1027–1035, Society for Industrial and Applied Mathematics, 2007. [14] P. Bholowalia and A. Kumar, “Ebk-means: A clustering technique based on elbow method and k-means in wsn,” International Journal of Computer Applications, vol. 105, no. 9, 2014. [15] L. Recalde and A. Kaskina, “Who is suitable to be followed back when you are a twitter interested in politics?,” in Proceedings of the 18th Annual International Conference on Digital Government Research, pp. 94–99, ACM, 2017. [16] Y. Halberstam and B. Knight, “Homophily, group size, and the diffusion of political information in social networks: Evidence from twitter,” Journal of Public Economics, vol. 143, pp. 73–88, 2016. [17] M. Z. Rodriguez, C. H. Comin, D. Casanova, O. M. Bruno, D. R. Amancio, L. d. F. Costa, and F. A. Rodrigues, “Clustering algorithms: A comparative approach,” PloS one, vol. 14, no. 1, p. e0210236, 2019. [18] M. McPherson, L. Smith-Lovin, and J. M. Cook, “Birds of a feather: Homophily in social networks,” Annual review of sociology, vol. 27, no. 1, pp. 415–444, 2001. [19] M. L. McHugh, “Interrater reliability: the kappa statistic,” Biochemia medica: Biochemia medica, vol. 22, no. 3, pp. 276–282, 2012. [20] A. J. Viera, J. M. Garrett, et al., “Understanding interobserver agreement: the kappa statistic,” Fam med, vol. 37, no. 5, pp. 360–363, 2005. Page 2578