Skip to main content
Extant literature has explored the social integration process of migrants settling in host communities. However, this literature typically takes a migrant-centric view, implicitly putting the burden of a successful integration on the... more
Extant literature has explored the social integration process of migrants settling in host communities. However, this literature typically takes a migrant-centric view, implicitly putting the burden of a successful integration on the migrant, and trying to identify the factors that lead to integration along various dimensions. In this paper, we flip this point of view by studying the attributes of natives that govern their propensity to form social ties with migrants.We do so by using anonymous and aggregate social network data provided by Facebook’s advertising platform. More specifically, we look at factors that influence the propensity for a likely-to-be non-Muslim Facebook user to have at least one social connection to a Facebook user who celebrates Ramadan. Given that, in the European context, following Islam is predominantly tied to a migration background, this gives us a lens into cross-cultural native-migrant connectivity. Our study considers demographic attributes of the ho...
... 3. QUERY EXPANSION VIA PREFIX COMPLETION The key idea is to add the information we have about related terms as artificial words to ... and tested our feature for two collec-tions: the TREC Robust collection (1.5 GB, 556,078... more
... 3. QUERY EXPANSION VIA PREFIX COMPLETION The key idea is to add the information we have about related terms as artificial words to ... and tested our feature for two collec-tions: the TREC Robust collection (1.5 GB, 556,078 doc-uments), and the English Wikipedia (8 GB ...
ABSTRACT Suppose your sole interest in recommending a product to me is to maximize the amount paid to you by the seller for a sequence of recommendations. How should you recommend optimally if I become more inclined to ignore you with... more
ABSTRACT Suppose your sole interest in recommending a product to me is to maximize the amount paid to you by the seller for a sequence of recommendations. How should you recommend optimally if I become more inclined to ignore you with each irrelevant recommendation you make? Finding an answer to this question is a key challenge in all forms of marketing that rely on and explore social ties; ranging from personal recommendations to viral marketing. We prove that even if the recommendee regains her initial trust on each successful recommendation, the expected revenue the recommender can make over an infinite period due to payments by the seller is bounded. This can only be overcome when the recommendee also incrementally regains trust during periods without any recommendation. Here, we see a connection to "banner blindness," suggesting that showing fewer ads can lead to a higher long-term revenue.
ABSTRACT While there has been a substantial amount of research into the editorial and organizational processes within Wikipedia, little is known about how Wikipedia editors (Wikipedians) relate to the online world in general. We attempt... more
ABSTRACT While there has been a substantial amount of research into the editorial and organizational processes within Wikipedia, little is known about how Wikipedia editors (Wikipedians) relate to the online world in general. We attempt to shed light on this issue by using aggregated log data from Yahoo!'s browser toolbar in order to analyze Wikipedians' editing behavior in the context of their online lives beyond Wikipedia. We broadly characterize editors by investigating how their online behavior differs from that of other ...
ABSTRACT The increasing ubiquity of Internet use has opened up new avenues in the study of human mobility. Easily-obtainable geolocation data resulting from repeated logins to the same website offer the possibility of observing long-term... more
ABSTRACT The increasing ubiquity of Internet use has opened up new avenues in the study of human mobility. Easily-obtainable geolocation data resulting from repeated logins to the same website offer the possibility of observing long-term patterns of mobility for a large number of individuals. We use data on the geographic locations from where over 100 million anonymized users log into Yahoo!~services to generate the first global map of short- and medium-term mobility flows. We develop a protocol to identify anonymized users who, over a one-year period, had spent more than 3 months in a different country from their stated country of residence ("migrants"), and users who spent less than a month in another country ("tourists"). We compute aggregate estimates of migration propensities between countries, as inferred from a user's location over the observed period. Geolocation data allow us to characterize also the pendularity of migration flows -- i.e., the extent to which migrants travel back and forth between their countries of origin and destination. We use data regarding visa regimes, colonial ties, geographic location and economic development to predict migration and tourism flows. Our analysis shows the persistence of traditional migration patterns as well as the emergence of new routes. Migrations tend to be more pendular between countries that are close to each other. We observe particularly high levels of pendularity within the European Economic Area, even after we control for distance and visa regimes. The dataset, methodology and results presented have important implications for the travel industry, as well as for several disciplines in social sciences, including geography, demography and the sociology of networks.
ABSTRACT Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is... more
ABSTRACT Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth and time. We built URL-based language classifiers for English, German, French, Spanish, and Italian by applying a variety of algorithms and features. As algorithms we used machine learning algorithms which are widely applied for text classification and state-of-art algorithms for language identification of text. As features we used words, various sized n-grams, and custom-made features (our novel feature set). We compared our approaches with two baseline methods, namely classification by country code top-level domains and classification by IP addresses of the hosting Web servers. We trained and tested our classifiers in a 10-fold cross-validation setup on a dataset obtained from the Open Directory Project and from querying a commercial search engine. We obtained the lowest F1-measure for English (94) and the highest F1-measure for German (98) with the best performing classifiers. We also evaluated the performance of our methods: (i) on a set of Web pages written in Adobe Flash and (ii) as part of a language-focused crawler. In the first case, the content of the Web page is hard to extract and in the second page downloading pages of the “wrong” language constitutes a waste of bandwidth. In both settings the best classifiers have a high accuracy with an F1-measure between 95 (for English) and 98 (for Italian) for the Adobe Flash pages and a precision between 90 (for Italian) and 97 (for French) for the language-focused crawler.
The internet is often thought of as a democratizer, enabling equality in aspects such as pay, as well as a tool introducing novel communication and monetization opportunities. In this study we examine athletes on Cameo, a website that... more
The internet is often thought of as a democratizer, enabling equality in aspects such as pay, as well as a tool introducing novel communication and monetization opportunities. In this study we examine athletes on Cameo, a website that enables bi-directional fancelebrity interactions, questioning whether the well-documented gender pay gaps in sports persist in this digital setting. Traditional studies into gender pay gaps in sports are mostly in a centralized setting where an organization decides the pay for the players, while Cameo facilitates grassroots fan engagement where fans pay for video messages from their preferred athletes. The results showed that even on such a platform gender pay gaps persist, both in terms of cost-per-message, and in the number of requests, proxied by number of ratings. For instance, we find that female athletes have a median pay of 30$ per-video, while the same statistic is 40$ for men. The results also contribute to the study of parasocial relationships and personalized fan engagements over a distance. Something that has become more relevant during the ongoing COVID-19 pandemic, where in-person fan engagement has often been limited.
BackgroundBrief intervention is a critical method for identifying patients with problematic substance use in primary care settings and for motivating them to consider treatment options. However, despite considerable evidence of delay... more
BackgroundBrief intervention is a critical method for identifying patients with problematic substance use in primary care settings and for motivating them to consider treatment options. However, despite considerable evidence of delay discounting in patients with substance use disorders, most brief advice by physicians focuses on the long-term negative medical consequences, which may not be the best way to motivate patients to seek treatment information.ObjectiveIdentification of the specific symptoms that most motivate individuals to seek treatment information may offer insights for further improving brief interventions. To this end, we used anonymized internet search engine data to investigate which medical conditions and symptoms preceded searches for 12-step meeting locators and general 12-step information.MethodsWe extracted all queries made by people in the United States on the Bing search engine from November 2016 to July 2017. These queries were filtered for those who mention...
Facebook, the most popular social network with over one billion daily users, provides rich opportunities for its use in the health domain. Though much of Facebook's data are not available to outsiders, the company provides a tool for... more
Facebook, the most popular social network with over one billion daily users, provides rich opportunities for its use in the health domain. Though much of Facebook's data are not available to outsiders, the company provides a tool for estimating the audience of Facebook advertisements, which includes aggregated information on the demographics and interests, such as weight loss or dieting, of Facebook users. This paper explores the potential uses of Facebook ad audience estimates for eHealth by studying the following: (1) for what type of health conditions prevalence estimates can be obtained via social media and (2) what type of marker interests are useful in obtaining such estimates, which can then be used for recruitment within online health interventions. The objective of this study was to understand the limitations and capabilities of using Facebook ad audience estimates for public health monitoring and as a recruitment tool for eHealth interventions. We use the Facebook Mark...
Social media platforms provide several social interactional features. Due to the large scale reach of social media, these interactional features help enable various types of political discourse. Constructive and diversified discourse is... more
Social media platforms provide several social interactional features. Due to the large scale reach of social media, these interactional features help enable various types of political discourse. Constructive and diversified discourse is important for sustaining healthy communities and reducing the impact of echo chambers. In this paper, we empirically examine the role of a newly introduced Twitter feature, 'quote retweets' (or 'quote RTs') in political discourse, specifically whether it has led to improved, civil, and balanced exchange. Quote RTs allow users to quote the tweet they retweet, while adding a short comment. Our analysis using content, network and crowd labeled data indicates that the feature has increased political discourse and its diffusion, compared to existing features. We discuss the implications of our findings in understanding and reducing online polarization.
In recent years, the Middle East’s information and communication landscape has changed dramatically. Increasingly, states, businesses, and citizens are capitalising on the opportunities offered by new technologies, the fast pace of... more
In recent years, the Middle East’s information and communication landscape has changed dramatically. Increasingly, states, businesses, and citizens are capitalising on the opportunities offered by new technologies, the fast pace of digitisation, and enhanced connectivity. These changes are far from turning Middle Eastern nations into network societies, but their impact is significant. The growing adoption of a wide variety of technologies in everyday life has given rise to complex dynamics that beg for a better understanding. Digital Middle East sheds a critical light on the continuing changes closely intertwined with the adoption of information and communication technologies in the region. Drawing on case studies from throughout the Middle East, the contributors explore how these digital transformations are playing out in the social, cultural, political, and economic spheres, exposing the various disjunctions and discordances that have marked the advent of the digital Middle East.
On social media platforms, like Twitter, users are often interested in gaining more influence and popularity by growing their set of followers, aka their audience. Several studies have described the properties of users on Twitter based on... more
On social media platforms, like Twitter, users are often interested in gaining more influence and popularity by growing their set of followers, aka their audience. Several studies have described the properties of users on Twitter based on static snapshots of their follower network. Other studies have analyzed the general process of link formation. Here, rather than investigating the dynamics of this process itself, we study how the characteristics of the audience and follower links change as the audience of a user grows in size on the road to user's popularity. To begin with, we find that the early followers tend to be more elite users than the late followers, i.e., they are more likely to have verified and expert accounts. Moreover, the early followers are significantly more similar to the person that they follow than the late followers. Namely, they are more likely to share time zone, language, and topics of interests with the followed user. To some extent, these phenomena are...
While the coronavirus disease 2019 (COVID-19) pandemic wreaked havoc across the globe, we have witnessed substantial mis- and disinformation regarding various aspects of the disease. We conducted a cross-sectional study using a... more
While the coronavirus disease 2019 (COVID-19) pandemic wreaked havoc across the globe, we have witnessed substantial mis- and disinformation regarding various aspects of the disease. We conducted a cross-sectional study using a self-administered questionnaire for the general public (recruited via social media) and healthcare workers (recruited via email) from the State of Qatar, and the Middle East and North Africa region to understand the knowledge of and anxiety levels around COVID-19 (April–June 2020) during the early stage of the pandemic. The final dataset used for the analysis comprised of 1658 questionnaires (53.0% of 3129 received questionnaires; 1337 [80.6%] from the general public survey and 321 [19.4%] from the healthcare survey). Knowledge about COVID-19 was significantly different across the two survey populations, with a much higher proportion of healthcare workers possessing better COVID-19 knowledge than the general public (62.9% vs. 30.0%, p < 0.0001). A reverse ...
We developed Political Insights, an online searchable database of politically charged queries, which allows you to obtain topical insights into partisan concern. In this paper we demonstrate how you can discover such political queries and... more
We developed Political Insights, an online searchable database of politically charged queries, which allows you to obtain topical insights into partisan concern. In this paper we demonstrate how you can discover such political queries and how to lay bare which issues are most salient to political audiences. We employ anonymized search engine queries resulting in a click on U.S. political blogs to calculate the probability that a query will land on blogs of a particular leaning. We are thus able to ‘charge’ queries politically and to group them along opposing partisan lines. Finally, by comparing the zip codes of users submitting these queries with election results, we find that the leaning of blogs people read correlates well with their likely voting behavior.
Abstract. Multimedia annotation is central to its organization and re-trieval – a task which tag recommendation systems attempt to simplify. We propose a photo tag recommendation system which automatically extracts semantics from visual... more
Abstract. Multimedia annotation is central to its organization and re-trieval – a task which tag recommendation systems attempt to simplify. We propose a photo tag recommendation system which automatically extracts semantics from visual and meta-data features to complement existing tags. Compared to standard content/tag-based models, these automatic tags provide a richer description of the image and especially improve performance in the case of the “cold start problem”. 1
We present a system for personalized tag suggestion for Flickr: While the user is entering/selecting new tags for a particular picture, the system is suggesting related tags to her, based on the tags that she or other people have used in... more
We present a system for personalized tag suggestion for Flickr: While the user is entering/selecting new tags for a particular picture, the system is suggesting related tags to her, based on the tags that she or other people have used in the past along with (some of) the tags already entered. The suggested tags are dynamically updated with every additional tag entered/selected. We describe three algorithms which can be applied to this problem. In experiments, our best-performing method yields an improvement in precision of 10-15 % over a baseline method very similar to the system currently used by Flickr. Our system is accessible at
We present a system for personalized tag suggestion for Flickr: While the user is entering/selecting new tags for a particular picture, the system is suggesting related tags to her, based on the tags that she or other people have used in... more
We present a system for personalized tag suggestion for Flickr: While the user is entering/selecting new tags for a particular picture, the system is suggesting related tags to her, based on the tags that she or other people have used in the past along with (some of) the tags already entered. The suggested tags are dynamically updated with every additional tag entered/selected. We describe three algorithms which can be applied to this problem. In experiments, our best-performing method yields an improvement in precision of 10-15% over a baseline method very similar to the system currently used by Flickr. Our system is accessible at http://ltaa5.epfl.ch/flickr-tags/. To the best of our knowledge, this is the first study on tag suggestion in a setting where (i) no full text information is available, such as for blogs, (ii) no item has been tagged by more than one person, such as for social bookmarking sites, and (iii) suggestions are dynamically updated, requiring efficient yet effect...
In The Clash of Civilizations, Samuel Huntington argued that the primary axis of global conflict was no longer ideological or eco-nomic but cultural and religious, and that this division would char-acterize the “battle lines of the... more
In The Clash of Civilizations, Samuel Huntington argued that the primary axis of global conflict was no longer ideological or eco-nomic but cultural and religious, and that this division would char-acterize the “battle lines of the future. ” In contrast to the "top down" approach in previous research focused on the relations among na-tion states, we focused on the flows of interpersonal communica-tion as a bottom-up view of international alignments. To that end, we mapped the locations of the world’s countries in global email networks to see if we could detect cultural fault lines. Using IP-geolocation on a worldwide anonymized dataset obtained from a large Internet company, we constructed a global email network. In computing email flows we employ a novel rescaling procedure to account for differences due to uneven adoption of a particular Inter-net service across the world. Our analysis shows that email flows are consistent with Huntington’s thesis. In addition to locatio...
Is it possible to "hack" an image of an international entity by driving international and domestic media? Here, we present an image/brand monitoring tool for a country, Qatar, which presents an overview of the contexts and... more
Is it possible to "hack" an image of an international entity by driving international and domestic media? Here, we present an image/brand monitoring tool for a country, Qatar, which presents an overview of the contexts and references to media in which it is mentioned on social media. Tracking dozens of languages, this tool allows a global understanding of the perceptions and concerns Twitter users associate with Qatar, and which mainstream media may be driving these sentiments.
We present a system that visualizes geo-temporal Twitter activity. The distinguishing features our system offers include, (i) a large degree of user freedom in specifying the subset of data to visualize and (ii) a focus on... more
We present a system that visualizes geo-temporal Twitter activity. The distinguishing features our system offers include, (i) a large degree of user freedom in specifying the subset of data to visualize and (ii) a focus on *discriminative* patterns rather than high volume patterns. Tweets with precise GPS co-ordinates are assigned to geographical cells and grouped by (i) tweet language, (ii) tweet topic, (iii) day of week, and (iv) time of day. The spatial resolutions of the cells is determined in a data-driven manner using quad-trees and recursive splitting. The user can then choose to see data for, say, English tweets on weekend evenings for the topic "party". This system has been implemented for 1.8 million geo-tagged tweets from Qatar (http://qtr.qcri.org/) and for 4.8 million geo-tagged tweets from New York City (http://nyc.qcri.org/) and can be easily extended to other cities/countries.

And 204 more

Celebrity and fandom have been studied extensively in real life. Perceived virtual relationships, commonly known as para-social relationships (PSR) have been shown to exist between celebrities and fans [3]. The end of such relationships,... more
Celebrity and fandom have been studied extensively in real life. Perceived virtual relationships, commonly known as para-social relationships (PSR) have been shown to exist between celebrities and fans [3]. The end of such relationships, para-social breakups (PSB), have also been studied [1]. However, with more and more celebrities using social media, the dynamics of PSR and PSB have changed. Using data from 57,000 fans for the top followed celebrities on Twitter, we try to understand how para-social breakups manifest on Twitter. We hypothesize that a PSB on Twitter happens as an act of unfollowing the celebrity and study the differences in engaging in a PSB between various types of fans. We find that, surprisingly, the most devoted fans are more likely to be involved in a para-social breakup. Given our scale and dependence on non-reactive data, our paper opens new avenues for research in para-social interactions.
Research Interests:
In recent years, the Middle East’s information and communication landscape has changed dramatically. Increasingly, states, businesses, and citizens are capitalising on the opportunities offered by new technologies, the fast pace of... more
In recent years, the Middle East’s information and communication landscape has changed dramatically. Increasingly, states, businesses, and citizens are capitalising on the opportunities offered by new technologies, the fast pace of digitisation, and enhanced connectivity. These changes are far from turning Middle Eastern nations into network societies, but their impact is significant. The growing adoption of a wide variety of technologies in everyday life has given rise to complex dynamics that beg for a better understanding. Digital Middle East sheds a critical light on the continuing changes closely intertwined with the adoption of information and communication technologies in the region. Drawing on case studies from throughout the Middle East, the contributors explore how these digital transformations are playing out in the social, cultural, political, and economic spheres, exposing the various disjunctions and discordances that have marked the advent of the digital Middle East.