Skip to main content
Maria Liakata

    Maria Liakata

    With a drastic shift towards digital communication, individuals and organizations are able to almost instantly disseminate information to a large audience with little-to-no regulation, creating both new challenges and opportunities. This... more
    With a drastic shift towards digital communication, individuals and organizations are able to almost instantly disseminate information to a large audience with little-to-no regulation, creating both new challenges and opportunities. This digital shift in our mediasphere has caused a profound change in the production and consumption of information, which in turn has strong implications on the social and political landscape. The challenges that result from mass information diffusion have become more visible to the general public in light of recent events such as the COVID-19 infodemic and the US election. In this second rendition of MEDIATE, we continued our focus on the topic of misinformation and examined it via three interrelated lenses: (1) automated methods tackling misinformation; (2) uptake of automated information verification; and (3) ethics, regulation, and governance. In line with the spirit of this workshop series, we brought together media practitioners and academics to d...
    Rumour stance classification, the task that determines if each tweet in a collection discussing a rumour is supporting, denying, questioning or simply commenting on the rumour, has been attracting substantial interest. Here we introduce a... more
    Rumour stance classification, the task that determines if each tweet in a collection discussing a rumour is supporting, denying, questioning or simply commenting on the rumour, has been attracting substantial interest. Here we introduce a novel approach that makes use of the sequence of transitions observed in tree-structured conversation threads in Twitter. The conversation threads are formed by harvesting users’ replies to one another, which results in a nested tree-like structure. Previous work addressing the stance classification task has treated each tweet as a separate unit. Here we analyse tweets by virtue of their position in a sequence and test two sequential classifiers, Linear-Chain CRF and Tree CRF, each of which makes different assumptions about the conversational structure. We experiment with eight Twitter datasets, collected during breaking news, and show that exploiting the sequential structure of Twitter conversations achieves significant improvements over the non-s...
    System event logs contain information that capture the sequence of events occurring in the system. They are often the primary source of information from large-scale distributed systems, such as cluster systems, which enable system... more
    System event logs contain information that capture the sequence of events occurring in the system. They are often the primary source of information from large-scale distributed systems, such as cluster systems, which enable system administrators to determine the causes and detect system failures. Due to the complex interactions between the system hardware and software components, the system event logs are typically huge in size, comprising streams of interleaved log messages. However, only a small fraction of those log messages are relevant for analysis. We thus develop a novel, generic log compression or filtering (i.e., redundancy removal) technique to address this problem. We apply the technique over three different log files obtained from two different production systems and validate the technique through the application of an unsupervised failure detection approach. Our results are positive: (i) our technique achieves good compression, (ii) log analysis yields better results fo...
    In this paper we address a new problem of predicting affect and well-being scales in a real-world setting of heterogeneous, longitudinal and non-synchronous textual as well as non-linguistic data that can be harvested from on-line media... more
    In this paper we address a new problem of predicting affect and well-being scales in a real-world setting of heterogeneous, longitudinal and non-synchronous textual as well as non-linguistic data that can be harvested from on-line media and mobile phones. We describe the method for collecting the heterogeneous longitudinal data, how features are extracted to address missing information and differences in temporal alignment, and how the latter are combined to yield promising predictions of affect and well-being on the basis of widely used psychological scales. We achieve a coefficient of determination (R^2) of 0.71-0.76 and a correlation coefficient of 0.68-0.87 which is higher than the state-of-the art in equivalent multi-modal tasks for affect.
    Dementia is a family of neurogenerative conditions affecting memory and cognition in an increasing number of individuals in our globally aging population. Automated analysis of language, speech and paralinguistic indicators have been... more
    Dementia is a family of neurogenerative conditions affecting memory and cognition in an increasing number of individuals in our globally aging population. Automated analysis of language, speech and paralinguistic indicators have been gaining popularity as potential indicators of cognitive decline. Here we propose a novel longitudinal multi-modal dataset collected from people with mild dementia and age matched controls over a period of several months in a natural setting. The multi-modal data consists of spoken conversations, a subset of which are transcribed, as well as typed and written thoughts and associated extra-linguistic information such as pen strokes and keystrokes. We describe the dataset in detail and proceed to focus on a task using the speech modality. The latter involves distinguishing controls from people with dementia by exploiting the longitudinal nature of the data. Our experiments showed significant differences in how the speech varied from session to session in t...
    The authors present a tokenizer and finite-state morphological analyzer (Beesley and Karttunen 2003) for Malagasy, based primarily on the discussion of Malagasy morphology in Keenan and Polinsky (1998) and Randriamasimanana (1986). Words... more
    The authors present a tokenizer and finite-state morphological analyzer (Beesley and Karttunen 2003) for Malagasy, based primarily on the discussion of Malagasy morphology in Keenan and Polinsky (1998) and Randriamasimanana (1986). Words in Malagasy are built from roots by means of a variety of morphological operations such as compounding, affixation and reduplication. The authors analyze productive patterns of nominal and verbal morphology, and describe genitive compounding and suffixation for nouns and various derivational processes involving compounding and affixation for verbs. This work offers a computational analysis of Malagasy morphology, and forms the basis of a computational grammar and lexicon of Malagasy within the framework of the PARGRAM project.
    The development of democratic systems is a crucial task as confirmed by its selection as one of the Millennium Sustainable Development Goals by the United Nations. In this article, we report on the progress of a project that aims to... more
    The development of democratic systems is a crucial task as confirmed by its selection as one of the Millennium Sustainable Development Goals by the United Nations. In this article, we report on the progress of a project that aims to address barriers, one of which is information overload, to achieving effective direct citizen participation in democratic decision-making processes. The main objectives are to explore if the application of Natural Language Processing ( NLP ) and machine learning can improve citizens’ experience of digital citizen participation platforms. Taking as a case study the “Decide Madrid” Consul platform, which enables citizens to post proposals for policies they would like to see adopted by the city council, we used NLP and machine learning to provide new ways to (a) suggest to citizens proposals they might wish to support; (b) group citizens by interests so that they can more easily interact with each other; (c) summarise comments posted in response to proposal...
    Scientific Reports 7: Article number: 45141; published online: 22 March 2017; updated: 16 May 2017 One of the Supplementary Information files that accompany this study was inadvertently omitted in the original version of this Article. The... more
    Scientific Reports 7: Article number: 45141; published online: 22 March 2017; updated: 16 May 2017 One of the Supplementary Information files that accompany this study was inadvertently omitted in the original version of this Article. The link to this Supplementary Information file has now been added to the HTML version of the Article.
    The number of people affected by mental illness is on the increase and with it the burden on health and social care use, as well as the loss of both productivity and quality-adjusted life-years. Natural language processing of electronic... more
    The number of people affected by mental illness is on the increase and with it the burden on health and social care use, as well as the loss of both productivity and quality-adjusted life-years. Natural language processing of electronic health records is increasingly used to study mental health conditions and risk behaviours on a large scale. However, narrative notes written by clinicians do not capture first-hand the patients’ own experiences, and only record cross-sectional, professional impressions at the point of care. Social media platforms have become a source of ‘in the moment’ daily exchange, with topics including well-being and mental health. In this study, we analysed posts from the social media platform Reddit and developed classifiers to recognise and classify posts related to mental illness according to 11 disorder themes. Using a neural network and deep learning approach, we could automatically recognise mental illness-related posts in our balenced dataset with an accu...
    We present a two-level model of Malagasy nominal and verbal morphology (Beesley and Karttunen, 2003), based primarily on the discussion of Malagasy morphology in Keenan and Polinsky (1998) and Randriamasimanana (1986). Words in Malagasy... more
    We present a two-level model of Malagasy nominal and verbal morphology (Beesley and Karttunen, 2003), based primarily on the discussion of Malagasy morphology in Keenan and Polinsky (1998) and Randriamasimanana (1986). Words in Malagasy are built from roots by means of a variety of morphological operations such as affixation and reduplication. The present paper analyzes productive patterns of nominal and verbal morphology, describing genitive compounding and suffixation for nouns, and various ...
    In this paper, we introduce the task of targeted aspect-based sentiment analysis. The goal is to extract fine-grained information with respect to entities mentioned in user comments. This work extends both aspect-based sentiment analysis... more
    In this paper, we introduce the task of targeted aspect-based sentiment analysis. The goal is to extract fine-grained information with respect to entities mentioned in user comments. This work extends both aspect-based sentiment analysis – that assumes a single entity per document — and targeted sentiment analysis — that assumes a single sentiment towards a target entity. In particular, we identify the sentiment towards each aspect of one or more entities. As a testbed for this task, we introduce the SentiHood dataset, extracted from a question answering (QA) platform where urban neighbourhoods are discussed by users. In this context units of text often mention several aspects of one or more neighbourhoods. This is the first time that a generic social media platform,i.e. QA, is used for fine-grained opinion mining. Text coming from QA platforms are far less constrained compared to text from review specific platforms which current datasets are based on. We develop several strong base...
    With the constant growth of the scientific literature, automated processes to enable access to its contents are increasingly in demand. Several functional discourse annotation schemes have been proposed to facilitate information... more
    With the constant growth of the scientific literature, automated processes to enable access to its contents are increasingly in demand. Several functional discourse annotation schemes have been proposed to facilitate information extraction and summarisation from scientific articles, the most well known being argumentative zoning. Core Scientific concepts (CoreSC) is a three layered fine-grained annotation scheme providing content-based annotations at the sentence level and has been used to index, extract and summarise scientific publications in the biomedical literature. A previously developed CoreSC corpus on which existing automated tools have been trained contains a single annotation for each sentence. However, it is the case that more than one CoreSC concept can appear in the same sentence. Here, we present the Multi-CoreSC CRA corpus, a text corpus specific to the domain of cancer risk assessment (CRA), consisting of 50 full text papers, each of which contains sentences annotat...
    Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters.... more
    Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters. Here we create a corpus of microblog clusters from three different domains and time windows and define the task of evaluating thematic coherence. We provide annotation guidelines and human annotations of thematic coherence by journalist experts. We subsequently investigate the efficacy of different automated evaluation metrics for the task. We consider a range of metrics including surface level metrics, ones for topic model coherence and text generation metrics (TGMs). While surface level metrics perform well, outperforming topic coherence metrics, they are not as consistent as TGMs. TGMs are more reliable than all other metrics considered for capturing thematic coherence in microblog clusters due to being less sensitive to the effect of time windows.
    Despite the widely spread research interest in social media sentiment analysis, sentiment and emotion classification across different domains and on Twitter data remains a challenging task. Here we set out to find an effective approach... more
    Despite the widely spread research interest in social media sentiment analysis, sentiment and emotion classification across different domains and on Twitter data remains a challenging task. Here we set out to find an effective approach for tackling a cross-domain emotion classification task on a set of Twitter data involving social media discourse around arts and cultural experiences, in the context of museums. While most existing work in domain adaptation has focused on feature-based or/and instance-based adaptation methods, in this work we study a model-based adaptive SVM approach as we believe its flexibility and efficiency is more suitable for the task at hand. We conduct a series of experiments and compare our system with a set of baseline methods. Our results not only show a superior performance in terms of accuracy and computational efficiency compared to the baselines, but also shed light on how different ratios of labelled target-domain data used for adaptation can affect c...
    Tweet-geolocation-5m is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap... more
    Tweet-geolocation-5m is a dataset with more than 5 million geolocated tweets with detailed geolocation information associated. Each geolocated tweet is associated with its fine-grained location information, collected from OpenStreetMap [1] using the reverse geocoding feature in Nominatim [2]. It was originally created for country-level classification of tweets, but finer-grained classification is also provided with the dataset. The country codes are provided using the ISO 3166-1 alpha-2 standard [3].<br>The dataset was collected in two different week long periods: TC2014, collected in October 2014, and TC2015, collected in October 2015.<br>Two files are provided here:* tweet-geolocation-5m.tar.bz2, which is the actual datasets, providing the tweet IDs and ground truth country IDs that enable conducting further experiments.* vectors-and-folds.tar.bz2, which is provided for the purposes of reproducibility. With the information provided in this file, you should be able to reproduce the results we presented in the paper.
    HMMER can also be used to discover families. It is capable of sensitive sequence similarity search, and can find homologous proteins which are distantly related to the query sequence. When performing searches, HMMER constructs a HMM from... more
    HMMER can also be used to discover families. It is capable of sensitive sequence similarity search, and can find homologous proteins which are distantly related to the query sequence. When performing searches, HMMER constructs a HMM from the query sequence, and uses this to query the target sequence database, similar to the method used to identify members of Pfam families. This makes HMMER a useful tool for identifying new families.
    The task of recommending relevant scientific literature for a draft academic paper has recently received significant interest. In our effort to ease the discovery of scientific literature and augment scientific writing, we aim to improve... more
    The task of recommending relevant scientific literature for a draft academic paper has recently received significant interest. In our effort to ease the discovery of scientific literature and augment scientific writing, we aim to improve the relevance of results based on a shallow semantic analysis of the source document and the potential documents to recommend. We investigate the utility of automatic argumentative and rhetorical annotation of documents for this purpose. Specifically, we integrate automatic Core Scientific Concepts (CoreSC) classification into a prototype context-based citation recommendation system and investigate its usefulness to the task. We frame citation recommendation as an information retrieval task and we use the categories of the annotation schemes to apply different weights to the similarity formula. Our results show interesting and consistent correlations between the type of citation and the type of sentence containing the relevant information.
    Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly, one can mine multiple reactions expressed by social media users in those situations, exploring their stance towards... more
    Social media tend to be rife with rumours while new reports are released piecemeal during breaking news. Interestingly, one can mine multiple reactions expressed by social media users in those situations, exploring their stance towards rumours, ultimately enabling the flagging of highly disputed rumours as being potentially false. In this work, we set out to develop an automated, supervised classifier that uses multi-task learning to classify the stance expressed in each individual tweet in a conversation around a rumour as either supporting, denying or questioning the rumour. Using a Gaussian Process classifier, and exploring its effectiveness on two datasets with very different characteristics and varying distributions of stances, we show that our approach consistently outperforms competitive baseline classifiers. Our classifier is especially effective in estimating the distribution of different types of stance associated with a given rumour, which we set forth as a desired charac...
    As breaking news unfolds people increasingly rely on social media to stay abreast of the latest updates. The use of social media in such situations comes with the caveat that new information being released piecemeal may encourage rumours,... more
    As breaking news unfolds people increasingly rely on social media to stay abreast of the latest updates. The use of social media in such situations comes with the caveat that new information being released piecemeal may encourage rumours, many of which remain unverified long after their point of release. Little is known, however, about the dynamics of the life cycle of a social media rumour. In this paper we present a methodology that has enabled us to collect, identify and annotate a dataset of 330 rumour threads (4,842 tweets) associated with 9 newsworthy events. We analyse this dataset to understand how users spread, support, or deny rumours that are later proven true or false, by distinguishing two levels of status in a rumour life cycle i.e., before and after its veracity status is resolved. The identification of rumours associated with each event, as well as the tweet that resolved each rumour as true or false, was performed by journalist members of the research team who track...
    Phrase-level word embeddings for sentiment analysis.

    And 63 more