Skip to main content
Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and... more
Over the past decade, the paradigm of the World Wide Web has shifted from static web pages towards participatory and collaborative content production. The main properties of this user generated content are a low publication threshold and little or no editorial control. While this has improved the variety and timeliness of the available information, it causes an even higher variance in quality than the already heterogeneous quality of traditional web content. Wikipedia is the prime example for a successful, large-scale, collaboratively created resource that reflects the spirit of the open collaborative content creation paradigm.
Even though recent studies have confirmed that the overall quality of Wikipedia is high, there is still a wide gap that must be bridged before Wikipedia reaches the state of a reliable, citable source.

A key prerequisite to reaching this goal is a quality management strategy that can cope both with the massive scale of Wikipedia and its open and almost anarchic nature. This includes an efficient communication platform for work coordination among the collaborators as well as techniques for monitoring quality problems across the encyclopedia. This dissertation
shows how natural language processing approaches can be used to assist information quality management on a massive scale.

In the first part of this thesis, we establish the theoretical foundations for our work. We first introduce the relatively new concept of open online collaboration with a particular focus on collaborative writing and proceed with a detailed discussion of Wikipedia and its role as an encyclopedia, a community, an online collaboration platform, and a knowledge resource for language technology applications. We then proceed with the three main contributions of this thesis.

Even though there have been previous attempts to adapt existing information quality frameworks to Wikipedia, no quality model has yet incorporated writing quality as a central
factor. Since Wikipedia is not only a repository of mere facts but rather consists of full text articles, the writing quality of these articles has to be taken into consideration when judging article quality. As the first main contribution of this thesis, we therefore define a comprehensive article quality model that aims to consolidate both the quality of writing and the quality criteria defined in multiple Wikipedia guidelines and policies into a single model. The model comprises 23 dimensions segmented into the four layers of intrinsic quality, contextual quality, writing quality and organizational quality.

As a second main contribution, we present an approach for automatically identifying quality flaws in Wikipedia articles. Even though the general idea of quality detection has been introduced in previous work, we dissect the approach to find that the task is inherently prone to a topic bias which results in unrealistically high cross-validated evaluation results that do not reflect the classifier’s real performance on real world data.

We solve this problem with a novel data sampling approach based on the full article revision history that is able to avoid this bias. It furthermore allows us not only to identify flawed articles but also to find reliable counterexamples that do not exhibit the respective quality flaws. For automatically detecting quality flaws in unseen articles, we present FlawFinder, a modular system for supervised text classification. We evaluate the system on a novel corpus of Wikipedia articles with neutrality and style flaws. The results confirm the initial hypothesis that the reliable classifiers tend to exhibit a lower cross-validated performance than the biased ones but the scores more closely resemble their actual performance
in the wild.

As a third main contribution, we present an approach for automatically segmenting and tagging the user contributions on article Talk pages to improve the work coordination among Wikipedians. These unstructured discussion pages are not easy to navigate and information is likely to get lost over time in the discussion archives. By automatically identifying the quality problems that have been discussed in the past and the solutions that have been proposed, we can help users to make informed decisions in the future.

Our contribution in this area is threefold: (i) We describe a novel algorithm for segmenting the unstructured dialog on Wikipedia Talk pages using their revision history. In contrast to related work, which mainly relies on the rudimentary markup, this new algorithm can reliably extract meta data, such as the identity of a user, and is moreover able to handle discontinuous turns. (ii) We introduce a novel scheme for annotating the turns in article discussions with dialog act labels for capturing the coordination efforts of article improvement. The labels reflect the types of criticism discussed in a turn, for example missing information or inappropriate language, as well as any actions proposed for solving the quality problems. (iii) Based on this scheme, we created two automatically segmented and manually annotated discussion corpora extracted from the Simple English Wikipedia (SEWD) and the English Wikipedia (EWD). We evaluate how well text classification approaches can learn to assign the dialog act labels from our scheme to unseen discussion pages and achieve a cross-validated performance of F1 = 0.82 on the SEWD corpus while we obtain an average performance of F1 = 0.78 on the larger and more complex EWD corpus.
Research Interests:
With more than 22 million articles, the largest collaborative knowl- edge resource never sleeps, experiencing several article edits every second. Over one fifth of these articles describes individual peo- ple, the majority of which are... more
With more than 22 million articles, the largest collaborative knowl-
edge resource never sleeps, experiencing several article edits every
second. Over one fifth of these articles describes individual peo-
ple, the majority of which are still alive. Such articles are, by their
nature, prone to corruption and vandalism. Manual quality assur-
ance by experts can barely cope with this massive amount of data.
Can it be effectively replaced by feedback from the crowd? Can we
provide meaningful support for quality assurance with automated
text processing techniques? Which properties of the articles should
then play a key role in the machine learning algorithms and why?
In this paper, we study the user-perceived quality of Wikipedia
articles based on a novel Wikipedia user feedback dataset. In con-
trast to previous work on quality assessment which mostly relied
on judgements of active Wikipedia authors, we analyze ratings of
ordinary Wikipedia users along four quality dimensions (complete,
well written, trustworthy and objective). We first present an empi-
rical analysis of the novel dataset with over 36 million Wikipedia
article ratings. We then select a subset of biographical articles and
perform classification experiments to predict their quality ratings
along each of the dimensions, exploring multiple linguistic, surface
and network properties of the rated articles. Additionally, we study
the classification performance and differences for the biographies
of living and dead people as well as those for men and women. We
demonstrate the effectiveness of our approach by the F1 scores of
0:94; 0:89; 0:73, and 0:73 for the dimensions complete, well written,
trustworthy, and objective. Based on the results, we believe that the
quality assessment of big textual data can be effectively supported
by current text classification and language processing tools.
Research Interests:
We present a sentiment classification system that participated in the SemEval 2014 shared task on sentiment analysis in Twitter. Our system expands tokens in a tweet with semantically similar expressions using a large novel... more
We present a sentiment classification system
that participated in the SemEval 2014
shared task on sentiment analysis in Twitter.
Our system expands tokens in a tweet
with semantically similar expressions using
a large novel distributional thesaurus
and calculates the semantic relatedness of
the expanded tweets to word lists representing
positive and negative sentiment.
This approach helps to assess the polarity
of tweets that do not directly contain polarity
cues. Moreover, we incorporate syntactic,
lexical and surface sentiment features.
On the message level, our system
achieved the 8th place in terms of macroaveraged
F-score among 50 systems, with
particularly good performance on the Life-
Journal corpus (F1=71.92) and the Twitter
sarcasm (F1=54.59) dataset. On the expression
level, our system ranked 14 out
of 27 systems, based on macro-averaged
F-score.
Research Interests:
We present DKPro TC, a framework for supervised learning experiments on textual data. The main goal of DKPro TC is to enable researchers to focus on the actual research task behind the learning problem and let the framework handle... more
We present DKPro TC, a framework for
supervised learning experiments on textual
data. The main goal of DKPro TC is
to enable researchers to focus on the actual
research task behind the learning problem
and let the framework handle the rest. It
enables rapid prototyping of experiments
by relying on an easy-to-use workflow engine
and standardized document preprocessing
based on the Apache Unstructured
Information Management Architecture
(Ferrucci and Lally, 2004). It ships
with standard feature extraction modules,
while at the same time allowing the user
to add customized extractors. The extensive
reporting and logging facilities make
DKPro TC experiments fully replicable.
Research Interests:
In this paper, we present a system for automatically answering opendomain, multiple choice reading comprehension questions about short English narrative texts. The system is based on state-of-the-art text similarity measures, textual... more
In this paper, we present a system for automatically answering opendomain,
multiple choice reading comprehension questions about short English
narrative texts. The system is based on state-of-the-art text similarity measures,
textual entailment metrics and coreference resolution and does not make use of
any additional domain specific background knowledge. Each answer option is
scored with a combination of all evaluation metrics and ranked according to their
overall score in order to determine the most likely correct answer. Our best configuration
achieved the second highest score across all competing system in the
entrance exam grading challenge with a c@1 score of 0.375.
Research Interests:
It is not always easy to define what a word means. We can choose between a variety of possibilities, from simply pointing at the correct object as we say its name to lengthy definitions in encyclopaedias, which can sometimes fill multiple... more
It is not always easy to define what a word means. We can choose between a variety of possibilities, from simply pointing at the correct object as we say its name to lengthy definitions in encyclopaedias, which can sometimes fill multiple pages. Although the former ...