Text Analysis in Large Scale Text
Text Analysis in Large Scale Text
Text Analysis
LNCS 9383
Pipelines
Towards Ad-hoc Large-Scale Text Mining
placerat facer possim assum. Lorem ipsum dolor placerat facer possim assum. Lorem ipsum dolor
123
Lecture Notes in Computer Science 9383
Commenced Publication in 1973
Founding and Former Series Editors:
Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board
David Hutchison
Lancaster University, Lancaster, UK
Takeo Kanade
Carnegie Mellon University, Pittsburgh, PA, USA
Josef Kittler
University of Surrey, Guildford, UK
Jon M. Kleinberg
Cornell University, Ithaca, NY, USA
Friedemann Mattern
ETH Zurich, Zürich, Switzerland
John C. Mitchell
Stanford University, Stanford, CA, USA
Moni Naor
Weizmann Institute of Science, Rehovot, Israel
C. Pandu Rangan
Indian Institute of Technology, Madras, India
Bernhard Steffen
TU Dortmund University, Dortmund, Germany
Demetri Terzopoulos
University of California, Los Angeles, CA, USA
Doug Tygar
University of California, Berkeley, CA, USA
Gerhard Weikum
Max Planck Institute for Informatics, Saarbrücken, Germany
More information about this series at http://www.springer.com/series/7407
Henning Wachsmuth
Text Analysis
Pipelines
Towards Ad-hoc Large-Scale Text Mining
123
Author
Henning Wachsmuth
Bauhaus-Universität Weimar
Weimar
Germany
This monograph constitutes a revised version of the author’s doctoral dissertation, which was
submitted to the University of Paderborn, Faculty of Electrical Engineering, Computer Science
and Mathematics, Department of Computer Science, Warburger Straße 100, 33098 Paderborn,
Germany, under the original title “Pipelines for Ad-hoc Large-Scale Text Mining”, and which
was accepted in February 2015.
Cover illustration: The image on the front cover was created by Henning Wachsmuth in 2015.
It illustrates the stepwise mining of structured information from large amounts of unstructured
text with a sequential pipeline of text analysis algorithms.
The last few years have given rise to a new and lasting technology hype that is
receiving much attention, not only in academia and industry but also in the news and
politics: big data. Hardly anything in the computer science world has been as con-
troversial as the ubiquitous storage and analysis of data. While activists keep on
warning that big data will take away our privacy and freedom, industry celebrates it as
the holy grail that will enhance everything from decisions over processes to products.
The truth lies somewhere in the middle.
Extensive data is collected nowadays about where we are, what we do, and how we
think. Being aware of that, dangers like not getting a job or paying higher health
insurance fees just because of one’s private behavior seem real and need to be tackled.
In this respect, big data indeed reduces freedom in that we are forced to refrain from
doing things not accepted by public opinion. On the upside, big data has the potential to
improve our lives and society in many respects including health care, environment
protection, problem solving, decision making, and so forth. It will bring unprecedented
insights into diseases and it will greatly increase the energy efficiency of infrastructure.
It will provide immediate information access to each of us and it will let influencers
better understand what people really need. Striving for such goals definitely makes it
worth and honorable working on big data.
Mature technologies exist for storing and analyzing structured data, from databases
to distributed computing clusters. However, most data out there (on the web, in
business clouds, on personal computers) is in fact unstructured, given in the form of
images, videos, music, or—above all—natural language text. Today, the most evolved
technologies to deal with such text are search engines. Search engines excel in finding
texts with the information we need in real time, but they do not understand what
information is actually relevant in a text. Here, text mining comes into play.
Text mining creates structured data from information found in unstructured text. For
this purpose, analysis algorithms are executed that aim to understand natural language
to some extent. Natural language is complex and full of ambiguities. Even the best
algorithms therefore create incorrect data from time to time, especially when a pro-
cessed text differs from expectation. Usually, a whole bunch of algorithms is assembled
together in a sequential pipeline. Although text mining targets large-scale data, such
pipelines still tend to be too inefficient to cope with the scales encountered today in
reasonable time. Moreover, the assembly of algorithms in a pipeline depends on the
information to be found, which is often only known ad-hoc.
VIII Foreword
People search on the web to find relevant information on topics they wish to know
more about. Accordingly, companies analyze big data to discover new information that
is relevant for their business. Today’s search engines and big data analytics seek to
fulfill such information needs ad-hoc i.e., immediately in response to a search query or
similar. Often, the relevant information is hidden in large numbers of natural language
texts from web pages and other documents. Instead of returning potentially relevant
texts only, leading search and analytics applications have recently started to return
relevant information directly. To obtain the information sought for from the texts, they
perform text mining.
Text mining deals with tasks that target the inference of structured information from
collections and streams of unstructured input texts. It covers all techniques needed to
identify relevant texts, to extract relevant spans from these texts, and to convert the
spans into high-quality information that can be stored in databases and analyzed sta-
tistically. Text mining requires task-specific text analysis processes that may consist of
several interdependent steps. Usually, these processes are realized with text analysis
pipelines. A text analysis pipeline employs a sequence of natural language processing
algorithms where each algorithm infers specific types of information from the input
texts. Although effective algorithms exist for various types, the use of text analysis
pipelines is still restricted to a few predefined information needs. We argue that this is
due to three problems:
First, text analysis pipelines are mostly constructed manually for the tasks to be
addressed, because their design requires expert knowledge about the algorithms to be
employed. When information needs have to be fulfilled that are unknown beforehand,
text mining hence cannot be performed ad-hoc. Second, text analysis pipelines tend to
be inefficient in terms of run-time, because their execution often includes analyzing
texts with computationally expensive algorithms. When information needs have to be
fulfilled ad-hoc, text mining hence cannot be performed in the large. And third, text
analysis pipelines tend not to robustly achieve high effectiveness on all input texts (in
terms of the correctness of the inferred information), because they often include
algorithms that rely on domain-dependent features of texts. Generally, text mining
hence cannot guarantee to infer high-quality information at present.
X Preface
The findings described in this book should not be attributed to a single person.
Although I wrote this book and the dissertation it is based on at the Database and
Information Systems Group and the Software Quality Lab of the University of
Paderborn myself, many people worked together with me or they helped me in other
important respects during my PhD time.
First, I would like to thank both my advisor, Gregor Engels, and my co-advisor,
Benno Stein, for supporting me throughout and for giving me the feeling that my
research is worth doing. Gregor, I express to you my deep gratitude for showing me
what a dissertation really means and showing me the right direction while letting me
take my own path. Benno, thank you so much for teaching me what science is all about,
how thoroughly I have to work for it, and that the best ideas emerge from collaboration.
I would like to thank Bernd Bohnet for the collaboration and for directly saying “yes,”
when I asked him to be the third reviewer of my dissertation. Similarly, I thank Hans
Kleine Büning and Friedhelm Meyer auf der Heide for serving as members of my
doctoral committee.
Since the book at hand reuses the content of a number of scientific publications,
I would like to thank all my co-authors not named so far. In chronological order, they
are Peter Prettenhofer, Kathrin Bujna, Mirko Rose, Tsvetomira Palakarska, and Martin
Trenkmann. Thank you for your great work. Without you, parts of this book would not
exist. Some parts have also profited from the effort of students who worked at our lab,
wrote their thesis under my supervision, or participated in our project group ID|SE.
Special thanks go to Joachim Köhring and Steffen Beringer. Moreover, some results
presented here are based on work of companies we cooperated with in two research
projects. I want to thank Dennis Hannwacker in this regard, but also the other
employees of Resolto Informatik and Digital Collections.
The aforementioned projects including my position were funded by the German
Federal Ministry of Education and Research (BMBF), for which I am very grateful. For
similar reasons, I want to thank the company HRS, the German Federal Ministry for
Economic Affairs and Energy (BMWI), as well as my employer, the University of
Paderborn, in general.
I had a great time at the university, first and foremost because of my colleagues.
Besides those already named, I say thank you to Fabian Christ and Benjamin Nagel for
all the fun discussions, for tolerating my habits, and for becoming friends. The same
holds for Christian Soltenborn and Christian Gerth, who I particularly thank for making
XII Acknowledgments
me confident about my research. Further thanks go to Jan Bals, Markus Luckey, and
Yavuz Sancar for exciting football matches, to the brave soccer team of the AG Engels,
and to the rest of the group. I express my gratitude to Friedhelm Wegener for constant
and patient technical help as well as to Stefan Sauer for managing all the official
matters, actively supported by Sonja Saage and Beatrix Wiechers.
I am especially grateful to Theo Lettmann for pushing me to apply for the PhD
position, for guiding me in the initial phase, and for giving me advice whenever needed
without measurable benefit for himself. Thanks to my new group, the web is group in
Weimar, for the close collaboration over the years, and to the people at the University
of Paderborn who made all my conference attendances possible. Because of you, I
could enter the computational linguistics community, which I appreciate so much, and
make new friends around the world. I would like to mention Julian Brooke, who I
enjoyed discussing research and life with every year at a biannual conference, as well
as Alberto Barron, who opened my eyes to my limited understanding of the world with
sincerity and a dry sense of humor.
Aside from my professional life, I deeply thank all my dear friends for so many
great experiences and fun memories, for unconditionally accepting how I am, and for
giving me relief from my everyday life. I would like to name Annika, Carmen, Dirk,
Fabian, Kathrin, Lars, Sebastian, Semih, Stephan, and Tim here, but many more current
and former Bielefelders and Paderborners influenced me, including but not limited to
those from the HG Kurzfilme and the Projektbereich Eine Welt. I thank my parents for
always loving and supporting me and for being the best role models I can imagine.
Ipke, the countless time you spent for the great corrections and your encouraging
feedback helped me more than I can tell. Thanks also to the rest of my family for being
such a wonderful family.
Finally, my greatest thanks go to my young small family, Katrin and Max, for letting
me experience that there are much more important things in life than work, for giving
me the chance to learn about being some sort of father, for giving me a home
throughout my PhD, for accepting my long working hours, and for all the love and care
I felt. I hope that the effort I put into my dissertation and this book as well as my
excitement for research and learning will give you inspiration in your life.
The basic notations used in this thesis are listed here. Specific forms and variations
of these notations are introduced where needed and are marked explicitly with the
respective indices or similar.
Analysis
A A text analysis algorithm.
A A set or a repository of text analysis algorithms.
p The schedule of the algorithms in a text analysis pipeline.
P A text analysis pipeline or a filtering stage within a pipeline.
P A set of text analysis pipelines.
Text
d A portion of text or a unit of a text.
D A text.
D A collection or a stream of texts.
S A scope, i.e., a sequence of portions of a text.
S A set of scopes.
Information
c A piece of information, such as a class label, an entity, a relation, etc.
C An information type or a set of pieces of information.
C A set of information types or a specification of an information need.
f A flow, i.e., the sequence of instances of an information type in a text.
F A set or a cluster of flows.
F A flow clustering, i.e., a partition of a set of flows.
f* A flow pattern, i.e., the average of a set of flows.
F* A set of flow patterns.
x A feature used to model an input in machine learning.
x A feature vector, i.e., an ordered set of features.
X A set of feature vectors.
XIV Symbols
Task
c A query that specifies a combination of information needs.
c A scoped query, i.e., a query that specifies sizes of relevant portions of text.
C A dependency graph that respresents the depedencies in a scoped query.
K An agenda, i.e., a list of input requirements of text analysis algorithms.
l A machine that executes text analysis pipelines.
U A planning problem that specifies a goal to achieve within an environment.
X A universe or an ontology, each of which specifies an environment.
Quality
q A quality value or a quality estimation.
q A vector of quality estimations.
Q A quality criterion.
Q A set of quality criteria.
q A quality prioritization, defined as an ordered set of quality criteria.
Measures
t A run-time, possibly averaged over a certain unit of text.
a An accuracy value, i.e., an achieved ratio of correct decisions.
p A precision value, i.e., an achieved ratio of information inferred correctly.
r A recall value, i.e., an achieved ratio of correct information that is inferred.
f1 An F1-score, i.e., the harmonic mean of a precision and a recall value.
D The averaged deviation, i.e., a measure of text heterogeneity.
H A heuristic that predicts the run-time of a text analysis pipeline.
Q A quality function that maps analysis results to quality values.
Y A machine learning model that maps features to information.
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Information Search in Times of Big Data. . . . . . . . . . . . . . . . . . . . 1
1.1.1 Text Mining to the Rescue . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 A Need for Efficient and Robust Text Analysis Pipelines. . . . . . . . . 4
1.2.1 Basic Text Analysis Scenario . . . . . . . . . . . . . . . . . . . . . . . 5
1.2.2 Shortcomings of Traditional Text Analysis Pipelines . . . . . . . 6
1.2.3 Problems Approached in This Book . . . . . . . . . . . . . . . . . . 7
1.3 Towards Intelligent Pipeline Design and Execution . . . . . . . . . . . . . 8
1.3.1 Central Research Question and Method . . . . . . . . . . . . . . . . 8
1.3.2 An Artificial Intelligence Approach . . . . . . . . . . . . . . . . . . . 9
1.4 Contributions and Outline of This Book . . . . . . . . . . . . . . . . . . . . 12
1.4.1 New Findings in Ad-Hoc Large-Scale Text Mining . . . . . . . . 13
1.4.2 Contributions to the Concerned Research Fields . . . . . . . . . . 14
1.4.3 Structure of the Remaining Chapters . . . . . . . . . . . . . . . . . . 15
1.4.4 Published Research Within This Book . . . . . . . . . . . . . . . . . 16
3 Pipeline Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining . . . . . . . 56
3.1.1 The Optimality of Text Analysis Pipelines . . . . . . . . . . . . . . 56
3.1.2 Paradigms of Designing Optimal Text Analysis Pipelines . . . . 60
3.1.3 Case Study of Ideal Construction and Execution . . . . . . . . . . 64
3.1.4 Discussion of Ideal Construction and Execution . . . . . . . . . . 68
3.2 A Process-Oriented View of Text Analysis . . . . . . . . . . . . . . . . . . 68
3.2.1 Text Analysis as an Annotation Task. . . . . . . . . . . . . . . . . . 69
3.2.2 Modeling the Information to Be Annotated. . . . . . . . . . . . . . 70
3.2.3 Modeling the Quality to Be Achieved by the Annotation . . . . 71
3.2.4 Modeling the Analysis to Be Performed for Annotation . . . . . 72
3.2.5 Defining an Annotation Task Ontology . . . . . . . . . . . . . . . . 74
3.2.6 Discussion of the Process-Oriented View . . . . . . . . . . . . . . . 75
3.3 Ad-Hoc Construction via Partial Order Planning . . . . . . . . . . . . . . . 76
3.3.1 Modeling Algorithm Selection as a Planning Problem . . . . . . 77
3.3.2 Selecting the Algorithms of a Partially Ordered
Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.3.3 Linearizing the Partially Ordered Pipeline. . . . . . . . . . . . . . . 80
3.3.4 Properties of the Proposed Approach . . . . . . . . . . . . . . . . . . 82
3.3.5 An Expert System for Ad-Hoc Construction . . . . . . . . . . . . . 86
3.3.6 Evaluation of Ad-Hoc Construction . . . . . . . . . . . . . . . . . . . 88
3.3.7 Discussion of Ad-Hoc Construction . . . . . . . . . . . . . . . . . . . 91
3.4 An Information-Oriented View of Text Analysis . . . . . . . . . . . . . . . 92
3.4.1 Text Analysis as a Filtering Task . . . . . . . . . . . . . . . . . . . . 92
3.4.2 Defining the Relevance of Portions of Text . . . . . . . . . . . . . 95
3.4.3 Specifying a Degree of Filtering for Each Relation
Type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
3.4.4 Modeling Dependencies of the Relevant Information Types . . 98
3.4.5 Discussion of the Information-Oriented View . . . . . . . . . . . . 100
3.5 Optimal Execution via Truth Maintenance . . . . . . . . . . . . . . . . . . . 101
3.5.1 Modeling Input Control as a Truth Maintenance Problem . . . . 101
3.5.2 Filtering the Relevant Portions of Text. . . . . . . . . . . . . . . . . 104
3.5.3 Determining the Relevant Portions of Text . . . . . . . . . . . . . . 106
3.5.4 Properties of the Proposed Approach . . . . . . . . . . . . . . . . . . 107
3.5.5 A Software Framework for Optimal Execution . . . . . . . . . . . 109
3.5.6 Evaluation of Optimal Execution. . . . . . . . . . . . . . . . . . . . . 111
3.5.7 Discussion of Optimal Execution. . . . . . . . . . . . . . . . . . . . . 116
Contents XVII
6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
6.1 Contributions and Open Problems . . . . . . . . . . . . . . . . . . . . . . . . . 232
6.1.1 Enabling Ad-Hoc Text Analysis . . . . . . . . . . . . . . . . . . . . . 232
6.1.2 Optimally Analyzing Text . . . . . . . . . . . . . . . . . . . . . . . . . 233
6.1.3 Optimizing Analysis Efficiency . . . . . . . . . . . . . . . . . . . . . . 233
6.1.4 Robustly Classifying Text . . . . . . . . . . . . . . . . . . . . . . . . . 234
6.2 Implications and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
6.2.1 Towards Ad-Hoc Large-Scale Text Mining. . . . . . . . . . . . . . 235
6.2.2 Outside the Box . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237
Abstract The future of information search is not browsing through tons of web
pages or documents. In times of big data and the information overload of the inter-
net, experts in the field agree that both everyday and enterprise search will gradually
shift from only retrieving large numbers of texts that potentially contain relevant
information to directly mining relevant information in these texts (Etzioni 2011;
Kelly and Hamm 2013; Ananiadou et al. 2013). In this chapter, we first motivate
the benefit of such large-scale text mining for today’s web search and big data an-
alytics applications (Sect. 1.1). Then, we reveal the task specificity and the process
complexity of analyzing natural language text as the main problems that prevent
applications from performing text mining ad-hoc, i.e., immediately in response to a
user query (Sect. 1.2). Section 1.3 points out how we propose to tackle these prob-
lems by improving the design, efficiency, and domain robustness of the pipelines of
algorithms used for text analysis with artificial intelligence techniques. This leads to
the contributions of the book at hand (Sect. 1.4).
1
2 1 Introduction
Fig. 1.1 Screenshot of Pentaho Big Data Analytics as an example for an enterprise software.
The shown “heat grid” visualizes the vehicle sales of a company.
or even unsuccessful for queries where relevant information has to be derived (e.g.
for the query locations of search companies), should be aggregated (e.g.
user opinions on bing), seems like a needle in a haystack (e.g. "if it isn’t
on google it doesn’t exist" original source), and so forth.
For enterprise environments, big data analytics applications aim to infer such
high-quality information in the sense of relations, patterns, and hidden facts from
vast amounts of data (Davenport 2012). Figure 1.1 gives an example, showing the
enterprise software of Pentaho.1 As with this software, big data analytics is still
only on the verge of including unstructured texts into analysis, though such texts are
assumed to make up 95 % of all enterprise-relevant data (HP Labs 2010). To provide
answers to a wide spectrum of information needs, relevant texts must be filtered and
relevant information must be identified in these texts. We hence argue that search
engines and big data analytics applications need to perform more text mining.
Text mining brings together techniques from the research fields of information
retrieval, data mining, and natural language processing in order to infer struc-
tured high-quality information from usually large numbers of unstructured texts
(Ananiadou and McNaught 2005). While information retrieval deals, at its heart,
with indexing and searching unstructured texts, data mining targets at the discovery
of patterns in structured data. Natural language processing, finally, is concerned with
algorithms and engineering issues for the understanding and generation of speech
and human-readable text (Tsujii 2011). It bridges the gap between the other fields by
2 InfexBA – Information Extraction for Business Applications, funded by the German Federal
Ministry of Education and Research (BMBF), http://infexba.upb.de.
3 Taken from Business Insider, http://www.businessinsider.com/how-apples-annual-revenue-
Fig. 1.2 Google result page for the query Charles Babbage, showing an example of directly
providing relevant information instead of returning only web links.
on specified topics for monitoring purposes.7 But, in accordance with the quote of
Babbage, the benefit of text mining arising from the increase of velocity becomes
more striking when turning from predefined text analyses in frequent use to arbitrary
and more complex text analysis processes.
Text mining deals with tasks that often entail complex text analysis processes, con-
sisting of several interdependent steps that aim to infer sophisticated information
types from collections and streams of natural language input texts (cf. Chap. 2 for
details). In the mentioned project InfexBA, different entity types (e.g. organization
names) and event types (e.g. forecasts) had to be extracted from input texts and cor-
rectly brought into relation, before they could be normalized and aggregated. Such
steps require syntactic annotations of texts, e.g. part-of-speech tags and parse tree
labels (Sarawagi 2008). These in turn can only be added to a text that is segmented
into lexical units, e.g. into tokens and sentences. Similarly, text classification often
relies on so called features (Manning et al. 2008) that are derived from lexical and
syntactic annotations or even from entities, like in ArguAna.
To realize the steps of a text analysis process, text analysis algorithms are em-
ployed that annotate new information types in a text or that classify, relate, normalize,
or filter previously annotated information. Such algorithms perform analyses of dif-
ferent computational cost, ranging from the typically cheap evaluation of single
rules and regular expressions, over the matching of lexicon terms and the statisti-
cal classification of text fragments, to complex syntactic analyses like dependency
parsing (Bohnet 2010). Because of the interdependencies between analyses, the
standard way to realize a text analysis process is in the form of a text analysis pipeline,
which sequentially applies each employed text analysis algorithm to its input.
Fig. 1.3 A text analysis pipeline Π = A, π with algorithm set A = {A1 , . . . , Am } and schedule
π . Each text analysis algorithm Ai ∈ A takes a text and information of certain types as input and Ai
provides information of certain types as output.
Fig. 1.4 The basic text analysis scenario discussed in this book: A text analysis pipeline Π = A, π
composes a subset A of all available text analysis algorithms A1 , . . . , An in order to infer output
information of a structured set of information types C from a collection or a stream of input texts D.
In principle, text analysis pipelines can be applied to tackle arbitrary text analysis
tasks, i.e., to infer output information of arbitrary types from arbitrary input texts.
This information will not always be correct, as it results from analyzing ambiguous
natural language text. Rather, a pipeline achieves a certain effectiveness in terms of
the quality of the inferred information, e.g. quantified as the relative frequency of
output information that is correct in the given task (cf. Chap. 2 for details).
The inference of high-quality information can be seen as the most general goal of
text mining. Search engines and big data analytics applications, in particular, aim to
infer such information immediately and/or from large numbers of texts. Both of them
deal with ad-hoc information needs, i.e., information needs that are stated ad-hoc
and are, thus, unknown beforehand. We argue that the process complexity and task
specificity outlined above prevent such ad-hoc large-scale text mining today due to
three problems:
First, the design of text analysis pipelines in terms of selecting and scheduling
algorithms for the information needs at hand and the input texts to be processed is
traditionally made manually, because it requires human expert knowledge about the
functionalities and interdependencies of the algorithms (Wachsmuth et al. 2013a).
If information needs are stated ad-hoc, also the design of pipelines has to be made
ad-hoc, which takes time when made manually, even with proper tool support (Kano
et al. 2010). Hence, text mining currently cannot be performed immediately.
Second, the run-time efficiency of traditionally executing text analysis pipelines
is low, because computationally expensive analyses are performed on the whole
input texts (Sarawagi 2008). Techniques are missing that identify the portions of
input texts, which contain information relevant for the information need at hand,
and that restrict expensive analyses to these portions. Different texts vary in the dis-
tribution of relevant information, which additionally makes these techniques input-
dependent (Wachsmuth and Stein 2012). While a common approach to avoid effi-
1.2 A Need for Efficient and Robust Text Analysis Pipelines 7
ciency problems is to analyze input texts in advance when they are indexed (Cafarella
et al. 2005), this is not feasible for ad-hoc information needs. Also, the application
of faster algorithms (Pantel et al. 2004) seems critical because it mostly also results
in a reduced effectiveness. Hence, ad-hoc text mining currently cannot be performed
on large numbers of texts.
Third, text analysis pipelines tend not to infer high-quality information with high
robustness, because the employed algorithms traditionally rely on features of input
texts that are dependent on the domains of the texts (Blitzer et al. 2007). The appli-
cations we target at, however, may process texts from arbitrary domains, such that
pipelines will often fail to infer information effectively. An approach to still achieve
user acceptance under limited effectiveness is to explain how information was in-
ferred (Li et al. 2012b), but this is difficult for pipelines, as they realize a process
with several complex and uncertain decisions about natural language (Das Sarma
et al. 2011). Hence, text mining cannot generally guarantee high quality.
Altogether, we summarize that traditional text analysis pipelines fail to address in-
formation needs ad-hoc in an efficient and domain-robust manner. The hypothesis of
this book is that three problems must be solved in order to enable ad-hoc large-scale
text mining within search engines and big data analytics applications:
In this book, we consider the task of efficiently and effectively addressing ad-hoc
information needs in large-scale text mining. We contribute to this task by showing
how to make the design and execution of text analysis pipelines more intelligent.
Our approach relies on knowledge and information available for such pipelines.
As motivated in Sect. 1.2, the design, efficiency, and robustness of text analysis
pipelines depend on the realized text analysis processes. With this in mind, the
central research question underlying this book can be formulated as follows:
The distinction between knowledge and information is controversial and not al-
ways unambiguous (Rowley 2007). For our purposes, it suffices to follow the simple
view that knowledge is an interpretation of data, which is assumed to be true irre-
spective of the context (e.g., Apple is a company), while information is data, which
has been given meaning by a particular context (e.g., in the book at hand the term
“Apple” denotes a company). In this regard, knowledge can be understood as specified
beforehand, while information is inferred during processing. Now, when speaking
of “knowledge about a text analysis process”, we basically mean two kinds:
1. Knowledge about the text analysis task to be addressed, namely, the informa-
tion need at hand, expected properties of the input texts to be processed, as well
as efficiency and effectiveness criteria to be met.
2. Knowledge about the text analysis algorithms to be employed, namely, their
input and output information types, restrictions of their applicability, as well as
their expected efficiency and effectiveness.
1.3 Towards Intelligent Pipeline Design and Execution 9
The proposed overall approach of this book to enable ad-hoc large-scale text mining
relies on three core ideas, which we will discuss more detailed below:
Fig. 1.5 Our overall approach to enable ad-hoc large-scale text mining: Each algorithm Ai∗ in
the automatically constructed ad-hoc large-scale text analysis pipeline Π ∗ = A∗ , π ∗ gets only
portions of text from the input control its output is relevant for. The schedule π ∗ is optimized in
terms of efficiency, while the effectiveness of Π ∗ is improved through an overall analysis that
produces the final output information.
available algorithms and the information need at hand, and we can optimize
their schedule based on information about their achieved efficiency and
the produced output.
2. Input control. We can automatically infer the portions of input texts that
need to be processed by each algorithm in a text analysis pipeline from the
information need at hand, the algorithm’s output information types, and
the output information produced so far.
3. Overall analysis. We can automatically improve the domain robustness of
specific algorithms in a text analysis pipeline by focusing on information
about the overall structure of the processed input texts within their analyses
while abstracting from their content.
Figure 1.5 illustrates how these ideas can be operationalized to replace the tradi-
tional predefined pipeline in the basic text analysis scenario. By that, Fig. 1.5 serves
as an overall view of all proposed approaches of this book. We detail the different
approaches in the following.
For ad-hoc information needs, we construct ad-hoc text analysis pipelines im-
mediately before analysis. To this end, we formalize the classical process-oriented
view of text analysis that is e.g. realized by the leading software frameworks for text
analysis, Apache UIMA8 and GATE:9 Each algorithm serves as an action that, if
applicable, transforms the state of a text (i.e., its input annotations) into another state
(extended by the algorithm’s output annotations). So, ad-hoc pipeline construction
means to find a viable sequence of actions, i.e., a planning problem (Russell and
Norvig 2009). We tackle this problem with partial order planning while considering
efficiency and effectiveness criteria (Wachsmuth et al. 2013a). Partial order planning
conforms to paradigms of ideal pipeline design and execution that we identified to
be always reasonable, e.g. lazy evaluation (Wachsmuth et al. 2011).
Instead of directly handing over the whole input texts from one algorithm in a
pipeline to the next, we introduce a novel input control that manages the input to be
processed by each algorithm, as sketched in Fig. 1.5. Given the output information
of all algorithms applied so far, the input control determines and filters only those
portions of the current input text that may contain all information required to fulfill
the information need at hand.10 For this information-oriented view, we realize the
input control as a truth maintenance system (Russell and Norvig 2009), which models
the relevance of each portion of text as a propositional formula. Reasoning about all
formulas then enables algorithms to analyze only relevant portions of text. Thereby
we avoid unnecessary analyses and we can influence the efficiency-effectiveness
tradeoff of a pipeline (Wachsmuth et al. 2013c).
Based on the information-oriented view, we next transform every pipeline into
a large-scale text analysis pipeline, meaning that we make it as run-time efficient
as possible. In particular, we found that a pipeline’s schedule strongly affects its
efficiency, since the run-time of each employed algorithm depends on the filtered
portions of text it processes. The filtered portions in turn result from the distribution
of relevant information in the input texts. Given the run-times on the filtered portions,
we apply dynamic programming (Cormen et al. 2009) to obtain an optimal schedule
(Wachsmuth and Stein 2012). In practice, these run-times can only be estimated for
the texts at hand. We thus propose to address scheduling with informed best-first
search (Russell and Norvig 2009), either in a greedy manner using estimations of the
algorithms’ run-times only or, for a more optimized scheduling, using information
from a sample of texts (Wachsmuth et al. 2013a).
Now, problems occur in case input texts are heterogeneous in the distribution of
relevant information, since such texts do not allow for accurate run-time estimations.
We quantify the impact of text heterogeneity on the efficiency of a pipeline in order
to estimate the optimization potential of scheduling. A solution is to perform an
adaptive scheduling that chooses a schedule depending on the text (Wachsmuth et al.
2013b). For this purpose, the characteristics of texts need to be mapped to the run-
times of pipelines. We induce such a mapping with self-supervised online learning,
i.e., by incrementally learning from self-generated training data obtained during
processing (Witten and Frank 2005; Banko et al. 2007). The scheduling approach
has implications for pipeline parallelization that we outline.
Finally, we present a new overall analysis that aims to improve domain robustness
by analyzing the overall structure of input texts while abstracting from their content.
As Fig. 1.5 depicts, the overall analysis is an alternative last algorithm in a pipeline.
Its structure-oriented view of text analysis specifically targets at the classification of
10 Throughout this book, we assume that information needs are already given in a processable
form (defined later on). Accordingly, we will not tackle problems from the areas of query analysis
and user interface design related to information search (Hearst 2009).
12 1 Introduction
Fig. 1.6 The three high-level contributions of this book: We present approaches (1) to automatically
design text analysis pipelines that optimally process input texts ad-hoc, (2) to optimize the run-time
efficiency of pipelines on all input texts, and (3) to improve the robustness of pipelines on input
texts from different domains.
argumentative texts. It is based on our observation from Wachsmuth et al. (2014b) that
the sequential flow of information in a text is often decisive in those text classification
tasks where analyzing content does not suffice (Lipka 2013). The overall analysis first
performs a supervised variant of clustering (Witten and Frank 2005) to statistically
learn common flow patterns of argumentative texts (Wachsmuth et al. 2014a). Then,
it uses the learned patterns as features for a more domain-robust classification. The
same patterns and their underlying information can be exploited to explain the results
of the analysis afterwards.
We claim that our approach makes the design and execution of text analysis pipelines
more intelligent: Efficient solutions to text analysis tasks (i.e., pipelines) are found
and accomplished automatically based on human expert knowledge and informa-
tion perceived in the environment (i.e., the processing of texts). More precisely, we
contribute to the enablement of ad-hoc large-scale text mining in three respects:
Figure 1.6 shows how these high-level main contributions relate to the three core
ideas within our overall approach. In the following, we summarize the most important
findings described in this book for each main contribution.
This book presents, at its heart, findings that refer to the field of computer science.
In particular, the outlined main contributions largely deal with the development and
application of algorithms, especially artificial intelligence algorithms. Most of them
benefit the practical applicability of text mining in big data scenarios. Our main field
of application is computational linguistics. According to our motivation of improving
information search, some of the implications for this field are connected to central
concepts from information retrieval, such as information needs or filtering.
Concretely, our approaches to pipeline design and efficiency affect the information
extraction area in the first place. In many extraction tasks, huge amounts of text are
processed to find the tiny portions of text with relevant information (Sarawagi 2008).
Still, existing approaches waste much effort processing irrelevant portions. If at all,
they filter only based on heuristics or vague statistical models (cf. Sect. 2.4 for details).
In contrast, our input control infers relevance formally and it is well-founded in the
theory of truth maintenance systems. We see the input control as a logical extension
of software frameworks like Apache UIMA or GATE.
The efficiency of information extraction has long been disregarded to a wide ex-
tent, but it is getting increasing attention in the last years (Chiticariu et al. 2010b).
Unlike related approaches, we neither require to lower the effectiveness of extraction,
nor do we consider only rule-based extraction algorithms, as we do not change the
algorithms themselves at all. Thereby, our approach achieves a very broad applica-
bility in several types of text analysis tasks. It addresses an often overseen means to
scale information extraction to large numbers of texts (Agichtein 2005).
In terms of domain robustness, we aim at argumentation-related text classification
tasks, like sentiment analysis. Our overall analysis improves effectiveness on texts
from domains unknown beforehand by considering the previously disregarded overall
structure of texts. While the ultimate goal of guaranteeing high-quality information
1.4 Contributions and Outline of This Book 15
in ad-hoc large-scale text mining is far from being solved, we are confident that our
approach denotes an important step towards more intelligent text analysis.
In this regard, we also provide new insights into the pragmatics of computational
linguistics, i.e., the study of the relation between utterances and their context
(Jurafsky 2003). Here, the most important findings refer to our work on the argumen-
tation of a text. In particular, we statistically determine common patterns in the way
people structure their argumentation in argumentative texts. Additionally, we claim
that our model and quantification of the heterogeneity of texts constitutes a substan-
tial building block for a better understanding of the processing complexity of texts.
To allow for a continuation of our research, a verifiability of our claims, and a
reproducibility of our experiments, we have made most developed approaches freely
available in open-source software (cf. Appendix B). Moreover, we provide three
new text corpora for the study of different scientifically and industrially relevant
information extraction and text classification problems (cf. Appendix C).
In Fig. 1.7, we illustrate the organization of Chaps. 3–5. In each of these chapters,
we first develop an abstract solution to the respective problem. Then, we present
and evaluate practical approaches. The approaches rely on concrete models of text
analysis and/or are motivated by our own experimental analyses. We conclude each
main chapter with implications for the area of application.
Before, Chap. 2 provides the required background knowledge. First, we introduce
basic concepts and approaches of text mining relevant for our purposes (Sect. 2.1).
We point out the importance of text analysis processes and their realization through
pipelines in Sect. 2, while case studies that we resort to in examples and experiments
follow in Sect. 2.3. Section 2.4 then summarizes the state of the art.
As Fig. 1.7 shows, Chap. 3 deals with the automation of pipeline design. In
Sect. 3.1, we present paradigms of an ideal pipeline construction and execution.
On this basis, we formalize key concepts from Sect. 2.2 in a process-oriented view
of text analysis (Sect. 3.2) and then address ad-hoc pipeline construction (Sect. 3.3).
In Sect. 3.4, we develop an information-oriented view of text analysis, which can
be operationalized to achieve an optimal pipeline execution (Sect. 3.5). This view
provides new ways of trading efficiency for effectiveness (Sect. 3.6).
Next, we optimize pipeline efficiency in Chap. 4, starting with a formal solution to
the optimal scheduling of text analysis algorithms (Sect. 4.1). We analyze the impact
of the distribution of relevant information in Sect. 4.2, followed by our approach to
optimized scheduling (Sect. 4.3) that also requires the filtering view from Sect. 3.4.
An analysis of the heterogeneity of texts (Sect. 4.4) then motivates the need for
adaptive scheduling, which we approach in Sect. 4.5. Scheduling has implications
for pipeline parallelization, as we discuss in Sect. 4.6.
16 1 Introduction
3 4 5
Pipeline Design Pipeline Efficiency Pipeline Robustness
abstract solutions
ideal construction ideal ideal domain
and execution scheduling independence
3.1 4.1 5.1
concrete models
process- information- structure-
oriented view oriented view oriented view
3.2 3.4 5.2
experimental analyses
impact of relevant impact of impact of
information heterogeneity overall structure
4.2 4.4 5.3
practical approaches
ad-hoc optimal optimized adaptive features for
construction execution scheduling scheduling domain indep'ce
3.3 3.5 4.3 4.5 5.4
implications
trading efficiency parallelizing pipe- explaining
for effectiveness line execution results
3.6 4.6 5.5
Fig. 1.7 The structure of this book, organized according to the book’s three main contributions.
The white boxes show short names of all sections of the three main chapters.
In the book at hand, the complete picture of our approach to enable ad-hoc large-
scale text mining is published for the first time. However, most of the main findings
described in the book have already been presented in peer-reviewed scientific papers
at renowned international conferences from the fields of computational linguistics
and information retrieval. An overview is given in Table 1.1. Moreover, some parts of
this book integrate content of four master’s theses that were written in the context of
the doctoral dissertation the book is based on. Most notably, the results of Rose (2012)
1.4 Contributions and Outline of This Book 17
Table 1.1 Overview of peer-reviewed publications this book is based on. For each publication, the
short name of the venue and the number of pages are given as well as a short sketch of the topic
and the main sections of this book, in which content of the publication is reused.
I put my heart and my soul into my work, and have lost my mind
in the process.
– Vincent van Gogh
19
20 2 Text Analysis Pipelines
In this section, we explain all general foundations of text mining the book at hand
builds upon. After a brief outline of text mining, we organize the foundations along
the three main research fields related to text mining. The goal is not to provide a
formal and comprehensive introduction to these fields, but rather to give exactly
the information that is necessary to follow our discussion. At the end, we describe
how to develop and evaluate approaches to text analysis. Basic concepts of text
analysis processes are defined in Sect. 2.2, while specific concepts related to our
overall approach are directly defined where needed in Chaps. 3–5.1
Text mining deals with the automatic or semi-automatic discovery of new, previ-
ously unknown information of high quality from large numbers of unstructured
texts (Hearst 1999). Different than sometimes assumed, the types of information to
be inferred from the texts are usually specified manually beforehand, i.e., text mining
tackles given tasks. As introduced in Sect. 1.1, this commonly requires to perform
three steps in sequence, each of which can be associated to one field (Ananiadou and
McNaught 2005):
1. Information retrieval. Gather input texts that are potentially relevant for the
given task.
2. Natural language processing. Analyze the input texts in order identify and struc-
ture relevant information.2
3. Data mining. Discover patterns in the structured information that has been
inferred from the texts.
Hearst (1999) points out that the main aspects of text mining are actually the
same as those studied in empirical computational linguistics. Although focusing
on natural language processing, some of the problems computational linguistics is
concerned with are also addressed in information retrieval and data mining, such as
text classification or machine learning. In this book, we refer to all these aspects with
the general term text analysis (cf. Sect. 1.1). In the following, we look at the concepts
of the three fields that are important for our discussion of text analysis.
1 Notice that, throughout this book, we assume that the reader has a more or less graduate-level
background in computer science or similar.
2 Ananiadou and McNaught (2005) refer to the second step as information extraction. While we
agree that information extraction is often the important part of this step, also other techniques from
natural language processing play a role, as discussed later in this section.
2.1 Foundations of Text Mining 21
Following Manning et al. (2008), the primary use case of information retrieval is to
search and obtain those texts from a large collection of unstructured texts that can
satisfy an information need, usually given in the form of a query. In ad-hoc web
search, such a query consists of a few keywords, but, in general, it may also be
given by a whole text, a logical expression, etc. An information retrieval application
assesses the relevance of all texts with respect to a query based on some similarity
measure. Afterwards, it ranks the texts by decreasing relevance or it filters only those
texts that are classified as potentially relevant (Manning et al. 2008).
Although the improvement of ad-hoc search denotes one of the main motivations
behind this book (cf. Chap. 1), we hardly consider the retrieval step of text mining,
since we focus on the inference of information from the potentially relevant texts,
as detailed in Sect. 2.2. Still, we borrow some techniques from information retrieval,
such as filtering or similarity measures. For this purpose, we require the following
concepts, which are associated to information retrieval rather than to text analysis.
Vectors. To determine the relevance of texts, many approaches map all texts and
queries into a vector space model (Manning et al. 2008). Such a model defines a
common vector representation x = (x1 , . . . , xk ), k ≥ 1, for all inputs where each
xi ∈ x formalizes an input property. A concrete input like a text D is then represented
by one value xi(D) for each xi . In web search, the standard way to represent texts and
queries is by the frequencies of words they contain from a set of (possibly hundreds of
thousands) words. Generally, any measurable property of an input can be formalized,
though, which becomes particularly relevant for tasks like text classification.
Similarity. Given a common representation, similarities between texts and queries
can be computed. Most word frequencies of a search query will often be 0. In case
they are of interest, a reasonable similarity measure is the cosine distance, which
puts emphasis on the properties of texts that actually occur (Manning et al. 2008). In
Chap. 5, we compute similarities of whole texts, where a zero does not always mean
the absence of a property. Such scenarios suggest other measures. In our experiments,
we use the Manhattan distance between two vectors x(1) and x(2) of length k (Cha
2007), which is defined as:
k
Manhattan distance(x(1) , x(2) ) = |xi(1) − xi(2) |
i=1
Indexing. While queries are typically stated ad-hoc, the key to efficient ad-hoc search
is that all texts in a given collection have been indexed before. A query is then
matched against the search index, thereby avoiding to process the actual texts during
search. Very sophisticated indexing approaches exist and are used in today’s web
search engines (Manning et al. 2008). In its basic form, a search index contains one
entry for every measured property. Each entry points to all texts that are relevant
with respect to the property. Some researchers have adapted indexing to information
22 2 Text Analysis Pipelines
Natural language processing covers algorithms and engineering issues for the under-
standing and generation of speech and human-readable text (Tsujii 2011). In the book
at hand, we concentrate on the analysis of text with the goal of deriving structured
information from unstructured texts.
In text analysis, algorithms are employed that, among others, infer lexical infor-
mation about the words in a text, syntactic information about the structure between
words, and semantic information about the meaning of words (Manning and Schütze
1999). Also, they may analyze the discourse and pragmatic level of a text (Jurafsky
and Martin 2009). In Chaps. 3–5, we use lexical and syntactic analyses as preprocess-
ing for information extraction and text classification. Information extraction targets at
semantic information. Text classification may seek for both semantic and pragmatic
information. To infer information of certain types from an input text, text analysis
algorithms apply rules or statistics, as we detail below.
Generally, natural language processing faces the problem of ambiguity, i.e., many
utterances of natural language allow for different interpretations. As a consequence,
all text analysis algorithms need to resolve ambiguities (Jurafsky and Martin 2009).
Without sufficient context, a correct analysis is hence often hard and can even be
impossible. For instance, the sentence “SHE’S AN APPLE FAN.” alone leaves unde-
cidable whether it refers to a fruit or to a company.
Technically, natural language processing can be seen as the production of anno-
tations (Ferrucci and Lally 2004). An annotation marks a text or a span of text that
represents an instance of a particular type of information. We discuss the role of anno-
tations more extensively in Sect. 2.2, before we formalize the view of text analysis
as an annotation task in Chap. 3.
Lexical and Syntactic Analyses. For our purposes, we distinguish three types of
lexical and syntactical analyses: The segmentation of a text into single units, the
tagging of units, and the parsing of syntactic structure.
2.1 Foundations of Text Mining 23
Mostly, the smallest text unit considered in natural language processing is a token,
denoting a word, a number, a symbol, or anything similar (Manning and Schütze
1999). Besides the tokenization of texts, we also refer to sentence splitting and
paragraph splitting as segmentation in this book. In terms of tagging, we look at
part-of-speech, meaning the categories of tokens like nouns or verbs, although much
more specific part-of-speech tags are used in practice (Jurafsky and Martin 2009).
Also, we perform lemmatization in some experiments to get the lemmas of tokens,
i.e., their dictionary forms, such as “be” in case of “is” Manning and Schütze (1999).
Finally, we use shallow parsing, called chunking (Jurafsky and Martin 2009), to
identify different types of phrases, and dependency parsing to infer the dependency
tree structure of sentences (Bohnet 2010). Appendix A provides details on all named
analyses and on the respective algorithms we rely on. The output of parsing is par-
ticularly important for information extraction.
Information Extraction. The basic semantic concept is a named or numeric entity
from the real world (Jurafsky and Martin 2009). Information extraction analyzes
usually unstructured texts in order to recognize references of such entities, relations
between entities, and events the entities participate in Sarawagi (2008). In the classical
view of the Message Understanding Conferences, information extraction is
seen as a template filling task (Chinchor et al. 1993), where the goal is to fill entity slots
of relation or event templates with information from a collection or a stream of texts D.
The set of information types C to be recognized is often predefined, although some
recent approaches address this limitation (cf. Sect. 2.4). Both rule-based approaches,
e.g. based on regular expressions or lexicons, and statistical approaches, mostly based
on machine learning (see below), are applied in information extraction (Sarawagi
2008). The output is structured information that can be stored in databases or directly
displayed to the users (Cunningham 2006). As a matter of fact, information extraction
plays an important role in today’s database research (Chiticariu et al. 2010a), while
it has its origin in computational linguistics (Sarawagi 2008). In principle, the output
qualifies for being exploited in text mining applications, e.g. to provide relevant
information like Google in the example from Fig. 1.2 (Sect. 1.1). However, many
information types tend to be domain-specific and application-specific (Cunningham
2006), making their extraction cost-intensive. Moreover, while some types can be
extracted accurately, at least from hiqh-quality texts of common languages (Ratinov
and Roth 2009), others still denote open challenges in current research (Ng 2010).
Information extraction often involves several subtasks, including coreference res-
olution, i.e., the identification of references that refer to the same entity (Cunningham
2006), and the normalization of entities and the like. In this book, we focus mainly
on the most central subtasks, namely, named and numeric entity recognition, binary
relation extraction, and event detection (Jurafsky and Martin 2009). Concrete analy-
ses and algorithms that realize the analyses are found in Appendix A. As an example,
Fig. 2.1(a) illustrates instances of different information types in a sample text. Some
refer to a relation of the type Founded(Organization, Time).
24 2 Text Analysis Pipelines
Fig. 2.1 a Illustration of an information extraction example: Extraction of a relation of the type
Founded(Organization, Time) from a sample text. b Illustration of a text classification example:
Classification of the topic and the sentiment polarity of a sample text.
( , , ..., )
input output
...
data ... information
data mining ( , , ..., )
representation generalization
machine learning
instances patterns
Fig. 2.2 Illustration of a high-level view of data mining. Input data is represented as a set of
instances, from which a model is derived using machine learning. The model is then generalized to
infer new output information.
Data mining primarily aims at the inference of new information of specified types
from typically huge amounts of input data, already given in structured form (Witten
and Frank 2005). To address such a prediction problem, the data is first converted into
instances of a defined representation and then handed over to a machine learning
algorithm. The algorithm recognizes statistical patterns in the instances that are
relevant for the prediction problem. This process is called training. The found patterns
are then generalized, such that they can be applied to infer new information from
unseen data, generally referred to as prediction. In this regard, machine learning can
be seen as the technical basis of data mining applications (Witten and Frank 2005).
Figure 2.2 shows a high-level view of the outlined process.
3 Unlike us, some researchers do not distinguish between sentiment analysis and opinion mining,
but they use these two terms interchangeably (Pang and Lee 2008).
26 2 Text Analysis Pipelines
Data mining and text mining are related in two respects: (1) The structured output
information of text analysis serves as the input to machine learning, e.g. to train a
text classifier. (2) Many text analyses themselves rely on machine learning algo-
rithms to produce output information. Both respects are important in this book. In
the following, we summarize the basic concepts relevant for our purposes.4
Machine Learning. Machine learning describes the ability of an algorithm to learn
without being explicitly programmed (Samuel 1959). An algorithm can be said to
learn from data with respect to a given prediction problem and to some quality mea-
sure, if the measured prediction quality increases the more data is processed (Mitchell
1997).5 Machine learning aims at prediction problems where the target function,
which maps input data to output information, is unknown and which, thus, cannot
be (fully) solved by following hand-crafted rules. In the end, all non-trivial text
analysis tasks denote such prediction problems, even though many tasks have been
successfully tackled with rule-based approaches.6
A machine learning algorithm produces a model Y : x → C, which generalizes
patterns found in the input data in order to approximate the target function. Y defines
a mapping from represented data x to a target variable C, where C captures the type
of information sought for. In text analysis, the target variable may represent classes
of texts (e.g. topics or genres), types of annotations (e.g. part-of-speech tags or entity
types), etc. Since machine learning generalizes from examples, the learned prediction
of output information cannot be expected to be correct in all cases. Rather, the goal is
to find a model Y that is optimal with respect to a given quality measure (see below).
Besides the input data, the quality of Y depends on how the data is represented and
how the found patterns are generalized.
Representation. Similar to information retrieval, most machine learning algorithms
rely on a vector space model. In particular, the input data is represented by a set X
of feature vectors of the form x. x defines an ordered set of features, where each
feature x ∈ x denotes a measurable property of an input (Hastie et al. 2009). In text
mining, common features are e.g. the frequency of a particular word in a given text
or the shape of a word (say, capitalized or not). Representing input data means to
create a set of instances of x, such that each instance contains one feature value for
every feature in x.7 In many cases, hundreds or thousands of features are considered
in combination. They belong to different feature types, like bag-of-words where each
feature means the frequency of a word (Manning et al. 2008).
4 Besides the references cited below, parts of the summary are inspired by the Coursera machine
learning course, https://www.coursera.org/course/ml (accessed on June 15, 2015).
5 A discussion of common quality measures follows at the end of this section.
6 The question for what text analysis tasks to prefer a rule-based approach over a machine learning
features are transformed, e.g. a feature with values “red”, “green”, and “blue” can be represented
by three 0/1-features, one for each value. All values are normalized to the same interval, namely
[0,1], which benefits learning (Witten and Frank 2005).
2.1 Foundations of Text Mining 27
The feature representation of the input data governs what patterns can be found
during learning. As a consequence, the development of features, which predict a
given target variable C, is one of the most important and often most difficult steps in
machine learning.8 Although common feature types like bag-of-words help in many
text analysis tasks, the most discriminative features tend to require expert knowledge
about the task and input. Also, some features generalize worse than others, often
because they capture domain-specific properties, as we see in Chap. 5.
Generalization. As shown in Fig. 2.2, generalization refers to the inference of output
information from unseen data based on patterns captured in a learned model (Witten
and Frank 2005). As such, it is strongly connected to the used machine learning
algorithm. The training of such an algorithm based on a given set of instances explores
a large space of models, because most algorithms have a number of parameters. An
important decision in this regard is how much to bias the algorithm with respect to
the complexity of the model to be learned (Witten and Frank 2005). Simple models
(say, linear functions) induce a high bias, which may not fit the input data well, but
regularize noise in the data and, thus, tend to generalize well. Complex models (say,
high polynomials) can be fitted well to the data, but tend to generalize less. We come
back to this problem of fitting in Sect. 5.1.9
During training, a machine learning algorithm incrementally chooses a possi-
ble model and evaluates the model based on some cost function. The choice relies
on an optimization procedure, e.g. gradient descent stepwise heads towards a local
minimum of the cost function until convergence by adapting the model to all input
data (Witten and Frank 2005). In large-scale scenarios, a variant called stochastic gra-
dient descent is often more suitable. Stochastic gradient descent repeatedly iterates
over all data instances in isolation, thereby being much faster while not guaranteeing
to find a local minimum (Zhang 2004). No deep understanding of the generalization
process is needed in this book, since we focus only on the question of how to address
text analysis tasks with existing machine learning algorithms in order to then select
an adequate one. What matters for us is the type of learning that can or should be
performed within the task at hand. Mainly, we consider two very prominent types in
this book, supervised learning and unsupervised learning.
Supervised Learning. In supervised learning, a machine learning algorithm derives
a model from known training data, i.e., from pairs of a data instance and the asso-
ciated correct output information (Witten and Frank 2005). The model can then be
used to predict output information for unknown data. The notion of being supervised
refers to the fact that the learning process is guided by examples of correct predic-
tions. In this book, we use supervised learning for both statistical classification and
statistical regression.
8 The concrete features of a feature type can often be chosen automatically based on input data,
as we do in our experiments, e.g. by taking only those words whose occurrence is above some
threshold. Thereby, useless features that would introduce noise are excluded.
9 Techniques like feature selection and dimensionality reduction, which aim to reduce the set of
considered features to improve generalizability and training efficiency among others (Hastie et al.
2009), are beyond the scope of this book.
28 2 Text Analysis Pipelines
ua les
unknown
s
sq irc
re
instance
c
training
instances
training
instances unknown
instance
X1 X1
Fig. 2.3 Illustration of supervised learning: a In classification, a decision boundary can be derived
from training instances with known classes (open circles and squares) based on their feature values,
here for x1 and x2 . The boundary decides the class of unknown instances. b In regression, a regression
model can be derived from training instances (represented by the feature x1 ) with known value for
the target variable C. The model decides the values of all other instances.
Classification describes the task to assign a data instance to the most likely of a set
of two or more predefined discrete classes (Witten and Frank 2005). In case of binary
classification, machine learning algorithms seek for an optimal decision boundary
that separates the instances of two classes, as illustrated in Fig. 2.3(a). Multi-class
classification is handled through approaches like one-versus-all classification (Hastie
et al. 2009). The applications of classification in text mining are manifold. E.g.,
it denotes the standard approach to text classification (Sebastiani 2002) and it is
also often used to classify candidate relations between entities (Sarawagi 2008). In
all respective experiments below, we perform classification with a support vector
machine (Witten and Frank 2005). Support vector machines aim to maximize the
margin between the decision boundary and the training instances of each class. They
have been shown to often perform well (Meyer et al. 2003) while not being prone to
adapt to noise (Witten and Frank 2005).10
In case of regression, the task is to assign a given data instance to the most likely
value of a metric and continuous target variable (Witten and Frank 2005). The result
of learning is a regression model that can predict the target variable for arbitrary
instances (cf. Fig. 2.3(b)). We restrict our view to linear regression models, which
we apply in Chap. 4 to predict the run-times of pipelines. In our experiments, we
learn these models with stochastic gradient descent for efficiency purposes.
Unsupervised Learning. In contrast to supervised learning, unsupervised learn-
ing is only given data instances without output information. As a consequence, it
usually does not serve for predicting a target variable from an instance, but merely
for identifying the organization and association of input data (Hastie et al. 2009).
The most common technique in unsupervised learning is clustering, which groups a
set of instances into a possibly but not necessarily predefined number of clusters
(Witten and Frank 2005). Here, we consider only hard clusterings, where each
instance belongs to a single cluster that represents some class. Different from
10 Some existing text analysis algorithms that we employ rely on other classification algorithms,
though, such as decision trees or artificial neural networks (Witten and Frank 2005).
2.1 Foundations of Text Mining 29
cluster 2
cluster 1
inner
cluster
leaf
cluster 3 cluster
root
cluster
X1 X1
Fig. 2.4 Illustration of unsupervised learning: a Flat clustering groups a set of instances into a
(possibly predefined) number of clusters. b Hierarchical clustering creates a binary hierarchy tree
structure over the instances.
While there are more learning types, like reinforcement learning, recommender
systems or one-class classification, we do not apply them in this book and, so, omit
to introduce them here for brevity.
11 Besides effectiveness and efficiency, we also investigate the robustness and intelligibility of text
analysis in Chap. 5. Further details are given there.
2.1 Foundations of Text Mining 31
Fig. 2.5 Venn diagram showing the four sets that can be derived from the ground truth information
of some type in a collection of input texts and the output information of that type inferred from the
input texts by a text analysis approach.
The accuracy is an adequate measure, when all decisions are of equal importance.
This holds for many text classification tasks as well as for other text analysis tasks,
in which every portion of an input text is annotated and, thus requires a decision,
such as in tokenization. In contrast, especially in information extraction tasks like
entity recognition, the output information usually covers only a small amount of
the processed input texts. As a consequence, high accuracy can be achieved by
simply producing no output information at all. Thus, accuracy is inadequate if the
true negatives are of low importance. Instead, it seems more suitable to measure
effectiveness in terms of the precision p and the recall r (Manning and Schütze 1999):
32 2 Text Analysis Pipelines
Precision quantifies the ratio of output information that is inferred correctly, while
recall refers to the ratio of all correct information that is inferred. In many cases,
however, achieving either high precision or high recall is as easy as useless. E.g.,
perfect recall can be obtained by producing all possible output information. If both
high precision and high recall are desired, their harmonic mean can be computed,
called the F1 -score (or F1 -measure), which rewards an equal balance between p
and r (van Rijsbergen 1979):
f 1 = 2 · p · r / (p + r)
12 The development of statistical approaches benefits from a balanced dataset (see above). This
can be achieved through either undersampling minority classes or oversampling majority classes.
Where needed, we mostly perform the latter using random duplicates.
2.1 Foundations of Text Mining 33
text
corpus ... ... ...
text
corpus ... ... ... ... ... ...
Fig. 2.6 Two ways of splitting a corpus for development and evaluation: a A training set is used
for development, a validation set for optimizing parameters, and a test set for evaluation. b Each
fold i out of n folds serves for evaluation in the i-th of n runs. All others are used for development.
In most cases, we split a given text corpus into a training set, a validation set, and
a test set, as illustrated in Fig. 2.6(a).13 After developing an approach on the training
set, the quality of different configurations of the approach (e.g. with different feature
vectors or learning parameters) is iteratively evaluated on the validation set. The
validation set thereby serves for optimizing the approach, while the approach adapts
to the validation set. The best configuration is then evaluated on the test set (also
referred to as the held-out set). A test set represents the unseen data. It serves for
estimating the quality of an approach in practical applications.14
The described method appears reasonable when each dataset is of sufficient size
and when the given split prevents from bias that may compromise the representa-
tiveness of the respective corpus. In other cases, an alternative is to perform (strati-
fied) n-fold cross-validation (Witten and Frank 2005). In n-fold cross-validation, a
text corpus is split into n (e.g. 10) even folds, assuring that the distribution of the
target variable is similar in all folds. The development and evaluation then consist of
n runs, over which the measured quality of an approach is averaged. In each run i,
the i-th fold is used for evaluation and all others for development. Such a split is
shown in Fig. 2.6(b). We conduct according experiments once in Chap. 5.
Comparison. The measured effectiveness and efficiency results of a text analysis
approach are usually compared to alternative ways of addressing the given task
in order to assess whether the results are good bad. For many tasks, an upper-
bound ceiling of effectiveness is assumed to be the effectiveness a human would
achieve (Jurafsky and Martin 2009).15 For simplicity, effectiveness is thus often
measured with respect to the human-annotated ground truth. While there is no gen-
eral upper-bound efficiency ceiling, we see in the subsequent chapters that optimal
efficiency can mostly be determined in a given experiment setting. We call every
13 Many text corpora already provide an according corpus split, including most of those that we use
upper-bound ceiling of a quality measure the gold standard and we define the gold
standard accordingly where needed.
For interpretation, results are also checked whether they are significantly better
than some lower bound baseline (Jurafsky and Martin 2009). E.g., an accuracy of
40 % in a 5-class classification task may appear low, but it is still twice as good
as the accuracy of guessing. The standard way to determine lower bounds is to
compare an evaluated approach with one or more approaches that are trivial (like
guessing), standard (like a bag-of-words approach in text classification), state-of-
the-art or at least well known from the literature. We compare our approaches to
according baselines in all our experiments in Chaps. 3–5. In these experiments, we
mostly consider complex text analysis processes realized by pipelines of text analysis
algorithms, as presented next.
In Sect. 1.2, we have roughly outlined that text mining requires task-specific text
analysis processes with several classification, extraction, and similar steps. These
processes are realized by text analysis pipelines that infer output information from
input texts in order to satisfy a given information need. Since text analysis pipelines
are in the focus of all approaches proposed in this book, we now explain the outlined
concepts of text analysis more comprehensively and we illustrate them at the end.
Thereby, we define the starting point for all discussions in Chaps. 3–5.
As specified in Sect. 1.2, we consider tasks in which we are given input texts and an
information need to be addressed. The goal is to infer output information from the
input texts that is relevant with respect to the information need. Here, we detail basic
concepts behind such tasks. An extension of these concepts by quality criteria to be
met follows in Chap. 3 after discussing the optimality of text analysis pipelines.
Input Texts. In principle, the input we deal with in this book may be either given
in the form of a collection of texts or a stream of texts. The former denotes a set of
natural language texts {D1 , . . . , Dn }, n ≥ 1, usually compiled with a purpose, like a
text corpus (see above). With the latter, we refer to continuously incoming natural
language text data. We assume here that such data can be split into logical segments
D1 , D2 , . . . (technically, this is always possible). Given that the speed of processing
a stream can be chosen freely, we can then deal with collections and streams in the
same way except for the constraint that streaming data must be processed in the order
in which it arrives. We denote both a collection and a stream as D.
2.2 Text Analysis Tasks, Processes, and Pipelines 35
We see a single text D ∈ D as the atomic input unit in text analysis tasks. While no
general assumptions are made about the length, style, language, or other properties
of D, we largely restrict our view to fully unstructured texts, i.e., plain texts that
have no explicit structure aside from line breaks and comparable character-level
formattings. Although text mining may receive several types of documents as input,
such as HTML files in case of web applications, our restriction is not a limitation
but rather a focus: Most text analysis approaches work on plain text. If necessary,
some content extraction is, thus, usually performed in the beginning that converts
the documents into plain text (Gottron 2008). Besides, some of our approaches in
the subsequent chapters allow the input texts to already have annotations of a certain
set of zero or more information types C0 , which holds for many text corpora in
computational linguistics research (cf. Sect. 2.1).
Output Information. In Sect. 2.1, different information types have been mentioned,
e.g. tokens, part-of-speech tags, concrete types of entities and relations, certain text
classification schemes, etc. In general, an information type C = {c1 , c2 , . . .} denotes
the set of all pieces of information c ∈ C that represent a particular lexical, syntactic,
semantic, or pragmatic concept. We postpone a more exact definition of information
types to Chap. 3, where we formalize the expert knowledge for tackling text analysis
tasks automatically. A concrete information type is denoted with an upper-case term
in this book, such as the Token type or a relation type Founded. To signal that an
information type is part of an event or relation type, we append it to that type in lower
case, such as Token.lemma or Founded.time.
Now, in many tasks from information extraction and text classification, the goal
is to infer output information of a specific set of information types C from texts
or portions of texts. Here, we use the set notation as in propositional logic (Kleine
et al. 1999), i.e., a set C = {C1 , . . . , Ck }, k ≥ 1, can be understood as a conjunction
C1 ∧ . . . ∧ Ck . In case of Founded(Organization, Time) from Sect. 2.1, for example,
a text or a portion of text that contains an instance of this relation type must com-
prise an organization name and a time information as well as a representation of
a foundation relation between them. Hence, the relation type implicitly refers to a
conjunction Founded ∧ Founded.organization ∧ Founded.time, i.e., a set {Founded,
Founded.organization, Founded.time}.
Information Needs. Based on the notion of information types, we can define what
information is relevant with respect to an information need in that it helps to fulfill
the need. The goal of text mining is to infer new information of specified types
from a collection or a stream of input texts D (cf. Sect. 2.1). From a text analysis
perspective, addressing an information need hence means to return all instances
of a given set of information types C that are found in D. In this regard, C itself
can be seen as a specification of an information need, a single information need in
particular. Accordingly, a combination of k > 1 information needs (say, the desire
to get information of k = 2 relation types at the same time) refers to a disjunction
C1 ∨ . . . ∨ Ck . In practical text mining applications, parts of an information need
might be specified beforehand. E.g., Founded(Organization, “1998”) denotes the
request to extract all names of organizations founded in the year 1998.
36 2 Text Analysis Pipelines
We assume in this book that information needs are already given in a formal-
ized form. Consequently, we can concentrate on the text analysis processes required
to address information needs. Similar to information types, we actually formalize
information needs later on in Chap. 3.
baseline approaches derive features from the output of tokenization and part-of-
speech tagging only (Pang et al. 2002), while others e.g. also perform chunking,
and extract relations between recognized domain-specific terms (Yi et al. 2003).
Moreover, some text classification approaches rely on fine-grained information from
semantic and pragmatic text analyses, such as the sentiment analysis in our case
study ArguAna that we introduce in Sect. 2.3.
Realization. The complexity of common text analysis processes raises the question
of how to approach a text analysis task without losing the mind in the process, like
van Gogh according to the introductory quote of this chapter. As the examples above
indicate, especially the dependencies between analysis steps are not always clear in
general (e.g. some entity recognition algorithms require part-of-speech tags, while
others do not). In addition, errors may propagate through the analysis steps, because
the output of one step serves as input to subsequent steps (Bangalore 2012). This
entails the danger of achieving limited overall effectiveness, although each single
analysis step works fine. A common approach to avoid error propagation is to perform
joint inference, where all or at least some steps are performed concurrently. Some
studies indicate that joint approaches can be more effective in tasks like information
extraction (cf. Sect. 2.4 for details).16
For our purposes, joint approaches entail limitations, though, because we seek
to realize task-specific processes ad-hoc for arbitrary information needs from text
analysis. Moreover, joint approaches tend to be computationally expensive (Poon
and Domingos 2007), since they explore larger search spaces emanating from com-
binations of information types. This can be problematic for the large-scale scenarios
we target at. Following Buschmann et al. (1996), our requirements suggest the resort
to a sequence of small analysis steps composed to address a task at hand. In par-
ticular, small analysis steps allow for an easy recombination and they simplify the
handling of interdepedencies. Still, a joint apprach may be used as a single step in
an according sequence. We employ a few joint approaches (e.g. the algorithm ene
described in Appendix A.1) in the experiments of this book. Now, we present the
text analysis pipelines that realize sequences of analysis steps.
Pipelines denote the standard approach to realize text analysis processes. Although
the application of pipelines is ubiquitous in natural language processing (Holling-
shead and Roark 2007), rarely their design and execution are defined formally. As
sketched in Sect. 1.2, a text analysis pipeline processes a collection or a stream of
16 A simple example is the interpretation of periods in tokenization and sentence splitting: Knowing
sentence boundaries simplifies the determination of tokens with periods like abbreviations, but
knowing the abbreviations also helps to determine sentence boundaries.
38 2 Text Analysis Pipelines
input texts with a sequence of algorithms in order to stepwise produce a set of output
information types.17 We model a text analysis pipeline in the following way:18
17 Some related work speaks about workflows rather than pipelines, such as (Shen et al. 2007). The
term workflow is more general, also covering cascades where the input can take different paths.
Indeed, such cascades are important in text analysis, e.g. when the sequence of algorithms to be
executed depends on the language of the input text. From an execution viewpoint, however, we can
see each taken path as a single pipeline in such cases.
18 While named differently, the way we represent pipelines and the algorithms they compose here
largely conforms to their realization in standard software frameworks for text analysis, like Apache
UIMA, http://uima.apache.org, accessed on June 15, 2015.
2.2 Text Analysis Tasks, Processes, and Pipelines 39
Fig. 2.7 Abstract view of executing a text analysis pipeline Π = (A1 , . . . , Am ) on a collection or a
stream of input texts D in order to produce a set of output information types C. For every text in D,
each algorithm Ai , 1 ≤ i ≤ m, adds instances of a set of information types Ci(out) to the instances of
all inferred information types ij=0 C j . This set is initialized with instances of a possibly empty
set of information types C0 .
types inferred so far is given. The union of information types inferred by all algorithms
employed in Π is supposed to be a superset of the set of information types C, which
represents the information need to be addressed. So, we observe that the pipeline
controls the process of creating all information sought for.
At the same time, the processed input texts themselves do not change at all
within the realized text analysis process, as emphasized in the upper part of Fig. 2.7.
I.e., each algorithm tradionally processes each input text completely. We present an
enhancement of such an execution in Chap. 3 after summarizing existing approaches
in Sect. 2.4. Before, we introduce the case studies we examine in order to evaluate
all of our approaches.
In this book, we aim to improve the design, efficiency, and robustness of the text
analysis pipelines defined in Sect. 2.2. All developed approaches are evaluated in
empirical experiments with text analysis tasks that refer to a selection of scientifically
and/or industrially relevant case studies. Some of these case studies are associated
to two of our research projects, InfexBA and ArguAna, whereas the others are
known from related research. We briefly outline all of them in this section.
InfexBA is a research project that was funded by the German Federal Ministry
of Education and Research (BMBF) from 2008 to 2010 under contract number
01IS08007A. The primary goal of InfexBA was to develop text mining applications
2.3 Case Studies in This Book 41
“ Cairo, August 25th 2010 -- “ This hotel is pretty nice. The rooms are large
Forecast on Egypt's Automobile industry and comfortable. The lobby is also very nice.
[...] In the next five years, revenues will rise The only problem with this hotel is the pool.
by 97% to US-$19.6 bn. [...] “ It is very cold. I might go back here. “
Output information Output information
Fig. 2.8 Example for output information inferred from an input text to address the main information
needs in a the InfexBA project and b the ArguAna project.
for an automatic market analysis in the sense of a focused web search engine. Given
an organization or market name, the search engine retrieves a set of candidate web
pages, extracts, normalizes, and aggregates information about financial forecasts for
the given subject from the web pages, and visualizes the aggregated information.
More details about the project and its results can be found at http://infexba.upb.de.
Within the project, we proposed to perform text mining in a sequence of eight infor-
mation retrieval, natural language processing, and data mining stages (Wachsmuth
et al. 2010), but the focus was on those stages related to information extraction.
In particular, information extraction begins after converting all retrieved web pages
into plain texts and ends after normalization. We can view the set of these plain texts
as a collection of input texts D. One of the main tasks tackled in InfexBA was to
extract each revenue forecast for the given organization or market at a certain loca-
tion (including the date it was published) from D and to bring the time and money
information associated to the forecast into resolved and normalized form. We can
specify the underlying information need as follows:
Forecast(Revenue(Subject, Location, Resolved(Time), Resolved(Money), Date))
Given the market “automobile” as the subject, Fig. 2.8(a) exemplifies for one input
text what information is meant to satisfy the specified need. Some information is
directly found in the text (e.g. “Egypt”), some must be computed (e.g. the money
amount of “9.95 bn” at the end of 2010).
We refer to the InfexBA project in the majority of those experiments in Chaps. 3
and 4 that deal with pipeline design and optimization. For a focused discussion, we
evaluate different simplifications of the presented information need, though. One
such need, for instance, targets at all related pairs of time and money information
that belong to revenue forecasts. Accordingly, the text analysis pipelines that we
42 2 Text Analysis Pipelines
As InfexBA, ArguAna is a research project that was funded by the German Fed-
eral Ministry of Education and Research (BMBF). It ran from 2012
to 2014 under contract number 01IS11016A. The project aimed at the development
of novel text analysis algorithms for fine-grained opinion mining from customer
product reviews. In particular, a focus was on the analysis of the sequence of single
arguments in a review in order to capture and interpret the review’s overall argu-
mentation. From the results of a set of reviews, a text mining application cannot
only infer collective opinions about different aspects of a product, but also provide a
precise classification of the sentiment of customers with respect to the product. More
information about ArguAna is given at http://www.arguana.com.
In the project, we developed a complex text analysis process to tackle the underly-
ing text classification and information extraction tasks: First, the body of each review
text from a collection of input texts D is segmented into its single subsentence-level
discourse units. Every unit is classified as being either an objective fact, a positive, or
a negative opinion. Discourse relations between the units are then extracted as well
as products and aspects the units are about, together with their attributes. Finally, a
sentiment score in the sense of the review’s overall rating is predicted. The output
information helps to address the following information need:
Opinion(Aspect, Attribute, Polarity) ∧ Sentiment.score
As for this information need, the disjunctions defined in Sect. 2.2 should not be
misunderstood in the sense that addressing one of the two connected conjunctions
suffices. Rather, it states that instances of either of them is relevant. Figure 2.8(b)
shows the output information for a sample hotel review. The sentiment score comes
from a scale between 1 (worst) and 5 (best).
We refer to the ArguAna project mostly in the evaluation of pipeline robustness
in Chap. 5. There, we omit the recognition of products, aspects, and attributes, but
we focus on text classification approaches based on the extracted facts, opinions, and
discourse relations. The remaining text analysis process is realized with the following
2.3 Case Studies in This Book 43
pipeline: ΠArguAna = (sse, sto2 , tpo1 , pdu, csb, csp, pdr, css). For details on the
algorithms, see Appendix A.
The experiments related to the ArguAna project are based on two English col-
lections of texts, consisting of reviews from the hotel domain and the film domain,
respectively. In particular, we rely on our own ArguAna TripAdvisor cor-
pus developed within the project (cf. Appendix C.2) as well as on the widely used
Sentiment Scale dataset (cf. Appendix C.4).
Most of the concrete text analysis tasks in this book are at least loosely connected to
the presented projects InfexBA and ArguAna. In some cases, though, we provide
complementary results obtained in other experiments in order to achieve more gen-
erality or to analyze the generalizability of our approaches. All noteworthy results
of this kind are associated to the following three text analysis tasks.
Genia Event Extraction. Genia denotes one of the main evaluation tasks of the
BioNLP Shared Task (Kim et al. 2011). While the latter deals with the general
question of how text mining can help to recognize changes of states of bio-molecules
described in the biomedical literature, the former specifically targets at the extrac-
tion of nine different event types that relate a number of proteins, other entities, or
other events. For instance, a Phosphorylation event refers to an entity of the Protein
type as well as to some information that denotes a binding site of the protein (Kim
et al. 2011). In the evaluation of automatic pipeline design in Chap. 3, we consider
the formal specifications of several entity recognition and event detection algorithms
that infer information types relevant in the Genia task.
Named Entity Recognition. A named entity is an entity that refers to a unique
concept from the real world. While numerous types of named entities exist, most
of them tend to be rather application-specific (Jurafsky and Martin 2009). Some
types, though, occur in diverse types of natural language texts, of which the most
common are person names, location names, and organization names. They have been
in the focus of the CoNLL- 2003 shared task on named entity recognition (Tjong
et al. 2003). In Chap. 4, we analyze the distribution of the three entity types in
several text corpora from Appendix C in the context of influencing factors of pipeline
efficiency. There, we rely on a common sequence labeling approach to named entity
recognition (cf. Sect. 2.1), using the algorithm ene (cf. Appendix A) in the pipeline
Πene = (sse, sto2 , tpo1 , pch, ene).
Language Function Analysis. Finally, we address the text classification task lan-
guage function analysis in this book. We introduced this task in Wachsmuth and Bujna
(2011). As argued there, every text can be seen as being predominantly expressive,
appellative, or informative. These language functions define an abstract classification
scheme, which can be understood as capturing a single aspect of genres (Wachsmuth
44 2 Text Analysis Pipelines
and Bujna 2011). In Chap. 5, we concretize the scheme for product-related texts in
order to then outline how much text classification depends on the domain of the input
texts. Moreover, in Chap. 3 we integrate clf in the information extraction pipelines
from InfexBA (see above). In particular, we employ clf to filter possibly relevant
candidate input texts, which can be seen as one of the most common applications of
text classification in text mining.
With the approaches developed in this book, we seek to enable the use of text analysis
pipelines for ad-hoc large-scale text mining (cf. Sect. 1.3). Several other approaches
have been proposed in the literature that tackle similar problems or that tackle differ-
ent problems but pursue similar goals. In this section, we survey the state of the art
in these respects, focusing on text analysis to a wide extent, and we stress how our
approaches extend the state of the art. From an abstract viewpoint, our discussion
follows the overall structure of this book. It reuses content from the related work
sections of most of our publications listed in Table 1.1 (Sect. 1.4).
As defined in Sect. 2.2, we consider the classic realization of a text analysis process
in the form of a pipeline, where each algorithm takes as input the output of all pro-
ceeding algorithms and produces further output. Pipelines represent the most widely
adopted text analysis approach (Bangalore 2012). The leading software frameworks
for text analysis, Apache UIMA and GATE, target at pipelines (cf. Sect. 1.3). Some
of our approaches assume that no analysis is performed by more than one algorithm
in a pipeline. This is usual, but not always the case (Whitelaw et al. 2008). As a con-
sequence, algorithms can never make up for errors of their predecessors, which may
limit the overall effectiveness of pipelines (Bangalore 2012). In addition, the task
dependency of effective text analysis algorithms and pipelines (cf. Sects. 2.1 and 2.2)
renders their use in the ad-hoc search scenarios we focus on problematic (Etzioni
2011). In the following, we describe the most important approaches to tackle these
problems, grouped under the topics joint inference, pipeline enhancement, and
task independence.
Joint Inference. We have already outlined joint inference as a way to avoid the
problem of error propagation in classic pipelines in Sect. 2.2. Joint approaches infer
different types of information at the same time, thereby mimicking the way humans
process and analyze texts (McCallum 2009). Among others, tasks like entity recog-
nition and relation extraction have been said to benefit from joint inference (Choi
et al. 2006). However, the possible gain of effectiveness comes at the cost of lower
2.4 State of the Art in Ad-Hoc Large-Scale Text Mining 45
efficiency and less reusability (cf. Sect. 2.2), which is why we do not target at joint
approaches in this book, but only integrate them when feasible.
Pipeline Enhancement. Other researchers have addressed the error propragation
through iterative or probabilistic pipelines. In case of the former, a pipeline is exe-
cuted repeatedly, such that the output of later algorithms in a pipeline can be used to
improve the output of earlier algorithms (Hollingshead and Roark 2007).19 In case
of the latter, a probability model is built based on different possible outputs of each
algorithm (Finkel et al. 2006) or on confidence values given for the outputs (Raman
et al. 2013). While these approaches provide reasonable enhancements of the clas-
sic pipeline architecture, they require modifications of the available algorithms and
partly also significantly reduce efficiency. Both does not fit well to our motivation
of enabling ad-hoc large-scale text mining (cf. Sect. 1.1).
Task Independence. The mentioned approaches can improve the effectiveness of
text analysis. Still, they have to be designed for the concrete task at hand. For the
extraction of entities and relations, Banko et al. (2007) introduced open informa-
tion extraction to overcome such task dependency. Unlike traditional approaches for
predefined entity and relation types (Cunningham 2006), their system TextRunner
efficiently looks for general syntactic patterns (made up of verbs and certain part-
of-speech tags) that indicate relations. Instead of task-specific analyses, it requires
only a keyword-based query as input that allows identifying task-relevant relations.
While Cunningham (2006) argues that high effectiveness implies high specificity,
open information extraction targets at web-scale scenarios. There, precision can be
preferred over recall, which suggests the exploitation of redundancy in the output
information (Downey et al. 2005) and the resort to highly reliable extraction rules,
as in the subsequent system ReVerb (Fader et al. 2011).
Open information extraction denotes an important step towards the use of text
analysis in web search and big data analytics applications. Until today, however, it is
restricted to rather simple binary relation extraction tasks (Mesquita et al. 2013). In
contrast, we seek to be able to tackle arbitrary text analysis tasks, for which appro-
priate algorithms are available. With respect to pipelines, we address the problem of
task dependency in Chap. 3 through an automatic design of text analysis pipelines.
In Sect. 2.2, we have discussed that text analysis processes are mostly realized manu-
ally in regard of the information need to be addressed. Also, the resulting text analysis
approaches traditionally process all input texts completely. Not only Apache UIMA
and GATE themselves provide tool support for the construction and execution of
19 Iterative pipelines are to a certain extent related to compiler pipelines that include feedback
loops (Buschmann et al. 1996). There, results from later compiler stages (say, semantic analysis)
are used to resolve ambiguities in earlier stages (say, lexical analysis).
46 2 Text Analysis Pipelines
the filtering of complete texts at different positions in a pipeline impacts the effec-
tiveness in complex extraction tasks. Other researchers observe that also classifying
the relevance of sentences can help to improve effectiveness (Patwardhan and Riloff
2007; Jean-Louis et al. 2011). Nedellec et al. (2001) stress the importance of such
filtering for all extraction tasks where relevant information is sparse. According to
Stevenson (2007), a restriction to sentences may also limit effectiveness in event
detection tasks, though. While we use filtering to improve efficiency, we provide
evidence that our approach maintains effectiveness. Still, we allow specifying the
sizes of filtered portions to trade efficiency for effectiveness.
Filtering approaches for efficiency often target at complete texts, e.g. using fast
text classification (Stein et al. 2005) or querying approaches trained on texts with
the relations of interest (Agichtein and Gravano 2003). A technique that filters por-
tions of text is passage retrieval (cf. Sect. 2.1). While many text mining applications
do not incorporate filtering until today, passage retrieval is common where infor-
mation needs must be addressed in real-time, e.g. in question answering (Krikon
et al. 2012). Cardie et al. (2000) compare the benefit of statistical and linguistic
knowledge for filtering candidate passages, and Cui et al. (2005) propose a fuzzy
matching of questions and possibly relevant portions of text. Sarawagi (2008) sees
the efficient filtering of relevant portions of input texts as a main challenge of infor-
mation extraction in large-scale scenarios. She complains that existing techniques
are still restricted to hand-coded heuristics. Common heuristics aim for high recall
in order not to miss relevant information later on, whereas precision can be preferred
on large collections of texts under the assumption that relevant information appears
redundantly (Agichtein 2005).
Different from all the outlined approaches, our filtering approach does not predict
relevance, relying on vague models derived from statistics or hand-crafted rules. In
contrast, our approach infers the relevant portions of an input text formally from
the currently available information. Moreover, we discuss in Chap. 3 that the input
control can be integrated with common filtering approaches. At the same time, it
does not prevent most other approaches to improve the efficiency of text analysis.
Efficiency has always been a main aspect of algorithm research (Cormen et al. 2009).
For a long time, most rewarded research on text analysis focused on effectiveness as
did the leading evaluation tracks, such as the Message Understanding Confer-
ences (Chinchor et al. 1993) or the CoNLL shared task. In the latter, efficiency
has at least sometimes been an optional evaluation criterion (Hajič et al. 2009). In
times of big data, however, efficiency is getting increasing attention in both research
and industry (Chiticariu et al. 2010b). While the filtering techniques from above
denote one way to improve efficiency, the filtered texts or portions of texts still
often run through a process with many expensive analysis steps (Sarawagi 2008).
48 2 Text Analysis Pipelines
Other techniques address this process, ranging from efficient algorithms over an
optimization through scheduling to indexing and parallelization.
Efficient Algorithms. Efficient algorithms have been developed for several text
analyses. For instance, Al-Rfou’ and Skiena (2012) present how to apply simple
heuristics and caching mechanisms in order to increase the velocity of segmentation
and tagging (cf. Sect. 2.1). Complex syntactic analyses like dependency parsing can
be approached in linear time by processing input texts from left to right only (Nivre
2003). Bohnet and Kuhn (2012) show how to integrate the knowledge of deeper
analyses in such transition-based parsing while still achieving only quadratic com-
plexity in the worst case. van Noord (2009) trades parsing efficiency for effectiveness
by learning a heuristic filtering of useful parses. For entity recognition, Ratinov and
Roth (2009) demonstrate that a greedy search (Russell and Norvig 2009) can com-
pete with a more exact sequence labeling (cf. Sect. 2.1). Others offers evidence that
simple patterns based on words and part-of-speech tags suffice for relation extraction,
when given enough data (Pantel et al. 2004). In text classification tasks like genre
identification, efficiently computable features are best practice (Stein et al. 2010).
Also, the feature computation itself can be sped up through unicode conversion and
string hash computations (Forman and Kirshenbaum 2008).
All these approaches aim to improve the efficiency of single text analyses, mostly
at the cost of some effectiveness. We do not compete with these approaches but
rather complement them, since we investigate how to improve pipelines that realize
complete processes consisting of different text analyses. In particular, we optimize
the efficiency of pipelines without compromising effectiveness through scheduling.
Scheduling. Some approaches related to text mining optimally schedule different
algorithms for the same analysis. For instance, Stoyanov and Eisner (2012) effec-
tively resolve coreferences by beginning with the easy cases, and Hagen et al. (2011)
efficiently detect sessions of search queries with the same information need by begin-
ning with the fastest detection steps. The ordering in which information is sought
for can also have a big influence on the run-time of text analysis (Sarawagi 2008). In
Chap. 4, we seize on this idea where we optimize the schedules of pipelines that filter
only relevant portions of texts. However, the optimal schedule is input-dependent,
as has been analyzed by Wang et al. (2011) for rule-based information extraction.
Similar to the authors, we process samples of input texts in order to estimate the
efficiency of different schedules.
In this regard, our research is in line with approaches in the context of the above-
mentioned SystemT. Concretely, Shen et al. (2007) and Doan et al. (2009) exploit
dependencies and distances between relevant text regions to optimize the schedules of
declarative information extraction approaches, yielding efficiency gains of about one
order of magnitude. Others obtain comparable results through optimization strategies
such as the integration of analysis steps (Reiss et al. 2008).
In these works, the authors provide only heuristic hints on the reasons behind their
empirical results. While some algebraic foundations of SystemT are established in
Chiticariu et al. (2010a), these foundations again reveal the limitation of declarative
information extraction, i.e., its restriction to rule-based text analysis. In contrast, we
2.4 State of the Art in Ad-Hoc Large-Scale Text Mining 49
approach scheduling for arbitrary sets of text analysis algorithms. While we achieve
similar gains as SystemT through an optimized scheduling, our adaptive scheduling
approach is, to our knowledge, the first that maintains efficiency on heterogeneous
input texts. In addition, we show that the theoretically optimal schedule can be found
with dynamic programming (Cormen et al. 2009) based on the run-times and filtered
portions of text of the employed algorithms.
In the database community, dynamic programming is used since many years
to optimize the efficiency of join operations (Selinger et al. 1979). However, the
problem of filtering relevant portions of text for an information need corresponds
to processing and-conditioned queries (cf. Sect. 2.2). Such queries select those
tuples of a database table whose values fulfill some attribute conjunction, as e.g.
in SELECT * FROM forecasts WHERE (time>2011 AND time<2015 AND
organization=IBM). Different from text analysis, the optimal schedule for
an and-conditioned query is obtained by ordering the involved attribute tests (e.g.
time>2011) according to the numbers of expected matches (Ioannidis 1997), i.e.,
without having to consider algorithm run-times.
Indexing. An alternative to optimizing the efficiency of text analysis is to largely
avoid the need for efficient analyses by indexing relevant information for each input
text beforehand (cf. Sect. 2.1). For instance, Cafarella et al. (2005) have presented the
KnowItNow system, which builds specialized index structures using the output of
information extraction algorithms. Their approach has then been adopted in the open
information extraction systems discussed above. Also, the Google Knowledge
Graph is operationalized in an index-like manner as far as known.20
In the best case, indexing renders text analysis unnecessary when addressing
information needs (Agichtein 2005). In the database analogy from above, the run-
times of the tests (that correspond to the text analysis algorithms) drop out then.
By that, indexing is particularly helpful in scenarios like ad-hoc search. However, it
naturally applies only to anticipated information needs and to input texts that can be
preprocessed beforehand. Both cannot be assumed in the tasks that we consider in
this book (cf. Sect. 1.2).
Parallelization. With the goal of efficiency finally arises the topic of parallelization.
As discussed, we concentrate on typical text analysis algorithms and pipelines, which
operate over each input text independently, making many parallelization techniques
easily applicable (Agichtein 2005). This might be the reason for the limited literature
on parallel text analysis, despite the importance of parallelization for practical text
mining applications. Here, we focus on process-related approaches as opposed to
distributed memory management (Dunlavy et al. 2010) or algorithm schemes like
MapReduce for text analysis (Lués and de Matos 2009).21
Text analysis can be parallelized on various levels: Different algorithms may run
distributed, both to increase the load of pipelines (Ramamoorthy and Li 1977) and
to parallelize independent analyses. Pokkunuri et al. (2011) run different pipelines
at the same time, and Dill et al. (2003) report on the parallelization of different algo-
rithms. The two latter do not allow interactions between the parallelized steps, while
others also consider synchronization (Egner et al. 2007). A deep analysis of parallel
scheduling strategies was performed by Zhang (2010). Apart from these, different
texts can be processed in parallel (Gruhl et al. 2004), and the execution of analy-
sis steps like parsing is commonly parallelized for different portions of text (Bohnet
2010). Kalyanpur et al. (2011) even run different pipelines on the same text in parallel
in order to provide results as fast as possible in ad-hoc question answering.
At the end of Chap. 4, we see that input-based parallelization is always applicable
to the pipelines that we employ. The same holds for the majority of other approaches,
as discussed there. Because of filtering, synchronization entails new challenges with
respect to our approaches, though.
the latter (Daumé and Marcu 2006). Also, structural correspondences can be learned
between domains (Blitzer et al. 2007; Prettenhofer and Stein 2011). In particular,
domain-specific features are aligned based on a few domain-independent features,
e.g. “Stay away!” in the hotel domain might have a similar meaning as “Read the
book!” in the film domain.
Domain adaptation, however, does not really apply to ad-hoc search and simi-
lar applications, where it is not possible to access texts from all target domains in
advance. This also excludes the approach of Gupta and Sarawagi (2009) who derive
domain-independent features from a comparison of the set of all unknown target
texts to the set of known source texts.
Domain Independence. Since domains are often characterized by content words
and the like (cf. Chap. 5 for details), most approaches that explicitly aim for domain
independence try to abstract from content. Glorot et al. (2011), for instance, argue
that higher-level intermediate concepts obtained through the non-linear input trans-
formations of deep learning help in cross-domain sentiment analysis of reviews.
While we evaluate domain robustness for the same task, we do not presume a certain
type of machine learning algorithms. Rather, we work on the features to be learned.
Lipka (2013) observes that style features like character trigrams serve the robustness
of text quality assessment. Similar results are reported for authorship attribution in
Sapkota et al. (2014). The authors reveal the benefit of mixed-domain training sets
for developing robust text analysis algorithms.
Some experiments that we perform in Chap. 5 suggest that style features are
still limited in their generalizability. We therefore propose features that model the
structure of texts. This resembles the idea of open information extraction, which
avoids the resort to any domain-dependent features, but captures only generally
valid syntactic patterns in sentences (see above). However, we seek for domain
independence in tasks, where complete texts have to be classified. For authorship
attribution, Choi (2011) provide evidence that structure-based features like function
word n-grams achieve high effectiveness across domains. We go one step further by
investigating the argumentation structure of texts.
Argumentation Structure. Argumentation is studied in various disciplines, such
as logic, philosophy, and artificial intelligence. We consider it from the linguistics
perspective, where it is pragmatically viewed as a regulated sequence of speech or
text (Walton and Godden 2006). The purpose of argumentation is to provide persua-
sive arguments for or against a decision or claim, where each argument itself can be
seen as a claim with some evidence. Following the pioneer model of Toulmin (1958),
the structure of an argumentation relates a claim to facts and warrants that are justified
by backings or countered by rebuttals. Most work in the emerging research area of
argumentation mining relies on this or similar models of argumentation (Habernal
et al. 2014). Concretely, argumentation mining analyzes natural language texts in
order to detect different types of arguments as well as their interactions (Mochales
and Moens 2011).
Within our approach to robustness, we focus on texts that comprise a monological
and positional argumentation, like reviews, essays, or scientific articles. In such a
52 2 Text Analysis Pipelines
text, a single author collates and structures a choice of facts, pros, and cons in
order to persuade the intended recipients about his or her conclusion (Besnard and
Hunter 2008). Unlike argumentative zoning (Teufel et al. 2009), which classifies
segments of scientific articles according to their argumentative functions, we aim to
find argumentation patterns in these texts that help to solve text classification tasks.
For this purpose, we develop a shallow model of argumentation structure in Chap. 5.
Information Structure. Our model captures sequences of task-specific information
in the units of a text as well as relations between them. By that, it is connected
to information structure, which refers to the way information is packaged in a text
(Lambrecht 1994). Different from approaches like (Bohnet et al. 2013), however,
we do not analyze the abstract information structure within sentences. Rather, we
look for patterns of how information is composed in whole texts (Gylling 2013).
In sentiment-related tasks, for instance, we claim that the sequence of subjectivities
and polarities in the facts and opinions of a text represents the argumentation of
the text. While Mao and Lebanon (2007) have already investigated such sequences,
they have analyzed the positions in the sequences only separately (cf. Chap. 5 for
details). In contrast, we develop an approach that learns patterns in the complete
sequences found in texts, thereby capturing the overall structure of the texts. To the
best of our knowledge, no text analysis approach to capture overall structure has been
published before.
Discourse Structure. The information structure that we consider is based on the
discourse structure of a text (Gylling 2013). Discourse structure refers to organi-
zational and functional relations between the different parts of a text (Mann and
Thompson 1988), as presented in Chap. 5. There, we reveal that patterns also exist
in the sequences of discourse relations that e.g. cooccur with certain sentiment.
Gurevych (2014b) highlight the close connection between discourse structure and
argumentation, while Ó Séaghdha and Teufel (2014) point out the topic indepen-
dence of discourse structure. The benefit of discourse structure for sentiment analy-
sis, especially in combination with opinion polarities, has been indicated in recent
publications (Villalba and Saint-Dizier 2012; Chenlo et al. 2014). We use according
features as baselines in our domain robustness experiments.
User Acceptance. Even a robust text mining application will output erroneous results
occasionally. If users do not understand the reasons behind, their acceptance of such
an application may be limited (Lim and Dey 2009). While in some technologies
related to text mining much attention is paid to the transparency of results, like rec-
ommender systems (Sinha and Swearingen 2002), according research for text analy-
sis is limited. We consider the explanation of text classification, which traditionally
outputs only a class label, possibly extended by some probability estimate (Manning
et al. 2008). Alvarez and Martin (2009) present an explanation approach to general
supervised classification that puts the decision boundary in the focus (cf. Sect. 2.1).
Kulesza et al. (2011) visualize the internal logic of a text classifier, and Gabrilovich
and Markovitch (2007) stress the understandability of features that correspond
2.4 State of the Art in Ad-Hoc Large-Scale Text Mining 53
input control
Fig. 3.1 Abstract view of the overall approach of this book (cf. Fig. 1.5). Sections 3.1–3.3 discuss
the automatic design of ad-hoc text analysis pipelines.
In this section, we formally develop the notion of optimal text analysis pipelines.
Then, we introduce generic paradigms of constructing and executing such pipelines
in ad-hoc text analysis tasks. The descriptions of the paradigms and a subsequent
case study of its impact are based on and partly reuse content from Wachsmuth et al.
(2011). Figure 3.1 highlights the contribution of this section as well as of the two
subsequent sections to the overall approach of this book. Concretely, this section
contains a great deal of the theory behind ad-hoc large-scale text analysis pipelines,
which will be completed by the optimal solution to pipeline scheduling in Sect. 4.1.
The term “optimal” always relates to some measure of quality. Informally, a text
analysis pipeline can be called optimal if it achieves a higher quality in what it does
than any other text analysis pipeline. Accordingly, the optimality is associated to the
tackled text analysis task, i.e., to a particular information need C and a collection or
a stream of input texts D.
When we speak of the quality of a pipeline in this book, we refer to the effective-
ness of the pipeline’s results with respect to C and to the (run-time) efficiency of its
execution on D. Both can be quantified in terms of the quality criteria introduced in
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining 57
Sect. 2.1, which provide the basis for defining optimality. As soon as more than one
criterion is considered, finding an optimal pipeline becomes a multi-criteria optimiza-
tion problem (Marler and Arora 2004): Usually, more effective pipelines are less effi-
cient and vice versa, because, in principle, higher effectiveness implies deeper and,
thus, more expensive analyses (cf. Sect. 2.4). Sometimes, the optimal pipeline is the
most efficient one under all most effective ones. Sometimes, the opposite holds, and
sometimes, there may also be a reasonable weighting of quality criteria. In general,
some quality function Q is required that specifies how to compute the quality of a
pipeline from the pipeline’s measured effectiveness and efficiency in the given text
analysis task. Without loss of generality, we assume here that a higher value for Q
means a higher quality. The notion of Q implies how to define pipeline optimality:
Pipeline Optimality. Let D be a collection or a stream of texts and let C be an
information need. Further, let Π = {Π1 , . . . , Π|Π| } be the set of all text analysis
pipelines for C on D. Then, Π ∗ ∈ Π is optimal for C on D with respect to a quality
function Q if and only if the following holds:
Now, the question is how to design an optimal pipeline Π ∗ for a given information
need C and a collection or a stream of input texts D. Our focus is realizing complete
text analysis processes rather than single text analyses. Therefore, we consider the
question for a universe Ω where the set AΩ of all available algorithms is prede-
fined (cf. Sect. 2.2). Under this premise, the quality of a pipeline follows only from
its construction and its execution.
As presented in Sect. 2.2, the design style of a pipeline Π = A, π is fixed,
consisting in a sequence of algorithms where the output of one algorithm is the input
of the next. Consequently, pipeline construction means the selection of an algorithm
set A from AΩ that can address C on D as well as the definition of a schedule π
of the algorithms in A. Similarly, we use the term pipeline execution to refer to the
application of a pipeline’s algorithms to the texts in D and to its production of output
information of the types in C. While the process of producing output from an input is
defined within an algorithm, the execution can be influenced by controlling what part
of the input is processed by each algorithm. As a matter of fact, pipeline optimality
follows from an optimal selection and scheduling of algorithms as well as from an
optimal control of the input of each selected algorithm.
The dependency of optimality on a specified quality function Q suggests that, in
general, there is not a single pipeline that is always optimal for a given text analysis
task. However, one prerequisite of optimality is to ensure that the respective pipeline
behaves correct. Since it is not generally possible to design pipelines that achieve
maximum effectiveness (cf. Sect. 2.1), we speak of the validity of a pipeline if it
tackles the task it is meant to solve:
58 3 Pipeline Design
m
(out)
C ⊆ C0 ∪ Ci (3.2)
i=1
(in)
i−1
(out)
∀Ai ∈ (A1 , . . . , Am ) : Ci ⊆ C0 ∪ Cj (3.3)
j=1
A complete algorithm set does not guarantee that an admissible schedule exists,
since it may yield circular or unfulfillable dependencies. So, both properties are
necessary for validity. Only valid pipelines allow the employed algorithms to produce
output information in the way they are supposed to do, which is why we restrict
our view to such pipelines throughout this book. Admissibility has an important
implication, which can be exploited during pipeline construction and execution:
Given that no information type is output by more than one algorithm of an algorithm
set A, all admissible pipelines based on A achieve the same effectiveness, irrespective
of the tackled text analysis task.1
We come back to this implication when we prove the correctness of our solution to
optimal scheduling in Sect. 4.1. The intuition is that, under admissibility, the schedule
of any two algorithms is only variable if neither depends on the other. In this case,
applying the algorithms in sequence is a commutative operation, which leads to the
same result irrespective of the schedule. In our project InfexBA (cf. Sect. 2.3), for
instance, we extracted relations between time and money entities from sentences.
No matter which entity type is recognized first, relation extraction must take place
only on those sentences that contain both a time entity and a money entity.
1 Thelimitation to pipelines with only one algorithm for each information type could be dropped
by extending the definition of admissibility, which we leave out here for simplicity. Admissibility
would then require that an algorithm Ai ∈ A with required input information types Ci(in) is not
(out) (in)
scheduled before any algorithm A j for which C j ∩ Ci = ∅ holds.
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining 59
(a)
pipeline run-time
Impact of algorithm selection (b) Impact of pipeline scheduling
pipeline run-time
1
*
tA
3
schedule
tA
tA
tA
se
se
se
se
hm
hm
hm
hm
rit
rit
rit
rit
go
go
go
go
al
al
al
al
schedule schedule
schedule
Fig. 3.2 The impact of the selection and the schedule of the algorithms in a text analysis pipeline:
a Selecting a more effective algorithm set improves a pipeline’s effectiveness, but it also takes more
run-time. b Scheduling can improve the pipeline’s run-time without impairing its effectiveness.
What we get from admissibility is that we can subdivide the problem of finding
an optimal pipeline Π ∗ = A∗ , π ∗ into two subproblems. The first subproblem is to
select an algorithm set A∗ that best matches the efficiency-effectiveness tradeoff to
be made, which has to be inferred from the quality function Q at hand. This situation
is illustrated in Fig. 3.2(a). Once A∗ is given, the second subproblem breaks down
to a single-criterion optimization problem that is independent from Q, namely, to
schedule and execute the selected algorithms in the most efficient manner, because
all pipelines based on A∗ are of equal effectiveness. Accordingly, the best pipeline
for A∗ in Fig. 3.2(b) refers to the one with lowest run-time. We conclude that only the
selection of algorithms actually depends on Q. Altogether, the developed pipeline
optimization problem can be summarized as follows:2
2 Thesecond subproblem of the pipeline optimization problem has originally been presented in the
context of the theory on optimal scheduling in Wachsmuth and Stein (2012).
60 3 Pipeline Design
is decided by both the selection and the scheduling. In general, the algorithm selection
already implies whether there is an admissible schedule of the algorithms at all, which
raises the need to consider scheduling within the selection process.
In the following, however, we present paradigms of an ideal pipeline design. For
this purpose, we assume that the selection of algorithms directly follows from the text
analysis task to be tackled. In Sect. 3.3, we drop this assumption, when we develop
a practical approach to pipeline construction. Nevertheless, we see there that the
assumption is justified as long as only one quality criterion is to be optimized.
We consider the pipeline optimization problem for an arbitrary but fixed text analysis
task, i.e., for a collection or a stream of input texts D and an information need C.
In Sect. 2.2, we have discussed that such a task requires a text analysis process that
infers instances of C from the texts in D. To realize this process, we can choose from
a set of available text analysis algorithms AΩ . On this basis, we argue that an optimal
text analysis pipeline Π ∗ for C on D results from following four paradigms:3
Figure 3.3 illustrates the paradigms. Below, we explain each of them in detail.
3 The given steps revise the pipeline construction method from Wachsmuth et al. (2011). There, we
named the last step “optimized scheduling”. We call it “optimal scheduling” here, since we discuss
the theory behind pipeline design rather than a practical approach. The difference between optimal
and optimized scheduling is detailed in Chap. 4.
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining 61
Fig. 3.3 Sample illustration of the four steps of designing an optimal text analysis pipeline for a
collection or a stream of input texts D and an information need C using a selection A1 , . . . , Am of
the set of all text analysis algorithms AΩ .
Given a text analysis task, maximum decomposition splits the process of addressing C
on D into single text analyses, such that the output of each text analysis can be
inferred with one algorithm A in AΩ . As stated above, in this section we assume
temporarily that the decomposition directly implies the algorithm set A ⊆ AΩ to
be employed, while the text analysis process suggests an initial schedule π (a) of a
pipeline Π (a) = A, π (a) . This is reflected in the top part of Fig. 3.3. The assumption
reduces the pipeline optimization problem to the pipeline scheduling problem (i.e., to
find an optimal schedule π ∗ ). In general, the more a text analysis task is decomposed
into single text analyses, the better the schedule of the algorithms that realize the
analyses can be optimized.
As motivated in Sect. 1.1, we consider mainly tasks from information extraction
and text classification. In information extraction, the intuitive unit of decomposition
is given by a text analysis that produces a certain information type, e.g. author name
annotations or the part-of-speech tags of tokens. In principle, it would be useful
to even decompose single text analyses. Different analyses often share similar ele-
ments and, so, a finer decomposition avoids redundant analyses. This fact gets more
obvious, when we look at text classification. Consider a pipeline that first classifies
the subjectivity of each text and then the sentiment polarity of each subjective text,
as used in our case study ArguAna (cf. Sect. 2.3). Both classifiers rely on certain
preprocessing and feature computation steps. While it is common to separate the
preprocessors, it would also be reasonable to decompose feature computation, since
every shared feature of the two classifiers is computed twice otherwise. Mex (2013)
points out the importance of such a decomposition in efficient approaches to tasks
like text quality assessment.
The example also reveals that a non-maximum decomposition can induce redun-
dancy. For the mentioned pipeline, the separation of subjectivity and polarity classi-
fication implies the unit of decomposition. If each classifier encapsulates its feature
62 3 Pipeline Design
The goal of the valid but not yet optimized pipeline Π (a) = A, π (a) is to address the
information need C only. Thus, we propose to perform early filtering on the input
texts in D, i.e., to maintain only those portions of text at each point of a text analysis
process, which are relevant in that they may still contain all information types in C.
(F) (F)
To this end, we insert filtering steps A1 , . . . , Ak into Π (a) after each of the k ≥ 1
algorithms in A that does not annotate all portions of text (i.e., it is not a preprocessor).
(F) (F)
By that, we obtain a modified algorithm set A∗ = {A1 , . . . , Am , A1 , . . . , Ak } and
hence a modified schedule π (b) in the resulting pipeline Π (b) = A∗ , π (b) . Such a
pipeline is visualized in step (b) of Fig. 3.3.
Especially information extraction tasks like InfexBA (cf. Sect. 2.3) profit from
filtering. To extract all revenue forecasts, for instance, only those portions of text
need to be filtered, which contain both a time entity and a money entity and where
these entities can be normalized if needed. For many applications on top of Infex-
BA, relevant portions of text will be those only where an arbitrary or even a specific
organization or market name is recognized. Also, the classification of forecasts can
be restricted to portions that have been identified to refer to revenue.
In the end, the filtering steps give rise to the optimization potential of pipeline
scheduling: Without filtering, every admissible pipeline for an algorithm set would
have the same run-time on each input text, because each algorithm would then
process all portions of all texts. Conversely, filtering the relevant portions of text
enables a pipeline to avoid all analyses that are unnecessary for the inference of C
from D (Wachsmuth et al. 2013c). Thereby, the run-time of pipeline can be opti-
mized without changing its effectiveness. In Sect. 3.5, we discuss filtering in detail,
where we see that a consistent filtering is indeed possible automatically and that it
also allows a trading of efficiency for effectiveness. There, we substitute the filtering
steps by an input control that works independent from the given pipeline.
Based on the filtering steps, an always reasonable step to further improve the run-
time of the pipeline Π (b) is lazy evaluation, i.e., to delay each algorithm in π (b) until
its output is needed. More precisely, each Ai ∈ A∗ is moved directly before the first
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining 63
(out) (in)
algorithm A j ∈ A∗ in π (b) , for which Ci ∩ C j = ∅ holds (or to the end of π (b) if
∗
no such A j exists). Thereby, A is implicitly partitioned into an ordered set of filtering
stages. We define a filtering stage as a partial pipeline Π j , consisting of a filtering
(F) (out) (in) (F)
step A j and all algorithms Ai ∈ A with Ci ∩ C j = ∅ that precede A j . We
sketch a couple of filtering stages in Fig. 3.3, e.g., the one with A2 , Am−1 , and A(F)
m−1 .
Of course, no algorithm in A∗ is executed more than once in the resulting pipeline
of filtering stages Π (c) = A∗ , π (c) .4
The rationale behind lazy evaluation is that, the more filtering takes place before an
algorithm is executed, the less portions of text the algorithm will process. Therefore,
an algorithm is, in general, faster if it is scheduled later in a pipeline. The potential
of lazy evaluation is rooted in the decomposition of text analysis tasks. E.g., in
InfexBA, tokens were required for time recognition and named entity recognition,
but part-of-speech tags only for the latter. So, the decomposition of tokenization and
tagging allowed us to tag only tokens of portions with time entities.
m
i−1
tΠ (D) = t1 (D) + ti d j (D) (3.4)
i=2 j=1
4 The given definition of filtering stages revises the definition of (Wachsmuth et al. 2011) where we
used the term to denote the partial pipelines resulting from early filtering.
64 3 Pipeline Design
In the following, we present an extended version of the application of the four par-
adigms in the InfexBA context from Wachsmuth et al. (2011). The goals are (1)
to demonstrate how the paradigms can be followed in general and (2) to offer first
evidence that, especially in information extraction, filtering and scheduling signif-
icantly impacts efficiency without compromising effectiveness. The results provide
the basis for all practical approaches and evaluations presented in Chaps. 3 and 4.
The Java source code of the performed experiments is detailed in Appendix B.4.
Information Need. We study the application of the paradigms for the extraction of
all related time and money entities from sentences that denote revenue forecasts.
This information need can be modeled as follows:
C = {Relation(Time, Money), Forecast, Revenue}
An instance of C is e.g. found in the sentence “In 2009, market analysts expected
touch screen revenues to reach $9B by 2015”.
Input Texts. In the case study, we process the provided split of our Revenue cor-
pus, for which details are presented in Appendix C.1. We use the training set of the
corpus to estimate all run-times and initially filtered portions of text of the employed
text analysis algorithms.
Maximum Decomposition. To address C, we need a recognition of time entities
and money entities, an extraction of their relations, and a detection of revenue and
forecast events. For these text analyses, an input text must be segmented into sen-
tences and tokens before. Depending on the employed algorithms, the tokens may
additionally have to be extended by part-of-speech tags, lemmas, and dependency
5 Equation 3.4assumes that the run-time of each filtering step A(F) ∈ A∗ is zero. In Sect. 3.5, we
offer evidence that the time required for filtering is in fact almost neglible.
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining 65
Fig. 3.4 Application of the paradigms from Fig. 3.3 of designing an optimal pipeline Π2∗ =
A∗2 , π ∗ for addressing the information need Forecast(Time, Money) on the Revenue corpus.
The application is based on the algorithm set A2 .
based on A∗2 execute eti and emo before rtm2 and eti also before rfo. Given these
constraints, we apply the approximation of Eq. 3.5, i.e., we pairwise compute the
optimal schedule of two filtering stages. E.g., for Πemo = (sse, sto2 , emo, emo(F) )
and Πrfo = (sse, sto2 , tpo1 , rfo, rfo(F) ), we have:
Therefore, we move the algorithms in Πrfo before emo, which also means that we
separate tpo1 from pde1 to insert tpo1 before rfo. For corresponding reasons, we
(F)
postpone rtm2 to the end of the schedule. Thereby, we obtain the final pipeline Π2∗
that is illustrated in Fig. 3.4(d). Correspondingly, we obtain the following pipelines
for A∗1 , A∗3 , and A∗4 :
Π1∗ = (sse, sto2 , rre1 , rre(F), eti, eti(F), emo, emo(F), tpo1 , rfo, rfo(F),
rtm1 , rtm(F) )
∗
Π3 = (sse, sto2 , eti, eti(F), emo, emo(F), rre1 , rre(F), tpo2 , tle, rfo,
rfo(F), pde2 , rtm2 , rtm(F) )
Baselines. To show the effects of each paradigm, we consider all constructed inter-
(a) (b) (c)
mediate pipelines. E.g., for A2 , this means Π2 , Π2 , Π2 , and Π2∗ . In addition,
we compare the schedules of the different optimized pipelines. I.e., for 1 ≤ i, j ≤ 3,
we compare each Πi∗ = Ai∗ , π ∗ to all pipelines Ai∗ , π ∗j with i = j except for π1∗ .
π1∗ applies rre1 before time and money recognition, which would not be admissible
for rre2 . For Ai∗ , π ∗j , we assume that π ∗j refers to the algorithms of A∗ .
Experiments. We compute the run-time per sentence t (Π ) and its standard devia-
tion σ for each pipeline Π on the test set of the Revenue corpus using a 2 GHz
Intel Core 2 Duo MacBook with 4 GB RAM. All run-times are averaged over five
runs. Effectiveness is captured in terms of precision p, recall r , and F1 -score f 1
(cf. Sect. 2.1).
Table 3.1 The precision p and the recall r as well as the average run-time in milliseconds per
j j
sentence t (Πi ) and its standard deviation σ for each considered pipeline Πi = Ai , π j based on
A1 , A2 , and A3 on the Revenue corpus.
A1 A2 A3
πj p r t ±σ p r t ±σ p r t ±σ
(a)
πi 0.65 0.56 3.23 ± .07 0.72 0.58 51.05 ± .40 0.75 0.61 168.40 ±.57
πi(b) 0.65 0.56 2.86 ± .09 0.72 0.58 49.66 ± .28 0.75 0.61 167.85 ± .70
πi(c) 0.65 0.56 2.54 ± .08 0.72 0.58 15.54 ± .23 0.75 0.61 45.16 ± .53
π1∗ 0.65 0.56 2.44 ± .03 0.72 0.58 – 0.75 0.61 –
π2∗ 0.65 0.56 2.47 ± .15 0.72 0.58 4.77 ± .06 0.75 0.61 16.25 ± .15
π3∗ 0.65 0.56 2.62 ± .05 0.72 0.58 4.95 ± .09 0.75 0.61 10.19 ± .05
3.1 Ideal Construction and Execution for Ad-Hoc Text Mining 67
Fig. 3.5 a The run-time per sentence of each pipeline Ai , π j on the test set of the Revenue
corpus at the levels of effectiveness represented by A1 , A2 , and A3 . b The run-time per sentence
of each algorithm A ∈ A2 on the test set depending on the schedule π j of the pipeline A2 , π j .
Results. Table 3.1 lists the precision, recall and run-time of each pipeline based
on A1 , A2 , or A3 . In all cases, the application of the paradigms does not change the
effectiveness of the employed algorithm set.6 Both precision and recall significantly
increase from A1 to A2 and from A2 to A3 , leading to F1 -scores of 0.60, 0.65, and
0.67, respectively. These values match the hypothesis that deeper analyses supports
higher effectiveness.
Paradigms (b) to (c) reduce the average run-time of A1 from 3.23 ms per sentence
(a)
of Π1 to t (Π1∗ ) = 2.44 ms. Π1∗ is indeed the fastest pipeline, but only at a low
confidence level according to the standard deviations. The efficiency gain under A1
is largely due to early filtering and lazy evaluation. In contrast, the benefit of optimal
scheduling becomes obvious for A2 and A3 . Most significantly, Π3∗ clearly outper-
forms all other pipelines with t (Π3∗ ) = 10.19 ms. It requires less than one fourth of
the run-time of the pipeline resulting after step (c), lazy evaluation, and even less
than one sixteenth of the run-time of Π3(a) .
In Fig. 3.5(a), we plot the run-times of all considered pipelines as a function of
their effectiveness in order to stress the efficiency impact of the four paradigms. The
shown interpolated curves have the shape sketched in Fig. 3.2. While they grow more
rapidly under increasing F1 -score, only a moderate slope is observed after optimal
6 InWachsmuth et al. (2011), we report on a small precision loss for A2 , which we there assume to
emanate from noise of algorithms that operate on token level. Meanwhile, we have found out that
the actual reason was an implementation error, which is now fixed.
68 3 Pipeline Design
scheduling. For A2 , Fig. 3.5(b) illustrates the main effects of the paradigms on the
employed algorithms: Dependency parsing (pde1 ) takes about 90 % of the run-time
of both A2 , πa and A2 , πb . Lazy evaluation then postpones pde1 , reducing the
run-time to one third. The same relative gain is achieved by optimal scheduling,
resulting in A2 , π2∗ where pde1 takes less than half of the total run-time.
According to the observed results, the efficiency impact of an ideal pipeline construc-
tion and execution seems to grow with the achieved effectiveness. In fact, however,
the important difference between the studied algorithm sets is that the run-times of
the algorithms in A1 are quite uniform, while A2 involves one much more expensive
algorithm (pde1 ) and A3 involves three such algorithms (tpo2 , tle, and pde2 ). This
difference gives rise to much of the potential of lazy evaluation and optimal schedul-
ing. Moreover, the room for improvement depends on the density and distribution of
relevant information in input texts. With respect to C, only 2 % of the sentences in
the test set of the Revenue corpus are relevant. In contrast, the more dense relevant
information occurs, the less filtering impacts efficiency, and, the more spread it is
across the text, the larger the size of the filtered portions of text must be in order to
achieve high recall, as we see later on.
Still, the introduced paradigms are generic in that they work irrespective of the
employed algorithms and the tackled text analysis task. To give a first intuition of
the optimization potential of the underlying filtering and scheduling steps, we have
considered a scenario where the algorithm set is already given. Also, the discussed
task is restricted to a single information need with defined sizes of relevant portions
of texts. In general, approaches are needed that can (1) choose a complete algorithm
set on their own and (2) perform filtering for arbitrary text analysis tasks. We address
these issues in the remainder of Chap. 3. On this basis, Chap. 4 then turns our view
to the raised optimization problem of pipeline scheduling.
The paradigms introduced in Sect. 3.1 assume that the set of text analysis algorithms
to be employed is given. In practice, the algorithms’ properties need to be specified
in a machine-readable form in order to be able to automatically select and schedule
an algorithm set that is complete according to Eq. 3.2 (see above). For this purpose,
we now formalize the concepts underlying text analysis processes in a metamodel.
Then, we exemplify how to instantiate the metamodel within an application and
we discuss its limitations. This section presents an extended version of the model
from Wachsmuth et al. (2013a), which in turn consolidates content from the work
of Rose (2012) that has also influenced the following descriptions.
3.2 A Process-Oriented View of Text Analysis 69
Our goal is to automatically design optimal text analysis pipelines for arbitrary text
analysis tasks (cf. Eq. 3.1). As motivated in Chap. 1, the design of a pipeline requires
human expert knowledge related to the text analysis algorithms to be employed. For
automation, we therefore develop a model that formalizes this expert knowledge. To
this end, we resort to the use of an ontology. Following Gruber (1993), an ontology
specifies a conceptualization by defining associations between names in a universe.
We rely on OWL-DL, a complete and decidable variant of the web ontology lan-
guage.7 OWL-DL represents knowledge using description logic, i.e., a subset of
first-order logic made up of concepts, roles, individuals, and relations (Baader et al.
(2003). As Rose (2012) argues, the major advantages of OWL-DL are its wide suc-
cessful use and its readability for both humans and machines. It is the recommended
standard of the W3C (Horrocks 2008), which is why OWL-DL ontologies are nor-
mally visualized as resource description framework (RDF) graphs.
Now, to formalize the expert knowledge, we slightly refine our basic scenario
from Sect. 1.2 by viewing text analysis as an annotation task:
The rationale behind this process-oriented view is that all text analyses can largely
be operationalized as an annotation of texts (cf. Sect. 2.2). While both the input texts
to be processed and the information need to be addressed depend on the given task, we
can hence model general expert knowledge about text analysis processes irrespective
of the task. Such an annotation task metamodel serves as an upper ontology that is
extended by concrete knowledge in a task at hand. In particular, we model three
aspects of the universe of annotation tasks:
1. The information to be annotated,
2. the analysis to be performed for annotation, and
3. the quality to be achieved by the annotation.
Each aspect subsumes different abstract concepts, each of which is instantiated
by the concrete concepts of the text analysis task at hand. Since OWL-DL integrates
types and instances within one model, such an instantiation can be understood as
an extension of the metamodel. Figure 3.6 illustrates the complete annotation task
metamodel as an RDF graph. In the following, we discuss the representation of all
shown concepts in detail. For a concise presentation of limited complexity and for
lack of other requirements, we define only some concepts formally.
1 0..1 2 3
Annotation Quality
1 Algorithm 1
type criterion
supertype output
1
1..* input * 1 0..1
* *
0..1 Information Quality Order Aggregate
Feature
active type estimation relation function
*
1 successor
Value Selectivity Quality
Type
constraint estimation priority
0..1
1..* * 1
Primitive Quality
Filter
type prioritization
Fig. 3.6 The proposed metamodel of the expert knowledge that is needed for addressing annotation
tasks, given in the form of an RDF graph. Black arrowheads denote “has” relations and white
arrowheads “subclass” relations. The six non-white abstract concepts are instantiated by concrete
concepts in an application.
The information needs addressed in annotation tasks refer to possibly complex real-
world concepts, such as semantic roles or relations between entities (cf. Sect. 2.2).
Usually, the types of information that are relevant for an according application are
implicitly or explicitly predefined in a type system. In our case study InfexBA, for
instance, we considered entity types like Organization, Time, or Money as well as
relation and event types based on these types, whereas the type system in ArguAna
specified opinions, facts, product names, and the like (cf. Sect. 2.3).
As for our case studies, most types are specific to the language, domain, or appli-
cation of interest: Classical natural language processing types emanate from lexical
or syntactic units like tokens or sentences. While the set of types is quite stable,
their values partly vary significantly across languages. An example is given by
part-of-speech tags, for which a universal tagset can only catch the most general
concepts (Petrov et al. 2012). In information extraction, some entity types are very
common like person names, but there does not even exist an approved definition of
the term “entity” (Jurafsky and Martin 2009). Relations and events tend to be both
domain-specific and application-specific, like the association of time and money enti-
ties to revenue forecasts in InfexBA. And in text classification, the class attribute
and its values differ from application to application, though there are some common
classification schemes like the Reuters topics (Lewis et al. 2004) or the sentiment
polarities (Pang et al. 2002).
Since concrete type systems vary across annotation tasks, it does not make sense
to generally model a certain set of concrete types. In contrast, we observe that, in
3.2 A Process-Oriented View of Text Analysis 71
principle, all type systems instantiate a subset of the same abstract structures. These
structures are defined in our metamodel, as shown on the left of Fig. 3.6.8
In particular, we distinguish two abstract types: (1) Primitive types, such as inte-
gers, real numbers, booleans, or strings. In the given context, primitive types play
a role where values are assigned to annotations of a text (examples follow below).
(1) An annotation type, which denotes the set of all annotations of all texts that
represent a specific (usually syntactic or semantic) real-world concept. A concrete
annotation type might e.g. be Author, subsuming all annotations of author names. An
instance of the annotation type (i.e., an annotation) assigns the represented concept
to a span of text, e.g. it marks a token or an author name. Formally, we abstract
from the textual annotations in the definition of an abstract annotation type, which
specifies the associated concept only through its identifier:
Annotation Type. An annotation type C (A) represents a specific real-world concept
and associates it to a 2-tuple C(A) , C (A) such that
1. Features. C(A) is the set of |C(A) | ≥ 0 features of C (A) , where each feature is a
concept that has a certain abstract type of information.
2. Supertype. C (A) is either undefined or it is the supertype of C (A) .
According to the definition, concrete annotation types can be organized hierar-
chically through supertypes. E.g., the supertype of Author may be Person, whose
supertype may in turn be Named entity, and so forth. An annotation type has an
arbitrary but fixed number of features.9 Each feature has a type itself. The value of
a feature is either a primitive or an annotation. Primitive features e.g. represent class
values or normalized forms of annotations (say, the part-of-speech tag or the lemma
of a token) or they simply specify an annotation’s boundary indices or reference
address. Through features, annotations can also model relations or events. In our
case study ArguAna from Sect. 2.3, we modeled the type Discourse relation on
the statement level as an annotation type with two features of the annotation type
Statement as well as a third primitive string feature that defines the type of relation.
When addressing information needs, text analysis pipelines target at the optimization
of a quality function Q (cf. Sect. 3.1). Depending on the task, several concrete quality
criteria exist, mostly referring to effectiveness and efficiency (cf. Sect. 2.1). The
abstract concepts that we consider for quality criteria are shown on the right of
Fig. 3.6. In principle, a quality criterion simply defines a set of comparable values:
8 The notion of type systems and the modeled structures are in line with the software framework
Apache UIMA, http://uima.apache.org, accessed on June 15, 2015.
9 We refer to features of annotations in this chapter only. They should not be confused with the
machine learning features (cf. Chap. 2.1), which play a role in Chaps. 4 and 5.
72 3 Pipeline Design
Quality Criterion. A quality criterion Q denotes a set of values that has the
following properties:
1. Order Relation. The values in Q have a defined total order.
2. Aggregate Function. Q may have an aggregate function that maps two arbitrary
values q1 , q2 ∈ Q to an aggregate value q ∈ Q.
Annotation tasks aim at optimizing a set of quality criteria Q. While aggregate
functions provide a possibility to infer the quality of a solution from the quality of
solutions to subtasks, far from all criteria entail such functions. E.g., aggregating the
absolute run-times of two text analysis algorithms executed in sequence means com-
puting their sum, whereas there is no general way of inferring an overall precision
from the precision of two algorithms. Similarly, quality functions that aggregate val-
ues of different quality criteria (as in the case of precision and recall) only rarely exist.
Thus, in contrast to several other multi-criteria optimization problems, weighting dif-
ferent Pareto-optimal solutions (where any improvement in one criterion worsens
others) does not seem reasonable in annotation tasks. Instead, we propose to rely on
a quality prioritization that defines an order of importance:
Quality Prioritization. A quality prioritization ρ = (Q 1 , . . . , Q k ) is a permutation
of a set of quality criteria Q = {Q 1 , . . . , Q k }, k ≥ 1.
E.g., the quality prioritization (run-time, F1 -score, recall) targets at finding the
best solution in terms of recall under all best solutions in terms of F1 -score under all
best solutions in terms of run-time. As the example shows, quality prioritizations can
at least integrate the weighting of different quality criteria by including an “aggregate
quality criterion” like the F1 -score.
In the annotation task metamodel in Fig. 3.6, we define a quality prioritization
as a sequence of one more quality priorities, where each quality priority points to
a quality criterion and has zero or one successor. Within a universe Ω, we call the
combination of a set of concrete quality criteria QΩ and a set of concrete quality
prioritizations the quality model of Ω. Such a quality model instantiates the concepts
of the quality aspect of the annotation task metamodel for a concrete application.
Finally, to address a given information need under some quality prioritization, a text
analysis process needs to be realized that performs the annotation of input texts.
Each text analysis refers to the inference of certain annotation types or features. It is
conducted by a set of text analysis algorithms from a given algorithm repository AΩ .
Within a respective process, not all features of an annotation are always set. Similarly,
an algorithm may require or produce only some features of an annotation type. We
call a feature an active feature if it has a value assigned.
In accordance with Sect. 2.2, an information need can be seen as defining a set of
annotation types and active features. In addition, it may specify value constraints,
3.2 A Process-Oriented View of Text Analysis 73
i.e., constraints a text span must meet in order to be considered for annotation. In one
of our prototypes from InfexBA, for instance, a user can specify the organization
to find forecasts for, say “Google” (cf. Sect. 2.3). Only organization annotations
that refer to Google then meet the implied value constraint. Besides such instance-
specific constraints, implicitly all text spans must meet the basic constraint that they
refer to the real-world concept represented by the respective annotation type. Based
on the notion of active features and value constraints, we formally define the abstract
information type to be found in annotation tasks as follows:
Information Type. A set of instances of an annotation type denotes an information
type C if it contains all instances that meet two conditions:
1. Active Feature. The instances in C either have no active feature or they have the
same single active feature.
2. Constraints. The instances in C fulfill the same value constraints.
By defining an information type C to have at most one active feature, we obtain
a normalized unit of information in annotation tasks. I.e., every information need
can be stated as a set of information types C = {C1 , . . . , Ck }, meaning a conjunc-
tion C1 ∧ . . . ∧ Ck with k ≥ 1, as defined in Sect. 2.2. In this regard, we can denote
the above-sketched example information need from InfexBA as {Forecast, Fore-
cast.organization = “Google”}, where Forecast is a concrete annotation type with a
feature organization.
Regarding information types, the internal operations of a text analysis algorithm
that infers this type from a text do not matter, but only the algorithm’s behavior in
terms of the input types it requires and the output types it produces. The actual quality
of an algorithm (say, its efficiency and/or effectiveness) in processing a collection or
a stream of texts is, in general, unknown beforehand. For many algorithms, quality
estimations are known from evaluations, though. Formally, our abstract concept of
an algorithm in the center of Fig. 3.6 hence has the following properties:
Algorithm. Let C be a set of information types and Q a set of quality criteria. Then
an algorithm A is a 3-tuple C(in) , C(out) , q with C(in) = C(out) and
1. Input Types. C(in) ⊆ C is a set of input information types,
2. Output Types. C(out) ⊆ C is a set of output information types, and
3. Quality Estimations. q ∈ (Q 1 ∪{⊥}) × . . . × (Q |Q| ∪{⊥}) contains one value qi
for each Q i ∈ Q. qi defines a quality estimation or it is unknown, denoted as ⊥.
Different from frameworks like Apache UIMA, the definition does not allow
equal input and output types, which is important for ad-hoc pipeline construction.
We come back to this disparity in Sect. 3.3.
Now, assume that an algorithm has produced instances of an output type C (say,
Organization) for an information need C. As discussed in Sect. 3.1, a means to
improve efficiency is early filtering, i.e., to further analyze only portions of text that
contain instances of C and that, hence, may be relevant for C. Also, portions can
be excluded from consideration, if they span only instances that do not fulfill some
value constraint in C (say, organization = “Google”). For such purposes, we introduce
74 3 Pipeline Design
filters, which discard portions of an input text that do not meet some checked value
constraint while filtering the others. We formalize filters as follows:
Filter. Let C be a set of information types. Then a filter is an algorithm A(F) that
additionally defines a 2-tuple C(F) , q (F) such that
1. Value Constraints. C(F) ⊆ C is the set of value constraints of A(F) ,
2. Selectivity Estimations. q (F) ∈ [0, 1]∗ is a vector of selectivity estimations of
A(F) , where each estimation refers to a set of input types.
In line with our case study in Sect. 3.1, the definition states that a filter entails
certain selectivities, which depend on the given input types. Selectivities, however,
strongly depend on the processed input texts, as we observed in Wachsmuth and Stein
(2012). Therefore, reasonable selectivity estimations can only be obtained during
analysis and then assigned to a given filter.
Filters can be created on-the-fly for information types. A respective filter then has
a single input type in C(in) that equals its output type in C(out) except that C(out)
additionally meets the filter’s value constraints. We use filters in Sect. 3.3 in order to
improve the efficiency of text analysis pipelines. In Sects. 3.4 and 3.5, we outsource
filtering into an input control, which makes an explicit distinction of filters obsolete.
The metamodel in Fig. 3.6 is instiantiated within a concrete application. We define the
knowledge induced thereby as an annotation task ontology, which can be understood
as a universe for annotation tasks:
Annotation Task Ontology. An annotation task ontology Ω consists in a 3-tuple
CΩ , QΩ , AΩ such that
1. Type System. CΩ is a set of concrete annotation types,
2. Quality Model. QΩ is a set of concrete quality criteria, and
3. Algorithm Repository. AΩ is a set of concrete algorithms.
This definition differs from the one in Wachsmuth et al. (2013a), where we define
an annotation task ontology to contain the set of all possible information types and
quality prioritizations instead of the annotation types and quality criteria. However,
the definition in Wachsmuth et al. (2013a) was chosen merely to shorten the dis-
cussion. In the end, the set of possible information types implies the given set of
annotation types and vice versa. The same holds for quality criteria and prioritiza-
tions, given that all possible prioritizations of a set of quality criteria are viable.
To demonstrate the development of concrete concepts based on our annotation task
metamodel within an application, we sketch a sample annotation task ontology for
our case study ArguAna (cf. Sect. 2.3). Here, we follow the notation of Rose (2012).
Although ArguAna does not yield insightful instances of all abstract concepts, it
suffices to outline how to use our annotation task metamodel in practice. In the
3.2 A Process-Oriented View of Text Analysis 75
...
...
...
nucleus Fact Fact() Accuracy = 78% {(q1,q2) | q1 > q2}
:Feature :Annotation type :Information type :Quality estimation :Order relation
output
supertype
Fig. 3.7 Excerpt from the annotation task ontology associated to the project ArguAna. The shown
concepts instantiate the abstract concepts of the annotation task metamodel from Fig. 3.6.
Apache UIMA sense, the instantiation process corresponds to the creation of a type
system and all analysis engine descriptors. That being said, we observe that quality
is not modeled in Apache UIMA, which gets important in Sect. 3.3.
Figure 3.7 illustrates an excerpt from the sample annotation task ontology. It shows
five annotation types, such as Statement and Opinion where the former is the super-
type of the latter. The opinion feature polarity is a primitive, whereas nucleus and
satellite of Discourse relation define a relation between two statements. Annotation
types are referenced by the information types of concrete algorithm concepts. E.g.,
Subjectivity classifier has an output type Opinion() without active features, which
also serves as the input type of Polarity classifier. A polarity classifier sets the value
of the polarity feature, represented by the output type Opinion(polarity). In that, it
achieves an estimated accuracy of 80 %. Accuracy denotes the only quality crite-
rion in the quality model of the sample ontology. Its order relation defines that an
accuracy value q1 is better than an accuracy value q2 if it is greater than q2 . The
quality criteria directly imply possible quality prioritizations. Here, the only quality
prioritization assigns Prio 1 to accuracy. A more sophisticated quality model follows
in the evaluation in Sect. 3.3.
Based on the proposed process-oriented view of text analysis, this section shows
how to construct text analysis pipelines ad-hoc for arbitrary information needs and
quality prioritizations. First, we show how to perform partial order planning Rus-
sell and Norvig (2009) to select an algorithm set that is complete in terms of the
definition in Sect. 3.1 while allowing for an admissible schedule. Then, we present
a basic approach to linearize the resulting partial order of the selected algorithms.
More efficient linearization approaches follow in Chap. 4 in the context of pipeline
scheduling. We realize the pipeline construction process in an expert system that we
finally use to evaluate the automation of pipeline construction. As above, this section
reuses content of Wachsmuth et al. (2013a) and Rose (2012).
Figure 3.8 exemplarily illustrates the pipeline construction for a sample informa-
tion need C from our case study InfexBA (cf. Sect. 2.3), which requests all forecasts
3.3 Ad-Hoc Construction via Partial Order Planning 77
Fig. 3.8 Sample illustration of our approach to ad-hoc pipeline construction: a A complete algo-
rithm set A ⊆ AΩ is selected that addresses a given information need C while following a given
quality prioritization ρ. b The partial order of the algorithms in A is linearized to obtain a text
analysis pipeline Π = A, π .
for the year 2015 or later. Depending on the given quality prioritization ρ, a set of
algorithms that can address C is selected from an algorithm repository AΩ , relying
on quality estimations of the algorithms. In addition, filters are automatically created
and inserted. The linearization then derives an efficient schedule using measures or
estimations of the run-times and selectivities of the algorithms and filters.
The selection and scheduling of an algorithm set, which is optimal for an information
need C on a collection or a stream of input texts D with respect to a quality func-
tion Q is traditionally made manually by human experts (cf. Sect. 2.2). Based on
the presented formalization of expert knowledge in an annotation task ontology Ω,
we now introduce an artificial intelligence approach to automatically construct text
analysis pipelines, hence enabling ad-hoc text mining (cf. Chap. 1). As discussed at
the end of the Sect. 3.2, we leave out the properties of input texts in our approach
and we assume Q to be based on a given quality prioritization ρ.
We consider ad-hoc pipeline construction as a planning problem. In artificial intel-
ligence, the term planning denotes the process of generating a viable sequence of
actions that transforms an initial state of the world into a specified goal state (Russell
and Norvig 2009). A planning problem is defined by the goal (and optional con-
straints) to be satistified as well as by the states and actions of the world. Here, we
describe the pipeline planning problem based on the definition of annotation task
ontologies as follows.
Pipeline Planning Problem. Let Ω = CΩ , QΩ , AΩ be an annotation task ontology.
Then a pipeline planning problem Φ (Ω) is a 4-tuple C0 , C, ρ, AΩ such that
1. Initial State. C0 ⊆ CΩ is the initially given set of information types,
2. Goal. C ⊆ CΩ is the set of information types to be inferred,
78 3 Pipeline Design
For the selection of an algorithm set A, we propose to use partial order planning. This
backward approach recursively generates and combines subplans (i.e., sequences of
actions) for all preconditions of those actions that satisfy a planning goal (Russell
and Norvig 2009). In general, actions may conflict, namely, if an effect of one action
violates a precondition of another one. In annotation tasks, however, algorithms only
produce information. While filters reduce the input to be processed, they do not
remove information types from the current state, thus never preventing subsequent
algorithms from being applicable (Dezsényi et al. 2005). Consequently, the precon-
ditions of an algorithm will always be satisfied as soon as they are satisfied once.
Partial order planning follows a least commitment strategy, which leaves the order-
ing of the actions as open as possible. Therefore, it is in many cases a very efficient
planning variant (Minton et al. 1995).
Pseudocode 3.1 shows our partial order planning approach to algorithm selection.
Given a planning problem, the approach creates a complete algorithm set A together
with a partial schedule π̃ . Only to initialize planning, a helper finish algorithm A0 is
first added to A. Also, the planning agenda Λ is derived from the information need C
and the initial state C0 (pseudocode lines 1–3). Λ stores each open input requirement,
i.e., a single precondition to be satistified together with the algorithm it refers to. As
long as open input requirements exist, lines 4–15 iteratively update the planning
agenda while inserting algorithms into A and respective ordering constraints into π̃.
In particular, line 5 retrieves an input requirement C, A from Λ using the method
poll(). If C contains C, a filter A(F) is created and integrated on-the-fly (lines 6–9).
According to Sect. 3.2, A(F) discards all portions of text that do not comprise
instances of C. After replacing C, A with the input requirement of A(F) , line 11
selects an algorithm A∗ ∈ AΩ that produces C and that is best in terms of the quality
prioritization ρ. If any C cannot be satisfied, planning fails (line 12) and does not
reach line 16 to return a partially ordered pipeline A, π̃ .
10 Because of the changed definition of annotation task ontologies in Sect. 3.1, the definition of
planning problems also slightly differs from the one in Wachsmuth et al. (2013a).
3.3 Ad-Hoc Construction via Partial Order Planning 79
pipelinePartialOrderPlanning(C0 , C, ρ, AΩ )
1: Algorithm set A ← {A0 }
2: Partial schedule π̃ ←∅
3: Input requirements Λ ← {C, A0 | C ∈ C\C0 }
4: while Λ = ∅ do
5: Input requirement C, A ← Λ.poll()
6: if C ∈ C then
7: Filter A(F) ← createFilter(C)
8: A ← A ∪ {A(F) }
9: π̃ ← π̃ ∪ {(A(F) < A)}
10: C, A ← A(F).C (in).poll(), A(F)
11: Algorithm A∗ ← selectBestAlgorithm(C, C, ρ, AΩ )
12: if A = ⊥ then return ⊥
13: A ← A ∪ {A∗ }
14: π̃ ← π̃ ∪ {(A∗ < A)}
15: Λ ← Λ ∪ {C, A∗ | C ∈ A∗ .C(in) \C0 }
16: return A, π̃
Pseudocode 3.1: Partial order planning for selecting an algorithm set A (with a partial schedule π̃ )
that addresses a planning problem Φ (Ω) = C0 , C, ρ, AΩ .
Different from Wachsmuth et al. (2013a), we also present the method select-
BestAlgorithm in detail here, shown in Pseudocode 3.2. The underlying process
has been defined by Rose (2012) originally. Lines 1 and 2 check if algorithms exist that
produce the given precondition C. The set AC of these algorithms is then compared
subsequently for each quality criterion Q i in ρ (lines 3–13) in order to determine the
set AC∗ of all algorithms with the best quality estimation q ∗ (initialized with the worst
possible value of Q i in lines 4 and 5). To build AC ∗ , lines 6–11 iteratively compare
selectBestAlgorithm(C, C, ρ, AΩ )
1: Algorithm set AC ← {A ∈ AΩ | C ∈ A.C(out) }
2: if |AC | = 0 then return ⊥
3: for each Quality criterion Q i ∈ ρ with i from 1 to |ρ| do
4: Algorithm set AC ∗ ←∅
Pseudocode 3.2: Selection of an algorithm from AΩ that produces the information type C and that
is best in terms of the quality prioritization ρ.
estimateQuality(A, Q i , C, AΩ )
1: Quality estimation q ← A.qi
2: if q =⊥ then q ← Q i .worst()
3: if Q i has no aggregate function then return q
4: for each Information type C (in) ∈ A.C(in) \C do
5: Quality estimation qC∗ (in) ← Q i .worst()
6: for each Algorithm AC (in) ∈ AΩ with C (in) ∈ AC (in) .C(out) do
7: Quality estimation qC (in) ← estimateQuality(AC (in) , Q i , C, AΩ )
8: if Q i .isBetter(qC (in) , qC∗ (in) ) then qC∗ (in) ← qC (in)
9: q ← Q i .aggregate(q, qC∗ (in) )
10: return q
Pseudocode 3.3: Computation of a quality estimation q for the algorithm A in terms of the quality
criterion Q i . Given that Q i has an aggregate function, q recursively aggregates the best quality
estimations of all required predecessors of A.
Before the selected algorithm set A can be executed, an admissible schedule π must
be derived from the partial schedule π̃ , as illustrated at the bottom of Fig. 3.8 above.
Such a linearization of a partial order plan addresses the pipeline scheduling problem
from Sect. 3.1.
3.3 Ad-Hoc Construction via Partial Order Planning 81
11 We assume that run-time estimations of all algorithms in A are given. In doubt, for each algorithm
greedyPipelineLinearization(A, π̃ )
1: Algorithm set AΦ ← ∅
2: Schedule π ← ∅
3: while AΦ = A do
4: Filtering stages Π ← ∅
5: for each Filter A(F) ∈ {A ∈ A\AΦ | A is a filter} do
6: Algorithm set A(F) ← {A(F) } ∪ getPredecessors(A\AΦ , π̃ , A(F) )
7: Schedule π (F) ← getAnyCorrectTotalOrdering(A
(F) , π̃)
Pseudocode 3.4: Greedy linearization of a partially ordered pipeline A, π̃ . The pipeline’s filtering
stages are ordered by increasing estimated run-time.
We now turn to the properties of our approach in terms of its benefits and limitations
as well as its correctness and complexity. Especially the analysis of the correctness
and complexity is a new contribution of this book.
Planning operationalizes the first paradigm from Sect. 3.1, maximum decompo-
sition. The actual benefit of partial order planning relates to the second and third
paradigm. It originates in the least commitment strategy of partial order planning:
As planning proceeds backwards, the constraints in the partial schedule π̃ (cf.
Pseudocode 3.1) prescribe only to execute an algorithm right before its output is
needed, which implies lazy evaluation. Also, π̃ allows a direct execution of a filter
after the text analysis algorithm it refers to, thereby enabling early filtering.
In the described form, pipelinePartialOrderPlanning is restricted to the con-
struction of a pipeline for one information need C. In general, also text analysis
tasks exist that target at k > 1 information needs at the same time. Because our
case studies InfexBA and ArguAna do not serve as proper examples in this
regard, in the evaluation below we also look at the biomedical event extraction
task Genia (cf. Sect. 2.3). Genia addresses nine different event types, such as Pos-
itive regulation or Binding. The principle generalization for k planning problems
Φ1 , . . . , Φk is straightforward: We apply our approach to each Φi in isolation, result-
ing in k partially ordered pipelines A1 , π̃1 , . . . , Ak , π̃k . Then, we unify all algo-
rithm sets andk partial schedules, respectively, to create one partially ordered pipeline
k
A, π̃ = i=1 Ai , i=1 πi . As a consequence, attention must be paid to filters.
For instance, a portion of text without positive regulations still may comprise a bind-
ing event. To handle Φ1 , . . . , Φk concurrently, a set of relevant portions must be
3.3 Ad-Hoc Construction via Partial Order Planning 83
maintained independently for each Φi , which is achieved by the input control that
follows in Sect. 3.5.
Correctness. Our planner may fail if the given algorithm repository AΩ is not con-
sistent, i.e., if there is any algorithm in AΩ whose input types cannot be satisfied
by any other algorithm in AΩ . We ignore this case, because such an algorithm will
never prove helpful and, hence, should be removed from AΩ . Similarly, we do not
pay attention to algorithms with a circular dependency. As an example, assume that
(in)
we have (1) a tokenizer sto, which requires Csto = {Sentence} as input and produces
(out) (in)
Csto = {Token} as output, and (2) a sentence splitter sse with Csse = {Token} and
(out)
Csse = {Sentence}. Given each of them is the best to satisfy the other’s precondition,
these algorithms would be repeatedly added to the set of selected algorithms A in an
alternating manner. A solution to avoid circular dependencies is to ignore algorithms
whose input types are output types of algorithms already added to A. However, this
might cause situations where planning fails, even though a valid pipeline would
have been possible. Here, we leave more sophisticated solutions to future work. In
the end, the described problem might be realistic, but it is in our experience far
from common.
Proof. We provide only an informal proof here, since the general correctness of
partial order planning is known from the literature (Minton et al. 1995). The only
case where planning fails is when selectBestAlgorithms finds no algorithm in AΩ
that satisfies C. Since AΩ is consistent, this can happen only if C ∈ C holds. Then,
C\C0 must indeed be unsatisfiable using AΩ .
If C\C0 is satisfiable using AΩ , selectBestAlgorithms always returns an algo-
rithm that satisfies C by definition of AC . It remains to be shown that Pseudocode 3.1
returns a complete algorithm set A for C \ C0 then. Without circular dependencies
in AΩ , the while-loop in lines 4–15 always terminates, because (1) the number of
input requirements added to Λ is finite and (2) an input requirement is removed from Λ
in each iteration. As all added input requirements are satisfied, each algorithm in A
works properly, while the initialization of Λ in line 3 ensures that all information types
in C\C0 are produced. Hence, A is complete and, so, Theorem 3.1 is correct.
least one filter in A. Since all predecessors of the filter in the filtering stage A j , π j ,
chosen in line 10 of greedyPipelineLinearization, belong to A j , AΦ is extended
in every iteration of the while-loop (lines 3–12). Thus, AΦ eventually equals A, so
the method always terminates.
The schedule π̃ is initialized with the empty set, which denotes a trivial correct
total order. Now, for the inductive step, assume a correct total order in π within
some loop iteration. The schedule π j added to π̃ guarantees a correct total order
by definition of getAnyCorrectTotalOrdering. For each A j referred to in π j ,
line 11 adds ordering constraints to π that prescribe the execution of A j after all
algorithms referred to in π before. Hence, π̃ remains a correct total order and, so,
Theorem 3.2 must hold.
Theorems 3.1 and 3.2 state the correctness and completeness of our approach.
In contrast, the quality of the selected algorithms with respect to the given quality
function Q (implied by the quality prioritization ρ) as well as the optimality of the
derived schedule remain unclear. Our planner relies on externally defined quality
estimations of the algorithms, which e.g. come from related experiments. It works
well as long as the algorithms, which are considered best for single text analyses, also
achieve high quality when assembled together. Similarly, the greedy linearization
can yield a near-optimal schedule only if comparably slow filtering stages do not
filter much less portions of the processed texts than faster filtering stages. Other
construction approaches like Kano et al. (2010) and Yang et al. (2013) directly
compare alternative pipelines on sample texts. However, our primary goal here is
to enable ad-hoc text mining, which will often not allow the preprocessing of a
significant sample. That is why we decided to remain with an approach that can
construct pipelines in almost zero time.
Complexity. To estimate the run-time of pipelinePartialOrderPlanning, we
determine its asymptotic upper bound using the O-notation (Cormen et al. 2009). The
while-loop in Pseudocode 3.1 is repeated once for each precondition to be satisfied.
Assuming that an annotation type implies a constant number of related information
types, this is at most O(|CΩ |) times due to the finite number of available anno-
tation types in CΩ . Within the loop, selectBestAlgorithm is called. It iterates
O(|AC |) times following from the inner for-loop, since the number of iterations of
the outer for-loop (i.e., the number of quality criteria in ρ) is constant. In the worst
case, each algorithm in the algorithm repository AΩ produces one information type
only. Hence, we infer that O(|CΩ | · |AC |) = O(|CΩ |) holds, so there are actually
only O(|CΩ |) external calls of estimateQuality. For each algorithm A, satisfying
all preconditions requires at most all |AΩ | algorithms. This means that the two for-
loops in estimateQuality result in O(|AΩ |) internal calls of estimateQuality
that recursively require to satisfy preconditions. This process is reflected in Fig. 3.9.
Analog to our argumentation for the preconditions, the maximum recursion depth is
|CΩ |, which implies a total number of O(|AΩ ||CΩ | ) executions of estimateQual-
ity. Therefore, we obtain the asymptotic worst-case overall run-time
tpipelinePartialOrderPlanning (CΩ , AΩ ) = O |CΩ | · |AΩ ||CΩ | . (3.6)
3.3 Ad-Hoc Construction via Partial Order Planning 85
Fig. 3.9 Sketch of the worst-case number O (|AΩ ||CΩ | ) of calls of the method estimateQuality
for a given algorithm A, visualized by the algorithms that produce a required information type and,
thus, lead to a recursive call of the method.
This estimation seems problematic for large type systems CΩ and algorithm
repositories AΩ . In practice, however, both the while-loop iterations and the recursion
depth are governed rather by the number of information types in the information
need C. Moreover, the recursion (which causes the main factor in the worst-case
run-time) assumes the existence of aggregate functions, which will normally hold
for efficiency criteria only. With respect to algorithms, the actual influencing factor
is the number of algorithms that serve as preprocessors, called the branching factor
in artificial intelligence (Russell and Norvig 2009). The average branching factor
is limited by C again. Additionally, it is further reduced through the disregard of
algorithms that allow for filtering (cf. line 6 in Pseudocode 3.3).
Given the output A, π̃ of planning, the run-time of greedyPipelineLineari-
zation depends on the number of algorithms in A. Since the while-loop in
Pseudocode 3.4 adds algorithms to the helper algorithm set AΦ , it iterates O(|A|)
times (cf. the proof of Theorem 3.2). So, the driver of the asympotic run-time is not
the number of loop iterations, but the computation of a transitive closure for get-
Predecessor, which typically takes O(|A|3 ) operations (Cormen et al. 2009). As
mentioned above, the computation needs to be performed only once. Thus, we obtain
a worst-case run-time of
O(|A|3 ) can be said to be easily tractable, considering that the number of algo-
rithms in A is usually at most in the lower tens (cf. Sect. 2.2). Altogether, we hence
claim that the run-time of our approach to ad-hoc pipeline construction will often
be negligible in practice. In our evaluation of ad-hoc pipeline construction below, we
will offer evidence for this claim.
86 3 Pipeline Design
Fig. 3.10 An UML-like class diagram that shows the three-tier architecture of our expert system
Pipeline XPS for ad-hoc pipeline construction and execution.
Rose (2012) has implemented the described ad-hoc construction of pipelines and
their subsequent execution as a Java software tool on top of the software framework
Apache UIMA already mentioned above. Technical details on the software tool
and its usage are found in Appendix B.1. Here, we present an extended version
of the high-level view of the main concepts underlying the software tool presented
in Wachsmuth et al. (2013a).
The software tool can be regarded as a classical expert system. In general, expert
systems simulate the reasoning of human experts within a specific domain (Jackson
1990). The purpose of expert systems is either to replace or to assist experts in
solving problems that require to reason based on domain- and task-specific expert
knowledge. Reasoning is performed by an inference engine, whereas the expert
knowledge is represented in a respective knowledge base. To serve their purpose,
expert systems must achieve a high efficiency and effectiveness while being capable
of explaining their problem solutions. One of the tasks expert systems have most
often been used for since the early times of artificial intelligence is the planning of
sequences of actions (Fox and Smith 1984). As discussed above, the construction of
a text analysis pipeline is an example for such kind of tasks.
To build an expert system, the required expert knowledge must be formalized.
Here, our model of text analysis as an annotation task from Sect. 3.2 comes into
play. Since the model conforms with basic and already well-defined concepts of
Apache UIMA to a wide extent, we rely on Apache UIMA in the realization
of the expert system. In particular, Apache UIMA defines text analysis pipelines
through so called aggregate analysis engines, which consist of a set of primitive
analysis engines (text analysis algorithms) with a specified flow (the schedule). Each
analysis engine is represented by a descriptor file with metadata, such as the analysis
3.3 Ad-Hoc Construction via Partial Order Planning 87
Fig. 3.11 Visualization of the built-in quality model of our expert system. Each of the partially
labeled circles denotes one possible quality prioritization. Exemplarily, the shown implies relations
illustrate that some prioritizations imply others.
engine’s input and output annotation types and features. Similarly, the available
set of annotation types is specified in a type system descriptor file. In contrast,
quality criteria and estimations are not specified by default. For this reason, we
allow algorithm developers to integrate quality estimations in the description field
of the analysis engine descriptor files via a fixed notation, e.g. “@Recall 0.7”. The
resulting descriptor files comprise all knowledge required by our expert system for
ad-hoc pipeline construction, called Pipeline XPS.
Figure 3.10 sketches the three-tier architecture of Pipeline XPS in a UML-like
class diagram notation (OMG 2011). As usual for expert systems, the architecture
separates the user interface from the inference engine, and both of them from the
knowledge base. In accordance with Sect. 3.2, the latter stores all domain-specific
expert knowledge in an annotation task ontology realized with OWL-DL. Via a
knowledge acquisition component, users (typically experts) can trigger an automatic
ontology import that creates an algorithm repository and a type system from a set of
descriptor files. Conversely, we decided to rely on a predefined quality model for lack
of specified quality criteria in Apache UIMA (cf. Sect. 3.2) and for convenience
reasons: Since the set of quality criteria is rather stable in text analysis, we thereby
achieve that users only rarely deal with ontology specifications if at all.
The quality model that we provide is visualized in Fig. 3.11. It contains six criteria
from Sect. 2.2, one for efficiency (i.e., run-time per sentence) and five for effective-
ness (e.g. accuracy). Possible quality prioritizations are represented by small circles.
Some of these are labeled for illustration, such as ( p, t, f 1 ). In addition, the qual-
ity model defines relations between those quality prioritizations where one implies
the other, as in the illustrated case of (t, r , f 1 ) and (t, a). In this way, users can
restrict their view to the three effectiveness criteria in the left part of Fig. 3.11, since
the expert system can e.g. compare algorithms whose effectiveness is measured as
accuracy (say, tokenizers), when e.g. F1 -score is to be optimized. Also, some quality
prioritizations are naturally equivalent. For instance, ( p, f 1 , t) is equivalent to ( p,
r , t), because, given the best possible precision, the best possible F1 -score follows
from the best possible recall. In contrast, ( f 1 , p, t) is different from (r , p, t), since
it prioritizes a high F1 -score over a high recall.
Through the information search interface in Fig. 3.10, a user can choose a qual-
ity prioritization, the information need to be addressed, and the collection of input
texts to be processed. The ad-hoc pipeline construction component takes these
88 3 Pipeline Design
parts of an annotation task together with the given ontology as input. Implement-
ing Pseudocodes 3.1–3.4, it outputs a valid text analysis pipeline in the form of a
UIMA aggregate analysis engine. On this basis, the inference engine performs the
pipeline execution, which results in the desired output information. This information
as well as a protocol of the construction and execution are presented to the user via
a result explanation component. A screenshot of the prototypical user interfacace of
the implemented expert system from Rose (2012) is found in Appendix B.1.
Table 3.2 Each information need C from InfexBA and Genia, for which we evaluate the run-time
of ad-hoc pipeline construction, and the number of algorithms |A| in the resulting pipeline.
types to the respective information need (e.g. Subject in case of the former). Then, we
also require specific values for some of these types (e.g. “Apple” in case of Subject).
In terms of quality, we prioritize precision over recall and both over the run-time
per sentence in all cases. While we construct pipelines once based on Ω1 and once
based on Ω2 , the information needs lead to the same selected algorithm sets for both
ontologies. By that, we achieve that we can directly compare the run-time of our
expert system under Ω1 and Ω2 . The cardinalities of the algorithm sets are listed in
the right column of Table 3.2.
Figure 3.12 plots interpolated curves of the run-time of our expert system as a
function of the number of algorithms in the resulting pipeline. For simplicity, we
omit to show the standard deviations, which range between 3.6 ms and 9.8 ms for
pipeline construction in total and proportionally lower values for algorithm selection
and scheduling. Even on the given far from up-to-date standard computer, both
the algorithm selection via partial order planning and the scheduling via greedy
linearization take only a few milliseconds for all information needs. The remaining
run-time of pipeline construction refers to operations, such as the creation of Apache
UIMA descriptor files. Different from the asymptotic worst-case run-times computed
above, the measured run-times seem to grow only linear in the number of employed
text analysis algorithms in practice, although there is some noise in the depicted
curves because of the high deviations.
As expected from theory, the size of the algorithm repositories has only a small
effect on the run-time, since the decisive factor is the number of algorithms available
for each required text analysis. Accordingly, scheduling is not dependent on the size
at all. Altogether, our expert system takes at most 26 ms for pipeline construction
90 3 Pipeline Design
(a) 30 (b) 30
text analysis pipelines for text analysis pipelines for
Revenue events PositiveRegulation events
average run-time in milliseconds
5 5
scheduling scheduling
0 0
3 6 9 15 16 18 21 6 12 14 17 19 20 21
number of algorithms in pipeline number of algorithms in pipeline
Fig. 3.12 The run-time of our expert system on a standard computer for ad-hoc pipeline construction
in total as well as for algorithm selection and scheduling alone, each as a function of the number of
algorithms in the constructed pipeline. The algorithms are selected from a repository of 76 (solid
curves) or 38 algorithms (dashed) and target at event types from a InfexBA or b Genia.
in all cases, and this efficiency could certainly be significantly improved through an
optimized implementation. In contrast, manual pipeline construction would take at
least minutes, even with appropriate tool support as given for Apache UIMA.
Correctness of Pipeline Construction. To offer practical evidence for the correct-
ness of our approach, we evaluate the execution of different pipelines for a single
information need, Revenue(Time, Money). We analyze the impact of all six possible
quality prioritizations of precision, recall, and the average run-time per sentence. Our
expert system constructs three different pipelines for these prioritizations, which we
execute on the test set of the Revenue corpus:
Πt = (sto1 , rre1 , rre(F) , eti, eti(F) , emo, emo(F) , tpo1 , pde1 , rtm2 , rtm(F) )
Πr = (sse, sto2 , emo, emo(F) , eti, eti(F) , rre1 , rre(F) , tle, tpo2 , pde2 ,
rtm1 , rtm(F) )
Π p = (sse, sto2 , eti, eti(F) , emo, emo(F) , rre2 , rre(F) , tle, tpo2 , pde2 ,
rtm2 , rtm(F) )
Table 3.3 The run-time per sentence t on the test set of the Revenue corpus averaged over ten
runs with standard deviation σ as well as the precision p and recall r of each text analysis pipeline
resulting from the evaluated quality prioritizations for the information need Revenue(Time, Money).
While the process-oriented view of text analysis described above is suitable for
pipeline construction, we now reinterpret text analysis as the task to filter exactly
those portions of input texts that are relevant for the information need at hand.
This information-oriented view has originally been developed in Wachsmuth et al.
(2013c). Here, we reorganize and detail content of that publication for an improved
presentation. The information-oriented view enables us to automatically execute text
analysis pipelines in an optimal manner, as we see in the subsequent section, i.e.,
without performing any unnecessary analyses. Together, Sects. 3.4–3.6 discuss our
concept of an input control (cf. Fig. 3.13) as well as its implications.
Traditional text analysis approaches control the process of creating all output infor-
mation sought for (cf. Sect. 2.2). However, they often do not comprise an efficient
control of the processed input texts, thus executing text analysis pipelines in a sub-
optimal manner. Concretely, much effort is spent for annotating portions of the texts
that are not relevant, as they lack certain required information. For instance, consider
the text analysis task to annotate all mentions of financial developments of orga-
3.4 An Information-Oriented View of Text Analysis 93
Fig. 3.13 Abstract view of the overall approach of this book (cf. Fig. 1.5). Sections 3.4–3.6 address
the extension of a text analysis pipeline by an input control.
“Google's ad revenues are going to reach $20B. The search company was founded in 1998.“
? Coreference
Relation
types
Fig. 3.14 A sample text with instances of information types associated to a financial event and
a foundation relation. One information (the time) of the financial event is missing, which can be
exploited to filter and analyze only parts of the text.
nizations over time in the sample text at the top of Fig. 3.14, modeled by an event
type Financial(Organization, Sector, Criterion, Time, Money). While the text spans
instances of most required information types, an appropriate time entity is missing.
Hence, the effort of annotating the other information of the financial event is wasted,
except for the organization entity, which also indirectly belongs to a binary relation
of the type Founded(Organization, Time).
Therefore, instead of simply processing the complete text, we propose to filter
only those portions of the text before spending annotation effort that may be relevant
for the task. To this end, we reinterpret the basic scenario from Sect. 1.2 and its refined
version from Sect. 3.2 as a filtering task:
To address text analysis in this way, we formalize the required expert knowledge
again (cf. Sect. 3.2). We assume the employed text analysis pipeline to be given
94 3 Pipeline Design
(a) Defining the relevance (b) Specifying a degree of filtering (c) Modeling the dependencies bet-
of portions of text for each relation type ween relevant information types
Scoped Dependency
Query
query graph
conjunction root
* *
Fig. 3.15 Modeling expert knowledge of filtering tasks: a A query defines the relevance of a portion
of text. b A scoped query specifies the degrees of filtering. c The scoped query implies a dependency
graph for the relevant information types.
already. Thus, we can restrict our view to the input and output of the pipeline.
However, the input and output vary for different text analysis tasks, so we need
to model the expert knowledge ad-hoc when a new task is given. For this purpose,
we propose the three following steps:
a. Defining the relevance of portions of text,
b. specifying a degree of filtering for each relation type, and
c. modeling dependencies of the relevant information types.
Before we explain the steps in detail, we sketch the information-oriented view of
text analysis for the mentioned financial events. Here, a portion of text is relevant
if it contains related instances of all types associated to Financial, such as Time.
Now, assume that time entities have already been annotated in the sample text in
Fig. 3.14. If the specified degree prescribes to filter sentences, then only the second
sentence remains relevant and, thus, needs to be analyzed. According to the schedule
of the employed text analysis algorithms, that sentence sooner or later also turns
out to be irrelevant for lack of financial events. Filtering the relevant sentences (and
disregarding the others) hence prevents a pipeline from wasting time.
By viewing text analysis in this sense, we gain (1) that each algorithm in a text
analysis pipeline annotates only relevant portions of input texts, thus optimizing
the pipeline’s run-time efficiency, and (2) that we can easily trade the run-time
efficiency of the pipeline for its effectiveness. We offer evidence for these benefits
later on in Sect. 3.5. In accordance with the given examples, the primary focus of
the information-oriented view is information extraction and not text analysis as a
whole. Still, the view also applies to text classification in principle, if we model
3.4 An Information-Oriented View of Text Analysis 95
13 Especially the term “entity” may be counterintuitive for types that are not core information
extraction types, e.g. for Sentence or Opinion. In the end, however, the output of all pipelines is
structured information that can be used in databases. Hence, it serves to fill the (entity) slots of a
(relation) template in the language of the classical MUC tasks (Chinchor et al. 1993). Such templates
represent the table schemes of databases.
96 3 Pipeline Design
Annotation
view Foundation relation
Fig. 3.16 The annotations (bottom) and the relevant portions (top) of a sample text. For the query
γ1 = Founded(Organization, Time), the only relevant portion of text is d p2 on the paragraph level
and ds2 on the sentence level, respectively.
At each point of a text analysis process, the relevance of a given portion of text can
be automatically inferred from the addressed query. However, a query alone does
not suffice to perform filtering, because it does not specify the size of the portions
to be filtered. Since these portions serve for the fulfillment of single conjunctions,
different portions can be relevant for different conjunctions of a query. The following
queries exemplify this:
γ2 = Forecast(Anchor, Time)
γ3 = Financial(Money, γ2 ) = Financial(Money, Forecast(Anchor, Time))
γ2 targets at the extraction of forecasts (i.e., statements about the future) with time
information that have an explicit anchor, while γ3 refers to financial events, which
relate forecasts to money entities. With respect to the inner conjunction of γ3 (i.e.,
the query γ2 ), a portion of text without time entities is irrelevant, but since such a
portion may still contain a money entity, it remains relevant for the outer conjunction
of γ3 (i.e., the query γ3 as a whole).
In case of disjunctive queries like γ4 , the relevance of all portions of text is largely
decided independently for each of them:
γ4 = γ1 ∨ γ3 = Founded(Organization, Time) ∨ Financial(Money, γ2 )
Here, a portion of text that does not fulfill the conjunctive query γ1 can, of course,
still fulfill γ3 , except for the constraint that both γ1 and γ3 require an instance of the
entity type Time. In general, every conjunction in a query may entail a different set
of relevant portions of text at each step of the analysis of an input text. Therefore,
we propose to assign a degree of filtering to each conjunction in a query.
Degree of Filtering. A degree of filtering C S is a type of lexical or syntactic text
unit that defines the size of a portion of text, all information of an instance of a
conjunction C R (C1 , . . . , Ck ), k ≥ 1, from a query to be addressed must lie within,
denoted as C S [C R (C1 , . . . , Ck )].
Degrees of filtering associate instances of conjunctions to units of text.14 The
specification of degrees of filtering accounts for the fact that most text analysis algo-
rithms operate on some text unit level. E.g., sequential classifiers for part-of-speech
tagging or entity recognition normally process one sentence at a time. Similarly,
most binary relation extractors take as input only candidate entity pairs within that
sentence. In contrast, coreference resolution rather analyzes paragraphs or even the
entire text. We call a query with assigned degrees of filtering a scoped query:
Scoped Query. A scoped query γ ∗ is a query γ where a degree of filtering is assigned
to each contained conjunction C R (C1 , . . . , Ck ), k ≥ 1, from γ .
14 InWachsmuth et al. (2013c), we associate the relevance of portions of texts and, hence, also the
assignment of degrees of filtering to relation types instead of conjunctions. The resort to conjunctions
can be seen as a generalization, because it allows us to determine the relevance of a portion of text
also with respect to an atomic entity type only.
98 3 Pipeline Design
Figure 3.15(b) shows how degrees of filtering are integrated in a query to form a
scoped query. Every degree of filtering either belongs to a relation type or to an entity
type, never to none or both (this cannot be modeled in the chosen ontology notation).
Moreover, entity types have an assigned degree of filtering only if they denote an
outer conjunction on their own. All other entity types are bound to a relation type and
are, thus, covered by the degree of filtering of that relation type. As an example, a
scoped version of γ4 may prescribe to look for the event type Financial in paragraphs
and for the binary relation types in sentences. I.e.:
γ4∗ = Sentence[Founded(Organization, Time)]
∨ Paragraph[Financial(Money, Sentence[Forecast(Anchor, Time)])]
The definition of a scoped query denotes a design decision and should, in this
regard, be performed manually. In particular, degrees of filtering provide a means
to influence the tradeoff between the efficiency of a text analysis pipeline and its
effectiveness: small degrees allow for the filtering of small portions of text, which
positively affects run-time efficiency. Larger degrees provide less room for filtering,
but they allow for higher recall if relations exceed the boundaries of small portions.
When the degrees match the text unit levels of the employed text analysis algorithms,
efficiency will be optimized without losing recall, since an algorithm can never find
relevant information that exceeds the respective text units.15 Hence, knowing the text
unit levels of the employed algorithms would in principle also enable an automatic
specification of degrees of filtering.
The notion behind the term “scoped query” is that, at each point of a text analysis
process, every degree of filtering in a scoped query implies a set of relevant portions
of text, which we call a scope of the analyzed input text:
Scope. A scope S = (d1 , . . . , dn ) is an ordered set of n ≥ 0 portions of text where
instances of a conjunction C R (C1 , . . . , Ck ), k ≥ 1, from a scoped query γ ∗ may occur.
When addressing a scoped query γ ∗ on a given input text, only the scopes of the text
need to be analyzed. However, the scopes change within the text analysis process
according to the found instances of the information types that are relevant with respect
to γ ∗ . To maintain the scopes, the dependencies between the entity and relation types
in γ ∗ and their associated degrees of filtering must be known, because the change
of one scope may affect another one. For illustration, consider the above-mentioned
scoped query γ4∗ . In terms of γ4∗ , paragraphs without time entities will never span
15 There is no clear connection between the specified degrees of filtering and the precision of a text
analysis. In many applications, however, a higher precision will often be easier to achieve if text
analysis is performed only on small portions of text.
3.4 An Information-Oriented View of Text Analysis 99
(a)
Sentence root g:Dependency root Paragraph
:Degree of filtering graph :Degree of Filtering
(b) “ GOOGLE NEWS. 2014 ad revenues predicted. Forecasts promising: Google, founded in 1998, hits $20B in 2014. “
scope of Sentence[Founded(Organization, Time)]
ds1: Sentence ds2: Sentence ds3: Sentence ds4: Sentence
Fig. 3.17 a The dependency graph of the scoped query γ4∗ = γ1∗ ∨ γ3∗ . b The scopes of a sample
text associated to the degrees of filtering in γ4∗ . They store the portions of text that are relevant
for γ4∗ after all text analyses have been performed.
sentences with forecasts and, thus, will not yield financial relations. Similarly, if a
paragraph contains no money entities, then there is no need for extracting forecasts
from the sentences in the paragraph. So, filtering one of the scopes of Forecast and
Financial affects the other one.16
In general, an instance of a conjunction C R (C1 , . . . , Ck ) in γ ∗ requires the exis-
tence of information of the relation type C R and of all related sets of information
types C1 , . . . , Ck within the same portion of text. All these types hence depend on
the same degree of filtering. In case of an inner conjunction in an outer conjunction
C R (C1 , . . . , Ck ), the relevance of a portion of text with respect to the inner conjunc-
tion can depend on the relevance with respect to the outer conjunction and vice versa.
So, degrees of filtering depend on the degrees of filtering they subsume and they are
subsumed by. We explicitly represent these hierarchical dependencies between the
relevant types of information as a dependency graph:
Dependency Graph. The dependency graph Γ of a scoped query γ ∗ = C1 ∨
. . . ∨ Ck , k ≥ 1, is a set of directed trees with one tree for each conjunction
Ci ∈ {C1 , . . . , Ck }. An inner node of any Ci corresponds to a degree of filtering C S
and a leaf to an entity type C E or a relation type C R . An edge from an inner node
16 For complex relation types like coreference, the degree of filtering of an inner conjunction may
exceed the degree of an outer conjunction. For instance, in the example from Fig. 3.14, foundation
relations (outer) might be filtered sentence-wise, while coreferences (inner) could be resolved based
on complete paragraphs. In such a case, filtering with respect to the outer conjunction affects the
entities to be resolved, but not the entities to be used for resolution.
100 3 Pipeline Design
to a leaf means that the respective degree of filtering is assigned to the respective
information type, and an edge between two inner nodes implies that the associated
degrees of filtering are dependent. The degree of filtering of Ci itself defines the root
of the tree of Ci .
The structure of a dependency graph is given in Fig. 3.15(c) above. Figure 3.17(a)
models an instance of this structure, namely, the dependency graph of γ4∗ .
Figure 3.17(b) visualizes the associated scopes of a sample text. The dependency
graph of a scoped query can be exploited to automatically maintain the relevant
portions of text at each point of a text analysis process, as we see in Sect. 3.5.
In most cases, the classical process-oriented view of text analysis and the information-
oriented view proposed in this section can be integrated without loss, meaning that
no output information sought for is lost through filtering. To this end, the specified
degrees of filtering need to match the actual analyses of the employed algorithms, as
described above. We offer evidence for this effectiveness preservation in the evalu-
ation of Sect. 3.5.
An exception that is not explicitly covered by the view emanates from algorithms
that do not operate on some text unit level, but that e.g. look at a sliding window
of an input text without paying attention to text unit boundaries. For instance, an
entity recognition algorithm might classify a candidate term based on clues from the
three preceding and the three subsequent tokens. Such algorithms will change their
behavior when only portions of the text are analyzed. In our experience, though, most
important information for according classification decisions is typically found within
sentence boundaries, which is why the algorithms will often still work appropriately
when considering a text as a set of portions of text (as we do).
As mentioned, the information-oriented view does not help much in usual text
classification tasks where complete texts shall be categorized, since these tasks often
do not allow for filtering at all. Still, as soon as an analysis can be restricted to
some scope of the text, an appropriate modeling of the task may enable filtering. As
an example, consider the prediction of sentiment scores from the bodies of review
texts in our case study ArguAna (cf. Sect. 2.3). This task directly implies a degree
of filtering Body. Additionally, there may be features used for prediction that are
e.g. computed based only on statements that denote opinions (as opposed to facts).
Therefore, a scoped query may restrict preceding analyses like the recognition of
product names or aspects to the according portions of texts:
∗
γscor e = Body[SentimentScore(Opinion[Product], Opinion[Aspect])]
Scoped queries and the derived dependency graphs normally ought to model only
the information types explicitly sought for in a text analysis task. In the following
section, we use the dependency graphs in order to automatically maintain the rel-
evant portions of an input text based on the output of the employed text analysis
3.4 An Information-Oriented View of Text Analysis 101
algorithms. However, some algorithms produce information types that do not appear
in a dependency graph at all, but that only serve as input for other algorithms. Typical
examples are basic lexical and syntactic types like Token or Part-of-speech. Also,
annotations of a specific type (say, Author) may require annotations of a more general
type (say, Person) to be given already. We decided not to model any of these prede-
cessor types here, because they depend on the employed pipeline. Consequently, it
seems more reasonable to determine their dependencies when the pipeline is given.
In particular, the dependencies can be automatically determined from the input and
output types of the algorithms in the pipeline, just as we have done for the ad-hoc
pipeline construction in Sect. 3.3.
17 Weomit to distinguish between knowledge and information in this section to emphasize the
connection between text analysis and non-monotonicity in artificial intelligence.
102 3 Pipeline Design
(a) (b)
assumption-based Input
truth maintenance control
Fig. 3.18 Comparison of a the classical high-level concept of an assumption-based truth mainte-
nance system and b the proposed input control.
be believed as true, i.e., what inferences can be made based on the currently given
knowledge (Russell and Norvig 2009). To this end, the inference engine passes cur-
rent assumptions and justifications expressed as propositional symbols and formulas
to the ATMS. The ATMS then returns the inferrable beliefs and contradictions.
To address text analysis as a filtering task, we adapt the ATMS concept for main-
taining the portions of an input text that are relevant with respect to a given scoped
query γ ∗ . In particular, we propose to equip a text analysis pipeline with an input
control that takes the annotations and current scopes of a text as input in order to
determine in advance of executing a text analysis algorithm what portions of text
need to be processed by that algorithm. Figure 3.18 compares the high-level concept
of a classical ATMS to the proposed input control.
The input control models the relevance of each portion of text using an independent
set of propositional formulas. In a formula, every propositional symbol represents an
assumption about the portion of text, i.e., the assumed existence of an information
type or the assumed fulfillment of the scoped query γ ∗ or of a conjunction in γ ∗ . The
formulas themselves denote justifications. A justification is an implication in definite
Horn form whose consequent corresponds to the fulfillment of a query or conjunction,
while the antecedent consists of the assumptions under which the fulfillment holds.
Concretely, the following formulas are defined initially. For each portion of text d
that is associated to an outer conjunction C S [C R (C1 , . . . , Ck )] in γ ∗ , we denote the
relevance of d with respect to the scoped query γ ∗ as γ ∗(d) and we let the input
control model its justification as φ (d) :
(d)
Additionally, the input control defines a justification of the relevance Ci of the
portion of text d with respect to each inner conjunction of the outer conjunction
C S [C R (C1 , . . . , Ck )] that has the form Ci = C S [C R (Ci1 , . . . , Cil )]. Based on the
(d )
portions of text associated to the degree of filtering of Ci , we introduce a formula ψi
for each such portion of text d :
(d ) (d ) (d ) (d ) (d)
ψi : CR ∧ Ci1 ∧ . . . ∧ Cil → Ci (3.9)
3.5 Optimal Execution via Truth Maintenance 103
(d ) (d )
This step is repeated recursively until each child node Ci j in a formula ψi
represents either an entity type C E or a relation type C R . As a result, the set of all
(d )
formulas φ (d) and ψi of all portions of text defines what can initially believed for
the respective input text.
To give an example, we look at the sample text from Fig. 3.17(b). The scoped
query γ4∗ to be addressed has two outer conjunctions, γ1∗ and γ3∗ , with the degrees
of filtering Sentence and Paragraph, respectively. For the four sentences and two
paragraphs of the text, we have six formulas:
∗(ds1 )
φ (ds1 ) : Founded (ds1 ) ∧ Organi zation (ds1 ) ∧ T ime(ds1 ) → γ4
∗(ds2 )
φ (ds2 ) : Founded (ds2 ) ∧ Organi zation (ds2 ) ∧ T ime(ds2 ) → γ4
φ (ds3 ) : Founded (ds3 ) ∧ Organi zation (ds3 ) ∧ T ime(ds3 ) → γ4∗(ds3 )
∗(ds4 )
φ (ds4 ) : Founded (ds4 ) ∧ Organi zation (ds4 ) ∧ T ime(ds4 ) → γ4
∗(d p1 ) ∗(d p1 )
φ (d p1 ) : Financial (d p1 ) ∧ Money (d p1 ) ∧ γ2 → γ4
∗(d p2 ) ∗(d p2 )
φ (d p2 ) : Financial (d p2 ) ∧ Money (d p2 ) ∧ γ2 → γ4
In case of the two latter formulas, the relevance depends on the inner conjunction
γ2∗ of γ4∗ , for which we define four additional formulas:
∗(d p1 )
ψ (ds1 ) : For ecast (ds1 ) ∧ Anchor (ds1 ) ∧ T ime(ds1 ) → γ2
∗(d p2 )
ψ (ds2 ) : For ecast (ds2 ) ∧ Anchor (ds2 ) ∧ T ime(ds2 ) → γ2
∗(d p2 )
ψ (ds3 ) : For ecast (ds3 ) ∧ Anchor (ds3 ) ∧ T ime(ds3 ) → γ2
∗(d p2 )
ψ (ds4 ) : For ecast (ds4 ) ∧ Anchor (ds4 ) ∧ T ime(ds4 ) → γ2
The antecedents of these formulas consist of entity and relation types only, so no
further formula needs to be added. Altogether, the relevance of the six distinguished
portions of the sample text is hence initially justified by the ten defined formulas.
After each text analysis, the formulas of a processed input text must be updated,
because their truth depends on the set of currently believed assumptions, which
follows from the output of all text analysis algorithms applied so far. Moreover, the
set of current formulas implies, whether a portion of text must be processed by a
specific text analysis algorithm or not. In particular, an algorithm can cause a change
of only those formulas that include an output type of the algorithm. At the end of the
text analysis process then, what formula ever remains, must be the truth, just in the
sense of this chapter’s introductory quote by Arthur Conan Doyle.
Here, by truth, we mean that the respective portions of text are relevant with
respect to the scoped query γ ∗ to be addressed. To maintain the relevant portions of
an input text, we have already introduced the concept of scopes that are associated
to the degrees of filtering in the dependency graph Γ of γ ∗ . Initially, these scopes
span the whole input text. Updating the formulas then means to filter the scopes
104 3 Pipeline Design
according to the output of a text analysis algorithm. Similarly, we can restrict the
analysis of that algorithm to those portions of text its output types are relevant for.
In the following, we discuss how to perform these operations.
Given the output of a text analysis algorithm, we update all justifications φ (d) and ψ (d)
of the relevance of an analyzed portion of text d that contain an output type C (out) ∈
C(out) of the algorithm. In particular, the assumptions about these types become
either true or false. Once an assumption turns out to be false, it will always be false.
Instead of maintaining the respective justifications, we can hence delete those that
are contradicted, thereby filtering the analyzed scopes of the input text.
For instance, if time entities are found only in the sentences ds2 and ds4 in
Fig. 3.17(b), then all formulas with T ime(ds1 ) or T ime(ds3 ) are falsified. In the other
ones, the respective assumptions T ime(ds2 ) and T ime(ds4 ) are justified by replac-
ing them with “true” and, consequently, deleting them from the antecedents of the
formulas. In addition, updating a formula ψ (d) requires a recursive update of all for-
∗(d )
mulas that contain the consequent of ψ (d) . In the given case, the consequent γ2 p1
of ψ (ds1 ) becomes false, which is why φ (d p1 ) also cannot hold anymore. This in turn
could render the fulfillment of further nested conjunctions useless. However, such
conjunctions do not exist in φ (d p1 ) . Therefore, the following formulas remain:
We summarize that the output of a text analysis algorithm is used to filter not only
the scopes analyzed by the algorithm, but also the dependent scopes of these scopes.
The set of dependent scopes of a scope S consists of the scope S0 associated to the
root of the degree of filtering C S of S in the dependency graph of γ ∗ as well as of
each scope S of a descendant degree of filtering of the root. This, of course, includes
the scopes of all ancestor degrees of filtering of C S besides the root.
Pseudocode 3.5 shows how to update the scopes of an input text based on the
output types C(out) of a text analysis algorithm. To enable filtering, all scopes must
initially be generated by segmentation algorithms (e.g. by a sentence splitter), i.e.,
algorithms with an output type C (out) that denotes a degree of filtering in the depen-
dency graph Γ . This is done in lines 1–3 of the pseudocode, given that the employed
pipeline schedules the according algorithms first. Independent of the algorithm, the
3.5 Optimal Execution via Truth Maintenance 105
updateScopes(C(out) )
1: for each Information type C (out) in C(out) do
2: if C (out) is a degree of filtering in the dependency graph Γ then
3: generateScope(C (out) )
4: Scopes S ← getRelevantScopes(C(out) )
5: for each Scope S in S do
6: Information types C ← all C ∈ C(out) to which S is assigned
7: for each Portion of text d in S do
8: if not d contains an instance of any C ∈ C then S.remove(d)
9: Scope S0 ← Γ .getRootScope(S)
10: if S0 = S then
11: for each Portion of text d in S0 do
12: if not d intersects with S then S0 .remove(d)
13: Scopes S ← Γ .getAllDescendantScopes(S0 )
14: for each Scope S = S in S do
15: for each Portion of text d in S do
16: if not d intersects with S0 then S .remove(d)
Pseudocode 3.5: Update of scopes based on the set of output types C(out) of a text analysis algorithm
and the produced instances of these types. An update may lead both to the generation and to the
filtering of the affected scopes.
getRelevantScopes(C(out) )
1: Scopes S
2: for each Degree of filtering C S in the dependency graph Γ do
3: if Γ .getChildren(C S ) ∩ C(out) = ∅ then
4: S.add(Γ .getScope(C S ))
5: else if getPredecessorTypes(Γ .getChildren(C S )) ∩ C(out) = ∅ then
6: S.add(Γ .getScope(C S ))
7: return S
Pseudocode 3.6: Determination of the set S of all scopes that are relevant with respect to the output
types C(out) of a text analysis algorithm.
method getRelevantScopes next determines the set S of scopes that are relevant
with respect to the output of the applied algorithm (line 4).18 For each scope S ∈ S,
a portion of text d is maintained only if it contains an instance of one of the types
C ⊆ C(out) relevant for S (lines 5–8). Afterwards, lines 9–12 remove all portions of
text from the root scope S0 of S that do not intersect with any portion of text in S.
Accordingly, only those portions of text in the set of descendant scopes S of S0 are
retained that intersect with a portion in S0 (lines 13–16).
18 Unlike
here, for space reasons we do not determine the relevant scopes within updateScopes in
Wachsmuth et al. (2013c), which requires to store the scopes externally instead.
106 3 Pipeline Design
determineUnifiedScope(C(out) )
1: for each Information type C (out) in C(out) do
2: if C (out) is a degree of filtering in the dependency graph Γ then
3: return the whole input text
4: Scopes S ← getRelevantScopes(C(out) )
5: Scope S∪ ← ∅
6: for each Scope S in S do
7: for each Portion of text d in S do
8: if not d intersects with S∪ then S∪ .add(d)
9: else S∪ .merge(d)
10: return S∪
Pseudocode 3.7: Determination of the unified scope S∪ to be analyzed by a text analysis algorithm
based on the given output types C(out) of the algorithm.
Above, we discussed how to filter the relevant portions of text based on the output of
a text analysis algorithm. Still, the question is what subset of the current scopes of
an input text actually need to be processed by an algorithm. As an example, consider
the five mentioned formulas that remain after time recognition when addressing the
scoped query γ4∗ = γ1∗ ∨ γ3∗ . Although the whole paragraph d p2 is assumed relevant
for γ4∗ , an algorithm that produces Organization annotations will only lead to a
change of the formulas φ (ds2 ) and φ (ds4 ) . So, the analysis of the algorithm can be
restricted to the scope associated γ1∗ , thus leaving out the sentence ds3 of d p2 .
In general, an employed text analysis algorithm must be applied to each portion
of text d, for which an assumption φ (d) or ψ (d) exists that depends on one of the
output types C(out) of the algorithm. That is, the algorithm must be applied to the
union S∪ of the set S of all scopes that are relevant for the algorithm according to
the method getRelevantScopes.
Pseudocode 3.7 sketches how to determine the unified scope S∪ that contains
all portions of an input text relevant for the output types C(out) of an employed
algorithm. Lines 1–3 check if a type in C(out) is a degree of filtering. In this case,
3.5 Optimal Execution via Truth Maintenance 107
the employed algorithm is a segmentation algorithm and, so, the whole input text is
returned. Elsewise, the set S of relevant scopes is identified using getRelevant-
Scopes from Pseudocode 3.6 again. These scopes are then unified in lines 5–9 by
collecting all non-overlapping portions of text while merging the overlapping ones.
We have already discussed the limitations of the information-oriented view at the end
of Sect. 3.4, namely, there are two noteworthy prerequisites that must be fulfilled in
order to allow for filtering: (1) The algorithms in the employed pipeline must operate
on some text unit level. (2) Not all parts of all input texts (and, hence, not all possible
annotations) are relevant to fulfill the information needs at hand. In the following,
we restrict our view to pipelines where both prerequisites hold. As for the pipeline
construction in Sect. 3.3, we look at the correctness and run-time of the developed
approaches. In Wachsmuth et al. (2013c), we have sketched these properties roughly,
whereas we analyze them more formally here.
Correctness. Concretely, we investigate the question whether the execution of a
pipeline that is equipped with an input control, which determines and updates the
scopes of an input text before each step of a text analysis process (as presented),
is optimal in that it analyzes only relevant portions of text.19 As throughout this
book, we consider only pipelines, where no output type is produced by more than
one algorithm (cf. Sect. 3.1). Also, for consistent filtering, we require all pipelines
to schedule the algorithms whose output is needed for generating the scopes of an
input text before any possible filtering takes place. Given these circumstances, we
now show the correctness of our algorithms for determining and updating scopes:
Lemma 3.1. Let a text analysis pipeline Π = A, π address a scoped query γ ∗
on an input text D. Let updateScopes(C(out) ) be called after each execution of an
algorithm A ∈ A on D with the output types C(out) of A. Then every scope S of D
associated to γ ∗ always contains exactly those portions of text that are currently
relevant with respect to γ ∗ .
Proof. We prove the lemma by induction over the number m of text analysis algo-
rithms executed so far. By assumption, no scope is generated before the first algorithm
has been executed. So, for m = 0, the lemma holds. Therefore, we hypothesize that
each generated scope S contains exactly those portions of text that can be relevant
with respect to γ ∗ after the execution of an arbitrary but fixed number m of text
analysis algorithms.
19 Here,we analyze the optimality of pipeline execution for the case that both the algorithms
employed in a pipeline and the schedule of these algorithms have been defined. In contrast, the
examples at the beginning of Sect. 3.4 have suggested that the amount of text to be analyzed (and,
hence, the run-time optimal pipeline) may depend on the schedule. The problem of finding the
optimal schedule under the given filtering view is discussed in Chap. 4.
108 3 Pipeline Design
Lemma 3.2. Let a text analysis pipeline Π = A, π address a scoped query γ ∗ on
an input text D. Further, let each degree of filtering in γ ∗ have an associated scope S
of D. Given that S contains exactly those portions of text that can be relevant with
respect to γ ∗ , the scope S∪ returned by determineUnifiedScope(C(out) ) contains
a portion of text d ∈ D iff. it is relevant for the information types C(out) .
Proof. By assumption, every segmentation algorithm must always process the whole
input text, which is assured in lines 1–3 of Pseudocode 3.7. For each other algo-
rithm A ∈ A, exactly those scopes belong to S where the output of A may help to
fulfill a conjunction (line 4). All portions of text of the scopes in S are unified incre-
mentally (line 5–9) while preventing that overlapping parts of the scopes are con-
sidered more than once. Thus, no relevant portion of text is missed and no irrelevant
one is analyzed.
Proof. As Lemma 3.1 holds, all scopes contain exactly those portions of D that are
relevant with respect to γ ∗ according to the current knowledge. As Lemma 3.2 holds,
each algorithm employed in Π gets only those portions of D its output is relevant
for. From that, Theorem 3.3 follows directly.
Theorem 3.3 implies that an input-controlled text analysis pipeline does not perform
any unnecessary analysis. The intended benefit is to make a text analysis process
20 Forthe proof, it does not matter whether the instances of an information type in C(out) are used
to generate scopes, since no filtering has taken place yet in this case and, hence, the whole text D
can be relevant after lines 1–3 of updateScopes.
3.5 Optimal Execution via Truth Maintenance 109
faster. Of course, the maintenance of relevant portions of text naturally produces some
overhead in terms of computational cost. In the evaluation below, however, we give
experimental evidence that these additional costs only marginally affect the efficiency
of an application in comparison to the efficiency gains achieved through filtering.
Before, we now analyze the asymptotic time complexity of the proposed methods.
Complexity. The number of output types |C(out) | of an algorithm is constant. Hence,
lines 1–3 of updateScopes take time linear in the length of the input text D, i.e.,
O(|D|). Getting the relevant scopes for C(out) then requires an iteration over all
degrees of filtering C S in the scoped query γ ∗ . Also, the for-loop from line 5 to 16 is
executed at most once for each scope S and, thus, again depends on |C S |. Within one
loop iteration, the filtering of S in lines 6–8 needs O(|D|) operations. Afterwards,
S is intersected with at most |C S | other scopes. Each intersection can be realized in
O(|D|) time by stepwise comparing the portions of text in all scopes according to
their ordering in D. Altogether, the run-time of the for-loop dominates the worst-case
run-time of updateScopes, which can be estimated as
Practically, this again results in time linear to the length of D. We conclude that
the filtering view of text analysis can be efficiently realized in the form of an input
control that governs the portions of text processed by each algorithm in a text analysis
pipeline. In the following, we describe the main concepts of our realization.
To demonstrate how to equip a text analysis pipeline with an input control in prac-
tice, we now sketch our realization of the developed approach as a Java software
framework on top of Apache UIMA (see above). This Filtering Framework
110 3 Pipeline Design
Filtering framework
defines 1..*
Scoped query
creates 1 1
Scope TMS
0..1
Filtering 1 * Scope
analysis engine *
Fig. 3.19 A UML-like class diagram that shows the high-level architecture of realizing an input
control as a filtering framework, which extends the Apache UIMA framework.
was originally presented in Wachsmuth et al. (2013c). A few technical details about
the framework and its extension by an application are given in Appendix B.2.
Some concepts of Apache UIMA have been introduced in Sect. 3.3. Here, we
provide a simplified view of its architecture that is illustrated at the bottom of Fig. 3.19
in a UML-like class diagram notation (OMG 2011). An application based on Apache
UIMA inputs at least one but typically much more texts and analyzes these texts with
aggregate analysis engines (text analysis pipelines). An aggregate analysis engine
executes a composition of primitive analysis engines (say, text analysis algorithms),
which make use of common analysis structures in order to process and to produce
output annotations of the input text at hand. Concrete annotation types do not denote
entity types only, but also relation types, because they may have features that store
values or references to other annotations.
We extend Apache UIMA with our Filtering Framework. Figure 3.19 shows
the four main concepts of this extension at the top:
1. The filtering analysis engines that analyze only relevant portions of text,
2. the scoped query to be addressed by the analysis engines,
3. the scopes that contain the relevant portions of the input text, and
4. the scope TMS, which updates and determines all scopes.
Filtering analysis engines inherit from primitive analysis engines and, hence, can
be composed in an aggregate analysis engine. Prior to analysis, a filtering analysis
engine automatically requests the unified scope its output annotation types and fea-
tures C(out) are relevant for from the scope TMS. After analysis, it triggers the update
of scopes based on C(out) and its produced output annotations.21
To enable filtering, an application must define the scoped query γ ∗ to be addressed
by an aggregate analysis engine. In our implementation, γ ∗ is entered in the form of
set C(out) can be inferred from the so called result specification of an analysis engine, which
21 The
Apache UIMA automatically derives from the analysis engine’s descriptor file.
3.5 Optimal Execution via Truth Maintenance 111
Based on the realized filtering framework, we now evaluate the effects of using an
input control on the efficiency and effectiveness of pipeline execution. The impact
of the underlying filtering view depends on the amount of information in the given
input texts that is relevant for the given text analysis task. This makes a compre-
hensive evaluation of filtering in text analysis infeasible. Instead, our experiments
serve as a reasonable proof-of-concept that (1) analyzes the main parameters intrin-
sic to filtering and (2) offers evidence for the efficiency of our proposed approach.
Appendix B.4 yields information on the Java source code of this evaluation.
Input Texts. Our experiments are conducted all on texts from two text corpora
of different languages. First, the widely used English dataset of the CoNLL-2003
shared task that has originally served for the development of approaches to
language-independent named entity recognition (cf. Appendix C.4). The dataset con-
sists of 1393 mixed classic newspaper stories. And second, our complete Revenue
112 3 Pipeline Design
corpus with 1128 German online business news articles that we already processed
in Sects. 3.1 and 3.3 and that is described in Appendix C.1.
Scoped Queries. From a task perspective, the impact of our approach is primarily
influenced by the complexity and the filtering potential of the scoped query to be
addressed. To evaluate these parameters, we consider the example queries γ1 to γ4
from Sect. 3.4 under three degrees of filtering: Sentence, Paragraph, and Text, where
the latter is equivalent to performing no filtering at all. The resulting scoped queries
are specified below.
Text Analysis Pipelines. We address the scoped queries with different pipelines,
some of which use an input control, while the others do not. In all cases, we employ
a subset of eleven text analysis algorithms that have been adjusted to serve as filter-
ing analysis engines. Each of these algorithms can be parameterized to work both
on English and on German texts. Concretely, we make use of the segmentation algo-
rithms sto2 , sse, and tpo1 as well as of the chunker pch for preprocessing. The
entity types that appear in the queries (i.e., Time, Money, and Organization) are
recognized with eti, emo, and ene, respectively. Accordingly, we extract relations
with the algorithms rfo (Forecast), rfu (Founded), and rfi (Financial). While rfo
operates only on the sentence level, the other two qualify for arbitrary degrees of
filtering. Further information on the algorithms can be found in Appendix A.
All employed algorithms have a roughly comparable run-time that scales linear
with the length of the processed input text. While computationally expensive algo-
rithms (such as the dependency parsers in Sect. 3.1) strongly increase the efficiency
potential of filtering (the later such an algorithm is scheduled the better), employing
them would render it hard to distinguish the effects of filtering from those of the
order of algorithm application (cf. Chap. 4).
Experiments. We quantify the filtering potential of our approach by comparing the
filter ratio (Filter %) of each evaluated pipeline Π , i.e., the quotient between the
number of characters processed by Π and the number of characters processed by a
respective non-filtering pipeline. Similarly, we compute the time ratio (Time %) of
each Π as the quotient between the run-time of Π and the run-time of a non-filtering
pipeline.22 All run-times are measured on a 2 GHz Intel Core 2 Duo MacBook with
4 GB memory and averaged over ten runs (with standard deviation σ ). In terms
of effectiveness, below we partly count the positives (P) only, i.e., the number of
extracted relations of the types sought for, in order to roughly compare the recall of
pipelines (cf. Sect. 2.1). For the foundation relations, we also distinguish between
false positives (FP) and true positives (TP) to compute the precision of extraction.
To this end, we have decided for each positive manually whether it is true or false. In
particular, an extracted foundation relation is considered a true positive if and only
(a) filter ratio 100.0% 100.0% (b) filter ratio 100.0% 100.0%
100% 100%
no filtering no filtering
80% 80%
73.6% paragraph level 75.2% paragraph level
60% 60%
60.3%
53.5%
40% 40%
42.0%
sentence level
20% query Y1 on 28.9% sentence level 20% query Y1 on
CONLL-2003 dataset REVENUE CORPUS 17.2%
10.8%
0% 0%
SPA SSE ETI STO2 TPO2 PCH ENE RFU SPA SSE ETI STO2 TPO2 PCH ENE RFU
algorithm algorithm
Fig. 3.20 Interpolated curves of the filter ratios of the algorithms in pipeline Π1 under three degrees
of filtering for the query γ1 = Founded(Organization, Time) on a the English CoNLL-2003 dataset
and b the German Revenue corpus.
if its anchor is brought into relation with the correct time entity while spanning the
correct organization entity.23
Tradeoff between Efficiency and Effectiveness. We analyze different degrees of
filtering for the query γ1 = Founded(Organization, Time). In particular, we execute
the pipeline Π1 = (spa, sse, eti, sto2 , tpo2 , pch, ene, rfu) on both given corpora
to address each of three scoped versions of γ1 :
To examine the effects of an input control, we first look at the impact of the degree
of filtering. Figure 3.20 illustrates the filter ratios of all single algorithms in Π1 on
each of the two corpora with one interpolated curve for every evaluated degree of
filtering. As the beginnings of the curves convey, even the segmentation of paragraphs
(given for the paragraph level only) and sentences already enables the input control
to disregard small parts of a text, namely those between the segmented text portions.
The first algorithm in Π1 , then, that really reduces the number of relevant portions of
text is tim. On the sentence level, it filters 28.9 % of its input characters from the texts
in the CoNLL-2003 dataset and 42.0 % from the Revenue corpus. These values
are further decreased by ene, such that rfu has to analyze only 10.8 % and 17.2 % of
all characters, respectively. The values for the degree of filtering Paragraph behave
similar, while naturally being higher.
The resulting overall efficiency and effectiveness values are listed in Table 3.4.
On the paragraph level, Π1 processes 81.5 % of the 12.70 million characters of the
CoNLL-2003 dataset that it processes on the text level, resulting in a time ratio of
69.0 %. For both these degrees of filtering, the same eight relations are extracted with
a precision of 87.5 %. So, no relation is found that exceeds paragraph boundaries.
Filtering on the sentence level lowers the filter ratio to 40.6 % and the time ratio to
23 An exact evaluation of precision and recall is hardly feasible on the input texts, since the relation
types sought for are not annotated. Moreover, the given evaluation of precision is only fairly repre-
sentative: In practice, many extractors do not look for cross-sentence and cross-paragraph relations
at all. In such cases, precision remains unaffected by filtering.
114 3 Pipeline Design
Table 3.4 The number of processed characters in millions with filter ratio Filter %, the run-time t
in seconds with standard deviation σ and time ratio Time %, the numbers of true positives (TP) and
false positives (FP), and the precision p of pipeline Π1 on the English CoNLL-2003 dataset and
the Revenue corpus for three degrees of filtering of the query Founded(Organization, Time).
32.9 %. While this reduces the number of true positives to 5, it also prevents any false
positive. Such behavior may be coincidence, but it may also indicate a tendency to
achieve better precision, when the filtered portions of texts are small.
On the Revenue corpus, the filter and time ratios are higher due to a larger
amount of time entities (which are produced first by Π1 ). Still, the use of an input
control saves more than half of the run-time t, when performing filtering on the
sentence level. Even for simple binary relation types like Founded and even without
employing any computationally expensive algorithm, the efficiency potential of fil-
tering hence becomes obvious. At the same time, the numbers of found true positives
in Table 3.4 (37 in total, 27 within paragraphs, 14 within sentences) suggest that the
use of an input control provides an intuitive means to trade the efficiency of a pipeline
for its recall, whereas precision remains quite stable.
Optimization of Run-Time Efficiency. In Sect. 3.4, we claim that it is possible to
optimize the efficiency of a pipeline through an input control without losing effective-
ness by specifying degrees of filtering that match the text unit levels of the employed
algorithms. For demonstration, we assign the same degrees of filtering as above to
the query γ2 = Forecast(Anchor, Time). Each of the three resulting scoped queries
is then addressed using the pipeline Π2 = (spa, sse, eti, sto2 , tpo2 , rfo) on the
Revenue corpus. As stated, rfo operates on the sentence level only.
Table 3.5 offers evidence for the truth of our claim: Under all three degrees of
filtering, Π1 extracts the same 3622 forecast relations from the 33,364 sentences in
the Revenue corpus. Although more than every tenth sentence is hence classified
as being relevant with respect to γ2 , the filter ratio is reduced down to 64.8 %.24
24 In Table 3.5, the number of characters for Paragraph is higher than for Text (20.02 M as opposed to
19.14 M), which seems counterintuitive. The reason behind is that the degree of filtering Paragraph
requires an additional application of the algorithm spa. A respective non-filtering pipeline for the
paragraph level actually processes 22.97 million characters.
3.5 Optimal Execution via Truth Maintenance 115
Table 3.5 The number of processed characters in millions with filter ratio Filter %, the run-time t
in seconds with standard deviation σ and time ratio Time %, and the number of positives (in terms of
extracted relations) of pipeline Π2 for the query γ2 = Forecast(Anchor, Time) under three degrees
of filtering on the Revenue corpus.
Performing filtering in such a way forms the basis of our approaches to pipeline
scheduling that we develop in Chap. 4. Also, related approaches like Shen et al.
(2007) rely on similar concepts. Here, the input control improves the run-time of Π2
by almost factor 2, thus emphasizing its great efficiency optimization potential.
Impact of the Complexity of the Query. Finally, we analyze the benefit and com-
putational effort of filtering on the Revenue corpus under increasing complexity
of the addressed query. For this purpose, we consider γ1∗ from the first experiment
as well as the following scoped queries:
γ3∗ = Paragraph[Financial(Money, Sentence[γ2 ])] γ4∗ = γ1∗ ∨ γ3∗
For γ1∗ , we employ Π1 again, whereas we use the following pipelines Π3 and Π4 to
address γ3∗ and γ4∗ , respectively:
Π3 = (spa, sse, emo, eti, sto2 , tpo2 , rfo, rfi)
Π4 = (spa, sse, emo, eti, sto2 , tpo2 , rfo, rfi, pch, ene, rfu)
In Table 3.6, we list the efficiency results and the numbers of positives for the
three queries. While the time ratios get slightly higher under increasing query com-
plexity (i.e., from γ1∗ to γ4∗ ), the input control saves over 50 % of the run-time of a
standard pipeline in all cases. At the same time, up to 1760 relations are extracted
from the Revenue corpus (2103 relations without filtering). While the longest
pipeline (Π4 ) processes the largest number of characters (24.40 millions), the filter
ratio of Π4 (57.9 %) rather appears to be the “weighted average” of the filter ratios
of Π1 and Π3 .
For a more exact interpretation of the results of γ4∗ , Fig. 3.21 visualizes the filter
ratios of all algorithms in Π4 . As shown, the interpolated curve does not decline
monotonously along the pipeline. Rather, the filter ratios depend on what portions
of text are relevant for which conjunctions in γ4∗ , which follows from the depen-
dency graph of γ4∗ (cf. Fig. 3.17(a)). For instance, the algorithm rfo precedes the
algorithm pch, but entails a lower filter ratio (28.9 % vs. 42 %). rfo needs to analyze
the portions of text in the scope of γ3∗ only. According to the schedule of Π5 , this
means all sentences with a time entity in paragraphs that contain a money entity.
116 3 Pipeline Design
100%
filter ratio
100% 98.7% 98%
80%
filter ratio of 4
60%
40%
42% 42%
20% scoped query Y4* 28.5% 27%
on REVENUE CORPUS 17.2%
0%
SPA SSE EMO ETI STO2 TPO2 RFO RFI PCH ENE RFU algorithm
Fig. 3.21 Interpolated curve of the filter ratios of the eleven algorithms in the pipeline Π4 for the
scoped query γ4∗ = γ1∗ ∨ γ3∗ on the Revenue corpus.
Table 3.6 The number of processed characters with filter ratio Filter % as well as the run-time t in
seconds with standard deviation σ and time ratio Time % of Π1 , . . . , Π3 on the Revenue corpus
under increasingly complex queries γ ∗ . Each run-time is broken down into the times spent for text
analysis and for input control. In the right-most column, the positives are listed, i.e., the number of
extracted relations.
In contrast, chu processes all sentences with time entities, as it produces a prede-
cessor type required by ene, which is relevant for the scope of γ1∗ .
Besides the efficiency impact of input control, Table 3.6 also provides insights
into the efficiency of our implementation. In particular, it opposes the analysis time
of each pipeline (i.e., the overall run-time of the employed text analysis algorithms)
to the control time (i.e., the overall run-time of the input control). In case of γ1∗ , the
input control takes 1.0 % of the total run-time (0.7 of 74.9 s). This fraction grows
only marginally under increasing query complexity, as the control times of γ3∗ and
γ4∗ suggest. While our implementation certainly leaves room for optimizations, we
thus conclude that the input control can be operationalized efficiently.
As summarized in Sect. 2.4, the idea of performing filtering to improve the efficiency
of text analysis is not new. However, unlike existing approaches such as the prediction
of sentences that contain relevant information (Nedellec et al. 2001) or the fuzzy
matching of queries and possibly relevant portions of text (Cui et al. 2005), our
3.5 Optimal Execution via Truth Maintenance 117
proposed input control does not rely on vague statistical models. Instead, we formally
infer the relevance of a portion of text from the current knowledge.
In general, the goal of equipping a text analysis pipeline with an input control is an
optimal pipeline execution. Following Sect. 3.1, this means to fulfill an information
need on a collection or a stream of input texts in the most run-time efficient manner.
In this section, we have proven that a pipeline Π = A, π based on our input control
approach analyzes only possibly relevant portions of text in each step. Given that A
and π are fixed, such an execution is optimal, because no performed analysis can be
omitted without possibly missing some information sought for.
Our evaluation has offered evidence that we can optimize the efficiency of a
pipeline using an input control while not influencing the pipeline’s effecticeness. At
the same time, the overhead induced by maintaining the relevant portions of text
is low. Even in text analysis tasks that are hardly viable for filtering like most text
classification tasks, an input control hence will usually have few negative effects. We
therefore argue that, in principle, every text analysis pipeline can be equipped with
an input control without notable drawbacks. What we have hardly discussed here,
though, is that our current implementation based on Apache UIMA still requires
small modifications of each text analysis algorithm (cf. Appendix B.2 for details).
To overcome this issue, future versions of Apache UIMA could directly integrate
the maintenance of scopes in the common analysis structure.
While the exact efficiency potential of filtering naturally depends on the amount
of relevant information in the given input texts, our experimental results suggest that
filtering can significantly speed up text analysis pipelines. Moreover, the specification
of degrees of filtering provides a means to easily trade the efficiency of a pipeline
for its effectiveness. This tradeoff is highly important in today’s and tomorrow’s text
mining scenarios, as we sketch in the concluding section of this chapter.
As we have seen in the previous section, approaching text analysis as a filtering task
provides a means to trade efficiency for effectiveness within ad-hoc text mining. We
now extend the analysis of this tradeoff by discussing the integration of different
filtering approaches. Then, we conclude with the important observation that filtering
governs how an optimal solution to the pipeline scheduling problem raised in Sect. 3.1
looks like.
As surveyed in Sect. 2.4, our input control is not the first approach that filters possibly
relevant portions of text. The question is in how far existing approaches integrate
118 3 Pipeline Design
with ours. Especially in time-critical ad-hoc text mining applications like question
answering, returning a precise result is usually of higher importance than achieving
high recall (and, thus, high overall effectiveness), which enables great improvements
of run-time efficiency. To this end, only promising candidate passages (i.e., para-
graphs or the like) are retrieved in the first place, from which relevant information
to answer the question at hand is then extracted (Cui et al. 2005).
A study of Stevenson (2007) suggests that most extraction algorithms operate on
the sentence level only, while related information is often spread across passages.
Under the above-motivated assumption that extraction is easier on smaller portions
of text, precision is hence preferred over recall again. In terms of the filtering view
from Sect. 3.4, this makes Sentence the most important degree of filtering and it
directly shows that passage retrieval techniques should often be integrable with our
input control: As long as the size of candidate passages exceeds the specified degrees
of filtering, relevance can be maintained for each portion of an input passage just as
described above. Therefore, we decided not to evaluate passage retrieval against our
approach. Also, we leave the integration for future work.
Besides the filtering of portions of text, the efficiency and effectiveness of pipelines
can also be influenced by filtering complete texts or documents that meet certain
constraints, while discarding others. In Sect. 2.4, we have already pointed out that
such kind of text filtering has been applied since the early times in order to determine
candidate texts for information extraction. As such, text filtering can be seen as a
regular text classification task.
Usually, the classification of candidate texts and the extraction of relevant infor-
mation from these texts are addressed in separate stages of a text mining applica-
tion (Cowie and Lehnert 1996; Sarawagi 2008). However, they often share com-
mon text analyses, especially in terms of preprocessing, such as tokenization or
part-of-speech tagging. Sometimes, features for text classification are also based
on information types like entities, as holds e.g. for the main approach in our project
ArguAna (cf. Sect. 2.3) as well as for related works like Moschitti and Basili (2004).
Given that the two stages are separated, all common text analyses are performed
twice, which increases run-time and produces redundant or inconsistent output.
To address these issues, Beringer (2012) has analyzed the integration of text
classification and information extraction pipelines experimentally in his master’s
thesis written in the context of the book at hand. In particular, the master’s the-
sis investigates the hypothesis that the later filtering is performed within an inte-
grated pipeline, the higher its effectiveness but the lower its efficiency will be (and
vice versa).
While existing works implicitly support this hypothesis, they largely focus
on effectiveness, such as Lewis and Tong (1992) who compare text filtering at
three positions in a pipeline. In contrast, Beringer (2012) explicitly evaluates the
3.6 Trading Efficiency for Effectiveness in Ad-Hoc Text Mining 119
e
tim
tim
80% 0.65
n-
n-
25 s 25 s
ru
ru
75% 0.6
accuracy
precision
F1-score
15 s 15 s
recall
70% 0.55
65% 5s 0.5 5s
SSE STO2 ETI TPO2 EMO ENE SSE STO2 ETI TPO2 EMO ENE
last algorithm before CLF last algorithm before CLF
Fig. 3.22 Illustration of the effectiveness of a filtering candidate input texts and b extracting
forecasts from these texts in comparison to the run-time in seconds of the integrated text analysis
∗
pipeline Π2,lfa ∗ . The figure is
depending on the position of the text filtering algorithm clf in Π2,lfa
based on results from Beringer (2012).
25 As shown in Fig. 3.22(a), the accuracy is already close to its maximum when clf is integrated
after sto2 , i.e., when token-based features are available, such as bag-of-words, bigrams, etc. So,
more complex features are not really needed in the end, which indicates that the classification of
language functions is comparably easy on the given input texts.
26 While the extraction precision remains unaffected from the position of integration in the experi-
ment, this is primarily due to the lack of false positives in the LFA-11 corpus only.
120 3 Pipeline Design
The observed results indicate that integrating text filtering and text analysis pro-
vides another means to trade efficiency for effectiveness. As in our experiments, the
relevant portions of the filtered texts can then be maintained by our input control. We
do not analyze the integration in detail in this book. However, we point out that the
input control does not prevent text filtering approaches from being applicable, as long
as it does not start to restrict the input of algorithms before text filtering is finished.
Otherwise, less and diffently distributed information is given for text filtering, which
can cause unpredictable changes in effectiveness, cf. Beringer (2012).
Aside from the outlined tradeoff, the integration of the two stages generally
improves the efficiency of text mining. In particular, the more text analyses are shared
(b)
by the stages, the more redundant effort can be avoided. For instance, the Π2,lfa
requires 19.1 s in total when clf is scheduled after tpo2 , as shown in Fig. 3.22.
Separating text filtering and text analysis would require to execute the first four
(b)
algorithms Π2,lfa double on all filtered texts, hence taking a proportional amount
of additional time (except for the time taken by clf itself). The numeric efficiency
impact of avoiding redundant operations has not been evaluated in Beringer (2012).
In the end, however, the impact depends on the schedule of the employed algorithms
as well as on the fraction of relevant texts and relevant information in these texts,
which leads to the concluding remark of this chapter.
In Sect. 3.1, we have defined the pipeline optimization problem as two-tiered, con-
sisting of (1) the selection of a set of algorithms that is optimal with respect to
some quality function and (2) determining a run-time optimal schedule of the algo-
rithms. While it should be clear by itself that different algorithm sets vary in terms
of efficiency and effectiveness, we have only implicitly answered yet why different
schedules vary in their run-time. The reason behind can be inferred directly from
the theory of ideal pipeline design in Sect. 3.1, namely, the optimization potential of
scheduling emanates solely from the insertion of filtering steps.
By now, we have investigated the efficiency impact of consistently filtering the
relevant portions of input texts under the prerequisite that the employed text analy-
sis pipeline Π = A, π is fixed. In accordance with the lazy evaluation step from
Sect. 3.1, the later an algorithm from A is scheduled in π , the less filtered portions
of text it will process, in general. Since the algorithms in A have different run-times
and different selectivities (i.e., they filter different portions of text), the schedule π
hence affects the overall efficiency of Π . This gives rise to the last step in Sect. 3.1,
i.e., to find an optimal scheduling that minimizes Eq. 3.4.
However, both the run-times and the selectivities of the algorithms are not prede-
fined, but they depend on the processed input. Under certain circumstances, it might
be reasonable to assume that the run-times behave proportionally (we come back to
this in Sect. 4.3). In contrast, the selectivities strongly diverge on different collections
3.6 Trading Efficiency for Effectiveness in Ad-Hoc Text Mining 121
or streams of input texts (cf. Fig. 3.20 in Sect. 3.5). Therefore, an optimal schedul-
ing cannot be found ad-hoc in the sense implied so far in this chapter, i.e., without
processing input texts but based on the text analysis task to be addressed only. As a
consequence, we need to integrate the use of an input control with a mechanism that
determines a run-time optimal schedule for the input texts at hand. This is the main
problem tackled in the following chapter.
Chapter 4
Pipeline Efficiency
A man who dares to waste one hour of time has not discovered
the value of life.
– Charles Darwin
123
124 4 Pipeline Efficiency
input control
Fig. 4.1 Abstract view of the overall approach of this book (cf. Fig. 1.5). All sections of Chap. 4
contribute to the design of large-scale text analysis pipelines.
As defined in Sect. 3.1, the last step of an ideal pipeline design is to optimally schedule
the employed text analysis algorithms. In this section, we present an extended version
of content from Wachsmuth and Stein (2012), where we compute the solution to an
optimal scheduling using dynamic programming (Cormen et al. 2009). Given an input
control as introduced in Sect. 3.5, the most efficient pipeline follows from the run-
times and processed portions of text of the scheduled algorithms. These values must
be measured before, which will often make the solution too expensive in practice.
Still, it reveals the properties of pipeline scheduling and it can be used to compute
benchmarks for large-scale text mining.
10
sentences MOF 6 4
4 6
sentences with sentences
organizations with money input of AF input of AO input of AM
entities
10
1 2
sentences with
forecasts FOM
1
2
Fig. 4.2 a Venn diagram representation of a sample text with ten sentences, among which one is a
forecast that contains a money and an organization entity. b The sentences that need to be processed
by each text analysis algorithm in the pipelines ΠMOF (top) and ΠFOM (bottom), respectively.
them contain organization enties. Only one of these also spans an organization entity
and, so, contains all information sought for. Figure 4.2(a) represents such an article as
a Venn diagram. To tackle the task, assume that three algorithms A M , A O , and A F for
the recognition of money entities, organization entities, and forecast events are given
that have no interdependencies, meaning that all possible schedules are admissible.
For simplicity, let A M always take t (A M ) = 4 ms to process a single sentence, while
A O and A F need t (A O ) = t (A F ) = 5 ms. Without an input control, each algorithm
must process all ten sentences, resulting in the following run-time t (Πno filtering ) of a
respective pipeline:
Now, given an input control that performs filtering on the sentence level, it
may appear reasonable to apply the fastest algorithm A M first, e.g. in a pipeline
ΠMOF = (A M , A O , A F ). This is exactly what our method greedyPipelineLin-
earization from Sect. 3.3 does. As a result, A M is applied to all ten sentences,
A O to the six sentences with money entities (assuming all entities are found), and
A F to the four with money and organization entities (accordingly), as illustrated at
the top of Fig. 4.2(b). Hence, we have:
t (ΠMOF ) = 10 · t (A M ) + 6 · t (A O ) + 4 · t (A F ) = 90 ms
Thus, the input control achieves an efficiency gain of 50 ms when using ΠMOF .
However, in an according manner, we compute the run-time of an alternative pipeline
ΠFOM = (A F , A O , A M ), based on the respective number of processed sentences as (cf.
bottom of Fig. 4.2(b)):
t (ΠFOM ) = 10 · t (A F ) + 2 · t (A O ) + 1 · t (A M ) = 64 ms
126 4 Pipeline Efficiency
As can be seen, ΠMOF takes over 40 % more time than Π F O M to process the article,
even though its first algorithm is 25 % faster. Apparently, the efficiency gain of using
an input control does not depend only on the algorithms employed in a pipeline, but
also on the pipeline’s schedule, which influences the algorithms’ selectivities, i.e.,
the numbers of portions of text filtered after each algorithm application (cf. Sect. 3.1).
The efficiency potential of pipeline scheduling hence corresponds to the maximum
possible impact of the input control.
So, optimal scheduling consists in the determination of an admissible sched-
ule π ∗ of a given algorithm set A that minimizes Eq. 3.4 from Sect. 3.1, i.e., the sum
of the run-times of all algorithms in A on the portions of text they process. This
minimization problem is governed by two paradigms: (1) Algorithms with a small
run-time should be scheduled early. (2) Algorithms with a small selectivity should
be scheduled early. Due to the exemplified recurrent structure of the run-times and
selectivities, however, these paradigms cannot be followed independently, but they
require a global analysis. In the following, we perform such an analysis with dynamic
programming. Dynamic programming refers to a class of algorithms that aim to effi-
ciently find solutions to problems by dividing the problems into smaller subproblems
and by solving recurring subproblems only once (Cormen et al. 2009).
According to Eq. 3.4 and to the argumentation above, all admissible pipelines based
on an algorithm set A entail the same relevant portions of an input text D while
possibly requiring different run-times for processing D. To model these run-times,
we consider a pipeline Π ( j) = (A1 , . . . , A j ) with j algorithms. For j = 1, Π ( j) =
(A1 ) must always process the whole input text, which takes t1 (D) time. For j > 1, the
run-time t (Π ( j) ) of Π ( j) is given by the sum of (1) the run-time t (Π ( j−1) ) of Π ( j−1)
on D and (2) the run-time t j (S(Π ( j−1) )) of A j on the scope S(Π ( j−1) ) associated
to Π ( j−1) . Here, we reuse the concept of scopes from Sect. 3.4 to refer to the portions
of texts relevant after applying Π ( j−1) . We can define t (Π ( j) ) recursively as:
( j) t1 (D) if j = 1
t (Π )= (4.1)
t (Π ( j−1) ) + t j (S(Π ( j−1) )) otherwise
This recursive definition resembles the one used by the Viterbi algorithm,
which operates on hidden Markov models (Manning and Schütze 1999). A hidden
Markov model describes a statistical process as a sequence of states. A transition
4.1 Ideal Scheduling for Large-Scale Text Mining 127
from one state to another is associated to some state probability. While the states
are not visible, each state produces an observation with an according probability.
Hidden Markov models have the Markov property, i.e., the probability of a future
state depends on the current state only. On this basis, the Viterbi algorithm
computes the Viterbi path, which denotes the most likely sequence of states for a
given sequence of observations.
We adapt the Viterbi algorithm for scheduling an algorithm set A, such that the
Viterbi path corresponds to the run-time optimal admissible pipeline Π ∗ = A, π ∗
on an input text D. As throughout this book, we restrict our view to pipelines where
no algorithm processes D multiple times. Also, under admissibility, only algorithms
with fulfilled input constraints can be executed (cf. Sect. 3.1). Putting both together,
we call an algorithm Ai applicable at some position j in a pipeline’s schedule, if Ai
has not been applied at positions 1 to j −1 and if all input types Ai .C(in) of Ai are
produced by the algorithms at positions 1 to j −1 or are already given for D.
To compute the Viterbi path, the original Viterbi algorithm determines the
most likely sequence of states for each observation and possible state at that position
in an iterative (dynamic programming) manner. For our purposes, we let states of
the scheduling process correspond to the algorithms in A, while each observation
denotes the position in a schedule.1 According to the Viterbi algorithm, we then
( j)
propose to store a pipeline Πi from 1 to j for each combination of a position
j ∈ {1, . . . , m} and an algorithm Ai ∈ A that is applicable at position j. To this
end, we determine the set Π ( j−1) with all those previously computed pipelines of
length j − 1, after which Ai is applicable. The recursive function to compute the
( j)
run-time of Πi can be directly derived from Eq. 4.1:
⎧
⎨ti (D) if j = 1
( j)
t (Πi ) = (4.2)
⎩ min t (Πl ) + ti (S(Πl )) otherwise
Πl ∈ Π ( j−1)
( j)
The scheduling process does not have Markov property, as the run-time t (Πi ) of
an algorithm Ai ∈ A at some position j depends on the scope it is executed on. Thus,
( j) ( j) ( j)
we need to keep track of the values t (Πi ) and S(Πi ) for each pipeline Πi during
the computation process. After computing all pipelines based on the full algorithm
set A = {A1 , . . . , Am }, the optimal schedule π ∗ of A on the input text D is the one
of the pipeline Πi(m) with the lowest run-time t (Πi(m) ).
Pseudocode 4.1 shows our Viterbi algorithm adaptation. A pipeline Πi(1) is
stored in lines 1–5 for every algorithm Ai ∈ A that is already applicable given D only.
The run-time t (Πi(1) ) and the scope S(Πi(1) ) of Πi(1) are set to the respective values
1 Differentfrom Wachsmuth and Stein (2012), we omit to explicitly define the underlying model
here for a more focused presentation. The adaptation works even without the model.
128 4 Pipeline Efficiency
A1 ∏1(j-1)
t(∏1(j-1)) +
ti(S(∏1(j-1)))
...
...
Π (j-1)
Ak (applicable) ∏k(j-1)
t(∏ S(∏k(j-1))
...
(j-1)
)
...
...
t(∏k(j-1)) +
Ai ∏i(j-1) ti(S(∏k(j-1))) ∏i(j)
already t(∏i(j)) S(∏i(j))
...
...
applied
Al ∏l(j-1)
Al+1 ⊥
not yet
...
...
applicable
Am ⊥
optimalPipelineScheduling({A1 , . . . , Am }, D)
1: for each i ∈ {1, . . . , m} do
2: if Ai is applicable in position 1 then
3: Pipeline Πi(1) ← (Ai )
Run-time
4: (1) ← ti (D)
t (Πi )
5: Scope S(Πi(1) ) ← Si (D)
6: for each j ∈ {2, . . . , m} do
7: for each i ∈ {1, . . . , m} do
8: Pipelines Π ( j−1) ← {Πl( j−1) | Ai is applicable after Πl( j−1) }
9: if Π ( j−1) = ∅ then
10: Pipeline Πk( j−1) ← arg min t (Πl ) + ti (S(Πl ))
Πl ∈ Π ( j−1)
( j) ( j−1)
11: Pipeline Πi ← Πk (Ai )
( j) ( j−1) ( j−1)
12: Run-time t (Πi ) ← t (Πk ) + ti (S(Πk ))
( j) ( j−1)
13: Scope S(Πi ) ← Si (S(Πk ))
14: return arg min t (Πi(m) )
(m)
Πi , i ∈ {1,...,m}
Pseudocode 4.1: The optimal solution to the computation of a pipeline based on an algorithm set
{A1 , . . . , Am } with a run-time optimal schedule on an input text D.
of Ai . Next, for each algorithm Ai that is applicable at all in position j lines 6–13
( j)
incrementally compute a pipeline Πi of length j. Here, the set Π ( j−1) is computed
in line 8. If Π ( j−1) is not empty (which implies the applicability of Ai ), lines 9–11
4.1 Ideal Scheduling for Large-Scale Text Mining 129
( j) ( j−1)
then create Πi by appending Ai to the pipeline Πk that is best in terms of
Eq. 4.2.2 In lines 12 and 13, the run-time and the scope are computed accordingly
( j)
for Πi . After the final iteration, the fastest pipeline Πi(m) of length m is returned
as an optimal solution (line 13). A trellis diagram that schematically illustrates the
described operations for Ai at position j is shown in Fig. 4.3.
In Wachsmuth and Stein (2012), we have sketched the basic idea of how to prove
that the presented scheduling approach computes an optimal solution. Now, we pro-
vide a formal proof of the correctness of the approach. Then, we continue with its
complexity and with practical implications.
Correctness. In the proof, we consider only algorithm sets that are consistent and
that have no circular dependencies, as defined in Sect. 3.3. These properties ensure
that an admissible schedule exists for a given algorithm set A = {A1 , . . . , Am }. As
usual in dynamic programming, the optimality of the pipeline Πi(m) returned by
optimalPipelineScheduling for A and for an input text D then follows from the
optimal solution to all subproblems. Here, this means the optimality of all computed
( j)
pipelines Πi , because Πi(m) is then optimal for the full algorithm set A. The follow-
( j)
ing lemma states that each Πi is run-time optimal under all admissible pipelines
that are based on the same algorithm set and that end with the same algorithm:
Lemma 4.1. Let A = {A1 , . . . , Am } be a consistent algorithm set without circular
( j) ( j) ( j)
dependencies and let D be a text. Let Πi = Ai , π ∗ = (A1 , . . . , A j−1 , Ai ), Ai ⊆
A, be any pipeline that optimalPipelineScheduling(A, D) computes. Then for all
( j)
admissible pipelines Πi ( j) = Ai , π = (A1 , . . . , Aj−1 , Ai ), we have:
( j) ( j)
t (Πi ) ≤ t (Πi )
Proof. We show the lemma by induction over the length j. For j = 1, each
pipeline Πi(1) created in line 3 of optimalPipelineScheduling consists only of
the algorithm Ai . As there is only one pipeline of length 1 that ends with Ai , for all
i ∈ {1, . . . , m}, Πi(1) = Πi (1) holds and, so, Πi(1) is optimal. Therefore, we hypothesize
that the lemma is true for an arbitrary but fixed length j−1. We prove by contradiction
that, in this case, it also holds for j. For this purpose, we assume the opposite:
( j) ( j) ( j) ( j)
∃Πi = Ai , π = (A1 , . . . , Aj−1 , Ai ) : t (Πi ) > t (Πi )
2 Different from the pseudocode in Wachsmuth and Stein (2012), we explicitly check here if Π ( j−1)
is not empty. Apart from that, the pseudocodes differ only in terms of namings.
130 4 Pipeline Efficiency
Proof. We first point out the admissibility of each text analysis pipeline Πi(m) returned
by Pseudocode 4.1. Both in line 3 and in line 11, only applicable algorithms are
added to the end of the created pipelines Πi(m) . By definition of applicability (see
above), all input requirements of an applicable algorithm are fulfilled. Therefore, all
pipelines Πi(m) must be admissible.
As a consequence, Lemma 4.1 implies that the pipeline computed in the last
iteration of lines 6–13 of optimalPipelineScheduling is run-time optimal under
all admissible pipelines of length m that end with Ai . Since no algorithm is added
twice under applicability, each Πi(m) contains the complete algorithm set A. The
fastest of these pipelines is taken in line 13.
Complexity. Knowing that the approach is correct, we now come to its computa-
tional complexity. As in Chap. 3, we rely on the O-notation (Cormen et al. 2009) to
capture the worst-case run-time of the approach. Given an arbitrary algorithm set A
and some input text D, Pseudocode 4.1 iterates exactly |A| times over the |A| algo-
rithms, once for each position in the schedule to be computed. In each of these |A|2
( j)
loop iterations, a pipeline Πi is determined based on at most |A|−1 other pipelines
4.1 Ideal Scheduling for Large-Scale Text Mining 131
( j−1) ( j) ( j)
Πl , resulting in O(|A|3 ) operations. For each Πi , the run-time t (Πi ) and the
( j)
scope S(Πi ) are stored. In practice, these values are not known beforehand, but
( j)
they need to be measured when executing Πi on its input. In the worst case, all
algorithms in A have an equal run-time tA (D) on D and they find relevant infor-
mation in all portions of text, i.e., Si (D) = D for each algorithm Ai ∈ A. Then, all
algorithms must indeed process the whole text D, which leads to an overall upper
bound of
toptimalPipelineScheduling (A, D) = O(|A|3 · tA (D)). (4.3)
Moreover, while we have talked about a single text D up to now, a more reasonable
input will normally be a whole collection of texts in practice.3 This underlines that
our theoretical approach is not made for practical applications in the first place:
Since each algorithm employed in a text analysis pipeline Π = A, π processes
its input at most once, even without an input control the run-time of Π is bound by
O(|A| · tA (D)) only.
Besides the unrealistic nature of the described worst case, however, the value |A|3
in Eq. 4.3 ignores the fact that an algorithm is, by definition, applicable only once
within a schedule and only if its input requirements are fulfilled. Therefore, the
real cost of optimalPipelineScheduling will be much lower in practice. Also,
instead of scheduling all |A| single algorithms, the search space of possible schedules
can usually be significantly reduced by scheduling the filtering stages that result
from our method for ideal pipeline design (cf. Sect. 3.1). We give an example in
the following case study and then come back to the practical benefits of optimal-
PipelineScheduling in the discussion at the end of this section.
3 Notice that using more than one text as input does not require changes of optimalPipeline-
Scheduling, since we can simply assume that the texts are given in concatenated form.
4 Taken from an IBM annual report, http://ibm.com/investor/1q11/press.phtml, accessed on June
15, 2015.
132 4 Pipeline Efficiency
control from Sect. 3.5, the information need can be formalized as the scoped query
γ ∗ = Sentence[Forecast(Time, Money, Organization)].5
Algorithm Set. To produce the output sought for in γ ∗ , we use the entity recognition
algorithms eti, emo, and ene as well as the forecast event detector rfo. To fulfill their
input requirements, we additionally employ three preprocessing algorithms, namely,
sse, sto2 , and tpo1 . All algorithms operate on the sentence level (cf. Appendix A
for further information). During scheduling, we implicity apply the lazy evaluation
step from Sect. 3.1 to create filtering stages, i.e., each preprocessor is scheduled as
late as possible in all pipelines. Instead of filtering stages, we simply speak of the
algorithm set A = {eti, emo, ene, rfo} in the following without loss of generality.
Text Corpora. As in Sect. 3.5, we consider both our Revenue corpus (cf. Appen-
dix C.1) and the CoNLL-2003 dataset (cf. Appendix C.4). For lack of alternatives to
the employed algorithms, we rely on the German part of the CoNLL-2003 dataset
this time. We process only the training sets of the two corpora. These training sets
consist of 21,586 sentences (in case of the Revenue corpus) and 12,713 sen-
tences (CoNLL-2003 dataset), respectively.
( j)
Experiments. We run all pipelines Πi , which are computed within the execution
of optimalPipelineScheduling, ten times on both text corpora using a 2 GHz
Intel Core 2 Duo MacBook with 4 GB memory. For each pipeline, we measure
( j)
the averaged overall run-time t (Πi ). In our experiments, all standard deviations
were lower than 1.0 s on the Revenue corpus and 0.5 s on the CoNLL-2003
dataset. Below, we omit them for a concise presentation. For similar reasons, we
( j) ( j)
state only the number of sentences in the scopes S(Πi ) of all Πi , instead of the
sentences themselves.
Input Dependency of Pipeline Scheduling. Figure 4.4 illustrates the application
of optimalPipelineScheduling for the algorithm set A to the two considered
corpora as trellis diagrams. The bold arrows correspond to the respective Viterbi
( j)
paths, indicating the optimal pipeline Πi of each length j. Given all four algorithms,
the optimal pipeline takes 48.25 s on the Revenue corpus, while the one on the
CoNLL-2003 dataset requires 18.17 s. eti is scheduled first and ene is scheduled
last on both corpora, but the optimal schedule of emo and rfo differs. This shows
the input-dependency of run-time optimal scheduling.
One main reason lies in the selectivities of the employed text analysis algo-
rithms: On the Revenue corpus, 3813 sentences remain relevant after applying
(2) (2)
Πemo = (eti, emo) as opposed to 2294 sentences in case of Πrfo = (eti, rfo).
(2)
Conversely, only 82 sentences are filtered after applying Πemo to the CoNLL-2003
5 We note here once that some of the implementations in the experiments in Chap. 4 do not use the
exact input control approach presented in Chap. 3. Instead, the filtering of relevant portions of text
is directly integrated into the employed algorithms. However, as long as only a single information
need is addressed and only one degree of filtering is specified, there will be no conceptual difference
in the obtained results.
4.1 Ideal Scheduling for Large-Scale Text Mining 133
(a) Revenue
corpus 1 2 3 4
(b) CoNLL-2003
dataset 1 2 3 4
Fig. 4.4 Trellis illustrations of executing optimalPipelineScheduling for the algorithm set A
= {eti, emo, ene, rfo} on the training set of a the Revenue corpus and b the German CoNLL-
( j) ( j)
2003 dataset. Below each pipeline Πi , the run-time t (Πi ) is given in seconds next to the number
( j)
of sentences (snt.) in S(Πi ). The bold arrows denote the Viterbi paths resulting in the run-time
optimal pipelines (eti, rfo, emo, ene) and (eti, emo, rfo, ene), respectively.
(2)
dataset, whereas 555 sentences still need to be analyzed after Πrfo . Altogether, each
(4)
admissible pipeline Πi based on the complete algorithm set A classifies the same
215 sentences (1.0 %) of the Revenue corpus as relevant, while not more than two
sentences of the CoNLL-2003 dataset (0.01 %) are returned to address γ ∗ .6 These
values originate in different distributions of relevant information, as we discuss in
detail in Sect. 4.2.
While the algorithms’ selectivities impact the optimal schedule in the described
case, the efficiency of scheduling eti first primarily emanates from the low run-time
(4)
of eti. On the CoNLL-2003 dataset, for instance, the optimal pipeline Πene schedules
(1)
eti before emo, although much less sentences remain relevant after Πemo = (emo)
(1)
than after Πeti = (eti). Applying eti and emo in sequence is even faster than apply-
ing emo only. This becomes important in case of parallelization, as we discuss in
Sect. 4.6. Even more clearly, ene alone takes 91.63 s on the Revenue corpus and
48.03 s on the CoNLL-2003 dataset, which underlines the general efficiency impact
of scheduling, when using an input control (cf. Sect. 3.5).
6 As can be seen, the numbersof sentences at position 4 do not vary between the different pipelines
for one corpus. This offers practical evidence for the commutativity of employing independent
algorithms within an admissible pipeline (cf. Sect. 3.1).
134 4 Pipeline Efficiency
On the previous pages, we have stressed the efficiency potential of scheduling the
algorithms in a text analysis pipeline, which arises from equipping the pipeline with
the input control from Sect. 3.5. We have provided a theoretical solution to opti-
mal scheduling and, hence, to the second part of the pipeline optimization problem
defined in Sect. 3.1. Our approach optimizes the run-time efficiency of a given set of
algorithms without compromising their effectiveness by maximizing the impact of
the input control. It works irrespective of the addressed information need as well as
of the language and other characteristics of the input texts to be processed.
For information extraction, both Shen et al. (2007) and Doan et al. (2009) present
scheduling approaches that rely on similar concepts as our input control, such as
dependencies and distances between relevant portions of text. These works improve
efficiency based on empirically reasonable heuristics. While they have algebraic
foundations (Chiticariu et al. 2010a), both approaches are limited to rule-based text
analyses (cf. Sect. 2.4 for details). Our approach closes this gap by showing how to
optimally schedule any set of text analysis algorithms. However, it cannot be applied
prior to pipeline execution, as it requires to keep track of the run-times and relevant
portions of texts of all possibly optimal pipelines. These values must be measured,
which makes the approach expensive.
As such, the chosen dynamic programming approach is not meant to serve for
practical text mining applications, although it still represents an efficient way to com-
pute benchmarks for more or less arbitrary text analysis tasks.7 Rather, it clarifies
the theoretical background of empirical findings on efficient text analysis pipelines
in terms of the underlying algorithmic and linguistic determinants. In particular, we
have shown that the optimality of a pipeline depends on the run-times and selectiv-
ities of the employed algorithms on the processed input texts. In the next section,
we investigate the characteristics of collections and streams of input texts that influ-
ence pipeline optimality. On this basis, we then turn to the development of efficient
practical scheduling approaches.
The case study in the previous section already shows that the run-time optimality
of an input-controlled pipeline depends on the given collection or stream of input
texts. In the following, we provide both formal and experimental evidence that the
reason behind lies in the distribution of relevant information in the texts, which
7 Different from our input control, the dynamic progrmaming approach covers only our basic sce-
nario from Sect. 1.2, where we seek for a single set of information types C. While relevant in practice,
we skip the possibly technically complex extension to more than one set for lack of expected con-
siderable insights. The integration of different pipelines sketched in the properties part of Sect. 3.3
indicates, though, what to pay attention to in the extension.
4.2 The Impact of Relevant Information in Input Texts 135
governs the portions of text filtered by the input control after each execution of an
algorithm (cf. Sect. 3.5). Consequently, the determination of an optimal schedule
requires to estimate the algorithms’ selectivities, which we approach in Sect. 4.3. As
above, this section reuses content from Wachsmuth and Stein (2012).
When we speak of relevant information in the book at hand, we always refer to infor-
mation that can be used to address one or more information needs, each defined as
a set of information types C. Accordingly, with the distribution of relevant informa-
tion, we mean the distribution of instances of each information type C ∈ C in the
input texts to be processed. More precisely, what matters in terms of efficiency is the
distribution of instances of all types in C found by some employed algorithm set,
because this information decides what portions of the texts are classified as relevant
and are, thus, filtered by the input control. We quantify this distribution in terms of
density as opposed to relative frequency:
The relative frequency of an information type C in a collection or a stream of input
texts D is the average number of instances of C found per portion of text in D (of
some specified text unit type, cf. Sect. 3.4). This frequency affects the efficiency of
algorithms that take instances of C as input. Although this effect is definitely worth
analyzing, in the given context we are primarily interested in the efficiency impact
of filtering. Instead, we therefore capture the density of C in D, which we define as
the fraction of portions of text in D in which instances of C are found.8
To illustrate the difference between frequency and density, assume that relations
of a type IsMarriedTo(Person, Person) shall be extracted from a text D with two
portions of text (say, sentences). Let three person names be found in the first sentence
and none in the second one. Then the type Person has a relative frequency of 1.5 in D
but a density of 0.5. The frequency affects the average number of candidate relations
for extraction. In contrast, the density implies that relation extraction needs to take
place on 50 % of all sentences only, which is what we are up to.
Now, consider the general case that some text analysis pipeline Π = A, π is
given to address an information need C on a text D. The density of each information
type from C in D directly governs what portions of D are filtered by our input control
after the execution of an algorithm in A. Depending on the schedule π , the resulting
run-times of all algorithms on the filtered portions of text then sum up to the run-time
of Π . Hence, it might seem reasonable to conclude that the fraction of those portions
of D, which pipeline Π classifies as relevant, impacts the run-time optimality of π . In
fact, however, the optimality depends on the portions of D classified as not relevant,
as follows from Theorem 4.2:
8 Besides, an influencing factor of efficiency is the length of the portions of text, of course. However,
we assume that, on average, all relative frequencies and densities equally scale with the length.
Consequently, both can be seen as an implicit model of length.
136 4 Pipeline Efficiency
Theorem 4.2. Let Π ∗ = A, π ∗ be run-time optimal on a text D under all admissible
text analysis pipelines based on an algorithm set A, and let S(D) ⊆ D denote a scope
containing all portions of D classified as relevant by Π ∗ . Let S(D ) ⊆ D denote
the portions of any other input text D classified as relevant by Π ∗ . Then Π ∗ is also
run-time optimal on (D\ S(D)) ∪ S(D ).
Proof. In the proof, we denote the run-time of a pipeline Π on an arbitrary scope S
as tΠ (S). By hypothesis, the run-time tΠ ∗ (D) of Π ∗ = A, π ∗ is optimal on D, i.e.,
for all admissible pipelines Π = A, π , we have
As known from Sect. 3.1, for a given input text D, all admissible text analysis
pipelines based on the same algorithm set A classify the same portions S(D) ⊆ D
as relevant. Hence, each portion of text in S(D) is processed by every algorithm
in A, irrespective of the schedule of the algorithms. So, the run-time of two pipelines
Π1 = A, π1 and Π2 = A, π2 on S(D) is always equal (at least on average):
Since we do not put any constraints on D, Eq. 4.5 also holds for D instead of D.
Combining Eq. 4.5 with Ineq. 4.4, we can thus derive the correctness of Theorem 4.2
as follows:
Theorem 4.2 states that the portions of text S(D) classified as relevant by a text
analysis pipeline have no impact on the run-time optimality of the pipeline. Conse-
quently, differences in the efficiency of two admissible pipelines based on the same
algorithm set A must emanate from applying the algorithms in A to different numbers
of irrelevant portions of texts. We give experimental evidence for this conclusion in
the following.
analysis pipeline and its possible run-time optimality behave under changing densi-
ties of the relevant information types.
Pipelines. In the experiments, we employ the same algorithm set as we did at the
end of Sect. 4.1. Again, we apply lazy evaluation in all cases, allowing us to consider
the four algorithms eti, emo, ene, and rfo only. Based on these, we investigate the
efficiency of two pipelines, Π1 and Π2 :
Π1 = (eti, rfo, emo, ene) Π2 = (emo, eti, rfo, ene)
As shown in the case study of Sect. 4.1, Π1 denotes the run-time optimal pipeline
for the information need C = {Forecast, Time, Money, Organization} on the training
set of the Revenue corpus. In contrast, Π2 represents an efficient alternative when
given texts with few money entities.
Text Corpora. To achieve different distributions of the set of relevant information
types C, we have created artificially altered versions of the training set of the Rev-
enue corpus. In particular, we have modified the original corpus texts by randomly
duplicating or deleting
(a) relevant sentences, which contain all relevant information types,
(b) irrelevant sentences, which miss at least one relevant type,
(c) irrelevant sentences, which contain money entities, but which miss at least
one other relevant type.
In case (a) and (b), we created text corpora, in which the density of the whole
set C is 0.01, 0.02, 0.05, 0.1, and 0.2, respectively. In case (c), it is not possible
to obtain higher densities than a little more than 0.021 from the training set of the
Revenue corpus, because under that density, all irrelevant sentences with money
entities have been deleted. Therefore, we restrict our view to five densities between
0.009 and 0.021 in that case.
Experiments. We processed all created corpora ten times with both Π1 and Π2 on
a 2 GHz Intel Core 2 Duo MacBook with 4 GB memory. Due to the alterations, the
corpora differ significantly in size. For this reason, we compare the efficiency of the
pipelines in terms of the average run-times per sentence. Appendix B.4 outlines how
to reproduce the experiments.
The Impact of the Distribution of Relevant Information. Figure 4.5 plots the run-
times as a function of the density of C. In line with Theorem 4.2, Fig. 4.5(a) conveys
that changing the number of relevant sentences does not influence the absolute dif-
ferences of the run-time of Π1 and Π2 .9 In contrast, the gap between the curves in
Fig. 4.5(b) increases proportionally under growing density, because the two pipelines
spend a proportional amount of time processing irrelevant portions of text. Finally,
the impact of the distribution of relevant information becomes explicit in Fig. 4.5(c):
Π1 is faster on densities lower than about 0.018, but Π2 outperforms Π1 under a
9 Minordeviations occur on the processed corpora, since we have changed the number of relevant
sentences as opposed to the number of sentences that are classified as relevant.
138 4 Pipeline Efficiency
(a) Changing the number of (b) Changing the number of (c) Changing the number of
relevant sentences random irrelevant sentences 6.06 specific irrelevant sentences
average run-time in ms / sentence
∏1
2.4 ∏1 3
∏1
2.0 1.91
2.36
2.32
2.17 2.16 1.84
2.1 2 1.7
0. 1
02
05
0. 1
02
05
1
0
0.
0.
0.
0.
00
01
01
01
02
0.
0.
0.
0.
0.
0.
0.
0.
0.
density of relevant information types
Fig. 4.5 Interpolated curves of the average run-times of the pipelines Π1 = (eti, rfo, emo, ene)
and Π2 = (emo, eti, rfo, ene) under different densities of the relevant information types in mod-
ified training sets of the Revenue corpus. The densities were created by duplicating or deleting
a relevant sentences, b random irrelevant sentences, and c irrelevant sentences with money entities.
higher density (0.021).10 The reason for the change in optimality is that, the more
irrelevant sentences with money entities are deleted, the less portions of text are
filtered after emo, which favors the schedule of Π2 .
Altogether, we conclude that the distribution of relevant information can be deci-
sive for the optimal scheduling of a text analysis pipeline. While there are other
influencing factors, some of which trace back to the efficiency of the employed text
analysis algorithms (as discussed in the beginning of this section), we have can-
celled out many of these factors by only duplicating and deleting sentences from the
Revenue corpus itself.
For specific text analysis tasks, we have already exemplified the efficiency potential
of pipeline scheduling (Sect. 3.1) and the input dependency (Sect. 4.1). A general
quantification of the practical impact of the distribution of relevant information is
hardly possible, because it depends on the processed input texts and the employed
algorithms. As an indicator of its relevance, however, we offer evidence here that
distributions of information tend to vary much more significantly between different
collections and streams of texts than the run-times of text analysis algorithms.
10 The declining curves in Fig. 4.5(c) seem counterintuitive. The reason is that, in the news articles
of the Revenue corpus, sentences with money entities often contain other relevant information
like time entities, too. So, while duplicating irrelevant sentences of this kind reduces the density of
C, the average time to process these sentences is rather high.
4.2 The Impact of Relevant Information in Input Texts 139
0.35
0.31 0.26
0.3 0.3
0.27
0.25 0.25
0.21
Organization
0.2 0.19 0.18
0.14 0.14 0.15
Location
0.05 Person
0.12 0.11 0.12
0.1 0.09
0.1
0.07
0.05
Brown CoNLL-2003 CoNLL-2003 Wikipedia 10k Revenue LFA-11
corpus (English) (German) sample (German) corpus smartphone
diverse topics diverse topics, focused topics, focused topics,
and genres one genre one genre diverse genres
Fig. 4.6 Illustration of the densities of person, organization, and location entities in the sentences
of two English and four German collections of texts. All densities are computed based on the results
of the pipeline Πene = (sse, sto2 , tpo1 , pch, ene).
11 Notice
that the employed named entity recognition algorithm, ene, has been trained on the
CoNLL-2003 datasets, which partly explains the high densities in these corpora.
140 4 Pipeline Efficiency
Table 4.1 The average run-time t in milliseconds per sentence and its standard deviation σ for
every algorithm in the pipeline Πene = (sse, sto2 , tpo1 , pch, ene) on each evaluated text corpus.
In the bottom line, the average of each algorithm’s run-time is given together with the standard
deviation from the average.
vary, e.g. between 0.042 ms and 0.059 ms in case of sto2 . However, the standard
deviations of the run-times averaged over all corpora (bottom line of Table 4.1) all lie
in an area of about 5 % to 15 % of the respective run-time. Compared to the measured
densities in the corpora, these variations are apparently small, which means that the
algorithms’ run-times are affected from the processed input only little.12
So, to summarize, the evaluated text corpora and information types suggest that
the run-times of the algorithms used to infer relevant information tend to remain
rather stable. At the same time, the resulting distributions of relevant information
may completely change. In such cases, the practical relevance of the impact outlined
above becomes obvious, since, at least for algorithms with similar run-times, the
distribution of relevant information will directly decide the optimality of the schedule
of the algorithms.13
This section has made explicit that the distribution of relevant information in the
input texts processed by a text analysis pipeline (equipped with the input control
from Sect. 3.5) impacts the run-time optimal schedule of the pipeline’s algorithms.
12 Of course, there exist algorithms whose run-time per sentence will vary more significantly, namely,
if the run-times scale highly overproportionally in the length of the processed sentences, such as for
syntactic parsers. However, some of the algorithms evaluated here at least have quadratic complexity,
indicating that the effect of sentence length is limited.
13 The discussed example may appear suboptimal in the sense that our entity recognition algorithm
ene relies on Stanford NER (Finkel et al. 2005), which classifies persons, locations, and orga-
nizations jointly and, therefore, does not allow for scheduling. However, we merely refer to the
standard entity types for illustration purposes, since they occur in almost every collection of texts.
Besides, notice that joint approaches generally conflict with the first step of an ideal pipeline design,
maximum decomposition, as discussed in Sect. 3.1.
4.2 The Impact of Relevant Information in Input Texts 141
This distribution can vary significantly between different collections and streams
of texts, as our experiments have indicated. In contrast, the average run-times of
the employed algorithms remain comparably stable. As a consequence, it seems
reasonable to rely on run-time estimations of the algorithms, when seeking for an
efficient schedule.
For given run-time estimations, we have already introduced a greedy pipeline
scheduling approach in Sect. 3.3, which sorts the employed algorithms (or filtering
stages, to be precise) by their run-time in ascending order. Our experimental results
in the section at hand, however, imply that the greedy approach will fail to construct
a near-optimal schedule under certain conditions, namely, when the densities of the
information types produced by faster algorithms are much higher than of those pro-
duced by slower algorithms. Since reasonable general estimates of the densities seem
infeasible according to our analysis, we hence need a way to infer the distribution of
relevant information from the given input texts, in cases where run-time efficiency
is of high importance (and, thus, optimality is desired).
A solution for large-scale text mining scenarios is to estimate the selectivities of
the employed algorithms from the results of processing a sample of input texts. For
information extraction, samples have been shown to suffice for accurate selectivity
estimations in narrow domains (Wang et al. 2011). Afterwards, we can obtain a
schedule that is optimal with respect to the estimations of both the run-times and
the selectivities using our adaptation of the Viterbi algorithm from Sect. 4.1.
More efficiently, we can also directly integrate the estimation of selectivities in
the scheduling process by addressing optimal scheduling as an informed search
problem (Russell and Norvig 2009). In particular, the Viterbi algorithm can
easily be transformed into an A∗ best-first search (Huang 2008), which in our case
then efficiently processes the sample of input texts.
In the next section, we propose an according best-first search scheduling approach,
which uses a heuristic that is based on the algorithms’ run-time estimations. We pro-
vide evidence that it works perfectly as long as the distribution of relevant information
does not vary significantly in different input texts. In other cases, a schedule should
be chosen depending on the text at hand for maintaining efficiency, as we discuss in
Sects. 4.4 and 4.5.
behind has been sketched in Wachsmuth et al. (2013a), from which we reuse some
content. While Melzner (2012) examines the use of informed search for pipeline
scheduling in his master’s thesis, we fully revise his approach here. As such, this
section denotes a new contribution of the book at hand.
14 The input text D may e.g. denote the concatenation of all texts from the sample.
15 Depending on the tackled problem, not all leaf nodes represent solutions. Also, sometimes the
path to a leaf node represents the solution and not the leaf node itself.
4.3 Optimized Scheduling via Informed Search 143
...
...
...
A,π*
...
...
A1 {A1},
...
root node node leaf node (solution)
applicable
A(j),π(j)
...
...
action A,π
(j) (j)
{Ak}, t( A ,π ) t( A,π )
Ak Ai
path cost solution cost
(j) (j)
ti(S( A ,π
...
...
step cost
... ...
Fig. 4.7 Illustration of the nodes, actions, and costs of the complete search graph of the informed
search for the optimal schedule π ∗ of an algorithm set A.
16 Again, we schedule single text analysis algorithms here for a less complex presentation. To
significantly reduce the search space, it would actually be more reasonable to define the filtering
stages from Sect. 3.1 as actions, as we propose in Wachsmuth et al. (2013a).
17 In the given context, it is important not to confuse the efficiency of the search for a pipeline
schedule and the efficiency of the schedule itself. Both are relevant, as both influence the overall
efficiency of addressing a text analysis task. We evaluate their influence below.
144 4 Pipeline Efficiency
we assume a vector q of run-time estimations to be given with one value for each
algorithm in A.18 On this basis, we show below how to approach the best-first search
for an optimal schedule π ∗ in any informed search scheduling problem:
Informed Search Scheduling Problem. An informed search scheduling problem
denotes a 4-tuple A, π̃ , q, D such that
1. Actions. A is an algorithm set to find an optimal schedule π ∗ for,
2. Constraints. π̃ is a partial schedule, π ∗ must comply with,
3. Knowledge. q ∈ R|A| is a vector with one run-time estimation qi for each algo-
rithm Ai ∈ A, and
4. Input. D is the input text, with respect to which π ∗ shall be optimal.
The most widely used informed best-first search approach is A∗ search. A∗ search
realizes the best-first search strategy by repeatedly performing two operations based
on a so called open list, which contains all not yet expanded nodes, i.e., those nodes
without generated successor nodes: First, it computes the estimated solution cost of
all nodes on the open list. Then, it generates all successor nodes of the node with the
minimum estimation. To estimate the cost of a solution that contains some node, A∗
search relies on an additive cost function, that sums up the path cost of reaching the
node and the estimated cost from the node to a leaf node. The latter is obtained from
the heuristic function H . Given that H is optimistic (i.e., H never overestimates
costs), it has been shown that the first leaf node generated by A∗ search denotes an
optimal solution (Russell and Norvig 2009).
To adapt A∗ search for pipeline scheduling, we hence need a heuristic that, for
a given partial pipeline A( j) , π ( j) , optimistically estimates the run-time t (A, π )
of a complete pipeline A, π that begins with A( j) , π ( j) . As detailed in Sects. 3.1
and 4.1, a pipeline’s run-time results from the run-times and the selectivities of
the employed algorithms. While we can resort to the estimations q for the former,
we propose to obtain information about the latter from processing an input text D.
In particular, let S(Π̃ ) contain the relevant portions of D after executing a partial
pipeline Π̃ . Further, let q(A) be the estimation of the average run-time per portion of
text of an algorithm A in Π̃ , and let Ai be the set of all algorithms that are applicable
18 As in the method greedyPipelineLinearization from Sect. 3.3, for algorithms without run-
time estimations, we can at least rely on default values. greedyPipelineLinearization can in fact
be understood as a greedy best-first search whose heuristic function simply assigns the run-time
estimation of the fastest applicable filtering stage to each node.
4.3 Optimized Scheduling via Informed Search 145
after Π̃ . Then, we define the heuristic function H for estimating the cost of reaching
a leaf node from some node Π̃ as:19
H (Π̃, Ai , q) = |S(Π̃ )| · min q(Ai ) | Ai ∈ Ai (4.6)
The actually observed run-time t (Π̃) of Π̃ and the value of the heuristic function
then sum up to the estimated solution cost q(Π̃ ). Similar to the dynamic programming
approach from Sect. 4.1, we hence need to keep track of all run-times and filtered
portions of text. By that, we implicitly estimate the selectivities of all algorithms in A
on the input text D at each possible position in a pipeline based on A.
Now, assume that each run-time estimation q(Ai ) is optimistic, meaning that the
actual run-time ti (S(Π̃ )) of Ai exceeds q(Ai ) on all scopes S(Π̃ ). In this case, Π̃ is
optimistic, too, because at least one applicable algorithm has to be executed on the
remaining scope S(Π̃ ). In the end, however, the only way to guarantee optimistic run-
time estimations consists in setting all of them to 0, which would render the defined
heuristic H useless. Instead, we relax the need of finding an optimal schedule here to
the need of optimizing the schedule with respect to the given run-time estimations q.
Consequently, the accuracy of the run-time estimations implies a tradeoff between
the efficiency of the search and the efficiency of the determined schedule: The higher
the estimations are set, the less nodes A∗ search will expand on average, but also the
less probable it will return an optimal schedule, and vice versa. We analyze some of
the effects of run-time estimations in the evaluation below.
Given an informed search scheduling problem A, π̃ , q, D and the defined heuris-
tic H , we can apply A∗ search to find an optimized schedule. However, A∗ search
may still be inefficient when the number of nodes on the open list with similar esti-
mated solution costs becomes large. Here, this can happen for algorithm sets with
many admissible schedules of similar efficiency. Each time a node is expanded,
every applicable algorithm needs to process the relevant portions of text of that node,
which may cause high run-times in case of computationally expensive algorithms.
To control the efficiency of A∗ search, we introduce a parameter k that defines the
maximum number of nodes to be kept on the open list. Such a k-best variant of A∗
search considers only the seemingly best k nodes for expansion, which improves
efficiency while not guaranteeing optimality (with respect to the given run-time esti-
mations) anymore.20 In particular, k thereby provides another means to influence the
efficiency-effectiveness tradeoff, as we also evaluate below. Setting k to ∞ yields a
standard A∗ search.
19 For simplicity, we assume here that there is one type of portions of text only (say, Sentence)
without loss of generality. For other cases, we could distinguish between instances of the different
types and respective run-time estimations in Eq. 4.6.
20 k-best variants of A∗ search have already been proposed for other tasks in natural language
Pseudocode 4.2: k-best variant of A∗ search for transforming a partially ordered text analysis
pipeline A, π̃ into a pipeline A, π ∗ , which is nearly run-time optimal on the given input text D.
Pseudocode 4.2 shows our k-best A∗ search approach for determining an opti-
mized schedule of an algorithm set A based on an input text D. The root node of the
implied search graph refers to the empty pipeline Π0 and to the complete input text
S(Π0 ) = D. Π0 does not yield any run-time and, so, the estimated solution cost of
Π0 equals the value of the heuristic H , which depends on the initially applicable
algorithms in Ai (lines 1–5). Line 6 creates the set Π open from Π0 , which represents
the open list. In lines 7–17, the partial pipeline Ã, π ∗ with the currently best esti-
mated solution cost is iteratively polled from the open list (line 8) and expanded until
it contains all algorithms and is, thus, returned. Within one iteration, line 10 first
determines all remaining algorithms that are applicable after Ã, π ∗ according to
the partial schedule π̃. Each such algorithm Ai processes the relevant portions of text
of Ã, π ∗ , thereby generating a successor node for the resulting pipeline Πi and its
associated portions of texts (lines 11–13).21 The run-time and the estimated solution
cost of Πi are then updated, before Πi is added to the open list (lines 14–16). After
expansion, line 17 reduces the open list to the k currently best pipelines.22
21 In the given pseudocode, we implicitly presume the use of the input control from Sect. 3.5, which
makes the inclusion of filters (cf. Sect. 3.2) obsolete. Without an input control, applicable filters
would need to be preferred in the algorithm set Ai for expansion over other applicable algorithms
in order to maintain early filtering.
22 Notice that, without line 17, Pseudocode 4.2 would correspond to a standard A∗ search approach
23 Besidesbeing correct, A∗ search has also been shown to dominate other informed search
approaches, meaning that it generates the minimum number of nodes to reach a leaf node (Russell
and Norvig 2009). We leave out an according proof here for lack of relevance.
148 4 Pipeline Efficiency
Proof. We only roughly sketch the proof, since the correctness of A∗ search has
already often been shown in the literature (Russell and Norvig 2009). As clari-
fied above, optimistic run-time estimations in q imply that the employed heuris-
tic H is optimistic, too. When k-bestA∗ PipelineScheduling returns a pipeline
A, π ∗ (pseudocode line 9), the estimated solution cost q(A, π ∗ ) of A, π ∗ equals
its run-time t (A, π ∗ ), as all algorithms have been applied. At the same time, no
other pipeline on the open list has a lower estimated solution cost according to line 8.
By definition of H , all estimated solution costs are optimistic. Hence, no pipeline
on the open list can entail a lower run-time than A, π ∗ , i.e., A, π ∗ is optimal.
Since algorithms from A are added to a pipeline on the open list in each iteration of
the outer loop in Pseudocode 4.2, A, π ∗ is always eventually found.
Avoiding this worst case is what the applied best-first search strategy aims for in
the end. In addition, we can control the run-time with the parameter k. In particular,
4.3 Optimized Scheduling via Informed Search 149
k changes the products in Eq. 4.7. Within each product, the last factor denotes the
number of possible expansions of a node of the respective length, while the multipli-
cation of all other factors results in the number of such nodes to be expanded. This
number is limited to k, which means that we can transform Eq. 4.7 into
|A| + k ·(|A|−1) + . . . + k · 2 = O k · |A|2 (4.9)
Like above, a single node generation entails costs that largely result from tA (D)
and at most |A|. As a consequence, we obtain a worst-case run-time of
tk-bestA∗ PipelineScheduling (A, D) = O k · |A|2 · (tA (D)+|A|) (4.10)
We now evaluate our k-best A∗ search approach for optimized pipeline schedul-
ing in different text analysis tasks related to our information extraction case study
InfexBA (cf. Sect. 2.3). In particular, on the one hand we explore in what scenarios
the additional effort of processing a sample of input texts is worth being spent. On
the other hand, we determine the conditions under which our approach achieves to
find a run-time optimal pipeline based on a respective training set. Details on the
source code used in the evaluation are given in Appendix B.4.
Corpora. As in Sect. 4.1, we conduct experiments on the Revenue corpus
described in Appendix C.1 and on the German dataset of the CoNLL-2003 shared
task described in Appendix C.4. First, we process different samples of the training
sets of these corpora for obtaining the algorithms’ run-time estimations as well as
for performing scheduling. Then, we execute the scheduled pipelines on the union of
the respective validation and test sets in order to measure their run-time efficiency.
Queries. We consider three information needs of different complexity that we rep-
resent as queries in the form presented in Sect. 3.4:
γ1 = Financial(Money, Forecast(Time))
γ2 = Forecast(Time, Money, Organization)
γ3 = Forecast(Revenue(Resolved(Time), Money, Organization))
γ1 and γ2 have already been analyzed in Sects. 3.5 and 4.1, respectively. In contrast,
we introduce γ3 here, which we also rely on when we analyze efficiency under
increasing heterogeneity of input texts in Sect. 4.5. γ3 targets at revenue forecasts
that contain resolvable time information, a money value, and an organization name.
A simple example for such a forecast is “Apple’s annual revenues could hit $400
150 4 Pipeline Efficiency
billion by 2015”. We require all information of an instance used to address any query
to lie within a sentence, i.e., the degree of filtering is Sentence in all cases.
Pipelines. To address γ1 , γ2 , and γ3 , we assume the following pipelines to be given
initially. They employ different algorithms from Appendix A:
Π1 = (sse, sto2 , tpo2 , eti, emo, rfo, rfi)
Π2 = (sse, sto2 , tpo2 , pch, ene, emo, eti, rfo)
Π3 = (sse, sto2 , tpo2 , pch, ene, emo, eti, nti, rre2 , rfo)
Each pipeline serves as input to all evaluated approaches. The respective pipeline
is seen as an algorithm set with a partial schedule, for which an optimized schedule
can then be computed. The algorithms in Π1 allow for only 15 different admissible
schedules, whereas Π2 entails 84, and Π3 even 1638 admissible schedules.
Baselines. We compare our approach to three baseline approaches. All approaches
are equipped with our input control from Sect. 3.5 and, thus, process only relevant
portions of text in each analysis step. We informally define the three baselines by the
rules they follow to obtain a schedule, when given a training set:
1. Fixed baseline. Do not process the training set at all. Remain with the schedule
of the given text analysis pipeline.
2. Greedy baseline. Do not process the training set at all. Schedule the given algo-
rithms according to their run-time estimation in an increasing and admissible
order, as proposed in Sect. 3.3.
3. Optimal baseline. Process the training set with all possible admissible sched-
ules by stepwise executing the given algorithms in a breadth-first search manner
(Cormen et al. 2009). Choose the schedule that is most efficient on the
training set.24
The standard baseline is used to highlight the general efficiency potential of
scheduling when filtering is performed, while we analyze the benefit of process-
ing a sample of texts in comparison to the greedy baseline. The last baseline is called
“optimal”, because it guarantees to find the schedule that is run-time optimal on a
training set. However, its brute-force nature contrasts the efficient process of our
informed search approach, as we see below.
Different from Chap. 3, we omit to construct filtering stages here (cf. Sect. 3.1),
but we schedule the single text analysis algorithms instead. This may affect the
efficiency of both the greedy baseline and our k-best A∗ search approach, thereby
favoring the optimal baseline to a certain extent. Anyway, it enables us to simplify the
analysis of the efficiency impact of optimized scheduling, which is our main focus
in the evaluation.
Experiments. Below, we measure the absolute run-times of all approaches averaged
over ten runs. We break these run-times down into the scheduling time on the training
24 Theoptimal baseline generates the complete search graph introduced above. It can be seen as a
simple alternative to the optimal scheduling approach from Sect. 4.1.
4.3 Optimized Scheduling via Informed Search 151
Table 4.2 Comparison between our 20-best A∗ search approach and the optimal baseline with
respect to the scheduling time and execution time on the Revenue corpus for each query γ and
for five different numbers of training texts.
sets and the execution time on the combined validation and test sets in order to analyze
and compare the efficiency of the approaches in detail. All experiments are conducted
on a 2 GHz Intel Core 2 Duo Macbook with 4 GB memory.25
Efficiency Impact of k-best A∗ Search Pipeline Scheduling. First, we analyze the
efficiency potential of scheduling a pipeline for each of the given queries with our
k-best A∗ search approach in comparison to the optimal baseline. To imitate realistic
circumstances, we use run-time estimations obtained on one corpus (the CoNLL-
2003 dataset), but schedule and execute all pipelines on another one (the Revenue
corpus). Since it is not clear in advance, what number of training texts suffices to
find an optimal schedule, we perform scheduling based on five different training
sizes (with 1, 10, 20, 50, and 100 texts). In contrast, we delay the analysis of the
parameter k of our approach to later experiments. Here, we set k to 20, which has
reliably produced near-optimal schedules in some preliminary experiments.
Table 4.2 opposes the scheduling times and execution times of the two evaluated
approaches as well as their standard deviations. In terms of scheduling time, the k-best
25 Sometimes, both the optimal baseline and the informed search approaches return different
pipelines in different runs of the same experiment. This can happen when the measured run-time of
the analyzed pipelines are very close to each other. Since such behavior can also occur in practical
applications, we simply average the run-times of the returned pipelines.
152 4 Pipeline Efficiency
A∗ search approach significantly outperforms the optimal baseline for all queries and
training sizes.26 For γ1 , k exceeds the number of admissible schedules (see above),
meaning that the approach equals a standard A∗ search. Accordingly, the gains in
scheduling time appear rather small. Here, the largest difference is observed for
100 training texts, where informed search is almost two times as fast as the optimal
baseline (65.0 vs. 117.3 s). However, this factor goes up to over 30 in case of γ3 (e.g.
507.2 vs. 15589.1 s), which indicates the huge impact of informed search for larger
search graphs. At the same time, it fully competes with the optimal baseline in
finding the optimal schedule. Moreover, we observe that, on the Revenue corpus,
20 training texts seem sufficient for finding a pipeline with a near-optimal execution
time that deviates only slightly in different runs. We therefore restrict our view to
this training size in the remainder of the evaluation.
Efficiency Impact of Optimized Scheduling. Next, we consider the question under
what conditions it is worth spending the additional effort of processing a sample of
input texts. In particular, we evaluate the method k-bestA∗ PipelineScheduling
for different values of the parameter k (1, 5, 10, 20, and 100) on the Revenue
corpus (with estimations obtained from the CoNLL-2003 dataset again) for all three
queries. We measure the scheduling time and execution time for each configuration
and compare them to the respective run-times of the three baselines.
Figure 4.8 shows the results separately for each query. At first sight, we see that the
execution time of the fixed baseline is significantly worse than all other approaches
in all cases, partly being even slower than the total time of both scheduling and
execution of the k-best A∗ search approaches. In case of γ1 , the greedy baseline is
about 13 % slower than the optimal baseline (14.2 vs. 12.6 s). In contrast, all evaluated
k values result in the same execution time as the optimal baseline, indicating that
the informed search achieves to find the optimal schedule.27 Similar results are also
observed for γ2 , except for much higher scheduling times. Consequently, 1-best A∗
search can be said to be most efficient with respect to γ1 and γ2 , as it requires the
lowest number of seconds for scheduling in both cases.
However, the results discussed so far render the processing of a sample of input
texts for scheduling questionable, because the greedy baseline performs almost as
good as the other approaches. The reason behind is that the run-times of the algo-
rithms in Π1 and Π2 differ relatively more than the associated selectivities (on the
Revenue corpus). This situation turns out to be different for the algorithms in Π3 .
For instance, the time resolver nti is faster than some other algorithms (cf. Appen-
dix A.2), but entails a very high selectivity, because of a low number of not resolvable
time entities. In accordance with this fact, the bottom part of Fig. 4.8 shows that our
approach clearly outperforms the greedy baselines when addressing query γ3 , e.g.
by a factor of around 2 in case of k = 20 and k = 100. Also, it denotes an example
where, the higher the value of k, the better the execution time of k-best A∗ search
26 Partly, our approach improves over the baseline even in terms of execution time. This, however,
emanates from a lower system load and not from finding a better schedule.
27 Notice that the standard deviations in Fig. 4.8 reveal that the slightly better looking execution
0 10 20 30 40 50 60 70 80
average run-time in seconds
Fig. 4.8 Illustration of the execution times (medium colors, left) and the scheduling times (light col-
ors, right) as well as their standard deviations (small black markers) of all evaluated approaches on the
Revenue corpus for each of the three addressed queries and 20 training texts (Color figure online).
(with respect to the evaluated values). This demonstrates the benefit of a real informed
best-first search.
Still, it might appear unjustified to disregard the scheduling time when comparing
the efficiency of the approaches. However, while we experiment with small text
corpora, in fact we target at large-scale text mining scenarios. For these, Fig. 4.9
exemplarily extrapolates the total run-times of two k-best A∗ search approaches
and the greedy baseline for γ3 , assuming that the run-times grow proportionally to
those on the 366 texts of the validation and test set of the Revenue corpus. Given
20 training texts, our approach actually saves time beginning at a number of 386
processed texts: There, the total run-time of 1-best A∗ search starts to be lower than
the execution time of the greedy baseline. Later on, 5-best A∗ search becomes better,
and so on. Consequently, our hypothesis already raised in Sect. 3.3 turns out to be
true for this evaluation: Given that an information need must be addressed ad-hoc,
a zero-time scheduling approach (like the greedy baseline) seems more reasonable,
but when large amounts of text must be processed, performing scheduling based on
a sample of texts is worth the effort.
154 4 Pipeline Efficiency
training
run-time in seconds validation and test extrapolation
search
5-best A*
100 eline
y s
b a
greed
61.9 search
1-best A*
50
43.1 31.0
30.4
11.4
Fig. 4.9 The total run-times of the greedy baseline and two variants of our k-best A∗ search
approach for addressing γ3 as a function of the number of processed texts. The dashed parts are
extrapolated from the run-times of the approaches on the 366 texts in the validation and test set of
the Revenue corpus.
Table 4.3 The average execution times in seconds with standard deviations of addressing the
query γ3 using the pipelines scheduled by our 20-best A∗ search approach and the greedy base-
line depending on the corpora on which (1) the algorithms’ run-time estimations are determined,
(2) scheduling is performed (in case of 20-best A∗ search), and (3) the pipeline is executed.
are also very similar on the Revenue corpus. In contrast, the greedy baseline is heav-
ily affected by the estimations at hand. Overall, the execution times on the two cor-
pora differ largely because only the Revenue corpus contains many portions of text
that are relevant for γ3 . The best execution times are 16.1 and 5.1 s, respectively, both
achieved by the informed search approach. For the CoNLL-2003 dataset, however,
we see that scheduling based on inappropriate training texts can have negative effects,
as in the case of the Revenue corpus, where the efficiency of 20-best A∗ search sig-
nificantly drops from 5.1 to 6.5 s. In practice, this gets important when the input texts
to be processed are heterogeneous, which we analyze in the following sections.
This section has introduced our practical approach to optimize the schedule of a text
analysis pipeline. Given run-time estimations of the employed algorithms, it aims to
efficiently find a pipeline schedule that is run-time optimal on a sample of input texts.
The approach can be seen as a modification of the dynamic programming approach
from Sect. 4.1, which incrementally builds up and compares different schedules using
informed best-first search. It is able to trade the efficiency of scheduling for the
efficiency of the resulting schedule through a pruning of the underlying search graph
down to a specified size k.
The presented realization of an informed best-first search in the method k-best-
A∗ PipelineScheduling is far from optimized yet. Most importantly, it does not
recognize nodes that are dominated by other nodes. E.g., if there are two nodes on
the open list, which represent the partial pipelines (A1 , A2 ) and (A2 , A1 ), then we
already know that one of these is more efficient for the two scheduled algorithms.
A solution is to let nodes represent algorithm sets instead of pipelines, which works
because all admissible schedules of an algorithm set entail the same relevant portions
of text (cf. Sect. 3.1). In this way, the search graph becomes much smaller, still being
a directed acyclic graph but not a tree anymore (as in our realization). Whereas we
tested the efficiency of our approach against a breadth-first search approach (the
optimal baseline), it would then be fairer to compete with the optimal solution from
Sect. 4.1, which applies similar techniques. The efficiency of scheduling could be
further improved by pruning the search graph earlier, e.g. by identifying very slow
schedules on the first training texts (based on certain thresholds) and then ignoring
them on the other texts. All such extensions are left to future work.
In our evaluation, the optimization of schedules has sped up pipelines by factor 4.
When the employed algorithms differ more strongly in efficiency, even gains of more
than one magnitude are possible, as exemplified in Sect. 3.1. Also, we have demon-
strated that scheduling on a sample of texts provides large benefits over our greedy
approach from Sect. 3.3, when the number of processed texts becomes large, which
we focus on in this chapter. In contrast, as hypothesized in Chap. 3, the additional
effort will often not be compensable in ad-hoc text mining scenarios.
156 4 Pipeline Efficiency
in Appendix B.4). In addition, we count how often each pipeline performs best (on
average) and we compare the pipelines’ run-times to the gold standard (cf. Sect. 2.1),
which we define here as the sum of the run-times that result from applying on each
input text at hand the pipeline Π ∗ ∈ {Π1 , . . . , Π12 } that is most efficient on that text.
Optimality Under Heterogeneity. The results are listed in Table 4.4, ordered by the
run-times on the CoNLL-2003 dataset. As known from Sect. 4.1, the pipeline (eti,
rfo, emo, ene) dominates the evaluation on the Revenue corpus, taking only
t (Π5 ) = 48.25 s and being most efficient on 295 of the 752 texts. However, three
other pipelines also do best on far more than a hundred texts. So, there is not one
single optimal schedule at all. A similar situation can be observed for the CoNLL-
2003 dataset, where the second fastest pipeline, Π2 , is still most efficient on 77 and
the sixth fastest, Π6 , even on 96 of the 553 texts. While the best fixed pipeline, Π1 ,
performs well on both corpora, Π2 and Π6 fail to maintain efficiency on the Revenue
corpus, with e.g. Π6 being almost 50 % slower than the gold standard. Although
the gold standard significantly outperforms all pipelines on both corpora at a very
high confidence level (say, 3σ ), the difference to the best fixed pipelines may seem
acceptable. However, the case of Π2 and Π6 shows that a slightly different training
set could have caused the optimized scheduling from Sect. 4.3 to construct a pipeline
whose efficiency is not robust to changing distributions of relevant information. We
Table 4.4 The run-time t (Π ) with standard deviation σ of each admissible pipeline Π based on the
given algorithm set A on both processed corpora in comparison to the gold standard. #best denotes
the number of texts, Π is most efficient on.
objective facts
0.24 0.12
0.32 0.26 0.32 0.3 0.26 0.25
0.37 0.34
Fig. 4.10 Distribution of positive opinions, negative opinions, and objective facts in a our
ArguAna TripAdvisor corpus and its different parts as well as in b the Sentiment Scale
dataset (Pang and Lee 2005) and its different parts. All distributions are computed based on the
results of the pipeline (sse, sto2 , tpo1 , pdu, csb, csp). See Appendices A and B.4 for details.
hypothesize that such a danger gets more probable the higher the text heterogeneity
of a corpus.28
To deal with text heterogeneity, the question is whether and how we can anticipate
it for a collection or a stream of texts. Intuitively, it appears reasonable to assume
that text heterogeneity relates to the mixing of types, domains, or according text
characteristics, as is typical for the results of an exploratory web search. However, the
following example from text classification suggests that there is not only dimension
that governs the heterogeneity. In text classification tasks, the information sought for
is the final class information of each text. While the density of classes naturally will
be 1.0 in all cases (given that different classes refer to the same information type),
what may vary is the distribution of those information types that serve as input for
the final classification. For instance, our sentiment analysis approach developed in
the ArguAna project (cf. Sect. 2.3) relies on the facts and opinions in a text. For
our ArguAna TripAdvisor corpus (cf. Appendix C.2) and for the Sentiment
Scale dataset from Pang and Lee (2005), we illustrate the distribution of these
types in Fig. 4.10.29
28 Also, larger numbers of admissible schedules make it harder to find a robust pipeline, since they
allow for higher efficiency gaps, as we have seen in the evaluation of Sect. 4.3.
29 In Wachsmuth et al. (2014a), we observe that the distributions and positions of facts and opinions
influence the effectiveness of sentiment analysis. As soon as a pipeline restricts some analysis to
certain portions of text only (say, to positive opinions), however, the different distributions will also
impact the efficiency of the pipeline’s schedule.
4.4 The Impact of the Heterogeneity of Input Texts 159
Since we consider text heterogeneity with the aim of achieving an efficient text analy-
sis irrespective of the input texts at hand, we propose to quantify text heterogeneity
with respect to the differences that actually impact the efficiency of a text analysis
pipeline equipped with our input control from Sect. 3.5, namely, variations in the
distribution of information relevant for the task at hand (as revealed in Sect. 4.2).
That means, we see text heterogeneity as a task-dependent input characteristic.
In particular, we measure the heterogeneity of a collection or a stream of input
texts D here with respect to the densities of all information types C1 , . . . , C|C| in D
that are referred to in an information need C. The reason is that an input-controlled
pipeline analyzes only portions of text, which contain instances of all information
types produced so far (cf. Sect. 3.5). As a consequence, differences in a pipeline’s
average run-time per portion of text result from varying densities of C1 , . . . , C|C| in
the processed texts.32 So, the text heterogeneity of D can be quantified by measuring
the variance of these densities in C. The outlined considerations give rise to a new
measure that we call the averaged deviation:
30 Since the distributions are computed based on the self-created annotations here, the values for the
the same, since the information types define a partition of all portions of text.
32 Notice that even without an input control the number of instances of the relevant information
types can affect the efficiency, as outlined at the beginning of Sect. 4.2. However, the density of
information types might not be the appropriate measure in this case.
160 4 Pipeline Efficiency
|C|
1
Given a text analysis task, the averaged deviation can be estimated based on a
sample of texts. Different from other sampling-based approaches for efficiency opti-
mizations, like (Wang et al. 2011), it does not measure the typical characteristics of
input texts, but it quantifies how much these characteristics vary. By that, the aver-
aged deviation reflects the impact of the input texts to be processed by a text analysis
pipeline on the pipeline’s efficiency, namely, the higher the averaged deviation, the
more the optimal pipeline schedule will vary on different input texts.
To illustrate the defined measure, we refer to Person, Location, and Organization
entities again, for which we have presented the densities in two English and four
German text corpora in Sect. 4.2. Now, we determine the standard deviations of
these densities in order to compute the associated averaged deviations (as always, see
Appendix B.4 for the source code). Table 4.5 lists the results, ordered by increasing
averaged deviation.33 While the deviations behave quite orthogonal to the covered
topics and genres, they seem connected to the quality of the texts in a corpus to some
extent. Concretely, the Revenue corpus and Brown corpus (both containing a
carefully planned choice of texts) show less heterogeneity than the random sample
of Wikipedia articles and much less than the LFA-11 web crawl of smartphone blog
posts. This matches the intuition of web texts being heterogeneous. An exception
is given by the values of the CoNLL-2003 datasets, though, which rather suggest
that high deviations correlate with high densities (cf. Fig. 4.5). However, the LFA-11
corpus contradicts this, having the lowest densities but the second highest averaged
deviation (18.4 %).
Altogether, the introduced measure does not clearly reflect any of the text char-
acteristics discussed above. For efficiency purposes, it therefore serves as a proper
solution to compare the heterogeneity of different collections or streams of texts
with respect to a particular information need. In contrast, it does not help to investi-
gate our hypothesis that the danger of losing efficiency grows under increasing text
heterogeneity, because it leaves unclear what a concrete averaged deviation value
actually means. For this purpose, we need to estimate how much run-time is wasted
by relying on a text analysis pipeline with a fixed schedule.
33 Some of the standard deviations of organization entities in Table 4.5 and the associated averaged
deviations exceed those presented in Wachsmuth et al. (2013c). This is because there we use a
modification of the algorithm ene, which rules out some organization names.
4.4 The Impact of the Heterogeneity of Input Texts 161
Table 4.5 The standard deviations of the densities of person, organization, and location entities
from Fig. 4.5 (cf. Sect. 4.2) as well as the resulting averaged deviations, which quantify the text
heterogeneity in the respective corpora. All values are computed based on the results of the pipeline
Πene = (sse, sto2 , tpo1 , pch, ene).
For a single input text, our optimal solution from Sect. 4.1 determines the run-time
optimal text analysis pipeline. However, most practical text analysis tasks require
to process many input texts, which may entail different optimal pipelines, as the
conducted experiments have shown. For this reason, we now develop an estimation
of the efficiency loss of executing a pipeline with a fixed schedule on a collection or
a stream of input texts as opposed to choosing the best pipeline schedule for each
input text. The latter denotes the gold standard defined above.
To estimate the gold standard run-time, we adopt an idea from the master’s the-
sis of Mex (2013) who analyzes the efficiency-effectiveness tradeoff of schedul-
ing multi-stage classifiers. Such classifiers can be seen as a generalization of text
analysis pipelines (for single information needs) to arbitrary classification prob-
lems. Mex (2013) sketches a method to compare the efficiency potential of dif-
ferent scheduling approaches in order to choose the approach whose potential lies
above some threshold. While this method includes our estimation, we greatly revise
the descriptions from Mex (2013) in order to achieve a simpler but also a more
formal presentation.
We estimate the efficiency impact induced by text heterogeneity on a sample of
texts D for a given algorithm set A = {A1 , . . . , Am }. Technically, the impact can
be understood as the difference between the run-time t ∗ (D) of an optimal (fixed)
pipeline Π ∗ = A, π ∗ on D and the run-time tgs (D) of the gold standard gs. While we
can measure the run-time of an (at least nearly) optimal pipeline using our scheduling
approach from Sect. 4.3, the question is how to compute tgs (D). To actually obtain gs,
we would need to determine the optimal pipeline for each single text in D. In contrast,
its run-time can be found much more efficiently, as shown in the following. For
conciseness, we restrict our view to algorithm sets that have no interdependencies,
meaning that all schedules of the algorithms are admissible. For other cases, a similar
162 4 Pipeline Efficiency
D= ( d1 , ... , dj , ... , dn )
...
...
...
...
Ak tk(d1) ... tk(dj) ... tk(dn)
A=
...
...
Aj
...
...
A \ Aj
...
...
Am tm(d1) ... tm(dj) ... tm(dn)
Fig. 4.11 Illustration of the computation of the gold standard run-time tgs (D) of an algorithm set
A = {A1 , . . . , Am } on a sample of portions of texts D = (d1 , . . . , dn ) for the simplified case that
the algorithms in A have no interdependencies.
but more complex computation can be conducted by considering filtering stages (cf.
Sect. 3.1) instead of single algorithms.
Now, to compute the run-time of gs for an algorithm set A without interdepen-
dencies, we consider the sample D as an ordered set of n ≥ 1 portions of text
(d1 , . . . , dn ). We process every d j ∈ D with each algorithm Ai ∈ A in order to mea-
sure the run-time ti (d j ) of Ai on d j and to determine whether Ai classifies d j as
relevant (i.e., whether d j contains all required output information produced by Ai ).
As we know from Sect. 4.2, a portion of text can be disregarded as soon as an applied
algorithm belongs to the subset A j ⊆ A of algorithms that classify d j as irrelevant.
Thus, we obtain the gold standard’s run-time tgs (d j ) on d j from only applying the
fastest algorithm Ak ∈ A j , if such an algorithm exists:
⎧
⎨min tk (d j ) | Ak ∈ A j if A j = ∅
tgs (d j ) =
m (4.12)
⎩ tk (d j ) otherwise
k=1
As a matter of fact, the overall run-time of the gold standard on the sample of
texts D results from summing up all run-times tgs (d j ):
n
tgs (D) = tgs (d j ) (4.13)
j=1
The computation of tgs (D) is illustrated in Fig. 4.11. Given tgs (D) and the optimal
pipeline’s run-time t ∗ (D), we finally estimate the efficiency impact of text hetero-
geneity in the collection or stream of texts represented by the sample D as the fraction
4.4 The Impact of the Heterogeneity of Input Texts 163
of run-time that can be saved through scheduling the algorithms depending on the
input text, i.e., 1 − t ∗ (D)/tgs (D).34
34 In cases where performing an optimized scheduling in the first place seems too expensive, also
t ∗ (D) could be approximated from the run-times modeled in Fig. 4.11, e.g. by computing a weighted
average of some lower bound t (lb) and upper bound t (ub). For instance, t (lb) could denote the
lowest possible run-time when the first j algorithms of Π ∗ are fixed and t (lb) the highest on. The
weighting then may follow from the average number of algorithms in A j . For lack of new insights,
we leave out according calculations here.
164 4 Pipeline Efficiency
idea is to adapt a pipeline to the input texts by predicting and choosing the run-time
optimal schedule depending on the text. Since run-times can be measured during
processing, the prediction can be learned self-supervised (cf. Sect. 2.1). Learning in
turn works online, because each processed text serves as a new training instance (Wit-
ten and Frank 2005). We conduct several experiments in order to analyze when the
approach is necessary in the sense of this chapter’s introductory Darwin quote, i.e.,
when it avoids wasting a significant amount of time (by using a fixed schedule only).
This section reproduces the main contributions from Wachsmuth et al. (2013b) but
it also provides several additional insights.
35 We discuss the question what information to use and what features to compute later on.
4.5 Adaptive Scheduling via Self-supervised Online Learning 165
prefix pipeline ∏pre = Apre,πpre scheduling model main pipelines ∏ = {∏1, ..., ∏k}
...
main pipeline ∏' = A,π'
...
Fig. 4.12 Illustration of pipeline scheduling as a text classification problem. Given the results of a
prefix pipeline, a learned scheduling model chooses one main pipeline for the input text D at hand.
are processed by Π pr e in order to compute feature values x(D) (for some defined fea-
ture vector representation x) as well as by all main pipelines in order to measure their
run-times. Y (Π ) specifies the mapping from the feature values x(D) of an arbitrary
input text D to the estimated average run-time q(Π ) of the respective pipeline Π
per portion of text from D (of some defined size, e.g. one sentence).36 Given the
regression models, the scheduling model to be realized then simply chooses the
main pipeline with the lowest prediction for each input text from D.
A positive side effect of the self-supervised approach is that the feature values x(D)
of each input text D together with the observed run-time t (Π ) of a pipeline Π that
processes D serve as a new training instance. Accordingly, the regression error is
given by the difference between q(Π ) and t (Π ). As a consequence, the regression
models can be updated in an online learning manner, incrementally processing and
learning from one training instance at a time (Witten and Frank 2005). This, of
course, works only for the regression model of the chosen pipeline Π ∗(D) whose
run-time has been observed. Only an explicit training set DT thus ensures that all
regression models are trained sufficiently.37 Still, the ability to continue learning
online is desired, as it enables our approach not only to adapt to DT , but also to the
collection or stream of input texts D while processing it.
Pseudocode 4.3 shows our adaptive pipeline scheduling approach. Lines 1 and 2
initialize the regression model Y (Π ) of each main pipeline Π . All regression models
are then trained incrementally on each input text D ∈ DT in lines 3–8. First, feature
values are computed based on the results of Π pr e (lines 3 and 4). Then, the run-times
36 By considering the run-time per portion of text (say, per sentence), we make the regression of a
adaptivePipelineScheduling(Π pr e , Π, DT , D)
1: for each Main Pipeline Π ∈ Π do
2: Regression model Y (Π ) ← initializeRegressionModel()
3: for each Input text D ∈ DT do
4: Π pr e .process(D)
5: Feature values x(D) ← computeFeatureValues(Π pr e , D)
6: for each Main Pipeline Π ∈ Π do
7: Run-time t (Π ) ← Π .process(D)
8: updateRegressionModel(Y (Π ), x(D) , t (Π ))
9: for each Input text D ∈ D do
10: Π pr e .process(D)
11: Feature values x(D) ← computeFeatureValues(Π pr e , D)
12: for each Main Pipeline Π ∈ Π do
13: Estimated run-time q(Π ) ← Y (Π ).predictRunTime(x(D) )
14: Main pipeline Π ∗(D) ← arg minΠ ∈Π q(Π )
15: Run-time t (Π ∗(D) ) ← Π ∗(D) .process(D)
16: updateRegressionModel(Y (Π ∗(D) ), x(D) , t (Π ∗(D) ))
Pseudocode 4.3: Learning the fastest main pipeline Π ∗(D) ∈ Π self-supervised for each input text D
from a training set DT and then predicting and choosing Π ∗(D) depending on the input text D ∈ D
at hand while continuing learning online.
of all main pipelines on D are measured in order to update their regression models.
Lines 9 to 16 process the input texts in D. After feature computation (lines 9 and 10),
the regression models are applied to obtain a run-time estimation q(Π ) for each
main pipeline Π on the current input text D (Lines 11 to 13). The fastest-predicted
main pipeline Π ∗(D) is then chosen to process D in lines 14 and 15. Finally, line 16
updates the regression model Y (Π ∗(D) ) of Π ∗(D) .
Like the scheduling approaches introduced in preceding sections (cf. Sects. 3.3, 4.1,
and 4.3), the proposed adaptive scheduling works for arbitrary text analysis algo-
rithms and collections or streams of input texts. Moreover, it does not place any
constraints on the information needs to be addressed, but works for any set of can-
didate main pipelines, which are equipped with our input control from Sect. 3.5 or
which restrict their analyses to relevant portions of text in an according manner.
Different from the other scheduling approaches, however, the adaptive scheduling
approach in Pseudocode 4.3 defines a method scheme rather than as a concrete
method. In particular, we have neither talked at all about what features to be computed
for the prediction of the run-times yet (pseudocode line 5), nor have we exactly
specified how to learn the regression of run-times based on the features.
168 4 Pipeline Efficiency
Correspondingly, we make the following estimate for the update phase of adap-
tivePipelineScheduling. Here, we do not differentiate between the run-times of a
prediction and of the update of a regression model:
In the evaluation below, we do not report on the run-time of the training phase,
since we have already exemplified in Sect. 4.3 how training time amortizes in large-
scale scenarios. Inequality 4.14 stresses, though, that the training time grows linearly
with the size of Π. In principle, the same holds for the run-time of the update phase
4.5 Adaptive Scheduling via Self-supervised Online Learning 169
because of the factor (|Π|+1)·tr eg (DT ). However, our results presented next indicate
that the regression time does not influence the overall run-time significantly.
Table 4.6 The standard deviations of the densities of all information types from C in the four
evaluated text corpora as well as the resulting averaged deviations. All values are computed from
the results of a non-filtering pipeline based on A.
38 In Wachsmuth et al. (2013c), we state that the prefix pipeline in this evaluation consists of two
algorithms only. This is because, we use a combined version of tpo1 and pch, there.
39 The main benefit of considering all 108 main pipelines would be to know the overall efficiency
In addition, we evaluate two further types in the feature analysis below, which
attempt to capture general characteristics of entities:
4. Chunk n-grams. The frequency of each possible unigram and bigram of all chunk
tags distinguished by pch.
5. Regex matches. The frequencies of matches of a regular expression for arbitrary
numbers and of a regular expression for upper-case words.
In order to allow for online learning, we trained linear regression models with
the Weka 3.7.5 implementation (Hall et al. 2009) of the incremental algorithm
Stochastic Gradient Descent (cf. Sect. 2.1). In all experiments, we let the
algorithm iterate 10 epochs over the training set, while its learning rate was set to
0.01 and its regularization parameter to 0.00001.
Baselines. The aim of adaptive scheduling is to achieve optimal efficiency on col-
lections or streams of input texts where no single optimal schedule exists. In this
regard, we see the optimal baseline from Sect. 4.3, which determines the run-time
optimal fixed pipeline (Π pr e , Π ∗ ) on the training set and then chooses this pipeline
for each test text, as the main competitor. Moreover, we introduce another baseline
to assess whether adaptive scheduling improves over trivial non-fixed approaches:
Random baseline. Do not process the training set at all. For each test text, choose
one of the fixed pipelines (pseudo-) randomly.
Gold Standard. Besides the baselines, we oppose all approaches to the gold stan-
dard, which knows the optimal main pipeline for each text beforehand. Together with
the optimal baseline, the gold standard implies the optimization potential of adaptive
scheduling on a given collection or stream of input texts (cf. Sect. 4.4).
Experiments. In the following, we present the results of a number of efficiency
experiments. The efficiency is measured as the run-time in milliseconds per sen-
tence, averaged over ten runs. For reproducability, all run-times and their standard
deviations were saved in a file in advance. In the experiments, we then loaded the
precomputed run-times instead of executing the pipelines.40 We omit to report on
effectiveness, as all main pipelines are equally effective by definition. The experi-
ments were conducted on 2 GHz Intel Core 2 Duo MacBook with 4 GB memory.
Efficiency Impact of Adaptive Scheduling. We evaluate adaptive scheduling on
the test sets of each corpus D0 , . . . , D3 after training on the respective training sets.
Figure 4.13 compares the run-times of the main pipelines of our approach to those of
the two baselines and the gold standard as a function of the averaged deviation. The
shown confidence intervals visualize the standard deviations σ , which range from
0.029 to 0.043 ms.
40 For lack of relevance in our discussion, we leave out an analysis of the effects of relying on
precomputed run-times here. In Wachsmuth et al. (2013c), we offer evidence that the main effect
is a significant reduction of the standard deviations of the pipelines’ run-times.
172 4 Pipeline Efficiency
adaptive 0.73
0.7 scheduling
0.62
gold standard
0.55
13.8% (D0) 15.4% (D1) 16.6% (D2) 18.5% (D3)
averaged deviation (text heterogeneity)
Fig. 4.13 Interpolated curves of the average run-times of the main pipelines of both baselines, of
our adaptive scheduling approach, and of the gold standard under increasing averaged deviation,
which represents the heterogeneity of the processed texts. The background areas denote the 95 %
confidence intervals (±2σ ).
on corpus D0 on corpus D3
0.95 0.65 (∏pre, ∏1) 0.51 0.96
1.06 0.65 (∏pre, ∏2) 0.51 1.02
1.09 0.65 (∏pre, ∏3) 0.51 1.08
Fig. 4.14 The average run-times per sentence (with standard deviations) of the three fixed pipelines,
our adaptive scheduling approach, and the gold standard on the test sets of D0 and D3 . Each run-time
is broken down into its different parts.
Table 4.7 The average regression time per sentence (including feature computation and regression),
the mean regression error, and the accuracy of choosing the optimal pipeline for each input text in
either D0 or D3 for different feature types.
(Π pr e , Π1 ) is the fastest pipeline on 598 of the 1000 test texts from D0 , whereas
(Π pr e , Π2 ) and (Π pr e , Π3 ) have the lowest run-time on 229 and 216 texts, respec-
tively (on some texts, the pipelines are equally fast). In contrast, our approach
takes Π2 (569 times) more often than Π1 (349) and Π3 (82), which results in an
accuracy of only 39 % for choosing the optimal pipeline. This behavior is caused by
a mean regression error of 0.45 ms, which is almost half as high as the run-times
to be predicted on average and, thus, often exceeds the differences between them.
However, the success on D3 does not emanate from lower regression errors, which
are in fact 0.24 ms higher on average. Still, the accuracy is increased to 55 %. So,
the success must result from larger differences in the main pipelines’ run-times.
One reason behind can be inferred from the average run-time per sentence of Π pr e
in Fig. 4.14, which is significantly higher on D1 (0.65 ms) than on D3 (0.51 ms). Since
the run-times of all algorithms in Π pr e scale linearly with the number of input tokens,
the average sentence length of D0 must exceed that of D3 . Naturally, shorter sentences
tend to contain less relevant information. Hence, many sentences can be discovered
as being irrelevant after few analysis steps by a pipeline that schedules the respective
text analysis algorithms early.
174 4 Pipeline Efficiency
Learning Analysis. The observed regression errors bring up the question of how
suitable the employed feature set is for learning adaptive scheduling. To address this
question, we built a separate regression model on the training sets of D0 and D3 ,
respectively, for each of the five distinguished feature types in isolation as well as
for combinations of them. For each model, we then measured the resulting mean
regression error as well as the classification accuracy of choosing the optimal main
pipeline. In Table 4.7, we compare these values to the respective regression time, i.e.,
the run-time per sentence spent for feature computations and regression.
In terms of the mean regression error, the part-of speech tags and regex matches
perform best among the single feature types, while the average run-times fail com-
pletely, especially on D3 (1.34 ms). Still, the accuracy of the average run-times is
far from worst, indicating that they sometimes provide meaningful information. The
best accuracy is clearly achieved by the lexical statistics.41 Obviously, none of the
single feature types dominates the evaluation. The set of all features outperforms
both the single types and the standard features in most respects. Nevertheless, we
use the standard features in all other experiments, because they entail a regression
time of only 0.04 to 0.05 ms per sentence on average. In contrast, the regex matches
e.g. need 0.16 ms alone on D0 , which exceeds the difference between the optimal
baseline and the gold standard on D0 and, thus, renders the regex matches useless in
the given setting.
The regex matches emphasize the need for efficiently computable features that we
discussed above. While the set of standard features fulfills the former requirement,
it seems as if none of the five feature types really captures the text characteristics
relevant for adaptive scheduling.42
Alternatively, though, the features may also require more than the 500 training
texts given so far. To rule out this possibility, we next analyze the performance of the
standard features depending on the size of the training set. Figure 4.15 shows the main
pipelines’ run-times for nine training sizes between 1 and 5000. Since the training
set of D0 is limited, we have partly performed training on duplicates of the texts
in D0 (modified in the way sketched above) where necessary. Adaptive scheduling
does better than the random baseline but not than the optimal baseline on all training
sizes except for 1. The illustrated curve minimally oscillates in the beginning. After
its maximum at 300 training texts (1.05 ms), it then declines monotonously until it
reaches 0.95 ms at size 1000. From there, the algorithm mimics the optimal baseline,
i.e., it chooses Π1 on about 90 % of the texts. While the observed learning behavior
may partly result from overfitting the training set in consequence of using modified
duplicates, it also underlines that the considered features simply do not suffice to
always find the optimal pipeline for each text. Still, more training decreases the
danger of being worse than without adaptive scheduling.
41 The low inverse correlation of the mean regression error and the classification accuracy seems
counterintuitive, but it indicates the limitations of these measures: E.g., a small regression error can
still be problematic if run-times differ only slightly, while a low classification accuracy may have
few negative effects in this case.
42 We have also experimented with other task-independent features, especially further regular expres-
sions, but their benefit was low. Therefore, we omit to report on them here.
4.5 Adaptive Scheduling via Self-supervised Online Learning 175
Fig. 4.15 The average run-time of the main pipelines of the two baselines, our adaptive scheduling
approach, and the gold standard on the test set of D0 as a function of the training size.
regression error in ms
0.65
0.56
0.55
0.45 0.48
mean (0.43)
0.40
0.35 0.37
0.25
1000 5000 10,000 15,000 input text
Fig. 4.16 The mean regression error for the main pipelines chosen by our adaptive scheduling
approach on 15,000 modified versions of the texts in D0 with training size 1. The values of the two
interpolated learning curves denote the mean of 100 (light curve) and 1000 (bold curve) consecutive
predictions, respectively.
In this section, we have developed a pipeline scheduling approach that aims to achieve
optimal run-time efficiency for a set of text analysis algorithms (equipped with the
input control from Sect. 3.5) on each input text. The approach automatically learns
to adapt a pipeline’s schedule to a processed text without supervision. It targets at
text analysis tasks where the collection or stream of input texts is heterogeneous in
43 The curves in Fig. 4.16 represent the differences between the predicted and the observed run-times
of the main pipelines that are actually executed on the respective texts.
176 4 Pipeline Efficiency
The approaches developed in this chapter aim to optimize the efficiency of sequen-
tially executing a set of text analysis algorithms on a single machine. The next logical
step is to parallelize the execution. Despite its obvious importance for both ad-hoc
and large-scale text mining, the parallelization of text analysis pipelines is only
discussed sporadically in the literature (cf. Sect. 2.4). In the following, we outline
possible ways to parallelize pipelines and we check how well they integrate with
our approaches. For a homogeneous machine setting, a reasonable parallelization
appears straightforward, although some relevant parameters remain to be evaluated.
Table 4.8 Qualitative overview of the expected effects of the four distinguished types of paral-
lelization with respect to each considered metric. The scale ranges from very positive [++] and
positive [+] over none or hardly any [o] to negative [−] and very negative [− −].
as the network time. We assume these three to be most important for the pipeline
execution time and we omit to talk about others accordingly.
Aside from a scalable execution, parallelization can also be exploited to speed up
pipeline scheduling. We analyze the effects of parallelization on the scheduling time,
i.e., the time spent for an optimized scheduling on a sample of texts, as proposed in
Sect. 4.3 (or for the optimal scheduling in Sect. 4.1) as well as on the training time
of our adaptive scheduling approach from Sect. 4.5. Also, we look at the minimum
response time of a pipeline, which we define as the pipeline’s run-time on a single
input text. The minimum response time becomes important in ad-hoc text mining,
when first results need to be returned as fast as possible (cf. Sect. 3.3).
In the following, we examine four types of parallelization for the scenario that a
single text analysis task is to be addressed on a network of machines with pipelines
equipped with our input control from Sect. 3.5. All machines are uniform in speed
and execute algorithms and pipelines in the same way. They can receive arbitrary
input texts from other machines, analyze the texts, and return the produced output
information. We assess the effects of each type on all metrics introduced above on a
comparative scale from very positive [++] to very negative [− −]. Table 4.8 provides
an overview of all effects.
To illustrate the different types, we consider three machines μ0 , . . . , μ2 . μ0 serves
as the master machine that distributes input texts and aggregates output information.
Given this setting, we schedule four sample algorithms related to our case study
InfexBA from Sect. 2.3: a time recognizer A T , a money recognizer A M , a forecast
event detector A F , and some segmentation algorithm A S . Let the output of A S be
required by all others and let A F additionally depend on A T . Then, three admissible
pipelines exist: (A S , A T , A M , A F ), (A S , A T , A F , A M ), and (A S , A M , A T , A F ).
4.6 Parallelizing Execution in Large-Scale Text Mining 179
µ0 AS AT µ0 AS AT
µ1 AF µ1 AF
µ2 AM µ2 AM
µ0 AS AT AF AM µ0 AS AT AF AM
µ1 AS AT AF AM µ1 AT AM AF
µ2 AS AT AF AM µ2 AM AT AF
Fig. 4.17 The four considered ways of parallelizing four sample text analysis algorithms on three
machines μ0 , . . . , μ2 : a Different algorithms on different machines in sequence, b different algo-
rithms on different machines in parallel where possible, c one schedule on different machines,
d different schedules on different machines.
44 Notice that we assume homogeneous machines here. In case of machines, which are specialized
for certain text analyses, analysis pipelining may entail more advantages.
180 4 Pipeline Efficiency
The same holds for exchanged roles of A1 and A2 . We have seen an example
that fulfills Ineq. 4.16 in the evaluation of Sect. 4.1, where the pipeline (eti, emo)
outperforms the algorithm emo alone. The danger of losing efficiency (which also
exists for the minimum response time [+/–]) generally makes analysis parallelization
questionable. While the scheduling time [++] and training time [+] behave in the same
way as for analysis pipelining, other types of parallelization exist that come with only
few notable drawbacks and with even more benefits, as discussed next.
(a) Parallel processing of the portions of an input text (b) Parallel processing of an input text
µ0 AS AT AF AM µ0 AS AT AF AM
µ1 AT AF AM µ1 AS AT AM AF
µ2 AT AF AM µ2 AS AM AT AF
Fig. 4.18 Parallel processing of a single input text with four sample text analysis algorithms: a The
master machine μ0 segments the input text into portions of text, each of which is then processed
on one machine. b All machines process the whole text, but schedule the algorithms differently.
As each machine employs all algorithms, the scheduling time can again be signifi-
cantly improved through parallel search node expansions [++]. Similarly, the training
time of adaptive scheduling scales well [++], because input texts can be processed
on different machines (while centrally updating the mapping to be learned on the
master machine). Moreover, pipeline duplication can reduce the minimum response
time to some extent [+], even though all machines execute the employed pipeline
in the same order. In particular, our input control allows the duplicated pipelines to
process different portions of an input text simultaneously. To this end, the master
machine needs to execute some kind of prefix pipeline, which segments the input
text into single portions, whose size is constrained by the largest specified degree of
filtering in the query to be addressed (cf. Sect. 3.4). The portions can then be distrib-
uted to the available machines. Figure 4.18(a) sketches such an input distribution for
our four sample algorithms.
Pipeline duplication appears to be an almost perfect choice, at least when a single
text analysis pipeline is given, as in the case of our optimized scheduling approach
from Sect. 4.3. In contrast, the (ideally) even better adaptive scheduling approach
from Sect. 4.5 can still cause a high memory consumption, because every machine
needs to maintain all candidate schedules. A solution is to parallelize the schedules
instead of the pipeline, as illustrated in Fig. 4.17(d). Such a schedule parallelization
requires to store only a subset of the schedules on each machine, thereby reduc-
ing memory consumption [+]. The adaptive choice of a schedule (and, hence, of a
machine) for an input text then must take place on the master machine. Consequently,
idle times can occur, especially when the choice is very imbalanced. In order to ensure
a full machine utilization [++], input texts may therefore have to be reassigned to
other machines, which implies a negative effect on the network time [−]. So, we
cannot generally determine whether schedule parallelization yields a better pipeline
execution time than pipeline duplication or vice versa [++].
In terms of scheduling time [++] and training time [++], schedule parallelization
behaves analog to pipeline duplication, whereas the distribution of schedules over
machines will tend to benefit the minimum response time on a single input text more
clearly [++]: Similar to Kalyanpur et al. (2011), a text can be processed by each
machine simultaneously (cf. Fig. 4.18(b)). As soon as the first machine finishes, the
execution can stop to directly return the produced output information. However, the
182 4 Pipeline Efficiency
full potential of such a massive parallelization is only achieved when all machines
are working. Still, schedule parallelization makes it easy to cope with machine break-
downs in general, indicating a high but not optimal fault tolerance [+].
46 Inexemplary tests with the main pipeline in our project ArguAna (cf. Sect. 2.3), pipeline dupli-
cation on five machines reduced the pipeline execution time by factor 3.
Chapter 5
Pipeline Robustness
In making a speech one must study three points: first, the means
of producing persuasion; second, the style, or language, to be
used; third, the proper arrangement of the various parts of the
speech.
– Aristotle
Abstract The ultimate purpose of text analysis pipelines is to infer new informa-
tion from unknown input texts. To this end, the algorithms employed in pipelines are
usually developed on known training texts from the anticipated domains of applica-
tion (cf. Sect. 2.1). In many applications, however, the unknown texts significantly
differ from the known texts, because a consideration of all possible domains within
the development is practically infeasible (Blitzer et al. 2007). As a consequence,
algorithms often fail to infer information effectively, especially when they rely on
features of texts that are specific to the training domain. Such missing domain robust-
ness constitutes a fundamental problem of text analysis (Turmo et al. 2006; Daumé
and Marcu 2006). The missing robustness of an algorithm directly reduces the ro-
bustness of a pipeline it is employed in. This in turn limits the benefit of pipelines
in all search engines and big data analytics applications, where the domains of texts
cannot be anticipated. In this chapter, we present first substantial results of an ap-
proach that improves robustness by relying on novel structure-based features that are
invariant across domains.
Section 5.1 discusses how to achieve ideal domain independence in theory. Since
the domain robustness problem is very diverse, we then focus on a specific type of
text analysis tasks (unlike in Chaps. 3 and 4). In particular, we consider tasks that deal
with the classification of argumentative texts, like sentiment analysis, stance recog-
nition, or automatic essay grading (cf. Sect. 2.1). In Sect. 5.2, we introduce a shallow
model of such tasks, which captures the sequential overall structure of argumentative
texts on the pragmatic level while abstracting from their content. For instance, we
observe that review argumentation can be represented by the flow of local sentiment.
Given the model, we demonstrate that common flow patterns exist in argumentative
texts (Sect. 5.3). Our hypothesis is that such patterns generalize well across domains.
In Sect. 5.4, we learn common flow patterns with a supervised variant of clustering.
Then, we use each pattern as a single feature for classifying argumentative texts
183
184 5 Pipeline Robustness
Fig. 5.1 Abstract view of the overall approach of this book (cf. Fig. 1.5). The main contribution of
this chapter is represented by the overall analysis.
from different domains. Our results for sentiment analysis indicate the robustness of
modeling overall structure (other tasks are left for future work). In addition, we can
visually make results more intelligible based on the model (Sect. 5.5). Altogether,
this chapter realizes the overall analysis within the approach of this book, highlighted
in Fig. 5.1. Both robustness and intelligibility benefit the use of pipelines in ad-hoc
large-scale text mining.
Several tasks from information extraction and text classification have been success-
fully tackled with text analysis pipelines (cf. Chap. 2). The algorithms employed in
pipelines are mostly developed based on a set of known training texts. These texts
are analyzed manually or automatically in order to find rules or statistics about cer-
tain features of the texts that help to generally infer the output information (in terms
of classes, annotations, etc.) to be inferred by the respective algorithm from input
texts (cf. Sect. 2.1 for details). Such a corpus-based development often results in
5.1 Ideal Domain Independence for High-Quality Text Mining 185
several false
classifications
one false
classification
ua s
ua s
s
s
sq ircle
sq ircle
re
re
c
c
X1 X1
Fig. 5.2 Illustration of the domain dependence of text analysis for a two-class classification task:
Applying the decision boundary from domain A in some domain B with a different feature distrib-
ution (here, for x1 and x2 ) often works badly.
high effectiveness when the training texts are representative for the input texts the
algorithm shall process later on, i.e., for the algorithm’s domain of application.
The notion of domains is common in related areas like software engineering, where
it captures two respects: (1) The specific concepts of some problem area and (2) shared
software requirements and functionalities that are key software reuse (Harsu 2002).
While, to our knowledge, no clear definition of domains exists in text analysis, here
the term is used rather in the first respect, namely, to capture common properties of
a set of texts.
Many authors refer to domains in terms of topics, such as Li et al. (2012a). How-
ever, domains are also distinguished according to other schemes, e.g. with respect
to genres or styles (Blitzer et al. 2007). In our project ArguAna, we analyzed the
sentiment of reviews (cf. Sect. 2.3), while some related approaches rather target at
the comment-like texts from Twitter1 (Mukherjee and Bhattacharyya 2012). Also,
the combination of a topic and a genre can make up a domain, as in our study
of language functions (Wachsmuth and Bujna 2011). Others differentiate between
authors Pang and Lee (2005) or even see languages as a special case of domains
(Prettenhofer and Stein 2011). In the end, the domain scheme depends on the ad-
dressed task.
What the texts from a specific domain share, in general, is that they are assumed
to be drawn from the same underlying feature distribution (Daumé and Marcu 2006),
meaning that similar feature values imply similar output information. Given a training
set with texts from a single domain, it is therefore not clear whether found rules or
statistics about a feature represent properties of texts that are generally helpful to
infer output information correctly or whether they refer to properties that occur only
within the specific domain at hand.2 Either way, an algorithm can rely on the feature
when being applied to any other set of texts from that domain.
In practice, however, the domain of application is not always the same as the train-
ing domain. Since different domains yield different feature distributions, an algorithm
The domain independence of a text analysis pipeline follows from the domain inde-
pendence of the algorithms employed in the pipeline, as it refers to the performed
analyses. According to the discussion above, the development of each algorithm can
be seen as deriving a model from a set of training texts that maps features of input
texts to output information. To obtain an ideally domain-independent algorithm for
a given text analysis task, we argue that three requirements must be fulfilled:
3 According to Blitzer et al. (2008), domain dependence occurs in nearly every application of
machine learning. As exemplified, it is not restricted to statistical approaches, though.
5.1 Ideal Domain Independence for High-Quality Text Mining 187
(a) instances from domain A and B (b) instances from domain A and B
X2 X2
more representative
training sets improve
classification
X1 X1
Fig. 5.3 Illustration of two ways of improving the decision boundary from Fig. 5.2 for the domains A
and B as a whole: a Choosing a model that better fits the task. b Choosing a training set (open icons)
that represents both domains.
4 Theterms used here come from the area of machine learning (Hastie et al. 2009). However, our
argumentation largely applies to rule-based text analysis approaches as well.
188 5 Pipeline Robustness
X1 domain-specific features X1
always square always circle prevent robustness always circle always square
class in domain B class in domain B class in domain A class in domain A
Fig. 5.4 Illustration of the different domain invariance of the features x1 and x2 with respect to the
instances from domains A and B: Only for x2 , the distribution of values over the circle and square
remains largely invariant across the domains.
in the end the appropriateness of a model depends on the training set it is derived
from. This directly leads to requirement of optimal representativeness.
The used training set governs the distribution of values of the considered fea-
tures and, hence, the quality of the derived model. According to learning theory (cf.
Sect. 2.1), the best model is obtained when the training set is optimally representa-
tive for the domain of application. The representativeness prevents the model from
incorrectly generalizing from the training data. Given different domains of applica-
tion, the training texts should, thus, optimally represent all texts irrespective of their
domains (Sapkota et al. 2014). Figure 5.3(b) shows an alternative training set (open
icons) for the sample instances from domains A and B that adresses this require-
ment. As for optimal fitting, it leads to a decision boundary, which causes fewer
false classifications than the one shown in Fig. 5.2. Among others, we observe such
behavior in Wachsmuth and Bujna (2011) after training on a random crawl of blog
posts instead of a focused collection of reviews.
Optimal representativeness is a primary goal of corpus design (Biber et al. 1998).
Besides the problem of how to achieve representativeness, an optimally representative
training set can only be built when enough training texts are given from different
domains. This contradicts one of the basic scenarios that motivates the need for
domain robustness, namely, that enough data is given from some source domain,
but only few from a target domain. The domain adaptation approaches summarized
in Sect. 2.4 deal with this scenario by learning the shift in the feature distribution
between domains or by aligning features from the source and the target domain.
Hence, they require at least some data from the target domain for training and, so,
need to assume the domains of application in advance.
The question is how to develop domain-independent text analysis algorithms
without knowing the domains of application. Since an algorithm cannot adapt to
the target domains in this scenario, the only way seems to derive a model from the
training texts that already refrains from any domain dependence in the first place.
5.1 Ideal Domain Independence for High-Quality Text Mining 189
This leads to our notion of optimal invariance of features. We call a set of features
(optimally) domain-invariant in a given text analysis task, if the distribution of their
values remains the same (with respect to the output information sought for) across all
possible domains of applications. Accordingly, strongly domain-invariant features
entail similar distributions across domains. For illustration, Fig. 5.4 emphasizes the
strong domain invariance of the feature x2 in the sample classification task: In both
domains, high values of x2 always refer to instances of the circle class, and low
values to the square class, which benefits domain robustness. Only the medium values
show differences between the domains. In contrast, x1 is very domain-specific. The
distribution of its values is almost contrary for the two domains.
The intuition behind domain invariance is that the respective features capture
properties of the task to be addressed only and not of the domain of application. As a
consequence, the resort to domain-invariant features simplifies the above-described
requirement of optimal representativeness. In particular, it limits the need to con-
sider all domains of application since the feature distribution remains stable across
domains. Ideally, a training set then suffices, which is representative in terms of the
distribution of features with respect to the task in a single domain.
In practice, optimal invariance will often not be achievable, because a differenti-
ation between task-specific and domain-specific properties of texts requires to know
the target function. Still, we believe that strongly domain-invariant features can be
found for many text analysis tasks. While the domain invariance of a certain set of
features cannot be proven, the robustness of an algorithm based on the features can at
least be evaluated using test sets from different domains of application. Such research
has already been done for selected tasks. For instance, Menon and Choi (2011) give
experimental evidence that features based on function words robustly achieve high
effectiveness in authorship attribution across domains.
Domain-invariant features benefit the domain independence of text analysis algo-
rithms and, consequently, the domain independence of a pipeline that employs such
algorithms. That being said, we explicitly point out that the best set of features in
terms of domain invariance is not necessarily the best in terms of effectiveness. In
the case of the figures above, for instance, the domain-specific feature x1 may still
add to the overall classification accuracy, when used appropriately. Hence, domain-
invariant features do not solve the general problem of achieving optimal effectiveness
in text analysis, but they help to robustly maintain the effectiveness of a text analysis
pipeline when applying the pipeline to unknown texts.
To conclude, none of the described requirements can be realized perfectly in
general, preventing ideal domain independence in practice. Still, approaches to ad-
dress each requirement exist. The question is whether general ways can be found
to overcome domain dependence, thereby improving pipeline robustness in ad-hoc
large-scale text mining. For a restricted setting, we consider this question in the
remainder of this chapter.
190 5 Pipeline Robustness
The domain dependence of text analysis algorithms and pipelines is manifold and
widely discussed in the literature (cf. Sect. 2.4). Different from the problems of
optimizing pipeline design and pipeline efficiency that we tackled in Chaps. 3 and 4,
domain dependence can hardly be addressed irrespective of the text analysis task at
hand, as it is closely connected to the actual analysis of natural language text and to
the domain scheme relevant in the task. Therefore, it seems impossible to approach
domain dependence comprehensively within one book chapter.
Instead, we restrict our view to the requirement of optimally invariant features
here, which directly influences possible solutions to optimal representativeness and
optimal fitting, as sketched above. To enable high-quality text mining, we seek for
invariant features that, at the same time, achieve high effectiveness in the addressed
text analysis task. Concretely, we focus on tasks that deal with the classification of
argumentative texts like essays, transcripts of political speeches, scientific articles, or
reviews, since they seem particularly viable for the development of domain-invariant
features: In general, an argumentative text represents a written form of monologi-
cal argumentation. For our purposes, such argumentation can be seen as a regulated
sequence of text with the goal of providing persuasive arguments for an intended con-
clusion (cf. Sect. 2.4 for details). This involves the identification of facts about the
topic being discussed as well as the structured presentation of pros and cons (Besnard
and Hunter 2008). As such, argumentative texts resemble the type of speeches
Aristotle refers to in the quote at the beginning of this chapter.
Typical tasks that target at argumentative texts are sentiment analysis, stance
recognition, and automatic essay grading among others (cf. Sect. 2.1). In such tasks,
domains are mostly distinguished in terms of topic, like different product types in
reviews or different disciplines of scientific articles. Moreover, argumentative texts
share common linguistic characteristics in terms of their structure (Trosborg 1997).
Now, according to Aristotle, the arrangement of the parts of a speech (i.e., the overall
structure of the speech) plays an important role in making a speech. Putting both
together, it therefore seems reasonable that the following two-fold hypothesis holds
for many tasks that deal with the classification of argumentative texts, where overall
structure can be equated with argumentation structure:
The first part of the hypothesis is important, since it suggests that structure-based
features actually help to effectively address a given task. If the second part turns out
to be true, we can in fact achieve a domain-robust classification of argumentative
texts. We investigate both parts in this chapter.
5.2 A Structure-Oriented View of Text Analysis 191
To investigate whether a focus on the analysis of overall structure benefits the domain
robustness of pipelines for the classification of argumentative texts, we now model the
units and relations in such texts that make up overall structure from an argumentation
perspective. Our shallow model is based on the intuition that many people organize
their argumentation largely sequentially. The model allows viewing text analysis as
the task to classify the argumentation structure of input texts, as we exemplify for
the sentiment analysis of reviews. This section partly reuses content and follows the
discussion of Wachsmuth et al. (2014a), but the model developed here aims for more
generality and wider applicability in text analysis.
Not only sentiment analysis, but also several other non-standard text classification
tasks (cf. Sect. 2.1) directly or indirectly deal with structure. As an obvious exam-
ple, automatic essay grading explicitly rates argumentative texts, mostly targeting at
structural aspects (Dikli 2006). In genre identification, a central concept is the form
of texts (Stein et al. 2010). Some genre-related tasks explicitly aim at argumentative
texts, such as language function analysis (cf. Sect. 2.3). Criteria in text quality as-
sessment of Wikipedia articles and the like often measure structure (Anderka et al.
2012), while readability has been shown to be connected to discourse (Pitler and
Nenkova 2008). Arun et al. (2009) rely on structural clues like patterns of uncon-
sciously used function words in authorship attribution, and similar patterns have been
successfully exploited for plagiarism detection (Stamatatos 2011).5
According to our hypothesis from Sect. 5.1, we argue that in these and related tasks
the class of an argumentative text is often decided by the structure of its argumentation
rather than by its content, while the content adapts the argumentation to the domain at
hand. For the classification of argumentative texts, we reinterpret the basic scenario
from Sect. 1.2 in this regard by viewing text analysis as a structure classification task:
This reinterpretation differs more significantly from the definition in Sect. 1.2
than those in Sects. 3.2 and 3.4, because here the type of output information to be
5 On a different level, overall structure also play a role in sequence labeling tasks like named entity
recognition (cf. Sect. 2.3). There, many approaches analyze the syntactic structure of a sentence for
the decision whether some candidate text span denotes an entity.
192 5 Pipeline Robustness
Text
class
1 Argumenta- 1
Stance Topic
tive text
successor
1..* 1..* 0..1 1..*
Fig. 5.5 The proposed metamodel of the structure of an argumentative text (center) and its con-
nection to the argumentation (left) and the content (right) of the text.
produced is much more restricted (i.e., text-level class information).6 As such, the
reinterpretation may seem unnecessarily limiting. We apply it merely for a more
focused discussion, though. Accordingly, it should not be misunderstood as an ex-
clusive approach to respective tasks. Apart from that, the reinterpretation appears
rather vague, because it leaves open what is exactly meant by argumentation struc-
ture. In the following, we present a metamodel that combines different concepts from
the literature to define such structure for argumentative texts in a granularity that we
argue is promising to address text classification.
Figure 5.5 shows the ontological metamodel with all presented concepts. Similar
to the information-oriented view of text analysis in Sect. 3.4, the ontology is not used
in terms of a knowledge base, but it serves as an analogy to the process-oriented view
in Sect. 3.2. We target at the center part of the model, which defines the overall struc-
ture of argumentative texts, i.e., their argumentation structure. In some classification
tasks, structure may have to be analyzed in consideration of the content referred to in
the argumentation of a text. An example in this regard is stance recognition, where
the topic of a stance is naturally bound to the content. First, we thus introduce the left
and right part. Given the topic is known or irrelevant, however, a focus on structure
benefits domain robustness, as we see later on.
As stated in Sect. 5.1, an argumentative text can be seen as the textual representation
of an argumentation. An argumentation aims to give persuasive arguments for some
6 Notice, though, that all single text classification tasks target at the inference of information of one
type C only, which remains implicit in the basic scenario from Sect. 1.2.
5.2 A Structure-Oriented View of Text Analysis 193
conclusion. In the hotel reviews analyzed in our project ArguAna (cf. Sect. 2.3),
for instance, authors often justify the score they assign to the reviewed hotel by
sharing their experiences with the reader. According to Stab and Gurevych (2014a),
most argumentation theories agree that an argumentation consists in a composition
of argument components (like a claim or a premise) and argumentative relations
between the components (like the support of a claim by a premise). In contrast,
the concrete types of components and relations differ. E.g., Toulmin (1958) further
divides premises into grounds and backings (cf. Sect. 2.4 for details). The conclusion
of an argumentation may or may not be captured explicitly in an argument component
itself. It usually corresponds to the stance of the author of the text with respect to the
topic being discussed.
The topic of an argumentative text sums up what the content of the text is all
about, such as the stay at some specific hotel in case of the mentioned reviews. The
topic is referred to in the text directly or indirectly by talking about different semantic
concepts. We use the generic term semantic concept here to cover entities, attributes,
and similar, like a particular employee of a hotel named John Doe or like the hotel’s
staff in general. Semantic relations may exist between the concepts, e.g. John Doe
works for the reviewed hotel. As the examples show, the relevant concrete types of
both semantic concepts and semantic relations are often domain-specific, similar to
what we observed for annotation types in Sect. 3.2.
An actual understanding of the arguments in a text would be bound to the contained
semantic concepts and relations. In contrast, we aim to determine only the class of an
argumentative text given some classification scheme here. Such a text class represents
meta information about the text, e.g. the sentiment score of a review or the name of its
author. As long as the meta information does not relate to the topic of a text, loosing
domain independence by analyzing content-related structure seems unnecessary.
Stab and Gurevych (2014a) present an annotation of the structure of argumenta-
tive texts (precisely, of persuasive essays) that relates to the defined concepts. They
distinguish major claims (i.e., conclusions), claims, and premises as argumentation
components as well as support and attack as argumentative relations. Such annota-
tion schemes serve research on the mining of arguments and their interactions (cf.
Sect. 2.4). The induced structure may also prove beneficial for text classification,
though, especially when the given classification scheme targets at the purpose of
argumentation (as in the case of stance recognition). However, we seek for a model
that can be applied to several classification tasks. Accordingly, we need to abstract
from concrete classification tasks in the first place.
7 As the examples demonstrate, the scheme of unit classes can, but needs not necessarily, be related
tion model from Wachsmuth et al. (2014a). However, we emphasize here that the semantic concepts
contained in a discourse unit do not belong to the structure.
5.2 A Structure-Oriented View of Text Analysis 195
d1 d i successor d i successor d i + 1 dn
discourse units ... ...
Fig. 5.6 Visualization of our model of the structure of an argumentative text D for finding its text
class C(D) in some task. D is represented as a sequence of n ≥ 1 discourse units. Each unit di has
a task-specific unit class and discourse relations of some concrete types to its predecessor di−1 and
its successor di+1 if existing.
The instantiation of the structure part of the metamodel in Fig. 5.5 entails two steps:
Given a classification task to be addressed on a set of argumentative texts, the first
step is to derive a concrete model for that task from the metamodel. Such a derivation
can be understood as defining a structure classification task ontology that instantiates
the abstract concepts of structure:
Structure Classification Task Ontology. A structure classification task ontology Ω
denotes a 2-tuple CU(Ω) , C(Ω)
R such that
Once a structure classification task ontology has been defined, the second step is
to actually model the structure of each text, i.e., to create individuals of the concepts
in the concrete model.
The considered types of unit classes and discourse relations decide what informa-
tion to use for analyzing overall structure in the task at hand. In contrast, the defined
ontology does not distinguish concrete types of discourse units, since they can be
assumed task-independent. While also discourse relations are task-independent (as
mentioned), different subsets of the 23 relation types from the rhetorical structure
theory may be beneficial in different tasks or even other relation types, such as those
from the Penn Discourse Treebank (Carlson et al. 2001).
As an example, we illustrate the two instantiation steps for the sentiment analysis
of reviews, as tackled in our project ArguAna (cf. Sect. 2.3). Reviews comprise a
positional argumentation, where an author collates and structures a choice of state-
ments (i.e., facts and opinions) about a product or service in order to inform intended
recipients about his or her beliefs (Besnard and Hunter 2008). The conclusion of a
review is often not explicit, but it is quantified in terms of an overall sentiment rating.
For example, a review from the hotel domain may look like the following:
196 5 Pipeline Robustness
hotel review
:Argumentative text
Fig. 5.7 Instantiation of the structure part of the metamodel from Fig. 5.5 with concrete concepts
of discourse relations and unit classes as well as with individuals of the concepts for the example
hotel review discussed in the text.
We spent one night at that hotel. Staff at the front desk was very nice, the room was clean
and cozy, and the hotel lies in the city center... but all this never justifies the price, which
is outrageous!
Five statements can be identified in the review: A fact on the stay, followed by two
opinions on the staff and the room, another fact on the hotel’s location, and a final
opinion on the price. Although there are more positive than negative statements, the
argumentation structure of the review reveals a negative global sentiment, i.e., the
overall sentiment to be inferred in the sentiment analysis of reviews.
In simplified terms, the argumentation structure is given by a sequential compo-
sition of statements with local sentiments on certain aspects of the hotel. Figure 5.7
models the argumentation structure of the example review as an instance of our
metamodel from Fig. 5.5: Each statement represents a discourse unit whose unit
class corresponds to a local sentiment. Here, we cover the positive and negative
polarities of opinions as well as the objective nature of a fact as classes of local sen-
timent. They entail the order relation already mentioned above. The five statements
are connected by four discourse relations. The discourse relations in turn refer to
three from whatever number of discourse relation types.
What Fig. 5.7 highlights is the sequence information induced by our shallow
structure-oriented view of text analysis. In particular, according to this view an
5.2 A Structure-Oriented View of Text Analysis 197
argumentative text implies both a sequence of unit classes and a sequence of discourse
relations. In the remainder of this chapter, we examine in how far such sequences
can be exploited in features for an effective and domain-robust text analysis. How-
ever, Fig. 5.7 also illustrates that the representation of a text as an instance of the
defined metamodel is tedious and space-consuming. Instead, we therefore prefer
visualizations in the style of Fig. 5.6 from here on.
The metamodel presented in this section defines an abstract view of text analysis
dedicated to tasks that deal with the classification of argumentative texts. At its
heart, it represents the overall structure of a text on the pragmatic level, namely, as
a sequence of interrelated discourse units of certain types. For a classification task
at hand, we propose to instantiate the metamodel and to then derive features from
the resulting concrete model. Since the metamodel defines a significant abstraction,
some information in and about argumentative texts is covered only implicitly if at
all, such as lexical and syntactic properties. When it comes to the computation of
features, missing information can still be integrated with the structural information
derived from the model, though, if needed. The same holds for information related
to the argumentation and content part of the metamodel.
The main goal of this chapter is to develop more domain-robust text classification
approaches. In this regard, our resort to discourse structure in the metamodel for sup-
porting domain independence follows related research. For instance, Ó Séaghdha and
Teufel (2014) hypothesize (and provide evidence to some extent) that the language
used in a text to convey discourse function is independent of the topical content of
the text. Moreover, as argued, the question of how to represent discourse structure is
largely independent from the text analysis tasks being addressed.
On the contrary, the concrete unit classes to be chosen for the discourse units
strongly depend on the given task and they directly influence the effectiveness of
all approaches based on the respective model. In particular, the assumption behind
is that the composition of unit classes in a text reflects argumentation structure on
an abstraction level that helps to solve the task. What makes the choice even more
complex is that a text analysis algorithm is required, in general, to infer the unit
classes of discourse units, since the unit classes are usually unknown beforehand.
Such an algorithm may be domain-dependent again, which then shifts the domain
robustness problem from the text level to the unit level instead of overcoming it.
At least, the shift may go along with a reduction of the problem, as in the case of
modeling local sentiment polarities to reflect global sentiment scores. Also, later on
we discuss what types of fully or nearly domain-independent unit classes can be used
to model argumentation structure and when.
At first sight, the shallow nature of our model of argumentation structure has short-
comings. Especially, the limitation that discourse relations always connect neighbor-
ing discourse units makes an identification of deeper non-sequential interactions of
198 5 Pipeline Robustness
the arguments in a text hard. Such interactions are in the focus of many approaches
related to argumentation mining (cf. Sect. 2.4). For text classification, however, we
argue that the shallow model should be preferred over deeper models (as far as the
abstraction in our model proves useful): Under our hypothesis from Sect. 5.1, the
overall structure of an argumentative text is decisive for its class in several text clas-
sification tasks. By relying on a more abstract representation of these argumentation
structures, the search space of possible patterns in the structures is reduced and,
hence, common patterns can be found more reliably. In the next section, we provide
evidence for the impact of such patterns.
Preprocessing. Each text is split into sentences and tokens, the latter enriched with
part-of-speech and phrase chunk tags. For these annotations, we rely on the respective
language-specific versions of the algorithms sse, sto2 , tpo1 , and pch. Details on
the algorithms are found in Appendix A.
Features. Based on the annotations, we evaluate the following 15 feature types. Each
type comprises a number of single features (given in brackets).9 For exact parameters
of the feature types, see Appendix B.4. With respect to the information they capture,
the 15 types can be grouped into five sets:
1. Word features, namely, the distribution of frequent token 1-grams (324–425 fea-
tures), token 2-grams (112–217), and token 3-grams (64–310),
2. class features, namely, the distribution of class-specific words (8–83) that occur
three times as often in one class as in every other (Lee and Myaeng 2002), some
lexicon-based sentiment scores (6), and the distribution of subjective sentiment
words (123–363). The two latter rely on SentiWordNet (Baccianella et al.
2010), which is available for English only.
3. part-of-speech (POS) features, namely, the distribution of POS 1-grams (43–52),
POS 2-grams (63–178), and POS 3-grams (69–137),
4. phrase features, namely, the distribution of chunk 1-grams (8–16), chunk 2-
grams (24–74), and chunk 3-grams (87–300), and
5. stylometry features common in authorship attribution (Stamatatos 2009), namely,
the distribution of character 3-grams (200–451) and of the most frequent function
words (100) as well as lexical statistics (6) like the number of tokens in the text
and in a sentence on average.
Experiments. Given the two domains for each of the two considered tasks, we create
classifiers for all feature types in isolation using supervised learning. Concretely, we
separately train one linear multi-class support vector machine from the LibSVM
integration of Weka (Chang and Lin 2011; Hall et al. 2009) on the training set
of each corpus, optimizing parameters on the respective validation set.10 Then, we
measure the accuracy of each feature type on the test sets of the corpora in the
associated task in two scenarios: (1) when training is performed in-domain on the
training set of the corpus, and (2) when training is performed on the training set of the
other domain.11 The results are listed in Table 5.1 for each possible combination A2B
of a training domain A and a test domain B.
9 The numbers of features vary depending on the processed training set, because only distributional
features with some minimum occurrence are considered (cf. Sect. 2.1).
10 The Sentiment Scale dataset is partitioned into four author datasets (cf. Sect. C.4). Here,
we use the datasets of author c and d for training, b for validation, and a for testing.
11 Research on domain adaptation often compares the accuracy of a classifier in its training domain
to its accuracy in some other test domain (i.e., A2 A vs. A2B), because training data from the
test domain is assumed to be scarce. However, this leaves unclear whether an accuracy change
may be caused by a varying difficulty of the task at hand across domains. For the analysis of
domain invariance, we therefore put the comparison of different training domains for the same test
domain (i.e., A2 A vs. B2 A) in the focus here.
200 5 Pipeline Robustness
Table 5.1 Accuracy of 15 common feature types in experiments with 3-class sentiment analysis and
3-class language function analysis for eight different scenarios A2B. A is the domain of the training
texts, and B the domain of the test texts, with A, B∈ {hotel (H), film (F), music (M), smartphone (S)}.
The two right-most columns shows the minimum observed accuracy of each feature type and the
maximum loss of accuracy points caused by training out of the test domain.
Feature type Sentiment polarity (en) Language function (de) Min Max
H2H F2H F2F H2F M2M S2M S2S M2S acc’y loss
Token 1-grams 64 % 44 % 42 % 30 % 71 % 54 % 67 % 42 % 30 % −25
Token 2-grams 50 % 46 % 40 % 65 % 51 % 48 % 69 % 20 % 20 % −49
Token 3-grams 25 % 40 % 40 % 37 % 42 % 49 % 66 % 27 % 25 % −39
Class-specific words 43 % 40 % 24 % 24 % 78 % 67 % 61 % 24 % 24 % −37
Sentiment scores 61 % 49 % 43 % 27 % n/a n/a n/a n/a 27 % −16
Sentiment words 60 % 45 % 39 % 28 % n/a n/a n/a n/a 27 % −15
POS 1-grams 57 % 44 % 39 % 36 % 72 % 53 % 58 % 40 % 36 % −19
POS 2-grams 53 % 48 % 41 % 36 % 68 % 54 % 64 % 40 % 36 % −24
POS 3-grams 44 % 41 % 40 % 23 % 57 % 49 % 57 % 55 % 23 % −17
Chunk 1-grams 51 % 46 % 37 % 28 % 55 % 53 % 57 % 30 % 28 % −27
Chunk 2-grams 56 % 43 % 39 % 33 % 62 % 50 % 57 % 31 % 31 % −26
Chunk 3-grams 54 % 44 % 41 % 33 % 60 % 55 % 58 % 37 % 33 % −21
Character 3-grams 53 % 51 % 38 % 24 % 62 % 52 % 63 % 39 % 24 % −24
Function words 60 % 44 % 43 % 37 % 79 % 61 % 62 % 49 % 37 % −18
Lexical statistics 43 % 38 % 40 % 40 % 52 % 18 % 34 % 48 % 18 % −34
All features 65 % 40 % 49 % 40 % 81 % 59 % 73 % 48 % 40 % −25
Limited Domain Invariance of all Feature Types. The token 1-grams perform
comparably well in the in-domain scenarios (H2H, F2F, M2M, and S2S), but their
accuracy significantly drops in the out-of-domain scenarios with a loss of up to
25 % points. The token 2-grams and 3-grams behave inconsistent in these respects,
indicating that their distribution is not learned on the given corpora. Anyway, the min-
imum observed accuracy of all word features lies below 33.3 %, i.e., below chance
in three-class classification. The class features share this problem, but they seem less
domain-dependent as far as available in the respective scenarios. While the sentiment
scores and the sentiment words work well in three of the four sentiment analysis sce-
narios, the class-specific words tend to be more discriminative for language functions
with one exception: In M2S, they fail, certainly because only 11 class-specific words
have been found in the music training set.
While similar observations can be made for many of the remaining feature types,
we restrict our view to the best results, all of which refer to features that capture
style: In terms of domain robustness, the part-of-speech features are the best group
in our experiments with a maximum loss of 17–24 points. Especially the POS 1-
grams turn out to be beneficial across domains, achieving high accuracy in a number
5.3 The Impact of the Overall Structure of Input Texts 201
1 2 3
positive (pos)
objective (obj)
statement
negative (neg)
4 5
Fig. 5.8 Illustration of capturing the structure of a sample hotel review as a local sentiment flow,
i.e., the sequence of local sentiment in the statements of the review.
of scenarios and always improving over chance. Only the function words do better
with an accuracy of at least 37 % and a loss of at most 18 points, when trained out
of the domain of application. At the same time, the function words yield the best
accuracy value under all single feature types with 79 % in case of M2M.
The reasonable benefit of function words across domains matches results from
the above-mentioned related work in authorship attribution (Menon and Choi 2011).
However, Table 5.1 also conveys that neither the function words nor any other of the
15 content and style feature types seem strongly domain-invariant. In addition, the
bottom line of the table makes explicit that these types do not suffice to achieve high
quality in the evaluated classification tasks, although the combination of features
at least performs best in all in-domain scenarios. For out-of-domain scenarios, we
hence need to find features that are both effective and domain-robust.
In order to achieve high effectiveness across domains, features are needed that model
text properties, which are specific to the task being addressed and not to the domain
of application. In the chapter at hand, we focus on the classification of argumentative
texts. To obtain domain-invariant features for such texts, our hypothesis is that we
can exploit their sequential argumentation structure, as captured in our model from
Sect. 5.2. Before we turn to the question of domain invariance, however, we first
provide evidence that such structure can be decisive for text classification. For this
purpose, we refer to the sentiment analysis of hotel reviews again.
As stated in Sect. 5.2, the argumentation structure of a review can be represented
by the sequence of local sentiment classes in the review. We distinguish between
positive (pos), negative (neg), and objective (obj). Following Mao and Lebanon
(2007), we call the sequence the local sentiment flow. For instance, the local sentiment
flow of the example review from Sect. 5.2 is visualized in Fig. 5.8. It can be seen as
an instance of the structure classification task ontology from Sect. 5.2, in which
discourse relations are ignored. Not only in the example review, the local sentiment
202 5 Pipeline Robustness
(a) all statements (b) statements in titles (c) first statements (d) last statements
100%
80%
60%
40%
20%
0%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 score
Fig. 5.9 a The fractions of positive opinions (upper part of the columns), objective facts (center
part), and negative opinions (lower part) in the texts of the ArguAna TripAdvisor corpus,
separated by the sentiment scores between 1 and 5 of the reviews they refer to. b–d The respective
fractions for statements at specific positions in the reviews.
flow of a review impacts the review’s global sentiment, as we now demonstrate. For
a careful distinction between the frequency and the composition of local sentiment,
we hypothesize three dependencies:
1. The global sentiment of a review correlates with the ratio of positive and negative
opinions in the review.
2. The global sentiment of a review correlates with the polarity of opinions at certain
positions of the review.
3. The global sentiment of a review depends on the review’s local sentiment flow.
To test the hypotheses, we statistically analyze our ArguAna TripAdvisor
corpus, which contains 2100 reviews from the hotel domain.12 As detailed in
Appendix C.2, each review has a title and a body and it has been manually seg-
mented into facts, positive opinions, and negative opinions that are annotated as
such. Since the corpus is balanced with respect to the reviews’ global sentiment
scores between 1 (worst) and 5 (best), we can directly measure correlations between
local and global sentiment in the corpus.
Figure 5.9(a) illustrates that hypothesis 1 turns out to be true statistically for our
corpus, matching the intuition that, the larger the fraction of positive opinions, the
better the sentiment score, and vice versa: On average, a hotel review with score 1 is
made up of 71 % negative and 9.4 % positive opinions. This ratio decreases strictly
monotonously under increasing scores down to 5.1 % negative and 77.5 % positive
opinions for score 5. The impact of the frequency of local sentiment is obvious.
Interestingly, the fraction of facts remains stable close to 20 % at the same time.
For the second hypothesis, we compute the distributions of opinions and facts in
the review’s titles as well as in the first and last statements of the review’s bodies. In
comparison with Fig. 5.9(a), the results for the title distributions in Fig. 5.9(b) show
much stronger gaps in the above-mentioned ratio with a rare appearance of facts,
suggesting that the sentiment polarity of the title of a hotel review often reflects the
review’s global sentiment polarity. Conversely, Fig. 5.9(c) shows that over 40 % of
all first statements denote facts, irrespective of the sentiment score. This number may
12 For information on the source code of the statistical analysis, see Appendix B.4.
5.3 The Impact of the Overall Structure of Input Texts 203
Table 5.2 The 13 most frequent sentiment change flows in the ArguAna TripAdvisor corpus
and their distribution over all possible global sentiment scores.
13 Alternatively,
Mao and Lebanon (2007) propose to ignore the objective facts. Our according
experiments did not yield new insights except for a higher frequency of trivial flows. For lack of
relevance, we omit to present results on local sentiment flows here, but they can be easily reproduced
using the provided source code (cf. Appendix B.4).
204 5 Pipeline Robustness
combine consecutive statements with the same local sentiment, thereby obtaining
local sentiment segments. We define the sentiment change flow of a review as the
sequence of all such segments in the review’s body.14 In case of the example in
Fig. 5.8, e.g., the second and third statement have the same local sentiment. Hence,
they refer to the same segment in the sentiment change flow, (obj, pos, obj, neg).
In total, our corpus contains reviews with 826 different sentiment change flows.
Table 5.2 lists all those with a frequency of at least 1 %. Together, they cover over one
third (34.8 %) of all texts. The most frequent flow, (pos), represents the 161 (7.7 %)
fully positive hotel reviews, whereas the best global sentiment score 5 is indicated
by flows with objective facts and positive opinions (table lines 4, 5, and 7). Quite
intuitively, (neg, pos, neg) and (pos, neg, pos) denote typical flows of reviews with
score 2 and 4, respectively. In contrast, none of the listed flows clearly indicates
score 3. The highest correlation is observed for (neg, obj, neg), which results in
score 1 in 88.9 % of the cases.
The outlined cooccurrences offer strong evidence for the hypothesis that the global
sentiment of a review depends on the review’s local sentiment flow. Even more, they
imply the expected effectiveness (in the hotel domain) of a single feature based on
a sentiment change flow. In particular, the frequency of a flow can be seen as the
recall of any feature that applies only to reviews matching the flow. Correspondingly,
the distribution of a flow over the sentiment scores shows what precision the feature
would achieve in predicting the scores. However, Table 5.2 also reveals that all found
flows cooccur with more than one score. Thus, we conclude that sentiment change
flows do not decide global sentiment alone. This becomes explicit for (obj, pos, neg,
pos), which is equally distributed over scores 3–5.
14 In
Wachsmuth et al. (2014b), we name these sequences argumentation flows. In the given more
general context, we prefer a more task-specific naming in order to avoid confusion.
5.3 The Impact of the Overall Structure of Input Texts 205
4% 3.8
3.0
2%
1.0 1.1 1.1 0.8
0.4 0.6 0.3
0%
1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 score
Fig. 5.10 The fractions of five types of discourse relations under all discourse relations, separated
by the sentiment scores between 1 and 5 of the reviews they refer to. The discourse relations were
found using the algorithm pdr (cf. Appendix A.1).
Unlike local sentiment, discourse relations are not annotated in the ArguAna
TripAdvisor corpus. Instead, we processed all texts with our heuristic discourse
parser pdr. pdr distinguishes a subset of 10 discourse relation types from the rhetor-
ical structure theory (cf. Appendix A.1 for details). Based on the output of pdr, we
have computed the distributions of each discourse relation type over the sentiment
scores of the reviews in the corpus in order to check for evidence for the first hy-
pothesis. For brevity, Fig. 5.10 shows only the results of those types that occur in the
discourse change flows, which we discuss below.
Figure 5.10 stresses that, in terms of sentiment analysis, one of the most important
discourse relation types is the contrast between discourse units: Quite intuitively,
medium reviews (those with sentiment score 3) yield the largest fraction of contrast
relations (6.9 %). This is more than twice as high as the fraction in the score 5
reviews (3.0 %) on average. A sign of rather negative sentiment in hotel reviews
seems the resort to causes, which are often used to justify statements. Interestingly,
circumstance relations (like when or where something happened) even more behave
in this way; they cooccur 3.8 times as often with score 1 than with score 4 or 5.
Conversely, motivation relations (e.g. indicated by second person voice) appear more
frequently in medium and positive hotel reviews, and concession (e.g. indicated by
the connective “although”) play a particular role in score 4 reviews.
The outlined correlations between frequencies of discourse relations and global
sentiment support hypothesis 1. The remaining question is whether the overall struc-
ture of discourse relations in the text is decisive, as captured by hypothesis 2. Analog
to above, we again consider changes in the discourse relation flows, resulting in dis-
course change flows, which would be (background, elaboration, contrast) for our
example review. However, not coincidentally, we have left out both elaboration and
sequence in Fig. 5.10. Together, these two types make up over 80 % of all discourse
relations found by pdr, rendering it hard to find common flows with other (poten-
tially more relevant) types. Thus, we ignore elaboration and sequence relations in
the determination of the most frequent discourse change flows.
Table 5.3 shows the distributions of sentiment scores for all such 12 discourse
change flows that represent at least 1 % of the reviews in the ArguAna Trip-
Advisor corpus each and 44.6 % of the corpus reviews in total. The flows are
206 5 Pipeline Robustness
Table 5.3 The 12 most frequent discourse change flows in the ArguAna TripAdvisor corpus
(when ignoring sequence and elaboration relations) and their distribution over all possible global
sentiment scores. The relations were found using the discourse parser pdr (cf. Appendix A).
grouped according to the contained discourse relation types. This is possible because
only contrast cooccurs with other types in the listed flows, certainly due to the low
frequency of the others.
About every fourth review (25.2 %) contains only contrast relations (except for
sequences and elaborations). Compared to Fig. 5.10, such reviews differ from an
average review with contrast relations, having their peak at score 2 (28.1 %). Similar
observations can be made for (cause). The found flows with circumstance relations
suggest that discourse change flows do not influence sentiment scores: No matter if
a contrast is expressed before or after a circumstance, the respective review tends
to be negative mostly. However, this is different for concession and motivation rela-
tions. E.g., a motivation in isolation leads to score 4 and 5 in over 60 % of the cases,
whereas the flow (contrast, motivation) most often cooccurs with score 3. (motiva-
tion, contrast) even speaks for a negative review on average. An explanation might
be that readers shall be warned right from the beginning in case of negative hotel
experiences, while recommendations are rather made at the end.
Altogether, the correlations between scores and flows in Table 5.3 are not as clear
as for the sentiment change flows. While the existence of certain discourse relations
obviously affects the global sentiment of the hotel reviews, their overall structure
seems sometimes but not always decisive for the sentiment analysis of reviews.
Features that model the content of a text are extensively used in text classification
(Aggarwal and Zhai 2012). In case of argumentative texts, many approaches also rely
5.3 The Impact of the Overall Structure of Input Texts 207
on style features (Stamatatos 2009). In this section, we have first offered evidence
for the domain dependence of typical content and style features in the classification
of argumentative texts. As an alternative, we have then analyzed two ways to capture
the overall structure of such texts (reviews, specifically): sentiment change flows and
discourse change flows. The former relies on task-specific information, while the
latter capture discourse structure. We argue that both can be seen as shallow models
of argumentation structure that abstract from content.
The revealed existence of patterns in the sentiment change flows and discourse
change flows of hotel reviews that cooccur with certain sentiment scores demonstrates
the impact of modeling argumentation structure. In addition, the abstract nature
of the two types of flows brings up the possibility that the found or similar flow
patterns generalize well across domains. For these reasons, we hence concretize our
hypothesis from Sect. 5.1 by supposing that features, which are based on according
flow patterns, help to achieve (1) domain robustness and (2) high effectiveness in
text analysis tasks that deal with the classification of argumentative texts.
Different from the sentiment analysis of reviews, however, at least the resort to
local sentiment flows does not generalize well to all such tasks. E.g., scientific ar-
ticles only sporadically contain sentiment at all, making respective features useless
for classification. Still, we assume that local sentiment flows have correspondences
in other classification tasks, which hence allow for corresponding unit class flows.
To this end, our structure-oriented view from Sect. 5.2 needs to be concretized ade-
quately for the task at hand. A few examples for such instantiations have already been
sketched in Sect. 5.2. The discourse change flows may be more generally beneficial,
but the experiments above suggest that the impact of analyzing discourse structure
also may be lower than of analyzing task-specific unit classes.
As motivated, the focus on argumentation structure targets at the development
of text analysis algorithms, which improve the robustness and intelligibility of the
pipeline they are employed in. In this regard, we investigate the benefits of mod-
eling such structure in Sects. 5.4 and 5.5. Since the two types of change flows in
the form considered here still depend on the length of the analyzed texts to some
extent, they may not be optimal for cross-domain text analysis, though. Below, we
therefore introduce a novel alternative way of identifying typical overall structures of
argumentative texts. It allows computing the similarity between arbitrary texts and,
hence, does not require to detect exactly the same patterns across domains.
Either way, our analysis of the ArguAna TripAdvisor corpus in this section
shows that structure-based features are not always decisive for text classification. To
enable high-quality text mining, a combination with other feature types may thus be
required. Candidates have already been presented, namely, some of the content and
style features as well as the distribution of unit classes and discourse relations. We
evaluate below how their integration affects domain robustness and effectiveness.
208 5 Pipeline Robustness
Based on our structure-oriented model from Sect. 5.2, we now develop statistical
features that aim for an effective classification of argumentative texts across domains.
First, we learn a set of common patterns of the overall structure of such texts through
supervised clustering. In the sense of the overall analysis from Fig. 5.1, we then
use each such flow pattern as a similarity-based feature for text classification. We
detail and evaluate an according analysis for sentiment scoring, reusing content
from Wachsmuth et al. (2014a). Our results suggest that the considered flow patterns
learned in one domain generalize to other domains to a wide extent.
Based on flows, we define flow patterns to capture the overall structure we seek for:
Flow Pattern. A flow pattern f ∗ denotes the average of a set of similar length-
normalized flows F = {f1 , . . . , f|F| }, |F| ≥ 1, of some concrete flow type.
Analog to deriving semantic concepts from Wikipedia articles, we determine
a set of flow patterns F ∗ = {f1∗ , . . . , f|F
∗ ∗
∗ | }, |F | ≥ 1, of some flow type C f from a
training set of argumentative texts. Unlike the semantic concepts, however, we aim
for common patterns that represent more than one text. Therefore, we construct F ∗
from flow clusters that cooccur with text classes. Each pattern is then deployed as
a single feature, whose value corresponds to the similarity between the pattern and
a given flow. The resulting feature vectors can be used for statistical approaches to
text classification, i.e., for learning to map flows to text classes (cf. Sect. 2.1). In the
following, we detail the outlined process and we exemplify it for sentiment scoring.
15 In the evaluation at the end of this section, we present results on the extent to which the effective-
ness of inferring Cf affects the quality of the features based on the flow patterns.
210 5 Pipeline Robustness
objective (0.5)
1 5
negative (0.0)
discourse relation Background (B) Elaboration (E) Elaboration (E) Contrast (C)
(b) normalized local sentiment flow and normalized discourse relation flow
positive (1.0)
objective (0.5)
1 18
negative (0.0)
discourse relation B B B B E E E E E E E E E C C C C
Fig. 5.11 a Illustration of the local sentiment flow and the discourse relation flow of the sample
text from Sect. 5.3. b Length-normalized versions of the two flows for length 18 (local sentiment)
and 17 (discourse relations), respectively.
sentiment values, while the nominal discourse relation types are duplicated for lack
of reasonable alternatives. The chosen normalized lengths are exemplary only.
Once the set of all normalized flows F has been created from DT , flow patterns
can be derived. As usual for feature computations (cf. Sect. 2.1), however, it may be
reasonable to discard rare flows before (say, flows that occur only once in the training
set) in order to avoid capturing noise.
Now, our hypothesis behind flow patterns is that similar flows entail the same
or, if applicable, similar text classes. Here, the similarity of two length-normalized
flows is measured in terms of some similarity function (cf. Sect. 2.1). For instance, the
(inverse) Manhattan distance may capture the similarity of the metric local sentiment
flows. In case of discourse relation flows, we can at least compute the fraction of
matches. With respect to the chosen similarity function, flows that construct the same
pattern should be as similar as possible and flows that construct different patterns
as dissimilar as possible. Hence, it seems reasonable to partition the set F using
clustering (cf. Sect. 2.1) and to derive flow patterns from the resulting clusters.
In particular, we propose to perform supervised clustering, which can be under-
stood as a clustering variant, where we exploit knowledge about the training text
classes to ensure that all obtained clusters have a certain purity. In accordance with
the original purity definition (Manning et al. 2008), here purity denotes the fraction of
those flows in a cluster, whose text class equals the majority class in the cluster. This
standard purity assumes exactly one correct class for each flow, implying that a flow
alone decides the class, which is not what we exclusively head for (as discussed in
Sect. 5.2). At least larger classification schemes speak for a relaxed purity definition.
For example, our results for sentiment scores between 1 and 5 in Sect. 5.3 suggest to
also see the dominant neighbor of the majority score as correct. Either way, based
on any measure of purity, we define supervised flow clustering:
Supervised Flow Clustering. Given a set of flows F with known classes, determine a
|F|
clustering F = {F1 , . . . , F|F| } of F with i=1 Fi = F and Fi ∩ F j = ∅ for Fi = F j ∈ F,
such that the purity of each Fi ∈ F lies above some threshold.
5.4 Features for Domain Independence via Supervised Clustering 211
hierarchical
flow clustering
text classes of
sample flows 1 1 1 3 1 3 3 3 2 3 4 4 3 3 4 5 5 4 5 5
flow clusters cluster for cluster for cluster for cluster for
at highest cuts class 1 class 3 and 4 class 4 and 5 class 5
Fig. 5.12 Illustration of hierarchically clustering a set of 20 sample flows, represented by their
associated text classes between 1 and 5. The example relaxed purity threshold of 0.8 leads to cuts
in the hierarchy that create four flow clusters.
objective (0.5)
negative (0.0)
majority B B B B E E E E E E E C C C C C C
Fig. 5.13 Sketch of the construction of a a sentiment flow pattern (dashed curve) from two length-
normalized sample local sentiment flows (circles and squares) and b a discourse flow pattern (bold)
from three according discourse relation flows.
We seek for clusters with a high purity, as they indicate specific classes, which
matches the intuition behind the flow patterns to be derived. At the same time, the
number of clusters should be small in order to achieve a high average cluster size
and, thus, a high commonness of the flow patterns. An easy way to address both
requirements is to rely on a hierarchical clustering (cf. Sect. 2.1), where we can
directly choose a flat clustering F with a desired number |F| of clusters through cuts
at appropriate nodes in the binary tree of the associated hierarchy. To minimize the
number of clusters, we then search for all cuts closest to the tree’s root that create
clusters whose purity lies above the mentioned threshold. Figure 5.12 exemplifies
the creation for the relaxed purity defined above and a threshold of 0.8.
The centroid of each flow cluster, i.e., the mean of all contained flows, finally
becomes a flow pattern, if it is made up of some minimum number of flows.
Figure 5.13(a) sketches the resulting construction of a sentiment flow pattern for
two sample local sentiment flows of normalized length. Since we consider local
sentiment as a metric classification scheme here, each value of the flow pattern
can in fact represent the average of the flow values. In contrast, for nominal flow
types, a possible alternative is to use the majority flow type in according patterns.
212 5 Pipeline Robustness
determineFlowPatterns(DT , C T , Cf , Πf )
1: Training flows F ← ∅
2: for each i ∈ {1, . . . , |DT |} do
3: Πf .process(DT [i])
4: Flow f ← DT [i].getOrderedFlowClasses(Cf )
5: f ← normalizeFlow(f)
6: F ← F ∪ {f, C T [i]}
7: F ← retainSignificantFlows(F)
8: Flow patterns F ∗ ← ∅
9: Clusters F ← performSupervisedFlowClustering(F)
10: F ← retainSignificantFlowClusters(F)
11: for each Cluster Fi ∈ F do F ∗ ← F ∗ ∪ Fi .getCentroid()
12: return F ∗
Pseudocode 5.1: Determination of a common set of flow patterns F ∗ from a set of training texts DT
and their associated known text classes C T . The patterns are derived from flows of the type Cf whose
instances are in turn inferred with the pipeline Πf .
Figure 5.13(b) illustrates this for a discourse flow pattern derived from three sample
discourse relation flows.
Altogether, the high-level process of deriving common flow patterns from a train-
ing set DT is summarized in Pseudocode 5.1. There, lines 1–7 determine the set of
training flows F. To this end, a given text analysis pipeline Πf infers instances of the
flow type Cf from all texts in DT (line 3). Next, the sequences of instances in each
text is converted into a normalized flow f (lines 4 and 5). In combination with the
associated text class C T [i], f is stored as a training flow (line 6). Once all such flows
have been computed, only those are retained that occur some significant number
of times (line 7). The flow patterns F ∗ are then found by performing the presented
clustering method (line 9) and retaining only clusters of some significant minimum
size (line 10). Each pattern is given by the centroid of a cluster (line 11).
Although we use the notion of an overall analysis as a motivation, our goal is not to
develop a new text classification method. Rather, we propose the derived flow patterns
as a new feature type to be used for classification. Typically, text classification is
approached with supervised learning, i.e., by deriving a classification model from
the feature representations of training texts with known classes. The classification
model then allows classifying the representations of unknown texts (cf. Sect. 2.1).
5.4 Features for Domain Independence via Supervised Clustering 213
(a) x2 (b)
3 3 3 4 flow pattern of class 1
3 4 5 flow pattern
2
flow pattern of of class 5
class 3 and 4 3
3
5 flow
...
flow flow pattern of class 5
1 3 5
1 1 5 4
flow pattern 1 4 flow pattern of
of class 1 class 4 and 5
x1 flow
Fig. 5.14 Two illustrations of measuring the similarity (here, the inverse Manhattan distance) of an
unknown flow to the flow patterns resulting from the clusters in Fig. 5.12: a 2D plot of the distance
of the flow’s vector to the cluster centroids. b Distances of the values of the flow to the values of
the respective patterns.
createTrainingFeatureVectors(DT , C T , Cf , Πf , F ∗ )
1: Training vectors X ← ∅
2: for each i ∈ {1, . . . , |DT |} do
3: Πf .process(DT [i])
4: Flow f ← DT [i].getOrderedFlowClasses(Cf )
5: f ← normalizeFlow(f)
6: Feature values x(i) ← ∅
7: for each Flow Pattern f ∗ ∈ F ∗ do
∗
8: Feature value x (i) ← computeSimilarity(f, f )
9: (i) ← x (i) x (i)
x
10: X ← X ∪ {x(i) , C T [i]}
11: return X
Pseudocode 5.2: Creation of a training feature vector for every text from a training set DT with
associated text classes C T . Each feature value denotes the similarity of a flow pattern from F ∗ to
the text’s flow of type Cf (inferred with the pipeline Πf ).
After the set of common flow patterns F ∗ has been determined, the next step is
hence to compute one training feature vector for each text from the above-processed
sets DT and Cf . In order to capture the overall structure of a text, we measure the
similarity of its normalized flow to each flow pattern in F ∗ using the same similarity
function as for clustering (see above). As a result, we obtain a feature vector with
one similarity value for each flow pattern. Figure 5.14 sketches two views of the
similarity computations for a sample flow: On the left, the vector view with the
cluster centroids that represent the flow patterns, mapped into two dimensions (for
illustration purposes) that correspond to the positions in the flow. And, on the right,
the flow view with the single values of the flows and flow patterns.
Pseudocode 5.2 shows how to create a vector for each text in DT given the flow
patterns F ∗ . Based on the normalized flow f of a text (lines 3–5), one feature value
is computed for each flow pattern f ∗ ∈ F ∗ by measuring the similarity between f
214 5 Pipeline Robustness
and f ∗ (lines 8 and 9). The combination of the ordered set of feature values and the
text class C T [i] defines the vector (line 10).16
In general, the introduced feature type itself is applicable to arbitrary tasks from the
area of text classification (a discussion of a transfer to other tasks follows at the end
of this section). It works irrespective of the type, language, or other properties of the
input texts being processed, since it outsources the specific analysis of producing
the flow type Cf to the employed text analysis pipeline Πf . Nevertheless, our feature
type explicitly aims to serve for approaches to the classification of argumentative
texts, because it relies on our hypothesis that the overall structure of a text is decisive
for the class of a text, which will not hold for all text classification tasks, e.g. not for
topic detection (cf. Sect. 2.1). While the feature type can cope with all flow types as
exemplified, we have indicated that nominal flow types restrict its flexibility.
Correctness. Similar to the adaptive scheduling approach in Sect. 4.5, the two pre-
sented pseudocodes (Pseudocodes 5.1 and 5.2) define method schemes rather than
concrete methods. As a consequence, again we cannot prove correctness here. Any-
way, the notion of correctness generally does not really make much sense in the
context of feature computations (cf. Sect. 2.1).
In particular, besides the flow type Cf that is defined as part of the input, both the
realized processes in general and the flow patterns in specific are schematic in that
they imply a number of relevant parameters:
1. Normalization. How to normalize flows and what normalized length to use.
2. Similarity. How to measure the similarity of flows and clusters.
3. Purity. How to measure purity and what purity threshold to ensure.
4. Clustering. How to perform clustering and what clustering algorithm to use.
5. Significance. How often a flow must occur to be used for clustering, and how
large a cluster must be to be used for pattern construction.
For some parameters, reasonable configurations may be found conceptually in
regard of the task at hand. Others should rather be found empirically.
With respect to the question of how to perform clustering, we have clarified that
the benefit of a supervised flow clustering lies in the construction of common flow
patterns that cooccur with certain text classes. A regular unsupervised clustering may
also achieve commonness, but it may lose the cooccurrences. Still, there are scenarios
where the unsupervised variant can make sense, e.g. when rather few input texts with
classes are available, but a large number of unknown texts. Then, semi-supervised
16 For clarity, we have included the computation of flows both in Pseudocode 5.1 and in
Pseudocode 5.2. In practice, the flow of each text can be maintained during the whole process
of feature determination and vector creation and, thus, needs to be computed only once.
5.4 Features for Domain Independence via Supervised Clustering 215
learning could be conducted (cf. Sect. 2.1) where flow patterns are first derived from
the unknown texts and cooccurrences thereafter from the known texts.
Although we see the choice of a concrete clustering algorithm as part of the
realization, above we propose to resort to hierarchical clustering in order to be able
to easily find flow clusters with some minimum purity. An alternative would be to
directly compute a flat clustering and then retain only pure clusters. While such an
approach provides less control about the obtained flow patterns, it may significantly
improve run-time: Many flat clustering algorithms scale linearly with the number
of objects to be clustered, whereas most common hierarchical clustering algorithms
are at least quadratic (Manning et al. 2008). This brings us to the computational
complexity of using flow patterns as features for text classification as described. As far
as possible, we now estimate the asymptotic run-time of the schematic pseudocodes
in terms of the O-calculus (Cormen et al. 2009).
The main input size of determineFlowPatterns in Pseudocode 5.1 is the num-
ber of texts in DT (both the number of flows in F and the number of clusters in F
are restricted by |DT |). The complexity of two operations of the method cannot be
quantified asympotically, namely, the analysis of the pipeline Πf in line 3 and the
computation of a clustering F in line 9. We denote their overall run-times on DT
as tf (DT ) and tF (DT ), respectively. The run-times of all remaining operations on
the flows and clusters depend on the length of the flows only. In the worst case,
the length of all flows is the same but different from the normalized length. We de-
note this length as |fmax | here. The remaining operations are executed either at most
once for each text (like normalization) or once for all texts (like retaining significant
flows), resulting in the run-time O(|DT | · |fmax |). Altogether, we thus estimate the
run-time of determineFlowPatterns as:
In practice, the most expensive operation will often be the clustering in deter-
mineFlowPatterns for larger numbers of input texts.17 The goal of this chapter is
not to optimize efficiency, which is why we do not evaluate run-times in the follow-
ing experiments, where we employ the proposed features in an overall analysis. We
17 If
the flow of each text from DT is computed only once during the whole process (see above),
Ineq. 5.2 would even be reduced to O (|DT | · |fmax |).
216 5 Pipeline Robustness
return to the efficiency of feature computation, when we discuss the overall analysis
in the context of ad-hoc large-scale text mining at the end of this section.
Experiments. Based on the four feature types, we evaluate the accuracy of a super-
vised classification of sentiment scores within and across domains. Concretely, we
use different combinations of the employed corpora to train and test the algorithm css
on subsets of all feature types. css learns a mapping from feature values to scores
using a linear support vector machine (cf. Appendix A.1 for details). To this end, we
18 As in Sect. 5.3, the numbers of features vary depending on the training set, because we take only
those features whose frequency in the training texts exceeds some specified threshold (cf. Sect. 2.1).
For instance, a word unigram is taken into account only if it occurs in at least 5 % of the hotel
reviews or 10 % of the film reviews, respectively.
19 In Wachsmuth et al. (2014a), we also evaluate the local sentiment on specific domain concepts in
the given text. For lack of relevance, we leave out respective experiments here.
218 5 Pipeline Robustness
first processed the respective training set to determine the concrete features of each
type. Then, we computed values of these features for each review. In the in-domain
tasks, we optimized the cost parameter of css during training. For the classification
across domains, we relied on the default parameter value, because an optimization
with respect to the training domain does not make sense there. After training, we
measure accuracy on the respective test set.
In case of the hotel reviews, we trained and optimized css on the training set
and the validation set of the ArguAna TripAdvisor corpus, respectively. On
the film reviews, we performed 10-fold cross-validation (cf. Sect. 2.1) separately on
the dataset of each review author, averaged over five runs. By that, we can directly
compare our results to those of Pang and Lee (2005) who published the Senti-
ment Scale dataset. In particular, we consider their best support vector machine
approach here, called ova.20
Effectiveness Within Each Domain. First, we report on the in-domain tasks. For the
hotel domain, we provide accuracy values both with respect to the sentiment scale
from 1 to 5 and with respect the mapped scale from 0 to 2. In addition, we compare
the theoretically possible accuracy to the practically achieved accuracy by opposing
the results on the ground-truth local sentiment annotations of the ArguAna Trip-
Advisor corpus to those on our self-created annotations. For the film domain, we
can refer only to self-created annotations. All results are listed in Table 5.4.
In the 5-class scenario on the ground-truth annotations of the hotel reviews, the
local sentiment distributions and the sentiment flow patterns are best under all single
feature types with an accuracy of 52 % and 51 %, respectively. Combining all types
boosts the accuracy to 54 %. Using self-created annotations, however, it significantly
drops down to 48 %. The loss of feature types 2–4 is even stronger, making them per-
form slightly worse than the content and style features (40 %–42 % vs. 43 %). These
results seem not to match with Wachsmuth et al. (2014a), where the regression error
of the argumentation features remains lower on the self-created annotations. The
reason behind can be inferred from the 3-class hotel results, which demonstrate the
effectiveness of modeling argumentation for sentiment scoring: There, all argumen-
tation feature types outperform feature type 1. This indicates that at least the polarity
of their classified scores is often correct, thus explaining the low regression errors.
On all four film review datasets, the sentiment flow patterns classify scores most
accurately among the argumentation feature types, but their effectiveness still re-
mains limited.21 The content and style features dominate the evaluation, which again
gives evidence for the effectiveness of such features within a domain (cf. Sect. 5.3).
20 We evaluate only the classification of scores for a focused discussion. In general, a more or less
metric scale like sentiment scores suggests to use regression (cf. Sect. 2.1), as we have partly done
in Wachsmuth et al. (2014a). Moreover, since our evaluation does not aim at achieving maximum
effectiveness in the first place, for simplicity we do not explicitly incorporate knowledge about the
neighborship between classes here, e.g. that score 1 is closer to score 2 than to score 3. An according
approach has been proposed by Pang and Lee (2005).
21 We suppose that the reason behind mainly lies in the limited accuracy of 74 % of our polarity
classifier csp in the film domain (cf. Appendix A.2), which reduces the impact of all features that
rely on local sentiment.
5.4 Features for Domain Independence via Supervised Clustering 219
Table 5.4 Accuracy of all evaluated feature types in 5-class and 3-class sentiment analysis on the
test hotel reviews of the ArguAna TripAdvisor corpus based on ground-truth annotations (Cor-
pus) or on self-created annotations (Self) as well as in 3-class sentiment analysis on the film reviews
of author a, b, c and d in the Sentiment Scale dataset. The bottom line compares our approach
to Pang and Lee (2005).
Compared to ova, our classifier based on all feature types is significantly better on
the reviews of author a and a little worse on two other datasets (c and d).
We conclude that the proposed feature types are competitive, achieving similar
effectiveness than existing approaches. In the in-domain task, the sentiment flow
patterns do not fail, but they also do not excel. Their main benefit lies in their strong
domain invariance, as we see next.
Effectiveness Across Domains. We now offer evidence for our hypothesis from
Sect. 5.1 that features like the sentiment flow patterns, which capture overall structure,
improve the domain robustness of classifying argumentative texts. For this purpose,
we apply the above-given classifiers trained in one domain to the reviews of the other
domain. Different from Wachsmuth et al. (2014a), we consider not only the transfer
from the hotel to the film domain, but also the other way round.
For the transfer to the hotel domain, Fig. 5.15 shows the accuracy loss of each
feature type resulting from employing either of the four film datasets instead of hotel
reviews for training (given in percentage points). With a few exceptions, feature
types 1–3 fail in these out-of-domain scenarios. Most significantly, the content and
style features lose 13 (Fa 2H) to 19 % points (Fb 2H), but the local sentiment and
discourse relation distributions seem hardly more robust. As a consequence, the
accuracy of all four feature types in combination is compromised severely. At the
same time, the sentiment flow patterns maintain effectiveness across domains, losing
only 4–10 points through out-of-domain training. This supports our hypothesis. Still,
the local sentiment distributions compete with the sentiment flow patterns in the
resulting accuracy on the hotel reviews.
However, this is different when we exchange the domains of training and appli-
cation. According to Fig. 5.16, the local sentiment distributions denote the second
220 5 Pipeline Robustness
Fd2H
Fa2H
Fc2H
H2H
20%
0%
1. Content and style 2. Local sentiment 3. Discourse relation 4. Sentiment flow 1.-4. All four
features distributions distributions patterns feature types
Fig. 5.15 Accuracy of the four evaluated feature types and their combination on the test hotel
reviews in the ArguAna TripAdvisor corpus based on self-created annotations with training
either on the training hotel reviews (H2H) or on the film reviews of author a, b, c, or d in the
Sentiment Scale dataset (Fa 2H, Fb 2H, Fc 2H, and Fd 2H).
Fd2Fd
Fa2Fa
Fc2Fc
H2Fb
H2Fd
20%
H2Fa
H2Fc
0%
1. Content and style 2. Local sentiment 3. Discourse relation 4. Sentiment flow 1.-4. All four
features distributions distributions patterns feature types
Fig. 5.16 Accuracy of the four evaluated feature types and their combination on the film reviews
of author a, b, c, and d in the Sentiment Scale dataset in 10-fold cross-validation on these
reviews (Fa 2Fa , Fb 2Fb , Fc 2Fc , and Fd 2Fd ) or with training on the training hotel reviews of the
ArguAna TripAdvisor corpus (H2Fa , H2Fb , H2Fc , and H2Fd ).
worst feature type when training them on the hotel reviews. Their accuracy is reduced
by up to 22 % points (H2Fc ), resulting in values around 40 % on all film datasets.
Only the content and style features seem more domain-dependent with drastic drops
between 18 and 41 points. In contrast, the accuracy of the discourse relation distrib-
utions and especially of the sentiment flow patterns provide further evidence for the
truth of our hypothesis. They remain almost stable on three of the film datasets. Only
in scenario H2Fd , they also fail with the sentiment flow patterns being worst. Appar-
ently, the argumentation structure of the film review author d, which is reflected by
the found sentiment flow patterns, differs from the others.
Insights into Sentiment Flow Patterns. For each sentiment score, Fig. 5.17(a) plots
the most common of the 38 sentiment flow patterns that we found in the training
set of the ArguAna TripAdvisor corpus (based on self-created annotations). As
depicted, the patterns are constructed from the local sentiment flows of up to 226
texts. Below, Figs. 5.17(b–c) show the respective patterns for author c and d in the
Sentiment Scale dataset. One of the 75 patterns of author c results from 155
flows, whereas all 41 patterns of author d represent at most 16 flows.
With respect to the shown sentiment flow patterns, the film reviews yield less clear
sentiment but more changes of local sentiment than the hotel reviews. While there
5.4 Features for Domain Independence via Supervised Clustering 221
objective (0.5)
1 30
(b) Sentiment Scale dataset, author c (scale 0-2) (c) Sentiment Scale dataset, author d (scale 0-2)
score 1 score 2 score 1
1.0 score 2 (7 flows) 1.0 (15 flows)
(155 flows) (5 flows)
0.5 0.5
1 60 1 60
score 0 score 0
0.0 (11 flows) 0.0 (16 flows)
Fig. 5.17 a The three most common sentiment flow patterns in the training set of the ArguAna
TripAdvisor corpus, labeled with their associated sentiment scores. b–c The according sentiment
flow patterns for all scores of the texts of author c and d in the Sentiment Scale dataset.
appears to be some similarity in the overall argumentation structure between the hotel
reviews and the film reviews of author c, two of the three patterns of author d contain
only little clear sentiment at all, especially in the middle parts. We have already
indicated the disparity of the author d dataset in Fig. 4.10 (Sect. 4.4). In particular,
73 % of all discourse units in the ArguAna TripAdvisor corpus are classified
as positive or negative opinions, but only 37 % of the sentences of author d. The
proportions of the three other film datasets at least range between 58 % and 67 %.
These numbers also serve as a general explanation for the limited accuracy of the
argumentation feature types 1–3 in the film domain.
A solution to improve the accuracy and domain invariance of modeling argumen-
tation structure might be to construct flow patterns from the subjective statements
only or from the changes of local sentiment, which we leave for future work. Here,
we conclude that our novel feature type does not yet solve the domain dependence
problem of classifying argumentative texts, but our experimental sentiment scoring
results suggest that it denotes a promising step towards more domain robustness.
In this section, we have pursued the main goal of this chapter, i.e., to develop a
novel feature type for the classification of argumentative texts whose distribution is
strongly domain-invariant across the domains of the texts. The feature type relies
on our structure-oriented view of text analysis from Sect. 5.2. For the first time, it
captures the overall structure of an argumentative text by measuring the similarity
of the flow of the text to a set of learned flow patterns. Our evaluation of sentiment
222 5 Pipeline Robustness
scoring supports the hypothesis from Sect. 5.1 that such a focus on overall structure
benefits the domain robustness of text classification. In addition, the obtained results
give evidence for the intuition that people often simply organize an argumentation
sequentially, which denotes an important linguistic finding.
However, the effectiveness of the proposed flow patterns is far from optimal yet.
Some possible improvements have been revealed by the provided insight into senti-
ment flow patterns. Also, a semi-supervised learning approach on large numbers of
texts (as sketched above) may result in more effective flow patterns. Besides, further
domain-invariant features might help, whereas we have seen that the combination
with domain-dependent features reduces domain robustness.
With respect to domain robustness, we point out that the evaluated sentiment flow
patterns are actually not fully domain-independent. Concretely, they still require
a (possibly domain-specific) algorithm that can infer local sentiment from an input
text, although this seems at least a somewhat easier problem. To overcome domain
dependence, a solution may be to build patterns from more general information.
For instance, Sect. 5.3 indicates the benefit of discourse relations in this regard.
Within one language, another approach is to compute flow patterns based on the
function words in a text, which can be understood as an evolution of the function
word n-grams used in tasks like plagiarism detection (Stamatatos 2011).
Apparently, even the evaluated sentiment scoring task alone implies several di-
rections of research on flow patterns. Investigating all of them would greatly exceed
the scope of this book. In general, further classification tasks are viable for analyzing
the discussed or other flow types, particularly those that target at argumentative texts.
Some are mentioned in the preceding sections, like automatic essay grading or lan-
guage function analysis. Moreover, while we restrict our view to text classification
here, other text analysis tasks may profit from modeling overall structure.
In the area of information extraction, common approaches to tasks like named en-
tity recognition or relation extraction already include structural features (Sarawagi
2008). There, overall structure must be addressed on a different level (e.g. on the
sentence level). A related approach from relation extraction is to classify candidate
entity pairs using kernel functions. Kernel functions measure the similarity between
graphs such as dependency parse trees, while being able to integrate different fea-
ture types. Especially convolutional kernels aim to capture structural information by
looking at similar graph substructures (Zhang et al. 2006). Such kernels have also
proven beneficial for tasks like semantic role labeling (cf. Sect. 2.2).
As in the latter cases, the deeper the analysis, the less shallow patterns of overall
structure will suffice. With respect to argumentative texts, a task of increasing promi-
nence that emphasizes the need for more complex models is a full argumentation
analysis, which seeks to understand the arguments in a text and their interactions (cf.
Sect. 2.4 for details). In the context of ad-hoc large-scale text mining, we primar-
ily aim for comparably shallow analyses like text classification. Under this goal,
the resort to unit class flows and discourse relation flows provides two advantages:
On one hand, the abstract nature of the flows reduces the search space of possible
patterns, which facilitates the determination of patterns that are discriminative for
certain text classes (as argued at the end of Sect. 5.2). On the other hand, computing
5.4 Features for Domain Independence via Supervised Clustering 223
flows needs much less time than complex analyses like parsing (cf. Sect. 2.1). This
becomes decisive when processing big data.
Our analysis of Pseudocodes 5.1 and 5.2 indicates that the scalability of our feature
type rises and falls with the efficiency of two operations: (1) Computing flows and
(2) clustering them. The first depends on the effort of inferring all flow type instances
for the flow patterns (including preprocessing steps). The average run-times of the
algorithms csb and csp employed here give a hint of the increased complexity of
performing sentiment scoring in this way (cf. Appendix A.2). Other flow types may be
much cheaper to infer (like function words), but also much more expensive. Discourse
relations, for instance, are often obtained through parsing. Still, efficient alternatives
exist (like our lexicon-based algorithm pdr), indicating the usual tradeoff between
efficiency and effectiveness (cf. Sect. 3.1). With respect to the second (clustering),
we have discussed that our hierarchical approach may be too slow for larger numbers
of training texts and we have outlined flat clusterers as an alternative. Nevertheless,
clustering tends to represent the bottleneck of the feature computation and, thus, of
the training of an according algorithm.
Anyway, training time is not of upmost importance in the scenario we target
at, where we assume the text analysis algorithms to choose from to be given in
advance (cf. Sect. 1.2). This observation conforms with our idea of an overall analysis
from Sect. 1.3: The determination of features takes place within the development of
an algorithm A T that produces instances of a set of text classes C T . At development
time, A T can be understood as an overall analysis: It denotes the last algorithm in a
pipeline ΠT for inferring C T while taking as input all information produced by the
preceding algorithms in ΠT . Once A T is given, it simply serves as an algorithm in the
set of all available algorithms. In the end, our overall analysis can hence be seen as a
regular text analysis algorithm for cross-domain usage. Besides the intended domain
robustness, such an analysis provides the benefit that its results can be explained, as
we finally sketch in Sect. 5.5.
We consider the scenario that a text analysis pipeline Π = A, π has processed an in-
put text to produce instances of a set of information types C (in terms of annotations
extraction, which we do not detail here, is to support verifiability, e.g. by linking back to the source
documents from which the returned results have been inferred.
5.5 Explaining Results in High-Quality Text Mining 225
Fig. 5.18 Illustration of an explanation graph for the sentiment score of the sample text from
Sect. 5.2. The dependencies between the different types of output information as well as the accuracy
estimations (a) are derived from knowledge about the text analysis pipeline that has produced the
output information.
with features, cf. Sect. 3.2). The goal is to automatically explain the text analysis
process realized to create some target instance of any information type from C. For
this goal, we now outline how to reuse the partial order induced by the interdepen-
dencies of the algorithms in A, which we exploited in Sect. 3.3.
In particular, we propose to construct a directed acyclic explanation graph that
illustrates the process based on the produced information. Each node in the graph
represents an annotation with its features. The root node corresponds to the target
instance. An edge between two nodes denotes a dependency between the respective
annotations. Here, we define that an annotation of type C1 ∈ C depends on an an-
notation of type C2 ∈ C, if the annotations overlap and if the algorithm in A, which
produces C1 as output, requires C2 as input. In addition to the dependencies, every
node is assigned properties of those algorithms in A that have been used to infer the
associated annotation types and features, e.g. quality estimations. Only nodes belong
to the graph the target instance directly or indirectly depends on.
As an example, let the following pipeline Πsco be given to assign a sentiment
score to the sample text from Sect. 5.2:
css classifies sentiment scores based on the local sentiment flow of a text derived
from the output of csb and csp as well as discourse relations between local senti-
ments extracted with pdr (cf. Sect. 5.4). Using the input and output types of the eight
employed algorithms listed in Appendix A.1 and the estimations of their quality from
226 5 Pipeline Robustness
Appendix A.2, we can automatically construct the explanation graph in Fig. 5.18.23
There, each layer subsumes the instances of one information type (e.g. token with
part-of-speech), including the respective text spans (e.g. “We”) and possibly asso-
ciated values (e.g. the part-of-speech tag PP). Where available, a layer is assigned
the estimated accuracies of all algorithms that have inferred the associated type (e.g.
98 % and 97 % of sto2 and tpo1 , respectively).24 With respect to the shown depen-
dencies, the explanation graph has been transitively reduced, such that no redundant
dependencies are maintained.
The sketched construction of explanation graphs is generic, i.e., it can be per-
formed for arbitrary text analysis pipelines and information types. We have re-
alized the construction on top of the Apache UIMA framework (cf. Sects. 3.3
and 3.5) as part of a prototypical web application for sentiment scoring described in
Appendix B.3.25 However, an explanation graph tends to be meaningful only for
pipelines with deep hierarchical interdependencies. This typically holds for informa-
tion extraction rather than text classification approaches (cf. Sect. 2.2), but the given
example emphasizes that according pipelines also exist within text classification.
In terms of an external view of text analysis processes, explanation graphs are
quite sound and complete. Only few information about a process is left out (e.g.,
although not common, annotations sometimes depend on annotations they do not
overlap with). Still, the use of explanation graphs in the presented form may not be
adequate for explanations, since some contained information is rather technical (e.g.
part-of-speech tags). While such details might increase the trust of users with a
computational linguistics or similar background, their benefit for average users seems
limited. In our prototype, we thus simplify explanation graphs. Among others, we
group different part-of-speech tags (e.g. common noun (NN) and proper noun (NE))
under meaningful terms (e.g. noun) and we add edges between independent but fully
overlapping nodes (cf. Appendix B.3 for examples). The latter reduces soundness, but
it makes the graph more easy to conceive. Besides, Fig. 5.18 indicates that explanation
graphs can become very large, which makes their understanding hard and time-
consuming. To deal with the size, our prototype reduces completeness by displaying
only the first layers in an overview graph and the others in a detail graph. Nevertheless,
long texts make the resort to explanation graphs questionable.
In terms of an internal view, the expressiveness of explanation graphs is rather
low because of their generic nature. Concretely, an explanation graph provides no
information about how the explained target instance emanates from the annotations
it depends on. As mentioned above, the decisive step of most text classification
approaches (and also many information extraction algorithms) is the feature compu-
tation, which remains implicit in an explanation graph. General information in this
23 For clarity, we omit the text span in case of discourse relations in Fig. 5.18. Relation types can
easily be identified, as only they point to the information they are dependent on.
24 Although Π
sco employs pdu, Fig. 5.18 contains no discourse unit annotations. This is because
each discourse unit is classified as being a fact or an opinion by csb afterwards.
25 In Apache UIMA, the algorithms’ interdependencies can be inferred from the descriptor files
of the employed primitive analysis engines. For properties like quality estimations, we use a fixed
notation in the description field of a descriptor file, just as we do for the expert system from Sect. 3.3.
5.5 Explaining Results in High-Quality Text Mining 227
score 3 (2.69)
(b) comparing the local sentiment flow of the text to similar flow patterns
positive (1.0)
common pattern with score 3-4
objective (0.5)
score 3 (2.69)
negative (0.0) common pattern with score 2-3
Fig. 5.19 Two possible visual explanations for the sentiment score of the sample text from Fig. 5.3
based on our model from Sect. 5.2: a Highlighting all local sentiment. b Comparison to the most
similar sentiment flow patterns.
respect could be specified via according properties of the employed algorithm, say
“features: lexical and shallow syntactic” in case of csb. For input-specific informa-
tion, however, the actually performed text analysis must be explained. This is easy
for our overall analysis from Sect. 5.4, as we discuss next.
in Sect. 5.4. If few patterns are much more similar to the flow than all others, the
visualization serves as a sound and rather complete explanation of that feature type.
Given that a user believes the patterns are correct, there should hence be no reason
for mistrusting such an explanation.
To summarize, we claim that certain feature types can be explained adequately
by visualizing our model. However, many text classification approaches combine
several features, like the one evaluated in Sect. 5.4. In this case, both the sound-
ness and the completeness of the visualizations will be reduced. To analyze the
benefit of explanations in a respective scenario, we conducted a first user study in
our project ArguAna using the crowdsourcing platform Amazon Mechanical
Turk26 , where so called workers can be requested to perform tasks. The workers are
paid a small amount of money if the results of the tasks are approved by the requester.
For a concise presentation, we only roughly outline the user study here.
The goal of the study was to examine whether explanations (1) help to assess the
sentiment of a text and (2) increase the speed of assessing the sentiment. To this
end, each task asked a worker to classify the sentiment score of 10 given reviews
from the ArguAna TripAdvisor corpus (cf. Appendix C.2), based on presented
information of exactly one of the following three types that we obtained from our
prototypical web application (cf. Appendix B.3):
1. Plain text. The review in plain text form.
2. Highlighted text. The highlighted review text, as exemplified in Fig. 5.19(a).
3. Plain text + local sentiment flow. The review in plain text form with the associ-
ated local sentiment flow shown below the text.27
For each type, all 2100 reviews of the ArguAna TripAdvisor corpus were
classified by three different workers. To prevent flawed results, two check reviews
with unambiguous sentiment (score 1 or 5) were put among every 10 reviews. We
accepted only tasks with correctly classified check reviews and we reassigned rejected
tasks to other workers. Altogether, this resulted in an approval rate of 93.1 %, which
indicates the quality of the conducted crowdsourcing. Table 5.5 lists the aggregated
classification results separately for the reviews of each possible sentiment score (in
the center) as well as the seconds required by a worker to perform a task, averaged
over all tasks and over the fastest 25 % of the tasks (on the right).
The average score of a TripAdvisor hotel review lies between 3 and 4 (cf.
Appendix C.2). As Table 5.5 shows, these “weak” sentiment scores were most accu-
rately assessed by the workers based on the plain text, possibly because the focus on
the text avoids a biased reading in this case. In contrast, the highlighted text seems
to help to assess the other scores. At least for score 2, also the local sentiment flow
proves beneficial. In terms of the required time, the highlighted text clearly dominates
the study. While type 3 speeds up the classification with respect to the fastest 25 %,
it entails the highest time on average. The latter result is not unexpected, because of
the complexity of understanding two instead of one visualization.
Table 5.5 The average sentiment score classified by the Amazon Mechanical Turk workers
for all reviews in the ArguAna TripAdvisor corpus of each score between 1 and 5 depending
on the presented information as well as the time the users took for classifying 10 reviews averaged
over all tasks or over the fastest 25 %, respectively.
This section has roughly sketched that knowledge about a text analysis process and
information obtained within the process can be used to improve the intelligibility of
the pipeline that realizes the process. In particular, both our process-oriented view
of text analysis from Sect. 3.2 and our structure-oriented view from Sect. 5.2 can be
operationalized to automatically provide explanations for a pipeline’s results.
The presented explanation graphs give general explanations of text analysis
processes that come at no cost (except for the normally negligible time of con-
structing them). They can be derived from arbitrary pipelines. For a deeper insight
into the reasons behind some result, we argue that more specific explanations are
needed that rely on task-specific information. We have introduced two exemplary
visual explanations in this regard for the sentiment scoring of argumentative texts.
At least the intelligibility of highlighting local sentiment has been underpinned in a
first user study, whereas we leave an evaluation of the explanatory benefit of flow
patterns for future work. In the end, the creation of explanations requires research in
the field of information visualization, which is beyond the scope of this book.
Altogether, this chapter has made explicit the difficulty of ensuring high quality
in text mining. Concretely, we have revealed the domain dependence of the text
analysis pipelines executed to infer certain information from input texts as a major
problem. While we have successfully addressed one facet of improving the domain
robustness of pipelines to a certain extent, our findings indicate that perfect robustness
will often be impossible to achieve. In accordance with that, our experiments have
underlined the fundamental challenge of high effectiveness in complex text analysis
tasks (cf. Sect. 2.1). Although approaches to cope with limited effectiveness exist, like
the exploitation of redundancy (cf. Sect. 2.4), the effectiveness of an employed text
analysis pipeline will always affect the correctness of the results of the associated text
mining application. Consequently, we argue that the intelligibility of a text analysis
process is of particular importance, as it may increase the end user acceptance of
erroneous results.
230 5 Pipeline Robustness
Moreover, in the given context of ad-hoc large-scale text mining (cf. Sect. 1.2), the
correctness of results can be verified at least sporadically because of the sheer amount
of processed data. Hence, a general trust in the quality of a text analysis pipeline
seems necessary. Under this premise, intelligibility denotes the final building block
of high-quality text mining aside from a robust effectiveness and a robust efficiency
of the performed analyses.
Chapter 6
Conclusion
We can only see a short distance ahead, but we can see plenty
there that needs to be done.
– Alan Turing
Abstract The ability of performing text mining ad-hoc in the large has the potential
to essentially improve the way people find information today in terms of speed
and quality, both in everyday web search and in big data analytics. More complex
information needs can be fulfilled immediately, and previously hidden information
can be accessed. At the heart of every text mining application, relevant information
is inferred from natural language texts by a text analysis process. Mostly, such a
process is realized in the form of a pipeline that sequentially executes a number
of information extraction, text classification, and other natural language processing
algorithms. As a matter of fact, text mining is studied in the field of computational
linguistics, which we consider from a computer science perspective in this book.
Besides the fundamental challenge of inferring relevant information effectively,
we have revealed the automatic design of a text analysis pipeline and the optimization
of a pipeline’s run-time efficiency and domain robustness as major requirements for
the enablement of ad-hoc large-scale text mining. Then, we have investigated the
research question of how to exploit knowledge about a text analysis process and
information obtained within the process to approach these requirements. To this
end, we have developed different models and algorithms that can be employed to
address information needs ad-hoc on large numbers of texts. The algorithms rely
on classical and statistical techniques from artificial intelligence, namely, planning,
truth maintenance, and informed search as well as supervised and self-supervised
learning. All algorithms have been analyzed formally, implemented as software, and
evaluated experimentally.
In Sect. 6.1, we summarize our main findings and their contributions to different
areas of computational linguistics. We outline that they have both scientific and
practical impact on the state of the art in text mining. However, far from every
problem of ad-hoc large-scale text mining has been solved or even approached at all
in this book. In the words of Alan Turing, we can therefore already see plenty there
that needs to be done in the given and in new directions of future research (Sect. 6.2).
Also, some of our main ideas may be beneficial for other problems from computer
science or even from other fields of application, as we finally sketch at the end.
231
232 6 Conclusion
This book presents the development and evaluation of approaches that exploit knowl-
edge and information about a text analysis process in order to effectively address
information needs from information extraction, text classification, and comparable
tasks ad-hoc in an efficient and domain-robust manner.1 In this regard, our high-level
contributions refer to an automatic pipeline design, an optimized pipeline efficiency,
and an improved pipeline robustness, as motivated in Chap. 1. After introducing rel-
evant foundations, basic definitions, and case studies, Chap. 2 has summarized that
several successful related approaches exist, which address similar problems as we
do. Still, we claim that the findings of this book improve the state of the art for
different problems, as we detail in the following.2
In Chap. 3, we have first discussed how to design a text analysis pipeline that is optimal
in terms of efficiency and effectiveness. Given a formal specification of available
text analysis algorithms, which has become standard in software frameworks for
text analysis (cf. Sect. 2.2), we can define an information need to be addressed and
a quality prioritization to be met in order to allow for a fully automatic pipeline
construction. We have realized an according engineering approach with partial order
planning (and a subsequent greedy linearization), implemented in a prototypical
expert system for non-expert users (cf. Appendix B.1). After showing the correctness
and asymptotic run-time complexity of the approach, we have offered evidence that
pipeline construction takes near-zero time in realistic scenarios.
To our knowledge, we are thereby the first to enable ad-hoc text analysis for unan-
ticipated information needs and input texts.3 Some minor problems of our approach
remain for future work, like its current limitation to single information needs. Most
of these are of technical nature and should be solvable without restrictions (see the
discussions in Sects. 3.2 and 3.3 for details). Besides, a few compromises had to be
made due to automation, especially the focus on either effectiveness or efficiency
during the selection of algorithms to be employed in a pipeline. Similarly, the flip-
side of constructing and executing a pipeline ad-hoc is the missing opportunity of
evaluating the pipeline’s quality before using it.
1 Here, addressing an information need means to return all information found in given input texts
that is relevant with respect to a defined query or the like (cf. Sect. 2.2).
2 As clarified in Sect. 1.4, notice that many findings attributed to this book here have already been
may have never been sought for before. Still, algorithms that can infer the single types from input
texts need to be either given in advance or created on-the-fly.
6.1 Contributions and Open Problems 233
Our main finding regarding an optimal pipeline design in Chap. 3 refers to an of-
ten overseen optimization potential: By restricting all analyses to those portions of
input texts that may be relevant for the information need at hand, much run-time
can be saved while maintaining effectiveness. Such an input control can be effi-
ciently operationalized for arbitrary text analysis pipelines with assumption-based
truth maintenance based on the dependencies between the information types to be
inferred. Different from related approaches, we thereby assess the relevance of por-
tions of text formally. We have proven the correctness of this approach, analyzed its
worst-case run-time, and realized it in a software framework on top of the industry
standard Apache UIMA (cf. Appendix B.2).
Every pipeline equipped with our input control is able to process input texts
optimally in that all unnecessary analyses are avoided. Alternatively, it can also trade
its run-time efficiency for its recall by restricting analysis to even smaller portions
of text. In our experiments with information extraction, only roughly 40 % to 80 %
of an input text needed to be processed by an employed algorithm on average. At the
same time, the effort of maintaining relevant portions of text seems almost negligible.
The benefit of our approach will be limited in tasks where most portions of text are
relevant, as is often the case in text classification. Also, the input restriction does not
work for some algorithms, namely those that do not stepwise process portions of a
text separately (say, sentence by sentence). Still, our approach comes with hardly
any notable drawback, which is why we argue in favor of generally equipping all
pipelines with an input control.
The use of an input control gives rise to the efficiency impact of optimizing the sched-
ule of a text analysis pipeline, as we have comprehensively investigated in Chap. 4.
We have shown formally and experimentally that, in theory, optimally scheduling
a pipeline constitutes a dynamic programming problem, which depends on the run-
times of the employed algorithms and the distribution of relevant information in the
input texts. Especially the latter may vary significantly, making other scheduling
approaches more efficient in practice. In order to decide what approach to take, we
provide a first measure of the heterogeneity of texts with regard to this distribution.
Under low heterogeneity, an optimal fixed schedule can reliably and efficiently be
determined with informed best-first search on a sample of input texts. This approach,
the proof of its correctness, and its evaluation denote new contributions of this book.4
For higher heterogeneity, we have developed an adaptive approach that learns online
4 Theapproach is correct given that we have optimistic estimations of the algorithms’ run-times.
Then, it always finds a schedule that is optimal on the sample (cf. Sect. 4.3).
234 6 Conclusion
and self-supervised what schedule to choose for each text. Our experiments indicate
that we thereby achieve being close to the theoretically best solution in all cases.
With our work on the optimization of efficiency, we support the applicability of
text analysis pipelines to industrial-scale data, which is still often disregarded in
research. The major gain of optimizing a pipeline’s schedule is that even computa-
tionally expensive analyses like dependency parsing can be conducted in little time,
thus often allowing for more effective results. In our experiments, the run-time of
information extraction was improved by up to factor 16 over naive approaches and
by factor 2 over our greedy linearization named above. These findings conform with
related research while being more generally applicable. In particular, all our schedul-
ing approaches apply to arbitrary text analysis algorithms and to input texts of any
type and language. They target at large-scale scenarios, where spending additional
time for analyzing samples of texts is worth the effort, like in big data analytics.
Conversely, when the goal is to respond to an information need ad-hoc, the greedy
linearization should be preferred.
Some noteworthy aspects of pipeline optimization remain unaddressed here. First,
although our input control handles arbitrary pipelines, we have considered only single
information needs (e.g. forecasts with time information). An extension to combined
needs (e.g. forecasts and declarations) will be more complicated, but is straightfor-
ward in principle as sketched. Next, we evaluated our approaches on large datasets,
but not in real big data scenarios. Among others, big data requires to deal with huge
memory consumption. While we are confident that such challenges even increase the
impact of our approaches on a pipeline’s efficiency, we cannot ultimately rule out
the possibility that they revert some achieved efficiency gains. Similarly, streams of
input texts have been used for motivation in this book, but their analysis is left for
future work. Finally, an open problem refers to the limited accuracy of predicting
pipeline run-times within adaptive scheduling, which prevents an efficiency impact
of the approach on real data of low heterogeneity. Possible solutions have been dis-
cussed in Sect. 4.5. We do not deepen them in this book, since we have presented
successful alternatives for low heterogeneity (cf. Sect. 4.3).
In Chap. 5, we have turned our view to the actual analysis performed by a pipeline. In
particular, we have investigated how to improve the domain robustness of an effec-
tive text analysis in tasks that deal with the classification of argumentative texts (like
reviews or essays). Our general idea is that a focus on the overall structure instead of
the content of respective texts benefits robustness. For reviews, we have found that
common overall structures exist, which cooccur with certain sentiment. Here, overall
structure is modeled as a sequential flow of either local sentiments or discourse rela-
tions, which can be seen as shallow representations of concepts from argumentation
theory. Using a supervised variant of clustering, we have determined common flow
patterns in the reduced search space of shallow overall structures. We exploit the
6.1 Contributions and Open Problems 235
With respect to our research question from Sect. 1.3, the contributions of this book
emphasize that a text analysis process can be improved in different respects using
knowledge about the process (like formal algorithm specifications) as well as infor-
mation obtained within the process (like observed algorithm run-times). Although
we have omitted to fully integrate all our single approaches, we now discuss in how
far their combination enables ad-hoc large-scale text mining, thereby coming back to
our original motivation of enhancing today’s information search from Sect. 1.1. Then,
we close this book with an outlook on arising research questions in the concerned
research field as well as in both more and less related fields.
As we have stressed in Sect. 1.3, the overall problem approached in this book aims
to make the design and execution of text analysis pipelines more intelligent. Our
underlying motivation is to enable search engines and big data analytics to perform
ad-hoc large-scale text mining, i.e., to return high-quality results inferred from large
numbers of texts in response to information needs stated ad-hoc. The output of
pipelines is structured information, which defines the basis for the results to be
returned. Therefore, we have addressed the requirements of (1) designing pipelines
236 6 Conclusion
automatically, (2) optimizing their efficiency, and (3) improving their robustness, as
summarized above. The fundamental effectiveness problem of text analysis remains
challenging and the definition of “large-scale” is not clear in general. Anyway, as
far as we got, our findings underline that we have successfully enabled ad-hoc text
mining, significantly augmented capabilities in large-scale text mining, and at least
provided an important step towards high-quality text mining.
However, not all single approaches are applicable or beneficial in every text analy-
sis scenario. In particular, the optimization of efficiency rather benefits information
extraction, while our approach to pipeline robustness targets at specific text classifica-
tion tasks. The latter tends to be slower than standard text classification approaches,
but it avoids performing deep analyses. In contrast, all approaches from Chaps. 3
and 4 fit together perfectly. Their integration will even solve remaining problems.
For instance, the restriction of our pipeline construction approach to single informa-
tion needs is easy to manage when given an input control (cf. Sect. 3.3). Moreover,
there are scenarios where all approaches have an impact. E.g., a sentiment analysis
based only on the opinions of a text allows for automatic design, optimized schedul-
ing, and the classification of overall structure. In addition, we have given hints in
Sect. 5.4 on how to transfer our robustness approach to further tasks.
We realized all approaches on top of the standard software framework for text
analysis, Apache UIMA. A promising step still to be taken is their deployment in
widely-recognized platforms and tools. In Sect. 3.5, we have already argued that a
native integration of our input control within Apache UIMA would minimize the
effort of using the input control while benefiting the efficiency of many text analysis
approaches based on the framework. Similarly, applications like U-Compare, which
serves for the development and evaluation of pipelines (Kano et al. 2010), may in our
view greatly benefit from including the ad-hoc pipeline construction from Sect. 3.3
or the scheduling approaches from Sects. 4.3 and 4.5. We leave these and other
deployments for future work. The same holds for some important aspects of using
pipelines in practical applications that we have analyzed only roughly here, such
as the parallelization of pipeline execution (cf. Sect. 4.6) and the explanation of
pipeline results (cf. Sect. 5.5). Both fit well to the approaches we have presented, but
still require more investigation.
We conclude that our contributions do not fully enable ad-hoc large-scale text
mining yet, but they define essential building blocks for achieving this goal. The
decisive question is whether academia and industry in the context of information
search will actually evolve in the direction suggested in this book. While we can
only guess, the superficial answer may be “no”, because there are too many possible
variations of this direction. A more nuanced view on today’s search engines and the
lasting hype around big data, however, reveals that the need for automatic, efficient,
and robust text mining technologies is striking: Chiticariu et al. (2010b) highlight
their impact on enterprise analytics, and Etzioni (2011) stresses the importance of
directly returning relevant information as search results (cf. Sect. 2.4 for details).
Hence, we are confident that our findings have the potential of improving the future
of information search. In the end, leading search engines show that this future has
already begun (Pasca 2011).
6.2 Implications and Outlook 237
This book deals with the analysis of natural language texts from a pure text mining
perspective merely. As a last step, we now give an outlook on the use of our approaches
for other tasks from computational linguistics, from other areas of computer science,
and even from outside these fields.
In computational linguistics, one of the most emerging research areas of the last
years is argumentation mining (Habernal et al. 2014). IBM claims that debating
technologies, which can automatically construct and oppose pro and con arguments,
will be the “next big thing” after their famous question answering tool Watson.5
With our work on argumentation structure, we contribute to the development of such
technologies. Our analysis of the overall structure of argumentative texts may, for
instance, be exploited to retrieve the candidates for argument identification. Also, it
may be adapted to assess the quality of an argumentation. The flow patterns at the
heart of our approach imply several future research directions themselves and may
possibly be transferred to other text analysis tasks, as outlined at the end of Sect. 5.4.
For a deeper analysis of argumentation, other ways to capture argumentation structure
than flow patterns are needed. E.g., deep syntactic parsing (Ballesteros et al. 2014)
and convolutional kernels (Moschitti and Basili 2004) could be used to learn tree-like
argumentation structures.
The derivation of an appropriate model that generates certain sequential informa-
tion from example sequences (like the flows) is addressed in data mining. A common
approach to detect system anomalies is to search for sequences of discrete events that
are improbable under a previously learned model of the system (Aggarwal 2013).
While recent work emphasizes the importance of time information for anomaly detec-
tion (Klerx et al. 2014), the relation to our computation of similarities between flows
in a text remains obvious. This brings up the question whether the two approaches
can benefit from each other, which we leave for future work.
Aside from text analysis, especially our generic approaches to pipeline design
and execution are transferable to other problems. While we have seen in Sect. 4.6
that our scheduling approaches relate to but do not considerably affect the classical
pipelining problem from computer architecture (Ramamoorthy and Li 1977), many
other areas of computer science deal with pipeline architectures and processes.
An important example is computer graphics, where the creation of a 2D raster from
a 3D scene to be displayed on a screen is performed in a rendering pipeline (Angel
2008). Similar to our input control, pipeline stages like clipping decide what parts
of a scene are relevant for the raster. While the ordering of the high-level rendering
stages is usually fixed, stages like a shader compose several programmable steps
whose schedule strongly impacts rendering efficiency (Arens 2014). A transfer of our
approaches to computer graphics seems possible, but it might put other parameters
in the focus, since the execution of a pipeline is parallelized on a specialized graphics
processing unit.
Another related area is software engineering. Among others, recent software test-
ing approaches deal with the optimization of test plans (Güldali et al. 2011). Here,
an optimized scheduling can speed up detecting some defined number of failures
or achieving some defined code coverage. Accordingly, approaches that perform an
assembly-based method engineering based on situational factors and a repository
of services (Fazal-Baqaie et al. 2013) should, in principle, be viable to automa-
tion with an adaptation of the pipeline construction from Sect. 3.3. Further possible
applications reach down to basic compiler optimization operations like list schedul-
ing (Cooper and Torczon 2011). The use of information obtained from training input
is known in profile-guided compiler optimization (Hsu et al. 2002) where such in-
formation helps to improve the efficiency of program execution, e.g. by optimizing
the scheduling of checked conditions in if-clauses.
Even outside computer science, our scheduling approaches may prove beneficial.
An example from the real world is an authentication of paintings or paper money,
which runs through a sequence of analyses with different run-times and numbers
of found forgeries. Also, we experience in everyday life that scheduling affects the
efficiency of solving a problem. For instance, the number of slices needed to cut
some vegetable into small cubes depends on the ordering of the slices and the form
of the vegetable. Moreover, the abstract concept of adaptive scheduling from Sect. 4.5
should be applicable to every performance problem where (1) different solutions to
the problem are most appropriate for certain situations or inputs and (2) where the
performance of a solution can be assessed somehow.
Altogether, we summarize that possible continuations of the work described in
this book are manifold. We hope that our findings will inspire new approaches of
other researchers and practitioners in the discussed fields and that they might help
anyone who encounters problems like those we approached. With this in mind, we
close the book with a translated quote from the German singer Andreas Front:6
What you learn from that is up to you, though.
I hope at least you have fun doing so.
6 “Was Du daraus lernst, steht Dir frei. Ich hoffe nur, Du hast Spaß dabei.” from the song “Spaß
dabei”, http://andreas-front.bplaced.net/blog/, accessed on December 21, 2014.
Appendix A
Text Analysis Algorithms
The evaluation of the design and execution of text analysis pipelines requires the
resort to concrete text analysis algorithms. Several of these algorithms are employed
in the experiments and case studies of our approaches to enable ad-hoc large-scale
text mining from Chaps. 3–5. Some of them have been developed by the author of
this book, while others refer to existing software libraries. In this appendix, we give
basic details on the functionalities and properties of all employed algorithms and
on the text analyses they perform. First, we describe all algorithms in a canonical
form (Appendix A.1). Then, we present evaluation results on their efficiency and
effectiveness as far as available (Appendix A.2). Especially, the measured run-times
are important in this book, because they directly influence the efficiency impact of
our pipeline optimization approaches, as discussed.
In Chaps. 3–5, we refer to every employed text analysis algorithm mostly in terms of
its three letter acronym (used as the algorithm’s name) and the concrete text analysis
it realizes. The first letter of each acronym stands for the type of text analysis it
belongs to, and the others abbreviate the concrete analysis. The types have been
introduced in Sect. 2.1. Now, we sketch all covered text analyses and we describe
for each employed algorithm (1) how it performs the respective analysis, (2) what
information it requires and produces, and (3) what input texts it is made for. An
overview of the algorithms’ input and output types is given in Table A.1.
We rely on a canonical form of algorithm description, but we also point out specific
characteristics where appropriate. For an easy look-up, in the following we list the
algorithms in alphabetical order of their names and, by that, also in alphabetical order
of the text analysis types. All algorithms are realized as Apache UIMA analysis
engines (cf. Sect. 3.5). These analysis engines come with our software described in
Appendix B. In case of algorithms that are taken from existing software libraries,
wrappers are provided.
239
240 Appendix A: Text Analysis Algorithms
Table A.1 The required input types C(in) and the produced output types C(out) of all text analysis
algorithms referred to in this book. Bracketed input types indicate the existence of variations of the
respective algorithm with and without these types.
Name Input types C(in) Output types C(out)
clf Sentence, Token.partOfSpeech, (Time), (Money) LanguageFunction.class
csb DiscourseUnit, Token.partOfSpeech Opinion, Fact
csp Opinion, Token.partOfSpeech Opinion.polarity
css Sentence, Token.partOfSpeech, Opinion.polarity, Fact, (Product), (ProductFeature) Sentiment.score
ene Sentence, Token.partOfSpeech/.chunk Person, Location, Organization
emo Sentence Money
eti Sentence Time
nti Sentence, Time Time.start, Time.end
pch Sentence, Token.partOfSpeech/.lemma Token.chunk
pde1 Sentence, Token.partOfSpeech/.lemma Token.parent/.role
pde2 Sentence, Token.partOfSpeech/.lemma Token.parent/.role
pdr Sentence, Token.partOfSpeech, Opinion.polarity DiscourseRelation.type
pdu Sentence, Token.partOfSpeech DiscourseUnit
rfo Sentence, Token.partOfSpeech /.lemma, Time Forecast.time
rfi Money, Forecast Financial.money/.forecast
rfu Token, Organization, Time Founded.organization/.time
rre1 Sentence, Token Revenue
rre2 Sentence, Token, Time, Money Revenue
rtm1 Sentence, Revenue, Time, Money Revenue.time/.money
rtm2 Sentence, Revenue, Time, Money, Token.partOfSpeech/.lemma./.parent/.role Revenue.time/.money
spa – Paragraph
sse – Sentence
sto1 – Token
sto2 Sentence Token
tle Sentence, Token Token.lemma
tpo1 Sentence, Token Token.partOfSpeech/.lemma
tpo2 Sentence, Token Token.partOfSpeech
The classification of text is one of the central text analysis types the approaches in
this book focus on. It assigns a class from a predefined scheme to each given text. In
our experiments and case studies in Chap. 5, we deal with the classification of both
whole texts and portions of text.
Language Functions. Language functions target at the question why a text was writ-
ten. On an abstract level, most texts can be seen as being predominantly expressive,
appellative, or informative (Bühler 1934). For product-related texts, we concretized
this scheme with a personal, a commercial, and an informational class (Wachsmuth
and Bujna 2011).
clf (Wachsmuth and Bujna 2011) is a statistical classifier, realized as a linear multi-
class support vector machine from the LibSVM integration of Weka (Chang
and Lin 2011; Hall et al. 2009). It assigns a language function to a text based
Appendix A: Text Analysis Algorithms 241
on different word, n-gram, entity, and part-of-speech features (cf. Sect. 5.3).
clf operates on the text level, requiring sentences, tokens and, if given, time
and money entities as input and producing language functions with assigned
class values. It has been trained on German texts from the music domain and
the smartphone domain, respectively.
1 Notice that all algorithms marked as self-implemented have been used in some of our publications,
According to Jurafsky and Martin (2009), the term entity is used not only to refer
to names that represent real-world entities, but also to specific types of numeric
information, like money and time expressions.
Money. In terms of money information, we distinguish absolute mentions (e.g.
“300 million dollars”), relative mentions (e.g. “by 10 %”), and combinations of
absolute and relative mentions.
emo (self-implemented) is a rule-based money extractor that uses lexicon-based
regular expressions, which capture the structure of money entities. emo operates
on the sentence level, requiring sentence annotations as input and producing
money annotations. It works only on German texts and it targets at news articles.
Named Entities. In some of our experiments and case studies, we deal with the
recognition of person, organization, and location names. These three named entity
types are in the focus of widely recognized evaluations, such as the CoNLL-2003
shared task (Tjong et al. 2003).
ene (Finkel et al. 2005) is a statistical sequence labeling algorithm, realized as a con-
ditional random field in the software library Stanford NER2 that sequentially
tags words as belonging to an entity of some type or not. ene operates on the
sentence level, requiring tokens with part-of-speech and chunk information as
input and producing person, location, and organization annotations. It can been
trained for different languages, including English and German, and it targets at
well-formatted texts like news articles.
Time. Similar to money entities, we consider text spans that represent periods of
time (e.g. “last year”) or dates (e.g. “07/21/69”) as time entities.
eti (self-implemented) is a rule-based time extractor that, analog to emo, uses
lexicon-based regular expressions, which capture the structure of time entities.
eti operates on the sentence level, requiring sentence annotations as input and
producing time annotations. It works only on German texts, and it targets at news
articles.
2006). The only type of information to be resolved in our experiments is time infor-
mation (in Sects. 4.3 and 4.5).
Resolved Time. For our purpose, we define resolved time information to consist
of a start date and an end date, both of the form YYYY-MM-DD. Ahn et al. (2005)
distinguish fully qualified, deictic, and anaphoric time information in a text. For
normalization, some information must be resolved, e.g. “last year” may require the
date the respective text was written on.
nti (self-implemented) is a rule-based time normalizer that splits a time entity into
atomic parts, identifies missing information, and then seeks for this information
in the surrounding text. nti operates on the text level, requiring sentence and time
annotations as input and producing normalized start and end dates as features of
time annotations. It works on German texts only and it targets at news articles.
A.1.4 Parsing
pde1 (Bohnet 2010) is a variant of the statistical dependency parser pde2 given
below, realized in the Mate Tools4 as a combination of a linear support
vector machine and a hash kernel. It uses several features to identify only
dependency parse trees without crossing edges (unlike pde2 ). pde1 operates
on the sentence level, requiring sentences and tokens with part-of-speech and
lemma as input and producing the parent and dependency role of each token.
It has been trained on German texts and it targets at well-formatted texts like
news articles.
pde2 (Bohnet 2010) is a statistical dependency parser, realized in the above-
mentioned Mate Tools as a combination of a linear support vector machine
and a hash kernel. It uses several features to identify dependency parse trees,
including those with crossing edges. pde2 operates on the sentence level, re-
quiring sentences and tokens with part-of-speech and lemma as input and
producing the parent and dependency role of each token. It has been trained
on a number of languages, including English and German, and it targets at
well-formatted texts like news articles.
Discourse Units and Relations. Discourse units are the minimum building blocks in
the sense of text spans that make up the discourse of a text. Several types of discourse
relations may exist between discourse units, e.g. 23 types are distinguished by the
widely-followed rhetorical structure theory (Mann and Thompson 1988).
pdr (self-implemented) is a rule-based discourse relation extractor that mainly re-
lies on language-specific lexicons with discourse connectives to identify 10
discourse relation types, namely, background, cause, circumstance, conces-
sion, condition, contrast, motivation, purpose, sequence, and summary. pdr
operates on the discourse unit level, requiring discourse units and tokens with
part-of-speech as input and producing typed discourse relation annotations. It
is implemented for English only and it targets at less-formatted texts like web
user reviews.
pdu (self-implemented) is a rule-based discourse unit segmenter that analyzes com-
mas, connectives (using language-specific lexicons), verb types, ellipses, etc.
to identify discourse units in terms of main clauses with all their subordinate
clauses. pdu operates on the text level, requiring sentences and tokens with
part-of-speech as input and producing discourse unit annotations. It is imple-
mented for English and German, and it targets at less-formatted texts like web
user reviews.
In Sect. 3.4, we argue that all relations and events can be seen as relating two or
more entities, often being represented by a span of text. Mostly, they are application-
specific (cf. Sect. 3.2). In this book, we consider relations and events in the context
of our case study InfexBA from Sect. 2.3.5
Financial Events. In Sect. 3.5, financial events denote a specific type of forecasts (see
below) that are associated to money information.
rfi (self-implemented) is a rule-based event detector that naively assumes each
portion of text with a money entity and a forecast event to represent a financial
event. rfi can operate on arbitrary text unit levels, requiring money and forecast
annotations as input and producing financial event annotations that relate the
respective money entities and forecast events. It works on arbitrary texts of any
language.
Forecast Events. A forecast is assumed here to be any sentence about the future
with time information.
rfo (self-implemented) is a statistical event detector, realized as a linear support
vector machine from the LibSVM integration of Weka (Chang and Lin 2011;
Hall et al. 2009). It classifies candidate sentences with time entities using several
types of information, including part-of-speech tags and occurring verbs. rfo
operates on the sentence level, requiring sentences, tokens with part-of-speech
and lemma as input and producing forecast annotations with set time features.
It is implemented for German texts only and it targets at news articles.
5 Inthe evaluation of ad-hoc pipeline construction in Sect. 3.3, we partly refer to algorithms for the
recognition of biomedical events. Since the construction solely relies on formal properties of the
algorithms, we do not consider the algorithms’ actual implementations and, therefore, omit to talk
about them here. The properties can be found in the respective Apache UIMA descriptor files that
come with our expert system (cf. Appendix B.1).
246 Appendix A: Text Analysis Algorithms
Time/Money Relations. The relations between time and money entities that we
consider here all refer to according pairs where both entities belong to the same
statement on revenue.
rtm1 (self-implemented) is a rule-based relation extractor that extracts the closest
pairs of time and money entities (in terms of the number of characters). It
operates on the sentence level, requiring sentences, time and money entities
as well as statements on revenue as input and producing the time and money
features of the latter. rtm1 works on arbitrary texts of any language.
rtm2 (self-implemented) is a statistical relation extractor, realized as a linear support
vector machine from the LibSVM integration of Weka (Chang and Lin 2011;
Hall et al. 2009). It classifies relations between candidate pairs of time and
money entities based on several types of information. rtm2 operates on the
sentence level, requiring sentences, tokens with all annotation features, time
and money entities, as well as statements on revenue as input and producing
the time and money features of the latter. It works for German texts only and
it targets at news articles.
A.1.6 Segmentation
Segmentation means the sequential partition of a text into single units. In this book,
we restrict our view to lexical and shallow syntactic segmentations in terms of the
following information types.
Paragraphs. We define a paragraph here syntactically to be a composition of sen-
tences that ends with a line break.
spa (self-implemented) is a rule-based paragraph splitter that looks for line breaks,
which indicate paragraph ends. spa operates on the character level, requiring
only plain text as input and producing paragraph annotations. It works on arbi-
trary texts of any language.
Appendix A: Text Analysis Algorithms 247
Sentences. Sentences segment the text into basic meaningful grammatical units.
sse (self-implemented) is a rule-based sentence splitter that analyzes whitespaces,
punctuation and quotation marks, hyphenation, ellipses, brackets, abbrevia-
tions (based on a language-specific lexicon), etc. sse operates on the character
level, requiring only plain text as input and producing sentence annotations. It
is implemented for German and English and it targets both at well-formatted
texts like news articles and at less-formatted texts like web user reviews.
Tokens. In natural language processing, tokens denote the atomic lexical units of a
text, i.e., words, numbers, symbols, and similar.
sto1 (Apache UIMA6 ) is a rule-based tokenizer that simply looks for whitespaces
and punctuation marks. sto1 operates on the character level, requiring only
plain text as input and producing token and sentence annotations. It works on
arbitrary texts of all those languages, which use the mentioned character types
as word and sentence delimiters.
sto2 (self-implemented) is a rule-based tokenizer that analyzes whitespaces, special
characters, abbreviations (based on a language-specific lexicon), etc. sto2 op-
erates on the sentence level, requiring sentences as input and producing token
annotations. It is implemented for German and English and it targets both at
well-formatted texts like news articles and at less-formatted texts like web user
reviews.
A.1.7 Tagging
Under the term tagging, we finally subsume text analyses that add information to
segments of a text, here to tokens in particular.
Lemmas. A lemma denotes the dictionary form of a word (in the sense of a lexeme),
such as “be” for “am”, “are”, or “be” itself. Lemmas are of particular importance
for highly inflected languages like German and they serve, among others, as input
for many parsers (see above).
tle (Björkelund et al. 2010) is a statistical lemmatizer, realized as a large margin
classifier in the above-mentioned Mate Tools, that uses several features to
find the shortest edit script between the lemmas and the words. tle operates on
the sentence level, requiring tokens as input and producing the lemma features
of the tokens. It has been trained on a number of languages, including English
and German, and it targets at well-formatted texts like news articles.
Part-of-speech Tags. Parts of speech are the linguistic categories of tokens. E.g., in
“Let the fly fly!”, the first “fly” is a noun and the second a verb. Mostly, more specific
part-of-speech tags are assigned to tokens, like common nouns as opposed to proper
nouns. Although some universal part-of-speech tagsets have been proposed (Petrov
et al. 2012), most approaches rather rely on language-specific tagsets, such as the
widely-used STTS TagSet7 for German consisting of 53 different tags.
tpo1 (Schmid 1995) is a statistical part-of-speech tagger, realized with the same
decision-tree classifier of the TreeTagger as pch above. tpo1 operates on
the sentence level, requiring sentences and tokens as input and producing both
part-of-speech and lemma features of the tokens. It has been trained on a
number of languages, including English and German, and it targets at well-
formatted texts like news articles.
tpo2 (Björkelund et al. 2010) is a statistical part-of-speech tagger, realized as a large
margin classifier in the above-mentioned Mate Tools, that uses several fea-
tures to classify part-of-speech. tpo2 operates on the sentence level, requiring
sentences and tokens as input and producing part-of-speech features of tokens.
It has been trained on a number of languages, including English and German,
and it targets at well-formatted texts like news articles.
The impact of the approaches developed in Chaps. 3–5 is affected by the efficiency
and/or the effectiveness of the employed text analysis algorithms. In particular, both
the algorithm selection of ad-hoc pipeline construction (Sect. 3.3) and the informed
search pipeline scheduling (Sect. 4.3) rely on run-time estimations of the algorithms,
the former also on effectiveness estimations. The efficiency gains achieved by our
input control from Sect. 3.5 and by every scheduling approach from Chap. 4 re-
sult from differences in the actually observed algorithm run-times. And, finally, the
effectiveness and robustness of our features for text classification in Sect. 5.4 depends
on the effectiveness of all algorithms used for preprocessing. For these reasons, we
have evaluated the efficiency and effectiveness of all employed algorithms as far as
possible. Table A.2 shows all results that we provide here with reference to the text
corpora they were computed on. The corpora are described in Appendix C.
In terms of efficiency, Table A.2 shows the average run-time per sentence of each
algorithm. We measured all run-times in either five or ten runs on a 2 GHz Intel Core
2 Duo MacBook with 4 GB RAM, partly using the complete respective corpus, partly
its training set only.
The effectiveness values in Table A.2 were obtained on the test sets of the specified
corpora in all cases except for those on the Sentiment Scale dataset, the Subjec-
tivity dataset, and the Sentence polarity dataset. The latter are computed
using 10-fold cross-validation in order to make them comparable to (Pang and Lee
2005). All results are given in terms of the quality criteria, we see as most appropriate
for the respective text analyses (cf. Sect. 2.1 for details). For lack of required ground-
truth annotations, we could not evaluate the effectiveness of some algorithms, such as
pdr. Also, for a few algorithms, we analyzed a small subset of the Revenue corpus
manually to compute their precision (eti and emo) or accuracy (sse and sto2 ). With
respect to the effectiveness of the algorithms from existing software libraries, we
refer to the according literature.
We have added information on the number of classes where accuracy values do
not refer to a two-class classification task. In case of css on the Sentiment Scale
dataset, we specify an interval for the accuracy values, because they vary depending
on which of the four datasets of the corpus is analyzed (cf. Appendix C.4). The perfect
effectiveness of pdu on the ArguAna TripAdvisor corpus is due to the fact that
pdu is exactly the algorithm used to create the discourse unit annotations of the
corpus (cf. Appendix C.2).
250 Appendix A: Text Analysis Algorithms
Table A.2 Evaluation results on the run-time efficiency (in milliseconds per sentence) and the
effectiveness (as precision p, recall r, F1 -score f 1 , and accuracy a) of all text analysis algorithms
referred to in this book on the specified text corpora.
Name Efficiency Effectiveness Evaluation on
clf 0.65 ms/snt. a 82 % (3 classes) LFA-11 corpus (music)
0.53 ms/snt. a 69 % (3 classes) LFA-11 corpus (smartphone)
csb 22.31 ms/snt. a 78 % ArguAna TripAdvisor corpus
– a 91 % Subjectivity dataset
csp 6.96 ms/snt. a 80 % ArguAna TripAdvisor corpus
– a 74 % Sentence polarity dataset
css 0.53 ms/snt. a 48 % (5 classes) ArguAna TripAdvisor corpus
0.86 ms/snt. a 57 %–72 % (3 classes) Sentiment Scale dataset
emo 0.68 ms/snt. p 0.99, r 0.95, f 1 0.97 Revenue corpus
0.59 ms/snt. – CoNLL-2003 (de)
ene 2.03 ms/snt. cf. Finkel et al. (2005) Revenue corpus
2.03 ms/snt. CoNLL-2003 (de)
eti 0.36 ms/snt. p 0.91, r 0.97, f 1 0.94 Revenue corpus
0.39 ms/snt. – CoNLL-2003 (de)
nti 1.21 ms/snt. – Revenue corpus
0.39 ms/snt. – CoNLL-2003 (de)
tch 0.97 ms/snt. cf. Schmid (1995) Revenue corpus
0.88 ms/snt. CoNLL-2003 (de)
pde1 166.14 ms/snt. cf. Bohnet (2010) Revenue corpus
pde2 54.61 ms/snt. Revenue corpus
pdr 0.11 ms/snt. – ArguAna TripAdvisor corpus
pdu 0.13 ms/snt. a 100.0 % ArguAna TripAdvisor corpus
rfi <0.01 ms/snt. – Revenue corpus
<0.01 ms/snt. – CoNLL-2003 (de)
rfo 0.27 ms/snt. a 93 % Revenue corpus
0.27 ms/snt. – CoNLL-2003 (de)
rfu 0.01 ms/snt. p 0.71 Revenue corpus
0.01 ms/snt. p 0.88 CoNLL-2003 (de)
rre1 0.03 ms/snt. p 0.86, r 0.93, f 1 0.89 Revenue corpus
rre2 0.81 ms/snt. p 0.87, r 0.93, f 1 0.90 Revenue corpus
0.05 ms/snt. – CoNLL-2003 (de)
rtm1 0.02 ms/snt. p 0.69, r 0.88, f 1 0.77
rtm2 10.41 ms/snt. p 0.75, r 0.88, f 1 0.81 Revenue corpus
spa <0.01 ms/snt. – Revenue corpus
<0.01 ms/snt. – CoNLL-2003 (de)
sse 0.04 ms/snt. a 95 % Revenue corpus
0.04 ms/snt. – CoNLL-2003 (de)
sto1 0.04 ms/snt. – Revenue corpus
sto2 0.06 ms/snt. a 98 % Revenue corpus
0.06 ms/snt. – CoNLL-2003 (de)
tle 11.12 ms/snt. cf. Björkelund et al. (2010) Revenue corpus
tpo1 0.94 ms/snt. cf. Schmid (1995) Revenue corpus
0.97 ms/snt. CoNLL-2003 (de)
tpo2 10.75 ms/snt. cf. Björkelund et al. (2010) Revenue corpus
Appendix B
Software
The optimization of pipeline design and execution that we discuss in the book at hand
provides practical benefits only when working fully automatically. In the context of
the book, prototypical software applications were developed that allow the usage and
evaluation of all parts of our approach to enable ad-hoc large-scale text mining. This
appendix presents how to work with these applications, all of which are given in the
form of open Java source code. In Appendix B.1, we begin with the expert system
for ad-hoc pipeline construction from Sect. 3.3. Then, Appendix B.2 sketches how
to use our software framework that realizes the input control presented in Sect. 3.5.
Our prototypical web application for sentiment scoring and explanation is described
in Appendix B.3. Finally, we outline how to reproduce the results of all experiments
and case studies of this book using the developed applications. All source code comes
together with instructions and some sample text analysis algorithms and pipelines.
It is split into different projects that we refer to below. As of end of 2014, the code
should be accessable at least for some years at http://is.upb.de/?id=wachsmuth (under
Software).8
In this appendix, we detail the usage of the expert system Pipeline XPS, presented in
Sect. 3.3. The expert system was implemented by Rose (2012) as part of his master’s
thesis. It provides a graphical user interface for the specification of text analysis tasks
and quality prioritizations. On this basis, Pipeline XPS constructs and executes a
text analysis pipeline ad-hoc.
8 In case you encounter problems with the link, please contact the author of this book.
251
252 Appendix B: Software
Installation. The expert system refers to the project XPS of the provided software.
By default, its annotation task ontology (cf. Sect. 3.2) comprises the algorithms and
information types of the EfXTools project. When using the integrated development
environment Eclipse9 , Java projects can be created with the respective top-level
folders as root directories. Otherwise, an according procedure has to be performed.
General Information. Our expert system can be seen as a first prototype, which still
may have some bugs and which tends not to be robust to wrong inputs and usage.
Therefore, the instructions presented here should be followed carefully.
Launch. Before the first launch, one option has to be adjusted if not using Windows
as the operating system: In the file ./XPS/conf/xps.properties, the line starting with
xps.treeTaggerModel, which belongs to the operating system at hand, must
be commented in, while the respective others must be commented out. The file
Main.launch in the folder XPS can then be run in order to launch the expert system.
At first start, no annotation task ontology is present in the system. After pressing
OK in response to the appearing popup window, a standard ontology is imported.
When starting again, the main window Pipeline XPS should appear as well as an
Explanations window with the message Pipeline XPS has been started.
User Interface. Figure B.1 shows the user interface of the prototype from Rose
(2012). A user first sets the directory of an input text collection to be processed
and chooses a quality prioritization. Then, the user specifies an information need
by repeatedly choosing annotation types with active features (cf. Sect. 3.2).10 The
addition of types to filter beneath does not replace the on-the-fly creation of filters
from the pseudocode in Fig. 3.1, but it defines the value constraints.11 Once all is
set, pressing Start XPS leads to the ad-hoc construction and execution of a pipeline.
Afterwards, explanations and results are given in separate windows. We rely on this
user interface in our evaluation of ad-hoc pipeline construction in Sect. 3.3. In the
following, we describe how to interact with the user interface in more detail.
types and value constraints. Still, these inputs are used equally in both cases.
Appendix B: Software 253
Fig. B.1 Screenshot of the prototypical user interface of our expert system Pipeline XPS that
realizes our approach to ad-hoc pipeline construction and execution.
The Pipeline XPS user interface in Fig. B.1 is made up of the following areas, each
of which including certain options that can or have to be set in order to start pipeline
construction and execution:
Input Text Collection. Via the button Browse, a directory with the input texts to
be processed (e.g., given as XMI files) can be set. If the checkbox Only construct
pipelines is activated, pipeline execution will be disabled. Instead, an Apache UIMA
aggregate analysis engine is then constructed and stored in terms of an according
descriptor file in the directory ./XPS/temp/.
Quality Criteria for Algorithm Selection. According to Sect. 3.3, a quality prior-
itization needs to be chosen from the provided choice. Exactly those prioritizations
are given that are illustrated in the quality model in Fig. 3.11, although the namings
slightly differ in the expert system.
Output Files. In the area Output files, a name for the pipeline to be constructed can
be specified. This name becomes the file name of the associated Apache UIMA
analysis engine descriptor file.
Information Needs. To set the information need C to be addressed by the pipeline,
the following three steps need to be performed once for each information type C ∈ C:
1. Select a type from the list and click on the Add type button.
254 Appendix B: Software
2. Choose attributes for the added type by marking the appearing checkboxes (if the
added type has attributes at all).
3. Press the button Add this type.
Value Constraints. The area Value constraints allows setting one or more filters that
represent the value constraints to be checked by the pipeline to be constructed. For
each filter, the following needs to be done:
1. In the Type to filter list, select the type to be filtered.
2. Select one of the appearing attributes of the selected type.
3. Select one of the three provided filters.
4. Insert the text to be used for filtering.
5. Press the button Add this filter.
Start XPS. When all types and filters have been set, Start XPS constructs and ex-
ecutes a pipeline for the specified information need and quality prioritization. Log
output is shown in the console of Eclipse as well as in the Explanations win-
dow. A Calculating results... window appears where all results are shown when
the pipeline execution is finished. In addition, the results are written to a file
./XPS/pipelineResults/resultOfPipeline-<pipelineName><timestamp>.txt. All cre-
ated pipeline descriptor files can be found in the ./XPS/temp/ directory, while the filter
descriptor files are stored in /XPS/temp/filter/.
Import Ontology. By default, a sample ontology with a specified type system, an
algorithm repository, and the built-in quality model described in Sect. 3.3 are set as
the annotation task ontology to rely on. When pressing the button Import ontology, a
window appears where an Apache UIMA type system descriptor file can be selected
as well as a directory in which to look for the analysis engine descriptor files (i.e.,
the algorithm repository). After pressing Import Ontology Information, the respec-
tive information is imported into the annotation task ontology and Pipeline XPS is
restarted.12
XPS and EfXTools denote largely independent Java projects. In case the default
ontology is employed, though, the former accesses the source code and the Apache
UIMA descriptor files of the latter. In the following, we give some information on
both projects. For more details, see Appendix B.4.
12 Incase other analysis engines are imported, errors may occur in the current implementation. The
reason is that there is a hardcoded blacklist of analysis engine descriptor files that can be edited
in the class de.upb.mrose.xps.application.ExpertSystemFrontendData (Eclipse compiles this class
automatically when starting the expert system the next time.
Appendix B: Software 255
XPS. The source code XPS consists of four main packages: All classes related to the
user interface of Pipeline XPS belong to the package de.upb.mrose.xps.application,
while the management of annotation task ontologies and their underlying data model
are realized by the classes in the packages de.upb.mrose.xps.knowledgebase and
de.upb.mrose.xps.datamodel. Finally, de.upb.mrose.xps.problemsolver is responsible
for the pipeline construction. Besides, some further packages handle the interaction
with classes and descriptors specific to Apache UIMA. For details on the architecture
and implementation of the expert system, we refer to (Rose 2012).
EfXTools. EfXTools is the primary software project containing text analysis algo-
rithms and text mining applications developed within our case study InfexBA de-
scribed in Sect. 2.3. A large fraction of the source code and associated files is not
relevant for the expert system, but partly plays a role in other experiments and case
studies (cf. Appendix B.4 below). The algorithms used by the expert system can be
found in all sub-packages of the package de.upb.efxtools.ae. The related Apache
UIMA descriptor files are stored in the folders desc, desc38, desc76, where the two
latter represent the algorithm repositories evaluated in Sect. 3.3. Text corpora like the
Revenue corpus (cf. Appendix C.1) are given in the folder data.
Libraries. The folder lib of XPS contains the following freely available Java li-
braries, which are needed to compile the associated source code:13
Apache Jena, http://jena.apache.org
Apache Log4j, http://logging.apache.org/log4j/2.x/
Apache Lucene, http://lucene.apache.org
Apache Xerces, http://xerces.apache.org
JGraph, http://sourceforge.net/projects/jgraph
StAX, http://stax.codehaus.org
TagSoup, http://ccil.org/~cowan/XML/tagsoup
Woodstox, http://woodstox.codehaus.org
Similarly, the algorithms in EFXTools are based on the following libraries:
Apache Commons, http://commons.apache.org/pool/
Apache UIMA, http://uima.apache.org
ICU4j, http://site.icu-project.org
LibSVM, http://www.csie.ntu.edu.tw/~cjlin/libsvm
Mate Tools, http://code.google.com/p/mate-tools
StanfordNER, http://nlp.stanford.edu/ner/
TreeTagger, http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/
tt4j, http://reckart.github.io/tt4j/
Weka, http://www.cs.waikato.ac.nz/~ml/weka/
13 All libraries accessed on June 15, 2015. The same holds for the libraries in EFXTools.
256 Appendix B: Software
Quick Start. For a first try of the framework, the class QuickStartApplication in
the source code package efxtools.sample.application can be executed with the Java
virtual machine parameters −Xmx1000m −Xms1000m. This starts the extraction of
relevant information for the query γ3∗ on the Revenue corpus (cf. Sect. 3.5). During
the processing of the corpus, some output is printed to the console. After processing,
the execution terminates. All results are printed to the console.
The source code of the Filtering Framework has been designed with a focus on
easy integration and minimal additional effort. In order to use the framework for
applications, the following needs to be done:
Appendix B: Software 257
Here, both the application-specific type system myTypeSystem and the aggre-
gate analysis engine description myAggregateAEDesc are available through the
Apache UIMA framework. In the provided implementation, the scoped query is
given as a text string myQueryString that is parsed in the class EfXScopedQuery
of the Filtering Framework. Example queries can be found in the sample appli-
cations in the package efxtools.sample (details on the source code follow below).
Analysis Engines. The determination, generation, and filtering of scopes are auto-
matically called from the method process(JCas) that is invoked by the Apache
UIMA framework on every primitive analysis engine for each text. For this purpose,
the abstract class FilteringAnalysisEngine in the package efxtools.filtering overrides
the process method and instead offers a method process(JCas, Scope). While
it is still possible to use regular primitive analysis engines in the Filtering Frame-
work, every analysis engine that shall restrict its analysis to scopes of a text should
inherit from FilteringAnalysisEngine and implement the new method.14 Typically,
with only minor changes a regular primitive analysis engine can be converted into a
filtering analysis engine. For examples, see the provided sample algorithms.
Analysis Engine Descriptors. In order to ensure a correct annotation and filtering,
it is important to thoroughly define the input and output capabilities of all employed
primitive and aggregate analysis engines. Examples can be found in the subdirectories
of the directory desc.
The source code of the Filtering Framework in the folder src contains two main
packages, efxtools.filtering and efxtools.sample. The former consists of the actual
framework classes, whereas the latter contains a number of sample algorithms and
applications, including those that are used in the experiments in Sect. 3.5.
The Filtering Framework. The classes in efxtools.filtering realize the assumption-
based truth maintenance system described in Sect. 3.5. Most of these nine classes
implement the classes in Fig. 3.19 and are named accordingly. The code was carefully
documented with javadoc comments to make it easily understandable.
14 In its prototypical form, the framework checks only the existence of annotation types during
filtering while ignoring whether possibly required features have been set explicitly. For some specific
queries, this may prevent the framework from performing filtering as much as would be possible.
258 Appendix B: Software
This appendix describes the prototypical web application for predicting and explain-
ing sentiment scores that we refer to in Sect. 5.5. The application was developed
within the project ArguAna (acknowledgments are given below). It can be accessed
at http://www.arguana.com.
15 Alllibraries accessed on June 15, 2015. For the Filtering Framework, only Apache UIMA
is required. The other libraries are used for the sample algorithms and applications.
Appendix B: Software 259
Before prediction, the application processes the entered text with a pipeline of
several text analysis algorithms. In addition to the feature types considered in the
evaluation of Sect. 5.4, it also extracts hotel names and aspects and it derives features
from the combination of local sentiment and the found names and aspects. Unlike
the evaluated sentiment scoring approach, prediction is then performed using super-
vised regression (cf. Sect. 2.1). Afterwards, the application provides different visual
explanations of the prediction, as described in the following.
Figure B.2 shows the user interface of the application consisting of the following five
areas. Only the first area is shown in the beginning, while the others are displayed
after an input text has been analyzed.
Input Text to Be Analyzed. Here, a user can enter an arbitrary input text. After
choosing either the shortened or the full overview graph style (see below), pressing
Analyze sends the input text to the webservice.
Analyzed Text. When the webservice has returned its results, this area shows a
segmentation of the input text into numbered discourse units. Each discourse unit is
marked as being a fact (dark background), a positive opinion (light background), or
a negative opinion (medium background).
Global Sentiment. Here, the predicted sentiment score is visualized in the form of
a star rating with a more exact but still rounded value given in brackets.
Local Sentiment Flow. This area depicts the local sentiment flow of the input text as
defined in Sect. 5.3. In addition, each value of the flow is labeled with the hotel names
and aspects found in the associated discourse unit. By clicking on one of the values,
a detail view of the explanation graph is displayed in the area below, as illustrated in
Fig. B.3.
Explanation Graph. In the bottom area, a variant of the explanation graph sketched
in Fig. 5.18 from Sect. 5.5 is visualized. In particular, the graph aggregates discourse
relations, facts (marked as objective), and opinions all in the same layer. Moreover,
the visualization pretends that the facts and opinions depend on the found hotel
names and aspects (labeled as products and product features) to achieve a more
simple graph layout. Given that the full overview graph style has been selected,
the explanation graph includes tokens with simplified part-of-speech tags as well as
sentences besides the outlined information. Otherwise, the tokens and sentences are
shown in the mentioned detail view only.
260 Appendix B: Software
Fig. B.2 Screenshot of the main user interface of the prototypical web application for the prediction
and explanation of the sentiment score of an input text.
The source code of the application can be found in the project ArguAna, except for
the source code of the user interface, which is not part of the provided software. The
packages com.arguana.explanation and com.arguana.server contain the source code
of the creation of explanation graphs and the webservice, respectively. All employed
algorithms can be found in the subpackages of com.arguana.efxtools.ae, whereas the
used text analysis pipeline is represented by the descriptor file HotelTextScoreRe-
gressionPipeline.xml in the folder desc/aggregate-ae.
Appendix B: Software 261
Fig. B.3 Screenshot of the detail view of the explanation graph of a single discourse unit in the
prototypical web application.
B.3.4 Acknowledgments
The development of the application was funded by the German Federal Ministry
of Education and Research (BMBF) as part of the project ArguAna described in
Sect. 2.3. The application’s user interface was implemented by the Resolto Infor-
matik GmbH16 , based in Herford, Germany. The source code for predicting scores
and creating explanation graphs was developed by the author of this book together
with a research assistant from the Webis group17 of the Bauhaus Universität
Weimar, Martin Trenkmann. The latter also realized the webservice underlying the
application.
Finally, we now shortly present how to reproduce all experiments and case studies
described in this book. First, we give basic information on the provided software and
the processed text corpora. Then, we point out how to perform an experiment or case
study and where to find additional information.
B.4.1 Software
has been mentioned (in Appendix B.3). It is given in the top-level folder ArguAna of
the provided software. ArguAna contains all algorithms and applications from our
case study ArguAna (cf. Sect. 2.3) that are relevant for this book. This folder is
organized similar to those of the projects EfXTools and IE-as-a-Filtering-Task.
Depending on the experiment or case study at hand, source code of one of the
four projects has to be executed in order to reproduce the according results. More
details are given after the following notes on the processed corpora.
The top-level folder corpora consists of the three text corpora that we created our-
selves and that are described in Appendix C.1 to C.3. Each of them already comes
in the format that is required to reproduce the experiments and case studies.
For the processed existing corpora (cf. Appendix C.4), we provide conversion
classes within the projects. In particular, the CoNLL-2003 dataset can be con-
verted (1) into XMI files with the class CoNLLToXMIConverter in the package
de.upb.efxtools.application.convert found in the project EfXTools and (2) into plain
texts with the class CoNLL03Converter in efxtools.sample.application of IE-as-a-
Filtering-Task. For the Sentiment Scale dataset and the related sentence subjec-
tivity and polarity datasets, three accordingly named XMI conversion classes can be
found in the package com.arguana.corpus.creation of ArguAna. Finally, the Brown
Corpus is converted using the class BrownCorpusToPlainTextConverter from the
package de.upb.efxtools.application.convert in EfXTools.
18 For sections with different separated experiments, another level of sub-folders is added.
Appendix B: Software 263
toolkit (Hall et al. 2009), there is also a folder arff-files that contains all feature files
of the evaluated text corpora.
Example. As an example, browse through the sub-folder experiments_4_3 and open
the file instructions.txt. At the top, this files shows the three steps needed to perform
any of the experiments on the informed search scheduling approach from Sect. 4.3
as well as some additional notes. Below, an overview of the parameters to be set in
the associated Java class is given as well as the concrete parameter specifications
of each experiment. In the folder results, several plain text files can be found that are
named according to the table or figure from Sect. 4.3 the respective results appear
in. In addition, the file algorithm-run-time-estimations.txt gives an overview of the
run-time estimations, the informed search strategy relies on.
Memory. Some of the text analysis algorithms employed in the experiments re-
quire a lot of heap space during execution, mostly because of large machine learn-
ing models. As a general rule, we propose to allocate 2 GB heap space in all ex-
periments and case studies. If Eclipse is used to compile and run the respec-
tive code, memory can be assigned to every class with a main method under
Run... > Run Configurations. In the appearing window, insert the vir-
tual machine arguments −Xmx2000m −Xms2000m in the tab Arguments. In case
a class is run from the console, type the following to achieve the same effect:
javac < myClass > .java −Xmx2000m −Xms2000m.
Run-times. Depending on the experiment, the reproduction of results may take any-
thing between a few seconds and several hours. Notice that many of our experiments
measure run-times themselves. During their execution, nothing else should be done
with the executing system (as far as possible) in order to ensure accurate measure-
ments. This includes the deactivation of energy saving modes, screen savers, and
so on. Moreover, all run-time experiments include a warm-up run, ignored in the
computation of the results, because pipelines based on Apache UIMA tend to be
somewhat slower in the beginning.
Appendix C
Text Corpora
The development and evaluation of text analysis algorithms and pipelines nearly
always relies on text corpora, i.e., collections of texts with known properties (cf.
Sect. 2.1 for details). For the approach to enable ad-hoc large-scale text mining dis-
cussed in the book at hand, we processed and analyzed three text corpora that we
published ourselves as well as a few existing text corpora often used by researchers
in the field. This appendix provides facts and descriptions on each employed corpus
that are relevant for the understanding of the case studies and experiments in Chaps. 3
– 5. We begin with our corpora in Appendices C.1–C.3, the Revenue corpus, the
ArguAna TripAdvisor corpus, and the LFA-11 corpus. Afterwards, we shortly
outline the other employed corpora (Appendix C.4).
First, we outline the corpus that is used most often in this book to evaluate the
developed approaches, the Revenue corpus. The Revenue corpus consists of
1128 German online business news articles, in which different types of statements
on revenue are manually annotated together with all information needed to make them
machine-processable. The purpose of the corpus is to investigate both the structure
of sentences on financial criteria and the distribution of associated information over
the text. The corpus was introduced in Wachsmuth et al. (2010), from which we reuse
some content, but we provide more details here. It is free for scientific use and can
be downloaded at http://infexba.upb.de.
C.1.1 Compilation
The Revenue corpus consists of 1128 German news articles from the years 2003
to 2009. These articles were manually selected from 29 source websites by four
265
266 Appendix C: Text Corpora
Table C.1 Numbers of texts from the listed websites in the complete Revenue corpus as well as
in its training set and in the union of its validation and test set.
Source Training Validation Complete
websites set and test set corpus
http://www.produktion.de 127 12 139
http://www.heise.de 127 12 139
http://www.golem.de 117 12 129
http://www.wiwo.de 112 11 123
http://www.boerse-online.de 100 11 111
http://www.spiegel.de 93 11 104
http://www.capital.de 76 11 87
http://www.tagesschau.de – 73 73
http://www.finanzen.net – 37 37
http://www.vdma.org – 37 37
http://de.news.yahoo.com – 37 37
http://www.faz.net – 19 19
http://www.vdi.de – 16 16
http://www.zdnet.de – 13 13
http://www.handelsblatt.com – 13 13
http://www.zvei.org – 11 11
http://www.sueddeutsche.de – 7 7
http://boerse.ard.de – 7 7
http://www.it-business.de – 5 5
http://www.manager-magazin.de – 5 5
http://www.sachen-machen.org – 5 5
http://www.swissinfo.ch – 4 4
http://www.hr-online.de – 1 1
http://nachrichten.finanztreff.de – 1 1
http://www.tognum.com – 1 1
http://www.pcgameshardware.de – 1 1
http://www.channelpartner.de – 1 1
http://www.cafe-future.net – 1 1
http://www.pokerzentrale.de – 1 1
Total 752 188 each 1128
Table C.2 Distribution of statements on revenues in the different parts of the Revenue corpus,
separated into the distributions of forecasts and of declarations.
Type Training set Validation set Test set Complete
Forecasts 306 (22.4 %) 113 (31.2 %) 104 (30.0 %) 523 (25.2 %)
Declarations 1060 (77.6 %) 249 (68.8 %) 243 (70.0 %) 1552 (74.8 %)
All statements 1366 (100.0 %) 362 (100.0 %) 347 (100.0 %) 2075 (100.0 %)
752 texts with a total of 21,586 sentences, while the validation and test set sum up
to 188 texts each with 5751 and 6038 sentences, respectively.
C.1.2 Annotation
In each text of the Revenue corpus, annotations of text spans are given on the event
level and on the entity level, as sketched in the following.
Event Level. Every sentence with explicit time and money information that repre-
sents a statement on the revenue of an organization or market is annotated as either
a forecast or a declaration. If a sentence comprises more than one such statement on
revenue, it is annotated multiple times.
Entity Level. In each statement, the time expression and the monetary expression are
marked as such (relative money information is preferred over absolute information
in case they are separated). Accordingly, the subject is marked within the sentence if
available, otherwise its last mention in the preceding text. The same holds for optional
entities, namely, a possible referenced point a relative time expression refers to, a
trend word that indicates whether a relative monetary expression is increasing or
decreasing, and the author of a statement. All annotated entities are linked to the
statement on revenue they belong to. Only entities that belongs to a statement are
annotated.
Table C.2 gives an overview of the 2075 statements on revenue annotated in the
corpus. The varying distributions of forecasts and declarations give a hint that the
validation and test set differ significantly from the training set.
Example
Figure C.1 shows a sample text from the Revenue corpus with one statement on
revenue, a forecast. Besides the required time and monetary expressions, the forecast
spans an author mention and a trend indicator of the monetary expression. Other
relevant information is spread across a text, namely, the organization the forecast is
about as well as a reference date needed to resolve the time expression.
268 Appendix C: Text Corpora
Fig. C.1 Illustration of a sample text from the Revenue corpus. Each statement on revenue that
spans a time expression and a monetary expression is manually marked either as a forecast or as a
declaration. Also, different types of information needed to process the statement are annotated.
Annotation Process
Two employees of the above-mentioned company manually annotated all texts from
the corpus. They were given the following main guideline:
Search for sentences in the text (including its title) that contain statements about the revenues
of an organization or market with explicit time and money information. Annotate each such
sentence as a forecast (if it is about the future) or as a declaration (if about the past). Also,
annotate the following information related to the statement:
C.1.3 Files
C.1.4 Acknowledgments
The creation of the Revenue corpus was funded by the German Federal Min-
istry of Education and Research (BMBF) as part of the project InfexBA, de-
scribed in Sect. 2.3. The corpus was planned by the author of this book together with
a research assistant from the above-mentioned Webis group of the Bauhaus Uni-
versität Weimar, Peter Prettenhofer. The described process of manually selecting
and annotating the texts in the corpus was conducted by the Resolto Informatik
GmbH, also named above.
C.2.1 Compilation
San Francisco
75k
Punta Cana
Other
40%
78,404
Amsterdam
Barcelona
50k 30%
Los Angeles
Hong Kong
Honolulu
Florence
New York
Singapore
20%
Sydney
Seattle
Boston
Berlin
Paris
25k 20,040 25,968
10% 15,152
0 0%
location 1 2 3 4 5 score
Fig. C.2 a Distribution of the locations of the reviewed hotels in the original dataset from Wang
et al. (2010). The ArguAna TripAdvisor corpus contains 300 annotated texts of each of the
seven marked locations. b Distribution of the overall scores of the reviews in the original dataset.
Table C.3 The number of reviewed hotels of each location in the complete ArguAna Trip-
Advisor corpus and in its three parts as well as the number of reviews for each sentiment score
between 1 and 5 and in total.
Set Location Hotels Score 1 Score 2 Score 3 Score 4 Score 5 Σ
Training Amsterdam 10 60 60 60 60 60 300
Seattle 10 60 60 60 60 60 300
Sydney 10 60 60 60 60 60 300
Validation Berlin 44 60 60 60 60 60 300
San Francisco 10 60 60 60 60 60 300
Test Barcelona 10 60 60 60 60 60 300
Paris 26 60 60 60 60 60 300
Complete All seven 120 420 420 420 420 420 2100
texts is not perfect in all cases, certainly due to crawling errors: Some line breaks have
been lost, which hides a number of sentence boundaries and, sporadically, also word
boundaries. The distributions of locations and overall ratings in the original dataset is
illustrated in Fig. C.2. Since the reviews of the covered locations are crawled more or
less randomly, the distribution of overall ratings can be assumed to be representative
for TripAdvisor in general.
Our sampled subset consists of 2100 reviews balanced with respect to both lo-
cation and overall rating. In particular, we selected 300 reviews of seven of the
15 most-represented locations in the original dataset each, 60 for every overall rating
between 1 (worst) and 5 (best). This supports an optimal training for machine learn-
ing approaches to rating prediction. Moreover, the reviews of each location cover at
least 10, but as few as possible hotels, which is beneficial for opinion summarization
approaches.
To counter location bias, we provide a corpus split with a training set containing
the reviews of three locations, and both a validation set and a test set with two of the
other locations. Table C.3 lists details about the balanced compilation and the split.
Appendix C: Text Corpora 271
Table C.4 Statistics of the tokens, sentences, manually classified statements, and manually anno-
tated product features in the ArguAna TripAdvisor corpus.
Type Total Average ± σ Median Min Max
Tokens 442 615 210.77 ± 171.66 172 3 1823
Sentences 24 162 11.51 ± 7.89 10 1 75
Statements 31 006 14.76 ± 10.44 12 1 96
Facts 6 303 3.00 ± 3.65 2 0 41
Positive opinions 11 786 5.61 ± 5.20 5 0 36
Negative opinions 12 917 6.15 ± 6.69 4 0 52
Product features 24 596 11.71 ± 10.03 10 0 180
C.2.2 Annotation
The reviews in the dataset from (Wang et al. 2010) have a title and a body and they
include different ratings and metadata. We maintain all this information as text-level
and syntax-level annotations in the ArguAna TripAdvisor corpus. In addition,
the corpus is enriched with annotations of local sentiment at the discourse level and
domain concepts at the entity level:
Text Level. Each review comes with optional ratings for seven hotels aspects: value,
room, location, cleanliness, front desk, service, and business service. We interpret the
mandatory overall rating as a global sentiment score. All ratings are integer values
between 1 and 5. In terms of metadata, the ID and location of the reviewed hotel, the
username of the author, and the date of creation are given.
Syntax Level. In every review text, the title and body are annotated as such and they
are separated by two line breaks.
Discourse Level. All review texts are segmented into statements that represent single
discourse units. A statement is a main clause together with all its dependent subor-
dinate clauses (and, hence, a statement spans at most a sentence). Each statement is
classified as being an objective fact, a positive opinion, or a negative opinion.
Entity Level. Two types of domain concepts are marked as product features in
all texts: (1) hotel aspects, like those rated on the text level but also others like
atmosphere, and (2) everything that is called an amenity in the hotel domain, e.g.
facilities like a coffee maker or wifi as well as services like laundry.
Table C.4 lists the numbers of corpus annotations together with some statistics.
The corpus includes 31,006 classified statements and 24,596 product features. On
average, a text comprises 14.76 statements and 11.71 product features. A histogram of
the length of all reviews in terms of the number of statements is given in Fig. C.3(a),
grouped into intervals. As can be seen, over one third of all texts spans less than
10 statements (intervals 0–4 and 5–9), whereas less than one fourth spans 20 or
272 Appendix C: Text Corpora
90-94
95-96
10-14
14-19
20-24 score 2
20%
0-4
5-9
score 1
0% 0%
# statements 0-4 5-9 10-14 # statements 40-44
Fig. C.3 a Histogram of the number of statements in the texts of the ArguAna TripAdvisor
corpus, grouped into intervals. b Interpolated curves of the fraction of sentiment scores in the
corpus depending on the numbers of statements.
Fig. C.4 Illustration of a review from the ArguAna TripAdvisor corpus. Each review text has
a title and a body. It is segmented into discourse-level statements that are manually classified as
positive opinions (light background), negative opinions (medium), and objective facts (dark). Also,
manual annotations of domain concepts are provided (marked in bold).
more. Figure C.3(b) visualizes the distribution of sentiment scores for all intervals
that cover at least 1 % of the corpus. Most significantly, the fraction of reviews with
sentiment score 3 increases under higher numbers of statements.
Example
Figure C.4 illustrates the main annotations of a sample review from the corpus.
Each text has a specified title and body. In this case, the body spans nine mentions of
product features, such as “location” or “internet access”. It is segmented into 12 facts
and opinions. The facts and opinions reflect the review’s rather negative sentiment
score 2 while e.g. highlighting that the internet access was not seen as negative.
Besides, Fig. C.4 exemplifies the typical writing style often found in web user reviews
like those from TripAdvisor: A few grammatical inaccuracies (e.g. inconsistent
capitalization) and colloquial phrases (e.g. “like 2 mins walk”), but easily readable.
Annotation Process
The classification of all statements in the texts of the ArguAna TripAdvisor cor-
pus was performed using crowdsourcing, while experts annotated the product fea-
tures. Before, the segmentation of the texts into statements was done automatically
Appendix C: Text Corpora 273
using the algorithm pdu (cf. Appendix A). The manual annotation process is sum-
marized in the following.
Crowdsourcing Annotation. The statements were classified using the crowdsour-
cing platform Amazon Mechanical Turk that we already relied on in Sect. 5.5.
The task we assigned to the workers here involved the classification of a random
selection of 12 statements. After some preliminary experiments with different task
descriptions, the main guideline given to the workers was the following:
When visiting a hotel, are the following statements positive, negative, or neither?
Together with the guideline, three notes were provided: (1) to choose “neither”
only for facts, not for unclear cases, (2) to pay attention to subtle statements where
sentiment is expressed implicitly or ironically, and (3) to pick the most appropriate
answer in controversial cases. The different cases were illustrated using a carefully
chosen set of example statements.
The workers were allowed to work on the 12 statements of a task at most 10
minutes and were paid $0.05 in case of approval. To assure quality, the tasks were
assigned only to workers with over 1000 approved tasks and an average approval
rate of at least 80 % on Amazon Mechanical Turk. Moreover, we always put
two hidden check statements with known and unambiguous classification among the
statements in order to recognize faked or otherwise flawed answers. The workers
were informed that tasks with incorrectly classified check statements are rejected.
Rejected tasks were reassigned to other workers. For a consistent annotation, we
assigned each statement to three workers and then applied majority voting to obtain
the final classifications. Altogether, 328 workers performed 14,187 tasks with an
approval rate of 72.8 %. On average, a worker spent 75.8 s per task.
Expert Annotation. Two experts with linguistic background annotated product fea-
tures in the corpus based on the following guideline:
Read through each review. Mark all product features of the reviewed hotel in the sense of
hotel aspects, amenities, services, and facilities.
the Ritz”. We classified these statements ourselves in the context of the associated
review. To measure the agreement of the product feature annotations, 633 statements
were annotated by two experts. In 546 cases, both experts marked exactly the same
spans in the statements as product features. Assuming a chance agreement probability
of 0.5, this results in the value 0.73 of Cohen’s Kappa (Fleiss 1981), which again
means substantial agreement.
C.2.3 Files
C.2.4 Acknowledgments
The creation of the ArguAna TripAdvisor corpus was funded by the German
Federal Ministry of Education and Research (BMBF) as part of the project
ArguAna, described in Sect. 2.3. The corpus was planned by the author of this book
together with a research assistant from the above-mentioned Webis group of the
Bauhaus Universität Weimar, Tsvetomira Palarkska. The latter then compiled
the texts of the corpus and realized and supervised the crowdsourcing annotation
process. The expert annotations were added by one assistant from the University
of Paderborn and one employee of the Resolto Informatik GmbH.
Besides the Revenue corpus and the ArguAna TripAdvisor corpus, we use a
third self-created corpus in some experiments of this book, the LFA-11 corpus. The
LFA-11 corpus is a collection of 4806 manually annotated product-related texts.
The corpus consists of two separate parts, which refer to different high-level topics,
Appendix C: Text Corpora 275
namely, music and smartphones. It serves as a linguistic resource for the develop-
ment and evaluation of approaches to language function analysis (cf. Sect. 2.3) and
sentiment analysis. In the following, we reuse and extend content from Wachsmuth
and Bujna (2011), where the LFA-11 corpus has originally been presented. The
corpus is free for scientific use and can be downloaded at http://infexba.upb.de.
C.3.1 Compilation
The LFA-11 corpus contains 2713 texts from the music domain as well as 2093
texts from the smartphone domain. The texts of these topical domains come from
different sources and are of very different quality and style:
The music collection is made up of user reviews, professional reviews, and pro-
motional texts from a social network platform, selected by employees of a company
from the digital asset management industry (see acknowledgments below). These
texts are well-written and of homogeneous style. On average, a music texts span 9.4
sentences with 23.0 tokens on average, according to the output of our algorithms sse
and sto2 (cf. Appendix A.1). In contrast, the texts in the smartphone collection are
blog posts. These posts were retrieved via queries on a self-made Apache Lucene21
index, which was built for the Spinn3r corpus.22 Spinn3r aims at crawling and
indexing the whole blogosphere. Hence, the texts in the smartphone collection vary
strongly in quality and writing style. They have an average length of 11.8 sentences
but only 18.6 tokens per sentence.
C.3.2 Annotation
All texts of the LFA-11 corpus are annotated on the text level with respect to three
classification schemes:
Text Level. First, the language function of each text is annotated as being predom-
inantly personal, commercial, or informational (cf. Sect. 2.3).23 Second, the texts
are classified with respect to their sentiment polarity, where we distinguish positive,
neutral, and negative sentiment. And third, the relevance with respect to the topic
of the corpus part the text belongs to (music or smartphones) is annotated as being
given (true) or not (false).
In the corpus texts, all three annotations are linked to a metadata annotation that
provides access to them. Some texts were annotated twice for inter-annotator agree-
Table C.5 Distributions of the text-level classes in the three sets of the two topical parts of the
LFA-11 corpus for the three annotated types.
Topic Type Class Training set Validation set Test set
music Language function personal 521 (38.5 %) 419 (61.7 %) 342 (50.4 %)
commercial 127 (9.4 %) 72 (10.6 %) 68 (10.0 %)
informational 707 (52.2 %) 188 (27.7 %) 269 (39.6 %)
Sentiment positive 1003 (74.0 %) 558 (82.2 %) 514 (75.7 %)
neutral 259 (19.1 %) 82 (12.1 %) 115 (16.9 %)
negative 93 (6.9 %) 39 (5.7 %) 50 (7.4 %)
Topic relevance true 1327 (97.9 %) 673 (99.1 %) 662 (97.5 %)
false 28 (2.1 %) 6 (0.9 %) 17 (2.5 %)
smartphone Language function personal 546 (52.1 %) 279 (53.3 %) 302 (57.7%)
commercial 90 (8.6 %) 36 (6.9 %) 28 (5.4 %)
informational 411 (39.3 %) 208 (39.8 %) 193 (36.9 %)
Sentiment polarity positive 205 (19.6 %) 110 (21.0 %) 84 (16.1 %)
neutral 738 (70.5 %) 343 (65.6%) 359 (68.6 %)
negative 104 (9.9 %) 70 (13.4%) 80 (15.3 %)
Topic relevance true 561 (53.6 %) 307 (58.7 %) 287 (54.9 %)
false 486 (46.4 %) 216 (41.3 %) 236 (45.1 %)
Fig. C.5 Translated excerpts from three texts of the music part of the LFA-11 corpus, exemplifying
one instance of each language function. Notice that the translation to English may have affected the
indicators of the annotated classes.
ment purposes (see the annotation process below). These texts have two annotations
of each type. We created splits for each topic with half of the texts in the training set
and each one fourth in the validation set and test set, respectively. Table C.5 show the
class distributions of language functions, sentiment polarities, and topic relevance.
The distributions indicate that the training, validation, and test sets differ signifi-
cantly from each other. In case of double-annotated texts, we used the annotation of
the second employee to compute the distributions. So, the exact frequencies of the
different classes depend on which annotations are used.
Appendix C: Text Corpora 277
Example
Figure C.5 shows excerpts from three texts of the music collection, one out of each
language function class. The excerpts have been translated to English for convenience
purposes. The neutral sentiment of the personal text might seem inappropriate, but
the given excerpt is misleading in this respect.
Annotation Process
The classification of all texts of the LFA-11 corpus was performed by two employees
of the mentioned company based on the following guidelines:
Read through each text of the two collections. First, tag the text as being predominantly
personal, commercial, or informational with respect to the product discussed in the text:
• personal. Use this annotation if the text seems not to be of commercial interest, but
probably represents the personal view on the product of a private individual.
• commercial. Use this annotation if the text is of obvious commercial interest. The text
seems to predominantly aim at persuading the reader to buy or like the product.
• informational. Use this annotation if the text seems not to be of commercial interest with
respect to the product. Instead, it predominantly appears to be informative in a journalistic
manner.
• neutral. Use this annotation if the text either reports on the product without making any
positive or negative statement about it or if the texts is neither positive nor negative, but
rather close to the midth between positive and negative.
• negative. Use this annotation if the text reports on the product in a positive way from an
overall viewpoint.
• positive. Use this annotation if the text reports on the product in a negative way from an
overall viewpoint.
Finally, decide whether the text is relevant (true) or irrelevant (false) with respect to the topic
of the collection (music or smartphones).
C.3.3 Files
C.3.4 Acknowledgments
The creation of the LFA-11 corpus was also funded by the German Federal
Ministry of Education and Research (BMBF) as part of the project InfexBA,
described in Sect. 2.3. Both parts of the corpus were planned by the author of this
book together with a research assistant from the above-mentioned Webis group, Pe-
ter Prettenhofer. The latter gathered the texts of the smartphone collection, whereas
the music texts were selected by the company Digital Collections Verlagsge-
sellschaft mbH24 , based in Hamburg, Germany. This company also conducted the
described annotation process.
The CoNLL-2003 dataset (Tjong Kim Sang and Meulder 2003) serves for the devel-
opment and evaluation of approaches to the CoNLL-2003 shared task on language-
independent named entity recognition. It consists of an English part and a German
part. The English part contains 1393 news stories with different topics. It is split into
946 training texts (12,705 sentences), 216 validation texts (3466 sentences), and 231
test texts (3684 sentences). The German part contains 909 mixed newspaper articles,
split into 553 training texts (12,705 sentences), 201 validation texts (3068 sentences),
and 155 test texts (3160 sentences). In all texts of both parts, every mention of a per-
son name, a location name, and an organization name are manually annotated as an
entity of the respective type. Besides, there is a type Misc that covers entities, which
do not belong to these three types.
We process the English part in the experiments with our input control in Sect. 3.5
and the German part to evaluate all our scheduling appraoches in Sects. 4.1, 4.3,
and 4.5. Both parts are analyzed in terms of their distribution of relevant informa-
tion (Sects. 4.2 and 4.4). The CoNLL-2003 dataset is not freely available. For infor-
mation on how to obtain this corpus, see the website of the CoNLL-2003 shared
task, http://www.cnts.ua.ac.be/conll2003/ner (accessed on June 15, 2015).
The Sentiment Scale dataset (Pang and Lee 2005) is a collection of texts that
has been widely used to evaluate approaches to the prediction of sentiment scores.
It consists of 5006 reviews from the film domain and comes with two sentiment
scales. In particular, each review is assigned one integer score in the range [0, 3] and
one in the range [0, 2]. On average, a review has 36.1 sentences. The dataset is split
into four text corpora according to the four authors of the reviews: 1770 reviews of
author a (Steve Rhodes), 902 reviews of author b (Scott Renshaw), 1307 reviews of
author c (James Berardinelli), and 1027 reviews of author d (Dennis Schwartz).
We first analyze the Sentiment Scale dataset in terms of its distribution of
relevant information in Sect. 4.4. Later, we process the dataset in the feature experi-
ments in Sect. 5.3 and in the evaluation of our overall analysis (Sect. 5.4), where we
rely on the three-class sentiment scale. In the performed experiments, we discarded
three reviews of author a and five reviews of author c due to encoding problems.
The Sentiment Scale dataset can be downloaded (accessed on June 8, 2015) at
http://www.cs.cornell.edu/people/pabo/movie-review-data.
In addition to the Sentiment Scale dataset, we also process the Subjectivity
dataset from (Pang and Lee 2004) and the Sentence polarity dataset from
(Pang and Lee 2005) in Sect. 5.4 in order to develop classifiers for sentence sentiment.
Both are also freely available at the mentioned website. The Subjectivity dataset
contains 10,000 sentences, half of which are classified as subjective and half as
objective. Similarly, the Sentence polarity dataset contains 5331 positive and
5331 negative sentences. The sentences from these two datasets are taken from film
reviews.
The widely used Brown corpus (Francis 1966) was introduced in the 1960s as a
standard text collection of present-day American English. It consists of 500 prose
text samples of about 2000 words each. The samples are excerpts from texts that
were printed in the year 1961 and that were written by native speakers of American
280 Appendix C: Text Corpora
English as far as determinable. They cover a wide range of styles and varieties of
prose. At a high level, they can be divided into informative prose (374 samples) and
imaginative prose (126 samples).
We process the Brown corpus in Sects. 4.2 and 4.4 to show how relevant infor-
mation is distributed across texts and collections of texts. The Brown corpus is free
for non-commercial purposes and can be downloaded (accessed on June 15, 2015)
at http://www.nltk.org/nltk_data.
The German Wikipedia sample that we experiment with consists of the first 10,000
articles from the Wikimedia25 dump from March 9, 2013, ordered according to their
internal page IDs. The complete dump contains over 3 million Wikipedia pages,
from which 1.8 million pages represent articles that are neither empty nor stubs or
simple lists.
As in the case of the Brown corpus, we process the Wikipedia sample in
Sects. 4.2 and 4.4 to show how relevant information is distributed across texts and
collections of texts. The dump we rely on is outdated and not available anymore.
However, similar dumps from later dates can be downloaded (accessed on June 15,
2015) at http://dumps.wikimedia.org/dewiki.
281
282 References
Baccianella, S., Esuli, A., Sebastiani, F.: SentiWordNet 3.0: an enhanced lexical resource for sen-
timent analysis and opinion mining. In: Proceedings of the Seventh International Conference on
Language Resources and Evaluation, pp. 2200–2204 (2010)
Ballesteros, M., Bohnet, B., Mille, S., Wanner, L.: Deep-syntactic parsing. In: Proceedings of the
25th International Conference on Computational Linguistics. Technical papers, pp. 1402–1413
(2014)
Bangalore, S.: Thinking outside the box for natural language processing. In: Gelbukh, A. (ed.)
CICLing 2012, Part I. LNCS, vol. 7181, pp. 1–16. Springer, Heidelberg (2012)
Banko, M., Cafarella, M.J., Soderland, S., Broadhead, M., Etzioni, O.: Open information extraction
from the web. In: Proceedings of the 20th International Joint Conference on Artificial Intelligence,
pp. 2670–2676 (2007)
Batista, G.E.A.P.A., Prati, R.C., Monard, M.C.: A study of the behavior of several methods for
balancing machine learning training data. SIGKDD Explor. Newslett. 6(1), 20–29 (2004)
Bellotti, V., Edwards, K.: Intelligibility and accountability: human considerations in context-aware
systems. Hum.-Comput. Interact. 16(2–4), 193–212 (2001)
Beringer, S.: Effizienz und Effektivität der Integration von Textklassifikation in Information-
Extraction-Pipelines. Master’s thesis, University of Paderborn, Paderborn, Germany (2012)
Besnard, P., Hunter, A.: Elements of Argumentation. The MIT Press, Cambridge (2008)
Biber, D., Conrad, S., Reppen, R.: Corpus Linguistics: Investigating Language Structure and Use.
Cambridge University Press, Cambridge (1998)
Björkelund, A., Bohnet, B., Hafdell, L., Nugues, P.: A high-performance syntactic and seman-
tic dependency parser. In: Proceedings of the 23rd International Conference on Computational
Linguistics: Demonstrations, pp. 33–36 (2010)
Blitzer, J., Dredze, M., Pereira, F.: Biographies, bollywood, boom-boxes and blenders: domain adap-
tation for sentiment classification. In: Proceedings of the 45th Annual Meeting of the Association
for Computational Linguistics, pp. 440–447 (2007)
Blitzer, J., Crammer, K., Kulesza, A., Pereira, F., Wortman, J.: Learning bounds for domain adap-
tation. In: Platt, J., Koller, D., Singer, Y., Roweis, S. (eds.) Advances in Neural Information
Processing Systems, vol. 21, pp. 105–134. MIT Press, Cambridge (2008)
Bohnet, B.: Very high accuracy and fast dependency parsing is not a contradiction. In: Proceedings
of the International Conference on Computational Linguistics, pp. 89–97 (2010)
Bohnet, B., Kuhn, J.: The best of both worlds: a graph-based completion model for transition-based
parsers. In: Proceedings of the 13th Conference of the European Chapter of the Association for
Computational Linguistics, pp. 77–87 (2012)
Bohnet, B., Burga, A., Wanner, L.: Towards the annotation of penn treebank with information struc-
ture. In: Proceedings of the Sixth International Joint Conference on Natural Language Processing,
Nagoya, Japan, pp. 1250–1256 (2013)
Bühler, K.: Sprachtheorie. Die Darstellungsfunktion der Sprache. Verlag von Gustav Fischer, Jena,
Germany (1934)
Buschmann, F., Meunier, R., Rohnert, H., Sommerlad, P., Stal, M.: Pattern-oriented Software Ar-
chitecture: A System of Patterns. Wiley, New York (1996)
Cafarella, M.J., Downey, D., Soderland, S., Etzioni, O.: KnowItNow: fast, scalable information
extraction from the web. In: Proceedings of the Conference on Human Language Technology
and Empirical Methods in Natural Language Processing, pp. 563–570 (2005)
Cardie, C., Ng, V., Pierce, D., Buckley, C.: Examining the role of statistical and linguistic knowledge
sources in a general-knowledge question-answering system. In: Proceedings of the Sixth Applied
Natural Language Processing Conference, pp. 180–187 (2000)
Carlson, L., Marcu, D., Okurowski, M.E.: Building a discourse-tagged corpus in the framework of
rhetorical structure theory. In: Proceedings of the Second SIGdial Workshop on Discourse and
Dialogue, vol. 16, pp. 1–10 (2001)
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density
functions. Int. J. Math. Models Methods Appl. Sci. 1(4), 300–307 (2007)
References 283
Chang, C.-C., Lin, C.-J.: LIBSVM: a library for support vector machines. ACM Trans. Intell. Syst.
Technol. 2, 27:1–27:27 (2011)
Chapelle, O., Schlkopf, B., Zien, A.: Semi-Supervised Learning. The MIT Press, Cambridge (2006)
Chenlo, J.M., Hogenboom, A., Losada, D.E.: Rhetorical structure theory for polarity estimation:
an experimental study. Data Knowl. Eng. 94(B), 135–147 (2014)
Chinchor, N., Lewis, D.D., Hirschman, L.: Evaluating message understanding systems: an analysis
of the third message understanding conference (MUC-3). Comput. Linguist. 19(3), 409–449
(1993)
Chiticariu, L., Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F.R., Vaithyanathan, S.: SystemT:
an algebraic approach to declarative information extraction. In: Proceedings of the 48th Annual
Meeting of the Association for Computational Linguistics, pp. 128–137 (2010a)
Chiticariu, L., Li, Y., Raghavan, S., Reiss, F.R.: Enterprise information extraction: recent develop-
ments and open challenges. In: Proceedings of the 2010 International Conference on Management
of Data, pp. 1257–1258 (2010b)
Choi, Y., Breck, E., Cardie, C.: Joint extraction of entities and relations for opinion recognition. In:
Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing, pp.
431–439 (2006)
Cooper, K.D., Torczon, L.: Engineering a Compiler. Morgan Kaufmann, Burlington (2011)
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. MIT
Press, Cambridge (2009)
Covington, M.A.: A fundamental algorithm for dependency parsing. In: Proceedings of the 39th
Annual ACM Southeast Conference, pp. 95–102 (2001)
Cowie, J., Lehnert, W.: Information extraction. Commun. ACM 39(1), 80–91 (1996)
Cui, H., Sun, R., Li, K., Kan, M.-Y., Chua, T.-S.: Question answering passage retrieval using
dependency relations. In: Proceedings of the 28th Annual International ACM SIGIR Conference
on Research and Development in Information Retrieval, pp. 400–407 (2005)
Cunningham, H.: Information extraction, automatic. Encycl. Lang. Linguist. 4, 665–677 (2006)
Sarma, A.D., Jain, A., Bohannon, P.: Building a generic debugger for information extraction
pipelines. In: Proceedings of the 20th ACM International Conference on Information and Knowl-
edge Management, pp. 2229–2232 (2011)
Daumé III, H., Marcu, D.: Domain adaptation for statistical classifiers. J. Artif. Intell. Res. 26(1),
101–126 (2006)
Davenport, T.H.: Enterprise Analytics: Optimize Performance, Process, and Decisions through Big
Data. FT Press, Upper Saddle River (2012)
Dezsényi, C., Dobrowiecki, T.P., Mészáros, T.: Adaptive document analysis with planning. Multi-
Agent Syst. Appl. IV, 620–623 (2005)
Dikli, S.: An overview of automated scoring of essays. J. Technol. Learn. Assess. 5(1), 1–36 (2006)
Dill, S., Eiron, N., Gibson, D., Gruhl, D., Guha, R., Jhingran, A., Kanungo, T., Rajagopalan, S.,
Tomkins, A., Tomlin, J.A., Zien, J.Y.: SemTag and seeker: bootstrapping the semantic web via
automated semantic annotation. In: Proceedings of the 12th International Conference on World
Wide Web, pp. 178–186 (2003)
Doan, A.H., Naughton, J.F., Ramakrishnan, R., Baid, A., Chai, X., Chen, F., Chen, T., Chu, E.,
DeRose, P., Gao, B., Gokhale, C., Huang, J., Shen, W., Vuong, B.-Q.: Information extraction
challenges in managing unstructured data. SIGMOD Rec. 37(4), 14–20 (2009)
Downey, D., Etzioni, O., Soderland, S.: A probabilistic model of redundancy in information ex-
traction. In: Proceedings of the 19th International Joint Conference on Artificial Intelligence, pp.
1034–1041 (2005)
Duggan, G.B., Payne, S.J.: Text skimming: the process and effectiveness of foraging through text
under time pressure. J. Exp. Psychol.: Appl. 15(3), 228–242 (2009)
Dunlavy, D.M., Shead, T.M., Stanton, E.T.: ParaText: scalable text modeling and analysis. In:
Proceedings of the 19th ACM International Symposium on High Performance Distributed Com-
puting, pp. 344–347 (2010)
284 References
Egner, M.T., Lorch, M., Biddle, E.: UIMA GRID: distributed large-scale text analysis. In: Pro-
ceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid, pp.
317–326 (2007)
Etzioni, O.: Search needs a shake-up. Nature 476, 25–26 (2011)
Fader, A., Soderland, S., Etzioni, O.: Identifying relations for open information extraction. In:
Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pp.
1535–1545 (2011)
Fazal-Baqaie, M., Luckey, M., Engels, G.: Assembly-based method engineering with method pat-
terns. In: Software Engineering 2013 Workshopband, pp. 435–444 (2013)
Ferrucci, D., Lally, A.: UIMA: an architectural approach to unstructured information processing in
the corporate research environment. Nat. Lang. Eng. 10(3–4), 327–348 (2004)
Finkel, J.R., Grenager, T., Manning, C.D.: Incorporating non-local information into information
extraction systems by Gibbs sampling. In: Proceedings of the 43rd Annual Meeting of the Asso-
ciation for Computational Linguistics, pp. 363–370 (2005)
Finkel, J.R., Manning, C.D., Ng, A.Y.: Solving the problem of cascading errors: approximate
Bayesian inference for linguistic annotation pipelines. In: Proceedings of the 2006 Conference
on Empirical Methods in Natural Language Processing, pp. 618–626 (2006)
Fleiss, J.L.: Statistical Methods for Rates and Proportions, 2nd edn. Wiley, New York (1981)
Forman, G., Kirshenbaum, E.: Extremely fast text feature extraction for classification and indexing.
In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp.
1221–1230 (2008)
Fox, M.S., Smith, S.F.: ISIS: a knowledge-based system for factory scheduling. Expert Syst. 1,
25–49 (1984)
Francis, W.N.: A Standard Sample of Present-day English for Use with Digital Computers. Brown
University (1966)
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit
semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intel-
ligence, pp. 1606–1611 (2007)
Gildea, D., Jurafsky, D.: Automatic labeling of semantic roles. Comput. Linguist. 28(3), 245–288
(2002)
Glorot, X., Bordes, A., Bengio, Y.: Domain adaptation for large-scale sentiment classification: a deep
learning approach. In: Proceedings of the 28th International Conference on Machine Learning,
pp. 97–110 (2011)
Gottron, T.: Content extraction - identifying the main content in HTML documents. Ph.D. thesis,
UniversitŁt Mainz (2008)
Gray, W.D., Fu, W.-T.: Ignoring perfect knowledge in-the-world for imperfect knowledge in-the-
head. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, pp.
112–119 (2001)
Gruber, T.R.: A translation approach to portable ontology specifications. Knowl. Acquis. 5(2),
199–220 (1993)
Gruhl, D., Chavet, L., Gibson, D., Meyer, J., Pattanayak, P., Tomkins, A., Zien, J.: How to build a
WebFountain an architecture for very large-scale text analytics. IBM Syst. J. 43(1), 64–76 (2004)
Güldali, B., Funke, H., Sauer, S., Engels, G.: TORC: test plan optimization by requirements clus-
tering. Softw. Qual. J. 1–29 (2011)
Gupta, R., Sarawagi, S.: Domain adaptation of information extraction models. SIGMOD Rec. 37(4),
35–40 (2009)
Gylling, M.: The structure of discourse: a corpus-based cross-linguistic study. Dissertation, Copen-
hagen Business School (2013)
Habernal, I., Eckle-Kohler, J., Gurevych, I.: Argumentation mining on the web from information
seeking perspective. In: Frontiers and Connections between Argumentation Theory and Natural
Language Processing (2014, to appear)
Hagen, M., Stein, B., Rüb, T.: Query session detection as a cascade. In: 20th ACM International
Conference on Information and Knowledge Management (CIKM 2011), pp. 147–152 (2011)
References 285
Hajič, J., Ciaramita, M., Johansson, R., Kawahara, D., Martí, M.A., Màrquez, L., Meyers, A.,
Nivre, J., Padó, S., Štěpánek, J., Straňák, P., Surdeanu, M., Xue, N., Zhang, Y.: The CoNLL-2009
shared task: syntactic and semantic dependencies in multiple languages. In: Proceedings of the
Thirteenth Conference on Computational Natural Language Learning: Shared Task, pp. 1–18
(2009)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data
mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Harsu, M.: A Survey on Domain Engineering. Tampere University of Technology (2002)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference
and Prediction, 2nd edn. Springer, New York (2009)
Hayes-Roth, B.: A blackboard architecture for control. Artif. Intell. 26(3), 251–321 (1985)
Hearst, M.A.: Untangling text data mining. In: Proceedings of the 37th Annual Meeting of the
Association for Computational Linguistics on Computational Linguistics, pp. 3–10 (1999)
Hearst, M.A.: Search User Interfaces. Cambridge University Press, Cambridge (2009)
Hinrichs, E., Hinrichs, M., Zastrow, T.: WebLicht: web-based LRT services for German. In: Pro-
ceedings of the ACL 2010 System Demonstrations, pp. 25–29 (2010)
Hollingshead, K., Roark, B.: Pipeline iteration. In: Proceedings of the 45th Annual Meeting of the
Association for Computational Linguistics, pp. 952–959 (2007)
Horrocks, I.: Ontologies and the semantic web. Commun. ACM 51(12), 58–67 (2008)
HP Labs: Annual Report (2010). http://www.hpl.hp.com/news/2011/jan-mar/pdf/HPL_AR_2010_
web.pdf. Accessed 12 July 2013
Hsu, W.C., Chen, H., Yew, P.C., Chen, H.: On the predictability of program behavior using different
input data sets. In: Sixth Annual Workshop on Interaction between Compilers and Computer
Architectures, pp. 45–53 (2002)
Huang, L.: Advanced dynamic programming in semiring and hypergraph frameworks. In: COLING
2008: Advanced Dynamic Programming in Computational Linguistics: Theory, Algorithms and
Applications - Tutorial Notes, pp. 1–18 (2008)
Ioannidis, Y.: Query optimization. In: Handbook for Computer Science. CRC Press (1997)
Jackson, P.: Introduction to Expert Systems, 2nd edn. Addison-Wesley Longman Publishing Co.
Inc., Boston (1990)
Jean-Louis, L., Besançon, R., Ferret, O.: Text Segmentation and graph-based method for template
filling in information extraction. In: Proceedings of the 5th International Joint Conference on
Natural Language Processing, pp. 723–731 (2011)
Joachims, T.: A statistical learning model of text classification for support vector machines. In: Pro-
ceedings of the 24th Annual International ACM SIGIR Conference on Research and Development
in Information Retrieval, pp. 128–136 (2001)
Jurafsky, D.: Pragmatics and computational linguistics. In: Horn, L.R., Ward, G. (eds.) Handbook
of Pragmatics, pp. 578–604. Blackwell, Oxford (2003)
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language
Processing, Speech Recognition, and Computational Linguistics, 2nd edn. Prentice-Hall, Upper
Saddle River (2009)
Kalyanpur, A., Patwardhan, S., Boguraev, B., Lally, A., Chu-Carroll, J.: Fact-based question de-
composition for candidate answer re-ranking. In: Proceedings of the 20th ACM International
Conference on Information and Knowledge Management, pp. 2045–2048 (2011)
Kano, Y.: Kachako: towards a data-centric platform for full automation of service selection, com-
position, scalable deployment and evaluation. In: Proceedings of the IEEE 19th International
Conference on Web Services, pp. 642–643 (2012)
Kano, Y., Dorado, R., McCrohon, L., Ananiadou, S., Tsujii, J.: U-Compare: an integrated language
resource evaluation platform including a comprehensive UIMA resource library. In: Proceedings
of the Seventh International Conference on Language Resources and Evaluation, pp. 428–434
(2010)
Kelly, J.E., Hamm, S.: Smart Machines: IBM’s Watson and the Era of Cognitive Computing.
Columbia University Press, New York (2013)
286 References
Kim, J.-D., Wang, Y., Takagi, T., Yonezawa, A.: Overview of genia event task in BioNLP shared
task 2011. In: Proceedings of the BioNLP Shared Task 2011 Workshop, pp. 7–15 (2011)
Büning, H.K., Lettmann, T.: Propositional Logic: Deduction and Algorithms. Cambridge University
Press, New York (1999)
Klerx, T., Anderka, M., Büning, H.K., Priesterjahn, S.: Model-based anomaly detection for discrete
event systems. In: Proceedings of the 26th IEEE International Conference on Tools with Artificial
Intelligence, pp. 665–672 (2014)
Krikon, E., Carmel, D., Kurland, O.: Predicting the performance of passage retrieval for ques-
tion answering. In: Proceedings of the 21st ACM International Conference on Information and
Knowledge Management, pp. 2451–2454 (2012)
Krishnamurthy, R., Li, Y., Raghavan, S., Reiss, F., Vaithyanathan, S., Zhu, H.: SystemT: a system
for declarative information extraction. SIGMOD Rec. 37(4), 7–13 (2009)
Kulesza, T., Stumpf, S., Wong, W.-K., Burnett, M.M., Perona, S., Ko, A., Oberst, I.: Why-oriented
end-user debugging of naive Bayes text classification. ACM Trans. Interact. Intell. Syst. 1(1),
2:1–2:31 (2011)
Kulesza, T., Stumpf, S., Burnett, M.M., Yang, S., Kwan, I., Wong, W.-K.: Too much, too little, or
just right? ways explanations impact end users’ mental models. In: IEEE Symposium on Visual
Languages and Human-Centric Computing, pp. 3–10 (2013)
Kumar, V., Grama, A., Gupta, A., Karypis, G.: Introduction to Parallel Computing: Design and
Analysis of Algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City (1994)
Kushmerick, N.: Wrapper induction for information extraction. Dissertation, University of Wash-
ington (1997)
Lambrecht, K.: Information Structure and Sentence Form: Topic, Focus, and the Mental Represen-
tations of Discourse Referents. Cambridge University Press, New York (1994)
Lee, Y.-B., Myaeng, S.H.: Text genre classification with genre-revealing and subject-revealing
deatures. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research
and Development in Information Retrieval, pp. 145–150 (2002)
Lewis, D.D., Tong, R.M.: Text filtering in MUC-3 and MUC-4. In: Proceedings of the 4th Conference
on Message Understanding, pp. 51–66 (1992)
Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization
research. J. Mach. Learn. Res. 5, 361–397 (2004)
Li, L., Jin, X., Long, M.: Topic correlation analysis for cross-domain text classification. In: Pro-
ceedings of the 26th AAAI Conference on Artificial Intelligence, pp. 998–1004 (2012a)
Li, Q., Anzaroot, S., Lin, W.-P., Li, X., Ji, H.: Joint inference for cross-document information extrac-
tion. In: Proceedings of the 20th ACM International Conference on Information and Knowledge
Management, pp. 2225–2228 (2011)
Li, Y., Chiticariu, L., Yang, H., Reiss, F.R., Carreno-fuentes, A.: WizIE: a best practices guided
development environment for information extraction. In: Proceedings of the ACL 2012 System
Demonstrations, pp. 109–114 (2012b)
Lim, B.Y., Dey, A.K.: Assessing demand for intelligibility in context-aware applications. In: Pro-
ceedings of the 11th International Conference on Ubiquitous Computing, pp. 195–204 (2009)
Lipka, N.: Modeling non-standard text classification tasks. Dissertation, Bauhaus-Universität
Weimar (2013)
Luís, T., de Matos, D.M.: High-performance high-volume layered corpora annotation. In: Proceed-
ings of the Third Linguistic Annotation Workshop, pp. 99–107 (2009)
Mann, W.C., Thompson, S.A.: Rhetorical structure theory: toward a functional theory of text orga-
nization. Text 8(3), 243–281 (1988)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press,
Cambridge (1999)
Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge Uni-
versity Press, New York (2008)
Mao, Y., Lebanon, G.: Isotonic conditional random fields and local sentiment flow. Adv. Neural
Inf. Process. Syst. 19, 961–968 (2007)
References 287
Marcu, D.: The Theory and Practice of Discourse Parsing and Summarization. MIT Press, Cam-
bridge (2000)
Marler, R.T., Arora, J.S.: Survey of multi-objective optimization methods for engineering. Struct.
Multidiscip. Optim. 26(6), 369–395 (2004)
McCallum, A.: Joint inference for natural language processing. In: Proceedings of the Thirteenth
Conference on Computational Natural Language Learning, p. 1 (2009)
Melzner, T.: Heuristische Suchverfahren zur Laufzeitoptimierung von Information-Extraction-
Pipelines. Master’s thesis, University of Paderborn, Paderborn, Germany (2012)
Menon, R., Choi, Y.: Domain independent authorship attribution without domain adaptation. In:
Proceedings of the International Conference Recent Advances in Natural Language Processing
2011, pp. 309–315 (2011)
Mesquita, F., Schmidek, J., Barbosa, D.: Effectiveness and efficiency of open relation extraction.
In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing,
pp. 447–457 (2013)
Mex, D.: Efficiency and effectiveness of multi-stage machine learning algorithms for text quality
assessment. Master’s thesis, University of Paderborn, Paderborn, Germany (2013)
Meyer, D., Leisch, F., Hornik, K.: The support vector machine under test. Neurocomputing 55(1–2),
169–186 (2003)
Minton, S., Bresina, J., Drummond, M.: Total-order and partial-order planning: a comparative
analysis. J. Artif. Intell. Res. 2(1), 227–262 (1995)
Mitchell, T.M.: Machine Learning, 1st edn. McGraw-Hill, Inc., New York (1997)
Mochales, R., Moens, M.-F.: Argumentation mining. Artif. Intell. Law 19(1), 1–22 (2011)
Montavon, G., Orr, G.B., Müller, K.-R. (eds.): Neural Networks: Tricks of the Trade, Reloaded,
2nd edn. Springer, Heidelberg (2012)
Moschitti, A., Basili, R.: Complex linguistic features for text classification: a comprehensive study.
In: McDonald, S., Tait, J.I. (eds.) ECIR 2004. LNCS, vol. 2997, pp. 181–196. Springer, Heidelberg
(2004)
Mukherjee, S., Bhattacharyya, P.: Sentiment analysis in twitter with lightweight discourse analysis.
In: Proceedings of the 24th International Conference on Computational Linguistics, pp. 1847–
1864 (2012)
Nédellec, C., Vetah, M.O.A., Bessières, P.: Sentence filtering for information extraction in genomics,
a classification problem. In: Siebes, A., De Raedt, L. (eds.) PKDD 2001. LNCS (LNAI), vol. 2168,
pp. 326–337. Springer, Heidelberg (2001)
Ng, V.: Supervised noun phrase coreference research: the first fifteen years. In: Proceedings of the
48th Annual Meeting of the Association for Computational Linguistics, pp. 1396–1411 (2010)
Nivre, J.: An efficient algorithm for projective dependency parsing. In: Proceedings of the 8th
International Workshop on Parsing Technologies, pp. 149–160 (2003)
Ó Séaghdha, D., Teufel, S.: Unsupervised learning of rhetorical structure with un-topic models. In:
Proceedings of COLING 2014, The 25th International Conference on Computational Linguistics:
Technical papers, Dublin, Ireland, pp. 2–13 (2014)
OMG: Unified Modeling Language (OMG UML) Superstructure, Version 2.4.1. OMG (2011)
Pang, B., Lee, L.: A sentimental education: sentiment analysis using subjectivity. In: Proceedings
of 42th Annual Meeting on Association for Computational Linguistics, pp. 271–278 (2004)
Pang, B., Lee, L.: Seeing stars: exploiting class relationships for sentiment categorization with
respect to rating scales. In: Proceedings of the 43rd Annual Meeting on Association for Compu-
tational Linguistics, pp. 115–124 (2005)
Pang, B., Lee, L.: Opinion mining and sentiment analysis. Found. Trends Informal Retr. 2(1–2),
1–135 (2008)
Pang, B., Lee, L., Vaithyanathan, S.: Thumbs up?: sentiment classification using machine learning
techniques. In: Proceedings of the ACL 2002 Conference on Empirical Methods in Natural
Language Processing, vol. 10, pp. 79–86 (2002)
Pantel, P., Ravichandran, D., Hovy, E.: Towards terascale knowledge acquisition. In: Proceedings
of the 20th International Conference on Computational Linguistics, pp. 771–777 (2004)
288 References
Pasca, M.: Web-based open-domain information extraction. In: Proceedings of the 20th ACM
International Conference on Information and Knowledge Management, pp. 2605–2606 (2011)
Patwardhan, S., Riloff, E.: Effective information extraction with semantic affinity patterns and
relevant regions. In: Proceedings of the 2007 Joint Conference on Empirical Methods in Natural
Language Processing and Computational Natural Language Learning, pp. 717–727 (2007)
Pauls, A., Klein, D.: k-best A∗ parsing. In: Proceedings of the Joint Conference of the 47th Annual
Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing
of the AFNLP, pp. 958–966 (2009)
Petrov, S., Das, D., McDonald, R.: A universal part-of-speech tagset. In: Proceedings of the Eight
International Conference on Language Resources and Evaluation, pp. 2089–2096 (2012)
Pitler, E., Nenkova, A.: Revisiting readability: a unified framework for predicting text quality.
In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp.
186–195 (2008)
Pokkunuri, S., Ramakrishnan, C., Riloff, E., Hovy, E., Burns, G.A.: The role of information extrac-
tion in the design of a document triage application for biocuration. In: Proceedings of BioNLP
2011 Workshop, pp. 46–55 (2011)
Poon, H., Domingos, P.: Joint inference in information extraction. In: Proceedings of the 22nd
National Conference on Artificial Intelligence, pp. 913–918 (2007)
Popescu, A.-M., Etzioni, O.: Extracting product features and opinions from reviews. In: Proceedings
of the conference on Human Language Technology and Empirical Methods in Natural Language
Processing, pp. 339–346 (2005)
Prettenhofer, P., Stein, B.: Cross-lingual adaptation using structural correspondence learning. Trans.
Intell. Syst. Technol. (ACM TIST) 3, 13:1–13:22 (2011)
Ramamoorthy, C.V., Li, H.F.: Pipeline architecture. ACM Comput. Surv. 9(1), 61–102 (1977)
Raman, K., Swaminathan, A., Gehrke, J., Joachims, T.: Beyond myopic inference in big data
pipelines. In: Proceedings of the 19th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 86–94 (2013)
Ratinov, L., Roth, D.: Design challenges and misconceptions in named entity recognition. In:
Proceedings of the 13th Conference on Natural Language Learning, pp. 147–155 (2009)
Reiss, F., Raghavan, S., Krishnamurthy, R., Zhu, H., Vaithyanathan, S.: An algebraic approach to
rule-based information extraction. In: Proceedings of the 2008 IEEE 24th International Confer-
ence on Data Engineering, pp. 933–942 (2008)
Riabov, A., Liu, Z.: Scalable planning for distributed stream processing systems. In: Proceedings
of the Sixteenth International Conference on Automated Planning and Scheduling, pp. 31–41
(2006)
Rose, M.: Entwicklung eines Expertensystems zur automatischen Erstellung effizienter
Information-Extraction-Pipelines. Master’s thesis, University of Paderborn, Paderborn, Germany
(2012)
Rowley, J.: The wisdom hierarchy: representations of the DIKW hierarchy. J. Inf. Sci. 33(2), 163–
180 (2007)
Russell, S.J., Norvig, P.: Artificial Intelligence: A Modern Approach, 3rd edn. Prentice-Hall, Upper
Saddle River (2009)
Samuel, A.L.: Some studies in machine learning using the game of checkers. IBM J. Res. Dev. 3(3),
210–229 (1959)
Sapkota, U., Solorio, T., Montes, M., Bethard, S., Rosso, P.: Cross-topic authorship attribution: will
out-of-topic data help? In: Proceedings of the 25th International Conference on Computational
Linguistics: Technical papers, pp. 1228–1237 (2014)
Sarawagi, S.: Information extraction. Found. Trends Databases 1(3), 261–377 (2008)
Schmid, H.: Improvements in part-of-speech tagging with an application to German. In: Proceedings
of the ACL SIGDAT-Workshop, pp. 47–50 (1995)
Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47
(2002)
References 289
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.G.: Access path selec-
tion in a relational database management system. In: Proceedings of the 1979 ACM SIGMOD
International Conference on Management of Data, pp. 23–34 (1979)
Shen, W., Doan, A., Naughton, J.F., Ramakrishnan, R.: Declarative information extraction using dat-
alog with embedded extraction predicates. In: Proceedings of the 33rd International Conference
on Very Large Data Bases, pp. 1033–1044 (2007)
Sinha, R., Swearingen, K.: The role of transparency in recommender systems. In: CHI 2002 Ex-
tended Abstracts on Human Factors in Computing Systems, pp. 830–831 (2002)
Solovyev, V., Polyakov, V., Ivanov, V., Anisimov, I., Ponomarev, A.: An approach to semantic
natural language processing of Russian texts. Res. Comput. Sci. 65, 65–73 (2013)
Somasundaran, S., Wiebe, J.: Recognizing stances in ideological on-line debates. In: Proceedings
of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation
of Emotion in Text, pp. 116–124 (2010)
Stab, C., Gurevych, I.: Annotating argument components and relations in persuasive essays. In: Pro-
ceedings of the 25th International Conference on Computational Linguistics: Technical Papers,
pp. 1501–1510 (2014a)
Stab, C., Gurevych, I.: Identifying argumentative discourse structures in persuasive essays. In:
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp.
46–56 (2014b)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol.
60(3), 538–556 (2009)
Stamatatos, E.: Plagiarism detection based on structural information. In: Proceedings of the 20th
ACM International Conference on Information and Knowledge Management, pp. 1221–1230
(2011)
Stein, B., zu Eißen, S.M., Gräfe, G., Wissbrock, F.: Automating market forecast summarization
from internet data. In: Proceedings of the Fourth Conference on WWW/Internet, pp. 395–402
(2005)
Stein, B., zu Eißen, S.M., Lipka, N.: Web genre analysis: use cases, retrieval models, and imple-
mentation issues. In: Mehler, A., Sharoff, S., Santini, M. (eds.) Genres on the Web. Text, Speech
and Language Technology, vol. 42, pp. 167–190. Springer, Berlin (2010)
Stevenson, M.: Fact distribution in information extraction. Lang. Res. Eval. 40(2), 183–201 (2007)
Stoyanov, V., Eisner, J.: Easy-first coreference resolution. In: Proceedings of the 24th International
Conference on Computational Linguistics, pp. 2519–2534 (2012)
Teufel, S., Siddharthan, A., Batchelor, C.: Towards discipline-independent argumentative zoning:
evidence from chemistry and computational linguistics. In: Proceedings of the 2009 Conference
on Empirical Methods in Natural Language Processing, pp. 1493–1502 (2009)
Tjong, E.F., Sang, K., De Meulder, F.: Introduction to the CoNLL-2003 shared task: language-
independent named entity recognition. In: Proceedings of the Seventh Conference on Natural
Language Learning at HLT-NAACL 2003, pp. 142–147 (2003)
Toulmin, S.E.: The Uses of Argument. Cambridge University Press, Cambridge (1958)
Trosborg, A.: Text typology: register, genre and text type. In: Trosborg, A. (ed.) Text Typology and
Translation, pp. 3–24. John Benjamins Publishing, Amsterdam (1997)
Tsujii, J.: Computational linguistics and natural language processing. In: Proceedings of the 12th
International Conference on Computational Linguistics and Intelligent Text Processing, vol. Part
I, pp. 52–67 (2011)
Turmo, J., Ageno, A., Català, N.: Adaptive information extraction. ACM Comput. Surv. 38(2), 1–47
(2006)
van Noord, G.: Learning efficient parsing. In: Proceedings of the 12th Conference of the European
Chapter of the Association for Computational Linguistics, pp. 817–825 (2009)
van Rijsbergen, C.J.: Information Retrieval. Butterworth-Heinemann, Newton (1979)
Villalba, M.P.G., Saint-Dizier, P.: Some facets of argument mining for opinion analysis. In: Pro-
ceedings of the 2012 Conference on Computational Models of Argument, pp. 23–34 (2012)
290 References
Wachsmuth, H., Bujna, K.: Back to the roots of genres: text classification by language function.
In: Proceedings of the 5th International Joint Conference on Natural Language Processing, pp.
632–640, (2011)
Wachsmuth, H., Stein, B.: Optimal scheduling of information extraction algorithms. In: Proceed-
ings of the 24th International Conference on Computational Linguistics: Posters, pp. 1281–1290
(2012)
Wachsmuth, H., Prettenhofer, P., Stein, B.: Efficient statement identification for automatic market
forecasting. In: Proceedings of the 23rd International Conference on Computational Linguistics,
pp. 1128–1136 (2010)
Wachsmuth, H., Stein, B., Engels, G.: Constructing efficient information extraction pipelines. In;
Proceedings of the 20th ACM Conference on Information and Knowledge Management, pp.
2237–2240 (2011)
Wachsmuth, H., Rose, M., Engels, G.: Automatic pipeline construction for real-time annotation.
In: Gelbukh, A. (ed.) CICLing 2013, Part I. LNCS, vol. 7816, pp. 38–49. Springer, Heidelberg
(2013)
Wachsmuth, H., Stein, B., Engels, G.: Learning efficient information extraction on heterogeneous
texts. In: Proceedings of the 6th International Joint Conference on Natural Language Processing,
pp. 534–542 (2013b)
Wachsmuth, H., Stein, B., Engels, G.: Information extraction as a filtering task. In: Proceedings of
the 22nd ACM Conference on Information and Knowledge Management, pp. 2049–2058 (2013c)
Wachsmuth, H., Trenkmann, M., Stein, B., Engels, G.: Modeling review argumentation for ro-
bust sentiment analysis. In: Proceedings of the 25th International Conference on Computational
Linguistics: Technical Papers, pp. 553–564 (2014a)
Wachsmuth, H., Trenkmann, M., Stein, B., Engels, G., Palakarska, T.: A review corpus for argu-
mentation analysis. In: Gelbukh, A. (ed.) CICLing 2014, Part II. LNCS, vol. 8404, pp. 115–127.
Springer, Heidelberg (2014b)
Walton, D., Godden, M.: The impact of argumentation on artificial intelligence. In: Houtlosser, P.,
van Rees, A. (eds.) Considering Pragma-Dialectics, pp. 287–299. Erlbaum, Mahwah (2006)
Wang, D.Z., Wei, L., Li, Y., Reiss, F.R., Vaithyanathan, S.: Selectivity estimation for extraction
operators over text data. In: Proceedings of the 2011 IEEE 27th International Conference on Data
Engineering, pp. 685–696 (2011)
Wang, H., Lu, Y., Zhai, C.: Latent aspect rating analysis on review text data: a rating regression
approach. In: Proceedings of the 16th ACM SIGKDD International Conference on Knowledge
Discovery and Data Mining, pp. 783–792 (2010)
Whitelaw, C., Kehlenbeck, A., Petrovic, N., Ungar, L.: Web-scale named entity recognition. In:
Proceedings of the 17th ACM Conference on Information and Knowledge Management, pp.
123–132 (2008)
Wimalasuriya, D.C., Dou, D.: Components for information extraction: ontology-based information
extractors and generic platforms. In: Proceedings of the 19th ACM International Conference on
Information and Knowledge Management, pp. 9–18 (2010)
Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques, 2nd edn.
Morgan Kaufmann Publishers, San Francisco (2005)
Wu, Q., Tan, S., Duan, M., Cheng, X.: A two-stage algorithm for domain adaptation with application
to sentiment transfer problems. In: Cheng, P.-J., Kan, M.-Y., Lam, W., Nakov, P. (eds.) AIRS 2010.
LNCS, vol. 6458, pp. 443–453. Springer, Heidelberg (2010)
Yang, Z., Garduno, E., Fang, Y., Maiberg, A., McCormack, C., Nyberg, E.: Building optimal in-
formation systems automatically: configuration space exploration for biomedical information
systems. In: Proceedings of the 22nd ACM International Conference on Conference on Informa-
tion and Knowledge Management, pp. 1421–1430 (2013)
Yi, J., Nasukawa, T., Bunescu, R., Niblack, W.: Sentiment analyzer: extracting sentiments about
a given topic using natural language processing techniques. In: Proceedings of the Third IEEE
International Conference on Data Mining, pp. 427–434 (2003)
References 291
Žáková, M., Křemen, P., Železný, F., Lavrač, N.: Automating knowledge discovery workflow com-
position through ontology-based planning. IEEE Trans. Autom. Sci. Eng. 8(2), 253–264 (2011)
Zhang, M., Zhang, J., Su, J., Zhou, G.: A composite kernel to extract relations between entities
with both flat and structured features. In: Proceedings of the 21st International Conference on
Computational Linguistics and the 44th Annual Meeting of the Association for Computational
Linguistics, pp. 825–832 (2006)
Zhang, T.: Solving large scale linear prediction problems using stochastic gradient descent algo-
rithms. In: Proceedings of the Twenty-first International Conference on Machine Learning, pp.
116–123 (2004)
Zhang, Y.: Grid-centric scheduling strategies for workflow applications. Dissertation, Rice Univer-
sity (2010)
Index
The following index covers the major technical terms used in this book. In most
cases, a short explanation of the respective term is given at one of its first mentions.
The most important terms are explained in more detail in Sect. 2.1.
Heuristic, 141, 143–147 140, 142, 146, 150, 159, 167, 175, 178,
function, 143–145, 147, 148 180–182, 233, 234, 236, 237, 248, 251,
optimistic, 144, 145, 148 252, 256, 258, 279
Heuristic search, 142 Input requirement, 5, 39, 78, 83, 130–132
High-quality information, 2, 6, 7, 14, 50, 182 Intelligibility, 30, 184, 207, 223, 224, 229, 230
I J
Index, 21, 22, 49 Java class, 254–258, 262, 263
Indexing, 2, 21, 48, 49, 275 Joint inference, 37, 44, 45
InfexBA, 3, 4, 9, 13, 40–44, 58, 62–64, 70, 73, Justification, 102, 104
76, 82, 88–90, 119, 124, 131, 178, 186,
245, 255, 269, 278
Information extraction, 3, 5, 14, 20, 22–25, 29, K
34–37, 40, 41, 47, 48, 50, 61, 64, 70, 94, Kernel function, 222
118, 134, 139, 141, 149, 169, 186, 222, Kernel, convolutional, 222, 237
224, 226, 232–234, 236, 256 Kernel, hash, 244
algorithm, 14, 46, 49, 91, 118, 226, 231
declarative information extraction, 46, 48
open information extraction, 45, 49, 51 L
pipeline, 44, 118 Labeled attachment score, 32
process, 36 Language function, 43, 119, 185, 194, 198,
task, 13–15, 31, 42, 47, 62, 88, 95, 169, 184 200, 208, 240, 241, 275–277
type, 95 Language function analysis, 43, 119, 191, 198,
Information need, 1–3, 5–8, 10, 11, 14, 21, 200, 222, 275
34–37, 40–42, 45–49, 59–62, 64, 65, Large scale, 1, 7, 27, 37, 47, 141, 163, 168,
68–73, 76–78, 81, 82, 85, 87–92, 95, 231, 234, 236, 274
96, 101, 107, 117, 131, 132, 134, 135, Lazy evaluation, 60, 62, 63, 65, 67, 68, 82, 120,
137, 147, 149, 153, 159–161, 163, 164, 132, 137, 147, 156
167, 231–236, 252–254, 256 Lemma, 23, 35, 36, 64, 71, 244, 245, 247, 248
ad-hoc, 10 Lemmatization, 23, 65, 91, 247
ad-hoc information need, 6–8, 182
Information retrieval, 2, 3, 14, 16, 19–22, 24,
26, 29, 30, 41 M
Information structure, 50, 52, 193, 194 Machine, 177–182
Information type, 4–6, 8–10, 20, 22, 23, 26, breakdown, 177, 182
35–40, 44, 58, 60–62, 69, 71, 73–75, master machine, 178, 180, 181
77–80, 83–85, 91–93, 95, 96, 99–102, Machine learning, 9, 20, 23, 25–27, 31, 33, 48,
106, 108, 118, 134, 135, 139–141, 147, 52, 71, 165, 169, 174, 176, 186, 187,
158, 159, 170, 224–226, 232, 233, 243, 209, 262
245, 246, 252, 253, 268 algorithm, 25–28, 51, 168
input information type, 58, 73–75, 83, 127, approach, 26, 235, 270
142, 225, 240 deep learning, 51
output information type, 40, 73–75, 78, 83, learning process, 27
92, 101, 103–109, 111, 142, 225, 239, learning theory, 31, 188
240 model, 14, 25–27, 29, 45, 47, 50, 117, 165,
predecessor type, 101, 106, 111, 116 186–188, 237, 256, 263
Informed search, 141–143, 147, 150, 152, 155, reinforcement learning, 30
156, 231, 263 self-supervised learning, 11, 14, 29,
scheduling problem, 144, 145, 147 163–167, 170, 231, 234
strategy, 143, 144, 148, 263 semi-supervised learning, 29, 215, 222,
Input control, 10–14, 46, 47, 62, 74, 83, 92, 93, 235, 274
101, 102, 107–121, 124–126, 131–135, statistical, 12, 24
298 Index
supervised learning, 27–29, 165, 198, 199, 212–214, 219, 221, 222, 227, 228,
208, 209, 212, 217, 231 234–237
type, 27, 29, 30, 50
unsupervised learning, 27–29
Machine utilization, 177, 180, 181 P
Manhattan distance, 21, 210, 213, 217 Paragraph splitting, 23, 246
MapReduce, 49 Parallelization, 11, 15, 39, 48–50, 133,
Markov model, hidden, 126, 127 176–182, 236, 237
Markov property, 127 analysis parallelization, 180
Memory consumption, 32, 152, 177, 179–182, schedule parallelization, 181, 182
234, 263 Pareto optimum, 72
Minimum response time, 178–181 Parse tree, 4, 243
dependency parse tree, 36, 65, 222, 243,
244
N Parsing, 22, 23, 48, 50, 145, 223, 243, 247
Natural language processing, 2, 3, 19, 20, constituency parsing, 243
22–24, 30, 36, 37, 41, 243, 247 dependency parsing, 4, 23, 32, 48, 65, 68,
algorithm, 231 112, 234, 243, 244, 249
task, 36, 145 shallow parsing, 23, 243
type, 70 syntactic parsing, 140, 237
Negatives, 31 transition-based, 48
false negatives, 31 Part-of-speech, 23, 91, 101, 190, 199, 200, 226,
true negatives, 31 241, 242, 244, 245, 247, 248
Network time, 178–181 n-gram, 217
Non-monotonicity, 101 tag, 4, 23, 26, 35–37, 45, 48, 61, 63, 64, 70,
Normalization, 4, 23, 26, 36, 41, 42, 62, 88, 71, 106, 170, 174, 199, 226, 241, 245,
163, 209, 214, 215, 241–243 247, 248, 259
length normalization, 203, 209–211, 215, tagset, 248
217 Part-of-speech tagging, 37, 65, 91, 97, 118,
170, 248
Passage retrieval, 22, 25, 47, 117, 118
O Pipeline construction, 58, 60, 64, 68, 76, 81,
Online learning, 11, 14, 163–167, 171, 174, 82, 84, 86, 88–90, 92, 107, 232, 236,
175, 233 238, 255
Ontology, 46, 69, 87, 95, 98, 192, 195 ad-Hoc, 76, 88, 91
Ontology import, 87 ad-hoc, 73, 76, 77, 85–91, 101, 224, 236,
Ontology, upper, 69 251–253
Open list, 144–146, 148, 155 automatic, 76, 232
Opinion, 2, 3, 25, 42, 52, 70, 75, 92, 95, 100, manual, 90
158, 159, 195, 196, 202, 226, 236, 241, Pipeline design, 1, 6–8, 11–15, 37, 39–41, 43,
259, 272 45, 46, 60, 69, 120, 131, 140, 163, 176,
customer opinion, 3, 42 190, 231–233, 235–237, 239, 251
negative opinion, 42, 158, 159, 202, 216, Pipeline execution, 58, 64, 68, 86, 88, 90, 95,
221, 259, 271, 272 107, 111, 134, 165, 176–182, 235–237,
positive opinion, 42, 158, 159, 202, 204, 251–254, 256
216, 221, 259, 271, 272 optimal, 101, 107, 109, 111, 116, 117, 256
Opinion mining, 25, 42 Pipeline linearization, 39, 76, 77, 80
Opinion summarization, 3, 270 greedy, 81, 82, 84, 89, 141, 232, 234
Order relation, 72, 75, 194, 196 Pipeline optimization, 59–61, 120, 134, 234
Overall analysis, 10–14, 16, 184, 203, 208, Pipeline, text analysis, 1, 4–10, 12, 13, 15, 19,
212, 215, 216, 223, 227, 279 34, 37–41, 44–46, 49, 59–61, 63, 65,
Overall structure, 10, 11, 14, 16, 52, 184, 69, 71, 74, 76, 77, 81, 86, 88–95, 98,
190–195, 197, 198, 201, 203–209, 102, 107–110, 112, 117, 119, 120, 124,
Index 299
125, 130, 131, 134–138, 140, 142, 150, standard, 210, 217
155, 156, 159–161, 163, 169, 176, 177, threshold, 210, 211, 214, 217
180, 181, 184, 186, 189, 190, 209, 212,
214, 223–226, 229–235, 239, 251, 256,
260, 265 Q
ad-hoc, 10 Quality, 6, 7, 20, 26, 30, 33, 38, 46, 69, 71–73,
ad-hoc large-scale, 9, 10 75, 76, 79, 84, 89, 91, 154, 188, 201,
admissible, 58–60, 62–66, 78, 83, 124, 126, 223, 225, 229, 230, 232, 235
127, 129, 130, 133, 136, 142, 148, 156, criterion, 13, 34, 60, 71–76, 79, 80, 84, 87,
157, 164, 176, 178 88, 249, 253
complete, 58, 143, 144, 180 estimation, 65, 73, 77, 79, 80, 84, 87, 91,
empty, 38 225, 226
large-scale, 11 function, 46, 59, 71, 77, 84, 120
main pipeline, 164–168, 170–176, 182 measure, 26, 30, 34
optimal, 134 model, 72, 74, 75, 87, 88, 253, 254
partial, 38, 63, 142–144, 146, 155 prioritization, 72, 74–78, 80, 84, 87, 90, 91,
partially ordered, 38, 78, 80–83, 146 232, 251–254
prefix pipeline, 164–166, 168, 170, 172, priority, 72
176, 181 Query, 94–98, 102, 112–116, 149–154, 169,
valid, 58, 62, 65, 83, 88 181, 232, 258
Pipelining, analysis, 179, 180, 237 disjunctive, 97, 147
Pipes and filters, 39 scoped, 94, 97–103, 106–112, 114–116,
Planning, 9, 46, 77, 78, 82, 86, 231 132, 256, 257
goal, 78 Question answering, 47, 118, 237
partial order planning, 11, 76, 78, 79, Question answering, 50
81–85, 89, 232
pipeline planning problem, 77–79, 82, 83
planning agenda, 78 R
planning problem, 10, 77 Recall, 31, 32, 45, 47, 66–68, 72, 87–91, 98,
Polarity, 3, 24, 30, 61, 70, 194, 198, 202, 241, 112–114, 118, 119, 204, 233, 250
275–277 Recommender system, 30, 52
global, 202 Regression, 27, 28, 165–168, 172–174, 218,
local, 197 259
negative, 24 error, 32, 173–175, 218
opinion polarity, 52, 61, 262, 273 linear regression, 28, 171
positive, 24 regression model, 28, 165–169, 174, 175
Positives, 31, 112, 115, 116 Regression time, 169, 173, 174
false positives, 31, 112, 114, 119 Regressionerror, 166
true positives, 31, 112, 114 Relation, 3, 4, 23, 24, 28, 35, 45, 47, 52, 58, 70,
Precision, 31, 32, 42, 45, 47, 66, 67, 72, 87–91, 71, 75, 93, 95, 98, 99, 112–116, 135,
98, 112–114, 118, 119, 204, 249, 250 191, 244–246
Precondition, 78, 79, 83, 84 binary relation, 96, 98, 114
Prediction, 25–28, 32, 42, 47, 91, 100, 116, template, 23, 95
164–168, 172, 176, 259 type, 5, 23, 24, 35, 36, 45, 70, 71, 93–99,
error, 176 101, 103, 110, 113, 135, 193, 194, 226
problem, 25, 26 Relation extraction, 4, 23, 36, 37, 39, 42, 44,
run-Time prediction, 165 48, 58, 64, 65, 88, 91, 96, 112, 135, 222,
run-time prediction, 28, 164, 166–168, 170, 243–246, 249
173, 175, 176, 234 binary relation extraction, 97
score prediction, 204, 217, 241, 258–261, task, 45
269, 270, 274, 279 Relevance, 14, 47, 95, 99
Purity, 210, 211, 214, 215, 217 classified as relevant, 114, 133, 135–137,
relaxed, 210, 211, 217 162
300 Index
relevant information, 1–4, 6, 11, 13–15, 120, 121, 124, 126–128, 130–136, 138,
19, 20, 23, 34, 35, 42, 47, 49, 68, 98, 140–148, 150–152, 155–158, 160, 161,
111, 116–118, 120, 131, 133–141, 156, 163–166, 171, 175, 179, 181, 233, 234,
157, 159, 163, 170, 173, 176, 193, 194, 237
231–233, 236, 249, 256, 267, 279, 280 admissible, 58, 60, 76, 80, 81, 125, 126,
relevant information type, 5, 43, 70, 94, 95, 129, 142, 143, 145, 147, 150, 152, 155,
98, 99, 104, 105, 108, 110, 111, 116, 131, 158, 161, 163, 170
137, 138, 159, 169, 176 fixed, 156, 160, 161, 163, 164, 172, 176,
relevant paragraph, 106 233
relevant portion of text, 7, 11, 13, 22, partial, 38, 78–80, 82, 142, 144, 146, 147,
46–49, 62, 68, 73, 82, 92–104, 106–111, 150
113, 115–118, 120, 126, 130, 132, 134, Scheduling, 5, 11, 14, 15, 39, 48–50, 59, 60,
142–147, 150, 155, 164, 167, 169, 180, 62, 64, 65, 68, 76, 77, 81, 89, 90, 92,
233, 249, 256 115, 120, 124, 126, 127, 129, 131–134,
relevant result, 1, 10, 108 138, 140–144, 149–155, 161, 163–165,
relevant sentence, 47, 68, 94, 132, 133, 137, 167–170, 175, 176, 178, 182, 214, 233,
138 234, 236–238, 248, 279
relevant text, 2, 3, 19–22, 44, 46, 108, 120
adaptive scheduling, 11, 12, 14, 15, 49,
Representativeness, 31, 33, 139, 185, 188, 189,
163, 164, 166, 167, 169, 171–176, 178,
270
179, 181, 234, 238
optimal representativeness, 187–190
Resource description framework, 69, 70 adaptive scheduling problem, 165, 166
Revenue corpus, 42, 64–68, 88, 90, 112–116, greedy scheduling, 13, 89, 141, 153, 163
119, 132, 133, 137–140, 149, 151–157, ideal scheduling, 131
160, 161, 169, 249, 250, 255, 256, optimal scheduling, 13, 15, 58–60, 63–65,
265–269, 274 67, 68, 120, 121, 126, 131, 134, 138, 141,
Rhetorical relation, 194 147, 150, 178, 233
Rhetorical structure theory, 194, 195, 205, 244 optimized scheduling, 11–13, 15, 49, 60,
Run-time, 11, 13, 28, 32, 48, 49, 59, 62–64, 141, 147, 149, 150, 152, 154, 155, 157,
66–68, 72, 77, 81, 84, 85, 88–91, 107, 163, 178, 181, 236, 238, 263
109, 112, 114–116, 118–120, 124–127, scheduling model, 165, 166
129–131, 133–141, 143–146, 148, 149, Scheduling time, 150–153, 178–181
151–153, 157, 160–168, 170–178, 180, Scope, 46, 98–100, 102–111, 115–117, 126,
215, 233–235, 238, 239, 248, 249, 263 127, 129, 131, 132, 136, 145, 257
asymptotic, 85, 109, 215, 232 descendant scope, 105, 111
estimation, 11, 64, 81, 82, 141, 144, 145,
relevant scope, 105–107, 109
147–151, 154, 155, 167, 248, 263
optimality, 59, 60, 107, 120, 121, 127–137, root scope, 105, 111
140–142, 146, 148–150, 155, 161, 164, scope TMS, 110, 111, 257
171, 175 unified scope, 106, 109–111
overall run-time, 32, 72, 84, 114–116, 119, Search engine, 1–3, 6, 7, 19, 21, 41, 91, 182,
132, 150, 153, 154, 156, 162, 164, 169, 186, 223, 235, 236
172, 177, 215 Search graph, 142, 143, 146, 148, 150, 152,
run-time per portion of text, 32, 66, 67, 155
87–90, 137–141, 143, 144, 159, 166, Segmentation, 22, 23, 48, 113, 246, 259, 272
170–175, 223, 248, 249 algorithm, 104, 107, 108, 112, 178
worst-case, 85, 89, 109, 130, 148, 149, 168, Selectivity, 64, 74, 77, 120, 126, 132–135, 141,
233 144, 152
Running time, 32 estimation, 74, 81, 141, 145, 176
Semantic concept, 193, 194, 198, 208, 209
S Semantic relation, 193
Schedule, 5, 9–11, 13, 14, 38, 39, 48, 49, Semantic role labeling, 36, 70, 222
58–64, 66, 67, 84, 86, 94, 107, 115, Sentence splitting, 23, 36, 37, 83, 104, 199, 247
Index 301
Sentiment, 24, 42, 52, 186, 196, 198, 205, 207, Test set, 33, 66–68, 90, 119, 149, 151, 153,
208, 220, 221, 227, 228, 234, 235, 241, 154, 171–173, 175, 189, 199, 218, 249,
269, 273, 275, 277, 279 266, 267, 270, 276
global sentiment, 196, 202–206, 227, 259 Test text, 156, 171, 173, 200, 216, 278
local sentiment, 3, 194, 196, 201–205, Text analysis, 1, 3–5, 10, 11, 14–16, 20–22, 26,
208–211, 216–222, 225, 227, 229, 234, 30, 32, 34–38, 44–50, 52, 60–62, 64,
241, 259, 271 68, 69, 72, 75, 76, 84, 86, 87, 89, 92–96,
scale, 216–218, 279 98–103, 109, 111, 116–118, 120, 134,
score, 42, 100, 193, 195, 199, 200, 202–207, 159, 164, 177, 179, 182, 184, 185, 187,
210, 216–218, 220, 221, 225, 227–229, 189, 191, 192, 196, 197, 207, 221, 224,
241, 258–260, 270–272, 279 227, 229, 234, 236, 237, 239, 247, 249
score, global, 194, 197, 202–204, 206 ad-hoc, 232
Sentiment analysis, 14, 24, 25, 30, 36, 37, 51, approach, 20, 30–33, 35, 44–47, 50, 52, 92,
52, 158, 184–186, 190, 191, 195, 196, 187, 236
198, 200, 201, 204–207, 219, 236, 241, process, 4, 5, 8, 9, 15, 19, 20, 34, 36–40, 42,
275 44, 45, 48, 60–62, 68, 69, 72, 76, 95–98,
approach, 25, 158 100, 101, 103, 107, 108, 148, 176, 177,
Sentiment scoring, 25, 208, 209, 216, 218, 223–226, 229, 231, 232, 235
221–223, 226, 229, 235, 258, 259
scenario, 5, 6, 10, 236
Sentiment word, 199, 200
task, 5–9, 12, 14, 26, 27, 31, 34–37, 39, 40,
Sentiment-related task, 52
43, 45, 58, 60, 61, 63, 68, 69, 82, 92, 94,
Sequence labeling, 29, 43, 48, 97, 191, 242
95, 100, 111, 117, 121, 124, 134, 138,
Similarity, 21, 29, 207, 208, 210, 213–215,
143, 149, 160, 161, 163, 169, 176, 178,
221, 222, 227
182, 184, 186, 187, 189, 190, 197, 207,
cluster similarity, 217
222, 229, 237, 249, 251
flow similarity, 209, 210, 213, 214, 217,
text analysis model, 15, 68, 69, 76, 86, 91,
221, 228, 237
95, 192, 193, 231, 252, 255
measure, 21, 29, 210, 213
Software engineering, 9, 185, 238 type, 239, 240
Speed, 224, 227, 228 Text categorization, 24
Stance, 24, 192–194 Text class, 193–195, 198, 209, 210, 276
Stance recognition, 24, 190, 192, 193 Text classification, 3–5, 11–14, 16, 20–22,
Standard deviation, 32, 66, 67, 88–90, 112, 24–26, 28, 34–36, 42, 44, 47, 50, 52, 61,
114–116, 132, 140, 151–154, 157, 62, 70, 75, 94, 100, 118, 119, 158, 164,
159–161, 170, 171, 173 165, 182, 184, 186, 190–194, 197–199,
Stream of texts, 4–6, 23, 24, 34, 35, 37, 39, 40, 201, 206–208, 212, 214–216, 218, 219,
46, 59–61, 63, 69, 73, 76, 77, 81, 93, 95, 221, 222, 224, 227, 228, 232–235, 240,
117, 121, 134, 135, 138, 141–143, 156, 241, 248, 272, 273, 277, 279
158–167, 169, 171, 176, 177, 182, 191, approach, 37, 42, 197, 198, 209, 212, 226,
234 228, 231, 236
Structure classification task, 191, 208 pipeline, 118, 191, 226
ontology, 195, 201 process, 36
Subjectivity, 24, 25, 52, 61, 75, 199, 216, 221, task, 12, 15, 24, 31, 42, 43, 48, 52, 100,
241, 262, 279 117, 118, 158, 164, 165, 192–195, 197,
Support vector machine, 28, 218 198, 201, 207, 214, 216, 222, 236
linear, 217, 240, 241, 244–246 task, non-standard, 24, 191, 235
multi-class, 199, 240, 241 Text mining, 2–4, 6, 7, 9, 14, 15, 19–21, 26, 28,
29, 34, 35, 41, 43, 44, 48, 52, 120, 231,
236, 237
T ad-hoc large-scale text mining, 6–10, 12,
Tagging, 22, 23, 48, 63, 247 13, 15, 16, 19, 22, 44, 45, 53, 182, 184,
Target function, 26, 30, 187, 189 189, 216, 222, 229–231, 235, 236, 239,
Target variable, 26–28, 30, 31, 33 251, 265
302 Index
ad-hoc text mining, 1, 7, 12, 46, 77, 81, 84, 174, 185, 187–189, 199, 200, 209, 210,
91, 117, 118, 142, 155, 177, 178, 182, 212, 213, 217, 218, 220, 221, 248, 266,
236 267, 270, 276
application, 19, 23, 25, 35, 40, 42, 47, 49, training size, 151, 152, 174, 175
52, 118, 134, 182, 229, 231, 255 Training time, 27, 32, 168, 178–181, 223
high-quality text mining, 184, 190, 207, Trust, 224, 226, 228, 230
223, 229, 230, 236 Truth maintenance, 11, 14, 101, 231
large-scale text mining, 1, 12, 81, 141, 142, assumption-based, 101, 102, 111, 233, 257
153, 176, 177, 180, 182, 236 Type system, 70, 71, 74, 75, 85, 87, 254, 256,
scenario, 117 257
Text quality, 5, 23, 160, 270, 275
Text quality assessment, 51, 61, 191
Time ratio, 112–116
Token, 4, 23, 32, 35, 37, 61, 63, 64, 67, 70, 71, U
83, 100, 101, 119, 170, 173, 199, 226, Understanding, 52, 224, 226, 228
241–249, 259, 271, 275 Unit class, 194–197, 200, 201, 207, 208
n-gram, 199, 200 Universe, 38, 69, 72, 74
Tokenization, 23, 31, 36, 37, 63, 83, 87, 88, Unstructured text, 2, 3, 20–24, 35
118, 170, 199, 247
Topic, 1, 3, 4, 24–26, 52, 70, 139, 159, 160,
185, 186, 190, 192–194, 197, 216, 241, V
274–278 Validation set, 33, 149, 151, 153, 154, 199,
Topic detection, 24, 214 216, 218, 266, 267, 270, 276
Topic relevance, 275–277 Validation text, 266, 278
Training, 25–27, 47, 139, 166, 169, 171, 174, Validation, n-fold cross-, 33, 218, 220, 249
175, 186, 188, 199–201, 210, 213, Value constraint, 72–74, 252, 254
216–220, 223, 238, 241–244, 247, 248, Vector space model, 21, 26, 198, 208
270
Viterbi algorithm, 126, 127, 141
data, 11, 27, 29, 50, 186–188, 199
Viterbi path, 127, 132, 133
instance, 28, 164–166, 168
text, 24, 50, 151–156, 165, 174, 184–188,
200, 209, 212, 216, 217, 223, 266, 278
training phase, 166, 168 W
training set, 33, 51, 64, 132, 133, 137, 138, Workflow, 38, 46
149–151, 156, 157, 165–168, 170, 171, Wrapper induction, 24