Active Learning
Active Learning
1
Synthesis Lectures on Artificial
Intelligence and Machine
Learning
Editor
Ronald J. Brachman, Yahoo! Research
William W. Cohen, Carnegie Mellon University
Thomas Dietterich, Oregon State University
Active Learning
Burr Settles
2012
Human Computation
Edith Law and Luis von Ahn
2011
2
Trading Agents
Michael P. Wellman
2011
3
2007
4
Copyright © 2012 by Morgan & Claypool
Active Learning
Burr Settles
www.morganclaypool.com
DOI 10.2200/S00429ED1V01Y201207AIM018
Lecture #18
Series Editors: Ronald J. Brachman, Yahoo Research
William W. Cohen, Carnegie Mellon University
Thomas Dietterich, Oregon State University
Series ISSN
Synthesis Lectures on Artificial Intelligence and Machine Learning
Print 1939-4608 Electronic 1939-4616
5
Active Learning
Burr Settles
Carnegie Mellon University
6
ABSTRACT
The key idea behind active learning is that a machine learning algorithm
can perform better with less training if it is allowed to choose the data
from which it learns. An active learner may pose “queries,” usually in
the form of unlabeled data instances to be labeled by an “oracle” (e.g., a
human annotator) that already understands the nature of the problem.
This sort of approach is well-motivated in many modern machine
learning and data mining applications, where unlabeled data may be
abundant or easy to come by, but training labels are difficult, time-
consuming, or expensive to obtain.
This book is a general introduction to active learning. It outlines
several scenarios in which queries might be formulated, and details many
query selection algorithms which have been organized into four broad
categories, or “query selection frameworks.” We also touch on some of
the theoretical foundations of active learning, and conclude with an
overview of the strengths and weaknesses of these approaches in
practice, including a summary of ongoing work to address these open
challenges and opportunities.
KEYWORDS
active learning, expected error reduction, hierarchical sampling, optimal
experimental design, query by committee, query by disagreement, query
learning, uncertainty sampling, variance reduction
7
Dedicated to my family and friends, who keep me
asking questions.
8
Contents
Preface
Acknowledgments
1 Automating Inquiry
1.1 A Thought Experiment
1.2 Active Learning
1.3 Scenarios for Active Learning
2 Uncertainty Sampling
2.1 Pushing the Boundaries
2.2 An Example
2.3 Measures of Uncertainty
2.4 Beyond Classification
2.5 Discussion
9
5.4 Discussion
6 Theory
6.1 A Unified View
6.2 A PAC Bound for Active Learning
6.3 Discussion
7 Practical Considerations
7.1 Which Algorithm is Best?
7.2 Real Labeling Costs
7.3 Alternative Query Types
7.4 Skewed Label Distributions
7.5 Unreliable Oracles
7.6 Multi-Task Active Learning
7.7 Data Reuse and the Unknown Model Class
7.8 Stopping Criteria
A Nomenclature Reference
Bibliography
Author’s Biography
Index
10
Preface
11
The presentation includes a mix of contrived, illustrative examples as
well as benchmark-style evaluations that compare and contrast various
algorithms on real data sets. However, I caution the reader not to take
any of these results at face value, as there are many factors at play when
choosing an active learning approach. It is my hope that this book does a
good job of pointing out all the subtleties at play, and helps the reader
gain some intuition about which approaches are most appropriate for the
task at hand.
This active learning book is the synthesis of a previous literature
survey (Settles, 2009) with material from other lectures and talks I have
given on the subject. It is meant to be used as an introduction and
reference for researchers, or as a supplementary text for courses in
machine learning—supporting a week or two of lectures—rather than as
a textbook for a complete full-term course on active learning. (Despite
two decades of research, I am not sure that there is enough breadth or
depth of understanding to warrant a full-semester course dedicated to
active learning. At least not yet!) Here is a road map:
• Chapter 1 introduces the basic idea of, and motivations for, active
learning.
I have attempted to wade through the thicket of papers and distill active
learning approaches into core conceptual categories, characterizing their
strengths and weaknesses in both theory and practice. I hope you enjoy it
and find it useful in your work.
Supplementary materials, as well as a mailing list, links to video
lectures, software implementations, and other resources for active
learning can be found online at http://active-learning.net.
12
Burr Settles
May 2012
13
Acknowledgments
This book has a roundabout history, and there are a lot of people to thank
along the way. It grew out of an informal literature survey I wrote on
active learning (Settles, 2009) which in turn began as a chapter in my
PhD thesis. During that phase of my career I am indebted to my
committee, Mark Craven, Jude Shavlik, Xiaojin “Jerry” Zhu, David
Page, and Lewis Friedland, who encouraged me to expand on my review
and make it publicly available. There has been a lot of work in active
learning over the past two decades, from simple heuristics to complex
and crazy ideas coming from a variety of subfields in AI and statistics.
The survey was my attempt to curate, organize, and make sense of it for
myself; to help me understand how my work fit into the overall
landscape.
Thanks to John Langford, who mentioned the survey on his popular
machine learning blog1. As a result, many other people found it and
found it helpful as well. Several people encouraged me to write this
book. To that end, Jude Shavlik and Edith Law (independently)
introduced me to Michael Morgan. Thanks to Michael, William Cohen,
Tom Dietterich, and others at Morgan & Claypool for doing their best to
keep things on schedule, and for patiently encouraging me through the
process of expanding what was a literature review into more of a tutorial
or textbook. Thanks also to Tom Mitchell for his support and helpful
advice on how to organize and write a book.
Special thanks to Steve Hanneke and Sanjoy Dasgupta for the
detailed feedback on both the original survey and the expanded
manuscript. Chapter 6 is particularly indebted to their comments as well
as their research. I also found Dasgupta’s review of active learning from
a theoretical perspective (Dasgupta, 2010) quite helpful. The insights and
organization of ideas presented here are not wholly my own, but draw on
conversations I have had with numerous people. In addition to the names
mentioned above, I would like to thank Josh Attenberg, Jason Baldridge,
Carla Brodley, Aron Culotta, Pinar Donmez, Miroslav Dudík, Gregory
Druck, Jacob Eisenstein, Russ Greiner, Carlos Guestrin, Robbie Haertel,
Ashish Kapoor, Percy Liang, Andrew McCallum, Prem Melville, Clare
14
Monteleoni, Ray Mooney, Foster Provost, Soumya Ray, Eric Ringger,
Teddy Seidenfeld, Kevin Small, Partha Talukdar, Katrin Tomanek, Byron
Wallace, and other colleagues for turning me on to papers, ideas, and
perspectives that I might have otherwise overlooked. I am sure there are
other names I have forgotten to list here, but know that I appreciate all
the ongoing discussions on active learning (and machine learning in
general), both online and in person. Thanks also to Daniel Hsu, Eric
Baum, Nicholas Roy, and their coauthors (some listed above) for kindly
allowing me to reuse figures from their publications.
I would like to thank my parents for getting me started, and my wife
Natalie for keeping me going. She remained amazingly supportive
during my long hours of writing (and re-writing). Whenever I was
stumped or frustrated, she was quick to offer a fresh perspective: “Look
at you, you’re writing a book!” Lo and behold, I have written a book. I
hope you enjoy the book.
While writing this book, I was supported by the Defense Advanced
Research Projects Agency (under contracts FA8750-08-1-0009 and
AF8750-09-C-0179), the National Science Foundation (under grant IIS-
0968487), and Google. The text also includes material written while I
was supported by a grant from National Human Genome Research
Institute (HGRI). Any opinions, findings and conclusions, or
recommendations expressed in this material are mine and do not
necessarily reflect those of the sponsors.
15
CHAPTER 1
Automating Inquiry
“Computers are useless. They can only give you answers.”
— Pablo Picasso (attributed)
Figure 1.1: Several alien fruits, which vary in shape from round to irregular.
Almost immediately, physicians on the expedition notice that colonists who eat the
smooth fruits find them delicious, while those who eat the irregular fruits tend to fall ill.
Naturally, you want to be able to distinguish between which fruits are safe to eat and which
ones are harmful. If you can accurately “classify” these fruits, it will be much easier to keep
the colony healthy and well-fed while finding important alternative uses for the noxious
fruits (such as processing them into dyes and fuels). The physicians assure you that the
shape of a fruit is the only feature that seems related to its safety. The problem, though, is
that a wide variety of fruit shapes from these plants exist: almost a continuous range from
round to irregular. Since the colony has essential uses for both safe and noxious fruits, you
want to be able to classify them as accurately as possible.
Supposing we have a way to quantify the “irregularity” of a fruit’s shape, we can
formalize this classification task using a simple function. Let X be a range of data instances,
e.g., fruits that have been harvested, where each x ∊ R represents the irregularity measure of
a particular fruit. Each data instance has a hidden label y, which in this example comes from
the finite set Y = {safe, noxious}. Our classifier, then, is a function mapping h : X → Y,
parameterized by a threshold θ:
The remaining challenge is to figure out what this threshold really is.
One way to pick the threshold is by supervised learning techniques, where we estimate θ
from a set of labeled data instances . Here, 〈x, y〉 denotes a fruit x
16
which has been tested and assigned a label y, e.g., by having someone eat it and then
observing whether or not he or she becomes ill. A simple learning algorithm in this case
isered set of fruits, we only need to te to harvest a large number of fruits, arrange them in
order from round to irregular in shape, and have them all tested. The point at which the
fruits switch from being safe to noxious is the best threshold θ* (assuming for the moment
that everyone responds to a fruit in the same way). Figure 1.2 illustrates this method for a
handful of labeled instances.
Figure 1.2: Supervised learning for the alien fruits example. Given a set of 〈x, y〉 instance-
label pairs, we want to choose the threshold θ* that classifies them most accurately.
According to the PAC learning1 framework (Valiant, 1984), if the underlying data
distribution can be perfectly classified by some hypothesis h in the hypothesis class H (in
this case, the set of all values for θ), then it is enough to test O(1/∊) randomly selected
instances, where ∊ is the maximum desired error rate. This means that if we want to achieve
99% accuracy in our “fruit safety” classifier, we might need to harvest and test on the order
of hundreds of fruits using this approach. That is a lot of fruits to test, though, and many of
your family and friends might get sick in the process!
Can we do better? While we certainly want a threshold that classifies these fruits as
accurately as possible, we also want to discover this threshold as quickly and as
economically as possible, by requiring fewer tests while achieving the same results (ideally).
Note some key properties of this problem:
• the fruits themselves are plentiful, easily harvested, and easily measured;
• testing for a fruit’s safety is what mainly incurs a cost: the fruit must be consumed, and
the person who eats it risks falling ill.
Figure 1.3: A binary search through the set of ordered, untested alien fruits. By only testing
this subset of fruits, we can exponentially reduce costs while achieving the same result. The
labels shown in light blue can be inferred, and therefore do not need to be tested.
17
As it turns out, we can do better by performing a directed search through the set of fruit
shapes. Taking the two observations above into account, we can augment our supervised
learning strategy in the following way. First, let us gather an arbitrary number of unlabeled
instances for free (or very inexpensively) by simply harvesting a lot of
fruits without testing them. Next, we can arrange these fruits in a sequence from round to
irregular. Our goal is similar to the previous one: to discover where in this ordered series of
fruits the transition from safe to noxious occurs, but by conducting as few tests as possible
this time. If we execute a simple binary search through this ordered set of fruits, we only
need to test ⌈log2 U⌉ items, instead of testing them all as we did before.
The process is illustrated in Figure 1.3. Given a set of unlabeled instances, this
sequential algorithm would first test x = 5, then x = 7, and finally x = 6 before arriving at the
exact same parameter value for θ*, alleviating the need to test the other six fruits (two of
which happen to be harmful). This algorithmic speedup means that a classifier with error ∊
or less can be found with a mere O(log2 1/∊) tests, since all the other labels can be inferred.
To put this in perspective, assume that 100 fruits were needed to obtain 99% accuracy in the
earlier supervised setting. Then we would still need to harvest 100 fruits for our new binary
search algorithm, but we would only need to test 6 or 7 of them to get the same guarantee.
This is an exponential reduction in the number of tests necessary, which drastically reduces
the number of people whose health is at risk, and helps the colony to make use of both safe
and noxious fruits as quickly as possible.
• Classification and filtering. Learning to classify documents (e.g., articles or web pages)
or any other kind of media (e.g., image, audio, and video files) usually requires people
to annotate each item with particular labels, such as relevant or not-relevant.
Unlabeled instances abound from electronic resources like the Internet, but annotating
thousands of these instances can be tedious and even redundant.
18
• Speech recognition. Accurate labeling of speech utterances is extremely time consuming
and requires trained linguists. While unannotated audio data is easily obtained from
recording devices, Zhu and Goldberg (2009) have reported that annotation at the word
level can take ten times longer than the actual audio (e.g., one minute of speech takes
ten minutes to label), and annotating phonemes can take 400 times as long (e.g., nearly
seven hours). The problem is compounded for rare languages or dialects.
• Information extraction. Systems that extract factual knowledge from text must be
trained with detailed annotations of documents. Users highlight entities or relations of
interest, such as person and organization names, or whether a person works for a
particular organization. Locating entities and relations can take a half-hour or more for
even simple newswire stories (Settles et al., 2008a). Annotations for specific knowledge
domains may require additional expertise, e.g., annotating gene and disease mentions in
biomedical text usually requires PhD-level biologists.
In all of these examples, data collection (for traditional supervised learning methods) comes
with a hefty price tag, in terms of human effort and/or laboratory materials. If an active
learning system is allowed to be part of the data collection process—to be “curious” about
the task, if you will—the goal is to learn the task better with less training.
While the binary search strategy described in the previous section is a useful
introduction to active learning, it is not directly applicable to most problems. For example,
what if fruit safety is related not only to shape, but to size, color, and texture as well? Now
we have four features to describe the input space instead of just one, and the simple binary
search mechanism no longer works in these higher-dimensional spaces. Also consider that
the bodies of different people might respond slightly differently to the same fruit, which
introduces ambiguity or noise into the observations we use as labels. Most interesting real-
world applications, like the ones in the list above, involve learning with hundreds or
thousands of features (input dimensions), and the labels are often not 100% reliable.
The rest of this book is about the various ways we can apply the principles of active
learning to machine learning problems in general. We focus primarily on classification, but
touch on methods that apply to regression and structured prediction as well. Chapters 2–5
present, in detail, several query selection frameworks, or utility measures that can be used to
decide which query the learner should ask next. Chapter 6 presents a unified view of these
different query frameworks, and briefly touches on some theoretical guarantees for active
learning. Chapter 7 summarizes the strengths and weaknesses of different active learning
methods, as well as some practical considerations and a survey of more recent
developments, with an eye toward the future of the field.
19
might be generated. In some applications, instance labels come at little or no cost, such as
the “spam” flag you mark on unwanted email messages, or the five-star rating you might
give to films on a social networking website. Learning systems use these flags and ratings to
better filter your junk email and suggest new movies you might enjoy.
In cases like this, you probably have other incentives for providing these labels—like
keeping your inbox or online movie library organized—so you provide many such labels
“for free.” Deploying an active learning system to carefully select queries in these cases may
require significant engineering overhead, with little or no gains in predictive accuracy. Also,
when only a relatively small number (e.g., tens or hundreds) of labeled instances are needed
to train an accurate model, it may not be appropriate to use active learning. The expense of
implementing the query framework might be greater than merely collecting a handful of
labeled instances, which might be sufficient.
Active learning is most appropriate when the (unlabeled) data instances themselves are
numerous, can be easily collected or synthesized, and you anticipate having to label many of
them to train an accurate system. It is also generally assumed that the oracle answers queries
about instance labels, and that the appropriate hypothesis class for the problem is more or
less already decided upon (naive Bayes, decision trees, neural networks, etc.). These last two
assumptions do not always hold, but for now let us assume that queries take the form of
unlabeled instances, and that the hypothesis class is known and fixed2. Given that active
learning is appropriate, there are several different specific ways in which the learner may be
able to ask queries. The main scenarios that have been considered in the literature are (1)
query synthesis, (2) stream-based selective sampling, and (3) pool-based sampling.
Query Synthesis. One of the first active learning scenarios to be investigated is learning
with membership queries (Angluin, 1988). In this setting, the learner may request “label
membership” for any unlabeled data instance in the input space, including queries that the
learner synthesizes de novo. The only assumption is that the learner has a definition of the
input space (i.e., the feature dimensions and ranges) available to it. Figure 1.4(a) illustrates
the active learning cycle for the query synthesis scenario. Efficient query synthesis is often
tractable and efficient for finite problem domains (Angluin, 2001). The idea of synthesizing
queries has also been extended to regression learning tasks, such as learning to predict the
absolute coordinates of a robot hand given the joint angles of its mechanical arm as inputs
(Cohn et al., 1996). Here the robot decides which joint configuration to test next, and
executes a sequence of movements to reach that configuration, obtaining resulting
coordinates that can be used as a training signal.
Query synthesis is reasonable for some problems, but labeling such arbitrary instances
can be awkward and sometimes problematic. For example, Lang and Baum (1992)
employed membership query learning with human oracles to train a neural network to
classify handwritten characters. They encountered an unexpected problem: many of the
query images generated by the learner contained no recognizable symbols; these were
merely artificial hybrid characters with little or no natural semantic meaning. See Figure
1.4(b) for a few examples: is the image in the upper-right hand corner a 5, an 8, or a 9? It
stands to reason that this ambiguous image could help the learner discriminate among the
different characters, if people were able to discriminate among them as well. Similarly, one
could imagine query synthesis for natural language tasks creating streams of text or audio
speech that amount to gibberish. The problem is that the data-generating distribution is not
necessarily taken into account (and may not even be known), so an active learner runs the
risk of querying arbitrary instances devoid of meaning. The stream-based and pool-based
scenarios (described shortly) have been proposed to address these limitations.
20
Figure 1.4: (a) An active learner might synthesize query instances de novo. (b) Query
synthesis can result is awkward and uninterpretable queries, such as these images generated
by a neural network attempting to learn how to recognize handwritten digits. Source: Lang
and Baum (1992), reprinted with kind permission of the authors.
Nevertheless, King et al. (2004, 2009) found a promising real-world application of query
synthesis. They employ a “robot scientist” which executes autonomous biological
experiments to discover metabolic pathways in the yeast Saccharomyces cerevisiae. Here,
an instance corresponds to a mixture of chemical solutions that constitute a growth medium
crossed with a particular yeast mutant. A label, then, is whether or not the mutant thrived in
the growth medium. Experiments are chosen using an active learning approach based on
inductive logic programming, and physically performed using a laboratory robot. The active
approach results in a three-fold decrease in the cost of experimental materials compared to
naively running the least expensive experiment, and a 100-fold decrease in cost compared to
randomly chosen experiments. In this task, all query instances correspond to well-defined
and meaningful experiments for the robot (or even a human) to perform. In other situations,
arbitrary queries may be meaningless and nonsensical and thus difficult for an oracle to
make judgements about.
21
Figure 1.5: (a) In selective sampling, the learner decides whether to query or discard items
from a stream of input instances. (b) Pool-based active learning queries the most informative
instance from a large pool of available unlabeled data U.
The decision whether to query or discard an instance can be framed several ways. One
approach is to define a measure of utility or information content (several such measures are
discussed in Chapters 2–5) and make a biased random decision, such that instances with
higher utility are more likely to be queried (c.f., Dagan and Engelson, 1995). Another
approach is to compute an explicit region of uncertainty (Cohn et al., 1994), i.e., the part of
the instance space that is still ambiguous to the learner, and only query instances that fall
within it. A naive way of doing this is to set a minimum threshold on a utility measure
which defines the region. Instances whose evaluation is above this threshold are then
queried. Another somewhat more principled approach is to define the region that is still
unknown to the overall model class, i.e., to the set of hypotheses consistent with the current
labeled training set called the version space (Mitchell, 1982). In other words, if any two
models of the same model class (but different parameter settings) agree on all the labeled
data, but disagree on an unlabeled instance sampled from the data source, then that instance
lies within the region of uncertainty. Calculating this region completely and explicitly is
computationally expensive, however, and it must be maintained after each new query. As a
result, approximations are used in practice (more details in Chapter 3).
The stream-based scenario has been studied in several real-world tasks, including part-
of-speech tagging (Dagan and Engelson, 1995), sensor scheduling (Krishnamurthy,2002),
and learning ranking functions for information retrieval (Yu, 2005). Fujii et al. (1998)
employed selective sampling for active learning in word sense disambiguation, e.g.,
determining if the word “bank” means land alongside a river or a financial institution in a
given context (except they study Japanese words in their work). The approach not only
reduces annotation effort, but also limits the size of the database used in nearest-neighbor
learning, which in turn expedites the classification algorithm.
It is worth noting that some authors (e.g., Moskovitch et al., 2007; Thompson et al.,
1999) use the term “selective sampling” to refer to the pool-based scenario described next.
Under this interpretation, the term merely signifies that queries are made with a selected
subset of instances sampled from a real data distribution. However, in most of the literature
selective sampling refers to the stream-based scenario described here.
22
unlabeled data U available. The approach is illustrated in Figure 1.5(b). Queries are selected
from the pool, which is usually assumed to be closed (i.e., static or non-changing), although
this is not strictly necessary. Queries are typically chosen in a greedy fashion, according to a
utility measure used to evaluate all instances in the pool (or, perhaps if U is very large, a
subsample thereof). The binary search algorithm for the alien fruits example in Section 1.1
is a pool-based active learning algorithm.
The pool-based scenario has been studied for many real-world problem domains in
machine learning, such as text classification (Hoi et al., 2006a; Lewis and Gale, 1994;
McCallum and Nigam, 1998; Tong and Koller, 2000), information extraction (Settles and
Craven, 2008; Thompson et al., 1999), image classification and retrieval (Tong and Chang,
2001; Zhang and Chen, 2002), video classification and retrieval (Hauptmann et al., 2006;
Yan et al., 2003), speech recognition (Tür et al., 2005), and cancer diagnosis (Liu, 2004), to
name only a few. In fact, pool-based sampling appears to be the most popular scenario for
applied research in active learning, whereas query synthesis and stream-based selective
sampling are more common in the theoretical literature.
The main difference between stream-based and pool-based active learning is that the
former obtains one instance at a time, sequentially from some streaming data source (or by
scanning through the data) and makes each query decision individually. Pool-based active
learning, on the other hand, evaluates and ranks the entire collection of unlabeled data
before selecting the best query. While the pool-based scenario appears to be much more
common among application papers, one can imagine settings where the stream-based
approach is more appropriate. For example, when memory or processing power is limited,
as with mobile and embedded devices, or when the data set is too large to load into memory
and must be scanned sequentially from disk. Unless otherwise noted, however, we will
assume a pool-based scenario for our discussion of the algorithms discussed in the
remainder of this book.
More detail on PAC learning and active learning will be discussed in Chapter 6.
1
23
CHAPTER 2
Uncertainty Sampling
“Information is the resolution of uncertainty.”
— Claude Shannon
Figure 2.1: The binary search from Figure 1.3, re-interpreted as an uncertainty sampling approach. The best
instance to query is deemed to be the one closest to the threshold θ.
24
2.2 AN EXAMPLE
To visualize the way in which uncertainty sampling generalizes to a noisy, two-dimensional classification
problem, consider Figure 2.2. Figure 2.2(a) shows a toy data set constructed from two Gaussians centered at
(-2,0) and (2,0) with standard deviation σ = 1. There are 400 instances total, 200 drawn from each class
distribution. In a real-world setting, these instances may be available but their labels would not. Figure 2.2(b)
illustrates the traditional supervised learning approach of randomly selecting instances for labeling. The line
shows the linear decision boundary of a logistic regression model (i.e., where the posterior label probability
equals 0.5) trained using 30 points. Notice that most of the labeled instances in this training set are far from
zero on the horizontal axis, which is where the Bayes optimal decision boundary should be. As a result, this
classifier only achieves 70% accuracy on the remaining unlabeled data. Figure 2.2(c) tells a different story,
however: the active learner uses uncertainty sampling to focus on the instances closest to its decision
boundary, assuming it can adequately explain the data in other parts of the input space. As a result, it avoids
requesting labels for redundant or irrelevant instances, and achieves 90% accuracy using the same budget of
30 labeled instances.
Figure 2.2: Uncertainty sampling with a toy data set. (a) 400 instances, evenly sampled from two class
Gaussians. Instances are represented as points in a 2D input space. (b) A logistic regression model trained with
30 labeled instances randomly drawn from the problem domain. The line represents the decision boundary of
the classifier. (c) A logistic regression model trained with 30 actively queried instances using uncertainty
sampling.
25
Least Confident. A basic uncertainty sampling strategy is to query the instance whose predicted output is the
least confident:
where ŷ = argmaxy Pθ(y|x), the prediction with the highest posterior probability under the model θ. In other
words, this strategy prefers the instance whose most likely labeling is actually the least likely among the
unlabeled instances available for querying. One way to interpret this uncertainty measure is the expected 0/1-
loss, i.e., the model’s belief that it has mislabeled x. A drawback of this strategy is that it only considers
information about the best prediction. Thus, it effectively throws away information about the rest of the
posterior distribution.
where ŷ1 and ŷ2 are the first and second most likely predictions under the model, respectively. Margin
sampling addresses a shortcoming of the least confident strategy by incorporating the second best labeling in
its assessment. Intuitively, instances with large margins are easy, since the learner has little doubt in how to
differentiate between the two most probable alternatives. Instances with small margins are more ambiguous,
though, and knowing the true labeling should help the model discriminate more effectively between them.
However, for problems with a very large number of alternatives, the margin approach still ignores much of the
output distribution.
Entropy. Perhaps the most general (and most common) uncertainty sampling strategy uses entropy (Shannon,
1948), usually denoted by H, as the utility measure:
where y ranges over all possible labelings of x. Entropy is a measure of a variable’s average information
content. As such, it is often thought of as an uncertainty or impurity measure in machine learning. One
interpretation of this strategy is the expected log-loss, i.e., the expected number of bits it will take to “encode”
the model’s posterior label distribution.
26
Figure 2.4: Illustrations of various uncertainty measures. (a–c) Binary classification tasks. Each plot shows
the utility score as a function of Pθ(⊕|x), which is the posterior probability of the positive class. (d–f) Ternary
classification tasks (three labels). Heatmap corners represent posterior distributions where one label is very
likely, with the opposite edge plotting the probability range between the other two classes (when that label has
low probability). The center of each heatmap is the uniform distribution.
Figure 2.4 visualizes the implicit relationship among these uncertainty measures. For binary classification
(top row), all three strategies are monotonic functions of one another. They are symmetric with a peak about
Pθ(⊕|x) = 0.5. In effect, they all reduce to querying the instance that is closest to the decision boundary. For a
three-label classification task (bottom row), the relationship begins to change. For all three measures, the most
informative instance lies at the center of the triangular simplex, because this represents where the posterior
label distribution is most uniform (and therefore most uncertain under the model). Similarly, the least
informative instances are at the three corners, where one of the classes has extremely high probability (and
thus little model uncertainty). The main differences lie in the rest of the probability space. For example, the
entropy measure does not particularly favor instances for which only one of the labels is highly unlikely (i.e.,
along the outer side edges of the simplex), because the model is fairly certain that it is not the true label. The
least confident and margin measures, on the other hand, consider such instances to be useful if the model has
trouble distinguishing between the remaining two classes (i.e., at the midpoint of an outer edge). Intuitively,
entropy seems appropriate if the objective function is to minimize log-loss, while the other two (particularly
margin) are more appropriate if we aim to reduce classification error, since they prefer instances that would
help the model better discriminate among specific classes.
27
Figure 2.5: An information extraction example viewed as a sequence labeling task. (a) A sample input
sequence x and corresponding label sequence y. (b) A probabilistic sequence model represented as a finite
state machine, illustrating the path of 〈x, y〉 through the model.
Figure 2.5(a) presents an example 〈x, y〉 pair. The labels indicate whether a given word belongs to an entity
class of interest (org and loc in this case, for “organization” and “location,” respectively) or null otherwise.
Unlike simple classification, x is not represented by a single feature vector, but rather a sequence of feature
vectors: one for each token (i.e., word). One approach is to treat each token as an instance, and train a
classifier that scans through the input sequence, assigning output labels to tokens independently. However, the
word “Madison,” devoid of context, might refer to an location (city), organization (university), or even a
person. For tasks such as this, sequence models based on probabilistic finite state machines, such as hidden
Markov models or linear-chain conditional random fields, are considered the state of the art. An example
sequence model is shown in Figure 2.5(b). Such models can produce a probability distribution for every
possible label sequence y, the number of which can grow exponentially in the sequence length T.
Fortunately, uncertainty sampling generalizes fairly easily to probabilistic structured prediction models.
For example, the least confident strategy is popular for information extraction using sequences (Culotta and
McCallum, 2005; Settles and Craven, 2008), because the most likely output sequence ŷ and the associated
Pθ(ŷ|x) can be efficiently computed with dynamic programming. Selecting the best query is generally no more
complicated or expensive than the standard inference procedure. The Viterbi algorithm (Corman et al., 1992)
requires O(TM) time, for example, where T is the sequence length and M is the number of label states. It is
often possible to perform “N-best” inference using a beam search as well (Schwartz and Chow, 1990), which
finds the N most likely output structures under the model. This makes it simple to compute the necessary
probabilities for ŷ1 and ŷ2 in the margin strategy, and comes at little extra computational expense: the
complexity is O(TMN) for sequences, which for N = 2 merely doubles the runtime compared to the least
confident strategy. Dynamic programs have also been developed to compute the entropy over all possible
sequences (Mann and McCallum, 2007) or trees (Hwa, 2004), although this approach is significantly more
expensive. The fastest entropy algorithm for sequence models requires O(TM2) time, which can be very slow
when the number of label states is large. Furthermore, some structured models are so complex that they
require approximate inference techniques, such as loopy belief propagation or Markov chain Monte Carlo
(Koller and Friedman, 2009). In such cases, the least confident strategy is still straightforward since only the
“best” prediction needs to be evaluated. However, the margin and entropy heuristics cease to be tractable and
exact for these more complex models.
So far we have only discussed problems with discrete outputs—classification and structured prediction.
Uncertainty sampling is also applicable to regression, i.e., learning tasks with continuous output variables. In
this setting, the learner can simply query the unlabeled instance for which the learner has the highest output
variance in its prediction. Under a Gaussian assumption, the entropy of a random variable is a monotonic
function of its variance, so this approach is much in same the spirit as entropy-based uncertainty sampling.
Another interpretation of variance is the expected squared-loss of the model’s prediction. Closed-form
estimates of variance can be computed for a variety of model classes, although they can require complex and
expensive computations.
Figure 2.6 illustrates variance-based uncertainty sampling using an artificial neural network. The target
function is a Gaussian in the range [-10,10], shown by the solid red line in the top row of plots. The network in
this example has one hidden layer of three logistic units, and one linear output unit. The network is initially
trained with two labeled instances drawn at random (upper left plot), and its variance estimate (lower left plot)
is used to select the first query. This process repeats for a few iterations, and after three queries the network
can approximate the target function fairly well. In general, though, estimating the output variance is nontrivial
and will depend on the type of model being used. This is in contrast to the utility measures in Section 2.3 for
discrete outputs, which only require that the learner produce probability estimates. In Section 3.4, we will
discuss active learning approaches using ensembles as a simpler way to estimate output variance. Active
28
learning for regression has a long history in the statistics literature, generally referred to as optimal
experimental design (Federov, 1972). However, the statistics community generally eschews uncertainty
sampling in lieu of more sophisticated strategies, which we will explore further in Chapter 4.
Figure 2.6: Variance-based uncertainty sampling for a toy 1D regression task. Each column represents an
iteration of active learning. In the top row, solid lines show the target function to be learned, while dashed
lines show a neural network approximation based on available training data (black dots). The bottom row plots
the network’s output variance across the input range, which is used to select the query for the next iteration.
2.5 DISCUSSION
Uncertainty sampling is possibly the most popular active learning strategy in practice. Perhaps this is because
of its intuitive appeal combined with the ease of implementation. Most of the common uncertainty-based
utility measures do not require significant engineering overhead to use. In fact, as long as the learner can
provide a confidence or probability score along with its predictions, any of the measures in Section 2.3 can be
employed with the learner as a “black box.” Standard classification or inference procedures can be used,
leaving the choice of learning algorithm fairly modular. This is not necessarily the case for all active learning
approaches. Furthermore, if inference is fast and tractable, then querying should also be fast and tractable.
Our discussion of uncertainty sampling has, thus far, been limited to the pool-based setting where the
learner selects the “best” query from the set of unlabeled data U. Uncertainty sampling can also be employed
in the stream-based selective sampling setting, where unlabeled instances are drawn x ∽ DX one at a time from
an input distribution, and the learner decides on the spot whether to query or discard it. The simplest way to
implement uncertainty-based selective sampling is to set a threshold on the uncertainty measure and use this to
define a region of uncertainty. An instance is queried if it falls within the region of uncertainty, and discarded
otherwise. The learner is re-trained after each new query instance (or batch of instances) is labeled and added
to L.
Figure 2.7 illustrates the idea of selective sampling using a neural network for a toy classification task.
Positive instances lie inside the black box, and the network in this example has one hidden layer of eight units.
If we obtain labels for the first 100 instances drawn randomly from the 2D plane (i.e., passive learning), then
the network tends to learn a nebulous, vaguely square-like shape like the one in Figure 2.7(b). Instead, we can
define a region of uncertainty as any point on the plane where the network confidence is below some threshold
Pθ(ŷ|x) < 0.9. By limiting queries to the streaming instances within the region of uncertainty, the network
learns the square shape much faster; Figure 2.7(c) shows a progression of the learned function at various
intervals. The active network in this example discarded more than 1,000 “uninformative” instances before
obtaining a training set of size 100 and learning the square function pictured above.
29
Figure 2.7: Stream-based uncertainty sampling for a simple toy classification task. (a) Positive instances lie
inside the black box in 2D. (b) After 100 random samples, the function learned by a neural network is still
somewhat amorphous. (c) Uncertainty-based selective sampling at 20, 60, and 100 queries. The highlighted
areas represent the region of uncertainty, which gradually shrinks and becomes more focused as the network
grows more confident. The output of the resulting network after 100 queries is much more square-like than
(b).
However, uncertainty sampling is not without problems. For one thing, the speed and simplicity of the
approach comes at the expense of being shortsighted, as the utility scores are based on the output of a single
hypothesis. To make matters worse, that hypothesis is often trained with very little data—given that a major
motivating use case for active learning is when little to no labeled data exist—and that data is also inherently
biased by the active sampling strategy. Thus, the learner’s grasp of the problem is probably fairly myopic1.
Figure 2.8 makes this point a bit more concrete. Here, the function to be learned is a pair of triangles in 2D,
and the learner is again a two-layer neural network with eight hidden units. If we are unlucky, the small
sample in Figure 2.8(b) used to train the initial network does not contain many instances from the white (i.e.,
negative) stripe in between the two triangles. As a result, the network over-generalizes and becomes quite
confident that regions in the center of the plane are positive. As active learning progresses, the network
continues to discard any instances that come from between the two triangles, and starts to learn a shape that is
closer to a square-like “convex hull” of the two triangles instead. Even worse, one can imagine that the initial
sample only contained positive instances from one of the two triangles, and the learner avoids querying in the
region of the other triangle because it believes those instances are negative. While these may be pathological
examples, the concerns are real, and uncertainty sampling has been observed to perform worse than random
sampling in some applied settings (Schütze et al., 2006; Tomanek et al., 2009; Wallace et al., 2010a). The
query strategy frameworks in the remainder of this book are aimed at alleviating or circumventing these
issues.
Figure 2.8: An example of uncertainty sampling failure. (a) Positive instances lie inside the two black
triangles. (b) An initial random sample fails to draw many training instances from the negative space in
between the triangles. (c) The trained network becomes overly confident that instances in the center are
positive. As a result, it avoids that region and begins to learn a different, more square-like shape.
Sampling bias is an inherent problem in nearly every active learning strategy, not just uncertainty sampling. Section 7.7
1
discusses some of the practical dangers of sampling bias beyond the myopia discussed here.
30
CHAPTER 3
Figure 3.1: The binary search from Figure 1.3, re-interpreted as a search through the version space. The
shaded box labeled V at each step represents the version space (the set of possible hypotheses which are
consistent with the labeled instances). By posing queries that bisect V, we efficiently search for good
hypotheses by quickly eliminating the inconsistent ones.
31
3.2 UNCERTAINTY SAMPLING AS VERSION SPACE SEARCH
We want to conduct a directed search through the hypothesis space by testing unlabeled instances, in order to
minimize the number of “legal” hypotheses under consideration (V). Conceptually, this approach is in contrast
to uncertainty sampling (Chapter 2), which makes query decisions based on the confidence of a single
hypothesis. In special cases, however, it turns out that uncertainty sampling represents an attempt to
incrementally reduce the size of the version space. After all, a binary search through a 1D input space is
justified under both interpretations (Figures 2.1 and 3.1). This is because we used a so-called max-margin
classifier, which positions itself in the center, halfway between positive and negative instances. Now let us
consider support vector machines (SVMs) for binary classification in more complex problem representations,
which have been the subject of much theoretical and empirical research in recent years. In its simplest form,
an SVM is a (K − 1)-dimensional hyperplane in a K-dimensional feature space F. The hypothesis space H
includes all such hyperplanes, and the version space V includes those which can correctly separate the training
data in feature space. Interestingly, there exists a duality between the feature space F and the hypothesis space
H: points in F correspond to hyperplanes (or boundaries) in H and vice versa1. Therefore, labeled data
instances impose geometric constraints on V in the hypothesis space. This is easy to see in 1D (e.g., Figure
3.1): consistent hypotheses must lie between the two nearest labeled instances.
Figure 3.2: The version space duality. (a) Instance points in feature space F correspond to hyperplanes in
hypothesis space H, and vice versa. (b) SVMs attempt to maximize the margin (distance) between the decision
boundary and the closest support vectors in F, corresponding to a hypothesis at the center of the maximal
bounded hypersphere in H. (c) Querying the instance closest to the decision boundary in F approximately
bisects the version space in H.
Figure 3.2 illustrates the so-called version space duality in 2D, and how it can be useful for active learning
with SVMs. The top row depicts F: data instances are points while hypotheses are lines in feature space. The
bottom row is meant to depict the same information in H: data instances are lines while hypotheses are points
in the hypothesis space. The highlighted regions in Figure 3.2(a) show the region of disagreement among
hypotheses in the version space V ⊆ H. SVMs are trained under the notion that the best separating hyperplane
is the one that represents the largest separation, or margin, between the two classes, since larger margins
commonly result in lower generalization error. Data instances on one side of the plane are classified as
positive, while points on the other side are negative, and the training instances closest to the hyperplane (i.e.,
that lie right on the margin) are called support vectors. Figure 3.2(b) represents the max-margin hyperplane as
a solid black line in F, and the margin boundaries as dashed grey lines. In H, this hypothesis corresponds to
the center of the largest hypersphere bounded by the labeled instance vectors. In other words, the best SVM
hypothesis is the one that lies approximately in the center of the version space.
32
Uncertainty sampling involves querying the unlabeled instance that is closest to the decision boundary in F
(in the case of binary classification, anyway). Since the SVM decision boundary should lie near the center of
the version space V, this heuristic corresponds to querying the instance vector in H that runs nearest to the
center of V, roughly bisecting it. The analogy is shown in Figure 3.2(c). By obtaining a label for this least
confident instance vector, |V| should be cut approximately in half and the bounded hypersphere governing the
decision boundary will be further constrained (regardless of its label), which is the desired outcome. There are
several machine learning methods besides SVMs that attempt to maximize the margin, such as variants of
boosting (Warmuth et al., 2008) and the voted perceptron (Freund and Schapire, 1999). Conceptually,
uncertainty sampling is equally principled as a version-space active learning approach for these algorithms.
Furthermore, margin-based active learning has been demonstrated to work well for several practical
applications of large-margin classifiers (Jain et al., 2010; Schohn and Cohn, 2000; Tong and Koller, 2000).
However, there is still reason to be cautious about uncertainty sampling. For one thing, margin-based
uncertainty sampling will only bisect the version space if it is symmetrical. In Figure 3.2(c), V is square-like
and therefore approximately symmetrical. After labeling the query instance, however, it is no longer a nearly
regular-shaped polygon. As a result, the largest bounded hypersphere in the revised version space will be far
away from the center of V, and the active learner might waste time winnowing away only a few fringe
hypotheses at a time. To address these limitations, Tong and Koller (2000) proposed alternative algorithms for
SVMs that try to anticipate the expected reduction in |V| in a decision-theoretic style analysis2. Finally,
uncertainty sampling is commonly used with classifiers that do not attempt to maximize the margin. The
interpretation of uncertainty sampling as a directed search through the version space does not necessarily carry
over to these other methods, and there is no theoretical argument that it will work well.
33
Figure 3.3: The query by disagreement algorithm (QBD).
There are a few practical shortcomings of QBD as it is presented here. For one thing, the size of the
version space may be infinite. This means that, in practice, the version space of an arbitrary hypothesis class
cannot be explicitly represented in memory. In some special cases, it may be possible to bound V analytically,
but not always. One solution to this problem is to store the version space implicitly instead. For example,
instead of exhaustively comparing all pairs of hypotheses in line 4 of Figure 3.3, we could consider two
speculative hypotheses:
where ⊕ and ⊖ denote positive and negative labels, and train(·) is the learning algorithm we intend to use in
active learning: it takes a labeled data set and returns a classifier consistent with the data (if one exists). The
rest of the algorithm remains unchanged. If these two hypotheses disagree on how to label x, then it is queried,
otherwise it is ignored. Note that for more than two classes, we would need to build a speculative classifier for
each label, and then compare all of their predictions. However, if the training procedure is computationally
expensive, this can lead to a drastic slowing down of the algorithm.
A different approach might be to exploit a partial ordering on the “generality” of hypotheses in V. Consider
this example of axis-parallel box classifiers:
The smallest box, tightly bound around positive instances, is a very conservative hypothesis, labeling very few
instances as positive. The other boxes are more liberal with the input regions they will classify as positive.
Imagine that we have two subsets S, G ⊆ V, where S contains the set of all “most specific” hypotheses, and G
contains all the “most general” hypotheses in the version space (there may be multiple general hypotheses
which are not necessarily comparable). Now we can use the QBD algorithm from Figure 3.3, where h1 ∊ S and
h2 ∊ G in line 4, leaving the rest unchanged. This alleviates the problem of maintaining the entire version space
since DIS(V) can be accurately summarized with these two sets. If a query falls within DIS(V) and is positive,
then hypotheses in S must be made more general; likewise if it proves negative, those in G are made more
34
specific. Unfortunately, maintenance of the S and G sets is still prohibitively expensive in general, as Haussler
(1994) pointed out that they can grow exponentially in the size of L. Another problem is that the data may not
be perfectly separable after all (i.e., there is label noise), in which case the version space technically does not
even exist.
However, this does bring to mind an approach that maintains only two hypotheses, hS and hG, which are
roughly the most specific and most general hypotheses in V, respectively. That is, they are very conservative
and liberal hypotheses that aim to be consistent with L. The region of disagreement can be approximated by
the instances for which these two extremes disagree, DIS(V) ≈ {x ∊ DX : hS(x) ≠ hG(x)}. An simple way to
encourage this specific/general behavior is to sample a set of “background” data points B ∽ DX, and add them
to L with artificial labels. To train the hS model, these artificial data are given the ⊖ label, thus making it
conservative in its predictions of the unlabeled data. Similarly, they are given the ⊕ label when training the hG
model, making it much more willing to label the unknown regions positive.
Figure 3.4 compares the SG-based QBD approach to random sampling and uncertainty sampling for the
simple two triangles learning task from Section 2.5. In addition to the initial random sample of 20 instances
shown in Figure 3.4(b), 200 background instances are added when training the hS and hG networks. These
background labels are also weighted by 1/200 so as not to overwhelm the actual labeled data in L. Figure
3.4(f) shows how these two networks interact with one another as active learning progresses up to 100 labeled
instances. The highlighted areas represent DIS(V), while black and white regions are where the networks agree
on positive and negative labels, respectively. Figure 3.4(g) shows the analogous sequence using uncertainty
sampling for a single best MAP hypothesis (using only L as training data). Notice that, since the initial sample
does not contain many instances in the center of the image, uncertainty sampling avoids that area until very
late. In contrast, SG-based QBD discovers that there are two triangle-like regions very early on. Figures 3.4(c–
e) show the final networks trained on labeled data sets of size 100. While uncertainty sampling does a better
job than random sampling at discovering the two different objects, QBD does a much better job at modeling
them as triangles.
35
Figure 3.4: Selective sampling for learning the toy two triangles task. (a) The target concept. (b) An initial
random sample of 20 labeled instances. (c) The output of a neural network after 100 random samples. (d) A
network after 100 instances queried using uncertainty sampling (see Section 2.5). (e) A network after 100
instances using the SG-based QBD approach. (f) The regions of disagreement at 20, 40, 60, 80, and 100
instances queries using QBD. (g) The analogous regions of uncertainty using uncertainty sampling.
• it is measured among all hypotheses h ∊ V (or two approximate extremes hS and hG);
• it is a binary measure, i.e., no controversial instance matters more than any other.
It might be helpful to relax these assumptions. First, it can be problematic to measure disagreement among all
hypotheses in the version space, even if we imperfectly approximate it with two extremes hS and hG. Even
worse, perhaps the data are noisy and V is not even properly defined. Second, we would like to use
disagreement-based heuristics in a pool-based setting, i.e., when we want to query the “most informative”
instance x ∊ U.
The query by committee (QBC) algorithm relaxes these two assumptions to make disagreement-based
active learning more broadly applicable. Technically speaking, the original formulation of QBC (Seung et al.,
1992) and its subsequent analysis (Freund et al., 1997) sampled two random hypotheses from the version
space at each iteration and used a binary notion of disagreement. Over the years, however, QBC has come to
refer to any disagreement-based approach that uses a “committee”—or ensemble—of hypotheses which we
will denote C. These can take a variety of different forms in practice. All one needs is a method for obtaining
hypotheses in the committee, and a heuristic for measuring disagreement among them.
36
If the data are noise-free and the version space is well defined, it is possible to randomly sample
hypotheses from V with certain learning algorithms, such as the perceptron (Freund et al., 1997) and winnow
(Liere and Tadepalli, 1997). Although it may be the case that this sampling strategy is computationally
inefficient, and it does not work for noisy data anyway. There are several alternatives that get around these
issues. One could take a Bayesian approach by sampling hypotheses from a posterior distribution P(θ|L). For
example, McCallum and Nigam (1998) did this for naive Bayes by using the Dirichlet distribution over model
parameters, whereas Dagan and Engelson (1995) sampled hidden Markov models by using the Normal
distribution. It is also quite natural to think about using generic ensemble learning algorithms to construct a
committee. For example, query by boosting uses a boosting algorithm (Freund and Schapire, 1997) to
construct a sequence of hypotheses that gradually become more an more specialized, by increasingly focusing
them on the erroneous instances in L. Another approach, query by bagging, employs the bagging algorithm
(Breiman, 1996) to train a committee of ensembles based on resamples of L with replacement, which is meant
to smooth out high-variance predictions. These ensemble-based committee methods were proposed by Abe
and Mamitsuka (1998) and have been used quite extensively since. Melville and Mooney (2004) proposed a
different ensemble-based method that explicitly encourages diversity among the committee members. Muslea
et al. (2000) constructed a committee of hypotheses by partitioning the feature space into conditionally
independent committee-specific subsets, to be used in conjunction with semi-supervised learning (see Section
5.3). Regarding the number of committee members to use, there is no general rule of thumb. It appears from
the literature that committees of five to fifteen hypotheses are quite common. However, even small committee
sizes (e.g., two or three hypotheses) have been shown to work well.
There are also a variety of heuristics for measuring disagreement in classification tasks, but we will focus
on two dominant trends. The first is essentially a committee-based generalization of uncertainty measures. For
example, vote entropy is defined as:
37
Figure 3.5: Examples of committee and consensus distributions. Pθ(i) refers the output distribution of the ith
hypothesis, and PC represents the consensus across all committee members.
There is a key conceptual difference in how vote entropy and KL divergence quantify disagreement.
Consider the probability distributions in Figure 3.5(a). The output distributions for individual hypotheses in
this example are relatively uniform, so the consensus output distribution is also uniform. Figure 3.5(b) paints a
slightly different picture: the individual hypothesis distributions are non-uniform and each prefers a different
label. But the consensus ends up being fairly uniform in this case as well. Vote entropy (3.2) only considers PC
and thus cannot distinguish between the two, since they both have high entropy in the consensus. In a way,
then, vote entropy is like a Bayesian generalization of entropy-based uncertainty sampling, relying not on a
single point estimate of model parameters, but aggregating uncertainties across a set of hypotheses. However,
querying an instance with predictions like Figure 3.5(a) is not really in the disagreement-based spirit of QBC.
The consensus label is uncertain, but this is because all committee members agree that it is uncertain. KL
divergence (3.3), on the other hand, would favor an instance with predictions like Figure 3.5(b): the consensus
here is uncertain, but the individual committee members vary wildly in their predictions, which is more in line
with the notion of disagreement.
To make these intuitions a bit more concrete, consider active learning starting from the labeled data shown
in Figure 3.6(a). Uncertainty sampling would train a single classifier on these data instances, and favor
unlabeled instances that fall in less confident regions of the input space according to that single hypothesis.
Figure 3.6(b) shows a heatmap of an entropy-based uncertainty sampling approach (2.3) using a single
multinomial logistic regression classifier (also known as a maximum entropy classifier in the literature).
Darker regions indicate higher utility according to the heuristic. Here we see that this single model is very
confident in most of the input space, and as a result will prefer to only query instances that fall in a narrow
region right along the decision boundary. Figure 3.6(c) shows the analogous heatmap for QBC using soft vote
entropy (3.2), with a committee of ten classifiers obtained by the bagging algorithm (Breiman, 1996). The
ensemble’s preferences still take on a sort of ‘Y’ shape, but it is clearly smoother and less overly-confident
about most of the input space. Figure 3.6(d) shows a heatmap for the KL divergence heuristic (3.3) using the
exact same committee of ten classifiers. Even though the task, training data, and classifiers are all the same,
the two QBC heatmaps look very different: the KL divergence measure determines that the two leftmost
decision boundaries, while uncertain, are not as contentious as the completely unknown region spanning from
the center to the upper-right. In other words, the leftmost decision boundaries resemble posterior distributions
like those Figure 3.5(a)—uncertain but in agreement—whereas the upper-right boundary is more similar to
Figure 3.5(b)—uncertain and in disagreement.
Figure 3.6: Visualization of different active learning heuristics. (a) A small training set with three classes in
2D. (b) A heatmap of the input space according to entropy-based uncertainty sampling. Darker regions are
deemed more informative by the MAP-estimate logistic regression classifier. (c) Soft vote entropy using a
committee of ten logistic regression classifiers obtained by bagging. (d) KL divergence using the exact same
committee of ten classifiers.
38
Aside from these two dominant themes, a few other disagreement measures for QBC have been described
in the literature for classification. One example is Jensen-Shannon divergence (Melville et al., 2005), which is
essentially a symmetric and smoothed version of KL divergence. Ngai and Yarowsky (2000) also proposed an
interesting measure called the F-compliment, which uses the well-known F1-measure from the information
retrieval literature (Manning and Schütze, 1999) to quantify (dis)agreement among committee members.
QBC can also be employed in regression settings, i.e., by measuring disagreement as the variance among
committee members’ predictions. The idea is illustrated in Figure 3.7 for learning a Gaussian target function
(solid red line) in the range [-10,10] using bagged regression trees (dotted blue line). The ensemble of trees is
initially trained with two labeled instances drawn at random (upper left plot), and the variance estimate (lower
left plot) is used to select the first query. This process repeats and as more variance-based instances are
queried, the function approximation improves. Burbidge et al. (2007) analyzed this rarely-used approach to
active data selection for real-valued target functions, and found that it works well if the inductive bias of the
learner and the label noise are both small. However, if the model class is misspecified or improperly biased
(i.e., a linear regression being used to actively learn a nonlinear function) a QBC approach based on output
variance can produce lower-quality models than randomly sampled data.
Figure 3.7: Variance-based QBC for a toy 1D task using bagged regression trees. Each column represents an
iteration of active learning. In the top row, solid lines show the target function to be learned, while dashed
lines show the mean output of an ensemble of ten regression trees induced from the training data (black dots).
The bottom row plots the ensemble’s output variance, which is used to select the query for the next iteration.
3.5 DISCUSSION
In this chapter, we have looked at a variety of active learning approaches that operate on a simple premise:
select queries that eliminate as many “bad” hypotheses as possible. This is in contrast to the vanilla
uncertainty sampling strategies from Chapter 2, which use a single hypothesis to select queries that simply
appear to be confusing.
There is an information-theoretic interpretation of the subtleties at play here. Recall that we want to select
a query x whose labeling would provide the most information about the hypotheses space (i.e., most reduces
uncertainty about the choice of hypothesis itself). Let I(Y; V) denote mutual information between the label
variable Y and the (possibly noisy) version space V, with some abuse of notation (namely, dropping
dependence on x and treating V as a random variable). What we want, then, is to query the instance that
maximizes I(Y; V), providing the most mutual information between a query label and the version space.
Consider the well-known information-theoretic identity:
where H denotes entropy. Assuming that H(V) ∝ |V|—both are measures of uncertainty about or complexity of
the space of “good” hypotheses—then this justifies an approach that searches for queries that reduce |V| in
expectation over all possible labelings. Queries that approximately bisect the version space can accomplish
this, such as the examples early in this chapter: binary search in 1D, uncertainty sampling with max-margin
classifiers like SVMs, or the QBD algorithm in its purest form. Certain query synthesis algorithms, which
create new data instances de novo from the instances space, can be interpreted as generating queries to most
constrain the version space in this way (Angluin, 1988; King et al., 2004).
However, an analysis that tries to maximize Equation 3.4 directly can be very complicated. Fortunately,
mutual information is symmetric and there is an equivalent identity:
39
This states that we want to query instances which are uncertain—H(Y) is high—but in such a way that there is
disagreement among the competing hypotheses. If the choices of θ largely agree on y, then EV[Hθ(Y)] will be
very close to H(Y), and the information content is deemed to be low; the more they disagree, the lower the
second term is and the higher the overall information content. However, since the size of V can be intractably
large in practice, the QBC algorithm uses a committee C ≈ V as a approximation to the full set of candidate
hypotheses.
Mutual information, it turns out, is also equivalent to the KL divergence between the joint distribution P(Y,
V) and the product of the product of their marginal distributions:
The second line follows from the fact that P(y|θ) = P(y, θ)/P(θ), so we can marginalize out the specific
hypothesis θ. If we use a QBC approach, assume that each committee member is equally probable (i.e., P(θ) =
1/|C|), and approximate the marginal P(Y) ≈ PC(Y), then this is equivalent to the KL divergence utility measure
defined in Equation 3.3. In other words, disagreement-based approaches are an attempt to maximize the
mutual information between an instance’s unknown label and the learning algorithm’s unknown hypothesis. In
contrast, entropy-based uncertainty sampling only attempts to maximize the entropy H(Y) = I(Y; Y), or “self
information” of an instance’s unknown label. By doing this, uncertainty sampling does not try to obtain
information about the hypothesis space itself, in general, whereas QBC and other disagreement-based active
learning methods do.
However, methods based on uncertainty sampling and hypothesis search alike can suffer if our primary
interest reducing generalization error on future unseen instances. To see why, consider a variant of our running
alien fruits example. Imagine that there are two types of fruits that come in different colors: red and grey. They
both have the property that smooth fruits are safe and irregular fruits are noxious, but they have different
thresholds. Thus, our hypotheses now consist of two parameters, θ1 and θ2, one for each color fruit. Figure 3.8
shows how this might be problematic for both uncertainty sampling and methods that try to reduce the version
space. There are two candidate queries that will cut the size of the version space in half: A, which is a grey
fruit, and B, which is red. As a result, they both look equally informative. But the impact that each query will
have on future classification error depends on the distribution of colored fruits. If that distribution is fairly
balanced and uniform across both red and grey dimensions, then B is probably a more informative query,
because currently more red fruits are likely to be misclassified by θ2 than grey fruits by θ1 (due to the tightness
of the labeled fruits providing bounds on them). However, grey fruits might be 10,000 times more common
than red fruits, in which case A will probably yield better overall classification by making more use of a better
estimate for θ1. On the other hand, even if the fruit colors are highly skewed, the red fruits might be normally
distributed with a mean near B, while the distribution of grey fruits might be bimodal with means far away
from A. In this case, B is probably the informative query.
40
Figure 3.8: A problematic case for hypothesis space search. In this variant of the alien fruits example, there
are two parameters: one for each color of fruit. This plot represents the 2D hypothesis space H. Candidate
queries A (grey) and B (red) both bisect the version space V (highlighted), and thus appear to be equally
informative. However, the query that should most reduce generalization error depends on the distribution of
fruit shapes.
In pathological cases, both uncertainty sampling and QBC might exert a lot of effort querying in sparse,
noisy, or irrelevant regions of the input space. This is especially true for query synthesis scenarios (see the
discussion in Section 1.3), but it can also be problematic for the pool-based setting where the size of U is very
large and full of statistical outliers. This is less of a problem in the stream-based selective sampling scenario,
where data are sampled i.i.d. from the underlying distribution, and in fact most of the strong theoretical results
in active learning (see Chapter 6) assume the stream-based setting. However, the active learning heuristics we
have covered thus far are a bit myopic and can be prone to querying outliers. If our real desire is to develop an
active learning system that reduces classification error, then perhaps we should consider active learning
heuristics that try to optimize for that directly, which is the topic of the next chapter.
See see Vapnik (1998) or Tong and Koller (2000) for a more technical exposition.
1
We will not go into detail on these alternative SVM algorithms in this book, but they are similar in spirit to the expected
2
error reduction framework (Chapter 4), and can suffer from some of the same issues of computational complexity.
41
CHAPTER 4
where θ+ refers to the a new model after it has been re-trained using a new labeled set L ⋃ 〈x, y〉—that is,
after adding the candidate query x and the hypothetical oracle response y. The objective here is to reduce the
expected total number of incorrect predictions (giving no credit for “near misses”). Another less stringent
objective is to minimize the expected log-loss:
42
which is equivalent to the expected total future output entropy (uncertainty) over U.
Roy and McCallum (2001) proposed the expected error reduction framework for text classification using
naive Bayes, and some of their results are presented in Figure 4.1. Here we see that sampling based on
expected error reduction (log-loss in their case, Equation 4.2) produced significantly better learning curves
in terms of accuracy than the other query strategies they tried: density-weighted query by committee (a
variant of QBC which we will discuss in Chapter 5) and uncertainty sampling. Note that these active
approaches are all still superior to random sampling (“passive” learning), but expected error reduction
produces more accurate classifiers with less labeled data.
In most cases, unfortunately, expected error reduction is very computationally expensive. Not only does
it require estimating the expected future error over U for each query, but a new model must be re-trained for
every possible labeling of every possible query in the pool. This leads to a drastic increase in computational
cost. For a classifier like naive Bayes, incremental re-training is fairly efficient since it only involves
incrementing or decrementing a few counts. However, Roy and McCallum still had to resort to the bagging
algorithm (Breiman, 1996) to ensure that the classifier’s posterior distributions were good estimates1.
Bagging helps to keep these estimates smooth and reliable, but adds to its computational overhead. For non-
parametric model classes like Gaussian random fields or nearest-neighbor methods, the incremental training
procedure is also efficient, making the approach fairly practical2. For a many other learning algorithms,
however, this is not the case. For example, a binary logistic regression model would require O(ULG) time
simply to choose the next query, where U is the size of the unlabeled pool U, L is the size of the current
training set L, and G is the number of gradient computations required by the by optimization procedure until
convergence. A classification task with three or more labels using a maximum entropy model (Berger et al.,
1996) would require O(M2ULG) time complexity, where M is the number of class labels. For a sequence
labeling task using conditional random fields (Lafferty et al., 2001), the complexity explodes to
O(TMT+2ULG), where T is the length of an input sequence. Because of this, research papers employing
expected error reduction have mostly only considered simple binary classification. Moreover, since the
approach is often still impractical, researchers must resort to sub-sampling from the pool (Roy and
McCallum, 2001) to reduce the U term in the above analyses, or use approximate training techniques (Guo
and Greiner, 2007) to reduce the G term.
43
Figure 4.1: Learning curves showing that expected error reduction can outperform QBC and uncertainty
sampling for two binary text classification tasks. Source: Adapted from Roy and McCallum (2001), with
kind permission of the authors.
Nevertheless, there are some interesting variants of the approach. Zhu et al. (2003) combined this
framework with semi-supervised learning (more on this in Chapter 5), resulting in a dramatic improvement
over random or uncertainty sampling. Guo and Greiner (2007) employed an “optimistic” variant that biases
the expectation toward the most likely label for computational convenience, using uncertainty sampling as a
fallback strategy when the oracle provides an unexpected labeling. This worked surprisingly well on several
common data mining data sets from the UCI repository3. The estimated error reduction framework has the
dual advantage of being near-optimal and not being dependent on the model class. All that is required is an
appropriate objective function and a way to estimate posterior label probabilities. For example, strategies in
this framework have been successfully used with a variety of models from naive Bayes and Gaussian
random fields to logistic regression, and support vector machines. Furthermore, the general decision-
theoretic approach can be employed not only to minimize error, as with 0/1-loss or log-loss, but to optimize
any generic performance measure of interest, such as maximizing precision, recall, F1-measure, or area
under the ROC curve. As always, however, there are concerns of computational complexity, which may
make these approaches infeasible for a truly interactive learning deployment.
where EY|x[·] is an expectation over the conditional density P(y|x), EL[·] is an expectation over the labeled
training set L, and E[·] is an expectation over both. Recall that ŷ is shorthand for the model’s prediction for a
given instance x, while y indicates the true label for that instance.
The first term on the right-hand side of this equation is noise, i.e., the unreliability of the true label y
given only x, which does not depend on the model or training data. Such noise may result from stochastic
effects of the method used to obtain the labels, for example, or because the feature representation is
inadequate. The second term is the bias, which represents the error due to the model class itself, e.g., if a
44
linear model is used to learn a function that is non-linear. This component of the overall error is invariant
given a fixed model class. The third term is the output variance, which is the remaining component of the
learner’s squared-loss with respect to the target function. Minimizing the variance, then, is guaranteed to
minimize the future generalization error of the model (since the learner itself can do nothing about the noise
or bias components).
Therefore, we can attempt to reduce error in the squared-loss sense by labeling instances that are
expected to most reduce the model’s output variance over the unlabeled instances U:
where θ+ again denotes the model after it has been re-trained with L ⋃ 〈x, y〉. The question is how to
compute this value more efficiently than we could by directly minimizing the error through re-training, as
we did in the previous section.
As it turns out, there is a rich history of such approaches in the statistics literature, often referred to as
optimal experimental design (Chaloner and Verdinelli, 1995; Federov, 1972), and typically focused on
regression tasks with a few input variables as predictors. However, the approaches are fairly general, and
can be used to derive elegant closed-form utility functions for active learning. A key ingredient of these
approaches is Fisher information, which is a way of measuring the amount of information that a random
variable Y carries about a parameter θ upon which the likelihood function Pθ(Y) depends. The partial
derivative of the logarithm of this function with respect to θ is called the Fisher score:
where ∇x denotes the score for an input instance x, upon which the output variable Y also depends. Note that
the score does not depend on the actual label of x, only the distribution over Y under the parameters θ. Note
also that, for multiple parameters, the score is a vector (or gradient). The Fisher information F is the
variance of the score:
The Fisher score and information could also be written ∇θx and F(θ), respectively, to make their relationship
to model parameters more explicit, but I will use simplified notation here for brevity Notice also that the
Fisher information is not a function of a particular observation, as we have integrated over all instances in a
particular input distribution.
Fisher information is important because its inverse sets a lower bound on the variance of the model’s
parameter estimates; this result is known as the Cramér-Rao inequality (Cover and Thomas, 2006). In other
words, to minimize the variance over its parameter estimates, an active learner should select data that
maximizes the Fisher information (or minimizes the inverse thereof). When there is only one parameter, the
Fisher information is a scalar value and this strategy is straightforward. But for models of K parameters,
Fisher information takes the form of a K × K covariance matrix, and deciding what exactly to optimize is
less clear. In the literature, there are three main types of so-called optimal experimental designs:
45
• A-optimality minimizes the trace of the inverse information matrix. This criterion results in minimizing
the average variance of the parameter estimates.
D-optimality is related to minimizing the differential posterior entropy of the parameter estimates
(Chaloner and Verdinelli, 1995), and results in the following utility function:
where the additive property of Fisher information allows us to simply add the information content of x to
that of all the previous training observations in L. Note that this is a closed-form solution, and no actual
model re-training is necessary to compute this measure of model variance. Since the determinant can be
thought of as a measure of volume, the D-optimal design criterion can be thought of as the volume of the
(noisy) version space, making it analogous to the query by committee algorithm (Chapter 3); it aims to
select instances that reduce the amount of uncertainty in the parameter estimates. E-optimality does not
seem to correspond to an obvious utility function, and is not often used in the machine learning literature,
though there are some exceptions (e.g., Flaherty et al.,2006).
A-optimal designs are considerably more popular, and aim to reduce the average variance of parameter
estimates by focusing on values along the diagonal of the information matrix. A common variant of A-
optimal design is to minimize —the trace of the inverse product of A and the information matrix
—where A is a square, symmetric “reference” matrix. As a special case, consider a matrix of rank one, in
particular: Ax = ∇x∇x⊤. In this case, we have , which is the equation for
computing the output variance for a single instance x in regression models (Schervish, 1995). This criterion
is sometimes called c-optimality (where the vector c = ∇x), which is an attempt to minimize the prediction
variance for a single data instance. One way to do this is to simply query x itself, if we assume that the
output variance will be zero once the true label is known. In this way, c-optimality can be viewed as a form
of uncertainty sampling. For example, this is the strategy employed by the neural network in learning a toy
1D regression task in Figure 2.6.
Now recall that we are really interested in reducing variance across the input distribution (not merely a
single data point), thus the A matrix should encode the whole instance space. Using A-optimal design, we
can derive an appropriate utility measure for active learning, called the Fisher information ratio:
We begin with the same objective as Equation 4.4, where Varθ+(·) denotes the variance of the model after it
has be retrained with the query x and its putative label y. However, unlike in the previous section we now
have a way of estimating this value in closed-form without explicit model re-training, by simply adding the
information content of x to that of the current training set L (4.5). Furthermore, since matrix traces are also
additive, the utility measure simplifies to the inner product of two matrices (4.6): the Fisher information
encoded by the model’s label predictions for U, and the Fisher information for the
label predictions for L ⋃ {x}. This ratio can be interpreted as the model’s output variance across the input
distribution (represented by U) that cannot yet be explained by the observations in L ⋃ {x}. Querying the
instance which minimizes this ratio is analogous to minimizing the future output variance once x has been
labeled, thus indirectly reducing generalization error (with respect to U) in the squared-loss sense. The
advantage here over explicit error reduction, as we did at the beginning of this chapter, is that the model
need not be retrained for each possible labeling of each candidate query. The information matrices give us
an approximation of the updated output variance that simulates retraining.
MacKay (1992) derived such a utility measure to do active learning in regression tasks with neural
networks, and Cohn (1994) applied it to a robot arm kinematics problem, predicting the absolute coordinates
46
of a robot hand given the arm joint angles as inputs. The approach was actually a query synthesis scenario:
by using numerical optimization, Cohn was able to find a set of joint angles that minimized variance among
predictions across all legal joint angle combinations. The approach was computationally expensive and
approximate, but Cohn et al. (1996) later showed that variance reduction active learning can be efficient and
exact for statistical models like Gaussian mixtures and locally-weighted linear regression. For classification,
Zhang and Oles (2000) and Schein and Ungar (2007) derived similar A-optimal heuristics for active learning
with logistic regression. Hoi et al. (2006a) extended this idea to active text classification in the batch-mode
setting (Section 4.3) in which a set of queries Q is selected all at once in an attempt to minimize the ratio
between FU and FQ. Settles and Craven (2008) also generalized the Fisher information ratio approach for
structured prediction using conditional random fields.
There are still several practical drawbacks to variance-reduction methods, in terms of computational
complexity. Estimating output variance requires inverting and multiplying K × K matrices, where K is the
number of parameters in the model θ. Assuming standard implementations, these operations require O(K3)
time4. This quickly becomes intractable for large K, which is a common occurrence in, say, natural language
processing tasks (with hundreds of thousands or millions of parameters). Because these operations must be
repeated for every instance in U being considered for querying, the computational complexity remains
O(UK3) for selecting the next query. Paaß and Kindermann (1995) proposed a sampling approach based on
Markov chains to reduce the U term in this analysis. To invert the information matrix and reduce the K3
term, Hoi et al. (2006a) used principal component analysis to reduce the dimensionality of the parameter
space. Once the information matrix has been inverted , it can be efficiently updated using the
Woodbury matrix identity (also known as the matrix inversion lemma) to incorporate information from the
query instance . Alternatively, Settles and Craven (2008) approximated the matrix
with its diagonal vector, which can be inverted in only O(K) time and results in substantial speedups. This is
also more space efficient, since for very large K the square matrices may not even fit in memory.
Empirically, these heuristics are still orders of magnitude slower than simple query strategies like
uncertainty sampling. As a consequence, there are few experimental comparisons with other active learning
methods, and the few that exist (Schein and Ungar, 2007; Settles and Craven, 2008) have reported mixed
results.
47
properties of submodular functions (Nemhauser et al., 1978). Submodularity is a property of set functions
that intuitively formalizes the idea of “diminishing returns.” That is, adding some instance x to the set A
provides more gain in terms of the utility function than adding x to a larger set A′, where A ⊆ A′. Informally,
since A′ is a superset of A and already contains more data, adding x will not help as much. More formally, a
set function s is submodular if it satisfies the property:
or, equivalently:
for any two sets A and B. The key advantage of submodularity is that, for monotonically non-decreasing
submodular functions where for the empty set s(∅) = 0, a greedy algorithm for selecting N instances
guarantees a performance of , where is the value of the optimal set of size
N. In other words, using a greedy algorithm to incrementally pick items to optimize a submodular function
will give us a lower-bound performance guarantee of around 63% of optimal. In practice, greedy algorithms
for submodular criteria can be within 90% of optimal (Krause, 2008).
A variance reduction heuristic can be recast as a monotonically non-decreasing function by measuring
the total difference between the model’s output variance before querying the set Q and the expected variance
afterward:
For Q = ∅, this utility measure is clearly zero, and we can greedily select the instance x that maximizes it,
add x to the set Q, and repeat until the query budget is spent. Under certain conditions with certain learning
algorithms, this approach can be shown to be submodular. Some examples are Gaussian processes (Guestrin
et al., 2005), logistic regression (Hoi et al., 2006b), and linear regression (Das and Kempe, 2008). In fact,
for linear regression the Fisher score and Fisher information matrices depend only on the input vector x and
not on the parameter coefficients. Thus, all queries within a given budget Q can be selected at the outset
before any data has been labeled (MacKay, 1992)! In settings where there is a fixed budget for gathering
data—or where it is more practical to gather the training data in batches—submodular utility measures
guarantee near-optimal results with significantly less computational effort5.
4.4 DISCUSSION
In this chapter, we touched on several principled active learning strategies that aim to directly minimize the
expected error or output variance on instances in the unlabeled distribution. These kinds of approaches are
intuitively appealing, and have enjoyed some empirical success in terms of producing more accurate
learners with fewer labeled instances than uncertainty or hypothesis-based approaches. Methods that aim to
reduce classification error by minimizing 0/1-loss or log-loss (Section 4.1) cannot be solved in closed form
generally, and require instead that models be re-trained to account for all possible labelings of all possible
queries. In contrast, methods that aim to reduce output variance in the squared-loss sense (Section 4.2) have
a long history in the statistics literature, can be computed in closed form, and apply equally well to
classification and regression problems. They can also be generalized in a rather straightforward way to
perform active learning in batches (sometimes with efficiency guarantees based on properties of
submodularity).
Still, there are two major drawbacks to these approaches. The first is that they are not as general as
uncertainty sampling or disagreement-based methods, and can only be easily applied to hypothesis classes
of a certain form. In particular, the variance reduction heuristics are only natural for linear, non-linear, and
logistic regression models (which is not surprising, since they were designed by statisticians with these
models in mind). How to use them with decision trees, nearest-neighbor methods, and a variety of other
48
standard machine learning and data mining techniques is less clear. The second drawback is their
computational complexity due to the inversion and multiplication of the Fisher information matrix. If the
number of parameters or output variables is relatively small, then these methods can still be relatively fast
and efficient. However, if these properties of the problem are very large then both the space and time
complexities of the algorithms involved might be too high to be useable in practice.
1
It is well known that, for problems where input features are not conditionally independent given the class label, a “double
counting” effect can occur with naive Bayes. This violates the “naive” independence assumption, which can result in very
sharp posteriors near to zero or one.
2
The bottleneck in non-parametric models generally not re-training, but predictive inference.
3http://archive.ics.uci.edu/ml/
Certain optimizations can reduce the exponent, but known matrix algorithms are still more than quadratic.
4
Many set optimization problems such as this are NP-hard, resulting in computations that unfortunately scale exponentially
5
with the size of the input. Therefore, greedy approaches are usually more efficient.
49
CHAPTER 5
Figure 5.1: An illustration of when uncertainty sampling can be a poor strategy. Shaded polygons
represent labeled instances in L, and circles represent unlabeled instances in U. Since A is on the
decision boundary, it would be queried as the most uncertain. However, B would probably provide
more information about the input distribution as a whole.
In contrast, error and variance reduction approaches (Chapter 4) attempt to reduce the expected
future error over all data instances, implicitly taking the input distribution into account. By utilizing
the unlabeled pool U when estimating future errors or output variances, these heuristics can be less
sensitive to outliers and noisy inputs than their simpler relatives like uncertainty sampling and QBC.
However, these advantages come at a significant computational expense, and in some cases the costs
may not be worth the potential gains. In this chapter, we look at ways of overcoming these problems
by explicitly considering the structure of the data while selecting queries.
Density-weighted heuristics for active learning aim to do exactly this. The basic intuition is that
informative instances should not only be those with high information content, but also those which
are representative of the underlying distribution in some sense (e.g., inhabit dense regions of the
input space). The generic information density heuristic captures the main ideas of this approach:
Here, U is the size of the unlabeled pool and ϕA(·) represents the utility of x according to some “base”
query strategy A, such as an uncertainty sampling or QBC approach. For example, one could use
prediction entropy as the base utility measure: ϕH(X) = Hθ(Y|x). The second term weights the
informativeness of x by its average similarity to all other instances in the input distribution—for
which U is assumed to be representative—subject to a parameter β that controls the relative
50
importance of the density term. For example, one might use cosine similarity, Euclidean distance,
Pearson’s correlation coefficient, Spearman’s rank correlation, or any other problem-appropriate
measure of similarity among instances.
Figure 5.2: Learning curves showing that, by explicitly weighting queries by their representativeness
among the input instances, information density can yield better results than the base uncertainty
sampling heuristic by itself.
Figure 5.2 shows learning curves for four subsets of the famous 20 Newsgroups text classification
corpus (Lang, 1995). The experiments compare information density (using entropy as the base utility
measure, cosine similarity, and β = 1) to simple entropy-based uncertainty sampling, and random
sampling. All results are averaged across ten folds using cross-validation. Even in cases where
uncertainty sampling is worse than random sampling, information density can yield comparable or
superior results. Such density-weighted approaches have been proposed and employed in a variety of
machine learning applications (Fujii et al., 1998; McCallum and Nigam, 1998; Settles and Craven,
2008; Xu et al., 2007). In general, the published results seem to indicate that these approaches are
superior to simpler methods that do not consider density or representativeness in their calculations.
Although there are some exceptions, such as the comp.* experiments in Figure 5.2, as well as in
Figure 4.1 (where Roy and McCallum (2001) used a density-weighted QBC approach as a baseline
for their error-reduction utility measure). However, a big advantage of density-weighting over error
and variance reduction is this: if the density terms are pre-computed and cached for efficient lookup,
the time required to select the next query is essentially no different than the base informativeness
measure, e.g., uncertainty sampling (Settles and Craven, 2008). This is advantageous for conducting
active learning interactively with oracles in real-time.
51
2004). One could also cluster the most informative instances after each iteration and query the most
representative instances of those clusters (Nguyen and Smeulders, 2004).
Figure 5.3: A motivating example for cluster-based active learning. If the input distribution looks
like this, perhaps we only need four labels?
However, naive cluster-based approaches have their limitations. First, there may not be an
obvious clustering of U, or a good similarity measure for clustering the data is unknown. Second,
good clusterings may exist at various levels of granularity (should there be four clusters, or forty?).
Third, the clusters may not correspond to the hidden class labels after all. To take the “best of both
worlds,” it would be nice to have an active learning algorithm that makes no hard assumptions about
the relationship between input and label distributions, but is still able to take advantage of data
clusterings when they happen to be informative. In fact, let us consider an active learning algorithm
that selects queries based solely on the cluster structure and labels received so far, and is therefore
agnostic about the type of classifier (i.e., model or hypothesis class) we intend to use. The basic idea
is illustrated in Figure 5.4. First we (a) take the data and (b) find an initial coarse clustering. Then we
(c) query instances from these clusters and (d) iteratively refine the clusterings so that they become
more pure, and focus our querying attention on the impure clusters.
This approach is called hierarchical sampling (Dasgupta and Hsu, 2008), and the full algorithm is
shown in Figure 5.5. The input is a hierarchical clustering T of the unlabeled data, where the subtree
rooted at node v ∈ T is denoted Tv. The algorithm maintains a pruning P ⊂ T, or a subset of cluster
nodes that are disjoint and contain all the leaves (instances)—in other words, P consists of the
topmost few layers of the hierarchical clustering. Each node u ∈ P is assigned its majority labeling
L(v) so that at any time, if the algorithm were interrupted, all the instances in Tv would be given this
label. For example, the bottom-right cluster in Figure 5.4(d) would be assigned the negative label,
whereas the other two would be assigned the positive label. These labelings are then used to construct
a training set L for supervised learning algorithms. Clearly, this results in label noise if there are
instances in Tv whose true labelings are not L(v), as is the case in the upper-right-hand cluster in
Figure 5.4(d). Therefore, we also want to maintain a bound p̆ v,y on the lowest estimated probability of
label y at node v, which could be estimated from empirical label frequencies, and used to estimate an
upper-bound of the error at node v. This can be computed in a linear-time bottom-up pass through the
tree and used to refine the pruning P, for example, by selecting the upper-right-hand cluster in Figure
5.4(d) with higher probability than the other two clusters (which are much more pure).
52
Figure 5.5: The hierarchical sampling algorithm. For each node v in the clustering, the algorithm
maintains the majority label L(v), empirical label frequencies p̃ v,y, and the lower bound p̆ v,y on all label
probabilities at node Tv.
A key difference between this and most of the active learning algorithms in this book is that the
learner queries clusters, and obtains labels for randomly drawn instances within a cluster. Regardless
of how we choose to implement the cluster(·) and select(·) procedures (lines 1 and 4, respectively),
the algorithm remains statistically sound: it maintains valid estimates for the error induced by its
current pruning, because the labeled instances are drawn at random from the clusters. Therefore, we
have a lot of freedom in how to perform these operations. Consider these two options for the select(·)
procedure:
• select v ∈ P with probability ∝ |Tv|(1 − p̆ v,L(v))—this is an active learning rule that focuses
attention on clusters which appear to be impure or erroneous.
If the cluster structure is pathologically not correlated with the label structure—i.e., all the clusters
are fairly impure—then the behavior of the algorithm will essentially resemble the first option above,
and degrade gracefully to random sampling. However, if the clusterings do have a relationship with
the hidden label distribution, we can take advantage of it by sampling more often from the impure
clusters with the second option (lines 4–6), and refining the clusters accordingly (lines 7–9). If the
clustering contains a pruning whose clusters are ∊-pure in their labels, then the hierarchical sampling
algorithm can find a labeling that proportionally pure with O(|P|d(P)/∊) labels, where d(P) is the
maximum depth of a node in the pruning P. More details on these bounds are discussed in Dasgupta
and Hsu (2008).
53
Figure 5.6: Learning curves for cluster-based hierarchical sampling vs. margin (uncertainty) and
random sampling baselines on (a) handwritten digit classification and (b-c) text classification.
Source: Adapted from Dasgupta and Hsu (2008), with kind permission of the authors.
Figure 5.6 shows learning curves that compare hierarchical sampling to margin-based uncertainty
sampling and random sampling (Dasgupta and Hsu, 2008). In these experiments T is constructed
using an agglomerative bottom-up clustering algorithm with an average-linkage distance measure
(Ward, 1963), and logistic regression is the final supervised learning algorithm. Figure 5.6(a) plots
the classification error for the ten-label problem of classifying images in the MNIST handwritten
digits data set1, using Euclidean distance as the similarity measure for clustering. Figures 5.6(b–c) are
on binary subsets of the 20 Newsgroups text classification corpus2, using different document
preprocessing techniques for clustering, namely TFIDF (Manning and Schütze, 1999) and latent
Dirichlet allocation (Blei et al., 2003), a topic modeling approach. Both active learning approaches
produce better learning curves than random sampling. The cluster-based hierarchical sampling
manages to produce steeper learning curves early on than margin-based uncertainty sampling in all
cases, but uncertainty sampling manages to achieve lower error rates later on in the process. This
may be a case of serendipitous sampling bias for these particular data sets. However, hierarchical
sampling avoids sampling bias by concentrating on converging to the same results as if it had all of
the correct training labels.
Table 5.1: Conceptually, several active learning algorithms have semi-supervised learning counterparts.
Active Learning Semi-Supervised Learning
uncertainty sampling self-training
query by committee co-training
expected error reduction entropy regularization
54
active learning is uncertainty sampling, where the instances about which the model is least confident
are selected for querying, assuming that the learner needs the most guidance on these instances.
Alternatively, co-training (Blum and Mitchell, 1998) and multi-view learning (de Sa, 1994) use
ensembles for semi-supervised learning. Initially, separate models are trained with the labeled data
(usually using separate, conditionally independent feature sets), which then classify the unlabeled
data, and “teach” the other models with a few unlabeled instances (using predicted labels) about
which they are most confident. For example, consider the task of classifying web pages: one feature
set might be the words contained on the page itself, while another feature set might consist of the
words contained in hyperlinks to that page. Forcing these different views of the data to agree not only
on the labeled data, but the unlabeled data as well—in this case, large numbers of web pages—helps
to reduce the size of the version space. Query by committee is an active learning complement here, as
the committee is meant to approximate different parts of the version space, and is used to query the
unlabeled instances about which these competing hypotheses do not agree.
Finally, entropy regularization (Grandvalet and Bengio, 2005) is a semi-supervised learning
approach based on the intuition that the best model of the data is the one that can make the most
confident predictions on the unlabeled data. For example, consider training a logistic regression
classifier by maximizing the following likelihood function ℓ:
The first summand is the conditional log likelihood of the labeled data, while the second is the
common L2 regularization term to prevent over-fitting. These two terms comprise the standard
objective function for supervised training. The third term aims to reduce the model’s output entropy
(e.g., the expected log-loss) among the unlabeled data. In other words, this entropy term penalizes
parameter settings with the highest risk of making mistakes on the unlabeled data. In active learning,
expected error reduction using log-loss (Equation 4.2) aims to select instances that yield the
analogous result.
We can see through these examples that many popular active and semi-supervised learning
algorithms try to attack the same problem—making the most of unlabeled data—from opposite
directions. While semi-supervised methods exploit what the learner thinks it knows about the
unlabeled data, active methods attempt to explore the unknown aspects3. It is therefore natural to
think about combining the two approaches, which is often easily done. McCallum and Nigam (1998)
combined density-weighted QBC for naive Bayes with the expectation-maximization (EM)
algorithm, which can be seen as a kind of “soft” self-training. Muslea et al. (2002) combined multi-
view algorithms for active and semi-supervised learning into a unified approach, also using naive
Bayes as the underlying learning algorithm. Zhu et al. (2003) combined an error-reduction query
selection strategy with a graph-based semi-supervised learning approach which also aims to
minimize uncertainty (similar in spirit to entropy regularization), applied to text classification and
handwritten digit recognition. Tomanek and Hahn (2009) combined self-training with uncertainty
sampling to train linear-chain conditional random fields for information extraction, while Tür et al.
(2005) combined active and semi-supervised learning for speech recognition. Other examples of
active semi-supervised learning systems are described by Zhou et al. (2004) and Yu et al. (2006).
5.4 DISCUSSION
In this chapter, we considered ways in which the input distribution—whether it is known or can be
modeled from a large pool of unlabeled data—can be explicitly taken advantage of in active learning.
One simple approach is to combine a simple utility measure (like uncertainty or QBC) with a
“representativeness” measure like input density. This can help the active learner avoid queries that
are statistical outliers and otherwise atypical or spurious instances that have little to do with the
problem. Another approach is to exploit cluster structure in the data, and either query clusters or
cluster centroids to obtain labels that are both representative and diverse. Finally, active learning and
semi-supervised learning both trafic in making the most of unlabeled data, but take largely
55
complementary approaches. It therefore seems natural to combine them so that the learner can exploit
both what it does and does not know about the problem.
1http://yann.lecun.com/exdb/mnist/
2
http://people.csail.mit.edu/jrennie/20Newsgroups/
Some might argue that active methods are also “exploiting” what is known by querying what is not known, rather
3
56
CHAPTER 6
Theory
“Theory is the first term in the Taylor series expansion of practice.”
— Thomas M. Cover
where θ+ denotes a new model after the query and its associated label have been added to L and used to re-
train the model. This measure represents the total gain in information (i.e., the change in entropy or
uncertainty) over all the instances. Unfortunately, as we have already seen, the true label y is not known for
the query x, so we again resort to the current model’s predictions and compute the expected information
gain:
From this utility measure, we can derive a simple uncertainty sampling heuristic:
Hence, the entropy-based uncertainty sampling strategy from Equation 2.3 can be seen as an
approximation of our idealized utility measure, under two (grossly over-simplifying) assumptions. The
first approximation (6.2) stems from the idea that all the instances in U have equal impact on each other, so
the information gain of query x will be proportional to any other instance x′. The second approximation
(6.3) follows from intuition that, if the model is re-trained with the oracle’s true label y for query x, then
the label is known, it can be accurately predicted, and its entropy will be zero. Similarly, if our goal was to
maximize the gain in 0/1-loss (as opposed to entropy here), these same approximations would yield the
least confident uncertainty sampling strategy from Equation 2.1.
Query by committee (QBC) heuristics such as the vote entropy variants in Equations 3.1 and 3.2 have a
very similar interpretation. Namely, it makes the assumptions that query x is representative of U and that
the expected future entropy of that query instance, once labeled, is zero. The main difference is that QBC
57
methods replace the point estimate θ with a distribution over several hypotheses θ ∊ C, approximating the
version space with a committee. This Bayesian flavor helps to mitigate problems caused by our first
assumption in this chapter (6.1): that prediction under the current single model θ yields a good estimate of
the true label distribution.
The relationship of our ideal utility measure to the error-reduction framework in Chapter 4 is a bit
more obvious:
Since the first entropy term Hθ(Y|x′) depends only on the current state of the model θ, is it held constant for
all possible queries and can be dropped from the equation. Thus, maximizing the negative expected future
entropy is the same as minimizing the expected log-loss, making this equivalent to the expected reduction
of log-loss utility measure from Equation 4.2. Since the entropy of a random variable is a monotonic
function of its variance, variance-reduction utility measures such as Fisher information ratio (4.6) are also
approximations to this ideal objective, in cases where the proper assumptions hold. If we wish to maximize
the gain in 0/1-loss instead, the result is Equation 4.1.
Finally, let us consider how density-weighted methods from Chapter 5 can be derived from this ideal
utility. The approximation is similar to that of uncertainty sampling (or QBC), but attenuates the overly
strong assumption that a single instance x is representative of the input distribution. The expression inside
the summand of Equation 6.1 represents the expected reduction in entropy for an instance x′ ∊ U if x is the
selected as the query. Let us hold to the assumption that the expected future entropy of x will be zero (since
the oracle provides the label). However, let us make a less stringent assumption about the rest of the input
distribution: the reduction in entropy for all other x′ is proportional to its similarity with x. By this notion,
we can make a simple substitution in the parenthetical:
With some trivial rearranging of terms, this is precisely the information density measure from Equation
5.1, if we use entropy as the base utility measure ϕA, set β = 1, and ignore the constant . Hence,
information density is can also be viewed as a somewhat scalable, but less harsh approximation to our
idealized utility measure.
58
part) on new data instances from the same input distribution. We will not cover all of the details of PAC
learning here; see Mitchell (1997)or Vapnik (1998) for more thorough introductions. We are interested in
finding an upper bound on the number of labeled instances that would be required to train a probably
approximately correct classifier in this sense. Let L be the label complexity, defined as the smallest integer
t such that for all t′ ≥ t,
where ht is a hypothesis (classifier) output by the learning algorithm after t queries, and err(·) is the error
rate (expected 0/1-loss) of a classifier. In other words, the label complexity L is the number of labeled
instances required to train a classifier that satisfies the desired values of ∊ and δ. In passive supervised
learning, such label complexity bounds have been fairly well-studied.
Theorem 6.1 Let DXY denote a joint instance-label distribution, and assume that labeled instances in the
training set L ∽ DXY are i.i.d. samples. If DXY is perfectly separable, i.e., there exists a hypothesis h* such
that err(h*) = 0, then the label complexity LPASS of a passive learning algorithm is (Vapnik, 1998):
The Õ notation simplifies things by ignoring terms that are logarithmic or dependent on δ (which is
typically modest in supervised learning bounds). Here also d denotes the VC dimension (Vapnik and
Chervonenkis, 1971), a parameter which is useful in characterizing the complexity of a hypothesis class H.
In brief, d represents the largest number of distinct instances that can be completely discriminated (or
“shattered”) by a hypothesis h ∊ H. For example, linear thresholds on the 1D number line have VC
dimension d = 1. Plugging this value into Theorem 6.1 shows why, in Chapter 1, we needed to test Õ (1/∊)
fruits in order to passively train an accurate classifier for the alien fruits example. Other canonical VC
dimensions include intervals on the 1D number line (d = 2), axis-parallel rectangles in K dimensions (d =
2K), and linear separators (d = K + 1).
Now let us consider the query by disagreement algorithm from Section 3.3 (Cohn et al., 1994). Recall
that QBD is a selective sampling algorithm, meaning we draw instances in a stream from the marginal
distribution D over the instance space (dropping DX’s explicit dependence on X for subsequent brevity), but
the labels are hidden. It is the task of the learner to request labels for select instances according to the
algorithm in Figure 3.3, which maintains all hypotheses in the version space and queries instances about
which there is any disagreement. In order to proceed, there are a few formal definitions that will prove
helpful in our analysis. First, the data distribution provides a natural measure of the “difference” Δ(h1, h2)
between any two hypotheses, defined as:
Intuitively, Δ encodes the proportion of instances about which the two hypotheses disagree. The r-ball
centered at some hypothesis h* is defined as:
that is, the set of all hypotheses for which the probability mass under the region of disagreement with h* is
within radius r. For a given version space V and instance space X, the region of disagreement is defined as:
59
This is the set of instances about which there is some disagreement among hypotheses in the version space.
These three notions are visually depicted in Figure 6.1.
Figure 6.1: Suppose the data are 2D and H consists of linear separators. (a) The difference between two
hypotheses is the probability mass under the region in which they disagree. (b) The thick line is the true
hypothesis h*, and thin lines are examples of hypotheses in B(h*, r). (c) The region of disagreement
among hypotheses in B(h*, r) might look like this.
If h* is the “true” underlying hypothesis with err(h*) = 0, then after a certain amount of querying we
hope to constrain the version space to be within B(h*, r) for small values of r. In that case, the probability
that a random instance drawn from D would be queried is no more than PD(DIS(B(h*, r))). Under this
intuition, Hanneke (2007) proposed a new parameter to characterize the complexity of active learning, as a
function of the hypothesis class H and input distribution D. The so-called disagreement coefficient ξ
measures how the probability of disagreement with h* scales with r:
Roughly speaking, ξ attempts to quantify how quickly the region of disagreement grows as a function of
the radius of the version space. Hence, smaller values of ξ imply more efficient shrinkage of the version
space1.
Theorem 6.2 Suppose H has finite VC dimension d, and the learning problem is separable with
disagreement coefficient ξ. Then the label complexity LQBD of the query by disagreement algorithm is
(Hanneke, 2009):
Proof. Let it index the instance sampled at the time of query t. After labeling Q queries, it
holds that . In other words, the set of unlabeled
instances to come in from the stream that fall inside the region ofdisagreement is at least of size Q.
Furthermore, the first Q instances in this set are drawn conditionally i.i.d. given x ∊ DIS(Vt). Now, let us
cleverly define Q to be:
Note that δ′ here is not the same as δ in Theorem 6.2; we will revisit this later in the proof. For now,
though, we can say that with probability 1 − δ′ and for all h ∊ Vt+Q,
60
The first inequality comes from Theorem 6.1, by setting L equal to the passive learning bound and solving
for the error tolerance ∊. This works because the data are conditionally i.i.d. given x ∊ DIS(Vt), as we saw
above, and because h(x) = h*(x) for all x ∉ DIS(Vt), so all the labels outside the region of disagreement are
known and available “for free.” The second inequality is an algebraic simplification based on our awkward
but careful definition of Q (with appropriate settings for the constants c and c′).
Note that as more instances are queried and labeled, the version space necessarily shrinks. As a result,
Vt+Q ⊆ Vt, which implies that for all h ∊ Vt+Q:
First, we split err(h) into the data proportions that fall inside and outside of the region of disagreement for
Vt. Since the data are separable, h and h* must agree on all x ∉ DIS(Vt) by definition, so the second
summand is zero. By substituting Equation 6.6 into the first summand, we arrive at an upper bound for the
error, as shown in the last line. This means that and that:
To put it another way, after t + Q queries, the data density under the region of disagreement will be at most
half what it was at the previous time step of t queries. So every Q queries, QBD chops the number of
controversial instances in half (or better). This holds for any value of t which is a multiple of Q, and by a
union bound it holds with probability 1 − δ′⌈log(1/∊)⌉ up to t = Q⌈log(1/∊)⌉. Re-substituting the definition
of Q and setting results in the bound for LQBD given in Theorem 6.2, which completes the
proof. □
Recall from Theorem 6.1 that the label complexity in the passive case was Õ(d/∊), while our analysis
of QBD yields a label complexity of . For well-behaved values of ξ, this implies a
significant reduction in the number of labeled instances necessary to train a classifier from the same
hypothesis class H, on the same problem distribution DXY, under the same δ and ∊ requirements in the PAC
setting. This is an encouraging result!
6.3 DISCUSSION
The analysis of QBD in the previous section only applies to separable (i.e., noise-free) data distributions
DXY. However, the disagreement coefficient turns out to be a useful quantity for analyzing several active
learning algorithms. For example, Hanneke (2007) used a proof very similar to the one above to analyze
the A2 algorithm (Balcan et al., 2006), which is designed to be able to learn from data in the “agnostic”
PAC setting—that is, obtaining similar label complexity bounds in the face of arbitrarily noisy data.
Dasgupta et al. (2008) also used the disagreement coefficient to analyze a different agnostic active learning
algorithm, based on a reduction to supervised learning. Balcan et al. (2008) use it to showthat active
learning always helps under an asymptotic assumption, often with exponential improvements in sample
complexity over passive learning counterparts. Beygelzimer et al. (2009) generalize the disagreement
coefficient in their analysis of a novel active learning algorithm—importance-weighted active learning
(IWAL)—and consider a larger family of loss functions beyond the 0/1-loss we looked at in this chapter.
61
Interestingly, recent work has been able to place bounds on ξ for certain hypothesis classes (Friedman,
2009; Wang, 2009), which is useful for predicting exactly how much active learning algorithms might help
for these classes.
For many years, the main theoretical result in active learning was an analysis of the original query by
committee (QBC) algorithm by Freund et al. (1997). They showed that if the data are drawn uniformly
from the surface of the unit sphere in RK, and the true hypothesis h* corresponds to a homogeneous (i.e.,
through the origin) linear separator, then it is possible to achieve error ∊ after streaming
instances and querying labels. Nearly a decade passed before significant progress was made
toward the analysis of other active learning algorithms (or the development of novel algorithms) that could
make similar claims accounting for arbitraryhy pothesis classes, input distributions, and label noise. By
now there are several such bounds in the literature, with a handful of the key results summarized in Table
6.1.
Still, many of these theoretical results have been based on contrived algorithms not covered in this
book, which are not very practical or computationally tractable (at least not as formally defined), although
there are a few exceptions (e.g., Beygelzimer et al., 2009; Dasgupta and Hsu, 2008). In fact, most analyses
are limited to binary classification tasks with the goal of minimizing 0/1-loss, and are not easily adapted to
other objective functions that may be more appropriate for many applications (e.g., log-loss, or squared-
loss for regression tasks). Furthermore, some of the algorithms in question require an explicit enumeration
over the version space, which is not only intractable for infinite hypothesis classes, but difficult to even
imagine for more complex hypotheses (e.g., large heterogeneous ensembles or structured prediction
models for sequences, trees, and graphs).
Table 6.1: Some key theoretical results in active learning. The last column indicates whether or not the algorithm can
handle noisy labels.
Algorithm Agnostic?
QBD: query by disagreement (Cohn et al., 1994; Hanneke, 2009) No
QBC: query by committee (Freund et al., 1997; Seung et al., 1992) No
A2 algorithm (Balcan et al., 2006; Hanneke, 2007) Yes
MBAL: margin-based AL (Balcan et al., 2007) Yes
reduction to supervised (Dasgupta et al., 2008) Yes
modified perceptron (Dasgupta et al., 2009) No
IWAL: importance-weighted AL (Beygelzimer et al., 2009) Yes
By and large, the active learning community has been split into two camps—the theoretical and the
empirical—who until the last few of years have had little overlap or interaction with one another. As a
result, there is still much to learn about the nature and behavior of active learning algorithms for more
general and practical applications.
A quantity called the “capacity function,” which is very related to the disagreement coefficient, was independently
1
62
CHAPTER 7
Practical Considerations
“In theory, there is no difference between theory and practice. But in practice, there is.”
—Jan L.A. van de Snepscheut and/or Yogi Berra
Table 7.1: A high-level comparison of the active learning strategies in this book.
Query Advantages Disadvantages
Strategy
uncertainty simplest approach, very fast, easy to myopic, runs the risk of becoming overly
sampling implement, usable with any probabilistic confident about incorrect predictions
model, justifiable for max-margin classifiers
QBC and reasonably simple, usable with almost any can be difficult to train/maintain multiple
disagreement- base learning algorithm, theoretical guarantees hypotheses, still myopic in terms of reducing
based under some conditions generalization error
methods
error/variance directly optimizes the objective of interest, computationally expensive, difficult to
63
reduction empirically successful, natural extension to implement, limited to pool-based or synthesis
batch queries with some guarantees scenarios, VR limited to regression models
density simple, inherits advantages of the base input distribution or cluster structure may have
weighting heuristic while making it less myopic in terms no relationship to the labels
of the input distribution, can be made fast
hierarchical exploits cluster structure, degrades gracefully requires a hierarchical clustering of the data,
sampling if clusters are not correlated with the labels, which can be slow and expensive in practice,
theoretical guarantees limited to pool-based scenario
active + semi- exploits latent structure in the data, aims to not a single algorithm/framework but a suite of
supervised make good use of data through both active approaches, inherits the pros and cons of the
and semi-supervised methods base algorithms
In a recent survey of annotation projects for natural language processing tasks (Tomanek and
Olsson, 2009), only 20% of the respondents said they had ever decided to use active learning,
which is evidence of the community’s skepticism about its usefulness in practice1. Of the large
majority who chose not to use active learning, 21% were convinced that it would not work well,
with some stating that “while they believed [it] would reduce the amount of instances to be
annotated, it would probably not reduce the overall annotation time.” However, all but one of the
eleven respondents that had attempted active learning claimed to be partially or fully satisfied
with the results. In my own experience, if the nature of the problem is fairly well-understood and
the appropriate learning algorithm is known (e.g., using naive Bayes, logistic regression, or
support vector machines for text classification), then simple heuristics like uncertainty sampling
can work successfully. I have heard similar stories from colleagues in industry who needed to
quickly build up new training sets for classifiers on a variant of a problem they were working on.
If the problem is completely new, however, a safer approach is probably to conduct an initial
pilot study with randomly-sampled data until the problem can be better understood, and then
transition to an active learning paradigm that seems most appropriate based on preliminary
results (provided there is still room for improvement!).
It should be noted that nearly all of the discussion in this book has focused on the traditional
active learning problem of selecting data instances to be labeled, with certain assumptions about
the cost and reliability of the “oracle” providing the labels. In many real-world deployment
situations, these assumptions do not hold. Or, perhaps more interestingly, certain applications
offer other opportunities such as actively soliciting rules, mixed-granularity queries, and other
kinds of interactions that are much richer than the mere annotation of instances. The rest of this
chapter focuses on these practical considerations, overviewing some of the work that has been
done so far, and summarizing the open research questions and opportunities that are available in
the rapidly-evolving landscape of active learning. Note that the following sections read more like
a survey than a tutorial, and lack some details relating to these emerging research areas. The
curious reader is encouraged to use the citations as a starting point for further investigation.
64
motivates various approaches for cost-sensitive active learning. One proposed approach for
reducing effort in active learning is automatic pre-annotation, i.e., using the current model’s
predictions to assist in labeling queries (Baldridge and Osborne, 2004; Culotta and McCallum,
2005). Sometimes this helps by reducing the amount of work a human must do. However, if the
model makes many mistakes (which is quite possible, especially early in the process), this may
also create extra work for the annotator who must correct these mistakes. To help combat this in
structured prediction tasks, Culotta et al. (2006) experimented with a technique called correction
propagation, where local edits are used to interactively update (and hopefully improve) the
overall structure prediction. Nevertheless, it may also be that the active learner ends up biasing
the annotator toward its poor predictions on noisy “edge” cases. For example, Baldridge and
Palmer (2009) reported on a user study involving a domain novice performing rare-language
glossing annotations with an active learner who did precisely this (whereas a second expert
annotator was not so swayed by the model). Felt et al. (2012) found that, for a Syriac
morphology-tagging task, the machine model had to achieve about 80% accuracy or more before
significantly reducing the annotator’s time. In general, though, automatic pre-annotation and
correction propagation do not actually represent or reason about labeling costs themselves.
Instead, these methods simply attempt to reduce cost indirectly by minimizing the number of
labeling actions required by the human oracle.
Another group of cost-sensitive active learning approaches explicitly accounts for varied
labeling costs when selecting queries. Kapoor et al. (2007) proposed a decision-theoretic
approach in which the value of information includes both current labeling costs and expected
future misclassification costs:
where c(·) is the cost of annotating the query x, and κθ+(·) is the cost of assigning the model’s
prediction ŷ to some instance x′ ∈ U. The approach is very much in the same flavor as the
expected error reduction framework (Chapter 4). In fact, if c(x) is constant (the traditional
assumption in active learning), and κθ+(ŷ, x′) = 1 − Pθ+ (ŷ|x′), then the equation above is
equivalent to minimizing the expected future 0/1-loss (Equation 4.1). A similar derivation can be
made for the expected future log-loss (Equation 4.2).
Kapoor et al. (2007) applied this approach to a voicemail classification task. Instead of using
real costs, however, they made a simplifying assumption that the cost of an instance label is
linear in the length of a voicemail message, and that misclassification costs are in the same
currency (e.g., c(·) = $0.01 per second of annotation and κθ+(·) = $10 per misclassification). This
approach may not be appropriate or even straightforward for some applications, though—for
example, costs may be non-deterministic, or the two cost measures may not be easily mapped
into the same currency. King et al. (2004) also used this sort of approach to reduce real labeling
costs. They describe a “robot scientist” which executes a series of autonomous biological
experiments to discover metabolic pathways in yeast, with the objective of minimizing the cost
of materials used (i.e., the cost of running an experiment, which can vary according to which lab
supplies are required, plus the expected total cost of future experiments until the correct
hypothesis is found). Note that instead of querying a human, this active learner is “querying
nature” by conducting a laboratory experiment without a human in the loop. In this case, the cost
of materials are real, but fixed and known at the time of experiment (query) selection.
An alternative approach is to select the instance that would maximize the benefit/cost ratio,
or return on investment (ROI):
65
where ϕA(x) is the benefit, i.e., any base utility measure (entropy-based uncertainty sampling, for
example). The simplification on the second line comes from splitting the fraction and ignoring
the resulting cost constant. A significant advantage of this approach is that we do not necessarily
have to map utilities and costs into the same currency, as we would have to for labeling and
(mis)classification costs in Equation 7.1.
In all the settings above, and indeed in most of the cost-sensitive active learning literature,
the cost of annotating an instance is assumed to be fixed and known to the learner before
querying. In some settings, annotation costs are variable and not known a priori, for example,
when labeling cost is a function of elapsed annotation time. Settles et al. (2008a) addressed this
by training a regression cost-model (alongside the active task-model) which tries to learn how to
predict the real, unknown annotation cost using a few simple “meta features” on instances. An
analysis of several user studies using real human annotation costs revealed the following:
• In some domains, annotation costs are not (approximately) constant across instances, and can
indeed vary considerably. This result is supported by the subsequent findings of others,
working on different learning tasks (Arora et al., 2009; Haertel et al., 2008;
Vijayanarasimhan and Grauman, 2009a; Wallace et al., 2010b).
• Consequently, active learning approaches which ignore cost may perform no better than
random selection (i.e., passive learning).
• The cost of annotating an instance may not be intrinsic, but may instead vary based on the
person doing the annotation. This result is also supported by the findings of Ringger et al.
(2008) and Arora et al. (2009).
• The measured cost for an annotation may include stochastic components. In particular, there
are at least two types of noise which affect annotation speed: jitter (minor variations due to
annotator fatigue, latency, etc.) and pause (unexpected major variations, such as a phone call
or other interruption).
• Unknown annotation costs can sometimes be accurately predicted, even after seeing only a
few labeled training instances. This result is also supported by the findings of others
(Vijayanarasimhan and Grauman, 2009a; Wallace et al., 2010b). Moreover, these learned
cost-models are more accurate than simple cost heuristics (e.g., a linear function of document
length).
Several empirical studies have shown that learned cost models can be trained to predict
reasonably accurate annotation times (Haertel et al., 2008; Settles et al., 2008a;
Vijayanarasimhan and Grauman, 2009a; Wallace et al., 2010b). However, there are mixed results
as to how these predicted costs can be effectively incorporated into a query selection algorithm.
Settles et al. (2008a) found that using ROI (7.2) with predicted costs is not necessarily any better
than random sampling for several natural language tasks (this is true even if actual costs are
known). However, results from Donmez and Carbonell (2008), Haertel et al. (2008), and Wallace
et al. (2010b) suggest that ROI is sometimes effective, even when using a learned cost model.
Tomanek and Hahn (2010) evaluated the effectiveness or ROI using the least confident
uncertainty sampling strategy ϕLC(·) (Equation 2.1), and found that it worked better with the non-
66
linear transformation exp (βϕLC(·)), using some scaling parameter β. This suggests that ROI
requires an appropriate relationship between the benefit and cost functions to work well.
Vijayanarasimhan and Grauman (2009a) demonstrated potential cost savings in active learning
using predicted annotation costs in a computer vision task using a decision-theoretic approach
based on Equation 7.1. It is unclear whether these mixed results are intrinsic, task-specific, or
simply a result of differing experimental assumptions. Therefore, there are still many open
questions as to how we can effectively use real and predicted annotations costs in practical
settings.
67
(however noisy) are useful if there are enough of them. This begs the question of how to actively
solicit these kinds of feature labels.
Druck et al. (2009) proposed and evaluated several active query strategies aimed at gathering
useful feature labels. They show that active feature labeling is more effective than either
“passive” feature labeling (using a variety of strong baselines) or instance-labeling (both passive
and active) for two information extraction tasks. These results held true for both simulated
experiments and real, interactive human user studies. The query algorithm that proved most
effective was a form of density-weighted uncertainty sampling (Section 5.1), but for features:
where f is a candidate feature, EU[·] is the expectation over the unlabeled input distribution, and
Cf is the frequency of the feature f in the corpus. The first term is the average uncertainty of items
which contain that feature (in this work, features were discrete binary variables like words,
capitalization patterns, etc.) times a scaling factor to represent how often the feature occurs, and
thus how much impact its label is likely to have. One explanation for the excellent gains in user
studies is that features are often both intuitively easier to label (requiring 2–4 seconds in user
studies, compared to arbitrarily long amounts of time for instances, depending on the task) and
provide much more information: a feature label is a rule which covers all instances containing
that feature, whereas an instance label only provides information about the instance in question.
Settles (2011) built on this work in a slightly different way for an interactive text
classification system called DUALIST. Figure 7.1 shows the interactive annotation interface
applied to a sentiment analysis task: identifying positive and negative movie reviews (Pang et al.,
2002). Instances are text documents, and features are words. At any time, the user may choose
among four annotation actions: label document queries, label word queries, volunteer
labeled words, or submit all labels, retrain the classifier, and obtain the next set of queries.
The differences compared to previous work are threefold. First, instance and feature queries are
interleaved. Second, the human oracle is allowed to volunteer his or her knowledge about feature
labels by typing them into text boxes, so not all annotations are at the behest of the learner’s
requests. Third, the feature queries are organized into columns according to label, which not only
aims to reduce cognitive load for the annotator, but requires some “initiative” by the active
learner to both identify informative features, and to try and postulate the associated label(s)2. The
query selection strategy in this case is the expected information gain of a feature:
68
Figure 7.1: User interface for the DUALIST system (Settles, 2011).
that is, the difference between the model’s average uncertainty of all the unlabeled instances, and
the uncertainty of the instances with the feature present. This is the expected amount of
information that feature f provides about the label variable Y. In a similar vein, features can be
assigned to class label columns according to the expected frequency with which f appears in
instances of class y according to the model’s predictions. In user studies on several benchmark
data sets, this interactive approach outperformed both active and passive learning by labeling
instances (documents) alone.
Figure 7.2: Multiple-instance active learning. (a) Images can be represented as “bags” and
instances correspond to segments. An MI active learner may directly query which segments
belong to an object of interest, such as the gold medal here. (b) Documents are represented as
bags and instances are passages of text. An MI learner may query specific passages to determine
if they represent the class of interest, alleviating the need to read the entire document.
69
We can also think of interacting with the learner at a granularity somewhere in between
features (very fine) and instances (very coarse). Settles et al. (2008b) introduced a scenario called
multiple-instance active learning, illustrated in Figure 7.2 for (a) computer vision and (b) natural
language processing tasks. In multiple-instance (MI) learning, instances are grouped into bags
(i.e., multi-sets), and it is the bags, rather than instances, that are labeled for training. A bag is
labeled negative if and only if all of its instances are negative. A bag is labeled positive,
however, if at least one of its instances is positive (note that positive bags may also contain
negative instances). A naive approach to MI learning is to view it as supervised learning with
one-sided noise (i.e., all negative instances are truly negative, but some positives are actually
negative). However, special MI learning algorithms have been developed to learn from labeled
bags despite this ambiguity. The MI setting was formalized by Dietterich et al. (1997) in the
context of drug activity prediction, and has since been applied to a wide variety of tasks
including content-based image retrieval (Andrews et al., 2003; Maron and Lozano-Perez, 1998;
Rahmani and Goldman, 2006) and text classification (Andrews et al., 2003; Ray and Craven,
2005).
An advantage of the MI representation in Figure 7.2(a) is that it may be much easier to obtain
image labels than it is to get fine-grained segment labels. Similarly, for texts classification in
Figure 7.2(b), document labels may be more readily available than passages (e.g., paragraphs).
The MI representation is compelling for classification tasks for which bag labels are freely
available or cheaply obtained (e.g., from online indexes and databases), but the target concept is
represented by only a few portions (i.e., image segments or passages of text). It is often possible
to obtain labels at both levels of granularity. Fully labeling all instances, however, would be
expensive: the rationale for an MI representation is often that it allows us to take advantage of
coarse labelings that may be available at low cost (or even for free). In MI active learning,
however, the learner is sometimes allowed to query for labels at a finer granularity than the target
concept, e.g., querying passages rather than entire documents, or segmented image regions rather
than entire images. Settles et al. (2008b) focused on this type of mixed-granularity active
learning with a multiple-instance generalization of logistic regression and develop weighted
uncertainty sampling algorithms that also account for how important an instance is to the MI
classifier. Vijayanarasimhan and Grauman (2009a,b) extended the idea to SVMs for the image
retrieval task, and also explore an approach that interleaves queries at varying levels of
granularity and cost, using a decision-theoretic approach similar to Equation 7.1.
70
Figure 7.3 shows that this approach can help to maintain good quality ranking and classifications
in the face of extreme data imbalance.
Arguably, though, this is simply due to a bad match between the selection strategy and the
desired objective function. Recall from Chapter 6 that uncertainty sampling is a crude
approximation to a utility measure that maximizes expected information gain (i.e., reducing
prediction uncertainty). The evaluation in Figure 7.3 is measured instead against area under the
ROC curve (AUC), which is the probability of ranking a randomly-chosen positive instance
above a randomly-chosen negative. Although computationally costly, a decision-theoretic utility
function that directly maximizes the expected AUC of the minority class may perform better that
uncertainty sampling here.
Wallace et al. (2010a) successfully used a dual supervision approach (Section 7.3) with active
learning under extreme class imbalance for biomedical citation screening. In this case, they built
two models: one from labeled data in the normal supervised way, and another based on labeled
words, with the assumption that these feature labels cover the space of representative clusters.
They used query by disagreement (Section 3.3) between these two models to perform active
learning on additional documents for querying, and found that this dual-view strategy helps to
correct some of the problems of label skew compared to several standard active learning
baselines.
Figure 7.3: The effects of artificially-induced label skew on a text classification task (sports vs.
nonsports web pages). Learning curves show area under the ROC curve (AUC) of logistic
regression as a function of the number of labeled documents. As the positive class becomes more
and more rare, uncertainty learning remains more robust than random sampling, but with a
severe skew (10,000 to one) there are virtually no gains. “Guided” learning, by which the oracle
is allowed to search and provide labels for representative data instances, helps in this respect.
Source: Adapted from Attenberg and Provost (2010b), used with kind permission of the authors.
71
produce gold-standard quality training sets (Snow et al., 2008) and also to evaluate learning
algorithms on data for which no gold-standard labelings exist (Carlson et al., 2010; Mintz et al.,
2009).
The question remains about how to use non-experts (or even noisy experts) as oracles in
active learning. In particular, when should the learner decide to query for the (potentially noisy)
label of a new unlabeled instance, versus querying for repeated labels to de-noisify an existing
training instance that seems a bit suspect? Sheng et al. (2008) studied this problem using several
heuristics that take into account estimates of both oracle and model uncertainty, and show that
data can be improved by selective repeated labeling. Their analysis assumes that: (i) all oracles
are equally and consistently noisy; and (ii) annotation is a noisy process over some underlying
true label. Donmez et al. (2009) addressed the first issue by allowing annotators to have different
noise levels, and show that both true instance labels and individual oracle qualities can be
estimated (so long as they do not change over time). They take advantage of these estimates by
querying only the more reliable annotators in subsequent iterations active learning. In follow-up
work, Donmez et al. (2010) used a particle filter to deal with noisy oracles whose quality varies
over time (e.g., improving after becoming more familiar with the task, or degrading after
becoming fatigued). Wallace et al. (2011) also investigated methods for active learning with
annotators of varying levels of expertise. In particular, they assume that novice or non-expert
annotators (e.g., crowdsourced users) are able to estimate the quality of their own labels and
indicate when instances were too difficult. A small user study confirmed that this approach made
better use of both novices and experts.
There are still many open research questions along these lines. For example, how might the
effect of payment influence annotation quality (i.e., if you pay a non-expert twice as much, are
they likely to try and be more accurate)? What if some instances are inherently noisy regardless
of which oracle is used, and repeated labeling is not likely to improve matters? Finally, in most
crowdsourcing environments the users are not necessarily available “on demand,” thus accurate
estimates of annotator quality may be difficult to achieve in the first place, and might possibly
never be applicable again, since the model has no real choice over which oracles to use. How
might the learner continue to make progress?
72
The joint utility measure for our two tasks is simply the sum of the language model uncertainty,
plus the sentiment model uncertainty as an expectation over the languages. As we saw in
Equation 6.3, this is an approximation to reducing the expected future joint log-loss for the two
tasks.
In some cases, the labels may be inherently linked to one another by output constraints which
might be known ahead of time. For example, imagine a system that classifies encyclopedia
articles into hundreds of categories that happen to be arranged in a hierarchy. We may train a
separate classifier for each category (node in the hierarchy), but because of the hierarchy we
know that an article about a biologist is also an article about a scientist and therefore a
person. So a positive label for biologist implies that both scientist and person are positive
as well, and also implies that sportsTeam and fruit are false (since they lie in parts of the
ontology that are mutually exclusive with biologist). Entropy-based uncertainty sampling for a
single task selects instances with the highest expected log-loss:
However, if there are many tasks with output constraints, we can generalize this to the expected
total log-loss of all labelings that are implied by the label of a task in question. Imagine that the
active learner can choose an instance-task query pair 〈x, t〉. This more generalized utility measure
can be written as:
where the summand inside the expectation iterates over all task labelings yt′ that are implied by
the labeling yt (e.g., {biologist, scientist, person, ¬sportsTeam, ¬fruit} ⇐ biologist).
Zhang (2010) used this sort of approach for a few simple cross-task information extraction and
text classification experiments with successful results. Qi et al. (2008) considered a slightly
different version of multitask active learning with output dependencies in a computer vision
problem, where the output constraints are not necessarily know a priori.
Finally, it may be that the two tasks are not necessarily related at all, or the interdependencies
are too complex to model explicitly. Reichart et al. (2008) experimented with two methods for
such settings. The first is alternating selection, which allows learner predicting Y1 to query
instances in one iteration of active learning—using an appropriate query selection scheme,
uncertainty sampling in their case—and let the Y2 learner query instances in the next iteration,
and so on. The approach can be generalized to more than two tasks using a round-robin approach
(possibly stochastic). The second method is rank combination, which lets all the learners
independently rankorder queries from the pool, and instances with the lowest combined rank are
selected for querying. They experiment with these approaches to select sentences for two natural
language tasks (parsing and information extraction), and conclude that they produce better
learning curves than random sampling. However, each subtask’s curves were not as high as
73
active learning for that subtask alone, which (not surprisingly) came at the expense of the other
task. Thus, the combined approach was preferable on average across all the tasks. Roth and
Small (2008) also proposed an adaptive strategy for pipeline models which is similar to rank
combination. Specifically, their approach selects instances based on a weighted combination of
utility measures from each stage in the pipeline.
74
natural to think about devising a “stopping criterion” for active learning, i.e., a method by which
an active learner may decide to stop asking questions in order to conserve resources.
Several such stopping criteria for active learning have been proposed (Bloodgood and
Shanker, 2009; Olsson and Tomanek, 2009; Vlachos, 2008). These methods are all fairly similar,
generally based on the notion that there is an intrinsic measure of stability or self-confidence
within the learner, and active learning ceases to be useful once that measure begins to level-off or
degrade. Such self-stopping methods seem like a good idea, and may be applicable in certain
situations. However, in my own experience, the real stopping criterion for practical applications
is based on economic or other external factors, which likely come well before an intrinsic
learner-decided threshold.
1
The authors suspect that even this is an over-estimate, since it was advertised as a survey on the use of active
learning and thus biased towards those familiar with it.
2
For tasks with more than two classes, features may take on multiple labels. For example, the word “goalie”
might imply both hockey and soccer documents, but not baseball or tennis, etc.
3http://www.mturk.com
4
http://www.gwap.com
75
APPENDIX A
Nomenclature Reference
“What’s the use of their having names,” the Gnat said, “if they won’t
answer to them?”
“No use to them,” said Alice; “but it’s useful to the people who name
them, I suppose. If not, why do things have names at all?”
— Lewis Carroll
L
labeled data set
U
unlabeled data set
h hypothesis
ξ disagreement coefficient
H(·) entropy
∇x Fisher score
76
Bibliography
N. Abe and H. Mamitsuka. Query learning strategies using boosting and bagging.
In Proceedings of the International Conference on Machine Learning (ICML),
pages 1–9. Morgan Kaufmann, 1998. Cited on page(s) 29
K.S. Alexander. Rates of growth and sample moduli for weighted empirical
processes indexed by sets. Probability Theory and Related Fields, 75(3):379–
423, 1987. DOI: 10.1007/BF00318708 Cited on page(s) 59
S. Arora, E. Nyberg, and C.P. Rosé. Estimating annotation cost for active learning
in a multi-annotator environment. In Proceedings of the NAACL HLT
Workshop on Active Learning for Natural Language Processing, pages 18–26.
ACL, 2009. DOI: 10.3115/1564131.1564136 Cited on page(s) 67
J. Attenberg and F. Provost. Why label when you can search? alternatives to active
learning for applying human resources to build classification models under
77
extreme class imbalance. In Proceedings of the International Conference on
Knowledge Discovery and Data Mining (KDD), pages 423–432. ACM, 2010a.
DOI: 10.1145/1835804.1835859 Cited on page(s) 72
M.F. Balcan, S. Hanneke, and J. Wortman. The true sample complexity of active
learning. In Proceedings of the Conference on Learning Theory (COLT), pages
45–56. Springer, 2008. Cited on page(s) 61
J. Baldridge and M. Osborne. Active learning and the total cost of annotation. In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 9–16. ACL, 2004. Cited on page(s) 65, 76
J. Baldridge and A. Palmer. How well does active learning actually work? Time-
based evaluation of cost-reduction strategies for language documentation. In
Proceedings of the Conference on Empirical Methods in Natural Language
Processing (EMNLP), pages 296–305. ACL, 2009. DOI:
10.3115/1699510.1699549 Cited on page(s) 65
A.L. Berger, V.J. Della Pietra, and S.A. Della Pietra. A maximum entropy
approach to natural language processing. Computational Linguistics, 22(1):39–
71, 1996. Cited on page(s) 39
78
C. M. Bishop. Pattern Recognition and Machine Learning. Springer, 2006. DOI:
10.1117/1.2819119 Cited on page(s) xi
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training.
In Proceedings of the Conference on Learning Theory (COLT), pages 92–100.
Morgan Kaufmann, 1998. DOI: 10.1145/279943.279962 Cited on page(s) 53
R. Burbidge, J.J. Rowland, and R.D. King. Active learning for regression based on
query by committee. In Proceedings of Intelligent Data Engineering and
Automated Learning (IDEAL), pages 209–218. Springer, 2007. DOI:
10.1007/978-3-540-77226-2_22 Cited on page(s) 31
79
D. Cohn, Z. Ghahramani, and M.I. Jordan. Active learning with statistical models.
Journal of Artificial Intelligence Research, 4:129–145, 1996. Cited on page(s)
6, 43
T.M. Cover and J.A. Thomas. Elements of Information Theory. Wiley, 2006. Cited
on page(s) 41
80
1994. Cited on page(s) 53
81
Y. Freund and R.E. Schapire. A decision-theoretic generalization of on-line
learning and an application to boosting. Journal of Computer and System
Sciences, 55(1):119–139, 1997. DOI: 10.1006/jcss.1997.1504 Cited on page(s)
28
Y. Freund and R.E. Schapire. Large margin classification using the perceptron
algorithm. Machine learning, 37(3):277–296, 1999. DOI:
10.1023/A:1007662407062 Cited on page(s) 24
Y. Freund, H.S. Seung, E. Shamir, and N. Tishby. Selective samping using the
query by committee algorithm. Machine Learning, 28:133–168, 1997. DOI:
10.1023/A:1007330508534 Cited on page(s) 28, 61, 62
82
R. Haertel, K. Seppi, E. Ringger, and J. Carroll. Return on investment for active
learning. In Proceedings of the NIPS Workshop on Cost-Sensitive Learning,
2008. Cited on page(s) 67, 68
A. Hauptmann, W. Lin, R. Yan, J. Yang, and M.Y. Chen. Extreme video retrieval:
joint maximization of human and computer performance. In Proceedings of the
ACM Workshop on Multimedia Image Retrieval, pages 385–394. ACM, 2006.
DOI: 10.1145/1180639.1180721 Cited on page(s) 9
S.C.H. Hoi, R. Jin, and M.R. Lyu. Large-scale text categorization by batch mode
active learning. In Proceedings of the International Conference on the World
Wide Web, pages 633–642. ACM, 2006a. DOI: 10.1145/1135777.1135870
Cited on page(s) 9, 43, 44
S.C.H. Hoi, R. Jin, J. Zhu, and M.R. Lyu. Batch mode active learning and its
application to medical image classification. In Proceedings of the International
Conference on Machine Learning (ICML), pages 417–424. ACM, 2006b. DOI:
10.1145/1143844.1143897 Cited on page(s) 45
83
J. Kang, K. Ryu, and H.C. Kwon. Using cluster-based sampling to select initial
training set for active learning in text classification. In Proceedings of the
Pacific-Asia Conference on Advances in Knowledge Discovery and Data
Mining (PAKDD), pages 384–388. Springer, 2004. DOI: 10.1007/978-3-540-
24775-3_46 Cited on page(s) 49
R.D. King, K.E. Whelan, F.M. Jones, P.G. Reiser, C.H. Bryant, S.H. Muggleton,
D.B. Kell, and S.G. Oliver. Functional genomic hypothesis generation and
experimentation by a robot scientist. Nature, 427(6971):247–52, 2004. DOI:
10.1038/nature02236 Cited on page(s) 7, 32, 66
84
International Conference on Machine Learning (ICML), pages 282–289.
Morgan Kaufmann, 2001. DOI: 10.1038/nprot.2006.61 Cited on page(s) 39
K. Lang and E. Baum. Query learning can work poorly when a human oracle is
used. In Proceedings of the IEEE International Joint Conference on Neural
Networks, pages 335–340. IEEE Press, 1992. Cited on page(s) 6, 7
R. Liere and P. Tadepalli. Active learning with committees for text categorization.
In Proceedings of the Conference on Artificial Intelligence (AAAI), pages 591–
597. AAAI Press, 1997. Cited on page(s) 28
Y. Liu. Active learning with support vector machine applied to gene expression
data for cancer classification. Journal of Chemical Information and Computer
Sciences, 44:1936–1941, 2004. DOI: 10.1021/ci049810a Cited on page(s) 9
85
Association for Computational Linguistics (NAACL), pages 109–112. ACL,
2007. Cited on page(s) 17
86
learning. In Proceedings of the German Conference on AI, pages 489–493.
Springer, 2007. DOI: 10.1007/978-3-540-74565-5_47 Cited on page(s) 9
G.L. Nemhauser, L.A. Wolsey, and M.L. Fisher. An analysis of approximations for
maximizing submodular set functions. Mathematical Programming, 14(1):265
–294, 1978. DOI: 10.1007/BF01588971 Cited on page(s) 45
G. J. Qi, X.S. Hua, Y. Rui, J. Tang, and H.J. Zhang. Two-dimensional active
learning for image classification. In Proceedings of the Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–8. IEEE Press,
2008. DOI: 10.1109/CVPR.2008.4587383 Cited on page(s) 76
87
H. Raghavan, O. Madani, and R. Jones. Active learning with feedback on both
features and instances. Journal of Machine Learning Research, 7:1655–1686,
2006. Cited on page(s) 68
D. Roth and K. Small. Active learning for pipeline models. In Proceedings of the
National Conference on Artificial Intelligence (AAAI), pages 683–688. AAAI
Press, 2008. Cited on page(s) 76
G. Schohn and D. Cohn. Less is more: Active learning with support vector
machines. In Proceedings of the International Conference on Machine
88
Learning (ICML), pages 839–846. Morgan Kaufmann, 2000. Cited on page(s)
24
R. Schwartz and Y.-L. Chow. The N-best algorithm: an efficient and exact
procedure for finding the N most likely sentence hypotheses. In Proceedings of
the International Conference on Acoustics, Speech, and Signal Processing
(ICASSP), pages 81–83. IEEE Press, 1990. DOI:
10.1109/ICASSP.1990.115542 Cited on page(s) 17
B. Settles, M. Craven, and L. Friedland. Active learning with real annotation costs.
In Proceedings of the NIPS Workshop on Cost-Sensitive Learning, 2008a.
Cited on page(s) 4, 67, 68
V.S. Sheng, F. Provost, and P.G. Ipeirotis. Get another label? improving data
quality and data mining using multiple, noisy labelers. In Proceedings of the
International Conference on Knowledge Discovery and Data Mining (KDD),
89
pages 614–622. ACM, 2008. DOI: 10.1145/1401890.1401965 Cited on page(s)
74
K. Small, B.C. Wallace, C.E. Brodley, and T.A. Trikalinos. The constrained weight
space SVM: Learning with ranked features. In Proceedings of the International
Conference on Machine Learning (ICML), pages 865–872. Omnipress, 2011.
Cited on page(s) 68, 69
B.C. Smith, B. Settles, W.C. Hallows, M.W. Craven, and J.M. Denu. SIRT3
substrate specificity determined by peptide arrays and machine learning. ACS
Chemical Biology, 6(2):146–157, 2011. DOI: 10.1021/cb100218d Cited on
page(s) 4
C.A. Thompson, M.E. Califf, and R.J. Mooney. Active learning for natural
language parsing and information extraction. In Proceedings of the
International Conference on Machine Learning (ICML), pages 406–414.
Morgan Kaufmann, 1999. Cited on page(s) 9
90
K. Tomanek and F. Olsson. A web survey on the use of active learning to support
annotation of text data. In Proceedings of the NAACL HLT Workshop on Active
Learning for Natural Language Processing, pages 45–48. ACL, 2009. DOI:
10.3115/1564131.1564140 Cited on page(s) 64
S. Tong and E. Chang. Support vector machine active learning for image retrieval.
In Proceedings of the ACM International Conference on Multimedia, pages
107–118. ACM, 2001. DOI: 10.1145/500141.500159 Cited on page(s) 9
S. Tong and D. Koller. Support vector machine active learning with applications to
text classification. In Proceedings of the International Conference on Machine
Learning (ICML), pages 999–1006. Morgan Kaufmann, 2000. DOI:
10.1162/153244302760185243 Cited on page(s) 9, 23, 24
V. Vapnik. Statistical Learning Theory. Wiley, 1998. Cited on page(s) 23, 57, 58
91
Applications, 16:264–280, 1971. DOI: 10.1137/1116025 Cited on page(s) 58
B. C. Wallace, K. Small, C.E. Brodley, and T.A. Trikalinos. Active learning for
biomedical citation screening. In Proceedings of the International Conference
on Knowledge Discovery and Data Mining (KDD), pages 173–182. ACM,
2010a. DOI: 10.1145/1835804.1835829 Cited on page(s) 20, 72
B.C. Wallace, K. Small, C.E. Brodley, and T.A. Trikalinos. Who should label
what? Instance allocation in multiple expert active learning. In Proceedings of
the SIAM Conference on Data Mining (SDM), pages 176–187, 2011. Cited on
page(s) 74
Byron C. Wallace, Kevin Small, Carla E. Brodley, Joseph Lau, and Thomas A.
Trikalinos. Modeling annotation time to reduce workload in comparative
effectiveness reviews. In Proceedings of the ACM International Health
Informatics Symposium (IHI), pages 28–35. ACM, 2010b. DOI:
10.1145/1882992.1882999 Cited on page(s) 67, 68
92
Z. Xu, R. Akella, and Y. Zhang. Incorporating diversity and density in active
learning for relevance feedback. In Proceedings of the European Conference
on IR Research (ECIR), pages 246–257. Springer-Verlag, 2007. DOI:
10.1007/978-3-540-71496-5_24 Cited on page(s) 44, 49
H. Yu. SVM selective sampling for ranking with application to data retrieval. In
Proceedings of the International Conference on Knowledge Discovery and
Data Mining (KDD), pages 354–363. ACM, 2005. DOI:
10.1145/1081870.1081911 Cited on page(s) 8
K. Yu, J. Bi, and V. Tresp. Active learning via transductive experimental design. In
Proceedings of the International Conference on Machine Learning (ICML),
pages 1081–1087. ACM, 2006. DOI: 10.1145/1143844.1143980 Cited on
page(s) 54
T. Zhang and F.J. Oles. A probability analysis on the value of unlabeled data for
classification problems. In Proceedings of the International Conference on
Machine Learning (ICML), pages 1191–1198. Morgan Kaufmann, 2000. Cited
on page(s) 43
Z.H. Zhou, K.J. Chen, and Y. Jiang. Exploiting unlabeled data in content-based
image retrieval. In Proceedings of the European Conference on Machine
Learning (ECML), pages 425–435. Springer, 2004. Cited on page(s) 54
93
X. Zhu, J. Lafferty, and Z. Ghahramani. Combining active learning and semi-
supervised learning using Gaussian fields and harmonic functions. In
Proceedings of the ICML Workshop on the Continuum from Labeled to
Unlabeled Data, pages 58–65, 2003. Cited on page(s) 39, 54
94
Author’s Biography
BURR SETTLES
95
sandals to shoes, and plays guitar in the Pittsburgh pop band Delicious
Pastries.
96
Index
The index that appeared in the print version of this title was intentionally
removed from the eBook. Please use the search function on your
eReading device to search for terms of interest. For your reference, the
terms that appear in the print index are listed below.
0/1-loss
classification
clustering
co-training
cost-sensitive active learning
data reuse
decision theory
density weighting
density-weighted methods
disagreement coefficient
entropy
uncertainty sampling
vote entropy
entropy regularization
expected error reduction
Fisher information
ratio
hierarchical sampling
hypothesis space
97
information density
information extraction
KL divergence
label skew
learning curves
least confident
log-loss
logistic regression
margin
max-margin classifiers
membership queries, see also query synthesis
multi-task active learning
multi-view learning
multiple-instance active learning
mutual information
neural networks
PAC learning
pool-based sampling
query by committee
for classification
for regression
query by disagreement
SG disagreement
theoretical analysis
query synthesis
region of disagreement
region of uncertainty
regression
regression trees
return on investment (ROI)
selective sampling
self-training
semi-supervised learning
sequence models
squared-loss
98
stopping criteria
stream-based active learning, see also selective sampling
structured prediction
submodular functions
support vector machines
uncertainty sampling
for classification
for regression
unreliable oracles
99
目录
Copyright 5
Title Page 6
Abstract 7
Dedication 8
Contents 9
Preface 11
Acknowledgments 14
1 Automating Inquiry 16
1.1 A Thought Experiment 16
1.2 Active Learning 18
1.3 Scenarios for Active Learning 19
2 Uncertainty Sampling 24
2.1 Pushing the Boundaries 24
2.2 An Example 25
2.3 Measures of Uncertainty 25
2.4 Beyond Classification 27
2.5 Discussion 29
3 Searching Through the Hypothesis Space 31
3.1 The Version Space 31
3.2 Uncertainty Sampling as Version Space Search 32
3.3 Query by Disagreement 33
3.4 Query by Committee 36
3.5 Discussion 39
4 Minimizing Expected Error and Variance 42
4.1 Expected Error Reduction 42
4.2 Variance Reduction 44
4.3 Batch Queries and Submodularity 47
4.4 Discussion 48
5 Exploiting Structure in Data 50
5.1 Density-Weighted Methods 50
100
5.2 Cluster-Based Active Learning 51
5.3 Active + Semi-Supervised Learning 54
5.4 Discussion 55
6 Theory 57
6.1 A Unified View 57
6.2 A PAC Bound for Active Learning 58
6.3 Discussion 61
7 Practical Considerations 63
7.1 Which Algorithm is Best? 63
7.2 Real Labeling Costs 64
7.3 Alternative Query Types 67
7.4 Skewed Label Distributions 70
7.5 Unreliable Oracles 71
7.6 Multi-Task Active Learning 72
7.7 Data Reuse and the Unknown Model Class 74
7.8 Stopping Criteria 74
A Nomenclature Reference 76
Bibliography 77
Author’s Biography 95
Index 97
101