Skip to main content

Michael Wick

University of Massachusetts Amherst, Computer Science, Graduate Student

Followers

34

Following

10

Co-author

1

Public Views

PhD Candidate, Computer Science.

My research addresses the notoriously difficult learning and inference problems that arise from high-level specifications of factor graphs, especially procedurally-encoded conditional random fields. My approach is largely driven by real-world applications (coreference, information extraction, ontology mapping) that require scalability to both large amounts of data, and to statistical models with many dependencies among the hidden variables (factors with high arity in graphs with large cliques).

There will soon be a publicly available language/toolkit for specifying factor graphs (tentatively called FactorIE) taking full advantage of our research in learning and inference. In the mean time, please see: http://ciir-publications.cs.umass.edu/pdf/IR-697.pdf for more information.
Supervisors: Andrew McCallum
Address: Information Extraction and Synthesis Lab
Center for Intelligent Information Retrieval
Department of Computer Science
University of Massachusetts
140 Governor's Drive
Amherst, MA 01003

less

Interests

Uploads

Papers by Michael Wick

ELERFED: Final Report

by Vladimir Eidelman and Michael Wick

1.3.3 Relation extraction (Jian Su) . . . . . . . . . . . . . . . . 10 1.4 Intra-document corefer... more 1.3.3 Relation extraction (Jian Su) . . . . . . . . . . . . . . . . 10 1.4 Intra-document coreference . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 The BART toolkit . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . 12 1.4.3 Extracting Lexical and Commonsense Knowledge . . . . 14 1.5 EvaluationandAnnotation . . . . . . . . . . . . . . . . . . . . . 15 1.5.1 TheACECDCcorpus . . . . . . . . . . . . . . . . . . . 16 1.5.2 The ARRAU IDC corpus . . . . . . . . . . . . . . . . . . 16 1.5.3 ... ... 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 ProblemSetting . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 ...

Distantly labeling data for large scale cross-document coreference

Abstract: Cross-document coreference, the problem of resolving entity mentions across multi-docum... more Abstract: Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on``distantly-labeling''a data set from which we can train a discriminative cross-document coreference model.

ELERFED: Final Report

SampleRank Tutorial: Leveraging Sparsity for Fast Implementations

This technical note describes how to leverage the sparsity of SampleRank gradients to efficiently... more This technical note describes how to leverage the sparsity of SampleRank gradients to efficiently implement parameter averaging and L2 regularization. What we mean by sparse is that the gradient has very few non-zero elements. SampleRank gradients exhibit extraordinary sparsity even in comparison to other batch, online, and stochastic gradient parameter estimation algorithms—including contrastive divergence [1] and its persistent variants [6, 4]!

SampleRank: training factor graphs with atomic gradients

Abstract We present SampleRank, an alternative to contrastive divergence (CD) for estimating para... more Abstract We present SampleRank, an alternative to contrastive divergence (CD) for estimating parameters in complex graphical models. SampleR-ank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. SampleRank is not only faster than CD, but also achieves better accuracy in practice (up to 23% error reduction on noun-phrase coreference).

Monte Carlo MCMC: Efficient Inference by Approximate Sampling

Abstract Conditional random fields and other graphical models have achieved state of the art resu... more Abstract Conditional random fields and other graphical models have achieved state of the art results in a variety of tasks such as coreference, relation extraction, data integration, and parsing. Increasingly, practitioners are using models with more complex structure—higher treewidth, larger fan-out, more features, and more data—rendering even approximate inference methods such as MCMC inefficient.

A discriminative hierarchical model for fast coreference at large scale

Abstract Methods that measure compatibility between mention pairs are currently the dominant appr... more Abstract Methods that measure compatibility between mention pairs are currently the dominant approach to coreference. However, they suffer from a number of drawbacks including difficulties scaling to large numbers of mentions and limited representational power. As these drawbacks become increasingly restrictive, the need to replace the pairwise approaches with a more expressive, highly scalable alternative is becoming urgent.

Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference}

Monte Carlo MCMC: Efficient Inference by Sampling Factors

Abstract Conditional random fields and other graphical models have achieved state of the art resu... more Abstract Conditional random fields and other graphical models have achieved state of the art results in a variety of NLP and IE tasks including coreference and relation extraction. Increasingly, practitioners are using models with more complex structure��higher tree-width, larger fanout, more features, and more data��rendering even approximate inference methods such as MCMC inefficient.

A Demographic Analysis of Online Sentiment during Hurricane Irene

Abstract We examine the response to the recent natural disaster Hurricane Irene on Twitter. com. ... more Abstract We examine the response to the recent natural disaster Hurricane Irene on Twitter. com. We collect over 65,000 Twitter messages relating to Hurricane Irene from August 18th to August 31st, 2011, and group them by location and gender. We train a sentiment classifier to categorize messages based on level of concern, and then use this classifier to investigate demographic differences.

Human-Machine Cooperation with Epistemological DBs: Supporting User Corrections to Knowledge Bases

Abstract Knowledge bases (KB) provide support for real-world decision making by exposing data in ... more Abstract Knowledge bases (KB) provide support for real-world decision making by exposing data in a structured format. However, constructing knowledge bases requires gathering data from many heterogeneous sources. Manual efforts for this task are accurate, but lack scalability, and automated approaches provide good coverage, but are not reliable enough for realworld decision makers to trust.

SampleRank: Learning Preferences with Atomic Gradients

Large templated factor graphs with complex structure that changes during infer- ence have been s... more Large templated factor graphs with complex structure that changes during infer-
ence have been shown to provide state-of-the-art experimental results on tasks
such as identity uncertainty and information integration. However, learning pa-
rameters in these models is difﬁcult because computing the gradients require ex-
pensive inference routines. In this paper we propose an online algorithm that
instead learns preferences over hypotheses from the gradients between the atomic
steps of inference. Although there are a combinatorial number of ranking con-
straints over the entire hypothesis space, a connection to the frameworks of sam-
pled convex programs reveals a polynomial bound on the number of rankings
that need to be satisﬁed in practice. We further apply ideas of passive aggressive
algorithms to our update rules, enabling us to extend recent work in conﬁdence-
weighted classiﬁcation to structured prediction problems. We compare our algo-
rithm to structured perceptron, contrastive divergence, and persistent contrastive
divergence, demonstrating substantial error reductions on two real-world prob-
lems (20% over contrastive divergence).

Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Large, relational factor graphs with structure defined by first-order logic or other language... more Large, relational factor graphs with structure defined by
first-order logic or other languages give rise to notoriously
difficult inference problems. Because unrolling the structure
necessary to represent distributions over all hypotheses has
exponential blow-up, solutions are often derived from MCMC. However,
because of limitations in the design and parameterization of the
jump function, these sampling-based methods suffer from local
minima|the system must transition through lower-scoring
configurations before arriving at a better MAP solution. This paper
presents a new method of explicitly selecting fruitful downward
jumps by leveraging reinforcement learning (RL). Rather than setting
parameters to maximize the likelihood of the training data,
parameters of the factor graph are treated as a log-linear function
approximator and learned with temporal difference (TD); MAP
inference is performed by executing the resulting policy on held out
test data. Our method allows efficient gradient updates since only
factors in the neighborhood of variables affected by an action need
to be computed|we bypass the need to compute marginals entirely.
Our method provides dramatic empirical success, producing new
state-of-the-art results on a complex joint model of ontology
alignment, with a 48\% reduction in error over state-of-the-art in
that domain.

FACTORIE: Efficient Probabilistic Programming for Relational Factor Graphs via Imperative Declarations of Structure, Inference and Learning

Reinforcement Learning for MAP Inference in Large Factor Graphs

We present a framework for solving stuctured prediction problems in large factor graphs using... more We present a framework for solving stuctured prediction problems in
large factor graphs using reinforcement learning. We formulate MAP
inference as an optimization problem in the output configuration
space and use reinforcement learning (RL) to learn an optimal policy
that identifies a sequence of transitions in the configuration
manifold that transforms an arbitrary point in the feasible region
into the MAP configuration. In our RL treatment of this problem the
delayed reward is a measure of the residual performance improvement
between configurations as the system transitions through the
configuration space. We propose two approaches. The first approach
uses the ground truth signal during training to learn a linear
function approximator that generalizes to a novel testing set. In
this scenario, MAP inference is just a matter of performing policy
search on the learned value function. In the second approach, we use
the ground truth signal to learn a reward function directly from the
training set; on held-out test data, the approximate reward function
is used to guide a traditional reinforcement learning algorithm to
the MAP configuration. In either case, the linear additive function
approximators provide the following advantages: (1) they allow
generalization from a training set to a novel testing set; (2) they
provide a representation compatible with log-linear models such as
conditional random fields (CRFs). We present preliminary results of
on real-world datasets.

A Unified Approach for Schema Matching, Coreference,and Canonicalization

The automatic consolidation of database records from many heterogeneous sources into a single rep... more The automatic consolidation of database records from many
heterogeneous sources into a single repository requires solving
several information integration tasks. Although tasks such as
coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly. We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach. Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50\% error reduction for coreference and a 40\% error reduction for schema matching.

Canonicalization of Database Records using Adaptive Similarity Measures

It is becoming increasingly common to construct databases from information automatically culled f... more It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as {\sl canonicalization}. Despite its importance, there is very little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is ``central'' in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different {\sl styles} of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. {\sl KDD} versus {\sl Conference on Knowledge Discovery and Data Mining}). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. We empirically evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

First Order Probabilistic Models for Coreference Resolution

Traditional noun phrase coreference resolution systems represent features only of pairs of noun p... more Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over {\sl sets} of noun phrases, resulting in a first-order probabilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our method to the ACE coreference dataset, achieving a 45\% error reduction over a comparable method that only considers features of pairs of noun phrases. This result demonstrates an example of how a first-order logic representation can be incorporated into a probabilistic model and scaled efficiently.

Context-Sensitive Error Correction: Using Topic Models to Improve OCR

Learning Field Compatibilities to Extract Database Records from Unstructured Text

Named-entity recognition systems extract entities such as people, organizations, and locations fr... more Named-entity recognition systems extract entities such as people, organizations, and locations from unstructured text. Rather than extract these mentions in isolation, this paper presents a {\sl record extraction} system that assembles mentions into records (i.e. database tuples). We construct a probabilistic model of the compatibility between field values, then employ graph partitioning algorithms to cluster fields into cohesive records. We also investigate compatibility functions over {\sl sets} of fields, rather than simply pairs of fields, to examine how higher representational power can impact performance. We apply our techniques to the task of extracting contact records from faculty and student homepages, demonstrating a 53\% error reduction over baseline approaches.

ELERFED: Final Report

by Vladimir Eidelman and Michael Wick

1.3.3 Relation extraction (Jian Su) . . . . . . . . . . . . . . . . 10 1.4 Intra-document corefer... more 1.3.3 Relation extraction (Jian Su) . . . . . . . . . . . . . . . . 10 1.4 Intra-document coreference . . . . . . . . . . . . . . . . . . . . . 11 1.4.1 The BART toolkit . . . . . . . . . . . . . . . . . . . . . . 11 1.4.2 Machine learning . . . . . . . . . . . . . . . . . . . . . . 12 1.4.3 Extracting Lexical and Commonsense Knowledge . . . . 14 1.5 EvaluationandAnnotation . . . . . . . . . . . . . . . . . . . . . 15 1.5.1 TheACECDCcorpus . . . . . . . . . . . . . . . . . . . 16 1.5.2 The ARRAU IDC corpus . . . . . . . . . . . . . . . . . . 16 1.5.3 ... ... 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 2.2 ProblemSetting . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 ...

Distantly labeling data for large scale cross-document coreference

Abstract: Cross-document coreference, the problem of resolving entity mentions across multi-docum... more Abstract: Cross-document coreference, the problem of resolving entity mentions across multi-document collections, is crucial to automated knowledge base construction and data mining tasks. However, the scarcity of large labeled data sets has hindered supervised machine learning research for this task. In this paper we develop and demonstrate an approach based on``distantly-labeling''a data set from which we can train a discriminative cross-document coreference model.

ELERFED: Final Report

SampleRank Tutorial: Leveraging Sparsity for Fast Implementations

This technical note describes how to leverage the sparsity of SampleRank gradients to efficiently... more This technical note describes how to leverage the sparsity of SampleRank gradients to efficiently implement parameter averaging and L2 regularization. What we mean by sparse is that the gradient has very few non-zero elements. SampleRank gradients exhibit extraordinary sparsity even in comparison to other batch, online, and stochastic gradient parameter estimation algorithms—including contrastive divergence [1] and its persistent variants [6, 4]!

SampleRank: training factor graphs with atomic gradients

Abstract We present SampleRank, an alternative to contrastive divergence (CD) for estimating para... more Abstract We present SampleRank, an alternative to contrastive divergence (CD) for estimating parameters in complex graphical models. SampleR-ank harnesses a user-provided loss function to distribute stochastic gradients across an MCMC chain. As a result, parameter updates can be computed between arbitrary MCMC states. SampleRank is not only faster than CD, but also achieves better accuracy in practice (up to 23% error reduction on noun-phrase coreference).

Monte Carlo MCMC: Efficient Inference by Approximate Sampling

Abstract Conditional random fields and other graphical models have achieved state of the art resu... more Abstract Conditional random fields and other graphical models have achieved state of the art results in a variety of tasks such as coreference, relation extraction, data integration, and parsing. Increasingly, practitioners are using models with more complex structure—higher treewidth, larger fan-out, more features, and more data—rendering even approximate inference methods such as MCMC inefficient.

A discriminative hierarchical model for fast coreference at large scale

Abstract Methods that measure compatibility between mention pairs are currently the dominant appr... more Abstract Methods that measure compatibility between mention pairs are currently the dominant approach to coreference. However, they suffer from a number of drawbacks including difficulties scaling to large numbers of mentions and limited representational power. As these drawbacks become increasingly restrictive, the need to replace the pairwise approaches with a more expressive, highly scalable alternative is becoming urgent.

Human Language Technologies 2007: The Conference of the North American Chapter of the Association for Computational Linguistics; Proceedings of the Main Conference}

Monte Carlo MCMC: Efficient Inference by Sampling Factors

Abstract Conditional random fields and other graphical models have achieved state of the art resu... more Abstract Conditional random fields and other graphical models have achieved state of the art results in a variety of NLP and IE tasks including coreference and relation extraction. Increasingly, practitioners are using models with more complex structure��higher tree-width, larger fanout, more features, and more data��rendering even approximate inference methods such as MCMC inefficient.

A Demographic Analysis of Online Sentiment during Hurricane Irene

Abstract We examine the response to the recent natural disaster Hurricane Irene on Twitter. com. ... more Abstract We examine the response to the recent natural disaster Hurricane Irene on Twitter. com. We collect over 65,000 Twitter messages relating to Hurricane Irene from August 18th to August 31st, 2011, and group them by location and gender. We train a sentiment classifier to categorize messages based on level of concern, and then use this classifier to investigate demographic differences.

Human-Machine Cooperation with Epistemological DBs: Supporting User Corrections to Knowledge Bases

Abstract Knowledge bases (KB) provide support for real-world decision making by exposing data in ... more Abstract Knowledge bases (KB) provide support for real-world decision making by exposing data in a structured format. However, constructing knowledge bases requires gathering data from many heterogeneous sources. Manual efforts for this task are accurate, but lack scalability, and automated approaches provide good coverage, but are not reliable enough for realworld decision makers to trust.

SampleRank: Learning Preferences with Atomic Gradients

Large templated factor graphs with complex structure that changes during infer- ence have been s... more Large templated factor graphs with complex structure that changes during infer-
ence have been shown to provide state-of-the-art experimental results on tasks
such as identity uncertainty and information integration. However, learning pa-
rameters in these models is difﬁcult because computing the gradients require ex-
pensive inference routines. In this paper we propose an online algorithm that
instead learns preferences over hypotheses from the gradients between the atomic
steps of inference. Although there are a combinatorial number of ranking con-
straints over the entire hypothesis space, a connection to the frameworks of sam-
pled convex programs reveals a polynomial bound on the number of rankings
that need to be satisﬁed in practice. We further apply ideas of passive aggressive
algorithms to our update rules, enabling us to extend recent work in conﬁdence-
weighted classiﬁcation to structured prediction problems. We compare our algo-
rithm to structured perceptron, contrastive divergence, and persistent contrastive
divergence, demonstrating substantial error reductions on two real-world prob-
lems (20% over contrastive divergence).

Training Factor Graphs with Reinforcement Learning for Efficient MAP Inference

Large, relational factor graphs with structure defined by first-order logic or other language... more Large, relational factor graphs with structure defined by
first-order logic or other languages give rise to notoriously
difficult inference problems. Because unrolling the structure
necessary to represent distributions over all hypotheses has
exponential blow-up, solutions are often derived from MCMC. However,
because of limitations in the design and parameterization of the
jump function, these sampling-based methods suffer from local
minima|the system must transition through lower-scoring
configurations before arriving at a better MAP solution. This paper
presents a new method of explicitly selecting fruitful downward
jumps by leveraging reinforcement learning (RL). Rather than setting
parameters to maximize the likelihood of the training data,
parameters of the factor graph are treated as a log-linear function
approximator and learned with temporal difference (TD); MAP
inference is performed by executing the resulting policy on held out
test data. Our method allows efficient gradient updates since only
factors in the neighborhood of variables affected by an action need
to be computed|we bypass the need to compute marginals entirely.
Our method provides dramatic empirical success, producing new
state-of-the-art results on a complex joint model of ontology
alignment, with a 48\% reduction in error over state-of-the-art in
that domain.

FACTORIE: Efficient Probabilistic Programming for Relational Factor Graphs via Imperative Declarations of Structure, Inference and Learning

Reinforcement Learning for MAP Inference in Large Factor Graphs

We present a framework for solving stuctured prediction problems in large factor graphs using... more We present a framework for solving stuctured prediction problems in
large factor graphs using reinforcement learning. We formulate MAP
inference as an optimization problem in the output configuration
space and use reinforcement learning (RL) to learn an optimal policy
that identifies a sequence of transitions in the configuration
manifold that transforms an arbitrary point in the feasible region
into the MAP configuration. In our RL treatment of this problem the
delayed reward is a measure of the residual performance improvement
between configurations as the system transitions through the
configuration space. We propose two approaches. The first approach
uses the ground truth signal during training to learn a linear
function approximator that generalizes to a novel testing set. In
this scenario, MAP inference is just a matter of performing policy
search on the learned value function. In the second approach, we use
the ground truth signal to learn a reward function directly from the
training set; on held-out test data, the approximate reward function
is used to guide a traditional reinforcement learning algorithm to
the MAP configuration. In either case, the linear additive function
approximators provide the following advantages: (1) they allow
generalization from a training set to a novel testing set; (2) they
provide a representation compatible with log-linear models such as
conditional random fields (CRFs). We present preliminary results of
on real-world datasets.

A Unified Approach for Schema Matching, Coreference,and Canonicalization

The automatic consolidation of database records from many heterogeneous sources into a single rep... more The automatic consolidation of database records from many
heterogeneous sources into a single repository requires solving
several information integration tasks. Although tasks such as
coreference, schema matching, and canonicalization are closely related, they are most commonly studied in isolation. Systems that do tackle multiple integration problems traditionally solve each independently, allowing errors to propagate from one task to another. In this paper, we describe a discriminatively-trained model that reasons about schema matching, coreference, and canonicalization jointly. We evaluate our model on a real-world data set of people and demonstrate that simultaneously solving these tasks reduces errors over a cascaded or isolated approach. Our experiments show that a joint model is able to improve substantially over systems that either solve each task in isolation or with the conventional cascade. We demonstrate nearly a 50\% error reduction for coreference and a 40\% error reduction for schema matching.

Canonicalization of Database Records using Adaptive Similarity Measures

It is becoming increasingly common to construct databases from information automatically culled f... more It is becoming increasingly common to construct databases from information automatically culled from many heterogeneous sources. For example, a research publication database can be constructed by automatically extracting titles, authors, and conference information from papers and their references. A common difficulty in consolidating data from multiple sources is that records are referenced in a variety of ways (e.g. abbreviations, aliases, and misspellings). Therefore, it can be difficult to construct a single, standard representation to present to the user. We refer to the task of constructing this representation as {\sl canonicalization}. Despite its importance, there is very little existing work on canonicalization. In this paper, we explore the use of edit distance measures to construct a canonical representation that is ``central'' in the sense that it is most similar to each of the disparate records. This approach reduces the impact of noisy records on the canonical representation. Furthermore, because the user may prefer different {\sl styles} of canonicalization, we show how different edit distance costs can result in different forms of canonicalization. For example, reducing the cost of character deletions can result in representations that favor abbreviated forms over expanded forms (e.g. {\sl KDD} versus {\sl Conference on Knowledge Discovery and Data Mining}). We describe how to learn these costs from a small amount of manually annotated data using stochastic hill-climbing. Additionally, we investigate feature-based methods to learn ranking preferences over canonicalizations. We empirically evaluate our approach on a real-world publications database and show that our learning method results in a canonicalization solution that is robust to errors and easily customizable to user preferences.

First Order Probabilistic Models for Coreference Resolution

Traditional noun phrase coreference resolution systems represent features only of pairs of noun p... more Traditional noun phrase coreference resolution systems represent features only of pairs of noun phrases. In this paper, we propose a machine learning method that enables features over {\sl sets} of noun phrases, resulting in a first-order probabilistic model for coreference. We outline a set of approximations that make this approach practical, and apply our method to the ACE coreference dataset, achieving a 45\% error reduction over a comparable method that only considers features of pairs of noun phrases. This result demonstrates an example of how a first-order logic representation can be incorporated into a probabilistic model and scaled efficiently.

Context-Sensitive Error Correction: Using Topic Models to Improve OCR

Learning Field Compatibilities to Extract Database Records from Unstructured Text

Named-entity recognition systems extract entities such as people, organizations, and locations fr... more Named-entity recognition systems extract entities such as people, organizations, and locations from unstructured text. Rather than extract these mentions in isolation, this paper presents a {\sl record extraction} system that assembles mentions into records (i.e. database tuples). We construct a probabilistic model of the compatibility between field values, then employ graph partitioning algorithms to cluster fields into cohesive records. We also investigate compatibility functions over {\sl sets} of fields, rather than simply pairs of fields, to examine how higher representational power can impact performance. We apply our techniques to the task of extracting contact records from faculty and student homepages, demonstrating a 53\% error reduction over baseline approaches.