Skip to main content

    Eric Bax

    Over the summer of 1989 I worked as an undergraduate research student with Dr. Hayden Porter through a COSEN grant funded by the Pew Memorial Trust on a project entitled "Understanding Chaos." I studied some of the history of... more
    Over the summer of 1989 I worked as an undergraduate research student with Dr. Hayden Porter through a COSEN grant funded by the Pew Memorial Trust on a project entitled "Understanding Chaos." I studied some of the history of research in the area of nonlinear systems, and my studies so far have concentrated mainly upon the logistic equation its behavior on both the real line and the complex plane, its properties, including scaling factors among some of the structures that arise within its domain, and how it can serve as a simplified model for studies of chaos in general. This paper represents a report on work which is still in progress as well as a review of some of the literature on nonlinear dynamics and a documentation of some of our results.
    In markets for online advertising, advertisers may post bids that they pay only when a user responds to an ad. Market-makers estimate response rates for each ad and multiply by the bid to estimate expected revenue for showing the ad. For... more
    In markets for online advertising, advertisers may post bids that they pay only when a user responds to an ad. Market-makers estimate response rates for each ad and multiply by the bid to estimate expected revenue for showing the ad. For each advertising opportunity, called an ad call, the market-maker selects an ad that maximizes estimated expected revenue. Actual revenue deviates from estimated expected revenue for two reasons: (a) uncertainty introduced by errors in estimation of response rates and (b) random fluctuations in response rates from their expected values. This paper outlines a method to allocate a set of ad calls over a set of ads. The method mediates a tradeoff between maximizing estimated expected revenue for publishers and minimizing estimated variance for that revenue. The method accounts for uncertainty as well as randomness as sources of variability. The paper also demonstrates the surprising result that using portfolio allocation to reduce variance can also inc...
    One way to estimate a statistic over a large data set is to draw a sample consisting of some records from the data set, and compute the statistic over the sample as an estimate of the statistic over the data set. This procedure may fail... more
    One way to estimate a statistic over a large data set is to draw a sample consisting of some records from the data set, and compute the statistic over the sample as an estimate of the statistic over the data set. This procedure may fail to produce an accurate estimate. Using one sample for multiple statistics reduces computation and latency, but it can increase the probability of multiple failures to produce accurate estimates, because estimates based on the same sample may not have independent failure probabilities. We show how to bound the probability of multiple failures for sequences of estimates over one or more samples.
    We develop an algorithm for the traveling salesman problem by applying nite diierences to a generating function. This algorithm requires polynomial space. In comparison, a dynamic programming algorithm requires exponential space. Also,... more
    We develop an algorithm for the traveling salesman problem by applying nite diierences to a generating function. This algorithm requires polynomial space. In comparison, a dynamic programming algorithm requires exponential space. Also, the nite-diierence algorithm requires less space than a similar inclusion and exclusion algorithm.
    We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for matching algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds... more
    We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for matching algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds can be applied to network reconciliation or entity resolution algorithms, which identify nodes in different networks or values in a data set that correspond to the same entity. For network reconciliation, the bounds do not require knowledge of the network generation process.
    A data sketch algorithm scans a big data set, collecting a small amount of data - the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big... more
    A data sketch algorithm scans a big data set, collecting a small amount of data - the sketch, which can be used to statistically infer properties of the big data set. Some data sketch algorithms take a fixed-size random sample of a big data set, and use that sample to infer frequencies of items that meet various criteria in the big data set. This paper shows how to statistically infer probably approximately correct (PAC) bounds for those frequencies, efficiently, and precisely enough that the frequency bounds are either sharp or off by only one, which is the best possible result without exact computation.
    We propose a system for privacy-aware machine learning. The data provider encodes each record in way that avoids revealing information about the record’s field values or about the ordering of values from different records. A service... more
    We propose a system for privacy-aware machine learning. The data provider encodes each record in way that avoids revealing information about the record’s field values or about the ordering of values from different records. A service provider stores the encoded records and uses them to perform classification on queries consisting of encoded input field values. The encoding provides privacy for the data provider from the service provider and from a third party issuing unauthorized queries. But the encoding makes regression-based and many tree-based classifiers impossible to implement. It does allow histogram-type classifiers that are based on category membership, and we present one such classification method that ensures data sufficiency on a per-classification basis.
    In quasi-proportional auctions, each bidder receives a fraction of the allocation equal to the weight of their bid divided by the sum of weights of all bids, where each bid's weight is determined by a weight function. We study the... more
    In quasi-proportional auctions, each bidder receives a fraction of the allocation equal to the weight of their bid divided by the sum of weights of all bids, where each bid's weight is determined by a weight function. We study the relationship between the weight function, bidders' private values, number of bidders, and the seller's revenue in equilibrium. It has been shown that if one bidder has a much higher private value than the others, then a nearly flat weight function maximizes revenue. Essentially, threatening the bidder who has the highest valuation with having to share the allocation maximizes the revenue. We show that as bidder private values approach parity, steeper weight functions maximize revenue by making the quasi-proportional auction more like a winner-take-all auction. We also show that steeper weight functions maximize revenue as the number of bidders increases. For flatter weight functions, there is known to be a unique pure-strategy Nash equilibrium....
    Network reconciliation is the problem of identifying nodes in separate networks that represent the same entity, for example matching nodes across social networks that correspond to the same user. We introduce a technique to compute... more
    Network reconciliation is the problem of identifying nodes in separate networks that represent the same entity, for example matching nodes across social networks that correspond to the same user. We introduce a technique to compute probably approximately correct (PAC) bounds on precision and recall for network reconciliation algorithms. The bounds require some verified matches, but those matches may be used to develop the algorithms. The bounds do not require knowledge of the network generation process, and they can supply confidence levels for individual matches.
    For an ensemble classifier that is composed of classifiers selected from a hypothesis set of classifiers, and that selects one of its constituent classifiers at random to use for each classification, we present ensemble error bounds... more
    For an ensemble classifier that is composed of classifiers selected from a hypothesis set of classifiers, and that selects one of its constituent classifiers at random to use for each classification, we present ensemble error bounds consisting of the average of error bounds for the individual classifiers in the ensemble, a term that depends on the fraction of hypothesis classifiers selected for the ensemble, and a small constant term and multiplier. There is no penalty for using a richer hypothesis set, if the same fraction of the hypothesis classifiers are selected for the ensemble.
    In markets for online advertising, some advertisers pay only when users respond to ads. So publishers estimate ad response rates and multiply by advertiser bids to estimate expected revenue for showing ads. Since these estimates may be... more
    In markets for online advertising, some advertisers pay only when users respond to ads. So publishers estimate ad response rates and multiply by advertiser bids to estimate expected revenue for showing ads. Since these estimates may be inaccurate, the publisher risks not selecting the ad for each ad call that would maximize revenue. The variance of revenue can be decomposed into two components -- variance due to `uncertainty' because the true response rate is unknown, and variance due to `randomness' because realized response statistics fluctuate around the true response rate. Over a sequence of many ad calls, the variance due to randomness nearly vanishes due to the law of large numbers. However, the variance due to uncertainty doesn't diminish. We introduce a technique for ad selection that augments existing estimation and explore-exploit methods. The technique uses methods from portfolio optimization to produce a distribution over ads rather than selecting the single ...
    A media platform's policy on obtrusive ads mediates an effectiveness-nuisance tradeoff. Allowing more obtrusive advertising can increase the effectiveness of ads, so the platform can elicit more short-term revenue from advertisers,... more
    A media platform's policy on obtrusive ads mediates an effectiveness-nuisance tradeoff. Allowing more obtrusive advertising can increase the effectiveness of ads, so the platform can elicit more short-term revenue from advertisers, but the nuisance to viewers can decrease their engagement over time, which decreases the platform's opportunity for future revenue. To optimize long-term revenue, a platform can use a combination of advertiser bids and ad impact on user experience to price and allocate ad space. We study the conditions for advertisers, viewers, and the platform to simultaneously benefit from using ad impact on user experience as a criterion for ad selection and pricing. It is important for advertisers to benefit, because media platforms compete with one another for advertisers. Our results show that platforms with more advertisers competing for ad space are more likely to generate increased profits for themselves and their advertisers by introducing ad impact on u...
    We extend Hoeeding bounds to develop superior probabilistic performance guarantees for accurate classiiers. The original Hoeeding bounds on classiier accuracy depend on the accuracy itself as a parameter. Since the accuracy is not known a... more
    We extend Hoeeding bounds to develop superior probabilistic performance guarantees for accurate classiiers. The original Hoeeding bounds on classiier accuracy depend on the accuracy itself as a parameter. Since the accuracy is not known a priori, the parameter value that gives the weakest bounds is used. We present a method that loosely bounds the accuracy using the old method and uses the loose bound as an improved parameter value for tighter bounds. We show how to use the bounds in practice, and we generalize the bounds for individual classiiers to form uniform bounds over multiple classiiers.
    We compare and contrast two approaches to validating a trained classifier while using all in-sample data for training. One is simultaneous validation over an organized set of hypotheses (SVOOSH), the well-known method that began with VC... more
    We compare and contrast two approaches to validating a trained classifier while using all in-sample data for training. One is simultaneous validation over an organized set of hypotheses (SVOOSH), the well-known method that began with VC theory. The other is withhold and gap (WAG). WAG withholds a validation set, trains a holdout classifier on the remaining data, uses the validation data to validate that classifier, then adds the rate of disagreement between the holdout classifier and one trained using all in-sample data, which is an upper bound on the difference in error rates. We show that complex hypothesis classes and limited training data can make WAG a favorable alternative.
    We show that $k$-nearest neighbor classifiers, in spite of their famously fractured decision boundaries, have exponential error bounds with nearly O($n^{-\frac{1}{2}}$) Gaussian-style bound ranges, similar to error bounds based on VC... more
    We show that $k$-nearest neighbor classifiers, in spite of their famously fractured decision boundaries, have exponential error bounds with nearly O($n^{-\frac{1}{2}}$) Gaussian-style bound ranges, similar to error bounds based on VC dimension for other types of classifiers that have simpler decision boundaries. Specifically, we present an exponential PAC error bound for $k$-nearest neighbor classifiers that has O($n^{-\frac{1}{2}}\sqrt{(k + \ln n)(\ln \ln n + \frac{1}{\delta})}$) error bound range, for $n$ in-sample examples and bound failure probability $\delta$.
    In a second-price auction with i.i.d. (independent identically distributed) bidder valuations, adding bidders increases expected buyer surplus if the distribution of valuations has a sufficiently heavy right tail. While this does not... more
    In a second-price auction with i.i.d. (independent identically distributed) bidder valuations, adding bidders increases expected buyer surplus if the distribution of valuations has a sufficiently heavy right tail. While this does not imply that a bidder in an auction should prefer for more bidders to join the auction, it does imply that a bidder should prefer it in exchange for the bidder being allowed to participate in more auctions. Also, for a heavy-tailed valuation distribution, marginal expected seller revenue per added bidder remains strong even when there are already many bidders.
    This note describes how to collect charges for ad impact on user experience. The charge may be per-view, to account for impact on user experience from viewing an ad, or per-click, to account for impact from clicking on the ad. The results... more
    This note describes how to collect charges for ad impact on user experience. The charge may be per-view, to account for impact on user experience from viewing an ad, or per-click, to account for impact from clicking on the ad. The results for per-click charges also apply to per-conversion charges or per-action charges. Conceivably, a marketplace could assess both kinds of charges.
    Today, web-based companies use user data to provide and enhance services to users, both individually and collectively. Some also analyze user data for other purposes, for example to select advertisements or price offers for users. Some... more
    Today, web-based companies use user data to provide and enhance services to users, both individually and collectively. Some also analyze user data for other purposes, for example to select advertisements or price offers for users. Some even use or allow the data to be used to evaluate investments in financial markets. Users' concerns about how their data is or may be used has prompted legislative action in the European Union and congressional questioning in the United States. But data can also benefit society, for example giving early warnings for disease outbreaks, allowing in-depth study of relationships between genetics and disease, and elucidating local and macroeconomic trends in a timely manner. So, instead of just a focus on privacy, in the future, users may insist that their data be used on their behalf. We explore potential frameworks for groups of consenting, informed users to pool their data for their own benefit and that of society, discussing directions, challenges,...
    We improve error bounds based on VC analysis for classes with sets of similar classifiers. We apply the new error bounds to separating planes and artificial neural networks.
    Quality data is a fundamental contributor to success in statistics and machine learning. If a statistical assessment or machine learning leads to decisions that create value, data contributors may want a share of that value. This paper... more
    Quality data is a fundamental contributor to success in statistics and machine learning. If a statistical assessment or machine learning leads to decisions that create value, data contributors may want a share of that value. This paper presents methods to assess the value of individual data samples, and of sets of samples, to apportion value among different data contributors. We use Shapley values for individual samples and Owen values for combined samples, and show that these values can be computed in polynomial time in spite of their definitions having numbers of terms that are exponential in the number of samples.
    For many problem instances, the inclusion and exclusion formula has many cancellations and symmetries. By imposing a hierarchy on the formula''s terms, we develop general reductions for inclusion and exclusion algorithms. We apply... more
    For many problem instances, the inclusion and exclusion formula has many cancellations and symmetries. By imposing a hierarchy on the formula''s terms, we develop general reductions for inclusion and exclusion algorithms. We apply these reductions to an algorithm which counts Hamiltonian paths, and we develop a branch and bound algorithm to detect Hamiltonian paths. Then we show how to apply the reductions to other inclusion and exclusion algorithms.
    This paper presents a series of PAC exponential error bounds for $k$-nearest neighbors classifiers, with O($n^{-\frac{r}{2r+1}}\sqrt{k \ln n}$) error bound range for each integer $r>0$, where $n$ is the number of in-sample examples.... more
    This paper presents a series of PAC exponential error bounds for $k$-nearest neighbors classifiers, with O($n^{-\frac{r}{2r+1}}\sqrt{k \ln n}$) error bound range for each integer $r>0$, where $n$ is the number of in-sample examples. This shows that $k$-nn classifiers, in spite of their famously fractured decision boundaries, come close to having Gaussian-style exponential error bounds with O($n^{-\frac{1}{2}}$) bound ranges.
    Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For... more
    Many networks grow by adding successive cohorts – layers of nodes. Often, the nodes in each layer are selected independently of each other, but from a distribution that can depend on which nodes were selected for previous cohorts. For example, successive waves of friends invite their friends to join social networks. We present error bounds for collective classification over these networks.
    Some online advertising offers pay only when an ad elicits a response. Randomness and uncertainty about response rates make showing those ads a risky investment for online publishers. Like financial investors, publishers can use portfolio... more
    Some online advertising offers pay only when an ad elicits a response. Randomness and uncertainty about response rates make showing those ads a risky investment for online publishers. Like financial investors, publishers can use portfolio allocation over multiple advertising offers to pursue revenue while controlling risk. Allocations over multiple offers do not have a distinct winner and runner-up, so the usual second-price mechanism does not apply. This paper develops a pricing mechanism for portfolio allocations. The mechanism is efficient, truthful, and rewards offers that reduce risk.
    For a voting ensemble that selects an odd-sized subset of the ensemble classifiers at random for each example, applies them to the example, and returns the majority vote, we show that any number of voters may minimize the error rate over... more
    For a voting ensemble that selects an odd-sized subset of the ensemble classifiers at random for each example, applies them to the example, and returns the majority vote, we show that any number of voters may minimize the error rate over an out-of-sample distribution. The optimal number of voters depends on the out-of-sample distribution of the number of classifiers in error. To select a number of voters to use, estimating that distribution then inferring error rates for numbers of voters gives lower-variance estimates than directly estimating those error rates.
    We introduce methods to bound the mean of a discrete distribution (or finite population) based on sample data, for random variables with a known set of possible values. In particular, the methods can be applied to categorical data with... more
    We introduce methods to bound the mean of a discrete distribution (or finite population) based on sample data, for random variables with a known set of possible values. In particular, the methods can be applied to categorical data with known category-based values. For small sample sizes, we show how to leverage knowledge of the set of possible values to compute bounds that are stronger than for general random variables such as standard concentration inequalities.
    ABSTRACT
    We examine methods to estimate the average and variance of test error rates over a set of classifiers. We begin with the process of drawing a classifier at random for each example. Given validation data, the average test error rate can be... more
    We examine methods to estimate the average and variance of test error rates over a set of classifiers. We begin with the process of drawing a classifier at random for each example. Given validation data, the average test error rate can be estimated as if validating a single classifier. Given the test example inputs, the variance can be computed exactly. Next, we consider the process of drawing a classifier at random and using it on all examples. Once again, the expected test error rate can be validated as if validating a single classifier. However, the variance must be estimated by validating all classifiers, which yields loose or uncertain bounds.

    And 74 more