[go: up one dir, main page]

0% found this document useful (0 votes)
9 views7 pages

Statistics

The document discusses statistical inference methods, focusing on Frequentist and Bayesian approaches for parameter estimation. It outlines key properties of estimators such as consistency, bias, efficiency, and robustness, and highlights the differences in how both methodologies handle probability and uncertainty. The text also emphasizes the importance of maximum likelihood estimation and its applications in various statistical contexts.

Uploaded by

Isabel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views7 pages

Statistics

The document discusses statistical inference methods, focusing on Frequentist and Bayesian approaches for parameter estimation. It outlines key properties of estimators such as consistency, bias, efficiency, and robustness, and highlights the differences in how both methodologies handle probability and uncertainty. The text also emphasizes the importance of maximum likelihood estimation and its applications in various statistical contexts.

Uploaded by

Isabel
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

~8.

Statistics 195

28. STATISTICS

Revised April 1998 by F. James (CERN). Updated February 2000 by 28.1. Parameter estimation [3, 4, 6-9]
R. Cousins (UCLA).
Here we review parametric statistics in which one desires estimates
A probability density function f(x;c~) ( p . d . f . ) w i t h known of the parameters a from a set of actual observations.
parameters c~ enables us to predict the frequency with which random A statistic is any function of the data, plus known constants, which
data x will take on a particular value (if discrete) or lie in a given range does not depend upon any of the unknown parameters. A statistic is
(if continuous). Here we are concerned with the inverse problem, that a random variable if the data have random errors. An estimator is
of making inferences about (x from a set of actual observations. Such any statistic whose value (the estimate ~) is intended as a meaningful
inferences are part of a larger subject variously known as statistics, guess for the value of the parameter a, or the vector c~ if there is more
statistical inference, or inverse probability. than one parameter.
There are two different approaches to statistical inference, which Since we are free to choose any function of the data as an estimator
we may call Frequentist and Bayesian. In the former, the frequency of the parameter a, we will try to choose that estimator which has the
definition of probability (Sec. 27.1) is used, and it is usually best properties. The most important properties are (a) consistency,
meaningless to define a p.d.f, in ct (for example, a parameter which (b) bias, (c) efficiency, and (d) robustness.
is a constant of nature has a value which is fixed). In Frequentist
(a) An estimator is said to be consistent if the estimate ~ converges
statistics, one can compute confidence intervals as a function of the
to the true value a as the amount of data increases. This property is
observed data, and they will contain ("cover") the unknown true value
so important that it is possessed by all commonly used estimators.
of c~ a specified fraction of the time in the long run, as defined in
Sac. 28.6. (b) The bias, b = E ( ~ ) - a, is the difference between the true
value and the expectation of the estimates, where the expectation
In Bayesian statistics, the concept of probability is not based on
value is taken over a hypothetical set of similar experiments in which
limiting frequencies, but is more general and includes degree of belief.
is constructed the same way. When b = 0 the estimator is said
With this definition, one m a y define p.d.f.'s in c~, and then inverse
to be unbiased. The bias depends on the chosen metric, i.e., if ~ is
probability simply obeys the general rules of probability. Bayesian
an unbiased estimator of a, then (~)2 is generally not an unbiased
methods allow for a natural way to input additional information such
estimator of c~2. The bias m a y be due to statistical properties of the
as physical boundaries and subjective information; in fact they require
estimator or to systematic errors in the experiment. If we can estimate
as input the prior p.d.f, for any parameter to be estimated. Using
the b we can subtract it from ~ to obtain a new ~t = ~ _ b. However,
Bayes' Theorem (Eq. (27.7)), the prior degree of belief is updated by
b may depend upon a or other unknowns, in which case we usually
incoming data.
try to choose an estimator which minimizes its average size.
For many inference problems, the Frequentist and Bayesian
(c) Efficiency is the inverse of the ratio between the variance of
approaches give the same numerical answers, even though they are
the estimates Var(~) and the m i n i m u m possible value of the variance.
based on fundamentally different assumptions. However, for exact
Under rather general conditions, the m i n i m u m variance is given by
results for small samples and for measurements near a physical
the Rao-Cram~r-Frechet bound:
boundary, the different approaches may yield very different confidence
limits, so we are forced to make a choice. There is an enormous amount
Varmi n = [1 + cOb/Oct]2 / I ( a ) ; (28.1)
of literature devoted to the question of Bayesian vs non-Bayesian
methods, much of it written by people who are fervent advocates
of one or the other methodology, which often leads to exaggerated
conclusions. For a reasonably balanced discussion, we recommend the
following articles: by a statistician [1], and by a physicist [2]. A more
advanced comparison is offered in Ref. 3. (Compare with Eq. (28.6) below.) The s u m is over all data and b
In high energy physics, where experiments are repeatable (at least is the bias, if any; the xi are assumed independent and distributed
in principle) the frequentist definition of probability is normally used. as J'(xi;o~), and the allowed range of z must not depend upon o,
However, Bayesian equations are often used to treat uncertainties on Mean-squared error, mse = E[( ~ - ~ )2] = V( 8 ) + b2 is a convenient
luminosity, background, etc. If the result has poor properties from a quantity which combines in the appropriate way the errors due to
Frequentist point of view, one should note that the result is not a bias and efficiency.
classical confidence interval. (d) Robustness; is the property of being insensitive to departures
Frequentist methods cannot provide the probability that a theory from assumptions in the p.d.f, due to such factors as noise.
is true, or that a parameter has a particular value. (Such probabilities For some common estimators the above properties are known
require input of prior belief.) Rather, Frequentist methods calculate exactly. More generally, it is always possible to evaluate t h e m by
probabilities that various data sets are obtained given specified Monte Carlo simulation. Note that they will often depend on the
theories or parameters; these frequencies are often calculated by unknown c~.
Monte Carlo methods. As described below, confidence intervals are
constructed from such frequencies, and therefore do not represent 28.2. Data with a common mean
degree of belief.
Suppose we have a set of N independent measurements Yi assumed
The Bayesian methodology is particularly well-adapted to decision- to be unbiased measurements of the same unknown quantity # with a
making, which requires subjective input not only for prior belief, but
common, but unknown, variance 0.2 resulting from m e a s u r e m e n t error.
also for risk tolerance, etc. Even primarily Frequentist texts such as
Then
Ref. 4 outline Bayesian decision theory. However, the usefulness of N
1
Bayesian methods as a means for the communication of experimental : ~ ~ y~ (28.2)
measurements is controversial. i=1
Recently, the first Workshop on Confidence Limits [5] was held N
at CERN, where proponents of various statistical methods presented ~2 _ 1 S-'(yi - ~)2 (28.3)
and discussed the issues. One sees that there was not a consensus on N -1
the best way to report confidence limits. We recommend the web site
and eventual proceedings as a starting point for discussion of these are unbiased estimators of # and 0.2. The variance of ~ is o'2/N. If the
issues. The methods described below use the Frequentist definition of common p.d.f, of the Yi is Gaussian, these estimates are uncorrelated.
probability, except where noted. Then, for large N, the standard deviation of ~ (the "error of the
error") is 0./v/2-/V. Again if the Yl are Gaussian, ~ is an efficient
estimator for /z. Otherwise the mean is in general not the most
196 28. Statistics

efficient estimator. For example, if the y follow a double-exponential . ~ ( a ) = a 3 e x p ( - a ) / 6 . Nothing in the construction of .L~ makes it a
distribution [~ e x p ( - v ' ~ [ y - / ~ l / a ) ] , the most efficient estimator of the probability density, i.e., a function which one can multiply by d a in
mean is the sample median (the value for which half the Yl lie above order to obtain a probability.
and half below). This is discussed in more detail in Ref. 4, Sec. 8.7. In Bayesian theory, one applies Bayes' Theorem to construct the
If a2 is known, it does not improve the estimate ~, a s can be posterior p.d.f, for a by multiplying the prior p.d.f, for a by -~:
seen from Eq. (28.2)~ however, if # is known, substitute it for ~ in
Pposterior(O~) O ( . ~ ( a ) • Pprior(a).
Eq. (28.3) and replace N - 1 by N, to obtain a somewhat better
estimator of a2. If the prior p.d.f, is uniform, integrating the posterior p.d.f, m a y
If the yi have different, known, variances a 2, then the weighted give the appearance of integrating .~. But note that the prior p.d.f.
crucially provides the density which makes it sensible to multiply
average
N by da to obtain a probability. In non-Bayesian applications, such as
1 those considered in the following subsections, only likelihood ratios
- 2 _ . w~ ~ , , (28.4)
~=w are used (or equivalently, differences in ln.2').
i=1
is an unbiased estimator for # with smaller variance than an Because t is so useful, we strongly encourage publishing it (or
unweighted average; here wi = 1/~ 2 and w = ~ w i . The standard enough information to allow the reader to reconstruct it), when
deviation of ~ is 1/vrw. practical.
28.3.3. Confidence intervals from the likelihood function:
28.3. The method of maximum likelihood The covariance matrix V m a y be estimated from
28.3.1. P a r a m e t e r estimation by m a x i m u m likelihood:
Vnm = ( E l 021nl
Oa,~Oam ~])-1 . (28.7)
"From a theoretical point of view, the most important general
method of estimation so far known is the method of maximum
(Here and below, the superscript -1 indicates matrix inversion,
likelihood" [6]. We suppose that a set of independently measured
followed by application of the subscripts.)
quantities xi came from a p.d.f..f(x; o), where o is an unknown set
of parameters. The method of m a x i m u m likelihood consists of finding In the large sample case (or a linear model with Ganssian errors),
the set of values, ~, which maximizes the joint probability density for .s is Gaussian, l n . ~ is a (multidimensional) parabola, and the second
all the data, given by derivative in Eq. (28.7) is constant, so the "expectation" operation
has no effect. This leads to the usual approximation of calculating
the error matrix of the parameters by inverting the second derivative
_~(o) = H/(x~; o), (28.5)
i matrix of l n . ~ . In this asymptotic case, it can be seen that a
numerically equivalent way of determining s-standard-deviation errors
where . ~ is c~tlled the likelihood. It is usually easier to work with is from the contour given by the o ~ such that
l n . ~ , and since both are maximized for the same set of o , it is ln.LP(a') = ln.-~max - 82/2 , (28.8)
sufficient to solve the likelihood equation
where In 1 m a x is the value of l n . ~ at the solution point (compare with
01n.~ Eq. (28.32), below). The extreme limits of this contour parallel to the
= o. (28.6)
(3an a n axis give an approximate s-standard-deviation confidence interval
in an. These intervals m a y not be symmetric and in pathological cases
When the solution to Eq. (28.6) is a m a x i m u m , it is called the they m a y even consist of two or more disjoint intervals.
maximum likelihood estimate of o. The importance of the approach is
Although asymptotically Eq. (28.7) is equivalent to Eq. (28.8)
shown by the following proposition, proved in Ref. 3:
with s = 1, the latter is a better approximation when the model
If an e~cient estimate ~ of o exists, the likelihood equation will deviates from linearity. This is because Eq. (28.8) is invariant with
have a unique solution equal to ~. respect to even a non-linear transformation of parameters o, whereas
Eq. (28.7) is not. Still, when the model is non-linear or errors are
In evaluating .L~, it is important that any normalization factors
not Gaussian, confidence intervals obtained with both these formulas
in the f ' s which involve o be included. However, we will only be
are only approximate. The true coverage of these confidence intervals
interested in the m a x i m u m of - ~ and in ratios of .2 ~ at different o ' s ;
can always be determined by a Monte Carlo simulation, or exact
hence any multiplicative factors which do not involve the parameters
confidence intervals can be determined as in Sec. 28.6.1.
we want to estimate m a y be dropped; this includes factors which
depend on the data but not on o. The results of two or more 28.3.4. Application to Poisson-distrlbuted data:
independent experiments m a y be combined by forming the product of In the case of Poisson-distributed d a t a in a counting experiment,
the . ~ ' s , or the s u m of the ln..~'s. the unbinned m a x i m u m likelihood method (where the index i in
Most commonly the solution to Eq. (28.6) will be found using a Eq. (28.5) labels events) is preferred if the total number of events is
general numerical minimization program such as the CERN program very small. (Sometimes it is "extended" to include the total number of
MINUIT [10], which contains considerable code to take account of the events as a Poisson-distributed observable.) If there are enough events
m a n y special cases and problems which can arise. to justify binning them in a histogram, then one m a y alternatively
Under a one-to-one change of parameters from o to 19 = t3(o), maximize the likelihood function for the contents of the bins (so i
the m a x i m u m likelihood estimate ~ transforms to ~ ( ~ ) . T h a t is, the labels bins). This is equivalent~to minimizing [11]
m a x i m u m likelihood solution is invariant under change of parameter.
However, m a n y properties of ~, in particular the bias, are not
i
invariant under change of parameter.
where Ni~ and N th are the observed and theoretical (from f )
28.3.2. Uses o f -~: - ~ ( o ) is not a p.d.f, f o r a: contents of the ith bin. In bins where N ~ = 0, the second term
Recall the definition of a probability density function: a function is zero. This function asymptotically behaves like a classical X2 for
p(a) is a p.d.f, for a if p(a)da is the probability for a to be within purposes of point estimation, interval estimation, and goodness-offit.
a and a + do. The likelihood function . ~ ( a ) is not a p.d.f, for a, so It also guarantees that the area under the fitted function f is equal to
in general it is nonsensical to integrate the likelihood function with the s u m of the histogram contents (as long as the overall normalization
respect to its parameter(s). of .f is effectively left unconstrained during the fit), which is not
Consider, for example, the Poisson probability for obtaining the case for X2 statistics based on a least-squares procedure with
n when sampling from a distribution with mean a: .f(n;a) = traditional weights.
a n e x p ( - a ) / n ! . If one obtains n = 3 in a particular experiment, then
28. Statistics 197

28.4. Propagation of errors ~.yl/m(~i) ~ x" ],,(=i!!m(=~) (2s.15)


= 2 a,, ~ a29 .
Suppose that F(x; e~) is some function of variable(s) x and the 9 o"i i z
fitted parameters ~, with a value F at &. The variance matrix of the With the definitions
parameters is Vmn. To first order in am - ~m, F is given by
gm= EYl fm(xl)/o'2i (28.16)
F:ff +E O"F ( a m - ~ m ) ' (28.10) i
m GOZm
and
Vml = Z fn(Xi) fm(Xi)/q2i , (28.17)
and the variance of F about its estimator is given by
i
the k-element column vector of solutions ~, for which Ox2/Oam = 0
(AF) 2 = E[(F - ?)2] = E OF OF Vmn (28.11)
m n OOtm OOen for all m, is given by
a = v o. (28.18)
evaluated at the x of interest. For different functions Fj and Fk, the
With this notation, X2 for the special case of a linear fitting
covariance is
function (Eq. (28.14)) can be rewritten in the compact form

OFj cgFk V, (28.12) 2 -[-(a -- a ) T v - l ( a


X2 = Xmin
E[(Fj - ffj)(F k - -Pk)] = E OC~mOc*n ran. -- a ) 9 (28.19)
mn

Nonindependent yi's
If the first-order approximation is in serious error, the above results
Eq. (28.13) is based on the assumption that the likelihood function
may be very approximate. F may be a biased estimator of F even if
is the product of independent Gaussian distributions 9 More generally,
the ~ are unbiased estimators of a . Inclusion of higher-order terms or
the measured yi's are not independent, and we must consider them as
direct evaluation of F in the vicinity of & will help to reduce the bias.
coming from a multivariate distribution with nondiagonal covariance
matrix S, as described in Sec. 27.3.3. The generalization of Eq. (28.13)
28.5. Method of least squares
is
The method of least squares can be derived from the maximum X 2 = E l y j - F(xj; a)]Sj-~l[yk - F(xk; a ) ] . (28.20)
likelihood theorem. We suppose a set of N measurements at points jk
xi. The ith measurement Yi is assumed to be chosen from a Gaussian
distribution with mean F(xi; vl) and variance a 2. Then In the case of a fitting function that is linear in the parameters,
one may differentiate X2 to find the generalization of Eq. (28.15), and
with the extended definitions
X2 : - 2 l n . ~ + constant = E [Yl -- F(xi; r (28.13)

Finding the set of parameters r which maximizes ~ is the same as jk


finding the set which minimizes X2.
V~,~ = E .fn(xj) fm(XklSj-k 1 (28.21)
In many practical cases one further restricts the problem to the jk
situation in which F(xi; w) is a linear function of the am'S,
solve Eq. (28.18) for the estimators ~.
F(=i; a ) : ~_, ~ f,,(=~) , (28.14) The problem of constructing the covariance matrix S is simplified
n by the fact that contributions to S (not to its inverse) are additive.
For example, suppose that we have three variables, all of which have
where the fn are k linearly independent functions (e.g., 1, x, x 2, ..., independent statistical errors. The first two also have a common error
or Legendre polynomials) which are single-valued over the allowed resulting in a positive correlation, perhaps because a common baseline
range of x. We require k < N, and at least k of the xi must be with its own statistical error (variance s 2) was subtracted from each.
distinct. We wish to estimate the linear coefficients an. Later we will In addition, the second two have a common error (variance a2),
discuss the nonlinear case. but this time the values are anticorrelated. This might happen, for
If the point errors ei = Yi - F ( x i ; a ) are Ganssian, then the example, if the sum of the two variables is a constant. Then
minimum X2 will be distributed as a X2 random variable with
n = N - k degrees of freedom. We can then evaluate the goodness-
s:/o ~
o o)
o
of-fit (significance level) from Figs9 27.1 or 27.3, as per the earlier
o ~
(i0 0)
discussion. The significance level expresses the probability that a
worse fit would be obtained in a large number of similar experiments
under the assumptions that: (a) the model y = ~ an fn is correct + s2 + a2 _a 2 . (28.22)
and (b) the errors ei are Ganssian and unbiased with variance 0 _a 2 a2
a 2. If this probability is larger than an agreed-upon value (0.001,
0.01, or 0.05 are common choices), the data are consistent with the If unequal amounts of the common baseline were subtracted from
assumptions; otherwise we may want to find improved assumptions 9 variables 1, 2, and 3--e.g., fractions fl, f2, and f3, then we would
As for the converse, most people do not regard a model as
being truly inconsistent unless the probability is as low as that
corresponding to four or five standard deviations for a Gaussian
(6 • 10-3 or 6 • 10-5; see Sec. 28.6.2). If the ei are not Ganssian, the
have

0)
method of least squares still gives an answer, but the goodness-of-fit o o-~
test would have to be done using the correct distribution of the [ f2822 flf2S 2 flf3S2~
random variable which is still called "X2.'' + | flf2S f~s 2 / 2 f 3 s2 9 (28.23)
Minimizing X2 in the linear case is straightforward: \Af3s 2 /2:3: :~s2 :
While in general this "two-vector" representation is not possible, it
10X 2 =Zfm(Xi)(Yi-~-~nan]n(Xi)) underscores the procedure: Add zero-determinant correlation matrices
2r i a2 to the matrix expressing the independent variation.
198 28. S t a t i s t i c s

Care m u s t be taken when fitting to correlated data, since off- 28.6. Exact confidence intervals
diagonal contributions to X2 are not necessarily positive. It is even
The unqualified phrase "confidence intervals" refers to frequentist
possible for all of the residuals to have the same sign.
(also called classical) intervals obtained with a construction due to
E x a m p l e : s t r a i g h t - l i n e fit Neyman [12], described below. Approximate confidence intervals are
For the case of a straight-line fit, y(x) = al + a2 x, one obtains, for obtained in classical statistics from likelihood ratios as described in the
independent measurements Yi, the following estimates of a l and a2, preceeding subsections. The validity of the approximation (in terms of
coverage; see below) should be checked (typically by the Monte Carlo
~1 = (gl A22 - g2 A12)/D , (28.24) method) when in doubt, as is usually the case with small n u m b e r s of
~2 = (g2 All - gl A12)/D , (28.25) events.

where Intervals in Bayesian statistics, usually called credible intervals or


Bayesian confidence intervals, are obtained by integrating the posterior
(All, A12, A22) : E ( 1 , xi, x~)/~r~ , (28.26a)
p.d.s (based on a non-frequency definition of probability), and in
(gl, g2) -: E ( 1, x i ) y l / a 2 . (28.26b) m a n y cases do not obey the defining properties of confidence intervals.
Correspondingly, confidence intervals do not in general behave like
respectively, and credible intervals.
D = All A22 - (A12) 2 (28.27)
In the Bayesian framework, all uncertainty including systematic and
The covariance matrix of the fitted parameters is: theoretical uncertainties can be treated in a straightforward manner:
one includes in the p.d.f, one's degree of belief about background
Vli VI2"~ I ( A22 -A12) (28.28)
estimates, luminosity, etc. Then one integrates out such "nuisance
V12 V22] : - D -h12 All "
parameters." In the Frequentist approach, one should have exact
The estimated variance of an interpolated or extrapolated value of coverage no matter what the value of the nuisance parameters, and
y at point x is: this is not in general possible. If one performs a Bayesian-style
integration over nuisance parameters while constructing nominally
1 All ( A12 ~ 2 Ftequentist intervals, then coverage must be checked.
(~- Vtruel21e,, = X~l + ~ - x- X~ / 9 (28.29)
28.6.1. Neyman's Construction of Confidence intervals:
28.5.1. Confidence intervals from the chisquare function:
If y is not linear in the fitting parameters c~, the solution vector
m a y have to be found by iteration. If we have a first guess (~0, then ~D(c) ~-
we m a y ext)and to obtain

OX2 ~ : ~OX2
j so + v41. (~ - s 0 ) + . . . . (28.30) "~'x2(a), %(x)
--. ct0 . . . . . . . . . . . . . . . . . . . . ,
where OX2/Oa is a vector whose rath component is OX2/Oozm, and
x l(a), %(x) .~ i
(Vr~ln) = 89 (See Eqns. 28.7 and 28.17. When evaluated
at ~t, V -1 is the inverse of the covariance matrix.) The next iteration
toward ~ can be obtained by setting cOx2/Ootm[e, = 0 and neglecting
higher-order terms:
ct = ct o - Vao 9 OX2/Oal,~o 9 (28.31)
Xl(O~O) x2(a0)
If V is constant in the vicinity of the minimum, as it is when the
model function is linear in the parameters, then X2 is parabolic Possible e x p e r i m e n t a l v a l u e s x
as a function of tx and Eq. (28.31) gives the solution immediately.
F i g u r e 28.1: Confidence intervals for a single unknown
Otherwise, further iteration is necessary. If the problem is highly
parameter c~. One might think of the p.d.s f ( x ; a ) as being
nonlinear, considerable difficulty m a y be encountered. There m a y be
plotted out of the paper as a function of x along each horizontal
secondary minima, and X2 m a y be decreasing at physical boundaries.
line of constant c~. The domain D(e) contains a fraction 1 - ~ of
Numerical methods have been devised to find such solutions without
the area under each of these functions.
divergence [9,10]. In particular, the CERN program MINUIT [10]
offers several iteration schemes for solving such problems.
We consider the parameter a whose true value is fixed but unknown.
Note that minimizing any function proportional to X2 (or
The properties of our experimental apparatus are expressed in the
maximizing any function proportional to l n - ~ ) will result in the same
function f(x; a) which gives the probability of observing data x if the
parameter set ~. Hence, for example, if the variances r 2 are known
true value of the parameter is c~. This function m u s t be known in
only up to a common constant, one can still solve for ~. One cannot, order to interpret the results of an experiment. For a large complex
however, evaluate goodness-of-fit, and the covariance matrix is known experiment, f is usually determined numerically using Monte Carlo
only to within the constant multiplier. The scale can be estimated at
simulation.
least roughly from the value of X2 compared to its expected value.
Given f(x; a), we can find for every value of c~, two values Xl(O~, e)
Additional information can be extracted from the behavior of the
and x2(c~, e) such that
normalized residuals (known as "pulls"), rj = ( y j - F ( x j ; ~ ) / ~ j ,
which should themselves distribute normally with mean 0 and rms
deviation 1. P ( x l < x < x2; (~) = 1 - ~ = f(x; c~)dx . (28.33)
1
If the data covariance matrix S has been correctly evaluated
(or, equivalently, the trj's, if the data are independent), then the This is shown graphically in Fig. 28.1: a horizontal line segment
s-standard deviation limits on each of the parameters are given by a [Xl(a, e), x2(c~, ~)] is drawn for representative values of c~. T h e union
set od such that of all intervals [xl(c~,e),x2(a,e)], designated in the figure as the
X2(a') = Xmin2 + s 2 . (28.32) domain D(r is known as the confidence belt. Typically the curves
This equation gives confidence intervals in the same sense as 28.8, Xl(a, ~) and x2(a, ~) are monotonic functions of a, which we assume
and all the discussion of Sec. 28.3.3 applies as well here, substituting for this discussion.
- X 2 / 2 for ln.sr Upon performing an experiment to measure x and obtaining the
value x0, one draws a vertical line through xo on the horizontal axis.
~8. Statistics 199

The confidence interval for c~ is the union of all values of a for which Table 28.1: Area of the tails e outside :t:~ from the mean of a
the corresponding line segment [Xl (a, e), x2(ct, ~)] is intercepted by this Gaussian distribution.
vertical line. The confidence interval is an interval [al(X0), a2(x0)], (~) (%)
where al(X0) and c~2(z0) are on the boundary of D(r Thus, the 31.73 10. 20 1.28tr
boundaries of D(e) can be considered to be functions x(ct) when 4.55 20. 10 1.64a
constructing D, and then to be functions ct(x) when reading off
0.27 30. 5 1.960.
confidence intervals.
6.3x10 -3 40. 1 2.580.
Such confidence intervals are said to have Confidence Level (CL)
equal to 1-~. 5.7x10 -5 50. 0.1 3.290.
2.0• -7 60. 0.01 3.890.
Now suppose that some unknown particular value of a, say a0
(indicated in the figure), is the true value of a. We see from the figure
that a0 lies between a l ( x ) and a2(x) if and only if x lies between -4-$ of the measured value. Fig. 28.2 shows a 6 = 1.640. confidence
Xl(O/0) and x2(a0). Thus we can write: interval unshaded. The choice ~ = ~ ) = 0. gives an interval
called the standard error which has 1 - ~ = 68.27% if 0. is known.
P[Xl(aO) < x < z 2 ( a o ) ] = 1 - e = P[c~2(x ) < a 0
<cq(z)] . Confidence coefficients e for other frequently used choices of ~ are
(28.34) given in Table 28.1.
And since, by construction, this is true for any value no, we can drop
the subscript 0 and obtain the relationship we wanted to establish for
the probability that the confidence limits will contain the true value
of a:
P[a2(x) < a < a l ( X ) ] = 1 - r (28.35)

In this probability statement, a l and a2 are the random variables


(not a), and we can verify that the statement is true, as a limiting
ratio of frequencies in random experiments, for any assumed value
of a. In a particular real experiment, the numerical values a t and
a2 are determined by applying the algorithm to the real data, and
the probability statement is (all too frequently) misinterpreted to be
a statement about the true value a since this is the only unknown
remaining in the equation. It should however be interpreted as the
probability of obtaining values cq and a2 which include the true value -3 -2 -1 0 1 2 3
of a, in an ensemble of identical experiments. Any method which
(x-~)/a
gives confidence intervals that contain the true value with probability
1 - e (no matter what the true value of a is) is said to have the F i g u r e 28.2: Illustration of a symmetric 90% confidence interval
correct coverage. The frequentist intervals as constructed above have (unshaded) for a measurement of a single quantity with Gaussian
the correct coverage by construction. Coverage is a critical property errors. Integrated probabilities, defined by e, are as shown.
of confidence intervals [2]. (Power to exclude false values of a, related
to the length of the intervals in a relevant measure, is also important.) For other 8, find e as the ordinate of Fig. 27.1 on the n = 1 curve
The condition of coverage Eq. (28.33) does not determine xl and at X2 = (~/cr) 2. We can set a one-sided (upper or lower) limit by
z2 uniquely, since any range which gives the desired value of the excluding above p + ~ (or below # - ~); s's for such limits are 1/2 the
integral would give the same coverage. Additional criteria are thus values in Table 28.1.
needed. The most common criterion is to choose central intervals such
For multivariate a the scalar Var(#) becomes a full variance-
that the area of the excluded tail on either side is r This criterion covariance matrix. Assuming a multivariate Gaussian, Eq. (27.22),
is sufficient in most cases, but there is a more general ordering
and subsequent discussion the standard error ellipse for the pair
principle which reduces to centrality in the usual cases and produces
(~m,~n) may be drawn as in Fig. 28.3.
confidence intervals with better properties when in the neighborhood
of a physical limit. This ordering, which consists of taking the interval The minimum X2 or maximum likelihood solution is at (~m, an).
which includes the largest values of a likelihood ratio, is briefly The standard errors O'm and an are defined as shown, where the ellipse
outlined in Ref. 3 and has been applied to prototypical problems by is at a constant value of X2 = Xmin
2 A- 1 or l n . ~ = ln.~max - 1/2. The
Feldman and Cousins [13]. angle of the major axis of the ellipse is given by
For the problem of a counting rate experiment in the presence of
2pmn 0.m 0.n
background, Roe and Woodroofe [14] have proposed a modification tan 2r = o'2 - o'2 (28.37)
to Ref. 13 incorporating conditioning, i.e., conditional probabilities
computed using constraints on the number of background events
actually observed. This and other prescriptions giving frequentist For non-Gaussian or nonlinear cases, one may construct an analogous
intervals have not yet been fully explored [5]. contour from the same X2 or ln.L# relations. Any other parameters
al, ~ # m, n must be allowed freely to find their optimum values for
28.6.2. G a u s s i a n errors: every trial point.
If the data are such that the distribution of the estimator(s) satisfies
the central limit theorem discussed in Sec. 27.3.3, the function ](x; a)
is the Gaussian distribution. If there is more than one parameter
being estimated, the multivariate Gaussian is used. For the univariate
case with known tr,

-,+6 -(x-~)2
l-e:/ e 20.2 dx : erf ( ~ 0 . ) (28.36)
Jg-6
~m ~m "
is the probability that the measured value x will fall within -4-8 of the
true value ~. From the symmetry of the Ganssian with respect to x F i g u r e 28.3: Standard error ellipse for the estimators am and
and #, this is also the probability that the true value will be within ~n. In this case the correlation is negative.
200 28. Statistics

Table 28.2:AX2 corresponding to (1 - r for joint estimation (b) If the data are to be used to make a decision, for example
of k parameters. to determine the dimensions of a new experimental apparatus for an
improved measurement, it m a y be appropriate to report a Bayesian
(1 - ~) (%) k=l k=2 k=3 upper limit, which m u s t necessarily contain subjective belief about
68.27 1.00 2.30 3.53 the possible values of the parameter, as well as containing information
90. 2.71 4.61 6.25 about the physical boundary. Its interpretation requires knowledge of
the prior distribution which was necessarily used to obtain it.
95.45 4.00 6.18 8.03
(c) If it is desired to report an upper limit that has a well-defined
99. 6.63 9.21 11.34
meaning in terms of a limiting frequency, then report the Frequentist
99.73 9.00 11.83 14.16 confidence bound(s) as given by the unified approach [3], [13]. This
algorithm always gives a non-null interval (that is, the confidence
limits are always inside the physical region, even for a measurement
For any unbiased procedure (e.g., least squares or m a x i m u m
well outside the physical region), and still has correct global coverage.
likelihood) used to estimate k parameters c~i, i = 1 , . . . , k , the
These confidence limits for a Gaussian m e a s u r e m e n t close to a
probability 1 - e that the true values of all k parameters lie within an
non-physical boundary are summarized in Fig. 28.4. Additional tables
ellipsoid bounded by a fixed value of AX2 = X2 - Xmin 2 m a y be found
are given in Ref. 13.
from Fig. 27.1. This is because the difference, AX2 = X2 - Xmin2,obeys
the "X 2'' p.d.f, given in Table 27.1, if the parameter n in the formula is
taken to be k (rather than degrees-of-freedom in the fit). In Fig. 27.1,
read the ordinate as ~ and the abscissa as AX2. The correct values of
are on the n = k curve. For k > 1, the values of ~ for given AX2
are much greater than for k = 1. Hence, using AX2 = s 2, which gives
s-standard-deviation errors on a single parameter (irrespective of the
other parameters), is not appropriate for a multi-dimensional ellipsoid.
For example, for k = 2, the probability (1 - ~) that the true values of
cq and a2 simultaneously lie within the one-standard-deviation error
ellipse (s = 1), centered on ~1 and ~2, is only 39%.
Values of AX2 corresponding to commonly used values of ~ and k
are given in Table 28.2. These probabilities assume Gaussian errors,
unbiased estimators, and that the model describing the data in terms
of the cti is correct. W h e n these assumptions are not satisfied, a Monte
Carlo simulation is typically performed to determine the relation
between AX2 and ~.
Figure 28.4: Plot of 99%, 95%, 90%, and 68.27% ("one ~r')
28.6.3. Upper limits and two-sided intervals:
confidence intervals (using the unified approach as in Ref. 13)
When a measured value is close to a physical boundary, it is natural for a physical quantity tt based on a Gaussian measurement x
to report a one-sided confidence interval (often an upper limit). It is (in units of standard deviations), for the case where the true
straightforward to force the procedure of Sec. 28.6.1 to produce only value of tt cannot be negative. The curves become straight lines
an upper limit, by setting x2 = cr in Eq. (28.33). Then Zl is uniquely above the horizontal tick marks. The probability of obtaining an
determined. Clearly this procedure will have the desired coverage, experimental value at least as negative as the left edge of the
but only if we always choose to set an upper limit. In practice one graph (x = -2.33) is less t h a n 1%. Values of x more negative
might decide after seeing the data whether to set an upper limit or a than - 1 . 6 4 (dotted segments) are less t h a n 5% probable, no
two-sided limit. In this case the upper limits calculated by Eq. (28.33) matter what the true value of it.
will not give exact coverage, as has been noted in Ref. 13.
In order to correct this problem and assure coverage in all
circumstances, it is necessary to adopt a unified procedure, that is, a 28.6.5. P o i s s o n data f o r small samples:
single ordering principle which will provide coverage globally. Then
it is the ordering principle which decides whether a one-sided or When the observable is restricted to integer values (as in the case
two-sided interval will be reported for any given set of data. The of Poisson and binomial distributions), it is not generally possible
unified procedure and ordering principle which follows from the theory to construct confidence intervals with exact coverage for all values
of likelihood-ratio tests [3] is described in Ref. 13. We reproduce below of a. In these cases the integral in Eq. (28.33) becomes a s u m of
the main results. finite contributions and it is no longer possible (in general) to find
consecutive terms which add up exactly to the required confidence
28.6.4. Gaussian data close to a boundary: level 1 - r for all values of a. Thus one constructs intervals which
One of the most controversial statistical questions in physics is how happen to have exact coverage for a few values of a, and unavoidable
to report a measurement which is close to the edge or even outside over-coverage for all other values.
of the allowed physical region. This is because there are several In addition to the problem posed by the discreteness of the data, we
admissible possibilities depending on how the result is to be used or usually have to contend with possible background whose expectation
interpreted. Normally one or more of the following should be reported: must be evaluated separately and m a y not be known precisely.
(a) The actual measurement should be reported, even if it is outside For these reasons, the reporting of this kind of data is even more
the physical region. As with any other measurement, it is best to controversial than the Gaussian d a t a near a boundary as discussed
report the value of a quantity which is nearly Gaussian distributed above. This is especially true when the number of observed counts is
if possible. Thus one m a y choose to report mass squared rather greater than the expected background. As for the Gaussian case, there
t h a n mass, or cos8 rather t h a n 6. For a complex quantity z close are at least three possibilities for reporting such results depending on
to zero, report Re(z) and Im(z) rather than amplitude and phase of how the result is to be used:
z. Data carefully reported in this way can be unbiased, objective, (a) The actual measurements should be reported, which m e a n s
easily interpreted and combined (averaged) with other data in a (1) the number of recorded counts, (2) the expected background,
straightforward way, even if they lie partly or wholly outside the possibly with its error, and (3) normalization factor which turns
physical region. The reported error is a direct measure of the intrinsic the number of counts into a cross section, decay rate, etc. As with
accuracy of the result, which cannot always be inferred from the upper Gaussian data, these data can be combined with t h a t of other
limits proposed below. experiments, to make improved upper limits for example.
28. S t a t i s t i c s 201

20 I I I I I I I l I I I I I I I I~
None of the above gives a single number which quantifies the quality
or sensitivity of the experiment. This is a serious shortcoming of most
upper limits including those of method (c), since it is impossible to
distinguish, from the upper limit alone, between a clean experiment
15 : R ' .... ~5 ................ ~ . . . . . . . . . . . . .
with no background and a lucky experiment with fewer observed
i .... ".. "'-. [~ ~'\\10events ~ t counts t h a n expected background. For this reason, we suggest that
in addition to (a) and (c) above, a measure of the sensitivity should
= "', ". ",: II ~ ~ [ ~ x k1
i ......
>, '',i, =o be reported whenever expected background is larger or comparable to
the number of observed counts. The best such measure we know of is
10 :'-.. R . : ' . . R : ~ : O 0 Meanb 5
that proposed and tabulated in Ref. 13, defined as the average upper
limit that would be attained by an ensemble of experiments with the
expected background and no true signal.
References:
1. B. Efron, Am. Stat. 40, 11 (1986).
2
2. R.D. Cousins, Am. J. Phys. 6 3 , 3 9 8 (1995).
0 1 I l t l l l l [ l l 1 l [ ' l l 3. A. Stuart and A. K. Ord, Kendall's Advanced Theory of Statistics,
0 5 10 15 20 Vol. 2 Classical Inference and Relationship 5th Ed., (Oxford Univ.
Mean expected background b Press, 1991), and earlier editions by Kendall and Stuart. The
F i g u r e 28.5: 90% confidence intervals [/.tl,#2]on the number of likelihood-ratio ordering principle is described at the beginning
signal events as a function of the expected number of background of Ch. 23. Chapter 31 compares different schools of statistical
events b. For example, if the expected background is 8 events inference.
and 5 events are observed, then the signal is 2.60 or less with 4. W.T. Eadie, D. Drijard, F.E. James, M. Roos, and B. Sadoulet,
90% confidence. Dotted portions of the /~2 curves on the upper Statistical Methods in Experimental Physics (North Holland,
left indicate regions where #1 is non-zero (as shown by the inset). A m s t e r d a m and London, 1971).
Dashed portions in the lower right indicate regions where the
5. Workshop on Confidence Limits, CERN, 17-18 Jan. 2000,
probability of obtaining the number of events observed or fewer
is less than 1%, even if # = 0. Horizontal curve sections occur www.cern.ch/CERN/Divisions/EP/Events/CLW/. See also the
because of discrete number statistics. Tables showing these data later Fermilab workshop linked to the CERN web page.
as well as the CL = 68.27%, 95%, and 99% results are given in 6. H. Cram6r, Mathematical Methods of Statistics, Princeton Univ.
Ref. 13. There is considerable discussion about the behavior of Press, New Jersey (1958).
the intervals when the number of observed events is less t h a n the 7. B.P. Roe, Probability and Statistics in Experimental Physics,
expected background; see Ref. 5 (Springer-Verlag, New York, 208 pp., 1992).
8. G. Cowan, Statistical Data Analysis (Oxford University Press,
(b) A Bayesian upper limit m a y be reported. This has the
Oxford, 1998).
advantages and disadvantages of any Bayesian result as discussed
above. The noninformative priors (based on invariance principles 9. W.H. Press et al., Numerical Recipes (Cambridge University
rather than subjective degree of belief) recommended in the statistics Press, New York, 1986).
literature for Poisson mean are rarely, if at all, used in high energy 10. F. James and M. Roos, "MINUIT, Function Minimization and
physics; they diverge for the case of zero events observed, and they Error Analysis," CERN D506 (Long Writeup). Available from
give upper limits which undercover when evaluated by the Frequentist the CERN P r o g r a m Library Office, CERN-IT Division, CERN,
criterion of coverage. Rather, priors uniform in the counting rate CH-1211, Geneva 21, Switzerland.
have been used by convention; care must be used in interpreting such 11. For a review, see S. Baker and R. Cousins, Nucl. Instrum.
results either as "degree of belief" or as a limiting frequency. Methods 2 2 1 , 4 3 7 (1984).
(c) An upper limit (or confidence region) with optimal coverage 12. J. Neyman, Phil. Trans. Royal Soc. London, Series A, 236, 333
can be reported using the unified approach of Ref. 13. At the moment (1937), reprinted in A Selection of Early Statistical Papers on J.
these confidence limits have been calculated only for the case of Neyman (University of California Press, Berkeley, 1967).
exactly known background expectation. The main results can be read
13. G.J. Feldman and R.D. Cousins, Phys. Rev. D 5 7 , 3873 (1998).
from Fig. 28.5 or from Table 28.3; more extensive tables can be found
This paper does not specify what to do if the ordering principle
in Ref. 13. gives equal rank to some values of x. Eq. 23.6 of Ref. 3 gives the
Table 28.3: Poisson limits [#1,/~2] for no observed events in the rule: all such points are included in the acceptance region (the
absence of background. domain D(e)). Some authors have assumed the contrary, and
shown that one can then obtain null intervals..
CI = 90% CI = 95% 14. B.P. Roe and M.B. Woodroofe, Phys. Rev. D 6 0 , 053009 (1999).

no ,Ul ~2 /~1 ~2

0 0.00 2.44 0.00 3.09


1 0.11 4.36 0.05 5.14
2 0.53 5.91 0.36 6.72
3 1.10 7.42 0.82 8.25
4 1.47 8.60 1.37 9.76
5 1.84 9.99 1.84 11.26
6 2.21 11.47 2.21 12.75
7 3.56 12.53 2.58 13.81
8 3.96 13.99 2.94 15.29
9 4.36 15.30 4.36 16.77
10 5.50 16.50 4.75 17.82

You might also like