s231_new_v3
s231_new_v3
s231_new_v3
ii
CONTENTS iii
179
11.2 Chi-Squared Cumulative Distribution function . . . . . . . . . . . . . . . . . . . . . . 179
11.3 Student t Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
Preface
These notes are a work-in-progress with contributions from those students taking the courses and
the instructors teaching them. The original version of the notes was prepared by Jerry Lawless with
editorial changes made by Jock MacKay, Don McLeish,... In order to provide improved versions of the
notes for students in subsequent terms, please email lists of errors, or sections that are confusing, or
additional remarks/sugggestions to your instructor or dlmcleis@uwaterloo.ca.
Specific topics in these notes also have associated video files that can be accessed at www.watstat.ca.
Where possible we will reference these videos in the text.
INTRODUCTION TO STATISTICAL SCIENCE
Statistical science, or statistics, is the discipline that deals with the collection, analysis and interpreta-
tion of data, and with the study and treatment of variability and of uncertainty. If you think about it,
you soon realize that almost everything we do or know depends on data of some kind, and that there is
usually some degree of uncertainty present. For example, in deciding whether to take an umbrella on a
long walk we may utilize information from weather forecasts along with our own direct impression of
the weather, but even so there is usually some degree of uncertainty as to whether it will actually rain
or not. In areas such as insurance or finance, decisions must be made about what rates to charge for an
insurance policy, or whether to buy or sell a stock, on the basis of certain types of data. The uncertainty
as to whether a policy holder will have a claim over the next year, or whether a stock’s price will rise
or fall, is the basis of financial risk for the insurer and the investor.
In order to increase our knowledge about some area or to make better decisions, we must collect
and analyze data about the area in question. To discuss general ways of doing this, it is useful to have
terms that refer to the objects we are studying. The words “population”, “phenomenon” and “process”
are frequently used below; they are simply catch-all terms that represent groups of objects and events
that someone might wish to study.
Variability and uncertainty are present in most processes and phenomena in the real world. Uncer-
tainty or lack of knowledge is the main reason why someone chooses to study a phenomenon in the
first place. For example, a medical study to assess the effect of a new drug for controlling hyperten-
sion (high blood pressure) may be conducted by a drug company because they do not know how the
drug will perform on different types of people, what its side effects will be, and so on. Variability is
ever-present; people have varying degrees of hypertension, they react differently to drugs, they have
different physical characteristics. One might similarly want to study variations in currency or stock val-
ues, variation in sales for a company over time, or variation in hits and response times for a commercial
web site. Statistical science deals both with the study of variability in processes and phenomena, and
with good (i.e. informative, cost-effective) ways to collect and analyze data about such processes.
1
2 INTRODUCTION TO STATISTICAL SCIENCE
There are various possible objectives when one collects and analyzes data on a population, phenom-
enon, or process. In addition to pure “learning” or furthering knowledge, these include decision-making
and the improvement of processes or systems. Many problems involve a combination of objectives. For
example, government scientists collect data on fish stocks in order to further their scientific knowledge,
but also to provide information to legislators or groups who must set quotas or limits on commercial
fishing. Statistical data analysis occurs in a huge number of areas. For example, statistical algorithms
are the basis for software involved in the automated recognition of handwritten or spoken text; statis-
tical methods are commonly used in law cases, for example in DNA profiling or in determining costs;
statistical process control is used to increase the quality and productivity of manufacturing processes;
individuals are selected for direct mail marketing campaigns through statistical analysis of their char-
acteristics. With modern information technology, massive amounts of data are routinely collected and
stored. But data does not equal information, and it is the purpose of statistical science to provide and
analyze data so that the maximum amount of information or knowledge may be obtained. Poor or
improperly analyzed data may be useless or misleading.
Mathematical models are used to represent many phenomena, populations, or processes and to deal
with problems that involve variability we utilize probability models. These have been introduced and
studied in your first probability course, and you have seen how to describe variability and solve certain
types of problems using them. This course will focus more on the collection, analysis and interpreta-
tion of data, but the probability models studied earlier will be heavily used. The most important part of
probability for this course is the material dealing with random variables and their probability distribu-
tions, including distributions such as the binomial, hypergeometric, Poisson, multinomial, normal and
exponential. You should review your previous notes on this material.
Statistical science is a large discipline, and this course is only an introduction. Our broad objectives
are to discuss the collection, analysis and interpretation of data, and to show why this is necessary. By
way of further introduction we will outline important statistical topics, first data collection, and then
probability models, data analysis, and statistical inference. We should bear in mind that study of a
process or phenomenon involves iteration between model building, data collection, data analysis, and
interpretation. We must also remember that data are collected and models are constructed for a specific
reason. In any given application we should keep the big picture in mind (e.g. why are we studying this?
what else do we know about it?) even when considering one specific aspect of a problem.
The objects of study in this course are usually referred to as either populations or processes. In essence
a population is just some collection of units (which can be either real or imagined), for example, the
1.2. COLLECTION OF DATA 3
collection of persons under the age of 18 in Canada as of September 1, 2012 or the collection of car
insurance policies issued by a company over a one year period. A process is a mechanism by which
output of some kind is produced; units can often be associated with the output. For example, hits on a
website constitute a process (the “units" are the distinct hits), as do the sequence of claims generated
by car insurance policy holders (the “units" are the individual claims). A key feature of processes is
that they usually occur over time, whereas populations are often static (defined at one moment in time).
Populations or processes are studied by defining variates or variables which represent character-
istics of units. These are usually numerical-valued and are represented by letters such as x; y; z. For
example, we might define a variable y as the number of car insurance claims from an individual policy
holder in a given year, or as the number of hits on a website over a specified one hour period. The
values of y vary across the units in a population or process, this variability which generates uncertainty
and makes it necessary to study populations and processes by collecting data about them. By "data" we
mean here the values of the variates for specific units in the population.
In planning for the collection of data about some phenomenon, we must carefully specify what the
objectives of doing this are. Then, the feasibility of obtaining information by various means must be
considered, as well as to what extent it will be possible to answer questions of interest. This sounds
simple but is usually difficult to do well, especially with limited resources.
There are several ways in which data are commonly obtained. One is purely according to what is
available: that is, data are provided by some existing source. Huge amounts of data collected by many
technological systems are of this type, for example, data on credit card usage or on purchases made by
customers in a supermarket. Sometimes it is not clear exactly what "available" data represent, and they
may be unsuitable for serious analysis. For example, people who voluntarily provide data in a survey
may not be representative of the population at large. Statistical science stresses the importance of ob-
taining data so that they will be “objective" and provide maximal information. Three broad approaches
are often used to do this:
(i) Sample Surveys The object of many studies is a finite population of some sort (e.g. all persons
over 19 in Ontario; all cars produced by GM in the past year). In this case information may be
obtained by selecting a “representative” sample of individuals from the population and studying
them. Representativeness of the sample is usually achieved by selecting the sample members
randomly from those in the population. Sample surveys are widely used in government statistical
studies, economics, marketing, public opinion polls, sociology, and other areas.
(ii) Observational Studies An observational study is one in which data are collected about a process
or phenomenon (over which the observer has no control) in some objective way, often over some
period of time. For example, in studying risk factors associated with a disease such as lung
cancer, one might investigate all such cases (or perhaps a random sample of them) that occur
4 INTRODUCTION TO STATISTICAL SCIENCE
over a given time period. A distinction between a sample survey and an observational study is
that for the latter the “population” of interest is usually infinite or conceptual. For example, in
investigating risk factors for a disease we would prefer to think of the population as a conceptual
one consisting of persons at risk from the disease recently or in the future.
(iii) Experiments An experiment is a study in which the experimenter (i.e. the person collecting
the data) exercises some degree of control over the process being studied. This usually takes the
form of the experimenter being able to control certain factors in the process. For example, in
an engineering experiment to quantify the effect of temperature on the performance of personal
computers, we might decide to run an experiment with 40 PC’s, ten of which would be operated
at each of the temperatures 10, 20, 30, and 40 degrees Celsius.
The three types of studies described above are not mutually exclusive, and many studies involve
aspects of two or more of them. Here are some slightly more detailed examples.
a bottle varies over a small range. Note that the manufacturer would like the variability in x to be as
small as possible, and for bottles to contain at least 26 ounces. Suppose that the manufacturer has just
added a new filling machine to increase the plant’s capacity and wants to compare the new machine
with the older ones. She decides to do this by sampling some filled bottles from each machine and
accurately measuring the amount of liquid x in each bottle; this will be an observational study.
How exactly should the data be collected? The machines may “drift” over time (i.e. the average or
the variability in the values of x may vary systematically up or down over time) so we should randomly
select bottles over time from each machine; we would have to decide how many, and over what time
periods to collect them.
salary, credit cards held), spending, and other variates. Based on this data, the company wishes to select
persons whom it considers have a good chance of responding positively to the mailout. The challenge
is to use data from previous mail campaigns, along with the current data, to achieve as high a response
rate as possible.
Numerical data summaries are useful for describing features of a data set fy1 ; : : : ; yn g. Important ones
are
1 P
n
The mean (also called the sample mean) y = n yi
i=1
1 P
n
the (sample) variance s2y = n 1 (yi y)2
i=1
q
the (sample) standard deviation sy = s2y
the percentiles and quantiles: the p’th quantile or 100p’th percentile is a y-value Q(p) such that
a fraction p of the values in the data set are below Q(p). The values Q(:5), Q(:25) and Q(:75)
are called the median, the lower quartile, and the upper quartile respectively. In fact, quantiles
1.3. DATA SUMMARIES 7
are not uniquely defined for all p-values in (0; 1) for a given data set, and there are different
conventions for defining quantiles and percentiles. For example what is the median of the values
f1; 2; 3; 4; 5; 6g? What is the lower quartile? The different conventions for defining quantiles
become identical as n becomes large.
The mean and the percentiles and quantiles are easily understood. The variance and standard de-
viation measure the variability or “spread" of the y-values in a data set, which is usually an important
characteristic. Another way to measure variability is in terms of the distance between a “low" and
“high" percentile, for example Q(0:10) and Q(:90).
A final numerical summary is a frequency table. This is closely related to a histogram and, in fact, is
just a table showing the interval Ij and their frequencies fj , as used in a histogram. For example, for the
200 male height measurements in Example 1.4.2, the frequency table corresponding to the bottom-left
histogram in Figure 1.1 is shown in Table 1.3.1.
Histograms
Consider measurements fy1 ; y2 ; :::; yn g on a variable y. Partition the range of y into intervals
Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate for j = 1; : : : ; k
(a) a “standard" histogram where the range of y is taken to be finite and the Ij are of equal length.
The height of the rectangle is taken to be fj . This type of histogram is similar to a bar chart.
(b) a “relative frequency" histogram, where the Ij may or may not be of equal length. The height of
the rectangle for Ij is chosen so that its area equals fj =n, which we call the relative frequency
for Ij . Note that in this case the sum of the areas of all the rectangles in the histogram equals one.
Example 1.3.2
Figure 1.1 shows relative frequency histograms based on each of two samples, (a) heights of 200
females randomly selected from workers aged 18 - 60 in New Zealand, and (b) heights of 200 males,
selected from the same population. Heights are recorded in metres; the female heights range from 1.45
to 1.78m (57.1 to 70.1 in.) and the males heights from 1.59 to 1.88m (62.6 to 74.0 in.).
To construct a histogram, we have to choose the number (k) and location of the intervals. The
intervals are typically selected in such a way that each interval contains at least one y-value from the
sample (that is, each fj 1). Software packages are used to produce histogram plots (see Section 1.6)
and they will either automatically select the intervals for a given data set or allow the user to specify
them.
The visual impression from a histogram can change somewhat according to the choice of intervals.
In Figure 1.1, the left-hand panels use 7 intervals and the right-hand use 17 for females and 15 for
males. Note that the histograms give a picture of the distribution of y values in the two samples. For
both females and males the distributions are fairly symmetrical-looking. To allow easy comparison of
female and male height distributions we have used the same y scale (x-axis) for males and females.
Obviously, the distribution of male heights is to the right of that for female heights, but the “spread"
and shape of the two distributions is similar.
Example 1.3.3 Different shapes of distributions can occur in data on a variable y. Figure 1.2 shows
a histogram of the lifetimes (in terms of number of km driven) for the front brake pads on 200 new
mid-size cars of the same type. Notice that the distribution is less symmetrical than the ones in Figure
1.1; the brake pad lifetimes have a rather long right-hand tail. The high degree of variability in life-
times is due to the wide variety of driving conditions which different cars are exposed to, as well as to
variability in how soon car owners decide to replace their brake pads.
Cumulative frequency plots Another way to portray a data set fy1 ; y2 ; :::; yn g is to count the number
or proportion of values in the set which are smaller than any given value. This gives a function
8
6
6
5
Density
Density
4
4
3
2
2
1
0
0
1.4 1.5 1.6 1.7 1.8 1.9 1.4 1.5 1.6 1.7 1.8 1.9
hfemale hfemale
8
5
6
4
Density
Density
3
4
2
2
1
0
1.4 1.5 1.6 1.7 1.8 1.9 1.4 1.5 1.6 1.7 1.8 1.9
hmale hmale
Figure 1.1: Histograms for Femaile and Male Heights. Sample sizes=200.
Software will produce such functions for a given data set. This is conveniently done by first or-
dering the yi ’s (i = 1; : : : ; n) to give the ordered values y(1) y(2) ::: y(n) . Then, we note
~
that F (y) is a “staircase" or “step" function that is easily obtained from the ordered values. If the data
values yi (i = 1; : : : ; n) are all different, then F~ (y(j) ) = j=n.
Example 1.3.4 Suppose that n = 4 and the y-values (ordered for convenience) are 1.5, 2.2, 3.4, 5.0.
Then
8
>
> 0 y < 1:5
>
>
>
>
< 0:25 1:5 y < 2:2
F~ (y) = :50 2:2 y < 3:4
>
>
>
> :75 3:4 y < 5:0
>
>
:
1:00 y 5:0
Example 1.3.5 Figure 1.3 shows the cumulative relative frequency plots F~ (y) for (1) the sample of
female heights, and (b) the sample of male heights in Example 1.3.1.
A cumulative frequency plot does not show the “shape" of the distribution of y-values in a data set
quite as clearly as a histogram. However, it shows us the proportion of y-values in any given interval;
10 INTRODUCTION TO STATISTICAL SCIENCE
Figure 1.2: Lifetimes (in km driven) for Front Brake Pads on 200 Cars
the proportion in the interval (a; b] is just F~ (b) F~ (a). In addition, this plot allows us to pinpoint
values such as the median (a y-value m such that half of the data values are below m and half are above
m) or the 100p’th percentile (a y-value Q(p) such that a proportion p of the data values is less that
Q(p)), where 0 < p < 1. For example, we see from Figure 1.3.3 that the median (or Q (0.5)) height
for females is about 1.60m (63.0 in) and for males, about 1.73m (68.1 in).
Other plots are also sometimes useful. The size n of a data set can be small or large. Histograms
are not very useful when n is less than about 20-30, and for small samples we often just plot the loca-
tions of y-values on a line; an example is given in Section 1.7. A useful plot called the "strip-plot" for
comparing two or more data sets is given next.
Box plots
Sometimes we have two or more samples of y-values, and we may wish to compare them. One way is
by plotting histograms or relative frequency plots for the different samples on the same graph or page;
we did this in Example 1.3.2 for the samples of female heights and male heights. The box plot is a plot
in which only certain values based on a data set are shown, in particular the median, upper and lower
quartiles (these are the 25th and 75th percentiles Q(.25) and Q(.75)), plus values equal to 1.5 times the
“inter-quantile range" Q(:75) Q(0:25) below Q(0:25) and above Q(:75). Figure 1.4 shows such a
plot for the female heights and male heights data sets from Example 1.3.2.
From the boxplot we can determine, for example, that approximately 75% of the females have
heights less than 1.65 m. or that about 50% of males had heights between 1.7 and 1.79 m.
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 11
1.0
0.8
0.6
Cum. Rel. Freq.
0.4
0.2
0.0
height(m)
Figure 1.3: Cumulative relative frequency for Female (F) and Male (M) heights
Two-variable plots
Often we have data on two or more variables for each unit represented in a sample. For example, we
might have the heights x and weights y for samples of individuals. The data set can then be represented
as n pairs, f(xi ; yi ); i = 1; : : : ; ng where xi and yi are the height and weight of the i’th person in the
sample.
When we have two such variables, a useful plot is a scatter plot, which is an x y plot of the
points (xi ; yi ); i = 1; : : : ; n. This shows whether xi and yi tend to be related in some way. Figure
1.5 shows a scatterplot of heights xi and weights yi for 100 adult males. As is obvious from looking
at people around us, taller people tend to weigh more, but there is considerable variability in weight
across persons of the same height.
(i) when studying a process scientifically, questions are often formulated in terms of a model for
the process. The questions of primary interest do not concern the data, but the data provides a
12 INTRODUCTION TO STATISTICAL SCIENCE
1.8
1.7
Heights (m)
1.6
1.5
1 2
Figure 1.4: Box Plots Based on 200 Female and 200 Male Heights in example 1.3.2. "F"=1,
"M"=2.
(ii) the data collected in studying processes are variable, so random variables are often used in dis-
cussing and dealing with data,
(iii) studies of a process usually lead to inferences or decisions that involve some degree of uncer-
tainty, and probability is used to quantify this,
(iv) procedures for making decisions are often formulated in terms of models,
(v) models allow us to characterize processes, and to simulate them via computer experiments or
other means.
Consider a variable y associated with the units in a population or process. To describe or “model"
the variability in y-values we use probability distributions, which were introduced in your first proba-
bility course. This is done as follows: let y be the value for a randomly chosen unit in the population or
process. Because this value is random (we do not know which unit will be chosen) we call Y a random
variable, and use a probability distribution to provide us with probabilities such as P (a Y b). You
should review your probability notes(a limited review is given in an appendix to this chapter) and recall
that random variables are usually either discrete or continuous. A discrete random variable (r.v.) Y is
one for which the range R (set of possible values) of Y is countable. A continuous r.v. is one whose
range R consists of one or more continuous intervals of real numbers. For a discrete r.v. the probability
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 13
200
180
weight (lb.)
160
140
120
64 66 68 70 72
height (in.)
Figure 1.5: Scatterplot of Height vs. Weight for 100 Adult Males
Recall that the cumulative distribution function (c.d.f) is defined for a r.v. Y as
F (y) = P (Y y) y R (1.4)
P
If Y is discrete then F (y) = f (x); if Y is continuous then
x y
Z
F (y) = f (x)dx
x y
Recall also that if h(Y ) is some function of y, then the expectation (or “expected value") of h(Y ) is
defined as
X
E[h(Y )] = h(y)f (y) (1.5)
y R
if Y is discrete and as Z
E[h(Y )] = h(y)f (y)dy (1.6)
R
if Y is continuous. Expectations are used in many settings, for example when costs, profits, or losses
are associated with a random variable. The expectation E(Y ) is called the mean of Y and is often
denoted by the Greek letter . The expectation E[(Y )2 ] is called the variance of Y and is often
p
denoted either as V ar(Y ) or with the Greek symbol 2 . The square root = V ar(Y ) is called the
standard deviation of Y , or sd(Y ).
Your previous course introduced several families of probability distributions along with processes
or populations to which they are applied. Models such as the binomial, Poisson, exponential, normal
(Gaussian), and multinomial will be reintroduced and used in this course. The first few problems at the
end of the chapter provide a review of some models, and examples of where they are applied.
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 15
Many problems involve two or more random variables defined for any given unit. For example
Y1 could represent the height and Y2 the weight of a randomly selected 30 year old male in some
population. In general, we can think of a random variable Y = (Y1 ; Y2 ; : : :) as being a vector of
length 1: This may make it necessary to consider multivariate probability distributions, which were
introduced in your last course for discrete random variables.
In many statistical applications there is a primary variable y of interest, but there may be a number
of other variables x1 ; x2 ; : : : that affect y, or are “related" to y in some way. In this case we often refer
to y as the “response" variable and x1 ; x2 : : : as “explanatory" variables or covariates. Many studies
are carried out for the purpose of determining how one or more explanatory variables are related to a
response variable. For example, we might study how the number of insurance claims y for a driver is
related to their sex, age, and type of car (x variables). One reason for studying explanatory variables
is to search out cause and effect relationships. Another is that we can often use explanatory variables
to improve decisions, predictions or “guesses" about a response variable. For example, insurance com-
panies use explanatory variables in defining risk classes and determining life insurance premiums.
Whether we are collecting data to increase our knowledge or to serve as a basis for making decisions,
proper analysis of the data is crucial. Two broad aspects of the analysis and interpretation of data may
be distinguished. The first is what we refer to as descriptive statistics: This is the portrayal of the
data, or parts of it, in numerical and graphical ways so as to show certain features. (On a historical
note, the word “statistics” in its original usage referred to numbers generated from data; today the word
is used both in this sense and to denote the discipline of Statistics.) We have considered methods of
doing this in Section 1.3. The terms data mining and knowledge discovery in data bases (KDD) refer
to exploratory data analysis where the emphasis is on descriptive statistics. This is often carried out on
very large data bases.
A second aspect of a statistical analysis of data is what we refer to as statistical inference: that is,
we use the data obtained in the study of a process or phenomenon to draw more general inferences
about the process or phenomenon itself. In general, we try to use study data to draw inferences about
some target population or process. This is a form of inductive inference, in which we reason from the
specific (the observed data) to the general (the target population or process). This may be contrasted
with deductive inference (as in logic and mathematics) in which we use general results (e.g. axioms)
to prove specific things (e.g. theorems).
This course introduces some basic methods of statistical inference. Two main types of problems
will be discussed, loosely referred to as estimation problems and hypothesis testing problems. In the
16 INTRODUCTION TO STATISTICAL SCIENCE
former, the problem is to estimate some feature of a process or population. For example, we may wish
to estimate the proportion of Ontario residents aged 14 - 20 who smoke, or to estimate the distribution
of survival times for certain types of AIDS patients. Another type of estimation problem is that of
“fitting” or selecting a probability model for a process.
Testing problems involve using the data to assess the truth of some question or hypothesis. For
example, we may hypothesize that in the 14-20 age group a higher proportion of females than males
smoke, or that the use of a new treatment will increase the average survival time of AIDS patients by
at least 50 percent. These questions can be addressed by collecting data on the populations in question.
Statistical analysis involves the use of both descriptive statistics and methods of estimation and
testing. As brief illustrations, we return to the first two examples of section 1.2.
Old machine:
27.8 28.9 26.8 27.4 28.0 27.4 27.1 28.0 26.6 25.6
24.8 27.1 25.7 27.9 25.3 26.0 27.3 27.4 25.7 26.9
27.3 25.2 25.6 27.0 26.2 27.3 24.8 27.1 26.7 26.8
26.6 26.6 28.6 27.0 26.6 27.3 25.9 27.6 27.6 28.3
28.0 26.4 25.4 26.7 27.8 27.4 27.3 26.9 26.9 26.9
New Machine:
26.6 26.8 27.2 26.9 27.6 26.7 26.8 27.4 26.9 27.1
27.0 27.1 27.0 26.6 27.2 26.1 27.6 27.2 26.5 26.3
28.0 26.8 27.1 26.7 27.7 26.7 27.1 26.5 26.8 26.8
26.9 27.2 27.4 27.1 26.5 27.2 26.8 27.3 26.6 26.6
27.0 26.9 27.3 26.0 27.4 27.4 27.6 27.2 27.8 27.7
The amount of liquid X that goes into a bottle is a random variable, and a main objective of this
study is to determine what the distribution of X looks like for the old machine and for the new machine,
over the one week period of the study. For this to be really useful, we should check first that there are
no “drifts" or time trends in the data. Figure 1.6 gives a plot of xi vs. i (order of production) for each
machine, and no trends are apparent. The random variable X is continuous and so we would like to use
a continuous probability distribution as the model. It often turns out for problems involving weights
or measures that a normal distribution provides a suitable model. Recall that the distribution N ( ; 2 )
(which we will often call in this course a Gaussian distribution, denoted G( ; )) has probability
density function
( )
2
1 1 x
f (x; ; ) = p exp 1<x<1
2 2
and that probabilities are obtained by integrating it; recall also that = E(X) and 2 = Var(X) are
the mean and variance of the distribution. (note: exp (a) is the same as ea .)
Before trying to use any particular model for X, it is a good idea to “look” at the data. Figure
?? shows frequency histograms of the data from the old and new machines, respectively. It shows that
(i) the distributions of X for the old and new machines each look like they might be well described
18 INTRODUCTION TO STATISTICAL SCIENCE
Old M achine
25 26 27 28 29
Vol(oz.)
0 10 20 30 40 50
index
New M achine
28.0
Vol(oz.)
27.0
26.0
0 10 20 30 40 50
index
by (different) normal distributions, (ii) the variability in the new machine’s distribution is considerably
less than in the old’s.
After this simple bit of descriptive statistics we could carry out a more thorough analysis, for
example fitting normal distributions to each machine. We can also estimate attributes of interest, such
as the probability a bottle will receive less than 26 ounces of liquid, and recommend adjustments that
could be made to the machines. In manufacturing processes it is important that the variability in the
output is small, and so in this case the new machine is better than the old one.
Software is essential for data manipulation and analysis. It is also used to deal with numerical calcu-
lations, to produce graphics, and to simulate probability models. There exist many statistical software
1.6. STATISTICAL SOFTWARE 19
Old machine
12
Frequency
8
4
0
24 25 26 27 28 29 30
volume(oz.)
New machine
15
Frequency
10
5
0
24 25 26 27 28 29 30
volume(oz.)
systems; some of the most comprehensive and popular are SAS, S-Plus, SPSS, Stata, Systat and R.
Spreadsheet software is also useful.
In this course we will primarily use the R software system. It is an open source package that
has extensive statistical capabilities and very good graphics procedures. Its home page is at www.r-
project.org. In structure it is similar to the commercial package S-Plus (Insightful Corp.); both are
based on the ideas developed from the S statistical system at AT & T Bell Laboratories.
Some of the basics of R are described in the Appendix at the end of this chapter; it is very easy
to use. In this course we will employ R for several purposes: to manipulate and graph data; to fit and
check statistical models (distributions); to estimate quantities or test hypotheses; to simulate data from
probability models.
As an introductory example we consider some data on the heights and the body-mass indexes
(BMI’s) of 150 males and 150 females, aged 18-60, that were collected from a random sample of
20 INTRODUCTION TO STATISTICAL SCIENCE
workers in New Zealand. The data are listed below, along with a few summary statistics. The BMI is
often used to measure o besity or severely low weight. It is defined as follows:
weight(kg)
BM I =
height(m)2
There is some variation in what different types of guidelines refer to as “overweight", “underweight",
etc. One that is sometimes used by public health professionals is:
The data on heights are stored in two R vectors (see the Appendix at the end of the chapter) called
hmale and hfemale; the BMI measurement are in vectors bmimale and bmifemale.
Heights and Body-Mass Index (BMI) Measurements for 150 Males and Females
[1] 1.76 1.76 1.68 1.72 1.73 1.78 1.78 1.86 1.77 1.72 1.72 1.77 1.77 1.70 1.72
[16] 1.77 1.79 1.75 1.74 1.71 1.73 1.74 1.70 1.71 1.72 1.66 1.74 1.73 1.77 1.69
[31] 1.91 1.77 1.81 1.74 1.87 1.76 1.69 1.87 1.78 1.70 1.78 1.84 1.82 1.77 1.72
[46] 1.80 1.72 1.69 1.78 1.69 1.80 1.82 1.65 1.56 1.64 1.60 1.82 1.73 1.62 1.77
[61] 1.81 1.73 1.74 1.75 1.73 1.71 1.63 1.72 1.74 1.75 1.72 1.83 1.77 1.74 1.66
[76] 1.93 1.81 1.73 1.68 1.71 1.69 1.74 1.74 1.79 1.68 1.71 1.74 1.82 1.68 1.78
[91] 1.79 1.77 1.74 1.78 1.86 1.80 1.74 1.69 1.85 1.71 1.79 1.74 1.80 1.64 1.82
[106] 1.66 1.56 1.80 1.68 1.73 1.78 1.69 1.57 1.64 1.67 1.74 1.89 1.77 1.75 1.84
[121] 1.66 1.71 1.75 1.75 1.64 1.73 1.79 1.74 1.83 1.80 1.74 1.81 1.80 1.66 1.75
[136] 1.82 1.80 1.81 1.71 1.59 1.71 1.79 1.80 1.70 1.77 1.78 1.64 1.70 1.86 1.75
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.56 1.71 1.74 1.744 1.79 1.93
FEMALE HEIGHTS (m)- hfemale
[1] 1.60 1.56 1.61 1.64 1.65 1.58 1.71 1.72 1.72 1.61 1.72 1.52 1.47 1.61 1.64
[16] 1.60 1.67 1.76 1.57 1.60 1.59 1.61 1.59 1.61 1.56 1.68 1.61 1.63 1.58 1.68
1.6. STATISTICAL SOFTWARE 21
[31] 1.51 1.64 1.52 1.59 1.62 1.64 1.65 1.64 1.67 1.56 1.77 1.55 1.71 1.71 1.54
[46] 1.60 1.67 1.58 1.53 1.64 1.63 1.60 1.64 1.67 1.54 1.65 1.57 1.59 1.58 1.58
[61] 1.67 1.53 1.69 1.64 1.54 1.66 1.71 1.58 1.60 1.52 1.41 1.51 1.56 1.65 1.68
[76] 1.55 1.60 1.57 1.73 1.58 1.53 1.58 1.53 1.66 1.57 1.54 1.69 1.62 1.65 1.64
[91] 1.61 1.67 1.64 1.57 1.70 1.66 1.61 1.62 1.58 1.67 1.67 1.69 1.53 1.70 1.65
[106] 1.56 1.79 1.70 1.61 1.56 1.65 1.59 1.62 1.71 1.57 1.72 1.58 1.70 1.70 1.66
[121] 1.60 1.54 1.60 1.68 1.68 1.67 1.57 1.61 1.64 1.57 1.72 1.48 1.60 1.66 1.60
[136] 1.58 1.65 1.59 1.57 1.53 1.60 1.64 1.57 1.59 1.68 1.61 1.66 1.52 1.67 1.65
[121] 26.4 28.4 24.8 29.1 25.6 29.4 26.2 29.7 22.8 21.5 21.3 29.9 17.6 28.3 24.1
[136] 28.3 24.0 25.4 23.9 24.3 30.4 28.6 25.0 23.8 36.0 31.5 21.8 29.4 30.8 28.1
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.4 23.42 26.8 26.92 29.7 38.8
Methods for summarizing data were discussed in Section 1.4. Both numerical and graphical sum-
maries can be obtained easily using R. For example, mean (x) and var(x) produce the mean x and
variance s2x of a set of numbers x1 ; : : : ; xn contained in the vector x. The definitions of x and s2x are
(see Section 1.3.2)
n n
1X 1 X
x= xi s2x = (xi x)2 :
n n 1
i=1 i=1
Using this we can find that the mean (average) heights for the 150 males in the sample is x = 1:74 m
(68.5 in.) and for the 150 females is x = 1:62 m (63.8 in.).
Figure 1.8: Histograms and Models for Height and BMI data
A histogram gives a picture of the data. Figure 1.8 shows relative frequency histograms for heights
and BMI’s for males and females. We also show normal distribution probability density functions
1.6. STATISTICAL SOFTWARE 23
overlaid on each histogram. In each case we used a normal distribution N ( ; 2 ) where the mean
and variance 2 were taken to equal x and s2x . For example, for the male heights we used = 1:74 and
2 = :004316. Note from Figure 1.8 that the normal (Gaussian) distributions agree only moderately
well with the observed data. Chapter 2 discusses probability models and comparisons of models and of
data in more detail.
The following R code, reproduced from the Appendix, that illustrates how to look at the data and
produce plots like those in Figure 1.6.1.
> summary(bmimale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.3 24.7 26.75 27.08 29.1 37.5
> summary(bmifemale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.4 23.42 26.8 26.92 29.7 38.8
[1] 4.602213
> par(mfrow=c(1,2)) #Sets up graphics to do two side by side plots per page
> hist(bmimale,prob=T,xlim=c(15,40)) #Relative frequency histogram; the
xlim option specifies the range we want
for the x-axis.
> x<- seq(15,40,.01) #We’ll use this vector to plot a Gaussian pdf
> fx<- dnorm(x,27.08,3.56) #Computes values f(x) of the G(27.08,3.56) pdf; we
have estimated the distribution mean and standard
deviation from the sample values.
> lines(x,fx) #This function adds points (x,fx) to the latest plot created
and joins them up with lines. This creates a plot of the
pdf overlaid on the histogram.
> hist(bmifemale,prob=T,xlim=c(15,40)) #Now do a histogram for the female
data.
> fx<- dnorm(x,26.92,4.60) #Compute pdf f(x) for G(26.92,4.60) distribution
> lines(x,fx) # As previously
> q() #Quit the R session.
Figure 1.9 shows a plot (called a “strip" plot) of similar data on 20 items of each of the five colours.
1.7. A MORE DETAILED EXAMPLE: COLOUR CLASSIFICATION BY ROBOTS 25
It is clear that the measurements for the Black items are well separated from the rest, but that there is
some overlap in the ranges of the intensities for some pairs of items.
To program the robot to “recognize" colour we must partition the range for y into five regions,
one corresponding to each colour. There are various ways to do this, the simplest being to choose
values that minimize the number of misclassifications (incorrectly identified colours) in the data that
are available. Another approach is to model the variability in y-values for each colour using a random
variable Y with a probability distribution and then to use this to select the partition. This turns out to
have certain advantages, and we will consider how this can be done.
White
Red
Green
Blue
Black
10 20 30 40 50 60
light intensity
Empirical study has shown that for the population of White items Y has close to a Gaussian distri-
bution,
Y G( W; W ):
Similarly, for Black, Green, Light Blue and Red items the distribution of Y is close to G( B ; B ),
G( G ; G ), G( L ; L ) and G( R ; R ), respectively. The approximate values of and for each
colour are (we will discuss in Chapter 2 how to find such values)
26 INTRODUCTION TO STATISTICAL SCIENCE
The operators of the equipment have set the following decision rule (partition) for identifying the
colour of an item, based on the observed value y:
Black y 34:0
Light Blue 34:0 < y 40:5
Green 40:5 < y 45:2
Red 45:2 < y 48:6
White y > 48:6
Question: Based on the Gaussian models, what are the probabilities an item colour is misclassified?
Note that colours may be classified incorrectly in more than one way. For example
P (Red is misclassified as Green) = P (40:5 < Y 45:2) = :0227
P (Red is misclassified as White) = P (Y > 48:6) = 0:1377.
Question: What kind of data would we collect to check that the Gaussian models are satisfactory
approximations to the distributions of Y ?
We would need to randomly select items of a specific colour, then use the sensor to get an inten-
sity measurement y. By doing this for a large number n of items, we would get measurements
y1 ; :::; yn which can be used to examine or fit parametric models for Y , using methods developed
in later chapters.
1.7. A MORE DETAILED EXAMPLE: COLOUR CLASSIFICATION BY ROBOTS 27
Question: Do we need to use a Gaussian model (or any other probability distribution) for this problem?
No. We could use a completely “empirical” approach in which we used the sensor experimentally
with many items of different colours, then determined a “good” decision rule for identifying
colour from the observed data. (For example, we could, as mentioned above, do this so that the
total number of items misclassified was minimized.)
To see how this would work, use R to simulate, say 20, y values for each colour, then try to pick
the cut-points for your decision rule.
Question: What are some advantages of using a probability model (assuming it fits the data)?
In more complicated problems (e.g. with more types of items or with multivariate measurements)
a direct empirical approach may be difficult to implement.
Models allow comparisons to be made easily across similar types of applications. (For example,
sensors of similar types, used in similar settings.)
Models are associated with “scientific” descriptions of measurement devices and processes.
Figure 1.10 shows the probability density functions for the Gaussian distributions for Black, Light
Blue, Green, Red and White items, which provides a clear picture of misclassification probabilities.
Given the models above, it is possible to determine an “optimal" partition of the y-scale, according
to some criterion. The criterion that is often used is called the overall misclassification rate, and it is
defined as follows for this problem. Suppose that among all of the items which the robot encounters
over some period of time, that the fractions which are Black, Light Blue, Green, Red, and White are
PB ; PLB ; PG ; PR and PW respectively (with PB + PLB + PG + PT + PW = 1). Suppose also that
instead of the values above we consider arbitrary cut-points c1 < c2 < c3 < c4 , so that if y c1 , the
decision is “Black", if c1 < y c2 , the decision is “Light Blue", and so on. The overall probability a
randomly selected item is classified correctly (CC) is
P (CC) = P (CCjBlack)P (Black) + P (CCjBlue)P (Blue) + P (CCjGreen)P (Green) + P (CCjRed)P (Red) + P (CCjWhit
(1.7)
= P (YB c1 )PB + P (c1 < YLB c2 )PLB + P (c2 < YG c3 )PG + P (c3 < YR c4 )PR + P (YW > c4 )PW
28 INTRODUCTION TO STATISTICAL SCIENCE
0.4
0.3
pdf(y)
0.2
0.1
0.0
0 10 20 30 40 50 60
where YB ; YLB ; YG ; YR ; YW denote random variables with the G(25:7; 2:4); G(38:4; 1:0); G(42:7; 1:3); G(47:4; 1:1); G(49
distributions respectively. It is not trivial to choose c1 ; c2 ; c3 ; c4 to maximize this (and therefore
minimize the probability of incorrect classification) but it can be done numerically, for any set of
values forPB ; PLB ; PG ; PR ; PW . Note that for given values for c1 ; c2 ; c3 ; c4 we can readily calcu-
late P (CC). For example, if PB = PLB = :05; PG = PR = PW = :3, then with the values
c1 = 34:0; c2 = 40:5; c3 = 45:2; c4 = 48:6 used above we get, using R to calculate the probabili-
ties in 1.7,
P (CC) = :05[pnorm(34:0; 25:7; 2:4)] + :05[pnorm(40:5; 38:4; 1:0) pnorm(34:0; 38:4; 1:0)]
+ :3[pnorm(45:2; 42:7; 1:3) pnorm(40:5; 42:7; 1:3)]
+ :3[pnorm(48:6; 47:4; 1:1) pnorm(45:2; 47:4; 1:1)] + :3[1 pnorm(48:6; 49:8; 1:2)]
= :05(:9997) + :05(:9821) + :3(:9723) + :3(:8396) + :3(:8413)
= :895
Thus the probability of incorrect identification of a colour is .105. Note that if the “mix" of colours
1.8. APPENDIX. THE R LANGUAGE AND SOFTWARE 29
that the robot sees changes (i.e. the values PB ; : : : ; PW change) then P (CC) changes. For example,
if there were more Black and fewer White items in the mix, the P (CC) would go up. Problems 3 and
4 at the end of the chapter consider slightly simpler classification problems involving only two types
of “items". It is easier to maximize the P (CC) in these cases. You should note that the general
problem of classification based on certain data is very common. Spam detection in your emailer, or
credit checking at the bank, fraud detection in a financial institution, and even legal institutions such as
courts are all examples where a classification takes place on the basis of noisy data.
help(mean).
In some cases help.search() is also helpful. The assignment symbol is <- : for example,
1.8.2 Vectors
Vectors can consist of numbers or other symbols; we will consider only numbers here. Vectors are
defined using c(): for example,
x<- c(1,3,5,7,9)
defines a vector of length 5 with the elements given. Vectors and other classes of objects possess certain
attributes. For example, typing
length(x)
30 INTRODUCTION TO STATISTICAL SCIENCE
will give the length of the vector x. Vectors of length n are often a convenient way to store data values
for n individuals or units in a sample. For example, if there are variates x and y associated with any
given individual, we would define vectors for x and for y.
1.8.3 Arithmetic
The following R commands and responses should explain arithmetic operations.
> 7+3
[1] 10
> 7*3
[1] 21
> 7/3
[1] 2.333333
> 2^3
[1] 8
[1] 5
> summary(x) # A useful function which summarizes features of
a vector x
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9
> var(x) # Computes the (sample) variance of the elements of x
[1] 10
> exp(1) # The exponential function
[1] 2.718282
> exp(y)
[1] 2.718282 3.490343 4.481689 5.754603 7.389056
> round(exp(y),2) # round(y,n) rounds the elements of vector y to
n decimals
[1] 2.72 3.49 4.48 5.75 7.39
> x+2*y
[1] 3.0 5.5 8.0 10.5 13.0
1.8.5 Graphs
To open a graphics window in Unix, type x11(). Note that in R, a graphics window opens automatically
when a graphical function is used. There are various plotting and graphical functions. Two useful ones
are
Graphs can be tailored with respect to axis labels, titles, numbers of plots to a page etc. Type help(plot),
help(hist) or help(par) for some information.
To save/print a graph in R using UNIX, you generate the graph you would like to save/print in R using
a graphing function like plot() and type:
dev.print(device,file="filename")
32 INTRODUCTION TO STATISTICAL SCIENCE
where device is the device you would like to save the graph to (i.e. x11) and filename is the name of
the file that you would like the graph saved to. To look at a list of the different graphics devices you
can save to, type
help(Devices).
b) You can right click on the graph. This gives you a choice of copying the graph and then pasting to
an editor, such as MS Word, or saving the graph as a metafile or bitmap. You may also print directly
to a printer using this option as well.
1.8.6 Distributions
There are functions which compute values of probability or probability density functions, cumulative
distribution functions, and quantiles for various distributions. It is also possible to generate (pseudo)
random samples from these distributions. Some examples follow for the Gaussian distribution. For
other distribution information, type help(Poisson),
help(Binomial) etc.
You can read numerical data stored in a text file called (say) data into an R vector y by typing
y<- scan("data")
write(y,file="filename")
1.8.9
The R session below describes how to take data on the BMI measurements for 150 males and 150
females and examine them, including the possibility of fitting Gaussian distributions to the data. The
data are in vectors bmimale and bmifemale.
> summary(bmimale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.3 24.7 26.75 27.08 29.1 37.5
> summary(bmifemale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.4 23.42 26.8 26.92 29.7 38.8
[91] 27.7 27.7 27.8 27.8 27.9 27.9 27.9 27.9 28.0 28.0 28.0 28.2 28.5 28.5 28.6
[106] 28.6 28.7 28.7 28.8 28.9 29.0 29.1 29.1 29.2 29.3 29.3 29.7 29.7 30.0 30.1
[121] 30.1 30.1 30.2 30.2 30.2 30.3 30.4 30.7 30.8 30.9 31.0 31.0 31.8 32.0 32.4
[136] 32.4 32.5 32.6 32.7 33.5 33.6 33.7 34.0 34.1 34.2 34.7 34.9 35.2 35.5 37.5
NOTE: You can see from the histograms and Gaussian pdf plots that the Gaussian distribution does
not seem an especially good model for BMI variation. The Gaussian pdf’s are symmetric whereas the
distribution of BMI measurements looks somewhat asymmetric.
1.9 Problems
1. The binomial distribution is a discrete probability model with probability function of the form
!
n
f (y) = py (1 p)n y y = 0; 1; :::; n
y
where 0 < p < 1 and n is a positive integer. If Y is a random variable with probability function
f (y) we write Y Bin(n; p).
1.9. PROBLEMS 35
A woman who claims to have special guessing abilities is given a test, as follows: a deck which
contains five cards with the numbers 1 to 5 is shuffled and a card drawn out of sight of the
woman. The woman then guesses the card, the deck is reshuffled with the card replaced, and
the procedure is repeated several times. Let Y represent the number of correct guesses by the
woman.
(a) Show that is the mean of X. Graph the probability density function of X.
(b) The exponential distribution is often found to be a suitable model for distributions of life-
times. The 30 observations X1 ; :::; X30 below, for example, are the lifetimes (in days) of a
random sample of a particular type of lightbulb, subjected to constant use:
23 261 87 7 120 14 62 47 225 71
246 21 42 20 5 12 120 11 3 14
71 11 14 11 16 90 1 16 52 95
The mean of these 30 numbers is X = 59:6. It has been suggested that if an exponential
model is suitable for representing the distribution of lifetimes in the population of lightbulbs
from which they came, then in (1) should have a value of around 59.6. Why should this
be so?
(c) For the exponential distribution (1) with = 59:6, calculate
i) p1 = P (0 X < 40)
36 INTRODUCTION TO STATISTICAL SCIENCE
Compare the values 30p1 , 30p2 , 30p3 , 30p4 with the actual number of observations in
the four intervals [0; 40), [40; 100), [100; 200), [200; 1), respectively. Why should these
numbers agree fairly well if the exponential distribution is a suitable model?
(d) Use a graph created using R to compare the model (1) with the data observed.
Let Y be a variate representing the systolic blood pressure of a randomly selected woman in
a large population, grouped or stratified by age. Good models for Y for persons not taking any
medication have been found to be:
ages 17-24 Y G(118; 8)
(a) Plot the probability density functions for the two models on the same graph using R.
(b) For each age group find the probability a randomly selected woman has blood pressure over
140.
(c) Suppose you were given the blood pressures Y1 ; :::; Y25 for 25 women from one of the two
age groups; you do not know which. Show that you could decide with near certainty which
group they came from by considering the value of Y = (Y1 + + Y25 )=25.
(d) Suppose you know that very close to 10% of the population of women are in each age group
and that you want to devise a rule for “deciding” which age group a woman is in, based on
knowing only her blood pressure Y . Good rules will be of the form: for some chosen value
y0 , decide that
age is 17-24 iff Y y0 .
Assuming you are going to use this rule over and over again, find y0 so that the fraction of
decisions which are wrong is minimized.
1.9. PROBLEMS 37
4. A test for diabetes is based on glucose (sugar) levels Y in the blood, measured after fasting
for a specified period of time. For healthy people the glucose levels have close to a Gaussian
distribution with mean = 5:31 mmol=L and standard deviation = 0:58mmol=L. For
untreated diabetics Y is (close to) Gaussian with = 11:74 and = 3:50.
A diagnostic test for diabetes can be based on a measurement Y for a given individual. Suppose
that we use the cutoff value 6.5, and diagnose someone as diabetic (diagnosis to be confirmed by
further tests) if Y > 6:5.
(a) If a person is diabetic, what is the probability that they are diagnosed as such by the test?
(b) What is the probability that a nondiabetic is incorrectly diagnosed as being diabetic?
(c) The probability in part (a) is called the sensitivity of the test. We can increase it by choosing
a cutoff value C which is less than 6.5. However, this will increase the probability of
incorrectly diagnosing nondiabetics, as in part (b). Recompute the probabilities in (a) and
(b) if C = 6:0.
where F (z) is the cumulative distribution function (cdf) for the G(0; 1) distribution. A slightly
better approximation is to replace y with y + :5 in (2).
A mail campaign to sign up new customers for a particular credit card has in the recent past had
a success rate of about p = :012 (i.e. about 1.2% of persons contacted sign up for the card).
(a) If 140,000 persons are sent the offer to sign up, what is the probability at least 1600 new
customers are obtained, if p = :012?
(b) Suppose 2000 new customers were actually obtained. Would you conclude that this cam-
paign was more successful than other recent ones? Explain your answer.
38 INTRODUCTION TO STATISTICAL SCIENCE
(a) Suggest two things that you might do in order to assess whether a given computer procedure
satisfies the above conditions. (Assume that you can use the procedure to generate, say,
1000 digits and that you will base your assessment on these values.)
(b) The next problem gives such a procedure , but is theoretically a little complicated to work
out its distribution mathematically. However, the distribution can be closely approximated
by computer simulation. You can generate a sequence Y1 ; Y2 ; : : : Yn of length n by the R
command y sample(0 : 9; n; replace = T )
Generate a sequence of 1000 digits and plot a histogram of the data. ( You can use it to check,
for example, whether runs of odd and even digits are consistent with the randomness conditions.)
One way to test whether a sequence “looks” like it could have come from a Bernoulli model
is to consider the number of runs (maximal subsequences of 0’s or 1’s) in the sequence; let the
random variable R denote the number of runs.
(a) Consider a randomly generated sequence from the Bernoulli model and suppose there were
1.9. PROBLEMS 39
8 ! !
>
> a 1 b 1
>
> 2
>
> k 1 k 1
>
>
>
>
0 1 r = 2k
>
> a+b
>
> B
@
C
A
>
> a
>
<
P (R = r) =
>
> 0 10 1 0 10 1
>
> a 1 b 1 a 1 b 1
>
> B CB C B
A+@
CB C
>
> @ A@ A@ A
>
> k 1 k k k 1
>
> 0 1 r = 2k + 1
>
> a+b
>
> B C
>
: @ A
a
(b) Part (a) is challenging. An approximation to the distribution of R which is not conditional
on a and b may be based on the Central Limit Theorem. Let Y1 ; :::; Yn be a Bernoulli
sequence of 0’s and 1’s and define the random variables
(
0 if Yi = Yi 1
U1 = 1; Ui =
1 if Yi 6= Yi 1:
n
X
Note that R = Ui (why?) and use this to find E(R) and Var(R) in terms of n and p,
i=1
where p = P (Yi = 1) = 1 P (Yi = 0).
(c) By the Central Limit Theorem, Z = [R E(R)]=Var(R)1=2 has a limiting G(0; 1) distri-
bution as n ! 1. Use this to find the approximate probability of 20 or fewer runs in a
Bernoulli sequence with n = 100, p = :5.
(d) Outline how you could approximate the distribution of R in part (b) by simulation. For the
case n = 10; p = :5 (i.e. sequences of length 10) generate 1000 sequences and obtain the
R value for each one. Plot these in a histogram.
8. The data below show the lengths (in cm) of male and female coyotes captured in Nova Scotia.
40 INTRODUCTION TO STATISTICAL SCIENCE
Females
93.0 97.0 92.0 101.6 93.0 84.5 102.5 97.8 91.0 98.0 93.5 91.7
90.2 91.5 80.0 86.4 91.4 83.5 88.0 71.0 81.3 88.5 86.5 90.0
84.0 89.5 84.0 85.0 87.0 88.0 86.5 96.0 87.0 93.5 93.5 90.0
85.0 97.0 86.0 73.7
Males
97.0 95.0 96.0 91.0 95.0 84.5 88.0 96.0 96.0 87.0 95.0 100.0
101.0 96.0 93.0 92.5 95.0 98.5 88.0 81.3 91.4 88.9 86.4 101.6
83.8 104.1 88.9 92.0 91.0 90.0 85.0 93.5 78.0 100.5 103.0 91.0
105.0 86.0 95.5 86.5 90.5 80.0 80.0
(a) Obtain relative frequency histograms of the data for the females and the males using R.
(b) Compute the sample mean y and standard deviation sy for the female and male coyotes.
Assuming = y and = sy , plot the pdf’s for Gaussian distributions G( ; ) over top of
the histograms for the females and males. (Note: if you have n measurements y1 ; : : : ; yn
Pn P
n
then y = yi =n and s2y = n 1 1 (yi y)2 :) (Based on Table 2.3.2 in Wild and Seber
i=1 i=1
1999)
9. The data below come from a study of the completion times for a task experienced by computer
users in a multiuser system. Each data point (x; y) is the result of an experiment in which x
terminals connected to the same server all initiated the same task. The variable y is the average
time per task.
x y x y x y x y
40 9.9 40 11.9 50 15.1 65 18.6
50 17.8 10 5.5 30 13.3 65 19.8
60 18.4 30 11.0 65 21.8
45 16.5 20 8.1 40 13.8
10. The R system contains a lot of interesting data sets. Here’s how to look at a data set contained in
an array wtloss; it show the weight on each day of a special diet for a very obese 48-year old male
1.9. PROBLEMS 41
who weighed 184.35 kg before starting the diet. The data set is in the MASS package which is
part of R. First you need the command
> library (MASS)
Then the command
> wtloss
will display a data base called wtloss which has 52 rows and two columns. Column 1 gives the
day of the diet and Column 2 the person’s weight on that day. Obtain a scatterplot of weight vs.
day by the command
> plot (wtloss $Days, wtloss $Weight, xlab = “Day", ylab = “Weight")
Does the weight appear to be a linear function of days on diet? Why would you expect that it
would not be linear if a long time period is involved?
MODEL FITTING, MAXIMUM
LIKELIHOOD ESTIMATION, AND
MODEL CHECKING
A statistical model is a mathematical model that incorporates probability1 in some way. As described
in Chapter 1, our interest here is in studying variability and uncertainty in populations and processes
and drawing inferences where warranted in the presence of this uncertainty. This will be done by con-
sidering random variables that represent characteristics of the units or individuals in the population or
process, and by studying the probability distributions of these random variables. It is very important to
be clear about what the “target" population or process is, and exactly how the variables being consid-
ered are defined and measured. Chapter 3 discusses these issues. You have already seen some examples
in Chapter 1, and been reminded of material on random variables and probability distributions that is
taught in earlier courses.
A preliminary step in probability and statistics is the choice of a probability model to suit a given
application. The choice of a model is usually driven by some combination of three factors:
1. Background knowledge or assumptions about the population on or process which lead to certain
distributions.
2. Past experience with data from the population or process, which has shown that certain distribu-
tions are suitable.
42
2.1. STATISTICAL MODELS AND PROBABILITY DISTRIBUTIONS 43
In probability theory, there is a lot of emphasis on factor 1 above, and there are many “families"
of probability distributions that describe certain types of situation. For example, binomial and Poisson
distributions were derived as models for outcomes in repeated trials and for the random occurrence
of events in time or space, respectively. The normal (Gaussian) distribution, on the other hand, is
often used to represent the distributions of continuous measurements such as the heights or weights
of individuals, based largely on on past experience that such models are suitable and mathematical
convenience, more than on factor 1 above.
In choosing a model we usually consider families of probability distributions. To be specific let
us suppose that for some discrete random variable Y we consider a family whose probability function
depends on one or more parameters :
P (Y = y) = f (y; ) for y 2 R
where R is a countable (i.e. discrete) set of real numbers, the range of the random variable Y . In order
to apply the model to a specific problem we require a value for ; the selection of a value based on the
data (let’s call it ^) is often referred to as “fitting" the model or as “estimating" the value of . The next
section gives a way to do this.
Most applications require a series of steps in the formulation (the word “specification" is also used)
of a model. In particular, we often start with some family of models in mind, but find after examining
data and fitting the model that it is unsuitable in certain respects. (Methods for checking the suitability
of a model will be discussed in Section 2.4.) We then try out other models, and perhaps look at more
data, in order to work towards a satisfactory model. Doing this is usually an iterative process, which is
sometimes represented by diagrams such as
Statistics devotes considerable effort to the steps of this process. However, here we will focus on
settings in which the models are not too complicated, so that model formulation problems are mini-
mized. There are several distribution that you should review before continuing since they will appear
frequently in these notes. There is a little more detail about them in the appendix and in the Stat 230
notes. You should also consult the Table of distributions at the end of these notes for a condensed table
of properties of these distributions uncluding their moment generating functions and their moments.
44 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
Binomial Distribution
The discrete random variable (r.v.) Y has a binomial distribution (Y Bin(n; )) if its probability
function is of the form
n y
f (y; ) = (1 )n y
for y = 0; 1; : : : ; n (2.2)
y
Poisson Distribution
The discrete random variable Y has a Poisson distribution (Y Poisson( )) if its p.f. is of the form
y
f (y; ) = e for y = 0; 1; 2; : : :
y!
where is parameter with > 0. Be careful not to confuse the notation Y Exp( ) which specifies
the exponential distribution for the random variable Y with a common notation for the exponential2 .
Then E(Y ) = and V ar(Y ) = 2 :
where and are parameters, with 1 < < 1 and > 0. Then E(Y ) = ; Var(Y ) = 2 ; so that
sd(Y ) = . We write either Y G( ; ) or Y N ( ; 2 ) to indicate that Y has this distribution.
Note that in the former case, G( ; ); the second parameter is the standard deviation whereas in
the latter, N ( ; 2 ); we specify the variance 2 for the parameter. Most software syntax includingR
requires that you input the standard deviation for the parameter. The normal (Gaussian) distribution
provides a suitable model for the distribution of measurements on characteristics like the size or weight
of individuals in certain populations, but is also used in many other settings. It is particularly useful in
finance where it is the basis for the most common model for asset prices, exchange rates, interest rates,
2
Sometimes ex is denoted by exp(x):
2.1. STATISTICAL MODELS AND PROBABILITY DISTRIBUTIONS 45
etc.
Multinomial Distribution
This is a multivariate distribution in which the discrete random variable’s Y1 ; : : : ; Yk (k 2) have
the joint p.f.
P (Y1 = y1 ; : : : ; Yk = yk ) = f (y1 ; : : : ; yk ; )
n! y1 y2 ::: yk
= 1 2 k
y1 !y2 ! : : : yk !
P
k
where each yi (for i = 1; : : : ; k) is an integer between 0 and n, and they satisfy the condition yi =
i=1
n. The elements of the parameter vector = ( 1 ; 2 ; : : : ; k ) satisfy 0 < i < 1 for i = 1; : : : ; k;
P
k
and i = 1. This distribution is a generalization of the binomial distribution. It arises when there
i=1
are repeated independent trials, where each trial results in one of k types of outcomes (call them types
1; : : : ; k), and the probability outcome i occurs is i . If Yi (where i = 1; : : : ; k) is the number of times
that type i occurs in a sequence of n trials, then (Y1 ; : : : ; Yk ) have the joint distribution above. This is
indicated by (Y1 ; : : : ; Yk ) Mult(n; ):
Pk
Since Yi = n we can if we wish rewrite f (y1 ; : : : ; yk ; ) using only k 1 variables, say
i=1
y1 ; : : : ; yk 1 (with yk replaced by n y1 : : : yk 1 ). We see that the multinomial distribution
with k = 2 is just the binomial distribution, where the S outcomes are type 1 (say) and the F outcomes
are type 2.
We will also consider models that include explanatory variables, or covariates. For example, sup-
pose that the response variable Y is the weight (in kg) of a randomly selected female in the age range
16-25, in some population. A person’s weight is related to their height, so we might want to study
this relationship. A way to do this is to consider females with a given height x (say in meters), and to
propose that the distribution of Y , given x is Gaussian, G( + x; ). That is, we are proposing that
the average (expected) weight of a female depends linearly on her height x: we write this as
E(Y jx) = + x
This seems sensible (why?) and similar ideas can be developed for other parameters; in particular, note
that must also be estimated, and you might think about how you could use y1 ; : : : ; y10 to do this.
(Hint: what does or 2 represent in the Gaussian model?). Before we move on, note that although we
1 P
10
are estimating the parameter we did not write = 10 yi . Why did we introduce a special notation
i=1
1 P
10
^? This serves a dual purpose, both to remind you that 10 yi is not exactly equal to the unknown
i=1
value of the parameter ; but also to indicate that ^ is a quantity derived from the data yi ; i = 1; 2; :::; 10
and is therefore random. A different draw of the sample yi ; i = 1; 2; :::; 10 will result in a different
value for the random variable ^:
Instead of ad hoc approaches to estimation as in (2.3), it is desirable to have a general method for
estimating parameters. Maximum likelihood is a very general method, which we now describe.
Let the (vector) random variable Y represent potential data that will be used to estimate , and
let y represent the actual observed data that are obtained in a specific application. Note that to ally
the maximum likelihood method, we must know (or make assumptions about) how the data y were
collected. It is usually assumed here that the data consists of measurements on a random sample of
population units. The likelihood function for is then defined as
L( ) = P (Y = y; ) for 2
where the parameter space is the set of possible values for . Thus the likelihood function is the
probability that we will observe at random the observation y; considered as a function of the parame-
ter. Obviously values of the parameter that render our observation more probable would seem more
credible than those that render it less probable so values of for which L( ) is large are more consis-
tent with the observation y: The value ^ that maximizes L( ) for given data y is called the maximum
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 47
likelihood estimate (MLE)3 of . This seems like a “sensible" approach, and it turns out to have very
good properties.
ciation of University Teachers). This is one of semi-annual poll on Post-Secondary Education and
Canadian Public Opinion, this one conducted in November 2010. Harris/Decima uses a telephone poll
of 2000 “representative” adults. Twenty-six percent of respondents agreed and 48% disagreed with the
following statement: " University and college teachers earn too much".
3
We will often distinguish between the random variable, the maximum likelihood estimator, which is the function of the
data in general, and its numerical value for the data at hand, referred to as the maximum likelihood estimate.
4
See the corresponding video at www.watstat.ca
5
http://www.caut.ca/uploads/Decima_Fall_2010.pdf
48 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
Figure 2.11: Results of the Harris/Decima poll. The two bars are from polls conducted Nov. 9, 2010 and
Nov 10, 2010 respectively.
Harris/Decima declared their result were accurate to within 2:2 percent 19 times out of twenty but
the margin of error for regional, demographic or other subgroups is wider. What does this mean and
where did these estimates and intervals come from? Suppose that the random variable Y represents
the number of individuals who, in a randomly selected group of n persons, agreed with the statement.
It is assumed that Y is closely modelled by a binomial distribution:
n y
P (Y = y) = f (y; ) = (1 )n y
y = 0; 1; :::; n
y
where represents the fraction of the whole population that agree. In this case, if we select a random
sample of n persons and obtain their view we have Y = Y , and y = y = 520, the number that agree.
Thus the likelihood function is given by
n y
L( ) = (1 )n y
y
or in this case
2000 520
(1 )2000 520
. (2.2.1)
520
It is easy to see that L( ) is maximized by the value ^ = y=n. (You should show this.) The value
of this maximum likelihood estimate is 520=2000 or 26%. This is easily seen from a graph of the
likelihood function (2.2.1) given in Figure 2.12. From the graph it can also be seen that the interval
suggested by the pollsters, 26 2:2% or (23.8,28.2) is a reasonable interval for the parameter since
it seems to contain most of the values of with large values of the likelihood L( ): We will return to
the construction of such interval estimates later.
Example 2.2.2 Suppose that the random variable Y represents the number of persons infected with
the human immunodeficiency virus (HIV) in a randomly selected group of n persons. Again assume
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 49
Figure 2.12: Likelihood function for the Harris/Decima poll and corresponding interval estimate for
R( ) = L( )=L( ^) for 2 :
`( ) = log L( ) for 2 :
Note that ^ also maximizes `( ). (Why?) Because functions are often maximized by setting their
derivatives equal to zero6 , we can usually obtain ^ by solving the equation(s)
d`
= 0: (2.2.2)
d
For example, from L( ) = y (1 )n y we get `( ) = y log( ) + (n y) log(1 ). Thus
d` y n y
=
d 1
and solving d`=d = 0 gives ^ = y=n.
In many applications the data Y are assumed to consist of a random sample Y1 ; :::; Yn from some
process or population, where each Yi has the probability density function (or probability function)
f (y; ). In this case y = (y1 ; :::; yn ) and
n
Y
L( ) = f (yi ; ): (2.2.3)
i=1
(You should recall from probability that if Y1 ; : : : ; Yn are independent randnom variables then their
joint probability density function is the product of their individual probability density functions.) In
addition, if we have independent data Y1 and Y2 about from two independent studies, then since
P (Y1 = y1 ; Y2 = y2 ) = P (Y1 = y1 )P (Y2 = y2 ) with independency we can obtain the “combined"
likelihood function L( ) based on Y1 and Y2 together as
L( ) = L1 ( )L2 ( )
where Lj ( ) = P (Yj = yj ; ); j = 1; 2:
Example 2.2.3 Suppose that the random variable Y represents the lifetime of a randomly selected light
bulb in a large population of bulbs, and that Y follows an exponential distribution with probability
density function
1
f (y; ) = e y= for y > 0;
6
Can you think of an example of a continuous function f (x) defined on the interval [0; 1] for which the maximum
max0 x 1 f (x) is NOT found by setting f 0 (x) = 0?
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 51
where > 0. If a random sample of light bulbs is tested and gives lifetimes y1 ; :::; yn , then the
likelihood function for is
n
Y X
1 yi = 1
L( ) = e = n
exp yi = :
i=1
Thus
n
1X
`( ) = n log yi
i=1
Example 2.2.2 revisited. Sometimes the likelihood function for a given set of data can be constructed
in more than one way. For the random sample of n persons who are tested for HIV, for example, we
could define for i = 1; : : : ; n
(Note: I(A) is the indicator function; it equals 1 if A is true and 0 if A is false.) In this case the p.f.
for Yi is Bin(1; ) with
f (yi ; ) = yi (1 )1 yi for yi = 0; 1
y
= (1 )n y
P
n
where y = yi . This is the same likelihood function as we obtained in Example 2.2.1, where we
i=1
P
n
used the fact that Y = Yi has a binomial distribution, Bin(n; ).
i=1
Example 2.2.4 As an example involving more than one parameter, suppose that the random variable Y
has a normal (Gaussian) distribution with probability density function
" #
2
1 1 y
f (y; ; ) = p exp for 1 < y < 1:
2 2
52 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
Example 2.2.5 The number of coliform bacteria Y in a random sample of water of volume vi ml has
close to a Poisson distribution:
( vi )y
P (Y = y) = f (y; ) = e vi ; y = 0; 1; 2; ::: (2.2.4)
y!
where is the average number of bacteria per millilitre (ml) of water. There is an inexpensive test
which can detect the presence (but not the number) of bacteria in a water sample. In this case what we
get to observe is not Y , but rather the “presence” indicator I(Y > 0), or
(
1 if Y > 0
Z=
0 if Y = 0:
7
To maximize a function of two variables, set the derivative with respect to each equal to zero. Of course finding values
at which the derivatives are zero does not prove this is a maximum. Showing it is a maximum is another exercise in calculus.
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 53
Suppose that n water samples, of volumes v1 ; :::; vn , are selected. Let z1 ; :::; zn be the presence
indicators. The likelihood function is then the product of P (Zi = zi ), or
n
Y
v i zi vi 1 zi
L( ) = (1 e ) (e )
i=1
and
n
X
vi
`( ) = [zi log(1 e ) (1 zi ) vi ]:
i=1
vi (ml) 8 4 2 1
no. of samples 10 10 10 10
no. with zi = 1 10 8 7 3
This gives
8 4 2
`( ) = 10 log(1 e ) + 8 log(1 e ) + 7 log(1 e )
+ 3 log(1 e ) 21 :
Either maximizing `( ) numerically for > 0, or by solving d`=d = 0 numerically, we find the MLE
to be ^ = 0:478. A simple way to maximize `( ) is to plot it, as shown in Figure 2.13; the MLE can
then be found by inspection or, for more accuracy, by iteration using Newton’s method9 .
A few remarks about numerical methods are in order. Aside from a few simple models, it is not
possible to maximize likelihood functions explicitly. However, software implementing powerful nu-
merical methods exist which can easily maximize (or minimize) functions of one or more variables.
Multi-purpose optimizers can be found in many software packages; in R the function nlm() is pow-
erful and easy to use. In addition, statistical software packages contain special functions for fitting
and analyzing a large number of statistical models. The R package MASS (which can be accessed by
the command library (MASS)) has a function fitdistr that will fit many common models. R and other
packages are also invaluable for doing arithmetic, graphical presentations, and for manipulation of data.
9
You should recall this from your calculus course
54 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
-1 7
-1 8
-1 9
log likelihood
-2 0
-2 1
-2 2
-2 3
-2 4
-2 5
0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1
θ
Multinomial models are used in many statistical applications. From Section 2.1, the multinomial prob-
ability function takes the form (using pi for the probability of a type i outcome instead of i )
k Y y X
n!
f (y1 ; : : : ; yk ) = pi i for yi = 0; 1; : : : where i=1 k yi = n
y1 ! : : : yk !
i=1
If the pi ’s are to be estimated from data involving n “trials", of which yi resulted in a type i outcome
(i = 1; : : : ; k), then it seems obvious that
p^i = yi =n (i = 1; : : : ; k) (2.3.1)
would be a sensible estimate. This can also be shown to be the m.l.e. for p = (p1 ; : : : ; pk ).10
Example 2.3.1 Each person is one of 4 blood types, labelled A,B,AB and O. (Which type a person
is has important consequences, for example in determining to whom they can donate blood for a trans-
fusion.) Let p1 ; p2 ; p3 ; p4 be the fraction of a population that has types A,B,AB,O, respectively. Now
suppose that in a random sample of 400 persons whose blood was tested, the numbers who were types
1 to 4 were y1 = 172; y2 = 38; y3 = 14 and y4 = 176 (note that y1 + y2 + y3 + y4 = 400).
Note that the random variables Y1 ; Y2 ; Y3 ; Y4 that represent the number of type A,B,AB,O persons
we might get in a random sample of size n = 400 follow a multinomial distribution, Mult(400; p1 ; p2 ; p3 ; p4 ).
The m.l.e.’s from the observed data are therefore
172 38 14 176
p^1 = 400 = 0:43; p^2 = 400 = 0:095; p^3 = 400 = 0:035; p^4 = 400 = 0:44
10 Pk
The log likelihood can be taken as (dropping the n!=(y1 ! : : : yk !) term for convenience) `(p) = yi log pi : This is a
P i=1
little tricky to maximize because the pi ’s satisfy a linear constraint, pi = 1. The Lagrange multiplier method (Calculus
III) for constrained optimization allows us to find the solution p^i = yi =n (i = 1; : : : ; k).
2.4. CHECKING MODELS 55
P
(As a check, note that p^i = 1). These give estimates of the population fractions p1 ; p2 ; p3 ; p4 . (Note:
studies involving much larger numbers of people put the values of the pi ’s for Caucasians at close to
p1 = 0:448; p2 = 0:083; p3 = 0:034; p4 = 0:436:)
In some problems the multinomial parameters p1 ; : : : ; pk may be functions of fewer than k 1
parameters. The following is an example.
Example 2.3.2 Another way of classifying a person’s blood is through their “M-N" type. Each person
is one of 3 types, labelled MM,MN and NN and we can let p1 ; p2 ; p3 be the fraction of the population
that is each of the 3 types. According to a model in genetics, the pi ’s can be expressed in terms of a
single parameter for human populations:
2
p1 = ; p2 = 2 (1 ); p3 = (1 )2
where is a parameter with 0 < < 1. In this case we would estimate from a random sample giving
y1 ; y2 and y3 persons of types MM, MN and NN by using the likelihood function
3
Y
L( ) = pyi i
i=1
2 y1
= [ ] [2 (1 )]y2 [(1 )2 ]y3
= 2y1 2y1 +y2
(1 )y2 +2y3
Example 2.4.2 in the next section considers some data for this setting.
If the model is suitable, these values should be “close" to the values of the relative frequencies p~j =
fj =n in the sample. (Recall that fj is the number of y-values in the sample that are in the interval
Ij ). This method of comparison works for either discrete or continuous r.v’s. An example of each type
follows.
Example 2.4.1 Suppose that an exponential model for a positive-valued continuous random variable
Y has been proposed, with probability density function
0:01y
f (y) = 0:01e ; y>0 (2.4.2)
and that a random sample of size n = 20 has given the following values y1 ; : : : ; y20 (rounded to the
nearest integer):
10; 32; 15; 26; 157; 99; 109; 88; 39; 118;
61; 104; 77; 144; 338; 72; 180; 63; 155; 140
For illustration purposes, let us partition the range of Y into 4 intervals [0,30), [30,70), [70,140),
[140,1). The probabilities p^j from the model (2.4.2) are for j = 1; : : : ; 4,
Z aj
p^j = 0:01e 0:01y dy = e 0:01aj 1 e 0:01aj
aj 1
and we find p^1 = 0:261; p^2 = 0:244; p^3 = 0:250; p^4 = 0:247; (the numbers add to 1.002 and not
1.0 because of round-off). The relative frequencies p~j = fj =20 from the random sample are p~1 =
0:15; p~2 = 0:25; p~3 = :30; p~4 = :30. These agree fairly well with the model-based values p^j , but
we might wonder about the first interval. We discuss how “close" we can expect the agreement to
be following the next example. With a sample of this small a size, the difference between p~1 and p^1
represented here does not suggest that the model is inadequate.
This example is an artificial numerical illustration. In practice we usually want to check a family
of models for which one or more parameter values is unknown. Problem 2 in Chapter 1 discusses an
application involving the exponential distribution where this is the case. When parameter values are
unknown we first estimate them using maximum likelihood, and then check the resulting model. The
following example illustrates this procedure.
Example 2.4.2 In Example 2.3.2 we considered a model from genetics in which the probability a
person is blood type MM, MN or NN is p1 = 2 ; p2 = 2 (1 ); p3 = (1 )2 , respectively. Suppose
a random sample of 100 individuals gave 17 of type MM, 46 of type MN, and 37 of type NN.
The relative frequencies from the sample are p~1 = 0:17; p~2 = 0:46; p~3 = 0:37, where we use the
obvious “intervals" I1 = fperson is MMg; I2 = fperson is MNg; I3 = fperson is NNg. (If we wish,
we can define the random variable Y to be 1,2,3 according to whether a person is MM, MN or NN.)
2.4. CHECKING MODELS 57
Since is unknown, we must estimate it before we can check the family of models given above. From
Example 2.3.2, the likelihood function for from the observed data is of multinomial form:
and d`=d = 0 gives the m.l.e. ^ = 0:40. The model-based probabilities for I1 ; I2 ; I3 are thus
These agree quite closely with p~1 ; p~2 ; p~3 and on this basis the model seems satisfactory.
The method above suffers from some arbitrariness in how the Ij ’s are defined and in what consti-
tutes “close" agreement between the model-based probabilities p^j and the relative frequencies p~j =
fj =n. Some theory that provides a formal comparison will be given later in Chapter 7, but for now
we will just rely on the following simple guideline. If we consider the frequencies fj from the sample
as random variables, then they have a multinomial distribution, Mult (n; p1 ; : : : ; pk ), where pj is the
“true" value of P (aj 1 Y < aj ) in the population. In addition, any single fj has a binomial dis-
tribution, Bin(n; pj ). This means we can assess how variable either fj or p~j = fj =n is likely to be,
in a random sample. From Stat 230, if n is large enough then the distribution of fj is approximately
normal, N (npj ; npj (1 pj )). It then follows that
q q
P npj 1:96 npj (1 pj ) fj npj + 1:96 npj (1 pj ) = 0:95
This allows us to get a rough idea for what constitutes a large discrepancy between an observed relative
frequency p~j and a true probability pj . For example when n = 20 and pj is about 0.25, as in Example
2.4.1, we get from (2.5.3) that
That is, it is quite common for p~j to differ from pj by up to .19. The discrepancy between p~1 = 0:15 and
p1 = 0:261 in Example 2.4.1 is consequently not unusual and does not suggest the model is inadequate.
58 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
For larger sample sizes, p~j will tend to be closer to the true value pj . For example, with n = 100
and pj =.5, (2.4.3) gives
Thus in Example 2.4.2, there is no indication that the model is inadequate. (We are assuming here that
the model-based values p^j are like the true probabilities as far as (2.4.3) is concerned. This is not quite
correct but (2.4.3) will still serve as a rough guide. We are also ignoring that we have picked the largest
of the values p~j = pj , as the binomial distribution is not quite correct either. Chapter 7 shows how to
develop checks of the model that get around these points.)
Graphical Checks
A graph that compares relative frequencies and model-based probabilities provides a nice picture
of the “fit" of the model to the data. Two plots that are widely used are based on histograms and
cumulative frequency functions F~ (y) which are also called empirical c.d.f.’s, respectively.
The histogram plot for a continuous random variable Y is as follows. Plot a relative frequency
histogram of the random sample y1 ; : : : ; yn and superimpose on this a plot of the probability density
function f (y; ) for the proposed model. The area under the probability density function between
values aj 1 and aj equals P (aj 1 Y < aj ) so this should agree well with the area of the rectangle
over [aj 1 ; aj ), which equals p~j . The plots in Figure 1.6.1 for the height and body-mass index data in
Section 1.6 are of this type.
For a discrete random variable Y we plot a histogram for the probability distribution f (y; ) and
superimpose a relative frequency histogram for the data, using the same intervals Ij in each case.
A second graphical procedure is to plot the cumulative frequency function or empirical c.d.f.
(ECDF) F~ (y) described by (1.3.1) in Section 1.3 and then to superimpose on this a plot of the model-
based c.d.f., F (y; ). If the model is suitable, the two curves should not be too far apart.
Example 2.4.3 Figure 2.14 shows a plot of the data on female heights from Section 1.6. We show (a)
a relative frequency histogram, with the G(1:62; 0:0637) probability density function superimposed
(the m.l.e.’s were ^ = 1:62; ^ = 0:0637, from the data), and (b) a plot of the ECDF with the
G(1:62; 0:0637) c.d.f. superimposed. The two types of plots give complementary but consistent pic-
tures. An advantage of the distribution function comparison is that the exact heights in the sample are
used, whereas in the histogram - probability density function plot the data are grouped in forming the
histogram. However, the histogram and probability density function show the distribution of heights
more clearly. Neither plot suggests strongly that the Gaussian model is unsatisfactory. Note that the R
function ecdf can be used to obtain F~ (y).
2.5. PROBLEMS 59
0 1 2 3 4 5 6
0.8
0.4
0.0
1.4 1.6 1.8 1.4 1.6 1.8
x height(m)
2.5 Problems
1. In modelling the number of transactions of a certain type received by a central computer for a
company with many on-line terminals the Poisson distribution can be used. If the transactions
arrive at random at the rate of per minute then the probability of x transactions in a time interval
of length t minutes is
( t)x
P (X = x) = e t ; x = 0; 1; 2; :::
x!
(a) The numbers of transactions received in 10 separate one minute intervals were as follows:
8, 3, 2, 4, 5, 3, 6, 5, 4, 1.
Write down the likelihood function for and find the m.l.e. ^ .
(b) Estimate the probability that during a two-minute interval, no transactions arrive.
(c) Use R function rpois() with the value = 4:1 to simulate the number of transactions
received in 100 one minute intervals. Calculate the sample mean and variance; are they
approximately the same? (Note that E(X) =Var(X) = for the Poisson model.)
2. Consider the following two experiments whose purpose was to estimate , the fraction of a large
population with blood type B.
i) Individuals were selected at random until 10 with blood type B were found. The total
number of people examined was 100.
60 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
ii) 100 individuals were selected at random and it was found that 10 of them have blood type
B.
(a) Find the probability of the observed results (as a function of ) for the two experiments.
Thus obtain the likelihood function for for each experiment and show that they are pro-
portional. Show that the m.l.e. ^ is the same in each case: what is it?
(b) Suppose n people came to a blood donor clinic. Assuming = 0:10, how large should
n be to make the probability of getting 10 or more B- type donors at least 0.90? (The R
functions gbinom() or pbinom() can help here.)
3. Consider Example 2.3.2 on M-N blood types. If a random sample of n individuals gives x1 ; x2 ;
and x3 persons of types MM, MN, and NN respectively, find the m.l.e. ^ in the model in terms
of x1 ; x2 ; x3 .
4. Suppose that in a population of twins, males (M ) and females (F ) are equally likely to occur and
that the probability that a pair of twins is identical is . If twins are not identical, their genders
are independent.
1+
P (M M ) = P (F F ) =
4
1
P (M F ) =
2
(b) Suppose that n pairs of twins are randomly selected; it is found that n1 are M M , n2 are
F F , and n3 are M F , but it is not known whether each set is identical or fraternal. Use
these data to find the m.l.e. ^ of . What does this give if n = 50 with n1 = 16, n2 = 16,
n3 = 18?
(c) Does the model appear to fit the data well?
6. The following model has been proposed for the distribution of the number of offspring X in a
family, for a large population of families:
P (X = k) = k k = 1; 2; : : :
P (X = 0) = (1 2 )=(1 ):
(a) Suppose that n families are selected at random and that fk is the number of families with k
children (f0 + f1 + : : : = n). Obtain the m.l.e. ^ .
(b) Consider a different type of sampling wherein a single child is selected at random and the
size of family the child comes from is determined. Let X represent the number of children
in the family. Show that
k
P (X = k) = ck k = 1; 2; : : :
and determine c.
(c) Suppose that the type of sampling in part (b) was used and that with n = 33 the following
data are obtained:
k: 1 2 3 4
fk : 22 7 3 1
Obtain the m.l.e. ^ . Also estimate the probability a couple has 0 children.
(d) Suppose the sample in (c) was incorrectly assumed to have arisen from the sampling plan
in (a). What would ^ be found to be? This problem shows that the way the data have been
collected can affect the model for the response variable.
7. Radioactive particles are emitted randomly over time from a source at an average rate of per
second. In n time periods of varying lengths t1 ; t2 ; :::; tn (seconds), the numbers of particles
emitted (as determined by an automatic counter) were y1 ; y2 ; :::; yn respectively.
62 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING
(a) Give an estimate of from these data. What assumptions have you made to do this?
(b) Suppose that instead of knowing the yi ’s, we know only whether or not there was one or
more particles emitted in each time interval. Making a suitable assumption, give the like-
lihood function for based on these data, and describe how you could find the maximum
likelihood estimate ^ .
8. Censored lifetime data. Consider the exponential distribution as a model for the lifetimes of
equipment. In experiments, it is often not feasible to run the study long enough that all the
pieces of equipment fail. For example, suppose that n pieces of equipment are each tested for a
maximum of C hours (C is called a “censoring time”) . The observed data are then as follows:
P (X > C) = exp( C= ):
(b) Give the likelihood function for based on the observed data described above. Show that
the m.l.e. is
r
X
xi + (n r)C
^= i=1
:
r
(c) What does part (b) give when r = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them to large fluc-
tuations in temperature inside specially designed ovens. For one particular type of compo-
nent, 50 units were tested and r = 5 failed before 400 hours, when the test was terminated,
P
5
with xi = 450 hours. Find ^.
i=1
9. Poisson model with a covariate. Let Y represent the number of claims in a given year for
a single general insurance policy holder. Each policy holder has a numerical “risk score” x
assigned by the company, based on available information. The risk score may be used as a
covariate (explanatory variable) when modeling the distribution of Y , and it has been found that
models of the form
(x)y
P (Y = yjx) = e (x) y = 0; 1; 2; :::
y!
where (x) = exp( + x), are useful.
2.5. PROBLEMS 63
(a) Suppose that n randomly chosen policy holders with risk scores x1 ; x2 ; :::; xn had y1 ; y2 ; :::; yn
claims, respectively, in a given year. Give the likelihood function for and based on these
data.
(b) Can ^ and ^ be found explicitly?
10. In a large population of males ages 40 - 50, the proportion who are regular smokers is (0 <
< 1) and the proportion who have hypertension (high blood pressure) is (0 < < 1). If
the events S (a person is a smoker) and H (a person has hypertension) are independent, then for
a man picked at random from the population the probabilities he falls into the four categories
SH; S H; SH; S H are respectively, ; (1 ); (1 ) ; (1 )(1 ). (Why?)
(a) Suppose that 100 men are selected and the numbers in each of the four categories are as
follows:
Category SH S H SH S H
Frequency 20 15 22 43
Assuming that S and H are independent, write down the likelihood function for ; based
on the multinomial distribution, and maximize it to obtain ^ and ^.
(b) Compute expected frequencies for each of the four categories. Do you think the model used
is appropriate? Why might it be inappropriate?
11. The course web page has data on the lifetimes of the right front disc brakes pads for a
specific car model. The lifetimes x are in km driven, and correspond to the point at which
the brake pads in new cars are reduced to a specified thickness. The data on m = 92
randomly selected cars are contained in the file brakelife.text.
(a) It is often found that the log lifetimes, Y = log X, are well modelled by a Gaussian
distribution. Fit such a model to the data, and then produce a plot of a relative fre-
quency histogram of the x-data with the probability density function for X superim-
posed on it, using ^ and ^ for the values of and .
(Note: The probability density function of X = exp(Y ) can be found from results in
STAT 230, and is
1 1 log x 2
f (x) = p e 2( ) x > 0:
2 x
(b) Suppose that instead of the model in part (a), you assumed that X G( ; ). Repeat
the procedure in part (a). Which model appears to fit the data better?
PLANNING AND CONDUCTING
EMPIRICAL STUDIES
Plan: the procedures used to carry out the study, including the collection of data.
We will discuss the Plan, Data, Analysis and Conclusion steps in sections following this one. First,
we consider some aspects of the Problem step.
64
3.1. EMPIRICAL STUDIES 65
The objectives of a study may be difficult to state precisely in some cases, because when trying to
learn about a phenomenon we are venturing into the unknown. However, we must state as clearly as
possible what we hope to learn, and (looking ahead) what the “output" or type of conclusions from our
study will be. Here are some terms that describe certain problems or objectives:
causative: this means that an objective is to study (possible) causal relationships among factors
or variables. For example, we might want to study whether high amounts of fat in a person’s
diet increase their risk of heart disease, or whether a drug decreases the risk of a person having a
stroke.
descriptive: this means that an objective is to describe the variability or other characteristics of
certain variables, or to describe relationships among variables. (There is no attempt to consider
causal connections.) For example, we may wish to estimate the percentage of persons who
are unemployed this month in different regions of Canada, and to relate this to their level of
education.
analytic or inferential: this means that an objective is to generalize from the study to a larger
context or process. This is what is known as inductive inference. Causative objectives are one
type of analytic objective.
technological: this refers to objectives involving predictions or decisions (e.g. develop a model
or algorithm to identify someone from biometric data, or to decide whether to sell a stock)
A distinction can often be made between what we call the target population (or process) and the
study population (or process). The former is what we really want to study, and what our objectives are
based on. The latter is what the units in the study are actually selected from. Sometimes the target and
study populations are the same but often they are not; in that case we aim to make the study population
as “similar" as possible to the target population. This will make generalizations from the study to the
target population plausible.
At the Problem step we must also consider what the “units" of the population or process are, what
variables will be used to study the units, and what characteristics or attributes of the variables we
want to examine. This can be difficult, especially in areas where the amount of “hard" scientific back-
ground is limited, because it involves quantification and measurement issues. Some things can be
quantified and measured easily and accurately (e.g. a person’s height, weight, age or sex) but many
cannot. For example, how can we measure the amount of fat in a person’s diet? Public opinion polls or
surveys are similarly tricky to design questions for, and we may not be sure what an answer represents.
For example (see Utts 20012), in a survey the following two questions were asked in the order shown:
66 PLANNING AND CONDUCTING EMPIRICAL STUDIES
There was almost no relationship between the respondents’ replies to the two questions. However,
when the order of the questions was reversed there was a strong relationship, with persons going out
frequently generally being happy. (Why might this occur?)
The units or individuals that are selected for a study are called a sample, because they usually
are a subset of all the units in the study population. In so far as it is possible, we attempt to select
a random sample of units from the study population, to ensure that the sample is “representative’.
Drawing a random sample of units from a population or process can be difficult logistically. (Think
about what this would involve for some of the studies discussed in chapters 1 and 2.) If our sample is not
representative of the study or target population, conclusions we draw may be inaccurate or incorrect;
this is often referred to as study bias. Also, ideally we have,
but in some cases the study population may not be a subset of the target population but only related
to it in some way. For example, our target population for a carcinogenicity study might be humans,
but experimental studies have to be conducted on laboratory mice or other animals. It is difficult to
generalize from a study on animals to a human population.
The selection of explanatory variables for a study is also important. In many problems there is
a large number of factors that might be related (in a causal way or not) to a response variable, and
we have to decide which of these to collect data on. For example, in an observational study on heart
disease we might include factors in the person’s diet, their age, weight, family history of heart disease,
smoking history and so on. Sometimes cause and effect diagrams are used to help us sort out factors
and decide what to measure, e.g.
We’ll conclude this section with a couple of examples that illustrate aspects of the Problem step.
just say it consists of persons alive and at risk of stroke now and in the future. The study population
will consist of individuals who can be “selected" to participate in the study; individual persons are
the units in the populations. In most such studies these are persons who come to a specific group of
doctors or medical facilities for care and treatment, and this may or may not be representative of the
target population. To make the study generalizable at least to the study population, we would want the
sample from the study population to be representative of it. We try to achieve this by drawing a random
sample of persons, but in medical studies this is complicated by the fact that a person has to agree to
participate if selected.
The choice of response variables depends on the study objectives but also partly on how long the
study will be conducted. Some examples of response variables y are
(i) the number of strokes for a person over a specified “followup" period, say 3 years
(ii) the time until the first stroke occurs (if it does)
(iv) I(a stroke leading to major loss of function occurs within 3 years).
Once response variables have been defined, a specific objective is to compare attributes based on the
distributions of these variables for persons who receive the drug and for persons who do not. In ad-
dition, to avoid study bias we randomly assign either the drug or a “placebo" (which cannot be dis-
tinguished from the drug) to each person. You can think of this being done by flipping a coin: if its
Heads a person gets the drug, if Tails they get the placebo. As discussed earlier in Example 1.2.3, it
is also best not to let the person (or their doctor) know which “treatment" they are receiving. Finally,
one might want to also define and measure explanatory variables considered to be risk factors for stroke.
get admitted to the program (and who accept). This mismatch between the target and study populations
means that there is some doubt about how persons not admitted (but whose x-variables are known)
would do in the program. For the problem of predicting a person’s success once they are admitted, we
can take the target population to be all students admitted (and accepting) over some time period. The
study population is more “representative" in this case. However, for this ( and for admissions deci-
sions), there is still a discrepancy between study and target populations: data on past years’ students
must be used for a target population consisting of this (or a future) year’s students.
(i) Surveys or random samples from a finite population of units. In this case there is a population
of, say, N units (e.g. persons, widgets, transactions) and we randomly select n(< N ) for our
study. Public opinion polls or surveys on health, income, and other matters are often of this type,
as are random audits and quality inspections.
(ii) Experimental studies. In this case the person(s) conducting the study exercise control over
certain factors, or explanatory variables. These studies are often hard or expensive to conduct, but
are important for demonstrating causal relationships. Clinical trials involving medical treatments
(e.g. does a drug lower the risk of stroke?) and engineering experiments (e.g. how is the strength
of a steel bolt related to its diameter?) are examples of such studies.
(iii) Observational studies. In this case the data are collected on a study population or process over
which there is no control. The units selected for the study are often selected in some random way,
so as to make the selected units “representative". For example, we might collect data on a random
set of transactions in an audit study, or on a random sample of persons buying a particular product
in a marketing study. In medicine, we might compare the “risk profiles" of persons diagnosed
with a disease (e.g. diabetes) with persons not diagnosed with the disease.
Two main questions with observational studies are whether the study population is representative
of the target population, and whether the units chosen for the study are representative of the study pop-
ulation. It is sometimes impossible to generalize, or even to draw any conclusions, from observational
data.
3.2. PLANNING A STUDY 69
Selecting random samples from a population or process can be challenging due to logistical prob-
lems as noted above. In addition, there is the question whether a process is stable over time. This
affects both the objectives of a study and the way units are sampled and measured. Entire books are
devoted to ways of drawing random or representative samples in different settings.
In settings where the population or process has a finite number of units, N , we can draw a random
sample by measuring the units (say as 1; 2; : : : ; N ) and then drawing n numbers (usually without
replacement). This identifies the units selected for the sample. This can of course be hard or impossible
to implement in many settings; consider for example how you could select a sample of n persons aged
18-25 and living in Canada on some specific date.
It is often taken for granted that measurements, or data, are “reliable". In fact, measurement of the
variables in a study may be subject to errors, and this should be considered in planning and analyzing
a study. If measurement errors are too severe, it may not even be worth doing the study. Consider the
following example.
Most measurement methods have some degree of error. In many cases this has no significant effect,
but the only way to be sure is to understand (and study, if necessary) the measurement processes being
used. Note that in some problems the measurement may involve a definition that is somewhat vague or
subject to error. For example, if y is the number of incidents of sexual abuse that a person was subjected
to over a 5 year period, then we should consider the definition of an “incident" and whether everyone
interprets this the same way.
Finally, the number of population units to be included in a study is a crucial issue, since it is directly
related to the amount of information that will be obtained. This topic will be discussed in subsequent
chapters so we will not consider it here.
70 PLANNING AND CONDUCTING EMPIRICAL STUDIES
errors can occur in recording data or entering it in a data base, as well as in measurement of
variables
in many studies the “units" must be tracked and measured over a fairly long period of time (e.g.
consider a stroke study in which persons are followed for 3 years)
when data are recorded over time or in different locations, the time and place for each measure-
ment should be recorded
there may be departures from the study plan that arise over time (e.g. persons may drop out of
a medical study because of adverse reactions to a treatment; it may take longer than anticipated
to collect the data so the number of units sampled must be reduced). Departures from the plan
should be recorded since they may have an important impact on the analysis and conclusions
in some studies the amount of data may be massive, so data base design and management is
important.
testing hypotheses (e.g. Is dietary fat related to heart disease? Does a drug prevent strokes? Is
one computer sort algorithm better than another?)
model building (e.g. produce a model relating lifestyle risk factors to the probability of a stroke;
produce a model relating the strength of a steel bolt to its diameter; produce a model that can
3.4. PROBLEMS 71
be used to identify customers for a targeted marketing campaign.) Models are often used for
classifying objects or units, or for making decisions.
Statistical analysis uses a wide array of numerical and graphical methods. Several major topics in
statistical analysis are introduced in Chapters 4 to 8 which follow. Specific applications are used to
illustrate the methods and how conclusions are drawn from the analysis. Although we do not discuss
it much in the remaining chapters, we must remember that well planned and conducted studies are
important for drawing reliable conclusions.
3.4 Problems
1. Suppose you wish to study the smoking habits of teenagers and young adults, in order to un-
derstand what personal factors are relative to whether, and how much, a person smokes. Briefly
describe the main components of such a study, using the PPDAC framework. Be specific about
the target and study population, the sample, and what variables you would collect data on.
2. Suppose you wanted to study the relationship between a person’s “resting" pulse rate (heart beats
per minute) and the amount and type of exercise they get.
(a) List some factors (including exercise) that might affect resting pulse rate. You may wish to
draw a cause and effect diagram to represent potential causal factors.
(b) Describe briefly how you might study the relationship between pulse rate and exercise us-
ing (i) an observational study, and (ii) an experimental study.
3. A large company uses photocopiers leased from two suppliers A and B. The lease rates are
slightly lower for B’s machines but there is a perception among workers that they break down
and cause disruptions in work flow substantially more often. Describe briefly how you might
design and carry out a study of this issue, with the ultimate objective being a decision whether to
continue the lease with company B. What additional factors might affect this decision?
72 PLANNING AND CONDUCTING EMPIRICAL STUDIES
4. For a study like the one in Section 1.6, where heights x and weights y of individuals are to be
recorded, discuss sources of variation due to the measurement of x and y on any individual.
STATISTICAL INFERENCE:
ESTIMATION
4.1 Introduction
Many statistical problems involve the estimation of some quantity or attribute. For example, the fraction
of North American women age 16-25 who smoke; the 10th, 50th and 90th percentiles of body-mass
index (BMI) for Canadian males age 21-35; the probability a sensor will classify the colour of an item
correctly. The statistical approach to estimation is based on the following idea:
Develop a model for variation in the population or process you are considering, in which the at-
tribute or quantity you want to estimate is included, and a corresponding model for data collection.
As we will see, this leads to powerful methods for estimating unknown quantities and, importantly,
for determining the uncertainty in an estimate.
We have already seen in Chapter 2 that quantities that can be expressed as parameters in a statis-
tical model (probability distribution) can be estimated using maximum likelihood. Let us consider the
following example, make some important observations.
Example 4.1.1. Suppose we want to estimate quantities associated with BMI for some population of
individuals (e.g. Canadian males age 21-35). If the distribution of BMI values y in the population is
well described by a Gaussian model, Y G( ; ), then by estimating and we can estimate any
quantity associated with the BMI distribution. For example,
(iii) The 10th and 90th percentiles (.10 quantiles) for BMI are y0:10 = 1:28 and y:90 = +1:28
(To see this, note for example that P (Y 1:28 ) = P (Z 1:28) = 0:10, where
Z = (Y )= G(0; 1).)
73
74 STATISTICAL INFERENCE: ESTIMATION
(iv) The fraction of the population with BMI over 35.0 is p = 1 ( 35:0 ), where (z) is the c.d.f
for a G(0; 1) random variable.
Thus, if we collected a random sample of, say, 150 individuals and found maximum likelihood
estimates (maximum likelihood estimators) ^ = 27:1; ^ = 3:56 then estimates of the quantities
in (i)-(iv) would be: (i) and (ii)^ = 27:1, (iii) y^0:10 = ^ 1:28^ = 22:54; y^:90 = 31:66; and
(iv) p^ = :0132.
The preceding example raises several issues, if we think about it.
Where do we get our probability distribution? What if it isn’t a good description of the population
or process?
We discussed the first question in Chapters 1 and 2. It is important to check the adequacy (or “fit")
of the model; some ways of doing this were discussed in Chapter 2 and more will be considered
later in the course. If the model used is not satisfactory, we may not be able to use the estimates
based on it. In the data introduced in Section 1.5 it was not in fact clear that a Gaussian model
was suitable when y is BMI. We will consider other models later.
The estimation of parameters or population attributes depends on data collected from the popula-
tion or process, and the likelihood function is based on the probability of the observed data. This
implies that factors associated with the selection of sample units or the measurement of variables
(e.g. measurement error) must be included in the model. In the BMI example it has been as-
sumed that a BMI was measured without error for a random sample of units (persons) from the
population. In these notes it is typically assumed that the data came from a random sample of
population units, but in any given application we would need to design the data collection plan
to ensure this assumption is valid.
Estimates such as ^ = 27:1 surely cannot be equal to the average BMI in the population. How
far away from is it likely to be? If I take a sample of only n = 50 persons, would I expect the
estimate ^ to be as “good" as ^ based on 150 persons? (What does “good" mean?)
The third point is what we will focus on in this chapter; it is assumed that we can deal with the first
two points with ideas introduced in Chapters 1 and 2.
^ = g(y1 ; : : : ; yn ): (4.1.1)
4.1. INTRODUCTION 75
Maximum likelihood provides a general method for obtaining estimates, but other methods also exist.
For example, if = E(Y ) = is the average (mean) value of y in the population, then the sample mean
^ = y is an intuitively sensible estimate; it is the m.l.e. if Y has a Gaussian distribution but because
of the Central Limit Theorem it is a good estimate of more generally. Thus, while we will use ML
estimation a great deal, you should remember that the discussion below applies to estimates of any
type. The problem facing us is how to determine or quantify the uncertainty in an estimate. We do this
using sampling distributions, which are based on the following point. If we select random samples
on repeated occasions, then the estimates ^ obtained from the different samples will vary. For example,
five separate random samples of n = 50 persons from the same male population described in Section
1.6 gave five estimates ^ = y of E(Y ): 1.723, 1.743, 1.734, 1.752, 1.736 (meters).The variability in
estimates obtained from repeated samples of the same size is termed a sampling distribution. More
precisely, we define this as follows. Let the random variables Y1 ; : : : ; Yn represent the observations in
a random sample, and associate with the estimate ^ given by (4.1.1) a random variable
~ = g(Y1 ; : : : ; Yn ): (4.1.2)
We call ~ the estimator of corresponding to ^. (We will always use ^ to denote an estimate and ~ to
denote the corresponding estimator.) The distribution of ~ is called the sampling distribution of the
estimator.
Since ~ is a function of the random variables Y1 ; : : : ; Yn we can find its distribution, at least in prin-
ciple. Two ways to do this are (i) using mathematics and (ii) by computer simulation. Once we know
the sampling distribution of an estimator ~ (we can think of this as describing an estimation procedure,
if we wish) then we are in the position to express the uncertainty in an estimate. The following example
illustrates how this is done: we examine the probability that the estimator ~ is “close" to .
Example 4.1.2 Suppose we want to estimate the mean = E(Y ) of a random variable, and that a
Gaussian distribution Y G( ; ) describes variation in Y in the population. Let y1 ; : : : ; yn represent
a random sample from the population, and consider the estimator
n
X
~=Y = Yi =n
i=1
for . At this point, we recall some probability theory which says that the distribution of Y is also
p
Gaussian, G( ; = n). Let us now consider the probability that j~ j is less than or equal to some
specified value . We have
pr(j~ j ) = P( Y + )
p p
n n
= P( Z ): (4.1.3)
76 STATISTICAL INFERENCE: ESTIMATION
p
where Z = (Y )=( = n) G(0; 1). Clearly, as n increases, the probability (4.1.3) approaches
1. Furthermore, if we know (even approximately) then we can find the probability for any given
and n. For example, suppose Y represents the height of a male (in meters) in the population of Section
1.5, and that we take = :01. That is, we want to find the probability that j~ j is no more than
.01 meters. Assuming = :07(m), which is roughly what we estimated it to be in Section 1.5, (4.1.3)
gives the following results for a sample of size n = 50 and for a sample of size n = 100:
This indicates that a large sample is “better" in the sense that the probability is higher that ~ will be
within .01m of the true (and unknown) average height in the population. It also allows us to express
the uncertainty in an estimate ^ = y from an observed sample y1 ; : : : ; yn : we merely give probabilities
like the above for the associated estimator, which indicate the probability that any single random sample
will give an estimate within a certain distance of .
Example 4.1.3 In the preceding example we were able to work out the variability in the estimator ~
mathematically, using results about Gaussian probability distributions. In some settings we might not
be able to work out the distribution of an estimator mathematically; however, we could use simulation
to study the distribution. This approach can also be used to study sampling from a finite population of
N values, fy1 ; : : : ; yN g, where we might not want to use a continuous probability distribution for Y .
For illustration, consider the case where a variable Y has a G( ; ) distribution in the population, with
= 100 and = 50, and suppose we take samples y = (y1 ; : : : ; yn ) of size n, giving ^ = y. We can
investigate the distribution of ~ by simulation, by
ybar mean(y);
and then repeating this, say k times. The k values y1 ; : : : ; yk can then be considered as a sample
from the distribution of ~, and we can study it by plotting a histogram or other plot of the values.
Exercise: Generate k = 100 samples this way, and plot a histogram based on the values y1 ; : : : ; y100 .
The approaches illustrated in the preceding examples can be used generally. Given an estimator ~,
we can consider its sampling distribution and compute probabilities of the form P (j ~ j ).
4.2. SOME DISTRIBUTION THEORY 77
We need to be able to find the distributions of estimators and other variables. We now review some
probability results from Stat 230 and derive a few other results that will be used in dealing with estima-
tion and other statistical procedures.
d p p p p 1 1=2
( w) ( w) = '( w) + '( w) w for standard Gaussian probability density function '
dw 2
w 1=2 w=2
= p e
2
So the square of a standard normal random variable has a density of the form constant w 1=2 e w=2
for w > 0 and this is a member of the chi-squared family of distributions. This result is proved again
as Theorem 4.2.6 using moment generating functions.
The 2 (chi-squared) Distribution
The 2 distribution is usually derived from Gaussian random variables as we will see below in
Theorem 4.2.5. It is a continuous family of distributions on (0; 1) with probability density functions
of the form
1
f (x) = r=2 x(r=2) 1 e x=2 x>0 (4.1.4)
2 (r=2)
where r is a parameter with values in f1; 2; : : :g. To denote that X has probability density function
(4.1.4) we write X 2 . The parameter r is referred to as the “degrees of freedom" (d.f.) parameter.
(r)
In Figure 4.15 you see the characteristic shapes of the chi-squared densities. For degrees of freedom
r = 2; it is the exponential but for r > 2; it increases to a maximum value at a point to the left of r
78 STATISTICAL INFERENCE: ESTIMATION
and then decreases thereafter. For large values of r; then probability density function resembles that of
a normal distribution with variance 2r:
3 0.6
p.d.f.
p.d.f.
2 0.4
1 0.2
0 0
0 2 4 6 0 5 10
0.15 0.15
p.d.f.
p.d.f.
0.1 0.1
0.05 0.05
0 0
0 5 10 15 0 10 20
Figure 4.15: Chi-squared probabilities densities with degrees of freedom 1,2,4 and 8.
Problem 11 at the end of the chapter gives some results for the 2 distributions, including the
fact that its m.g.f. is
r=2
M (t) = (1 2t) (4.1.5)
and that its mean and variance are E(X) = r and V ar(X) = 2r. The c.d.f. F (x) can be given in
closed algebraic form for even values of r, but tables and software give the functions’s values. In R the
functions dchisq(x; r) and pchisq(x; r) give the probability density function f (x) and c.d.f. F (x) for
the 2(r) distribution. A table with selected values is given at the end of these notes.
We now give a pair of important results. The first shows that when we add independent chi-squared
random variables, the sum is also chi-squared, and the degrees of freedom simply sum.
4.2. SOME DISTRIBUTION THEORY 79
Furthermore if we add together the squares of several independent standard normal random vari-
ables then we are adding independent chi-squared random variables. The result can only be chi-
squared!
P
n
Corollary:13 If Z1 ; : : : ; Zn are mutually independent G(0; 1) random variables and S = Zi2 ;then
i=1
S 2 .
(n)
The estimates and estimators discussed in Section 4.1 are often referred to as point estimates (and
estimators). This is because they consist of a single value or “point". The discussion of sampling
distributions shows how to address the uncertainty in an estimate, but we nevertheless prefer in most
settings to indicate the uncertainty explicitly with the estimate. This leads to the concept of an interval
estimate14 , which takes the form
2 (L; U ) or L U;
where L = h1 (y1 ; : : : ; yn ) and U = h2 (y1 ; : : : ; yn ) are based on the observed data y1 ; : : : ; yn . Notice
that this provides an interval with endpoints L;and U both of which depend on the data. With random
variables replacing the observed data, the endpoints L ~ = h1 (Y1 ; : : : ; Yn ) and U
~ = h2 (Y1 ; : : : ; Yn )
are random variables (for example if we were to draw another sample from the same population) the
assertion would be L ~ < < U ~ : There is a specific probability (hopefully large) that this assertion
n
P
11 ri =2 Q
n ri =2
Proof: Wi has m.g.f. Mi (t) = (1 2t) . Thus Ms (t) = Mi (t) = (1 2t) i=1 and this is the m.g.f. of
i=1
2 P
n
a distribution with degrees of freedom ni :
i=1
2 R1 2 1 z2
12
Proof of 4.2.6: The m.g.f. of W is Mw (t) = E(etW ) = E(etZ ) = 1
etz p1
2
e 2 dz =
R1 1 z 2 (1 2t)
p1 e 2 dz: This integral exists (is finite) provided 1 2t > 0, i.e., t < 1=2. Making the change of vari-
1 2
1=2
R1 1 2
able y = z(1 2t) we get Mw (t) = (1 2t) 1=2 1 p12 e 2 y dy = (1 2t) 1=2 : This is the m.g.f. for 2(1) , so W
must have a 2(1) distribution.
13
Proof: By the theorem, each Xi2 has a 2(1) distribution. Theorem 4.2.5 then gives the result.
14
See the video What is a confidence Interval? at watstat.ca
80 STATISTICAL INFERENCE: ESTIMATION
is correct in general, i.e that the parameter will fall in this random interval, and this probability is
P (L~< <U ~ ): This probability is called the coverage probability. It gives an indication how good
the rule is by which the interval estimate was obtained. For example if it is 0.95, this means that 95% of
the time (i.e. 95% of the different samples we might draw), the parameter falls in the interval (L; ~ U
~ ).
This means we can be reasonably safe in assuming on his occasion, and for this dataset, it does so. In
general, uncertaintyh in an interval
i estimate is explicitly stated by giving the interval estimate along with
the probability P ~ ~
2 (L; U ) , when L ~ = h1 (Y1 ; : : : ; Yn ) and U
~ = h2 (Y1 ; : : : ; Yn ) are the random
variables associated with L and U .
The likelihood function can be used to obtain interval estimates for parameters in a very straight-
forward way. We do this here for the case in which the probability model involves only a single scalar
parameter . Models with two or more parameters will be considered later. Individual models often
have constraints on the parameters, so for example in the Gaussin distribution while the mean can take
any real number 1 < < 1 the standard deviation must be positive, i.e. > 0: Similarly in the
binomial model the probability of success must lie in the interval [0; 1]: These constraints are usually
identified by requiring that the parameter falls in some set , called the parameter space. As men-
tioned before we often multiply the likleihood function by a convenient scale, resulting in the relative
likelihood.
Definition: Suppose is scalar and that some observed data (say a random sample y1 ; :::; yn ) have
given a likelihood function L( ). The relative likelihood function R( ) is then defined as
L( )
R( ) = for 2
L( ^)
where ^ is the m.l.e. (obtained by maximizing L( )) and is the parameter space. Note that 0
R( ) 1 for all 2 .
a reasonable model and the observed data Y = y gives the likelihood function
!
n y
L( ) = (1 )n y
0 1:
y
n=200
-1
-2
log RL
-3 n=1000
-4
-5
1.0
0.8 n=200
0.6
RL
0.4 n=1000
0.2
0.0
Figure 4.16: Relative Likelihood and log Relative Likelihood Functions for a Binomial Parameter
r( ) = log R( ) = `( ) `( ^);
where `( ) = log L( ) is the log likelihood function. It’s often convenient to compute r( ) instead of
R( ) and to compute a p likelihood interval by the fact that R( ) p if r( ) log p. While both
plots are unimodal and have idential locations of the maximum, they differ in terms of the shape. The
plot of the Relative likelihood function resembles a normal probability density function in shape while
that of the log relative likelihood resembles a quadratic function of :
Likelihood intervals have desirable properties. One is that they become narrower as the sample size
increases, thus indicating that larger samples contain more information about . They are also easy to
obtain, since all we really have to do is plot the relative likelihood function R( ) or r( ) = logR( ).
This approach can also be extended to deal with vector parameters, in which case R( ) P gives
likelihood “regions" for .
The one apparent shortcoming of likelihood intervals so far is that we do not know how probable it
is that a given interval will contain the true parameter value. As a result we also do not have a basis for
the choice of p. Sometimes it is argued that values like p = 0:10 or p = :05 make sense because they
rule out parameter values for which the probability of the observed data is less than 1=10 or 1=20 of
the probability when = ^. However, a more satisfying approach is to apply the sampling distribution
ideas in Section 4.1 to the interval estimates, as discussed at the start of this section. This leads to the
concept of confidence intervals, which we describe next.
is called the coverage probability for the interval estimate [L(Y); U (Y)]. A few words are in order
about the meaning of the probability in (4.1.6). The parameter 0 is an unknown constant associated
with the population, but it is a fixed constant, NOT a random variable and does not therefore have a
distribution. You should read (4.1.6) as follows. If we were to draw a random sample of the same
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 83
size from the same population and the true value of the parameter was 0 , the probability that 0 is
contained in the interval constructed from this new data is C( 0 ):15 How does C( 0 ) assist in the
evaluation of interval estimates? In practice, we try to find intervals for which C( 0 ) is fairly close to 1
(values :90, :95 and :99 are often used) while keeping the interval fairly short. Such interval estimates
are called confidence intervals and the value of C( 0 ) is also called the confidence coefficient. If
C( 0 )is 0.95 for example, then this indicates that 95% of the samples that we would draw from this
model result in an interval [L(Y); U (Y)] which includes the parameter 0 (and of course 5% do not).
This gives us some confidence that for a particular sample, such as the one at hand, the true value of
the parameter is contained by the interval.
To show that confidence intervals exist, and at least in some cases the coverage probability C( 0 )
does not depend on the unknown parameter 0 , we consider the following simple example.
Example 4.4.1 Suppose that a random variable Y has a G( 0 ; 1) distribution. That is, 0 = E(Y )
is unknown but the standard deviation of Y is known to equal 1. Let the random variables Y1 ; :::; Yn
represent a random sample of y-values, and consider the interval (Y 1:96n 1=2 ; Y + 1:96n 1=2 ),
Xn
where Y = Yi =n is the sample mean. Since Y G( 0 ; n 1=2 ), we find that
i=1
1=2 1=2 Y 0
P (Y 1:96n 0 Y + 1:96n ) = P ( 1:96 1=2
1:96)
n
= P ( 1:96 Z 1:96) = :95;
where Z G(0; 1). Thus the interval (Y 1:96n 1=2 ; Y + 1:96n 1=2 ) is a confidence interval for
with confidence coefficient (coverage probability) C( 0 ) = :95. This is an example in which C( 0 )
does not depend on 0 ;and extremely desirable feature of an interval estimator since 0 is unknown.
We repeat the very important interpretation of a confidence interval: if the procedure is used re-
peatedly then in a fraction C( 0 ) of cases the interval will contain the true value 0 . If a particular
sample of size 16 in Example 4.4.1 had observed mean y = 10:4, then the observed .95 confidence
interval would be (y 1:96=4; y + 1:96=4), or (9.91, 10.89). We do not say that the probability of the
observed interval 9:91 0 10:89 is :95, but we have a high degree of confidence (95%) that this
interval contains 0 .
Confidence intervals tend to become narrower as the size of the sample on which they are based
increases. For example, note the effect of n in Example 4.4.1. The width of the confidence interval is
2 1:96
p which gets smaller as n increases. We noted this characteristic earlier for likelihood intervals, and
n
we show a bit later that likelihood intervals are a type of confidence interval. Recall that the coverage
probability for the interval in the example did not depend on 0 , a highly desirable property because
15
When we use the observed data y; L(y) and U (y) are numerical values not random variables. We do not know that it is
either true or not true that L(y) 0 U (y) or and P fL(y) 0 U (y)g makes no more sense than P f1 0 3g
since there is no random variable to which the probability statement can refer.
84 STATISTICAL INFERENCE: ESTIMATION
we’d like to know the coverage probability while not knowing the value of the parameter ( 0 ): We next
consider a general way to find confidence intervals which have this property.
Definition: A pivotal quantity Q = g(Y1 ; :::; Yn ; ) is a function of a random sample Y1 ; :::; Yn and
such that the distribution of the r.v. Q is fully known. That is, the distribution does not depend on
or other unknown information.
The motivation for this definition is the following. If we can begin with a statement such as
P fa g(Y1 ; :::; Yn ; ) bg = 0:95 where g(Y1 ; :::; Yn ; ) is a pivotal with knmown distribution we
can then re-express the inequality a g(Y1 ; :::; Yn ; ) b can be rewritten in th form L(Y1 ; :::; Yn )
U (Y1 ; :::; Yn ). Then [L(Y1 ; :::; Yn ); U (Y1 ; :::; Yn )] is a confidence interval for with coverage
probability 0:95 that does not depend on the parameter : In general, for coverage probability (also
referred to as the confdence coefficient) ;we choose L; U so that
P fa g(Y1 ; :::; Yn ; ) bg
= P fL(Y1 ; :::; Yn ) U (Y1 ; :::; Yn )g = :
The coverage probability (confidence coefficient) does not depend on the value of : It does depend
on a and b, but these are determined by the known distribution of g(Y1 ; : : : ; Yn ; ).
Y
:95 = P (a p b)
0= n
b 0 a 0
= P (Y p Y p );
n n
so that
b 0 a 0
Y p Y p (4.4.1)
n n
is a .95 confidence interval for . Note that there are infinitely many pairs (a; b) giving P (a Q
p
b) = :95. A common choice is a = 1:96, b = 1:96; this gives the interval (Y 1:96 0 = n; Y +
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 85
p
1:96 0 = n) which turns out to be the shortest possible .95 confidence interval. Another choice would
p
be a = 1, b = 1:645, which gives the interval (Y 1:645 0 = n; 1). This is useful when we are
interested in getting a lower bound on the value of .
It turns out that for most distributions it is not possible to find “exact” pivotal quantities or con-
fidence intervals for whose coverage probabilities do not depend somewhat on the true value of .
However, in general we can find quantities Qn = g(Y1 ; :::; Yn ; ) such that as n ! 1, the distribution
of Qn ceases to depend on or other unknown information. We then say that Qn is asymptotically piv-
otal, and in practice we treat Qn as a pivotal quantity for sufficiently large values of n; more accurately,
we term it an approximate pivotal quantity.
Example 4.4.3. Polls Consider Example 4.3.1 discussed earlier, where Y Bin(n; ). For large n
we know that Q1 = (Y n )=[n (1 1=2
)] is approximately G(0; 1). It can also be proved that the
distribution of
Y n
Q=
[n (1 ^)]1=2
^
where ^ = Y =n, is also close to G(0; 1) for large n. Thus Q can be used as an (approximate)
pivotal quantity to get confidence intervals for . For example,
:
:95 = P ( 1:96 Q 1:96)
0 " #1=2 " #1=2 1
^(1 ^) ^(1 ^)
= P @^ 1:96 ^ + 1:96 A:
n n
Remark: It is important to understand that confidence intervals may vary quite a lot when we take
repeated samples. For example, 10 samples of size n = 100 which were simulated for a population
where p = 0:25 gave the following .95 confidence intervals for p: .20 - .38, .14 - .31, .23 - .42, .22 -
.41, .18 - .36, .14 - .31, .10 - .26, .21 - .40, .15 - .33, .19 - .37.
When we get a .95 confidence interval from a single sample, it will include the true value of p with
probability .95, but this does not necessarily mean another sample will give a confidence interval that
is similar to the first one. If we take larger samples, then the confidence intervals are narrower and will
agree better. For example, try generating a few samples of size n = 1000 and compare the confidence
intervals for p.
86 STATISTICAL INFERENCE: ESTIMATION
It turns out that likelihood intervals are approximate confidence intervals and sometimes they are exact
confidence intervals. Let R( ) = L( )=L( ^) and define the quantity
This is called the likelihood ratio statistic. The following result can be proved:
If L( ) is based on a random sample of size n and if is the true value of the scalar parameter, then
(under mild mathematical conditions) the distribution of converges to 2(1) as n ! 1.
This means that can be used as an approximate pivotal quantity in order to get confidence intervals
for . Because highly plausible values of are ones where R( ) is close to 1 (i.e. is close to 0), we
get confidence intervals by working from the probability
2 :
P( (1) c) = P ( c)
= P (L(Y1 ; :::; Yn ) U (Y1 ; :::; Yn )) (4.4.4)
Example 4.4.4 Consider the binomial model in Examples 4.3.1 and 4.4.3 once again. We have, after
rearrangement, !
1 ^
= 2n ^ log( ^= ) + 2n(1 ^) log :
1
To get a .95 confidence interval for we note that P ( 2(1) 3:841) = :95. To find the confidence
interval we have to find all values satisfying 3:841. This has to be done numerically, and
depends on the observed data. For example, suppose that we observe n = 100, y = 40. Thus ^ = :40
and the observed value of is a function of ,
:6
( ) = 80 log(:4= ) + 120 log :
1
Figure 4.17 shows a plot of ( ) and the line g( ) = 3:841 exactly, from which the confidence interval
can be extracted. Solving ( ) 3:841, we find that :307 :496 is the .95 confidence interval.
We could also use the approximate pivotal quantity in Example 4.4.3 for this situation. It gives the .95
confidence interval (4.4.2), which is :304 :496. The two confidence intervals differ slightly
(they are both based on approximations) but are extremely close.
We can now see the connection between likelihood intervals and confidence intervals. The likeli-
hood interval defined by R( ) p is the same as the confidence interval defined by ( ) 2 log p.
For a .95 confidence interval we use ( ) 2
3:841 (since P ( (1) 3:841) = :95), which corre-
sponds to R( ) 0:147. Conversely a .10 likelihood interval given by R( ) 0:1 corresponds to
2
( ) 4:605. Since P ( (1) 4:605) = :968, we see that a .10 likelihood interval is a confidence
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 87
40
35
30
25
λ(θ)
20
15
10
0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
θ
interval with approximate confidence coefficient .968. Normally in statistical work, however, we use
confidence intervals with (approximate) confidence coefficients .90,.95 or .99, and we usually employ
( )) rather then R( ) in discussions about likelihood-based interval estimates.
We have seen in examples that confidence intervals for a parameter tend to get narrower as the sample
size n increases. When designing a study we often decide how large a sample to collect on the basis of
(i) how narrow we would like confidence intervals to be, and (ii) how much we can afford to spend (it
costs time and money to collect data). The following example illustrates the procedure.
Example 4.4.5 Estimation of a Binomial Probability Suppose we want to estimate the probability
from a binomial experiment in which the response variable Y has a Bin(n; ) distribution. We will
assume that the approximate pivotal quantity
Y n
Q= ' G(0; 1) (4.4.5)
[n (1 ^]1=2
^
introduced in Example 4.4.3 will be used to get confidence intervals for (Using the likelihood ratio
statistic leads to a more difficult derivation and in any case, for large n the LR confidence intervals are
very close to those based on Q.) Here is a criterion that is widely used for choosing the size of n: pick
n large enough so a .95 CI for is no larger than ^ :03: Let’s see why this is used and where it leads.
88 STATISTICAL INFERENCE: ESTIMATION
is an approximate .95 CI for . To make this CI ^ :03 (or even shorter, say ^ :025), we need n large
enough that
" #1=2
^(1 ^)
1:96 :03
n
We of course don’t know what ^ will be once we take our sample, but we note that the worst case
scenario is when ^ = :5. So to be conservative, we find n such that
2
1:96
n= (:05)2 = 1067
:03
Thus, choosing n = 1067 (or larger) will result in a .95 CI of the form ^ c, where c :03. If you
look or listen carefully when polling results are announced, you’ll often hear words like “this poll is
accurate to within 3 percentage points 19 times out of 20." What this really means is that the estimator
(which is usually given in percentile form) satisfies P (j ~ j :03) = :95, or equivalently, that the
actual estimate ^ is the centre of a .95 confidence interval ^ c, for which c = :03. In practice, many
polls are based on around 1050-1100 people, giving “accuracy to within 3 percent" (with probability
.95). Of course, one needs to be able to afford to collect a sample of this size. If we were satisfied
with an accuracy of 5 percent, then we’d only need n = 480. In many situations this might not be
sufficiently accurate for the purpose of the study, however.
Exercise: Show that to make the .95 CI ^ :02 or smaller, you need n = 2401 What should n be to
make a .99 CI ^ :02 or less?
Remark: Very large binomial polls (n 2000) are not done very often. Although we can in theory
estimate ^ very precisely with an extremely large poll, there are two problems:
2. in many settings the value of fluctuates over time. A poll is at best a snapshot at one point in
time.
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 89
As a result, the “real" accuracy of a poll cannot generally be made arbitrarily high.
Sample sizes can be similarly determined so as to give confidence intervals of some desired length in
other settings. We consider this topic again in Chapter 6. Many of the tools of this section can also be
extended to the muli-parameter16 setting but we will not discuss this further here.
Components of electronic products often must be very reliable, that is, they must perform over long
periods of time without failing. Consequently, manufacturers who supply components to a company
that produces, e.g. personal computers, ,must satisfy the company that their components are reliable.
Demonstrating that a component is highly reliable is difficult because if the component is used
under "normal" conditions it will usually take a very long time to fail. It is generally not feasible for
a manufacturer to carry out tests on the component that last for a period of years (or even months, in
most cases) and there fore they use what are called accelerated life tests. These involve placing such
high levels of stress on the components that they fail in much less than the normal time. If a model
relating the level of stress to the lifetime of the component is known then such experiments can be used
to estimate lifetime at normal stress levels for the population from which the experimental units are
taken.
We consider below some life test experiments on power supplies for PC’s, with ambient temperature
being the stress factor. As the temperature increases, the lifetimes of components tend to decrease and
a temperature around 70 Celsius average lifetime tend to be of the order of 100 hours. The normal
usage temperature is around 20 C. The data in Table 4.5.1 show the lifetimes (i.e. times to failure) of
components tests at each of 40 , 50 , 60 and 70 C. The experiment was terminated after 600 hours
and for temperatures 40 , 50 and 60 some of the 25 components being tested had still not failed.
Such observations are called censored: we know in each case only that the lifetime in question was
16
Models With Two or More Parameters. When there is a vector = ( 1 ; : : : ; k ) of unknown parameters, we may
want to get interval estimates for individual parameters j (j = 1; : : : ; k) or for functions = h( 1 ; : : : ; k ). For example,
with a Gaussian model G( ; ) we might want to estimate and . In some problems there are pivotal quantities which
are functions of the data and (only) the parameter of interest. We will use such quantities in Chapter 6, where we consider
estimation and testing for Gaussian models. There also exist approximate pivotal quantities based on the likelihood function
and m.l.e.’s. These are mainly developed in more advanced followup courses to this one, but we will briefly consider this
approach later in the notes.
It is also possible to construct confidence regions for two or more parameters. For example, suppose a model has two
parameters 1 ; 2 and a likelihood function L( 1 ; 2 ) based on observed data. Then we can define the relative likelihood
L( 1 ; 2 )
function R( 1 ; 2 ) = L( ^1 ; ^2 ) as in the scalar case. The set of pairs ( 1 ; 2 ) which satisfy R( 1 ; 2 ) p is then called a p
likelihood region for ( 1 ; 2 ). The concept of confidence intervals can similarly be extended to confidence regions.
90 STATISTICAL INFERENCE: ESTIMATION
over 600 hours. In Table 4.5.1 the asterisks denote the censored observations.
It is known from past experience that at each temperature level lifetimes follow close to an ex-
ponential distribution; let us therefore suppose that at temperature t (t = 40; 50; 60; 70), component
lifetimes Y have probability density function
1 y= t
f (y; t) = e y 0 (4.5.1)
t
where t = E(Y ) is the mean lifetime. The likelihood function based on a sample consisting of both
censoring times and lifetimes is a little different from one based on a random sample of lifetimes. It is,
for the tests at temperature t,
Y 1 Y
yi = t yi = t
L( t ) = f e gf e g (4.5.2)
t
LT CT
where LT stands for the set of lifetimes yi and CT the set of censoring times yi.
Question 1 Show that for the distribution 4.5.1, P (Y > y) = e y= t . Then describe how 4.5.2 is
obtained.
Note that 4.5.2 can be rewritten as
1
L( t ) = r es= t (4.5.3)
t
P
n
where r = number of lifetimes observed S = yi = sum of all lifetimes and censoring times.
i=1
Question 2 Assuming that the exponential model is correct, obtain m.l.e.’s ^1 for the mean lifetime at
each of the former termperature levels t40; 50; 60; 70. Graph the likelihood functions for 40 and 70
and comment on any qualitative differences.
Question 3 Check, perhaps using some kind of graph, whether the exponential model seems appropri-
ate. Engineers used a model (called the Arrhenius model) that relates the mean lifetime of a component
to the ambient temperature. This states that
t = expf + g (4.5.4)
t + 273:2
where t is the temperature in degrees Celsius and and are parameters.
Questions 4 Make a plot of log ^t vs (t + 273:2) 1 for the four temperatures involved in the life
test experiment. Do the points lie roughly along a straight line? Give rough point estimators of and
. Extrapolate your plot or use your estimates of and to estimate the mean lifetime at t = 20 C,
the normal temperature.
Question 5 A point estimate of at 20 C is not very satisfactory. Outline how you might attempt
to get an interval estimate based on the likelihood function. Once armed with an interval estimate,
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 91
would you have many remaining qualms about indicating to the engineers what mean lifetime could be
expected at 20 C? (Explain.)
Question 6 Engineers and statisticians have to design reliability tests like the one just discussed, and
considerations such as the following are often used.
Suppose that the mean lifetime at 20 C is supposed to be about 90,000 hours and that at 70 C you
know from past experience that its about 100 hours. If the model 4.5.4 applies, determine what and
must approximately equal and thus what is roughly equal to at 40 , 50 and 60 C. How might you
use this information in deciding how long a period of time to run the life test? In particular, give the
approximate expected number of uncensored lifetimes from an experiment that was terminated after
600 hours.
92 STATISTICAL INFERENCE: ESTIMATION
Table 1 Lifetimes (in hours) from an accelerated life test experiment in PC power supplies
Temperature
70 C 60 C 50 C 40 C
2 1 55 78
5 20 139 211
9 40 206 297
10 47 263 556
10 56 347 600
11 58 402 600
64 63 410 600
66 88 563 600
69 92 600 600
70 103 600 600
71 108 600 600
73 125 600 600
75 155 600 600
77 177 600 600
97 209 600 600
103 224 600 600
115 295 600 600
130 298 600 600
131 352 600 600
134 392 600 600
145 441 600 600
181 489 600 600
242 600 600 600
263 600 600 600
283 600 600 600
Notes: Lifetimes are given in ascending order; asterisks( ) denote censored observations.
4.4 Problems
1. Consider the data on heights of adult males and females from Chapter 1. (The data are on the
course web page.)
4.4. PROBLEMS 93
(a) Assuming that for each gender the heights Y in the population from which the samples
were drawn is adequately represented by Y G( ; ), obtain the m.l.e.’s ^ and ^ in each
case.
(b) Give the m.l.e.’s for the 10th and 90th percentiles of the height distribution for males and
for females.
(c) Give the m.l.e.’s for the probability P (Y > 1:83) for males and females (i.e. the fraction
of the population over 1.83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that doesn’t use the Gaussian model is
number of persons in sample with y > 1:83
P^ r(Y > 1:83) =
n
where here n = 150. Obtain these estimates for males and for females. Can you think of
any advantages for this estimate over the one in part (c)? Can you think of any disadvan-
tages?
(e) Suggest and try a method of estimating the 10th and 90th percentile of the height distribu-
tion that is similar to that in part (d).
2. When we measure something we are in effect estimating the true value of the quantity; measure-
ments of the same quantity on different occasions are usually not equal. A chemist has two ways
of measuring a particular quantity ; one has more random error than the other. For method I,
measurements X1 ; X2 ; : : : follow a normal distribution with mean and variance 12 , whereas
for method II, measurements Y1 ; Y2 ; : : :, have a normal distribution with mean and variance
2
2.
(a) Suppose that the chemist has n measurements X1 ; : : : ; Xn of a quantity by method I and m
measurements, Y1 ; : : : ; Ym by method II. Assuming that 12 and 22 are known, write down
the combined likelihood function for , and show that
w1 X + w2 Y
^=
w1 + w2
n m
where w1 = 2 and w2 = 2 .
1 2
(b) Suppose that 1 = 1, 2 = :5 and n = m = 10. How would you rationalize to a non-
statistician why you were using the estimate ^ = (x+4y)
5 instead of (x+y)
2 ?
(c) Determine the standard deviation of ^ and of (x + y)=2 under the conditions of part (b).
Why is ^ a better estimate?
94 STATISTICAL INFERENCE: ESTIMATION
3. Suppose that a fraction p of a large population of persons over 18 years of age never drink alcohol.
In order to estimate p, a random sample of n persons is to be selected and the number y who do
not drink determined; the maximum likelihood estimate of p is then p^ = y=n.
We want our estimate p^ to have a high probability of being close to p, and want to know how
large n should be to achieve this.
(a) Consider the random variable Y and estimator p = Y =n. Describe how you could work
out the probability that :03 p p :03, if you knew the values of n and p.
(b) Suppose that p is .40. Determine how large n should be in order to make P ( :03 p p
:03) = :95. Use an approximation if you wish.
4. Proof of Central Limit Theorem (Special Case) Suppose Y1 ; Y2 ; : : : are independent random
variables with E(Yi ) = ; V ar(Yi ) = 2 and that they have the same distribution, which has a
m.g.f.
2
(a) Show that (Yi )= has m.g.f. of the form (1 + t2 + terms in t3 ; t4 ; : : :) and thus that
p t2
(Yi )= n has m.g.f of the form (1 + 2n + 0(n)), where 0(n) signifies a remainder
term Rn with the property that Rn =n ! 0 as n ! 1.
(b) Let
n
X p
(Yi ) n(Y )
Zn = p =
n
i=1
t 2
and note that its m.g.f. is of the form (1 + 2n + 0(n))n . Show that as n ! 1 this ap-
2
proaches the limit et =2 , which is the m.g.f. for G(0; 1). (Hint: For any real number a,
(1 + a=n)n ! ea as n ! 1.)
5. A sequence of random variables fXn g is said to converge in probability to the constant c if for
all > 0,
p
We denote this by writing Xn ! c.
p p
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and Yn ! c2 ,
p p
show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
4.4. PROBLEMS 95
(b) Let X1 ; X2 ; be i.i.d. random variables with probability density function f (x; ). A
^
point estimator n based on a random sample X1 ; ; Xn is said to be consistent for if
^n !p
as n ! 1.
i. Let X1 ; ; Xn be i.i.d. U (0; ). Show that ^n is consistent for .
ii. Let X Bin(n; p). Show that p^n = X=n is consistent for p.
6. Refer to the definition of consistency in Problem 5(b). Difficulties can arise when the number of
parameters increases with the amount of data. Suppose that two independent measurements of
blood sugar are taken on each of n individuals and consider the model
2
Xi1 ; Xi2 N ( i; ) i = 1; ;n
where Xi1 and Xi2 are the independent measurements. The variance 2 is to be estimated, but
the i ’s are also unknown.
(a) Find the m.l.e. ^ 2 and show that it is not consistent. (To do this you have to find the m.l.e.’s
for 1 ; : : : ; n as well as for 2 .)
(b) Suggest an alternative way to estimate 2 by considering the differences
Wi = Xi1 Xi2 .
(c) What does represent physically if the measurements are taken very close
together in time?
7. Suppose that blood samples for nk people are to be tested to obtain information about , the
fraction of the population infected with a certain virus. In order to save time and money, pooled
testing is used: the nk samples are mixed together k at a time to give n pooled samples. A
pooled sample will test negative if all k individuals are not infected.
(a) Give an expression for the probability that x out of n samples will be negative, if the nk
people are a random sample from the population. State any assumptions you make.
(b) Obtain a general expression for the maximum likelihood estimate ^ in terms of n, k and x.
(c) Suppose n = 100, k = 10 and x = 89. Give the m.l.e. ^ and relative likelihood function,
and find a 10% likelihood interval for .
(d) Discuss (or do it) how you would select an “optimal” value of k to use for pooled testing,
if your objective was not to estimate but to identify persons who are infected, with the
96 STATISTICAL INFERENCE: ESTIMATION
smallest number of tests. Assume that you know the value of and the procedure would be
to test all k persons individually each time a pooled sample was positive. (Hint: Suppose a
large number n of persons must be tested, and find the expected number of tests needed.)
8. (a) For the data in Problem 4 of Chapter 2, plot the relative likelihood function R( ) and
determine a .10 likelihood interval. Is very accurately determined?
(b) Suppose that we can find out whether each pair of twins is identical or not, and that it is
determined that of 50 pairs, 17 were identical. Obtain the likelihood function and m.l.e.
of in this case. Plot the relative likelihood function with the one in (a), and compare the
accuracy of estimation in the two cases.
9. Company A leased photocopiers to the federal government, but at the end of their recent contract
the government declined to renew the arrangement and decided to lease from a new vendor,
Company B. One of the main reasons for this decision was a perception that the reliability of
Company A’s machines was poor.
(a) Over the preceding year the monthly numbers of failures requiring a service call from
Company A were
16 14 25 19 23 12
22 28 19 15 18 29.
Assuming that the number of service calls needed in a one month period has a Poisson
distribution with mean , obtain and graph the relative likelihood function R( ) based on
the data above.
(b) In the first year using Company B’s photocopiers, the monthly numbers of service calls
were
13 7 12 9 15 17
10 13 8 10 12 14.
Under the same assumption as in part (a), obtain R( ) for these data and graph it on the
same graph as used in (a). Do you think the government’s decision was a good one, as far
as the reliability of the machines is concerned?
(c) Use the likelihood ratio statistic ( ) as an approximate pivotal quantity to get .95 confi-
dence intervals for for each company.
(d) What conditions would need to be satisfied to make the assumptions and analysis in (a) to
(c) valid? What approximations are involved?
4.4. PROBLEMS 97
10. The lifetime T (in days) of a particular type of lightbulb is assumed to have a distribution with
probability density function
3 t2 e t
f (t; ) = ; t > 0; > 0:
2
(a) Suppose t1 ; t2 ; ; tn is a random sample from this distribution. Show that the likelihood
function for is proportional to
P
3n ti
L( ) = e :
11. The 2 (Chi-squared) distribution. Consider the 2 distribution whose probability density
function is given by (4:2:8) in Section 4.2.3. If Y 2 then
(r)
12. In an early study concerning survival time for patients diagnosed with Acquired Immune Defi-
ciency Syndrome (AIDS), the survival times (i.e. times between diagnosis of AIDS and death)
P
30
of 30 male patients were such that xi = 11; 400 days. It is known that survival times were
i=1
approximately exponentially distributed with mean days.
98 STATISTICAL INFERENCE: ESTIMATION
(a) Write down the likelihood function for and obtain the likelihood ratio statistic. Use this
to get an approximate .90 confidence interval for .
(b) Show that m = log 2 is the median survival time. Give a .90 confidence interval for m.
1 x=
f (x; ) = e x>0
where > 0.
(a) Show that Y = 2X= has a 2 distribution. (Hint: compare the probability density
(2)
function of Y with (4.2.8).)
(b) If X1 ; : : : ; Xn is a random sample from the exponential distribution above, prove that
n
X
2
U =2 Xi = (2n) :
i=1
(You may use results in Section 4.2.) U is therefore a pivotal quantity, and can be used to
get confidence intervals for .
(c) Refer to Problem 12. Using the fact that
2
P 43:19 (60) 79:08 = :90
obtain a .90 confidence interval for based on U . Compare this with the interval found in
12(a). Which interval is preferred here? (Why?)
14. Two hundred adults are chosen at random from a population and each is asked whether informa-
tion about abortions should be included in high school public health sessions. Suppose that 70%
say they should.
(a) Obtain a 95% confidence interval for the proportion p of the population who support abor-
tion information being included.
(b) Suppose you found out that the 200 persons interviewed consisted of 50 married couples
and 100 other persons. The 50 couples were randomly selected, as were the other 100 per-
sons. Discuss the validity (or non-validity) of the analysis in (a).
4.4. PROBLEMS 99
15. Consider the height data discussed in Problem 1 above. If heights Y are G( ; ) and ~ = Y and
P
n
~2 = (Yi ~)2 =n are the ML estimators based on a sample of size n then it can be shown
i=1
that when n is large, p
n(Y )
Z=
~
is very close to G(0; 1), and so it is approximately a pivotal quantity. Use Z to obtain a .99
confidence interval for for males and for females.
16. In the U.S.A. the prevalence of HIV (Human Immunodeficiency Virus) infections in the popula-
tion of child-bearing women has been estimated by doing blood tests (anonymized) on all women
giving birth in a hospital. One study tested 29,000 women and found that 64 were HIV positive
(had the virus). Give an approximate .99 confidence interval for , the fraction of the population
that is HIV positive.
State any concerns you feel about the accuracy of this estimate.
STATISTICAL INFERENCE: TESTING
HYPOTHESES
5.1 Introduction
There are often hypotheses that a statistician or scientist might want to “test” in the light of observed
data. Two important types of hypotheses are
(1) that a parameter vector has some specified value 0; we denote this as H0 : = 0.
(2) that a random variable Y has a specified probability distribution, say with probability density
function f0 (y); we denote this as H0 : Y f0 (y).
The statistical approach to hypothesis testing is as follows: First, assume that the hypothesis H0
will be tested using some random data “Data”. Next, define a test statistic (also called a discrepancy
measure) D = g(Data) that is constructed to measure the degree of “agreement” between Data and
the hypothesis H0 . It is conventional to define D so that D = 0 represents the best possible agreement
between the data and H0 , and so that the larger D is, the poorer the agreement. Methods of constructing
test statistics will be described later. Third, once specific observed “data” have been collected, let
dobs = g(data) be the corresponding observed value of D. To test H, we now calculate the observed
significance level (also called the p-value), defined as
SL = P (D dobs ; H0 ); (4.5.2)
where the notation “; H0 ” means “assuming H0 is true”. If SL is close to zero then we are inclined
to doubt that H0 is true, because if it is true the probability of getting agreement as poor or worse
than observed is small. This makes the alternative explanation that H0 is false more appealing. In
other words, we must accept that one of the following two statements is correct.:
(a) H0 is true but by chance we have observed data that indicate poor agreement with H0 , or
100
5.1. INTRODUCTION 101
(b) H0 is false.
SL = P (D 14; H0 )
= P (Y 44; = 1=6)
180
!
X 180 1 5
= ( )y ( )180 y
= :005:
y 6 6
y=44
This provides strong evidence against H0 , and suggests that is bigger than 1/6.
Example 5.1.2 Suppose that in the experiment in Example 5.1.1 we observed y = 35 ones in n = 180
tosses. Now the SL is
SL = P (Y 35; = 1=6)
180
!
X 180 1
= ( )y (5=6)180 y
y 6
y=35
= 0:183:
This probability is not especially small, so we conclude that there is no strong evidence against H0 .
Note that we do not claim that H0 is true, only that there is no evidence that it is not true.
A: 1.026, 0.998, 1.017, 1.045, 0.978, 1.004, 1.018, 0.965, 1.010, 1.000
B: 1.011, 0.966, 0.965, 0.999, 0.988, 0.987, 0.956, 0.969, 0.980, 0.988
102 STATISTICAL INFERENCE: TESTING HYPOTHESES
Let Y represent a single measurement on one of the scales, and let represent the average measure-
ment E(Y ) in repeated weighings of a single 1 kg weight. If an experiment involving n weighings is
conducted then a sensible test of H0 : = 1 could be based on the test statistic
D = jY 1:00j
P p
where Y = Yi =n. Since Y ' G( ; = n), where = E(Y ) and 2 = Var(Y ), we can compute
the significance level (at least approximately) using a Gaussian distribution. Since we don’t know 2
P
we will estimate it by the sample variance s2 = (yi y)2 =(n 1) in the calculations below.
The samples from scales A and B above give us
SL = P (D 0:0061; = 1:00)
= P (jY 1:00j 0:0061)
Y 1:00 0:0061
=P p p
0:0230= 10 0:0230= 10
= P (jZj 0:839) where Z G(0; 1)
= :401:
Thus there is no evidence of bias (that is, that H0 : = 1:00 is false) for scale A.
For scale B, however, we get
Y 1:00 :0190
SL = P p p
:0170= 10 :0170= 10
= P (jZj 3:534) = :0004:
Thus there is strong evidence against H0 : = 1:00, suggesting strongly that scale B is biased.
Finally, note that just because there is strong evidence against H0 for scale B, the degree of bias in
its measurements is not necessarily large. In fact, we can get an approximate .95 confidence interval
for = E(Y ) for scale B by using the approximate pivotal quantity
Y
Z= p ' G(0; 1):
s= 10
: p
Since P ( 1:96 Z 1:96) = :95, we get the approximate .95 confidence interval y 1:96s= 10,
or 0:981 :011, or 0:970 0:992. Thus the bias in measuring the 1 kg weight is likely fairly
5.1. INTRODUCTION 103
The approach to testing hypothesis described above is very general and straightforward, but a few
points should be stressed:
1. If the SL is small (close to 0) then the test indicates strong evidence against H0 ; this is often
termed “statistically significant" evidence against H0 . Rough rules of thumb are that SL < :05
provides moderately strong evidence against H0 and that SL < :01 provides strong evidence.
2. If the SL is not small we do not conclude by saying that H0 is true: we simply say there is
no evidence against H0 . The reason for this “hedging" is that in most settings a hypotheses
may never be strictly “true". (For example, one might argue when testing H0 : = 1=6 in
Example 5.1.1 that no real die ever has a probability of exactly 1=6 for side 1.) Hypotheses can
be “disproved" (with a small degree of possible error) but not proved.
3. Just because there is strong evidence (“highly statistically significant" evidence) against a hy-
potheses H0 , there is no implication about how “wrong" H0 is. For example in Example 5.3.1
there was strong evidence that scale B was biased (that is, strong evidence against H0 : bias =
0), but the relative magnitude (1-3%) of the bias is apparently small. In practice, we try to sup-
plement a significant test with an interval estimate that indicates the magnitude of the departure
from H0 . This is how we check whether a result is “ scientifically" significant as well as statis-
tically significant.
A drawback with the approach to testing described so far is that we are not told how to construct
test statistics D. Often there are “intuitively obvious" statistics that can be used; this is the case in the
examples of this section. However, in more complicated situations it is not always easy to come up
with a test statistic. In the next section we show how to use the likelihood function to construct test
statistics in general settings.
A final point is that once we have specified a test statistic D, we need to be able to compute the
significance level (5.1.1) for the observed data. This brings us back to distribution theory: in most cases
the exact probability (5.1.1) is hard to determine mathematically, and we must either use an approx-
imation or use computer simulation. Fortunately, for the tests in the next section we can usually use
approximations based on 2 distributions.
104 STATISTICAL INFERENCE: TESTING HYPOTHESES
General Theory
First we show how test statistics can be constructed from the likelihood function for any hypothesis
H0 that is specified in terms of one or more parameters. Let “Data” represent data generated from
a distribution with probability or probability density function f (Data; ) which depends on the k-
dimensional parameter . Let be the parameter space (set of possible values) for .
Consider a hypothesis of the form
H0 : 2 0
where 0 and 0 is of dimension p < k. The dimensions of and 0 refer to the minimum
number of parameters (or “coordinates") needed to specify points in them. We can test H0 using as our
test statistic the likelihood ratio test statistic , defined as follows:
Let L( ) = f (Data; ) be the likelihood function and let
^ denote the m.l.e. of over
^ 0 denote the m.l.e. of over 0 .
Now let ( )
L( ^ 0 )
= 2`( ^ ) 2`( ^ 0 ) = 2 log : (5.2.1)
L( ^ )
This seems like a sensible way to measure the degree of agreement between H0 and the observed data:
we look at the relative likelihood
R( ^ 0 ) = L( ^ 0 )=L( ^ )
of the “most likely" vector ^ 0 under H0 (i.e. when 0 is in 0 ). If this is very small then there is
evidence against H0 (think why). The reason we take D = as the test statistic instead of R( ^ 0 ) is
that = 2logR( ^ 0 ) takes values 0, with = 0 (R( ^ 0 ) = 1) representing the best agreement
between H0 and the data. Also, it can be shown that under H0 , the distribution of becomes 2(k p) as
the size of the data set becomes large. Large values of obs (small values of R( ^ 0 )) indicate evidence
against H0 so the p-value (significance level) is
: 2
SL = P ( obs ; H0 ) = P ( (k p) obs ): (5.2.2)
(Note: Here we are using obs to represent the value of obtained when we get the (numerical) data;
represents the r.v. when we think, as usual, of the data as random variables, before they are collected.)
5.2. TESTING PARAMETRIC HYPOTHESES WITH LIKELIHOOD RATIO STATISTICS 105
Some Examples
This approach is very general and can be used with many different types of problems. A few examples
follow.
(a) A single parameter model. Suppose Y has an exponential distribution with probability density
function f (y; ) = 1 e y= (y 0). Test H0 : = 0 (a given value) based on a random sample
y1 ; :::; yn . Thus = f : > 0g, 0 = f 0 g; k = 1, p = 0 and
n
Y
L( ) = f (yi ; ):
i=1
(b) A model with two parameters. Suppose Y G( ; ) with probability density function
1 y 2
f (y; ; ) = p2 e 2 (
1 ) ( 1 < y < 1) ; = ( ; )
(c) Comparison of two parameters. Suppose we have data from two Poisson distributions with
y1i y2i
p.f.’s f (y1i ; 1 ) = e 1 y11i ! ; f (y2i ; 2 ) = e 2 y22i ! ,
where y1i and y2i take values in f0; 1; 2; : : :g.
Test H0 : 1 = 2 based on two independent random samples y1i (i = 1; :::; n1 ) and y2i (i =
1; :::; n2 ). Thus = ( 1 ; 2 ) and = f( 1 ; 2 ) : 1 > 0; 2 > 0g, 0 = f( ; ) : > 0g,
k = 2, p = 1 and
n1
Y n2
Y
L( ) = L( 1 ; 2 ) = f (y1i ; 1 ) f (y2i ; 2 )
i=1 i=1
We now consider problems that involve actual data, and describe the steps in the tests in more de-
tail. In many testing problems it is necessary to use numerical methods to maximize likelihoods and
find and ^0 . In this course and the examples below we focus on problems in which these estimates
can be formed mathematically.
Example 5.2.1. Lifetimes of light bulbs. The variability in lifetimes of light bulbs (in hours, say,
of operation before failure) and other electrical components is often well described by an exponential
distribution with probability density function of the form
1 y=
f (y; ) = e y>0
where = E(Y ) is the average (mean) lifetime. A manufacturer claims that the mean life of a partic-
ular brand of bulbs is 2000 hours. We can examine that claim by testing the hypothesis
H0 : = 2000
Note that in terms of our general theory the parameter space of is = f : > 0g and the parameter
space under H0 is the single point = f2000g. The dimensions of and 0 are 1 and 0, respectively.
We use the likelihood ratio statistic of (5.2.1) as our test statistic D. To evaluate this we first write
P
down the log likelihood function (noting that n = 20 and yi = 38; 524 here)
38524
`( ) = 20 log :
The final computational step is to compute the significance level, which we do using the 2 approxi-
mation (5.2.2). This gives
SL = P ( 0:028; H0 true)
2
= P( (1) 0:028)
= 0:87
The SL is not close to zero so we conclude that there is no evidence against H0 and against the
manufacturer’s claim that is 2000 hours. Although the m.l.e. ^ was under 2000 hours (1926.2) it was
not sufficiently under to give conclusive evidence against H0 : = 2000.
Company A: 16 14 25 19 23 12 22 28 19 15 18 29
Company B: 13 7 12 9 15 17 10 13 8 10 12 14
Denote the value of for Company A’s copiers as A and the value for Company B’s as B . It
appears from the data that B’s copiers fail less often, but let us assess this formally by testing the
hypothesis
H0 : A = B
108 STATISTICAL INFERENCE: TESTING HYPOTHESES
This problem was sketched in Example (c) above. To handle it we consider the likelihood functions
for A and B based on the observed data and then the likelihood formed by combining them. The
likelihoods for A and B are
12
Y yAi
A 12 240
L1 ( A) = e A
=e A
A constant
yAi !
i=1
Y12 yBi
B 12 140
L1 ( B) = e B
=e B
B constant
yBi !
i=1
To test H0 we view the two Poisson distributions together as one big model, with the parameter
vector = ( A ; B ): In terms of the general framework for likelihood ratio tests the parameter space
for is = f( A ; B ) : A > 0; B > 0g, and under H0 the parameter space becomes 0 =
f( A ; B ) : A = B > 0g. The likelihood function for the “combined" model is the product of
L1 ( A ) and L2 ( B ) since the two samples are independent:
L( ) = L( A; B) = L1 ( A )L2 ( B )
We now carry out the steps for the test of H0 exactly as in the previous example, except that now
has dimension k = 2 and 0 has dimension p = 1. First, we write down the log likelihood function,
Next we find ^ and ^0 . The m.l.e. ^ maximizes `( A; B) in the unconstrained case. This can be done
by solving the maximum likelihood equations
@` @`
= 0; = 0;
@ A @ B
which gives ^A = 240=12 = 20:0 and ^B = 140=12 = 11:667. That is, ^ = (20:0; 11:667): The
constrained m.l.e. ^0 maximizes `( A ; B ) under the constraint A = B . To do this we merely have
to maximize
`( A; A) = 24 A + 380 log A
with respect to A . Solving @`( A; A )=@ A = 0, we find ^A = 380=24 = 15:833(= ^B ); that is,
^ 0 = (15:833; 15:833):
5.3. HYPOTHESIS TESTING AND INTERVAL ESTIMATION 109
The next step is to compute to observed value of the likelihood ratio statistic, which from (5.2.1)
and (5.2.3) is
Finally, we compute the significance level for the test, which by (5.2.2) is
2
P( 26:64; H0 true) = P ( (1) 26:64)
7
= 0:25 10
Our conclusion is that there is very strong evidence against the hypothesis; the test indicates that
Company B’s copiers have a lower rate of failure than Company A’s copiers.
We can follow this conclusion up by giving confidence intervals for A and B ; this indicates the
magnitude of the difference in the two failure rates. (The m.l.e.’s ^A = 20:0 average failures per month
and ^B = 11:67 failures per month differ a lot, but we also give confidence intervals in order to express
the uncertainty in such estimates.)
Other hypothesis tests are considered in the remaining chapters of these notes. We conclude with
some short remarks about hypothesis testing and estimation.
is used for both tests and confidence intervals. For a test, the significance level using (5.2.2) is
2
SL = P ( (1) obs ( 0 )); (5.3.1)
where we write obs ( 0 ) to remind us that we are testing H0 : = 0 : On the other hand, to get a
confidence interval for with confidence coefficient we find by (4.4.4) all values 0 such that
2
obs ( 0 ) (1); (5.3.2)
110 STATISTICAL INFERENCE: TESTING HYPOTHESES
where 2 is the quantile of a 2 distribution; for example for a .95 confidence interval we use
(1); (1)
2 = 3:84.
(1);:95
We now see the following by comparing (5.3.1) and (5.3.2):
The parameter value 0 is inside an confidence interval given by (5.3.2) if and only if (iff) for
the test of H0 : = 0 we have SL 1 .
For example, 0 is inside the .95 confidence interval (CI) iff the significance level for H0 : = 0
satisfies SL :05, To see this note that
SL :05
2
, P( (1) obs ( 0 )) :05
, obs ( 0 ) 3:84
, 0 is inside :95 CI
The connection between tests and confidence intervals can also be made when other test statistics
beside the LR statistic are used. If D is a test statistic for testing H0 : = 0 then we can obtain a .95
confidence interval for by finding all values 0 such that SL :05, or an CI by finding values 0
such that SL 1 .
5.4 Problems
1. The accident rate over a certain stretch of highway was about = 10 per year for a period of
several years. In the most recent year, however, the number of accidents was 25. We want to
know whether this many accidents is very probable if = 10; if not, we might conclude that the
accident rate has increased for some reason. Investigate this question by assuming that the num-
ber of accidents in the current year follows a Poisson distribution with mean and then testing
H0 : = 10. Use the test statistic D = max(0; Y 10) where Y represents the number of
accidents in the most recent year.
2. Refer back to Problem 1 in Chapter 1. Frame this problem as a hypothesis test. What test statistic
is being used? What are the significance levels from the data in parts (b) and (c)?
3. The R function runif () generates pseudo random U (0; 1) (uniform distribution on (0; 1)) ran-
dom variables. The command y runif (n) will produce a vector of n values y1 ; ; yn .
5.4. PROBLEMS 111
(a) Give a test statistic which could be used to test that the yi ’s (i = 1; ; n) are consistent
with a random sample from U (0; 1).
(b) Generate 1000 yi ’s and carry out the test in (a).
4. A company that produces power systems for personal computers has to demonstrate a high degree
of reliability for its systems. Because the systems are very reliable under normal use conditions,
it is customary to ‘stress’ the systems by running them at a considerably higher temperature than
they would normally encounter, and to measure the time until the system fails. According to a
contract with one PC manufacturer, the average time to failure for systems run at 70 C should be
no less than 1,000 hours.
From one production lot, 20 power systems were put on test and observed until failure at 70 .
The 20 failure times x1 ; : : : ; x20 were (in hours)
374.2 544.0 1113.9 509.4 1244.3
551.9 853.2 3391.2 297.0 63.1
250.2 678.1 379.6 1818.9 1191.1
162.8 1060.1 1501.4 332.2 2382.0
P
20
(Note: xi = 18; 698:6)
1
Failure times Xi are known to be approximately exponential with mean .
(a) Use a likelihood ratio test to test the hypothesis that = 1000 hours. Is there any evidence
that the company’s power systems do not meet the contracted standard?
(b) If you were a PC manufacturer using these power systems, would you like the company to
perform any other statistical analyses besides testing H0 : = 1000? Why?
5. In the Wintario lottery draw, six digit numbers were produced by six machines that operate
independently and which each simulate a random selection from the digits 0; 1; : : : ; 9. Of 736
numbers drawn over a period from 1980-82, the following frequencies were observed for position
1 in the six digit numbers:
Digit (i): 0 1 2 3 4 5 6 7 8 9 Total
Frequency (fi ): 70 75 63 59 81 92 75 100 63 58 736
Consider the 736 draws as trials in a multinomial experiment and let pi = P (digit i is drawn on
any trial), i = 0; 1; : : : 9. If the machines operate in a truly ‘random’ fashion, then we should
have pi = 0:1 (i = 0; 1; : : : ; 9).
112 STATISTICAL INFERENCE: TESTING HYPOTHESES
(a) Test this hypothesis using a likelihood ratio test. What do you conclude?
(b) The data above were for digits in the first position of the six digit Wintario numbers. Sup-
pose you were told that similar likelihood ratio tests had in fact been carried out for each of
the six positions, and that position 1 had been singled out for presentation above because it
gave the largest value of the test statistic, D. What would you now do to test the hypothesis
pi = 0:1 (i = 0; 1; 2; : : : ; 9)? (Hint: You need to consider P (largest of 6 independent D’s
is Dobs ).)
6. Testing a genetic model. Recall the model for the M-N blood types of people, discussed
in Examples 2.3.2 and 2.5.2. In a study involving a random sample of n persons the numbers
Y1 ; Y2 ; Y3 (Y1 + Y2 + Y3 = n) who have blood types MM, MN and NN respectively has a
multinomial distribution with p.f.
n! X
f (y1 ; y2 ; y3 ) = py11 py22 py33 ; yi 0; yi = n
y1 !; y2 !; y3 !
P
and since p1 + p2 + p3 = 1 the parameter space = f(p1 ; p2 ; p3 ) : pi 0; pi = 1g has
dimension 2. The genetic model discussed earlier specified that p1 ; p2 ; p3 can be expressed in
terms of only a single parameter (0 < < 1), as follows:
2
p1 = ; p2 = 2 (1 ); p3 = (1 )2 (5.4.1)
Consider (5.4.1) as a hypothesis H0 to be tested. In that case, the dimension of the parameter
space for (p1 ; p2 ; p3 ) under H0 is 1, and the general methodology of likelihood ratio tests can be
applied. This gives a test of the adequacy of the genetic model.
Suppose that a sample with n = 100 persons gave observed values y1 = 18; y2 = 50; y3 =
32: Test the model (5.4.1) and state your conclusion.
7. Likelihood ratio test for a Gaussian mean. Suppose that a r.v Y has a G( ; ) distribution
and that we want to test the hypothesis H0 : = 0 , where 0 is some specified number. The
value of is unknown.
(a) Set this up as a likelihood ratio test. (Note that the parameter space is = f = ( ; ) :
1 < < 1; > 0g:) Assume that a random sample y1 ; : : : ; yn is available.
p
(b) Derive the LR statistic and show that it can be expressed as a function of t = n(y
0 )=s, where s is the sample standard deviation and y is the sample mean. (Note: the easily
proved identity
Xn Xn
2
(yi 0 ) = (yi y)2 + n(y 0)
2
(5.4.2)
i=1 i=1
5.4. PROBLEMS 113
8. Use of simulation to obtain significance levels. In some testing problems the distribution of the
test statistic D is so complicated it is not possible (or very difficult) to find the significance level,
SL = P (D Dobs ; H0 true)
D = jR 25:5j
(a) Suppose you observe R = 14 and want to find the SL. Since Dobs = 11:5,
9. The Poisson model is often used to compare rates of occurrence for certain types of events in
different geographic regions. For example, consider K regions with populations P1 ; : : : ; PK
and let j (j = 1; : : : ; K) be the annual expected number of events per person for region j.
By assuming that the number of events Yj for region j in a given t-year period has a Poisson
distribution with mean Pj j t, we can estimate and compare the j ’s or test that they are equal.
114 STATISTICAL INFERENCE: TESTING HYPOTHESES
(a) Under what conditions might the stated Poisson model be reasonable?
(b) Suppose you observe values y1 ; : : : ; yK for a given t-year period. Describe how to test the
hypothesis that 1 = 2 = : : : = K .
(c) The data below show the numbers of children yj born with “birth defects" for 5 regions
over a given five year period, along with the total numbers of births Pj for each region.
Test the hypothesis that the five rates of birth defects are equal.
6.1 Introduction
Gaussian models for response variables Y are very widely used in applications of statistics. We have
seen examples involving variables such as heights and body-mass index measurements of people pre-
viously in these notes. Many problems involve explanatory variables x (which may be a vector) that
are related to a response Y ; in this case we can generalize the simple Gaussian model Y G( ; ) to
Y G( (x); (x)), where x is a vector of covariates (explanatory variables). The trick in creating
models in such settings is to decide on the forms for (x) and (x),
To do this we rely on past information and on current data from the population or process in question.
Here are some examples of settings where Gaussian models could be used.
Example 6.1.1
The soft drink bottle filling process of Example 1.4.2 involved two machines (Old and New). For a
given machine it is reasonable to represent the distribution for the amount of liquid Y deposited in a
single bottle by a Gaussian distribution: Y G( ; ).
In this case we can think of the machines as being like a covariate, with and differing for the
two machines. We could write
Y G( 0; 0) Y G( N; N ):
for the old and new machines. In this case there is no formula relating and to the machines; they
are simply different.
Example 6.1.2 Price vs. Size of Commercial Buildings (Reference: Oldford and MacKay STAT
231 Course Notes, Ch. 16)
Ontario property taxes are based on "market value", which is determined by comparing a property to
115
116 GAUSSIAN RESPONSE MODELS
the price of those which have recently been sold. The value of a property is separated into components
for land and for buildings. Here we deal with the value of the buildings only.
A large manufacturing company was appealing the assessed market value of its property, which
included a large building. Sales records were collected on the 30 largest buildings sold in the previous
three years in the area. The data are given in Table 6.1.1 and plotted in Figure 6.18 in a scatter plot,
which is a plot of the points (xi ; yi ). They include the size of the building x (in m2 =105 ) and the selling
price y (in $ per m2 ).
The building in question was 4.47 x 105 m2 , with an assessed market value of $ 75 per m2 .
The scatter plot shows that price (y) is roughly inversely proportional to size (x) but there is obvi-
ously variability in the price of buildings having the same area (size). In this case we might consider a
model where the price of a building of size xi is represented by a random variable Yi , with
Yi G( i ; ) i = 0 + 1 xi
where 0 and 1 are parameters. In this model we’ve assumed that i = (xi ) = , a constant. Al-
ternatively, we could let it depend on xi somehow. (Note that the scatter plot does not provide much
information on how to do this, however.)
600
500
y (price)
400
300
x (size)
forces. Understanding the distribution of breaking strengths is very important in construction and other
areas.
The data below show the breaking strengths (y) of six steel bolts at each of five different bolt
diameters (x). The data are plotted in Figure 6.19
The scatter plot gives a clear picture of the relationship between y and x. A reasonable model for
the breaking strength Y of a randomly selected bolt of diameter x would appear to be Y G( (x); ),
because the variability in y-values appears to be about the same for bolts of different diameters. Its not
clear what the best choice for (x) would be; the relationship looks slightly nonlinear so presumably
we want
2
(x) = 0 + 1x + 2x
118 GAUSSIAN RESPONSE MODELS
2.4
2.2
y (strength)
2.0
1.8
1.6
x (diameter)
Figure 6.19: Scatter Plot of Diameter vs. Strength for Steel Bolts.
A Gaussian response model is one for which the distribution of the response variable Y , given the
associated covariates x for an individual unit, is of the form
Yi G( i ; i) i = 1; : : : ; n
where i = g1 (xi ) and i = g2 (xi ) for some specified functions g1 and g2 . In many problems it is
only i that depends much on xi and we then use models where i = is constant. Furthermore, in
a great many situations i can be written as a linear function of covariates. These models are called
Gaussian linear models and are of the following form:
Yi G( i ; ) i = 1; : : : ; n (6.1.2)
with
k
X
i = j xij ; i = 1; : : : ; n (6.1.3)
j=1
6.2. INFERENCE FOR A SINGLE SAMPLE FROM A GAUSSIAN DISTRIBUTION 119
where xi = (xi1 ; xi1 ; : : : ; xik ) is the vector of covariates associated with unit i and the j ’s are para-
meters.
Remark: Sometimes the model (6.1.2) is written a little differently as
Yi = i + Ri where Ri G(0; ):
2. The model in Example 6.1.2 had i = 0 + 1 xi where xi was the building’s size. This can be
re-expressed as i = 0 xi0 + 1 xi1 where xi0 = 1; xi1 = xi (here we’ve used xij with j = 0; 1
for simplicity.)
Now we’ll consider estimation and testing procedures for Gaussian models. We begin in the next
section with models that have no covariates.
Y
Z= p G(0; 1)
= n
would be a pivotal quantity and could be used to get confidence intervals (CI’s) for . However, is
generally unknown. Fortunately it turns out that if we simply replace with either ^ or s in Z, then
we still have a pivotal quantity which we denote as T . We will write T in terms of s since the formulas
below look a little simpler then, so T is defined as
Y
T = p (6.2.1)
s= n
Since s is treated as a r.v. in (6.2.1) (we’ll use s to represent both the r.v. and an observed value, for
convenience), T does not have a G(0; 1) distribution. It turns out that its distribution in this case is
what is known as a student-t (or just “t") distribution. We’ll digress briefly to present this distribution
and show how it arises.
Student -t Distributions
This distribution arises when we consider independent random variables Z G(0; 1) and U 2
( )
and then define the new r.v.
Z
X= (6.2.2)
(U= )1=2
Then X has a student -t distribution with degrees if freedom (d.f.), and we write X t( ) to
denote this. The probability density function of X can be shown by a bivariate change of variables
method (we won’t do this) to be
x2 ( +1)=2
f (x) = k (1 + ) 1<x<1 (6.2.3)
+1 p
k = = ( =2) (6.2.4)
2
6.2. INFERENCE FOR A SINGLE SAMPLE FROM A GAUSSIAN DISTRIBUTION 121
This distribution is symmetric about x = 0 and for large is closely approximated by the probability
density function of G(0; 1): Problem 1 at chapter’s end considers some properties of f (x).
Probabilities for the t distribution are available from tables or computer software. In R, the
c.d.f. value
F (x) = P (t( ) x) (6.2.5)
is calculated as pt(x; ). For example, pt(1:5; 10) gives P (t(10) 1:5) as .918.
Y
T = p t(n 1) (6.2.6)
s= n
Yp
(i) Z = = n
G(0; 1)
(n 1)s2 2
(ii) U = 2 (n 1)
(The results (ii) and (iii) are not too hard to show using multivariate calculus. Proofs are omitted
here but done in Stat 330.)
Then, we see that X defined in (6.2.2) is just the r.v. T here, so (6.2.6) follows and T is a pivotal
quantity.
Therefore, to get an CI for we find values a1 and a2 such that
Y
P (a1 t(n 1) a2 ) = =P a1 p a2 :
s= n
This converts to
s s
P Y a2 p Y a1 p = (6.2.7)
n n
p p
so (Y a2 s= n; Y a1 s= n) is an CI.
Example 6.2.1 Scores Y for an IQ test administered to ten year olds in a very large population have
close to a Gaussian distribution G( ; ). A random sample of 10 children got test scores as follows:
103; 115; 97; 101; 100; 108; 111; 91; 119; 101:
122 GAUSSIAN RESPONSE MODELS
We can obtain confidence intervals for the average IQ test score in the population by using the pivotal
quantity
Y
T = p t(9) :
s= 10
Since P ( 2:262 t(9) 2:262) = :95, for example, a :95 confidence interval for is y
p
2:262s= 10. For the data given above y = 104:6 and s = 8:57, so the observed confidence inter-
val is 104:6 6:13, or 98:47 110:73.
Remarks:
1. Confidence intervals for get narrower as n increases. They are also narrower if is known
(though this is unusual). In the limit as n ! 1, the CI based on (6.2.6) is equivalent to using Z
and knowing that = s. For example, if in Example 6.2.1 we know that = 8:57 then the .95
p p p
CI would use y 1:96 = n instead of y 2:262 s= n when n = 10. As n increases, s= n
becomes arbitrarily close to zero so the CI’s shrink to include only the point y.
2. If we have a rough idea what the value of is, we can determine the value of n needed to make
a (.95,say) CI a given length. This is used in deciding how large a sample to take in a study.
3. Sometimes we want CI’s of the form L(y1 ; : : : ; yn ) or U (y1 ; : : : ; yn ) . These are ob-
tained by taking a1 = 1 and a2 = 1, respectively, in (6.2.7). For “two-sided" intervals we
usually pick a1 = a2 so that the interval is symmetrical about y.
Exercise: Show that these provide the shortest CI with a given confidence coefficient .
SL = P (D Dobs ; H0 true)
= P (jt(n 1) j Dobs )
=1 P ( Dobs t(n 1) Dobs )
6.2. INFERENCE FOR A SINGLE SAMPLE FROM A GAUSSIAN DISTRIBUTION 123
Example 6.2.2 For the setting in Example 6.2.1, test H0 : = 110. With (6.2.8) the observed value
of D is then
j104:6 110j
Dobs = p = 1:99
8:57= 10
and the significance level is
SL = P (jt(9) j 1:99)
=1 P ( 1:99 t(9) 1:99)
= :078
This indicates there isn’t any strong evidence against H0 : (Such tests are sometimes used to com-
pare IQ test scores for a sub-population (e.g. students in one school district) with a known mean for
a “reference" population.)
Remark: The likelihood ratio (LR) statistic could also be used for testing H0 : = 0 or for CI’s
for , but the methods above are a little simpler. In fact, it can be shown that the LR statistic for H0 is
a one-to-one function of jT j; see Problem 2 at the end of the chapter.
Remark: The function t.test in R will obtain confidence intervals and test hypotheses about ; for a
data set y use t.test(y).
so U is pivotal quantity and can be used to find CI’s for 2 or . To get an CI we find a1 ; a2 such that
2 (n 1)s2
P (a1 (n 1) a2 ) = =P a1 2
a2 :
This converts to
(n 1)s2 2 (n 1)s2
P = (6.2.10)
a2 a1
q q
(n 1)s2 (n 1)s2
so an CI for is a2 ; a1 : For “two-sided" CI’s we usually choose a1 and a2 such
that
2 2 1
P( (n 1) < a1 ) = P ( (n 1) > a2 ) =
2
In some applications we are interested in an upper bound on (because small is “good" in some
sense); then we take a2 = 1 so the lower confidence limit in (6.2.10) is 0.
124 GAUSSIAN RESPONSE MODELS
Example 6.2.3. A manufacturing process produces wafer-shaped pieces of optical glass for lenses.
Pieces must be very close to 25 mm thick, and only a small amount of variability around this can
be tolerated. If Y represents the thickness of a randomly selected piece of glass then, to a close ap-
proximation, Y G( ; ). Periodically, random samples of 15 pieces of glass are selected and the
values of and are estimated to see if they are consistent with = 25 and with being under .02
mm. On one such occasion the sample mean and sum of squares from the data were y = 25:009 and
P
(yi y)2 = 0:002347.
Consider getting a .95 confidence interval for , using the pivotal quantity
P
14s2 (Yi Y )2 2
U= 2 = 2 (14) :
Example 6.2.4 For the manufacturing process in Example 6.2.3, test the hypothesis H0 : = :008
(.008 is the desired or target value of the manufacturer would like to achieve).
6.3. GENERAL GAUSSIAN RESPONSE MODELS 125
Note that since the value = :008 is outside the two-sided .95 CI for in Example 6.2.3, the SL
for H0 based on the test statistic (or equivalently, U ) will be less than .05. To find the exact SL, we
follow the procedure above:
14s2obs
1. Uobs = :0082
= 36:67
2. SL = 2P ( 2 36:67) = :0017
(14)
This indicates very strong evidence against H0 and suggests that is bigger than .008.
We now summarize some results about the maximum likelihood estimators of the parameters =
( 1 ; : : : ; k )0 and .
= ( 1 ; :::; 0
Some results about M.L.E.’s of k) and of ^
i=1
2
or of `( ; ) = log L( ; ) gives
n
^ = (X 0 X) 1 0 1X 2
X y; ^ = (yi ^i )2
n
i=1
P
k
^j xij . We also define
where Xn k = (xij ), y n 1 = (y1 ; :::; yn )0 and ^i =
j=1
n
X
2 1 n
s = (yi ^i )2 = ^2
n k n k
i=1
126 GAUSSIAN RESPONSE MODELS
n~ 2 (n k)s2 2
W = 2
= 2 (n k) (6.3.3)
Remark: The m.l.e. ^ is also a least squares (LS) estimate of . Least squares is a method
of estimation in linear models that predates maximum likelihood. Problem 16 describes least
squares methods.
Recall the distribution theory for the (student) t distribution. If Z G(0; 1) and W 2 then
(r)
the r.v.
p
T = Z= W=r (6.3.5)
This provides a way to get confidence intervals (or tests) for any j. Because (6.3.2) implies
that
~j j
Z= p G(0; 1)
cj
and because of (6.3.3), then (6.3.5) implies that
~j j
T = p t(n k) (6.3.6)
s cj
Below we will consider some special types of Gaussian models which fall under the general theory.
However, we’ll alter the notation a bit for each model, for convenience.
(Note: The c1 in (6.3.2) is 1=n here; its easiest to get this by the fact that Var(Y ) = var(~) = 2 =n,
proved earlier.)
Once again, stick with 1 and 2 as names of parameters. The likelihood function for 1, 2, is
nj
2 Y
Y 1 1 yji j 2
( )
L( 1; 2; )= p e 2
j=1 i=1
2
2
X
Note that s2 = 1
n1 +n2 2 (nj 1)s2j = (n1 + n2 )^ 2 =(n1 + n2 2)
j=1
nj
X
where s2j = 1
nj 1 (yji yj ) 2 .
i=1
(Y 1 Y 2 ) (n1 + n2 2)s2 2
Z= q G(0; 1); W = 2 (n1 +n2 2)
1 1
n1 + n2
and so by (6.3.5)
(Y 1 Y 2 )
T = q t(n1 +n2 2) : (6.3.7)
s n11 + n12
Confidence intervals or tests about can be obtained by using the pivotal quantity W exactly as in
Section 6.2 for a single distribution.
Example 6.3.1. In an experiment to assess the durability of two types of white paint used on asphalt
highways, 12 lines (each 400 wide) of each paint were laid across a heavily traveled section of highway,
in random order. After a period of time, reflectometer readings were taken for each line of paint; the
128 GAUSSIAN RESPONSE MODELS
higher the readings the greater the reflectivity and the visibility of the paint. The measurements of
reflectivity were as follows:
Paint A: 12.5, 11.7, 9.9, 9.6, 10.3, 9.6, 9.4, 11.3, 8.7, 11.5, 10.6, 9.7
Paint B: 9.4, 11.6, 9.7, 10.4, 6.9, 7.3, 8.4, 7.2, 7.0, 8.2, 12.7, 9.2
Statistical objectives are to test that the average reflectivity for paints A and B is the same, and if
there is evidence of a difference, to obtain a confidence interval for their difference. (In many problems
where two attributes are to be compared we start by testing the hypothesis that they are equal, even if
we feel there may be a difference. If there is no statistical evidence of a difference then we stop there.)
To do this it is assumed that, to a close approximation, reflectivity measurements Y1 for paint A are
G( 1 ; 1 ), and measurements Y2 for paint B are G( 2 ; 2 ). We can test H : = 1 2 = 0 and
get confidence intervals for by using the pivotal quantity
(Y 1 Y 2 )
T = q t(22)
1 1
s 12 + 12
To test H : = 0 we use the test statistic D = jT j. From the data given above we find
P
n1 = 12 y1 = 10:4 (y1i y1 )2 = 14:08 s21 = 1:2800
P
n2 = 12 y2 = 9:0 (y2i y2 )2 = 38:64 s22 = 3:5127:
This gives ^ = 1:4 and s2 = 2:3964, and the observed test statistic dobs = 2:22. The significance level
is then
:
SL = P (jt(22) j 2:22) = :038:
This indicates there is fairly strong evidence against H : 1 = 2 . Since y1 > y2 , the indication
is that paint A keeps its visibility better. A .95 confidence interval based on T is obtained using the
fact that P ( 2:074 t(22) 2:074) = :95. This gives the confidence interval for = 1 2 of
p
^ 2:074s= 6, or 0:09 2:71. This suggests that although the difference in reflectivity (and
durability) of the paint is statistically significant, the size of the difference is not really large relative to
1 and 2 (look at y1 and y2 ).
The procedures above assume that the two Gaussian distributions have the same standard devia-
tions. Sometimes this isn’t a reasonable assumption (it can be tested using a LR test, but we won’t do
6.3. GENERAL GAUSSIAN RESPONSE MODELS 129
this here) and we must assume that Yi G( 1 ; 1 ) and Y2 G( 2; 2 ). In this case there is no exactly
pivotal quantity with which to get CI’s for = 1 2, but
(Y1 Y2 )
Z= q 2 ' G(0; 1) (6.3.8)
s1 s22
n1 + n2
is an approximate pivotal quantity that becomes exact as n1 and n2 become arbitrarily large.
To illustrate its use, consider Example 6.3.1, where we had s21 = 1:2800; s22 = 3:5127. These
appear quite different but they are in squared units and n1 ; n2 are small; the standard deviations s1 =
1:13 and s2 = 1:97 do not provide evidence against the hypothesis that 1 = 2 if a LR test is carried
out. Nevertheless, let us use (6.3.8) to get a .95 CI for . This gives the CI
s
s21 s2
(Y1 Y2 ) 1:96 + 2
n1 n2
which with the data observed equals 1.4 1.24, or .16 2:64: This is not much different than the
interval obtained in Example 6.3.1.
The average score is somewhat higher in district A, and we will give a confidence interval for the
difference in average scores A B in a model representing this setting. This is done by thinking
of the students in each district as a random sample from a conceptual population of “similar" students
writing “similar" tests. Assuming that in a given district the scores Y have a G( ; ) distribution, we
can test that the means A and B for districts A and B are the same, or give a CI for the difference.
(Achievement tests are usually designed so that the scores are approximately Gaussian, so this is a
sensible procedure.)
Let us get a .95 CI for = A B using the pivotal quantity (6.3.8). This gives the CI
s
s21 s2
y1 y2 1:96 + 2
n 1 n2
130 GAUSSIAN RESPONSE MODELS
which becomes 2:1 (1:96)(:779) or 0:57 1:63: Since = 0 is outside the .95 CI (and also the
.99 CI) we can conclude there is fairly strong evidence against the hypothesis that A = B , suggesting
that A > B .
It is always a good idea to look carefully at the data and the distributions suggested for the two
groups, however; we should not rely only on a comparison of their means. Figure 6.3.1 shows a box plot
of the two samples; this type of plot was mentioned in Section 1.3. It shows both the median value and
other summary statistics of each sample: the upper and lower quartiles (i.e. 25th and 75th percentiles)
and the smallest and largest values. Figure 6.20 was obtained using the R function boxplot().
Note that the distributions of marks for districts A and B are actually quite similar. The median
(and mean) is a little higher for district A and because the sample sizes are so large, this gives a “statis-
tically significant" difference in a test that A = B . However, it would be a mistake to conclude that
the actual difference in the two distributions is very large. Unfortunately, “significant" tests like this
are often used to make claims about one group being “superior" to another.
Remark: The R function t.test will carry out the test above and will give confidence intervals for
A B . This can be done with the command t:test(yA ; yB ; var:equal = T ), where yA and yB are
the data vectors from A and B.
90
80
70
60
50
40
District A District B
Figure 6.20: Box Plot of Math Test Scores for Two School Districts.
6.4. INFERENCE FOR PAIRED DATA 131
and the difference = 1 2 . However, the heights of related persons are not independent, so to
estimate the method in the preceding section would not be strictly usable; it requires that we have
independent random samples of males and females. In fact, the primary reason for collecting these
data was to consider the joint distribution of Y1i ; Y2i and to examine their relationship. This topic is not
considered in this course, but a clear picture is obtained by plotting the points (Y1i ; Y2i ) in a scatter plot.
= E(Y1i Y2i ):
The fuel consumptions Y1i ; Y2i for the i’th car are related, because factors such as size, weight and
engine size (and perhaps the driver) affect consumption. As in the preceding example it would likely
not be appropriate to treat the Y1i ’s (i = 1; : : : ; 50) and Y2i ’s (i = 1; : : : ; 50) as two independent
132 GAUSSIAN RESPONSE MODELS
samples. Note that in this example it may not be of much interest to consider E(Y1i ) and E(Y2i )
separately, since there is only a single observation on each car type for either fuel.
Two types of Gaussian models are used to represent settings involving paired data. The first involves
what is called a bivariate normal distribution for (Y1i ; Y2i ), and it could be used in Example 6.4.1. This
is a continuous bivariate model. Only discrete bivariate models were introduced in Stat 230 and we
will not consider this model here (it is studied in Stat 330), except to note an important property:
2
Yi = Y1i Y2i G( ; ) (6.4.1)
2 2
Y1i G( 1 + i; 1) Y2i G( 2 + i; 2)
where the i ’s are unknown constants. Here it is assumed that Y1i and Y2i are independent random vari-
ables, and the i ’s represent factors specific to the different pairs. This model also gives the distribution
(6.4.1), since
This model seems relevant for Example 6.4.2, where i refers to the i’th car type. Interestingly, the two
models for (Y1i ; Y2i ) can be connected; if the i ’s are considered as Gaussian random variables in the
population of pairs of units then the result is that (Y1i ; Y2i ) have a bivariate normal model.
Thus, whenever we encounter paired data in which the variation in variables Y1i and Y2i is ade-
quately modeled by Gaussian distributions, we will make inferences about = 1 2 by working
with the model (6.4.1).
Example 6.4.1 revisited. The data on 1401 (brother, sister) pairs gave differences Yi = Y1i Y2i (i =
1; : : : ; 1401) for which the sample mean and variance were
P
2 (yi y)2
y = 4:895 in s = = 6:5480 in2
1400
Using the student t pivotal quantity (6.2.6), a two-sided .95 confidence interval for = E(Yi ) is
p
y 1:96s= n where n = 1401: (Note that t(1400) is indistinguishable from G(0; 1).) This gives the
.95 CI 4:895 0:134 inches, or 4:76 5:03 in.
6.4. INFERENCE FOR PAIRED DATA 133
Remark: The method above assumes that the (brother, sister) pairs are a random sample from the
population of families with a living adult brother and sister. The question arises as to whether also
represents the difference in the average heights of all adult males and all adult females (call them 01
and 02 ) in the population. Presumably 01 = 1 (i.e. the average height of all adult males equals
the average height of all adult males who also have an adult sister) and similarly 02 = 2 , so does
represent this difference. However, it might be wise to check this assumption.
Recall our earlier Example 2.4.1 involving the difference in the average heights of males and fe-
males in New Zealand. This gave the estimate ^ = y1 y2 = 68:72 64:10 = 4:62 inches, which is a
little less than the difference in the example above. This is likely due to the fact that we are considering
two distinct populations, but it should be noted that the New Zealand data are not paired.
where 12 = Cov(Y1i ; Y2i ). If 12 > 0, then V ar( ~) is smaller than when 12 = 0 (i.e. when Y1i and
Y2i are independent). Therefore if we can collect a sample of pairs (Y1i ; Y2i ), this is better than two
independent random samples (one of Y1i ’s and one of Y2i ’s) for estimating . Note on the other hand
that if 12 < 0, then pairing is a bad idea since it increases the variance of ~.
134 GAUSSIAN RESPONSE MODELS
Table 6.4.1 shows the cholesterol levels y (in mmol per liter) for each subject, measured at the
end of each 6 week period. We’ll let the random variables Y1i ; Y2i represent the cholesterol levels for
subject i on the high fibre and low fibre diets, respectively. We’ll also assume that the differences are
represented by the model
Yi = Y1i Y2i G( ; ): i = 1; : : : ; 20
The differences yi are also shown in Table 6.4.1, and from them we calculate the sample mean and
standard deviation
y = :020 s = 0:411
A .95 CI for is found using the pivotal quantity (6.2.7) and the fact that P ( 2:093 t(19)
p
2:093) = :95. This gives the CI y 2:093 s= 20, or :020 0:192, or
0:212 0:172
This confidence interval includes = 0, and there is clearly no evidence that the high fibre diet gives
a lower cholesterol level.
Remark: The results here can be obtained using the R function t:test.
Exercise: Compute the significance level of the hypothesis H0 : = 0, using the test statistic (6.2.8).
Subject Y1i (High F) Y2i (Low F) Yi Subject Y1i (High F) Y2i (Low F) Yi
1 5.55 5.42 .13 11 4.44 4.43 .01
2 2.91 2.85 .06 12 5.22 5.27 -.05
3 4.77 4.25 .52 13 4.22 3.61 .61
4 5.63 5.43 .20 14 4.29 4.65 -.36
5 3.58 4.38 -.80 15 4.03 4.33 -.30
6 5.11 5.05 .06 16 4.55 4.61 -.06
7 4.29 4.44 -.15 17 4.56 4.45 .11
8 3.40 3.36 .04 18 4.67 4.95 -.28
9 4.18 4.38 -.20 19 3.55 4.41 -.86
10 5.41 4.55 .86 20 4.44 4.38 .06
Final Remarks: When you see data from a comparative study (i.e. one whose objective is to com-
pare two distributions, often through their means), you have to determine whether it involves paired
data or not. Of course, a sample of Y1i ’s and Y2i ’s cannot be from a paired study unless there are equal
numbers of each, but if there are equal numbers the study might be either “paired" or “unpaired". Note
also that there is a subtle difference in the study populations in paired and unpaired studies. In the for-
mer it is pairs of individual units that forms the population where as in the latter there are (conceptually
at least) separate individual units for Y1 and Y2 measurements.
Yi G( i ; ) with i = + xi (6.5.1)
(Note that this is of the form (6.1.2) and (6.1.3) with 1 = ; 2 = ; x1i = 1; x2i = xi ).
Once again, we can use the general results of Section 6.3 or just maximize the likelihood to get
the maximum likelihood estimators: maximize
n
Y 2
1 1 yi xi
L( ; ; ) = p e 2
i=1
2
136 GAUSSIAN RESPONSE MODELS
Sxy 1 Pn
to get ^ = Sxx , ^ =y ^x, ^ 2 =
n i=1 (yi ^ ^xi )2 = 1 (Syy ^Sxy )
n
X n Xn Xn
where Sxx = (xi x)2 , Sxy = (xi x)(yi y), Syy = (yi y)2 .
i=1 i=1 i=1
instead of ^ 2 .
resulting from an increase of 1 in x. As well, if = 0 then x has no effect on Y (within the constraints
of this model).
p
From Section 6.3 we know that ~ G( ; c ) for some constant c. It’s easy to show this directly
here, and to obtain c. Write ~ as
X (xi n
~ = Sxy = x)(Yi Y)
sxx sxx
i=1
n
X X
(xi x)
= Yi (since (xi x)Y = 0)
sxx
i=1
Xn
= ai Yi
i=1
6.5. LINEAR REGRESSION MODELS 137
where ai = (xi x)=sxx . This is a linear combination of independent Gaussian random variables and
so its distribution is also Gaussian, with
n
X
E( ~) = ai E(Yi )
i=1
Xn
(xi x)
= ( + xi )
sxx
i=1
Xn n
X
(xi x) (xi x)
= + xi
sxx sxx
i=1 i=1
n
X X X
(xi x)2
=0+ (since (xi x)2 = (xi x)xi )
sxx
i=1
and
n
X
V ar( ~) = a2i V ar(Yi )
i=1
n
X (xi x)2 2
2
= =
s2xx sxx
i=1
Thus
~ G( ; p ) (6.5.2)
sxx
and combining this with the fact that
(n 2)s2 2
W = 2 (n 2) (6.5.3)
~
T = p t(n 2) (6.5.4)
s= sxx
This can be used as a pivotal quantity to get CI’s for , or to test hypotheses about .
Note also that (6.5.3) can be used to get CI’s or tests for , but these are usually of less interest than
inference about or the other quantities below.
since ~ = Y ~x. Thus (x) is a linear function of Gaussian random variables (because Y and ~ are)
and so must have a Gaussian distribution. Its mean and variance are
= + x + (x x)
= + x = (x);
2 1 (x x)2
= +
n sxx
Thus 2 s 3
1 (x x)2
~(x) G 4 (x); + 5
n sxx
Remark: The parameter equals (0) so doesn’t require special treatment. Sometimes x = 0 is a
value of interest but often it is not. In the following example it refers to a building of area 0, which is
nonsensical!
Remark: The results of the analyses below can be obtained using the R function lm, with the com-
mand lm(y x). We give the detailed results below to illustrate how the calculations are made. In R,
6.5. LINEAR REGRESSION MODELS 139
Note that ^ is negative: the larger size buildings tend to sell for less per square meter. (The estimate
^ = 144:5 indicates a drop in average price of $144:50 per square meter for each increase of 1 unit
in x; remember x’s units are m2 (105 )): The line y = ^ + ^x is often called the fitted regression line
for y on x, and if we plot it on the same graph as the points (xi ; yi ) in the scatter plot Figure 6.1.1, we
see it passes close to the points.
A confidence interval for isn’t of major interest in the setting here, where the data were called on
to indicate a fair assessment value for a large building with x = 4:47. One way to address this is to
estimate (x) when x = 4:47. We get the MLE
which we note is much below the assessed value of $75 per square meter. However, one can object that
there is uncertainty in ^(4:47), and that it would be better to give a CI. Using (6.5.5) and the fact that
P ( 2:048 t(28) 2:048) = :95, we get a .95 CI for (4:47) as
s
1 (4:47 x)2
^(4:47) 2:048s +
30 sxx
or $40:94 $26:54, or $14:40 (4:47) $67:50. Thus the assessed value of $75 is outside this
range.
However (playing lawyer for the Assessor), we could raise another objection: since we are con-
sidering a single building (and not the average of all buildings) of size x = 4:47( 105 )m2 , we must
recognize that Yi has a non-neglible variance. This suggests that what we should do is predict the
y value for a building with x = 4:47, instead of estimating (4:47). We will temporarily leave the
example in order to develop a method to do this.
for its covariate We can get a pivotal quantity that can be used to give a prediction interval (or interval
“estimate") for Y , as follows.
1=2
1 (x x)2
Note that Y G( (x); ) and, from above, that ~(x) G( (x); n + sxx ). Also, Y is
independent of ~(x) since it is not connected to the existing sample. Thus
We can use the pivotal quantity T to get interval estimates for Y , since
P (a1 T a2 )
0 s s 1
1 (x x)2 1 (x x)2 A
= P @ ~(x) a2 s 1 + + Y ~(x) a1 s 1+ +
n sxx n sxx
We usually call these prediction intervals instead of confidence intervals, since Y isn’t a parameter
but a “future" observation.
or 6:30 Y 88:20 (dollars per square meter). The lower limit is negative, which is nonsensical.
This happened because we’re using a Gaussian model (in which Y can be positive or negative) in a
setting where Y (price) must be positive. Nonetheless, the Gaussian model fits the data well, so we’ll
just truncate the PI and take it to be 0 Y $88:20.
Now we find that the assessed value of $75 is inside this interval! On this basis its hard to say that
the assessed value is unfair (though it is towards the high end of the PI).
Note also that the value x = 4:47 of interest is well outside the range of x-values (.20 - 3.26) in
the data set of 30 buildings; look again at Figure 6.1.1. Thus any conclusions we reach are based on
6.6. MODEL CHECKING 141
an assumption that the linear model (x) = + x applies beyond x = 3:26 and out to x = 4:47.
This may not be true, but we have no way to check it with the data we have. Note also that is a slight
suggestion in Figure 6.1.1 that V ar(Y X) may be smaller for larger x-values. There is not sufficient
data to check this either.
Remark: Note from (6.5.5) and (6.5.6) that CI’s for (x) and PI’s from Y are wider the further away
x is from x. Thus, as we move further away from the “middle" of the x’s in the data, we get wider and
wider intervals for (x) or Y .
The fitted regression line y = ^ + ^x1 is shown on the scatter plot in Figure 6.21; the model appears
to fit well.
More as a numerical illustration, let us get a CI for , which represents the increase in average
strength (x1 ) from increasing the diameter x (and therefore also x1 = x2 ) by 1 unit. Using the pivotal
quantity (6.5.4) and the fact that P ( 2:048 t(28) 2:048) = :95, we get the .95 CI
^ 2:048 p s ; or 2:838 0:223
sxx
A .95 CI for the value of is therefore (2.605, 3.051).
Exercise: This model could be used to predict the breaking strength of a new bolt of given diameter
x. Find a PI for a new bolt of diameter x = 0:35.
2.4
2.2
y (strength)
2.0
1.8
1.6
x1 (diameter squared)
(i) the assumption that Yi (given any covariates xi ) is Gaussian with constant standard deviation .
Models should always be checked, and in this case there are several ways to do this. Some of these
are based on what we term “residuals" of the fitted model: the residuals are the values
r^i = yi ^i i = 1; : : : ; n
For example, if Yi G( + xi ; ) then r^i = yi ^ ^xi . The R function lm produces these values
as part of its output.
If Yi G( i ; ) then Ri = Yi i G(0; ). The idea behind the r^i ’s is that they can be
thought of as “observed" Ri ’s. This isn’t exactly correct since we are using ^i instead of i in r^i , but
if the model is correct, then the r^i ’s should behave roughly like a random sample from the distribution
G(0; ).
(1) Plot points (xi ; r^i ); i = 1; : : : ; n. If the model is satisfactory these should lie within a horizontal
band around the line r^i = 0:
(2) Plot points (^i ; r^i ); i = 1; : : : ; n. If the model is satisfactory we should get the same type of
pattern as for (1).
6.6. MODEL CHECKING 143
Departures from the “expected" pattern in (1) and (2) suggest problems with the model. For ex-
ample, if in (2) we see that the variability in the r^i ’s is bigger for larger values ^i , this suggests that
V ar(Yi ) = V ar(Ri ) is not constant, but may be larger when (x) is larger.
Figure 6.6.1 shows a couple of such patterns; the left hand plot suggests non-constant variance
whereas the right hand plot suggests that the function i = g(xi ) is not correctly specified.
In problems with only one x-variable, a plot of ^(x) superimposed on the scatterplot of the data (as
in Figure 6.5.1) shows pretty clearly how well the model fits. The residual plots described are however,
very useful when there are two or more covariates in the model.
When there are no covariates in the model, as in Section 6.2, plots (1) and (2) are undefined. In this
case the only assumption is that Yi G( ; ). We can still define residuals, either as
yi ^
r^i = yi ^ or r^i = ;
^
where ^ = y and ^ (we could alternatively use s) is the MLE of . One way to check the model is
to treat the r^i ’s (which are called standardized residuals) as a random sample of values (Y )= .
Since Y )= G(0; 1) under our assumed model, we could plot the empirical c.d.f. (EDF) from
r^i (i = 1; : : : ; n) and superimpose on it the G(0; 1) c.d.f. The two curves should agree well if the
Gaussian model is satisfactory. This plot can also be used when there are covariates, by defining the
standardized residuals
r^i yi ^ i
r^i = = i = 1; : : : ; n
^ ^
We can also use the r^i ’s in place of the r^i ’s in plots (1) and (2) above; in fact that is what we did in
Figure 6.6.1. When the r^i ’s are used the patterns in the plot are unchanged but the r^i values tend to lie
in the range (-3,3). (think why this is so.)
r^i = yi ^ ^xi i = 1; : : : ; 30
for the model fitted in Example 6.5.3. Figure 6.23 shows a plot of the points (x1i ; r^i ); no deviation
from the expected pattern is observed. This is of course also evident from Figure 6.5.1.
A further check on the Gaussian distribution is shown in Figure 6.24. Here we have plotted the
EDF based on the 30 standardized residuals
yi ^ ^x1i
r^i = :
^
On the same graph is the G(0; 1) c.d.f. There is good agreement between the two curves.
144 GAUSSIAN RESPONSE MODELS
2
2
1
1
std. resid.
std. resid.
0
0
-1
-2
-2
0 5 15 25 0 5 15 25
m uhat x
6.7 Problems
1. Student’s t Distribution
Suppose that Z and U are independent variates with
2
Z N (0; 1); U ( ) :
where k is a normalizing constant such that the total area under the pdf is 1:
+1 p
k = = :
2 2
The pdf is symmetric about the origin, and is similar in shape to the pdf of N (0; 1) but has more
probability in the tails. It can be shown that f (x) tends to the N (0; 1) pdf as ! 1.
6.7. PROBLEMS 145
0.05
residual (y-muhat)
-0.05
x1
(a) Show that the likelihood ratio statistic for testing a value of is given by (assume is
unknown)
2
( ) = n log 1 + nT 1
p
where T = n(Y )=s, with s the sample standard deviation. (Note: The sample
P
variance s is defined as (yi y)2 =(n 1).)
2
(b) Show that the likelihood ratio statistic for testing a value of is a function of
(n 1)s2
W = 2
:
146 GAUSSIAN RESPONSE MODELS
1.0
0.8
0.6
cdf
0.4
0.2
0.0
-2 -1 0 1 2
std. resid.
Figure 6.24: Empirical Distribution function of Standard Residuals and G(0; 1) c.d.f.
3. The following data are instrumental measurements of level of dioxin (in parts per billion) in 20
samples of a “standard” water solution known to contain 45 ppb dioxin.
44.1 46.0 46.6 41.3 44.8 47.8 44.5 45.1 42.9 44.5
42.5 41.5 39.6 42.0 45.8 48.9 46.6 42.9 47.0 43.7
(a) Assuming that the measurements are independent and N ( ; 2 ), obtain a .95 confidence
interval for and test the hypothesis that = 45.
(b) Obtain a .95 confidence interval for . Of what interest is this scientifically?
4. A new method gave the following ten measurements of the specific gravity of mercury:
13.696 13.699 13.683 13.692 13.705
13.695 13.697 13.688 13.690 13.707
Assume these to be independent observations from N ( ; 2 ).
(a) An old method produced measurements with standard deviation = :02. Test the hypoth-
esis that the new method has the same standard deviation as the old.
6.7. PROBLEMS 147
(b) A physical chemistry handbook lists the specific gravity of mercury as 13.75. Are the data
consistent with this value?
(c) Obtain 95% CI’s for and .
5. Sixteen packages are randomly selected from the production of a detergent packaging machine.
Their weights (in grams) are as follows:
287 293 295 295 297 298 299 300
300 302 302 303 306 307 308 311
(a) Assuming that the weights are independent N ( ; 2 ) random variables, obtain .95 confi-
dence intervals for and .
P
(b) Let X and s2 = n 1 1 (xi x)2 be the mean and variance in a sample of size n, and let
X represent the weight of a future, independent, randomly selected package. Show that
X X N 0; 2 1 + n1 and then that
X X
Z= q t(n 1) :
s 1 + n1
For the data above, use this as a pivotal to obtain a .95 “confidence” interval for X.
6. A manufacturer wishes to determine the mean breaking strength (force) of a type of string to
“within a pound", which we interpret as requiring that the 95% confidence interval for a should
have length at most 2 pounds. If breaking strength Y of strings tested are G( ; ) and if 10
P
preliminary tests gave (yi y)2 = 80, how many additional measurements would you advise
the manufacturer to take?
7. To compare the mathematical abilities of incoming first year students in Mathematics and Engi-
neering, 30 Math students and 30 Engineering students were selected randomly from their first
year classes and given a mathematics aptitude test. A summary of the resulting marks xi (for the
math students) and yi (for the engineering students), i = 1; : : : ; 30, is as follows:
P
Math students: n = 30 x = 120 (xi x)2 = 3050
P
Engineering students: n = 30 y = 114 (yi y)2 = 2937
Obtain a .95 confidence interval for the difference in mean scores for first year Math and Engi-
neering students, and test the hypothesis that the difference is zero.
148 GAUSSIAN RESPONSE MODELS
8. A study was done to compare the durability of diesel engine bearings made of two different
compounds. Ten bearings of each type were tested. The following table gives the “times” until
failure (in units of millions of cycles):
Type I Type II
3.03 3.19
5.53 4.26
5.60 4.47
9.30 4.53
9.92 4.67
12.51 4.69
12.95 12.78
15.21 6.79
16.04 9.37
16.84 12.75
(a) Assuming that Y , the number of million cycles to failure, has a normal distribution with the
same variance for each type of bearing, obtain a .90 confidence interval for the difference
in the means 1 and 2 of the two distributions.
(b) Test the hypothesis that 1 = 2.
(c) It has been suggested that log failure times are approximately normally distributed, but
not failure times. Assuming that the log Y ’s for the two types of bearing are normally
distributed with the same variance, test the hypothesis that the two distributions have the
same mean. How does the answer compare with that in part (b)?
(d) How might you check whether Y or log Y is closer to normally distributed?
(e) Give a plot of the data which could be used to describe the data and your analysis.
9. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and the numbers
of cycles to failure were observed. The sample mean and variance of the log failure “times" were
y = 14:564 and s2 = 0:0914. Similar tests on four additional girders with repaired welds gave
y = 14:291 and s2 = 0:0422. Log failure times are assumed to be independent with a G( ; )
distribution.
(a) Test the hypothesis that the variance of Y is the same for repaired welds as for the normal
welds.
(b) Assuming equal variances, obtain a 90% confidence interval for the difference in mean log
failure time.
6.7. PROBLEMS 149
(c) Note that 1 2 in part (b) is also the difference in median log failure times, and obtain a
90% confidence interval for the ratio
median lifetime (cycles) for repaired welds
median lifetime (cycles) for normal welds
11. Readings produced by a set of scales are independent and normally distributed about the true
weight of the item being measured. A study is carried out to assess whether the standard deviation
of the measurements varies according to the weight of the item.
(a) Ten weighings of a 10 kg. weight yielded y = 10:004 and s = 0:013 as the sample
mean and standard deviation. Ten weighings of a 40 kg. weight yielded y = 39:989
and s = 0:034. Is there any evidence of a difference in the standard deviations for the
measurements of the two weights?
(b) Suppose you had a further set of weighings of a 20 kg. item. How could you study the
question of interest further?
12. An experiment was conducted to compare gas mileages of cars using a synthetic oil and a con-
ventional oil. Eight cars were chosen as representative of the cars in general use. Each car was
run twice under as similar conditions as possible (same drivers, routes, etc.), once with the syn-
thetic oil and once with the conventional oil, the order of use of the two oils being randomized.
The average gas mileages were as follows:
Car 1 2 3 4 5 6 7 8
Synthetic oil 21.2 21.4 15.9 37.0 12.1 21.1 24.5 35.7
Conventional oil 18.0 20.6 14.2 37.8 10.6 18.5 25.9 34.7
(a) Obtain a .95 confidence interval for the difference in mean gas mileage, and state the as-
sumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking only one
measurement on each of 16 cars?
150 GAUSSIAN RESPONSE MODELS
13. Consider the data in Problem 8 of Chapter 1 on the lengths of male and female coyotes.
(a) Fit separate Gaussian models for the lengths of males and females. Estimate the difference
in mean lengths for the two sexes.
(b) Estimate P (Y1 > Y2 ) (give the m.l.e.), where Y1 is the length of a randomly selected female
and Y2 is the length of a randomly selected male. Can you suggest how you might get a
confidence interval?
(c) Give separate CI’s for the average length of males and females.
14. Comparing sorting algorithms. Suppose you want to compare two algorithms A and B that
will sort a set of number into an increasing sequence. (The R function sort x) will, for example,
sort the elements of the numeric vector x.)
To compare the speed of algorithms A and B, you decide to “present" A and B with random
permutations of n numbers, for several values of n. Explain exactly how you would set up such
a study, and discuss what pairing would mean in this context.
15. Sorting algorithms continued. Two sort algorithms as in the preceding question were each run
on (the same) 20 sets of numbers (there were 500 numbers in each set). Times to sort the sets of
two numbers are shown below.
Set: 1 2 3 4 5 6 7 8 9 10
A: 3.85 2.81 6.47 7.59 4.58 5.47 4.72 3.56 3.22 5.58
B: 2.66 2.98 5.35 6.43 4.28 5.06 4.36 3.91 3.28 5.19
Set: 11 12 13 14 15 16 17 18 19 20
A: 4.58 5.46 3.31 4.33 4.26 6.29 5.04 5.08 5.08 3.47
B: 4.05 4.78 3.77 3.81 3.17 6.02 4.84 4.81 4.34 3.48
using this approach. (It is also possible to get a CI using the Gaussian model.)
16. Least squares estimation. Suppose you have a model where the mean of the response variable
Yi given the covariates xi has the form
where is a vector of unknown parameters. Then the least squares (LS) estimate of based
on data (xi ; yi ); i = 1; : : : ; n is the value that minimizes the objective function
n
X
S( ) = [yi g(xi ; )]2
i=1
Show that the LS estimate ^ of is the same as the MLE of in the Gaussian model Yi
G( i ; ), when i is of the form (6.7.1).
17. To assess the effect of a low dose of alcohol on reaction time, a sample of 24 student volunteers
took part in a study. Twelve of the students (randomly chosen from the 24) were given a fixed
dose of alcohol (adjusted for body weight) and the other twelve got a nonalcoholic drink which
looked and tasted the same as the alcoholic drink. Each student was then tested using software
that flashes a coloured rectangle randomly placed on a screen; the student has to move the cursor
into the rectangle and double click the mouse. As soon as the double click occurs, the process
is repeated, up to a total of 20 times. The response variate is the total reaction time (i.e. time to
complete the experiment) over the 20 trials.
The data on the times are shown below for the 24 students.
“Alcohol" Group: 1.33, 1.55, 1.43, 1.35, 1.17, 1.35, 1.17, 1.80, 1.68
1.19, 0.96, 1.46 (y = 1:370; s = 0:235)
“Non-Alcohol" Group: 1.68, 1.30, 1.85, 1.64, 1.62, 1.69, 1.57, 1.82, 1.41,
1.78, 1.40, 1.43 (y = 1:599; s = 0:180)
Analyze the data with the objective of seeing when there is any evidence that the dose of alcohol
increases reaction time. Justify any models that you use.
18. There are often both expensive (and highly accurate) and cheaper (and less accurate) ways of
measuring concentrations of various substances (e.g. glucose in human blood, salt in a can of
soup). The table below gives the actual concentration x (determined by an expensive but very
152 GAUSSIAN RESPONSE MODELS
accurate procedure) and the measured concentration y obtained by a cheap procedure, for each
of 10 units.
x: 4.01 8.12 12.53 15.90 20.24 24.81 30.92 37.26 38.94 40.15
y: 3.70 7.80 12.40 16.00 19.90 24.90 30.80 37.20 38.40 39.40
(a) Fit a Gaussian linear regression model for Y given x to the data and obtain .95 confidence
intervals for the slope and standard deviation . Use a plot to check the adequacy of the
model.
(b) Describe briefly how you would characterize the cheap measurement process’s accuracy to
a lay person.
(c) Assuming that the units being measured have true concentrations in the range 0-40, do you
think that the cheap method tends to produce a value that is lower than the true concentra-
tion? Support your answer with an argument based on the data.
19. The following data, collected by Dr. Joseph Hooker in the Himalaya mountains, relates at-
mospheric pressure to the boiling point of water. Theory suggests that a graph of log pressure vs.
boiling point should give a straight line.
(a) Prepare a scatterplot of y = log(Pressure) vs. x =Temperature. Do the same for y =Pressure
vs. x. Which is better described by a linear model? Does this confirm the theory’s model?
(b) Fit a normal linear regression model for y = log(Pressure) vs. x. Are there any obvious
difficulties with the model?
(c) Obtain a .95 confidence interval for the atmospheric pressure if the boiling point of water
is 195 F.
20. Consider the data in Problem 9 of Chapter 1, in which the variable y was the average time to
complete tasks by computer users, and x was the number of users on the system. Fit a regression
model, using x as the explanatory variable. Give a .95 confidence interval for the mean of Y
when there are 50 users on the system.
21. (a) For the steel bolt experiment in Examples 6.1.3 and 6.5.2, use a Gaussian model to
(i) estimate the average breaking strength of bolts of diameter 0.35
(ii) estimate (predict) the breaking strength of a single bolt of diameter 0.35.
Give interval estimates in each case.
(b) Suppose that a bolt of diameter 0.35 is exposed to a large force V that could potentially
break it. In structural reliability and safety calculations, V is treated as a r.v. and if Y
represents the breaking strength of the bolt (or some other part of a structure), then the
probability of a “failure" of the bolt is P (V > Y ). Give a point estimate of this value if
V G(1:60; 0:10), where V and Y are independent.
22. Optimal Prediction. In many settings we want to use covariates x to predict a future value Y .
(For example, we use economic factors x to predict the price Y of a commodity a month from
now.) The value Y is random, but suppose we know (x) = E(Y jx) and (x)2 = V ar(Y jx).
(a) Predictions take the form Y^ = g(x), where g( ) is our “prediction" function. Show that the
minimum achievable value of E(Y^ Y )2 is minimized by choosing g(x) = (x).
(b) Show that the minimum achievable value of E(Y^ Y )2 , that is, its value when g(x) =
(x) is (x)2 .
This shows that if we can determine or estimate (x), then “optimal" prediction (in terms
of Euclidean distance) is possible. Part (b) shows that we should try to find covariates x for
which (x)2 = V ar(Y jx) is as small as possible.
(c) What happens when (x)2 is close to zero? (Explain this in ordinary English.)
TESTS AND INFERENCE PROBLEMS
BASED ON MULTINOMIAL MODELS
7.1 Introduction
Many important hypothesis testing problems can be addressed using multinomial models. An example
was given in Chapter 5, whose general ideas we will use here. To start, recall the setting in Example
(d) of Chapter 5, Section 2, where data were assumed to arise from a multinomial distribution with
probability function
n!
f (y1 ; :::; ym ; p1 ; :::; pm ) = py1 pymm (6.5.2)
y1 ! ym ! 1
P P
where 0 yj n and yj = n. The multinomial probabilities pj satisfy 0 pj 1 and pj = 1,
and we define = (p1 ; :::; pm ). Suppose now that we wish to test the hypothesis
H0 : pj = pj ( ) j = 1; :::; m (6.5.3)
Let be the parameter space for . It was shown earlier that L( ) is maximized over by the vector
^ with p^j = yj =n (j = 1; :::; m). A likelihood ratio test of the hypothesis (7.1.2) is based on the
likelihood ratio statistic ( )
L( ^0)
= 2`( ^ ) 2`( ^ 0 ) = 2 log ; (6.5.5)
L( ^ )
where ^ 0 maximizes L( ) under the hypothesis (7.1.2), which restricts to lie in a space 0 of
dimension p. If H0 is true (that is, if really lies in 0 ) then the distribution of is approximately
154
7.2. GOODNESS OF FIT TESTS 155
This approximation is very accurate when n is large and none of the pj ’s is too small; when the ej ’s
below all exceed 5 it is accurate enough for testing purposes.
The test statistic (7.1.4) can be written in a simple form. Let ^ 0 = (~
p1 ; :::; p~m ) = (p1 (~ ); :::; pm (~ ))
denote the m.l.e. of under the hypothesis (7.1.2). Then, by (7.1.3), we get
= 2`( ^ ) 2`( ^0 )
m
X
=2 yj log(^
pj =~
pj ):
j=1
ej = n~
pj ; j = 1; :::; m
we can rewrite as
m
X
=2 yj log(yj =ej ): (6.5.7)
j=1
An alternative test statistic that was developed historically before is the “Pearson” statistic
m
X (yj ej )2
D= : (6.5.8)
ej
j=1
This has similar properties to ; for example, both equal 0 when yj = ej for all j = 1; :::; m and
are larger when yj ’s and ej ’s differ greatly. It turns out that, like , the statistic D also has a limiting
2
(k p) distribution, with k = m 1, when H0 is true.
The remainder of this chapter consists of the application of the general methods above to some
important testing problems.
Example 7.2.1. Recall Example 2.5.2, where people in a population are classified as being one
of three blood types MM, MN, NN. The proportions of the population that are these three types are
p1 , p2 , p3 respectively, with p1 + p2 + p3 = 1. Genetic theory indicates, however, that the pj ’s can be
expressed in terms of a single parameter , as
2
p1 = p2 = 2 (1 ) p3 = (1 )2 : (7.2.1)
Data collected on 100 persons gave y1 = 17, y2 = 46, y3 = 37, and we can use this to test
the hypothesis H0 that (7.2.1) is correct. (Note that (Y1 ; Y2 ; Y3 ) M ult(n = 100i p1 ; p2 ; p3 ).) The
likelihood ratio test statistic is (7.1.6), but we have to find ~ and then the ej ’s. The likelihood function
under (7.2.1) is
L1 ( ) = L(p1 ( ); p2 ( ); p3 ( ))
2 17
=( ) [2 (1 )]46 [(1 )2 ]37
80
=c (1 )120
and we easily find that ~ = :40. The expected frequencies are therefore e1 = 100 ~ 2 = 16, e2 =
100[2~ (1 ~ )] = 48, e3 = 100[(1 ~ )2 ] = 36. Clearly these are close to the observed frequencies
y1 , y2 , y3 , and (7.1.6) gives the observed value obs = 0:17. The significance level is
2
P( (1) 0:17) = :68
Example 7.2.2. Continuous distributions can also be tested by grouping the data into intervals
and then using the multinomial model. Example 2.5.1 previously did this in an informal way for an
exponential distribution. For example, suppose that T is thought to have an exponential distribution
with probability density function
1 t=
f (t; ) = e ; t > 0: (7.2.2)
Suppose a random sample t1 ; :::; t100 is collected and the objective is to test the hypothesis H0 that
(7.2.2) is correct. To do this we partition the range of T into intervals j = 1; :::; m, and count the
number of observations yj that fall into each interval. Under (7.2.2), the probability that an observation
lies in the j’th interval Ij = (aj ; bj ) is
Z bj
pj ( ) = f (t; )dt j = 1; :::; m (7.2.3)
aj
7.3. TWO-WAY TABLES AND TESTING FOR INDEPENDENCE OF TWO VARIABLES 157
and if yj is the number of observations (t’s) that lie in Ij , then Y1 ; :::; Ym follow a multinomial distrib-
ution with n = 100. Thus we can test (7.2.2) by testing that (7.2.3) is true.
Consider the following data, which have been divided into m = 7 intervals:
We have also shown expected frequencies ej , calculated as follows. The distribution of (Y1 ; :::; Y7 ) is
multinomial with probabilities given by (7.2.3) when the model (7.2.2) is correct. In particular,
Z 100
1 t= 100=
p1 = e dt = 1 e ;
0
and so on. Expressions for p2 ; :::; p7 are p2 ( ) = e 100= e 200= , p3 ( ) = e 200= e 300= ,
p4 ( ) = e 300= e 400= , p5 ( ) = e 400= e 600= , p6 ( ) = e 600= e 800= , p7 ( ) = e 800= .
The likelihood function from y1 ; :::; y7 based on model (7.2.3) is then
7
Y
L1 ( ) = pj ( )yj :
j=1
2
P( obs ; H0 ) = P ( (5) 1:91) = :86;
so there is no evidence against the model (7.2.3). Note that the reason the 2 degrees of freedom is 5 is
because k = m 1 = 6 and p = dim( ) = 1.
The goodness of fit test just given has some arbitrary elements, since we could have used different
intervals and a different number of intervals. Theory and guidelines as to how best to choose the
intervals can be developed, but we won’t consider this here. Rough guidelines for our purposes are to
chose 4-10 intervals, so that expected frequencies are at least 5.
this in the case where both variables are discrete, and take on a fairly small number of possible values.
This turns out to cover a great many important settings.
Two types of studies give rise to data that can be used to test independence, and in both cases the
data can be arranged as frequencies in a two-way table. These tables are sometimes called “contin-
gency” tables in the statistics literature. We’ll consider the two types of studies in turn.
b
X a
X
where yi+ = yij and y+j = yij . The likelihood ratio statistic (7.1.6) for testing the hypothesis
j=1 i=1
(7.3.1) is then
a X
X b
=2 yij log(yij =eij ): (7.3.4)
i=1 j=1
The significance level is computed by the approximation P ( 2(a 1)(b 1) obs ); the
2 degrees of
freedom (a 1)(b 1) come from k p = (ab 1) (a + b 2) = (a 1)(b 1).
Example 7.3.1. Human blood is classified according to several systems. Two are the OAB system
and the Rh system. In the former a person is one of 4 types O, A, B, AB and in the latter a person is
Rh+ or Rh . A random sample of 300 persons produced the observed frequencies in the following
table. Expected frequencies, computed below, are in brackets after each observed frequency.
O A B AB
Rh+ 82 (77.3) 89 (94.4) 54 (49.6) 19 (22.8) 244
It is of interest to see whether these two classification systems are genetically independent. The row
and column totals in the table are also shown, since they are the values yi+ and y+j needed to compute
the eij ’s in (7.3.3). In this case we can think of the Rh types as the A-type classification and the OAB
types as the B-type classification in the general theory above. Thus a = 2, b = 4 and the 2 degrees of
freedom are (a 1)(b 1) = 3.
To carry out the test that a person’s Rh and OAB blood types are statistically independent, we
merely need to compute the eij ’s by (7.3.3). This gives, for example,
(244)(95) 244(116)
e11 = = 77:3; e12 = = 94:4
300 300
and, similarly, e13 = 49:6, e14 = 22:8, e21 = 17:7, e22 = 21:6, e23 = 11:4, e24 = 5:2.
It may be noted that ei+ = yi+ and e+j = y+j , so it is necessary to compute only (a 1)(b 1) eij ’s
via (7.3.3); the remainder can be obtained by subtraction from row and column totals. For example, if
we compute e11 , e12 , e13 here then e21 = 95 e11 , e22 = 116 e12 , and so on. (This isn’t an advantage
with a computer; its simpler to use (7.3.3) above then. However, it suggests where the term “degrees
of freedom" comes from.)
The observed value of the likelihood ratio test statistic (7.3.4) is obs = 8:52, and the significance
level is approximately P ( 2(3) 8:52) = :036, so there is some degree of evidence against the hy-
pothesis of independence. Note that by comparing the eij ’s and the yij ’s we get some idea about the
160 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS
lack of independence, or relationship, between the two classifications. We see here that the degree of
dependence does not appear large.
A similar problem arises when individuals in a population can be one of b types B1 ; :::; Bb , but where
the population is sub-divided into a groups A1 ; :::; Aa . In this case, we might be interested in whether
the proportions of individuals of types B1 ; :::; Bb are the same for each group. This is essentially
the same as the question of independence in the preceding section: we want to know whether the
probability pij that a person in population group i is B-type Bj is the same for all i = 1; :::; a. That is,
pij = P (Bj jAi ) and we want to know if this deends on Ai or not.
Although the framework is superficially the same as the preceding section, the details are a little
different. In particular, the probabilities pij satisfy
H0 : p1 = p2 = = pa ; (7.3.6)
where pi = (pi1 ; pi2 ; :::; pib ). Furthermore, the data in this case arise by selecting specified num-
bers of individuals ni from groups i = 1; :::; a and so there are actually a multinomial distributions,
Mult(ni ; pi1 ; :::; pib ).
If we denote the observed frequency of Bj -type individuals in the sample from the i’th group as yij
(where yi1 + + yib = ni ), then it can be shown that the likelihood ratio statistic for testing (7.3.6)
is exactly the same as (7.3.4), where now the expected frequencies eij are given by
y+j
eij = ni i = 1; :::; a; j = 1; :::; b (7.3.7)
n
where n = n1 + + na . Since ni = yi+ the expected frequencies have exactly the same form as in
the preceding section, when we lay out the data in a two-way table with a rows and b columns.
Example 7.3.2. The study in Example 7.3.1 could have been conducted differently, by selecting a
fixed number of Rh+ persons and a fixed number of Rh persons, and then determining their OAB
blood type. Then the proper framework would be to test that the probabilities for the 4 types O, A, B,
AB were the same for Rh+ and for Rh persons, and so the methods of the present section apply. This
study gives exactly the same testing procedure as one where the numbers of Rh+ and Rh persons in
the sample are random, as discussed.
7.3. TWO-WAY TABLES AND TESTING FOR INDEPENDENCE OF TWO VARIABLES 161
Example 7.3.3. In a randomized clinical trial to assess the effectiveness of a small daily dose of
Aspirin in preventing strokes among high-risk persons, a group of patients were randomly assigned
to get either Aspirin or a placebo. They were then followed for 3 years, and it was determined for
each person whether they had a stroke during that period or not. The data were as follows (expected
frequencies are also given in brackets).
Stroke No Stroke
Aspirin Group 64 (75.6) 176 (164.4) 240
We can think of the persons receiving Aspirin and those receiving Placebo as two groups, and test the
hypothesis
H0 : p11 = p21 ;
where p11 = P (Stroke) for a person in the Aspirin group and p21 = P (Stroke) for a person in the
Placebo group. The test statistic (7.3.4) requires the expected frequencies, which are
(yi+ )(y+j )
eij = i = 1; 2:
476
This gives the values shown in the table. The test statistic then has observed value
2 X
X 2
=2 yij log(yij =eij ) = 5:25:
i=1 j=1
so there is fairly strong evidence against H0 . A look at the yij ’s and the eij ’s indicates that persons
receiving Aspirin have had fewer strokes than expected under H0 , suggesting that p11 < p21 .
This test can be followed up with estimates for p11 and p21 . Because each row of the table follows
a binomial distribution, we have
y11 64 y21 86
p^11 = = = 0:267; p^21 = = = 0:364:
n1 240 n2 236
We can also give confidence intervals for p11 and p21 ; approximate .95 confidence intervals based on
earlier methods are 0:211 p11 :323 and :303 p21 :425. Confidence intervals for = p11 p21
can also be obtained from the approximate G(0; 1) pivotal quantity
(^
p11 p^21 )
Z=p :
p^11 (1 p^11 )=n1 + p^21 (1 p^21 )=n2
162 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS
Remark: This and other tests involving binomial probabilities and contingency tables can be carried
out using the R function prop:test.
7.4 Problems
1. To investigate the effectiveness of a rust-proofing procedure, 50 cars that had been rust-proofed
and 50 cars that had not were examined for rust five years after purchase. For each car it was
noted whether rust was present (actually defined as having moderate or heavy rust) or absent
(light or no rust). The data are as follows:
Cars Cars Not
Rust-Proofed Rust Proofed
Rust present 14 28
Rust absent 36 22
50 50
(a) Test the hypothesis that the probability of rust occurring is the same for the rust-proofed
cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proofing prevents rust? How might
a better study be designed?
3. Mass-produced items are packed in cartons of 12 as they come off an assembly line. The items
from 250 cartons are inspected for defects, with the following results:
Number defective: 0 1 2 3 4 5 6
Frequency observed: 103 80 31 19 11 5 1
Test the hypothesis that the number of defective items Y in a single carton has a binomial distri-
bution Bin(12; p). Why might the binomial not be a suitable model?
7.4. PROBLEMS 163
4. The numbers of service interruptions in a communications system over 200 separate weekdays
is summarized in the following frequency table:
Number of interruptions: 0 1 2 3 4 5
Frequency observed: 64 71 42 18 4 1
Test whether a Poisson model for the number of interruptions Y on a single day is consistent
with these data.
5. The table below records data on 292 litters of mice classified according to litter size and number
of females in the litter.
Number of females
0 1 2 3 4 Total # of litters
1 8 12 20
Litter 2 23 44 13 80
Size 3 10 25 48 13 96
4 5 30 34 22 5 96
(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter follows
a binomial distribution with parameters n and pn = P (female). Test the binomial model
separately for each of the litter sizes n = 2; n = 3 and n = 4. (Why is it of scientific
interest to do this?)
(b) Assuming that the binomial model is appropriate for each litter size, test the hypothesis that
p1 = p2 = p3 = p4 .
6. A long sequence of digits (0; 1; : : : ; 9) produced by a pseudo random number generator was
examined. There were 51 zeros in the sequence, and for each successive pair of zeros, the
number of (non-zero) digits between them was counted. The results were as follows:
1 1 6 8 10 22 12 15 0 0
2 26 1 20 4 2 0 10 4 19
2 3 0 5 2 8 1 6 14 2
2 2 21 4 3 0 0 7 2 4
4 7 16 18 2 13 22 7 3 5
Give an appropriate probability model for the number of digits between two successive zeros,
if the pseudo random number generator is truly producing digits for which P (any digit = j) =
0:1(j = 0; 1; : : : ; 9), independent of any other digit. Construct a frequency table and test the
164 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS
7. 1398 school children with tonsils present were classified according to tonsil size and absence or
presence of the carrier for streptococcus pyogenes. The results were as follows:
Normal Enlarged Much enlarged
Carrier present 19 29 24
Carrier absent 497 560 269
Is there evidence of an association between the two classifications?
8. The following data on heights of 210 married couples were presented by Yule in 1900.
Tall wife Medium wife Short wife
Tall husband 18 28 19
Medium husband 20 51 28
Short husband 12 25 9
Test the hypothesis that the heights of husbands and wives are independent.
9. In the following table, 64 sets of triplets are classified according to the age of their mother at
their birth and their sex distribution:
3 boys 2 boys 2 girls 3 girls Total
Mother under 30 5 8 9 7 29
Mother over 30 6 10 13 6 35
Total 11 18 22 13 64
(a) Is there any evidence of an association between the sex distribution and the age of the
mother?
(b) Suppose that the probability of a male birth is 0.5, and that the sexes of triplets are de-
termined independently. Find the probability that there are x boys in a set of triples
(x = 0; 1; 2; 3), and test whether the column totals are consistent with this distribution.
10. A study was undertaken to determine whether there is an association between the birth weights
of infants and the smoking habits of their parents. Out of 50 infants of above average weight,
7.4. PROBLEMS 165
9 had parents who both smoked, 6 had mothers who smoked but fathers who did not, 12 had
fathers who smoked but mothers who did not, and 23 had parents of whom neither smoked. The
corresponding results for 50 infants of below average weight were 21, 10, 6, and 13, respectively.
(a) Test whether these results are consistent with the hypothesis that birth weight is independent
of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of the mother,
the smoking habits of the father are not related to birth weight?
CAUSE AND EFFECT
8.1 Introduction
As mentioned in Chapters 1 and 3, many studies are carried out with causal objectives in mind. That
is, we would like to be able to establish or investigate a possible cause and effect relationship between
variables X and Y .
We use the word “causes" often; for example we might say that “gravity causes dropped objects to
fall to the ground", or that “smoking causes lung cancer". The concept of causation (as in “X causes
Y ") is nevertheless hard to define. One reason is that the “strengths" of causal relationships vary a lot.
For example, on earth gravity may always lead to a dropped object falling to the ground; however, not
everyone who smokes gets lung cancer.
Idealized definitions of causation are often of the following form. Let y be a response variate
associated with units in a population or process, and let x be an explanatory variate associated with
some factor that may affect y. Then, if all other factors that affect y are held constant, let us change
x (or observe different values of x) and see if y changes. If it does we say that x has a causal effect
on y.
In fact, this definition is not broad enough, because in many settings a change in x may only lead
to a change in y in some probabilistic sense. For example, giving an individual person at risk of stroke
a small daily dose of aspirin instead of a placebo may not necessarily lower their risk. (Not everyone
is helped by this medication.) However, on average the effect is to lower the risk of stroke. One way
to measure this is by looking at the probability a randomly selected person has a stroke (say within 3
years) if they are given aspirin versus if they are not.
Therefore, a better idealized definition of causation is to say that changing x should result in a
change in some attribute of the random variable Y (for example, its mean or some probability such as
P (Y > 0)). Thus we revise the definition above to say:
if all other factors that affect Y are held constant, let us change x (or observe different values
of x) and see if some specified attribute of Y changes. If it does we say x has a causal effect on Y .
These definitions are unfortunately unusable in most settings since we cannot hold all other factors
166
8.1. INTRODUCTION 167
that affect y constant; often we don’t even know what all the variables are. However, the definition
serves as a useful ideal for how we should carry out studies in order to show that a causal relationship
exists. What we do is try to design our studies so that alternative (to the variate x) explanations of what
causes changes in attributes of y can be ruled out, leaving x as the causal agent. This is much easier
to do in experimental studies, where explanatory variables may be controlled, than in observational
studies. The following are brief examples.
Example 8.1.1. Recall Example 6.1.3 concerning the (breaking) strength y of a steel bolt and the
diameter x of the bolt. It is clear that bolts with larger diameters tend to have higher strength, and it
seems clear on physical and theoretical grounds that increasing the diameter “causes" an increase in
strength. This can be investigated in experimental studies like that in Example 6.1.3, when random
samples of bolts of different diameters are tested, and their strengths y determined.
Clearly, the value of x does not determine y exactly (different bolts with the same diameter don’t
have the same strength), but we can consider attributes such as the average value of y. In the experiment
we can hold other factors more or less constant (e.g. the ambient temperature, the way the force is
applied; the metallurgical properties of the bolts) so we feel that the observed larger average values of
y for bolts of larger diameter x is due to a causal relationship.
Note that even here we have to depart slightly from the idealized definition of cause and effect. In
particular, a bolt cannot have its diameter x changed, so that we can see if y changes. All we can do
is consider two bolts that are as similar as possible, and are subject to the same explanatory variables
(aside from diameter). This difficulty arises in many experimental studies.
Example 8.1.2. Suppose that data had been collected on 10,000 persons ages 40-80 who had smoked
for at least 20 years, and 10,000 persons in the same age range who had not. There is roughly the
same distribution of ages in the two groups. The (hypothetical) data concerning the numbers with lung
cancer are as follows:
There are many more lung cancer cases among the smokers, but without further information or
assumptions we cannot conclude that a causal relationship (smoking causes lung cancer) exists. Al-
ternative explanations might explain some or all of the observed difference. (This is an observational
study and other possible explanatory variables are not controlled.) For example, family history is an
important factor in many cancers; maybe smoking is also related to family history. Moreover, smoking
168 CAUSE AND EFFECT
tends to be connected with other factors such as diet and alcohol consumption; these may explain some
of the effect seen.
The last example exemplifies that association (statistical dependence) between two variables X
and Y does not imply that a causal relationship exists. Suppose for example that we observe a
positive correlation between X and Y ; higher values of X tend to go with higher values of Y in a
unit. Then there are at least three “explanations": (i) X causes Y (meaning X has a causative effect on
Y ),(ii) Y causes X, and (iii) some other factor(s) Z cause both X and Y .
We’ll now consider the question of cause and effect in experimental and observational studies in a
little more detail.
the difference to the causative effect of the aspirin. Here’s how we rule out alternative explanations:
suppose you claim that its not the aspirin but dietary factors and blood pressure that cause this observed
effect. I respond that the randomization procedure has lead to those factors being balanced in the two
treatment groups. That is, the Aspirin group and the Placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a difference in the two groups
should not be due to these factors.
It is thought that fuel consumption in automobiles is greater at speeds in excess of 100 km per hour.
(Some years ago during oil shortages, many U.S. states reduced speed limits on freeways because of
this.) A study is planned that will focus on freeway-type driving, because fuel consumption is also
affected by the amount of stopping and starting in town driving, in addition to other factors.
In this case a decision was made to carry out an experimental study at a special paved track owned
by a car company. Obviously a lot of factors besides speed affect fuel consumption: for example,
the type of car and engine, tire condition, fuel grade and the driver. As a result, these factors were
controlled in the study by balancing them across different driving speeds. An experimental plan of the
following type was employed.
84 cars of eight different types were used; each car was used for 8 test drives.
the cars were each driven twice for 600 km on the track at each of four speeds: 80,100,120 and
140 km/hr.
8 drivers were involved, each driving each of the 8 cars for one test, and each driving two tests at
each of the four speeds.
The cars had similar initial mileages and were carefully checked and serviced so as to make them
as comparable as possible; they used comparable fuels.
The drivers were instructed to drive steadily for the 600 km. Each was allowed a 30 minute rest
stop after 300 km.
The order in which each driver did his or her 8 test drives was randomized. The track was large
enough that all 8 drivers could be on it at the same time. (The tests were conducted over 8 days.)
The response variate was the amount of fuel consumed for each test drive. Obviously in the analy-
sis we must deal with the fact that the cars differ in size and engine type, and their fuel consumption
170 CAUSE AND EFFECT
will depend on that as well as on driving speed. A simple approach would be to add the fuel amounts
consumed for the 16 test drives at each speed, and to compare them (other methods are also possible).
Then, for example, we might find that the average consumption (across the 8 cars) at 80, 100, 120 and
140 km/hr were 43.0,44.1, 45.8 and 47.2 liters, respectively. Statistical methods of testing and estima-
tion could then be used to test or estimate the differences in average fuel consumption at each of the
four speeds. (Can you think of a way to do this?)
Exercise: Suppose that statistical tests demonstrated a significant difference in consumption across the
four driving speeds, with lower speeds giving lower consumption. What (if any) qualifications would
you have about concluding there is a causal relationship?
Example 8.3.1 Suppose that over a five year period, the applications and admissions to graduate studies
in Engineering and Arts faculties in a university are as follows:
We want to see if females have a lower probability of admission than males. If we looked only
at the totals for Engineering plus Arts, then it would appear that the probability a male applicant is
admitted is a little higher than the probability for a female applicant. However, if we look separately at
8.4. PROBLEMS 171
Arts and Engineering, we see the probability for females being admitted appears higher in each case!
The reason for the reverse direction in the totals is that Engineering has a higher admission rate than
Arts, but the fraction of women applying to Engineering is much lower than for Arts.
In cause and effect language, we would say that the faculty one applies to (i.e. Engineering or Arts)
is a causative factor with respect to probability of admission. Furthermore, it is related to the gender
(M or F) of an applicant, so we cannot ignore it in trying to see if gender is also a causative factor.
Remark: The feature illustrated in the example above is sometimes called Simpson’s Paradox. In
probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can have
but have
P (AjB1 ) < P (AjB2 )
P
k
(Note that P (AjB1 ) = P (AjB1 Ci )P (Ci jB1 ) and similarly for P (AjB2 ), so they depend on what
i=1
P (Ci jB1 ) andP (Ci jB2 ) are.) In the example above we can take B1 = fperson is femaleg, B2 =
fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson applies to Artsg, and A =
fperson is admittedg.
Exercise: Write down estimated probabilities for the various events based on Example 8.3.1, and so
illustrate Simpson’s paradox.
Epidemiologists (specialists in the study of disease) have developed guidelines or criteria which
should be met in order to argue that a causal association exists between a risk factor x and a disease
(represented by a response variable Y = I(person has the disease), for example). These include
the need to account for other possible risk factors and to demonstrate that x and Y are consistently
related when these factors vary.
the demonstration that association between x and Y holds in different types of settings
8.4 Problems
1. In an Ontario study, 50267 live births were classified according to the baby’s weight (less than or
greater than 2.5 kg.) and according to the mother’s smoking habits (non-smoker, 1-20 cigarettes
per day, or more than 20 cigarettes per day). The results were as follows:
172 CAUSE AND EFFECT
(a) Test the hypothesis that birth weight is independent of the mother’s smoking habits.
(b) Explain why it is that these results do not prove that birth weights would increase if mothers
stopped smoking during pregnancy. How should a study to obtain such proof be designed?
(c) A similar, though weaker, association exists between birth weight and the amount smoked
by the father. Explain why this is to be expected even if the father’s smoking habits are
irrelevant.
2. One hundred and fifty Statistics students took part in a study to evaluate computer-assisted in-
struction (CAI). Seventy-five received the standard lecture course while the other 75 received
some CAI. All 150 students then wrote the same examination. Fifteen students in the standard
course and 29 of those in the CAI group received a mark over 80%.
(a) Are these results consistent with the hypothesis that the probability of achieving a mark
over 80% is the same for both groups?
(b) Based on these results, the instructor concluded that CAI increases the chances of a mark
over 80%. How should the study have been carried out in order for this conclusion to be
valid?
3. (a) The following data were collected some years ago in a study of possible sex bias in graduate
admissions at a large university:
Admitted Not admitted
Male applicants 3738 4704
Female applicants 1494 2827
Test the hypothesis that admission status is independent of sex. Do these data indicate a
lower admission rate for females?
(b) The following table shows the numbers of male and female applicants and the percentages
admitted for the six largest graduate programs in (a):
8.4. PROBLEMS 173
Men Women
Program Applicants % Admitted Applicants % Admitted
A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7
Test the independence of admission status and sex for each program. Do any of the pro-
grams show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the results for
individual programs in (b) do not?
4. To assess the (presumed) beneficial effects of rust-proofing cars, a manufacturer randomly se-
lected 200 cars that were sold 5 years earlier and were still used by the original buyers. One
hundred cars were selected from purchases where the rust-proofing option package was included,
and one hundred from purchases where it was not (and where the buyer did not subsequently get
the car rust-proofed by a third party).
The amount of rust on the vehicles was measured on a scale in which the responses Y are assumed
roughly Gaussian, as follows:
1. Rust-proofed cars: Y G( 1; )
2. Non-rust-proofed cars: Y G( 2 ; )
Sample means and variances from the two sets of cars were found to be (higher y means
more rust)
1. y1 = 11.7 s1 = 2.1
2. y2 = 12.0 s2 = 2.4
(b) The manufacturer was surprised to find that the data did not show a beneficial effect of
rust-proofing. Describe problems with their study and outline how you might carry out a
study designed to demonstrate a causal effect of rust-proofing.
174 CAUSE AND EFFECT
5. In randomized clinical trials that compare two (or more) medical treatments it is customary not to
let either the subject or their physician know which treatment they have been randomly assigned.
(These are referred to as double blind studies.)
Discuss why not doing this might not be a good idea in a causative study (i.e. a study where you
want to assess the causative effect of one or more treatments).
6. Public health researchers want to study whether specifically designed educational programs about
the effects of cigarette smoking have the effect of discouraging people from smoking. One
particular program is delivered to students in grade 9, with followup in grade 11 to determine each
student’ s smoking “history". Briefly discuss some factors you’d want to consider in designing
such a study, and how you might address them.
References and Supplementary
Resources
9.1 References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat 231 Course
Notes)
C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis and Infer-
ence. John Wiley and Sons, New York.
J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability. American
Statistician 57,74-79
175
Summary of Distributions
Discrete r.v. Y
Notation and Probability function Moment generating
Mean Variance
Parameters f (y) function MY (t)
n
Binomial(n; p) y py q n y
np npq (pet + q)n
0 < p < 1; q = 1 p y = 0; 1; 2; :::; n
Bernoulli(p) py (1 p)1 y
p p(1 p) (pet + q)
0 < p < 1; q = 1 p y = 0; 1
y+k 1
Negative Binomial(k; p) y pk q y kq kq p k
p p2
(1 qet )
0 < p < 1; q = 1 p y = 0; 1; 2; :::
Geometric(p) pq y q q p
p p2
(1 qet )
0 < p < 1; q = 1 p y = 0; 1; 2; :::
(yr )(Nn yr)
Hypergeometric(N; r; n)
(Nn ) nr
N n Nr (1 r N n
N)N 1 intractible
r < N; n < N y = 0; 1; 2; :::min(r; n)
e y
Poisson( ) y! (et 1)
e
>0 y = 0; 1; :::
176
177
Moment generating
Continuous r.v. Y probability density function f (y) Mean Variance
function MY (t)
1 a+b (b a)2 ebt eat
Uniform(a; b) f (y) = b a; a<y<b 2 12 (b a)t
Exponential( )
f (y) = 1 e y= ; 0<y 2
1
1
t; t < 1=
0<
Normal( ; 2 ) or )2 =(2 2)
f (y) = p1 e (y
2 t2 =2
2 2 e t+
Gaussian( ; )
1<y<1
1 < < 1; > 0
1
f (y) = 2r=2 (r=2)
y (r=2) 1 e y=2
Chisquared(r) r=2
0 < y < 1; r 2r (1 2t)
d.f. r > 0 R1
where (a) = 0 xa 1 e x dx
y2
f (y) = k (1 + ) ( +1)=2
student -t 0 2
1 < y < 1 where undefined
d.f. > 0 +1 p if >1 if >2
k = 2 = ( =2)
Statistical Tables
0.25
0.2
0.15
0.1
F(x)
0.05
0
-4 -3 -2
z -1 0 x
1 2 3 4
Rx
The table gives the value of F (x) = 1 '(z)dz for x 0:
178
11.2 Chi-Squared Cumulative Distribution function
11.4
179
APPENDIX. ANSWERS TO SELECTED
PROBLEMS
Chapter 1
5. (a) .9745
Chapter 2
3. (2x1 + x2 )=n
4. (b) .28
(f0 +3T ) [(f0 +3T )2 8T 2 ]1=2 P
6. (a) 4T where T = kfk
(b) c = (1 )2 =
(c) ^ = 0:195; P^ r(X = 0) = :758
180
181
(d) ^ = :5
P P
7. ^ = yi = ti
Chapter 4
3. (b) n = 1024
Chapter 5
9. (c) LR statistic gives = 3:73 and SL = P ( 2 3:73) = :44. There is no evidence that
0bs (4)
the rates are not equal.
Chapter 6
18. (a) ^ = 0:9935, ^ = 0:0866, s = 0:2694. Confidence intervals are 0:978 1:009 and
0:182 0:516
Chapter 7
183
1. (a) LR statistic gives 0bs = 8:17 and Pearson statistic D0bs = 8:05. The SL is about .004 in
each case so there is strong evidence against H.
2. LR statistic gives 0bs = 5:70 and Pearson statistic D0bS = 5:64. The SL is about .017 in each
case.
5. (a) LR statistics for n = 2; 3; 4 are 1.11, 4.22, 1.36. The SL’s are P ( 2 1:11) = 0:29,
(1)
P ( 2(2) 4:22) = 0:12; and P ( 2(3) 1:36) = :71, respectively.
(b) LR statistic is 7.54 and SL = P ( 2 7:54) = :057:
(3)
7. The LR statistic is 7.32 and SL = P ( 2 7:32) = :026 so there is evidence against indepen-
(2)
dence and in favour of an association.
Chapter 8
1. (a) LR statistic is 480.65 so SL is almost zero; there is very strong evidence against indepen-
dence.
184
APPENDIX: A Short Review of
Probability
In this appendix we provide some of the basic results and definitions in basic probability. For further
detail, refer to the Statistics 230 notes.
A discrete probability model consists of a sample space S; a set of events or subsets of the sample
space to which we can assign probabilities and a mechanism for assigning these probabilities. The
probability of an arbitrary event A can be determined by summing the probabilities of simple events
in A and so we have the following rules:
Rule 1 P (S) = 1
P P
Proof: P (S) = a2S P (a) = all a P (a) = 1
Rule 3 If A and B are two events with A B (that is, all of the points in A are also in B), then
P (A) P (B):
P P
Proof: P (A) = a2A P (a) a2B P (a) = P (B) so P (A) P (B):
Rule 4 b (the probability of the union of three events) By a similar argument, we have
Definition 1 Events A and B are mutually exclusive if AB = ' (the empty event).uence or rule 4a
and the fact that P (AB) = P (') = 0:
185
186 APPENDIX: A SHORT REVIEW OF PROBABILITY
Rule 5 In general, let A1 ; A2 ; be finitely many or countably many mutually exclusive events. Then
P
1
P (A1 [ A2 [ )= P (Ai ):
i=1
Rule 6 (probability of complements) If A denotes the complement of the event A; i.e. all of the points
in the sample space that are not in A; then P (A) = 1 P (A):
Definition 2 Events A and B are independent if and only if P (AB) = P (A)P (B). If they are not
independent, we call the events dependent.
For a finite or countably infinite set of events, we call them independent if all finite subsets are
independent, i.e.
for all sets (i1 ; i2 ; ; ik ) of distinct subscripts chosen from (1; 2; 3 ; )17
and so on.
Partition Rule. Let A1 ; A2 ; A3 : : : be a partition of the sample space S into disjoint (mutually
exclusive) events, that is
Definition 5 A random variable is a function that assigns a real number to each point in a sample
space S.
Discrete random variables take integer values or, more generally, values in a countable set (recall
that a set is countable if its elements can be placed in a one-one correspondence with a subset of the
positive integers).
Continuous random variables take values in some interval of real numbers like (0; 1) or (0; 1) or
( 1; 1): Note that the cardinality of the real numbers in an interval is NOT countable.
Definition 6 The probability function (p.f.) of a discrete random variable Y is the function
The set of pairs f(y; f (y)) : y 2 Ag is called the probability distribution of Y . All probability
functions must have two properties:
Definition 7 The cumulative distribution function (c.d.f.) of the random variable Y is the function
usually denoted by F (y)
F (y) = P (Y y)
defined for all real numbers y.
Note that for discrete random variables we can obtain the c.d.f. by summing the values of the
P
probability function F (y) = P (Y y) = f (u): In general, cumulative distribution functions have
u y
the properties:
Discrete random variables have jumps in the cumulative distribution function F (y) and indeed
the probability of a given value is the size of the jump at that point, i.e. lim"#0 (F (y) F (y h)) =
P (Y = y): However continuous random variables have continuous and indeed differentiable (ex-
cept possibly at finitely many points) cumulative distribution functions F (y) and indeed the derivative
F 0 (y) = f (y) is referred to as the probability density function (probability density function), a function
Rb
f with the property that P (a Y b) = P (a < Y < b) = a f (z)dz for all real numbers a < b: It
follows that
188 APPENDIX: A SHORT REVIEW OF PROBABILITY
R1 R
1. 1 f (x)dx = allx f (x)dx = 1. (This is because P ( 1 X 1) = 1) and
Rx
2. F (x) = 1 f (u)du.
Hypergeometric Distribution
Suppose have a collection of N objects which can be classified into two distinct types. Call one type
“success” (S) and the other type “failure” (F ). The poluation contains a total of r successes and
N r failures. Pick n objects at random without replacement. Let Y be the number of successes
obtained. Then Y has a hypergeometric distribution with probability function 2. in the table below:
(r )(N r)
f (y) = y Nn y : The range of possible values of Y is implicit since, for example, for integers y >
(n)
r;expressions such as yr take the value 0:
Binomial Distribution
Suppose an “experiment" has two types of distinct outcomes “success” (S) and “failure” (F ), and let
their probabilities be p (for S) and 1 p (for F ). Repeat the experiment n independent times. Let Y
be the number of successes obtained. Then the discrete random variable Y has a binomial distribution
(write Y Bin(n; p)) with probability function; f (y) = ny py (1 p)n y ; y = 0; 1; 2; ; n:
Poisson Distribution
The discrete random variable Y has a Poisson distribution if its p.f. is of the form
y
f (y; ) = e for y = 0; 1; 2; : : :
y!
where is a parameter with > 0. To indicate this we write Y Poisson( ). It can be shown that
E(Y ) = for this model. The Poisson distribution arises in settings where Y represents the number of
random events of some kind that occur over a fixed period of time, for example, the number of arrivals
in a queue or the number of hits on a web site in a 1 hour period. For this model to be suitable the
events must occur completely randomly in time. The Poisson distribution is also used to describe the
random occurrence of events in space.
13.2. SOME BASIC DISTRIBUTIONS 189
The probability density function is constant over an interval f (y) = k; a y b for some constant
Rb
k. So that a f (y)dy = 1, we require k = b 1 a . and the probability density function is: f (y) = b 1 a
for a y b:The single most important case is a = 0; b = 1:
Exponential Distribution
The continuous random variable Y has an exponential distribution if its probability density function is
of the form
1 y=
f (y; ) = e for y > 0
where is parameter with > 0. To indicate this we write Y Exp( ). It can be shown that
E(Y ) = . The exponential distribution arises in some settings where Y represents the time until
an event occurs. Be careful not to confuse the notation Y Exp( ) which specifies the exponential
distribution for the random variable Y with a common notation for the exponential18 . This distribution
arises in various problems involving the time until some event occurs. In a Poisson process for events
in time let Y be the length of time we wait for the first event occurrence. Then Y has an exponential
distribution. Integrating the probability density function the c.d.f. is F (y) = 1 e y= ; for y > 0
and otherwise F (y) = 0:
18
Sometimes ex is denoted by exp(x):
190 APPENDIX: A SHORT REVIEW OF PROBABILITY
where 1< < 1 and > 0 are parameters. It turns out that E(Y ) = and Var(Y ) = 2. We
write
2
Y N( ; ) or Y G( ; )
to denote that X has a normal distribution with mean and variance 2 (standard deviation ). The
normal distribution is the most widely used distribution in probability and statistics. The shape of
the probability density function f (x) above is what is often termed a “bell shape” or “bell curve”,
symmetric about 0 as shown in Figure 13.25.(you should be able to verify the shape without graphing
the function)
0.4
0.35
0.3
0.25
φ(x)
0.2
0.15
0.1
0.05
0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-4.8
-4.7
-4.6
-4.5
-4.4
-4.3
-4.2
Figure 13.25: The Standard Normal (N (0; 1)) probability density function
-4.1
-4
-3.9
-3.8
-3.7
-3.6
-3.5
-3.4
-3.3
-3.2
-3.1
-3
-2.9
-2.8
-2.7
13.3 Moments -2.6
-2.5
-2.4
-2.3
-2.2
-2.1
-2
-1.9
-1.8
Definition 8 Suppose the discrete random variable Y has probability function f (y): Then we can
-1.7
-1.6
P -1.5
-1.4
-1.3
define the expected value of some function g(Y ) of Y as E [g(Y )] = g(y)f (y):
-1.2
-1.1
-1
-0.9
-0.8 all y
-0.7
-0.6
-0.5
-0.4
-0.3
Definition 9 When Y is continuous with probability density function f (y), we can define
-0.2
-0.1
0
0.1
0.2
Z 0.3
0.4
0.5
0.6
E(g(Y )) = g(y)f (y)dy: 0.7
0.8
0.9
all y1.11
1.2
1.3
1.4
19 1.5
"The only normal people are the ones you don’t know very well." Joe Ancis,1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
13.3. MOMENTS 191
M (t) = E(etY );
assuming that this expectation exists for all t in some open set ( a; a) of real numbers (a > 0). If Y
is continuous with probability density function f (y) then
Z 1
M (t) = ety f (y)dy (4.2.1)
1
where the integral and sum in (4.2.1) and (4.2.2) are over the range of Y .
The m.g.f. is a transform, that is, a function M (t) that is obtained from the function f (y). Not all
probability distributions have m.g.f.’s (since (4.2.1) or (4.2.2) may not exist in some cases) but if the
m.g.f. exists, it uniquely determines the distribution. That is, a m.g.f M (t) can arise from only one
function f (y).
Theorem 4.2.121 Let the random variable Y have m.g.f. M (t). Then for r = 1; 2; 3; : : :
dr M (t) tr
E(Y r ) = jt=0 = M (r)
(0) = Coefficient of in the power series representation of M (t):
dtr r!
P
n
n P
n
n
20
M (t) = ety y
py (1 p)n y
= y
(pet )y (1 p)n y
= (pet + 1 p)n by the binomial theorem.
y=0 y=0
dr
R 1 ty R 1 dr (ety )
21
Proof: For simplicity consider a continuous r.v. Y . Then M (r) (t) = dt r e f (y)dy = dtr
f (y)dy =
R 1 r ty (r)
R1 r r
1 1
1
y e f (y)dy: Therefore M (0) = 1
y f (y)dy = E(Y ):
192 APPENDIX: A SHORT REVIEW OF PROBABILITY
Proof. First consider Z G(0; 1). If we can find Mz (t) then we can get My (t), where Y
G( ; ), from it. This is because Z = (Y )= G(0; 1), or Y = Z + . Thus
My (t) = e t Mz ( t) (4.2.5)
22
Proof: we have M (t) = (pet + 1 p)n . This gives M (1) (t) = npet (pet + 1 p)n 1 and M (2) (t) = npet (pet + 1
n 1
p) +n(n 1)(pet )2 (pet +1 p)n 2 : Therefore E(Y ) = M (1) (0) = np and so E(Y 2 ) = M (2) (0) = np+n(n 1)p2 :
Finally V ar(Y ) = E(Y 2 ) [E(Y )]2 = np(1 p):
23
Proof: My (t) = E(etY ) = E(et(aX+b) ) = E(ebt e(at)X ) = ebt E(e(at)X ) = ebt Mx (at)
13.3. MOMENTS 193
Sums of Independent Random Variables We often encounter sums or linear combinations of inde-
pendent random variables, and m.g.f.’s are useful in dealing with them.
Theorem 4.2.3 Let Y1 ; : : : ; Yk be mutually independent random variables with m.g.f.’s M1 (t); : : : ; Mk (t):
Then the m.g.f. of
Xk
S= Yi
i=1
is
k
Y
Ms (t) = Mi (t)
i=1
Proof.
k
!
P Y
Ms (t) = E(etS ) = E et Yi
=E etYi
i=1
k
Y
= E etYi since the Yi0 s are independent
i=1
k
Y
= Mi (t)
i=1
This allows us to prove a fundamental property of Gaussian random variables. If we take linear
combinations (such as averages or sums) of (independent) Gaussian random variables, the result is also
a Gaussian random variable. This is one reason that the Gaussian distribution is so popular. For exam-
ple in finance it is a great convenience in a model if the returns from individual investments are each
normally distributed, so too is the total return from the portfolio.
Theorem 4.2.4 Let Y1 ; : : : ; Yk be independent random variables with Yi G( i ; i ). Then if
a1 ; : : : ; ak are constants, the distribution of
k
X
W = ai Yi
i=1
1=2
!
P
k P
k
is G ai i ; a2i 2
i .
i=1 i=1
Proof. By Theorems 4.2.2 and 4.2.3 we have the m.g.f. of ai Yi as Mai Yi (t) = MYi (ai t) =
1 2 2 2
exp(ai i t + 2 ai i t ):
194 APPENDIX: A SHORT REVIEW OF PROBABILITY
Thus
k
Y
Mw (t) = Mai Yi (t)
i=1
" k
! k
! #
X 1 X
= exp ai i t+ a2i 2
i t2
2
i=1 i=1
1=2
!
P
k P
k
This is the m.g.f. for a G ai i ; a2i 2
i random variable, which proves the theorem.
i=1 i=1
We call f (y1 ; y2 ; ; yn ) the joint probability function of (Y1 ; Y2 ; :::; Yn ). We will aslo refer to the
marginal probability function of one random variable, for example fj (yj ) = P (Yj = yj ):
For n continuous random variables Y1 ; : : : ; Yn .define the joint c.d.f.
We call f (y1 ; y2 ; ; yn ) the joint probability function of (Y1 ; Y2 ; :::; Yn ) if the above function can
be found as a multiple integral of f; i.e. if
Z y1 Z y2 Z yn
F (y1 ; y2 ; ; yn ) = ::: f (u1 ; u2 ; ; un )dun du1
1 1 1
We will also refer to the marginal probability function of one random variable, for example fj (yj ) =
R1 R1
P (Yj = yj ) in the discrete case and in the continuous case for example, f1 (y1 ) = 1 ::: 1 f (y1 ; u2 ; ; un )dun du
is obtained by integrating the joint probability density function over all variables except the variable
y1 :
24 2
Proof: Consider theorem 4.2.4 with i = ; i = and ai = 1=n for i = 1; : : : ; n:
13.4. JOINT AND CONDITIONAL PROBABILITY FUNCTIONS: 195
Definition 11 If f (y1 ; y2 ) denotes the joint probability function of Y2 given Y1 = y1 is f (y2 jy1 ) =
f (y1 ;y2 )
f2 (y2 ) .
Similarly, in the continuous case, if f (y1 ; y2 ) is the joint probability density function and if f2 (y2 )
denotes the marginal probability density function of Y2 ; then f (y2 jy1 ) = ff(y2 (y
1 ;y2 )
2)
is the conditional
probability density function for Y1 given Y2: (In both cases, of course, these functions are only defined
where the denominator is not zero).
Definition 12
X
E [g (X1 ; X2 ; ; Xn )] = g (x1 ; x2 ; xn ) f (x1 ; ; xn ) (discrete case)
all (x1 ;x2 ; ;xn )
Z Z Z
E [g (X1 ; X2 ; ; Xn )] = ::: g (x1 ; x2 ; xn ) f (x1 ; ; xn ) dx1 dx2 :::dxn (continuous case)
all (x1 ;x2 ; ;xn )
Cov(Y1 ;Y2 )
Definition 14 The correlation coefficient of Y1 and Y2 is = 1 2
For events A and B, we defined A and B to be independent if and only if P (AB) = P (A)P (B). This
definition extends to random variables: random variables are independent if their joint probability
function is the product of the marginal probability functions.
Theorem 16 Suppose random variables Y1 and Y2 are independent. Then, if g1 (y) and g2 (y) are any
two functions for which these expectations exist„
3. Let X1 ; X2 ; ; Xn be random variables which have mean . (You can imagine these being
some sample results from an experiment such as recording the number of occupants in cars
P
n
Xi
i=1
travelling over a toll bridge.) The sample mean is X = n . Then E X = .
h i
1. Cov (X; X) = E [(X X ) (X X )] = E (X )2 =Var (X)
2. Cov (aX + bY; cU + dV ) = ac Cov (X; U ) + ad Cov (X; V ) + bc Cov (Y; U ) + bd Cov(Y; V )
where a; b; c; and d are constants.
3. Variance of a linear combination: Var (aX + bY ) = a2 Var (X)+b2 Var(Y )+2ab Cov (X; Y )
2
Var (X Y)= X + ( 1)2 2
Y = 2
X + 2
Y;
i.e., for independent variables, the variance of a difference is the sum of the variances.
2
Var X = =n
13.4. JOINT AND CONDITIONAL PROBABILITY FUNCTIONS: 197
1 p
3. 2 =
at all points y at which F is continuous, then we say that the sequence of random variables has a
d
limiting distribution with c.d.f. F (y). We often write this as Yn ! Y , where Y is a random variable
with c.d.f. F (y).
A major use of limiting distributions is as approximations. In many problems the c.d.f. Fn (y) may
be complicated or intractable, but as n ! 1 there is a much simpler limit F (y). If n is sufficiently
large, we often just use F (y) as a close approximation to Fn (y).
We will state and use some limiting distributions in these notes. A famous limiting distribution
called the Central Limit Theorem is contained in the following theorem. In essence the theorem states
that we may approximate the distribution of the sum of independent random variables using the normal
distribution. Of course the more summands, the better the approximation. A way to prove the theorem
is provided in Problem 4 at the end of the chapter. Loosely speaking the central limit theorem says:
198 APPENDIX: A SHORT REVIEW OF PROBABILITY
let Y1 ; Y2 ; ; Yn be independent random variables all having the same distribution, with mean and
2
variance . Then as n ! 1,
n
X
Yi N n ; n 2 (4.2.8)
i=1
and
2
Y N ; : (4.2.9)
n
This is actually a rough statement of the result since, as n ! 1, both the N n ; n 2 and N ; 2 =n
2
distributions fail to exist. (The former because both n and n 2 ! 1, the latter because n ! 0.) A
precise version of the results is:
Theorem 18 If Y1 ; Y2 ; ; Yn be independent random variables all having the same distribution, with
2
mean and variance , then as n ! 1, the cumulative distribution function of the random variable
P
Yi n
Zn = p
n
Y
Zn = p
= n
The theorem asserts that as n ! 1; Zn converges in distribution to G(0; 1).25 . Although this is a
P
theorem about limits, we will use it when n is large, but finite, to approximate the distribution of Yi
or Y by a normal distribution, so the rough version of the theorem in (4.2.8) and (4.2.9) is adequate for
our purposes.
Remark: For n sufficiently large we may use the G(0; 1) distribution to approximate the distribution
P
n
of Z and thus to calculate probabilities for Yi . Recall that constant times a normal random variable
i=1
is also normally distributed so the theorem also asserts that Y has a Gaussian distribution with mean
p
and standard deviation 2 =n: The accuracy of the approximation depends on n (bigger is better)
25
It is an interesting question as to whether this implies the p.d.f. of Z converges to the normal p.d.f. Is it possible for a
sequence Fn (x) to converge to a limit F (x) and yet their derivatives do not converge, Fn0 (x) 9 F (x)?
26
Suppose Yj is your winning in a lottery on week j and Yj is either 1 million (with probability 10 6 or 0 with probability
P
100
1 10 6 : How close do you think the distribution of Yi is to a normal distribution?
i=1
13.4. JOINT AND CONDITIONAL PROBABILITY FUNCTIONS: 199
Remark: It is possible to use the theorem to approximate discrete as well as continuous distributions.
A very important case is the binomial distribution: let Yn Bin(n; p) be a binomial r.v. Then as
n ! 1 the limiting distribution of
Yn np
Zn = p
np(1 p)
is G(0; 1). This is proved by noting that Yn can be written as a sum of independent random variables:
n
X
Yn = Xi
i=1
where Xi Bin(1; p); with = E(Xi ) = p and 2 = V ar(Xi ) = p(1 p). The result above is then
given by Theorem 4.2.7.
Exercise: Use the limiting Gaussian approximation to the binomial distribution to evaluate P (20
Y 28) and P (16 Y 32), when Y Bin(60; :4). Compare the answers with the exact
probabilities, obtained from the R function pbinom(x; n; p). (Note: Recall that when approximating a
discrete distribution we can use a continuity connection; this means here that we consider P (19:5
Y 28:5) and P (15:5 Y 32:5) when we apply the normal approximation.)
Besides convergence in distribution, there is another concept called convergence in probability
which we will mention briefly. A sequence of random variables fXn : n; 1; 2; : : :g is said to converge
in probability to the constant c if,
Loosely, this implies that however we define our tolerance ; the probability that Xn is nearly equal to c
(i.e. using this tolerance) gets closer and closer to 1. A major application of this concept is in showing
that certain estimators ~n , based on a sample of size n, converge to the true value of the parameter
being estimated as n ! 1 ( i.e. as sample size becomes arbitrarily large). Convergence in probability
will not be used much in this course, but is an important tool in more advanced discussions of statistical
theory. In fact we can show that the definition of convergence in distribution in (4.2.7)27 to a constant
is equivalent to convergence in (4.2.10)
27
What is the cumulative distribution function F (y) of the constant c and at what points y is it continuous?