[go: up one dir, main page]

0% found this document useful (0 votes)
6 views204 pages

s231_new_v3

Download as pdf or txt
Download as pdf or txt
Download as pdf or txt
You are on page 1/ 204

STATISTICS 231 COURSE NOTES

Department of Statistics and Actuarial Science, University of Waterloo

Winter 2013 Edition


Contents

INTRODUCTION TO STATISTICAL SCIENCE 1


1.1 Statistical Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Collection of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Data Summaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.4 Probability Distributions and Statistical Models . . . . . . . . . . . . . . . . . . . . . 11
1.5 Data Analysis and Statistical Inference . . . . . . . . . . . . . . . . . . . . . . . . . . 15
1.6 Statistical Software . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.7 A More Detailed Example: Colour Classification by Robots . . . . . . . . . . . . . . 24
1.8 Appendix. The R Language and Software . . . . . . . . . . . . . . . . . . . . . . . . 29
1.9 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECK-


ING 42
2.1 Statistical Models and Probability Distributions . . . . . . . . . . . . . . . . . . . . . 42
2.2 Estimation of Parameters (Model Fitting) . . . . . . . . . . . . . . . . . . . . . . . . 46
2.3 Likelihood Functions From Multinomial Models . . . . . . . . . . . . . . . . . . . . 54
2.4 Checking Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.5 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

PLANNING AND CONDUCTING EMPIRICAL STUDIES 64


3.1 Empirical Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.2 Planning a Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
3.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
3.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

STATISTICAL INFERENCE: ESTIMATION 73


4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
4.2 Some Distribution Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

ii
CONTENTS iii

4.3 Confidence Intervals for a Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . 82


4.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

STATISTICAL INFERENCE: TESTING HYPOTHESES 100


5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
5.2 Testing Parametric Hypotheses with Likelihood Ratio Statistics . . . . . . . . . . . . . 104
5.3 Hypothesis Testing and Interval Estimation . . . . . . . . . . . . . . . . . . . . . . . 109
5.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

GAUSSIAN RESPONSE MODELS 115


6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
6.2 Inference for a single sample from a Gaussian Distribution . . . . . . . . . . . . . . . 119
6.3 General Gaussian Response Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
6.4 Inference for Paired Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
6.5 Linear Regression Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
6.6 Model Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
6.7 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144

TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS 154


7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
7.2 Goodness of Fit Tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.3 Two-Way Tables and Testing for Independence of Two Variables . . . . . . . . . . . . 157
7.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

CAUSE AND EFFECT 166


8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
8.2 Experimental Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
8.3 Observational Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
8.4 Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 171

References and Supplementary Resources 175


9.1 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.2 Departmental Web Resources . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

Summary of Distributions 176

Statistical Tables 178


11.1 Probabilities For the Standard Normal Distribution N (0; 1) . . . . . . . . . . . . . . . 178
iv CONTENTS

179
11.2 Chi-Squared Cumulative Distribution function . . . . . . . . . . . . . . . . . . . . . . 179
11.3 Student t Quantiles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
11.4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

APPENDIX. ANSWERS TO SELECTED PROBLEMS 180

APPENDIX: DATA 184

APPENDIX: A Short Review of Probability 185


13.2 Some Basic Distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
13.3 Moments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
13.4 Joint and Conditional Probability Functions: . . . . . . . . . . . . . . . . . . . . . . . 194
0 CONTENTS

Preface
These notes are a work-in-progress with contributions from those students taking the courses and
the instructors teaching them. The original version of the notes was prepared by Jerry Lawless with
editorial changes made by Jock MacKay, Don McLeish,... In order to provide improved versions of the
notes for students in subsequent terms, please email lists of errors, or sections that are confusing, or
additional remarks/sugggestions to your instructor or dlmcleis@uwaterloo.ca.
Specific topics in these notes also have associated video files that can be accessed at www.watstat.ca.
Where possible we will reference these videos in the text.
INTRODUCTION TO STATISTICAL SCIENCE

1.1 Statistical Science

Statistical science, or statistics, is the discipline that deals with the collection, analysis and interpreta-
tion of data, and with the study and treatment of variability and of uncertainty. If you think about it,
you soon realize that almost everything we do or know depends on data of some kind, and that there is
usually some degree of uncertainty present. For example, in deciding whether to take an umbrella on a
long walk we may utilize information from weather forecasts along with our own direct impression of
the weather, but even so there is usually some degree of uncertainty as to whether it will actually rain
or not. In areas such as insurance or finance, decisions must be made about what rates to charge for an
insurance policy, or whether to buy or sell a stock, on the basis of certain types of data. The uncertainty
as to whether a policy holder will have a claim over the next year, or whether a stock’s price will rise
or fall, is the basis of financial risk for the insurer and the investor.
In order to increase our knowledge about some area or to make better decisions, we must collect
and analyze data about the area in question. To discuss general ways of doing this, it is useful to have
terms that refer to the objects we are studying. The words “population”, “phenomenon” and “process”
are frequently used below; they are simply catch-all terms that represent groups of objects and events
that someone might wish to study.
Variability and uncertainty are present in most processes and phenomena in the real world. Uncer-
tainty or lack of knowledge is the main reason why someone chooses to study a phenomenon in the
first place. For example, a medical study to assess the effect of a new drug for controlling hyperten-
sion (high blood pressure) may be conducted by a drug company because they do not know how the
drug will perform on different types of people, what its side effects will be, and so on. Variability is
ever-present; people have varying degrees of hypertension, they react differently to drugs, they have
different physical characteristics. One might similarly want to study variations in currency or stock val-
ues, variation in sales for a company over time, or variation in hits and response times for a commercial
web site. Statistical science deals both with the study of variability in processes and phenomena, and
with good (i.e. informative, cost-effective) ways to collect and analyze data about such processes.

1
2 INTRODUCTION TO STATISTICAL SCIENCE

There are various possible objectives when one collects and analyzes data on a population, phenom-
enon, or process. In addition to pure “learning” or furthering knowledge, these include decision-making
and the improvement of processes or systems. Many problems involve a combination of objectives. For
example, government scientists collect data on fish stocks in order to further their scientific knowledge,
but also to provide information to legislators or groups who must set quotas or limits on commercial
fishing. Statistical data analysis occurs in a huge number of areas. For example, statistical algorithms
are the basis for software involved in the automated recognition of handwritten or spoken text; statis-
tical methods are commonly used in law cases, for example in DNA profiling or in determining costs;
statistical process control is used to increase the quality and productivity of manufacturing processes;
individuals are selected for direct mail marketing campaigns through statistical analysis of their char-
acteristics. With modern information technology, massive amounts of data are routinely collected and
stored. But data does not equal information, and it is the purpose of statistical science to provide and
analyze data so that the maximum amount of information or knowledge may be obtained. Poor or
improperly analyzed data may be useless or misleading.
Mathematical models are used to represent many phenomena, populations, or processes and to deal
with problems that involve variability we utilize probability models. These have been introduced and
studied in your first probability course, and you have seen how to describe variability and solve certain
types of problems using them. This course will focus more on the collection, analysis and interpreta-
tion of data, but the probability models studied earlier will be heavily used. The most important part of
probability for this course is the material dealing with random variables and their probability distribu-
tions, including distributions such as the binomial, hypergeometric, Poisson, multinomial, normal and
exponential. You should review your previous notes on this material.
Statistical science is a large discipline, and this course is only an introduction. Our broad objectives
are to discuss the collection, analysis and interpretation of data, and to show why this is necessary. By
way of further introduction we will outline important statistical topics, first data collection, and then
probability models, data analysis, and statistical inference. We should bear in mind that study of a
process or phenomenon involves iteration between model building, data collection, data analysis, and
interpretation. We must also remember that data are collected and models are constructed for a specific
reason. In any given application we should keep the big picture in mind (e.g. why are we studying this?
what else do we know about it?) even when considering one specific aspect of a problem.

1.2 Collection of Data

The objects of study in this course are usually referred to as either populations or processes. In essence
a population is just some collection of units (which can be either real or imagined), for example, the
1.2. COLLECTION OF DATA 3

collection of persons under the age of 18 in Canada as of September 1, 2012 or the collection of car
insurance policies issued by a company over a one year period. A process is a mechanism by which
output of some kind is produced; units can often be associated with the output. For example, hits on a
website constitute a process (the “units" are the distinct hits), as do the sequence of claims generated
by car insurance policy holders (the “units" are the individual claims). A key feature of processes is
that they usually occur over time, whereas populations are often static (defined at one moment in time).
Populations or processes are studied by defining variates or variables which represent character-
istics of units. These are usually numerical-valued and are represented by letters such as x; y; z. For
example, we might define a variable y as the number of car insurance claims from an individual policy
holder in a given year, or as the number of hits on a website over a specified one hour period. The
values of y vary across the units in a population or process, this variability which generates uncertainty
and makes it necessary to study populations and processes by collecting data about them. By "data" we
mean here the values of the variates for specific units in the population.
In planning for the collection of data about some phenomenon, we must carefully specify what the
objectives of doing this are. Then, the feasibility of obtaining information by various means must be
considered, as well as to what extent it will be possible to answer questions of interest. This sounds
simple but is usually difficult to do well, especially with limited resources.
There are several ways in which data are commonly obtained. One is purely according to what is
available: that is, data are provided by some existing source. Huge amounts of data collected by many
technological systems are of this type, for example, data on credit card usage or on purchases made by
customers in a supermarket. Sometimes it is not clear exactly what "available" data represent, and they
may be unsuitable for serious analysis. For example, people who voluntarily provide data in a survey
may not be representative of the population at large. Statistical science stresses the importance of ob-
taining data so that they will be “objective" and provide maximal information. Three broad approaches
are often used to do this:

(i) Sample Surveys The object of many studies is a finite population of some sort (e.g. all persons
over 19 in Ontario; all cars produced by GM in the past year). In this case information may be
obtained by selecting a “representative” sample of individuals from the population and studying
them. Representativeness of the sample is usually achieved by selecting the sample members
randomly from those in the population. Sample surveys are widely used in government statistical
studies, economics, marketing, public opinion polls, sociology, and other areas.

(ii) Observational Studies An observational study is one in which data are collected about a process
or phenomenon (over which the observer has no control) in some objective way, often over some
period of time. For example, in studying risk factors associated with a disease such as lung
cancer, one might investigate all such cases (or perhaps a random sample of them) that occur
4 INTRODUCTION TO STATISTICAL SCIENCE

over a given time period. A distinction between a sample survey and an observational study is
that for the latter the “population” of interest is usually infinite or conceptual. For example, in
investigating risk factors for a disease we would prefer to think of the population as a conceptual
one consisting of persons at risk from the disease recently or in the future.

(iii) Experiments An experiment is a study in which the experimenter (i.e. the person collecting
the data) exercises some degree of control over the process being studied. This usually takes the
form of the experimenter being able to control certain factors in the process. For example, in
an engineering experiment to quantify the effect of temperature on the performance of personal
computers, we might decide to run an experiment with 40 PC’s, ten of which would be operated
at each of the temperatures 10, 20, 30, and 40 degrees Celsius.

The three types of studies described above are not mutually exclusive, and many studies involve
aspects of two or more of them. Here are some slightly more detailed examples.

Example 1.2.1 A sample survey about smoking


Suppose we wish to study the smoking behaviour of Ontario residents aged 14-20 years. (Think
about reasons why such studies are considered important.) Of course, people’s smoking habits and the
population referred to both change over time, so we will content ourselves with a “snapshot” of the
population at some point in time (e.g. the second week of September in a given year). Since we cannot
possibly contact all persons in the population, we decide to select a random sample of n persons. The
data to be obtained from each person might consist of their age, sex, place of residence, occupation,
whether they currently smoke, and some additional information about their smoking habits and how
long they have smoked (if they are smokers or ex-smokers).
Note that we have to decide how large n should be, and how we are going to obtain our random
sample. The latter question is, in particular, very important if we want to ensure that our sample is
indeed “representative” of the population. The amount of time and money available to carry out the
study heavily influences how we will proceed.

Example 1.2.2 A study about a manufacturing process


When a manufacturer produces a product in packages stated to weigh or contain a certain amount,
they are generally required by law to provide at least the stated amount in each package. Since there
is always some inherent variation in the amount of product which the manufacturing process deposits
in each package, the manufacturer has to understand this variation and set up the process so that no (or
only a very small fraction of) packages contain less than the required amount.
Consider, for example, soft drinks sold in nominal 26 ounce bottles. Because of inherent variation
in the bottle filling process (what might some sources of this be?), the amount of liquid x that goes into
1.2. COLLECTION OF DATA 5

a bottle varies over a small range. Note that the manufacturer would like the variability in x to be as
small as possible, and for bottles to contain at least 26 ounces. Suppose that the manufacturer has just
added a new filling machine to increase the plant’s capacity and wants to compare the new machine
with the older ones. She decides to do this by sampling some filled bottles from each machine and
accurately measuring the amount of liquid x in each bottle; this will be an observational study.
How exactly should the data be collected? The machines may “drift” over time (i.e. the average or
the variability in the values of x may vary systematically up or down over time) so we should randomly
select bottles over time from each machine; we would have to decide how many, and over what time
periods to collect them.

Example 1.2.3 Clinical trials in medicine


In medical studies of the treatment of disease, it is common to compare alternative treatments in
experiments called clinical trials. Consider, for example, persons who are considered at high risk of a
stroke. Some years ago it was established in clinical trials that small daily doses of aspirin (which acts
as a blood thinner) could lower the risk of stroke. This was done by giving some persons daily doses of
aspirin (call this Treatment 1) and others a daily dose of a placebo, that is, an inactive compound, given
in the same form as the aspirin (call this Treatment 2). The two groups of persons were then followed
for a period of time, and the number of strokes in each group was observed.
This sounds simple, but there are several important points. For example, patients should be assigned
to receive Treatment 1 or Treatment 2 in some random fashion so as to avoid unconscious bias (e.g.
doctors might otherwise tend to put persons at higher risk in the Aspirin group) and to “balance” other
factors (e.g. age, sex, severity of condition) across the two groups. It is also best not to let the patients
or their doctors know which treatment they are receiving. Many other questions must also be addressed.
For example, what variables should we measure as the basis for our data? What should we do about
patients who are forced to drop out of the study because of adverse side effects? Is it possible that the
Aspirin treatment works for the certain types of patients but not others? How long should the study go
on? How many persons should be included?
As an example of a statistical setting where the data are not obtained by a survey, experiment, or
even an observational study, consider the following.

Example 1.2.4 Direct marketing campaigns


With products or services such as credit cards it is common to conduct direct marketing campaigns
in which large numbers of individuals are contacted by mail and “invited” to acquire a product or
service. Such individuals are usually picked from a much larger number of persons on whom the com-
pany has information. For example, in a credit card marketing campaign a company might have data
on several million persons, pertaining to demographic (e.g. sex, age, place of residence), financial (e.g.
6 INTRODUCTION TO STATISTICAL SCIENCE

salary, credit cards held), spending, and other variates. Based on this data, the company wishes to select
persons whom it considers have a good chance of responding positively to the mailout. The challenge
is to use data from previous mail campaigns, along with the current data, to achieve as high a response
rate as possible.

1.3 Data Summaries


We noted in previous section that data consisting of measurements on variables x; y; z; : : : of interest
are collected when we study a phenomenon or process. Data in raw form can be difficult to compre-
hend, especially if the volume is great or if there are large numbers of variables. Many methods of
summarizing data so they can be more easily understood have been developed. There are two main
types: graphical and numerical. We will consider a few important data summaries here.
The basic setup is as follows. Suppose that data on a variable x is collected for n units in a
population or process. By convention, we label the units as 1; 2; : : : ; n and denote their respective
x-value as x1 ; x2 ; : : : ; xn . We might also collect data on a second variate y for each unit, and we
would denote the values as y1 ; y2 ; : : : ; yn . We often refer to fx1 ; x2 ; : : : ; xn g, fy1 ; y2 ; : : : ; yn g or
f(x1 ; y1 ); (x2 ; y2 ); : : : (xn ; yn )g as samples or data sets, and refer to n as the sample size.
First we describe some graphical summaries of data sets like this, and then we describe some nu-
merical summaries.

1.3.1 Numerical Summaries

Numerical data summaries are useful for describing features of a data set fy1 ; : : : ; yn g. Important ones
are

1 P
n
The mean (also called the sample mean) y = n yi
i=1

1 P
n
the (sample) variance s2y = n 1 (yi y)2
i=1
q
the (sample) standard deviation sy = s2y

the percentiles and quantiles: the p’th quantile or 100p’th percentile is a y-value Q(p) such that
a fraction p of the values in the data set are below Q(p). The values Q(:5), Q(:25) and Q(:75)
are called the median, the lower quartile, and the upper quartile respectively. In fact, quantiles
1.3. DATA SUMMARIES 7

are not uniquely defined for all p-values in (0; 1) for a given data set, and there are different
conventions for defining quantiles and percentiles. For example what is the median of the values
f1; 2; 3; 4; 5; 6g? What is the lower quartile? The different conventions for defining quantiles
become identical as n becomes large.

The mean and the percentiles and quantiles are easily understood. The variance and standard de-
viation measure the variability or “spread" of the y-values in a data set, which is usually an important
characteristic. Another way to measure variability is in terms of the distance between a “low" and
“high" percentile, for example Q(0:10) and Q(:90).
A final numerical summary is a frequency table. This is closely related to a histogram and, in fact, is
just a table showing the interval Ij and their frequencies fj , as used in a histogram. For example, for the
200 male height measurements in Example 1.4.2, the frequency table corresponding to the bottom-left
histogram in Figure 1.1 is shown in Table 1.3.1.

Table 1.3.1 Frequency Table of Male Heights (in m.)


Interval (Ij ) Frequency(fj )
[1:55; 1:60) 2
[1:60; 1:65) 13
[1:65; 1:70) 48
[1:70; 1:75) 64
[1:75; 1:80) 42
[1:80; 1:85) 25
[1:85; 1:90) 6
Total 200

1.3.2 Graphical Summaries


We consider the first two types of plots for a data set fy1 ; y2 ; :::; yn g of numerical values. These are
called histograms and cumulative frequency plots.

Histograms
Consider measurements fy1 ; y2 ; :::; yn g on a variable y. Partition the range of y into intervals
Ij = [aj 1 ; aj ); j = 1; 2; : : : ; k and then calculate for j = 1; : : : ; k

fj = number of values from fy1 ; : : : ; yn g that are in Ij .


P
The fj are called the observed frequencies for I1 ; : : : ; Ik ; note that kj=1 fj = n. A histogram is
a graph in which a rectangle is placed on each interval; the height of the rectangle for Ij is chosen so
that the rectangle’s area is proportional to fj . Two main types of histogram are used:
8 INTRODUCTION TO STATISTICAL SCIENCE

(a) a “standard" histogram where the range of y is taken to be finite and the Ij are of equal length.
The height of the rectangle is taken to be fj . This type of histogram is similar to a bar chart.

(b) a “relative frequency" histogram, where the Ij may or may not be of equal length. The height of
the rectangle for Ij is chosen so that its area equals fj =n, which we call the relative frequency
for Ij . Note that in this case the sum of the areas of all the rectangles in the histogram equals one.

Example 1.3.2
Figure 1.1 shows relative frequency histograms based on each of two samples, (a) heights of 200
females randomly selected from workers aged 18 - 60 in New Zealand, and (b) heights of 200 males,
selected from the same population. Heights are recorded in metres; the female heights range from 1.45
to 1.78m (57.1 to 70.1 in.) and the males heights from 1.59 to 1.88m (62.6 to 74.0 in.).
To construct a histogram, we have to choose the number (k) and location of the intervals. The
intervals are typically selected in such a way that each interval contains at least one y-value from the
sample (that is, each fj 1). Software packages are used to produce histogram plots (see Section 1.6)
and they will either automatically select the intervals for a given data set or allow the user to specify
them.
The visual impression from a histogram can change somewhat according to the choice of intervals.
In Figure 1.1, the left-hand panels use 7 intervals and the right-hand use 17 for females and 15 for
males. Note that the histograms give a picture of the distribution of y values in the two samples. For
both females and males the distributions are fairly symmetrical-looking. To allow easy comparison of
female and male height distributions we have used the same y scale (x-axis) for males and females.
Obviously, the distribution of male heights is to the right of that for female heights, but the “spread"
and shape of the two distributions is similar.
Example 1.3.3 Different shapes of distributions can occur in data on a variable y. Figure 1.2 shows
a histogram of the lifetimes (in terms of number of km driven) for the front brake pads on 200 new
mid-size cars of the same type. Notice that the distribution is less symmetrical than the ones in Figure
1.1; the brake pad lifetimes have a rather long right-hand tail. The high degree of variability in life-
times is due to the wide variety of driving conditions which different cars are exposed to, as well as to
variability in how soon car owners decide to replace their brake pads.

Cumulative frequency plots Another way to portray a data set fy1 ; y2 ; :::; yn g is to count the number
or proportion of values in the set which are smaller than any given value. This gives a function

number of values in fy1 ; y2 ; :::; yn g which are y


F~ (y) = : (1.1)
n
1.3. DATA SUMMARIES 9

Histogram of hfemale Histogram of hfemale

8
6

6
5
Density

Density
4

4
3
2

2
1
0

0
1.4 1.5 1.6 1.7 1.8 1.9 1.4 1.5 1.6 1.7 1.8 1.9

hfemale hfemale

Histogram of hmale Histogram of hmale


6

8
5

6
4
Density

Density
3

4
2

2
1
0

1.4 1.5 1.6 1.7 1.8 1.9 1.4 1.5 1.6 1.7 1.8 1.9

hmale hmale

Figure 1.1: Histograms for Femaile and Male Heights. Sample sizes=200.

Software will produce such functions for a given data set. This is conveniently done by first or-
dering the yi ’s (i = 1; : : : ; n) to give the ordered values y(1) y(2) ::: y(n) . Then, we note
~
that F (y) is a “staircase" or “step" function that is easily obtained from the ordered values. If the data
values yi (i = 1; : : : ; n) are all different, then F~ (y(j) ) = j=n.

Example 1.3.4 Suppose that n = 4 and the y-values (ordered for convenience) are 1.5, 2.2, 3.4, 5.0.
Then
8
>
> 0 y < 1:5
>
>
>
>
< 0:25 1:5 y < 2:2
F~ (y) = :50 2:2 y < 3:4
>
>
>
> :75 3:4 y < 5:0
>
>
:
1:00 y 5:0

Example 1.3.5 Figure 1.3 shows the cumulative relative frequency plots F~ (y) for (1) the sample of
female heights, and (b) the sample of male heights in Example 1.3.1.

A cumulative frequency plot does not show the “shape" of the distribution of y-values in a data set
quite as clearly as a histogram. However, it shows us the proportion of y-values in any given interval;
10 INTRODUCTION TO STATISTICAL SCIENCE

Figure 1.2: Lifetimes (in km driven) for Front Brake Pads on 200 Cars

the proportion in the interval (a; b] is just F~ (b) F~ (a). In addition, this plot allows us to pinpoint
values such as the median (a y-value m such that half of the data values are below m and half are above
m) or the 100p’th percentile (a y-value Q(p) such that a proportion p of the data values is less that
Q(p)), where 0 < p < 1. For example, we see from Figure 1.3.3 that the median (or Q (0.5)) height
for females is about 1.60m (63.0 in) and for males, about 1.73m (68.1 in).
Other plots are also sometimes useful. The size n of a data set can be small or large. Histograms
are not very useful when n is less than about 20-30, and for small samples we often just plot the loca-
tions of y-values on a line; an example is given in Section 1.7. A useful plot called the "strip-plot" for
comparing two or more data sets is given next.

Box plots
Sometimes we have two or more samples of y-values, and we may wish to compare them. One way is
by plotting histograms or relative frequency plots for the different samples on the same graph or page;
we did this in Example 1.3.2 for the samples of female heights and male heights. The box plot is a plot
in which only certain values based on a data set are shown, in particular the median, upper and lower
quartiles (these are the 25th and 75th percentiles Q(.25) and Q(.75)), plus values equal to 1.5 times the
“inter-quantile range" Q(:75) Q(0:25) below Q(0:25) and above Q(:75). Figure 1.4 shows such a
plot for the female heights and male heights data sets from Example 1.3.2.

From the boxplot we can determine, for example, that approximately 75% of the females have
heights less than 1.65 m. or that about 50% of males had heights between 1.7 and 1.79 m.
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 11

Cum. Rel. Freq. for F and M Heights

1.0
0.8
0.6
Cum. Rel. Freq.

0.4
0.2
0.0

1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0

height(m)

Figure 1.3: Cumulative relative frequency for Female (F) and Male (M) heights

Two-variable plots
Often we have data on two or more variables for each unit represented in a sample. For example, we
might have the heights x and weights y for samples of individuals. The data set can then be represented
as n pairs, f(xi ; yi ); i = 1; : : : ; ng where xi and yi are the height and weight of the i’th person in the
sample.
When we have two such variables, a useful plot is a scatter plot, which is an x y plot of the
points (xi ; yi ); i = 1; : : : ; n. This shows whether xi and yi tend to be related in some way. Figure
1.5 shows a scatterplot of heights xi and weights yi for 100 adult males. As is obvious from looking
at people around us, taller people tend to weigh more, but there is considerable variability in weight
across persons of the same height.

1.4 Probability Distributions and Statistical Models


Probability models are used to describe random processes. (For convenience we’ll often use the single
term “process" below but the terms population or phenomenon could also be inserted.) They help us
understand such processes and to make decisions in the face of uncertainty. They are important in
studies involving the collection and analysis of data for several reasons. These include:

(i) when studying a process scientifically, questions are often formulated in terms of a model for
the process. The questions of primary interest do not concern the data, but the data provides a
12 INTRODUCTION TO STATISTICAL SCIENCE

Box Plots of 200 M and 200 F Heights

1.8
1.7
Heights (m)

1.6
1.5

1 2

Figure 1.4: Box Plots Based on 200 Female and 200 Male Heights in example 1.3.2. "F"=1,
"M"=2.

window to the population or the model.

(ii) the data collected in studying processes are variable, so random variables are often used in dis-
cussing and dealing with data,

(iii) studies of a process usually lead to inferences or decisions that involve some degree of uncer-
tainty, and probability is used to quantify this,

(iv) procedures for making decisions are often formulated in terms of models,

(v) models allow us to characterize processes, and to simulate them via computer experiments or
other means.

Consider a variable y associated with the units in a population or process. To describe or “model"
the variability in y-values we use probability distributions, which were introduced in your first proba-
bility course. This is done as follows: let y be the value for a randomly chosen unit in the population or
process. Because this value is random (we do not know which unit will be chosen) we call Y a random
variable, and use a probability distribution to provide us with probabilities such as P (a Y b). You
should review your probability notes(a limited review is given in an appendix to this chapter) and recall
that random variables are usually either discrete or continuous. A discrete random variable (r.v.) Y is
one for which the range R (set of possible values) of Y is countable. A continuous r.v. is one whose
range R consists of one or more continuous intervals of real numbers. For a discrete r.v. the probability
1.4. PROBABILITY DISTRIBUTIONS AND STATISTICAL MODELS 13

200
180
weight (lb.)

160
140
120

64 66 68 70 72

height (in.)

Figure 1.5: Scatterplot of Height vs. Weight for 100 Adult Males

function (p.f.) f (y) is defined as

f (y) = P (Y = y) for certain values of y 2 R


where R = fr1 ; r2 ; r3 ; : : :g; a countable subset of R; is the range of Y:.For a continuous random
variables. the probability density function (p.d.f) f (y) is such that for any interval (a; b) contained in
R, Z b
P (a Y b) = f (y)dy:
a

Example 1.4.1 A Binomial Distribution


Consider a “toy" example in which a six-sided die is rolled repeatedly. This constitutes the target
process, and for a study we might roll the die a total of n times. For this study, let Y denote the number
of rolls that result in the number 6. We treat it as a random variable since it is subject to random
variation. Assuming that the die has probability 1=6 of turning up the face "six" on any single roll, and
that the rolls are independent, the probability distribution of Y has probability function
n
P (Y = y) = f (y) = (1=6)y (5=6)n y
y = 0; 1; : : : ; n (1.2)
y
This is called a binomial distribution. We should bear in mind that 1.2 is a model; for any real die the
assumption that a six has probability 1=6 could be slightly in error. However, 1.2 is a very accurate
model that closely represents variability for most real dice.
14 INTRODUCTION TO STATISTICAL SCIENCE

Example 1.4.2 An Exponential Distribution


Suppose that in a population of light bulbs the random variable X represents the lifetime (say in
days) of a randomly chosen bulb. The continuous exponential distribution provides a good model for
many types of bulbs. For example, if the bulbs have an average lifetime of 100 days (2400 hours)
operation, then a distribution with probability density function (probability density function)
:01x
f (x) = :01e x>0 (1.3)

would be suitable. Using this, we can compute probabilities such as


Z 1
:01x
P (X > 1000) = :01e dx = 0:368
1000

Recall that the cumulative distribution function (c.d.f) is defined for a r.v. Y as

F (y) = P (Y y) y R (1.4)
P
If Y is discrete then F (y) = f (x); if Y is continuous then
x y
Z
F (y) = f (x)dx
x y

Recall also that if h(Y ) is some function of y, then the expectation (or “expected value") of h(Y ) is
defined as
X
E[h(Y )] = h(y)f (y) (1.5)
y R

if Y is discrete and as Z
E[h(Y )] = h(y)f (y)dy (1.6)
R
if Y is continuous. Expectations are used in many settings, for example when costs, profits, or losses
are associated with a random variable. The expectation E(Y ) is called the mean of Y and is often
denoted by the Greek letter . The expectation E[(Y )2 ] is called the variance of Y and is often
p
denoted either as V ar(Y ) or with the Greek symbol 2 . The square root = V ar(Y ) is called the
standard deviation of Y , or sd(Y ).

Your previous course introduced several families of probability distributions along with processes
or populations to which they are applied. Models such as the binomial, Poisson, exponential, normal
(Gaussian), and multinomial will be reintroduced and used in this course. The first few problems at the
end of the chapter provide a review of some models, and examples of where they are applied.
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 15

Many problems involve two or more random variables defined for any given unit. For example
Y1 could represent the height and Y2 the weight of a randomly selected 30 year old male in some
population. In general, we can think of a random variable Y = (Y1 ; Y2 ; : : :) as being a vector of
length 1: This may make it necessary to consider multivariate probability distributions, which were
introduced in your last course for discrete random variables.
In many statistical applications there is a primary variable y of interest, but there may be a number
of other variables x1 ; x2 ; : : : that affect y, or are “related" to y in some way. In this case we often refer
to y as the “response" variable and x1 ; x2 : : : as “explanatory" variables or covariates. Many studies
are carried out for the purpose of determining how one or more explanatory variables are related to a
response variable. For example, we might study how the number of insurance claims y for a driver is
related to their sex, age, and type of car (x variables). One reason for studying explanatory variables
is to search out cause and effect relationships. Another is that we can often use explanatory variables
to improve decisions, predictions or “guesses" about a response variable. For example, insurance com-
panies use explanatory variables in defining risk classes and determining life insurance premiums.

1.5 Data Analysis and Statistical Inference

Whether we are collecting data to increase our knowledge or to serve as a basis for making decisions,
proper analysis of the data is crucial. Two broad aspects of the analysis and interpretation of data may
be distinguished. The first is what we refer to as descriptive statistics: This is the portrayal of the
data, or parts of it, in numerical and graphical ways so as to show certain features. (On a historical
note, the word “statistics” in its original usage referred to numbers generated from data; today the word
is used both in this sense and to denote the discipline of Statistics.) We have considered methods of
doing this in Section 1.3. The terms data mining and knowledge discovery in data bases (KDD) refer
to exploratory data analysis where the emphasis is on descriptive statistics. This is often carried out on
very large data bases.
A second aspect of a statistical analysis of data is what we refer to as statistical inference: that is,
we use the data obtained in the study of a process or phenomenon to draw more general inferences
about the process or phenomenon itself. In general, we try to use study data to draw inferences about
some target population or process. This is a form of inductive inference, in which we reason from the
specific (the observed data) to the general (the target population or process). This may be contrasted
with deductive inference (as in logic and mathematics) in which we use general results (e.g. axioms)
to prove specific things (e.g. theorems).
This course introduces some basic methods of statistical inference. Two main types of problems
will be discussed, loosely referred to as estimation problems and hypothesis testing problems. In the
16 INTRODUCTION TO STATISTICAL SCIENCE

former, the problem is to estimate some feature of a process or population. For example, we may wish
to estimate the proportion of Ontario residents aged 14 - 20 who smoke, or to estimate the distribution
of survival times for certain types of AIDS patients. Another type of estimation problem is that of
“fitting” or selecting a probability model for a process.
Testing problems involve using the data to assess the truth of some question or hypothesis. For
example, we may hypothesize that in the 14-20 age group a higher proportion of females than males
smoke, or that the use of a new treatment will increase the average survival time of AIDS patients by
at least 50 percent. These questions can be addressed by collecting data on the populations in question.
Statistical analysis involves the use of both descriptive statistics and methods of estimation and
testing. As brief illustrations, we return to the first two examples of section 1.2.

Example 1.5.1 A smoking behaviour survey


Suppose that a random sample of 200 persons aged 14-20 was selected, as described in example
1.2.1. Let us focus only on the sex of each person in the sample, and whether or not they smoked. The
data are nicely summarized in a two-way frequency table such as the following:
No. of smokers No. of non-smokers Total
Female 32 66 98
Male 27 75 102
Total 59 141 200
If we wished to estimate, say, the proportion of females aged 14-20 in the population who smoke,
we might simply use the sample proportion P = 32=98 = :327. However, we would also like some
idea as to how close this estimate is likely to be to the actual proportion in the population. Note that if
we selected a second sample of 200 persons, we would very likely find a different proportion of females
who smoked. When we consider estimation problems later in the course, we will learn how to use a
probability model to calculate the uncertainty for this kind of study. For now, let us merely note what
kind of model seems appropriate.
Consider only the proportion of females who smoke, and suppose that we select n females at
random. (This is not quite what was done in the above survey.) Then the number of women X who
smoke is actually a random variable in the sense that before the data are collected, it is random. Suppose
now that the population of females from which the sample is drawn is very large and that a proportion p
of the population are smokers. Then the probability distribution of X is, to a very close approximation,
binomial with probability function
n x
f (x) = P (X = x) = p (1 p)n x
for x = 0; 1; :::; n
x
Knowing this will allow better estimation procedures to be developed later in the course.
1.5. DATA ANALYSIS AND STATISTICAL INFERENCE 17

Example 1.5.2 A soft drink bottle filler study


Recall example 1.2.2, and suppose that 26 ounce bottles are randomly selected from the output of
each of two machines, one old and one new, over a period of one week. The bottles were selected from
each machine’s output at roughly the same times. Accurate measurements of the amounts of liquid in
the bottles are as follows:

Old machine:
27.8 28.9 26.8 27.4 28.0 27.4 27.1 28.0 26.6 25.6
24.8 27.1 25.7 27.9 25.3 26.0 27.3 27.4 25.7 26.9
27.3 25.2 25.6 27.0 26.2 27.3 24.8 27.1 26.7 26.8
26.6 26.6 28.6 27.0 26.6 27.3 25.9 27.6 27.6 28.3
28.0 26.4 25.4 26.7 27.8 27.4 27.3 26.9 26.9 26.9
New Machine:
26.6 26.8 27.2 26.9 27.6 26.7 26.8 27.4 26.9 27.1
27.0 27.1 27.0 26.6 27.2 26.1 27.6 27.2 26.5 26.3
28.0 26.8 27.1 26.7 27.7 26.7 27.1 26.5 26.8 26.8
26.9 27.2 27.4 27.1 26.5 27.2 26.8 27.3 26.6 26.6
27.0 26.9 27.3 26.0 27.4 27.4 27.6 27.2 27.8 27.7

The amount of liquid X that goes into a bottle is a random variable, and a main objective of this
study is to determine what the distribution of X looks like for the old machine and for the new machine,
over the one week period of the study. For this to be really useful, we should check first that there are
no “drifts" or time trends in the data. Figure 1.6 gives a plot of xi vs. i (order of production) for each
machine, and no trends are apparent. The random variable X is continuous and so we would like to use
a continuous probability distribution as the model. It often turns out for problems involving weights
or measures that a normal distribution provides a suitable model. Recall that the distribution N ( ; 2 )
(which we will often call in this course a Gaussian distribution, denoted G( ; )) has probability
density function
( )
2
1 1 x
f (x; ; ) = p exp 1<x<1
2 2

and that probabilities are obtained by integrating it; recall also that = E(X) and 2 = Var(X) are
the mean and variance of the distribution. (note: exp (a) is the same as ea .)
Before trying to use any particular model for X, it is a good idea to “look” at the data. Figure
?? shows frequency histograms of the data from the old and new machines, respectively. It shows that
(i) the distributions of X for the old and new machines each look like they might be well described
18 INTRODUCTION TO STATISTICAL SCIENCE

Old M achine
25 26 27 28 29
Vol(oz.)

0 10 20 30 40 50

index

New M achine
28.0
Vol(oz.)

27.0
26.0

0 10 20 30 40 50

index

Figure 1.6: Time Sequence Plot of Bottle Contents y

by (different) normal distributions, (ii) the variability in the new machine’s distribution is considerably
less than in the old’s.
After this simple bit of descriptive statistics we could carry out a more thorough analysis, for
example fitting normal distributions to each machine. We can also estimate attributes of interest, such
as the probability a bottle will receive less than 26 ounces of liquid, and recommend adjustments that
could be made to the machines. In manufacturing processes it is important that the variability in the
output is small, and so in this case the new machine is better than the old one.

1.6 Statistical Software

Software is essential for data manipulation and analysis. It is also used to deal with numerical calcu-
lations, to produce graphics, and to simulate probability models. There exist many statistical software
1.6. STATISTICAL SOFTWARE 19

Old machine

12
Frequency

8
4
0

24 25 26 27 28 29 30

volume(oz.)

New machine
15
Frequency

10
5
0

24 25 26 27 28 29 30

volume(oz.)

Figure 1.7: Histograms of Data from Two Machines

systems; some of the most comprehensive and popular are SAS, S-Plus, SPSS, Stata, Systat and R.
Spreadsheet software is also useful.
In this course we will primarily use the R software system. It is an open source package that
has extensive statistical capabilities and very good graphics procedures. Its home page is at www.r-
project.org. In structure it is similar to the commercial package S-Plus (Insightful Corp.); both are
based on the ideas developed from the S statistical system at AT & T Bell Laboratories.
Some of the basics of R are described in the Appendix at the end of this chapter; it is very easy
to use. In this course we will employ R for several purposes: to manipulate and graph data; to fit and
check statistical models (distributions); to estimate quantities or test hypotheses; to simulate data from
probability models.
As an introductory example we consider some data on the heights and the body-mass indexes
(BMI’s) of 150 males and 150 females, aged 18-60, that were collected from a random sample of
20 INTRODUCTION TO STATISTICAL SCIENCE

workers in New Zealand. The data are listed below, along with a few summary statistics. The BMI is
often used to measure o besity or severely low weight. It is defined as follows:

weight(kg)
BM I =
height(m)2
There is some variation in what different types of guidelines refer to as “overweight", “underweight",
etc. One that is sometimes used by public health professionals is:

Underweight BMI < 18.5


Normal 18.5 BMI < 25:0
Overweight 25.0 BMI < 30:0
Moderately Obese 30.0 BMI < 35:0
Severely Obese 35.0 BMI

The data on heights are stored in two R vectors (see the Appendix at the end of the chapter) called
hmale and hfemale; the BMI measurement are in vectors bmimale and bmifemale.

Heights and Body-Mass Index (BMI) Measurements for 150 Males and Females

NOTE: BMI = weight(kg)/height(m)**2

MALE HEIGHTS (m)- hmale

[1] 1.76 1.76 1.68 1.72 1.73 1.78 1.78 1.86 1.77 1.72 1.72 1.77 1.77 1.70 1.72
[16] 1.77 1.79 1.75 1.74 1.71 1.73 1.74 1.70 1.71 1.72 1.66 1.74 1.73 1.77 1.69
[31] 1.91 1.77 1.81 1.74 1.87 1.76 1.69 1.87 1.78 1.70 1.78 1.84 1.82 1.77 1.72
[46] 1.80 1.72 1.69 1.78 1.69 1.80 1.82 1.65 1.56 1.64 1.60 1.82 1.73 1.62 1.77
[61] 1.81 1.73 1.74 1.75 1.73 1.71 1.63 1.72 1.74 1.75 1.72 1.83 1.77 1.74 1.66
[76] 1.93 1.81 1.73 1.68 1.71 1.69 1.74 1.74 1.79 1.68 1.71 1.74 1.82 1.68 1.78
[91] 1.79 1.77 1.74 1.78 1.86 1.80 1.74 1.69 1.85 1.71 1.79 1.74 1.80 1.64 1.82
[106] 1.66 1.56 1.80 1.68 1.73 1.78 1.69 1.57 1.64 1.67 1.74 1.89 1.77 1.75 1.84
[121] 1.66 1.71 1.75 1.75 1.64 1.73 1.79 1.74 1.83 1.80 1.74 1.81 1.80 1.66 1.75
[136] 1.82 1.80 1.81 1.71 1.59 1.71 1.79 1.80 1.70 1.77 1.78 1.64 1.70 1.86 1.75
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.56 1.71 1.74 1.744 1.79 1.93
FEMALE HEIGHTS (m)- hfemale
[1] 1.60 1.56 1.61 1.64 1.65 1.58 1.71 1.72 1.72 1.61 1.72 1.52 1.47 1.61 1.64
[16] 1.60 1.67 1.76 1.57 1.60 1.59 1.61 1.59 1.61 1.56 1.68 1.61 1.63 1.58 1.68
1.6. STATISTICAL SOFTWARE 21

[31] 1.51 1.64 1.52 1.59 1.62 1.64 1.65 1.64 1.67 1.56 1.77 1.55 1.71 1.71 1.54
[46] 1.60 1.67 1.58 1.53 1.64 1.63 1.60 1.64 1.67 1.54 1.65 1.57 1.59 1.58 1.58
[61] 1.67 1.53 1.69 1.64 1.54 1.66 1.71 1.58 1.60 1.52 1.41 1.51 1.56 1.65 1.68
[76] 1.55 1.60 1.57 1.73 1.58 1.53 1.58 1.53 1.66 1.57 1.54 1.69 1.62 1.65 1.64
[91] 1.61 1.67 1.64 1.57 1.70 1.66 1.61 1.62 1.58 1.67 1.67 1.69 1.53 1.70 1.65
[106] 1.56 1.79 1.70 1.61 1.56 1.65 1.59 1.62 1.71 1.57 1.72 1.58 1.70 1.70 1.66
[121] 1.60 1.54 1.60 1.68 1.68 1.67 1.57 1.61 1.64 1.57 1.72 1.48 1.60 1.66 1.60
[136] 1.58 1.65 1.59 1.57 1.53 1.60 1.64 1.57 1.59 1.68 1.61 1.66 1.52 1.67 1.65

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.41 1.573 1.61 1.618 1.667 1.79

MALE BMI- bmimale


[1] 20.6 30.3 28.5 18.9 37.5 29.1 27.7 26.1 27.9 34.7 26.8 28.9 25.6 23.7 30.0
[16] 28.6 27.5 30.4 22.7 24.7 26.2 28.0 35.5 22.7 26.5 26.4 30.2 24.7 24.8 25.9
[31] 24.3 25.7 21.7 24.9 30.1 29.3 23.6 27.0 33.6 29.0 26.4 28.0 25.6 31.0 27.7
[46] 23.1 25.4 24.9 29.7 24.5 28.5 25.1 32.7 27.5 25.1 24.0 26.0 30.2 27.0 26.3
[61] 29.7 21.7 26.7 26.3 34.2 23.5 26.0 26.5 26.4 22.8 22.3 22.5 23.7 27.4 31.0
[76] 28.7 27.2 25.1 25.1 27.9 26.8 23.9 30.9 28.8 27.5 26.8 23.4 32.4 25.6 24.0
[91] 34.0 30.8 32.0 31.8 23.3 28.0 22.8 23.9 23.2 32.5 23.1 32.6 24.7 27.2 23.7
[106] 27.1 22.1 22.6 18.3 25.6 22.3 28.6 21.8 26.1 26.6 22.9 29.3 33.7 30.2 29.2
[121] 33.5 26.2 26.7 27.7 26.5 25.5 27.9 30.1 34.9 28.7 29.1 27.8 34.1 24.2 27.9
[136] 27.8 25.5 25.6 24.1 23.8 30.1 23.5 27.5 27.1 25.1 28.2 35.2 32.4 30.7 21.3

Min. 1st Qu. Median Mean 3rd Qu. Max.


18.3 24.7 26.75 27.08 29.1 37.5

FEMALE BMI- bmifemale


[1] 23.4 21.2 31.2 27.1 25.9 26.8 28.8 24.3 36.2 37.0 37.5 37.2 28.4 20.7 25.1
[16] 18.9 20.4 27.7 30.1 27.9 19.8 27.0 23.3 17.5 25.8 23.2 21.2 26.8 25.9 21.6
[31] 34.2 20.3 37.9 32.7 22.7 25.9 35.6 32.0 32.2 19.8 23.3 32.7 22.7 22.8 27.0
[46] 21.3 22.7 23.7 25.6 21.4 32.8 30.3 29.0 27.7 29.4 26.3 26.2 28.0 29.1 24.6
[61] 28.4 22.5 33.2 29.6 26.1 27.8 26.8 26.8 32.4 38.8 23.5 33.7 30.2 29.1 26.8
[76] 36.4 19.0 24.5 23.1 33.9 26.5 31.0 26.1 29.8 23.4 31.2 28.4 26.7 27.3 24.1
[91] 20.7 25.0 23.6 33.1 23.1 32.2 24.8 22.5 29.9 26.9 28.5 27.5 26.3 29.7 21.9
[106] 26.3 20.3 21.5 24.5 31.1 31.3 34.0 31.9 27.2 16.4 20.3 29.1 25.7 23.4 27.6
22 INTRODUCTION TO STATISTICAL SCIENCE

[121] 26.4 28.4 24.8 29.1 25.6 29.4 26.2 29.7 22.8 21.5 21.3 29.9 17.6 28.3 24.1
[136] 28.3 24.0 25.4 23.9 24.3 30.4 28.6 25.0 23.8 36.0 31.5 21.8 29.4 30.8 28.1
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.4 23.42 26.8 26.92 29.7 38.8
Methods for summarizing data were discussed in Section 1.4. Both numerical and graphical sum-
maries can be obtained easily using R. For example, mean (x) and var(x) produce the mean x and
variance s2x of a set of numbers x1 ; : : : ; xn contained in the vector x. The definitions of x and s2x are
(see Section 1.3.2)
n n
1X 1 X
x= xi s2x = (xi x)2 :
n n 1
i=1 i=1
Using this we can find that the mean (average) heights for the 150 males in the sample is x = 1:74 m
(68.5 in.) and for the 150 females is x = 1:62 m (63.8 in.).

Figure 1.8: Histograms and Models for Height and BMI data

A histogram gives a picture of the data. Figure 1.8 shows relative frequency histograms for heights
and BMI’s for males and females. We also show normal distribution probability density functions
1.6. STATISTICAL SOFTWARE 23

overlaid on each histogram. In each case we used a normal distribution N ( ; 2 ) where the mean
and variance 2 were taken to equal x and s2x . For example, for the male heights we used = 1:74 and
2 = :004316. Note from Figure 1.8 that the normal (Gaussian) distributions agree only moderately

well with the observed data. Chapter 2 discusses probability models and comparisons of models and of
data in more detail.
The following R code, reproduced from the Appendix, that illustrates how to look at the data and
produce plots like those in Figure 1.6.1.

EXAMPLE: BODY-MASS INDEX DATA


The R session below describes how to take data on the BMI measurements
for 150 males and 150 females and examine them, including the possibility
of fitting Gaussian distributions to the data.

The data are in vectors bmimale and bmifemale.

> summary(bmimale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.3 24.7 26.75 27.08 29.1 37.5
> summary(bmifemale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.4 23.42 26.8 26.92 29.7 38.8

> sort(bmimale) #Sometimes its nice to look at the ordered sample


[1] 18.3 18.9 20.6 21.3 21.7 21.7 21.8 22.1 22.3 22.3 22.5 22.6 22.7 22.7 22.8
[16] 22.8 22.9 23.1 23.1 23.2 23.3 23.4 23.5 23.5 23.6 23.7 23.7 23.7 23.8 23.9
[31] 23.9 24.0 24.0 24.1 24.2 24.3 24.5 24.7 24.7 24.7 24.8 24.9 24.9 25.1 25.1
[46] 25.1 25.1 25.1 25.4 25.5 25.5 25.6 25.6 25.6 25.6 25.6 25.7 25.9 26.0 26.0
[61] 26.1 26.1 26.2 26.2 26.3 26.3 26.4 26.4 26.4 26.5 26.5 26.5 26.6 26.7 26.7
[76] 26.8 26.8 26.8 27.0 27.0 27.1 27.1 27.2 27.2 27.4 27.5 27.5 27.5 27.5 27.7
[91] 27.7 27.7 27.8 27.8 27.9 27.9 27.9 27.9 28.0 28.0 28.0 28.2 28.5 28.5 28.6
[106] 28.6 28.7 28.7 28.8 28.9 29.0 29.1 29.1 29.2 29.3 29.3 29.7 29.7 30.0 30.1
[121] 30.1 30.1 30.2 30.2 30.2 30.3 30.4 30.7 30.8 30.9 31.0 31.0 31.8 32.0 32.4
[136] 32.4 32.5 32.6 32.7 33.5 33.6 33.7 34.0 34.1 34.2 34.7 34.9 35.2 35.5 37.5

> sqrt(var(bmimale)) #Get the sample standard deviations


[1] 3.555644
> sqrt(var(bmifemale))
24 INTRODUCTION TO STATISTICAL SCIENCE

[1] 4.602213
> par(mfrow=c(1,2)) #Sets up graphics to do two side by side plots per page
> hist(bmimale,prob=T,xlim=c(15,40)) #Relative frequency histogram; the
xlim option specifies the range we want
for the x-axis.
> x<- seq(15,40,.01) #We’ll use this vector to plot a Gaussian pdf
> fx<- dnorm(x,27.08,3.56) #Computes values f(x) of the G(27.08,3.56) pdf; we
have estimated the distribution mean and standard
deviation from the sample values.
> lines(x,fx) #This function adds points (x,fx) to the latest plot created
and joins them up with lines. This creates a plot of the
pdf overlaid on the histogram.
> hist(bmifemale,prob=T,xlim=c(15,40)) #Now do a histogram for the female
data.
> fx<- dnorm(x,26.92,4.60) #Compute pdf f(x) for G(26.92,4.60) distribution
> lines(x,fx) # As previously
> q() #Quit the R session.

1.7 A More Detailed Example: Colour Classification by Robots


Inexpensive robots and other systems sometimes use a crude light sensor to identify colour-coded items.
In one particular application, items were one of five colours: White, Black, Green, Light Blue, Red.
The sensor determines a light intensity measurement y from any given item and uses it to identify the
colour.
In order to program the robot to do a good job, experiments are conducted on the sensor, as follows:
items of different colours are passed by the sensor and the intensity readings y are recorded. Table 1
shows some typical data for 10 Red and 10 White items. Note that all Red items (or all White items) do
not give the same y-values. The reasons for the variability include variation in the colour and texture
of the items, variations in lighting and other ambient conditions, and variations in the angle at which
the item passes the sensor.

Table 1. Light intensity measurements y for 10 Red and 10 White items


Red 47.6 47.2 46.6 46.8 47.8 46.8 46.3 46.5 47.6 48.8
White 49.2 50.1 48.8 50.6 51.3 49.6 49.3 50.8 48.6 49.8

Figure 1.9 shows a plot (called a “strip" plot) of similar data on 20 items of each of the five colours.
1.7. A MORE DETAILED EXAMPLE: COLOUR CLASSIFICATION BY ROBOTS 25

It is clear that the measurements for the Black items are well separated from the rest, but that there is
some overlap in the ranges of the intensities for some pairs of items.
To program the robot to “recognize" colour we must partition the range for y into five regions,
one corresponding to each colour. There are various ways to do this, the simplest being to choose
values that minimize the number of misclassifications (incorrectly identified colours) in the data that
are available. Another approach is to model the variability in y-values for each colour using a random
variable Y with a probability distribution and then to use this to select the partition. This turns out to
have certain advantages, and we will consider how this can be done.

White

Red

Green

Blue

Black

10 20 30 40 50 60

light intensity

Figure 1.9: Light Intensities for 20 Items of Each Colour

Empirical study has shown that for the population of White items Y has close to a Gaussian distri-
bution,

Y G( W; W ):

Similarly, for Black, Green, Light Blue and Red items the distribution of Y is close to G( B ; B ),
G( G ; G ), G( L ; L ) and G( R ; R ), respectively. The approximate values of and for each
colour are (we will discuss in Chapter 2 how to find such values)
26 INTRODUCTION TO STATISTICAL SCIENCE

Black = 25:7 = 2:4


Light Blue = 38:4 = 1:0
Green = 42:7 = 1:3
Red = 47:4 = 1:1
White = 49:8 = 1:2

The operators of the equipment have set the following decision rule (partition) for identifying the
colour of an item, based on the observed value y:

Black y 34:0
Light Blue 34:0 < y 40:5
Green 40:5 < y 45:2
Red 45:2 < y 48:6
White y > 48:6

We now consider a few questions that shed light on this procedure.

Question: Based on the Gaussian models, what are the probabilities an item colour is misclassified?

This can be determined for each colour. For example,


P (Black item is misclassified) = P (Y > 34:0); where Y G(25:7; 2:4)
= .0003.

P (Red item is misclassified) = 1 - P (45:2 < Y 48:6); where Y G(47:4; 1:1)


= .1604.

Note that colours may be classified incorrectly in more than one way. For example
P (Red is misclassified as Green) = P (40:5 < Y 45:2) = :0227
P (Red is misclassified as White) = P (Y > 48:6) = 0:1377.

Question: What kind of data would we collect to check that the Gaussian models are satisfactory
approximations to the distributions of Y ?

We would need to randomly select items of a specific colour, then use the sensor to get an inten-
sity measurement y. By doing this for a large number n of items, we would get measurements
y1 ; :::; yn which can be used to examine or fit parametric models for Y , using methods developed
in later chapters.
1.7. A MORE DETAILED EXAMPLE: COLOUR CLASSIFICATION BY ROBOTS 27

Question: Do we need to use a Gaussian model (or any other probability distribution) for this problem?

No. We could use a completely “empirical” approach in which we used the sensor experimentally
with many items of different colours, then determined a “good” decision rule for identifying
colour from the observed data. (For example, we could, as mentioned above, do this so that the
total number of items misclassified was minimized.)

To see how this would work, use R to simulate, say 20, y values for each colour, then try to pick
the cut-points for your decision rule.

Question: What are some advantages of using a probability model (assuming it fits the data)?

It allows decision rules to be obtained and compared mathematically (or numerically).

In more complicated problems (e.g. with more types of items or with multivariate measurements)
a direct empirical approach may be difficult to implement.

Models allow comparisons to be made easily across similar types of applications. (For example,
sensors of similar types, used in similar settings.)

Models are associated with “scientific” descriptions of measurement devices and processes.

Figure 1.10 shows the probability density functions for the Gaussian distributions for Black, Light
Blue, Green, Red and White items, which provides a clear picture of misclassification probabilities.
Given the models above, it is possible to determine an “optimal" partition of the y-scale, according
to some criterion. The criterion that is often used is called the overall misclassification rate, and it is
defined as follows for this problem. Suppose that among all of the items which the robot encounters
over some period of time, that the fractions which are Black, Light Blue, Green, Red, and White are
PB ; PLB ; PG ; PR and PW respectively (with PB + PLB + PG + PT + PW = 1). Suppose also that
instead of the values above we consider arbitrary cut-points c1 < c2 < c3 < c4 , so that if y c1 , the
decision is “Black", if c1 < y c2 , the decision is “Light Blue", and so on. The overall probability a
randomly selected item is classified correctly (CC) is

P (CC) = P (CCjBlack)P (Black) + P (CCjBlue)P (Blue) + P (CCjGreen)P (Green) + P (CCjRed)P (Red) + P (CCjWhit
(1.7)
= P (YB c1 )PB + P (c1 < YLB c2 )PLB + P (c2 < YG c3 )PG + P (c3 < YR c4 )PR + P (YW > c4 )PW
28 INTRODUCTION TO STATISTICAL SCIENCE

0.4
0.3
pdf(y)

0.2
0.1
0.0

0 10 20 30 40 50 60

light intens ity (y)

Figure 1.10: Distributions of Light Intensities for Items of Colours

where YB ; YLB ; YG ; YR ; YW denote random variables with the G(25:7; 2:4); G(38:4; 1:0); G(42:7; 1:3); G(47:4; 1:1); G(49
distributions respectively. It is not trivial to choose c1 ; c2 ; c3 ; c4 to maximize this (and therefore
minimize the probability of incorrect classification) but it can be done numerically, for any set of
values forPB ; PLB ; PG ; PR ; PW . Note that for given values for c1 ; c2 ; c3 ; c4 we can readily calcu-
late P (CC). For example, if PB = PLB = :05; PG = PR = PW = :3, then with the values
c1 = 34:0; c2 = 40:5; c3 = 45:2; c4 = 48:6 used above we get, using R to calculate the probabili-
ties in 1.7,

P (CC) = :05[pnorm(34:0; 25:7; 2:4)] + :05[pnorm(40:5; 38:4; 1:0) pnorm(34:0; 38:4; 1:0)]
+ :3[pnorm(45:2; 42:7; 1:3) pnorm(40:5; 42:7; 1:3)]
+ :3[pnorm(48:6; 47:4; 1:1) pnorm(45:2; 47:4; 1:1)] + :3[1 pnorm(48:6; 49:8; 1:2)]
= :05(:9997) + :05(:9821) + :3(:9723) + :3(:8396) + :3(:8413)
= :895

Thus the probability of incorrect identification of a colour is .105. Note that if the “mix" of colours
1.8. APPENDIX. THE R LANGUAGE AND SOFTWARE 29

that the robot sees changes (i.e. the values PB ; : : : ; PW change) then P (CC) changes. For example,
if there were more Black and fewer White items in the mix, the P (CC) would go up. Problems 3 and
4 at the end of the chapter consider slightly simpler classification problems involving only two types
of “items". It is easier to maximize the P (CC) in these cases. You should note that the general
problem of classification based on certain data is very common. Spam detection in your emailer, or
credit checking at the bank, fraud detection in a financial institution, and even legal institutions such as
courts are all examples where a classification takes place on the basis of noisy data.

1.8 Appendix. The R Language and Software


1.8.1 Some R Basics
R is a statistical software system that has excellent numerical, graphical and statistical capabilities.
There are Unix and Windows versions. These notes are a very brief introduction to a few of the features
of R. Web resources have much more information. You can also download a Unix or Windows version
of R to your own computer. R is invoked on Math Unix machines by typing R. The R prompt is >. R
objects include variables, functions, vectors, arrays, lists and other items. To see online documentation
about something, we use the help function. For example, to see documentation on the function mean(),
type

help(mean).

In some cases help.search() is also helpful. The assignment symbol is <- : for example,

x<- 15 assigns the value 15 to variable x.


To quit an R session in unix, type q()

1.8.2 Vectors
Vectors can consist of numbers or other symbols; we will consider only numbers here. Vectors are
defined using c(): for example,

x<- c(1,3,5,7,9)

defines a vector of length 5 with the elements given. Vectors and other classes of objects possess certain
attributes. For example, typing

length(x)
30 INTRODUCTION TO STATISTICAL SCIENCE

will give the length of the vector x. Vectors of length n are often a convenient way to store data values
for n individuals or units in a sample. For example, if there are variates x and y associated with any
given individual, we would define vectors for x and for y.

1.8.3 Arithmetic
The following R commands and responses should explain arithmetic operations.

> 7+3
[1] 10
> 7*3
[1] 21
> 7/3
[1] 2.333333
> 2^3
[1] 8

1.8.4 Some Functions


Functions of many types exist in R. Many operate on vectors in a transparent way, as do arithmetic
operations. (For example, if x and y are vectors then x+y adds the vectors element-wise; thus x and y
must be the same length.) Some examples, with comments, follow.

> x<- c(1,3,5,7,9) # Define a vector x


> x # Display x
[1] 1 3 5 7 9
> y<- seq(1,2,.25) #A useful function for defining a vector whose
elements are an arithmetic progression
> y
[1] 1.00 1.25 1.50 1.75 2.00
> y[2] # Display the second element of vector y
[1] 1.25
> y[c(2,3)] # Display the vector consisting of the second and
third elements of vector y.
[1] 1.25 1.50
> mean(x) #Computes the mean of the elements of vector x
1.8. APPENDIX. THE R LANGUAGE AND SOFTWARE 31

[1] 5
> summary(x) # A useful function which summarizes features of
a vector x
Min. 1st Qu. Median Mean 3rd Qu. Max.
1 3 5 5 7 9
> var(x) # Computes the (sample) variance of the elements of x
[1] 10
> exp(1) # The exponential function
[1] 2.718282
> exp(y)
[1] 2.718282 3.490343 4.481689 5.754603 7.389056
> round(exp(y),2) # round(y,n) rounds the elements of vector y to
n decimals
[1] 2.72 3.49 4.48 5.75 7.39
> x+2*y
[1] 3.0 5.5 8.0 10.5 13.0

1.8.5 Graphs
To open a graphics window in Unix, type x11(). Note that in R, a graphics window opens automatically
when a graphical function is used. There are various plotting and graphical functions. Two useful ones
are

plot(x,y) # Gives a scatterplot of x versus y; thus x and y must


be vectors of the same length.

hist(x) # Creates a frequency histogram based on the values in


the vector x. To get a relative frequency histogram
(areas of rectangles sum to one) use hist(x,prob=T).

Graphs can be tailored with respect to axis labels, titles, numbers of plots to a page etc. Type help(plot),
help(hist) or help(par) for some information.
To save/print a graph in R using UNIX, you generate the graph you would like to save/print in R using
a graphing function like plot() and type:

dev.print(device,file="filename")
32 INTRODUCTION TO STATISTICAL SCIENCE

where device is the device you would like to save the graph to (i.e. x11) and filename is the name of
the file that you would like the graph saved to. To look at a list of the different graphics devices you
can save to, type

help(Devices).

To save/print a graph in R using Windows, you can do one of two things.


a) You can go to the File menu and save the graph using one of several formats (i.e. postscript, jpeg,
etc.). It can then be printed. You may also copy the graph to the clipboard using one of the formats
and then paste to an editor, such as MS Word. Note that the graph can be printed directly to a printer
using this option as well.

b) You can right click on the graph. This gives you a choice of copying the graph and then pasting to
an editor, such as MS Word, or saving the graph as a metafile or bitmap. You may also print directly
to a printer using this option as well.

1.8.6 Distributions
There are functions which compute values of probability or probability density functions, cumulative
distribution functions, and quantiles for various distributions. It is also possible to generate (pseudo)
random samples from these distributions. Some examples follow for the Gaussian distribution. For
other distribution information, type help(Poisson),

help(Binomial) etc.

> y<- rnorm(10,25,5) # Generate 10 random values from the Gaussian


distribution G(25,5); this is the same as
the normal distribution N(25,25). The values
are stored in the vector y.
> y # Display the values
[1] 22.50815 26.35255 27.49452 22.36308 21.88811 26.06676 18.16831 30.37838
[9] 24.73396 27.26640
> pnorm(1,0,1) # Compute P(Y<=1) for a G(0,1) random variable.
[1] 0.8413447
> qnorm(.95,0,1) # Find the .95 quantile (95th percentile) for G(0,1).
[1] 1.644854
1.8. APPENDIX. THE R LANGUAGE AND SOFTWARE 33

1.8.7 Reading Data from a file

You can read numerical data stored in a text file called (say) data into an R vector y by typing

y<- scan("data")

Type help(scan) to see more about the scan function.

1.8.8 Writing Data or information to a file.

You can write an R vector or other object to a text file through

write(y,file="filename")

To see more about the write function use help(write).

1.8.9

Example: Body-Mass index Data

The R session below describes how to take data on the BMI measurements for 150 males and 150
females and examine them, including the possibility of fitting Gaussian distributions to the data. The
data are in vectors bmimale and bmifemale.

> summary(bmimale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
18.3 24.7 26.75 27.08 29.1 37.5
> summary(bmifemale)
Min. 1st Qu. Median Mean 3rd Qu. Max.
16.4 23.42 26.8 26.92 29.7 38.8

> sort(bmimale) #Sometimes its nice to look at the ordered sample


[1] 18.3 18.9 20.6 21.3 21.7 21.7 21.8 22.1 22.3 22.3 22.5 22.6 22.7 22.7 22.8
[16] 22.8 22.9 23.1 23.1 23.2 23.3 23.4 23.5 23.5 23.6 23.7 23.7 23.7 23.8 23.9
[31] 23.9 24.0 24.0 24.1 24.2 24.3 24.5 24.7 24.7 24.7 24.8 24.9 24.9 25.1 25.1
[46] 25.1 25.1 25.1 25.4 25.5 25.5 25.6 25.6 25.6 25.6 25.6 25.7 25.9 26.0 26.0
[61] 26.1 26.1 26.2 26.2 26.3 26.3 26.4 26.4 26.4 26.5 26.5 26.5 26.6 26.7 26.7
[76] 26.8 26.8 26.8 27.0 27.0 27.1 27.1 27.2 27.2 27.4 27.5 27.5 27.5 27.5 27.7
34 INTRODUCTION TO STATISTICAL SCIENCE

[91] 27.7 27.7 27.8 27.8 27.9 27.9 27.9 27.9 28.0 28.0 28.0 28.2 28.5 28.5 28.6
[106] 28.6 28.7 28.7 28.8 28.9 29.0 29.1 29.1 29.2 29.3 29.3 29.7 29.7 30.0 30.1
[121] 30.1 30.1 30.2 30.2 30.2 30.3 30.4 30.7 30.8 30.9 31.0 31.0 31.8 32.0 32.4
[136] 32.4 32.5 32.6 32.7 33.5 33.6 33.7 34.0 34.1 34.2 34.7 34.9 35.2 35.5 37.5

> sqrt(var(bmimale)) #Get the sample standard deviations


[1] 3.555644
> sqrt(var(bmifemale))
[1] 4.602213
> par(mfrow=c(1,2)) #Sets up graphics to do two side by side plots per page
> hist(bmimale,prob=T,xlim=c(15,40)) #Relative frequency histogram; the
xlim option specifies the range we want
for the x-axis.
> x<- seq(15,40,.01) #We’ll use this vector to plot a Gaussian pdf
> fx<- dnorm(x,27.08,3.56) #Computes values f(x) of the G(27.08,3.56) pdf; we
have estimated the distribution mean and standard
deviation from the sample values.
> lines(x,fx) #This function adds points (x,fx) to the latest plot created
and joins them up with lines. This creates a plot of the
pdf overlaid on the histogram.
> hist(bmifemale,prob=T,xlim=c(15,40)) #Now do a histogram for the female
data.
> fx<- dnorm(x,26.92,4.60) #Compute pdf f(x) for G(26.92,4.60) distribution
> lines(x,fx) # As previously
> q() #Quit the R session.

NOTE: You can see from the histograms and Gaussian pdf plots that the Gaussian distribution does
not seem an especially good model for BMI variation. The Gaussian pdf’s are symmetric whereas the
distribution of BMI measurements looks somewhat asymmetric.

1.9 Problems
1. The binomial distribution is a discrete probability model with probability function of the form
!
n
f (y) = py (1 p)n y y = 0; 1; :::; n
y
where 0 < p < 1 and n is a positive integer. If Y is a random variable with probability function
f (y) we write Y Bin(n; p).
1.9. PROBLEMS 35

A woman who claims to have special guessing abilities is given a test, as follows: a deck which
contains five cards with the numbers 1 to 5 is shuffled and a card drawn out of sight of the
woman. The woman then guesses the card, the deck is reshuffled with the card replaced, and
the procedure is repeated several times. Let Y represent the number of correct guesses by the
woman.

(a) Suppose an experiment consists of 20 repetitions, or guesses. If someone guesses “ran-


domly” each time, discuss why Y Bin(20; 0:2) would be an appropriate model.
(b) Suppose the woman guessed correctly 8 times in 20 repetitions. Calculate P (Y 8) if Y
Bin(20; 0:2) and use the result to consider whether the woman might have a probability p
of guessing correctly which is greater than 0:2.
(c) In a longer sequence of 100 repetitions over two days, the woman guessed correctly 32
times. Calculate P (Y 32) if Y Bin(100; 0:2); you can use a normal approximation if
you wish. What do you conclude now?

2. The exponential distribution is a continuous probability model in which a random variable X


has probability density function
1 x
f (x) = e ; x 0

where > 0 is a parameter.

(a) Show that is the mean of X. Graph the probability density function of X.
(b) The exponential distribution is often found to be a suitable model for distributions of life-
times. The 30 observations X1 ; :::; X30 below, for example, are the lifetimes (in days) of a
random sample of a particular type of lightbulb, subjected to constant use:
23 261 87 7 120 14 62 47 225 71
246 21 42 20 5 12 120 11 3 14
71 11 14 11 16 90 1 16 52 95
The mean of these 30 numbers is X = 59:6. It has been suggested that if an exponential
model is suitable for representing the distribution of lifetimes in the population of lightbulbs
from which they came, then in (1) should have a value of around 59.6. Why should this
be so?
(c) For the exponential distribution (1) with = 59:6, calculate
i) p1 = P (0 X < 40)
36 INTRODUCTION TO STATISTICAL SCIENCE

ii) p2 = P (40 X < 100)


iii) p3 = P (100 X < 200)
iv) p4 = P (X 200)

Compare the values 30p1 , 30p2 , 30p3 , 30p4 with the actual number of observations in
the four intervals [0; 40), [40; 100), [100; 200), [200; 1), respectively. Why should these
numbers agree fairly well if the exponential distribution is a suitable model?
(d) Use a graph created using R to compare the model (1) with the data observed.

3. The normal or Gaussian distribution is an important continuous probability model which


describes the variation in many types of physiological measurements very well. Recall that
Y G( ; ) (or Y N ( ; 2 )) means that Y has a normal distribution with mean and stan-
dard deviation (variance 2 ).

Let Y be a variate representing the systolic blood pressure of a randomly selected woman in
a large population, grouped or stratified by age. Good models for Y for persons not taking any
medication have been found to be:
ages 17-24 Y G(118; 8)

ages 45-54 Y G(130; 9)

(a) Plot the probability density functions for the two models on the same graph using R.
(b) For each age group find the probability a randomly selected woman has blood pressure over
140.
(c) Suppose you were given the blood pressures Y1 ; :::; Y25 for 25 women from one of the two
age groups; you do not know which. Show that you could decide with near certainty which
group they came from by considering the value of Y = (Y1 + + Y25 )=25.
(d) Suppose you know that very close to 10% of the population of women are in each age group
and that you want to devise a rule for “deciding” which age group a woman is in, based on
knowing only her blood pressure Y . Good rules will be of the form: for some chosen value
y0 , decide that
age is 17-24 iff Y y0 .
Assuming you are going to use this rule over and over again, find y0 so that the fraction of
decisions which are wrong is minimized.
1.9. PROBLEMS 37

4. A test for diabetes is based on glucose (sugar) levels Y in the blood, measured after fasting
for a specified period of time. For healthy people the glucose levels have close to a Gaussian
distribution with mean = 5:31 mmol=L and standard deviation = 0:58mmol=L. For
untreated diabetics Y is (close to) Gaussian with = 11:74 and = 3:50.
A diagnostic test for diabetes can be based on a measurement Y for a given individual. Suppose
that we use the cutoff value 6.5, and diagnose someone as diabetic (diagnosis to be confirmed by
further tests) if Y > 6:5.

(a) If a person is diabetic, what is the probability that they are diagnosed as such by the test?

(b) What is the probability that a nondiabetic is incorrectly diagnosed as being diabetic?

(c) The probability in part (a) is called the sensitivity of the test. We can increase it by choosing
a cutoff value C which is less than 6.5. However, this will increase the probability of
incorrectly diagnosing nondiabetics, as in part (b). Recompute the probabilities in (a) and
(b) if C = 6:0.

(d) What cutoff value C do we need to make the sensitivity 0.98?

(Based on Exercise 6.6 in Wild and Seber, 1999)

5. Normal (Gaussian) approximation for binomial probabilities. If Y Bin(n; p) and n is


large, the Gaussian distribution can be used to calculate approximate probabilities for Y . In
particular,
!
: y np
P (Y y) = F p y = 0; 1; :::; n (1.8)
np(1 p)

where F (z) is the cumulative distribution function (cdf) for the G(0; 1) distribution. A slightly
better approximation is to replace y with y + :5 in (2).

A mail campaign to sign up new customers for a particular credit card has in the recent past had
a success rate of about p = :012 (i.e. about 1.2% of persons contacted sign up for the card).

(a) If 140,000 persons are sent the offer to sign up, what is the probability at least 1600 new
customers are obtained, if p = :012?

(b) Suppose 2000 new customers were actually obtained. Would you conclude that this cam-
paign was more successful than other recent ones? Explain your answer.
38 INTRODUCTION TO STATISTICAL SCIENCE

6. Computer-generated (pseudo) random numbers. Consider a procedure designed to generate


sequences of digits Yi (i = 1; 2; :::) such that for each i

P (Yi = y) = 0:1 y = 0; 1; :::; 9

and such that Y1 ; Y2 ; ::: are statistically independent.

(a) Suggest two things that you might do in order to assess whether a given computer procedure
satisfies the above conditions. (Assume that you can use the procedure to generate, say,
1000 digits and that you will base your assessment on these values.)

(b) The next problem gives such a procedure , but is theoretically a little complicated to work
out its distribution mathematically. However, the distribution can be closely approximated
by computer simulation. You can generate a sequence Y1 ; Y2 ; : : : Yn of length n by the R
command y sample(0 : 9; n; replace = T )

Generate a sequence of 1000 digits and plot a histogram of the data. ( You can use it to check,
for example, whether runs of odd and even digits are consistent with the randomness conditions.)

7. A runs test for randomness


Consider a process which produces binary sequences of 0’s and 1’s. The process is supposed to
have the property that for a sequence Y1 ; Y2 ; :::; Yn of arbitrary length n,

P (Yi = 1) = p = 1 P (Yi = 0) i = 1; :::; n

Y1 ; :::; Yn are mutually independent.

In probability this is called a Bernoulli model.

One way to test whether a sequence “looks” like it could have come from a Bernoulli model
is to consider the number of runs (maximal subsequences of 0’s or 1’s) in the sequence; let the
random variable R denote the number of runs.

(a) Consider a randomly generated sequence from the Bernoulli model and suppose there were
1.9. PROBLEMS 39

a 0’s and b 1’s (a + b = n). Try to prove that, conditional on a and b,

8 ! !
>
> a 1 b 1
>
> 2
>
> k 1 k 1
>
>
>
>
0 1 r = 2k
>
> a+b
>
> B
@
C
A
>
> a
>
<
P (R = r) =
>
> 0 10 1 0 10 1
>
> a 1 b 1 a 1 b 1
>
> B CB C B
A+@
CB C
>
> @ A@ A@ A
>
> k 1 k k k 1
>
> 0 1 r = 2k + 1
>
> a+b
>
> B C
>
: @ A
a

(b) Part (a) is challenging. An approximation to the distribution of R which is not conditional
on a and b may be based on the Central Limit Theorem. Let Y1 ; :::; Yn be a Bernoulli
sequence of 0’s and 1’s and define the random variables

(
0 if Yi = Yi 1
U1 = 1; Ui =
1 if Yi 6= Yi 1:

n
X
Note that R = Ui (why?) and use this to find E(R) and Var(R) in terms of n and p,
i=1
where p = P (Yi = 1) = 1 P (Yi = 0).

(c) By the Central Limit Theorem, Z = [R E(R)]=Var(R)1=2 has a limiting G(0; 1) distri-
bution as n ! 1. Use this to find the approximate probability of 20 or fewer runs in a
Bernoulli sequence with n = 100, p = :5.

(d) Outline how you could approximate the distribution of R in part (b) by simulation. For the
case n = 10; p = :5 (i.e. sequences of length 10) generate 1000 sequences and obtain the
R value for each one. Plot these in a histogram.

8. The data below show the lengths (in cm) of male and female coyotes captured in Nova Scotia.
40 INTRODUCTION TO STATISTICAL SCIENCE

Females
93.0 97.0 92.0 101.6 93.0 84.5 102.5 97.8 91.0 98.0 93.5 91.7
90.2 91.5 80.0 86.4 91.4 83.5 88.0 71.0 81.3 88.5 86.5 90.0
84.0 89.5 84.0 85.0 87.0 88.0 86.5 96.0 87.0 93.5 93.5 90.0
85.0 97.0 86.0 73.7

Males
97.0 95.0 96.0 91.0 95.0 84.5 88.0 96.0 96.0 87.0 95.0 100.0
101.0 96.0 93.0 92.5 95.0 98.5 88.0 81.3 91.4 88.9 86.4 101.6
83.8 104.1 88.9 92.0 91.0 90.0 85.0 93.5 78.0 100.5 103.0 91.0
105.0 86.0 95.5 86.5 90.5 80.0 80.0

(a) Obtain relative frequency histograms of the data for the females and the males using R.
(b) Compute the sample mean y and standard deviation sy for the female and male coyotes.
Assuming = y and = sy , plot the pdf’s for Gaussian distributions G( ; ) over top of
the histograms for the females and males. (Note: if you have n measurements y1 ; : : : ; yn
Pn P
n
then y = yi =n and s2y = n 1 1 (yi y)2 :) (Based on Table 2.3.2 in Wild and Seber
i=1 i=1
1999)

9. The data below come from a study of the completion times for a task experienced by computer
users in a multiuser system. Each data point (x; y) is the result of an experiment in which x
terminals connected to the same server all initiated the same task. The variable y is the average
time per task.
x y x y x y x y
40 9.9 40 11.9 50 15.1 65 18.6
50 17.8 10 5.5 30 13.3 65 19.8
60 18.4 30 11.0 65 21.8
45 16.5 20 8.1 40 13.8

(a) Make a scatterplot of the data.


(b) Which variable is the response and which is the explanatory variable?
(Based on Wild and Seber 1999, Table 3.1.2)

10. The R system contains a lot of interesting data sets. Here’s how to look at a data set contained in
an array wtloss; it show the weight on each day of a special diet for a very obese 48-year old male
1.9. PROBLEMS 41

who weighed 184.35 kg before starting the diet. The data set is in the MASS package which is
part of R. First you need the command
> library (MASS)
Then the command
> wtloss
will display a data base called wtloss which has 52 rows and two columns. Column 1 gives the
day of the diet and Column 2 the person’s weight on that day. Obtain a scatterplot of weight vs.
day by the command
> plot (wtloss $Days, wtloss $Weight, xlab = “Day", ylab = “Weight")

Does the weight appear to be a linear function of days on diet? Why would you expect that it
would not be linear if a long time period is involved?
MODEL FITTING, MAXIMUM
LIKELIHOOD ESTIMATION, AND
MODEL CHECKING

2.1 Statistical Models and Probability Distributions

A statistical model is a mathematical model that incorporates probability1 in some way. As described
in Chapter 1, our interest here is in studying variability and uncertainty in populations and processes
and drawing inferences where warranted in the presence of this uncertainty. This will be done by con-
sidering random variables that represent characteristics of the units or individuals in the population or
process, and by studying the probability distributions of these random variables. It is very important to
be clear about what the “target" population or process is, and exactly how the variables being consid-
ered are defined and measured. Chapter 3 discusses these issues. You have already seen some examples
in Chapter 1, and been reminded of material on random variables and probability distributions that is
taught in earlier courses.
A preliminary step in probability and statistics is the choice of a probability model to suit a given
application. The choice of a model is usually driven by some combination of three factors:

1. Background knowledge or assumptions about the population on or process which lead to certain
distributions.

2. Past experience with data from the population or process, which has shown that certain distribu-
tions are suitable.

3. Current data, against which models can be assessed.


1
The material in this section is largely a review of material in a previous probability course. A more complete review is
provided in an appendix.

42
2.1. STATISTICAL MODELS AND PROBABILITY DISTRIBUTIONS 43

In probability theory, there is a lot of emphasis on factor 1 above, and there are many “families"
of probability distributions that describe certain types of situation. For example, binomial and Poisson
distributions were derived as models for outcomes in repeated trials and for the random occurrence
of events in time or space, respectively. The normal (Gaussian) distribution, on the other hand, is
often used to represent the distributions of continuous measurements such as the heights or weights
of individuals, based largely on on past experience that such models are suitable and mathematical
convenience, more than on factor 1 above.
In choosing a model we usually consider families of probability distributions. To be specific let
us suppose that for some discrete random variable Y we consider a family whose probability function
depends on one or more parameters :

P (Y = y) = f (y; ) for y 2 R

where R is a countable (i.e. discrete) set of real numbers, the range of the random variable Y . In order
to apply the model to a specific problem we require a value for ; the selection of a value based on the
data (let’s call it ^) is often referred to as “fitting" the model or as “estimating" the value of . The next
section gives a way to do this.
Most applications require a series of steps in the formulation (the word “specification" is also used)
of a model. In particular, we often start with some family of models in mind, but find after examining
data and fitting the model that it is unsuitable in certain respects. (Methods for checking the suitability
of a model will be discussed in Section 2.4.) We then try out other models, and perhaps look at more
data, in order to work towards a satisfactory model. Doing this is usually an iterative process, which is
sometimes represented by diagrams such as

Collect and Examine data


#
Propose a (revised?) Model
# "
Fit model ! Check model
#
Draw Conclusions

Statistics devotes considerable effort to the steps of this process. However, here we will focus on
settings in which the models are not too complicated, so that model formulation problems are mini-
mized. There are several distribution that you should review before continuing since they will appear
frequently in these notes. There is a little more detail about them in the appendix and in the Stat 230
notes. You should also consult the Table of distributions at the end of these notes for a condensed table
of properties of these distributions uncluding their moment generating functions and their moments.
44 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

Binomial Distribution
The discrete random variable (r.v.) Y has a binomial distribution (Y Bin(n; )) if its probability
function is of the form
n y
f (y; ) = (1 )n y
for y = 0; 1; : : : ; n (2.2)
y

where is a parameter with 0 < < 1. Then E(Y ) = n and V ar(Y ) = n (1 ):

Poisson Distribution
The discrete random variable Y has a Poisson distribution (Y Poisson( )) if its p.f. is of the form
y
f (y; ) = e for y = 0; 1; 2; : : :
y!

where is a parameter with > 0. Then E(Y ) = and V ar(Y ) = :


Exponential Distribution
The continuous random variable Y has an exponential distribution (Y Exp( )) if its probability
density function is of the form
1
f (y; ) = e y= for y > 0

where is parameter with > 0. Be careful not to confuse the notation Y Exp( ) which specifies
the exponential distribution for the random variable Y with a common notation for the exponential2 .
Then E(Y ) = and V ar(Y ) = 2 :

Gaussian (normal) Distribution


The continuous random variable Y has a Gaussian (also called a normal) distribution if its proba-
bility density function is of the form
1 1 y
( )2
f (y; ; ) = p e 2 for 1<y<1
2

where and are parameters, with 1 < < 1 and > 0. Then E(Y ) = ; Var(Y ) = 2 ; so that
sd(Y ) = . We write either Y G( ; ) or Y N ( ; 2 ) to indicate that Y has this distribution.
Note that in the former case, G( ; ); the second parameter is the standard deviation whereas in
the latter, N ( ; 2 ); we specify the variance 2 for the parameter. Most software syntax includingR
requires that you input the standard deviation for the parameter. The normal (Gaussian) distribution
provides a suitable model for the distribution of measurements on characteristics like the size or weight
of individuals in certain populations, but is also used in many other settings. It is particularly useful in
finance where it is the basis for the most common model for asset prices, exchange rates, interest rates,
2
Sometimes ex is denoted by exp(x):
2.1. STATISTICAL MODELS AND PROBABILITY DISTRIBUTIONS 45

etc.

Multinomial Distribution
This is a multivariate distribution in which the discrete random variable’s Y1 ; : : : ; Yk (k 2) have
the joint p.f.

P (Y1 = y1 ; : : : ; Yk = yk ) = f (y1 ; : : : ; yk ; )
n! y1 y2 ::: yk
= 1 2 k
y1 !y2 ! : : : yk !

P
k
where each yi (for i = 1; : : : ; k) is an integer between 0 and n, and they satisfy the condition yi =
i=1
n. The elements of the parameter vector = ( 1 ; 2 ; : : : ; k ) satisfy 0 < i < 1 for i = 1; : : : ; k;
P
k
and i = 1. This distribution is a generalization of the binomial distribution. It arises when there
i=1
are repeated independent trials, where each trial results in one of k types of outcomes (call them types
1; : : : ; k), and the probability outcome i occurs is i . If Yi (where i = 1; : : : ; k) is the number of times
that type i occurs in a sequence of n trials, then (Y1 ; : : : ; Yk ) have the joint distribution above. This is
indicated by (Y1 ; : : : ; Yk ) Mult(n; ):
Pk
Since Yi = n we can if we wish rewrite f (y1 ; : : : ; yk ; ) using only k 1 variables, say
i=1
y1 ; : : : ; yk 1 (with yk replaced by n y1 : : : yk 1 ). We see that the multinomial distribution
with k = 2 is just the binomial distribution, where the S outcomes are type 1 (say) and the F outcomes
are type 2.
We will also consider models that include explanatory variables, or covariates. For example, sup-
pose that the response variable Y is the weight (in kg) of a randomly selected female in the age range
16-25, in some population. A person’s weight is related to their height, so we might want to study
this relationship. A way to do this is to consider females with a given height x (say in meters), and to
propose that the distribution of Y , given x is Gaussian, G( + x; ). That is, we are proposing that
the average (expected) weight of a female depends linearly on her height x: we write this as

E(Y jx) = + x

Such models are considered in Chapters 6-8.


We now turn to the problem of fitting a model. This requires estimating or assigning numerical
values to the parameters in the model (for example, ; in the Gaussian model or in an exponential
model).
46 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

2.2 Estimation of Parameters (Model Fitting)


Suppose a probability distribution that serves as a model for some random process depends on an
unknown parameter (possibly a vector). In order to use the model we have to “estimate" or specify a
value for . To do this we usually rely on some data that have been collected for the random variable in
question. It is important that such data be collected carefully, and we consider this issue in Chapter 3.
For example, suppose that the random variable Y represents the weight of a randomly chosen female
in some population, and that we were considering a Gaussian model, Y G( ; ). Since E(Y ) = ,
we might decide to randomly select, say, 10 females from the population, to measure their weights
y1 ; y2 ; : : : ; y10 , and then to estimate by the average,
10
1 X
^= yi (2.3)
10
i=1

This seems sensible (why?) and similar ideas can be developed for other parameters; in particular, note
that must also be estimated, and you might think about how you could use y1 ; : : : ; y10 to do this.
(Hint: what does or 2 represent in the Gaussian model?). Before we move on, note that although we
1 P
10
are estimating the parameter we did not write = 10 yi . Why did we introduce a special notation
i=1
1 P
10
^? This serves a dual purpose, both to remind you that 10 yi is not exactly equal to the unknown
i=1
value of the parameter ; but also to indicate that ^ is a quantity derived from the data yi ; i = 1; 2; :::; 10
and is therefore random. A different draw of the sample yi ; i = 1; 2; :::; 10 will result in a different
value for the random variable ^:
Instead of ad hoc approaches to estimation as in (2.3), it is desirable to have a general method for
estimating parameters. Maximum likelihood is a very general method, which we now describe.
Let the (vector) random variable Y represent potential data that will be used to estimate , and
let y represent the actual observed data that are obtained in a specific application. Note that to ally
the maximum likelihood method, we must know (or make assumptions about) how the data y were
collected. It is usually assumed here that the data consists of measurements on a random sample of
population units. The likelihood function for is then defined as

L( ) = P (Y = y; ) for 2

where the parameter space is the set of possible values for . Thus the likelihood function is the
probability that we will observe at random the observation y; considered as a function of the parame-
ter. Obviously values of the parameter that render our observation more probable would seem more
credible than those that render it less probable so values of for which L( ) is large are more consis-
tent with the observation y: The value ^ that maximizes L( ) for given data y is called the maximum
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 47

likelihood estimate (MLE)3 of . This seems like a “sensible" approach, and it turns out to have very
good properties.

Example 2.2.1 (a public opinion poll)4 .


We are surrounded by polls. They guide the policies of our political leaders, the products that
are developed by manufacturers, and increasingly the content of the media. For example the poll5 in
Figure ?? was conducted by Harris/Decima company under contract of the CAUT (Canadian Asso-

ciation of University Teachers). This is one of semi-annual poll on Post-Secondary Education and
Canadian Public Opinion, this one conducted in November 2010. Harris/Decima uses a telephone poll
of 2000 “representative” adults. Twenty-six percent of respondents agreed and 48% disagreed with the
following statement: " University and college teachers earn too much".
3
We will often distinguish between the random variable, the maximum likelihood estimator, which is the function of the
data in general, and its numerical value for the data at hand, referred to as the maximum likelihood estimate.
4
See the corresponding video at www.watstat.ca
5
http://www.caut.ca/uploads/Decima_Fall_2010.pdf
48 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

Figure 2.11: Results of the Harris/Decima poll. The two bars are from polls conducted Nov. 9, 2010 and
Nov 10, 2010 respectively.

Harris/Decima declared their result were accurate to within 2:2 percent 19 times out of twenty but
the margin of error for regional, demographic or other subgroups is wider. What does this mean and
where did these estimates and intervals come from? Suppose that the random variable Y represents
the number of individuals who, in a randomly selected group of n persons, agreed with the statement.
It is assumed that Y is closely modelled by a binomial distribution:

n y
P (Y = y) = f (y; ) = (1 )n y
y = 0; 1; :::; n
y

where represents the fraction of the whole population that agree. In this case, if we select a random
sample of n persons and obtain their view we have Y = Y , and y = y = 520, the number that agree.
Thus the likelihood function is given by

n y
L( ) = (1 )n y
y

or in this case
2000 520
(1 )2000 520
. (2.2.1)
520

It is easy to see that L( ) is maximized by the value ^ = y=n. (You should show this.) The value
of this maximum likelihood estimate is 520=2000 or 26%. This is easily seen from a graph of the
likelihood function (2.2.1) given in Figure 2.12. From the graph it can also be seen that the interval
suggested by the pollsters, 26 2:2% or (23.8,28.2) is a reasonable interval for the parameter since
it seems to contain most of the values of with large values of the likelihood L( ): We will return to
the construction of such interval estimates later.
Example 2.2.2 Suppose that the random variable Y represents the number of persons infected with
the human immunodeficiency virus (HIV) in a randomly selected group of n persons. Again assume
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 49

Figure 2.12: Likelihood function for the Harris/Decima poll and corresponding interval estimate for

that Y is modelled by a binomial distribution:


n y
P (Y = y) = f (y; ) = (1 )n y
y = 0; 1; :::; n
y
where represents the fraction of the population that are infected. In this case, if we select a random
sample of n persons and test them for HIV, we have Y = Y , and y = y as the observed number
infected. Thus
n y
L( ) = (1 )n y (2.2.1)
y
and again L( ) is maximized by the value ^ = y=n.
The likelihood function’s basic properties, for example, where its maximum occurs and its shape,
are not affected if we multiply L( ) by a constant. Indeed it is not the absolute value of the likelihood
that is important but the relative values at two different values of the parameter, e.g. L( 1 )=L( 2 ): You
might think of this ratio as how much more or less consistent the data is with the parameter 1 versus
2 . The ratio L( 1 )=L( 2 ) is also unaffected if we multiply L( ) by a constant. In view of this we
might define the likelihood as P (Y = y; ) or any constant multiple of it, so, for example, we could
drop the term ny in (2.2.1) and define L( ) = y (1 )n y : This function and (2.2.1) are maximized
by the same value ^ = y=n and have the same shape. Indeed we might rescale the likelihood function
by dividing through by the maximum so that the new maximum is 1. This rescaled version is called
50 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

the relative likelihood function

R( ) = L( )=L( ^) for 2 :

It is also convenient to define the log likelihood function,

`( ) = log L( ) for 2 :

Note that ^ also maximizes `( ). (Why?) Because functions are often maximized by setting their
derivatives equal to zero6 , we can usually obtain ^ by solving the equation(s)
d`
= 0: (2.2.2)
d
For example, from L( ) = y (1 )n y we get `( ) = y log( ) + (n y) log(1 ). Thus

d` y n y
=
d 1
and solving d`=d = 0 gives ^ = y=n.

In many applications the data Y are assumed to consist of a random sample Y1 ; :::; Yn from some
process or population, where each Yi has the probability density function (or probability function)
f (y; ). In this case y = (y1 ; :::; yn ) and
n
Y
L( ) = f (yi ; ): (2.2.3)
i=1

(You should recall from probability that if Y1 ; : : : ; Yn are independent randnom variables then their
joint probability density function is the product of their individual probability density functions.) In
addition, if we have independent data Y1 and Y2 about from two independent studies, then since
P (Y1 = y1 ; Y2 = y2 ) = P (Y1 = y1 )P (Y2 = y2 ) with independency we can obtain the “combined"
likelihood function L( ) based on Y1 and Y2 together as

L( ) = L1 ( )L2 ( )

where Lj ( ) = P (Yj = yj ; ); j = 1; 2:

Example 2.2.3 Suppose that the random variable Y represents the lifetime of a randomly selected light
bulb in a large population of bulbs, and that Y follows an exponential distribution with probability
density function
1
f (y; ) = e y= for y > 0;
6
Can you think of an example of a continuous function f (x) defined on the interval [0; 1] for which the maximum
max0 x 1 f (x) is NOT found by setting f 0 (x) = 0?
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 51

where > 0. If a random sample of light bulbs is tested and gives lifetimes y1 ; :::; yn , then the
likelihood function for is
n
Y X
1 yi = 1
L( ) = e = n
exp yi = :
i=1

Thus
n
1X
`( ) = n log yi
i=1

and solving d`=d = 0, we get


X n
^= 1 yi = y:
n
i=1

It is easily checked that ^ = y maximizes `( ) and so it is the MLE.

Example 2.2.2 revisited. Sometimes the likelihood function for a given set of data can be constructed
in more than one way. For the random sample of n persons who are tested for HIV, for example, we
could define for i = 1; : : : ; n

Yi = I (person i tests positive for HIV.)

(Note: I(A) is the indicator function; it equals 1 if A is true and 0 if A is false.) In this case the p.f.
for Yi is Bin(1; ) with
f (yi ; ) = yi (1 )1 yi for yi = 0; 1

and the likelihood function is


n
Y
L( ) = f (yi ; )
i=1
yi
= (1 )n yi

y
= (1 )n y

P
n
where y = yi . This is the same likelihood function as we obtained in Example 2.2.1, where we
i=1
P
n
used the fact that Y = Yi has a binomial distribution, Bin(n; ).
i=1

Example 2.2.4 As an example involving more than one parameter, suppose that the random variable Y
has a normal (Gaussian) distribution with probability density function
" #
2
1 1 y
f (y; ; ) = p exp for 1 < y < 1:
2 2
52 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

(Note: exp (x) means the same as ex .)


The random sample y1 ; :::; yn then gives, with = ( ; ),
n
Y
L( ) = f (yi ; ; )
i=1
" n
#
n=2 n 1X yi 2
= (2 ) exp ;
2
i=1

where 1< < 1 and > 0. Thus


n
1 X
`( ) = `( ; ) = n log 2
(yi )2 (n=2) log(2 ):
2
i=1

We wish to maximize `( ; ) with respect to both parameters ; : Solving7 (simultaneously)


@` 8 1 X
= 2 = 1n (yi )=0
@
i
@` n 1 X
= + 3 = 1n (yi )2 = 0;
@
i

we find that the MLE is ^ = (^; ^ ), where


n
1X
^= yi = y
n
i=1
" n #1=2
1X 2
^= (yi y) :
n
i=1

In many applications we encounter likelihood functions which cannot be maximized mathemati-


cally and we need to resort to numerical methods. The following example provides an illustration.

Example 2.2.5 The number of coliform bacteria Y in a random sample of water of volume vi ml has
close to a Poisson distribution:
( vi )y
P (Y = y) = f (y; ) = e vi ; y = 0; 1; 2; ::: (2.2.4)
y!
where is the average number of bacteria per millilitre (ml) of water. There is an inexpensive test
which can detect the presence (but not the number) of bacteria in a water sample. In this case what we
get to observe is not Y , but rather the “presence” indicator I(Y > 0), or
(
1 if Y > 0
Z=
0 if Y = 0:
7
To maximize a function of two variables, set the derivative with respect to each equal to zero. Of course finding values
at which the derivatives are zero does not prove this is a maximum. Showing it is a maximum is another exercise in calculus.
2.2. ESTIMATION OF PARAMETERS (MODEL FITTING) 53

Note that from (2.2.4),


vi
P (Z = 1; ) = 1 e =1 P (Z = 0; ):

Suppose that n water samples, of volumes v1 ; :::; vn , are selected. Let z1 ; :::; zn be the presence
indicators. The likelihood function is then the product of P (Zi = zi ), or

n
Y
v i zi vi 1 zi
L( ) = (1 e ) (e )
i=1

and
n
X
vi
`( ) = [zi log(1 e ) (1 zi ) vi ]:
i=1

We cannot maximize `( ) mathematically by solving d`=d = 0, so we will use numerical methods.


Suppose for example that n = 40 samples gave data as follows:

vi (ml) 8 4 2 1
no. of samples 10 10 10 10
no. with zi = 1 10 8 7 3

This gives

8 4 2
`( ) = 10 log(1 e ) + 8 log(1 e ) + 7 log(1 e )
+ 3 log(1 e ) 21 :

Either maximizing `( ) numerically for > 0, or by solving d`=d = 0 numerically, we find the MLE
to be ^ = 0:478. A simple way to maximize `( ) is to plot it, as shown in Figure 2.13; the MLE can
then be found by inspection or, for more accuracy, by iteration using Newton’s method9 .
A few remarks about numerical methods are in order. Aside from a few simple models, it is not
possible to maximize likelihood functions explicitly. However, software implementing powerful nu-
merical methods exist which can easily maximize (or minimize) functions of one or more variables.
Multi-purpose optimizers can be found in many software packages; in R the function nlm() is pow-
erful and easy to use. In addition, statistical software packages contain special functions for fitting
and analyzing a large number of statistical models. The R package MASS (which can be accessed by
the command library (MASS)) has a function fitdistr that will fit many common models. R and other
packages are also invaluable for doing arithmetic, graphical presentations, and for manipulation of data.

9
You should recall this from your calculus course
54 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

-1 7

-1 8

-1 9

log likelihood
-2 0

-2 1

-2 2

-2 3

-2 4

-2 5
0 .2 0 .3 0 .4 0 .5 0 .6 0 .7 0 .8 0 .9 1
θ

Figure 2.13: The log likelihood function `( ) for Example 2.2.5

2.3 Likelihood Functions From Multinomial Models

Multinomial models are used in many statistical applications. From Section 2.1, the multinomial prob-
ability function takes the form (using pi for the probability of a type i outcome instead of i )

k Y y X
n!
f (y1 ; : : : ; yk ) = pi i for yi = 0; 1; : : : where i=1 k yi = n
y1 ! : : : yk !
i=1

If the pi ’s are to be estimated from data involving n “trials", of which yi resulted in a type i outcome
(i = 1; : : : ; k), then it seems obvious that

p^i = yi =n (i = 1; : : : ; k) (2.3.1)

would be a sensible estimate. This can also be shown to be the m.l.e. for p = (p1 ; : : : ; pk ).10
Example 2.3.1 Each person is one of 4 blood types, labelled A,B,AB and O. (Which type a person
is has important consequences, for example in determining to whom they can donate blood for a trans-
fusion.) Let p1 ; p2 ; p3 ; p4 be the fraction of a population that has types A,B,AB,O, respectively. Now
suppose that in a random sample of 400 persons whose blood was tested, the numbers who were types
1 to 4 were y1 = 172; y2 = 38; y3 = 14 and y4 = 176 (note that y1 + y2 + y3 + y4 = 400).
Note that the random variables Y1 ; Y2 ; Y3 ; Y4 that represent the number of type A,B,AB,O persons
we might get in a random sample of size n = 400 follow a multinomial distribution, Mult(400; p1 ; p2 ; p3 ; p4 ).
The m.l.e.’s from the observed data are therefore

172 38 14 176
p^1 = 400 = 0:43; p^2 = 400 = 0:095; p^3 = 400 = 0:035; p^4 = 400 = 0:44

10 Pk
The log likelihood can be taken as (dropping the n!=(y1 ! : : : yk !) term for convenience) `(p) = yi log pi : This is a
P i=1
little tricky to maximize because the pi ’s satisfy a linear constraint, pi = 1. The Lagrange multiplier method (Calculus
III) for constrained optimization allows us to find the solution p^i = yi =n (i = 1; : : : ; k).
2.4. CHECKING MODELS 55
P
(As a check, note that p^i = 1). These give estimates of the population fractions p1 ; p2 ; p3 ; p4 . (Note:
studies involving much larger numbers of people put the values of the pi ’s for Caucasians at close to
p1 = 0:448; p2 = 0:083; p3 = 0:034; p4 = 0:436:)
In some problems the multinomial parameters p1 ; : : : ; pk may be functions of fewer than k 1
parameters. The following is an example.

Example 2.3.2 Another way of classifying a person’s blood is through their “M-N" type. Each person
is one of 3 types, labelled MM,MN and NN and we can let p1 ; p2 ; p3 be the fraction of the population
that is each of the 3 types. According to a model in genetics, the pi ’s can be expressed in terms of a
single parameter for human populations:

2
p1 = ; p2 = 2 (1 ); p3 = (1 )2

where is a parameter with 0 < < 1. In this case we would estimate from a random sample giving
y1 ; y2 and y3 persons of types MM, MN and NN by using the likelihood function
3
Y
L( ) = pyi i
i=1
2 y1
= [ ] [2 (1 )]y2 [(1 )2 ]y3
= 2y1 2y1 +y2
(1 )y2 +2y3

Example 2.4.2 in the next section considers some data for this setting.

2.4 Checking Models


The models used in this course are probability distributions for random variables Y that represent
measurements or variates in a population or process. A typical model has probability density function
(p.d.f) f (y; ) if Y is continuous, or probability function (p.f.) f (y; ) if Y is discrete, where is a
vector of parameter values. If a family of models is to be used for some purpose then it is important
to check that the model adequately represents the variability in Y . This can be done by comparing the
model with random samples y1 ; : : : ; yn of y-values from the population or process.
The probability model is supposed to represent the relative frequency of sets of y-values in large
samples, so a fundamental check is to compare model probabilities and relative frequencies for a sam-
ple. Recall the definition of a histogram in Section 1.3 and let the range of Y be partitioned into
intervals Ij = [aj 1 ; aj ); j = 1; : : : ; k. From our model f (y; ) we can compute the values

p^j = P (aj 1 Y < aj ) j = 1; : : : ; k: (2.4.1)


56 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

If the model is suitable, these values should be “close" to the values of the relative frequencies p~j =
fj =n in the sample. (Recall that fj is the number of y-values in the sample that are in the interval
Ij ). This method of comparison works for either discrete or continuous r.v’s. An example of each type
follows.

Example 2.4.1 Suppose that an exponential model for a positive-valued continuous random variable
Y has been proposed, with probability density function
0:01y
f (y) = 0:01e ; y>0 (2.4.2)

and that a random sample of size n = 20 has given the following values y1 ; : : : ; y20 (rounded to the
nearest integer):

10; 32; 15; 26; 157; 99; 109; 88; 39; 118;
61; 104; 77; 144; 338; 72; 180; 63; 155; 140

For illustration purposes, let us partition the range of Y into 4 intervals [0,30), [30,70), [70,140),
[140,1). The probabilities p^j from the model (2.4.2) are for j = 1; : : : ; 4,
Z aj
p^j = 0:01e 0:01y dy = e 0:01aj 1 e 0:01aj
aj 1

and we find p^1 = 0:261; p^2 = 0:244; p^3 = 0:250; p^4 = 0:247; (the numbers add to 1.002 and not
1.0 because of round-off). The relative frequencies p~j = fj =20 from the random sample are p~1 =
0:15; p~2 = 0:25; p~3 = :30; p~4 = :30. These agree fairly well with the model-based values p^j , but
we might wonder about the first interval. We discuss how “close" we can expect the agreement to
be following the next example. With a sample of this small a size, the difference between p~1 and p^1
represented here does not suggest that the model is inadequate.
This example is an artificial numerical illustration. In practice we usually want to check a family
of models for which one or more parameter values is unknown. Problem 2 in Chapter 1 discusses an
application involving the exponential distribution where this is the case. When parameter values are
unknown we first estimate them using maximum likelihood, and then check the resulting model. The
following example illustrates this procedure.

Example 2.4.2 In Example 2.3.2 we considered a model from genetics in which the probability a
person is blood type MM, MN or NN is p1 = 2 ; p2 = 2 (1 ); p3 = (1 )2 , respectively. Suppose
a random sample of 100 individuals gave 17 of type MM, 46 of type MN, and 37 of type NN.
The relative frequencies from the sample are p~1 = 0:17; p~2 = 0:46; p~3 = 0:37, where we use the
obvious “intervals" I1 = fperson is MMg; I2 = fperson is MNg; I3 = fperson is NNg. (If we wish,
we can define the random variable Y to be 1,2,3 according to whether a person is MM, MN or NN.)
2.4. CHECKING MODELS 57

Since is unknown, we must estimate it before we can check the family of models given above. From
Example 2.3.2, the likelihood function for from the observed data is of multinomial form:

L( ) = [ 2 ]17 [2 (1 )]46 [(1 )2 ]37

where 0 < < 1. Collecting terms, we find

`( ) = log L( ) = 80 log + 120 log (1 ) + constant

and d`=d = 0 gives the m.l.e. ^ = 0:40. The model-based probabilities for I1 ; I2 ; I3 are thus

p^1 = ^2 = 0:16; p^2 = 2 ^(1 ^) = 0:48; p^3 = (1 ^)2 = 0:36:

These agree quite closely with p~1 ; p~2 ; p~3 and on this basis the model seems satisfactory.

The method above suffers from some arbitrariness in how the Ij ’s are defined and in what consti-
tutes “close" agreement between the model-based probabilities p^j and the relative frequencies p~j =
fj =n. Some theory that provides a formal comparison will be given later in Chapter 7, but for now
we will just rely on the following simple guideline. If we consider the frequencies fj from the sample
as random variables, then they have a multinomial distribution, Mult (n; p1 ; : : : ; pk ), where pj is the
“true" value of P (aj 1 Y < aj ) in the population. In addition, any single fj has a binomial dis-
tribution, Bin(n; pj ). This means we can assess how variable either fj or p~j = fj =n is likely to be,
in a random sample. From Stat 230, if n is large enough then the distribution of fj is approximately
normal, N (npj ; npj (1 pj )). It then follows that
q q
P npj 1:96 npj (1 pj ) fj npj + 1:96 npj (1 pj ) = 0:95

and thus (dividing by n and rearranging)


r r !
pj (1 pj ) pj (1 pj )
P 1:96 p~j pj 1:96 = 0:95 (2.4.3)
n n

This allows us to get a rough idea for what constitutes a large discrepancy between an observed relative
frequency p~j and a true probability pj . For example when n = 20 and pj is about 0.25, as in Example
2.4.1, we get from (2.5.3) that

P ( 0:19 p~j pj 0:19) = :95

That is, it is quite common for p~j to differ from pj by up to .19. The discrepancy between p~1 = 0:15 and
p1 = 0:261 in Example 2.4.1 is consequently not unusual and does not suggest the model is inadequate.
58 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

For larger sample sizes, p~j will tend to be closer to the true value pj . For example, with n = 100
and pj =.5, (2.4.3) gives

P ( 0:10 p~j pj 0:10) = 0:95

Thus in Example 2.4.2, there is no indication that the model is inadequate. (We are assuming here that
the model-based values p^j are like the true probabilities as far as (2.4.3) is concerned. This is not quite
correct but (2.4.3) will still serve as a rough guide. We are also ignoring that we have picked the largest
of the values p~j = pj , as the binomial distribution is not quite correct either. Chapter 7 shows how to
develop checks of the model that get around these points.)

Graphical Checks
A graph that compares relative frequencies and model-based probabilities provides a nice picture
of the “fit" of the model to the data. Two plots that are widely used are based on histograms and
cumulative frequency functions F~ (y) which are also called empirical c.d.f.’s, respectively.
The histogram plot for a continuous random variable Y is as follows. Plot a relative frequency
histogram of the random sample y1 ; : : : ; yn and superimpose on this a plot of the probability density
function f (y; ) for the proposed model. The area under the probability density function between
values aj 1 and aj equals P (aj 1 Y < aj ) so this should agree well with the area of the rectangle
over [aj 1 ; aj ), which equals p~j . The plots in Figure 1.6.1 for the height and body-mass index data in
Section 1.6 are of this type.
For a discrete random variable Y we plot a histogram for the probability distribution f (y; ) and
superimpose a relative frequency histogram for the data, using the same intervals Ij in each case.
A second graphical procedure is to plot the cumulative frequency function or empirical c.d.f.
(ECDF) F~ (y) described by (1.3.1) in Section 1.3 and then to superimpose on this a plot of the model-
based c.d.f., F (y; ). If the model is suitable, the two curves should not be too far apart.

Example 2.4.3 Figure 2.14 shows a plot of the data on female heights from Section 1.6. We show (a)
a relative frequency histogram, with the G(1:62; 0:0637) probability density function superimposed
(the m.l.e.’s were ^ = 1:62; ^ = 0:0637, from the data), and (b) a plot of the ECDF with the
G(1:62; 0:0637) c.d.f. superimposed. The two types of plots give complementary but consistent pic-
tures. An advantage of the distribution function comparison is that the exact heights in the sample are
used, whereas in the histogram - probability density function plot the data are grouped in forming the
histogram. However, the histogram and probability density function show the distribution of heights
more clearly. Neither plot suggests strongly that the Gaussian model is unsatisfactory. Note that the R
function ecdf can be used to obtain F~ (y).
2.5. PROBLEMS 59

0 1 2 3 4 5 6

0.8
0.4
0.0
1.4 1.6 1.8 1.4 1.6 1.8

x height(m)

Figure 2.14: Model and Data Comparisons for Female Heights

2.5 Problems
1. In modelling the number of transactions of a certain type received by a central computer for a
company with many on-line terminals the Poisson distribution can be used. If the transactions
arrive at random at the rate of per minute then the probability of x transactions in a time interval
of length t minutes is
( t)x
P (X = x) = e t ; x = 0; 1; 2; :::
x!
(a) The numbers of transactions received in 10 separate one minute intervals were as follows:
8, 3, 2, 4, 5, 3, 6, 5, 4, 1.
Write down the likelihood function for and find the m.l.e. ^ .
(b) Estimate the probability that during a two-minute interval, no transactions arrive.
(c) Use R function rpois() with the value = 4:1 to simulate the number of transactions
received in 100 one minute intervals. Calculate the sample mean and variance; are they
approximately the same? (Note that E(X) =Var(X) = for the Poisson model.)

2. Consider the following two experiments whose purpose was to estimate , the fraction of a large
population with blood type B.

i) Individuals were selected at random until 10 with blood type B were found. The total
number of people examined was 100.
60 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

ii) 100 individuals were selected at random and it was found that 10 of them have blood type
B.
(a) Find the probability of the observed results (as a function of ) for the two experiments.
Thus obtain the likelihood function for for each experiment and show that they are pro-
portional. Show that the m.l.e. ^ is the same in each case: what is it?
(b) Suppose n people came to a blood donor clinic. Assuming = 0:10, how large should
n be to make the probability of getting 10 or more B- type donors at least 0.90? (The R
functions gbinom() or pbinom() can help here.)

3. Consider Example 2.3.2 on M-N blood types. If a random sample of n individuals gives x1 ; x2 ;
and x3 persons of types MM, MN, and NN respectively, find the m.l.e. ^ in the model in terms
of x1 ; x2 ; x3 .

4. Suppose that in a population of twins, males (M ) and females (F ) are equally likely to occur and
that the probability that a pair of twins is identical is . If twins are not identical, their genders
are independent.

(a) Show that

1+
P (M M ) = P (F F ) =
4
1
P (M F ) =
2

(b) Suppose that n pairs of twins are randomly selected; it is found that n1 are M M , n2 are
F F , and n3 are M F , but it is not known whether each set is identical or fraternal. Use
these data to find the m.l.e. ^ of . What does this give if n = 50 with n1 = 16, n2 = 16,
n3 = 18?
(c) Does the model appear to fit the data well?

5. Estimation from capture-recapture studies.


In order to estimate the number of animals, N , in a wild habitat the capture-recapture method is
often used. In this scheme r animals are caught, tagged, and then released. Later on n animals
are caught and the number X of these that bear tags are noted. The idea is to use this information
to estimate N .
2.5. PROBLEMS 61
! ! !
r N r N
(a) Show that P (X = x) = = , under suitable assumptions.
x n x n
(b) For observed r, n and x find the value N ^ that maximizes the probability in part (a). Does
this ever differ much from the intuitive estimate N ~ = rn=x? (Hint: The likelihood L(N )
depends on the discrete parameter N , and a good way to find where L(N ) is maximized
over f1; 2; 3 : : :g is to examine the ratios L(N + 1)=L(N ):)
(c) When might the model in part (a) be unsatisfactory?

6. The following model has been proposed for the distribution of the number of offspring X in a
family, for a large population of families:

P (X = k) = k k = 1; 2; : : :
P (X = 0) = (1 2 )=(1 ):

Here is an unknown parameter with 0 < < 21 .

(a) Suppose that n families are selected at random and that fk is the number of families with k
children (f0 + f1 + : : : = n). Obtain the m.l.e. ^ .
(b) Consider a different type of sampling wherein a single child is selected at random and the
size of family the child comes from is determined. Let X represent the number of children
in the family. Show that
k
P (X = k) = ck k = 1; 2; : : :

and determine c.
(c) Suppose that the type of sampling in part (b) was used and that with n = 33 the following
data are obtained:
k: 1 2 3 4
fk : 22 7 3 1
Obtain the m.l.e. ^ . Also estimate the probability a couple has 0 children.
(d) Suppose the sample in (c) was incorrectly assumed to have arisen from the sampling plan
in (a). What would ^ be found to be? This problem shows that the way the data have been
collected can affect the model for the response variable.

7. Radioactive particles are emitted randomly over time from a source at an average rate of per
second. In n time periods of varying lengths t1 ; t2 ; :::; tn (seconds), the numbers of particles
emitted (as determined by an automatic counter) were y1 ; y2 ; :::; yn respectively.
62 MODEL FITTING, MAXIMUM LIKELIHOOD ESTIMATION, AND MODEL CHECKING

(a) Give an estimate of from these data. What assumptions have you made to do this?
(b) Suppose that instead of knowing the yi ’s, we know only whether or not there was one or
more particles emitted in each time interval. Making a suitable assumption, give the like-
lihood function for based on these data, and describe how you could find the maximum
likelihood estimate ^ .

8. Censored lifetime data. Consider the exponential distribution as a model for the lifetimes of
equipment. In experiments, it is often not feasible to run the study long enough that all the
pieces of equipment fail. For example, suppose that n pieces of equipment are each tested for a
maximum of C hours (C is called a “censoring time”) . The observed data are then as follows:

r (where 0 r n) pieces fail, at times x1 ; :::; xr .


n r pieces are still working after time C.

(a) For the exponential model in Section 2.1, show that

P (X > C) = exp( C= ):

(b) Give the likelihood function for based on the observed data described above. Show that
the m.l.e. is
r
X
xi + (n r)C
^= i=1
:
r
(c) What does part (b) give when r = 0? Explain this intuitively.
(d) A standard test for the reliability of electronic components is to subject them to large fluc-
tuations in temperature inside specially designed ovens. For one particular type of compo-
nent, 50 units were tested and r = 5 failed before 400 hours, when the test was terminated,
P
5
with xi = 450 hours. Find ^.
i=1

9. Poisson model with a covariate. Let Y represent the number of claims in a given year for
a single general insurance policy holder. Each policy holder has a numerical “risk score” x
assigned by the company, based on available information. The risk score may be used as a
covariate (explanatory variable) when modeling the distribution of Y , and it has been found that
models of the form
(x)y
P (Y = yjx) = e (x) y = 0; 1; 2; :::
y!
where (x) = exp( + x), are useful.
2.5. PROBLEMS 63

(a) Suppose that n randomly chosen policy holders with risk scores x1 ; x2 ; :::; xn had y1 ; y2 ; :::; yn
claims, respectively, in a given year. Give the likelihood function for and based on these
data.
(b) Can ^ and ^ be found explicitly?

10. In a large population of males ages 40 - 50, the proportion who are regular smokers is (0 <
< 1) and the proportion who have hypertension (high blood pressure) is (0 < < 1). If
the events S (a person is a smoker) and H (a person has hypertension) are independent, then for
a man picked at random from the population the probabilities he falls into the four categories
SH; S H; SH; S H are respectively, ; (1 ); (1 ) ; (1 )(1 ). (Why?)

(a) Suppose that 100 men are selected and the numbers in each of the four categories are as
follows:
Category SH S H SH S H

Frequency 20 15 22 43

Assuming that S and H are independent, write down the likelihood function for ; based
on the multinomial distribution, and maximize it to obtain ^ and ^.
(b) Compute expected frequencies for each of the four categories. Do you think the model used
is appropriate? Why might it be inappropriate?

11. The course web page has data on the lifetimes of the right front disc brakes pads for a
specific car model. The lifetimes x are in km driven, and correspond to the point at which
the brake pads in new cars are reduced to a specified thickness. The data on m = 92
randomly selected cars are contained in the file brakelife.text.
(a) It is often found that the log lifetimes, Y = log X, are well modelled by a Gaussian
distribution. Fit such a model to the data, and then produce a plot of a relative fre-
quency histogram of the x-data with the probability density function for X superim-
posed on it, using ^ and ^ for the values of and .
(Note: The probability density function of X = exp(Y ) can be found from results in
STAT 230, and is
1 1 log x 2
f (x) = p e 2( ) x > 0:
2 x
(b) Suppose that instead of the model in part (a), you assumed that X G( ; ). Repeat
the procedure in part (a). Which model appears to fit the data better?
PLANNING AND CONDUCTING
EMPIRICAL STUDIES

3.1 Empirical Studies


An empirical study is one which is carried out to learn about some real world population or process.
Several examples have been given in the preceding two chapters, but we have not yet considered the
various aspects of a study in any detail. It is the object of this chapter to do that; well-conducted studies
are needed to produce maximal information within existing cost and time constraints. Conversely, a
poor study can be worthless or even misleading.
It is helpful to break the process of planning and conducting a study into a series of parts or steps.
We describe below a formulation with the acronym PPDAC, proposed by Jock Mackay and Wayne
Oldford of University of Waterloo. Other similar formulations exist, and it should be remembered that
their purpose is to focus on the essential aspects of an empirical study. Although they are presented as a
series of steps, many studies, as well as the overall objective of learning about a population or process,
involve repeated passes through one or more steps.
The steps or aspects of a study that PPDAC refers to are as follows:

Problem: a clear statement of the study’s objectives

Plan: the procedures used to carry out the study, including the collection of data.

Data: the physical collection of the data, as per the plan

Analysis: the analysis of the data collected

Conclusion: conclusions that are drawn from the study

We will discuss the Plan, Data, Analysis and Conclusion steps in sections following this one. First,
we consider some aspects of the Problem step.

64
3.1. EMPIRICAL STUDIES 65

The objectives of a study may be difficult to state precisely in some cases, because when trying to
learn about a phenomenon we are venturing into the unknown. However, we must state as clearly as
possible what we hope to learn, and (looking ahead) what the “output" or type of conclusions from our
study will be. Here are some terms that describe certain problems or objectives:

causative: this means that an objective is to study (possible) causal relationships among factors
or variables. For example, we might want to study whether high amounts of fat in a person’s
diet increase their risk of heart disease, or whether a drug decreases the risk of a person having a
stroke.

descriptive: this means that an objective is to describe the variability or other characteristics of
certain variables, or to describe relationships among variables. (There is no attempt to consider
causal connections.) For example, we may wish to estimate the percentage of persons who
are unemployed this month in different regions of Canada, and to relate this to their level of
education.

analytic or inferential: this means that an objective is to generalize from the study to a larger
context or process. This is what is known as inductive inference. Causative objectives are one
type of analytic objective.

technological: this refers to objectives involving predictions or decisions (e.g. develop a model
or algorithm to identify someone from biometric data, or to decide whether to sell a stock)

A distinction can often be made between what we call the target population (or process) and the
study population (or process). The former is what we really want to study, and what our objectives are
based on. The latter is what the units in the study are actually selected from. Sometimes the target and
study populations are the same but often they are not; in that case we aim to make the study population
as “similar" as possible to the target population. This will make generalizations from the study to the
target population plausible.
At the Problem step we must also consider what the “units" of the population or process are, what
variables will be used to study the units, and what characteristics or attributes of the variables we
want to examine. This can be difficult, especially in areas where the amount of “hard" scientific back-
ground is limited, because it involves quantification and measurement issues. Some things can be
quantified and measured easily and accurately (e.g. a person’s height, weight, age or sex) but many
cannot. For example, how can we measure the amount of fat in a person’s diet? Public opinion polls or
surveys are similarly tricky to design questions for, and we may not be sure what an answer represents.
For example (see Utts 20012), in a survey the following two questions were asked in the order shown:
66 PLANNING AND CONDUCTING EMPIRICAL STUDIES

1. How happy are you with life in general?

2. How often do you normally go out on a date?

There was almost no relationship between the respondents’ replies to the two questions. However,
when the order of the questions was reversed there was a strong relationship, with persons going out
frequently generally being happy. (Why might this occur?)
The units or individuals that are selected for a study are called a sample, because they usually
are a subset of all the units in the study population. In so far as it is possible, we attempt to select
a random sample of units from the study population, to ensure that the sample is “representative’.
Drawing a random sample of units from a population or process can be difficult logistically. (Think
about what this would involve for some of the studies discussed in chapters 1 and 2.) If our sample is not
representative of the study or target population, conclusions we draw may be inaccurate or incorrect;
this is often referred to as study bias. Also, ideally we have,

Sample Study Population Target Population

but in some cases the study population may not be a subset of the target population but only related
to it in some way. For example, our target population for a carcinogenicity study might be humans,
but experimental studies have to be conducted on laboratory mice or other animals. It is difficult to
generalize from a study on animals to a human population.
The selection of explanatory variables for a study is also important. In many problems there is
a large number of factors that might be related (in a causal way or not) to a response variable, and
we have to decide which of these to collect data on. For example, in an observational study on heart
disease we might include factors in the person’s diet, their age, weight, family history of heart disease,
smoking history and so on. Sometimes cause and effect diagrams are used to help us sort out factors
and decide what to measure, e.g.

Diet (Fat, sugar, : : :) &


Smoking ! Heart Disease
Family History %

We’ll conclude this section with a couple of examples that illustrate aspects of the Problem step.

Example 3.1.1. Blood thinners and risk of stroke


A study is to be undertaken to assess whether a certain drug that “thins" the blood reduces the risk of
stroke. The main objective is thus causative. The target population is hard to specify precisely; let us
3.1. EMPIRICAL STUDIES 67

just say it consists of persons alive and at risk of stroke now and in the future. The study population
will consist of individuals who can be “selected" to participate in the study; individual persons are
the units in the populations. In most such studies these are persons who come to a specific group of
doctors or medical facilities for care and treatment, and this may or may not be representative of the
target population. To make the study generalizable at least to the study population, we would want the
sample from the study population to be representative of it. We try to achieve this by drawing a random
sample of persons, but in medical studies this is complicated by the fact that a person has to agree to
participate if selected.
The choice of response variables depends on the study objectives but also partly on how long the
study will be conducted. Some examples of response variables y are

(i) the number of strokes for a person over a specified “followup" period, say 3 years

(ii) the time until the first stroke occurs (if it does)

(iii) I(a stroke occurs within 3 years)

(iv) I(a stroke leading to major loss of function occurs within 3 years).

Once response variables have been defined, a specific objective is to compare attributes based on the
distributions of these variables for persons who receive the drug and for persons who do not. In ad-
dition, to avoid study bias we randomly assign either the drug or a “placebo" (which cannot be dis-
tinguished from the drug) to each person. You can think of this being done by flipping a coin: if its
Heads a person gets the drug, if Tails they get the placebo. As discussed earlier in Example 1.2.3, it
is also best not to let the person (or their doctor) know which “treatment" they are receiving. Finally,
one might want to also define and measure explanatory variables considered to be risk factors for stroke.

Example 3.1.2 Predicting success in university courses


Universities use certain variables to help predict how a student will do in their university program,
and to help decide whom to accept for a program. One objective might be to maximize the percentage
of students who pass first year, or who graduate from the program; another might be to have a formula
to use in the admissions process. Empirical studies for this problem usually collect data on certain
“explanatory" variables x (e.g. for Math students these might include high school course marks, contest
marks, information (or a rating) for the person’s high school, and so on) for some collection of students
(e.g. all students entering the university program over some period). Response variables y might
include university course marks or averages, or whether a person graduates. Obviously such a study
would involve collecting data over several years.
Note that for the admissions problem the target population is actually the population of students
who apply to the university program, whereas the study population consists only of these persons who
68 PLANNING AND CONDUCTING EMPIRICAL STUDIES

get admitted to the program (and who accept). This mismatch between the target and study populations
means that there is some doubt about how persons not admitted (but whose x-variables are known)
would do in the program. For the problem of predicting a person’s success once they are admitted, we
can take the target population to be all students admitted (and accepting) over some time period. The
study population is more “representative" in this case. However, for this ( and for admissions deci-
sions), there is still a discrepancy between study and target populations: data on past years’ students
must be used for a target population consisting of this (or a future) year’s students.

3.2 Planning a Study


To plan a study we need to specify (i) the study population and its units, (ii) how (and how many) units
will be “selected" for the study, and (iii) how response and explanatory variables will be measured.
This assumes that at the Problem stage we have specified what variables are to be considered.
Recall from Chapter 1 that there are various types of studies. Three main types are:

(i) Surveys or random samples from a finite population of units. In this case there is a population
of, say, N units (e.g. persons, widgets, transactions) and we randomly select n(< N ) for our
study. Public opinion polls or surveys on health, income, and other matters are often of this type,
as are random audits and quality inspections.

(ii) Experimental studies. In this case the person(s) conducting the study exercise control over
certain factors, or explanatory variables. These studies are often hard or expensive to conduct, but
are important for demonstrating causal relationships. Clinical trials involving medical treatments
(e.g. does a drug lower the risk of stroke?) and engineering experiments (e.g. how is the strength
of a steel bolt related to its diameter?) are examples of such studies.

(iii) Observational studies. In this case the data are collected on a study population or process over
which there is no control. The units selected for the study are often selected in some random way,
so as to make the selected units “representative". For example, we might collect data on a random
set of transactions in an audit study, or on a random sample of persons buying a particular product
in a marketing study. In medicine, we might compare the “risk profiles" of persons diagnosed
with a disease (e.g. diabetes) with persons not diagnosed with the disease.

Two main questions with observational studies are whether the study population is representative
of the target population, and whether the units chosen for the study are representative of the study pop-
ulation. It is sometimes impossible to generalize, or even to draw any conclusions, from observational
data.
3.2. PLANNING A STUDY 69

Selecting random samples from a population or process can be challenging due to logistical prob-
lems as noted above. In addition, there is the question whether a process is stable over time. This
affects both the objectives of a study and the way units are sampled and measured. Entire books are
devoted to ways of drawing random or representative samples in different settings.
In settings where the population or process has a finite number of units, N , we can draw a random
sample by measuring the units (say as 1; 2; : : : ; N ) and then drawing n numbers (usually without
replacement). This identifies the units selected for the sample. This can of course be hard or impossible
to implement in many settings; consider for example how you could select a sample of n persons aged
18-25 and living in Canada on some specific date.
It is often taken for granted that measurements, or data, are “reliable". In fact, measurement of the
variables in a study may be subject to errors, and this should be considered in planning and analyzing
a study. If measurement errors are too severe, it may not even be worth doing the study. Consider the
following example.

Example 3.2.1 Piston diameters


Pistons for car engines must have very precise dimensions, but they vary slightly in key dimensions
such as diameter because of variability in the manufacturing process. Let Y denote the deviation of
a piston diameter from its desired value, and suppose that Y G(0; ). The value of determines
the variability of the piston diameters; suppose that for the process to be acceptable it is necessary
that 1 (in some unspecified units). We can assess whether 1 by randomly selecting some
pistons and measuring their y-values. However, suppose y is measured using a crude tool so that what
we observe is not y, but y = y + e, where e is the measurement error. If e is independent of y and
V ar(e) = 2 , then V ar(Y ) = V ar(Y ) + V ar(e) = 2 + 2 and sd(Y ) = ( 2 + 2 )1=2 . If is
large enough (e.g. suppose if = 1) then V ar(Y ) = ( 2 + 1)1=2 , so even if is very small it “looks"
from our measurements that sd(Y ) > 1. In order to assess whether 1, the measurements y alone
would be useless, and a more precise measurement process (method) is needed.

Most measurement methods have some degree of error. In many cases this has no significant effect,
but the only way to be sure is to understand (and study, if necessary) the measurement processes being
used. Note that in some problems the measurement may involve a definition that is somewhat vague or
subject to error. For example, if y is the number of incidents of sexual abuse that a person was subjected
to over a 5 year period, then we should consider the definition of an “incident" and whether everyone
interprets this the same way.
Finally, the number of population units to be included in a study is a crucial issue, since it is directly
related to the amount of information that will be obtained. This topic will be discussed in subsequent
chapters so we will not consider it here.
70 PLANNING AND CONDUCTING EMPIRICAL STUDIES

3.3 Data Collection


The previous sections noted the need to define study variables clearly and to have satisfactory methods
of measuring them. It is difficult to discuss data collection except in the context of specific examples,
but we mention a few relevant points.

errors can occur in recording data or entering it in a data base, as well as in measurement of
variables

in many studies the “units" must be tracked and measured over a fairly long period of time (e.g.
consider a stroke study in which persons are followed for 3 years)

when data are recorded over time or in different locations, the time and place for each measure-
ment should be recorded

there may be departures from the study plan that arise over time (e.g. persons may drop out of
a medical study because of adverse reactions to a treatment; it may take longer than anticipated
to collect the data so the number of units sampled must be reduced). Departures from the plan
should be recorded since they may have an important impact on the analysis and conclusions

in some studies the amount of data may be massive, so data base design and management is
important.

3.4 Analysis and Conclusions


The remainder of this course is focussed on statistical methods of data analysis and inference, so we
won’t discuss it in any detail here. Statistical analysis is a huge area, but there are a few major types of
analysis:

estimation of characteristics of populations or processes (e.g. the unemployment rate in Ontario,


the probability a certain type of person has a stroke, the degree of association between dietary fat
and heart disease)

testing hypotheses (e.g. Is dietary fat related to heart disease? Does a drug prevent strokes? Is
one computer sort algorithm better than another?)

model building (e.g. produce a model relating lifestyle risk factors to the probability of a stroke;
produce a model relating the strength of a steel bolt to its diameter; produce a model that can
3.4. PROBLEMS 71

be used to identify customers for a targeted marketing campaign.) Models are often used for
classifying objects or units, or for making decisions.

exploratory analysis: looking at data so as to uncover special structure or relationships. This


is often done without any specific objectives (e.g. hypotheses to test or attributes to estimate) in
mind. “Data mining" is an evocative term for this type of activity.

Statistical analysis uses a wide array of numerical and graphical methods. Several major topics in
statistical analysis are introduced in Chapters 4 to 8 which follow. Specific applications are used to
illustrate the methods and how conclusions are drawn from the analysis. Although we do not discuss
it much in the remaining chapters, we must remember that well planned and conducted studies are
important for drawing reliable conclusions.

3.4 Problems
1. Suppose you wish to study the smoking habits of teenagers and young adults, in order to un-
derstand what personal factors are relative to whether, and how much, a person smokes. Briefly
describe the main components of such a study, using the PPDAC framework. Be specific about
the target and study population, the sample, and what variables you would collect data on.

2. Suppose you wanted to study the relationship between a person’s “resting" pulse rate (heart beats
per minute) and the amount and type of exercise they get.

(a) List some factors (including exercise) that might affect resting pulse rate. You may wish to
draw a cause and effect diagram to represent potential causal factors.
(b) Describe briefly how you might study the relationship between pulse rate and exercise us-
ing (i) an observational study, and (ii) an experimental study.

3. A large company uses photocopiers leased from two suppliers A and B. The lease rates are
slightly lower for B’s machines but there is a perception among workers that they break down
and cause disruptions in work flow substantially more often. Describe briefly how you might
design and carry out a study of this issue, with the ultimate objective being a decision whether to
continue the lease with company B. What additional factors might affect this decision?
72 PLANNING AND CONDUCTING EMPIRICAL STUDIES

4. For a study like the one in Section 1.6, where heights x and weights y of individuals are to be
recorded, discuss sources of variation due to the measurement of x and y on any individual.
STATISTICAL INFERENCE:
ESTIMATION

4.1 Introduction

Many statistical problems involve the estimation of some quantity or attribute. For example, the fraction
of North American women age 16-25 who smoke; the 10th, 50th and 90th percentiles of body-mass
index (BMI) for Canadian males age 21-35; the probability a sensor will classify the colour of an item
correctly. The statistical approach to estimation is based on the following idea:
Develop a model for variation in the population or process you are considering, in which the at-
tribute or quantity you want to estimate is included, and a corresponding model for data collection.
As we will see, this leads to powerful methods for estimating unknown quantities and, importantly,
for determining the uncertainty in an estimate.
We have already seen in Chapter 2 that quantities that can be expressed as parameters in a statis-
tical model (probability distribution) can be estimated using maximum likelihood. Let us consider the
following example, make some important observations.

Example 4.1.1. Suppose we want to estimate quantities associated with BMI for some population of
individuals (e.g. Canadian males age 21-35). If the distribution of BMI values y in the population is
well described by a Gaussian model, Y G( ; ), then by estimating and we can estimate any
quantity associated with the BMI distribution. For example,

(i) = E(Y ) is the average BMI in the population

(ii) is also the median BMI (50th percentile)

(iii) The 10th and 90th percentiles (.10 quantiles) for BMI are y0:10 = 1:28 and y:90 = +1:28
(To see this, note for example that P (Y 1:28 ) = P (Z 1:28) = 0:10, where
Z = (Y )= G(0; 1).)

73
74 STATISTICAL INFERENCE: ESTIMATION

(iv) The fraction of the population with BMI over 35.0 is p = 1 ( 35:0 ), where (z) is the c.d.f
for a G(0; 1) random variable.
Thus, if we collected a random sample of, say, 150 individuals and found maximum likelihood
estimates (maximum likelihood estimators) ^ = 27:1; ^ = 3:56 then estimates of the quantities
in (i)-(iv) would be: (i) and (ii)^ = 27:1, (iii) y^0:10 = ^ 1:28^ = 22:54; y^:90 = 31:66; and
(iv) p^ = :0132.
The preceding example raises several issues, if we think about it.

Where do we get our probability distribution? What if it isn’t a good description of the population
or process?
We discussed the first question in Chapters 1 and 2. It is important to check the adequacy (or “fit")
of the model; some ways of doing this were discussed in Chapter 2 and more will be considered
later in the course. If the model used is not satisfactory, we may not be able to use the estimates
based on it. In the data introduced in Section 1.5 it was not in fact clear that a Gaussian model
was suitable when y is BMI. We will consider other models later.

The estimation of parameters or population attributes depends on data collected from the popula-
tion or process, and the likelihood function is based on the probability of the observed data. This
implies that factors associated with the selection of sample units or the measurement of variables
(e.g. measurement error) must be included in the model. In the BMI example it has been as-
sumed that a BMI was measured without error for a random sample of units (persons) from the
population. In these notes it is typically assumed that the data came from a random sample of
population units, but in any given application we would need to design the data collection plan
to ensure this assumption is valid.

Estimates such as ^ = 27:1 surely cannot be equal to the average BMI in the population. How
far away from is it likely to be? If I take a sample of only n = 50 persons, would I expect the
estimate ^ to be as “good" as ^ based on 150 persons? (What does “good" mean?)

The third point is what we will focus on in this chapter; it is assumed that we can deal with the first
two points with ideas introduced in Chapters 1 and 2.

Estimators and Sampling Distributions


Suppose that some attribute or parameter is to be estimated. We assume that a random sample
y1 ; : : : ; yn can be drawn from the population or process in question, from which can be estimated. In
general terms an estimate of , denoted as ^, is some function of the observed sample y1 ; : : : ; yn :

^ = g(y1 ; : : : ; yn ): (4.1.1)
4.1. INTRODUCTION 75

Maximum likelihood provides a general method for obtaining estimates, but other methods also exist.
For example, if = E(Y ) = is the average (mean) value of y in the population, then the sample mean
^ = y is an intuitively sensible estimate; it is the m.l.e. if Y has a Gaussian distribution but because
of the Central Limit Theorem it is a good estimate of more generally. Thus, while we will use ML
estimation a great deal, you should remember that the discussion below applies to estimates of any
type. The problem facing us is how to determine or quantify the uncertainty in an estimate. We do this
using sampling distributions, which are based on the following point. If we select random samples
on repeated occasions, then the estimates ^ obtained from the different samples will vary. For example,
five separate random samples of n = 50 persons from the same male population described in Section
1.6 gave five estimates ^ = y of E(Y ): 1.723, 1.743, 1.734, 1.752, 1.736 (meters).The variability in
estimates obtained from repeated samples of the same size is termed a sampling distribution. More
precisely, we define this as follows. Let the random variables Y1 ; : : : ; Yn represent the observations in
a random sample, and associate with the estimate ^ given by (4.1.1) a random variable
~ = g(Y1 ; : : : ; Yn ): (4.1.2)

We call ~ the estimator of corresponding to ^. (We will always use ^ to denote an estimate and ~ to
denote the corresponding estimator.) The distribution of ~ is called the sampling distribution of the
estimator.

Since ~ is a function of the random variables Y1 ; : : : ; Yn we can find its distribution, at least in prin-
ciple. Two ways to do this are (i) using mathematics and (ii) by computer simulation. Once we know
the sampling distribution of an estimator ~ (we can think of this as describing an estimation procedure,
if we wish) then we are in the position to express the uncertainty in an estimate. The following example
illustrates how this is done: we examine the probability that the estimator ~ is “close" to .

Example 4.1.2 Suppose we want to estimate the mean = E(Y ) of a random variable, and that a
Gaussian distribution Y G( ; ) describes variation in Y in the population. Let y1 ; : : : ; yn represent
a random sample from the population, and consider the estimator
n
X
~=Y = Yi =n
i=1

for . At this point, we recall some probability theory which says that the distribution of Y is also
p
Gaussian, G( ; = n). Let us now consider the probability that j~ j is less than or equal to some
specified value . We have

pr(j~ j ) = P( Y + )
p p
n n
= P( Z ): (4.1.3)
76 STATISTICAL INFERENCE: ESTIMATION
p
where Z = (Y )=( = n) G(0; 1). Clearly, as n increases, the probability (4.1.3) approaches
1. Furthermore, if we know (even approximately) then we can find the probability for any given
and n. For example, suppose Y represents the height of a male (in meters) in the population of Section
1.5, and that we take = :01. That is, we want to find the probability that j~ j is no more than
.01 meters. Assuming = :07(m), which is roughly what we estimated it to be in Section 1.5, (4.1.3)
gives the following results for a sample of size n = 50 and for a sample of size n = 100:

n = 50 = P (j~ j :01) = P ( 1:01 Z 1:01) = :688


n = 100 = P (j~ j :01) = P ( 1:43 Z 1:43) = :847

This indicates that a large sample is “better" in the sense that the probability is higher that ~ will be
within .01m of the true (and unknown) average height in the population. It also allows us to express
the uncertainty in an estimate ^ = y from an observed sample y1 ; : : : ; yn : we merely give probabilities
like the above for the associated estimator, which indicate the probability that any single random sample
will give an estimate within a certain distance of .
Example 4.1.3 In the preceding example we were able to work out the variability in the estimator ~
mathematically, using results about Gaussian probability distributions. In some settings we might not
be able to work out the distribution of an estimator mathematically; however, we could use simulation
to study the distribution. This approach can also be used to study sampling from a finite population of
N values, fy1 ; : : : ; yN g, where we might not want to use a continuous probability distribution for Y .
For illustration, consider the case where a variable Y has a G( ; ) distribution in the population, with
= 100 and = 50, and suppose we take samples y = (y1 ; : : : ; yn ) of size n, giving ^ = y. We can
investigate the distribution of ~ by simulation, by

1. (i) generating a sample of size n; in R this is done by

y rnorm(n; 100; 50):

(ii) computing ^ = y from the sample; in R this is done by

ybar mean(y);

and then repeating this, say k times. The k values y1 ; : : : ; yk can then be considered as a sample
from the distribution of ~, and we can study it by plotting a histogram or other plot of the values.

Exercise: Generate k = 100 samples this way, and plot a histogram based on the values y1 ; : : : ; y100 .
The approaches illustrated in the preceding examples can be used generally. Given an estimator ~,
we can consider its sampling distribution and compute probabilities of the form P (j ~ j ).
4.2. SOME DISTRIBUTION THEORY 77

We need to be able to find the distributions of estimators and other variables. We now review some
probability results from Stat 230 and derive a few other results that will be used in dealing with estima-
tion and other statistical procedures.

4.2 Some Distribution Theory


Many of the distributions required for statistics you have already seen in your first probability course,
including the basic discrete distributions, the uniform, the binomial, the Poisson, and the hypergeo-
metric, as well as the simpler continuous distributions, the uniform, the exponential and the normal or
Gaussian. You can review some of the this material in the appendix. There are a few new ones you will
encounter here, including the chi-squared distribution and the student t distribution. Recall the con-
R1
tinuous extension of the factorial, the Gamma function (a) = 0 xa 1 e x dx whose properties are
outlined in the appendix. The next distribution appears naturally whenever we obtain squares or sample
variances for Gaussian random variables. Suppose Z G(0; 1) and we consider the distribution of its
square W = Z 2 : Then
p p p p
P (W w) = P ( w Z w) = ( w) ( w) for standard Gaussian c.d.f. :

Therefore the probability density function of W is

d p p p p 1 1=2
( w) ( w) = '( w) + '( w) w for standard Gaussian probability density function '
dw 2
w 1=2 w=2
= p e
2

So the square of a standard normal random variable has a density of the form constant w 1=2 e w=2
for w > 0 and this is a member of the chi-squared family of distributions. This result is proved again
as Theorem 4.2.6 using moment generating functions.
The 2 (chi-squared) Distribution
The 2 distribution is usually derived from Gaussian random variables as we will see below in
Theorem 4.2.5. It is a continuous family of distributions on (0; 1) with probability density functions
of the form
1
f (x) = r=2 x(r=2) 1 e x=2 x>0 (4.1.4)
2 (r=2)
where r is a parameter with values in f1; 2; : : :g. To denote that X has probability density function
(4.1.4) we write X 2 . The parameter r is referred to as the “degrees of freedom" (d.f.) parameter.
(r)
In Figure 4.15 you see the characteristic shapes of the chi-squared densities. For degrees of freedom
r = 2; it is the exponential but for r > 2; it increases to a maximum value at a point to the left of r
78 STATISTICAL INFERENCE: ESTIMATION

and then decreases thereafter. For large values of r; then probability density function resembles that of
a normal distribution with variance 2r:

Chisquared df=1 Chisquared df=2


4 0.8

3 0.6
p.d.f.

p.d.f.
2 0.4

1 0.2

0 0
0 2 4 6 0 5 10

Chisquared df=4 Chisquared df=8


0.2 0.2

0.15 0.15
p.d.f.

p.d.f.

0.1 0.1

0.05 0.05

0 0
0 5 10 15 0 10 20

Figure 4.15: Chi-squared probabilities densities with degrees of freedom 1,2,4 and 8.

Problem 11 at the end of the chapter gives some results for the 2 distributions, including the
fact that its m.g.f. is
r=2
M (t) = (1 2t) (4.1.5)

and that its mean and variance are E(X) = r and V ar(X) = 2r. The c.d.f. F (x) can be given in
closed algebraic form for even values of r, but tables and software give the functions’s values. In R the
functions dchisq(x; r) and pchisq(x; r) give the probability density function f (x) and c.d.f. F (x) for
the 2(r) distribution. A table with selected values is given at the end of these notes.
We now give a pair of important results. The first shows that when we add independent chi-squared
random variables, the sum is also chi-squared, and the degrees of freedom simply sum.
4.2. SOME DISTRIBUTION THEORY 79

Theorem 4.2.511 Let W1 ; : : : ; Wn be independent random variables with Wi 2 (n


i ): Then S =
Pn Pn
Wi has a 2 distribution with degrees of freedom ni ; i..e. S 2
P
n :
i=1 i=1 ( ni )
i=1
The next result shows why the chi-squared distribution is important whenever we study Gaussian
distributed random variables. It arises as the square of a standard normal random variable.
Theorem 4.2.6 12 If Z G(0; 1) then the distribution of W = Z 2 is 2 .
(1)

Furthermore if we add together the squares of several independent standard normal random vari-
ables then we are adding independent chi-squared random variables. The result can only be chi-
squared!
P
n
Corollary:13 If Z1 ; : : : ; Zn are mutually independent G(0; 1) random variables and S = Zi2 ;then
i=1
S 2 .
(n)

4.2.1 Interval Estimation Using Likelihood Functions

The estimates and estimators discussed in Section 4.1 are often referred to as point estimates (and
estimators). This is because they consist of a single value or “point". The discussion of sampling
distributions shows how to address the uncertainty in an estimate, but we nevertheless prefer in most
settings to indicate the uncertainty explicitly with the estimate. This leads to the concept of an interval
estimate14 , which takes the form

2 (L; U ) or L U;

where L = h1 (y1 ; : : : ; yn ) and U = h2 (y1 ; : : : ; yn ) are based on the observed data y1 ; : : : ; yn . Notice
that this provides an interval with endpoints L;and U both of which depend on the data. With random
variables replacing the observed data, the endpoints L ~ = h1 (Y1 ; : : : ; Yn ) and U
~ = h2 (Y1 ; : : : ; Yn )
are random variables (for example if we were to draw another sample from the same population) the
assertion would be L ~ < < U ~ : There is a specific probability (hopefully large) that this assertion
n
P
11 ri =2 Q
n ri =2
Proof: Wi has m.g.f. Mi (t) = (1 2t) . Thus Ms (t) = Mi (t) = (1 2t) i=1 and this is the m.g.f. of
i=1
2 P
n
a distribution with degrees of freedom ni :
i=1
2 R1 2 1 z2
12
Proof of 4.2.6: The m.g.f. of W is Mw (t) = E(etW ) = E(etZ ) = 1
etz p1
2
e 2 dz =
R1 1 z 2 (1 2t)
p1 e 2 dz: This integral exists (is finite) provided 1 2t > 0, i.e., t < 1=2. Making the change of vari-
1 2
1=2
R1 1 2
able y = z(1 2t) we get Mw (t) = (1 2t) 1=2 1 p12 e 2 y dy = (1 2t) 1=2 : This is the m.g.f. for 2(1) , so W
must have a 2(1) distribution.
13
Proof: By the theorem, each Xi2 has a 2(1) distribution. Theorem 4.2.5 then gives the result.
14
See the video What is a confidence Interval? at watstat.ca
80 STATISTICAL INFERENCE: ESTIMATION

is correct in general, i.e that the parameter will fall in this random interval, and this probability is
P (L~< <U ~ ): This probability is called the coverage probability. It gives an indication how good
the rule is by which the interval estimate was obtained. For example if it is 0.95, this means that 95% of
the time (i.e. 95% of the different samples we might draw), the parameter falls in the interval (L; ~ U
~ ).
This means we can be reasonably safe in assuming on his occasion, and for this dataset, it does so. In
general, uncertaintyh in an interval
i estimate is explicitly stated by giving the interval estimate along with
the probability P ~ ~
2 (L; U ) , when L ~ = h1 (Y1 ; : : : ; Yn ) and U
~ = h2 (Y1 ; : : : ; Yn ) are the random
variables associated with L and U .
The likelihood function can be used to obtain interval estimates for parameters in a very straight-
forward way. We do this here for the case in which the probability model involves only a single scalar
parameter . Models with two or more parameters will be considered later. Individual models often
have constraints on the parameters, so for example in the Gaussin distribution while the mean can take
any real number 1 < < 1 the standard deviation must be positive, i.e. > 0: Similarly in the
binomial model the probability of success must lie in the interval [0; 1]: These constraints are usually
identified by requiring that the parameter falls in some set , called the parameter space. As men-
tioned before we often multiply the likleihood function by a convenient scale, resulting in the relative
likelihood.
Definition: Suppose is scalar and that some observed data (say a random sample y1 ; :::; yn ) have
given a likelihood function L( ). The relative likelihood function R( ) is then defined as
L( )
R( ) = for 2
L( ^)

where ^ is the m.l.e. (obtained by maximizing L( )) and is the parameter space. Note that 0
R( ) 1 for all 2 .

Definition: A “p” likelihood interval for is the set f : R( ) pg.


Actually, f : R( ) pg is not necessarily an interval unless R( ) is unimodal, but this is the
case for all models that we consider. The motivation for this approach is that the values of that give
larger values of L( ) (and hence R( )) are the most plausible. The main challenge is to decide what p
should be; we show later that choosing p in the range 0:10 0:15 is often useful. If you return to the
likelihood for the Harris/Decima poll in Figure 2.12, note that the interval that the pollsters provided,
i.e. 26 2:2 percent looks like it was constructed such that the values of the likelihood at the endpoints
is around 1/10 of its maximum value so p is in the range 0.10-0.15.

Example 4.3.1 Polls


Suppose is the proportion of people in a large population who have a specific characteristic. If n
persons are randomly selected and Y is the number who have the characteristic, then Y Bin(n; ) is
4.2. SOME DISTRIBUTION THEORY 81

a reasonable model and the observed data Y = y gives the likelihood function
!
n y
L( ) = (1 )n y
0 1:
y

Maximizing L( ) we find ^ = y=n and then


y (1 )n y
R( ) = 0 1:
^y (1 ^)n y

n=200
-1

-2
log RL

-3 n=1000

-4

-5

0.30 0.35 0.40 0.45 0.50


theta

1.0

0.8 n=200

0.6
RL

0.4 n=1000

0.2

0.0

0.30 0.35 0.40 0.45 0.50


theta

Figure 4.16: Relative Likelihood and log Relative Likelihood Functions for a Binomial Parameter

Figure 4.16 shows the R( ) functions for two polls:


Poll 1: n = 200, y = 80
Poll 2: n = 1; 000, y = 400.
In each case ^ = :40, but the relative likelihood function is more “concentrated” around ^ for the larger
poll (Poll 2). The 0:10 likelihood intervals also reflect this:
Poll 1: R( ) 0:1 for :33 :47
82 STATISTICAL INFERENCE: ESTIMATION

Poll 2: R( ) 0:1 for :37 :43.


The graph also shows the log relative likelihood function,

r( ) = log R( ) = `( ) `( ^);

where `( ) = log L( ) is the log likelihood function. It’s often convenient to compute r( ) instead of
R( ) and to compute a p likelihood interval by the fact that R( ) p if r( ) log p. While both
plots are unimodal and have idential locations of the maximum, they differ in terms of the shape. The
plot of the Relative likelihood function resembles a normal probability density function in shape while
that of the log relative likelihood resembles a quadratic function of :
Likelihood intervals have desirable properties. One is that they become narrower as the sample size
increases, thus indicating that larger samples contain more information about . They are also easy to
obtain, since all we really have to do is plot the relative likelihood function R( ) or r( ) = logR( ).
This approach can also be extended to deal with vector parameters, in which case R( ) P gives
likelihood “regions" for .
The one apparent shortcoming of likelihood intervals so far is that we do not know how probable it
is that a given interval will contain the true parameter value. As a result we also do not have a basis for
the choice of p. Sometimes it is argued that values like p = 0:10 or p = :05 make sense because they
rule out parameter values for which the probability of the observed data is less than 1=10 or 1=20 of
the probability when = ^. However, a more satisfying approach is to apply the sampling distribution
ideas in Section 4.1 to the interval estimates, as discussed at the start of this section. This leads to the
concept of confidence intervals, which we describe next.

4.3 Confidence Intervals for a Parameter


In general, a likelihood interval or any other interval estimate for takes the form [L(Y); U (Y)], where
Y stands for the data the estimate is based on. Suppose we assume that the model chosen is correct and
0 is the true (unknown) value of the parameter. It is by no means certain that the true parameter value
will fall in this interval, i.e. for the particular sample we selected it is true that L(Y) 0 U (Y) but
an important property of the interval estimator is the coverage probability for any interval estimate,
defined as the probability of this event.
Definition: The value
C( 0 ) = P fL(Y) 0 U (Y)g (4.1.6)

is called the coverage probability for the interval estimate [L(Y); U (Y)]. A few words are in order
about the meaning of the probability in (4.1.6). The parameter 0 is an unknown constant associated
with the population, but it is a fixed constant, NOT a random variable and does not therefore have a
distribution. You should read (4.1.6) as follows. If we were to draw a random sample of the same
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 83

size from the same population and the true value of the parameter was 0 , the probability that 0 is
contained in the interval constructed from this new data is C( 0 ):15 How does C( 0 ) assist in the
evaluation of interval estimates? In practice, we try to find intervals for which C( 0 ) is fairly close to 1
(values :90, :95 and :99 are often used) while keeping the interval fairly short. Such interval estimates
are called confidence intervals and the value of C( 0 ) is also called the confidence coefficient. If
C( 0 )is 0.95 for example, then this indicates that 95% of the samples that we would draw from this
model result in an interval [L(Y); U (Y)] which includes the parameter 0 (and of course 5% do not).
This gives us some confidence that for a particular sample, such as the one at hand, the true value of
the parameter is contained by the interval.
To show that confidence intervals exist, and at least in some cases the coverage probability C( 0 )
does not depend on the unknown parameter 0 , we consider the following simple example.
Example 4.4.1 Suppose that a random variable Y has a G( 0 ; 1) distribution. That is, 0 = E(Y )
is unknown but the standard deviation of Y is known to equal 1. Let the random variables Y1 ; :::; Yn
represent a random sample of y-values, and consider the interval (Y 1:96n 1=2 ; Y + 1:96n 1=2 ),
Xn
where Y = Yi =n is the sample mean. Since Y G( 0 ; n 1=2 ), we find that
i=1

1=2 1=2 Y 0
P (Y 1:96n 0 Y + 1:96n ) = P ( 1:96 1=2
1:96)
n
= P ( 1:96 Z 1:96) = :95;

where Z G(0; 1). Thus the interval (Y 1:96n 1=2 ; Y + 1:96n 1=2 ) is a confidence interval for
with confidence coefficient (coverage probability) C( 0 ) = :95. This is an example in which C( 0 )
does not depend on 0 ;and extremely desirable feature of an interval estimator since 0 is unknown.
We repeat the very important interpretation of a confidence interval: if the procedure is used re-
peatedly then in a fraction C( 0 ) of cases the interval will contain the true value 0 . If a particular
sample of size 16 in Example 4.4.1 had observed mean y = 10:4, then the observed .95 confidence
interval would be (y 1:96=4; y + 1:96=4), or (9.91, 10.89). We do not say that the probability of the
observed interval 9:91 0 10:89 is :95, but we have a high degree of confidence (95%) that this
interval contains 0 .
Confidence intervals tend to become narrower as the size of the sample on which they are based
increases. For example, note the effect of n in Example 4.4.1. The width of the confidence interval is
2 1:96
p which gets smaller as n increases. We noted this characteristic earlier for likelihood intervals, and
n
we show a bit later that likelihood intervals are a type of confidence interval. Recall that the coverage
probability for the interval in the example did not depend on 0 , a highly desirable property because
15
When we use the observed data y; L(y) and U (y) are numerical values not random variables. We do not know that it is
either true or not true that L(y) 0 U (y) or and P fL(y) 0 U (y)g makes no more sense than P f1 0 3g
since there is no random variable to which the probability statement can refer.
84 STATISTICAL INFERENCE: ESTIMATION

we’d like to know the coverage probability while not knowing the value of the parameter ( 0 ): We next
consider a general way to find confidence intervals which have this property.

Pivotal Quantities and Confidence Intervals

Definition: A pivotal quantity Q = g(Y1 ; :::; Yn ; ) is a function of a random sample Y1 ; :::; Yn and
such that the distribution of the r.v. Q is fully known. That is, the distribution does not depend on
or other unknown information.
The motivation for this definition is the following. If we can begin with a statement such as
P fa g(Y1 ; :::; Yn ; ) bg = 0:95 where g(Y1 ; :::; Yn ; ) is a pivotal with knmown distribution we
can then re-express the inequality a g(Y1 ; :::; Yn ; ) b can be rewritten in th form L(Y1 ; :::; Yn )
U (Y1 ; :::; Yn ). Then [L(Y1 ; :::; Yn ); U (Y1 ; :::; Yn )] is a confidence interval for with coverage
probability 0:95 that does not depend on the parameter : In general, for coverage probability (also
referred to as the confdence coefficient) ;we choose L; U so that

P fa g(Y1 ; :::; Yn ; ) bg
= P fL(Y1 ; :::; Yn ) U (Y1 ; :::; Yn )g = :

The coverage probability (confidence coefficient) does not depend on the value of : It does depend
on a and b, but these are determined by the known distribution of g(Y1 ; : : : ; Yn ; ).

Example 4.4.2 Suppose Y G( ; 0) where is unknown but 0 is known. Then if Y1 ; :::; Yn is


a random sample, we know
Y
Q= p G(0; 1)
0= n
so it is a pivotal quantity. (For simplicity we just write instead of 0 for the unknown true value
which is to be estimated.) To get a .95 confidence interval for we just need to find values a and b such
that P (a Q b) = :95, and then

Y
:95 = P (a p b)
0= n
b 0 a 0
= P (Y p Y p );
n n
so that
b 0 a 0
Y p Y p (4.4.1)
n n
is a .95 confidence interval for . Note that there are infinitely many pairs (a; b) giving P (a Q
p
b) = :95. A common choice is a = 1:96, b = 1:96; this gives the interval (Y 1:96 0 = n; Y +
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 85
p
1:96 0 = n) which turns out to be the shortest possible .95 confidence interval. Another choice would
p
be a = 1, b = 1:645, which gives the interval (Y 1:645 0 = n; 1). This is useful when we are
interested in getting a lower bound on the value of .
It turns out that for most distributions it is not possible to find “exact” pivotal quantities or con-
fidence intervals for whose coverage probabilities do not depend somewhat on the true value of .
However, in general we can find quantities Qn = g(Y1 ; :::; Yn ; ) such that as n ! 1, the distribution
of Qn ceases to depend on or other unknown information. We then say that Qn is asymptotically piv-
otal, and in practice we treat Qn as a pivotal quantity for sufficiently large values of n; more accurately,
we term it an approximate pivotal quantity.

Example 4.4.3. Polls Consider Example 4.3.1 discussed earlier, where Y Bin(n; ). For large n
we know that Q1 = (Y n )=[n (1 1=2
)] is approximately G(0; 1). It can also be proved that the
distribution of
Y n
Q=
[n (1 ^)]1=2
^

where ^ = Y =n, is also close to G(0; 1) for large n. Thus Q can be used as an (approximate)
pivotal quantity to get confidence intervals for . For example,
:
:95 = P ( 1:96 Q 1:96)
0 " #1=2 " #1=2 1
^(1 ^) ^(1 ^)
= P @^ 1:96 ^ + 1:96 A:
n n

Thus " #1=2


^(1 ^)
^ 1:96 (4.4.2)
n
gives an approximate .95 confidence interval for . As a numerical example, suppose we observed
n = 100, y = 18 in a poll. Then (4.4.2) becomes 0:18 1:96[0:18(:82)=100]1=2 , or 0:115 0:255.

Remark: It is important to understand that confidence intervals may vary quite a lot when we take
repeated samples. For example, 10 samples of size n = 100 which were simulated for a population
where p = 0:25 gave the following .95 confidence intervals for p: .20 - .38, .14 - .31, .23 - .42, .22 -
.41, .18 - .36, .14 - .31, .10 - .26, .21 - .40, .15 - .33, .19 - .37.
When we get a .95 confidence interval from a single sample, it will include the true value of p with
probability .95, but this does not necessarily mean another sample will give a confidence interval that
is similar to the first one. If we take larger samples, then the confidence intervals are narrower and will
agree better. For example, try generating a few samples of size n = 1000 and compare the confidence
intervals for p.
86 STATISTICAL INFERENCE: ESTIMATION

Likelihood-Based Confidence Intervals

It turns out that likelihood intervals are approximate confidence intervals and sometimes they are exact
confidence intervals. Let R( ) = L( )=L( ^) and define the quantity

= 2 log R( ) = 2`( ^) 2`( ): (4.4.3)

This is called the likelihood ratio statistic. The following result can be proved:
If L( ) is based on a random sample of size n and if is the true value of the scalar parameter, then
(under mild mathematical conditions) the distribution of converges to 2(1) as n ! 1.
This means that can be used as an approximate pivotal quantity in order to get confidence intervals
for . Because highly plausible values of are ones where R( ) is close to 1 (i.e. is close to 0), we
get confidence intervals by working from the probability

2 :
P( (1) c) = P ( c)
= P (L(Y1 ; :::; Yn ) U (Y1 ; :::; Yn )) (4.4.4)

Example 4.4.4 Consider the binomial model in Examples 4.3.1 and 4.4.3 once again. We have, after
rearrangement, !
1 ^
= 2n ^ log( ^= ) + 2n(1 ^) log :
1

To get a .95 confidence interval for we note that P ( 2(1) 3:841) = :95. To find the confidence
interval we have to find all values satisfying 3:841. This has to be done numerically, and
depends on the observed data. For example, suppose that we observe n = 100, y = 40. Thus ^ = :40
and the observed value of is a function of ,

:6
( ) = 80 log(:4= ) + 120 log :
1

Figure 4.17 shows a plot of ( ) and the line g( ) = 3:841 exactly, from which the confidence interval
can be extracted. Solving ( ) 3:841, we find that :307 :496 is the .95 confidence interval.
We could also use the approximate pivotal quantity in Example 4.4.3 for this situation. It gives the .95
confidence interval (4.4.2), which is :304 :496. The two confidence intervals differ slightly
(they are both based on approximations) but are extremely close.
We can now see the connection between likelihood intervals and confidence intervals. The likeli-
hood interval defined by R( ) p is the same as the confidence interval defined by ( ) 2 log p.
For a .95 confidence interval we use ( ) 2
3:841 (since P ( (1) 3:841) = :95), which corre-
sponds to R( ) 0:147. Conversely a .10 likelihood interval given by R( ) 0:1 corresponds to
2
( ) 4:605. Since P ( (1) 4:605) = :968, we see that a .10 likelihood interval is a confidence
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 87

40

35

30

25
λ(θ)

20

15

10

0
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8
θ

Figure 4.17: Likelihood Ratio Statistic for Binomial Parameter

interval with approximate confidence coefficient .968. Normally in statistical work, however, we use
confidence intervals with (approximate) confidence coefficients .90,.95 or .99, and we usually employ
( )) rather then R( ) in discussions about likelihood-based interval estimates.

4.4.3 Choosing a Sample Size

We have seen in examples that confidence intervals for a parameter tend to get narrower as the sample
size n increases. When designing a study we often decide how large a sample to collect on the basis of
(i) how narrow we would like confidence intervals to be, and (ii) how much we can afford to spend (it
costs time and money to collect data). The following example illustrates the procedure.
Example 4.4.5 Estimation of a Binomial Probability Suppose we want to estimate the probability
from a binomial experiment in which the response variable Y has a Bin(n; ) distribution. We will
assume that the approximate pivotal quantity
Y n
Q= ' G(0; 1) (4.4.5)
[n (1 ^]1=2
^

introduced in Example 4.4.3 will be used to get confidence intervals for (Using the likelihood ratio
statistic leads to a more difficult derivation and in any case, for large n the LR confidence intervals are
very close to those based on Q.) Here is a criterion that is widely used for choosing the size of n: pick
n large enough so a .95 CI for is no larger than ^ :03: Let’s see why this is used and where it leads.
88 STATISTICAL INFERENCE: ESTIMATION

From Example 4.4.3, we know that (see (4.4.2))


" #1=2
^(1 ^)
^ 1:96
n

is an approximate .95 CI for . To make this CI ^ :03 (or even shorter, say ^ :025), we need n large
enough that
" #1=2
^(1 ^)
1:96 :03
n

This can be rewritten as


2
1:96 ^(1 ^):
n=
:03

We of course don’t know what ^ will be once we take our sample, but we note that the worst case
scenario is when ^ = :5. So to be conservative, we find n such that
2
1:96
n= (:05)2 = 1067
:03

Thus, choosing n = 1067 (or larger) will result in a .95 CI of the form ^ c, where c :03. If you
look or listen carefully when polling results are announced, you’ll often hear words like “this poll is
accurate to within 3 percentage points 19 times out of 20." What this really means is that the estimator
(which is usually given in percentile form) satisfies P (j ~ j :03) = :95, or equivalently, that the
actual estimate ^ is the centre of a .95 confidence interval ^ c, for which c = :03. In practice, many
polls are based on around 1050-1100 people, giving “accuracy to within 3 percent" (with probability
.95). Of course, one needs to be able to afford to collect a sample of this size. If we were satisfied
with an accuracy of 5 percent, then we’d only need n = 480. In many situations this might not be
sufficiently accurate for the purpose of the study, however.

Exercise: Show that to make the .95 CI ^ :02 or smaller, you need n = 2401 What should n be to
make a .99 CI ^ :02 or less?

Remark: Very large binomial polls (n 2000) are not done very often. Although we can in theory
estimate ^ very precisely with an extremely large poll, there are two problems:

1. it is difficult to pick a sample that is truly random, so Y Bin(n; ) is only an approximation

2. in many settings the value of fluctuates over time. A poll is at best a snapshot at one point in
time.
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 89

As a result, the “real" accuracy of a poll cannot generally be made arbitrarily high.
Sample sizes can be similarly determined so as to give confidence intervals of some desired length in
other settings. We consider this topic again in Chapter 6. Many of the tools of this section can also be
extended to the muli-parameter16 setting but we will not discuss this further here.

4.3.1 A Case Study: Testing Reliability of Computer Power Supplies

Components of electronic products often must be very reliable, that is, they must perform over long
periods of time without failing. Consequently, manufacturers who supply components to a company
that produces, e.g. personal computers, ,must satisfy the company that their components are reliable.
Demonstrating that a component is highly reliable is difficult because if the component is used
under "normal" conditions it will usually take a very long time to fail. It is generally not feasible for
a manufacturer to carry out tests on the component that last for a period of years (or even months, in
most cases) and there fore they use what are called accelerated life tests. These involve placing such
high levels of stress on the components that they fail in much less than the normal time. If a model
relating the level of stress to the lifetime of the component is known then such experiments can be used
to estimate lifetime at normal stress levels for the population from which the experimental units are
taken.
We consider below some life test experiments on power supplies for PC’s, with ambient temperature
being the stress factor. As the temperature increases, the lifetimes of components tend to decrease and
a temperature around 70 Celsius average lifetime tend to be of the order of 100 hours. The normal
usage temperature is around 20 C. The data in Table 4.5.1 show the lifetimes (i.e. times to failure) of
components tests at each of 40 , 50 , 60 and 70 C. The experiment was terminated after 600 hours
and for temperatures 40 , 50 and 60 some of the 25 components being tested had still not failed.
Such observations are called censored: we know in each case only that the lifetime in question was
16

Models With Two or More Parameters. When there is a vector = ( 1 ; : : : ; k ) of unknown parameters, we may
want to get interval estimates for individual parameters j (j = 1; : : : ; k) or for functions = h( 1 ; : : : ; k ). For example,
with a Gaussian model G( ; ) we might want to estimate and . In some problems there are pivotal quantities which
are functions of the data and (only) the parameter of interest. We will use such quantities in Chapter 6, where we consider
estimation and testing for Gaussian models. There also exist approximate pivotal quantities based on the likelihood function
and m.l.e.’s. These are mainly developed in more advanced followup courses to this one, but we will briefly consider this
approach later in the notes.
It is also possible to construct confidence regions for two or more parameters. For example, suppose a model has two
parameters 1 ; 2 and a likelihood function L( 1 ; 2 ) based on observed data. Then we can define the relative likelihood
L( 1 ; 2 )
function R( 1 ; 2 ) = L( ^1 ; ^2 ) as in the scalar case. The set of pairs ( 1 ; 2 ) which satisfy R( 1 ; 2 ) p is then called a p
likelihood region for ( 1 ; 2 ). The concept of confidence intervals can similarly be extended to confidence regions.
90 STATISTICAL INFERENCE: ESTIMATION

over 600 hours. In Table 4.5.1 the asterisks denote the censored observations.
It is known from past experience that at each temperature level lifetimes follow close to an ex-
ponential distribution; let us therefore suppose that at temperature t (t = 40; 50; 60; 70), component
lifetimes Y have probability density function
1 y= t
f (y; t) = e y 0 (4.5.1)
t
where t = E(Y ) is the mean lifetime. The likelihood function based on a sample consisting of both
censoring times and lifetimes is a little different from one based on a random sample of lifetimes. It is,
for the tests at temperature t,
Y 1 Y
yi = t yi = t
L( t ) = f e gf e g (4.5.2)
t
LT CT

where LT stands for the set of lifetimes yi and CT the set of censoring times yi.

Question 1 Show that for the distribution 4.5.1, P (Y > y) = e y= t . Then describe how 4.5.2 is
obtained.
Note that 4.5.2 can be rewritten as
1
L( t ) = r es= t (4.5.3)
t
P
n
where r = number of lifetimes observed S = yi = sum of all lifetimes and censoring times.
i=1
Question 2 Assuming that the exponential model is correct, obtain m.l.e.’s ^1 for the mean lifetime at
each of the former termperature levels t40; 50; 60; 70. Graph the likelihood functions for 40 and 70
and comment on any qualitative differences.
Question 3 Check, perhaps using some kind of graph, whether the exponential model seems appropri-
ate. Engineers used a model (called the Arrhenius model) that relates the mean lifetime of a component
to the ambient temperature. This states that

t = expf + g (4.5.4)
t + 273:2
where t is the temperature in degrees Celsius and and are parameters.
Questions 4 Make a plot of log ^t vs (t + 273:2) 1 for the four temperatures involved in the life
test experiment. Do the points lie roughly along a straight line? Give rough point estimators of and
. Extrapolate your plot or use your estimates of and to estimate the mean lifetime at t = 20 C,
the normal temperature.

Question 5 A point estimate of at 20 C is not very satisfactory. Outline how you might attempt
to get an interval estimate based on the likelihood function. Once armed with an interval estimate,
4.3. CONFIDENCE INTERVALS FOR A PARAMETER 91

would you have many remaining qualms about indicating to the engineers what mean lifetime could be
expected at 20 C? (Explain.)

Question 6 Engineers and statisticians have to design reliability tests like the one just discussed, and
considerations such as the following are often used.
Suppose that the mean lifetime at 20 C is supposed to be about 90,000 hours and that at 70 C you
know from past experience that its about 100 hours. If the model 4.5.4 applies, determine what and
must approximately equal and thus what is roughly equal to at 40 , 50 and 60 C. How might you
use this information in deciding how long a period of time to run the life test? In particular, give the
approximate expected number of uncensored lifetimes from an experiment that was terminated after
600 hours.
92 STATISTICAL INFERENCE: ESTIMATION

Table 1 Lifetimes (in hours) from an accelerated life test experiment in PC power supplies

Temperature
70 C 60 C 50 C 40 C
2 1 55 78
5 20 139 211
9 40 206 297
10 47 263 556
10 56 347 600
11 58 402 600
64 63 410 600
66 88 563 600
69 92 600 600
70 103 600 600
71 108 600 600
73 125 600 600
75 155 600 600
77 177 600 600
97 209 600 600
103 224 600 600
115 295 600 600
130 298 600 600
131 352 600 600
134 392 600 600
145 441 600 600
181 489 600 600
242 600 600 600
263 600 600 600
283 600 600 600

Notes: Lifetimes are given in ascending order; asterisks( ) denote censored observations.

4.4 Problems

1. Consider the data on heights of adult males and females from Chapter 1. (The data are on the
course web page.)
4.4. PROBLEMS 93

(a) Assuming that for each gender the heights Y in the population from which the samples
were drawn is adequately represented by Y G( ; ), obtain the m.l.e.’s ^ and ^ in each
case.
(b) Give the m.l.e.’s for the 10th and 90th percentiles of the height distribution for males and
for females.
(c) Give the m.l.e.’s for the probability P (Y > 1:83) for males and females (i.e. the fraction
of the population over 1.83 m, or 6 ft).
(d) A simpler estimate of P (Y > 1:83) that doesn’t use the Gaussian model is
number of persons in sample with y > 1:83
P^ r(Y > 1:83) =
n
where here n = 150. Obtain these estimates for males and for females. Can you think of
any advantages for this estimate over the one in part (c)? Can you think of any disadvan-
tages?
(e) Suggest and try a method of estimating the 10th and 90th percentile of the height distribu-
tion that is similar to that in part (d).

2. When we measure something we are in effect estimating the true value of the quantity; measure-
ments of the same quantity on different occasions are usually not equal. A chemist has two ways
of measuring a particular quantity ; one has more random error than the other. For method I,
measurements X1 ; X2 ; : : : follow a normal distribution with mean and variance 12 , whereas
for method II, measurements Y1 ; Y2 ; : : :, have a normal distribution with mean and variance
2
2.

(a) Suppose that the chemist has n measurements X1 ; : : : ; Xn of a quantity by method I and m
measurements, Y1 ; : : : ; Ym by method II. Assuming that 12 and 22 are known, write down
the combined likelihood function for , and show that

w1 X + w2 Y
^=
w1 + w2
n m
where w1 = 2 and w2 = 2 .
1 2

(b) Suppose that 1 = 1, 2 = :5 and n = m = 10. How would you rationalize to a non-
statistician why you were using the estimate ^ = (x+4y)
5 instead of (x+y)
2 ?

(c) Determine the standard deviation of ^ and of (x + y)=2 under the conditions of part (b).
Why is ^ a better estimate?
94 STATISTICAL INFERENCE: ESTIMATION

3. Suppose that a fraction p of a large population of persons over 18 years of age never drink alcohol.
In order to estimate p, a random sample of n persons is to be selected and the number y who do
not drink determined; the maximum likelihood estimate of p is then p^ = y=n.
We want our estimate p^ to have a high probability of being close to p, and want to know how
large n should be to achieve this.

(a) Consider the random variable Y and estimator p = Y =n. Describe how you could work
out the probability that :03 p p :03, if you knew the values of n and p.
(b) Suppose that p is .40. Determine how large n should be in order to make P ( :03 p p
:03) = :95. Use an approximation if you wish.

4. Proof of Central Limit Theorem (Special Case) Suppose Y1 ; Y2 ; : : : are independent random
variables with E(Yi ) = ; V ar(Yi ) = 2 and that they have the same distribution, which has a
m.g.f.

2
(a) Show that (Yi )= has m.g.f. of the form (1 + t2 + terms in t3 ; t4 ; : : :) and thus that
p t2
(Yi )= n has m.g.f of the form (1 + 2n + 0(n)), where 0(n) signifies a remainder
term Rn with the property that Rn =n ! 0 as n ! 1.
(b) Let
n
X p
(Yi ) n(Y )
Zn = p =
n
i=1

t 2
and note that its m.g.f. is of the form (1 + 2n + 0(n))n . Show that as n ! 1 this ap-
2
proaches the limit et =2 , which is the m.g.f. for G(0; 1). (Hint: For any real number a,
(1 + a=n)n ! ea as n ! 1.)

5. A sequence of random variables fXn g is said to converge in probability to the constant c if for
all > 0,

lim P fjXn cj g=0


n!1

p
We denote this by writing Xn ! c.

p p
(a) If fXn g and fYn g are two sequences of random variables with Xn ! c1 and Yn ! c2 ,
p p
show that Xn + Yn ! c1 + c2 and Xn Yn ! c1 c2 .
4.4. PROBLEMS 95

(b) Let X1 ; X2 ; be i.i.d. random variables with probability density function f (x; ). A
^
point estimator n based on a random sample X1 ; ; Xn is said to be consistent for if
^n !p
as n ! 1.
i. Let X1 ; ; Xn be i.i.d. U (0; ). Show that ^n is consistent for .
ii. Let X Bin(n; p). Show that p^n = X=n is consistent for p.

6. Refer to the definition of consistency in Problem 5(b). Difficulties can arise when the number of
parameters increases with the amount of data. Suppose that two independent measurements of
blood sugar are taken on each of n individuals and consider the model

2
Xi1 ; Xi2 N ( i; ) i = 1; ;n

where Xi1 and Xi2 are the independent measurements. The variance 2 is to be estimated, but
the i ’s are also unknown.

(a) Find the m.l.e. ^ 2 and show that it is not consistent. (To do this you have to find the m.l.e.’s
for 1 ; : : : ; n as well as for 2 .)
(b) Suggest an alternative way to estimate 2 by considering the differences
Wi = Xi1 Xi2 .
(c) What does represent physically if the measurements are taken very close
together in time?

7. Suppose that blood samples for nk people are to be tested to obtain information about , the
fraction of the population infected with a certain virus. In order to save time and money, pooled
testing is used: the nk samples are mixed together k at a time to give n pooled samples. A
pooled sample will test negative if all k individuals are not infected.

(a) Give an expression for the probability that x out of n samples will be negative, if the nk
people are a random sample from the population. State any assumptions you make.
(b) Obtain a general expression for the maximum likelihood estimate ^ in terms of n, k and x.
(c) Suppose n = 100, k = 10 and x = 89. Give the m.l.e. ^ and relative likelihood function,
and find a 10% likelihood interval for .
(d) Discuss (or do it) how you would select an “optimal” value of k to use for pooled testing,
if your objective was not to estimate but to identify persons who are infected, with the
96 STATISTICAL INFERENCE: ESTIMATION

smallest number of tests. Assume that you know the value of and the procedure would be
to test all k persons individually each time a pooled sample was positive. (Hint: Suppose a
large number n of persons must be tested, and find the expected number of tests needed.)

8. (a) For the data in Problem 4 of Chapter 2, plot the relative likelihood function R( ) and
determine a .10 likelihood interval. Is very accurately determined?
(b) Suppose that we can find out whether each pair of twins is identical or not, and that it is
determined that of 50 pairs, 17 were identical. Obtain the likelihood function and m.l.e.
of in this case. Plot the relative likelihood function with the one in (a), and compare the
accuracy of estimation in the two cases.

9. Company A leased photocopiers to the federal government, but at the end of their recent contract
the government declined to renew the arrangement and decided to lease from a new vendor,
Company B. One of the main reasons for this decision was a perception that the reliability of
Company A’s machines was poor.

(a) Over the preceding year the monthly numbers of failures requiring a service call from
Company A were
16 14 25 19 23 12
22 28 19 15 18 29.
Assuming that the number of service calls needed in a one month period has a Poisson
distribution with mean , obtain and graph the relative likelihood function R( ) based on
the data above.
(b) In the first year using Company B’s photocopiers, the monthly numbers of service calls
were
13 7 12 9 15 17
10 13 8 10 12 14.
Under the same assumption as in part (a), obtain R( ) for these data and graph it on the
same graph as used in (a). Do you think the government’s decision was a good one, as far
as the reliability of the machines is concerned?
(c) Use the likelihood ratio statistic ( ) as an approximate pivotal quantity to get .95 confi-
dence intervals for for each company.
(d) What conditions would need to be satisfied to make the assumptions and analysis in (a) to
(c) valid? What approximations are involved?
4.4. PROBLEMS 97

10. The lifetime T (in days) of a particular type of lightbulb is assumed to have a distribution with
probability density function

3 t2 e t
f (t; ) = ; t > 0; > 0:
2
(a) Suppose t1 ; t2 ; ; tn is a random sample from this distribution. Show that the likelihood
function for is proportional to

P
3n ti
L( ) = e :

Find the m.l.e. ^ and the relative likelihood function R( ).


P
(b) If n = 20 and ti = 996, graph R( ) and determine the 10% likelihood interval for .
What is the approximate confidence level associated with this interval?
(c) Suppose we wish to estimate the mean lifetime of a lightbulb. Show E(T ) = 3 . (Recall
R1
that xn 1 e x dx = (n) = (n 1)! for n = 1; 2; ). Find a .95 confidence interval
0
for the mean.
(d) The probability p that a lightbulb lasts less than 50 days is p = P (T 50) = 1
e 50 2
[1250 + 50 + 1]. (Can you show this?) Thus p^ = :580 and we can find a .95
confidence interval for p from a CI for . In the data referred to in part (b), the number of
lightbulbs which lasted less than 50 days was 11 (out of 20). Using a binomial model, we
can also obtain a .95 confidence interval for p (see Examples 4.3.3 and 4.3.4). Find both
intervals. What are the pros and cons of the second interval over the first one?

11. The 2 (Chi-squared) distribution. Consider the 2 distribution whose probability density
function is given by (4:2:8) in Section 4.2.3. If Y 2 then
(r)

(a) show that f (y) integrates to 1 for any r in f1; 2; : : : :g


(b) find the m.g.f of Y (see (4:2:10)) and use it to show that E(Y ) = r and V ar(Y ) = 2r.
(c) Plot the probability density function’s for the 2 and 2 distributions on the same graph.
(5) (10)

12. In an early study concerning survival time for patients diagnosed with Acquired Immune Defi-
ciency Syndrome (AIDS), the survival times (i.e. times between diagnosis of AIDS and death)
P
30
of 30 male patients were such that xi = 11; 400 days. It is known that survival times were
i=1
approximately exponentially distributed with mean days.
98 STATISTICAL INFERENCE: ESTIMATION

(a) Write down the likelihood function for and obtain the likelihood ratio statistic. Use this
to get an approximate .90 confidence interval for .
(b) Show that m = log 2 is the median survival time. Give a .90 confidence interval for m.

13. Let X have an exponential distribution with probability density function

1 x=
f (x; ) = e x>0

where > 0.
(a) Show that Y = 2X= has a 2 distribution. (Hint: compare the probability density
(2)
function of Y with (4.2.8).)
(b) If X1 ; : : : ; Xn is a random sample from the exponential distribution above, prove that
n
X
2
U =2 Xi = (2n) :
i=1

(You may use results in Section 4.2.) U is therefore a pivotal quantity, and can be used to
get confidence intervals for .
(c) Refer to Problem 12. Using the fact that

2
P 43:19 (60) 79:08 = :90

obtain a .90 confidence interval for based on U . Compare this with the interval found in
12(a). Which interval is preferred here? (Why?)

14. Two hundred adults are chosen at random from a population and each is asked whether informa-
tion about abortions should be included in high school public health sessions. Suppose that 70%
say they should.

(a) Obtain a 95% confidence interval for the proportion p of the population who support abor-
tion information being included.
(b) Suppose you found out that the 200 persons interviewed consisted of 50 married couples
and 100 other persons. The 50 couples were randomly selected, as were the other 100 per-
sons. Discuss the validity (or non-validity) of the analysis in (a).
4.4. PROBLEMS 99

15. Consider the height data discussed in Problem 1 above. If heights Y are G( ; ) and ~ = Y and
P
n
~2 = (Yi ~)2 =n are the ML estimators based on a sample of size n then it can be shown
i=1
that when n is large, p
n(Y )
Z=
~
is very close to G(0; 1), and so it is approximately a pivotal quantity. Use Z to obtain a .99
confidence interval for for males and for females.

16. In the U.S.A. the prevalence of HIV (Human Immunodeficiency Virus) infections in the popula-
tion of child-bearing women has been estimated by doing blood tests (anonymized) on all women
giving birth in a hospital. One study tested 29,000 women and found that 64 were HIV positive
(had the virus). Give an approximate .99 confidence interval for , the fraction of the population
that is HIV positive.
State any concerns you feel about the accuracy of this estimate.
STATISTICAL INFERENCE: TESTING
HYPOTHESES

5.1 Introduction
There are often hypotheses that a statistician or scientist might want to “test” in the light of observed
data. Two important types of hypotheses are

(1) that a parameter vector has some specified value 0; we denote this as H0 : = 0.

(2) that a random variable Y has a specified probability distribution, say with probability density
function f0 (y); we denote this as H0 : Y f0 (y).

The statistical approach to hypothesis testing is as follows: First, assume that the hypothesis H0
will be tested using some random data “Data”. Next, define a test statistic (also called a discrepancy
measure) D = g(Data) that is constructed to measure the degree of “agreement” between Data and
the hypothesis H0 . It is conventional to define D so that D = 0 represents the best possible agreement
between the data and H0 , and so that the larger D is, the poorer the agreement. Methods of constructing
test statistics will be described later. Third, once specific observed “data” have been collected, let
dobs = g(data) be the corresponding observed value of D. To test H, we now calculate the observed
significance level (also called the p-value), defined as

SL = P (D dobs ; H0 ); (4.5.2)

where the notation “; H0 ” means “assuming H0 is true”. If SL is close to zero then we are inclined
to doubt that H0 is true, because if it is true the probability of getting agreement as poor or worse
than observed is small. This makes the alternative explanation that H0 is false more appealing. In
other words, we must accept that one of the following two statements is correct.:

(a) H0 is true but by chance we have observed data that indicate poor agreement with H0 , or

100
5.1. INTRODUCTION 101

(b) H0 is false.

A SL less than about .05 provides moderately strong evidence against H0 .

Example 5.1.1 Testing a binomial probability


Suppose that it is suspected that a 6-sided die has been “doctored” so that the number one turns up
more often than if the die were fair. Let = P (die turns up one) on a single toss and consider the
hypothesis H0 : = 1=6. To test H0 , we toss the die n times and observe the number of times Y that
a one occurs. Then “Data” = Y and a reasonable test statistic would then be either D1 = jY n=6j or
(if we wanted to focus on the possibility that was bigger than 1/6), D = max((Y n=6); 0).
Suppose that n = 180 tosses gave y = 44. Using D, we get dobs = 14 and the significance level is
(using the second definition of D)

SL = P (D 14; H0 )
= P (Y 44; = 1=6)
180
!
X 180 1 5
= ( )y ( )180 y
= :005:
y 6 6
y=44

This provides strong evidence against H0 , and suggests that is bigger than 1/6.

Example 5.1.2 Suppose that in the experiment in Example 5.1.1 we observed y = 35 ones in n = 180
tosses. Now the SL is

SL = P (Y 35; = 1=6)
180
!
X 180 1
= ( )y (5=6)180 y
y 6
y=35

= 0:183:

This probability is not especially small, so we conclude that there is no strong evidence against H0 .
Note that we do not claim that H0 is true, only that there is no evidence that it is not true.

Example 5.1.3. Testing for bias in a measurement system


Two cheap scales A and B for measuring weight are tested by taking 10 weighings of a 1 kg weight on
each of the scales. The measurements on A and B are

A: 1.026, 0.998, 1.017, 1.045, 0.978, 1.004, 1.018, 0.965, 1.010, 1.000
B: 1.011, 0.966, 0.965, 0.999, 0.988, 0.987, 0.956, 0.969, 0.980, 0.988
102 STATISTICAL INFERENCE: TESTING HYPOTHESES

Let Y represent a single measurement on one of the scales, and let represent the average measure-
ment E(Y ) in repeated weighings of a single 1 kg weight. If an experiment involving n weighings is
conducted then a sensible test of H0 : = 1 could be based on the test statistic

D = jY 1:00j
P p
where Y = Yi =n. Since Y ' G( ; = n), where = E(Y ) and 2 = Var(Y ), we can compute
the significance level (at least approximately) using a Gaussian distribution. Since we don’t know 2
P
we will estimate it by the sample variance s2 = (yi y)2 =(n 1) in the calculations below.
The samples from scales A and B above give us

A : y = 1:0061; s = 0:0230; dobs = 0:0061


B : y = 0:9810; s = 0:0170; dobs = 0:0190:

The SL for A is (pretending = s = 0:0230)

SL = P (D 0:0061; = 1:00)
= P (jY 1:00j 0:0061)
Y 1:00 0:0061
=P p p
0:0230= 10 0:0230= 10
= P (jZj 0:839) where Z G(0; 1)
= :401:

Thus there is no evidence of bias (that is, that H0 : = 1:00 is false) for scale A.
For scale B, however, we get

Y 1:00 :0190
SL = P p p
:0170= 10 :0170= 10
= P (jZj 3:534) = :0004:

Thus there is strong evidence against H0 : = 1:00, suggesting strongly that scale B is biased.
Finally, note that just because there is strong evidence against H0 for scale B, the degree of bias in
its measurements is not necessarily large. In fact, we can get an approximate .95 confidence interval
for = E(Y ) for scale B by using the approximate pivotal quantity

Y
Z= p ' G(0; 1):
s= 10
: p
Since P ( 1:96 Z 1:96) = :95, we get the approximate .95 confidence interval y 1:96s= 10,
or 0:981 :011, or 0:970 0:992. Thus the bias in measuring the 1 kg weight is likely fairly
5.1. INTRODUCTION 103

small (about 1% - 3%).

The approach to testing hypothesis described above is very general and straightforward, but a few
points should be stressed:

1. If the SL is small (close to 0) then the test indicates strong evidence against H0 ; this is often
termed “statistically significant" evidence against H0 . Rough rules of thumb are that SL < :05
provides moderately strong evidence against H0 and that SL < :01 provides strong evidence.

2. If the SL is not small we do not conclude by saying that H0 is true: we simply say there is
no evidence against H0 . The reason for this “hedging" is that in most settings a hypotheses
may never be strictly “true". (For example, one might argue when testing H0 : = 1=6 in
Example 5.1.1 that no real die ever has a probability of exactly 1=6 for side 1.) Hypotheses can
be “disproved" (with a small degree of possible error) but not proved.

3. Just because there is strong evidence (“highly statistically significant" evidence) against a hy-
potheses H0 , there is no implication about how “wrong" H0 is. For example in Example 5.3.1
there was strong evidence that scale B was biased (that is, strong evidence against H0 : bias =
0), but the relative magnitude (1-3%) of the bias is apparently small. In practice, we try to sup-
plement a significant test with an interval estimate that indicates the magnitude of the departure
from H0 . This is how we check whether a result is “ scientifically" significant as well as statis-
tically significant.

A drawback with the approach to testing described so far is that we are not told how to construct
test statistics D. Often there are “intuitively obvious" statistics that can be used; this is the case in the
examples of this section. However, in more complicated situations it is not always easy to come up
with a test statistic. In the next section we show how to use the likelihood function to construct test
statistics in general settings.
A final point is that once we have specified a test statistic D, we need to be able to compute the
significance level (5.1.1) for the observed data. This brings us back to distribution theory: in most cases
the exact probability (5.1.1) is hard to determine mathematically, and we must either use an approx-
imation or use computer simulation. Fortunately, for the tests in the next section we can usually use
approximations based on 2 distributions.
104 STATISTICAL INFERENCE: TESTING HYPOTHESES

5.2 Testing Parametric Hypotheses with Likelihood Ratio Statistics

General Theory

First we show how test statistics can be constructed from the likelihood function for any hypothesis
H0 that is specified in terms of one or more parameters. Let “Data” represent data generated from
a distribution with probability or probability density function f (Data; ) which depends on the k-
dimensional parameter . Let be the parameter space (set of possible values) for .
Consider a hypothesis of the form
H0 : 2 0

where 0 and 0 is of dimension p < k. The dimensions of and 0 refer to the minimum
number of parameters (or “coordinates") needed to specify points in them. We can test H0 using as our
test statistic the likelihood ratio test statistic , defined as follows:
Let L( ) = f (Data; ) be the likelihood function and let
^ denote the m.l.e. of over
^ 0 denote the m.l.e. of over 0 .

Now let ( )
L( ^ 0 )
= 2`( ^ ) 2`( ^ 0 ) = 2 log : (5.2.1)
L( ^ )

This seems like a sensible way to measure the degree of agreement between H0 and the observed data:
we look at the relative likelihood
R( ^ 0 ) = L( ^ 0 )=L( ^ )

of the “most likely" vector ^ 0 under H0 (i.e. when 0 is in 0 ). If this is very small then there is
evidence against H0 (think why). The reason we take D = as the test statistic instead of R( ^ 0 ) is
that = 2logR( ^ 0 ) takes values 0, with = 0 (R( ^ 0 ) = 1) representing the best agreement
between H0 and the data. Also, it can be shown that under H0 , the distribution of becomes 2(k p) as
the size of the data set becomes large. Large values of obs (small values of R( ^ 0 )) indicate evidence
against H0 so the p-value (significance level) is

: 2
SL = P ( obs ; H0 ) = P ( (k p) obs ): (5.2.2)

(Note: Here we are using obs to represent the value of obtained when we get the (numerical) data;
represents the r.v. when we think, as usual, of the data as random variables, before they are collected.)
5.2. TESTING PARAMETRIC HYPOTHESES WITH LIKELIHOOD RATIO STATISTICS 105

Some Examples

This approach is very general and can be used with many different types of problems. A few examples
follow.

(a) A single parameter model. Suppose Y has an exponential distribution with probability density
function f (y; ) = 1 e y= (y 0). Test H0 : = 0 (a given value) based on a random sample
y1 ; :::; yn . Thus = f : > 0g, 0 = f 0 g; k = 1, p = 0 and
n
Y
L( ) = f (yi ; ):
i=1

(b) A model with two parameters. Suppose Y G( ; ) with probability density function
1 y 2
f (y; ; ) = p2 e 2 (
1 ) ( 1 < y < 1) ; = ( ; )

Test H0 : = 0 based on a random sample y1 ; :::; yn . Thus = f( ; ) : 1 < <


1; > 0g, 0 = f( ; 0 ); 1 < < 1g; k = 2, p = 1 and
n
Y
L( ) = L( ; ) = f (yi ; ; )
i=1

(c) Comparison of two parameters. Suppose we have data from two Poisson distributions with
y1i y2i
p.f.’s f (y1i ; 1 ) = e 1 y11i ! ; f (y2i ; 2 ) = e 2 y22i ! ,
where y1i and y2i take values in f0; 1; 2; : : :g.

Test H0 : 1 = 2 based on two independent random samples y1i (i = 1; :::; n1 ) and y2i (i =
1; :::; n2 ). Thus = ( 1 ; 2 ) and = f( 1 ; 2 ) : 1 > 0; 2 > 0g, 0 = f( ; ) : > 0g,
k = 2, p = 1 and
n1
Y n2
Y
L( ) = L( 1 ; 2 ) = f (y1i ; 1 ) f (y2i ; 2 )
i=1 i=1

(d) A test about multinomial probabilities. Consider a multinomial p.f.


!
n! 0 yj n
f (y1 ; :::; ym ; p1 ; :::; pm ) = py11 py22 pymn P
y1 ! ym ! yj = n
Test H0 : pj = pj ( ) where dim( ) = p < m 1. Thus = (p1 ; :::; pm ) and =
Xn
f(p1 ; :::; pm ) : 0 pj 1; pj = 1g, 0 = f(p1 ; ::; pm ) : pj = pj ( ) for 2 g;
1
k=m 1, p = p and
L( ) = f (y1 ; :::; ym ; ):
106 STATISTICAL INFERENCE: TESTING HYPOTHESES

We now consider problems that involve actual data, and describe the steps in the tests in more de-
tail. In many testing problems it is necessary to use numerical methods to maximize likelihoods and
find and ^0 . In this course and the examples below we focus on problems in which these estimates
can be formed mathematically.

Example 5.2.1. Lifetimes of light bulbs. The variability in lifetimes of light bulbs (in hours, say,
of operation before failure) and other electrical components is often well described by an exponential
distribution with probability density function of the form

1 y=
f (y; ) = e y>0

where = E(Y ) is the average (mean) lifetime. A manufacturer claims that the mean life of a partic-
ular brand of bulbs is 2000 hours. We can examine that claim by testing the hypothesis

H0 : = 2000

assuming that the exponential model applies.


Suppose for illustration that a random sample of 20 light bulbs was tested over a long period and
P
20
the total of the lifetimes y1 ; : : : ; y20 was observed to be yi = 38; 524 hours. (It turns out that
P i=1
for the test below we need only the value of yi and not the individual lifetimes y1 ; : : : ; y20 so we
haven’t bothered to list them. They would be needed, however to check that the exponential model was
satisfactory.)
Let us carry out a likelihood ratio test of H0 . The setup is as described in Example (a) above: the
likelihood function based on a random sample y1 ; : : : ; yn is
n
Y P
1 yi = 1 yi =
L( ) = e = n
e :
i=1

Note that in terms of our general theory the parameter space of is = f : > 0g and the parameter
space under H0 is the single point = f2000g. The dimensions of and 0 are 1 and 0, respectively.
We use the likelihood ratio statistic of (5.2.1) as our test statistic D. To evaluate this we first write
P
down the log likelihood function (noting that n = 20 and yi = 38; 524 here)

38524
`( ) = 20 log :

Next, we obtain ^ by maximizing `( ): this gives

^ = 38524 = 1926:2 (hours)


20
5.2. TESTING PARAMETRIC HYPOTHESES WITH LIKELIHOOD RATIO STATISTICS 107

Now we can compute the “observed" value of from (5.2.1) as

obs = 2`( ^) 2`(2000)


77048 77048
= 40 log ( ^=2000) +
^ 2000
= 0:028

The final computational step is to compute the significance level, which we do using the 2 approxi-
mation (5.2.2). This gives

SL = P ( 0:028; H0 true)
2
= P( (1) 0:028)
= 0:87

The SL is not close to zero so we conclude that there is no evidence against H0 and against the
manufacturer’s claim that is 2000 hours. Although the m.l.e. ^ was under 2000 hours (1926.2) it was
not sufficiently under to give conclusive evidence against H0 : = 2000.

Example 5.2.2 Comparison of Two Poisson Distributions. In problem 9 of Chapter 4 some


data were given on the numbers of failures per month for each of two companies’ photocopiers. To a
good approximation we can assume that in a given month the number of failures Y follows a Poisson
distribution with p.f. of the form
y
f (y; ) = P (Y = y) = e y = 0; 1; 2; : : :
y!
where = E(Y ) is the mean number of failures per month. (This ignores that the number of days that
the copiers are used varies a little across months. Adjustments can be make to the analysis to deal with
this.)
The number of failures in 12 consecutive months for company A and company B’s copiers are given
below; there were the same number of copiers from each company in use.

Company A: 16 14 25 19 23 12 22 28 19 15 18 29
Company B: 13 7 12 9 15 17 10 13 8 10 12 14

Denote the value of for Company A’s copiers as A and the value for Company B’s as B . It
appears from the data that B’s copiers fail less often, but let us assess this formally by testing the
hypothesis
H0 : A = B
108 STATISTICAL INFERENCE: TESTING HYPOTHESES

This problem was sketched in Example (c) above. To handle it we consider the likelihood functions
for A and B based on the observed data and then the likelihood formed by combining them. The
likelihoods for A and B are

12
Y yAi
A 12 240
L1 ( A) = e A
=e A
A constant
yAi !
i=1
Y12 yBi
B 12 140
L1 ( B) = e B
=e B
B constant
yBi !
i=1

To test H0 we view the two Poisson distributions together as one big model, with the parameter
vector = ( A ; B ): In terms of the general framework for likelihood ratio tests the parameter space
for is = f( A ; B ) : A > 0; B > 0g, and under H0 the parameter space becomes 0 =
f( A ; B ) : A = B > 0g. The likelihood function for the “combined" model is the product of
L1 ( A ) and L2 ( B ) since the two samples are independent:

L( ) = L( A; B) = L1 ( A )L2 ( B )

We now carry out the steps for the test of H0 exactly as in the previous example, except that now
has dimension k = 2 and 0 has dimension p = 1. First, we write down the log likelihood function,

`( ) = `( A; B) = 12 A + 240 log A 12 B + 140 log B (5.2.3)

Next we find ^ and ^0 . The m.l.e. ^ maximizes `( A; B) in the unconstrained case. This can be done
by solving the maximum likelihood equations

@` @`
= 0; = 0;
@ A @ B

which gives ^A = 240=12 = 20:0 and ^B = 140=12 = 11:667. That is, ^ = (20:0; 11:667): The
constrained m.l.e. ^0 maximizes `( A ; B ) under the constraint A = B . To do this we merely have
to maximize

`( A; A) = 24 A + 380 log A

with respect to A . Solving @`( A; A )=@ A = 0, we find ^A = 380=24 = 15:833(= ^B ); that is,
^ 0 = (15:833; 15:833):
5.3. HYPOTHESIS TESTING AND INTERVAL ESTIMATION 109

The next step is to compute to observed value of the likelihood ratio statistic, which from (5.2.1)
and (5.2.3) is

obs = 2`( ^ ) 2`( ^ 0 )


= 2`(20:0; 11:667) 2`(15:833; 15:833)
= 2[682:92 669:60]
= 26:64

Finally, we compute the significance level for the test, which by (5.2.2) is

2
P( 26:64; H0 true) = P ( (1) 26:64)
7
= 0:25 10

Our conclusion is that there is very strong evidence against the hypothesis; the test indicates that
Company B’s copiers have a lower rate of failure than Company A’s copiers.
We can follow this conclusion up by giving confidence intervals for A and B ; this indicates the
magnitude of the difference in the two failure rates. (The m.l.e.’s ^A = 20:0 average failures per month
and ^B = 11:67 failures per month differ a lot, but we also give confidence intervals in order to express
the uncertainty in such estimates.)
Other hypothesis tests are considered in the remaining chapters of these notes. We conclude with
some short remarks about hypothesis testing and estimation.

5.3 Hypothesis Testing and Interval Estimation


Hypothesis tests that are of the form H0 : = 0 , where is a scalar parameter, are very closely
related to interval estimates for . For likelihood ratio (LR) tests the connection is immediately obvious,
because the LR statistic
= 2`( ^) 2`( 0 )

is used for both tests and confidence intervals. For a test, the significance level using (5.2.2) is

2
SL = P ( (1) obs ( 0 )); (5.3.1)

where we write obs ( 0 ) to remind us that we are testing H0 : = 0 : On the other hand, to get a
confidence interval for with confidence coefficient we find by (4.4.4) all values 0 such that

2
obs ( 0 ) (1); (5.3.2)
110 STATISTICAL INFERENCE: TESTING HYPOTHESES

where 2 is the quantile of a 2 distribution; for example for a .95 confidence interval we use
(1); (1)
2 = 3:84.
(1);:95
We now see the following by comparing (5.3.1) and (5.3.2):

The parameter value 0 is inside an confidence interval given by (5.3.2) if and only if (iff) for
the test of H0 : = 0 we have SL 1 .

For example, 0 is inside the .95 confidence interval (CI) iff the significance level for H0 : = 0
satisfies SL :05, To see this note that

SL :05
2
, P( (1) obs ( 0 )) :05
, obs ( 0 ) 3:84
, 0 is inside :95 CI

The connection between tests and confidence intervals can also be made when other test statistics
beside the LR statistic are used. If D is a test statistic for testing H0 : = 0 then we can obtain a .95
confidence interval for by finding all values 0 such that SL :05, or an CI by finding values 0
such that SL 1 .

5.4 Problems
1. The accident rate over a certain stretch of highway was about = 10 per year for a period of
several years. In the most recent year, however, the number of accidents was 25. We want to
know whether this many accidents is very probable if = 10; if not, we might conclude that the
accident rate has increased for some reason. Investigate this question by assuming that the num-
ber of accidents in the current year follows a Poisson distribution with mean and then testing
H0 : = 10. Use the test statistic D = max(0; Y 10) where Y represents the number of
accidents in the most recent year.

2. Refer back to Problem 1 in Chapter 1. Frame this problem as a hypothesis test. What test statistic
is being used? What are the significance levels from the data in parts (b) and (c)?

3. The R function runif () generates pseudo random U (0; 1) (uniform distribution on (0; 1)) ran-
dom variables. The command y runif (n) will produce a vector of n values y1 ; ; yn .
5.4. PROBLEMS 111

(a) Give a test statistic which could be used to test that the yi ’s (i = 1; ; n) are consistent
with a random sample from U (0; 1).
(b) Generate 1000 yi ’s and carry out the test in (a).

4. A company that produces power systems for personal computers has to demonstrate a high degree
of reliability for its systems. Because the systems are very reliable under normal use conditions,
it is customary to ‘stress’ the systems by running them at a considerably higher temperature than
they would normally encounter, and to measure the time until the system fails. According to a
contract with one PC manufacturer, the average time to failure for systems run at 70 C should be
no less than 1,000 hours.
From one production lot, 20 power systems were put on test and observed until failure at 70 .
The 20 failure times x1 ; : : : ; x20 were (in hours)
374.2 544.0 1113.9 509.4 1244.3
551.9 853.2 3391.2 297.0 63.1
250.2 678.1 379.6 1818.9 1191.1
162.8 1060.1 1501.4 332.2 2382.0
P
20
(Note: xi = 18; 698:6)
1
Failure times Xi are known to be approximately exponential with mean .

(a) Use a likelihood ratio test to test the hypothesis that = 1000 hours. Is there any evidence
that the company’s power systems do not meet the contracted standard?
(b) If you were a PC manufacturer using these power systems, would you like the company to
perform any other statistical analyses besides testing H0 : = 1000? Why?

5. In the Wintario lottery draw, six digit numbers were produced by six machines that operate
independently and which each simulate a random selection from the digits 0; 1; : : : ; 9. Of 736
numbers drawn over a period from 1980-82, the following frequencies were observed for position
1 in the six digit numbers:
Digit (i): 0 1 2 3 4 5 6 7 8 9 Total
Frequency (fi ): 70 75 63 59 81 92 75 100 63 58 736
Consider the 736 draws as trials in a multinomial experiment and let pi = P (digit i is drawn on
any trial), i = 0; 1; : : : 9. If the machines operate in a truly ‘random’ fashion, then we should
have pi = 0:1 (i = 0; 1; : : : ; 9).
112 STATISTICAL INFERENCE: TESTING HYPOTHESES

(a) Test this hypothesis using a likelihood ratio test. What do you conclude?
(b) The data above were for digits in the first position of the six digit Wintario numbers. Sup-
pose you were told that similar likelihood ratio tests had in fact been carried out for each of
the six positions, and that position 1 had been singled out for presentation above because it
gave the largest value of the test statistic, D. What would you now do to test the hypothesis
pi = 0:1 (i = 0; 1; 2; : : : ; 9)? (Hint: You need to consider P (largest of 6 independent D’s
is Dobs ).)

6. Testing a genetic model. Recall the model for the M-N blood types of people, discussed
in Examples 2.3.2 and 2.5.2. In a study involving a random sample of n persons the numbers
Y1 ; Y2 ; Y3 (Y1 + Y2 + Y3 = n) who have blood types MM, MN and NN respectively has a
multinomial distribution with p.f.
n! X
f (y1 ; y2 ; y3 ) = py11 py22 py33 ; yi 0; yi = n
y1 !; y2 !; y3 !
P
and since p1 + p2 + p3 = 1 the parameter space = f(p1 ; p2 ; p3 ) : pi 0; pi = 1g has
dimension 2. The genetic model discussed earlier specified that p1 ; p2 ; p3 can be expressed in
terms of only a single parameter (0 < < 1), as follows:
2
p1 = ; p2 = 2 (1 ); p3 = (1 )2 (5.4.1)

Consider (5.4.1) as a hypothesis H0 to be tested. In that case, the dimension of the parameter
space for (p1 ; p2 ; p3 ) under H0 is 1, and the general methodology of likelihood ratio tests can be
applied. This gives a test of the adequacy of the genetic model.
Suppose that a sample with n = 100 persons gave observed values y1 = 18; y2 = 50; y3 =
32: Test the model (5.4.1) and state your conclusion.

7. Likelihood ratio test for a Gaussian mean. Suppose that a r.v Y has a G( ; ) distribution
and that we want to test the hypothesis H0 : = 0 , where 0 is some specified number. The
value of is unknown.

(a) Set this up as a likelihood ratio test. (Note that the parameter space is = f = ( ; ) :
1 < < 1; > 0g:) Assume that a random sample y1 ; : : : ; yn is available.
p
(b) Derive the LR statistic and show that it can be expressed as a function of t = n(y
0 )=s, where s is the sample standard deviation and y is the sample mean. (Note: the easily
proved identity
Xn Xn
2
(yi 0 ) = (yi y)2 + n(y 0)
2
(5.4.2)
i=1 i=1
5.4. PROBLEMS 113

can be used here.)

8. Use of simulation to obtain significance levels. In some testing problems the distribution of the
test statistic D is so complicated it is not possible (or very difficult) to find the significance level,

SL = P (D Dobs ; H0 true)

mathematically. In many problems computer simulation can be used as an alternative. Here is


an approach that can be used with the “runs test" for randomness in a binary sequence that was
discussed in Problem 7 of Chapter 1. For illustration we consider sequences of length 50.
Let R denote the number of runs in a sequence of 50 binary (0 or 1) digits. If the probability a
digit is 1 is p then, from part (b) of Problem 7, Chapter 1, we have E(R) = 1 + 98p(1 p). If
p = 5 (i.e. 0 and 1 both have probability .5) then E(R) = 25:5. Thus, let us use

D = jR 25:5j

as our statistic for testing the hypothesis


H0 : Digits come from a Bernoulli process with p = :5:

(a) Suppose you observe R = 14 and want to find the SL. Since Dobs = 11:5,

SL = P (D 11:5) = P (R 14) + P (R 37)

Evaluate this in the following way:


(i) Simulate a sequence of 50 independent binary digits, with P (0) = P (1) = :5
(ii) Determine R and store it.
(iii) Repeat this 1000 times and determine the fraction of times that either R 14 or R 37.
This is an approximation to SL (Why?).
Note: This problem is tricky because it requires that code be written to deal with step
(ii). Step (i) can be handled in the R statistical software system by the command x
rbinom(50; 1; :5), which generates 50 independent Bin(1; :5) random variables. The vec-
tor x thus is a random binary sequence of the type desired.

9. The Poisson model is often used to compare rates of occurrence for certain types of events in
different geographic regions. For example, consider K regions with populations P1 ; : : : ; PK
and let j (j = 1; : : : ; K) be the annual expected number of events per person for region j.
By assuming that the number of events Yj for region j in a given t-year period has a Poisson
distribution with mean Pj j t, we can estimate and compare the j ’s or test that they are equal.
114 STATISTICAL INFERENCE: TESTING HYPOTHESES

(a) Under what conditions might the stated Poisson model be reasonable?
(b) Suppose you observe values y1 ; : : : ; yK for a given t-year period. Describe how to test the
hypothesis that 1 = 2 = : : : = K .
(c) The data below show the numbers of children yj born with “birth defects" for 5 regions
over a given five year period, along with the total numbers of births Pj for each region.
Test the hypothesis that the five rates of birth defects are equal.

yj : 2025 1116 3210 1687 2840


Pj : 27 18 41 29 31
GAUSSIAN RESPONSE MODELS

6.1 Introduction
Gaussian models for response variables Y are very widely used in applications of statistics. We have
seen examples involving variables such as heights and body-mass index measurements of people pre-
viously in these notes. Many problems involve explanatory variables x (which may be a vector) that
are related to a response Y ; in this case we can generalize the simple Gaussian model Y G( ; ) to
Y G( (x); (x)), where x is a vector of covariates (explanatory variables). The trick in creating
models in such settings is to decide on the forms for (x) and (x),

(x) = g1 (x); (x) = g2 (x)

To do this we rely on past information and on current data from the population or process in question.
Here are some examples of settings where Gaussian models could be used.

Example 6.1.1
The soft drink bottle filling process of Example 1.4.2 involved two machines (Old and New). For a
given machine it is reasonable to represent the distribution for the amount of liquid Y deposited in a
single bottle by a Gaussian distribution: Y G( ; ).
In this case we can think of the machines as being like a covariate, with and differing for the
two machines. We could write

Y G( 0; 0) Y G( N; N ):

for the old and new machines. In this case there is no formula relating and to the machines; they
are simply different.

Example 6.1.2 Price vs. Size of Commercial Buildings (Reference: Oldford and MacKay STAT
231 Course Notes, Ch. 16)
Ontario property taxes are based on "market value", which is determined by comparing a property to

115
116 GAUSSIAN RESPONSE MODELS

the price of those which have recently been sold. The value of a property is separated into components
for land and for buildings. Here we deal with the value of the buildings only.
A large manufacturing company was appealing the assessed market value of its property, which
included a large building. Sales records were collected on the 30 largest buildings sold in the previous
three years in the area. The data are given in Table 6.1.1 and plotted in Figure 6.18 in a scatter plot,
which is a plot of the points (xi ; yi ). They include the size of the building x (in m2 =105 ) and the selling
price y (in $ per m2 ).
The building in question was 4.47 x 105 m2 , with an assessed market value of $ 75 per m2 .

Table 6.1.1 Size and Price of 30 Buildings

Size Price Size Price Size Price


3.26 226.2 0.86 532.8 0.38 636.4
3.08 233.7 0.80 563.4 0.38 657.9
3.03 248.5 0.77 578.0 0.38 597.3
2.29 360.4 0.73 597.3 0.38 611.5
1.83 415.2 0.60 617.3 0.38 670.4
1.65 458.8 0.48 624.4 0.34 660.6
1.14 509.9 0.46 616.4 0.26 623.8
1.11 525.8 0.45 620.9 0.24 672.5
1.11 523.7 0.41 624.3 0.23 673.5
1.00 534.7 0.40 641.7 0.20 611.8

The scatter plot shows that price (y) is roughly inversely proportional to size (x) but there is obvi-
ously variability in the price of buildings having the same area (size). In this case we might consider a
model where the price of a building of size xi is represented by a random variable Yi , with

Yi G( i ; ) i = 0 + 1 xi

where 0 and 1 are parameters. In this model we’ve assumed that i = (xi ) = , a constant. Al-
ternatively, we could let it depend on xi somehow. (Note that the scatter plot does not provide much
information on how to do this, however.)

Example 6.1.3 Strength of Steel Bolts


The “breaking strength" of steel bolts is measured by subjecting a bolt to an increasing (lateral) force
and determining the force at which the bolt breaks. This force is called the breaking strength; it de-
pends on the diameter of the bolt and the material the bolt is composed of. There is variability in
breaking strengths: Two bolts of the same dimension and material will generally break at different
6.1. INTRODUCTION 117

600
500
y (price)

400
300

0.5 1.0 1.5 2.0 2.5 3.0

x (size)

Figure 6.18: Scatter Plot of Size vs. Price for 30 Buildings

forces. Understanding the distribution of breaking strengths is very important in construction and other
areas.
The data below show the breaking strengths (y) of six steel bolts at each of five different bolt
diameters (x). The data are plotted in Figure 6.19

Diameter 0.10 0.20 0.30 0.40 0.50


1.62 1.71 1.86 2.14 2.45
Breaking 1.73 1.78 1.86 2.07 2.42
Strength 1.70 1.79 1.90 2.11 2.33
1.66 1.86 1.95 2.18 2.36
1.74 1.70 1.96 2.17 2.38
1.72 1.84 2.00 2.07 2.31

The scatter plot gives a clear picture of the relationship between y and x. A reasonable model for
the breaking strength Y of a randomly selected bolt of diameter x would appear to be Y G( (x); ),
because the variability in y-values appears to be about the same for bolts of different diameters. Its not
clear what the best choice for (x) would be; the relationship looks slightly nonlinear so presumably
we want
2
(x) = 0 + 1x + 2x
118 GAUSSIAN RESPONSE MODELS

2.4
2.2
y (strength)

2.0
1.8
1.6

0.1 0.2 0.3 0.4 0.5

x (diameter)

Figure 6.19: Scatter Plot of Diameter vs. Strength for Steel Bolts.

or some other nonlinear function.

A Gaussian response model is one for which the distribution of the response variable Y , given the
associated covariates x for an individual unit, is of the form

Y G( (x); (x)): (6.1.1)

If observations are made on n randomly selected units we often write this as

Yi G( i ; i) i = 1; : : : ; n

where i = g1 (xi ) and i = g2 (xi ) for some specified functions g1 and g2 . In many problems it is
only i that depends much on xi and we then use models where i = is constant. Furthermore, in
a great many situations i can be written as a linear function of covariates. These models are called
Gaussian linear models and are of the following form:

Yi G( i ; ) i = 1; : : : ; n (6.1.2)

with
k
X
i = j xij ; i = 1; : : : ; n (6.1.3)
j=1
6.2. INFERENCE FOR A SINGLE SAMPLE FROM A GAUSSIAN DISTRIBUTION 119

where xi = (xi1 ; xi1 ; : : : ; xik ) is the vector of covariates associated with unit i and the j ’s are para-
meters.
Remark: Sometimes the model (6.1.2) is written a little differently as

Yi = i + Ri where Ri G(0; ):

This splits Yi into deterministic ( i ) and random (Ri ) components.


These models are also referred to as linear regression models, and the j ’s are called the re-
gression coefficients. (The term “regression" is used because it was introduced in the 19th century in
connection with these models. We won’t bother explaining just how the term arose.)
The model (6.1.2) plus (6.1.3) describes many situations well. The following are some illustrations.

1. Y G( ; ), where Y is the height of a random female, corresponds to xi = (1); 1 =

2. The model in Example 6.1.2 had i = 0 + 1 xi where xi was the building’s size. This can be
re-expressed as i = 0 xi0 + 1 xi1 where xi0 = 1; xi1 = xi (here we’ve used xij with j = 0; 1
for simplicity.)

3. The bolt strength model in Example 6.1.3 had (x) = + 2.


0 1x + 2x
This could be re-expressed as

i = 0 xi1 + 1 xi1 + 2 xi2

where xi0 = 1; xi1 = xi ; xi2 = x2i .

Now we’ll consider estimation and testing procedures for Gaussian models. We begin in the next
section with models that have no covariates.

6.2 Inference for a single sample from a Gaussian Distribution

Suppose that Y G( ; ) models a response variable y in some population or process. A random


sample Y1 ; : : : ; Yn is selected, and we want to estimate the model parameters and possibly to test
hypotheses about them.
We have already seen in Section 2.2 that the maximum likelihood estimators of and 2 are
n n
1X 1X
^=y= yi ^2 = (yi y)2
n n
i=1 i=1
120 GAUSSIAN RESPONSE MODELS

A closely related point estimate of 2 is the sample variance,


n
X
1
s2 = (yi y)2
n 1
i=1

We now consider interval estimation and tests of and .

6.2.1 Confidence Intervals and Tests About and


If were known then, as discussed in Chapter 4,

Y
Z= p G(0; 1)
= n
would be a pivotal quantity and could be used to get confidence intervals (CI’s) for . However, is
generally unknown. Fortunately it turns out that if we simply replace with either ^ or s in Z, then
we still have a pivotal quantity which we denote as T . We will write T in terms of s since the formulas
below look a little simpler then, so T is defined as

Y
T = p (6.2.1)
s= n
Since s is treated as a r.v. in (6.2.1) (we’ll use s to represent both the r.v. and an observed value, for
convenience), T does not have a G(0; 1) distribution. It turns out that its distribution in this case is
what is known as a student-t (or just “t") distribution. We’ll digress briefly to present this distribution
and show how it arises.

Student -t Distributions
This distribution arises when we consider independent random variables Z G(0; 1) and U 2
( )
and then define the new r.v.
Z
X= (6.2.2)
(U= )1=2
Then X has a student -t distribution with degrees if freedom (d.f.), and we write X t( ) to
denote this. The probability density function of X can be shown by a bivariate change of variables
method (we won’t do this) to be

x2 ( +1)=2
f (x) = k (1 + ) 1<x<1 (6.2.3)

where ( ) is the gamma function and

+1 p
k = = ( =2) (6.2.4)
2
6.2. INFERENCE FOR A SINGLE SAMPLE FROM A GAUSSIAN DISTRIBUTION 121

This distribution is symmetric about x = 0 and for large is closely approximated by the probability
density function of G(0; 1): Problem 1 at chapter’s end considers some properties of f (x).
Probabilities for the t distribution are available from tables or computer software. In R, the
c.d.f. value
F (x) = P (t( ) x) (6.2.5)

is calculated as pt(x; ). For example, pt(1:5; 10) gives P (t(10) 1:5) as .918.

Confidence Intervals for


We can show using (6.2.2) that (6.2.1) has a t-distribution with n 1 degrees of freedom:

Y
T = p t(n 1) (6.2.6)
s= n

To do this we use the fact that

Yp
(i) Z = = n
G(0; 1)

(n 1)s2 2
(ii) U = 2 (n 1)

(iii) Y and s2 are independent.

(The results (ii) and (iii) are not too hard to show using multivariate calculus. Proofs are omitted
here but done in Stat 330.)
Then, we see that X defined in (6.2.2) is just the r.v. T here, so (6.2.6) follows and T is a pivotal
quantity.
Therefore, to get an CI for we find values a1 and a2 such that

Y
P (a1 t(n 1) a2 ) = =P a1 p a2 :
s= n

This converts to
s s
P Y a2 p Y a1 p = (6.2.7)
n n
p p
so (Y a2 s= n; Y a1 s= n) is an CI.

Example 6.2.1 Scores Y for an IQ test administered to ten year olds in a very large population have
close to a Gaussian distribution G( ; ). A random sample of 10 children got test scores as follows:

103; 115; 97; 101; 100; 108; 111; 91; 119; 101:
122 GAUSSIAN RESPONSE MODELS

We can obtain confidence intervals for the average IQ test score in the population by using the pivotal
quantity
Y
T = p t(9) :
s= 10
Since P ( 2:262 t(9) 2:262) = :95, for example, a :95 confidence interval for is y
p
2:262s= 10. For the data given above y = 104:6 and s = 8:57, so the observed confidence inter-
val is 104:6 6:13, or 98:47 110:73.

Remarks:

1. Confidence intervals for get narrower as n increases. They are also narrower if is known
(though this is unusual). In the limit as n ! 1, the CI based on (6.2.6) is equivalent to using Z
and knowing that = s. For example, if in Example 6.2.1 we know that = 8:57 then the .95
p p p
CI would use y 1:96 = n instead of y 2:262 s= n when n = 10. As n increases, s= n
becomes arbitrarily close to zero so the CI’s shrink to include only the point y.

2. If we have a rough idea what the value of is, we can determine the value of n needed to make
a (.95,say) CI a given length. This is used in deciding how large a sample to take in a study.

3. Sometimes we want CI’s of the form L(y1 ; : : : ; yn ) or U (y1 ; : : : ; yn ) . These are ob-
tained by taking a1 = 1 and a2 = 1, respectively, in (6.2.7). For “two-sided" intervals we
usually pick a1 = a2 so that the interval is symmetrical about y.

Exercise: Show that these provide the shortest CI with a given confidence coefficient .

Hypothesis Tests for


We may wish to test a hypothesis H0 : = 0, where 0 is some specified value. To do this we can
use the statistic
jY j
D = jT j = p 0 (6.2.8)
s= n
Significance levels are obtained from the t-distribution: if Dobs is the value of D observed in a sample
giving mean y and standard deviation s, then

SL = P (D Dobs ; H0 true)
= P (jt(n 1) j Dobs )
=1 P ( Dobs t(n 1) Dobs )
6.2. INFERENCE FOR A SINGLE SAMPLE FROM A GAUSSIAN DISTRIBUTION 123

Example 6.2.2 For the setting in Example 6.2.1, test H0 : = 110. With (6.2.8) the observed value
of D is then
j104:6 110j
Dobs = p = 1:99
8:57= 10
and the significance level is

SL = P (jt(9) j 1:99)
=1 P ( 1:99 t(9) 1:99)
= :078

This indicates there isn’t any strong evidence against H0 : (Such tests are sometimes used to com-
pare IQ test scores for a sub-population (e.g. students in one school district) with a known mean for
a “reference" population.)

Remark: The likelihood ratio (LR) statistic could also be used for testing H0 : = 0 or for CI’s
for , but the methods above are a little simpler. In fact, it can be shown that the LR statistic for H0 is
a one-to-one function of jT j; see Problem 2 at the end of the chapter.

Remark: The function t.test in R will obtain confidence intervals and test hypotheses about ; for a
data set y use t.test(y).

Confidence Intervals or Tests for


From the results following (6.2.6) we have that
(n 1)s2 2
U= 2 (n 1) (6.2.9)

so U is pivotal quantity and can be used to find CI’s for 2 or . To get an CI we find a1 ; a2 such that

2 (n 1)s2
P (a1 (n 1) a2 ) = =P a1 2
a2 :

This converts to
(n 1)s2 2 (n 1)s2
P = (6.2.10)
a2 a1
q q
(n 1)s2 (n 1)s2
so an CI for is a2 ; a1 : For “two-sided" CI’s we usually choose a1 and a2 such
that
2 2 1
P( (n 1) < a1 ) = P ( (n 1) > a2 ) =
2
In some applications we are interested in an upper bound on (because small is “good" in some
sense); then we take a2 = 1 so the lower confidence limit in (6.2.10) is 0.
124 GAUSSIAN RESPONSE MODELS

Example 6.2.3. A manufacturing process produces wafer-shaped pieces of optical glass for lenses.
Pieces must be very close to 25 mm thick, and only a small amount of variability around this can
be tolerated. If Y represents the thickness of a randomly selected piece of glass then, to a close ap-
proximation, Y G( ; ). Periodically, random samples of 15 pieces of glass are selected and the
values of and are estimated to see if they are consistent with = 25 and with being under .02
mm. On one such occasion the sample mean and sum of squares from the data were y = 25:009 and
P
(yi y)2 = 0:002347.
Consider getting a .95 confidence interval for , using the pivotal quantity
P
14s2 (Yi Y )2 2
U= 2 = 2 (14) :

Since P (5:63 2 26:12) = :95, we find


(14)
P P 1=2 P 1=2
!
(Yi Y )2 (Yi Y )2 (Yi Y )2
P (5:63 2
26:12) = P = :95
26:12 5:63
This gives the observed confidence interval :0095 :0204. It seems plausible that :02, though
the right hand end of the .95 confidence interval is just over .02. A one-sided .95 CI is :0189; this
comes from P (6:57 2 1) = :95
(14)

Hypothesis Tests for


Sometimes a test for H0 : = 0 is of interest. One approach is to use a likelihood ratio (LR)
statistic, as described in Chapter 4. It can be seen (see Problem 2) that the LR statistic is a function
of U = (n 1)s2 = 2 ,
= U n log (U=n) n (6.2.11)
This is not a one-to-one function of U but is 0 when U = n and is large when U=n is much
bigger than or much less than 1 (i.e. when s2 = 02 is much bigger than one or much less than 1).
Since U 2
(n 1) , when H0 is true, we can use this to compute exact significance levels, in-
stead of using the 2(1) approximation for discussed in Chapter 4. The following simpler calculation
approximates this SL:

1. Obtain Uobs = (n 1)s2obs = 2


0 from the observed data.

2. If Uobs > n 1 compute SL = 2P ( 2 Uobs ).


(n 1)
If Uobs < n 1 compute SL = 2P ( 2 Uobs ).
(n 1)

Example 6.2.4 For the manufacturing process in Example 6.2.3, test the hypothesis H0 : = :008
(.008 is the desired or target value of the manufacturer would like to achieve).
6.3. GENERAL GAUSSIAN RESPONSE MODELS 125

Note that since the value = :008 is outside the two-sided .95 CI for in Example 6.2.3, the SL
for H0 based on the test statistic (or equivalently, U ) will be less than .05. To find the exact SL, we
follow the procedure above:

14s2obs
1. Uobs = :0082
= 36:67

2. SL = 2P ( 2 36:67) = :0017
(14)

This indicates very strong evidence against H0 and suggests that is bigger than .008.

6.3 General Gaussian Response Models


P
k
We now consider general models of the form (6.1.2) plus (6.1.3): Yi G( i ; ) with i = j xij
j=1
for independent units i = 1; 2; : : : ; n. For convenience we define the n k matrix X of covariate
values:
X = (xij )n k (6.3.1)

We now summarize some results about the maximum likelihood estimators of the parameters =
( 1 ; : : : ; k )0 and .

= ( 1 ; :::; 0
Some results about M.L.E.’s of k) and of ^

Maximization of the likelihood function


n
Y 2
1 1 yi i
L( ; ) = p e 2

i=1
2

or of `( ; ) = log L( ; ) gives
n
^ = (X 0 X) 1 0 1X 2
X y; ^ = (yi ^i )2
n
i=1

P
k
^j xij . We also define
where Xn k = (xij ), y n 1 = (y1 ; :::; yn )0 and ^i =
j=1

n
X
2 1 n
s = (yi ^i )2 = ^2
n k n k
i=1
126 GAUSSIAN RESPONSE MODELS

For the estimators ~j (j = 1; :::; k) and ~ 2 it can be proved that


q
~j G( j ; cj 2 ) j = 1; :::; k (6.3.2)

n~ 2 (n k)s2 2
W = 2
= 2 (n k) (6.3.3)

W is independent of ( ~1 ; :::; ~k ): (6.3.4)

In (6.3.2), cj are constants which are functions of the xij ’s.

Remark: The m.l.e. ^ is also a least squares (LS) estimate of . Least squares is a method
of estimation in linear models that predates maximum likelihood. Problem 16 describes least
squares methods.

Recall the distribution theory for the (student) t distribution. If Z G(0; 1) and W 2 then
(r)
the r.v.
p
T = Z= W=r (6.3.5)

has a t(r) distribution.

This provides a way to get confidence intervals (or tests) for any j. Because (6.3.2) implies
that
~j j
Z= p G(0; 1)
cj
and because of (6.3.3), then (6.3.5) implies that
~j j
T = p t(n k) (6.3.6)
s cj

so T is a pivotal quantity for In addition W given in (6.3.3) is a pivotal quantity for 2


j. or .

Below we will consider some special types of Gaussian models which fall under the general theory.
However, we’ll alter the notation a bit for each model, for convenience.

Single Gaussian distribution


Here, Yi G( ; ) i = 1; :::; n i.e. i = (we use instead of as parameter name). This model
was discussed in detail in Section 6.2, where we used
Y (n 1)s2 2
T = p t(n 1) ; W = 2 (n 1) :
s= n
6.3. GENERAL GAUSSIAN RESPONSE MODELS 127

(Note: The c1 in (6.3.2) is 1=n here; its easiest to get this by the fact that Var(Y ) = var(~) = 2 =n,

proved earlier.)

Comparing Two Gaussian Distributions G( 1 ; ) and G( 2 ; )

Independent samples y11 ; y12 ; :::; y1n1 from G( 1; )


y21 ; y22 ; :::; y2n2 from G( 2 ; ) are obtained.

(We use double subscripts for the Y ’s here, for convenience.)

Once again, stick with 1 and 2 as names of parameters. The likelihood function for 1, 2, is
nj
2 Y
Y 1 1 yji j 2
( )
L( 1; 2; )= p e 2

j=1 i=1
2

Maximization gives the m.l.e.’s


n1 n2 nj
2 X
X y1i X y2i 1 X
2
^1 = = y1 ; ^ 2 = = y2 ; ^ = (yji ^j )2
n1 n2 n1 + n2
i=1 i=1 j=1 i=1

2
X
Note that s2 = 1
n1 +n2 2 (nj 1)s2j = (n1 + n2 )^ 2 =(n1 + n2 2)
j=1
nj
X
where s2j = 1
nj 1 (yji yj ) 2 .
i=1

To get CI’s for = 1 2 note that

(Y 1 Y 2 ) (n1 + n2 2)s2 2
Z= q G(0; 1); W = 2 (n1 +n2 2)
1 1
n1 + n2

and so by (6.3.5)
(Y 1 Y 2 )
T = q t(n1 +n2 2) : (6.3.7)
s n11 + n12

Confidence intervals or tests about can be obtained by using the pivotal quantity W exactly as in
Section 6.2 for a single distribution.

Example 6.3.1. In an experiment to assess the durability of two types of white paint used on asphalt
highways, 12 lines (each 400 wide) of each paint were laid across a heavily traveled section of highway,
in random order. After a period of time, reflectometer readings were taken for each line of paint; the
128 GAUSSIAN RESPONSE MODELS

higher the readings the greater the reflectivity and the visibility of the paint. The measurements of
reflectivity were as follows:

Paint A: 12.5, 11.7, 9.9, 9.6, 10.3, 9.6, 9.4, 11.3, 8.7, 11.5, 10.6, 9.7
Paint B: 9.4, 11.6, 9.7, 10.4, 6.9, 7.3, 8.4, 7.2, 7.0, 8.2, 12.7, 9.2

Statistical objectives are to test that the average reflectivity for paints A and B is the same, and if
there is evidence of a difference, to obtain a confidence interval for their difference. (In many problems
where two attributes are to be compared we start by testing the hypothesis that they are equal, even if
we feel there may be a difference. If there is no statistical evidence of a difference then we stop there.)
To do this it is assumed that, to a close approximation, reflectivity measurements Y1 for paint A are
G( 1 ; 1 ), and measurements Y2 for paint B are G( 2 ; 2 ). We can test H : = 1 2 = 0 and
get confidence intervals for by using the pivotal quantity

(Y 1 Y 2 )
T = q t(22)
1 1
s 12 + 12

where it is assumed that 1 = 2 = , which is estimated by


" 12 12
#
1 X X
s2 = (y1i y1 )2 + (y2i y2 )2 :
22
i=1 i=1

To test H : = 0 we use the test statistic D = jT j. From the data given above we find
P
n1 = 12 y1 = 10:4 (y1i y1 )2 = 14:08 s21 = 1:2800
P
n2 = 12 y2 = 9:0 (y2i y2 )2 = 38:64 s22 = 3:5127:

This gives ^ = 1:4 and s2 = 2:3964, and the observed test statistic dobs = 2:22. The significance level
is then
:
SL = P (jt(22) j 2:22) = :038:

This indicates there is fairly strong evidence against H : 1 = 2 . Since y1 > y2 , the indication
is that paint A keeps its visibility better. A .95 confidence interval based on T is obtained using the
fact that P ( 2:074 t(22) 2:074) = :95. This gives the confidence interval for = 1 2 of
p
^ 2:074s= 6, or 0:09 2:71. This suggests that although the difference in reflectivity (and
durability) of the paint is statistically significant, the size of the difference is not really large relative to
1 and 2 (look at y1 and y2 ).
The procedures above assume that the two Gaussian distributions have the same standard devia-
tions. Sometimes this isn’t a reasonable assumption (it can be tested using a LR test, but we won’t do
6.3. GENERAL GAUSSIAN RESPONSE MODELS 129

this here) and we must assume that Yi G( 1 ; 1 ) and Y2 G( 2; 2 ). In this case there is no exactly
pivotal quantity with which to get CI’s for = 1 2, but

(Y1 Y2 )
Z= q 2 ' G(0; 1) (6.3.8)
s1 s22
n1 + n2

is an approximate pivotal quantity that becomes exact as n1 and n2 become arbitrarily large.
To illustrate its use, consider Example 6.3.1, where we had s21 = 1:2800; s22 = 3:5127. These
appear quite different but they are in squared units and n1 ; n2 are small; the standard deviations s1 =
1:13 and s2 = 1:97 do not provide evidence against the hypothesis that 1 = 2 if a LR test is carried
out. Nevertheless, let us use (6.3.8) to get a .95 CI for . This gives the CI
s
s21 s2
(Y1 Y2 ) 1:96 + 2
n1 n2

which with the data observed equals 1.4 1.24, or .16 2:64: This is not much different than the
interval obtained in Example 6.3.1.

Example 6.3.2 Scholastic Achievement Test Scores


Tests that are designed to “measure" the achievement of students are often given in various subjects.
Educators and others often compare results for different schools or districts. We consider here the
scores on a mathematics test given to Canadian students in the 5th grade. Summary statistics (sample
sizes, means, and standard deviations) of the scores y for the students in two small school districts in
Ontario are as follows:

District A: n = 278 y = 60:2 s = 10:16


District B: n = 345 y = 58:1 s = 9:02

The average score is somewhat higher in district A, and we will give a confidence interval for the
difference in average scores A B in a model representing this setting. This is done by thinking
of the students in each district as a random sample from a conceptual population of “similar" students
writing “similar" tests. Assuming that in a given district the scores Y have a G( ; ) distribution, we
can test that the means A and B for districts A and B are the same, or give a CI for the difference.
(Achievement tests are usually designed so that the scores are approximately Gaussian, so this is a
sensible procedure.)
Let us get a .95 CI for = A B using the pivotal quantity (6.3.8). This gives the CI
s
s21 s2
y1 y2 1:96 + 2
n 1 n2
130 GAUSSIAN RESPONSE MODELS

which becomes 2:1 (1:96)(:779) or 0:57 1:63: Since = 0 is outside the .95 CI (and also the
.99 CI) we can conclude there is fairly strong evidence against the hypothesis that A = B , suggesting
that A > B .
It is always a good idea to look carefully at the data and the distributions suggested for the two
groups, however; we should not rely only on a comparison of their means. Figure 6.3.1 shows a box plot
of the two samples; this type of plot was mentioned in Section 1.3. It shows both the median value and
other summary statistics of each sample: the upper and lower quartiles (i.e. 25th and 75th percentiles)
and the smallest and largest values. Figure 6.20 was obtained using the R function boxplot().
Note that the distributions of marks for districts A and B are actually quite similar. The median
(and mean) is a little higher for district A and because the sample sizes are so large, this gives a “statis-
tically significant" difference in a test that A = B . However, it would be a mistake to conclude that
the actual difference in the two distributions is very large. Unfortunately, “significant" tests like this
are often used to make claims about one group being “superior" to another.

Remark: The R function t.test will carry out the test above and will give confidence intervals for
A B . This can be done with the command t:test(yA ; yB ; var:equal = T ), where yA and yB are
the data vectors from A and B.
90
80
70
60
50
40

District A District B

Figure 6.20: Box Plot of Math Test Scores for Two School Districts.
6.4. INFERENCE FOR PAIRED DATA 131

6.4 Inference for Paired Data


Although this and the next section are also special cases of the general Gaussian model of Section 6.3,
the procedures are sufficiently important that we give them their own sections.
Often experimental studies designed to compare means are conducted with pairs of units, where
the responses within a pair are not independent. The following examples illustrate this.

Example 6.4.1 Heights of Males vs Females


In a study in England, the heights of 1401 (brother, sister) pairs of adults were determined. One
objective of the study was to compare the heights of adult males and females; another was to examine
the relationship between the heights of male and female siblings.
Let Y1i and Y2i be the heights of the male and female, respectively , in the i’th (brother, sister) pair
(i = 1; 2; : : : ; 1401). Assuming that the pairs are sampled randomly from the population, we can use
them to estimate
1 = E(Y1i ) 2 = E(Y2i )

and the difference = 1 2 . However, the heights of related persons are not independent, so to
estimate the method in the preceding section would not be strictly usable; it requires that we have
independent random samples of males and females. In fact, the primary reason for collecting these
data was to consider the joint distribution of Y1i ; Y2i and to examine their relationship. This topic is not
considered in this course, but a clear picture is obtained by plotting the points (Y1i ; Y2i ) in a scatter plot.

Example 6.4.2 Comparing Car Fuels


In a study to compare “standard" gasoline with gas containing an additive designed to improve mileage
(i.e. reduce fuel consumption), the following experiment was conducted:
Fifty cars of a variety of makes and engine sizes were chosen. Each car was driven in a standard
way on a test track for 1000 km, with the standard fuel (S) and also with the enhanced fuel (E). The
order in which the S and E fuels was used was randomized for each car (you can think of a coin being
tossed for each car, with fuel S being used first if a Head occurred) and the same driver was used for
both fuels in a given car. Drivers were different across the 50 cars.
Suppose we let Y1i and Y2i be the amount of fuel consumed (in litres) for the i’th car with the S and
E fuels, respectively. We want to estimate

= E(Y1i Y2i ):

The fuel consumptions Y1i ; Y2i for the i’th car are related, because factors such as size, weight and
engine size (and perhaps the driver) affect consumption. As in the preceding example it would likely
not be appropriate to treat the Y1i ’s (i = 1; : : : ; 50) and Y2i ’s (i = 1; : : : ; 50) as two independent
132 GAUSSIAN RESPONSE MODELS

samples. Note that in this example it may not be of much interest to consider E(Y1i ) and E(Y2i )
separately, since there is only a single observation on each car type for either fuel.
Two types of Gaussian models are used to represent settings involving paired data. The first involves
what is called a bivariate normal distribution for (Y1i ; Y2i ), and it could be used in Example 6.4.1. This
is a continuous bivariate model. Only discrete bivariate models were introduced in Stat 230 and we
will not consider this model here (it is studied in Stat 330), except to note an important property:

2
Yi = Y1i Y2i G( ; ) (6.4.1)

where = 1 2 = E(Y1i ) E(Y2i ). Thus, if we are interested in estimating or testing , we can do


this by considering the within-pair differences Yi and using the methods for a single Gaussian model
in Section 6.2.
The second Gaussian model used with paired data has

2 2
Y1i G( 1 + i; 1) Y2i G( 2 + i; 2)

where the i ’s are unknown constants. Here it is assumed that Y1i and Y2i are independent random vari-
ables, and the i ’s represent factors specific to the different pairs. This model also gives the distribution
(6.4.1), since

E(Y1i Y2i ) = 1 2 ( i cancels)


2 2 2
V ar(Y1i Y2i ) = 1 + 2 = :

This model seems relevant for Example 6.4.2, where i refers to the i’th car type. Interestingly, the two
models for (Y1i ; Y2i ) can be connected; if the i ’s are considered as Gaussian random variables in the
population of pairs of units then the result is that (Y1i ; Y2i ) have a bivariate normal model.
Thus, whenever we encounter paired data in which the variation in variables Y1i and Y2i is ade-
quately modeled by Gaussian distributions, we will make inferences about = 1 2 by working
with the model (6.4.1).

Example 6.4.1 revisited. The data on 1401 (brother, sister) pairs gave differences Yi = Y1i Y2i (i =
1; : : : ; 1401) for which the sample mean and variance were
P
2 (yi y)2
y = 4:895 in s = = 6:5480 in2
1400
Using the student t pivotal quantity (6.2.6), a two-sided .95 confidence interval for = E(Yi ) is
p
y 1:96s= n where n = 1401: (Note that t(1400) is indistinguishable from G(0; 1).) This gives the
.95 CI 4:895 0:134 inches, or 4:76 5:03 in.
6.4. INFERENCE FOR PAIRED DATA 133

Remark: The method above assumes that the (brother, sister) pairs are a random sample from the
population of families with a living adult brother and sister. The question arises as to whether also
represents the difference in the average heights of all adult males and all adult females (call them 01
and 02 ) in the population. Presumably 01 = 1 (i.e. the average height of all adult males equals
the average height of all adult males who also have an adult sister) and similarly 02 = 2 , so does
represent this difference. However, it might be wise to check this assumption.
Recall our earlier Example 2.4.1 involving the difference in the average heights of males and fe-
males in New Zealand. This gave the estimate ^ = y1 y2 = 68:72 64:10 = 4:62 inches, which is a
little less than the difference in the example above. This is likely due to the fact that we are considering
two distinct populations, but it should be noted that the New Zealand data are not paired.

Pairing as an Experimental Design Choice


In settings were the population can be arranged in pairs, the estimation of a difference in means, =
1 2 , can often be made more precise (shorter CI’s) by using pairing in the study. The condition for
this is that the association (or correlation) between Y1i and Y2i be positive. This is the case in both of
Examples 6.4.1 and 6.4.2, so the pairing in these studies is a good idea.
To illustrate this further, in Example 6.4.1 the height measurement on the 1401 males gave y1 =
69:720 and s21 = 7:3861 and those on the females gave y2 = 64:825 and s22 = 6:7832. If the males and
females were two independent samples (this is not quite right because the heights for the brother-sister
combinations are not independent, but the sample means and variances are close to what we would get
if we did have completely independent samples), then we could use the pivotal quantity (6.3.7) to get a
confidence interval for = 1 2 . This gives the .95 CI 4:70 5:09; we note that it is slightly
longer than the .95 CI 4:76 5:03 obtained using the pairings.
To see why the pairing is helpful in estimating , suppose that Y1i G( 1 ; 12 ) and Y2i
G( 2 ; 22 ), but that Y1i and Y2i are not necessarily independent (i = 1; 2; : : : ; n). The estimator of
is
~ = Y1 Y2

and we have that E( ~) = = 1 2 and

V ar( ~) = V ar(Y1 ) + V ar(Y2 ) 2Cov(Y1 ; Y2 )


2 2 2
1 2
= + 12 ;
n n n

where 12 = Cov(Y1i ; Y2i ). If 12 > 0, then V ar( ~) is smaller than when 12 = 0 (i.e. when Y1i and
Y2i are independent). Therefore if we can collect a sample of pairs (Y1i ; Y2i ), this is better than two
independent random samples (one of Y1i ’s and one of Y2i ’s) for estimating . Note on the other hand
that if 12 < 0, then pairing is a bad idea since it increases the variance of ~.
134 GAUSSIAN RESPONSE MODELS

The following example involves an experimental study with pairing.

Example 6.4.3. Fibre in Diet and Cholesterol Level


This example comes from the Stat 231 Course Notes, chapter 15. In the study 20 subjects (who were
actually volunteers from workers in a Boston hospital) with ordinary cholesterol levels were given a
low-fibre diet for 6 weeks and a high-fibre diet for another 6 week period. The order in which the two
diets were given was randomized for each subject (person), and there was a two-week gap between
the two 6 week periods, in which no dietary fibre supplements were given. A primary objective of the
study was to see if cholesterol levels are lower with the high-fibre diet.
Details of the study are given in the New England Journal of Medicine, volume 322 (January 18,
1990), pages 147-152, and in the Stat 231 Notes. These provide interesting comments on factors and
difficulties in the design of studies on the effects of diet. Here we will simply present the data from the
study and estimate the effect of the amount of dietary fibre.

Table 6.4.1 shows the cholesterol levels y (in mmol per liter) for each subject, measured at the
end of each 6 week period. We’ll let the random variables Y1i ; Y2i represent the cholesterol levels for
subject i on the high fibre and low fibre diets, respectively. We’ll also assume that the differences are
represented by the model

Yi = Y1i Y2i G( ; ): i = 1; : : : ; 20

The differences yi are also shown in Table 6.4.1, and from them we calculate the sample mean and
standard deviation
y = :020 s = 0:411

A .95 CI for is found using the pivotal quantity (6.2.7) and the fact that P ( 2:093 t(19)
p
2:093) = :95. This gives the CI y 2:093 s= 20, or :020 0:192, or

0:212 0:172

This confidence interval includes = 0, and there is clearly no evidence that the high fibre diet gives
a lower cholesterol level.

Remark: The results here can be obtained using the R function t:test.

Exercise: Compute the significance level of the hypothesis H0 : = 0, using the test statistic (6.2.8).

Table 6.4.1. Cholesterol Levels on Two Diets


6.5. LINEAR REGRESSION MODELS 135

Subject Y1i (High F) Y2i (Low F) Yi Subject Y1i (High F) Y2i (Low F) Yi
1 5.55 5.42 .13 11 4.44 4.43 .01
2 2.91 2.85 .06 12 5.22 5.27 -.05
3 4.77 4.25 .52 13 4.22 3.61 .61
4 5.63 5.43 .20 14 4.29 4.65 -.36
5 3.58 4.38 -.80 15 4.03 4.33 -.30
6 5.11 5.05 .06 16 4.55 4.61 -.06
7 4.29 4.44 -.15 17 4.56 4.45 .11
8 3.40 3.36 .04 18 4.67 4.95 -.28
9 4.18 4.38 -.20 19 3.55 4.41 -.86
10 5.41 4.55 .86 20 4.44 4.38 .06

Final Remarks: When you see data from a comparative study (i.e. one whose objective is to com-
pare two distributions, often through their means), you have to determine whether it involves paired
data or not. Of course, a sample of Y1i ’s and Y2i ’s cannot be from a paired study unless there are equal
numbers of each, but if there are equal numbers the study might be either “paired" or “unpaired". Note
also that there is a subtle difference in the study populations in paired and unpaired studies. In the for-
mer it is pairs of individual units that forms the population where as in the latter there are (conceptually
at least) separate individual units for Y1 and Y2 measurements.

6.5 Linear Regression Models


Many studies involve covariates x, as described in Section 6.1. In this section we consider settings
where there is a single x-variable. Problems with multiple x-variables were mentioned in Sections 6.1
and 6.3, and are considered in Stat 331. We start by summarizing results from Sections 6.1 and 6.3.
Consider the model with independent Yi ’s such that

Yi G( i ; ) with i = + xi (6.5.1)

(Note that this is of the form (6.1.2) and (6.1.3) with 1 = ; 2 = ; x1i = 1; x2i = xi ).

Once again, we can use the general results of Section 6.3 or just maximize the likelihood to get
the maximum likelihood estimators: maximize
n
Y 2
1 1 yi xi
L( ; ; ) = p e 2

i=1
2
136 GAUSSIAN RESPONSE MODELS

Sxy 1 Pn
to get ^ = Sxx , ^ =y ^x, ^ 2 =
n i=1 (yi ^ ^xi )2 = 1 (Syy ^Sxy )
n
X n Xn Xn
where Sxx = (xi x)2 , Sxy = (xi x)(yi y), Syy = (yi y)2 .
i=1 i=1 i=1

Remark: In regression models we often “redefine" a covariate xi as x0i = xi c, where c is a


P 0 P 0
constant value that makes xi close to zero. (Often we take c = x, which makes xi exactly
zero.) The reasons for doing this are that it reduces round-off errors in calculations, and that it makes
the parameter more interpretable. Note that does not change if we “centre" xi this way, because
E(Y jx)is
(x) = + x = + (x0 + c) = ( + c) + x0 :
Thus, the intercept changes if we redefine x, but not : In the examples here we have kept the given
definition of x, for simplicity.
We now consider how to get confidence intervals for quantities of interest. As usual, formulas are
written in terms of
n
X
1
s2 = (yi ^i )2
n 2
i=1
Xn
1 ^xi )2
= (yi ^
n 2
i=1

instead of ^ 2 .

Confidence Intervals for


These are important because represents the increase in

(x) = E(Y jx) = + x

resulting from an increase of 1 in x. As well, if = 0 then x has no effect on Y (within the constraints
of this model).
p
From Section 6.3 we know that ~ G( ; c ) for some constant c. It’s easy to show this directly
here, and to obtain c. Write ~ as
X (xi n
~ = Sxy = x)(Yi Y)
sxx sxx
i=1
n
X X
(xi x)
= Yi (since (xi x)Y = 0)
sxx
i=1
Xn
= ai Yi
i=1
6.5. LINEAR REGRESSION MODELS 137

where ai = (xi x)=sxx . This is a linear combination of independent Gaussian random variables and
so its distribution is also Gaussian, with
n
X
E( ~) = ai E(Yi )
i=1
Xn
(xi x)
= ( + xi )
sxx
i=1
Xn n
X
(xi x) (xi x)
= + xi
sxx sxx
i=1 i=1
n
X X X
(xi x)2
=0+ (since (xi x)2 = (xi x)xi )
sxx
i=1

and
n
X
V ar( ~) = a2i V ar(Yi )
i=1
n
X (xi x)2 2
2
= =
s2xx sxx
i=1

Thus
~ G( ; p ) (6.5.2)
sxx
and combining this with the fact that

(n 2)s2 2
W = 2 (n 2) (6.5.3)

and that ~ and s2 are independent, we get as in (6.3.6) that

~
T = p t(n 2) (6.5.4)
s= sxx

This can be used as a pivotal quantity to get CI’s for , or to test hypotheses about .
Note also that (6.5.3) can be used to get CI’s or tests for , but these are usually of less interest than
inference about or the other quantities below.

Confidence Intervals for (x)


We are often interested in estimating (x) = + x for a specified value of x. We’ll derive a student
138 GAUSSIAN RESPONSE MODELS

-t pivotal quantity for doing this.


The MLE of (x) has associated estimator

~(x) = ~ + ~x = Y + ~(x x);

since ~ = Y ~x. Thus (x) is a linear function of Gaussian random variables (because Y and ~ are)
and so must have a Gaussian distribution. Its mean and variance are

E[~(x)] = E(Y ) + (x x)E( ~)


n
1X
= E(Yi ) + (x x)
n
i=1
Xn
1
= ( + xi ) + (x x)
n
i=1

= + x + (x x)
= + x = (x);

and because Y and ~ are independent (can be shown),

V ar[~(x)] = V ar(Y ) + (x x)2 V ar( ~)


n
1 X 2
= V ar(Yi ) + (x x)2
n2 sxx
i=1

2 1 (x x)2
= +
n sxx
Thus 2 s 3
1 (x x)2
~(x) G 4 (x); + 5
n sxx

and it then follows that


~(x) (x)
T = q t(n 2) (6.5.5)
(x x)2
s n1 + sxx

This can be used as a pivotal quantity to get CI’s for (x).

Remark: The parameter equals (0) so doesn’t require special treatment. Sometimes x = 0 is a
value of interest but often it is not. In the following example it refers to a building of area 0, which is
nonsensical!

Remark: The results of the analyses below can be obtained using the R function lm, with the com-
mand lm(y x). We give the detailed results below to illustrate how the calculations are made. In R,
6.5. LINEAR REGRESSION MODELS 139

summary(lm(y x)) gives a lot of useful output.

Example 6.5.1 Price vs Size of Commercial Buildings


Example 6.1.2 gave data on the selling price per square meter (y) and area (x) of commercial buildings.
Figure 6.1.1 suggested that a model of the form (6.5.1) would be reasonable, so let us consider that.
We find easily that x = 0:954; y = 549:0 and sxx = 22:945; sxy = 3316:68; syy =
489; 462:62 so we find

^= 144:5; ^ = 686:9; s2 = 364:37; s = 19:09:

Note that ^ is negative: the larger size buildings tend to sell for less per square meter. (The estimate
^ = 144:5 indicates a drop in average price of $144:50 per square meter for each increase of 1 unit
in x; remember x’s units are m2 (105 )): The line y = ^ + ^x is often called the fitted regression line
for y on x, and if we plot it on the same graph as the points (xi ; yi ) in the scatter plot Figure 6.1.1, we
see it passes close to the points.
A confidence interval for isn’t of major interest in the setting here, where the data were called on
to indicate a fair assessment value for a large building with x = 4:47. One way to address this is to
estimate (x) when x = 4:47. We get the MLE

^(4:47) = ^ + ^(4:47) = $40:94

which we note is much below the assessed value of $75 per square meter. However, one can object that
there is uncertainty in ^(4:47), and that it would be better to give a CI. Using (6.5.5) and the fact that
P ( 2:048 t(28) 2:048) = :95, we get a .95 CI for (4:47) as
s
1 (4:47 x)2
^(4:47) 2:048s +
30 sxx

or $40:94 $26:54, or $14:40 (4:47) $67:50. Thus the assessed value of $75 is outside this
range.
However (playing lawyer for the Assessor), we could raise another objection: since we are con-
sidering a single building (and not the average of all buildings) of size x = 4:47( 105 )m2 , we must
recognize that Yi has a non-neglible variance. This suggests that what we should do is predict the
y value for a building with x = 4:47, instead of estimating (4:47). We will temporarily leave the
example in order to develop a method to do this.

Prediction Intervals for Y


Suppose we want to estimate or predict the Y -value for a random unit which has a specific value x
140 GAUSSIAN RESPONSE MODELS

for its covariate We can get a pivotal quantity that can be used to give a prediction interval (or interval
“estimate") for Y , as follows.
1=2
1 (x x)2
Note that Y G( (x); ) and, from above, that ~(x) G( (x); n + sxx ). Also, Y is
independent of ~(x) since it is not connected to the existing sample. Thus

V ar(Y ~(x)) = V ar(Y ) + V ar(~(x))


2 2 1 (x x)2
= + +
n sxx
Thus
1=2
1 (x x)2
Y ~(x) G(0; 1+ + )
n sxx
and it also follows that
Y ~(x)
T = q t(n 2) (6.5.6)
1 (x x)2
s 1+ n + sxx

We can use the pivotal quantity T to get interval estimates for Y , since

P (a1 T a2 )
0 s s 1
1 (x x)2 1 (x x)2 A
= P @ ~(x) a2 s 1 + + Y ~(x) a1 s 1+ +
n sxx n sxx

We usually call these prediction intervals instead of confidence intervals, since Y isn’t a parameter
but a “future" observation.

Example 6.5.1 Revisited


Let us obtain a .95 prediction interval (PI) for Y when x = 4:47. Using (6.5.6) and the fact that
P ( 2:048 t(28) 2:048) = :95, we get the .95 PI
s
1 (4:47 x)2
~(4:47) 2:048s 1 + +
30 sxx

or 6:30 Y 88:20 (dollars per square meter). The lower limit is negative, which is nonsensical.
This happened because we’re using a Gaussian model (in which Y can be positive or negative) in a
setting where Y (price) must be positive. Nonetheless, the Gaussian model fits the data well, so we’ll
just truncate the PI and take it to be 0 Y $88:20.
Now we find that the assessed value of $75 is inside this interval! On this basis its hard to say that
the assessed value is unfair (though it is towards the high end of the PI).
Note also that the value x = 4:47 of interest is well outside the range of x-values (.20 - 3.26) in
the data set of 30 buildings; look again at Figure 6.1.1. Thus any conclusions we reach are based on
6.6. MODEL CHECKING 141

an assumption that the linear model (x) = + x applies beyond x = 3:26 and out to x = 4:47.
This may not be true, but we have no way to check it with the data we have. Note also that is a slight
suggestion in Figure 6.1.1 that V ar(Y X) may be smaller for larger x-values. There is not sufficient
data to check this either.

Remark: Note from (6.5.5) and (6.5.6) that CI’s for (x) and PI’s from Y are wider the further away
x is from x. Thus, as we move further away from the “middle" of the x’s in the data, we get wider and
wider intervals for (x) or Y .

Example 6.5.2 Strength of Steel Bolts


Recall the data given in Example 6.1.3, where Y represented the breaking strength of a randomly
selected steel bolt and x was the bolt’s diameter. A scatter plot of points (xi ; yi ) for 30 bolts suggested
a nonlinear relationship between Y and x. A bolt’s strength might be expected to be proportional to its
cross-sectional area, which is proportional to x2 . Figure 6.5.1 shows a plot of points (x2i ; yi ); it looks
quite linear.
Let us fit a linear model

Yi G( + x1i ; ) x1i = x2i

to the data. We find (check these for yourself)

^ = 1:667; ^ = 2:838; s = 0:0515; Sxx = 0:2244

The fitted regression line y = ^ + ^x1 is shown on the scatter plot in Figure 6.21; the model appears
to fit well.
More as a numerical illustration, let us get a CI for , which represents the increase in average
strength (x1 ) from increasing the diameter x (and therefore also x1 = x2 ) by 1 unit. Using the pivotal
quantity (6.5.4) and the fact that P ( 2:048 t(28) 2:048) = :95, we get the .95 CI
^ 2:048 p s ; or 2:838 0:223
sxx
A .95 CI for the value of is therefore (2.605, 3.051).

Exercise: This model could be used to predict the breaking strength of a new bolt of given diameter
x. Find a PI for a new bolt of diameter x = 0:35.

6.6 Model Checking


There are two main components in Gaussian response models:
142 GAUSSIAN RESPONSE MODELS

2.4
2.2
y (strength)

2.0
1.8
1.6

0.05 0.10 0.15 0.20 0.25

x1 (diameter squared)

Figure 6.21: Scatter Plot of Bolt Diameter Squared vs. Strength

(i) the assumption that Yi (given any covariates xi ) is Gaussian with constant standard deviation .

(ii) the assumption that i = g(xi ) is a given form like (6.1.3).

Models should always be checked, and in this case there are several ways to do this. Some of these
are based on what we term “residuals" of the fitted model: the residuals are the values

r^i = yi ^i i = 1; : : : ; n

For example, if Yi G( + xi ; ) then r^i = yi ^ ^xi . The R function lm produces these values
as part of its output.
If Yi G( i ; ) then Ri = Yi i G(0; ). The idea behind the r^i ’s is that they can be
thought of as “observed" Ri ’s. This isn’t exactly correct since we are using ^i instead of i in r^i , but
if the model is correct, then the r^i ’s should behave roughly like a random sample from the distribution
G(0; ).

Plots of residuals are used as a model check. For example, we can

(1) Plot points (xi ; r^i ); i = 1; : : : ; n. If the model is satisfactory these should lie within a horizontal
band around the line r^i = 0:

(2) Plot points (^i ; r^i ); i = 1; : : : ; n. If the model is satisfactory we should get the same type of
pattern as for (1).
6.6. MODEL CHECKING 143

Departures from the “expected" pattern in (1) and (2) suggest problems with the model. For ex-
ample, if in (2) we see that the variability in the r^i ’s is bigger for larger values ^i , this suggests that
V ar(Yi ) = V ar(Ri ) is not constant, but may be larger when (x) is larger.
Figure 6.6.1 shows a couple of such patterns; the left hand plot suggests non-constant variance
whereas the right hand plot suggests that the function i = g(xi ) is not correctly specified.
In problems with only one x-variable, a plot of ^(x) superimposed on the scatterplot of the data (as
in Figure 6.5.1) shows pretty clearly how well the model fits. The residual plots described are however,
very useful when there are two or more covariates in the model.
When there are no covariates in the model, as in Section 6.2, plots (1) and (2) are undefined. In this
case the only assumption is that Yi G( ; ). We can still define residuals, either as
yi ^
r^i = yi ^ or r^i = ;
^
where ^ = y and ^ (we could alternatively use s) is the MLE of . One way to check the model is
to treat the r^i ’s (which are called standardized residuals) as a random sample of values (Y )= .
Since Y )= G(0; 1) under our assumed model, we could plot the empirical c.d.f. (EDF) from
r^i (i = 1; : : : ; n) and superimpose on it the G(0; 1) c.d.f. The two curves should agree well if the
Gaussian model is satisfactory. This plot can also be used when there are covariates, by defining the
standardized residuals
r^i yi ^ i
r^i = = i = 1; : : : ; n
^ ^
We can also use the r^i ’s in place of the r^i ’s in plots (1) and (2) above; in fact that is what we did in
Figure 6.6.1. When the r^i ’s are used the patterns in the plot are unchanged but the r^i values tend to lie
in the range (-3,3). (think why this is so.)

Example 6.6.1 Steel Bolts Let us define residuals

r^i = yi ^ ^xi i = 1; : : : ; 30

for the model fitted in Example 6.5.3. Figure 6.23 shows a plot of the points (x1i ; r^i ); no deviation
from the expected pattern is observed. This is of course also evident from Figure 6.5.1.
A further check on the Gaussian distribution is shown in Figure 6.24. Here we have plotted the
EDF based on the 30 standardized residuals
yi ^ ^x1i
r^i = :
^
On the same graph is the G(0; 1) c.d.f. There is good agreement between the two curves.
144 GAUSSIAN RESPONSE MODELS

2
2

1
1
std. resid.

std. resid.

0
0

-1
-2

-2
0 5 15 25 0 5 15 25

m uhat x

Figure 6.22: Examples of Patterns in Residual Plots

6.7 Problems
1. Student’s t Distribution
Suppose that Z and U are independent variates with
2
Z N (0; 1); U ( ) :

Consider the ratio


Z
X p :
U
Then X is a continuous variate. Its distribution is called the t (Student’s) distribution with
degrees of freedom, and we write X t( ) for short. It can be shown by change of variables that
X has pdf
+1
x2 2
f (x) = k 1+ for 1<x<1

where k is a normalizing constant such that the total area under the pdf is 1:
+1 p
k = = :
2 2
The pdf is symmetric about the origin, and is similar in shape to the pdf of N (0; 1) but has more
probability in the tails. It can be shown that f (x) tends to the N (0; 1) pdf as ! 1.
6.7. PROBLEMS 145

0.05
residual (y-muhat)

-0.05

0.05 0.10 0.15 0.20 0.25

x1

Figure 6.23: Residual Plot for Bolt Strength Model

(a) Plot the pdf for = 1 and = 5:


(b) Find values a; b such that

P a t(30) a = 0:98; P t(20) b = 0:95 :

(c) Show that f (x) is unimodal for all x.


1 2
(d) Show that as ! 1; f (x) approaches the G(0; 1) limit f (x) = p12 e 2 x .
p
(Note: To do this you will need to use the fact that k ! 1= 2 as ! 1; this is from a
property of gamma functions.)

2. Suppose that Y1 ; ; Yn are independent G( ; ) observations.

(a) Show that the likelihood ratio statistic for testing a value of is given by (assume is
unknown)
2
( ) = n log 1 + nT 1
p
where T = n(Y )=s, with s the sample standard deviation. (Note: The sample
P
variance s is defined as (yi y)2 =(n 1).)
2

(b) Show that the likelihood ratio statistic for testing a value of is a function of
(n 1)s2
W = 2
:
146 GAUSSIAN RESPONSE MODELS

1.0
0.8
0.6
cdf

0.4
0.2
0.0

-2 -1 0 1 2

std. resid.

Figure 6.24: Empirical Distribution function of Standard Residuals and G(0; 1) c.d.f.

3. The following data are instrumental measurements of level of dioxin (in parts per billion) in 20
samples of a “standard” water solution known to contain 45 ppb dioxin.
44.1 46.0 46.6 41.3 44.8 47.8 44.5 45.1 42.9 44.5
42.5 41.5 39.6 42.0 45.8 48.9 46.6 42.9 47.0 43.7

(a) Assuming that the measurements are independent and N ( ; 2 ), obtain a .95 confidence
interval for and test the hypothesis that = 45.
(b) Obtain a .95 confidence interval for . Of what interest is this scientifically?

4. A new method gave the following ten measurements of the specific gravity of mercury:
13.696 13.699 13.683 13.692 13.705
13.695 13.697 13.688 13.690 13.707
Assume these to be independent observations from N ( ; 2 ).

(a) An old method produced measurements with standard deviation = :02. Test the hypoth-
esis that the new method has the same standard deviation as the old.
6.7. PROBLEMS 147

(b) A physical chemistry handbook lists the specific gravity of mercury as 13.75. Are the data
consistent with this value?
(c) Obtain 95% CI’s for and .

5. Sixteen packages are randomly selected from the production of a detergent packaging machine.
Their weights (in grams) are as follows:
287 293 295 295 297 298 299 300
300 302 302 303 306 307 308 311

(a) Assuming that the weights are independent N ( ; 2 ) random variables, obtain .95 confi-
dence intervals for and .
P
(b) Let X and s2 = n 1 1 (xi x)2 be the mean and variance in a sample of size n, and let
X represent the weight of a future, independent, randomly selected package. Show that
X X N 0; 2 1 + n1 and then that

X X
Z= q t(n 1) :
s 1 + n1

For the data above, use this as a pivotal to obtain a .95 “confidence” interval for X.

6. A manufacturer wishes to determine the mean breaking strength (force) of a type of string to
“within a pound", which we interpret as requiring that the 95% confidence interval for a should
have length at most 2 pounds. If breaking strength Y of strings tested are G( ; ) and if 10
P
preliminary tests gave (yi y)2 = 80, how many additional measurements would you advise
the manufacturer to take?

7. To compare the mathematical abilities of incoming first year students in Mathematics and Engi-
neering, 30 Math students and 30 Engineering students were selected randomly from their first
year classes and given a mathematics aptitude test. A summary of the resulting marks xi (for the
math students) and yi (for the engineering students), i = 1; : : : ; 30, is as follows:
P
Math students: n = 30 x = 120 (xi x)2 = 3050
P
Engineering students: n = 30 y = 114 (yi y)2 = 2937
Obtain a .95 confidence interval for the difference in mean scores for first year Math and Engi-
neering students, and test the hypothesis that the difference is zero.
148 GAUSSIAN RESPONSE MODELS

8. A study was done to compare the durability of diesel engine bearings made of two different
compounds. Ten bearings of each type were tested. The following table gives the “times” until
failure (in units of millions of cycles):
Type I Type II
3.03 3.19
5.53 4.26
5.60 4.47
9.30 4.53
9.92 4.67
12.51 4.69
12.95 12.78
15.21 6.79
16.04 9.37
16.84 12.75

(a) Assuming that Y , the number of million cycles to failure, has a normal distribution with the
same variance for each type of bearing, obtain a .90 confidence interval for the difference
in the means 1 and 2 of the two distributions.
(b) Test the hypothesis that 1 = 2.

(c) It has been suggested that log failure times are approximately normally distributed, but
not failure times. Assuming that the log Y ’s for the two types of bearing are normally
distributed with the same variance, test the hypothesis that the two distributions have the
same mean. How does the answer compare with that in part (b)?
(d) How might you check whether Y or log Y is closer to normally distributed?
(e) Give a plot of the data which could be used to describe the data and your analysis.

9. Fourteen welded girders were cyclically stressed at 1900 pounds per square inch and the numbers
of cycles to failure were observed. The sample mean and variance of the log failure “times" were
y = 14:564 and s2 = 0:0914. Similar tests on four additional girders with repaired welds gave
y = 14:291 and s2 = 0:0422. Log failure times are assumed to be independent with a G( ; )
distribution.

(a) Test the hypothesis that the variance of Y is the same for repaired welds as for the normal
welds.
(b) Assuming equal variances, obtain a 90% confidence interval for the difference in mean log
failure time.
6.7. PROBLEMS 149

(c) Note that 1 2 in part (b) is also the difference in median log failure times, and obtain a
90% confidence interval for the ratio
median lifetime (cycles) for repaired welds
median lifetime (cycles) for normal welds

10. Let Y1 ; ; Yn be a random sample from G( 1 ; 1 ) and X1 ; ; Xn be a random sample from


G( 2 ; 2 ). Obtain the likelihood ratio statistic for testing H : 1 = 2 and show that it is a
function of F = s21 =s22 , where s21 and s22 are the sample variances from the y and x samples.

11. Readings produced by a set of scales are independent and normally distributed about the true
weight of the item being measured. A study is carried out to assess whether the standard deviation
of the measurements varies according to the weight of the item.

(a) Ten weighings of a 10 kg. weight yielded y = 10:004 and s = 0:013 as the sample
mean and standard deviation. Ten weighings of a 40 kg. weight yielded y = 39:989
and s = 0:034. Is there any evidence of a difference in the standard deviations for the
measurements of the two weights?
(b) Suppose you had a further set of weighings of a 20 kg. item. How could you study the
question of interest further?

12. An experiment was conducted to compare gas mileages of cars using a synthetic oil and a con-
ventional oil. Eight cars were chosen as representative of the cars in general use. Each car was
run twice under as similar conditions as possible (same drivers, routes, etc.), once with the syn-
thetic oil and once with the conventional oil, the order of use of the two oils being randomized.
The average gas mileages were as follows:
Car 1 2 3 4 5 6 7 8
Synthetic oil 21.2 21.4 15.9 37.0 12.1 21.1 24.5 35.7
Conventional oil 18.0 20.6 14.2 37.8 10.6 18.5 25.9 34.7

(a) Obtain a .95 confidence interval for the difference in mean gas mileage, and state the as-
sumptions on which your analysis depends.
(b) Repeat (a) if the natural pairing of the data is (improperly) ignored.
(c) Why is it better to take pairs of measurements on eight cars rather than taking only one
measurement on each of 16 cars?
150 GAUSSIAN RESPONSE MODELS

13. Consider the data in Problem 8 of Chapter 1 on the lengths of male and female coyotes.

(a) Fit separate Gaussian models for the lengths of males and females. Estimate the difference
in mean lengths for the two sexes.
(b) Estimate P (Y1 > Y2 ) (give the m.l.e.), where Y1 is the length of a randomly selected female
and Y2 is the length of a randomly selected male. Can you suggest how you might get a
confidence interval?
(c) Give separate CI’s for the average length of males and females.

14. Comparing sorting algorithms. Suppose you want to compare two algorithms A and B that
will sort a set of number into an increasing sequence. (The R function sort x) will, for example,
sort the elements of the numeric vector x.)
To compare the speed of algorithms A and B, you decide to “present" A and B with random
permutations of n numbers, for several values of n. Explain exactly how you would set up such
a study, and discuss what pairing would mean in this context.

15. Sorting algorithms continued. Two sort algorithms as in the preceding question were each run
on (the same) 20 sets of numbers (there were 500 numbers in each set). Times to sort the sets of
two numbers are shown below.
Set: 1 2 3 4 5 6 7 8 9 10
A: 3.85 2.81 6.47 7.59 4.58 5.47 4.72 3.56 3.22 5.58
B: 2.66 2.98 5.35 6.43 4.28 5.06 4.36 3.91 3.28 5.19

Set: 11 12 13 14 15 16 17 18 19 20
A: 4.58 5.46 3.31 4.33 4.26 6.29 5.04 5.08 5.08 3.47
B: 4.05 4.78 3.77 3.81 3.17 6.02 4.84 4.81 4.34 3.48

(a) Plot the data so as to illustrate its mean features.


(b) Estimate (give a CI) for the difference in the average time to sort with algorithms A and B,
assuming a Gaussian model applies.
(c) Suppose you are asked to estimate the probability that A will sort a randomly selected list
fast than B. Give a point estimate of this probability.
(d) Another way to estimate the probability p in part (b) is just to notice that of the 20 sets of
numbers in the study, A sorted faster on 15. Indicate how you could also get a CI for p
6.7. PROBLEMS 151

using this approach. (It is also possible to get a CI using the Gaussian model.)

16. Least squares estimation. Suppose you have a model where the mean of the response variable
Yi given the covariates xi has the form

i = E(Yi jxi ) = g(xi ; ) (6.7.1)

where is a vector of unknown parameters. Then the least squares (LS) estimate of based
on data (xi ; yi ); i = 1; : : : ; n is the value that minimizes the objective function
n
X
S( ) = [yi g(xi ; )]2
i=1

Show that the LS estimate ^ of is the same as the MLE of in the Gaussian model Yi
G( i ; ), when i is of the form (6.7.1).

17. To assess the effect of a low dose of alcohol on reaction time, a sample of 24 student volunteers
took part in a study. Twelve of the students (randomly chosen from the 24) were given a fixed
dose of alcohol (adjusted for body weight) and the other twelve got a nonalcoholic drink which
looked and tasted the same as the alcoholic drink. Each student was then tested using software
that flashes a coloured rectangle randomly placed on a screen; the student has to move the cursor
into the rectangle and double click the mouse. As soon as the double click occurs, the process
is repeated, up to a total of 20 times. The response variate is the total reaction time (i.e. time to
complete the experiment) over the 20 trials.
The data on the times are shown below for the 24 students.
“Alcohol" Group: 1.33, 1.55, 1.43, 1.35, 1.17, 1.35, 1.17, 1.80, 1.68
1.19, 0.96, 1.46 (y = 1:370; s = 0:235)
“Non-Alcohol" Group: 1.68, 1.30, 1.85, 1.64, 1.62, 1.69, 1.57, 1.82, 1.41,
1.78, 1.40, 1.43 (y = 1:599; s = 0:180)

Analyze the data with the objective of seeing when there is any evidence that the dose of alcohol
increases reaction time. Justify any models that you use.

18. There are often both expensive (and highly accurate) and cheaper (and less accurate) ways of
measuring concentrations of various substances (e.g. glucose in human blood, salt in a can of
soup). The table below gives the actual concentration x (determined by an expensive but very
152 GAUSSIAN RESPONSE MODELS

accurate procedure) and the measured concentration y obtained by a cheap procedure, for each
of 10 units.
x: 4.01 8.12 12.53 15.90 20.24 24.81 30.92 37.26 38.94 40.15
y: 3.70 7.80 12.40 16.00 19.90 24.90 30.80 37.20 38.40 39.40

(a) Fit a Gaussian linear regression model for Y given x to the data and obtain .95 confidence
intervals for the slope and standard deviation . Use a plot to check the adequacy of the
model.

(b) Describe briefly how you would characterize the cheap measurement process’s accuracy to
a lay person.

(c) Assuming that the units being measured have true concentrations in the range 0-40, do you
think that the cheap method tends to produce a value that is lower than the true concentra-
tion? Support your answer with an argument based on the data.

19. The following data, collected by Dr. Joseph Hooker in the Himalaya mountains, relates at-
mospheric pressure to the boiling point of water. Theory suggests that a graph of log pressure vs.
boiling point should give a straight line.

Temp ( F) Pres (in. Hg) Temp ( F) Pres (in. Hg)


210.8 29.211 189.5 18.869
210.2 28.559 188.8 18.356
208.4 27.972 188.5 18.507
202.5 24.697 185.7 17.267
200.6 23.726 186.0 17.221
200.1 23.369 185.6 17.062
199.5 23.030 184.1 16.959
197.0 21.892 184.6 16.881
196.4 21.928 184.1 16.817
196.3 21.654 183.2 16.385
195.6 21.605 182.4 16.235
193.4 20.480 181.9 16.106
193.6 20.212 181.9 15.928
191.4 19.758 181.0 15.919
191.1 19.490 180.6 15.376
190.6 19.386
6.7. PROBLEMS 153

(a) Prepare a scatterplot of y = log(Pressure) vs. x =Temperature. Do the same for y =Pressure
vs. x. Which is better described by a linear model? Does this confirm the theory’s model?
(b) Fit a normal linear regression model for y = log(Pressure) vs. x. Are there any obvious
difficulties with the model?
(c) Obtain a .95 confidence interval for the atmospheric pressure if the boiling point of water
is 195 F.

20. Consider the data in Problem 9 of Chapter 1, in which the variable y was the average time to
complete tasks by computer users, and x was the number of users on the system. Fit a regression
model, using x as the explanatory variable. Give a .95 confidence interval for the mean of Y
when there are 50 users on the system.

21. (a) For the steel bolt experiment in Examples 6.1.3 and 6.5.2, use a Gaussian model to
(i) estimate the average breaking strength of bolts of diameter 0.35
(ii) estimate (predict) the breaking strength of a single bolt of diameter 0.35.
Give interval estimates in each case.
(b) Suppose that a bolt of diameter 0.35 is exposed to a large force V that could potentially
break it. In structural reliability and safety calculations, V is treated as a r.v. and if Y
represents the breaking strength of the bolt (or some other part of a structure), then the
probability of a “failure" of the bolt is P (V > Y ). Give a point estimate of this value if
V G(1:60; 0:10), where V and Y are independent.

22. Optimal Prediction. In many settings we want to use covariates x to predict a future value Y .
(For example, we use economic factors x to predict the price Y of a commodity a month from
now.) The value Y is random, but suppose we know (x) = E(Y jx) and (x)2 = V ar(Y jx).

(a) Predictions take the form Y^ = g(x), where g( ) is our “prediction" function. Show that the
minimum achievable value of E(Y^ Y )2 is minimized by choosing g(x) = (x).
(b) Show that the minimum achievable value of E(Y^ Y )2 , that is, its value when g(x) =
(x) is (x)2 .
This shows that if we can determine or estimate (x), then “optimal" prediction (in terms
of Euclidean distance) is possible. Part (b) shows that we should try to find covariates x for
which (x)2 = V ar(Y jx) is as small as possible.
(c) What happens when (x)2 is close to zero? (Explain this in ordinary English.)
TESTS AND INFERENCE PROBLEMS
BASED ON MULTINOMIAL MODELS

7.1 Introduction

Many important hypothesis testing problems can be addressed using multinomial models. An example
was given in Chapter 5, whose general ideas we will use here. To start, recall the setting in Example
(d) of Chapter 5, Section 2, where data were assumed to arise from a multinomial distribution with
probability function
n!
f (y1 ; :::; ym ; p1 ; :::; pm ) = py1 pymm (6.5.2)
y1 ! ym ! 1
P P
where 0 yj n and yj = n. The multinomial probabilities pj satisfy 0 pj 1 and pj = 1,
and we define = (p1 ; :::; pm ). Suppose now that we wish to test the hypothesis

H0 : pj = pj ( ) j = 1; :::; m (6.5.3)

where dim( ) = p < m 1.


The likelihood function based on (7.1.1) is anything proportional to
m
Y y
L( ) = pj j : (6.5.4)
j=1

Let be the parameter space for . It was shown earlier that L( ) is maximized over by the vector
^ with p^j = yj =n (j = 1; :::; m). A likelihood ratio test of the hypothesis (7.1.2) is based on the
likelihood ratio statistic ( )
L( ^0)
= 2`( ^ ) 2`( ^ 0 ) = 2 log ; (6.5.5)
L( ^ )

where ^ 0 maximizes L( ) under the hypothesis (7.1.2), which restricts to lie in a space 0 of
dimension p. If H0 is true (that is, if really lies in 0 ) then the distribution of is approximately

154
7.2. GOODNESS OF FIT TESTS 155

2 when n is large, where k = m 1 is the dimension of . This enables us to compute significance


(k p)
levels (p-values) from observed data by using the approximation
: 2
P( obs ; H0 ) = P ( (k p) obs ): (6.5.6)

This approximation is very accurate when n is large and none of the pj ’s is too small; when the ej ’s
below all exceed 5 it is accurate enough for testing purposes.
The test statistic (7.1.4) can be written in a simple form. Let ^ 0 = (~
p1 ; :::; p~m ) = (p1 (~ ); :::; pm (~ ))
denote the m.l.e. of under the hypothesis (7.1.2). Then, by (7.1.3), we get

= 2`( ^ ) 2`( ^0 )
m
X
=2 yj log(^
pj =~
pj ):
j=1

Noting that p^j = yj =n and defining “expected frequencies” under H0 as

ej = n~
pj ; j = 1; :::; m

we can rewrite as
m
X
=2 yj log(yj =ej ): (6.5.7)
j=1

An alternative test statistic that was developed historically before is the “Pearson” statistic
m
X (yj ej )2
D= : (6.5.8)
ej
j=1

This has similar properties to ; for example, both equal 0 when yj = ej for all j = 1; :::; m and
are larger when yj ’s and ej ’s differ greatly. It turns out that, like , the statistic D also has a limiting
2
(k p) distribution, with k = m 1, when H0 is true.
The remainder of this chapter consists of the application of the general methods above to some
important testing problems.

7.2 Goodness of Fit Tests


Recall from Section 2.5 that one way to check the fit of a probability distribution is by comparing the
pj ’s (relative frequencies ) with the estimates p~j from the distributional model. This is equivalent to
comparing the yj ’s (observed frequencies) and the ej ’s. In Section 2.5 this comparison was informal,
with only a rough guideline for how closely the yj ’s and ej ’s (or p^j ’s and p~j ’s) should agree.
It is possible to test the correctness of a parametric model by using an implied multinomial model.
We illustrate this through two examples.
156 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS

Example 7.2.1. Recall Example 2.5.2, where people in a population are classified as being one
of three blood types MM, MN, NN. The proportions of the population that are these three types are
p1 , p2 , p3 respectively, with p1 + p2 + p3 = 1. Genetic theory indicates, however, that the pj ’s can be
expressed in terms of a single parameter , as
2
p1 = p2 = 2 (1 ) p3 = (1 )2 : (7.2.1)

Data collected on 100 persons gave y1 = 17, y2 = 46, y3 = 37, and we can use this to test
the hypothesis H0 that (7.2.1) is correct. (Note that (Y1 ; Y2 ; Y3 ) M ult(n = 100i p1 ; p2 ; p3 ).) The
likelihood ratio test statistic is (7.1.6), but we have to find ~ and then the ej ’s. The likelihood function
under (7.2.1) is

L1 ( ) = L(p1 ( ); p2 ( ); p3 ( ))
2 17
=( ) [2 (1 )]46 [(1 )2 ]37
80
=c (1 )120

and we easily find that ~ = :40. The expected frequencies are therefore e1 = 100 ~ 2 = 16, e2 =
100[2~ (1 ~ )] = 48, e3 = 100[(1 ~ )2 ] = 36. Clearly these are close to the observed frequencies
y1 , y2 , y3 , and (7.1.6) gives the observed value obs = 0:17. The significance level is
2
P( (1) 0:17) = :68

so there is no evidence against the model (7.2.1).


The Pearson statistic (7.1.7) usually gives close to the same value as when n is large. In this case
we find that D = 0:17.

Example 7.2.2. Continuous distributions can also be tested by grouping the data into intervals
and then using the multinomial model. Example 2.5.1 previously did this in an informal way for an
exponential distribution. For example, suppose that T is thought to have an exponential distribution
with probability density function
1 t=
f (t; ) = e ; t > 0: (7.2.2)

Suppose a random sample t1 ; :::; t100 is collected and the objective is to test the hypothesis H0 that
(7.2.2) is correct. To do this we partition the range of T into intervals j = 1; :::; m, and count the
number of observations yj that fall into each interval. Under (7.2.2), the probability that an observation
lies in the j’th interval Ij = (aj ; bj ) is
Z bj
pj ( ) = f (t; )dt j = 1; :::; m (7.2.3)
aj
7.3. TWO-WAY TABLES AND TESTING FOR INDEPENDENCE OF TWO VARIABLES 157

and if yj is the number of observations (t’s) that lie in Ij , then Y1 ; :::; Ym follow a multinomial distrib-
ution with n = 100. Thus we can test (7.2.2) by testing that (7.2.3) is true.
Consider the following data, which have been divided into m = 7 intervals:

Interval 0-100 100-200 200-300 300-400 400-600 600-800 >800


yj 29 22 12 10 10 9 8
ej 27.6 20.0 14.4 10.5 13.1 6.9 7.6

We have also shown expected frequencies ej , calculated as follows. The distribution of (Y1 ; :::; Y7 ) is
multinomial with probabilities given by (7.2.3) when the model (7.2.2) is correct. In particular,
Z 100
1 t= 100=
p1 = e dt = 1 e ;
0

and so on. Expressions for p2 ; :::; p7 are p2 ( ) = e 100= e 200= , p3 ( ) = e 200= e 300= ,
p4 ( ) = e 300= e 400= , p5 ( ) = e 400= e 600= , p6 ( ) = e 600= e 800= , p7 ( ) = e 800= .
The likelihood function from y1 ; :::; y7 based on model (7.2.3) is then
7
Y
L1 ( ) = pj ( )yj :
j=1

It is possible to maximize L1 ( ) mathematically. (Hint: rewrite L1 ( ) in terms of the parameter =


e 100= and find ~ first; then ~ = 100= ln ~.) This gives ~ = 310:3 and the expected frequencies
ej = 100pj (~ ) given in the table are then obtained.
The likelihood ratio statistic (7.1.6) gives obs = 1:91. The significance level is computed as

2
P( obs ; H0 ) = P ( (5) 1:91) = :86;

so there is no evidence against the model (7.2.3). Note that the reason the 2 degrees of freedom is 5 is
because k = m 1 = 6 and p = dim( ) = 1.
The goodness of fit test just given has some arbitrary elements, since we could have used different
intervals and a different number of intervals. Theory and guidelines as to how best to choose the
intervals can be developed, but we won’t consider this here. Rough guidelines for our purposes are to
chose 4-10 intervals, so that expected frequencies are at least 5.

7.3 Two-Way Tables and Testing for Independence of Two Variables


Often we want to assess whether two factors or variables appear to be related. One tool for doing this is
to test the hypothesis that the factors are independent (and thus statistically unrelated). We will consider
158 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS

this in the case where both variables are discrete, and take on a fairly small number of possible values.
This turns out to cover a great many important settings.
Two types of studies give rise to data that can be used to test independence, and in both cases the
data can be arranged as frequencies in a two-way table. These tables are sometimes called “contin-
gency” tables in the statistics literature. We’ll consider the two types of studies in turn.

7.3.1 Cross-Classification of A Random Sample of Individuals


Suppose that individuals or items in a population can be classified according to each of two factors
A and B. For A, an individual can be any of a mutually exclusive types A1 ; A2 ; :::; Aa and for B an
individual can be any of b mutually exclusive types B1 ; B2 ; :::; Bb , where a 2 and b 2.
If a random sample of n individuals is selected, let yij denote the number that have A-type Ai and
B-type Bj . Let pij be the probability a randomly selected individual is combined type (Ai ; Bj ). Note
that
Xa X b X a X b
yij = n pij = 1
i=1 j=1 i=1 j=1
and that the ab frequencies (y11 ; y12 ; :::; yab ) follow a multinomial distribution with m = ab classes.
To test independence of the A and B classifications, we consider the hypothesis

H0 : pij = i j i = 1; :::; a; j = 1; :::; b (7.3.1)


a
X b
X
where 0 < i < 1, 0 < j < 1, i = 1, j = 1. Note that i = P (an individual is A-type
i=1 j=1
Ai ) and j = P (an individual is B-type Bj ), and that (7.3.1) is the standard definition for independent
events: P (Ai ; Bj ) = P (Ai )P (Bj ).
We recognize that testing (7.3.1) falls into the general framework of Section 7.1, where m = ab,
k = m 1, and the dimension of the parameter space under (7.3.1) is p = (a 1) + (b 1) = a + b 2.
All that needs to be done in order to use the statistics (7.1.6) or (7.1.7) to test H0 given by (7.3.1) is to
obtain the m.l.e.’s ~ i , ~j under model (7.3.1), and then the expected frequencies eij . Under (7.3.1), the
likelihood function for the yij ’s is proportional to
a Y
Y b
L1 ( ; ) = pij ( ; )yij
i=1 j=1
a Y
Y b
yij
= ( i j) : (7.3.2)
i=1 j=1
P P
It is easy to maximize `( ; ) = log L( ; ) subject to the linear constraints i = 1, j = 1.
This gives
yi+ ~ y+j yi+ y+j
~i = ; j= and eij = n~ i ~j = ; (7.3.3)
n n n
7.3. TWO-WAY TABLES AND TESTING FOR INDEPENDENCE OF TWO VARIABLES 159

b
X a
X
where yi+ = yij and y+j = yij . The likelihood ratio statistic (7.1.6) for testing the hypothesis
j=1 i=1
(7.3.1) is then
a X
X b
=2 yij log(yij =eij ): (7.3.4)
i=1 j=1

The significance level is computed by the approximation P ( 2(a 1)(b 1) obs ); the
2 degrees of
freedom (a 1)(b 1) come from k p = (ab 1) (a + b 2) = (a 1)(b 1).

Example 7.3.1. Human blood is classified according to several systems. Two are the OAB system
and the Rh system. In the former a person is one of 4 types O, A, B, AB and in the latter a person is
Rh+ or Rh . A random sample of 300 persons produced the observed frequencies in the following
table. Expected frequencies, computed below, are in brackets after each observed frequency.

O A B AB
Rh+ 82 (77.3) 89 (94.4) 54 (49.6) 19 (22.8) 244

Rh 13 (17.7) 27 (21.6) 7 (11.4) 9 (5.2) 56


95 116 61 28 300

It is of interest to see whether these two classification systems are genetically independent. The row
and column totals in the table are also shown, since they are the values yi+ and y+j needed to compute
the eij ’s in (7.3.3). In this case we can think of the Rh types as the A-type classification and the OAB
types as the B-type classification in the general theory above. Thus a = 2, b = 4 and the 2 degrees of
freedom are (a 1)(b 1) = 3.
To carry out the test that a person’s Rh and OAB blood types are statistically independent, we
merely need to compute the eij ’s by (7.3.3). This gives, for example,
(244)(95) 244(116)
e11 = = 77:3; e12 = = 94:4
300 300
and, similarly, e13 = 49:6, e14 = 22:8, e21 = 17:7, e22 = 21:6, e23 = 11:4, e24 = 5:2.
It may be noted that ei+ = yi+ and e+j = y+j , so it is necessary to compute only (a 1)(b 1) eij ’s
via (7.3.3); the remainder can be obtained by subtraction from row and column totals. For example, if
we compute e11 , e12 , e13 here then e21 = 95 e11 , e22 = 116 e12 , and so on. (This isn’t an advantage
with a computer; its simpler to use (7.3.3) above then. However, it suggests where the term “degrees
of freedom" comes from.)
The observed value of the likelihood ratio test statistic (7.3.4) is obs = 8:52, and the significance
level is approximately P ( 2(3) 8:52) = :036, so there is some degree of evidence against the hy-
pothesis of independence. Note that by comparing the eij ’s and the yij ’s we get some idea about the
160 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS

lack of independence, or relationship, between the two classifications. We see here that the degree of
dependence does not appear large.

Testing Equality of Multinomial Parameters from Two or More Groups

A similar problem arises when individuals in a population can be one of b types B1 ; :::; Bb , but where
the population is sub-divided into a groups A1 ; :::; Aa . In this case, we might be interested in whether
the proportions of individuals of types B1 ; :::; Bb are the same for each group. This is essentially
the same as the question of independence in the preceding section: we want to know whether the
probability pij that a person in population group i is B-type Bj is the same for all i = 1; :::; a. That is,
pij = P (Bj jAi ) and we want to know if this deends on Ai or not.
Although the framework is superficially the same as the preceding section, the details are a little
different. In particular, the probabilities pij satisfy

pi1 + pi2 + + pib = 1 for each i = 1; :::; a (7.3.5)

and the hypothesis we are interested in testing is

H0 : p1 = p2 = = pa ; (7.3.6)

where pi = (pi1 ; pi2 ; :::; pib ). Furthermore, the data in this case arise by selecting specified num-
bers of individuals ni from groups i = 1; :::; a and so there are actually a multinomial distributions,
Mult(ni ; pi1 ; :::; pib ).
If we denote the observed frequency of Bj -type individuals in the sample from the i’th group as yij
(where yi1 + + yib = ni ), then it can be shown that the likelihood ratio statistic for testing (7.3.6)
is exactly the same as (7.3.4), where now the expected frequencies eij are given by
y+j
eij = ni i = 1; :::; a; j = 1; :::; b (7.3.7)
n
where n = n1 + + na . Since ni = yi+ the expected frequencies have exactly the same form as in
the preceding section, when we lay out the data in a two-way table with a rows and b columns.

Example 7.3.2. The study in Example 7.3.1 could have been conducted differently, by selecting a
fixed number of Rh+ persons and a fixed number of Rh persons, and then determining their OAB
blood type. Then the proper framework would be to test that the probabilities for the 4 types O, A, B,
AB were the same for Rh+ and for Rh persons, and so the methods of the present section apply. This
study gives exactly the same testing procedure as one where the numbers of Rh+ and Rh persons in
the sample are random, as discussed.
7.3. TWO-WAY TABLES AND TESTING FOR INDEPENDENCE OF TWO VARIABLES 161

Example 7.3.3. In a randomized clinical trial to assess the effectiveness of a small daily dose of
Aspirin in preventing strokes among high-risk persons, a group of patients were randomly assigned
to get either Aspirin or a placebo. They were then followed for 3 years, and it was determined for
each person whether they had a stroke during that period or not. The data were as follows (expected
frequencies are also given in brackets).

Stroke No Stroke
Aspirin Group 64 (75.6) 176 (164.4) 240

Placebo Group 86 (74.4) 150 (161.6) 236


150 326 476

We can think of the persons receiving Aspirin and those receiving Placebo as two groups, and test the
hypothesis
H0 : p11 = p21 ;
where p11 = P (Stroke) for a person in the Aspirin group and p21 = P (Stroke) for a person in the
Placebo group. The test statistic (7.3.4) requires the expected frequencies, which are
(yi+ )(y+j )
eij = i = 1; 2:
476
This gives the values shown in the table. The test statistic then has observed value
2 X
X 2
=2 yij log(yij =eij ) = 5:25:
i=1 j=1

The approximate significance level is


2
P( (1) 5:25) = :022

so there is fairly strong evidence against H0 . A look at the yij ’s and the eij ’s indicates that persons
receiving Aspirin have had fewer strokes than expected under H0 , suggesting that p11 < p21 .
This test can be followed up with estimates for p11 and p21 . Because each row of the table follows
a binomial distribution, we have
y11 64 y21 86
p^11 = = = 0:267; p^21 = = = 0:364:
n1 240 n2 236
We can also give confidence intervals for p11 and p21 ; approximate .95 confidence intervals based on
earlier methods are 0:211 p11 :323 and :303 p21 :425. Confidence intervals for = p11 p21
can also be obtained from the approximate G(0; 1) pivotal quantity
(^
p11 p^21 )
Z=p :
p^11 (1 p^11 )=n1 + p^21 (1 p^21 )=n2
162 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS

Remark: This and other tests involving binomial probabilities and contingency tables can be carried
out using the R function prop:test.

7.4 Problems
1. To investigate the effectiveness of a rust-proofing procedure, 50 cars that had been rust-proofed
and 50 cars that had not were examined for rust five years after purchase. For each car it was
noted whether rust was present (actually defined as having moderate or heavy rust) or absent
(light or no rust). The data are as follows:
Cars Cars Not
Rust-Proofed Rust Proofed
Rust present 14 28
Rust absent 36 22
50 50

(a) Test the hypothesis that the probability of rust occurring is the same for the rust-proofed
cars as for those not rust-proofed. What do you conclude?
(b) Do you have any concerns about inferring that the rust-proofing prevents rust? How might
a better study be designed?

2. Two hundred volunteers participated in an experiment to examine the effectiveness of vitamin C


in preventing colds. One hundred were selected at random to receive daily doses of vitamin C
and the others received a placebo. (None of the volunteers knew which group they were in.) Dur-
ing the study period, 20 of those taking vitamin C and 30 of those receiving the placebo caught
colds. Test the hypothesis that the probability of catching a cold during the study period was the
same for each group.

3. Mass-produced items are packed in cartons of 12 as they come off an assembly line. The items
from 250 cartons are inspected for defects, with the following results:
Number defective: 0 1 2 3 4 5 6
Frequency observed: 103 80 31 19 11 5 1

Test the hypothesis that the number of defective items Y in a single carton has a binomial distri-
bution Bin(12; p). Why might the binomial not be a suitable model?
7.4. PROBLEMS 163

4. The numbers of service interruptions in a communications system over 200 separate weekdays
is summarized in the following frequency table:
Number of interruptions: 0 1 2 3 4 5
Frequency observed: 64 71 42 18 4 1

Test whether a Poisson model for the number of interruptions Y on a single day is consistent
with these data.

5. The table below records data on 292 litters of mice classified according to litter size and number
of females in the litter.
Number of females
0 1 2 3 4 Total # of litters
1 8 12 20
Litter 2 23 44 13 80
Size 3 10 25 48 13 96
4 5 30 34 22 5 96

(a) For litters of size n (n = 1; 2; 3; 4) assume that the number of females in a litter follows
a binomial distribution with parameters n and pn = P (female). Test the binomial model
separately for each of the litter sizes n = 2; n = 3 and n = 4. (Why is it of scientific
interest to do this?)
(b) Assuming that the binomial model is appropriate for each litter size, test the hypothesis that
p1 = p2 = p3 = p4 .

6. A long sequence of digits (0; 1; : : : ; 9) produced by a pseudo random number generator was
examined. There were 51 zeros in the sequence, and for each successive pair of zeros, the
number of (non-zero) digits between them was counted. The results were as follows:
1 1 6 8 10 22 12 15 0 0
2 26 1 20 4 2 0 10 4 19
2 3 0 5 2 8 1 6 14 2
2 2 21 4 3 0 0 7 2 4
4 7 16 18 2 13 22 7 3 5
Give an appropriate probability model for the number of digits between two successive zeros,
if the pseudo random number generator is truly producing digits for which P (any digit = j) =
0:1(j = 0; 1; : : : ; 9), independent of any other digit. Construct a frequency table and test the
164 TESTS AND INFERENCE PROBLEMS BASED ON MULTINOMIAL MODELS

goodness of fit of your model.

7. 1398 school children with tonsils present were classified according to tonsil size and absence or
presence of the carrier for streptococcus pyogenes. The results were as follows:
Normal Enlarged Much enlarged
Carrier present 19 29 24
Carrier absent 497 560 269
Is there evidence of an association between the two classifications?

8. The following data on heights of 210 married couples were presented by Yule in 1900.
Tall wife Medium wife Short wife
Tall husband 18 28 19
Medium husband 20 51 28
Short husband 12 25 9
Test the hypothesis that the heights of husbands and wives are independent.

9. In the following table, 64 sets of triplets are classified according to the age of their mother at
their birth and their sex distribution:
3 boys 2 boys 2 girls 3 girls Total

Mother under 30 5 8 9 7 29
Mother over 30 6 10 13 6 35

Total 11 18 22 13 64

(a) Is there any evidence of an association between the sex distribution and the age of the
mother?
(b) Suppose that the probability of a male birth is 0.5, and that the sexes of triplets are de-
termined independently. Find the probability that there are x boys in a set of triples
(x = 0; 1; 2; 3), and test whether the column totals are consistent with this distribution.

10. A study was undertaken to determine whether there is an association between the birth weights
of infants and the smoking habits of their parents. Out of 50 infants of above average weight,
7.4. PROBLEMS 165

9 had parents who both smoked, 6 had mothers who smoked but fathers who did not, 12 had
fathers who smoked but mothers who did not, and 23 had parents of whom neither smoked. The
corresponding results for 50 infants of below average weight were 21, 10, 6, and 13, respectively.

(a) Test whether these results are consistent with the hypothesis that birth weight is independent
of parental smoking habits.
(b) Are these data consistent with the hypothesis that, given the smoking habits of the mother,
the smoking habits of the father are not related to birth weight?
CAUSE AND EFFECT

8.1 Introduction

As mentioned in Chapters 1 and 3, many studies are carried out with causal objectives in mind. That
is, we would like to be able to establish or investigate a possible cause and effect relationship between
variables X and Y .
We use the word “causes" often; for example we might say that “gravity causes dropped objects to
fall to the ground", or that “smoking causes lung cancer". The concept of causation (as in “X causes
Y ") is nevertheless hard to define. One reason is that the “strengths" of causal relationships vary a lot.
For example, on earth gravity may always lead to a dropped object falling to the ground; however, not
everyone who smokes gets lung cancer.
Idealized definitions of causation are often of the following form. Let y be a response variate
associated with units in a population or process, and let x be an explanatory variate associated with
some factor that may affect y. Then, if all other factors that affect y are held constant, let us change
x (or observe different values of x) and see if y changes. If it does we say that x has a causal effect
on y.
In fact, this definition is not broad enough, because in many settings a change in x may only lead
to a change in y in some probabilistic sense. For example, giving an individual person at risk of stroke
a small daily dose of aspirin instead of a placebo may not necessarily lower their risk. (Not everyone
is helped by this medication.) However, on average the effect is to lower the risk of stroke. One way
to measure this is by looking at the probability a randomly selected person has a stroke (say within 3
years) if they are given aspirin versus if they are not.
Therefore, a better idealized definition of causation is to say that changing x should result in a
change in some attribute of the random variable Y (for example, its mean or some probability such as
P (Y > 0)). Thus we revise the definition above to say:
if all other factors that affect Y are held constant, let us change x (or observe different values
of x) and see if some specified attribute of Y changes. If it does we say x has a causal effect on Y .
These definitions are unfortunately unusable in most settings since we cannot hold all other factors

166
8.1. INTRODUCTION 167

that affect y constant; often we don’t even know what all the variables are. However, the definition
serves as a useful ideal for how we should carry out studies in order to show that a causal relationship
exists. What we do is try to design our studies so that alternative (to the variate x) explanations of what
causes changes in attributes of y can be ruled out, leaving x as the causal agent. This is much easier
to do in experimental studies, where explanatory variables may be controlled, than in observational
studies. The following are brief examples.

Example 8.1.1. Recall Example 6.1.3 concerning the (breaking) strength y of a steel bolt and the
diameter x of the bolt. It is clear that bolts with larger diameters tend to have higher strength, and it
seems clear on physical and theoretical grounds that increasing the diameter “causes" an increase in
strength. This can be investigated in experimental studies like that in Example 6.1.3, when random
samples of bolts of different diameters are tested, and their strengths y determined.
Clearly, the value of x does not determine y exactly (different bolts with the same diameter don’t
have the same strength), but we can consider attributes such as the average value of y. In the experiment
we can hold other factors more or less constant (e.g. the ambient temperature, the way the force is
applied; the metallurgical properties of the bolts) so we feel that the observed larger average values of
y for bolts of larger diameter x is due to a causal relationship.
Note that even here we have to depart slightly from the idealized definition of cause and effect. In
particular, a bolt cannot have its diameter x changed, so that we can see if y changes. All we can do
is consider two bolts that are as similar as possible, and are subject to the same explanatory variables
(aside from diameter). This difficulty arises in many experimental studies.

Example 8.1.2. Suppose that data had been collected on 10,000 persons ages 40-80 who had smoked
for at least 20 years, and 10,000 persons in the same age range who had not. There is roughly the
same distribution of ages in the two groups. The (hypothetical) data concerning the numbers with lung
cancer are as follows:

Lung Cancer No Lung Cancer


Smokers 500 9500 (10,000)
Non-Smokers 100 9900 (10,000)

There are many more lung cancer cases among the smokers, but without further information or
assumptions we cannot conclude that a causal relationship (smoking causes lung cancer) exists. Al-
ternative explanations might explain some or all of the observed difference. (This is an observational
study and other possible explanatory variables are not controlled.) For example, family history is an
important factor in many cancers; maybe smoking is also related to family history. Moreover, smoking
168 CAUSE AND EFFECT

tends to be connected with other factors such as diet and alcohol consumption; these may explain some
of the effect seen.
The last example exemplifies that association (statistical dependence) between two variables X
and Y does not imply that a causal relationship exists. Suppose for example that we observe a
positive correlation between X and Y ; higher values of X tend to go with higher values of Y in a
unit. Then there are at least three “explanations": (i) X causes Y (meaning X has a causative effect on
Y ),(ii) Y causes X, and (iii) some other factor(s) Z cause both X and Y .
We’ll now consider the question of cause and effect in experimental and observational studies in a
little more detail.

8.2 Experimental Studies


Suppose we want to investigate whether a variate x has a causal effect on a response variate Y . In an
experimental setting we can control the values of x that a unit “sees". In addition, we can use one or
both of the following devices for ruling out alternative explanations for any observed changes in Y that
might be caused by x:

(i) Hold other possible explanatory variables fixed.

(ii) Use randomization to control for other variables.

These are mostly simply explained via examples.

Example 8.2.1 Blood thinning and the risk of stroke


Suppose 500 persons that are at high risk of stroke have agreed to take part in a clinical trial to
assess whether aspirin lowers the risk of stroke. These persons are representative of a population of
high risk individuals. The study is conducted by giving some persons aspirin and some a placebo, then
comparing the two groups in terms of the number of strokes observed.
Other factors such as age, sex, weight, existence of high blood pressure, and diet also may affect the
risk of stroke. These variables obviously vary substantially across persons and cannot be held constant
or otherwise controlled. However, such studies use randomization in the following way: among the
study subjects, who gets aspirin and who gets a placebo is determined by a random mechanism. For
example, we might flip a coin (or draw a random number from f0; 1g), with one outcome (say Heads)
indicating a person is to be given aspirin, and the other indicating they get the placebo.
The effect of this randomization is to balance the other possible explanatory variables in the two
“treatment" groups (Aspirin and Placebo). Thus, if at the end of the study we observe that 20% of
the Placebo subjects have had a stroke but only 9% of the Aspirin subjects have, then we can attribute
8.2. EXPERIMENTAL STUDIES 169

the difference to the causative effect of the aspirin. Here’s how we rule out alternative explanations:
suppose you claim that its not the aspirin but dietary factors and blood pressure that cause this observed
effect. I respond that the randomization procedure has lead to those factors being balanced in the two
treatment groups. That is, the Aspirin group and the Placebo group both have similar variations in
dietary and blood pressure values across the subjects in the group. Thus, a difference in the two groups
should not be due to these factors.

Example 8.2.2. Driving speed and fuel consumption


(Adapted from Stat 230 Course Notes).

It is thought that fuel consumption in automobiles is greater at speeds in excess of 100 km per hour.
(Some years ago during oil shortages, many U.S. states reduced speed limits on freeways because of
this.) A study is planned that will focus on freeway-type driving, because fuel consumption is also
affected by the amount of stopping and starting in town driving, in addition to other factors.
In this case a decision was made to carry out an experimental study at a special paved track owned
by a car company. Obviously a lot of factors besides speed affect fuel consumption: for example,
the type of car and engine, tire condition, fuel grade and the driver. As a result, these factors were
controlled in the study by balancing them across different driving speeds. An experimental plan of the
following type was employed.

84 cars of eight different types were used; each car was used for 8 test drives.

the cars were each driven twice for 600 km on the track at each of four speeds: 80,100,120 and
140 km/hr.

8 drivers were involved, each driving each of the 8 cars for one test, and each driving two tests at
each of the four speeds.

The cars had similar initial mileages and were carefully checked and serviced so as to make them
as comparable as possible; they used comparable fuels.

The drivers were instructed to drive steadily for the 600 km. Each was allowed a 30 minute rest
stop after 300 km.

The order in which each driver did his or her 8 test drives was randomized. The track was large
enough that all 8 drivers could be on it at the same time. (The tests were conducted over 8 days.)

The response variate was the amount of fuel consumed for each test drive. Obviously in the analy-
sis we must deal with the fact that the cars differ in size and engine type, and their fuel consumption
170 CAUSE AND EFFECT

will depend on that as well as on driving speed. A simple approach would be to add the fuel amounts
consumed for the 16 test drives at each speed, and to compare them (other methods are also possible).
Then, for example, we might find that the average consumption (across the 8 cars) at 80, 100, 120 and
140 km/hr were 43.0,44.1, 45.8 and 47.2 liters, respectively. Statistical methods of testing and estima-
tion could then be used to test or estimate the differences in average fuel consumption at each of the
four speeds. (Can you think of a way to do this?)

Exercise: Suppose that statistical tests demonstrated a significant difference in consumption across the
four driving speeds, with lower speeds giving lower consumption. What (if any) qualifications would
you have about concluding there is a causal relationship?

8.3 Observational Studies


In observational studies there are often unmeasured factors that affect the response Y . If these factors
are also related to the explanatory variable x whose (potential) causal effect we are trying to assess,
then we cannot easily make any inferences about causation. For this reason, we try in observational
studies to measure other important factors besides x.
For example, Problem 1 at the end of Chapter 7 discusses an observational study on whether rust-
proofing prevents rust. It is clear that an unmeasured factor is the care a car owner takes in looking
after a vehicle; this could quite likely be related to whether a person opts to have their car rust-proofed.
The following example shows how we must take note of measured factors that affect Y .

Example 8.3.1 Suppose that over a five year period, the applications and admissions to graduate studies
in Engineering and Arts faculties in a university are as follows:

No. Applied No. Admitted % Admitted


Engineering 1000 600 60% Men
200 150 75% Women
Arts 1000 400 40% Men
1800 800 44% Women
Total 2000 1000 50% Men
2000 950 47.5% Women

We want to see if females have a lower probability of admission than males. If we looked only
at the totals for Engineering plus Arts, then it would appear that the probability a male applicant is
admitted is a little higher than the probability for a female applicant. However, if we look separately at
8.4. PROBLEMS 171

Arts and Engineering, we see the probability for females being admitted appears higher in each case!
The reason for the reverse direction in the totals is that Engineering has a higher admission rate than
Arts, but the fraction of women applying to Engineering is much lower than for Arts.
In cause and effect language, we would say that the faculty one applies to (i.e. Engineering or Arts)
is a causative factor with respect to probability of admission. Furthermore, it is related to the gender
(M or F) of an applicant, so we cannot ignore it in trying to see if gender is also a causative factor.

Remark: The feature illustrated in the example above is sometimes called Simpson’s Paradox. In
probabilistic terms, it says that for events A; B1 ; B2 and C1 ; : : : ; Ck , we can have

P (AjB1 Ci ) > P (AjB2 Ci ) for each i = 1; : : : ; k

but have
P (AjB1 ) < P (AjB2 )
P
k
(Note that P (AjB1 ) = P (AjB1 Ci )P (Ci jB1 ) and similarly for P (AjB2 ), so they depend on what
i=1
P (Ci jB1 ) andP (Ci jB2 ) are.) In the example above we can take B1 = fperson is femaleg, B2 =
fperson is maleg, C1 = fperson applies to Engineeringg, C2 = fperson applies to Artsg, and A =
fperson is admittedg.
Exercise: Write down estimated probabilities for the various events based on Example 8.3.1, and so
illustrate Simpson’s paradox.
Epidemiologists (specialists in the study of disease) have developed guidelines or criteria which
should be met in order to argue that a causal association exists between a risk factor x and a disease
(represented by a response variable Y = I(person has the disease), for example). These include

the need to account for other possible risk factors and to demonstrate that x and Y are consistently
related when these factors vary.

the demonstration that association between x and Y holds in different types of settings

the existence of a plausible scientific explanation

Similar criteria apply to other areas.

8.4 Problems
1. In an Ontario study, 50267 live births were classified according to the baby’s weight (less than or
greater than 2.5 kg.) and according to the mother’s smoking habits (non-smoker, 1-20 cigarettes
per day, or more than 20 cigarettes per day). The results were as follows:
172 CAUSE AND EFFECT

No. of cigarettes 0 1-20 > 20


Weight 2:5 1322 1186 793
Weight > 2.5 27036 14142 5788

(a) Test the hypothesis that birth weight is independent of the mother’s smoking habits.

(b) Explain why it is that these results do not prove that birth weights would increase if mothers
stopped smoking during pregnancy. How should a study to obtain such proof be designed?

(c) A similar, though weaker, association exists between birth weight and the amount smoked
by the father. Explain why this is to be expected even if the father’s smoking habits are
irrelevant.

2. One hundred and fifty Statistics students took part in a study to evaluate computer-assisted in-
struction (CAI). Seventy-five received the standard lecture course while the other 75 received
some CAI. All 150 students then wrote the same examination. Fifteen students in the standard
course and 29 of those in the CAI group received a mark over 80%.

(a) Are these results consistent with the hypothesis that the probability of achieving a mark
over 80% is the same for both groups?

(b) Based on these results, the instructor concluded that CAI increases the chances of a mark
over 80%. How should the study have been carried out in order for this conclusion to be
valid?

3. (a) The following data were collected some years ago in a study of possible sex bias in graduate
admissions at a large university:
Admitted Not admitted
Male applicants 3738 4704
Female applicants 1494 2827
Test the hypothesis that admission status is independent of sex. Do these data indicate a
lower admission rate for females?

(b) The following table shows the numbers of male and female applicants and the percentages
admitted for the six largest graduate programs in (a):
8.4. PROBLEMS 173

Men Women
Program Applicants % Admitted Applicants % Admitted

A 825 62 108 82
B 560 63 25 68
C 325 37 593 34
D 417 33 375 35
E 191 28 393 24
F 373 6 341 7

Test the independence of admission status and sex for each program. Do any of the pro-
grams show evidence of a bias against female applicants?
(c) Why is it that the totals in (a) seem to indicate a bias against women, but the results for
individual programs in (b) do not?

4. To assess the (presumed) beneficial effects of rust-proofing cars, a manufacturer randomly se-
lected 200 cars that were sold 5 years earlier and were still used by the original buyers. One
hundred cars were selected from purchases where the rust-proofing option package was included,
and one hundred from purchases where it was not (and where the buyer did not subsequently get
the car rust-proofed by a third party).
The amount of rust on the vehicles was measured on a scale in which the responses Y are assumed
roughly Gaussian, as follows:

1. Rust-proofed cars: Y G( 1; )
2. Non-rust-proofed cars: Y G( 2 ; )
Sample means and variances from the two sets of cars were found to be (higher y means
more rust)
1. y1 = 11.7 s1 = 2.1
2. y2 = 12.0 s2 = 2.4

(a) Test the hypothesis that there is no difference in 1 and 2.

(b) The manufacturer was surprised to find that the data did not show a beneficial effect of
rust-proofing. Describe problems with their study and outline how you might carry out a
study designed to demonstrate a causal effect of rust-proofing.
174 CAUSE AND EFFECT

5. In randomized clinical trials that compare two (or more) medical treatments it is customary not to
let either the subject or their physician know which treatment they have been randomly assigned.
(These are referred to as double blind studies.)
Discuss why not doing this might not be a good idea in a causative study (i.e. a study where you
want to assess the causative effect of one or more treatments).

6. Public health researchers want to study whether specifically designed educational programs about
the effects of cigarette smoking have the effect of discouraging people from smoking. One
particular program is delivered to students in grade 9, with followup in grade 11 to determine each
student’ s smoking “history". Briefly discuss some factors you’d want to consider in designing
such a study, and how you might address them.
References and Supplementary
Resources

9.1 References
R.J. Mackay and R.W. Oldford (2001). Statistics 231: Empirical Problem Solving (Stat 231 Course
Notes)

C.J. Wild and G.A.F. Seber (1999). Chance Encounters: A First Course in Data Analysis and Infer-
ence. John Wiley and Sons, New York.

J. Utts (2003). What Educated Citizens Should Know About Statistics and Probability. American
Statistician 57,74-79

9.2 Departmental Web Resources


Videos on sections: see www.watstat.ca

175
Summary of Distributions

Discrete r.v. Y
Notation and Probability function Moment generating
Mean Variance
Parameters f (y) function MY (t)
n
Binomial(n; p) y py q n y
np npq (pet + q)n
0 < p < 1; q = 1 p y = 0; 1; 2; :::; n
Bernoulli(p) py (1 p)1 y
p p(1 p) (pet + q)
0 < p < 1; q = 1 p y = 0; 1
y+k 1
Negative Binomial(k; p) y pk q y kq kq p k
p p2
(1 qet )
0 < p < 1; q = 1 p y = 0; 1; 2; :::
Geometric(p) pq y q q p
p p2
(1 qet )
0 < p < 1; q = 1 p y = 0; 1; 2; :::
(yr )(Nn yr)
Hypergeometric(N; r; n)
(Nn ) nr
N n Nr (1 r N n
N)N 1 intractible
r < N; n < N y = 0; 1; 2; :::min(r; n)
e y
Poisson( ) y! (et 1)
e
>0 y = 0; 1; :::

176
177

Moment generating
Continuous r.v. Y probability density function f (y) Mean Variance
function MY (t)
1 a+b (b a)2 ebt eat
Uniform(a; b) f (y) = b a; a<y<b 2 12 (b a)t
Exponential( )
f (y) = 1 e y= ; 0<y 2
1
1
t; t < 1=
0<
Normal( ; 2 ) or )2 =(2 2)
f (y) = p1 e (y
2 t2 =2
2 2 e t+
Gaussian( ; )
1<y<1
1 < < 1; > 0
1
f (y) = 2r=2 (r=2)
y (r=2) 1 e y=2
Chisquared(r) r=2
0 < y < 1; r 2r (1 2t)
d.f. r > 0 R1
where (a) = 0 xa 1 e x dx
y2
f (y) = k (1 + ) ( +1)=2
student -t 0 2
1 < y < 1 where undefined
d.f. > 0 +1 p if >1 if >2
k = 2 = ( =2)
Statistical Tables

11.1 Probabilities For the Standard Normal Distribution N (0; 1)


0.4
0.35
0.3
φ(z)

0.25
0.2
0.15
0.1
F(x)
0.05
0
-4 -3 -2
z -1 0 x
1 2 3 4

Rx
The table gives the value of F (x) = 1 '(z)dz for x 0:

178
11.2 Chi-Squared Cumulative Distribution function

11.3 Student t Quantiles

11.4

179
APPENDIX. ANSWERS TO SELECTED
PROBLEMS

Chapter 1

1. (b) .032 (c) .003 (.002 using Gaussian approx.)

2. (c) p1 = :489, p2 = :325, p3 = 0:151, p4 = :035

3. (b) .003 and .133 (d) y0 = 124:3

4. (a) .933 (b) .020 (c).949 and .117 (d) 4.56

5. (a) .9745

7. (a) E(R) = 1 + 2(n 1)p(1 p)


(b) V ar(R) = 2(n 1)p(1 p)[1 2p(1 p)] + 2(n 2)p(1 p)(1 2p)2
(c) E(R) = 50:5, V ar(R) = 24:75 and P (R 20) < 10 6

Chapter 2

1. (a) 4.1 (b) .000275

2. (a) .10 (b) n = 140

3. (2x1 + x2 )=n

4. (b) .28
(f0 +3T ) [(f0 +3T )2 8T 2 ]1=2 P
6. (a) 4T where T = kfk
(b) c = (1 )2 =
(c) ^ = 0:195; P^ r(X = 0) = :758

180
181

(d) ^ = :5
P P
7. ^ = yi = ti

9. (a) ^ = :35 ^ = :42

(b) 14.7, 20.3, 27.3 and 37.7

Chapter 4

1. (a) ^ = 1:744, ^ = :0664(M) ^ = 1:618,^ = :0636 (F)


(b) 1.659 and 1.829 (M) 1.536 and 1.670 (F)
(c) .098 (M) and .0004 (F)
(d) 11/50=.073 (M) 0(F)

2. (c) 0.1414 and 0.1768, respectively

3. (b) n = 1024

7. (b) ^ = 1 (x=n)1=k (c) ^ = :0116; interval approximately (.0056,.0207)

8. (a) 0 :548 (b) .10 likelihood interval is now 0:209 :490


P
10. (a) ^ = 3n= ti (b) ^ = :06024; :0450 :0785
(c) :95 CI for is (.0463,.0768) and for is 39.1 64:8
(d) CI’s are :408 p :738 (using model) and 0:287 p :794 (using binomial). The
binomial model involves fewer assumptions but gives a less precise (wider) interval.
(Note: the 1st CI can be obtained directly from the CI for in part (c).)

12. (a) ^ = 380 days; CI is 285:5 521:3


(b) 197:9 m 361:3

13. (b) 288:3 527:9

14. (a) :637 p :764

Chapter 5

1. SL = P (D 15) = P (Y 25; = 10) = :000047


182 APPENDIX. ANSWERS TO SELECTED PROBLEMS

4. (a) LR statistic gives 0bs = :0885 and SL = :76.

5. (a) 0bs = 23:605 and SL = :005


(b) SL = 1 :9956 = :03 now

6. 0bs = :042 and SL = :84. There is no evidence against the model.

9. (c) LR statistic gives = 3:73 and SL = P ( 2 3:73) = :44. There is no evidence that
0bs (4)
the rates are not equal.

Chapter 6

3. (a) 43:28 45:53 (b) 1:82 3:50

4. (a) D0bs = 9s2 =:022 = 1:2 and SL = 2P ( 2


(9) 1:2) = :0024 so there is strong evidence
against H : = :02
(b) No: testing H : = 13:75 gives SL < :001
(c) 13:690 13:700 and :0050 :0132

5. (a) 296:91 303:47; 4:55 9:53


(b) 286:7 X 313:7

7. :75 11:25 where = 1 2

8. (a) 0:64 1 2 7:24


(b) SL = :05 (c) SL = :07

9. (a) LR test gives SL = :4


(b) :011 1 2 :557

12. (a) 0:23 2:38 (b) 8:77 10:92

18. (a) ^ = 0:9935, ^ = 0:0866, s = 0:2694. Confidence intervals are 0:978 1:009 and
0:182 0:516

19. (b) ^ = :02087, ^ = 1:022, s = :008389


(c) :95 prediction interval for Y (log P ) is 3:030 Y 3:065 so P I for P is 20:70 P
21:43:

Chapter 7
183

1. (a) LR statistic gives 0bs = 8:17 and Pearson statistic D0bs = 8:05. The SL is about .004 in
each case so there is strong evidence against H.

2. LR statistic gives 0bs = 5:70 and Pearson statistic D0bS = 5:64. The SL is about .017 in each
case.

5. (a) LR statistics for n = 2; 3; 4 are 1.11, 4.22, 1.36. The SL’s are P ( 2 1:11) = 0:29,
(1)
P ( 2(2) 4:22) = 0:12; and P ( 2(3) 1:36) = :71, respectively.
(b) LR statistic is 7.54 and SL = P ( 2 7:54) = :057:
(3)

7. The LR statistic is 7.32 and SL = P ( 2 7:32) = :026 so there is evidence against indepen-
(2)
dence and in favour of an association.

8. LR statistic is 3.13 and SL = P ( 2 3:13) = :54. There is no evidence against indepen-


(4)
dence.

9. (a) LR statistic gives = 0:57 and SL = P ( 2 :57) = :90 so there is no evidence of


0bs (3)
association.
(b) LR statistic gives 0bs = 5:44 and SL = P ( 2 5:44) = 0:14: There is no evidence
(3)
against the binomial model.

10. (a) = 10:8 and SL = P ( 2 10:8) = :013:


0bs (3)

Chapter 8

1. (a) LR statistic is 480.65 so SL is almost zero; there is very strong evidence against indepen-
dence.

3. (a) LR statistic gives 0bs = 112 and SL = 0


(b) Only Program B shows any evidence of non-independence, and that is in the direction of a
lower admission rate for males.
APPENDIX: DATA

184
APPENDIX: A Short Review of
Probability

In this appendix we provide some of the basic results and definitions in basic probability. For further
detail, refer to the Statistics 230 notes.
A discrete probability model consists of a sample space S; a set of events or subsets of the sample
space to which we can assign probabilities and a mechanism for assigning these probabilities. The
probability of an arbitrary event A can be determined by summing the probabilities of simple events
in A and so we have the following rules:

Rule 1 P (S) = 1
P P
Proof: P (S) = a2S P (a) = all a P (a) = 1

Rule 2 For any event A; 0 P (A) 1:


P P
Proof: P (A) = a2A P (a) a2S P (a) = 1 and so since each P (a) 0; we have
0 P (A) 1:

Rule 3 If A and B are two events with A B (that is, all of the points in A are also in B), then
P (A) P (B):
P P
Proof: P (A) = a2A P (a) a2B P (a) = P (B) so P (A) P (B):

Rule 4 a (probability of union or two events) P (A [ B) = P (A) + P (B) P (AB)

Rule 4 b (the probability of the union of three events) By a similar argument, we have

P (A [ B [ C) = P (A) + P (B) + P (C) P (AB) P (AC) P (BC) + P (ABC) (7.3.2)

Definition 1 Events A and B are mutually exclusive if AB = ' (the empty event).uence or rule 4a
and the fact that P (AB) = P (') = 0:

185
186 APPENDIX: A SHORT REVIEW OF PROBABILITY

Rule 5 In general, let A1 ; A2 ; be finitely many or countably many mutually exclusive events. Then
P
1
P (A1 [ A2 [ )= P (Ai ):
i=1

Rule 6 (probability of complements) If A denotes the complement of the event A; i.e. all of the points
in the sample space that are not in A; then P (A) = 1 P (A):

Definition 2 Events A and B are independent if and only if P (AB) = P (A)P (B). If they are not
independent, we call the events dependent.

For a finite or countably infinite set of events, we call them independent if all finite subsets are
independent, i.e.

Definition 3 The events A1 ; A2 ; A3 are independent if and only if

P (Ai1 ; Ai2 ; ; Aik ) = P (Ai1 )P (Ai2 ) P (Aik )

for all sets (i1 ; i2 ; ; ik ) of distinct subscripts chosen from (1; 2; 3 ; )17

Definition 4 the conditional probability of event A, given event B, is


P (AB)
P (AjB) = , provided P (B) 6= 0:
P (B)
Product rules. Let A; B; C; D; : : : be arbitrary events in a sample space. Assume that
P (A) > 0; P (AB) > 0; and P (ABC) > 0: Then

P (AB) = P (A)P (BjA)


P (ABC) = P (A)P (BjA)P (CjAB)
P (ABCD) = P (A)P (BjA)P (CjAB)P (DjABC)

and so on.
Partition Rule. Let A1 ; A2 ; A3 : : : be a partition of the sample space S into disjoint (mutually
exclusive) events, that is

A1 [ A2 [ A3 ::: = S: and Ai \ Aj = ' if i 6= j

Let B be an arbitrary event in S. Then


1
X
P (B) = P (BjAi )P (Ai )
i=1
17
We need all subsets so that events are independent of combinations of other events. For example if A1 is independent of
A2 and A4 is to be independent of A1 A2 then, P (A1 A2 A4 ) = P (A1 A2 )P (A4 ) = P (A1 )P (A2 )P (A4 )
187

Definition 5 A random variable is a function that assigns a real number to each point in a sample
space S.

Discrete random variables take integer values or, more generally, values in a countable set (recall
that a set is countable if its elements can be placed in a one-one correspondence with a subset of the
positive integers).
Continuous random variables take values in some interval of real numbers like (0; 1) or (0; 1) or
( 1; 1): Note that the cardinality of the real numbers in an interval is NOT countable.

Definition 6 The probability function (p.f.) of a discrete random variable Y is the function

f (y) = P (Y = y); defined for all y 2 A:

The set of pairs f(y; f (y)) : y 2 Ag is called the probability distribution of Y . All probability
functions must have two properties:

1. f (y) 0 for all values of x (i.e. for y 2 A)


P
2. f (y) = 1
all y2A

Definition 7 The cumulative distribution function (c.d.f.) of the random variable Y is the function
usually denoted by F (y)
F (y) = P (Y y)
defined for all real numbers y.

Note that for discrete random variables we can obtain the c.d.f. by summing the values of the
P
probability function F (y) = P (Y y) = f (u): In general, cumulative distribution functions have
u y
the properties:

1. F (y) is a non-decreasing function of y:

2. 0 F (y) 1 for all y:

3. limy! 1 F (y) = 0 and limy!1 F (y) = 1:

Discrete random variables have jumps in the cumulative distribution function F (y) and indeed
the probability of a given value is the size of the jump at that point, i.e. lim"#0 (F (y) F (y h)) =
P (Y = y): However continuous random variables have continuous and indeed differentiable (ex-
cept possibly at finitely many points) cumulative distribution functions F (y) and indeed the derivative
F 0 (y) = f (y) is referred to as the probability density function (probability density function), a function
Rb
f with the property that P (a Y b) = P (a < Y < b) = a f (z)dz for all real numbers a < b: It
follows that
188 APPENDIX: A SHORT REVIEW OF PROBABILITY
R1 R
1. 1 f (x)dx = allx f (x)dx = 1. (This is because P ( 1 X 1) = 1) and
Rx
2. F (x) = 1 f (u)du.

13.2 Some Basic Distributions

13.2.1 Discrete Distributions

Hypergeometric Distribution

Suppose have a collection of N objects which can be classified into two distinct types. Call one type
“success” (S) and the other type “failure” (F ). The poluation contains a total of r successes and
N r failures. Pick n objects at random without replacement. Let Y be the number of successes
obtained. Then Y has a hypergeometric distribution with probability function 2. in the table below:
(r )(N r)
f (y) = y Nn y : The range of possible values of Y is implicit since, for example, for integers y >
(n)
r;expressions such as yr take the value 0:

Binomial Distribution

Suppose an “experiment" has two types of distinct outcomes “success” (S) and “failure” (F ), and let
their probabilities be p (for S) and 1 p (for F ). Repeat the experiment n independent times. Let Y
be the number of successes obtained. Then the discrete random variable Y has a binomial distribution
(write Y Bin(n; p)) with probability function; f (y) = ny py (1 p)n y ; y = 0; 1; 2; ; n:

Poisson Distribution

The discrete random variable Y has a Poisson distribution if its p.f. is of the form
y
f (y; ) = e for y = 0; 1; 2; : : :
y!

where is a parameter with > 0. To indicate this we write Y Poisson( ). It can be shown that
E(Y ) = for this model. The Poisson distribution arises in settings where Y represents the number of
random events of some kind that occur over a fixed period of time, for example, the number of arrivals
in a queue or the number of hits on a web site in a 1 hour period. For this model to be suitable the
events must occur completely randomly in time. The Poisson distribution is also used to describe the
random occurrence of events in space.
13.2. SOME BASIC DISTRIBUTIONS 189

Summary of Single Variable Discrete Model

Name Probability Function


1
1. Discrete Uniform f (y) = b a+1 ; y = a; a + 1; a + 2; b
(yr )(Nn yr)
2. Hypergeometric f (y) = ; y = max(0; n (N r)); ; min(n; r)
(Nn )
3. Binomial f (y) = n y
y p (1 p)n y ; y = 0; 1; 2; n
4. Negative Binomial f (y) = y+k 1
y pk (1 p)y ; y = 0; 1; 2;
5. Geometric f (y) = p (1 p)y ; y = 0; 1; 2;
e y
6. Poisson f (y) = y! ; y = 0; 1; 2;

13.2.2 Continuous Distributions

Continuous Uniform Distribution

The probability density function is constant over an interval f (y) = k; a y b for some constant
Rb
k. So that a f (y)dy = 1, we require k = b 1 a . and the probability density function is: f (y) = b 1 a
for a y b:The single most important case is a = 0; b = 1:

Exponential Distribution

The continuous random variable Y has an exponential distribution if its probability density function is
of the form
1 y=
f (y; ) = e for y > 0

where is parameter with > 0. To indicate this we write Y Exp( ). It can be shown that
E(Y ) = . The exponential distribution arises in some settings where Y represents the time until
an event occurs. Be careful not to confuse the notation Y Exp( ) which specifies the exponential
distribution for the random variable Y with a common notation for the exponential18 . This distribution
arises in various problems involving the time until some event occurs. In a Poisson process for events
in time let Y be the length of time we wait for the first event occurrence. Then Y has an exponential
distribution. Integrating the probability density function the c.d.f. is F (y) = 1 e y= ; for y > 0
and otherwise F (y) = 0:

18
Sometimes ex is denoted by exp(x):
190 APPENDIX: A SHORT REVIEW OF PROBABILITY

Normal (Gaussian) Distribution

A random variable Y defined on ( 1; 1) has a normal19 distribution if it has probability density


function of the form
1 1 y
( )2
f (y) = p e 2 1<y<1
2

where 1< < 1 and > 0 are parameters. It turns out that E(Y ) = and Var(Y ) = 2. We
write
2
Y N( ; ) or Y G( ; )

to denote that X has a normal distribution with mean and variance 2 (standard deviation ). The
normal distribution is the most widely used distribution in probability and statistics. The shape of
the probability density function f (x) above is what is often termed a “bell shape” or “bell curve”,
symmetric about 0 as shown in Figure 13.25.(you should be able to verify the shape without graphing
the function)
0.4

0.35

0.3

0.25
φ(x)

0.2

0.15

0.1

0.05

0
-5 -4 -3 -2 -1 0 1 2 3 4 5
x
-4.8
-4.7
-4.6
-4.5
-4.4
-4.3
-4.2
Figure 13.25: The Standard Normal (N (0; 1)) probability density function
-4.1
-4
-3.9
-3.8
-3.7
-3.6
-3.5
-3.4
-3.3
-3.2
-3.1
-3
-2.9
-2.8
-2.7
13.3 Moments -2.6
-2.5
-2.4
-2.3
-2.2
-2.1
-2
-1.9
-1.8
Definition 8 Suppose the discrete random variable Y has probability function f (y): Then we can
-1.7
-1.6
P -1.5
-1.4
-1.3
define the expected value of some function g(Y ) of Y as E [g(Y )] = g(y)f (y):
-1.2
-1.1
-1
-0.9
-0.8 all y
-0.7
-0.6
-0.5
-0.4
-0.3
Definition 9 When Y is continuous with probability density function f (y), we can define
-0.2
-0.1
0
0.1
0.2
Z 0.3
0.4
0.5
0.6
E(g(Y )) = g(y)f (y)dy: 0.7
0.8
0.9
all y1.11
1.2
1.3
1.4
19 1.5
"The only normal people are the ones you don’t know very well." Joe Ancis,1.6
1.7
1.8
1.9
2
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3
3.1
3.2
3.3
3.4
3.5
3.6
3.7
3.8
13.3. MOMENTS 191

For constants a and b and two functions g1 and g2 , it is easy to show

E [ag1 (Y ) + bg2 (Y )] = aE [g1 (Y )] + bE [g2 (Y )]

h 10 Theimean or expected value of a random variable Y is = E(Y ): The variance of a r.v


Definition
Y is E (Y )2 , and is denoted by 2 or by Var (Y ). The standard deviation of a random variable
r h i
Y is = E (Y )2

13.3.1 Moment Generating Functions


Let Y be a random variable. The moment generating function (m.g.f) of Y is defined as

M (t) = E(etY );

assuming that this expectation exists for all t in some open set ( a; a) of real numbers (a > 0). If Y
is continuous with probability density function f (y) then
Z 1
M (t) = ety f (y)dy (4.2.1)
1

and if Y is discrete with p.f. f (y) then


X
M (t) = ety f (y) (4.2.2)
y

where the integral and sum in (4.2.1) and (4.2.2) are over the range of Y .
The m.g.f. is a transform, that is, a function M (t) that is obtained from the function f (y). Not all
probability distributions have m.g.f.’s (since (4.2.1) or (4.2.2) may not exist in some cases) but if the
m.g.f. exists, it uniquely determines the distribution. That is, a m.g.f M (t) can arise from only one
function f (y).

Example 4.2.1. Binomial Distribution


The m.g.f.20 . for the distribution Bin(n; p) is M (t) = (pet + 1 p)n
We now give some simple theorems about m.g.f.’s, with applications. Moment generating functions
are so-called because from them we can derive the moments of a random variable as follows.

Theorem 4.2.121 Let the random variable Y have m.g.f. M (t). Then for r = 1; 2; 3; : : :
dr M (t) tr
E(Y r ) = jt=0 = M (r)
(0) = Coefficient of in the power series representation of M (t):
dtr r!
P
n
n P
n
n
20
M (t) = ety y
py (1 p)n y
= y
(pet )y (1 p)n y
= (pet + 1 p)n by the binomial theorem.
y=0 y=0
dr
R 1 ty R 1 dr (ety )
21
Proof: For simplicity consider a continuous r.v. Y . Then M (r) (t) = dt r e f (y)dy = dtr
f (y)dy =
R 1 r ty (r)
R1 r r
1 1

1
y e f (y)dy: Therefore M (0) = 1
y f (y)dy = E(Y ):
192 APPENDIX: A SHORT REVIEW OF PROBABILITY

Example 4.2.322 Mean and Variance of Binomial Distribution


For Y Bin(n; p), E(Y ) = np, and V ar(Y ) = np(1 p).
Theorem 4.2.223 Let X and Y be random variables related by the fact that Y = aX + b, where a and
b are constants. Then
My (t) = ebt Mx (at); (4.2.3)
where Mx (t) is the m.g.f for X and My (t) is the m.g.f. for Y .
Example 4.2.3 MGF’s for Gaussian Distributions
The moment generating function of Y G( ; ) is given by
t+ 12 2 t2
My (t) = e (4.2.4)

Proof. First consider Z G(0; 1). If we can find Mz (t) then we can get My (t), where Y
G( ; ), from it. This is because Z = (Y )= G(0; 1), or Y = Z + . Thus

My (t) = e t Mz ( t) (4.2.5)

by Theorem 4.2.2. To find Mz (t), we must evaluate


Z 1
1 1 2
Mz (t) = etz p e 2
z
dz
1 2
This can be obtained by “completing the square" in the exponent of the integrand: since
1 2 1
tz z = [(z t)2 t2 ]
2 2
we get
Z 1
1 1
t)2 t2 ]
Mz (t) = p e 2 [(z dz
1 2
Z 1
1 2 1 1
t)2 1 2
= e2t p e 2
(z
dz = e 2 t ;
1 2
since the integral is the integral of the probability density function for G(t; 1) and therefore equals 1.
Now, using (4.2.3), we get
1 2 2
My (t) = e t+ 2 t (4.2.6)
as the m.g.f. for Y G( ; ).

Exercise: Verify using (4.2.4) that E(Y ) = and V ar(Y ) = 2.

22
Proof: we have M (t) = (pet + 1 p)n . This gives M (1) (t) = npet (pet + 1 p)n 1 and M (2) (t) = npet (pet + 1
n 1
p) +n(n 1)(pet )2 (pet +1 p)n 2 : Therefore E(Y ) = M (1) (0) = np and so E(Y 2 ) = M (2) (0) = np+n(n 1)p2 :
Finally V ar(Y ) = E(Y 2 ) [E(Y )]2 = np(1 p):
23
Proof: My (t) = E(etY ) = E(et(aX+b) ) = E(ebt e(at)X ) = ebt E(e(at)X ) = ebt Mx (at)
13.3. MOMENTS 193

Sums of Independent Random Variables We often encounter sums or linear combinations of inde-
pendent random variables, and m.g.f.’s are useful in dealing with them.

Theorem 4.2.3 Let Y1 ; : : : ; Yk be mutually independent random variables with m.g.f.’s M1 (t); : : : ; Mk (t):
Then the m.g.f. of
Xk
S= Yi
i=1

is
k
Y
Ms (t) = Mi (t)
i=1

Proof.
k
!
P Y
Ms (t) = E(etS ) = E et Yi
=E etYi
i=1
k
Y
= E etYi since the Yi0 s are independent
i=1
k
Y
= Mi (t)
i=1

This allows us to prove a fundamental property of Gaussian random variables. If we take linear
combinations (such as averages or sums) of (independent) Gaussian random variables, the result is also
a Gaussian random variable. This is one reason that the Gaussian distribution is so popular. For exam-
ple in finance it is a great convenience in a model if the returns from individual investments are each
normally distributed, so too is the total return from the portfolio.
Theorem 4.2.4 Let Y1 ; : : : ; Yk be independent random variables with Yi G( i ; i ). Then if
a1 ; : : : ; ak are constants, the distribution of

k
X
W = ai Yi
i=1

1=2
!
P
k P
k
is G ai i ; a2i 2
i .
i=1 i=1

Proof. By Theorems 4.2.2 and 4.2.3 we have the m.g.f. of ai Yi as Mai Yi (t) = MYi (ai t) =
1 2 2 2
exp(ai i t + 2 ai i t ):
194 APPENDIX: A SHORT REVIEW OF PROBABILITY

Thus
k
Y
Mw (t) = Mai Yi (t)
i=1
" k
! k
! #
X 1 X
= exp ai i t+ a2i 2
i t2
2
i=1 i=1

1=2
!
P
k P
k
This is the m.g.f. for a G ai i ; a2i 2
i random variable, which proves the theorem.
i=1 i=1

Corollary24 Suppose that Y1 ; : : : ; Yn are independent random variable’s with Yi G( ; ): If Y =


P
Yi =n, then
p
Y G( ; = n):

13.4 Joint and Conditional Probability Functions:


For n discrete random variables Y1 ; : : : ; Yn .define the function

f (y1 ; y2 ; ; yn ) = P (Y1 = y1 and Y2 = y2 and : : : and Yn = yn )

We call f (y1 ; y2 ; ; yn ) the joint probability function of (Y1 ; Y2 ; :::; Yn ). We will aslo refer to the
marginal probability function of one random variable, for example fj (yj ) = P (Yj = yj ):
For n continuous random variables Y1 ; : : : ; Yn .define the joint c.d.f.

F (y1 ; y2 ; ; yn ) = P (Y1 y1 and Y2 y2 and : : : and Yn yn )

We call f (y1 ; y2 ; ; yn ) the joint probability function of (Y1 ; Y2 ; :::; Yn ) if the above function can
be found as a multiple integral of f; i.e. if
Z y1 Z y2 Z yn
F (y1 ; y2 ; ; yn ) = ::: f (u1 ; u2 ; ; un )dun du1
1 1 1

We will also refer to the marginal probability function of one random variable, for example fj (yj ) =
R1 R1
P (Yj = yj ) in the discrete case and in the continuous case for example, f1 (y1 ) = 1 ::: 1 f (y1 ; u2 ; ; un )dun du
is obtained by integrating the joint probability density function over all variables except the variable
y1 :
24 2
Proof: Consider theorem 4.2.4 with i = ; i = and ai = 1=n for i = 1; : : : ; n:
13.4. JOINT AND CONDITIONAL PROBABILITY FUNCTIONS: 195

Definition 11 If f (y1 ; y2 ) denotes the joint probability function of Y2 given Y1 = y1 is f (y2 jy1 ) =
f (y1 ;y2 )
f2 (y2 ) .
Similarly, in the continuous case, if f (y1 ; y2 ) is the joint probability density function and if f2 (y2 )
denotes the marginal probability density function of Y2 ; then f (y2 jy1 ) = ff(y2 (y
1 ;y2 )
2)
is the conditional
probability density function for Y1 given Y2: (In both cases, of course, these functions are only defined
where the denominator is not zero).

Definition 12
X
E [g (X1 ; X2 ; ; Xn )] = g (x1 ; x2 ; xn ) f (x1 ; ; xn ) (discrete case)
all (x1 ;x2 ; ;xn )
Z Z Z
E [g (X1 ; X2 ; ; Xn )] = ::: g (x1 ; x2 ; xn ) f (x1 ; ; xn ) dx1 dx2 :::dxn (continuous case)
all (x1 ;x2 ; ;xn )

Definition 13 The covariance of Y1 and Y2 , denoted Cov(Y1 ; Y2 ), is, with E(Yj ) = j; j = 1; 2;

Cov(Y1 ; Y2 ) = E [(Y1 1 )(Y2 2 )]

Cov(Y1 ;Y2 )
Definition 14 The correlation coefficient of Y1 and Y2 is = 1 2

13.4.1 Independent Random Variables:

For events A and B, we defined A and B to be independent if and only if P (AB) = P (A)P (B). This
definition extends to random variables: random variables are independent if their joint probability
function is the product of the marginal probability functions.

Definition 15 Y1 ; Y2 ; ; Yn are independent random variables (discrete or continuous case) iff

f (y1 ; y2 ; ; yn ) = f1 (y1 )f2 (y2 ): :fn (yn ) for all y1 ; y2 ; :::; yn

Theorem 16 Suppose random variables Y1 and Y2 are independent. Then, if g1 (y) and g2 (y) are any
two functions for which these expectations exist„

E[g1 (Y1 )g2 (Y2 )] = E[g1 (Y1 )]E[g2 (Y2 )]:

Properties of Expected Value, Variance, Covariance.

1. E (aX + bY ) = aE(X) + bE(Y ) = a X + b Y , when a and b are constants. (This follows


from the definition of expected value .) In particular, E (X + Y ) = X + Y and E (X Y ) =
X Y.
196 APPENDIX: A SHORT REVIEW OF PROBABILITY
P P
2. Let ai be constants (real numbers) and E (Xi ) = i. Then E ( ai Xi ) = ai i . In particular,
P P
E ( Xi ) = E (Xi ).

3. Let X1 ; X2 ; ; Xn be random variables which have mean . (You can imagine these being
some sample results from an experiment such as recording the number of occupants in cars
P
n
Xi
i=1
travelling over a toll bridge.) The sample mean is X = n . Then E X = .
h i
1. Cov (X; X) = E [(X X ) (X X )] = E (X )2 =Var (X)

2. Cov (aX + bY; cU + dV ) = ac Cov (X; U ) + ad Cov (X; V ) + bc Cov (Y; U ) + bd Cov(Y; V )
where a; b; c; and d are constants.

3. Variance of a linear combination: Var (aX + bY ) = a2 Var (X)+b2 Var(Y )+2ab Cov (X; Y )

4. Variance of a sum of independent random variables: Let X and Y be independent. Since


Cov (X; Y ) = 0, Var (X + Y ) = X 2 + 2 ; i.e., for independent variables, the variance of a
Y
sum is the sum of the variances. Also note

2
Var (X Y)= X + ( 1)2 2
Y = 2
X + 2
Y;

i.e., for independent variables, the variance of a difference is the sum of the variances.

5. Variance of a general linear combination: Let ai be constants and Var (Xi ) = 2


i. Then
X X X
Var ai Xi = a2i 2
i +2 ai aj Cov (Xi ; Xj ) :
i<j

6. Variance of a linear combination of independent: Special cases of result 3. are:

a) If X1 ; X2 ; ; Xn are independent then Cov (Xi ; Xj ) = 0, so that


X X
Var ai Xi = a2i 2
i:

b) If X1 ; X2 ; ; Xn are independent and all have the same variance 2, then

2
Var X = =n
13.4. JOINT AND CONDITIONAL PROBABILITY FUNCTIONS: 197

The Gamma Function


R1
Definition 17 The Gamma Function: (a) = 0 xa 1 e x dx is called the gamma function of a,
where a > 0.

There are 3 properties of gamma functions which we’ll use.

1. ( )=( 1) ( 1) for > 1


R1 R1
Proof: Using integration by parts, 0 x 1 e x dx = x 1 e x j1 +(
0 1) 0 x 2 e x dx and
R1 R1
provided that > 1; x 1 e x j1 0 = 0: Therefore 0 x
1 e x dx = ( 1) 0 x 2 e x dx:

2. ( )=( 1)! if is a positive integer.


Proof: It is easy to show that (1) = 1: Using property 1 repeatedly, we obtain (2) = 1 (1) =
1;and (3) = 2 (2) = 2!; and in general (n + 1) = n! for integer n:

1 p
3. 2 =

Limiting Distributions and Convergence in Probability

Sometimes we encounter a sequence of random variables Y1 ; Y2 ; : : : whose distributions converge to


some limit. Random variables are said to converge in distribution if their corrresponding cumulative
distribution functions converge. The following definition states this more precisely.
Definition (convergence in distribution): Let fYn : n = 1; 2; : : :g be a sequence of random variables
with c.d.f.’s Fn (y); n = 1; 2; : : :. If there is a c.d.f. F (y) such that

lim Fn (y) = F (y) (4.2.7)


n!1

at all points y at which F is continuous, then we say that the sequence of random variables has a
d
limiting distribution with c.d.f. F (y). We often write this as Yn ! Y , where Y is a random variable
with c.d.f. F (y).
A major use of limiting distributions is as approximations. In many problems the c.d.f. Fn (y) may
be complicated or intractable, but as n ! 1 there is a much simpler limit F (y). If n is sufficiently
large, we often just use F (y) as a close approximation to Fn (y).
We will state and use some limiting distributions in these notes. A famous limiting distribution
called the Central Limit Theorem is contained in the following theorem. In essence the theorem states
that we may approximate the distribution of the sum of independent random variables using the normal
distribution. Of course the more summands, the better the approximation. A way to prove the theorem
is provided in Problem 4 at the end of the chapter. Loosely speaking the central limit theorem says:
198 APPENDIX: A SHORT REVIEW OF PROBABILITY

let Y1 ; Y2 ; ; Yn be independent random variables all having the same distribution, with mean and
2
variance . Then as n ! 1,
n
X
Yi N n ; n 2 (4.2.8)
i=1

and
2
Y N ; : (4.2.9)
n

This is actually a rough statement of the result since, as n ! 1, both the N n ; n 2 and N ; 2 =n
2
distributions fail to exist. (The former because both n and n 2 ! 1, the latter because n ! 0.) A
precise version of the results is:

Theorem 18 If Y1 ; Y2 ; ; Yn be independent random variables all having the same distribution, with
2
mean and variance , then as n ! 1, the cumulative distribution function of the random variable
P
Yi n
Zn = p
n

approaches the N (0; 1) c.d.f. Similarly, the c.d.f. of

Y
Zn = p
= n

approaches the standard normal c.d.f.

The theorem asserts that as n ! 1; Zn converges in distribution to G(0; 1).25 . Although this is a
P
theorem about limits, we will use it when n is large, but finite, to approximate the distribution of Yi
or Y by a normal distribution, so the rough version of the theorem in (4.2.8) and (4.2.9) is adequate for
our purposes.
Remark: For n sufficiently large we may use the G(0; 1) distribution to approximate the distribution
P
n
of Z and thus to calculate probabilities for Yi . Recall that constant times a normal random variable
i=1
is also normally distributed so the theorem also asserts that Y has a Gaussian distribution with mean
p
and standard deviation 2 =n: The accuracy of the approximation depends on n (bigger is better)

and also on the distributions of Y1 ; Y2 ; : : :.26

25
It is an interesting question as to whether this implies the p.d.f. of Z converges to the normal p.d.f. Is it possible for a
sequence Fn (x) to converge to a limit F (x) and yet their derivatives do not converge, Fn0 (x) 9 F (x)?
26
Suppose Yj is your winning in a lottery on week j and Yj is either 1 million (with probability 10 6 or 0 with probability
P
100
1 10 6 : How close do you think the distribution of Yi is to a normal distribution?
i=1
13.4. JOINT AND CONDITIONAL PROBABILITY FUNCTIONS: 199

Remark: It is possible to use the theorem to approximate discrete as well as continuous distributions.
A very important case is the binomial distribution: let Yn Bin(n; p) be a binomial r.v. Then as
n ! 1 the limiting distribution of
Yn np
Zn = p
np(1 p)
is G(0; 1). This is proved by noting that Yn can be written as a sum of independent random variables:
n
X
Yn = Xi
i=1

where Xi Bin(1; p); with = E(Xi ) = p and 2 = V ar(Xi ) = p(1 p). The result above is then
given by Theorem 4.2.7.

Exercise: Use the limiting Gaussian approximation to the binomial distribution to evaluate P (20
Y 28) and P (16 Y 32), when Y Bin(60; :4). Compare the answers with the exact
probabilities, obtained from the R function pbinom(x; n; p). (Note: Recall that when approximating a
discrete distribution we can use a continuity connection; this means here that we consider P (19:5
Y 28:5) and P (15:5 Y 32:5) when we apply the normal approximation.)
Besides convergence in distribution, there is another concept called convergence in probability
which we will mention briefly. A sequence of random variables fXn : n; 1; 2; : : :g is said to converge
in probability to the constant c if,

lim P fjXn cj g = 0 for all > 0 (4.2.10)


n!1

Loosely, this implies that however we define our tolerance ; the probability that Xn is nearly equal to c
(i.e. using this tolerance) gets closer and closer to 1. A major application of this concept is in showing
that certain estimators ~n , based on a sample of size n, converge to the true value of the parameter
being estimated as n ! 1 ( i.e. as sample size becomes arbitrarily large). Convergence in probability
will not be used much in this course, but is an important tool in more advanced discussions of statistical
theory. In fact we can show that the definition of convergence in distribution in (4.2.7)27 to a constant
is equivalent to convergence in (4.2.10)

27
What is the cumulative distribution function F (y) of the constant c and at what points y is it continuous?

You might also like