1.
Two Means (t-test)
Age (yrs)
45 + 54 + 36
(a) x =
= 45 yrs,
3
y=
x1 = 45
y1 = 41
x2 = 54
y2 = 50
x3 = 36
y3 = 29
sx =
2
( 45 45 ) 2 + ( 54 45 ) 2 + ( 36 45 ) 2
3 1
( 41 40 ) + ( 50 40 ) 2 + ( 29 40 ) 2
= 81 yrs2
41 + 50 + 29
= 40 yrs,
3
sy2 =
3 1
= 111 yrs2
Note that the sample mean age difference = x y = 5 yrs.
Testing H0: X Y =
0 versus HA: X Y 0 , at the = .05 significance level. The problem
states that the age variable is normally distributed in each population, and these independent
samples are clearly small. This is therefore a good candidate for the two sample t-test, provided
that equivariance can be established, via the informal condition that the ratio of sample variances
2
s
81
lies between 0.25 and 4. We have x 2 =
= 0.73, which satisfies this required criterion.
111
sy
Next, for = .05 and df = nx + n y 2 = (3 + 3) 2 = 4, we calculate
critical value
= t4, .025 = 2.776
standard error s.e.
spooled
1
1
+
nx n y
2 (81) + 2 (111)
4
where spooled =
2
1 1
=
+
3 3
96
( nx 1) sx 2 + ( n y 1) s y 2
df
2
= 8 yrs.
3
Therefore,
95% margin of error = ( t4,.025 )( s.e.) (2.776)(8 yrs) = 22.208 yrs, so that...
95% Confidence Interval for X Y = (5 22.2, 5 + 22.2) = (17.2, 27.2) yrs.
As this interval contains the null value X Y = 0, the null hypothesis cannot be rejected
at the = .05 significance level.
(b) Interpretation: Based on the sample data, there is not enough evidence to conclude that a
statistically significant difference exists between the population mean ages of men and women,
at the = .05 level. (Not surprising, given the small sample sizes, and thus low power of
detecting such a difference, even if present.)
(c) If normality is not established for two independent samples, then a nonparametric test
specifically, the Wilcoxon Rank Sum Test should be used instead.
2
(d) In this design, we have two small, paired samples. Therefore, a paired t-test is appropriate, i.e.,
a t-test of H0: D = 0 vs. HA: D 0 , where =
X Y , using the single sample of n = 3
D
individual differences D = {4, 4, 7}. Here, as before,
(4 5) 2 + (4 5) 2 + (7 5) 2
but
= 3 yrs2.
d = 5 yrs,
sd 2 =
3 1
So, for = .05 and df = 3 1 = 2, we have
critical value = t2, .025 = 4.303
standard error s.e. sd / n =
3 yrs2 / 3 = 1 yr.
Multiply
Therefore, the 95% margin of error = ( t4,.025 )( s.e.) (4.303)(1 yr) = 4.303 yrs, so the
95% Confidence Interval for D = (5 4.3, 5 + 4.3) = (0.7, 9.3) yrs.
As this interval does not contain the null value D = 0, the null hypothesis can be rejected at
the = .05 significance level.
(e) Interpretation: Based on the data from this sample, at the = .05 level, there is a statistically
significant difference between the population mean ages of married couples, with husbands
older than their wives by an average of 5 years.
(f) If normality is not established for two paired samples, then (after reducing them to a single
sample of pairwise differences) a nonparametric test specifically, the Wilcoxon Signed Rank
Test should be used instead.
2. Chi-squared Tests
(a) Dark = 48/120 = 0.4, Milk = 40/120 = 0.333, and White = 32/120 = 0.267.
Dark
Chocolate
Milk
Chocolate
48 (40)
40 (40)
32 (40)
With the three equal expected values under
H 0 : =
=
White shown in parentheses, we
Dark
Milk
2
(+8)
(0)2 (8)2
2
have = 40 + 40 + 40 = 3.2 on 2 df, so that
0.10 < p-value < 0.25. Do not reject H0 at = .05.
(b) Men | Dark = 12/48 = 0.25, Men | Milk = 16/40 = 0.4, Men | White = 20/32 = 0.525
vs.
Women | Dark = 36/48 = 0.75, Women | Milk = 24/40 = 0.6, Women | White = 12/32 = 0.375
Dark | Men = 12/48 = 0.25, Dark | Women = 36/72 = 0.5 vs.
Milk | Men = 16/48 = 0.333, Milk | Women = 24/72 = 0.333 vs.
White | Men = 20/48 = 0.417, White | Women = 12/72 = 0.167
Dark
Chocolate
Milk
Chocolate
12 (19.2)
16 (16.0)
20 (12.8)
48
36 (28.8)
24 (24.0)
12 (19.2)
72
48
40
32
120
With the expected values shown, we have
( 7.2 ) 2 ( +7.2 ) 2 ( +7.2 ) 2 ( 7.2 ) 2
2
+
+
+
=
19.2
12.8
28.8
19.2
= 11.25 on 2 df, so that .0025 < p-value < .005.
Reject H0 strongly at the = .05 level.
(c) Interpretation: At the 5% level, there is no statistically significant difference shown in the
overall proportions of people who prefer dark, milk, or white chocolate. However, there is
strong evidence that a significant difference exists when stratified on gender. In particular, the
two categorical variables I = Gender (Men / Women) and J = Chocolate preference (Dark /
Milk / White) are not independent of one another. i.e., an association exists between the two.
3. Analysis of Variance (ANOVA)
20(130) + 21(150) + 22(151)
= 144.
63
20(130 144) 2 + 21(150 144) 2 + 22(151 144) 2
5754
=
=
= 2877
3 1
2
(20 1)(207.5) + (21 1)(204.5) + (22 1)(217.5)
12600
=
=
= 210
63 3
60
(a) With sample size n = 20 + 21 + 22 = 63, the grand mean x =
SSTrt
df Trt
SSErr
MSErr =
df Err
MSTrt =
ANOVA Table
Source
Treatment
df
SS
MS
5754
2877
Error
60
12600
Total
62
18354
MSTrt
F = MS
Err
p-value
13.7
p < .05
210
(b) Because 13.7 is greater than the tabulated = .05 F2, 60-score of 3.15, it follows that p << .05,
and we can strongly reject the null hypothesis H0: 1 = 2 = 3, at the = .05 significance level.
Interpretation: At the 5% level, there is a statistically significant difference in the mean yields
between the three corn varieties. Specifically, the two experimental varieties do indeed have
significantly higher mean yields than the control, beyond what one would expect by chance.
F2, 60
3.15
13.7
4. Linear Correlation and Regression
+20.8
Y = 121.4 11.1 X
17.0
14.8
2.6
+13.6
(a) With means x = 6 and y = 54.8, we have sxy =
1
5 1
[(2 6)(120 54.8) + (4 6)(60 54.8) + (6 6)(40 54.8) + (8 6)(30 54.8) + (10 6)(24 54.8)]
= 111.
(b) r =
sxy
sx s y
111
(c) SSTotal =
(y
i =1
(d) b1 =
sxy
sx
= 0.902; a strong, negative linear correlation between X and Y.
10 1515.2
y ) 2 = ( n 1) s y 2 = 4 (1515.2) = 6060.8
111
= 11.1,
10
b0 = y b1 x = 54.8 (11.1)(6) = 121.4
Hence, the equation of the least squares regression line is Y = 121.4 11.1 X .
predictors
observed
responses
fitted
responses
(e)
residuals
10
120
60
40
30
24
99.2
77.0
54.8
32.6
10.4
Y Y
+20.8
17.0
14.8
2.6
+13.6
SSErr = (+20.8)2 + (17.0)2 + (14.8)2 + (2.6)2 + (+13.6)2 = 1132.4
(By construction, this is the smallest value that the residual sum of squares can attain, of
any regression line for these data.)
(f) SSReg can be computed directly from its definition, but because SSTotal = SSReg + SSErr,
it follows that SSReg = SSTotal SSErr = 6060.8 1132.4 = 4928.4.
Total
Err
Reg
(g) r2 = (0.902)2 = 0.813
or, equivalently,
r2 =
SSReg
SSTotal
4928.4
= 0.813.
6060.8
Interpretation: The negative linear association between Y (protein level) and X (time)
accounts for 81.3% of the total variation in the data. (The remaining 18.7% of the
variation is unaccounted for by the line, and may in fact just be due to random chance.)
This indicates that the linear model is a reasonably good fit to the data. However
(h) At x = 12 hours, the corresponding point estimate of the response is Y = 121.4 11.1(12)
= 11.8 mg/L, a negative value, which is physically impossible in this situation.
(i) The overall graph of the line, along with its high r2 value, indicates that this model may be
a reasonable perhaps useful description of the true manner in which the protein levels
decrease with time, at least over the recorded intervals. However, the residuals clearly
follow a nonlinear pattern of +, to , then back to +, which we would not expect to see if
this were actually the case. Furthermore, as time increases, we might anticipate that the
protein levels would drop to zero as a lower bound, but not less, as the model yielded
for x = 12 hours above; the margin of error there is almost certain so large as to make the
estimate practically useless. All of this suggests that, being physically unrealistic, the
linear model has definite limitations, and a nonlinear model might be more appropriate.
240
(j) The nonlinear model Y =
exactly fits the original data values for X = 2, 4, 6, 8, 10, is a
X
mathematically simple relationship and hence easy to interpret, and unlike the linear model,
decreases toward zero as time X gets large, which conforms to physical intuition. When
240
x = 12 hours, this model predicts that Y =
= 20 mg/dL, a much more realistic value.
12