[go: up one dir, main page]

0% found this document useful (0 votes)
9 views29 pages

2025jun 02402-Python en

The document outlines the structure and rules for a written examination in Statistics at the Technical University of Denmark, scheduled for June 26, 2025. It consists of 30 multiple-choice questions across 16 exercises, with specific scoring rules for correct and incorrect answers. Additionally, it includes various statistical scenarios and questions related to productivity, normal distribution, hypothesis testing, and survey analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views29 pages

2025jun 02402-Python en

The document outlines the structure and rules for a written examination in Statistics at the Technical University of Denmark, scheduled for June 26, 2025. It consists of 30 multiple-choice questions across 16 exercises, with specific scoring rules for correct and incorrect answers. Additionally, it includes various statistical scenarios and questions related to productivity, normal distribution, hypothesis testing, and survey analysis.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Technical University of Denmark Page 1 of 29 pages.

Written examination: 26.06.2025


Course name and number: Statistics (02402)
Duration: 4 hours
Aids and facilities allowed: All aids - no internet access

The questions were answered by

(student number) (signature) (table number)

This exam consists of 30 questions of the “multiple choice” type, which are divided between
16 exercises. To answer the questions, you need to fill in the “multiple choice” form on
exam.dtu.dk.

5 points are given for a correct “multiple choice” answer, and −1 point is given for a wrong
answer. ONLY the following 5 answer options are valid: 1, 2, 3, 4, or 5. If a question is left
blank or an invalid answer is entered, 0 points are given for the question. Furthermore, if more
than one answer option is selected for a single question, which is in fact technically possible in
the online system, 0 points are given for the question. The number of points needed to obtain
a specific mark or to pass the exam is ultimately determined during censoring.

The final answers should be given by filling in and submitting the form.
The table provided here is ONLY an emergency alternative.
Remember to provide your student number if you do hand in on paper.

Exercise I.1 I.2 II.1 III.1 III.2 IV.1 V.1 VI.1 VI.2 VII.1
Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Answer

Exercise VII.2 VIII.1 VIII.2 VIII.3 IX.1 IX.2 IX.3 X.1 X.2 X.3
Question (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
Answer

Exercise XI.1 XII.1 XII.2 XIII.1 XIII.2 XIV.1 XV.1 XV.2 XV.3 XVI.1
Question (21) (22) (23) (24) (25) (26) (27) (28) (29) (30)
Answer

The exam paper contains 29 pages. Continue on page 2

1
Multiple choice questions: Note that in each question, one and only one of the answer
options is correct. Furthermore, not all the suggested answers are necessarily meaningful. Al-
ways remember to round your own result to the number of decimals given in the answer options
before you choose your answer. Also remember that there may be slight discrepancies between
the result of the book’s formulas and corresponding built-in functions in Python.

Exercise I

A company wants to investigate the effect of lighting and music conditions on employee pro-
ductivity (measured in the number of units produced per hour).

Two factors are tested:

1) Lighting (Factor A) with two levels: Low and High

2) Music (Factor B) with three levels: None, Calm, and Energetic

All combinations are tested, and the employees’ average productivity is measured:

No Music Calm Music Energetic Music


Low Lighting 20 23 19
High Lighting 25 27 22

It is now assumed that the model is

Yij = µ + αi + βj + ϵij ; ϵij ∼ N (0, σ 2 ),

where ϵij are iid., Yij (y) is productivity, and αi and βj represent the effects of Music (Music)
and Lighting (Light). To investigate the effects, the following result has been obtained from
Python (data from the table above is stored in D).

fit = smf.ols(’y ∼ Light + Music’, data = D).fit()


sm.stats.anova lm(fit)

df sum_sq mean_sq F PR(>F)


Light 1.0 24.000000 24.000000 48.000000 0.020204
Music 2.0 20.333333 10.166667 20.333333 0.046875
Residual 2.0 1.000000 0.500000 NaN NaN

Continue on page 3

2
Question I.1 (1)

What is the conclusion from the relevant statistical tests, using a significance level of α = 0.05?

1 □ There is a significant difference in productivity both with respect to different lighting and
different music.

2 □ There is a significant difference in productivity with respect to different lighting, but not
with respect to different music.

3 □ There is a significant difference in productivity with respect to different music, but not
with respect to different lighting.

4 □ There is no significant difference in productivity, either with respect to different music or


different lighting.

5 □ There is a significant difference in productivity with respect to different music, but one
cannot conclude whether there is an effect of different lighting, since there are only two
levels.

Question I.2 (2)

To perform further investigations on the effect of music, pairwise comparisons of its effects need
to be made. While correcting for multiple tests, what is the Least Significant Distance (LSD)
for the pairwise comparisons of the effect of music (using a significance level of α = 0.05)?

1□ 2.5

2 □ 2.0

3 □ 1.5

4 □ 5.4

5 □ 6.2

Continue on page 4

3
Exercise II

Let X follow a normal distribution with mean 2 and variance 16.

Question II.1 (3)

What is the median of X?

1 □ median(X) = -4

2 □ median(X) = -2

3 □ median(X) = 0

4 □ median(X) = 2

5 □ median(X) = 4

Continue on page 5

4
Exercise III

An engineer wants to test whether a new alloy has a tensile strength with a mean value of 500
MPa. A random sample of 30 specimens is tested, which gives a sample mean of 510 MPa
and a sample standard deviation of 20 MPa. It is assumed that the observations are iid and
normally distributed.

Question III.1 (4)

What is the corresponding p-value for the relevant hypothesis test with the following hypotheses:

H0 : µ = 500, HA : µ ̸= 500.

1 □ p = 0.010

2 □ p = 0.621

3 □ p = 0.310

4 □ p = 0.006

5 □ p = 0.005

The engineer is now planning a new experiment where he wants to achieve a ”margin of error”
of at most 2 MPa. He uses the observed standard deviation as a scenario and a significance
level of α = 0.05

Question III.2 (5)

What sample size should be taken to achieve a ”margin of error” of at most 2 MPa?

1 □ approx. 20 observations

2 □ approx. 40 observations

3 □ approx. 1538 observations

4 □ approx. 16 observations

5 □ approx. 385 observations

Continue on page 6

5
Exercise IV

A coach wants to investigate whether there is a difference between different types of targeted
training in terms of improving the time it takes to run up stairs. The coach collects data from
15 participants, who are (randomly) divided into three equally sized groups: Group A, Group
B, and Group C. The coach has the participants perform targeted exercises over the next 4
weeks. Participants in the same group do the same exercises, but the coach assigns different
exercises to the three groups. For each participant, data is collected on the improvement in
the time it takes them to run up a staircase at the gym (the time improvement is measured in
seconds).

The observed time improvements are:

Group: time improvement (measured in seconds):


A 2.1, 2.5, 2.3, 2.4, 2.2
B 2.8, 2.9, 2.7, 3.0, 2.6
C 2.3, 2.4, 2.5, 2.2, 2.1

The average time improvement for all 15 participants is µ̂ = 2.467, and the average time
improvements within each group are given by: µ̂A = 2.30, µ̂B = 2.80, µ̂C = 2.30. It can be
assumed that all observations are independent and normally distributed.

Question IV.1 (6)

What is the most appropriate statistical model and analysis when one wishes to examine
whether there is a difference in the effect of the different types of training?

1 □ An appropriate model could be Yij = µ + αi + ϵij (ϵij ∼ N (0, σ 2 )), where Yij is the time
improvement of person number j in group number i. A relevant analysis would then be
to perform a t-test that tests the null hypothesis H0 : µ = 0.

2 □ An appropriate model could be Yi = β0 + β1 xi + ϵi (ϵi ∼ N (0, σ 2 )), where xi is the time


improvement of person number i. A relevant analysis would then be to perform a t-test
that tests the null hypothesis H0 : β1 = 0.

3 □ An appropriate model could be Yij = µi + ϵij (ϵij ∼ N (0, σ 2 )), where Yij is the time
improvement of person number j in group number i. A relevant analysis would then be
to perform an analysis of variance that tests the null hypothesis H0 : µA = µB = µC = 0.

4 □ An appropriate model could be Yij = β0 + βi xij + ϵij (ϵij ∼ N (0, σ 2 )), where xij is the
time improvement of person number j in group number i. A relevant analysis would then
be an analysis of variance that tests the null hypothesis H0 : βi = 0 (that is, a total of 3
tests are performed – one for each group).

5 □ An appropriate model could be Yij = µ + αi + ϵij (ϵij ∼ N (0, σ 2 )), where Yij is the time
improvement of person number j in group number i. A relevant analysis would then be
an analysis of variance that tests the null hypothesis H0 : αA = αB = αC = 0.

6
Exercise V

Assume that Y follows an exponential distribution with E(Y ) = 3.

Question V.1 (7)

What is P (2 < Y < 4)?

1 □ 0.49

2 □ 0.25

3 □ 0.61

4 □ 0.75

5 □ 0.0024

Continue on page 8

7
Exercise VI

A pet store wants to investigate what proportion of Danish households have a dog. They
conduct a survey among 1000 of their customers, asking whether they have a dog. The store
assumes that these 1000 customers represent 1000 households.

Of these, 320 respond that they have a dog.

Question VI.1 (8)

What is the estimated proportion (p̂) of households that have a dog, and what is the uncertainty
(standard error, s.e.p̂ ) of this proportion?

1 □ p̂ = 0.32 and s.e.p̂ = 0.00022


2 □ p̂ = 0.32 and s.e.p̂ = 0.015
3 □ p̂ = 0.32 and s.e.p̂ = 0.047
4 □ p̂ = 0.32 and s.e.p̂ = 0.32
5 □ p̂ = 0.32 and s.e.p̂ = 0.010

Question VI.2 (9)

Official figures indicate that about 20% of Danes have a dog. The pet store had therefore
expected that their survey would result in a proportion closer to 0.20. Is it likely that their
result – that as many as 32% of households have a dog – is due to random variation? And
could it be true that the true proportion of Danish households with a dog is actually around
20%?

1 □ Yes, the pet store has randomly selected a sample where more than expected have a dog.
This is likely due to random variation, and the true proportion could well be around 20%.
2 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value for
the relevant test is 0.0015, so we would reject the null hypothesis that the true proportion
is 0.20. Thus, we must conclude that the true proportion is probably not 20%.
3 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value for the
relevant test is 0.0015, so we would reject the null hypothesis that the true proportion is
0.20. However, it is doubtful whether the sample is representative, so the true proportion
could still be 20%.
4 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value
for the relevant test is 2 · 10−21 , so we would reject the null hypothesis that the true
proportion is 0.20. Since the sample is clearly representative, we must conclude that the
true proportion of households with a dog is probably not 20%.

8
5 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value for the
relevant test is 2 · 10−21 , so we would reject the null hypothesis that the true proportion is
0.20. However, it is doubtful whether the sample is representative, so the true proportion
could still be 20%.

Continue on page 10

9
Exercise VII

In a study, data from 4 different groups are available:

Group 1: 89, 102, 94, 90, 100


Group 2: 78, 46, 65, 72, 69
Group 3: 83, 89, 81, 89, 90
Group 4: 82, 101, 93, 88, 104

The table can be entered into Python using the following code.

y = np.array([89, 102, 94, 90, 100,


78, 46, 65, 72, 69,
83, 89, 81, 89, 90,
82, 101, 93, 88, 104])
Group = pd.Categorical([1,1,1,1,1,2,2,2,2,2,3,3,3,3,3,4,4,4,4,4])
D = pd.DataFrame({’y’: y, ’Group’: Group})

It can be assumed that the data can be described by the following model: Yij = µ + αi + ϵij ,
ϵij ∼ N (0, σ 2 ).

Question VII.1 (10)

What is the between group variation, MS(Group), and the within group variation, MSE?

1 □ MS(Group) = 190.3 and MSE = 1122

2 □ MS(Group) = 894.5 and MSE = 70.15

3 □ MS(Group) = 2683 and MSE = 1122

4 □ MS(Group) = 894.5 and MSE = 2683

5 □ MS(Group) = 190.3 and MSE = 70.15

Continue on page 11

10
Question VII.2 (11)

Which statement about the model above is NOT correct?

1 □ Yij is observation number j in group number i. α̂i is group i’s average deviation from the
overall mean µ̂.
1
2 □ The total variance of the data (i.e. N −1
SST ) cannot be greater than σ̂ 2 .

3 □ MSE represents the variance within each group, and since we assume it is the same across
all groups, we also have M SE = σ̂ 2 .

4 □ If the variance of the αi ’s is large compared to the MSE, this means that there is a
difference between the groups.

5 □ In the data above (for Group 1, i = 1), we obtain α̂1 = 9.75.

Continue on page 12

11
Exercise VIII

Let X ∼ N (50, 22 ) and Y ∼ N (45, 52 ) be independent and form the random variables V =
3X + Y and U = max(X, Y ) i.e., U is the maximum of X and Y . Use simulation to answer
the following three questions.

Question VIII.1 (12)

What is the probability P (Y > X)?

1 □ Approximately 17.7%

2 □ Approximately 42.2%

3 □ Approximately 50.0%

4 □ Approximately 57.8%

5 □ Approximately 82.3%

Continue on page 13

12
Question VIII.2 (13)

Which of the following plots shows the probability density function (pdf) of U ?
Plot A Plot B
0.200 0.200
0.175 0.175
0.150 0.150
0.125 0.125
0.100 0.100
0.075 0.075
0.050 0.050
0.025 0.025
0.000 0.000
30 40 50 60 70 30 40 50 60 70
Plot C Plot D
0.08 0.12
0.07
0.10
0.06
0.05 0.08
0.04 0.06
0.03
0.04
0.02
0.01 0.02
0.00 0.00
30 40 50 60 70 30 40 50 60 70
Plot E
0.10

0.08

0.06

0.04

0.02

0.00
30 40 50 60 70

1 □ Plot A

2 □ Plot B

3 □ Plot C

4 □ Plot D

5 □ Plot E

13
Question VIII.3 (14)

Which of the following plots most likely shows a scatter plot of 100 observations of (X, V )?
Plot A Plot B
170 210
205
160
200
150 195
V

V
190
140 185

130 180
44 46 48 50 52 54 44 46 48 50 52 54
X X
Plot C Plot D
240 320
220
200 300
180
V

280
160
140 260
120
240
44 46 48 50 52 54 44 46 48 50 52 54
X X
Plot E
210

200
V

190

180

44 46 48 50 52 54
X

1 □ Plot A

2 □ Plot B

3 □ Plot C

4 □ Plot D

5 □ Plot E

14
Exercise IX

The police set up a traffic checkpoint where they randomly select cars for inspection. On
average, they check 10 cars per hour, and the number of cars selected for inspection in an
hour follows a Poisson distribution. Let X denote the total number of cars selected during a
five-hour period.

Question IX.1 (15)

What is the most appropriate model for X?

1 □ X follows a binomial distribution with parameters n = 5 and p = 0.1.

2 □ X follows a binomial distribution with parameters n = 5 and p = 0.5.

3 □ X follows a Poisson distribution with rate λ = 2.

4 □ X follows a Poisson distribution with rate λ = 10.

5 □ X follows a Poisson distribution with rate λ = 50.

Question IX.2 (16)

What is the probability that no cars are selected for inspection during a given hour?

1 □ Approximately 0%

2 □ Approximately 10%

3 □ Approximately 20%

4 □ Approximately 80%

5 □ Approximately 90%

Continue on page 16

15
Question IX.3 (17)

Historical data suggests that 40% of inspections result in a citation. If, during a given hour,
the police randomly select three cars for inspection, what is the probability that exactly two
of the inspections result in a citation?

1 □ 0.096

2 □ 0.144

3 □ 0.288

4 □ 0.432

5 □ 0.720

Continue on page 17

16
Exercise X

A couple is planning to buy a house and wants to estimate the expected price. They collect
data on recent property sales in the area and use Python to fit a multiple linear regression
model with property price as the response variable and the sizes of the house and lot (in square
meters) as explanatory variables.

They are particularly interested in two very similar properties. Property A has a house of
175 square meters and a lot of 800 square meters, while Property B has the same lot size but
a house that is 10 square meters smaller i.e., 165 square meters.

The couple obtains the following output from their Python model, where the price is given
in tkr. (tusinde kroner - DKK thousands).

OLS Regression Results


==============================================================================
Dep. Variable: Price R-squared: 0.892
Model: OLS Adj. R-squared: 0.887
No. Observations: 45 F-statistic: 173.8
Covariance Type: nonrobust Prob (F-statistic): 4.84e-21
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 585.8165 613.968 0.954 0.345 -653.220 1824.853
House 43.9742 2.757 15.951 0.000 38.411 49.538
Lot 3.1214 0.275 11.334 0.000 2.566 3.677
==============================================================================

Question X.1 (18)

What is the expected price of Property A according to the model?

1 □ Approximately DKK 10.193 M

2 □ Approximately DKK 10.778 M

3 □ Approximately DKK 20.200 M

4 □ Approximately DKK 35.726 M

5 □ Approximately DKK 36.311 M

Continue on page 18

17
A real estate agent claims that Property B should cost 400,000 less than Property A, implying
that each square meter of house is worth 40,000. The couple wants to evaluate whether the
data support this claim (null hypothesis) based on the fitted model.

Question X.2 (19)

The usual test of the real estate agent’s claim (null hypothesis) results in which p-value?

1 □ p = 0.922

2 □ p = 0.692

3 □ p = 0.345

4 □ p = 0.157

5 □ p < 0.001

Continue on page 19

18
The couple validates the model and generates the following diagnostic plots:
Normal QQ-plot of residuals Scatter plot of residuals and fitted values
1000 1000

750 750
500
500
Sample Quantiles

Residuals (tkr.)
250
250
0
0
250

500 250

750 500

2 1 0 1 2 10000 11000 12000 13000 14000


Theoretical Quantiles Fitted values (tkr.)
Scatter plot of residuals and house sizes Scatter plot of residuals and lot sizes
1000 1000

750 750

500 500
Residuals (tkr.)

Residuals (tkr.)

250 250

0 0

250 250

500 500

160 180 200 220 240 600 700 800 900 1000 1100 1200 1300
House size (sq. m) Lot size (sq. m)

Continue on page 20

19
Question X.3 (20)

Which of the following statements is false?

1 □ The normal QQ-plot of the residuals suggests that the residuals may be slightly right-
skewed but shows no obvious violation of the normality assumption.

2 □ The normal QQ-plot of the residuals suggests that the residuals may be inter-dependent
but shows no obvious violation of the independence assumption.

3 □ The scatter plot of residuals versus fitted values suggests that the variance of the residuals
may increase with the fitted values but shows no obvious violation of the homoscedasticity
(variance homogeneity) assumption.

4 □ The scatter plot of residuals versus house size suggests that the variance of the residuals
may increase with house size but shows no obvious violation of the homoscedasticity
(variance homogeneity) assumption.

5 □ The scatter plot of residuals versus lot size suggests that the variance of the residuals may
increase with lot size but shows no obvious violation of the homoscedasticity (variance
homogeneity) assumption.

Continue on page 21

20
Exercise XI

One wants to compare the means in two samples, A and B. Both samples contain 50 indepen-
dent measurements, which are assumed to be normally distributed. It is stated that the 95%
confidence interval for the mean in each group is:

95% CI for µ̂A = [15.2, 17.8]

95% CI for µ̂B = [13.0, 15.5].

Question XI.1 (21)

Which of the following statements is correct?

1 □ Since the confidence intervals overlap, the two underlying populations could have the
same mean. Therefore, we can easily see that the difference between the two sample
means is not significantly different from zero (at a 5% significance level).

2 □ Since the confidence intervals overlap, the difference between the two sample means is
not statistically significantly different from zero (at a 5% significance level). Thus, the
two underlying populations have the same distribution.

3 □ The sample means for sample A and B are 16.50 and 14.25, respectively, and there is
a significant difference between these means at a 5% significance level (but not at a 1%
significance level).

4 □ The sample means for sample A and B are 16.50 and 14.25, respectively, and there is a
significant difference between these means (at a 1% significance level).

5 □ The sample means for sample A and B are 16.50 and 14.25, respectively, but there is no
significant difference between these means (at a 5% significance level).

Continue on page 22

21
Exercise XII

Capture-recapture is a method in which a number of individuals (animals) are captured, tagged,


and released. After a period of time, a number of individuals are captured and it is examined
how many individuals are tagged. The method can be used to estimate population sizes.

A biologist has captured n1 = 150 fish in a lake, tagged them, and released them again. The
biologist now plans to return and capture n2 = 200 fish from the same lake.

Question XII.1 (22)

If we denote the total number of fish in the lake by N (and assume that N is the same when
released and recaptured), what distribution will the number of tagged fish (Y ) then follow at
recapture (it is assumed that all tagged fish survive and that it is completely random which of
the N fish are captured)?

150
and n = 200, i.e. Y ∼ B 200, 150

1 □ A binomial distribution with p = N
, N
.
200·150 200·150 150

2 □ A normal distribution with paramters µ = N
, and σ 2 = N
1− N
.
3 □ A hypergeometric distribution with parameters n = 200, a = 150, and N , i.e. Y ∼
H(200, 150, N ).
4 □ A Poisson distribution with parameter λ = 150·200 150·200

N
, i.e. Y ∼ P ois N
.
N N

5 □ An exponential distribution with parameter λ = 150·200 , i.e. Y ∼ Exp 150·200 .

The length of the caught fish is measured in order to provide an estimate of their age. Thus, fish
between 6 and 10 cm. are classified as 1-year-old, while fish of more than 10 cm. are classified
as older. It is assumed that the length of a one-year-old fish follows a normal distribution with
mean µ = 8 cm. and standard deviation σ = 1 cm.

Question XII.2 (23)

What is the probability that a one-year-old fish is classified as older than one year?

1 □ 0.159
2 □ 0.5
3 □ 0.841
4 □ 0.0228
5 □ 0.977

Continue on page 23

22
Exercise XIII

A consumer organization wants to investigate how often a parcel delivery company delivers
packages to the nearest parcel shop. They collected data from 750 parcel deliveries, distributed
across 5 regions. The results are summarized in the following table:

Region Delivered to the nearest parcel shop Delivered elsewhere Total


Capital region 40 110 150
Central Jutland 95 55 150
Southern Denmark 80 70 150
North Jutland 85 65 150
Zealand 70 80 150
Total 370 380 750

A χ2 -test is now performed to investigate whether the proportion of packages delivered to the
nearest parcel shop is the same across all 5 regions.

Question XIII.1 (24)

What are the expected values in each cell of the table under the null hypothesis?

Region Delivered to the nearest parcel shop Delivered elsewhere


Capital region 75 75
Central Jutland 75 75
1□
Southern Denmark 75 75
North Jutland 75 75
Zealand 75 75

Region Delivered to the nearest parcel shop Delivered elsewhere


Capital region 80 70
Central Jutland 70 80
2□
Southern Denmark 60 90
North Jutland 50 100
Zealand 40 110

Region Delivered to the nearest parcel shop Delivered elsewhere


Capital region 100 0
Central Jutland 100 0
3□
Southern Denmark 100 0
North Jutland 100 0
Zealand 100 0

23
Region Delivered to the nearest parcel shop Delivered elsewhere
Capital region 74 76
Central Jutland 74 76
4□
Southern Denmark 74 76
North Jutland 74 76
Zealand 74 76

Region Delivered to the nearest parcel shop Delivered elsewhere


Capital region 40 110
Central Jutland 95 55
5□
Southern Denmark 80 70
North Jutland 85 65
Zealand 70 80

Question XIII.2 (25)

A χ2 -test is performed to investigate whether the proportion of packages delivered to the nearest
parcel shop is the same across all 5 regions. The relevant test statistic has been calculated as
47.21. What is the p-value for the relevant test, and what is the corresponding conclusion (use
a significance level of α = 0.05)?

1 □ The p-value is 0.10, and the conclusion is that there is a difference in the proportion of
packages delivered to the nearest parcel shop in the different regions.

2 □ The p-value is 0.10, and the conclusion is that there is no difference in the proportion of
packages delivered to the nearest parcel shop in the different regions.

3 □ The p-value is 0.05, and the conclusion is that there is no difference in the proportion of
packages delivered to the nearest parcel shop in the different regions.

4 □ The p-value is 1.4 · 10−9 , and the conclusion is that there is no difference in the proportion
of packages delivered to the nearest parcel shop in the different regions.

5 □ The p-value is 1.4 · 10−9 , and the conclusion is that there is a difference in the proportion
of packages delivered to the nearest parcel shop in the different regions.

Continue on page 25

24
Exercise XIV

A sensor measures the temperature of a machine that should not exceed 80◦ C. A sample of 40
temperature measurements gives a sample mean of 75.2◦ C and a sample standard deviation of
2.5◦ C. Assume that the temperature measurements are independent of each other and follow a
normal distribution.

Question XIV.1 (26)

What is a 95% confidence interval for the true mean temperature?

1 □ [72.7, 77.7] ◦ C

2 □ [70.1, 80.3] ◦ C

3 □ [74.4, 76.0] ◦ C

4 □ [75.1, 75.3] ◦ C

5 □ [71.0, 79.4] ◦ C

Continue on page 26

25
Exercise XV

A car owner wants to buy a new (used) car, to investigate what the price should be. She
has collected prices for the car (make and model) she wants. The histogram below shows the
distribution of prices for the car she wants.
50
40
Frequency

30
20
10
0
50 100 150 200 250 300 350
Price [1000 kr]

In addition to the histogram, she has calculated the average and empirical variance for the
observed prices (price) [1000 kr.].

np.mean(price)

np.float64(118.94622598870056)

np.var(price,ddof=1)

np.float64(3633.1131006077294)

Question XV.1 (27)

Based on the above, which of the following assumptions about the distribution of the price (Y )
is the most reasonable?

1 □ A log-normal distribution with parameters α = 4.66 and β 2 = 0.478, i.e. Y ∼ LN (4.66, 0.4782 ).
2 □ A normal distribution with parameters µ = 118.94 and σ 2 = 3633.12 , i.e. Y ∼ N (118.9, 3633.12 ).
3 □ An exponential distribution with parameter λ = 118.9, i.e. Y ∼ Exp(118.9).
4 □ A log-normal distribution with parameters α = 118.9 and β 2 = 60.282 , i.e. Y ∼
LN (118.9, 60.282 ).
5 □ A normal distribution with parameters µ = 118.94 and σ 2 = 60.282 , i.e. Y ∼ N (118.9, 60.282 ).

26
The car owner has also collected data on the age and mileage of the cars. To investigate the
relationship between price, age and mileage of the car, the car owner has run the following code
(price [1000 kr.] is the price, age [years] is the age of the car, dist [1000 km.] is the mileage
of the car, and cars is the dataset with the collected numbers),

fit = smf.ols("price ∼ age + dist", data = cars).fit()

corresponding to the model

Yi = µi + ϵi ,

where the formula for µi is determined from the input to smf.ols and E(ϵi ) = 0.

Question XV.2 (28)

Which of the following statements about the model or model assumptions is correct?

1 □ The ϵi ’s are normally distributed and iid. (independent and identically distributed).
2 □ It is assumed that the observed correlation between age and dist is equal zero.
3 □ The Yi ’s are normally distributed and iid. (independent and identically distributed).
4 □ The µi ’s follows a normal distribution and are iid.
5 □ µi = β1 x1i + β2 x2i , where x1i and x2i are the age and mileage of car i, respectively.

The results of the estimation above are given below (some numbers have been replaced with
symbols)

fit.summary(slim = True)

OLS Regression Results


==============================================================================
Dep. Variable: price R-squared: 0.793
Model: OLS Adj. R-squared: 0.791
No. Observations: 177 F-statistic: 334.0
Covariance Type: nonrobust Prob (F-statistic): 2.65e-60
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
Intercept 255.8098 5.707 T1 P1 Ql_1 Qu_1
age -9.6077 0.963 T2 P2 Ql_2 Qu_2
dist -0.4756 0.055 T3 P3 Ql_3 Qu_3
==============================================================================

27
Question XV.3 (29)

Which of the following statements is correct when using a significance level of α = 0.05?

1 □ Both effects (age and mileage) are significantly different from zero and the expected price
decreases with age and mileage.

2 □ None of the effects (age and mileage) are significantly different from zero.

3 □ The age of the car has a significant effect on the price, while an effect of the mileage
cannot be demonstrated. The price increases as the age increases.

4 □ Mileage has a significant effect on price, while and effect of age cannot be demonstrated.
The price decreases as mileage increases.

5 □ The age of the car has a significant effect on the price, while an effect of the mileage
cannot be demonstrated. The price decreases as the age increases.

Continue on page 29

28
Exercise XVI

A consultant has received data on arrival and departure times for 35 employees at a given
workplace. Arrival and departure times are recorded for the same 35 employees on two different
days - one day in the summer and one day in the winter. The consultant now wishes to assess
whether the average working hours are the same on both days.

Question XVI.1 (30)

Which analysis is relevant to perform?

1 □ For each arrival and departure time, the working hours are calculated. Now, you have
two independent samples (one for the summer day and one for the winter day) with 35
measurements in each. The means of these samples are compared using a t-test with the
null hypothesis H0 : µ1 = µ2 .

2 □ For each of the two days, the average arrival time and the average departure time are
calculated. Then, two t-tests are performed: one t-test tests for a significant difference
in arrival times, and the other tests for a significant difference in departure times.

3 □ For each arrival and departure time, the working hours are calculated. Now, you have two
paired samples (one for the summer day and one for the winter day) with 35 measurements
in each. A paired t-test is used to examine whether the average difference in working hours
is significantly different from zero.

4 □ For each of the two days, a 95% confidence interval is calculated for the average arrival
time CIx̄arrive and for the average departure time CIx̄leave . If the two confidence intervals
do not overlap, there is a significant difference in the average working hours between the
two days.

5 □ For each arrival and departure time, a total working time is calculated. Now, you have
two samples with 35 measurements. A one-way ANOVA model is used to test whether
there is a difference in the average working hours between the two days.

The exam is finished. Enjoy the vacation!

29

You might also like