2025jun 02402-Python en
2025jun 02402-Python en
This exam consists of 30 questions of the “multiple choice” type, which are divided between
16 exercises. To answer the questions, you need to fill in the “multiple choice” form on
exam.dtu.dk.
5 points are given for a correct “multiple choice” answer, and −1 point is given for a wrong
answer. ONLY the following 5 answer options are valid: 1, 2, 3, 4, or 5. If a question is left
blank or an invalid answer is entered, 0 points are given for the question. Furthermore, if more
than one answer option is selected for a single question, which is in fact technically possible in
the online system, 0 points are given for the question. The number of points needed to obtain
a specific mark or to pass the exam is ultimately determined during censoring.
The final answers should be given by filling in and submitting the form.
The table provided here is ONLY an emergency alternative.
Remember to provide your student number if you do hand in on paper.
Exercise I.1 I.2 II.1 III.1 III.2 IV.1 V.1 VI.1 VI.2 VII.1
Question (1) (2) (3) (4) (5) (6) (7) (8) (9) (10)
Answer
Exercise VII.2 VIII.1 VIII.2 VIII.3 IX.1 IX.2 IX.3 X.1 X.2 X.3
Question (11) (12) (13) (14) (15) (16) (17) (18) (19) (20)
Answer
Exercise XI.1 XII.1 XII.2 XIII.1 XIII.2 XIV.1 XV.1 XV.2 XV.3 XVI.1
Question (21) (22) (23) (24) (25) (26) (27) (28) (29) (30)
Answer
1
Multiple choice questions: Note that in each question, one and only one of the answer
options is correct. Furthermore, not all the suggested answers are necessarily meaningful. Al-
ways remember to round your own result to the number of decimals given in the answer options
before you choose your answer. Also remember that there may be slight discrepancies between
the result of the book’s formulas and corresponding built-in functions in Python.
Exercise I
A company wants to investigate the effect of lighting and music conditions on employee pro-
ductivity (measured in the number of units produced per hour).
All combinations are tested, and the employees’ average productivity is measured:
where ϵij are iid., Yij (y) is productivity, and αi and βj represent the effects of Music (Music)
and Lighting (Light). To investigate the effects, the following result has been obtained from
Python (data from the table above is stored in D).
Continue on page 3
2
Question I.1 (1)
What is the conclusion from the relevant statistical tests, using a significance level of α = 0.05?
1 □ There is a significant difference in productivity both with respect to different lighting and
different music.
2 □ There is a significant difference in productivity with respect to different lighting, but not
with respect to different music.
3 □ There is a significant difference in productivity with respect to different music, but not
with respect to different lighting.
5 □ There is a significant difference in productivity with respect to different music, but one
cannot conclude whether there is an effect of different lighting, since there are only two
levels.
To perform further investigations on the effect of music, pairwise comparisons of its effects need
to be made. While correcting for multiple tests, what is the Least Significant Distance (LSD)
for the pairwise comparisons of the effect of music (using a significance level of α = 0.05)?
1□ 2.5
2 □ 2.0
3 □ 1.5
4 □ 5.4
5 □ 6.2
Continue on page 4
3
Exercise II
1 □ median(X) = -4
2 □ median(X) = -2
3 □ median(X) = 0
4 □ median(X) = 2
5 □ median(X) = 4
Continue on page 5
4
Exercise III
An engineer wants to test whether a new alloy has a tensile strength with a mean value of 500
MPa. A random sample of 30 specimens is tested, which gives a sample mean of 510 MPa
and a sample standard deviation of 20 MPa. It is assumed that the observations are iid and
normally distributed.
What is the corresponding p-value for the relevant hypothesis test with the following hypotheses:
H0 : µ = 500, HA : µ ̸= 500.
1 □ p = 0.010
2 □ p = 0.621
3 □ p = 0.310
4 □ p = 0.006
5 □ p = 0.005
The engineer is now planning a new experiment where he wants to achieve a ”margin of error”
of at most 2 MPa. He uses the observed standard deviation as a scenario and a significance
level of α = 0.05
What sample size should be taken to achieve a ”margin of error” of at most 2 MPa?
1 □ approx. 20 observations
2 □ approx. 40 observations
4 □ approx. 16 observations
Continue on page 6
5
Exercise IV
A coach wants to investigate whether there is a difference between different types of targeted
training in terms of improving the time it takes to run up stairs. The coach collects data from
15 participants, who are (randomly) divided into three equally sized groups: Group A, Group
B, and Group C. The coach has the participants perform targeted exercises over the next 4
weeks. Participants in the same group do the same exercises, but the coach assigns different
exercises to the three groups. For each participant, data is collected on the improvement in
the time it takes them to run up a staircase at the gym (the time improvement is measured in
seconds).
The average time improvement for all 15 participants is µ̂ = 2.467, and the average time
improvements within each group are given by: µ̂A = 2.30, µ̂B = 2.80, µ̂C = 2.30. It can be
assumed that all observations are independent and normally distributed.
What is the most appropriate statistical model and analysis when one wishes to examine
whether there is a difference in the effect of the different types of training?
1 □ An appropriate model could be Yij = µ + αi + ϵij (ϵij ∼ N (0, σ 2 )), where Yij is the time
improvement of person number j in group number i. A relevant analysis would then be
to perform a t-test that tests the null hypothesis H0 : µ = 0.
3 □ An appropriate model could be Yij = µi + ϵij (ϵij ∼ N (0, σ 2 )), where Yij is the time
improvement of person number j in group number i. A relevant analysis would then be
to perform an analysis of variance that tests the null hypothesis H0 : µA = µB = µC = 0.
4 □ An appropriate model could be Yij = β0 + βi xij + ϵij (ϵij ∼ N (0, σ 2 )), where xij is the
time improvement of person number j in group number i. A relevant analysis would then
be an analysis of variance that tests the null hypothesis H0 : βi = 0 (that is, a total of 3
tests are performed – one for each group).
5 □ An appropriate model could be Yij = µ + αi + ϵij (ϵij ∼ N (0, σ 2 )), where Yij is the time
improvement of person number j in group number i. A relevant analysis would then be
an analysis of variance that tests the null hypothesis H0 : αA = αB = αC = 0.
6
Exercise V
1 □ 0.49
2 □ 0.25
3 □ 0.61
4 □ 0.75
5 □ 0.0024
Continue on page 8
7
Exercise VI
A pet store wants to investigate what proportion of Danish households have a dog. They
conduct a survey among 1000 of their customers, asking whether they have a dog. The store
assumes that these 1000 customers represent 1000 households.
What is the estimated proportion (p̂) of households that have a dog, and what is the uncertainty
(standard error, s.e.p̂ ) of this proportion?
Official figures indicate that about 20% of Danes have a dog. The pet store had therefore
expected that their survey would result in a proportion closer to 0.20. Is it likely that their
result – that as many as 32% of households have a dog – is due to random variation? And
could it be true that the true proportion of Danish households with a dog is actually around
20%?
1 □ Yes, the pet store has randomly selected a sample where more than expected have a dog.
This is likely due to random variation, and the true proportion could well be around 20%.
2 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value for
the relevant test is 0.0015, so we would reject the null hypothesis that the true proportion
is 0.20. Thus, we must conclude that the true proportion is probably not 20%.
3 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value for the
relevant test is 0.0015, so we would reject the null hypothesis that the true proportion is
0.20. However, it is doubtful whether the sample is representative, so the true proportion
could still be 20%.
4 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value
for the relevant test is 2 · 10−21 , so we would reject the null hypothesis that the true
proportion is 0.20. Since the sample is clearly representative, we must conclude that the
true proportion of households with a dog is probably not 20%.
8
5 □ No, it is unlikely that the pet store’s result is due to random variation. The p-value for the
relevant test is 2 · 10−21 , so we would reject the null hypothesis that the true proportion is
0.20. However, it is doubtful whether the sample is representative, so the true proportion
could still be 20%.
Continue on page 10
9
Exercise VII
The table can be entered into Python using the following code.
It can be assumed that the data can be described by the following model: Yij = µ + αi + ϵij ,
ϵij ∼ N (0, σ 2 ).
What is the between group variation, MS(Group), and the within group variation, MSE?
Continue on page 11
10
Question VII.2 (11)
1 □ Yij is observation number j in group number i. α̂i is group i’s average deviation from the
overall mean µ̂.
1
2 □ The total variance of the data (i.e. N −1
SST ) cannot be greater than σ̂ 2 .
3 □ MSE represents the variance within each group, and since we assume it is the same across
all groups, we also have M SE = σ̂ 2 .
4 □ If the variance of the αi ’s is large compared to the MSE, this means that there is a
difference between the groups.
Continue on page 12
11
Exercise VIII
Let X ∼ N (50, 22 ) and Y ∼ N (45, 52 ) be independent and form the random variables V =
3X + Y and U = max(X, Y ) i.e., U is the maximum of X and Y . Use simulation to answer
the following three questions.
1 □ Approximately 17.7%
2 □ Approximately 42.2%
3 □ Approximately 50.0%
4 □ Approximately 57.8%
5 □ Approximately 82.3%
Continue on page 13
12
Question VIII.2 (13)
Which of the following plots shows the probability density function (pdf) of U ?
Plot A Plot B
0.200 0.200
0.175 0.175
0.150 0.150
0.125 0.125
0.100 0.100
0.075 0.075
0.050 0.050
0.025 0.025
0.000 0.000
30 40 50 60 70 30 40 50 60 70
Plot C Plot D
0.08 0.12
0.07
0.10
0.06
0.05 0.08
0.04 0.06
0.03
0.04
0.02
0.01 0.02
0.00 0.00
30 40 50 60 70 30 40 50 60 70
Plot E
0.10
0.08
0.06
0.04
0.02
0.00
30 40 50 60 70
1 □ Plot A
2 □ Plot B
3 □ Plot C
4 □ Plot D
5 □ Plot E
13
Question VIII.3 (14)
Which of the following plots most likely shows a scatter plot of 100 observations of (X, V )?
Plot A Plot B
170 210
205
160
200
150 195
V
V
190
140 185
130 180
44 46 48 50 52 54 44 46 48 50 52 54
X X
Plot C Plot D
240 320
220
200 300
180
V
280
160
140 260
120
240
44 46 48 50 52 54 44 46 48 50 52 54
X X
Plot E
210
200
V
190
180
44 46 48 50 52 54
X
1 □ Plot A
2 □ Plot B
3 □ Plot C
4 □ Plot D
5 □ Plot E
14
Exercise IX
The police set up a traffic checkpoint where they randomly select cars for inspection. On
average, they check 10 cars per hour, and the number of cars selected for inspection in an
hour follows a Poisson distribution. Let X denote the total number of cars selected during a
five-hour period.
What is the probability that no cars are selected for inspection during a given hour?
1 □ Approximately 0%
2 □ Approximately 10%
3 □ Approximately 20%
4 □ Approximately 80%
5 □ Approximately 90%
Continue on page 16
15
Question IX.3 (17)
Historical data suggests that 40% of inspections result in a citation. If, during a given hour,
the police randomly select three cars for inspection, what is the probability that exactly two
of the inspections result in a citation?
1 □ 0.096
2 □ 0.144
3 □ 0.288
4 □ 0.432
5 □ 0.720
Continue on page 17
16
Exercise X
A couple is planning to buy a house and wants to estimate the expected price. They collect
data on recent property sales in the area and use Python to fit a multiple linear regression
model with property price as the response variable and the sizes of the house and lot (in square
meters) as explanatory variables.
They are particularly interested in two very similar properties. Property A has a house of
175 square meters and a lot of 800 square meters, while Property B has the same lot size but
a house that is 10 square meters smaller i.e., 165 square meters.
The couple obtains the following output from their Python model, where the price is given
in tkr. (tusinde kroner - DKK thousands).
Continue on page 18
17
A real estate agent claims that Property B should cost 400,000 less than Property A, implying
that each square meter of house is worth 40,000. The couple wants to evaluate whether the
data support this claim (null hypothesis) based on the fitted model.
The usual test of the real estate agent’s claim (null hypothesis) results in which p-value?
1 □ p = 0.922
2 □ p = 0.692
3 □ p = 0.345
4 □ p = 0.157
5 □ p < 0.001
Continue on page 19
18
The couple validates the model and generates the following diagnostic plots:
Normal QQ-plot of residuals Scatter plot of residuals and fitted values
1000 1000
750 750
500
500
Sample Quantiles
Residuals (tkr.)
250
250
0
0
250
500 250
750 500
750 750
500 500
Residuals (tkr.)
Residuals (tkr.)
250 250
0 0
250 250
500 500
160 180 200 220 240 600 700 800 900 1000 1100 1200 1300
House size (sq. m) Lot size (sq. m)
Continue on page 20
19
Question X.3 (20)
1 □ The normal QQ-plot of the residuals suggests that the residuals may be slightly right-
skewed but shows no obvious violation of the normality assumption.
2 □ The normal QQ-plot of the residuals suggests that the residuals may be inter-dependent
but shows no obvious violation of the independence assumption.
3 □ The scatter plot of residuals versus fitted values suggests that the variance of the residuals
may increase with the fitted values but shows no obvious violation of the homoscedasticity
(variance homogeneity) assumption.
4 □ The scatter plot of residuals versus house size suggests that the variance of the residuals
may increase with house size but shows no obvious violation of the homoscedasticity
(variance homogeneity) assumption.
5 □ The scatter plot of residuals versus lot size suggests that the variance of the residuals may
increase with lot size but shows no obvious violation of the homoscedasticity (variance
homogeneity) assumption.
Continue on page 21
20
Exercise XI
One wants to compare the means in two samples, A and B. Both samples contain 50 indepen-
dent measurements, which are assumed to be normally distributed. It is stated that the 95%
confidence interval for the mean in each group is:
1 □ Since the confidence intervals overlap, the two underlying populations could have the
same mean. Therefore, we can easily see that the difference between the two sample
means is not significantly different from zero (at a 5% significance level).
2 □ Since the confidence intervals overlap, the difference between the two sample means is
not statistically significantly different from zero (at a 5% significance level). Thus, the
two underlying populations have the same distribution.
3 □ The sample means for sample A and B are 16.50 and 14.25, respectively, and there is
a significant difference between these means at a 5% significance level (but not at a 1%
significance level).
4 □ The sample means for sample A and B are 16.50 and 14.25, respectively, and there is a
significant difference between these means (at a 1% significance level).
5 □ The sample means for sample A and B are 16.50 and 14.25, respectively, but there is no
significant difference between these means (at a 5% significance level).
Continue on page 22
21
Exercise XII
A biologist has captured n1 = 150 fish in a lake, tagged them, and released them again. The
biologist now plans to return and capture n2 = 200 fish from the same lake.
If we denote the total number of fish in the lake by N (and assume that N is the same when
released and recaptured), what distribution will the number of tagged fish (Y ) then follow at
recapture (it is assumed that all tagged fish survive and that it is completely random which of
the N fish are captured)?
150
and n = 200, i.e. Y ∼ B 200, 150
1 □ A binomial distribution with p = N
, N
.
200·150 200·150 150
2 □ A normal distribution with paramters µ = N
, and σ 2 = N
1− N
.
3 □ A hypergeometric distribution with parameters n = 200, a = 150, and N , i.e. Y ∼
H(200, 150, N ).
4 □ A Poisson distribution with parameter λ = 150·200 150·200
N
, i.e. Y ∼ P ois N
.
N N
5 □ An exponential distribution with parameter λ = 150·200 , i.e. Y ∼ Exp 150·200 .
The length of the caught fish is measured in order to provide an estimate of their age. Thus, fish
between 6 and 10 cm. are classified as 1-year-old, while fish of more than 10 cm. are classified
as older. It is assumed that the length of a one-year-old fish follows a normal distribution with
mean µ = 8 cm. and standard deviation σ = 1 cm.
What is the probability that a one-year-old fish is classified as older than one year?
1 □ 0.159
2 □ 0.5
3 □ 0.841
4 □ 0.0228
5 □ 0.977
Continue on page 23
22
Exercise XIII
A consumer organization wants to investigate how often a parcel delivery company delivers
packages to the nearest parcel shop. They collected data from 750 parcel deliveries, distributed
across 5 regions. The results are summarized in the following table:
A χ2 -test is now performed to investigate whether the proportion of packages delivered to the
nearest parcel shop is the same across all 5 regions.
What are the expected values in each cell of the table under the null hypothesis?
23
Region Delivered to the nearest parcel shop Delivered elsewhere
Capital region 74 76
Central Jutland 74 76
4□
Southern Denmark 74 76
North Jutland 74 76
Zealand 74 76
A χ2 -test is performed to investigate whether the proportion of packages delivered to the nearest
parcel shop is the same across all 5 regions. The relevant test statistic has been calculated as
47.21. What is the p-value for the relevant test, and what is the corresponding conclusion (use
a significance level of α = 0.05)?
1 □ The p-value is 0.10, and the conclusion is that there is a difference in the proportion of
packages delivered to the nearest parcel shop in the different regions.
2 □ The p-value is 0.10, and the conclusion is that there is no difference in the proportion of
packages delivered to the nearest parcel shop in the different regions.
3 □ The p-value is 0.05, and the conclusion is that there is no difference in the proportion of
packages delivered to the nearest parcel shop in the different regions.
4 □ The p-value is 1.4 · 10−9 , and the conclusion is that there is no difference in the proportion
of packages delivered to the nearest parcel shop in the different regions.
5 □ The p-value is 1.4 · 10−9 , and the conclusion is that there is a difference in the proportion
of packages delivered to the nearest parcel shop in the different regions.
Continue on page 25
24
Exercise XIV
A sensor measures the temperature of a machine that should not exceed 80◦ C. A sample of 40
temperature measurements gives a sample mean of 75.2◦ C and a sample standard deviation of
2.5◦ C. Assume that the temperature measurements are independent of each other and follow a
normal distribution.
1 □ [72.7, 77.7] ◦ C
2 □ [70.1, 80.3] ◦ C
3 □ [74.4, 76.0] ◦ C
4 □ [75.1, 75.3] ◦ C
5 □ [71.0, 79.4] ◦ C
Continue on page 26
25
Exercise XV
A car owner wants to buy a new (used) car, to investigate what the price should be. She
has collected prices for the car (make and model) she wants. The histogram below shows the
distribution of prices for the car she wants.
50
40
Frequency
30
20
10
0
50 100 150 200 250 300 350
Price [1000 kr]
In addition to the histogram, she has calculated the average and empirical variance for the
observed prices (price) [1000 kr.].
np.mean(price)
np.float64(118.94622598870056)
np.var(price,ddof=1)
np.float64(3633.1131006077294)
Based on the above, which of the following assumptions about the distribution of the price (Y )
is the most reasonable?
1 □ A log-normal distribution with parameters α = 4.66 and β 2 = 0.478, i.e. Y ∼ LN (4.66, 0.4782 ).
2 □ A normal distribution with parameters µ = 118.94 and σ 2 = 3633.12 , i.e. Y ∼ N (118.9, 3633.12 ).
3 □ An exponential distribution with parameter λ = 118.9, i.e. Y ∼ Exp(118.9).
4 □ A log-normal distribution with parameters α = 118.9 and β 2 = 60.282 , i.e. Y ∼
LN (118.9, 60.282 ).
5 □ A normal distribution with parameters µ = 118.94 and σ 2 = 60.282 , i.e. Y ∼ N (118.9, 60.282 ).
26
The car owner has also collected data on the age and mileage of the cars. To investigate the
relationship between price, age and mileage of the car, the car owner has run the following code
(price [1000 kr.] is the price, age [years] is the age of the car, dist [1000 km.] is the mileage
of the car, and cars is the dataset with the collected numbers),
Yi = µi + ϵi ,
where the formula for µi is determined from the input to smf.ols and E(ϵi ) = 0.
Which of the following statements about the model or model assumptions is correct?
1 □ The ϵi ’s are normally distributed and iid. (independent and identically distributed).
2 □ It is assumed that the observed correlation between age and dist is equal zero.
3 □ The Yi ’s are normally distributed and iid. (independent and identically distributed).
4 □ The µi ’s follows a normal distribution and are iid.
5 □ µi = β1 x1i + β2 x2i , where x1i and x2i are the age and mileage of car i, respectively.
The results of the estimation above are given below (some numbers have been replaced with
symbols)
fit.summary(slim = True)
27
Question XV.3 (29)
Which of the following statements is correct when using a significance level of α = 0.05?
1 □ Both effects (age and mileage) are significantly different from zero and the expected price
decreases with age and mileage.
2 □ None of the effects (age and mileage) are significantly different from zero.
3 □ The age of the car has a significant effect on the price, while an effect of the mileage
cannot be demonstrated. The price increases as the age increases.
4 □ Mileage has a significant effect on price, while and effect of age cannot be demonstrated.
The price decreases as mileage increases.
5 □ The age of the car has a significant effect on the price, while an effect of the mileage
cannot be demonstrated. The price decreases as the age increases.
Continue on page 29
28
Exercise XVI
A consultant has received data on arrival and departure times for 35 employees at a given
workplace. Arrival and departure times are recorded for the same 35 employees on two different
days - one day in the summer and one day in the winter. The consultant now wishes to assess
whether the average working hours are the same on both days.
1 □ For each arrival and departure time, the working hours are calculated. Now, you have
two independent samples (one for the summer day and one for the winter day) with 35
measurements in each. The means of these samples are compared using a t-test with the
null hypothesis H0 : µ1 = µ2 .
2 □ For each of the two days, the average arrival time and the average departure time are
calculated. Then, two t-tests are performed: one t-test tests for a significant difference
in arrival times, and the other tests for a significant difference in departure times.
3 □ For each arrival and departure time, the working hours are calculated. Now, you have two
paired samples (one for the summer day and one for the winter day) with 35 measurements
in each. A paired t-test is used to examine whether the average difference in working hours
is significantly different from zero.
4 □ For each of the two days, a 95% confidence interval is calculated for the average arrival
time CIx̄arrive and for the average departure time CIx̄leave . If the two confidence intervals
do not overlap, there is a significant difference in the average working hours between the
two days.
5 □ For each arrival and departure time, a total working time is calculated. Now, you have
two samples with 35 measurements. A one-way ANOVA model is used to test whether
there is a difference in the average working hours between the two days.
29