[go: up one dir, main page]

0% found this document useful (0 votes)
260 views149 pages

CS2B Workbook

Uploaded by

nikuloha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
260 views149 pages

CS2B Workbook

Uploaded by

nikuloha
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 149

www.sankhyiki.

in
+91-9711150002

INDEX

1. User Defined Functions in R………………………………………………………………3


2. Loss Distributions…………………………………………………………………………6
3. Reinsurance………………………………………………………………………………17
4. Risk Models……………………………………………………………………………...31
5. Survival Models………………………………………………………………………….46
6. Proportional Hazards Models……………………………………………………………59
7. Markov Chains…………………………………………………………………………...64
8. Markov Jump Processes………………………………………………………………….97
9. Time Series……………………………………………………………………………..114
10. Extreme Value Theory………………………………………………………………….135
11. Machine Learning………………………………………………………………………145

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 1


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 2


www.sankhyiki.in
+91-9711150002

User Defined Functions in R


Question 1. Define the functions in R to calculate the values of PDF & CDF of 𝑃𝑎𝑟𝑒𝑡𝑜 (𝛼, 𝜆)
distribution, name these function dpareto and ppareto respectively.

Question 2. Define a function in R to simulate the values from 𝑃𝑎𝑟𝑒𝑡𝑜 (𝛼, 𝜆) distribution.
Name this function rpareto.

Question 3. Define a function in R to calculate the quartiles of 𝑃𝑎𝑟𝑒𝑡𝑜 (𝛼, 𝜆) distribution.


Name this function qpareto.

Question 4. (i) Using the function defined in Q2, simulate 500 values from a
𝑃𝑎𝑟𝑒𝑡𝑜 (3,1000) distribution, using a seed value of 80.

(ii) Calculate the coefficient of skewness for the sample dataset.

Question 5. Repeat Q1 to Q3 for 𝑊𝑒𝑖𝑏𝑢𝑙𝑙(𝑐, 𝛾) distribution. You can refer the Actuarial
Tables for Formulas.

Question 6. Repeat Q1 to Q3 for 𝐵𝑢𝑟𝑟(𝛼, 𝜆, 𝛾) distribution. You can refer the Actuarial
Tables for Formulas.

Question 7. Adapting the code of dpareto from Q1, create a corresponding function for the
three-parameter Pareto distribution. You can refer the Actuarial Tables for
Formulas.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 3


www.sankhyiki.in
+91-9711150002

Solutions
Answer 1. dpareto <- function(x,a,l){
a*(l^a)/((l+x)^(a+1))
}

ppareto <- function(q,a,l){


1-(l/(l+q))^a
}

Answer 2. rpareto <- function(n,a,l){


rp <- l*((1-runif(n))^(-1/a)-1)
rp
}

Answer 3. qpareto <- function(p,a,l){


q <- l*((1-p)^(-1/a)-1)
q
}

Answer 4. (i)
set.seed(80)
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}

x<-rpareto(500,3,10000)

(ii)
coeff.of.skew<-skewness/var(x)^(3/2)
coeff.of.skew
[1] 6.903348

Answer 5. dweibull <- function(x,c,g){


c*g*x^(g-1)*exp(-c*x^g)
}

pweibull <- function(q,c,g){


1-exp(-c*x^g)
}

rweibull <- function(n,c,g){


rp <- (log(1-runif(n))/c)^(1/g)
rp
}

qweibull <- function(p,c,g){


q <- (log(1-p)/c)^(1/g)
q
}

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 4


www.sankhyiki.in
+91-9711150002

Answer 6. dburr <- function(x,a,l,g){


(a*g*l^a)*x^(g-1)/((l+x^g)^(a+1))
}

pburr <- function(q,a,l,g){


1-(l/(l+q^g))^a
}

rburr <- function(n,a,l,g){


rp <- (l*((1-runif(n))^(-1/a)-1))^(1/g)
rp
}
qburr <- function(p,a,l,g){
q <- (l*((1-p)^(-1/a)-1))^(1/g)
q
}

Answer 7. d3pareto <- function(x,a,l,k){


gamma(a+k)*(l^a)*x^(k-1)/(gamma(a)*gamma(k)*(l+x)^(a+k))
}

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 5


www.sankhyiki.in
+91-9711150002

ASSIGNMENT – Loss Distributions


Question 1. Calculate f (50) and F(50) for the random variable X where X~Gamma(10,0.2).

Question 2. Calculate f (1000) and F(1000) for the random variable Y where Y is lognormally
distributed with parameters 𝜇 = 6 𝑎𝑛𝑑 𝜎 ! = 4.

Question 3. The random variable Y follows a Weibull distribution with parameters


𝑐 = 0.00002 and 𝛾 = 4. Calculate P(8<Y <12) to four significant figures.

Question 4. The random variable Y follows a Weibull distribution with parameters


𝑐 = 0.00002 and 𝛾 = 4. The following code creates a vector y containing a
sample of 20 independent observations from this distribution.
set.seed(300)
y<-rweibull(20,4,0.00002^(-1/4))
Run this code. Hence calculate the log‐likelihood of observing this sample to 2
significant figures.

Question 5. Calculate the 99.5th percentile for the random variable Y, where Y is lognormally
distributed with parameters 𝜇 = 6 𝑎𝑛𝑑 𝜎 ! = 4.

Question 6. Calculate the IQR for the random variable X where X~Gamma(6,0.1).

Question 7. The random variable Y follows a Weibull distribution with parameters


𝑐 = 0.00002 and 𝛾 = 4. Calculate the value of y for which P(Y > y) = 0.3.

Question 8. Use a seed of 50 to generate 10 random variates from a lognormal distribution


with parameters 𝜇 = 5 𝑎𝑛𝑑 𝜎 ! = 4.

Question 9. You believe that claim amounts follow a Gamma(6,𝜆) distribution, where
𝜆~𝐸𝑥𝑝(10). Generate a random sample of five claims, using a seed value equal
to 1.
"
Question 10. The random variable Z is defined as Z = # where:

• Y follows a Weibull distribution with parameters 𝑐 = 0.00002 and 𝛾 = 4


• X follows an exponential distribution with parameter 𝜆 = 0.1

Generate three random variates from Z, using a seed value of 10.

Question 11. Claim sizes X for a particular class of insurance business are believed to follow a
three‐parameter Pareto distribution such that X ~ Pa(2,200,7) . Summarise the
code required to calculate P(X > 10,000) , and display the result.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 6


www.sankhyiki.in
+91-9711150002

Question 12. The claim frequency on a portfolio of life insurance policies follows a beta
distribution with parameters 𝛼 = 0.02 and 𝛽 = 0.4. Calculate the probability that
the claim frequency exceeds 0.75.

Question 13. The random variable X follows a three‐parameter Pareto distribution with
parameters 𝛼 = 3, 𝜆 = 5 and k = 4. Calculate the median of X.
Hint: write a function to calculate |P(X <M) - 0.5| and minimise this with respect to M.

Question 14. The random variable X follows a LogN(0,0.25) distribution (i.e. with parameters
𝜇 = 0 and 𝜎 ! = 0.25 ).
(i) Calculate the mode of X, to 6 significant figures.
(ii) Check your answer against a plot of the lognormal distribution.

Question 15. Run the following code:


set.seed(1812)
x<-rgamma(1000,rpois(1,100),rexp(1,5))
The vector x represents a set of claim sizes for a particular insurance portfolio.
You believe that claims follow a gamma distribution with parameters 𝛼 and 𝛽.
Find the maximum likelihood estimates of 𝛼 and 𝛽. (You are given that the
method of moments estimates are 𝛼G =111 and 𝛽H =0.189.)

Question 16. This question uses the following file of claim sizes for a portfolio of general
insurance policies: ClaimSize.csv
(i) An analyst believes that claims are independent and follow a
Gamma(𝛼, 𝛽) distribution. Determine the maximum likelihood estimates
of the parameters 𝛼 and 𝛽, choosing from the following possibilities:
• 𝛼 = 1.7,1.8,1.9 or 2
• 𝛽 = 0.0015, 0.0016, 0.0017, …, 0.002
(ii) Plot the fitted density function against the data, and use it to comment on
your results.

Question 17. Insurance claims are believed to follow an exponential distribution with parameter
𝜆. The file ‘exp.txt’ gives the claim size for 1,000 of these claims.
(i) Find the maximum likelihood estimate of 𝜆. (You should use a starting
value of 𝜆 = 0.001 for the iteration process.)
(ii) Plot the empirical density function against your fitted distribution and
hence comment on the goodness of fit.

Question 18. A fellow student has collected a set of motor insurance claims data in the
following file: MotorClaims.txt
The student believes that this data follows an exponential distribution, and has
calculated the maximum likelihood estimate for 𝜆H = 0.003.
Plot the Q‐Q plot of the data against the student’s fitted distribution, and comment
on the result.

Question 19. Run the following code:

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 7


www.sankhyiki.in
+91-9711150002

set.seed(1066)
x<-rlnorm(1000,5,2)
(i) Use the data stored in vector x to find the MLE of parameters of Log-
Normal distribution.
(ii) Use the method of moments on the simulated data above to estimate the
lognormal parameters.
(iii) Verify that your estimates can be used to give sensible maximum
likelihood estimates.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 8


www.sankhyiki.in
+91-9711150002

Solutions
Answer 1. dgamma(50,10,0.2)
[1] 0.02502201
pgamma(50,10,0.2)
[1] 0.5420703

Answer 2. dlnorm(1000,6,2)
[1] 0.0001799479
plnorm(1000,6,2)
[1] 0.6750416

Answer 3. c<-0.00002
g<-4
pweibull(12,4,c^(-1/g))-pweibull(8,4,c^(-1/g))
[1] 0.2608205

Answer 4. c<-0.00002
g<-4
sum(dweibull(y,4,c^(-1/g),log=TRUE))
[1] -56.0779

Answer 5. qlnorm(0.995,5,2)
[1] 25633.58

Answer 6. qgamma(0.75,6,0.1)-qgamma(0.25,6,0.1)
[1] 32.03492

Answer 7. c<-0.00002
g<-4
qweibull(0.7,g,c^(-1/g))
[1] 15.66378

Answer 8. set.seed(50)
rlnorm(10,5,2)
[1] 445.563506 27.571772 158.538247 423.392403 4.686996
[6] 85.137785 305.410535 45.521062 1044.382530 8.235879

Answer 9. Specifying the seed value:


set.seed(1)
If we were to generate five random variates from the Gamma(6,𝜆 ) distribution
with 𝜆 a constant, this would be: rgamma(10,6, 𝜆 )for some known value of 𝜆
.
However, 𝜆 is a random variable. So every time we generate a value from our
gamma distribution, we’ll have to randomly choose a different value for 𝜆. We’ll
do this by inserting the rexp()function into the rgamma() function:

rgamma(5,6,rexp(1,10))
[1] 63.05133 49.19141 86.27594 32.85776 88.75371

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 9


www.sankhyiki.in
+91-9711150002

Note, since your answer will depend on the order in which you generate numbers
from the two distributions, it is possible that your solution will be different to
ours, but still equally valid!
For example, you may have first generated a vector of random variates from the
exponential distribution, and then used these to simulate values from the
gamma distribution. In this case, your answer would be:
set.seed(1)
lambda <- rexp(5,10)
rgamma(5,6,lambda)
[1] 32.85776 56.72204 505.65996 495.95443 110.23761

Answer 10. Setting the seed:


set.seed(10)
Generating three random variates from the Weibull distribution:
c<-0.00002
g<-4
y<-rweibull(3,g,c^(-1/g))
Generating three random variates from the exponential distribution:
x<-rexp(3,0.1)
Generating the set of three random variates from Z and displaying the result:
z<-y/x; z
[1] 3.5138341 0.6385402 0.9119199
Again, different generation methods would also be valid.

Answer 11. a<-2; l<-200; k<-7; q<-10000


d3pareto <- function(x){
gamma(a+k)*(l^a)*x^(k-1)/(gamma(a)*gamma(k)*(l+x)^(a+k))
}
1-integrate(d3pareto,0,q)$value
[1] 0.009951169

Answer 12. pbeta(0.75,0.02,0.4,lower=FALSE)


[1] 0.02990434

Answer 13. a<-3; l<-5; k<-4

f<-function(x) {
gamma(a+k)*(l^a)*x^(k-1)/(gamma(a)*gamma(k)*(l+x)^(a+k))
}

F<-function(M) {
abs(integrate(f,0,M)$value-0.5)
}

M.start<-10
nlm(F,M.start)
$minimum
[1] 1.947338e-08
$estimate
[1] 6.865008
$gradient
[1] -3.221498e-05
$code

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 10


www.sankhyiki.in
+91-9711150002

[1] 2
$iterations
[1] 61 So the median M = 6.865
Answer 14. m<-0
s2<-0.25
s<-s2^0.5
exp(m+0.5*s^2)
[1]1 1.133148
x<-seq(0.000001,2,by=0.000001)
f<-function(x) {
dlnorm(x,m,s)
}
max<-max(f(x)); max
[1] 0.9041217
i<-which(f(x)==max); i
[1] 778801
x[i]
[1] 0.778801
(ii)
plot(x,f(x),type="l",main="Density function for
LogN(0,0.25)")

The highest point on the curve is at x ≈0.8 , which is close to our answer for part (i).

Answer 15. # Function for negative log-likelihood function of gamma dist


f<-function(params) {
neglnL<- -sum(dgamma(x,params[1],params[2],log=TRUE)); neglnL
}

p<-c(111,0.189)

f(p)
[1] 5433.073

# To minimize f wrt the parameters


MLE<-nlm(f,p)
MLE

$minimum

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 11


www.sankhyiki.in
+91-9711150002

[1] 5433.067
$estimate
[1] 110.9999999 0.1890614
$gradient
[1] -0.0301862472 -0.0002226943
$code
[1] 2
$iterations
[1] 2

Answer 16. (i) #Importing data


data <- read.table("ClaimSize.csv",header=T)

#R has imported this as a dataframe one column wide. We extract


this column of data #as a vector, just to make it easier to work
with:
datavector <- data[,1]

#creating two vectors for the possible parameter values of alpha


#and beta
alpha <- seq(1.7,2,0.1)
beta <- seq(0.0015,0.002,0.0001)

#blank vector to save the results


result<-numeric(0)

#loop for combination of the log-likelihood function sumL


for (i in 1:length(alpha)){
for (j in 1:length(beta)){
logL <- pgamma(datavector,shape=alpha[i],rate=beta[j],log=TRUE)
sumL <- sum(logL)
result<-rbind(result,c(alpha[i],beta[j],sumL))
}
}
result

#In the code above, the function rbind appends each iteration
into a new row of the dataframe result.

The maximum of these is the row:


[6,] 1.7 0.0020 -154.6880

Alternatively, we can extract the correct row automatically, like this:


result[result[,3]==max(result[,3]),]
[1] 1.700 0.002 -154.688

(ii) The maximum claim size observation is 4,664 so we’ll plot a histogram with the x axis
on the range (0,5000):

hist(datavector,freq=FALSE, breaks=20, xlim=c(0,5000), ylim=


c(0,0.001),ylab="density")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 12


www.sankhyiki.in
+91-9711150002

We have also chosen to set breaks=20, ie twenty bars in our histogram. This is an
arbitrary choice, you may have chosen a different number altogether. In any case R only
treats this as a suggestion and will automatically set the breakpoints to ‘pretty’ values.

The option freq=FALSE tells R to plot the PDF, rather than the frequencies. Now we
plot our fitted gamma distribution on the same graph as the histogram:

lines(seq(0:5000),dgamma(seq(0:5000),shape=1.7,rate=0.002),
col="blue")

We can see that the data is more positively skew than the fitted distribution. There are
far too many claims greater than 800 and far too few claims of less than 800.

Answer 17. (i) #Importing data


claims <- read.table("exp.txt",header=T)

Extracting the data into a vector will make it easier to work with:
claims<-claims[,1]

Now we write a function to calculate the negative log-likelihood:


f<-function(param) {
neglnL<- -sum(dexp(claims,param,log=TRUE))
neglnL
}

p<-0.001
MLE<-nlm(f,p)
MLE
$minimum
[1] 75087.7
$estimate
[1] 0.001490436
$gradient
[1] -9535.74
$code
[1] 3
$iterations
[1] 5
Hence, 𝜆" = 0.0015 .

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 13


www.sankhyiki.in
+91-9711150002

(ii) Empirical density function


We’ll use the density function on the range from 0 to max(claims). We’ll also give
the graph some sensible labels, just as you should do in the exam:
plot(density(claims,from=0,to=max(claims)),xlab="claims",main=
"Empirical density function",col="blue")

We want to plot the fitted PDF on a similar range, so we check the maximum claim size:
max(claims)
[1] 5612.656
Hence, we set our x coordinates as:
x<-seq(0,5613)

Now we calculate the PDF of the exp(0.0015) distribution, for each value of x:
y<-dexp(x,MLE$estimate)

Finally we add this to our graph:


lines(x,y,col="red")

This looks like a very close fit.

Answer 18. Importing the data


x <- read.table("MotorClaims.txt",header=T)

Extracting the data into a vector:


x<-x[,1]

Simulating values from the fitted distribution


We use a seed of 77 below, so that you can reproduce our results. However this
wouldn’t be necessary in the exam:

set.seed(77)
y<-rexp(1000,0.003)

Creating the Q-Q plot:


Since we’ve called our data ‘x’, we’ll plot it on the x axis rather than the y axis, to avoid
confusion. This is the opposite way round to what we’ve done elsewhere in this unit, but
it is still perfectly valid.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 14


www.sankhyiki.in
+91-9711150002

It’s good technique to:


- use sensible labels for your axes and your main title, rather than R’s default labels
- add the line of perfect fit, using the abline function.

qqplot(x,y,xlab="Sample quantiles",ylab="Quantiles of
exp(0.003)", main="Q-Q plot")

abline(0,1,col="red")

The data doesn’t lie anywhere near the reference line for claims greater than about 500,
so the fit is particularly bad in the tail of the distribution. Nor is it a good fit for smaller
claims. Overall we would recommend the student tries to fit an alternative distribution.

Answer 19. Defining Negative Log-Likelihood function


f<-function(params) {
neglnL<- -sum(dlnorm(x,params[1],params[2],log=TRUE))
neglnL
}

we have written params[1] instead of mu, and params[2] instead of sigma. So when
we specify params (ie our vector of inputs), we need the first element of params to
correspond to the parameter mu and the second element to correspond to the
parameter sigma.

p<-c(100,100)
MLE<-nlm(f,p)
MLE
$minimum
[1] 7108.991
$estimate
[1] 4.988397 2.017083
$gradient
[1] 7.110559e-05 -5.455841e-05
$code
[1] 1
$iterations

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 15


www.sankhyiki.in
+91-9711150002

[1] 20

MLE$estimate
[1] 4.988397 2.017083

(ii)Use method of moments mathematically and reach to following


solutions
mu<-log(mean(x))-0.5*log(1+var(x)/mean(x)^2)
sigma<-(log(1+var(x)/mean(x)^2))^0.5
mu
[1] 5.541139
sigma
[1] 1.674361

(iii) Using the method of moments estimates as inputs in the MLE function
We enter the estimates mu and sigma from part (i) as inputs in our function f, and
minimise the negative log-likelihood:
p<-c(mu,sigma)
MLE<-nlm(f,p)
MLE
$minimum
[1] 7108.991
$estimate
[1] 4.988397 2.017082
$gradient
[1] 8.897315e-05 -5.830088e-04
$code
[1] 1
$iterations
[1] 8
This time, our maximum likelihood estimates are 𝜇̂ = 4.99 and 𝜎( = 2.02 , which are very
close to the population parameters 𝜇 = 5 and 𝜎 = 2 .

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 16


www.sankhyiki.in
+91-9711150002

Reinsurance
Question 1. Claims from a particular portfolio of insurance policies are expected to follow a
Pareto distribution with parameters 𝛼 = 2 and 𝜆 = 800. An individual excess of
loss arrangement is in force where the insurer pays a maximum of 1,500 on any
claim.
(i) Simulate 10,000 claims from this portfolio, using a seed value of 5.
(ii) Estimate the probability that a claim involves the reinsurer.

Question 2. Claim sizes this year, X, from a particular class of insurance follow an exponential
distribution with parameter 𝜆 = 0.001.
(i) Simulate 10,000 claims from this portfolio, using a seed value of 2357.
(ii) Inflate each claim in the sample by 5%, to derive a sample of 10,000
claims from next year’s claim size distribution.
(iii) Apply the policy excess of 100 to each claim and find the mean amount
paid by the insurer after inflation.

Question 3. Run the following code:


set.seed(10249)
yclaims <-pmin(rexp(10000,rgamma(1,3,runif(1,580,620))),1000)

The vector yclaims denotes the amounts paid by an insurer for a book of general
insurance business, net of reinsurance. The reinsurance is an excess of loss
arrangement with retention limit M = 1,000 so that the insurer pays a maximum of
1,000 for any individual claim. (Hence, many claims in yclaims are exactly equal
to 1,000.)
You believe that the underlying claim size random variable X (before reinsurance)
is exponentially distributed with unknown parameter 𝜆.
(i) Calculate the maximum likelihood estimate of 𝜆.
(ii) Plot a graph of your fitted distribution against the empirical density plot of
the data yclaims. Your x axis should extend from 100 to 1,000 and you
should label your graph appropriately.
(iii) Comment on the key features of your graph.

Question 4. Individual claim amounts follow a gamma distribution with parameters α = 5 and
β = 0.04. The insurer arranges excess of loss reinsurance with a retention limit of
200. Calculate the proportion of claims on which the reinsurer is involved.

Question 5. Run the following code in R: _


set.seed(1); x <- sort(rpois(365,rgamma(365,20,0.01)))

The number of claims reported to an insurance company for a particular type of


insurance policy has been recorded every day for the last 365 days. These are listed
in the vector x. On average, one in every 100 claims is rejected by the insurer as it
is deemed to be fraudulent, and this rate has been constant for many years.
However, the insurer now expects that the number of fraudulent claims will

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 17


www.sankhyiki.in
+91-9711150002

increase by 4% next year. Estimate the skewness of the number of claims the
insurer can expect to pay next year.

Question 6. Before attempting this question, you should run the following line of code:
set.seed(16550)
An insurer believes that claim amounts, X, on its portfolio of pet insurance policies
follow an exponential distribution with parameter λ = 0.0005. The insurer wants to
purchase an excess of loss reinsurance policy with retention level M, such that the
reinsurer pays Z = X - M on claims that exceed size M, and zero otherwise.
(i) Simulate 10,000 claims from the insurer’s gross claims distribution and
hence estimate the insurer’s mean claims size net of reinsurance,
for M = 2, 500 .
The reinsurer imposes a maximum reinsurance payment of £1,000 on each claim,
so that:
0 𝑖𝑓 𝑋 ≤ 𝑀
𝑍 = J𝑋 − 𝑀 𝑖𝑓 𝑀 < 𝑋 ≤ 𝑀 + 1000
1000 𝑖𝑓 𝑋 > 𝑀 + 1000
(ii) Estimate the retention level M that minimises the variance of the insurer’s
claims net of reinsurance.

Question 7. A reinsurer has entered into an excess of loss reinsurance arrangement with a
retention limit of £45,000. The amounts paid by the reinsurer on each claim are
provided in the file: Reinsurance payments.txt

(i) Import the data into R.


(ii) Assuming that individual claim amounts, gross of reinsurance, follow a
log N(μ ,σ) distribution, calculate the maximum likelihood estimates for μ
and σ .
(iii) By plotting a suitable graph, check the reasonableness of your answer to
part (ii).
(iv) Locate the largest claim amount paid by the reinsurer.
(v) Estimate the probability that the reinsurer will have to pay a claim that is
larger than this largest claim.

Qquestion 8. In an insurance company’s portfolio, individual claim sizes, in £, follow an


exponential distribution with parameter 0.0001.
(i) Run the following code: set.seed(123)and then use R to
simulate 10,000 claims and plot a histogram of the simulated data. Paste
your R code and chart into your answer.
(ii) (a) Calculate the mean and the variance of the simulated claims in part (i).
(b) Compare your answers to part (ii)(a) with the theoretical values expected
from the exponential distribution with parameter 0.0001.

The insurer decides to take out an individual excess of loss reinsurance


arrangement with a retention level of £20,000.
(iii) Calculate the mean and the variance of the claims paid, under this

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 18


www.sankhyiki.in
+91-9711150002

arrangement, by
(a) the insurer
(b) the reinsurer.

The insurer wishes to determine an appropriate retention limit and has asked an
analyst to investigate the effect of different retention limits on the mean and
variance of claims.
(iv) Calculate the mean and variance of the claims paid by:
(a) the insurer
(b) the reinsurer
under each of the following six retention levels:
£5,000, £10,000, £20,000, £30,000, £40,000 and £50,000.
(v) Plot your results from part (iv) on four separate charts to show how the
mean and variance of the claims paid by the insurer and reinsurer vary
with the retention level.
(vi) Comment, with reference to the total variance, on your results in part (v).

Question 9. An exponential distribution has a parameter of l = 0.4.


(i) Use the in-built functions in R to:
(a) Simulate 1,000 values from this distribution, assigning this to a
variable called Exp_Vector and calculate the mean and
variance of the simulated values. Paste the results of your
calculation into your answer.
(b) Plot a histogram of Exp_Vector showing the frequencies and
paste the plot into your answer.
(c) Plot the probability density function for this distribution as:
1 a scatter plot
2 a line graph.
Paste your plots into your answer.
A lognormal distribution has parameters µ = 0 and s2 = 1.
(ii) Use the in-built functions in R to:
(a) Simulate 1,000 values from this distribution, assigning this to a
vector called LNorm_Vector and calculate the mean and
variance of the simulated values. Paste the results of your
calculation into your answer.
(b) Plot a histogram of LNorm_Vector showing the frequencies
and paste the plot into your answer.
(c) Plot a second histogram in a new graph of LNorm_Vector
showing the probability densities, setting the y-axis range from 0 to
0.7 for this graph and paste the plot into your answer.
(d) Add the cumulative density function of LNorm_Vector to the
chart in part (ii)(c) and paste the plot into your answer.
(e) Add the theoretical lognormal (0,1) distribution to the chart in
part (ii)(d) to highlight the difference to the sample, including
appropriate labels and legend and paste the plot into your answer.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 19


www.sankhyiki.in
+91-9711150002

Question 10. Claims from a particular portfolio of insurance policies are expected to follow a
Pareto distribution with parameters 𝛼 = 3 and 𝜆 = 400. A policyholder
deductible of £100 is applied to these policies.
(i) Simulate 10,000 claims from this portfolio, using a seed value of 225 and
store the values in vector x.
(ii) Calculate the expected claim size paid by the insurance company.
(iii) Comment on the difference between your answer to (ii) and the mean of
vector x.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 20


www.sankhyiki.in
+91-9711150002

Solutions
Answer 1. (i) Simulating claims
Since the Pareto distribution is not built in to the basic R functions, we will write
our own function. You will have seen how to do this before in a previous unit:
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}

Simulating a set of 10,000 claims from this distribution, using the seed value of
5:
set.seed(5)
x<-rpareto(10000,2,800)

(ii) Probability of the reinsurer being involved


We’ll call the reinsurer’s share of these claims z2 (to distinguish it from z in
the example above):
z <- pmax(0,x -1500)

So the probability of the reinsurer being involved in a claim is approximately:


length(z[z>0])/length(x)
[1] 0.1231
ie 12.31%.

Answer 2. (i) Simulating claims


set.seed(2357)
x <- rexp(10000,0.001)

(ii) kx <- 1.05*x

(iii) y <- pmax(0,kx-100)


mean(y[y>0])

Answer 3. (i) M <- 1000


nz <- length(yclaims[yclaims==1000]
nz
[1] 643

Extracting the vector of claims that do not exceed the retention level 1,000:
yclaims.excl1000 <- yclaims[yclaims<1000]

Now we write the negative log-likelihood function flnL:

flnL <- function(parameter) {


- nz*pexp(M,rate=parameter,lower.tail=FALSE,log.p=TRUE)-
sum(dexp(yclaims.excl1000,rate=parameter,log=TRUE ))
}

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 21


www.sankhyiki.in
+91-9711150002

p<-1
nlm(flnL,p)
$minimum
[1] 64624.84
$estimate
[1] 0.002720991
$gradient
[1] 5.491267
$code
[1] 2
$iterations
[1] 14

So, the maximum likelihood estimate of 𝜆" = 0.00272 .

(ii) Plot
First we extract the maximum likelihood estimate 𝜆" from the output of the nlm
function, and call this lambda:
lambda<-nlm(flnL,p)$estimate

Now we plot the empirical density function of the original data. In the code
below we have split the title of the graph into three lines, so that R will do the
same when it creates the graph:
plot(density(yclaims,from=100,to=1000),xlab="Amounts paid
by insurer",main="Empirical density function vs
Exp(0.00272)",col="blue")

Now we create a vector of values x, on the same range as the empirical density
plot:
x<-seq(100,1000)
y<-dexp(x,rate=lambda)

Finally we add the fitted PDF to our empirical density plot:


lines(x,y,col="red")

(iii) Comment
The blue line (ie the empirical density function) has a peak at x =1,000 . This is to
be expected since the data yclaims represents the insurer’s claim amounts
after reinsurance (ie there are many claims that have hit the reinsurance layer
and for which the insurer pays the maximum1,000).

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 22


www.sankhyiki.in
+91-9711150002

The two lines diverge in the right-hand tail of the distribution since the data
represents claim amounts after reinsurance whereas the fitted distribution
models the underlying claim size before reinsurance.
The two lines are quite close across the rest of the range (ie where the
underlying claim sizes are the same as the insurer’s claim amounts), so the fitted
distribution appears to be a good fit.

Answer 4. pgamma(200, shape=5, rate=0.04, lower= FALSE)

Answer 5. set.seed(1)
x <- sort(rpois(365,rgamma(365,20,0.01)))
f <- 0.01*x
f2 <- 1.04*f
nonf <- (1-0.01)*x
r2 <- nonf+f2
x2 <- r2*(1-0.01*1.04)
skewX2 <- sum((x2-mean(x2))^3)/length(x2); skewX2
skewX2^(1/3)

Answer 6. Part (i)


set.seed(16550)
M <- 2500
n <- 10000
x <- rexp(n,rate=0.0005)
z <- pmax(x-M,0)
y <- x-z
mean(y)
y <- pmin(x,M)
mean(y)

# Part (ii)
f <- function(M) {
var(x-pmin(pmax(x-M,0),1000))
}
nlm(f,1000)

Answer 7. Part (i)


data <- read.table("Reinsurance payments.txt",header=FALSE)
head(data)

We extract this column of data as a vector, just to make it easier to work


with:
Payments <- data[,1]
# Part (ii)
M <- 45000

Now we extract all claims which involve the reinsurer:


RI.claims <- Payments[Payments>0]
head(RI.claims)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 23


www.sankhyiki.in
+91-9711150002

The number of claims that fall below the retention limit of £45,000 is:
nm <- length(Payments)-length(RI.claims)
nm
[1] 1634

Hence, we write the negative loglikelihood function flogN as:


flogN <- function(parameters) {
- nm*plnorm(M,meanlog=parameters[1],sdlog
=parameters[2],log.p=TRUE)-
sum(dlnorm(RI.claims+M,meanlog=parameters[1],
sdlog = parameters[2],log=TRUE ))
}

mu <- mean(log(RI.claims))
sigma <- sd(log(RI.claims))
p <- c(mu,sigma)
flogN(p)
nlm(flogN,p)
$minimum
[1] 4578.015
$estimate
[1] 10.4747560 0.2651853
$gradient
[1] -8.453505e-06 -1.749413e-05
$code
[1] 1
$iterations
[1] 25

# Part (iii)
x <- Payments+M
max(x)
breakpoints <- c(20000,45000,seq(46000,86000,1000))
hist(x,freq=FALSE, breaks=breakpoints, xlim=c(0,86000),
ylim=c(0,0.00005),ylab="Density",main="Truncated data
versus logN(10.475,0.265)",
col="grey")
lines(seq(0,86000),dlnorm(seq(0:86000),meanlog=10.475,sdlog
=0.265),col="blue")
m <- nlm(flogN,p)$estimate[1]
s <- nlm(flogN,p)$estimate[2]
lines(seq(0,86000),dlnorm(seq(0:86000),meanlog=m,sdlog=s),c
ol="blue")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 24


www.sankhyiki.in
+91-9711150002

# Part (iv) Largest claim


The largest payment made by the reinsurer is:
max<-max(RI.claims)
max
[1] 40200
We use the which function to locate this claim in the data:
which(RI.claims %in% max)
[1] 199
Alternatively:
which(Payments %in% max)
[1] 1111
So the largest claim occurs as the 199th element of reinsurer’s conditional
claims data RI.claims and the 1,111th claim of the original data Payments.

# Part (v)
x <- M+max
plnorm(x,10.475,0.265,lower.tail=FALSE)
m <- nlm(flogN,p)$estimate[1]
s <- nlm(flogN,p)$estimate[2]
plnorm(x,m,s,lower.tail=FALSE)

Answer 8. (i)
claims <- rexp(10000, 0.0001)
hist(claims, main = "Histogram of 10,000 Simulated Claims
Values from the Exponential Distribution")

(ii)(a)
mean(claims)
[1] 10037.81
var(claims)
[1] 99976440
(b)
The theoretical value of the mean should be 1/0.0001 = 10,000.
2
The theoretical value of the variance should be (1/0.0001 ) = 100,000,000
The simulated values are very close to the theoretical values.
The difference is due to sampling error

(iii)(a)
For the insurer:
InsClaims3 <- pmin(claims, 20000)
mean(InsClaims3)
[1] 8671.328
var(InsClaims3)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 25


www.sankhyiki.in
+91-9711150002

[1] 43964610
(b)
For the reinsurer:
ReClaims3 <- pmax(0, claims - 20000)
mean(ReClaims3)
[1] 1366.485
var(ReClaims3)
(iv)(a)
InsClaims1 <- pmin(claims, 5000)
InsClaims2 <- pmin(claims, 10000)
InsClaims4 <- pmin(claims, 30000)
InsClaims5 <- pmin(claims, 40000)
InsClaims6 <- pmin(claims, 50000)
MeanIns <- c(mean(InsClaims1), mean(InsClaims2),
mean(InsClaims3), mean(InsClaims4), mean(InsClaims5),
mean(InsClaims6))

MeanIns
[1] 3940.658 6339.937 8671.328 9538.658 9869.412 9971.611

VarIns <- c(var(InsClaims1), var(InsClaims2),


var(InsClaims3), var(InsClaims4), var(InsClaims5),
var(InsClaims6))

VarIns
[1] 2533775 12858446 43964610 70204248 86376566 93370367
(b)
ReClaims1 <- pmax(0, claims - 5000)
ReClaims2 <- pmax(0, claims - 10000)
ReClaims4 <- pmax(0, claims - 30000)
ReClaims5 <- pmax(0, claims - 40000)
ReClaims6 <- pmax(0, claims - 50000)
MeanRe <- c(mean(ReClaims1), mean(ReClaims2),
mean(ReClaims3), mean(ReClaims4), mean(ReClaims5),
mean(ReClaims6))
MeanRe
[1] 6097.15469 3697.87535 1366.48492 499.15454 168.40057
66.20174
VarRe <- c(var(ReClaims1), var(ReClaims2), var(ReClaims3),
var(ReClaims4), var(ReClaims5), var(ReClaims6))
VarRe
[1] 84523429 60046376 25047814 9343406 3450843 1305645
(v)
x <- c(5000, 10000, 20000, 30000, 40000, 50000)
plot(x, MeanIns, xlab = "Retention Limit", ylab = "Mean
amount paid by Insurer", main = "Mean Claim Amount for
Insurer by Retention Limit", type = "l")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 26


www.sankhyiki.in
+91-9711150002

plot(x, MeanRe, xlab = "Retention Limit", ylab = "Mean


amount paid by Reinsurer",main = "Mean Claim Amount for
Reinsurer by Retention Limit", type = "l")

plot(x, VarIns, xlab = "Retention Limit", ylab = "Variance


of amount paid by Insurer", main = "Variance of Claim
Amount for Insurer by Retention Limit", type = "l")

plot(x, VarRe, xlab = "Retention Limit", ylab = "Variance


of amount paid by Reinsurer", main = "Variance of Claim
Amount for Reinsurer by Retention Limit", type = "l")

(vi) The mean claim amount paid by the insurer increases in size with the retention
limit but at a decreasing rate. It tends towards the unrestricted mean as the retention
limit increases
The mean claim amount paid by the reinsurer is equal to the total mean claim minus
the mean amount paid by the insurer
Hence the mean claim amount paid by the reinsurer decreases in size with the
retention limit also at a decreasing rate.
The trends in the variance of the claims for the insurer and the reinsurer
are not “mirror images”.
Among the six retention limits investigated, the sum of the variance of the reinsurer
and the insurer reaches a minimum at a retention limit of around £20,000 [1]
TotalVar <- VarIns + VarRe
TotalVar
[1] 87057204 72904822 69012424 79547654 89827408 94676012

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 27


www.sankhyiki.in
+91-9711150002

Answer 9. (i)
(a)
Exp_Vector <- rexp(1000,0.4)
mean(Exp_Vector) # or summary(Exp_Vector)
var(Exp_Vector)
The mean and variance will vary due to the random number generation. If the
sample size was large enough, the mean and variance should be close the
underlying distribution (exponential with parameter 0.4) as follows:
Mean = 2.5
Variance = 6.25
(b)
hist(Exp_Vector)

(c) 1.
x <- sort(Exp_Vector)
PDF <- dexp(x, 0.4)
plot(x, PDF)

(c) 2.
plot(x, PDF, type="l")

(ii) (a)
LNorm_Vector <- rlnorm(1000, meanlog = 0, sdlog = 1)
mean(LNorm_Vector)
var(LNorm_Vector)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 28


www.sankhyiki.in
+91-9711150002

The mean and variance will vary due to the random number generation. If the sample size
was large enough, the mean and variance should be close the underlying distribution
(lognormal with parameters μ = 0, σ2 = 1) as follows:
Mean = 1.649
Variance = 4.6708

(ii) (b)
hist(LNorm_Vector)

(c)
hist(LNorm_Vector, freq = FALSE, xlim = c(0,25), ylim = c(0,0.7))

(d) and (e)


lines(grid, dlnorm(grid,0,1),type="l",xlab="x",ylab="f(x)", col="black")
lines(density(LNorm_Vector), col="red")
legend("topright",c("True Density", "Estimate"),lty=1,col=c("black",
"red"))

hist(LNorm_Vector, freq = FALSE, xlim = c(0,25), ylim = c(0,1))


grid = seq(0,25,0.1)
lines(grid, plnorm(grid,0,1),type="l",xlab="x",ylab="f(x)",
col="black")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 29


www.sankhyiki.in
+91-9711150002

lines(ecdf(LNorm_Vector),col="red")
legend("bottomright",c("True cumulative
distribution","Estimate"),lty=1,col=c("black", "red"))

Answer 10. (i)


rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}

set.seed(225)
x<-rpareto(10000,3,400)

(ii)
y <- x[x>100]
mean(y)
[1] 355.6002

(iii)
mean(x)
[1] 204.6041
The mean amount of Y is greater than this because we have removed some of
the smallest claims by introducing the deductible.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 30


www.sankhyiki.in
+91-9711150002

Risk Models

Question 1. Aggregate claims S follow a compound distribution where the number of claims N
follows a (type 2) negative binomial distribution with parameters k = 10 and p = 0.1.
Individual claim sizes X follow a lognormal distribution with parameters µ = 10 and
s2 = 5.

(i) Simulate 10,000 random variates from the compound distribution and display the
first five values of your data. (Use seed value 16.)

(ii) Compare the sample mean and variance of your dataset to the population mean
and variance of S. Hint: Use the formula for E(S) and Var(S) given on page 16 of
the Tables.

(iii) Adjust your answer to part (i) to simulate the insurers aggregate claims net of
reinsurance, SI , if an individual excess of loss arrangement applies, with
retention limit M = 2,000,000.

Question 2. An insurer’s total annual claim amount S follows a compound Poisson distribution
with parameter l = 500. The individual claim size X is exponentially distributed
with parameter b = 0.001 .

(i) Simulate 1,000 aggregate claim amounts, using the seed value 1975, and display
the last five results.

The insurer believes that aggregate claims can be approximated using a normal
distribution, with 2 parameters µ and s .
(ii) Calculate the maximum likelihood estimates of the parameters µ and s2 based on
your simulated data. Hint: You should use the sample mean and standard deviation
as starting values for your iteration.

(iii) Use a suitable graph to comment on the goodness of fit.

Question 3. This question uses the data file ‘IRM.txt’. Therefore please run the following code
before attempting this question:

probability.of.claim <-
read.table("IRM.txt",header=T)[,2]
exponential.parameter <-
read.table("IRM.txt",header=T)[,3]

An insurer has sold 500 life insurance policies. Each policy can give rise to at
most one claim, and if a claim occurs on risk i the benefit payment follows an
exponential distribution with parameter li (for i = 1,..., 500 ). Policies are

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 31


www.sankhyiki.in
+91-9711150002

independent, and claim sizes (and claim numbers) are not necessarily identically
distributed.

The vector probability.of.claim contains the probability of a claim qi on the ith


policy. The vector exponential.parameter contains the exponential parameter li
for the ith policy.

(i) Calculate the insurer’s expected total claim payment for the portfolio.
(ii) Estimate the probability that the insurer’s aggregate claim amount, S, will be
within plus or minus 10% of the population mean. (Hint: You should carry out
10,000 simulations, using the seed value 5.)
(iii) (a) Calculate the probability of the aggregate claims on the portfolio being
zero, and compare this to the probability implied by your simulation.
(b) Hence or otherwise, comment on the shape of the distribution of S.

Question 4. We have generated the data required for this question (policies) using the
following code. You should run this code before attempting the question:
set.seed(1812)
q <- runif(200,0,0.01)
mu <- runif(200,5,13)
sigma <- runif(200,0.5,0.8)
policies <- cbind(q,mu,sigma)

Aggregate claims under a certain portfolio of insurance business are expected to


follow the individual risk model.
The ith row in the dataframe policies contains the risk characteristics of the ith
policy in the portfolio.
The dataframe consists of three columns. The first column represents the
probability qi of a claim on the ith policy.
Individual claims follow a lognormal distribution, and the second and third
columns of the dataframe represent the parameters µi and si respectively of the
claim size random variable for the ith policy.

There are 200 policies in the portfolio, so i = 1,...,200 . An excess of loss


reinsurance arrangement is in place where the insurer pays a maximum of M =
60,000 on any one claim.

(i) Simulate the reinsurer’s mean aggregate claim amount, using 5,000
simulations and a seed value of 1854.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 32


www.sankhyiki.in
+91-9711150002

(ii) Repeat your answer to part(i) assuming that, instead of an individual


excess of loss reinsurance arrangement, an aggregate excess of loss
arrangement applies whereby the insurer will pay a maximum of 300,000
in total for all claims, and the reinsurer will pay the rest.

Question 5. Aggregate claims S follow a compound distribution where the number of claims N
follows a compound Poisson distribution with parameter 𝜆 = 100. Individual
claim sizes X follow an exponential distribution with parameter 𝛽 = 0.002.
Use seed value 8,143 to simulate 1,000 random variates from the compound
distribution.
(i) Hence state the value of the 98th simulated value of S.
(ii) Estimate the median of the insurer’s aggregate claims payment.
(iii) Estimate the coefficient of skewness of the insurer’s aggregate claims
payment S

Question 6. The random variable T follows a compound binomial distribution with


N ~ Bin(80,0.3) and the claim size random variable X follows a two‐parameter
Pareto distribution with parameters 𝛼 = 4 and 𝜆 = 4,500 .

(i) Use a seed of 123 to generate a sample of 10,000 observations from T and
display the last five values from your simulation.
(ii) Estimate q such that P(T > q) = 0.05.
(iii) Fit a gamma distribution to the simulated dataset t that you derived above.
(iv) Create a Q‐Q plot of the maximum likelihood gamma distribution against
the simulated aggregate claims data t. Comment on the result.
(v) Check the interpretation of the Q‐Q plot in the previous question by
plotting the right‐hand tail of the fitted density function against the right‐
hand tail of the empirical density function.

Question 7. The random variable W follows a compound negative binomial distribution with
N ~NBin(5,0.8) and the claim size random variable X follows a Burr distribution
with parameters 𝛼 = 0.3 , 𝜆 = 10 and 𝛾 = 1.2. Use a seed of 123 to generate a
sample of 10 observations from W.

Question 8. Company B has sold 100 life insurance policies which each gives rise to a
maximum of one claim per year. There are three types of policy:
• for Type 1 policies, claims amounts (in £000s) follow an
Exp(0.005)distribution, and the probability of a claim occurring on the
policy is 0.001
• for Type 2 policies, claims amounts (in £000s) follow an Exp(0.006)
distribution, and the probability of a claim occurring on the policy is 0.002
• for Type 3 policies, claims amounts (in £000s) follow an Exp(0.007)
distribution, and the probability of a claim occurring on the policy is 0.003
There are 30 Type 1 policies, 50 Type 2 policies and 20 Type 3 policies. All
policies are independent.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 33


www.sankhyiki.in
+91-9711150002

Simulate a set of 5,000 observations from the insurer’s aggregate claims size
distribution, and hence estimate the probability that it will have to pay out more
than £100,000 a year for these policies.
(You should use the seed value 54321.)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 34


www.sankhyiki.in
+91-9711150002

SOLUTIONS

Answer 1
(i)
k <- 10; p <- 0.1
mu <- 10; sigma <- 5^0.5
set.seed(16)
n <- rnbinom(10000,size=k,prob=p)
s <- numeric(10000)
for(i in 1:10000) {
x <- rlnorm(n[i],meanlog=mu,sdlog=sigma)
s[i] <- sum(x)
}
head(s,5)

(ii)
sample.mean <- mean(s)
sample.variance <- var(s)
meanN<-k*(1-p)/p
varN<-k*(1-p)/p^2
meanX<-exp(mu+0.5*sigma^2)
varX<-exp(2*mu+sigma^2)*(exp(sigma^2)-1)
meanS<-meanN*meanX
varS<-meanN*varX+varN*meanX^2
c(meanS, sample.mean)
c(varS,sample.variance)
meanS/sample.mean
varS/sample.variance

(iii)
set.seed(16)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 35


www.sankhyiki.in
+91-9711150002

k <- 10; p <- 0.1


mu <- 10; sigma <- 5^0.5
M <- 2000000
n <- rnbinom(10000,size=k,prob=p)
sI <- numeric(10000)
for(i in 1:10000) {
x <- rlnorm(n[i],meanlog=mu,sdlog=sigma)
y <- pmin(x,M)
sI[i] <- sum(y)
}

Answer 2
(i)
set.seed(1975)
n <- rpois(1000,500)
s <- numeric(1000)
for(i in 1:1000) {
x <- rexp(n[i],rate=0.001)
s[i] <- sum(x)
}
tail(s,5)

(ii)
f <- function(params) {
lnL <- dnorm(s,mean=params[1],sd=params[2],log=TRUE)
sum(-lnL)
}
p <- c(mean(s),sd(s))
MLE<-nlm(f,p); MLE

(iii)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 36


www.sankhyiki.in
+91-9711150002

mu <-MLE$estimate[1]
sigma <- MLE$estimate[2]
agg <- rnorm(100, mean=mu, sd=sigma)
qqplot(agg,s,xlab = "Quantiles from fitted
distribution",ylab = "Quantiles of simulated data",main =
"QQ plot of data")
abline(0,1,col="red")
plot(density(s),main="Simulated data versus fitted
distribution",xlab="Aggregate claim size",col="blue")
agg <- seq(from=min(s),to=max(s))
y <- dnorm(agg, mean=mu, sd=sigma)
lines(agg,y,col="red")

Answer 3
probability.of.claim <- read.table("IRM.txt",header=T)[,2]
exponential.parameter <- read.table("IRM.txt",header=T)[,3]
(i)
E.Si <- probability.of.claim/exponential.parameter
sum(E.Si)
(ii)
lower.bound <- sum(E.Si)*0.9
lower.bound
upper.bound <- sum(E.Si)*1.1
upper.bound
death <- rbinom(500,1,probability.of.claim)
x <- rexp(500,rate=exponential.parameter)
sim <- x*death
sim <- sum(sim)
set.seed(5)
S.sim <- numeric(10000)
for (i in 1:10000) {

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 37


www.sankhyiki.in
+91-9711150002

death <- rbinom(500,1,probability.of.claim)


x <- rexp(500,rate=exponential.parameter)
sim <- x*death
sim <- sum(sim)
S.sim[i] <- sim
}
length(S.sim[S.sim>=lower.bound&S.sim<=upper.bound])/length
(S.sim)
(iii)
(a) q <- 1-probability.of.claim
prod(q)
instances.of.0 <- 10000-length(S.sim[S.sim!=0 ])
instances.of.0
(instances.of.0/10000)/prod(q)-1

(b)
summary(probability.of.claim)
Answer 4
set.seed(1812)
q <- runif(200,0,0.01)
mu <- runif(200,5,13)
sigma <- runif(200,0.5,0.8)
policies <- cbind(q,mu,sigma)

(i)
set.seed(1854)
SR.sim <- numeric(5000)
for (i in 1:5000) {
death <- rbinom(200,1,policies[,1])
x <- rlnorm(200,meanlog=policies[,2],sdlog=policies[,3])
z <- pmax(0,x-60000)
sim <- z*death
sim <- sum(sim)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 38


www.sankhyiki.in
+91-9711150002

SR.sim[i] <- sim


}
mean(SR.sim)
[1] 51425.54

(ii)
set.seed(1854)
S.sim <- numeric(5000)
for (i in 1:5000) {
death <- rbinom(200,1,policies[,1])
x <- rlnorm(200,meanlog=policies[,2],sdlog=policies[,3])
sim <- x*death
sim <- sum(sim)
S.sim[i] <- sim
}
SR.sim <- pmax(0,S.sim-300000)
mean(SR.sim)
[1] 20635.58
So the reinsurer’s mean aggregate claims payment is approximately 20,636

Answer 5.
(i)
set.seed(8143)
n <- rpois(1000,100)
s <- numeric(1000)

for(i in 1:1000) {
x <- rexp(n[i], rate=0.002)
s[i] <- sum(x)
}
The 98th element of s is:
s[98]
[1] 37309.99

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 39


www.sankhyiki.in
+91-9711150002

So the 98th simulated value of the aggregate claim amount S is


37,309.99.

(ii)
quantile(s,0.5)
50%
49812.93

(iii)
skewness<-sum((s-mean(s))^3)/length(s)
skewness/var(s)^(3/2)
[1] 0.1570086

Answer 6.
(i)
set.seed(123)
n <- rbinom(10000,size=80,prob=0.3)
t <- numeric(10000)
rpareto <- function(n,a,l){
rp <- l*((1-runif(n))^(-1/a)-1)
rp
}
for(i in 1:10000) {
x <- rpareto(n[i],a=4,l=4500)
t[i] <- sum(x)
}

Displaying the last five values:


tail(t,5)
[1] 59012.54 47233.24 39993.46 25798.18 45305.19

(ii)
We need to find the 95th percentile q, which we can do
using the quantile function:

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 40


www.sankhyiki.in
+91-9711150002

quantile(t,0.95)
95%
57819.2

(iii)
#First we write the negative log-likelihood function f:
f <- function(params) {
lnL <- dgamma(t,shape=params[1],rate=params[2],log=TRUE)
sum(-lnL)
}

#We don’t know what the parameters might be, so we will


#choose arbitrary starting values for the iteration:
p <- c(10, 10)

#Finding the maximum likelihood estimates the Gamma


#distribution:
nlm<-nlm(f,p)
nlm
$minimum
[1] 107603
$estimate
[1] 9.2516397652 0.0002576924
$gradient
[1] -3.408396e+01 -1.867021e+07
$code
[1] 3
$iterations
[1] 33

So 𝛼G = 9.2516 and 𝜆H = 0.0003

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 41


www.sankhyiki.in
+91-9711150002

(iv)
#First we extract the maximum likelihood estimates from the
#output of nlm.
a <- nlm$estimate[1]
l <- nlm$estimate[2]

#Then we generate a set of, say, 100 random variates from


#the fitted distribution:
set.seed(123)
agg <- rgamma(100, shape=a, rate=l)

#Now we create a Q-Q plot of the fitted dataset agg against


#the simulated dataset t:
qqplot(agg,t,xlab = "Quantiles from fitted
distribution",ylab = "Quantiles of simulated data",main =
"QQ plot of data")

#Finally we add a straight line along the main diagonal, to


#help judge the goodness of fit:
abline(0,1,col="red")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 42


www.sankhyiki.in
+91-9711150002

The data appears to be a reasonable fit, except for the right‐hand tail of the distribution.
Here the quantiles of the simulated data are higher than the quantiles of the fitted
distribution, and there is one outlier that is very far from the maximum likelihood
distribution. So the simulated data is more positively skewed than the fitted gamma
distribution. (In fact, the data is not a particularly good fit for the left‐hand tail of the
distribution either.)
(v)
#Plotting the empirical density function:
plot(density(t),ylim=c(0,0.000001),xlim=c(50000,300000),main=
"Simulated data
versus
fitted distribution",xlab="Aggregate claim size",col="blue")

#(We have chosen the range of the empirical density function plot
#by trial and error, using the ylim and xlim arguments.)

#Generating a sequence of values corresponding to the same range


as the x axis:
agg <- seq(from=50000,to=300000)

#Calculating the corresponding values f (agg) for the fitted


#gamma distribution:
y <- dgamma(agg, shape=a, rate=l)

#Adding the fitted density function to the graph:


lines(agg,y,col="red")

The simulated data is indeed more positively skew than the fitted distribution.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 43


www.sankhyiki.in
+91-9711150002

Answer 7.
set.seed(123)
n <- rnbinom(10,size=5,prob=0.8)
w <- numeric(10)

rburr <- function(n,a,l,g){


rp <- (l*((1-runif(n))^(-1/a)-1))^(1/g)
rp
}

for(i in 1:10) {
x <- rburr(n[i],a=0.3,l=10,g=1.2)
w[i] <- sum(x)
}

w
[1] 558.14071 387.40722 23.26043 39.80472 50.99664

[6] 48.72566 1546.04093 620.37950 0.00000 63.42809

Answer 8.
set.seed(54321)
rate1 <- 0.005; q.1 <- 0.001
rate2 <- 0.006; q.2 <- 0.002
rate3 <- 0.007; q.3 <- 0.003

S.sim <- numeric(5000)

for (j in 1:5000) {
x<-numeric(100)
for (i in 1:30) {
x[i] <- rexp(1,rate=rate1)
}
for (i in 31:80) {
x[i] <- rexp(1,rate=rate2)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 44


www.sankhyiki.in
+91-9711150002

}
for (i in 81:100) {
x[i] <- rexp(1,rate=rate3)
}
death<-numeric(100)
for (i in 1:30) {
death[i] <- rbinom(1,size=1,prob=q.1)
}
for (i in 31:80) {
death[i] <- rbinom(1,size=1,prob=q.2)
}
for (i in 81:100) {
death[i] <- rbinom(1,size=1,prob=q.3)
}
sim <- x*death
sim<-sum(sim)
S.sim[j] <- sim
}

#To estimate the empirical probability:


length(S.sim[S.sim>100])/length(S.sim)
[1] 0.1056
Hence, the probability of the insurer having to pay out more than
£100,000 in total is
approximately 10.56%.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 45


www.sankhyiki.in
+91-9711150002

Survival Models

Question 1. Let 𝑇$% be the random variable denoting the expected future lifetime of lives
currently aged 40. It is believed that 𝑇$% follows the W(0.00001,3.5) distribution.
(i) (a) Generate a sample of 10,000 values of 𝑇$% , setting a seed of 100 and
storing the simulated values in the vector sims.
(b) Estimate P(𝑇&% >10) using your sample.
Under this model, the hazard at age 40 + t is given by:
𝜇$%'( = 𝛼𝛽𝑡)*+
where a = 0.00001 and b = 3.5.
(ii) (a) Construct a function R to calculate µ, for an inputted age, x ³ 40 .
(b) Calculate P(𝑇&% >10) using your function for µ, .
(iii) Comment on you answers to parts (i)(b) and (ii)(b).

Question 2. Mortality of a group of lives is assumed to follow Gompertz’ law for ages 25 and
over. The force of mortality at age x ³ 25 is given by:
𝜇, = 𝐵𝑐 ,*!& for x³25
where B=0.00000729 and c = 1.128
(i) Plot the force of mortality from ages 25 to 100.
(ii) Plot the PDF of T45 , the future lifetime random variable of someone aged
45.
-
(iii) Calculate a numerical estimate of 𝑒&% .
(iv) (a) Simulate 100,000 values of 𝑇&% the future lifetime random variable of
someone aged 50, setting a seed of 50 and storing your simulated values in
the vector sims.
-
(b) Estimate 𝑒&% from your simulated values.

Question 3. The ‘Very-ruthless Management Consultancy Company’ pays very high wages
but also has a very high failure rate, both from sackings and through people
leaving.
The R code to create a data frame representing the life table for a typical new
recruit (with durations measured in years) is given below:
life.table = data.frame(duration = 0:9,
lives = c(100000,72000,51000,36000,24000,15000,10000,6000,2500,0))
(i) Construct an extra column in the table that contains in each row the
probabilities of a new recruit’s survival to that year (the first row should
be set to 1).
(ii) Calculate the expected number of complete years that a new graduate will
complete with the company.
(iii) 45 of the graduates that started working at the company on 1 September
last year are still at the company exactly one year later. Calculate the
expected number of complete additional years that these graduates will
complete with the company.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 46


www.sankhyiki.in
+91-9711150002

Question 4. The values of µx for the AM92 mortality table for ages in the range 17 £ x £ 119
are calculated using the following formula:
𝜇, = 𝑎% + 𝑎+ 𝑡 + exp[𝑏% + 𝑏+ 𝑡 + 𝑏! (2𝑡 ! − 1)]
,*.%
where 𝑡 = &% and 𝑎% = 0.00005887 , 𝑎+ = -0.0004988 , 𝑏% = -4.363378 ,
𝑏+ = 5.544956 , 𝑏! = -0.620345.
(i) (a) Construct a function in R to calculate µ, for an inputted age, x .
(b) Calculate the value of µ/% using your function, checking it against
the tabulated value on page 81 of the Tables.
(c) Plot a graph of µx using a log scale. [Hint : use the log = “y”
argument in the plot() function.]
You are given that 𝑞, can be approximated using the relationship
𝑞, ≈ 1 − exp \−𝜇,'! ].
"
(ii) (a) Use this approximation to calculate values of qx , rounded to 6 decimal
places, to complete the table below:

Age Approximate value of 𝑞,


(based on AM92 Mortality)
20
30
40
50
60
70
80
90
100
110

(b) Comment on the accuracy of the values in your table.

(iii) (a) Calculate an approximate value for the curtate expectation of life 𝑒!& .
(*+
Hint: you can use the cumprod() function and the fact that ( 𝑝, = Π01% 𝑝,'0

(b) Check your value of 𝑒!& against the tabulated value on page 82 of the
Tables.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 47


www.sankhyiki.in
+91-9711150002

SOLUTIONS

Answer 1
(i)(a) a = 0.00001
b = 3.5
shape = b
scale = a^(-1/b)
We then simulate the values using the rweibull() function:
set.seed(100)
sims = rweibull(10000,shape,scale)

(b) In part (i) we simulated values of 𝑇!" . So, we first need to convert P(𝑇#" >10) into a probability
in terms of 𝑇!" . Using principle of consistency
!" &#"
$" 𝑝!" = %" 𝑝!" ×%" 𝑝#" ⇒ %" 𝑝#" =
$" &#"
To estimate this from the simulated values, we need to calculate how many of the simulated
values are at least 20 and divided by how many are at least 10.
In R, we can do this using square bracket notation with logical conditions and the length()
function:
length(sims[sims > 20]) / length(sims[sims > 10])
[1] 0.7277898
(ii)(a)
mux = function(x){
a * b * (x - 40)^(b - 1)
}
(b) exp(-integrate(mux, 50, 60)$value)
[1] 0.7216983
(iii) The answers are very similar.
We expect the answer to part (ii)(b) to be fairly accurate, even though R evaluates the integral
numerically. The answer to part (i)(b) is slightly different due to sampling error. If we took a
larger sample, we would expect the answer to be closer to that in part (ii)(b).

Answer 2
(i) B = 0.00000729
c = 1.128
mux = function(x){
B * c ^ (x - 25)
}
plot(25:100, mux(25:100), type = "l", col = "blue",
main = "Graph of mu_x against age",
xlab = "age", ylab = "Force of mortality", lwd = 3)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 48


www.sankhyiki.in
+91-9711150002

(ii) We first load the flexsurv package:


library(flexsurv)
We then create some objects for the shape and rate parameters:
shape = log(c)
rate = B
We can the plot the PDF using the dgompertz() function, noting that for a 45-year-
old, the rate parameter is B´c20 :
plot(dgompertz(0:100, shape, rate * c ^ 20), col = "blue",
type = "l",
main = "Density of the future lifetime of a 45-year-old",
xlab = "t", ylab = "f(t)", lwd = 3)

(iii) integrate(pgompertz, 0, Inf, shape = shape, rate =


rate*c^25,lower.tail = FALSE)
50.91776 with absolute error < 2.7e-05
(iv) (a) set.seed(50)
sims = rgompertz(100000, shape, rate * c ^ 25)

(b) mean(sims)
[1] 50.94788

Answer 3
(i) First we load in the data:
life.table = data.frame(duration = 0:9, lives =
c(100000,72000,51000,36000,24000,15000,10000,6000,2500,0)
The probabilities of surviving to each year are the number of lives corresponding to that year
divided by the starting number of lives.
We can calculate these using square brackets for indexing and creating a new column using $:
life.table$surv = life.table$lives / life.table$lives[1]
life.table
duration lives surv
1 0 100000 1.000
2 1 72000 0.720
3 2 51000 0.510
4 3 36000 0.360
5 4 24000 0.240
6 5 15000 0.150
7 6 10000 0.100
8 7 6000 0.06
9 8 2500 0.025
10 9 0 0.000

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 49


www.sankhyiki.in
+91-9711150002

(ii) We can do this using the sum() function and square brackets with a negative number to
remove the first entry:
sum(life.table$surv[-1])
[1] 2.165

(iii) We now want to calculate 𝑒% using the formula:


𝑒% = ∑(')% ' 𝑝%

Here, we have w = 7 . We first need to calculate ' 𝑝% . These probabilities are the lives in
each year divided by the number of lives in year 1. We can calculate these and add a
column to the table similar to part (i). We set the first row to NA as it represents *% 𝑝% ,
which doesn’t make any sense:

life.table$surv_after_1_year=
c(NA,life.table$surv[2:nrow(life.table)] / life.table$surv[2])

life.table
𝑒% is now the sum of the probabilities in the last column from the third row on. To
calculate this, we can take the sum of the column excluding the first and second entries:
sum(life.table$surv_after_1_year[-c(1:2)])
[1] 2.006944
Answer 4
(i) (a) a0 = 0.00005887
a1 = -0.0004988
b0 = -4.363378
b1 = 5.544956
b2 = -0.620345
mu <- function(x){
t = (x-70)/50
a0 + a1*t + exp(b0 + b1*t + b2*(2*t^2 - 1))
}
(b) mu(80)
[1] 0.06827114

(c) plot(mu,20,110,log="y", lwd = 3, xlab="Age (x)", ylab="mu(x)",


main="Mortality rates for AM92 (log scale)")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 50


www.sankhyiki.in
+91-9711150002

(ii) (a)
We can create a list of ages for the first column using the following R command:
xlist = seq(20,110,by=10)
We can now define a function to calculate the values in the right-hand column of the table in
the question:
fnexp <- function(x) {1-exp(-mu(x + 0.5))}
We can then input the required ages to this function, rounding the answers to 6 decimal
places:
qlist = round(fnexp(xlist),6)
There are various ways to combine the ages and probabilities in the form of a table. One
method is to use the cbind function to join the columns together:
cbind(xlist,qlist)

(b)
The accurate values for ages 20, 60 and 110 (say), from pages 78 and 79 of the Tables, are:
𝑞$" = 0.000582, 𝑞+" = 0.008022 𝑎𝑛𝑑 𝑞%%" = 0.607918
We can see that all the approximations are very good – accurate to at least 4 decimal places.
However, the accuracy diminishes as the age increases.
This is to be expected, because there will be a larger variation in the value of µx over the year
of age at the older ages. So the approximation of using the value for the middle of the age
range will not be as accurate here.

(iii) (a) The formula for calculating 𝑒, is 𝑒, = ∑- ')% ' 𝑝,


Our formula only applies up to age 119. Since 25+94 = 119 we need:
𝑒$# = ∑.!')% ' 𝑝$#
'*%
From the hint in the question, ' 𝑝, = Π/)" 𝑝,0/ . We can estimate the 𝑝, probabilities using
the fnexp() function and then multiply them together to estimate ' 𝑝, .
%
For example, estimating $ 𝑝$# = Π/)" 𝑝$#0/ = 𝑝$# × 𝑝$+
prod(1 - fnexp(25:26))
[1] 0.9988669
.!*%
Firstly, getting out all the required 𝑝, estimates from 𝑝$# to 𝑝%%1 (as .! 𝑝$# = Π/)" 𝑝$#0/ ,
the last probability required is 𝑝$#0.2 = 𝑝%%1 ):
pxs = 1 - fnexp(25:118)
We can use the cumprod() function to calculate the cumulative product of these estimated
px probabilities, which calculates the required estimates of tp25 :
tp25s = cumprod(1 - fnexp(25:118))
To estimate 𝑒$# , we take the sum of these probabilities:
sum(tp25s)
[1] 53.61316
(b) We can see from page 82 of the Tables that the accurate value is 53.609.
There is a small error in our approximation, but it is still accurate to 2 decimal places,
which would be sufficient for most purposes.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 51


www.sankhyiki.in
+91-9711150002

Estimating The Lifetime Distribution


Question 1. A mortality study involving 10 patients exposed to a serious disease produced the
results in the data file surv_table_1.csv. The file contains the following
information for each patient:
• patient – the patient ID
• exposure_time – the time the patient was exposed in weeks since the study
began
• observation_end – the time observation of the patient ended in weeks since the
study began
• reason – the reason the observation of the patient ended, either death (D) or
right censored (C)
(i) Load the table into R.
(ii) Create a survival object in R using the data from the file.
(iii) (a) Calculate the Kaplan-Meier estimate of the survival function.
(b) Produce an appropriate graph showing the Kaplan-Meier estimate of the
survival function.
(iv) Calculate an approximate 95% confidence interval for S(7) using the Kaplan-
Meier approach.
(v) (a) Calculate the Nelson-Aalen estimate of the survival function.
(b) Produce an appropriate graph showing the Nelson-Aalen estimate of the
survival function.

Question 2. A study aimed at estimating hazard rates and survival probabilities from the point
of contracting a particular disease has produced the results in the data file
surv_table_2.csv, which has the following columns:
• Time – the time an event took place measured in weeks since contraction of the
condition
• Deaths – the number of deaths that took place at each time
• Censorings – the number of right censorings that took place at each time

A total of 100 patients were observed since the day that they contracted the
illness. Observation of patients stopped after 25 weeks following contraction of
the condition.

(i) Load the table into R into an object called surv.data.


(ii) Create a data frame called deaths with columns called tj, dj and nj that
contain 𝑡2 , 𝑑2 and 𝑛2 respectively.
The researchers initially carry out an analysis of the results using the Kaplan-Meier
model.
(iii) (a) Update your table to calculate a column called skm that contains an
estimate of the survival function using the Kaplan-Meier model.
(b) Produce an appropriate graph in R showing the Kaplan-Meier estimate.
The researchers have now decided to re-analyse the results using the Nelson-Aalen model
for comparison.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 52


www.sankhyiki.in
+91-9711150002

(iv) (a) Update your table to include a column called Lj that contains an estimate
_
of L (t) using the Nelson-Aalen model.
(b) Calculate a set of approximate 95% confidence intervals for L(t) at the
time of each death event.
(c) Produce an appropriate graph in R showing a plot of the estimated
integrated hazard function with approximate 95% confidence intervals.

The following code creates a vector of event times and a vector of event codes (1
indicating death and 0 indicating censoring) for the 100 patients:

event.times = c(rep(surv.data$Time, times = surv.data$Deaths),


rep(surv.data$Time, times = surv.data$Censorings))

event.codes = c(rep(1, times = sum(surv.data$Deaths)),


rep(0, times = sum(surv.data$Censorings)))

(v) (a) Create a survival object using the vectors event.times and event.codes.
(b) Calculate the Nelson-Aalen estimate of the survival function and
approximate 99% confidence intervals.
Hint: the conf.int argument in the survfit() function sets the level of the
estimated confidence interval.
(c) Produce an appropriate graph in R showing the Neslon-Aalen estimate and
approximate 99% confidence intervals.
(vi) (a) State an inequality relating the values of 𝑆H34 (𝑡) and 𝑆H56 (𝑡), the
estimated survival function based on the Nelson-Aalen model.
(b) Demonstrate numerically the validity of the inequality in part (vi)(a) using
the data in this study.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 53


www.sankhyiki.in
+91-9711150002

SOLUTIONS

Answer 1
(i) surv.data = read.csv("surv_table_7.1.csv")
(ii) We first need to ensure the survival package is active:
library(survival)

We then need to create a vector of event times as the difference between the end of
observation and time of exposure:
surv.data$event.time = surv.data$observation_end -
surv.data$exposure_time

We also need a vector of event codes. The table has a reason column, which we need to use to
create a vector of 1’s (to indicate death) and 0’s (to indicate censoring):
surv.data$event.code = ifelse(surv.data$reason == "D", 1, 0)

We can then use the Surv() function to create the survival object:
surv.obj = Surv(surv.data$event.time, surv.data$event.code)
surv.obj
[1] 2 3+ 5 8 2 10+ 14 3 10+ 3

(iii) (a)
We can use the survfit() function with the survival object from part (ii):
fitkm = survfit(surv.obj ~ 1, conf.type = "plain")

We can then use the summary function to output the estimate:


summary(fitkm)
Call: survfit(formula = surv.obj ~ 1, conf.type = "plain")
time n.risk n.event survival std.err lower 95% CI upper 95% CI
2 10 2 0.8 0.126 0.5521 1.000
3 8 2 0.6 0.155 0.2964 0.904
5 5 1 0.48 0.164 0.1587 0.801
8 4 1 0.36 0.161 0.0445 0.676
14 1 1 0.00 NaN NaN NaN

(b) plot(fitkm,
main = "Kaplan-Meier estimate of survival function",
xlab = "weeks",
ylab = "SKM(t)")

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 54


www.sankhyiki.in
+91-9711150002

(iv) We can extract the confidence interval from the summary:


summary(fitkm)
The estimate of S(7) is in the third row, as this gives the estimate of the survival function
for 5 £ t < 8 . So, the confidence interval is (0.1587, 0.801). We can extract more decimal
places using the $ symbol:
summary(fitkm)$lower[3]
[1] 0.1586615
summary(fitkm)$upper[3]
[1] 0.8013385
So, the confidence interval is (0.15866, 0.80134) .
(v) (a) Again using the survfit() function, this type setting the stype argument to 2:
fitna = survfit(surv.obj ~ 1, conf.type = "plain", stype = 2)
summary(fitna)
(b) plot(fitna,
main = "Nelson-Aalen estimate of survival function",
xlab = "weeks",
ylab = "SNA(t)")

Answer 2
(i) surv.data = read.csv("surv_table_2.csv")
surv.data

(ii) The data in the table is by event time rather than by patient. So, first we need to just extract the
rows that relate to death times:
deaths = surv.data[surv.data$Death > 0, 1:2]
deaths
This gives both tj and dj . We can rename the columns to these titles:
colnames(deaths) = c("tj", "dj")
deaths
Next we need to calculate the nj . As each patient was observed since they contracted the
illness, nj is simply the number of patients that experienced their event on or after tj. To make
the calculation a little easier, we can first create a new column in the original data set that is the
sum of deaths and censoring, ie the total number of events at each time:
surv.data$Events = surv.data$Death + surv.data$Censorings

Next, we calculate the number of events happening at times greater than or equal to all the
death times
deaths$nj = numeric(nrow(deaths))

for (i in 1:nrow(deaths)) {
deaths$nj[i] = sum(surv.data$Events[surv.data$Time >= deaths$tj[i]])
}
deaths

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 55


www.sankhyiki.in
+91-9711150002

(iii) (a) deaths$lamj = deaths$dj / deaths$nj


deaths$skm = cumprod(1 - deaths$lamj)
deaths
(b) plot(c(0, deaths$tj, surv.data$Time[nrow(surv.data)]),
c(1, deaths$skm, deaths$skm[nrow(deaths)]),
type = "s", lwd = 2,
main = "Kaplan-Meier estimate of survival function",
xlab = "weeks",
ylab = "SKM(t)")

(iv) (a) deaths$Lj = cumsum(deaths$lamj)


deaths

(b) deaths$Lj.se = sqrt(cumsum(deaths$dj*(deaths$nj -


deaths$dj)/deaths$nj^3))

Lj.CI = data.frame(tj = deaths$tj,


lower = deaths$Lj - qnorm(0.975) * deaths$Lj.se,
upper = deaths$Lj + qnorm(0.975) * deaths$Lj.se)

Lj.CI$lower = pmax(0, Lj.CI$lower)


Lj.CI

(c) plot(c(0, deaths$tj), c(0, deaths$Lj), type = "s", lwd = 2,


main = "Nelson-Aalen estimate of the integrated hazard
function",
xlab = "weeks",
ylab = "Lambda(t)")

We can then add the confidence intervals using the lines() function and setting the line
type to 2 for a dotted line:

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 56


www.sankhyiki.in
+91-9711150002

lines(deaths$tj, Lj.CI$lower, type = "s", lty = 2)


lines(deaths$tj, Lj.CI$upper, type = "s", lty = 2)

Unfortunately, the upper part of the approximate confidence interval is being cut off by
the limits of the y -axis. We can fix this with the ylim argument in the plot() function:
plot(c(0, deaths$tj), c(0, deaths$Lj), type = "s", lwd = 2,
main = "Nelson-Aalen estimate of the integrated hazard
function",
xlab = "weeks",
ylab = "Lambda(t)", ylim = c(0,0.3))

lines(deaths$tj, Lj.CI$lower, type = "s", lty = 2)


lines(deaths$tj, Lj.CI$upper, type = "s", lty = 2)

(v) (a) First ensuring we have the package loaded:


library(survival)
Next running the code provided:
event.times = c(rep(surv.data$Time, times = surv.data$Death),
rep(surv.data$Time, times = surv.data$Censorings))

event.codes = c(rep(1, times = sum(surv.data$Death)),


rep(0, times = sum(surv.data$Censorings)))

We now have a vector of times and codes to use with the Surv() function:
surv.obj = Surv(event.times, event.codes)
surv.obj

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 57


www.sankhyiki.in
+91-9711150002

(b) Using the survfit() function with stype set to 2 and conf.int set to 0.99:
fitna = survfit(surv.obj ~ 1, conf.type = "plain", stype = 2,
conf.int = 0.99)
We can then look at the estimate using the summary() function:
summary(fitna)
(c) plot(fitna,
main = "Nelson-Aalen estimate of survival function",
xlab = "weeks",
ylab = "SKM(t)", ylim = c(0.6, 1))

(vi) (a) 𝑆$#$ (𝑡) ≤ 𝑆$%& (𝑡)


(b) (data.frame(tj = deaths$tj,
na = summary(fitna)$surv,
km = deaths$skm))
We could also check this graphically, however it is difficult to see clearly in this case as the
estimates are so close:
plot(fitna,main = "Nelson-Aalen estimate of survival function",
xlab = "weeks",
ylab = "SKM(t)", ylim = c(0.7, 1))

lines(c(0, deaths$tj, surv.data$Time[nrow(surv.data)]),


c(1, deaths$skm, deaths$skm[nrow(deaths)]),
type = "s", col = "blue")

legend("bottomleft",
c("Nelson-Aalen estimate", "Kaplan-Meier estimate"),
col = c("black", "blue"), lty = 1)

The blue line is never above the black line, as expected based on the relationship between the
estimates.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 58


www.sankhyiki.in
+91-9711150002

Proportional Hazards Models

Question 1. The table below shows the number of circuits a group of 10 army recruits were
able to complete in a training exercise and their ages, split by sex.
Complete Circuits Males Females
1
2 30
3 21,20 24,25
4 19
5 18,18,17
6 21

Two Cox proportional hazards models are being considered for modelling the
result :
Model 1: ln 𝜆(𝑡; 𝑧+ ) = ln 𝜆% (𝑡) + 𝛽+ 𝑧+
Model 2: ln 𝜆(𝑡; 𝑧+ , 𝑧! ) = ln 𝜆% (𝑡) + 𝛽+ 𝑧+ + 𝛽! 𝑧!
Here:
• l(𝑡; 𝑧1 ) and l(𝑡; 𝑧1 , 𝑧2 ) are the hazard rates for ‘dropping out’, i.e. for the
recruit reaching their limit in terms of the number of circuits they can
complete
• 𝑧+ is the recruit’s sex ( 𝑧+ = 0 for males, 𝑧+ = 1 for females)
• 𝑧! is the recruit’s age.

(i) Construct a data frame that contains a row for each recruit and the
following columns:
• Age – the age of the recruit
• Sex – the sex of the recruit
• Circuits – the number of circuits completed by the recruit
• Status – whether each recruit experienced the death event (1) or
was censored (0)
Hint: the data.frame(<col 1> = <vector 1>, ..., <col n> = <vector
n>) function creates a data frame with columns called <col 1>,...,<col n>
containing the values in <vector 1> to <vector n>.

(ii) (a) Fit Model 1 to the data.


(b) State the estimated value of b+ and comment on this value.

(iii) (a) Fit Model 2 to the data.


(b) State the estimated value of b+ and 𝛽! , and comment on this value.

(iv) Determine if the Model 2 is a significant improvement over Model 1.


Hint: anova(model1, model2, test ="Chi")performs a likelihood ratio
test.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 59


www.sankhyiki.in
+91-9711150002

Question 2. This question uses the models from Question 1.


Ignoring constants, the partial log-likelihood function for Model 1 in Question 1
is given by:
ln(L(b% )) = 4b% - ln (5´exp(b1)+5) -4 ln (4´exp(b1)+5) - ln (2´exp(b1)+3) -3 ln (exp(b1)+3)

(i) (a) Construct a function that returns the partial log-likelihood for an
inputted value of b+ .
(b) Show that the value of the partial log-likelihood (ignoring
constants) is –16.78165 when b+ = 0.3.

(ii) Plot a graph of the partial log-likelihood function over a suitable range,
indicating an appropriate interval that contains 𝛽H+ .

(iii) Estimate b+ using non-linear minimisation.

(iv) Compare your answer to part (iii) to you answer to Question 1 part (ii)(b).

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 60


www.sankhyiki.in
+91-9711150002

SOLUTIONS

Answer 1
(i) Creating the necessary vectors and putting them into a data frame:
circuits = c(2,3,3,3,3,4,5,5,5,6)
status = rep(1,10)
sex = c(1,0,0,1,1,1,0,0,0,1)
age = c(30,21,20,24,25,19,18,18,17,21)
PH_data = data.frame(circuits,status,sex,age)

We could also do this all in one step as follows:

PH_data = data.frame(circuits = c(2,3,3,3,3,4,5,5,5,6),


status = rep(1,10),sex = c(1,0,0,1,1,1,0,0,0,1),
age = c(30,21,20,24,25,19,18,18,17,21))

(ii)(a) Firstly ensuring that the survival package is loaded:


library(survival)

Then creating the survival object:


surv.obj = Surv(PH_data$circuits, PH_data$status)

Finally, fitting the model:


fit1 = coxph(surv.obj ~ PH_data$sex, ties = "breslow")
summary(fit1)

(b) The estimated value of b1 is 0.2738 .


This gives the fitted model:
Model 1: ln l(t; z% )=lnl" (t) + 0.2738z%
Here expK𝛽"% M equals:
exp(fit1$coefficients)
sex
1.314932

So we can also express this model as:


𝜆(𝑡; 𝑧% ) = 𝜆" (𝑡) for 𝑧% = 0 and 𝜆(𝑡; 𝑧% ) = 𝜆" (𝑡) × 1.315 for 𝑧% = 1
This model suggests that the drop-out rate is 31.5% higher for the females compared to
the males.
However, the p -value for this covariate is 69%. This is much greater than 5%, which
indicates that the difference between the sexes based on this model is not remotely
significant.

(iii)(a) Fitting the model:


fit2 = coxph(surv.obj ~ PH_data$sex + PH_data$age,ties="breslow")
summary(fit2)

(b) The estimated values of b% and b$ are now -2.0086 and 0.5744.
This gives the fitted model:
Model 2: lnl(t; z% ) = lnl" (t) - 2.0086z% + 0.5744z$

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 61


www.sankhyiki.in
+91-9711150002

We now have:
exp(fit2$coefficients)
sex age
0.1341723 1.7760945

Since 𝛽"% is negative, this model now suggests that the drop-out rate is lower for the
females compared to the males, but again this difference is not significant ( p -value =
14.1%).
Since 𝛽"$ is positive, this model suggests that the drop-out rate increases with age (77.6%
per year of age according to the fitted model). This effect is significant in this model,
since the p - value of 4.3% is less than 5%.

(iv) We are testing the null hypothesis that Model 2 is not a significant improvement over
Model 1.
Using the hint in the question, we can use the anova() function.
anova(fit1, fit2, test = "Chi")

This tells us that the partial log-likelihood for the first model is –16.781 and –13.419 for
the second. The observed value of the test statistic, -2(𝑙& - 𝑙&+3 ), is 6.7042 and the
associated p -value is 0.009618.
As the p -value of 0.962% is less than 5%, there is sufficient evidence to reject the null
hypothesis that Model 2 is not a significant improvement over Model 1. It appears
reasonable to conclude that incorporating age has improved the model.

Answer 2
(i) (a) Constructing the function:
part.log.lik = function(b){
4 * b - log(5 * exp(b) + 5) - 4 * log(4 * exp(b) + 5) -
log(2 * exp(b) + 3) - 3 * log(exp(b) + 3)
}
(b) Checking the value:
part.log.lik(0.3)
[1] -16.78165
(ii) plot(part.log.lik, 0, 1, lwd = 2,
main = "Partial log-likelihood",
xlab = "beta", ylab = "log(L(beta))")

There appears to be a solution somewhere in the range 0.2 to 0.4.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 62


www.sankhyiki.in
+91-9711150002

(iii) First setting up a function to return the negative partial log-likelihood:


neg.part.log.lik = function(b){
-part.log.lik(b)
}
Then using the nlm() function with a starting value of say 0.3:
nlm(neg.part.log.lik, p = 0.3)

The function appears to have found a solution, as the code is 1


and there was at least one iteration. So, the estimate is b1 =
0.2737846.

(iv) This value is almost exactly the same as the estimate from 1.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 63


www.sankhyiki.in
+91-9711150002

Markov Chains
Question 1. A Markov chain has the following state space and one-step transition
probabilities:

(i) Construct a matrix that contains the one-step transition probabilities, labelling the
rows and columns with ‘state 1’, ‘state 2’ and ‘state 3’.

The new() function can be used to create a markovchain object.

(ii) Construct a markovchain object ,mc, with the name ‘Q1MC’ and using the
matrix from part (i) as the transition probability matrix.
(iii) (a) Determine whether the Markov chain is irreducible using functions in R.

(b) Determine the period of the Markov chain using functions in R.

The @ symbol can be used with markovchain objects to extract its components. The
following code extracts the transition matrix:
mc@transitionMatrix

(iv) Check the row sums of the extracted transition matrix are all 1.

(v) (a) Calculate the probability that the chain will be in state 1 after 4 steps,
given it is currently in state 2, using mc.

The current distribution across the states is d = (0.2, 0.3, 0.5) .

(b) Calculate the probability that the chain will be in state 2 after 8 steps, using
mc.

(c) Calculate the stationary distribution of the chain, using mc.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 64


www.sankhyiki.in
+91-9711150002

Question 2. The most recent 18 values of a discrete-time process with states 1, 2 and 3 were as
follows:

2 1 2 1 3 1 1 2 2 1 1 3 3 1 2 1 2 1

(i) Create a vector named past18 containing the values of the past states.

Enter the following lines of R code, which fit a Markov chain model to the past
data:
fit1 = markovchainFit(past18)

Now query the result by entering:


fit1$estimat
fit1$logLikelihood

(ii) Explain the output produced by fit1$estimate and


fit1$logLikelihood.

You should tackle questions 3, 4 and 5 using either the basic R functions or functionality in the
markovchain package (or both). Answers for both approaches are presented in the solutions.

Question 3. A no-claims discount (NCD) system for motor insurance is to be modelled using a
Markov chain with constant transition probabilities, as shown in the diagram:

State 1 No discount
State 2 15% discount
State 3– 30% discount (at least one claim in previous year)
State 3+ 30% discount (no claims in previous yr)
State 4 40% discount

(i) (a) Construct the one-step transition probability matrix for this process P.
(b) Check that the rows of the matrix sum to 1.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 65


www.sankhyiki.in
+91-9711150002

A new policyholder is currently on the 0% discount level.

(ii) Calculate the probabilities for the discount rate this policyholder will be
receiving after five years.
(iii) Calculate the probabilities for the discount rate for this policyholder after 20
years.
(iv) (a) Calculate 𝑃!&
(b) Comment on the rows of 𝑃!& .

Hint: you may wish to use the functions matrix(), diag(), rowSums() and the
matrix multiplication operator %*%.

Question 4. An insurance company has been using a Markov chain to model its no-claims
discount (NCD) system, which offers the following discounts to motorists on their
annual premium:

Level 1 No discount
Level 2 15% discount
Level 3 25% discount
Level 4 30% discount
• After a claim free year, probability 0.85, policyholders move up a level or stay at
Level 4.
• After a year with exactly one claim, probability 0.11, policyholders move down a
level or stay at Level 1.
• After a year with more than one claim, policyholders move down two levels,
subject to a minimum level of Level 1.

(i) Construct the one step transition probability matrix P


(ii) Construct a function that returns the n -step transition probability matrix P (n)
with inputs P and n .
(iii) Calculate (a) P(4) (b) P(7)

For a particular class of policyholder (same type of car, type of cover etc.), the
insurance company has re-evaluated the one-step probability distribution so that there is
now a 0.83 probability of a claim-free year and a 0.12 probability of a policyholder
claiming exactly once in any given year. The full premium in this class is 750.

(iv) (a) Recalculate the one-step transition probability matrix P


(b) Calculate the stationary distribution of the class.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 66


www.sankhyiki.in
+91-9711150002

(v) Construct a function that calculates the expected total premium income over n
years for a given starting distribution, income vector and one-step transition
matrix, without using the markovchain package.

(vi) Calculate the expected premium income:


(a) over the next 5 years for a new driver
(b) over the next 10 years for a new driver
(c) over the next 10 years for a driver currently on Level 4
The insurance company is thinking of offering its Level 4 policyholders a chance to
pay an additional premium to protect their own discount level (NCD protection).
Under the proposal, policyholders on Level 4 who had paid the additional premium
would:
• stay on Level 4 if they make exactly one claim
• drop just one level (instead of two) if they make more than one claim.
(vii) Calculate the new expected premium income over the next 10 years for a driver
currently on Level 4 if the additional premium is £0 (so that all level 4
policyholders get the above benefits for free).

The insurance company is considering setting the additional premium at between £10 and
£15 per annum.

(viii) Calculate the new expected premium income over the next 10 years for a driver
currently on Level 4 who pays the additional premium for the case where:

(a) the additional premium is £10, storing your answer in the object prem.10
(b) the additional premium is £15, storing your answer in the object prem.15.

(ix) Comment on whether the insurer should set the additional premium at £10, with
reference to your answer to part (vi)(c).
The overall expected premium income is linearly related to the additional premium
amount.
(x) Find the additional premium required for NCD protection (to 2 decimal places)
so that the new expected premium income is equal to that calculated in part
(vi)(c), by using linear interpolation or otherwise.
(xi) Explain why the insurance company might opt for a much higher additional
premium than that calculated in part (vi).

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 67


www.sankhyiki.in
+91-9711150002

Question 5. An insurance company is using a Markov chain to model its no-claims discount
(NCD) system, which offers the following discounts to motorists on their annual
premium:
Level 1 No discount
Level 2 10% discount
Level 3 20% discount
Level 4 30% discount
Level 5 40% discount
After a claim-free year, probability 0.9, policyholders move up a level or stay at Level
5. After a year with exactly one claim, probability 0.08, policyholders move down a
level or stay at Level 1. After a year with more than one claim, policyholders move to
Level 1.
(i) Construct the one-step transition probability matrix P.
The probability that a policyholder on Level i is on Level j after n years is
denoted by 𝑝92 (𝑛).
(ii) Construct a function that calculates the n -step transition probability matrix as a
function of P and use it to calculate the following probabilities:
(a) p+& (5)
(b) p&: (4)
(c) p:$ (6)
For a particular class of policyholder (same type of car, type of cover etc) the full
premium is 800 and there are 2,000 policyholders in the class. You may assume that
the policyholders have been driving for a number of years.
(iii) Calculate the stationary distribution and use it to calculate the expected annual
premium income for this policy class.
To improve the overall premium income and make the system fairer to the safer
drivers, the insurance company decides to change the rules of the NCD system slightly,
so that any policyholder making exactly one claim in a given year will be moved down
two levels if they also made a claim the previous year, subject to a minimum level of
Level 1.
(iv) Explain how many more states will be needed to continue to model the NCD
system as a Markov chain, and give the new one-step transition probability
matrix Q .
(v) Recalculate the probabilities in part (ii) under the new rules. You may assume
that the driver in (c) did NOT have a claim the previous year.
(vi) Calculate the increase in expected annual premium income from the change in
the rules and comment on your answer.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 68


www.sankhyiki.in
+91-9711150002

Question 6. Data is available for the movement of taxis in London. The city is divided into
three zones “North”, “South” and “West”. The movement of a taxi from one zone
to another will depend only on its current position. The following probabilities
have been determined for taxi movements:
• Of all taxis in the North zone, 30% will remain in North and 30% will move to
South, with the remaining 40% moving to West.
• In the South zone, taxis have a 40% chance of moving to North, 40% chance of
staying in South and 20% chance of moving to West.
• Of all the drivers in the West zone, 50% will move to North and 30% to South
with the remaining 20% staying in West.
The movement of taxis in London will be modelled in R using a Markov Chain.
(i) Create a vector with the state space of the Markov Chain, using R code. You
should print this to the screen and paste into your answer.
(ii) Construct a transition matrix of the zone movement probabilities. You should
print this to the screen and paste into your answer.
(iii) Load the R package for Markov Chains and paste your coding into your answer.
(iv) Create a Markov Chain object with state space equal to your vector in part (i)
and transition matrix from part (ii). You should print this to the screen and
paste into your answer.

The transition diagram for the Markov Chain is shown below:

(v) Calculate the probability that a driver currently in the North zone will be in the
North zone after:
(a) two trips (b) three trips
(vi) Determine the stationary state of the Markov Chain.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 69


www.sankhyiki.in
+91-9711150002

Question 7. Before answering this question, the R package for Markov chains should be
loaded into R using the following code:
install.packages("markovchain")
library(markovchain)
Xt is a Markov chain on the state space {1,2,3} with the following transition
matrix:

where 0 < 𝑝, 𝑞, 𝑟 < 1.

(i) Construct an R function, with arguments p, q and r, that generates a


Markov chain object for Xt.
Assume that p = 0.75 and r = 0.25.
(ii) Construct R code that calculates the stationary distribution of Xt for values
of q from 0.1 to 0.9 inclusive, at intervals of 0.1, pasting your results into
your answer script.
(iii) Plot a graph showing the stationary distributions of Xt calculated in part
(ii) as a function of q. You should include three separately coloured lines
on your graph, each line representing the stationary probabilities of each
state.

Question 8. Suppose the transition probabilities are a function of the age of a person. The
transition probability of a person aged x moving:
• from Healthy to Sick state is 0.007x
• from Healthy to Death state is 0.001x
• from Sick to Death state is 0.002(100-x)
• from Sick to Healthy is 0.006(100-x).
Assuming 100 is the maximum age. Calculate the probability of:
(i) Healthy person aged 30 will be in Sick state after 4 years.
(ii) Sick Person aged 25 will be Dead after 7 years.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 70


www.sankhyiki.in
+91-9711150002

Solutions
Answer 1.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 71


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 72


www.sankhyiki.in
+91-9711150002

Answer 2

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 73


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 74


www.sankhyiki.in
+91-9711150002

Answer 3

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 75


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 76


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 77


www.sankhyiki.in
+91-9711150002

Answer 4

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 78


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 79


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 80


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 81


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 82


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 83


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 84


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 85


www.sankhyiki.in
+91-9711150002

Answer 5

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 86


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 87


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 88


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 89


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 90


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 91


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 92


www.sankhyiki.in
+91-9711150002

Answer 6

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 93


www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 94


www.sankhyiki.in
+91-9711150002

Answer 7

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 95


www.sankhyiki.in
+91-9711150002

Answer 8
(i) H2S<-function(x){ 0.007*x}
H2D<-function(x){ 0.001*x}
S2H<-function(x){ 0.006*(100-x)}
S2D<-function(x){ 0.002*(100-x)}
transmat<-function(x){
M<-matrix(0,nrow=3,ncol=3)
M[1,1]<-1-H2S(x)-H2D(x)
M[1,2]<-H2S(x)
M[1,3]<-H2D(x)
M[2,1]<-S2H(x)
M[2,2]<-1-S2H(x)-S2D(x)
M[2,3]<-S2D(x)
M[3,1]<-0
M[3,2]<-0
M[3,3]<-1
M
}
n=30
B<-c(1,0,0)
for (i in 1:4){
B=B%*%transmat(n+i-1)
}
B[2]

Hence the probability of Healthy person aged 30 will be in Sick state after 4 years is 0.2608138.

(ii) n=25
C<-c(0,1,0)
for (i in 1:7){
C=C%*%transmat(n+i-1)
}
C[3]

Hence the probability of sick person aged 25 will be in Death state after 7 years is 0.4333386.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 96


www.sankhyiki.in
+91-9711150002

Markov Jump Processes


Question 1. A time-homogeneous Markov jump process is to be used to monitor the number
of athletes opting to be omnivore, vegetarian or vegan, over a 10-year period. At
any time, an athlete may be classed as:
• Omnivore (State 1)
• Vegetarian (State 2)
• Vegan (State 3)

The annual transition rates collected from past data are given here:
µ+! = 0.12 µ+: = 0.02 µ!+ = 0.18
µ!: = 0.05 µ:+ = 0.013 µ:! = 0.05

(i) Construct the generator matrix, A .


(ii) Estimate P(1 12), the matrix of transition probabilities over one month.
(iii) Estimate, using one month steps, the probability that:
(a) an omnivore athlete is a vegetarian in 10 years’ time
(b) an omnivore athlete is a vegan in 10 years’ time
(c) a vegan athlete is an omnivore in 10 years’ time.

After an extended analysis of trends in the diet of athletes, expert nutritionists


have proposed some new rates:
µ+! = 0.15 µ+: = 0.08 µ!+ = 0.09
µ!: = 0.07 µ:+ = 0.004 µ:! = 0.07

(iv) (a) Recalculate the probabilities from part (iii) using these new rates.
(b) Comment on your answer to part (iv)(a).

(v) Suggest two improvements to the current model.

Question 2. An insurance company is using a time-homogeneous Markov jump, Healthy-Sick-


Dead (HSD) model to monitor the current state of its Income Protection
policyholders, with the following transition rate diagram of annual rates:

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 97


www.sankhyiki.in
+91-9711150002

(i) Construct the generator matrix A.


(ii) Explain how the transition probability matrix, P (8) , can be calculated
using P (1) .

The insurance company wishes to use this to approximate P (8).


(iii) (a) Construct a function that takes two inputs, A and h , and returns the
approximate transition matrix P (h) over a short time period, h .
(b) Construct a function that takes three inputs, A , h and t , and estimates
the values of P(t).
(c) Estimate the values of P (8 ) for h = 1, 0.5, 0.25, 0.125, and 0.0625 .
(iv) Comment on your answers to part (iii)(c).
+ + +
(v) (a) Estimate P(16) for ℎ=1, ! , $ , … !!' using the functions in part(iii).
(b) Comment on the level of accuracy achieved here.
+
(vi) Estimate, using ℎ = !!' , the probability that:
(a) a healthy life will be sick after 16 years.
(b) a sick life will be healthy after 20 years.
(c) a sick life will be dead after 131⁄3 years.
(vii) Explain whether the insurance company should be using this model to
calculate the premiums for its Income Protection policies.

Question 3. In a particular European country, where anyone over the age of 18 may vote, there
are 3 main political parties:
• the ‘Reds’ (R)
• the ‘Blues’ (B)
• the ‘Xenophobes’ (X).
A company that produces and sells opinion polls has suggested that the voting
population’s political affiliation be monitored over time using a 3-state Markov
process.
(i) Explain the most suitable choice for the time set if a Markov chain is used.

The company has instead decided to use a time-inhomogeneous Markov jump


process with the following age-dependent transition rates between the political
parties:
µ;< (t) = 0.05 + 0.01t µ;= (t) = 0.001+0.0002t
µ<; (t) = 0.6 - 0.005t µ<= (t) = 0.004 + 0.001t
µ=; (t ) = 0.05 - 0.0003t µ=< (t ) = 0.9 - 0.006t
where t is the voter’s age and it is assumed, for simplicity, that all voters are no
older than 110.
(ii) Construct a function to create a generator matrix, A(t), for a given input of
voter age.
(iii) Calculate:
(a) the rate at which a 25-year old member of the Blues moves to the Reds
(b) the rate at which a 60-year old member of the Reds moves to the Blues

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 98


www.sankhyiki.in
+91-9711150002

(c) the rate at which a 50-year old member of the Blues moves to the
Xenophobes.
Over a very small time period, h,the transition probability matrix P(t,t+h) can be
approximated as:
P(t, t+h) » I+h ∗ A(t)
where I is the appropriate identity matrix with the same dimensions as A(t).
Probabilities over longer time periods can then be estimated by successively
multiplying the probabilities over shorter time periods (increments of length h ) as
follows:
()*
*+
𝑃(𝑠, 𝑡) ≈ Π21% 𝑃(𝑠 + ℎ𝑗, 𝑠 + ℎ𝑗 + ℎ)
+

(iv) Construct a function to estimate the value of P(s,t) for a given subdivision
value of h and start and end times s and t .
(v) Estimate:
(a) the probability that a 25-year old member of the Blues moves to the Reds
over the next 10 years, using h=1/12
(b) the probability that a 60-year old member of the Reds moves to the
Blues over the next 5 years, using h=1/365
(c) the probability that a 50-year old member of the Xenophobes moves to
the Reds over the next 7 years, using h=1/2.
(vi) Discuss whether this time-inhomogeneous Markov jump process is a good
model for tracking political party affiliation.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 99


www.sankhyiki.in
+91-9711150002

Solutions
Answer 1

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 100
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 101
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 102
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 103
www.sankhyiki.in
+91-9711150002

Answer 2

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 104
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 105
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 106
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 107
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 108
www.sankhyiki.in
+91-9711150002

Answer 3

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 109
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 110
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 111
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 112
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 113
www.sankhyiki.in
+91-9711150002

Time Series

Question 1. Plot the ACF and PACF up to lag k = 20 for the AR(2) time series 𝑌( with defining
equation: 𝑌( = 15 + 0.8𝑌(*+ − 0.6𝑌(*! + 𝑧(
where the white noise terms {Zt} are independent and identically distributed random
variables with mean 0 and standard deviation s =1 .

You should plot your graphs one directly above the other and label them clearly.

Question 2. The time series ldeaths is available in R. It measures the number of deaths in the
UK from lung disease, over a period of time.

For the time series ldeaths, find:

(a) the number of observations that have been made per year
(b) the time of the first observation
(c) the time of the last observation
(d) the autocorrelation function at lag 2 (ie 𝜌! )
(e) the number of people who died from lung disease in February 1978
(f) the total number of people who died from lung disease during 1976.

Question 3. Plot the sample ACF and sample PACF for the data ldeaths, up to lag k = 5. Your
graphs should be side by side, and should have appropriate labels.

Question 4. The following code generates 1,000 observations from a time series, and saves it as
an object s.

xx <- numeric(1003)
set.seed(4567)
ww <- rnorm(1003)
xx[1:3] <- ww[1:3]
for(t in 4:1003) { xx[t] <- 0.8*xx[t-1] +0.2* xx[t-3]
+ ww[t]+2*ww[t-1]}
s<-ts(xx[4:1003])

Run the code above and hence determine how many times the time series s should be
differenced.

Question 5. The file sales.txt records the number of sales per month of a particular type of
insurance product, over the ten‐year period from January 2005 to December 2014.
(i) Import this data and classify it as a time series object. You should specify
the relevant dates.
(ii) Apply the method of moving averages and the method of seasonal means
to the data. (You may assume the time series is additive.)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 114
www.sankhyiki.in
+91-9711150002

(iii) Plot the ACF and the PACF of the remainder, and comment on the result.

Question 6. The volume of water in a reservoir is measured every month from 2010, for five
years. The results (in millions of gallons) are stored in the file seasonality.txt
(i) Import this data and classify it as a time series object, specifying the
relevant dates.
(ii) Plot the data, the sample ACF and the sample PACF in a single plotting
area.
(iii) Remove the seasonality from the data using seasonal differencing. (You
should assume that the data exhibits yearly seasonal variation.)
(iv) Use your answer to part (iii) to plot the seasonally‐adjusted data and its
corresponding sample ACF and sample PACF. Your graphs should be
displayed in a single plotting area.
Question 7. This question uses the file ‘TS.txt’.
Run the following code:
TS<-read.table("TS.txt",sep="\t")
TS<-ts(TS[,1],start=1,end=length(TS[,1]),frequency=1)
(i) Fit an AR(2) model to the time series TS and interpret the result.
(ii) Carry out a Ljung‐Box test using five lags, and hence determine whether
the fitted model is an appropriate fit.
Question 8. This question uses the file ‘Wt.csv’.
Run the following code: 3
Wt<-read.table("Wt.csv",sep="",header=TRUE)
Wt<-ts(Wt$value,start=min(Wt$day),end=max(Wt$day))
The time series Wt represents the value of a particular commodity, recorded daily.
Plot the correlogram and partial correlogram for the data, in a 2 × 1 layout. Hence
suggest what order of ARIMA you would fit to this time series.
Question 9. This question uses the file ‘Xt.csv’.
Run the following code:
Xt<-read.table("Xt.csv",sep=",",header=TRUE)
Xt<-ts(Xt,1990,frequency=12)
Xt is a set of time series data recorded monthly from January 1990 to August
2006.
(i) Calculate the Akaike Information Criteria for ARIMA(p,d,q) models p £ 2
, d £2 , q £ 2 .
(ii) Hence, determine which model is the best fit.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 115
www.sankhyiki.in
+91-9711150002

Question 10. This question uses the file ‘Yt.csv’.


Run the following code:
Yt<-read.table("Yt.csv",sep=",",header=TRUE)
Yt<-ts(Yt[,2],2005,frequency=12)
(i) Fit an ARMA (1,1) model to the time series Yt and state the values of the
fitted parameters.
(ii) By plotting suitable graphs, suggest whether you think the model is a good
fit.
(iii) Carry out a turning point test on the residuals of your fitted model.

Question 11. Generate the time series object data using the following code:
set.seed(1558)
data<100+arima.sim(list(ar=runif(1,0.3,0.6),ma=runif(1
,12,15)), n=40)
data<-ts(data,start=c(2008,1),frequency=4)
This stationary time series represents a company’s revenue in thousands, over a
ten‐year period.
(i) Fit an ARMA(1,1) to the data, and hence estimate the company’s revenue
up until the end of 2019. (You do not have to state the values of the fitted
parameters.)
(ii) Use exponential smoothing to estimate the company’s revenue, over the
same period.
(iii) Create a graph that shows the time series data and the two sets of
forecasts. You should label each line clearly.

Question 12. Run the following code:


set.seed(1952)
x<-arima.sim(list(ar=0.7),n=240)
x<-1800+cumsum(x)
x<-ts(x,start=c(1990,1),frequency=12)
The time series x represents the number of diagnoses of a rare disease, every
month from January 1990 to December 2009.
(i) Fit an autoregressive time series to the differenced data, and hence
estimate the number of diagnoses to be expected in December 2014. (You
are not required to state the fitted parameters.)
(ii) Plot the data and the forecast number of diagnoses on the same graph, for
every month until December 2014.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 116
www.sankhyiki.in
+91-9711150002

Question 13. Run the following code:


set.seed(1952)
y<-90+round(arima.sim(list(ma=c(15,7)),n=40))
y<-ts(y,start=c(2007,3),frequency=4)
The stationary time series y represents the number of fraudulent claims received
by an insurer every three months (ie every quarter year), from July 2007 for a
period of ten years.
Fit a MA(2) process to this data, and hence estimate the number of fraudulent
claims that will be submitted to the insurer in the last three months of 2038.

Question 14. The following two equations form a bivariate time series (𝑋0 , 𝑌0 )> :
𝑋0 = 0.3𝑋0*+ + 0.1𝑌0*+ + 𝑒0
𝑌0 = −0.2𝑋0*+ + 0.6𝑌0*+ + 𝑒0?
where {et } and {et ¢} are independent white noise processes.
Determine the eigenvalues of the parameter matrix and hence state whether
(𝑋0 , 𝑌0 )> is stationary.

Question 15. The following code generates two vectors, s and t:


set.seed(1234)
z <- rep(0, 1000)
for (i in 2:1000) z[i] <- z[i-1] + rnorm(1)
s <- t <- rep(0, 1000)
s <- runif(1,16,20)*z + rnorm(1000)
t <- -runif(1,1,10)*z + rnorm(1000)
(i) Run this code and determine whether the vector (0.2,1.5) is a cointegrating
vector for s and t.
(ii) Determine whether s and t are cointegrated, and if so, find a
cointegrating vector.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 117
www.sankhyiki.in
+91-9711150002

SOLUTIONS

Answer 1
We calculate the theoretical ACF and PACF using the ARMAacf function, with
lag.max=20:
AR2acf<-ARMAacf(ar=c(0.8,-0.6),lag.max=20)
AR2pacf<-ARMAacf(ar=c(0.8,-0.6),lag.max=20,pacf=TRUE)

Setting the graphical parameters:


par(mfrow=c(2,1))

Creating a bar chart for AR2acf:


barplot(AR2acf,main="ACF of AR(2)",col="blue",xlab="Lag")

Similarly, we create the bar chart for AR2pacf, but including the argument
names.arg=seq(1,20), in order to label the bars explicitly:

barplot(AR2pacf,main="PACF of AR(2)",col="red",xlab="Lag",
names.arg=seq(1,20))

Finally, we reset the graphical parameters to their default setting:


par(mfrow=c(1,1))

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 118
www.sankhyiki.in
+91-9711150002

Answer 2
(a) Number of observations per year
frequency(ldeaths)
[1] 12
So there are 12 observations per year.
(b) Time of the first observation
start(ldeaths)
[1] 1974 1
Since data is recorded monthly, the first observation is for January 1974.
(c) Time of the last observation
end(ldeaths)
[1] 1979 12
So the last observation is for December 1979.
(d) Autocorrelation function at lag 2
The acf function will plot a graph of the sample autocorrelation function. So to access the
information behind the graph, we can define the object a, and set plot=FALSE:
a<-acf(ldeaths,plot=FALSE,lag.max=72)
(We’ve stipulated lag.max=72 since there are 72 observations in the dataset:
length(ldeaths)
[1] 72
However, any lag of 25 or over would score marks.)
% $ 5% 5$
We’ve calculated the ACF 𝜌4 for lags 𝑘 = 0, %$ , %$ , … , 5$ , 5$ . So to find 𝜌$ we need the 25th

ACF in the series (ie 2 × 12 + 1 ):


a$acf[25]
[1] 0.6001395
(e) Number of people who died from lung disease in February 1978
We can print out the dataset and read off the correct number:
ldeaths

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 119
www.sankhyiki.in
+91-9711150002

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
So the number of deaths in February 1978 was 3,137.
However, we wouldn’t want to print out all the data for very large datasets. So instead, we can
extract the answer like this:
Since the time series starts at January 1974 and we have monthly data, the observation for
February 1978 is the 50th element of the series:
ldeaths[50]
[1] 3137
(f) Total number of people who died from lung disease during 1976
We could sum the required elements:
sum(ldeaths[25:36])
[1] 25718

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 120
www.sankhyiki.in
+91-9711150002

Answer 3
First we set our graphical parameters:
par(mfrow=c(1,2))
Then we plot the ACF using the acf function with lag.max=60 (since, for k = 5 and
monthly data, we need to calculate 5 12 60 ´ = auto correlations):
acf(ldeaths,lag.max=60,main="Number of deaths from lung
disease", ylab="Sample ACF",col="red")
Similarly, plotting the PACF using the pacf function:
pacf(ldeaths,lag.max=60,main="Number of deaths from lung
disease", ylab="Sample PACF",col="blue")

Finally, we reset the graphical parameters to their default setting:


par(mfrow=c(1,1))

Answer 4
Differencing s
It’s very quick to difference data in R so we will do this a few times and then analyse the
results:
ds<-diff(s)
dds<-diff(ds)
(Alternatvely, we could have calculated dds as dds<-diff(s,differences=2).)

Plotting the ACF


We’ll first plot the ACFs for each of our time series. In the code below:

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 121
www.sankhyiki.in
+91-9711150002

• we’ll tell R to plot the graphs in a 1 × 3 array using par(mfrow=c(1,3)) and then
reset the graphical parameters afterwards using par(mfrow=c(1,1))
• we use semicolons ; to carry out more than one command in a single line
par(mfrow=c(1,3))
acf(s); acf(ds); acf(dds)
par(mfrow=c(1,1))

There is a steady linear decrease in the ACF of the undifferenced time series,
which indicates that differencing is required. This disappears after differencing
once (ie in the graph of ds), which indicates that we don’t need to difference a
second time.
Analysing the variance
Calculating the sample variance for each of our datasets:
var(s); var(ds); var(dds)
[1] 652.342 [1] 4.760322 [1] 7.903931
The variance reduces (from 652 to 4.76) if we difference s once, but increases (up to
7.90) if we difference a second time. Again this indicates that we should difference the
data only once.

Answer 5
(i) Importing the data:
We use the read.table function, and check this looks sensible using the head
function:
sales<-read.table("sales.txt")
head(sales)

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 122
www.sankhyiki.in
+91-9711150002

V1
1 1057
2 1093
3 1120
4 1120
5 1091
6 1082
Now we classify the data as a time series, where:
• we specify the start date as January 2005 using the argument start=c(2005,1)
• we specify monthly observations using the argument frequency=12.
sales<-ts(sales,frequency=12,start=c(2005,1))

(ii) Removing the trend and seasonality


The decompose function is ideal for the method of moving averages and the method of
seasonal means.
decomposed<-(decompose(sales,type="additive"))
The ‘untrended’ time series (ie after applying both filters) is called random, and we can
extract this using the $ symbol:
untrended<-decomposed$random

(iii) Plot the ACF and PACF


We can’t immediately use the acf and pacf functions. See what happens if we try:
acf(untrended)
Error in na.fail.default(as.ts(x)) : missing values in
object
This has occurred because it is impossible to calculate the moving average for the first
and last six time periods. Hence, the remainder is also undefined for these time
periods. R has set these values to NA.
To exclude these from the data, we can type:
r<-untrended[!is.na(untrended)]
where:
• !is.na means ‘all entries where the results are not NA’
• the square brackets [ ] mean ‘only include entries that meet this criterion’.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 123
www.sankhyiki.in
+91-9711150002

Plotting the ACF and PACF:


par(mfrow=c(1,2))
acf(r,main="Untrended data"); pacf(r,main="Untrended data")
par(mfrow=c(1,1))

The ACF still shows significant seasonal variation in the data. Hence there appears to be
some residual seasonality that the decomposition hasn’t captured.
However, the PACF cuts off for lags k > 2 which suggests that (assuming we can remove
the remaining seasonal variation), an AR(2) model might be appropriate.

Answer 6
(i) Importing the data
Using the read.table function and checking that the data has been imported sensibly:
seas<-read.table("seasonality.txt")
head(seas)
V1
1 212.7483
2 239.6343
3 242.7506
4 239.3206
5 233.8502
6 207.3584

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 124
www.sankhyiki.in
+91-9711150002

Now we specify it as a time series, where:


• frequency=12 implies that we have monthly data
• start=c(2010,1) implies that the first observation occurs in January 2010.
seas<-ts(seas,frequency=12,start=c(2010,1))
(ii) Creating graphs
We first create the matrix m :
m<-matrix(2,2,data=c(1,2,1,3))
Now we set the graphical parameters using the layout function:
layout(m)
Now we plot our graphs and, as good practice, reset our graphical parameters afterwards:
ts.plot(seas,main="Data: Reservoir
level",ylab="gallons (millions)")
acf(seas,lag.max=36,main="")
pacf(seas,lag.max=36,main="")
par(mfrow=c(1,1))

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 125
www.sankhyiki.in
+91-9711150002

(iii) Seasonal differencing


Since we have monthly data with a yearly seasonal variation, we should use lag=12 in
the diff function:
d<-diff(seas,lag=12,differences=1)
(iv) Creating graphs of filtered data
We will adapt the code we used in part (ii) by:
• using the data d instead of seas
• changing the main title
• adding a horizontal reference line at y = 0 (by specifying h=0 in the abline function):
m<-matrix(2,2,data=c(1,2,1,3))
layout(m)
ts.plot(d,main="Change since same month last
year",ylab="gallons (millions)")
abline(h=0,lty="dashed",col="red")
acf(d,lag.max=36,main="")
pacf(d,lag.max=36,main="")
par(mfrow=c(1,1))

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 126
www.sankhyiki.in
+91-9711150002

Answer 7
(i) Fitting an autoregressive of order 2
We can use the arima function, where the order argument represents p, d and q
respectively. Since an AR(2) is the same as an ARIMA(2,0,0) , we have:
fit<-arima(TS,order=c(2,0,0))
fit
Call:
arima(x = TS, order = c(2, 0, 0))
Coefficients:
ar1 ar2 intercept
0.2193 0.4475 500.1856
s.e. 0.0889 0.0894 0.2803
sigma^2 estimated as 0.9245: log likelihood = -138.28, aic
= 284.56
So we have fitted a model of the form:
𝑋( = 𝜇 + 𝛼(𝑋(*+ − 𝜇) + 𝛽(𝑋(*! − 𝜇) + 𝑍(
where: 𝜇̂ = 500.1856, 𝛼G = 0.2193 , 𝛽H =0.4475 and the estimated variance of the
white noise terms 𝑍( is 𝜎G ! = 0.9245

(ii) Ljung‐Box test


We’ll use the Box.test() function, specifying type="Ljung" and lag=5.
We need to apply the test to the residuals, so we will extract from our fitted model as
fit$residuals.
Since 𝑝 + 𝑞 = 2 + 0 = 2 we need to reduce our degrees of freedom by 2, ie we set
fitdf=2.
Combining all these into one function gives:
Box.test(fit$residuals,lag=5,type="Ljung",fitdf=2)
Box-Ljung test
data: fit$residuals
X-squared = 3.9903, df = 3, p-value = 0.2625

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 127
www.sankhyiki.in
+91-9711150002

The null hypothesis is:


𝐻" : the residuals are independent.
The p‐value is 0.2625, which is greater than 0.05. So we have insufficient evidence to
reject the null hypothesis. We can conclude that the model is a satisfactory fit.

Answer 8
We’ll plot the ACF and the PACF one above the other by setting the graphical
parameter:
par(mfrow=c(2,1))
(This means the layout of the graphs will be two rows deep and one column wide.)
Now we use the acf and pacf functions:
acf(Wt, main="Data: Wt",ylab="sample ACF")
pacf(Wt,main="",ylab="sample PACF")

The ACF appears to decay. The PACF cuts off (ie the values fall inside the confidence
interval) for lags greater than 1.
Hence, we would fit an AR(1) model, ie an ARIMA(1,0,0).

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 128
www.sankhyiki.in
+91-9711150002

Answer 9
We need to fit a model for every combination of p, d and q and calculate the AIC. That’s
a lot of work! So we’ll write a ‘loop’ to do the work for us. In our solution below, we’ll
start by calculating the AIC for just one model, and then extend this to do the same for all
models.
Calculating the AIC for one model
First let’s create a vector with four entries:
answer<-numeric(4)
All the entries in this vector are zero for now. We’ll populate it with our answers in a
moment.
To calculate the AIC for a ARIMA(0,1,2) model (for example), we fit the model using
the arima function, and then extract the AIC from this model by using $aic, like this:
model<-arima(Xt,order=c(0,1,2))
aic<-model$aic
In fact, we can simplify these last two lines to:
aic<arima(Xt,order=c(0,1,2))$aic
We create a vector to display this result alongside the values of p , d and q respectively:
row<-c(0,1,2,aic)
row
[1] 0.0000 1.0000 2.0000 886.8326
Finally we append this onto the bottom of our vector answer, using the rbind
function:
answer<-rbind(answer,row)
answer
[,1] [,2] [,3] [,4]
answer 0 0 0 0.0000
row 0 1 2 886.8326
(The first row is a ‘dummy’ row, we can ignore it.)

Calculating the AIC for many models


The code above seems to work fine for just one model. So we’ll extend this method to
repeat the process for all combinations of p, d and q.
We will replace all instances of “0,1,2” in the code above with “p,d,q” and then insert this
into a loop.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 129
www.sankhyiki.in
+91-9711150002

So our final code is:


answer<-numeric(4)
for (p in 0:2) for (d in 0:2) for (q in 0:2) {
aic<-arima(Xt,order=c(p,d,q))$aic
row<-c(p,d,q,aic)
answer<-rbind(answer,row)
}
We can remove the first (dummy) row of our answer:
answer <- answer[-1,]

(ii) Best fit


The model with the lowest AIC is the best fit. This is the ARIMA(1,1,2) model, with
AIC =872.
We can obtain this answer using R:
min(answer[,4])
[1] 872.1794

Answer 10
(i) Fitting the model
To fit an ARMA(1,1) model we use the arima function, with order=c(1,0,1):
fit<-arima(Yt,order=c(1,0,1))
fit
Call:
arima(x = Yt, order = c(1, 0, 1))
Coefficients:
ar1 ma1 intercept
0.8772 0.4089 109.8152
s.e. 0.0387 0.0815 1.7909
sigma^2 estimated as 4.749: log likelihood = -396.76,
aic = 801.52
So for our fitted model we have: 𝜇̂ =109.8152, 𝛼( = 0.8772, 𝛽" = 0.4089 and 𝜎( $ = 4.749.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 130
www.sankhyiki.in
+91-9711150002

(ii) Using graphs to interpret goodness‐of‐fit


We could extract the residuals from the fitted model, and plot the ACF like this:
et<-fit$residuals
acf(et,lag.max=24)
However, it is much easier to use the tsdiag() function. This function plots the
(standardised) residuals, as well as the ACF of the residuals.
From the top graph, it looks like the residuals have mean zero, constant variance and no
discernible pattern, which indicates a good fit.
The ACFs lie inside the confidence interval at least 95% of the time (which is
desirable). However there is a hint of oscillation in the ACFs, so there may be some
underlying pattern in the data which the model has not captured.

(iii) Turning points test


There is no in built function for the turning points test in R (unless we load a
package). So we will have to write the test ourselves.
Extracting the residuals from the fitted model:
et<-fit$residuals
We create an empty vector called turns of length n, where n is the number of residuals
we have calculated:
n<-length(et)
turns<-numeric(n)

for (i in 2:(n-1)) {
if (et[i]>et[i+1] && et[i]>et[i-1]) {
turns[i]<-1
} else {
next(i)
}
}
In the code above:
• notice that our loop runs from i =2,…,(n-1), not from i=1,…. n, because the first and
last residuals cannot be turning points

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 131
www.sankhyiki.in
+91-9711150002

• the square brackets [] let us extract individual entries from the vector, so et[i] is the ith
element of vector et.
Similarly, to check whether each residual is a minimum, we can simply change the
direction of the inequality signs:

for (i in 2:(n-1)) {
if (et[i]<et[i+1] && et[i]<et[i-1]) {
turns[i]<-1
} else {
next(i)
}
}
We can calculate the total number of turning points in the residuals as:
nt<-sum(turns)
If the residuals form a white noise process, we would expect the mean and variance of the
number of turning points in the residuals to be:
E<-2/3*(n-2)
V<-(16*n-29)/90
where is the number of residuals.
Our test statistic should include a continuity correction. Comparing the expected number
of turning points with the actual number of turning points:
E; nt
[1] 118.6667
[1] 110
Since nt is less than E, we will apply the continuity correction by including +0.5 in our
calculation of the test statistic:
ts<-(nt-E+0.5)/V^0.5
ts
[1] -1.451
Our null hypothesis is: 𝐻" : the residuals are white noise. The 5% critical value is
+/‐ 1.96, so we have insufficient evidence to reject H0, and we conclude that the residuals
are white noise.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 132
www.sankhyiki.in
+91-9711150002

Answer 11
(i)

fit<-arima(data,order=c(1,0,1))
start(data);end(data);frequency(data)
p<-predict(fit,n.ahead=8)
p<-p$pred; p

(ii)

HW<-HoltWinters(data,beta=FALSE,gamma=FALSE)
p2<-predict(HW,n.ahead=8); p2

(iii)

ts.plot(data,main="Data and forecasts", ylab="Revenue


(000s)", xlim=c(2008,2020),col="blue")
lines(p,col="red")
lines(p2,col="dark green")
text(2019.5,90,"data",col="blue")
text(2018.7,85,"step-ahead forecast",col="red")
text(2018.1,80,"exponential smoothing
forecast",col="dark green")

Answer 12
(i)
dx<-diff(x)
fit<-ar(dx)
pd<-predict(fit,n.ahead=60)$pred
px<-cumsum(pd)+tail(x,1)
px<-ts(px,start=c(2010,1),frequency=12)
px
px[time(px)==2014+11/12]

(ii)

min(px,x); max(px,x)
ts.plot(x,main="Past data and forecast",ylab="Number
ofdiagnoses",xlim=c(1990,2015),ylim=c(1796,1833),col="
blue")
lines(px,col="red")

Answer 13
p<-predict(fit,n.ahead=100)$pred
p[time(p)==2038.75]

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 133
www.sankhyiki.in
+91-9711150002

Answer 14
m<-matrix(c(0.3,-0.2,0.1,0.6),2,2); m
eigen(m)$values

Answer 15
(i)
PP.test(0.2*s+1.5*t)$p.value
(ii)
PP.test(s)$p.value
PP.test(diff(s))$p.value
PP.test(t)$p.value; PP.test(diff(t))$p.value
find.coint<-function(coint) {
comb<-coint[1]*s+coint[2]*t
test<-PP.test(comb)$p.value
test
}
v<-c(1,1)
find.coint(v)
fit<-nlm(find.coint,v)
alpha<-fit$estimate[1]
beta<-fit$estimate[2] alpha; beta
PP.test(alpha*s+beta*t)$p.value

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 134
www.sankhyiki.in
+91-9711150002

Extreme Value Theory

Question 1. An insurer has recorded its claims, alongside the year and month in which each
claim event occurred, in the following file:
table.txt

(i) Calculate the maximum claim that occurred in:


(a) July 2016 (b) August 2016 (c) September 2016

(ii) Group the claims according to the quarter in which they occurred (ie
quarter 1 refers to the months January to March, quarter 2 refers to months
April to June, etc) and hence find the maximum claim that occurred in the
third quarter of 2016.

Question 2. Aggregate monthly claims are recorded in the following file:


aggClaims.txt

Calculate the threshold exceedances for these claims at the threshold u = 3,000,000.

Question 3. Claims sizes in the file ‘exp.txt’ are assumed to follow an exponential distribution
with parameter l . Estimate the distribution of the threshold exceedances for this
data.

Question 4. Plot the hazard rate for the Weibull(0.4,5) distribution, ie where c = 0.4 and g =
5. Hence state whether this distribution has a heavy tail.

Question 5. Plot the density ratios for the Pareto(2, 4) and Pareto(7,12) distributions and hence
state which distribution has the heavier tail.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 135
www.sankhyiki.in
+91-9711150002

Solutions
Answer 1

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 136
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 137
www.sankhyiki.in
+91-9711150002

So the maximum claim in the third quarter of 2016 was 31,075. This is consistent with our answer to
part (i).

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 138
www.sankhyiki.in
+91-9711150002

Answer 2

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 139
www.sankhyiki.in
+91-9711150002

Answer 3

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 140
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 141
www.sankhyiki.in
+91-9711150002

Answer 4

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 142
www.sankhyiki.in
+91-9711150002

Answer 5

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 143
www.sankhyiki.in
+91-9711150002

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 144
www.sankhyiki.in
+91-9711150002

Machine Learning

Question 1. The built-in data set in R called faithful contains 272 records of the waiting time
between eruptions and the duration of the eruption for the Old Faithful geyser.

(i) Plot the data.


(ii) (a) Calculate z-scores for the data.
Hint: the scale() function centres and scales data.
(b) Perform k-means clustering with k = 2 clusters, setting a seed of 272 on
the standardised data.
Hint: use the kmeans(<data>, k) function.

(iii) Show the result of the clustering, explaining each component of the output.

(iv) (a) Show the standardised cluster centres from part (iii).

(b) Calculate the cluster centres in the original units of the data.

(v) Plot the data, making each cluster a different colour.

(vi) Describe the clusters.

Someone suggests rerunning the algorithm with initial cluster centres of (2.2,80) and
(4.5,55).

(vii) Assess the suitability of the proposed initial centres.

(viii) (a) Calculate z-scores for the proposed cluster centres based on the data in
faithful.

(b) Perform k-means clustering with k = 2 clusters on the standardised data


using the standardised proposed initial cluster centres.

Hint: use the following structure of the kmeans() function:


kmeans(<data>, <initial centres>)

Where <initial centres> is a r´c matrix (or data frame) with r as the number
of clusters and c as the number of variables in the data set.

(ix) Show the standardised cluster centres for the clusters calculated in part (viii)(b)
and compare them to those calculated in part (iv)(a).

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 145
www.sankhyiki.in
+91-9711150002

Question 2. A movie streaming website has collected data about movie preferences on some of
their customers. The customers rated various movies, each assigned a unique genre,
giving a score from 1 to 100. The file movie.data.csv contains the following
information:

• Horror: the average score given by the customer for horror films
• Romcom: the average score given by the customer for romantic comedies
• Action: the average score given by the customer for action movies
• Comedy: the average score given by the customer for comedies
• Fantasy: the average score given by the customer for fantasy films.
• Cluster: the proposed cluster allocation for the customer (1, 2 or 3).

An analyst working for the website has attempted to allocate customers into clusters
using the k-means algorithm based on their average movie scores across the genres.
The analyst has stored the proposed cluster for each customer in the column Cluster.

(i) (a) Calculate the three cluster centres.


(b) Calculate the distance from each point to the centre of Cluster 1, storing
your answer in an additional column called dist_1.
(c) Calculate the distance from each point to the centre of Cluster 2, storing
your answer in an additional column called dist_2.
(d) Calculate the distance from each point to the centre of Cluster 3, storing
your answer in an additional column called dist_3.

The analyst realises that the algorithm was terminated too early and there is one data
point that is not correctly allocated to the nearest centroid.

(ii) (a) Identify the data point that is not correctly allocated to its nearest
cluster centre.
Hint: the function which.min(<vector>) function outputs the index
of the minimum value in <vector>.

The function which(<logical condition>) can be used to return


the indices of items satisfying <logical condition>.
(b) Identify the correct cluster for the point in part (ii)(a).
(c) Calculate the new cluster centres using the updated clusters.
(d) Verify that the algorithm has now converged.

The analyst now further realises that the data should really be standardised before
using the k-means algorithm. The analyst decides to use z-scores to standardise the
data.
(iii) (a) Perform k-means clustering using the kmeans() function with 3
clusters and setting a seed of 6 after appropriately standardising the data
using the scale() function.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 146
www.sankhyiki.in
+91-9711150002

(b) Output the unscaled cluster centres of the clusters calculated in part
(iii)(a).
(c) Compare the cluster centres calculated in part (iii)(b) to those
calculated in part (ii)(c).
(iv) Describe the clusters calculated in part (iii)(a) by considering the centres
in part (iii)(c).
(v) Plot the first five columns of the data set, colouring each point by its
allocated cluster from the k-means algorithm.
The total within-groups sum of squares is stored for each cluster in the k-means
output as the item ‘tot.withinss’. This measures the within-cluster
homogeneity (ie how similar the points in that cluster are). This item may be
referenced using the following code:

kmeans(<data>, <centers>)$tot.withinss

The website wants to consider other numbers of clusters and wants to use the total
within-groups sum of squares metric to determine an appropriate number to use.
(vi) (a) Calculate the total within-groups sum of squares when performing
k-means clustering on standardised data with 1 to 10 clusters, setting a
seed of 6 before running each iteration and storing the results in a vector
called total.within.ss.
(b) Plot total.within.ss against the number of clusters using a
combined line and points graph.
(c) Suggest an appropriate number of clusters.

Question 3. A scientist is analysing a large number of items, which are known to be of one of
three types (1, 2 or 3). She has measured two factors for each one in order to help
identify which type each item belongs to. She has found that the values of the factors
for each of the types conform to the distributions shown in the table below. Here
U(a,b) indicates the discrete uniform distribution, which is equally likely to take any
of the values a, a+1,…,b.

TYPE FACTOR 1 FACTOR 2


1 U(40,100) U(40,100)
2 U(10,70) U(30,90)
3 U(0,80) U(0,60)
Run the following code to create a blank set of axes in the plot window:
plot(NULL,NULL,xlim=c(0,100),ylim=c(0,100),xlab="Factor 1",
ylab="Factor 2", main="Range for Types 1, 2 and 3")
(i) (a) Show the ranges of values of the two factors for each type on a graph
using rectangles.
Hint: polygon(c(x1,x2,...,xn),c(y1,y2,...,yn)) plots a polygon with
vertices at (x1,y1),...,(xn,yn).
Hint: text(c(x1,x2,...,xn),c(y1,y2,...,yn),c(<text>)) adds the
elements of <text> to a graph at the points (x1,y1),...,(xn,yn).
(b) Comment on your illustration.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 147
www.sankhyiki.in
+91-9711150002

(ii) Construct a data frame of 30 simulated items (10 of each type) with the
characteristics shown in the table, after setting a seed value of 19. The data
frame should contain three columns; the type of the object; the value of
FACTOR1 and the value of FACTOR2.
Hint: the sample(<vector>, n, replace =TRUE) function
generates a sample of size n using the values in <vector> with
replacement.

The scientist has produced the decision tree below:

(iii) (a) Construct a function to allocate an item in the simulated data to


one of the three types based on the decision tree shown above.
(b) Test the function from part (iii)(a) for an item where both factors
have the value 50.
(c) Apply the function from part (iii)(a) to the simulated data from part
(ii).
(d) Calculate the accuracy of the decision tree.

The scientist is also considering using a naïve Bayes approach.


(iv) (a) Write down an expression for the probability of obtaining a value x
from a discrete uniform distribution that can take the integer values
a,a+1,…,b. Your expression should be valid for all values of x .
(b) Construct a function called ddunif that takes three inputs; x,a and
b and returns the probability of obtaining the value x from a
discrete uniform distribution that can take the values a,a+1,…,b.
Your function should:
• work for vector inputs of x to calculate multiple probabilities at
once
• correctly output the probability for all integer values of x
• return an error message if x (or, if x is a vector, any value in x )
is not an integer.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 148
www.sankhyiki.in
+91-9711150002

In this question ‘integer’ refers to whether xÎℤ and not the integer data
type in R.
Hint: (sum(x != floor(x)) != 0)returns TRUE if any value in
<vector> is not an integer and TRUE otherwise.
(c) Write a function for each Type that calculates the conditional
probability of obtaining values for the factors F1 and F2 , given the
Type, ie 𝑃(𝐹|𝑇𝑦𝑝𝑒). You should assume that the factors operate
independently given the Type.

You are given that the prior probabilities of each type are P(Type 1) = 0.5 ,
P(Type 2) = 0.3 and P(Type3)=0.2. You are also given that:

(d) Write a function that calculates the conditional probability of Type


given the values of the factors F1 and F2 , P(Type|F ) , using your
functions from part (c). Your function should take two inputs, F1
and F2 , and output a data frame containing one row and three
columns, one conditional probability for each type.
(e) Evaluate your function from part (c) for an item where both factors
have the value 50, commenting on the result.
(f) Determine the predicted value for each item in the simulated data
using the highest P(Type|F) calculated using the naïve Bayes
method.
Hint: the which.max(<vector>) function outputs the index of
the maximum value in <vector>.
(g) Calculate the accuracy of the naïve Bayes model using the results
from (iv)(f) and compare this to the decision tree model.

Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 149

You might also like