CS2B Workbook
CS2B Workbook
in
+91-9711150002
INDEX
Question 2.   Define a function in R to simulate the values from 𝑃𝑎𝑟𝑒𝑡𝑜 (𝛼, 𝜆) distribution.
              Name this function rpareto.
Question 4.   (i) Using the function defined in Q2, simulate 500 values from a
                  𝑃𝑎𝑟𝑒𝑡𝑜 (3,1000) distribution, using a seed value of 80.
Question 5.   Repeat Q1 to Q3 for 𝑊𝑒𝑖𝑏𝑢𝑙𝑙(𝑐, 𝛾) distribution. You can refer the Actuarial
              Tables for Formulas.
Question 6.   Repeat Q1 to Q3 for 𝐵𝑢𝑟𝑟(𝛼, 𝜆, 𝛾) distribution. You can refer the Actuarial
              Tables for Formulas.
Question 7.   Adapting the code of dpareto from Q1, create a corresponding function for the
              three-parameter Pareto distribution. You can refer the Actuarial Tables for
              Formulas.
                                 Solutions
Answer 1.          dpareto <- function(x,a,l){
                   a*(l^a)/((l+x)^(a+1))
                   }
Answer 4.          (i)
                   set.seed(80)
                   rpareto <- function(n,a,l){
                   rp <- l*((1-runif(n))^(-1/a)-1)
                   rp
                   }
x<-rpareto(500,3,10000)
                   (ii)
                   coeff.of.skew<-skewness/var(x)^(3/2)
                   coeff.of.skew
                   [1] 6.903348
Question 2.    Calculate f (1000) and F(1000) for the random variable Y where Y is lognormally
               distributed with parameters 𝜇 = 6 𝑎𝑛𝑑 𝜎 ! = 4.
Question 5.    Calculate the 99.5th percentile for the random variable Y, where Y is lognormally
               distributed with parameters 𝜇 = 6 𝑎𝑛𝑑 𝜎 ! = 4.
Question 6. Calculate the IQR for the random variable X where X~Gamma(6,0.1).
Question 9.    You believe that claim amounts follow a Gamma(6,𝜆) distribution, where
               𝜆~𝐸𝑥𝑝(10). Generate a random sample of five claims, using a seed value equal
               to 1.
                                                        "
Question 10. The random variable Z is defined as Z = # where:
Question 11. Claim sizes X for a particular class of insurance business are believed to follow a
             three‐parameter Pareto distribution such that X ~ Pa(2,200,7) . Summarise the
             code required to calculate P(X > 10,000) , and display the result.
Question 12. The claim frequency on a portfolio of life insurance policies follows a beta
             distribution with parameters 𝛼 = 0.02 and 𝛽 = 0.4. Calculate the probability that
             the claim frequency exceeds 0.75.
Question 13. The random variable X follows a three‐parameter Pareto distribution with
             parameters 𝛼 = 3, 𝜆 = 5 and k = 4. Calculate the median of X.
             Hint: write a function to calculate |P(X <M) - 0.5| and minimise this with respect to M.
Question 14. The random variable X follows a LogN(0,0.25) distribution (i.e. with parameters
             𝜇 = 0 and 𝜎 ! = 0.25 ).
             (i)    Calculate the mode of X, to 6 significant figures.
             (ii)   Check your answer against a plot of the lognormal distribution.
Question 16. This question uses the following file of claim sizes for a portfolio of general
             insurance policies: ClaimSize.csv
             (i)    An analyst believes that claims are independent and follow a
                    Gamma(𝛼, 𝛽) distribution. Determine the maximum likelihood estimates
                    of the parameters 𝛼 and 𝛽, choosing from the following possibilities:
                    •       𝛼 = 1.7,1.8,1.9 or 2
                    •       𝛽 = 0.0015, 0.0016, 0.0017, …, 0.002
             (ii)   Plot the fitted density function against the data, and use it to comment on
                    your results.
Question 17. Insurance claims are believed to follow an exponential distribution with parameter
             𝜆. The file ‘exp.txt’ gives the claim size for 1,000 of these claims.
             (i)    Find the maximum likelihood estimate of 𝜆. (You should use a starting
                    value of 𝜆 = 0.001 for the iteration process.)
             (ii)   Plot the empirical density function against your fitted distribution and
                    hence comment on the goodness of fit.
Question 18. A fellow student has collected a set of motor insurance claims data in the
             following file: MotorClaims.txt
             The student believes that this data follows an exponential distribution, and has
             calculated the maximum likelihood estimate for 𝜆H = 0.003.
             Plot the Q‐Q plot of the data against the student’s fitted distribution, and comment
             on the result.
              set.seed(1066)
              x<-rlnorm(1000,5,2)
              (i)     Use the data stored in vector x to find the MLE of parameters of Log-
                      Normal distribution.
              (ii)    Use the method of moments on the simulated data above to estimate the
                      lognormal parameters.
              (iii)   Verify that your estimates can be used to give sensible maximum
                      likelihood estimates.
                                   Solutions
Answer 1.          dgamma(50,10,0.2)
                   [1] 0.02502201
                   pgamma(50,10,0.2)
                   [1] 0.5420703
Answer 2.           dlnorm(1000,6,2)
                    [1] 0.0001799479
                    plnorm(1000,6,2)
                    [1] 0.6750416
Answer 3.          c<-0.00002
                   g<-4
                   pweibull(12,4,c^(-1/g))-pweibull(8,4,c^(-1/g))
                   [1] 0.2608205
Answer 4.          c<-0.00002
                   g<-4
                   sum(dweibull(y,4,c^(-1/g),log=TRUE))
                   [1] -56.0779
Answer 5.          qlnorm(0.995,5,2)
                   [1] 25633.58
Answer 6.          qgamma(0.75,6,0.1)-qgamma(0.25,6,0.1)
                   [1] 32.03492
Answer 7.          c<-0.00002
                   g<-4
                   qweibull(0.7,g,c^(-1/g))
                   [1] 15.66378
Answer 8.          set.seed(50)
                   rlnorm(10,5,2)
                   [1] 445.563506 27.571772 158.538247 423.392403 4.686996
                   [6] 85.137785 305.410535 45.521062 1044.382530 8.235879
                   rgamma(5,6,rexp(1,10))
                   [1] 63.05133 49.19141 86.27594 32.85776 88.75371
                   Note, since your answer will depend on the order in which you generate numbers
                   from the two distributions, it is possible that your solution will be different to
                   ours, but still equally valid!
                   For example, you may have first generated a vector of random variates from the
                   exponential distribution, and then used these to simulate values from the
                   gamma distribution. In this case, your answer would be:
                   set.seed(1)
                   lambda <- rexp(5,10)
                   rgamma(5,6,lambda)
                   [1] 32.85776 56.72204 505.65996 495.95443 110.23761
                   f<-function(x) {
                   gamma(a+k)*(l^a)*x^(k-1)/(gamma(a)*gamma(k)*(l+x)^(a+k))
                   }
                   F<-function(M) {
                   abs(integrate(f,0,M)$value-0.5)
                   }
                   M.start<-10
                   nlm(F,M.start)
                   $minimum
                   [1] 1.947338e-08
                   $estimate
                   [1] 6.865008
                   $gradient
                   [1] -3.221498e-05
                   $code
                      [1] 2
                      $iterations
                      [1] 61        So the median M = 6.865
Answer 14.            m<-0
                      s2<-0.25
                      s<-s2^0.5
                      exp(m+0.5*s^2)
                      [1]1 1.133148
                      x<-seq(0.000001,2,by=0.000001)
                      f<-function(x) {
                      dlnorm(x,m,s)
                      }
                      max<-max(f(x)); max
                      [1] 0.9041217
                      i<-which(f(x)==max); i
                      [1] 778801
                      x[i]
                      [1] 0.778801
                      (ii)
                      plot(x,f(x),type="l",main="Density function for
                      LogN(0,0.25)")
The highest point on the curve is at x ≈0.8 , which is close to our answer for part (i).
p<-c(111,0.189)
              f(p)
              [1] 5433.073
$minimum
              [1] 5433.067
              $estimate
              [1] 110.9999999 0.1890614
              $gradient
              [1] -0.0301862472 -0.0002226943
              $code
              [1] 2
              $iterations
              [1] 2
              #In the code above, the function rbind appends each iteration
              into a new row of the dataframe result.
              (ii) The maximum claim size observation is 4,664 so we’ll plot a histogram with the x axis
                   on the range (0,5000):
              We have also chosen to set breaks=20, ie twenty bars in our histogram. This is an
              arbitrary choice, you may have chosen a different number altogether. In any case R only
              treats this as a suggestion and will automatically set the breakpoints to ‘pretty’ values.
              The option freq=FALSE tells R to plot the PDF, rather than the frequencies. Now we
              plot our fitted gamma distribution on the same graph as the histogram:
              lines(seq(0:5000),dgamma(seq(0:5000),shape=1.7,rate=0.002),
              col="blue")
              We can see that the data is more positively skew than the fitted distribution. There are
              far too many claims greater than 800 and far too few claims of less than 800.
              Extracting the data into a vector will make it easier to work with:
              claims<-claims[,1]
              p<-0.001
              MLE<-nlm(f,p)
              MLE
              $minimum
              [1] 75087.7
              $estimate
              [1] 0.001490436
              $gradient
              [1] -9535.74
              $code
              [1] 3
              $iterations
              [1] 5
              Hence, 𝜆" = 0.0015 .
              We want to plot the fitted PDF on a similar range, so we check the maximum claim size:
              max(claims)
              [1] 5612.656
              Hence, we set our x coordinates as:
              x<-seq(0,5613)
              Now we calculate the PDF of the exp(0.0015) distribution, for each value of x:
              y<-dexp(x,MLE$estimate)
              set.seed(77)
              y<-rexp(1000,0.003)
              qqplot(x,y,xlab="Sample quantiles",ylab="Quantiles of
              exp(0.003)", main="Q-Q plot")
abline(0,1,col="red")
              The data doesn’t lie anywhere near the reference line for claims greater than about 500,
              so the fit is particularly bad in the tail of the distribution. Nor is it a good fit for smaller
              claims. Overall we would recommend the student tries to fit an alternative distribution.
              we have written params[1] instead of mu, and params[2] instead of sigma. So when
              we specify params (ie our vector of inputs), we need the first element of params to
              correspond to the parameter mu and the second element to correspond to the
              parameter sigma.
              p<-c(100,100)
              MLE<-nlm(f,p)
              MLE
              $minimum
              [1] 7108.991
              $estimate
              [1] 4.988397 2.017083
              $gradient
              [1] 7.110559e-05 -5.455841e-05
              $code
              [1] 1
              $iterations
[1] 20
              MLE$estimate
              [1] 4.988397 2.017083
              (iii) Using the method of moments estimates as inputs in the MLE function
              We enter the estimates mu and sigma from part (i) as inputs in our function f, and
              minimise the negative log-likelihood:
              p<-c(mu,sigma)
              MLE<-nlm(f,p)
              MLE
              $minimum
              [1] 7108.991
              $estimate
              [1] 4.988397 2.017082
              $gradient
              [1] 8.897315e-05 -5.830088e-04
              $code
              [1] 1
              $iterations
              [1] 8
              This time, our maximum likelihood estimates are 𝜇̂ = 4.99 and 𝜎( = 2.02 , which are very
              close to the population parameters 𝜇 = 5 and 𝜎 = 2 .
                                      Reinsurance
Question 1.   Claims from a particular portfolio of insurance policies are expected to follow a
              Pareto distribution with parameters 𝛼 = 2 and 𝜆 = 800. An individual excess of
              loss arrangement is in force where the insurer pays a maximum of 1,500 on any
              claim.
              (i)     Simulate 10,000 claims from this portfolio, using a seed value of 5.
              (ii)    Estimate the probability that a claim involves the reinsurer.
Question 2.   Claim sizes this year, X, from a particular class of insurance follow an exponential
              distribution with parameter 𝜆 = 0.001.
              (i)     Simulate 10,000 claims from this portfolio, using a seed value of 2357.
              (ii)    Inflate each claim in the sample by 5%, to derive a sample of 10,000
                      claims from next year’s claim size distribution.
              (iii)   Apply the policy excess of 100 to each claim and find the mean amount
                      paid by the insurer after inflation.
              The vector yclaims denotes the amounts paid by an insurer for a book of general
              insurance business, net of reinsurance. The reinsurance is an excess of loss
              arrangement with retention limit M = 1,000 so that the insurer pays a maximum of
              1,000 for any individual claim. (Hence, many claims in yclaims are exactly equal
              to 1,000.)
              You believe that the underlying claim size random variable X (before reinsurance)
              is exponentially distributed with unknown parameter 𝜆.
              (i)     Calculate the maximum likelihood estimate of 𝜆.
              (ii)    Plot a graph of your fitted distribution against the empirical density plot of
                      the data yclaims. Your x axis should extend from 100 to 1,000 and you
                      should label your graph appropriately.
              (iii)   Comment on the key features of your graph.
Question 4.   Individual claim amounts follow a gamma distribution with parameters α = 5 and
              β = 0.04. The insurer arranges excess of loss reinsurance with a retention limit of
              200. Calculate the proportion of claims on which the reinsurer is involved.
              increase by 4% next year. Estimate the skewness of the number of claims the
              insurer can expect to pay next year.
Question 6.   Before attempting this question, you should run the following line of code:
               set.seed(16550)
              An insurer believes that claim amounts, X, on its portfolio of pet insurance policies
              follow an exponential distribution with parameter λ = 0.0005. The insurer wants to
              purchase an excess of loss reinsurance policy with retention level M, such that the
              reinsurer pays Z = X - M on claims that exceed size M, and zero otherwise.
              (i)      Simulate 10,000 claims from the insurer’s gross claims distribution and
                       hence estimate the insurer’s mean claims size net of reinsurance,
                       for M = 2, 500 .
              The reinsurer imposes a maximum reinsurance payment of £1,000 on each claim,
              so that:
                                         0                    𝑖𝑓 𝑋 ≤ 𝑀
                              𝑍 = J𝑋 − 𝑀            𝑖𝑓 𝑀 < 𝑋 ≤ 𝑀 + 1000
                                      1000              𝑖𝑓 𝑋 > 𝑀 + 1000
              (ii)     Estimate the retention level M that minimises the variance of the insurer’s
                       claims net of reinsurance.
Question 7.   A reinsurer has entered into an excess of loss reinsurance arrangement with a
              retention limit of £45,000. The amounts paid by the reinsurer on each claim are
              provided in the file: Reinsurance payments.txt
              arrangement, by
                      (a) the insurer
                     (b) the reinsurer.
              The insurer wishes to determine an appropriate retention limit and has asked an
              analyst to investigate the effect of different retention limits on the mean and
              variance of claims.
              (iv)     Calculate the mean and variance of the claims paid by:
                       (a) the insurer
                       (b) the reinsurer
                       under each of the following six retention levels:
                    £5,000, £10,000, £20,000, £30,000, £40,000 and £50,000.
              (v)      Plot your results from part (iv) on four separate charts to show how the
                       mean and variance of the claims paid by the insurer and reinsurer vary
                       with the retention level.
              (vi)     Comment, with reference to the total variance, on your results in part (v).
Question 10. Claims from a particular portfolio of insurance policies are expected to follow a
             Pareto distribution with parameters 𝛼 = 3 and 𝜆 = 400. A policyholder
             deductible of £100 is applied to these policies.
             (i)    Simulate 10,000 claims from this portfolio, using a seed value of 225 and
                    store the values in vector x.
             (ii)   Calculate the expected claim size paid by the insurance company.
             (iii)  Comment on the difference between your answer to (ii) and the mean of
                    vector x.
                                    Solutions
Answer 1.          (i) Simulating claims
                   Since the Pareto distribution is not built in to the basic R functions, we will write
                   our own function. You will have seen how to do this before in a previous unit:
                           rpareto <- function(n,a,l){
                                 rp <- l*((1-runif(n))^(-1/a)-1)
                                 rp
                           }
                   Simulating a set of 10,000 claims from this distribution, using the seed value of
                   5:
                           set.seed(5)
                           x<-rpareto(10000,2,800)
                   Extracting the vector of claims that do not exceed the retention level 1,000:
                           yclaims.excl1000 <- yclaims[yclaims<1000]
                           p<-1
                           nlm(flnL,p)
                           $minimum
                           [1] 64624.84
                           $estimate
                           [1] 0.002720991
                           $gradient
                           [1] 5.491267
                           $code
                           [1] 2
                           $iterations
                           [1] 14
                   (ii) Plot
                   First we extract the maximum likelihood estimate 𝜆" from the output of the nlm
                   function, and call this lambda:
                           lambda<-nlm(flnL,p)$estimate
                   Now we plot the empirical density function of the original data. In the code
                   below we have split the title of the graph into three lines, so that R will do the
                   same when it creates the graph:
                   plot(density(yclaims,from=100,to=1000),xlab="Amounts paid
                   by insurer",main="Empirical density function vs
                   Exp(0.00272)",col="blue")
                   Now we create a vector of values x, on the same range as the empirical density
                   plot:
                   x<-seq(100,1000)
                   y<-dexp(x,rate=lambda)
                   (iii) Comment
                   The blue line (ie the empirical density function) has a peak at x =1,000 . This is to
                   be expected since the data yclaims represents the insurer’s claim amounts
                   after reinsurance (ie there are many claims that have hit the reinsurance layer
                   and for which the insurer pays the maximum1,000).
                   The two lines diverge in the right-hand tail of the distribution since the data
                   represents claim amounts after reinsurance whereas the fitted distribution
                   models the underlying claim size before reinsurance.
                   The two lines are quite close across the rest of the range (ie where the
                   underlying claim sizes are the same as the insurer’s claim amounts), so the fitted
                   distribution appears to be a good fit.
Answer 5.          set.seed(1)
                   x <- sort(rpois(365,rgamma(365,20,0.01)))
                   f <- 0.01*x
                   f2 <- 1.04*f
                   nonf <- (1-0.01)*x
                   r2 <- nonf+f2
                   x2 <- r2*(1-0.01*1.04)
                   skewX2 <- sum((x2-mean(x2))^3)/length(x2); skewX2
                   skewX2^(1/3)
                   # Part (ii)
                   f <- function(M) {
                   var(x-pmin(pmax(x-M,0),1000))
                   }
                   nlm(f,1000)
                   The number of claims that fall below the retention limit of £45,000 is:
                   nm <- length(Payments)-length(RI.claims)
                   nm
                   [1] 1634
                   mu <- mean(log(RI.claims))
                   sigma <- sd(log(RI.claims))
                   p <- c(mu,sigma)
                   flogN(p)
                   nlm(flogN,p)
                   $minimum
                   [1] 4578.015
                   $estimate
                   [1] 10.4747560 0.2651853
                   $gradient
                   [1] -8.453505e-06 -1.749413e-05
                   $code
                   [1] 1
                   $iterations
                   [1] 25
                   # Part (iii)
                   x <- Payments+M
                   max(x)
                   breakpoints <- c(20000,45000,seq(46000,86000,1000))
                   hist(x,freq=FALSE, breaks=breakpoints, xlim=c(0,86000),
                   ylim=c(0,0.00005),ylab="Density",main="Truncated data
                   versus logN(10.475,0.265)",
                   col="grey")
                   lines(seq(0,86000),dlnorm(seq(0:86000),meanlog=10.475,sdlog
                   =0.265),col="blue")
                   m <- nlm(flogN,p)$estimate[1]
                   s <- nlm(flogN,p)$estimate[2]
                   lines(seq(0,86000),dlnorm(seq(0:86000),meanlog=m,sdlog=s),c
                   ol="blue")
                   # Part (v)
                   x <- M+max
                   plnorm(x,10.475,0.265,lower.tail=FALSE)
                   m <- nlm(flogN,p)$estimate[1]
                   s <- nlm(flogN,p)$estimate[2]
                   plnorm(x,m,s,lower.tail=FALSE)
Answer 8.          (i)
                   claims <- rexp(10000, 0.0001)
                   hist(claims, main = "Histogram of 10,000 Simulated Claims
                   Values from the Exponential Distribution")
                   (ii)(a)
                   mean(claims)
                   [1] 10037.81
                   var(claims)
                   [1] 99976440
                   (b)
                   The theoretical value of the mean should be 1/0.0001 = 10,000.
                                                                             2
                   The theoretical value of the variance should be (1/0.0001 ) = 100,000,000
                   The simulated values are very close to the theoretical values.
                   The difference is due to sampling error
                   (iii)(a)
                   For the insurer:
                   InsClaims3 <- pmin(claims, 20000)
                   mean(InsClaims3)
                   [1] 8671.328
                   var(InsClaims3)
                   [1] 43964610
                   (b)
                   For the reinsurer:
                   ReClaims3 <- pmax(0, claims - 20000)
                   mean(ReClaims3)
                   [1] 1366.485
                   var(ReClaims3)
                   (iv)(a)
                   InsClaims1 <- pmin(claims, 5000)
                   InsClaims2 <- pmin(claims, 10000)
                   InsClaims4 <- pmin(claims, 30000)
                   InsClaims5 <- pmin(claims, 40000)
                   InsClaims6 <- pmin(claims, 50000)
                   MeanIns <- c(mean(InsClaims1), mean(InsClaims2),
                   mean(InsClaims3), mean(InsClaims4), mean(InsClaims5),
                   mean(InsClaims6))
                   MeanIns
                   [1] 3940.658 6339.937 8671.328 9538.658 9869.412 9971.611
                   VarIns
                   [1] 2533775 12858446 43964610 70204248 86376566 93370367
                   (b)
                   ReClaims1 <- pmax(0, claims - 5000)
                   ReClaims2 <- pmax(0, claims - 10000)
                   ReClaims4 <- pmax(0, claims - 30000)
                   ReClaims5 <- pmax(0, claims - 40000)
                   ReClaims6 <- pmax(0, claims - 50000)
                   MeanRe <- c(mean(ReClaims1), mean(ReClaims2),
                   mean(ReClaims3), mean(ReClaims4), mean(ReClaims5),
                   mean(ReClaims6))
                   MeanRe
                   [1] 6097.15469 3697.87535 1366.48492 499.15454 168.40057
                   66.20174
                   VarRe <- c(var(ReClaims1), var(ReClaims2), var(ReClaims3),
                   var(ReClaims4), var(ReClaims5), var(ReClaims6))
                   VarRe
                   [1] 84523429 60046376 25047814 9343406 3450843 1305645
                   (v)
                   x <- c(5000, 10000, 20000, 30000, 40000, 50000)
                   plot(x, MeanIns, xlab = "Retention Limit", ylab = "Mean
                   amount paid by Insurer", main = "Mean Claim Amount for
                   Insurer by Retention Limit", type = "l")
              (vi) The mean claim amount paid by the insurer increases in size with the retention
              limit but at a decreasing rate. It tends towards the unrestricted mean as the retention
              limit increases
              The mean claim amount paid by the reinsurer is equal to the total mean claim minus
              the mean amount paid by the insurer
              Hence the mean claim amount paid by the reinsurer decreases in size with the
              retention limit also at a decreasing rate.
              The trends in the variance of the claims for the insurer and the reinsurer
              are not “mirror images”.
              Among the six retention limits investigated, the sum of the variance of the reinsurer
              and the insurer reaches a minimum at a retention limit of around £20,000 [1]
              TotalVar <- VarIns + VarRe
              TotalVar
              [1] 87057204 72904822 69012424 79547654 89827408 94676012
Answer 9.     (i)
              (a)
              Exp_Vector <- rexp(1000,0.4)
              mean(Exp_Vector) # or summary(Exp_Vector)
              var(Exp_Vector)
              The mean and variance will vary due to the random number generation. If the
              sample size was large enough, the mean and variance should be close the
              underlying distribution (exponential with parameter 0.4) as follows:
              Mean = 2.5
              Variance = 6.25
              (b)
              hist(Exp_Vector)
              (c) 1.
              x <- sort(Exp_Vector)
              PDF <- dexp(x, 0.4)
              plot(x, PDF)
              (c) 2.
              plot(x, PDF, type="l")
              (ii) (a)
              LNorm_Vector <- rlnorm(1000, meanlog = 0, sdlog = 1)
              mean(LNorm_Vector)
              var(LNorm_Vector)
              The mean and variance will vary due to the random number generation. If the sample size
              was large enough, the mean and variance should be close the underlying distribution
              (lognormal with parameters μ = 0, σ2 = 1) as follows:
              Mean = 1.649
              Variance = 4.6708
              (ii) (b)
              hist(LNorm_Vector)
              (c)
              hist(LNorm_Vector, freq = FALSE, xlim = c(0,25), ylim = c(0,0.7))
              lines(ecdf(LNorm_Vector),col="red")
              legend("bottomright",c("True cumulative
              distribution","Estimate"),lty=1,col=c("black", "red"))
              set.seed(225)
              x<-rpareto(10000,3,400)
              (ii)
              y <- x[x>100]
              mean(y)
              [1] 355.6002
              (iii)
              mean(x)
              [1] 204.6041
              The mean amount of Y is greater than this because we have removed some of
              the smallest claims by introducing the deductible.
Risk Models
Question 1.       Aggregate claims S follow a compound distribution where the number of claims N
              follows a (type 2) negative binomial distribution with parameters k = 10 and p = 0.1.
              Individual claim sizes X follow a lognormal distribution with parameters µ = 10 and
              s2 = 5.
              (i) Simulate 10,000 random variates from the compound distribution and display the
                  first five values of your data. (Use seed value 16.)
              (ii) Compare the sample mean and variance of your dataset to the population mean
                   and variance of S. Hint: Use the formula for E(S) and Var(S) given on page 16 of
                   the Tables.
              (iii) Adjust your answer to part (i) to simulate the insurers aggregate claims net of
                    reinsurance, SI , if an individual excess of loss arrangement applies, with
                    retention limit M = 2,000,000.
Question 2. An insurer’s total annual claim amount S follows a compound Poisson distribution
           with parameter l = 500. The individual claim size X is exponentially distributed
           with parameter b = 0.001 .
              (i) Simulate 1,000 aggregate claim amounts, using the seed value 1975, and display
                  the last five results.
              The insurer believes that aggregate claims can be approximated using a normal
              distribution, with 2 parameters µ and s .
              (ii) Calculate the maximum likelihood estimates of the parameters µ and s2 based on
              your simulated data. Hint: You should use the sample mean and standard deviation
              as starting values for your iteration.
Question 3. This question uses the data file ‘IRM.txt’. Therefore please run the following code
           before attempting this question:
                 probability.of.claim <-
                 read.table("IRM.txt",header=T)[,2]
                 exponential.parameter <-
                 read.table("IRM.txt",header=T)[,3]
                 An insurer has sold 500 life insurance policies. Each policy can give rise to at
                 most one claim, and if a claim occurs on risk i the benefit payment follows an
                 exponential distribution with parameter li (for i = 1,..., 500 ). Policies are
              independent, and claim sizes (and claim numbers) are not necessarily identically
              distributed.
              (i) Calculate the insurer’s expected total claim payment for the portfolio.
              (ii) Estimate the probability that the insurer’s aggregate claim amount, S, will be
              within plus or minus 10% of the population mean. (Hint: You should carry out
              10,000 simulations, using the seed value 5.)
              (iii) (a) Calculate the probability of the aggregate claims on the portfolio being
                       zero, and compare this to the probability implied by your simulation.
                    (b) Hence or otherwise, comment on the shape of the distribution of S.
Question 4.   We have generated the data required for this question (policies) using the
              following code. You should run this code before attempting the question:
                     set.seed(1812)
                     q <- runif(200,0,0.01)
                     mu <- runif(200,5,13)
                     sigma <- runif(200,0.5,0.8)
                     policies <- cbind(q,mu,sigma)
              (i)     Simulate the reinsurer’s mean aggregate claim amount, using 5,000
                      simulations and a seed value of 1854.
Question 5.   Aggregate claims S follow a compound distribution where the number of claims N
              follows a compound Poisson distribution with parameter 𝜆 = 100. Individual
              claim sizes X follow an exponential distribution with parameter 𝛽 = 0.002.
              Use seed value 8,143 to simulate 1,000 random variates from the compound
              distribution.
              (i)     Hence state the value of the 98th simulated value of S.
              (ii)    Estimate the median of the insurer’s aggregate claims payment.
              (iii)   Estimate the coefficient of skewness of the insurer’s aggregate claims
                      payment S
              (i)     Use a seed of 123 to generate a sample of 10,000 observations from T and
                      display the last five values from your simulation.
              (ii)    Estimate q such that P(T > q) = 0.05.
              (iii)   Fit a gamma distribution to the simulated dataset t that you derived above.
              (iv)    Create a Q‐Q plot of the maximum likelihood gamma distribution against
                      the simulated aggregate claims data t. Comment on the result.
              (v)     Check the interpretation of the Q‐Q plot in the previous question by
                      plotting the right‐hand tail of the fitted density function against the right‐
                      hand tail of the empirical density function.
Question 7.   The random variable W follows a compound negative binomial distribution with
              N ~NBin(5,0.8) and the claim size random variable X follows a Burr distribution
              with parameters 𝛼 = 0.3 , 𝜆 = 10 and 𝛾 = 1.2. Use a seed of 123 to generate a
              sample of 10 observations from W.
Question 8.   Company B has sold 100 life insurance policies which each gives rise to a
              maximum of one claim per year. There are three types of policy:
                  • for Type 1 policies, claims amounts (in £000s) follow an
                      Exp(0.005)distribution, and the probability of a claim occurring on the
                      policy is 0.001
                  • for Type 2 policies, claims amounts (in £000s) follow an Exp(0.006)
                      distribution, and the probability of a claim occurring on the policy is 0.002
                  • for Type 3 policies, claims amounts (in £000s) follow an Exp(0.007)
                      distribution, and the probability of a claim occurring on the policy is 0.003
              There are 30 Type 1 policies, 50 Type 2 policies and 20 Type 3 policies. All
              policies are independent.
              Simulate a set of 5,000 observations from the insurer’s aggregate claims size
              distribution, and hence estimate the probability that it will have to pay out more
              than £100,000 a year for these policies.
              (You should use the seed value 54321.)
SOLUTIONS
Answer 1
       (i)
       k <- 10; p <- 0.1
       mu <- 10; sigma <- 5^0.5
       set.seed(16)
       n <- rnbinom(10000,size=k,prob=p)
       s <- numeric(10000)
       for(i in 1:10000) {
            x <- rlnorm(n[i],meanlog=mu,sdlog=sigma)
           s[i] <- sum(x)
       }
       head(s,5)
       (ii)
       sample.mean <- mean(s)
       sample.variance <- var(s)
       meanN<-k*(1-p)/p
       varN<-k*(1-p)/p^2
       meanX<-exp(mu+0.5*sigma^2)
       varX<-exp(2*mu+sigma^2)*(exp(sigma^2)-1)
       meanS<-meanN*meanX
       varS<-meanN*varX+varN*meanX^2
       c(meanS, sample.mean)
       c(varS,sample.variance)
       meanS/sample.mean
       varS/sample.variance
       (iii)
       set.seed(16)
Answer 2
       (i)
       set.seed(1975)
       n <- rpois(1000,500)
       s <- numeric(1000)
       for(i in 1:1000) {
           x <- rexp(n[i],rate=0.001)
           s[i] <- sum(x)
       }
       tail(s,5)
       (ii)
       f <- function(params) {
           lnL <- dnorm(s,mean=params[1],sd=params[2],log=TRUE)
           sum(-lnL)
       }
       p <- c(mean(s),sd(s))
       MLE<-nlm(f,p); MLE
(iii)
       mu <-MLE$estimate[1]
       sigma <- MLE$estimate[2]
       agg <- rnorm(100, mean=mu, sd=sigma)
       qqplot(agg,s,xlab = "Quantiles from fitted
       distribution",ylab = "Quantiles of simulated data",main =
       "QQ plot of data")
       abline(0,1,col="red")
       plot(density(s),main="Simulated data versus fitted
       distribution",xlab="Aggregate claim size",col="blue")
       agg <- seq(from=min(s),to=max(s))
       y <- dnorm(agg, mean=mu, sd=sigma)
       lines(agg,y,col="red")
Answer 3
       probability.of.claim <- read.table("IRM.txt",header=T)[,2]
       exponential.parameter <- read.table("IRM.txt",header=T)[,3]
       (i)
       E.Si <- probability.of.claim/exponential.parameter
       sum(E.Si)
       (ii)
       lower.bound <- sum(E.Si)*0.9
       lower.bound
       upper.bound <- sum(E.Si)*1.1
       upper.bound
       death <- rbinom(500,1,probability.of.claim)
       x <- rexp(500,rate=exponential.parameter)
       sim <- x*death
       sim <- sum(sim)
       set.seed(5)
       S.sim <- numeric(10000)
       for (i in 1:10000) {
       (b)
              summary(probability.of.claim)
Answer 4
       set.seed(1812)
       q <- runif(200,0,0.01)
       mu <- runif(200,5,13)
       sigma <- runif(200,0.5,0.8)
       policies <- cbind(q,mu,sigma)
       (i)
       set.seed(1854)
       SR.sim <- numeric(5000)
       for (i in 1:5000) {
       death <- rbinom(200,1,policies[,1])
       x <- rlnorm(200,meanlog=policies[,2],sdlog=policies[,3])
       z <- pmax(0,x-60000)
       sim <- z*death
       sim <- sum(sim)
       (ii)
       set.seed(1854)
       S.sim <- numeric(5000)
       for (i in 1:5000) {
       death <- rbinom(200,1,policies[,1])
       x <- rlnorm(200,meanlog=policies[,2],sdlog=policies[,3])
       sim <- x*death
       sim <- sum(sim)
       S.sim[i] <- sim
       }
       SR.sim <- pmax(0,S.sim-300000)
       mean(SR.sim)
       [1] 20635.58
       So the reinsurer’s mean aggregate claims payment is approximately 20,636
Answer 5.
       (i)
       set.seed(8143)
       n <- rpois(1000,100)
       s <- numeric(1000)
       for(i in 1:1000) {
              x <- rexp(n[i], rate=0.002)
              s[i] <- sum(x)
              }
       The 98th element of s is:
       s[98]
       [1] 37309.99
       (ii)
       quantile(s,0.5)
       50%
       49812.93
       (iii)
       skewness<-sum((s-mean(s))^3)/length(s)
       skewness/var(s)^(3/2)
       [1] 0.1570086
Answer 6.
       (i)
       set.seed(123)
       n <- rbinom(10000,size=80,prob=0.3)
       t <- numeric(10000)
       rpareto <- function(n,a,l){
       rp <- l*((1-runif(n))^(-1/a)-1)
       rp
       }
       for(i in 1:10000) {
       x <- rpareto(n[i],a=4,l=4500)
       t[i] <- sum(x)
       }
       (ii)
       We need to find the 95th percentile q, which we can do
       using the quantile function:
       quantile(t,0.95)
       95%
       57819.2
       (iii)
       #First we write the negative log-likelihood function f:
       f <- function(params) {
       lnL <- dgamma(t,shape=params[1],rate=params[2],log=TRUE)
       sum(-lnL)
       }
       (iv)
       #First we extract the maximum likelihood estimates from the
       #output of nlm.
       a <- nlm$estimate[1]
       l <- nlm$estimate[2]
       The data appears to be a reasonable fit, except for the right‐hand tail of the distribution.
       Here the quantiles of the simulated data are higher than the quantiles of the fitted
       distribution, and there is one outlier that is very far from the maximum likelihood
       distribution. So the simulated data is more positively skewed than the fitted gamma
       distribution. (In fact, the data is not a particularly good fit for the left‐hand tail of the
       distribution either.)
       (v)
       #Plotting the empirical density function:
       plot(density(t),ylim=c(0,0.000001),xlim=c(50000,300000),main=
       "Simulated data
       versus
       fitted distribution",xlab="Aggregate claim size",col="blue")
       #(We have chosen the range of the empirical density function plot
       #by trial and error, using the ylim and xlim arguments.)
The simulated data is indeed more positively skew than the fitted distribution.
Answer 7.
       set.seed(123)
       n <- rnbinom(10,size=5,prob=0.8)
       w <- numeric(10)
       for(i in 1:10) {
       x <- rburr(n[i],a=0.3,l=10,g=1.2)
       w[i] <- sum(x)
       }
       w
       [1] 558.14071 387.40722 23.26043 39.80472 50.99664
Answer 8.
       set.seed(54321)
       rate1 <- 0.005; q.1 <- 0.001
       rate2 <- 0.006; q.2 <- 0.002
       rate3 <- 0.007; q.3 <- 0.003
       for (j in 1:5000) {
       x<-numeric(100)
       for (i in 1:30) {
       x[i] <- rexp(1,rate=rate1)
       }
       for (i in 31:80) {
       x[i] <- rexp(1,rate=rate2)
       }
       for (i in 81:100) {
       x[i] <- rexp(1,rate=rate3)
       }
       death<-numeric(100)
       for (i in 1:30) {
       death[i] <- rbinom(1,size=1,prob=q.1)
       }
       for (i in 31:80) {
       death[i] <- rbinom(1,size=1,prob=q.2)
       }
       for (i in 81:100) {
       death[i] <- rbinom(1,size=1,prob=q.3)
       }
       sim <- x*death
       sim<-sum(sim)
       S.sim[j] <- sim
       }
Survival Models
Question 1.   Let 𝑇$% be the random variable denoting the expected future lifetime of lives
              currently aged 40. It is believed that 𝑇$% follows the W(0.00001,3.5) distribution.
              (i)     (a) Generate a sample of 10,000 values of 𝑇$% , setting a seed of 100 and
                      storing the simulated values in the vector sims.
                      (b) Estimate P(𝑇&% >10) using your sample.
              Under this model, the hazard at age 40 + t is given by:
                              𝜇$%'( = 𝛼𝛽𝑡)*+
              where a = 0.00001 and b = 3.5.
              (ii)    (a) Construct a function R to calculate µ, for an inputted age, x ³ 40 .
                      (b) Calculate P(𝑇&% >10) using your function for µ, .
              (iii)   Comment on you answers to parts (i)(b) and (ii)(b).
Question 2.   Mortality of a group of lives is assumed to follow Gompertz’ law for ages 25 and
              over. The force of mortality at age x ³ 25 is given by:
                             𝜇, = 𝐵𝑐 ,*!& for x³25
              where B=0.00000729 and c = 1.128
              (i)    Plot the force of mortality from ages 25 to 100.
              (ii)   Plot the PDF of T45 , the future lifetime random variable of someone aged
                     45.
                                                          -
              (iii)  Calculate a numerical estimate of 𝑒&%   .
              (iv)   (a) Simulate 100,000 values of 𝑇&% the future lifetime random variable of
                     someone aged 50, setting a seed of 50 and storing your simulated values in
                     the vector sims.
                                    -
                     (b) Estimate 𝑒&%  from your simulated values.
Question 3.   The ‘Very-ruthless Management Consultancy Company’ pays very high wages
              but also has a very high failure rate, both from sackings and through people
              leaving.
              The R code to create a data frame representing the life table for a typical new
              recruit (with durations measured in years) is given below:
              life.table = data.frame(duration = 0:9,
              lives = c(100000,72000,51000,36000,24000,15000,10000,6000,2500,0))
              (i)     Construct an extra column in the table that contains in each row the
                      probabilities of a new recruit’s survival to that year (the first row should
                      be set to 1).
              (ii)    Calculate the expected number of complete years that a new graduate will
                      complete with the company.
              (iii)   45 of the graduates that started working at the company on 1 September
                      last year are still at the company exactly one year later. Calculate the
                      expected number of complete additional years that these graduates will
                      complete with the company.
Question 4.   The values of µx for the AM92 mortality table for ages in the range 17 £ x £ 119
              are calculated using the following formula:
                      𝜇, = 𝑎% + 𝑎+ 𝑡 + exp[𝑏% + 𝑏+ 𝑡 + 𝑏! (2𝑡 ! − 1)]
                         ,*.%
              where 𝑡 = &% and 𝑎% = 0.00005887 , 𝑎+ = -0.0004988 , 𝑏% = -4.363378 ,
                     𝑏+ = 5.544956 , 𝑏! = -0.620345.
              (i)    (a)     Construct a function in R to calculate µ, for an inputted age, x .
                    (b)      Calculate the value of µ/% using your function, checking it against
                             the tabulated value on page 81 of the Tables.
                    (c)     Plot a graph of µx using a log scale. [Hint : use the log = “y”
                            argument in the plot() function.]
              You are given that 𝑞, can be approximated using the relationship
              𝑞, ≈ 1 − exp \−𝜇,'! ].
                                   "
              (ii)   (a) Use this approximation to calculate values of qx , rounded to 6 decimal
                     places, to complete the table below:
              (iii)  (a) Calculate an approximate value for the curtate expectation of life 𝑒!& .
                                                                                       (*+
              Hint: you can use the cumprod() function and the fact that ( 𝑝, = Π01%        𝑝,'0
                     (b) Check your value of 𝑒!& against the tabulated value on page 82 of the
                     Tables.
SOLUTIONS
Answer 1
(i)(a)      a = 0.00001
           b = 3.5
           shape = b
           scale = a^(-1/b)
         We then simulate the values using the rweibull() function:
                 set.seed(100)
                sims = rweibull(10000,shape,scale)
(b)      In part (i) we simulated values of 𝑇!" . So, we first need to convert P(𝑇#" >10) into a probability
         in terms of 𝑇!" . Using principle of consistency
                                                                  !" &#"
                   $" 𝑝!"   =   %" 𝑝!"   ×%" 𝑝#" ⇒   %" 𝑝#"   =
                                                                  $" &#"
         To estimate this from the simulated values, we need to calculate how many of the simulated
         values are at least 20 and divided by how many are at least 10.
         In R, we can do this using square bracket notation with logical conditions and the length()
         function:
                 length(sims[sims > 20]) / length(sims[sims > 10])
                 [1] 0.7277898
(ii)(a)
                 mux = function(x){
                       a * b * (x - 40)^(b - 1)
                 }
(b)              exp(-integrate(mux, 50, 60)$value)
                 [1] 0.7216983
(iii) The answers are very similar.
         We expect the answer to part (ii)(b) to be fairly accurate, even though R evaluates the integral
         numerically. The answer to part (i)(b) is slightly different due to sampling error. If we took a
         larger sample, we would expect the answer to be closer to that in part (ii)(b).
Answer 2
   (i)           B = 0.00000729
                 c = 1.128
                 mux = function(x){
                       B * c ^ (x - 25)
                       }
                 plot(25:100, mux(25:100), type = "l", col = "blue",
                      main = "Graph of mu_x against age",
                      xlab = "age", ylab = "Force of mortality", lwd = 3)
              (b)      mean(sims)
                       [1] 50.94788
Answer 3
   (i)        First we load in the data:
              life.table = data.frame(duration = 0:9, lives =
              c(100000,72000,51000,36000,24000,15000,10000,6000,2500,0)
              The probabilities of surviving to each year are the number of lives corresponding to that year
              divided by the starting number of lives.
              We can calculate these using square brackets for indexing and creating a new column using $:
              life.table$surv = life.table$lives / life.table$lives[1]
              life.table
                       duration           lives             surv
              1                 0         100000            1.000
              2                 1         72000             0.720
              3                  2        51000             0.510
              4                  3        36000             0.360
              5                  4        24000             0.240
              6                  5        15000             0.150
              7                  6        10000             0.100
              8                  7        6000              0.06
              9                  8        2500              0.025
              10                 9        0                 0.000
     (ii)     We can do this using the sum() function and square brackets with a negative number to
              remove the first entry:
                      sum(life.table$surv[-1])
                      [1] 2.165
              Here, we have w = 7 . We first need to calculate ' 𝑝% . These probabilities are the lives in
              each year divided by the number of lives in year 1. We can calculate these and add a
              column to the table similar to part (i). We set the first row to NA as it represents *% 𝑝% ,
              which doesn’t make any sense:
              life.table$surv_after_1_year=
              c(NA,life.table$surv[2:nrow(life.table)] / life.table$surv[2])
               life.table
              𝑒% is now the sum of the probabilities in the last column from the third row on. To
              calculate this, we can take the sum of the column excluding the first and second entries:
                   sum(life.table$surv_after_1_year[-c(1:2)])
                   [1] 2.006944
Answer 4
   (i) (a)          a0 = 0.00005887
                    a1 = -0.0004988
                    b0 = -4.363378
                    b1 = 5.544956
                    b2 = -0.620345
                    mu <- function(x){
                    t = (x-70)/50
                    a0 + a1*t + exp(b0 + b1*t + b2*(2*t^2 - 1))
                    }
       (b)    mu(80)
              [1] 0.06827114
   (ii)      (a)
             We can create a list of ages for the first column using the following R command:
                   xlist = seq(20,110,by=10)
             We can now define a function to calculate the values in the right-hand column of the table in
             the question:
                 fnexp <- function(x) {1-exp(-mu(x + 0.5))}
             We can then input the required ages to this function, rounding the answers to 6 decimal
             places:
                 qlist = round(fnexp(xlist),6)
             There are various ways to combine the ages and probabilities in the form of a table. One
             method is to use the cbind function to join the columns together:
                 cbind(xlist,qlist)
             (b)
             The accurate values for ages 20, 60 and 110 (say), from pages 78 and 79 of the Tables, are:
                                𝑞$" = 0.000582, 𝑞+" = 0.008022 𝑎𝑛𝑑 𝑞%%" = 0.607918
             We can see that all the approximations are very good – accurate to at least 4 decimal places.
             However, the accuracy diminishes as the age increases.
             This is to be expected, because there will be a larger variation in the value of µx over the year
             of age at the older ages. So the approximation of using the value for the middle of the age
             range will not be as accurate here.
Question 2.    A study aimed at estimating hazard rates and survival probabilities from the point
               of contracting a particular disease has produced the results in the data file
               surv_table_2.csv, which has the following columns:
               • Time – the time an event took place measured in weeks since contraction of the
               condition
               • Deaths – the number of deaths that took place at each time
               • Censorings – the number of right censorings that took place at each time
               A total of 100 patients were observed since the day that they contracted the
               illness. Observation of patients stopped after 25 weeks following contraction of
               the condition.
       (iv)   (a)     Update your table to include a column called Lj that contains an estimate
                  _
              of L (t) using the Nelson-Aalen model.
              (b)     Calculate a set of approximate 95% confidence intervals for L(t) at the
                      time of each death event.
              (c)     Produce an appropriate graph in R showing a plot of the estimated
                      integrated hazard function with approximate 95% confidence intervals.
       The following code creates a vector of event times and a vector of event codes (1
       indicating death and 0 indicating censoring) for the 100 patients:
       (v)    (a)     Create a survival object using the vectors event.times and event.codes.
              (b)     Calculate the Nelson-Aalen estimate of the survival function and
                      approximate 99% confidence intervals.
              Hint: the conf.int argument in the survfit() function sets the level of the
              estimated confidence interval.
              (c)     Produce an appropriate graph in R showing the Neslon-Aalen estimate and
                      approximate 99% confidence intervals.
       (vi)   (a)     State an inequality relating the values of 𝑆H34 (𝑡) and 𝑆H56 (𝑡), the
              estimated survival function based on the Nelson-Aalen model.
              (b)     Demonstrate numerically the validity of the inequality in part (vi)(a) using
                      the data in this study.
SOLUTIONS
Answer 1
   (i) surv.data = read.csv("surv_table_7.1.csv")
   (ii) We first need to ensure the survival package is active:
               library(survival)
           We then need to create a vector of event times as the difference between the end of
           observation and time of exposure:
               surv.data$event.time = surv.data$observation_end -
               surv.data$exposure_time
           We also need a vector of event codes. The table has a reason column, which we need to use to
           create a vector of 1’s (to indicate death) and 0’s (to indicate censoring):
               surv.data$event.code = ifelse(surv.data$reason == "D", 1, 0)
           We can then use the Surv() function to create the survival object:
              surv.obj = Surv(surv.data$event.time, surv.data$event.code)
              surv.obj
              [1] 2    3+ 5    8   2 10+ 14    3 10+ 3
   (iii) (a)
         We can use the survfit() function with the survival object from part (ii):
               fitkm = survfit(surv.obj ~ 1, conf.type = "plain")
           (b) plot(fitkm,
                      main = "Kaplan-Meier estimate of survival function",
                      xlab = "weeks",
                      ylab = "SKM(t)")
Answer 2
   (i) surv.data = read.csv("surv_table_2.csv")
         surv.data
   (ii) The data in the table is by event time rather than by patient. So, first we need to just extract the
         rows that relate to death times:
               deaths = surv.data[surv.data$Death > 0, 1:2]
               deaths
         This gives both tj and dj . We can rename the columns to these titles:
              colnames(deaths) = c("tj", "dj")
              deaths
         Next we need to calculate the nj . As each patient was observed since they contracted the
         illness, nj is simply the number of patients that experienced their event on or after tj. To make
         the calculation a little easier, we can first create a new column in the original data set that is the
         sum of deaths and censoring, ie the total number of events at each time:
               surv.data$Events = surv.data$Death + surv.data$Censorings
         Next, we calculate the number of events happening at times greater than or equal to all the
         death times
               deaths$nj = numeric(nrow(deaths))
            for (i in 1:nrow(deaths)) {
            deaths$nj[i] = sum(surv.data$Events[surv.data$Time >= deaths$tj[i]])
         }
        deaths
              We can then add the confidence intervals using the lines() function and setting the line
              type to 2 for a dotted line:
              Unfortunately, the upper part of the approximate confidence interval is being cut off by
              the limits of the y -axis. We can fix this with the ylim argument in the plot() function:
                   plot(c(0, deaths$tj), c(0, deaths$Lj), type = "s", lwd = 2,
                           main = "Nelson-Aalen estimate of the integrated hazard
                      function",
                           xlab = "weeks",
                           ylab = "Lambda(t)", ylim = c(0,0.3))
              We now have a vector of times and codes to use with the Surv() function:
              surv.obj = Surv(event.times, event.codes)
              surv.obj
       (b)     Using the survfit() function with stype set to 2 and conf.int set to 0.99:
               fitna = survfit(surv.obj ~ 1, conf.type = "plain", stype = 2,
               conf.int = 0.99)
               We can then look at the estimate using the summary() function:
               summary(fitna)
       (c)     plot(fitna,
               main = "Nelson-Aalen estimate of survival function",
               xlab = "weeks",
               ylab = "SKM(t)", ylim = c(0.6, 1))
               legend("bottomleft",
                      c("Nelson-Aalen estimate", "Kaplan-Meier estimate"),
                      col = c("black", "blue"), lty = 1)
       The blue line is never above the black line, as expected based on the relationship between the
       estimates.
Question 1.   The table below shows the number of circuits a group of 10 army recruits were
              able to complete in a training exercise and their ages, split by sex.
                          Complete Circuits       Males      Females
                                   1
                                   2                             30
                                   3              21,20        24,25
                                   4                             19
                                   5            18,18,17
                                   6                             21
              Two Cox proportional hazards models are being considered for modelling the
              result :
              Model 1:         ln 𝜆(𝑡; 𝑧+ ) = ln 𝜆% (𝑡) + 𝛽+ 𝑧+
              Model 2:         ln 𝜆(𝑡; 𝑧+ , 𝑧! ) = ln 𝜆% (𝑡) + 𝛽+ 𝑧+ + 𝛽! 𝑧!
              Here:
                  • l(𝑡; 𝑧1 ) and l(𝑡; 𝑧1 , 𝑧2 ) are the hazard rates for ‘dropping out’, i.e. for the
                       recruit reaching their limit in terms of the number of circuits they can
                       complete
                  • 𝑧+ is the recruit’s sex ( 𝑧+ = 0 for males, 𝑧+ = 1 for females)
                  • 𝑧! is the recruit’s age.
              (i)     Construct a data frame that contains a row for each recruit and the
                      following columns:
                          • Age – the age of the recruit
                          • Sex – the sex of the recruit
                          • Circuits – the number of circuits completed by the recruit
                          • Status – whether each recruit experienced the death event (1) or
                             was censored (0)
              Hint: the data.frame(<col 1> = <vector 1>, ..., <col n> = <vector
              n>) function creates a data frame with columns called <col 1>,...,<col n>
              containing the values in <vector 1> to <vector n>.
              (i)     (a)    Construct a function that returns the partial log-likelihood for an
                      inputted value of b+ .
                      (b)    Show that the value of the partial log-likelihood (ignoring
                             constants) is –16.78165 when b+ = 0.3.
              (ii)    Plot a graph of the partial log-likelihood function over a suitable range,
                      indicating an appropriate interval that contains 𝛽H+ .
(iv) Compare your answer to part (iii) to you answer to Question 1 part (ii)(b).
SOLUTIONS
Answer 1
   (i)         Creating the necessary vectors and putting them into a data frame:
                    circuits = c(2,3,3,3,3,4,5,5,5,6)
                    status = rep(1,10)
                    sex = c(1,0,0,1,1,1,0,0,0,1)
                    age = c(30,21,20,24,25,19,18,18,17,21)
                    PH_data = data.frame(circuits,status,sex,age)
           (b) The estimated values of b% and b$ are now -2.0086 and 0.5744.
               This gives the fitted model:
               Model 2: lnl(t; z% ) = lnl" (t) - 2.0086z% + 0.5744z$
              We now have:
              exp(fit2$coefficients)
              sex         age
              0.1341723 1.7760945
              Since 𝛽"% is negative, this model now suggests that the drop-out rate is lower for the
              females compared to the males, but again this difference is not significant ( p -value =
              14.1%).
              Since 𝛽"$ is positive, this model suggests that the drop-out rate increases with age (77.6%
              per year of age according to the fitted model). This effect is significant in this model,
              since the p - value of 4.3% is less than 5%.
       (iv)   We are testing the null hypothesis that Model 2 is not a significant improvement over
              Model 1.
              Using the hint in the question, we can use the anova() function.
                   anova(fit1, fit2, test = "Chi")
              This tells us that the partial log-likelihood for the first model is –16.781 and –13.419 for
              the second. The observed value of the test statistic, -2(𝑙& - 𝑙&+3 ), is 6.7042 and the
              associated p -value is 0.009618.
              As the p -value of 0.962% is less than 5%, there is sufficient evidence to reject the null
              hypothesis that Model 2 is not a significant improvement over Model 1. It appears
              reasonable to conclude that incorporating age has improved the model.
Answer 2
       (i)    (a) Constructing the function:
                 part.log.lik = function(b){
                  4 * b - log(5 * exp(b) + 5) - 4 * log(4 * exp(b) + 5) -
                         log(2 * exp(b) + 3) - 3 * log(exp(b) + 3)
                              }
              (b) Checking the value:
                    part.log.lik(0.3)
                    [1] -16.78165
       (ii)   plot(part.log.lik, 0, 1, lwd = 2,
                         main = "Partial log-likelihood",
                         xlab = "beta", ylab = "log(L(beta))")
(iv) This value is almost exactly the same as the estimate from 1.
                                       Markov Chains
Question 1.    A Markov chain has the following state space and one-step transition
               probabilities:
       (i)     Construct a matrix that contains the one-step transition probabilities, labelling the
               rows and columns with ‘state 1’, ‘state 2’ and ‘state 3’.
       (ii)    Construct a markovchain object ,mc, with the name ‘Q1MC’ and using the
               matrix from part (i) as the transition probability matrix.
       (iii)   (a) Determine whether the Markov chain is irreducible using functions in R.
        The @ symbol can be used with markovchain objects to extract its components. The
        following code extracts the transition matrix:
               mc@transitionMatrix
(iv) Check the row sums of the extracted transition matrix are all 1.
       (v)     (a)     Calculate the probability that the chain will be in state 1 after 4 steps,
               given it is currently in state 2, using mc.
               (b) Calculate the probability that the chain will be in state 2 after 8 steps, using
               mc.
Question 2.     The most recent 18 values of a discrete-time process with states 1, 2 and 3 were as
                follows:
2 1 2 1 3 1 1 2 2 1 1 3 3 1 2 1 2 1
(i) Create a vector named past18 containing the values of the past states.
                Enter the following lines of R code, which fit a Markov chain model to the past
                data:
                         fit1 = markovchainFit(past18)
You should tackle questions 3, 4 and 5 using either the basic R functions or functionality in the
markovchain package (or both). Answers for both approaches are presented in the solutions.
Question 3.     A no-claims discount (NCD) system for motor insurance is to be modelled using a
                Markov chain with constant transition probabilities, as shown in the diagram:
                         State 1         No discount
                         State 2         15% discount
                         State 3–        30% discount (at least one claim in previous year)
                         State 3+        30% discount (no claims in previous yr)
                         State 4         40% discount
          (i)       (a) Construct the one-step transition probability matrix for this process P.
                   (b) Check that the rows of the matrix sum to 1.
            (ii)     Calculate the probabilities for the discount rate this policyholder will be
                      receiving after five years.
            (iii)    Calculate the probabilities for the discount rate for this policyholder after 20
                      years.
            (iv)     (a) Calculate 𝑃!&
                     (b) Comment on the rows of 𝑃!& .
            Hint: you may wish to use the functions matrix(), diag(), rowSums() and the
            matrix multiplication operator %*%.
Question 4.         An insurance company has been using a Markov chain to model its no-claims
                    discount (NCD) system, which offers the following discounts to motorists on their
                    annual premium:
                                    Level 1       No discount
                                    Level 2       15% discount
                                    Level 3       25% discount
                                    Level 4       30% discount
        •           After a claim free year, probability 0.85, policyholders move up a level or stay at
                    Level 4.
        •           After a year with exactly one claim, probability 0.11, policyholders move down a
                    level or stay at Level 1.
        •           After a year with more than one claim, policyholders move down two levels,
                    subject to a minimum level of Level 1.
        For a particular class of policyholder (same type of car, type of cover etc.), the
        insurance company has re-evaluated the one-step probability distribution so that there is
        now a 0.83 probability of a claim-free year and a 0.12 probability of a policyholder
        claiming exactly once in any given year. The full premium in this class is 750.
        (v)     Construct a function that calculates the expected total premium income over n
                years for a given starting distribution, income vector and one-step transition
                matrix, without using the markovchain package.
       The insurance company is considering setting the additional premium at between £10 and
       £15 per annum.
      (viii)       Calculate the new expected premium income over the next 10 years for a driver
                   currently on Level 4 who pays the additional premium for the case where:
                   (a) the additional premium is £10, storing your answer in the object prem.10
                   (b) the additional premium is £15, storing your answer in the object prem.15.
        (ix)     Comment on whether the insurer should set the additional premium at £10, with
                 reference to your answer to part (vi)(c).
      The overall expected premium income is linearly related to the additional premium
      amount.
        (x)      Find the additional premium required for NCD protection (to 2 decimal places)
                 so that the new expected premium income is equal to that calculated in part
                 (vi)(c), by using linear interpolation or otherwise.
        (xi)     Explain why the insurance company might opt for a much higher additional
                 premium than that calculated in part (vi).
Question 5.    An insurance company is using a Markov chain to model its no-claims discount
               (NCD) system, which offers the following discounts to motorists on their annual
               premium:
                           Level 1 No discount
                           Level 2 10% discount
                           Level 3 20% discount
                           Level 4 30% discount
                           Level 5 40% discount
        After a claim-free year, probability 0.9, policyholders move up a level or stay at Level
        5. After a year with exactly one claim, probability 0.08, policyholders move down a
        level or stay at Level 1. After a year with more than one claim, policyholders move to
        Level 1.
        (i)       Construct the one-step transition probability matrix P.
                  The probability that a policyholder on Level i is on Level j after n years is
                  denoted by 𝑝92 (𝑛).
        (ii)       Construct a function that calculates the n -step transition probability matrix as a
                   function of P and use it to calculate the following probabilities:
                   (a) p+& (5)
                   (b) p&: (4)
                   (c) p:$ (6)
        For a particular class of policyholder (same type of car, type of cover etc) the full
        premium is 800 and there are 2,000 policyholders in the class. You may assume that
        the policyholders have been driving for a number of years.
        (iii) Calculate the stationary distribution and use it to calculate the expected annual
              premium income for this policy class.
        To improve the overall premium income and make the system fairer to the safer
        drivers, the insurance company decides to change the rules of the NCD system slightly,
        so that any policyholder making exactly one claim in a given year will be moved down
        two levels if they also made a claim the previous year, subject to a minimum level of
        Level 1.
        (iv)   Explain how many more states will be needed to continue to model the NCD
               system as a Markov chain, and give the new one-step transition probability
               matrix Q .
        (v)    Recalculate the probabilities in part (ii) under the new rules. You may assume
               that the driver in (c) did NOT have a claim the previous year.
        (vi)   Calculate the increase in expected annual premium income from the change in
               the rules and comment on your answer.
Question 6. Data is available for the movement of taxis in London. The city is divided into
            three zones “North”, “South” and “West”. The movement of a taxi from one zone
            to another will depend only on its current position. The following probabilities
            have been determined for taxi movements:
                • Of all taxis in the North zone, 30% will remain in North and 30% will move to
                  South, with the remaining 40% moving to West.
                • In the South zone, taxis have a 40% chance of moving to North, 40% chance of
                  staying in South and 20% chance of moving to West.
                • Of all the drivers in the West zone, 50% will move to North and 30% to South
                  with the remaining 20% staying in West.
                  The movement of taxis in London will be modelled in R using a Markov Chain.
        (i)       Create a vector with the state space of the Markov Chain, using R code. You
                  should print this to the screen and paste into your answer.
        (ii)      Construct a transition matrix of the zone movement probabilities. You should
                  print this to the screen and paste into your answer.
        (iii)     Load the R package for Markov Chains and paste your coding into your answer.
        (iv)      Create a Markov Chain object with state space equal to your vector in part (i)
                  and transition matrix from part (ii). You should print this to the screen and
                  paste into your answer.
        (v)       Calculate the probability that a driver currently in the North zone will be in the
                  North zone after:
                  (a)    two trips              (b)     three trips
        (vi)      Determine the stationary state of the Markov Chain.
Question 7.   Before answering this question, the R package for Markov chains should be
              loaded into R using the following code:
                      install.packages("markovchain")
                      library(markovchain)
              Xt is a Markov chain on the state space {1,2,3} with the following transition
              matrix:
Question 8.   Suppose the transition probabilities are a function of the age of a person. The
              transition probability of a person aged x moving:
                    • from Healthy to Sick state is 0.007x
                    • from Healthy to Death state is 0.001x
                    • from Sick to Death state is 0.002(100-x)
                    • from Sick to Healthy is 0.006(100-x).
              Assuming 100 is the maximum age. Calculate the probability of:
              (i)     Healthy person aged 30 will be in Sick state after 4 years.
              (ii)    Sick Person aged 25 will be Dead after 7 years.
                                        Solutions
Answer 1.
Answer 2
Answer 3
Answer 4
Answer 5
Answer 6
Answer 7
Answer 8
(i)      H2S<-function(x){ 0.007*x}
         H2D<-function(x){ 0.001*x}
         S2H<-function(x){ 0.006*(100-x)}
         S2D<-function(x){ 0.002*(100-x)}
         transmat<-function(x){
               M<-matrix(0,nrow=3,ncol=3)
               M[1,1]<-1-H2S(x)-H2D(x)
               M[1,2]<-H2S(x)
               M[1,3]<-H2D(x)
               M[2,1]<-S2H(x)
               M[2,2]<-1-S2H(x)-S2D(x)
               M[2,3]<-S2D(x)
               M[3,1]<-0
               M[3,2]<-0
               M[3,3]<-1
               M
               }
         n=30
         B<-c(1,0,0)
         for (i in 1:4){
               B=B%*%transmat(n+i-1)
               }
         B[2]
Hence the probability of Healthy person aged 30 will be in Sick state after 4 years is 0.2608138.
(ii)     n=25
         C<-c(0,1,0)
         for (i in 1:7){
               C=C%*%transmat(n+i-1)
               }
         C[3]
Hence the probability of sick person aged 25 will be in Death state after 7 years is 0.4333386.
              The annual transition rates collected from past data are given here:
                     µ+! = 0.12      µ+: = 0.02      µ!+ = 0.18
                     µ!: = 0.05      µ:+ = 0.013 µ:! = 0.05
              (iv)    (a) Recalculate the probabilities from part (iii) using these new rates.
                      (b) Comment on your answer to part (iv)(a).
Question 3.   In a particular European country, where anyone over the age of 18 may vote, there
              are 3 main political parties:
                      • the ‘Reds’ (R)
                      • the ‘Blues’ (B)
                      • the ‘Xenophobes’ (X).
              A company that produces and sells opinion polls has suggested that the voting
              population’s political affiliation be monitored over time using a 3-state Markov
              process.
              (i)     Explain the most suitable choice for the time set if a Markov chain is used.
                    (c) the rate at which a 50-year old member of the Blues moves to the
                         Xenophobes.
              Over a very small time period, h,the transition probability matrix P(t,t+h) can be
              approximated as:
                                              P(t, t+h) » I+h ∗ A(t)
              where I is the appropriate identity matrix with the same dimensions as A(t).
              Probabilities over longer time periods can then be estimated by successively
              multiplying the probabilities over shorter time periods (increments of length h ) as
              follows:
                                 ()*
                                     *+
                     𝑃(𝑠, 𝑡) ≈ Π21% 𝑃(𝑠 + ℎ𝑗, 𝑠 + ℎ𝑗 + ℎ)
                                  +
              (iv) Construct a function to estimate the value of P(s,t) for a given subdivision
                   value of h and start and end times s and t .
              (v) Estimate:
                   (a) the probability that a 25-year old member of the Blues moves to the Reds
                   over the next 10 years, using h=1/12
                   (b) the probability that a 60-year old member of the Reds moves to the
                   Blues over the next 5 years, using h=1/365
                   (c) the probability that a 50-year old member of the Xenophobes moves to
                   the Reds over the next 7 years, using h=1/2.
              (vi) Discuss whether this time-inhomogeneous Markov jump process is a good
                   model for tracking political party affiliation.
                                 Solutions
Answer 1
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 100
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 101
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 102
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 103
www.sankhyiki.in
+91-9711150002
Answer 2
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 104
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 105
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 106
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 107
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 108
www.sankhyiki.in
+91-9711150002
Answer 3
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 109
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 110
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 111
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 112
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 113
www.sankhyiki.in
 +91-9711150002
Time Series
Question 1. Plot the ACF and PACF up to lag k = 20 for the AR(2) time series 𝑌( with defining
            equation:        𝑌( = 15 + 0.8𝑌(*+ − 0.6𝑌(*! + 𝑧(
            where the white noise terms {Zt} are independent and identically distributed random
            variables with mean 0 and standard deviation s =1 .
You should plot your graphs one directly above the other and label them clearly.
Question 2. The time series ldeaths is available in R. It measures the number of deaths in the
            UK from lung disease, over a period of time.
                  (a) the number of observations that have been made per year
                  (b) the time of the first observation
                  (c) the time of the last observation
                  (d) the autocorrelation function at lag 2 (ie 𝜌! )
                  (e) the number of people who died from lung disease in February 1978
                  (f) the total number of people who died from lung disease during 1976.
Question 3. Plot the sample ACF and sample PACF for the data ldeaths, up to lag k = 5. Your
            graphs should be side by side, and should have appropriate labels.
Question 4. The following code generates 1,000 observations from a time series, and saves it as
            an object s.
             xx <- numeric(1003)
             set.seed(4567)
             ww <- rnorm(1003)
             xx[1:3] <- ww[1:3]
             for(t in 4:1003) { xx[t] <- 0.8*xx[t-1] +0.2* xx[t-3]
           + ww[t]+2*ww[t-1]}
             s<-ts(xx[4:1003])
           Run the code above and hence determine how many times the time series s should be
           differenced.
Question 5. The file sales.txt records the number of sales per month of a particular type of
            insurance product, over the ten‐year period from January 2005 to December 2014.
           (i)           Import this data and classify it as a time series object. You should specify
                         the relevant dates.
           (ii)          Apply the method of moving averages and the method of seasonal means
                         to the data. (You may assume the time series is additive.)
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                            Page 114
www.sankhyiki.in
 +91-9711150002
(iii) Plot the ACF and the PACF of the remainder, and comment on the result.
Question 6. The volume of water in a reservoir is measured every month from 2010, for five
            years. The results (in millions of gallons) are stored in the file seasonality.txt
            (i)           Import this data and classify it as a time series object, specifying the
                          relevant dates.
            (ii)          Plot the data, the sample ACF and the sample PACF in a single plotting
                          area.
            (iii)         Remove the seasonality from the data using seasonal differencing. (You
                          should assume that the data exhibits yearly seasonal variation.)
            (iv)          Use your answer to part (iii) to plot the seasonally‐adjusted data and its
                          corresponding sample ACF and sample PACF. Your graphs should be
                          displayed in a single plotting area.
Question 7. This question uses the file ‘TS.txt’.
            Run the following code:
                   TS<-read.table("TS.txt",sep="\t")
                   TS<-ts(TS[,1],start=1,end=length(TS[,1]),frequency=1)
            (i)           Fit an AR(2) model to the time series TS and interpret the result.
            (ii)          Carry out a Ljung‐Box test using five lags, and hence determine whether
                          the fitted model is an appropriate fit.
Question 8. This question uses the file ‘Wt.csv’.
                   Run the following code: 3
                   Wt<-read.table("Wt.csv",sep="",header=TRUE)
                   Wt<-ts(Wt$value,start=min(Wt$day),end=max(Wt$day))
            The time series Wt represents the value of a particular commodity, recorded daily.
            Plot the correlogram and partial correlogram for the data, in a 2 × 1 layout. Hence
            suggest what order of ARIMA you would fit to this time series.
Question 9. This question uses the file ‘Xt.csv’.
                   Run the following code:
                   Xt<-read.table("Xt.csv",sep=",",header=TRUE)
                   Xt<-ts(Xt,1990,frequency=12)
             Xt is a set of time series data recorded monthly from January 1990 to August
             2006.
             (i)          Calculate the Akaike Information Criteria for ARIMA(p,d,q) models p £ 2
                          , d £2 , q £ 2 .
             (ii)         Hence, determine which model is the best fit.
    Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                Page 115
www.sankhyiki.in
 +91-9711150002
Question 11. Generate the time series object data using the following code:
                      set.seed(1558)
                      data<100+arima.sim(list(ar=runif(1,0.3,0.6),ma=runif(1
                      ,12,15)), n=40)
                      data<-ts(data,start=c(2008,1),frequency=4)
              This stationary time series represents a company’s revenue in thousands, over a
              ten‐year period.
              (i)     Fit an ARMA(1,1) to the data, and hence estimate the company’s revenue
                      up until the end of 2019. (You do not have to state the values of the fitted
                      parameters.)
              (ii)    Use exponential smoothing to estimate the company’s revenue, over the
                      same period.
              (iii)   Create a graph that shows the time series data and the two sets of
                      forecasts. You should label each line clearly.
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                         Page 116
 www.sankhyiki.in
  +91-9711150002
Question 14. The following two equations form a bivariate time series (𝑋0 , 𝑌0 )> :
                    𝑋0 = 0.3𝑋0*+ + 0.1𝑌0*+ + 𝑒0
                    𝑌0 = −0.2𝑋0*+ + 0.6𝑌0*+ + 𝑒0?
               where {et } and {et ¢} are independent white noise processes.
               Determine the eigenvalues of the parameter matrix and hence state whether
               (𝑋0 , 𝑌0 )> is stationary.
    Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                         Page 117
www.sankhyiki.in
+91-9711150002
SOLUTIONS
Answer 1
           We calculate the theoretical ACF and PACF using the ARMAacf function, with
           lag.max=20:
             AR2acf<-ARMAacf(ar=c(0.8,-0.6),lag.max=20)
             AR2pacf<-ARMAacf(ar=c(0.8,-0.6),lag.max=20,pacf=TRUE)
           Similarly, we create the bar chart for AR2pacf, but including the argument
           names.arg=seq(1,20), in order to label the bars explicitly:
              barplot(AR2pacf,main="PACF of AR(2)",col="red",xlab="Lag",
              names.arg=seq(1,20))
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                  Page 118
www.sankhyiki.in
+91-9711150002
Answer 2
       (a)     Number of observations per year
               frequency(ldeaths)
               [1] 12
               So there are 12 observations per year.
       (b)     Time of the first observation
               start(ldeaths)
               [1] 1974 1
               Since data is recorded monthly, the first observation is for January 1974.
       (c)     Time of the last observation
               end(ldeaths)
               [1] 1979 12
               So the last observation is for December 1979.
       (d)     Autocorrelation function at lag 2
       The acf function will plot a graph of the sample autocorrelation function. So to access the
       information behind the graph, we can define the object a, and set plot=FALSE:
               a<-acf(ldeaths,plot=FALSE,lag.max=72)
       (We’ve stipulated lag.max=72 since there are 72 observations in the dataset:
               length(ldeaths)
               [1] 72
       However, any lag of 25 or over would score marks.)
                                                        %   $   5% 5$
       We’ve calculated the ACF 𝜌4 for lags 𝑘 = 0, %$ , %$ , … , 5$ , 5$ . So to find 𝜌$ we need the 25th
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                      Page 119
www.sankhyiki.in
+91-9711150002
               Jan      Feb     Mar Apr      May    Jun     Jul    Aug     Sep    Oct     Nov       Dec
       1974 3035 2552 2704 2554 2014 1655 1721 1524 1596 2074 2199 2512
       1975 2933 2889 2938 2497 1870 1726 1607 1545 1396 1787 2076 2837
       1976 2787 3891 3179 2011 1636 1580 1489 1300 1356 1653 2013 2823
       1977 3102 2294 2385 2444 1748 1554 1498 1361 1346 1564 1640 2293
       1978 2815 3137 2679 1969 1870 1633 1529 1366 1357 1570 1535 2491
       1979 3084 2605 2573 2143 1693 1504 1461 1354 1333 1492 1781 1915
       So the number of deaths in February 1978 was 3,137.
       However, we wouldn’t want to print out all the data for very large datasets. So instead, we can
       extract the answer like this:
       Since the time series starts at January 1974 and we have monthly data, the observation for
       February 1978 is the 50th element of the series:
               ldeaths[50]
               [1] 3137
       (f) Total number of people who died from lung disease during 1976
       We could sum the required elements:
               sum(ldeaths[25:36])
               [1] 25718
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                Page 120
www.sankhyiki.in
+91-9711150002
Answer 3
              First we set our graphical parameters:
                         par(mfrow=c(1,2))
              Then we plot the ACF using the acf function with lag.max=60 (since, for k = 5 and
              monthly data, we need to calculate 5 12 60 ´ = auto correlations):
              acf(ldeaths,lag.max=60,main="Number of deaths from lung
              disease", ylab="Sample ACF",col="red")
              Similarly, plotting the PACF using the pacf function:
              pacf(ldeaths,lag.max=60,main="Number of deaths from lung
              disease", ylab="Sample PACF",col="blue")
Answer 4
              Differencing s
              It’s very quick to difference data in R so we will do this a few times and then analyse the
              results:
                         ds<-diff(s)
                         dds<-diff(ds)
              (Alternatvely, we could have calculated dds as dds<-diff(s,differences=2).)
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                Page 121
www.sankhyiki.in
+91-9711150002
              • we’ll tell R to plot the graphs in a 1 × 3 array using par(mfrow=c(1,3)) and then
              reset the graphical parameters afterwards using par(mfrow=c(1,1))
              • we use semicolons ; to carry out more than one command in a single line
                      par(mfrow=c(1,3))
                      acf(s); acf(ds); acf(dds)
                      par(mfrow=c(1,1))
              There is a steady linear decrease in the ACF of the undifferenced time series,
              which indicates that differencing is required. This disappears after differencing
              once (ie in the graph of ds), which indicates that we don’t need to difference a
              second time.
              Analysing the variance
                      Calculating the sample variance for each of our datasets:
                                var(s); var(ds); var(dds)
                                [1] 652.342 [1] 4.760322 [1] 7.903931
              The variance reduces (from 652 to 4.76) if we difference s once, but increases (up to
              7.90) if we difference a second time. Again this indicates that we should difference the
              data only once.
Answer 5
              (i)     Importing the data:
              We use the read.table function, and check this looks sensible using the head
              function:
                      sales<-read.table("sales.txt")
                      head(sales)
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                Page 122
www.sankhyiki.in
+91-9711150002
              V1
              1 1057
              2 1093
              3 1120
              4 1120
              5 1091
              6 1082
              Now we classify the data as a time series, where:
              • we specify the start date as January 2005 using the argument start=c(2005,1)
              • we specify monthly observations using the argument frequency=12.
                       sales<-ts(sales,frequency=12,start=c(2005,1))
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                               Page 123
www.sankhyiki.in
+91-9711150002
              The ACF still shows significant seasonal variation in the data. Hence there appears to be
              some residual seasonality that the decomposition hasn’t captured.
              However, the PACF cuts off for lags k > 2 which suggests that (assuming we can remove
              the remaining seasonal variation), an AR(2) model might be appropriate.
Answer 6
              (i) Importing the data
              Using the read.table function and checking that the data has been imported sensibly:
                      seas<-read.table("seasonality.txt")
                      head(seas)
                      V1
                      1 212.7483
                      2 239.6343
                      3 242.7506
                      4 239.3206
                      5 233.8502
                      6 207.3584
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                              Page 124
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                             Page 125
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                Page 126
www.sankhyiki.in
+91-9711150002
Answer 7
              (i)     Fitting an autoregressive of order 2
              We can use the arima function, where the order argument represents p, d and q
              respectively. Since an AR(2) is the same as an ARIMA(2,0,0) , we have:
                      fit<-arima(TS,order=c(2,0,0))
                      fit
                      Call:
                      arima(x = TS, order = c(2, 0, 0))
                      Coefficients:
                              ar1              ar2              intercept
                       0.2193                  0.4475           500.1856
              s.e.     0.0889                  0.0894           0.2803
              sigma^2 estimated as 0.9245: log likelihood = -138.28, aic
              = 284.56
              So we have fitted a model of the form:
                      𝑋( = 𝜇 + 𝛼(𝑋(*+ − 𝜇) + 𝛽(𝑋(*! − 𝜇) + 𝑍(
              where: 𝜇̂ = 500.1856, 𝛼G = 0.2193 , 𝛽H =0.4475 and the estimated variance of the
              white noise terms 𝑍( is 𝜎G ! = 0.9245
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                  Page 127
www.sankhyiki.in
+91-9711150002
Answer 8
              We’ll plot the ACF and the PACF one above the other by setting the graphical
              parameter:
                      par(mfrow=c(2,1))
              (This means the layout of the graphs will be two rows deep and one column wide.)
              Now we use the acf and pacf functions:
                      acf(Wt, main="Data: Wt",ylab="sample ACF")
                      pacf(Wt,main="",ylab="sample PACF")
              The ACF appears to decay. The PACF cuts off (ie the values fall inside the confidence
              interval) for lags greater than 1.
              Hence, we would fit an AR(1) model, ie an ARIMA(1,0,0).
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                             Page 128
www.sankhyiki.in
+91-9711150002
Answer 9
              We need to fit a model for every combination of p, d and q and calculate the AIC. That’s
              a lot of work! So we’ll write a ‘loop’ to do the work for us. In our solution below, we’ll
              start by calculating the AIC for just one model, and then extend this to do the same for all
              models.
              Calculating the AIC for one model
              First let’s create a vector with four entries:
                        answer<-numeric(4)
              All the entries in this vector are zero for now. We’ll populate it with our answers in a
              moment.
              To calculate the AIC for a ARIMA(0,1,2) model (for example), we fit the model using
              the arima function, and then extract the AIC from this model by using $aic, like this:
                        model<-arima(Xt,order=c(0,1,2))
                        aic<-model$aic
              In fact, we can simplify these last two lines to:
                        aic<arima(Xt,order=c(0,1,2))$aic
              We create a vector to display this result alongside the values of p , d and q respectively:
                        row<-c(0,1,2,aic)
                        row
                        [1] 0.0000 1.0000 2.0000 886.8326
              Finally we append this onto the bottom of our vector answer, using the rbind
              function:
                        answer<-rbind(answer,row)
                        answer
                               [,1] [,2] [,3] [,4]
              answer              0       0       0       0.0000
              row                 0       1       2       886.8326
              (The first row is a ‘dummy’ row, we can ignore it.)
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                   Page 129
www.sankhyiki.in
+91-9711150002
Answer 10
              (i)     Fitting the model
              To fit an ARMA(1,1) model we use the arima function, with order=c(1,0,1):
                      fit<-arima(Yt,order=c(1,0,1))
                      fit
              Call:
              arima(x = Yt, order = c(1, 0, 1))
              Coefficients:
                                 ar1           ma1              intercept
                                 0.8772        0.4089           109.8152
                      s.e.       0.0387        0.0815           1.7909
                      sigma^2 estimated as 4.749: log likelihood = -396.76,
                      aic = 801.52
              So for our fitted model we have: 𝜇̂ =109.8152, 𝛼( = 0.8772, 𝛽" = 0.4089 and 𝜎( $ = 4.749.
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                Page 130
www.sankhyiki.in
+91-9711150002
                      for (i in 2:(n-1)) {
                               if (et[i]>et[i+1] && et[i]>et[i-1]) {
                               turns[i]<-1
                               } else {
                               next(i)
                               }
                               }
              In the code above:
              • notice that our loop runs from i =2,…,(n-1), not from i=1,…. n, because the first and
              last residuals cannot be turning points
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                  Page 131
www.sankhyiki.in
+91-9711150002
              • the square brackets [] let us extract individual entries from the vector, so et[i] is the ith
              element of vector et.
              Similarly, to check whether each residual is a minimum, we can simply change the
              direction of the inequality signs:
                       for (i in 2:(n-1)) {
                                 if (et[i]<et[i+1] && et[i]<et[i-1]) {
                                 turns[i]<-1
                                 } else {
                                 next(i)
                                 }
                                 }
              We can calculate the total number of turning points in the residuals as:
                       nt<-sum(turns)
              If the residuals form a white noise process, we would expect the mean and variance of the
              number of turning points in the residuals to be:
                       E<-2/3*(n-2)
                       V<-(16*n-29)/90
              where is the number of residuals.
              Our test statistic should include a continuity correction. Comparing the expected number
              of turning points with the actual number of turning points:
                       E; nt
                       [1] 118.6667
                       [1] 110
              Since nt is less than E, we will apply the continuity correction by including +0.5 in our
              calculation of the test statistic:
                       ts<-(nt-E+0.5)/V^0.5
                       ts
                       [1] -1.451
              Our null hypothesis is: 𝐻" : the residuals are white noise. The 5% critical value is
              +/‐ 1.96, so we have insufficient evidence to reject H0, and we conclude that the residuals
              are white noise.
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                                     Page 132
www.sankhyiki.in
+91-9711150002
Answer 11
              (i)
                      fit<-arima(data,order=c(1,0,1))
                      start(data);end(data);frequency(data)
                      p<-predict(fit,n.ahead=8)
                      p<-p$pred; p
(ii)
                      HW<-HoltWinters(data,beta=FALSE,gamma=FALSE)
                      p2<-predict(HW,n.ahead=8); p2
(iii)
Answer 12
              (i)
                      dx<-diff(x)
                      fit<-ar(dx)
                      pd<-predict(fit,n.ahead=60)$pred
                      px<-cumsum(pd)+tail(x,1)
                      px<-ts(px,start=c(2010,1),frequency=12)
                      px
                      px[time(px)==2014+11/12]
(ii)
                      min(px,x); max(px,x)
                      ts.plot(x,main="Past data and forecast",ylab="Number
                      ofdiagnoses",xlim=c(1990,2015),ylim=c(1796,1833),col="
                      blue")
                      lines(px,col="red")
Answer 13
                      p<-predict(fit,n.ahead=100)$pred
                      p[time(p)==2038.75]
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur    Page 133
www.sankhyiki.in
+91-9711150002
Answer 14
                   m<-matrix(c(0.3,-0.2,0.1,0.6),2,2); m
                   eigen(m)$values
Answer 15
                   (i)
                   PP.test(0.2*s+1.5*t)$p.value
                    (ii)
                    PP.test(s)$p.value
                    PP.test(diff(s))$p.value
                    PP.test(t)$p.value; PP.test(diff(t))$p.value
                    find.coint<-function(coint) {
                         comb<-coint[1]*s+coint[2]*t
                         test<-PP.test(comb)$p.value
                         test
                    }
                    v<-c(1,1)
                    find.coint(v)
                    fit<-nlm(find.coint,v)
                    alpha<-fit$estimate[1]
                    beta<-fit$estimate[2] alpha; beta
                   PP.test(alpha*s+beta*t)$p.value
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 134
www.sankhyiki.in
 +91-9711150002
Question 1. An insurer has recorded its claims, alongside the year and month in which each
            claim event occurred, in the following file:
                      table.txt
            (ii)       Group the claims according to the quarter in which they occurred (ie
                       quarter 1 refers to the months January to March, quarter 2 refers to months
                       April to June, etc) and hence find the maximum claim that occurred in the
                       third quarter of 2016.
Calculate the threshold exceedances for these claims at the threshold u = 3,000,000.
Question 3. Claims sizes in the file ‘exp.txt’ are assumed to follow an exponential distribution
            with parameter l . Estimate the distribution of the threshold exceedances for this
            data.
Question 4. Plot the hazard rate for the Weibull(0.4,5) distribution, ie where c = 0.4 and g =
            5. Hence state whether this distribution has a heavy tail.
Question 5. Plot the density ratios for the Pareto(2, 4) and Pareto(7,12) distributions and hence
            state which distribution has the heavier tail.
    Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                            Page 135
www.sankhyiki.in
+91-9711150002
                                 Solutions
Answer 1
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 136
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 137
www.sankhyiki.in
 +91-9711150002
So the maximum claim in the third quarter of 2016 was 31,075. This is consistent with our answer to
part (i).
    Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                               Page 138
www.sankhyiki.in
 +91-9711150002
Answer 2
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 139
www.sankhyiki.in
 +91-9711150002
Answer 3
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 140
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 141
www.sankhyiki.in
 +91-9711150002
Answer 4
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 142
www.sankhyiki.in
 +91-9711150002
Answer 5
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 143
www.sankhyiki.in
+91-9711150002
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur   Page 144
www.sankhyiki.in
 +91-9711150002
Machine Learning
Question 1. The built-in data set in R called faithful contains 272 records of the waiting time
            between eruptions and the duration of the eruption for the Old Faithful geyser.
(iii) Show the result of the clustering, explaining each component of the output.
(iv) (a) Show the standardised cluster centres from part (iii).
(b) Calculate the cluster centres in the original units of the data.
           Someone suggests rerunning the algorithm with initial cluster centres of (2.2,80) and
           (4.5,55).
           (viii)      (a) Calculate z-scores for the proposed cluster centres based on the data in
                       faithful.
           Where <initial centres> is a r´c matrix (or data frame) with r as the number
           of clusters and c as the number of variables in the data set.
           (ix) Show the standardised cluster centres for the clusters calculated in part (viii)(b)
           and compare them to those calculated in part (iv)(a).
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                           Page 145
www.sankhyiki.in
 +91-9711150002
Question 2. A movie streaming website has collected data about movie preferences on some of
            their customers. The customers rated various movies, each assigned a unique genre,
            giving a score from 1 to 100. The file movie.data.csv contains the following
            information:
           •     Horror: the average score given by the customer for horror films
           •     Romcom: the average score given by the customer for romantic comedies
           •     Action: the average score given by the customer for action movies
           •     Comedy: the average score given by the customer for comedies
           •     Fantasy: the average score given by the customer for fantasy films.
           •     Cluster: the proposed cluster allocation for the customer (1, 2 or 3).
           An analyst working for the website has attempted to allocate customers into clusters
           using the k-means algorithm based on their average movie scores across the genres.
           The analyst has stored the proposed cluster for each customer in the column Cluster.
           The analyst realises that the algorithm was terminated too early and there is one data
           point that is not correctly allocated to the nearest centroid.
           (ii)         (a) Identify the data point that is not correctly allocated to its nearest
                        cluster centre.
                        Hint: the function which.min(<vector>) function outputs the index
                        of the minimum value in <vector>.
           The analyst now further realises that the data should really be standardised before
           using the k-means algorithm. The analyst decides to use z-scores to standardise the
           data.
           (iii)      (a)     Perform k-means clustering using the kmeans() function with 3
                      clusters and setting a seed of 6 after appropriately standardising the data
                      using the scale() function.
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                         Page 146
www.sankhyiki.in
 +91-9711150002
                       (b)     Output the unscaled cluster centres of the clusters calculated in part
                               (iii)(a).
                      (c)      Compare the cluster centres calculated in part (iii)(b) to those
                               calculated in part (ii)(c).
           (iv)       Describe the clusters calculated in part (iii)(a) by considering the centres
                      in part (iii)(c).
           (v)        Plot the first five columns of the data set, colouring each point by its
                      allocated cluster from the k-means algorithm.
           The total within-groups sum of squares is stored for each cluster in the k-means
           output as the item ‘tot.withinss’. This measures the within-cluster
           homogeneity (ie how similar the points in that cluster are). This item may be
           referenced using the following code:
kmeans(<data>, <centers>)$tot.withinss
           The website wants to consider other numbers of clusters and wants to use the total
           within-groups sum of squares metric to determine an appropriate number to use.
           (vi)      (a)     Calculate the total within-groups sum of squares when performing
                     k-means clustering on standardised data with 1 to 10 clusters, setting a
                     seed of 6 before running each iteration and storing the results in a vector
                     called total.within.ss.
                     (b)     Plot total.within.ss against the number of clusters using a
                             combined line and points graph.
                     (c)     Suggest an appropriate number of clusters.
Question 3. A scientist is analysing a large number of items, which are known to be of one of
           three types (1, 2 or 3). She has measured two factors for each one in order to help
           identify which type each item belongs to. She has found that the values of the factors
           for each of the types conform to the distributions shown in the table below. Here
           U(a,b) indicates the discrete uniform distribution, which is equally likely to take any
           of the values a, a+1,…,b.
    Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                          Page 147
www.sankhyiki.in
+91-9711150002
           (ii)       Construct a data frame of 30 simulated items (10 of each type) with the
                      characteristics shown in the table, after setting a seed value of 19. The data
                      frame should contain three columns; the type of the object; the value of
                      FACTOR1 and the value of FACTOR2.
                      Hint: the sample(<vector>, n, replace =TRUE) function
                      generates a sample of size n using the values in <vector> with
                      replacement.
   Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur                           Page 148
www.sankhyiki.in
+91-9711150002
                     In this question ‘integer’ refers to whether xÎℤ and not the integer data
                     type in R.
                     Hint: (sum(x != floor(x)) != 0)returns TRUE if any value in
                     <vector> is not an integer and TRUE otherwise.
                     (c)     Write a function for each Type that calculates the conditional
                             probability of obtaining values for the factors F1 and F2 , given the
                             Type, ie 𝑃(𝐹|𝑇𝑦𝑝𝑒). You should assume that the factors operate
                             independently given the Type.
           You are given that the prior probabilities of each type are P(Type 1) = 0.5 ,
           P(Type 2) = 0.3 and P(Type3)=0.2. You are also given that:
Satya Niketan | North Campus |Mumbai | Bangalore | Kolkata | Jaipur Page 149