Envx3002 Lecture Notes
Envx3002 Lecture Notes
ENVX3002
Week 1
Why:
● Describe relationship
● Explain variation
● Predict new values
Anova:
● Regression SS: explained explained by the linear model (regression)
● Residual SS: variation not explained by regression
●
● P is observations
Anova vs Drop 1:
● Anova function provides a sequential set of tests for each variable
○ You can set up 2 models with a null hypothesis and compare them, you can see
which model does the best job at explaining the variation
● Drop 1 function does adjusted adjusted for all other terms in the model, so are to be
preferred
● Anova can only compare 2 models, drop 1 compares all other terms and the model
Assumptions:
● Data normally distributed
○
● Useful to check residuals
Leverage:
● A measure of how extreme the value
Principle of Parsimony:
● Less variable is better, the simplest model is better
Variance-bias tradeoff:
● Overfitting: lots of variables, tracking the noise in the data rather than the mechanism
behind it
○ High variance, low bias
● Underfitting: low variance , high bias
● Good balance: low bias and low variance
Patial F test:
● When comparing can do anova, if the f test is greater it is not significant and not better at
explaining
Automated methods:
● Forward selection
● Backward elimination
● Problem: different combination of variables might take out important information as
different variables may affect it
●
● Attempts to capture the number of parameters, the principle of parsimony is captured
Problems
● Every test has a 5% chance of Type 1 Error (probability of falsely rejecting null)
● Still not clear which is best method, depends on hypothesis
Lab 1
Prediction vs Confidence:
● The prediction interval is larger because the uncertainty around that is a lot larger
● Confidence interval is the confidence of the model we just did which would be smaller
Week 2
Treatment Design:
● Selection of treatments for an experiment both the factors as well as the levels of each
factor
● Design with more than one factor are termed factorial treatment designs
Experiment Design:
● How we allocate treatments to the experimental units
● Completely randomised design:
● Randomised block
Split plot:
● If you have more than one factor affecting the analysis, increasing the complexity of
blocking structure
● Blocks are an error term, it is used to account for the differences between the samples
● Hint for assessment: need to think about the order for the whole plots and the sub plot
Definitions;
● Block
● Treatment
Week 3
The models
● ANOVA
○ Predictor is categorical
○ Observations = mean + treatment (categorical) + error
● Regression
○ Predictor is continuous
○ Y values = intercept + slope (continuous predictor) + error
Anova tables
Assumptions:
● Normality
● Constant variance
● Independent and randomly collected
Notes:
Assignment:
Building a model to predict the amount of abundance in the variation of the species dependent
on the treatment use.
● Is the reason we use the error term to consider whether there is difference between the
samples, what error is actually relevant here
Question 1:
● Justification includes should you use the interaction
● Should you do the blocking
● Whether to use anova or linear model
● Do we need design.split
Question 2:
● Linearity, homoscedasticity,
Question 4:
● Emmenas stuff to see which pair interacts, what more
Question 5:
● What are the recommendations based on
● Rehabilitate, but for biological diversity, optimal place for species 2
Questions:
● Use design.split, independence, linearity, 3 x 2, what did you put in the output,
● Remove emmip
● How to get 4 marks in last part, what does the anova actually show
Week 4 Lecture
Assumptions:
● Independence
● Errors are normally distributed with mean 0 and a variance
● Constant variance
● Homoscedasticity / Homogeneity
●
● Fitting the data in a straight line through the logistic regression
Maximum likelihood:
● Maximum possibility of getting an outcome
● Normal distribution: maximum possibility of having a value is the mean
● Binomial distribution:
○
Anova in regression:
● If predicted line is
● If observed line is close to the least squares, that means the relationship is strong and
there is a significant relationship
● Akaike Information Criterion (AIC): Adjusts the deviance for a given model for the
number of predictor variable
○ Lower is better
Assignment
● Question 1:
○ Plot
● Question 2:
○ Test assumptions
○ Then do a model
● Question 3:
○ Test dispersion and AIC and then use a new model
● Question 4:
○ Prediction plot using code
● Question 5:
○ Interpretation of prediction plot
Week 4 Tutorial
● Look at residual deviance and degrees of freedom, identify whether over dispersion
issue or not
● Look at the start for over dispersion, if it is significant
○ When binary data and when you have over dispersion you can’t use it
● Look at lecture code as well it helps
● When you have binary don’t use quasi binomial but when you have proportion you can
use quasi binomial
● Have to check the reduced model vs full model, see whether the sum of squares
predicts model or the usefulness is actually better
Exercise 2
● Need a failure and success within binomial data
● Check for over dispersion, if there is then instead of using binomial, re run as
quasibinomial
● Our responsibility to check
● You can use this to get the dispersion parameter
● If over dispersion is there and you are (not using binomial data?) then quasibinomial can
be used
● Standard error increases, taken over dispersion into consideration, check lecture slides
for formula when taking into account dispersion
● Initial model is just one predictor variable
● Type = response (creating log transformed)
● If you want to test interaction have to check using the *
● If you remove the cow, that change in the variation is the variation explain by one of the
variables
● When residuals go up it means it is not effective, don’t want to inflate it
● If deviance increased model is not performing good
● Can’t drop if residuals go up
● When interaction is not significant can drop it
● Check the AIC as well
Week 5 Lecture
● You can use proportions, you need to give extra information, the total number, need to
know what proportion was successful and what wasn’t successful
○ More or less doing binary data
● Count can’t be less than 1
● When you have counts can use poisson, this is when there is number of plant species,
number of bushfires, can’t have 0.1 (can’t be expressed as proportions)
Recall:
Generalised linear models are useful when the y (response) variable is:
● Binary
○ Use logistic regression
● Counts expressed as proportions (e.g. indexes)
○ Use logistic regression
● Counts not expressed as proportions
○ Use log-linear model (family = "poisson" ?)
Deviance:
● The discrepancy between fitted model and the observed data
● Equivalent to Residual SS of ordinary linear model
Over-dispersion:
● Indicates variance is larger than the mean
● Tends to occur if
○ One or more important predictors are not included in the model OR
○ The underlying distribution is non-binomial or non-Poisson
Quasi family:
● When the precise form of distribution is not known
● Can deal with overdispersion
● Standard errors of the estimated parameters are multiplied by square root of the
dispersion parameter
Anova:
● If quasi dispersion issue is present have to use the f test
Week 5 Tutorial
●
●
Curved relationships:
● If x and y relationship is not straight you can think about using a generalised additive
model
○
○ If you change the number of knots this will change the number of windows
Coding Example:
Fruit flies:
● Gaussian distribution → normal
55.36 Continue
Week 7 Lecture
● Important to block as you can see that there is a lot of variation explained by the Dog
due to Sum Sq / the total sum of sq
● You can start thinking of clever ways to think of how to explain variation
●
● Random effect: explaining unknown variation between the dogs, you can assume
different structures
When you have a balance design, ANOVA and REML will be identical but unbalanced will not be
Correlation:
● Look at inra class variation by doing this
Limitations:
● We assumed that the random errors of this model are all independent (i.e.
uncorrelated).
● However, this may be untrue given the repeated measures nature of the data.
Fixed effects:
● We know when we will take the and what we will apply
● Treatment - hour interaction is fixed, some treatments causes something to drop of
quicker
● Picking the type of animal, which replicates and the final interaction
LMM:
● 1| is the pipe
● Interpretation
○ Check the two different models, checking the interaction first
● Correlation matrix
○ 0 correlation means not much
○ And then more just means there is more correlation
○ 1:10 to look at correlation stuff
Assumptions:
● Variance of the residuals are stable
● Residuals are normally distributed
Code:
● Relevel sets control as the reference level
● The spikes at the left and the right in the qqplot just means a bit of kurtosis
● Emmip is extra code to see the interaction
● Unbalanced design will cause an affect on the anova
● “Marginal” option tests both ways and give the lowest of the two
● Do adjust
● 1:40 goes through the conclusion, in exercise use the Tukey’s adjustment,
recommended to do it
Tutorial
● If the species is significant it means there is a significant difference between the species
and therefore, needs to be considered
● We need to test whether the random model explain a significant amount of variation in
the model
● VarCorr, if it above 0 then it means there is some variation there
● Using anova, summary and VarCorr to explore the model and interpret output
● Wald test comparing it to the reference level to see which one has more deviation (Fixed
effects) but this can be explored through a post hocc test
● 1d
○ Generally don’t look at this but can be useful
○ If the confidence doesn’t include zero it indicates significant variation,
approximate method for standard deviation
○ Most articles never consider it
● Instead of removing the data we should add na.action = na.omit without the missing data
Week 8 Lecture
Calves example:
● Need to have one time column where it is numerical and one where it is a factor?
● If you are doing a t test or an anova, you are ignoring the correlation issue
● Option 1: looping through time points and performing t test on each time point
● Split plot experiments:
○
● Option 3: REML Estimation of split plot
○ Define correlation model
● Option 4: Differing variance assumption
○ Add in the corr and the weights
○ Check whether we should have done it by testing the anova, it would be worth it
if the anova is less than 0.05
● Opton 5: model where the correlation between 2 observations depending on how far
apart they are
○ Less correlation, further apart
● Post hoc tests are the last two parts should use the adjust
Week 8 Tutorial
● In the lecture we have corAR1 but in this one we need corCAR1 because the time
intervals are not uniform
Assignment 3:
In conducting the experiment, each experiment unit (box of inoculated cucumbers) was sampled
at equal spaced time representing 8 hour time blocks. Hence, this is a repeated data analysis as
the response to each treatment is compared over time. The identified problem in this situation is
that the data is no longer independent as the next sample of data will be affected by the
previous sample. For example, the second time block will be affected by the first time block.
Therefore, in order to relax independence, a linear mixed model using the restricted maximum
likelihood was utilised (REML). REML was chosen over anova. Upon initial analysis of the data,
It became clear that this was a balanced design with the strains being completely repeated for
each of the treatments i.e. 100% allocated, however, this included missing values. After
removing null values, this became an unbalanced design. REML performs better in unbalanced
design situations.
The outcome variable is the log10 transform of the Colony Forming Units (Log10CFU) of
Listeria. The fixed effects of this experiment are the transport parameters including time
(Time_stage), relative humidity (RH) and Temperature (Temp_C) as these are the variables that
are constant across the experimental units. Whereas, the Pathogen strains, Listeria
monocytogenes (FPS_0004 serotype 3). and Listeria monocytogenes (FPS_0007 serotype
1/2b) are the random effects as the conductor is unable to control the differences between two
samples within the same strain. Additionally, ID was treated as a random effect as there is
potential difference between the way that various units react in different situations.
The gls function was used as in analysing the compound symmetry model, the residuals vs
fitted plot demonstrates potential fanning of the residuals over time. Therefore, changing
variance and correlation is tested. "Gls extends ordinary least-squares (OLS) estimation of the
normal linear model by providing for possibly unequal error variances and for correlations
between different errors." (Fox & Weisberg, 2018) Glmer was not used as the outcome variable
was converted into a form that can be used as a continuous not a count variable.
Using the likelihood ratio test to compare the constant variance vs the changing variance with
both having compound symmetry, a p value of 0.7592 was given. Since p > 0.05, using
changing variance does not provide a significant difference and therefore constant variance was
used. Next the constant variance was tested against the combined model (changing variance
and changing correlation). The AIC were compared as both models had the same number of
parameters. The combined model had an AIC of 337.55(2.d.p) compared to the constant
variance of 341.40(2.d.p). Since the combined model has an AIC that is lower by greater than 2,
the combined model is preferred. Next the combined model was tested against the changing
correlation model. The models were also tested using AIC due to the same number of
parameters. The AIC's were the same at 337.55(2.d.p). Due to the principle of parsimony, the
simpler model using the changing correlation was selected.
Currently, the two variables with the highest p-values are the 3-way interaction
"Time:Temp_C:RH" (0.2822) and the 2-way interaction "Temp_C:RH" (0.7244). The backwards
AIC process will be performed, dropping the 3-way interaction first as the 2-way interaction can't
be dropped before the 3-way interaction. Despite the AIC dropping from 337.55(2.d.p) to
333.49(2.d.p), using the log likelihood ratio to test, the p value of 0.1013 is > 0.05 and therefore,
there is not a significant difference between the models. Therefore, the 3-way interaction is kept
in this model and the 2-way interaction does not require testing.
Question 2:
Code:
qqnorm(resid(final_model))
qqline(resid(final_model))
Using the Normal QQ-Plot, the normality assumption is seen to be satisfied as the values do not
significantly deviate from the normality line. Additionally, there is minimal fanning at the extreme
values.
Code:
Although the residuals vs fitted plot demonstrates some variance in deviation from the 0 line,
this was accounted for using the gls function and the final model provided above. A test was
conducted using the lme function to see whether the homoscedasticity assumption is violated
and the lme function provided a model that received a higher AIC at 339.1569. Therefore, the
homoscedasticity is violated enough to warrant a switch to lme.
Output
Question 3:
● Pathogen, can’t control the difference between two pathogens even if they are of the
same strain
● Through analysing the data in excel, It became clear that this was a balanced design
with the strains being completely repeated for each of the treatments ie 100% allocated,
however this included missing values. After removing null values, this became an
unbalanced design
● ID is treated as a random effect as there is differences between
● Does the random effect test a significant amount of variation
Week 10 Lecture
Machine learning
● Construct algorithms that can learn from data
● Learn with date without being explicitly programmed
Neural network
● Take an input and assign weights (or linear regression coefficient), sum becomes the
bias, use the activation function to transform from linear to non-linear
Week 11 Lecture
Prediction quality:
● RMSE: variance of the error, the residuals
● Mean error (bias): within measured versus predicted, checking under or over prediction
● Coefficient of determination R^2: variance of error over variance of the data. Ideally
variance of error should be small compared to data then r^2 will be closer to 1
● Lin’s Concordance correlation coefficient: evaluates agreement between pairs of
observations by measuring variance from the 45 degree line
Model validation:
● Data split
○
● K-fold cross validation:
○ Taking all of the data and using folds whereas data split is splitting the data
○ Cross-validation: leave one out
Tree models:
● Generates a set of rules which represent the model and displayed in a binary tree
Data ensemble:
● If one model does not perform well and it is joined with different types of model it will
make a better model
Bootstrap:
● Generate multiple models from the data
Random forest:
● Grow lots of trees and average predictions to reduce the noise
○ Use bootstrap subsampled version of the data
Week 11 Tutorial
● Have the PH will increase MSE by 15% meaning it is important in decreasing the error
● On RHS: how important in explaining the data, elevation is important
Partial plot:
● Showing how each of the variable affect the prediction
Week 13 Lecture
Module 1:
● If it was just RCBD then you wouldn’t have the blocking for the Block.T + Block.T.F as
it is random
● For the pairwise comparisons you can do plot(the emmenas that you created,
comparison = T)
Module 2:
● cbind(sucess, failure)
● Test changes “chisq” if it is quasibinomial
● 1) ii) testing that is equal to 0 by comparing it to a null model = 1
● exp(coef( glm))
○ Can get the odds, if the odds are above 1 then it is more likely success occurs. If
it is < 1 then it is more likely that the failure
○ 1/coeff is the odds
● Do not need to check the data if it is 1 and 0
Module 3:
● If it is quantitative don’t need a glm
● Clustering effect means random
● Log the data if it is not normal