[go: up one dir, main page]

0% found this document useful (0 votes)
30 views24 pages

Envx3002 Lecture Notes

The document contains lecture notes for ENVX3002, focusing on statistical methods in natural sciences, including regression analysis, ANOVA, and general linear models. Key topics include model selection, assumptions of statistical tests, and methods for handling different types of data, such as binary and count data. The notes also discuss experimental design, overdispersion, and the use of generalized additive models for non-linear relationships.

Uploaded by

coleacoke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views24 pages

Envx3002 Lecture Notes

The document contains lecture notes for ENVX3002, focusing on statistical methods in natural sciences, including regression analysis, ANOVA, and general linear models. Key topics include model selection, assumptions of statistical tests, and methods for handling different types of data, such as binary and count data. The notes also discuss experimental design, overdispersion, and the use of generalized additive models for non-linear relationships.

Uploaded by

coleacoke
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

lOMoARcPSD|5570213

ENVX3002 - Lecture notes

Statistics in the Natural Sciences (University of Sydney)

Studocu is not sponsored or endorsed by any college or university


Downloaded by Douglas Vanbeek (coleacoke@gmail.com)
lOMoARcPSD|5570213

ENVX3002

● Final Exam: Record +, 4 Reports

Week 1

● Simple linear regression vs Multiple linear regression – Interpretation of regression


coefficients
● Assumptions
● Model selection

Why:
● Describe relationship
● Explain variation
● Predict new values

Simple Linear Regression:


● Choosing estimates of the coefficients to minimise the residual sum of squares which is
called the least squares method

Anova:
● Regression SS: explained explained by the linear model (regression)
● Residual SS: variation not explained by regression


● P is observations

Multiple Linear Regression:


● Using more than one predictor variable to explain something, extending the model
● Can use a p value approach to see if the model is significant below 0.05

Anova vs Drop 1:
● Anova function provides a sequential set of tests for each variable
○ You can set up 2 models with a null hypothesis and compare them, you can see
which model does the best job at explaining the variation
● Drop 1 function does adjusted adjusted for all other terms in the model, so are to be
preferred
● Anova can only compare 2 models, drop 1 compares all other terms and the model

Assumptions:
● Data normally distributed

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Variance is constant: fanning is not good


● Linear model is correct


● Useful to check residuals

Leverage:
● A measure of how extreme the value

Principle of Parsimony:
● Less variable is better, the simplest model is better

Variance-bias tradeoff:
● Overfitting: lots of variables, tracking the noise in the data rather than the mechanism
behind it
○ High variance, low bias
● Underfitting: low variance , high bias
● Good balance: low bias and low variance

Patial F test:
● When comparing can do anova, if the f test is greater it is not significant and not better at
explaining

Automated methods:
● Forward selection
● Backward elimination
● Problem: different combination of variables might take out important information as
different variables may affect it

Akaike Information Criteria (AIC)


● Smaller AIC is a better model


● Attempts to capture the number of parameters, the principle of parsimony is captured

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

Problems
● Every test has a 5% chance of Type 1 Error (probability of falsely rejecting null)
● Still not clear which is best method, depends on hypothesis

Lab 1
Prediction vs Confidence:
● The prediction interval is larger because the uncertainty around that is a lot larger
● Confidence interval is the confidence of the model we just did which would be smaller

Difference between anova and drop 1


● Drop1: Takes out a variable to see whether it is helpful in describing the data
○ Checking which explanatory variable is useful to keep
● Anova: goes down the whole way through to check them

anova(one model, second model)


● If the RSS is higher it is not as good of a model

AIC exercises, rules of thumbs


● Want the lowest AIC
● Less variable is better assuming the explained variance is the same
● Needs to explain a lot of variation but if something is slightly different it doesnt change
the model, don’t overfit

Week 2

Lecture 2 Anova and it’s extension

● Introducing experimental structures, blocking structures, plot designs

Experimental Unit: smallest unit at which a treatment is applied


● Level of replication
Sampling unit: unit at which the observations are made

ANOVA (Analysis of Variance):


● Looking to see difference in mean between groups by analysing variance
● Assumption
○ Data is normal
○ Variances are equal between groups
● Treatment: explained variance
● Residual: not explained

Treatment Design:

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Selection of treatments for an experiment both the factors as well as the levels of each
factor
● Design with more than one factor are termed factorial treatment designs

Experiment Design:
● How we allocate treatments to the experimental units
● Completely randomised design:
● Randomised block

Split plot:
● If you have more than one factor affecting the analysis, increasing the complexity of
blocking structure
● Blocks are an error term, it is used to account for the differences between the samples

● Hint for assessment: need to think about the order for the whole plots and the sub plot

Repeated measure design:


● Treatments applied sequentially or repeatedly at the scale of the entire block
○ Measures of the same variable
● How many times do you resurvey the block
○ 6 times in that example because there is 6

Extra assumptions for repeated measure:


● Sphericity:
○ need equal variance between each pair of within block treatments
● No block by treatment interaction

Definitions;
● Block
● Treatment

Week 3

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Anova: categorical predictors


○ Used to compare means of different treatments
● Regression: quantitative predictor
○ Model relationships between predictor and response variables
● Can now consider them as the general linear model

Cases where this is important:


● Mixture of categorical and numerical predictor variables
● In designed experiments we may have unbalance designs by choice or mishap
○ Regression takes into account this

The models
● ANOVA
○ Predictor is categorical
○ Observations = mean + treatment (categorical) + error
● Regression
○ Predictor is continuous
○ Y values = intercept + slope (continuous predictor) + error

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

Anova tables

Assumptions:
● Normality
● Constant variance
● Independent and randomly collected

Indicator / dummy variable


● Give categorical data a numerical value with an indicator code
● For each treatment create an indicator variable
● Now in the anova models the treatment is times by the indicator variable, this makes it
equal to the regression model

Notes:

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

Assignment:

Look through the AIC and drop function

Building a model to predict the amount of abundance in the variation of the species dependent
on the treatment use.
● Is the reason we use the error term to consider whether there is difference between the
samples, what error is actually relevant here

Question 1:
● Justification includes should you use the interaction
● Should you do the blocking
● Whether to use anova or linear model
● Do we need design.split

Randomised block design

Question 2:

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Linearity, homoscedasticity,

Question 3: Interpretation of results


● What results are here

Question 4:
● Emmenas stuff to see which pair interacts, what more

Question 5:
● What are the recommendations based on
● Rehabilitate, but for biological diversity, optimal place for species 2

Questions:
● Use design.split, independence, linearity, 3 x 2, what did you put in the output,
● Remove emmip
● How to get 4 marks in last part, what does the anova actually show

Week 4 Lecture

General linear model:


● Determining the relationship between the explanatory and response variable

Assumptions:
● Independence
● Errors are normally distributed with mean 0 and a variance
● Constant variance
● Homoscedasticity / Homogeneity

Data with not normal distribution:


● Binary response
● Count data
● Proportions

Generalised linear model


● Extension to General regression is to incorporate both normal and non normal
distribution
● GLM consists of 3 components
○ Random component, that is response variable and it’s probability distribution
(error structure)
○ Systematic component, that is predictor variable in the model
○ Link function: link the response and the predictor

GLM link functions:

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213


● Fitting the data in a straight line through the logistic regression

Maximum likelihood:
● Maximum possibility of getting an outcome
● Normal distribution: maximum possibility of having a value is the mean
● Binomial distribution:

GLM model fitting:


● Deviance: G^2 when compare the log-likelihood of a specific model and saturdate model

Anova in regression:
● If predicted line is
● If observed line is close to the least squares, that means the relationship is strong and
there is a significant relationship

Analysis of binary data

● In a binary distribution have to make sure to use a logit link

GLM model fitting:


● Overdispersion: when variance is larger than the mean
○ When the ratio > 1, then overdispersion
● Deviance residual: measures contribution of individual observation to the deviance
● Pearson residual: residual of a individual observation divided by the square root of the
variance

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Akaike Information Criterion (AIC): Adjusts the deviance for a given model for the
number of predictor variable
○ Lower is better

● Need to consider dispersion?

Assignment
● Question 1:
○ Plot
● Question 2:
○ Test assumptions
○ Then do a model
● Question 3:
○ Test dispersion and AIC and then use a new model
● Question 4:
○ Prediction plot using code
● Question 5:
○ Interpretation of prediction plot

Week 4 Tutorial

● Look at residual deviance and degrees of freedom, identify whether over dispersion
issue or not
● Look at the start for over dispersion, if it is significant
○ When binary data and when you have over dispersion you can’t use it
● Look at lecture code as well it helps
● When you have binary don’t use quasi binomial but when you have proportion you can
use quasi binomial
● Have to check the reduced model vs full model, see whether the sum of squares
predicts model or the usefulness is actually better

Exercise 2
● Need a failure and success within binomial data
● Check for over dispersion, if there is then instead of using binomial, re run as
quasibinomial
● Our responsibility to check
● You can use this to get the dispersion parameter
● If over dispersion is there and you are (not using binomial data?) then quasibinomial can
be used
● Standard error increases, taken over dispersion into consideration, check lecture slides
for formula when taking into account dispersion
● Initial model is just one predictor variable
● Type = response (creating log transformed)
● If you want to test interaction have to check using the *

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● If you remove the cow, that change in the variation is the variation explain by one of the
variables
● When residuals go up it means it is not effective, don’t want to inflate it
● If deviance increased model is not performing good
● Can’t drop if residuals go up
● When interaction is not significant can drop it
● Check the AIC as well

Week 5 Lecture

● You can use proportions, you need to give extra information, the total number, need to
know what proportion was successful and what wasn’t successful
○ More or less doing binary data
● Count can’t be less than 1
● When you have counts can use poisson, this is when there is number of plant species,
number of bushfires, can’t have 0.1 (can’t be expressed as proportions)

Generalised linear models (Analysis of Deviance):


● Data is not normally distributed, variance is not constant

Recall:

Generalised linear models are useful when the y (response) variable is:
● Binary
○ Use logistic regression
● Counts expressed as proportions (e.g. indexes)
○ Use logistic regression
● Counts not expressed as proportions
○ Use log-linear model (family = "poisson" ?)

Deviance:
● The discrepancy between fitted model and the observed data
● Equivalent to Residual SS of ordinary linear model

Over-dispersion:
● Indicates variance is larger than the mean

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Tends to occur if
○ One or more important predictors are not included in the model OR
○ The underlying distribution is non-binomial or non-Poisson

Quasi family:
● When the precise form of distribution is not known
● Can deal with overdispersion
● Standard errors of the estimated parameters are multiplied by square root of the
dispersion parameter

GLM with count data:


Linear regression is NOT APPROPRIATE for count data
● The variance of the response variable is most likely to increase with mean
● The errors are not normally distributed
● Prediction of negative counts
● Zeros are difficult to handle in transformation

Anova:
● If quasi dispersion issue is present have to use the f test

Week 5 Tutorial

● Specify the family of distribution


● Whether there is over dispersion, whether deviance and other parameters are
acceptable
● Compare using AIC or anova
● Can use drop function with multiple variables


Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

Week 6 Lecture Splines and generalised additive models

Curved relationships:
● If x and y relationship is not straight you can think about using a generalised additive
model

Smoothers & Splines:


● Smoothers: setting a window through the data and doing linear regression through it
which will create a smoother line
○ Loess (local regression smoother)
○ LOWESS (weighted local regression smoother)
■ Putting different weights on those data that is not going in the same
direction as the data that we need
● Splines:
○ Polynomial lines, choosing right polynomials through the windows
○ Splines can give you all the statistics whereas smoothers is more for patterns
○ Can use GAM to do smoothed lines


○ If you change the number of knots this will change the number of windows

Optimal number of knots:


● Using the GCV package to find the optimal smoothing
● Need to account for the roughness penalty which offsets the ‘over fitting’ so
the function s(⋅) does not track exactly through the data points resulting in an
overall ‘wiggly’ appearance.

Coding Example:

Fruit flies:
● Gaussian distribution → normal

55.36 Continue

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

Week 7 Lecture

● Important to block as you can see that there is a lot of variation explained by the Dog
due to Sum Sq / the total sum of sq

● You can start thinking of clever ways to think of how to explain variation

● Random effect: explaining unknown variation between the dogs, you can assume
different structures

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

When you have a balance design, ANOVA and REML will be identical but unbalanced will not be

Linear Mixed Models (REML):


● Allows for the relaxation of the independence assumption
○ Important for taking observations over time
● Useful for unbalanced experimental designs
● Contains fixed and random effects
○ Random effects are assumed to have effects that are normally distributed

Dog.lme = lme(PEP ~ Anaesthetic, random = ~ 1|Dog, data = dogs)


VarCorr(dog.lme) # To check correlations between subsets of samples

Residual Maximum Likelihood:


● A model fitting procedure that accounts for correlations amongst the data
○ May result from clustering or repeated sampling
● Allows complex models to be fitted

Correlation:
● Look at inra class variation by doing this

Limitations:
● We assumed that the random errors of this model are all independent (i.e.
uncorrelated).
● However, this may be untrue given the repeated measures nature of the data.

Fixed effects:
● We know when we will take the and what we will apply
● Treatment - hour interaction is fixed, some treatments causes something to drop of
quicker
● Picking the type of animal, which replicates and the final interaction

LMM:
● 1| is the pipe

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Interpretation
○ Check the two different models, checking the interaction first
● Correlation matrix
○ 0 correlation means not much
○ And then more just means there is more correlation
○ 1:10 to look at correlation stuff

Assumptions:
● Variance of the residuals are stable
● Residuals are normally distributed

Code:
● Relevel sets control as the reference level
● The spikes at the left and the right in the qqplot just means a bit of kurtosis
● Emmip is extra code to see the interaction
● Unbalanced design will cause an affect on the anova
● “Marginal” option tests both ways and give the lowest of the two
● Do adjust
● 1:40 goes through the conclusion, in exercise use the Tukey’s adjustment,
recommended to do it
Tutorial
● If the species is significant it means there is a significant difference between the species
and therefore, needs to be considered
● We need to test whether the random model explain a significant amount of variation in
the model
● VarCorr, if it above 0 then it means there is some variation there
● Using anova, summary and VarCorr to explore the model and interpret output
● Wald test comparing it to the reference level to see which one has more deviation (Fixed
effects) but this can be explored through a post hocc test
● 1d
○ Generally don’t look at this but can be useful
○ If the confidence doesn’t include zero it indicates significant variation,
approximate method for standard deviation
○ Most articles never consider it
● Instead of removing the data we should add na.action = na.omit without the missing data

Week 8 Lecture

Repeated data analysis


● Experiments where the response to each treatment is compared over time

Calves example:

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Need to have one time column where it is numerical and one where it is a factor?
● If you are doing a t test or an anova, you are ignoring the correlation issue
● Option 1: looping through time points and performing t test on each time point
● Split plot experiments:


● Option 3: REML Estimation of split plot
○ Define correlation model
● Option 4: Differing variance assumption
○ Add in the corr and the weights
○ Check whether we should have done it by testing the anova, it would be worth it
if the anova is less than 0.05
● Opton 5: model where the correlation between 2 observations depending on how far
apart they are
○ Less correlation, further apart
● Post hoc tests are the last two parts should use the adjust

Week 8 Tutorial
● In the lecture we have corAR1 but in this one we need corCAR1 because the time
intervals are not uniform

Assignment 3:

In conducting the experiment, each experiment unit (box of inoculated cucumbers) was sampled
at equal spaced time representing 8 hour time blocks. Hence, this is a repeated data analysis as
the response to each treatment is compared over time. The identified problem in this situation is
that the data is no longer independent as the next sample of data will be affected by the
previous sample. For example, the second time block will be affected by the first time block.
Therefore, in order to relax independence, a linear mixed model using the restricted maximum
likelihood was utilised (REML). REML was chosen over anova. Upon initial analysis of the data,
It became clear that this was a balanced design with the strains being completely repeated for
each of the treatments i.e. 100% allocated, however, this included missing values. After
removing null values, this became an unbalanced design. REML performs better in unbalanced

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

design situations.

The outcome variable is the log10 transform of the Colony Forming Units (Log10CFU) of
Listeria. The fixed effects of this experiment are the transport parameters including time
(Time_stage), relative humidity (RH) and Temperature (Temp_C) as these are the variables that
are constant across the experimental units. Whereas, the Pathogen strains, Listeria
monocytogenes (FPS_0004 serotype 3). and Listeria monocytogenes (FPS_0007 serotype
1/2b) are the random effects as the conductor is unable to control the differences between two
samples within the same strain. Additionally, ID was treated as a random effect as there is
potential difference between the way that various units react in different situations.

The gls function was used as in analysing the compound symmetry model, the residuals vs
fitted plot demonstrates potential fanning of the residuals over time. Therefore, changing
variance and correlation is tested. "Gls extends ordinary least-squares (OLS) estimation of the
normal linear model by providing for possibly unequal error variances and for correlations
between different errors." (Fox & Weisberg, 2018) Glmer was not used as the outcome variable
was converted into a form that can be used as a continuous not a count variable.

Using the likelihood ratio test to compare the constant variance vs the changing variance with
both having compound symmetry, a p value of 0.7592 was given. Since p > 0.05, using
changing variance does not provide a significant difference and therefore constant variance was
used. Next the constant variance was tested against the combined model (changing variance
and changing correlation). The AIC were compared as both models had the same number of
parameters. The combined model had an AIC of 337.55(2.d.p) compared to the constant
variance of 341.40(2.d.p). Since the combined model has an AIC that is lower by greater than 2,
the combined model is preferred. Next the combined model was tested against the changing
correlation model. The models were also tested using AIC due to the same number of
parameters. The AIC's were the same at 337.55(2.d.p). Due to the principle of parsimony, the
simpler model using the changing correlation was selected.

Currently, the two variables with the highest p-values are the 3-way interaction
"Time:Temp_C:RH" (0.2822) and the 2-way interaction "Temp_C:RH" (0.7244). The backwards
AIC process will be performed, dropping the 3-way interaction first as the 2-way interaction can't
be dropped before the 3-way interaction. Despite the AIC dropping from 337.55(2.d.p) to
333.49(2.d.p), using the log likelihood ratio to test, the p value of 0.1013 is > 0.05 and therefore,
there is not a significant difference between the models. Therefore, the 3-way interaction is kept
in this model and the 2-way interaction does not require testing.

The final model is:

Log10CFU = Intercept + Time + Temp_C + RH + Time:Temp_C + Temp_C:RH +


Time:Temp_C:RH

Code for final model:

cs1 = corAR1(form =~Timepts|Pathogen/ID)

final_model = gls(Log10CFU ~ Time*Temp_C*RH, corr = cs1, data = listeria, na.action =


na.omit)

Question 2:

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

The assumptions tested are normality and homoscedasticity.

Code:

qqnorm(resid(final_model))

qqline(resid(final_model))

Using the Normal QQ-Plot, the normality assumption is seen to be satisfied as the values do not
significantly deviate from the normality line. Additionally, there is minimal fanning at the extreme
values.

Code:

plot(final_model, which = 1, main = "Residual vs Fitted")

Although the residuals vs fitted plot demonstrates some variance in deviation from the 0 line,
this was accounted for using the gls function and the final model provided above. A test was
conducted using the lme function to see whether the homoscedasticity assumption is violated
and the lme function provided a model that received a higher AIC at 339.1569. Therefore, the
homoscedasticity is violated enough to warrant a switch to lme.

Output

Question 3:

● Should temperature, RH, pathogen be considered as factors

● Pathogen, can’t control the difference between two pathogens even if they are of the
same strain
● Through analysing the data in excel, It became clear that this was a balanced design
with the strains being completely repeated for each of the treatments ie 100% allocated,
however this included missing values. After removing null values, this became an
unbalanced design
● ID is treated as a random effect as there is differences between
● Does the random effect test a significant amount of variation

● Compare models modelling difference variances and covariances

Week 10 Lecture

Machine learning
● Construct algorithms that can learn from data
● Learn with date without being explicitly programmed

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

Neural network
● Take an input and assign weights (or linear regression coefficient), sum becomes the
bias, use the activation function to transform from linear to non-linear

Data to assess model quality:


● Divide the dataset into 2: prediction and validation
○ Prediction is building the model
○ Validation: set used to test predictive quality
● Issues
○ How do we divide

Week 11 Lecture

Prediction quality:
● RMSE: variance of the error, the residuals
● Mean error (bias): within measured versus predicted, checking under or over prediction
● Coefficient of determination R^2: variance of error over variance of the data. Ideally
variance of error should be small compared to data then r^2 will be closer to 1
● Lin’s Concordance correlation coefficient: evaluates agreement between pairs of
observations by measuring variance from the 45 degree line

Training and validation:

Model validation:
● Data split

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213


● K-fold cross validation:
○ Taking all of the data and using folds whereas data split is splitting the data
○ Cross-validation: leave one out

Time series validation:


● Time-series are correlated, random hold-out for validation can present over-confidence
in prediction
Machine learning models:
● Supervised learning: learning based on labels (value of what we want it to be)
○ Regression: prediction continuous variable
○ Classification: non continuous or categorical variable
● Unsupervised learning: learning without labels

Linear discriminatiant analysis:


● Project the variables onto a smaller subspace while maintaining the class-discriminatory
information

Goodness of fit for categorical variables:


● Contingency table / confusion matrix
● Overall accuracy can be tested as well as seeing the true positive and true negative

Tree models:
● Generates a set of rules which represent the model and displayed in a binary tree

Data ensemble:
● If one model does not perform well and it is joined with different types of model it will
make a better model

Bootstrap:
● Generate multiple models from the data

Random forest:
● Grow lots of trees and average predictions to reduce the noise
○ Use bootstrap subsampled version of the data

Week 11 Tutorial

● Regression: continuous variables

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

● Classification: categorical variables

Cubist model output:


● Checking to see which ones is necessary in prediction in the attribute usage section

● Have the PH will increase MSE by 15% meaning it is important in decreasing the error
● On RHS: how important in explaining the data, elevation is important

Partial plot:
● Showing how each of the variable affect the prediction

Week 13 Lecture

● May need to stack data, change it to a long format


● We can upload a HTML file
● Can do boxplot(output ~ interaction, data = )

Module 1:
● If it was just RCBD then you wouldn’t have the blocking for the Block.T + Block.T.F as
it is random
● For the pairwise comparisons you can do plot(the emmenas that you created,
comparison = T)

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)


lOMoARcPSD|5570213

○ Don’t overlap, there is a significant difference, do overal there isn’t. The


significant difference is the overlap between the other lines.
● Check whether the design is balanced, do a group by, summarise

● Bwplot (plot | subplot)


● Don’t need to check error terms

Module 2:
● cbind(sucess, failure)
● Test changes “chisq” if it is quasibinomial
● 1) ii) testing that is equal to 0 by comparing it to a null model = 1
● exp(coef( glm))
○ Can get the odds, if the odds are above 1 then it is more likely success occurs. If
it is < 1 then it is more likely that the failure
○ 1/coeff is the odds
● Do not need to check the data if it is 1 and 0

● First-order interactions is the interaction between the independent variab;es


○ Number of independent variables ! = number of first-order interactions
● Second-order interactions is the interaction between the independent variables ,
variables
● Two-way interaction: independent * independent
● 3-way interactions: Independent * independent * independent

Module 3:
● If it is quantitative don’t need a glm
● Clustering effect means random
● Log the data if it is not normal

Downloaded by Douglas Vanbeek (coleacoke@gmail.com)

You might also like