SPSS notes
by
Abdelrahman M. Attia
MBBCH Candidate, Faculty of medicine Cairo University
Peer-Reviewer at Journal of Infection and Public health
Multiple international publications
Biostatistician
1
Content Page
SPSS Interface 4
Data files 6
Data Entry 9
Computing Variables 13
Recoding Variables 15
Split File 19
Select Cases 23
Descriptive Statistics for categorical variables 26
Descriptive Statistics Cross tabs 28
Descriptive Statistics for numeric data 31
Descriptive Statistics for numeric data in 37
Multiple groups
Descriptive Statistics for numeric data in 39
multiple Categorical Variables
Data Visualization 42
Normality Testing 55
One sample Z-test 58
One sample T-test 63
Paired-T test 65
Independent-T test 69
One-Way ANOVA 72
Two-Way ANOVA 76
Wilcoxon signed rank test 89
Mann-Whitney U test 92
Kruskal-Wallis test 95
Pearson's correlation 98
Spearman's correlation 101
Chi-square test 104
Simple linear regression 106
2
Multiple linear regression 109
Univariate Logistic regression 112
Multiple Logistic regression 115
3
SPSS Interface
Data View
• Row: Each row stores the data for 1 patient (1
observation).
• Column: Each column stores the values for 1 variable
(e.g., Age of patients)
4
Variable View
• Each row in the variable view represents a Variable in
the data view
5
Data files
How do we enter the data on SPSS in preparation for us to work
on it and work on it for statistical operations?
There are two ways:
• The first is that we make a manual data entry, entering data
from the beginning on the program (not common at all)
• Each column represents a variable and we enter data as
shown in the image
6
• The second method is that the data will be present in an
excel file or any other Format , and we will open it and
work on it
• Click on this icon to open data document
7
8
Data Entry
• How do we enter variables into the SPSS program and how
does the program correctly understand our data?
• Suppose we have 3 variables (type - weight - age)
3 variables: -
• Sex (male – female)
• Weight (under – normal – over – obese)
• Age in years
How does the program understand these variables ?
• Open Variable View and Define your data
• There is more than one item, we choose them, some of them are
important and some of them are less important:
9
o Name: it is not allowed to start with a number and it is not
allowed to have spaces
o Label: We are not restricted by the rules of names, and we have
any labels available
o Type of variable: Most of the time we put it as numeric
because we encode the data. All the variables become in the
form of numbers, even if they are groups (Categorical for
example, the type is sex. Although it is a categorical variable,
but we enter it in the form of numbers so we keep it as a
numeric)
10
o Width: The number of digits allowed to be entered, so
we chose 8. This means that (99999999) is allowed for
this variable.
o Decimals: the decimal numbers after the comma
o values: through which we define each code in the text
data
o Measure: through which we determine the type of
data
11
12
Computing Variables
• Transform > Compute Variables…
• Dataset used: 2-Computing Variables.sav
How are calculations done in SPSS?
If we want to calculate a new variable from two variables or
more already present in our available data
•For example, weight, and height, and we want to calculate the
BMI, the body mass index
•BMI = weight (KG)/Height(m)*Height(m)
13
14
Recoding Variables
• Transform > Recode into Different Variables ...
• Dataset used: 3-Recoding Variables.sav
Recoding variables: is used to convert the values of variables to
other values that we want
• We use it in several cases, for example:
o If we have Age in years as a numeric variable, we want
to convert it into groups of categorical variables to turn
into groups of ages groups
o If you want to do reverse coding, it will remain in some
of the questions in the surveys
• Example, if I have an Age in years, this is a numeric variable.
I want to change it to a categorical variable, for example :
o 0-10 years: group 1
o 10-20 years: group 2
o 20-30 years: group 3
o More than 30 years: group 4
15
16
17
18
Split File
• Data > Split File ...
• Dataset used: 4-Split File.sav
• We use it when we want to do a statistical analysis for each
group separately:
o For example, if the data set, we have males and
females, we want to work on males alone and females
alone and do the same statistical analysis for males
and the same statistical analysis for females
19
• We will divide it into two groups, and here we will choose the
Sex of patients, it will be groups based on gender, meaning
divide the males alone and the females alone
• Compare groups: performs all operations for each group
separately, but in the same table
20
• Organize output by groups that performs all operations for
each group separately and in separate tables
21
22
Select Cases
• Data > Select Cases ...
• Dataset used: 5-Select Cases.sav
• One of the methods used to filter the data
• We use it when we are interested in one group in our data
set. We want to work on it and do not want to work on all
datasets.
• Example: if we are only interested in Obese and want to
work on them and exclude the rest, how we can do this (
Observe that the coding of obese = 4)
23
24
25
Descriptive Statistics for categorical variables
• Analyze > Descriptive Statistics > Frequencies…
• Dataset used: 6-Descriptive Statistics for categorical
variables.sav
26
• For each categorical variable separately.
27
Descriptive Statistics Cross tabs
• Analyze > Descriptive Statistics > Crosstabs…
• Dataset used: 7-Descriptive Statistics Cross-
tabulation.sav
• We use it to describe the relationship between two or more sets of data
28
• Two or more categorical variables by cross tabulation
29
30
Descriptive Statistics for numeric data
There are three methods that can be used to describe numerical
data through SPSS
• First way
o Analyze > Descriptive Statistics > Frequencies…
o Dataset used: 8-Descriptive Statistics for numeric
data.sav
31
32
• Second method
o Analyze > Descriptive Statistics > Descriptives…
o Dataset used: 8-Descriptive Statistics for numeric
data.sav
33
34
• Third method (best)
o Analyze > Descriptive Statistics > Explore…
35
36
Descriptive Statistics for numeric data in Multiple groups
• Analyze > Descriptive Statistics > Explore…
• Dataset used: 9-Descriptive Statistics for numeric data in
Multiple groups.sav
For example, if we want to Describe the ages of patients who
have had cancer and those who have not yet .
37
38
Descriptive Statistics for numeric data in multiple
Categorical Variables
• Analyze > Compare Means > Means…
• Dataset used: 10-Descriptive Statistics for numeric
data in multiple Categorical Variables.sav
For example, if we want to Describe the ages of Males &
Females who have had cancer and those who have not yet .
39
40
41
Data Visualization
Bar Graph
• Graphs > Chart Builder…
• Dataset used: 11-Bar Chart.sav
• Used for categorical variables to show frequency or
proportion in each category.
• Summarize a variable between different groups
• It can also be used to compare groups
42
43
44
Another way to make a bar chart to compare groups
• Analyze > Descriptive Statistics > Crosstabs…
45
46
BOX plot
• Graphs > Chart Builder…
• Dataset used: 12-BOX PLOT.sav
• To show the distribution (shape, center, range, variation) of
quantitative variables.
• It is useful in comparing the same numeric variable across
different groups as comparing a score between men and women.
• The boxplot (also called Box and whisker plot) is used to
summarize numerical variables based on the five-number
summary.
• Those five numbers are minimum, maximum, median, upper
quartile, and lower quartile.
o Median = horizontal line in the box
o Upper quartile = top edge of the box
o Lower quartile = lower edge of the box
o Maximum = top of 'whisker'
o Minimum = bottom of 'whisker
47
48
• It is useful in comparing the same numeric variable across different
groups
49
Histogram
o Graphs > Chart Builder…
o Dataset used: 13-Histogram.sav
Excellent for numeric data, if it continuous and you interested in
frequencies
50
51
Scatter Plot
• Graphs > Chart Builder…
• Dataset used: 14-Scatter PLOT.sav
• Visual representation of relationship between 2 numeric
variables
52
• We can add a line to know the direction and strength of the
relationship (there is no correlation between age and patients'
53
hope in the future)
54
Normality Testing
• Analyze > Descriptive Statistics > Explore…
• Dataset used: 15-normality testing.sav
• Plotting a histogram or QQ plot
• Using a statistical test: Shapiro-Wilk tests
o Normally distributed
✓ (P-value >0.05)
o Not normally distributed
✓ (P-value ≤ 0.05)
55
56
57
One sample Z-test
• x¯ is the sample mean
• μ is the population mean
• σ is the population standard deviation
• n is the sample size
❖ There is no direct way to calculate a One sample Z-test on
SPSS, but it can be calculated using the syntax plugin
❖ File > New > Syntax
Example:
Suppose the IQ in a certain population is normally distributed
with a mean of μ = 100 and standard deviation of σ = 15. A
scientist wants to know if a new medication affects IQ levels, so
she recruits 20 patients to use it for one month and records their
IQ levels at the end of the month.
58
data list list / n sample_mean population_mean population_sd.
begin data
20 103.05 100 15
end data.
Compute mean_difference = sample_mean-population_mean.
Compute square_root_n= SQRT(n).
Compute standard_difference=population_sd/square_root_n.
Compute z_statistic= mean_difference/standard_difference.
Compute chi_square = z_statistıc*z_statistic.
Compute p_value = SIG.CHISQ(chi_square, 1).
EXECUTE.
Formats z_statistic p_value.
LIST z_statistic p_value.
Explanation :
• First get the mean of your sample size
• Analyze > Descriptive Statistics > Descriptives…
• Mean sample is : 103.05
• Equation needed is :
o N=Sample size:20
o sample_mean population=103.05
o mean population =100
59
o sd = 15
File > New > Syntax
• Dataset used: 16-One sample z test.sav
60
61
• P-Value is higher than 0.05 so we fail to reject the Null
hypothesis and the new drug have no effect
62
One sample T-test
• Analyze > Compare means > One-Sample T test…
• Dataset used: 17-One sample t test.sav
Example:
Suppose the IQ in a certain population is normally distributed
with a mean of μ = 100 and standard deviation is unknown A
scientist wants to know if a new medication affects IQ levels, so
she recruits 20 patients to use it for one month and records their
IQ levels at the end of the month.
63
• P-Value is higher than 0.05 so we fail to reject the Null
hypothesis and the new drug have no effect
64
Paired-T test
• First step you should test the normality of the difference
between before and after (grade2 – grade1)
o Computing Variables
o Dataset used: 18-Paired-T test.sav
o Transform > Compute Variables…
• Now testing the normality of the Difference
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
65
66
• Now do Paired-T test
Analyze > Compare Means > Paired-Sample T test…
67
• P-Value is Lower than 0.05 so we reject the Null
hypothesis, accept the alternative hypothesis and the
training program has significantly increased the grade
of students exam .
68
Independent-T test
• First step normality testing
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
o Dataset used: 19-independent t test.sav
• Then do independent T test with levene’s test
o Analyze>Compare Means>Independent-Sample T
test…
• Example: Test the Depression Score Between Male and Females
69
70
• P-Value is higher than 0.05 so we fail to reject the Null
hypothesis and depression is equal between males and
Females
71
One-Way ANOVA
• First step normality testing
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
o Dataset used: 20-One-Way ANOVA.sav
• Then do One-Way ANOVA with levene’s test
o Analyze > Compare Means > One-Way ANOVA…
• Test the Effort Resistance Score Between Placebo & Low-dose Vitamin
& High dose Vitamin Groups (is the dose of Vitamin increase the effort
resistance)
72
73
• Making Post-Hoc Analysis
74
There is a significantly higher increase of effort score
with high dose Vitamin in comparison with placebo and
low dose Vitamin
75
Two-Way ANOVA
The two-way analysis of variance is used to measure the combined
influence of two factors on a dependent variable. The factors
(independent variables) are categorical, while the dependent variable is
continuous.
Example: new vitamin test
• The effort resistance is measured on a continuous scale from 1 to
30.
o The employees in the first group will receive a placebo (this is
the control group)
o The employees in the second group will take the vitamin in low
dose
o The employees in the third group will take the vitamin in high
dose.
• Moreover, we have information about each subject’s gender (male
or female).
• We are interested to know if there is a combined influence of the
two factors, dose and gender, on the effort resistance. In other
words, we want to detect the interaction effect of these variables.
Assumptions:
• The two independent variables are categorical, each having at least
two categories.
• The dependent variable is continuous.
• There is independence of observation; in other words, there is no
relationship between the subjects in our groups.
• The dependent variable is normally distributed in all groups.
76
• The dependent variable does not present significant outliers in any
group.
• The dependent variable has equal variances in all groups (there is
homogeneity of variances).
• First step Split File according to gender and dose then
do normality testing:
o Split File
▪ Data > Split File …
▪ Dataset used: 31-Two-way Anova.sav
o Normality Testing
▪ Analyze > Descriptive Statistics > Explore…
77
Notes: Don’t Foreget to Rest Split File after checking
the assuptions
78
• Then do Tow-Way ANOVA
o Analyze > General Linear Model > Univariate…
79
80
81
• P-Value <0.001 so we can Reject the null hypothesis and
there is a significant compined influence of the two factors
82
83
• The average effort resistance is practicaly equal in the two
categories male and female according to the placebo group
( the difference is very very small)
• The average effort resistance much more pronounced in
the two categories male and female according to low dose
group
• The average effort resistance much more pronounced in
the two categories male and female according to high dose
group
The simple main effects: represent the influences of one factor
at each level of the other factor. In other words, we keep a factor
constant and make the other factor vary.
84
The simple main effects for the factor “dose” represent the effect
of dose at every level of the gender. We must compute two sets
of differences here:
• the differences between the average effort resistance for
the “placebo”, “low dose” and “high dose” levels, for
male subjects
• the differences between the average effort resistance for
the “placebo”, “low dose” and “high dose” levels, for
female subjects
85
The simple main effects for the factor “gender” represent the
effect of gender at every level of the dose factor. We must
compute three differences here:
• the difference between the average male and female
effort resistance, at the “placebo” level
• the difference between the average male and female
effort resistance, at the “low dose” level
• the difference between the average male and female
effort resistance, at the “high dose” level
86
Final conclusions:
• Overall, the vitamin does increase the effort resistance
• A high dose is significantly more effective than a low
dose
• At the same dose, the vitamin has a stronger effect on
male than on female employees.
87
Non-Parametric Tests
Parametric test Non- parametric Use
test
Paired t-test Wilcoxon signed Comparing the difference
rank test between 2 paired groups
Independent t test Mann-Whitney test Comparing the difference
between 2 groups
One way ANOVA Kruskal-Wallis test Comparing the difference
Between 3 groups or more
Pearson's correlation Spear man's Relation between 2
correlation Variables
88
Wilcoxon signed rank test
It is the non-parametric equivalent to the paired t-test
• First step you should test the normality of the difference
between Measuring the Blood Pressure before giving the
drug (S1) and Measuring the Blood Pressure after giving
the drug (S2) = (S2 – S1)
o Computing Variables
o Transform > Compute Variables…
o Dataset used: 21-Wilcoxon signed rank test.sav
• Now testing the normality of the Difference
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
89
• Now do Paired-T test
90
• Analyze > Nonparametric Tests > Legacy Dialogs > 2 Related
Samples…
• P-Value is higher than 0.05 so we fail to reject the Null
hypothesis and Blood Pressure doesn’t change before
and after treatment
91
Mann-Whitney U test
• First step normality testing
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
o Dataset used: 22-Mann Witney U test - Copy.sav
• Then do Mann-Whitney U test
o Analyze>Nonparametric Tests > Legacy Dialogs > 2
Independent Samples …
Example: We want to know is there significant difference
between Males and Females according to hours of exercises
per weeks .
92
93
• P-Value is Lower than 0.05 so we reject the Null
hypothesis and we could approve that there is
significant increase in hours of exercise per week for
males in comparison with females
94
Kruskal-Wallis test
• First step normality testing
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
o Dataset used: 23-Kruskal-Wallis test.sav
• Then do Kruskal-Wallis test
o Analyze>Nonparametric Tests > Independent
Samples …
Example: We want to know is there significant difference
between Group A , B , C according to hours of exercises
per weeks .
95
96
97
• Looking at Post-Hoc Analysis to Compare between
each group
98
Pearson's correlation
• First step normality testing
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
o Dataset used: 24-Pearson's correlation.sav
• Then do Pearson's correlation
o Analyze>Correlate > Bivariate …
Example: We want to know is there Association between
Height of cases and Weight of cases
99
100
• For Graphical presentation we can use Scatter plot
o Graphs > Chart Builder…
101
Spearman's correlation
Used for ordinal data, or if the assumptions of numeric
data not satisfied.
• First step normality testing
o Normality Testing
o Analyze > Descriptive Statistics > Explore…
o Dataset used: 25-Spearman'scorrelation.sav
• Then do Spearman's correlation
o Analyze>Correlate > Bivariate …
Example: We want to know is there Association between
Educational Level and Current salary .
102
103
Chi-square test
Used to compare between Categorical Data between
groups
• Analyze > Descriptive Statistics > Crosstabs …
• Dataset used: 26-Chi-square Test.sav
Example: We want to know is there Association between
Lung Cancer and smoking by comparing cases who smoke
and don’t smoke by who get lung cancer and don’t get
104
105
Simple linear regression
• The simple linear regression studies the relationship, in order to
predict the values of the dependent variable based on the values of
the independent variable.
o Analyze > Regression > Linear…
o Dataset used: 27-Simple Linear Regression.sav
Example: We want to know is there Association between
Exam Score and IQ of Students
106
107
• R square : 0.79 meaning that: 79% of the variability in the
Exam Score can be explained by the IQ through this model.
• Coefficient (B) : 0.14 with significant (P-value <0.001 )
meaning that : For every unit increase in the IQ, there is 0.14
units increase in the mean Exam Score, OR the student who have
Higher IQ had significant Higher Exam score, OR there is higher
odds for student who had high IQ to get higher Exam Scores.
108
Multiple linear regression
• The Multiple linear regression studies the relationship, in order to
predict the values of the dependent variable based on the values of
the independent variables.
o Analyze > Regression > Linear…
o Dataset used: 28-Multiple Linear Regression.sav
Example: We want to know is there Association between
IQ ,Hours of the study , gender of Students and Exam
Score of students.
109
110
• R square : 0.84 meaning that: 84% of the variability in the
Exam Score can be explained by the independent variables ( IQ ,
Hours of the study , and Gender ) through this model.
• Coefficient (B) of IQ: 0.14 with significant (P-value
<0.001 ) meaning that : For every unit increase in the IQ, there
is 0.14 units increase in the mean Exam Score while controlling all
other independent variables, OR the student who have Higher IQ
had significant Higher Exam score, OR there is higher odds for
student who had high IQ to get higher Exam Scores
• Coefficient (B) of Hours of the study: 0.39 with
significant (P-value <0.001 ) meaning that : For every unit
increase in the Hours of the study, there is 0.39 units increase in
the mean Exam Score while controlling all other independent
variables, OR the student who have Higher Hours of the study had
significant Higher Exam score, OR there is higher odds for student
who had high Hours of the study to get higher Exam Scores
• Coefficient (B) of Gender : -0.28 with significant (P-
value = 0.004 ) meaning that : The Male students have
significant lower Exam score in comparison with Female students,
OR there is odds for Male students to get lower Exam Scores in
comparison with Female students.
111
Univariate Logistic regression
• The binomial logistic regression is a predictive technique which is
used when the dependent variable is dichotomous, and the
independent variables are continuous, ordinal or nominal.
o Analyze > Regression > Binary Logistic…
o Dataset used: 29-Univariate Logistic Regression.sav
• Example: We want to know if the age is predicting factor
for Ischimic heart diseases (IHD) .
112
113
• Exp(B) (Odds Ratio (OR)) : 1.38 with significant (P-
value <0.001 ) meaning that: Older Ages are associated with
IHD OR The patients with older ages have higher odds to get IHD
in comparison with younger ages.
114
Multiple Logistic regression
• The binomial logistic regression is a predictive technique which is
used when the dependent variable is dichotomous, and the
independent variables are continuous, ordinal or nominal.
o Analyze > Regression > Binary Logistic…
o Dataset used: 30-Multiple Logistic Regression.sav
• Example: We want to know if the age, gender, and
smoking status are predicting factor for Ischimic heart
diseases (IHD) .
115
116
• Exp(B) (Odds Ratio (OR)) of age : 1.41 with significant
(P-value <0.001 ) meaning that: Older Ages are associated
with IHD OR The patients with older ages have higher odds to get
IHD in comparison with younger ages.
• Exp(B) (Odds Ratio (OR)) of gender (Male): 24.5 with
significant (P-value <0.001 ) meaning that: Male Cases are
associated with IHD OR The Male patients have higher odds to get
IHD in comparison with Female patients.
• Exp(B) (Odds Ratio (OR)) Non-smoker: 0.05 with
significant (P-value <0.001 ) meaning that: The patients
with Non-smoking status have lower odds to get IHD in
comparison Current smokers.
117